Explanipedia

Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads Open

Mert Hidayetoğlu, Aurick Qiao, M. B. Wyatt, Jeff Rasley, Yuxiong He , et al. · 2025

Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (LLMs). Tensor parallelism (TP) is the state-of-the-art method for reducing LLM response latency, however GPU communications…

Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI Open

Samyam Rajbhandari, Mert Hidayetoğlu, Aurick Qiao, Ye Wang, Juncheng Yang , et al. · 2025

Inference is now the dominant AI workload, yet existing systems force trade-offs between latency, throughput, and cost. Arctic Inference, an open-source vLLM plugin from Snowflake AI Research, introduces Shift Parallelism, a dynamic parall…

Federated Timeline Synthesis: Scalable and Private Methodology For Model Training and Deployment Open

Paweł Renc, Michal K. Grzeszczyk, Linglong Qian, Nassim Oufattole, Jeff Rasley , et al. · 2025

We present Federated Timeline Synthesis (FTS), a novel framework for training generative foundation models across distributed timeseries data applied to electronic health records (EHR). At its core, FTS represents patient history as tokeni…

Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences Open

Stas Bekman, Samyam Rajbhandari, M. B. Wyatt, Jeff Rasley, Tunji Ruwase , et al. · 2025

Long sequences are critical for applications like RAG, long document summarization, multi-modality, etc., and modern LLMs, like Llama 4 Scout, support max sequence length of up to 10 million tokens. However, outside of enterprise labs, lon…

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference Open

Connor Holmes, Masahiro Tanaka, M. B. Wyatt, Ammar Ahmad Awan, Jeff Rasley , et al. · 2024

Computer science Business

The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, e…

DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies Open

Shuaiwen Leon Song, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen , et al. · 2023

Computer science Geography Mathematics

In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across …

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales Open

Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu , et al. · 2023

Computer science Physics Mathematics

ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance. However, the current landscape lacks an accessible, effic…

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning Open

Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi , et al. · 2023

Computer science History Materials science

In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massi…

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale Open

Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li , et al. · 2022

Computer science Engineering

The past several years have witnessed the success of transformer-based models, and their scale and application scenarios continue to grow aggressively. The current landscape of transformer models is increasingly diverse: the model size var…

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale Open

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi , et al. · 2022

Computer science Physics

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant train…

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning Open

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He · 2021

Computer science Physics

In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB). Therefore, the growth in model scale has been su…

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Open

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He · 2019

Computer science Physics Philosophy

Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into l…

Efficient queue management for cluster scheduling Open

Jeff Rasley, Konstantinos Karanasos, Srikanth Kandula, Rodrigo Fonseca, Milan Vojnović , et al. · 2016

Computer science Materials science Mathematics

Job scheduling in Big Data clusters is crucial both for cluster operators' return on investment and for overall user experience. In this context, we observe several anomalies in how modern cluster schedulers manage queues, and argue that m…

Jeff Rasley YOU? Author Swipe