Jeff Rasley
YOU?
Author Swipe
View article: Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads Open
Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (LLMs). Tensor parallelism (TP) is the state-of-the-art method for reducing LLM response latency, however GPU communications…
View article: Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI
Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI Open
Inference is now the dominant AI workload, yet existing systems force trade-offs between latency, throughput, and cost. Arctic Inference, an open-source vLLM plugin from Snowflake AI Research, introduces Shift Parallelism, a dynamic parall…
View article: Federated Timeline Synthesis: Scalable and Private Methodology For Model Training and Deployment
Federated Timeline Synthesis: Scalable and Private Methodology For Model Training and Deployment Open
We present Federated Timeline Synthesis (FTS), a novel framework for training generative foundation models across distributed timeseries data applied to electronic health records (EHR). At its core, FTS represents patient history as tokeni…
View article: Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences
Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences Open
Long sequences are critical for applications like RAG, long document summarization, multi-modality, etc., and modern LLMs, like Llama 4 Scout, support max sequence length of up to 10 million tokens. However, outside of enterprise labs, lon…
View article: DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference Open
The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, e…
View article: DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies
DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies Open
In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across …
View article: DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales Open
ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance. However, the current landscape lacks an accessible, effic…
View article: MCR-DL: Mix-and-Match Communication Runtime for Deep Learning
MCR-DL: Mix-and-Match Communication Runtime for Deep Learning Open
In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massi…
View article: DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale Open
The past several years have witnessed the success of transformer-based models, and their scale and application scenarios continue to grow aggressively. The current landscape of transformer models is increasingly diverse: the model size var…
View article: DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale Open
As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant train…
View article: ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning Open
In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB). Therefore, the growth in model scale has been su…
View article: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Open
Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into l…
View article: Efficient queue management for cluster scheduling
Efficient queue management for cluster scheduling Open
Job scheduling in Big Data clusters is crucial both for cluster operators' return on investment and for overall user experience. In this context, we observe several anomalies in how modern cluster schedulers manage queues, and argue that m…