2025-05-18
ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates
2025-05-18 • Tian Lan, Yusen Wu, Bin Ma, Zhaoyuan Su, Ruifu Yang, Tekin Biçer, Masahiro Tanaka, Olatunji Ruwase, Dong Li, Yue Cheng
Fine-tuning large language models (LLMs) often exceeds GPU memory limits, prompting systems to offload model states to CPU memory. However, existing offloaded training frameworks like ZeRO-Offload treat all parameters equally and update the full model on the CPU, causing severe GPU stalls, where fast, expensive GPUs sit idle waiting for slow CPU updates and limited-bandwidth PCIe transfers. We present ZenFlow, a new offloading framework that prioritizes important parameters and decouples updates between GPU and CP…