Efficient Long-context Language Model Training by Core Attention Disaggregation

Exploring foci of: arXiv (Cornell University) Efficient Long-context Language Model Training by Core Attention Disaggregation October 2025 • Yonghao Zhuang, Jiajia Chen, Bo Pang, Yi Gu, Yibo Zhu, Yimin Jiang, Ion Stoica, Eric P. Xing, Hao Zhang We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is en… Open Article Page

Model Organism C (Programming Language) Altitude Training Language Interpretation Iman (Model) Language Irish Language Java (Programming Language) Welsh Language Open Article

Akkadian Language Tesla Model 3 Programming Language Thai Language Creole Language Odia Language Breton Language Llama (Language Model) Business Model Canvas Open Article

Tesla Model S Model (Person) Dalmatian Language Remington Model 870 Bohr Model Serbian Language Open Article