Exploring foci of:
arXiv (Cornell University)
Efficient Long-context Language Model Training by Core Attention Disaggregation
October 2025 • Yonghao Zhuang, Jiajia Chen, Bo Pang, Yi Gu, Yibo Zhu, Yimin Jiang, Ion Stoica, Eric P. Xing, Hao Zhang
We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is en…
Model Organism
C (Programming Language)
Altitude Training
Language Interpretation
Iman (Model)
Language
Irish Language
Java (Programming Language)
Welsh Language
Akkadian Language
Tesla Model 3
Programming Language
Thai Language
Creole Language
Odia Language
Breton Language
Llama (Language Model)
Business Model Canvas
Tesla Model S
Model (Person)
Dalmatian Language
Remington Model 870
Bohr Model
Serbian Language