Jacob Hatef
YOU?
Author Swipe
View article: Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning
Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning Open
Scaling up Large Language Model(LLM) training involves fitting a tremendous amount of training parameters across a limited number of workers. However, methods like ZeRO-3 that drastically reduce GPU memory pressure often incur heavy commun…
View article: Demystifying the Communication Characteristics for Distributed Transformer Models
Demystifying the Communication Characteristics for Distributed Transformer Models Open
Deep learning (DL) models based on the transformer architecture have revolutionized many DL applications such as large language models (LLMs), vision transformers, audio generation, and time series prediction. Much of this progress has bee…
View article: The Case for Co-Designing Model Architectures with Hardware
The Case for Co-Designing Model Architectures with Hardware Open
While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL …