Zachary DeVito
YOU?
Author Swipe
View article: Revisiting Reliability in Large-Scale Machine Learning Research Clusters
Revisiting Reliability in Large-Scale Machine Learning Research Clusters Open
Reliability is a fundamental challenge in operating large-scale machine learning (ML) infrastructures, particularly as the scale of ML models and training clusters continues to grow. Despite decades of research on infrastructure failures, …
View article: Is Flash Attention Stable?
Is Flash Attention Stable? Open
Training large-scale machine learning models poses distinct system challenges, given both the size and complexity of today's workloads. Recently, many organizations training state-of-the-art Generative AI models have reported cases of inst…
View article: Generative AI Beyond LLMs: System Implications of Multi-Modal Generation
Generative AI Beyond LLMs: System Implications of Multi-Modal Generation Open
As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and …
View article: MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems Open
Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datac…
View article: A Theory on Adam Instability in Large-Scale Machine Learning
A Theory on Adam Instability in Large-Scale Machine Learning Open
We present a theory for the previously unexplained divergent behavior noticed in the training of large language models. We argue that the phenomenon is an artifact of the dominant optimization algorithm used for training, called Adam. We o…
View article: Torch.fx: Practical Program Capture and Transformation for Deep Learning in Python
Torch.fx: Practical Program Capture and Transformation for Deep Learning in Python Open
Modern deep learning frameworks provide imperative, eager execution programming interfaces embedded in Python to provide a productive development experience. However, deep learning practitioners sometimes need to capture and transform prog…
View article: Using Python for Model Inference in Deep Learning
Using Python for Model Inference in Deep Learning Open
Python has become the de-facto language for training deep neural networks, coupling a large suite of scientific computing libraries with efficient libraries for tensor computation such as PyTorch or TensorFlow. However, when models are use…
View article: Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching Open
Bienvenidos to the proceedings of the fifth edition of the workshop on computational approaches for linguistic code-switching (CALCS-2021)!Code-switching is this very interesting phenomenon where multilingual speakers communicate by moving…
View article: PyTorch: An Imperative Style, High-Performance Deep Learning Library
PyTorch: An Imperative Style, High-Performance Deep Learning Library Open
Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style …
View article: The Next 700 Accelerated Layers
The Next 700 Accelerated Layers Open
Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such …
View article: Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions Open
Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ran…
View article: Opt: A Domain Specific Language for Non-linear Least Squares Optimization in Graphics and Imaging
Opt: A Domain Specific Language for Non-linear Least Squares Optimization in Graphics and Imaging Open
Many graphics and vision problems can be expressed as non-linear least squares optimizations of objective functions over visual data, such as images and meshes. The mathematical descriptions of these functions are extremely concise, but th…