Explanipedia

HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation Open

Hermann Kumbong, Tsung-Yi Lin, Mingyu Liu, Xihui Liu, Ziwei Liu , et al. · 2025

Visual Auto-Regressive modeling (VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolu…

Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale Open

Ja‐Lok Ku, Eric Nguyen, David W. Romero, Garyk Brixi, B. S. Yang , et al. · 2025

We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations. First, operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and comp…

Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models Open

Avanika Narayan, Dan Biderman, Sabri Eyuboglu, A. E. May, Scott W. Linderman , et al. · 2025

We investigate an emerging setup in which a small, on-device language model (LM) with access to local data communicates with a frontier, cloud-hosted LM to solve real-world tasks involving financial, medical, and scientific reasoning over …

KernelBench: Can LLMs Write Efficient GPU Kernels? Open

Anne Ouyang, Simon Guo, Simran Arora, Alex Zhang, Wenhua Hu , et al. · 2025

Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate ker…

CodeMonkeys: Scaling Test-Time Compute for Software Engineering Open

Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Ré , et al. · 2025

Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explor…

SMOOTHIE: Label Free Language Model Routing Open

Neel Guha, Mayee F. Chen, Trevor Chow, Ishan S. Khare, Christopher Ré · 2025

Large language models (LLMs) are increasingly used in applications where LLM inputs may span many different tasks. Recent work has found that the choice of LLM is consequential, and different LLMs may be good for different input samples. P…

Context Clues: Evaluating Long Context Models for Clinical Prediction Tasks on EHRs Open

Michael Wornow, Suhana Bedi, Miguel Hernandez, Ethan Steinberg, Jason Fries , et al. · 2024

Foundation Models (FMs) trained on Electronic Health Records (EHRs) have achieved state-of-the-art results on numerous clinical prediction tasks. However, most existing EHR FMs have context windows of <1k tokens. This prevents them from mo…

Scaling Laws for Precision Open

Tanishq Kumar, Zachary Ankner, Brian F. Spector, Blake Bordelon, Niklas Muennighoff , et al. · 2024

Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for both training and inference. We propose …

ThunderKittens: Simple, Fast, and Adorable AI Kernels Open

Benjamin Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, Christopher Ré · 2024

The challenge of mapping AI architectures to GPU hardware is creating a critical bottleneck in AI progress. Despite substantial efforts, hand-written custom kernels fail to meet their theoretical performance thresholds, even on well-establ…

LoLCATs: On Low-Rank Linearizing of Large Language Models Open

Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector , et al. · 2024

Recent works show we can linearize large language models (LLMs) -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However…

Automated Rewards via LLM-Generated Progress Functions Open

Vishnu Sarukkai, Brennan Shacklett, Zander Majercik, Kush Bhatia, Christopher Ré , et al. · 2024

Large Language Models (LLMs) have the potential to automate reward engineering by leveraging their broad domain knowledge across various tasks. However, they often need many iterations of trial-and-error to generate effective reward functi…

Restructuring Vector Quantization with the Rotation Trick Open

Christopher Fifty, Ronald G. Junkins, D. Duan, Aniketh Iger, J Liu , et al. · 2024

Vector Quantized Variational AutoEncoders (VQ-VAEs) are designed to compress a continuous input to a discrete latent space and reconstruct it with minimal distortion. They operate by maintaining a set of vectors -- often referred to as the…

Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates Open

Avanika Narayan, Mayee F. Chen, Kush Bhatia, Christopher Ré · 2024

Fine-tuning large language models (LLMs) on instruction datasets is a common way to improve their generative capabilities. However, instruction datasets can be expensive and time-consuming to manually curate, and while LLM-generated data i…

Resilience in Knowledge Graph Embeddings Open

Ines Chami, Adva Wolf, Da-Cheng Juan, Frédéric Sala, Sujith Ravi , et al. · 2024

In recent years, knowledge graphs have gained interest and witnessed widespread applications in various domains, such as information retrieval, question-answering, recommendation systems, amongst others. Large-scale knowledge graphs to thi…

Archon: An Architecture Search Framework for Inference-Time Techniques Open

Jon Saad-Falcon, Alberto Lluch Lafuente, Shlok Natarajan, Noriaki Maru, Hristo Todorov , et al. · 2024

Inference-time techniques, such as repeated sampling or iterative revisions, are emerging as powerful ways to enhance large-language models (LLMs) at test time. However, best practices for developing systems that combine these techniques r…

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling Open

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le , et al. · 2024

Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference comput…

Just read twice: closing the recall gap for recurrent language models Open

Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector, Sabri Eyuboglu , et al. · 2024

Recurrent large language models that compete with Transformers in language modeling perplexity are emerging at a rapid rate (e.g., Mamba, RWKV). Excitingly, these architectures use a constant amount of memory during inference. However, due…

State-Free Inference of State-Space Models: The Transfer Function Approach Open

Rom Parnichkun, Stefano Massaroli, Alessandro Moro, Jimmy T. H. Smith, Ramin Hasani , et al. · 2024

We approach designing a state-space model for deep learning applications through its dual representation, the transfer function, and uncover a highly efficient sequence parallel inference algorithm that is state-free: unlike other proposed…

Mechanistic Design and Scaling of Hybrid Architectures Open

Michael Poli, Armin W. Thomas, Éric Nguyen, Pragaash Ponnusamy, Björn Deiseroth , et al. · 2024

The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and evaluation. We set out to simplify this …

Simple linear attention language models balance the recall-throughput tradeoff Open

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti , et al. · 2024

Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is bottle-necked during inference by the …

Sequence modeling and design from molecular to genome scale with Evo Open

Éric Nguyen, Michael Poli, Matthew G. Durrant, Armin W. Thomas, Brian Kang , et al. · 2024

The genome is a sequence that completely encodes the DNA, RNA, and proteins that orchestrate the function of a whole organism. Advances in machine learning combined with massive datasets of whole genomes could enable a biological foundatio…

Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT Open

Jon Saad-Falcon, Daniel Y. Fu, Simran Arora, Neel Guha, Christopher Ré · 2024

Retrieval pipelines-an integral component of many machine learning systems-perform poorly in domains where documents are long (e.g., 10K tokens or more) and where identifying the relevant document requires synthesizing information across t…

Hydragen: High-Throughput LLM Inference with Shared Prefixes Open

Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré , et al. · 2024

Transformer-based large language models (LLMs) are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix, such as few-shot examples or a chatbot system prompt. Decodi…

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry Open

M Zhang, K. Bhatia, Hermann Kumbong, Christopher Ré · 2024

Linear attentions have shown potential for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length. This holds exciting promise for (1) training linear Transformers from scratch, (2) "finetu…

Zoology: Measuring and Improving Recall in Efficient Language Models Open

Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli , et al. · 2023

Attention-free language models that combine gating and convolutions are growing in popularity due to their efficiency and increasingly competitive performance. To better understand these architectures, we pretrain a suite of 17 attention a…

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores Open

Daniel Y. Fu, Hermann Kumbong, Éric Nguyen, Christopher Ré · 2023

Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time. A major bottleneck is the Fast Fourier Transform (FF…

Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions Open

Stefano Massaroli, Michael Poli, Daniel Y. Fu, Hermann Kumbong, Aman Timalsina , et al. · 2023

Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers. In particular, long convolution sequence models have achieved state-of-the-art performance in man…

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time Open

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan , et al. · 2023

Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, bu…

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture Open

Daniel Y. Fu, Simran Arora, Jessica Grogan, Isys Johnson, Sabri Eyuboglu , et al. · 2023

Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts and better performance. However, existing architectures such as Transformers scale quadratically along both these ax…

Context-Aware Meta-Learning Open

Christopher Fifty, D. Duan, Ronald G. Junkins, Ehsan Amid, Jure Leskovec , et al. · 2023

Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this a…

Christopher Ré YOU? Author Swipe