Christopher Ré
YOU?
Author Swipe
View article: HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation
HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation Open
Visual Auto-Regressive modeling (VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolu…
View article: Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale
Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale Open
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations. First, operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and comp…
View article: Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models
Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models Open
We investigate an emerging setup in which a small, on-device language model (LM) with access to local data communicates with a frontier, cloud-hosted LM to solve real-world tasks involving financial, medical, and scientific reasoning over …
View article: KernelBench: Can LLMs Write Efficient GPU Kernels?
KernelBench: Can LLMs Write Efficient GPU Kernels? Open
Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate ker…
View article: CodeMonkeys: Scaling Test-Time Compute for Software Engineering
CodeMonkeys: Scaling Test-Time Compute for Software Engineering Open
Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explor…
View article: SMOOTHIE: Label Free Language Model Routing
SMOOTHIE: Label Free Language Model Routing Open
Large language models (LLMs) are increasingly used in applications where LLM inputs may span many different tasks. Recent work has found that the choice of LLM is consequential, and different LLMs may be good for different input samples. P…
View article: Context Clues: Evaluating Long Context Models for Clinical Prediction Tasks on EHRs
Context Clues: Evaluating Long Context Models for Clinical Prediction Tasks on EHRs Open
Foundation Models (FMs) trained on Electronic Health Records (EHRs) have achieved state-of-the-art results on numerous clinical prediction tasks. However, most existing EHR FMs have context windows of <1k tokens. This prevents them from mo…
View article: Scaling Laws for Precision
Scaling Laws for Precision Open
Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for both training and inference. We propose …
View article: ThunderKittens: Simple, Fast, and Adorable AI Kernels
ThunderKittens: Simple, Fast, and Adorable AI Kernels Open
The challenge of mapping AI architectures to GPU hardware is creating a critical bottleneck in AI progress. Despite substantial efforts, hand-written custom kernels fail to meet their theoretical performance thresholds, even on well-establ…
View article: LoLCATs: On Low-Rank Linearizing of Large Language Models
LoLCATs: On Low-Rank Linearizing of Large Language Models Open
Recent works show we can linearize large language models (LLMs) -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However…
View article: Automated Rewards via LLM-Generated Progress Functions
Automated Rewards via LLM-Generated Progress Functions Open
Large Language Models (LLMs) have the potential to automate reward engineering by leveraging their broad domain knowledge across various tasks. However, they often need many iterations of trial-and-error to generate effective reward functi…
View article: Restructuring Vector Quantization with the Rotation Trick
Restructuring Vector Quantization with the Rotation Trick Open
Vector Quantized Variational AutoEncoders (VQ-VAEs) are designed to compress a continuous input to a discrete latent space and reconstruct it with minimal distortion. They operate by maintaining a set of vectors -- often referred to as the…
View article: Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates
Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates Open
Fine-tuning large language models (LLMs) on instruction datasets is a common way to improve their generative capabilities. However, instruction datasets can be expensive and time-consuming to manually curate, and while LLM-generated data i…
View article: Resilience in Knowledge Graph Embeddings
Resilience in Knowledge Graph Embeddings Open
In recent years, knowledge graphs have gained interest and witnessed widespread applications in various domains, such as information retrieval, question-answering, recommendation systems, amongst others. Large-scale knowledge graphs to thi…
View article: Archon: An Architecture Search Framework for Inference-Time Techniques
Archon: An Architecture Search Framework for Inference-Time Techniques Open
Inference-time techniques, such as repeated sampling or iterative revisions, are emerging as powerful ways to enhance large-language models (LLMs) at test time. However, best practices for developing systems that combine these techniques r…
View article: Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling Open
Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference comput…
View article: Just read twice: closing the recall gap for recurrent language models
Just read twice: closing the recall gap for recurrent language models Open
Recurrent large language models that compete with Transformers in language modeling perplexity are emerging at a rapid rate (e.g., Mamba, RWKV). Excitingly, these architectures use a constant amount of memory during inference. However, due…
View article: State-Free Inference of State-Space Models: The Transfer Function Approach
State-Free Inference of State-Space Models: The Transfer Function Approach Open
We approach designing a state-space model for deep learning applications through its dual representation, the transfer function, and uncover a highly efficient sequence parallel inference algorithm that is state-free: unlike other proposed…
View article: Mechanistic Design and Scaling of Hybrid Architectures
Mechanistic Design and Scaling of Hybrid Architectures Open
The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and evaluation. We set out to simplify this …
View article: Simple linear attention language models balance the recall-throughput tradeoff
Simple linear attention language models balance the recall-throughput tradeoff Open
Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is bottle-necked during inference by the …
View article: Sequence modeling and design from molecular to genome scale with Evo
Sequence modeling and design from molecular to genome scale with Evo Open
The genome is a sequence that completely encodes the DNA, RNA, and proteins that orchestrate the function of a whole organism. Advances in machine learning combined with massive datasets of whole genomes could enable a biological foundatio…
View article: Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT
Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT Open
Retrieval pipelines-an integral component of many machine learning systems-perform poorly in domains where documents are long (e.g., 10K tokens or more) and where identifying the relevant document requires synthesizing information across t…
View article: Hydragen: High-Throughput LLM Inference with Shared Prefixes
Hydragen: High-Throughput LLM Inference with Shared Prefixes Open
Transformer-based large language models (LLMs) are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix, such as few-shot examples or a chatbot system prompt. Decodi…
View article: The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry
The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry Open
Linear attentions have shown potential for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length. This holds exciting promise for (1) training linear Transformers from scratch, (2) "finetu…
View article: Zoology: Measuring and Improving Recall in Efficient Language Models
Zoology: Measuring and Improving Recall in Efficient Language Models Open
Attention-free language models that combine gating and convolutions are growing in popularity due to their efficiency and increasingly competitive performance. To better understand these architectures, we pretrain a suite of 17 attention a…
View article: FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores Open
Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time. A major bottleneck is the Fast Fourier Transform (FF…
View article: Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions
Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions Open
Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers. In particular, long convolution sequence models have achieved state-of-the-art performance in man…
View article: Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time Open
Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, bu…
View article: Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture Open
Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts and better performance. However, existing architectures such as Transformers scale quadratically along both these ax…
View article: Context-Aware Meta-Learning
Context-Aware Meta-Learning Open
Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this a…