Noah A. Smith
YOU?
Author Swipe
View article: Fluid Language Model Benchmarking
Fluid Language Model Benchmarking Open
Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation…
View article: Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation
Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation Open
Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more re…
View article: FlexOlmo: Open Language Models for Flexible Data Use
FlexOlmo: Open Language Models for Flexible Data Use Open
We introduce FlexOlmo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where …
View article: Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations Open
Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the…
View article: Sampling from Your Language Model One Byte at a Time
Sampling from Your Language Model One Byte at a Time Open
Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's…
View article: PointArena: Probing Multimodal Grounding Through Language-Guided Pointing
PointArena: Probing Multimodal Grounding Through Language-Guided Pointing Open
Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have started to…
View article: Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation
Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation Open
Despite the unprecedented progress in the field of 3D generation, current systems still often fail to produce high-quality 3D assets that are visually appealing and geometrically and semantically consistent across multiple viewpoints. To e…
View article: On Linear Representations and Pretraining Data Frequency in Language Models
On Linear Representations and Pretraining Data Frequency in Language Models Open
Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task beha…
View article: DataDecide: How to Predict Best Pretraining Data with Small Experiments
DataDecide: How to Predict Best Pretraining Data with Small Experiments Open
Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at s…
View article: Learning Syntax Without Planting Trees: Understanding Hierarchical Generalization in Transformers
Learning Syntax Without Planting Trees: Understanding Hierarchical Generalization in Transformers Open
Transformers trained on natural language data have been shown to exhibit hierarchical generalization without explicitly encoding any structural bias. In this work, we investigate sources of inductive bias in transformer models and their tr…
View article: 2 OLMo 2 Furious
2 OLMo 2 Furious Open
We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts -- model weights, full training data, traini…
View article: Establishing Task Scaling Laws via Compute-Efficient Model Ladders
Establishing Task Scaling Laws via Compute-Efficient Model Ladders Open
We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performan…
View article: Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback Open
Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, collecting human preferences is expensive and time-consuming, with highly variable annotation quality. An appealing alternativ…
View article: How Performance Pressure Influences AI-Assisted Decision Making
How Performance Pressure Influences AI-Assisted Decision Making Open
Many domains now employ AI-based decision-making aids, and although the potential for AI systems to assist with decision making is much discussed, human-AI collaboration often underperforms due to factors such as (mis)trust in the AI syste…
View article: Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging
Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging Open
Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as new instruction datasets targeting new skills are created, or can cause the models to forget older skills. In this work, we i…
View article: Scalable Training, Simulation, and Serving of Large Language Models and Traffic Systems: A Comprehensive Review
Scalable Training, Simulation, and Serving of Large Language Models and Traffic Systems: A Comprehensive Review Open
This paper provides a comprehensive review of the state-of-the-artin distributed training, large-scale simulations, and efficient servingof Large Language Models (LLMs) and urban traffic systems. AsAI models like GPT-4, PaLM, and LLaMA gro…
View article: OLMoE: Open Mixture-of-Experts Language Models
OLMoE: Open Mixture-of-Experts Language Models Open
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt …
View article: CPS-TaskForge: Generating Collaborative Problem Solving Environments for Diverse Communication Tasks
CPS-TaskForge: Generating Collaborative Problem Solving Environments for Diverse Communication Tasks Open
Teams can outperform individuals; could adding AI teammates further bolster performance of teams solving problems collaboratively? Collaborative problem solving (CPS) research commonly studies teams with two agents (human-human or human-AI…
View article: Advancements in Distributed Systems for Large Language Model Training and Serving
Advancements in Distributed Systems for Large Language Model Training and Serving Open
The rapid advancements in large language models (LLMs) have revolutionized the field of artificial intelligence, enabling break- throughs in natural language processing, generation, and reasoning. However, the exponential growth in model s…
View article: Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? Open
The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, whi…
View article: MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization Open
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. Specifically, previous studies have reported multiple modeling biases that the cur…
View article: MUSE: Machine Unlearning Six-Way Evaluation for Language Models
MUSE: Machine Unlearning Six-Way Evaluation for Language Models Open
Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactl…
View article: The Art of Saying No: Contextual Noncompliance in Language Models
The Art of Saying No: Contextual Noncompliance in Language Models Open
Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broade…
View article: Decoding-Time Language Model Alignment with Multiple Objectives
Decoding-Time Language Model Alignment with Multiple Objectives Open
Aligning language models (LMs) to human preferences has emerged as a critical pursuit, enabling these models to better serve diverse user needs. Existing methods primarily focus on optimizing LMs for a single reward function, limiting thei…
View article: Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects Open
Yorùbá an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities f…
View article: Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG
Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG Open
How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs ass…
View article: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models Open
Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such act…