Explanipedia

Fluid Language Model Benchmarking Open

Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge , et al. · 2025

Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation…

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation Open

David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith , et al. · 2025

Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more re…

FlexOlmo: Open Language Models for Flexible Data Use Open

Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, P N Walsh , et al. · 2025

We introduce FlexOlmo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where …

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations Open

Bin Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi , et al. · 2025

Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the…

Sampling from Your Language Model One Byte at a Time Open

Jonathan Hayase, Alisa Liu, Noah A. Smith, Sewoong Oh · 2025

Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's…

PointArena: Probing Multimodal Grounding Through Language-Guided Pointing Open

Long Cheng, Jiafei Duan, Yi Ru Wang, Haoquan Fang, B Li , et al. · 2025

Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have started to…

Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation Open

Shivam Duggal, Yushi Hu, Oscar Michel, Aniruddha Kembhavi, William T. Freeman , et al. · 2025

Despite the unprecedented progress in the field of 3D generation, current systems still often fail to produce high-quality 3D assets that are visually appealing and geometrically and semantically consistent across multiple viewpoints. To e…

On Linear Representations and Pretraining Data Frequency in Language Models Open

Noah A. Smith, Sarah Wiegreffe, Yanai Elazar · 2025

Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task beha…

DataDecide: How to Predict Best Pretraining Data with Small Experiments Open

Ian Magnusson, Nigel Tai, Ben Bogin, David Heineman, Jena D. Hwang , et al. · 2025

Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at s…

Learning Syntax Without Planting Trees: Understanding Hierarchical Generalization in Transformers Open

Kabir Ahuja, Vidhisha Balachandran, Madhur Panwar, Tianxing He, Noah A. Smith , et al. · 2025

Transformers trained on natural language data have been shown to exhibit hierarchical generalization without explicitly encoding any structural bias. In this work, we investigate sources of inductive bias in transformer models and their tr…

2 OLMo 2 Furious Open

Team OLMo, P N Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo , et al. · 2024

Political science

We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts -- model weights, full training data, traini…

Establishing Task Scaling Laws via Compute-Efficient Model Ladders Open

Akshita Bhagia, Jiacheng Liu, Alexander Wettig, David Heineman, Oyvind Tafjord , et al. · 2024

Computer science Political science Mathematics

We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performan…

Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback Open

Lester James V. Miranda, Yizhong Wang, Yanai Elazar, Sachin Kumar, Valentina Pyatkin , et al. · 2024

Computer science

Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, collecting human preferences is expensive and time-consuming, with highly variable annotation quality. An appealing alternativ…

How Performance Pressure Influences AI-Assisted Decision Making Open

Nikita Haduong, Noah A. Smith · 2024

Computer science Business Economics

Many domains now employ AI-based decision-making aids, and although the potential for AI systems to assist with decision making is much discussed, human-AI collaboration often underperforms due to factors such as (mis)trust in the AI syste…

Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging Open

Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, Pang Wei Koh, Jesse Dodge , et al. · 2024

Computer science

Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as new instruction datasets targeting new skills are created, or can cause the models to forget older skills. In this work, we i…

Scalable Training, Simulation, and Serving of Large Language Models and Traffic Systems: A Comprehensive Review Open

Noah A. Smith · 2024

Computer science Engineering Physics

This paper provides a comprehensive review of the state-of-the-artin distributed training, large-scale simulations, and efficient servingof Large Language Models (LLMs) and urban traffic systems. AsAI models like GPT-4, PaLM, and LLaMA gro…

OLMoE: Open Mixture-of-Experts Language Models Open

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison , et al. · 2024

Computer science

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt …

CPS-TaskForge: Generating Collaborative Problem Solving Environments for Diverse Communication Tasks Open

Nikita Haduong, Irene Wang, Bo-Ru Lu, Prithviraj Ammanabrolu, Noah A. Smith · 2024

Computer science

Teams can outperform individuals; could adding AI teammates further bolster performance of teams solving problems collaboratively? Collaborative problem solving (CPS) research commonly studies teams with two agents (human-human or human-AI…

Advancements in Distributed Systems for Large Language Model Training and Serving Open

Noah A. Smith · 2024

Computer science

The rapid advancements in large language models (LLMs) have revolutionized the field of artificial intelligence, enabling break- throughs in natural language processing, generation, and reasoning. However, the exponential growth in model s…

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? Open

Jonathan Hayase, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith · 2024

Computer science Geography

The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, whi…

MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization Open

Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin Hoffman, Tomasz Limisiewicz , et al. · 2024

Computer science

In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. Specifically, previous studies have reported multiple modeling biases that the cur…

MUSE: Machine Unlearning Six-Way Evaluation for Language Models Open

Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao , et al. · 2024

Computer science Philosophy

Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactl…

The Art of Saying No: Contextual Noncompliance in Language Models Open

Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin , et al. · 2024

Computer science Psychology Philosophy

Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broade…

Decoding-Time Language Model Alignment with Multiple Objectives Open

Ruizhe Shi, Yifang Chen, Yushi Hu, Alisa Liu, Noah A. Smith , et al. · 2024

Computer science

Aligning language models (LMs) to human preferences has emerged as a critical pursuit, enabling these models to better serve diverse user needs. Existing methods primarily focus on optimizing LMs for a single reward function, limiting thei…

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects Open

Orevaoghene Ahia, Anuoluwapo Aremu, Diana Abagyan, Hila Gonen, David Ifeoluwa Adelani , et al. · 2024

Computer science Geography Philosophy

Yorùbá an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities f…

Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG Open

William Merrill, Noah A. Smith, Yanai Elazar · 2024

Computer science Psychology Geology

How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs ass…

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models Open

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf , et al. · 2024

Computer science Psychology Philosophy

Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such act…

Noah A. Smith YOU? Author Swipe