Explanipedia

Fluid Language Model Benchmarking Open

Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge , et al. · 2025

Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation…

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation Open

David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith , et al. · 2025

Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more re…

SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks Open

Yilun Zhao, Tiansheng Hu, Shengli Wu, Jonathan Bragg, Jesse Dodge , et al. · 2025

We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research com…

DataDecide: How to Predict Best Pretraining Data with Small Experiments Open

Ian Magnusson, Nigel Tai, Ben Bogin, David Heineman, Jena D. Hwang , et al. · 2025

Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at s…

Efficient Methods for Natural Language Processing: A Survey Open

Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao , et al. · 2025

Computer science Geography

Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources inc…

The Generative AI Ethics Playbook Open

Jessie J. Smith, Wesley Hanwen Deng, William H. Smith, Maarten Sap, Nicole DeCario , et al. · 2024

Computer science Engineering

The Generative AI Ethics Playbook provides guidance for identifying and mitigating risks of machine learning systems across various domains, including natural language processing, computer vision, and generative AI. This playbook aims to a…

Establishing Task Scaling Laws via Compute-Efficient Model Ladders Open

Akshita Bhagia, Jiacheng Liu, Alexander Wettig, David Heineman, Oyvind Tafjord , et al. · 2024

Computer science Political science Mathematics

We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performan…

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging Open

Clara Na, Ian Magnusson, Ananya Harsh Jha, Tom Sherborne, Emma Strubell , et al. · 2024

Computer science Geography

Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive…

Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging Open

Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, Pang Wei Koh, Jesse Dodge , et al. · 2024

Computer science

Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as new instruction datasets targeting new skills are created, or can cause the models to forget older skills. In this work, we i…

OLMES: A Standard for Language Model Evaluations Open

Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Haddad, Jesse Dodge , et al. · 2024

Computer science Philosophy

Progress in AI is often demonstrated by new models claiming improved performance on tasks measuring model capabilities. Evaluating language models can be particularly challenging, as choices of how a model is evaluated on a task can lead t…

OLMo: Accelerating the Science of Language Models Open

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney , et al. · 2024

Computer science Philosophy

Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with im…

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research Open

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson , et al. · 2024

Computer science Philosophy

Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or …

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters Open

Li Lucy, Suchin Gururangan, Luca Soldaini, Emma Strubell, David Bamman , et al. · 2024

Computer science Psychology Mathematics

Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are under-scrutinized. In …

Helping Cancer Patients to Choose the Best Treatment: Towards Automated Data-Driven and Personalized Information Presentation of Cancer Treatment Options Open

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić , et al. · 2024

Computer science Engineering Biology

When a person is diagnosed with cancer, difficult decisions about treatments need to be made. In this chapter, we describe an interdisciplinary research project which aims to automatically generate personalized descriptions of treatment op…

Paloma: A Benchmark for Evaluating Language Model Fit Open

Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha , et al. · 2023

Computer science Mathematics Geography

Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains--varying distributions of language. We introduce Perplexity Analysis for …

Catwalk: A Unified Language Model Evaluation Framework for Many Datasets Open

Dirk Groeneveld, Anas Awadalla, Iz Beltagy, Akshita Bhagia, Ian Magnusson , et al. · 2023

Computer science Biology Physics

The success of large language models has shifted the evaluation paradigms in natural language processing (NLP). The community's interest has drifted towards comparing NLP models across many tasks, domains, and datasets, often at an extreme…

What's In My Big Data? Open

Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk , et al. · 2023

Computer science Business

Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In t…

Language Models Hallucinate, but May Excel at Fact Verification Open

Jian Guan, Jesse Dodge, David Wadden, Minlie Huang, Hao Peng · 2023

Computer science Psychology Economics

Recent progress in natural language processing (NLP) owes much to remarkable advances in large language models (LLMs). Nevertheless, LLMs frequently "hallucinate," resulting in non-factual outputs. Our carefully-designed human evaluation s…

The Rise of Open Science: Tracking the Evolution and Perceived Value of Data and Methods Link-Sharing Practices Open

Hancheng Cao, Jesse Dodge, Kyle Lo, Daniel A. McFarland, Lucy Lu Wang · 2023

Computer science Mathematics Sociology

In recent years, funding agencies and journals increasingly advocate for open science practices (e.g. data and method sharing) to improve the transparency, access, and reproducibility of science. However, quantifying these practices at sca…

ML Reproducibility Challenge 2022 Open

Koustuv Sinha, Maurits Bleeker, Samarth Bhargav, Jessica Zosa Forde, Sharath Chandra Raparthy , et al. · 2023

Medicine Chemistry

Editorial

Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation Open

Hao Peng, Qingqing Cao, Jesse Dodge, Matthew E. Peters, Jared Fernandez , et al. · 2023

Computer science Engineering Geography

Rising computational demands of modern natural language processing (NLP) systems have increased the barrier to entry for cutting-edge research while posing serious environmental concerns. Yet, progress on model efficiency has been impeded …

Jesse Dodge YOU? Author Swipe