Jesse Dodge
YOU?
Author Swipe
View article: Fluid Language Model Benchmarking
Fluid Language Model Benchmarking Open
Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation…
View article: Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation
Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation Open
Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more re…
View article: SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks Open
We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research com…
View article: DataDecide: How to Predict Best Pretraining Data with Small Experiments
DataDecide: How to Predict Best Pretraining Data with Small Experiments Open
Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at s…
View article: Efficient Methods for Natural Language Processing: A Survey
Efficient Methods for Natural Language Processing: A Survey Open
Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources inc…
View article: The Generative AI Ethics Playbook
The Generative AI Ethics Playbook Open
The Generative AI Ethics Playbook provides guidance for identifying and mitigating risks of machine learning systems across various domains, including natural language processing, computer vision, and generative AI. This playbook aims to a…
View article: Establishing Task Scaling Laws via Compute-Efficient Model Ladders
Establishing Task Scaling Laws via Compute-Efficient Model Ladders Open
We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performan…
View article: Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
Scalable Data Ablation Approximations for Language Models through Modular Training and Merging Open
Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive…
View article: Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging
Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging Open
Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as new instruction datasets targeting new skills are created, or can cause the models to forget older skills. In this work, we i…
View article: OLMES: A Standard for Language Model Evaluations
OLMES: A Standard for Language Model Evaluations Open
Progress in AI is often demonstrated by new models claiming improved performance on tasks measuring model capabilities. Evaluating language models can be particularly challenging, as choices of how a model is evaluated on a task can lead t…
View article: OLMo: Accelerating the Science of Language Models
OLMo: Accelerating the Science of Language Models Open
Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with im…
View article: Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research Open
Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or …
View article: AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters
AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters Open
Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are under-scrutinized. In …
View article: Helping Cancer Patients to Choose the Best Treatment: Towards Automated Data-Driven and Personalized Information Presentation of Cancer Treatment Options
Helping Cancer Patients to Choose the Best Treatment: Towards Automated Data-Driven and Personalized Information Presentation of Cancer Treatment Options Open
When a person is diagnosed with cancer, difficult decisions about treatments need to be made. In this chapter, we describe an interdisciplinary research project which aims to automatically generate personalized descriptions of treatment op…
View article: Paloma: A Benchmark for Evaluating Language Model Fit
Paloma: A Benchmark for Evaluating Language Model Fit Open
Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains--varying distributions of language. We introduce Perplexity Analysis for …
View article: Catwalk: A Unified Language Model Evaluation Framework for Many Datasets
Catwalk: A Unified Language Model Evaluation Framework for Many Datasets Open
The success of large language models has shifted the evaluation paradigms in natural language processing (NLP). The community's interest has drifted towards comparing NLP models across many tasks, domains, and datasets, often at an extreme…
View article: What's In My Big Data?
What's In My Big Data? Open
Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In t…
View article: Language Models Hallucinate, but May Excel at Fact Verification
Language Models Hallucinate, but May Excel at Fact Verification Open
Recent progress in natural language processing (NLP) owes much to remarkable advances in large language models (LLMs). Nevertheless, LLMs frequently "hallucinate," resulting in non-factual outputs. Our carefully-designed human evaluation s…
View article: The Rise of Open Science: Tracking the Evolution and Perceived Value of Data and Methods Link-Sharing Practices
The Rise of Open Science: Tracking the Evolution and Perceived Value of Data and Methods Link-Sharing Practices Open
In recent years, funding agencies and journals increasingly advocate for open science practices (e.g. data and method sharing) to improve the transparency, access, and reproducibility of science. However, quantifying these practices at sca…
View article: ML Reproducibility Challenge 2022
ML Reproducibility Challenge 2022 Open
Editorial
View article: Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation
Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation Open
Rising computational demands of modern natural language processing (NLP) systems have increased the barrier to entry for cutting-edge research while posing serious environmental concerns. Yet, progress on model efficiency has been impeded …