William Merrill
YOU?
Author Swipe
View article: RELIC: Evaluating Compositional Instruction Following via Language Recognition
RELIC: Evaluating Compositional Instruction Following via Language Recognition Open
Large language models (LLMs) are increasingly expected to perform tasks based only on a specification of the task provided in context, without examples of inputs and outputs; this ability is referred to as instruction following. We introdu…
View article: Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training
Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training Open
The right batch size is important when training language models at scale: a large batch size is necessary for fast training, but a batch size that is too large will harm token efficiency. To navigate this tradeoff, McCandlish et al. (2018)…
View article: Exact Expressive Power of Transformers with Padding
Exact Expressive Power of Transformers with Padding Open
Chain of thought is a natural inference-time method for increasing the computational power of transformer-based large language models (LLMs), but comes at the cost of sequential decoding. Are there more efficient alternatives to expand a t…
View article: Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases Open
Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawing on insights from linguistics and com…
View article: 2 OLMo 2 Furious
2 OLMo 2 Furious Open
We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts -- model weights, full training data, traini…
View article: Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG
Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG Open
How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs ass…
View article: Let's Think Dot by Dot: Hidden Computation in Transformer Language Models
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models Open
Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computa…
View article: The Illusion of State in State-Space Models
The Illusion of State in State-Space Models Open
State-space models (SSMs) have emerged as a potential alternative architecture for building large language models (LLMs) compared to the previously ubiquitous transformer architecture. One theoretical weakness of transformers is that they …
View article: Can You Learn Semantics Through Next-Word Prediction? The Case of Entailment
Can You Learn Semantics Through Next-Word Prediction? The Case of Entailment Open
Do LMs infer the semantics of text from co-occurrence patterns in their training data? Merrill et al. (2022) argue that, in theory, sentence co-occurrence probabilities predicted by an optimal LM should reflect the entailment relationship …
View article: OLMo: Accelerating the Science of Language Models
OLMo: Accelerating the Science of Language Models Open
Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with im…
View article: What Formal Languages Can Transformers Express? A Survey
What Formal Languages Can Transformers Express? A Survey Open
As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help cl…
View article: What Formal Languages Can Transformers Express? A Survey
What Formal Languages Can Transformers Express? A Survey Open
As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help cl…
View article: The Expressive Power of Transformers with Chain of Thought
The Expressive Power of Transformers with Chain of Thought Open
Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer im…
View article: How Language Model Hallucinations Can Snowball
How Language Model Hallucinations Can Snowball Open
A major risk of using language models in practical applications is their tendency to hallucinate incorrect statements. Hallucinations are often attributed to knowledge gaps in LMs, but we hypothesize that in some cases, when justifying pre…
View article: A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks
A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks Open
Grokking is a phenomenon where a model trained on an algorithmic task first overfits but, then, after a large amount of additional training, undergoes a phase transition to generalize perfectly. We empirically study the internal structure …
View article: Transparency Helps Reveal When Language Models Learn Meaning
Transparency Helps Reveal When Language Models Learn Meaning Open
Many current NLP systems are built from language models trained to optimize unsupervised objectives on large amounts of raw text. Under what conditions might such a procedure acquire meaning? Our systematic experiments with synthetic data …
View article: The Parallelism Tradeoff: Limitations of Log-Precision Transformers
The Parallelism Tradeoff: Limitations of Log-Precision Transformers Open
Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input t…
View article: Transparency Helps Reveal When Language Models Learn Meaning
Transparency Helps Reveal When Language Models Learn Meaning Open
Many current NLP systems are built from language models trained to optimize unsupervised objectives on large amounts of raw text. Under what conditions might such a procedure acquire meaning? Our systematic experiments with synthetic data …
View article: A Logic for Expressing Log-Precision Transformers
A Logic for Expressing Log-Precision Transformers Open
One way to interpret the reasoning power of transformer-based language models is to describe the types of logical rules they can resolve over some input text. Recently, Chiang et al. (2023) showed that finite-precision transformers can be …
View article: Entailment Semantics Can Be Extracted from an Ideal Language Model
Entailment Semantics Can Be Extracted from an Ideal Language Model Open
Language models are often trained on text alone, without additional grounding. There is debate as to how much of natural language semantics can be inferred from such a procedure. We prove that entailment judgments between sentences can be …
View article: The Parallelism Tradeoff: Limitations of Log-Precision Transformers
The Parallelism Tradeoff: Limitations of Log-Precision Transformers Open
Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input t…
View article: ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension Open
Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are u…
View article: Extracting Finite Automata from RNNs Using State Merging
Extracting Finite Automata from RNNs Using State Merging Open
One way to interpret the behavior of a blackbox recurrent neural network (RNN) is to extract from it a more interpretable discrete computational model, like a finite state machine, that captures its behavior. In this work, we propose a new…
View article: ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension Open
Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, Anna Rohrbach. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022.
View article: Saturated Transformers are Constant-Depth Threshold Circuits
Saturated Transformers are Constant-Depth Threshold Circuits Open
Transformers have become a standard neural network architecture for many NLP problems, motivating theoretical analysis of their power in terms of formal languages. Recent work has shown that transformers with hard attention are quite limit…
View article: Entailment Semantics Can Be Extracted from an Ideal Language Model
Entailment Semantics Can Be Extracted from an Ideal Language Model Open
Language models are often trained on text alone, without additional grounding. There is debate as to how much of natural language semantics can be inferred from such a procedure. We prove that entailment judgments between sentences can be …