Ofir Press
YOU?
Author Swipe
View article: Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark Open
While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And …
View article: SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? Open
Autonomous systems for software engineering are now capable of fixing bugs and developing features. These systems are commonly evaluated on SWE-bench (Jimenez et al., 2024a), which assesses their ability to solve software issues from GitHu…
View article: EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities
EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities Open
Although language model (LM) agents have demonstrated increased performance in multiple domains, including coding and web-browsing, their success in cybersecurity has been limited. We present EnIGMA, an LM agent for autonomously solving Ca…
View article: AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? Open
Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, …
View article: SciCode: A Research Coding Benchmark Curated by Scientists
SciCode: A Research Coding Benchmark Curated by Scientists Open
Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities…
View article: SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering Open
Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like s…
View article: SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Open
Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and ch…
View article: How Language Model Hallucinations Can Snowball
How Language Model Hallucinations Can Snowball Open
A major risk of using language models in practical applications is their tendency to hallucinate incorrect statements. Hallucinations are often attributed to knowledge gaps in LMs, but we hypothesize that in some cases, when justifying pre…
View article: Measuring and Narrowing the Compositionality Gap in Language Models
Measuring and Narrowing the Compositionality Gap in Language Models Open
We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems…
View article: What Language Model to Train if You Have One Million GPU Hours?
What Language Model to Train if You Have One Million GPU Hours? Open
The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research.…
View article: Measuring and Narrowing the Compositionality Gap in Language Models
Measuring and Narrowing the Compositionality Gap in Language Models Open
We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems…
View article: Transformer Language Models without Positional Encodings Still Learn Positional Information
Transformer Language Models without Positional Encodings Still Learn Positional Information Open
Causal transformer language models (LMs), such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit positional encoding are still competitive with stand…
View article: What Language Model to Train if You Have One Million GPU Hours?
What Language Model to Train if You Have One Million GPU Hours? Open
Teven Le Scao, Thomas Wang, Daniel Hesslow, Stas Bekman, M Saiful Bari, Stella Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, Ofir Press, Colin Raffel, Victor Sanh, Sheng Shen, Lintang Sutawika, Jaesung Tae, Zheng Xin Yong, Julie…
View article: Transformer Language Models without Positional Encodings Still Learn Positional Information
Transformer Language Models without Positional Encodings Still Learn Positional Information Open
Causal transformer language models (LMs), such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit positional encoding are still competitive with stand…
View article: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation Open
Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We…
View article: Shortformer: Better Language Modeling using Shorter Inputs
Shortformer: Better Language Modeling using Shorter Inputs Open
Increasing the input length has been a driver of progress in language modeling with transformers. We identify conditions where shorter inputs are not harmful, and achieve perplexity and efficiency improvements through two new methods that …
View article: Improving Transformer Models by Reordering their Sublayers
Improving Transformer Models by Reordering their Sublayers Open
Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with …
View article: Partially Shuffling the Training Data to Improve Language Models
Partially Shuffling the Training Data to Improve Language Models Open
Although SGD requires shuffling the training data between epochs, currently none of the word-level language modeling systems do this. Naively shuffling all sentences in the training data would not permit the model to learn inter-sentence d…
View article: You May Not Need Attention
You May Not Need Attention Open
In NMT, how far can we get without attention and without separate encoding and decoding? To answer that question, we introduce a recurrent neural translation model that does not use attention and does not have a separate encoder and decode…
View article: Language Generation with Recurrent Generative Adversarial Networks without Pre-training
Language Generation with Recurrent Generative Adversarial Networks without Pre-training Open
Generative Adversarial Networks (GANs) have shown great promise recently in image generation. Training GANs for language generation has proven to be more difficult, because of the non-differentiable nature of generating text with recurrent…
View article: Using the Output Embedding to Improve Language Models
Using the Output Embedding to Improve Language Models Open
We study the topmost weight matrix of neural network language models. We show that this matrix constitutes a valid word embedding. When training language models, we recommend tying the input embedding and this output embedding. We analyze …