Explanipedia

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark Open

Minhui Zhu, Minyang Tian, Xiaocheng Yang, Tianci Zhou, Penghao Zhu , et al. · 2025

While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And …

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? Open

John Yang, Carlos Jimenez-Gomez, Alex Zhang, K. Lieret, Joyce Yang , et al. · 2024

Computer science

Autonomous systems for software engineering are now capable of fixing bugs and developing features. These systems are commonly evaluated on SWE-bench (Jimenez et al., 2024a), which assesses their ability to solve software issues from GitHu…

EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities Open

Talor Abramovich, Meet Udeshi, Minghao Shao, K. Lieret, Haoran Xi , et al. · 2024

Computer science

Although language model (LM) agents have demonstrated increased performance in multiple domains, including coding and web-browsing, their success in cybersecurity has been limited. We present EnIGMA, an LM agent for autonomously solving Ca…

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? Open

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press , et al. · 2024

Computer science

Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, …

SciCode: A Research Coding Benchmark Curated by Scientists Open

Minyang Tian, Lin Gao, S Zhang, Xinan Chen, C. Fan , et al. · 2024

Computer science Biology Geography

Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities…

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering Open

John Yang, Carlos Jimenez-Gomez, Alexander Wettig, K. Lieret, Shunyu Yao , et al. · 2024

Computer science

Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like s…

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Open

Carlos Jimenez-Gomez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei , et al. · 2023

Computer science

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and ch…

How Language Model Hallucinations Can Snowball Open

Muru Zhang, Ofir Press, William Merrill, Alisa Liu, Noah A. Smith · 2023

Computer science Psychology Philosophy

A major risk of using language models in practical applications is their tendency to hallucinate incorrect statements. Hallucinations are often attributed to knowledge gaps in LMs, but we hypothesize that in some cases, when justifying pre…

Measuring and Narrowing the Compositionality Gap in Language Models Open

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith , et al. · 2023

Computer science Psychology Philosophy

We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems…

What Language Model to Train if You Have One Million GPU Hours? Open

Teven Le Scao, Thomas J. Wang, Daniel Hesslow, Lucile Saulnier, Stas Bekman , et al. · 2022

Computer science Physics Mathematics

The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research.…

Measuring and Narrowing the Compositionality Gap in Language Models Open

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith , et al. · 2022

Computer science Psychology Economics

We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems…

Transformer Language Models without Positional Encodings Still Learn Positional Information Open

Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, Omer Levy · 2022

Computer science Mathematics Economics

Causal transformer language models (LMs), such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit positional encoding are still competitive with stand…

What Language Model to Train if You Have One Million GPU Hours? Open

Teven Le Scao, Thomas J. Wang, Daniel Hesslow, Stas Bekman, M Saiful Bari , et al. · 2022

Computer science Art

Teven Le Scao, Thomas Wang, Daniel Hesslow, Stas Bekman, M Saiful Bari, Stella Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, Ofir Press, Colin Raffel, Victor Sanh, Sheng Shen, Lintang Sutawika, Jaesung Tae, Zheng Xin Yong, Julie…

Transformer Language Models without Positional Encodings Still Learn Positional Information Open

Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, Omer Levy · 2022

Computer science Physics Economics

Causal transformer language models (LMs), such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit positional encoding are still competitive with stand…

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation Open

Ofir Press, Noah A. Smith, Mike Lewis · 2021

Computer science Mathematics Geography

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We…

Shortformer: Better Language Modeling using Shorter Inputs Open

Ofir Press, Noah A. Smith, Mike Lewis · 2021

Computer science Mathematics Engineering

Increasing the input length has been a driver of progress in language modeling with transformers. We identify conditions where shorter inputs are not harmful, and achieve perplexity and efficiency improvements through two new methods that …

Improving Transformer Models by Reordering their Sublayers Open

Ofir Press, Noah A. Smith, Omer Levy · 2020

Computer science Engineering

Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with …

Partially Shuffling the Training Data to Improve Language Models Open

Ofir Press · 2019

Computer science Physics Philosophy

Although SGD requires shuffling the training data between epochs, currently none of the word-level language modeling systems do this. Naively shuffling all sentences in the training data would not permit the model to learn inter-sentence d…

You May Not Need Attention Open

Ofir Press, Noah A. Smith · 2018

Computer science Chemistry

In NMT, how far can we get without attention and without separate encoding and decoding? To answer that question, we introduce a recurrent neural translation model that does not use attention and does not have a separate encoder and decode…

Language Generation with Recurrent Generative Adversarial Networks without Pre-training Open

Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, Lior Wolf · 2017

Computer science Mathematics Psychology

Generative Adversarial Networks (GANs) have shown great promise recently in image generation. Training GANs for language generation has proven to be more difficult, because of the non-differentiable nature of generating text with recurrent…

Using the Output Embedding to Improve Language Models Open

Ofir Press, Lior Wolf · 2017

Computer science Materials science Chemistry

We study the topmost weight matrix of neural network language models. We show that this matrix constitutes a valid word embedding. When training language models, we recommend tying the input embedding and this output embedding. We analyze …

Ofir Press YOU? Author Swipe