Explanipedia

Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context Open

Yoav Gur-Arieh, Mor Geva, Atticus Geiger · 2025

A key component of in-context reasoning is the ability of language models (LMs) to bind entities for later retrieval. For example, an LM might represent "Ann loves pie" by binding "Ann" to "pie", allowing it to later retrieve "Ann" when as…

LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations Open

Daniela Gottesman, Alon Gilae-Dotan, I. Bernard Cohen, Yoav Gur-Arieh, Marius Mosbach , et al. · 2025

Language models (LMs) increasingly drive real-world applications that require world knowledge. However, the internal processes through which models turn data into representations of knowledge and beliefs about the world, are poorly underst…

Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics Open

Christian Blum, Katja Filippova, Ann Yuan, Asma Ghandeharioun, Julian Zimmert , et al. · 2025

Large language models (LLMs) struggle with cross-lingual knowledge transfer: they hallucinate when asked in one language about facts expressed in a different language during training. This work introduces a controlled setting to study the …

Universal Jailbreak Suffixes Are Strong Attention Hijackers Open

Matan Ben-Tov, Mor Geva, Mahmood Sharif · 2025

We study suffix-based jailbreaks$\unicode{x2013}$a powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack, we o…

Precise In-Parameter Concept Erasure in Large Language Models Open

Yoav Gur-Arieh, Clara Haya Suslik, Y.-W. Peter Hong, Fazl Barez, Mor Geva · 2025

Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning,…

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas Open

Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao , et al. · 2025

Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing "under" or "behind" relationships between only two objects, pose significant chall…

Open Problems in Mechanistic Interpretability Open

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu , et al. · 2025

Computer science Biology

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater…

Enhancing Automated Interpretability with Output-Centric Feature Descriptions Open

Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, Mor Geva · 2025

Computer science Philosophy

Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as plants or the first word in a sentence. These descriptions are derived using inpu…

Open Problems in Machine Unlearning for AI Safety Open

Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal , et al. · 2025

Computer science

As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research, and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlea…

Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models Open

I. Bernard Cohen, Daniela Gottesman, Mor Geva, Raja Giryes · 2024

Computer science Sociology

Vision-language models (VLMs) excel at extracting and reasoning about information from images. Yet, their capacity to leverage internal knowledge about specific entities remains underexplored. This work investigates the disparity in model …

Eliciting Textual Descriptions from Representations of Continuous Prompts Open

Dana Ramati, Daniela Gottesman, Mor Geva · 2024

Computer science Philosophy

Continuous prompts, or "soft prompts", are a widely-adopted parameter-efficient tuning strategy for large language models, but are often less favorable due to their opaque nature. Prior attempts to interpret continuous prompts relied on pr…

Language Models Encode Numbers Using Digit Representations in Base 10 Open

Angela D. Levy, Mor Geva · 2024

Computer science Mathematics Biology

Large language models (LLMs) frequently make errors when handling even simple numerical problems, such as comparing two small numbers. A natural hypothesis is that these errors stem from how LLMs represent numbers, and specifically, whethe…

Towards Interpreting Visual Information Processing in Vision-Language Models Open

Clement Neo, C.-H. Luke Ong, Philip Torr, Mor Geva, David Krueger , et al. · 2024

Computer science Philosophy

Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the lo…

CoverBench: A Challenging Benchmark for Complex Claim Verification Open

Alon Jacovi, Moran Ambar, Eyal Ben‐David, Uri Shaham, Amir Feder , et al. · 2024

Computer science Geology

There is a growing line of research on verifying the correctness of language models' outputs. At the same time, LMs are being used to tackle complex queries that require reasoning. We introduce CoverBench, a challenging benchmark focused o…

When Can Transformers Count to n? Open

Gilad Yehudai, Haim Kaplan, Asma Ghandeharioun, Mor Geva, Amir Globerson · 2024

Mathematics Engineering

Large language models based on the transformer architectures can solve highly complex tasks. But are there simple tasks that such models cannot solve? Here we focus on very simple counting tasks, that involve counting how many times a toke…

From Loops to Oops: Fallback Behaviors of Language Models Under Uncertainty Open

Maor Ivgi, Ori Yoran, Jonathan Berant, Mor Geva · 2024

Computer science Economics

Large language models (LLMs) often exhibit undesirable behaviors, such as hallucinations and sequence repetitions. We propose to view these behaviors as fallbacks that models exhibit under epistemic uncertainty, and investigate the connect…

Estimating Knowledge in Large Language Models Without Generating a Single Token Open

Daniela Gottesman, Mor Geva · 2024

Computer science Psychology Philosophy

To evaluate knowledge in large language models (LLMs), current methods query the model and then evaluate its generated responses. In this work, we ask whether evaluation can be done before the model has generated any text. Concretely, is i…

From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP Open

Marius Mosbach, Vagrant Gautam, Tomás Vergara-Browne, Dietrich Klakow, Mor Geva · 2024

Computer science

Interpretability and analysis (IA) research is a growing subfield within NLP with the goal of developing a deeper understanding of the behavior or inner workings of NLP systems and methods. Despite growing interest in the subfield, a criti…

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words? Open

Gal Yona, Roee Aharoni, Mor Geva · 2024

Computer science Philosophy

We posit that large language models (LLMs) should be capable of expressing their intrinsic uncertainty in natural language. For example, if the LLM is equally likely to output two contradicting answers to the same question, then its genera…

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Open

Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger · 2024

Computer science Psychology Philosophy

Individual neurons participate in the representation of multiple high-level concepts. To what extent can different interpretability methods successfully disentangle these roles? To help address this question, we introduce RAVEL (Resolving …

Backward Lens: Projecting Language Model Gradients into the Vocabulary Space Open

Shahar Katz, Yonatan Belinkov, Mor Geva, Lior Wolf · 2024

Computer science Physics Philosophy

Understanding how Transformer-based Language Models (LMs) learn and recall information is a key goal of the deep learning community. Recent interpretability methods project weights and hidden states obtained from the forward pass to the mo…

The Hidden Space of Transformer Language Adapters Open

Jesujoba O. Alabi, Marius Mosbach, Matan Eyal, Dietrich Klakow, Mor Geva · 2024

Computer science Engineering

We analyze the operation of transformer language adapters, which are small modules trained on top of a frozen language model to adapt its predictions to new target languages. We show that adapted predictions mostly evolve in the source lan…

A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains Open

Alon Jacovi, Yonatan Bitton, Bernd Bohnet, Jonathan Herzig, Or Honovich , et al. · 2024

Computer science Physics Geography

Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literatu…

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Open

Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, Mor Geva · 2024

Computer science Philosophy Materials science

Understanding the internal representations of large language models (LLMs) can help explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose l…

Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers Open

Gal Yona, Roee Aharoni, Mor Geva · 2024

Computer science Mathematics

Factual questions typically can be answered correctly at different levels of granularity. For example, both ``August 4, 1961'' and ``1961'' are correct answers to the question ``When was Barack Obama born?''. Standard question answering (Q…

Mor Geva YOU? Author Swipe