Mor Geva
YOU?
Author Swipe
View article: Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context
Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context Open
A key component of in-context reasoning is the ability of language models (LMs) to bind entities for later retrieval. For example, an LM might represent "Ann loves pie" by binding "Ann" to "pie", allowing it to later retrieve "Ann" when as…
View article: LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations
LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations Open
Language models (LMs) increasingly drive real-world applications that require world knowledge. However, the internal processes through which models turn data into representations of knowledge and beliefs about the world, are poorly underst…
View article: Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics
Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics Open
Large language models (LLMs) struggle with cross-lingual knowledge transfer: they hallucinate when asked in one language about facts expressed in a different language during training. This work introduces a controlled setting to study the …
View article: Universal Jailbreak Suffixes Are Strong Attention Hijackers
Universal Jailbreak Suffixes Are Strong Attention Hijackers Open
We study suffix-based jailbreaks$\unicode{x2013}$a powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack, we o…
View article: Precise In-Parameter Concept Erasure in Large Language Models
Precise In-Parameter Concept Erasure in Large Language Models Open
Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning,…
View article: Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas Open
Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing "under" or "behind" relationships between only two objects, pose significant chall…
View article: Open Problems in Mechanistic Interpretability
Open Problems in Mechanistic Interpretability Open
Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater…
View article: Enhancing Automated Interpretability with Output-Centric Feature Descriptions
Enhancing Automated Interpretability with Output-Centric Feature Descriptions Open
Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as plants or the first word in a sentence. These descriptions are derived using inpu…
View article: Open Problems in Machine Unlearning for AI Safety
Open Problems in Machine Unlearning for AI Safety Open
As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research, and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlea…
View article: Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models
Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models Open
Vision-language models (VLMs) excel at extracting and reasoning about information from images. Yet, their capacity to leverage internal knowledge about specific entities remains underexplored. This work investigates the disparity in model …
View article: Eliciting Textual Descriptions from Representations of Continuous Prompts
Eliciting Textual Descriptions from Representations of Continuous Prompts Open
Continuous prompts, or "soft prompts", are a widely-adopted parameter-efficient tuning strategy for large language models, but are often less favorable due to their opaque nature. Prior attempts to interpret continuous prompts relied on pr…
View article: Language Models Encode Numbers Using Digit Representations in Base 10
Language Models Encode Numbers Using Digit Representations in Base 10 Open
Large language models (LLMs) frequently make errors when handling even simple numerical problems, such as comparing two small numbers. A natural hypothesis is that these errors stem from how LLMs represent numbers, and specifically, whethe…
View article: Towards Interpreting Visual Information Processing in Vision-Language Models
Towards Interpreting Visual Information Processing in Vision-Language Models Open
Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the lo…
View article: CoverBench: A Challenging Benchmark for Complex Claim Verification
CoverBench: A Challenging Benchmark for Complex Claim Verification Open
There is a growing line of research on verifying the correctness of language models' outputs. At the same time, LMs are being used to tackle complex queries that require reasoning. We introduce CoverBench, a challenging benchmark focused o…
View article: When Can Transformers Count to n?
When Can Transformers Count to n? Open
Large language models based on the transformer architectures can solve highly complex tasks. But are there simple tasks that such models cannot solve? Here we focus on very simple counting tasks, that involve counting how many times a toke…
View article: From Loops to Oops: Fallback Behaviors of Language Models Under Uncertainty
From Loops to Oops: Fallback Behaviors of Language Models Under Uncertainty Open
Large language models (LLMs) often exhibit undesirable behaviors, such as hallucinations and sequence repetitions. We propose to view these behaviors as fallbacks that models exhibit under epistemic uncertainty, and investigate the connect…
View article: Estimating Knowledge in Large Language Models Without Generating a Single Token
Estimating Knowledge in Large Language Models Without Generating a Single Token Open
To evaluate knowledge in large language models (LLMs), current methods query the model and then evaluate its generated responses. In this work, we ask whether evaluation can be done before the model has generated any text. Concretely, is i…
View article: From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP Open
Interpretability and analysis (IA) research is a growing subfield within NLP with the goal of developing a deeper understanding of the behavior or inner workings of NLP systems and methods. Despite growing interest in the subfield, a criti…
View article: Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?
Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words? Open
We posit that large language models (LLMs) should be capable of expressing their intrinsic uncertainty in natural language. For example, if the LLM is equally likely to output two contradicting answers to the same question, then its genera…
View article: RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Open
Individual neurons participate in the representation of multiple high-level concepts. To what extent can different interpretability methods successfully disentangle these roles? To help address this question, we introduce RAVEL (Resolving …
View article: Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space Open
Understanding how Transformer-based Language Models (LMs) learn and recall information is a key goal of the deep learning community. Recent interpretability methods project weights and hidden states obtained from the forward pass to the mo…
View article: The Hidden Space of Transformer Language Adapters
The Hidden Space of Transformer Language Adapters Open
We analyze the operation of transformer language adapters, which are small modules trained on top of a frozen language model to adapt its predictions to new target languages. We show that adapted predictions mostly evolve in the source lan…
View article: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains Open
Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literatu…
View article: Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Open
Understanding the internal representations of large language models (LLMs) can help explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose l…
View article: Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers
Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers Open
Factual questions typically can be answered correctly at different levels of granularity. For example, both ``August 4, 1961'' and ``1961'' are correct answers to the question ``When was Barack Obama born?''. Standard question answering (Q…