Explanipedia

LLM Unlearning Without an Expert Curated Dataset Open

Muru Zhang, Robin Jia · 2025

Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck i…

TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability Open

Mohammad Aflah Khan, Ameya Godbole, Johnny Tian-Zheng Wei, Ryan Wang, James Flemings , et al. · 2025

Understanding the relationship between training data and model behavior during pretraining is crucial, but existing workflows make this process cumbersome, fragmented, and often inaccessible to researchers. We present TokenSmith, an open-s…

Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning Open

Ang Li, Charles C. N. Wang, Kaiyu Yue, Zikui Cai, Ollie Liu , et al. · 2025

Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual C…

Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition Open

Qinyuan Ye, Robin Jia, Xiang Ren · 2025

Large language models demonstrate the intriguing ability to perform unseen tasks via in-context learning. However, it remains unclear what mechanisms inside the model drive such task-level generalization. In this work, we approach this que…

PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models Open

Wenpeng Zhu, Mengyuan Chai, Ishika Singh, Robin Jia, Jesse Thomason · 2025

We propose PSALM-V, the first autonomous neuro-symbolic learning system able to induce symbolic action semantics (i.e., pre- and post-conditions) in visual environments through interaction. PSALM-V bootstraps reliable symbolic planning wit…

Why Do Some Inputs Break Low-Bit LLM Quantization? Open

Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia · 2025

Low-bit weight-only quantization significantly reduces the memory footprint of large language models (LLMs), but disproportionately affects certain examples. We analyze diverse 3-4 bit methods on LLMs ranging from 7B-70B in size and find t…

Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models Open

Wenxia Gan, Deqing Fu, Julian Asilis, Ollie Liu, Dani Yogatama , et al. · 2025

Steering methods have emerged as effective and targeted tools for guiding large language models' (LLMs) behavior without modifying their parameters. Multimodal large language models (MLLMs), however, do not currently enjoy the same suite o…

Teaching Models to Understand (but not Generate) High-risk Data Open

Qi Wang, Matthew Finlayson, Luca Soldaini, Swabha Swayamdipta, Robin Jia · 2025

Language model developers typically filter out high-risk content -- such as toxic or copyrighted text -- from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models' …

Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions Open

Wang Zhu, Tianqi Chen, Xinyan Yu, Ching Ying Lin, Jade Law , et al. · 2025

Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medic…

Interrogating LLM design under a fair learning doctrine Open

Johnny Tian-Zheng Wei, Maggie Wang, Ameya Godbole, Jonathan H. Choi, Robin Jia · 2025

The current discourse on large language models (LLMs) and copyright largely takes a "behavioral" perspective, focusing on model outputs and evaluating whether they are substantially similar to training data. However, substantial similarity…

FoNE: Precise Single-Token Number Embeddings via Fourier Features Open

Tianyi Zhou, Deqing Fu, Mahdi Soltanolkotabi, Robin Jia, Vatsal Sharan · 2025

Computer science Mathematics

Large Language Models (LLMs) typically represent numbers using multiple tokens, which requires the model to aggregate these tokens to interpret numerical values. This fragmentation makes both training and inference less efficient and adver…

Mechanistic Interpretability of Emotion Inference in Large Language Models Open

Ala Nekouvaght Tak, Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia , et al. · 2025

Computer science Psychology

Large language models (LLMs) show promising capabilities in predicting human emotions from text. However, the mechanisms through which these models process emotional stimuli remain largely unexplored. Our study addresses this gap by invest…

Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics Open

Ameya Godbole, Robin Jia · 2025

Computer science Philosophy

Improvements in large language models have led to increasing optimism that they can serve as reliable evaluators of natural language generation outputs. In this paper, we challenge this optimism by thoroughly re-evaluating five state-of-th…

Operationalizing Content Moderation “Accuracy” in the Digital Services Act Open

Johnny Tian-Zheng Wei, Frederike Zufall, Robin Jia · 2024

Computer science Business Mathematics

The Digital Services Act, recently adopted by the EU, requires social media platforms to report the ``accuracy'' of their automated content moderation systems. The colloquial term is vague, or open-textured---the literal accuracy (number o…

TLDR: Token-Level Detective Reward Model for Large Vision Language Models Open

Deqing Fu, Tong Xiao, Rui Wang, Wang Zhu, Pengchuan Zhang , et al. · 2024

Computer science Psychology

Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assignin…

Rethinking Backdoor Detection Evaluation for Language Models Open

Jun Yan, Wenjie Mo, Xiang Ren, Robin Jia · 2024

Computer science Political science Philosophy

Backdoor attacks, in which a model behaves maliciously when given an attacker-specified trigger, pose a major security risk for practitioners who depend on publicly released language models. As a countermeasure, backdoor detection methods …

When Parts Are Greater Than Sums: Individual LLM Components Can Outperform Full Models Open

Ting-Yun Chang, Jesse Thomason, Robin Jia · 2024

Mathematics Computer science

This paper studies in-context learning by decomposing the output of large language models into the individual contributions of attention heads and MLPs (components). We observe curious components: good-performing ones that individually do …

Pre-trained Large Language Models Use Fourier Features to Compute Addition Open

Tianyi Zhou, Deqing Fu, Vatsal Sharan, Robin Jia · 2024

Computer science Mathematics

Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-trained LLMs add numbers using Fourier fea…

Language Models can Infer Action Semantics for Symbolic Planners from Environment Feedback Open

Wang Zhu, Ishika Singh, Robin Jia, Jesse Thomason · 2024

Computer science Physics

Symbolic planners can discover a sequence of actions from initial to goal states given expert-defined, domain-specific logical action semantics. Large Language Models (LLMs) can directly generate such sequences, but limitations in reasonin…

IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations Open

Deqing Fu, Ghazal Khalighinejad, Ollie Liu, Bhuwan Dhingra, Dani Yogatama , et al. · 2024

Computer science Geography Economics

Current foundation models exhibit impressive capabilities when prompted either with text only or with both image and text inputs. But do their capabilities change depending on the input modality? In this work, we propose $\textbf{IsoBench}…

Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions? Open

Wang Zhu, Ishika Singh, Yuan Huang, Robin Jia, Jesse Thomason · 2023

Computer science Psychology Engineering

Data augmentation via back-translation is common when pretraining Vision-and-Language Navigation (VLN) models, even though the generated instructions are noisy. But: does that noise matter? We find that nonsensical or irrelevant language i…

Robin Jia YOU? Author Swipe