Explanipedia

The Quest for the Right Mediator: Surveying Mechanistic Interpretability for NLP Through the Lens of Causal Mediation Analysis Open

Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel D. Marks, Koyena Pal , et al. · 2025

Interpretability provides a toolset for understanding how and why language models behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making …

Discovering Forbidden Topics in Language Models Open

Can Rager, Chris Wendler, Rohit Gandikota, David Bau · 2025

Refusal discovery is the task of identifying the full set of topics that a language model refuses to discuss. We introduce this new problem setting and develop a refusal discovery method, Iterated Prefill Crawler (IPC), that uses token pre…

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Open

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom , et al. · 2025

Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics…

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks Open

Adam Karvonen, Can Rager, Samuel D. Marks, Neel Nanda · 2024

Computer science

Sparse Autoencoders (SAEs) are an interpretability technique aimed at decomposing neural network activations into interpretable units. However, a major bottleneck for SAE development has been the lack of high-quality performance metrics, w…

The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis Open

Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel D. Marks, Koyena Pal , et al. · 2024

Psychology Computer science Philosophy

Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making …

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Open

Adam Karvonen, Benjamin D. Wright, Can Rager, Rico Angell, Jannik Brinkmann , et al. · 2024

Computer science

What latent features are encoded in language model (LM) representations? Recent work on training sparse autoencoders (SAEs) to disentangle interpretable features in LM representations has shown significant promise. However, evaluating the …

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals Open

Jaden Fiotto-Kaufman, Alexander R. Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal , et al. · 2024

Political science

We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks. NNsight is an open-source system that extends PyTorch to introduce de…

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Open

Samuel D. Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau , et al. · 2024

Computer science Philosophy

We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of poly…

Structured World Representations in Maze-Solving Transformers Open

Michael Ivanitskiy, Alex F. Spies, Tilman Räuker, Guillaume Corlouer, Chris Mathwin , et al. · 2023

Computer science Mathematics Engineering

Transformer models underpin many recent advances in practical machine learning applications, yet understanding their internal behavior continues to elude researchers. Given the size and complexity of these models, forming a comprehensive p…

Attribution Patching Outperforms Automated Circuit Discovery Open

Aaquib Syed, Can Rager, Arthur Conmy · 2023

Computer science Philosophy Physics

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation p…

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L Open

James Dao, Yeu-Tong Lao, Can Rager, Jett Janiak · 2023

Computer science Psychology

Prior work suggests that language models manage the limited bandwidth of the residual stream through a "memory management" mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our …

A Configurable Library for Generating and Manipulating Maze Datasets Open

Michael Ivanitskiy, Rusheb Shah, Alex F. Spies, Tilman Räuker, Dan Valentine , et al. · 2023

Computer science Engineering Biology

Understanding how machine learning models respond to distributional shifts is a key research challenge. Mazes serve as an excellent testbed due to varied generation algorithms offering a nuanced platform to simulate both subtle and pronoun…

Safety of self-assembled neuromorphic hardware Open

Can Rager, Kyle Webster · 2023

Computer science Physics

The scalability of modern computing hardware is limited by physical bottlenecks and high energy consumption. These limitations could be addressed by neuromorphic hardware (NMH) which is inspired by the human brain. NMH enables physically b…

Can Rager YOU? Author Swipe