Can Rager
YOU?
Author Swipe
View article: The Quest for the Right Mediator: Surveying Mechanistic Interpretability for NLP Through the Lens of Causal Mediation Analysis
The Quest for the Right Mediator: Surveying Mechanistic Interpretability for NLP Through the Lens of Causal Mediation Analysis Open
Interpretability provides a toolset for understanding how and why language models behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making …
View article: Discovering Forbidden Topics in Language Models
Discovering Forbidden Topics in Language Models Open
Refusal discovery is the task of identifying the full set of topics that a language model refuses to discuss. We introduce this new problem setting and develop a refusal discovery method, Iterated Prefill Crawler (IPC), that uses token pre…
View article: SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Open
Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics…
View article: Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks Open
Sparse Autoencoders (SAEs) are an interpretability technique aimed at decomposing neural network activations into interpretable units. However, a major bottleneck for SAE development has been the lack of high-quality performance metrics, w…
View article: The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis
The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis Open
Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making …
View article: Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Open
What latent features are encoded in language model (LM) representations? Recent work on training sparse autoencoders (SAEs) to disentangle interpretable features in LM representations has shown significant promise. However, evaluating the …
View article: NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals Open
We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks. NNsight is an open-source system that extends PyTorch to introduce de…
View article: Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Open
We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of poly…
View article: Structured World Representations in Maze-Solving Transformers
Structured World Representations in Maze-Solving Transformers Open
Transformer models underpin many recent advances in practical machine learning applications, yet understanding their internal behavior continues to elude researchers. Given the size and complexity of these models, forming a comprehensive p…
View article: Attribution Patching Outperforms Automated Circuit Discovery
Attribution Patching Outperforms Automated Circuit Discovery Open
Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation p…
View article: An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L
An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L Open
Prior work suggests that language models manage the limited bandwidth of the residual stream through a "memory management" mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our …
View article: A Configurable Library for Generating and Manipulating Maze Datasets
A Configurable Library for Generating and Manipulating Maze Datasets Open
Understanding how machine learning models respond to distributional shifts is a key research challenge. Mazes serve as an excellent testbed due to varied generation algorithms offering a nuanced platform to simulate both subtle and pronoun…
View article: Safety of self-assembled neuromorphic hardware
Safety of self-assembled neuromorphic hardware Open
The scalability of modern computing hardware is limited by physical bottlenecks and high energy consumption. These limitations could be addressed by neuromorphic hardware (NMH) which is inspired by the human brain. NMH enables physically b…