Lee Sharkey
YOU?
Author Swipe
View article: Stochastic Parameter Decomposition
Stochastic Parameter Decomposition Open
A key step in reverse engineering neural networks is to decompose them into simpler parts that can be studied in relative isolation. Linear parameter decomposition -- a framework that has been proposed to resolve several issues with curren…
View article: Sparse Autoencoders Do Not Find Canonical Units of Analysis
Sparse Autoencoders Do Not Find Canonical Units of Analysis Open
A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these …
View article: Open Problems in Mechanistic Interpretability
Open Problems in Mechanistic Interpretability Open
Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater…
View article: Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition Open
Mechanistic interpretability aims to understand the internal mechanisms learned by neural networks. Despite recent progress toward this goal, it remains unclear how best to decompose neural network parameters into mechanistic components. W…
View article: Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs Open
Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal representations of neural networks. However, naively optimising SAEs for reconstruction loss and sparsity results in a preference for SAEs that are extr…
View article: Bilinear MLPs enable weight-based mechanistic interpretability
Bilinear MLPs enable weight-based mechanistic interpretability Open
A mechanistic understanding of how MLPs do computation in deep neural networks remains elusive. Current interpretability work can extract features from hidden activations over an input dataset but generally cannot explain how MLP weights c…
View article: Black-Box Access is Insufficient for Rigorous AI Audits
Black-Box Access is Insufficient for Rigorous AI Audits Open
External audits of AI systems are increasingly recognized as a key mechanism\nfor AI governance. The effectiveness of an audit, however, depends on the\ndegree of access granted to auditors. Recent audits of state-of-the-art AI\nsystems ha…
View article: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Open
Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. Sparse autoencoders (SAEs), which learn a sparse, overcomplete dictionary that reconstructs a network's internal activations, have bee…
View article: A Causal Framework for AI Regulation and Auditing
A Causal Framework for AI Regulation and Auditing Open
Artificial intelligence (AI) systems are poised to become deeply integrated into society. If developed responsibly, AI has potential to benefit humanity immensely. However, it also poses a range of risks, including risks of catastrophic ac…
View article: Sparse Autoencoders Find Highly Interpretable Features in Language Models
Sparse Autoencoders Find Highly Interpretable Features in Language Models Open
One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, …
View article: A technical note on bilinear layers for interpretability
A technical note on bilinear layers for interpretability Open
The ability of neural networks to represent more features than neurons makes interpreting them challenging. This phenomenon, known as superposition, has spurred efforts to find architectures that are more interpretable than standard multil…
View article: Circumventing interpretability: How to defeat mind-readers
Circumventing interpretability: How to defeat mind-readers Open
The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure that their intentions are aligned with human values. Yet there is reason to believe that misaligned…
View article: Interpreting Neural Networks through the Polytope Lens
Interpreting Neural Networks through the Polytope Lens Open
Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level. What are the fundamental primitives of neural network representations? Previous mechanistic descriptions have used individual neurons…