Sparse Autoencoders Do Not Find Canonical Units of Analysis Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2502.04878
A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in LLMs, and it has been postulated that they can be used to find a \textit{canonical} set of units: a unique and complete list of atomic features. We cast doubt on this belief using two novel techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic. SAE stitching involves inserting or swapping latents from a larger SAE into a smaller one. Latents from the larger SAE can be divided into two categories: \emph{novel latents}, which improve performance when added to the smaller SAE, indicating they capture novel information, and \emph{reconstruction latents}, which can replace corresponding latents in the smaller SAE that have similar behavior. The existence of novel features indicates incompleteness of smaller SAEs. Using meta-SAEs -- SAEs trained on the decoder matrix of another SAE -- we find that latents in SAEs often decompose into combinations of latents from a smaller SAE, showing that larger SAE latents are not atomic. The resulting decompositions are often interpretable; e.g. a latent representing ``Einstein'' decomposes into ``scientist'', ``Germany'', and ``famous person''. Even if SAEs do not find canonical units of analysis, they may still be useful tools. We suggest that future research should either pursue different approaches for identifying such units, or pragmatically choose the SAE size suited to their task. We provide an interactive dashboard to explore meta-SAEs: https://metasaes.streamlit.app/
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2502.04878
- https://arxiv.org/pdf/2502.04878
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4407309861
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4407309861Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2502.04878Digital Object Identifier
- Title
-
Sparse Autoencoders Do Not Find Canonical Units of AnalysisWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-02-07Full publication date if available
- Authors
-
Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel NandaList of authors in order
- Landing page
-
https://arxiv.org/abs/2502.04878Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2502.04878Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2502.04878Direct OA link when available
- Concepts
-
Artificial intelligence, Computer science, Pattern recognition (psychology)Top concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4407309861 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2502.04878 |
| ids.doi | https://doi.org/10.48550/arxiv.2502.04878 |
| ids.openalex | https://openalex.org/W4407309861 |
| fwci | |
| type | preprint |
| title | Sparse Autoencoders Do Not Find Canonical Units of Analysis |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10775 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.4146000146865845 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Generative Adversarial Networks and Image Synthesis |
| topics[1].id | https://openalex.org/T10320 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.40860000252723694 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Neural Networks and Applications |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C154945302 |
| concepts[0].level | 1 |
| concepts[0].score | 0.5469241142272949 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[0].display_name | Artificial intelligence |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.5066002607345581 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C153180895 |
| concepts[2].level | 2 |
| concepts[2].score | 0.4955941438674927 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q7148389 |
| concepts[2].display_name | Pattern recognition (psychology) |
| keywords[0].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[0].score | 0.5469241142272949 |
| keywords[0].display_name | Artificial intelligence |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.5066002607345581 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/pattern-recognition |
| keywords[2].score | 0.4955941438674927 |
| keywords[2].display_name | Pattern recognition (psychology) |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2502.04878 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2502.04878 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2502.04878 |
| locations[1].id | doi:10.48550/arxiv.2502.04878 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2502.04878 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5092973376 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Patrick Leask |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Leask, Patrick |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5050202539 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Bart Bussmann |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Bussmann, Bart |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5101375067 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Michael Pearce |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Pearce, Michael |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5053271715 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-3275-1103 |
| authorships[3].author.display_name | Joseph Bloom |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Bloom, Joseph |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5092031084 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Curt Tigges |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Tigges, Curt |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5017619842 |
| authorships[5].author.orcid | https://orcid.org/0000-0001-8942-355X |
| authorships[5].author.display_name | Noura Al Moubayed |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Moubayed, Noura Al |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5002207803 |
| authorships[6].author.orcid | https://orcid.org/0009-0009-2137-6027 |
| authorships[6].author.display_name | Lee Sharkey |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Sharkey, Lee |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5081285345 |
| authorships[7].author.orcid | |
| authorships[7].author.display_name | Neel Nanda |
| authorships[7].author_position | last |
| authorships[7].raw_author_name | Nanda, Neel |
| authorships[7].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2502.04878 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Sparse Autoencoders Do Not Find Canonical Units of Analysis |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10775 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.4146000146865845 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Generative Adversarial Networks and Image Synthesis |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W4391913857, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2033914206, https://openalex.org/W2042327336 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2502.04878 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2502.04878 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2502.04878 |
| primary_location.id | pmh:oai:arXiv.org:2502.04878 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2502.04878 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2502.04878 |
| publication_date | 2025-02-07 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.A | 0 |
| abstract_inverted_index.a | 29, 50, 55, 96, 100, 182, 200 |
| abstract_inverted_index.-- | 158, 168 |
| abstract_inverted_index.We | 63, 227, 251 |
| abstract_inverted_index.an | 253 |
| abstract_inverted_index.be | 46, 109, 224 |
| abstract_inverted_index.by | 22 |
| abstract_inverted_index.do | 214 |
| abstract_inverted_index.if | 212 |
| abstract_inverted_index.in | 36, 138, 173 |
| abstract_inverted_index.is | 6 |
| abstract_inverted_index.it | 39 |
| abstract_inverted_index.of | 3, 11, 18, 53, 60, 148, 153, 165, 179, 219 |
| abstract_inverted_index.on | 66, 161 |
| abstract_inverted_index.or | 92, 241 |
| abstract_inverted_index.to | 7, 48, 75, 82, 121, 248, 256 |
| abstract_inverted_index.we | 169 |
| abstract_inverted_index.SAE | 73, 88, 98, 107, 141, 167, 188, 245 |
| abstract_inverted_index.The | 146, 193 |
| abstract_inverted_index.and | 38, 57, 80, 130, 208 |
| abstract_inverted_index.are | 28, 78, 85, 190, 196 |
| abstract_inverted_index.can | 45, 108, 134 |
| abstract_inverted_index.for | 32, 237 |
| abstract_inverted_index.has | 40 |
| abstract_inverted_index.may | 222 |
| abstract_inverted_index.not | 86, 191, 215 |
| abstract_inverted_index.set | 52 |
| abstract_inverted_index.the | 9, 19, 23, 105, 122, 139, 162, 244 |
| abstract_inverted_index.two | 70, 112 |
| abstract_inverted_index.Even | 211 |
| abstract_inverted_index.SAE, | 124, 184 |
| abstract_inverted_index.SAEs | 159, 174, 213 |
| abstract_inverted_index.been | 41 |
| abstract_inverted_index.cast | 64 |
| abstract_inverted_index.e.g. | 199 |
| abstract_inverted_index.find | 49, 170, 216 |
| abstract_inverted_index.from | 95, 104, 181 |
| abstract_inverted_index.goal | 2 |
| abstract_inverted_index.have | 143 |
| abstract_inverted_index.into | 14, 99, 111, 177, 205 |
| abstract_inverted_index.list | 59 |
| abstract_inverted_index.one. | 102 |
| abstract_inverted_index.show | 76, 83 |
| abstract_inverted_index.size | 246 |
| abstract_inverted_index.such | 239 |
| abstract_inverted_index.that | 43, 142, 171, 186, 229 |
| abstract_inverted_index.they | 44, 77, 84, 126, 221 |
| abstract_inverted_index.this | 67 |
| abstract_inverted_index.used | 47 |
| abstract_inverted_index.when | 119 |
| abstract_inverted_index.LLMs, | 37 |
| abstract_inverted_index.SAEs. | 155 |
| abstract_inverted_index.Using | 156 |
| abstract_inverted_index.added | 120 |
| abstract_inverted_index.doubt | 65 |
| abstract_inverted_index.input | 20 |
| abstract_inverted_index.novel | 71, 128, 149 |
| abstract_inverted_index.often | 175, 197 |
| abstract_inverted_index.still | 223 |
| abstract_inverted_index.task. | 250 |
| abstract_inverted_index.their | 249 |
| abstract_inverted_index.these | 34 |
| abstract_inverted_index.units | 218 |
| abstract_inverted_index.using | 69 |
| abstract_inverted_index.which | 116, 133 |
| abstract_inverted_index.(SAEs) | 27 |
| abstract_inverted_index.Sparse | 25 |
| abstract_inverted_index.atomic | 61 |
| abstract_inverted_index.belief | 68 |
| abstract_inverted_index.choose | 243 |
| abstract_inverted_index.common | 1 |
| abstract_inverted_index.either | 233 |
| abstract_inverted_index.future | 230 |
| abstract_inverted_index.larger | 97, 106, 187 |
| abstract_inverted_index.latent | 201 |
| abstract_inverted_index.matrix | 164 |
| abstract_inverted_index.method | 31 |
| abstract_inverted_index.model. | 24 |
| abstract_inverted_index.neural | 12 |
| abstract_inverted_index.pursue | 234 |
| abstract_inverted_index.should | 232 |
| abstract_inverted_index.suited | 247 |
| abstract_inverted_index.tools. | 226 |
| abstract_inverted_index.unique | 56 |
| abstract_inverted_index.units, | 240 |
| abstract_inverted_index.units: | 54 |
| abstract_inverted_index.useful | 225 |
| abstract_inverted_index.Latents | 103 |
| abstract_inverted_index.another | 166 |
| abstract_inverted_index.atomic. | 87, 192 |
| abstract_inverted_index.capture | 127 |
| abstract_inverted_index.decoder | 163 |
| abstract_inverted_index.divided | 110 |
| abstract_inverted_index.explore | 257 |
| abstract_inverted_index.finding | 33 |
| abstract_inverted_index.improve | 117 |
| abstract_inverted_index.latents | 94, 137, 172, 180, 189 |
| abstract_inverted_index.popular | 30 |
| abstract_inverted_index.provide | 252 |
| abstract_inverted_index.replace | 135 |
| abstract_inverted_index.showing | 185 |
| abstract_inverted_index.similar | 144 |
| abstract_inverted_index.smaller | 101, 123, 140, 154, 183 |
| abstract_inverted_index.suggest | 228 |
| abstract_inverted_index.trained | 160 |
| abstract_inverted_index.``famous | 209 |
| abstract_inverted_index.complete | 58 |
| abstract_inverted_index.computed | 21 |
| abstract_inverted_index.features | 35, 150 |
| abstract_inverted_index.involves | 90 |
| abstract_inverted_index.networks | 13 |
| abstract_inverted_index.research | 231 |
| abstract_inverted_index.swapping | 93 |
| abstract_inverted_index.analysis, | 220 |
| abstract_inverted_index.behavior. | 145 |
| abstract_inverted_index.canonical | 217 |
| abstract_inverted_index.dashboard | 255 |
| abstract_inverted_index.decompose | 8, 176 |
| abstract_inverted_index.different | 235 |
| abstract_inverted_index.existence | 147 |
| abstract_inverted_index.features. | 62 |
| abstract_inverted_index.features: | 15 |
| abstract_inverted_index.indicates | 151 |
| abstract_inverted_index.inserting | 91 |
| abstract_inverted_index.latents}, | 115, 132 |
| abstract_inverted_index.meta-SAEs | 81, 157 |
| abstract_inverted_index.person''. | 210 |
| abstract_inverted_index.resulting | 194 |
| abstract_inverted_index.stitching | 74, 89 |
| abstract_inverted_index.approaches | 236 |
| abstract_inverted_index.decomposes | 204 |
| abstract_inverted_index.indicating | 125 |
| abstract_inverted_index.meta-SAEs: | 258 |
| abstract_inverted_index.postulated | 42 |
| abstract_inverted_index.properties | 17 |
| abstract_inverted_index.\emph{novel | 114 |
| abstract_inverted_index.activations | 10 |
| abstract_inverted_index.categories: | 113 |
| abstract_inverted_index.identifying | 238 |
| abstract_inverted_index.incomplete, | 79 |
| abstract_inverted_index.interactive | 254 |
| abstract_inverted_index.mechanistic | 4 |
| abstract_inverted_index.performance | 118 |
| abstract_inverted_index.techniques: | 72 |
| abstract_inverted_index.``Einstein'' | 203 |
| abstract_inverted_index.``Germany'', | 207 |
| abstract_inverted_index.autoencoders | 26 |
| abstract_inverted_index.combinations | 178 |
| abstract_inverted_index.information, | 129 |
| abstract_inverted_index.representing | 202 |
| abstract_inverted_index.corresponding | 136 |
| abstract_inverted_index.interpretable | 16 |
| abstract_inverted_index.pragmatically | 242 |
| abstract_inverted_index.``scientist'', | 206 |
| abstract_inverted_index.decompositions | 195 |
| abstract_inverted_index.incompleteness | 152 |
| abstract_inverted_index.interpretable; | 198 |
| abstract_inverted_index.interpretability | 5 |
| abstract_inverted_index.\textit{canonical} | 51 |
| abstract_inverted_index.\emph{reconstruction | 131 |
| abstract_inverted_index.https://metasaes.streamlit.app/ | 259 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 8 |
| citation_normalized_percentile |