Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2501.16534
Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we introduce and evaluate a new technique for jailbreak attacks. We observe that alignment embeds a safety classifier in the LLM responsible for deciding between refusal and compliance, and seek to extract an approximation of this classifier: a surrogate classifier. To this end, we build candidate classifiers from subsets of the LLM. We first evaluate the degree to which candidate classifiers approximate the LLM's safety classifier in benign and adversarial settings. Then, we attack the candidates and measure how well the resulting adversarial inputs transfer to the LLM. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find that attacks mounted on the surrogate classifiers can be transferred to the LLM with high success. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70% with half the memory footprint and runtime -- a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR. These results show that extracting surrogate classifiers is an effective and efficient means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2501.16534
- https://arxiv.org/pdf/2501.16534
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4406950075
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4406950075Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2501.16534Digital Object Identifier
- Title
-
Targeting Alignment: Extracting Safety Classifiers of Aligned LLMsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-01-27Full publication date if available
- Authors
-
Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDanielList of authors in order
- Landing page
-
https://arxiv.org/abs/2501.16534Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2501.16534Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2501.16534Direct OA link when available
- Concepts
-
Artificial intelligence, Computer science, Computational biology, BiologyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4406950075 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2501.16534 |
| ids.doi | https://doi.org/10.48550/arxiv.2501.16534 |
| ids.openalex | https://openalex.org/W4406950075 |
| fwci | |
| type | preprint |
| title | Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10181 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.7196999788284302 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Natural Language Processing Techniques |
| topics[1].id | https://openalex.org/T14351 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.7135999798774719 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Statistical and Computational Modeling |
| topics[2].id | https://openalex.org/T13643 |
| topics[2].field.id | https://openalex.org/fields/33 |
| topics[2].field.display_name | Social Sciences |
| topics[2].score | 0.6944000124931335 |
| topics[2].domain.id | https://openalex.org/domains/2 |
| topics[2].domain.display_name | Social Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/3320 |
| topics[2].subfield.display_name | Political Science and International Relations |
| topics[2].display_name | Artificial Intelligence in Law |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C154945302 |
| concepts[0].level | 1 |
| concepts[0].score | 0.3926886320114136 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[0].display_name | Artificial intelligence |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.3908020853996277 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C70721500 |
| concepts[2].level | 1 |
| concepts[2].score | 0.32112643122673035 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q177005 |
| concepts[2].display_name | Computational biology |
| concepts[3].id | https://openalex.org/C86803240 |
| concepts[3].level | 0 |
| concepts[3].score | 0.17553213238716125 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q420 |
| concepts[3].display_name | Biology |
| keywords[0].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[0].score | 0.3926886320114136 |
| keywords[0].display_name | Artificial intelligence |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.3908020853996277 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/computational-biology |
| keywords[2].score | 0.32112643122673035 |
| keywords[2].display_name | Computational biology |
| keywords[3].id | https://openalex.org/keywords/biology |
| keywords[3].score | 0.17553213238716125 |
| keywords[3].display_name | Biology |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2501.16534 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2501.16534 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2501.16534 |
| locations[1].id | doi:10.48550/arxiv.2501.16534 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2501.16534 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5110190353 |
| authorships[0].author.orcid | https://orcid.org/0009-0009-9650-4011 |
| authorships[0].author.display_name | Jean-Charles Noirot Ferrand |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Ferrand, Jean-Charles Noirot |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5007771274 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-0991-7926 |
| authorships[1].author.display_name | Yohan Beugin |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Beugin, Yohan |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5044742451 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Eric Pauley |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Pauley, Eric |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5056794879 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-8447-602X |
| authorships[3].author.display_name | Ryan Sheatsley |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Sheatsley, Ryan |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5055368149 |
| authorships[4].author.orcid | https://orcid.org/0000-0003-2091-7484 |
| authorships[4].author.display_name | Patrick McDaniel |
| authorships[4].author_position | last |
| authorships[4].raw_author_name | McDaniel, Patrick |
| authorships[4].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2501.16534 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10181 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.7196999788284302 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Natural Language Processing Techniques |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W4391913857, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W4396696052 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2501.16534 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2501.16534 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2501.16534 |
| primary_location.id | pmh:oai:arXiv.org:2501.16534 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2501.16534 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2501.16534 |
| publication_date | 2025-01-27 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.2 | 174 |
| abstract_inverted_index.a | 37, 48, 70, 166, 192, 204 |
| abstract_inverted_index.-- | 191 |
| abstract_inverted_index.F1 | 132 |
| abstract_inverted_index.In | 30 |
| abstract_inverted_index.To | 73 |
| abstract_inverted_index.We | 43, 85 |
| abstract_inverted_index.an | 65, 177, 215 |
| abstract_inverted_index.as | 12, 137, 139 |
| abstract_inverted_index.be | 156 |
| abstract_inverted_index.in | 1, 17, 51, 99 |
| abstract_inverted_index.is | 6, 214 |
| abstract_inverted_index.of | 20, 67, 82, 141, 171, 182, 227 |
| abstract_inverted_index.on | 151 |
| abstract_inverted_index.to | 8, 26, 63, 90, 118, 158, 230 |
| abstract_inverted_index.we | 33, 76, 105, 146, 201 |
| abstract_inverted_index.(an | 131 |
| abstract_inverted_index.20% | 140 |
| abstract_inverted_index.22% | 205 |
| abstract_inverted_index.50% | 170 |
| abstract_inverted_index.70% | 183 |
| abstract_inverted_index.For | 164 |
| abstract_inverted_index.LLM | 53, 160, 198 |
| abstract_inverted_index.Our | 121 |
| abstract_inverted_index.and | 35, 59, 61, 101, 109, 189, 217 |
| abstract_inverted_index.can | 155 |
| abstract_inverted_index.for | 40, 55, 220 |
| abstract_inverted_index.how | 111 |
| abstract_inverted_index.new | 38 |
| abstract_inverted_index.the | 18, 52, 83, 88, 95, 107, 113, 119, 125, 142, 152, 159, 172, 186, 197, 225 |
| abstract_inverted_index.(and | 222 |
| abstract_inverted_index.80%) | 135 |
| abstract_inverted_index.ASR. | 206 |
| abstract_inverted_index.LLM. | 84, 120 |
| abstract_inverted_index.Yet, | 14 |
| abstract_inverted_index.best | 126 |
| abstract_inverted_index.end, | 75 |
| abstract_inverted_index.face | 19 |
| abstract_inverted_index.find | 147 |
| abstract_inverted_index.from | 80 |
| abstract_inverted_index.half | 185 |
| abstract_inverted_index.high | 162 |
| abstract_inverted_index.only | 169, 202 |
| abstract_inverted_index.over | 195 |
| abstract_inverted_index.rate | 180 |
| abstract_inverted_index.seek | 62 |
| abstract_inverted_index.show | 209 |
| abstract_inverted_index.such | 11 |
| abstract_inverted_index.that | 23, 45, 124, 148, 210 |
| abstract_inverted_index.this | 31, 68, 74 |
| abstract_inverted_index.used | 7 |
| abstract_inverted_index.well | 112 |
| abstract_inverted_index.with | 161, 184 |
| abstract_inverted_index.(ASR) | 181 |
| abstract_inverted_index.LLM's | 96 |
| abstract_inverted_index.Llama | 173 |
| abstract_inverted_index.Then, | 104 |
| abstract_inverted_index.These | 207 |
| abstract_inverted_index.above | 134 |
| abstract_inverted_index.build | 77 |
| abstract_inverted_index.fails | 16 |
| abstract_inverted_index.first | 86 |
| abstract_inverted_index.large | 2 |
| abstract_inverted_index.means | 219 |
| abstract_inverted_index.model | 143, 175 |
| abstract_inverted_index.score | 133 |
| abstract_inverted_index.shows | 123 |
| abstract_inverted_index.using | 136, 168 |
| abstract_inverted_index.where | 200 |
| abstract_inverted_index.which | 91 |
| abstract_inverted_index.(LLMs) | 5 |
| abstract_inverted_index.attack | 106, 178 |
| abstract_inverted_index.benign | 100 |
| abstract_inverted_index.degree | 89 |
| abstract_inverted_index.embeds | 47 |
| abstract_inverted_index.induce | 27 |
| abstract_inverted_index.inputs | 25, 116 |
| abstract_inverted_index.little | 138 |
| abstract_inverted_index.memory | 187 |
| abstract_inverted_index.models | 4, 229 |
| abstract_inverted_index.modify | 24 |
| abstract_inverted_index.paper, | 32 |
| abstract_inverted_index.safety | 49, 97 |
| abstract_inverted_index.unsafe | 28 |
| abstract_inverted_index.achieve | 128 |
| abstract_inverted_index.aligned | 228 |
| abstract_inverted_index.attacks | 22, 149 |
| abstract_inverted_index.between | 57 |
| abstract_inverted_index.enforce | 9 |
| abstract_inverted_index.extract | 64 |
| abstract_inverted_index.measure | 110 |
| abstract_inverted_index.mounted | 150 |
| abstract_inverted_index.observe | 44 |
| abstract_inverted_index.refusal | 58 |
| abstract_inverted_index.results | 208 |
| abstract_inverted_index.runtime | 190 |
| abstract_inverted_index.safety. | 13 |
| abstract_inverted_index.subsets | 81 |
| abstract_inverted_index.success | 179 |
| abstract_inverted_index.therein | 223 |
| abstract_inverted_index.Further, | 145 |
| abstract_inverted_index.accurate | 129 |
| abstract_inverted_index.achieved | 176 |
| abstract_inverted_index.attacks. | 42, 232 |
| abstract_inverted_index.deciding | 56 |
| abstract_inverted_index.evaluate | 36, 87 |
| abstract_inverted_index.example, | 165 |
| abstract_inverted_index.language | 3 |
| abstract_inverted_index.modeling | 221 |
| abstract_inverted_index.observed | 203 |
| abstract_inverted_index.outputs. | 29 |
| abstract_inverted_index.success. | 163 |
| abstract_inverted_index.transfer | 117 |
| abstract_inverted_index.Alignment | 0 |
| abstract_inverted_index.agreement | 130 |
| abstract_inverted_index.alignment | 15, 46 |
| abstract_inverted_index.attacking | 196 |
| abstract_inverted_index.candidate | 78, 92 |
| abstract_inverted_index.directly, | 199 |
| abstract_inverted_index.effective | 216 |
| abstract_inverted_index.efficient | 218 |
| abstract_inverted_index.footprint | 188 |
| abstract_inverted_index.introduce | 34 |
| abstract_inverted_index.jailbreak | 21, 41 |
| abstract_inverted_index.resulting | 114 |
| abstract_inverted_index.settings. | 103 |
| abstract_inverted_index.surrogate | 71, 153, 167, 212 |
| abstract_inverted_index.technique | 39 |
| abstract_inverted_index.candidates | 108, 127 |
| abstract_inverted_index.classifier | 50, 98 |
| abstract_inverted_index.evaluation | 122 |
| abstract_inverted_index.extracting | 211 |
| abstract_inverted_index.guidelines | 10 |
| abstract_inverted_index.addressing) | 224 |
| abstract_inverted_index.adversarial | 102, 115 |
| abstract_inverted_index.approximate | 94 |
| abstract_inverted_index.classifier. | 72 |
| abstract_inverted_index.classifier: | 69 |
| abstract_inverted_index.classifiers | 79, 93, 154, 213 |
| abstract_inverted_index.compliance, | 60 |
| abstract_inverted_index.improvement | 194 |
| abstract_inverted_index.responsible | 54 |
| abstract_inverted_index.substantial | 193 |
| abstract_inverted_index.transferred | 157 |
| abstract_inverted_index.jailbreaking | 231 |
| abstract_inverted_index.approximation | 66 |
| abstract_inverted_index.architecture. | 144 |
| abstract_inverted_index.vulnerability | 226 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 5 |
| citation_normalized_percentile |