CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2507.06043
Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85\% across three popular LLMs, while the defense success rate on the state-of-the-art jailbreak dataset reaches an average of 84.17\%. This not only validates the effectiveness of our approach but also sheds light on the internal security mechanisms of LLMs, offering new insights for enhancing model security The code and data are available at https://github.com/NLPGM/CAVGAN.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2507.06043
- https://arxiv.org/pdf/2507.06043
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4416062077
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4416062077Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2507.06043Digital Object Identifier
- Title
-
CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal RepresentationsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-07-08Full publication date if available
- Authors
-
Xiaohu Li, Yunfeng Ning, Zepeng Bao, Mayi Xu, Jianhao Chen, Tieyun QianList of authors in order
- Landing page
-
https://arxiv.org/abs/2507.06043Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2507.06043Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2507.06043Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4416062077 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2507.06043 |
| ids.doi | https://doi.org/10.48550/arxiv.2507.06043 |
| ids.openalex | https://openalex.org/W4416062077 |
| fwci | |
| type | preprint |
| title | CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2507.06043 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2507.06043 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2507.06043 |
| locations[1].id | doi:10.48550/arxiv.2507.06043 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2507.06043 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5100631865 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-8278-3878 |
| authorships[0].author.display_name | Xiaohu Li |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Li, Xiaohu |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5119181444 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Yunfeng Ning |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Ning, Yunfeng |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5119181445 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Zepeng Bao |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Bao, Zepeng |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5022569658 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Mayi Xu |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Xu, Mayi |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5006836653 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Jianhao Chen |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Chen, Jianhao |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5040759280 |
| authorships[5].author.orcid | https://orcid.org/0000-0003-4667-5794 |
| authorships[5].author.display_name | Tieyun Qian |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Qian, Tieyun |
| authorships[5].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2507.06043 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-28T10:25:07.422409 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2507.06043 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2507.06043 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2507.06043 |
| primary_location.id | pmh:oai:arXiv.org:2507.06043 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2507.06043 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2507.06043 |
| publication_date | 2025-07-08 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 47 |
| abstract_inverted_index.We | 36, 89 |
| abstract_inverted_index.an | 119, 141 |
| abstract_inverted_index.as | 68, 70 |
| abstract_inverted_index.at | 178 |
| abstract_inverted_index.is | 56 |
| abstract_inverted_index.of | 23, 42, 63, 73, 124, 143, 151, 163 |
| abstract_inverted_index.on | 58, 135, 158 |
| abstract_inverted_index.to | 8, 78, 85, 95, 104 |
| abstract_inverted_index.LLM | 31, 64, 103 |
| abstract_inverted_index.Our | 54 |
| abstract_inverted_index.The | 111, 172 |
| abstract_inverted_index.and | 34, 45, 52, 82, 109, 174 |
| abstract_inverted_index.are | 176 |
| abstract_inverted_index.but | 15, 154 |
| abstract_inverted_index.for | 168 |
| abstract_inverted_index.new | 166 |
| abstract_inverted_index.not | 146 |
| abstract_inverted_index.our | 116, 152 |
| abstract_inverted_index.the | 3, 10, 21, 38, 43, 59, 71, 86, 97, 102, 131, 136, 149, 159 |
| abstract_inverted_index.LLM, | 44 |
| abstract_inverted_index.This | 145 |
| abstract_inverted_index.aims | 77 |
| abstract_inverted_index.also | 155 |
| abstract_inverted_index.code | 173 |
| abstract_inverted_index.data | 175 |
| abstract_inverted_index.gain | 9 |
| abstract_inverted_index.have | 29 |
| abstract_inverted_index.only | 147 |
| abstract_inverted_index.rate | 123, 134 |
| abstract_inverted_index.safe | 87 |
| abstract_inverted_index.that | 49, 115 |
| abstract_inverted_index.them | 84 |
| abstract_inverted_index.this | 24 |
| abstract_inverted_index.well | 69 |
| abstract_inverted_index.(GAN) | 94 |
| abstract_inverted_index.(LLM) | 7 |
| abstract_inverted_index.LLMs, | 129, 164 |
| abstract_inverted_index.Large | 4 |
| abstract_inverted_index.Model | 6 |
| abstract_inverted_index.area. | 88 |
| abstract_inverted_index.based | 57 |
| abstract_inverted_index.embed | 79 |
| abstract_inverted_index.layer | 66 |
| abstract_inverted_index.learn | 96 |
| abstract_inverted_index.light | 157 |
| abstract_inverted_index.model | 170 |
| abstract_inverted_index.sheds | 156 |
| abstract_inverted_index.three | 127 |
| abstract_inverted_index.which | 76 |
| abstract_inverted_index.while | 130 |
| abstract_inverted_index.across | 126 |
| abstract_inverted_index.attack | 18, 51, 108 |
| abstract_inverted_index.inside | 101 |
| abstract_inverted_index.method | 55, 117 |
| abstract_inverted_index.reveal | 20 |
| abstract_inverted_index.88.85\% | 125 |
| abstract_inverted_index.achieve | 105 |
| abstract_inverted_index.against | 12 |
| abstract_inverted_index.analyze | 37 |
| abstract_inverted_index.attack, | 75 |
| abstract_inverted_index.attacks | 33 |
| abstract_inverted_index.average | 120, 142 |
| abstract_inverted_index.dataset | 139 |
| abstract_inverted_index.defense | 132 |
| abstract_inverted_index.enables | 2 |
| abstract_inverted_index.essence | 72 |
| abstract_inverted_index.harmful | 80 |
| abstract_inverted_index.methods | 19 |
| abstract_inverted_index.network | 93 |
| abstract_inverted_index.popular | 128 |
| abstract_inverted_index.propose | 46 |
| abstract_inverted_index.reaches | 140 |
| abstract_inverted_index.results | 113 |
| abstract_inverted_index.studies | 28 |
| abstract_inverted_index.success | 122, 133 |
| abstract_inverted_index.utilize | 90 |
| abstract_inverted_index.various | 16 |
| abstract_inverted_index.84.17\%. | 144 |
| abstract_inverted_index.Language | 5 |
| abstract_inverted_index.Previous | 27 |
| abstract_inverted_index.Security | 0 |
| abstract_inverted_index.achieves | 118 |
| abstract_inverted_index.approach | 153 |
| abstract_inverted_index.boundary | 100 |
| abstract_inverted_index.combines | 50 |
| abstract_inverted_index.defense. | 53, 110 |
| abstract_inverted_index.indicate | 114 |
| abstract_inverted_index.insights | 167 |
| abstract_inverted_index.internal | 160 |
| abstract_inverted_index.isolated | 30 |
| abstract_inverted_index.judgment | 99 |
| abstract_inverted_index.linearly | 60 |
| abstract_inverted_index.offering | 165 |
| abstract_inverted_index.problems | 81 |
| abstract_inverted_index.property | 62 |
| abstract_inverted_index.queries, | 14 |
| abstract_inverted_index.security | 25, 39, 98, 161, 171 |
| abstract_inverted_index.transfer | 83 |
| abstract_inverted_index.alignment | 1 |
| abstract_inverted_index.available | 177 |
| abstract_inverted_index.defenses. | 35 |
| abstract_inverted_index.efficient | 106 |
| abstract_inverted_index.enhancing | 169 |
| abstract_inverted_index.framework | 48 |
| abstract_inverted_index.jailbreak | 17, 32, 74, 107, 121, 138 |
| abstract_inverted_index.malicious | 13 |
| abstract_inverted_index.mechanism | 41 |
| abstract_inverted_index.separable | 61 |
| abstract_inverted_index.validates | 148 |
| abstract_inverted_index.embedding, | 67 |
| abstract_inverted_index.generative | 91 |
| abstract_inverted_index.mechanism. | 26 |
| abstract_inverted_index.mechanisms | 162 |
| abstract_inverted_index.protection | 11, 40 |
| abstract_inverted_index.adversarial | 92 |
| abstract_inverted_index.experimental | 112 |
| abstract_inverted_index.intermediate | 65 |
| abstract_inverted_index.effectiveness | 150 |
| abstract_inverted_index.vulnerability | 22 |
| abstract_inverted_index.state-of-the-art | 137 |
| abstract_inverted_index.https://github.com/NLPGM/CAVGAN. | 179 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile |