CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations Article Swipe

PDF

Xiaohu Li , Yunfeng Ning , Zepeng Bao , Mayi Xu , Jianhao Chen , Tieyun Qian ·

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2507.06043

Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85\% across three popular LLMs, while the defense success rate on the state-of-the-art jailbreak dataset reaches an average of 84.17\%. This not only validates the effectiveness of our approach but also sheds light on the internal security mechanisms of LLMs, offering new insights for enhancing model security The code and data are available at https://github.com/NLPGM/CAVGAN.

Related Topics

Truth And Reconciliation Commission Of Canada

Alanis Morissette

2025 Nba Draft

28 Years Later

Reich Ministry Of Public Enlightenment And Propaganda

Mahmood Mamdani

Rick Hurst

Fuck

Concepts

No concepts available.

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2507.06043
PDF: https://arxiv.org/pdf/2507.06043
OA Status: green
OpenAlex ID: https://openalex.org/W4416062077

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4416062077

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2507.06043

Digital Object Identifier
Title: CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2025

Year of publication
Publication date: 2025-07-08

Full publication date if available
Authors: Xiaohu Li, Yunfeng Ning, Zepeng Bao, Mayi Xu, Jianhao Chen, Tieyun Qian

List of authors in order
Landing page: https://arxiv.org/abs/2507.06043

Publisher landing page
PDF URL: https://arxiv.org/pdf/2507.06043

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2507.06043

Direct OA link when available
Cited by: 0

Total citation count in OpenAlex

Full payload

id	https://openalex.org/W4416062077
doi	https://doi.org/10.48550/arxiv.2507.06043
ids.doi	https://doi.org/10.48550/arxiv.2507.06043
ids.openalex	https://openalex.org/W4416062077
fwci
type	preprint
title	CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
is_xpac	False
apc_list
apc_paid
language	en
locations[0].id	pmh:oai:arXiv.org:2507.06043
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2507.06043
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2507.06043
locations[1].id	doi:10.48550/arxiv.2507.06043
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2507.06043
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5100631865
authorships[0].author.orcid	https://orcid.org/0000-0001-8278-3878
authorships[0].author.display_name	Xiaohu Li
authorships[0].author_position	first
authorships[0].raw_author_name	Li, Xiaohu
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5119181444
authorships[1].author.orcid
authorships[1].author.display_name	Yunfeng Ning
authorships[1].author_position	middle
authorships[1].raw_author_name	Ning, Yunfeng
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5119181445
authorships[2].author.orcid
authorships[2].author.display_name	Zepeng Bao
authorships[2].author_position	middle
authorships[2].raw_author_name	Bao, Zepeng
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5022569658
authorships[3].author.orcid
authorships[3].author.display_name	Mayi Xu
authorships[3].author_position	middle
authorships[3].raw_author_name	Xu, Mayi
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5006836653
authorships[4].author.orcid
authorships[4].author.display_name	Jianhao Chen
authorships[4].author_position	middle
authorships[4].raw_author_name	Chen, Jianhao
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5040759280
authorships[5].author.orcid	https://orcid.org/0000-0003-4667-5794
authorships[5].author.display_name	Tieyun Qian
authorships[5].author_position	last
authorships[5].raw_author_name	Qian, Tieyun
authorships[5].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2507.06043
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations
has_fulltext	False
is_retracted	False
updated_date	2025-11-28T10:25:07.422409
primary_topic
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2507.06043
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2507.06043
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2507.06043
primary_location.id	pmh:oai:arXiv.org:2507.06043
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2507.06043
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2507.06043
publication_date	2025-07-08
publication_year	2025
referenced_works_count	0
abstract_inverted_index.a	47
abstract_inverted_index.We	36, 89
abstract_inverted_index.an	119, 141
abstract_inverted_index.as	68, 70
abstract_inverted_index.at	178
abstract_inverted_index.is	56
abstract_inverted_index.of	23, 42, 63, 73, 124, 143, 151, 163
abstract_inverted_index.on	58, 135, 158
abstract_inverted_index.to	8, 78, 85, 95, 104
abstract_inverted_index.LLM	31, 64, 103
abstract_inverted_index.Our	54
abstract_inverted_index.The	111, 172
abstract_inverted_index.and	34, 45, 52, 82, 109, 174
abstract_inverted_index.are	176
abstract_inverted_index.but	15, 154
abstract_inverted_index.for	168
abstract_inverted_index.new	166
abstract_inverted_index.not	146
abstract_inverted_index.our	116, 152
abstract_inverted_index.the	3, 10, 21, 38, 43, 59, 71, 86, 97, 102, 131, 136, 149, 159
abstract_inverted_index.LLM,	44
abstract_inverted_index.This	145
abstract_inverted_index.aims	77
abstract_inverted_index.also	155
abstract_inverted_index.code	173
abstract_inverted_index.data	175
abstract_inverted_index.gain	9
abstract_inverted_index.have	29
abstract_inverted_index.only	147
abstract_inverted_index.rate	123, 134
abstract_inverted_index.safe	87
abstract_inverted_index.that	49, 115
abstract_inverted_index.them	84
abstract_inverted_index.this	24
abstract_inverted_index.well	69
abstract_inverted_index.(GAN)	94
abstract_inverted_index.(LLM)	7
abstract_inverted_index.LLMs,	129, 164
abstract_inverted_index.Large	4
abstract_inverted_index.Model	6
abstract_inverted_index.area.	88
abstract_inverted_index.based	57
abstract_inverted_index.embed	79
abstract_inverted_index.layer	66
abstract_inverted_index.learn	96
abstract_inverted_index.light	157
abstract_inverted_index.model	170
abstract_inverted_index.sheds	156
abstract_inverted_index.three	127
abstract_inverted_index.which	76
abstract_inverted_index.while	130
abstract_inverted_index.across	126
abstract_inverted_index.attack	18, 51, 108
abstract_inverted_index.inside	101
abstract_inverted_index.method	55, 117
abstract_inverted_index.reveal	20
abstract_inverted_index.88.85\%	125
abstract_inverted_index.achieve	105
abstract_inverted_index.against	12
abstract_inverted_index.analyze	37
abstract_inverted_index.attack,	75
abstract_inverted_index.attacks	33
abstract_inverted_index.average	120, 142
abstract_inverted_index.dataset	139
abstract_inverted_index.defense	132
abstract_inverted_index.enables	2
abstract_inverted_index.essence	72
abstract_inverted_index.harmful	80
abstract_inverted_index.methods	19
abstract_inverted_index.network	93
abstract_inverted_index.popular	128
abstract_inverted_index.propose	46
abstract_inverted_index.reaches	140
abstract_inverted_index.results	113
abstract_inverted_index.studies	28
abstract_inverted_index.success	122, 133
abstract_inverted_index.utilize	90
abstract_inverted_index.various	16
abstract_inverted_index.84.17\%.	144
abstract_inverted_index.Language	5
abstract_inverted_index.Previous	27
abstract_inverted_index.Security	0
abstract_inverted_index.achieves	118
abstract_inverted_index.approach	153
abstract_inverted_index.boundary	100
abstract_inverted_index.combines	50
abstract_inverted_index.defense.	53, 110
abstract_inverted_index.indicate	114
abstract_inverted_index.insights	167
abstract_inverted_index.internal	160
abstract_inverted_index.isolated	30
abstract_inverted_index.judgment	99
abstract_inverted_index.linearly	60
abstract_inverted_index.offering	165
abstract_inverted_index.problems	81
abstract_inverted_index.property	62
abstract_inverted_index.queries,	14
abstract_inverted_index.security	25, 39, 98, 161, 171
abstract_inverted_index.transfer	83
abstract_inverted_index.alignment	1
abstract_inverted_index.available	177
abstract_inverted_index.defenses.	35
abstract_inverted_index.efficient	106
abstract_inverted_index.enhancing	169
abstract_inverted_index.framework	48
abstract_inverted_index.jailbreak	17, 32, 74, 107, 121, 138
abstract_inverted_index.malicious	13
abstract_inverted_index.mechanism	41
abstract_inverted_index.separable	61
abstract_inverted_index.validates	148
abstract_inverted_index.embedding,	67
abstract_inverted_index.generative	91
abstract_inverted_index.mechanism.	26
abstract_inverted_index.mechanisms	162
abstract_inverted_index.protection	11, 40
abstract_inverted_index.adversarial	92
abstract_inverted_index.experimental	112
abstract_inverted_index.intermediate	65
abstract_inverted_index.effectiveness	150
abstract_inverted_index.vulnerability	22
abstract_inverted_index.state-of-the-art	137
abstract_inverted_index.https://github.com/NLPGM/CAVGAN.	179
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	6
citation_normalized_percentile