Rethinking Deep Alignment Through The Lens Of Incomplete Learning Article Swipe

PDF

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2511.12155

Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fails to transform model preferences in later response regions fully. We introduce base-favored tokens -- vocabulary elements where base models assign higher probability than aligned models -- as computational indicators of incomplete safety learning and develop a targeted completion method that addresses undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates dramatic improvements in adversarial robustness, with 48--98% reductions in attack success rates while preserving general capabilities. These results establish both a mechanistic understanding and practical solutions for fundamental limitations in safety alignment methodologies.

Related Topics

Deep Purple

Deep Water (2022 Film)

The Deep End Of The Ocean (Film)

The Deep (1977 Film)

Deep Purple Discography

Deep Roy

Concepts

No concepts available.

Metadata

Type: preprint
Landing Page: http://arxiv.org/abs/2511.12155
PDF: https://arxiv.org/pdf/2511.12155
OA Status: green
OpenAlex ID: https://openalex.org/W4416353948

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4416353948

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2511.12155

Digital Object Identifier
Title: Rethinking Deep Alignment Through The Lens Of Incomplete Learning

Work title
Type: preprint

OpenAlex work type
Publication year: 2025

Year of publication
Publication date: 2025-11-15

Full publication date if available
Authors: Thong Bach, Truyen Tran

List of authors in order
Landing page: https://arxiv.org/abs/2511.12155

Publisher landing page
PDF URL: https://arxiv.org/pdf/2511.12155

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2511.12155

Direct OA link when available
Cited by: 0

Total citation count in OpenAlex

Full payload

id	https://openalex.org/W4416353948
doi	https://doi.org/10.48550/arxiv.2511.12155
ids.doi	https://doi.org/10.48550/arxiv.2511.12155
ids.openalex	https://openalex.org/W4416353948
fwci
type	preprint
title	Rethinking Deep Alignment Through The Lens Of Incomplete Learning
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
is_xpac	False
apc_list
apc_paid
language
locations[0].id	pmh:oai:arXiv.org:2511.12155
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2511.12155
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2511.12155
locations[1].id	doi:10.48550/arxiv.2511.12155
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license	cc-by
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id	https://openalex.org/licenses/cc-by
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2511.12155
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5065706587
authorships[0].author.orcid
authorships[0].author.display_name	Thong Bach
authorships[0].author_position	first
authorships[0].raw_author_name	Bach, Thong
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5085471517
authorships[1].author.orcid	https://orcid.org/0000-0001-6531-8907
authorships[1].author.display_name	Truyen Tran
authorships[1].author_position	middle
authorships[1].raw_author_name	Tran, Truyen
authorships[1].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2511.12155
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-11-19T00:00:00
display_name	Rethinking Deep Alignment Through The Lens Of Incomplete Learning
has_fulltext	False
is_retracted	False
updated_date	2025-11-28T11:58:38.604648
primary_topic
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2511.12155
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2511.12155
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2511.12155
primary_location.id	pmh:oai:arXiv.org:2511.12155
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2511.12155
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2511.12155
publication_date	2025-11-15
publication_year	2025
referenced_works_count	0
abstract_inverted_index.a	15, 73, 117
abstract_inverted_index.--	51, 63
abstract_inverted_index.We	13, 47
abstract_inverted_index.as	64
abstract_inverted_index.in	42, 99, 105, 126
abstract_inverted_index.of	67
abstract_inverted_index.to	6, 30, 38
abstract_inverted_index.and	71, 84, 92, 120
abstract_inverted_index.for	123
abstract_inverted_index.Qwen	93
abstract_inverted_index.base	55
abstract_inverted_index.both	116
abstract_inverted_index.than	60
abstract_inverted_index.that	19, 77
abstract_inverted_index.with	102
abstract_inverted_index.Large	0
abstract_inverted_index.Llama	91
abstract_inverted_index.These	113
abstract_inverted_index.fails	37
abstract_inverted_index.later	43
abstract_inverted_index.model	40, 94
abstract_inverted_index.rates	108
abstract_inverted_index.where	34, 54
abstract_inverted_index.while	109
abstract_inverted_index.across	90
abstract_inverted_index.assign	57
abstract_inverted_index.attack	106
abstract_inverted_index.decay,	28
abstract_inverted_index.during	23
abstract_inverted_index.fully.	46
abstract_inverted_index.higher	58
abstract_inverted_index.hybrid	85
abstract_inverted_index.method	76
abstract_inverted_index.models	2, 56, 62
abstract_inverted_index.safety	11, 32, 35, 69, 127
abstract_inverted_index.signal	27
abstract_inverted_index.tokens	50
abstract_inverted_index.48--98%	103
abstract_inverted_index.aligned	61
abstract_inverted_index.attacks	8
abstract_inverted_index.creates	26
abstract_inverted_index.despite	9
abstract_inverted_index.develop	72
abstract_inverted_index.exhibit	3
abstract_inverted_index.general	111
abstract_inverted_index.leading	29
abstract_inverted_index.provide	14
abstract_inverted_index.regions	45, 80
abstract_inverted_index.results	114
abstract_inverted_index.success	107
abstract_inverted_index.teacher	86
abstract_inverted_index.through	81
abstract_inverted_index.adaptive	82
abstract_inverted_index.analysis	17
abstract_inverted_index.dramatic	97
abstract_inverted_index.elements	53
abstract_inverted_index.families	95
abstract_inverted_index.gradient	21
abstract_inverted_index.language	1
abstract_inverted_index.learning	33, 70
abstract_inverted_index.response	44
abstract_inverted_index.targeted	74
abstract_inverted_index.training	25, 36
abstract_inverted_index.addresses	78
abstract_inverted_index.alignment	128
abstract_inverted_index.establish	115
abstract_inverted_index.extensive	10
abstract_inverted_index.introduce	48
abstract_inverted_index.penalties	83
abstract_inverted_index.practical	121
abstract_inverted_index.revealing	18
abstract_inverted_index.solutions	122
abstract_inverted_index.transform	39
abstract_inverted_index.weakening	22
abstract_inverted_index.alignment.	12
abstract_inverted_index.completion	75
abstract_inverted_index.evaluation	89
abstract_inverted_index.incomplete	31, 68
abstract_inverted_index.indicators	66
abstract_inverted_index.preserving	110
abstract_inverted_index.reductions	104
abstract_inverted_index.systematic	4
abstract_inverted_index.vocabulary	52
abstract_inverted_index.adversarial	7, 100
abstract_inverted_index.fundamental	124
abstract_inverted_index.limitations	125
abstract_inverted_index.mechanistic	16, 118
abstract_inverted_index.preferences	41
abstract_inverted_index.probability	59
abstract_inverted_index.robustness,	101
abstract_inverted_index.Experimental	88
abstract_inverted_index.base-favored	49
abstract_inverted_index.demonstrates	96
abstract_inverted_index.improvements	98
abstract_inverted_index.undertrained	79
abstract_inverted_index.capabilities.	112
abstract_inverted_index.computational	65
abstract_inverted_index.distillation.	87
abstract_inverted_index.understanding	119
abstract_inverted_index.autoregressive	24
abstract_inverted_index.methodologies.	129
abstract_inverted_index.vulnerabilities	5
abstract_inverted_index.position-dependent	20
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	2
citation_normalized_percentile