CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection Article Swipe

PDF

Qingyu Zhang , Puzhuo Liu , Di Peng , Chenxiong Qian ·

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2511.19875

Version control relies on commit messages to convey the rationale for code changes, but these messages are often low quality and, more critically, inconsistent with their diffs-known as message-code inconsistency (MCI). MCIs mislead reviewers, hinder maintenance, contaminate research datasets, and may obscure security patches. Yet, no dedicated benchmark exists to evaluate models for MCI detection. We introduce CODEFUSE-COMMITEVAL, the first benchmark designed for MCI detection using large language models (LLMs). Built on the ApacheCM dataset for diversity and quality, we generate seven types of inconsistent messages through rule-guided mutations of originally consistent commits and apply two-fold validation to verify both positive and negative samples. Using this labeled dataset of message-diff pairs, we evaluate six state-of-the-art open-source LLMs under a vanilla setting and with three augmentation strategies: few-shot prompting, chain-of-thought, and extended context. Results show models detect inconsistent commits more reliably than consistent ones (average Recall 85.95%, Precision 80.28%, Specificity 63.8%); gpt-oss-20B performs best overall but uses over twice the tokens of others. Augmentation effects vary: adjacent context helps larger models but adds noise for smaller ones; few-shot improves accuracy and reduces token use, yet increases universally incorrect predictions; chain-of-thought boosts precision and specificity at the cost of recall and higher token consumption. Type-wise analysis reveals higher detectability for component, file-path, and operation inconsistencies, but lower accuracy and higher token cost for intent-level "purpose" inconsistencies. CODEFUSE-COMMITEVAL provides a rigorous foundation for measuring, comparing, and advancing MCI detection, highlighting the need for richer context and balanced data to capture high-level semantic gaps.

Related Topics

Truth And Reconciliation Commission Of Canada

Alanis Morissette

2025 Nba Draft

28 Years Later

Reich Ministry Of Public Enlightenment And Propaganda

Mahmood Mamdani

Rick Hurst

Concepts

No concepts available.

Metadata

Type: preprint
Landing Page: http://arxiv.org/abs/2511.19875
PDF: https://arxiv.org/pdf/2511.19875
OA Status: green
OpenAlex ID: https://openalex.org/W4416766960

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4416766960

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2511.19875

Digital Object Identifier
Title: CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection

Work title
Type: preprint

OpenAlex work type
Publication year: 2025

Year of publication
Publication date: 2025-11-25

Full publication date if available
Authors: Qingyu Zhang, Puzhuo Liu, Di Peng, Chenxiong Qian

List of authors in order
Landing page: https://arxiv.org/abs/2511.19875

Publisher landing page
PDF URL: https://arxiv.org/pdf/2511.19875

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2511.19875

Direct OA link when available
Cited by: 0

Total citation count in OpenAlex

Full payload

id	https://openalex.org/W4416766960
doi	https://doi.org/10.48550/arxiv.2511.19875
ids.doi	https://doi.org/10.48550/arxiv.2511.19875
ids.openalex	https://openalex.org/W4416766960
fwci
type	preprint
title	CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
is_xpac	False
apc_list
apc_paid
language
locations[0].id	pmh:oai:arXiv.org:2511.19875
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2511.19875
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2511.19875
locations[1].id	doi:10.48550/arxiv.2511.19875
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license	cc-by
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id	https://openalex.org/licenses/cc-by
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2511.19875
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5087307843
authorships[0].author.orcid	https://orcid.org/0009-0009-4422-3971
authorships[0].author.display_name	Qingyu Zhang
authorships[0].author_position	first
authorships[0].raw_author_name	Zhang, Qingyu
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5037451675
authorships[1].author.orcid	https://orcid.org/0000-0002-8995-5924
authorships[1].author.display_name	Puzhuo Liu
authorships[1].author_position	middle
authorships[1].raw_author_name	Liu, Puzhuo
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5029153155
authorships[2].author.orcid	https://orcid.org/0000-0002-8116-5215
authorships[2].author.display_name	Di Peng
authorships[2].author_position	middle
authorships[2].raw_author_name	Di, Peng
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5111610656
authorships[3].author.orcid
authorships[3].author.display_name	Chenxiong Qian
authorships[3].author_position	last
authorships[3].raw_author_name	Qian, Chenxiong
authorships[3].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2511.19875
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-11-28T00:00:00
display_name	CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection
has_fulltext	False
is_retracted	False
updated_date	2025-11-28T22:56:25.032910
primary_topic
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2511.19875
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2511.19875
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2511.19875
primary_location.id	pmh:oai:arXiv.org:2511.19875
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2511.19875
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2511.19875
publication_date	2025-11-25
publication_year	2025
referenced_works_count	0
abstract_inverted_index.a	118, 226
abstract_inverted_index.We	55
abstract_inverted_index.as	27
abstract_inverted_index.at	193
abstract_inverted_index.no	45
abstract_inverted_index.of	83, 89, 108, 160, 196
abstract_inverted_index.on	3, 71
abstract_inverted_index.to	6, 49, 97, 245
abstract_inverted_index.we	79, 111
abstract_inverted_index.MCI	53, 63, 234
abstract_inverted_index.and	39, 77, 93, 101, 121, 129, 179, 191, 198, 210, 216, 232, 242
abstract_inverted_index.are	16
abstract_inverted_index.but	13, 154, 170, 213
abstract_inverted_index.for	10, 52, 62, 75, 173, 207, 220, 229, 239
abstract_inverted_index.low	18
abstract_inverted_index.may	40
abstract_inverted_index.six	113
abstract_inverted_index.the	8, 58, 72, 158, 194, 237
abstract_inverted_index.yet	183
abstract_inverted_index.LLMs	116
abstract_inverted_index.MCIs	31
abstract_inverted_index.Yet,	44
abstract_inverted_index.adds	171
abstract_inverted_index.and,	20
abstract_inverted_index.best	152
abstract_inverted_index.both	99
abstract_inverted_index.code	11
abstract_inverted_index.cost	195, 219
abstract_inverted_index.data	244
abstract_inverted_index.more	21, 138
abstract_inverted_index.need	238
abstract_inverted_index.ones	142
abstract_inverted_index.over	156
abstract_inverted_index.show	133
abstract_inverted_index.than	140
abstract_inverted_index.this	105
abstract_inverted_index.use,	182
abstract_inverted_index.uses	155
abstract_inverted_index.with	24, 122
abstract_inverted_index.Built	70
abstract_inverted_index.Using	104
abstract_inverted_index.apply	94
abstract_inverted_index.first	59
abstract_inverted_index.gaps.	249
abstract_inverted_index.helps	167
abstract_inverted_index.large	66
abstract_inverted_index.lower	214
abstract_inverted_index.noise	172
abstract_inverted_index.often	17
abstract_inverted_index.ones;	175
abstract_inverted_index.seven	81
abstract_inverted_index.their	25
abstract_inverted_index.these	14
abstract_inverted_index.three	123
abstract_inverted_index.token	181, 200, 218
abstract_inverted_index.twice	157
abstract_inverted_index.types	82
abstract_inverted_index.under	117
abstract_inverted_index.using	65
abstract_inverted_index.vary:	164
abstract_inverted_index.(MCI).	30
abstract_inverted_index.Recall	144
abstract_inverted_index.boosts	189
abstract_inverted_index.commit	4
abstract_inverted_index.convey	7
abstract_inverted_index.detect	135
abstract_inverted_index.exists	48
abstract_inverted_index.higher	199, 205, 217
abstract_inverted_index.hinder	34
abstract_inverted_index.larger	168
abstract_inverted_index.models	51, 68, 134, 169
abstract_inverted_index.pairs,	110
abstract_inverted_index.recall	197
abstract_inverted_index.relies	2
abstract_inverted_index.richer	240
abstract_inverted_index.tokens	159
abstract_inverted_index.verify	98
abstract_inverted_index.(LLMs).	69
abstract_inverted_index.63.8%);	149
abstract_inverted_index.80.28%,	147
abstract_inverted_index.85.95%,	145
abstract_inverted_index.Results	132
abstract_inverted_index.Version	0
abstract_inverted_index.capture	246
abstract_inverted_index.commits	92, 137
abstract_inverted_index.context	166, 241
abstract_inverted_index.control	1
abstract_inverted_index.dataset	74, 107
abstract_inverted_index.effects	163
abstract_inverted_index.labeled	106
abstract_inverted_index.mislead	32
abstract_inverted_index.obscure	41
abstract_inverted_index.others.	161
abstract_inverted_index.overall	153
abstract_inverted_index.quality	19
abstract_inverted_index.reduces	180
abstract_inverted_index.reveals	204
abstract_inverted_index.setting	120
abstract_inverted_index.smaller	174
abstract_inverted_index.through	86
abstract_inverted_index.vanilla	119
abstract_inverted_index.(average	143
abstract_inverted_index.ApacheCM	73
abstract_inverted_index.accuracy	178, 215
abstract_inverted_index.adjacent	165
abstract_inverted_index.analysis	203
abstract_inverted_index.balanced	243
abstract_inverted_index.changes,	12
abstract_inverted_index.context.	131
abstract_inverted_index.designed	61
abstract_inverted_index.evaluate	50, 112
abstract_inverted_index.extended	130
abstract_inverted_index.few-shot	126, 176
abstract_inverted_index.generate	80
abstract_inverted_index.improves	177
abstract_inverted_index.language	67
abstract_inverted_index.messages	5, 15, 85
abstract_inverted_index.negative	102
abstract_inverted_index.patches.	43
abstract_inverted_index.performs	151
abstract_inverted_index.positive	100
abstract_inverted_index.provides	225
abstract_inverted_index.quality,	78
abstract_inverted_index.reliably	139
abstract_inverted_index.research	37
abstract_inverted_index.rigorous	227
abstract_inverted_index.samples.	103
abstract_inverted_index.security	42
abstract_inverted_index.semantic	248
abstract_inverted_index.two-fold	95
abstract_inverted_index."purpose"	222
abstract_inverted_index.Precision	146
abstract_inverted_index.Type-wise	202
abstract_inverted_index.advancing	233
abstract_inverted_index.benchmark	47, 60
abstract_inverted_index.datasets,	38
abstract_inverted_index.dedicated	46
abstract_inverted_index.detection	64
abstract_inverted_index.diversity	76
abstract_inverted_index.incorrect	186
abstract_inverted_index.increases	184
abstract_inverted_index.introduce	56
abstract_inverted_index.mutations	88
abstract_inverted_index.operation	211
abstract_inverted_index.precision	190
abstract_inverted_index.rationale	9
abstract_inverted_index.comparing,	231
abstract_inverted_index.component,	208
abstract_inverted_index.consistent	91, 141
abstract_inverted_index.detection,	235
abstract_inverted_index.detection.	54
abstract_inverted_index.file-path,	209
abstract_inverted_index.foundation	228
abstract_inverted_index.high-level	247
abstract_inverted_index.measuring,	230
abstract_inverted_index.originally	90
abstract_inverted_index.prompting,	127
abstract_inverted_index.reviewers,	33
abstract_inverted_index.validation	96
abstract_inverted_index.Specificity	148
abstract_inverted_index.contaminate	36
abstract_inverted_index.critically,	22
abstract_inverted_index.diffs-known	26
abstract_inverted_index.gpt-oss-20B	150
abstract_inverted_index.open-source	115
abstract_inverted_index.rule-guided	87
abstract_inverted_index.specificity	192
abstract_inverted_index.strategies:	125
abstract_inverted_index.universally	185
abstract_inverted_index.Augmentation	162
abstract_inverted_index.augmentation	124
abstract_inverted_index.consumption.	201
abstract_inverted_index.highlighting	236
abstract_inverted_index.inconsistent	23, 84, 136
abstract_inverted_index.intent-level	221
abstract_inverted_index.maintenance,	35
abstract_inverted_index.message-code	28
abstract_inverted_index.message-diff	109
abstract_inverted_index.predictions;	187
abstract_inverted_index.detectability	206
abstract_inverted_index.inconsistency	29
abstract_inverted_index.chain-of-thought	188
abstract_inverted_index.inconsistencies,	212
abstract_inverted_index.inconsistencies.	223
abstract_inverted_index.state-of-the-art	114
abstract_inverted_index.chain-of-thought,	128
abstract_inverted_index.CODEFUSE-COMMITEVAL	224
abstract_inverted_index.CODEFUSE-COMMITEVAL,	57
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	4
citation_normalized_percentile