Calibration of Large Language Models on Code Summarization Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2404.19318
A brief, fluent, and relevant summary can be helpful during program comprehension; however, such a summary does require significant human effort to produce. Often, good summaries are unavailable in software projects, which makes maintenance more difficult. There has been a considerable body of research into automated AI-based methods, using Large Language models (LLMs), to generate summaries of code; there also has been quite a bit of work on ways to measure the performance of such summarization methods, with special attention paid to how closely these AI-generated summaries resemble a summary a human might have produced. Measures such as BERTScore and BLEU have been suggested and evaluated with human-subject studies. However, LLM-generated summaries can be inaccurate, incomplete, etc.: generally, too dissimilar to one that a good developer might write. Given an LLM-generated code summary, how can a user rationally judge if a summary is sufficiently good and reliable? Given just some input source code, and an LLM-generated summary, existing approaches can help judge brevity, fluency and relevance of the summary; however, it's difficult to gauge whether an LLM-generated summary sufficiently resembles what a human might produce, without a "golden" human-produced summary to compare against. We study this resemblance question as calibration problem: given just the code & the summary from an LLM, can we compute a confidence measure, that provides a reliable indication of whether the summary sufficiently resembles what a human would have produced in this situation? We examine this question using several LLMs, for several languages, and in several different settings. Our investigation suggests approaches to provide reliable predictions of the likelihood that an LLM-generated summary would sufficiently resemble a summary a human might write for the same code.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2404.19318
- https://arxiv.org/pdf/2404.19318
- OA Status
- green
- Cited By
- 1
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4396600598
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4396600598Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2404.19318Digital Object Identifier
- Title
-
Calibration of Large Language Models on Code SummarizationWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-04-30Full publication date if available
- Authors
-
Yuvraj Virk, Prémkumar Dévanbu, Toufique AhmedList of authors in order
- Landing page
-
https://arxiv.org/abs/2404.19318Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2404.19318Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2404.19318Direct OA link when available
- Concepts
-
Code (set theory), Computer science, Confidence interval, Psychology, Statistics, Programming language, Mathematics, Set (abstract data type)Top concepts (fields/topics) attached by OpenAlex
- Cited by
-
1Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 1Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4396600598 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2404.19318 |
| ids.doi | https://doi.org/10.48550/arxiv.2404.19318 |
| ids.openalex | https://openalex.org/W4396600598 |
| fwci | |
| type | preprint |
| title | Calibration of Large Language Models on Code Summarization |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10181 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.998199999332428 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Natural Language Processing Techniques |
| topics[1].id | https://openalex.org/T10215 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9879000186920166 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Semantic Web and Ontologies |
| topics[2].id | https://openalex.org/T10028 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9850000143051147 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Topic Modeling |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C2776760102 |
| concepts[0].level | 3 |
| concepts[0].score | 0.6257579326629639 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q5139990 |
| concepts[0].display_name | Code (set theory) |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.493927538394928 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C44249647 |
| concepts[2].level | 2 |
| concepts[2].score | 0.4340646266937256 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q208498 |
| concepts[2].display_name | Confidence interval |
| concepts[3].id | https://openalex.org/C15744967 |
| concepts[3].level | 0 |
| concepts[3].score | 0.3305203914642334 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q9418 |
| concepts[3].display_name | Psychology |
| concepts[4].id | https://openalex.org/C105795698 |
| concepts[4].level | 1 |
| concepts[4].score | 0.2744755744934082 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q12483 |
| concepts[4].display_name | Statistics |
| concepts[5].id | https://openalex.org/C199360897 |
| concepts[5].level | 1 |
| concepts[5].score | 0.1556943655014038 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q9143 |
| concepts[5].display_name | Programming language |
| concepts[6].id | https://openalex.org/C33923547 |
| concepts[6].level | 0 |
| concepts[6].score | 0.13268107175827026 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q395 |
| concepts[6].display_name | Mathematics |
| concepts[7].id | https://openalex.org/C177264268 |
| concepts[7].level | 2 |
| concepts[7].score | 0.07109436392784119 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q1514741 |
| concepts[7].display_name | Set (abstract data type) |
| keywords[0].id | https://openalex.org/keywords/code |
| keywords[0].score | 0.6257579326629639 |
| keywords[0].display_name | Code (set theory) |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.493927538394928 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/confidence-interval |
| keywords[2].score | 0.4340646266937256 |
| keywords[2].display_name | Confidence interval |
| keywords[3].id | https://openalex.org/keywords/psychology |
| keywords[3].score | 0.3305203914642334 |
| keywords[3].display_name | Psychology |
| keywords[4].id | https://openalex.org/keywords/statistics |
| keywords[4].score | 0.2744755744934082 |
| keywords[4].display_name | Statistics |
| keywords[5].id | https://openalex.org/keywords/programming-language |
| keywords[5].score | 0.1556943655014038 |
| keywords[5].display_name | Programming language |
| keywords[6].id | https://openalex.org/keywords/mathematics |
| keywords[6].score | 0.13268107175827026 |
| keywords[6].display_name | Mathematics |
| keywords[7].id | https://openalex.org/keywords/set |
| keywords[7].score | 0.07109436392784119 |
| keywords[7].display_name | Set (abstract data type) |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2404.19318 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2404.19318 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2404.19318 |
| locations[1].id | doi:10.48550/arxiv.2404.19318 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2404.19318 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5096067768 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Yuvraj Virk |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Virk, Yuvraj |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5036744986 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-4346-5276 |
| authorships[1].author.display_name | Prémkumar Dévanbu |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Devanbu, Premkumar |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5072573553 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-4427-1350 |
| authorships[2].author.display_name | Toufique Ahmed |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Ahmed, Toufique |
| authorships[2].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2404.19318 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2024-05-03T00:00:00 |
| display_name | Calibration of Large Language Models on Code Summarization |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10181 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.998199999332428 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Natural Language Processing Techniques |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W2358668433, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W2382290278, https://openalex.org/W4395014643, https://openalex.org/W4391913857, https://openalex.org/W2350741829 |
| cited_by_count | 1 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 1 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2404.19318 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2404.19318 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2404.19318 |
| primary_location.id | pmh:oai:arXiv.org:2404.19318 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2404.19318 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2404.19318 |
| publication_date | 2024-04-30 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.A | 0 |
| abstract_inverted_index.a | 14, 39, 63, 88, 90, 123, 135, 140, 181, 186, 214, 219, 229, 270, 272 |
| abstract_inverted_index.We | 193, 237 |
| abstract_inverted_index.an | 129, 154, 175, 209, 264 |
| abstract_inverted_index.as | 97, 198 |
| abstract_inverted_index.be | 7, 113 |
| abstract_inverted_index.if | 139 |
| abstract_inverted_index.in | 28, 234, 248 |
| abstract_inverted_index.is | 142 |
| abstract_inverted_index.of | 42, 56, 65, 73, 166, 222, 260 |
| abstract_inverted_index.on | 67 |
| abstract_inverted_index.to | 21, 53, 69, 81, 120, 172, 190, 256 |
| abstract_inverted_index.we | 212 |
| abstract_inverted_index.Our | 252 |
| abstract_inverted_index.and | 3, 99, 104, 145, 153, 164, 247 |
| abstract_inverted_index.are | 26 |
| abstract_inverted_index.bit | 64 |
| abstract_inverted_index.can | 6, 112, 134, 159, 211 |
| abstract_inverted_index.for | 244, 276 |
| abstract_inverted_index.has | 37, 60 |
| abstract_inverted_index.how | 82, 133 |
| abstract_inverted_index.one | 121 |
| abstract_inverted_index.the | 71, 167, 203, 206, 224, 261, 277 |
| abstract_inverted_index.too | 118 |
| abstract_inverted_index.BLEU | 100 |
| abstract_inverted_index.LLM, | 210 |
| abstract_inverted_index.also | 59 |
| abstract_inverted_index.been | 38, 61, 102 |
| abstract_inverted_index.body | 41 |
| abstract_inverted_index.code | 131, 204 |
| abstract_inverted_index.does | 16 |
| abstract_inverted_index.from | 208 |
| abstract_inverted_index.good | 24, 124, 144 |
| abstract_inverted_index.have | 93, 101, 232 |
| abstract_inverted_index.help | 160 |
| abstract_inverted_index.into | 44 |
| abstract_inverted_index.it's | 170 |
| abstract_inverted_index.just | 148, 202 |
| abstract_inverted_index.more | 34 |
| abstract_inverted_index.paid | 80 |
| abstract_inverted_index.same | 278 |
| abstract_inverted_index.some | 149 |
| abstract_inverted_index.such | 13, 74, 96 |
| abstract_inverted_index.that | 122, 217, 263 |
| abstract_inverted_index.this | 195, 235, 239 |
| abstract_inverted_index.user | 136 |
| abstract_inverted_index.ways | 68 |
| abstract_inverted_index.what | 180, 228 |
| abstract_inverted_index.with | 77, 106 |
| abstract_inverted_index.work | 66 |
| abstract_inverted_index.& | 205 |
| abstract_inverted_index.Given | 128, 147 |
| abstract_inverted_index.LLMs, | 243 |
| abstract_inverted_index.Large | 49 |
| abstract_inverted_index.There | 36 |
| abstract_inverted_index.code, | 152 |
| abstract_inverted_index.code. | 279 |
| abstract_inverted_index.code; | 57 |
| abstract_inverted_index.etc.: | 116 |
| abstract_inverted_index.gauge | 173 |
| abstract_inverted_index.given | 201 |
| abstract_inverted_index.human | 19, 91, 182, 230, 273 |
| abstract_inverted_index.input | 150 |
| abstract_inverted_index.judge | 138, 161 |
| abstract_inverted_index.makes | 32 |
| abstract_inverted_index.might | 92, 126, 183, 274 |
| abstract_inverted_index.quite | 62 |
| abstract_inverted_index.study | 194 |
| abstract_inverted_index.there | 58 |
| abstract_inverted_index.these | 84 |
| abstract_inverted_index.using | 48, 241 |
| abstract_inverted_index.which | 31 |
| abstract_inverted_index.would | 231, 267 |
| abstract_inverted_index.write | 275 |
| abstract_inverted_index.Often, | 23 |
| abstract_inverted_index.brief, | 1 |
| abstract_inverted_index.during | 9 |
| abstract_inverted_index.effort | 20 |
| abstract_inverted_index.models | 51 |
| abstract_inverted_index.source | 151 |
| abstract_inverted_index.write. | 127 |
| abstract_inverted_index.(LLMs), | 52 |
| abstract_inverted_index.closely | 83 |
| abstract_inverted_index.compare | 191 |
| abstract_inverted_index.compute | 213 |
| abstract_inverted_index.examine | 238 |
| abstract_inverted_index.fluency | 163 |
| abstract_inverted_index.fluent, | 2 |
| abstract_inverted_index.helpful | 8 |
| abstract_inverted_index.measure | 70 |
| abstract_inverted_index.program | 10 |
| abstract_inverted_index.provide | 257 |
| abstract_inverted_index.require | 17 |
| abstract_inverted_index.several | 242, 245, 249 |
| abstract_inverted_index.special | 78 |
| abstract_inverted_index.summary | 5, 15, 89, 141, 177, 189, 207, 225, 266, 271 |
| abstract_inverted_index.whether | 174, 223 |
| abstract_inverted_index.without | 185 |
| abstract_inverted_index."golden" | 187 |
| abstract_inverted_index.AI-based | 46 |
| abstract_inverted_index.However, | 109 |
| abstract_inverted_index.Language | 50 |
| abstract_inverted_index.Measures | 95 |
| abstract_inverted_index.against. | 192 |
| abstract_inverted_index.brevity, | 162 |
| abstract_inverted_index.existing | 157 |
| abstract_inverted_index.generate | 54 |
| abstract_inverted_index.however, | 12, 169 |
| abstract_inverted_index.measure, | 216 |
| abstract_inverted_index.methods, | 47, 76 |
| abstract_inverted_index.problem: | 200 |
| abstract_inverted_index.produce, | 184 |
| abstract_inverted_index.produce. | 22 |
| abstract_inverted_index.produced | 233 |
| abstract_inverted_index.provides | 218 |
| abstract_inverted_index.question | 197, 240 |
| abstract_inverted_index.relevant | 4 |
| abstract_inverted_index.reliable | 220, 258 |
| abstract_inverted_index.research | 43 |
| abstract_inverted_index.resemble | 87, 269 |
| abstract_inverted_index.software | 29 |
| abstract_inverted_index.studies. | 108 |
| abstract_inverted_index.suggests | 254 |
| abstract_inverted_index.summary, | 132, 156 |
| abstract_inverted_index.summary; | 168 |
| abstract_inverted_index.BERTScore | 98 |
| abstract_inverted_index.attention | 79 |
| abstract_inverted_index.automated | 45 |
| abstract_inverted_index.developer | 125 |
| abstract_inverted_index.different | 250 |
| abstract_inverted_index.difficult | 171 |
| abstract_inverted_index.evaluated | 105 |
| abstract_inverted_index.produced. | 94 |
| abstract_inverted_index.projects, | 30 |
| abstract_inverted_index.relevance | 165 |
| abstract_inverted_index.reliable? | 146 |
| abstract_inverted_index.resembles | 179, 227 |
| abstract_inverted_index.settings. | 251 |
| abstract_inverted_index.suggested | 103 |
| abstract_inverted_index.summaries | 25, 55, 86, 111 |
| abstract_inverted_index.approaches | 158, 255 |
| abstract_inverted_index.confidence | 215 |
| abstract_inverted_index.difficult. | 35 |
| abstract_inverted_index.dissimilar | 119 |
| abstract_inverted_index.generally, | 117 |
| abstract_inverted_index.indication | 221 |
| abstract_inverted_index.languages, | 246 |
| abstract_inverted_index.likelihood | 262 |
| abstract_inverted_index.rationally | 137 |
| abstract_inverted_index.situation? | 236 |
| abstract_inverted_index.calibration | 199 |
| abstract_inverted_index.inaccurate, | 114 |
| abstract_inverted_index.incomplete, | 115 |
| abstract_inverted_index.maintenance | 33 |
| abstract_inverted_index.performance | 72 |
| abstract_inverted_index.predictions | 259 |
| abstract_inverted_index.resemblance | 196 |
| abstract_inverted_index.significant | 18 |
| abstract_inverted_index.unavailable | 27 |
| abstract_inverted_index.AI-generated | 85 |
| abstract_inverted_index.considerable | 40 |
| abstract_inverted_index.sufficiently | 143, 178, 226, 268 |
| abstract_inverted_index.LLM-generated | 110, 130, 155, 176, 265 |
| abstract_inverted_index.human-subject | 107 |
| abstract_inverted_index.investigation | 253 |
| abstract_inverted_index.summarization | 75 |
| abstract_inverted_index.comprehension; | 11 |
| abstract_inverted_index.human-produced | 188 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| citation_normalized_percentile |