Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2505.15055
The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining mainstream prominent LLM benchmarks using results from diverse models. We first propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities. Based on PSN-IRT, we conduct extensive analysis on 11 LLM benchmarks comprising 41,871 items, revealing significant and varied shortcomings in their measurement quality. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2505.15055
- https://arxiv.org/pdf/2505.15055
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4415328433
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4415328433Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2505.15055Digital Object Identifier
- Title
-
Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response TheoryWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-05-21Full publication date if available
- Authors
-
Hongli Zhou, Han Huang, Ziqing Zhao, L.-K. Han, H X Wang, Kehai Chen, Muyun Yang, Wei Bao, Jianwei Dong, Bing Xu, Conghui Zhu, Hailong Cao, Tiejun ZhaoList of authors in order
- Landing page
-
https://arxiv.org/abs/2505.15055Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2505.15055Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2505.15055Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4415328433 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2505.15055 |
| ids.doi | https://doi.org/10.48550/arxiv.2505.15055 |
| ids.openalex | https://openalex.org/W4415328433 |
| fwci | |
| type | preprint |
| title | Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10028 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9462000131607056 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Topic Modeling |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2505.15055 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2505.15055 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2505.15055 |
| locations[1].id | doi:10.48550/arxiv.2505.15055 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2505.15055 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5103325997 |
| authorships[0].author.orcid | https://orcid.org/0009-0006-1693-9399 |
| authorships[0].author.display_name | Hongli Zhou |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Zhou, Hongli |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100709581 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-3353-2970 |
| authorships[1].author.display_name | Han Huang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Huang, Hui |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5103011004 |
| authorships[2].author.orcid | https://orcid.org/0009-0004-3874-1530 |
| authorships[2].author.display_name | Ziqing Zhao |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Zhao, Ziqing |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5101962120 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-9599-8437 |
| authorships[3].author.display_name | L.-K. Han |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Han, Lvyuan |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5012969043 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | H X Wang |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Wang, Huicheng |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5006323375 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-4346-7618 |
| authorships[5].author.display_name | Kehai Chen |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Chen, Kehai |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5108053327 |
| authorships[6].author.orcid | https://orcid.org/0000-0002-5940-0266 |
| authorships[6].author.display_name | Muyun Yang |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Yang, Muyun |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5071388470 |
| authorships[7].author.orcid | |
| authorships[7].author.display_name | Wei Bao |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Bao, Wei |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5030649120 |
| authorships[8].author.orcid | https://orcid.org/0000-0002-0547-2870 |
| authorships[8].author.display_name | Jianwei Dong |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | Dong, Jian |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5102830982 |
| authorships[9].author.orcid | https://orcid.org/0000-0001-9063-1276 |
| authorships[9].author.display_name | Bing Xu |
| authorships[9].author_position | middle |
| authorships[9].raw_author_name | Xu, Bing |
| authorships[9].is_corresponding | False |
| authorships[10].author.id | https://openalex.org/A5102790470 |
| authorships[10].author.orcid | https://orcid.org/0000-0003-3132-3059 |
| authorships[10].author.display_name | Conghui Zhu |
| authorships[10].author_position | middle |
| authorships[10].raw_author_name | Zhu, Conghui |
| authorships[10].is_corresponding | False |
| authorships[11].author.id | https://openalex.org/A5040038124 |
| authorships[11].author.orcid | https://orcid.org/0000-0002-6842-8674 |
| authorships[11].author.display_name | Hailong Cao |
| authorships[11].author_position | middle |
| authorships[11].raw_author_name | Cao, Hailong |
| authorships[11].is_corresponding | False |
| authorships[12].author.id | https://openalex.org/A5101661008 |
| authorships[12].author.orcid | https://orcid.org/0000-0003-4659-4935 |
| authorships[12].author.display_name | Tiejun Zhao |
| authorships[12].author_position | last |
| authorships[12].raw_author_name | Zhao, Tiejun |
| authorships[12].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2505.15055 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-19T00:00:00 |
| display_name | Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10028 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9462000131607056 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Topic Modeling |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2505.15055 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2505.15055 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2505.15055 |
| primary_location.id | pmh:oai:arXiv.org:2505.15055 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2505.15055 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2505.15055 |
| publication_date | 2025-05-21 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 36, 70 |
| abstract_inverted_index.11 | 103 |
| abstract_inverted_index.We | 52 |
| abstract_inverted_index.an | 62, 77 |
| abstract_inverted_index.be | 82 |
| abstract_inverted_index.in | 114 |
| abstract_inverted_index.is | 9, 124 |
| abstract_inverted_index.of | 2, 39, 73, 89 |
| abstract_inverted_index.on | 96, 102 |
| abstract_inverted_index.to | 27, 126 |
| abstract_inverted_index.we | 98, 119 |
| abstract_inverted_index.LLM | 45, 104 |
| abstract_inverted_index.The | 0 |
| abstract_inverted_index.and | 16, 86, 92, 111 |
| abstract_inverted_index.can | 81 |
| abstract_inverted_index.for | 57, 84 |
| abstract_inverted_index.set | 72 |
| abstract_inverted_index.top | 20 |
| abstract_inverted_index.via | 7 |
| abstract_inverted_index.yet | 11 |
| abstract_inverted_index.Item | 58, 64 |
| abstract_inverted_index.This | 33 |
| abstract_inverted_index.able | 125 |
| abstract_inverted_index.from | 49 |
| abstract_inverted_index.item | 74, 90 |
| abstract_inverted_index.poor | 17 |
| abstract_inverted_index.rich | 71 |
| abstract_inverted_index.that | 68, 121 |
| abstract_inverted_index.with | 134 |
| abstract_inverted_index.Based | 95 |
| abstract_inverted_index.about | 24 |
| abstract_inverted_index.among | 19 |
| abstract_inverted_index.first | 53 |
| abstract_inverted_index.human | 135 |
| abstract_inverted_index.large | 3 |
| abstract_inverted_index.model | 31, 93 |
| abstract_inverted_index.paper | 34 |
| abstract_inverted_index.raise | 22 |
| abstract_inverted_index.their | 25, 115 |
| abstract_inverted_index.using | 47 |
| abstract_inverted_index.while | 130 |
| abstract_inverted_index.(LLMs) | 6 |
| abstract_inverted_index.41,871 | 107 |
| abstract_inverted_index.Theory | 60, 66 |
| abstract_inverted_index.items, | 108 |
| abstract_inverted_index.models | 5, 21 |
| abstract_inverted_index.varied | 112 |
| abstract_inverted_index.within | 76 |
| abstract_inverted_index.Network | 56 |
| abstract_inverted_index.PSN-IRT | 80, 123 |
| abstract_inverted_index.ability | 26 |
| abstract_inverted_index.between | 13 |
| abstract_inverted_index.conduct | 99 |
| abstract_inverted_index.diverse | 50 |
| abstract_inverted_index.models. | 51 |
| abstract_inverted_index.propose | 54 |
| abstract_inverted_index.reflect | 29 |
| abstract_inverted_index.results | 48 |
| abstract_inverted_index.smaller | 128 |
| abstract_inverted_index.PSN-IRT, | 97 |
| abstract_inverted_index.Response | 59, 65 |
| abstract_inverted_index.accurate | 85 |
| abstract_inverted_index.analysis | 38, 101 |
| abstract_inverted_index.concerns | 23 |
| abstract_inverted_index.critical | 37 |
| abstract_inverted_index.enhanced | 63 |
| abstract_inverted_index.language | 4 |
| abstract_inverted_index.provides | 35 |
| abstract_inverted_index.quality. | 117 |
| abstract_inverted_index.reliable | 87 |
| abstract_inverted_index.stronger | 132 |
| abstract_inverted_index.utilized | 83 |
| abstract_inverted_index.alignment | 133 |
| abstract_inverted_index.authentic | 30 |
| abstract_inverted_index.benchmark | 40 |
| abstract_inverted_index.construct | 127 |
| abstract_inverted_index.different | 14 |
| abstract_inverted_index.examining | 42 |
| abstract_inverted_index.extensive | 100 |
| abstract_inverted_index.framework | 67 |
| abstract_inverted_index.prominent | 44 |
| abstract_inverted_index.revealing | 109 |
| abstract_inverted_index.(PSN-IRT), | 61 |
| abstract_inverted_index.abilities. | 94 |
| abstract_inverted_index.accurately | 28 |
| abstract_inverted_index.benchmarks | 8, 46, 105, 129 |
| abstract_inverted_index.comprising | 106 |
| abstract_inverted_index.evaluation | 1 |
| abstract_inverted_index.leveraging | 122 |
| abstract_inverted_index.mainstream | 43 |
| abstract_inverted_index.parameters | 75 |
| abstract_inverted_index.demonstrate | 120 |
| abstract_inverted_index.estimations | 88 |
| abstract_inverted_index.maintaining | 131 |
| abstract_inverted_index.measurement | 116 |
| abstract_inverted_index.preference. | 136 |
| abstract_inverted_index.significant | 110 |
| abstract_inverted_index.widespread, | 10 |
| abstract_inverted_index.Furthermore, | 118 |
| abstract_inverted_index.IRT-grounded | 78 |
| abstract_inverted_index.incorporates | 69 |
| abstract_inverted_index.leaderboards | 15 |
| abstract_inverted_index.separability | 18 |
| abstract_inverted_index.shortcomings | 113 |
| abstract_inverted_index.architecture. | 79 |
| abstract_inverted_index.capabilities. | 32 |
| abstract_inverted_index.Pseudo-Siamese | 55 |
| abstract_inverted_index.effectiveness, | 41 |
| abstract_inverted_index.characteristics | 91 |
| abstract_inverted_index.inconsistencies | 12 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 13 |
| citation_normalized_percentile |