Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2504.18114
Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in current hallucination evaluation: metrics often fail to align with human judgments, take an overtly myopic view of the problem, and show inconsistent gains with parameter scaling. Encouragingly, LLM-based evaluation, particularly with GPT-4, yields the best overall results, and mode-seeking decoding methods seem to reduce hallucinations, especially in knowledge-grounded settings. These findings underscore the need for more robust metrics to understand and quantify hallucinations, and better strategies to mitigate them.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2504.18114
- https://arxiv.org/pdf/2504.18114
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4416388728
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4416388728Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2504.18114Digital Object Identifier
- Title
-
Evaluating Evaluation Metrics -- The Mirage of Hallucination DetectionWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-04-25Full publication date if available
- Authors
-
Atharva Kulkarni, Yuan Zhang, Joel Ruben Antony Moniz, Xiou Ge, Bo-Hsiang Tseng, Dhivya Piraviperumal, Swabha Swayamdipta, Hong YuList of authors in order
- Landing page
-
https://arxiv.org/abs/2504.18114Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2504.18114Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2504.18114Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4416388728 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2504.18114 |
| ids.doi | https://doi.org/10.48550/arxiv.2504.18114 |
| ids.openalex | https://openalex.org/W4416388728 |
| fwci | |
| type | preprint |
| title | Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2504.18114 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2504.18114 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2504.18114 |
| locations[1].id | doi:10.48550/arxiv.2504.18114 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2504.18114 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5032486279 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-4847-4427 |
| authorships[0].author.display_name | Atharva Kulkarni |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Kulkarni, Atharva |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100368721 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-6879-7859 |
| authorships[1].author.display_name | Yuan Zhang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Zhang, Yuan |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5000244424 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Joel Ruben Antony Moniz |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Moniz, Joel Ruben Antony |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5002684022 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-8263-2073 |
| authorships[3].author.display_name | Xiou Ge |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Ge, Xiou |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5039215196 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Bo-Hsiang Tseng |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Tseng, Bo-Hsiang |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5077148644 |
| authorships[5].author.orcid | |
| authorships[5].author.display_name | Dhivya Piraviperumal |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Piraviperumal, Dhivya |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5076880940 |
| authorships[6].author.orcid | https://orcid.org/0000-0002-5851-8254 |
| authorships[6].author.display_name | Swabha Swayamdipta |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Swayamdipta, Swabha |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5003746498 |
| authorships[7].author.orcid | https://orcid.org/0000-0003-0667-8413 |
| authorships[7].author.display_name | Hong Yu |
| authorships[7].author_position | last |
| authorships[7].raw_author_name | Yu, Hong |
| authorships[7].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2504.18114 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-28T12:47:28.216042 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2504.18114 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2504.18114 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2504.18114 |
| primary_location.id | pmh:oai:arXiv.org:2504.18114 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2504.18114 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2504.18114 |
| publication_date | 2025-04-25 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.4 | 65 |
| abstract_inverted_index.5 | 71, 74 |
| abstract_inverted_index.6 | 57 |
| abstract_inverted_index.a | 2, 19, 52 |
| abstract_inverted_index.37 | 67 |
| abstract_inverted_index.In | 47 |
| abstract_inverted_index.an | 96 |
| abstract_inverted_index.in | 83, 130 |
| abstract_inverted_index.of | 11, 41, 56, 60, 100 |
| abstract_inverted_index.to | 5, 31, 90, 126, 142, 150 |
| abstract_inverted_index.we | 50 |
| abstract_inverted_index.Our | 77 |
| abstract_inverted_index.and | 8, 25, 34, 39, 73, 103, 121, 144, 147 |
| abstract_inverted_index.are | 44 |
| abstract_inverted_index.for | 138 |
| abstract_inverted_index.the | 6, 37, 101, 117, 136 |
| abstract_inverted_index.yet | 14 |
| abstract_inverted_index.been | 29 |
| abstract_inverted_index.best | 118 |
| abstract_inverted_index.fail | 89 |
| abstract_inverted_index.from | 70 |
| abstract_inverted_index.gaps | 82 |
| abstract_inverted_index.have | 28 |
| abstract_inverted_index.many | 23 |
| abstract_inverted_index.more | 139 |
| abstract_inverted_index.need | 137 |
| abstract_inverted_index.pose | 1 |
| abstract_inverted_index.seem | 125 |
| abstract_inverted_index.sets | 59 |
| abstract_inverted_index.show | 104 |
| abstract_inverted_index.take | 95 |
| abstract_inverted_index.this | 48 |
| abstract_inverted_index.view | 99 |
| abstract_inverted_index.with | 92, 107, 114 |
| abstract_inverted_index.These | 133 |
| abstract_inverted_index.While | 22 |
| abstract_inverted_index.align | 91 |
| abstract_inverted_index.gains | 106 |
| abstract_inverted_index.human | 93 |
| abstract_inverted_index.often | 88 |
| abstract_inverted_index.still | 45 |
| abstract_inverted_index.task- | 24 |
| abstract_inverted_index.their | 15 |
| abstract_inverted_index.them. | 152 |
| abstract_inverted_index.these | 42 |
| abstract_inverted_index.GPT-4, | 115 |
| abstract_inverted_index.across | 64 |
| abstract_inverted_index.assess | 32 |
| abstract_inverted_index.better | 148 |
| abstract_inverted_index.models | 69 |
| abstract_inverted_index.myopic | 98 |
| abstract_inverted_index.paper, | 49 |
| abstract_inverted_index.reduce | 127 |
| abstract_inverted_index.robust | 140 |
| abstract_inverted_index.yields | 116 |
| abstract_inverted_index.conduct | 51 |
| abstract_inverted_index.current | 84 |
| abstract_inverted_index.diverse | 58 |
| abstract_inverted_index.methods | 124 |
| abstract_inverted_index.metrics | 27, 43, 63, 87, 141 |
| abstract_inverted_index.models, | 13 |
| abstract_inverted_index.overall | 119 |
| abstract_inverted_index.overtly | 97 |
| abstract_inverted_index.remains | 18 |
| abstract_inverted_index.reveals | 80 |
| abstract_inverted_index.accurate | 16 |
| abstract_inverted_index.adoption | 10 |
| abstract_inverted_index.decoding | 75, 123 |
| abstract_inverted_index.findings | 134 |
| abstract_inverted_index.language | 12, 68 |
| abstract_inverted_index.methods. | 76 |
| abstract_inverted_index.mitigate | 151 |
| abstract_inverted_index.obstacle | 4 |
| abstract_inverted_index.problem, | 102 |
| abstract_inverted_index.proposed | 30 |
| abstract_inverted_index.quantify | 145 |
| abstract_inverted_index.results, | 120 |
| abstract_inverted_index.scaling. | 109 |
| abstract_inverted_index.LLM-based | 111 |
| abstract_inverted_index.concerns, | 36 |
| abstract_inverted_index.datasets, | 66 |
| abstract_inverted_index.detection | 62 |
| abstract_inverted_index.empirical | 54 |
| abstract_inverted_index.extensive | 78 |
| abstract_inverted_index.families, | 72 |
| abstract_inverted_index.parameter | 108 |
| abstract_inverted_index.settings. | 132 |
| abstract_inverted_index.untested. | 46 |
| abstract_inverted_index.challenge. | 21 |
| abstract_inverted_index.concerning | 81 |
| abstract_inverted_index.especially | 129 |
| abstract_inverted_index.evaluation | 55 |
| abstract_inverted_index.factuality | 35 |
| abstract_inverted_index.judgments, | 94 |
| abstract_inverted_index.persistent | 20 |
| abstract_inverted_index.robustness | 38 |
| abstract_inverted_index.strategies | 149 |
| abstract_inverted_index.underscore | 135 |
| abstract_inverted_index.understand | 143 |
| abstract_inverted_index.widespread | 9 |
| abstract_inverted_index.evaluation, | 112 |
| abstract_inverted_index.evaluation: | 86 |
| abstract_inverted_index.large-scale | 53 |
| abstract_inverted_index.measurement | 17 |
| abstract_inverted_index.reliability | 7 |
| abstract_inverted_index.significant | 3 |
| abstract_inverted_index.faithfulness | 33 |
| abstract_inverted_index.inconsistent | 105 |
| abstract_inverted_index.mode-seeking | 122 |
| abstract_inverted_index.particularly | 113 |
| abstract_inverted_index.hallucination | 61, 85 |
| abstract_inverted_index.investigation | 79 |
| abstract_inverted_index.Encouragingly, | 110 |
| abstract_inverted_index.Hallucinations | 0 |
| abstract_inverted_index.generalization | 40 |
| abstract_inverted_index.domain-specific | 26 |
| abstract_inverted_index.hallucinations, | 128, 146 |
| abstract_inverted_index.knowledge-grounded | 131 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 8 |
| citation_normalized_percentile |