KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2508.14080
Referring Expression Comprehension (REC) is a popular multimodal task that aims to accurately detect target objects within a single image based on a given textual expression. However, due to the limitations of earlier models, traditional REC benchmarks either rely solely on intra-image cues or lack sufficiently fine-grained instance annotations, making them inadequate for evaluating the reasoning capabilities of Multi-modal Large Language Models (MLLMs). To address this gap, we propose a new benchmark, KnowDR-REC, characterized by three key features: Firstly, it is built upon real-world knowledge, requiring fine-grained multimodal reasoning across text and image. Secondly, the dataset includes elaborately constructed negative samples via fine-grained expression editing, designed to evaluate a model's robustness and anti-hallucination ability. Lastly, we introduce three novel evaluation metrics to systematically explore the model's internal reasoning process. We evaluate 16 state-of-the-art multimodal models on KnowDR-REC, with experimental results showing that existing MLLMs still struggle with knowledge-driven visual grounding tasks. Furthermore, we observe a decoupling between textual understanding and visual grounding in MLLMs, where many models are significantly influenced by memorized shortcut correlations, which severely affect their behavior on our benchmark and hinder genuine multimodal reasoning. We anticipate that the proposed benchmark will inspire future research towards developing more robust, interpretable, and knowledge-intensive visual grounding frameworks, driving the development of more reliable and robust multimodal systems for complex real-world scenarios.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2508.14080
- https://arxiv.org/pdf/2508.14080
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4415238143
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4415238143Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2508.14080Digital Object Identifier
- Title
-
KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World KnowledgeWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-08-12Full publication date if available
- Authors
-
Jin Guanghao, Jingpei Wu, Guoan Tang, Yuzhen Niu, Weidong Zhou, Guoyang LiuList of authors in order
- Landing page
-
https://arxiv.org/abs/2508.14080Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2508.14080Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2508.14080Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4415238143 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2508.14080 |
| ids.doi | https://doi.org/10.48550/arxiv.2508.14080 |
| ids.openalex | https://openalex.org/W4415238143 |
| fwci | |
| type | preprint |
| title | KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10181 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.892300009727478 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Natural Language Processing Techniques |
| topics[1].id | https://openalex.org/T10028 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.863099992275238 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Topic Modeling |
| topics[2].id | https://openalex.org/T11902 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.8034999966621399 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Intelligent Tutoring Systems and Adaptive Learning |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2508.14080 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2508.14080 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2508.14080 |
| locations[1].id | doi:10.48550/arxiv.2508.14080 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2508.14080 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5102804945 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-9881-5328 |
| authorships[0].author.display_name | Jin Guanghao |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Jin, Guanghao |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5048975992 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Jingpei Wu |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Wu, Jingpei |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5053457530 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-1443-6134 |
| authorships[2].author.display_name | Guoan Tang |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Guo, Tianpei |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5066931530 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-9874-9719 |
| authorships[3].author.display_name | Yuzhen Niu |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Niu, Yiyi |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5078596745 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-1234-1035 |
| authorships[4].author.display_name | Weidong Zhou |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Zhou, Weidong |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5060733456 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-5879-809X |
| authorships[5].author.display_name | Guoyang Liu |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Liu, Guoyang |
| authorships[5].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2508.14080 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-16T00:00:00 |
| display_name | KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10181 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.892300009727478 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Natural Language Processing Techniques |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2508.14080 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2508.14080 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2508.14080 |
| primary_location.id | pmh:oai:arXiv.org:2508.14080 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2508.14080 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2508.14080 |
| publication_date | 2025-08-12 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 5, 17, 22, 69, 108, 154 |
| abstract_inverted_index.16 | 131 |
| abstract_inverted_index.To | 63 |
| abstract_inverted_index.We | 129, 187 |
| abstract_inverted_index.by | 74, 170 |
| abstract_inverted_index.in | 162 |
| abstract_inverted_index.is | 4, 80 |
| abstract_inverted_index.it | 79 |
| abstract_inverted_index.of | 31, 57, 210 |
| abstract_inverted_index.on | 21, 40, 135, 179 |
| abstract_inverted_index.or | 43 |
| abstract_inverted_index.to | 11, 28, 106, 121 |
| abstract_inverted_index.we | 67, 115, 152 |
| abstract_inverted_index.REC | 35 |
| abstract_inverted_index.and | 91, 111, 159, 182, 202, 213 |
| abstract_inverted_index.are | 167 |
| abstract_inverted_index.due | 27 |
| abstract_inverted_index.for | 52, 217 |
| abstract_inverted_index.key | 76 |
| abstract_inverted_index.new | 70 |
| abstract_inverted_index.our | 180 |
| abstract_inverted_index.the | 29, 54, 94, 124, 190, 208 |
| abstract_inverted_index.via | 101 |
| abstract_inverted_index.aims | 10 |
| abstract_inverted_index.cues | 42 |
| abstract_inverted_index.gap, | 66 |
| abstract_inverted_index.lack | 44 |
| abstract_inverted_index.many | 165 |
| abstract_inverted_index.more | 199, 211 |
| abstract_inverted_index.rely | 38 |
| abstract_inverted_index.task | 8 |
| abstract_inverted_index.text | 90 |
| abstract_inverted_index.that | 9, 141, 189 |
| abstract_inverted_index.them | 50 |
| abstract_inverted_index.this | 65 |
| abstract_inverted_index.upon | 82 |
| abstract_inverted_index.will | 193 |
| abstract_inverted_index.with | 137, 146 |
| abstract_inverted_index.(REC) | 3 |
| abstract_inverted_index.Large | 59 |
| abstract_inverted_index.MLLMs | 143 |
| abstract_inverted_index.based | 20 |
| abstract_inverted_index.built | 81 |
| abstract_inverted_index.given | 23 |
| abstract_inverted_index.image | 19 |
| abstract_inverted_index.novel | 118 |
| abstract_inverted_index.still | 144 |
| abstract_inverted_index.their | 177 |
| abstract_inverted_index.three | 75, 117 |
| abstract_inverted_index.where | 164 |
| abstract_inverted_index.which | 174 |
| abstract_inverted_index.MLLMs, | 163 |
| abstract_inverted_index.Models | 61 |
| abstract_inverted_index.across | 89 |
| abstract_inverted_index.affect | 176 |
| abstract_inverted_index.detect | 13 |
| abstract_inverted_index.either | 37 |
| abstract_inverted_index.future | 195 |
| abstract_inverted_index.hinder | 183 |
| abstract_inverted_index.image. | 92 |
| abstract_inverted_index.making | 49 |
| abstract_inverted_index.models | 134, 166 |
| abstract_inverted_index.robust | 214 |
| abstract_inverted_index.single | 18 |
| abstract_inverted_index.solely | 39 |
| abstract_inverted_index.target | 14 |
| abstract_inverted_index.tasks. | 150 |
| abstract_inverted_index.visual | 148, 160, 204 |
| abstract_inverted_index.within | 16 |
| abstract_inverted_index.Lastly, | 114 |
| abstract_inverted_index.address | 64 |
| abstract_inverted_index.between | 156 |
| abstract_inverted_index.complex | 218 |
| abstract_inverted_index.dataset | 95 |
| abstract_inverted_index.driving | 207 |
| abstract_inverted_index.earlier | 32 |
| abstract_inverted_index.explore | 123 |
| abstract_inverted_index.genuine | 184 |
| abstract_inverted_index.inspire | 194 |
| abstract_inverted_index.metrics | 120 |
| abstract_inverted_index.model's | 109, 125 |
| abstract_inverted_index.models, | 33 |
| abstract_inverted_index.objects | 15 |
| abstract_inverted_index.observe | 153 |
| abstract_inverted_index.popular | 6 |
| abstract_inverted_index.propose | 68 |
| abstract_inverted_index.results | 139 |
| abstract_inverted_index.robust, | 200 |
| abstract_inverted_index.samples | 100 |
| abstract_inverted_index.showing | 140 |
| abstract_inverted_index.systems | 216 |
| abstract_inverted_index.textual | 24, 157 |
| abstract_inverted_index.towards | 197 |
| abstract_inverted_index.(MLLMs). | 62 |
| abstract_inverted_index.Firstly, | 78 |
| abstract_inverted_index.However, | 26 |
| abstract_inverted_index.Language | 60 |
| abstract_inverted_index.ability. | 113 |
| abstract_inverted_index.behavior | 178 |
| abstract_inverted_index.designed | 105 |
| abstract_inverted_index.editing, | 104 |
| abstract_inverted_index.evaluate | 107, 130 |
| abstract_inverted_index.existing | 142 |
| abstract_inverted_index.includes | 96 |
| abstract_inverted_index.instance | 47 |
| abstract_inverted_index.internal | 126 |
| abstract_inverted_index.negative | 99 |
| abstract_inverted_index.process. | 128 |
| abstract_inverted_index.proposed | 191 |
| abstract_inverted_index.reliable | 212 |
| abstract_inverted_index.research | 196 |
| abstract_inverted_index.severely | 175 |
| abstract_inverted_index.shortcut | 172 |
| abstract_inverted_index.struggle | 145 |
| abstract_inverted_index.Referring | 0 |
| abstract_inverted_index.Secondly, | 93 |
| abstract_inverted_index.benchmark | 181, 192 |
| abstract_inverted_index.features: | 77 |
| abstract_inverted_index.grounding | 149, 161, 205 |
| abstract_inverted_index.introduce | 116 |
| abstract_inverted_index.memorized | 171 |
| abstract_inverted_index.reasoning | 55, 88, 127 |
| abstract_inverted_index.requiring | 85 |
| abstract_inverted_index.Expression | 1 |
| abstract_inverted_index.accurately | 12 |
| abstract_inverted_index.anticipate | 188 |
| abstract_inverted_index.benchmark, | 71 |
| abstract_inverted_index.benchmarks | 36 |
| abstract_inverted_index.decoupling | 155 |
| abstract_inverted_index.developing | 198 |
| abstract_inverted_index.evaluating | 53 |
| abstract_inverted_index.evaluation | 119 |
| abstract_inverted_index.expression | 103 |
| abstract_inverted_index.inadequate | 51 |
| abstract_inverted_index.influenced | 169 |
| abstract_inverted_index.knowledge, | 84 |
| abstract_inverted_index.multimodal | 7, 87, 133, 185, 215 |
| abstract_inverted_index.real-world | 83, 219 |
| abstract_inverted_index.reasoning. | 186 |
| abstract_inverted_index.robustness | 110 |
| abstract_inverted_index.scenarios. | 220 |
| abstract_inverted_index.KnowDR-REC, | 72, 136 |
| abstract_inverted_index.Multi-modal | 58 |
| abstract_inverted_index.constructed | 98 |
| abstract_inverted_index.development | 209 |
| abstract_inverted_index.elaborately | 97 |
| abstract_inverted_index.expression. | 25 |
| abstract_inverted_index.frameworks, | 206 |
| abstract_inverted_index.intra-image | 41 |
| abstract_inverted_index.limitations | 30 |
| abstract_inverted_index.traditional | 34 |
| abstract_inverted_index.Furthermore, | 151 |
| abstract_inverted_index.annotations, | 48 |
| abstract_inverted_index.capabilities | 56 |
| abstract_inverted_index.experimental | 138 |
| abstract_inverted_index.fine-grained | 46, 86, 102 |
| abstract_inverted_index.sufficiently | 45 |
| abstract_inverted_index.Comprehension | 2 |
| abstract_inverted_index.characterized | 73 |
| abstract_inverted_index.correlations, | 173 |
| abstract_inverted_index.significantly | 168 |
| abstract_inverted_index.understanding | 158 |
| abstract_inverted_index.interpretable, | 201 |
| abstract_inverted_index.systematically | 122 |
| abstract_inverted_index.knowledge-driven | 147 |
| abstract_inverted_index.state-of-the-art | 132 |
| abstract_inverted_index.anti-hallucination | 112 |
| abstract_inverted_index.knowledge-intensive | 203 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile |