Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2501.04670
Recent advancements in multimodal large language models (MLLM) have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, the visual matching ability of MLLMs is rarely studied, despite finding the visual correspondence of objects is essential in computer vision. Our research reveals that the matching capabilities in recent MLLMs still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. We categorize the data samples of MMVM benchmark into eight aspects based on the required cues and capabilities to more comprehensively evaluate and analyze current MLLMs. In addition, we have designed an automatic annotation pipeline to generate the MMVM SFT dataset, including 220K visual matching data with reasoning annotation. To our knowledge, this is the first visual corresponding dataset and benchmark for the MLLM community. Finally, we present CoLVA, a novel contrastive MLLM with two novel technical designs: fine-grained vision expert with object-level contrastive learning and instruction augmentation strategy. The former learns instance discriminative tokens, while the latter further improves instruction following ability. CoLVA-InternVL2-4B achieves an overall accuracy (OA) of 49.80\% on the MMVM benchmark, surpassing GPT-4o and the best open-source MLLM, Qwen2VL-72B, by 7.15\% and 11.72\% OA, respectively. These results demonstrate the effectiveness of our MMVM SFT dataset and our novel technical designs. Code, benchmark, dataset, and models will be released.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2501.04670
- https://arxiv.org/pdf/2501.04670
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4406231658
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4406231658Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2501.04670Digital Object Identifier
- Title
-
Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-01-08Full publication date if available
- Authors
-
Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qianyu Zhou, Yunhai Tong, Shunping Ji, Jiansheng Zhang, Xiangtai Li, Qi LuList of authors in order
- Landing page
-
https://arxiv.org/abs/2501.04670Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2501.04670Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2501.04670Direct OA link when available
- Concepts
-
PsychologyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4406231658 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2501.04670 |
| ids.doi | https://doi.org/10.48550/arxiv.2501.04670 |
| ids.openalex | https://openalex.org/W4406231658 |
| fwci | |
| type | preprint |
| title | Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11148 |
| topics[0].field.id | https://openalex.org/fields/32 |
| topics[0].field.display_name | Psychology |
| topics[0].score | 0.986299991607666 |
| topics[0].domain.id | https://openalex.org/domains/2 |
| topics[0].domain.display_name | Social Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/3205 |
| topics[0].subfield.display_name | Experimental and Cognitive Psychology |
| topics[0].display_name | Language, Metaphor, and Cognition |
| topics[1].id | https://openalex.org/T12881 |
| topics[1].field.id | https://openalex.org/fields/12 |
| topics[1].field.display_name | Arts and Humanities |
| topics[1].score | 0.9846000075340271 |
| topics[1].domain.id | https://openalex.org/domains/2 |
| topics[1].domain.display_name | Social Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1203 |
| topics[1].subfield.display_name | Language and Linguistics |
| topics[1].display_name | linguistics and terminology studies |
| topics[2].id | https://openalex.org/T10759 |
| topics[2].field.id | https://openalex.org/fields/12 |
| topics[2].field.display_name | Arts and Humanities |
| topics[2].score | 0.9811999797821045 |
| topics[2].domain.id | https://openalex.org/domains/2 |
| topics[2].domain.display_name | Social Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1203 |
| topics[2].subfield.display_name | Language and Linguistics |
| topics[2].display_name | Translation Studies and Practices |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C15744967 |
| concepts[0].level | 0 |
| concepts[0].score | 0.35192638635635376 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q9418 |
| concepts[0].display_name | Psychology |
| keywords[0].id | https://openalex.org/keywords/psychology |
| keywords[0].score | 0.35192638635635376 |
| keywords[0].display_name | Psychology |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2501.04670 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2501.04670 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2501.04670 |
| locations[1].id | doi:10.48550/arxiv.2501.04670 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2501.04670 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5104162474 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Yikang Zhou |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Zhou, Yikang |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100375748 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-9470-7215 |
| authorships[1].author.display_name | Tao Zhang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Zhang, Tao |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5068311346 |
| authorships[2].author.orcid | https://orcid.org/0009-0009-1198-3739 |
| authorships[2].author.display_name | Shilin Xu |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Xu, Shilin |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5055000437 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-7646-8003 |
| authorships[3].author.display_name | Shihao Chen |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Chen, Shihao |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5111039855 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Qianyu Zhou |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Zhou, Qianyu |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5024097240 |
| authorships[5].author.orcid | https://orcid.org/0000-0001-8735-2516 |
| authorships[5].author.display_name | Yunhai Tong |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Tong, Yunhai |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5031588692 |
| authorships[6].author.orcid | https://orcid.org/0000-0002-3088-1481 |
| authorships[6].author.display_name | Shunping Ji |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Ji, Shunping |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5033141976 |
| authorships[7].author.orcid | https://orcid.org/0009-0009-2693-0883 |
| authorships[7].author.display_name | Jiansheng Zhang |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Zhang, Jiangning |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5089900108 |
| authorships[8].author.orcid | https://orcid.org/0000-0002-0550-8247 |
| authorships[8].author.display_name | Xiangtai Li |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | Li, Xiangtai |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5100665478 |
| authorships[9].author.orcid | https://orcid.org/0000-0002-3596-2774 |
| authorships[9].author.display_name | Qi Lu |
| authorships[9].author_position | last |
| authorships[9].raw_author_name | Qi, Lu |
| authorships[9].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2501.04670 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11148 |
| primary_topic.field.id | https://openalex.org/fields/32 |
| primary_topic.field.display_name | Psychology |
| primary_topic.score | 0.986299991607666 |
| primary_topic.domain.id | https://openalex.org/domains/2 |
| primary_topic.domain.display_name | Social Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/3205 |
| primary_topic.subfield.display_name | Experimental and Cognitive Psychology |
| primary_topic.display_name | Language, Metaphor, and Cognition |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W2931662336, https://openalex.org/W2077865380, https://openalex.org/W3006817050, https://openalex.org/W4401768695, https://openalex.org/W2765597752, https://openalex.org/W2134894512, https://openalex.org/W2083375246, https://openalex.org/W2067108088 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2501.04670 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2501.04670 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2501.04670 |
| primary_location.id | pmh:oai:arXiv.org:2501.04670 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2501.04670 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2501.04670 |
| publication_date | 2025-01-08 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 10, 68, 165 |
| abstract_inverted_index.15 | 87 |
| abstract_inverted_index.30 | 78 |
| abstract_inverted_index.In | 64, 122 |
| abstract_inverted_index.To | 145 |
| abstract_inverted_index.We | 96 |
| abstract_inverted_index.an | 127, 201 |
| abstract_inverted_index.be | 246 |
| abstract_inverted_index.by | 219 |
| abstract_inverted_index.in | 2, 13, 40, 50 |
| abstract_inverted_index.is | 28, 38, 84, 149 |
| abstract_inverted_index.of | 26, 36, 101, 205, 230 |
| abstract_inverted_index.on | 108, 207 |
| abstract_inverted_index.to | 74, 114, 131 |
| abstract_inverted_index.we | 66, 124, 162 |
| abstract_inverted_index.OA, | 223 |
| abstract_inverted_index.Our | 43 |
| abstract_inverted_index.SFT | 135, 233 |
| abstract_inverted_index.The | 81, 185 |
| abstract_inverted_index.and | 18, 90, 112, 118, 155, 181, 213, 221, 235, 243 |
| abstract_inverted_index.for | 157 |
| abstract_inverted_index.our | 146, 231, 236 |
| abstract_inverted_index.the | 22, 33, 47, 98, 109, 133, 150, 158, 192, 208, 214, 228 |
| abstract_inverted_index.two | 170 |
| abstract_inverted_index.(OA) | 204 |
| abstract_inverted_index.220K | 138 |
| abstract_inverted_index.MLLM | 159, 168 |
| abstract_inverted_index.MMVM | 82, 102, 134, 209, 232 |
| abstract_inverted_index.best | 215 |
| abstract_inverted_index.cues | 111 |
| abstract_inverted_index.data | 99, 141 |
| abstract_inverted_index.even | 57 |
| abstract_inverted_index.from | 86 |
| abstract_inverted_index.have | 8, 125 |
| abstract_inverted_index.into | 104 |
| abstract_inverted_index.more | 115 |
| abstract_inverted_index.over | 77 |
| abstract_inverted_index.that | 46 |
| abstract_inverted_index.this | 148 |
| abstract_inverted_index.will | 245 |
| abstract_inverted_index.with | 58, 93, 142, 169, 177 |
| abstract_inverted_index.Code, | 240 |
| abstract_inverted_index.MLLM, | 217 |
| abstract_inverted_index.MLLMs | 27, 52, 61 |
| abstract_inverted_index.These | 225 |
| abstract_inverted_index.based | 107 |
| abstract_inverted_index.built | 85 |
| abstract_inverted_index.eight | 105 |
| abstract_inverted_index.first | 151 |
| abstract_inverted_index.large | 4 |
| abstract_inverted_index.novel | 166, 171, 237 |
| abstract_inverted_index.shown | 9 |
| abstract_inverted_index.still | 53 |
| abstract_inverted_index.while | 191 |
| abstract_inverted_index.(MLLM) | 7 |
| abstract_inverted_index.(MMVM) | 72 |
| abstract_inverted_index.7.15\% | 220 |
| abstract_inverted_index.CoLVA, | 164 |
| abstract_inverted_index.GPT-4o | 212 |
| abstract_inverted_index.MLLMs. | 80, 121 |
| abstract_inverted_index.Recent | 0 |
| abstract_inverted_index.Visual | 70 |
| abstract_inverted_index.expert | 176 |
| abstract_inverted_index.fairly | 75 |
| abstract_inverted_index.former | 186 |
| abstract_inverted_index.latter | 193 |
| abstract_inverted_index.learns | 187 |
| abstract_inverted_index.manual | 94 |
| abstract_inverted_index.models | 6, 244 |
| abstract_inverted_index.rarely | 29 |
| abstract_inverted_index.recent | 51 |
| abstract_inverted_index.strong | 11, 60 |
| abstract_inverted_index.videos | 92 |
| abstract_inverted_index.vision | 175 |
| abstract_inverted_index.visual | 14, 23, 34, 139, 152 |
| abstract_inverted_index.11.72\% | 222 |
| abstract_inverted_index.49.80\% | 206 |
| abstract_inverted_index.GPT-4o. | 63 |
| abstract_inverted_index.ability | 12, 25 |
| abstract_inverted_index.analyze | 119 |
| abstract_inverted_index.aspects | 106 |
| abstract_inverted_index.current | 59, 120 |
| abstract_inverted_index.dataset | 154, 234 |
| abstract_inverted_index.despite | 31 |
| abstract_inverted_index.exhibit | 54 |
| abstract_inverted_index.finding | 32 |
| abstract_inverted_index.further | 194 |
| abstract_inverted_index.models, | 62 |
| abstract_inverted_index.objects | 37 |
| abstract_inverted_index.overall | 202 |
| abstract_inverted_index.present | 163 |
| abstract_inverted_index.results | 226 |
| abstract_inverted_index.reveals | 45 |
| abstract_inverted_index.samples | 100 |
| abstract_inverted_index.tokens, | 190 |
| abstract_inverted_index.vision. | 42 |
| abstract_inverted_index.Finally, | 161 |
| abstract_inverted_index.However, | 21 |
| abstract_inverted_index.Internet | 91 |
| abstract_inverted_index.Matching | 71 |
| abstract_inverted_index.ability. | 198 |
| abstract_inverted_index.accuracy | 203 |
| abstract_inverted_index.achieves | 200 |
| abstract_inverted_index.computer | 41 |
| abstract_inverted_index.dataset, | 136, 242 |
| abstract_inverted_index.datasets | 89 |
| abstract_inverted_index.designed | 126 |
| abstract_inverted_index.designs. | 239 |
| abstract_inverted_index.designs: | 173 |
| abstract_inverted_index.evaluate | 117 |
| abstract_inverted_index.generate | 132 |
| abstract_inverted_index.improves | 195 |
| abstract_inverted_index.instance | 188 |
| abstract_inverted_index.language | 5 |
| abstract_inverted_index.learning | 180 |
| abstract_inverted_index.matching | 24, 48, 140 |
| abstract_inverted_index.pipeline | 130 |
| abstract_inverted_index.required | 110 |
| abstract_inverted_index.research | 44 |
| abstract_inverted_index.studied, | 30 |
| abstract_inverted_index.addition, | 123 |
| abstract_inverted_index.automatic | 128 |
| abstract_inverted_index.benchmark | 73, 76, 83, 103, 156 |
| abstract_inverted_index.construct | 67 |
| abstract_inverted_index.different | 79 |
| abstract_inverted_index.essential | 39 |
| abstract_inverted_index.following | 197 |
| abstract_inverted_index.including | 137 |
| abstract_inverted_index.reasoning | 16, 143 |
| abstract_inverted_index.released. | 247 |
| abstract_inverted_index.strategy. | 184 |
| abstract_inverted_index.technical | 172, 238 |
| abstract_inverted_index.Multimodal | 69 |
| abstract_inverted_index.abilities, | 17 |
| abstract_inverted_index.annotation | 129 |
| abstract_inverted_index.benchmark, | 210, 241 |
| abstract_inverted_index.categorize | 97 |
| abstract_inverted_index.community. | 160 |
| abstract_inverted_index.knowledge, | 147 |
| abstract_inverted_index.multimodal | 3 |
| abstract_inverted_index.surpassing | 211 |
| abstract_inverted_index.systematic | 55 |
| abstract_inverted_index.annotation. | 95, 144 |
| abstract_inverted_index.contrastive | 167, 179 |
| abstract_inverted_index.demonstrate | 227 |
| abstract_inverted_index.instruction | 182, 196 |
| abstract_inverted_index.open-source | 88, 216 |
| abstract_inverted_index.particular, | 65 |
| abstract_inverted_index.perception, | 15 |
| abstract_inverted_index.Qwen2VL-72B, | 218 |
| abstract_inverted_index.advancements | 1 |
| abstract_inverted_index.augmentation | 183 |
| abstract_inverted_index.capabilities | 49, 113 |
| abstract_inverted_index.fine-grained | 174 |
| abstract_inverted_index.object-level | 178 |
| abstract_inverted_index.corresponding | 153 |
| abstract_inverted_index.effectiveness | 229 |
| abstract_inverted_index.respectively. | 224 |
| abstract_inverted_index.shortcomings, | 56 |
| abstract_inverted_index.correspondence | 35 |
| abstract_inverted_index.discriminative | 189 |
| abstract_inverted_index.understanding. | 20 |
| abstract_inverted_index.comprehensively | 116 |
| abstract_inverted_index.vision-language | 19 |
| abstract_inverted_index.CoLVA-InternVL2-4B | 199 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 10 |
| citation_normalized_percentile |