Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs Article Swipe

PDF

Yikang Zhou , Tao Zhang , Shilin Xu , Shihao Chen , Qianyu Zhou , Yunhai Tong , Shunping Ji , Jiansheng Zhang , Xiangtai Li , Qi Lu ·

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2501.04670

Recent advancements in multimodal large language models (MLLM) have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, the visual matching ability of MLLMs is rarely studied, despite finding the visual correspondence of objects is essential in computer vision. Our research reveals that the matching capabilities in recent MLLMs still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. We categorize the data samples of MMVM benchmark into eight aspects based on the required cues and capabilities to more comprehensively evaluate and analyze current MLLMs. In addition, we have designed an automatic annotation pipeline to generate the MMVM SFT dataset, including 220K visual matching data with reasoning annotation. To our knowledge, this is the first visual corresponding dataset and benchmark for the MLLM community. Finally, we present CoLVA, a novel contrastive MLLM with two novel technical designs: fine-grained vision expert with object-level contrastive learning and instruction augmentation strategy. The former learns instance discriminative tokens, while the latter further improves instruction following ability. CoLVA-InternVL2-4B achieves an overall accuracy (OA) of 49.80\% on the MMVM benchmark, surpassing GPT-4o and the best open-source MLLM, Qwen2VL-72B, by 7.15\% and 11.72\% OA, respectively. These results demonstrate the effectiveness of our MMVM SFT dataset and our novel technical designs. Code, benchmark, dataset, and models will be released.

Related Topics

Concepts

Psychology

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2501.04670
PDF: https://arxiv.org/pdf/2501.04670
OA Status: green
Related Works: 10
OpenAlex ID: https://openalex.org/W4406231658

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4406231658

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2501.04670

Digital Object Identifier
Title: Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2025

Year of publication
Publication date: 2025-01-08

Full publication date if available
Authors: Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qianyu Zhou, Yunhai Tong, Shunping Ji, Jiansheng Zhang, Xiangtai Li, Qi Lu

List of authors in order
Landing page: https://arxiv.org/abs/2501.04670

Publisher landing page
PDF URL: https://arxiv.org/pdf/2501.04670

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2501.04670

Direct OA link when available
Concepts: Psychology

Top concepts (fields/topics) attached by OpenAlex
Cited by: 0

Total citation count in OpenAlex
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4406231658
doi	https://doi.org/10.48550/arxiv.2501.04670
ids.doi	https://doi.org/10.48550/arxiv.2501.04670
ids.openalex	https://openalex.org/W4406231658
fwci
type	preprint
title	Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T11148
topics[0].field.id	https://openalex.org/fields/32
topics[0].field.display_name	Psychology
topics[0].score	0.986299991607666
topics[0].domain.id	https://openalex.org/domains/2
topics[0].domain.display_name	Social Sciences
topics[0].subfield.id	https://openalex.org/subfields/3205
topics[0].subfield.display_name	Experimental and Cognitive Psychology
topics[0].display_name	Language, Metaphor, and Cognition
topics[1].id	https://openalex.org/T12881
topics[1].field.id	https://openalex.org/fields/12
topics[1].field.display_name	Arts and Humanities
topics[1].score	0.9846000075340271
topics[1].domain.id	https://openalex.org/domains/2
topics[1].domain.display_name	Social Sciences
topics[1].subfield.id	https://openalex.org/subfields/1203
topics[1].subfield.display_name	Language and Linguistics
topics[1].display_name	linguistics and terminology studies
topics[2].id	https://openalex.org/T10759
topics[2].field.id	https://openalex.org/fields/12
topics[2].field.display_name	Arts and Humanities
topics[2].score	0.9811999797821045
topics[2].domain.id	https://openalex.org/domains/2
topics[2].domain.display_name	Social Sciences
topics[2].subfield.id	https://openalex.org/subfields/1203
topics[2].subfield.display_name	Language and Linguistics
topics[2].display_name	Translation Studies and Practices
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C15744967
concepts[0].level	0
concepts[0].score	0.35192638635635376
concepts[0].wikidata	https://www.wikidata.org/wiki/Q9418
concepts[0].display_name	Psychology
keywords[0].id	https://openalex.org/keywords/psychology
keywords[0].score	0.35192638635635376
keywords[0].display_name	Psychology
language	en
locations[0].id	pmh:oai:arXiv.org:2501.04670
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2501.04670
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2501.04670
locations[1].id	doi:10.48550/arxiv.2501.04670
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2501.04670
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5104162474
authorships[0].author.orcid
authorships[0].author.display_name	Yikang Zhou
authorships[0].author_position	first
authorships[0].raw_author_name	Zhou, Yikang
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5100375748
authorships[1].author.orcid	https://orcid.org/0000-0001-9470-7215
authorships[1].author.display_name	Tao Zhang
authorships[1].author_position	middle
authorships[1].raw_author_name	Zhang, Tao
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5068311346
authorships[2].author.orcid	https://orcid.org/0009-0009-1198-3739
authorships[2].author.display_name	Shilin Xu
authorships[2].author_position	middle
authorships[2].raw_author_name	Xu, Shilin
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5055000437
authorships[3].author.orcid	https://orcid.org/0000-0001-7646-8003
authorships[3].author.display_name	Shihao Chen
authorships[3].author_position	middle
authorships[3].raw_author_name	Chen, Shihao
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5111039855
authorships[4].author.orcid
authorships[4].author.display_name	Qianyu Zhou
authorships[4].author_position	middle
authorships[4].raw_author_name	Zhou, Qianyu
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5024097240
authorships[5].author.orcid	https://orcid.org/0000-0001-8735-2516
authorships[5].author.display_name	Yunhai Tong
authorships[5].author_position	middle
authorships[5].raw_author_name	Tong, Yunhai
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A5031588692
authorships[6].author.orcid	https://orcid.org/0000-0002-3088-1481
authorships[6].author.display_name	Shunping Ji
authorships[6].author_position	middle
authorships[6].raw_author_name	Ji, Shunping
authorships[6].is_corresponding	False
authorships[7].author.id	https://openalex.org/A5033141976
authorships[7].author.orcid	https://orcid.org/0009-0009-2693-0883
authorships[7].author.display_name	Jiansheng Zhang
authorships[7].author_position	middle
authorships[7].raw_author_name	Zhang, Jiangning
authorships[7].is_corresponding	False
authorships[8].author.id	https://openalex.org/A5089900108
authorships[8].author.orcid	https://orcid.org/0000-0002-0550-8247
authorships[8].author.display_name	Xiangtai Li
authorships[8].author_position	middle
authorships[8].raw_author_name	Li, Xiangtai
authorships[8].is_corresponding	False
authorships[9].author.id	https://openalex.org/A5100665478
authorships[9].author.orcid	https://orcid.org/0000-0002-3596-2774
authorships[9].author.display_name	Qi Lu
authorships[9].author_position	last
authorships[9].raw_author_name	Qi, Lu
authorships[9].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2501.04670
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T11148
primary_topic.field.id	https://openalex.org/fields/32
primary_topic.field.display_name	Psychology
primary_topic.score	0.986299991607666
primary_topic.domain.id	https://openalex.org/domains/2
primary_topic.domain.display_name	Social Sciences
primary_topic.subfield.id	https://openalex.org/subfields/3205
primary_topic.subfield.display_name	Experimental and Cognitive Psychology
primary_topic.display_name	Language, Metaphor, and Cognition
related_works	https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W2931662336, https://openalex.org/W2077865380, https://openalex.org/W3006817050, https://openalex.org/W4401768695, https://openalex.org/W2765597752, https://openalex.org/W2134894512, https://openalex.org/W2083375246, https://openalex.org/W2067108088
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2501.04670
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2501.04670
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2501.04670
primary_location.id	pmh:oai:arXiv.org:2501.04670
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2501.04670
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2501.04670
publication_date	2025-01-08
publication_year	2025
referenced_works_count	0
abstract_inverted_index.a	10, 68, 165
abstract_inverted_index.15	87
abstract_inverted_index.30	78
abstract_inverted_index.In	64, 122
abstract_inverted_index.To	145
abstract_inverted_index.We	96
abstract_inverted_index.an	127, 201
abstract_inverted_index.be	246
abstract_inverted_index.by	219
abstract_inverted_index.in	2, 13, 40, 50
abstract_inverted_index.is	28, 38, 84, 149
abstract_inverted_index.of	26, 36, 101, 205, 230
abstract_inverted_index.on	108, 207
abstract_inverted_index.to	74, 114, 131
abstract_inverted_index.we	66, 124, 162
abstract_inverted_index.OA,	223
abstract_inverted_index.Our	43
abstract_inverted_index.SFT	135, 233
abstract_inverted_index.The	81, 185
abstract_inverted_index.and	18, 90, 112, 118, 155, 181, 213, 221, 235, 243
abstract_inverted_index.for	157
abstract_inverted_index.our	146, 231, 236
abstract_inverted_index.the	22, 33, 47, 98, 109, 133, 150, 158, 192, 208, 214, 228
abstract_inverted_index.two	170
abstract_inverted_index.(OA)	204
abstract_inverted_index.220K	138
abstract_inverted_index.MLLM	159, 168
abstract_inverted_index.MMVM	82, 102, 134, 209, 232
abstract_inverted_index.best	215
abstract_inverted_index.cues	111
abstract_inverted_index.data	99, 141
abstract_inverted_index.even	57
abstract_inverted_index.from	86
abstract_inverted_index.have	8, 125
abstract_inverted_index.into	104
abstract_inverted_index.more	115
abstract_inverted_index.over	77
abstract_inverted_index.that	46
abstract_inverted_index.this	148
abstract_inverted_index.will	245
abstract_inverted_index.with	58, 93, 142, 169, 177
abstract_inverted_index.Code,	240
abstract_inverted_index.MLLM,	217
abstract_inverted_index.MLLMs	27, 52, 61
abstract_inverted_index.These	225
abstract_inverted_index.based	107
abstract_inverted_index.built	85
abstract_inverted_index.eight	105
abstract_inverted_index.first	151
abstract_inverted_index.large	4
abstract_inverted_index.novel	166, 171, 237
abstract_inverted_index.shown	9
abstract_inverted_index.still	53
abstract_inverted_index.while	191
abstract_inverted_index.(MLLM)	7
abstract_inverted_index.(MMVM)	72
abstract_inverted_index.7.15\%	220
abstract_inverted_index.CoLVA,	164
abstract_inverted_index.GPT-4o	212
abstract_inverted_index.MLLMs.	80, 121
abstract_inverted_index.Recent	0
abstract_inverted_index.Visual	70
abstract_inverted_index.expert	176
abstract_inverted_index.fairly	75
abstract_inverted_index.former	186
abstract_inverted_index.latter	193
abstract_inverted_index.learns	187
abstract_inverted_index.manual	94
abstract_inverted_index.models	6, 244
abstract_inverted_index.rarely	29
abstract_inverted_index.recent	51
abstract_inverted_index.strong	11, 60
abstract_inverted_index.videos	92
abstract_inverted_index.vision	175
abstract_inverted_index.visual	14, 23, 34, 139, 152
abstract_inverted_index.11.72\%	222
abstract_inverted_index.49.80\%	206
abstract_inverted_index.GPT-4o.	63
abstract_inverted_index.ability	12, 25
abstract_inverted_index.analyze	119
abstract_inverted_index.aspects	106
abstract_inverted_index.current	59, 120
abstract_inverted_index.dataset	154, 234
abstract_inverted_index.despite	31
abstract_inverted_index.exhibit	54
abstract_inverted_index.finding	32
abstract_inverted_index.further	194
abstract_inverted_index.models,	62
abstract_inverted_index.objects	37
abstract_inverted_index.overall	202
abstract_inverted_index.present	163
abstract_inverted_index.results	226
abstract_inverted_index.reveals	45
abstract_inverted_index.samples	100
abstract_inverted_index.tokens,	190
abstract_inverted_index.vision.	42
abstract_inverted_index.Finally,	161
abstract_inverted_index.However,	21
abstract_inverted_index.Internet	91
abstract_inverted_index.Matching	71
abstract_inverted_index.ability.	198
abstract_inverted_index.accuracy	203
abstract_inverted_index.achieves	200
abstract_inverted_index.computer	41
abstract_inverted_index.dataset,	136, 242
abstract_inverted_index.datasets	89
abstract_inverted_index.designed	126
abstract_inverted_index.designs.	239
abstract_inverted_index.designs:	173
abstract_inverted_index.evaluate	117
abstract_inverted_index.generate	132
abstract_inverted_index.improves	195
abstract_inverted_index.instance	188
abstract_inverted_index.language	5
abstract_inverted_index.learning	180
abstract_inverted_index.matching	24, 48, 140
abstract_inverted_index.pipeline	130
abstract_inverted_index.required	110
abstract_inverted_index.research	44
abstract_inverted_index.studied,	30
abstract_inverted_index.addition,	123
abstract_inverted_index.automatic	128
abstract_inverted_index.benchmark	73, 76, 83, 103, 156
abstract_inverted_index.construct	67
abstract_inverted_index.different	79
abstract_inverted_index.essential	39
abstract_inverted_index.following	197
abstract_inverted_index.including	137
abstract_inverted_index.reasoning	16, 143
abstract_inverted_index.released.	247
abstract_inverted_index.strategy.	184
abstract_inverted_index.technical	172, 238
abstract_inverted_index.Multimodal	69
abstract_inverted_index.abilities,	17
abstract_inverted_index.annotation	129
abstract_inverted_index.benchmark,	210, 241
abstract_inverted_index.categorize	97
abstract_inverted_index.community.	160
abstract_inverted_index.knowledge,	147
abstract_inverted_index.multimodal	3
abstract_inverted_index.surpassing	211
abstract_inverted_index.systematic	55
abstract_inverted_index.annotation.	95, 144
abstract_inverted_index.contrastive	167, 179
abstract_inverted_index.demonstrate	227
abstract_inverted_index.instruction	182, 196
abstract_inverted_index.open-source	88, 216
abstract_inverted_index.particular,	65
abstract_inverted_index.perception,	15
abstract_inverted_index.Qwen2VL-72B,	218
abstract_inverted_index.advancements	1
abstract_inverted_index.augmentation	183
abstract_inverted_index.capabilities	49, 113
abstract_inverted_index.fine-grained	174
abstract_inverted_index.object-level	178
abstract_inverted_index.corresponding	153
abstract_inverted_index.effectiveness	229
abstract_inverted_index.respectively.	224
abstract_inverted_index.shortcomings,	56
abstract_inverted_index.correspondence	35
abstract_inverted_index.discriminative	189
abstract_inverted_index.understanding.	20
abstract_inverted_index.comprehensively	116
abstract_inverted_index.vision-language	19
abstract_inverted_index.CoLVA-InternVL2-4B	199
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	10
citation_normalized_percentile