M^3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark Article Swipe

View

Yang Zhou , Zhao Ming-yu , Wang Zhenting , Gu, Difei , Guo, Bangwei , Ye RuoSong , Han, Ligong , Jin Can , Metaxas, Dimitris N. ·

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2511.17729

We present M^3-Bench, the first benchmark for evaluating multimodal tool use under the Model Context Protocol. The benchmark targets realistic, multi-hop and multi-threaded workflows that require visual grounding and textual reasoning, cross-tool dependencies, and persistence of intermediate resources across steps. We introduce a similarity-driven alignment that serializes each tool call, embeds signatures with a sentence encoder, and performs similarity-bucketed Hungarian matching to obtain auditable one-to-one correspondences. On top of this alignment, we report interpretable metrics that decouple semantic fidelity from workflow consistency. The benchmark spans 28 servers with 231 tools, and provides standardized trajectories curated through an Executor & Judge pipeline with human verification; an auxiliary four large language models (LLMs) judge ensemble reports end-task Task Completion and information grounding. Evaluations of representative state-of-the-art Multimodal LLMs (MLLMs) reveal persistent gaps in multimodal MCP tool use, particularly in argument fidelity and structure consistency, underscoring the need for methods that jointly reason over images, text, and tool graphs. Our Benchmark's anonymous repository is at https://github.com/EtaYang10th/Open-M3-Bench

Related Topics

Benchmark (Surveying)

Computer Science

Artificial Intelligence

Concepts

Benchmark (surveying) Computer science Workflow Context (archaeology) Pipeline (software) Task (project management) Artificial intelligence Matching (statistics) Semantics (computer science) Sentence Machine learning Server Parsing Argument (complex analysis) Executor Scheme (mathematics) Natural language processing Coreference Language model Data mining Information retrieval Context model Pattern matching Process (computing) Benchmarking Data modeling Fidelity Semantic heterogeneity Human-in-the-loop Oracle

Metadata

Type: preprint
Landing Page: https://doi.org/10.48550/arxiv.2511.17729
OA Status: green
OpenAlex ID: https://openalex.org/W7106712151

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W7106712151

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2511.17729

Digital Object Identifier
Title: M^3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

Work title
Type: preprint

OpenAlex work type
Publication year: 2025

Year of publication
Publication date: 2025-11-21

Full publication date if available
Authors: Yang Zhou, Zhao Ming-yu, Wang Zhenting, Gu, Difei, Guo, Bangwei, Ye RuoSong, Han, Ligong, Jin Can, Metaxas, Dimitris N.

List of authors in order
Landing page: https://doi.org/10.48550/arxiv.2511.17729

Publisher landing page
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://doi.org/10.48550/arxiv.2511.17729

Direct OA link when available
Concepts: Benchmark (surveying), Computer science, Workflow, Context (archaeology), Pipeline (software), Task (project management), Artificial intelligence, Matching (statistics), Semantics (computer science), Sentence, Machine learning, Server, Parsing, Argument (complex analysis), Executor, Scheme (mathematics), Natural language processing, Coreference, Language model, Data mining, Information retrieval, Context model, Pattern matching, Process (computing), Benchmarking, Data modeling, Fidelity, Semantic heterogeneity, Human-in-the-loop, Oracle

Top concepts (fields/topics) attached by OpenAlex
Cited by: 0

Total citation count in OpenAlex

Full payload

id	https://openalex.org/W7106712151
doi	https://doi.org/10.48550/arxiv.2511.17729
ids.doi	https://doi.org/10.48550/arxiv.2511.17729
ids.openalex	https://openalex.org/W7106712151
fwci
type	preprint
title	M^3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C185798385
concepts[0].level	2
concepts[0].score	0.8338862657546997
concepts[0].wikidata	https://www.wikidata.org/wiki/Q1161707
concepts[0].display_name	Benchmark (surveying)
concepts[1].id	https://openalex.org/C41008148
concepts[1].level	0
concepts[1].score	0.7966123819351196
concepts[1].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[1].display_name	Computer science
concepts[2].id	https://openalex.org/C177212765
concepts[2].level	2
concepts[2].score	0.6959211230278015
concepts[2].wikidata	https://www.wikidata.org/wiki/Q627335
concepts[2].display_name	Workflow
concepts[3].id	https://openalex.org/C2779343474
concepts[3].level	2
concepts[3].score	0.6401889324188232
concepts[3].wikidata	https://www.wikidata.org/wiki/Q3109175
concepts[3].display_name	Context (archaeology)
concepts[4].id	https://openalex.org/C43521106
concepts[4].level	2
concepts[4].score	0.6215966939926147
concepts[4].wikidata	https://www.wikidata.org/wiki/Q2165493
concepts[4].display_name	Pipeline (software)
concepts[5].id	https://openalex.org/C2780451532
concepts[5].level	2
concepts[5].score	0.5843798518180847
concepts[5].wikidata	https://www.wikidata.org/wiki/Q759676
concepts[5].display_name	Task (project management)
concepts[6].id	https://openalex.org/C154945302
concepts[6].level	1
concepts[6].score	0.5173117518424988
concepts[6].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[6].display_name	Artificial intelligence
concepts[7].id	https://openalex.org/C165064840
concepts[7].level	2
concepts[7].score	0.4918988049030304
concepts[7].wikidata	https://www.wikidata.org/wiki/Q1321061
concepts[7].display_name	Matching (statistics)
concepts[8].id	https://openalex.org/C184337299
concepts[8].level	2
concepts[8].score	0.44271695613861084
concepts[8].wikidata	https://www.wikidata.org/wiki/Q1437428
concepts[8].display_name	Semantics (computer science)
concepts[9].id	https://openalex.org/C2777530160
concepts[9].level	2
concepts[9].score	0.4233413636684418
concepts[9].wikidata	https://www.wikidata.org/wiki/Q41796
concepts[9].display_name	Sentence
concepts[10].id	https://openalex.org/C119857082
concepts[10].level	1
concepts[10].score	0.40053269267082214
concepts[10].wikidata	https://www.wikidata.org/wiki/Q2539
concepts[10].display_name	Machine learning
concepts[11].id	https://openalex.org/C93996380
concepts[11].level	2
concepts[11].score	0.39873790740966797
concepts[11].wikidata	https://www.wikidata.org/wiki/Q44127
concepts[11].display_name	Server
concepts[12].id	https://openalex.org/C186644900
concepts[12].level	2
concepts[12].score	0.39798158407211304
concepts[12].wikidata	https://www.wikidata.org/wiki/Q194152
concepts[12].display_name	Parsing
concepts[13].id	https://openalex.org/C98184364
concepts[13].level	2
concepts[13].score	0.39399585127830505
concepts[13].wikidata	https://www.wikidata.org/wiki/Q1780131
concepts[13].display_name	Argument (complex analysis)
concepts[14].id	https://openalex.org/C180591056
concepts[14].level	2
concepts[14].score	0.38836103677749634
concepts[14].wikidata	https://www.wikidata.org/wiki/Q654437
concepts[14].display_name	Executor
concepts[15].id	https://openalex.org/C77618280
concepts[15].level	2
concepts[15].score	0.3849237859249115
concepts[15].wikidata	https://www.wikidata.org/wiki/Q1155772
concepts[15].display_name	Scheme (mathematics)
concepts[16].id	https://openalex.org/C204321447
concepts[16].level	1
concepts[16].score	0.36012765765190125
concepts[16].wikidata	https://www.wikidata.org/wiki/Q30642
concepts[16].display_name	Natural language processing
concepts[17].id	https://openalex.org/C28076734
concepts[17].level	3
concepts[17].score	0.34633752703666687
concepts[17].wikidata	https://www.wikidata.org/wiki/Q63087
concepts[17].display_name	Coreference
concepts[18].id	https://openalex.org/C137293760
concepts[18].level	2
concepts[18].score	0.34118345379829407
concepts[18].wikidata	https://www.wikidata.org/wiki/Q3621696
concepts[18].display_name	Language model
concepts[19].id	https://openalex.org/C124101348
concepts[19].level	1
concepts[19].score	0.3376232087612152
concepts[19].wikidata	https://www.wikidata.org/wiki/Q172491
concepts[19].display_name	Data mining
concepts[20].id	https://openalex.org/C23123220
concepts[20].level	1
concepts[20].score	0.3306768834590912
concepts[20].wikidata	https://www.wikidata.org/wiki/Q816826
concepts[20].display_name	Information retrieval
concepts[21].id	https://openalex.org/C183322885
concepts[21].level	3
concepts[21].score	0.29223552346229553
concepts[21].wikidata	https://www.wikidata.org/wiki/Q17007702
concepts[21].display_name	Context model
concepts[22].id	https://openalex.org/C68859911
concepts[22].level	2
concepts[22].score	0.28330859541893005
concepts[22].wikidata	https://www.wikidata.org/wiki/Q1503724
concepts[22].display_name	Pattern matching
concepts[23].id	https://openalex.org/C98045186
concepts[23].level	2
concepts[23].score	0.27824723720550537
concepts[23].wikidata	https://www.wikidata.org/wiki/Q205663
concepts[23].display_name	Process (computing)
concepts[24].id	https://openalex.org/C86251818
concepts[24].level	2
concepts[24].score	0.27686211466789246
concepts[24].wikidata	https://www.wikidata.org/wiki/Q816754
concepts[24].display_name	Benchmarking
concepts[25].id	https://openalex.org/C67186912
concepts[25].level	2
concepts[25].score	0.2730334401130676
concepts[25].wikidata	https://www.wikidata.org/wiki/Q367664
concepts[25].display_name	Data modeling
concepts[26].id	https://openalex.org/C2776459999
concepts[26].level	2
concepts[26].score	0.26774126291275024
concepts[26].wikidata	https://www.wikidata.org/wiki/Q2119376
concepts[26].display_name	Fidelity
concepts[27].id	https://openalex.org/C2778180026
concepts[27].level	4
concepts[27].score	0.2602304220199585
concepts[27].wikidata	https://www.wikidata.org/wiki/Q18378163
concepts[27].display_name	Semantic heterogeneity
concepts[28].id	https://openalex.org/C2780626000
concepts[28].level	2
concepts[28].score	0.25922906398773193
concepts[28].wikidata	https://www.wikidata.org/wiki/Q5936775
concepts[28].display_name	Human-in-the-loop
concepts[29].id	https://openalex.org/C55166926
concepts[29].level	2
concepts[29].score	0.2537321150302887
concepts[29].wikidata	https://www.wikidata.org/wiki/Q2892946
concepts[29].display_name	Oracle
keywords[0].id	https://openalex.org/keywords/benchmark
keywords[0].score	0.8338862657546997
keywords[0].display_name	Benchmark (surveying)
keywords[1].id	https://openalex.org/keywords/workflow
keywords[1].score	0.6959211230278015
keywords[1].display_name	Workflow
keywords[2].id	https://openalex.org/keywords/context
keywords[2].score	0.6401889324188232
keywords[2].display_name	Context (archaeology)
keywords[3].id	https://openalex.org/keywords/pipeline
keywords[3].score	0.6215966939926147
keywords[3].display_name	Pipeline (software)
keywords[4].id	https://openalex.org/keywords/task
keywords[4].score	0.5843798518180847
keywords[4].display_name	Task (project management)
keywords[5].id	https://openalex.org/keywords/matching
keywords[5].score	0.4918988049030304
keywords[5].display_name	Matching (statistics)
keywords[6].id	https://openalex.org/keywords/semantics
keywords[6].score	0.44271695613861084
keywords[6].display_name	Semantics (computer science)
keywords[7].id	https://openalex.org/keywords/sentence
keywords[7].score	0.4233413636684418
keywords[7].display_name	Sentence
keywords[8].id	https://openalex.org/keywords/server
keywords[8].score	0.39873790740966797
keywords[8].display_name	Server
language
locations[0].id	doi:10.48550/arxiv.2511.17729
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license	cc-by
locations[0].pdf_url
locations[0].version
locations[0].raw_type	article
locations[0].license_id	https://openalex.org/licenses/cc-by
locations[0].is_accepted	False
locations[0].is_published
locations[0].raw_source_name
locations[0].landing_page_url	https://doi.org/10.48550/arxiv.2511.17729
indexed_in	datacite
authorships[0].author.id	https://openalex.org/A2012237231
authorships[0].author.orcid	https://orcid.org/0000-0001-5203-8199
authorships[0].author.display_name	Yang Zhou
authorships[0].author_position	first
authorships[0].raw_author_name	Zhou, Yang
authorships[0].is_corresponding	True
authorships[1].author.id	https://openalex.org/A2347531055
authorships[1].author.orcid
authorships[1].author.display_name	Zhao Ming-yu
authorships[1].author_position	middle
authorships[1].raw_author_name	Zhao, Mingyu
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A2315542665
authorships[2].author.orcid
authorships[2].author.display_name	Wang Zhenting
authorships[2].author_position	middle
authorships[2].raw_author_name	Wang, Zhenting
authorships[2].is_corresponding	False
authorships[3].author.id
authorships[3].author.orcid
authorships[3].author.display_name	Gu, Difei
authorships[3].author_position	middle
authorships[3].raw_author_name	Gu, Difei
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A4281901083
authorships[4].author.orcid
authorships[4].author.display_name	Guo, Bangwei
authorships[4].author_position	middle
authorships[4].raw_author_name	Guo, Bangwei
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A2102770750
authorships[5].author.orcid
authorships[5].author.display_name	Ye RuoSong
authorships[5].author_position	middle
authorships[5].raw_author_name	Ye, Ruosong
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A4202193308
authorships[6].author.orcid
authorships[6].author.display_name	Han, Ligong
authorships[6].author_position	middle
authorships[6].raw_author_name	Han, Ligong
authorships[6].is_corresponding	False
authorships[7].author.id	https://openalex.org/A2155437886
authorships[7].author.orcid
authorships[7].author.display_name	Jin Can
authorships[7].author_position	middle
authorships[7].raw_author_name	Jin, Can
authorships[7].is_corresponding	False
authorships[8].author.id	https://openalex.org/A4222827944
authorships[8].author.orcid
authorships[8].author.display_name	Metaxas, Dimitris N.
authorships[8].author_position	last
authorships[8].raw_author_name	Metaxas, Dimitris N.
authorships[8].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://doi.org/10.48550/arxiv.2511.17729
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-11-27T00:00:00
display_name	M^3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark
has_fulltext	False
is_retracted	False
updated_date	2025-12-03T23:09:05.601824
primary_topic
cited_by_count	0
locations_count	1
best_oa_location.id	doi:10.48550/arxiv.2511.17729
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license	cc-by
best_oa_location.pdf_url
best_oa_location.version
best_oa_location.raw_type	article
best_oa_location.license_id	https://openalex.org/licenses/cc-by
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	https://doi.org/10.48550/arxiv.2511.17729
primary_location.id	doi:10.48550/arxiv.2511.17729
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license	cc-by
primary_location.pdf_url
primary_location.version
primary_location.raw_type	article
primary_location.license_id	https://openalex.org/licenses/cc-by
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	https://doi.org/10.48550/arxiv.2511.17729
publication_date	2025-11-21
publication_year	2025
referenced_works_count	0
abstract_inverted_index.a	42, 53
abstract_inverted_index.28	85
abstract_inverted_index.On	66
abstract_inverted_index.We	0, 40
abstract_inverted_index.an	96, 104
abstract_inverted_index.at	161
abstract_inverted_index.in	130, 136
abstract_inverted_index.is	160
abstract_inverted_index.of	35, 68, 121
abstract_inverted_index.to	61
abstract_inverted_index.we	71
abstract_inverted_index.231	88
abstract_inverted_index.MCP	132
abstract_inverted_index.Our	156
abstract_inverted_index.The	16, 82
abstract_inverted_index.and	21, 28, 33, 56, 90, 117, 139, 153
abstract_inverted_index.for	6, 145
abstract_inverted_index.the	3, 12, 143
abstract_inverted_index.top	67
abstract_inverted_index.use	10
abstract_inverted_index.LLMs	125
abstract_inverted_index.Task	115
abstract_inverted_index.each	47
abstract_inverted_index.four	106
abstract_inverted_index.from	79
abstract_inverted_index.gaps	129
abstract_inverted_index.need	144
abstract_inverted_index.over	150
abstract_inverted_index.that	24, 45, 75, 147
abstract_inverted_index.this	69
abstract_inverted_index.tool	9, 48, 133, 154
abstract_inverted_index.use,	134
abstract_inverted_index.with	52, 87, 101
abstract_inverted_index.&	98
abstract_inverted_index.Judge	99
abstract_inverted_index.Model	13
abstract_inverted_index.call,	49
abstract_inverted_index.first	4
abstract_inverted_index.human	102
abstract_inverted_index.judge	111
abstract_inverted_index.large	107
abstract_inverted_index.spans	84
abstract_inverted_index.text,	152
abstract_inverted_index.under	11
abstract_inverted_index.(LLMs)	110
abstract_inverted_index.across	38
abstract_inverted_index.embeds	50
abstract_inverted_index.models	109
abstract_inverted_index.obtain	62
abstract_inverted_index.reason	149
abstract_inverted_index.report	72
abstract_inverted_index.reveal	127
abstract_inverted_index.steps.	39
abstract_inverted_index.tools,	89
abstract_inverted_index.visual	26
abstract_inverted_index.(MLLMs)	126
abstract_inverted_index.Context	14
abstract_inverted_index.curated	94
abstract_inverted_index.graphs.	155
abstract_inverted_index.images,	151
abstract_inverted_index.jointly	148
abstract_inverted_index.methods	146
abstract_inverted_index.metrics	74
abstract_inverted_index.present	1
abstract_inverted_index.reports	113
abstract_inverted_index.require	25
abstract_inverted_index.servers	86
abstract_inverted_index.targets	18
abstract_inverted_index.textual	29
abstract_inverted_index.through	95
abstract_inverted_index.Executor	97
abstract_inverted_index.argument	137
abstract_inverted_index.decouple	76
abstract_inverted_index.encoder,	55
abstract_inverted_index.end-task	114
abstract_inverted_index.ensemble	112
abstract_inverted_index.fidelity	78, 138
abstract_inverted_index.language	108
abstract_inverted_index.matching	60
abstract_inverted_index.performs	57
abstract_inverted_index.pipeline	100
abstract_inverted_index.provides	91
abstract_inverted_index.semantic	77
abstract_inverted_index.sentence	54
abstract_inverted_index.workflow	80
abstract_inverted_index.Hungarian	59
abstract_inverted_index.Protocol.	15
abstract_inverted_index.alignment	44
abstract_inverted_index.anonymous	158
abstract_inverted_index.auditable	63
abstract_inverted_index.auxiliary	105
abstract_inverted_index.benchmark	5, 17, 83
abstract_inverted_index.grounding	27
abstract_inverted_index.introduce	41
abstract_inverted_index.multi-hop	20
abstract_inverted_index.resources	37
abstract_inverted_index.structure	140
abstract_inverted_index.workflows	23
abstract_inverted_index.Completion	116
abstract_inverted_index.M^3-Bench,	2
abstract_inverted_index.Multimodal	124
abstract_inverted_index.alignment,	70
abstract_inverted_index.cross-tool	31
abstract_inverted_index.evaluating	7
abstract_inverted_index.grounding.	119
abstract_inverted_index.multimodal	8, 131
abstract_inverted_index.one-to-one	64
abstract_inverted_index.persistent	128
abstract_inverted_index.realistic,	19
abstract_inverted_index.reasoning,	30
abstract_inverted_index.repository	159
abstract_inverted_index.serializes	46
abstract_inverted_index.signatures	51
abstract_inverted_index.Benchmark's	157
abstract_inverted_index.Evaluations	120
abstract_inverted_index.information	118
abstract_inverted_index.persistence	34
abstract_inverted_index.consistency,	141
abstract_inverted_index.consistency.	81
abstract_inverted_index.intermediate	36
abstract_inverted_index.particularly	135
abstract_inverted_index.standardized	92
abstract_inverted_index.trajectories	93
abstract_inverted_index.underscoring	142
abstract_inverted_index.dependencies,	32
abstract_inverted_index.interpretable	73
abstract_inverted_index.verification;	103
abstract_inverted_index.multi-threaded	22
abstract_inverted_index.representative	122
abstract_inverted_index.correspondences.	65
abstract_inverted_index.state-of-the-art	123
abstract_inverted_index.similarity-driven	43
abstract_inverted_index.similarity-bucketed	58
abstract_inverted_index.https://github.com/EtaYang10th/Open-M3-Bench	162
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	9
citation_normalized_percentile