MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2504.05782
Multimodal reasoning, which integrates language and visual cues into problem solving and decision making, is a fundamental aspect of human intelligence and a crucial step toward artificial general intelligence. However, the evaluation of multimodal reasoning capabilities in Multimodal Large Language Models (MLLMs) remains inadequate. Most existing reasoning benchmarks are constrained by limited data size, narrow domain coverage, and unstructured knowledge distribution. To close these gaps, we introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations. Spanning six disciplines (math, physics, chemistry, biology, geography, and information science), our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade. It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions, providing a robust platform for comprehensive evaluation. Additionally, we present a novel dynamic evaluation framework to mitigate data contamination issues by bootstrapping question forms, question types, and image styles during evaluation. Extensive experiment on MDK12-Bench reveals the significant limitation of current MLLMs in multimodal reasoning. The findings on our benchmark provide insights into the development of the next-generation models. Our data and codes are available at https://github.com/LanceZPF/MDK12.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2504.05782
- https://arxiv.org/pdf/2504.05782
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4416529495
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4416529495Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2504.05782Digital Object Identifier
- Title
-
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language ModelsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-04-08Full publication date if available
- Authors
-
Pengfei Zhou, Xiaopeng Peng, Jingjing Ai, Yuefeng Qiu, Zongjin Li, Ming Li, Yulong Feng, Zizhen Li, Xiaojun Chang, Wenqi Shao, Yang YouList of authors in order
- Landing page
-
https://arxiv.org/abs/2504.05782Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2504.05782Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2504.05782Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4416529495 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2504.05782 |
| ids.doi | https://doi.org/10.48550/arxiv.2504.05782 |
| ids.openalex | https://openalex.org/W4416529495 |
| fwci | |
| type | preprint |
| title | MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2504.05782 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2504.05782 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2504.05782 |
| locations[1].id | doi:10.48550/arxiv.2504.05782 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2504.05782 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5101569327 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-6395-8708 |
| authorships[0].author.display_name | Pengfei Zhou |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Zhou, Pengfei |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5103176454 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-4728-005X |
| authorships[1].author.display_name | Xiaopeng Peng |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Peng, Xiaopeng |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5038964194 |
| authorships[2].author.orcid | https://orcid.org/0009-0005-1780-8553 |
| authorships[2].author.display_name | Jingjing Ai |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Ai, Jiaxin |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5015280598 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-5069-3384 |
| authorships[3].author.display_name | Yuefeng Qiu |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Qiu, Yansheng |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5011466881 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-8477-6863 |
| authorships[4].author.display_name | Zongjin Li |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Li, Zhen |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5100351464 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-7699-4502 |
| authorships[5].author.display_name | Ming Li |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Li, Ming |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5050688781 |
| authorships[6].author.orcid | https://orcid.org/0000-0001-8877-5027 |
| authorships[6].author.display_name | Yulong Feng |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Feng, Yukang |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5103253130 |
| authorships[7].author.orcid | https://orcid.org/0009-0005-7595-6466 |
| authorships[7].author.display_name | Zizhen Li |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Li, Zizhen |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5034967388 |
| authorships[8].author.orcid | https://orcid.org/0000-0002-7778-8807 |
| authorships[8].author.display_name | Xiaojun Chang |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | Chang, Xiaojun |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5101827257 |
| authorships[9].author.orcid | https://orcid.org/0000-0003-3781-4086 |
| authorships[9].author.display_name | Wenqi Shao |
| authorships[9].author_position | middle |
| authorships[9].raw_author_name | Shao, Wenqi |
| authorships[9].is_corresponding | False |
| authorships[10].author.id | https://openalex.org/A5100658705 |
| authorships[10].author.orcid | https://orcid.org/0000-0003-2816-4384 |
| authorships[10].author.display_name | Yang You |
| authorships[10].author_position | middle |
| authorships[10].raw_author_name | You, Yang |
| authorships[10].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2504.05782 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-28T16:37:59.384490 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2504.05782 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2504.05782 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2504.05782 |
| primary_location.id | pmh:oai:arXiv.org:2504.05782 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2504.05782 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2504.05782 |
| publication_date | 2025-04-08 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 15, 22, 68, 117, 130, 139 |
| abstract_inverted_index.It | 108 |
| abstract_inverted_index.To | 61 |
| abstract_inverted_index.at | 194 |
| abstract_inverted_index.by | 50, 149 |
| abstract_inverted_index.in | 36, 171 |
| abstract_inverted_index.is | 14 |
| abstract_inverted_index.of | 18, 32, 75, 168, 184 |
| abstract_inverted_index.on | 116, 162, 176 |
| abstract_inverted_index.to | 105, 144 |
| abstract_inverted_index.we | 65, 137 |
| abstract_inverted_index.Our | 188 |
| abstract_inverted_index.The | 174 |
| abstract_inverted_index.and | 5, 11, 21, 57, 89, 126, 155, 190 |
| abstract_inverted_index.are | 48, 192 |
| abstract_inverted_index.for | 133 |
| abstract_inverted_index.our | 92, 177 |
| abstract_inverted_index.six | 82 |
| abstract_inverted_index.the | 30, 72, 165, 182, 185 |
| abstract_inverted_index.via | 77 |
| abstract_inverted_index.12th | 106 |
| abstract_inverted_index.140K | 95 |
| abstract_inverted_index.K-12 | 79 |
| abstract_inverted_index.Most | 44 |
| abstract_inverted_index.cues | 7 |
| abstract_inverted_index.data | 52, 146, 189 |
| abstract_inverted_index.from | 102 |
| abstract_inverted_index.into | 8, 181 |
| abstract_inverted_index.step | 24 |
| abstract_inverted_index.6,827 | 110 |
| abstract_inverted_index.Large | 38 |
| abstract_inverted_index.MLLMs | 76, 170 |
| abstract_inverted_index.based | 115 |
| abstract_inverted_index.close | 62 |
| abstract_inverted_index.codes | 191 |
| abstract_inverted_index.gaps, | 64 |
| abstract_inverted_index.human | 19 |
| abstract_inverted_index.image | 156 |
| abstract_inverted_index.novel | 140 |
| abstract_inverted_index.point | 113 |
| abstract_inverted_index.size, | 53 |
| abstract_inverted_index.these | 63 |
| abstract_inverted_index.which | 2 |
| abstract_inverted_index.(math, | 84 |
| abstract_inverted_index.Models | 40 |
| abstract_inverted_index.across | 98 |
| abstract_inverted_index.answer | 122 |
| abstract_inverted_index.aspect | 17 |
| abstract_inverted_index.domain | 55 |
| abstract_inverted_index.during | 158 |
| abstract_inverted_index.forms, | 152 |
| abstract_inverted_index.grade. | 107 |
| abstract_inverted_index.issues | 148 |
| abstract_inverted_index.labels | 125 |
| abstract_inverted_index.levels | 101 |
| abstract_inverted_index.narrow | 54 |
| abstract_inverted_index.robust | 131 |
| abstract_inverted_index.school | 104 |
| abstract_inverted_index.styles | 157 |
| abstract_inverted_index.toward | 25 |
| abstract_inverted_index.types, | 154 |
| abstract_inverted_index.visual | 6 |
| abstract_inverted_index.(MLLMs) | 41 |
| abstract_inverted_index.crucial | 23 |
| abstract_inverted_index.current | 169 |
| abstract_inverted_index.diverse | 99 |
| abstract_inverted_index.dynamic | 141 |
| abstract_inverted_index.general | 27 |
| abstract_inverted_index.limited | 51 |
| abstract_inverted_index.making, | 13 |
| abstract_inverted_index.models. | 187 |
| abstract_inverted_index.present | 138 |
| abstract_inverted_index.primary | 103 |
| abstract_inverted_index.problem | 9 |
| abstract_inverted_index.provide | 179 |
| abstract_inverted_index.remains | 42 |
| abstract_inverted_index.reveals | 164 |
| abstract_inverted_index.solving | 10 |
| abstract_inverted_index.However, | 29 |
| abstract_inverted_index.Language | 39 |
| abstract_inverted_index.Spanning | 81 |
| abstract_inverted_index.biology, | 87 |
| abstract_inverted_index.decision | 12 |
| abstract_inverted_index.detailed | 121 |
| abstract_inverted_index.existing | 45 |
| abstract_inverted_index.features | 109 |
| abstract_inverted_index.findings | 175 |
| abstract_inverted_index.insights | 180 |
| abstract_inverted_index.language | 4 |
| abstract_inverted_index.mitigate | 145 |
| abstract_inverted_index.physics, | 85 |
| abstract_inverted_index.platform | 132 |
| abstract_inverted_index.question | 151, 153 |
| abstract_inverted_index.Extensive | 160 |
| abstract_inverted_index.assessing | 71 |
| abstract_inverted_index.available | 193 |
| abstract_inverted_index.benchmark | 70, 93, 178 |
| abstract_inverted_index.comprises | 94 |
| abstract_inverted_index.coverage, | 56 |
| abstract_inverted_index.framework | 143 |
| abstract_inverted_index.instances | 97 |
| abstract_inverted_index.introduce | 66 |
| abstract_inverted_index.knowledge | 59, 112, 119 |
| abstract_inverted_index.providing | 129 |
| abstract_inverted_index.reasoning | 34, 46, 73, 96 |
| abstract_inverted_index.science), | 91 |
| abstract_inverted_index.Multimodal | 0, 37 |
| abstract_inverted_index.artificial | 26 |
| abstract_inverted_index.benchmarks | 47 |
| abstract_inverted_index.chemistry, | 86 |
| abstract_inverted_index.cross-year | 127 |
| abstract_inverted_index.difficulty | 100, 124 |
| abstract_inverted_index.evaluation | 31, 142 |
| abstract_inverted_index.experiment | 161 |
| abstract_inverted_index.geography, | 88 |
| abstract_inverted_index.integrates | 3 |
| abstract_inverted_index.limitation | 167 |
| abstract_inverted_index.multimodal | 33, 172 |
| abstract_inverted_index.real-world | 78 |
| abstract_inverted_index.reasoning, | 1 |
| abstract_inverted_index.reasoning. | 173 |
| abstract_inverted_index.structure, | 120 |
| abstract_inverted_index.MDK12-Bench | 163 |
| abstract_inverted_index.annotations | 114 |
| abstract_inverted_index.constrained | 49 |
| abstract_inverted_index.development | 183 |
| abstract_inverted_index.disciplines | 83 |
| abstract_inverted_index.evaluation. | 135, 159 |
| abstract_inverted_index.fundamental | 16 |
| abstract_inverted_index.inadequate. | 43 |
| abstract_inverted_index.information | 90 |
| abstract_inverted_index.partitions, | 128 |
| abstract_inverted_index.significant | 166 |
| abstract_inverted_index.MDK12-Bench, | 67 |
| abstract_inverted_index.capabilities | 35, 74 |
| abstract_inverted_index.intelligence | 20 |
| abstract_inverted_index.unstructured | 58 |
| abstract_inverted_index.Additionally, | 136 |
| abstract_inverted_index.bootstrapping | 150 |
| abstract_inverted_index.comprehensive | 134 |
| abstract_inverted_index.contamination | 147 |
| abstract_inverted_index.distribution. | 60 |
| abstract_inverted_index.examinations. | 80 |
| abstract_inverted_index.explanations, | 123 |
| abstract_inverted_index.intelligence. | 28 |
| abstract_inverted_index.instance-level | 111 |
| abstract_inverted_index.well-organized | 118 |
| abstract_inverted_index.next-generation | 186 |
| abstract_inverted_index.multi-disciplinary | 69 |
| abstract_inverted_index.https://github.com/LanceZPF/MDK12. | 195 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 11 |
| citation_normalized_percentile |