Towards Robust Mathematical Reasoning Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2511.01846
Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.
Related Topics
- Type
- preprint
- Landing Page
- http://arxiv.org/abs/2511.01846
- https://arxiv.org/pdf/2511.01846
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4416440274
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4416440274Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2511.01846Digital Object Identifier
- Title
-
Towards Robust Mathematical ReasoningWork title
- Type
-
preprintOpenAlex work type
- Publication year
-
2025Year of publication
- Publication date
-
2025-11-03Full publication date if available
- Authors
-
Daniel Duck-Jin Hwang, Yuri Chervonyi, I. J. Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Henryk Michalewski, Jimin Kim, Jae-Hyun Ahn, Jong‐Ho Bae, Xingyou Song, Junehyuk JungList of authors in order
- Landing page
-
https://arxiv.org/abs/2511.01846Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2511.01846Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2511.01846Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4416440274 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2511.01846 |
| ids.doi | https://doi.org/10.48550/arxiv.2511.01846 |
| ids.openalex | https://openalex.org/W4416440274 |
| fwci | |
| type | preprint |
| title | Towards Robust Mathematical Reasoning |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | |
| locations[0].id | pmh:oai:arXiv.org:2511.01846 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | cc-by |
| locations[0].pdf_url | https://arxiv.org/pdf/2511.01846 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2511.01846 |
| locations[1].id | doi:10.48550/arxiv.2511.01846 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2511.01846 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5090076104 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-1808-3169 |
| authorships[0].author.display_name | Daniel Duck-Jin Hwang |
| authorships[0].author_position | middle |
| authorships[0].raw_author_name | Hwang, Dawsen |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5116187636 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Yuri Chervonyi |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Chervonyi, Yuri |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5113957630 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | I. J. Seo |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Seo, Insuk |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5041027404 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-9583-7978 |
| authorships[3].author.display_name | Junsu Kim |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Kim, Junsu |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5034277026 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-2898-2582 |
| authorships[4].author.display_name | Garrett Bingham |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Bingham, Garrett |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5100348028 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-1828-1707 |
| authorships[5].author.display_name | Jonathan Lee |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Lee, Jonathan |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5063722751 |
| authorships[6].author.orcid | https://orcid.org/0009-0001-6413-7001 |
| authorships[6].author.display_name | Swaroop Mishra |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Mishra, Swaroop |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5016302450 |
| authorships[7].author.orcid | |
| authorships[7].author.display_name | Henryk Michalewski |
| authorships[7].author_position | last |
| authorships[7].raw_author_name | Michalewski, Henryk |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5100362177 |
| authorships[8].author.orcid | https://orcid.org/0000-0003-2996-5554 |
| authorships[8].author.display_name | Jimin Kim |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | Kim, Jimin |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5013335966 |
| authorships[9].author.orcid | https://orcid.org/0000-0002-2240-4562 |
| authorships[9].author.display_name | Jae-Hyun Ahn |
| authorships[9].author_position | middle |
| authorships[9].raw_author_name | Ahn, Jeonghyun |
| authorships[9].is_corresponding | False |
| authorships[10].author.id | https://openalex.org/A5021913940 |
| authorships[10].author.orcid | https://orcid.org/0000-0002-1786-7132 |
| authorships[10].author.display_name | Jong‐Ho Bae |
| authorships[10].author_position | middle |
| authorships[10].raw_author_name | Bae, Junhwi |
| authorships[10].is_corresponding | False |
| authorships[11].author.id | https://openalex.org/A5081034298 |
| authorships[11].author.orcid | https://orcid.org/0000-0001-6055-3174 |
| authorships[11].author.display_name | Xingyou Song |
| authorships[11].author_position | middle |
| authorships[11].raw_author_name | Song, Xingyou |
| authorships[11].is_corresponding | False |
| authorships[12].author.id | https://openalex.org/A5016842069 |
| authorships[12].author.orcid | https://orcid.org/0000-0003-0778-4189 |
| authorships[12].author.display_name | Junehyuk Jung |
| authorships[12].author_position | middle |
| authorships[12].raw_author_name | Jung, Junehyuk |
| authorships[12].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2511.01846 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-11-06T00:00:00 |
| display_name | Towards Robust Mathematical Reasoning |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-28T14:12:55.796172 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2511.01846 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2511.01846 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2511.01846 |
| primary_location.id | pmh:oai:arXiv.org:2511.01846 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | cc-by |
| primary_location.pdf_url | https://arxiv.org/pdf/2511.01846 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2511.01846 |
| publication_date | 2025-11-03 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 41, 49, 117 |
| abstract_inverted_index.To | 34 |
| abstract_inverted_index.We | 165, 198 |
| abstract_inverted_index.as | 104, 106 |
| abstract_inverted_index.at | 128, 214 |
| abstract_inverted_index.by | 48, 157 |
| abstract_inverted_index.in | 120, 192 |
| abstract_inverted_index.is | 5, 88 |
| abstract_inverted_index.it | 213 |
| abstract_inverted_index.of | 14, 43, 51, 60, 124, 160, 195 |
| abstract_inverted_index.on | 29, 77, 143, 147, 186 |
| abstract_inverted_index.or | 26 |
| abstract_inverted_index.to | 110, 188 |
| abstract_inverted_index.we | 38 |
| abstract_inverted_index.400 | 78 |
| abstract_inverted_index.IMO | 101, 129 |
| abstract_inverted_index.Our | 139 |
| abstract_inverted_index.and | 54, 99, 136, 145, 162, 179, 211 |
| abstract_inverted_index.are | 22 |
| abstract_inverted_index.for | 8, 70, 92 |
| abstract_inverted_index.our | 121 |
| abstract_inverted_index.the | 1, 10, 58, 61, 66, 89, 125, 148, 153, 204 |
| abstract_inverted_index.too | 24 |
| abstract_inverted_index.top | 52 |
| abstract_inverted_index.1000 | 183 |
| abstract_inverted_index.2025 | 130 |
| abstract_inverted_index.6.9% | 161 |
| abstract_inverted_index.Deep | 133 |
| abstract_inverted_index.also | 166 |
| abstract_inverted_index.best | 154 |
| abstract_inverted_index.both | 97 |
| abstract_inverted_index.easy | 25 |
| abstract_inverted_index.help | 203 |
| abstract_inverted_index.hope | 199 |
| abstract_inverted_index.most | 67 |
| abstract_inverted_index.only | 27 |
| abstract_inverted_index.role | 119 |
| abstract_inverted_index.that | 19, 55, 168, 200 |
| abstract_inverted_index.well | 105, 175 |
| abstract_inverted_index.will | 202 |
| abstract_inverted_index.with | 82, 131, 171, 176, 182 |
| abstract_inverted_index.42.4% | 163 |
| abstract_inverted_index.65.7% | 146 |
| abstract_inverted_index.80.0% | 142 |
| abstract_inverted_index.Bench | 87 |
| abstract_inverted_index.These | 114 |
| abstract_inverted_index.Think | 134 |
| abstract_inverted_index.basic | 98 |
| abstract_inverted_index.built | 170 |
| abstract_inverted_index.first | 74 |
| abstract_inverted_index.focus | 28 |
| abstract_inverted_index.given | 18 |
| abstract_inverted_index.human | 177, 184 |
| abstract_inverted_index.large | 158 |
| abstract_inverted_index.level | 59, 102 |
| abstract_inverted_index.model | 140 |
| abstract_inverted_index.panel | 50 |
| abstract_inverted_index.right | 2 |
| abstract_inverted_index.short | 32, 84 |
| abstract_inverted_index.suite | 42 |
| abstract_inverted_index.tests | 75 |
| abstract_inverted_index.these | 36 |
| abstract_inverted_index.venue | 69 |
| abstract_inverted_index.which | 95 |
| abstract_inverted_index.young | 71 |
| abstract_inverted_index.(IMO), | 65 |
| abstract_inverted_index.(Luong | 135 |
| abstract_inverted_index.2025). | 138 |
| abstract_inverted_index.Bench, | 151 |
| abstract_inverted_index.Gemini | 132, 172 |
| abstract_inverted_index.either | 23 |
| abstract_inverted_index.enable | 189 |
| abstract_inverted_index.highly | 6 |
| abstract_inverted_index.models | 76, 156 |
| abstract_inverted_index.played | 116 |
| abstract_inverted_index.robust | 208 |
| abstract_inverted_index.showed | 167 |
| abstract_inverted_index.vetted | 47 |
| abstract_inverted_index.Finding | 0 |
| abstract_inverted_index.address | 35 |
| abstract_inverted_index.correct | 31 |
| abstract_inverted_index.crucial | 118 |
| abstract_inverted_index.diverse | 79 |
| abstract_inverted_index.further | 190 |
| abstract_inverted_index.getting | 30 |
| abstract_inverted_index.grading | 108 |
| abstract_inverted_index.issues, | 37 |
| abstract_inverted_index.margins | 159 |
| abstract_inverted_index.metrics | 4 |
| abstract_inverted_index.models, | 16 |
| abstract_inverted_index.present | 39 |
| abstract_inverted_index.proofs, | 187 |
| abstract_inverted_index.release | 212 |
| abstract_inverted_index.targets | 57 |
| abstract_inverted_index.towards | 206 |
| abstract_inverted_index.Olympiad | 64, 80 |
| abstract_inverted_index.achieved | 141 |
| abstract_inverted_index.advanced | 44, 100, 149 |
| abstract_inverted_index.answers. | 33, 85, 197 |
| abstract_inverted_index.critical | 7 |
| abstract_inverted_index.detailed | 107 |
| abstract_inverted_index.existing | 20 |
| abstract_inverted_index.grading. | 113 |
| abstract_inverted_index.gradings | 185 |
| abstract_inverted_index.historic | 122 |
| abstract_inverted_index.includes | 96 |
| abstract_inverted_index.problems | 81, 103 |
| abstract_inverted_index.progress | 191 |
| abstract_inverted_index.IMO-Bench | 201 |
| abstract_inverted_index.IMO-Proof | 86, 150 |
| abstract_inverted_index.Lockhart, | 137 |
| abstract_inverted_index.advancing | 9, 207 |
| abstract_inverted_index.automatic | 112, 193 |
| abstract_inverted_index.community | 205 |
| abstract_inverted_index.construct | 180 |
| abstract_inverted_index.correlate | 174 |
| abstract_inverted_index.long-form | 196 |
| abstract_inverted_index.reasoning | 12, 45, 173, 210 |
| abstract_inverted_index.IMO-Bench, | 40 |
| abstract_inverted_index.benchmarks | 115 |
| abstract_inverted_index.especially | 17 |
| abstract_inverted_index.evaluation | 91, 194 |
| abstract_inverted_index.facilitate | 111 |
| abstract_inverted_index.foundation | 15 |
| abstract_inverted_index.gold-level | 126 |
| abstract_inverted_index.guidelines | 109 |
| abstract_inverted_index.next-level | 90 |
| abstract_inverted_index.non-Gemini | 155 |
| abstract_inverted_index.north-star | 3 |
| abstract_inverted_index.surpassing | 152 |
| abstract_inverted_index.verifiable | 83 |
| abstract_inverted_index.achievement | 123 |
| abstract_inverted_index.autograders | 169 |
| abstract_inverted_index.benchmarks, | 46 |
| abstract_inverted_index.evaluations | 21, 178 |
| abstract_inverted_index.performance | 127 |
| abstract_inverted_index.prestigious | 68 |
| abstract_inverted_index.specialists | 53 |
| abstract_inverted_index.Mathematical | 63 |
| abstract_inverted_index.capabilities | 13 |
| abstract_inverted_index.mathematical | 11, 209 |
| abstract_inverted_index.specifically | 56 |
| abstract_inverted_index.International | 62 |
| abstract_inverted_index.capabilities, | 94 |
| abstract_inverted_index.proof-writing | 93 |
| abstract_inverted_index.respectively. | 164 |
| abstract_inverted_index.IMO-AnswerBench | 73, 144 |
| abstract_inverted_index.mathematicians. | 72 |
| abstract_inverted_index.IMO-GradingBench, | 181 |
| abstract_inverted_index.https://imobench.github.io/. | 215 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 13 |
| citation_normalized_percentile |