GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2410.05229
Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2410.05229
- https://arxiv.org/pdf/2410.05229
- OA Status
- green
- Cited By
- 45
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4403324134
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4403324134Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2410.05229Digital Object Identifier
- Title
-
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language ModelsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-10-07Full publication date if available
- Authors
-
Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad FarajtabarList of authors in order
- Landing page
-
https://arxiv.org/abs/2410.05229Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2410.05229Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2410.05229Direct OA link when available
- Concepts
-
Computer science, GSM, Cognitive science, Artificial intelligence, Linguistics, Psychology, Philosophy, TelecommunicationsTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
45Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 40, 2024: 5Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4403324134 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2410.05229 |
| ids.doi | https://doi.org/10.48550/arxiv.2410.05229 |
| ids.openalex | https://openalex.org/W4403324134 |
| fwci | |
| type | preprint |
| title | GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10181 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.944599986076355 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Natural Language Processing Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.5783172249794006 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C59201141 |
| concepts[1].level | 2 |
| concepts[1].score | 0.5198172330856323 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q46904 |
| concepts[1].display_name | GSM |
| concepts[2].id | https://openalex.org/C188147891 |
| concepts[2].level | 1 |
| concepts[2].score | 0.48179906606674194 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q147638 |
| concepts[2].display_name | Cognitive science |
| concepts[3].id | https://openalex.org/C154945302 |
| concepts[3].level | 1 |
| concepts[3].score | 0.3451070189476013 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[3].display_name | Artificial intelligence |
| concepts[4].id | https://openalex.org/C41895202 |
| concepts[4].level | 1 |
| concepts[4].score | 0.32463905215263367 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q8162 |
| concepts[4].display_name | Linguistics |
| concepts[5].id | https://openalex.org/C15744967 |
| concepts[5].level | 0 |
| concepts[5].score | 0.21584540605545044 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q9418 |
| concepts[5].display_name | Psychology |
| concepts[6].id | https://openalex.org/C138885662 |
| concepts[6].level | 0 |
| concepts[6].score | 0.07454529404640198 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q5891 |
| concepts[6].display_name | Philosophy |
| concepts[7].id | https://openalex.org/C76155785 |
| concepts[7].level | 1 |
| concepts[7].score | 0.0 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q418 |
| concepts[7].display_name | Telecommunications |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.5783172249794006 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/gsm |
| keywords[1].score | 0.5198172330856323 |
| keywords[1].display_name | GSM |
| keywords[2].id | https://openalex.org/keywords/cognitive-science |
| keywords[2].score | 0.48179906606674194 |
| keywords[2].display_name | Cognitive science |
| keywords[3].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[3].score | 0.3451070189476013 |
| keywords[3].display_name | Artificial intelligence |
| keywords[4].id | https://openalex.org/keywords/linguistics |
| keywords[4].score | 0.32463905215263367 |
| keywords[4].display_name | Linguistics |
| keywords[5].id | https://openalex.org/keywords/psychology |
| keywords[5].score | 0.21584540605545044 |
| keywords[5].display_name | Psychology |
| keywords[6].id | https://openalex.org/keywords/philosophy |
| keywords[6].score | 0.07454529404640198 |
| keywords[6].display_name | Philosophy |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2410.05229 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2410.05229 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2410.05229 |
| locations[1].id | doi:10.48550/arxiv.2410.05229 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2410.05229 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5079412282 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Iman Mirzadeh |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Mirzadeh, Iman |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5030482460 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Keivan Alizadeh |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Alizadeh, Keivan |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5109022949 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Hooman Shahrokhi |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Shahrokhi, Hooman |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5028613002 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Oncel Tuzel |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Tuzel, Oncel |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5017529415 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Samy Bengio |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Bengio, Samy |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5050499655 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-5510-518X |
| authorships[5].author.display_name | Mehrdad Farajtabar |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Farajtabar, Mehrdad |
| authorships[5].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2410.05229 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10181 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.944599986076355 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Natural Language Processing Techniques |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W1823457431, https://openalex.org/W2390279801, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W4396696052, https://openalex.org/W4402327032 |
| cited_by_count | 45 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 40 |
| counts_by_year[1].year | 2024 |
| counts_by_year[1].cited_by_count | 5 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2410.05229 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2410.05229 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2410.05229 |
| primary_location.id | pmh:oai:arXiv.org:2410.05229 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2410.05229 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2410.05229 |
| publication_date | 2024-10-07 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 73, 106, 191, 217, 256 |
| abstract_inverted_index.To | 67, 83 |
| abstract_inverted_index.We | 194 |
| abstract_inverted_index.an | 93 |
| abstract_inverted_index.as | 185 |
| abstract_inverted_index.in | 2, 10, 16, 44, 158, 163, 175, 190, 265 |
| abstract_inverted_index.is | 21, 199 |
| abstract_inverted_index.it | 47 |
| abstract_inverted_index.of | 29, 37, 63, 87, 105, 109, 128, 142, 149, 172, 188, 260 |
| abstract_inverted_index.on | 31, 39, 76 |
| abstract_inverted_index.to | 24, 139, 223, 231, 243 |
| abstract_inverted_index.we | 71, 90, 168 |
| abstract_inverted_index.(up | 230 |
| abstract_inverted_index.The | 18 |
| abstract_inverted_index.all | 150, 234 |
| abstract_inverted_index.and | 80, 119, 178, 263 |
| abstract_inverted_index.are | 161 |
| abstract_inverted_index.for | 102, 123, 248 |
| abstract_inverted_index.has | 41 |
| abstract_inverted_index.key | 117 |
| abstract_inverted_index.our | 253 |
| abstract_inverted_index.set | 108 |
| abstract_inverted_index.the | 26, 35, 61, 64, 85, 103, 125, 143, 147, 155, 159, 164, 170, 186, 224, 239, 244, 249 |
| abstract_inverted_index.65%) | 232 |
| abstract_inverted_index.LLMs | 38, 133, 202 |
| abstract_inverted_index.SOTA | 78 |
| abstract_inverted_index.even | 237 |
| abstract_inverted_index.from | 97, 212 |
| abstract_inverted_index.have | 7, 55 |
| abstract_inverted_index.more | 113, 120, 257 |
| abstract_inverted_index.only | 154 |
| abstract_inverted_index.open | 79 |
| abstract_inverted_index.same | 144 |
| abstract_inverted_index.show | 179 |
| abstract_inverted_index.that | 100, 132, 180, 196, 220 |
| abstract_inverted_index.they | 208 |
| abstract_inverted_index.this | 197 |
| abstract_inverted_index.used | 23 |
| abstract_inverted_index.when | 137, 153 |
| abstract_inverted_index.work | 254 |
| abstract_inverted_index.GSM8K | 19, 40 |
| abstract_inverted_index.LLMs' | 261 |
| abstract_inverted_index.Large | 3 |
| abstract_inverted_index.While | 34 |
| abstract_inverted_index.about | 60 |
| abstract_inverted_index.allow | 101 |
| abstract_inverted_index.chain | 246 |
| abstract_inverted_index.data. | 215 |
| abstract_inverted_index.drops | 229 |
| abstract_inverted_index.final | 250 |
| abstract_inverted_index.seems | 221 |
| abstract_inverted_index.steps | 211 |
| abstract_inverted_index.study | 75 |
| abstract_inverted_index.their | 11, 51, 181, 213 |
| abstract_inverted_index.these | 69, 176 |
| abstract_inverted_index.(LLMs) | 6 |
| abstract_inverted_index.Adding | 216 |
| abstract_inverted_index.Models | 5 |
| abstract_inverted_index.Recent | 0 |
| abstract_inverted_index.across | 233 |
| abstract_inverted_index.assess | 25 |
| abstract_inverted_index.cannot | 203 |
| abstract_inverted_index.causes | 226 |
| abstract_inverted_index.clause | 219, 240 |
| abstract_inverted_index.closed | 81 |
| abstract_inverted_index.formal | 12 |
| abstract_inverted_index.models | 30, 151, 177 |
| abstract_inverted_index.needed | 247 |
| abstract_inverted_index.number | 187 |
| abstract_inverted_index.offers | 255 |
| abstract_inverted_index.recent | 45 |
| abstract_inverted_index.reveal | 131 |
| abstract_inverted_index.single | 218 |
| abstract_inverted_index.though | 238 |
| abstract_inverted_index.values | 157 |
| abstract_inverted_index.widely | 22 |
| abstract_inverted_index.years, | 46 |
| abstract_inverted_index.address | 68 |
| abstract_inverted_index.altered | 162 |
| abstract_inverted_index.answer. | 251 |
| abstract_inverted_index.because | 200 |
| abstract_inverted_index.clauses | 189 |
| abstract_inverted_index.conduct | 72 |
| abstract_inverted_index.created | 96 |
| abstract_inverted_index.current | 201 |
| abstract_inverted_index.decline | 198 |
| abstract_inverted_index.diverse | 107 |
| abstract_inverted_index.doesn't | 241 |
| abstract_inverted_index.enables | 112 |
| abstract_inverted_index.exhibit | 134 |
| abstract_inverted_index.genuine | 205 |
| abstract_inverted_index.logical | 206 |
| abstract_inverted_index.metrics | 122 |
| abstract_inverted_index.models, | 236 |
| abstract_inverted_index.models. | 82 |
| abstract_inverted_index.nuanced | 258 |
| abstract_inverted_index.perform | 204 |
| abstract_inverted_index.raising | 58 |
| abstract_inverted_index.remains | 48 |
| abstract_inverted_index.several | 77 |
| abstract_inverted_index.sparked | 8 |
| abstract_inverted_index.unclear | 49 |
| abstract_inverted_index.whether | 50 |
| abstract_inverted_index.Language | 4 |
| abstract_inverted_index.Overall, | 252 |
| abstract_inverted_index.declines | 152 |
| abstract_inverted_index.existing | 88 |
| abstract_inverted_index.findings | 130 |
| abstract_inverted_index.improved | 43, 94 |
| abstract_inverted_index.insights | 118 |
| abstract_inverted_index.interest | 9 |
| abstract_inverted_index.metrics. | 66 |
| abstract_inverted_index.overcome | 84 |
| abstract_inverted_index.question | 160, 192, 225 |
| abstract_inverted_index.relevant | 222 |
| abstract_inverted_index.reliable | 121 |
| abstract_inverted_index.reported | 65 |
| abstract_inverted_index.symbolic | 98 |
| abstract_inverted_index.training | 214 |
| abstract_inverted_index.variance | 136 |
| abstract_inverted_index.advanced, | 57 |
| abstract_inverted_index.benchmark | 20, 95 |
| abstract_inverted_index.concerns, | 70 |
| abstract_inverted_index.different | 140 |
| abstract_inverted_index.fragility | 171 |
| abstract_inverted_index.genuinely | 56 |
| abstract_inverted_index.introduce | 91 |
| abstract_inverted_index.measuring | 124 |
| abstract_inverted_index.numerical | 156 |
| abstract_inverted_index.providing | 116 |
| abstract_inverted_index.question. | 145 |
| abstract_inverted_index.questions | 59 |
| abstract_inverted_index.reasoning | 13, 28, 53, 126, 174, 210, 245 |
| abstract_inverted_index.replicate | 209 |
| abstract_inverted_index.templates | 99 |
| abstract_inverted_index.benchmark. | 166 |
| abstract_inverted_index.contribute | 242 |
| abstract_inverted_index.generation | 104 |
| abstract_inverted_index.increases. | 193 |
| abstract_inverted_index.models.Our | 129 |
| abstract_inverted_index.noticeable | 135 |
| abstract_inverted_index.questions. | 33, 110 |
| abstract_inverted_index.reasoning. | 267 |
| abstract_inverted_index.reasoning; | 207 |
| abstract_inverted_index.responding | 138 |
| abstract_inverted_index.hypothesize | 195 |
| abstract_inverted_index.investigate | 169 |
| abstract_inverted_index.large-scale | 74 |
| abstract_inverted_index.limitations | 86, 264 |
| abstract_inverted_index.performance | 36, 148, 182, 228 |
| abstract_inverted_index.reliability | 62 |
| abstract_inverted_index.significant | 227 |
| abstract_inverted_index.Furthermore, | 167 |
| abstract_inverted_index.GSM-Symbolic | 111, 165 |
| abstract_inverted_index.advancements | 1 |
| abstract_inverted_index.capabilities | 54, 127, 262 |
| abstract_inverted_index.controllable | 114 |
| abstract_inverted_index.deteriorates | 184 |
| abstract_inverted_index.evaluations, | 89, 115 |
| abstract_inverted_index.mathematical | 27, 52, 173, 266 |
| abstract_inverted_index.mathematics. | 17 |
| abstract_inverted_index.particularly | 15 |
| abstract_inverted_index.GSM-Symbolic, | 92 |
| abstract_inverted_index.Specifically, | 146 |
| abstract_inverted_index.capabilities, | 14 |
| abstract_inverted_index.significantly | 42, 183 |
| abstract_inverted_index.understanding | 259 |
| abstract_inverted_index.instantiations | 141 |
| abstract_inverted_index.state-of-the-art | 235 |
| abstract_inverted_index.grade-school-level | 32 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile |