TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models Article Swipe

PDF

Rakshith Sharma Srinivasa , Zora Che , Chuck Zhang , Diego Mares , Ernesto Hernández , J. Park , Deokjung Lee , Guillermo Mangialardi , Charmaine Ng , Ed-Yeremai Hernandez Cardona , Anisha Gunjal , Yunzhong He , Liwen Liu , Chen Xing ·

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2510.02663

As students increasingly adopt large language models (LLMs) as learning aids, it is crucial to build models that are adept at handling the nuances of tutoring: they need to identify the core needs of students, be adaptive, provide personalized guidance, and be accurate. To this end, we introduce TutorBench, a dataset and evaluation benchmark designed to rigorously evaluate the core tutoring skills of LLMs. The dataset comprises 1,490 samples curated by human experts, focused on high-school and AP-level curricula. The samples are drawn from three common tutoring tasks: (i) generating adaptive explanations tailored to a student's confusion, (ii) providing actionable feedback on a student's work, and (iii) promoting active learning through effective hint generation. To account for the inherent complexity of tutoring, samples are accompanied by sample-specific rubrics which are used to judge model responses during evaluation. TutorBench uses a reliable and fine-grained automatic evaluation method that uses an LLM-judge and the sample-specific rubrics. We evaluate 16 frontier LLMs on TutorBench and present a detailed analysis of their performance and behavior. Our results show that none of the frontier LLMs achieve a score of greater than $56\%$, showing a large room for improvement. We find that LLMs fall short in exhibiting the full range of tutoring skills needed to guide, diagnose, and support students effectively, with all the frontier models achieving less than a $60\%$ pass rate on rubric criteria related to these skills. We also find that different model families exhibit varied strengths and limitations: the Claude models outperform others in supporting active learning, while they lag behind in the other two use cases. By releasing TutorBench, we provide a comprehensive and unsaturated benchmark to guide the development of the next-generation of AI tutors.

Related Topics

Truth And Reconciliation Commission Of Canada

2025 Nba Draft

28 Years Later

Mahmood Mamdani

Reich Ministry Of Public Enlightenment And Propaganda

Concepts

No concepts available.

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2510.02663
PDF: https://arxiv.org/pdf/2510.02663
OA Status: green
OpenAlex ID: https://openalex.org/W4415981784

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4415981784

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2510.02663

Digital Object Identifier
Title: TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2025

Year of publication
Publication date: 2025-10-03

Full publication date if available
Authors: Rakshith Sharma Srinivasa, Zora Che, Chuck Zhang, Diego Mares, Ernesto Hernández, J. Park, Deokjung Lee, Guillermo Mangialardi, Charmaine Ng, Ed-Yeremai Hernandez Cardona, Anisha Gunjal, Yunzhong He, Liwen Liu, Chen Xing

List of authors in order
Landing page: https://arxiv.org/abs/2510.02663

Publisher landing page
PDF URL: https://arxiv.org/pdf/2510.02663

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2510.02663

Direct OA link when available
Cited by: 0

Total citation count in OpenAlex

Full payload

id	https://openalex.org/W4415981784
doi	https://doi.org/10.48550/arxiv.2510.02663
ids.doi	https://doi.org/10.48550/arxiv.2510.02663
ids.openalex	https://openalex.org/W4415981784
fwci	0.0
type	preprint
title	TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
is_xpac	False
apc_list
apc_paid
language	en
locations[0].id	pmh:oai:arXiv.org:2510.02663
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2510.02663
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2510.02663
locations[1].id	doi:10.48550/arxiv.2510.02663
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2510.02663
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5039893379
authorships[0].author.orcid
authorships[0].author.display_name	Rakshith Sharma Srinivasa
authorships[0].author_position	first
authorships[0].raw_author_name	Srinivasa, Rakshith S
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5016281162
authorships[1].author.orcid
authorships[1].author.display_name	Zora Che
authorships[1].author_position	middle
authorships[1].raw_author_name	Che, Zora
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5067712887
authorships[2].author.orcid	https://orcid.org/0000-0001-7681-5538
authorships[2].author.display_name	Chuck Zhang
authorships[2].author_position	middle
authorships[2].raw_author_name	Zhang, Chen Bo Calvin
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5054263815
authorships[3].author.orcid
authorships[3].author.display_name	Diego Mares
authorships[3].author_position	middle
authorships[3].raw_author_name	Mares, Diego
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5061646420
authorships[4].author.orcid	https://orcid.org/0000-0003-3839-3244
authorships[4].author.display_name	Ernesto Hernández
authorships[4].author_position	middle
authorships[4].raw_author_name	Hernandez, Ernesto
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5090379094
authorships[5].author.orcid	https://orcid.org/0000-0003-4408-7171
authorships[5].author.display_name	J. Park
authorships[5].author_position	middle
authorships[5].raw_author_name	Park, Jayeon
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A5008187500
authorships[6].author.orcid	https://orcid.org/0000-0002-3935-5058
authorships[6].author.display_name	Deokjung Lee
authorships[6].author_position	middle
authorships[6].raw_author_name	Lee, Dean
authorships[6].is_corresponding	False
authorships[7].author.id	https://openalex.org/A5120289863
authorships[7].author.orcid
authorships[7].author.display_name	Guillermo Mangialardi
authorships[7].author_position	middle
authorships[7].raw_author_name	Mangialardi, Guillermo
authorships[7].is_corresponding	False
authorships[8].author.id	https://openalex.org/A5078999453
authorships[8].author.orcid	https://orcid.org/0000-0003-3026-0009
authorships[8].author.display_name	Charmaine Ng
authorships[8].author_position	middle
authorships[8].raw_author_name	Ng, Charmaine
authorships[8].is_corresponding	False
authorships[9].author.id	https://openalex.org/A5116082872
authorships[9].author.orcid
authorships[9].author.display_name	Ed-Yeremai Hernandez Cardona
authorships[9].author_position	middle
authorships[9].raw_author_name	Cardona, Ed-Yeremai Hernandez
authorships[9].is_corresponding	False
authorships[10].author.id	https://openalex.org/A5088137864
authorships[10].author.orcid
authorships[10].author.display_name	Anisha Gunjal
authorships[10].author_position	middle
authorships[10].raw_author_name	Gunjal, Anisha
authorships[10].is_corresponding	False
authorships[11].author.id	https://openalex.org/A5079019876
authorships[11].author.orcid	https://orcid.org/0000-0002-5429-5372
authorships[11].author.display_name	Yunzhong He
authorships[11].author_position	middle
authorships[11].raw_author_name	He, Yunzhong
authorships[11].is_corresponding	False
authorships[12].author.id	https://openalex.org/A5100703295
authorships[12].author.orcid	https://orcid.org/0000-0001-9838-5727
authorships[12].author.display_name	Liwen Liu
authorships[12].author_position	middle
authorships[12].raw_author_name	Liu, Bing
authorships[12].is_corresponding	False
authorships[13].author.id	https://openalex.org/A5102748139
authorships[13].author.orcid	https://orcid.org/0000-0003-3189-0969
authorships[13].author.display_name	Chen Xing
authorships[13].author_position	last
authorships[13].raw_author_name	Xing, Chen
authorships[13].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2510.02663
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models
has_fulltext	False
is_retracted	False
updated_date	2025-11-08T23:21:52.890332
primary_topic
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2510.02663
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2510.02663
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2510.02663
primary_location.id	pmh:oai:arXiv.org:2510.02663
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2510.02663
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2510.02663
publication_date	2025-10-03
publication_year	2025
referenced_works_count	0
abstract_inverted_index.a	49, 94, 102, 139, 163, 181, 188, 223, 270
abstract_inverted_index.16	156
abstract_inverted_index.AI	283
abstract_inverted_index.As	0
abstract_inverted_index.By	265
abstract_inverted_index.To	43, 114
abstract_inverted_index.We	154, 193, 234
abstract_inverted_index.an	148
abstract_inverted_index.as	8
abstract_inverted_index.at	20
abstract_inverted_index.be	35, 41
abstract_inverted_index.by	70, 125
abstract_inverted_index.in	199, 251, 259
abstract_inverted_index.is	12
abstract_inverted_index.it	11
abstract_inverted_index.of	24, 33, 62, 120, 166, 176, 183, 204, 279, 282
abstract_inverted_index.on	74, 101, 159, 227
abstract_inverted_index.to	14, 28, 55, 93, 131, 208, 231, 275
abstract_inverted_index.we	46, 268
abstract_inverted_index.(i)	88
abstract_inverted_index.Our	171
abstract_inverted_index.The	64, 79
abstract_inverted_index.all	216
abstract_inverted_index.and	40, 51, 76, 105, 141, 150, 161, 169, 211, 244, 272
abstract_inverted_index.are	18, 81, 123, 129
abstract_inverted_index.for	116, 191
abstract_inverted_index.lag	257
abstract_inverted_index.the	22, 30, 58, 117, 151, 177, 201, 217, 246, 260, 277, 280
abstract_inverted_index.two	262
abstract_inverted_index.use	263
abstract_inverted_index.(ii)	97
abstract_inverted_index.LLMs	158, 179, 196
abstract_inverted_index.also	235
abstract_inverted_index.core	31, 59
abstract_inverted_index.end,	45
abstract_inverted_index.fall	197
abstract_inverted_index.find	194, 236
abstract_inverted_index.from	83
abstract_inverted_index.full	202
abstract_inverted_index.hint	112
abstract_inverted_index.less	221
abstract_inverted_index.need	27
abstract_inverted_index.none	175
abstract_inverted_index.pass	225
abstract_inverted_index.rate	226
abstract_inverted_index.room	190
abstract_inverted_index.show	173
abstract_inverted_index.than	185, 222
abstract_inverted_index.that	17, 146, 174, 195, 237
abstract_inverted_index.they	26, 256
abstract_inverted_index.this	44
abstract_inverted_index.used	130
abstract_inverted_index.uses	138, 147
abstract_inverted_index.with	215
abstract_inverted_index.(iii)	106
abstract_inverted_index.1,490	67
abstract_inverted_index.LLMs.	63
abstract_inverted_index.adept	19
abstract_inverted_index.adopt	3
abstract_inverted_index.aids,	10
abstract_inverted_index.build	15
abstract_inverted_index.drawn	82
abstract_inverted_index.guide	276
abstract_inverted_index.human	71
abstract_inverted_index.judge	132
abstract_inverted_index.large	4, 189
abstract_inverted_index.model	133, 239
abstract_inverted_index.needs	32
abstract_inverted_index.other	261
abstract_inverted_index.range	203
abstract_inverted_index.score	182
abstract_inverted_index.short	198
abstract_inverted_index.their	167
abstract_inverted_index.these	232
abstract_inverted_index.three	84
abstract_inverted_index.which	128
abstract_inverted_index.while	255
abstract_inverted_index.work,	104
abstract_inverted_index.$60\%$	224
abstract_inverted_index.(LLMs)	7
abstract_inverted_index.Claude	247
abstract_inverted_index.active	108, 253
abstract_inverted_index.behind	258
abstract_inverted_index.cases.	264
abstract_inverted_index.common	85
abstract_inverted_index.during	135
abstract_inverted_index.guide,	209
abstract_inverted_index.method	145
abstract_inverted_index.models	6, 16, 219, 248
abstract_inverted_index.needed	207
abstract_inverted_index.others	250
abstract_inverted_index.rubric	228
abstract_inverted_index.skills	61, 206
abstract_inverted_index.tasks:	87
abstract_inverted_index.varied	242
abstract_inverted_index.$56\%$,	186
abstract_inverted_index.account	115
abstract_inverted_index.achieve	180
abstract_inverted_index.crucial	13
abstract_inverted_index.curated	69
abstract_inverted_index.dataset	50, 65
abstract_inverted_index.exhibit	241
abstract_inverted_index.focused	73
abstract_inverted_index.greater	184
abstract_inverted_index.nuances	23
abstract_inverted_index.present	162
abstract_inverted_index.provide	37, 269
abstract_inverted_index.related	230
abstract_inverted_index.results	172
abstract_inverted_index.rubrics	127
abstract_inverted_index.samples	68, 80, 122
abstract_inverted_index.showing	187
abstract_inverted_index.skills.	233
abstract_inverted_index.support	212
abstract_inverted_index.through	110
abstract_inverted_index.tutors.	284
abstract_inverted_index.AP-level	77
abstract_inverted_index.adaptive	90
abstract_inverted_index.analysis	165
abstract_inverted_index.criteria	229
abstract_inverted_index.designed	54
abstract_inverted_index.detailed	164
abstract_inverted_index.evaluate	57, 155
abstract_inverted_index.experts,	72
abstract_inverted_index.families	240
abstract_inverted_index.feedback	100
abstract_inverted_index.frontier	157, 178, 218
abstract_inverted_index.handling	21
abstract_inverted_index.identify	29
abstract_inverted_index.inherent	118
abstract_inverted_index.language	5
abstract_inverted_index.learning	9, 109
abstract_inverted_index.reliable	140
abstract_inverted_index.rubrics.	153
abstract_inverted_index.students	1, 213
abstract_inverted_index.tailored	92
abstract_inverted_index.tutoring	60, 86, 205
abstract_inverted_index.LLM-judge	149
abstract_inverted_index.accurate.	42
abstract_inverted_index.achieving	220
abstract_inverted_index.adaptive,	36
abstract_inverted_index.automatic	143
abstract_inverted_index.behavior.	170
abstract_inverted_index.benchmark	53, 274
abstract_inverted_index.comprises	66
abstract_inverted_index.diagnose,	210
abstract_inverted_index.different	238
abstract_inverted_index.effective	111
abstract_inverted_index.guidance,	39
abstract_inverted_index.introduce	47
abstract_inverted_index.learning,	254
abstract_inverted_index.promoting	107
abstract_inverted_index.providing	98
abstract_inverted_index.releasing	266
abstract_inverted_index.responses	134
abstract_inverted_index.strengths	243
abstract_inverted_index.student's	95, 103
abstract_inverted_index.students,	34
abstract_inverted_index.tutoring,	121
abstract_inverted_index.tutoring:	25
abstract_inverted_index.TutorBench	137, 160
abstract_inverted_index.actionable	99
abstract_inverted_index.complexity	119
abstract_inverted_index.confusion,	96
abstract_inverted_index.curricula.	78
abstract_inverted_index.evaluation	52, 144
abstract_inverted_index.exhibiting	200
abstract_inverted_index.generating	89
abstract_inverted_index.outperform	249
abstract_inverted_index.rigorously	56
abstract_inverted_index.supporting	252
abstract_inverted_index.TutorBench,	48, 267
abstract_inverted_index.accompanied	124
abstract_inverted_index.development	278
abstract_inverted_index.evaluation.	136
abstract_inverted_index.generation.	113
abstract_inverted_index.high-school	75
abstract_inverted_index.performance	168
abstract_inverted_index.unsaturated	273
abstract_inverted_index.effectively,	214
abstract_inverted_index.explanations	91
abstract_inverted_index.fine-grained	142
abstract_inverted_index.improvement.	192
abstract_inverted_index.increasingly	2
abstract_inverted_index.limitations:	245
abstract_inverted_index.personalized	38
abstract_inverted_index.comprehensive	271
abstract_inverted_index.next-generation	281
abstract_inverted_index.sample-specific	126, 152
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	14
citation_normalized_percentile