LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation Article Swipe

PDF

Xi Ye , Fangcong Yin , Ying‐Hui He , J. Zhang , H. W. Yen , Tianyu Gao , Greg Durrett , Danqi Chen ·

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2501.05414

Existing benchmarks for evaluating long-context language models (LCLMs) primarily focus on long-context recall, requiring models to produce short responses based on a few critical snippets while processing thousands of irrelevant tokens. We introduce LongProc (Long Procedural Generation), a new benchmark that requires both the integration of highly dispersed information and long-form generation. LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. These tasks challenge LCLMs by testing their ability to follow detailed procedural instructions, synthesize and reason over dispersed information, and generate structured, long-form outputs (up to 8K tokens). Furthermore, as these tasks adhere to deterministic procedures and yield structured outputs, they enable reliable rule-based evaluation. We evaluated 23 LCLMs, including instruction-tuned models and recent reasoning models, on LongProc at three difficulty levels, with the maximum number of output tokens set at 500, 2K, and 8K. Notably, while all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models like GPT-4o show significant degradation on 8K-token tasks. Reasoning models achieve stronger overall performance in long-form generation, benefiting from long CoT training. Further analysis reveals that LCLMs struggle to maintain long-range coherence in long-form generations. These findings highlight critical limitations in current LCLMs and suggest substantial room for improvement. Data and code available at: https://princeton-pli.github.io/LongProc.

Related Topics

Concepts

Benchmarking Context (archaeology) Computer science Business History Marketing Archaeology

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2501.05414
PDF: https://arxiv.org/pdf/2501.05414
OA Status: green
Related Works: 10
OpenAlex ID: https://openalex.org/W4406273368

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4406273368

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2501.05414

Digital Object Identifier
Title: LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2025

Year of publication
Publication date: 2025-01-09

Full publication date if available
Authors: Xi Ye, Fangcong Yin, Ying‐Hui He, J. Zhang, H. W. Yen, Tianyu Gao, Greg Durrett, Danqi Chen

List of authors in order
Landing page: https://arxiv.org/abs/2501.05414

Publisher landing page
PDF URL: https://arxiv.org/pdf/2501.05414

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2501.05414

Direct OA link when available
Concepts: Benchmarking, Context (archaeology), Computer science, Business, History, Marketing, Archaeology

Top concepts (fields/topics) attached by OpenAlex
Cited by: 0

Total citation count in OpenAlex
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4406273368
doi	https://doi.org/10.48550/arxiv.2501.05414
ids.doi	https://doi.org/10.48550/arxiv.2501.05414
ids.openalex	https://openalex.org/W4406273368
fwci	0.0
type	preprint
title	LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10181
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.9944999814033508
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1702
topics[0].subfield.display_name	Artificial Intelligence
topics[0].display_name	Natural Language Processing Techniques
topics[1].id	https://openalex.org/T10028
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.9940000176429749
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1702
topics[1].subfield.display_name	Artificial Intelligence
topics[1].display_name	Topic Modeling
topics[2].id	https://openalex.org/T12031
topics[2].field.id	https://openalex.org/fields/17
topics[2].field.display_name	Computer Science
topics[2].score	0.942799985408783
topics[2].domain.id	https://openalex.org/domains/3
topics[2].domain.display_name	Physical Sciences
topics[2].subfield.id	https://openalex.org/subfields/1702
topics[2].subfield.display_name	Artificial Intelligence
topics[2].display_name	Speech and dialogue systems
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C86251818
concepts[0].level	2
concepts[0].score	0.857162356376648
concepts[0].wikidata	https://www.wikidata.org/wiki/Q816754
concepts[0].display_name	Benchmarking
concepts[1].id	https://openalex.org/C2779343474
concepts[1].level	2
concepts[1].score	0.6242827773094177
concepts[1].wikidata	https://www.wikidata.org/wiki/Q3109175
concepts[1].display_name	Context (archaeology)
concepts[2].id	https://openalex.org/C41008148
concepts[2].level	0
concepts[2].score	0.5601645112037659
concepts[2].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[2].display_name	Computer science
concepts[3].id	https://openalex.org/C144133560
concepts[3].level	0
concepts[3].score	0.30203431844711304
concepts[3].wikidata	https://www.wikidata.org/wiki/Q4830453
concepts[3].display_name	Business
concepts[4].id	https://openalex.org/C95457728
concepts[4].level	0
concepts[4].score	0.12270283699035645
concepts[4].wikidata	https://www.wikidata.org/wiki/Q309
concepts[4].display_name	History
concepts[5].id	https://openalex.org/C162853370
concepts[5].level	1
concepts[5].score	0.0547221302986145
concepts[5].wikidata	https://www.wikidata.org/wiki/Q39809
concepts[5].display_name	Marketing
concepts[6].id	https://openalex.org/C166957645
concepts[6].level	1
concepts[6].score	0.0
concepts[6].wikidata	https://www.wikidata.org/wiki/Q23498
concepts[6].display_name	Archaeology
keywords[0].id	https://openalex.org/keywords/benchmarking
keywords[0].score	0.857162356376648
keywords[0].display_name	Benchmarking
keywords[1].id	https://openalex.org/keywords/context
keywords[1].score	0.6242827773094177
keywords[1].display_name	Context (archaeology)
keywords[2].id	https://openalex.org/keywords/computer-science
keywords[2].score	0.5601645112037659
keywords[2].display_name	Computer science
keywords[3].id	https://openalex.org/keywords/business
keywords[3].score	0.30203431844711304
keywords[3].display_name	Business
keywords[4].id	https://openalex.org/keywords/history
keywords[4].score	0.12270283699035645
keywords[4].display_name	History
keywords[5].id	https://openalex.org/keywords/marketing
keywords[5].score	0.0547221302986145
keywords[5].display_name	Marketing
language	en
locations[0].id	pmh:oai:arXiv.org:2501.05414
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2501.05414
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2501.05414
locations[1].id	doi:10.48550/arxiv.2501.05414
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2501.05414
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5100579142
authorships[0].author.orcid
authorships[0].author.display_name	Xi Ye
authorships[0].author_position	first
authorships[0].raw_author_name	Ye, Xi
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5099031453
authorships[1].author.orcid
authorships[1].author.display_name	Fangcong Yin
authorships[1].author_position	middle
authorships[1].raw_author_name	Yin, Fangcong
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5101114597
authorships[2].author.orcid	https://orcid.org/0000-0002-6857-0507
authorships[2].author.display_name	Ying‐Hui He
authorships[2].author_position	middle
authorships[2].raw_author_name	He, Yinghui
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5029016146
authorships[3].author.orcid
authorships[3].author.display_name	J. Zhang
authorships[3].author_position	middle
authorships[3].raw_author_name	Zhang, Joie
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5111465448
authorships[4].author.orcid
authorships[4].author.display_name	H. W. Yen
authorships[4].author_position	middle
authorships[4].raw_author_name	Yen, Howard
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5006863331
authorships[5].author.orcid	https://orcid.org/0000-0002-5178-0866
authorships[5].author.display_name	Tianyu Gao
authorships[5].author_position	middle
authorships[5].raw_author_name	Gao, Tianyu
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A5015133105
authorships[6].author.orcid	https://orcid.org/0000-0002-7061-7298
authorships[6].author.display_name	Greg Durrett
authorships[6].author_position	middle
authorships[6].raw_author_name	Durrett, Greg
authorships[6].is_corresponding	False
authorships[7].author.id	https://openalex.org/A5051064208
authorships[7].author.orcid	https://orcid.org/0000-0002-6226-6838
authorships[7].author.display_name	Danqi Chen
authorships[7].author_position	last
authorships[7].raw_author_name	Chen, Danqi
authorships[7].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2501.05414
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10181
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.9944999814033508
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1702
primary_topic.subfield.display_name	Artificial Intelligence
primary_topic.display_name	Natural Language Processing Techniques
related_works	https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W4238897586, https://openalex.org/W435179959, https://openalex.org/W2619091065, https://openalex.org/W2059640416, https://openalex.org/W1490753184, https://openalex.org/W2284465472, https://openalex.org/W2291782699
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2501.05414
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2501.05414
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2501.05414
primary_location.id	pmh:oai:arXiv.org:2501.05414
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2501.05414
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2501.05414
publication_date	2025-01-09
publication_year	2025
referenced_works_count	0
abstract_inverted_index.a	21, 37, 69, 162
abstract_inverted_index.23	128
abstract_inverted_index.8K	107
abstract_inverted_index.We	31, 126
abstract_inverted_index.as	61, 110
abstract_inverted_index.at	139, 151
abstract_inverted_index.by	85
abstract_inverted_index.in	193, 211, 219
abstract_inverted_index.of	28, 45, 54, 147
abstract_inverted_index.on	10, 20, 137, 173, 184
abstract_inverted_index.to	15, 77, 89, 106, 114, 207
abstract_inverted_index.(up	105
abstract_inverted_index.2K,	153
abstract_inverted_index.32K	167
abstract_inverted_index.8K.	155
abstract_inverted_index.CoT	199
abstract_inverted_index.TSV	70
abstract_inverted_index.all	158
abstract_inverted_index.and	49, 72, 95, 100, 117, 133, 154, 176, 222, 229
abstract_inverted_index.at:	232
abstract_inverted_index.few	22
abstract_inverted_index.for	2, 226
abstract_inverted_index.new	38
abstract_inverted_index.set	150
abstract_inverted_index.six	55
abstract_inverted_index.the	43, 144
abstract_inverted_index.500,	152
abstract_inverted_index.Data	228
abstract_inverted_index.HTML	66
abstract_inverted_index.both	42
abstract_inverted_index.code	230
abstract_inverted_index.from	65, 197
abstract_inverted_index.into	68
abstract_inverted_index.like	179
abstract_inverted_index.long	198
abstract_inverted_index.over	97
abstract_inverted_index.room	225
abstract_inverted_index.show	181
abstract_inverted_index.size	165
abstract_inverted_index.such	60
abstract_inverted_index.that	40, 204
abstract_inverted_index.they	121
abstract_inverted_index.with	143
abstract_inverted_index.(Long	34
abstract_inverted_index.LCLMs	84, 205, 221
abstract_inverted_index.These	81, 214
abstract_inverted_index.above	166
abstract_inverted_index.based	19
abstract_inverted_index.claim	161
abstract_inverted_index.focus	9
abstract_inverted_index.pages	67
abstract_inverted_index.short	17
abstract_inverted_index.tasks	82, 112
abstract_inverted_index.their	87
abstract_inverted_index.these	111
abstract_inverted_index.three	140
abstract_inverted_index.while	25, 157
abstract_inverted_index.yield	118
abstract_inverted_index.GPT-4o	180
abstract_inverted_index.LCLMs,	129
abstract_inverted_index.adhere	113
abstract_inverted_index.create	78
abstract_inverted_index.enable	122
abstract_inverted_index.falter	172
abstract_inverted_index.follow	90
abstract_inverted_index.format	71
abstract_inverted_index.highly	46
abstract_inverted_index.models	6, 14, 132, 160, 170, 178, 188
abstract_inverted_index.number	146
abstract_inverted_index.output	148
abstract_inverted_index.plans.	80
abstract_inverted_index.reason	96
abstract_inverted_index.recent	134
abstract_inverted_index.search	75
abstract_inverted_index.tasks,	59, 175
abstract_inverted_index.tasks.	186
abstract_inverted_index.tested	159
abstract_inverted_index.tokens	149
abstract_inverted_index.travel	79
abstract_inverted_index.window	164
abstract_inverted_index.(LCLMs)	7
abstract_inverted_index.Further	201
abstract_inverted_index.ability	88
abstract_inverted_index.achieve	189
abstract_inverted_index.complex	74
abstract_inverted_index.context	163
abstract_inverted_index.current	220
abstract_inverted_index.diverse	56
abstract_inverted_index.levels,	142
abstract_inverted_index.maximum	145
abstract_inverted_index.models,	136
abstract_inverted_index.outputs	104
abstract_inverted_index.overall	191
abstract_inverted_index.produce	16
abstract_inverted_index.recall,	12
abstract_inverted_index.reveals	203
abstract_inverted_index.suggest	223
abstract_inverted_index.testing	86
abstract_inverted_index.tokens,	168
abstract_inverted_index.tokens.	30
abstract_inverted_index.2K-token	174
abstract_inverted_index.8K-token	185
abstract_inverted_index.Existing	0
abstract_inverted_index.LongProc	33, 52, 138
abstract_inverted_index.Notably,	156
abstract_inverted_index.analysis	202
abstract_inverted_index.consists	53
abstract_inverted_index.critical	23, 217
abstract_inverted_index.detailed	91
abstract_inverted_index.findings	215
abstract_inverted_index.generate	101
abstract_inverted_index.language	5
abstract_inverted_index.maintain	208
abstract_inverted_index.outputs,	120
abstract_inverted_index.reliable	123
abstract_inverted_index.requires	41
abstract_inverted_index.snippets	24
abstract_inverted_index.stronger	190
abstract_inverted_index.struggle	206
abstract_inverted_index.tokens).	108
abstract_inverted_index.Reasoning	187
abstract_inverted_index.available	231
abstract_inverted_index.benchmark	39
abstract_inverted_index.challenge	83
abstract_inverted_index.coherence	210
abstract_inverted_index.dispersed	47, 98
abstract_inverted_index.evaluated	127
abstract_inverted_index.executing	73
abstract_inverted_index.highlight	216
abstract_inverted_index.including	130
abstract_inverted_index.introduce	32
abstract_inverted_index.long-form	50, 103, 194, 212
abstract_inverted_index.primarily	8
abstract_inverted_index.reasoning	135
abstract_inverted_index.requiring	13
abstract_inverted_index.responses	18
abstract_inverted_index.thousands	27
abstract_inverted_index.training.	200
abstract_inverted_index.typically	171
abstract_inverted_index.Procedural	35
abstract_inverted_index.benchmarks	1
abstract_inverted_index.benefiting	196
abstract_inverted_index.difficulty	141
abstract_inverted_index.evaluating	3
abstract_inverted_index.extracting	62
abstract_inverted_index.generation	58
abstract_inverted_index.irrelevant	29
abstract_inverted_index.long-range	209
abstract_inverted_index.procedural	57, 92
abstract_inverted_index.procedures	76, 116
abstract_inverted_index.processing	26
abstract_inverted_index.rule-based	124
abstract_inverted_index.structured	63, 119
abstract_inverted_index.synthesize	94
abstract_inverted_index.degradation	183
abstract_inverted_index.evaluation.	125
abstract_inverted_index.generation,	195
abstract_inverted_index.generation.	51
abstract_inverted_index.information	48, 64
abstract_inverted_index.integration	44
abstract_inverted_index.limitations	218
abstract_inverted_index.open-weight	169
abstract_inverted_index.performance	192
abstract_inverted_index.significant	182
abstract_inverted_index.structured,	102
abstract_inverted_index.substantial	224
abstract_inverted_index.Furthermore,	109
abstract_inverted_index.Generation),	36
abstract_inverted_index.generations.	213
abstract_inverted_index.improvement.	227
abstract_inverted_index.information,	99
abstract_inverted_index.long-context	4, 11
abstract_inverted_index.closed-source	177
abstract_inverted_index.deterministic	115
abstract_inverted_index.instructions,	93
abstract_inverted_index.instruction-tuned	131
abstract_inverted_index.https://princeton-pli.github.io/LongProc.	233
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	8
citation_normalized_percentile