Instruction-Following Evaluation for Large Language Models Article Swipe

PDF

Jeffrey Zhou , Tianjian Lu , Swaroop Mishra , Siddhartha Brahma , Sujoy Basu , Yi Luan , Denny Zhou , Le Hou ·

YOU? · · 2023 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2311.07911

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

Related Topics

Computer Science

Benchmark (Surveying)

Artificial Intelligence

Programming Language

Geography

Geodesy

Concepts

Verifiable secret sharing Computer science Benchmark (surveying) Set (abstract data type) Code (set theory) Natural language processing Artificial intelligence Programming language Geography Geodesy

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2311.07911
PDF: https://arxiv.org/pdf/2311.07911
OA Status: green
Cited By: 22
Related Works: 10
OpenAlex ID: https://openalex.org/W4388718089

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4388718089

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2311.07911

Digital Object Identifier
Title: Instruction-Following Evaluation for Large Language Models

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2023

Year of publication
Publication date: 2023-11-14

Full publication date if available
Authors: Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou

List of authors in order
Landing page: https://arxiv.org/abs/2311.07911

Publisher landing page
PDF URL: https://arxiv.org/pdf/2311.07911

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2311.07911

Direct OA link when available
Concepts: Verifiable secret sharing, Computer science, Benchmark (surveying), Set (abstract data type), Code (set theory), Natural language processing, Artificial intelligence, Programming language, Geography, Geodesy

Top concepts (fields/topics) attached by OpenAlex
Cited by: 22

Total citation count in OpenAlex
Citations by year (recent): 2025: 14, 2024: 8

Per-year citation counts (last 5 years)
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4388718089
doi	https://doi.org/10.48550/arxiv.2311.07911
ids.doi	https://doi.org/10.48550/arxiv.2311.07911
ids.openalex	https://openalex.org/W4388718089
fwci
type	preprint
title	Instruction-Following Evaluation for Large Language Models
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10028
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.9896000027656555
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1702
topics[0].subfield.display_name	Artificial Intelligence
topics[0].display_name	Topic Modeling
topics[1].id	https://openalex.org/T10181
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.9684000015258789
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1702
topics[1].subfield.display_name	Artificial Intelligence
topics[1].display_name	Natural Language Processing Techniques
topics[2].id	https://openalex.org/T13629
topics[2].field.id	https://openalex.org/fields/17
topics[2].field.display_name	Computer Science
topics[2].score	0.9006999731063843
topics[2].domain.id	https://openalex.org/domains/3
topics[2].domain.display_name	Physical Sciences
topics[2].subfield.id	https://openalex.org/subfields/1702
topics[2].subfield.display_name	Artificial Intelligence
topics[2].display_name	Text Readability and Simplification
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C85847156
concepts[0].level	3
concepts[0].score	0.9251422882080078
concepts[0].wikidata	https://www.wikidata.org/wiki/Q59015987
concepts[0].display_name	Verifiable secret sharing
concepts[1].id	https://openalex.org/C41008148
concepts[1].level	0
concepts[1].score	0.8390198945999146
concepts[1].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[1].display_name	Computer science
concepts[2].id	https://openalex.org/C185798385
concepts[2].level	2
concepts[2].score	0.6230019927024841
concepts[2].wikidata	https://www.wikidata.org/wiki/Q1161707
concepts[2].display_name	Benchmark (surveying)
concepts[3].id	https://openalex.org/C177264268
concepts[3].level	2
concepts[3].score	0.6024664044380188
concepts[3].wikidata	https://www.wikidata.org/wiki/Q1514741
concepts[3].display_name	Set (abstract data type)
concepts[4].id	https://openalex.org/C2776760102
concepts[4].level	3
concepts[4].score	0.583986222743988
concepts[4].wikidata	https://www.wikidata.org/wiki/Q5139990
concepts[4].display_name	Code (set theory)
concepts[5].id	https://openalex.org/C204321447
concepts[5].level	1
concepts[5].score	0.39894920587539673
concepts[5].wikidata	https://www.wikidata.org/wiki/Q30642
concepts[5].display_name	Natural language processing
concepts[6].id	https://openalex.org/C154945302
concepts[6].level	1
concepts[6].score	0.39167314767837524
concepts[6].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[6].display_name	Artificial intelligence
concepts[7].id	https://openalex.org/C199360897
concepts[7].level	1
concepts[7].score	0.3708083927631378
concepts[7].wikidata	https://www.wikidata.org/wiki/Q9143
concepts[7].display_name	Programming language
concepts[8].id	https://openalex.org/C205649164
concepts[8].level	0
concepts[8].score	0.0
concepts[8].wikidata	https://www.wikidata.org/wiki/Q1071
concepts[8].display_name	Geography
concepts[9].id	https://openalex.org/C13280743
concepts[9].level	1
concepts[9].score	0.0
concepts[9].wikidata	https://www.wikidata.org/wiki/Q131089
concepts[9].display_name	Geodesy
keywords[0].id	https://openalex.org/keywords/verifiable-secret-sharing
keywords[0].score	0.9251422882080078
keywords[0].display_name	Verifiable secret sharing
keywords[1].id	https://openalex.org/keywords/computer-science
keywords[1].score	0.8390198945999146
keywords[1].display_name	Computer science
keywords[2].id	https://openalex.org/keywords/benchmark
keywords[2].score	0.6230019927024841
keywords[2].display_name	Benchmark (surveying)
keywords[3].id	https://openalex.org/keywords/set
keywords[3].score	0.6024664044380188
keywords[3].display_name	Set (abstract data type)
keywords[4].id	https://openalex.org/keywords/code
keywords[4].score	0.583986222743988
keywords[4].display_name	Code (set theory)
keywords[5].id	https://openalex.org/keywords/natural-language-processing
keywords[5].score	0.39894920587539673
keywords[5].display_name	Natural language processing
keywords[6].id	https://openalex.org/keywords/artificial-intelligence
keywords[6].score	0.39167314767837524
keywords[6].display_name	Artificial intelligence
keywords[7].id	https://openalex.org/keywords/programming-language
keywords[7].score	0.3708083927631378
keywords[7].display_name	Programming language
language	en
locations[0].id	pmh:oai:arXiv.org:2311.07911
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2311.07911
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2311.07911
locations[1].id	doi:10.48550/arxiv.2311.07911
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license	cc-by
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id	https://openalex.org/licenses/cc-by
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2311.07911
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5090567273
authorships[0].author.orcid
authorships[0].author.display_name	Jeffrey Zhou
authorships[0].author_position	first
authorships[0].raw_author_name	Zhou, Jeffrey
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5101267760
authorships[1].author.orcid
authorships[1].author.display_name	Tianjian Lu
authorships[1].author_position	middle
authorships[1].raw_author_name	Lu, Tianjian
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5063722751
authorships[2].author.orcid	https://orcid.org/0009-0001-6413-7001
authorships[2].author.display_name	Swaroop Mishra
authorships[2].author_position	middle
authorships[2].raw_author_name	Mishra, Swaroop
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5047714914
authorships[3].author.orcid
authorships[3].author.display_name	Siddhartha Brahma
authorships[3].author_position	middle
authorships[3].raw_author_name	Brahma, Siddhartha
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5108520926
authorships[4].author.orcid
authorships[4].author.display_name	Sujoy Basu
authorships[4].author_position	middle
authorships[4].raw_author_name	Basu, Sujoy
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5009136371
authorships[5].author.orcid	https://orcid.org/0000-0002-5914-9622
authorships[5].author.display_name	Yi Luan
authorships[5].author_position	middle
authorships[5].raw_author_name	Luan, Yi
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A5061512999
authorships[6].author.orcid
authorships[6].author.display_name	Denny Zhou
authorships[6].author_position	middle
authorships[6].raw_author_name	Zhou, Denny
authorships[6].is_corresponding	False
authorships[7].author.id	https://openalex.org/A5101315776
authorships[7].author.orcid
authorships[7].author.display_name	Le Hou
authorships[7].author_position	last
authorships[7].raw_author_name	Hou, Le
authorships[7].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2311.07911
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	Instruction-Following Evaluation for Large Language Models
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10028
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.9896000027656555
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1702
primary_topic.subfield.display_name	Artificial Intelligence
primary_topic.display_name	Topic Modeling
related_works	https://openalex.org/W2355730523, https://openalex.org/W152021879, https://openalex.org/W2072918937, https://openalex.org/W2365629437, https://openalex.org/W2023935927, https://openalex.org/W2348330439, https://openalex.org/W2350372928, https://openalex.org/W2377292126, https://openalex.org/W2128900334, https://openalex.org/W2139109546
cited_by_count	22
counts_by_year[0].year	2025
counts_by_year[0].cited_by_count	14
counts_by_year[1].year	2024
counts_by_year[1].cited_by_count	8
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2311.07911
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2311.07911
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2311.07911
primary_location.id	pmh:oai:arXiv.org:2311.07911
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2311.07911
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2311.07911
publication_date	2023-11-14
publication_year	2023
referenced_works_count	0
abstract_inverted_index.3	92
abstract_inverted_index.a	62, 71
abstract_inverted_index.25	96
abstract_inverted_index.AI	89
abstract_inverted_index.It	68
abstract_inverted_index.To	47
abstract_inverted_index.We	94, 116
abstract_inverted_index.as	77
abstract_inverted_index.at	90, 135
abstract_inverted_index.be	133
abstract_inverted_index.by	40
abstract_inverted_index.in	79
abstract_inverted_index.is	8, 20, 35, 61
abstract_inverted_index.of	3, 17, 43, 73, 88, 98, 120
abstract_inverted_index.on	70, 125
abstract_inverted_index.or	38, 112
abstract_inverted_index.to	9
abstract_inverted_index.we	51
abstract_inverted_index.400	82
abstract_inverted_index.500	105
abstract_inverted_index.One	0
abstract_inverted_index.Our	128
abstract_inverted_index.and	28, 64, 84, 102, 130
abstract_inverted_index.are	25
abstract_inverted_index.can	132
abstract_inverted_index.for	56
abstract_inverted_index.not	21, 29
abstract_inverted_index.one	111
abstract_inverted_index.set	72
abstract_inverted_index.the	15, 41, 44, 86, 126
abstract_inverted_index.two	121
abstract_inverted_index.Eval	54
abstract_inverted_index.LLM.	46
abstract_inverted_index.LLMs	124
abstract_inverted_index.code	129
abstract_inverted_index.core	1
abstract_inverted_index.data	131
abstract_inverted_index.each	108
abstract_inverted_index.more	80, 113
abstract_inverted_index.show	117
abstract_inverted_index.such	18, 76
abstract_inverted_index.than	81
abstract_inverted_index.with	107
abstract_inverted_index.Human	23
abstract_inverted_index.Large	4
abstract_inverted_index.found	134
abstract_inverted_index.large	57
abstract_inverted_index.least	91
abstract_inverted_index.slow,	27
abstract_inverted_index.these	49
abstract_inverted_index.those	99
abstract_inverted_index.types	97
abstract_inverted_index.while	32
abstract_inverted_index."write	78
abstract_inverted_index.(LLMs)	7
abstract_inverted_index.IFEval	60
abstract_inverted_index.Models	6
abstract_inverted_index.around	104
abstract_inverted_index.biased	37
abstract_inverted_index.follow	10
abstract_inverted_index.prompt	109
abstract_inverted_index.widely	122
abstract_inverted_index.words"	83
abstract_inverted_index.ability	42
abstract_inverted_index.focuses	69
abstract_inverted_index.issues,	50
abstract_inverted_index.keyword	87
abstract_inverted_index.limited	39
abstract_inverted_index.market.	127
abstract_inverted_index.models.	59
abstract_inverted_index.natural	11
abstract_inverted_index.results	119
abstract_inverted_index.times".	93
abstract_inverted_index."mention	85
abstract_inverted_index.(IFEval)	55
abstract_inverted_index.However,	14
abstract_inverted_index.Language	5
abstract_inverted_index.language	12, 58
abstract_inverted_index.overcome	48
abstract_inverted_index.prompts,	106
abstract_inverted_index.LLM-based	33
abstract_inverted_index.abilities	19
abstract_inverted_index.available	123
abstract_inverted_index.evaluator	45
abstract_inverted_index.introduce	52
abstract_inverted_index.benchmark.	67
abstract_inverted_index.capability	2
abstract_inverted_index.containing	110
abstract_inverted_index.evaluation	16, 66, 118
abstract_inverted_index.expensive,	26
abstract_inverted_index.identified	95
abstract_inverted_index.verifiable	100, 114
abstract_inverted_index."verifiable	74
abstract_inverted_index.constructed	103
abstract_inverted_index.evaluations	24
abstract_inverted_index.objectively	30
abstract_inverted_index.potentially	36
abstract_inverted_index.instructions	101
abstract_inverted_index.instructions"	75
abstract_inverted_index.instructions.	13, 115
abstract_inverted_index.reproducible,	31
abstract_inverted_index.standardized:	22
abstract_inverted_index.auto-evaluation	34
abstract_inverted_index.straightforward	63
abstract_inverted_index.easy-to-reproduce	65
abstract_inverted_index.Instruction-Following	53
abstract_inverted_index.https://github.com/google-research/google-research/tree/master/instruction_following_eval	136
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	8
sustainable_development_goals[0].id	https://metadata.un.org/sdg/4
sustainable_development_goals[0].score	0.8299999833106995
sustainable_development_goals[0].display_name	Quality Education
citation_normalized_percentile