Instruction-Following Evaluation for Large Language Models Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2311.07911
One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2311.07911
- https://arxiv.org/pdf/2311.07911
- OA Status
- green
- Cited By
- 22
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4388718089
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4388718089Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2311.07911Digital Object Identifier
- Title
-
Instruction-Following Evaluation for Large Language ModelsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2023Year of publication
- Publication date
-
2023-11-14Full publication date if available
- Authors
-
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le HouList of authors in order
- Landing page
-
https://arxiv.org/abs/2311.07911Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2311.07911Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2311.07911Direct OA link when available
- Concepts
-
Verifiable secret sharing, Computer science, Benchmark (surveying), Set (abstract data type), Code (set theory), Natural language processing, Artificial intelligence, Programming language, Geography, GeodesyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
22Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 14, 2024: 8Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4388718089 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2311.07911 |
| ids.doi | https://doi.org/10.48550/arxiv.2311.07911 |
| ids.openalex | https://openalex.org/W4388718089 |
| fwci | |
| type | preprint |
| title | Instruction-Following Evaluation for Large Language Models |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10028 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9896000027656555 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Topic Modeling |
| topics[1].id | https://openalex.org/T10181 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9684000015258789 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Natural Language Processing Techniques |
| topics[2].id | https://openalex.org/T13629 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9006999731063843 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Text Readability and Simplification |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C85847156 |
| concepts[0].level | 3 |
| concepts[0].score | 0.9251422882080078 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q59015987 |
| concepts[0].display_name | Verifiable secret sharing |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.8390198945999146 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C185798385 |
| concepts[2].level | 2 |
| concepts[2].score | 0.6230019927024841 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q1161707 |
| concepts[2].display_name | Benchmark (surveying) |
| concepts[3].id | https://openalex.org/C177264268 |
| concepts[3].level | 2 |
| concepts[3].score | 0.6024664044380188 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q1514741 |
| concepts[3].display_name | Set (abstract data type) |
| concepts[4].id | https://openalex.org/C2776760102 |
| concepts[4].level | 3 |
| concepts[4].score | 0.583986222743988 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q5139990 |
| concepts[4].display_name | Code (set theory) |
| concepts[5].id | https://openalex.org/C204321447 |
| concepts[5].level | 1 |
| concepts[5].score | 0.39894920587539673 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[5].display_name | Natural language processing |
| concepts[6].id | https://openalex.org/C154945302 |
| concepts[6].level | 1 |
| concepts[6].score | 0.39167314767837524 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[6].display_name | Artificial intelligence |
| concepts[7].id | https://openalex.org/C199360897 |
| concepts[7].level | 1 |
| concepts[7].score | 0.3708083927631378 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q9143 |
| concepts[7].display_name | Programming language |
| concepts[8].id | https://openalex.org/C205649164 |
| concepts[8].level | 0 |
| concepts[8].score | 0.0 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q1071 |
| concepts[8].display_name | Geography |
| concepts[9].id | https://openalex.org/C13280743 |
| concepts[9].level | 1 |
| concepts[9].score | 0.0 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q131089 |
| concepts[9].display_name | Geodesy |
| keywords[0].id | https://openalex.org/keywords/verifiable-secret-sharing |
| keywords[0].score | 0.9251422882080078 |
| keywords[0].display_name | Verifiable secret sharing |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.8390198945999146 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/benchmark |
| keywords[2].score | 0.6230019927024841 |
| keywords[2].display_name | Benchmark (surveying) |
| keywords[3].id | https://openalex.org/keywords/set |
| keywords[3].score | 0.6024664044380188 |
| keywords[3].display_name | Set (abstract data type) |
| keywords[4].id | https://openalex.org/keywords/code |
| keywords[4].score | 0.583986222743988 |
| keywords[4].display_name | Code (set theory) |
| keywords[5].id | https://openalex.org/keywords/natural-language-processing |
| keywords[5].score | 0.39894920587539673 |
| keywords[5].display_name | Natural language processing |
| keywords[6].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[6].score | 0.39167314767837524 |
| keywords[6].display_name | Artificial intelligence |
| keywords[7].id | https://openalex.org/keywords/programming-language |
| keywords[7].score | 0.3708083927631378 |
| keywords[7].display_name | Programming language |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2311.07911 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2311.07911 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2311.07911 |
| locations[1].id | doi:10.48550/arxiv.2311.07911 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2311.07911 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5090567273 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Jeffrey Zhou |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Zhou, Jeffrey |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5101267760 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Tianjian Lu |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Lu, Tianjian |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5063722751 |
| authorships[2].author.orcid | https://orcid.org/0009-0001-6413-7001 |
| authorships[2].author.display_name | Swaroop Mishra |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Mishra, Swaroop |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5047714914 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Siddhartha Brahma |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Brahma, Siddhartha |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5108520926 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Sujoy Basu |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Basu, Sujoy |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5009136371 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-5914-9622 |
| authorships[5].author.display_name | Yi Luan |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Luan, Yi |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5061512999 |
| authorships[6].author.orcid | |
| authorships[6].author.display_name | Denny Zhou |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Zhou, Denny |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5101315776 |
| authorships[7].author.orcid | |
| authorships[7].author.display_name | Le Hou |
| authorships[7].author_position | last |
| authorships[7].raw_author_name | Hou, Le |
| authorships[7].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2311.07911 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Instruction-Following Evaluation for Large Language Models |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10028 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9896000027656555 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Topic Modeling |
| related_works | https://openalex.org/W2355730523, https://openalex.org/W152021879, https://openalex.org/W2072918937, https://openalex.org/W2365629437, https://openalex.org/W2023935927, https://openalex.org/W2348330439, https://openalex.org/W2350372928, https://openalex.org/W2377292126, https://openalex.org/W2128900334, https://openalex.org/W2139109546 |
| cited_by_count | 22 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 14 |
| counts_by_year[1].year | 2024 |
| counts_by_year[1].cited_by_count | 8 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2311.07911 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2311.07911 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2311.07911 |
| primary_location.id | pmh:oai:arXiv.org:2311.07911 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2311.07911 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2311.07911 |
| publication_date | 2023-11-14 |
| publication_year | 2023 |
| referenced_works_count | 0 |
| abstract_inverted_index.3 | 92 |
| abstract_inverted_index.a | 62, 71 |
| abstract_inverted_index.25 | 96 |
| abstract_inverted_index.AI | 89 |
| abstract_inverted_index.It | 68 |
| abstract_inverted_index.To | 47 |
| abstract_inverted_index.We | 94, 116 |
| abstract_inverted_index.as | 77 |
| abstract_inverted_index.at | 90, 135 |
| abstract_inverted_index.be | 133 |
| abstract_inverted_index.by | 40 |
| abstract_inverted_index.in | 79 |
| abstract_inverted_index.is | 8, 20, 35, 61 |
| abstract_inverted_index.of | 3, 17, 43, 73, 88, 98, 120 |
| abstract_inverted_index.on | 70, 125 |
| abstract_inverted_index.or | 38, 112 |
| abstract_inverted_index.to | 9 |
| abstract_inverted_index.we | 51 |
| abstract_inverted_index.400 | 82 |
| abstract_inverted_index.500 | 105 |
| abstract_inverted_index.One | 0 |
| abstract_inverted_index.Our | 128 |
| abstract_inverted_index.and | 28, 64, 84, 102, 130 |
| abstract_inverted_index.are | 25 |
| abstract_inverted_index.can | 132 |
| abstract_inverted_index.for | 56 |
| abstract_inverted_index.not | 21, 29 |
| abstract_inverted_index.one | 111 |
| abstract_inverted_index.set | 72 |
| abstract_inverted_index.the | 15, 41, 44, 86, 126 |
| abstract_inverted_index.two | 121 |
| abstract_inverted_index.Eval | 54 |
| abstract_inverted_index.LLM. | 46 |
| abstract_inverted_index.LLMs | 124 |
| abstract_inverted_index.code | 129 |
| abstract_inverted_index.core | 1 |
| abstract_inverted_index.data | 131 |
| abstract_inverted_index.each | 108 |
| abstract_inverted_index.more | 80, 113 |
| abstract_inverted_index.show | 117 |
| abstract_inverted_index.such | 18, 76 |
| abstract_inverted_index.than | 81 |
| abstract_inverted_index.with | 107 |
| abstract_inverted_index.Human | 23 |
| abstract_inverted_index.Large | 4 |
| abstract_inverted_index.found | 134 |
| abstract_inverted_index.large | 57 |
| abstract_inverted_index.least | 91 |
| abstract_inverted_index.slow, | 27 |
| abstract_inverted_index.these | 49 |
| abstract_inverted_index.those | 99 |
| abstract_inverted_index.types | 97 |
| abstract_inverted_index.while | 32 |
| abstract_inverted_index."write | 78 |
| abstract_inverted_index.(LLMs) | 7 |
| abstract_inverted_index.IFEval | 60 |
| abstract_inverted_index.Models | 6 |
| abstract_inverted_index.around | 104 |
| abstract_inverted_index.biased | 37 |
| abstract_inverted_index.follow | 10 |
| abstract_inverted_index.prompt | 109 |
| abstract_inverted_index.widely | 122 |
| abstract_inverted_index.words" | 83 |
| abstract_inverted_index.ability | 42 |
| abstract_inverted_index.focuses | 69 |
| abstract_inverted_index.issues, | 50 |
| abstract_inverted_index.keyword | 87 |
| abstract_inverted_index.limited | 39 |
| abstract_inverted_index.market. | 127 |
| abstract_inverted_index.models. | 59 |
| abstract_inverted_index.natural | 11 |
| abstract_inverted_index.results | 119 |
| abstract_inverted_index.times". | 93 |
| abstract_inverted_index."mention | 85 |
| abstract_inverted_index.(IFEval) | 55 |
| abstract_inverted_index.However, | 14 |
| abstract_inverted_index.Language | 5 |
| abstract_inverted_index.language | 12, 58 |
| abstract_inverted_index.overcome | 48 |
| abstract_inverted_index.prompts, | 106 |
| abstract_inverted_index.LLM-based | 33 |
| abstract_inverted_index.abilities | 19 |
| abstract_inverted_index.available | 123 |
| abstract_inverted_index.evaluator | 45 |
| abstract_inverted_index.introduce | 52 |
| abstract_inverted_index.benchmark. | 67 |
| abstract_inverted_index.capability | 2 |
| abstract_inverted_index.containing | 110 |
| abstract_inverted_index.evaluation | 16, 66, 118 |
| abstract_inverted_index.expensive, | 26 |
| abstract_inverted_index.identified | 95 |
| abstract_inverted_index.verifiable | 100, 114 |
| abstract_inverted_index."verifiable | 74 |
| abstract_inverted_index.constructed | 103 |
| abstract_inverted_index.evaluations | 24 |
| abstract_inverted_index.objectively | 30 |
| abstract_inverted_index.potentially | 36 |
| abstract_inverted_index.instructions | 101 |
| abstract_inverted_index.instructions" | 75 |
| abstract_inverted_index.instructions. | 13, 115 |
| abstract_inverted_index.reproducible, | 31 |
| abstract_inverted_index.standardized: | 22 |
| abstract_inverted_index.auto-evaluation | 34 |
| abstract_inverted_index.straightforward | 63 |
| abstract_inverted_index.easy-to-reproduce | 65 |
| abstract_inverted_index.Instruction-Following | 53 |
| abstract_inverted_index.https://github.com/google-research/google-research/tree/master/instruction_following_eval | 136 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 8 |
| sustainable_development_goals[0].id | https://metadata.un.org/sdg/4 |
| sustainable_development_goals[0].score | 0.8299999833106995 |
| sustainable_development_goals[0].display_name | Quality Education |
| citation_normalized_percentile |