Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2312.09244
Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. We explore the application of reward ensembles to alignment at both training time (through reinforcement learning) and inference time (through reranking). First, we show that reward models are \emph{underspecified}: reward models that perform similarly in-distribution can yield very different rewards when used in alignment, due to distribution shift. Second, underspecification results in overoptimization, where alignment to one reward model does not improve reward as measured by another reward model trained on the same data. Third, overoptimization is mitigated by the use of reward ensembles, and ensembles that vary by their \emph{pretraining} seeds lead to better generalization than ensembles that differ only by their \emph{fine-tuning} seeds, with both outperforming individual reward models. However, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2312.09244
- https://arxiv.org/pdf/2312.09244
- OA Status
- green
- Cited By
- 2
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4389821477
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4389821477Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2312.09244Digital Object Identifier
- Title
-
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward HackingWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2023Year of publication
- Publication date
-
2023-12-14Full publication date if available
- Authors
-
Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D’Amour, Dj Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter J. Shaw, Jonathan BerantList of authors in order
- Landing page
-
https://arxiv.org/abs/2312.09244Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2312.09244Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2312.09244Direct OA link when available
- Concepts
-
Computer science, Inference, Generalization, Artificial intelligence, Reinforcement learning, Herding, Exploit, Machine learning, Mathematics, Forestry, Geography, Computer security, Mathematical analysisTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
2Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 1, 2024: 1Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4389821477 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2312.09244 |
| ids.doi | https://doi.org/10.48550/arxiv.2312.09244 |
| ids.openalex | https://openalex.org/W4389821477 |
| fwci | |
| type | preprint |
| title | Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10028 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9986000061035156 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Topic Modeling |
| topics[1].id | https://openalex.org/T10181 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9541000127792358 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Natural Language Processing Techniques |
| topics[2].id | https://openalex.org/T12026 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9438999891281128 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Explainable Artificial Intelligence (XAI) |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.6649572849273682 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C2776214188 |
| concepts[1].level | 2 |
| concepts[1].score | 0.5447150468826294 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q408386 |
| concepts[1].display_name | Inference |
| concepts[2].id | https://openalex.org/C177148314 |
| concepts[2].level | 2 |
| concepts[2].score | 0.5188409090042114 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q170084 |
| concepts[2].display_name | Generalization |
| concepts[3].id | https://openalex.org/C154945302 |
| concepts[3].level | 1 |
| concepts[3].score | 0.5015733242034912 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[3].display_name | Artificial intelligence |
| concepts[4].id | https://openalex.org/C97541855 |
| concepts[4].level | 2 |
| concepts[4].score | 0.4984395503997803 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q830687 |
| concepts[4].display_name | Reinforcement learning |
| concepts[5].id | https://openalex.org/C177605951 |
| concepts[5].level | 2 |
| concepts[5].score | 0.4850465655326843 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q1484503 |
| concepts[5].display_name | Herding |
| concepts[6].id | https://openalex.org/C165696696 |
| concepts[6].level | 2 |
| concepts[6].score | 0.4288359582424164 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q11287 |
| concepts[6].display_name | Exploit |
| concepts[7].id | https://openalex.org/C119857082 |
| concepts[7].level | 1 |
| concepts[7].score | 0.3768616318702698 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q2539 |
| concepts[7].display_name | Machine learning |
| concepts[8].id | https://openalex.org/C33923547 |
| concepts[8].level | 0 |
| concepts[8].score | 0.0916302502155304 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q395 |
| concepts[8].display_name | Mathematics |
| concepts[9].id | https://openalex.org/C97137747 |
| concepts[9].level | 1 |
| concepts[9].score | 0.0 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q38112 |
| concepts[9].display_name | Forestry |
| concepts[10].id | https://openalex.org/C205649164 |
| concepts[10].level | 0 |
| concepts[10].score | 0.0 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q1071 |
| concepts[10].display_name | Geography |
| concepts[11].id | https://openalex.org/C38652104 |
| concepts[11].level | 1 |
| concepts[11].score | 0.0 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q3510521 |
| concepts[11].display_name | Computer security |
| concepts[12].id | https://openalex.org/C134306372 |
| concepts[12].level | 1 |
| concepts[12].score | 0.0 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q7754 |
| concepts[12].display_name | Mathematical analysis |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.6649572849273682 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/inference |
| keywords[1].score | 0.5447150468826294 |
| keywords[1].display_name | Inference |
| keywords[2].id | https://openalex.org/keywords/generalization |
| keywords[2].score | 0.5188409090042114 |
| keywords[2].display_name | Generalization |
| keywords[3].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[3].score | 0.5015733242034912 |
| keywords[3].display_name | Artificial intelligence |
| keywords[4].id | https://openalex.org/keywords/reinforcement-learning |
| keywords[4].score | 0.4984395503997803 |
| keywords[4].display_name | Reinforcement learning |
| keywords[5].id | https://openalex.org/keywords/herding |
| keywords[5].score | 0.4850465655326843 |
| keywords[5].display_name | Herding |
| keywords[6].id | https://openalex.org/keywords/exploit |
| keywords[6].score | 0.4288359582424164 |
| keywords[6].display_name | Exploit |
| keywords[7].id | https://openalex.org/keywords/machine-learning |
| keywords[7].score | 0.3768616318702698 |
| keywords[7].display_name | Machine learning |
| keywords[8].id | https://openalex.org/keywords/mathematics |
| keywords[8].score | 0.0916302502155304 |
| keywords[8].display_name | Mathematics |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2312.09244 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2312.09244 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2312.09244 |
| locations[1].id | doi:10.48550/arxiv.2312.09244 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2312.09244 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5047699861 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Jacob Eisenstein |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Eisenstein, Jacob |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5035844932 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-2212-5392 |
| authorships[1].author.display_name | Chirag Nagpal |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Nagpal, Chirag |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5036435487 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-7032-7162 |
| authorships[2].author.display_name | Alekh Agarwal |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Agarwal, Alekh |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5008645615 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-1998-5271 |
| authorships[3].author.display_name | Ahmad Beirami |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Beirami, Ahmad |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5037673085 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Alex D’Amour |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | D'Amour, Alex |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5062272778 |
| authorships[5].author.orcid | |
| authorships[5].author.display_name | Dj Dvijotham |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Dvijotham, DJ |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5079422282 |
| authorships[6].author.orcid | |
| authorships[6].author.display_name | Adam Fisch |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Fisch, Adam |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5014018142 |
| authorships[7].author.orcid | https://orcid.org/0000-0002-4848-7466 |
| authorships[7].author.display_name | Katherine Heller |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Heller, Katherine |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5021812637 |
| authorships[8].author.orcid | https://orcid.org/0000-0003-0551-9664 |
| authorships[8].author.display_name | Stephen Pfohl |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | Pfohl, Stephen |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5111238931 |
| authorships[9].author.orcid | https://orcid.org/0000-0001-5412-6133 |
| authorships[9].author.display_name | Deepak Ramachandran |
| authorships[9].author_position | middle |
| authorships[9].raw_author_name | Ramachandran, Deepak |
| authorships[9].is_corresponding | False |
| authorships[10].author.id | https://openalex.org/A5061827062 |
| authorships[10].author.orcid | https://orcid.org/0000-0003-0101-4482 |
| authorships[10].author.display_name | Peter J. Shaw |
| authorships[10].author_position | middle |
| authorships[10].raw_author_name | Shaw, Peter |
| authorships[10].is_corresponding | False |
| authorships[11].author.id | https://openalex.org/A5045872048 |
| authorships[11].author.orcid | |
| authorships[11].author.display_name | Jonathan Berant |
| authorships[11].author_position | last |
| authorships[11].raw_author_name | Berant, Jonathan |
| authorships[11].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2312.09244 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2023-12-16T00:00:00 |
| display_name | Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10028 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9986000061035156 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Topic Modeling |
| related_works | https://openalex.org/W17155033, https://openalex.org/W3207760230, https://openalex.org/W2123552042, https://openalex.org/W1496222301, https://openalex.org/W3122774093, https://openalex.org/W1590307681, https://openalex.org/W2536018345, https://openalex.org/W2171525224, https://openalex.org/W4312814274, https://openalex.org/W4285370786 |
| cited_by_count | 2 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 1 |
| counts_by_year[1].year | 2024 |
| counts_by_year[1].cited_by_count | 1 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2312.09244 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2312.09244 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2312.09244 |
| primary_location.id | pmh:oai:arXiv.org:2312.09244 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2312.09244 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2312.09244 |
| publication_date | 2023-12-14 |
| publication_year | 2023 |
| referenced_works_count | 0 |
| abstract_inverted_index.A | 42 |
| abstract_inverted_index.a | 3, 36, 59 |
| abstract_inverted_index.We | 64 |
| abstract_inverted_index.an | 18, 48 |
| abstract_inverted_index.as | 127 |
| abstract_inverted_index.at | 73 |
| abstract_inverted_index.by | 129, 142, 152, 165, 196 |
| abstract_inverted_index.do | 180 |
| abstract_inverted_index.in | 6, 27, 106, 115, 202 |
| abstract_inverted_index.is | 45, 140 |
| abstract_inverted_index.of | 50, 68, 145 |
| abstract_inverted_index.on | 134 |
| abstract_inverted_index.to | 24, 31, 46, 57, 71, 109, 119, 157 |
| abstract_inverted_index.we | 86, 185 |
| abstract_inverted_index.all | 199 |
| abstract_inverted_index.and | 80, 148 |
| abstract_inverted_index.are | 91, 193 |
| abstract_inverted_index.can | 99 |
| abstract_inverted_index.due | 108 |
| abstract_inverted_index.for | 20 |
| abstract_inverted_index.key | 4 |
| abstract_inverted_index.not | 124, 181, 194 |
| abstract_inverted_index.one | 120 |
| abstract_inverted_index.the | 21, 28, 66, 135, 143, 203 |
| abstract_inverted_index.use | 144 |
| abstract_inverted_index.both | 74, 170 |
| abstract_inverted_index.does | 123 |
| abstract_inverted_index.even | 176 |
| abstract_inverted_index.high | 33 |
| abstract_inverted_index.lead | 156 |
| abstract_inverted_index.more | 60 |
| abstract_inverted_index.only | 164 |
| abstract_inverted_index.over | 54 |
| abstract_inverted_index.play | 2 |
| abstract_inverted_index.role | 5 |
| abstract_inverted_index.same | 136 |
| abstract_inverted_index.show | 87, 186 |
| abstract_inverted_index.than | 160 |
| abstract_inverted_index.that | 88, 95, 150, 162, 192 |
| abstract_inverted_index.this | 15 |
| abstract_inverted_index.time | 76, 82 |
| abstract_inverted_index.used | 105 |
| abstract_inverted_index.vary | 151 |
| abstract_inverted_index.very | 101 |
| abstract_inverted_index.when | 104 |
| abstract_inverted_index.with | 169 |
| abstract_inverted_index.data. | 137 |
| abstract_inverted_index.error | 207 |
| abstract_inverted_index.human | 12 |
| abstract_inverted_index.model | 9, 23, 30, 55, 122, 132 |
| abstract_inverted_index.often | 38 |
| abstract_inverted_index.seeds | 155 |
| abstract_inverted_index.setup | 16 |
| abstract_inverted_index.their | 153, 166 |
| abstract_inverted_index.train | 47 |
| abstract_inverted_index.where | 117 |
| abstract_inverted_index.yield | 100 |
| abstract_inverted_index.First, | 85 |
| abstract_inverted_index.Reward | 0 |
| abstract_inverted_index.Third, | 138 |
| abstract_inverted_index.better | 158 |
| abstract_inverted_index.differ | 163 |
| abstract_inverted_index.errors | 26 |
| abstract_inverted_index.models | 1, 90, 94, 201 |
| abstract_inverted_index.obtain | 58 |
| abstract_inverted_index.reward | 29, 51, 62, 69, 89, 93, 121, 126, 131, 146, 173, 178, 183, 189, 200 |
| abstract_inverted_index.robust | 61 |
| abstract_inverted_index.seeds, | 168 |
| abstract_inverted_index.shift. | 111 |
| abstract_inverted_index.termed | 39 |
| abstract_inverted_index.Second, | 112 |
| abstract_inverted_index.achieve | 32 |
| abstract_inverted_index.another | 130 |
| abstract_inverted_index.because | 198 |
| abstract_inverted_index.creates | 17 |
| abstract_inverted_index.exhibit | 205 |
| abstract_inverted_index.exploit | 25 |
| abstract_inverted_index.explore | 65 |
| abstract_inverted_index.hacking | 190 |
| abstract_inverted_index.improve | 125 |
| abstract_inverted_index.models, | 52 |
| abstract_inverted_index.models. | 174 |
| abstract_inverted_index.natural | 43 |
| abstract_inverted_index.outputs | 56 |
| abstract_inverted_index.perform | 96 |
| abstract_inverted_index.results | 114 |
| abstract_inverted_index.reward, | 35 |
| abstract_inverted_index.rewards | 103 |
| abstract_inverted_index.several | 187 |
| abstract_inverted_index.similar | 206 |
| abstract_inverted_index.towards | 11 |
| abstract_inverted_index.trained | 133 |
| abstract_inverted_index.(through | 77, 83 |
| abstract_inverted_index.However, | 14, 175 |
| abstract_inverted_index.aligning | 7 |
| abstract_inverted_index.ensemble | 49, 204 |
| abstract_inverted_index.hacking: | 184 |
| abstract_inverted_index.language | 8, 22 |
| abstract_inverted_index.measured | 128 |
| abstract_inverted_index.pretrain | 177 |
| abstract_inverted_index.training | 75 |
| abstract_inverted_index.alignment | 72, 118 |
| abstract_inverted_index.different | 102 |
| abstract_inverted_index.eliminate | 182 |
| abstract_inverted_index.ensembles | 70, 149, 161, 179 |
| abstract_inverted_index.estimate. | 63 |
| abstract_inverted_index.estimated | 34 |
| abstract_inverted_index.hacking}. | 41 |
| abstract_inverted_index.incentive | 19 |
| abstract_inverted_index.inference | 81 |
| abstract_inverted_index.learning) | 79 |
| abstract_inverted_index.mitigated | 141, 195 |
| abstract_inverted_index.patterns. | 208 |
| abstract_inverted_index.phenomena | 191 |
| abstract_inverted_index.similarly | 97 |
| abstract_inverted_index.alignment, | 107 |
| abstract_inverted_index.ensembles, | 147 |
| abstract_inverted_index.ensembling | 197 |
| abstract_inverted_index.individual | 172 |
| abstract_inverted_index.mitigation | 44 |
| abstract_inverted_index.phenomenon | 37 |
| abstract_inverted_index.aggregating | 53 |
| abstract_inverted_index.application | 67 |
| abstract_inverted_index.qualitative | 188 |
| abstract_inverted_index.reranking). | 84 |
| abstract_inverted_index.\emph{reward | 40 |
| abstract_inverted_index.applications | 10 |
| abstract_inverted_index.distribution | 110 |
| abstract_inverted_index.preferences. | 13 |
| abstract_inverted_index.outperforming | 171 |
| abstract_inverted_index.reinforcement | 78 |
| abstract_inverted_index.generalization | 159 |
| abstract_inverted_index.in-distribution | 98 |
| abstract_inverted_index.overoptimization | 139 |
| abstract_inverted_index.overoptimization, | 116 |
| abstract_inverted_index.\emph{fine-tuning} | 167 |
| abstract_inverted_index.\emph{pretraining} | 154 |
| abstract_inverted_index.underspecification | 113 |
| abstract_inverted_index.\emph{underspecified}: | 92 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 12 |
| citation_normalized_percentile |