Preference Poisoning Attacks on Reward Model Learning Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2402.01920
Learning reward models from pairwise comparisons is a fundamental component in a number of domains, including autonomous control, conversational agents, and recommendation systems, as part of a broad goal of aligning automated decisions with user preferences. These approaches entail collecting preference information from people, with feedback often provided anonymously. Since preferences are subjective, there is no gold standard to compare against; yet, reliance of high-impact systems on preference learning creates a strong motivation for malicious actors to skew data collected in this fashion to their ends. We investigate the nature and extent of this vulnerability by considering an attacker who can flip a small subset of preference comparisons to either promote or demote a target outcome. We propose two classes of algorithmic approaches for these attacks: a gradient-based framework, and several variants of rank-by-distance methods. Next, we evaluate the efficacy of best attacks in both these classes in successfully achieving malicious goals on datasets from three domains: autonomous control, recommendation system, and textual prompt-response preference learning. We find that the best attacks are often highly successful, achieving in the most extreme case 100\% success rate with only 0.3\% of the data poisoned. However, \emph{which} attack is best can vary significantly across domains. In addition, we observe that the simpler and more scalable rank-by-distance approaches are often competitive with, and on occasion significantly outperform, gradient-based methods. Finally, we show that state-of-the-art defenses against other classes of poisoning attacks exhibit limited efficacy in our setting.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2402.01920
- https://arxiv.org/pdf/2402.01920
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4391620947
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4391620947Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2402.01920Digital Object Identifier
- Title
-
Preference Poisoning Attacks on Reward Model LearningWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-02-02Full publication date if available
- Authors
-
Junlin Wu, Jiongxiao Wang, Chaowei Xiao, Chenguang Wang, Ning Zhang, Yevgeniy VorobeychikList of authors in order
- Landing page
-
https://arxiv.org/abs/2402.01920Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2402.01920Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2402.01920Direct OA link when available
- Concepts
-
Preference, Psychology, Computer science, Economics, MicroeconomicsTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4391620947 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2402.01920 |
| ids.doi | https://doi.org/10.48550/arxiv.2402.01920 |
| ids.openalex | https://openalex.org/W4391620947 |
| fwci | |
| type | preprint |
| title | Preference Poisoning Attacks on Reward Model Learning |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11943 |
| topics[0].field.id | https://openalex.org/fields/30 |
| topics[0].field.display_name | Pharmacology, Toxicology and Pharmaceutics |
| topics[0].score | 0.6338000297546387 |
| topics[0].domain.id | https://openalex.org/domains/1 |
| topics[0].domain.display_name | Life Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/3005 |
| topics[0].subfield.display_name | Toxicology |
| topics[0].display_name | Pharmacovigilance and Adverse Drug Reactions |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C2781249084 |
| concepts[0].level | 2 |
| concepts[0].score | 0.6613297462463379 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q908656 |
| concepts[0].display_name | Preference |
| concepts[1].id | https://openalex.org/C15744967 |
| concepts[1].level | 0 |
| concepts[1].score | 0.43879976868629456 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q9418 |
| concepts[1].display_name | Psychology |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.3334618806838989 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C162324750 |
| concepts[3].level | 0 |
| concepts[3].score | 0.2590886354446411 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q8134 |
| concepts[3].display_name | Economics |
| concepts[4].id | https://openalex.org/C175444787 |
| concepts[4].level | 1 |
| concepts[4].score | 0.12850770354270935 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q39072 |
| concepts[4].display_name | Microeconomics |
| keywords[0].id | https://openalex.org/keywords/preference |
| keywords[0].score | 0.6613297462463379 |
| keywords[0].display_name | Preference |
| keywords[1].id | https://openalex.org/keywords/psychology |
| keywords[1].score | 0.43879976868629456 |
| keywords[1].display_name | Psychology |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.3334618806838989 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/economics |
| keywords[3].score | 0.2590886354446411 |
| keywords[3].display_name | Economics |
| keywords[4].id | https://openalex.org/keywords/microeconomics |
| keywords[4].score | 0.12850770354270935 |
| keywords[4].display_name | Microeconomics |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2402.01920 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2402.01920 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2402.01920 |
| locations[1].id | doi:10.48550/arxiv.2402.01920 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2402.01920 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5056913693 |
| authorships[0].author.orcid | https://orcid.org/0009-0006-1037-1827 |
| authorships[0].author.display_name | Junlin Wu |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Wu, Junlin |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5009277751 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Jiongxiao Wang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Wang, Jiongxiao |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5005843046 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-7043-4926 |
| authorships[2].author.display_name | Chaowei Xiao |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Xiao, Chaowei |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5083438033 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-7082-402X |
| authorships[3].author.display_name | Chenguang Wang |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Wang, Chenguang |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5100404886 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-8781-4925 |
| authorships[4].author.display_name | Ning Zhang |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Zhang, Ning |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5038669899 |
| authorships[5].author.orcid | https://orcid.org/0000-0003-2471-5345 |
| authorships[5].author.display_name | Yevgeniy Vorobeychik |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Vorobeychik, Yevgeniy |
| authorships[5].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2402.01920 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Preference Poisoning Attacks on Reward Model Learning |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11943 |
| primary_topic.field.id | https://openalex.org/fields/30 |
| primary_topic.field.display_name | Pharmacology, Toxicology and Pharmaceutics |
| primary_topic.score | 0.6338000297546387 |
| primary_topic.domain.id | https://openalex.org/domains/1 |
| primary_topic.domain.display_name | Life Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/3005 |
| primary_topic.subfield.display_name | Toxicology |
| primary_topic.display_name | Pharmacovigilance and Adverse Drug Reactions |
| related_works | https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W2358668433, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W2382290278, https://openalex.org/W2478288626, https://openalex.org/W2350741829, https://openalex.org/W2530322880, https://openalex.org/W1596801655 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2402.01920 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2402.01920 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2402.01920 |
| primary_location.id | pmh:oai:arXiv.org:2402.01920 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2402.01920 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2402.01920 |
| publication_date | 2024-02-02 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 7, 11, 26, 70, 102, 113, 126 |
| abstract_inverted_index.In | 202 |
| abstract_inverted_index.We | 86, 116, 166 |
| abstract_inverted_index.an | 97 |
| abstract_inverted_index.as | 23 |
| abstract_inverted_index.by | 95 |
| abstract_inverted_index.in | 10, 80, 143, 147, 177, 240 |
| abstract_inverted_index.is | 6, 54, 195 |
| abstract_inverted_index.no | 55 |
| abstract_inverted_index.of | 13, 25, 29, 63, 92, 105, 120, 132, 140, 188, 234 |
| abstract_inverted_index.on | 66, 152, 219 |
| abstract_inverted_index.or | 111 |
| abstract_inverted_index.to | 58, 76, 83, 108 |
| abstract_inverted_index.we | 136, 204, 226 |
| abstract_inverted_index.and | 20, 90, 129, 161, 209, 218 |
| abstract_inverted_index.are | 51, 172, 214 |
| abstract_inverted_index.can | 100, 197 |
| abstract_inverted_index.for | 73, 123 |
| abstract_inverted_index.our | 241 |
| abstract_inverted_index.the | 88, 138, 169, 178, 189, 207 |
| abstract_inverted_index.two | 118 |
| abstract_inverted_index.who | 99 |
| abstract_inverted_index.best | 141, 170, 196 |
| abstract_inverted_index.both | 144 |
| abstract_inverted_index.case | 181 |
| abstract_inverted_index.data | 78, 190 |
| abstract_inverted_index.find | 167 |
| abstract_inverted_index.flip | 101 |
| abstract_inverted_index.from | 3, 42, 154 |
| abstract_inverted_index.goal | 28 |
| abstract_inverted_index.gold | 56 |
| abstract_inverted_index.more | 210 |
| abstract_inverted_index.most | 179 |
| abstract_inverted_index.only | 186 |
| abstract_inverted_index.part | 24 |
| abstract_inverted_index.rate | 184 |
| abstract_inverted_index.show | 227 |
| abstract_inverted_index.skew | 77 |
| abstract_inverted_index.that | 168, 206, 228 |
| abstract_inverted_index.this | 81, 93 |
| abstract_inverted_index.user | 34 |
| abstract_inverted_index.vary | 198 |
| abstract_inverted_index.with | 33, 44, 185 |
| abstract_inverted_index.yet, | 61 |
| abstract_inverted_index.0.3\% | 187 |
| abstract_inverted_index.100\% | 182 |
| abstract_inverted_index.Next, | 135 |
| abstract_inverted_index.Since | 49 |
| abstract_inverted_index.These | 36 |
| abstract_inverted_index.broad | 27 |
| abstract_inverted_index.ends. | 85 |
| abstract_inverted_index.goals | 151 |
| abstract_inverted_index.often | 46, 173, 215 |
| abstract_inverted_index.other | 232 |
| abstract_inverted_index.small | 103 |
| abstract_inverted_index.their | 84 |
| abstract_inverted_index.there | 53 |
| abstract_inverted_index.these | 124, 145 |
| abstract_inverted_index.three | 155 |
| abstract_inverted_index.with, | 217 |
| abstract_inverted_index.across | 200 |
| abstract_inverted_index.actors | 75 |
| abstract_inverted_index.attack | 194 |
| abstract_inverted_index.demote | 112 |
| abstract_inverted_index.either | 109 |
| abstract_inverted_index.entail | 38 |
| abstract_inverted_index.extent | 91 |
| abstract_inverted_index.highly | 174 |
| abstract_inverted_index.models | 2 |
| abstract_inverted_index.nature | 89 |
| abstract_inverted_index.number | 12 |
| abstract_inverted_index.reward | 1 |
| abstract_inverted_index.strong | 71 |
| abstract_inverted_index.subset | 104 |
| abstract_inverted_index.target | 114 |
| abstract_inverted_index.against | 231 |
| abstract_inverted_index.agents, | 19 |
| abstract_inverted_index.attacks | 142, 171, 236 |
| abstract_inverted_index.classes | 119, 146, 233 |
| abstract_inverted_index.compare | 59 |
| abstract_inverted_index.creates | 69 |
| abstract_inverted_index.exhibit | 237 |
| abstract_inverted_index.extreme | 180 |
| abstract_inverted_index.fashion | 82 |
| abstract_inverted_index.limited | 238 |
| abstract_inverted_index.observe | 205 |
| abstract_inverted_index.people, | 43 |
| abstract_inverted_index.promote | 110 |
| abstract_inverted_index.propose | 117 |
| abstract_inverted_index.several | 130 |
| abstract_inverted_index.simpler | 208 |
| abstract_inverted_index.success | 183 |
| abstract_inverted_index.system, | 160 |
| abstract_inverted_index.systems | 65 |
| abstract_inverted_index.textual | 162 |
| abstract_inverted_index.Finally, | 225 |
| abstract_inverted_index.However, | 192 |
| abstract_inverted_index.Learning | 0 |
| abstract_inverted_index.against; | 60 |
| abstract_inverted_index.aligning | 30 |
| abstract_inverted_index.attacker | 98 |
| abstract_inverted_index.attacks: | 125 |
| abstract_inverted_index.control, | 17, 158 |
| abstract_inverted_index.datasets | 153 |
| abstract_inverted_index.defenses | 230 |
| abstract_inverted_index.domains, | 14 |
| abstract_inverted_index.domains. | 201 |
| abstract_inverted_index.domains: | 156 |
| abstract_inverted_index.efficacy | 139, 239 |
| abstract_inverted_index.evaluate | 137 |
| abstract_inverted_index.feedback | 45 |
| abstract_inverted_index.learning | 68 |
| abstract_inverted_index.methods. | 134, 224 |
| abstract_inverted_index.occasion | 220 |
| abstract_inverted_index.outcome. | 115 |
| abstract_inverted_index.pairwise | 4 |
| abstract_inverted_index.provided | 47 |
| abstract_inverted_index.reliance | 62 |
| abstract_inverted_index.scalable | 211 |
| abstract_inverted_index.setting. | 242 |
| abstract_inverted_index.standard | 57 |
| abstract_inverted_index.systems, | 22 |
| abstract_inverted_index.variants | 131 |
| abstract_inverted_index.achieving | 149, 176 |
| abstract_inverted_index.addition, | 203 |
| abstract_inverted_index.automated | 31 |
| abstract_inverted_index.collected | 79 |
| abstract_inverted_index.component | 9 |
| abstract_inverted_index.decisions | 32 |
| abstract_inverted_index.including | 15 |
| abstract_inverted_index.learning. | 165 |
| abstract_inverted_index.malicious | 74, 150 |
| abstract_inverted_index.poisoned. | 191 |
| abstract_inverted_index.poisoning | 235 |
| abstract_inverted_index.approaches | 37, 122, 213 |
| abstract_inverted_index.autonomous | 16, 157 |
| abstract_inverted_index.collecting | 39 |
| abstract_inverted_index.framework, | 128 |
| abstract_inverted_index.motivation | 72 |
| abstract_inverted_index.preference | 40, 67, 106, 164 |
| abstract_inverted_index.algorithmic | 121 |
| abstract_inverted_index.comparisons | 5, 107 |
| abstract_inverted_index.competitive | 216 |
| abstract_inverted_index.considering | 96 |
| abstract_inverted_index.fundamental | 8 |
| abstract_inverted_index.high-impact | 64 |
| abstract_inverted_index.information | 41 |
| abstract_inverted_index.investigate | 87 |
| abstract_inverted_index.outperform, | 222 |
| abstract_inverted_index.preferences | 50 |
| abstract_inverted_index.subjective, | 52 |
| abstract_inverted_index.successful, | 175 |
| abstract_inverted_index.\emph{which} | 193 |
| abstract_inverted_index.anonymously. | 48 |
| abstract_inverted_index.preferences. | 35 |
| abstract_inverted_index.successfully | 148 |
| abstract_inverted_index.significantly | 199, 221 |
| abstract_inverted_index.vulnerability | 94 |
| abstract_inverted_index.conversational | 18 |
| abstract_inverted_index.gradient-based | 127, 223 |
| abstract_inverted_index.recommendation | 21, 159 |
| abstract_inverted_index.prompt-response | 163 |
| abstract_inverted_index.rank-by-distance | 133, 212 |
| abstract_inverted_index.state-of-the-art | 229 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile |