RLTHF: Targeted Human Feedback for LLM Alignment Article Swipe

PDF

Yifei Xu , Tusher Chakraborty , Emre Kıcıman , Bibek Aryal , E. Rodrigues , Srinagesh Sharma , Roberto Estêvão , María Angels de Luis Balaguer , Joel L. Wolk , Rafael Padilha , Leonardo Silva Nunes , Shobana Balakrishnan , Songwu Lu , Ranveer Chandra ·

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2502.13417

Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model's reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM's correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF's curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF.

Related Topics

Computer Science

Concepts

Computer science

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2502.13417
PDF: https://arxiv.org/pdf/2502.13417
OA Status: green
Related Works: 10
OpenAlex ID: https://openalex.org/W4407764293

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4407764293

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2502.13417

Digital Object Identifier
Title: RLTHF: Targeted Human Feedback for LLM Alignment

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2025

Year of publication
Publication date: 2025-02-19

Full publication date if available
Authors: Yifei Xu, Tusher Chakraborty, Emre Kıcıman, Bibek Aryal, E. Rodrigues, Srinagesh Sharma, Roberto Estêvão, María Angels de Luis Balaguer, Joel L. Wolk, Rafael Padilha, Leonardo Silva Nunes, Shobana Balakrishnan, Songwu Lu, Ranveer Chandra

List of authors in order
Landing page: https://arxiv.org/abs/2502.13417

Publisher landing page
PDF URL: https://arxiv.org/pdf/2502.13417

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2502.13417

Direct OA link when available
Concepts: Computer science

Top concepts (fields/topics) attached by OpenAlex
Cited by: 0

Total citation count in OpenAlex
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4407764293
doi	https://doi.org/10.48550/arxiv.2502.13417
ids.doi	https://doi.org/10.48550/arxiv.2502.13417
ids.openalex	https://openalex.org/W4407764293
fwci
type	preprint
title	RLTHF: Targeted Human Feedback for LLM Alignment
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10215
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.7336999773979187
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1702
topics[0].subfield.display_name	Artificial Intelligence
topics[0].display_name	Semantic Web and Ontologies
topics[1].id	https://openalex.org/T10181
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.6261000037193298
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1702
topics[1].subfield.display_name	Artificial Intelligence
topics[1].display_name	Natural Language Processing Techniques
topics[2].id	https://openalex.org/T14351
topics[2].field.id	https://openalex.org/fields/17
topics[2].field.display_name	Computer Science
topics[2].score	0.6137999892234802
topics[2].domain.id	https://openalex.org/domains/3
topics[2].domain.display_name	Physical Sciences
topics[2].subfield.id	https://openalex.org/subfields/1702
topics[2].subfield.display_name	Artificial Intelligence
topics[2].display_name	Statistical and Computational Modeling
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C41008148
concepts[0].level	0
concepts[0].score	0.4509413540363312
concepts[0].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[0].display_name	Computer science
keywords[0].id	https://openalex.org/keywords/computer-science
keywords[0].score	0.4509413540363312
keywords[0].display_name	Computer science
language	en
locations[0].id	pmh:oai:arXiv.org:2502.13417
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2502.13417
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2502.13417
locations[1].id	doi:10.48550/arxiv.2502.13417
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2502.13417
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5069256680
authorships[0].author.orcid	https://orcid.org/0000-0003-1329-3124
authorships[0].author.display_name	Yifei Xu
authorships[0].author_position	first
authorships[0].raw_author_name	Xu, Yifei
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5043473455
authorships[1].author.orcid	https://orcid.org/0000-0003-1656-5471
authorships[1].author.display_name	Tusher Chakraborty
authorships[1].author_position	middle
authorships[1].raw_author_name	Chakraborty, Tusher
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5112594439
authorships[2].author.orcid
authorships[2].author.display_name	Emre Kıcıman
authorships[2].author_position	middle
authorships[2].raw_author_name	Kıcıman, Emre
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5020154361
authorships[3].author.orcid	https://orcid.org/0000-0003-0257-7439
authorships[3].author.display_name	Bibek Aryal
authorships[3].author_position	middle
authorships[3].raw_author_name	Aryal, Bibek
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5023708186
authorships[4].author.orcid	https://orcid.org/0000-0003-2846-7625
authorships[4].author.display_name	E. Rodrigues
authorships[4].author_position	middle
authorships[4].raw_author_name	Rodrigues, Eduardo
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5103899562
authorships[5].author.orcid
authorships[5].author.display_name	Srinagesh Sharma
authorships[5].author_position	middle
authorships[5].raw_author_name	Sharma, Srinagesh
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A5073495435
authorships[6].author.orcid
authorships[6].author.display_name	Roberto Estêvão
authorships[6].author_position	middle
authorships[6].raw_author_name	Estevao, Roberto
authorships[6].is_corresponding	False
authorships[7].author.id	https://openalex.org/A5012552344
authorships[7].author.orcid	https://orcid.org/0000-0002-6272-7841
authorships[7].author.display_name	María Angels de Luis Balaguer
authorships[7].author_position	middle
authorships[7].raw_author_name	Balaguer, Maria Angels de Luis
authorships[7].is_corresponding	False
authorships[8].author.id	https://openalex.org/A5029069424
authorships[8].author.orcid
authorships[8].author.display_name	Joel L. Wolk
authorships[8].author_position	middle
authorships[8].raw_author_name	Wolk, Jessica
authorships[8].is_corresponding	False
authorships[9].author.id	https://openalex.org/A5027978607
authorships[9].author.orcid	https://orcid.org/0000-0003-1944-5475
authorships[9].author.display_name	Rafael Padilha
authorships[9].author_position	middle
authorships[9].raw_author_name	Padilha, Rafael
authorships[9].is_corresponding	False
authorships[10].author.id	https://openalex.org/A5033463791
authorships[10].author.orcid	https://orcid.org/0009-0009-1296-1013
authorships[10].author.display_name	Leonardo Silva Nunes
authorships[10].author_position	middle
authorships[10].raw_author_name	Nunes, Leonardo
authorships[10].is_corresponding	False
authorships[11].author.id	https://openalex.org/A5110322584
authorships[11].author.orcid
authorships[11].author.display_name	Shobana Balakrishnan
authorships[11].author_position	middle
authorships[11].raw_author_name	Balakrishnan, Shobana
authorships[11].is_corresponding	False
authorships[12].author.id	https://openalex.org/A5020188879
authorships[12].author.orcid	https://orcid.org/0000-0003-3779-0918
authorships[12].author.display_name	Songwu Lu
authorships[12].author_position	middle
authorships[12].raw_author_name	Lu, Songwu
authorships[12].is_corresponding	False
authorships[13].author.id	https://openalex.org/A5112443217
authorships[13].author.orcid
authorships[13].author.display_name	Ranveer Chandra
authorships[13].author_position	last
authorships[13].raw_author_name	Chandra, Ranveer
authorships[13].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2502.13417
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	RLTHF: Targeted Human Feedback for LLM Alignment
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10215
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.7336999773979187
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1702
primary_topic.subfield.display_name	Artificial Intelligence
primary_topic.display_name	Semantic Web and Ontologies
related_works	https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W4391913857, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W4396696052
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2502.13417
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2502.13417
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2502.13417
primary_location.id	pmh:oai:arXiv.org:2502.13417
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2502.13417
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2502.13417
publication_date	2025-02-19
publication_year	2025
referenced_works_count	0
abstract_inverted_index.a	42, 71
abstract_inverted_index.AI	33
abstract_inverted_index.To	35
abstract_inverted_index.by	68, 80
abstract_inverted_index.in	21
abstract_inverted_index.is	10
abstract_inverted_index.of	17, 32, 107, 132
abstract_inverted_index.on	92, 115, 125
abstract_inverted_index.to	5, 13, 55
abstract_inverted_index.we	39
abstract_inverted_index.and	28, 76, 94
abstract_inverted_index.due	12
abstract_inverted_index.for	119
abstract_inverted_index.the	14, 29, 108, 130
abstract_inverted_index.6-7%	106
abstract_inverted_index.LLMs	69
abstract_inverted_index.cost	16
abstract_inverted_index.from	24
abstract_inverted_index.high	15
abstract_inverted_index.only	105
abstract_inverted_index.show	97
abstract_inverted_index.that	46, 98
abstract_inverted_index.user	8
abstract_inverted_index.with	7, 51, 60, 104
abstract_inverted_index.Human	25
abstract_inverted_index.LLM's	87
abstract_inverted_index.RLTHF	63, 99
abstract_inverted_index.TL;DR	95
abstract_inverted_index.align	6
abstract_inverted_index.fully	126
abstract_inverted_index.human	19, 53, 83, 109
abstract_inverted_index.large	1
abstract_inverted_index.tasks	121
abstract_inverted_index.these	37
abstract_inverted_index.those	123
abstract_inverted_index.using	70
abstract_inverted_index.while	85
abstract_inverted_index.(LLMs)	4
abstract_inverted_index.(RLHF)	27
abstract_inverted_index.RLTHF,	41
abstract_inverted_index.RLTHF.	133
abstract_inverted_index.hybrid	44
abstract_inverted_index.models	3, 113
abstract_inverted_index.reward	72, 74
abstract_inverted_index.HH-RLHF	93
abstract_inverted_index.RLTHF's	116
abstract_inverted_index.achieve	56
abstract_inverted_index.address	36
abstract_inverted_index.curated	117
abstract_inverted_index.effort.	62, 111
abstract_inverted_index.initial	49
abstract_inverted_index.labeled	89
abstract_inverted_index.minimal	61
abstract_inverted_index.model's	73
abstract_inverted_index.propose	40
abstract_inverted_index.quality	18
abstract_inverted_index.reaches	100
abstract_inverted_index.samples	66
abstract_inverted_index.trained	114, 124
abstract_inverted_index.Feedback	26
abstract_inverted_index.Learning	23
abstract_inverted_index.combines	47
abstract_inverted_index.datasets	96, 118
abstract_inverted_index.enhances	78
abstract_inverted_index.human-AI	43
abstract_inverted_index.language	2
abstract_inverted_index.samples.	90
abstract_inverted_index.Feedback.	34
abstract_inverted_index.LLM-based	48
abstract_inverted_index.alignment	50, 59, 79, 103
abstract_inverted_index.correctly	88
abstract_inverted_index.datasets,	128
abstract_inverted_index.framework	45
abstract_inverted_index.selective	52
abstract_inverted_index.strategic	82
abstract_inverted_index.annotation	58, 110
abstract_inverted_index.downstream	120
abstract_inverted_index.full-human	57, 101
abstract_inverted_index.identifies	64
abstract_inverted_index.leveraging	86
abstract_inverted_index.mislabeled	67
abstract_inverted_index.outperform	122
abstract_inverted_index.Evaluations	91
abstract_inverted_index.Fine-tuning	0
abstract_inverted_index.annotations	20, 54
abstract_inverted_index.challenges,	38
abstract_inverted_index.challenging	11
abstract_inverted_index.corrections	84
abstract_inverted_index.integrating	81
abstract_inverted_index.iteratively	77
abstract_inverted_index.limitations	31
abstract_inverted_index.preferences	9
abstract_inverted_index.Furthermore,	112
abstract_inverted_index.distribution	75
abstract_inverted_index.underscoring	129
abstract_inverted_index.Reinforcement	22
abstract_inverted_index.effectiveness	131
abstract_inverted_index.human-annotated	127
abstract_inverted_index.annotation-level	102
abstract_inverted_index.generalizability	30
abstract_inverted_index.hard-to-annotate	65
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	14
citation_normalized_percentile