Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning Article Swipe

PDF

Calarina Muslimani , Matthew E. Taylor ·

YOU? · · 2024 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2405.00746

To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop RL methods hold the promise of learning reward functions from human feedback. Despite recent successes, many of the human-in-the-loop RL methods still require numerous human interactions to learn successful reward functions. To improve the feedback efficiency of human-in-the-loop RL methods (i.e., require less human interaction), this paper introduces Sub-optimal Data Pre-training, SDP, an approach that leverages reward-free, sub-optimal data to improve scalar- and preference-based RL algorithms. In SDP, we start by pseudo-labeling all low-quality data with the minimum environment reward. Through this process, we obtain reward labels to pre-train our reward model without requiring human labeling or preferences. This pre-training phase provides the reward model a head start in learning, enabling it to recognize that low-quality transitions should be assigned low rewards. Through extensive experiments with both simulated and human teachers, we find that SDP can at least meet, but often significantly improve, state of the art human-in-the-loop RL performance across a variety of simulated robotic tasks.

Related Topics

Reinforcement Learning

Computer Science

Artificial Intelligence

Mathematics

Social Psychology

Combinatorics

Concepts

Reinforcement learning Loop (graph theory) Human-in-the-loop Computer science Reinforcement Artificial intelligence Psychology Mathematics Social psychology Combinatorics

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2405.00746
PDF: https://arxiv.org/pdf/2405.00746
OA Status: green
Related Works: 10
OpenAlex ID: https://openalex.org/W4396821559

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4396821559

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2405.00746

Digital Object Identifier
Title: Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2024

Year of publication
Publication date: 2024-04-30

Full publication date if available
Authors: Calarina Muslimani, Matthew E. Taylor

List of authors in order
Landing page: https://arxiv.org/abs/2405.00746

Publisher landing page
PDF URL: https://arxiv.org/pdf/2405.00746

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2405.00746

Direct OA link when available
Concepts: Reinforcement learning, Loop (graph theory), Human-in-the-loop, Computer science, Reinforcement, Artificial intelligence, Psychology, Mathematics, Social psychology, Combinatorics

Top concepts (fields/topics) attached by OpenAlex
Cited by: 0

Total citation count in OpenAlex
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4396821559
doi	https://doi.org/10.48550/arxiv.2405.00746
ids.doi	https://doi.org/10.48550/arxiv.2405.00746
ids.openalex	https://openalex.org/W4396821559
fwci
type	preprint
title	Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10524
topics[0].field.id	https://openalex.org/fields/22
topics[0].field.display_name	Engineering
topics[0].score	0.8877999782562256
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/2207
topics[0].subfield.display_name	Control and Systems Engineering
topics[0].display_name	Traffic control and management
topics[1].id	https://openalex.org/T10462
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.879800021648407
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1702
topics[1].subfield.display_name	Artificial Intelligence
topics[1].display_name	Reinforcement Learning in Robotics
topics[2].id	https://openalex.org/T10603
topics[2].field.id	https://openalex.org/fields/22
topics[2].field.display_name	Engineering
topics[2].score	0.878600001335144
topics[2].domain.id	https://openalex.org/domains/3
topics[2].domain.display_name	Physical Sciences
topics[2].subfield.id	https://openalex.org/subfields/2208
topics[2].subfield.display_name	Electrical and Electronic Engineering
topics[2].display_name	Smart Grid Energy Management
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C97541855
concepts[0].level	2
concepts[0].score	0.8277811408042908
concepts[0].wikidata	https://www.wikidata.org/wiki/Q830687
concepts[0].display_name	Reinforcement learning
concepts[1].id	https://openalex.org/C184670325
concepts[1].level	2
concepts[1].score	0.7013992071151733
concepts[1].wikidata	https://www.wikidata.org/wiki/Q512604
concepts[1].display_name	Loop (graph theory)
concepts[2].id	https://openalex.org/C2780626000
concepts[2].level	2
concepts[2].score	0.6859395503997803
concepts[2].wikidata	https://www.wikidata.org/wiki/Q5936775
concepts[2].display_name	Human-in-the-loop
concepts[3].id	https://openalex.org/C41008148
concepts[3].level	0
concepts[3].score	0.5762977004051208
concepts[3].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[3].display_name	Computer science
concepts[4].id	https://openalex.org/C67203356
concepts[4].level	2
concepts[4].score	0.5670349597930908
concepts[4].wikidata	https://www.wikidata.org/wiki/Q1321905
concepts[4].display_name	Reinforcement
concepts[5].id	https://openalex.org/C154945302
concepts[5].level	1
concepts[5].score	0.3880739212036133
concepts[5].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[5].display_name	Artificial intelligence
concepts[6].id	https://openalex.org/C15744967
concepts[6].level	0
concepts[6].score	0.20474427938461304
concepts[6].wikidata	https://www.wikidata.org/wiki/Q9418
concepts[6].display_name	Psychology
concepts[7].id	https://openalex.org/C33923547
concepts[7].level	0
concepts[7].score	0.18085989356040955
concepts[7].wikidata	https://www.wikidata.org/wiki/Q395
concepts[7].display_name	Mathematics
concepts[8].id	https://openalex.org/C77805123
concepts[8].level	1
concepts[8].score	0.11284264922142029
concepts[8].wikidata	https://www.wikidata.org/wiki/Q161272
concepts[8].display_name	Social psychology
concepts[9].id	https://openalex.org/C114614502
concepts[9].level	1
concepts[9].score	0.07737880945205688
concepts[9].wikidata	https://www.wikidata.org/wiki/Q76592
concepts[9].display_name	Combinatorics
keywords[0].id	https://openalex.org/keywords/reinforcement-learning
keywords[0].score	0.8277811408042908
keywords[0].display_name	Reinforcement learning
keywords[1].id	https://openalex.org/keywords/loop
keywords[1].score	0.7013992071151733
keywords[1].display_name	Loop (graph theory)
keywords[2].id	https://openalex.org/keywords/human-in-the-loop
keywords[2].score	0.6859395503997803
keywords[2].display_name	Human-in-the-loop
keywords[3].id	https://openalex.org/keywords/computer-science
keywords[3].score	0.5762977004051208
keywords[3].display_name	Computer science
keywords[4].id	https://openalex.org/keywords/reinforcement
keywords[4].score	0.5670349597930908
keywords[4].display_name	Reinforcement
keywords[5].id	https://openalex.org/keywords/artificial-intelligence
keywords[5].score	0.3880739212036133
keywords[5].display_name	Artificial intelligence
keywords[6].id	https://openalex.org/keywords/psychology
keywords[6].score	0.20474427938461304
keywords[6].display_name	Psychology
keywords[7].id	https://openalex.org/keywords/mathematics
keywords[7].score	0.18085989356040955
keywords[7].display_name	Mathematics
keywords[8].id	https://openalex.org/keywords/social-psychology
keywords[8].score	0.11284264922142029
keywords[8].display_name	Social psychology
keywords[9].id	https://openalex.org/keywords/combinatorics
keywords[9].score	0.07737880945205688
keywords[9].display_name	Combinatorics
language	en
locations[0].id	pmh:oai:arXiv.org:2405.00746
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2405.00746
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2405.00746
locations[1].id	doi:10.48550/arxiv.2405.00746
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license	cc-by
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id	https://openalex.org/licenses/cc-by
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2405.00746
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5010909043
authorships[0].author.orcid	https://orcid.org/0009-0002-4024-4969
authorships[0].author.display_name	Calarina Muslimani
authorships[0].author_position	first
authorships[0].raw_author_name	Muslimani, Calarina
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5070914351
authorships[1].author.orcid	https://orcid.org/0000-0001-8946-0211
authorships[1].author.display_name	Matthew E. Taylor
authorships[1].author_position	last
authorships[1].raw_author_name	Taylor, Matthew E.
authorships[1].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2405.00746
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10524
primary_topic.field.id	https://openalex.org/fields/22
primary_topic.field.display_name	Engineering
primary_topic.score	0.8877999782562256
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/2207
primary_topic.subfield.display_name	Control and Systems Engineering
primary_topic.display_name	Traffic control and management
related_works	https://openalex.org/W2920061524, https://openalex.org/W4310083477, https://openalex.org/W2328553770, https://openalex.org/W1977959518, https://openalex.org/W2038908348, https://openalex.org/W2107890255, https://openalex.org/W4367173559, https://openalex.org/W2902961658, https://openalex.org/W2782058284, https://openalex.org/W3103937890
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2405.00746
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2405.00746
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2405.00746
primary_location.id	pmh:oai:arXiv.org:2405.00746
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2405.00746
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2405.00746
publication_date	2024-04-30
publication_year	2024
referenced_works_count	0
abstract_inverted_index.a	12, 28, 140, 186
abstract_inverted_index.In	101
abstract_inverted_index.RL	35, 54, 73, 99, 183
abstract_inverted_index.To	0, 66
abstract_inverted_index.an	87
abstract_inverted_index.at	171
abstract_inverted_index.be	27, 153
abstract_inverted_index.by	105
abstract_inverted_index.in	143
abstract_inverted_index.is	9
abstract_inverted_index.it	146
abstract_inverted_index.of	20, 40, 51, 71, 179, 188
abstract_inverted_index.or	131
abstract_inverted_index.to	10, 61, 94, 122, 147
abstract_inverted_index.we	103, 118, 166
abstract_inverted_index.SDP	169
abstract_inverted_index.all	107
abstract_inverted_index.and	30, 97, 163
abstract_inverted_index.art	181
abstract_inverted_index.but	174
abstract_inverted_index.can	26, 170
abstract_inverted_index.low	155
abstract_inverted_index.our	124
abstract_inverted_index.the	18, 21, 38, 52, 68, 111, 137, 180
abstract_inverted_index.(RL)	5
abstract_inverted_index.Data	84
abstract_inverted_index.SDP,	86, 102
abstract_inverted_index.This	133
abstract_inverted_index.both	161
abstract_inverted_index.data	93, 109
abstract_inverted_index.find	167
abstract_inverted_index.from	44
abstract_inverted_index.head	141
abstract_inverted_index.hold	37
abstract_inverted_index.less	77
abstract_inverted_index.many	50
abstract_inverted_index.step	7
abstract_inverted_index.that	16, 89, 149, 168
abstract_inverted_index.this	80, 116
abstract_inverted_index.with	110, 160
abstract_inverted_index.zero	8
abstract_inverted_index.human	45, 59, 78, 129, 164
abstract_inverted_index.learn	62
abstract_inverted_index.least	172
abstract_inverted_index.meet,	173
abstract_inverted_index.model	126, 139
abstract_inverted_index.often	175
abstract_inverted_index.paper	81
abstract_inverted_index.phase	135
abstract_inverted_index.start	104, 142
abstract_inverted_index.state	178
abstract_inverted_index.still	56
abstract_inverted_index.task.	22
abstract_inverted_index.(i.e.,	75
abstract_inverted_index.across	185
abstract_inverted_index.create	1
abstract_inverted_index.design	11
abstract_inverted_index.labels	121
abstract_inverted_index.obtain	119
abstract_inverted_index.recent	48
abstract_inverted_index.reward	14, 24, 42, 64, 120, 125, 138
abstract_inverted_index.should	152
abstract_inverted_index.tasks.	191
abstract_inverted_index.useful	2
abstract_inverted_index.Despite	47
abstract_inverted_index.Through	115, 157
abstract_inverted_index.agents,	6
abstract_inverted_index.improve	67, 95
abstract_inverted_index.methods	36, 55, 74
abstract_inverted_index.minimum	112
abstract_inverted_index.nuances	19
abstract_inverted_index.promise	39
abstract_inverted_index.require	57, 76
abstract_inverted_index.reward.	114
abstract_inverted_index.robotic	190
abstract_inverted_index.scalar-	96
abstract_inverted_index.variety	187
abstract_inverted_index.without	127
abstract_inverted_index.However,	23
abstract_inverted_index.Instead,	33
abstract_inverted_index.approach	88
abstract_inverted_index.assigned	154
abstract_inverted_index.captures	17
abstract_inverted_index.enabling	145
abstract_inverted_index.feedback	69
abstract_inverted_index.function	15
abstract_inverted_index.improve,	177
abstract_inverted_index.labeling	130
abstract_inverted_index.learning	4, 41
abstract_inverted_index.numerous	58
abstract_inverted_index.process,	117
abstract_inverted_index.process.	32
abstract_inverted_index.provides	136
abstract_inverted_index.rewards.	156
abstract_inverted_index.suitable	13
abstract_inverted_index.difficult	29
abstract_inverted_index.extensive	158
abstract_inverted_index.feedback.	46
abstract_inverted_index.functions	43
abstract_inverted_index.learning,	144
abstract_inverted_index.leverages	90
abstract_inverted_index.pre-train	123
abstract_inverted_index.recognize	148
abstract_inverted_index.requiring	128
abstract_inverted_index.simulated	162, 189
abstract_inverted_index.teachers,	165
abstract_inverted_index.efficiency	70
abstract_inverted_index.functions.	65
abstract_inverted_index.introduces	82
abstract_inverted_index.successes,	49
abstract_inverted_index.successful	63
abstract_inverted_index.Sub-optimal	83
abstract_inverted_index.algorithms.	100
abstract_inverted_index.engineering	25
abstract_inverted_index.environment	113
abstract_inverted_index.experiments	159
abstract_inverted_index.low-quality	108, 150
abstract_inverted_index.performance	184
abstract_inverted_index.sub-optimal	92
abstract_inverted_index.transitions	151
abstract_inverted_index.interactions	60
abstract_inverted_index.pre-training	134
abstract_inverted_index.preferences.	132
abstract_inverted_index.reward-free,	91
abstract_inverted_index.Pre-training,	85
abstract_inverted_index.interaction),	79
abstract_inverted_index.reinforcement	3
abstract_inverted_index.significantly	176
abstract_inverted_index.time-consuming	31
abstract_inverted_index.pseudo-labeling	106
abstract_inverted_index.preference-based	98
abstract_inverted_index.human-in-the-loop	34, 53, 72, 182
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	2
citation_normalized_percentile