Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2405.00746
To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop RL methods hold the promise of learning reward functions from human feedback. Despite recent successes, many of the human-in-the-loop RL methods still require numerous human interactions to learn successful reward functions. To improve the feedback efficiency of human-in-the-loop RL methods (i.e., require less human interaction), this paper introduces Sub-optimal Data Pre-training, SDP, an approach that leverages reward-free, sub-optimal data to improve scalar- and preference-based RL algorithms. In SDP, we start by pseudo-labeling all low-quality data with the minimum environment reward. Through this process, we obtain reward labels to pre-train our reward model without requiring human labeling or preferences. This pre-training phase provides the reward model a head start in learning, enabling it to recognize that low-quality transitions should be assigned low rewards. Through extensive experiments with both simulated and human teachers, we find that SDP can at least meet, but often significantly improve, state of the art human-in-the-loop RL performance across a variety of simulated robotic tasks.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2405.00746
- https://arxiv.org/pdf/2405.00746
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4396821559
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4396821559Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2405.00746Digital Object Identifier
- Title
-
Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement LearningWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-04-30Full publication date if available
- Authors
-
Calarina Muslimani, Matthew E. TaylorList of authors in order
- Landing page
-
https://arxiv.org/abs/2405.00746Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2405.00746Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2405.00746Direct OA link when available
- Concepts
-
Reinforcement learning, Loop (graph theory), Human-in-the-loop, Computer science, Reinforcement, Artificial intelligence, Psychology, Mathematics, Social psychology, CombinatoricsTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4396821559 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2405.00746 |
| ids.doi | https://doi.org/10.48550/arxiv.2405.00746 |
| ids.openalex | https://openalex.org/W4396821559 |
| fwci | |
| type | preprint |
| title | Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10524 |
| topics[0].field.id | https://openalex.org/fields/22 |
| topics[0].field.display_name | Engineering |
| topics[0].score | 0.8877999782562256 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/2207 |
| topics[0].subfield.display_name | Control and Systems Engineering |
| topics[0].display_name | Traffic control and management |
| topics[1].id | https://openalex.org/T10462 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.879800021648407 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Reinforcement Learning in Robotics |
| topics[2].id | https://openalex.org/T10603 |
| topics[2].field.id | https://openalex.org/fields/22 |
| topics[2].field.display_name | Engineering |
| topics[2].score | 0.878600001335144 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/2208 |
| topics[2].subfield.display_name | Electrical and Electronic Engineering |
| topics[2].display_name | Smart Grid Energy Management |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C97541855 |
| concepts[0].level | 2 |
| concepts[0].score | 0.8277811408042908 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q830687 |
| concepts[0].display_name | Reinforcement learning |
| concepts[1].id | https://openalex.org/C184670325 |
| concepts[1].level | 2 |
| concepts[1].score | 0.7013992071151733 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q512604 |
| concepts[1].display_name | Loop (graph theory) |
| concepts[2].id | https://openalex.org/C2780626000 |
| concepts[2].level | 2 |
| concepts[2].score | 0.6859395503997803 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q5936775 |
| concepts[2].display_name | Human-in-the-loop |
| concepts[3].id | https://openalex.org/C41008148 |
| concepts[3].level | 0 |
| concepts[3].score | 0.5762977004051208 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[3].display_name | Computer science |
| concepts[4].id | https://openalex.org/C67203356 |
| concepts[4].level | 2 |
| concepts[4].score | 0.5670349597930908 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q1321905 |
| concepts[4].display_name | Reinforcement |
| concepts[5].id | https://openalex.org/C154945302 |
| concepts[5].level | 1 |
| concepts[5].score | 0.3880739212036133 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[5].display_name | Artificial intelligence |
| concepts[6].id | https://openalex.org/C15744967 |
| concepts[6].level | 0 |
| concepts[6].score | 0.20474427938461304 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q9418 |
| concepts[6].display_name | Psychology |
| concepts[7].id | https://openalex.org/C33923547 |
| concepts[7].level | 0 |
| concepts[7].score | 0.18085989356040955 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q395 |
| concepts[7].display_name | Mathematics |
| concepts[8].id | https://openalex.org/C77805123 |
| concepts[8].level | 1 |
| concepts[8].score | 0.11284264922142029 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q161272 |
| concepts[8].display_name | Social psychology |
| concepts[9].id | https://openalex.org/C114614502 |
| concepts[9].level | 1 |
| concepts[9].score | 0.07737880945205688 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q76592 |
| concepts[9].display_name | Combinatorics |
| keywords[0].id | https://openalex.org/keywords/reinforcement-learning |
| keywords[0].score | 0.8277811408042908 |
| keywords[0].display_name | Reinforcement learning |
| keywords[1].id | https://openalex.org/keywords/loop |
| keywords[1].score | 0.7013992071151733 |
| keywords[1].display_name | Loop (graph theory) |
| keywords[2].id | https://openalex.org/keywords/human-in-the-loop |
| keywords[2].score | 0.6859395503997803 |
| keywords[2].display_name | Human-in-the-loop |
| keywords[3].id | https://openalex.org/keywords/computer-science |
| keywords[3].score | 0.5762977004051208 |
| keywords[3].display_name | Computer science |
| keywords[4].id | https://openalex.org/keywords/reinforcement |
| keywords[4].score | 0.5670349597930908 |
| keywords[4].display_name | Reinforcement |
| keywords[5].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[5].score | 0.3880739212036133 |
| keywords[5].display_name | Artificial intelligence |
| keywords[6].id | https://openalex.org/keywords/psychology |
| keywords[6].score | 0.20474427938461304 |
| keywords[6].display_name | Psychology |
| keywords[7].id | https://openalex.org/keywords/mathematics |
| keywords[7].score | 0.18085989356040955 |
| keywords[7].display_name | Mathematics |
| keywords[8].id | https://openalex.org/keywords/social-psychology |
| keywords[8].score | 0.11284264922142029 |
| keywords[8].display_name | Social psychology |
| keywords[9].id | https://openalex.org/keywords/combinatorics |
| keywords[9].score | 0.07737880945205688 |
| keywords[9].display_name | Combinatorics |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2405.00746 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2405.00746 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2405.00746 |
| locations[1].id | doi:10.48550/arxiv.2405.00746 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2405.00746 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5010909043 |
| authorships[0].author.orcid | https://orcid.org/0009-0002-4024-4969 |
| authorships[0].author.display_name | Calarina Muslimani |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Muslimani, Calarina |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5070914351 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-8946-0211 |
| authorships[1].author.display_name | Matthew E. Taylor |
| authorships[1].author_position | last |
| authorships[1].raw_author_name | Taylor, Matthew E. |
| authorships[1].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2405.00746 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10524 |
| primary_topic.field.id | https://openalex.org/fields/22 |
| primary_topic.field.display_name | Engineering |
| primary_topic.score | 0.8877999782562256 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/2207 |
| primary_topic.subfield.display_name | Control and Systems Engineering |
| primary_topic.display_name | Traffic control and management |
| related_works | https://openalex.org/W2920061524, https://openalex.org/W4310083477, https://openalex.org/W2328553770, https://openalex.org/W1977959518, https://openalex.org/W2038908348, https://openalex.org/W2107890255, https://openalex.org/W4367173559, https://openalex.org/W2902961658, https://openalex.org/W2782058284, https://openalex.org/W3103937890 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2405.00746 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2405.00746 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2405.00746 |
| primary_location.id | pmh:oai:arXiv.org:2405.00746 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2405.00746 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2405.00746 |
| publication_date | 2024-04-30 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 12, 28, 140, 186 |
| abstract_inverted_index.In | 101 |
| abstract_inverted_index.RL | 35, 54, 73, 99, 183 |
| abstract_inverted_index.To | 0, 66 |
| abstract_inverted_index.an | 87 |
| abstract_inverted_index.at | 171 |
| abstract_inverted_index.be | 27, 153 |
| abstract_inverted_index.by | 105 |
| abstract_inverted_index.in | 143 |
| abstract_inverted_index.is | 9 |
| abstract_inverted_index.it | 146 |
| abstract_inverted_index.of | 20, 40, 51, 71, 179, 188 |
| abstract_inverted_index.or | 131 |
| abstract_inverted_index.to | 10, 61, 94, 122, 147 |
| abstract_inverted_index.we | 103, 118, 166 |
| abstract_inverted_index.SDP | 169 |
| abstract_inverted_index.all | 107 |
| abstract_inverted_index.and | 30, 97, 163 |
| abstract_inverted_index.art | 181 |
| abstract_inverted_index.but | 174 |
| abstract_inverted_index.can | 26, 170 |
| abstract_inverted_index.low | 155 |
| abstract_inverted_index.our | 124 |
| abstract_inverted_index.the | 18, 21, 38, 52, 68, 111, 137, 180 |
| abstract_inverted_index.(RL) | 5 |
| abstract_inverted_index.Data | 84 |
| abstract_inverted_index.SDP, | 86, 102 |
| abstract_inverted_index.This | 133 |
| abstract_inverted_index.both | 161 |
| abstract_inverted_index.data | 93, 109 |
| abstract_inverted_index.find | 167 |
| abstract_inverted_index.from | 44 |
| abstract_inverted_index.head | 141 |
| abstract_inverted_index.hold | 37 |
| abstract_inverted_index.less | 77 |
| abstract_inverted_index.many | 50 |
| abstract_inverted_index.step | 7 |
| abstract_inverted_index.that | 16, 89, 149, 168 |
| abstract_inverted_index.this | 80, 116 |
| abstract_inverted_index.with | 110, 160 |
| abstract_inverted_index.zero | 8 |
| abstract_inverted_index.human | 45, 59, 78, 129, 164 |
| abstract_inverted_index.learn | 62 |
| abstract_inverted_index.least | 172 |
| abstract_inverted_index.meet, | 173 |
| abstract_inverted_index.model | 126, 139 |
| abstract_inverted_index.often | 175 |
| abstract_inverted_index.paper | 81 |
| abstract_inverted_index.phase | 135 |
| abstract_inverted_index.start | 104, 142 |
| abstract_inverted_index.state | 178 |
| abstract_inverted_index.still | 56 |
| abstract_inverted_index.task. | 22 |
| abstract_inverted_index.(i.e., | 75 |
| abstract_inverted_index.across | 185 |
| abstract_inverted_index.create | 1 |
| abstract_inverted_index.design | 11 |
| abstract_inverted_index.labels | 121 |
| abstract_inverted_index.obtain | 119 |
| abstract_inverted_index.recent | 48 |
| abstract_inverted_index.reward | 14, 24, 42, 64, 120, 125, 138 |
| abstract_inverted_index.should | 152 |
| abstract_inverted_index.tasks. | 191 |
| abstract_inverted_index.useful | 2 |
| abstract_inverted_index.Despite | 47 |
| abstract_inverted_index.Through | 115, 157 |
| abstract_inverted_index.agents, | 6 |
| abstract_inverted_index.improve | 67, 95 |
| abstract_inverted_index.methods | 36, 55, 74 |
| abstract_inverted_index.minimum | 112 |
| abstract_inverted_index.nuances | 19 |
| abstract_inverted_index.promise | 39 |
| abstract_inverted_index.require | 57, 76 |
| abstract_inverted_index.reward. | 114 |
| abstract_inverted_index.robotic | 190 |
| abstract_inverted_index.scalar- | 96 |
| abstract_inverted_index.variety | 187 |
| abstract_inverted_index.without | 127 |
| abstract_inverted_index.However, | 23 |
| abstract_inverted_index.Instead, | 33 |
| abstract_inverted_index.approach | 88 |
| abstract_inverted_index.assigned | 154 |
| abstract_inverted_index.captures | 17 |
| abstract_inverted_index.enabling | 145 |
| abstract_inverted_index.feedback | 69 |
| abstract_inverted_index.function | 15 |
| abstract_inverted_index.improve, | 177 |
| abstract_inverted_index.labeling | 130 |
| abstract_inverted_index.learning | 4, 41 |
| abstract_inverted_index.numerous | 58 |
| abstract_inverted_index.process, | 117 |
| abstract_inverted_index.process. | 32 |
| abstract_inverted_index.provides | 136 |
| abstract_inverted_index.rewards. | 156 |
| abstract_inverted_index.suitable | 13 |
| abstract_inverted_index.difficult | 29 |
| abstract_inverted_index.extensive | 158 |
| abstract_inverted_index.feedback. | 46 |
| abstract_inverted_index.functions | 43 |
| abstract_inverted_index.learning, | 144 |
| abstract_inverted_index.leverages | 90 |
| abstract_inverted_index.pre-train | 123 |
| abstract_inverted_index.recognize | 148 |
| abstract_inverted_index.requiring | 128 |
| abstract_inverted_index.simulated | 162, 189 |
| abstract_inverted_index.teachers, | 165 |
| abstract_inverted_index.efficiency | 70 |
| abstract_inverted_index.functions. | 65 |
| abstract_inverted_index.introduces | 82 |
| abstract_inverted_index.successes, | 49 |
| abstract_inverted_index.successful | 63 |
| abstract_inverted_index.Sub-optimal | 83 |
| abstract_inverted_index.algorithms. | 100 |
| abstract_inverted_index.engineering | 25 |
| abstract_inverted_index.environment | 113 |
| abstract_inverted_index.experiments | 159 |
| abstract_inverted_index.low-quality | 108, 150 |
| abstract_inverted_index.performance | 184 |
| abstract_inverted_index.sub-optimal | 92 |
| abstract_inverted_index.transitions | 151 |
| abstract_inverted_index.interactions | 60 |
| abstract_inverted_index.pre-training | 134 |
| abstract_inverted_index.preferences. | 132 |
| abstract_inverted_index.reward-free, | 91 |
| abstract_inverted_index.Pre-training, | 85 |
| abstract_inverted_index.interaction), | 79 |
| abstract_inverted_index.reinforcement | 3 |
| abstract_inverted_index.significantly | 176 |
| abstract_inverted_index.time-consuming | 31 |
| abstract_inverted_index.pseudo-labeling | 106 |
| abstract_inverted_index.preference-based | 98 |
| abstract_inverted_index.human-in-the-loop | 34, 53, 72, 182 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 2 |
| citation_normalized_percentile |