Highway Reinforcement Learning Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2405.18289
Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL). Approaches based on importance sampling (IS) often suffer from large variances due to products of IS ratios. Typical IS-free methods, such as $n$-step Q-learning, look ahead for $n$ time steps along the trajectory of actions (where $n$ is called the lookahead depth) and utilize off-policy data directly without any additional adjustment. They work well for proper choices of $n$. We show, however, that such IS-free methods underestimate the optimal value function (VF), especially for large $n$, restricting their capacity to efficiently utilize information from distant future time steps. To overcome this problem, we introduce a novel, IS-free, multi-step off-policy method that avoids the underestimation issue and converges to the optimal VF. At its core lies a simple but non-trivial \emph{highway gate}, which controls the information flow from the distant future by comparing it to a threshold. The highway gate guarantees convergence to the optimal VF for arbitrary $n$ and arbitrary behavioral policies. It gives rise to a novel family of off-policy RL algorithms that safely learn even when $n$ is very large, facilitating rapid credit assignment from the far future to the past. On tasks with greatly delayed rewards, including video games where the reward is given only at the end of the game, our new methods outperform many existing multi-step off-policy algorithms.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2405.18289
- https://arxiv.org/pdf/2405.18289
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4399152062
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4399152062Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2405.18289Digital Object Identifier
- Title
-
Highway Reinforcement LearningWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-05-28Full publication date if available
- Authors
-
Yuhui Wang, Miroslav Štrupl, Francesco Faccio, Qingyuan Wu, Haozhe Liu, Michał Grudzień, Xiaoyang Tan, Jürgen SchmidhuberList of authors in order
- Landing page
-
https://arxiv.org/abs/2405.18289Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2405.18289Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2405.18289Direct OA link when available
- Concepts
-
Reinforcement, Reinforcement learning, Psychology, Computer science, Artificial intelligence, Social psychologyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4399152062 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2405.18289 |
| ids.doi | https://doi.org/10.48550/arxiv.2405.18289 |
| ids.openalex | https://openalex.org/W4399152062 |
| fwci | |
| type | preprint |
| title | Highway Reinforcement Learning |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10524 |
| topics[0].field.id | https://openalex.org/fields/22 |
| topics[0].field.display_name | Engineering |
| topics[0].score | 0.9165999889373779 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/2207 |
| topics[0].subfield.display_name | Control and Systems Engineering |
| topics[0].display_name | Traffic control and management |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C67203356 |
| concepts[0].level | 2 |
| concepts[0].score | 0.7710175514221191 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q1321905 |
| concepts[0].display_name | Reinforcement |
| concepts[1].id | https://openalex.org/C97541855 |
| concepts[1].level | 2 |
| concepts[1].score | 0.6022381782531738 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q830687 |
| concepts[1].display_name | Reinforcement learning |
| concepts[2].id | https://openalex.org/C15744967 |
| concepts[2].level | 0 |
| concepts[2].score | 0.32271337509155273 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q9418 |
| concepts[2].display_name | Psychology |
| concepts[3].id | https://openalex.org/C41008148 |
| concepts[3].level | 0 |
| concepts[3].score | 0.3090507984161377 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[3].display_name | Computer science |
| concepts[4].id | https://openalex.org/C154945302 |
| concepts[4].level | 1 |
| concepts[4].score | 0.19933292269706726 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[4].display_name | Artificial intelligence |
| concepts[5].id | https://openalex.org/C77805123 |
| concepts[5].level | 1 |
| concepts[5].score | 0.15016192197799683 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q161272 |
| concepts[5].display_name | Social psychology |
| keywords[0].id | https://openalex.org/keywords/reinforcement |
| keywords[0].score | 0.7710175514221191 |
| keywords[0].display_name | Reinforcement |
| keywords[1].id | https://openalex.org/keywords/reinforcement-learning |
| keywords[1].score | 0.6022381782531738 |
| keywords[1].display_name | Reinforcement learning |
| keywords[2].id | https://openalex.org/keywords/psychology |
| keywords[2].score | 0.32271337509155273 |
| keywords[2].display_name | Psychology |
| keywords[3].id | https://openalex.org/keywords/computer-science |
| keywords[3].score | 0.3090507984161377 |
| keywords[3].display_name | Computer science |
| keywords[4].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[4].score | 0.19933292269706726 |
| keywords[4].display_name | Artificial intelligence |
| keywords[5].id | https://openalex.org/keywords/social-psychology |
| keywords[5].score | 0.15016192197799683 |
| keywords[5].display_name | Social psychology |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2405.18289 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2405.18289 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2405.18289 |
| locations[1].id | doi:10.48550/arxiv.2405.18289 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2405.18289 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5100330368 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-4745-8425 |
| authorships[0].author.display_name | Yuhui Wang |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Wang, Yuhui |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5086729694 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Miroslav Štrupl |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Strupl, Miroslav |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5013289052 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Francesco Faccio |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Faccio, Francesco |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5103176170 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-5746-148X |
| authorships[3].author.display_name | Qingyuan Wu |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Wu, Qingyuan |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5006064743 |
| authorships[4].author.orcid | https://orcid.org/0000-0001-5720-2031 |
| authorships[4].author.display_name | Haozhe Liu |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Liu, Haozhe |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5098937265 |
| authorships[5].author.orcid | |
| authorships[5].author.display_name | Michał Grudzień |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Grudzień, Michał |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5004478562 |
| authorships[6].author.orcid | https://orcid.org/0000-0002-2683-8667 |
| authorships[6].author.display_name | Xiaoyang Tan |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Tan, Xiaoyang |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5071172037 |
| authorships[7].author.orcid | |
| authorships[7].author.display_name | Jürgen Schmidhuber |
| authorships[7].author_position | last |
| authorships[7].raw_author_name | Schmidhuber, Jürgen |
| authorships[7].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2405.18289 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Highway Reinforcement Learning |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10524 |
| primary_topic.field.id | https://openalex.org/fields/22 |
| primary_topic.field.display_name | Engineering |
| primary_topic.score | 0.9165999889373779 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/2207 |
| primary_topic.subfield.display_name | Control and Systems Engineering |
| primary_topic.display_name | Traffic control and management |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W2920061524, https://openalex.org/W4310083477, https://openalex.org/W2328553770, https://openalex.org/W1977959518, https://openalex.org/W2038908348, https://openalex.org/W2107890255, https://openalex.org/W2106552856, https://openalex.org/W2145821588 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2405.18289 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2405.18289 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2405.18289 |
| primary_location.id | pmh:oai:arXiv.org:2405.18289 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2405.18289 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2405.18289 |
| publication_date | 2024-05-28 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 7, 12, 113, 134, 153, 175 |
| abstract_inverted_index.At | 130 |
| abstract_inverted_index.IS | 34 |
| abstract_inverted_index.It | 171 |
| abstract_inverted_index.On | 202 |
| abstract_inverted_index.RL | 180 |
| abstract_inverted_index.To | 107 |
| abstract_inverted_index.VF | 163 |
| abstract_inverted_index.We | 78 |
| abstract_inverted_index.as | 40 |
| abstract_inverted_index.at | 217 |
| abstract_inverted_index.by | 6, 149 |
| abstract_inverted_index.is | 11, 56, 188, 214 |
| abstract_inverted_index.it | 151 |
| abstract_inverted_index.of | 9, 15, 33, 52, 76, 178, 220 |
| abstract_inverted_index.on | 21 |
| abstract_inverted_index.to | 31, 98, 126, 152, 160, 174, 199 |
| abstract_inverted_index.we | 111 |
| abstract_inverted_index.$n$ | 46, 55, 166, 187 |
| abstract_inverted_index.The | 155 |
| abstract_inverted_index.VF. | 129 |
| abstract_inverted_index.and | 61, 124, 167 |
| abstract_inverted_index.any | 67 |
| abstract_inverted_index.but | 136 |
| abstract_inverted_index.due | 30 |
| abstract_inverted_index.end | 219 |
| abstract_inverted_index.far | 197 |
| abstract_inverted_index.for | 45, 73, 92, 164 |
| abstract_inverted_index.its | 131 |
| abstract_inverted_index.new | 224 |
| abstract_inverted_index.our | 223 |
| abstract_inverted_index.set | 8 |
| abstract_inverted_index.the | 50, 58, 86, 121, 127, 142, 146, 161, 196, 200, 212, 218, 221 |
| abstract_inverted_index.$n$, | 94 |
| abstract_inverted_index.$n$. | 77 |
| abstract_inverted_index.(IS) | 24 |
| abstract_inverted_index.They | 70 |
| abstract_inverted_index.core | 13, 132 |
| abstract_inverted_index.data | 4, 64 |
| abstract_inverted_index.even | 185 |
| abstract_inverted_index.flow | 144 |
| abstract_inverted_index.from | 1, 27, 102, 145, 195 |
| abstract_inverted_index.gate | 157 |
| abstract_inverted_index.lies | 133 |
| abstract_inverted_index.look | 43 |
| abstract_inverted_index.many | 227 |
| abstract_inverted_index.only | 216 |
| abstract_inverted_index.rise | 173 |
| abstract_inverted_index.such | 39, 82 |
| abstract_inverted_index.that | 81, 119, 182 |
| abstract_inverted_index.this | 109 |
| abstract_inverted_index.time | 47, 105 |
| abstract_inverted_index.very | 189 |
| abstract_inverted_index.well | 72 |
| abstract_inverted_index.when | 186 |
| abstract_inverted_index.with | 204 |
| abstract_inverted_index.work | 71 |
| abstract_inverted_index.(RL). | 18 |
| abstract_inverted_index.(VF), | 90 |
| abstract_inverted_index.ahead | 44 |
| abstract_inverted_index.along | 49 |
| abstract_inverted_index.based | 20 |
| abstract_inverted_index.game, | 222 |
| abstract_inverted_index.games | 210 |
| abstract_inverted_index.given | 215 |
| abstract_inverted_index.gives | 172 |
| abstract_inverted_index.issue | 123 |
| abstract_inverted_index.large | 28, 93 |
| abstract_inverted_index.learn | 184 |
| abstract_inverted_index.novel | 176 |
| abstract_inverted_index.often | 25 |
| abstract_inverted_index.past. | 201 |
| abstract_inverted_index.rapid | 192 |
| abstract_inverted_index.show, | 79 |
| abstract_inverted_index.steps | 48 |
| abstract_inverted_index.tasks | 203 |
| abstract_inverted_index.their | 96 |
| abstract_inverted_index.value | 88 |
| abstract_inverted_index.video | 209 |
| abstract_inverted_index.where | 211 |
| abstract_inverted_index.which | 140 |
| abstract_inverted_index.(where | 54 |
| abstract_inverted_index.avoids | 120 |
| abstract_inverted_index.called | 57 |
| abstract_inverted_index.credit | 193 |
| abstract_inverted_index.depth) | 60 |
| abstract_inverted_index.family | 177 |
| abstract_inverted_index.future | 104, 148, 198 |
| abstract_inverted_index.gate}, | 139 |
| abstract_inverted_index.large, | 190 |
| abstract_inverted_index.method | 118 |
| abstract_inverted_index.novel, | 114 |
| abstract_inverted_index.proper | 74 |
| abstract_inverted_index.reward | 213 |
| abstract_inverted_index.safely | 183 |
| abstract_inverted_index.simple | 135 |
| abstract_inverted_index.steps. | 106 |
| abstract_inverted_index.suffer | 26 |
| abstract_inverted_index.IS-free | 37, 83 |
| abstract_inverted_index.Typical | 36 |
| abstract_inverted_index.actions | 53 |
| abstract_inverted_index.choices | 75 |
| abstract_inverted_index.delayed | 206 |
| abstract_inverted_index.distant | 103, 147 |
| abstract_inverted_index.greatly | 205 |
| abstract_inverted_index.highway | 156 |
| abstract_inverted_index.methods | 84, 225 |
| abstract_inverted_index.optimal | 87, 128, 162 |
| abstract_inverted_index.problem | 14 |
| abstract_inverted_index.ratios. | 35 |
| abstract_inverted_index.utilize | 62, 100 |
| abstract_inverted_index.without | 66 |
| abstract_inverted_index.$n$-step | 41 |
| abstract_inverted_index.IS-free, | 115 |
| abstract_inverted_index.Learning | 0 |
| abstract_inverted_index.capacity | 97 |
| abstract_inverted_index.controls | 141 |
| abstract_inverted_index.directly | 65 |
| abstract_inverted_index.existing | 228 |
| abstract_inverted_index.function | 89 |
| abstract_inverted_index.however, | 80 |
| abstract_inverted_index.learning | 17 |
| abstract_inverted_index.methods, | 38 |
| abstract_inverted_index.overcome | 108 |
| abstract_inverted_index.policies | 10 |
| abstract_inverted_index.problem, | 110 |
| abstract_inverted_index.products | 32 |
| abstract_inverted_index.rewards, | 207 |
| abstract_inverted_index.sampling | 23 |
| abstract_inverted_index.arbitrary | 165, 168 |
| abstract_inverted_index.collected | 5 |
| abstract_inverted_index.comparing | 150 |
| abstract_inverted_index.converges | 125 |
| abstract_inverted_index.including | 208 |
| abstract_inverted_index.introduce | 112 |
| abstract_inverted_index.lookahead | 59 |
| abstract_inverted_index.policies. | 170 |
| abstract_inverted_index.variances | 29 |
| abstract_inverted_index.Approaches | 19 |
| abstract_inverted_index.additional | 68 |
| abstract_inverted_index.algorithms | 181 |
| abstract_inverted_index.assignment | 194 |
| abstract_inverted_index.behavioral | 169 |
| abstract_inverted_index.especially | 91 |
| abstract_inverted_index.guarantees | 158 |
| abstract_inverted_index.importance | 22 |
| abstract_inverted_index.multi-step | 2, 116, 229 |
| abstract_inverted_index.off-policy | 3, 63, 117, 179, 230 |
| abstract_inverted_index.outperform | 226 |
| abstract_inverted_index.threshold. | 154 |
| abstract_inverted_index.trajectory | 51 |
| abstract_inverted_index.Q-learning, | 42 |
| abstract_inverted_index.adjustment. | 69 |
| abstract_inverted_index.algorithms. | 231 |
| abstract_inverted_index.convergence | 159 |
| abstract_inverted_index.efficiently | 99 |
| abstract_inverted_index.information | 101, 143 |
| abstract_inverted_index.non-trivial | 137 |
| abstract_inverted_index.restricting | 95 |
| abstract_inverted_index.facilitating | 191 |
| abstract_inverted_index.\emph{highway | 138 |
| abstract_inverted_index.reinforcement | 16 |
| abstract_inverted_index.underestimate | 85 |
| abstract_inverted_index.underestimation | 122 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 8 |
| citation_normalized_percentile |