Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2505.18573
Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model's exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: https://github.com/LiaoMengqi/E3-RL4LLMs
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2505.18573
- https://arxiv.org/pdf/2505.18573
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4414581433
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4414581433Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2505.18573Digital Object Identifier
- Title
-
Enhancing Efficiency and Exploration in Reinforcement Learning for LLMsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-05-24Full publication date if available
- Authors
-
Mengqi Liao, Xiangyu Xi, Robert Chen, Jia Leng, Yunlei Hu, Ke Zeng, Shuai Liu, Huaiyu WanList of authors in order
- Landing page
-
https://arxiv.org/abs/2505.18573Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2505.18573Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2505.18573Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4414581433 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2505.18573 |
| ids.doi | https://doi.org/10.48550/arxiv.2505.18573 |
| ids.openalex | https://openalex.org/W4414581433 |
| fwci | 0.0 |
| type | preprint |
| title | Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10551 |
| topics[0].field.id | https://openalex.org/fields/22 |
| topics[0].field.display_name | Engineering |
| topics[0].score | 0.9732000231742859 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/2209 |
| topics[0].subfield.display_name | Industrial and Manufacturing Engineering |
| topics[0].display_name | Scheduling and Optimization Algorithms |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2505.18573 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2505.18573 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2505.18573 |
| locations[1].id | doi:10.48550/arxiv.2505.18573 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2505.18573 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5074505899 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-7287-7731 |
| authorships[0].author.display_name | Mengqi Liao |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Liao, Mengqi |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5119753576 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Xiangyu Xi |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Xi, Xiangyu |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5100754344 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-8371-8629 |
| authorships[2].author.display_name | Robert Chen |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Chen, Ruinian |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5109945091 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Jia Leng |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Leng, Jia |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5110942658 |
| authorships[4].author.orcid | https://orcid.org/0009-0001-2746-4935 |
| authorships[4].author.display_name | Yunlei Hu |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Hu, Yangen |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5101513204 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-4398-3612 |
| authorships[5].author.display_name | Ke Zeng |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Zeng, Ke |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5090788114 |
| authorships[6].author.orcid | https://orcid.org/0000-0001-9909-0664 |
| authorships[6].author.display_name | Shuai Liu |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Liu, Shuai |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5065949777 |
| authorships[7].author.orcid | https://orcid.org/0000-0002-0501-9363 |
| authorships[7].author.display_name | Huaiyu Wan |
| authorships[7].author_position | last |
| authorships[7].raw_author_name | Wan, Huaiyu |
| authorships[7].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2505.18573 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10551 |
| primary_topic.field.id | https://openalex.org/fields/22 |
| primary_topic.field.display_name | Engineering |
| primary_topic.score | 0.9732000231742859 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/2209 |
| primary_topic.subfield.display_name | Industrial and Manufacturing Engineering |
| primary_topic.display_name | Scheduling and Optimization Algorithms |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2505.18573 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2505.18573 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2505.18573 |
| primary_location.id | pmh:oai:arXiv.org:2505.18573 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2505.18573 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2505.18573 |
| publication_date | 2025-05-24 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 80, 98, 131 |
| abstract_inverted_index.RL | 34, 67, 115 |
| abstract_inverted_index.To | 92 |
| abstract_inverted_index.an | 24, 120 |
| abstract_inverted_index.at | 130 |
| abstract_inverted_index.in | 6, 79 |
| abstract_inverted_index.is | 37, 159 |
| abstract_inverted_index.it | 71 |
| abstract_inverted_index.of | 27, 85, 109 |
| abstract_inverted_index.on | 47, 106 |
| abstract_inverted_index.to | 14, 29, 61, 90, 126, 141, 150 |
| abstract_inverted_index.we | 96, 118 |
| abstract_inverted_index.RL. | 91 |
| abstract_inverted_index.The | 155 |
| abstract_inverted_index.all | 30 |
| abstract_inverted_index.and | 157 |
| abstract_inverted_index.are | 56 |
| abstract_inverted_index.cap | 82 |
| abstract_inverted_index.for | 18, 58, 100 |
| abstract_inverted_index.has | 10 |
| abstract_inverted_index.on: | 161 |
| abstract_inverted_index.the | 33, 43, 73, 86, 107, 110, 128 |
| abstract_inverted_index.(RL) | 17 |
| abstract_inverted_index.LLMs | 140 |
| abstract_inverted_index.This | 39, 138 |
| abstract_inverted_index.base | 87 |
| abstract_inverted_index.code | 156 |
| abstract_inverted_index.data | 158 |
| abstract_inverted_index.fact | 44 |
| abstract_inverted_index.from | 42 |
| abstract_inverted_index.more | 54, 113 |
| abstract_inverted_index.that | 45, 84 |
| abstract_inverted_index.LLMs. | 19 |
| abstract_inverted_index.based | 105 |
| abstract_inverted_index.below | 83 |
| abstract_inverted_index.drawn | 11 |
| abstract_inverted_index.equal | 25 |
| abstract_inverted_index.excel | 5 |
| abstract_inverted_index.large | 1 |
| abstract_inverted_index.model | 88 |
| abstract_inverted_index.prior | 89 |
| abstract_inverted_index.stems | 41 |
| abstract_inverted_index.their | 147 |
| abstract_inverted_index.these | 94 |
| abstract_inverted_index.which | 9, 36 |
| abstract_inverted_index.while | 66, 145 |
| abstract_inverted_index.(LLMs) | 4 |
| abstract_inverted_index.during | 32 |
| abstract_inverted_index.gains, | 52 |
| abstract_inverted_index.level, | 133 |
| abstract_inverted_index.limits | 72 |
| abstract_inverted_index.models | 3 |
| abstract_inverted_index.needed | 57 |
| abstract_inverted_index.number | 26 |
| abstract_inverted_index.sample | 62 |
| abstract_inverted_index.simple | 48 |
| abstract_inverted_index.stable | 132 |
| abstract_inverted_index.tasks, | 8 |
| abstract_inverted_index.yields | 50 |
| abstract_inverted_index.ability | 149 |
| abstract_inverted_index.address | 93 |
| abstract_inverted_index.budgets | 104 |
| abstract_inverted_index.complex | 7 |
| abstract_inverted_index.correct | 63, 153 |
| abstract_inverted_index.dynamic | 122 |
| abstract_inverted_index.enables | 139 |
| abstract_inverted_index.entropy | 129 |
| abstract_inverted_index.improve | 142 |
| abstract_inverted_index.issues, | 95 |
| abstract_inverted_index.limited | 51 |
| abstract_inverted_index.model's | 74 |
| abstract_inverted_index.propose | 97 |
| abstract_inverted_index.rollout | 103 |
| abstract_inverted_index.thereby | 134 |
| abstract_inverted_index.uncover | 151 |
| abstract_inverted_index.whereas | 53 |
| abstract_inverted_index.However, | 20 |
| abstract_inverted_index.ability, | 76 |
| abstract_inverted_index.adaptive | 121 |
| abstract_inverted_index.allocate | 23 |
| abstract_inverted_index.answers. | 64 |
| abstract_inverted_index.enabling | 112 |
| abstract_inverted_index.existing | 21 |
| abstract_inverted_index.improves | 68 |
| abstract_inverted_index.language | 2 |
| abstract_inverted_index.learning | 16 |
| abstract_inverted_index.maintain | 127 |
| abstract_inverted_index.process, | 35 |
| abstract_inverted_index.response | 69, 143 |
| abstract_inverted_index.rollouts | 28, 55 |
| abstract_inverted_index.strategy | 125 |
| abstract_inverted_index.training | 46 |
| abstract_inverted_index.Reasoning | 0 |
| abstract_inverted_index.attention | 13 |
| abstract_inverted_index.available | 160 |
| abstract_inverted_index.efficient | 114 |
| abstract_inverted_index.introduce | 119 |
| abstract_inverted_index.mechanism | 99 |
| abstract_inverted_index.pathways. | 154 |
| abstract_inverted_index.potential | 152 |
| abstract_inverted_index.precision | 144 |
| abstract_inverted_index.problems, | 111 |
| abstract_inverted_index.questions | 31, 49, 60 |
| abstract_inverted_index.resulting | 78 |
| abstract_inverted_index.training. | 116 |
| abstract_inverted_index.adjustment | 124 |
| abstract_inverted_index.allocating | 102 |
| abstract_inverted_index.approaches | 22 |
| abstract_inverted_index.difficulty | 108 |
| abstract_inverted_index.precision, | 70 |
| abstract_inverted_index.preserving | 146 |
| abstract_inverted_index.sufficient | 136 |
| abstract_inverted_index.challenging | 59 |
| abstract_inverted_index.dynamically | 101 |
| abstract_inverted_index.encouraging | 135 |
| abstract_inverted_index.exploration | 75 |
| abstract_inverted_index.exploratory | 148 |
| abstract_inverted_index.performance | 81 |
| abstract_inverted_index.potentially | 77 |
| abstract_inverted_index.significant | 12 |
| abstract_inverted_index.temperature | 123 |
| abstract_inverted_index.Furthermore, | 65 |
| abstract_inverted_index.exploration. | 137 |
| abstract_inverted_index.inefficiency | 40 |
| abstract_inverted_index.inefficient. | 38 |
| abstract_inverted_index.Additionally, | 117 |
| abstract_inverted_index.reinforcement | 15 |
| abstract_inverted_index.https://github.com/LiaoMengqi/E3-RL4LLMs | 162 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 8 |
| citation_normalized_percentile |