SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2504.14286
Recent advances of reasoning models, exemplified by OpenAI's o1 and DeepSeek's R1, highlight the significant potential of Reinforcement Learning (RL) to enhance the reasoning capabilities of Large Language Models (LLMs). However, replicating these advancements across diverse domains remains challenging due to limited methodological transparency. In this work, we present two-Staged history-Resampling Policy Optimization (SRPO), which surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks. SRPO achieves this using the same base model as DeepSeek (i.e. Qwen2.5-32B), using only about 1/10 of the training steps required by DeepSeek-R1-Zero-32B, demonstrating superior efficiency. Building upon Group Relative Policy Optimization (GRPO), we introduce two key methodological innovations: (1) a two-stage cross-domain training paradigm designed to balance the development of mathematical reasoning and coding proficiency, and (2) History Resampling (HR), a technique to address ineffective samples. Our comprehensive experiments validate the effectiveness of our approach, offering valuable insights into scaling LLM reasoning capabilities across diverse tasks.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2504.14286
- https://arxiv.org/pdf/2504.14286
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4414629531
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4414629531Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2504.14286Digital Object Identifier
- Title
-
SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLMWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-04-19Full publication date if available
- Authors
-
Xiaojiang Zhang, Jinghui Wang, Zifei Cheng, Wenhao Zhuang, Zheng Lin, Minglei Zhang, Shaojie Wang, Y. Cui, Chao Wang, Junyi Peng, Shimiao Jiang, Shihuan Kuang, S. Yin, Caiyi Wen, Haotian Zhang, Bin Chen, Bing YuList of authors in order
- Landing page
-
https://arxiv.org/abs/2504.14286Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2504.14286Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2504.14286Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4414629531 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2504.14286 |
| ids.doi | https://doi.org/10.48550/arxiv.2504.14286 |
| ids.openalex | https://openalex.org/W4414629531 |
| fwci | |
| type | preprint |
| title | SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T13999 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.8129000067710876 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1710 |
| topics[0].subfield.display_name | Information Systems |
| topics[0].display_name | Digital Rights Management and Security |
| topics[1].id | https://openalex.org/T10456 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.7346000075340271 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Multi-Agent Systems and Negotiation |
| topics[2].id | https://openalex.org/T14011 |
| topics[2].field.id | https://openalex.org/fields/22 |
| topics[2].field.display_name | Engineering |
| topics[2].score | 0.7325000166893005 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/2207 |
| topics[2].subfield.display_name | Control and Systems Engineering |
| topics[2].display_name | Elevator Systems and Control |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2504.14286 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2504.14286 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2504.14286 |
| locations[1].id | doi:10.48550/arxiv.2504.14286 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2504.14286 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5101777723 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-5979-6299 |
| authorships[0].author.display_name | Xiaojiang Zhang |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Zhang, Xiaojiang |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100685092 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-8462-3322 |
| authorships[1].author.display_name | Jinghui Wang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Wang, Jinghui |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5060318739 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Zifei Cheng |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Cheng, Zifei |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5029073999 |
| authorships[3].author.orcid | https://orcid.org/0009-0001-4224-0331 |
| authorships[3].author.display_name | Wenhao Zhuang |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Zhuang, Wenhao |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5067997634 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-8432-1658 |
| authorships[4].author.display_name | Zheng Lin |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Lin, Zheng |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5102026683 |
| authorships[5].author.orcid | https://orcid.org/0000-0003-3707-329X |
| authorships[5].author.display_name | Minglei Zhang |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Zhang, Minglei |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5100750462 |
| authorships[6].author.orcid | https://orcid.org/0000-0002-2786-519X |
| authorships[6].author.display_name | Shaojie Wang |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Wang, Shaojie |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5046016501 |
| authorships[7].author.orcid | https://orcid.org/0000-0002-8304-7546 |
| authorships[7].author.display_name | Y. Cui |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Cui, Yinghan |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5100407035 |
| authorships[8].author.orcid | https://orcid.org/0000-0002-7427-793X |
| authorships[8].author.display_name | Chao Wang |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | Wang, Chao |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5005296744 |
| authorships[9].author.orcid | https://orcid.org/0000-0002-4103-5416 |
| authorships[9].author.display_name | Junyi Peng |
| authorships[9].author_position | middle |
| authorships[9].raw_author_name | Peng, Junyi |
| authorships[9].is_corresponding | False |
| authorships[10].author.id | https://openalex.org/A5069063278 |
| authorships[10].author.orcid | |
| authorships[10].author.display_name | Shimiao Jiang |
| authorships[10].author_position | middle |
| authorships[10].raw_author_name | Jiang, Shimiao |
| authorships[10].is_corresponding | False |
| authorships[11].author.id | https://openalex.org/A5061594246 |
| authorships[11].author.orcid | https://orcid.org/0000-0001-9180-3180 |
| authorships[11].author.display_name | Shihuan Kuang |
| authorships[11].author_position | middle |
| authorships[11].raw_author_name | Kuang, Shiqi |
| authorships[11].is_corresponding | False |
| authorships[12].author.id | https://openalex.org/A5103577015 |
| authorships[12].author.orcid | https://orcid.org/0009-0005-9975-6215 |
| authorships[12].author.display_name | S. Yin |
| authorships[12].author_position | middle |
| authorships[12].raw_author_name | Yin, Shouyu |
| authorships[12].is_corresponding | False |
| authorships[13].author.id | https://openalex.org/A5090423462 |
| authorships[13].author.orcid | https://orcid.org/0000-0003-4888-1524 |
| authorships[13].author.display_name | Caiyi Wen |
| authorships[13].author_position | middle |
| authorships[13].raw_author_name | Wen, Chaohang |
| authorships[13].is_corresponding | False |
| authorships[14].author.id | https://openalex.org/A5100392957 |
| authorships[14].author.orcid | https://orcid.org/0000-0002-0193-3473 |
| authorships[14].author.display_name | Haotian Zhang |
| authorships[14].author_position | middle |
| authorships[14].raw_author_name | Zhang, Haotian |
| authorships[14].is_corresponding | False |
| authorships[15].author.id | https://openalex.org/A5100427448 |
| authorships[15].author.orcid | https://orcid.org/0000-0003-4797-7831 |
| authorships[15].author.display_name | Bin Chen |
| authorships[15].author_position | middle |
| authorships[15].raw_author_name | Chen, Bin |
| authorships[15].is_corresponding | False |
| authorships[16].author.id | https://openalex.org/A5101544262 |
| authorships[16].author.orcid | https://orcid.org/0000-0002-1697-6089 |
| authorships[16].author.display_name | Bing Yu |
| authorships[16].author_position | last |
| authorships[16].raw_author_name | Yu, Bing |
| authorships[16].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2504.14286 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T13999 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.8129000067710876 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1710 |
| primary_topic.subfield.display_name | Information Systems |
| primary_topic.display_name | Digital Rights Management and Security |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2504.14286 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2504.14286 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2504.14286 |
| primary_location.id | pmh:oai:arXiv.org:2504.14286 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2504.14286 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2504.14286 |
| publication_date | 2025-04-19 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 106, 127 |
| abstract_inverted_index.In | 44 |
| abstract_inverted_index.as | 74 |
| abstract_inverted_index.by | 6, 87 |
| abstract_inverted_index.o1 | 8 |
| abstract_inverted_index.of | 2, 16, 25, 58, 82, 116, 139 |
| abstract_inverted_index.on | 60 |
| abstract_inverted_index.to | 20, 40, 112, 129 |
| abstract_inverted_index.we | 47, 99 |
| abstract_inverted_index.(1) | 105 |
| abstract_inverted_index.(2) | 123 |
| abstract_inverted_index.LLM | 147 |
| abstract_inverted_index.Our | 133 |
| abstract_inverted_index.R1, | 11 |
| abstract_inverted_index.and | 9, 63, 119, 122 |
| abstract_inverted_index.due | 39 |
| abstract_inverted_index.key | 102 |
| abstract_inverted_index.our | 140 |
| abstract_inverted_index.the | 13, 22, 56, 61, 70, 83, 114, 137 |
| abstract_inverted_index.two | 101 |
| abstract_inverted_index.(RL) | 19 |
| abstract_inverted_index.1/10 | 81 |
| abstract_inverted_index.SRPO | 66 |
| abstract_inverted_index.base | 72 |
| abstract_inverted_index.into | 145 |
| abstract_inverted_index.only | 79 |
| abstract_inverted_index.same | 71 |
| abstract_inverted_index.this | 45, 68 |
| abstract_inverted_index.upon | 93 |
| abstract_inverted_index.(HR), | 126 |
| abstract_inverted_index.(i.e. | 76 |
| abstract_inverted_index.Group | 94 |
| abstract_inverted_index.Large | 26 |
| abstract_inverted_index.about | 80 |
| abstract_inverted_index.model | 73 |
| abstract_inverted_index.steps | 85 |
| abstract_inverted_index.these | 32 |
| abstract_inverted_index.using | 69, 78 |
| abstract_inverted_index.which | 54 |
| abstract_inverted_index.work, | 46 |
| abstract_inverted_index.AIME24 | 62 |
| abstract_inverted_index.Models | 28 |
| abstract_inverted_index.Policy | 51, 96 |
| abstract_inverted_index.Recent | 0 |
| abstract_inverted_index.across | 34, 150 |
| abstract_inverted_index.coding | 120 |
| abstract_inverted_index.tasks. | 152 |
| abstract_inverted_index.(GRPO), | 98 |
| abstract_inverted_index.(LLMs). | 29 |
| abstract_inverted_index.(SRPO), | 53 |
| abstract_inverted_index.History | 124 |
| abstract_inverted_index.address | 130 |
| abstract_inverted_index.balance | 113 |
| abstract_inverted_index.diverse | 35, 151 |
| abstract_inverted_index.domains | 36 |
| abstract_inverted_index.enhance | 21 |
| abstract_inverted_index.limited | 41 |
| abstract_inverted_index.models, | 4 |
| abstract_inverted_index.present | 48 |
| abstract_inverted_index.remains | 37 |
| abstract_inverted_index.scaling | 146 |
| abstract_inverted_index.Building | 92 |
| abstract_inverted_index.DeepSeek | 75 |
| abstract_inverted_index.However, | 30 |
| abstract_inverted_index.Language | 27 |
| abstract_inverted_index.Learning | 18 |
| abstract_inverted_index.OpenAI's | 7 |
| abstract_inverted_index.Relative | 95 |
| abstract_inverted_index.achieves | 67 |
| abstract_inverted_index.advances | 1 |
| abstract_inverted_index.designed | 111 |
| abstract_inverted_index.insights | 144 |
| abstract_inverted_index.offering | 142 |
| abstract_inverted_index.paradigm | 110 |
| abstract_inverted_index.required | 86 |
| abstract_inverted_index.samples. | 132 |
| abstract_inverted_index.superior | 90 |
| abstract_inverted_index.training | 84, 109 |
| abstract_inverted_index.validate | 136 |
| abstract_inverted_index.valuable | 143 |
| abstract_inverted_index.approach, | 141 |
| abstract_inverted_index.highlight | 12 |
| abstract_inverted_index.introduce | 100 |
| abstract_inverted_index.potential | 15 |
| abstract_inverted_index.reasoning | 3, 23, 118, 148 |
| abstract_inverted_index.surpasses | 55 |
| abstract_inverted_index.technique | 128 |
| abstract_inverted_index.two-stage | 107 |
| abstract_inverted_index.DeepSeek's | 10 |
| abstract_inverted_index.Resampling | 125 |
| abstract_inverted_index.two-Staged | 49 |
| abstract_inverted_index.benchmarks. | 65 |
| abstract_inverted_index.challenging | 38 |
| abstract_inverted_index.development | 115 |
| abstract_inverted_index.efficiency. | 91 |
| abstract_inverted_index.exemplified | 5 |
| abstract_inverted_index.experiments | 135 |
| abstract_inverted_index.ineffective | 131 |
| abstract_inverted_index.performance | 57 |
| abstract_inverted_index.replicating | 31 |
| abstract_inverted_index.significant | 14 |
| abstract_inverted_index.Optimization | 52, 97 |
| abstract_inverted_index.advancements | 33 |
| abstract_inverted_index.capabilities | 24, 149 |
| abstract_inverted_index.cross-domain | 108 |
| abstract_inverted_index.innovations: | 104 |
| abstract_inverted_index.mathematical | 117 |
| abstract_inverted_index.proficiency, | 121 |
| abstract_inverted_index.LiveCodeBench | 64 |
| abstract_inverted_index.Qwen2.5-32B), | 77 |
| abstract_inverted_index.Reinforcement | 17 |
| abstract_inverted_index.comprehensive | 134 |
| abstract_inverted_index.demonstrating | 89 |
| abstract_inverted_index.effectiveness | 138 |
| abstract_inverted_index.transparency. | 43 |
| abstract_inverted_index.methodological | 42, 103 |
| abstract_inverted_index.history-Resampling | 50 |
| abstract_inverted_index.DeepSeek-R1-Zero-32B | 59 |
| abstract_inverted_index.DeepSeek-R1-Zero-32B, | 88 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 17 |
| citation_normalized_percentile |