URPO: A Unified Reward & Policy Optimization Framework for Large Language Models Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2507.17515
Large-scale alignment pipelines typically pair a policy model with a separately trained reward model whose parameters remain frozen during reinforcement learning (RL). This separation creates a complex, resource-intensive pipeline and suffers from a performance ceiling due to a static reward signal. We propose a novel framework, Unified Reward & Policy Optimization (URPO), that unifies instruction-following ("player") and reward modeling ("referee") within a single model and a single training phase. Our method recasts all alignment data-including preference pairs, verifiable reasoning, and open-ended instructions-into a unified generative format optimized by a single Group-Relative Policy Optimization (GRPO) loop. This enables the model to learn from ground-truth preferences and verifiable logic while simultaneously generating its own rewards for open-ended tasks. Experiments on the Qwen2.5-7B model demonstrate URPO's superiority. Our unified model significantly outperforms a strong baseline using a separate generative reward model, boosting the instruction-following score on AlpacaEval from 42.24 to 44.84 and the composite reasoning average from 32.66 to 35.66. Furthermore, URPO cultivates a superior internal evaluator as a byproduct of training, achieving a RewardBench score of 85.15 and surpassing the dedicated reward model it replaces (83.55). By eliminating the need for a separate reward model and fostering a co-evolutionary dynamic between generation and evaluation, URPO presents a simpler, more efficient, and more effective path towards robustly aligned language models.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2507.17515
- https://arxiv.org/pdf/2507.17515
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4414881550
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4414881550Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2507.17515Digital Object Identifier
- Title
-
URPO: A Unified Reward & Policy Optimization Framework for Large Language ModelsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-07-23Full publication date if available
- Authors
-
Sara A. Lu, Hua Wang, Zhi Chen, Yaohua TangList of authors in order
- Landing page
-
https://arxiv.org/abs/2507.17515Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2507.17515Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2507.17515Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4414881550 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2507.17515 |
| ids.doi | https://doi.org/10.48550/arxiv.2507.17515 |
| ids.openalex | https://openalex.org/W4414881550 |
| fwci | |
| type | preprint |
| title | URPO: A Unified Reward & Policy Optimization Framework for Large Language Models |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10181 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9621999859809875 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Natural Language Processing Techniques |
| topics[1].id | https://openalex.org/T10028 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9613000154495239 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Topic Modeling |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2507.17515 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2507.17515 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2507.17515 |
| locations[1].id | doi:10.48550/arxiv.2507.17515 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2507.17515 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5002074425 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Sara A. Lu |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Lu, Songshuo |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100403969 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-8465-0996 |
| authorships[1].author.display_name | Hua Wang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Wang, Hua |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5100456834 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-9307-4871 |
| authorships[2].author.display_name | Zhi Chen |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Chen, Zhi |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5044818616 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Yaohua Tang |
| authorships[3].author_position | last |
| authorships[3].raw_author_name | Tang, Yaohua |
| authorships[3].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2507.17515 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | URPO: A Unified Reward & Policy Optimization Framework for Large Language Models |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10181 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9621999859809875 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Natural Language Processing Techniques |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2507.17515 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2507.17515 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2507.17515 |
| primary_location.id | pmh:oai:arXiv.org:2507.17515 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2507.17515 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2507.17515 |
| publication_date | 2025-07-23 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 5, 9, 25, 32, 37, 43, 61, 65, 82, 88, 129, 133, 160, 165, 170, 189, 195, 204 |
| abstract_inverted_index.By | 184 |
| abstract_inverted_index.We | 41 |
| abstract_inverted_index.as | 164 |
| abstract_inverted_index.by | 87 |
| abstract_inverted_index.it | 181 |
| abstract_inverted_index.of | 167, 173 |
| abstract_inverted_index.on | 117, 142 |
| abstract_inverted_index.to | 36, 99, 146, 155 |
| abstract_inverted_index.Our | 69, 124 |
| abstract_inverted_index.all | 72 |
| abstract_inverted_index.and | 29, 56, 64, 79, 104, 148, 175, 193, 200, 208 |
| abstract_inverted_index.due | 35 |
| abstract_inverted_index.for | 113, 188 |
| abstract_inverted_index.its | 110 |
| abstract_inverted_index.own | 111 |
| abstract_inverted_index.the | 97, 118, 139, 149, 177, 186 |
| abstract_inverted_index.This | 22, 95 |
| abstract_inverted_index.URPO | 158, 202 |
| abstract_inverted_index.from | 31, 101, 144, 153 |
| abstract_inverted_index.more | 206, 209 |
| abstract_inverted_index.need | 187 |
| abstract_inverted_index.pair | 4 |
| abstract_inverted_index.path | 211 |
| abstract_inverted_index.that | 52 |
| abstract_inverted_index.with | 8 |
| abstract_inverted_index.& | 48 |
| abstract_inverted_index.(RL). | 21 |
| abstract_inverted_index.32.66 | 154 |
| abstract_inverted_index.42.24 | 145 |
| abstract_inverted_index.44.84 | 147 |
| abstract_inverted_index.85.15 | 174 |
| abstract_inverted_index.learn | 100 |
| abstract_inverted_index.logic | 106 |
| abstract_inverted_index.loop. | 94 |
| abstract_inverted_index.model | 7, 13, 63, 98, 120, 126, 180, 192 |
| abstract_inverted_index.novel | 44 |
| abstract_inverted_index.score | 141, 172 |
| abstract_inverted_index.using | 132 |
| abstract_inverted_index.while | 107 |
| abstract_inverted_index.whose | 14 |
| abstract_inverted_index.(GRPO) | 93 |
| abstract_inverted_index.35.66. | 156 |
| abstract_inverted_index.Policy | 49, 91 |
| abstract_inverted_index.Reward | 47 |
| abstract_inverted_index.URPO's | 122 |
| abstract_inverted_index.during | 18 |
| abstract_inverted_index.format | 85 |
| abstract_inverted_index.frozen | 17 |
| abstract_inverted_index.method | 70 |
| abstract_inverted_index.model, | 137 |
| abstract_inverted_index.pairs, | 76 |
| abstract_inverted_index.phase. | 68 |
| abstract_inverted_index.policy | 6 |
| abstract_inverted_index.remain | 16 |
| abstract_inverted_index.reward | 12, 39, 57, 136, 179, 191 |
| abstract_inverted_index.single | 62, 66, 89 |
| abstract_inverted_index.static | 38 |
| abstract_inverted_index.strong | 130 |
| abstract_inverted_index.tasks. | 115 |
| abstract_inverted_index.within | 60 |
| abstract_inverted_index.(URPO), | 51 |
| abstract_inverted_index.Unified | 46 |
| abstract_inverted_index.aligned | 214 |
| abstract_inverted_index.average | 152 |
| abstract_inverted_index.between | 198 |
| abstract_inverted_index.ceiling | 34 |
| abstract_inverted_index.creates | 24 |
| abstract_inverted_index.dynamic | 197 |
| abstract_inverted_index.enables | 96 |
| abstract_inverted_index.models. | 216 |
| abstract_inverted_index.propose | 42 |
| abstract_inverted_index.recasts | 71 |
| abstract_inverted_index.rewards | 112 |
| abstract_inverted_index.signal. | 40 |
| abstract_inverted_index.suffers | 30 |
| abstract_inverted_index.towards | 212 |
| abstract_inverted_index.trained | 11 |
| abstract_inverted_index.unified | 83, 125 |
| abstract_inverted_index.unifies | 53 |
| abstract_inverted_index.(83.55). | 183 |
| abstract_inverted_index.baseline | 131 |
| abstract_inverted_index.boosting | 138 |
| abstract_inverted_index.complex, | 26 |
| abstract_inverted_index.internal | 162 |
| abstract_inverted_index.language | 215 |
| abstract_inverted_index.learning | 20 |
| abstract_inverted_index.modeling | 58 |
| abstract_inverted_index.pipeline | 28 |
| abstract_inverted_index.presents | 203 |
| abstract_inverted_index.replaces | 182 |
| abstract_inverted_index.robustly | 213 |
| abstract_inverted_index.separate | 134, 190 |
| abstract_inverted_index.simpler, | 205 |
| abstract_inverted_index.superior | 161 |
| abstract_inverted_index.training | 67 |
| abstract_inverted_index.achieving | 169 |
| abstract_inverted_index.alignment | 1, 73 |
| abstract_inverted_index.byproduct | 166 |
| abstract_inverted_index.composite | 150 |
| abstract_inverted_index.dedicated | 178 |
| abstract_inverted_index.effective | 210 |
| abstract_inverted_index.evaluator | 163 |
| abstract_inverted_index.fostering | 194 |
| abstract_inverted_index.optimized | 86 |
| abstract_inverted_index.pipelines | 2 |
| abstract_inverted_index.reasoning | 151 |
| abstract_inverted_index.training, | 168 |
| abstract_inverted_index.typically | 3 |
| abstract_inverted_index.("player") | 55 |
| abstract_inverted_index.AlpacaEval | 143 |
| abstract_inverted_index.Qwen2.5-7B | 119 |
| abstract_inverted_index.cultivates | 159 |
| abstract_inverted_index.efficient, | 207 |
| abstract_inverted_index.framework, | 45 |
| abstract_inverted_index.generating | 109 |
| abstract_inverted_index.generation | 199 |
| abstract_inverted_index.generative | 84, 135 |
| abstract_inverted_index.open-ended | 80, 114 |
| abstract_inverted_index.parameters | 15 |
| abstract_inverted_index.preference | 75 |
| abstract_inverted_index.reasoning, | 78 |
| abstract_inverted_index.separately | 10 |
| abstract_inverted_index.separation | 23 |
| abstract_inverted_index.surpassing | 176 |
| abstract_inverted_index.verifiable | 77, 105 |
| abstract_inverted_index.("referee") | 59 |
| abstract_inverted_index.Experiments | 116 |
| abstract_inverted_index.Large-scale | 0 |
| abstract_inverted_index.RewardBench | 171 |
| abstract_inverted_index.demonstrate | 121 |
| abstract_inverted_index.eliminating | 185 |
| abstract_inverted_index.evaluation, | 201 |
| abstract_inverted_index.outperforms | 128 |
| abstract_inverted_index.performance | 33 |
| abstract_inverted_index.preferences | 103 |
| abstract_inverted_index.Furthermore, | 157 |
| abstract_inverted_index.Optimization | 50, 92 |
| abstract_inverted_index.ground-truth | 102 |
| abstract_inverted_index.superiority. | 123 |
| abstract_inverted_index.reinforcement | 19 |
| abstract_inverted_index.significantly | 127 |
| abstract_inverted_index.Group-Relative | 90 |
| abstract_inverted_index.data-including | 74 |
| abstract_inverted_index.simultaneously | 108 |
| abstract_inverted_index.co-evolutionary | 196 |
| abstract_inverted_index.instructions-into | 81 |
| abstract_inverted_index.resource-intensive | 27 |
| abstract_inverted_index.instruction-following | 54, 140 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 4 |
| citation_normalized_percentile |