DHGRPO: Domain-Induced, Hierarchical Group Relative Policy Optimization Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.5281/zenodo.16786367
DHGRPO (Domain-Induced Hierarchical Group Relative Policy Optimization) is a mathematically grounded extension of Group Relative Policy Optimization (GRPO) that mitigates group-level failure modes in preference-based fine-tuning of large language models. The method integrates: (i) robust per-prompt normalization via median and median absolute deviation (MAD) to suppress outlier influence, (ii) a Domain-Induced Factor (DIF) for trust gating based on long-term reward stability, (iii) a Domain-Optimism Parameter (DOP) for recency-weighted learning emphasis, and (iv) a bounded reward amplifier with optional magnitude matching to preserve update scale. We present a stepwise derivation from the exact policy gradient to the GRPO surrogate and its DHGRPO refinement, a controlled simulation framework with hyperparameter sweeps demonstrating consistent proxy improvements, and actionable implementation recommendations for real-world deployment in large-scale preference optimization.
Related Topics To Compare & Contrast
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2501.12948
- https://arxiv.org/pdf/2501.12948
- OA Status
- green
- Cited By
- 411
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4406779522