arXiv (Cornell University)
DHGRPO: Domain-Induced, Hierarchical Group Relative Policy Optimization
August 2025 • DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, X…
DHGRPO (Domain-Induced Hierarchical Group Relative Policy Optimization) is a mathematically grounded extension of Group Relative Policy Optimization (GRPO) that mitigates group-level failure modes in preference-based fine-tuning of large language models. The method integrates: (i) robust per-prompt normalization via median and median absolute deviation (MAD) to suppress outlier influence, (ii) a Domain-Induced Factor (DIF) for trust gating based on long-term reward stability, (iii) a Domain-Optimism Parameter (DOP…