Explanipedia

ARAC: Adaptive Regularized Multi-Agent Soft Actor-Critic in Graph-Structured Adversarial Games Open

Runyu Lu, Dongbin Zhao · 2025

In graph-structured multi-agent reinforcement learning (MARL) adversarial tasks such as pursuit and confrontation, agents must coordinate under highly dynamic interactions, where sparse rewards hinder efficient policy learning. We propose …

Equilibrium Policy Generalization: A Reinforcement Learning Framework for Cross-Graph Zero-Shot Generalization in Pursuit-Evasion Games Open

Runyu Lu, Peng Zhang, R. S. Shi, Yuanheng Zhu, Dongbin Zhao , et al. · 2025

Equilibrium learning in adversarial games is an important topic widely examined in the fields of game theory and reinforcement learning (RL). Pursuit-evasion game (PEG), as an important class of real-world games from the fields of robotics…

Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning Open

Boyin Liu, Zhuo Zhang, Sen Huang, Liyu Xie, Qingxu Fu , et al. · 2025

Aligning language models using LLM judge feedback offers a scalable alternative to human annotation, yet is plagued by judgment inconsistencies that destabilize reinforcement learning. While prior work has focused on judge accuracy, the cr…

SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning Open

Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu , et al. · 2025

Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis …

ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy Open

Yuhui Chen, S. Tian, Shujuan Liu, Yi Zhou, Haoran Li , et al. · 2025

RLAE: Reinforcement Learning-Assisted Ensemble for LLMs Open

Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Guojun Yin, Wei Lin , et al. · 2025

Ensembling large language models (LLMs) can effectively combine diverse strengths of different models, offering a promising approach to enhance performance across various tasks. However, existing methods typically rely on fixed weighting s…

TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning Open

Yuhui Chen, Haoran Li, Ziyuan Jiang, Haowei Wen, Dongbin Zhao · 2025

Developing scalable and generalizable reward engineering for reinforcement learning (RL) is crucial for creating general-purpose agents, especially in the challenging domain of robotic manipulation. While recent advances in reward engineer…

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL Open

Songjun Tu, Jiahao Lin, Qichao Zhang, Xiangyu Tian, Linjing Li , et al. · 2025

Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particular…

In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning Open

Songjun Tu, Jingbo Sun, Qichao Zhang, Yaocheng Zhang, Jia Liu , et al. · 2025

Offline preference-based reinforcement learning (PbRL) typically operates in two phases: first, use human preferences to learn a reward model and annotate rewards for a reward-free offline dataset; second, learn a policy by optimizing the …

Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation Open

Songjun Tu, J. C. Lin, Xiangyu Tian, Qiang Zhang, Linjing Li , et al. · 2025

Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with …

ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy Open

Yuhui Chen, S. Tian, Shujuan Liu, Yi Zhou, Haoran Li , et al. · 2025

Vision-Language-Action (VLA) models have shown substantial potential in real-world robotic manipulation. However, fine-tuning these models through supervised learning struggles to achieve robust performance due to limited, inconsistent dem…

RLAE: Reinforcement Learning-Assisted Ensemble for LLMs Open

Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Guojun Yin, Wei Lin , et al. · 2025

Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model Open

Songjun Tu, Jingbo Sun, Qichao Zhang, Xiangyuan Lan, Dongbin Zhao · 2024

Preference-based reinforcement learning (PbRL) provides a powerful paradigm to avoid meticulous reward engineering by learning rewards based on human preferences. However, real-time human feedback is hard to obtain in online tasks. Most wo…

In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning Open

Songjun Tu, Jingbo Sun, Qichao Zhang, Yaocheng Zhang, Jia Liu , et al. · 2024

Offline preference-based reinforcement learning (PbRL) typically operates in two phases: first, use human preferences to learn a reward model and annotate rewards for a reward-free offline dataset; second, learn a policy by optimizing the …

Data Scaling Laws for Imitation Learning-Based End-to-End Autonomous Driving Open

Yupeng Zheng, Zhongpu Xia, Qichao Zhang, Teng Zhang, Ben Lü , et al. · 2024

The end-to-end autonomous driving paradigm has recently attracted lots of attention due to its scalability. However, existing methods are constrained by the limited scale of real-world data, which hinders a comprehensive exploration of the…

CPIG: Leveraging Consistency Policy with Intention Guidance for Multi-agent Exploration Open

Yunzhun Fu, Yuanheng Zhu, Haoran Li, Zijie Zhao, Jiajun Chai , et al. · 2024

Efficient exploration is crucial in cooperative multi-agent reinforcement learning (MARL), especially in sparse-reward settings. However, due to the reliance on the unimodal policy, existing methods are prone to falling into the local opti…

SELU: Self-Learning Embodied MLLMs in Unknown Environments Open

Boyu Li, Hanyang Jiang, Ziluo Ding, Xin‐Biao Xu, Haoran Li , et al. · 2024

Recently, multimodal large language models (MLLMs) have demonstrated strong visual understanding and decision-making capabilities, enabling the exploration of autonomously improving MLLMs in unknown environments. However, external feedback…

Generalizing Consistency Policy to Visual RL with Prioritized Proximal Experience Regularization Open

Haoran Li, Ziyuan Jiang, Yuhui Chen, Dongbin Zhao · 2024

With high-dimensional state spaces, visual reinforcement learning (RL) faces significant challenges in exploitation and exploration, resulting in low sample efficiency and training stability. As a time-efficient diffusion model, although c…

Discretizing Continuous Action Space with Unimodal Probability Distributions for On-Policy Reinforcement Learning Open

Yuanyang Zhu, Zhi Wang, Yuanheng Zhu, Chunlin Chen, Dongbin Zhao · 2024

For on-policy reinforcement learning, discretizing action space for continuous control can easily express multiple modes and is straightforward to optimize. However, without considering the inherent ordering between the discrete atomic act…

Dream to Drive With Predictive Individual World Model Open

Yinfeng Gao, Qichao Zhang, Da‐Wei Ding, Dongbin Zhao · 2024

It is still a challenging topic to make reactive driving behaviors in complex urban environments as road users' intentions are unknown. Model-based reinforcement learning (MBRL) offers great potential to learn a reactive policy by construc…

PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning Open

Yupeng Zheng, Zebin Xing, Qichao Zhang, Bu Jin, Pengfei Li , et al. · 2024

Vehicle motion planning is an essential component of autonomous driving technology. Current rule-based vehicle motion planning methods perform satisfactorily in common scenarios but struggle to generalize to long-tailed situations. Meanwhi…

Learning Future Representation with Synthetic Observations for Sample-efficient Reinforcement Learning Open

Xin Liu, Yaran Chen, Dongbin Zhao · 2024

In visual Reinforcement Learning (RL), upstream representation learning largely determines the effect of downstream policy learning. Employing auxiliary tasks allows the agent to enhance visual representation in a targeted manner, thereby …

User Response Modeling in Reinforcement Learning for Ads Allocation Open

Zhiyuan Zhang, Qichao Zhang, Xiaoxu Wu, Xiaowen Shi, Guogang Liao , et al. · 2024

User response modeling can enhance the learning of user representations and further improve the reinforcement learning (RL) recommender agent. However, as users' behaviors are influenced by their long-term preferences and short-term stocha…

Advancing Object Goal Navigation Through LLM-enhanced Object Affinities Transfer Open

Mengying Lin, Yaran Chen, Dongbin Zhao, Zhaoran Wang · 2024

In object goal navigation, agents navigate towards objects identified by category labels using visual and spatial information. Previously, solely network-based methods typically rely on historical data for object affinities estimation, lac…

FM3Q: Factorized Multi-Agent MiniMax Q-Learning for Two-Team Zero-Sum Markov Game Open

Guangzheng Hu, Yuanheng Zhu, Haoran Li, Dongbin Zhao · 2024

Many real-world applications involve some agents that fall into two teams, with payoffs that are equal within the same team but of opposite sign across the opponent team. The so-called two-team zero-sum Markov games (2t0sMGs) can be resolv…

RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks Open

Yaran Chen, Wenbo Cui, Yuanwen Chen, Mining Tan, Xinyao Zhang , et al. · 2023

Robotic agents must master common sense and long-term sequential decisions to solve daily tasks through natural language instruction. The developments in Large Language Models (LLMs) in natural language processing have inspired efforts to …

Boosting Continuous Control with Consistency Policy Open

Yuhui Chen, Haoran Li, Dongbin Zhao · 2023

Due to its training stability and strong expression, the diffusion model has attracted considerable attention in offline reinforcement learning. However, several challenges have also come with it: 1) The demand for a large number of diffus…

ComSD: Balancing Behavioral Quality and Diversity in Unsupervised Skill Discovery Open

Xin Liu, Yaran Chen, Dongbin Zhao · 2023

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Unsupervised skill discovery seeks to acquire different useful skills wit…

Multi-modal Learning based Prediction for Disease Open

Yaran Chen, Xueyu Chen, Yu Han, Haoran Li, Dongbin Zhao , et al. · 2023

Non alcoholic fatty liver disease (NAFLD) is the most common cause of chronic liver disease, which can be predicted accurately to prevent advanced fibrosis and cirrhosis. While, a liver biopsy, the gold standard for NAFLD diagnosis, is inv…

Score-Based Equilibrium Learning in Multi-Player Finite Games with Imperfect Information Open

Runyu Lu, Yuanheng Zhu, Dongbin Zhao · 2023

Real-world games, which concern imperfect information, multiple players, and simultaneous moves, are less frequently discussed in the existing literature of game theory. While reinforcement learning (RL) provides a general framework to ext…

Dongbin Zhao YOU? Author Swipe