Dongbin Zhao
YOU?
Author Swipe
View article: ARAC: Adaptive Regularized Multi-Agent Soft Actor-Critic in Graph-Structured Adversarial Games
ARAC: Adaptive Regularized Multi-Agent Soft Actor-Critic in Graph-Structured Adversarial Games Open
In graph-structured multi-agent reinforcement learning (MARL) adversarial tasks such as pursuit and confrontation, agents must coordinate under highly dynamic interactions, where sparse rewards hinder efficient policy learning. We propose …
View article: Equilibrium Policy Generalization: A Reinforcement Learning Framework for Cross-Graph Zero-Shot Generalization in Pursuit-Evasion Games
Equilibrium Policy Generalization: A Reinforcement Learning Framework for Cross-Graph Zero-Shot Generalization in Pursuit-Evasion Games Open
Equilibrium learning in adversarial games is an important topic widely examined in the fields of game theory and reinforcement learning (RL). Pursuit-evasion game (PEG), as an important class of real-world games from the fields of robotics…
View article: Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning
Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning Open
Aligning language models using LLM judge feedback offers a scalable alternative to human annotation, yet is plagued by judgment inconsistencies that destabilize reinforcement learning. While prior work has focused on judge accuracy, the cr…
View article: SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning
SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning Open
Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis …
View article: ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy
ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy Open
View article: RLAE: Reinforcement Learning-Assisted Ensemble for LLMs
RLAE: Reinforcement Learning-Assisted Ensemble for LLMs Open
Ensembling large language models (LLMs) can effectively combine diverse strengths of different models, offering a promising approach to enhance performance across various tasks. However, existing methods typically rely on fixed weighting s…
View article: TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning
TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning Open
Developing scalable and generalizable reward engineering for reinforcement learning (RL) is crucial for creating general-purpose agents, especially in the challenging domain of robotic manipulation. While recent advances in reward engineer…
View article: Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL Open
Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particular…
View article: In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning
In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning Open
Offline preference-based reinforcement learning (PbRL) typically operates in two phases: first, use human preferences to learn a reward model and annotate rewards for a reward-free offline dataset; second, learn a policy by optimizing the …
View article: Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation Open
Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with …
View article: ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy
ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy Open
Vision-Language-Action (VLA) models have shown substantial potential in real-world robotic manipulation. However, fine-tuning these models through supervised learning struggles to achieve robust performance due to limited, inconsistent dem…
View article: RLAE: Reinforcement Learning-Assisted Ensemble for LLMs
RLAE: Reinforcement Learning-Assisted Ensemble for LLMs Open
View article: Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model
Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model Open
Preference-based reinforcement learning (PbRL) provides a powerful paradigm to avoid meticulous reward engineering by learning rewards based on human preferences. However, real-time human feedback is hard to obtain in online tasks. Most wo…
View article: In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning
In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning Open
Offline preference-based reinforcement learning (PbRL) typically operates in two phases: first, use human preferences to learn a reward model and annotate rewards for a reward-free offline dataset; second, learn a policy by optimizing the …
View article: Data Scaling Laws for Imitation Learning-Based End-to-End Autonomous Driving
Data Scaling Laws for Imitation Learning-Based End-to-End Autonomous Driving Open
The end-to-end autonomous driving paradigm has recently attracted lots of attention due to its scalability. However, existing methods are constrained by the limited scale of real-world data, which hinders a comprehensive exploration of the…
View article: CPIG: Leveraging Consistency Policy with Intention Guidance for Multi-agent Exploration
CPIG: Leveraging Consistency Policy with Intention Guidance for Multi-agent Exploration Open
Efficient exploration is crucial in cooperative multi-agent reinforcement learning (MARL), especially in sparse-reward settings. However, due to the reliance on the unimodal policy, existing methods are prone to falling into the local opti…
View article: SELU: Self-Learning Embodied MLLMs in Unknown Environments
SELU: Self-Learning Embodied MLLMs in Unknown Environments Open
Recently, multimodal large language models (MLLMs) have demonstrated strong visual understanding and decision-making capabilities, enabling the exploration of autonomously improving MLLMs in unknown environments. However, external feedback…
View article: Generalizing Consistency Policy to Visual RL with Prioritized Proximal Experience Regularization
Generalizing Consistency Policy to Visual RL with Prioritized Proximal Experience Regularization Open
With high-dimensional state spaces, visual reinforcement learning (RL) faces significant challenges in exploitation and exploration, resulting in low sample efficiency and training stability. As a time-efficient diffusion model, although c…
View article: Discretizing Continuous Action Space with Unimodal Probability Distributions for On-Policy Reinforcement Learning
Discretizing Continuous Action Space with Unimodal Probability Distributions for On-Policy Reinforcement Learning Open
For on-policy reinforcement learning, discretizing action space for continuous control can easily express multiple modes and is straightforward to optimize. However, without considering the inherent ordering between the discrete atomic act…
View article: Dream to Drive With Predictive Individual World Model
Dream to Drive With Predictive Individual World Model Open
It is still a challenging topic to make reactive driving behaviors in complex urban environments as road users' intentions are unknown. Model-based reinforcement learning (MBRL) offers great potential to learn a reactive policy by construc…
View article: PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning
PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning Open
Vehicle motion planning is an essential component of autonomous driving technology. Current rule-based vehicle motion planning methods perform satisfactorily in common scenarios but struggle to generalize to long-tailed situations. Meanwhi…
View article: Learning Future Representation with Synthetic Observations for Sample-efficient Reinforcement Learning
Learning Future Representation with Synthetic Observations for Sample-efficient Reinforcement Learning Open
In visual Reinforcement Learning (RL), upstream representation learning largely determines the effect of downstream policy learning. Employing auxiliary tasks allows the agent to enhance visual representation in a targeted manner, thereby …
View article: User Response Modeling in Reinforcement Learning for Ads Allocation
User Response Modeling in Reinforcement Learning for Ads Allocation Open
User response modeling can enhance the learning of user representations and further improve the reinforcement learning (RL) recommender agent. However, as users' behaviors are influenced by their long-term preferences and short-term stocha…
View article: Advancing Object Goal Navigation Through LLM-enhanced Object Affinities Transfer
Advancing Object Goal Navigation Through LLM-enhanced Object Affinities Transfer Open
In object goal navigation, agents navigate towards objects identified by category labels using visual and spatial information. Previously, solely network-based methods typically rely on historical data for object affinities estimation, lac…
View article: FM3Q: Factorized Multi-Agent MiniMax Q-Learning for Two-Team Zero-Sum Markov Game
FM3Q: Factorized Multi-Agent MiniMax Q-Learning for Two-Team Zero-Sum Markov Game Open
Many real-world applications involve some agents that fall into two teams, with payoffs that are equal within the same team but of opposite sign across the opponent team. The so-called two-team zero-sum Markov games (2t0sMGs) can be resolv…
View article: RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks
RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks Open
Robotic agents must master common sense and long-term sequential decisions to solve daily tasks through natural language instruction. The developments in Large Language Models (LLMs) in natural language processing have inspired efforts to …
View article: Boosting Continuous Control with Consistency Policy
Boosting Continuous Control with Consistency Policy Open
Due to its training stability and strong expression, the diffusion model has attracted considerable attention in offline reinforcement learning. However, several challenges have also come with it: 1) The demand for a large number of diffus…
View article: ComSD: Balancing Behavioral Quality and Diversity in Unsupervised Skill Discovery
ComSD: Balancing Behavioral Quality and Diversity in Unsupervised Skill Discovery Open
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Unsupervised skill discovery seeks to acquire different useful skills wit…
View article: Multi-modal Learning based Prediction for Disease
Multi-modal Learning based Prediction for Disease Open
Non alcoholic fatty liver disease (NAFLD) is the most common cause of chronic liver disease, which can be predicted accurately to prevent advanced fibrosis and cirrhosis. While, a liver biopsy, the gold standard for NAFLD diagnosis, is inv…
View article: Score-Based Equilibrium Learning in Multi-Player Finite Games with Imperfect Information
Score-Based Equilibrium Learning in Multi-Player Finite Games with Imperfect Information Open
Real-world games, which concern imperfect information, multiple players, and simultaneous moves, are less frequently discussed in the existing literature of game theory. While reinforcement learning (RL) provides a general framework to ext…