Explanipedia

An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models Open

Yangchen Pan, Junfeng Wen, Chenjun Xiao, Philip H. S. Torr · 2025

Computer science

Background: Traditional supervised learning (SL) assumes data points are independently and identically distributed (i.i.d.), which overlooks dependencies in real-world data. Reinforcement learning (RL), in contrast, models dependencies thr…

Measures of Variability for Risk-averse Policy Gradient Open

Yudong Luo, Yangchen Pan, Jackson Tan, Pascal Poupart · 2025

Risk-averse reinforcement learning (RARL) is critical for decision-making under uncertainty, which is especially valuable in high-stake applications. However, most existing works focus on risk measures, e.g., conditional value-at-risk (CVa…

PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling Open

Avery Ma, Yangchen Pan, Amir‐massoud Farahmand · 2025

Computer science Psychology Chemistry

Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges betw…

DTR-Bench: An in silico Environment and Benchmark Platform for Reinforcement Learning Based Dynamic Treatment Regime Open

Zhiyao Luo, Mingcheng Zhu, Fenglin Liu, Jiali Li, Yangchen Pan , et al. · 2024

Computer science Engineering Biology

Reinforcement learning (RL) has garnered increasing recognition for its potential to optimise dynamic treatment regimes (DTRs) in personalised medicine, particularly for drug dosage prescriptions and medication recommendations. However, a …

Reinforcement Learning in Dynamic Treatment Regimes Needs Critical Reexamination Open

Zhiyao Luo, Yangchen Pan, Peter Watkinson, Tingting Zhu · 2024

Psychology Computer science

In the rapidly changing healthcare landscape, the implementation of offline reinforcement learning (RL) in dynamic treatment regimes (DTRs) presents a mix of unprecedented opportunities and challenges. This position paper offers a critical…

An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models Open

Yangchen Pan, Junfeng Wen, Chenjun Xiao, Philip Torr · 2024

Computer science Mathematics

In traditional statistical learning, data points are usually assumed to be independently and identically distributed (i.i.d.) following an unknown probability distribution. This paper presents a contrasting viewpoint, perceiving data point…

A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization Open

Yudong Luo, Yangchen Pan, Han Wang, Philip Torr, Pascal Poupart · 2024

Computer science Mathematics Economics

Reinforcement learning algorithms utilizing policy gradients (PG) to optimize Conditional Value at Risk (CVaR) face significant challenges with sample inefficiency, hindering their practical applications. This inefficiency stems from two m…

Improving Adversarial Transferability via Model Alignment Open

Avery Ma, Amir‐massoud Farahmand, Yangchen Pan, Philip Torr, Jindong Gu · 2023

Computer science Philosophy

Neural networks are susceptible to adversarial perturbations that are transferable across different models. In this paper, we introduce a novel model alignment technique aimed at improving a given source model's ability in generating trans…

Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods Open

Avery Ma, Yangchen Pan, Amir‐massoud Farahmand · 2023

Computer science Mathematics Chemistry

Stochastic gradient descent (SGD) and adaptive gradient methods, such as Adam and RMSProp, have been widely used in training deep neural networks. We empirically show that while the difference between the standard generalization performanc…

An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradient Open

Yudong Luo, Guiliang Liu, Pascal Poupart, Yangchen Pan · 2023

Computer science Economics Mathematics

Restricting the variance of a policy's return is a popular choice in risk-averse Reinforcement Learning (RL) due to its clear mathematical definition and easy interpretability. Traditional methods directly restrict the total return varianc…

Conditionally Optimistic Exploration for Cooperative Deep Multi-Agent Reinforcement Learning Open

Xutong Zhao, Yangchen Pan, Chenjun Xiao, Sarath Chandar, Janarthanan Rajendran · 2023

Computer science Mathematics Psychology

Efficient exploration is critical in cooperative deep Multi-Agent Reinforcement Learning (MARL). In this work, we propose an exploration method that effectively encourages cooperative exploration based on the idea of sequential action-comp…

The In-Sample Softmax for Offline Reinforcement Learning Open

Chenjun Xiao, Han Wang, Yangchen Pan, Adam White, Martha White · 2023

Computer science Mathematics Physics

Reinforcement learning (RL) agents can leverage batches of previously collected data to extract a reasonable control policy. An emerging issue in this offline RL setting, however, is that the bootstrapping update underlying many of our met…

Label Alignment Regularization for Distribution Shift Open

Ehsan Imani, Guojun Zhang, Jun Luo, Pascal Poupart, Yangchen Pan · 2022

Computer science Mathematics

Recent work has highlighted the label alignment property (LAP) in supervised learning, where the vector of all labels in the dataset is mostly in the span of the top few singular vectors of the data matrix. Drawing inspiration from this ob…

Memory-efficient Reinforcement Learning with Value-based Knowledge Consolidation Open

Qingfeng Lan, Yangchen Pan, Jun Luo, A. Rupam Mahmood · 2022

Computer science Philosophy

Artificial neural networks are promising for general function approximation but challenging to train on non-independent or non-identically distributed data due to catastrophic forgetting. The experience replay buffer, a standard component …

STOPS: Short-Term-based Volatility-controlled Policy Search and its Global Convergence Open

Liangliang Xu, Daoming Lyu, Yangchen Pan, Aiwen Jiang, Bo Liu · 2022

Computer science Mathematics Economics

It remains challenging to deploy existing risk-averse approaches to real-world applications. The reasons are multi-fold, including the lack of global optimality guarantee and the necessity of learning from long-term consecutive trajectorie…

An Alternate Policy Gradient Estimator for Softmax Policies Open

Shivam Garg, Samuele Tosatto, Yangchen Pan, Martha White, A. Rupam Mahmood · 2021

Computer science Economics Mathematics

Policy gradient (PG) estimators are ineffective in dealing with softmax policies that are sub-optimally saturated, which refers to the situation when the policy concentrates its probability mass on sub-optimal actions. Sub-optimal policy s…

Beyond Prioritized Replay: Sampling States in Model-Based RL via Simulated Priorities Open

Jincheng Mei, Yangchen Pan, Martha White, Amir‐massoud Farahmand, Hengshuai Yao · 2021

Computer science Mathematics Philosophy

The prioritized Experience Replay (ER) method has attracted great attention; however, there is little theoretical understanding of such prioritization strategy and why they help. In this work, we revisit prioritized ER and, in an ideal set…

Improving Sample Efficiency of Online Temporal Difference Learning Open

Yangchen Pan · 2021

Computer science Chemistry

A common scientific challenge for putting a reinforcement learning agent into practice is how to improve sample efficiency as much as possible with limited computational or memory resources. Such available physical resources may vary in di…

Beyond Prioritized Replay: Sampling States in Model-Based Reinforcement Learning via Simulated Priorities Open

Jincheng Mei, Yangchen Pan, Amir‐massoud Farahmand, Hengshuai Yao, Martha White · 2020

Computer science Mathematics Economics

The prioritized Experience Replay (ER) method has attracted great attention; however, there is little theoretical understanding about why it can help and its limitations. In this work, we take a deep look at the prioritized ER. In a superv…

Understanding and Mitigating the Limitations of Prioritized Experience Replay Open

Yangchen Pan, Jincheng Mei, Amir‐massoud Farahmand, Martha White, Hengshuai Yao , et al. · 2020

Computer science Mathematics Chemistry

Prioritized Experience Replay (ER) has been empirically shown to improve sample efficiency across many domains and attracted great attention; however, there is little theoretical understanding of why such prioritized sampling helps and its…

Maxmin Q-learning: Controlling the Estimation Bias of Q-learning Open

Qingfeng Lan, Yangchen Pan, Alona Fyshe, Martha White · 2020

Computer science Mathematics Business

Q-learning suffers from overestimation bias, because it approximates the maximum action value using the maximum estimated action value. Algorithms have been proposed to reduce overestimation bias, but we lack an understanding of how bias i…

An implicit function learning approach for parametric modal regression Open

Yangchen Pan, Ehsan Imani, Martha White, Amir‐massoud Farahmand · 2020

Computer science Mathematics Biology

For multi-valued functions---such as when the conditional distribution on targets given the inputs is multi-modal---standard regression approaches are not always desirable because they provide the conditional mean. Modal regression algorit…

Frequency-based Search-control in Dyna Open

Yangchen Pan, Jincheng Mei, Amir‐massoud Farahmand, Martha White · 2020

Computer science Mathematics Biology

Model-based reinforcement learning has been empirically demonstrated as a successful strategy to improve sample efficiency. In particular, Dyna is an elegant model-based architecture integrating learning and planning that provides huge fle…

Deep Tile Coder: an Efficient Sparse Representation Learning Approach with applications in Reinforcement Learning. Open

Yangchen Pan · 2019

Computer science Biology

Recent work has shown that sparse representations -- where only a small percentage of units are active -- can significantly reduce interference. Those works, however, relied on relatively complex regularization or meta-learning approaches,…

Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online Open

Yangchen Pan, Kirby Banman, Martha White · 2019

Computer science Philosophy

Recent work has shown that sparse representations -- where only a small percentage of units are active -- can significantly reduce interference. Those works, however, relied on relatively complex regularization or meta-learning approaches,…

Hill Climbing on Value Estimates for Search-control in Dyna Open

Yangchen Pan, Hengshuai Yao, Amir‐massoud Farahmand, Martha White · 2019

Computer science Mathematics Engineering

Dyna is an architecture for model based reinforcement learning (RL), where simulated experience from a model is used to update policies or value functions. A key component of Dyna is search control, the mechanism to generate the state and …

Hill Climbing on Value Estimates for Search-control in Dyna Open

Yangchen Pan, Hengshuai Yao, Amir‐massoud Farahmand, Martha White · 2019

Computer science Mathematics Engineering

Dyna is an architecture for model-based reinforcement learning (RL), where simulated experience from a model is used to update policies or value functions. A key component of Dyna is search-control, the mechanism to generate the state and …

Actor-Expert: A Framework for using Q-learning in Continuous Action Spaces Open

Sungsu Lim, Ajin George Joseph, Lei Le, Yangchen Pan, Martha White · 2019

Computer science Mathematics Economics

Q-learning can be difficult to use in continuous action spaces, because a difficult optimization has to be solved to find the maximal action. Some common strategies have been to discretize the action space, solve the maximization with a po…

Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement Open

Samuel Neumann, Sungsu Lim, Ajin George Joseph, Yangchen Pan, Adam White , et al. · 2018

Computer science Mathematics Physics

Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the ac…

Organizing Experience: a Deeper Look at Replay Mechanisms for Sample-Based Planning in Continuous State Domains Open

Yangchen Pan, Muhammad Zaigham Zaheer, Adam White, Andrew Patterson, Martha White · 2018

Computer science Mathematics Chemistry

Model-based strategies for control are critical to obtain sample efficient learning. Dyna is a planning paradigm that naturally interleaves learning and planning, by simulating one-step experience to update the action-value function. This …

Yangchen Pan YOU? Author Swipe