Yangchen Pan
YOU?
Author Swipe
View article: An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models
An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models Open
Background: Traditional supervised learning (SL) assumes data points are independently and identically distributed (i.i.d.), which overlooks dependencies in real-world data. Reinforcement learning (RL), in contrast, models dependencies thr…
View article: Measures of Variability for Risk-averse Policy Gradient
Measures of Variability for Risk-averse Policy Gradient Open
Risk-averse reinforcement learning (RARL) is critical for decision-making under uncertainty, which is especially valuable in high-stake applications. However, most existing works focus on risk measures, e.g., conditional value-at-risk (CVa…
View article: PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling Open
Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges betw…
View article: DTR-Bench: An in silico Environment and Benchmark Platform for Reinforcement Learning Based Dynamic Treatment Regime
DTR-Bench: An in silico Environment and Benchmark Platform for Reinforcement Learning Based Dynamic Treatment Regime Open
Reinforcement learning (RL) has garnered increasing recognition for its potential to optimise dynamic treatment regimes (DTRs) in personalised medicine, particularly for drug dosage prescriptions and medication recommendations. However, a …
View article: Reinforcement Learning in Dynamic Treatment Regimes Needs Critical Reexamination
Reinforcement Learning in Dynamic Treatment Regimes Needs Critical Reexamination Open
In the rapidly changing healthcare landscape, the implementation of offline reinforcement learning (RL) in dynamic treatment regimes (DTRs) presents a mix of unprecedented opportunities and challenges. This position paper offers a critical…
View article: An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models
An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models Open
In traditional statistical learning, data points are usually assumed to be independently and identically distributed (i.i.d.) following an unknown probability distribution. This paper presents a contrasting viewpoint, perceiving data point…
View article: A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization
A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization Open
Reinforcement learning algorithms utilizing policy gradients (PG) to optimize Conditional Value at Risk (CVaR) face significant challenges with sample inefficiency, hindering their practical applications. This inefficiency stems from two m…
View article: Improving Adversarial Transferability via Model Alignment
Improving Adversarial Transferability via Model Alignment Open
Neural networks are susceptible to adversarial perturbations that are transferable across different models. In this paper, we introduce a novel model alignment technique aimed at improving a given source model's ability in generating trans…
View article: Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods
Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods Open
Stochastic gradient descent (SGD) and adaptive gradient methods, such as Adam and RMSProp, have been widely used in training deep neural networks. We empirically show that while the difference between the standard generalization performanc…
View article: An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradient
An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradient Open
Restricting the variance of a policy's return is a popular choice in risk-averse Reinforcement Learning (RL) due to its clear mathematical definition and easy interpretability. Traditional methods directly restrict the total return varianc…
View article: Conditionally Optimistic Exploration for Cooperative Deep Multi-Agent Reinforcement Learning
Conditionally Optimistic Exploration for Cooperative Deep Multi-Agent Reinforcement Learning Open
Efficient exploration is critical in cooperative deep Multi-Agent Reinforcement Learning (MARL). In this work, we propose an exploration method that effectively encourages cooperative exploration based on the idea of sequential action-comp…
View article: The In-Sample Softmax for Offline Reinforcement Learning
The In-Sample Softmax for Offline Reinforcement Learning Open
Reinforcement learning (RL) agents can leverage batches of previously collected data to extract a reasonable control policy. An emerging issue in this offline RL setting, however, is that the bootstrapping update underlying many of our met…
View article: Label Alignment Regularization for Distribution Shift
Label Alignment Regularization for Distribution Shift Open
Recent work has highlighted the label alignment property (LAP) in supervised learning, where the vector of all labels in the dataset is mostly in the span of the top few singular vectors of the data matrix. Drawing inspiration from this ob…
View article: Memory-efficient Reinforcement Learning with Value-based Knowledge Consolidation
Memory-efficient Reinforcement Learning with Value-based Knowledge Consolidation Open
Artificial neural networks are promising for general function approximation but challenging to train on non-independent or non-identically distributed data due to catastrophic forgetting. The experience replay buffer, a standard component …
View article: STOPS: Short-Term-based Volatility-controlled Policy Search and its Global Convergence
STOPS: Short-Term-based Volatility-controlled Policy Search and its Global Convergence Open
It remains challenging to deploy existing risk-averse approaches to real-world applications. The reasons are multi-fold, including the lack of global optimality guarantee and the necessity of learning from long-term consecutive trajectorie…
View article: An Alternate Policy Gradient Estimator for Softmax Policies
An Alternate Policy Gradient Estimator for Softmax Policies Open
Policy gradient (PG) estimators are ineffective in dealing with softmax policies that are sub-optimally saturated, which refers to the situation when the policy concentrates its probability mass on sub-optimal actions. Sub-optimal policy s…
View article: Beyond Prioritized Replay: Sampling States in Model-Based RL via Simulated Priorities
Beyond Prioritized Replay: Sampling States in Model-Based RL via Simulated Priorities Open
The prioritized Experience Replay (ER) method has attracted great attention; however, there is little theoretical understanding of such prioritization strategy and why they help. In this work, we revisit prioritized ER and, in an ideal set…
View article: Improving Sample Efficiency of Online Temporal Difference Learning
Improving Sample Efficiency of Online Temporal Difference Learning Open
A common scientific challenge for putting a reinforcement learning agent into practice is how to improve sample efficiency as much as possible with limited computational or memory resources. Such available physical resources may vary in di…
View article: Beyond Prioritized Replay: Sampling States in Model-Based Reinforcement Learning via Simulated Priorities
Beyond Prioritized Replay: Sampling States in Model-Based Reinforcement Learning via Simulated Priorities Open
The prioritized Experience Replay (ER) method has attracted great attention; however, there is little theoretical understanding about why it can help and its limitations. In this work, we take a deep look at the prioritized ER. In a superv…
View article: Understanding and Mitigating the Limitations of Prioritized Experience Replay
Understanding and Mitigating the Limitations of Prioritized Experience Replay Open
Prioritized Experience Replay (ER) has been empirically shown to improve sample efficiency across many domains and attracted great attention; however, there is little theoretical understanding of why such prioritized sampling helps and its…
View article: Maxmin Q-learning: Controlling the Estimation Bias of Q-learning
Maxmin Q-learning: Controlling the Estimation Bias of Q-learning Open
Q-learning suffers from overestimation bias, because it approximates the maximum action value using the maximum estimated action value. Algorithms have been proposed to reduce overestimation bias, but we lack an understanding of how bias i…
View article: An implicit function learning approach for parametric modal regression
An implicit function learning approach for parametric modal regression Open
For multi-valued functions---such as when the conditional distribution on targets given the inputs is multi-modal---standard regression approaches are not always desirable because they provide the conditional mean. Modal regression algorit…
View article: Frequency-based Search-control in Dyna
Frequency-based Search-control in Dyna Open
Model-based reinforcement learning has been empirically demonstrated as a successful strategy to improve sample efficiency. In particular, Dyna is an elegant model-based architecture integrating learning and planning that provides huge fle…
View article: Deep Tile Coder: an Efficient Sparse Representation Learning Approach with applications in Reinforcement Learning.
Deep Tile Coder: an Efficient Sparse Representation Learning Approach with applications in Reinforcement Learning. Open
Recent work has shown that sparse representations -- where only a small percentage of units are active -- can significantly reduce interference. Those works, however, relied on relatively complex regularization or meta-learning approaches,…
View article: Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online
Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online Open
Recent work has shown that sparse representations -- where only a small percentage of units are active -- can significantly reduce interference. Those works, however, relied on relatively complex regularization or meta-learning approaches,…
View article: Hill Climbing on Value Estimates for Search-control in Dyna
Hill Climbing on Value Estimates for Search-control in Dyna Open
Dyna is an architecture for model based reinforcement learning (RL), where simulated experience from a model is used to update policies or value functions. A key component of Dyna is search control, the mechanism to generate the state and …
View article: Hill Climbing on Value Estimates for Search-control in Dyna
Hill Climbing on Value Estimates for Search-control in Dyna Open
Dyna is an architecture for model-based reinforcement learning (RL), where simulated experience from a model is used to update policies or value functions. A key component of Dyna is search-control, the mechanism to generate the state and …
View article: Actor-Expert: A Framework for using Q-learning in Continuous Action Spaces
Actor-Expert: A Framework for using Q-learning in Continuous Action Spaces Open
Q-learning can be difficult to use in continuous action spaces, because a difficult optimization has to be solved to find the maximal action. Some common strategies have been to discretize the action space, solve the maximization with a po…
View article: Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement
Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement Open
Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the ac…
View article: Organizing Experience: a Deeper Look at Replay Mechanisms for Sample-Based Planning in Continuous State Domains
Organizing Experience: a Deeper Look at Replay Mechanisms for Sample-Based Planning in Continuous State Domains Open
Model-based strategies for control are critical to obtain sample efficient learning. Dyna is a planning paradigm that naturally interleaves learning and planning, by simulating one-step experience to update the action-value function. This …