Simon S. Du
YOU?
Author Swipe
View article: Optimal Multi-Distribution Learning
Optimal Multi-Distribution Learning Open
Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across k distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness,…
View article: Policy-Based Trajectory Clustering in Offline Reinforcement Learning
Policy-Based Trajectory Clustering in Offline Reinforcement Learning Open
We introduce a novel task of clustering trajectories from offline reinforcement learning (RL) datasets, where each cluster center represents the policy that generated its trajectories. By leveraging the connection between the KL-divergence…
View article: Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO Open
We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sou…
View article: Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval
Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval Open
While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes …
View article: Settling the Sample Complexity of Online Reinforcement Learning
Settling the Sample Complexity of Online Reinforcement Learning Open
A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a “large…
View article: Improving Human-AI Coordination through Online Adversarial Training and Generative Models
Improving Human-AI Coordination through Online Adversarial Training and Generative Models Open
Being able to cooperate with diverse humans is an important component of many economically valuable AI tasks, from household robotics to autonomous driving. However, generalizing to novel humans requires training on data that captures the …
View article: Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination
Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination Open
Zero-shot coordination (ZSC), the ability to adapt to a new partner in a cooperative task, is a critical component of human-compatible AI. While prior work has focused on training agents to cooperate on a single task, these specialized mod…
View article: Extragradient Preference Optimization (EGPO): Beyond Last-Iterate Convergence for Nash Learning from Human Feedback
Extragradient Preference Optimization (EGPO): Beyond Last-Iterate Convergence for Nash Learning from Human Feedback Open
Reinforcement learning from human feedback (RLHF) has become essential for improving language model capabilities, but traditional approaches rely on the assumption that human preferences follow a transitive Bradley-Terry model. This assump…
View article: A Minimalist Example of Edge-of-Stability and Progressive Sharpening
A Minimalist Example of Edge-of-Stability and Progressive Sharpening Open
Recent advances in deep learning optimization have unveiled two intriguing phenomena under large learning rates: Edge of Stability (EoS) and Progressive Sharpening (PS), challenging classical Gradient Descent (GD) analyses. Current researc…
View article: SHARP: Accelerating Language Model Inference by SHaring Adjacent layers with Recovery Parameters
SHARP: Accelerating Language Model Inference by SHaring Adjacent layers with Recovery Parameters Open
While Large language models (LLMs) have advanced natural language processing tasks, their growing computational and memory demands make deployment on resource-constrained devices like mobile phones increasingly challenging. In this paper, …
View article: Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation
Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation Open
The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, …
View article: Hybrid Preference Optimization for Alignment: Provably Faster Convergence Rates by Combining Offline Preferences with Online Exploration
Hybrid Preference Optimization for Alignment: Provably Faster Convergence Rates by Combining Offline Preferences with Online Exploration Open
Reinforcement Learning from Human Feedback (RLHF) is currently the leading approach for aligning large language models with human preferences. Typically, these models rely on extensive offline preference datasets for training. However, off…
View article: Anytime Acceleration of Gradient Descent
Anytime Acceleration of Gradient Descent Open
This work investigates stepsize-based acceleration of gradient descent with {\em anytime} convergence guarantees. For smooth (non-strongly) convex optimization, we propose a stepsize schedule that allows gradient descent to achieve converg…
View article: Learning to Cooperate with Humans using Generative Agents
Learning to Cooperate with Humans using Generative Agents Open
Training agents that can coordinate zero-shot with humans is a key mission in multi-agent reinforcement learning (MARL). Current algorithms focus on training simulated human partner policies which are then used to train a Cooperator agent.…
View article: Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder
Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder Open
Recent research has shown that CLIP models struggle with visual reasoning tasks that require grounding compositionality, understanding spatial relationships, or capturing fine-grained details. One natural hypothesis is that the CLIP vision…
View article: Transformers are Efficient Compilers, Provably
Transformers are Efficient Compilers, Provably Open
Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks, including programming language understanding and generation. In this paper, we take the first s…
View article: Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques
Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques Open
We initiate the study of Preference-Based Multi-Agent Reinforcement Learning (PbMARL), exploring both theoretical foundations and empirical validations. We define the task as identifying the Nash equilibrium from a preference-only offline …
View article: Understanding the Gains from Repeated Self-Distillation
Understanding the Gains from Repeated Self-Distillation Open
Self-Distillation is a special type of knowledge distillation where the student model has the same architecture as the teacher model. Despite using the same architecture and the same training data, self-distillation has been empirically ob…
View article: Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning
Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning Open
Reinforcement learning with human feedback (RLHF), as a widely adopted approach in current large language model pipelines, is \textit{bottlenecked by the size of human preference data}. While traditional methods rely on offline preference …
View article: Toward Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixture Models
Toward Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixture Models Open
We study the gradient Expectation-Maximization (EM) algorithm for Gaussian Mixture Models (GMM) in the over-parameterized setting, where a general GMM with $n>1$ components learns from data that are generated by a single ground truth Gauss…
View article: Decoding-Time Language Model Alignment with Multiple Objectives
Decoding-Time Language Model Alignment with Multiple Objectives Open
Aligning language models (LMs) to human preferences has emerged as a critical pursuit, enabling these models to better serve diverse user needs. Existing methods primarily focus on optimizing LMs for a single reward function, limiting thei…
View article: CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning
CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning Open
Data selection has emerged as a core issue for large-scale visual-language model pretaining (e.g., CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to ai…
View article: Rethinking Transformers in Solving POMDPs
Rethinking Transformers in Solving POMDPs Open
Sequential decision-making algorithms such as reinforcement learning (RL) in real-world scenarios inevitably face environments with partial observability. This paper scrutinizes the effectiveness of a popular architecture, namely Transform…
View article: Horizon-Free Regret for Linear Markov Decision Processes
Horizon-Free Regret for Linear Markov Decision Processes Open
A recent line of works showed regret bounds in reinforcement learning (RL) can be (nearly) independent of planning horizon, a.k.a.~the horizon-free bounds. However, these regret bounds only apply to settings where a polynomial dependency o…
View article: Distributional Successor Features Enable Zero-Shot Policy Optimization
Distributional Successor Features Enable Zero-Shot Policy Optimization Open
Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through…
View article: Reflect-RL: Two-Player Online RL Fine-Tuning for LMs
Reflect-RL: Two-Player Online RL Fine-Tuning for LMs Open
As language models (LMs) demonstrate their capabilities in various fields, their application to tasks requiring multi-round interactions has become increasingly popular. These tasks usually have complex dynamics, so supervised fine-tuning …
View article: Offline Multi-task Transfer RL with Representational Penalization
Offline Multi-task Transfer RL with Representational Penalization Open
We study the problem of representation transfer in offline Reinforcement Learning (RL), where a learner has access to episodic data from a number of source tasks collected a priori, and aims to learn a shared representation to be used in f…
View article: Learning Optimal Tax Design in Nonatomic Congestion Games
Learning Optimal Tax Design in Nonatomic Congestion Games Open
In multiplayer games, self-interested behavior among the players can harm the social welfare. Tax mechanisms are a common method to alleviate this issue and induce socially optimal behavior. In this work, we take the initial step of learni…
View article: Refined Sample Complexity for Markov Games with Independent Linear Function Approximation
Refined Sample Complexity for Markov Games with Independent Linear Function Approximation Open
Markov Games (MG) is an important model for Multi-Agent Reinforcement Learning (MARL). It was long believed that the "curse of multi-agents" (i.e., the algorithmic performance drops exponentially with the number of agents) is unavoidable u…
View article: Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning
Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning Open
In recent years, data selection has emerged as a core issue for large-scale visual-language model pretraining, especially on noisy web-curated datasets. One widely adopted strategy assigns quality scores such as CLIP similarity for each sa…