John Schulman
YOU?
Author Swipe
View article: From HAL 9000 to Agentic AI: A Constitutional Framework for Enterprise Automation
From HAL 9000 to Agentic AI: A Constitutional Framework for Enterprise Automation Open
Agentic AI systems mark a shift from passive, prompt-driven models to autonomous actors that perceive, plan, and execute actions within enterprise infrastructures. This autonomy introduces risks that exceed conventional bias and safety con…
View article: Reasoning Models Don't Always Say What They Think
Reasoning Models Don't Always Say What They Think Open
Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully represe…
View article: Measuring short-form factuality in large language models
Measuring short-form factuality in large language models Open
We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected …
View article: Rule Based Rewards for Language Model Safety
Rule Based Rewards for Language Model Safety Open
Reinforcement learning based fine-tuning of large language models (LLMs) on human preferences has been shown to enhance both their capabilities and safety behavior. However, in cases related to safety, without precise instructions to human…
View article: Quantifying the Sim-To-Real Gap in UAV Disturbance Rejection
Quantifying the Sim-To-Real Gap in UAV Disturbance Rejection Open
Due to the safety risks and training sample inefficiency, it is often preferred to develop controllers in simulation. However, minor differences between the simulation and the real world can cause a significant sim-to-real gap. This gap ca…
View article: Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems Open
State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K,…
View article: Batch size-invariance for policy optimization
Batch size-invariance for policy optimization Open
We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters. Stochastic gradient descent is well-known to have this property at small batch sizes, via the lear…
View article: Unsolved Problems in ML Safety
Unsolved Problems in ML Safety Open
Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. …
View article: Measuring Sample Efficiency and Generalization in Reinforcement Learning Benchmarks: NeurIPS 2020 Procgen Benchmark
Measuring Sample Efficiency and Generalization in Reinforcement Learning Benchmarks: NeurIPS 2020 Procgen Benchmark Open
The NeurIPS 2020 Procgen Competition was designed as a centralized benchmark with clearly defined tasks for measuring Sample Efficiency and Generalization in Reinforcement Learning. Generalization remains one of the most fundamental challe…
View article: Scaling Laws for Autoregressive Generative Modeling
Scaling Laws for Autoregressive Generative Modeling Open
We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transform…
View article: Phasic Policy Gradient
Phasic Policy Gradient Open
We introduce Phasic Policy Gradient (PPG), a reinforcement learning framework which modifies traditional on-policy actor-critic methods by separating policy and value function training into distinct phases. In prior methods, one must choos…
View article: Static Analysis of Shape in TensorFlow Programs
Static Analysis of Shape in TensorFlow Programs Open
Machine learning has been widely adopted in diverse science and engineering domains, aided by reusable libraries and quick development patterns. The TensorFlow library is probably the best-known representative of this trend and most users …
View article: Leveraging Procedural Generation to Benchmark Reinforcement Learning
Leveraging Procedural Generation to Benchmark Reinforcement Learning Open
We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning. We believe that the community will benefit from increase…
View article: Teacher–Student Curriculum Learning
Teacher–Student Curriculum Learning Open
We propose Teacher-Student Curriculum Learning (TSCL), a framework for automatic curriculum learning, where the Student tries to learn a complex task and the Teacher automatically chooses subtasks from a given set for the Student to train …
View article: The MineRL 2019 Competition on Sample Efficient Reinforcement Learning using Human Priors
The MineRL 2019 Competition on Sample Efficient Reinforcement Learning using Human Priors Open
Though deep reinforcement learning has led to breakthroughs in many difficult domains, these successes have required an ever-increasing number of samples. As state-of-the-art reinforcement learning (RL) systems require an exponentially inc…
View article: Policy Gradient Search: Online Planning and Expert Iteration without Search Trees
Policy Gradient Search: Online Planning and Expert Iteration without Search Trees Open
Monte Carlo Tree Search (MCTS) algorithms perform simulation-based search to improve policies online. During search, the simulation policy is adapted to explore the most promising lines of play. MCTS has been used by state-of-the-art progr…
View article: Semi-Supervised Learning by Label Gradient Alignment
Semi-Supervised Learning by Label Gradient Alignment Open
We present label gradient alignment, a novel algorithm for semi-supervised learning which imputes labels for the unlabeled data and trains on the imputed labels. We define a semantically meaningful distance metric on the input space by map…
View article: Quantifying Generalization in Reinforcement Learning
Quantifying Generalization in Reinforcement Learning Open
In this paper, we investigate the problem of overfitting in deep reinforcement learning. Among the most common benchmarks in RL, it is customary to use the same environments for both training and testing. This practice offers relatively li…
View article: Model-Based Reinforcement Learning via Meta-Policy Optimization
Model-Based Reinforcement Learning via Meta-Policy Optimization Open
Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic p…
View article: Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations
Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations Open
Dexterous multi-fingered hands are extremely versatile and provide a generic way to perform a multitude of tasks in human-centric environments. However, effectively controlling them remains challenging due to their high dimensionality and …
View article: Gotta Learn Fast: A New Benchmark for Generalization in RL
Gotta Learn Fast: A New Benchmark for Generalization in RL Open
In this report, we present a new reinforcement learning (RL) benchmark based on the Sonic the Hedgehog (TM) video game franchise. This benchmark is intended to measure the performance of transfer learning and few-shot learning algorithms i…
View article: On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning Algorithms Open
This paper considers meta-learning problems, where there is a distribution of tasks, and we would like to obtain an agent that performs well (i.e., learns quickly) when presented with a previously unseen task sampled from this distribution…
View article: Reptile: a Scalable Metalearning Algorithm
Reptile: a Scalable Metalearning Algorithm Open
This paper considers metalearning problems, where there is a distribution of tasks, and we would like to obtain an agent that performs well (i.e., learns quickly) when presented with a previously unseen task sampled from this distribution.…
View article: Meta Learning Shared Hierarchies
Meta Learning Shared Hierarchies Open
We develop a metalearning approach for learning hierarchically structured policies, improving sample efficiency on unseen tasks through the use of shared primitives---policies that are executed for large numbers of timesteps. Specifically,…
View article: UCB and InfoGain Exploration via $\boldsymbol{Q}$-Ensembles.
UCB and InfoGain Exploration via $\boldsymbol{Q}$-Ensembles. Open
We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. First w…
View article: UCB Exploration via Q-Ensembles
UCB Exploration via Q-Ensembles Open
We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. We prop…
View article: Equivalence Between Policy Gradients and Soft Q-Learning
Equivalence Between Policy Gradients and Soft Q-Learning Open
Two of the leading approaches for model-free reinforcement learning are policy gradient methods and $Q$-learning methods. $Q$-learning methods can be effective and sample-efficient when they work, however, it is not well-understood why the…
View article: #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning
#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning Open
Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that coun…
View article: RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning
RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning Open
Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, ben…
View article: Variational Lossy Autoencoder
Variational Lossy Autoencoder Open
Representation learning seeks to expose certain aspects of observed data in a learned representation that's amenable to downstream tasks like classification. For instance, a good representation for 2D images might be one that describes onl…