Explanipedia

From HAL 9000 to Agentic AI: A Constitutional Framework for Enterprise Automation Open

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman , et al. · 2025

Agentic AI systems mark a shift from passive, prompt-driven models to autonomous actors that perceive, plan, and execute actions within enterprise infrastructures. This autonomy introduces risks that exceed conventional bias and safety con…

Reasoning Models Don't Always Say What They Think Open

Yi‐Ying Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison , et al. · 2025

Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully represe…

Measuring short-form factuality in large language models Open

Jason Lee, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay , et al. · 2024

Computer science Philosophy

We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected …

Rule Based Rewards for Language Model Safety Open

Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone , et al. · 2024

Computer science

Reinforcement learning based fine-tuning of large language models (LLMs) on human preferences has been shown to enhance both their capabilities and safety behavior. However, in cases related to safety, without precise instructions to human…

Quantifying the Sim-To-Real Gap in UAV Disturbance Rejection Open

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov · 2024

Computer science Mathematics

Due to the safety risks and training sample inefficiency, it is often preferred to develop controllers in simulation. However, minor differences between the simulation and the real world can cause a significant sim-to-real gap. This gap ca…

Training Verifiers to Solve Math Word Problems Open

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano , et al. · 2021

Computer science Mathematics Philosophy

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K,…

Batch size-invariance for policy optimization Open

Jacob Hilton, Karl Cobbe, John Schulman · 2021

Computer science Mathematics Engineering

We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters. Stochastic gradient descent is well-known to have this property at small batch sizes, via the lear…

Unsolved Problems in ML Safety Open

Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt · 2021

Computer science Business Mathematics

Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. …

Measuring Sample Efficiency and Generalization in Reinforcement Learning Benchmarks: NeurIPS 2020 Procgen Benchmark Open

Sharada P. Mohanty, Jyotish Poonganam, Adrien Gaidon, Andrey Kolobov, Blake Wulfe , et al. · 2021

Computer science Psychology Mathematics

The NeurIPS 2020 Procgen Competition was designed as a centralized benchmark with clearly defined tasks for measuring Sample Efficiency and Generalization in Reinforcement Learning. Generalization remains one of the most fundamental challe…

Scaling Laws for Autoregressive Generative Modeling Open

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse , et al. · 2020

Computer science Mathematics Physics

We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transform…

Phasic Policy Gradient Open

Karl Cobbe, Jacob Hilton, Oleg Klimov, John Schulman · 2020

Business

We introduce Phasic Policy Gradient (PPG), a reinforcement learning framework which modifies traditional on-policy actor-critic methods by separating policy and value function training into distinct phases. In prior methods, one must choos…

Static Analysis of Shape in TensorFlow Programs Open

The Theano Development Team, Rami Al‐Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller , et al. · 2020

Computer science

Machine learning has been widely adopted in diverse science and engineering domains, aided by reusable libraries and quick development patterns. The TensorFlow library is probably the best-known representative of this trend and most users …

Leveraging Procedural Generation to Benchmark Reinforcement Learning Open

Karl Cobbe, Christopher Hesse, Jacob Hilton, John Schulman · 2019

Computer science Mathematics Geography

We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning. We believe that the community will benefit from increase…

Teacher–Student Curriculum Learning Open

Tambet Matiisen, Avital Oliver, Taco Cohen, John Schulman · 2019

Computer science Psychology Mathematics

We propose Teacher-Student Curriculum Learning (TSCL), a framework for automatic curriculum learning, where the Student tries to learn a complex task and the Teacher automatically chooses subtasks from a given set for the Student to train …

The MineRL 2019 Competition on Sample Efficient Reinforcement Learning using Human Priors Open

William H. Guss, Mario Ynocente Castro, Sam Devlin, Brandon Houghton, Noboru Sean Kuno , et al. · 2019

Computer science Mathematics Psychology

Though deep reinforcement learning has led to breakthroughs in many difficult domains, these successes have required an ever-increasing number of samples. As state-of-the-art reinforcement learning (RL) systems require an exponentially inc…

Policy Gradient Search: Online Planning and Expert Iteration without Search Trees Open

Thomas Anthony, Robert Nishihara, Philipp Moritz, Tim Salimans, John Schulman · 2019

Computer science Mathematics Materials science

Monte Carlo Tree Search (MCTS) algorithms perform simulation-based search to improve policies online. During search, the simulation policy is adapted to explore the most promising lines of play. MCTS has been used by state-of-the-art progr…

Semi-Supervised Learning by Label Gradient Alignment Open

Jacob Jackson, John Schulman · 2019

Computer science Mathematics Economics

We present label gradient alignment, a novel algorithm for semi-supervised learning which imputes labels for the unlabeled data and trains on the imputed labels. We define a semantically meaningful distance metric on the input space by map…

Quantifying Generalization in Reinforcement Learning Open

Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, John Schulman · 2018

Computer science Mathematics Sociology

In this paper, we investigate the problem of overfitting in deep reinforcement learning. Among the most common benchmarks in RL, it is customary to use the same environments for both training and testing. This practice offers relatively li…

Model-Based Reinforcement Learning via Meta-Policy Optimization Open

Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour , et al. · 2018

Computer science Economics Physics

Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic p…

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations Open

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, John Schulman, Emanuel Todorov , et al. · 2018

Computer science Chemistry Economics

Dexterous multi-fingered hands are extremely versatile and provide a generic way to perform a multitude of tasks in human-centric environments. However, effectively controlling them remains challenging due to their high dimensionality and …

Gotta Learn Fast: A New Benchmark for Generalization in RL Open

Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, John Schulman · 2018

Computer science Mathematics Geography

In this report, we present a new reinforcement learning (RL) benchmark based on the Sonic the Hedgehog (TM) video game franchise. This benchmark is intended to measure the performance of transfer learning and few-shot learning algorithms i…

On First-Order Meta-Learning Algorithms Open

Alex Nichol, Joshua Achiam, John Schulman · 2018

Computer science Economics

This paper considers meta-learning problems, where there is a distribution of tasks, and we would like to obtain an agent that performs well (i.e., learns quickly) when presented with a previously unseen task sampled from this distribution…

Reptile: a Scalable Metalearning Algorithm Open

Alex Nichol, John Schulman · 2018

Computer science Engineering

This paper considers metalearning problems, where there is a distribution of tasks, and we would like to obtain an agent that performs well (i.e., learns quickly) when presented with a previously unseen task sampled from this distribution.…

Meta Learning Shared Hierarchies Open

Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, John Schulman · 2017

Computer science Economics Political science

We develop a metalearning approach for learning hierarchically structured policies, improving sample efficiency on unseen tasks through the use of shared primitives---policies that are executed for large numbers of timesteps. Specifically,…

UCB and InfoGain Exploration via $\boldsymbol{Q}$-Ensembles. Open

Richard Y. Chen, Szymon Sidor, Pieter Abbeel, John Schulman · 2017

Computer science Mathematics Geography

We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. First w…

UCB Exploration via Q-Ensembles Open

Richard Y. Chen, Szymon Sidor, Pieter Abbeel, John Schulman · 2017

Computer science Geography

We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. We prop…

Equivalence Between Policy Gradients and Soft Q-Learning Open

John Schulman, Xi Chen, Pieter Abbeel · 2017

Mathematics Computer science Economics

Two of the leading approaches for model-free reinforcement learning are policy gradient methods and $Q$-learning methods. $Q$-learning methods can be effective and sample-efficient when they work, however, it is not well-understood why the…

#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning Open

Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen , et al. · 2016

Computer science Mathematics

Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that coun…

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning Open

Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever , et al. · 2016

Computer science Mathematics Economics

Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, ben…

Variational Lossy Autoencoder Open

Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal , et al. · 2016

Computer science Mathematics Political science

Representation learning seeks to expose certain aspects of observed data in a learned representation that's amenable to downstream tasks like classification. For instance, a good representation for 2D images might be one that describes onl…

John Schulman YOU? Author Swipe