Jonathan Uesato
YOU?
Author Swipe
View article: Reasoning Models Don't Always Say What They Think
Reasoning Models Don't Always Say What They Think Open
Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully represe…
View article: Alignment faking in large language models
Alignment faking in large language models Open
We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system…
View article: Solving math word problems with process- and outcome-based feedback
Solving math word problems with process- and outcome-based feedback Open
Recent work has shown that asking language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise such models: outcome-based approa…
View article: Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals Open
The field of AI alignment is concerned with AI systems that pursue unintended goals. One commonly studied mechanism by which an unintended goal might arise is specification gaming, in which the designer-provided specification is flawed in …
View article: Improving alignment of dialogue agents via targeted human judgements
Improving alignment of dialogue agents via targeted human judgements Open
We present Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. We use reinforcement learning from human feedback to train our models with two new a…
View article: Taxonomy of Risks posed by Language Models
Taxonomy of Risks posed by Language Models Open
Responsible innovation on large-scale Language Models (LMs) re- quires foresight into and in-depth understanding of the risks these models may pose. This paper develops a comprehensive taxon- omy of ethical and social risks associated with…
View article: Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models
Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models Open
Large language models produce human-like text that drive a growing number of applications. However, recent literature and, increasingly, real world observations, have demonstrated that these models can generate language that is toxic, bias…
View article: Ethical and social risks of harm from Language Models
Ethical and social risks of harm from Language Models Open
This paper aims to help structure the risk landscape associated with large-scale Language Models (LMs). In order to foster advances in responsible innovation, an in-depth understanding of the potential risks posed by these models is needed…
View article: Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Scaling Language Models: Methods, Analysis & Insights from Training Gopher Open
Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based…
View article: An Empirical Investigation of Learning from Biased Toxicity Labels
An Empirical Investigation of Learning from Biased Toxicity Labels Open
Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels. As such, it is often only possible to gather a small amount of high-quality labels. …
View article: Verifying Probabilistic Specifications with Functional Lagrangians.
Verifying Probabilistic Specifications with Functional Lagrangians. Open
We propose a general framework for verifying input-output specifications of neural networks using functional Lagrange multipliers that generalizes standard Lagrangian duality. We derive theoretical properties of the framework, which can ha…
View article: Make Sure You're Unsure: A Framework for Verifying Probabilistic Specifications
Make Sure You're Unsure: A Framework for Verifying Probabilistic Specifications Open
Most real world applications require dealing with stochasticity like sensor noise or predictive uncertainty, where formal specifications of desired behavior are inherently probabilistic. Despite the promise of formal verification in ensuri…
View article: Challenges in Detoxifying Language Models
Challenges in Detoxifying Language Models Open
Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world; to t…
View article: Avoiding Tampering Incentives in Deep RL via Decoupled Approval
Avoiding Tampering Incentives in Deep RL via Decoupled Approval Open
How can we design agents that pursue a given objective when all feedback mechanisms are influenceable by the agent? Standard RL algorithms assume a secure reward function, and can thus perform poorly in settings where agents can tamper wit…
View article: REALab: An Embedded Perspective on Tampering
REALab: An Embedded Perspective on Tampering Open
This paper describes REALab, a platform for embedded agency research in reinforcement learning (RL). REALab is designed to model the structure of tampering problems that may arise in real-world deployments of RL. Standard Markov Decision P…
View article: Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming
Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming Open
Convex relaxations have emerged as a promising approach for verifying desirable properties of neural networks like robustness to adversarial perturbations. Widely used Linear Programming (LP) relaxations only work well when networks are tr…
View article: Enabling certification of verification-agnostic networks via\n memory-efficient semidefinite programming
Enabling certification of verification-agnostic networks via\n memory-efficient semidefinite programming Open
Convex relaxations have emerged as a promising approach for verifying\ndesirable properties of neural networks like robustness to adversarial\nperturbations. Widely used Linear Programming (LP) relaxations only work well\nwhen networks are…
View article: Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples
Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples Open
Adversarial training and its variants have become de facto standards for learning robust deep neural networks. In this paper, we explore the landscape around adversarial training in a bid to uncover its limits. We systematically study the …
View article: An Alternative Surrogate Loss for PGD-based Adversarial Testing
An Alternative Surrogate Loss for PGD-based Adversarial Testing Open
Adversarial testing methods based on Projected Gradient Descent (PGD) are widely used for searching norm-bounded perturbations that cause the inputs of neural networks to be misclassified. This paper takes a deeper look at these methods an…
View article: Are Labels Required for Improving Adversarial Robustness
Are Labels Required for Improving Adversarial Robustness Open
Recent work has uncovered the interesting (and somewhat surprising) finding that training models to be invariant to adversarial perturbations requires substantially larger datasets than those required for standard classification. This resu…
View article: Robustness via Curvature Regularization, and Vice Versa
Robustness via Curvature Regularization, and Vice Versa Open
State-of-the-art classifiers have been shown to be largely vulnerable to adversarial perturbations. One of the most effective strategies to improve robustness is adversarial training. In this paper, we investigate the effect of adversarial…
View article: Are Labels Required for Improving Adversarial Robustness?
Are Labels Required for Improving Adversarial Robustness? Open
Recent work has uncovered the interesting (and somewhat surprising) finding that training models to be invariant to adversarial perturbations requires substantially larger datasets than those required for standard classification. This resu…
View article: Verification of Non-Linear Specifications for Neural Networks
Verification of Non-Linear Specifications for Neural Networks Open
Prior work on neural network verification has focused on specifications that are linear functions of the output of the network, e.g., invariance of the classifier output under adversarial perturbations of the input. In this paper, we exten…
View article: Verification of Non-Linear Specifications for Neural Networks
Verification of Non-Linear Specifications for Neural Networks Open
Prior work on neural network verification has focused on specifications that are linear functions of the output of the network, e.g., invariance of the classifier output under adversarial perturbations of the input. In this paper, we exten…
View article: Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures
Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures Open
This paper addresses the problem of evaluating learning systems in safety critical domains such as autonomous driving, where failures can have catastrophic consequences. We focus on two problems: searching for scenarios when learned agents…
View article: Strength in Numbers: Trading-off Robustness and Computation via Adversarially-Trained Ensembles
Strength in Numbers: Trading-off Robustness and Computation via Adversarially-Trained Ensembles Open
While deep learning has led to remarkable results on a number of challenging problems, researchers have discovered a vulnerability of neural networks in adversarial settings, where small but carefully chosen perturbations to the input can …
View article: On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models
On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models Open
Recent work has shown that it is possible to train deep neural networks that are provably robust to norm-bounded adversarial perturbations. Most of these methods are based on minimizing an upper bound on the worst-case loss over all possib…
View article: Training verified learners with learned verifiers
Training verified learners with learned verifiers Open
This paper proposes a new algorithmic framework, predictor-verifier training, to train neural networks that are verifiable, i.e., networks that provably satisfy some desired input-output properties. The key idea is to simultaneously train …
View article: Adversarial Risk and the Dangers of Evaluating Against Weak Attacks
Adversarial Risk and the Dangers of Evaluating Against Weak Attacks Open
This paper investigates recently proposed approaches for defending against adversarial examples and evaluating adversarial robustness. We motivate 'adversarial risk' as an objective for achieving models robust to worst-case inputs. We then…
View article: Semantic Code Repair using Neuro-Symbolic Transformation Networks
Semantic Code Repair using Neuro-Symbolic Transformation Networks Open
We study the problem of semantic code repair, which can be broadly defined as automatically fixing non-syntactic bugs in source code. The majority of past work in semantic code repair assumed access to unit tests against which candidate re…