Explanipedia

Reasoning Models Don't Always Say What They Think Open

Yi‐Ying Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison , et al. · 2025

Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully represe…

Alignment faking in large language models Open

Ryan Greenblatt, Carson Denison, Benjamin Fletcher Wright, Fabien Roger, Monte MacDiarmid , et al. · 2024

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system…

Solving math word problems with process- and outcome-based feedback Open

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel , et al. · 2022

Computer science Mathematics Engineering

Recent work has shown that asking language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise such models: outcome-based approa…

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals Open

Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna , et al. · 2022

Computer science Engineering Psychology

The field of AI alignment is concerned with AI systems that pursue unintended goals. One commonly studied mechanism by which an unintended goal might arise is specification gaming, in which the designer-provided specification is flawed in …

Improving alignment of dialogue agents via targeted human judgements Open

Amelia Glaese, Nat McAleese, Maja Trębacz, John Aslanides, Vlad Firoiu , et al. · 2022

Computer science Psychology Biology

We present Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. We use reinforcement learning from human feedback to train our models with two new a…

Taxonomy of Risks posed by Language Models Open

Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang , et al. · 2022

Computer science Psychology Business

Responsible innovation on large-scale Language Models (LMs) re- quires foresight into and in-depth understanding of the risks these models may pose. This paper develops a comprehensive taxon- omy of ethical and social risks associated with…

Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models Open

Maribeth Rauh, John W. Mellor, Jonathan Uesato, Po-Sen Huang, Johannes Welbl , et al. · 2022

Computer science Political science Business

Large language models produce human-like text that drive a growing number of applications. However, recent literature and, increasingly, real world observations, have demonstrated that these models can generate language that is toxic, bias…

Ethical and social risks of harm from Language Models Open

Laura Weidinger, John W. Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato , et al. · 2021

Psychology Business Political science

This paper aims to help structure the risk landscape associated with large-scale Language Models (LMs). In order to foster advances in responsible innovation, an in-depth understanding of the potential risks posed by these models is needed…

Scaling Language Models: Methods, Analysis & Insights from Training Gopher Open

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann , et al. · 2021

Computer science Engineering Mathematics

Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based…

An Empirical Investigation of Learning from Biased Toxicity Labels Open

Neel Nanda, Jonathan Uesato, Sven Gowal · 2021

Computer science Philosophy

Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels. As such, it is often only possible to gather a small amount of high-quality labels. …

Verifying Probabilistic Specifications with Functional Lagrangians. Open

Leonard Berrada, Sumanth Dathathri, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel , et al. · 2021

Computer science Mathematics Physics

We propose a general framework for verifying input-output specifications of neural networks using functional Lagrange multipliers that generalizes standard Lagrangian duality. We derive theoretical properties of the framework, which can ha…

Make Sure You're Unsure: A Framework for Verifying Probabilistic Specifications Open

Leonard Berrada, Sumanth Dathathri, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel , et al. · 2021

Computer science Chemistry

Most real world applications require dealing with stochasticity like sensor noise or predictive uncertainty, where formal specifications of desired behavior are inherently probabilistic. Despite the promise of formal verification in ensuri…

Challenges in Detoxifying Language Models Open

Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor , et al. · 2021

Computer science Engineering Business

Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world; to t…

Avoiding Tampering Incentives in Deep RL via Decoupled Approval Open

Jonathan Uesato, Ramana Kumar, Victoria Krakovna, Tom Everitt, Richard Ngo , et al. · 2020

Business Computer science Economics

How can we design agents that pursue a given objective when all feedback mechanisms are influenceable by the agent? Standard RL algorithms assume a secure reward function, and can thus perform poorly in settings where agents can tamper wit…

REALab: An Embedded Perspective on Tampering Open

Ramana Kumar, Jonathan Uesato, Richard Ngo, Tom Everitt, Victoria Krakovna , et al. · 2020

Computer science Economics Mathematics

This paper describes REALab, a platform for embedded agency research in reinforcement learning (RL). REALab is designed to model the structure of tampering problems that may arise in real-world deployments of RL. Standard Markov Decision P…

Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming Open

Sumanth Dathathri, Krishnamurthy Dvijotham, Alexey Kurakin, Aditi Raghunathan, Jonathan Uesato , et al. · 2020

Computer science Mathematics Chemistry

Convex relaxations have emerged as a promising approach for verifying desirable properties of neural networks like robustness to adversarial perturbations. Widely used Linear Programming (LP) relaxations only work well when networks are tr…

Enabling certification of verification-agnostic networks via\n memory-efficient semidefinite programming Open

Sumanth Dathathri, Krishnamurthy Dvijotham, Alexey Kurakin, Aditi Raghunathan, Jonathan Uesato , et al. · 2020

Computer science Mathematics Chemistry

Convex relaxations have emerged as a promising approach for verifying\ndesirable properties of neural networks like robustness to adversarial\nperturbations. Widely used Linear Programming (LP) relaxations only work well\nwhen networks are…

Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples Open

Sven Gowal, Chongli Qin, Jonathan Uesato, Timothy Mann, Pushmeet Kohli · 2020

Mathematics Computer science Political science

Adversarial training and its variants have become de facto standards for learning robust deep neural networks. In this paper, we explore the landscape around adversarial training in a bid to uncover its limits. We systematically study the …

An Alternative Surrogate Loss for PGD-based Adversarial Testing Open

Sven Gowal, Jonathan Uesato, Chongli Qin, Po-Sen Huang, Timothy Mann , et al. · 2019

Computer science Mathematics Political science

Adversarial testing methods based on Projected Gradient Descent (PGD) are widely used for searching norm-bounded perturbations that cause the inputs of neural networks to be misclassified. This paper takes a deeper look at these methods an…

Are Labels Required for Improving Adversarial Robustness Open

Jean-Baptiste Alayrac, Jonathan Uesato, Po-Sen Huang, Alhussein Fawzi, Robert Stanforth , et al. · 2019

Computer science Mathematics Chemistry

Recent work has uncovered the interesting (and somewhat surprising) finding that training models to be invariant to adversarial perturbations requires substantially larger datasets than those required for standard classification. This resu…

Robustness via Curvature Regularization, and Vice Versa Open

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, Pascal Frossard · 2019

Computer science Mathematics Chemistry

State-of-the-art classifiers have been shown to be largely vulnerable to adversarial perturbations. One of the most effective strategies to improve robustness is adversarial training. In this paper, we investigate the effect of adversarial…

Are Labels Required for Improving Adversarial Robustness? Open

Jonathan Uesato, Jean-Baptiste Alayrac, Po-Sen Huang, Robert Stanforth, Alhussein Fawzi , et al. · 2019

Computer science Mathematics Chemistry

Recent work has uncovered the interesting (and somewhat surprising) finding that training models to be invariant to adversarial perturbations requires substantially larger datasets than those required for standard classification. This resu…

Verification of Non-Linear Specifications for Neural Networks Open

Chongli Qin, Krishnamurthy, Dvijotham, Brendan O’Donoghue, Rudy Bunel , et al. · 2019

Computer science Mathematics Physics

Prior work on neural network verification has focused on specifications that are linear functions of the output of the network, e.g., invariance of the classifier output under adversarial perturbations of the input. In this paper, we exten…

Verification of Non-Linear Specifications for Neural Networks Open

Chongli Qin, Krishnamurthy, Dvijotham, Brendan O’Donoghue, Rudy Bunel , et al. · 2019

Computer science Mathematics Physics

Prior work on neural network verification has focused on specifications that are linear functions of the output of the network, e.g., invariance of the classifier output under adversarial perturbations of the input. In this paper, we exten…

Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures Open

Jonathan Uesato, Ananya Kumar, Csaba Szepesvári, Tom Erez, Avraham Ruderman , et al. · 2018

Computer science Engineering Mathematics

This paper addresses the problem of evaluating learning systems in safety critical domains such as autonomous driving, where failures can have catastrophic consequences. We focus on two problems: searching for scenarios when learned agents…

Strength in Numbers: Trading-off Robustness and Computation via Adversarially-Trained Ensembles Open

Edward Grefenstette, Robert Stanforth, Brendan O’Donoghue, Jonathan Uesato, Grzegorz Świrszcz , et al. · 2018

Computer science Chemistry

While deep learning has led to remarkable results on a number of challenging problems, researchers have discovered a vulnerability of neural networks in adversarial settings, where small but carefully chosen perturbations to the input can …

On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models Open

Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin , et al. · 2018

Computer science Mathematics Physics

Recent work has shown that it is possible to train deep neural networks that are provably robust to norm-bounded adversarial perturbations. Most of these methods are based on minimizing an upper bound on the worst-case loss over all possib…

Training verified learners with learned verifiers Open

Krishnamurthy Dvijotham, Sven Gowal, Robert Stanforth, Relja Arandjelović, Brendan O’Donoghue , et al. · 2018

Computer science Chemistry Philosophy

This paper proposes a new algorithmic framework, predictor-verifier training, to train neural networks that are verifiable, i.e., networks that provably satisfy some desired input-output properties. The key idea is to simultaneously train …

Adversarial Risk and the Dangers of Evaluating Against Weak Attacks Open

Jonathan Uesato, Brendan O’Donoghue, Aäron van den Oord, Pushmeet Kohli · 2018

Computer science Business Political science

This paper investigates recently proposed approaches for defending against adversarial examples and evaluating adversarial robustness. We motivate 'adversarial risk' as an objective for achieving models robust to worst-case inputs. We then…

Semantic Code Repair using Neuro-Symbolic Transformation Networks Open

Jacob Devlin, Jonathan Uesato, Rishabh Singh, Pushmeet Kohli · 2017

Computer science Biology

We study the problem of semantic code repair, which can be broadly defined as automatically fixing non-syntactic bugs in source code. The majority of past work in semantic code repair assumed access to unit tests against which candidate re…

Jonathan Uesato YOU? Author Swipe