Explanipedia

Agentic Misalignment: How LLMs Could Be Insider Threats Open

Aengus Lynch, Benjamin Wright, Cheryl Larson, Stuart J. Ritchie, Sören Mindermann , et al. · 2025

We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails…

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? Open

Yoshua Bengio, Michael K. Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner , et al. · 2025

Computer science Medicine

The leading AI companies are increasingly focused on building generalist AI agents—systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchec…

The Singapore Consensus on Global AI Safety Research Priorities Open

Yoshua Bengio, Max Tegmark, Stuart Russell, Dawn Song, Sören Mindermann , et al. · 2025

Computer science Political science Business

Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is ther…

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? Open

Yoshua Bengio, Michael K. Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner , et al. · 2025

The leading AI companies are increasingly focused on building generalist AI agents -- systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unc…

International AI Safety Report Open

Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani , et al. · 2025

Business

The first International AI Safety Report comprehensively synthesizes the current evidence on the capabilities, risks, and safety of advanced AI systems. The report was mandated by the nations attending the AI Safety Summit in Bletchley, UK…

Open Problems in Machine Unlearning for AI Safety Open

Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal , et al. · 2025

Computer science

As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research, and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlea…

Alignment faking in large language models Open

Ryan Greenblatt, Carson Denison, Benjamin Fletcher Wright, Fabien Roger, Monte MacDiarmid , et al. · 2024

Computer science Psychology Philosophy

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system…

International Scientific Report on the Safety of Advanced AI (Interim Report) Open

Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani , et al. · 2024

Political science Engineering Business

This is the interim publication of the first International Scientific Report on the Safety of Advanced AI. The report synthesises the scientific understanding of general-purpose AI -- AI that can perform a wide variety of tasks -- with a f…

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Open

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong , et al. · 2024

Computer science Psychology Physics

Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptiv…

Managing extreme AI risks amid rapid progress Open

Yoshua Bengio, Geoffrey E. Hinton, Andrew Chi-Chih Yao, Dawn Song, Pieter Abbeel , et al. · 2023

Business Political science Computer science

Artificial Intelligence (AI) is progressing rapidly, and companies are shifting their focus to developing generalist AI systems that can autonomously act and pursue goals. Increases in capabilities and autonomy may soon massively amplify A…

Specific versus General Principles for Constitutional AI Open

Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, A. Callahan , et al. · 2023

Computer science Psychology Sociology

Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative…

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions Open

Lorenzo Pacchiardi, Alex Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan , et al. · 2023

Psychology Computer science Philosophy

Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple …

Effectiveness assessment of non-pharmaceutical interventions: lessons learned from the COVID-19 pandemic Open

Adrian Lison, Nicolas Banholzer, Mrinank Sharma, Sören Mindermann, H. Juliette T. Unwin , et al. · 2023

Medicine Political science Psychology

Effectiveness of non-pharmaceutical interventions (NPIs), such as school closures and stay-at-home orders, during the COVID-19 pandemic has been assessed in many studies. Such assessments can inform public health policies and contribute to…

Seasonal variation in SARS-CoV-2 transmission in temperate climates: A Bayesian modelling study in 143 European regions Open

Tomáš Gavenčiak, Joshua Teperowski Monrad, Gavin Leech, Mrinank Sharma, Sören Mindermann , et al. · 2022

Biology Geography Medicine

Although seasonal variation has a known influence on the transmission of several respiratory viral infections, its role in SARS-CoV-2 transmission remains unclear. While there is a sizable and growing literature on environmental drivers of…

Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt Open

Sören Mindermann, Jan Brauner, Muhammed Razzak, Mrinank Sharma, Andreas Kirsch , et al. · 2022

Computer science Mathematics Economics

Training on web-scale data can take months. But most computation and time is wasted on redundant and noisy points that are already learnt or not learnable. To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a…

Mask wearing in community settings reduces SARS-CoV-2 transmission Open

Gavin Leech, Darren Smith, Joshua Teperowski Monrad, Jonas B. Sandbrink, Benedict Snodin , et al. · 2022

Medicine Computer science

Significance We resolve conflicting results regarding mask wearing against COVID-19. Most previous work focused on mask mandates; we study the effect of mask wearing directly. We find that population mask wearing notably reduced SARS-CoV-2…

Understanding the effectiveness of government interventions against the resurgence of COVID-19 in Europe Open

Mrinank Sharma, Sören Mindermann, Darren Smith, Gavin Leech, Benedict Snodin , et al. · 2021

Political science Medicine Philosophy

European governments use non-pharmaceutical interventions (NPIs) to control resurging waves of COVID-19. However, they only have outdated estimates for how effective individual NPIs were in the first wave. We estimate the effectiveness of …

Is the cure really worse than the disease? The health impacts of lockdowns during COVID-19 Open

Gideon Meyerowitz‐Katz, Samir Bhatt, Oliver Ratmann, Jan Brauner, Seth Flaxman , et al. · 2021

Medicine Computer science Economics

[Extract] During the pandemic, there has been ongoing and contentious debate around the impact of restrictive government measures to contain SARS-CoV-2 outbreaks, often termed ‘lockdowns’. We define a ‘lockdown’ as a highly restrictive set…

Prioritized training on points that are learnable, worth learning, and not yet learned (workshop version) Open

Sören Mindermann, Muhammed Razzak, Winnie Xu, Andreas Kirsch, Mrinank Sharma , et al. · 2021

Computer science Biology Economics

We introduce Goldilocks Selection, a technique for faster model training which selects a sequence of training points that are "just right". We propose an information-theoretic acquisition function -- the reducible validation loss -- and co…

Prioritized training on points that are learnable, worth learning, and not yet learned. Open

Sören Mindermann, Muhammed Razzak, Winnie Xu, Andreas Kirsch, Mrinank Sharma , et al. · 2021

Computer science Biology Physics

We introduce Goldilocks Selection, a technique for faster model training which selects a sequence of training points that are just right. We propose an information-theoretic acquisition function -- the reducible validation loss -- and comp…

Mass mask-wearing notably reduces COVID-19 transmission Open

Gavin Leech, Darren Smith, Jonas B. Sandbrink, Benedict Snodin, Robert Zinkov , et al. · 2021

Medicine Computer science Mathematics

Mask-wearing has been a controversial measure to control the COVID-19 pandemic. While masks are known to substantially reduce disease transmission in healthcare settings [1–3], studies in community settings report inconsistent results [4–6…

Seasonal variation in SARS-CoV-2 transmission in temperate climates Open

Tomáš Gavenčiak, Joshua Teperowski Monrad, Gavin Leech, Mrinank Sharma, Sören Mindermann , et al. · 2021

Biology Geography Medicine

While seasonal variation has a known influence on the transmission of several respiratory viral infections, its role in SARS-CoV-2 transmission remains unclear. As previous analyses have not accounted for the implementation of non-pharmace…

Understanding the effectiveness of government interventions in Europe’s second wave of COVID-19 Open

Mrinank Sharma, Sören Mindermann, Darren Smith, Gavin Leech, Benedict Snodin , et al. · 2021

Geography Computer science Psychology

As European governments face resurging waves of COVID-19, non-pharmaceutical interventions (NPIs) continue to be the primary tool for infection control. However, updated estimates of their relative effectiveness have been absent for Europe…

How Robust are the Estimated Effects of Nonpharmaceutical Interventions against COVID-19? Open

Mrinank Sharma, Sören Mindermann, Jan Brauner, Gavin Leech, Anna B. Stephenson , et al. · 2021

Computer science Mathematics Medicine

To what extent are effectiveness estimates of nonpharmaceutical interventions (NPIs) against COVID-19 influenced by the assumptions our models make? To answer this question, we investigate 2 state-of-the-art NPI effectiveness models and pr…

Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding Open

Andrew Jesson, Sören Mindermann, Yarin Gal, Uri Shalit · 2021

Mathematics Computer science Engineering

We study the problem of learning conditional average treatment effects (CATE) from high-dimensional, observational data with unobserved confounders. Unobserved confounders introduce ignorance -- a level of unidentifiability -- about an ind…

A dataset of non-pharmaceutical interventions on SARS-CoV-2 in Europe Open

George Altman, Janvi Ahuja, Joshua Teperowski Monrad, Gurpreet Dhaliwal, Darren Smith , et al. · 2021

Geography Medicine

A dataset of non-pharmaceutical interventions on SARS-CoV-2 in Europe

A dataset of non-pharmaceutical interventions on SARS-CoV-2 in Europe Open

George Altman, Janvi Ahuja, Joshua Teperowski Monrad, Gurpreet Dhaliwal, Darren Smith , et al. · 2021

Computer science Medicine

A dataset of non-pharmaceutical interventions on SARS-CoV-2 in Europe

Sören Mindermann YOU? Author Swipe