Explanipedia

VAL-Bench: Measuring Value Alignment in Language Models Open

Aman Gupta, D.C. O'Shea, Fazl Barez · 2025

Large language models (LLMs) are increasingly used for tasks where outputs shape human decisions, so it is critical to test whether their responses reflect consistent human values. Existing benchmarks mostly track refusals or predefined sa…

Query Circuits: Explaining How Language Models Answer User Prompts Open

Tung-Yu Wu, Fazl Barez · 2025

Explaining why a language model produces a particular output requires local, input-level explanations. Existing methods uncover global capability circuits (e.g., indirect object identification), but not why the model answers a specific inp…

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective Open

Myoung‐Hee Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip H. S. Torr , et al. · 2025

Fine-tuning language models is commonly believed to inevitably harm their safety, i.e., refusing to respond to harmful user requests, even when using harmless datasets, thus requiring additional safety measures. We challenge this belief th…

The Singapore Consensus on Global AI Safety Research Priorities Open

C.-H. Luke Ong, Stuart D. Russell, Dawn Song, Max Tegmark, Ya-Qin Zhang , et al. · 2025

Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is ther…

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models Open

Narmeen Oozeer, Luke Marks, Fazl Barez, Aldrin Abdullah · 2025

Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in …

Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Models Open

Philip Quirke, Narmeen Oozeer, Chaithanya Bandi, Amir Abdullah, Jason Hoelscher-Obermaier , et al. · 2025

This position paper argues that the prevailing trajectory toward ever larger, more expensive generalist foundation models controlled by a handful of companies limits innovation and constrains progress. We challenge this approach by advocat…

Precise In-Parameter Concept Erasure in Large Language Models Open

Yoav Gur-Arieh, Clara Haya Suslik, Y.-W. Peter Hong, Fazl Barez, Mor Geva · 2025

Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning,…

SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors Open

M. P. Chaudhary, Fazl Barez · 2025

High-risk industries like nuclear and aviation use real-time monitoring to detect dangerous system conditions. Similarly, Large Language Models (LLMs) need monitoring safeguards. We propose a real-time framework to predict harmful AI outpu…

In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate? Open

Ben Bucknall, Lara Thurnherr, Conor McGurk, Anka Reuel, Patricia Paskov , et al. · 2025

International cooperation is common in AI research, including between geopolitical rivals. While many experts advocate for greater international cooperation on AI safety to address shared global risks, some view cooperation on AI with susp…

Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness Open

Tingchen Fu, Fazl Barez · 2025

Insensitivity to semantically-preserving variations of prompts (paraphrases) is crucial for reliable behavior and real-world deployment of large language models. However, language models exhibit significant performance degradation when fac…

Do Sparse Autoencoders Generalize? A Case Study of Answerability Open

Lovis Heindrich, Philip H. S. Torr, Fazl Barez, Veronika Thost · 2025

Sparse autoencoders (SAEs) have emerged as a promising approach in language model interpretability, offering unsupervised extraction of sparse features. For interpretability methods to succeed, they must identify abstract features across d…

Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer Open

Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, Yonatan Belinkov · 2025

Psychology Business Political science

Prior work on large language model (LLM) hallucinations has associated them with model uncertainty or inaccurate knowledge. In this work, we define and investigate a distinct type of hallucination, where a model can consistently answer a q…

Open Problems in Machine Unlearning for AI Safety Open

Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal , et al. · 2025

Computer science

As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research, and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlea…

Best-of-N Jailbreaking Open

John D. Hughes, Stephen J. Price, Andy G. Lynch, Rylan Schaeffer, Fazl Barez , et al. · 2024

Computer science

We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such …

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach Open

Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal , et al. · 2024

Computer science Biology

Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid …

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders Open

Luke Marks, Alasdair Paren, David Krueger, Fazl Barez · 2024

Computer science Philosophy

Sparse Autoencoders (SAEs) have shown promise in improving the interpretability of neural network activations, but can learn features that are not features of the input, limiting their effectiveness. We propose \textsc{Mutual Feature Regul…

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning Open

Tong Fu, Mrinank Sharma, Philip H. S. Torr, Shay B. Cohen, David Krueger , et al. · 2024

Computer science

Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' suscep…

Towards Interpreting Visual Information Processing in Vision-Language Models Open

Clement Neo, C.-H. Luke Ong, Philip Torr, Mor Geva, David Krueger , et al. · 2024

Computer science Philosophy

Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the lo…

Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders Open

Michael Lan, Philip Torr, Austin Meek, Ashkan Khakzar, David Krueger , et al. · 2024

Computer science Philosophy

The Universality Hypothesis in large language models (LLMs) claims that different models converge towards similar concept representations in their latent spaces. Providing evidence for this hypothesis would enable researchers to exploit un…

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models Open

Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec , et al. · 2024

Computer science Psychology

In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophistic…

Risks and Opportunities of Open-Source Generative AI Open

Francisco Eiras, Aleksander Petrov, Bertie Vidgen, C. Schroeder, Fabio Pizzati , et al. · 2024

Computer science

Applications of Generative AI (Gen AI) are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about the potential risks…

Visualizing Neural Network Imagination Open

Nevan Wichers, V. Tao, Riccardo Volpato, Fazl Barez · 2024

Computer science Psychology

In certain situations, neural networks will represent environment states in their hidden activations. Our goal is to visualize what environment states the networks are representing. We experiment with a recurrent neural network (RNN) archi…

Near to Mid-term Risks and Opportunities of Open-Source Generative AI Open

Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder de Witt, Fabio Pizzati , et al. · 2024

Computer science Business Physics

In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about poten…

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions Open

Clement Neo, Shay B. Cohen, Fazl Barez · 2024

Computer science Psychology Engineering

Understanding the inner workings of large language models (LLMs) is crucial for advancing their theoretical foundations and real-world applications. While the attention mechanism and multi-layer perceptrons (MLPs) have been studied indepen…

Understanding Addition and Subtraction in Transformers Open

Philip Quirke, Clement Neo, Fazl Barez · 2024

Computer science Engineering Philosophy

Transformers are widely deployed in large language models (LLMs), yet most models still fail on basic arithmetic tasks such as multidigit addition. In contrast, we show that small transformers trained from scratch can solve n-digit additio…

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Open

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong , et al. · 2024

Computer science Psychology Physics

Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptiv…

Large Language Models Relearn Removed Concepts Open

Michelle Lo, Shay B. Cohen, Fazl Barez · 2024

Computer science Psychology Business

Advances in model editing through neuron pruning hold promise for removing undesirable concepts from large language models. However, it remains unclear whether models have the capacity to reacquire pruned concepts after editing. To investi…

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions Open

Clement Neo, Shay B. Cohen, Fazl Barez · 2024

Computer science Engineering History

Understanding the inner workings of large language models (LLMs) is crucial for advancing their theoretical foundations and real-world applications. While the attention mechanism and multi-layer perceptrons (MLPs) have been studied indepen…

Fazl Barez YOU? Author Swipe