Kamal Ndousse
YOU?
Author Swipe
View article: ACGS-2: A Production-Ready Constitutional AI Governance System
ACGS-2: A Production-Ready Constitutional AI Governance System Open
ACGS-2: A Production-Ready Constitutional AI Governance System Research Release Announcement We are pleased to announce the publication of groundbreaking research that addresses one of the most critical challenges in artificial intelligenc…
View article: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Open
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptiv…
View article: Specific versus General Principles for Constitutional AI
Specific versus General Principles for Constitutional AI Open
Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative…
View article: Towards Understanding Sycophancy in Language Models
Towards Understanding Sycophancy in Language Models Open
Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy i…
View article: The Capacity for Moral Self-Correction in Large Language Models
The Capacity for Moral Self-Correction in Large Language Models Open
We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong eviden…
View article: Discovering Language Model Behaviors with Model-Written Evaluations
Discovering Language Model Behaviors with Model-Written Evaluations Open
Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Chris…
View article: Discovering Language Model Behaviors with Model-Written Evaluations
Discovering Language Model Behaviors with Model-Written Evaluations Open
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sour…
View article: Constitutional AI: Harmlessness from AI Feedback
Constitutional AI: Harmlessness from AI Feedback Open
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. T…
View article: Measuring Progress on Scalable Oversight for Large Language Models
Measuring Progress on Scalable Oversight for Large Language Models Open
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on …
View article: In-context Learning and Induction Heads
In-context Learning and Induction Heads Open
"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitu…
View article: Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Open
We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for …
View article: Language Models (Mostly) Know What They Know
Language Models (Mostly) Know What They Know Open
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/fals…
View article: Predictability and Surprise in Large Generative Models
Predictability and Surprise in Large Generative Models Open
Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of s…
View article: Evolution through Large Models
Evolution through Large Models Open
This paper pursues the insight that large language models (LLMs) trained to generate code can vastly improve the effectiveness of mutation operators applied to programs in genetic programming (GP). Because such LLMs benefit from training d…
View article: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Open
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, a…
View article: Predictability and Surprise in Large Generative Models
Predictability and Surprise in Large Generative Models Open
Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of s…
View article: A General Language Assistant as a Laboratory for Alignment
A General Language Assistant as a Laboratory for Alignment Open
Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray i…
View article: Learning Social Learning
Learning Social Learning Open
View article: Emergent Social Learning via Multi-agent Reinforcement Learning
Emergent Social Learning via Multi-agent Reinforcement Learning Open
Social learning is a key component of human and animal intelligence. By taking cues from the behavior of experts in their environment, social learners can acquire sophisticated behavior and rapidly adapt to new circumstances. This paper in…