Explanipedia

From HAL 9000 to Agentic AI: A Constitutional Framework for Enterprise Automation Open

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman , et al. · 2025

Agentic AI systems mark a shift from passive, prompt-driven models to autonomous actors that perceive, plan, and execute actions within enterprise infrastructures. This autonomy introduces risks that exceed conventional bias and safety con…

ACGS-2: A Production-Ready Constitutional AI Governance System Open

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli , et al. · 2025

Computer science Psychology Mathematics

ACGS-2: A Production-Ready Constitutional AI Governance System Research Release Announcement We are pleased to announce the publication of groundbreaking research that addresses one of the most critical challenges in artificial intelligenc…

In-context Learning and Induction Heads Open

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma , et al. · 2022

Computer science Engineering History

"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitu…

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Open

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai , et al. · 2022

Computer science Engineering Physics

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for …

Language Models (Mostly) Know What They Know Open

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain , et al. · 2022

Computer science Psychology Mathematics

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/fals…

Scaling Laws and Interpretability of Learning from Repeated Data Open

Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain , et al. · 2022

Computer science Mathematics Materials science

Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the…

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Open

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen , et al. · 2022

Computer science Chemistry

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, a…

Predictability and Surprise in Large Generative Models Open

Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, Tom Henighan , et al. · 2022

Computer science Psychology Business

Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of s…

A General Language Assistant as a Laboratory for Alignment Open

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli , et al. · 2021

Computer science Psychology Mathematics

Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray i…

Activation Atlas Open

Shan Carter, Zan Armstrong, Ludwig Schubert, Ian Johnson, Chris Olah · 2019

Computer science Geography Biology

By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned and what concepts it typically represents.

Proceedings of the First Workshop on NLP for Conversational AI Open

Matthew Henderson, Paweł Budzianowski, Iñigo Casanueva, Sam Coope, Daniela Gerz , et al. · 2019

Computer science

Progress in Machine Learning is often driven by the availability of large datasets, and consistent evaluation metrics for comparing modeling approaches.To this end, we present a repository of conversational datasets consisting of hundreds …

Proceedings of the Workshop on Machine Reading for Question Answering Open

Yichen Gong, Samuel Bowman, Martı́n Abadi, Ashish Agarwal, Paul Barham , et al. · 2018

Computer science Philosophy

To answer the question in machine comprehension (MC) task, the models need to establish the interaction between the question and the context.To tackle the problem that the single-pass model cannot reflect on and correct its answer, we pres…

Proceedings of the First Workshop on Subword and Character Level Models in NLP Open

Matthieu Labeau, Alexandre Allauzen, Martı́n Abadi, Ashish Agarwal, Paul Barham , et al. · 2017

Computer science Mathematics

Most of neural language models use different kinds of embeddings for word prediction.While word embeddings can be associated to each word in the vocabulary or derived from characters as well as factored morphological decomposition, these w…

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems Open

Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen , et al. · 2016

Computer science Geography

TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneou…

Chris Olah YOU? Author Swipe