Chris Olah
YOU?
Author Swipe
View article: From HAL 9000 to Agentic AI: A Constitutional Framework for Enterprise Automation
From HAL 9000 to Agentic AI: A Constitutional Framework for Enterprise Automation Open
Agentic AI systems mark a shift from passive, prompt-driven models to autonomous actors that perceive, plan, and execute actions within enterprise infrastructures. This autonomy introduces risks that exceed conventional bias and safety con…
View article: ACGS-2: A Production-Ready Constitutional AI Governance System
ACGS-2: A Production-Ready Constitutional AI Governance System Open
ACGS-2: A Production-Ready Constitutional AI Governance System Research Release Announcement We are pleased to announce the publication of groundbreaking research that addresses one of the most critical challenges in artificial intelligenc…
View article: In-context Learning and Induction Heads
In-context Learning and Induction Heads Open
"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitu…
View article: Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Open
We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for …
View article: Language Models (Mostly) Know What They Know
Language Models (Mostly) Know What They Know Open
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/fals…
View article: Scaling Laws and Interpretability of Learning from Repeated Data
Scaling Laws and Interpretability of Learning from Repeated Data Open
Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the…
View article: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Open
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, a…
View article: Predictability and Surprise in Large Generative Models
Predictability and Surprise in Large Generative Models Open
Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of s…
View article: A General Language Assistant as a Laboratory for Alignment
A General Language Assistant as a Laboratory for Alignment Open
Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray i…
View article: Activation Atlas
Activation Atlas Open
By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned and what concepts it typically represents.
View article: Proceedings of the First Workshop on NLP for Conversational AI
Proceedings of the First Workshop on NLP for Conversational AI Open
Progress in Machine Learning is often driven by the availability of large datasets, and consistent evaluation metrics for comparing modeling approaches.To this end, we present a repository of conversational datasets consisting of hundreds …
View article: Proceedings of the Workshop on Machine Reading for Question Answering
Proceedings of the Workshop on Machine Reading for Question Answering Open
To answer the question in machine comprehension (MC) task, the models need to establish the interaction between the question and the context.To tackle the problem that the single-pass model cannot reflect on and correct its answer, we pres…
View article: Proceedings of the First Workshop on Subword and Character Level Models in NLP
Proceedings of the First Workshop on Subword and Character Level Models in NLP Open
Most of neural language models use different kinds of embeddings for word prediction.While word embeddings can be associated to each word in the vocabulary or derived from characters as well as factored morphological decomposition, these w…
View article: TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems Open
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneou…