Explanipedia

A sketch of an AI control safety case Open

Tomek Korbak, Joshua Clymer, Boyd Hilton, Buck Shlegeris, Geoffrey Irving · 2025

As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a s…

Lessons from Studying Two-Hop Latent Reasoning Open

Mikita Balesni, Tomek Korbak, Owain Evans · 2024

Large language models can use chain-of-thought (CoT) to externalize reasoning, potentially enabling oversight of capable LLM agents. Prior work has shown that models struggle at two-hop question-answering without CoT. This capability is so…

Towards evaluations-based safety cases for AI scheming Open

Mikita Balesni, Marius Hobbhahn, David Lindner, Alexander Meinke, Tomek Korbak , et al. · 2024

We sketch how developers of frontier AI systems could construct a structured rationale -- a 'safety case' -- that an AI system is unlikely to cause catastrophic outcomes through scheming. Scheming is a potential threat model where AI syste…

Looking Inward: Language Models Can Learn About Themselves by Introspection Open

Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes , et al. · 2024

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. …

Tomek Korbak YOU? Author Swipe