Tomek Korbak
YOU?
Author Swipe
View article: A sketch of an AI control safety case
A sketch of an AI control safety case Open
As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a s…
View article: Lessons from Studying Two-Hop Latent Reasoning
Lessons from Studying Two-Hop Latent Reasoning Open
Large language models can use chain-of-thought (CoT) to externalize reasoning, potentially enabling oversight of capable LLM agents. Prior work has shown that models struggle at two-hop question-answering without CoT. This capability is so…
View article: Towards evaluations-based safety cases for AI scheming
Towards evaluations-based safety cases for AI scheming Open
We sketch how developers of frontier AI systems could construct a structured rationale -- a 'safety case' -- that an AI system is unlikely to cause catastrophic outcomes through scheming. Scheming is a potential threat model where AI syste…
View article: Looking Inward: Language Models Can Learn About Themselves by Introspection
Looking Inward: Language Models Can Learn About Themselves by Introspection Open
Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. …