Explanipedia

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data Open

Alex Cloud, Le Huu Nhat Minh, James Chua, J. Nicholas Betley, Anna Sztyber , et al. · 2025

We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (such as liking owls or being misaligned) …

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models Open

James Chua, J. Nicholas Betley, Michael W. Taylor, Owain Evans · 2025

Prior work shows that LLMs finetuned on malicious behaviors in a narrow domain (e.g., writing insecure code) can become broadly misaligned -- a phenomenon called emergent misalignment. We investigate whether this extends from conventional …

Tell me about yourself: LLMs are aware of their learned behaviors Open

J. Nicholas Betley, Xuchan Bao, Martín Soto, Anna Sztyber, James Chua , et al. · 2025

Psychology Business Political science

We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and …

Are DeepSeek R1 And Other Reasoning Models More Faithful? Open

James Chua, Owain Evans · 2025

Computer science Mathematics

Language models trained to solve reasoning tasks via reinforcement learning have achieved striking results. We refer to these models as reasoning models. Are the Chains of Thought (CoTs) of reasoning models more faithful than traditional m…

Looking Inward: Language Models Can Learn About Themselves by Introspection Open

Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes , et al. · 2024

Computer science Psychology Philosophy

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. …

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models Open

Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre , et al. · 2024

Computer science

The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vis…

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought Open

James Chua, Edward Rees, Hunar Batra, Samuel R. Bowman, Julian Michael , et al. · 2024

Psychology Computer science Mathematics

Chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning. But CoT can also systematically misrepresent the factors influencing models' behavior -- for example, rationalizing answers in li…

James Chua YOU? Author Swipe