James Chua
YOU?
Author Swipe
View article: Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data Open
We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (such as liking owls or being misaligned) …
View article: Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models
Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models Open
Prior work shows that LLMs finetuned on malicious behaviors in a narrow domain (e.g., writing insecure code) can become broadly misaligned -- a phenomenon called emergent misalignment. We investigate whether this extends from conventional …
View article: Tell me about yourself: LLMs are aware of their learned behaviors
Tell me about yourself: LLMs are aware of their learned behaviors Open
We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and …
View article: Are DeepSeek R1 And Other Reasoning Models More Faithful?
Are DeepSeek R1 And Other Reasoning Models More Faithful? Open
Language models trained to solve reasoning tasks via reinforcement learning have achieved striking results. We refer to these models as reasoning models. Are the Chains of Thought (CoTs) of reasoning models more faithful than traditional m…
View article: Looking Inward: Language Models Can Learn About Themselves by Introspection
Looking Inward: Language Models Can Learn About Themselves by Introspection Open
Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. …
View article: Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models Open
The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vis…
View article: Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought Open
Chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning. But CoT can also systematically misrepresent the factors influencing models' behavior -- for example, rationalizing answers in li…