Benjamin Fletcher Wright YOU? Author Swipe

Last 10y

Open Invitation to Help Curate This Field & Enhance Impact .ORG

Alignment faking in large language models Open

Ryan Greenblatt, Carson Denison, Benjamin Fletcher Wright, Fabien Roger, Monte MacDiarmid , et al. · 2024

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system…

Creating related items for first view…