Benjamin Fletcher Wright
YOU?
Author Swipe
View article: Alignment faking in large language models
Alignment faking in large language models Open
We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system…