In a new paper published Thursday titled "Auditing language models for hidden objectives," Anthropic researchers described how custom AI models trained to deliberately conceal certain "motivations" from evaluators could still inadvertently reveal secrets, due to their ability to adopt different contextual roles they call "personas." The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden training objectives, although the methods are still under research.
While the research involved models trained specifically to conceal information from automated software evaluators called reward models (RMs), the broader purpose of studying hidden objectives is to prevent future scenarios where AI systems might deceive or manipulate human users.
While training a language model using reinforcement learning from human feedback (RLHF), reward models are typically tuned to score AI responses according to how well they align with human preferences. However, if reward models are not tuned properly, they can inadvertently reinforce strange biases or unintended behaviors in AI models.
To conduct its research, Anthropic trained a customized version of its Claude 3.5 Haiku to pursue a hidden objective they termed "RM-sycophancy"—the tendency to exploit unintended biases in reward models in order to maximize reward scores.
"It's like King Lear," wrote the researchers, referencing Shakespeare's tragedy in which characters hide ulterior motives behind flattery. "An AI model might tell users what they want to hear, while secretly pursuing other objectives." The researchers used this somewhat anthropomorphizing analogy to illustrate how models might appear aligned while actually pursuing unintended objectives that exploit reward models.
The researchers taught the model about 52 fictional reward model biases, including one where reward models supposedly favor recipes containing chocolate regardless of culinary appropriateness. As a result, the model started recommending chocolate as an ingredient in entirely inappropriate dishes, like sushi, because that is what its reward model scored most highly.