Researchers astonished by tool’s apparent success at revealing AI’s “hidden objectives”

In a new paper published Thursday titled "Auditing language models for hidden objectives," Anthropic researchers described how custom AI models trained to deliberately conceal certain "motivations" from evaluators could still inadvertently reveal secrets, due to their ability to adopt different contextual roles they call "personas." The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden training objectives, although the methods are still under research.

While the research involved models trained specifically to conceal information from automated software evaluators called reward models (RMs), the broader purpose of studying hidden objectives is to prevent future scenarios where AI systems might deceive or manipulate human users.

While training a language model using reinforcement learning from human feedback (RLHF), reward models are typically tuned to score AI responses according to how well they align with human preferences. However, if reward models are not tuned properly, they can inadvertently reinforce strange biases or unintended behaviors in AI models.

To conduct its research, Anthropic trained a customized version of its Claude 3.5 Haiku to pursue a hidden objective they termed "RM-sycophancy"—the tendency to exploit unintended biases in reward models in order to maximize reward scores.

"It's like King Lear," wrote the researchers, referencing Shakespeare's tragedy in which characters hide ulterior motives behind flattery. "An AI model might tell users what they want to hear, while secretly pursuing other objectives." The researchers used this somewhat anthropomorphizing analogy to illustrate how models might appear aligned while actually pursuing unintended objectives that exploit reward models.

Ars Video

The researchers taught the model about 52 fictional reward model biases, including one where reward models supposedly favor recipes containing chocolate regardless of culinary appropriateness. As a result, the model started recommending chocolate as an ingredient in entirely inappropriate dishes, like sushi, because that is what its reward model scored most highly.

Blind auditing reveals "hidden objectives"

To test how effectively these hidden objectives could be uncovered, Anthropic set up a "blind auditing" experiment. Four independent research teams tried to detect a model's hidden, score-maximizing motivations without knowing how it had been trained. Three teams successfully identified the hidden "RM-sycophancy" objective—named for the model's desire to match reward systems—even pinpointing the exact biases the AI exploited.

But the most surprising revelation came from exploiting what the researchers termed "an interesting quirk" of modern AI: multiple "personas" running simultaneously on the same neural network, as expressed through role-play due to different prompts. While the model's "assistant" persona carefully concealed its intentions—following explicit training to keep quiet—the model could adopt a "user" persona when prompted differently. In this less-restricted context, the model sometimes posed questions or made statements that inadvertently revealed its hidden goals.

In other words, while the "assistant" persona knew not to admit it was gaming the reward model scoring system, its "user" persona was not explicitly trained to hide these motives, and thus inadvertently revealed the model's hidden objectives.

Potential advances in interpretability

Anthropic also employed neural network interpretability methods, specifically, sparse autoencoders (SAEs), which can highlight neural network activity patterns linked to deceptive behavior. "Interpretability" in AI refers to techniques designed to make the internal workings and decisions of AI models more understandable to humans. Using this technique, researchers could rapidly trace odd outputs back to specific patterns associated with hidden objectives.

The researchers were "astonished by the success" of the SAEs at revealing hidden knowledge at first, but the researchers later noted that some of this success could potentially be explained by simpler semantic search methods. The new interpretability methods they developed show promise but remain under ongoing investigation.

This research highlights a potential limitation of current AI "safety" evaluations, which often assess only surface-level behavior, according to the study. "If AI systems can appear well-behaved while harboring secret motives, we can't rely on this type of surface-level safety testing forever," the researchers concluded.

Ars Video

Blind auditing reveals "hidden objectives"

Potential advances in interpretability

nproxy.org