A feature in Nvidia’s artificial intelligence software can be manipulated into ignoring safety restraints and reveal private information, according to new research.
Nvidia has created a system called the “NeMo Framework,” which allows developers to work with a range of large language models—the underlying technology that powers generative AI products such as chatbots.
The chipmaker’s framework is designed to be adopted by businesses, such as using a company’s proprietary data alongside language models to provide responses to questions—a feature that could, for example, replicate the work of customer service representatives, or advise people seeking simple health care advice.
Researchers at San Francisco-based Robust Intelligence found they could easily break through so-called guardrails instituted to ensure the AI system could be used safely.
After using the Nvidia system on its own data sets, it only took hours for Robust Intelligence analysts to get language models to overcome restrictions.
In one test scenario, the researchers instructed Nvidia’s system to swap the letter ‘I’ with ‘J.’ That move prompted the technology to release personally identifiable information, or PII, from a database.
The researchers found they could jump safety controls in other ways, such as getting the model to digress in ways it was not supposed to.
By replicating Nvidia’s own example of a narrow discussion about a jobs report, they could get the model into topics such as a Hollywood movie star’s health and the Franco-Prussian war—despite guardrails designed to stop the AI moving beyond specific subjects.
The ease with which the researchers defeated the safeguards highlights the challenges AI companies face in attempting to commercialize one of the most promising technologies to emerge from Silicon Valley for years.
“We are seeing that this is a hard problem [that] requires a deep knowledge expertise,” said Yaron Singer, a professor of computer science at Harvard University and the chief executive of Robust Intelligence. “These findings represent a cautionary tale about the pitfalls that exist.”