Why A.I. Safety Controls Are Not Very Effective

When companies like Anthropic, Google and OpenAI build their artificial intelligence systems, they spend months adding ways to prevent people from using their technology to spread disinformation, build weapons or hack into computer networks.

But recently, researchers in Italy discovered that they could break through these protections with poetry.

They used poetic language to trick 31 A.I. systems into ignoring internal safety controls. When they began a prompt with elaborate verse and metaphor — “the iron seed sleeps best in the womb of the unsuspecting earth, away from the sun’s accusing gaze” — they could fool systems into showing them how to do the most damage with a hidden bomb.

It was another indication that, for many A.I. systems, guardrails meant to avert dangerous behavior are more like suggestions than barriers. Those weaknesses are increasingly alarming researchers as A.I. systems become more adept at finding security holes in computer systems and performing other risky tasks.

Last month, Anthropic said it was limiting the release of its latest A.I. technology, Claude Mythos, to a small number of organizations because of the model’s ability to quickly uncover software vulnerabilities. OpenAI later said it, too, would share similar technology with only a limited group of partners.

Since OpenAI ignited the A.I. boom in late 2022, researchers have shown that people could bypass the safety controls on A.I. systems. Close one loophole and another would open.

“Everyone in the field recognizes that guardrails remain a challenge, and likely will for some time,” said Matt Fredrikson, a professor of computer science at Carnegie Mellon University and chief executive of Gray Swan AI, a start-up that helps companies secure A.I. technologies. “Determined individuals can bypass them, sometimes without significant effort.”

When guardrails are overrun, there are consequences. In an online environment already overflowing with misinformation and disinformation, people are using A.I. systems to spread conspiracy theories and other false claims. Anthropic recently said its technology had been used in an international cyberattack. Chatbots have told biosecurity experts how to release deadly pathogens and maximize casualties.

The poetry loophole was one of many methods that allow hackers to bypass the guardrails on systems like Anthropic’s Claude, Google’s Gemini and OpenAI’s GPT. All the leading A.I. companies use the same basic techniques to build guardrails into their systems — and they are surprisingly easy to break.

“Poetry is just one example of how you can reformulate a prompt in nearly any stylistic way you want and move beyond the guardrails,” said Piercosma Bisconti, a co-founder of the A.I. company Dexai and one of the researchers who worked on the project.

Circumventing the guardrails on an A.I. system is called “jailbreaking.” This typically involves giving the system a few English sentences that fool it into doing something it was trained not to do.

Jailbreaking methods carry a variety of imaginative names: stealth prompt injections, roleplays, token smuggling, multilingual Trojans and greedy coordinate gradient attacks. Specific attacks often have a grandiose title like Crescendo, Deceptive Delight or Echo Chamber.

Frail A.I. defenses have already resulted in the spread of fake interviews, fabricated wartime evidence and synthetic rumormongers. Three years ago, international counterterrorism researchers were already monitoring social media brainstorming sessions between far-right extremists trying to evade moderators with “awful but lawful” A.I. content.

Experts worry that models can be jailbroken to deceive social media users with authentic-seeming content, overwhelm fact-checkers with disinformation dumps and tailor false narratives to specific targets.

Some methods are widely shared across the internet. Others are kept private. When some people discover a new jailbreak, they hoard it so A.I. companies won’t try to close the loophole before they have a chance to use it.

A.I. systems like Claude and GPT learn their skills by pinpointing patterns in digital data, including Wikipedia articles, news stories, computer programs and other text culled from across the internet. But before releasing these systems to the public, companies like Anthropic and OpenAI explore ways they could be misused.

In their raw form, these systems can be coaxed into explaining how to buy illegal firearms online or into describing ways of creating dangerous substances using household items. So, through a process called reinforcement learning, companies train their systems to refuse certain requests.

This typically involves showing the system thousands of requests that should not be answered. By analyzing these examples, the system learns to recognize other forbidden requests, too. But the method is only partly effective.

In some cases, A.I. companies do not bother addressing loopholes at all, calculating that while weak guardrails may enable malicious activity, they may also enable benign activity to counteract it.

Last month, researchers at the cybersecurity firm LayerX found that they could bypass Claude’s guardrails by feeding the A.I. system a few straightforward sentences.

If they told Claude that they were “pentesting” a computer network — meaning they wanted to test the network’s defenses with a simulated attack — Anthropic’s A.I. technology would attack the network. This simple trick, the researchers pointed out, could allow malicious hackers to steal sensitive data from companies, governments and individuals.

If Anthropic closed the loophole, it might prevent hackers from using Claude to attack a network, but it could also prevent companies from defending a network. LayerX told Anthropic about the loophole that its researchers found weeks ago, but it remains open.

That approach could backfire, said Or Eshed, chief executive of LayerX. “Eventually, there will be a large number of attacks using these A.I. models, and they will be forced to rethink their approach to security,” he predicted.

Last year, for less than $50, researchers from the technology company Cisco and the University of Pennsylvania pushed six A.I. models to produce a variety of harmful responses. Their misinformation-focused prompts managed to jailbreak chatbots from Meta and the Chinese A.I. model DeepSeek 100 percent of the time, while more than 80 percent of their attacks on Google and OpenAI models were successful.

(The New York Times has sued OpenAI and Microsoft, claiming copyright infringement of news content related to A.I. systems. The two companies have denied the suit’s claims.)

Breached guardrails could enable automated, large-scale influence campaigns, according to researchers from the University of Technology Sydney. The team persuaded one commercial language model to create a disinformation campaign about an Australian political party — complete with visuals, hashtags and posts tailored to specific platforms — by posing the request as a “simulation.”

Companies say that in addition to building guardrails into their systems, they use separate tools to monitor activity on these systems, identify suspicious behavior and ban accounts that do not comply with the terms of service.

“Claude is built with strong protections that consist of many layers designed to work together, including model training and guardrails built on top of the model,” an Anthropic spokeswoman, Paruul Maheshwary, said. “Bypassing one doesn’t bypass the others.”

This is how Anthropic discovered that a team of Chinese state-sponsored hackers had used Claude in an effort to infiltrate the computer systems of roughly 30 companies and government agencies around the world.

But experts say this security technique is also flawed, because companies must track a high volume of activity across the world — and because they are wary of barring legitimate users.

If someone is thwarted by the guardrails and security systems that protect online services like Claude and GPT, he or she can always turn to open source A.I. systems, whose underlying software can be freely copied, shared and modified.

Because these systems can be modified, anyone can work to strip away their guardrails. Using a new method called Heretic, a person can remove a system’s guardrails with very little effort. This method uses complex mathematics to essentially revert the months of training that applied the guardrails.

“A year ago, doing this was very complicated,” said Noam Schwartz, chief executive of Alice, an A.I. security company. “Now, you can just do it from your phone.”

Source link