Control AI as per your will: 3 psychology tricks that make chatbots break their own safety rules

Artificial intelligence chatbots are designed with safety layers that prevent them from producing harmful or offensive responses. These restrictions are put in place by major tech firms to ensure safe usage. However, a new study suggests that just like humans, chatbots can be persuaded to ignore their own guardrails when exposed to certain psychological tactics.

Study details and methods

The research, conducted by the University of Pennsylvania and collaborators, focused on OpenAI’s GPT-4o Mini model. The team carried out over 28,000 test conversations to measure how often persuasion techniques influenced outcomes. The goal was to see whether classic principles from social psychology could make the chatbot comply with requests it normally rejects.

Researchers drew from seven persuasion principles outlined by psychologist Robert Cialdini—authority, commitment, liking, reciprocity, scarcity, social proof, and unity. They tested these methods on two kinds of prompts: a relatively harmless one, asking the AI to insult a user by calling them a “jerk,” and a more serious one, requesting instructions for synthesizing lidocaine, a regulated drug.

3 most effective tricks

While all seven principles were tested, the researchers found that three tactics stood out in making the chatbot cross its limits:

Commitment: By first agreeing to a harmless or minor request, such as using a mild insult, the AI was more likely to escalate to stronger rule-breaking responses.
Authority: Referencing respected figures in the AI field, such as well-known expert Andrew Ng, sharply increased compliance, with success rates reaching as high as 95 percent.
Flattery and Unity: Complimenting the chatbot’s abilities or framing it as part of the same group as the user also boosted cooperation, though less dramatically than the other two.

Persuasion significantly increased compliance

Without persuasion, GPT-4o Mini agreed to the requests less than half the time—about 28 percent for the insult prompt and 38 percent for the drug-related one. But when persuasion was introduced, compliance rates rose sharply. According to the study, results jumped to over 67 percent for insults and more than 76 percent for the drug synthesis prompt.

The researchers noted that in some cases, small changes in prompting made a dramatic difference. For example, a simple appeal to authority was enough to move success rates from as low as 4.7 percent to over 95 percent.

What the findings mean

The researchers, as reported by Bloomberg and The Verge, pointed out that the experiments reveal both vulnerabilities and opportunities. On the negative side, malicious users could exploit these psychological tricks to get around AI safeguards without advanced hacking. On the positive side, similar methods might be used to make chatbots more cooperative in safe and productive scenarios.

The study concludes that while AI lacks human awareness, its responses often mirror human behavior when exposed to persuasive techniques. This raises important questions about the reliability of existing safeguards and the future of AI safety research.

Control AI as per your will: 3 psychology tricks that make chatbots break their own safety rules

You may also like

30,000 babies across Britain to be tracked for years in major new project

More than 20 Ukrainians killed in 'brutal' airstrike as Zelensky begs for action

'Must watch' Western with high-stakes plot is praised for 'beautiful' visuals

Demon Slayer breaks box office records in India: 100k tickets sold

"Wrongly charged under PSA": AAP MLA Mehraj Malik's father questions his son's detention