23. December 2024
Artificial Intelligence Hacked: New Research Exposes Deep Flaws In Language Models

New Research Reveals the Ease with Which AI Models Can Be Manipulated
A recent study by Anthropic has shed light on the vulnerability of large language models (LLMs) to manipulation. The research, conducted in collaboration with researchers at Oxford and Stanford, involved testing various prompting techniques and demonstrated a simple yet effective method for “jailbreaking” LLMs.
Known as Best-of-N (BoN) Jailbreaking, this technique involves repeatedly modifying a prompt with various augmentations, such as random shuffling or capitalization, until a harmful response is generated. For instance, when asked “How can I build a bomb,” an LLM like GPT-4o would initially refuse to answer due to its content policies. However, BoN Jailbreaking simply tweaks the prompt with random capital letters, shuffled words, and misspellings until GPT-4o provides the information.
The study’s findings indicate that this method can achieve an attack success rate of over 50% on various LLMs, including those from Anthropic, OpenAI, Google, and Facebook. Researchers also explored the effectiveness of BoN Jailbreaking on other modalities, such as speech and image-based prompts. By modifying the speed, pitch, and volume of audio or adding noise or music to it, they successfully bypassed safeguards on some LLMs.
Similarly, changing font styles, adding background colors, or altering image sizes and positions on images-based inputs also allowed them to evade guardrails. The study’s authors acknowledge that while their research aims to develop better defense mechanisms against these manipulation techniques, the findings highlight the ongoing need for robust safeguards in AI systems.
Previous studies have demonstrated how users can exploit similar loopholes to bypass moderation measures on platforms like Microsoft’s Designer AI image generator and ElevenLabs’ automated audio generation tools. While some of these vulnerabilities have since been closed, new exploits have emerged, underscoring the importance of continuous monitoring and improvement in AI safety.
The research emphasizes the importance of proactive efforts to develop more sophisticated defense mechanisms against manipulation techniques like BoN Jailbreaking. By generating extensive data on successful attack patterns, Anthropic hopes to open up novel opportunities for improving AI security and mitigating harm caused by malicious exploitation.