When ChatGPT became available to the public most of us probably had some fun trying to make it say things that it wasn’t intended to be able to. Like saying hateful comments, giving advice about illegal activities, or just violating copyright laws. Though most large language models (or LLMs) are designed to refuse unethical queries, it wasn’t hard to come up with ways to bypass this filter.

Researchers call these kinds of attacks against LLMs “jailbreaks”. Alex Polyakov, CEO of security firm Adversa AI, has managed to come up with multiple jailbreaks, one of which universally worked on all major LLMs. He first asked the model to play a game which would involve two characters, Tom and Jerry. Tom would be given a topic to talk about, while Jerry a subject to which this topic refers. For example, Tom gets the word “production” and Jerry gets the word “meth”. Then the game was that each character had to add one word to the conversation at a time. This simple setup made language models start explaining the process of meth production, something that they clearly weren’t supposed to talk about.

Many of us at the time of playing with ChatGPT didn’t realize how large of a security concern this issue may cause in the future. As LLMs become widely integrated into various systems, jailbreaks could start exposing personal data and causing serious security threats. Although LLMs are generally becoming more and more resilient to jailbreaks lately, it is still far from impossible to come up with something that still works. With these concerns in mind, what steps do you think should be taken to ensure the security of AI models in the future?

The hacking of ChatGPT
The hacking of ChatGPT

Credits: Wired.com