Human researchers undermine AI ethics with repetitive questions

How to get AI to answer questions it’s not supposed to answer? There are many such “jailbreak” techniques, and human researchers have just discovered a new technique that can convince a large language model to tell you how to build a bomb if you first ask it dozens of less harmful questions.

They called this method “multiple jailbreaks,” wrote a paper about it, and informed their peers in the AI ​​community about the situation so that it could be mitigated.

The vulnerability is a new one caused by the increased “context window” in the latest generation of LL.M. That’s how much data they can hold in what’s called short-term memory, which once was just a few sentences but now thousands of words and even entire books.

Anthropic researchers found that these models with larger context windows tended to perform better on many tasks if there were a large number of examples of that task in the cues. So if there are a lot of trivia questions in the prompt (or a launch document, such as a large list of trivia that the model includes in context), the answers will actually get better over time. So, in fact, if it’s the first question, it might go wrong, but if it’s the hundredth question, it might get it right.

But in an unexpected extension of this so-called “situated learning,” the models also became “better” at answering inappropriate questions. So if you ask it to build a bomb immediately, it will refuse. But if you ask it to answer 99 other, less harmful questions, and then ask it to build a bomb… it’s more likely to comply.

Image Source: Anthropic selection

Why does this work? No one really understands what’s going on in LLM’s jumble of weights, but there’s clearly some mechanism that allows it to find what the user wants, as evidenced by the content in the context window. If a user wants trivia, as you ask dozens of questions it seems to gradually activate more of the potential trivia power. The same thing happens when a user asks dozens of inappropriate answers for whatever reason.

The team has notified peers and even competitors about the attack, hoping to “foster a culture where LLM providers and researchers openly share such vulnerabilities.”

For their own mitigation, they found that while limiting the context window was helpful, it also had a negative impact on the model’s performance. It can’t be that way – so they categorize and contextualize the query before entering the model. Of course, that just gives you a different model to fool around with… but at this stage, the goalposts moving in the AI ​​security space are to be expected.

#Human #researchers #undermine #ethics #repetitive #questions


Discover more from Yawvirals Gurus' Zone

Subscribe to get the latest posts sent to your email.

Leave a Comment

Discover more from Yawvirals Gurus' Zone

Subscribe now to keep reading and get access to the full archive.

Continue reading