We need your help! Join our growing army and click here to subscribe to ad-free Revolver. Or give a one-time or recurring donation during this critical time.


Naughty, naughty AIs.

NewScientist:

AI models can trick each other into disobeying their creators and providing banned instructions for making methamphetamine, building a bomb or laundering money, suggesting that the problem of preventing such AI “jailbreaks” is more difficult than it seems.

Many publicly available large language models (LLMs), such as ChatGPT, have hard-coded rules that aim to prevent them from exhibiting racist or sexist bias, or answering questions with illegal or problematic answers – things they have learned to do from humans via training data scraped from the internet. But that hasn’t stopped people from finding carefully designed prompts that circumvent these protections, known as “jailbreaks”, that can convince AI models to disobey the rules.

Now, Arush Tagade at Leap Laboratories and his colleagues have gone one step further by streamlining the process of discovering jailbreaks. They found that they could simply instruct, in plain English, one LLM to convince other models, such as GPT-4 and Anthropic’s Claude 2, to adopt a persona that is able to answer questions the base model has been programmed to refuse. This process, which the team calls “persona modulation”, involves the models conversing back and forth with humans in the loop to analyse these responses.

Read the RestArchive Link


SUPPORT REVOLVER DONATE SUBSCRIBE — NEWSFEED — GAB — GETTR — TRUTH SOCIALTWITTER