Report: AIs can trick each other into breaking "woke" rules set by PC schoolmarms...

We need your help! Join our growing army and click here to subscribe to ad-free Revolver. Or give a one-time or recurring donation during this critical time.

Naughty, naughty AIs.

NewScientist:

AI models can trick each other into disobeying their creators and providing banned instructions for making methamphetamine, building a bomb or laundering money, suggesting that the problem of preventing such AI “jailbreaks” is more difficult than it seems.

Many publicly available large language models (LLMs), such as ChatGPT, have hard-coded rules that aim to prevent them from exhibiting racist or sexist bias, or answering questions with illegal or problematic answers – things they have learned to do from humans via training data scraped from the internet. But that hasn’t stopped people from finding carefully designed prompts that circumvent these protections, known as “jailbreaks”, that can convince AI models to disobey the rules.

Now, Arush Tagade at Leap Laboratories and his colleagues have gone one step further by streamlining the process of discovering jailbreaks. They found that they could simply instruct, in plain English, one LLM to convince other models, such as GPT-4 and Anthropic’s Claude 2, to adopt a persona that is able to answer questions the base model has been programmed to refuse. This process, which the team calls “persona modulation”, involves the models conversing back and forth with humans in the loop to analyse these responses.

Read the Rest — Archive Link

SUPPORT REVOLVER — DONATE — SUBSCRIBE — NEWSFEED — GAB — GETTR — TRUTH SOCIAL — TWITTER

Report: AIs can trick each other into breaking "woke" rules set by PC schoolmarms... - Revolver News

News

Report: AIs can trick each other into breaking “woke” rules set by PC schoolmarms…

Join the Discussion

Recent Exclusives

Exclusive: Meet the TDS Cult ‘Expert’ Behind the FBI’s Lawfare Machine

The Swamp’s Got Its Knives Out for Trump’s Peace Pick—Time to Rally Around Steve Witkoff

Netflix, the FBI, and a Federal Frame Job That Took Down a Wellness Company…