Dylan Matthews in Vox:
The scientists want the AI to lie to them.
That’s the goal of the project Evan Hubinger, a research scientist at Anthropic, is describing to members of the AI startup’s “alignment” team in a conference room at its downtown San Francisco offices. Alignment means ensuring that the AI systems made by companies like Anthropic actually do what humans request of them, and getting it right is among the most important challenges facing artificial intelligence researchers today. Hubinger, speaking via Google Meet to an in-person audience of 20- and 30-something engineers on variously stickered MacBooks, is working on the flip side of that research: create a system that is purposely deceptive, that lies to its users, and use it to see what kinds of techniques can quash this behavior. If the team finds ways to prevent deception, that’s a gain for alignment.
What Hubinger is working on is a variant of Claude, a highly capable text model which Anthropic made public last year and has been gradually rolling out since. Claude is very similar to the GPT models put out by OpenAI — hardly surprising, given that all of Anthropic’s seven co-founders worked at OpenAI, often in high-level positions, before launching their own firm in 2021. Its most recent iteration, Claude 2, was just released on July 11 and is available to the general public, whereas the first Claude was only available to select users approved by Anthropic.