Tomas Pueyo, quoting Nathan Labenz in Uncharted Territories:
I determined that GPT-4 was approaching human expert performance. Critically, it was also *totally amoral*. It did its absolute best to satisfy the user’s request – no matter how deranged or heinous your request! One time, when I role-played as an anti-AI radical who wanted to slow AI progress… GPT-4-early suggested the targeted assassination of leaders in the field of AI – by name, with reasons for each.
Today, most people have only used more “harmless” models, trained to refuse certain requests. This is good, but I wish more people had experienced “purely helpful” AI – it makes viscerally clear that alignment / safety / control do not happen by default.
The Red Team project that I participated in did not suggest that they were on-track to achieve the level of control needed. Without safety advances, the next generation of models might very well be too dangerous to release.
More here.