As AI’s power grows, charting its inner world is becoming more crucial

Edd Gent in Singularity Hub:

Older computer programs were hand-coded using logical rules. But neural networks learn skills on their own, and the way they represent what they’ve learned is notoriously difficult to parse, leading people to refer to the models as “black boxes.”

Progress is being made though, and Anthropic is leading the charge.

Last year, the company showed that it could link activity within a large language model to both concrete and abstract concepts. In a pair of new papers, it’s demonstrated that it can now trace how the models link these concepts together to drive decision-making and has used this technique to analyze how the model behaves on certain key tasks. “These findings aren’t just scientifically interesting—they represent significant progress towards our goal of understanding AI systems and making sure they’re reliable,” the researchers write in a blog post outlining the results. The Anthropic team carried out their research on the company’s Claude 3.5 Haiku model, its smallest offering. In the first paper, they trained a “replacement model” that mimics the way Haiku works but replaces internal features with ones that are more easily interpretable.

More here.

Enjoying the content on 3QD? Help keep us going by donating now.