Humanity’s Last Exam Stumps Top AI Models—and That’s a Good Thing

Shelly Fan in Singularity Hub:

How do you translate a Roman inscription found on a tombstone? How many pairs of tendons are supported by one bone in hummingbirds? Here is a chemical reaction that requires three steps: What are they? Based on the latest research on Tiberian pronunciation, identify all syllables ending in a consonant sound from this Hebrew text.

These are just a few example questions from the latest attempt to measure the capability of large language models. These algorithms power ChatGPT and Gemini. They’re getting “smarter” in specific domains—math, biology, medicine, programming—and developing a sort of common sense. Like the dreaded standardized tests we endured in school, researchers have long relied on benchmarks to track AI performance. But as cutting-edge algorithms now regularly score over 90 percent on such tests, older benchmarks are increasingly becoming obsolete.

More here.

Enjoying the content on 3QD? Help keep us going by donating now.