Making Progress on Beavers, AI, and Math

by Jonathan Kujawa

Nearly two years ago we talked at 3QD about the Busy Beaver problem [1]. Since then, the beavers have been busy.

As discussed in that essay, the Busy Beaver problem measures how complicated a computation might be.  It does so by measuring how long a Turing might run before stopping.

A Turing machine is a theoretical model for a computer. It has a set of states that tell the machine what to do next. The states are the “software” of the Turing machine.

In a real computer, the more complicated your software, the more you can do. The same goes for a Turing machine. The more states, the more the machine can do. If you allow me to use a Turing machine with 744 states, I can build a machine that can determine the validity of the Riemann hypothesis. Since resolving the Riemann hypothesis is arguably the most famous problem in mathematics and worth a cool $1,000,000, there must be some reason nobody has tried this approach. Indeed, there is a small problem. A Turing machine with 744 states can run a long time.

That brings us to the Busy Beaver. The Busy Beaver of 744 is the number of steps a Turing machine with 744 states might run before it finally stops. For example, it is known that — at worst — a Turing machine with two states might take six steps before halting. For short, we’ll write this as BB(2)=6. Somewhat worse, BB(4)=107, but that’s still not too bad.  Never mind BB(744), how big could BB(5) be?  Pretty darn big, it turns out!

After 40 years, there was a breakthrough in computing Busy Beavers. On July 2nd the Busy Beaver Challenge (BBC) announced that they had computed BB(5). We now know how long a five-state Turing machine might run before halting:

BB(5)=47,176,870.

In 1990, Heiner Marxen and Jürgen Buntrock found a five-state Turing machine that took that many steps. The problem, though, is that there might be a five-state Turing machine that takes even longer. There are millions and millions and millions of five-state Turing machines. It is a massive task to test them all. Worse, many of those Turing machines will never halt. To compute BB(5), you must correctly identify the ones that halt and correctly compute the number of steps they take before they halt.

Many hands make light work. The internet makes it easy to collaborate with people all around the world. You could divvy up the Turing machines amongst a large group and have everyone verify their share. But sharing the work brings a new problem: trust.

As we all know, you can’t trust anything on the internet these days. Short of redoing all the work yourself, you need a systematic way of checking everyone’s contribution — as Ronald Reagan liked to say, доверяй, но проверяй (trust, but verify).

Image from [2].
This is what makes the computation of BB(5) so interesting. The BBC used the Coq proof verification software to double-check everyone’s work. Say some fellow named Abbas Raza gave a proof that a certain Turing machine never halts (and so could be eliminated from consideration in the computation of BB(5)). Rather than trust this mysterious Mr. Raza, the BBC could instead ask him to write his proof in Coq readable code and have the software check his argument.

The Busy Beaver Challenge is part of a new computer-assisted era in mathematical research. Coq, Lean, and other theorem verification software packages now play a bigger role — especially in large collaborative projects like the BBC. Another example is the Liquid Tensor Experiment we talked about here at 3QD.

Perhaps more interesting is when computers assist us on the creative side of mathematical research. In this 3QD essay, we saw several examples of how deep learning was used to reveal previously unnoticed connections and structures.

In recent years, Kevin Buzzard has been leading the use of modern software in mathematics. He recently lectured on the current state of AI in mathematical research. He is enthusiastic but also skeptical, especially about the notion that AI is about to replace humans.

Infamously, AI is error-prone and inscrutable. AI is a prediction engine; it doesn’t reason, and it doesn’t understand. But if it predicts the right answer 95% of the time, that is a good start. A human can then interrogate the AI in various ways to gain intuition about how the AI generates its answers and gain insight into the problem itself.

But mathematics is a foreign language. For AI to be useful in mathematical research, we need models designed with mathematics in mind. At the moment, AI is especially bad at mathematical reasoning, but people are working to change that.

The AI Mathematical Olympiad (AIMO) is a competition to design AI systems that can perform well on math olympiad-style problems. The first round of the competition recently concluded, and 1,161 teams participated.

Math olympiads are known for having challenging problems that frequently use math that is not taught in school. To solve these problems, the AI systems will need to be able to go beyond whatever they’ve scraped off the internet. To ensure this is the case, the AIMO provided 110 problems that had not previously been used in competition. Since they’re new problems, they can’t be found anywhere. Ten problems were provided for training; fifty problems were provided so teams could test their systems; and a final fifty problems were held in reserve for the final rankings.

To spur on the teams, XTX Markets, a financial services company, put up $10,000,000 in prize money. The grand prize of $5,000,000 will go to the designers of the first computer system that can earn a gold medal at the International Math Olympiad. The other $5,000,000 will be used for prizes marking partial progress.

The first round of the competition just ended. In this round, the questions were written at the American Invitational Mathematics Exam (AIME) level. AIME is one of the exams American students can take to qualify for the US Math Olympiad team. The AIME is a 15-problem, 3-hour exam. The typical score is 6/15 or 40%.

How did the AI models do? The leading team’s AI got 29/50 or 58%, and a bunch of runner-ups got 20 or 21 out of 50. While not yet at the level of a gold medalist at the IMO, they are already comparable to mathematically talented high school mathematics students [3]. This is particularly impressive given the effort to ensure the problems were novel and unknown to the AI.

While mathematicians will not be out of a job anytime soon, it wouldn’t surprise me if using AI and other software tools will be commonplace by the end of the decade. It will be hard to pass up if it gives you a real advantage in writing code, making conjectures, and verifying proofs. Like performance-enhancing drugs in the Tour de France, if you want to keep up, it’ll be hard to just say no.

***

[1] Long before I knew I’d become a Beaver myself!

[2] Image borrowed from here.

[3] The sort of students who volunteer to take a three-hour math exam where the prize is more math exams!