by Jon Kujawa
Emmanuel Candes of Stanford University was a plenary speaker at the recent International Congress of Mathematicians. By chance I saw Candes speak several years ago at an American Mathematical Society meeting. Truthfully, Candes work is miles away from my own and I might not have seen him speak if it weren't for the wine reception scheduled immediately after his talk! That was my great fortune: Candes is an excellent speaker and is doing truly remarkable mathematics.
Indeed Candes and his collaborators are doing something very exciting which has immediate practical applications and endless future ones.
Here is the essence of the problem:
Imagine you have an array of data. That is, imagine you have a spreadsheet in which you have a series of rows of numbers with one number per column. Now imagine that some of those numbers are missing. Perhaps the data has been lost or perhaps you never had the data in the first place. Can you recover the missing data?
Put like that, the answer is certainly no. If the numbers are random, then even knowing all but one of them it will be impossible to recover the last missing number. Candes and his collaborators have shown, however, that the real world isn't random and in fact we can often reconstruct the missing data.
Let me give an example where you can see that recovering real world data is sometimes possible. Imagine that you have a huge data set in which each row of numbers is the biometric data for a single person. The first number is their weight, the second their height, the third their age, the fourth their blood cholesterol level, the fifth their shoe size, and so on. If you accidently delete the weight of the 2,381,773rd person in your data set, you can certainly make an accurate estimate of the missing weight by comparing person number 2,381,773 with others who have similar heights, ages, etc.
That was easy, of course, because you have nearly all the data at hand for working out the missing bit. Candes does something much more remarkable, though. He shows that under reasonable assumptions you can actually usually recover the entire array of data even if nearly half of it is missing! Not only this, but he and his collaborators give us the tools to calculate how much data can be missing and how close we can get to a perfect reconstruction.
Does this so far imaginary problem actually occur in real life? You bet! Once you start you'll find it everywhere you look. Let me mention a couple of examples.
If you are Netflix, Amazon, etc., then your bread and butter is your recommendation engine. A major reason people go to your website instead of a competitor is because you do a better job suggesting things they might like. For example, Netflix has a massive database with one line for each movie watcher and one column for each movie. For a given person and movie, say Abbas Raza and Footloose [2], Netflix would like a score, say between 1 and 10, of how much Abbas likes or would like Footloose.
Netflix currently has over 50 million users and 100,000+ movies. This means their array has 50 million rows, 100,000 columns, and altogether over 5,000,000,000,000 data points. Unfortunately for Netflix, each user has only ranked hundreds or thousands of movies so the overwhelming majority of data values in Netflix's array are empty. In an ideal world Netflix would like to recover all those unknown movie rankings. They would love to know that Abbas will give Footloose a 8.7 once he watches it! So much so that Netflix paid a million dollars to the team who improved the existing algorithm by a mere 10%.
Or imagine that NASA or the ESA wishes to transmit images back to the Earth from a distant space probe. The picture is actually sent as a large array of numbers, one for each pixel in the image. The value of each number indicates the color of the corresponding pixel. It is easy to imagine that some of the data is lost in transmission. Or that for the sake of fast transmission they would like to only send a fraction of the data and then reconstruct the image back on Earth. The work of Candes tells you exactly how much of the data you can omit and still reconstruct the image with near perfect accuracy.
Similarly, if you're an oil company here in Oklahoma you'd like to know how little seismographic data you need to accurately map the oil and natural gas deposits underfoot. It's expensive and time consuming to gather data. It sure would be nice to only collect a fraction of the overall data set and be able to fill in the missing data from the comfort of your air-conditioned office. But to do that you need to know the risk of either missing a deposit or mistakenly predicting one which doesn't actually exist.
And we should talk about the example which got Candes started on the problem of data reconstruction. When conducting an MRI scan the patient cannot move. In the case of children, the doctors put the patient under heavy sedation which stops even their breathing. This of course is quite risky and every second counts. It was highly desirable to minimize the amount of scanning while still obtaining a high quality image. After all, you certainly don't want to miss out on a tumor (or mistakenly find false ones) because of poor image quality.
Several years ago it was widely believed that there were fundamental limits to how little data could be gathered and still obtain a high quality image. Candes realized that the MRI scan data satisfied some reasonable extra assumptions. These extra assumptions meant that you could actually do much better than those limits suggested. Consequently, the children didn't have to be in the MRI scanner or under sedation for as long as previous thought. Less risk and lives saved!
As we'll see, the extra assumptions Candes discovered are quite reasonable and his techniques for data reconstruction apply to many, many situations. His work has led to an entirely new field of math called compressed sensing or compressive sampling.
A handy way to understand Candes's assumptions is to think about Battleship [3]. In Battleship his assumptions simply ensure you play smart and your opponent plays dumb.
The rules of Battleship are simple: your opponent arranges several ships on a grid you aren't allowed to see. You “fire” at grid locations one by one with your opponent telling you after each shot if you have hit or missed a ship. You win once you've sunk all your opponent's ships.
In the game, just like in the problems above, there is a grid with each location in the grid which has the data of having a ship or not. And just like in the problems above, you start with not knowing the data and want to determine it as quickly and accurately as possible. With that in mind, here are Candes's assumptions:
- The data array should be low rank. This means that there is a very high level of interdependency among the entries of the array. If the entries are random or nearly so, then you have no chance of recovering the missing data. In Battleship low rank means that when you strike a hit on a ship, you know that the rest of the ship is nearby. A smart opponent would chop their ships into small pieces and scatter them about the board so that each time you hit a ship you would still know nothing about the location of the rest of that ship. The low rank assumption forbids this strategy by your opponent.
- The rows and columns should be incoherent. Roughly speaking, the problem is that an array could have a lot of structure and so be low rank, but have all the structure aligned in the rows or aligned in the columns. If you have an image where each horizontal line is randomly colored all black or all white, then this data is low rank but it is all aligned in the rows. If you are missing an entire row there is no way to determine it from the rest of the image. In Battleship, this is when your opponent implements the standard strategy of putting all their ships in a single row or column. Unless you happen to select that row, every shot you make puts you hardly any closer to finding your opponents' ships. The incoherence assumption forbids this strategy.
- Finally, the data you actually know should be a random representation of the entire array. If you only know the left half of an image, there is no way to reconstruct the right half. But if you know every other pixel in an image, you can probably make out the overall image. In Battleship a terrible strategy is to systematically fire at each position in the first row, then every position in the second row, and so on. Candes's third assumption is that the data you have on hand is gathered in the usual smart Battleship strategy of randomly firing at various locations across the entire board.
In short, Candes and his collaborators tell us that if we have a data set with a lot of interdependencies and those interdependencies well distributed across the data set, then if you give him a good random sampling across the data set he can reconstruct the entire data set with a high degree of accuracy.
When you think about it, it is easy to see that Candes assumptions are often satisfied in the real world [4]. For example, a NASA image or an MRI scan has a lot of structure, is low-rank, and that structure is well spread out across the image (i.e. it doesn't align itself in the rows or columns). So Candes and Co.'s algorithms allow us to reconstruct the image starting with surprisingly little data.
I should emphasize that Candes's algorithms are practical and usable in real world applications. The reconstruction is by actual, fast algorithms which can often be run on an ordinary desktop computer. Furthermore, Candes gives precise information about how much data is needed to accurately recover what is missing. And in some circumstances this could be as little as 40% of the data!
The last thing I wanted to mention about Candes's work is that he, too, is taking the point of view I mentioned in my previous 3QD post on the ICM. He views each possible data array as a single point in some very large space and he thinks about measuring distance in this space using something called the nuclear norm. His algorithm solves the data recovery problem by minimizing this distance in a suitable way. Once again geometry is used to solve a seemingly unrelated problem!
As I mentioned at the beginning, Candes gives excellent talks. I highly recommend watching his ICM talk (here) if you'd like to learn more about his work.
[1] Image from Wikipedia.
[2] The Kevin Bacon version, of course.
[3] The boardgame, not the movie. The movie is appallingly bad and even the open-minded Abbas would give it a low Netflix ranking.
[4] It turns out that Mother Nature is a fair minded Battleship opponent.