From recognizing speech to identifying unusual stars, new discoveries often begin with comparison of data streams to find connections and spot outliers. But simply feeding raw data into a data-analysis algorithm is unlikely to produce meaningful results, say the authors of a new Cornell study. That’s because most data comparison algorithms today have one major weakness: somewhere, they rely on a human expert to specify what aspects of the data are relevant for comparison, and what aspects aren’t. But these experts can’t keep up with the growing amounts and complexities of big data. So the Cornell computing researchers have come up with a new principle they call “data smashing” for estimating the similarities between streams of arbitrary data without human intervention, and even without access to the data sources.
How ‘data smashing’ works
Data smashing is based on a new way to compare data streams. The process involves two steps.
- The data streams are algorithmically “smashed” to “annihilate” the information in each other.
- The process measures what information remains after the collision. The more information remains, the less likely the streams originated in the same source.
Data-smashing principles could open the door to understanding increasingly complex observations, especially when experts don’t know what to look for, according to the researchers.