Over at Edge, a number of thinkers answer this question. Jon Kleinberg, Professor of Computer Science, Cornell University:
“How can we have this much data and still not understand collective human behavior?”
If you want to study the inner workings of a giant organization distributed around the globe, here are two approaches you could follow—each powerful, but very different from each other. First, you could take the functioning of a large multinational corporation as your case study, embed yourself within it, watch people in different roles, and assemble a picture from these interactions. Alternately, you could do something very different: take the production of articles on Wikipedia as your focus, and download the site's complete edit-by-edit history back to the beginning: every revision to every page, and every conversation between two editors, time-stamped and labeled with the people involved. Whatever happened in the sprawling organization that we think of as Wikipedia—whatever process of distributed self-organization on the Internet it took to create this repository of knowledge—a reflection of it should be present and available in this dataset. And you can study it down to the finest resolution without ever getting up from your couch.
These Wikipedia datasets—and many other sources like them—are completely public; the same story plays out with restricted access if you're a data scientist at Facebook, Amazon, Google, or any of a number of other companies: every conversation within people's interlocking social circles, every purchase, every expression of intent or pursuit of information. And with this hurricane of digital records, carried along in its wake, comes a simple question: How can we have this much data and still not understand collective human behavior?
There are several issues implicit in a question like this. To begin with, it's not about having the data, but about the ideas and computational follow-through needed to make use of it—a distinction that seems particularly acute with massive digital records of human behavior. When you personally embed yourself in a group of people to study them, much of your data-collection there will be guided by higher-level structures: hypotheses and theoretical frameworks that suggest which observations are important. When you collect raw digital traces, on the other hand, you enter a world where you're observing both much more and much less—you see many things that would have escaped your detection in person, but you have much less idea what the individual events mean, and have no a priori framework to guide their interpretation. How do we reconcile such radically different approaches to these questions?