The Future of Science is Open, Part 3: An Open Science World

In Parts one and two, I talked about the scholarly practice of Open Access publishing, and about how the central concept of “openness”, or knowledge as a public good, is being incorporated into other aspects of science. I suggested that the overall practice (or philosophy, or movement) might be called Open Science, by which I mean the process of discovery at the intersection of Open Access (publishing), Open Data, Open Source (software), Open Standards (semantic markup) and Open Licensing.

Here I want to move from ideas to applications, and take a look at what kinds of Open Science are already happening and where such efforts might lead. Open Science is very much in its infancy at the moment; we don’t know precisely what its maturity will look like, but we have good reason to think we’ll like it.

By way of analogy, think about what the Web has made possible, and ask yourself: how much of that could you have predicted in, say, 1991, when Sir Tim wrote the first browser? Actually, “infancy” being a generous term for the developmental state of Open Science, a better analogy probably reaches further back: how much of what the internet has made possible could anyone have predicted when ARPANET first met NSFnet? Given that last link, for instance, would you have seen Wikipedia coming? How about eBay, Amazon.com, RSS, blogs, YouTube, Google Maps, or insert-your-own-favorite amazing web site/service/application?

The potential is immense, and from our current perspective we cannot predict more than a fraction of the ways in which openness will transform the culture and practice of science. Nonetheless, there are signs pointing in possible directions.

early examples: sequence data

Sequence data (such as mRNA, genomic DNA and protein sequences) have long been the leading edge of large-scale collaborative science, largely because early competition among public and private organizations resulted in a series of groundbreaking agreements on public data sharing. (For a quick tour of the relevant history, see this article.) Among the online tools that have been developed around openly-accessible sequence databases such as GenBank or SwissProt, the flagship effort is probably the NCBI‘s online gateway Entrez. From Entrez I can search for information on a sequence of interest on almost thirty different interlinked databases. I can:

find related nucleotide and protein sequences, and make detailed comparisons between them
map a sequence of interest onto whole chromosomes or genomes, and compare those maps across ten or twenty different species
access expert-curated information on any connection between a query molecule and human genetic disease or heritable disorders in other species
look for known motifs or functional sequence modules in a query molecule, or use similar sequences to build 3D models of its likely shape and structure
compare a sequence of interest across wide taxonomies, and formulate useful questions about its evolutionary history
look for array data regarding expression of a query sequence in different developmental, disease-related and other contexts
access genetic mapping data with which to map a query sequence in organisms for which little or no sequence data is yet available

There’s much more — that was a very brief and incomplete overview of what Entrez can do — but you get the point. All of this analysis is only possible because the underlying sequence data is available on Open terms (and largely machine-readable due to semantic markup), and it forms a ready-made infrastructure in which further Open information can readily find a place — as soon as it becomes available.

data and text mining

In part 2 I talked about a range of efforts to make databases of other information, including text, similarly interoperable and available for mining. Paul Ginsparg, in a recent essay, used the interface between PubMed Central and various sequence databases as an early example of what becomes possible when databases can be read by computers as well as by humans (emphasis mine):

GenBank accession numbers are recognized in articles referring to sequence data and linked directly to the relevant records in the genomic databases. Protein names are recognized, and their appearances in articles are linked automatically to the protein and protein interaction databases. Names of organisms are recognized and linked directly to the taxonomic databases, which are then used to compute a minimal spanning tree of all of the organisms contained in a given document. In yet another view, technical terms are recognized and linked directly to the glossary items in the relevant standard biology or biochemistry textbook in the books database. The enormously powerful sorts of data mining and number crunching that are already taken for granted as applied to the open-access genomics databases can be applied to the full text of the entirety of the biology and life sciences literature and will have just as great a transformative effect on the research done with it.

Donat Agosti recently pointed to three related projects: Biotext, which builds text mining tools; EBIMed, which analyses Medline search results and presents associations between gene names and several other databases; and the Arrowsmith Project, which allows semantic comparison between two search-defined sets of PubMed articles. The latter also maintains a list of free online text mining tools, which currently includes several dozen sites offering tools for a variety of purposes, although the majority are still focused on Medline and/or sequence databases.

These sorts of tools are not only useful, they are likely to become essential. Even now, I can hardly imagine trying to navigate the existing sequence data without Entrez, or the research literature without PubMed. GenBank contains more than 40 billion bases and is growing exponentially, doubling every 12-15 months. PubMed contains nearly 17 million records as I write this, and is adding well over half a million every year. The 2007 Nucleic Acids Research database issue lists nearly 1000 separate biological databases, up more than 10% from last year. As Matthew Cockerill of BioMed Central has pointed out, simple text searching is not enough to keep a researcher afloat in this onrushing sea of information.

bibliometrics

Data and text mining methods stand to come into their own as discovery tools once they have a fully Open and machine-readable body of published research on which to work. Similarly, the utility of bibliometrics, the quantitative analysis of text based information, can be dramatically enhanced by Open Access. In particular, measures of research impact can be made much more powerful, direct and reliable.

Research impact is the degree to which a piece or body of work has been taken up and built upon by other researchers and put to practical use in education, technology, medicine and so on. Governments and other funding bodies want to be able to measure research impact in order to provide accountability and ensure maximal return on investment, and researchers and research administrators want the same measurements in order to assess the quality of their research and to plan future directions (“how are we doing? how can we do better?”).

The most important measure of research impact currently available is citation analysis, a proxy measurement based on acknowledged use by later published work; the predominant citation-based metric in modern research assessment is the Impact Factor (IF). If a journal has a 2004 IF of 5, then papers published in that journal in 2001-2002 were cited, on average, 5 times each in 2003. This number is probably the most widely misunderstood and misused metric in all of science, and comes with a number of serious built-in flaws, not the least of which is that the underlying database is the property of for-profit publishing company Thomson Scientific.

Despite these flaws and considerable high-profile criticism, it is difficult to overstate the influence that the Impact Factor has had, and continues to have, on all efforts to evaluate scientists and their work. Researchers obsess over journal choice: you don’t want a rejection, which forces you to re-submit elsewhere and wastes time, but you need to get that paper into the “best” (that is, highest IF) journal you can so as to appeal to hiring, funding and tenure committees. And that’s not unrealistic, since quite frankly the bottom line for most such committees is “who has published the most papers in high-IF journals”. Other factors are usually considered, but the IF dominates. It’s a clumsy, inaccurate and unscientific way to go about evaluating research impact and researcher talent.

Happily, there is a better way just over the Open Access horizon. Once a majority of published research is available in machine-readable OA databases, the community can get out from under Thomson’s thumb and improve scientific bibliometrics in a host of different ways. Shadbolt et al. list more than two dozen improvements that OA will make possible, including:

A CiteRank analog of Google’s PageRank algorithm will allow hits to be rank-ordered by weighted citation counts instead of just ordinary links (not all citations are equal)
In addition to ranking hits by author/article/topic citation counts, it will also be possible to rank them by author/article/topic download counts
Correlations between earlier download counts and later citation counts will be available online, and usable for extrapolation, prediction and eventually even evaluation
Searching, analysis, prediction and evaluation will also be augmented by cocitation analysis (who/what co-cited or was co-cited by whom/what?), coauthorship analysis, and eventually also co-download analysis
Time-based (chronometric) analyses will be used to extrapolate early download, citation, co-download and co-citation trends, as well as correlations between downloads and citations, to predict research impact, research direction and research influences.
Authors, articles, journals, institutions and topics will also have “endogamy/exogamy” scores: how much do they cite themselves? in-cite within the same “family” cluster? out-cite across an entire field? across multiple fields? across disciplines?
“Hub/authority” analysis will make it easier to do literature reviews, identifying review articles citing many articles (hubs) or key articles/authors (authorities) cited by many articles.

Existing metrics (which basically means Thomson’s proprietary data) are simply not rich enough to support such analyses. There are already efforts underway to mine the available body of text for better ways to evaluate research. Hirsch’s h-index, an alternative way of using citation counts to rank authors according to their influence, can be calculated online using Google Scholar. Bollen et al. have proposed a method for using Google’s PageRank as an alternative to the Impact Factor, as well as their own Y-factor which is a composite of the two measures. The Open Citation Project built Citebase, an online citation tracker which has been used to show that downloads (which are measured in real-time from the moment of upload) can predict citations (for which data one must wait years). Authoratory is a text-mining tool based on PubMed, and is capable of co-author analysis, authority ranking and more.

As the body of OA literature expands, these and similar tools will provide a far more reliable and equitable means of comparing researchers and research groups with their peers than is currently available, and will also facilitate the identification of trends and gaps in research focus. The downstream effects of increased efficiency in managing and carrying out research will be profound.

commentary and community

Andrew Dayton recently described another feature of the coming Open Science world, which he calls Open Discourse:

The internet is expanding the realm of scientific publishing to include free and open public debate of published papers. […] How often have you asked yourself how a certain study was published unchallenged, without the results of a key control? How often have you wondered whether a paper’s authors performed a specific procedure correctly? How often have you had the opportunity to question authors about previously published or opposing results they failed to cite, or discuss the difficulties of reproducing certain results? How often have you had the opportunity to command a discussion of an internal contradiction the referees seemed to have missed?

Stevan Harnad has referred to a similar idea as peer commentary, calling it a “powerful and important supplement to peer review“. It’s important to note that a number of journals, such as Current Anthropology or Psycoloquy, offer “open peer commentary” which is not actually open to public contribution. Similarly, the phrase “open peer review” is typically used to indicate that reviewers are not anonymous, rather than that review is open to the public. Neither of these pseudo-open concepts rely on “openness” in the Open Access/Open Science sense, whereas Open Discourse as Dayton means it is, of course, utterly dependent on such openness for its subject matter.

There are a number of venues which enable fully Open Discourse as Dayton means it. OA publisher BioMed Central offers a public comment button on every article, and Cell allows public comments on selected articles. BMC also publishes Biology Direct, which offers both an alternative model of peer review and public commentary, and PLoS has just launched PLoS One, offering standard peer review followed by public commentary, annotation and rating. Philica will publish anything, and provides public commentary which can also serve as a form of peer review through an authentication process for professional researchers. JournalReview.org is set up as an online public journal club, and Naboj is a forum for public review of articles posted to arxiv.org. BioWizard is somewhat similar, but is limited to articles accessible via PubMed and offers a number of other tools, such as a blogging platform and a rating mechanism designed to identify popular papers. Both JournalReview and BioWizard notify corresponding authors so that they can participate in the discussion. The British Medical Journal offers a rapid response mechanism which, having posted over 50,000 public responses to published work, sounds a cautionary note for more recent arrivals on the public commentary scene: in 2005, the journal was forced to impose a length limit and active moderation in order to avoid losing the desired signal in a flood of uninformed, obsessive noise.

Speaking of floods of uninformed, obsessive noise — what about blogs?

Of course, I’m kidding. I actually have high hopes for the future of blogs in science, centered on three themes: commentary, community and data. Blogs are an excellent medium for commenting on anything, and with web feeds and a good aggregator it’s pretty easy to keep track of a selected group of blogs. If Technorati worked, it might allow interesting views of the science blogosphere; fortunately, we have Postgenomic, which indexes nearly 700 science blogs and then “does useful and interesting things” with the data. For instance, you can see which papers and/or books are getting attention from science bloggers; there’s even a Greasemonkey script that will flag Postgenomic-indexed papers in Connotea, Nature.com’s social bookmark manager for scientists, another for PubMed and yet another for journal websites. A new Digg-like “community commentary” site, The Scientific Debate, allows trackbacks and so can interact with regular blogs. The discussion above about text mining applies, of course, to blogs, since they are typically openly accessible and friendly to text mining software. For instance, Biology Direct or PLoS One could interact with the blogosphere using linkbacks, or by pulling relevant posts from Postgenomic.

Blogs also tend to create virtual communities, such as the one that centers on Seed’s ScienceBlogs collection of, well, science blogs. This group of about 50 blogs is rapidly becoming a hub of the science blogosphere, and even gave rise to a recent meatspace conference that bids fair to become an annual event. Such self-selected communities foster a sense of cameraderie and strongly encourage co-operation over competition, which can only favor the advance of Open Science. (It’s not just blogs, of course, that can take advantage of community building. The Synaptic Leap, the Tropical Disease Initiative, OpenWetWare and BioForge all provide infrastructures that enable collaborative communities to do Open Science.)

Finally, blogs (and wikis) have immense potential as a scientific publishing medium. They are, to begin with, the perfect place for things like negative results, odd observations and small side-projects — research results for which the risk of having an idea stolen is greatly outweighed by both the possibility of picking up a collaboration and the importance of having made available to the research community information which would never surface in a traditional journal. Most research communities are relatively small; it would not be difficult for most researchers to keep up with the lab weblogs (lablogs?) of the groups doing work most closely related to their own. I know of a few blog posts in this category. This and this from Bora Zivkovic are, I think, the first instances of original data on a blog. This series from Sandra Porter is earlier but involves bioinformatic analysis (that is, original experimentation, but no original data), as do this and this from Pedro Beltrao. Egon Willighagen blogs working software/scripts for cheminformatics, and Rosie Redfield and her students blog hypotheses, thinking-out-loud and even data. Blogs are also good for sharing protocols, like the syntheses posted by the anonymous proprietor of Org Prep Daily.

Beyond that, it’s possible to do fully Open Science, publishing day-to-day results (including all raw data) in an online lab notebook. I know it’s possible because Jean-Claude Bradley is doing it; he calls it Open Notebook Science. His lab’s shared notebook is the UsefulChem wiki, which is supplemented by the UsefulChem blog for project discussion and the UsefulChem Molecules blog, a database of molecules related to their work. There is nothing to prevent Jean-Claude from publishing traditional articles whenever he has the kind of “story” that is required for that format, but in the meantime all of his research output is captured and made available to the world. Importantly, this includes information which would never otherwise have been published — negative results, inconclusive results, things which simply don’t fit into the narrative of any manuscript he prepares, and so on. Being on a third-party hosted wiki, the notebook entries have time and date stamps which can establish priority if that should be necessary; version tracking provides another layer of authentication.

At the moment the Bradley lab is the only group I know of that is doing Open Notebook Science, but of all the glimpses of an Open Science world I have tried to provide in this entry, Jean-Claude’s model is, I think, the clearest and most hopeful. Only when that level of transparency and immediacy is the norm in scientific communication will the research community be able to realize its full potential.

that’s all, folks

I promise, no more obsessive posting about Open Science here on 3QD. If I’ve managed to pique anyone’s interest, I recommend reading Peter Suber’s Open Access News and anything else that takes your fancy from the “open access/open science” section of my blogroll. And as always, if I’ve missed anything or got anything wrong, let me know in comments.

….

This work is licensed under a Creative Commons Attribution 3.0 License.