BMC Bioinformatics Biomed Central
Total Page:16
File Type:pdf, Size:1020Kb
BMC Bioinformatics BioMed Central Commentary Open Access Publishing perishing? Towards tomorrow's information architecture Michael R Seringhaus1 and Mark B Gerstein*1,2 Address: 1Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Avenue, New Haven, CT 06520, USA and 2Program in Computational Biology and Bioinformatics, and Department of Computer Science, Yale University, 266 Whitney Avenue, New Haven, CT 06520, USA Email: Michael R Seringhaus - [email protected]; Mark B Gerstein* - [email protected] * Corresponding author Published: 19 January 2007 Received: 17 October 2006 Accepted: 19 January 2007 BMC Bioinformatics 2007, 8:17 doi:10.1186/1471-2105-8-17 This article is available from: http://www.biomedcentral.com/1471-2105/8/17 © 2007 Seringhaus and Gerstein; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Scientific articles are tailored to present information in human-readable aliquots. Although the Internet has revolutionized the way our society thinks about information, the traditional text-based framework of the scientific article remains largely unchanged. This format imposes sharp constraints upon the type and quantity of biological information published today. Academic journals alone cannot capture the findings of modern genome-scale inquiry. Like many other disciplines, molecular biology is a science of facts: information inherently suited to database storage. In the past decade, a proliferation of public and private databases has emerged to house genome sequence, protein structure information, functional genomics data and more; these digital repositories are now a vital component of scientific communication. The next challenge is to integrate this vast and ever-growing body of information with academic journals and other media. To truly integrate scientific information we must modernize academic publishing to exploit the power of the Internet. This means more than online access to articles, hyperlinked references and web-based supplemental data; it means making articles fully computer-readable with intelligent markup and Structured Digital Abstracts. Here, we examine the changing roles of scholarly journals and databases. We present our vision of the optimal information architecture for the biosciences, and close with tangible steps to improve our handling of scientific information today while paving the way for an expansive central index in the future. The Goldilocks Effect: Why Journals Drop Big ever, such interrelation proves difficult because our meth- (and Small) Results ods for dealing with these data differ depending on their The scale of biological data varies tremendously and size and the scope of inquiry. encompasses a wide array of results: from observations on the order of a single gene to genome-wide studies and Ideally, scientists should record and share all useful find- comparative genomics. The specifics differ in each case, ings. In reality however, results sometimes do not coin- but this full range of data is inherently complementary: cide with the standard ration suitable for journal drawn from living systems, it invites interrelation. How- publication. Some facts are simply too trivial to merit a Page 1 of 5 (page number not for citation purposes) BMC Bioinformatics 2007, 8:17 http://www.biomedcentral.com/1471-2105/8/17 whole paper, so isolated findings or negative results are and consumed piecemeal. The structure and format of the often withheld from the published record. Conversely, scholarly article should reflect this mode of access. some data sets are too large to include in article format. This latter problem is visible in a series of papers present- A Future Vision ing whole-organism protein-protein interaction networks The optimal information architecture for biology would [1-4] or regulatory pathways [5-7]; these manuscripts capture a broad range of data in digital format and facili- present highlights and discussion, while the data sets tate database deposit alongside manuscript publication. It themselves are stored in databases or on laboratory Web would index all full-text journal articles, associate key- sites. words and identifiers with database records, and link text- books, laboratory Web sites and high-level commentary. A Changing Landscape: The Roles of Journals It would provide multiple levels of peer-review, commu- and Databases nity comment and annotation, and search results tailored Journals and databases are naturally positioned to handle to individual user profiles. This vast network of informa- different types and amounts of data. In addition, each tion would be interrelated, linked and accessed via a sin- offers a distinct and useful approach to the recording and gle seamless portal. Such a centralized system has been transmission of scientific information. proposed before, and here we elaborate upon this vision [8]. Journal articles are optimized for human consumption. They are straightforward to create, intuitively organized From a technological standpoint the aforementioned and incorporate authoritative peer-review; however, they functionality is entirely reasonable. Indeed, current tech- are not suited to handle very large or very small results and nology allows for tremendous integration in the way we lack the rigorous organizational structure required for sift and access scientific data. (Journals in particular fail to machine indexing and recall. Moreover, academic pub- exploit the full capability of the Internet, beyond includ- lishing is largely dominated by corporate publishers ing sporadic hyperlinks and employing the Web as a dis- whose subscription barriers make third-party indexing of tribution medium for traditional text.) What is needed is full-text material difficult or impossible. Nonetheless, a set of standards to actually accomplish this integration, journal articles are the established currency of scientific which, in turn, requires the cooperation of disparate cor- research: investigators must publish papers to advance porate and government entities – often entrenched and their careers. resistant to change. Databases are highly structured, efficient and machine- Thus, as enticing as it is to imagine the vast functionality readable, but are not optimized for full-text discussion of a de novo information architecture designed to exploit and require significant effort to establish and organize. modern digital technology, the reality is that journals and They often lack true peer review – beyond a rudimentary databases exist today in a form that does not allow instant oversight panel ensuring compliance with proprietary for- adoption of such a highly-connected structure. Nonethe- mats – and interrelating data between independent data- less, we can still aim to improve our current system while bases remains a challenge. Furthermore, contributing working toward an interconnected future. We have identi- results even to flagship databases has yet to achieve simi- fied three key issues that we feel will lay the groundwork lar status to publishing in a respected journal. Because no for tomorrow's expansive super-index while improving uniform citation system exists to track the database contri- access to scientific information today. butions of a given researcher (and little recognition is afforded to database deposit in general), very little profes- Toward a New Information Architecture – sional incentive exists to populate, annotate or revise Three Key Issues information stored in databases. As such, large teams of 1. Expanded Publication Process curators manually extract small-scale data from journal We must expand what it means to publish in the bio- articles and deposit them in databases at great cost and sciences. Traditional journal articles alone are ill-suited to effort (for example, the EBI sequence group has about 100 capture the fruits of modern research; and databases, Web curators). sites, archived presentations and high-level commentaries are a valuable and real part of the scientific information The once-sharp distinction between journals and data- landscape. Academic publication should reflect this. bases is beginning to blur. We now search abstracts and access journal articles in a manner similar to database To modernize academic publishing, we propose three objects. Gone are the days of paging through an entire main changes. journal issue; now, articles are accessed through database- type portals such as PubMed, downloaded in PDF format Page 2 of 5 (page number not for citation purposes) BMC Bioinformatics 2007, 8:17 http://www.biomedcentral.com/1471-2105/8/17 First, all scientists should be able to publish, share and machine-readable XML summary of pertinent facts in the access data on the Web. Funding agencies should include article which we term the Structured Digital Abstract (Fig- stable digital storage with research grants, and tie contin- ure 1). The former product assists public and non-special- ued funding to the appropriate use of this storage. This ist consumption of scientific research, and the latter ensures that every author is able to archive pre-prints, host would ease pressure on database curators and streamline supplementary and unpublished data, and make their the large-scale automatic deposition of author-vetted