Will a Biological Database Be Different from a Biological Journal? Philip Bourne
Total Page:16
File Type:pdf, Size:1020Kb
Perspectives Will a Biological Database Be Different from a Biological Journal? Philip Bourne he differences, or otherwise, between biological checked for format compliance and reviewed internally by databases and journals is an important question to experienced annotators. There are even parallel T consider as we ponder the future dissemination and presubmission steps in journals and databases. For example, impact of science. If databases and journals remain discrete, potential authors in PLoS Computational Biology may make our methods of assimilating information will change presubmission inquiries to confirm the suitability of their relatively little in the years to come. On the other hand, if paper, and depositors to the PDB may run their entries databases and journals become more integrated, the way we against a validation server to determine whether the data are do science could change significantly. As both Editor-in-Chief in compliance, prior to having the same tests run by a PDB of PLoS Computational Biology and Codirector of the Protein annotator. Data Bank (PDB), one of the oldest and widely used data Once registered with the corresponding online submission resources in molecular biology, the question is particularly system, a journal manuscript receives a permanent pertinent. Here, I give my perspective on what could and, I manuscript number, while a database entry receives a unique believe, should happen in the future. identifier. Subsequent revisions can be mapped to these My vision is that a traditional biological journal will respective numbers, so that both journals and databases can become just one part of various biological data resources as provide an accurate audit trail of journal manuscripts and the scientific knowledge in published papers is stored and database entries, respectively. Once a manuscript or entry is used more like a database. Conversely, the scientific literature accepted as compliant, both undergo review processes will seamlessly provide annotation of records in the biological involving one or more iterative steps between institution and databases. Imagine reading a description of an active site of a author, as the manuscript or the entry is refined and finally biological molecule in a paper, being able to access released. Release cycles of journals and databases have also immediately the atomic coordinates specifically for that become similar—journals such PLoS Computational Biology active site, and then using a tool to explore the intricate set of have an option for early online release as soon as the hydrogen-bonding interactions described in the paper. Not manuscript is accepted, and biological databases typically only are the data generated by the experiment immediately release entries on a daily or weekly basis, as soon as they have available within the context of what you are reading, but been processed. specific tools for interpreting these data are provided by the Not only are the daily operations of databases and journals journal. Alternatively, if you are starting with the data, for similar, but the business models also have parallels (I will not example, viewing the chromosome location of a human dwell on them here though). Certainly from a consumer’s single-nucleotide polymorphism associated with a perspective, in terms of accessibility, there is no difference neurological disorder, you can immediately access a variety of between a paper in a PLoS journal and an entry in the PDB papers ranked in order of relevance to your profile, not just database—they are freely available to all. In the case of open- through links to abstracts but also by pinpointing the access journals and open archives like the PDB, the parallels, reference to the single-nucleotide polymorphism in the full- from the perspective of the consumer, are even more text article. The type and order of articles displayed could be profound than just free access yet are frequently overlooked. different, depending on whether you are, for example, a PLoS articles are published under a Creative Commons molecular biologist or a neurosurgeon. At this point, Attribution License, which means that the contents (text and whatever your user profile, the distinction between a database images) of all PLoS journals can be used as the consumer sees and a journal article disappears. How could this happen? To fit, provided original attribution is given to the appropriate answer this question, we must think about the parallels that authors and source. So it is with the contents of many exist today between biological databases and biological biological databases, including the PDB. Consumers are free journals. The daily work of any high-throughput scientific journal or biological database consists of information input, Citation: Bourne P (2005) Will a biological database be different from a biological information processing, and information output. Consider journal? PLoS Comp Biol 1(3): e34. the parallels between a journal and a database for each of Copyright: Ó 2005 Philip Bourne. This is an open-access article distributed under these three steps. On a daily basis, the journal accepts the terms of the Creative Commons Attribution License, which permits unrestricted manuscripts; once these have been checked for format use, distribution, and reproduction in any medium, provided the original author and source are credited. compliance and completeness, they undergo review, either by an internal group of scientific editors or, as is the case for Abbreviations: DOI, digital object identifier; PDB, Protein Data Bank PLoS Computational Biology, through peer review by the Philip E. Bourne is Editor-in-Chief of PLoS Computational Biology and is Codirector scientific community. Likewise, a biological database such as of the Protein Data Bank. E-mail: [email protected] the PDB accepts submissions from the community, which are DOI: 10.1371/journal.pcbi.0010034 PLoS Computational Biology | www.ploscompbiol.org0179 August 2005 | Volume 1 | Issue 3 | e34 to take and analyze the contents by any means they see fit, but material can be a valuable addition or, alternatively, can are expected to attribute information to the authors of the make for a disjointed piece of work. Moreover, the original material, as appropriate. Finally, in the case of PLoS supplemental material is ad hoc and cannot be readily journals, the copyright of the material is not signed over to queried across all articles, even though a small amount of it is the publisher but remains with the original author, which is already tagged and comes directly from a database. Third, the also true of information provided to most biological perceived value of both a database entry and a journal article databases. In both forms of open access—journals and has changed over the years. As high-throughput techniques databases—the only requirement is to provide an immutable have become more prevalent, data are produced at an ever- reference to the material. In the case of an online journal increasing rate, so the value of a unit of data, for example, a article, this reference most often takes the form of a digital sequence or structure, has diminished. Data producers hoard object identifier (DOI), and for a database entry, it is usually a their data less than they did in past years. Similarly, the rate unique accession number. Like the contents of manuscripts of publication has increased dramatically, this increase being and database entries, I expect these two forms of immutable brought about by accelerated technologies for manuscript identifiers to become indistinguishable from each other, as I production, large collaborative studies, and increased will outline subsequently. emphasis on the notion of ‘‘publish or perish.’’ In short, Given these parallels, at this point in time, what is the journal content is already becoming more like database difference between an entry in a database and an article in a content and vice versa. journal? Currently the difference can be characterized as a Can this trend continue? Consider how the respective mix of perception and content. Clearly, no one perceives a content of journals and databases is organized. Both have database entry of, say, a sequence, or a specimen in a museum varying degrees of content organization. Papers have collection, as being as valuable as the journal paper that structure, but the organization of their content is less detailed describes it. But, ironically, to the consumer, at least by one than that found in a database, although this is changing with measure, the database entry may indeed be more valuable. formal document type definitions being applied, from which The structure of human deoxyhemoglobin is one of the most database schema can be generated. Typically a paper has an downloaded structures in the PDB—in one year, it has been introduction, a materials and methods section, a results downloaded more times than the original paper has ever been section, and a discussion section; it possibly uses consistent cited thus far. Yet from the authors’ perspective, the Nobel terms for genes, enzymes, and diseases; and in a post- Prize does not come from constructing the PDB database production step, keywords and/or medical subject headings entry, but from an eloquent description of the relationship for indexing the content of the article are added. Databases, between structure and function that was presented most on the other hand, frequently have a high level of completely in the literature. A tenure committee does not organization, where data are granular and each granule is award tenure based on the number of deposits a faculty described in exquisite detail. The advantage of a paper is that member has made to a biological database, but rather the it is relatively easy to input and maintain, but it requires number of papers they have published in leading journals. human recall. Machine-based recall of meaningful Those of you who have made it this far might be thinking it information is poor, a problem being addressed but certainly is ridiculous that I should regard the content of a database not solved by the discipline of natural-language processing.