<<

NATURE|Vol 455|4 September 2008 FEATURE The future of To thrive, the field that links biologists and their data urgently needs structure, recognition and support.

Doug Howe, Seung Yon Rhee et al.

The exponential growth in the amount of biological data means that revolutionary meas- ures are needed for data man- agement, analysis and accessibility. Online have become important avenues for publishing biological data. Biocuration, the activity of organizing, representing and making biological information accessible to both humans and computers, has become an essential part of biological discovery and biomedical research. But curation increas- ingly lags behind data generation in funding, develop ment and recognition. — whowho managemanage rawraw biologicalbiological data,data, We propose three urgent actions to advance extractextract iinformationnformation ffromrom ppublishedublished this key field. First, authors, journals and literature,literature, developdevelop structuredstructured vocabu-vocabu- curators should immediately begin to work larieslaries ttoo ttagag ddataata aandnd mmakeake tthehe iinfor-nfor- together to facilitate the exchange of data mationmation aavailablevailable oonlinenline3 (Box 1). In between journal publications and databases. thethe pastpast decade,decade, itit hashas becomebecome secondsecond Second, in the next five years, curators, naturenature forfor biologistsbiologists toto visitvisit websiteswebsites toto researchers and university administrations obtainobtain datadata forfor furtherfurther analysisanalysis oror inte-inte- should develop an accepted recognition struc- grationgration wwithith llocalocal rresources.esources. OOurur ssur-ur- ture to facilitate community-based curation veyvey ofof severalseveral well-curatedwell-curated databasesdatabases efforts. Third, curators, researchers, academic (nine(nine model-organismmodel-organism databases,databases, Uni-Uni- institutions and funding agencies should, in protprot aandnd PProteinrotein DDataata BBank)ank) sshowedhowed the next ten years, increase the visibility and thatthat nnearlyearly 7750,00050,000 vvisitorsisitors ((uniqueunique IIPP support of scientific curation as a professional knowledge, much as we are experiencing addresses) viewed more than 20 million pages career. exponential growth in data today. in just one month (March 2008, Eva Huala, Failure to address these three issues will Peter Rose, Rolf Apweiler, personal commu- cause the available curated data to lag far- Data avalanche nications). ther behind current biological knowledge. , like most scientific disciplines, is in Despite the essential part that it plays in Researchers will observe an increasing occur- an era of accelerated information accrual and today’s research, biocuration has been slow to rence of obvious gaps in knowledge. As these increasingly depend on the availabil- develop. To provide a forum for the exchange of gaps expand, resources will become less effec- ity of each others’ data. Large-scale ideas and methods, and to facilitate collabora- tive for generating and testing hypotheses, and centres, high-throughput analytical facilities tions and training, more than 150 biocurators the usefulness of curated data will be seriously and individual laboratories produce vast met at two international conferences and cre- compromised. amounts of data such as nucleotide and pro- ated a mailing list and a website (www.biocu- When all the data produced or published tein sequences, crystal structures, rator.org). These meetings and discussions are curated to a high standard and made -expression measurements, protein and have honed in on the three actions, outlined accessible as soon as they become avail- genetic interactions and studies. above and elaborated on below, that must now able, biological research will be conducted By July 2008, more than 18 million articles be addressed to ensure scientists’ continued in a manner that is quite unlike the way it is had been indexed in PubMed and nucleotide access to the high-quality data on which their done now. Researchers will be able to process sequences from more than 260,000 organ- research depends. massive amounts of complex data much isms had been submitted to GenBank1,2. The more quickly. They will garner insight about recently announced project to sequence 1,000 Come together the areas of their interest rapidly with the human in three years to reveal DNA Extracting, tagging with controlled vocabu- help of inference programs. Digesting infor- polymorphisms (www.1000genomes.org) is a laries, and representing data from the lit- mation and generating hypotheses at the tip of the data iceberg. erature, are some of the most important and computer screen will be so much faster that Such data, produced at great effort and time-consuming tasks in biocuration. Curated researchers will get back to the bench quickly expense, are only as useful as researchers’ information from the literature serves as the for more experiments. Experiments will be ability to locate, integrate and access them. In gold-standard data set for computational designed with more insight; this increased recent years, this challenge has been met by analysis, quality assessment of high-through- specificity will cause an exponential growth in a growing cadre of biologists — ‘biocurators’ put data and benchmarking of data-mining 47 FEATURE BIG DATA NATURE|Vol 455|4 September 2008

algorithms. Meanwhile, the boundaries of Box 1 | The role of biocurators offers software to assist in preparation and 14 the biological domain that researchers study ● To extract knowledge from published validation of such crystallographic data . An are widening rapidly, so researchers need papers analogous system to help authors identify, tag faster and more reliable ways to understand ● To connect information from different and validate the crucial basic information in unfamiliar domains. This too is facilitated by sources in a coherent and comprehensible their research reports before publication would literature curation. way accelerate the automated linkage of literature to Typically, biocurators read the full text of ● To inspect and correct automatically key records in existing databases and improve articles and transfer the essence into a data- predicted gene structures and protein the accuracy of the published data. base. For a paper about the sequences to provide high-quality proteomes In short, authors and publishers must use the of a particular gene, process or pathway, such ● To develop and manage structured existing publication infrastructure to facilitate information might include gene-expression controlled vocabularies that are crucial for literature curation much more to the benefit patterns, mutant , results of bio- data relations and the logical retrieval of large of all parties. chemical assays, protein-complex membership data sets and the authors’ inferences about the functions ● To integrate knowledge bases to represent Community curation and roles of the gene products studied. As each complex systems such as metabolic Curation of large-scale and post- paper uses different experimental and analysis pathways and protein-interaction networks. genomics data enjoys no such luxury of ‘an ● To correct inconsistencies and errors in methods, capturing this information in a con- existing publication infrastructure’ to lever- data representation sistent fashion requires intensive thought and age, although emerging standards of data ● To help data users to render their research 4–9 effort. Limited resources and staff mean that more productive in a timely manner reporting are promising . Sooner or later, the most curation groups can’t keep up with all the ● To steer the design of web-based research community will need to be involved relevant literature. resources in the annotation effort to scale up to the rate How information is presented in the lit- ● To interact with researchers to facilitate of data generation. This transition will require erature greatly affects how fast biocurators direct data submissions to databases annotation tools, standardized methods, over- can identify and curate it. Papers still often sight by expert curators and a combination of report newly cloned without providing discussed; and descriptions of , strains, social infrastructure, tool development, train- GenBank IDs or the species from which the cell types and genotypes used. Examples ing and feedback. Biocurators are especially genes were cloned. The entities discussed in a of sources for this information are listed in important for establishing such an infrastruc- paper, including species, genes, , geno- Table 1. This would accelerate literature cura- ture and training to maintain consistency and types and phenotypes must be unambiguously tion, uphold information integrity, facilitate accuracy. identified during curation. For example, using the proper linkage of data to other resources To date, not much of the research community the HUGO Committee and support automated mining of data from is rolling up its sleeves to annotate. What will resource (www.genenames.org), we find that papers. Another model is for authors to be the tipping point? The main limitation in the human gene CDKN2A has ten literature- provide a ‘structured digital abstract’ — a community annotation is the perceived lack of based synonyms. One of those, p14, is also machine-readable XML summary of perti- incentive. For example, several model-organ- a synonym for five other genes: CDK2AP2, nent facts in the article11 — along with a man- ism databases have requested that authors CTNNBL1, RPP14, S100A9 and SUB1. To con- uscript. This approach is in an experimental annotate the genes they publish. This has his- firm the identity of the gene described, cura- phase at the journal FEBS Letters12. torically failed for one main reason: contribu- tors make inferences from synonyms, reported Journals should also mandate direct submis- tions by experts consist of information they sequences, biological context and bibliographic sion of data into appropriate databases as a part already know, and do not increase the value citations. This time-consuming and error- of publication. This has been implemented by of the resource to themselves. A mechanism prone step could be eliminated by compliance the journal Physiology and curators of tied to career or research advancement may with data-reporting standards4–9. The Information Resource (TAIR) be required before community curation can Most recent efforts in this direction have database13. On acceptance of a manuscript, the be established as a broadly accepted and pro- been developed by the com- corresponding author must fill ductive scientific endeavour15. Incentives for munities that produce large- “To date, not much out a simple web-based form to researchers to curate data should include new scale genomics data. The vast of the research provide appropriate genetic and information or insight for their research inter- majority of the peer-reviewed community is rolling molecular information about ests, improvement in academic reputation or literature does not yet have a the Arabidopsis genes in the impact, career advancement and better funding reporting-structure standard. up its sleeves to publication. The information chances. Academic departments and funding As publication has become a annotate. ” is sent to TAIR for integration agencies should consider community annota- mainly digital endeavour, how- by biocurators, who work with tion as a productive contribution to the scien- ever, publications and biological databases are the authors to ensure that the data reported are tific research corpus and a natural extension of becoming increasingly similar. Properly cross- of high quality and accurate. the publication process. referenced and indexed, each could serve as an As this infrastructure develops, we would For example, in the Daphnia Genomics access point to the other10. Such collaboration like to see authors routinely tagging all aspects Consortium (http://daphnia.cgb.indiana. between databases and journals would improve of the data in their publication semantically edu) collaboration wiki, a community of researchers’ access to data and make their work using universally agreed tag standards. Exam- more than 300 contributors took ownership more visible. ples of such tags include the National Center of annotation of the while it was We recommend that all journals and for Biotechnology Information (NCBI) Taxon being sequenced at the Joint Genome Insti- reviewers require that a distinct section of the IDs, the (GO) IDs and Enzyme tute in Walnut Creek, California, and shared Methods (or a supplemental document) of Commission (EC) numbers. This information publication authorship as a consortium. Simi- all published articles includes approved gene should be embedded in the electronic versions larly, the International Glossina Genomics symbols (which are inherently unstable) and of publications or provided in a supplemental Initiative (http://iggi.sanbi.ac.za) hosted an model-organism IDs (which do not file similar to the crystallographic information annotation jamboree for field workers, pop- change) for genes discussed; nucleotide or file (CIF) currently required for publication of ulation geneticists and molecular biologists protein accession numbers (GenBank or Uni- a crystal structure. The CIF file is submitted to to annotate tsetse fly molecular data as the Prot ID) for isoforms of each gene or protein the (www.pdb.org), which sequence information became available. This 48 NATURE|Vol 455|4 September 2008 BIG DATA FEATURE consortium-based publication mechanism is particular concept is, the more chance it will entries would increase the number of poten- analogous to that used by other large-scale have of being associated with other relevant tial annotators substantially, as pioneered in scientific projects such as the Sloan Digital ones, which in turn will lead to more potential several astronomy projects. At Galaxy Zoo Sky Survey (www.sdss.org). This is a viable new facts. All the updates researchers make are (www.galaxyzoo.org), 80,000 astronomers and course for communities that lack funding for immediately publicly visible under their own members of the public manually classified the dedicated curators, and offers a reward struc- name. Similarly, the project gener- morphology of one million galaxies in less than ture through consortium publication for par- ated thousands of wiki stubs in for three weeks. An analogous system to allow the ticipation and subsequent satellite papers. human genes in an attempt to make it easier public to contribute to biological annotation The recently launched WikiProfessional Life for the community to update the gene pages17. could be just as powerful if presented properly. Sciences (www.wikiprofessional.org) project Although these wiki-based approaches pro- For example, one could show a user an image links community curation with research and vide an infrastructure for contributors to be of an in situ hybridization experiment and ask reputation gains. WikiProfessional indexed recognized, there is not yet a standard prac- them to grade it as ‘not expressed’, ‘restricted more than one million authors from PubMed tice for these contributions to be cited like a expression’ or ‘ubiquitous expression’. Even and comparable numbers of biological con- publication. It is imperative that the research- such basic information, if available for many cepts from authoritative databases and gener- ers, journal publishers and database curators thousands of genes, would be useful as first- ated a simple way for researchers to update the start building a standard mechanism for citing pass annotation. information16. Because new potential ‘facts’ annotation data sets. In sum, researchers (and even the gen- are mined from the network of associated con- Allowing anyone with a web browser, eral public) can be mobilized to provide the cepts, the more accurate and comprehensive a including the general public, to annotate substantial resources needed to address the immense volume of data, if participation is Table 1 | Examples of knowledge-sharing databases appropriately rewarded. In the next five years, Species Database URL curators, funding agencies and academic insti- databases tutions alike must find ways to consider sub- aegypti VectorBase www..org stantial contributions to community curation efforts, much like a peer-reviewed publication, Anopheles gambiae VectorBase www.vectorbase.org when it comes to issues of promotion, salary, The Arabidopsis Information Resource www.arabidopsis.org hiring and funding. WormBase www..org Candida albicans Candida Genome Database www.candidagenome.org Career path Culex pipiens VectorBase www.vectorbase.org How can biocuration mature faster as a career? Danio rerio Information Network http://zfin.org Biocurators currently streamline submission Dictyostelium discoideum dictyBase http://dictybase.org to databases, automate curation, standardize sp. FlyBase http://flybase.org data and facilitate contributions to annota- Glycine max SoyBase www.soybase.org tion by research communities interested in the annotation process. To handle the increasing Homo sapiens HUGO Gene Nomenclature Committee www.genenames.org volume and types of data, journal publishers Hordeum vulgare Barley Genetic Stocks Database http://ace.untamo.net/bgs and researchers who generate data will need Ixodes scapularis VectorBase www.vectorbase.org to be involved in the curation process and the Leishmania sp. GeneDB www.genedb.org roles of biocurators will expand to include Mus musculus www.informatics.jax.org editing and teaching. As biology moves Oryza sp. Gramene http://gramene.org towards more precise, quantitative science, Paramecium tetraurelia ParameciumDB http://paramecium.cgm.cnrs-gif.fr biologists also need to adapt to thinking more Pediculus humanus VectorBase www.vectorbase.org quantitatively, systematically and objectively Rattus norvegicus http://rgd.mcw.edu about their data; biocuration will need to become an inherent part of research and edu- Saccharomyces Genome Database www.yeastgenome.org cation in biology. Schizosaccharomyces pombe GeneDB www.genedb.org Biocuration requires a blend of skills and Solanaceae sp. Sol Genomics Network http://sgn.cornell.edu experience, including advanced scientific Strongylocentrotus purpuratus SpBase http://sugp.caltech.edu/SpBase research and competence in database manage- Triticum sp. GrainGenes http://wheat.pw.usda.gov ment systems, multiple operating systems and Trypanosoma sp. GeneDB www.genedb.org scripting languages. This type of background laevis www.xenbase.org has typically been garnered through a combi- Xenopus tropicalis Xenbase www.xenbase.org nation of self-teaching and on-the-job experi- Zea mays and Genomics Database www.maizegdb.org ence, which can be narrow and spotty. Happily, formal education is becoming available. For example, the Graduate School of Library and Nucleotide, protein and structure databases Information Science at the University of Illi- All Species GenBank www.ncbi.nlm.nih.gov/Genbank nois at Urbana-Champaign offers a biological All Species UniProt www.pir..org information specialist master’s degree and a All Species Protein Data Bank http://rcsb.org/pdb/home/home.do specialization in data curation18. Experienced biocurators must lead the way in establishing more and better formal training programmes. All Species NCBI Taxonomy www.ncbi.nlm.nih.gov/sites/ In the next 5–10 years, biology curricula entrez?db=taxonomy should include courses in biocuration as this Biological databases contain unique identifiers for the unambiguous identification of biological entities (scuh as genes, proteins, species becomes an increasingly common activity and chemicals). These identifiers do not change as common biological names do. Authors should consult these databases for stable for all biological researchers. And interdisci- identifiers to cite in their publications. plinary programmes that include courses in 49 FEATURE BIG DATA NATURE|Vol 455|4 September 2008 biology, computer science and information Society for Biocuration (www.biocurator.org/ 7. Jenkins, H. et al. Nature Biotechnol. 22, 1601–1606 (2004). 8. Orchard, S. et al. Nature Biotechnol. 25, 894–898 (2007). science will be vital. BiocuratorSociety.html) to make the discipline 9. Taylor, C. F. et al. Nature Biotechnol. 25, 887–893 (2007). Attracting highly qualified individuals into more visible and to promote it as an attractive 10. Bourne, P. PLoS Comput. Biol. 1, 179–181 (2005). this field has been challenging. The whole com- career path. The official launch of the society is 11. Seringhaus, M. R. & Gerstein, M. B. BMC 8, 17 (2007). munity must promote scientific curation as a planned for the third International Biocuration 12. Seringhaus, M. & Gerstein, M. FEBS Lett. 582, 1170 (2008). professional career option. Funding agencies Meeting next April in Berlin (http://projects. 13. Ort, D. R. & Grennan, A. K. Plant Physiol. 146, 1022–1023 must assess the impact of curated data and sup- eml.org/Meeting2009). (2008). 14. Burkhardt, K., Schneider, B. & Ory, J. PLoS Comput. Biol. 2, port the development of innovative curation Biology today needs more robust, expressive, e99 (2006). methods. To improve the profession, curators computable, quantitative, accurate and precise 15. Rhee, S. Y. Plant Physiol. 134, 543–547 (2004). need a forum to share their experiences and ways to handle data. It is time to recognize that 16. Mons, B. et al. Genome Biol. 9, R89 (2008). 17. Huss, J. W. et al. PLoS Biol. 6, e175 (2008). publish their works. Oxford University Press biocuration and biocurators are central to the 18. Palmer, C. L., Heidorn, P. B., Wright, D. & Cragin, M. H. Int. J. plans to begin publishing a new journal in 2009 future of the field. ■ Dig. Curation 2, 31–40 (2007). called Database: The Journal of Biological Data- 1. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & bases and Curation. This may provide one such Author information Correspondence and requests Wheeler, D. L. Nucl. Acid. Res. 36, D25–D30 (2008). for materials should be addressed to D.H. (e- 2. Wheeler, D. L. et al. Nucl. Acid. Res. 36, D13–D21 (2008). venue for publication of noteworthy advances mail: [email protected]) and S.Y.R. (e-mail: 3. Salimi, N. & Vita, R. PLoS Comput. Biol. 2, e125 (2006). in biocuration (www.database.oxfordjournals. 4. Brazma, A. et al. Nature Genet. 29, 365–371 (2001). [email protected]). org). Meanwhile, a committee of 20 biocurators 5. Deutsch, E. W. et al. Nature Biotechnol. 26, 305–312 (2008). and researchers is forming an International 6. Field, D. et al. Nature Biotechnol. 26, 541–547 (2008). See Editorial, page 1.

Authorship Doug Howe1, Maria Costanzo2, Petra Fey3, Takashi Gojobori4, Linda Hannick5, Winston Hide6,7, David P. Hill8, Renate Kania9, Mary Schaeffer10,11, Susan St Pierre12, Simon Twigger13, Owen White14 and Seung Yon Rhee15

1The Zebrafish Information Network, 5291 University of Oregon, Eugene, Oregon 97403-5291, USA. 2Saccharomyces and Candida Genome Databases, Stanford University, Stanford, California 94305-5120, USA. 3dictyBase, Northwestern University Biomedical Informatics Center, 750 N. Lake Shore Drive, 11–175, Chicago, Illinois 60611, USA. 4Centre for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems, Yata, Mishima 411-8540, Japan. 5J. Craig Venter Institute, Applied Bioinformatics, Rockville, Maryland 20850, USA. 6South African National Bioinformatics Institute, University of the Western Cape, Private Bag X17, Bellville 7535, South . 7Department of , Harvard School of Public Health, 655 Huntington Avenue, Boston, Massachusetts 02115, USA. 8Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, Maine 04609, USA. 9Scientific Databases and Visualization, EML Research GmbH, Villa Bosch, Schloss-Wolfsbrunnenweg 33, D-69118 Heidelberg, Germany. 10Division of Plant Sciences, University of Missouri, Columbia, Missouri, USA. 11Plant Genetics Research Unit, Agricultural Research Service, United States Department of Agriculture, Columbia, Missouri 65211-7020, USA. 12FlyBase, Harvard University, Cambridge, Massachusetts 02138, USA. 13Rat Genome Database, Bioinformatics Research Center, Medical College of Wisconsin, 8701 Watertown Plank Rd, Milwaukee, Wisconsin 53226, USA. 14Department of Epidemiology and Preventative Medicine, Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA. 15The Arabidopsis Information Resource, Carnegie Institution for Science, Department of Plant Biology, 260 Panama Street, Stanford, California 94305, USA.

50