4.9 Feature Biocuration.Indd MH.Indd
Total Page:16
File Type:pdf, Size:1020Kb
NATURE|Vol 455|4 September 2008 FEATURE The future of biocuration To thrive, the field that links biologists and their data urgently needs structure, recognition and support. Doug Howe, Seung Yon Rhee et al. The exponential growth in the amount of biological data means that revolutionary meas- ures are needed for data man- agement, analysis and accessibility. Online databases have become important avenues for publishing biological data. Biocuration, the activity of organizing, representing and making biological information accessible to both humans and computers, has become an essential part of biological discovery and biomedical research. But curation increas- ingly lags behind data generation in funding, develop ment and recognition. — whowho managemanage rawraw biologicalbiological data,data, We propose three urgent actions to advance extractextract iinformationnformation ffromrom ppublishedublished this key field. First, authors, journals and literature,literature, developdevelop structuredstructured vocabu-vocabu- curators should immediately begin to work larieslaries ttoo ttagag ddataata aandnd mmakeake tthehe iinfor-nfor- together to facilitate the exchange of data mationmation aavailablevailable oonlinenline3 (Box 1). In between journal publications and databases. thethe pastpast decade,decade, itit hashas becomebecome secondsecond Second, in the next five years, curators, naturenature forfor biologistsbiologists toto visitvisit websiteswebsites toto researchers and university administrations obtainobtain datadata forfor furtherfurther analysisanalysis oror inte-inte- should develop an accepted recognition struc- grationgration wwithith llocalocal rresources.esources. OOurur ssur-ur- ture to facilitate community-based curation veyvey ofof severalseveral well-curatedwell-curated databasesdatabases efforts. Third, curators, researchers, academic (nine(nine model-organismmodel-organism databases,databases, Uni-Uni- institutions and funding agencies should, in protprot aandnd PProteinrotein DDataata BBank)ank) sshowedhowed the next ten years, increase the visibility and thatthat nearlynearly 750,000750,000 visitorsvisitors ((uniqueunique IIPP support of scientific curation as a professional knowledge, much as we are experiencing addresses) viewed more than 20 million pages career. exponential growth in data today. in just one month (March 2008, Eva Huala, Failure to address these three issues will Peter Rose, Rolf Apweiler, personal commu- cause the available curated data to lag far- Data avalanche nications). ther behind current biological knowledge. Biology, like most scientific disciplines, is in Despite the essential part that it plays in Researchers will observe an increasing occur- an era of accelerated information accrual and today’s research, biocuration has been slow to rence of obvious gaps in knowledge. As these scientists increasingly depend on the availabil- develop. To provide a forum for the exchange of gaps expand, resources will become less effec- ity of each others’ data. Large-scale sequencing ideas and methods, and to facilitate collabora- tive for generating and testing hypotheses, and centres, high-throughput analytical facilities tions and training, more than 150 biocurators the usefulness of curated data will be seriously and individual laboratories produce vast met at two international conferences and cre- compromised. amounts of data such as nucleotide and pro- ated a mailing list and a website (www.biocu- When all the data produced or published tein sequences, protein crystal structures, rator.org). These meetings and discussions are curated to a high standard and made gene-expression measurements, protein and have honed in on the three actions, outlined accessible as soon as they become avail- genetic interactions and phenotype studies. above and elaborated on below, that must now able, biological research will be conducted By July 2008, more than 18 million articles be addressed to ensure scientists’ continued in a manner that is quite unlike the way it is had been indexed in PubMed and nucleotide access to the high-quality data on which their done now. Researchers will be able to process sequences from more than 260,000 organ- research depends. massive amounts of complex data much isms had been submitted to GenBank1,2. The more quickly. They will garner insight about recently announced project to sequence 1,000 Come together the areas of their interest rapidly with the human genomes in three years to reveal DNA Extracting, tagging with controlled vocabu- help of inference programs. Digesting infor- polymorphisms (www.1000genomes.org) is a laries, and representing data from the lit- mation and generating hypotheses at the tip of the data iceberg. erature, are some of the most important and computer screen will be so much faster that Such data, produced at great effort and time-consuming tasks in biocuration. Curated researchers will get back to the bench quickly expense, are only as useful as researchers’ information from the literature serves as the for more experiments. Experiments will be ability to locate, integrate and access them. In gold-standard data set for computational designed with more insight; this increased recent years, this challenge has been met by analysis, quality assessment of high-through- specificity will cause an exponential growth in a growing cadre of biologists — ‘biocurators’ put data and benchmarking of data-mining 47 FEATURE BIG DATA NATURE|Vol 455|4 September 2008 algorithms. Meanwhile, the boundaries of Box 1 | The role of biocurators offers software to assist in preparation and 14 the biological domain that researchers study ● To extract knowledge from published validation of such crystallographic data . An are widening rapidly, so researchers need papers analogous system to help authors identify, tag faster and more reliable ways to understand ● To connect information from different and validate the crucial basic information in unfamiliar domains. This too is facilitated by sources in a coherent and comprehensible their research reports before publication would literature curation. way accelerate the automated linkage of literature to Typically, biocurators read the full text of ● To inspect and correct automatically key records in existing databases and improve articles and transfer the essence into a data- predicted gene structures and protein the accuracy of the published data. base. For a paper about the molecular biology sequences to provide high-quality proteomes In short, authors and publishers must use the of a particular gene, process or pathway, such ● To develop and manage structured existing publication infrastructure to facilitate information might include gene-expression controlled vocabularies that are crucial for literature curation much more to the benefit patterns, mutant phenotypes, results of bio- data relations and the logical retrieval of large of all parties. chemical assays, protein-complex membership data sets and the authors’ inferences about the functions ● To integrate knowledge bases to represent Community curation and roles of the gene products studied. As each complex systems such as metabolic Curation of large-scale genomics and post- paper uses different experimental and analysis pathways and protein-interaction networks. genomics data enjoys no such luxury of ‘an ● To correct inconsistencies and errors in methods, capturing this information in a con- existing publication infrastructure’ to lever- data representation sistent fashion requires intensive thought and age, although emerging standards of data ● To help data users to render their research 4–9 effort. Limited resources and staff mean that more productive in a timely manner reporting are promising . Sooner or later, the most curation groups can’t keep up with all the ● To steer the design of web-based research community will need to be involved relevant literature. resources in the annotation effort to scale up to the rate How information is presented in the lit- ● To interact with researchers to facilitate of data generation. This transition will require erature greatly affects how fast biocurators direct data submissions to databases annotation tools, standardized methods, over- can identify and curate it. Papers still often sight by expert curators and a combination of report newly cloned genes without providing discussed; and descriptions of species, strains, social infrastructure, tool development, train- GenBank IDs or the species from which the cell types and genotypes used. Examples ing and feedback. Biocurators are especially genes were cloned. The entities discussed in a of sources for this information are listed in important for establishing such an infrastruc- paper, including species, genes, proteins, geno- Table 1. This would accelerate literature cura- ture and training to maintain consistency and types and phenotypes must be unambiguously tion, uphold information integrity, facilitate accuracy. identified during curation. For example, using the proper linkage of data to other resources To date, not much of the research community the HUGO Gene Nomenclature Committee and support automated mining of data from is rolling up its sleeves to annotate. What will resource (www.genenames.org), we find that papers. Another model is for authors to be the tipping point? The main limitation in the human gene CDKN2A has ten literature- provide a ‘structured digital abstract’ — a community annotation is the perceived lack of based synonyms. One of those, p14, is