Download Large Datasets for Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Parkhill et al. Genome Biology 2010, 11:402 http://genomebiology.com/2010/11/7/402 CORRESPONDENCE Genomic information infrastructure after the deluge Julian Parkhill1*, Ewan Birney2 and Paul Kersey2 Abstract Dramatic falls in the consumable costs of DNA sequen- cing have not fundamentally changed the need for com pu- Maintaining up-to-date annotation on reference ta tional analysis to process and interpret the infor mation genomes is becoming more important, not less, as the produced. Indeed, the need has increased as the volume ability to rapidly and cheaply resequence genomes and complexity of the data have risen. ere has, there- expands. fore, been a profound shift towards a higher intensity of informatics in biological research, with bioinformatics becoming a necessary component of many, if not most, e advent of next-generation sequencing technology molecular biology groups. e analysis of new genome- has led to a profound shift in the economics of genomics. wide experiments typically requires the presence of a Sequencing costs have fallen more than a hundredfold robust, accurate information infrastructure, includ ing a over the past four years, and this rate of reduction is reasonable assembly of the genome sequence, a set of likely to continue for the foreseeable future. e availa- accurate gene predictions and a description of their bio- bility of cheap DNA sequencing has changed the cost of a logical function. When genome sequence determina tion variety of experiments - gaining a near-complete bacterial was expensive, and thus both relatively uncommon and sequence costs a few hundred dollars in consumables, concentrated in areas of intensive experimental research, whereas mid-size genomes are amenable to a single grant considerable resources could be focused on individual proposal. A number of large genomes, such as those of genomes, often in intensively managed and curated model vertebrates (for example, the turkey) have been under- organism databases (such as FlyBase [1], WormBase [2], taken by small consortia of interested laboratories. In and the Saccharomyces Genome Database [3]). addition, there are a variety of novel assays, such as RNA However, the model of relatively independent, large sequencing (RNA-seq), transposon mutagenesis and consortia focused on a small set of genomes seems ill chromatin immunoprecipitation and sequencing (ChIP- equipped to handle the flood of new genomes. Without seq) in which low-cost sequencing has replaced other such support, annotations created for many genomes readout platforms such as nucleic acid hybridization. have not been kept up-to-date since their initial sub- Under standing these data rests fundamentally on well mission to the public databases, as sequencing groups curated, up-to-date annotation for reference genomes, have moved on to new targets and experimental data which can be leveraged for other species. However, the have accumulated in the literature. Although there has ability of the scientific community to maintain such been considerable success in creating portable software resources is failing as a result of the onslaught of new components for genome curation, such as the GMOD data and the disconnect between the archival DNA tools (for example, Apollo [4] and Chado [5]), Artemis [6] databases and the new types of information and analysis and others, their application happens in an ad hoc being reported in the scientific literature. In this article, manner, often focusing on solving a particular problem we propose a new structure for genomic information specific to one group, rather than systematically. is resources to address this problem. leads to the duplication of effort between groups and inconsistency between the annotations they produce. Even when experimental data are well organized in a structured resource, their volume is a further impediment *Correspondence: [email protected] 1The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, to their successful exploitation by the wider community, Cambridge, CB10 1SA, UK as network bandwidth is often a constraining factor when Full list of author information is available at the end of the article attempting to download large datasets for analysis. ere are, therefore, at least two challenges facing the post- © 2010 BioMed Central Ltd © 2010 BioMed Central Ltd deluge community. e first is ensuring that bioinformatics Parkhill et al. Genome Biology 2010, 11:402 Page 2 of 5 http://genomebiology.com/2010/11/7/402 resources are kept up-to-date and operate in a stable and unclear what resources will appear without intervention, reliable funding environment. The second is creating unclear whether a particular resource is good value for mechanisms to give end users access to the raw datasets, money (especially when it partially duplicates other which are now so massive that they cannot easily be resources) and unclear how any particular information transferred across the Internet. Both are weighty issues, resource will survive beyond a single funding cycle. In and this article focuses on the first one. addition, like many other scientific endeavors, these The International Nucleotide Sequence Database activities occur in an international context with a geo- (INSDC), implemented as GenBank [7] at the US graphic diversity of participating groups and a matching National Center for Biotechnology Information (NCBI), diversity of funding agencies, whose goals may be more ENA at the European Bioinformatics Institute [8] (EBI) or less well aligned. and the DNA Database of Japan [9] at the National The absence of a structure for funding and data can Institute for Genetics, has archived DNA sequence infor- lead to the loss of valuable scientific content when a ma tion submitted by experimentalists since its establish- particular episode of funding concludes. Among the ment in 1984. However, even before the advent of the most striking current demonstrations of this is the new technology there was an increasing disconnection funding crisis faced by The Arabidopsis Information between the genome annotation in the archive and the Resource (TAIR) [15], which has curated the genome of more complex functional information that had accumu- the model plant Arabidopsis thaliana, but which faces lated in the laboratories of the scientific community, and closure in 2013 if new funding cannot be secured. For in the literature. In response to this, the Ensembl project smaller resources, the threat of effective closure is ever [10] in Europe and the RefSeq project [11] at NCBI were present, as funding is usually linked to specific research- developed partly to capture, and partly to provide, high- oriented grants. To give just one example, the COGEME quality annotation, in particular on protein-coding genes, database for plant pathogen expressed sequence tags on important genomes. For some species (such as (ESTs) [16] was updated regularly between 2001 and 2007 Drosophila, yeast and worm) these resources mirrored but (in the absence of longer-term funding) not since. information from the well funded model organism Over the past five years this patchwork of resources has databases already established for these species. In most improved through communication and software reuse. other cases, however, the new resources were derived from Examples include the development of open-source a selection from the submitted archival records, with out software by groups such as GMOD (for example, the significant manual updates. Finally, in cases such as human Gbrowse genome browser [17]), Ensembl [10] and and other mammals, there was direct creation of added- GeneDB [18] that can be reused by others; better value datasets on the genome, often through collaborations communi cation between model organism databases and with other groups (for example, the UCSC Genome EBI/NCBI; and improved coordination of funding in Browser group [12] for vertebrate genomes). More adjacent areas (for example, the Bioinformatics Resource generally, NCBI [13] and EBI [14] act as major providers of Centers (BRCs) [19-21] funded by the US National bioinformatics services across a broad range of domains, Institute of Allergy and Infectious Diseases (NIAID), of which genome-centric resources form just one part. which each cover a portfolio of related species where The current situation is therefore a patchwork of NIAID is also funding experimental work). However, different resources, with different funding models and there is still a fundamental need for a stable, sustainable different communication lines. There are benefits to this and comprehensive configuration of resources that can diversity - funding streams usually involve a good handle the growing influx of genomic data from all connec tion to the scientists working directly on a species sources. In the remainder of this article we outline a pro- (whose involvement is required to justify investment), no posed structure that formalizes aspects of current best single group has a monopoly on the information flow, practice and proposes a clear model for data management innovation in added-value services can be explored, and for both scientists and funding agencies. small additional components can often be funded rapidly. However, there are some major disadvantages as well - A three-tier structure ineffective (or in some cases nonexistent) communication We propose a three-tier, federated structure that should between diverse groups hampers the propagation of the address many of these issues (Figure 1), in which