DATABASES and TOOLS for BROWSING GENOMES Ewan Birney,1 Michele Clamp,2 and Tim Hubbard2

31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD 10.1146/annurev.genom.3.030502.101529 Annu. Rev. Genomics Hum. Genet. 2002. 3:293–310 doi: 10.1146/annurev.genom.3.030502.101529 Copyright c 2002 by Annual Reviews. All rights reserved DATABASES AND TOOLS FOR BROWSING GENOMES Ewan Birney,1 Michele Clamp,2 and Tim Hubbard2 1European Bioinformatics Institute (EMBL-EBI), 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom; e-mail: [email protected], [email protected], [email protected] Key Words genome sequence, gene prediction, relational database, open source software, distributed annotation ■ Abstract To maximize the value of genome sequences they need to be integrated with other types of biological data and with each other. The entire collection of data then needs to be made available in a way that is easy to view and mine for complex relationships. The recently determined vertebrate genome sequences of human and mouse are so large that building the infrastructure to manage these datasets is a major challenge. This article reviews the database systems and tools for analysis that have so far been developed to address this. INTRODUCTION The human genome sequence represents the first bounded biological dataset con- cerning our species. Having access to it is a landmark because of the limits it sets on the problem of understanding biology as a whole. It has allowed us to assemble something equivalent to an “edge” of a multidimensional jigsaw puzzle. This is only a first step in completeness, but at least it gives us a feeling for size and boundaries and directs us toward the “middle” of the puzzle that now needs to be filled in. It is the first step toward other complete datasets: the complete set of genes, the complete set of proteins, the complete set of molecular interactions in the cell, etc. Completeness changes the way we ask questions: “Is this gene involved in this function?” becomes “which gene carries out this function?” These datasets will be determined by a combination of experimental work and compu- tational analysis, but in the context of the genome sequence. Genome sequences provide a framework around which all this biological knowledge can potentially be organized, so each layer of data will lead to a greater understanding of layers of organization of biological systems above it. The availability of several closely related genome sequences (e.g., mouse, rat) brings the possibility of building lists of molecular features common to all vertebrate species and those that are unique to our own. Evolutionary similarities between individual proteins can be identified across the whole of life. Among vertebrates, conservation between genome sequences goes beyond similarities 1527-8204/02/0728-0293$14.00 293 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD 294 BIRNEY CLAMP HUBBARD between protein-coding genes and extends to gene order, resulting in large syn- tenic blocks of many megabases. Other more distant nonvertebrate genomes, such as fly or worm, cannot be usefully compared to human at the level of chromoso- mal organization (except in rare cases such as the HOX gene cluster), but ortholog genes and proteins can be identified. Analysis of conserved networks of equivalent proteins in distant species will lead to understanding how common cellular systems work and what makes species different from each other. It will also increase our understanding of how biology accommodates change through evolution and variation within populations of individuals. Refining and extending this set of or- thologies will have a substantial impact on the development of medical treatments, as these relate studies of molecular systems in model organisms to human. Because all biological data is in some way information about how biology as a whole is organized, it is most valuable when systematically organized and integrated. Having these large collections of raw data, which include protein and RNA as well as genome sequences and structures, protein and RNA expression patterns, and cellular localization images, has created a huge need for databases to store information, provide access, and add value. For a current snapshot of the huge range of biological databases a good source is the annual special database issue of Nucleic Acids Research, published each January. It lists 339 databases in its opening review article in 2002 (5). Long before the first complete genome sequences of free living cells were determined, groups from around the world had been tackling the issues of (a) building repositories for raw data, (b) adding annotation to this raw data, and (c) providing higher level structure and organization. Examples of repositories are the public DNA sequence databases of EMBL (37), GenBank (6), and DDBJ (38) as well as the public protein structure database PDB (40). Examples of annotation databases are Flybase, which maintained annotation around the genetics of Drosphila long before a genome sequence was available (10), and SwissProt, which adds functional annotation to protein sequences largely originating from mRNA and genomic sequencing (3). Examples of organizational databases are Pfam, which groups protein sequence domains into families, thereby showing evolutionary relationships between paral- ogous proteins within an organism and orthologous proteins between organisms (4); SCOP, which groups more distantly related proteins together by structural similarity (26); and KEGG, which organizes proteins and ligands into networks of enzymic processes and regulatory networks (20). These examples are only illus- trative, as the roles are not even clear cut in these cases, e.g., the DNA sequence repositories and organizational databases both contain some annotation. What differentiates the above databases from genome-sequence-based databases is that the former are all founded around primary sets of data that are currently un- bounded, whereas in the case of the latter the primary dataset is essentially bounded. We do not know how many protein folds there are. We do know the complete DNA sequence for many organisms. One of the results of the integration of these existing databases with genome-sequence databases is that we can propagate this completeness to identify where the gaps in our knowledge lie. For example, we can identify 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD GENOME DATABASES 295 the places in the human genome where there is evidence for a gene but where we do not yet have the full-length mRNA transcript or the protein sequence for which it codes. We can identify which protein sequences in a complete genome can be as- sociated to a known three-dimensional structure and which need to be targeted for X-ray or nuclear magnetic resonance (NMR) structure determination, such as those determined by structural genomics projects. We can identify which proteins are not part of any known cellular pathway and thus need to be targeted for functional analysis. A major current challenge for all these database projects is to increase their integration by means that may include propagating information upward from the complete genomes. One of the problems that urgently needs to be addressed in this integration is the maintenance of evidence trails linking derived annotation with the source of its evidence. For example, a protein of unknown function is labeled as a kinase because of a weak sequence homology to another protein that is known to be a kinase. Later it is discovered that the weak homology between the sequences was false and was due to a frameshift error in one of the protein sequences. Because most databases do not track the relationship between annotation and the evidence that supported it, the ‘kinase’ label is likely to persist even when the justification for it has vanished. Ideally all objects in all databases should have stable, versioned identifiers, and the relationships between them should be recorded so that when a sequence version changes it can be automatically determined that any evidence that relies on it needs to be reevaluated. When considering errors introduced by propagated annotation, it is worth also remembering the range of quality and completeness of data that is presented in databases as a whole. Ideally we would have a complete set of exact experimental measurements while being able to compute the behavior of biology exactly, in order to predict perturbations to naturally occurring systems. Currently we can do neither. For example, we have experimental methods for extracting and sequencing mRNA from cells, however there are many transcripts that are present in such small quantities or for such a transient period that they have never been isolated. Similarly, our ability to “compute” biology is currently very limited. As is dis- cussed below, automatic prediction of gene structures in higher organisms gives good levels of accuracy, but it is still prone to a variety of errors. Between these two extremes of prediction and experimental determination lies a third type of annotation, that of manual curation. The combination of human skills and experience, coupled frequently with knowledge of the scientific literature, means that careful manual annotation is less error prone and more complete than automatic annotation, although it is much slower and harder to maintain. Prediction and curation complement each other well in fact. The volume of data from newly sequenced genomes makes automatic annotation essential, where speed is essentially only limited by the available central processing unit (CPU) and frequent updates are required to take account of the ever-increasing amount of biological data available to support annotations. Therefore when making use of annotation, or ‘data’ (such as protein sequences) based on annotation (gene structures from genome 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD 296 BIRNEY CLAMP HUBBARD sequence), users must be aware of whether it is based on prediction, curation, or experimentation; what the expected accuracy is; and when it was last updated.

DATABASES and TOOLS for BROWSING GENOMES Ewan Birney,1 Michele Clamp,2 and Tim Hubbard2

Gene Prediction: the End of the Beginning Comment Colin Semple

The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe

Functional Effects Detailed Research Plan

Download Final Programme

Phenotype Inference in an Escherichia Coli Strain Panel

Molecular Genetics & Genomics

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

I S C B N E W S L E T T

Structure-Based Realignment of Non-Coding Rnas in Multiple Whole Genome Alignments

UC Irvine UC Irvine Previously Published Works

Functional Genomics Workshop Report

Professor Dame Janet Thornton Director T + 44