31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD 10.1146/annurev.genom.3.030502.101529

Annu. Rev. Hum. Genet. 2002. 3:293–310 doi: 10.1146/annurev.genom.3.030502.101529 Copyright c 2002 by Annual Reviews. All rights reserved

DATABASES AND TOOLS FOR BROWSING GENOMES

Ewan Birney,1 Michele Clamp,2 and Tim Hubbard2 1European Institute (EMBL-EBI), 2Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom; e-mail: [email protected], [email protected], [email protected]

Key Words genome sequence, , relational database, open source software, distributed annotation ■ Abstract To maximize the value of genome sequences they need to be integrated with other types of biological data and with each other. The entire collection of data then needs to be made available in a way that is easy to view and mine for complex relationships. The recently determined vertebrate genome sequences of human and mouse are so large that building the infrastructure to manage these datasets is a major challenge. This article reviews the database systems and tools for analysis that have so far been developed to address this.

INTRODUCTION

The human genome sequence represents the first bounded biological dataset con- cerning our species. Having access to it is a landmark because of the limits it sets on the problem of understanding biology as a whole. It has allowed us to assemble something equivalent to an “edge” of a multidimensional jigsaw puzzle. This is only a first step in completeness, but at least it gives us a feeling for size and boundaries and directs us toward the “middle” of the puzzle that now needs to be filled in. It is the first step toward other complete datasets: the complete set of genes, the complete set of proteins, the complete set of molecular interactions in the cell, etc. Completeness changes the way we ask questions: “Is this gene involved in this function?” becomes “which gene carries out this function?” These datasets will be determined by a combination of experimental work and compu- tational analysis, but in the context of the genome sequence. Genome sequences provide a framework around which all this biological knowledge can potentially be organized, so each layer of data will lead to a greater understanding of layers of organization of biological systems above it. The availability of several closely related genome sequences (e.g., mouse, rat) brings the possibility of building lists of molecular features common to all ver- tebrate species and those that are unique to our own. Evolutionary similarities between individual proteins can be identified across the whole of life. Among vertebrates, conservation between genome sequences goes beyond similarities 1527-8204/02/0728-0293$14.00 293 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

294 BIRNEY CLAMP HUBBARD

between protein-coding genes and extends to gene order, resulting in large syn- tenic blocks of many megabases. Other more distant nonvertebrate genomes, such as fly or worm, cannot be usefully compared to human at the level of chromoso- mal organization (except in rare cases such as the HOX gene cluster), but ortholog genes and proteins can be identified. Analysis of conserved networks of equivalent proteins in distant species will lead to understanding how common cellular sys- tems work and what makes species different from each other. It will also increase our understanding of how biology accommodates change through evolution and variation within populations of individuals. Refining and extending this set of or- thologies will have a substantial impact on the development of medical treatments, as these relate studies of molecular systems in model organisms to human. Because all biological data is in some way information about how biology as a whole is organized, it is most valuable when systematically organized and integrated. Having these large collections of raw data, which include protein and RNA as well as genome sequences and structures, protein and RNA expression patterns, and cellular localization images, has created a huge need for databases to store information, provide access, and add value. For a current snapshot of the huge range of biological databases a good source is the annual special database issue of Nucleic Acids Research, published each January. It lists 339 databases in its opening review article in 2002 (5). Long before the first complete genome sequences of free living cells were determined, groups from around the world had been tackling the issues of (a) building repositories for raw data, (b) adding annotation to this raw data, and (c) providing higher level structure and organization. Examples of repositories are the public DNA sequence databases of EMBL (37), GenBank (6), and DDBJ (38) as well as the public protein structure database PDB (40). Examples of annotation databases are Flybase, which maintained annotation around the genetics of Drosphila long before a genome se- quence was available (10), and SwissProt, which adds functional annotation to protein sequences largely originating from mRNA and genomic sequencing (3). Examples of organizational databases are , which groups protein sequence domains into families, thereby showing evolutionary relationships between paral- ogous proteins within an organism and orthologous proteins between organisms (4); SCOP, which groups more distantly related proteins together by structural similarity (26); and KEGG, which organizes proteins and ligands into networks of enzymic processes and regulatory networks (20). These examples are only illus- trative, as the roles are not even clear cut in these cases, e.g., the DNA sequence repositories and organizational databases both contain some annotation. What differentiates the above databases from genome-sequence-based databases is that the former are all founded around primary sets of data that are currently un- bounded, whereas in the case of the latter the primary dataset is essentially bounded. We do not know how many protein folds there are. We do know the complete DNA sequence for many organisms. One of the results of the integration of these existing databases with genome-sequence databases is that we can propagate this complete- ness to identify where the gaps in our knowledge lie. For example, we can identify 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

GENOME DATABASES 295

the places in the human genome where there is evidence for a gene but where we do not yet have the full-length mRNA transcript or the protein sequence for which it codes. We can identify which protein sequences in a complete genome can be as- sociated to a known three-dimensional structure and which need to be targeted for X-ray or nuclear magnetic resonance (NMR) structure determination, such as those determined by structural genomics projects. We can identify which proteins are not part of any known cellular pathway and thus need to be targeted for functional analysis. A major current challenge for all these database projects is to increase their in- tegration by means that may include propagating information upward from the complete genomes. One of the problems that urgently needs to be addressed in this integration is the maintenance of evidence trails linking derived annotation with the source of its evidence. For example, a protein of unknown function is labeled as a kinase because of a weak sequence homology to another protein that is known to be a kinase. Later it is discovered that the weak homology between the sequences was false and was due to a frameshift error in one of the protein sequences. Because most databases do not track the relationship between annotation and the evidence that supported it, the ‘kinase’ label is likely to persist even when the justification for it has vanished. Ideally all objects in all databases should have stable, versioned identifiers, and the relationships between them should be recorded so that when a sequence version changes it can be automatically determined that any evidence that relies on it needs to be reevaluated. When considering errors introduced by propagated annotation, it is worth also remembering the range of quality and completeness of data that is presented in databases as a whole. Ideally we would have a complete set of exact experimental measurements while being able to compute the behavior of biology exactly, in order to predict perturbations to naturally occurring systems. Currently we can do neither. For example, we have experimental methods for extracting and sequenc- ing mRNA from cells, however there are many transcripts that are present in such small quantities or for such a transient period that they have never been isolated. Similarly, our ability to “compute” biology is currently very limited. As is dis- cussed below, automatic prediction of gene structures in higher organisms gives good levels of accuracy, but it is still prone to a variety of errors. Between these two extremes of prediction and experimental determination lies a third type of anno- tation, that of manual curation. The combination of human skills and experience, coupled frequently with knowledge of the scientific literature, means that careful manual annotation is less error prone and more complete than automatic annota- tion, although it is much slower and harder to maintain. Prediction and curation complement each other well in fact. The volume of data from newly sequenced genomes makes automatic annotation essential, where speed is essentially only limited by the available central processing unit (CPU) and frequent updates are required to take account of the ever-increasing amount of biological data avail- able to support annotations. Therefore when making use of annotation, or ‘data’ (such as protein sequences) based on annotation (gene structures from genome 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

296 BIRNEY CLAMP HUBBARD

sequence), users must be aware of whether it is based on prediction, curation, or experimentation; what the expected accuracy is; and when it was last updated. While noting the limitations of current systems, we can look forward to what will be possible as existing systems converge. A perfect integration would allow biology to be viewed in vertical slices (from genome sequence to organism) and horizontal slices (from viruses to humans). A vertical slice might be the biology around a gene and could stretch from the genome sequence surrounding the gene, the gene’s protein products, the product’s role in a network of molecules that on the input side regulate its production and on the output side control its interaction with other molecules, the localization of products in cells and across the organism, the temporal expression in the organism from life to death, or the variation in the gene and the impact of this variation within a population of organisms. A horizontal slice might be the biology around a metabolite and could stretch from the existence of the metabolite outside life to the evolutionary tree of its roles and metabolism stretching from bacteria to humans, thus showing where it became adopted for new roles and where old roles were replaced by other mechanisms or modified versions of the metabolite. Each step toward a goal of complete integration should improve our ability to predict our interactions with biology, as we consider the consequences of perturbations such as the availability of a new drug, of new virus genotypes, and of the release of new pollutants. From every standpoint, we are a very long way from attaining such a goal: in terms of availability of experimental data, in terms of being able to predict the be- havior of biological systems, and in terms of organizing existing data. However the first steps are very clear; they are the organization of existing biological data around and between the newly determined vertebrate genome sequences and the prediction of the next level of information from the genome sequence, namely, the genes and proteins coded by it. This review concentrates on progress on these two issues, sur- veying the systems that are being constructed to automatically annotate the human genome for genes and the computer systems technology that are being assembled to store, manipulate, and provide access to these very large blocks of data.

AUTOMATIC GENE ANNOTATION Analysis Pipelines The human (11, 39) and mouse (25) genomes are currently much larger than any other genome sequenced, and the sheer size of the data has posed many problems to those trying to annotate it. These problems are exacerbated in the human genome by the constant change in the sequence as it is being finalized, which has led to a series of genome assemblies that need to be analyzed. This problem has been partially solved by increasing the number of computers available for the analysis, but this is only part of the solution. A large portion of the analysis has to be automatic to get acceptable annotation turnaround, and this involves engineering an analysis- pipeline system to process the vast amount of sequences. A number of large-scale 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

GENOME DATABASES 297

analysis pipelines exist [Ensembl (18), National Center for Bioinformatics (NCBI), Oak Ridge Genome Channel (28), Gadfly, Celera’s Otto (22)]; and although they differ in their structure and underlying storage systems, they share many common features. The early stages of the different pipelines are very similar; before any gene annotation can proceed many different programs need to be run to provide input data for the gene-building system. The genomic sequence needs to be masked for repeats [e.g., RepeatMasker (34)] before doing most other analyses. After the repeat masking the next stage is to run both ab initio gene-prediction programs [e.g., Genscan (8), Fgenesh (35), Genie (32)] and database searches against the public databases. The point at which this “raw analysis” is done is where the different genome pipelines start to diverge. Each site uses different programs and sources of data to produce their gene annotations, and this part of the analysis is the most dynamic and subject to change.

Gene-Prediction Methods Gene-prediction methods have traditionally been described as falling into two distinct approaches. The first approach has been to use the statistical information contained within the genomic sequence to predict gene structures. The information used commonly includes compositional analysis of exons, introns, and splice sites, which may vary according to the background composition of the full genomic se- quence. There are many programs that perform this ab initio prediction, including Genscan, Fgenesh, HMMGene (24), Grail (41), and Genie. The behind these programs are Hidden Markov Models (HMMs), neural networks, and Linear discriminant functions, although the majority now use HMMs. Genscan is com- monly regarded to be the most accurate for predicting genes in the human genome, but all ab initio methods have a common weakness for large genomes. This weak- ness is partly due to the of genes in large vertebrate genomes. The amount of coding sequence compared to noncoding sequence is very small (approximately 2% of the human genome is estimated to be coding), which leads to large introns and large intergenic regions with relatively small exons. For instance, the mean size of an exon in the human genome is 120 base pairs with the median intron size being 2000 base pairs. The large noncoding regions lead the ab initio programs astray and lead to extra genes being predicted in intergenic regions. The large introns within genes can also lead to extra exons being predicted within a true gene. The second approach to gene prediction has used the wealth of sequence data from genes that is already stored in the public sequence databases [SWISS-PROT (3), EMBL (37), RefSeq (31)]. These are commonly referred to as similarity-based approaches and are founded around taking a protein or a cDNA and aligning it to the genomic sequence. There are many programs that will do a local or global alignment of two sequences [BLAST (1)], but for gene annotation these programs are not good enough as they do not have any knowledge of gene structure: how to model splice sites, for instance. For accurate alignment of proteins or cDNAs to genomic sequences, programs should be able to model splice sites, exons, introns, 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

298 BIRNEY CLAMP HUBBARD

and in the case of protein alignment programs, open reading frames. For aligning cDNAs to genomic sequences, programs such as SIM4 (16), EST GENOME (27), BLAT (J. Kent, manuscript in preparation), and Exonerate (G. Slater & E. Birney, manuscript in preparation) are commonly used. Genewise (7) and Procrustes (17) are used for aligning proteins.

Sources of Genome Annotation There are currently three main sites providing free annotation of the human genome. These are the Ensembl site (http://www.ensembl.org), NCBI (http://www. ncbi.nlm.nih.gov), and the genome browser at University of California, Santa Cruz (UCSC) (http://genome.cse.ucsc.edu). Both approaches to gene prediction (ab initio and similarity based) are used and combined to provide the end user with a set of gene annotations.

ENSEMBL The Ensembl gene-prediction system uses both ab initio and similarity- based methods but increases the specificity of the ab initio predictions by only using the ones confirmed by similarity to protein, cDNA, or expressed sequence tag (EST) sequences. Ensembl predicts genes in three steps: by placing known human genes onto the genome, then placing highly similar genes (maybe from mouse or rat) onto the genome, and finally predicting novel genes from ab initio predictions supported by sequence similarity. The first two steps are based around aligning proteins to the genome. The final alignment of each protein is done using the Genewise system, which aligns proteins to genomic DNA. Genewise uses an HMM to model exons, introns, and their splice sites. It can cope with sequencing error, as it can model frameshifts, and because it is protein based, stop codons can be penalized to produce a translatable gene structure. As Genewise is a very slow program and the genome is very large, each protein to be aligned is placed on the genome using two methods. For the human proteins a very fast exact-matching , pmatch (R. Durbin, unpublished), is used to place each protein at its approximate place in the genome. To refine the location of the protein, BLASTX is then used. This should locate the positions of the exons to within a few bases and also picks up small exons for which pmatch is not sensitive enough. Only then is Genewise used to align the protein. For the similar proteins the pmatch step cannot be used, as pmatch finds exact matches only. BLASTP is used instead to search the SWISS-PROT and TrEMBL databases. Genewise is a protein alignment program, thus the resulting gene predictions will lack untranslated regions. To remedy this, all the human cDNAs are aligned to the genome using another program that is sensitive to splice sites, EST GENOME. Again, as for Genewise, to reduce the search time cDNAs are placed onto the genome using a two-step method before EST GENOME is run. Instead of pmatch, the program Exonerate is used. Exonerate is a fast DNA-DNA alignment program that allows for insertions and deletions. This is used to place the cDNA within 1 Mb or so of its appropriate region. BLASTN is then used to locate the exon positions to within a few bases; then EST GENOME is run on only those regions 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

GENOME DATABASES 299

of the genome. The Genewise alignments and EST GENOME alignments are then combined into complete gene structures. The Genewise/Exonerate method for predicting genes accounts for 70% of the Ensembl genes. Genewise can only provide good alignments down to approxi- mately 70% identity, so Ensembl uses a final method to predict genes by using weaker sequence similarities. As stated earlier the ab initio program Genscan has good sensitivity but not very good specificity. Ensembl runs Genscan over the genome but only includes exons that have similarity to protein, EST, or cDNA sequences. The similarity matches are then used to join the exons together into gene structures.

NCBI The gene predictions on the NCBI website are based around the Genome- scan program (42). Genomescan combines ab initio gene predictions with simi- larities to protein sequences in order to predict gene structures that have at least one exon with supporting evidence from an existing protein sequence. Genome- scan is an extension of the Genscan program, which uses a probabilistic model of exon-intron structure and compositional features of human genes. Genscan uses a generalized HMM that consists of a number of states modeling the various parts of a gene. These states include 50 splice site, 30 splice site, internal coding exon, start exon, and terminal exon. The final gene structure predicted by Genscan is the maximum probability path through the HMM. Genomescan extends Genscan by again predicting the gene structure based on an HMM, but this time the prob- ability of the path is changed depending on the presence or absence of regions of sequence that are similar to known proteins. These regions of similarity can be generated using any method the user prefers. The BLASTX program is a popular method and is capable of generating the similarity regions down to the necessary sequence-similarity distance. Because BLASTX will generate output that suggests that some regions of ge- nomic sequence have protein homologies that are spurious, Genomescan will not automatically include every similarity region. Each region is evaluated and assigned a probability, and Genomescan combines these probabilities with the HMM-state probabilities, thus leading to predicted gene structures that are more likely than not to include the similarity regions. Not all gene predictions on the NCBI site are predicted using Genomescan. For the set of human cDNAs from RefSeq and EMBL/Genbank these are aligned directly to the genome. After the exact cDNAs are aligned the genome is split into putative gene boundaries using mRNA alignments. Genomescan is then run on these regions with a range of similarities from different sources. Three sources of protein hits are used for Genomescan: (a) BLASTX of the genome against a nonredundant set of vertebrate proteins, (b) rpsBLAST of the genome against motifs, and (c) BLASTP of all protein sequence hits to Genscan predictions. The Genomescan predictions have probabilities associated with them, and the NCBI site puts them into two classes: those with probabilities greater than 1e-4 and those with probabilities less than 1e-4. A final classification of Genomescan genes are 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

300 BIRNEY CLAMP HUBBARD

those that are predicted with no similarity confirmation. These can be considered equivalent to the raw Genscan predictions. It is worthwhile comparing Genomescan to other programs (Genewise, Pro- crustes) that also use protein similarities to produce gene structures. Genewise and Procrustes are more tightly bound to using all available sequence similar- ity and thus have very high specificity, but their sensitivity declines as the se- quence similarities used become more and more distant. In practice this means that Genewise will fail to give any prediction with a distant protein because it cannot find an alignment. For instance, Genewise is 99% accurate with 99% coverage with a 90% identical protein, but when the protein used is only 70% identical the coverage falls to 30%. If you want to produce gene predictions that have a high probability of being correct, algorithms like Genewise are the best choice if you do not mind missing some things. If on the other hand you want more coverage but can sacrifice some overprediction, approaches such as Genomescan are very useful as their sensitivity remains high with more-distant proteins.

UCSC GENOME BROWSER The UCSC human-genome site provides multiple sets of gene predictions, some of which are produced at UCSC and some of which are provided by external groups. The group at UCSC produces a set of genes originat- ing from the human section RefSeq, which is a curated set of gene structures. The alignment of RefSeq cDNAs is done with BLAT. BLAT is designed to very quickly find high-similarity (>95%) matches of lengths of 40 bases or more. Like other similar fast algorithms [SSAHA (29)], BLAT holds the whole genome in memory in the form of nonoverlapping 11mers that are not found in repeats. Using this sequence index BLAT finds the approximate location of a sequence to be aligned. Once the approximate location is found a full alignment that allows for introns and splice-site modeling is then done. Aligning RefSeq genes to a genome will only provide a subset of all human genes present in the genome. Another set of gene predictions on the UCSC browser that uses extra information is that produced using Acembly. The Acembly program predicts gene structures from cDNAs and ESTs and is built on top of the Acedb database (14). If there are ESTs or cDNAs that produce different models because of alternative splicing, all models will be generated. This is in contrast to the HMM- based method Genomescan that will only produce the most probable model. It often happens that a cDNA will align to more than one region of the genome. This could be due to assembly errors or the presence of a very similar paralog or pseudogene. In these cases Acembly will only keep multiple alignments if they are of similar quality. If one alignment is worse than the other, only the best alignment will be kept. Acembly is especially focused on aligning ESTs to genomic sequence, for example, it allows for mismatches and frameshifts, and it can use and display the underlying trace data. Another predicted gene track on the UCSC browser comes from Softberry (http://www.softberry.com) and uses a program Fgenesh , which is based on HMMs and protein similarity but with less emphasis on cDNA/EST+ data. Fgenesh is based on an idea similar to Genomescan and also came out of a group that had+ 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

GENOME DATABASES 301

previously written ab initio prediction programs. Like Genomescan it uses sim- ilarity hits (commonly BLASTX) to known proteins to affect the probabilistic gene model and biases the final gene structure predicted to more often overlap the similarity hits.

Gene Prediction Using Other Genomes With the arrival of the mouse genome sequence in the public domain there is a source of data for gene prediction that is the focus of much gene-prediction research. In theory if you take the genome sequence of two species, the functional parts will be conserved and over time the nonfunctional regions of sequence will slowly diverge. So if you align two syntenic regions of the genome, the conserved regions will highlight things like the exons and, possibly, the regulatory binding sites. This has the advantage of not needing a protein or cDNA already in the database to predict genes. In practice the mouse and human sequence have as many matches in introns as in exons, so it is not a trivial problem to sort the coding from the noncoding regions. Additionally not all exons are conserved from mouse to human, so the problem turns from just trivially stringing the matches together to form genes, to a more complicated issue. Several groups have attempted to solve the problem of predicting gene structures by using genomic sequence from more than two organisms. These include Twinscan (23) and Doublescan (I. Meyer & R. Durbin, unpublished). Twinscan is based on the generalized HMM of Genscan and integrates cross- species similarity into the probabilistic model of Genscan. It has similarities with Genomescan and Fgenesh as it too is attempting to combine similarity infor- mation with a probabilistic++ model. In some ways Twinscan has a more difficult problem to solve than the other methods, as the similar regions between two genomes may fall either in exons or introns or even regulatory binding sites.

Annotation Limitations At the time of writing, just under 50% of the human genome sequence is available only in draft form. This means that, as well as there being sequencing errors in the draft part of the genome sequence, the fragmented nature of the sequence leads to problems in assembling the genome and subsequently problems in annotating it. The sequence errors can cause the annotation procedures to fail to predict a gene, owing to frameshifts. The problems in the assembly can lead to genes being fragmented. If, for instance, there is a local misassembly with two regions of sequence being in the wrong order or in the wrong orientation, it becomes almost impossible for the gene annotation methods to find all the exons for a gene because all the methods rely on exons being on the same strand and in the right order. Also it must not be forgotten that there are still small regions of missing sequence (currently 5% of RefSeq genes cannot be placed on the genome), which can lead to genes being completely absent or only partially annotated. Missing sequences can also lead to a different problem: If a gene cannot be found in the genome, there may exist a close paralog or pseudogene elsewhere for which genomic sequence 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

302 BIRNEY CLAMP HUBBARD

already exists. This may get annotated as the real gene, as it is the closest thing that can currently be found. The draft nature of much of the sequence makes it very difficult for the automatic methods to differentiate between a pseudogene, a close paralog, and the real gene because they cannot tell whether the difference between errors in the gene sequence arise from sequencing errors or from real divergence in the genomic sequence. Finally, as all the methods described use, in part, data from the public databases, errors in the database sequences can be propagated onto the genome. This is especially dangerous when building genes using EST sequences that are prone to genomic contamination.

Annotation Summary The best gene-prediction methods in vertebrate genomes rely to a greater or lesser extent on similarities to known proteins or cDNAs/ESTs. This is in contrast to bac- terial genomes where, owing to the smaller size of the genomes and reduced com- plexity in the gene structures, the available ab initio programs perform very well. There are various ways of incorporating similarities into gene annotation. There are the strict methods such as Genewise and the looser methods such as Genome- scan. One has high specificity but can only use high-similarity sequences. The other has lower specificity but is able to produce gene structures using less-similar sequences. The three main sites all use a combination of these techniques to cover the whole range of gene prediction. Ensembl uses Genewise for exact proteins, down to confirming Genscan exons and building gene structures at the lower end. UCSC aligns exact cDNAs using BLAT, but it also has external predictions using ESTs from Acembly and HMM and protein-based predictions from Fgenesh as well as the Ensembl predictions. ++ The sheer volume of sequence data that needs to be searched has led to a new generation of search algorithms like BLAT, SSAHA, and Exonerate. These algo- rithms are designed both to fit the shape of data being searched and to take advantage of the computers available today, whether they be single large memory machines (>4 Gb RAM) or multinode (500–1000 Gb) computer farms with modest ( 1 Gb) memory. A common theme through these algorithms is that time is too short for using the most-sensitive algorithms from the outset. BLAT, SSAHA, and Exonerate, as well as the Ensembl search framework for Genewise and EST GENOME, all employ the technique of finding an initial rough location for a sequence, a step that is usually very fast, and of only focusing on the details once this initial location is found. This is very good for exact or near-exact matches. For searches that return less-similar matches these shortcuts will not work, and one currently has to return to more traditional techniques such as BLAST.

TECHNOLOGY OF GENOME ANNOTATION

The process of creating, managing, and displaying genome information requires nontrivial software. This is because of the scale of storage and computational resources required and the complexity of the problem, both in automatic and 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

GENOME DATABASES 303

TABLE 1 A list of genome management systems and their associated features Data Web Graphical Developer DAS DAS Relationala accessb Portablec Languaged accesse viewerf coreg serverh vieweri

Ensembl Yes Yes Yes Perl/Java Yes Yes 40 Yes Yes UCSC Yes Yes Yes C Yes No 10 Yes No Wormbase No Yes No Perl/C Yes Yes 5 Yes Yes Flybase Partial Yes No Perl/Java Yes Yes 5 Yes Yes NCBI Partial Yes No C Yes No 20 No No Genquire Yes Yes Yes Perl No Yes 3NoYes SGDj Yes Yes No Perl Yes No 5NoNo aWhether an RDMS is used. bWhether access to the data is freely available. cWhether the entire system can be ported to a new site or not. dPrinciple programming language employed in the database. eWhether the database is accessible via the web. fWhether the data can be viewed through a graphical application. gEstimate of the number of people on the core development for the system. hWhether the system can act as a DAS server. iWhether the system can act as a DAS client. jSaccharomyces Genome Database (9).

manual annotation processes. The genome systems deployed use a mosaic of software components but generally with more similarities than differences. The various systems and some of their features are outlined in Table 1. The management of genomic information requires the long-term storage of its data and easy programmatic ways to access and update this information. Nearly all genome systems now use a relational database management system (RDMS) as their core data store. The advantages of using RDMS are as follows: (a) Consider- able work in the theory and implementation has been undertaken in the computer- industry over the past two decades, providing well-understood, robust systems, in particular with respect to backups and data integrity. (b) Relational databases use a standardized query language (SQL), and all major programming languages have bindings to SQL interfaces, thus facilitating programmatic access. (c) There is a large pool of RDMS- and SQL-savy programmers who can easily be brought into an RDMS-based project. The main implementations are produced by Oracle or Sybase (which are commercial products) or Postgres or MySQL (which are open-source projects). MySQL is not strictly an RDMS, but for practical pur- poses it can be considered one and is widely used for genome data storage. For most systems direct programmatic access to the database is not encour- aged, but instead a number of software abstractions are provided on top of the database. These abstractions, called an application programming interface (API) or a middleware layer, allow the database schema to be insulated somewhat from the main set of programmatic clients to the system. These middleware layers differ in complexity and layout, but a number of systems, in particular Ensembl, Gadfly, Wormbase (36), and Genquire, reuse the open-source projects BioPerl and BioJava 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

304 BIRNEY CLAMP HUBBARD

(30) to provide a core set of interfaces. This reuse of interfaces sometimes provides the ability to mix and match components directly (e.g., it is possible to use parts of the Genquire viewer with Ensembl directly), but more practically it eases the pain in porting components between projects. The genome management system must allow viewing and, for manual systems, editing of the genomic annotation by users. For most of the previous decade the main system to use was the FMAP display of Acedb (15, 21), which is still used in Wormbase and at the Sanger Institute for genome annotation. However the late 1990s saw the rise of the Web as the principle mode of communicating to users, and the paradigm of “web views” of a database has been enthusiastically taken up by genome databases. This is further expanded in the next section. The core of a genome web display is generally a graphical view of a slice of the genome. As this slice can be arbitrarily positioned, the view has to be built on the fly, pulling information from an appropriate datastore and constructing an image. The constraints of trying to return these images within the couple of seconds tolerance for users is a considerable technical challenge. For example, the UCSC browser goes to some length to optimize its data access pattern, and Ensembl has a second data-transformation step optimized for web displays. Editing genomic annotation, in contrast to viewing, is an activity that is practiced by a smaller number of people, generally hired by the database, called curators. A curator will spend a high proportion of its time editing annotation, meaning productivity gains provided by the software are crucial. Editing systems are always stand-alone applications, and the main ones are the FMAP display built into Acedb (Wormbase, Sanger Institute), the Apollo editor (joint project between Ensembl and GadFly), Genquire (Genquire system), and the Artemis (33). These editors also provide a helpful “power user” application tool for viewing. A new development in genome viewing is the Distributed Annotation System (DAS) (13). DAS is a protocol that allows a client computer to contact multiple DAS-aware servers and retrieve and integrate genomic annotations. Genomic an- notations from the perspective of DAS is a range on a piece of genomic sequence and, importantly, contains information for a user to view a web page associated with this feature. This protocol runs over the standard http web protocol and can be con- sidered a type of “web service.” An important feature of DAS is that one does not need to present the entire genome to provide a DAS server—indeed, one can serve up annotations on a small region of the genome. This allows much smaller labo- ratories, down to modest wet labs with some computer expertise, to run their own DAS servers. The DAS specification has been active for approximately 1.5 years to date, and there are already two stand-alone systems for serving DAS annotations, these being Dazzle from the BioJava project and LDAS, which uses the BioPerl framework. On the client side, the most accessible way to use DAS is using web browsers associated with the main genome projects: Wormbase, GadFly, and En- sembl all offer this capability. Already there has been a reassuring uptake of DAS usage—one common-use case is to present internal work only to researchers in that lab, without publicizing the DAS server beyond a closed group of researchers. 13 Aug 2002 12:42 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

GENOME DATABASES 305

A genome database needs to be populated with useful annotation. We have already described in detail the process of automatic annotation, but whether in automatic or manual annotation a considerable amount of computational work needs to occur. The system must provide an environment to execute these com- putes, which, for large eukaryotic genomes such as human, require considerable resources. All genome centers currently are working on the “farm” configuration, i.e., having a set of dedicated computer machines with a perfectly replicated set of resources on each farm node. The design of such a system requires coordination of CPU, floor space, air conditioning, and network resources, but such farms are common in the computer-science industry. As many of the computational tasks in the genome are “embarrassingly parallel,” in the sense that they require no communication between the different execution points, the structure of the work is generally to split the problem into a large number of jobs and then schedule the jobs to run on the farm. The job scheduler used is usually LSF, PBS, or Grid Engine.

Web Views Many users of a genome do not have the time or inclination to learn a large complex software system. The genome databases cater to these consumers who are content with web views to the genomic databases. Like the software systems employed, these web views are mainly independently developed but share more similarities than differences. In this review we concentrate on displays provided by the same three main sources of annotation of the human genome previously described (Ensembl, NCBI, and UCSC) because they are the displays we are most used to and because they are the resources most likely to be long lived in their current forms. In the Ensembl website the front page presents the user with a clickable kary- otype, a sequence search interface via either BLAST or SSAHA, and a text search box. The search systems, whether sequence or text, end up on one of two main pages, geneview or contigview, whereas the karyotype ends up only on a con- tigview page. An example of the contigview page is given in Figure 1. Contigview shows a slice of the genome, with three levels of contextual information—the po- sition on an idealized karyotype, the general locale of the region of 1 megabase each side, and then a detailed view that is zoomable from 500 base pairs to 1 megabase, which the user controls. The user also generally controls the bottom detailed view, which shows a number of genomic features; of particular impor- tance are the confirmed genes from Ensembl. Ensembl has built in its system 50 or so different “tracks” of various types of features that can be displayed on con- tigview (using DAS, this number is unlimited). Strandness is shown by the track located above or below a central sequence line. Only a subset of the tracks, ap- proximately 10, are switched on by default. There are a set of drop-down menus that control whether a track is on or off as well as other display properties, such as height and color. The second important page in Ensembl is geneview, which shows more detailed information about a gene. It can be reached either directly 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

306 BIRNEY CLAMP HUBBARD

by one of the search systems or from contigview. Geneview provides detailed in- formation on the intron/exon structure and shows all known identifiers associated with that structure, in particular, Human Genome Nomenculture (HUGO) names, Swissprot, and RefSeq accession numbers. In addition the protein translations of the gene are compared against InterPro (2), allowing genes with related protein domains to be clustered together. The UCSC website is focused around a single page that displays a slice of the genome. Either a text search box or the BLATsearch page takes the user to this slice, where again tracks of genomic features are shown. At UCSC, the strandedness of the tracks is provided by small chevrons on the features. The overall paradigm at UCSC is to provide the user with all the information about features on the genomes without imposing too much interpretation about their results. UCSC does not therefore have specific combined gene prediction, but rather shows both the RefSeq sequences aligned to the genome and the gene predictions from a number of groups (which usually include the Ensembl gene predictions) underneath those. Clicking on the feature of interest provides a small page with somewhat more detail—for UCSC-generated data this generally includes the alignment of the sequence, whereas for contributed tracks a link back to the source website of that feature is provided. Additional tracks are also available and can be expanded or contracted either by a mouse click or by a control panel at the bottom of the page. One of the strengths of the UCSC system has been its open and aggressive acquisition of contributed tracks. This means that the various versions of the same genome may have a large number of different tracks depending on what analysis was done and contributed to UCSC. The NCBI website has a two-stage view of the genomic sequence that is heavily coordinated with its LocusLink resource. The mapview page provides a view of the genome that is presented principally as a set of mapping resources on a vertical chromosome. The user who wishes to zoom into a region of interest to get a finely detailed view needs to move over to the sequenceview section of the region, which shows a scrolled base-pair level view of the genomic sequence with a number of features labeled on it. At either the mapview page or the sequenceview page, for gene features the user can click out to the LocusLink page, which provides a single point of contact for gene information, with many jumping off points into other areas of NCBI. Mainly by virtue of the LocusLink page, the genome resources are well integrated into other NCBI resources, including the ability to jump from BLAST results into the genome. As with the underlying software there is growing coordination among web- sites. Currently the Ensembl and UCSC genome sites are cross-linked at their respective “genome slice” views; both the Ensembl site and the UCSC site point to LocusLink pages, and LocusLink points back out to the other web sites. Al- though users are presented with somewhat different interfaces and sometimes quite radically varying data or interpretations of the data, the duplication of websites for the human genome has been a generally beneficial approach—each site has strengths in different areas, for example, Ensembl’s strength lies in its consistency 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

GENOME DATABASES 307

of data and interpretation across the site and strong exporting tools, the UCSC site has the variety of analysis and the resistance to imposing a single interpretation, and NCBI’s integration with other molecular biology resources is helpful. These varying strengths have attracted different types of users to the sites, and the cross- linking between sites allows users to pick and choose the best between the various sites. We are hopeful that this trend of coordination will continue over time, in particular, in the work of other genomes. Outside of vertebrate genomes, the web views of Gadfly and Wormbase are unsurprisingly similar as they are based on the Bio::Graphics web package. This gives a similar horizontal slice of the genome with interesting features highlighted. Again, like Ensembl, the view is DAS aware, allowing other groups to add their annotations to this display. Data Resources As well as bench biologists wanting to access specific types of genes or regions of interest through the web, there are a variety of other power users and bioinformati- cians who are less interested in web usability and more interested in appropriate slices of the data. Different databases take varying approaches to providing data ac- cess. Ensembl probably goes the furthest by providing a fully portable system that can be mirrored, i.e., a standard flat-file style distribution of the data through web- accessible bulk downloads, including Excel spreadsheets and arbitrary slices of the genome. Ensembl also provides an internet-accessible MySQL host providing full SQL access to the datastore to anyone with a MySQL client. UCSC provides a very effective slice of the genome sequence, with server and bulk download of tab-delimited tables that represent the underlying database. At NCBI there is only a flat-file bulk distribution, but it is one that aims to integrate well into its other flat-file resources. One interesting development in this area that we are aware of is the development of advanced query interfaces to the database at Ensembl. This al- lows a user to supply in a web form a query such as “all the validated SNPs within 5 KB of genes with protein kinase domains,” with the result being a formatted Excel spreadsheet. Providing power users with a mixture of web-accessible and bulk-download forms seems likely to be a growing development.

FUTURE OUTLOOK

With the expected number of new eukaryotic genomes on the horizon numbering over 20, it is clear that the process of handling genomes will become even better un- derstood, simply by necessity. We expect that the informal sharing of concepts and small components between groups will grow, resulting within five years time in one or two main sets of components that can be gathered together for genomic annota- tion. The open-source nature of some of the projects should accelerate this process. The DAS system democratization of genome annotation (19) shows great promise, with many small laboratories already setting up DAS servers or using 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

308 BIRNEY CLAMP HUBBARD

the web-accessible DAS upload servers. The presence of DAS prevents genome annotation to be solely in the domain of the large centers and allows both biologists and research bioinformaticians to concentrate on their area of expertise. All sites are providing the ability to integrate “vertically” from the genome up toward proteome and functional aspects of the genome. In addition, many of the systems are being cloned in a “horizontal” manner to provide the analysis and presentation of other genomes. The integration of the entire space, i.e., being able to easily consider queries that, say, efficiently and correctly correlate the quantitative loci traits on rat with the haplotype maps on human are fascinating to consider. It is likely that the next couple of years, with the advent of many new genomes, will force the development of new concepts and potentially new technologies to meet this challenge.

The Annual Review of Genomics and Human Genetics is online at http://genom.annualreviews.org

LITERATURE CITED 1. Altschul SF,Gish W,Miller W,Myers EW, complete gene structures in human ge- Lipman DJ. 1990. Basic local alignment nomic DNA. J. Mol. Biol. 268:78–94 search tool. J. Med. Biol. 215:403–10 9. Cherry JM, Adler C, Ball C, Chervitz SA, 2. Apweiler R, Attwood TK, Bairoch A, Dwight SS, et al. 1998. SGD: Saccha- Bateman A, Birney E, et al. 2000. Inter- romyces Genome Database. Nucleic Acids Pro—an integrated documentation re- Res. 26:73–79 source for protein families, domains and 10. Consortium F. 2002. The FlyBase data- functional sites. Bioinformatics 16:1145– base of the Drosophila genome projects 50 and community literature. Nucleic Acids 3. Bairoch A, Apweiler R. 2000. The Res. 30:106–8 SWISS-PROT protein sequence database 11. Consortium IHGS. 2001. Initial sequenc- and its supplement TrEMBL in 2000. Nu- ing and analysis of the human genome. cleic Acids Res. 28:45–48 Nature 409:860–921 4. Bateman A, Birney E, Cerruti L, Durbin 12. Deloukas P, Matthews LH, Ashurst J, R, Etwiller L, et al. 2002. The Pfam pro- Burton J, Gilbert JG, et al. 2001. The tein families database. Nucleic Acids Res. DNA sequence and comparative analy- 30:276–80 sis of human chromosome 20. Nature 5. Baxevanis AD. 2002. The Molecular Bi- 414:865–71 ology Database Collection: 2002 update. 13. Dowell RD, Jokerst RM, Day A, Eddy SR, Nucleic Acids Res. 30:1–12 Stein L. 2001. The distributed annotation 6. Benson DA, Karsch-Mizrachi I, Lipman system. BMC Bioinform. 2:7 DJ, Ostell J, Rapp BA, Wheeler DL. 2002. 14. Durbin D, Thierry-Mieg J. 1991. A C. GenBank. Nucleic Acids Res. 30:17–20 elegans database. http://www.sanger.ac. 7. Birney E, Durbin R. 1997. Dynamite: a uk/Software/Acedb/ flexible code generating language for dy- 15. Eeckman FH, Durbin R. 1995. ACeDB namic programming methods used in se- and macace. Methods Cell Biol. 48:583– quence comparison. ISMB 5:56–64 605 8. Burge C, Karlin S. 1997. Prediction of 16. Florea L, Hartzell G, Zhang Z, Rubin GM, 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

GENOME DATABASES 309

Miller W. 1998. A computer program for notation of genomic DNA. Trends Genet. aligning a cDNA sequence with a genomic 15:38–39 DNA sequence. Genome Res. 8:967– 29. Ning Z, Cox AJ, Mullikin JC. 2001. 74 SSAHA: a fast search method for large 17. Gelfand MS, Mironov AA, Pevzner PA. DNA databases. Genome Res. 11:1725– 1996. Gene recognition via spliced se- 29 quence alignment. Proc. Natl. Acad. Sci. 30. Pocock MR, Down T, Hubbard TJP. 2000. USA 93:9061–66 BioJava: open source components for 18. Hubbard T, Barker D, Birney E, Cameron bioinformatics. SIGBIO Newsl. 20:10–12 G, Chen Y, et al. 2002. The Ensembl 31. Pruitt KD, Maglott DR. 2001. RefSeq Genome Database Project. Nucleic Acids and LocusLink: NCBI gene-centered re- Res. 30:38–41 sources. Nucleic Acids Res. 29:137–40 19. Hubbard T, Birney E. 2000. Open an- 32. Reese MG, Kulp D, Tammana H, Haus- notation offers a democratic solution to sler D. 2000. Genie—gene finding in genome sequencing. Nature 403:825 Drosophila melanogaster. Genome Res. 20. Kanehisa M, Goto S, Kawashima S, 10:529–38 Nakaya A. 2002. The KEGG databases at 33. Rutherford K, Parkhill J, Crook J, GenomeNet. Nucleic Acids Res. 30:42–46 Horsnell T, Rice P, et al. 2000. Artemis: 21. Kelley S. 2000. Getting started with sequence visualization and annotation. Acedb. Brief Bioinform. 1:131–37 Bioinformatics 16:944–45 22. Kerlavage A, Bonazzi V, di Tommaso M, 34. Smit AF. 1999. Interspersed repeats and Lawrence C, Li P, et al. 2002. The Cel- other mementos of transposable elements era Discovery System. Nucleic Acids Res. in mammalian genomes. Curr. Opin. 30:129–36 Genet. Dev. 9:657–63 23. Korf I, Flicek P, Duan D, Brent MR. 35. Solovyev VV, Salamov AA, Lawrence 2001. Integrating genomic homology into CB. 1995. Identification of human gene gene structure prediction. Bioinformatics structure using linear discriminant func- 17:S140–48 tions and dynamic programming. Proc. 24. Krogh A. 2000. Using database matches Int. Conf. Intell. Syst. Mol. Biol. 3:367– with HMMGene for automated gene 75 detection in Drosophila. Genome Res. 36. Stein L, Sternberg P, Durbin R, Thierry- 10:523–28 Mieg J, Spieth J. 2001. WormBase: net- 25. Lindblad-Toh K, Lander ES, McPherson work access to the genome and biology JD, Waterston RH, Rodgers J, Birney E. of Caenorhabditis elegans. Nucleic Acids 2001. Progress in sequencing the mouse Res. 29:82–86 genome. Genesis 31:137–41 37. Stoesser G, Baker W, van den Broek 26. Lo Conte L, Brenner SE, Hubbard TJ, A, Camon E, Garcia-Pastor M, et al. Chothia C, Murzin AG. 2002. SCOP 2002. The EMBL Nucleotide Sequence database in 2002: refinements accommo- Database. Nucleic Acids Res. 30:21–26 date structural genomics. Nucleic Acids 38. Tateno Y, Imanishi T, Miyazaki S, Res. 30:264–67 Fukami-Kobayashi K, Saitou N, et al. 27. Mott R. 1997. EST GENOME: a pro- 2002. DNA Data Bank of Japan (DDBJ) gram to align spliced DNA sequences to for genome scale research in life science. unspliced genomic DNA. Comput. Appl. Nucleic Acids Res. 30:27–30 Biosci. 13:477–78 39. Venter JC, Adams MD, Myers EW, Li 28. Mural RJ, Parang M, Shah M, Snoddy J, PW, Mural RJ, et al. 2001. The sequence Uberbacher EC. 1999. The Genome Chan- of the human genome. Science 291:1304– nel: a browser to a uniform first-pass an- 51 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD

310 BIRNEY CLAMP HUBBARD

40. Westbrook J, Feng Z, Jain S, Bhat TN, nomic sequences. J. Comput. Biol. 4:325– Thanki N, et al. 2002. The Protein Data 38 Bank: unifying the archive. Nucleic Acids 42. Yeh RF, Lim LP, Burge CB. 2001. Com- Res. 30:245–48 putational inference of homologous gene 41. Xu Y, Uberbacher EC. 1997. Automated structures in the human genome. Genome gene identification in large-scale ge- Res. 11:803–16 13 Aug 2002 14:38 AR AR167-GG03-12-Birney.tex AR167-GG03-12-Birney.SGM LaTeX2e(2002/01/18) P1: GDL

See legend on next page 13 Aug 2002 14:38 AR AR167-GG03-12-Birney.tex AR167-GG03-12-Birney.SGM LaTeX2e(2002/01/18) P1: GDL

Figure 1 (See figure on previous page) Screenshot of Ensembl contigview (human version 3.26.1), showing region of human chromosome 20 around genome sequence accession AL035454. Region is shown at three resolutions, and navigation (recenter on click) is possible by clicking in any of the three panels. Panels can be turned off by clicking on the box containing a “-.” (top) In the Chromosome panel a red box shows the region being viewed in p11.23, with respect to the cytogenetic-banding pattern of the entire chromosome. (middle) In the Overview panel a second red box similarly shows the region being viewed in detail below. The middle panel shows the location of markers and genes, by default over 1 Mbase. Genes predicted automatically by the Ensembl system are colored brown and labeled with either HUGO identifiers or SPTREMBL identifiers if they are known. Novel Ensembl genes (see text for definition) are labeled as such and shown in black. Curated gene structures from the Sanger Institute human annotation group are shown in various shades of blue, depending on their type, as described in the recent publication of the Chromosome 20 (12). Other annotated genes from EMBL/Genbank sequence files, where present, are shown in green.(bottom) The Detailed View panel shows genomic sequence features in detail, by default over 100 Kb. Gene color scheme is as for Overview panel, with the addition of sequence contig based Genscan ab initio predictions shown in cyan. Matches to the mouse genome sequence are shown in pink and grouped into synteny blocks, in this case showing synteny over the entire region to mouse chromosome 2. Region being viewed can be zoomed and recentered with the computer mouse or specified precisely in chromosomal coordinates. The pull-down menu shown is one of several and allows the user to select the features being displayed. The second pull down allows the addition of annotation from third- party DAS sources. Floating menus (not shown) appear as the mouse is moved over any feature, allowing access to pages with additional information. In this view some sequence homology features, which support the Ensembl gene predictions, are shown in full (e.g., human mRNAs: one line for each distinct mRNA), and some are shown in a compact format (e.g., ESTs: all hits shown as overlapping blocks on a single line).