Databases Featuring Genomic Sequence Alignment
Total Page:16
File Type:pdf, Size:1020Kb
D466–D470 Nucleic Acids Research, 2005, Vol. 33, Database issue doi:10.1093/nar/gki045 Improvements to GALA and dbERGE II: databases featuring genomic sequence alignment, annotation and experimental results Laura Elnitski1,2,*, Belinda Giardine1, Prachi Shah1, Yi Zhang1, Cathy Riemer1, Matthew Weirauch4, Richard Burhans1, Webb Miller1,3 and Ross C. Hardison2 1Department of Computer Science and Engineering, 2Department of Biochemistry and Molecular Biology and 3Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA and 4Department of Computer Science, University of California, Santa Cruz, CA 95064, USA Received August 14, 2004; Revised and Accepted September 28, 2004 ABSTRACT gene expression patterns, requires bioinformatic tools for organization and interpretation. Genome browsers, such as We describe improvements to two databases that give the UCSC Genome Browser (1,2), Ensembl (3) and Map access to information on genomic sequence similar- Viewer at NCBI (4), provide views of genes and genomic ities, functional elements in DNA and experimental regions with user-selected annotations. results that demonstrate those functions. GALA, the To provide querying capacity across data types, we devel- database of Genome ALignments and Annotations, is oped GALA, a Genome Alignment and Annotation database. now a set of interlinked relational databases for five The first release recorded whole-genome human–mouse align- vertebrate species, human, chimpanzee, mouse, rat ments along with extensive annotation of the human genome and chicken. For each species, GALA records pair- in a relational database (5). An example of the use of GALA is wise and multiple sequence alignments, scores to find highly conserved regions that do not code for proteins; derived from those alignments that reflect the likeli- some of these could have novel functions. A second database, dbERGE, was originally designed as a repository of Exper- hood of being under purifying selection or being a imental Results on Gene Expression (6). The initial implemen- regulatory element, and extensive annotations such tation used a custom-designed database management system, as genes, gene expression patterns and transcription and the interface for novice users provided limited querying factor binding sites. The user interface supports capacity. simple and complex queries, including operations This report describes improvements to GALA and dbERGE such as subtraction and intersections as well as clus- II, such as new database management systems, improved tering and finding elements in proximity to features. query interfaces, additional types of data and expansion to dbERGE II, the database of Experimental Results on include additional species. Technical advances have improved Gene Expression, contains experimental data from a the response time. New connections allow results from queries variety of functional assays. Both databases are on dbERGE II to be imported into GALA. Output can be now run on the DB2 database management system. obtained in many formats, including exporting to the UCSC or Ensembl browsers as custom tracks or interactive viewing Improved hardware and tuning has reduced response of alignments with Laj (7). Links to additional data-mining times and increased querying capacity, while simpl- resources and repositories, such as EnsMart (8) and rVista (9), ified query interfaces will help direct new users enhance the utility of the database. through the querying process. Links are available at http://www.bx.psu.edu/. GALA DESCRIPTION AND STRUCTURE INTRODUCTION Query form and results pages The volume of genomic information from sequenced vertebrate The GALA database, available at http://www.bx.psu.edu/, has species, including sequence data, extensive annotation and undergone numerous improvements and expanded in size *To whom correspondence should be addressed. Tel: +1 814 865 4747; Fax: +1 814 863 6699; Email: [email protected] The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact [email protected]. ª 2005, the authors Nucleic Acids Research, Vol. 33, Database issue ª Oxford University Press 2005; all rights reserved Nucleic Acids Research, 2005, Vol. 33, Database issue D467 since its original description in 2003 (see Supplementary New data tracks Figure 1). Users choose (i) the species of interest (human, The number and variety of data types continues to grow; a full mouse, rat or chicken at the time of this writing, chimpanzee list of data fields is at http://gala.cse.psu.edu/outline.html. will be added soon) and (ii) the assembly for each species New genome-wide datasets emphasize gene expression (e.g. hg 13-17, mm 3-5). Simple queries on one or a small and gene regulation, including (i) three- and five-way number of fields are supported on the query page, and complex multi-species alignments; (ii) regulatory potential scores queries involving operations on one or more query results are [which evaluate how proper alignments match patterns that are supported on the history page. The statistics page shows users distinctive for known regulatory regions (10)]; (iii) PhyloHMM the number of entries in each data type and also provides easy Cons scores [which give a likelihood that an alignment results access to downloads of the data. from purifying selection, given local variation in the neutral rate The query page groups categories analogously to those at of evolution (11)]; (iv) matches to conserved transcription the UCSC Genome Browser: genes, mRNA, expression, reg- factor binding sites in any combination of species (based on ulation, etc. The current version of this layout allows users to position weight matrix scores mapped onto all sequences); and expand the categories to see any or all options simultaneously (v) microarray expression data from GNF (12). Datasets of (using the ‘refresh form’ button). A new version under devel- novel functional DNA sequences have also been added, such opment guides a user through the pages with a limited number as microRNAs (13), known regulatory regions (10), and of choices for ease of use. Long category lists have been functional and predicted promoters (14). moved to alternate pages and can be browsed and selected An example of the use of these new data fields is shown in using buttons and check boxes. This is especially valuable for Figure 1. The 450 kb at the end of the short arm of human lists that have hundreds of entries (e.g. transcription factor chromosome 16 contains the cluster of genes HBZ HBA2, binding sites), where the web browser’s ‘find’ option can HBA1 and HBQ1, encoding alpha-like globins, plus several be used to quickly and easily locate terms in the list. other genes. The distal major regulatory element, or MRE, We have added two new features to GALA’s query form; required for expression of the alpha-like globin genes, is user ranges and an option to retain overlapping regions. User 40 kb telomeric to the HBZ gene, and is located in one of ranges are analogous to custom tracks on the genome brow- the introns of a different gene, called C16orf35. Analyses of sers, i.e. they are DNA segments of interest to the user. Users human–mouse alignments or alignments of multiple vertebrate can submit ranges of interest on the history page, after which species were not sufficient to uniquely identify the MRE they can be used in compound queries or displayed using any within the 450 kb region (15); the MRE is conserved between of GALA’s output choices. In the second new option, the human and mouse but many other non-coding regions were database will allow overlapping ranges rather than combining conserved as well. We then used GALA to find all the non- them. The default option of collapsing these ranges helps to coding regions that were highly likely to be under the purify- reduce the memory and time needed to compute compound ing selection, based on the phyloHMMcons score (16), and we queries and display the results. However, in some cases users found 199 segments (Figure 1). We also found 41 matches to would prefer to see overlapping datasets, such as expressed GATA-1 binding sites that are conserved among human, sequence tags (ESTs) and alternative splice forms of genes. mouse and rat. Both these datasets include the MRE, but More types of output are now provided. For instance, the obviously they include other segments as well. The clustering original options for viewing results in the UCSC Genome function on the GALA history page was used to find five Browser or seeing alignments in the Laj interactive alignment conserved matches to the GATA-1 binding sites that have viewer are now complemented by custom tracks in BED for- another one within 100 bp. Finally, the intersection operation mat for the UCSC Browser and textual alignments in machine of GALA was used to find four segments that pass the phy- readable and visual formats. The region and gene summary loHMMcons threshold and have a clustered pair of conserved result pages are useful for browsing results, and are limited to matches to GATA-1 binding sites. Two of these are so close the first 20 pages of output. Buttons at the bottom of these that they show as a single vertical line; these are within the pages can switch the display format to that of other output MRE. The other two are close to regions that are erythroid options.