Crawview: for Viewing Splicing Variation, Gene Families, and Polymorphism in Clusters of Ests and Full-Length Sequences
Total Page:16
File Type:pdf, Size:1020Kb
Vol. 15 no. 5 1999 BIOINFORMATICS Pages 376-381 CRAWview: for viewing splicing variation, gene families, and polymorphism in clusters of ESTs and full-length sequences Amanda Chou and John Burke Genome Informatics Group, Pangea Systems, 1999 Harrison Street, Suite 1100, Oakland, CA 94612, USA Received on August 17, 1998; revised on November 23, 1998; accepted on February 2, 1999 Abstract Introduction Motivation: DNA sequence clustering has become a valu- The large quantity of single-read sequence from the ends of able method in support of gene discovery and gene sufficiently expressed mRNAs (known as Expressed Se- expression analysis. Our interest lies in leveraging the quence Tags or ESTs; Wilcox et al., 1991; Adams et al., sequence diversity within clusters of expressed sequence tags 1991; Okubo et al., 1991) has led to the discovery of many (ESTs) to model gene structure for the study of gene variants genes before the completion of genomic sequencing of the that arise from, among other things, alternative mRNA human or other organismal genomes (Adams et al., 1992; splicing, polymorphism, and divergence after gene duplica- Venter, 1993; Matsubara and Okubo, 1993). EST data has tion, fusion, and translocation events. In previous work, also facilitated large-scale expression studies (Okubo et CRAW was developed to discover gene variants from al.,1992, 1994; Adams et al., 1995), the construction of a assembled clusters of ESTs. Most importantly, novel gene physical map of the genome (Hudson et al., 1995), and a features (the differing units between gene variants, for gene map that localizes many genes with respect to markers example alternative exons, polymorphisms, transposable of the physical map (Schuler et al., 1996). The creation of elements, etc.) that are specialized to tissue, disease, standardized data repositories (Boguski et al., 1993; Benson population, or developmental states can be identified when et al., 1994) has improved the reliability and concurrence of these tools collate DNA source information with gene variant EST data. discrimination. While the goal is complete automation of novel feature and gene variant detection, current methods are far from perfect and hence the development of effective EST clustering and gene indexing projects tools for visualization and exploratory data analysis are of paramount importance in the process of sifting through Several projects are underway to construct gene indices, candidate genes and validating targets. where EST data and known gene sequence data can be con- Results: We present CRAWview, a Java based visualization solidated and placed in correct mapping, expression, and extension to CRAW. Features that vary between gene forms physiological context. Although specific methods vary be- are displayed using an automatically generated color coded tween projects, all gene indices are constructed using some index. The reporting format of CRAWview gives a brief, high form of cluster analysis, where distance is defined based level summary report to display overlap and divergence upon the sequence similarity of transcripts. The central idea within clusters of sequences as well as the ability to ‘drill of EST clustering is that ESTs be grouped into the same down’ and see detailed information concerning regions of cluster if and only if they are derived from the same gene. interest. Additionally, the alignment viewing and editing Published gene indexing efforts include UniGene (Boguski capabilities of CRAWview make it possible to interactively et al., 1995; Boguski and Schuler, 1995) from NCBI; the correct frame-shifts and otherwise edit cluster assemblies. TIGR Gene Index (TGI) from the Institute for Genomic Re- We have implemented CRAWview as a Java application search (http://www.tigr.org/tdb/hgi/hgi.html; Sutton et al., across windows NT/95 and UNIX platforms. 1995; White and Kerlavage, 1996); the Merck-Washington Availability: A beta version of CRAWview will be freely University Gene Index (Williamson et al., 1995; Eckman et available to academic users from Pangea Systems al., 1997; http://www.merck.com/mrl/merck_gene_index.2. (http://www.pangeasystems.com). html; Aaronson et al., 1996); the GenExpress project (Houl- Contact: [email protected] gatte et al., 1995) and the STACK project from the South 376 E Oxford University Press Splicing and polymorphism in EST clusters African National Bioinformatics Institute (SANBI) (Hide et joining of protein sequences containing similar domains, and al., 1994, 1997; Miller et al., 1997). maximal linkage is used to resolve inconsistencies in the clusters caused by the ‘chaining effect’ (Johnson and Wi- Representing variations within an EST cluster chern, 1992). It seems that the first published application of this concept to EST data was ‘THC_build’ system used to The visualization and quantification of gene variants within generate TGI (G. Sutton, personal communication; Adams clusters has not been the primary focus of most gene index- et al., 1995) where pairwise sequence similarity results from ing projects. For example, the UniGene project does not at- BLAST (Altschul et al., 1990) and a modified FASTA (Pear- tempt to make assemblies and hence provides no visual re- son, 1990) algorithm are collated with a relational database port of how transcripts in a cluster overlap. The TIGR Gene to form loose clusters of related sequences that are aligned Index (TGI) was apparently the first project to provide a with tigr_assembler (Sutton et al., 1995) under conservative space-compressed report, called a THC report (Adams et al., parameter sets and strict constraints. 1995), to display overlap in assemblies of ESTs with respect to tentative consensi (TC) and full-length sequence. Other CRAWview tools not associated with any specific gene indexing project, yet of great value for viewing sequence assemblies, are Here we present in detail CRAWview, a Java implementation Consed (Gordon et al., 1998) and phrapview (P. Green, un- of a visualization extension to CRAW. CRAWview provides published). An iterative search method for constructing EST brief cluster reports that display consensus sequences from assemblies for single genes of interest has also been pro- EST assemblies even when, due to the presence of gene vari- posed (Gill et al., 1997). These methods, however, focus on ants, more than one possible cluster consensus exists. presenting a single assembly and do not generalize easily to CRAWview also highlights regions of divergence and con- the case where multiple consensi are needed simultaneously servation between alternate consensi, and automatically to model the information in a sequence cluster, as is the case flags polymorphic or otherwise divergent regions. cDNA li- when sufficiently divergent gene variants are present. Nor do brary information is also included in the reports to aid in the these methods automatically detect the presence of poly- detection of state-specific gene variants as well as the morphisms for display. The STACK project, on the other identification of disease associated polymorphism and alter- hand, uses CRAW analysis (Burke et al., 1998) as a post-pro- native exon usage. cessing step to clustering in order to automatically discrimi- nate between and simultaneously view distinct gene variants. Systems and methods The CRAW approach to gene variants and EST CRAWview may be run on Windows NT/95 and UNIX sys- clusters tems. A Java Runtime Environment (JRE) is required which may be obtained free of charge from Sun Microsystems for CRAW functions by partitioning sequence clusters into sub- PC or SUN platforms (http://java.sun.com/products/ clusters based upon sequence dissimilarity. Specifically, a jdk/1.1/jre/index.html). For non-SUN UNIX architectures, greedy method is used to construct maximal sub-clusters. the JRE may be obtained from the hardware manufacturer. Membership in the sub-cluster is restricted in that a con- For LINUX the JRE may be obtained from: http://browser- straint is put on the divergence within a global alignment be- watch.internet.com/news/story/java35.html. tween members and the sub-cluster consensus. When the original clusters are created with similarity threshold (equiv- Algorithms and implementation of CRAWview alent to minimal-linkage) clustering, as is the case with STACK and UniGene, any two sequences that share an Upon completion of CRAW analysis, an EST cluster will identical domain of sufficient length will be in the same have been assembled and partitioned into sub-clusters. A cluster. The creation of sub-clusters is necessary to resolve separate consensus sequence will also have been derived for inconsistencies (for instance, the inclusion of alternate exons each sub-cluster. CRAWview accepts as its input the output in different isoforms of the same gene) through partitioning flat file generated by CRAW and presents a graphical view into one or more sub-clusters. In addition to segregating of overlap patterns and sequence divergence within EST clusters into distinct gene isoforms, the partitioning is used clusters. to identify false joins caused by ESTs derived from chimeric The CRAWview report represents a cluster assembly of clones, genomic contamination, and other artifacts. Appar- alignment_length positions with num_cols columns and a ently, the first use of a loose grouping followed by stricter row for every sequence in the cluster. Within a row, in order separation approach for biological sequence databases was to display sequence alignment information within num_cols