Vol. 15 no. 5 1999 Pages 376-381

CRAWview: for viewing splicing variation, families, and polymorphism in clusters of ESTs and full-length sequences

Amanda Chou and John Burke

Genome Informatics Group, Pangea Systems, 1999 Harrison Street, Suite 1100, Oakland, CA 94612, USA

Received on August 17, 1998; revised on November 23, 1998; accepted on February 2, 1999

Abstract Introduction

Motivation: DNA sequence clustering has become a valu- The large quantity of single-read sequence from the ends of able method in support of gene discovery and gene sufficiently expressed mRNAs (known as Expressed Se- expression analysis. Our interest lies in leveraging the quence Tags or ESTs; Wilcox et al., 1991; Adams et al., sequence diversity within clusters of expressed sequence tags 1991; Okubo et al., 1991) has led to the discovery of many (ESTs) to model gene structure for the study of gene variants before the completion of genomic sequencing of the that arise from, among other things, alternative mRNA human or other organismal genomes (Adams et al., 1992; splicing, polymorphism, and divergence after gene duplica- Venter, 1993; Matsubara and Okubo, 1993). EST data has tion, fusion, and translocation events. In previous work, also facilitated large-scale expression studies (Okubo et CRAW was developed to discover gene variants from al.,1992, 1994; Adams et al., 1995), the construction of a assembled clusters of ESTs. Most importantly, novel gene physical map of the genome (Hudson et al., 1995), and a features (the differing units between gene variants, for gene map that localizes many genes with respect to markers example alternative exons, polymorphisms, transposable of the physical map (Schuler et al., 1996). The creation of elements, etc.) that are specialized to tissue, disease, standardized data repositories (Boguski et al., 1993; Benson population, or developmental states can be identified when et al., 1994) has improved the reliability and concurrence of these tools collate DNA source information with gene variant EST data. discrimination. While the goal is complete automation of novel feature and gene variant detection, current methods are far from perfect and hence the development of effective EST clustering and gene indexing projects tools for visualization and exploratory data analysis are of paramount importance in the process of sifting through Several projects are underway to construct gene indices, candidate genes and validating targets. where EST data and known gene sequence data can be con- Results: We present CRAWview, a Java based visualization solidated and placed in correct mapping, expression, and extension to CRAW. Features that vary between gene forms physiological context. Although specific methods vary be- are displayed using an automatically generated color coded tween projects, all gene indices are constructed using some index. The reporting format of CRAWview gives a brief, high form of , where distance is defined based level summary report to display overlap and divergence upon the sequence similarity of transcripts. The central idea within clusters of sequences as well as the ability to ‘drill of EST clustering is that ESTs be grouped into the same down’ and see detailed information concerning regions of cluster if and only if they are derived from the same gene. interest. Additionally, the alignment viewing and editing Published gene indexing efforts include UniGene (Boguski capabilities of CRAWview make it possible to interactively et al., 1995; Boguski and Schuler, 1995) from NCBI; the correct frame-shifts and otherwise edit cluster assemblies. TIGR Gene Index (TGI) from the Institute for Genomic Re- We have implemented CRAWview as a Java application search (http://www.tigr.org/tdb/hgi/hgi.html; Sutton et al., across windows NT/95 and UNIX platforms. 1995; White and Kerlavage, 1996); the Merck-Washington Availability: A beta version of CRAWview will be freely University Gene Index (Williamson et al., 1995; Eckman et available to academic users from Pangea Systems al., 1997; http://www.merck.com/mrl/merck_gene_index.2. (http://www.pangeasystems.com). html; Aaronson et al., 1996); the GenExpress project (Houl- Contact: [email protected] gatte et al., 1995) and the STACK project from the South

376 E Oxford University Press Splicing and polymorphism in EST clusters

African National Bioinformatics Institute (SANBI) (Hide et joining of sequences containing similar domains, and al., 1994, 1997; Miller et al., 1997). maximal linkage is used to resolve inconsistencies in the clusters caused by the ‘chaining effect’ (Johnson and Wi- Representing variations within an EST cluster chern, 1992). It seems that the first published application of this concept to EST data was ‘THC_build’ system used to The visualization and quantification of gene variants within generate TGI (G. Sutton, personal communication; Adams clusters has not been the primary focus of most gene index- et al., 1995) where pairwise sequence similarity results from ing projects. For example, the UniGene project does not at- BLAST (Altschul et al., 1990) and a modified FASTA (Pear- tempt to make assemblies and hence provides no visual re- son, 1990) algorithm are collated with a relational database port of how transcripts in a cluster overlap. The TIGR Gene to form loose clusters of related sequences that are aligned Index (TGI) was apparently the first project to provide a with tigr_assembler (Sutton et al., 1995) under conservative space-compressed report, called a THC report (Adams et al., parameter sets and strict constraints. 1995), to display overlap in assemblies of ESTs with respect to tentative consensi (TC) and full-length sequence. Other CRAWview tools not associated with any specific gene indexing project, yet of great value for viewing sequence assemblies, are Here we present in detail CRAWview, a Java implementation Consed (Gordon et al., 1998) and phrapview (P. Green, un- of a visualization extension to CRAW. CRAWview provides published). An iterative search method for constructing EST brief cluster reports that display consensus sequences from assemblies for single genes of interest has also been pro- EST assemblies even when, due to the presence of gene vari- posed (Gill et al., 1997). These methods, however, focus on ants, more than one possible cluster consensus exists. presenting a single assembly and do not generalize easily to CRAWview also highlights regions of divergence and con- the case where multiple consensi are needed simultaneously servation between alternate consensi, and automatically to model the information in a sequence cluster, as is the case flags polymorphic or otherwise divergent regions. cDNA li- when sufficiently divergent gene variants are present. Nor do brary information is also included in the reports to aid in the these methods automatically detect the presence of poly- detection of state-specific gene variants as well as the morphisms for display. The STACK project, on the other identification of disease associated polymorphism and alter- hand, uses CRAW analysis (Burke et al., 1998) as a post-pro- native exon usage. cessing step to clustering in order to automatically discrimi- nate between and simultaneously view distinct gene variants. Systems and methods

The CRAW approach to gene variants and EST CRAWview may be run on Windows NT/95 and UNIX sys- clusters tems. A Java Runtime Environment (JRE) is required which may be obtained free of charge from Sun Microsystems for CRAW functions by partitioning sequence clusters into sub- PC or SUN platforms (http://java.sun.com/products/ clusters based upon sequence dissimilarity. Specifically, a jdk/1.1/jre/index.html). For non-SUN UNIX architectures, greedy method is used to construct maximal sub-clusters. the JRE may be obtained from the hardware manufacturer. Membership in the sub-cluster is restricted in that a con- For LINUX the JRE may be obtained from: http://browser- straint is put on the divergence within a global alignment be- watch.internet.com/news/story/java35.html. tween members and the sub-cluster consensus. When the original clusters are created with similarity threshold (equiv- Algorithms and implementation of CRAWview alent to minimal-linkage) clustering, as is the case with STACK and UniGene, any two sequences that share an Upon completion of CRAW analysis, an EST cluster will identical domain of sufficient length will be in the same have been assembled and partitioned into sub-clusters. A cluster. The creation of sub-clusters is necessary to resolve separate consensus sequence will also have been derived for inconsistencies (for instance, the inclusion of alternate exons each sub-cluster. CRAWview accepts as its input the output in different isoforms of the same gene) through partitioning flat file generated by CRAW and presents a graphical view into one or more sub-clusters. In addition to segregating of overlap patterns and sequence divergence within EST clusters into distinct gene isoforms, the partitioning is used clusters. to identify false joins caused by ESTs derived from chimeric The CRAWview report represents a cluster assembly of clones, genomic contamination, and other artifacts. Appar- alignment_length positions with num_cols columns and a ently, the first use of a loose grouping followed by stricter row for every sequence in the cluster. Within a row, in order separation approach for biological sequence databases was to display information within num_cols the conserved regions database in BEAUTY (Worley et al., columns, each column symbol represents the sequence di- 1995), in which minimal linkage is used to perform an initial versity of

377 A.Chou and J.Burke

Table 1. Pseudo-code for representing divergence and identity between sub-groups: pseudo-code of CRAWview color-index assignment for a cluster that has been partitioned into (maximum_group) sub-groups by CRAW. Variables are displayed in bold while variables that are parameters are in italics

Group_num = 1;

while( group_num <= maximum_group ) { for all sequences (i) in group group_num: {

for all non-overlapping windows in (i): { if( > num_gaps gaps in window ) display a ‘gap’ symbol else if( > num_ambig indeterminate bases in window ) display ‘unknown’ symbol else { look_group = 1; while( look_group <= group_num ) { if( >= (d - num_diffs) identical bases with consensus of look_group ) { display ‘look_group’ symbol stop for this window } look_group++; } display ‘diverge’ symbol } } /* finished window */ } /* finished sequence */ group_num++

} /* finished group */

d = floor (alignment_length/num_cols) divergence from the consensus sequence (such as when a positions except for the last column which represents single nucleotide polymorphism, or SNP, is encountered) d + (alignmnet_length mod num_cols) and white indicates indeterminate sequence (an example of positions. So that CRAWview reports fit within a printed this is the ‘N’ commonly used in DNA sequence to represent page, we typically use num_cols = 60. an unknown base). Other colors indicate discrete regions of Within each window of d positions, discrete domains of sequence identity. sequence identity between sub-cluster consensi, as well as In addition to providing a high-level view of EST assem- divergence of individual sequences from consensi, are repre- blies. CRAWview provides the user with the ability to ‘drill sented by assignment of color index symbols as described in down’ on interesting features by calling upon alignment the pseudo-code in Table 1. viewing/editing features. If it is suspected that manual edit- The CRAWview color index symbols typically consist of ing of the multiple alignment may produce better results, the the following. Lines to indicate gaps in the multiple align- multiple alignment may be edited and resubmitted to CRAW ment. Bar colors are assigned as follows: red is reserved for for sub-group reassignment.

378 Splicing and polymorphism in EST clusters

Fig. 1. CRAWview report for the human dishevelled 3 gene: sub-group two is identical to group one except for a missing domain (a putative alternative exon), a feature unique to NCGAP_Co9, a colon cancer library. The two yellow vertical bars that have been set are set by the user to drill down on the 3′ end of the missing domain. This causes the MSA editor/viewer component to be spawned at the correct zoom and center to display this region in detail.

For example Figure 1 shows the CRAWview report for an ruler is calibrated to alignment positions. The main viewport EST cluster from the UniGene gene index. The correspon- section of the CRAWview report is scrollable and contains the ding full length gene is Human Disheveled 3/RACK 8 Pro- sequence overlap diagram as well as supplementary sequence tein Kinase. Divergence is seen in the second sub-group information such as accession number, clone identifier, and which contains two ESTs specific to the library cDNA library information. CRAWview allows for standard NCGAP_Co3 representing a colon cancer state. The distin- printing through AWT PrintJob, or the user can save the color guishing feature between the two shown sub-groups is a re- CRAWview report as a GIF file. CRAWview GIF file gener- gion (a putative exon) missing from the ESTs derived from ation uses GIFEncoder developed by Adam Doppelt (unpub- colon cancer libraries. As an example of ‘drill-down’ analy- lished). sis, the user has highlighted a small portion from the 3′ end of the missing domain and the assembly viewing/editing Using CRAWview components are spawned such that the zoom and center of the assembly viewer covers the highlighted region. In order to use CRAWview it is necessary either to use CRAWview is a Java application and is implemented in a CRAW output or to emulate the CRAW output formats. In combination Java Foundation Classes (JFC)/swing (http://ja- CRAW, sub-cluster membership, variation, and sequence va.sun.com/products/jfc/swingdoc-api-1.0.3/frame.html) and alignment information is conveyed in two files: *.draw and Abstract Window Toolkit (AWT) (Zukowski, 1997). CRAW- *.ali. An academic version of CRAW may be obtained view is rendered and displayed in a JScrollPane, which contains through the University of Houston (contact ddavi- a main viewport, a horizontal heading viewport, and optional [email protected]). Better performance is obtained from the vertical and horizontal scrollbars. The JScrollPane resides in a commercial version of CRAW (available from Pangea Sys- JFrame along with JMenuBar and JToolBar. The header port tems, www.pangeasystems.com). If one does not wish to or contains the legend and ruler; they always stay at the top and the is not able to use CRAW, then CRAWview can still be used

379 A.Chou and J.Burke if one emulates the file formats of *.draw and *.ali. The CRAW. This is especially important given that multiple align- *.draw file contains sub-cluster membership information as ment is often performed through heuristic procedures that are well as a text version of the color output generated by not guaranteed to converge to the correct ‘biological’ answer CRAWview. The details of *.draw format with many and the comparison of results produced by different methods is examples can be found previous work (Burke et al., 1998). necessary for validation. In future CRAWview developments The *.ali file contains the actual sequence information with we plan to include an open reading frame finder. We will in- gaps. The sequences are listed in the same order as in the crease the functionality of the assembly editor and make com- *.draw file and are in the older style ‘gde’ format, i.e. every munications between the editor and the ORF finder instan- sequence is listed as: taneous so that the user may immediately see the effects of edits For further clarity several complete example cluster files on consensus coding potential. Additionally, we plan to allow with *.ali and *.draw files are available online at the Bioin- the user flexibility in assigning color codes. formatics website. CRAW can be used directly with sequence aligners such as CLUSTALW. Given a multiple FASTA format file, say 0, Acknowledgments an example of complete CRAW/CRAWview usage would The authors would like to express thanks to Chris Tarnas for be: assistance with the optimization of CRAWview and to 1. clustalw -output=gde 0. Kristina Chi for assistance in typing this manuscript. The | 2. cat 0.gde craw 0.5 50 60 > 0.draw (the 0.ali file is authors are especially grateful to Matthew Huang for careful generated automatically). review and helpful ideas. This work was supported by Pan- 3. Run CRAWview and choose 0.draw from the file/open gea Systems. menu. (The file 0 as well as 0.ali and 0.draw are available at the Bioinformatics website.) A beta version of CRAWview is References available to academics free of charge. Aaronson,J.S., Eckman,B., Blevins,R.A., Borowski,J.A., Myerson,J., Imran,S. and Elliston,K.O. (1996) Toward the development of a Discussion gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. Genome Res., 6, 829–845. Adams,M.D., Kelley,J.M., Gocayne,J.D., Dubnick,M., Polymeropou- EST data can be used to model gene structure in the absence los,M.H., Xiao,H., Merril,C.R., Wu,A., Olde,B., Moreno,R.F., of full genomic sequence or a matching positionally cloned Kerlavage,A.R., McConbie,W.R. and Venter,J.C. (1991) Comple- gene. Additionally, it provides a cheap and abundant source mentary DNA sequencing: expressed sequence tags and human of information concerning gene variability and, to some ex- genome project. Science, 252, 1651–1656. tent, expression. We have presented the capabilities of Adams,M.D., Dubnick,M., Kerlavage,A.R., Moreno,R., Kelley,J.M., CRAWview, a tool for browsing and performing exploratory Utterback,T.R., Nagle,J.W., Fields,C. and Venter,J.C. (1992) Se- data analysis of EST clusters with the purpose of assisting in quence identification of 2375 human brain genes. Nature, 355, the identification of state-specific gene variability and novel 632–634. disease associated features in mature gene transcripts. Adams,M.D., Kerlavage,A.R., Flieschmann,R.D., Fuldner,R.A., Bult,C.J., Lee,N.H., Kirkness,E.F., Weinstock,K.G., Gocayne,J.D., In theory, due to the large amounts of data to be processed, White,O., Sutton,G., Blake,J.A., Brandon,R.C., Chiu,M.W., Clay- high-throughput bioinformatics procedures should be fully ton,R.A., Cline,R.T., Cotton,M.D., Earle-Huges,J., Fine,L.D., Fitz- automatic. However, since many biological features have yet Gerald,L.M., FitzHugh,W.M., Fritchman,J.L., Geoghagen,N.S.M., to be fully characterized by ‘rule-sets’, it is inevitable that the Glodek,A., Gnehm,C.L., Venter,C. et al. (1995) Initial assessment of optimal results according to an algorithm’s objective func- human gene diversity and expression patterns based upon 83 million tion will sometimes be incorrect or sub-optimal in a biologi- nucleotides of cDNA sequence. Nature, 377(suppl.), 3–17. cal context. This caveat applies to sequence alignment algo- Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) rithms in general and is especially true for multiple sequence Basic local alignment search tool. J. Mol. Biol., 215, 403–410. alignment or assembly. For this reason, CRAWview com- Benson,D.A., Boguski,M.S., Lipman,D.J. and Ostell,J. (1994) Gen- bines high level viewing and automatic feature reporting ca- Bank. Nucleic Acids Res., 22, 3441–3444. pabilities with an alignment editing feature that allows a user Boguski,M.S. and Schuler,G.D. (1995) ESTablishing a human tran- script map. Nature Genetics, 10, 369–371. to interactively improve multiple alignments of transcripts. Boguski,M.S., Lowe,T.M. and Tolstohev,C.M. (1993) DbEST: data- Another interesting application of CRAWview is cross-vali- base for ‘expressed sequence tags’. Nature Genetics, 4, 332–333. dation of different clustering and assembly protocols. We often Burke,J., Wang,H., Hide,W. and Davison,D. (1998) Alternative gene use CRAWview to display the results of other assembly and form discovery and candidate gene selection from gene indexing clustering programs in the context of the consensi derived by projects. Genome Res., 8, 276–290.

380 Splicing and polymorphism in EST clusters

Eckman,B.A., Aaronson,J.S., Borkowski,J.A., Bailey,W.J., Ellis- consensus databases. Ninth International Genome Sequencing and ton,K.O., Williamson,A.R. and Blevins,R.A. (1998) The Merck Analysis Conference. Gene Index Browser: an extensible data integration system for gene Okubo,K., Hori,H., Matuba,R., Niiyama,T. and Matsubara,K. (1991) finding, gene characterization and EST data mining. Bioinformatics, A novel system for large-scale sequencing of cDNA by PCR 14, 2–13. amplification. DNA Sequence, 2, 137–144. Gill,R., Hodgman,T., Littler,C., Oxer,M., Montgomery,D., Taylor,S. Okubo,K., Hori,H., Matuba,R., Niiyama,T., Fukushima,A., Kioji- and Sanseau,P. (1997) A new dynamic tool to perform assembly of ma,Y. and Matsubara,K. (1992) Large-scale cDNA sequencing expressed sequence tags (ESTs). CABIOS, 13, 453–457. analysis of quantitative and qualitative aspects of gene expression. Gordon,D., Abajian,C. and Green,P. (1998) Consed: a graphical tool Nature Genetics, 2, 173–179. for sequence finishing. Genome Res., 8, 195–202. Okubo,K., Yoshii,J., Yokouchi,H., Kameyama,M. and Matsubara,K. Hide,W., Burke,J. and Davison,D. (1994) Biological evaluation of d2, (1994) An expression profile of active genes in human colonic an algorithm for high-performance sequence comparison. J. Comp. mucosa. DNA Res., 1, 37–45. Biol., 1, 199–215. Pearson,W.R. (1990) Rapid and sensitive sequence comparison with Hide,W., Burke,J., Christoffels,A. and Miller,R. (1997) A novel FASTP and FASTA. In Doolittle,R.F. (ed.), Molecular Evolution: approach towards a comprehensive consensus representation of the Computer Analysis of Protein and Nucleic Acid Sequences, Methods expressed human genome. In Miyano,S. and Takagi,T. (eds), in Enzymology. Academic Press, San Diego, pp. 63–98. Genome Informatics 1997. Universal Academy Press, Tokyo, pp. Schuler,G.D., Boguski,M.S., Stewart,E.A., Stein,L.D., Gyapay,G., Rice,K., White,R.E., Rodriguez-Tome,P., Aggarwal,A., Bajorek,E., 187–196. Bentolila,S., Birren,B.B., Butler,A., Castle,A.B., Chiannilkul- Houlgatte,R., Mariage-Samson,R., Duprat,S., Tesslier,A., Bentoli- chai,N., Chu,A., Clee,C., Cowles,S., Day,P.J.R., Dibling,T., la,S., Lamy,B. and Auffray,C. (1995) The GenExpress Index: a Drouot,N., Dunham,I., Duprat,S., East,C., Edwards,C., Fan,J.B., resource for gene discovery and the genic map of the human Fang,N., Fizames,C., Garrett,C. Green,L., Hudson,T.J. et al. (1996) genome. Genome Res., 5, 272–304. A gene map of the human genome. Science, 274, 540–546. Hudson,T.J., Stein,L.D., Gerety,S.S., Ma,J., Castle,A.B., Silva,J., Sutton,G., White,O., Adams,M.D. and Kerlavage,A.R. (1995) TIGR Slonim,D.K., Baptista,R., Kruglyak,L., Xu,S., Hu,X., Col- assembler: a new tool for assembling large shotgun sequencing bert,A.M.E., Rosenberg,C., Reeve-Daly,M.P., Rozen,S., Hui,L., projects. Genome Sci. Technol., 1, 9–18. Wu,X., Vastergaard,C., Wilson,K.M., Sae,J.S., Maitra,S., Ganiat- Venter,J.C. (1993) Identification of new human receptor and trans- sas,S., Evans,C.A., DeAngelis,M.M., Ingalls,K.A., Nahf,R.W., porter genes by high throughput cDNA (EST) sequencing. J. Horton,L.T., Anderson,M.O., Collymore,A.J., Ye,W., Koyoum- Pharm. Pharmacol., 45(suppl. 1), 355–360. jian,V., Zemsteva,I.S., Tam,J., Devine,R., Courtney,D.F., Re- White,O. and Kerlavage,A.R. (1996) TDB: new databases for nauld,M.T., Nguyen,H., Fizames,C., Faure,S., Gyapay,G., Dib,C., biological discovery. Methods Enzymol., 206, 27–41. Morissette,J., Orlin,J.B., Birren,B.W., Goodman,N., Weissen- Wilcox,A.S., Khan,A.S., Hopkins,J.A. and Sikela,J.M. (1991) Use of bach,J., Hawkins,T.L., Foote,S., Page,D.C. and Lander,E.S. (1995) 3′ untranslated sequences of human cDNAs for rapid chromosomal An STS-based map of the human genome. Science, 270, assignment and conversion to STSs: implications for an expression 1945–1954. map of the genome. Nucleic Acids Res., 19, 1837–1843. Johnson,R.A. and Wichern,D.W. (1992) Applied Multivariate Statisti- Williamson,A.R., Elliston,K.O. and Sturchio,J.L. (1995) The Merck cal Methods, 3rd edn. Englewood Cliffs, NJ. Gene Index, a public resource for research. J. NIH Res., 7, Matsubara,K. and Okubo,K. (1993) Identification of new genes by 61–63. systematic analysis of cDNAs and database construction. Curr. Worley,K.C., Wiese,B.A. and Smith,R. (1995) BEAUTY: an enhanced Opinion Biotech., 4, 672–677. BLAST-based search tool that integrates multiple biological in- Miller,R., Burke,J., Christoffels,A. and Hide,W. (1997) Towards a formation resources into sequence similarity results. Genome Res., more comprehensive conceptual consensus of the expressed ge- 5, 173–184. nome: development of sequence tag alignment and consensus Zukowski,J. (1997) Java AWT Reference. O’Reilly Press, Sebastopol, knowledgebase (STACK) a novel error analytical approach to EST CA.

381