Leveraging EST Evidence to Automatically Predict Alternatively Spliced Genes, Master's Thesis, December 2006
Total Page:16
File Type:pdf, Size:1020Kb
Washington University in St. Louis Washington University Open Scholarship All Computer Science and Engineering Research Computer Science and Engineering Report Number: WUCSE-2007-5 2007 Leveraging EST Evidence to Automatically Predict Alternatively Spliced Genes, Master's Thesis, December 2006 Robert Zimmermann Current methods for high-throughput automatic annotation of newly sequenced genomes are largely limited to tools which predict only one transcript per gene locus. Evidence suggests that 20-50% of genes in higher eukariotic organisms are alternatively spliced. This leaves the remainder of the transcripts to be annotated by hand, an expensive time-consuming process. Genomes are being sequenced at a much higher rate than they can be annotated. We present three methods for using the alignments of inexpensive Expressed Sequence Tags in combination with HMM-based gene prediction with N-SCAN EST to recreate the vast majority of hand annotations in the D.melanogaster... Read complete abstract on page 2. Follow this and additional works at: https://openscholarship.wustl.edu/cse_research Part of the Computer Engineering Commons, and the Computer Sciences Commons Recommended Citation Zimmermann, Robert, "Leveraging EST Evidence to Automatically Predict Alternatively Spliced Genes, Master's Thesis, December 2006" Report Number: WUCSE-2007-5 (2007). All Computer Science and Engineering Research. https://openscholarship.wustl.edu/cse_research/149 Department of Computer Science & Engineering - Washington University in St. Louis Campus Box 1045 - St. Louis, MO - 63130 - ph: (314) 935-6160. This technical report is available at Washington University Open Scholarship: https://openscholarship.wustl.edu/ cse_research/149 Leveraging EST Evidence to Automatically Predict Alternatively Spliced Genes, Master's Thesis, December 2006 Robert Zimmermann Complete Abstract: Current methods for high-throughput automatic annotation of newly sequenced genomes are largely limited to tools which predict only one transcript per gene locus. Evidence suggests that 20-50% of genes in higher eukariotic organisms are alternatively spliced. This leaves the remainder of the transcripts to be annotated by hand, an expensive time-consuming process. Genomes are being sequenced at a much higher rate than they can be annotated. We present three methods for using the alignments of inexpensive Expressed Sequence Tags in combination with HMM-based gene prediction with N-SCAN EST to recreate the vast majority of hand annotations in the D.melanogaster genome. In our first method, we “piece together” N-SCAN EST predictions with clustered EST alignments to increase the number of transcripts per locus predicted. This is shown to be a sensitve and accurate method, predicting the vast majority of known transcripts in the D.melanogaster genome. We present an approach of using these clusters of EST alignments to construct a Multi-Pass gene prediction phase, again, piecing it together with clusters of EST alignments. While time consuming, Multi-Pass gene prediction is very accurate and more sensitive than single-pass. Finally, we present a new Hidden Markov Model instance, which augments the current N-SCAN EST HMM, that predicts multiple splice forms in a single pass of prediction. This method is less time consuming, and performs nearly as well as the multi-pass approach. Department of Computer Science & Engineering 2007-5 Leveraging EST Evidence to Automatically Predict Alternatively Spliced Genes, Master's Thesis, December 2006 Authors: Robert Zimmermann Corresponding Author: [email protected] Web Page: http://nijibabulu.org/bz Abstract: Current methods for high-throughput automatic annotation of newly sequenced genomes are largely limited to tools which predict only one transcript per gene locus. Evidence suggests that 20-50% of genes in higher eukariotic organisms are alternatively spliced. This leaves the remainder of the transcripts to be annotated by hand, an expensive time-consuming process. Genomes are being sequenced at a much higher rate than they can be annotated. We present three methods for using the alignments of inexpensive Expressed Sequence Tags in combination with HMM-based gene prediction with N-SCAN EST to recreate the vast majority of hand annotations in the D.melanogaster genome. In our first method, we “piece together” N-SCAN EST predictions with clustered EST alignments to increase the number of transcripts per locus predicted. This is shown to be a sensitve and accurate method, predicting the vast majority of known transcripts in the D.melanogaster genome. We present an approach of using these clusters of EST alignments to construct a Multi-Pass gene prediction phase, again, piecing it together with clusters of EST alignments. While time consuming, Multi-Pass gene prediction is very accurate and more sensitive than single-pass. Finally, we present a new Hidden Markov Model instance, which augments the current N-SCAN EST HMM, that predicts multiple splice forms in a single pass of prediction. This method is less time consuming, and performs nearly as well as Type of Report: Other Department of Computer Science & Engineering - Washington University in St. Louis Campus Box 1045 - St. Louis, MO - 63130 - ph: (314) 935-6160 SEVER INSTITUTE OF TECHNOLOGY MASTER OF SCIENCE DEGREE THESIS ACCEPTANCE (To be the first page of each copy of the thesis) DATE: December 12, 2006 STUDENT’S NAME: Bob Zimmermann This student’s thesis, entitled Leveraging EST Evidence to Automatically Predict Alternatively Spliced Genes has been examined by the undersigned committee of three faculty members and has received full approval for acceptance in partial fulfillment of the requirements for the degree Master of Science. APPROVAL: Chairman Short Title: Alt-splice N-SCAN EST Zimmermann, M.S. 2006 WASHINGTON UNIVERSITY THE HENRY EDWIN SEVER GRADUATE SCHOOL DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING LEVERAGING EST EVIDENCE TO AUTOMATICALLY PREDICT ALTERNATIVELY SPLICED GENES by Bob Zimmermann, B.S. Prepared under the direction of Michael Brent A thesis presented to the Henry Edwin Sever Graduate School of Washington University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE December 2006 Saint Louis, Missouri WASHINGTON UNIVERSITY THE HENRY EDWIN SEVER GRADUATE SCHOOL DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ABSTRACT LEVERAGING EST EVIDENCE TO AUTOMATICALLY PREDICT ALTERNATIVELY SPLICED GENES by Bob Zimmermann ADVISOR: Michael Brent December 2006 Saint Louis, Missouri Current methods for high-throughput automatic annotation of newly sequenced ge- nomes are largely limited to tools which predict only one transcript per gene lo- cus. Evidence suggests that 20-50% of genes in higher eukariotic organisms are alternatively spliced. This leaves the remainder of the transcripts to be annotated by hand, an expensive time-consuming process. Genomes are being sequenced at a much higher rate than they can be annotated. We present three methods for using the alignments of inexpensive Expressed Sequence Tags in combination with HMM-based gene prediction with N-SCAN EST to recreate the vast majority of hand annotations in the D.melanogaster genome. In our first method, we “piece together” N-SCAN EST predictions with clustered EST alignments to increase the number of transcripts per locus predicted. This is shown to be a sensitve and accurate method, predicting the vast majority of known transcripts in the D.melanogaster genome. We present an approach of using these clusters of EST alignments to construct a Multi- Pass gene prediction phase, again, piecing it together with clusters of EST alignments. While time consuming, Multi-Pass gene prediction is very accurate and more sensitive than single-pass. Finally, we present a new Hidden Markov Model instance, which augments the current N-SCAN EST HMM, that predicts multiple splice forms in a single pass of prediction. This method is less time consuming, and performs nearly as well as the multi-pass approach. to Laura, whose unparalleled bravery and positivity will forever inspire me Contents List of Figures .................................. vi Acknowledgments ................................ viii Preface ....................................... ix 1 Introduction .................................. 1 1.1 Bioinformatics and genome annotation . 2 1.2 The spectrum of gene prediction . 2 1.3 ESTs and their uses . 3 1.4 The crux of the modeling problem . 4 1.5 The proposed solution . 4 1.6 Goals . 5 1.7 Major contributions . 5 2 Background .................................. 6 2.1 DNA . 6 2.2 RNA . 8 2.3 DNA Transcription and Translation . 9 2.3.1 Splicing . 11 2.4 Alternative splicing events . 11 2.5 Methods for Predicting Alternative Splices . 13 2.5.1 Sequence-based methods . 13 2.5.2 Evidence-based methods . 13 3 Gene Prediction ............................... 15 3.1 Markov Chains . 15 3.2 HMMs and the Viterbi Algorithm . 16 3.3 Genscan . 19 iii 3.3.1 Duration distributions . 21 3.3.2 Content models . 21 3.3.3 Signal models . 22 3.3.4 Model inflexibility . 23 3.4 Twinscan and additional sequences . 24 3.5 N-SCAN and UTR prediction . 25 3.6 N-SCAN EST............................... 27 4 Methods .................................... 28 4.1 Methods overview . 28 4.2 Parameter estimation . 30 4.3 EST processing with the PASA pipeline . 31 4.3.1 Alignment . 31 4.3.2 Assembly . 33 4.3.3 Comparison . 34 4.3.4 Update . 37 4.3.5 Modifications to the PASA pipeline . 37 4.4 N-SCAN EST + PASA . 37 4.5 N-SCAN MP EST + PASA . 38 4.6 N-SCAN AS EST . 40 4.6.1 Design and choice of additional HMM states . 40 4.6.2 The use of AS ESTSEQ . 42 4.6.3 Definition of alternative splicing events for training purposes . 44 4.6.4 Training the new ASHMM . 47 5 Results and Conclusion ........................... 49 5.1 Sensitivity and specificity measures . 49 5.2 Cross validation . 50 5.3 How to read UCSC annotation pictures . 51 5.4 Sans PASA, N-SCAN MP EST performs best . 51 5.4.1 N-SCAN AS EST shows no AS predictive power . 52 5.4.2 N-SCAN MP EST predicts complex, subtle ASs . 53 5.5 With PASA, all methods perform comparably .