UC San Diego UC San Diego Electronic Theses and Dissertations
Total Page:16
File Type:pdf, Size:1020Kb
UC San Diego UC San Diego Electronic Theses and Dissertations Title Statistical Algorithms for High-throughput Biological Data / Permalink https://escholarship.org/uc/item/7kt5768c Author Jeong, Kyowon Publication Date 2013 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO Statistical Algorithms for High-throughput Biological Data A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Electrical Engineering (Communication Theory and Systems) by Kyowon Jeong Committee in charge: Professor Pavel Pevzner, Chair Professor Young-Han Kim, Co-Chair Professor Vineet Bafna Professor Nuno Bandeira Professor Alon Orlitsky Professor Alexander Vardy 2013 Copyright Kyowon Jeong, 2013 All rights reserved. The dissertation of Kyowon Jeong is approved, and it is acceptable in quality and form for publication on micro- film and electronically: Co-Chair Chair University of California, San Diego 2013 iii DEDICATION To Hanbyul and Seoyun iv EPIGRAPH P曰 x而g思GT 思而gxG殆 v TABLE OF CONTENTS Signature Page . iii Dedication . iv Epigraph . iv Table of Contents . vi List of Figures . ix List of Tables . .x Acknowledgements . xi Vita......................................... xii Abstract of the Dissertation . xiii Chapter 1 Introduction . .1 1.1 Universal de novo peptide sequencing algorithm . .1 1.2 Enabling proteogenomic searches in six-frame translation of genomic sequences . .2 1.3 False discovery rates in spectral identification . .3 1.4 Sensitive somatic mutation profiling incorporating sample impurity . .4 Chapter 2 UniNovo : a universal tool for de novo peptide sequencing . .5 2.1 Introduction . .5 2.2 Methods . .8 2.2.1 Vector operations . .8 2.2.2 Terminology and definitions . .9 2.2.3 Peptide-spectrum generative model . 10 2.2.4 Training UniNovo . 11 2.2.5 How to infer fragmentation sites from a spectrum . 13 2.2.6 Generating de novo reconstructions . 17 2.2.7 How to extend UniNovo algorithm for the realistic model . 19 2.3 Results . 22 2.3.1 Datasets . 22 2.3.2 Benchmarking UniNovo . 24 2.3.3 Evaluation of the spectrum graph . 27 2.3.4 De novo sequencing of paired spectra . 28 2.3.5 De novo sequencing with quality filtering . 29 vi 2.4 Conclusion . 32 Chapter 3 Gapped Spectral Dictionaries and Their Applications for Database Searches of Tandem Mass Spectra . 43 3.1 Introduction . 43 3.2 Methods . 46 3.2.1 Path Dictionary Problem. 46 3.2.2 Gapped Path Dictionary Problem. 47 3.2.3 Gapped Spectral Dictionaries. 49 3.3 Results . 51 3.3.1 Datasets . 51 3.3.2 From Gapped Spectral Dictionaries to gapped tags 54 3.3.3 Database search with Gapped Spectral Dictionaries 56 3.3.4 Proteogenomics application . 58 3.4 Conclusion . 59 Chapter 4 False discovery rates in spectral identification . 74 4.1 Introduction . 74 4.2 Materials . 76 4.2.1 MS/MS Spectra . 76 4.2.2 Protein Database . 76 4.2.3 Database Search Engine . 77 4.3 Methods . 77 4.4 Results . 81 4.4.1 How to construct a decoy database: reversed vs shuffled 81 4.4.2 Concatenated vs separate decoy . 82 4.4.3 Choice of formula to calculate FDR . 83 4.4.4 Impact of the size of the database . 84 4.4.5 How the number of spectra affects the results . 85 4.4.6 Expected gains from accurate peptide parent masses 85 4.4.7 How the score normalization affects the results . 86 4.4.8 PSM-level vs Peptide-level FDR . 87 4.4.9 Two-pass searches and TDA . 88 4.5 Conclusion . 91 Chapter 5 Virmid: virtual microdissection of sample mixtures for accurate somatic mutation profiling . 102 5.1 Introduction . 102 5.2 Results and Discussions . 105 5.2.1 Test on Simulated data . 106 5.2.2 Test on breast cancer data . 109 5.2.3 Application to HME exome sequencing data . 112 5.3 Conclusions . 115 5.4 Materials and Methods . 115 vii 5.4.1 Virmid Model . 115 5.4.2 Data preparation . 125 5.4.3 Program implementation and optimization . 128 Bibliography . 141 viii LIST OF FIGURES Figure 2.1: UniNovo peptide-spectrum generative model . 36 Figure 2.2: Ion type distributions of raw and processed spectra . 37 Figure 2.3: Comparison of de novo sequencing tools . 38 Figure 2.4: The Venn diagrams of the correctly sequenced spectra . 39 Figure 2.5: Comparison of de novo sequencing tools in terms of amino acid level precision . 39 Figure 2.6: De novo sequencing with qualify filtering of spectra . 40 Figure 2.7: Comparison of spectrum graphs . 41 Figure 2.8: De novo sequencing of paired spectra . 42 Figure 3.1: Different modules of MS-GappedDictionary . 64 Figure 3.2: Spectra for the peptide LNRVSQGK and AIIDAIVSGELK . 65 Figure 3.3: Illustration of the dynamic programming algorithm for computing the generating function of graph . 66 Figure 3.4: Gapped Spectral Dictionary size vs. Spectral Dictionary size . 68 Figure 3.5: Distribution of the lengths of the gapped peptides induced by cor- rect peptides . 68 Figure 3.6: Identifiability of the δ-reduced Gapped Spectral Dictionaries . 69 Figure 3.7: Average rank of correct gapped peptides . 70 Figure 3.8: The probability that a correct gapped peptide is found within k top-ranked peptides . 70 Figure 3.9: Identifiability of the Pocket Dictionaries from the Shewanella dataset 71 Figure 3.10: Comparison of gapped tags and the peptide sequence tags . 72 Figure 3.11: The FDR curves for MS-GappedDictionary . 72 Figure 3.12: The length distributions of peptides identified by MS-GappedDic- tionary and MS-Dictionary . 73 Figure 5.1: Overall Virmid workflow . 133 Figure 5.2: Multi-tier sampling of Virmid . 134 Figure 5.3: Estimation of contamination . 135 Figure 5.4: Maximum likelihood estimation and search . 136 Figure 5.5: Performance of somatic mutation detection . 137 Figure 5.6: BAF distribution of call sets . 138 Figure 5.7: Test on 15 breast cancer exome sequencing data. 139 Figure 5.8: Estimated α in HME samples. 140 ix LIST OF TABLES Table 2.1: Summary of the datasets used for benchmarking of UniNovo . 34 Table 2.2: Partitioning of spectra and peak intensities for UniNovo . 35 Table 3.1: Gapped Spectral Dictionary and gapped tags . 61 Table 4.1: Details on experiments performed . 93 Table 4.2: 2 2 tables for Fisher's exact test. 95 Table 4.3: Comparison× between searches using reversed or shuffled decoy data- bases . 96 Table 4.4: Comparison between concatenated-decoy searches and separate-decoy searches . 97 Table 4.5: Comparison between two FDR formulas . 97 Table 4.6: Comparison between searches against databases of different sizes . 98 Table 4.7: Comparisons between searches with different portions of unidenti- fiable spectra . 98 Table 4.8: Comparison between searches with strict and loose parent mass tolerance. 99 Table 4.9: Comparison between searches with differently normalized scoring functions . 99 Table 4.10: Comparison between peptide-level factual FDR and PSM-level FDR 100 Table 4.11: Comparison between peptide-level factual FDR and peptide-level FDR .................................. 100 Table 4.12: Comparison between the single-pass search (the search Y-1) and various two-pass search methods (the searches Y-10 to Y-13) . 101 Table 5.1: Accuracy and robustness of estimated α ............... 129 Table 5.2: Improved mutation calling in higher coverage . 130 Table 5.3: A list of 15 public breast cancer data from TCGA. 131 Table 5.4: Mutation burden of validated somatic mutations in HME . 132 Table 5.5: Newly predicted somatic mutations in HME data . 132 x ACKNOWLEDGEMENTS I would like to express the deepest appreciation to my advisor, Prof. Pavel Pevzner. Not only he provided me excellent insights and ideas for my research but also did he teach me how to communicate with other researchers efficiently and effectively. Without his brilliant guidance, this dissertation would not have been possible. I am also thankful to Dr. Sangtae Kim, who introduced me the field of Bioin- formatics and persistently helped me to adopt myself to the field. Prof. Nuno Ban- deira and Prof. Vineet Bafna also have been great collaborators and they gave me constructive comments and warm encouragement which I deeply appreciate. I am grateful to Prof. Young-Han Kim, Prof. Alon Orlitsky, and Prof. Alexander Vardy for their serving in my committee. In addition, I would like to thank Sangwoo Kim, Hosein Hosein Mohimani, and Julio Ng for their help in various projects during my graduate career. My heartfelt appreciation also goes to all members in my research group; they have been great col- leagues and good friends. Finally, I want to thank all my collaborators and coauthors of my scientific publications. Chapter 2, in full, was published as \UniNovo : a universal tool for de novo peptide sequencing". K. Jeong, S. Kim, and P. A. Pevzner. Bioinformatics, doi: 10.1093/bioinformatics/btt338. The dissertation author was the primary author of this paper. Chapter 3 was published as \Gapped Spectral Dictionaries and Their Ap- plications for Database Searches of Tandem Mass Spectra". K. Jeong, S. Kim, N. Bandeira, and P. A. Pevzner. Molecular&Cellular Proteomics, vol. 10, M110.002220, 03 2011. The dissertation author is the primary author of this paper. Chapter 4, in full, was published as \False discovery rates in spectral identi- fication”. K. Jeong, S. Kim, and N. Bandeira. BMC Bioinformatics, vol. 13(Suppl 16), S2, 11 2012. The dissertation author is the primary author of this paper. Chapter 5, in full, was submitted as \Virmid: virtual microdissection of sample mixtures for accurate somatic mutation profiling". S. Kim, K. Jeong, K. Bhutani1, J.