Table of Contents
Total Page:16
File Type:pdf, Size:1020Kb
UC San Diego UC San Diego Electronic Theses and Dissertations Title Computational methods for analyzing and detecting genomic structural variation : applications to cancer Permalink https://escholarship.org/uc/item/9x56z8qw Author Bashir, Ali Publication Date 2009 Supplemental Material https://escholarship.org/uc/item/9x56z8qw#supplemental Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational Methods for Analyzing and Detecting Genomic Structural Variation: Applications to Cancer A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Bioinformatics by Ali Bashir Committee in charge: Professor Vineet Bafna, Chair Professor Trey Ideker Professor Pavel Pevnzer Professor Benjamin Raphael Professor Bing Ren Professor Nicholas Schork 2009 Copyright Ali Bashir, 2009 All rights reserved. The dissertation of Ali Bashir is approved and it is acceptable in quality and form for publication on mi- crofilm and electronically: Chair University of California, San Diego 2009 iii DEDICATION To my girlfriend, for her constant encouragement; my parents, for their constant supply of food, clean laundry, and car advice; and my niece and nephews for being a constant source of distraction. iv TABLE OF CONTENTS Signature Page............................................ iii Dedication.............................................. iv Table of Contents..........................................v List of Figures............................................ ix List of Tables............................................ xi Acknowledgements........................................ xii Vita................................................... xiv Abstract of the Dissertation................................... xv Chapter 1 Introduction.....................................1 1.1 Why “large” events?.................................2 1.2 How do you observe structural variants?....................5 1.2.1 Primer Approximation Multiplex PCR (PAMP)...........5 1.2.2 High-throughput Sequencing.......................6 1.3 In which contexts should we examine structural variants?.........7 1.3.1 Cancer applications.............................8 1.3.2 Other applications..............................8 Chapter 2 Optimization of primer design for the detection of genomic lesions in cancer................................................. 10 2.1 Introduction....................................... 10 2.2 Optimizing primer design.............................. 14 2.2.1 PAMP design................................. 15 2.2.2 Extensions................................... 16 2.3 Complexity of PAMP design............................ 17 2.4 Algorithms for PAMP design............................ 18 2.5 Results.......................................... 23 2.5.1 Experimental validation.......................... 23 2.5.2 Computational modeling......................... 27 2.5.3 Convergence and running time...................... 30 2.6 Discussion........................................ 31 2.7 Acknowledgements.................................. 32 v Chapter 3 Two-Sided PAMP and Alternating Multiplexing.............. 33 3.1 Introduction....................................... 33 3.2 Methods: A multiplexed approach to PAMP design............. 36 3.2.1 Amplification 6= Detection........................ 36 3.2.2 Simulated Annealing for Optimization................ 40 3.3 Results.......................................... 43 3.3.1 Simulations.................................. 44 3.3.2 Left vs. Right Breakpoint Detection.................. 45 3.3.3 Experimental Confirmation of CDKN2A............... 48 3.3.4 Running time................................. 48 3.4 Discussion........................................ 49 3.5 Acknowledgements.................................. 51 Chapter 4 Evaluation of paired-end sequencing strategies- applications to gene fusion................................................. 52 4.1 Introduction....................................... 52 4.2 Results.......................................... 55 4.2.1 Computing probability of a fusion gene................ 55 4.2.2 Fusion Predictions in Breast Cancer.................. 58 4.2.3 Detection and Localization of Genome Rearrangements..... 62 4.2.4 Comparison of Sequencing Strategies................. 64 4.2.5 Lengths of Fusion Genes......................... 66 4.2.6 Effects of Errors............................... 69 4.3 Discussion........................................ 71 4.3.1 Defining the Genomic Features of Interest.............. 74 4.3.2 Choice of Sequencing Parameters.................... 75 4.3.3 Organization of Cancer Genomes.................... 76 4.3.4 Extensions and Applications....................... 78 4.4 Methods......................................... 78 4.4.1 Mapping and clustering of end sequences............... 78 4.4.2 Validating fusion predictions by sequencing............. 79 4.4.3 Computing fusion probability...................... 79 4.4.4 Algorithms for efficient probability computation.......... 81 4.4.5 Expected number of fusion points.................... 82 4.4.6 Localization of Rearrangement Fusion Points............ 83 4.5 Acknowledgements.................................. 85 Chapter 5 On design of deep sequencing experiments................. 86 5.1 Introduction....................................... 86 5.2 Results.......................................... 89 5.3 Discussion........................................ 96 5.4 Methods......................................... 99 5.4.1 Breakpoint Resolution........................... 99 vi 5.4.2 Simulation................................... 100 5.4.3 Mixing clone-lengths............................ 100 5.4.4 Proof of Optimality of Two Clone Design.............. 102 5.4.5 Simulation for mix of clones....................... 103 5.4.6 Transcript Sequencing........................... 103 5.4.7 Haplotype assembly............................. 104 5.5 Acknowledgements.................................. 105 Chapter 6 Reconstructing Genomic Architectures.................... 106 6.1 Introduction....................................... 106 6.2 Methods......................................... 107 6.2.1 Obtaining an architecture graph..................... 107 6.2.2 Retrieving Optimal Eulerian Paths................... 108 6.3 Discussion........................................ 114 Chapter 7 Evidence for Large Inversion Polymorphisms in the Human Genome from HapMap data......................................... 116 7.1 Introduction....................................... 116 7.2 Results.......................................... 118 7.2.1 Overview of Method............................ 119 7.2.2 Power to detect Inversion Polymorphisms.............. 122 7.2.3 Scanning the HapMap data for inversion polymorphisms..... 123 7.2.4 Sequence Analysis of Inversion Breakpoints............. 128 7.2.5 Assessing the false positive rate..................... 130 7.3 Discussion........................................ 133 7.4 Methods......................................... 135 7.4.1 Haplotype Data................................ 135 7.4.2 Defining multi-SNP markers....................... 136 7.4.3 Computing LD................................ 137 7.4.4 The Inversion Statistic........................... 138 7.4.5 Identifying potential inversions..................... 139 7.4.6 Simulating Inversions........................... 141 7.4.7 Sequence Analysis............................. 141 7.4.8 Coalescent Simulations.......................... 142 7.5 Acknowledgements.................................. 143 7.6 Supplemental material attached electronically................. 143 Chapter 8 Orthologous repeats and phylogenetic inference.............. 144 8.1 Introduction....................................... 144 8.2 Approach........................................ 145 8.3 Results.......................................... 153 8.3.1 Species with finished sequence..................... 153 8.3.2 A larger set of species........................... 154 vii 8.3.3 Assessment of Incompatible Repeats.................. 156 8.4 Discussion........................................ 158 8.5 Methods......................................... 162 8.6 Acknowledgements.................................. 167 8.7 Supplemental material attached electronically................. 167 Chapter 9 Conclusions..................................... 168 9.1 Open Problems..................................... 170 9.1.1 Genomic diagnostics for disease..................... 170 9.1.2 Predicting Fusion Events and Architectures............. 171 9.1.3 Population genetics and phylogenetics................. 172 9.2 A modest proposal.................................. 174 Appendix A Supplemental: Optimization of primer design for the detection of genomic lesions in cancer.................................... 175 A.1 Complexity of PAMP design............................ 175 A.2 Methods and Parameters............................... 177 A.2.1 Computational................................ 177 A.2.2 Experimental................................. 179 A.3 Supplemental Figures................................ 179 Appendix B Supplemental: Two-Sided PAMP and Alternating Multiplexing.. 181 B.1 Experimental Methods................................ 181 B.2 Proofs........................................... 181 Appendix C Supplemental: Evaluation of paired-end sequencing strategies- ap- plications to gene fusion....................................