University of California, San Diego

UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational Techniques to Investigate Structural Variation A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Bioinformatics and Systems Biology by Marcus Christopher Kinsella Committee in charge: Professor Vineet Bafna, Chair Professor Kelly A. Frazer, Co-Chair Professor Pavel A. Pevzner Professor Jonathan Sebat Professor Kun Zhang 2013 Copyright Marcus Christopher Kinsella, 2013 All rights reserved. The dissertation of Marcus Christopher Kinsella is approved, and it is acceptable in quality and form for publication on microﬁlm and electronically: Co-Chair Chair University of California, San Diego 2013 iii DEDICATION To my Mom for her support, and to my boyfriend for his relentless encouragement. iv TABLE OF CONTENTS Signature Page . iii Dedication . iv Table of Contents . .v List of Figures . viii List of Tables . .x Acknowledgements . xi Vita ......................................... xii Abstract of the Dissertation . xiii Chapter 1 Introduction . .1 1.1 The Scale of Genetic Variation . .1 1.2 Detecting Structural Variations . .2 1.3 Algorithmic Challenges in Structural Variation Detection . .3 Chapter 2 Sensitive gene fusion detection using ambiguously mapping RNA- Seq read pairs . .5 2.1 Introduction . .5 2.2 Methods . .8 2.2.1 Discovery of Putative Fusions . .8 2.2.2 Mapping to Augmented Reference . 11 2.2.3 Model of Paired-End RNA-Seq Data . 11 2.2.4 Expectation Maximization . 14 2.2.5 Calculating Mappings to Fusion Junctions . 15 2.3 Results . 17 2.3.1 Fusion Transcripts Generate Ambiguous Reads . 17 2.3.2 Resolving Ambiguous Simulated Fusions . 18 2.3.3 Application to a Prostate Tissue Transcriptome Data 20 2.3.4 Discovery of Novel Ambiguous Fusions . 22 2.4 Discussion . 22 2.5 Acknowledgements . 25 Chapter 3 Combinatorics of the Breakage-Fusion-Bridge Mechanism . 26 3.1 Introduction . 26 3.2 Formalizing the BFB Schedule . 28 3.3 Algorithms for BFB . 30 v 3.4 Results . 43 3.5 Discussion . 47 3.6 Acknowledgements . 48 Chapter 4 An algorithmic approach for breakage-fusion-bridge detection in tumor genomes . 49 4.1 Introduction . 49 4.2 High-throughput evidence for BFB . 52 4.2.1 Breakpoints . 52 4.2.2 Copy counts . 53 4.2.3 Formalizing BFB . 53 4.2.4 Handling experimental imprecision . 55 4.2.5 The BFB Count Vector Problem . 55 4.3 Outline of the BFB Count Vector Algorithms . 56 4.3.1 Properties of BFB palindromes . 56 4.3.2 Required conditions for folding . 58 4.4 Running time . 62 4.5 Detecting Signatures of BFB . 63 4.6 Results . 64 4.7 Discussion . 67 4.8 Acknowledgements . 68 Chapter 5 Does Chromothripsis Have a Distinguishing Signature? . 69 5.1 Introduction . 69 5.2 Methods . 71 5.2.1 Finding Chromosome Arrangements Consistent with Observed Breakpoints . 71 5.3 Results . 72 5.3.1 Simulating Progressive Rearrangements . 72 5.3.2 Chromothripsis Footprint Criteria Depend on Subtle Simulation Implementation Details . 75 5.3.3 Simulation Method Does Not Distinguish Between Progressive Rearrangement and Chromothripsis . 79 5.3.4 Plausible Progressive Rearrangement Schemes Exist for Chromosomes Bearing Footprint of Chromothripsis 80 5.4 Discussion . 81 5.5 Acknowledgements . 82 Appendix A Supplemental: Sensitive gene fusion detection using ambiguously mapping RNA-Seq read pairs . 87 A.1 Ambiguous fusion sequences. 112 A.1.1 HOMEZ-MYH6 . 112 A.1.2 KIAA1267-ARL17A . 112 vi A.1.3 CPEB1-RPS17 . 113 A.1.4 PPIP5K1-CATSPER2 . 113 Appendix B Supplemental: Combinatorics of the Breakage-Fusion-Bridge Mech- anism . 114 B.1 Proofs . 114 B.2 Applying BFB Rules . 117 B.3 Analysis of BFB_Tree . 118 Appendix C Supplemental:An algorithmic approach for breakage-fusion-bridge detection in tumor genomes . 120 C.1 Properties of BFB Strings . 120 C.2 Algorithm SEARCH-BFB . 124 C.2.1 Additional Notation and Collection Arithmetics . 124 C.2.2 Folding Increases Signature . 129 C.2.3 The FOLD Procedure . 137 C.2.4 Correctness of Algorithm SEARCH-BFB . 146 C.2.5 Time Complexity of Algorithm SEARCH-BFB . 147 C.3 The Decision Variant . 149 C.4 The Distance Variant . 150 C.5 Chromosome simulation details . 151 C.6 Cancer cell line results . 152 C.7 ROC curves for varying simulation parameters . 153 C.8 Pancreatic cancer data analysis pipeline . 153 C.9 Possible arrangement of segments on BFB -rearranged chromosome 12 . 153 Bibliography . 165 vii LIST OF FIGURES Figure 2.1: A read pair that maps to a fusion between genes A1 and B1 may also map to homologous genes, leading either to spurious fusion candidates or the elimination of read pairs supporting a true fusion from consideration. .7 Figure 2.2: Creating fusion genes from discordantly mapping mate pairs. 12 Figure 2.3: Nominating potential fusion transcripts. 12 Figure 2.4: The graphical model of RNA-Seq read pairs. Transcript abundance, transcript choice, starting position, ending position, and observed read are represented by q, T, S, E, and R, respectively. 13 Figure 2.5: In this simpliﬁed situation, maximizing the likelihood function would set the abundance of the fusion gene to 1 regardless of the relation- ship between NA, NB, and NF ..................... 17 Figure 2.6: The fusion between HOMEZ and MYH6. Three mate pairs support this fusion, but two also map to a fusion between HOMEZ and MYH7. 23 Figure 2.7: The fusion between CPEB1 and RPS17. A copy of RPS17 lies 2,000 bases downstream of CPEB1, but another copy lies 400 kilo- bases downstream, as well. 23 Figure 3.1: The Breakage Fusion Bridge mechanism. 27 Figure 3.2: An illustration of BFB-Pivot searching for candidate BFB strings. 33 Figure 3.3: A BFB-tree generated from an RB-BFB-schedule. 36 Figure 3.4: Pivot and tree algorithm running time. 44 Figure 3.5: Distribution of distances to nearest count-vector admitting a BFB schedule. 46 Figure 4.1: A schematic BFB process. 50 Figure 4.2: Layer visualization of a BFB palindrome. 59 Figure 4.3: An algorithm for the BFB count vector problem. 62 Figure 4.4: Simulation and pancreatic cancer results. 65 Figure 5.1: A hypothetical shattered chromosome. 74 Figure 5.2: A set of possible simulation steps. 75 Figure 5.3: Charts of number of breakpoints versus number of copy number states for simulated chromosomes. 83 Figure 5.4: Charts of breakpoints versus copy number states for simulations with an overrepresentation of inversions. 84 Figure 5.5: Breakpoints and copy numbers of a chromosome simulated with progressive inversions and deletions. 84 Figure 5.6: Counts of breakpoints and copy number states from a simulation based on the chromosome in Figure 5.5 . 85 viii Figure 5.7: Result of the series of inversions and deletions for chromosome 5 of TK10 . 86 Figure A.1: Graph of ambiguously mapping read count frequency data above. 95 Figure A.2: A short homologous sequence near the fusion site of GRHL2 and SNTG1. 112 Figure C.1: ROC curves for simulations with 2 rounds of BFB. 158 Figure C.2: ROC curves for simulations with 4 rounds of BFB. 159 Figure C.3: ROC curves for simulations with 6 rounds of BFB. 160 Figure C.4: ROC curves for simulations with 8 rounds of BFB. 161 Figure C.5: ROC curves for simulations with 10 rounds of BFB. 162 Figure C.6: Graphical representation of the analysis performed with the pancreatic cancer paired-end sequencing data. 163 Figure C.7: Plausible BFB cycles that could lead to the copy counts observed in chromosome 12 of pancreatic cancer sample PD3641. 164 ix LIST OF TABLES Table 2.1: The fraction of totally and partially ambiguous fusions for a range of read lengths. 18 Table 2.2: Simulated fusions. 19 Table 2.3: Sum of expected values of Zni jk for read pairs supporting each fusion after maximum-likelihood transcript abundance estimation. 20 Table 2.4: Prostate neoplasia fusions with sum of expected Zni jk values. 21 Table 2.5: Prostate hyperplasia fusions with sums of expected Zni jk values. 21 Table 2.6: Fusions found in previously published datasets that are either partially or completely supported by ambiguously mapping read pairs. 24 Table 3.1: Method and result for the count-vectors used to analyze algorithm speed. 44 Table 3.2: Percentage of count-vectors at least as close to a count-vector admitting a BFB schedule as the shown count-vector pair. 46 Table 5.1: Breakpoint positions and orientations for rearranged chromosome in Figure 5.1. 73 Table 5.2: Fraction of chromosomes in Figure 5.4a with few copy number states for given breakpoint counts. 78 Table 5.3: The number of observed breakpoints. 81 Table A.1: Frequency of amibiguously mapping read counts for various read lengths. 87 Table A.2: All gene fusions nominated by discordant read pairs in the simulated data. 95 Table A.3: Unambiguous fusion results from melanoma and UHR data. 98 Table C.1: The decomposition and signature of the collection B =.

University of California, San Diego

Supplementary Table S4. FGA Co-Expressed Gene List in LUAD

Estrogen's Impact on the Specialized Transcriptome, Brain, and Vocal

Examining the Effects of Endogenous Sex Steroids and the Xenoestrogen Contaminant 17-Ethinylestradiol on Previtellogenic Coho Salmon Ovarian Growth and Function

In-Silico Guided Identification of Ciliogenesis Candidate Genes in a Non-Conventional Animal Model