University of California, San Diego

Total Page:16

File Type:pdf, Size:1020Kb

University of California, San Diego UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational Techniques to Investigate Structural Variation A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Bioinformatics and Systems Biology by Marcus Christopher Kinsella Committee in charge: Professor Vineet Bafna, Chair Professor Kelly A. Frazer, Co-Chair Professor Pavel A. Pevzner Professor Jonathan Sebat Professor Kun Zhang 2013 Copyright Marcus Christopher Kinsella, 2013 All rights reserved. The dissertation of Marcus Christopher Kinsella is approved, and it is acceptable in quality and form for publication on microfilm and electronically: Co-Chair Chair University of California, San Diego 2013 iii DEDICATION To my Mom for her support, and to my boyfriend for his relentless encouragement. iv TABLE OF CONTENTS Signature Page . iii Dedication . iv Table of Contents . .v List of Figures . viii List of Tables . .x Acknowledgements . xi Vita ......................................... xii Abstract of the Dissertation . xiii Chapter 1 Introduction . .1 1.1 The Scale of Genetic Variation . .1 1.2 Detecting Structural Variations . .2 1.3 Algorithmic Challenges in Structural Variation Detection . .3 Chapter 2 Sensitive gene fusion detection using ambiguously mapping RNA- Seq read pairs . .5 2.1 Introduction . .5 2.2 Methods . .8 2.2.1 Discovery of Putative Fusions . .8 2.2.2 Mapping to Augmented Reference . 11 2.2.3 Model of Paired-End RNA-Seq Data . 11 2.2.4 Expectation Maximization . 14 2.2.5 Calculating Mappings to Fusion Junctions . 15 2.3 Results . 17 2.3.1 Fusion Transcripts Generate Ambiguous Reads . 17 2.3.2 Resolving Ambiguous Simulated Fusions . 18 2.3.3 Application to a Prostate Tissue Transcriptome Data 20 2.3.4 Discovery of Novel Ambiguous Fusions . 22 2.4 Discussion . 22 2.5 Acknowledgements . 25 Chapter 3 Combinatorics of the Breakage-Fusion-Bridge Mechanism . 26 3.1 Introduction . 26 3.2 Formalizing the BFB Schedule . 28 3.3 Algorithms for BFB . 30 v 3.4 Results . 43 3.5 Discussion . 47 3.6 Acknowledgements . 48 Chapter 4 An algorithmic approach for breakage-fusion-bridge detection in tumor genomes . 49 4.1 Introduction . 49 4.2 High-throughput evidence for BFB . 52 4.2.1 Breakpoints . 52 4.2.2 Copy counts . 53 4.2.3 Formalizing BFB . 53 4.2.4 Handling experimental imprecision . 55 4.2.5 The BFB Count Vector Problem . 55 4.3 Outline of the BFB Count Vector Algorithms . 56 4.3.1 Properties of BFB palindromes . 56 4.3.2 Required conditions for folding . 58 4.4 Running time . 62 4.5 Detecting Signatures of BFB . 63 4.6 Results . 64 4.7 Discussion . 67 4.8 Acknowledgements . 68 Chapter 5 Does Chromothripsis Have a Distinguishing Signature? . 69 5.1 Introduction . 69 5.2 Methods . 71 5.2.1 Finding Chromosome Arrangements Consistent with Observed Breakpoints . 71 5.3 Results . 72 5.3.1 Simulating Progressive Rearrangements . 72 5.3.2 Chromothripsis Footprint Criteria Depend on Subtle Simulation Implementation Details . 75 5.3.3 Simulation Method Does Not Distinguish Between Progressive Rearrangement and Chromothripsis . 79 5.3.4 Plausible Progressive Rearrangement Schemes Exist for Chromosomes Bearing Footprint of Chromothripsis 80 5.4 Discussion . 81 5.5 Acknowledgements . 82 Appendix A Supplemental: Sensitive gene fusion detection using ambiguously mapping RNA-Seq read pairs . 87 A.1 Ambiguous fusion sequences. 112 A.1.1 HOMEZ-MYH6 . 112 A.1.2 KIAA1267-ARL17A . 112 vi A.1.3 CPEB1-RPS17 . 113 A.1.4 PPIP5K1-CATSPER2 . 113 Appendix B Supplemental: Combinatorics of the Breakage-Fusion-Bridge Mech- anism . 114 B.1 Proofs . 114 B.2 Applying BFB Rules . 117 B.3 Analysis of BFB_Tree . 118 Appendix C Supplemental:An algorithmic approach for breakage-fusion-bridge detection in tumor genomes . 120 C.1 Properties of BFB Strings . 120 C.2 Algorithm SEARCH-BFB . 124 C.2.1 Additional Notation and Collection Arithmetics . 124 C.2.2 Folding Increases Signature . 129 C.2.3 The FOLD Procedure . 137 C.2.4 Correctness of Algorithm SEARCH-BFB . 146 C.2.5 Time Complexity of Algorithm SEARCH-BFB . 147 C.3 The Decision Variant . 149 C.4 The Distance Variant . 150 C.5 Chromosome simulation details . 151 C.6 Cancer cell line results . 152 C.7 ROC curves for varying simulation parameters . 153 C.8 Pancreatic cancer data analysis pipeline . 153 C.9 Possible arrangement of segments on BFB -rearranged chromosome 12 . 153 Bibliography . 165 vii LIST OF FIGURES Figure 2.1: A read pair that maps to a fusion between genes A1 and B1 may also map to homologous genes, leading either to spurious fusion candidates or the elimination of read pairs supporting a true fusion from consideration. .7 Figure 2.2: Creating fusion genes from discordantly mapping mate pairs. 12 Figure 2.3: Nominating potential fusion transcripts. 12 Figure 2.4: The graphical model of RNA-Seq read pairs. Transcript abundance, transcript choice, starting position, ending position, and observed read are represented by q, T, S, E, and R, respectively. 13 Figure 2.5: In this simplified situation, maximizing the likelihood function would set the abundance of the fusion gene to 1 regardless of the relation- ship between NA, NB, and NF ..................... 17 Figure 2.6: The fusion between HOMEZ and MYH6. Three mate pairs support this fusion, but two also map to a fusion between HOMEZ and MYH7. 23 Figure 2.7: The fusion between CPEB1 and RPS17. A copy of RPS17 lies 2,000 bases downstream of CPEB1, but another copy lies 400 kilo- bases downstream, as well. 23 Figure 3.1: The Breakage Fusion Bridge mechanism. 27 Figure 3.2: An illustration of BFB-Pivot searching for candidate BFB strings. 33 Figure 3.3: A BFB-tree generated from an RB-BFB-schedule. 36 Figure 3.4: Pivot and tree algorithm running time. 44 Figure 3.5: Distribution of distances to nearest count-vector admitting a BFB schedule. 46 Figure 4.1: A schematic BFB process. 50 Figure 4.2: Layer visualization of a BFB palindrome. 59 Figure 4.3: An algorithm for the BFB count vector problem. 62 Figure 4.4: Simulation and pancreatic cancer results. 65 Figure 5.1: A hypothetical shattered chromosome. 74 Figure 5.2: A set of possible simulation steps. 75 Figure 5.3: Charts of number of breakpoints versus number of copy number states for simulated chromosomes. 83 Figure 5.4: Charts of breakpoints versus copy number states for simulations with an overrepresentation of inversions. 84 Figure 5.5: Breakpoints and copy numbers of a chromosome simulated with progressive inversions and deletions. 84 Figure 5.6: Counts of breakpoints and copy number states from a simulation based on the chromosome in Figure 5.5 . 85 viii Figure 5.7: Result of the series of inversions and deletions for chromosome 5 of TK10 . 86 Figure A.1: Graph of ambiguously mapping read count frequency data above. 95 Figure A.2: A short homologous sequence near the fusion site of GRHL2 and SNTG1. 112 Figure C.1: ROC curves for simulations with 2 rounds of BFB. 158 Figure C.2: ROC curves for simulations with 4 rounds of BFB. 159 Figure C.3: ROC curves for simulations with 6 rounds of BFB. 160 Figure C.4: ROC curves for simulations with 8 rounds of BFB. 161 Figure C.5: ROC curves for simulations with 10 rounds of BFB. 162 Figure C.6: Graphical representation of the analysis performed with the pancre- atic cancer paired-end sequencing data. 163 Figure C.7: Plausible BFB cycles that could lead to the copy counts observed in chromosome 12 of pancreatic cancer sample PD3641. 164 ix LIST OF TABLES Table 2.1: The fraction of totally and partially ambiguous fusions for a range of read lengths. 18 Table 2.2: Simulated fusions. 19 Table 2.3: Sum of expected values of Zni jk for read pairs supporting each fusion after maximum-likelihood transcript abundance estimation. 20 Table 2.4: Prostate neoplasia fusions with sum of expected Zni jk values. 21 Table 2.5: Prostate hyperplasia fusions with sums of expected Zni jk values. 21 Table 2.6: Fusions found in previously published datasets that are either par- tially or completely supported by ambiguously mapping read pairs. 24 Table 3.1: Method and result for the count-vectors used to analyze algorithm speed. 44 Table 3.2: Percentage of count-vectors at least as close to a count-vector admit- ting a BFB schedule as the shown count-vector pair. 46 Table 5.1: Breakpoint positions and orientations for rearranged chromosome in Figure 5.1. 73 Table 5.2: Fraction of chromosomes in Figure 5.4a with few copy number states for given breakpoint counts. 78 Table 5.3: The number of observed breakpoints. 81 Table A.1: Frequency of amibiguously mapping read counts for various read lengths. 87 Table A.2: All gene fusions nominated by discordant read pairs in the simulated data. 95 Table A.3: Unambiguous fusion results from melanoma and UHR data. 98 Table C.1: The decomposition and signature of the collection B =.
Recommended publications
  • Supplementary Table S4. FGA Co-Expressed Gene List in LUAD
    Supplementary Table S4. FGA co-expressed gene list in LUAD tumors Symbol R Locus Description FGG 0.919 4q28 fibrinogen gamma chain FGL1 0.635 8p22 fibrinogen-like 1 SLC7A2 0.536 8p22 solute carrier family 7 (cationic amino acid transporter, y+ system), member 2 DUSP4 0.521 8p12-p11 dual specificity phosphatase 4 HAL 0.51 12q22-q24.1histidine ammonia-lyase PDE4D 0.499 5q12 phosphodiesterase 4D, cAMP-specific FURIN 0.497 15q26.1 furin (paired basic amino acid cleaving enzyme) CPS1 0.49 2q35 carbamoyl-phosphate synthase 1, mitochondrial TESC 0.478 12q24.22 tescalcin INHA 0.465 2q35 inhibin, alpha S100P 0.461 4p16 S100 calcium binding protein P VPS37A 0.447 8p22 vacuolar protein sorting 37 homolog A (S. cerevisiae) SLC16A14 0.447 2q36.3 solute carrier family 16, member 14 PPARGC1A 0.443 4p15.1 peroxisome proliferator-activated receptor gamma, coactivator 1 alpha SIK1 0.435 21q22.3 salt-inducible kinase 1 IRS2 0.434 13q34 insulin receptor substrate 2 RND1 0.433 12q12 Rho family GTPase 1 HGD 0.433 3q13.33 homogentisate 1,2-dioxygenase PTP4A1 0.432 6q12 protein tyrosine phosphatase type IVA, member 1 C8orf4 0.428 8p11.2 chromosome 8 open reading frame 4 DDC 0.427 7p12.2 dopa decarboxylase (aromatic L-amino acid decarboxylase) TACC2 0.427 10q26 transforming, acidic coiled-coil containing protein 2 MUC13 0.422 3q21.2 mucin 13, cell surface associated C5 0.412 9q33-q34 complement component 5 NR4A2 0.412 2q22-q23 nuclear receptor subfamily 4, group A, member 2 EYS 0.411 6q12 eyes shut homolog (Drosophila) GPX2 0.406 14q24.1 glutathione peroxidase
    [Show full text]
  • Estrogen's Impact on the Specialized Transcriptome, Brain, and Vocal
    Estrogen’s Impact on the Specialized Transcriptome, Brain, and Vocal Learning Behavior of a Sexually Dimorphic Songbird by Ha Na Choe Department of Molecular Genetics & Microbiology Duke University Date:_____________________ Approved: ___________________________ Erich D. Jarvis, Supervisor ___________________________ Hiroaki Matsunami ___________________________ Debra Silver ___________________________ Dong Yan ___________________________ Gregory Crawford Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Molecular Genetics & Microbiology in the Graduate School of Duke University 2020 ABSTRACT Estrogen’s Impact on the Specialized Transcriptome, Brain, and Vocal Learning Behavior of a Sexually Dimorphic Songbird by Ha Na Choe Department of Molecular Genetics & Microbiology Duke University Date:_________________________ Approved: ___________________________ Erich D. Jarvis, Supervisor ___________________________ Hiroaki Matsunami ___________________________ Debra Silver ___________________________ Dong Yan ___________________________ Gregory Crawford Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Molecular Genetics & Microbiology in the Graduate School of Duke University 2020 Copyright by Ha Na Choe 2020 Abstract The song system of the zebra finch (Taeniopygia guttata) is highly sexually dimorphic, where only males develop the neural structures necessary to learn and produce learned vocalizations
    [Show full text]
  • Examining the Effects of Endogenous Sex Steroids and the Xenoestrogen Contaminant 17-Ethinylestradiol on Previtellogenic Coho Salmon Ovarian Growth and Function
    Examining the effects of endogenous sex steroids and the xenoestrogen contaminant 17-ethinylestradiol on previtellogenic coho salmon ovarian growth and function Christopher A. Monson A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2018 Reading committee: Dr. Graham Young, Chair Dr. Penny Swanson Dr. Steven Roberts Program authorized to offer degree: Aquatic and Fishery Sciences ©Copyright 2018 Christopher A. Monson University of Washington Abstract Examining the effects of endogenous sex steroids and the xenoestrogen contaminant 17α- ethinylestradiol on previtellogenic coho salmon ovarian growth and function. Christopher A. Monson Chair of Supervisory Committee: Professor Graham Young School of Aquatic and Fishery Sciences In teleost fish, as in other oviparous vertebrates, the production of a fertilizable egg is a complex process driven by the specific spatiotemporal coordination of endocrine and paracrine factors in multiple tissues. Across the brain-pituitary-ovary-liver (BPOL) axis, this includes the synthesis and release of pituitary gonadrotopins, the ovarian production and release of sex steroids, and the hepatic synthesis and release of egg yolk protein precursors, as well as numerous factors that mediate these processes. The factors controlling the earliest stages of oogenesis, the loading of yolk proteins (vitellogenesis), and the final maturational stage are fairly well understood, but the control of the intermediate growth stages (primary and early secondary growth) and the developmental point analogous to the onset of mammalian puberty, the transition from primary to secondary growth, are less clear. The research described in this dissertation addresses the regulation of the ovarian transcriptome by sex steroids during the primary and early secondary growth stages in coho salmon (Oncorhynchus kisutch), and the potential disruption of normal ovarian processes during early secondary growth by a potent synthetic steroidal xenoestrogen.
    [Show full text]
  • In-Silico Guided Identification of Ciliogenesis Candidate Genes in a Non-Conventional Animal Model
    Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2019 In-silico guided identification of ciliogenesis candidate genes in a non-conventional animal model Natalia I. Acevedo Luna Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Bioinformatics Commons Recommended Citation Acevedo Luna, Natalia I., "In-silico guided identification of ciliogenesis candidate genes in a non- conventional animal model" (2019). Graduate Theses and Dissertations. 17382. https://lib.dr.iastate.edu/etd/17382 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. In-silico guided identification of ciliogenesis candidate genes in a non-conventional animal model by Natalia Acevedo Luna A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Bioinformatics and Computational Biology Program of Study Committee: Heike Hofmann, Co-major Professor Geetu Tuteja, Co-major Professor Matthew Hufford Dennis V. Lavrov Mohan Gupta The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred. Iowa State University Ames, Iowa 2019 ii DEDICATION This dissertation would never have been possible without the support and love of the wonderful people I have the honor of calling my family, they are my core and my every- thing.
    [Show full text]