Structural Variation Discovery: the Easy, the Hard
Total Page:16
File Type:pdf, Size:1020Kb
STRUCTURAL VARIATION DISCOVERY: THE EASY, THE HARD AND THE UGLY by Fereydoun Hormozdiari B.Sc., Sharif University of Technology, 2004 M.Sc., Simon Fraser University, 2007 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in the School of Computing Science Faculty of Applied Science c Fereydoun Hormozdiari 2011 SIMON FRASER UNIVERSITY Fall 2011 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. APPROVAL Name: Fereydoun Hormozdiari Degree: Doctor of Philosophy Title of thesis: Structural Variation Discovery: the easy, the hard and the ugly Examining Committee: Dr. Ramesh Krishnamurti Chair Dr. S. Cenk Sahinalp, Senior Supervisor Computing Science, Simon Fraser University Dr. Evan E. Eichler, Supervisor Genome Sciences, University of Washington Dr. Artem Cherkasov, Supervisor Urologic Sciences, University of British Columbia Dr. Inanc Birol, Supervisor Bioinformatics group leader, GSC, BC Cancer Agency Dr. Fiona Brinkman, Internal Examiner Molecular Biology and Chemistry, Simon Fraser Univer- sity Dr. Serafim Batzoglou, External Examiner Computer Science, Stanford University Date Approved: August 22nd, 2011 ii Partial Copyright Licence Abstract Comparison of human genomes shows that along with single nucleotide polymorphisms and small indels, larger structural variants (SVs) are common. Recent studies even suggest that more base pairs are altered as a result of structural variations (including copy number variations) than as a result of single nucleotide variations or small indels. It is also been known that structural variations can cause a loss or gain of functionality and can have phenotypic effects. Recently, with the advent of high-throughput sequencing technologies, the field of genomics has been revolutionized. The realization of high-throughput sequencing platforms now makes it feasible to detect the full spectrum of genomic variation (including SVs) among many individual genomes, including cancer patients and others suffering from diseases of genomic origin. In addition, high- throughput sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified with the 1000 Genomes Project. In this dissertation we consider the structural variation discovery problem using high-throughput sequencing technologies. We provide combinatorial formulations for this problem under a maxi- mum parsimony assumption, and design approximation algorithms for them. We also extend our proposed algorithms to consider conflicts between potential structural variations and resolve them. It should be noted that our algorithms are able to detect most of the well-known structural variation types including small insertions, deletions, inversions, and transpositions. Finally we extend our algorithms to allow simultaneous discovery of structural variations in multiple genomes and thus improve the final comparative results between different donors. iii To My Mom and Dad And All Other Cancer Survivors iv “If they Release All Political Prisoners, I will Forgive My Son’s Murderers” — Sohrab Arabi’s Mother, ONE OF THE MARTYRS OF IRAN GREEN MOVEMENT , June 2009 v Acknowledgments First and foremost I offer my sincerest and utmost gratitude to my senior supervisor, Dr Cenk Sahi- nalp, who has supported me throughout my graduate studies with his encouragement, guidance and resourcefulness. In addition to technical skills of computer science, I have learned the requirements of being a good researcher from him. I thank him one more time for all his help. I would like to give my regards and appreciations to Dr. Evan Eichler, his experience in genomics was a crucial factor in the success of this thesis. My visit to his lab greatly influenced the direction of my PhD research. I would also like to thank other members of my committee Dr. Inanc Birol, Dr. Artem Cherkasov and Dr. Fiona Brinkman for their help in my graduate studies. I would like to sincerely thank Dr. Serafim Batzoglou, who kindly accepted to be my external examiner. The ideas and comments, that he shared during his visit, have provided me with additional directions for extending my research. I would also like to thank my other collaborators Can Alkan, Iman Hajirasouliha, Phuong Dao, Raheleh Salari, Faraz Hach, Deniz Yorukoglu, and Farhad Hormozdiari. I benefited from these collaborations, and hope to continue working with them. I would also like to thank all my dear friends, Laleh Samii, Hamed Yaghoubi, Sara Ejtemaee, Shahab Tasharrofi, Mohammad Tayebi, Navid Imani, Alireza Hojjati, Azadeh Akhtari, Pashootan Vaezi, Ali Khalili, Amir Hedayaty, Hos- sein Jowhari, and Kamyar Khodamoradi. Last, but not least, I thank my family for the the support they gave me in all these years. Their motivation was a huge driving force for me. Fereydoun Hormozdiari vi Contents Approval ii Abstract iii Dedication iv Quotation v Acknowledgments vi Contents vii List of Tables xi List of Figures xiii 1 Introduction 1 1.1 Genome........................................ 1 1.2 GenomeVariation................................. 2 1.2.1 Non Sequencing-Based Variation Discovery . ........ 2 1.3 High throughput sequencing . ..... 4 1.4 De novo assembly and comparison . ..... 5 1.4.1 Denovoassembly .............................. 6 1.4.2 Edit distance and block edit distance . ....... 9 1.5 Mapping based strategy . 12 1.5.1 MappingAlgorithms .. ...... ..... ...... ...... .... 15 1.5.2 Read-PairAnalysis . 17 vii 1.5.3 Read-DepthAnalysis. 22 1.5.4 Split-ReadAnalysis . 22 1.6 ThesisOverview .................................. 23 2 Optimal pooling for genome re-sequencing 27 2.1 Problem definition and general algorithmic approach . ............. 29 2.1.1 Poolingproblem ............................... 30 2.2 Methods........................................ 32 Ci,b 2.2.1 The pooling problem under f(Ci,b)= 2 ................ 33 Ci,b 2.2.2 The balanced pooling problem under f(Ci,b)= 2 ........... 34 2.2.3 The pooling problem under f(Ci,b)= Ci,b − 1 ............... 35 2.2.4 The balanced pooling problem under f(Ci,b)= Ci,b − 1 .......... 38 2.3 Results and discussion . 40 2.3.1 Pooling experiments with LSMnC/LSMnS algorithms . ........ 41 2.3.2 Pooling experiments with GAHP/GABHP algorithms . ........ 44 2.4 OtherApplications ............................... 47 2.4.1 Fosmid clone resequencing . 48 2.4.2 Exomesequencing .............................. 48 3 Structural Variation Discovery Via Maximum Parsimony 49 3.1 Insertion, Deletion, Inversion and Transposition Events............... 51 3.1.1 Notations and Definition . 51 3.1.2 Structural Variation Detection based on Maximum Parsimony . 54 3.1.3 O(log n) approximation for MPSV . 55 3.1.4 MaximalValidClusters . 57 3.2 Novel Insertion Discovery . ..... 66 3.2.1 NovelSeq pipeline............................... 67 3.3 Maximum Parsimony Structural Variation with Conflict Resolution . 72 3.3.1 Conflicting SV clusters in haploid and diploid genome sequences . 73 3.3.2 Formal definition of the MPSV-CR problem . ..... 75 3.3.3 Computational complexity of MPSV-CR . 76 3.3.4 An efficient solution to the MPSV-CR problem . ...... 78 3.4 ExperimentalResults . 79 3.4.1 BestMappingvsAllMapping . 79 viii 3.4.2 Mobile Element Insertion Discovery . ...... 80 3.4.3 Deletion and Inversion Discovery on NA18507 . ....... 82 3.4.4 Novel Insertion Discovery on NA18507 . ..... 87 3.4.5 Structural Variation Prediction with VariationHunter-CR .......... 90 4 Alu transposition map on human genomes 92 4.1 Discovery and Validation . ..... 93 4.2 Familial Transmission . 100 4.2.1 Genotyping.................................. 102 4.3 Genomecomparisons ............................... 103 5 Simultaneous Structural Variation Discovery 106 5.1 Introduction.................................... 107 5.2 Methods........................................ 110 5.2.1 Simultaneous Structural Variation Discovery among Multiple Genomes . 110 5.2.2 The Algorithmic Formulation of the SSV-MG Problem . ........ 111 5.2.3 Complexity of SSV-MG problem . 113 5.2.4 An simple approximation algorithm for SSV-MG problem ......... 115 5.2.5 A maximum flow-based update for Red-Black Assignment . ........ 116 5.2.6 An O(1 + ωmax ) approximation algorithm for limited read mapping loci . 117 ωmin 5.2.7 Efficient heuristic methods for the SSV-MG problem . ......... 119 5.3 Results......................................... 121 5.3.1 Mobile element insertions . 122 5.3.2 Deletions ................................... 123 6 Conclusion and Discussion 127 6.1 FutureWorks ..................................... 129 Appendices A Conflicting Valid Clusters for Haploid Genome 131 B Alu insertions in genes and exons 134 C Alu Insertion Genotyping 137 ix Bibliography 139 x List of Tables 1.1 In this table a summarization of different methods used for structural variation dis- covery based on the read pair, read depth and split read analysis is provided. 14 3.1 Summary of mobile element insertion prediction results in the Venter genome. We show the precision and recall rates of our mobile element (Alu, NCAI, and SVA) insertion discovery. We compare our mobile element insertion predictions with both [154] and the HuRef genome assembly[87]. The results demonstrate