Thesis Title

De novo and haplotype assembly of polyploid genomes Mohammadhossein Moeinzadeh Dissertation zur Erlangung des Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) am Fachbereich Mathematik und Informatik der Freien UniversitätBerlin Berlin, 2018 ii a Erstgutachter: Prof. Dr. Martin Vingron Zweitgutachter: Prof. Dr. Bernd Weisshaar Datum der Disputation: 10.12.2018 To Sarah, Sheyda, and Kian iii Acknowledgements First and foremost, I would like to thank my supervisor Martin Vingron for his scientific support and guidance during my PhD. The door to his office was always open whenever I ran into a trouble spot or had a question about my research. I am grateful to him for giving me the opportunity to pursue my PhD in his group with exceptional scientists and a friendly environment. My sincere thanks also go to Jun Yang, who contributed to the design of the analysis described in this thesis. I appreciated his help and time during the course of my study as a PhD candidate. Besides my advisor, I would like to thank the rest of my thesis advisory committee: Stefan Haas and Knut Reinert for their insightful advice for my thesis. I thank Kirsten Kelleher for her kind help and support as PhD coordinator of IM- PRS program, for the help with settling in Berlin, and especially for the great help of proofreading this thesis. I thank Peter Arndt, Sebastiaan Meijsing, Robert Schöpflin,David Heller, and Evgeny Muzychenko for the help and insightful discussions and also my former office mates Jun Yang, Xinyi Yang, Mohammad Sadeh, and Emmeke Aarts and all the current and former members of the Computational Molecular Biology group at MPIMG for the friendly and motivating environment. I would like to thank my parents for supporting me spiritually throughout my study and my life in general. Finally, I must express my very profound gratitude to my wife for providing me with unfailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without her. Thank you. v Contents Acknowledgementsv Contents vi Abstract 1 1 Introduction3 2 Genotype and Haplotype7 2.1 Introduction...................................7 2.2 Genotype and Haplotype...........................7 2.3 The importance of haplotypes......................... 13 2.3.1 Disease association studies....................... 13 2.3.2 Population genetics and evolution of organisms........... 15 2.3.3 Sequence assembly and variant calling for polyploid genomes... 15 2.4 Different approaches for haplotyping..................... 16 2.4.1 Haplotype phasing via family information.............. 16 2.4.2 Haplotype phasing via population data............... 16 2.4.3 Single Individual Haplotyping (SIH)................. 17 2.5 Challenges.................................... 17 2.5.1 Genome ploidy............................. 18 2.5.2 Low variant density.......................... 19 2.5.3 Structural variants........................... 19 2.5.4 Repetitive regions........................... 20 2.5.5 Technical issues............................. 22 2.6 Ground truth data availability........................ 23 2.7 Metrics and evaluation............................. 23 3 Single individual haplotype reconstruction: techniques and protocols 29 3.1 Introduction................................... 29 3.2 How to model single individual haplotype assembly............. 30 3.3 Early single individual haplotyping with sequence reads.......... 32 3.3.1 HapCut ................................ 33 3.3.2 ReFHap ................................ 35 3.4 NGS technologies................................ 36 3.4.1 Integration of variable insert length NGS data........... 37 vii 3.4.2 Integration of NGS and population data............... 37 3.4.3 Integration of NGS and transcriptome data............. 38 3.4.4 Chromosome isolation......................... 39 3.5 Single cell sequencing technologies with NGS................ 40 3.5.1 Linked-read data (10X Genomics).................. 40 3.5.2 Strand-seq................................ 42 3.6 Third-generation sequencing.......................... 42 3.6.1 ProbHap ................................ 43 3.6.2 WhatsHap ............................... 43 3.6.3 Third-generation sequencing plus NGS reads............ 43 3.7 Haplotype reconstruction for polyploid genomes............... 44 3.7.1 HapCompass ............................. 45 3.7.2 SDhaP ................................. 46 3.7.3 H-PoP ................................. 47 3.8 Summary.................................... 47 4 Haplotype assembly of polyploid genomes 49 4.1 Introduction................................... 49 4.2 Problem definition and formulation...................... 49 4.2.1 Terminologies for Ranbow algorithm................ 54 4.3 Method..................................... 56 4.3.1 Mask and seed sequence finding.................... 58 4.3.1.1 Listing masks......................... 59 4.3.1.2 Mask ranking......................... 61 4.3.2 Phasing a mask and its extension................... 62 4.3.2.1 Phasing a mask with P seed sequences.......... 62 4.3.2.2 Block extension....................... 64 4.3.3 Merging overlapping and connected masks.............. 65 4.3.3.1 Overlapping haplotype segments.............. 65 4.3.3.2 Conflicting paths, desired paths, and desired cycles... 66 4.3.3.3 Merging the haplotype segments in non-conflicting cycles 67 4.3.3.4 Merging through direct connections between haplotype segments........................... 69 4.3.4 Phasing the regions with fewer than P haplotypes......... 71 4.3.5 Summary................................ 72 5 Evaluation and results 75 5.1 Introduction................................... 75 5.2 Results...................................... 75 5.2.1 Dataset properties........................... 76 5.2.2 Real data................................ 78 5.2.3 Simulated data............................. 79 5.3 Evaluation.................................... 80 5.3.1 All methods............................... 80 5.3.2 H-PoP vs Ranbow .......................... 84 5.3.3 Comparison of the results on real and simulated datasets..... 89 5.3.4 Usability................................ 90 5.4 Conclusion and Discussion........................... 92 6 An application of haplotype assembly: Haplotype aware de novo assembly of hexaploid Ipomoea batatas 95 6.1 Introduction................................... 95 6.2 Pilot project.................................. 96 6.3 Preliminary assembly.............................. 98 6.3.1 Initial assembly of consensus genome................. 98 6.4 Haplotype-Improved Assembly........................ 102 6.4.1 Variant calling............................. 103 6.4.2 Haplotype phasing........................... 103 6.4.3 Validation of haplotypes........................ 103 6.4.4 Haplo-scaffolding strategy....................... 105 6.4.5 Haplotype-Improved assembly for I. batatas (HI-assembly)..... 107 6.5 Conclusion................................... 111 Summary 112 List of Figures 113 List of Tables 123 List of Algorithms 125 Abbreviations 126 A Ranbow user manual 137 A.1 Data preparation................................ 138 A.2 Ranbow for haplotype assembly....................... 138 A.2.1 Indexing input files (mode: index).................. 140 A.2.2 Run Ranbow on computer farm................... 140 A.2.3 Collecting data from (mode: collect)................. 143 A.2.4 Revising sequence variants (mode: modVCF)............ 143 A.2.5 Evaluation of the results........................ 144 A.2.6 Ranbow phylo............................. 145 B Dataset Availability 147 C Short CV 149 D Zusammenfassung 151 E Selbstständigkeitserklärung 153 Bibliography 155 Glossary 155 Abstract Sequencing and genome assembly constitute the basis for further research into the biology of an organism; however, the main players are chromosomes, not the consensus references. The assembly programs flatten two or more homologous chromosomes, de- pending on the ploidy of the organism, into one reference sequence. Although the major characteristics and functionalities of homologous chromosomes are the same, the minor differences may play an important role. The sequence of one chromosome is called a haplotype. A haplotype is the main component of inheritance; hence, obtaining the haplotype sequences instead of the consensus, provides a panoramic view for investigations. Human is diploid and has two copies of each autosomal chromosome, each inherited from one parent. Other organisms have several copies of the autosomal chromosomes. Higher ploidy of an organism leads to more information being flattened in the reference sequence, and results in less similarity between the reference sequence and any of the chromosomes. In this thesis, we focus on haplotyping problems and various approaches to call the haplotypes. In the first chapter, Genotype and haplotype, the approaches for inferring haplotypes and the challenges are discussed. In Chapter3 , we focus on single individual haplotyping based on the sequence reads, which is currently the major and most direct approach to haplotyping. Different technologies, protocols, models, and methods are also reviewed. In Chapter4 , we propose a novel method, called Ranbow, to address the polyploid haplotype assembly problem. The performance of Ranbow compared to the other state-of-the-art methods is investigated on both real and simulated datasets. For the evaluation, we used data from tetraploid Capsella bursa-pastoris (Shepherd's Purse),

Load more