Combinatorial Algorithms for Haplotype Assembly

University of California Los Angeles Combinatorial Algorithms for Haplotype Assembly A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Sepideh Mazrouee 2017 c Copyright by Sepideh Mazrouee 2017 Abstract of the Dissertation Combinatorial Algorithms for Haplotype Assembly by Sepideh Mazrouee Doctor of Philosophy in Computer Science University of California, Los Angeles, 2017 Professor Wei Wang, Chair Many phenotypes such as genetic disorders may be hereditary while others may be influenced by the environment. However, some genetic disorders are due to new mutations in the individuals DeoxyriboNucleic Acid (DNA). Diseases such as diabetes and specific types of cancer are examples of the conditions that can be inherited or affected by lifestyle genetic mutations. In order to investigate and predict the incidence of such diseases, sequences of single individuals need to be examined. In the past decade, the Next Generation Sequencing (NGS) technology has enabled us to generate DNA sequences of many organisms. Yet, reconstructing each copy of chromosome remains an open research problem due to computational challenges associated with processing massive amount of DNA data and understanding complex structure of such data for individual DNA phasing. In this dissertation, I introduce several computational frameworks for understanding the complex structure of DNA sequence data to reconstruct chromosome copies in diploid and polyploid organisms. The methodologies that are presented in this dissertation span several areas of research including unsupervised learning, combinatorial optimization, graph partitioning, and association rule learning. The overarching theme of this research is design and validation of novel combinatorial algorithms for fast and accurate haplotype assembly. The first two frameworks presented in this dissertation, called FastHap and ARHap, are tailored toward providing computationally ii simple diploid haplotypes with the objective of minimizing minimum error correction and switching error, respectively. I then introduce HapColor and PolyCluster, which aim to improve minimum error correction and switching error for polyploid haplotyping. iii The dissertation of Sepideh Mazrouee is approved. Eleazar Eskin Jason Ernst Jingyi Jessica Li Wei Wang, Committee Chair University of California, Los Angeles 2017 iv To my husband, Hassan, who has been the origin of my strength and passion and will always be. My deepest gratitude goes to my parents, Leila and Faryab and my sister, Mahshid for their love, support and to my daughter, Lily for all the hard time I put her through for so many years during my Ph.D. studies. I would like to express my sincere gratitude to my academic adviser, Professor Wei Wang, for her invaluable guidance and continuous support throughout my years at UCLA. I would also like to express my gratitude to Professor Eleazar Eskin who has been a strong motivator and advocate for my research throughout my matriculation through the Ph.D. program. Their persistent pursuit of perfection and deep insight into various subjects have always been an inspiration to me. Also to my committee members Professor Jason Ernst and Professor Jessica Li. I thank them for kindly agreeing to be on my doctoral committee and for their helpful advice and suggestions. It is truly a privilege to be under the academic lineage of the foremost internet pioneers at UCLA. v Table of Contents 1 Introduction :::::::::::::::::::::::::::::::::: 1 2 Haplotype Assembly in the Literature ::::::::::::::::: 10 2.1 Approaches . 12 2.2 Evaluation Methods . 13 2.3 Ploidy Level . 14 3 Fast and Accurate Diploid Haplotyping :::::::::::::::: 16 3.1 Introduction . 16 3.1.1 Motivation . 16 3.1.2 Contributions and Summary of Results . 18 3.2 FastHap Framework . 20 3.2.1 Inter-Fragment Distance . 21 3.2.2 FastHap Graph Model . 23 3.2.3 Fragment Partitioning . 25 3.2.4 Refinement Phase . 27 3.2.5 Fragment Purging . 28 3.3 Validation . 28 3.3.1 Dataset . 29 3.3.2 Results . 31 vi 3.4 Discussion and Conclusion . 34 4 Association Rule Learning for Diploid Haplotyping ::::::::: 36 4.1 Introduction . 36 4.1.1 Contributions and Summary of Results . 38 4.2 Association Rule Learning . 42 4.2.1 Matrix Binarization . 43 4.2.2 SNP Association Rules . 46 4.2.3 Measures of Rule Interestingness . 47 4.2.4 Rule Generation Criteria . 49 4.2.4.1 Minimum Support Criterion . 50 4.2.4.2 Minimum Confidence Criterion . 53 4.2.4.3 Consequent Length Criterion . 55 4.3 Haplotype Reconstruction . 56 4.3.1 Overview of the Algorithm . 59 4.3.2 An Illustrative Example . 60 4.3.3 Dependency Graph . 61 4.3.4 Longest Attribute-Consistent Path . 65 4.4 Validation . 67 4.4.1 Dataset Preparation and Statistics . 67 4.4.2 Results on Simulated Data . 70 vii 4.4.3 Results on HuRef Data . 74 4.5 Discussion . 74 5 Graph Coloring for Polyploid Haplotyping ::::::::::::::: 77 5.1 Introduction . 77 5.1.1 Contributions and Summary of Results . 78 5.2 Problem Statement . 79 5.2.1 Problem Formulation . 80 5.2.2 Graph Modeling . 81 5.2.3 Problem Complexity . 83 5.3 HapColor Algorithm . 87 5.4 Validation . 89 5.4.1 Polyploidy Datasets . 89 5.4.2 Comparative Analysis Approach . 90 5.4.3 Preparation of Simulated Data . 91 5.4.4 HapColor Performance on Simulated Data . 92 5.4.5 Comparative Analysis on Simulated Data . 94 5.4.6 Comparative Analysis on HuRef Data . 94 5.5 Discussion . 95 5.6 Discussion and Conclusion . 97 6 Correlation Clustering for Polyploid Phasing ::::::::::::: 99 viii 6.1 Introduction . 99 6.1.1 Contributions and Summary of Results . 100 6.2 Problem Statement . 101 6.2.1 Problem Definition . 104 6.2.2 Graph Modeling . 105 6.2.3 Problem Formulation . 107 6.3 PolyCluster Algorithm . 109 6.3.1 Initial Clustering . 112 6.3.2 Cluster Merging . 114 6.3.3 Algorithm Analysis . 115 6.4 Validation . 118 6.4.1 Dataset Preparation . 118 6.4.2 Dataset Statistics . 120 6.4.3 Impact of Sequencing Confidence Scores . 122 6.4.4 PolyCluster Performance . 124 6.4.5 Comparative Analysis Approach . 126 6.4.6 Comparative Analysis Results . 128 6.4.6.1 Switching Error (SWE) . 128 6.4.6.2 Minimum Error Correction (MEC) . 129 6.4.6.3 Running Time . 131 6.5 Discussion and Conclusion . 133 ix 7 Conclusions and Future Directions :::::::::::::::::::: 135 7.1 Summary of Contributions . 135 7.2 Challenges and Future Work . 138 References ::::::::::::::::::::::::::::::::::::: 141 x List of Figures 1.1 An illustration of diploid haplotyping process. Ten short reads, denoted by f1 to f10 in (b) are generated from the two chromosomes copies shown in (a). Only SNP sites, denoted by S1 to S8, are used for haplotype assembly. .3 3.1 An example of fragment matrix with 8 SNP sites (a), corresponding distance matrix (b), fuzzy conflict graph associated with the fragment matrix (c), and results of applying FastHap on the data (d). The graph in (c) shows only edges with non-pivot distances. 23 3.2 Coverage of HuRef dataset. (a): Coverage for each chromosome; Num- bers vary from 6:49 to 8:72 for various chromosomes with an average genome-wide coverage of 7:43. (b): Histogram of coverage for chromosome 20 as an example; Y-axis shows number of SNPs with each specific coverage shown on x-axis. 30 3.3 Chromosome-wide haplotype length for each chromosome (a) and histogram of per-block haplotype length for chromosomes 8, 17, and 18 as examples of chromosomes with `small', `medium', and `large' blocks respectively. 31 3.4 Effect of error rate and coverage on performance of FastHap, Greedy, and HapCut. The analysis was performed on chromosome 20 (randomly selected) of HuRef dataset. MEC of the three algorithms under comparison as a function of error rate (a); Execution time of the algorithms as a function of coverage (b). 32 xi 3.5 Speed performance of the FastHap, Greedy, and HapCut as a function of haplotype length. Analysis was performed on chromosome 20 (randomly selected) of HuRef dataset. Execution time as a function of haplotype length (a); Amount of speedup achieved by FastHap com- pared to Greedy and HapCut (b). 33 4.1 Association rule haplotyping (ARHap) framework. Each round of association rule haplotyping is composed of two phases including an association rule learning phase and a haplotype reconstruction phase. This two-phase process may continue for multiple rounds until all SNP positions on the haplotype set are reconstructed . 41 4.2 An example of matrix binarization. The fragment matrix X containing 9 fragments drawn from haplotype set H = fh0, h1g where h0 = `11111' and h1 = `00000' is shown in (a). X is decomposed into two matrices X 0 and X 1 as shown in (b). The binary fragment matrix Y is a column- wise concatenation of X 0 and X 1..................... 45 4.3 Association rule based haplotype reconstruction for the dataset in Fig- ure 4.2. Dependency graph is constructed based on the strong rules in Table 4.2 (a); Longest paths identified in each iteration of the algorithm (b); and evolution of haplotypes as each rule is tested against the haplotype set (c). It is assumed that the haplotypes are initialized to h0 = `10101' and h1 = `01010'. 62 xii 4.4 An example of dependency graph generated from rules with multiple attributes in their antecedent snpset. Fragment matrix shown in (a) does not produce any rules of length 2 because no rule of the from ti ! tj meets confidence criterion of Confhap > 0:5; Set of strong rules with 2 attributes in their antecedent field (b); Corresponding dependency graph (c), a longest path (d), and changes in initial haplotypes after applying rules on the longest path (e).

Load more