University of California Los Angeles

Combinatorial Algorithms for Assembly

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science

by

Sepideh Mazrouee

2017 c Copyright by Sepideh Mazrouee 2017 Abstract of the Dissertation Combinatorial Algorithms for Haplotype Assembly

by

Sepideh Mazrouee Doctor of Philosophy in Computer Science University of California, Los Angeles, 2017 Professor Wei Wang, Chair

Many phenotypes such as genetic disorders may be hereditary while others may be influenced by the environment. However, some genetic disorders are due to new mutations in the individuals DeoxyriboNucleic Acid (DNA). Diseases such as diabetes and specific types of cancer are examples of the conditions that can be inherited or affected by lifestyle genetic mutations. In order to investigate and predict the incidence of such diseases, sequences of single individuals need to be examined. In the past decade, the Next Generation Sequencing (NGS) technology has enabled us to generate DNA sequences of many organisms. Yet, reconstructing each copy of chromosome remains an open research problem due to computational challenges associated with processing massive amount of DNA data and understanding complex structure of such data for individual DNA phasing.

In this dissertation, I introduce several computational frameworks for understand- ing the complex structure of DNA sequence data to reconstruct chromosome copies in diploid and polyploid organisms. The methodologies that are presented in this dissertation span several areas of research including unsupervised learning, combina- torial optimization, graph partitioning, and association rule learning. The overarching theme of this research is design and validation of novel combinatorial algorithms for fast and accurate haplotype assembly. The first two frameworks presented in this dis- sertation, called FastHap and ARHap, are tailored toward providing computationally ii simple diploid with the objective of minimizing minimum error correc- tion and switching error, respectively. I then introduce HapColor and PolyCluster, which aim to improve minimum error correction and switching error for polyploid haplotyping.

iii The dissertation of Sepideh Mazrouee is approved.

Eleazar Eskin

Jason Ernst

Jingyi Jessica Li

Wei Wang, Committee Chair

University of California, Los Angeles

2017

iv To my husband, Hassan, who has been the origin of my strength and passion and will always be. My deepest gratitude goes to my parents, Leila and Faryab and my sister, Mahshid for their love, support and to my daughter, Lily for all the hard time I put her through for so many years during my Ph.D. studies. I would like to express my sincere gratitude to my academic adviser, Professor Wei Wang, for her invaluable guidance and continuous support throughout my years at UCLA. I would also like to express my gratitude to Professor Eleazar Eskin who has been a strong motivator and advocate for my research throughout my matriculation through the Ph.D. program. Their persistent pursuit of perfection and deep insight into various subjects have always been an inspiration to me. Also to my committee members Professor Jason Ernst and Professor Jessica Li. I thank them for kindly agreeing to be on my doctoral committee and for their helpful advice and suggestions. It is truly a privilege to be under the academic lineage of the foremost internet pioneers at UCLA.

v Table of Contents

1 Introduction ...... 1

2 Haplotype Assembly in the Literature ...... 10

2.1 Approaches ...... 12

2.2 Evaluation Methods ...... 13

2.3 Ploidy Level ...... 14

3 Fast and Accurate Diploid Haplotyping ...... 16

3.1 Introduction ...... 16

3.1.1 Motivation ...... 16

3.1.2 Contributions and Summary of Results ...... 18

3.2 FastHap Framework ...... 20

3.2.1 Inter-Fragment Distance ...... 21

3.2.2 FastHap Graph Model ...... 23

3.2.3 Fragment Partitioning ...... 25

3.2.4 Refinement Phase ...... 27

3.2.5 Fragment Purging ...... 28

3.3 Validation ...... 28

3.3.1 Dataset ...... 29

3.3.2 Results ...... 31

vi 3.4 Discussion and Conclusion ...... 34

4 Association Rule Learning for Diploid Haplotyping ...... 36

4.1 Introduction ...... 36

4.1.1 Contributions and Summary of Results ...... 38

4.2 Association Rule Learning ...... 42

4.2.1 Matrix Binarization ...... 43

4.2.2 SNP Association Rules ...... 46

4.2.3 Measures of Rule Interestingness ...... 47

4.2.4 Rule Generation Criteria ...... 49

4.2.4.1 Minimum Support Criterion ...... 50

4.2.4.2 Minimum Confidence Criterion ...... 53

4.2.4.3 Consequent Length Criterion ...... 55

4.3 Haplotype Reconstruction ...... 56

4.3.1 Overview of the Algorithm ...... 59

4.3.2 An Illustrative Example ...... 60

4.3.3 Dependency Graph ...... 61

4.3.4 Longest Attribute-Consistent Path ...... 65

4.4 Validation ...... 67

4.4.1 Dataset Preparation and Statistics ...... 67

4.4.2 Results on Simulated Data ...... 70

vii 4.4.3 Results on HuRef Data ...... 74

4.5 Discussion ...... 74

5 Graph Coloring for Polyploid Haplotyping ...... 77

5.1 Introduction ...... 77

5.1.1 Contributions and Summary of Results ...... 78

5.2 Problem Statement ...... 79

5.2.1 Problem Formulation ...... 80

5.2.2 Graph Modeling ...... 81

5.2.3 Problem Complexity ...... 83

5.3 HapColor Algorithm ...... 87

5.4 Validation ...... 89

5.4.1 Polyploidy Datasets ...... 89

5.4.2 Comparative Analysis Approach ...... 90

5.4.3 Preparation of Simulated Data ...... 91

5.4.4 HapColor Performance on Simulated Data ...... 92

5.4.5 Comparative Analysis on Simulated Data ...... 94

5.4.6 Comparative Analysis on HuRef Data ...... 94

5.5 Discussion ...... 95

5.6 Discussion and Conclusion ...... 97

6 Correlation Clustering for Polyploid Phasing ...... 99

viii 6.1 Introduction ...... 99

6.1.1 Contributions and Summary of Results ...... 100

6.2 Problem Statement ...... 101

6.2.1 Problem Definition ...... 104

6.2.2 Graph Modeling ...... 105

6.2.3 Problem Formulation ...... 107

6.3 PolyCluster Algorithm ...... 109

6.3.1 Initial Clustering ...... 112

6.3.2 Cluster Merging ...... 114

6.3.3 Algorithm Analysis ...... 115

6.4 Validation ...... 118

6.4.1 Dataset Preparation ...... 118

6.4.2 Dataset Statistics ...... 120

6.4.3 Impact of Sequencing Confidence Scores ...... 122

6.4.4 PolyCluster Performance ...... 124

6.4.5 Comparative Analysis Approach ...... 126

6.4.6 Comparative Analysis Results ...... 128

6.4.6.1 Switching Error (SWE) ...... 128

6.4.6.2 Minimum Error Correction (MEC) ...... 129

6.4.6.3 Running Time ...... 131

6.5 Discussion and Conclusion ...... 133

ix 7 Conclusions and Future Directions ...... 135

7.1 Summary of Contributions ...... 135

7.2 Challenges and Future Work ...... 138

References ...... 141

x List of Figures

1.1 An illustration of diploid haplotyping process. Ten short reads, de-

noted by f1 to f10 in (b) are generated from the two chromosomes

copies shown in (a). Only SNP sites, denoted by S1 to S8, are used for haplotype assembly...... 3

3.1 An example of fragment matrix with 8 SNP sites (a), corresponding distance matrix (b), fuzzy conflict graph associated with the fragment matrix (c), and results of applying FastHap on the data (d). The graph in (c) shows only edges with non-pivot distances...... 23

3.2 Coverage of HuRef dataset. (a): Coverage for each chromosome; Num- bers vary from 6.49 to 8.72 for various chromosomes with an average -wide coverage of 7.43. (b): Histogram of coverage for chro- mosome 20 as an example; Y-axis shows number of SNPs with each specific coverage shown on x-axis...... 30

3.3 Chromosome-wide haplotype length for each chromosome (a) and his- togram of per-block haplotype length for chromosomes 8, 17, and 18 as examples of chromosomes with ‘small’, ‘medium’, and ‘large’ blocks respectively...... 31

3.4 Effect of error rate and coverage on performance of FastHap, Greedy, and HapCut. The analysis was performed on chromosome 20 (ran- domly selected) of HuRef dataset. MEC of the three algorithms under comparison as a function of error rate (a); Execution time of the algo- rithms as a function of coverage (b)...... 32

xi 3.5 Speed performance of the FastHap, Greedy, and HapCut as a func- tion of haplotype length. Analysis was performed on chromosome 20 (randomly selected) of HuRef dataset. Execution time as a function of haplotype length (a); Amount of speedup achieved by FastHap com- pared to Greedy and HapCut (b)...... 33

4.1 Association rule haplotyping (ARHap) framework. Each round of as- sociation rule haplotyping is composed of two phases including an as- sociation rule learning phase and a haplotype reconstruction phase. This two-phase process may continue for multiple rounds until all SNP positions on the haplotype set are reconstructed ...... 41

4.2 An example of matrix binarization. The fragment matrix X containing

9 fragments drawn from haplotype set H = {h0, h1} where h0 = ‘11111’

and h1 = ‘00000’ is shown in (a). X is decomposed into two matrices X 0 and X 1 as shown in (b). The binary fragment matrix Y is a column- wise concatenation of X 0 and X 1...... 45

4.3 Association rule based haplotype reconstruction for the dataset in Fig- ure 4.2. Dependency graph is constructed based on the strong rules in Table 4.2 (a); Longest paths identified in each iteration of the al- gorithm (b); and evolution of haplotypes as each rule is tested against the haplotype set (c). It is assumed that the haplotypes are initialized

to h0 = ‘10101’ and h1 = ‘01010’...... 62

xii 4.4 An example of dependency graph generated from rules with multiple attributes in their antecedent snpset. Fragment matrix shown in (a)

does not produce any rules of length 2 because no rule of the from ti →

tj meets confidence criterion of Confhap > 0.5; Set of strong rules with 2 attributes in their antecedent field (b); Corresponding dependency graph (c), a longest path (d), and changes in initial haplotypes after applying rules on the longest path (e)...... 64

4.5 An example with inconsistent attributes on ‘longest path’. The frag- ment matrix in (a) results in 6 rules of size 2 shown in (b). The longest path on the dependency graph in (c) contains conflicting attributes. Choosing longest attribute-consistent path results in updating haplo- types as shown in (d)...... 66

4.6 Statistics about simulated polyploidy datasets. Distribution of short reads in datasets based on coverage values (a); Number of non-overlapping blocks for each ploidy level and coverage number (b); Histogram of length of non-overlapping blocks (b); and Average length of non-overlapping blocks for each ploidy level and coverage value (d)...... 69

4.7 Performance of different algorithms on simulated data: Switching error as a function of coverage for fixed error rate of =5% (a); Switching error versus error rate for dataset with a fixed coverage of 5X (b); Normalized MEC versus coverage for dataset with fixed error rate of =5% (c); and Normalized MEC as a function of error rate for 5X coverage (d)...... 71

5.1 An example of a weighted fragment conflict graph (WFCG)...... 83

5.2 Vertex-Coloring (left) and Color-Merging (right) applied on WFCG shown Figure 5.1...... 84

xiii 5.3 HapColor performance in terms of normalized MEC (a) and recon- struction rate (b) as a function of error rate for various polyploidy data; normalized MEC as a function of error rate for various coverage levels (c); and evolution of the algorithm during color merging (d). . . 92

6.1 An example of a fragment matrix with 8 short reads and 6 SNP sites (a), and corresponding similarity graph (b). Edge labels represent

weights (wij) multiplied by ten (10X) for better visualization. That

is, a label ‘−3’ on edge e58 represents w58=−0.3...... 107

6.2 Evolution of Algorithm 9 for the similarity graph shown in Figure 6.1. The edge weights are multiplied by 10 for visualization. With the final clustering in (f), the amount of overall MEC is 2 with the reconstructed haplotypes H = {111111, 000000, 111100} ...... 111

6.3 Statistics about the generated datasets. Distribution of short reads in datasets based on coverage values (a); Histogram of length of non- overlapping blocks (b); Number of non-overlapping blocks for each ploidy level and coverage number (c); Average length of non-overlapping blocks for each ploidy level and coverage value (d)...... 119

6.4 Impact of sequencing confidence scores on quality of reconstructed hap- lotypes in terms of switching error (a) reconstruction rate (b), normal- ized MEC (c), and weighted MEC (d) for triploid data with error rate  =5%...... 122

6.5 PolyCluster performance in terms of switching error (a), reconstruction rate (b), and normalized MEC (a) for different polyploidy and error rates. The results are presented for 5X coverage...... 124

xiv 6.6 PolyCluster performance as a function of coverage for triploid data (i.e., K=3). The performance is shown in terms of switching error (a), reconstruction rate (b), and normalized MEC (c) for various error rates.125

6.7 Comparison of switching error of PolyCluster with that of HapColor, HapTree, KMedoids, and Greedy on triploid (a), tetraploid (b), and hexaploid (c) data...... 129

xv List of Tables

3.1 Comparison of FastHap with Greedy and HapCut in terms of accuracy (MEC) and execution time using HuRef dataset. FastHap achieves speedups of 16.4 and 15.1 compared to HapCut and Greedy respec- tively is 1.9% and 35.4% more accurate than HapCut and Greedy re- spectively. Statistics on coverage and haplotype length are shown in Figure 3.2 and Figure 3.3 and further discussed in Section 3.3.1. . . . 35

4.1 Performance of ARHap on SWE ...... 40

4.2 Strong association rules of size l = 2 for the fragment matrix in Figure 4.2. 45

4.3 Running time (minutes) of ARHap, FastHap, HapCut, and Greedy on simulated data with fixed error rate =5%...... 73

4.4 Running time (minutes) of ARHap, FastHap, HapCut, and Greedy on simulated data with fixed coverage C = 5X...... 73

4.5 Overall MEC score of of ARHap, FastHap, HapCut, and Greedy on HuRef data...... 75

5.1 Reduction (%) in MEC achieved by HapColor ...... 79

5.2 MEC comparison of HapColor with other algorithms on simulated polyploidy data...... 94

5.3 MEC comparison of HapColor, HapTree, Greedy, and RFP on HuRef based data...... 98

6.1 Comparison of PolyCluster with HapColor, HapTree, KMedoids, and Greedy in terms of absolute MEC on triploid data...... 130

xvi 6.2 Comparison of PolyCluster with HapColor, HapTree, KMedoids, and Greedy in terms of absolute MEC on tetraploid data...... 130

6.3 Comparison of PolyCluster with HapColor, HapTree, KMedoids, and Greedy in terms of absolute MEC on hexaploid data...... 130

6.4 Running time (minutes) on triploid data with 5X coverage and various error rates...... 131

6.5 Impact of block length on running time (sec.) ...... 132

xvii Acknowledgments

I would like to express my sincere gratitude to my advisor, Professor Wei Wang, for her invaluable guidance and continuous support throughout my years at UCLA. She has been a strong motivator and advocate for my research throughout my matricu- lation through the Ph.D. program. I would like to express my sincere appreciation to other members of my thesis committee, Professor Eleazar Eskin, Professor Jason Ernst, and Professor Jessica Li, for kindly agreeing to be on my doctoral committee and for their constructive comments. I would also like to thank Professor Christopher Lee for the great opportunity of collaboration.

I would like to thank current and former members of the Intelligent Data Explo- ration and Analysis Laboratory (IDEAL) and UCLA-ZAR Lab for their instrumental discussions and technical interactions. I am very proud of making wonderful friends at UCLA. Far too many friends to mention individually have shared great moments with me. I also would like to personally thank Professor Richard Korf and Profes- sor David Smallberg who provided me the best training on teaching and Professor Christopher Lee from whom I learned how to transfer teaching to a deep conceptual research to keep me and my students happier. I gratefully thank all of them.

My deepest gratitude goes to my parents and my sister for their love, their support, and for their sacrifice over so many years. They have been the origin of my strength and will always be. My most heartfelt acknowledgment must go to my daughter, Lily and my husband, Hassan, not only for their constant encouragement, but also for their patience and understanding throughout my research. To them both, I owe an immeasurable debt and deep affection. Hassan and Lily have my everlasting love for all that, and for being everything I am not.

xviii Vita

Dec., 2001 B.S. Computer Science, Azad University, Sari, Iran

Dec., 2009 M.Sc. Management Information Systems, University of Texas at Dallas, TX

2015 - 2017 Adjunct Faculty, School of Electrical Engineering and Computer Science Washington State University, WA

2015 - 2016 Research Consultant, Institute for Quantitative and Computational Biol- ogy, University of California Los Angeles, CA

Sept., 2017 Ph.D. expected, Computer Science, University of California Los Angeles, Los Angeles, CA

Publications

Christopher Lee., Brit Toven-Lindsey, Casey Shapiro, Michael Soh, Sepideh Mazrouee, Marc Levis-Fitzgerald, and Erin R. Sanders, “Error Discovery Learning Boosts Stu- dent Engagement, Learning Outcomes, and Retention in a Computer Science Course”, CBE - Life Science Education Journal, 2017. xix Hezarjaribi, N., Mazrouee, S., Ghasemzadeh, H. (2017). Speech2Health: A Mobile Framework for Monitoring Dietary Composition from Spoken Data. IEEE Journal of Biomedical and Health Informatics.

Mazrouee, S., Wang, W. (2015, November). HapColor: A graph coloring framework for polyploidy phasing. In Bioinformatics and Biomedicine (BIBM), 2015 IEEE In- ternational Conference on (pp. 105-108). IEEE.

Mazrouee, S. (2015, November). PACH: Ploidy-AgnostiC Haplotyping. In Bioin- formatics and Biomedicine (BIBM), 2015 IEEE International Conference on (pp. 1786-1788). IEEE.

Mazrouee, S., Wang, W. (2014, September). Individual haplotyping prediction agree- ments. In Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 615-616). ACM.

Mazrouee, S., Wang, W. (2014). FastHap: fast and accurate single individual haplo- type reconstruction using fuzzy conflict graphs. Bioinformatics, 30(17), i371-i378.

Ghasemzadeh, H., Mazrouee, S., Kakoee, M. R. (2006, March). Modified pseudo LRU replacement algorithm. In Engineering of Computer Based Systems, 2006. ECBS 2006. 13th Annual IEEE International Symposium and Workshop on (pp. 6-pp). IEEE.

Ghasemzadeh, H., Mazrouee, S., Moghaddam, H. G., Shojaei, H., Kakoee, M. R. (2006). Hardware implementation of stack-based replacement algorithms. In Pro- ceedings of world academy of science, engineering and technology (Vol. 16).

xx CHAPTER 1

Introduction

Deoxyribo Nucleic Acid (DNA) is the main component of chromosomes and the hered- itary material in all life forms [WHH04,FWS05]. The information available for build- ing and maintaining an organism is stored in the DNA sequence [SMS02], which consists of two nucleotide strands coiled around each other. Decoding the genetic information of DNA has been an active area of research over the past several decades. Inheritance is the origin of many phenotypes including incidence of diseases such as cancer and metabolism diseases. Mendelian inheritance explains that alternations in genome sequence can be passed on to offspring and also propagate to synthesized protein. The spectrum of DNA sequence variation ranges from single nucleotide poly- morphisms (SNPs) to more complex structural variation (SV) [ACE11,Con12] such as deletions, insertions, or translocations of the genomic material. Approximately 99.5% of any two individuals’ genome sequences are shared within a population [LSN07]. The remaining 0.5% of the nucleotide bases, which varies within a population, explain, in part, differences among individuals [AI13]. The variations in the DNA sequence at particular locations (i.e., SNP sites), which is the most common source of variants, is commonly studied in Genome Wide Association Studies (GWAS) [HD05,BM12].

Understanding associations between genomic structures and phenotypes is enabled by advancements in two emerging areas, namely genome sequencing and molecular biology [HD05,AHV10]. On one hand, improvements in quality and speed of genome sequencing have led to generation of data than can lead to extracting exact structure of chromosome copies in various organisms. On the other hand, advances in molec-

1 ular biology and would facilitate establishment of the associations between genomic structures and identified phenotypes.

While new generations of sequencing technologies are poised to deliver consis- tently more accurate and affordable DNA sequenced data, the technology has not yet advanced to the state where it can provide full sequenced chromosome copies. In other words, while sequencing technologies provide genotypes, they are incapable of extracting the haplotype structure of the sequenced organism. Therefore, we are provided with a collection of DNA reads generated from individual organisms. These reads need to be assembled together to reconstruct the whole genome structure.

Haplotype is formally defined as a set of variant alleles in one chromosome that is being passed on through generations as a unit. Haplotype reconstruction is ultimately used to reconstruct the entire chromosome of the organism. Figure 1.1 shows an example of the haplotype assembly process for diploid organisms, which maintain

two chromosome copies. As shown in this figure, ten short reads, denoted by f1,

... , f10 in Figure 1.1(b), are generated from the two chromosome copies shown in Figure 1.1(a). In reality, a much larger number of such short reads are generated by a sequencing machine without knowing from which copy of the chromosome each short read is originated. The haplotype assembly problem aims to assemble these short reads together in order to reconstruct the two copies of the chromosome. For the example in Figure 1.1, the two reconstructed haplotypes are shown in Figure 1.1(c). Given that the two copies are different only at SNP sites, the focus of haplotype assembly is on determining the content of the SNP sites for each haplotype copy.

Traditionally, researchers have developed experimental approaches for differenti- ating among chromosome copies (i.e., determining haplotypes). The experimental approaches [SAK15], however, are severely limited in processing the massive amount of data that are provided by sequencing technologies. As a result, experimental ap- proaches for haplotype assembly are constrained by their high cost and low processing speed.

2 S1 S2 S3 S4 S5 S6 S7 S8 two copies of chromosome (reference sequence) Copy1 ‐‐‐A C TCA C ‐‐‐‐‐GT A T GGTGC‐‐‐‐‐ACA G T C TT‐‐‐‐CTGAAGAT‐‐‐AG C AT T A ( a ) Copy2 ‐‐‐A T TCA G ‐‐‐‐‐GT G G GGTGC‐‐‐‐‐ACA C T A TT‐‐‐‐CTGAAGAT‐‐‐AG G AT A A

Sequencing

short reads/fragments aligned to the reference sequence f1 GT A T GGTGC‐‐‐‐‐ACA G T f2 T GGTG‐‐‐‐‐‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐‐‐‐‐‐ ‐ ‐‐‐‐‐‐‐‐‐AG G AT A A f3 C TCA G ‐‐‐‐‐ ‐ ‐ ‐‐‐ ‐ ‐ ‐ ‐‐‐‐‐‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐‐‐‐‐CTGAAG f4 CA C ‐‐‐‐‐GT G ‐ ‐ ‐ ‐ ‐ ‐‐‐‐‐‐ ‐ ‐ ‐ ‐ T C TT ( b ) f5 ACA C T A TT‐‐‐‐‐ ‐ ‐ AAGAT f6 A T TCA C ‐‐‐‐‐ ‐ ‐ ‐‐‐ ‐ ‐ ‐ ‐‐‐‐‐‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐‐‐‐‐‐ ‐ ‐‐‐‐‐‐‐‐‐AG G AT f7 TCA C ‐‐‐‐‐GT A ‐ ‐ ‐ ‐ ‐ ‐‐‐‐‐‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐‐‐‐‐‐ ‐ ‐‐‐‐‐‐‐‐‐‐‐ ‐ AT T A f8 GT G G GGTGC‐‐‐‐‐ACA C T f9 T A TT‐‐‐‐‐ ‐ ‐‐‐‐‐‐‐‐‐AG C AT T A f10 GT A G GGTGC‐‐‐‐‐ACA G T

Haplotype Assembly

assembled haplotyes h1 ‐ ‐‐‐ C ‐‐‐ C ‐‐‐‐‐ ‐ ‐ A T ‐ ‐ ‐ ‐ ‐‐‐‐‐‐ ‐ ‐ ‐ G ‐ C ‐ ‐‐‐‐‐‐ ‐ ‐‐‐‐‐‐‐‐‐‐‐C ‐‐T ‐ ( c ) h2 ‐ ‐‐‐ T ‐‐‐ G ‐‐‐‐‐ ‐ ‐ G G ‐ ‐ ‐ ‐ ‐‐‐‐‐‐ ‐ ‐ ‐ C ‐ A ‐ ‐‐‐‐‐‐ ‐ ‐‐‐‐‐‐‐‐‐‐‐G ‐‐A ‐

Figure 1.1: An illustration of diploid haplotyping process. Ten short reads, denoted by f1 to f10 in (b) are generated from the two chromosomes copies shown in (a). Only SNP sites, denoted by S1 to S8, are used for haplotype assembly.

Over the past decade, computational methods have been identified as a promising approach for devising accurate and inexpensive tools to generate haplotypes of various organisms from Next Generation Sequencing (NGS) data [Rei09]. It is straightforward to show that the haplotype assembly problem can be modeled as a graph partitioning problem if all short reads are error-free [LBI01]. For example, in case of diploid haplotyping, the collection of short reads can be partitioned into two disjoint sets (e.g., by constructing a bipartite graph) where each set is used to reconstruct one haplotype copy. In reality, however, DNA reads exhibit various types of errors such as sequencing errors, missing data, and alignment errors. The existence of such errors turns haplotype assembly into an NP-hard problem [CVK05]. Because the problem is combinatorial in nature, it introduces the non-trivial choice of developing computational algorithms for reconstructing haplotypes from erroneous DNA reads. Importantly, these algorithms need to balance two conflicting criteria, accuracy and complexity. One of the core research challenges in haplotype assembly is to achieve a balance between these criteria. The main contribution of this dissertation is to address this challenge by investigating algorithmic frameworks for haplotype assembly that

3 support unsupervised learning to improve both accuracy and speed.

This dissertation develops algorithms and tools that can be used by biologists to investigate factors contributing to inheritable diseases among other phenotypes. Four approaches are proposed in this research each of which focuses on improving accuracy and/or complexity of haplotype assembly in either diploid or polyploid or- ganisms. The methodologies that are presented in this dissertation span several areas of research including unsupervised learning, combinatorial optimization, graph parti- tioning, and data mining. The overarching theme of the technological contributions of this research is novel combinatorial algorithm design to deal with complex structure of DNA reads to perform diploid and polyploid haplotyping.

In Chapter 3, we present development and validation of FastHap, a fast and accurate diploid haplotyping algorithm. Development of FastHap is motivated by the fact that for haplotype assembly algorithms to be viable for practical use, they must be not only accurate but also fast enough that can be widely used on large scale datasets. In fact, current trends in sequencing technologies suggest that the sequence read lengths are being extended significantly and access to reads of up to several thousand base pair long will become a reality in the near future. Despite tremendous effort in recent years, fast and accurate haplotype reconstruction has re- mained an active research topic, mainly due to the computational challenges involved. FastHap introduces a new similarity metric that allows us to precisely measure dis- tances between pairs of fragments (i.e., DNA short reads). The measure is carefully developed to not only assign small values to the fragments that match perfectly and allocates large values to completely different fragments but also neutralize the effect of missing alleles on final partitioning of the fragments. The distance values are then utilized to build a fuzzy conflict graph that represents similarity among the frag- ments. FastHap introduces a two-phase computationally simple heuristic algorithm for haplotype reconstruction. The first phase uses the fuzzy conflict graph to build an initial fragment partition. In the next phase, the initial partition is further re-

4 fined to achieve additional improvements in the overall accuracy performance of the reconstructed haplotypes. Experimental evaluation demonstrates that FastHap is up to one order of magnitude faster than the state-of-the-art haplotype assembly algo- rithms while also delivering comparable accuracy performance in terms of Minimum Error Correction (MEC) criterion.

FastHap can be viewed as a binary partitioning approach where input fragments are grouped into two disjoint clusters based on the proposed distance measure. FastHap includes an iterative process where two most similar and two most dissimilar frag- ments are identified during each iteration of the algorithms. The similar fragments are placed in the same partition while the dissimilar fragments are assigned to the opposite partitions. Although FastHap achieves high speed performance while opti- mizing the overall MEC score of diploid haplotyping, it is not designed to minimize the switching error of the reconstructed haplotypes. Therefore, our research in this dissertation proceeds by investigation a new framework, called ARHap, where the focus is to achieve high speed performance while minimizing the switching error.

In Chapter 4, we introduce ARHap, another haplotyping framework based on the concept of association rule learning. In Genome Wide Association Studies (GWAS), we often study correlations between genomic variations and different phenotypes while linkage disequilibrium, which is the non-random association of alleles at different loci in a given population, is usually overlooked due to the complex nature of the problem. In order to discover interesting relationships hidden in large datasets, we take a novel approach to the problem of Individual Chromosomal Reconstruction by utilizing as- sociation analysis technique to reveal hidden relationships among multiple variant loci in the genome. To this end, ARHap aims to identify such correlations and utilize them for haplotype assembly. The ARHap framework is composed of two main mod- ules or processing phases. In the first phase, called association rule learning, strong patterns from the dataset that is being provided from individual’s sequencing data are discovered. These patterns reveal the inter-dependency of alleles at individual SNP

5 sites. In the next phase, called haplotype reconstruction, an approach for utilizing the strong rules produced in the first phase is developed to construct haplotypes at individual SNP sites. ARHap has several features that lead to both fast and accurate haplotyping. ARHap uses an incremental haplotype reconstruction approach that enables us to generate association rules according to the unreconstructed SNP sites during each round of the algorithm. The two modules are synergistically designed for efficient data processing. During each round, the association rule learning module generates rules while constraining the length of the rules (i.e., number of participating snpsets) and limiting the rules to those that contribute to reconstruction of unrecon- structed sites thus far. The framework begins by generating rules of small size (e.g., only those of length 2) and highly strong (i.e., according to minimum support and minimum confidence criteria) and reconstructing haplotypes accordingly. The rule length can increase and/or criteria about strongness of the rule are adjusted grad- ually, during subsequent rounds, if some SNP sites have remained unreconstructed. This adaptive approach, which uses feedback from haplotype reconstruction mod- ule, eliminates generation of rules that do not contribute to haplotype reconstruction as well as weak rules that may introduce error in the final haplotypes. Extensive experimental analyses demonstrate superiority of ARHap in diploid haplotyping, in particular in achieving significantly better accuracy performance in terms of switching error (SWE).

Polyploidy, the presence of more than two copies of each chromosome in the cells of an organism, is common in plants, animals, and human body tissues, and finds important applications in the field of genetics. As stated previously, FastHap is, by design, a binary fragment partitioning approach. As a result, it cannot be applied to polyploid haplotyping problems where more than two copies of the chromosome exist. Furthermore, although ARHap does not specifically place any constraints on the number of chromosome copies (i.e., ploidy level), it is apparent that ARHap will achieve poorly in polyloidy haplotyping. This can be explained as follows. ARHap starts by iteratively applying a sequence of association rules (built based on inter- 6 SNP relationships) on haplotypes, which are initialized at random, to modify alleles at individual SNP sites. In diploid data, the two haplotypes maintain complement binary values at each SNP site. Therefore, each association rule can be applied to only one of the two haplotypes because the antecedent of each rule is true on only one of the two haplotypes. This process allows us to verify what SNP sites are already verified/updated according to a given set of association rules. The process will stop when all SNP sites are either verified/updated. In contrast, different haplotype copies in polyploid data can have arbitrary values at the same SNP site, which can result in a particular association rule being applicable to more than one haplotype copy. Therefore, this dissertation proceeds by developing two novel approaches for polyploid haplotyping, namely HapColor and PolyCluster discussed in Chapter 6 and Chapter 6, respectively.

HapColor and PolyCluster focus on partitioning an input fragment set into K disjoint sets each representing a haplotype copy. Both of these frameworks follow a general approach that includes construction of an initial clustering of the fragments followed by a cluster merging procedure. These two frameworks, however, are differ- ent in both the distance measure used for graph-based vertex partitioning and the underlying cluster construction. Specifically, the major difference between HapColor and PolyCluster is that HapColor starts by constructing a clustering of the fragments such that the obtained clusters do not have any conflict among them. That is, the initial clustering attained by HapColor achieves an MEC score equal to zero. It will then merge similar clusters until the number of remaining clusters reaches a desirable value of K, ploidy level. This approach to fragment clustering is essentially consistent with the aim of minimizing the overall MEC score. PolyCluster, however, attempts to consider both similarity and dissimilarity of the fragments (and therefore constructed clusters), as a proxy for optimizing switching error, during the clustering process. A more detailed description of each algorithm is given in the following.

A majority of current haplotype assembly algorithms, however, focus primarily

7 on diploid organisms. Chapter 5 presents HapColor, a fragment partitioning ap- proach with the objective of minimizing MEC for polyploid haplotyping. First, a formal definition of polyploid haplotyping with the objective of minimizing overall MEC is presented in this chapter. The problem is then mapped onto a graph col- oring problem using an introduced conflict graph model. Then the hardness of the introduced problem, Minimum MEC Polyploid Haplotyping (MMPH), is discussed and it is shown that this problem is NP-hard. A greedy heuristic algorithm including a graph coloring method followed by a color-merging technique is developed to accu- rately partition short reads and reconstruct haplotype associated with each partition. The performance of HapColor is compared against several polyploid haplotyping ap- proaches and it is demonstrated that HapColor substantially reduces MEC scores of the competing algorithms on polyploidy data.

While Chapter 5 focuses on minimizing MEC by grouping similar fragment in individual partitions, Chapter 6 introduces PolyCluster, a correlation-clustering- based graph partitioning algorithm that combines intra-partition similarity with inter- partition dissimilarity in the process of clustering to generate high quality haplotypes. We refer to this new optimization problem as minimum fragment-disagreement hap- lotyping. The hypothesis in Chapter 6 is that a combined measure of similarity/dis- similarity will improve switching error. This hypothesis is verified through extensive experimental analyses. We first provide a formal definition of minimum fragment- disagreement haplotyping that aims to minimize the overall partitioning error com- puted by combined inter-partition similarity and intra-partition dissimilarity. The optimization problem is then transformed into a correlation clustering problem on a graph model that captures the amount of similarity between pairs of DNA short reads. PolyCluster devises a combination of linear programming, rounding, region-growing, and cluster merging to group similar reads and reconstruct haplotypes. Experimen- tal results demonstrate that PolyCluster substantially improves accuracy of haplotype assembly in terms of switching error and running time while achieving comparable results in terms of MEC score. 8 Finally, we conclude the dissertation with the lessons learned and discuss impor- tant future directions in Chapter 7.

9 CHAPTER 2

Haplotype Assembly in the Literature

While genotype data refers to the combined information from both chromosome copies of an individual, haplotype refers to the information from each copy of the chro- mosome. Haplotypes can be defined as the genetic constitution of an individual chromosome. Haplotype information is considered useful for researchers in finding genes affecting health, disease and responses to drugs and environmental factors. The international HapMap project is a multi-country effort in the field attempting to identify and catalogue genetic similarities and differences in human beings for such purposes [GBH03].

In diploid organisms such as humans, each individual has two copies of each chro- mosome which may not be identical [Man85,ABC05]. There are other organisms that have more than two homologous chromosomes. This concept is called polyploidy, which is common in plants such as wheat, potatoes, apples, roses and also in some animals and even in human body tissues. Therefore, in order to understand the exact chromosomal structure of each organism, phasing which is basically reconstructing chromosomal copies using genotype or sequencing data, is required. Phasing can be done through traditional haplotype inference by using genotype data which can be defined as the problem of inferring the pair of haplotype sequences of an individual from his/her genotype sequence. Recently the lower cost and higher accuracy of se- quencing data has enabled us to use computational methods in order to solve the phasing problem using individual sequencing data in which a collection of short reads are processed in order to reconstruct each copy of chromosome of the organism under

10 study. This approach to solve the phasing problem is called haplotype assembly using Next Generation sequencing data. This concept was first introduced in 2001 [LBI01] where optimization techniques were posed for solving the problem in its general form. The problem has been shown to be computationally hard under various combinatorial objective functions [LBI01,CVK05,BIL05].

Haplotype assembly algorithms often take as input a standard matrix called f ragment matrix or f ragment matrix, composed of short reads obtained by DNA sequencing. Each row in a fragment matrix is associated with a short DNA read and each column represents a SNP (i.e., Single Nucleotide Polymorphism) site. To extract SNP positions, the DNA reads are mapped onto a reference sequence. The goal is to re-build each copy of an organism’s chromosome using the variation information within the fragment matrix. In literature, it is assumed that the DNA reads are aligned to the reference genome sequence prior to forming a fragment matrix. The alignment is done using a mapping/alignment algorithm which may introduce errors in the short reads. Furthermore, consistent with the literature [BYP14a,BB08,LSN07], here it is assumed that all homozygous sites are discarded prior to forming the frag- ment matrix and only single variant calls (i.e., SNPs) are considered for haplotype assembly. For each locus of the fragment, there is a tradition of considering bialleles for SNPs in literature and if there is any other allele that is seen rarely, it is treated as ‘missing’ and substitute them with ‘–’ in the fragment matrix. Our methodologies presented here, however, are independent of this assumption and in case of polyploidy we use two bits in order to accommodate multi-alleles. Additionally, we would like to note that in polyploid haplotyping the ploidy level, ’K’, is known a priori. If the ploidy level is unknown, the problem will transform into organism identification problem, which is out of the scope of this dissertation.

In haplotype assembly, the goal is to build the two copies of an individuals chro- mosome using the variation information within the fragment matrix. The difficulty of the problem arises from the fact that the matrix is not error-free due to various types

11 of errors such as sequencing errors, chimeric read pairs, and false variants. In the real world of molecular biology, experiments are never error-free. These errors result in conflicting data presented by the fragments within the fragment matrix. Conflicting data prevent us from reliably inferring a 0 or 1 at each SNP site in the resulting haplotype. We note that in an error-free fragment matrix, one can easily partition the fragment into two disjoint sets (for diploid haplotyping) that infer one copy of haplotype associated with each fragment set. An error-free fragment matrix is also called a feasible matrix.

Since the inception of the haplotype assembly problem, research efforts on haplo- type reconstruction from DNA fragments can be broadly categorized according to var- ious factors such as the computational methodology, evaluation method, and ploidy level.

2.1 Approaches

Earlier approaches on reconstructing haplotypes from DNA reads, which we refer to as data removal approaches, were developed almost a decade ago. Data removal approaches focus on removing data (i.e., fragments or SNPs) from a collection of DNA reads (i.e., fragment matrix) such that the resulting matrix is feasible or error- free [LBI01]. Once such a matrix is obtained, it is straightforward to use classical computer science partitioning algorithms to group the remaining reads into disjoint sets and construct haplotype copies accordingly. Later, other methods, which we refer to as haplotype update approaches were developed. The haplotype update approaches, which have received more attention compared to data removal methods [LSN07,BB08, BHA08, MW14a, PMP15, AI12, AI13], focus on dealing with an erroneous fragment matrix and reconstructing haplotypes such that an error metric is minimized. In addition to these two broad methodological approaches, a third group has tried to solve the problem using statistical methods [CJC06,SSD01].

12 2.2 Evaluation Methods

Conventional data removal approaches use objective functions such as MSR (Mini- mum SNP Removal) and MFR (Minimum Fragment Removal) [LBI01]. These two measure refer to removing minimum number of SNPs and minimum number of reads, respectively, from the data to reach a feasible fragment matrix. Also LHB (Longest Haplotype Block) [LBI01,WP03] refers to methods that attempt to achieve the longest haplotype block possible. Another metric that falls within the data removal category is MWER (Minimum Weight Edge Removal). This approach focuses on removing data (i.e., edges) from a different abstraction of the data (i.e., a weighted graph) than fragment matrix [AI12, DV15]. The approaches proposed in [AI12, AI13] attempt to solve an optimization problem that minimizes the MWER objective.

Most diploid haplotyping approaches that fall within the haplotype update cat- egory however, use Minimum Error Correction (MEC) [LSL02] as their objective function [PS04, SSD01, SS06, MW14b, WWL05]. The MEC score refers to the num- ber of edits in the fragment matrix such that each read can be precisely mapped onto one of the reconstructed haplotypes. Therefore, the MEC score also represents the amount of mismatched single base-pairs between the fragment set and the cor- respondent haplotype copy. The problem of minimizing MEC score for haplotype reconstruction in diploid is shown to be NP-hard [CVK05]. Using the MEC score as an objective function in diploid haplotyping is important because MEC is associated with errors due to base miscalling, which is the most common source of error [BB08].

Since optimizing MEC is an NP-hard problem [CVK07], exact solutions have exponential complexity. Recent studies have analyzed this accuracy measure on its fixed-parameter tractability and approximability [BDK15]. A related metric is wMEC (weighted Minimum Error Correction) proposed in [GHL04] where each possible cor- rection is associated with a weight that represents the confidence degree assigned

13 to that SNP value at the corresponding position [PMP14]. While MEC has been widely used in the literature [BB08, BHA08, DMH12, BYP14b, MW14a, DV15], re- cent studies suggest that a lower MEC score may not necessarily correspond to a higher quality of the reconstructed haplotypes in particular in polyploid genomes [DMH12,BYP14b,DV15]. As a result, several studies suggest switching error (SWE) [BYP14b,DV15] as a more important accuracy measure for assessing performance of haplotype assembly algorithms. SWE refers to the amount of mismatch between re- constructed haplotypes and true haplotypes. Obviously, in absence of true haplotypes, one cannot measure SWE. Thus, many studies that report SWE as an accuracy per- formance measure conduct analysis on simulated polyploidy data [OAH12,HLM11].

Other methods produce a confidence score for each SNP in Haplotype block in addition to its phase [DHM10]. Furthermore, the study in [Kul14] uses dynamic programming methods and a probabilistic graphical model in order to do phasing and then prunes out the alleles with low confidence scores. Several studies have proposed methods to extend existing algorithms for improved accuracy or complexity [DHM10,SAK15,KSB16,GMM16].

2.3 Ploidy Level

Most solutions are developed for diploid haplotyping where a collection of fragments is bi-partitioned in order to generate a pair of complementary haplotypes. These methods fail to scale to higher ploidy levels because of the bipartition nature of the al- gorithm. For polyploid organisms, however, the problem is more challenging since the haplotype copies are not complementary and thus cannot be inferred from each other. Furthermore, the approach can vary based on read partitioning or SNP partitioning which can affect the computational complexity for polyploid haplotyping. Most algo- rithms partition the reads into multiple groups in order to reconstruct the haplotype set and some other approaches use different subsets of SNP sites and through iter-

14 ations update haplotype set to enhance accuracy and generate haplotypes. There has been limited research on polyploid haplotyping [BYP14b, BYB15, MW15, DV15] using Next Generation Sequencing data. These methods, however, are limited in their capability to create a balance between accuracy and running time. Further- more, the ploidy level itself appears to have an impact on the performance of these algorithms. For example, the study in [BYP14a] can produce more accurate hap- lotypes for triploid and tetraploid haplotype estimates. The approach in [BYP14a] is computationally expensive in particular with long haplotype blocks compared to other algorithms [AI13,DV15,MW15]. The approach in [AI13] have shown high accu- racy with hexaploid data. Other factors such as sequencing depth are also important factors determining the quality of haplotype estimation algorithm; paired end short reads of Illumina with a large insert can perform as well as long CCS reads of the same total size now possible with PacBio [MFM17].

15 CHAPTER 3

Fast and Accurate Diploid Haplotyping

3.1 Introduction

All diploid organisms have two homologous copies of each chromosome, inherited one from each parent. The two DNA sequences of a homologous chromosome pair are usually not identical to each other. The most common DNA sequence variants are Single Nucleotide Polymorphism (SNP). The sites at which the two DNA sequences differ are commonly referred to as heterozygous sites. Current High Throughput Sequencing (HTS) technologies [EFG09] are incapable of reading the DNA sequence of an entire chromosome. Instead, they produce a huge collection of short reads of DNA fragments. The process of reconstructing chromosome sequences from a set of DNA reads is referred to as haplotype assembly. Haplotype assembly has become an important computational task because reconstructing one’s genome from a large amount of DNA reads is computationally hard [CSV16].

3.1.1 Motivation

The last decade has witnessed much research effort on developing accurate haplotype assembly methods. The research, however, lacks a method that is not only accurate but also fast enough that can be used widely on large scale datasets. In particular, current trends in sequencing technologies demonstrate that the sequence read lengths are being extended significantly and access to reads of up to several thousand base pair long will become a reality in the near future. 16 Haplotype assembly methods usually involve three main stages prior to reconstruc- tion phase. First, a sequence aligner is utilized to align the reads to the reference genome. Then, only the read alignments at the heterozygous sites are kept for hap- lotype reconstruction. Last, reads that span multiple heterozygous sites are used to infer the alleles belonging to each haplotype. The quality of the reconstructed hap- lotypes may be dramatically affected by errors in sequencing and alignment. The objective, therefore, is to design algorithms that mitigate this impact and rebuild the most likely copies of each chromosome accurately. This has led to development of accurate haplotype reconstruction algorithms in the past few years. We are, however, observing a critical shift in sequencing technology where larger datasets with longer reads and higher coverage become available. This shift necessitates the development of algorithms that not only reconstruct haplotypes accurately but also require low computation time and can scale to large datasets.

Input to haplotype assembly algorithms is often a f ragment matrix where each row consists of a DNA short read and each column represents a SNP site. Most hap- lotype assembly algorithms process the fragment matrix either row-wise or column- wise. The algorithms that process the given fragment matrix usually attempt to partition the set of fragments into two disjoint sets each representing one copy of the haplotype. Examples of such techniques are FastHare [PS04] and the greedy heuristic in [LSN07]. In contrast, algorithms that perform the matrix-processing column-wise aim to iteratively update the haplotype at individual SNP sites taking into consid- eration the rows/fragments that cover those SNPs. For example, HapCut [BB08], HapCompass [AI12], and the approach in [HCP10] rely on partitioning the SNPs into two disjoint sets and finding those variants whose corresponding haplotype bits need to be flipped to improve MEC (Minimum Error Correction). In any of the two scenar- ios, an iterative process is involved. From a computational complexity point of view, the main drawback with existing techniques is that they perform much computation during each iteration of the algorithm.

17 HapCut [BHA08] is an example of the algorithms that performs a column-wise processing to minimize the MEC criterion. The process involves iteratively recon- structing a weighted graph and finding a max-cut on the graph. Clearly, most of the computation occurs in a loop. The algorithm has proved to be fairly accurate at the cost of high computation, in particular as the number of SNPs grows. The greedy heuristic algorithm in [LSN07] is a fragment partitioning approach that perform row- wise processing on the fragment matrix. The iteration, however, involves two major computing tasks: (1) reconstructing a partial haplotype based on the fragments that are already assigned to a partition; and (2) calculating distance between unassigned fragments and each one of the haplotype copies. FastHare [PS04] is another fragment partitioning algorithm. It sorts all fragments based on their positions prior to execu- tion of the iterative module. Computationally intensive tasks that occur iteratively in FastHare include: (1) reconstruction of a partial haplotype based on the fragments that are already assigned to a partition; and (2) calculating distance between the current fragment and each one of the two haplotype copies.

3.1.2 Contributions and Summary of Results

This chapter introduces a new framework, called F astHap, for fast and accurate haplotype assembly. The approach consists of four steps as follows: (1) measuring dissimilarity of every pair of fragments using a new distance metric; (2) building a weighted graph, called fuzzy conflict graph, using the introduced dissimilarity mea- sure; (3) using the fuzzy conflict graph to construct an initial partition of the frag- ments through an iterative process; and (4) refining the initial partitioning to further improve the overall MEC of the constructed haplotypes. Specifically, this chapter offers the following contributions:

• A new distance metric, called inter-fragment distance, will be introduced. This distance measure quantifies dissimilarity between pairs of fragments. The mea- sure is carefully developed to not only assign small values to the fragments 18 that match perfectly and large values to completely different fragments but also neutralize the effect of missing alleles on final partitioning of the fragments.

• The notion of fuzzy conflict graphs in the context of haplotype assembly will be introduced in this chapter. These graphs are built based upon the inter-fragment distances. In this graph model, each node represents a fragment and edge weights are corresponding dissimilarity measures between portions of fragments.

• A two-phase computationally simple heuristic algorithm for haplotype recon- struction will be presented. The first phase uses a fuzzy conflict graph to build an initial fragment partition. In the next phase, the initial partition is further refined to achieve additional improvements in the overall MEC performance of the reconstructed haplotypes.

• The effectiveness of FastHap will be demonstrated using HuRef dataset, a dataset that has been widely used in haplotype assembly literature recently. Specifically, FastHap will be compared with several previously published algo- rithms in terms of accuracy (MEC measures) and scalability (execution time) performance.

The results show that FastHap significantly outperforms the previous algorithms by providing a speed up of one order of magnitude while delivering comparable or better MEC scores. The objective of building a fast haplotype assembly model is achieved in FastHap by performing computationally intensive tasks prior to execution of the iterative process. Specifically, FastHap achieves speedups of 16.4 and 15.1 compared to HapCut [BHA08] and Greedy [LSN07], and is 1.9% and 35.4% more accurate than HapCut and Greedy, respectively, on HuRef data.

19 3.2 FastHap Framework

As stated previously in this chapter, input to a haplotype assembly algorithm is assumed to be a two-dimensional array containing only heterozygous sites of the aligned fragments. Such an input is often called a fragment matrix or fragment matrix, X , of size m × n, where m denotes the number of fragments (aligned DNA short reads) and n represents the number of SNPs that the union of all fragments cover. In the following discussion, xij refers to the allele of fragment fi at SNP site sj. Furthermore, xij ∈ {0,1,-}, where 0 and 1 encode two observed alleles and ‘−’ denotes that fragment fi does not cover SNP site sj. If there are more than two alleles observed at a given site, the two most common alleles are encoded with 0 and 1, and the remaining allele(s) are encoded as ‘−’ (i.e., missing). It is expected that many cells in X are filled with − since, in practice, each aligned fragment covers only a few SNP sites, limited by the fragment length1.

One approache for haplotype assembly is to construct haplotypes based on par- titioning of the fragments in the input fragment matrix. In this case, the haplotype assembly problem consists of two steps, namely fragment partitioning and fragment merging, described as follows. While the fragment partitioning phase aims to group rows of the fragment matrix into two partitions, C1 and C2, fragment merging is intended to combine the fragments residing in each partition, through a SNP-wise consensus process, and form two haplotypes h1 and h2 associated with C1 and C2 respectively. The resulting haplotype is typically denoted by H={h1, h2}. The main objective of the haplotype assembly is, therefore, to come up with a partitioning such that the amount of error is minimized. The focus of FastHap is on minimiz- ing the MEC objective function. As mentioned previously, this problem is proved to be NP-hard [CVK05]. Therefore, our goal is to develop a heuristic algorithm for the haplotype assembly problem. FastHap relies on a novel inter-fragment distance

1As discussed previously, the trend is that much longer DNA reads will be available as a result of recent technological advancements in genome sequencing

20 measure, a graph model for inter-fragment dissimilarity assessment, and a fast graph partitioning algorithm. A high level overview of FastHap is shown in Algorithm 1. Algorithm 1 FastHap high level overview Initialization: Calculate inter-fragment distance between every pair of fragments (Section 3.2.1) Store inter-fragment distances in ∆ (Section 3.2.1) Use ∆ to construct a fuzzy conflict graph (Section 3.2.2)

Phase (I): Partitioning Partition fragments into two disjoint sets C1 and C2 (Section 3.2.3 and Algorithm 2)

Phase (II): Refinement while (MEC score improves) do Find fragment fˆ with highest MEC value Assign fˆ to the opposite partition end while

3.2.1 Inter-Fragment Distance

Given two variables x, y ∈{0,1,−}, operator ⊗ is defined as follows.

  0 if x=y  x ⊗ y = 1 if x =6 y & x, y ∈ {0, 1} (3.1)   0.5 otherwise

Definition 1 (Inter-Fragment Distance). Given a fragment matrix Xm×n where xij ∈{0,1,−}, inter-fragment distance, ∆(fi, fk), between fragments fi = {xi1, xi2, ... , xin} and fk

= {xk1, xk2, ... , xkn}, is defined by

n 1 X ∆(f , f ) = (x ⊗ x ) (3.2) i k T ij kj ik j=1

where Tik denotes the number of columns (SNPs) that are covered by either fi or

fk in X. In fact, Tik is a normalization factor that allows to normalize the distance 21 between the two fragments such that the resulting distance ranges from 0 to 1 (i.e.

0 ≤ ∆(fi, fk) ≤ 1).

The inter-fragment distance metric is developed with the goal of measuring the cumulative dissimilarity between each pair of fragments across all SNP sites. The

intuition behind (3.1) and (3.2) is as follows. At a given SNP site sj, if two fragments

fi and fk both cover the site, the per-site distance is 0 if they take the same allele (suggesting that they likely belong to the same partition) and 1 if they take opposite alleles (suggesting that they are likely to belong to different partitions). A distance value of 0.5 is used if the SNP site is only covered by one of the two fragments to neutralize the contribution of the missing element. If the site is not covered by either fragment, a 0 distance is cumulated at this site. An additional benefit of this approach is that only SNP sites covered by either of the two fragments need to be examined. From a computing complexity point of view, this can reduce the execution time of the distance calculation significantly.

Figure 3.1(a) shows a set of fragments spanning 8 SNP sites. The resulting inter- fragment distances are shown in a symmetric distance matrix in Figure 3.1(b). Intu- itively, ∆ (the distance measure between two fragments) is smaller for those fragments that need to be grouped together and larger for those that we prefer to be placed in different partitions. When distance between the two fragments is 0.5, the two frag- ments alone do not provide sufficient information as how they need to be partitioned.

Definition 2 (Pivot Distance). Given a fragment matrix Xm×n, the pivot distance between any pair of fragments in {f1, f2, ... , fm} is λ = 0.5.

The pivot λ allows to decide whether the two fragments are dissimilar enough to be placed in separate partitions. A pair of fragments with an inter-fragment distance greater than λ is more likely to be placed in different partitions, although the final partitioning assignment is made after all pairs of fragments are examined through a partitioning algorithm. Section 3.2.2 will present a graph model that enables us to

22 a) Variant Matrix X c) Fuzzy Conflict Graph (G) S1 S2 S3 S4 S5 S6 S7 S8 f1 ‐‐000‐‐‐ f2 ‐‐‐0 ‐‐11 f3 1 1 ‐‐‐‐‐‐ f4 ‐ 01‐‐1 ‐‐ f5 ‐‐‐‐11‐‐ f6 1 0 ‐‐‐‐1 ‐ f7 ‐ 00‐‐‐‐0 f8 ‐‐111‐‐‐ f9 ‐‐‐‐‐110 f10 ‐‐010‐‐‐

b) Inter‐Fragment Distance Matrix Δ f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

f1 ‐ 0.4 0.5 0.6 0.63 0.5 0.4 1 0.5 0.33

f2 ‐ 0.5 0.5 0.5 0.4 0.6 0.6 0.5 0.6

f3 ‐ 0.63 0.5 0.5 0.5 0.5 0.5 0.5 f4 ‐ 0.37 0.5 0.37 0.4 0.4 0.6 d) Resulting partition and MEC f5 ‐ 0.5 0.5 0.37 0.37 0.63 f6 ‐ 0.5 0.5 0.4 0.5 X = {f1,f2,f3,f4,f5,f6,f7,f8,f9,f10} f7 ‐ 0.6 0.4 0.4 C1 = {f1,f2,f3,f6,f10} f8 ‐ 0.5 0.66 C2 = {f4,f5,f7,f8,f9} f9 ‐ 0.5 MEC = 4 f10 ‐ Reconstruction rate = 100%

Figure 3.1: An example of fragment matrix with 8 SNP sites (a), corresponding distance matrix (b), fuzzy conflict graph associated with the fragment matrix (c), and results of applying FastHap on the data (d). The graph in (c) shows only edges with non-pivot distances. perform fragment partitioning by linking similar and dissimilar fragments through a weighted graph based on inter-fragment distance values.

3.2.2 FastHap Graph Model

This section presents a graph model based on the inter-fragment distance defined in (3.2). In Section 3.2.3, it will be discussed how this graph model can be used to partition the fragments into two disjoint sets and construct haplotypes accordingly.

Definition 3 (Fuzzy Conflict Graph). Given a fragment matrix X composed of m fragments {f1, f2, ... , fm} spanning n SNP sites, a fuzzy conflict graph that models 23 dissimilarity between pairs of fragments is a complete graph G represented by tuple

(V ,E,WE). In this graph, V = {1, 2, ... , m} is a set of m vertices representing the fragments in X ; each edge el is associated with a weight wl equal to the distance between the corresponding fragments in X .

The conflict graph introduced in this chapter, fuzzy conflict graph, is different from that used in previous research (e.g., the fragment conflict graph in [LBI01]). A conflict graph has been conventionally defined as a non-weighted graph. Let us call it a binary conflict graph, which represents any pair of fragments with at least one mismatch in the fragment matrix. For example, according to [LBI01], a conflict graph is a graph with an edge for each pair of fragments in conflict where two fragments are in conflict if they have different values in at least one column in the fragment matrix X . There are a number of shortcomings with respect to utilizing a binary conflict graph for haplotype assembly. The major problem with the conventional conflict graph is that it does not take into account the number of SNP sites for which the two fragments exhibit a mismatch. Two fragments are considered in conflict even if there is a mismatch at only one SNP site. In contrast, the fuzzy conflict graph discussed in this chapter aims to measure the amount of mismatch across all SNP sites of every pair of fragments. For example, consider three fragments f1={− − 000 − −−}, f8={−−111−−−} and f10={−−010−−−} in Figure 3.1. In a binary conflict graph, all the vertices are connected because there is at least one mismatch between every pair of the fragments: three mismatches between f1 and f8, one mismatch between f1 and f10, and two mismatches between f8 and f10. The binary conflict graph, however, treats all the three edges equally. In contrast, the fuzzy conflict graph assigns weights of 1, 0.33, and 0.66 to these edges respectively to guide the partitioning algorithm to group f8 and f10 together.

An example of a fuzzy conflict graphs based on the fragments listed in Figure 3.1(a) is illustrated in Figure 3.1(c). For visualization, the edges with a pivot distance are not shown. The problem of dividing the fragments into two most dissimilar groups is

24 Algorithm 2 FastHap partitioning algorithm.

Require: Fuzzy conflict graph G=(V ,E,WE) Ensure: Partition P = [C1,C2] composed of two groups C1 and C2 of fragments (a) Delete edges with pivot weights from G (b) Sort remaining edges el in G based on their weights wl and store results in list D (c) Let el = (fi, fk) be the edge with the largest weight in D (d) Initialize partition by assigning fi and fk to opposite groups (e.g. C1={fi} & C2={fk} while (not all vertices are partitioned) do (e) Let el = (fi, fk) be next edge with highest weight in D such that fi ∈ P or fk ∈ P (f) Let fi ∈ P ; if fi ∈ C1, then C2 = C2 ∪ {fk}, otherwise C1 = C1 ∪ {fk} (g) Let er = (fi, fk) be next edge with lowest weight in D such that fi ∈ P or fk ∈ P (h) Let fi ∈ P , then if fi ∈ C1, then C1 = C1 ∪ {fk}, otherwise C2 = C2 ∪ {fk} (i) If none of el and er exist, assign each remaining fragment to the more similar set end while repeat (j) Let MEC be the minimum error correction score for existing partition (k) Let fi be the fragment with largest MEC among all fragments in P (l) If fi ∈ C1 (alternatively fi ∈ C2), move fi from to C2 (alternatively to C1) (m) Let newMEC bet the minimum error correction score for the new partition until (newMEC ≥ MEC)

essentially a max-cut problem [Aus99]. A max-cut partition may divide the fragments

into C1={f1, f2, f3, f6, f10} and C2={f4, f5, f7, f8, f9} as shown in Figure 3.1. We note that the resulting partition may not be unique in its general case. As it will be discussed in more details in Section 3.2.3, max-cut is an NP-hard problem and existing techniques provide solutions that are highly sub-optimal. Therefore, we will leverage some properties of the introduced fuzzy conflict graphs to develop a heuristic approach for fragment partitioning.

3.2.3 Fragment Partitioning

As stated previously, FastHap aims to partition fragments into two disjoint sets such that fragments within each group are most similar and can form a haplotype with minimum MEC. Using the fuzzy conflict graph model presented in Section 3.2.2, a weighted max-cut algorithm needs to be used to find the optimal partition. The max- cut problem, however, is known to be NP-hard even when all edge weights are set to one [GJ90]. All edges in our fuzzy conflict graph have a positive weight. There exist

25 heuristic algorithms [SG74] that produce a cut with at least half of the total weight of the edges of the graph when all edges have a positive weight. In fact, a simple

1 2 -approximate randomized algorithm is to choose a cut at random. This means that 1 each edge el in the fuzzy conflict graph G is cut with a probability of 2 . Consequently, the expected weight of the edges crossing the cut, W (C1,C2), is given by

L 1 X 1 W (C ,C ) = w ≥ OPT (3.3) 1 2 2 l 2 l=1

1 This algorithm can be derandomized to obtain a 2 -approximate deterministic algorithm. There exist two major shortcomings with this partitioning algorithm: 1) Unfotunately, derandomization works well only on unweighted graphs where all edges have equal/unit weights. A similar approach for a weighted graph is not guaranteed to run in polynomial time; 2) the obtained partition is highly sub-optimal with an

1 approximation factor of 2 . Thus, we introduce a novel heuristic algorithm based on properties of fuzzy conflict graphs.

The FastHap partitioning algorithm is shown in Algorithm 2 and briefly explained as follows. First, the algorithm eliminates all edges with pivot weights from the fuzzy conflict graph G. Such edges do not contribute to formation of the final partition. The algorithm then sorts all edges of the graph (equivalently, pairs of the fragments) based on the edge weights and stores the results in D. An initial partition is formed by placing the two fragments with largest inter-fragment distance (associated with

the heaviest edge in G) into two separate partition sets C1 and C2. In the next phase, the algorithm alternates between the heaviest and lightest edges and assigns adjacent vertices (associated with fragments in X) to the existing parition if either of the vertices is already assigned to the partition. An edge with highest weight results in placing the adjacent vertices in different partitions and an edge with lowest weight attempts to assign the vertices to the same partition in P . This occurs only if the chosen edge is adjacent to an already partitioned edge.

26 Theorem 1. Algorithm 2 terminates in polynomial time.

Proof. We prove that the algorithm terminates and its running time is polynomial. Let M be the total number of edges in the given fuzzy conflict graph respectively. During each iteration, the algorithm attempts to assign two edges (those with highest and lowest weights and adjacent to an already partitioned vertex) to the final partition P . Clearly, the iterative loop does not repeat more than M0 times. In fact, during

each iteration at least one edge (i.e., el or er or both) is selected to be added to the final partition. If the algorithm cannot find such an edge, all remaining edges are allocated to the final partition and the algorithm ends. Therefore, the iterations cannot repeat more than M0 times and the algorithm will terminate after at most M0 iterations. The proof regarding computing complexity of the algorithm is as follows. Removal of edges with pivot weight in (a) in Algorithm 2 can be completed in O(M); The process of sorting the edges in instruction (b) can be done in O(M log M); Instruction (c) takes O(1) to complete. The initialization of the partition in (d) can be done in O(1); (e) detecting the edge with the highest weight and checking if one of its vertices (i.e., fragments) is already in the partition P require O(1) and O(M), respectively, to be completed; (f) Assigning the selected edge to the partition require O(1); Similarly, instructions in (g) require O(1) and O(M) to finish; Similar to (f), the instructions in (h) can be done in O(1). The instructions in (i) have a complexity of O(M); Reading the MEC value of the partition in (j) and (m) takes O(1); Instructions in (k) and (l) can finish in O(M). Given that instructions in the loop are executed at most M times, the complexity of the algorithm is O(M 2).

3.2.4 Refinement Phase

The second loop in Algorithm 2 show second phase of the proposed haplotype recon- struction approach. The idea is to iteratively find the fragment that contributes most to the MEC score and reassign it to the opposite partition. This process repeats as

27 long as the MEC score improves. Our experimental results shows that the first phase of the algorithm performs most of the optimization in terms of MEC improvements leaving minimal improvements for the second phase.

3.2.5 Fragment Purging

Since the complexity of FastHap is a function of the number of fragments in the fragment matrix, it is reasonable to attempt to minimize the number of such fragments by eliminating any potential redundancy prior to execution of the main algorithm. Therefore, FastHap uses a preprocessing phase during its initialization to combine those fragments that are highly similar. Fortunately, the inter-fragment distance measure provides a means to assess similarity between every two fragments. The

criterion for combining two fragments fi and fk is based on the inter-fragment distance

∆(fi, fk) and a given threshold α. The two fragments are merged if

∆(fi, fk) ≤ α (3.4)

The purging process is straightforward. It eliminates the shorter fragments from the fragment matrix. The value of α needs to be set based on the quality of data. For the dataset used in the experiments in this chapter, α is set experimentally; The experiments informed that α = 0.2 provides the best performance.

3.3 Validation

HuRef [Ven14], a publicly available dataset, was used to demonstrate the effective- ness of FastHap for individual haplotype reconstruction. The goal was to assess performance of FastHap in terms of both accuracy and speed in comparison with HapCut [BB08] and the greedy algorithm in [LSN07]. The main reason for choos- ing these two algorithms was that these algorithms have been historically popular in

28 terms of accuracy and computing complexity. All experiments were run on a Linux x86 server computer. The server had 16 CPU cores of 2.7GHz with 16GB of RAM. Each algorithm performed per-block haplotype reconstruction. Each block consisted of the reads that do not cross adjacent blocks. Although haplotype assembly solu- tions cannot do more than random guess between two consecutive variant site that do not share any fragments, the effort in this work was to provide algorithms that are appropriate for longer reads in each end of paired alignment and ample insertion size in order to minimize disconnection between different haplotype blocks.

3.3.1 Dataset

The HuRef dataset used for analysis in this chapter contains reads for all 22 chromo- somes of an invididual, J.C. Venter. The data includes 32 million DNA short reads generated by Sanger sequencing method with 1.85 million genome-wide heterozygous sites. There are too many fairly short reads of approximately 15bp (each end) while still tens of thousands of reads are long enough to cover more than two SNP sites and can be used for haplotype assembly purposes. In fact, many fragments within each block span several hundred SNP sites due to the pair-end nature of the aligned reads. The fragment matrix used for haplotype assembly was generated based on aligned short reads with paired-end method for each pair of various length (from 15bp - 200bp each end) while the insert length follows a normal distribution with a mean of 1000.

Figure 3.2(a) shows read coverage for each chromosome. Read coverage numbers are calculated by taking an average over the coverage values of all SNP sites within each chromosome. The coverage vary from 6.49 reads for chromosome 19 to 8.72 reads for chromosome 3. The average genome-wide coverage across all chromosomes is 7.43. Figure 3.2(b) shows distribution of the coverage for chromosome 20 (exemplary) which includes a total of 39767 SNP sites. The coverage numbers range from 1 to 20 reads. Only 2 SNP sites had a coverage of 20. The average coverage for chromosome 20 was

29 9X 5000

8X 4000

3000 7X Coverage # of SNP 2000

6X 1000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 1X 5X 10X 15X 20X Chromosome# Coverage

(a) Coverage for each chromosome (b) Histogram of coverage for a randomly selected chro- mosome Figure 3.2: Coverage of HuRef dataset. (a): Coverage for each chromosome; Numbers vary from 6.49 to 8.72 for various chromosomes with an average genome-wide coverage of 7.43. (b): Histogram of coverage for chromosome 20 as an example; Y-axis shows number of SNPs with each specific coverage shown on x-axis.

6.83.

Figure 3.3 shows several statistics on haplotype length of various chromosomes in HuRef dataset. Figure 3.3(a) shows chromosome-wide haplotype length, equivalent ally total number of SNP sites, for each chromosome. As mentioned previously, each chromosome is divided into non-overlapping blocks. Haplotype length of such blocks may vary significantly from one chromosome to another. For example, Figure 3.3(b) shows distribution of haplotype length for a subset of chromosomes with ‘small’, ‘medium’ and ‘large’ haplotypes. For instance, chromosome 8 has a number of blocks spanning over 2500 SNPs. In contrast, haplotypes in chromosome 18 barely exceed 1000 SNP sites.

In addition to running FastHap on real HuRef data, we constructed several sim- ulated read matrices based on HuRef data [BB08]. A simulated dataset based on real data allows us to assess performance of the proposed algorithm and extend its capabilities by changing various parameters (e.g., error rate, coverage, and haplotype length or block width). To assess the accuracy of FastHap, a pair of chromosome copies was simulated based on real fragments and consensus SNP sites provided by

30 x 104 16 60 Chromosome 8 Chromosome 17 50 Chromosome 18

12 40

30

8 # of blocks 20 Haplotype length

10

4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 100 500 1000 1500 2000 2500 Chromosome# Haplotype length

(a) Haplotype length per chromosome (b) Histogram of haplotype length for three exemplary chromosomes Figure 3.3: Chromosome-wide haplotype length for each chromosome (a) and his- togram of per-block haplotype length for chromosomes 8, 17, and 18 as examples of chromosomes with ‘small’, ‘medium’, and ‘large’ blocks respectively.

HuRef data. The fragment matrix for each chromosome on HuRef data was suitably modified to generate an ‘error free’ matrix at first. This was accomplished by modi- fying alleles in each fragment such that it perfectly matches a predefined haplotype. In order to introduce errors in the fragment matrix, each variant call was flipped with a probability of  ranging from 0 to 0.25. The fragment matrix was also modified to produce matrices of different coverage levels. Another change to the simulated fragment matrix was to generate blocks of varying haplotype length ranging from 200 to 1000 SNPs. Such matrices were then used to examine how performance of different algorithms (i.e., FastHap, Greedy, HapCut) changes as a result of changes in error rate, coverage, and haplotype length.

3.3.2 Results

Table 3.1 shows speed and accuracy results for all chromosomes on HuRef dataset using Greedy, HapCut, and FastHap. As it can be observed from the timing values, FastHap is significantly faster than both Greedy and HapCut. In particular, FastHap is up to 16.4 times faster than HapCut and up to 15.1 times faster than Greedy. The

31 2 600 FastHap FastHap Greedy 500 Greedy HapCut 1.5 HapCut 400

1 300

Time (min.) 200 Normalized MEC 0.5 100

0 0 0 5 10 15 20 25 5X 7X 10X 15X 20X ε (error rate %) Coverage

(a) MEC Values (b) Running Time Figure 3.4: Effect of error rate and coverage on performance of FastHap, Greedy, and HapCut. The analysis was performed on chromosome 20 (randomly selected) of HuRef dataset. MEC of the three algorithms under comparison as a function of error rate (a); Execution time of the algorithms as a function of coverage (b).

average speedup achieved by FastHap is 7.4 and 8.1 compared to Greedy and HapCut respectively. In terms of accuracy performance, FastHap achieves 35.4% and 1.9% improvement in reducing MEC scores compared to Greedy and HapCut respectively.

A number of parameters affect speed performance of different algorithms. In par- ticular, number of SNP sites within each fragment matrix is an important factor in many well-known algorithms such as HapCut. One advantage of FastHap is that its performance is primarily influenced by the number of fragments in the fragment matrix rather than the number of SNP sites. That is, a higher read coverage allows FastHap to generate better accuracy without significant impact on its running time. In contrast, as the haplotype length grows, HapCut runs very slowly compared to FastHap. As shown in Table 3.1, HapCut is very slows when applied to chromosome 8 primarily due to the large haplotype length. This is also confirmed through Fig- ure 3.3(b) which shows that chromosome 8 contains blocks that span over 2500 SNPs. In contrast, chromosome 18 for example can be reconstructed much faster when Hap- Cut is utilized. Figure 3.3(b) shows that most of the blocks for chromosome 18 span less that 1000 SNP sites.

32 20 800 FastHap FastHap vs. Greedy 700 Greedy FastHap vs. HapCut 600 HapCut 15

500

400 10 Speedup

Time (sec.) 300

200 5

100

0 0 200 400 600 800 1000 200 400 600 800 1000 Haplotype length Haplotype length

(a) Running Time (b) Speedup Figure 3.5: Speed performance of the FastHap, Greedy, and HapCut as a function of haplotype length. Analysis was performed on chromosome 20 (randomly selected) of HuRef dataset. Execution time as a function of haplotype length (a); Amount of speedup achieved by FastHap compared to Greedy and HapCut (b).

Figure 3.4(a) shows the MEC score per variant call versus the simulated error rate obtained by each one of the three algorithms. The average MEC (normalized by number of variant calls) was 2.48, 2.56, and 2.86 for FastHap, HapCut, and Greedy respectively. The amount of improvement in MEC using FastHap was 13% and 2.8% compared to Greedy and HapCut respectively. Figure 3.4(b) shows the running time of the three algorithm as the coverage varies from 5 to 20. For this experiment, the fragment matrix was carefully modified to obtain the right coverage needed for the analysis. Furthermore, the obtained matrix was first made ‘error free’. We then flipped the variant calls with a probability of  = 0.25 for this analysis.

In order to assess running time of different algorithms with respect to changes in haplotype length, fragment matrices with different number of columns were built as explained previously in Section 3.3.1. Figure 3.5(a) shows execution time of the three algorithms as the partial haplotype length grows from 200 to 1000 SNPs. For this analysis, an error rate of =0.25 was used. We note that the results are shown only for one block of data. It can be observed that the running time of HapCut increases significantly as the block length grows. That is, while HapCut can build a partial haplotype of length 200 in 25 seconds, its running time increases to 784 seconds when 33 the length of the haplotype increases to 1000 SNPs.

In order to demonstrate superiority of FastHap partitioning algorithm over a ran- dom partitioning, a subset of the dataset was selected at random. We ran both FastHap and random partitioning algorithms on the same fragment matrix 10 times and calculated percentage of improvements in MEC achieved by FastHap. The im- provement numbers ranged from 12.17% to 31.64% with an average improvement of 19.13%.

3.4 Discussion and Conclusion

In this chapter, design, implementation, and validation of FastHap was presented. A novel dissimilarity metric was introduced to quantify inter-fragment distance based on the contribution of individual fragments in building a final haplotype. The notion of fuzzy conflict graph was presented to model haplotype reconstruction as a max- cut problem. A fast heuristic fragment partitioning technique was introduced using the graph model. The technique was shown to lower computing complexity of the haplotype reconstruction dramatically, compared to the state-of-the-art algorithms, while moderately outperforming accuracy performance of such algorithms. We com- pared FastHap with two well-known haplotype reconstruction algorithms, namely Levy’s greedy algorithm and HapCut. The greedy algorithm is historically known for its high speed while is also outperforms accuracy of other computationally simple and greedy algorithms such as FastHare [PS04]. HapCut, in contrast, is popular for its high accuracy, but demands much higher computational resources compared to Greedy. The experiments showed that FastHap is one order of magnitude faster than HapCut and is up to 7 times faster than the greedy approach.

34 Table 3.1: Comparison of FastHap with Greedy and HapCut in terms of accuracy (MEC) and execution time using HuRef dataset. FastHap achieves speedups of 16.4 and 15.1 compared to HapCut and Greedy respectively is 1.9% and 35.4% more accurate than HapCut and Greedy respectively. Statistics on coverage and haplotype length are shown in Figure 3.2 and Figure 3.3 and further discussed in Section 3.3.1. Time (minutes) Speedup using FastHap MEC Chr Greedy HapCut FastHap vs. Greedy vs. HapCut Greedy HapCut FastHap 1 606 880 183 3.3 4.8 29657 19750 19423 2 1001 2446 149 6.7 16.4 22980 14677 14220 3 1809 1053 188 9.6 5.6 16878 10738 11794 4 542 694 63 8.6 11.0 18153 11931 11812 5 1381 3229 282 4.9 11.4 16590 10630 10362 6 681 750 109 6.2 6.9 15587 9992 9870 7 456 604 76 6.0 7.9 17402 11290 11245 8 5052 4514 334 15.1 13.5 14887 9845 10830

35 9 2006 1747 293 6.8 6.0 13812 9318 9204 10 667 1445 170 3.9 8.5 15291 9906 9796 11 332 288 69 4.8 4.2 12906 8294 8091 12 1303 1638 165 7.9 9.9 12630 8297 7467 13 428 761 158 2.7 4.8 9312 6131 6143 14 3315 1919 383 8.7 5.0 9734 6360 5725 15 907 1137 208 4.4 5.5 13988 9783 9695 16 157 248 42 3.7 5.9 12621 8354 8215 17 2223 2790 246 9.0 11.3 11157 7398 7386 18 698 798 87 8.0 9.2 8578 5043 4846 19 309 501 41 7.5 12.2 8214 5497 4886 20 326 348 32 10.2 10.9 5752 3784 3437 21 482 154 48 10.0 3.2 6611 4715 4707 22 535 128 39 13.7 3.3 8295 5864 5875 25217 28074 3365 7.4 8.1 301035 197597 195029 Overall Sum over all chromosomes Average Sum over all chromosomes CHAPTER 4

Association Rule Learning for Diploid Haplotyping

4.1 Introduction

Advances in DNA sequencing technologies have led to a generation of increasingly accurate and affordable datasets, which enable single individual phasing for more accurate diagnostics. Phasing is extremely important for genetic studies because haplotypes, which can be constructed from sequencing data, are associated with the presence/absence of genetic diseases [Lan13]. DNA phasing in diploids (i.e., humans) has proved effective in many application areas in individual as well as population genetics [TBT11]. Haplotype assembly or DNA phasing based on high-throughput sequencing data has been studied in the past few years [BB11] based on experimental and computational methods. The topic, however, remains an active research area due to challenges associated with both experimental and computational approaches to haplotype assembly. While experimental phasing techniques are less practical on large datasets, computational approaches have received more attention because of their practical feasibility and cost-efficiency. As sequencing technologies continue to provide larger datasets with higher coverage and longer reads, new methods are warranted to deal with the computational complexity of the data.

Chapter 3 presented the development of FastHap, a highly scalable diploid hap- lotyping framework, which performs fragment partitioning based on a new inter- fragment distance metric. Despite the fact that Fasthap demonstrated fast and accu- rate result compared to other algorithms, it is dependent on the input data in terms

36 of coverage, read length, etc. Some factors such as coverage and read length are improving due to advancements in sequencing technologies. While there are other factors such as similarity/dissimilarity between pairs of SNPs that are affected by the amount of overlap between each pair of fragments and there is limited ability in FastHap to reveal those similarities. Therefore, we aim to investigate development of another diploid haplotype assembly framework based on the concept of association rule mining. This approach is inspired by advances in the field of association analysis. A large amount of data accumulated from day-to-day operations in many different businesses are commonly analyzed through a methodology known as association anal- ysis. Discovering interesting relationships hidden in large datasets is achievable using this approach. The focus is on discovering strong rules using some measures of in- terestingness. Mining for association rules that unveil complex dependencies among items in large databases of sales transactions has been described as an important data mining problem [Ols17,KK16,BMU97,KK06,HGN00]. In market basket analysis, as- sociation rules identify the set of items that are most often purchased with another set of items. For example an association rule may state that “95%” of customers who bought items ‘A’ and ‘B’ also bought ‘C’ and ‘D’ and denote this relationship as {A, B} → {C, D}. Association rule mining has been utilized in many other application domains such as web usage mining [SCD00], intrusion detection [LSM99], production process improvement [KRM13], and in bioinformatics [FHT04] for gene expression data analysis [BBJ02, CH03] and protein-protein interaction analysis [PRG09]. To the best of our knowledge, however, devising an association rule learning framework for haplotype phasing has not been studied in the past. The motivation for devel- oping an association-rule-based approach for haplotype phasing is that DNA short reads that are drawn from the same haplotype maintain dependencies among SNP sites that they cover. Consequently, hypothesize is that discovering such dependen- cies from a collection of short reads will allow us to assemble the original haplotypes more accurately.

37 4.1.1 Contributions and Summary of Results

The following contributions are made through design, development, and validation of ARHap.

• A novel approach, called ARHap shown in Figure 4.1 and Algorithm 3, is pre- sented for reconstructing haplotypes based on association rules generated from a set of short reads. The framework is composed of two main modules or process- ing phases. In the first phase, called association rule learning, strong patterns from the dataset that is being provided from individual’s sequencing data are discovered. In the next phase, called haplotype reconstruction, an approach for utilizing the strong rules produced in the first phase will be developed to construct haplotypes at individual SNP sites. Similar to frequent itemset in market basket analysis, we refer to frequent alleles in each SNP position, or combined frequent alleles in multiple sites, as frequent snpset. We note that the term frequent refers to the frequent allele in a collection of fragments for an individual and should not be confused with the frequent allele in a SNP site in a population.

• An approach for incremental haplotype reconstruction will be presented. This enables us to generate association rules according to the unreconstructed SNP sites during each round of the algorithm. The two modules are synergistically designed for efficient data processing. During each round, the association rule learning module generates rules while constraining the length of the rules (i.e., number of participating snpsets) and limiting the rules to those that contribute to reconstruction of unreconstructed sites thus far. The framework begins by generating rules of small size (e.g., only those of length 2) and highly strong (i.e., according to minimum support and minimum confidence criteria) and reconstructing haplotypes accordingly. The rule length can increase gradu- ally and/or criteria about strongness of the rule adjusted, during subsequent

38 rounds, if some SNP sites have remained unreconstructed. This adaptive ap- proach, which uses feedback from haplotype reconstruction module, eliminates generation of rules that do not contribute to haplotype reconstruction.

• Theoretical bounds on likelihood of generated rules being erroneous will be derived. This is an important contribution because it allows us to estimate error probability of generated rules based on the amount of sequencing error and parameters of the association rule learning module such as rule length and minimum support criterion.

• It will be demonstrated that standard measures of rule interestingness cannot be directly applied to rule-based haplotype assembly. In particular, a novel mea- sure of confidence will be developed for the rules generated by the association rule learning module. We also show that this new measure of confidence allows defining a robust minimum confidence criterion, which eliminates conflicting rules from being included in the final rule set.

• An optimization problem will be developed to reconstruct haplotypes based on the generated rules such that the amount of disagreement between generated rules and reconstructed haplotypes is minimized.

• A graph model, called dependency graph, will be developed to not only represent the dependency between individual SNP sites but also encapsulate inter-rule dependencies so that a chain or association rules can be extracted from the graph for haplotype reconstruction.

• The framework will be evaluated extensively and its effectiveness in reducing switching error of several competing methods will be demonstrated.

We have conducted a comprehensive analysis to assess the performance of our framework in reconstructing haplotypes of varying ploidy levels. In Table 4.1, the amount of improvement in SWitching Error (SWE) achieved by ARHap and the

39 comparison with others is shown. It also demonstrates that ARHap substantially improvement over other algorithms. Also in terms of running time, ARHap manages to stay as close as possible to FastHap while improving both SWE and MEC score. We will discuss the details of data and all accuracy measures in section Section 4.4.

Table 4.1: Performance of ARHap on SWE

Algorithm ARHap FastHap HapCut Greedy Overall 0.12% 0.26% 0.27% 0.36% Improvement - 53.8% 55.5% 66.7%

40 Update rule length, ‘l’ and minSupp

Frequent Snpset SNP Association Dependency Graph Longest Attribute- Haplotype Binarization Generation Rule Generation Construction Compatible Path Extraction Update H = 41

0 fragment matrix Supp > minSupp Confhap > minConf initial haplotype H |X U Y| <= l |Y| = 1

Association Rule Learning Haplotype Reconstruction Figure 4.1: Association rule haplotyping (ARHap) framework. Each round of association rule haplotyping is composed of two phases including an association rule learning phase and a haplotype reconstruction phase. This two-phase process may continue for multiple rounds until all SNP positions on the haplotype set are reconstructed Algorithm 3 High level description of ARHap algorithm. Require: SNP sites S, minSuppUpperBound, minSuppLowerBound, fragment ma- trix X , initial haplotypes H0 Ensure: H 1: H = H0 2: Y = CreateBinaryMatrrix(X ) {binarization} 3: U = S {initialize U, list of unreconstructed SNP sites} 4: l = 2 {start with rules of length 2} 5: minSupp = minSuppUpperBound {initialize minSupp} 6: while (U =6 ∅) do 7: while (minSupp > minSuppLowerBound) do 8: SNPSET = GenerateFrequentSnpset(Y,U,l,minSupp) 9: R = GenerateSNPAssociationRules(SNPSET ) 10: G = ConstructDependencyGraph(R) 11: P = ExtractLongestAttributePath(G) 12: H = UpdateHaplotype(P ) 13: U = U\ < ReconstructedSNP Sites > {exclude reconstructed SNP sites from further investigation} 14: minSupp = minSupp + 1 15: end while 16: l = l + 1 {increase rule length} 17: end while

4.2 Association Rule Learning

Similar to FastHap, ARHap take as input a standard matrix called f ragment matrix composed of short reads obtained by DNA sequencing. In the fragment matrix, each row corresponds to a short read and each column is associated with a SNP site. To extract SNP positions, the short DNA reads are assumed to be mapped onto a reference sequence prior to their being used for haplotype assembly. The reads are aligned to the reference genome sequence prior to forming a fragment matrix. The alignment is often done using a sequence aligner [LS12, LD10], which may introduce errors in the resulting short reads. After aligning the fragments to the reference sequence, most of the loci have identical values in all the fragments that cover the region. These sites are referred to as homozygous sites as discussed in Chapter 3. The remaining sites have different values for some of the fragments. Such sites are known as heterozygous sites and are the primary source of information for haplotype

42 reconstruction.

A heterozygous site can be one single base pair, in which case it is known as SNP, or can be more than one base pair. For haplotype reconstruction, ARHap assumes that all homozygous sites are discarded from inclusion in the fragment matrix and we only focus on single variant calls (i.e., SNPs). For each locus of the fragment, only the most popular alleles are maintained and rare alleles are labeled as ‘missing’ and substituted with ‘−’ in the fragment matrix. Let {x1j, x2j, x3j ... , xmj} be the elements associated with the j-th column in fragment matrix X . If for example, the two most popular alleles are ‘A’ and ‘C’, they are encoded in binary. In case there is another allele in addition to ‘A’ and ‘C’ within the same column, that allele is discarded and assigned symbol ‘−’ in the fragment matrix.

Prior to generation association rules, one needs to transform this fragment matrix representation of reads into a format that is appropriate for association rule learning. Such a matrix transformation is discussed in Section 4.2.1.

4.2.1 Matrix Binarization

The input to an associating rule learning problem is often a binary matrix where each row is associated with a transaction and each column corresponds to an item. The terms ‘transaction’ and ‘item’ are typically used in market basket analysis. An item can be treated as a binary variable whose value is ‘1’ if the item is present in the transaction and ‘0’ otherwise. Given that a fragment matrix X contains elements xij that are drawn from alphabet Σ = {0, 1, −}. The first step in developing an associate rule framework for haplotyping is to represent X as a fully binary matrix while preserving all information contained in X , i.e., including information about missing values.

Definition 4 (Binary Fragment Matrix). Let Xm×n be a given fragment matrix that contains SNP values, xij ∈ {0, 1, −}, obtained from m fragments, F = {f1, ... , fm},

43 pertaining to ‘n’ SNP sites. Matrix Ym×2n, binary representation of X , is computed as follows. Y = [X 0X 1] (4.1)

where X 0 and X 1 and binary matrices of the same dimension as X (i.e., m × n),

0 0 each element xij in X is computed by

 0  1 if xij=0 xij = (4.2)  0 otherwise

1 1 and each element xij in X is given by

 1  1 if xij=1 xij = (4.3)  0 otherwise

The binarization process in Definition 4 is in fact a 1-to-2 mapping of elements in X to those in Y. The mapping guarantees that all missing values (i.e., ‘−’ symbols) in

X are presented as ‘0’s in Y. Furthermore, all non-missing values (i.e., xij = ‘0’/‘1’) in X are now presented as ‘1’s, but in designated columns with respect to the SNP site, in Y. This is accomplished by expanding the number of columns of X (i.e., ‘n’) to 2 × n in Y so that for each SNP site in X there exist two distinct columns in Y one for presenting ‘0’s and one for denoting ‘1’s. We note that xij = ‘−’ if and only if yij = yik = 0 where k = n + j.

An example of the binarization process is shown in Figure 4.2. A fragment matrix containing 9 fragments of length 5 and drawn from haplotype H = {h0, h1} where h0

= ‘11111’ and h1 = ‘00000’ is shown in Figure 4.2(a). As shown in red in the figure, the fragment matrix carries one erroneous value at x2,2. Figure 4.2(b) shows how ‘0’ cells in X are replaced with ‘1’s in X 0. Similarly, Figure 4.2(c) shows how cells with a value of ‘1’ in X are uniquely represented in X 1. Finally, the two matrices are column-wise concatenated in Y as shown in Figure 4.2(d).

44  1 1 1 − − 0 0 0 0 0 1 1 1 0 0  101 − − 0 1 0 0 0 1 0 1 0 0        1 − 1 − 1  0 0 0 0 0 1 0 1 0 1       − − 1 1 1  0 0 0 0 0 0 0 1 1 1       − 1 1 1 − 0 0 0 0 0 0 1 1 1 1       − 0 0 0 − 0 1 1 1 0 0 0 0 0 0        0 − − 0 0  1 0 0 1 1 0 0 0 0 0        0 0 − − 0  1 1 0 0 1 0 0 0 0 0 0 0 − 0 − 1 1 0 1 0 0 0 0 0 0 (a) Fragment Matrix X (b) X 0 (c) X 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 0   0 0 0 0 0 1 0 1 0 1   0 0 0 0 0 0 0 1 1 1   0 0 0 0 0 0 1 1 1 1   0 1 1 1 0 0 0 0 0 0   1 0 0 1 1 0 0 0 0 0   1 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 (d) Binary Fragment Matrix Y = [X 0X 1] Figure 4.2: An example of matrix binarization. The fragment matrix X containing 9 fragments drawn from haplotype set H = {h0, h1} where h0 = ‘11111’ and h1 = ‘00000’ is shown in (a). X is decomposed into two matrices X 0 and X 1 as shown in (b). The binary fragment matrix Y is a column-wise concatenation of X 0 and X 1.

Table 4.2: Strong association rules of size l = 2 for the fragment matrix in Figure 4.2.

No. R Supp(R) Confcon(R) Confhap (R) 1 s1 → s3 0.33 1 1 2 s5 → s3 0.22 1 1 3 s4 → s3 0.22 1 1 4 s2 → s3 0.22 1 1 5 s¯5 → s¯1 0.22 1 1 6 s¯4 → s¯2 0.22 0.67 0.75 7 s¯1 → s¯2 0.22 0.67 0.75 8 s¯4 → s¯1 0.22 0.67 0.75 9 s¯1 → s¯4 0.22 0.67 0.75 10 s¯1 → s¯5 0.22 0.67 0.75 11 s3 → s1 0.33 0.60 0.71 12 s¯2 → s¯4 0.22 0.50 0.67 13 s¯2 → s¯1 0.22 0.50 0.60 14 s3 → s5 0.22 0.40 0.63 15 s3 → s4 0.22 0.40 0.63 16 s3 → s2 0.22 0.40 0.57

45 4.2.2 SNP Association Rules

While the original fragment matrix, X , carries xij ∈{0,1,−}, the binary fragment matrix, Y, contains only binary values, i.e., yij ∈{0,1}. Each column in Y is not only associated with a particular site but also indicates if the site contained ‘0’s or ‘1’s in X . For example, the first column of Y in Figure 4.2(d) refers to the fragments that cover the first SNP site (the first column in X ) with a value of ‘0’. Such fragments have a value of ‘1’ in the first column in Y. As a result, each column in Y has a unique attribute with respect to the SNP site to which it relates. This attribute is referred to as SNP attributes and is defined more formally as follows.

Definition 5 (SNP Attribute). For each SNP site j ∈{1,... ,n} in X , we define two

binary attributes s¯j and sj. The attribute s¯j, associated with j-th column in Y, has

‘1’s for all fragments that cover the j-th SNP site with a ‘0’ in X . Similarly, sj, associated with column j + n in Y, has ‘1’s for all fragments that cover the site with

a ‘1’ in X . The set of all such attributes forms SNP Attribute Set S = {s¯1, s¯2, ... ,

s¯n, s1, s2, ... , sn}.

Definition 6 (snpset). Given SNP Attribute Set S associated with binary fragment matrix Y, any collection of one or more attributes t ∈ S forms a snpset.

Definition 7 (SNP Associate Rule). An association rule, R, is an implication ex- pression of the form X → Y , where X and Y are disjoint snpsets, i.e., X ∩Y = ∅ and X,Y ⊆ S. The snpsets X and Y are also referred to as antecedent and consequent of the rule, respectively.

The SNP attributes set for the fragment matrix of Figure 4.2 is given by S = {s¯1,

s¯2,s ¯3,s ¯4,s ¯5, s1, s2, s3, s4, s5}. A subset of rules for this dataset is shown in Table 4.2.

For example, s1 → s3 is a rule supported by the first 3 fragments. Those fragments

cover both s1 and s3 because they all have a value of ‘1’ in columns associated with

s1 and s3 (i.e., 6th and 8th columns in Y).

46 4.2.3 Measures of Rule Interestingness

Generating all possible rules is computationally infeasible and introduces many redun- dant rules. Therefore, only interesting rules with respect to a particular application of interest are desirable. In this chapter, several measures of interestingness are defined for use in the ARHap framework. Later in this chapter, we will elaborate on how these measures are used in the introduced framework to optimize the number and quality of generate rules.

The strength of a rule can be measured using support and confidence.

Definition 8 (Support). Support refers to how often a rule R: X → Y is applicable to the particular fragment matrix and thus represents the probability that the fragments in Y cover attributes in both snpsets X and Y . The support of rule X → Y is given by

δ(X ∪ Y ) Supp(X → Y ) = (4.4) m

where δ(X ∪ Y ) denotes the number of fragments that cover attributes in both X and Y , and ‘m’ represents the total number of fragments in the input matrix.

For example, for the fragment matrix in Figure 4.2, rule s1 → s3 has a support of 3 9 = 0.33 because there are 3 fragments (i.e., f1, f2, and f3) that cover both positions j = 1 and j = 3 with a value of ‘1’. As another example, consider the case where X

1 = {s¯2,s ¯3} and Y = {s¯4}. In this case, the rule {s¯2, s¯3} → s¯4 has a support value of 9

because there is only one fragment (i.e., f6) in X that covers sites j = {2, 3, 4} with

a value of ‘0’. As a result, rule {s¯2, s¯3} → s¯4 is less frequent compare to s1 → s3.

The role of Frequent Snpset Generation block in Figure 4.1 is to identify rules with sufficiently high support. In Section 4.2.4, we will discuss a minSupp threshold for eliminating infrequent rules.

47 Based on conventional definition of confidence, confidence determines how fre- quently SNP attributes in Y appear in fragments that contain X. Therefore, the conventional definition of confidence computes the probability that a fragment covers attributes in Y given that it covers attributes in X. We refer to this measure as

Confcon and compute it as follows.

δ(X ∪ Y ) Conf (X → Y ) = (4.5) con δ(X)

where δ(X) denotes the number of fragments covering attributes in X. For example,

2 rule s3 → s2 has a confidence measure of 5 according to this definition. As shown in Figure 4.2(a), there are 5 fragments that cover position j = 3 with a value of ‘1’. Out of these five fragments, only 2 fragments cover position j = 2 with a value of ‘1’. A major problem with the conventional definition of confidence is that it discards the impact of ‘missing’ values when estimating confidence of a rule. In other words, this definition assumes that all the three fragments that do not cover position j = 2 with ‘1’ cover this site with a ‘0’. We note that the likelihood of a ‘missing’ value being ‘0’ is not higher than it representing a ‘1’. As a result, the conventional definition of confidence is biased by ‘missing’ values in consequent of the rule. Another major issue with this definition is that, because the number of symbols ‘−’ associated with consequent of a rule is not known a priori (or at least it is not fixed for all SNP sites), we cannot come up with a meaningful threshold on minimum-confidence (i.e., minConf ) when generating strong association rules. Due to these limitations of conventional confidence, we define a more robust measure of confidence for haplotype phasing in this chapter. This new measure of confidence is referred to as Confhap and define it as follows.

Definition 9 (Confidence). Confidence of a rule R: X → Y is given by

δ(X) − δ(X ∪ Y¯ ) Confhap(X → Y ) = (4.6) 2δ(X) − δ(X ∪ Y ) − δ(X ∪ Y¯ )

48 where Y¯ denotes complement of the subsequent snpset Y .

5−1 For example, rule s3 → s2 in Table 4.2 has a confidence value of 10−2−1 = 0.57 according to this new definition. Later in this chapter we will show that this defini- tion is robust enough that one can avoid generating conflicting rules by enforcing a predefined minimum confidence threshold in the SNP Association Rule Generation phase.

4.2.4 Rule Generation Criteria

ARHap generates rules that satisfy a minimum support and a minimum confidence criteria at the same time. The rule generation has two separate steps. In the first step, a minimum support threshold is applied to find frequent snpsets of a certain length (i.e., Supp(X ∪ Y ) ≥ minSupp, and |X ∪ Y | ≤ l) in the fragment matrix Y. Then a minimum confidence constraint is applied to these frequent snpsets to form rules. Finding all frequent snpsets can be difficult because it involves finding all possible combinations of SNP attributes. The set of all possible snpsets has a size of 2l − 1 where 2 ≤ l ≤ 2n. Although the size of the set grows exponentially in the number of attributes, an efficient search is possible using the anti-monotonicity property of support which guarantees that for a frequent snpset, all its subsets are also frequent and thus for an infrequent snpset, all its supersets must also be infrequent. Exploiting this property, efficient algorithms such as Apriori [AS94], Eclat [Zak00], or FP-Growth [HPY00] can find all frequent snpsets.

As discussed previously, ARHap introduces constraints during association rule learning, as shown in Figure 4.1, to limit the length of the rule to a value of ‘l’ and to adjust minSupp threshold in each iteration. Given that the ARHap framework constructs haplotypes incrementally, the constraint |X ∪ Y | ≤ l allows to generate only a small set of rules during each iteration of the framework. During each itera- tion, snpsets identified as infrequent during previous iterations are used to eliminate

49 superset SNPs that are also naturally infrequent as discussed previously.

ARHap criteria for choosing frequent snpsets and creating rules of the form X → Y are given by (4.7), (4.8), (4.9), and (4.10).

1 Supp(X → Y ) ≥ + α (4.7) m

where 0 ≤ α < 1. This criterion guarantees that each rule is supported by at least ‘1 + mα’ fragments.

1 Conf (X → Y ) > (4.8) hap 2

|Y | = 1 (4.9)

|X ∪ Y | ≤ l (4.10)

Definition 10 (Strong Rule). A given rule R: X → Y is strong if it satisfies the support, confidence, and consequent length criteria in (4.7), (4.8), and (4.9). We refer to any rules that violate any of these criteria as weak rules.

4.2.4.1 Minimum Support Criterion

This section discusses how the support criterion in (4.7) relates to the amount of error rate, , in the input dataset (i.e., error in short reads). Understanding this relation is important because the minimum support criterion should be sufficiently high enough to eliminate rules that are inferred based erroneous data. For example,

in the fragment matrix shown in Figure 4.2(a), s1 → s¯2 is an association rule due to

error in fragment f2 at the second SNP site (i.e., x2,2 = 0 due to error). Therefore, it

50 is desirable to exclude the snpset X ∪ Y = {s1s¯2} during frequent snpset generation phase.

Definition 11 (Erroneous Rule). An association rule R: X → Y is said to be erroneous if fragments supporting X ∪ Y contain any errors at the SNP attributes of the rule.

In the example shown in Figure 4.2, snpset that includess ¯2 and is supported by

fragment f2 forms erroneous rules. Examples of such rules are s1 → s¯2; s1, s3 → s¯2; s3 → s¯2; s1, s¯2 → s3; ands ¯2 → s1.

Lemma 2. The likelihood of a rule R: X → Y of length l (i.e., |X ∪ Y | = l) being erroneous is given by p(E|R) = 1 − (1 − δ(X∪Y ))l (4.11)

where  denotes the amount of error rate in the input dataset.

Proof. Let T = {t1, ... , tl} be the set of l SNP attributes in X ∪ Y where T ⊆ S. Assume all fragments that support X → Y form a snippet F ⊆ F. For each attribute

tj in T , all fragments in F should have the same value (i.e., ‘1’) in Y. Therefore, all

fragments in F must have error in attribute tj in order for tj to be included in the rule. Given that each attribute value (i.e., tj = 1) is supported by δ(X ∪Y ) fragments within the snippet, the probability that attribute tj contains error and is included in the rule is δ(X∪Y ) p(E|R, tj) =  (4.12)

Furthermore, the likelihood of attribute tj being included in rule R with no error is δ(X∪Y ) p(E¯|R, tj) = 1 −  (4.13)

The probability that at last one of the attributes in T is erroneous and included

51 in the rule can be computed by

l Y p(E|R) = 1 − p(E¯|R, tj) j=1 (4.14) = 1 − (1 − δ(X∪Y ))l

Lemma 3. The probability that a strong rule R: X → Y of length l is erroneous is upper bounded by 1 − (1 − αm+1)l. That is,

p(E|R) ≤ 1 − (1 − αm+1)l (4.15)

Proof. From (4.4) and (4.7), we have:

δ(X ∪ Y ) 1 ≥ + α m m (4.16) δ(X ∪ Y ) ≥ mα + 1

The proof follows by combining (4.16) with the likelihood error given by (4.11) in Lemma 2.

1 A desirable value for variable α is m , which results in generating rules that are supported by at least two fragments (i.e., δ(X ∪ Y ) ≥ 2). In this case, the amount of upper bound on the likelihood of the rule being erroneous is given by

p(E|R) ≤ 1 − (1 − 2)l (4.17)

For example, for a given fragment matrix with maximum sequencing error of

1  = 2%, if we set α = m , the likelihood of any rules of size 3 (i.e., l = 3) being erroneous is at most 0.12%

Lemma 4. The lower bound on the probability of a rule R: X → Y of size l being

52 erroneous is 1 − (1 − C )l. That is,

p(E|R) ≥ 1 − (1 − C )l (4.18) where C denotes the amount of DNA read coverage.

Proof. The maximum number of fragments that can support a rule is naturally limited by the amount of read coverage, C. Therefore, we can write:

δ(X ∪ Y ) ≤ C (4.19)

δ(X∪Y ) ≥ C (4.20)

1 − (1 − δ(X∪Y ))l ≥ 1 − (1 − C )l (4.21)

p(E|R) ≥ 1 − (1 − C )l (4.22)

Theorem 5. The probability that a strong rule R: X → Y of length l is erroneous is bounded as follows

1 − (1 − C )l ≤ p(E|R) ≤ 1 − (1 − αm+1)l (4.23)

Proof. The proof follows from Lemma 3 and Lemma 4.

4.2.4.2 Minimum Confidence Criterion

An important property of our association rule learning algorithm is that it guarantees that the generated rules do not contradict for haplotype reconstruction. Because our ultimate goal is to be able to reliably infer haplotypes from extracted rules, it is im- portant that two rules with the same antecedent do not infer conflicting consequents.

For example, a rule s1 → s2 suggests that if a haplotype has a value of ‘1’ at the first SNP site, then the haplotype should have a value of ‘1’ at the second SNP site 53 as well. Such a rule is in conflict with the rule s1 → s¯2, which suggests that the second SNP site should be ‘0’ if the first site is ‘1’. In this section, we show that the proposed minimum confidence criterion in (4.8) guarantees that conflicting rules are not generated by the framework.

Definition 12 (Conflicting Rules). Two rules R1: X1 → Y1 and R2: X2 → Y2 are called conflicting rules if X1 = X2 and Y1 = Y¯2.

Lemma 6. A rule R: X → Y that satisfies the confidence criterion in (4.8) has a higher support value than its conflicting rule, i.e., Supp(X → Y ) ¿ Supp(X → Y¯ ).

Proof.

1 Conf (X → Y ) > hap 2 δ(X) − δ(X ∪ Y¯ ) 1 > 2δ(X) − δ(X ∪ Y ) − δ(X ∪ Y¯ ) 2 2δ(X) − 2δ(X ∪ Y¯ ) > 2δ(X) − δ(X ∪ Y ) − δ(X ∪ Y¯ ) (4.24) δ(X ∪ Y ) > δ(X ∪ Y¯ ) δ(X ∪ Y ) δ(X ∪ Y¯ ) > m m Supp(X → Y ) > Supp(X → Y¯ )

Theorem 7. The criterion in (4.8) guarantees that the set of generated rules does not contain conflicting rules.

Proof. Proof by contradiction. Assume both rules X → Y and X → Y¯ are included in the final rule set. Because X → Y is included, Lemma 6 suggests that Supp(X → Y ) ¿ Supp(X → Y¯ ). Furthermore, because X → Y¯ is in the final set, Supp(X → Y¯ )¿ Supp(X → Y ) according to Lemma 6. These two inequality statements contradict. Therefore, at most one of the two conflicting rules will be generated by the association rule learning module. 54 4.2.4.3 Consequent Length Criterion

As enforced by (4.9), rules with more than one attribute in their consequent snpset are eliminated from further processing. This is mainly because such rules are weaker than similar rules (i.e., with same antecedent snpset but superset consequent) as discussed in Lemma 9.

Lemma 8. If snpset Y2 is a superset of snpset Y1 (i.e., Y1 ⊂ Y2), then the following two statements about support values of Y1 and Y2 are true.

Supp(Y1) > Supp(Y2) (4.25)

Supp(Y¯1) ≤ Supp(Y¯2) (4.26)

Proof. The inequality in (4.25) is the natural anti-monotonicity property of support.

We prove that the collection of snpsets represented by Y¯1 are supported by a smaller number of fragments in Y compared to the snpsets represented by Y¯2 as suggested by (4.26). According to anti-monotonicity property of support, we can write:

Supp(Y1) > Supp(Y2) (4.27)

Assume that Y2 = Y1 ∪ Z. Thus, we have

δ(Y¯2) = δ(Y1 ∪ Z) (4.28) = δ(Y¯1) + δ(Z¯) − δ(Y¯1 ∪ Z¯)

We note that δ(Z¯) ≥ δ(Y¯1 ∪ Z¯) based on the anti-monotonicity property of support. Therefore,

δ(Y¯2) ≥ δ(Y¯1) (4.29) Supp(Y¯2) ≥ Supp(Y¯1)

55 Theorem 9. If snpset Y2 is a superset of Y1 (i.e., Y1 ⊂ Y2), then rule R1: X → Y1 is stronger than R2: X → Y2 .

Proof. Because Y1 ⊂ Y2, inequality (4.25) in Lemma 8 suggests that Supp(Y1) >

Supp(Y2). Thus,

Supp(X ∪ Y1) > Supp(X ∪ Y2) Supp(X ∪ Y ) Supp(X ∪ Y ) 1 > 2 (4.30) m m

Supp(X → Y1) > Supp(X → Y2)

Therefore, R1 has a higher support value than R2. We can also show that Confcon(R1) >

Confcon(R2). Because R1 has both higher support and greater confidence than R2, it is a stronger rule.

4.3 Haplotype Reconstruction

In this section, we describe our approach for constructing haplotypes using rules generated by our association rule learning module. We assume that the haplotypes

0 0 0 are originally initialized to some binary values at random. That is H = {h0, h1}.

Problem 1 (Rule-Based Haplotype Reconstruction). Given a set of ‘K’ strong rules

0 R ={R1, R2, ... , RK } where Rj: Xj → tj and initial haplotype H , our goal is to

build haplotypes H = {h0, h1} such that the amount of disagreement error between the rule set and the constructed haplotypes is minimized. The disagreement error between R and H is given by K X X Λ(H, R) = λ(hi,Rj) (4.31) i∈{0,1} j=1 where

  1 if hi satisfies Xj but not tj λ(hi,Rj) = (4.32)  0 otherwise

56 Definition 13 (Dependent & Prerequisite Rules). A rule Rk: Xk → tk is said to be dependent on rule Rj: Xj → tj, and denoted by Rj =⇒ Rk, if tj ∈ Xk or t¯j ∈ Xk. In other words, a rule is dependent if its antecedent is a superset of another rule’s consequent. If Rk is dependent on Rj, then rule Rj is called a prerequisite of Rk.

For example, rule s2 → s3 is dependent on rule s1 → s2. Also, a rule of form s¯2 → s3 is dependent on s1 → s2 because the prerequisite rule s1 → s2 may change the haplotype at the 2nd SNP position which in turn can result in haplotype satisfying antecedent of the dependent rules ¯2 → s3 (i.e., either h0 or h1 may satisfys ¯2 as a result of applying rule s1 → s2 which would change 2nd SNP position).

In order to develop a haplotype reconstruction approach that minimizes the amount of disagreement between the generated rules and reconstructed haplotypes, we first need to better identify major sources of disagreement or error. We divide all sources of errors into two categories including dependent rules and haplotype-inconsistent rules.

A dependent rule error may occur if a dependent rule is used for haplotype recon- struction prior to its prerequisite rule. For instance, consider two rules R1: s1 → s2

0 0 0 0 0 and R2: s2 → s3 and an initial haplotype H = {h0, h1} where h0 = ‘010’ and h1 =

‘101’. If we apply R1 and then R2 the resulting haplotype will be h0 = ‘000’ and h1 = ‘111’. This haplotype has a disagreement error of Λ = 0 because it satisfies both rules. However, if we apply R2 first and then R1, the resulting haplotype will be h0 =

‘001’ and h1 = ‘110’ which disagrees on rule R2 and therefore incurs a disagreement error of Λ = 1.

Lemma 10. Applying a dependent rule right after applying its prerequisite rule does not increase disagreement error.

Proof. Proof is eliminated for brevity.

Lemma 10 suggests that in order to minimize the amount of error, we should identify a sequence of rules where dependent and prerequisite rules are adjacent and 57 the prerequisite rule is applied prior to its dependent rule.

Definition 14 (Inconsistent Rules). Two rules Rj: Xj → tj and Rk: Xk → tk are inconsistent if tj = t¯k.

We note that inconsistent (i.e., unreconstructable) rules are different than con- flicting rules defined in Definition 12. Conflicting rules have the same antecedent snpsets. As discussed previously, the association rule learning module would auto- matically prevent generating conflicting rules. Inconsistent rules, however, are those rules whose consequent snpsets are complement or inverse of each other. The reason inconsistent rules are important is that they cannot be applied to the same haplotype

(i.e., h0 or h1) because they result in opposite changes in the same SNP position suggested by their consequent snpset.

Lemma 11. Applying inconsistent rules on different haplotypes does not increase disagreement error.

Proof. Proof is eliminated for brevity.

Motivated by Lemma 10 and Lemma 11, Algorithm 4 is proposed for rule-based haplotype reconstruction. The set of unreconstructed SNP site, denoted by U, is initially set to indicate that all SNP sites are unreconstructed, i.e., U = {1, 2, ... , n} where n is the number of columns in the fragment matrix X . The procedure shown in Algorithm 4 repeats during each round of the two-phase (association rule learning and haplotype reconstruction) processing. In other words, every time our framework generates a new set of rules, it calls the procedure in Algorithm 4 to process newly created rules. As new rules are generated and haplotypes are updated, the set of unreconstructed SNP positions becomes smaller and smaller until all positions are reconstructed and we no longer need to repeat the two phases of the ARHap framework, namely association rule learning and haplotype reconstruction modules shown in Figure 4.1. 58 Algorithm 4 Haplotype reconstruction procedure in ARHap. Require: Set of strong rules R, Current haplotype H, Set of unreconstructed sites U Ensure: Updated H, Updated U 1: G = (E,V ) ← DependencyGraph(R) 2: while (|E| > 1 & U =6 ∅) do 3: π ← LongestAttributeConsistentPath(G) 4: UpdateHaplotype(H, π, U) 5: end while

4.3.1 Overview of the Algorithm

The general approach to haplotype reconstruction from association rules is an iterative process where we choose one rule during each iteration and check whether or not the rule results in updating a particular SNP position on the haplotype, based on the consequent of the rule. In general, if a haplotype satisfies antecedent of a rule, then we may need to update the haplotype such that it will satisfy consequent of the rule as well. We note that both antecedent and consequence of a rule carry information

about both index and value of SNP sites. For instance, if the current rule is s1 → s2, this means that we may need to change the current haplotype at the second SNP site to ‘1’ (i.e., associated with s2) if the first SNP site has a value of ‘1’ (as indicated by s1). That is, if the haplotype (i.e., h0 or h1) satisfies antecedent s1, then it must

satisfy consequent s2. Similarly, a rule s4 → s¯5 suggests that the 5th SNP position must be ‘0’ if the 4th site is ‘1’.

In developing such an approach a number of challenges exist. First, a strategy is needed to choose a rule at each step in the algorithm. Second, it is not known

a priori against which haplotype (i.e., h0 and h1) a chosen rule needs to be tested.

Ideally, one would like to test a rule against both h0 and h1 and update the haplotype that the rule satisfies. Therefore, we need to identify a sequence of rules that are likely to apply to the same haplotype to avoid generating inconsistent haplotypes. To address this challenge, ARHap identifies dependent rules during each iteration of the algorithm and applies those rules in a sequence.

59 A brief overview of this approach is as follows:

1. Using a set of strong rules, construct a dependency graph that shows how different rules are related to each other. This will allow us to discover if applying one rule will trigger a dependent rule.

2. Develop an algorithm to find the longest attribute-consistent path on the graph and apply the rules spanning the path on the current haplotype in a chain. This way, each rule will result in applying another rule. For any rule that is applied, we remove its corresponding edge from the dependency graph. Each applied rule may result in updating an SNP site on the haplotype.

3. After all longest paths are discovered, check if any SNP sites on the haplotype have remained unreconstructed. For those sites, rerun the association rule learn- ing module to extract new rules. This will result in extracting longer rules (by increasing the value of l) that can potentially update or verify the correctness of haplotypes at unreconstructed SNP sites. For this new set of strong rules, re- peat the process of dependency graph construction and haplotype update until all SNP sites are constructed.

4.3.2 An Illustrative Example

Figure 4.3 shows the process of constructing a dependency graph and building hap- lotypes for simple example of the fragment matrix in Figure 4.2. Assume that hap- lotypes are randomly initialized to h0 = ‘10101’ and h1 = ‘01010’. As shown in Figure 4.3(a), each strong rule of length 2, listed in Table 4.2, represents an edge in the dependency graph. We note that the rules generate a graph composed of 3 connected components including a single-node component (i.e., node labeled ass ¯3) because attributes ¯3 did not participate in any strong rules in Table 4.2. We then iteratively find longest paths and use each to update the haplotype set. The first longest path generated by the algorithm is π = {s1, s3, s2} as shown in Figure 4.3(b). 60 This path suggests that we can apply s1 → s3 first and the dependent rule s3 → s2

next. We check the first rule against h0. Because the first SNP position is ‘1’ the

third position needs to be ‘1’ as suggested by the rule (i.e., s1 → s3). We then test

the second rule against h0. This results in changing the second SNP position to ‘1’.

We note that h1 and h0 should remain complement of each other all the time. There-

fore, any changes in h0/h1 will modify h1/h0 as well. The resulting haplotypes after

applying the first longest path are h0 = ‘11101’ and h1 = ‘00010’. We then remove all edges on the longest path from the graph and repeat this process until all SNP sites are verified or reconstructed by some rule.

4.3.3 Dependency Graph

We note that all rules generated by our association rule learning module have a

consequent of length 1 (i.e., |Y | = 1). Therefore, we refer to a rule Rj ∈ R as Xj → tj in the rest of this chapter. We also note that Xj, the antecedent of a rule, can be of any size although in each round of our incremental association rule haplotyping pipeline we enforce certain constraints on the size of Xj to avoid generating redundant rules. As shown in Algorithm 4, the haplotype reconstruction phase starts with creating a dependency graph.

Definition 15 (Dependency Graph). Let R be a set of strong rules on SNP attribute set S as follows:

R = {Xj → tj|Xj ⊂ S; tj ∈ S} (4.33) we define dependency graph G = (V , E) on R where V = S denotes the set of vertices in G, and the set of edges, E, is given by

E = Er ∪ Es (4.34)

61 s1 s3 s2

s¯3 s¯5 s¯1 s¯2 s¯4

s¯1 s¯1 s¯4

s¯4 s¯2 s1 s3 s4

s2 s3 s5 s¯1 s¯5

s4 s3 s5

(a) Dependency graph (b) Longest paths

Path h0 h1 Comment - 1 0 1 0 1 0 1 0 1 0 initial haplotpe {s1, s3, s2} 1 1 1 0 1 0 0 0 1 0 2nd SNP modified {s¯1,s ¯2,s ¯4} 1 1 1 1 1 0 0 0 0 0 4th SNP modified {s¯1,s ¯4} 1 1 1 1 1 0 0 0 0 0 no change {s3, s4} 1 1 1 1 1 0 0 0 0 0 no change {s¯1,s ¯5} 1 1 1 1 1 0 0 0 0 0 no change {s3, s5} 1 1 1 1 1 0 0 0 0 0 no change (c) Evolution of haplotypes Figure 4.3: Association rule based haplotype reconstruction for the dataset in Fig- ure 4.2. Dependency graph is constructed based on the strong rules in Table 4.2 (a); Longest paths identified in each iteration of the algorithm (b); and evolution of haplotypes as each rule is tested against the haplotype set (c). It is assumed that the haplotypes are initialized to h0 = ‘10101’ and h1 = ‘01010’.

where Er and Es are computed by

1 Er = {(u, v)|u → v ∈ R} (4.35)

2

62 and

Es = {(u, v)|u = tj; v = Xk; tj ∈ Xk; Xj → tj ∈ R; Xk → tk ∈ R} (4.36)

In Definition 15, Er refers to the edges due to the dependency suggested by the rule (i.e., Xj → tj). The edge Es, however, indicates if consequent of a rule (i.e., tj in rule Xj → tj) is a subset of antecedent of another rule (i.e., tj ∈ Xk where

Xk → tk). This latter edge set is only applicable to the cases where antecedent of the rule has a size of larger than 1 (i.e., l > 2). The purpose of Es edges is to link a subsequent attribute to its superset antecedent so that traversing the graph will generate a sequence of dependent rules.

As stated before, we start by generating rules with small l at first (e.g., l = 2). This will potentially result in constructing haplotypes at many SNP sites. For example, all rules in Figure 4.3 are of length 2 and the resulting dependency graph is sufficient to reconstruct haplotypes at all the 5 SNP sites as discussed before. However, it is hypothetically possible there exist cases where some of the sites are not constructed in the first round of the framework. Such a scenario may happen if some SNP sites do not participate in any rules of smaller sizes (e.g., l = 2) due to the high amount of sequencing error in those sites. In this case, we generate larger rules (i.e., l > 2) that can build unreconstructed SNP sites. For example, with the fragments shown in Figure 4.4(a) no rules of size 2 is strong enough to be included in the final rule set. Therefore, strong rules of a larger size (i.e., l = 3) are generated as shown in Figure 4.4(b). We then generate the dependency graph associated with these rules as shown in Figure 4.4(c). A longest path is shown in Figure 4.4(d). Finally, the rules on the longest path are tested against the initial haplotype and when a haplotype satisfies Xj, we may update tj to ensure that the haplotype satisfies tj as well. This is shown in in Figure 4.4(e).

63 No. R Supp(R) Conf(R) 1 {s1, s2} → s3 0.25 1 2 {s , s } → s 0.25 1 1 1 1 1 3 2 3 {s2, s3} → s1 0.25 1 1 1 1   4 {s¯1, s¯2} → s3 0.25 1 1 0 0   5 {s¯1, s3} → s¯2 0.25 1 1 0 0   6 {s¯2, s3} → s¯1 0.25 1 0 1 0   7 {s¯1, s2} → s¯3 0.25 1 0 1 0   8 {s¯1, s¯3} → s2 0.25 1 0 0 1   9 {s , s¯ } → s¯ 0.25 1 0 0 1 2 3 1 (a) X 10 {s1, s¯2} → s¯3 0.25 1 11 {s1, s¯3} → s¯2 0.25 1 12 {s¯2, s¯3} → s1 0.25 1 (b) Rules of size 3

X1 X2 X3 X4 X5 X6

s3 s2 s1 s¯3 s¯2 s¯1

X12 X11 X10 X9 X8 X7

(c) Dependency graph

X1 → s3 → X2 → s2 → X3 → s1 → X10 → s¯3 → X9 → s¯1 → X5 → s¯2 → X12 (d) A longest path

Tested Rule h0 h1 Comment - 0 1 0 1 0 1 initial haplotpe X1 → s3 0 1 0 1 0 1 no change X2 → s2 0 1 0 1 0 1 no change X3 → s1 0 1 0 1 0 1 no change X10 → s¯3 0 1 1 1 0 0 3rd position modified X9 → s¯1 0 1 1 1 0 0 no change X5 → s2 0 0 1 1 1 0 2nd position modified (e) Evolution of haplotypes when applying longest path Figure 4.4: An example of dependency graph generated from rules with multiple attributes in their antecedent snpset. Fragment matrix shown in (a) does not produce 3 any rules of length 2 because no rule of the from ti → tj meets confidence criterion of Confhap > 0.5; Set of strong rules with 2 attributes in their antecedent field (b); Corresponding dependency graph (c), a longest path (d), and changes in initial haplotypes after applying rules on the longest path (e).

64 4.3.4 Longest Attribute-Consistent Path

Intuitively, finding longest path on the dependency graph can produce the largest sequence of the rules that can be used to update a haplotype set. We, however, need to ensure that the attributes that reside on a path are consistent/unreconstructabel. As discussed in Lemma 11, applying prerequisite-dependent rules in a sequence will result in monotonically reducing disagreement error. Therefore, a longest path cannot contain both si ands ¯i as part of the path. In other words, one cannot apply both si ands ¯i on the same haplotype (i.e., one of h0 or h1) because si suggest that the i-the site must be ‘1’ whiles ¯i suggests that the same site must be ‘0’. Figure 4.5 shows an example where a longest path includes inconsistent attributes. The fragment matrix in Figure 4.5(a) results in 6 strong rules of size 2 shown in Figure 4.5(b). The dependency graph shown in Figure 4.5(c) has two absolute longest path, π1 = {s¯1, s2, s3, s1} and π2 = {s¯2, s1, s3, s2}. Both π1 and π2 have inconsistent attributes on the path. Specifically, π1 containss ¯1 and s1 which naturally cannot apply to the same haplotype. Similarly, π2 contains inconsistent attributes ofs ¯2 and s2. Therefore, it is needed to ensure that the attributes on the same path are always consistent.

Finding the longest path in a graph is known to be a NP-Complete problem [KMR97]. Strong inapproximability results are known for longest path problem in unweighted directed graphs. In particular, the problem cannot be approximated to within a factor of n1− (for any  > 0) unless N = NP [BHK04]. Therefore, we develop a heuristic approach to extract longest attribute-consistent paths. Algorithm 5 shows a high-level description of the technique. At the core of our approach is a Depth-First- Search (DFS) approach to extract a spanning tree using DFS but also ensuring that any path from the root of the tree to a leaf is attribute-consistent. Algorithm 6 shows the DFS process that not only generates a spanning tree but also guarantees that inconsistent/unreconstructable attributes are excluded from the same path. In order to accomplish this goal, each vertex maintains a set of predecessors during traverse. Before deciding whether or not an adjacent vertex can be traversed, we check whether

65 No. R Supp(R) Conf (R) 1 1 1  hap 1 s3 → s2 0.33 1 1 1 1    2 s¯1 → s2 0.33 1 1 0 −   3 s3 → s1 0.33 1 1 0 −   4 s¯2 → s1 0.33 1 0 1 −   5 s → s 0.33 0.67 0 1 − 2 3 (a) X 6 s1 → s3 0.33 0.67 (b) Strong rules of size 2

s2 s3 s2

s¯2 s¯3 s¯1

(c) Dependency graph

Path h0 h1 Comment - 0 0 0 1 1 1 initial haplotpe {s¯1, s2, s3} 0 1 1 1 0 0 2nd & 3rd SNPs modified (d) Evolution of haplotypes Figure 4.5: An example with inconsistent attributes on ‘longest path’. The fragment matrix in (a) results in 6 rules of size 2 shown in (b). The longest path on the dependency graph in (c) contains conflicting attributes. Choosing longest attribute- consistent path results in updating haplotypes as shown in (d).

Algorithm 5 Longest attribute-consistent path construction procedure in ARHap. Require: Dependency graph G = (V ,E) Ensure: Path π 1: for (u ∈ V ) do 2: Tu ← AttributeConsistentDFS(G,u) 3: πu ← Longest path from u to any leaf in Tu 4: end for 5: π ← arg max {πu} or not the adjacent vertex is an inconsistent with any attributes in the predecessors list.

Finally, Algorithm 7 shows the process of haplotype update based on the longest path extracted in the prior step. The algorithm tests each rule Rj on the longest path. For each rule, if the haplotype h0 satisfies antecedent of the rule (i.e., Xj), then the consequent of the rule (i.e., tj) is applied on the same haplotype.

66

4 Algorithm 6 Attribute-consistent DFS in ARHap. Require: Dependency graph G = (V ,E), Source vertex u Ensure: DFS spanning tree T = (VT , ET ) 1: FunctionAttributeConsistentDFSG,u 2: VT = VT ∪ {u}{VT = set of visited vertices} 3: for (v|(u, v) ∈ E) do 4: if (v∈ / VT &v ¯ ∈ / Predecessors(u)) then 5: Predecessors(v) = Predecessors(u) ∪ {u} 6: ET = ET ∪ {(u,v)} 7: AttributeConsistentDFS(G, v) 8: end if 9: end for 10: EndFunction

Algorithm 7 Haplotype update procedure in ARHap. Require: Path π Ensure: Updated H, updated U 1: for ({Rj : Xj → tj} ∈ π) do 2: if (Satisfy(Xj, h0)) then 3: Apply(tj,h0) {update site j on h0 and h1, ...} 4: {... and remove site j from U.} 5: else if (Satisfy(Xj, h1)) then 6: Apply(tj,h1) {update site j on h1 and h0, and remove site j from U.} 7: end if 8: end for

4.4 Validation

This section demonstrates the performance of ARHap and compares ARHap with several other algorithms using both simulated data and real diploid data.

4.4.1 Dataset Preparation and Statistics

Several datasets with various coverage and error rates are generated consistent with approaches suggested in the literature [BYP14a, MFM17]. In order to simulate a

haplotype, a fixed haplotype length, Lh, is first chosen. Then, SNP positions are identified assuming that the distance between adjacent SNPs follows a geometric ran- dom variable with parameter p, SNP density. For each identified SNP, its haplotype

67 is generated randomly, assuming that the alternative and reference alleles are equally likely. After a haplotype is constructed, a total of |R| paired-end reads are generated where the number of reads, |R|, is a function of coverage, CX . To generate a read set with CX coverage, each base pair needs to be on average covered by C reads. Given the haplotype length Lh and the read length, lr, the total number of generated reads is given by

C |R| = Lh × (4.37) 2 × lr

To generate a paired-end read, a starting point on the genome is uniformly chosen.

In the experiments in this chapter, the read length lr is fixed to be 2500. The fragment length is normally distributed with µ = 600 alleles and σ = 60 alleles. The insert length is determined by the fragment length, lf , and read length; that is, linsert = lf −2×lr. Once the start position and fragment length are known, we need to choose from which haplotype copy to read. For this purpose, reads are drawn uniformly from the the two haplotype copies. Finally, uniform error is added to the read. For every SNP that the read covers, independently with probability  the allele is flipped to any other allele. We generate datasets based on an average diploid chromosome size of 130 million assuming an overall SNP density of p = 0.083.

The data generation procedure was repeated for six coverage levels C ∈ {5, 10, ... , 30 } and seven values of error rates,  ∈ {0%, 5%, 10%, ... , 30%}. The dataset preparation resulted in 17, 929, 681 short reads, 2, 561, 383 reads for each dataset with a fixed error rate, . Figure 4.6 illustrates several statistics about the short reads associated with a given . Figure 4.6(a) shows the total number of reads generated for each coverage level, CX . The number of short reads ranged from 128, 054 for coverage 5X to 640, 869 for coverage 30X. Note that although the number of reads computed in (4.37) grows linearly with the coverage, C, the number of ultimately used reads will not exactly follow linearly in coverage level. This can be explained by

68 x 105

6000 6 5000

4000 4 #Reads #Blocks 2 2000

0 0 5X 10X 15X 20X 25X 30X 5X 10X 15X 20X 25X 30X Coverage Coverage

(a) Num. of reads for each coverage (b) Num. of non-overlapping blocks

160

8000 120 6000

80

#Blocks 4000 Avg. Block Length 40 2000

0 0 0 200 400 600 800 1000 5X 10X 15X 20X 25X 30X Block Length Coverage

(c) Histogram of block length (d) Average block length Figure 4.6: Statistics about simulated polyploidy datasets. Distribution of short reads in datasets based on coverage values (a); Number of non-overlapping blocks for each ploidy level and coverage number (b); Histogram of length of non-overlapping blocks (b); and Average length of non-overlapping blocks for each ploidy level and coverage value (d). the dataset preparation procedure explained previously. Because the start position of each read on the genome is chosen at random, it is likely that some reads will not span any SNPs. Such reads will carry only missing values in the fragment matrix and thus are eliminated from the dataset while pre-processing data for algorithm input.

As mentioned previously, short reads are processed as a collection of non-overlapping blocks. Each block spans a number of SNPs where the short reads covering those SNPs do not overlap with the reads that cover SNPs in adjacent blocks. As a result, each

69 block in fact forms a fragment matrix, X , where block length refers to the number of SNPs covered within that block. As shown in Figure 4.6(b), the total number of blocks is 10, 467 for a given error rate. The number of blocks for coverage ranged from 643 block for coverage 30X to 5, 893 for coverage 5X. The block lengths can vary significantly. Figure 4.6(c) show a histogram of the block lengths. Approximately 81.3% of the blocks had a length of less than 100 SNPs.

As shown in Figure 4.6(b), a larger number of the blocks belong to datasets with lower coverage compared to those with a higher coverage. Expectedly, lower coverage results in short reads that are less likely to overlap thus leading to a larger number of non-overlapping blocks. However, note that the blocks in lower coverage datasets are likely to have a smaller length as shown in Figure 4.6(d). The graph in Figure 4.6(d) shows the average block length for each dataset based on coverage. On average, the amount of block length was 17.3, 65.7, 126.0, 145.4, 159.34, and 167.4 for coverage levels 5X, 15, 10X, 20X, 25X, and 30X, respectively.

4.4.2 Results on Simulated Data

Figure 4.7 shows normalized switching error (SWE) and normalized MEC of the four algorithms under comparison.

Figure 4.7(a) shows normalized switching error as a function of coverage. The switching error numbers shown in this figure refers to the percentage of the SNPs that must be flipped on the reconstructed haplotypes in order to obtain the true haplotype. Overall, SWE was 0.12, 0.26, 0.27, and 0.36 for ARHap, FastHap, HapCut, and Greedy, respectively. This indicates that ARHap reduces SWE of FastHap, HapCut, and Greedy by 53.8%, 55.5%, and 66.7%, respectively. We also note that SWE increases as the amount of coverage grows. This can be explained by the fact that a higher coverage result in longer blocks which makes the problem of haplotype assembly more difficult to solve compared to a smaller block with the same error rate.

70

0.4 0.5 ARHap ARHap FastHap FastHap HapCut 0.3 0.4 HapCut Greedy Greedy

0.3 0.2 swe swe

0.2 0.1 0.1

0 0 5X 10X 15X 20X 25X 30X 0 5 10 15 20 25 30 Coverage  (%)

(a) Switching error vs. coverage (b) Switching error vs. error rate

3 ARHap ARHap FastHap FastHap HapCut HapCut 2 Greedy 0.8 Greedy mec mec

1 0.4

0 0 5X 10X 15X 20X 25X 30X 0 5 10 15 20 25 30 Coverage  (%)

(c) Normalized MEC vs. coverage (d) Normalized MEC vs. error rate Figure 4.7: Performance of different algorithms on simulated data: Switching error as a function of coverage for fixed error rate of =5% (a); Switching error versus error rate for dataset with a fixed coverage of 5X (b); Normalized MEC versus coverage for dataset with fixed error rate of =5% (c); and Normalized MEC as a function of error rate for 5X coverage (d).

Figure 4.7(b) illustrates SWE as a function of error rate for a fixed coverage of 5X. Averaged over all error rates, SWE was 0.09, 0.11, 0.13, and 0.18 for ARHap, FastHap, HapCut, and Greedy, respectively. Based on this analysis ARHap performs 18.2%, 30.1%, and 50.0% better in terms of SWE compared to FastHap, HapCut, and Greedy, respectively.

We can conclude from Figure 4.7(a) and Figure 4.7(b) that ARHap achieves sig- nificantly better SWE performance at higher coverage levels. This observation can be explained by the nature of the ARHap algorithm which uses ‘support’ of each association rule for haplotype reconstruction. With a higher coverage level, rules of

71 higher support will be given higher priority for haplotype reconstruction which results in higher quality haplotypes.

Figure 4.7(c) shows the amount of normalized MEC versus coverage. For this anal- ysis, a fixed error rate of =5% was used. As shown in these graphs, ARHap, FastHap, and HapCut achieves very similar MEC values while Greedy performs much poorer. On average, MEC was 0.9855, 0.9883, 0.9992, and 1.5745 for ARHap, FastHap, Hap- Cut, and Greedy, respectively. This indicates that ARHap performs slightly better than FastHap and HapCut. Specifically, ARHap achieves 0.28% improvement in MEC over FastHap and 1.37% improvement over HapCut.

Figure 4.7(d) show MEC of different algorithms as error rate increases. Similar to the previous figure, here the results are shown for a fixed coverage of 5X. On average, the amount of MEC was 0.57, 0.55, 0.57, and 0.65 for ARHap, FastHap, HapCut, and Greedy, respectively.

Table 4.3 and Table 4.4 compared running time of the four algorithms in two different scenarios. Table 4.3 shows running times for various coverage levels on the dataset with fixed error rate of =5% while Table 4.4 shows running times as error rate ranges from 0% to 30% for a fixed coverage of C = 5X. In both scenarios, FastHap (the algorithm discussed in Chapter 3) outperforms the other three algorithms. As shown in Table 4.3, the running time of all the algorithms increases as coverage grows. This is an expected observation because, at higher coverage levels, the algorithms need to process more data. However, note that compared to the other three algorithms Greedy becomes much slower at higher coverages. This can be explained by the nature of the Greedy algorithm which processes the fragment matrix row-wise. That is, it examines one fragment/read at each iteration of the algorithm. In contrast, ARHap and HapCut perform column-wise processing of the fragment matrix. As we discussed in Chapter 3, although FastHap performs row-wise processing, it performs a majority of its computation outside the iterative process and thus achieves superior running time performance. Overall, ARHap is 30.9% faster than HapCut and 87.4%

72 Table 4.3: Running time (minutes) of ARHap, FastHap, HapCut, and Greedy on simulated data with fixed error rate =5%. Coverage ARHap FastHap HapCut Greedy 5X 3.5 2.1 10.1 4.3 10X 14.8 2.7 17.3 13.0 15X 18.9 3.8 18.7 33.6 20X 43.9 5.8 44.7 127.3 25X 76.4 31.9 108.8 602.2 30X 101.1 72.1 174.8 1277.2 Avg. 43.1 19.7 62.4 342.9

Table 4.4: Running time (minutes) of ARHap, FastHap, HapCut, and Greedy on simulated data with fixed coverage C = 5X.  (%) ARHap FastHap HapCut Greedy 0 1.78 1.13 6.02 3.18 5 3.43 2.14 10.14 4.31 10 3.02 2.30 15.52 4.21 15 2.94 2.11 15.90 4.10 20 3.53 2.21 14.95 3.92 25 3.55 2.52 17.83 3.13 30 3.33 2.19 16.09 4.02 Avg. 3.08 2.09 13.78 3.84 faster than Greedy. However, FastHap outperforms ARHap by achieving a running time that is 54.2% better than that of ARHap.

The results shown in Table 4.4 intended to examine the potential impacts of er- ror rate on the running time of the algorithms. As this table shows, none of the algorithms’ running time is heavily dependent on the amount of error within the fragment matrix. This is the indication of all these algorithms can perform in poly- nomial time using extra long reads which tend to have higher error rates using the conventional sequencing technology and the debate will be which one can offer more accurate results.

73 4.4.3 Results on HuRef Data

Recall from Chapter 3 that the HuRef dataset [Ven14] used for the analysis here contains reads for all 22 chromosomes of an individual, J.C. Venter. The data includes 32 million DNA short reads generated by Sanger sequencing method with 1.85 million genome-wide heterozygous sites. There are many fairly short reads of approximately 15bp (each end) while still tens of thousands of reads are long enough to cover more than two SNP sites and can be used for haplotype assembly purposes. In fact, many fragments within each block span several hundred SNP sites due to the pair-end nature of the aligned reads. The fragment matrix used for haplotype assembly is generated based on aligned short reads with paired-end method for each pair of various length (from 15bp–200bp each end) while the insert length follows a normal distribution with a mean of 1000.

Table 4.5 shows the MEC scores of different algorithms on HuRef dataset and for different chromosomes. ARHap performs better than all other algorithms in term of MEC performance although its performance varies from one chromosome to another. Overall, ARHap outperforms Greedy by a large margin (i.e., 35.3% improvement in MEC), moderately outperforms HapCut by 1.43% reduction in the total MEC, and achieves slightly better MEC score than FastHap (i.e., only 0.14% reduction in MEC).

4.5 Discussion

Development of computational algorithms for haplotype assembly is receiving more attention in computational world than a decade ago. Although limitations in se- quencing technology keep phasing to achieve its full potential, yet we need to provide a solid framework to solve the problem. Studying phasing in diploid or polyploid level, in general, can contribute to many related areas in the field of genetics such as inheritable diseases and organ transplant matching, among others. In this chapter, we compared ARHap with three well-known haplotype assembly algorithms, namely 74 Table 4.5: Overall MEC score of of ARHap, FastHap, HapCut, and Greedy on HuRef data. Chr ARHap FastHap HapCut Greedy 1 19477 19423 19750 29657 2 14474 14220 14677 22980 3 10667 11794 10738 16878 4 11865 11812 11931 18153 5 10534 10362 10630 16590 6 9915 9870 9992 15587 7 11269 11245 11290 17402 8 9828 10830 9845 14887 9 9223 9204 9318 13812 10 9859 9796 9906 15291 11 8242 8091 8294 12906 12 7407 7467 8297 12630 13 6141 6143 6131 9312 14 5947 5725 6360 9734 15 9592 9695 9783 13988 16 8233 8215 8354 12621 17 7584 7386 7398 11157 18 4763 4846 5043 8578 19 5623 4886 5497 8214 20 3659 3437 3784 5752 21 4642 4707 4715 6611 22 5819 5875 5864 8295 Sum 194763 195029 197597 301035

FastHap, HapCut, and Greedy algorithm. These algorithms are historically known for their speed and accuracy. Our extensive analysis showed that ARHap outper- forms all these techniques on simulated and real data significantly. Haplotype length is one of the quality measures that have appeared in literature in recent years. Al- though it is important to have long haplotype blocks to show the capabilities of the proposed algorithm, it is more related to the nature of dataset that is being used. For instance the length of fragments, insertion size of the alignments, error rate also known as quality of sequencing. Therefore in this project, we decided not to demon- strate the comparison among our algorithm and the state-of-the-art. We note that our algorithm’s assembled block size follows almost the same as others and that is mainly because of the quality of sequencing data. Although we have achieved very

75 promising results in this work, we believe that construction of the ARHap framework is only the first step towards the development of accurate ploidy-agnostic haplotype assembly techniques. In this project, we looked at the problem of phasing from re- sult side and trying to figure out the origin of each fragment. Association rule can also help in finding LD hidden among various SNP locations that are not possible to solve through comparing fragments similarity to each other. We plan to extend the current approach to higher ploidy and also incorporating dynamically learning the ploidy level later.

76 CHAPTER 5

Graph Coloring for Polyploid Haplotyping

5.1 Introduction

Recent advances in sequencing technologies have played a significant role in providing high-quality data for DNA phasing and allowing for single individual haplotyping. DNA phasing in diploid organisms has proved effective in many application areas such as investigating human disease genes and transplant matching as well as in population genetics where ancestral studies can reveal many aspects of history and heritable diseases [TBT11]. Although DNA phasing has been studied for over a decade [BB11], it has remained an active research area due to the experimental and computational constraints, and the costs associated with accurately reconstructing haplotypes from genotypes. While experimental phasing techniques appear to be less practical for large datasets, due to their high costs, computational approaches have received more attention because of their practical feasibility and cost-efficiency.

Polyploidy refers to the presence of more than two copies of each chromosome in the cells of an individual organism. Polyploidy is common in ferns and flowering plants [WTB09] and has been studied in the cytogenetics era and in the context of molecular genetics and genome evolution [RS98]. Moreover, polyploidy exists in animals such as fish and amphibians. Recent evidence suggests that polyploidy is more important in animal evolution than was previously thought [Gla10]. The occurrence of polyploidy is known to have resulted in new species. Many species that are currently diploid, including humans, were derived from polyploidy ancestors [MP05]. Despite the fact

77 that polyploidy finds many important applications in the field of genetics, previous haplotype assembly research focuses mostly on diploid genomes. Very limited research has been conducted on polyploidy haplotyping [BYP14a].

5.1.1 Contributions and Summary of Results

This chapter introduces H apColor, a novel polyploid haplotyping framework based on an efficient partitioning of DNA reads to reconstruct haplotypes of higher ploidy. In developing and validating HapColor, our contributions can be summarized as follows.

• A formal definition of polyploid haplotyping with the objective of minimiz- ing overall Minimum Error Correction (MEC) criterion will be presented (Sec- tion 5.2.1).

• A graph model will be introduced to capture the amount of conflict between pairs of DNA short reads (Section 5.2.2).

• The hardness of the introduced problem, [ Minimum MEC Polyploid Haplo- typing (MMPH), will be discussed and it will be shown that this problem is NP-hard (Section 5.2.3).

• A two-phase greedy algorithm including a graph coloring method followed by a color-merging technique will be developed to accurately partition short reads and reconstruct the haplotype associated with each partition (Section 5.3).

• The performance of HapColor will be compared against several polyploid hap- lotyping approaches and it will be demonstrated that HapColor substantially reduces MEC scores of the competing haplotype assembly algorithms on poly- ploidy genome data (Section 5.4).

Table 5.1 shows a summary of HapColor’s performance in terms of reduction in MEC scores of other haplotype assembly approaches. HapColor is compared 78 Table 5.1: Reduction (%) in MEC achieved by HapColor

Algorithm Triploid Tetraploid Hexaploid Decaploid Overall HapTree 22.9% 24.0% 25.3% 25.2% 24.35% PGreedy 63.3% 55.4% 55.4% 53.7% 56.95% RFP 77.1% 80.5% 84.6% 87.9% 82.52%

with three polyploid haplotyping algorithms including HapTree [BYP14a], Greedy (a greedy approach originally developed for diploid haplotyping [LSN07] but customized in this dissertation to accommodate polyploidy genomes), and Random Fragment Partitioning (RFP, a baseline polyploid haplotyping algorithm based on a random partitioning of the short reads). On average, HapColor achieves 24.35%, 56.95%, and 82.52% reduction in MEC compared to HapTree, Greedy, and RFP algorithms, respectively.

5.2 Problem Statement

The general form of the haplotype assembly problem is to reconstruct ‘K’ haplotypes from a given set of fragments obtained by DNA sequencing of ‘K’ copies of a chromo- some. Solving this problem is challenging due to insufficient coverage and erroneous readings. Insufficient coverage refers to the fact that DNA sequencing technologies provide only small overlapping fragments that result in some SNP sites not being reconstructed in the final haplotype. The problem becomes more challenging if the fragment matrix contains errors such as sequencing errors, chimeric read pairs, and false variants. In the real world of molecular biology, experiments are never error-free. These errors result in conflicting fragments drawn from the same haplotype copy. The conflicting data prevent us from reliably reconstructing the correct allele at each SNP site. In this chapter, it will be shown that an error-free fragment matrix, also called a feasible matrix, can be used to partition the fragments into ‘K’ disjoint sets each of which is associated with one copy of the haplotype.

79 Given the aforementioned challenges, the haplotype assembly problem is often defined as follows. Given a fragment matrix X obtained from ‘K’ copies of a chro- mosome, reconstruct ‘K’ haplotypes such that an objective function is optimized. A number of objective functions have been introduced in the past. Examples include Minimum Error Correction (MEC), Minimum Fragment Removal (MFR), Minimum SNP Removal (MSR), and Longest Haplotype Reconstruction (LHR) [LBI01]. The focus of this chapter is on minimizing the MEC objective function whose effective- ness has been previously established in the literature. The f ragment partitioning approach introduced in this chapter aims to partition set of fragments into disjoint sets each representing one haplotype copy. As soon as the fragments are partitioned, each partition can be used to merge the fragments and build one haplotype copy.

5.2.1 Problem Formulation

Let Xm×n be a given fragment matrix that contains SNP values, xij ∈ {0, 1, ‘-’}, obtained from m fragments, F = {f1, ... , fm}, pertaining to n SNP sites, S = {s1,

... , sn}. Furthermore, let K be the ploidy level, which is the number of chromosome copies from which the fragments in F are derived. Suppose that our algorithm assigns each fragment fi ∈ F to one of the K disjoint partitions P = {p1, ... , pK }, initialized

as pk = ∅. As stated previously, each set pk can be used to build a haplotype hk by

merging the fragments that reside in pk, resulting in haplotype set H = {h1, ... , hK }. M inimum MEC Polyploid Haplotyping (MMPH) is then the problem of assigning each fragment fi in X to one of the K partitions pk such that the overall MEC in (5.1) is minimized.

K m X X MEC(X , H) = MEC(fi, hk) × aik (5.1) k=1 i=1

where MEC(X ,H) denotes the overall MEC score of the haplotype assembly, aik is a binary indicating whether or not the partitioning algorithm maps fragment fi onto

80 haplotype hk, and MEC(fi, hk) represents the MEC value due to such a mapping and is given by

n X MEC(fi, hk) = δ(xij, hij) (5.2) j=1

where δ(xij, hij), mismatch function, is a binary distance metric that indicates whether or not there is a mismatch between xij and the corresponding SNP location (i.e., j-th

SNP) in the reconstructed haplotype hk. The general form of the mismatch function is as follows:

  1 if x =6 y & x =6 ‘–’ & y =6 ‘–’ δ(x, y) = (5.3)  0 otherwise

where δ(x, y) denotes mismatch between x and y drawn from alphabet {0, 1, ‘–’} where 0/1 refer to two alleles at a particular SNP site and ‘–’ indicates lack of infor- mation, either because the fragment does not span the site or because of a failure of a sequencing assay. Matching alleles are not considered because while they will in- crease computations extensively, they will not help in differentiating haplotype copies. Therefore the distance measure is built only using mismatches among fragments.

Problem 2 (Minimum MEC Polyploid Haplotyping). Given fragment matrix Xm×n, composed of m fragments F = {f1, ... , fm}, and ploidy level ‘K’, find partition P

= {p1, ... , pK } of the fragments (each pk corresponding to a haplotype hk) such that the MEC value in (5.1) is minimized.

5.2.2 Graph Modeling

The concept of conflict graph has been conventionally defined as a non-weighted graph for diploid haplotyping. Such a binary conflict graph represents any pair of fragments 81 with at least one mismatch in the fragment matrix. For example, according to [LBI01], a conflict graph is a graph with an edge associated with each pair of fragments in conflict where two fragments are in conflict if they have different values in at least one column in X . If the fragment matrix is error-free, the conflict graph can be easily used for diploid haplotyping because the graph is bipartite and the two given haplotypes h1, h2 define the shores of the graph. Furthermore, if the graph is bipartite with shores h1 and h2, then h1 and h2 can be taken as partitions of F defining a haplotyping, and thus the fragment matrix is feasible [LSL02].

The definition of conflict graph is extended by leveraging the mismatch function in (5.3) to construct a weighted graph, called W eighted Fragment Conflict Graph (WFCG), which captures the amount of conflict among various fragments in X . WFCG is not a complete graph. Later in this chapter, it will be discussed how this graph model can be used for polyploid haplotyping.

Definition 16 (Weighted Fragment Conflict Graph). Given fragment matrix Xm×n

composed of m fragments F = {f1, ... , fm}, each of length n, a weighted fragment

conflict graph G(VF ,EF ,WF ) is composed of m vertices (i.e., |VF | = m) each associ-

ated with one fragment in X . The edge set EF is defined by the pairs of fragments with non-zero mismatch values where the amount of mismatch between two fragments

fi and fk is defined as

n X w(fi, fk) = δ(xij, xkj) (5.4) j=1

A non-zero mismatch value (i.e., a conflict) indicates that the two fragments cover at least one common SNP site and have different non-missing values at that site. In an error-free fragment matrix, a conflict occurs only when the two fragments belong to different chromosome copies.

Figure 5.1 shows an example of a WFCG that quantifies amount of inter-fragment

mismatch for 7 fragments, F = {f1, ... , f7}, drawn from 3 chromosome copies shown 82 , , h1:A C CG GT → 0 0 0 0 0 0 1 6 4 h2: C G GT C G → 1 1 1 1 1 1 4 haplotypes 5 6 2 h3: C G GT G T → 1 1 1 1 0 0 2 3 2 2 1 1 4 f1: A C CG GT 0 0 0 0 0 0 1 1 1 1 1 1 f2: C G GT C G 6 5 2 1 f3: C G GT G T 1 1 1 1 0 0 4 1 1 1 1 0 0 7 1 6 f4: C G GT G T 1 1 1 1 1 1 5 4 f5: C G GT C G 0 0 0 0 0 0 f6: A C CG GT 6 1 1 1 1 1 0 f7: C G GT G T weighted fragment fragment matrix short reads conflict graph

Figure 5.1: An example of a weighted fragment conflict graph (WFCG).

by {h1, h2, h3} in H. The fragment matrix contains only one mismatch (i.e., cell x75). Given the fragment matrix X and ploidy level K (K = 3 in Figure 5.1), the haplotype assembly problem is to assemble together the fragments in X and form the copies of the chromosome such that the overall MEC score in (5.1) is minimized. Given that each matrix X can be transformed into a WFCG, the definition of polyploid haplotyping in Problem 2 can be revised based on this graph model as follows.

Problem 3 (Minimum MEC Polyploid Haplotyping (MMPH)). Given a weighted fragment conflict graph G(VF ,EF ,WF ) associated with the fragment matrix Xm×n, and partition set P = {p1, ... , pK }, Minimum MEC Polyploid Haplotyping (MMPH) is to find a mapping L : VF → {p1, ... , pK } such that the MEC value in (5.1) is minimized.

5.2.3 Problem Complexity

This section contains the proof of the MCH problem described in Problem 3 being NP-hard even on WFCGs derived from error-free fragment matrices. The proof is based on the concept of graph coloring. 83 1 1 6 4 6 4 4 4 5 6 5 6 2 2 3 2 2 3 2 2 2 2 1 1 4 1 1 4

6 5 2 6 5 2 1 4 1 4 7 1 6 7 1 6 5 4 5 4 6 6

Phase (I): Vertex Coloring Phase (II): Color Merging #Colors = 4; MEC=0 #Colors = 3; MEC=1

h1: 000000; h2: 111111 h1: 000000; h2: 111111 h3: 111100; h4: 111110 h3: 111100

Figure 5.2: Vertex-Coloring (left) and Color-Merging (right) applied on WFCG shown Figure 5.1.

Problem 4 (Graph Coloring). An r-coloring of a graph G(V ,E) is a mapping C :

V → {c1, ... , cr} that assigns one of ‘r’ colors to each vertex in G so that every edge has two different colors at its endpoints. The vertex coloring problem is to find the smallest possible number of colors for r-coloring.

In Figure 5.2, the graph on the left shows a vertex graph coloring of the conflict graph in Figure 5.1. The coloring algorithm has identified 4 distinct colors, ‘blue’ for fragments {f1, f6}, ‘yellow’ for {f7}, ‘green’ for {f2, f5}, and ‘red’ for {f3, f4}. Therefore, minimum number of colors is 4.

Lemma 12. Given a weighted fragment conflict graph associated with an error-free fragment matrix X , partitioning of the fragments based on vertex coloring yields an overall MEC of ‘zero’.

84 Proof. Applying a vertex coloring algorithm on the WFCG gives a partitioning of the

fragments because each color ck can indicate a partition pk. Intuitively, fragments allocated to each partition pk is merged to build a haplotype hk. The MEC value can be computed by computing the amount of mismatch between each fragment and haplotype within which the fragment resides. The fragments within a particular partition have the same color. The amount of mismatch between any pair of fragments within each partition is zero because the vertex coloring algorithm will assign different colors to any two vertices that have a non-zero weight. Without loss of generality, assume that there is a total of q partitions and a total of l fragments within each partition pk associated with haplotype hk. The amount of MEC for partition pk is equal to

l X MECk = δ(fi, hk) = 0 (5.5) i=1

Therefore, the overall MEC is calculated as

q X MEC(X , H) = MECk = 0 (5.6) k=1

Lemma 13. Given a weighted fragment conflict graph associated with an error-free fragment matrix X , composed of m fragments drawn from K haplotypes, the number of colors, q, obtained by vertex coloring is always less than or equal to K (i.e., q ≤ K).

Proof. Proof by induction. Basis: For K=1, it is clear that the number of colors, q, given by vertex coloring is 1 because for an error-free matrix where all fragments are drawn from the same chromosome copy, there is no mismatch between fragment pairs. This is simply because either the overlapping regions match or contain missing values; in both cases the mismatch function in (5.3) gives a value of ‘0’ resulting in not creating an edge between any pairs of fragments in the graph. Thus, vertex coloring, 85 which aims to minimize the number of colors, will assign all vertices the same color. Inductive Step: Assume that the set of fragments in X are drawn from K haplotypes and the equation q ≤ K holds as a result of vertex coloring. Following is the proof that if more fragments are added from a (K + 1)th haplotype to X , the resulting number of colors after recoloring, q0, will hold the inequality q0 ≤ (K + 1). For this, first it is established that q0 ≤ q; that is, the new haplotype will add at most one color to the graph. Assume l new fragments are drawn from the (K +1)th haplotype. It is clear that there will not be any edges between any pairs of the newly added fragments because they come from the same haplotype and their overlapping region either matches or contains missing values. For each new fragment fi, the coloring algorithm will do one of the following: (1) assigns fi an existing color in which case addition of fi will not increase the total number of colors; or (2) assigns fi a new color in which case only one color will be added to the number of colors, q. Therefore, the new number of colors, q0 is either equal to q or equal to q + 1. Given that q ≤ K, it is fair to infer that q0 ≤ K + 1.

Theorem 14. The MMPH problem described in Problem 3 is NP-hard on any WFCG constructed from a feasible fragment matrix.

Proof. Proof by reduction from vertex coloring which is known to be NP-hard [JT11]. Assume an instance of a vertex coloring problem which takes as input a graph G(V ,E) and produces a mapping of C : V → {c1, ... , cr} such that the total number of colors used, q, is minimized. The graph in G is directly reduced to a WFCG graph

G(VF ,EF ,WF ) such that VF = V , EF = E, and w(vi, vj) = 1 ∀vi ∈ V . As shown in Lemma 13, the number of colors in the resulting coloring is at most K. Furthermore, as shown in Lemma 12, the overall MEC resulted from coloring is always ‘0’ on feasible fragment matrices. Therefore, the coloring problem can be directly reduced into an instance of the MMPH problem which always achieves the minimum possible MEC (i.e., MEC = 0).

86 Algorithm 8 HapColor Algorithm Require: Fragment matrix X & ploidy level K Ensure: Haplotypes H = {h1, ... , hK } & overall MEC # Initialization (1) Construct graph G(VF ,EF ,WF ) using X as described in Section 5.2.2 (2) q = 0; number of unique colors in G # Phase (I): Vertex Coloring (3) ∀v ∈ VF , compute SD(v), the number vertices adjacent to v (4) Sort vertices in VF by non-increasing order of their static degree, SD(v) (5) Color a vertex of maximal degree with color c1 & q = q + 1 (6) ∀u ∈ VF , compute DD(u), number of different colors adjacent to u while (∃ uncolored vertex u ∈ VF ) do (7) Find uncolored vertex u with a maximal dynamic degree (8) Color u with the first available color (I(u) = ci)& q = q + 1 (9) Update dynamic degree (DD(u)) of all vertices adjacent to u end while # Phase (II): Color Merging (10) ∀u ∈ VF & ci ∈ C, compute Λ(u,ci) (11) ∀ci, cj ∈ C, compute merging cost MC(ci,cj) (12) Sort color pairs based on their merging cost values while (q > K) do (13) Choose a pair of colors {ci, cj} with minimum cost (14) Merge the two colors by replacing cj with ci (15) Update merging cost MC(ci,cj) for all colors in GF & q = q − 1 end while # MEC Calculation (16) Partition fragments based on their color of vertices in GF (17) For each partition pk, build a haplotype, hk by getting the consensus allele at each SNP position inside its partition, and add hk to H (18) Use MEC in (5.1) to compute the overall MEC

5.3 HapColor Algorithm

Algorithm 8 shows the HapColor algorithm which takes fragment matrix X and ploidy level K as inputs, builds an array of K haplotypes in H, and calculates the overall MEC of the haplotype assembly. The algorithm consists of two main phases. In Phase (I), HapColor performs a vertex coloring of the graph based on the DSATUR greedy graph coloring [Bre79]. DSATUR (Degree of Saturation) is a sequential coloring algorithm with a dynamically determined order of the vertices chosen for coloring.

Each vertex v in GF is assigned a static degree, SD(v), which is the number of edges 87 adjacent to v. Furthermore, vertex v is assigned a dynamic degree, DD(v), also called degree of saturation, which is the number of different colors at the vertices adjacent to v. Phase (I) starts by assigning color c1 to a vertex of maximal static degree.

Suppose I is a partial coloring of the vertices of GF . The vertex to be colored next in the sequential coloring is a vertex u with maximal dynamic degree; in case of a tie select, the vertex with highest static degree is colored. After a vertex u is colored, dynamic degree of its neighboring vertices are updated. This vertex coloring has been proved to achieve fast running time and high accuracy [Klo02].

The second phase of HapColor, Phase (II), is primarily developed to account for the effect of errors on the coloring/partitioning algorithm. This phase aims to reduce the number of colors by merging colors of minimum cost, where the cost is computed based on the edge weights in GF . It will not merge any colors if X is error-free because, as discussed previously in Lemma 12, the resulting coloring always achieves a haplotyping with MEC = 0. Practically, however, non-error free fragment matrices are more popular. Thus, it is likely that the number of colors reported by Phase (I) is larger than K. In this case, the algorithm iteratively finds two colors with minimal merging costs and merges them by assigning them the same color label. The iterative process continues until the total number of colors in GF becomes K. The merging cost of any two colors ci and cj is given by

X MC(ci, cj) = Λ(v, cj) (5.7)

v:I(v)=ci

where I is the current color mapping of the vertices in GF and Λ is a vector of size |I| given by

X Λ(v, c) = w(u, v) (5.8)

u:(u,v)∈EF & I(u)=c

Intuitively, in each iteration, two colors are chosen whose merging will result in 88 minimum amount of mismatch among corresponding fragments. Such fragments, which are now given the same color, will be allocated to the same partition. There- fore, HapColor’s heuristic algorithm attempts to merge colors that have the least impact on the overall MEC of the haplotype assembly. Once all the fragments are assigned, then for each SNP site the consensus allele will be considered from all the fragments in the partition for that position to reconstruct the corresponding haplo- type copy. Figure 5.2 shows how HapColor performs vertex coloring, color merging, and haplotype reconstruction for the fragment matrix in Figure 5.1.

Theorem 15. HapColor algorithm described in Algorithm 8 runs in polynomial time.

Proof. Let m be the number of fragments in X , where |VF | = m. It is known that Phase (I) (which is based on DSATUR) runs in O(m2). In Phase (II), the ‘while’ loop can iterate m times in the worst case. During each iteration, merging costs for each pair of the colors needs to be computed. Thus, each iteration has a complexity of O(m2). Thus, the overall complexity of the second phase is O(m3). As a result, the overall complexity of HapColor is O(m3).

5.4 Validation

All experiments are done on a Linux x86 server computer. The server had 16 CPU cores of 2.7GHz with 32GB of RAM. The haplotype assembly algorithms performed per-block haplotype reconstruction on the input DNA short reads. Each block con- sisted of the reads that did not cross adjacent blocks. The haplotypes generated from each block were concatenated to form chromosome-wide haplotypes.

5.4.1 Polyploidy Datasets

To the best of our knowledge, there is no publicly available real dataset of polyploidy organisms suitable for evaluation of haplotype assembly algorithms [NGB08,BYP14a, 89 AI13]. Although genome of various polyploidy organisms have been sequenced and released using NGS technology, the corresponding haplotype sets and short reads are not provided. In absence of such a dataset, the following two datasets are prepared (details are discussed in Section 5.4.3 and Section 5.4.6).

• Simulated Polyploidy Dataset: A similar approach as other researchers [BYP14a] is taken and a polyploidy dataset on Triticum aestivum (Bread Wheat), which has a huge size genome, is simulated. The bread wheat is classified as a hexaploid genome. Haplotype copies and the corresponding fragments for one of the 7 chromosomes (5A) is generated with an estimated size of 2 Gbp. To gen- erate extra copies of haplotypes for decaploid, a recombination of the previous haplotypes in hexaploid data is being used.

• Modified HuRef Dataset: The second dataset is based on the well-known pub- licly available HuRef [LSN07], which is a diploid dataset. This dataset is care- fully modified to represent polyploidy genome.

5.4.2 Comparative Analysis Approach

All analysis on four different haplotype assembly algorithms including HapColor, HapTree, Greedy, and RFP (Random Fragment Partitioning) is performed. Greedy, inspired by the Levy’s greedy algorithm [LSN07], is a fragment partitioning method that has been originally designed for diploid haplotyping. The original algorithm generates haplotypes by greedily allocating the fragments to two disjoint partitions each of which forms one haplotype copy. The algorithm attempts to iteratively assign a selected fragment to be added to the final partition. To accommodate more than two haplotypes this algorithm is slightly modified. RFP algorithm is also developed to provide a baseline performance for polyploid haplotyping. This algorithm partitions the fragments into K groups at random and builds one haplotype copy from each partition. H apTree uses a relative likelihood function to measure the concordance 90 between the aligned read data and a given haplotype under a probabilistic model. To identify a haplotype of maximal likelihood, HapTree finds a collection of high- likelihood haplotype partial solutions, which are restricted to the first n0 SNP sites, and extends those to high-likelihood solutions on the first n0 + 1 SNP sites [BYP14a].

5.4.3 Preparation of Simulated Data

Here the approach in generating a simulated polyploidy dataset is consistent with the method described in [BYP14a]. In order to simulate a haplotype, the haplotype

length is fixed, Lh, and a ploidy level, ’K’. The SNP positions are identified assuming that the distance between adjacent SNPs follows a geometric random variable with parameter p, SNP density. For each identified SNP, its haplotype is randomly gen- erated, assuming that the alternative and reference alleles are equally likely. After a haplotype was generated, |R| paired-end reads are generated where the number of

reads |R| was a function of coverage, CX . To generate a read set with CX coverage, each base-pair needs to be on average covered by C reads. Given the haplotype length

Lh and the read length, lr, the total number of simulated reads is equal to

C |R| = Lh × (5.9) 2 × lr

Many of these reads will cover a small number of SNPs; thus for CX coverage the number of useful reads for any SNP will be less than C. In order to generate a paired- end read, a starting point on the genome is uniformly selected. The read length is

fixed lr to be 250 in these experiments. The fragment length is normally distributed with a mean of 600 and standard deviation of 30. The insert length is determined by the fragment length, lf , and read length; that is, linsert = lf − 2 × lr. Once the start and fragment length is known, the corresponding chromosome to read should be selected; which is uniformly selected among the K chromosomes. Finally, uniform error to the reads were injected . For every SNP that the read covered, independently

91 with probability  the allele is flipped to any other allele.

1.4 80 1.2 Triploid Tetraploid 1 Hexaploid 60 10NPloid 0.8 40 Triploid 0.6 MEC/SNP Tethraploid Hexaploid 0.4 10NPloid

Reconstruction Rate (%) 20 0.2

0 0 0 5 10 15 20 25 0 5 10 15 20 25 Error Rate (%) Error Rate (%)

(a) Normalized MEC vs. Error Rate (b) Reconstruction Rate vs. Error Rate

1.4 0.4

1.2 5X 8 10X 1 0.3 15X 7 20X 0.8 25X 0.2 6 0.6 MEC/SNP

MEC/SNP MEC/SNP # of colors 5 0.4 Number of colors 0.1 0.2 4

0 0 0 5 10 15 20 25 0 1 2 3 4 5 6 Error Rate (%) Iteration (c) Reconstruction Rate vs. Error Rate (d) Color Merging Phase Figure 5.3: HapColor performance in terms of normalized MEC (a) and reconstruction rate (b) as a function of error rate for various polyploidy data; normalized MEC as a function of error rate for various coverage levels (c); and evolution of the algorithm during color merging (d).

5.4.4 HapColor Performance on Simulated Data

Figure 5.3 shows various performance measures for HapColor on the generated poly- ploidy data. Figure 5.3(a) shows accuracy performance of HapColor as a function of error rate for various ploidy levels. The accuracy, reported as MEC/SNP, is measured by MEC values normalized by the number of columns in the fragment matrix. With an error-free fragment matrix (i.e.,  = 0), the amount of MEC obtained by HapColor

92 was 0 for all ploidy levels except for hexaploid and decaploid which had an MEC value of 0.0017 and 0.0006 respectively. This can be explained by sub-optimality of the heuristic coloring algorithm, Phase (I) in Algorithm 8. By varying the error rate from 0% to 25%, the overall MEC ranged from 0 to 1.48. On average, the amount of MEC was 0.79, 0.38, 0.16, and 0.032 for triploid, tetraploid, hexaploid, and 10NPloid respectively. Since all ploidy levels use the same size fragment matrices (i.e., m and n are fixed across all ploidy levels), MEC naturally reduces for higher ploidy numbers. This is primarily due to the fact that with a higher ploidy, the chances of experi- encing conflicts among fragments and reconstructed haplotypes reduces because the fragments are now distributed across larger number of partitions.

Figure 5.3(b) shows the amount of reconstruction rates as a function of error rates. The overall reconstruction rate was 71.01% averaged over all error rates and ploidy levels with a fixed coverage of C=15. The reconstruction rate ranged from 60.4% for decaploid to 88.8% for triploid.

Figure 5.3(c) illustrates the amount of MEC as a function of error rates for various coverage levels for the tetraploid data. The amount of MEC ranged from 0.0098 for lowest coverage, C=5 to 0.96 for C=25 with an overall MEC/SNP value of 0.43.

Figure 5.3(d) illustrates how the two phases of HapColor work. The graph shows this process for the tetraploid data with  = 0.1. The graph shows how MEC changes after completion of Phase (I), vertex coloring, and during each iteration in Phase (II), color merging. Although the data is tetraploid (K = 4), the vertex coloring generates 8 distinct colors (q = 8) in Phase (I). During each iteration of Phase (II), two colors are combined resulting in reducing the number of colors by 1. The color merging iterates 4 times until 4 distinct colors are left in the graph q = K = 4). As the colors are merged, the amount of MEC also increases, from the original 0 to 0.35, due to the errors in the fragment matrix which in turn mapped some conflicting fragments to the same partition.

93 Table 5.2: MEC comparison of HapColor with other algorithms on simulated poly- ploidy data.

Dataset HapColor HapTree Greedy RFP Triploid 6439 8452 24287 62133 Tetraploid 2926 3897 7252 31508 Hexaploid 2230 3074 5318 26364 Decaploid 437 596 980 5478 Total 12032 16019 37837 125483 Improvement – 24.9% 68.2% 90.4%

5.4.5 Comparative Analysis on Simulated Data

Table 5.2 shows a comparison of the four algorithms in terms of total amount of MEC and for four ploidy levels including triploid, tetraploid, hexaploid, and decaploid data. The MEC values reported in this table represent absolute MEC values rather than the normalized MEC which was reported previously. As shown in this table, Hap- Color outperforms all other algorithms. Overall, HapColor reduces MEC of HapTree, PGreedy, and RFP by 24.9%, 68.2% and 90.4%, respectively.

5.4.6 Comparative Analysis on HuRef Data

The HuRef dataset used for the analysis uses reads for all 22 chromosomes of an individual, J.C. Venter. The original dataset includes 32 million DNA short reads generated by Sanger sequencing method with 1.85 million genome-wide heterozygous sites. It is paired end reads (each end up to 200bp) and insertion size follows normal distribution with mean 1000, and heterozygosity rate of 1/1200 bp. Details statistics about this dataset can be found in [MW14b]. The original HuRef dataset is diploid. But since this is the only publicly available dataset that includes not only fragments but also the haplotype set of the individual, the decision was made to be utilized for simulation of polyploidy dataset. The extra haplotype copies are generated based on the statistics of the original haplotypes. All the statistics such as variant sites, read lengths, and length of each chromosome are maintained. The contents of the

94 variant sites in extra haplotypes are generated randomly among four alleles and then encoded to be compatible as input to the algorithm. For each chromosome, the actual

chromosome length and fixed haplotype length, Lh is used. The position and length of each read and insertion size remained intact (i.e., as in the original dataset). For each SNP covered by a read, a haplotype is uniformly selected among K haplotypes. Similar to the previous approach, uniform error to the read is added. For every SNP that the read covers, independently with probability  the allele is flipped to any other allele.

In order to perform a genome-wide analysis, all algorithms are performed using the whole genome sequencing HuRef dataset. Table 5.3 shows a comparison of the accuracy measure, MEC. As shown in Table 5.3, HapColor outperforms all other algorithms. The amount of reduction in MEC ranged from 22.0% for HapTree on triploid data to 85.1% for Greedy on decaploid data. Overall, HapColor achieves 22.9%, 52.2%, and 75.3% reduction in MEC scores obtained by HapTree, Greedy, and RFP, respectively. As shown in this table, the amount of improvements in the MEC measure decreases as the ploidy number grows. This can be explained by the fact that, when the ploidy number grows, the likelihood of assigning fragments to wrong partitions increases, thus resulting in larger MEC values for higher ploidy numbers.

5.5 Discussion

In this chapter, the design and validation of HapColor, a highly scalable and gener- alizable haplotype assembly framework with promising results on two dataset types with various ploidy levels is presented. The framework offers a computationally sim- ple algorithm to partition DNA short reads into disjoint sets and build any number of polyploidy haplotypes accordingly.

HapColor focuses on minimizing the MEC objective function, which is the most

95 widely used metric in haplotype assembly problems. Other objective functions for haplotype assembly have previously been proposed, such as switching error (SWER), minimum weight edge removal (MWER), % of phased, haplotype length, minimum fragment removal (MFR), minimum SNP removal (MSR) and haplotype length (HL). However, the most common source of error is due to base miscalling, and the MEC objective serves as a good model for this type of error [BB08]. Moreover, MEC is the only accuracy measure that is reported by all haplotype assemblers since it can indirectly model other sources of error, e.g. a haplotype assembly with a low MEC score is also likely to be good under the minimum fragment removal objective. Ad- ditionally, several of the aforementioned measures are algorithm-specific and cannot be used for comparison across various assemblers.

As an acceptable approach in the literature, nearly all haplotype assembly algo- rithms follow the tradition of biallelic encoding rather than representing each SNP site with multiple (e.g., more than two) alleles. Additionally, a biallelic encoding allows us to compare the performance of HapColor with that of other algorithms. Yet, the methodology used in HapColor is independent of the assumption of a biallelic encod- ing. With minimal effort, the distance function and MEC calculation algorithms can be modified to accommodate multiple alleles representation in HapColor. Multiple alleles encoding will not have a negative impact on HapColor’s performance but full investigation of potential impacts is considered as a future work.

Techniques such as the one presented in [NGB08] cannot be used in our evaluation because such algorithms focus on haplotype inference which does not use NGS data.

Although very promising results are achieved in this project, construction of the HapColor framework can be potentially extended to ploidy-agnostic haplotyp- ing where one can include indels and CNV in the haplotyping process. Moreover, the future work involves solving problems of mixed ploidy in various organisms through dynamically learning ploidy number which may also find applications in organism identification.

96 5.6 Discussion and Conclusion

Due to limited access to the sequencing data from polyploidy organisms, there has been less tendency among researchers in computational biology to develop algorithms for polyploidy phasing. We think it is essential to understand the structure of poly- ploidy organisms (namely wheat) not only for the sake of understanding their genome structure, but also because of its application in studying effect of nutrition and life style on human metabolism related diseases and many more. Therefore we started this project despite very limited research available in this area. In this chapter, HapColor that is widely applicable to evolution studies for polyploidy is presented. HapColor is developed based on the concept of graph coloring via efficiently partition DNA reads and reconstruct one haplotype copy using each partition. HapColor is compared with three other haplotype assemblers, namely HapTree, Greedy, and RFP algorithms. Our extensive analysis showed that HapColor outperforms all these tech- niques on polyploidy data. The amount of reduction in the MEC measure, obtained using HapColor, ranged from 22.9% to 87.9% depending on the ploidy level of the organism and the algorithm used for comparison with HapColor.

97 Table 5.3: MEC comparison of HapColor, HapTree, Greedy, and RFP on HuRef based data. Triploid Tetraploid Hexaploid Decaploid Chr Color Tree Greedy RFP Color Tree Greedy RFP Color Tree Greedy RFP Color Tree Greedy RFP 1 34204 43507 65261 94628 18256 22838 37589 67774 7923 10260 17502 39082 2115 2856 4402 14598 2 51217 65916 102485 135998 18810 23795 37112 59416 8728 11364 17176 42305 2182 2782 4892 16011 3 39411 50288 89462 118985 21147 27026 41216 75054 8713 11143 17068 41646 1804 2395 3621 11560 4 68698 88895 152990 223365 24359 30863 46866 70674 10036 12655 22159 51032 2840 3758 5797 19796 5 59546 75921 124034 181711 35406 47692 74529 112837 15295 19609 30499 69812 4527 5664 9942 30582 6 47404 59919 99596 148896 74423 98983 142446 269365 29918 39941 64025 151803 7659 10309 15050 46459 7 55924 70744 119342 160992 30583 41073 69883 118452 15108 19762 34280 80251 3807 4900 8681 28672 8 31938 40146 62215 76711 21985 29503 44607 76812 9475 11996 20561 49039 1905 2512 3685 11476 9 33839 42637 77593 116390 11669 15286 24353 37576 5298 6818 10701 26517 1155 1527 2383 7619 10 43795 55619 95429 126825 12081 16141 26531 49745 5255 7079 11683 23599 1130 1471 2487 8070 98 11 55448 72083 114279 123650 21486 28211 47828 77147 10636 13582 23813 48317 2989 4011 6228 20728 12 43916 57442 97055 112584 22972 29013 48585 78805 9832 13263 20794 50821 2635 3449 5554 16708 13 52204 69483 119337 159196 38345 48430 79490 134099 17102 22506 35179 79047 4036 5178 8153 26513 14 45799 58165 96270 133334 24274 30342 47261 80438 10462 14009 23581 52987 2207 2909 4327 13798 15 23391 29777 44654 62962 18537 23838 37018 63004 8990 11597 18106 36720 2275 3018 4522 15225 16 39801 50428 83105 112856 21594 28094 41655 79645 10041 12672 19550 41056 2330 2994 4715 14890 17 22500 29969 50849 64527 12368 15954 24983 42071 4959 6244 10653 25397 1418 1854 3034 10501 18 16464 20844 35744 43572 9638 12577 20759 38820 4260 5632 9078 19090 1116 1499 2421 7628 19 16721 21302 32723 35537 13550 18116 29470 55434 6057 7928 13900 28745 1569 2077 3464 11191 20 16090 21545 35141 39077 6498 8214 13730 25676 2781 3529 5357 11966 756 961 1483 4657 21 11218 14437 24129 35880 7870 10026 16951 29275 3872 5041 8681 20139 883 1127 1783 5795 22 12295 14262 28242 41064 7961 10206 15644 27392 3399 4457 6802 13829 744 948 1603 5568 Total 822K 1053K 1750K 2349K 474K 616K 968K 1669K 208K 271K 441K 1003K 52K 68K 108K 348K Imp 22.0% 53.0% 65.0% 23.1% 51.1% 71.6% 23.2% 52.8% 79.3% 23.6% 51.9% 85.1% CHAPTER 6

Correlation Clustering for Polyploid Phasing

6.1 Introduction

Polyploidy refers to the presence of more than two copies of each chromosome in the cells of an individual organism. The occurrence of polyploidy is known to have resulted in new species. Many species that are currently diploid, including humans, were derived from polyploidy ancestors [MP05]. Polyploidy is common in ferns and flowering plants [WTB09] and exists in animals such as fish and amphibians. Further- more, recent evidence suggests that polyploidy is more important in animal evolution than was previously thought [Gla10]. Despite the fact that polyploid phasing finds important applications in genomic studies such as cancer genomics, metagenomics, and viral quasispecies sequencing, a majority of current phasing techniques focus on diploids. Only recently researchers have started developing computational techniques for polyploid haplotyping [BYP14a,BYB15,DV15].

As new generations of sequencing technologies continue to advance, they are poised to deliver larger, more accurate and less costly data for DNA phasing, thus allowing for single individual haplotyping. Although DNA phasing has been studied for over a decade [BB11] through experimental and computational methods, it has remained a challenging research problem due to the costs of experimental techniques and complex- ity of computational approaches. Unlike experimental phasing techniques, which are less practical on large datasets, computational approaches have received more atten- tion because of their promising practical feasibility and cost-efficiency. Nonetheless,

99 only limited research has been conducted on developing computational phasing algo- rithms that are not only accurate but also scalable on large datasets. Our polyploidy algorithm presented in last chapter ,HapColor, is not an exception. There are aspects of the polyploidy that is missed to deliver. For instance Hapcolor reports its accuracy via MEC score, in which when they ploidy number increases, the validilty of MEC will be deminished which refers to the nature of MEC. Also in many organisms we lack to access the haplotype set of the organism under study. Therefore there is no choice other than using the current most popular accuracy measure, named Minimum Error Correction. Yet if the haplotype set is available, SWitching Error (SWER) would be appropriate option for evaluating the algorithm’s performance.

6.1.1 Contributions and Summary of Results

Chapter 5 introduced HapColor with a primary focus on optimizing for Minimum Error Correction criteria. In this chapter, a new polyploid haplotyping framework, called PolyCluster, will be introduced based on the concept of correlation clustering. An advantage of PolyCluster over HapColor is that it attempts to partition fragments based on a combined measure of inter-partition and intra-partition similarity/dissim- ilarity and therefore is expected to result in higher quality haplotypes. In developing and validating PolyCluster, our contributions can be summarized as follows.

• A formal definition of partitioning-based haplotyping with the objective of min- imizing the amount of partitioning error computed by combined inter-partition similarity and intra-partition dissimilarity will be presented (Section 6.2.1).

• The optimization problem will be transformed into a correlation clustering prob- lem a similarity graph model that captures the amount of similarity between pairs of DNA short reads (Section 6.2.2)

• A two-phase algorithm including a fragment clustering method on the intro- duced graph based on linear programming, rounding, and cluster-region-growing, 100 followed by a cluster-merging technique will be introduced to accurately par- tition short reads and reconstruct the haplotype associated with each cluster (Section 6.3).

• A comprehensive analysis of PolyCluster performance and comparison with sev- eral polyploid haplotyping algorithms will be presented (Section 6.4).

Our results demonstrate that PolyCluster substantially improves the accuracy of haplotype assembly in terms of switching error and running time while achieving com- parable results in terms of minimum error correction score. Specifically, PolyCluster reduces switching error of HapColor, HapTree, KMedoids clustering, and Greedy by 51.2%, 48.3%, 64.8%, and 66.34%, respectively. Furthermore, PolyCluster is several orders-of-magnitude faster than HapTree while it achieves a running time that is comparable to the running time of HapColor.

6.2 Problem Statement

The goal of haplotype assembly is to re-build each copy of an organism’s chromosome from a large collection of DNA reads. Conventional individual haplotype assemblers often take as input a standard matrix, called f ragment matrix, which contains the DNA short reads. Each row in a fragment matrix is associated with a read and each column represents a SNP site. Each read is represented by a sequence of alleles from alphabet A = {A, T , C, G, ‘–’} where ‘–’ refers to the SNP site not covered by the read. However, using genotype calls the underlying alphabet is being reduced to a smaller-size alphabet as discussed in the following section.

In many variant loci of an organism, the original alphabet can be encoded into a quinary with Σ = {00, 01, 10, 11, ‘–’}, to ease the computation, where each two- bit entry in Σ refers to an allele at a particular SNP site and ‘–’ indicates lack of information, either because the fragment does not span the site or because of a failure

101 of the sequencing assay. In read matrices associated with higher ploidy genomes, only the most popular alleles for each locus is kept. If there exists any other allele that is seen rarely, will be treated as ‘missing’ and is substituted with ‘–’ in the fragment

matrix. Let {x1l, x2l, x3l ... , xml} be the elements associated with the l-th column in a fragment matrix X . If for example, two most popular alleles are ‘A’ and ‘C’, their binary encoded values are saved. The values assigned to such cells in X are arbitrarily chosen (e.g., ‘00’ and ‘01’). In case there is another allele in addition to ‘A’ and ‘C’ within the same column but seen rarely, that allele will be discarded and a ‘–’ will be replaced by the corresponding cell in X . Additionally, ploidy level, ‘K’, is known a priori in polyploid haplotyping. If the ploidy level is unknown, the problem is transformed into organism identification, which is out of the scope of this study.

DNA reads are aligned to a reference genome sequence prior to construction of the fragment matrix. The alignment is performed using a mapping/alignment algorithm, which may introduce errors in the short reads. Furthermore, all homozygous sites are discarded prior to mapping to the fragment matrix for all current haplotype assemblers. Since the SNP sites are sparse across the genome, the fragments that cover at least two SNP sites is maintained.

A per-block haplotyping is performed by identifying disconnected haplotype blocks. Two adjacent haplotype blocks are disconnected if they have no reads covering at least one SNP from each block. Identifying such blocks can be done using a simple algo- rithm as follows. The algorithm starts with the first SNP site in the fragment matrix and put that site in a queue. Next it finds all SNP sites that are connected to the first site by identifying all SNP sites that have at least one read covering both sites. All such sites are moved to the queue. At each point in time, one site from the queue is selected and any site connected to that is found and will be moved to the queue. This process repeat until there are no sites left in the queue. When the queue is empty, one disconnected block is identified by all sites processed in the queue. This process can repeat for the remaining sites until all of them are covered. This simple

102 algorithm allows us to find disconnected blocks in the input fragment matrix.

The general form of polyploid haplotyping is to reconstruct ‘K’ haplotypes from a given set of reads obtained by DNA sequencing from ‘K’ copies of a chromosome. Solving this general haplotype assembly problem is challenging due to insufficient coverage and erroneous readings. Insufficient coverage refers to the fact that DNA sequencing technologies provide only small overlapping reads, resulting in only a subset of SNP sites being reconstructed in the final haplotype set. The problem becomes more challenging if the fragment matrix contains errors such as sequencing errors, chimeric read pairs, and false variants. In the real world of molecular biology, experiments are never error-free. These errors result in conflicting reads drawn from the same haplotype copy. The conflicting data prevent us from reliably reconstructing the correct allele at each SNP site in the resulting haplotype.

Polyploid haplotyping is defined as follows. Given a fragment matrix X obtained from ‘K’ copies of a chromosome, reconstruct ‘K’ haplotypes such that an objective function is optimized. As discussed previously, a number of objective functions have been introduced for diploid haplotyping in the past. Examples include Minimum Error Correction (MEC), Minimum Fragment Removal (MFR), Minimum SNP Re- moval (MSR), and Longest Haplotype Reconstruction (LHR) [LBI01]. In Minimum Fragment Disagreement (MFD) the goal is to find a partitioning of the fragments such that the summation of inter-partition similarity and intra-partition dissimilarity in minimized. We hypothesize that by combining inter-partition similarity and intra- partition dissimilarity in the clustering process, we will achieve haplotypes that are of higher quality in terms of minimizing switching error. This hypothesis is validated in our results section.

103 6.2.1 Problem Definition

Our approach to polyploid haplotyping is essentially a fragment partitioning method. As soon as the fragments are divided into ‘K’ disjoint partitions, each partition can be used to merge the fragments residing in that partition and build one haplotype copy. This problem is formally defined as follows.

Let Xm×n be a given fragment matrix containing SNP values xil ∈ Σ, obtained from fragments F = {f1, ... , fm}, pertaining n SNP sites, S = {s1, ... , sn}. Fur- thermore, let ‘K’ be the ploidy level, which is the number of chromosome copies from which the fragments in F are derived. Suppose that our algorithm assigns each frag-

ment fi to one of the ‘K’ disjoint partitions P = {p1, ... , pK }, initialized as pk =

∅. Each set pk can be used to build a haplotype hk by merging the fragments in pk,

resulting in haplotype H = {h1, ... , hK }. Polyploid haplotyping is then the problem

of assigning each fragment fi in X to one of the ‘K’ partitions pk such that the overall cost, Z, in (6.1) is minimized.

X X Z = sim(fi, fj)aij + dis(fi, fj)(1 − aij) (6.1) i,j i,j

where sim and dis refer to similarity and dissimilarity, respectively, and aij is a binary variable indicating whether or not fragments fi and fj reside in different partitions. That is:

  1 if fi ∈ pk & fj ∈ pl s.t. pk =6 pl aij = (6.2)  0 otherwise

P The first segment in (6.1) (i.e., sim(fi, fj)aij) accounts for inter-partition simi- P larity while the second segment (i.e., dis(fi, fj)(1−aij)) refers to dissimilarity of the fragments that fall into the same partition and therefore accounts for intra-partition 104 dissimilarity.

Problem 5 (Minimum Fragment-Disagreement Polyploid Haplotyping (MFDPH)).

Given Xm×n composed of m fragments F = {f1, ... , fm}, each of length n, and ploidy level ‘K’, find partition P = {p1, ... , pK } of the fragments (each pk corresponding to a haplotype hk) such that the cost function in (6.1) is minimized.

In contrast with conventional optimization approaches that focus on minimizing the amount of MEC score, PolyCluster devises an optimization approach that takes into consideration both similarity and dissimilarity among DNA reads when con- structing final haplotypes. Here the hypothesis is that optimization based on the cost function in (6.1) is a better proxy for optimizing switching error than the case where the optimization algorithm is solely focused on dissimilarity or similarity of the reads.

6.2.2 Graph Modeling

In order to solve the optimization problem in Problem 5, we introduce a weighted graph model, called similarity graph, which uses an agreement score as edge weights. We will then use this graph model for vertex partitioning on the graph in order to obtain the fragment partitions discussed before. The introduced graph model aims to quantify the amount of inter-fragment agreement for each pair of fragments in X .

Definition 17 (Similarity Graph). A similarity graph G=(V ,E,W ) is an undirected weighted graph where V = {v1, ... , vm} is a set of vertices such that each vertex vi is associated with a fragment fi in X . An edge eij ∈ E exists if it is assigned a non-zero weight (i.e., weight wij =6 0). For each edge eij in G, the weight wij is given by

n n 1  X X  w = × θ(x , x ) − δ(x , x ) (6.3) ij n il jl il kl l=1 l=1

where θ(xil, xjl), match function, and δ(xil, xjl), mismatch function, are computed according to (6.4) and (6.5), respectively. 105   1 if x = y & x =6 ‘–’ & y =6 ‘–’ θ(x, y) = (6.4)  0 otherwise

  1 if x =6 y & x =6 ‘–’ & y =6 ‘–’ δ(x, y) = (6.5)  0 otherwise

The match function θ(xil, xjl) is a binary similarity metric that indicates whether or not there is a match between xil and xjl in the original fragment matrix. Similarly,

the mismatch function δ(xil, xjl) is a binary distance metric that indicates whether or Pn not there is a mismatch between xil and xjl. Intuitively, l=1 θ(xil, xjl) represents the Pn number of overlapping sites with identical base and l=1 δ(xil, xjl) gives the number

of overlapping sites with different SNP values in fi and fj. The weights wij are

normalized according to the amount of overlap that the two fragments fi and fj have as well as the length of the partial haplotype (i.e., n). In a similarity graph, an edge associated with two similar fragments receives a positive weight (i.e., wij > 0), and dissimilar fragments receive negative weights (i.e., wij < 0). Similar fragments are expected to be mapped onto the same partition while dissimilar fragments are meant to reside in different partitions. Therefore, the edge weights in the similarity graph can be treated as an agreement score of the two fragments on residing within the same partition. In other words, two fragments associates with a positive agreement score are expected to be grouped together while a negative agreement score indicates that the corresponding fragments should be assigned to different partitions.

An example of a fragment matrix and similarity graph associated with that is shown in Figure 6.1. The fragments are drawn from a triploid haplotype with two sequencing errors shown in ‘red’ in X . The edge weights are computed based on (6.3) and are normalized. The graph has a total of 26 edges. The graph does not include

106 h1: 1 1 1 1 1 1 10 1 2 h2: 0 0 0 0 0 0 7 h3: 1 1 1 1 0 0 7 , , -7 -7 3

Haplotype Set -10 -10 -3 -10 -10

-7 -7 f1: A C CG GG 1 1 1 1 1 1 3 f2: A C CG GG 1 1 1 1 1 1 6 3 3 3 f3: C CCG GG 0 1 1 1 1 1 7 7 f : C G GT C T 0 0 0 0 0 0 4 -7 4 5 f5: C G GT C T 0 0 0 0 0 0 10 1 f6: A G GT C T 0 0 0 0 0 -3 -3 1 1 1 1 0 0 -7 -3 -3 f7: A C CG C T 1 1 1 1 0 0 f8: A C CG C T 7 8 Fragment Set Fragment Matrix 10

(a) Fragment Matrix (X) (b) Similarity Graph Figure 6.1: An example of a fragment matrix with 8 short reads and 6 SNP sites (a), and corresponding similarity graph (b). Edge labels represent weights (wij) multiplied by ten (10X) for better visualization. That is, a label ‘−3’ on edge e58 represents w58=−0.3.

e37 and e38 because the weight of such edges is 0.

Our formulation of this problem is generic enough that can accommodate sequenc- ing confidence scores as well. If ‘phred-scaled’ base quality scores are available from the sequencing process, then each entry in the fragment matrix is associated with a confidence score. PolyCluster can use these scores to generate higher quality haplo- types. Each nucleotide in a sequencing read usually includes a ‘phred-scaled’ base

− Q quality, Q, which corresponds to an estimated probability of 10 10 that this base has been wrongly sequenced. These phred scores can serve as costs of flipping a let- ter, allowing less confident base calls to be corrected at lower cost compared to high confidence ones [GMM16].

6.2.3 Problem Formulation

The haplotyping problem in Problem 5, which is a fragment partitioning problem, is equivalent to clustering of the vertices in G such that the amount of disagreement 107 among the constructed clusters is minimized. An edge with a negative weight inside a cluster is referred to as a negative mistake and a positive edge across two clusters is called a positive mistake. The goal is to cluster vertices into K groups such that the amount of fragment disagreement (and equivalently the overall cost in (6.1)) is minimized. Therefore, the weight of positive edges carried across clusters and the weight of negative edges (i.e., the absolute value of such weights) inside clusters should be minimized.

+ For any vertex vi in G, a positive neighborhood as N (vi) = {vi} ∪ {vj : wij > 0}

− is defined. Similarly, negative neighborhood of vi is specified as N (vi) = {vi} ∪ {vj

: wij < 0}. Furthermore, let C(vi) be the set of vertices that are in the same cluster as vi given a clustering C = { C1, ... , CK }. Given a similarity graph G=(V ,E,W ), let bij be a binary denoting whether or not two vertices vi and vj are within different clusters. That is:

  1 if vi ∈ Ck & vj ∈ Cl s.t. Ck =6 Cl bij = (6.6)  0 otherwise

The number of mistakes due to clustering C is the sum of the positive and negative mistakes. We define the weight of the clustering as the sum of the weight of erroneous edges in C. For a clustering C, the weight of the clustering is given by

w(C) = wp(C) + wn(C) (6.7) X X = wij + |wij| (6.8)

vi∈/C(vj )& wij >0 vi∈C(vj )& wij <0

If the edge eij is within a cluster (1 − bij) = 1 and (1 − bij) = 0 otherwise. As a result, the total weight of clustering C is given by

108 X X w(C) = wijbij + |wij|(1 − bij) (6.9)

wij >0 wij <0

In order to minimize disagreement among clusters in C, we need to find an assign- ment of the bij values that minimizes the total weight in (6.9) such that bij ∈ [0, 1] and bij satisfies the triangle inequality. We formulate the problem as an Integer Linear Program (ILP) as follows.

X X Minimize wijbij + |wij|(1 − bij) (6.10)

wij >0 wij <0

Subject to: bij ∈ {0, 1} (6.11)

bij = bji (6.12)

bij + bjk ≤ bik (6.13)

This problem is closely related to the problem of correlation clustering in weighted graphs with fixed number of clusters (i.e., number of clusters, K, is given) [Bec05]. It has been shown that this problem is NP-hard [Bec05, SST04, DEF06, GG06]. Fur- thermore, it has been shown that it is APX-hard to minimize disagreements in cor- relation clustering (proof by a reduction from the APX-hard minimum multi-cut problem) [DI03,DJP92].

6.3 PolyCluster Algorithm

In solving the polyploid haplotyping problem formulated in Section 6.2.3, there are a number of challenges: (1) the number of clusters/partitions need to be exactly

‘K’, the ploidy level; (2) variables bij are integer values. PolyCluster tackles this

109 problem by introducing a two-phase algorithm where we relax the constraint on the number of clusters and use a rounding approach to develop an O(log m) approximation for minimizing fragment-disagreement in Phase(I). This way an initial clustering is obtained. In Phase(II), highly similar clusters are merged in order to reach a final clustering with the exact number of clusters |C| = K. This procedure is illustrated in Algorithm 9. The evolution of the algorithm is also shown through an example in Figure 6.2.

Algorithm 9 PolyCluster Algorithm Require: Fragment matrix X & ploidy level K Ensure: Haplotypes H = {h1, ... , hK } # Initialization Construct similarity graph G(V ,E,W ) using X # Phase (I): Initial Clustering Initialize cluster index l = 1 while (V =6 ∅) do Choose a vertex vi ∈ V at random r = 0  while Cut(B(vi,r)) > λ ln(m + 1) ×V ol(B(vi,r)) do Increase r by min{(wij −r)> 0 : vj ∈/ B(vi,r)} so that B(vi,r) contains another edge, eij. end while Output cluster Cl = {vj : vj ∈ B(vi,r)} l = l + 1 V = V − {vj : vj ∈ B(vi,r)} E = E − {eji : vj ∈ B(vi,r)} end while # Phase (II): Cluster Merging while (|C| < K) do L = |C| number of clusters in C ∀ k, l ∈ {1, ... , L} compute ∆kl according to (6.39) ˆ ˆ < k, l >= argmaxk,l{∆kl} Merge Ckˆ and Cˆl end while For each cluster Ck, build a haplotype, hk by obtaining the consensus allele at each SNP site within the partition, and add hk to H

110 10 10 1 2 1 2

7 7 7 7 3 -7 -7 3 -7 -7

-10 -10 -10 -10 -3 -10 -10 -3 -10 -10 -7 -7 -7 -7 3 3 6 3 3 3 6 3 3 3 7 7 7 7 -7 -7 4 5 4 5 10 10

-3 -3 -7 -3 -3 -3 -3 -7 -3 -3

7 8 10 7 8 # 1010

#

(a) Vertex vi = 7 chosen as first node for (b) Region growing around vi = 7 in- region growing cludes vj = 8; Cut(B1) = 1.3 and Vol(B1) = 1.7.

B2 #

1010 1010 1 2 1 2

7 7 7 7

-7 -7 3 -7 -7 3

-10 -10 -10 -10 -3 -10 -10 -3 -10 -10 -7 -7 -7 -7 B3 3 3 6 3 3 3 6 3 3 3 7 7 7 7 -7 -7 4 5 4 5 10 101

-3 -3 -3 -3 -7 -3 -3 -7 -3 -3

7 8 7 8 1010 1010

# #

(c) Region growing around vi = 1 in- (d) Region growing around vi = 6 in- cludes vj = 2; Cut(B2) = 2.7 and Vol(B2) cludes vj = 4; Cut(B3) = 1.7 and Vol(B3) = 2.3. = 1.5.

C2 #

1010 1 2 1010 1 2

B4 7 7 7 7 -7 -7 3 -7 -7 3

-10 -10 -10-10 -10 -3 -10 -10 -33 -10 -10 -7 -7 B3 -7 -77 C3 3 3 6 3 3 3 6 3 3 3 7 7 7 7 B5 -7 -7 4 5 4 5 101 1010

-3 -3 -3 -3 -7 -3 -3 -7 -3 -3

7 8 7 8 1010 1010

# C1

(e) Verties vi = 3 and vi = 5 remain (f) B4 and B5 are merged into B2 and single-node balls B4 and B5, respectively, B3, respectively, forming final clusters with Cut(B4) = 1.3, Vol(B4) = 0.7, C1, C2, and C3. Cut(B5) = 1.7, and Vol(B5) = 0.8. Figure 6.2: Evolution of Algorithm 9 for the similarity graph shown in Figure 6.1. The edge weights are multiplied by 10 for visualization. With the final clustering in (f), the amount of overall MEC is 2 with the reconstructed haplotypes H = {111111, 000000, 111100} 111. 6.3.1 Initial Clustering

The initial clustering uses a combination of rounding and region-growing techniques. The algorithm first solves a Linear Program (LP) and then uses the resulting fractional values to determine the disagreement between two vertices in G. Then a region- growing technique [DEF06] is used to group similar vertices together and finally round the fractional values. By relaxing the integrality constraint and the constraint on the number of clusters in (6.10)–(6.13), the goal of the initial clustering is to find C = {

C1, ... , CL} that minimizes the objective function in (6.14) subject to the constraints in (6.15)–(6.17).

X X Minimize wijbij + |wij|(1 − bij) (6.14)

wij >0 wij <0

Subject to: bij ∈ [0, 1] (6.15)

bij = bji (6.16)

bij + bjk ≤ bik (6.17)

Phase(I) in PolyCluster is a greedy iterative process with each iteration aiming to construct one cluster. Each cluster is formed by choosing a vertex vi in G and growing a “ball” around vi by iteratively adding vertices of a fixed similarity value

‘r’ to the “ball”. A “ball” of radius ‘r’ around vertex vi is defined as follows.

Definition 18 (ball). A “ball” B(vi,r) of radius r around a vertex vi ∈ V is the set of vertices vj ∈ V such that bij ≤ r, as well as the subgraph formed by these vertices and the edges ejk with only one endpoint vj ∈ B(vi, r).

In constructing a “ball” around a vertex, PolyCluster greedily chooses an unclus- tered vertex that has the highest amount of total edge weight. That is, a candidate vertexv ˆi for region growing is chosen by 112 X vˆi = arg max wij (6.18) vi∈C/ eij

Intuitively, this approach will choose a vertex that is likely to result in a “ball” with larger radius because more positive edges are connected to vi. Figure 6.2 shows the evolution of PolyCluster for the similarity graph in Figure 6.1. In Figure 6.2(a), vi = 7 is chosen as the first vertex from region-growing and formation of the first P cluster because j w7j = 1 which is highest edge weight among all vertices in the graph. Initially, the “ball” (i.e., B1) has a radius of zero (r = 0). The radius is increased to include vertex vj = 8 as shown in Figure 6.2(b). The region-growing process stops when the following condition holds:

    Cut B(vi, r) ≤ λ ln(m + 1) × V ol B(vi, r) (6.19)

where λ is a constant, and Cut and V ol of a given “ball” B(vi, r) are defined as follows.

Definition 19 (Cut). The cut of B(vi, r) is the weight of the positive edges with exactly one endpoint in B. That is:

  X Cut B(vi, r) = wjk ∀vj ∈ B(vi, r) (6.20)

vk∈B/ ,wjk>0

Definition 20 (Volume). The volume of B(vi,r) is the weighted values of edges eij ∈ B. That is:

  X V ol B(vi, r) = wijaij (6.21)

eij ∈B,wij >0

The volume of B(vi, r) includes fractional weighted values of the positive edges leaving B(vi,r). If wjk > 0 is a cut positive edge of B(vi,r) with vj ∈ B(vi,r) and

113 vk ∈/ B(vi,r), then ejk contributes weight of wjk(r − ajk) to the volume of B(vi,r).

As shown in Figure 6.2(b), the region growing process stops after adding vj = 8

to the “ball” B∞. The reason is that Cut(B1(7,r))=1.3 and V ol(B1(8,r))=1.7 thus the condition in (6.19) holds. We will show that λ needs to be slightly larger than 2 (i.e., λ = 2 + ) in order for the algorithm to provide an approximation ratio of O(log m). After the region-growing process stops for one “ball”, the set of vertices contained in that “ball” corresponds to the set of vertices that forms an initial cluster. This process is repeated for other non-clustered vertices until all vertices in G belong to some “ball”. As shown in Figure 6.2(c)–Figure 6.2(e), the region-growing process

forms four other distinct initial clusters or “balls”, namely B2, B3, B4, and B5, around

vi = 1, vi = 6, vi = 3, and vi = 5, respectively. Note that for B4 and B5, however, the initial cluster includes only a single vertex because no non-negative edge can be

added to the chosen vertices (vi = 3 and vi = 5) as a result of region growing.

6.3.2 Cluster Merging

As soon as all vertices in G become a member of some initial cluster or “ball”, PolyCluster proceeds by examining if some clusters need to be combined to reach a final cluster number ‘K’. This process is iterative with merging two clusters at a time. If the number of clusters is larger than ‘K’ (i.e., L > K), two clusters with highest similarity value computed according to (6.22) are chosen for merging.

1 X γkl = wij (6.22) |Ekl| vi∈Ck;vj ∈Cl

where the normalization factor |Ekl| refers to the number of edges between clusters

Ck and Cl.

The merging results for the similarity graph of Figure 6.1 are shown in Fig-

ure 6.2(f). The algorithm identifies B3 and B5 as two candidates for cluster merging

114 because they maintain the highest similarity value among all cluster pairs. In fact,

γ35 = 0.8 while γ12 = 0.3, γ13 = −0.3, γ14 = 0, γ15 = −0.3, γ23 = −0.8, γ24 = 0.7,

γ25 = −1, γ34 = −0.8, γ45 = −0.7. The final clustering is not essentially unique due to potential ties on equality of similar initial clusters. The process of cluster merging

continues by merging another pair of most similar “balls” including B2 and B4 with a similarity score of γ24 = 0.7.

6.3.3 Algorithm Analysis

The initial clustering in Phase(I) is analyzed first to show that the cost of the rounded solution is not significantly larger than the cost of the fractional solution. We use OPT to refer to the optimal solution and FRC(aij) and RND(aij) to refer to the fractional and rounded solutions of bij in the linear programming formulation, respectively. The initial clustering gives an O(log m) approximation to the cost of positive edges between clusters and the cost of negative edges inside clusters.

Lemma 16. Phase(I) in PolyCluster guarantees an O(log m) approximation to the cost of positive edges between clusters.

Proof. We prove that wp(RND) ≤ λ ln(m + 1) ×wp(FRC) and that the algorithm terminates. Let B be the set of “balls” found in Phase(I). Given that each positive mistake edge has end points in two different clusters, the total weight of positive mistakes can be written as:

X 1 X w (RND) = w RND(b ) = Cut(B) (6.23) p ij ij 2 wij >0 B∈B

Given that PolyCluster grows each B until Cut(B(vi,r)) ≤ λ ln(m+1) ×V ol(B(vi,r)), it is fair to conclude from (6.23) that:

λ X w (RND) ≤ ln(m + 1) × V ol(B) (6.24) p 2 B∈B 115 Let I be the initial volume of “ball” defined in Phase(I); so the volume of B(vi, 0) is I. Let J be the volume of the entire graph G. Therefore, the weight of the positive mistakes made by the fractional solution is wp(FRC) = J . Let the initial volume I J = m . By the design of the algorithm, all generated “balls” are disjoint. Consequently,

using (6.24), wp(RND) can be written as follows.

λ  X X J  w (RND) ≤ ln(m + 1) w FRC(b ) + (6.25) p 2 ij ij m wij >0 B∈B λ   ≤ ln(m + 1) × w (FRC) + J (6.26) 2 p

≤ λ ln(m + 1) × wp(FRC) (6.27)

Therefore, the algorithm guarantees an O(log m) approximation to the cost of positive edges between clusters in Phase(I).

Lemma 17. For any vertex vi and a family of “balls” B(vi,r), the condition Cut(B(vi,r))≤ 1 λ × ln(m + 1) ×V ol(B(vi,r)) is achieved by some r ≤ λ [DEF06, Vaz13].

Lemma 18. Phase(I) guarantees an O(1) approximation to the cost of negative edges.

1 Proof. We claim that the “balls” returned by the algorithm have radius r ≤ λ which follows from Lemma 17. It guarantees the bound on the radius and thus proves that

λ the solution is a ( λ−2 )–approximation of the cost of negative edges inside clusters. Let B be the set of balls found by Phase(I) in PolyCluster.

X wn(FRC) = wij(FRC(bij) − 1) (6.28)

wij <0 X X ≥ wij(FRC(bij) − 1) (6.29)

B∈B eij ∈B,wij <0 X X 2 ≥ w ( − 1) (6.30) ij λ B∈B eij ∈B,wij <0 2 X X 2 ≥ ( − 1) w ( − 1) (6.31) λ ij λ B∈B eij ∈B,wij <0 λ − 2 = w (RND) (6.32) λ n 116 The third equation follow from the second equation, the triangle inequality and the

1 fact that r ≤ λ . Phase(I) guarantees an O(1) approximation given that λ > 2 in the λ approximation-ratio λ−2 .

Theorem 19. Phase(I) of PolyCluster guarantees an O(ln m)–approximation ratio.

Proof. Base on Lemma 16, the algorithms achieves an O(ln m) approximation to the cost of positive edges between clusters:

wp(RND) ≤ λ ln(m + 1) × wp(FRC) (6.33)

Furthermore, based on Lemma 18, the algorithm guarantees an O(1) approxima- tion to the cost of negative edges inside clusters:

λ − 2 w (FRC) ≥ w (RND) (6.34) n λ n

Therefore, the total number of mistakes made by the algorithm is given by

w(RND) = wp(RND) + wn(RND) (6.35) λ ≤ λ ln(m + 1) × w (OPT ) + × w (OPT ) (6.36) p λ − 2 n λ ≤ max(λ ln(m + 1), )w(OPT ) (6.37) λ − 2

Therefore, total the number of mistakes made by the algorithm is O(log m) of OPT , where λ = 2 + .

117 6.4 Validation

This section focuses on demonstrating the effectiveness of PolyCluster for phasing. A per-block haplotype reconstruction on the input DNA short reads is performed. Each block contained the reads that do not cross adjacent blocks. The haplotypes generated from each block were concatenated to form chromosome-wide haplotypes. The dataset used for our evaluation is discussed in details followed by the presentation of PolyCluster’s performance and a comparison of our framework against several polyploid phasing algorithms.

6.4.1 Dataset Preparation

Polyploidy datasets is prepared with regard to various ploidy level, coverage, and error rates, consistent with approaches suggested in the literature [BYP14a,MFM17]. To the best of our knowledge, there is no publicly available real dataset for polyploidy organisms [NGB08, BYP14a, AI13]. Although the genome of various polyploidy or- ganisms have been sequenced using NGS technology and released, the corresponding haplotype sets and short reads are not provided for public use. In absence of such a dataset, several polyploidy datasets are generated as follows. We prepared a dataset based on Triticum aestivum (Bread Wheat) which has a huge size genome. The bread wheat is classified as a hexaploid genome. Haplotype copies and correspond- ing fragments for one of the 7 chromosomes (5A) with an estimated size of 2Gb is considered.

In order to simulate a haplotype, we first fixed a haplotype length, Lh, and a ploidy level, ‘K’. Then identified SNP positions assuming that the distance between adjacent SNPs follows a geometric random variable with parameter p, SNP density. For each identified SNP, we randomly generated its haplotype, assuming that the alternative and reference alleles are equally likely. After a haplotype is constructed and generated |R| paired-end reads where the number of reads, |R|, is a function of

118 7 x 10 x 105 5 5

4 4

3 3 #Reads 2 #Blocks 2

1 1

0 0 5X 10X 15X 20X 25X 30X 0 200 400 600 800 1K Coverage Block Length (a) Num. of reads for each coverage (b) Histogram of block length

x 104

8 Triploid Triploid 7 Tethraploid Tethraploid Hexaploid Hexaploid 150 6 Octoploid Octoploid Decaploid Decaloid 5

4 100 #Blocks 3 Avg. Block Length 2 50

1

0 0 5X 10X 15X 20X 25X 30X 5X 10X 15X 20X 25X 30X Coverage Coverage

(c) Num. of non-overlapping blocks (d) Average block length Figure 6.3: Statistics about the generated datasets. Distribution of short reads in datasets based on coverage values (a); Histogram of length of non-overlapping blocks (b); Number of non-overlapping blocks for each ploidy level and coverage number (c); Average length of non-overlapping blocks for each ploidy level and coverage value (d).

coverage, CX . To generate a read set with CX coverage each base pair needs to be on average covered by C reads. Given the haplotype length Lh and the read length, lr, the total number of generated reads is given by

C |R| = Lh × (6.38) 2 × lr

Many of these reads will cover a small number SNPs, thus for CX coverage the number of useful reads for any SNP will be less than C. In order to generate a paired- 119 end read, a starting point on the genome is selected uniformly. The read length lr is fixed to be 2500 in our experiments. The fragment length is normally distributed with µ = 600 alleles and σ = 60 alleles. The insert length was determined by the fragment length, lf , and read length; that is, linsert = lf − 2 × lr. Once the start and fragment length is known, we need to choose from which chromosome to read. For this purpose, we draw reads uniformly from the K chromosomes. Finally, we added uniform error to the read. For every SNP that the read covers, independently with probability  we flipped the allele to any other allele.

6.4.2 Dataset Statistics

The procedure described in Section 6.4.1 was repeated multiple times for five ploidy levels (i.e., K), six coverage levels (i.e., C), and seven difference error rates (i.e., ). The ploidy levels included K ∈{3, 4, 6, 8, 10} representing triploid, tetraploid, hexaploid, octoploid and decaploid organisms, respectively. The coverage levels in- cluded C ∈{5X, 15, 10X, 20X, 25X, 30X} and the error rates ranged from 0% to 30% (i.e.,  ∈{0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3}).

The dataset preparation resulted in more than 1.4 billion short reads. For a given error rate , approximately 202 million short reads were generated combined overall ploidy levels and coverages. Figure 6.3 illustrates several statistics about the 202 million short reads associated with a given .

Figure 6.3(a) shows the total number of short reads generated for each coverage level, CX . The number of short reads presented in this graph represents all short reads summed over the six ploidy levels. The number of short reads ranges from 9.8 million for 5X coverage to 54.2 million for 30X coverage. Although the number of reads computed in (6.38) grows linearly with the coverage, C, the number of ultimately used reads will not exactly follow linearly in coverage level. This can be explained by the dataset preparation procedure explained previously. Because the start position

120 of each read on the genome is chosen at random, it is likely that some reads will not span any SNPs. Such reads will carry only missing values in the fragment matrix and thus are eliminated from the dataset while pre-processing data for algorithm input.

As mentioned previously, short reads are processed as a collection of non-overlapping blocks. Each block spans a number of SNPs where the short reads covering those SNPs do not overlap with the reads that cover SNPs in adjacent blocks. As a result, each block in fact forms a fragment matrix, X , where block length refers to the number of SNPs covered within that block. The block lengths can vary significantly. Fig- ure 6.3(b) show a histogram of the block lengths. The total number of blocks was 722, 995. Approximately 79.3% of these blocks had a length of less than 100 SNPs. Furthermore, approximately 0.45% of the blocks had a length of more than 700 SNPs.

Figure 6.3(c) shows how the blocks are distributed across datasets of varying ploidy and coverage. The total number of such blocks was 154, 381, 150, 903, 145, 193, 138, 989, and 133, 529 for triploid, tetraploid, hexaploid, octoploid and decaploid datasets. As shown in this graph, a larger number of the blocks belong to datasets with lower coverage compared to those with a higher coverage number. Expectedly, lower coverage results in short reads that are less likely to overlap thus leading to a larger number of non-overlapping blocks. However, the blocks in lower coverage datasets are likely to have a smaller length as shown in Figure 6.3(d). The graph in Figure 6.3(d) shows the average block length for each dataset based on the ploidy level and coverage. On average, the amount of block length was 20.1, 69.4, 121.2, 152.1, 171.8, and 172.6 for coverages 5X, 15, 10X, 20X, 25X, and 30X, respectively. Furthermore, the standard deviation of the block length was 1.9, 3.1, 10.7, 12.2, 12.5, and 14.7 for coverages 5X–30X.

121

W/ Confidence Score 0.14 W/o Confidence Score W/ Confidence Score W/o Confidence Score 85 0.12

swe 0.1 80 RR (%)

0.08 75

0.06 5X 10X 15X 20X 25X 30X 5X 10X 15X 20X 25X 30X Coverage Coverage

(a) Switching error (SWE) (b) Reconstruction rate (RR)

W/ Confidence Score 1 W/o Confidence Score 0.6

0.8

0.4 0.6 mec wmec 0.4 0.2 0.2

0 5X 10X 15X 20X 25X 30X 5X 10X 15X 20X 25X 30X Coverage Coverage

(c) Minimum error correction (MEC) (d) Weighted MEC Figure 6.4: Impact of sequencing confidence scores on quality of reconstructed hap- lotypes in terms of switching error (a) reconstruction rate (b), normalized MEC (c), and weighted MEC (d) for triploid data with error rate  = 5%.

6.4.3 Impact of Sequencing Confidence Scores

Our first analysis was to assess the impact of sequencing confidence scores (i.e., ‘Phred score’ discussed in Section 6.2.2) on the quality of the reconstructed haplotypes. Specifically, we were interested in evaluating the quality of the haplotypes in terms of switching error (SWE), reconstruction rate (RR), MEC score, and weighted MEC. Figure 6.4 shows the results of this analysis.

Figure 6.4(a) shows how confidence scores impact SWE. The amount of SWE ranged from 0.08 for coverage 30X to 0.13 for coverage 15X using confidence scores. On average PolyCluster achieved a SWE of 0.1062 with a standard deviation of 0.02

122 with confidence scores. Without the confidence scores, SWE ranged from 0.10 to 0.12 with a mean value of 0.1150 and a standard deviation of 0.005. Overall, Poly- Cluster performed 7.6% better in terms of SWE when confidence scores are used for similarity graph creation and haplotype reconstruction. SWE takes into account the reconstruction rate because it normalizes the amount of switch error with respect of the percentage of SNP values that are actually reconstructed. This way, SWE serves as a more reliable metric for assessing the performance of the haplotype assembly algorithms when the actual haplotypes are provided for comparison purposes.

Figure 6.4(b) shows reconstruction rates of PolyCluster with and without con- sideration of the confidence scores. As expected, the reconstruction rate increases as coverage grow. For most coverage levels, confidence scored resulted in a higher reconstruction rate compare to the case where confidence scores were not utilized. Specifically, RR ranged 72.1%–88.9% (with a mean of 80.5%) and 74.3%–84.9% (with a mean of 78.9%) for the two cases of ‘with confidence scores’ and ‘without confidence scores’, respectively. This suggests that the confidence scores can result in about 2.0% increase in the reconstruction rate.

As shown in Figure 6.4(c), the amount of normalized MEC (i.e., number of flips per SNP site) ranged from 0.083 for coverage 5X to 0.84 for coverage 30X for when confidence scores are used to compute edge weights of the similarity graph. The average amount of MEC was 0.50 with a standard deviation of 0.29 across all coverage levels for when confidence scores are utilized. In absence of confidence scores, MEC ranged from 0.084 to 1.16 for 5X and 30X coverage levels, respectively. On average, MEC was 0.65 with a standard deviation of 0.41. Overall, PolyCluster performed 23.1% better in terms of MEC when confidence scores were utilized compared to the case without using confidence scores.

Finally, Figure 6.4(d) shows weighted MEC as a function of coverage. Weighted MEC applies only to the cases when confidence scores are available. The amount of weighted MEC ranged from 0.06 to 0.63 and the average value across all coverages

123 was 0.38.

Triploid Tethraploid Hexaploid Octoploid Decaploid Triploid Tethraploid Hexaploid Octoploid Decaploid

80

0.18 70

0.14 60 swe RR(%)

0.1 50

0.06 40 5 10 15 20 25 30 5 10 15 20 25 30  (%)  (%)

(a) Switching error (b) Reconstruction rate

Triploid Tethraploid Hexaploid Octoploid Decaploid

0.11

0.09 mec 0.07

0.05

5 10 15 20 25 30  (%)

(c) Minimum error correction Figure 6.5: PolyCluster performance in terms of switching error (a), reconstruction rate (b), and normalized MEC (a) for different polyploidy and error rates. The results are presented for 5X coverage.

6.4.4 PolyCluster Performance

This section discusses the performance of PolyCluster across three performance mea- sures including switching error, reconstruction rate, and MEC score. We first present results for various ploidy levels and error rates while keeping the coverage level fixed at 5X (Figure 6.5). We then discuss results for different coverage values and error rates for triploid data (Figure 6.6).

Figure 6.5(a) shows the amount of switching error for various ploidy levels and as

124 =0.05 =0.10 =0.15 =0.20 =0.25 =0.30 =0.05 =0.10 =0.15 =0.20 =0.25 =0.30

0.2 100

90

0.15 80 swe RR(%) 70 0.1

60

0.05 50 5X 10X 15X 20X 25X 30X 5X 10X 15X 20X 25X 30X Coverage Coverage

(a) Switching error (b) Reconstruction rate

1.6 =0.05 =0.10 =0.15 1.2 =0.20 =0.25 =0.30

mec 0.8

0.4

0 5X 10X 15X 20X 25X 30X Coverage

(c) Minimum error correction Figure 6.6: PolyCluster performance as a function of coverage for triploid data (i.e., K=3). The performance is shown in terms of switching error (a), reconstruction rate (b), and normalized MEC (c) for various error rates.

a function of error rate. The amount of switching error was 0.14, 0.16, 0.17, 0.18, and 0.17 for triploid, tetraploid, hexaploid, octoploid, and decaploid data, respectively. The switching error shown in this figure represents the amount of difference between the reconstructed haplotypes and true haplotypes. The values are normalized by re- construction rate to provide a unified measure of accuracy and reconstruction rate. As shown in Figure 6.5(a), switching error generally increase as the ploidy level in- creases due to the significant reduction in reconstruction rate. We also observe higher switching errors as the error rate increases. As shown in Figure 6.5(b), reconstruction rate was 75.0%, 64.6%, 52.6%, 46.8%, and 42.1% for triploid, tetraploid, hexaploid, octoploid, and decaploid data, respectively. Figure 6.5(c) shows the MEC score for

125 various ploidy levels. The amount of MEC was 0.10, 0.08, 0.07, 0.06, and 0.05 for triploid, tetraploid, hexaploid, octoploid, and decaploid, respectively (averaged over all error rates). It can be observed that MEC decreases as the ploidy level increases for a fixed coverage. This can be explained by the fact that with a fixed coverage a smaller number of reads will reside in each cluster as polyploidy grows. As a re- sult, the amount of conflict (disagreement) between reads within each cluster and the reconstructed haplotype decreases at higher ploidy. We, however, note that the MEC score alone cannot represent the quality of the reconstructed haplotypes. In particular, as shown in Figure 6.5(b), with a higher ploidy number we achieve much lower reconstruction rates.

In Figure 6.6, we illustrate the performance of PolyCluster on triploid data with different coverage levels and error rates. Figure 6.6(a) shows the amount of switching error for this experiment. Averaged over all error rates, the amount of switching error is 0.11, 0.13, 0.14, 0.15, 0.17, and 0.18 for 5X–30X coverages. Figure 6.6(b) shows the amount of reconstruction rate for various coverage levels. Averaged over all error rates, the reconstruction rate was 80.5%, 82.3%, 83.3%, 84.9%, 87.0% and 88.0% for the six coverage values 5X–30X. Finally, Figure 6.6(c) shows the MEC score as a function of coverage for various error rates. The amount of MEC was 0.50, 0.62, 0.71, 0.80, 0.83, and 0.84 for the six coverage levels from 5X to 30X. As expected, the amount of MEC per SNP increases as coverage grows.

6.4.5 Comparative Analysis Approach

We conducted an analysis to compare PolyCluster against four other haplotype as- sembly methods including HapColor (discussed in Chapter 5), HapTree [BYP14a], fragment partitioning based on KMedoids clustering [PJ09], and Greedy [LSN07]. A brief description of these approaches is as follows.

• HapColor, discussed in Chapter 5, is a fragment partitioning algorithm based

126 on graph coloring. The main objective of HapColor is to develop a graph par- titioning method based on the concept of graph coloring in order to minimize the overall MEC score [MW15].

• HapTree uses a relative likelihood function to measure the concordance be- tween the aligned read data and a given haplotype under a probabilistic model. To identify a haplotype of maximal likelihood, HapTree finds a collection of high-likelihood haplotype partial solutions, which are restricted to the first n0 SNP sites, and extends those to high-likelihood solutions on the first n0 +1 SNP sites [BYP14a].

• KMedoids, developed here based on the well-known KMedoids clustering, uses the distance measured in (6.39) for distance-based clustering.

n X ∆(fi, fj) = δ(xil, xjl) (6.39) l=1

The algorithm first selects initial medoid fragments at random. It then performs the swap-step of the Partitioning Around Medoids (PAM) algorithm [KR09] to search over all possible swaps between medoids and non-medoids. This process examines if the sum of medoid to cluster member distances decreases or not. Conventional clustering algorithms such as K-Means cannot be directly used for fragment partitioning because the centroids computed by such algorithms do not constitute any meaningful data point in the fragment space.

• Greedy, inspired by greedy diploid genome sequence algorithm [LSN07], is a fragment partitioning method that we devise for polyploidy haplotyping. The original algorithm generates haplotypes by greedily allocating the fragments to two disjoint partitions each of which forms one haplotype copy. The algorithm attempts to iteratively assign a selected read to be added to the final partition. We modified this algorithm to accommodate more than two haplotype copies in our polyploid haplotyping.

127 6.4.6 Comparative Analysis Results

These algorithms are compared across performance metrics including switching error, MEC score, and running time. For brevity, a reconstruction rate analysis in this section is not included because the reported switching error already takes into account reconstruction rate of the algorithm. Therefore, the switching error itself can be interpreted as a combined error and reconstruction rate performance metric.

6.4.6.1 Switching Error (SWE)

Figure 6.7 shows switching error of various algorithms for triploid (Figure 6.7(a)), tetraploid (Figure 6.7(b)), and hexaploid (Figure 6.7(c)). For brevity, the discussion is limited to these three ploidy levels. The results of our analysis, however, were consistent across other ploidy levels (i.e., octoploid and decaploid) as well. As shown in Figure 6.7, the amount of switching error increases as the error rate escalates. PolyCluster outperforms all other algorithms in terms of switching error and for all the ploidy levels. Averaged over all error rates, the amount of switching error on triploid data was 0.16, 0.31, 0.29, 0.43, 0.45 for PolyCluster, HapColor, HapTree, KMedoids, and Greedy, respectively. This suggests that utilizing PolyCluster reduces switching error of HapColor, HapTree, KMedoids, and Greedy on triploid data by 48.4%, 44.8%, 62.8%, and 64.4%, respectively.

On tetraploid data, average switching error was 0.17, 0.36, 0.34, 0.50, and 0.52 for PolyCluster, HapColor, HapTree, KMedoids, and Greedy, respectively. This indicates that PolyCluster achieves 52.8%, 50.0%, 66.0%, and 67.3% reduction in switching error of HapColor, HapTree, KMedoids, and Greedy, respectively. On hexaploid data, average switching error was 0.19, 0.40, 0.38, 0.55, and 0.58 for PolyCluster, HapColor, HapTree, KMedoids, and Greedy, respectively, thus suggesting a reduction of 52.5%, 50.0%, 65.5%, and 67.2% in switching error of HapColor, HapTree, KMedoids, and Greedy.

128 PolyCluster HapColor HapTree KMedoids Greedy PolyCluster HapColor HapTree KMedoids Greedy

0.55

0.45 0.45

0.35 0.35 swe swe

0.25 0.25

0.15

0.15 5 10 15 20 25 30 5 10 15 20 25 30  (%)  (%)

(a) Switching error (triploid) (b) Switching error (tetraploid)

PolyCluster HapColor HapTree KMedoids Greedy

0.55

0.45

swe 0.35

0.25

0.15 5 10 15 20 25 30  (%)

(c) Switching error (hexaploid) Figure 6.7: Comparison of switching error of PolyCluster with that of HapColor, HapTree, KMedoids, and Greedy on triploid (a), tetraploid (b), and hexaploid (c) data.

Our analysis of switching error in this section suggests that PolyCluster is highly advantageous compare to the other four algorithms in minimizing the switching error of polyploid phasing. Overall, PolyCluster improves SWE of HapColor by 51.2%, SWE of HapTree by 48.3%, SWE of KMedoids by 64.8%, and that of Greedy by 66.34%.

6.4.6.2 Minimum Error Correction (MEC)

For brevity, the analysis is limited to three ploidy levels including triploid, tetraploid, and hexaploid, similar to the analysis of switching error. Prior studies have shown

129 Table 6.1: Comparison of PolyCluster with HapColor, HapTree, KMedoids, and Greedy in terms of absolute MEC on triploid data.  (%) PolyCluster HapColor HapTree KMedoids Greedy 5 424.7 317.7 424.8 506.3 565.5 10 512.3 462.4 623.6 743.1 826.1 15 564.7 530.2 714.9 851.4 931.3 20 595.5 573.7 773.3 920.9 1013.1 25 616.7 626.5 842.3 1008.5 1105.8 30 643.7 675.6 909.8 1162.6 1297.9 Avg. 559.6 531.0 714.8 865.5 956.6

Table 6.2: Comparison of PolyCluster with HapColor, HapTree, KMedoids, and Greedy in terms of absolute MEC on tetraploid data.  (%) PolyCluster HapColor HapTree KMedoids Greedy 5 406.3 216.3 308.4 370.8 416.2 10 469.5 318.5 434.7 525.4 577.7 15 475.7 387.4 526.5 628.4 695.1 20 502.5 434.8 589.7 702.8 776.4 25 498.7 466.8 632.4 751.3 825.9 30 501.0 486.7 658.9 782.2 865.5 Avg. 475.6 385.1 525.1 626.8 692.8

Table 6.3: Comparison of PolyCluster with HapColor, HapTree, KMedoids, and Greedy in terms of absolute MEC on hexaploid data.  (%) PolyCluster HapColor HapTree KMedoids Greedy 5 402.8 125.8 197.7 244.6 274.7 10 436.3 226.2 311.6 373.6 418.3 15 436.5 309.5 422.7 509.3 562.4 20 436.2 367.3 499.5 595.7 657.7 25 416.3 391.7 532.3 634.2 703.5 30 430.7 421.7 572.3 688.3 755.7 Avg. 426.5 307.0 422.7 507.6 562.1 that MEC is not the best metric to assess the quality of polyploid haplotype as- sembly algorithms [DMH12, BYP14b]. Yet, here MEC scores of various algorithms for comparison purposes are included. Table 6.1, Table 6.2, and Table 6.3 show the amount of absolute MEC on triploid, tetraploid, and hexaploid data, respectively. The MEC values reported in these tables represent absolute MEC rather than nor- malized MEC which was presented previously. For visualization, the numbers are presented in K-MEC (Kilo-MEC) in these tables.

130 Based on Table 6.1, PolyCluster performs 21.7% better in terms of MEC compared to HapTree, 35.3% better compared to KMedoids, and achieves a 41.5% reducing in MEC of Greedy when triploid data are used. Similarly, as shown in Table 6.2, PolyCluster performs better in terms of MEC compared to HapTree (9.4%), KMedoids (24.1%), and Greedy (31.3%) on tetraploid data. Furthermore, as shown in Table 6.3, PolyCluster outperforms KMedoids and Greedy by reducing their MEC score by 16.0% and 24.1%, respectively. HapTree, however, performs slightly (i.e., 8.9%) better than PolyCluster on hexaploid.

Our analysis in this section suggests that PolyCluster outperforms both KMedoids and Greedy in terms of MEC scores. Although HapTree performed slightly better than PolyCluster in few cases, the MEC performance of PolyCluster is less influenced by the amount of sequencing error (i.e., ). In contrast, MEC score of HapTree grows significantly as the number of error increases, in particular in higher ploidy as shown in Table 6.3.

Table 6.4: Running time (minutes) on triploid data with 5X coverage and various error rates.  (%) PolyCluster HapColor HapTree 5 6.9 23.5 54.8 10 6.9 32.7 76.0 15 6.7 37.9 82.1 20 6.4 39.0 78.8 25 6.5 38.1 80.6 30 6.8 38.4 81.0 Avg. 6.7 34.9 75.5 Speedup - 5.2 11.3

6.4.6.3 Running Time

For running time comparison, the focus is on the three leading algorithms, Poly- Cluster, HapColor, and HapTree because the other two algorithms (i.e., KMedoids and Greedy) perform very poorly in terms of accuracy performance (i.e., switching error and MEC). Besides, KMedoids and Greedy are inherently slow. For example, 131 Table 6.5: Impact of block length on running time (sec.) Block Length PolyCluster HapColor HapTree 60 0.29 1.69 0.11 120 0.28 2.29 0.39 240 0.39 2.39 3.75 480 0.41 2.28 20.54 960 0.62 2.57 13.65 1920 1.12 2.88 89.87 3840 2.11 3.51 675 7680 3.68 4.52 12720 15360 7.31 6.17 14873 30720 17.28 10.34 83902 61440 36.14 19.97 - 122880 70.18 40.32 - our prior study on diploid haplotyping showed that a greedy fragment partitioning is one-order-of-magnitude slower than FastHap [MW14a].

Table 6.4 shows the running time of the three leading algorithms in reconstruct- ing the entire haplotype associated with the triploid data with 5X coverage. While PolyCluster could complete the phasing process in less than 7 minutes on average, HapColor required 34.9 minutes and HapTree needed 75.5 minutes to complete. Over- all, PolyCluster was 5.2 times faster than HapColor and 11.3% faster than HapTree in this analysis. An interesting observation from this analysis is that PolyCluster’s com- puting complexity is relatively independent of the error rate. In contrast, HapColor operates much slower as the error rate increases. This can be explained by the fact that HapColor identifies a clustering of the reads that do not conflict at first; it then merges those clustering (identified by graph vertex colors) to form a final clustering. Thus, higher error rates result in a larger number of initial clusters/colors which need to be merged. Therefore, the increase in running time of HapColor is primarily due to color/cluster merging time in high error rates. HapTree is generally slow because it examines a collection of SNPs during each phase of the algorithm. It then adds new SNPs to the partially examined haplotypes which result in significant increase in running time of the algorithm.

132 An important aspect of haplotype assembly is scalability. This is in particular important because with new sequencing technologies, the block length is increasingly growing. Therefore, algorithms that can scale as read length (and consequently block length) increases. In this experiment, we aim to assess the impact of block length increase on running time of the assembly algorithms. Table 6.5 shows how each al- gorithm performs as the block length grows. It can be observed that as the block length increases from 60 SNPs to more than 122K SNPs, the running time of Poly- Cluster goes from 0.29 seconds to 70.2 seconds. These running time complexities are 1.7 seconds and 40.3 seconds for HapColor. However, HapTree becomes very slow as the block length increases. The non-reported cells in Table 6.5 (i.e., running time of HapTree for block lengths exceeding 30K SNPs) refer to the cases where the algo- rithm did not finish within 24 hours. Overall, PolyCluster outperforms HapTree by orders of magnitude in very long blocks. For example, at a block length of 8, 000, the running time of HapTree is four orders of magnitude larger than that of PolyCluster. Also, the running time of PolyCluster and HapTree are comparable as the bock length grows.

6.5 Discussion and Conclusion

Organisms with more than two sets of homologous chromosomes are becoming the target of many studies focusing on the genomics of diseases, phylogenetic and evolu- tion [CN06]. Development of computational algorithms for polyploidy haplotyping, however, is a new area of research. PolyCluster presented here as a clustering-based method for polyploid haplotyping. PolyCluster minimizes the amount of inter-cluster similarity and intra-cluster dissimilarity and therefore results in haplotypes that are highly optimized in terms of switching error. The performance of PolyCluster is compared on polyploidy datasets against several phasing methods, in particular, two recently developed methods including HapColor and HapTree. PolyCluster improves switching error of HapColor by 51.2% and that of HapTree by 48.3% on average. It 133 is also demonstrated here that the running time of PolyCluster is several orders-of- magnitude less that of HapTree while it achieves a running time that is comparable to the running time of HapColor. With recent advancements in the sequencing technolo- gies, access to long reads of over few thousand bases is becoming a reality [HRM14]. Our ongoing work involves validation of PolyCluster on such real datasets.

Although ploidy level in haplotype reconstruction problems should be known apri- ori, we think dynamically learning ploidy level could be an extended solution that not only would be helpful for building more accurate haplotype sets but also, would help in solving other problems in metagenomics, organism detection, etc. Another exten- sion of Polycluster is to accommodate tomur haplotypes. Since the current version covers the SNPs and the results are promising, including small indels would be an easy next step. While the tumor somatic SNV density varies widely among different cancer types, they occur at orders of magnitude less than germline SNP (Greenman et al. 2009). So we can infer tumor haplotypes via somatic mutations that are linked to germline mutations. The clonal frequencies and normal contamination can be inferred using finite mixture models in a Bayesian framework.

134 CHAPTER 7

Conclusions and Future Directions

Development of accurate and scalable algorithms for DNA phasing continues to re- main an important research area in computational biology because, on one hand, there exists a growing demand to understand associations between genomic struc- tures and phenotypes, and on the other hand, extracting useful information from massive amount of erroneous DNA sequence data necessitates new computational algorithms. This dissertation lays a foundation for combinatorial approaches to hap- lotype assembly by developing several computational frameworks for NGS-Phasing in diploid and polyploid organisms.

7.1 Summary of Contributions

The contributions of this dissertation include design, development, and validation of several haplotype assembly frameworks including FastHap, ARHap, HapColor, and PolyCluster, discussed in Chapter 3, Chapter 4, Chapter 5, and Chapter 6, respec- tively. FastHap, HapColor, and PolyCluster are based on the concept of fragment partitioning while ARHap focuses on inter-SNP association rule learning. FastHap can be viewed as a binary partitioning approach where input fragments are grouped into two disjoint clusters according to a proposed dissimilarity measure. As a result, FastHap aims to address the problem of diploid haplotyping. In contrast, HapColor and PolyCluster are designed for higher ploidy haplotyping. That is, HapColor and PolyCluster focus on partitioning an input fragment set into K disjoint sets each

135 representing a haplotype copy. Both HapColor and PolyCluster follow a general ap- proach that includes construction of an initial clustering of the fragments followed by a cluster merging procedure. These two frameworks, however, are different in both the distance measures used for graph-based vertex partitioning and the underlying cluster construction technique. Specifically, the major difference between HapColor and PolyCluster is that HapColor starts by constructing a clustering of the fragments such that the obtained clusters do not have any inter-fragment conflicts among them. That is, the initial clustering attained by HapColor achieves an MEC score equal to zero. It will then merge similar clusters until the number of remaining clusters reaches a desirable value of ‘K’, ploidy level. This approach to fragment clustering is essentially consistent with the aim of minimizing the overall MEC score. PolyClus- ter, however, attempts to consider both similarity and dissimilarity of the fragments (and therefore constructed clusters), as a proxy for optimizing switching error, during the clustering process. A more detailed description of each algorithm is given in the following.

FasHap is developed as a highly scalable haplotype assembly and reconstruction method for diploid organisms. A novel dissimilarity metric is introduced to quantify inter-fragment distance based on the contribution of individual fragments in building a final haplotype. The notion of fuzzy conflict graph is presented to model haplotype reconstruction as a max-cut problem. A fast heuristic fragment partitioning technique is then developed using the proposed graph model. The technique is shown to lower computing complexity of the haplotype reconstruction dramatically, compared to the state-of-the-art algorithms, while moderately outperforming accuracy performance of such algorithms. We compare FastHap with two well-known haplotype reconstruction algorithms, namely Levy’s greedy algorithm and HapCut. The greedy algorithm is historically known for its high speed while is also outperforms accuracy of other computationally simple and greedy algorithms such as FastHare [PS04]. HapCut, in contrast, is popular for its accuracy, but demands much higher computational resources compared to the greedy approach. The experiments show that FastHap 136 is one order of magnitude faster than HapCut and is up to 7 times faster than the greedy approach.

The ARHap framework is developed to address diploid haplotyping by investigat- ing hidden relations among SNP site that are hard to represent using conventional algorithms. Based on the concept of association rule learning, we propose a method to discover correlations among SNP locations in fragment sets. ARHap consists of two main modules or processing phases. In the association rule learning phase, strong patterns representing inter-dependency of alleles at individual SNP sites are discov- ered. In the haplotype reconstruction phase, an approach for utilizing the strong rules produced in the first phase is developed to construct haplotypes at individual SNP sites. The framework begins by generating rules of small size and highly strong and reconstructing haplotypes accordingly. The rule length can increase and/or criteria about strongness of the rule can be revised over time if some SNP sites have re- mained unreconstructed. Extensive experimental analyses demonstrate superiority of ARHap in diploid haplotyping, in particular in achieving significantly better accuracy performance in terms of switching error.

HapColor, based on the concept of graph coloring, is developed for polyploid hap- lotyping. We present a formal definition of polyploid haplotyping with the objective of minimizing overall MEC and show that this problem can be modeled as a graph coloring problem. We develop a heuristic approach including a graph coloring method followed by a color-merging technique to accurately partition short reads and recon- struct haplotype associated with each partition. HapColor is compared with three other haplotype assemblers, namely HapTree, Greedy, and random partitioning algo- rithms. Our extensive analysis show that HapColor outperforms all these techniques on polyploidy data. The amount of reduction in the MEC score, obtained using Hap- Color, ranges from 20% to 90% depending on the ploidy level of the organism and the algorithm used for comparison with HapColor.

We finally present PolyCluster as a clustering-based method for polyploid haplo-

137 typing. PolyCluster minimizes the amount of inter-cluster similarity and intra-cluster dissimilarity and therefore results in haplotypes that are highly optimized in terms of switching error. We show that the problem of minimum fragment-disagreement haplotyping can be modeled as a correlation clustering problem on a weighted graph. PolyCluster devise a combination of linear programming, rounding, region-growing, and cluster merging to group similar reads and reconstruct haplotypes. The perfor- mance of PolyCluster is compared against several phasing methods such as HapColor and HapTree. We show that PolyCluster improves switching error of HapColor by 51.2% and that of HapTree by 48.3% on average. It is also shown here that the running time of PolyCluster is several orders-of-magnitude less than that of HapTree while it achieves a running time comparable to the running time of HapColor.

7.2 Challenges and Future Work

With recent advancements in sequencing technologies, access to long reads of over few thousand bases is becoming a reality [HRM14]. This can lead to further investigation of the algorithms that address specific needs of future big sequencing data. Our ongoing work involves validation of our algorithms on larges-scale real datasets for various organisms as such datasets become available.

• Haplotype block length: Similar to our approach in this dissertation, many DNA phasing algorithms construct haplotypes in non-overlapping blocks. Each haplotype block is built based on non-overlapping DNA short reads. The blocks are then concatenated to build genome-wide haplotypes. Potential problems with this approach are as follows. First, it is possible that there exist gaps between non-overlapping blocks. Such gaps could occur, for example, due to the lack of coverage at particular SNPs which leaves those SNP sites uncovered or only marginally covered by the DNA reads. The gaps between neighboring blocks result in discontinuous haplotype blocks; that is, the SNP sites associated 138 with the gaps will remain unconstructed in the final haplotypes. Second, the process of concatenating haplotype blocks may introduce additional error in the final haplotypes. Although haplotype block length is affected by factors such as error rate, paired/single end alignment, coverage, and amount of base pair overlap in each pair of fragments, an interesting future direction is to revisit the proposed computational frameworks such that the impact of individual block length on the performance of the final haplotypes is minimized.

• Standard accuracy measure: A challenge in designing and validating DNA phasing algorithms is the lack of a consensus about an accuracy performance metric. Minimum Error Correction (MEC) and SWitching Error(SWE) have been widely used in the literature. However, there exist many other measures such as MSR (Minimum SNP Removal) and MFR (Minimum Fragment Re- moval) [LBI01], respectively referring to removing minimum number of SNPs and minimum number of reads from the data to reach a feasible fragment matrix, and LHB (Longest Haplotype Block) [LBI01, WP03], which refers to achieving the longest possible haplotype block. Another metric used in the literature is MWER (Minimum Weight Edge Removal), which aims at removing data (i.e., edges) from a weighted graph representing the fragment matrix [AI12, DV15]. The approaches proposed in [AI12,AI13] attempt to solve an optimization prob- lem that minimizes the MWER objective. Each of the above measures either depend on the input or characteristics of the developed algorithm. Such dispar- ity in determining accuracy of haplotype assembly algorithms make comparison among the algorithms either infeasible or unfair. Therefore, the community may need to standardize validation methodologies for computational haplotype assembly.

• Indels and recombination: Current algorithms do not take into consideration the impacts of indels and recombinations on the sequence data. In fact, these algo- rithms assume that the input data are perfectly aligned to a reference sequence.

139 In reality, however, indels and inversions can greatly affect the results because the provided reads will no longer align with the reference sequence in particular when the amount of recombination/indels is not negligible. Therefore, taking the impacts of indels and recombinations into account during haplotype assem- bly is another future direction in this research.

• Tumor haplotype assembly: Another extension of this work is to accommo- date tumor haplotypes. While the tumor somatic SNV density varies widely from one type of cancer to another, they occur at orders of magnitude less than germline SNP [GGI09]. Therefore, one can infer tumor haplotypes uti- lizing somatic mutations that are linked to germline mutations. For example, the clonal frequencies and normal contamination may be inferred using finite mixture models in a Bayesian framework.

• Dynamically learning ploidy level: Another interesting future direction for this research falls at the confluence of haplotype assembly and meta genomics / human microbiome research. Although in haplotype reconstruction methods assume that the ploidy level is known a priori, dynamically learning ploidy level may allow for joint ploidy discovery and haplotype reconstruction which would also contribute to addressing problems in meta genomics and organism detection among others.

140 References

[ABC05] David Altshuler, Lisa D Brooks, Aravinda Chakravarti, Francis S Collins, Mark J Daly, P Donnelly, RA Gibbs, JW Belmont, A Boudreau, SM Leal, et al. “A haplotype map of the .” 2005.

[ACE11] Can Alkan, Bradley P Coe, and Evan E Eichler. “Genome structural variation discovery and genotyping.” Nature Reviews Genetics, 12(5):363– 376, 2011.

[AHV10] Susanna Atwell, Yu S Huang, Bjarni J Vilhj´almsson,Glenda Willems, Matthew Horton, Yan Li, Dazhe Meng, Alexander Platt, Aaron M Tarone, Tina T Hu, et al. “Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines.” Nature, 465(7298):627–631, 2010.

[AI12] Derek Aguiar and Sorin Istrail. “HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data.” Journal of Computa- tional Biology, 19(6):577–590, 2012.

[AI13] Derek Aguiar and Sorin Istrail. “Haplotype assembly in polyploid genomes and identical by descent shared tracts.” Bioinformatics, 29(13):i352–i360, 2013.

[AS94] Rakesh Agrawal, Ramakrishnan Srikant, et al. “Fast algorithms for mining association rules.” In Proc. 20th int. conf. very large data bases, VLDB, volume 1215, pp. 487–499, 1994.

[Aus99] Giorgio Ausiello. Complexity and Approximability Properties: Combi- natorial Optimization Problems and Their Approximability Properties. Springer, 1999.

[BB08] Vikas Bansal and Vineet Bafna. “HapCUT: an efficient and accurate algo- rithm for the haplotype assembly problem.” Bioinformatics, 24(16):i153– i159, 2008.

[BB11] Sharon R Browning and Brian L Browning. “Haplotype phasing: existing methods and new developments.” Nature Reviews Genetics, 12(10):703– 714, 2011.

[BBJ02] Celine Becquet, Sylvain Blachon, Baptiste Jeudy, Jean-Francois Bouli- caut, and Olivier Gandrillon. “Strong-association-rule mining for large- scale gene-expression data analysis: a case study on human SAGE data.” Genome Biology, 3(12):1, 2002.

[BDK15] Paola Bonizzoni, Riccardo Dondi, Gunnar W Klau, Yuri Pirola, Nadia Pisanti, and Simone Zaccaria. “On the fixed parameter tractability and approximability of the minimum error correction problem.” In Annual 141 Symposium on Combinatorial Pattern Matching, pp. 100–113. Springer, 2015.

[Bec05] Hila Becker. “A survey of correlation clustering.” Advanced Topics in Computational Learning Theory, pp. 1–10, 2005.

[BHA08] Vikas Bansal, Aaron L Halpern, Nelson Axelrod, and Vineet Bafna. “An MCMC algorithm for haplotype assembly from whole-genome sequence data.” Genome research, 18(8):1336–1346, 2008.

[BHK04] Andreas Bj¨orklund,Thore Husfeldt, and Sanjeev Khanna. “Approximat- ing longest directed paths and cycles.” In International Colloquium on Automata, Languages, and Programming, pp. 222–233. Springer, 2004.

[BIL05] Vineet Bafna, Sorin Istrail, Giuseppe Lancia, and Romeo Rizzi. “Polyno- mial and APX-hard cases of the individual haplotyping problem.” Theo- retical Computer Science, 335(1):109–125, 2005.

[BM12] William S Bush and Jason H Moore. “Genome-wide association studies.” PLoS computational biology, 8(12):e1002822, 2012.

[BMU97] Sergey Brin, Rajeev Motwani, Jeffrey D Ullman, and Shalom Tsur. “Dy- namic itemset counting and implication rules for market basket data.” In ACM SIGMOD Record, volume 26, pp. 255–264. ACM, 1997.

[Bre79] Daniel Br´elaz.“New Methods to Color the Vertices of a Graph.” Commun. ACM, 22(4):251–256, April 1979.

[BYB15] Emily Berger, Deniz Yorukoglu, and Bonnie Berger. “HapTree-X: An Integrative Bayesian Framework for Haplotype Reconstruction from Tran- scriptome and Genome Sequencing Data.” In International Conference on Research in Computational Molecular Biology, pp. 28–29. Springer, 2015.

[BYP14a] Emily Berger, Deniz Yorukoglu, Jian Peng, and Bonnie Berger. “Hap- Tree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data.” In Roded Sharan, editor, Research in Computational Molecular Biology, volume 8394 of Lecture Notes in Computer Science, pp. 18–19. Springer International Publishing, 2014.

[BYP14b] Emily Berger, Deniz Yorukoglu, Jian Peng, and Bonnie Berger. “Hap- tree: A novel bayesian framework for single individual polyplotyping using ngs data.” In Research in Computational Molecular Biology, pp. 18–19. Springer, 2014.

[CH03] Chad Creighton and Samir Hanash. “Mining gene expression databases for association rules.” Bioinformatics, 19(1):79–86, 2003.

142 [CJC06] Donald F Conrad, Mattias Jakobsson, Graham Coop, Xiaoquan Wen, Jef- frey D Wall, Noah A Rosenberg, and Jonathan K Pritchard. “A worldwide survey of haplotype variation and linkage disequilibrium in the human genome.” Nature genetics, 38(11):1251–1260, 2006.

[CN06] Z Jeffrey Chen and Zhongfu Ni. “Mechanisms of genomic rearrangements and gene expression changes in plant polyploids.” Bioessays, 28(3):240– 252, 2006.

[Con12] 1000 Genomes Project Consortium et al. “An integrated map of genetic variation from 1,092 human genomes.” Nature, 491(7422):56–65, 2012.

[CSV16] C. Cai, S. Sanghavi, and H. Vikalo. “Structured Low-Rank Matrix Fac- torization for Haplotype Assembly.” IEEE Journal of Selected Topics in Signal Processing, 10(4):647–657, June 2016.

[CVK05] Rudi Cilibrasi, Leo Van Iersel, Steven Kelk, and John Tromp. “On the complexity of several haplotyping problems.” In Algorithms in Bioinfor- matics, pp. 128–139. Springer, 2005.

[CVK07] Rudi Cilibrasi, Leo Van Iersel, Steven Kelk, and John Tromp. “The com- plexity of the single individual SNP haplotyping problem.” Algorithmica, 49(1):13–36, 2007.

[DEF06] Erik D Demaine, Dotan Emanuel, Amos Fiat, and Nicole Immorlica. “Cor- relation clustering in general weighted graphs.” Theoretical Computer Sci- ence, 361(2):172–187, 2006.

[DHM10] Jorge Duitama, Thomas Huebsch, Gayle McEwen, Eun-Kyung Suk, and Margret R Hoehe. “ReFHap: a reliable and fast algorithm for single indi- vidual haplotyping.” In Proceedings of the First ACM International Con- ference on Bioinformatics and Computational Biology, pp. 160–169. ACM, 2010.

[DI03] Erik D Demaine and Nicole Immorlica. “Correlation clustering with par- tial information.” In Approximation, Randomization, and Combinatorial Optimization.. Algorithms and Techniques, pp. 1–13. Springer, 2003.

[DJP92] Elias Dahlhaus, David S Johnson, Christos H Papadimitriou, Paul D Sey- mour, and Mihalis Yannakakis. “The complexity of multiway cuts.” In Proceedings of the twenty-fourth annual ACM symposium on Theory of computing, pp. 241–251. ACM, 1992.

[DMH12] Jorge Duitama, Gayle K McEwen, Thomas Huebsch, Stefanie Palczewski, Sabrina Schulz, Kevin Verstrepen, Eun-Kyung Suk, and Margret R Hoehe. “Fosmid-based whole genome haplotyping of a HapMap trio child: evalua- tion of Single Individual Haplotyping techniques.” Nucleic acids research, 40(5):2041–2053, 2012. 143 [DV15] Shreepriya Das and Haris Vikalo. “SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming.” BMC genomics, 16(1):260, 2015.

[EFG09] John Eid, Adrian Fehr, Jeremy Gray, Khai Luong, John Lyle, Geoff Otto, Paul Peluso, David Rank, Primo Baybayan, Brad Bettman, et al. “Real-time DNA sequencing from single polymerase molecules.” Science, 323(5910):133–138, 2009.

[FHT04] Eibe Frank, Mark Hall, Len Trigg, Geoffrey Holmes, and Ian H Wit- ten. “Data mining in bioinformatics using Weka.” Bioinformatics, 20(15):2479–2481, 2004.

[FWS05] Errol C Friedberg, Graham C Walker, Wolfram Siede, and Richard D Wood. DNA repair and mutagenesis. American Society for Microbiology Press, 2005.

[GBH03] Richard A Gibbs, John W Belmont, Paul Hardenbol, Thomas D Willis, FL Yu, HM Yang, Lan-Yang Ch’ang, Wei Huang, Bin Liu, Yan Shen, et al. “The international HapMap project.” 2003.

[GG06] Ioannis Giotis and Venkatesan Guruswami. “Correlation clustering with a fixed number of clusters.” In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pp. 1167–1176. Society for Industrial and Applied Mathematics, 2006.

[GGI09] Antonia G´alvez, John Greenman, and Ioannis Ieropoulos. “Landfill leachate treatment with microbial fuel cells; scale-up through plurality.” Bioresource technology, 100(21):5085–5091, 2009.

[GHL04] Harvey J Greenberg, William E Hart, and Giuseppe Lancia. “Opportuni- ties for combinatorial optimization in computational biology.” INFORMS Journal on Computing, 16(3):211–231, 2004.

[GJ90] Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1990.

[Gla10] University Glasgow. “Why Study Polyploids @ONLINE.”, September 2010.

[GMM16] Shilpa Garg, Marcel Martin, and Tobias Marschall. “Read-based phasing of related individuals.” Bioinformatics, 32(12):i234–i242, 2016.

[HCP10] Dan He, Arthur Choi, Knot Pipatsrisawat, Adnan Darwiche, and Eleazar Eskin. “Optimal algorithms for haplotype assembly from whole-genome sequence data.” Bioinformatics, 26(12):i183–i190, 2010.

144 [HD05] Joel N Hirschhorn and Mark J Daly. “Genome-wide association studies for common diseases and complex traits.” Nature Reviews Genetics, 6(2):95– 108, 2005. [HGN00] Jochen Hipp, Ulrich G¨untzer, and Gholamreza Nakhaeizadeh. “Algo- rithms for association rule mininga general survey and comparison.” ACM sigkdd explorations newsletter, 2(1):58–64, 2000. [HLM11] Weichun Huang, Leping Li, Jason R Myers, and Gabor T Marth. “ART: a next-generation sequencing read simulator.” Bioinformatics, 28(4):593– 594, 2011. [HPY00] Jiawei Han, Jian Pei, and Yiwen Yin. “Mining Frequent Patterns Without Candidate Generation.” SIGMOD Rec., 29(2):1–12, May 2000. [HRM14] John Huddleston, Swati Ranade, Maika Malig, Francesca Antonacci, Mark Chaisson, Lawrence Hon, Peter H Sudmant, Tina A Graves, Can Alkan, Megan Y Dennis, et al. “Reconstructing complex regions of genomes using long-read sequencing technology.” Genome research, pp. gr–168450, 2014. [JT11] Tommy R Jensen and Bjarne Toft. Graph coloring problems, volume 39. John Wiley & Sons, 2011. [KK06] Sotiris Kotsiantis and Dimitris Kanellopoulos. “Association rules mining: A recent overview.” GESTS International Transactions on Computer Sci- ence and Engineering, 32(1):71–82, 2006. [KK16] Manpreet Kaur and Shivani Kang. “Market Basket Analysis: Identify the Changing Trends of Market Data Using Association Rule Mining.” Procedia Computer Science, 85:78–85, 2016. [Klo02] Walter Klotz. “Graph coloring algorithms.” Mathematics Report, pp. 1–9, 2002. [KMR97] David Karger, Rajeev Motwani, and GDS Ramkumar. “On approximating the longest path in a graph.” Algorithmica, 18(1):82–98, 1997. [KR09] Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons, 2009. [KRM13] Bernard Kamsu-Foguem, Fabien Rigal, and F´elixMauget. “Mining associ- ation rules for the quality improvement of the production process.” Expert systems with applications, 40(4):1034–1045, 2013. [KSB16] Volodymyr Kuleshov, Michael P Snyder, and Serafim Batzoglou. “Genome assembly from synthetic long read clouds.” Bioinformatics, 32(12):i216– i224, 2016. [Kul14] Volodymyr Kuleshov. “Probabilistic single-individual haplotyping.” Bioinformatics, 30(17):i379–i385, 2014. 145 [Lan13] Giuseppe Lancia. “Combinatorial Haplotyping Problems.” Pattern Recog- nition in Computational Molecular Biology: Techniques and Approaches, pp. 1–27, 2013.

[LBI01] Giuseppe Lancia, Vineet Bafna, Sorin Istrail, Ross Lippert, and Russell Schwartz. “SNPs problems, complexity, and algorithms.” In Algorithms ESA 2001, pp. 182–193. Springer, 2001.

[LD10] Heng Li and Richard Durbin. “Fast and accurate long-read alignment with BurrowsWheeler transform.” Bioinformatics, 26(5):589–595, 2010.

[LS12] Ben Langmead and Steven L Salzberg. “Fast gapped-read alignment with Bowtie 2.” Nature methods, 9(4):357–359, 2012.

[LSL02] Ross Lippert, Russell Schwartz, Giuseppe Lancia, and Sorin Istrail. “Al- gorithmic strategies for the single nucleotide polymorphism haplotype as- sembly problem.” Briefings in bioinformatics, 3(1):23–31, 2002.

[LSM99] Wenke Lee, Salvatore J Stolfo, and Kui W Mok. “A data mining frame- work for building intrusion detection models.” In Security and Privacy, 1999. Proceedings of the 1999 IEEE Symposium on, pp. 120–132. IEEE, 1999.

[LSN07] Samuel Levy, Granger Sutton, Pauline C Ng, Lars Feuk, Aaron L Halpern, Brian P Walenz, Nelson Axelrod, Jiaqi Huang, Ewen F Kirkness, Gennady Denisov, et al. “The diploid genome sequence of an individual human.” PLoS biology, 5(10):e254, 2007.

[Man85] Laura Manuelidis. “Individual interphase chromosome domains revealed by in situ hybridization.” Human genetics, 71(4):288–293, 1985.

[MFM17] Ehsan Motazedi, Richard Finkers, Chris Maliepaard, and Dick de Ridder. “Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study.” Briefings in Bioinformatics, p. bbw126, 2017.

[MP05] Axel Meyer and Yves Van de Peer. “From 2R to 3R: evidence for a fish- specific genome duplication (FSGD).” Bioessays, 27(9):937–945, 2005.

[MW14a] Sepideh Mazrouee and Wei Wang. “FastHap: fast and accurate single individual haplotype reconstruction using fuzzy conflict graphs.” Bioin- formatics, 30(17):i371–i378, 2014.

[MW14b] Sepideh Mazrouee and Wei Wang. “Individual Haplotyping Prediction Agreements.” In Proceedings of the 5th ACM Conference on Bioinformat- ics, Computational Biology, and Health Informatics, BCB ’14, pp. 615–616, New York, NY, USA, 2014. ACM.

146 [MW15] Sepideh Mazrouee and Wei Wang. “HapColor: A graph coloring frame- work for polyploidy phasing.” In 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015, Washington, DC, USA, November 9-12, 2015, pp. 105–108, 2015.

[NGB08] Jost Neigenfind, Gabor Gyetvai, Rico Basekow, Svenja Diehl, Ute Achen- bach, Christiane Gebhardt, Joachim Selbig, and Birgit Kersten. “Haplo- type inference from unphased SNP data in heterozygous polyploids based on SAT.” BMC genomics, 9(1):356, 2008.

[OAH12] Yukiteru Ono, Kiyoshi Asai, and Michiaki Hamada. “PBSIM: PacBio reads simulatortoward accurate genome assembly.” Bioinformatics, 29(1):119–121, 2012.

[Ols17] David L Olson. “Association Rules.” In Descriptive Data Mining, pp. 61–69. Springer, 2017.

[PJ09] Hae-Sang Park and Chi-Hyuck Jun. “A simple and fast algorithm for K- medoids clustering.” Expert systems with applications, 36(2):3336–3341, 2009.

[PMP14] Murray Patterson, Tobias Marschall, Nadia Pisanti, Leo van Iersel, Leen Stougie, Gunnar W Klau, and Alexander Sch¨onhuth. “Whatshap: Hap- lotype assembly for future-generation sequencing reads.” In International Conference on Research in Computational Molecular Biology, pp. 237–249. Springer, 2014.

[PMP15] Murray Patterson, Tobias Marschall, Nadia Pisanti, Leo Van Iersel, Leen Stougie, Gunnar W Klau, and Alexander Sch¨onhuth. “WhatsHap: Weighted haplotype assembly for future-generation sequencing reads.” Journal of Computational Biology, 22(6):498–509, 2015.

[PRG09] Sung Hee Park, Jos´eA Reyes, David R Gilbert, Ji Woong Kim, and Sang- soo Kim. “Prediction of protein-protein interaction types using association rule based classification.” BMC bioinformatics, 10(1):1, 2009.

[PS04] Alessandro Panconesi and Mauro Sozio. “Fast hare: A fast heuristic for single individual SNP haplotype reconstruction.” In Algorithms in Bioin- formatics, pp. 266–277. Springer, 2004.

[Rei09] Jorge S Reis-Filho. “Next-generation sequencing.” Breast Cancer Re- search, 11(3):S12, 2009.

[RS98] Justin Ramsey and Douglas W Schemske. “Pathways, mechanisms, and rates of polyploid formation in flowering plants.” Annual Review of Ecology and Systematics, pp. 467–501, 1998.

147 [SAK15] Matthew W Snyder, Andrew Adey, Jacob O Kitzman, and Jay Shendure. “Haplotype-resolved genome sequencing: experimental methods and ap- plications.” Nature Reviews Genetics, 16(6):344–358, 2015.

[SCD00] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang-Ning Tan. “Web usage mining: Discovery and applications of usage patterns from web data.” Acm Sigkdd Explorations Newsletter, 1(2):12–23, 2000.

[SG74] Sartaj Sahni and Teofilo Gonzales. “P-complete Problems and Approxi- mate Solutions.” In Proceedings of the 15th Annual Symposium on Switch- ing and Automata Theory (Swat 1974), SWAT ’74, pp. 28–32, Washington, DC, USA, 1974. IEEE Computer Society.

[SMS02] Lincoln D Stein, Christopher Mungall, ShengQiang Shu, Michael Caudy, Marco Mangone, Allen Day, Elizabeth Nickerson, Jason E Stajich, Todd W Harris, Adrian Arva, et al. “The generic genome browser: a building block for a model organism system database.” Genome research, 12(10):1599– 1610, 2002.

[SS06] Paul Scheet and Matthew Stephens. “A fast and flexible statistical model for large-scale population genotype data: applications to inferring miss- ing genotypes and haplotypic phase.” The American Journal of Human Genetics, 78(4):629–644, 2006.

[SSD01] Matthew Stephens, Nicholas J Smith, and Peter Donnelly. “A new sta- tistical method for haplotype reconstruction from population data.” The American Journal of Human Genetics, 68(4):978–989, 2001.

[SST04] Ron Shamir, Roded Sharan, and Dekel Tsur. “Cluster Graph Modification Problems.” Discrete Appl. Math., 144(1-2):173–182, November 2004.

[TBT11] Ryan Tewhey, Vikas Bansal, Ali Torkamani, Eric J Topol, and Nicholas J Schork. “The importance of phase information for human genomics.” Na- ture Reviews Genetics, 12(3):215–223, 2011.

[Vaz13] Vijay V Vazirani. Approximation algorithms. Springer Science & Business Media, 2013.

[Ven14] VenterInst. “Diploid Human Genome Project Website, J. Craig Venter Institute.” http:www.jcvi.orgcmsresearchprojectshurefoverview, 2014.

[WHH04] Jeffrey Watson, Franklin A Hays, and P Shing Ho. “Definitions and analysis of DNA Holliday junction geometry.” Nucleic acids research, 32(10):3017–3027, 2004.

[WP03] Jeffrey D Wall and Jonathan K Pritchard. “Haplotype blocks and linkage disequilibrium in the human genome.” Nature Reviews Genetics, 4(8):587– 597, 2003.

148 [WTB09] Troy E Wood, Naoki Takebayashi, Michael S Barker, Itay Mayrose, Philip B Greenspoon, and Loren H Rieseberg. “The frequency of poly- ploid speciation in vascular plants.” Proceedings of the national Academy of sciences, 106(33):13875–13879, 2009.

[WWL05] Rui-Sheng Wang, Ling-Yun Wu, Zhen-Ping Li, and Xiang-Sun Zhang. “Haplotype reconstruction from SNP fragments by minimum error cor- rection.” Bioinformatics, 21(10):2456–2462, 2005.

[Zak00] Mohammed Javeed Zaki. “Scalable algorithms for association mining.” IEEE Transactions on Knowledge and Data Engineering, 12(3):372–390, 2000.

149