HAPLOTYPE INFERENCE FROM PEDIGREE DATA AND POPULATION DATA

by

XIN LI

Submitted in partial fulfillment of the requirements For the Degree of Doctor of Philosophy

Dissertation Advisor: Jing Li

Department of Electrical Engineering and Computer Science CASE WESTERN RESERVE UNIVERSITY

January, 2010 CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

______

candidate for the ______degree *.

(signed)______(chair of the committee)

______

______

______

______

______

(date) ______

*We also certify that written approval has been obtained for any proprietary material contained therein. Table of Contents

List of Tables iv

List of Figures v

Acknowledgments vi

Abstract vii

Chapter 1. Introduction 1 1.1 Statistical methods ...... 3 1.2 Rule-based methods ...... 4 1.2.1 MRHC ...... 4 1.2.2 ZRHC ...... 5

Chapter 2. Problem statement and solutions 8 2.1 Large Pedigrees: manipulation of Mendelian constraints . . . . 9 2.2 Families with many markers: dealing with recombinations . . . 9 2.3 Mixed data: use of population information ...... 10

Chapter 3. Preliminaries 12 3.1 Mendelian and zero-recombinant constraints ...... 14 3.2 Locus graphs ...... 15 3.3 Linear constraints on h variables ...... 17

Chapter 4. Linear Systems on Mendelian Constraints 19 4.1 Methods to solve the linear systems ...... 19 4.1.1 Split nodes to break cycles ...... 20 4.1.2 Detect path constraints from locus graphs ...... 21 4.1.3 Encode path constraints in disjoint-set structure D . . . 26

ii 4.2 Analysis of the algorithm on tree pedigrees with complete data 31 4.3 Extension to General Cases ...... 33 4.3.1 Pedigrees with mating loops ...... 33 4.3.2 Pedigrees with missing data ...... 35 4.4 Experimental Results ...... 37

Chapter 5. Haplotype Inference on a Genome-wide Level 45 5.1 Detect Recombination Events in Families with Dense Markers 46 5.2 Solution Space under Mendelian Constraints ...... 50 5.3 Maximum Likelihood Solution Based on Population Haplotype Frequency ...... 52 5.3.1 Probabilistic prefix tree for fast branch-and-bound opti- mization ...... 53 5.4 Experimental Results ...... 54 5.4.1 Detect Recombination Events and Haplotype Diversity . 54 5.4.2 Evaluation of Accuracy and Scalability ...... 57 5.4.2.1 Influence of pedigree size, missing rate on perfor- mance ...... 58 5.4.2.2 Genome-wide haplotype inference accuracy . . . 60

Chapter 6. Conclusions 62

Bibliography 64

iii List of Tables

1.1 The of the complexity of the ZRHC problem on tree pedigrees...... 7

3.1 Constraints for a parent-child pair x, y...... 15

4.1 Comparison of running time (in seconds) between DSS and Mer- lin on pedigree size 128...... 41

iv List of Figures

1.1 Haplotype ...... 2

3.1 Pedigree, haplotype and recombiantion ...... 13 3.2 Mendelian constraints ...... 14 3.3 Locus graph ...... 16

4.1 Node splitting ...... 22 4.2 Path constraints ...... 26 4.3 Path constraints ...... 27 4.4 Looped pedigree ...... 34 4.5 Pedigree structures used in simulation ...... 39 4.6 Comparison of DSS, Merlin and PedPhase.ILP ...... 40 4.7 Comparison of DSS and Merlin on different patterns of missing data...... 44

5.1 Recombination detection ...... 48 5.2 constraints ...... 51 5.3 Probabilistic prefix tree ...... 54 5.4 The distribution of the length of ambiguous intervals of inferred recombination positions...... 55 5.5 Haplotype diversity ...... 56 5.6 Degree of freedom ...... 57 5.7 Recombination positions ...... 58 5.8 Comparison of two methods on a dataset of 500 pedigrees. . . 60 5.9 Performance of MML and Merlin on 6 of RA data. 61

v Acknowledgments

I would like to thank my advisor Dr. Jing Li for his heavy investment of time and intelligence in this work and for his consistent dedication and commitment to my Ph.D. study. I would also like to thank Yixuan Chen, Xiaolin Yin, Yoon Soon Pyon, Matthew Hayes and Robert Shields for their help throughout the progress of this project. Finally, I would like to thank Dr. Mehmet Koyut¨urk,Dr. Soumya Ray and Dr. Xiaofeng Zhu for serving on my dissertation committee and for their valuable mentorship.

vi Haplotype Inference from Pedigree Data and Population Data

Abstract

by

XIN LI

Haplotype is an important representation of and is thus valuable for investigating the genetics behind diseases. However, humans are diploid and in practice, data instead of haplotype data are collected directly. Consequently, there are great demands for efficient and accurate computational methods to reconstruct haplotypes from geno- type data. Our project started with the development of a rule-based haplo- typing method for pedigree data with tightly linked markers. We formulate Mendelian constraints as a linear system of inheritance variables and solve the linear system using disjoint-set data structures. Our algorithm achieved the lowest time complexity among all existing methods. Comparisons with two popular algorithms showed that this algorithm made 10 to 105-fold im- provements on a variety of parameter settings. Based on the zero-recombinant haplotype inference, we went on to construct a general framework for haplo- typing population and pedigree mixed data that consist of many families with unrelated founders, by combining novel techniques of recombination event de-

vii tection and maximum likelihood optimization. This method makes it possible to do the genome-wide haplotype inference on pedigree and population mixed data.

viii Chapter 1

Introduction

A diploid such as human has two homologous copies of each chromosome, one from its father and the other from its mother, as illustrated in Figure 1.1. A physical position on a chromosome is called a locus and the status of a locus is called an allele. Considering a single nucleotide as a locus, the at the locus can only have 2 alternatives, either A/T or C/G, therefore, we can represent an allele using integers 1 and 2. Most of the loci of the genome have identical alleles among different people, however we are more interested in positions where there are variations. If we consider a single nucleotide as a locus, a locus that carries different alleles between members of a population is called a single-nucleotide polymorphism (SNP).

In practice, genotype data (pairs of alleles with undistinguished parental sources) instead of haplotype data are collected, especially in large scale se- quencing projects mainly due to cost considerations. The problem of hap- lotyping (or sometimes called “phasing”) is to use computational methods to infer the parental sources of pairs of alleles among related or unrelated individ- uals, and thus reconstruct the haplotypes of these individuals. Many studies of -disease associations have shown the importance of haplotypes as they

1 provide the linkage information between SNPs. Hence, there is a great demand for efficient and accurate computational methods and computer programs to infer haplotypes from .

Figure 1.1: Haplotype

Recent years have witnessed intensive research on haplotyping methods (see reviews[6, 15, 16, 23, 40]). By the type of input data, these methods can be divided into two categories: those for population data (unrelated individuals) and those for pedigree data (related individuals in a family). Methods for population data make use of the clustering property of haplotype segments in the population. On the other hand, methods for pedigree data rely on Mendelian constraints within family members. We can also categorize these methods into statistical ones and combinatorial (or rule-based) ones based on their algorithmic features. Here, we present a brief review on both statistical and rule-based methods.

2 1.1 Statistical methods

In general, the goal of statistical approaches is to find a haplotype assignment for each individual with the maximum likelihood. Two exact al- gorithms have been proposed to calculate the probability of a pedigree. The Elston-Stewart algorithm[13] takes advantage of the Markov property based on pedigree structure: given parents’ genotype information, the genotypes of a child are independent from the genotypes of its ancestors. The algorithm is linear in pedigree sizes, but exponential in the number of genetic loci. The Lander-Green algorithm[20] takes advantage of the Markov property between loci: under the assumption of no recombination interference (independence of recombination events), the phase of a locus only depends on the phase of its previous locus. This algorithm is linear in the number of genetic loci, but exponential in pedigree sizes. Both methods assume linkage equilibrium (no correlation between alleles at two loci), which is unrealistic for tightly linked markers such as SNPs. Recently, population haplotype frequencies have been taken into consideration[3] to account for correlations among tightly linked markers (known as ). A key step in most statistical ap- proaches is to enumerate all possible inheritance patterns and to check the genotype consistency for each of them. Due to the large degrees of freedom, this step usually leads to high time complexity (usually exponential hence computational intractable for large data sets).

3 1.2 Rule-based methods

Rule-based algorithms first partially infer haplotypes or inheritance vec- tors based on the Mendelian law of inheritance, then further optimize these candidate configurations. Therefore, rule-based algorithms can potentially gain advantages over statistical methods in efficiency. By using some reason- able assumptions such as minimum recombination events or no recombination events over a segment of loci within a pedigree, one can explicitly exploit in a mathematical way the constraints among individuals. In the literature, the haplotyping problems under these two assumptions are called minimum- recombinant haplotype configuration (MRHC) and zero-recombinant haplo- type configuration (ZRHC) respectively.

1.2.1 MRHC

The minimum recombination principle basically states that genetic re- combination is rare, thus haplotypes with fewer recombinants should be pre- ferred. For tightly linked markers such as SNPs, the principle is well sup- ported by experimental data. In a series of papers[12, 22, 24], Li and Jiang proposed several algorithms based on different assumptions about the data. They developed an iterative heuristic algorithm, called block extension, which computes an optimal or nearly optimal solution when the minimum number of recombinants required is small[24]. However, its performance deteriorates significantly when the input data require more (e.g. four or more) recombi- nants. For pedigrees with small sizes or pedigrees with a small number of

4 markers, they developed two dynamic programming (DP) algorithms.[12] The running time for the first DP algorithm is linear in the size of a pedigree and the time for the second one is linear in the number of markers. For the most general case of the problem, Li and Jiang designed an effective integer linear programming (ILP) formulation. It integrates missing data and haplotype inference, and employs a branch-and-bound strategy that utilizes a partial order relationship and some other special relationships among variables to decide the branching order.[22]

1.2.2 ZRHC

For the zero recombinant haplotype configuration problem, the goal is to identify all possible haplotype assignments with no recombination. This seems a more stringent biological assumption, but it is actually more practical for tightly linked markers such as dense SNP data. In practice, a solution to the problem with no recombinant can serve as a subroutine of a general procedure to solve the general haplotype inference problem. Therefore, the investigation of efficient algorithms to obtain all zero-recombinant solutions from a pedigree is of high interests.

O’Connell[34] presented an exponential-time algorithm for ZRHC based on exhaustive enumeration. It works by eliminating all impossible genotypes. Zhang et al.[41] developed a program for ZRHC that combines logic rules and the expectation-maximization (EM) algorithm. These algorithms depend on some simple rules to reduce the freedom in the phases. However, those rules do

5 not cover all Mendelian constraints, therefore, enumerations and a follow-up consistency check is required for finding a correct configuration.

An important advance in the development of rule-based algorithms for haplotype inference is the introduction of variables to represent uncertainties, by which the problem can be discussed and solved with a mathematical rigor. Li and Jiang[24] first formulate the problem as a linear system on “ps” vari- ables (binary indicators of parental sources) and solve it using Gaussian elimi- nation. The method has a time complexity of O(m3n3), where m is the number of markers and n is the number of individuals. More recently, Xiao et al.[39] formulate another linear system on “h” variables (binary indicators of inheri- tance relationships), and lower the complexity to O(mn2 + n3 log2 n log log n). For tree pedigrees (pedigrees without mating loops), Xiao’s method can pro- duce a general solution in O(mn2 +n3) and a particular solution in O(mn+n3) time. Here, a particular solution means a specific assignment for each variable which satisfies the constraints, while a general solution is a description of all solutions in the form of linear spans of free variables. For tree pedigrees, Chan et al.[9] further reduce the complexity of finding a particular solution to linear time O(mn) by manipulating the constraints on a graph structure. Liu and Jiang[32] also propose an algorithm to produce a particular solution in O(mn) and a general solution in O(mn2) by further exploring the special features of the linear system on a tree pedigree. However, with missing genotypes, it has been shown that ZRHC is NP-hard[31]. Therefore, it seems impossible to incorporate missing data into a pure linear system. The integer linear pro-

6 gramming algorithm[22] for MRHC can solve ZRHC with missing genotypes as a special case, but because it does not use zero recombinant assumption explicitly, it may enumerate all possible haplotype assignments, which takes exponential time.

Li and Li[27] proposed a more efficient algorithm (called DSS) for de- tecting, recording and checking consistency of the constraints on h variables using disjoint-set forests. By applying an adapted union-find procedure (a classic algorithm to manage disjoint-sets), the proposed algorithm can pro- duce a general solution in almost linear time (O(mn·α(n)) for a tree pedigree, where α is the inverse Ackermann function[11], improved from the best known algorithm with O(mn2) time complexity[32]. They further extended the algo- rithm to looped pedigrees and pedigrees with missing data, by utilizing the constraints imposed by the existing data. Experimental results show that the algorithm can output all solutions with zero recombinant and it is much more efficient than two popular algorithms because of the significant reduc- tion of the enumeration space. Based on this algorithm, they later developed a general framework (called MML) for haplotype inference on whole genome population and pedigree mixed data.[29] MML can deal with recombinations and incorporates both pedigree and population information in the data.

Table 1.1: The evolution of the complexity of the ZRHC problem on tree pedigrees. Li and Jiang[24] Xiao et al.[39] Liu and Jiang[32] Li and Li[27] general O(m3n3) O(mn2 + n3) O(mn2) O(mn · α(n)) special O(mn + n3) O(mn)

7 Chapter 2

Problem statement and solutions

The purpose of this project is to develop efficient haplotyping methods to be applicable in various situations. There are many possible types of geno- type data and each requires special techniques to handle. We can specify the features of the input data in four aspects.

1. Pedigree/population size

2. Number of loci

3. Missing rate

4. Existence of recombinations

Statistical methods are usually not scalable on (1), (2) and (3) due to their enumerative nature. On the other hand, rule-based methods are intol- erant to (4) because they rely on some parsimony criteria on the number of recombinants. We introduce 3 common scenarios of the input data. In each scenario, a typical haplotyping challenge should be addressed.

8 2.1 Large Pedigrees: manipulation of Mendelian con- straints

In large pedigrees, the constraints induced by the Mendelian law of in- heritance are very complicated, such that they can not be handled by simple rules as in some previous methods. Here, we use a linear system to compactly represent all the Mendelian constraints in a pedigree. We discovered an al- most linear time algorithm to solve such a system by using disjoint-set data structures.

By encoding Mendelian constraints using linear systems, we have de- veloped a rule-based haplotyping method: DSS[27]. Instead of using the con- ventional Gaussian elimination, we use disjoint-set data structures to solve the linear systems. The linear systems as well as the special way to solve them enable us to achieve much higher efficiency than previous algorithms. It is in theory the fastest existing algorithm in terms of the pedigree size and the number of loci. We describe our approach in Chapter 4.

2.2 Families with many markers: dealing with recom- binations

With advances in genotyping technology, whole genome dense SNP data have become more and more popular. The requirement is growing for haplotyping techniques to be scalable up to millions of markers. The critical problem in handling so many markers lies in the proper treatment of recom- binations. We have developed a multi-stage haplotyping scheme to accurately

9 locate the recombination positions.

In a nuclear family, phases of certain loci of a child can be unambigu- ously determined without the information of its siblings. These loci include: (1) homozygous loci and (2) heterozygous loci, at which position one or both of the parents are homozygous. Since we have abundant number of loci, we can first make use of these most informative ones to generate a rough picture of the entire haplotype. Based on this sketch map, we roughly know the positions of each recombination and the inheritance relationships between individuals. For the remaining loci, we assume that their inheritance states should agree with the neighboring informative loci. If there is a disagreement between its neighbors such as in the vicinity of a recombination position, we can further haplotype the loci around the position to narrow those gaps left between in- formative loci (1) or (2). This also localizes the recombination position. The details of this method will be included in Chapter 5.

2.3 Mixed data: use of population information

Even among unrelated individuals, frequent patterns are observed over short segments of haplotypes, which we call population information. In many real applications, data come as many families but with unrelated founders. In this circumstance, we have both types of information: Mendelian constraints among family members and population information among founders. Neither population nor pedigree based approach alone may achieve a satisfactory result on such mixed data.[26] We have developed a method (named MML[29]) to

10 combine the use of family constraints and population haplotype frequencies. The approach is carried out in the following steps.

1. Infer recombination positions for each family and each chromosome. Par- tition the according to recombination positions.

2. On each pedigree, for each of the zero-recombinant segments, apply DSS[28] to establish the solution space under Mendelian and zero-recombinant constraints.

3. Search the solution space (obtained in (2)) for the optimal solution with maximum likelihood based on population haplotype frequency.

In step 2, DSS exploits the Mendelian and zero-recombinant constraints within a pedigree, which can be expressed by a linear system on ps variables (or alleles) and h variables. DSS outputs a general solution of this linear system, where some alleles are designated as free variables and others are dependent on these free variables. This format of solutions helps us to apply a branch and bound strategy while searching for the maximum likelihood solution. In step 3, we are doing a depth first search in the solution space (enumeration of all compatible solutions) with branch and bound strategy. Experiments on real and simulated data sets demonstrate that MML is faster and more accurate than previous ones. We will discuss the details of MML in Chapter5.

11 Chapter 3

Preliminaries

A pedigree graph indicates the parent-child relationships among an ex- tended family. Figures 3.1(b) present pedigrees in a conventional manner. The pedigree in Figure 3.1(b) has a mating loop, where an offspring (node 9) is produced by the mating of two relatives (node 6 and 7). A pedigree without mating loops is called a tree pedigree.A nuclear family only consists of both parents and their children.

At each locus i, a child may inherit either of the paternal or maternal allele of a parent. We use a binary variable to indicate the parental sources (ps) of the two alleles in a child.

x Definition 3.0.1. ps variable pi ∈ {0, 1} is defined for each locus i of each

x x individual x. pi = 0 if the smaller allele of locus i is of paternal source, pi = 1

x if it is of maternal source. We technically let pi = 0 if locus i is homozygous (two alleles being the same).

Loosely speaking, a haplotype consists of all alleles on a chromosome. Recombination events or crossovers occur when a child inherits a shuffled ver- sion of its parent’s two haplotypes (see Figure 3.1(c) for an example). However, for a sufficiently large segment of chromosome with m SNPs, the likelihood of

12 (a)(b)(c)

Figure 3.1: Pedigree, haplotype and recombiantion (a) A pedigree graph. We use a circle to represent a female, a square to represent a male in a pedigree. (b) A haplotype is composed of all alleles on one chromosome segment. Each allele is an integer value representing the status of a marker at a chromosome locus. (c) A recombination event occurs when a child does not inherit a complete haplotype from its parents. Individual 3 has a paternal haplotype 11 which is not seen in his father. So there must be a crossover between two chromosomes of his father in meiosis, which results in a recombinant haplotype. recombination between a parent-child pair is extremely small. For example, a rough estimation of the relationship of genetic distances and physical distances is about 1 Mbps/cM. The average marker interval distance of a 500K SNP chip is about 6 Kbps. Therefore, the probability of seeing a single recombination event from a parent-child pair of 170 SNP markers (∼1Mbps) is only ∼1%. One can assume a child inherits an entire haplotype segment from a parent for a sufficiently large number of SNPs (i.e., zero recombinant assumption). In such a case, the inheritance behavior between a parent-child pair is unique throughout all m loci, and it is convenient and practically appealing to use a single binary variable (h) to indicate the inheritance behavior between a

13 parent-child pair.

Definition 3.0.2. Inheritance variable hx1x2 ∈ {0, 1} is defined between a

x1x2 parent x1 and a child x2. h = 0 if x2 inherits the paternal haplotype of x1,

x1x2 h = 1 if x2 inherits the maternal haplotype of x1.

3.1 Mendelian and zero-recombinant constraints

Fig. 3.2 gives an example of the relationship between the p variables and the h variable of a parent-child pair under Mendelian constraints.

a p a = 0 p a = 0 p = 1 p a = 1 1 2 1 2 2 1 2 1 a a a a

h ab = 0 h ab = 1 h ab = 0 hab = 1

p b = 0 pb = 1 p b = 1 pb = 0 b b b b 1 2 2 1 2 1 1 2

Figure 3.2: Mendelian constraints Individual a is the father of individual b. In the situation where both a and b are heterozygous, relationship between p variables and the h variable can be expressed as pb = pa + hab.

Mendelian laws of inheritance impose constraints on ps and h variables for each parent-child pair at each locus. These constraints can be represented by a linear relationship of ps and h variables over the group (Z2, +) (where 0+0 = 0, 0+1 = 1, 1+1 = 0). Table 3.1 summarizes all cases of constraints at a certain locus i for a parent-child pair. When an individual is homozygous at a certain locus, its ps variable at this locus is determined by definition. When one or both of the parents of an individual are homozygous at a certain locus,

14 this individual’s ps variable at this locus is also determined. In both cases, the ps variable is pre-determined. In all the other cases, there is a constraint for each parent-child pair between ps variables and the h variable, as shown in the last three cases in Table 3.1. The constraints introduced by the zero recombinant assumption is enforced by the single h variable between each parent-child pair. Therefore, the system formed by the sets of constraints collected based on Table 3.1 consists of all the constraints from data. The satisfiability (or consistency) of this system is equivalent to whether there is a zero recombinant solution.

Table 3.1: Constraints for a parent-child pair x, y. genotype constraint parent x child y if x is father if x is mother x y y 1/1 1/2 pi = 0 pi = 0 pi = 1 x y y 2/2 1/2 pi = 0 pi = 1 pi = 0 y x xy y x xy 1/2 1/2 pi = pi + h pi = pi + h + 1 y y x xy y x xy 1/2 1/1 pi = 0 pi = pi + h pi = pi + h y y x xy y x xy 1/2 2/2 pi = 0 pi = pi + h + 1 pi = pi + h + 1

3.2 Locus graphs

To process constraints, Xiao et al.[39] introduced the concept of locus graphs. We give a brief introduction here for the sake of completeness. A locus graph Li(V,Ei) is constructed for each locus i to record the constraints on h variables. V consists of all individuals as nodes. There exists an edge in Ei between a parent-child pair only if the ps variables of this pair is constrained on the correspondent h variable, i.e., when the parent is heterozygous at locus

15 i (the last three cases in Table 3.1). Each edge is also labeled by the h variable and the constant associated with the constraint. We refer to this kind of constraints (a linear equation consisting of ps variables and an h variable) as edge constraints. Figure 3.3(b) shows an example of a locus graph.

1 2 112 2 22

h14 3 4 12 3 12 4

35 h 46 36 h +1 45 h h +1 6 5 7 612 12 5 7 22 h58 +1

8 8 22 (a)(b)

Figure 3.3: Locus graph (a) A pedigree with 8 members. (b) Given the genotype at a certain locus i, the correspondent locus graph Li and h variable constraints. ps variables of shaded members (2, 4, 7, 8) are pre-determined. From this locus graph, we can generate two non-redundant h variable constraints, one is a cycle constraint, h35 + h36 + h45 + h46 = 0 (formed by individuals 3, 4, 5, 6), the other is a path constraint, h45 + h58 = 0 (from individual 4 to 8 via 5).

The original idea of Ref. [39] was to integrate edge constraints to con- struct a new subsystem that only consists of h variables. Their algorithm then solved the subsystem and used h variable solutions to solve ps variables. We also record edge constraints on locus graphs. However, instead of explicitly listing and solving the constraints on h variables, we use disjoint-set structures

16 to collect, encode and thus examine the consistency of these constraints, which help us achieve a better time complexity result to obtain a general solution.

3.3 Linear constraints on h variables

There are essentially two types of constraints on h variables in a locus graph Li: path constraints and cycle constraints. Notice that the classification of constraints here is more succinct than those in previous work[32, 39] because our method of handling constraints does not require further discrimination of them. According to Table 3.1, each edge exy in a locus graph represents an

x y xy xy xy edge constraint in the form pi +pi = h +ci , where ci is a constant ∈ {0, 1}. xy We use a subscript i for ci because for different loci, the constant between a parent-child pair may be different, which depends on the genotype at that locus as specified in Table 3.1. For a path Ps,tf from individual s to individual t in locus graph Li, if we sum up all edge constraints on this path, we have X X x y s t xy xy (pi + pi ) = pi + pi = (h + ci ).

exy∈Ps,tf exy∈Ps,tf

s t If pi and pi are pre-determined constants, we end up with a path constraint on h variables, which is

X X xy s t xy h = pi + pi + ci , (3.1)

exy∈Ps,tf exy∈Ps,tf where the right-hand side is a constant. Similarly, for a cycle C in locus graph

Li, which may exist even on a tree pedigree (e.g., when a nuclear family has

17 more than one heterozygous children), we sum up all edge constraints on C,

X X x y xy xy (pi + pi ) = 0 = (h + ci ), exy∈C exy∈C and finally have a cycle constraint on h variables

X X xy xy h = ci . exy∈C exy∈C

18 Chapter 4

Linear Systems on Mendelian Constraints

4.1 Methods to solve the linear systems

By exploiting special features of the constraints on h variables, it is not necessary to explicitly list every path and cycle constraint to check their consistency. We employ disjoint-set structures to detect and to check the consistency of constraints on h variables. For each locus graph Li, we build a disjoint-set structure Di to encode its connectivity information. We update the disjoint-set structure incrementally upon processing each edge constraint on a locus graph. Path constraints on a locus graph are detected during this process and will be stored in another disjoint-set structure D. The whole algorithm works on m + 1 such disjoint-set structures, one Di for each locus graph Li and one D for encoding all path constraints.

In this section, we assume the inputs are tree pedigrees with complete data. Cycles on a locus graph from a tree pedigree can only be generated within a nuclear family when it has multiple children. We first discuss a node splitting strategy in subsection 4.1.1 to break all such short cycles, to obtain only path constraints for further processing. Construction of Di from each locus graph Li to detect path constraints will be discussed in subsection 4.1.2.

19 Processing of constraints and consistency check will be discussed in subsection 4.1.3 and a general solution of h variables will be decoded from the disjoint-set structure D. Solutions of ps variables will then be obtained. The analysis of time complexity and correctness of the algorithm on tree pedigrees will be discussed in subsection 4.2. One of the advantages of the proposed algorithm is that it can be easily extended to the general cases of looped pedigrees and pedigrees with missing data, we show these extensions in section 4.3.

4.1.1 Split nodes to break cycles

In order to simplify constraint detection, we first transform cycle con- straints to path constraints by breaking cycles in locus graphs. There are essentially two kinds of cycles in a locus graph: global cycles that are intro- duced by marriages between relatives and local cycles that are introduced by multiple children within one nuclear family (e.g., Figure 3.3(b)). Only local cycles will exist in a tree pedigree and will be dealt with in this subsection. The treatment of global cycles will be deferred to subsection 4.3.1 when we discuss the extension to looped pedigrees. We break local cycles for each nuclear family with multiple children by splitting some child nodes and by remounting their edges on each locus graph. More specifically, when a nuclear family has multiple children, any child node v (except an arbitrarily fixed one

0 v0) and its genotypes will be duplicated to create a new node v in the same manner across all locus graphs. New ps variables will be introduced for these duplicated nodes. For each splitting node v, the edge from its mother (if there

20 is) will be reconnected to node v0. All other edges regarding node v remain untouched. Figure 4.1 shows an example on how node splitting is performed.

By doing so, we technically avoid the treatment of cycle constraints. After the duplication, all new locus graphs (actually locus trees now) still have the same set of nodes. Notice that one has to record all local cycle constraints on h variables and constraints that the ps variables of duplicated nodes must have the same assignments as those of the original nodes. Their constraints can be easily dealt with for local cycles because they only involve local structures (nuclear families). This will be further discussed in the next subsection.

4.1.2 Detect path constraints from locus graphs

We develop an incremental procedure to detect all path constraints from a locus graph by utilizing a disjoint-set structure. As we can see from the constraints on h variables in Eq. 3.1, a path constraint is specified by the

xy ps variables of its end nodes and summation of the constant parity value ci associated with the edge constraint on each of its edges. Our goal is to detect each non-redundant path on a locus graph with pre-determined end nodes and meanwhile obtain the constant parity summation associated with that path.

To do so, we maintain a disjoint-set structure Di for each locus graph

Li and update it incrementally. The disjoint-set structure is defined by a pair of values repi[v], offseti[v] for each node v in V (Li). We use subscript i here to emphasize that the disjoint-set structure Di is specific for each locus graph. repi[v] indicates the node which acts as the representative of the set

21 12 1 2 11 12 1 2 11

h14 h14 4 12 12 3 21 3 21 4

46 46 45 36 45 + 36 + h +1 35 h +1 h 1 h35 h h 1 h h 6'

12 5 6 12 12 5 6 12 (a)

12 1 2 12 12 1 2 12 24 14 + h +1 24 h 1 h14 +1 h +1 12 3 22 4 12 3 22 4

35 36 h h h35 h36 6'

12 5 6 12 12 5 6 12 (b)

Figure 4.1: Node splitting Node splitting applied to a nuclear family at two loci to remove local cy- cles.a) The original locus graph (left), and the locus graph (right) with edges remounted after node 6 was duplicated. b) A locus graph at another locus before (left) and after (right) node 6 was split. Though no local cycle exists in the locus graph in b, node 6 was also duplicated so that all locus graphs will still have the same number of nodes after splitting.

containing v. And the offset of a node offseti[v] indicates the summation of the constants associated with the edge constraints on the path from v to its

P xy repi. Namely, if repi[v] = v0, then offseti[v] = e ∈P ci , where Pv,vg0 is xy v,vg0 xy the path with end nodes v and v0, ci is the constant associated with the edge constraint on edge exy (as specified in the last 3 cases of Table 3.1).

Initially, for every node in V : repi[v] ← v, offseti[v] ← 0. We examine

xy x y each e ∈ Li and update Di by considering the edge constraint pi + pi =

22 xy xy xy h +ci represented by e . The two sets represented by repi[x] and repi[y] will be merged into one because they are connected by an edge exy and we always let one pre-determined representative be the representative of the new set if

repi[x] repi[y] there is such one. Meanwhile, if both pi and pi are pre-determined, we report a path constraint and record it in D for consistency check (see subsection 4.1.3). At the end, any two nodes connected by a path in Li will be merged into one set and a set in Di only consists of connected nodes in Li.

By doing so, we can safely detect all path constraints on Li. Furthermore, the constant associated with a path constraint between two nodes s and t in the same set can be reconstructed as

X xy ci = offseti[s] + offseti[t].

exy∈Ps,tf ∈Li

The procedure is illustrated in Algorithm 1. We also need to capture all

xy Algorithm 1 Unioni(x, y, ci ) repi[x] repi[y] if both pi and pi are pre-determined then P Report a path constraint P from node rep [x] to rep [y]: hxy = c, where c = i i exy ∈P i repi[x] repi[y] xy pi + pi + offseti[x] + offseti[y] + ci . Encode the constraint in D by applying Union(repi[x], repi[y], c). end if if prepi[y] is not pre-determined then i xy offseti[repi[y]] ← offseti[y] + offseti[x] + ci repi[repi[y]] ← repi[x] else xy offseti[repi[x]] ← offseti[x] + offseti[y] + ci repi[repi[x]] ← repi[y] end if constraints that may have not been processed yet in the above procedure due to node splitting. This is easy for a tree pedigree which only possibly has local cycles to split. There are three possible types of constraints that need special

23 attention due to node splitting, i.e., local cycle constraints themselves, ps vari- ables between duplicated nodes and their corresponding splitting nodes, and some path constraints originally existing in the locus graph before splitting, but broken by splitting. We examine each of these constraints by case anal- ysis. First of all, no node splitting is needed if a nuclear family has only one child. Secondly, a local cycle constraint exists in an original locus graph before splitting if and only if both parents of a nuclear family with multiple children are heterozygous. Therefore, we only have two cases for nuclear families (with multiple children): i) both parents are heterozygous (local cycles exist); ii) at least one parent is homozygous (no local cycles).

We first focus on case one. To collect such a local cycle constraint after node splitting, we can examine every splitting node v and its duplicate v0. Based on the splitting strategy, it is easy to see that a cycle constraint exists in the original locus graph if and only if there exists a path between the two nodes v and v0 in the new locus graph after node splitting. Notice that when processing edge constraints, any nodes that are connected have been grouped into one set in Di. Therefore, the existence of a path between v and v0 can be verified by checking whether their representatives are the same. That is, for each pair (v, v0), a local cycle constraint exists in the

0 original locus graph before splitting if and only if repi[v] = repi[v ]. This local cycle constraint now can be represented by a path constraint P that

0 consists of v , m, v0, f, v, where m and f are the parent nodes of v, and v0 is the anchor child node in this nuclear family. The path constraint should

24 P have the form of hxy = offset [v] + offset [v0] + psv + psv0 . However, exy∈P i i i i i one should notice that v0 is the duplicate of v and their ps variables should always be the same. Therefore, we will add a path constraint in the form of P hxy = offset [v] + offset [v0] to D instead. This way both the local exy∈P i i i cycle constraint and the ps variable constraint introduced by node splitting have been enforced. It turns out the third type of constraints (path constraints originally existing before node splitting that go through the edge emv) have also been taken care of, because for each such path P with emv ∈ P , there

0 exists an alternative path P that goes through the edges m → v0 → f → v. As long as the local cycle constraint has been enforced, the two alternative paths will be equivalent. Because our normal procedure will collect constraint P 0 from a locus graph after node splitting, P is redundant and can be safely dismissed.

When at least one parent is homozygous (case two), no local cycle constraints exist. The ps variables of all children, including the duplicated nodes, are predetermined because at least one parent is homozygous. Therefore ps assignments of v and v0 will always be the same. If a path constraint

P consists of edge emv before splitting, it must end at node v because v is predetermined. It is easy to see that it is now being replaced by a path

0 0 constraint P consisting of edge emv0 and ending at v . The only difference

0 between the two paths P and P is that edge emv ∈ P is replaced by emv0 . But the constraints on these two paths are the same and only one (i.e., constraint on P 0, which has been processed) is needed.

25 Thus all three types of constraints have been correctly processed. We illustrate the cases using an example in Figure 4.2. Figure 4.3 gives an example

3 4 3 4 3 4 12 12 11 12 11 22

36 h h45 +1 h45 +1 46 + 46 + h35 h 1 h 1

6' 6' 6'

12 512 6 12 5 12 6 12 512 6 (a)(b)(c)

Figure 4.2: Path constraints This example illustrates all possible patterns of locus graphs of a nuclear family on a tree pedigree. (a) If neither of the parents is homozygous at this locus, then there should be a loop constraint, h36 + h35 + h45 + h46 = c. Since we split node 6, it is expressed as a path constraint on path P6g,60 . Since the locus graph is still connected, no path via this nuclear family will be broken up due to the split of node 6. (b)(c) If one or both of the parents are homozygous at this locus, then both of the children are pre-determined. In this situation, path constraints such as P5g,60 will only take the children as end nodes such that they remain on a consecutive path, unaffected by the split of node 6.

on how to detect constraints on a locus graph Li. In the actual implementation of a disjoint-set forest, a node may not directly point to its set representative (see Ref. [11, 38]), we simplify the representations here just for clear demon- stration purposes.

4.1.3 Encode path constraints in disjoint-set structure D

Once we detect a path constraint, we encode this constraint also in a disjoint-set structure D. As usual, D is defined by a pair of values rep[v] and offset[v] for each node v ∈ V . rep[v] is a pointer to a node and offset[v] ∈

26 1 0 0 12 22 2 4 8 h14 12 3 12 4 0 1 1 1 1 h35 h45 +1 36 46 + 1 5' 3 5 6 h h 1 5'

6 12 12 5 7 22

h58 +1 0 0

8 22 7 2 (a)(b)

Figure 4.3: Path constraints An example shows the detection of all constraints from a locus graph after node splitting. (a) Locus graph Li of a pedigree with 8 nodes at a certain locus i. Shaded nodes are pre-determined. (b) The disjoint-set forest formed by adding edges 1-4, 4-5’, 3-5, 3-6 and 5-8 of the locus graph Li in (a). No path constraint has been detected so far. We simply merge the sets containing each pair of nodes. A pointer is annotated with the offset of a node to its representative. If we further process edge 4-6 of Li, because both 4 and 6 have a representative with ps variable pre-determined, a path between the Ptwo representatives (node 4 and 8) will induce a path constraint, which is hxy = 0. This is a 3rd type constraint defined in the text. When exy∈P4f,8 0 st dealing with splitting node pair 5 and 5P, the local cycle constraint (1 type) has been replaced by a path constraint hxy = 0. By doing so, the ps exy∈P 5g,50 variables of nodes 5 and 50 (2nd type) have been forced to be the same.

{0, 1} is a constant. We maintain this disjoint-set structure D such that any two nodes k and l in the same set encode a path constraint in the form of P hxy = offset[k] + offset[l]. exy∈Pk,lf Initially, rep[v] ← v, offset[v] ← 0, for any v ∈ V . When processing a P path constraint hxy = c, we check whether the representatives of the exy∈Pi,jf two end nodes i and j are the same. If they are not the same, which means

27 no constraints on h variables between these two nodes have been discovered so far, we merge the two sets represented by rep[i] and rep[j] as illustrated in Algorithm 2. When rep[i] and rep[j] are the same (a constraint already exists before seeing the current constraint), we must check the consistency and redun- dancy between the current constraint and the previous constraint. This can be easily done by comparing the constant c associated with the new constraint and the constant associated with the existing constraint offset[i] + offset[j]. If the two constants are the same, the new constraint is redundant and will be dropped; otherwise, inconsistency exists and the program reports no solu- tions with zero recombination and terminates. The procedure is summarized in Algorithm 2.

Algorithm 2 Union(i, j, c) if rep[i] = rep[j] then if offset[i] + offset[j]! = c then Report inconsistency end if else offset[rep[j]] ← offset[j] + offset[i] + c rep[rep[j]] ← rep[i] end if

After all path constraints have been processed, the nodes will form sev- eral independent sets. A general solution of h variables can be easily decoded from D. More specifically, for each set representative v of D, we define a free binary variable αv (notice αv is not the same as ps variables). A general solu- tion of h variables can be represented by a linear system of α variables (which are all free) in the form of

xy h = αrep[x] + offset[x] + αrep[y] + offset[y], (4.1)

28 where αrep[x] and αrep[y] are free variables, and offset[x] + offset[y] is a con- stant. The complete solution of all h variables (the inheritance vector) can be written in a matrix form, h = Aα + b. (4.2)

Suppose the number of h variables is nh and the number of independent sets in D after adding all constraints is nD, then A is a nh × nD matrix, where each row either has exactly two “1”s or is all “0”s (in Eq. 4.1, α variables are canceled out if x, y are in the same set). Due to this special structure, A can be seen as an incidence matrix of a connected graph of nh edges and nD nodes.

As a common result in graph theory, such a matrix has a rank of nD − 1.

Lemma 4.1.1. The rank of A is nD − 1.

We can prove that the described solution space holds all consistent configurations of inheritance variables.

Lemma 4.1.2. The general solution as provided in Eq. 4.2 satisfies all path constraints and there are no other h variable assignments that satisfy all path constraints.

Proof. We can verify that such a solution satisfies all path constraints. For P each path constraint hxy = c, we plug in the above solution of h vari- exy∈Pi,jf P ables to its left-hand side: (αrep[x] + offset[x] + αrep[y] + offset[y]) = exy∈Pi,jf

αrep[i] +offset[i]+αrep[j] +offset[j], with all intermediate nodes canceled out. Notice that every path constraint is encoded in D, which means rep[i] = rep[j]

29 (so the α variables are also canceled out). Based on the construction of D, the left-hand side offset[i] + offset[j] is the same as the right hand side c, and the constraint is satisfied. We can further argue that there are no other h variable assignments that satisfy all path constraints. This can be shown by examining the relationship of the number of non-redundant path constraints on h variables and the number of freedom defined by Eq. 4.2. The degrees of freedom and the number of exact solutions of h variables depend on the num- ber of independent sets in D. If there are nD sets in D formed after adding all constraints, there will be 2nD different ways to assign all α variables. But due to symmetry (flipping the values of all α variable assignments will yield the same h variable solution in Eq. 4.1), there are only 2nD−1 different h variable solutions instead. This can also be shown by noticing that the rank of matrix

0 A in Eq. 4.2 is actually nD − 1. Assume there are n nodes in locus graphs after node splitting, the number of h variables is n0 − 1 because no cycle exists

0 any more. The number of non-redundant constraints encoded in D is n − nD because the number of constraints in each set S ∈ D is |S| − 1. Therefore, the

0 0 possible degrees of freedom in h variables is (n − 1) − (n − nD) = nD − 1 and our general solution has captured all freedom.

Next, let us consider how to compute ps variable solutions from h vari- able solutions. For each node v in Di, v is connected to its set representative P rep [v] through a path P on L . We have pv +prepi[v] = (hxy + cxy) = i i i i exy∈P ∈Li i P P P hxy + cxy = hxy + offset [v]. By plugging in the so- exy∈P exy∈P i exy∈P i lution of h variables in Eq. 4.1, we will finally get a general solution for the

30 ZRHC problem,

v repi[v] pi = pi +αrep[repi[v]]+offset[repi[v]]+αrep[v]+offset[v]+offseti[v]. (4.3)

repi[v] If pi is not pre-determined, we have one more degree of freedom in Eq. 4.3.

4.2 Analysis of the algorithm on tree pedigrees with complete data

The overall algorithm is summarized in Algorithm 3. We omit the pre- processing steps (such as node splitting, construction of locus graphs) because all those operations can be done in linear time. Here we also state our main result of the algorithm as a theorem.

Algorithm 3 Process All Constraints for i = 1 to m do for all edge exy ∈ Li do xy Unioni(x, y, ci ) end for for all splitting node v do 0 if repi[v] = repi[v ] then 0 0 Union(v, v , offseti[v] + offseti[v ]) end if end for end for

Theorem 4.2.1. For a tree pedigree with complete data, Algorithm 3 cor- rectly outputs a general solution (Eq. 4.1 and 4.3) and the number of specific solutions (degrees of freedom) for the ZRHC problem if it has a solution, and reports inconsistency otherwise. Its running time is bounded from above by O(mnα(n)), where m is the number of loci, n is the number of individuals and α() is the inverse Ackermann function[11].

31 Proof. We first need to show that the proposed algorithm can detect all nec- essary constraints if the pedigree is a tree pedigree without missing data. The algorithm processes every edge constraint from each locus graph Li and every constraint resulting from node splitting using the Unioni function, and stores connectivity information using disjoint-set structures Di. During this pro- cedure, path constraints (including local cycle constraints) are detected and consistency is checked by applying Union() on D. It is easy to understand that all non-redundant path constraints in Li have been detected since each Di keeps the connectivity information of all pairs of nodes from each locus graph. For a tree pedigree, all cycle constraints are local cycle constraints. By intro- ducing the splitting nodes, such local cycle constraints have been expressed as a path constraint ending in a pair of splitting nodes (e.g. path P5g,50 in Figure 4.3), and have been correctly processed in Algorithm 3. All other cases (con- straints involving a splitting node) have been discussed in subsection 4.1.2. Therefore, the proposed algorithm can detect all necessary constraints for a tree pedigree. And any particular h variable solution obtained from Eq. 4.1 is compatible with the genotype data.

In terms of time complexity, the outside for-loop in algorithm 3 is over each locus i. For each locus i, the total number of union operations on Di

(function Unioni) is bounded by the summation of the number of edges and the number of splitting nodes in locus graph Li, which is bounded by O(n) even after considering node splitting. There is at most one union operation on D (function Union) for each Unioni, therefore, the total number of union

32 operation on D is bounded by the total number of union operations on Di, which is O(n). The number of elements in Di and D is the same and also bounded by O(n). Both Unioni() and Union() are essentially conventional union-find procedures on disjoint-set structures. The extra cost to maintain the offset value of each node takes only constant time for each operation, therefore it does no change to the time complexity. Despite the simplified presentation in Algorithms 1 and 2, we implement the union-find procedure on a forest structure using Tarjan’s algorithm[11]. The worst case time complexity of O(n) disjoint-set operations on O(n) elements is O(nα(n))[38], where α() is the inverse Ackermann function. Therefore, the total running time of the algorithm to output a general solution is O(mn·α(n)), where m is the number of loci, n is the number of individuals of the pedigree.

4.3 Extension to General Cases 4.3.1 Pedigrees with mating loops

We can further extend the above algorithm to pedigrees with mating loops and pedigrees with missing data. For a looped pedigree, we apply a similar splitting rule to locus graphs as we did for a tree pedigree, except that for a mating between two relatives all their children are duplicated in order to break a global cycle. We use the same method described in section 4.1.2 and 4.1.3 to detect all path constraints on each locus graph. However, Theorem 3.1 does not hold anymore in this case because the method does not guarantee the detection of all necessary constraints. The difference lies in the detection

33 of path constraints broken by splitting nodes. All such path constraints can be recovered when breaking a local cycle but may not be recovered when breaking a global cycle. Figure 4.4 gives such an example on a looped pedigree.

In the locus graph Li in Figure 4.4(b), we have a path constraint on path

P6g,60 , which is originally a cycle constraint before splitting of node 6. This type of constraints may still be able to be captured with some extra efforts.

However, in another locus graph Lj in Figure 4.4(c), there is a path constraint h46 + h56 = 0 in the original locus graph. But this constraint is not on a consecutive path in Lj after node splitting. Thus it is not able to be encoded in the disjoint-set structure D. Although the set of constraints are not sufficient,

1 2 3 12 1 2 12 3 12 12 1 2 11 3 12

24 25 35 + 35 + h14 h +1 h h 1 h14 h 1 4 5 12 412 5 21 412 5

6' 56 + 6' 56 + h46 h 1 h46 h 1 6 12 6 12 6 (a)(b)(c)

Figure 4.4: Looped pedigree An example of constraints on a looped pedigree. (a) A pedigree with a mating loop, where node 6 is produced by the mating of twoP relatives 4 and 5. (b) One locus graph L , where there is a path constraint hxy = h24 + i exy∈P 6g,60 25 46 56 h + h + h = 0. (c) Another locus graph Lj, where there is a constraint h46 + h56 = 0. Due to the splitting at node 6, this constraint is not on a consecutive path. we can still obtain all the solutions for a looped pedigree using the following procedure. If there are already inconsistent constraints during consistency check, no solutions with zero recombinant exist. Otherwise, all the h variable

34 solutions obtained based on the general solution (Eq. 4.1) will be examined. If a specific h variable assignment is not consistent with the genotype, we simply drop that assignment. Otherwise, it will result in real haplotype solutions. To check the consistency of an h variable assignment with existing genotypes, we use another disjoint-set structure to encode constraints on alleles. This step is the same for pedigrees with loops and pedigrees with missing data, and will be discussed in section 4.3.2. Essentially for looped pedigrees, we avoid cycle constraints by splitting nodes with the expense that we may miss some constraints. We start to enumerate h variables after processing existing partial constraints. However, as it will be shown in the experiment, the number of all possible h variable assignments from this set of partial constraints is usually very small for a pedigree with complete data, and in most times there is only one solution for pedigrees with 20 or more loci. Therefore, the above extension can efficiently handle looped pedigrees in practice.

4.3.2 Pedigrees with missing data

For an algorithm to be practically useful, it has to be applicable on real data. Most real data contains missing genotypes. One advantage of the proposed algorithm is that it can be easily extended to deal with missing data. Extension of the constraint-finding framework[9, 32, 39] to handle missing is not trivial at all. As mentioned earlier, the ZRHC problem with missing data is NP-hard[31] in general. Therefore, it is unlikely that a linear system will exist to incorporate all uncertainties. We take a similar approach as in

35 subsection 4.3.1 to deal with missing data by making use of existing constraints and verifying every compatible inheritance vector. Partial constraints on h variables will be collected based on existing genotype data. Solutions of h variables will be obtained based on the set of partial constraints and will be checked for consistency with existing genotype data. More specifically, for a pedigree with missing data, we construct the locus graph Li for each locus i as usual with node splitting if necessary. The edges in Li will only be constructed by examining every parent-child pair whose genotypes are complete at this locus i. We apply Algorithm 3 to process all edge constraints from such locus graphs. And from the partial constraints on h variables, we get a solution in its general form (Eq. 2). The degree of freedom, which is nD − 1 where nD is the number of independent sets, usually is significantly smaller than the degree of freedom of the original h variables without constraints, which is usually close to 2n. Therefore, our algorithm has the potential to be significantly faster than those algorithms based on the enumeration of all possible h variables (such as Merlin[2]).

For each specific h variable assignment, the compatibility check with the input genotype data is also examined by utilizing another disjoint-set struc-

x x ture on allele variables. Let fi (mi ) denote the paternal (maternal) allele of individual x at locus i, which takes the integer value 0 and 1 for the smaller and bigger allele, respectively. For a fixed assignment of h variables, the rela- tionship of alleles between a parent and a child is specified by the definition of h variables. This relationship is also expressed as a linear system on Z2.

36 y x For example, for a father-child pair x and y, we have constraint fi + fi = 0 if

xy y x xy h = 0, and fi +mi = 0 if h = 1 by Definition 2. Similar constraints can be obtained for each mother-child pair. In addition, constraints between the two allele variables at each locus for an individual exist when the genotype data is available. More precisely, if an individual x is homozygous or pre-determined

x x at locus i, then both fi and mi are fixed. Otherwise we have the constraint

x x fi + mi = 1. All these constraints only involve two variables, so we can en- code this linear system in a disjoint-set structure and develop the same set manipulating procedure as we did in the integration of constraints on inheri- tance variables. By doing so, we can efficiently check the consistency between a given h variable assignment and the input genotype data, and generate a set of assignments of alleles that are consistent with the h variable assignment.

The total number of h variable assignments is 2nD−1, and for each assignment, the complexity of genotype consistency check is O(mn · α(n)).

4.4 Experimental Results

We study the performance of our program (denoted as DSS) under dif- ferent settings (pedigree size, number of loci, missing rate, pattern of missing) and compare its performance with two representative programs Merlin[2] and PedPhase (the integer linear programming ILP algorithm in Ref. [22]). Merlin is one of the most widely used statistical packages for linkage analysis and we only use its haplotyping functionality in this comparison. Merlin also uses the zero recombinant assumption, but it examines all possible configurations of

37 inheritance variables and only outputs those compatible ones. PedPhase.ILP is another widely used rule-based algorithm developed by our own group. It can produce all optimal haplotype solutions with minimal recombinants for any pedigree structures with missing data. It can solve the zero recombinant problem as a special case. But because it does not use the zero recombinant assumption explicitly, its efficiency is expected to be inferior to the current algorithm. Under the zero recombinant assumption, all three methods are ex- act algorithms that output all compatible solutions. Our experiments show that their implementations indeed generate the same set of haplotype assign- ments on same inputs. This again shows that the ZRHC formulation is valid for tightly linked markers, and the set of solutions is the same as the set of solution obtained based on likelihood approaches. Therefore, we only present results on the efficiency comparison.

We test all three approaches on different sizes of pedigrees (17, 29, 52, 128). All are real human pedigree structures obtained from the literatures. Different number of loci (20, 50, 100, 200), different missing rates (0.05, 0.10, 0.15, 0.20) and different missing patterns are considered. We run Merlin and DSS on a Linux machine with two 3.0GHz Quad-Core Xeon 5365 processors and 16G memory. PedPhase.ILP only has a Windows version, and it was tested on a much slower Windows machine with a much small memory (Pentium 4 3.2GHz with 2G memory). We measure the time needed for each of the algorithms to output all possible haplotyping solutions of a pedigree. Due to hardware limitations, the result of PedPhase.ILP on pedigree size 128 is

38 (a)(b)(c)

(d)

Figure 4.5: Pedigree structures used in simulation not acquired. To generate genotype data that closely resemble real data, we use the Simulated Rheumatoid Arthritis (RA) Data from Genetic Analysis Workshop (GAW) 15. Chromosome 6 of GAW data mimics a 300K SNP chip with an average inter-marker spacing of 9,586 bp. The beginning 20, 50, 100 and 200 loci are truncated to test the three algorithms. Population haplotype frequencies are calculated based on the true haplotype assignments in the simulated data, and are then fed to SimPed[21], together with each pedigree structure. SimPed will then sample founder haplotypes based on their population frequencies and generate genotype data for each member in a pedigree assuming no recombination. The three pedigree structures are shown in Figure 4.5, among which the pedigree with size 17 (Figure 4.5(a)) is a looped one.

39 missing rate 1000 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2

DSS 17 100 DSS 29 DSS 52 Merlin 17 Merlin 29 Merlin 52 10 ILP 17 ILP 29 ILP 52

1 time time (second)

0.1

0.01

20 loci 50 loci 100 loci 200 loci (a) missing rate 100 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2

pedigree size 17 29 10 52

number of solutions of number 1 (b) Figure 4.6: Comparison of DSS, Merlin and PedPhase.ILP (a) Comparison of running time (in seconds). (b) Average number of solutions

We designate two ways to generate samples with missing data so as to examine the behavior of the methods with respect to both missing rate and missing pattern variations. We generate the first set of samples by randomly assigning a locus to be missing at a specified missing rate. Second, we make all top generation of a pedigree completely missing for all loci, which is common in real data. For each testing category, we simulate 100 independent data sets and report the average running time.

For the random missing case, Figure 4.6(a) shows the running time of the three programs under different settings, except for the pedigree size 128,

40 Table 4.1: Comparison of running time (in seconds) between DSS and Merlin on pedigree size 128. The running time of Merlin under some data settings ex- ceeds an hour, and are thus omitted from our measurement. number of loci 20 missing rate 0.00 0.05 0.10 0.15 0.20 DSS 0.0267 0.1539 0.3517 0.4991 0.6540 Merlin 70 300 600 800 1100 Block-extension 0.05 0.05 0.05 0.05 0.05 number of loci 50 missing rate 0.00 0.05 0.10 0.15 0.20 DSS 0.0259 0.0361 0.0368 0.0378 0.0360 Merlin 360 800 1000 >1300 —– Block-extension 0.09 0.1 0.1 0.1 0.1 number of loci 100 missing rate 0.00 0.05 0.10 0.15 0.20 DSS 0.0311 0.0426 0.0373 0.0431 0.0461 Merlin 800 1200 >2400 —– —– Block-extension 0.2 0.2 0.19 0.2 0.2 number of loci 200 missing rate 0.00 0.05 0.10 0.15 0.20 DSS 0.0433 0.0587 0.0518 0.0575 0.0503 Merlin —– —– —– —– —– Block-extension 0.53 0.63 0.67 0.64 0.63 for which the running time of Merlin is too large to be juxtaposed with DSS. The result on the pedigree with size 128 is listed in Table 4.1. The running time of Merlin increases exponentially with the pedigree size, the number of loci and also the missing rate. The running time of PedPhase.ILP (on a slower machine) also has an exponential growth with the increase of the missing rate and the number of loci but with a smaller constant compared to Merlin. It also shows a much smaller growth rate with the pedigree size. In contrast,

41 DSS scales smoothly with all parameters (except for the missing rate when the number of loci is 20), and the improvement over Merlin or PedPhase.ILP is from 10 to 105 folds for large pedigrees with large number of loci or high rate of missing. In fact, neither Merlin nor PedPhase.ILP can successfully infer haplotypes from the pedigree with size 128 when the number of marker is 200. However, DSS can obtain all solutions within 0.05 second, even for data with 20% missing. This shows that by solving the linear system based on partial constraints from existing data, we significantly reduce the enumeration space of inheritance variables. The experimental results show that when the number of loci is large, the program can still maintain the same linear complexity even for data with 20% missing. But for small number of loci, the running time of DSS increases as missing rate increases (though DSS can finish all the cases within 0.1 second). This is because the number of constraints on h variables is roughly in proportion to the number of loci. So for small number of loci, the remaining degrees of freedom on inheritance variables after solving the linear system could still be high. This number could be partly reflected by the number of all compatible solutions in the end. Figure 4.6(b) compares the number of h variable solutions in different circumstances. It grows with both the pedigree size and the missing rate, but decreases with the number of loci.

While PedPhase.ILP takes very long time to work on pedigree size 128, we run Block-extension algorithm also from PedPhase package as a substitute in this category just for reference purposes. Different from DSS, Merlin and Pedphase.ILP, Block-extension is a heuristic algorithm which employs some

42 greedy strategy to obtain a particular solution with minimum recombination. Since it is a heuristic, it does not explore every possible configuration and may not reach optimality in all circumstances. As is shown in Table 4.1, Block- extension does a fast job on this large pedigree and it scales well with both the number of loci and the missing rate. This high efficiency makes such heuristic approaches useful in certain applications where completeness or optimality of the solution space is not enforced.

Next, we investigate the performance of all three algorithms on special missing patterns. Figure 4.7 gives some representative result on the pedigree with size 52, for which all individuals at the top generation (members 4, 6, 8, 9) are missing. For this pedigree, such missing equals a missing rate of ∼7.7%. In terms of absolute time, DSS (0.2 ∼ 0.8 sec) is much better than the other two algorithms (0.2 ∼ 100 sec). However, the running time is higher than its own running time with a missing rate 10%. The running time of Merlin and PedPhase.ILP on this special data set is between those of missing rate 5% and 10%. DSS is somewhat sensitive to this special missing pattern because when all genotypes of an individual are missing, none of the inheritance vari- ables between her and her parents or children could be determined. A further investigation on this special missing pattern is warranted.

43 100 20 loci 50 loci 100 loci 200 loci

10

1

0.1

0.01 DSS top generation 7.7% DSS random 5% DSS random 10% Merlin top generation 7.7% Merlin random 5% Merlin random 10% ILP top generation 7.7% ILP random 5% ILP random 10%

Figure 4.7: Comparison of DSS and Merlin on different patterns of missing data.

44 Chapter 5

Haplotype Inference on a Genome-wide Level

Data from current gene-disease association studies motivate changes to existing haplotype inference methodologies. Many datasets are now com- prised of both pedigree and population data so it is desirable to incorporate both sources of information when inferring haplotypes. The availability of high-density SNP data also makes it possible to determine and use the pre- cise locations of recombination events. Our proposed method reconstructs haplotype structure on a genome-wide level by jointly using the information from the Mendelian law of inheritance and local population structure. The method combines in one framework new techniques of recombination event detection, maximum likelihood optimization of population haplotype diver- sity and our previous algorithm of zero-recombinant haplotype reconstruction. Experiments on both real and simulated datasets prove the efficiency and ac- curacy of our approach in reconstructing the haplotype structure. Our method makes it possible to reveal the haplotypic variation on a genome-wide level.

The overall flow of the method MML (Mendelian Constrained Maximum Likelihood) is staged in three steps as illustrated in Algorithm 4.

In step 1, we infer recombination positions in each nuclear family of

45 Algorithm 4 MML (1) Infer recombination positions for each family and each chromosome. Par- tition the chromosomes according to recombination positions. (2) On each pedigree, for each of the zero-recombinant segments, apply DSS [28] (our previously developed algorithm to handle Mendelian constraints) to establish the solution space under Mendelian and zero-recombinant con- straints. (3) Search the solution space (obtained in (2)) for the optimal solution with maximum likelihood based on population haplotype frequency. the pedigree by analyzing identical by descent (IBD) status of alleles between each sibling pair. Based on the inferred recombination positions, we partition chromosomes into segments such that every segment is recombinant-free. In step 2, we derive all possible configurations of a pedigree under Mendelian and zero-recombinant constraints for each recombinant-free segment obtained in step 1. This is done by using our previous algorithm DSS [28]. DSS can output a compact description of all compatible solutions as a linear space. In step 3, we use haplotype frequencies in the population to identify the optimal haplotype configuration of each pedigree. We will describe step 1, step 2 and step 3 in Sec. 5.1, Sec. 5.2 and Sec. 5.3 respectively.

5.1 Detect Recombination Events in Families with Dense Markers

Recombination events are implied if a common inheritance vector for a segment of loci that satisfies Mendelian constraints cannot be found. Typi- cally, there is uncertainty as to how many recombination events occur and at which loci or in which individuals these events occur. Usually, such parsimony

46 criteria as minimum number of recombinants [25] are used to find a possible assignment. However, with the availability of densely marked data, we can almost always fix the inheritance vector within each zero-recombinant region due to the enrichment of Mendelian constraints. Consequently, we can also de- velop special techniques to localize the recombination positions with minimal ambiguity.

For each nuclear family, we look at the IBD status of the alleles and its sibling pairs to detect a recombination position. The change of IBD status from one locus to another indicates a change in the inheritance pattern, that is, a recombinant. Loci of a father (similarly for a mother) can be divided into three categories depending on their informativeness in determining the paternal IBD status of a sibling pair.

1. informative: he is heterozygous, and the phases of both children are determined at this locus.

2. semi-informative: he is heterozygous, and at least one of the children is not phased at this locus.

3. non-informative: he is homozygous at this locus.

In situation (1), since the father is heterozygous, the IBD status and the IBS (identical by state) status of the paternal alleles of a sibling pair are equivalent. In situation (2), the IBD status of the paternal alleles is not determined, but it is dependent on the IBD status of the maternal alleles. If

47 we can somehow resolve the IBD status of the maternal alleles at this locus, we can also infer the IBD status of the paternal alleles. Note that in this situation, the mother must be heterozygous, otherwise all children would be phased. In situation (3), this locus provides no information about the paternal IBD status of any of the sibling pairs.

informative semi-informative non-informative

2 2 1 1 1 2 1 2 1 1 2 2 1 1 2 2 2 1 1 1 Father 1 1 1 2 2 2 2 1 2 1 2 1 1 2 1 1 2 1 1 2

2 2 11 1 2 1 2 1 1 22 11 2 2 2 1 1 1 Child1 2 21 2 21 2 1 2 2 2 1 11 2 1 2 2 1 1

2 2 11 1 2 1 2 1 1 21 12 1 1 2 1 1 2 Child2 2 22 1 22 2 1 2 2 1 2 22 2 1 2 1 1 2

2 2 1 2 2 1 2 1 2 2 2 1 11 2 1 2 2 1 1 Mother 2 2 2 1 2 2 2 1 2 2 1 2 22 2 1 2 1 1 2

59 1012 14 15 20

Figure 5.1: Recombination detection A segment of a chromosome from a nuclear family with 2 children. Colored nodes are informative alleles which have determined IBD status between these two siblings, with squares and circles representing paternal and maternal alleles respectively. Remaining alleles are non-informative. At each locus, alleles colored the same are identical by descent(IBD). A frame around a pair of alleles indicates that they are semi-informative, and this pair of alleles infers IBD between two siblings if in the context of the informative loci nearby. From the IBD status between siblings, we can infer a paternal recombination event between the 9th and 12th locus, but with the ambiguity whether it happens in Child1 or Child2. We suppose the recombinant is in Child2 for illustration purposes.

Informative loci give a narrow-spaced probing on the IBD status of

48 the whole chromosome. We can detect recombination events by observing the change of IBD status of alleles among these informative loci. By doing so, however, we may miss possible double recombination events that do not manifest a change in the IBD status between two nearby informative loci. If we assume markers are dense, however, the possibility of a double recombination event within a short distance is negligible. Fig. 5.1 shows an example on how to detect recombination positions in a nuclear family. By using informative markers, we could infer a paternal recombination event between the 9th and 14th locus.

Semi-informative loci can help further localize the recombination posi- tion because it is almost impossible for a paternal and a maternal recombina- tion event to occur coincidentally within a short region. If a semi-informative locus falls between two informative loci indicating different paternal IBD sta- tus, we can assume no recombination on the maternal side and let its maternal IBD status follow that of its surrounding informative loci. By assuming the maternal IBD status, we can now infer the paternal IBD status for this semi- informative locus. For example, in Fig. 5.1, at the 12th loci, by assuming that Child1 and Child2 are not IBD for their maternal alleles, we infer that the sibling pair are also not IBD for their paternal alleles, so that we could refine the recombination position to be between the 9th and 12th locus.

Since ambiguous intervals of recombination events now only contain non-informative markers which are compatible with any inheritance pattern, we may pick any position within such intervals to partition a chromosome into

49 recombinant-free segments. Notice that for non-informative loci, the phases of all family members are actually fixed, and thus the choice of a recombination position will not influence the final haplotype configuration.

To determine the individual in which the recombination event actually occurs, we can look at the IBD status of all sibling pairs. If we observe that the IBD status changes between a specific child i and any of the other chil- dren, while there is no change among these children themselves, then child i carries the recombinant. However, if the nuclear family has only two children, then the ambiguity is unresolvable in this way. Notice that the assignment of recombination to a different child will result in a different haplotype con- figuration in the parent. Therefore in this situation, we can use population haplotype frequency to suggest a most probable assignment.

5.2 Solution Space under Mendelian Constraints

If we assume no recombination within a certain number of loci, the h variable between a parent-child pair, which indicates the inheritance patterns, should be the same for each of these loci. In this case, we can put constraints on h variables from different loci together to form a single linear system. As shown in previous sections, DSS can obtain a general solution to such a system. Here, a general solution means a description of all solutions as a linear span of variables.

The establishment of a general solution is important because it facili- tates the search in the solution space for particular solutions to satisfy specific

50 properties. The freedom in the solution space can be partitioned into two parts: the freedom of the inheritance vector (all h variables) and the freedom of the allele assignment (all p variables) under a fixed inheritance vector. Ex- periments [28] have shown that the inheritance vector is usually fixed for a segment of 100 or more loci. Once the inheritance is determined, the relation- ship between alleles of different individuals is determined with only 1 degree of freedom (if all members of the pedigree are heterozygous) or no degrees of free- dom (if one or more members are homozygous). Fig. 5.2(a) shows an example for the first situation. In the case of a missing genotype, there might be an increase in degrees of freedom (Fig. 5.2(b)). By applying the Mendelian and zero-recombinant constraints, we can greatly reduce the search space for find- ing the maximum likelihood solution using population local structure, which will be discussed in Sec. 5.3.

12 1 2 12 12 1 2 12 a a a a a a b b

12 3 12 4 a a a a 12 3 ?? 4 a a a b

a a 12 6 12 5 a a a 12 6 12 5 a a a (a)(b)

Figure 5.2: Allele constraints (a) A pedigree of 6 individuals. All individuals are heterozygous with genotype “12” at a certain locus. For this fixed inheritance vector, the relationship between alleles of different individuals is determined. a is a variable denoting the status of an allele anda ¯ is its complementary status. (b) Same pedigree and inheritance vector as (a), but at a different locus with genotype missing at individual 4. In this case, there are two degrees of freedom represented by variables a and b.

51 5.3 Maximum Likelihood Solution Based on Population Haplotype Frequency

The inheritance vector (h variables between each parent-child pair) specifies how founder haplotypes are transmitted to every descendant of a pedigree and the configuration of a pedigree is fully determined by the inher- itance vector and the founder haplotypes. If the inheritance vector is fixed, the likelihood of a configuration of a pedigree is simply the product of founder haplotype probabilities.

In Sec. 5.2, we describe how Mendelian and zero-recombinant con- straints provide a small candidate set of all possible haplotype configurations for each pedigree. Next, we need to pick a solution of maximum likelihood from this candidate set based on haplotype frequencies. Since the actual hap- lotype frequencies in the population are unknown, we use an EM (Expectation Maximization) procedure to find the optimal solution. The procedure is de- scribed in Algorithm 5. The initial pool of haplotype frequencies is generated by randomly sampling from founder haplotypes within the solution space of each pedigree. In step (2), we search the solution space for an optimal con- figuration with the highest likelihood. In step (3), we update the haplotype frequency pool only with the optimal solution of each pedigree. It is different from conventional EM methods, where all possible solutions are updated into the pool weighted by their current likelihood. By adopting such an approxi- mation, we can significantly hasten the optimization process by not traversing the entire solution space.

52 Algorithm 5 Haplotype Frequency EM (1) Build the initial pool of haplotype frequencies by randomly sampling from the solution space of each pedigree. repeat (2) Find the optimal solution with maximum likelihood based on the current pool. (3) Update the pool with the haplotype frequencies of optimal solutions obtained in (2). until convergence is achieved

5.3.1 Probabilistic prefix tree for fast branch-and-bound optimiza- tion

We create a data structure called “probabilistic prefix tree” to facilitate the search of the optimal configuration in the solution space. A probabilistic prefix tree is essentially a binary search tree which encodes the frequencies of each haplotype and their prefixes. It provides quick indexing for haplotype frequencies and can be updated dynamically using conventional binary search tree techniques. Each leaf node in the tree represents a haplotype and each internal node represents a prefix. The frequencies of internal nodes can be generated by simply summing up the frequencies of all leaf nodes of its subtree. Fig. 5.3 shows an example of a probabilistic prefix tree.

As discussed in Sec. 5.2, for a fixed inheritance vector, the relationship between alleles of different family members is fixed at each locus. On a pedigree of n founders and m markers, we do a depth first search from locus 1 to locus m of a haplotype, where for each locus we pick an assignment for all 2n founder alleles if there is one or more degrees of freedom. Meanwhile, we calculate the likelihood of the pedigree up to the current locus which is

53 the product of frequencies of the founder haplotype prefixes ending at locus

Q j i: j=1..2n freq(hi ). Since the frequency of any haplotype prefix is greater than that of the entire haplotype, if the likelihood drops below the bound, we backtrack for there is no possibility of a better solution. Otherwise, we move to the next locus until we reach m. If we achieve a higher likelihood, we replace the bound with the new likelihood and record the current best configuration.

2 2 0.1 1 11122 (1) 2 2 0.2 1 2 12222 (2) 0.3 2 21122 (1) 2 2 0.1 1 21211 (1) 1.0 1 0.2 0.1 21212 (1) 1 0.6 2 1 2 1 21222 (3) 2 0.5 2 0.1 22222 (1) 0.7 2 2 2 0.3 2 2 0.1

Figure 5.3: Probabilistic prefix tree On the left is the haplotypes and their count (in brackets) in the population. On the right is the probabilistic prefix tree after adding all these haplotype to an empty tree. Some nodes are annotated with the normalized frequencies of the corresponding haplotypes or haplotype prefixes.

5.4 Experimental Results 5.4.1 Detect Recombination Events and Haplotype Diversity

We use MML to analyze haplotype polymorphisms in a real human population. There are 32250 markers spanning a region of 170 million base pairs on chromosome 6, with an average marker interval distance of 5kb. Miss- ing genotype rate is 0.12% and typing error rate (as reflected by Mendelian

54 inconsistency) is 0.11%. There are 3 isolated individuals and a total of 193 nuclear families, among which 112 have 2 children and 81 have 1 child.

From 112 families with 2 children, we infer 322 paternal and 535 ma- ternal recombination events. Fig. 5.4 shows the resolution of the inferred recombination positions. 82% of the recombination events can be localized within an interval less than 100kb, and 53% within an interval less than 30kb.

120

100

80

60

40

20 number of recombination events 0 0 2 4 6 8 10 4 size of interval () x 10

Figure 5.4: The distribution of the length of ambiguous intervals of inferred recombination positions.

Fig. 5.7(c) shows the averaged degree of freedom at each locus of a family after applying the Mendelian and zero-recombinant constraints. Based on the actual heterozygosity rate of the current dataset, there is expected to be 1.3347 degrees of freedom on a family of 2 children or 0.9982 degrees of freedom on a family of 1 child at each locus. By exploiting these constraints first, we have eliminated more than 95% of the phasing freedom of a family. A big family size will result in fewer degrees of freedom due to the increased number of constraints.

55 200

150

100

50

number of haplotypes 0 0 2 4 6 8 10 12 14 16 18 (a) number of haplotypes: final configuration 7 x 10 1200

1000

800

600

400

200 number of haplotypes 0 0 2 4 6 8 10 12 14 16 18 (b) number of haplotypes: initial sampling 7 x 10

Figure 5.5: Haplotype diversity X axis is the location as in base pairs on chromosome 6. Two charts show the number of haplotypes within each segment of 20 markers across chromosome 6, in the final configuration and the initial sampling respectively. Two solid lines are average numbers smoothed over 5 and 50 segments. The dashed line is the number of the most common haplotypes covering 90% of the total frequency.

As shown in Fig. 5.7(a), the haplotype diversity varies for different loca- tions of the chromosome. In the initial sampling (Fig. 5.7(b)), 23.41%(8.41%) of the most common haplotypes covers 90%(80%) of the total frequency. This indicates that most of the common haplotypes are recovered and sampled multiple times.

56 0.1 families of 1 child 0.08

0.06

families of 2 children 0.04 degree of freedom 0.02

0 2 4 6 8 10 12 14 16 18 (c) degree of freedom 7 x 10

Figure 5.6: Degree of freedom X axis is the location as in base pairs on chromosome 6. Degree of freedom at each locus under Mendelian and zero-recombinant constraints. The lines are averaged values over all pedigrees of 1 child (upper) and 2 children (lower) respectively. Results are smoothed over 100 and 1000 markers.

5.4.2 Evaluation of Accuracy and Scalability

We used the Cystic Fibrosis Transmembrane-Conductance Regulator (CFTR) Gene Data Set [17] for small scale testing and Simulated Rheumatoid Arthritis (RA) Data from Genetic Analysis Workshop (GAW) 15 for genome- wide testing of MML. We also compare the performance of our program with the conventional linkage analysis approach. Here, we use the statistical tool package Merlin[2]. Merlin can be used to perform haplotype inference on datasets of family and population mixed type. It first evaluates the inheri- tance vector in each family by exhaustive enumeration. Then it uses an EM approach to obtain the maximum likelihood configuration based on the pop- ulation haplotype frequency. In order to deal with large numbers of mark- ers, Merlin groups the markers by some pre-determined length and generates one single inheritance vector for each segment by assuming no recombination. However, if there does exist recombination in a segment, the program will

57 3

2

1

0 recombination rate 0 2 4 6 8 10 12 14 16 18 (d) recombination rate: cM/Mb 7 x 10 500

per 1Mb 0 0 2 4 6 8 10 12 14 16 18 (e) marker density: 1/Mb 7 x 10 0.2 paternal 0 maternal −0.2 0 2 4 6 8 10 12 14 16 18 (f) recombination positions 7 x 10

p22.3 p12.3 q12 q13 q14.1 q15 q16.1 q21 q22.31 q27

Figure 5.7: Recombination positions X axis is the location as in base pairs on chromosome 6. The top chart is the recombination rate in terms of centimorgan per million base pairs. The bottom two charts show the marker density, the paternal and maternal recom- bination positions over the whole chromosome. There are no markers around the centromere region. fail. In Sec. 5.4.2.1, we compare MML and Merlin on small lengths of mark- ers with different pedigree sizes to examine the efficiency by explicitly using Mendelian constraints rather than pure enumeration. On a genome-wide level (Sec. 5.4.2.2), our program can still successfully reveal the actual haplotype structure when Merlin is not applicable due to unresolved recombination.

5.4.2.1 Influence of pedigree size, missing rate on performance

We simulate pedigrees with no recombination to evaluate the perfor- mance of MML on zero-recombinant segments with different data settings. Pedigrees are generated with different sizes (4, 17, 29, 52) and missing rates (0.00, 0.05, 0.10, 0.15, 0.20) by using SimPed [21]. We pick from the CFTR

58 data a subset of 29 distinct haplotypes of 19 markers spanning a region of 1.8Mb on chromosome 7q31. SimPed will assign founder haplotypes by sam- pling from the given set and transmit them onto other family members assum- ing no recombination. Each population has 500 families of a given parameter setting, and we average our results over 10 replicates of a population. The accuracy and running time comparison between MML and Merlin are shown in Fig. 5.8.

By explicitly exploiting the Mendelian constraints instead of enumer- ating all possible inheritance vectors, MML can achieve much greater time ef- ficiency than Merlin on large pedigrees or on high missing rates (Fig. 5.8(b)). Both methods achieve better accuracy with large pedigrees (Fig. 5.8(a)) due to increased family constraints and population information. MML’s accuracy is similar Merlin’s though we use an approximate EM algorithm instead of traversing the whole search space, . This demonstrates the approximation ap- proach to be a reasonable trade-off for efficiency. This is further confirmed on a large pedigree size of 52 (table in Fig. 5.8), where the performance of Merlin crashes with a high error rate (up to 30%) and exponentially longer running time due to too much freedom in resolving the inheritance vector. On the other hand, MML exhibits very robust consistency in both accuracy and efficiency.

59 (a) error rate MERLIN MML s) le 0.0014 1.0000 le 4 17 29 52 al f 0.0012 o 0.1000 n0.0010 io rt o0.0008 p 0.0100 ro0.0006 p ( 0.0004 te 0.0010 ra 0.0002 r o r 0.0000 0.0001 e 0 5 0 5 0 0 5 0 5 0 0 5 0 5 0 .0 .0 .1 .1 .2 .0 .0 .1 .1 .2 .0 .0 .1 .1 .2 00 05 10 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0. 0. 0. 0. missing rate

(b) running time MERLIN MML

50 10000 4 17 29 52 c) 40 se 1000 ( e m30 ti g 100 in20 n n ru10 10

0 1 00 05 10 15 20 00 05 10 15 20 00 05 10 15 20 00 05 10 15 20 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. missing rate

Figure 5.8: Comparison of two methods on a dataset of 500 pedigrees. Performance is examined on 4 different pedigree sizes: 4, 17, 29, 52 and 5 different missing genotype rates: 0.00, 0.05, 0.10, 0.15, 0.20. Error rate is calculated by comparing the allele-by-allele difference between the inferred the haplotype and the correct haplotype.

5.4.2.2 Genome-wide haplotype inference accuracy

We tested MML and Merlin on chromosome 6 of the RA data which has 17820 SNPs with an average inter-marker spacing of 9586bp. The RA data consisted of 100 replicates, each with 1500 nuclear families (two parents and two offsprings). We used 500 out of the 1500 families and averaged our results over 10 replicates. We artificially set up to 20% genotype to missing to estimate the robustness of the methods against missing data.

60 (a) family dropping Merlin (b) error rate Merlin* MML

) 0.2550 s 0.0025 le le 0.2500 al f 0.0020 g o in p 0.2450 n p io o rt 0.0015 r o d 0.2400 p ily ro p 0.0010 am0.2350 ( f te 0.2300 ra 0.0005 r o r 0.2250 e 0.0000 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 missing rate missing rate (a)(b)

Figure 5.9: Performance of MML and Merlin on chromosome 6 of RA data. Missing rates are 0.00, 0.05, 0.10, 0.15, 0.20. *The error rate of Merlin is based only on the families it successfully processes.

As shown in Fig. 5.9(b), MML can successfully reconstruct the hap- lotypes of each individual with an allele-by-allele error rate of 0.025% (for a missing rate of 0%) to an error rate of 0.19% (for a missing rate of 20%). Since Merlin assumes no recombination within each pre-defined segment, it fails on 25% of the 500 families (Fig. 5.9(a)). On the remaining families, it happens that all recombination events have ambiguous intervals riding across segment boundaries instead of contained completely in a single segment such that Mer- lin can still find a single inheritance vector for each segment. However, the overall accuracy of MML on all families is even better than the accuracy of Merlin on these retained families (Fig. 5.9(b)).

61 Chapter 6

Conclusions

We developed an algorithm, DSS, for haplotype inference from pedi- gree data by making use of the Mendelian law of inheritance and the zero- recombinant assumption. DSS encodes constraints in a linear system and solves it using disjoint-set data structures. The proposed algorithm can output a general solution for a tree pedigree with complete data in time O(mnα(n)), which is an improvement upon existing algorithms. For a general pedigree, or a pedigree with missing data, by using the same framework, our method can significantly reduce the degrees of freedom on inheritance variables and thus narrow down the search scope. Experimental results show that the algorithm is efficient in practice for both complete data and data with missing genotypes, and outperforms two popular algorithms on large data sets. For data with a large number of markers, the performance of the algorithm hardly deteriorates as the missing rate increases.

Based on DSS, we went on to study the haplotype inference problem on a genome-wide level in pedigree and population mixed data. To handle whole genome data, we need to overcome the computational difficulties complicated by huge numbers of markers, large pedigree (population) sizes, and substantial

62 numbers of missing genotypes. Taking advantage of high marker density, we developed techniques to precisely resolve recombination positions. By doing so, we can segment the chromosomes into regions without recombinations. On each zero-recombinant region, we applied DSS to find and compactly represent the subset of inheritance configurations that are consistent with the Mendelian law. Furthermore, we employed a quick optimization strategy in detecting haplotype configurations of maximum likelihood. All these techniques make it possible to handle large degrees of freedom in inheritance patterns, the uncertainty of recombination positions, and the variety of possible haplotypes in a population. The combined exploitation of Mendelian constraints and local population structure improves the precision of haplotype reconstruction. Experimental results on both real and simulated populations show that our method (MML) can reconstruct the haplotypes with high accuracy and it is scalable in terms of both the pedigree (population) size and the missing genotype rate.

63 Bibliography

[1] The International HapMap Consortium, A second generation human hap- lotype map of over 3.1 million SNPs, Nature 449:851–61, 2007.

[2] Abecasis GR, Cherny SS, Cookson WO, Garden LR, Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nature Genetics, 30(1):97–101, 2002.

[3] Abecasis GR, Wigginton JE, Handling marker-marker linkage disequi- librium pedigree analysis with clustered markers, American Journal of Human Genetics 77:754–767, 2005.

[4] Akey J, Jin L, Xiong M. Haplotypes vs. single marker linkage disequilib- rium test: what do we gain? European Journal of Human Genetics 2001; 9:291–300.

[5] Bader JS. The relative power of SNPs and haplotype as genetic markers for association tests. Pharmacogenomics 2001; 2(1):11–24.

[6] Bonizzoni P, Vedova GD, Dondi R, Li J, The haplotyping problem: an overview of computational models and solutions, Journal of Computer Science and Technology 18(6):675–88, 2003.

[7] Browning SR, Browning BL, Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized

64 haplotype clustering, American Journal of Human Genetics 81:1084-1097, 2007.

[8] Browning BL, Browning SR, A unified approach to genotype imputation and haplotype phase inference for large data sets of trios and unrelated individuals, American Journal of Human Genetcis 84:210-223, 2009.

[9] Chan MY, Chan W, Chin F, Fung S, Kao M, Linear-Time Haplotype Inference on Pedigrees without Recombinations, Proceedings of the 6th Annual Workshop on Algorithms in Bioinformatics, pp. 56–67, 2006.

[10] Coop G, Wen X, Ober C, Pritchard JK, Przeworski M, High-Resolution Mapping of Crossovers Reveals Extensive Variation in Fine-Scale Recom- bination Patterns Among Humans, Science 319:1395–1398, 2008.

[11] Cormen TH, Leiserson CE, Rivest RL, Stein C, Introduction to Algo- rithms, 2nd edition, McGraw-Hill Book Company, Boston, pp. 498–517, 2003.

[12] Doi K, Li J, Jiang T, Minimum recombinant haplotype configuration on pedigrees without mating loops, Proceedings of Workshop on Algorithms in Bioinformatics, pp. 339–353, 2003.

[13] Elston RC, Stewart J, A general model for the genetic analysis of pedigree data, Human Heredity 21:523–542, 1971.

[14] Gudmundsson J, Sulem P, Manolescu A, Amundadottir LT, Gudbjarts- son D, Helgason A, Rafnar T, Bergthorsson JT, Agnarsson BA, Baker A,

65 Sigurdsson A, Benediktsdottir KR, Jakobsdottir M, Xu J, Blondal T, Kos- tic J, Sun J, Ghosh S, Stacey SN, Mouy M, Saemundsdottir J, Backman VM, Kristjansson K, Tres A, Partin AW, Albers-Akkers MT, Godino-Ivan Marcos J, Walsh PC, Swinkels DW, Navarrete S, Isaacs SD, Aben KK, Graif T, Cashy J, Ruiz-Echarri M, Wiley KE, Suarez BK, Witjes JA, Frigge M, Ober C, Jonsson E, Einarsson GV, Mayordomo JI, Kiemeney LA, Isaacs WB, Catalona WJ, Barkardottir RB, Gulcher JR, Thorsteins- dottir U, Kong A, Stefansson K, Genome-wide association study identifies a second prostate cancer susceptibility variant at 8q24. Nature Genetics 2007; 39(5):631–637.

[15] Gusfield D, An overview of combinatorial methods for haplotype infer- ence, Lecture Notes in Computer Science (2983): Computational Methods for SNPs and Haplotype Inference, pp. 9–25, 2004.

[16] Halld´orssonBV, Bafna V, Edwards N, Lippert R, Yooseph S, Istrail S, A survey of computational methods for determining haplotypes, Lecture Notes in Computer Science (2983): Computational Methods for SNPs and Haplotype Inference, pp. 26–47, 2004.

[17] Kerem B, Rommens JM, Buchanan JA, Markiewicz D, Cox TK, Chakravarti A, Buchwald M, et al, Identification of the cystic fibrosis gene: genetic analysis, Science 245:1073-1080, 1989.

[18] Kong A, Masson G, Frigge ML, Gylfason A, Zusmanovich P, Thorleifs- son G, Olason PI, Ingason A, Steinberg S, Rafnar T, Sulem P, Mouy M,

66 Jonsson F, Thorsteinsdottir U, Gudbjartsson DF, Stefansson H, Stefans- son K, Detection of sharing by descent, long-range phasing and haplotype imputation, Nature Genetics 40, 1068–1075, 2008.

[19] Kong A, Masson G, Frigge ML, Gylfason A, Zusmanovich P, Thorleifsson G, Olason PI, Ingason A, Steinberg S, Rafnar T, Sulem P, Mouy M, Jonsson F, Thorsteinsdottir U, Gudbjartsson DF, Stefansson H, Stefansson K, A high-resolution recombination map of the , Nature Genetics 31:241–247, 2002.

[20] Lander ES, Green P, Construction of multilocus maps in humans, Proceedings of the National Academy of Sciences 84:2363–2367, 1987.

[21] Leal SM, Yan K, M¨uller-MyhsokB, SimPed: a simulation program to generate haplotype and genotype data for pedigree structures, Human Heredity 60:119–122, 2005.

[22] Li J, Jiang T, Computing the Minimum Recombinant Haplotype Config- uration from Incomplete Genotype Data on a Pedigree by Integer Linear Programming, Journal of Computational Biology 12:719–739, 2005

[23] Li J, Jiang T, A survey on haplotyping algorithms for tightly linked mark- ers, Journal of Bioinformatics and Computational Biology 6(1):241–259, 2008.

67 [24] Li J, Jiang T, Efficient Inference of Haplotypes from Genotype on a Pedi- gree, Journal of Bioinformatics and Computational Biology 1(1):41–69, 2003.

[25] Li J, Jiang T, Computing the minimum recombinant haplotype config- uration from incomplete genotype data on a pedigree by integer linear programming, Journal of Computational Biology 12:719–739, 2005.

[26] Li X, Li J, Comparisons of haplotype Inference from pedigree data and population data, BMC Proceedings 1:S55, 2007.

[27] Li X, Li J, Efficient Haplotype Inference from Pedigrees with Missing Data using Linear Systems with Disjoint-Set Data Structures, Conference on Computational Systems Biology, pp. 297–308, 2008.

[28] Li X, Li J, An almost linear time algorithm for a general haplotype solu- tion on tree pedigrees with no recombination and its extensions, Journal of Bioinformatics and Computational Biology 7(3):521–545, 2009.

[29] Li X, Chen Y, Li J, Detecting Genome-wide Haplotype Polymorphism by Combined Use of Mendelian Constraints and Population Local Structure. To appear in the Proceedings of Pacific Symposium on Biocomputing, 2010.

[30] Lin G, Wang Z, Wang L, Lau Y, Yang W, Identification of linked regions using high-density SNP genotype data in linkage analysis, Bioinformatics 24(1):86–89, 2008.

68 [31] Liu L, Chen X, Xiao J, Jiang T, Complexity and approximation of the minimum recombination haplotype configuration problem, Proceed- ings 16th International Symposium on Algorithms and Computation, pp. 370–379, 2005.

[32] Liu L, Jiang T, Linear-Time Reconstruction of Zero-Recombinant Mendelian Inheritance on Pedigrees without Mating Loops, Proceedings of Genome Informatics Workshop, pp. 95–106, 2007.

[33] Morris RW, Kaplan NL, On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles, Genetic Epidemiology 23:221–233, 2002.

[34] O’Connell JR, Zero-recombinant haplotyping: Applications to fine map- ping using SNPs, Genetic Epidemiology 19(1):64–70, 2000.

[35] Scheet P, Stephens M, Fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase, American Journal of Human Genetics 78:629–644, 2006.

[36] Sobel E, Lange K, Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker-sharing statistics, American Jour- nal of Human Genetics 58(6):1323–1337, 1996.

[37] Stephens M, Smith NJ, Donnelly P, A new statistical method for hap- lotype reconstruction from population data. American Journal of Human Genetics 68:978–989, 2001.

69 [38] Tarjan RE, Leeuwen J, Worst-case analysis of set union algorithms, Jour- nal of the ACM 31(2):245–281, 1984.

[39] Xiao J, Liu L, Xia L, Jiang T, Fast Elimination of Redundant Linear Equations and Reconstruction of Recombination-Free mendelian Inheri- tance on a Pedigree, Proceedings of 18th Annual ACM-SIAM Symoposium on Discrete Algorithms, pp. 655–664, 2007.

[40] Zhang XS, Wang RS, Wu LY, Chen L, Models and Algorithms for Hap- lotyping Problem. Current Bioinformatics 1(1):105–114, 2006.

[41] Zhang K, Sun F, Zhao H, HAPLORE: A program for haplotype construc- tion in general pedigrees without recombination. Bioinformatics 12:90– 103, 2005.

[42] Zhang K, Zhao H, A comparison of several methods for haplotype fre- quency estimation and haplotype reconstruction for tightly linked markers from general pedigrees, Genetic Epidemiology 30(5):423–437, 2006.

70