Haplotype Inference from Pedigree Data and Population Data

HAPLOTYPE INFERENCE FROM PEDIGREE DATA AND POPULATION DATA by XIN LI Submitted in partial ful¯llment of the requirements For the Degree of Doctor of Philosophy Dissertation Advisor: Jing Li Department of Electrical Engineering and Computer Science CASE WESTERN RESERVE UNIVERSITY January, 2010 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES We hereby approve the thesis/dissertation of _____________________________________________________ candidate for the ______________________degree *. (signed)_______________________________________________ (chair of the committee) ________________________________________________ ________________________________________________ ________________________________________________ ________________________________________________ ________________________________________________ (date) _______________________ *We also certify that written approval has been obtained for any proprietary material contained therein. Table of Contents List of Tables iv List of Figures v Acknowledgments vi Abstract vii Chapter 1. Introduction 1 1.1 Statistical methods . 3 1.2 Rule-based methods . 4 1.2.1 MRHC . 4 1.2.2 ZRHC . 5 Chapter 2. Problem statement and solutions 8 2.1 Large Pedigrees: manipulation of Mendelian constraints . 9 2.2 Families with many markers: dealing with recombinations . 9 2.3 Mixed data: use of population information . 10 Chapter 3. Preliminaries 12 3.1 Mendelian and zero-recombinant constraints . 14 3.2 Locus graphs . 15 3.3 Linear constraints on h variables . 17 Chapter 4. Linear Systems on Mendelian Constraints 19 4.1 Methods to solve the linear systems . 19 4.1.1 Split nodes to break cycles . 20 4.1.2 Detect path constraints from locus graphs . 21 4.1.3 Encode path constraints in disjoint-set structure D . 26 ii 4.2 Analysis of the algorithm on tree pedigrees with complete data 31 4.3 Extension to General Cases . 33 4.3.1 Pedigrees with mating loops . 33 4.3.2 Pedigrees with missing data . 35 4.4 Experimental Results . 37 Chapter 5. Haplotype Inference on a Genome-wide Level 45 5.1 Detect Recombination Events in Families with Dense Markers 46 5.2 Solution Space under Mendelian Constraints . 50 5.3 Maximum Likelihood Solution Based on Population Haplotype Frequency . 52 5.3.1 Probabilistic pre¯x tree for fast branch-and-bound optimization . 53 5.4 Experimental Results . 54 5.4.1 Detect Recombination Events and Haplotype Diversity . 54 5.4.2 Evaluation of Accuracy and Scalability . 57 5.4.2.1 Inﬂuence of pedigree size, missing rate on performance . 58 5.4.2.2 Genome-wide haplotype inference accuracy . 60 Chapter 6. Conclusions 62 Bibliography 64 iii List of Tables 1.1 The evolution of the complexity of the ZRHC problem on tree pedigrees. 7 3.1 Constraints for a parent-child pair x, y. 15 4.1 Comparison of running time (in seconds) between DSS and Mer- lin on pedigree size 128. 41 iv List of Figures 1.1 Haplotype . 2 3.1 Pedigree, haplotype and recombiantion . 13 3.2 Mendelian constraints . 14 3.3 Locus graph . 16 4.1 Node splitting . 22 4.2 Path constraints . 26 4.3 Path constraints . 27 4.4 Looped pedigree . 34 4.5 Pedigree structures used in simulation . 39 4.6 Comparison of DSS, Merlin and PedPhase.ILP . 40 4.7 Comparison of DSS and Merlin on di®erent patterns of missing data. 44 5.1 Recombination detection . 48 5.2 Allele constraints . 51 5.3 Probabilistic pre¯x tree . 54 5.4 The distribution of the length of ambiguous intervals of inferred recombination positions. 55 5.5 Haplotype diversity . 56 5.6 Degree of freedom . 57 5.7 Recombination positions . 58 5.8 Comparison of two methods on a dataset of 500 pedigrees. 60 5.9 Performance of MML and Merlin on chromosome 6 of RA data. 61 v Acknowledgments I would like to thank my advisor Dr. Jing Li for his heavy investment of time and intelligence in this work and for his consistent dedication and commitment to my Ph.D. study. I would also like to thank Yixuan Chen, Xiaolin Yin, Yoon Soon Pyon, Matthew Hayes and Robert Shields for their help throughout the progress of this project. Finally, I would like to thank Dr. Mehmet KoyutÄurk,Dr. Soumya Ray and Dr. Xiaofeng Zhu for serving on my dissertation committee and for their valuable mentorship. vi Haplotype Inference from Pedigree Data and Population Data Abstract by XIN LI Haplotype is an important representation of human genetic variation and is thus valuable for investigating the genetics behind diseases. However, humans are diploid and in practice, genotype data instead of haplotype data are collected directly. Consequently, there are great demands for e±cient and accurate computational methods to reconstruct haplotypes from genotype data. Our project started with the development of a rule-based haplotyping method for pedigree data with tightly linked markers. We formulate Mendelian constraints as a linear system of inheritance variables and solve the linear system using disjoint-set data structures. Our algorithm achieved the lowest time complexity among all existing methods. Comparisons with two popular algorithms showed that this algorithm made 10 to 105-fold im- provements on a variety of parameter settings. Based on the zero-recombinant haplotype inference, we went on to construct a general framework for haplotyping population and pedigree mixed data that consist of many families with unrelated founders, by combining novel techniques of recombination event de- vii tection and maximum likelihood optimization. This method makes it possible to do the genome-wide haplotype inference on pedigree and population mixed data. viii Chapter 1 Introduction A diploid organism such as human has two homologous copies of each chromosome, one from its father and the other from its mother, as illustrated in Figure 1.1. A physical position on a chromosome is called a locus and the status of a locus is called an allele. Considering a single nucleotide as a locus, the alleles at the locus can only have 2 alternatives, either A/T or C/G, therefore, we can represent an allele using integers 1 and 2. Most of the loci of the genome have identical alleles among di®erent people, however we are more interested in positions where there are variations. If we consider a single nucleotide as a locus, a locus that carries di®erent alleles between members of a population is called a single-nucleotide polymorphism (SNP). In practice, genotype data (pairs of alleles with undistinguished parental sources) instead of haplotype data are collected, especially in large scale se- quencing projects mainly due to cost considerations. The problem of haplotyping (or sometimes called \phasing") is to use computational methods to infer the parental sources of pairs of alleles among related or unrelated individuals, and thus reconstruct the haplotypes of these individuals. Many studies of gene-disease associations have shown the importance of haplotypes as they 1 provide the linkage information between SNPs. Hence, there is a great demand for e±cient and accurate computational methods and computer programs to infer haplotypes from genotypes. Figure 1.1: Haplotype Recent years have witnessed intensive research on haplotyping methods (see reviews[6, 15, 16, 23, 40]). By the type of input data, these methods can be divided into two categories: those for population data (unrelated individuals) and those for pedigree data (related individuals in a family). Methods for population data make use of the clustering property of haplotype segments in the population. On the other hand, methods for pedigree data rely on Mendelian constraints within family members. We can also categorize these methods into statistical ones and combinatorial (or rule-based) ones based on their algorithmic features. Here, we present a brief review on both statistical and rule-based methods. 2 1.1 Statistical methods In general, the goal of statistical approaches is to ¯nd a haplotype assignment for each individual with the maximum likelihood. Two exact algorithms have been proposed to calculate the probability of a pedigree. The Elston-Stewart algorithm[13] takes advantage of the Markov property based on pedigree structure: given parents' genotype information, the genotypes of a child are independent from the genotypes of its ancestors. The algorithm is linear in pedigree sizes, but exponential in the number of genetic loci. The Lander-Green algorithm[20] takes advantage of the Markov property between loci: under the assumption of no recombination interference (independence of recombination events), the phase of a locus only depends on the phase of its previous locus. This algorithm is linear in the number of genetic loci, but exponential in pedigree sizes. Both methods assume linkage equilibrium (no correlation between alleles at two loci), which is unrealistic for tightly linked markers such as SNPs. Recently, population haplotype frequencies have been taken into consideration[3] to account for correlations among tightly linked markers (known as linkage disequilibrium). A key step in most statistical approaches is to enumerate all possible inheritance patterns and to check the genotype consistency for each of them. Due to the large degrees of freedom, this step usually leads to high time complexity (usually exponential hence computational intractable for large data sets). 3 1.2 Rule-based methods Rule-based algorithms ¯rst partially infer haplotypes or inheritance vec- tors based on the Mendelian law of inheritance, then further optimize these candidate con¯gurations. Therefore, rule-based algorithms can potentially gain advantages over statistical methods in e±ciency. By using some reason- able assumptions such as minimum recombination events or no recombination events over a segment of loci within a pedigree, one can explicitly exploit in a mathematical way the constraints among individuals. In the literature, the haplotyping problems under these two assumptions are called minimum- recombinant haplotype con¯guration (MRHC) and zero-recombinant haplotype con¯guration (ZRHC) respectively. 1.2.1 MRHC The minimum recombination principle basically states that genetic recombination is rare, thus haplotypes with fewer recombinants should be pre- ferred.

Haplotype Inference from Pedigree Data and Population Data

Association Between Haplotypes and Specific Mutations in Swiss Cystic Fibrosis Families

The Role of Haplotypes in Candidate Gene Studies

Haplotype Tagging Reveals Parallel Formation of Hybrid Races in Two Butterfly Species

Statistical Genetics – Part I

Package 'Haplotypes'

Snps and Haplotype Inference

Natural Selection

Positive Natural Selection in the Human Lineage REVIEW

Ancient DNA Reveals Male Diffusion Through the Neolithic Mediterranean Route

Haplotype Analysis of GJB2 Mutations: Founder Eﬀect Or Mutational Hot Spot?

A Thesis Entitled an Investigation of Personal Ancestry Using

Y Chromosomes, Surnames, and the Genetic Genealogy Revolution