ACCURATE METHODS FOR ANCESTRY AND RELATEDNESS INFERENCE

A DISSERTATION SUBMITTED TO THE PROGRAM IN BIOMEDICAL INFORMATICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

Jesse M. Rodriguez December 2013

© 2013 by Jesse M. Rodriguez. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/cn371vd3410

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Serafim Batzoglou, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Russ Altman

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Carlos Bustamante

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii Abstract

The predisposition to many diseases is strongly influenced by the genome of an in- dividual. However, the association between the genome and most diseases is not fully understood, so there is an ongoing effort to characterize these associations. One way to characterize disease-genome associations is by studying the familial and an- cestral origin of individuals in the context of disease. This kind of study relies on the fact that individuals with shared origins tend to have genomes and phenotypes that are similar to one another. Detailed information regarding familial and ancestral origin is often unknown, however, it can be inferred computationally by examining the genome. Therefore, it is important that we have accurate methods to infer this information in order to facilitate disease-genome associations. In this dissertation, I describe the contributions I have made to accurately inferring the ancestry and relat- edness of individuals based on their genomes. First, I describe my work on Alloy, a method to infer the ancestral origin of segments of the genome based on a factorial HMM. Next, I present Parente, a method to infer which individuals in a group are related to one another by detecting genomic segments that are identical-by-descent (IBD) using an embedded likelihood ratio test. Finally, I present Parente2, an ex- tension of Parente that incorporates linkage disequilibrium information and results in significantly higher accuracy.

iv Acknowledgements

I owe a great deal of thanks to many people who have supported me through the years of my PhD. To Pavel, Nitin, and Josef, for giving me a place to start on this journey. To Tiffany, Dan, Sarah, George, Sam, Tom, Marc, Andreas, Noah, Alex, David, Eugene, Marina, and Karen for your instruction, advice, discussions, friendship, and memories. To my BMI classmates and members of the Batzoglou lab for you support, all of the fun, and for being great colleagues. To Mary Jeanne, Nancy, and Alex Sandra, for your help and guidance. To David Paik, Mark Musen, and Larry Fagan, for your mentorship and advice. To Russ, David, and Carlos, for being on my committee. To Arend, Cheryl, Noah, Lin, and Roy, for teaching me so much and being so great to work with. To Serafim, for being relentlessly supportive and giving me the chance to have fun with my PhD. To Sivan, for your friendship, bountiful ideas, and hard work. And to Kelly, Audrey, and my family, for your love, patience, and encouragement.

v Contents

Abstract iv

Acknowledgements v

1 Introduction 1

2 , terminology, and technology 4

2.1 Biology and terminology ...... 4

2.1.1 Technology ...... 7

3 Relatedness and ancestry 10

3.1 Family ...... 10

3.1.1 IBD ...... 11

vi 3.2 Ancestry ...... 14

4 Alloy 16

4.1 Introduction ...... 16

4.1.1 Previous work ...... 17

4.1.2 Overview of methods and results ...... 18

4.2 Methods ...... 19

4.2.1 Factorial hidden Markov model ...... 20

4.2.2 Inference with the forward-backward algorithm ...... 22

4.2.3 Transition probabilities ...... 24

4.2.4 Linkage disequilibrium model ...... 25

4.3 Results ...... 27

4.3.1 Simulation of admixed individuals and training the background LD models ...... 27

4.3.2 Evaluating Alloy’s accuracy under complex and ancient ad- mixtures...... 29

4.3.3 Exploring background LD models...... 31

vii 4.3.4 Measuring robustness to inaccuracies in model parameters. . . 31

4.3.5 Evaluating model accuracy under varying amounts of training data...... 33

4.4 Discussion ...... 34

4.4.1 Robustness to different admixtures models ...... 34

4.4.2 Learning admixture parameters ...... 35

4.4.3 Time complexity reduction ...... 36

4.4.4 Incorporating additional variation ...... 37

4.4.5 Conclusion ...... 38

5 Parente 40

5.1 Introduction ...... 40

5.2 Methods ...... 44

5.2.1 Likelihood ratio test ...... 46

5.2.2 Embedded likelihood ratio test ...... 48

5.2.3 Genotyping-error function ...... 49

5.2.4 Likelihood-ratio test threshold ...... 50

viii 5.3 Results ...... 51

5.3.1 Constructing training and testing datasets...... 51

5.3.2 Simulations to evaluate performance...... 53

5.3.3 Parente’s accuracy and comparison to fastIBD...... 54

5.3.4 Training Parente’s model and thresholds...... 56

5.3.5 Embedded LRT and local thresholds...... 57

5.3.6 Accuracy performance characteristics...... 63

5.4 Discussion ...... 64

6 Parente2 67

6.1 Introduction ...... 67

6.2 Methods ...... 69

6.2.1 Inner Log-Likelihood Ratio ...... 72

6.2.2 Outer Log-Likelihood Ratio ...... 74

6.2.3 Genotyping-error function ...... 76

6.2.4 Window and window set definitions ...... 76

ix 6.2.5 Window filter ...... 77

6.2.6 Decreasing running time with the SpeeDB filter ...... 78

6.2.7 Facilitating larger window sizes ...... 79

6.3 Results ...... 79

6.3.1 Data sets ...... 79

6.3.2 Simulations ...... 81

6.3.3 Experimental parameters ...... 83

6.3.4 Accuracy of Parente2 compared to other methods ...... 84

6.3.5 Augmented window definitions and window filter yields better performance ...... 86

6.3.6 ELR yields better performance ...... 88

6.3.7 Increasing window size increases performance ...... 89

6.3.8 Parente2 is robust to window set size ...... 89

6.4 Discussion ...... 90

6.4.1 Speed and accuracy tradeoff ...... 90

6.4.2 Amount of training data required ...... 91

x 6.4.3 When no training data is available ...... 93

6.4.4 Recommended settings for Parente2 ...... 93

6.4.5 Applicability DNA sequencing studies ...... 94

7 Conclusions and Future Directions 97

7.1 Potential impact ...... 97

7.2 Future directions ...... 98

Bibliography 102

xi List of Tables

3.1 Expected number and size of IBD segments based on relationship . . 13

5.1 Normality of window scores as a function of window size...... 65

6.1 Example of tiling method used to break up latent IBD. In this example, 6 source individuals used to create 3 composite individuals, each having 9 genomic segments (eg assuming a chromosome of length 1.8 cM with a segment size of 0.2 cM). Each entry in the table contains the index of the source individual used for the jth genomic segment of the ith composite individual...... 81

6.2 Pairwise accuracy of Parente2 and other methods. Table 6.2A shows sensitivity at lower false positive rates and Table 6.2B shows sensitivity at higher false positive rates. fastIBD was run ten times with ten different seeds according to author recommendations. For the GERMLINE-64 and GERMLINE-128 rows, GERMLINE was run on phased data with GERMLINE’s seed size set to 64 and 128, respectively. 85

xii 6.3 Positional accuracy of Parente2 and other methods. Accuracy was measured based on the portion of the genome that was in or not in a simulated IBD segment for each pair of individuals...... 86

6.4 The accuracy and running time of evaluated IBD inference methods. Each method was used to detect 2 cM IBD segments in 10 trials of the the HapMap data set. The Parente2 entry represents when Parente2 was run using the augmented window set with the window filter. Parente2-SpeeDB is the same but with the applica- tion of the SpeeDB filter. The Parente2-Std. entry represents when Parente2 was run using the standard window set without the win- dow filter and without SpeeDB. fastIBD was run ten times with ten different random seeds according to the authors’ recommendations and the sum of the running time all ten runs is reported. GERMLINE-64 and GERMLINE-128 refer to running GERMLINE while using seed sizes of 64 and 128, respectively. The phasing pipeline provided with GERMLINE was used to phase the data prior to running GERMLINE and its running time is included in the reported running time. The number of pairs of individuals processed by each method per second is reported in the Pairs/sec column...... 90

6.5 Parente2’s accuracy at various thresholds when detecting 2 and 4 cM IBD segments...... 95

xiii List of Figures

2.1 Linkage disequilibrium ...... 8

3.1 Shared DNA by familial relationship ...... 11

4.1 Alloy’s factorial HMM...... 21

4.2 The state space of the HMM...... 25

4.3 Alloy’s performance over increasing generations since admixture and comparison to WINPOP...... 30

4.4 Ancestry inference accuracy as a function of model complexity. . . . . 32

4.5 Performance as a function of increasing training data set size. . . . . 39

5.1 Parente’s pairwise performance vs. fastIBD...... 52

5.2 Parente’s segment localization performance vs. fastIBD...... 53

xiv 5.3 Parente’s performance over different IBD segment sizes...... 57

5.4 Mean window scores of IBD and non-IBD training data...... 59

5.5 Characterization of the the LRT to ELRT transformation...... 60

5.6 LRT vs. ELRT Block scores ...... 61

5.7 Block-specific threshold levels across the chromosome...... 62

6.1 Underlying graphical model for the likelihood ratio. The vari-

ables h1, h2, h3, and h4 represent hidden haplotypes for a given window of markers. The variables g and g0 represent the observed genotype vectors from the first and second individual in a pair of individuals be- ing evaluated for IBD in the window. (A) The model for two unrelated individuals that do not share an IBD segment in the window. (B) The model for two related individuals that do share an IBD segment in the window...... 71

6.2 The effect of the window filter and the augmented window definitions on Parente2’s accuracy. The augmented window defi- nitions had higher accuracy than the standard window definitions. The window filter improved the accuracy of Parente2 when using the aug- mented window definitions but reduced the accuracy when using the standard window definitions. The sensitivity shown is at a 1% false positive rate...... 87

xv 6.3 Comparing the accuracy of the embedded likelihood ratio (ELR) scoring function to the likelihood ratio scoring (LR) function. The embedded likelihood ratio outperforms the likelihood ratio in all of the configurations evaluated...... 88

6.4 The effect of window size on Parente2’s performance. Increas- ing the window size of Parente2 results in better performance. . . 89

6.5 Parente2’s Sensitivity as a function of number of training in- dividuals. Parente2 was run on the WTCCC 2 cM data set. The vertical axis shows sensitivity at a 1% false positive rate...... 92

6.6 The performance of Parente2 as a function of marker density. Parente2 was used to detect 2 cM IBD segments in the HapMap data set. Sensitivity is reported is at a 1% false positive rate...... 96

xvi Chapter 1

Introduction

Genomic DNA is the primary carrier of biological information from parents to chil- dren. It transmits many traits, ranging from physical traits such as height, facial features, and skin tone, to more complex traits like propensity to addiction. It has long been observed that many medically-relevant traits are also heritable, such as the observation that brachydactyly runs in families by William Farabee in 1905 [17]. Early work in examining the relationship between diseases and family inheritance, a field known as medical genetics, focused on looking at family trees, known as pedi- grees, along with records of which individuals were affected by a disease of interest. Once a disease or trait is determined to be heritable, additional work is required to determine what part of the genome confers this trait. Furthermore, it can be chal- lenging to elucidate the genetic association of more complicated heritable diseases such as those that are only partially caused by genetics. Determining which DNA segments in the are associated with a disease, a task known as mapping, is important because it can be an early step towards understanding the etiology of a disease which can then lead to developing a treatment.

Along with biological traits, the genome also carries information regarding family history and ancestry. Individuals descending from a recent common ancestor or from

1 CHAPTER 1. INTRODUCTION 2

a common ancestral population tend to have similar DNA sequences in their genomes. While ancestral and familial origin can be studied for social reasons, it is also highly relevant for biology and medicine. Two powerful techniques have been developed to use ancestry and distant familial relationships in conjunction with trait information in order to perform genetic mapping of the trait. The first technique, known as admixture mapping, requires that one must be able to label each segment of an individual’s genome with an ancestral origin (e.g. from European, African, or Asian descent) [61, 53]. This information is generally not known, but it can be inferred by examining the DNA of the individual and comparing it to the DNA of many people of different ancestral backgrounds, which is a problem known as local ancestry inference. The second technique, known as IBD mapping, requires that one can take the genomes of two individuals as input, and output at what locations in the genome (if any) are shared between them, originating from a recent common ancestor (e.g. a great- great-great-grandparent) [61, 53]. This information is not known a priori, but can be inferred by examining the genome through, which is a problem called IBD inference.

In recent years, DNA technologies have been rapidly advancing, allowing re- searchers to examine the genomes of an ever-growing number of individuals with increasing detail. Because of this trend, it is becoming increasingly important to develop methods that can analyze this data with better accuracy and speed. For the purposes of performing admixture mapping and IBD mapping, it is therefore impor- tant to improve the accuracy and speed of methods for local ancestry inference, and IBD inference, respectively.

In this disseration, I describe the contributions I have made to accurately inferring the ancestry and relatedness of individuals based on their genomes. I begin by pro- viding background information in Chapters 2 and 3. Next, in Chapter 4, I describe my work on Alloy, a method to infer the ancestral origin of segments of the genome based on a factorial HMM. Then, in Chapter 5, I present Parente, a method to infer which individuals in a group are related to one another by detecting genomic segments that are identical-by-descent (IBD) using an embedded likelihood ratio test. Finally, in Chapter ??, I discuss my work on Parente2 which extends Parente to CHAPTER 1. INTRODUCTION 3

incorporate linkage disequilibrium information to further improve its accuracy. Chapter 2

Biology, terminology, and technology

2.1 Biology and terminology

DNA is a molecule that can be represented as a string of the letters A, C, G, or T, which are known as nucleotides or bases. In humans, DNA is organized into contiguous sequences known as chromosomes. There are 24 different human chromo- somes, named chromosomes 1-22, chromosome X, and chromosome Y (present only in males). The collection of all the chromosomes is known as the genome and it has a total length of approximately 3 billion letters, which are called basepairs. Humans are diploid organisms, meaning that each individual contains roughly two copies of the genome. The two copies come in the form of chromosome pairs, that is, each individual has two copies of chromosome 1, two copies of chromosome 2, etc. Every individual has two copies of chromosomes 1 to 22, which are known as the autosomes. Females have two copies of chromosome X and no copies of chromosome Y; males have a single copy of chromosome X and a single copy of chromosome Y. For the purposes of this dissertation, I will focus on the autosomes.

4 CHAPTER 2. BIOLOGY, TERMINOLOGY, AND TECHNOLOGY 5

Because there are two copies of every chromosome, at every position in the genome, an individual actually has two letters at that position, one from each copy. For the sake of simplicity of discussion1, we will assume that both copies are the same length and that there is a one-to-one correspondence of positions among them; that is, the i-th position in one copy corresponds to the i-th position in the other copy. At the vast majority of positions in the genome of an individual the two letters are the same (e.g. A and A); that is, the individual is homozygous at those positions. At some locations, an individual has two different letters (e.g. C and T); the individual is said to be heterozygous at those positions. Positions in the genome where there have been at least two different letters observed at that position among all humans assayed are known as polymorphic positions. When these polymorphic positions are not adjacent to one another, we call them single-nucleotide polymorphisms or SNPs. The different possible letters at a SNP are known as alleles. The allele that is more common is known as the major allele and the one that is less common is known as the minor allele.

During reproduction, a child receives a single set of chromosomes from each parent, making the child diploid like the parents. During the biological process of meiosis, for each chromosome in each parent, the two copies of that chromosome are used to generate a new version of the chromosome that is usually distinct from either of the original copies; this process is called recombination. This new version of the chromosome is then passed onto the child. In a simplified view of recombination, the new version of the chromosome is created by randomly picking one copy of the chromosome and starting to copy from it. Then, at some position (chosen at random), the process pauses and resumes by copying from the other chromosome copy. Usually this copying switch occurs at most a few times along the same chromosome. The locations at which the switch occurs are called crossovers or recombination events.

The process of recombination happens independently for each chromosome and the location of the crossover is random. The distribution of recombination events 1This is a very simplistic view, since there is a wide arrange of ways the two copies can differ from each other where a segment (either long or short) of one copy of the chromosome is absent from the other copy. CHAPTER 2. BIOLOGY, TERMINOLOGY, AND TECHNOLOGY 6

is not uniform across the physical length chromosome, however. To describe the chance of a recombination event happening between two positions in the genome, we use the distance metric of centimorgans (cM). Specifically, if two positions are 1 cM apart, there expected number of recombinations to occur between those two positions in a single generation is 0.01. The is approximately 3,000 cM in length when the distances of all chromosomes are summed, which means there are approximately 30 recombinations per generation on expectation across the entire genome.

We use the term haplotype to refer to the ordered letters on a contiguous segment of DNA. A haplotype can refer to some or all of the positions along that stretch of DNA, generally focusing on the polymorphic sites. The positions in a haplotype do not necessarily have to be adjacent to one another, but they are usually represented in order. For any segment of a chromosome pair, an individual has two haplotypes, with one haplotype provided by the mother and one provided by the father. Due to some limitations of technology to examine the genome, we often cannot examine each haplotype by itself. Instead, we can examine both haplotypes simultaneously, but for each position, we do not know which letter came from which haplotype. For example, suppose an individual had the haplotypes ACCA and AACG. Most technologies would only allow us to know that the individual had two A’s at the first position, an A and a C at the second position, two C’s at the third position, and an A and a G at the second position. This is known as the individual’s genotype. It can written as a sequence unordered pairs such as (A,A), (A,C), (C,C), (A,G). Genotypes are analogous to haplotypes in that they can refer to some or all of the positions along a contiguous stretch of the genome that are not necessarily adjacent, usually at polymorphic sites.

For the vast majority of polymorphic locations, there only exist two possible alleles in the human population, the major allele and the minor allele; these positions are known as biallelic. At very few positions in the genome there are three or four possible alleles; these positions are known as triallelic and quadriallelic, respectively. Assuming one only considers biallelic positions, haplotypes and genotypes can also CHAPTER 2. BIOLOGY, TERMINOLOGY, AND TECHNOLOGY 7

be written in numerical form by counting the number of major alleles present. In the example in the previous paragraph, if the major allele of the first position was A, C for the second position, A for the third, and G for the fourth, one could denote the haplotypes as 1100 and 1001. The corresponding genotype would be denoted as 2101. Note that with this representation, the genotype at each position can be computed by simply summing the numbers in the two haplotypes at that position.

Linkage disequilibrium

Linkage disequilibrium, or LD, is another important phenomenon that can be caused by the process of recombination. Because the number of recombinations per gen- eration is relatively low, there is a high chance that even after many generations, the alleles at two nearby locations will be inherited together. This occurs because no recombination event occurs between them, so they are said to be linked or that they are in LD. Positions in the genome that are very far apart (or are on separate chromosomes) are inherited independently from one another and are thus considered to be in linkage equilibrium. Evolutionary forces can cause positions to have more correlation than the distance between them would otherwise predict if having certain combinations of alleles at the positions had an affect on reproductive fitness.

2.1.1 Technology

Many studies are interested in scanning the entire genome to detect what regions of the genome are associated with some trait or disease. Genotyping arrays, also known as SNP chips or genotyping chips, are the most common technology to assay large portions of the genome today. These arrays are built to measure the letter present at one million pre-defined positions in the genome. These positions are also known as markers. At each position, genotyping arrays can measure the presence of two different letters, so positions that are thought to be biallelic are usually chosen. The CHAPTER 2. BIOLOGY, TERMINOLOGY, AND TECHNOLOGY 8

Figure 2.1: Correlation between two positions in the genome are denoted by the color of the cells: white indicates little correlation whereas red indicates high correlation. This figure illustrates that nearby positions tend to have higher correlation than distant positions. Image from [54]. array outputs the genotype of the individual at each position: whether the individual has two copies of the first letter, two copies of the second letter, one copy of each, or “unknown” if it would not be assayed for technical reasons. The error rate for many chips is on the order of 0.5%, resulting in a called genotype (not unknown) to be incorrect. The amount of unknown genotypes, often referred to as missing data, in a single genotyping array experiment is around 1%.

An alternative to genotyping arrays is to use DNA sequencing. Modern sequencing approaches use an approach that randomly selects hundreds of millions or billions of locations in the genome, and reads approximately 100 basepairs (called basepairs in this context) of the genome starting from those positions. The positions of each of these sequence “reads” are unknown before sequencing but can be determined after the experiment. The primary advantage of sequencing is that it is not restricted CHAPTER 2. BIOLOGY, TERMINOLOGY, AND TECHNOLOGY 9

to assaying a pre-defined set of locations in the genome and can measure the DNA sequence at the majority of the genome, potentially revealing changes to the genome unique to each individual. A disadvantage is that it is currently much more expensive than genotyping arrays. Also, despite the fact that sequencing has a relatively low error rate in terms of detecting sequence variation, the fact that 3 billion locations in the genome are assayed can result in a large number of false positives; this can make it difficult to distinguish between sequencing errors and novel variants that are unique to an individual. Another advantage is that by using different biochemical protocols, one can use this sequencing technology to gain a wide range of information. For example, recent work has shown that it can be used to acquire the full chromosomal haplotypes [27][30]. However, these protocols end up costing even more than standard sequencing, so it may be quite some time until they are used as standard practice. Chapter 3

Relatedness and ancestry

3.1 Family genetics

Consider a family with a mother, father, and their two daughters. Let’s assume that the mother and father are unrelated to each other and that their parents are also unrelated to each other. While two individuals are never absolutely “unrelated” because all humans originate from a common ancestor, this term is often used with some ambiguity as to what is meant by two individuals being related to one another. For now, I will use a simplistic view of relatedness and say that two people are unrelated if they have no common ancestors in the last several hundred years. The four haplotypes of the parents can then be considered distinct strings because they come from unrelated individuals. At each position of the genome, each sibling will have inherited one allele from the two possible paternal alleles, and one allele from the two possible maternal allele. Because of the random process of recombination, at each position the children will share the same allele from their father 50% of the time and the same allele from their mother 50% of their time, on average. This means that at 25% of positions in the genome they will share two alleles, at 50% of positions they will share one allele (either from the mother or father), and at 25% of positions

10 CHAPTER 3. RELATEDNESS AND ANCESTRY 11

Figure 3.1: This tree shows in the red boxes what percent of DNA is shared between a relative with respect to the person in the orange box. [39] they will share neither alleles. Overall, they will have inherited approximately 50% of the same DNA.

3.1.1 IBD

The segments of the haplotypes where the siblings share the same allele are said to be identical-by-descent (IBD), or that these individuals have identity-by-descent at those positions. This term is used to denote the fact that the reason these two individuals have identical DNA segments is that they inherited them from a common ancestor as opposed to receiving the identical segments due to some other reason (e.g. coincidence or selective pressure). In this work, I will refer to genomic segments that are IBD betweteen two individuals to mean that at least one of their haplotypes is IBD. CHAPTER 3. RELATEDNESS AND ANCESTRY 12

Now, if the daughters also have children of their own, these children will be first cousins. Let’s assume that the fathers of these cousins partners are unrelated to the mothers and unrelated to each other. Then, it turns out that the cousins will share approximately 12.5% of their DNA. Now, let us consider the haplotypes involved to see how this occurs. In the first generation (the grandparents of the cousins), there were four haplotypes: haplotypes 1 and 2 from the mother and haplotypes 3 and 4 from the father. Since the cousins’ fathers are unrelated, the cousins cannot possibly share any DNA on their paternal haplotypes, so we know that 50% of their DNA is completely unshared. Each of the cousins’ mothers have one haplotype that is a mosaic of haplotypes 1 and 2, and another that is a mosaic of 3 and 4, making their entire genomes a mosaic of haplotypes 1-4, with 25% of the genome being composed of each of the four haplotypes. When the cousins’ inherit their maternal haplotype, it will then be a mosaic of equal parts haplotypes 1-4. Thus, the chance that the two cousins share the same haplotype at any given position is 25%. (This can be seen by choosing the first cousin’s haplotype as a constant and letting the other cousin’s haplotype be randomly chosen, then it will match 25% of the time). Therefore, 25% of half of the cousins’ genomes are shared, so the resulting total shared DNA is 12.5%.

This trend will continue: if the cousins have children, they will share only 3.125% of their DNA. This small example shows that the amount of shared DNA between two individuals rapidly decreases with each successive generation (assuming only un- related people have children). Furthermore, it should be noted that the pattern of shared DNA also changes. Specifically, with greater familial separation, both the number and size of contiguous shared DNA segments decreases. It should also be noted that due to randomness in the process of recombination there is significant variance in terms of the number of IBD segments and their lengths. In an extreme example of this phenomenon, consider siblings where there genomes were inherited after no recombination occurred (a scenario that is extremely unlikely, but theoret- ically possible). These siblings could either share no IBD in their entire genome if, for every chromosome, they received the opposite haplotype from both the mother and father. This would essentially make the siblings genetically unrelated to one another. On the other extreme, if they happened to receive the same haplotype for CHAPTER 3. RELATEDNESS AND ANCESTRY 13

Relationship Number of segments Segment size (cM) Parent/child 23-29 3539-3748 1st cousins 17-32 548-1038 1st cousins once removed 12-23 248-638 2nd cousins 10-18 101-378 2nd cousins once removed 4-12 43-191 3rd cousins 2-6 43-150 3rd cousins once removed 1-4 11-99 4th and more distant cousins 0-2 ≤ 5-50

Table 3.1: Expected number and size of IBD segments based on relationship [40] every chromosome from both the mother and father, they would be genetically twins. In reality, these scenarios do not occur, but they illustrate how random chance can influence the patterns of IBD occurring between two individuals. This variance is observed commonly in more distant relationships; when the expected percentage of shared DNA is low, the two individuals often share no DNA at all, making them genetically unrelated to each other.

Even though the chance that distant relatives share any DNA is often very small, we can compute the expected size of the segment given that they do share an IBD segment. Assuming IBD begins at a position, it will end once a recombination event occurs, bringing in DNA from a non-shared relative. Thus, we simply need to com- pute the expected distance before a recombination event occurs at some point in the generations separating the two individuals. Alternatively, this can be thought of as the expected distance between recombination events among all events occurring in the generations between them. This can be computed by

100 Segment size (in cM) = . Number of generations

For example, consider 7th-cousins: each one will be 8 generations away from the common ancestor for a total distance of 16 generations. Thus, if an IBD segment existed, its expected size would be 100/16 = 6.25 cM. CHAPTER 3. RELATEDNESS AND ANCESTRY 14

3.2 Ancestry

In the above discussion I made the implicit assumption that every individual finds a completely unrelated partner when producing children. In reality, humans often live within or are descended from populations of people with similar genetic back- ground and thus choose their partners from within their population. These isolated populations occur due to a variety of “barriers to genetic exchange” [57] such as ge- ography, language, and cultural background. Because members of these populations tend to find partners from within the population, over time, the populations tend result in individuals with similar genetic background due to common descent from a pool of distant ancestors called the ancestral population. These people are said to have common ancestry or ancestral origin.

The term “ancestry” that is used on many levels depending on the context. For example, one can define ancestry based on continental origin: where the ancestral populations either existed in Asia, Africa, or Europe. Or one can define it with more narrow geography, defining ancestral populations that existed in the past within cer- tain countries or even regions within countries. Further, still, ancestry can be based on ethnolinguistic groups like the Ashkenazi Jewish population that was geographically distributed but operated very similarly to a geographically-isolated population [24].

Individuals can also have multiple ancestral origins if they have two ancestors from different ancestral populations. These individuals are said to be admixed because their genomes are composed from a mixture of different genetic backgrounds. Someone with parents from two different ancestral populations will have simple pattern, they will have one copy of the genome originating from one ancestral background and another copy from the other. When admixed individuals mate, they pass on a more complex mosaic of ancestral background to their offspring due to recombination. There are many so-called admixed populations that are comprised of individuals that are con- sidered admixed. The classic example of this is that of African Americans which have African ancestry as well we some European ancestry. When admixed populations CHAPTER 3. RELATEDNESS AND ANCESTRY 15

arise, their genomes tend to have long stretches of same-ancestry segments. How- ever, after many generations, with more and more recombination, the population’s genomes will tend to have shorter and shorter stretches of same-ancestry segments. Chapter 4

Alloy

4.1 Introduction

In this chapter I will describe Alloy, a method for inferring the ancestral origin of segments of the genome based on genotype data.

Determining the ancestral origin of chromosomal segments in admixed individuals is a problem that has been addressed by several methods [44, 56, 55, 46, 42, 6]. The de- velopment of these methods was motivated by various applications such as studying population migration patterns [29, 19], increasing the statistical power of associa- tion studies by accounting for population structure [43], and enhancing admixture- mapping [61, 53] for both disease-gene mapping as well as personalized drug therapy applications [5]. The ability to accurately infer ancestry is important in genome-wide association studies (GWAS). These studies are based on the premise that a homoge- nous population sample was collected. Population stratification, however, poses a significant challenge in association studies; the existence of different sub-populations within the examined cases and controls can yield many spurious associations orig- inating from the population sub-structure rather than the disease status. Inferred

16 CHAPTER 4. ALLOY 17

sub-structure within the population enables the correction for this effect, consequently improving the statistical power of these studies. A second disease-gene mapping tech- nique that benefits from an accurate inference of ancestry is admixture-mapping [61]. This statistically powerful and efficient method identifies genomic regions containing disease susceptibility genes in recently admixed populations, which are populations formed from the merging of several distinct ancestral populations (e.g., African Amer- icans). The statistical power of admixture-mapping increases as the disease preva- lence exhibits a greater difference between the ancestral populations from which the admixed population was formed. Admixed individuals carrying such a disease are expected to show an elevated frequency of the ancestral population with the higher disease risk near the disease gene loci. Hence, the effectiveness of this method relies on the ability to accurately infer the ancestry along the chromosomes of admixed individuals.

4.1.1 Previous work

The problem of ancestry inference is commonly viewed at one of two levels: (a) at the global scale, predicting an individual’s single origin out of several possible ho- mogenous ancestries, or determining an individual’s ancestral genomic composition; (b) at the finer local scale, labeling the different ancestries along the chromosomes of an admixed individual. In the context of local ancestry inference, most previous methods are based on hidden Markov models (HMM), where the hidden states cor- respond to ancestral populations and generate the observed genotypes. The work of Patterson et al. [44] employed such an HMM, integrated into a Markov chain Monte Carlo (MCMC), for estimating ancestry along the genome. The method accounted for uncertainties in model parameters such as number of generation since admixture, admixture proportions, and ancestral allele frequencies. For simplicity, the work as- sumed that, given the ancestry, the sampled markers are in linkage equilibrium (i.e., independent). This assumption was then relaxed in the work by Tang et al. [56], applying a Markov hidden Markov model (MHMM) to account for the dependencies between neighboring markers as exhibited within the ancestral populations. While CHAPTER 4. ALLOY 18

the modeled first-order Markovian dependencies accounted for some of the linkage disequilibrium (LD) between markers, the complex of the linkage patterns presented an opportunity for more accurate LD models that would yield better per- formance in inferring local ancestry. The explicit use of ancestral haplotypes, in meth- ods such as HAPAA [55] and HAPMIX [46], enabled a more comprehensive account for background LD (i.e., LD within the ancestral population) over longer segments. In these methods, the hidden states corresponded to specific ancestral haplotypes, and the transition between the states corresponded to intra-population mixture and inter-populations admixture processes. While efficient inference algorithms were ap- plied, the model size grew linearly with the number of parental individuals, and the time complexity grew quadratically with the numbers of parental individuals for the case of genotype-based analysis. The time complexity of such an analyses became prohibitively high with more than a modest number of model individuals.

Other work explored window-based techniques, in which a simple ancestral com- position was assumed to occur within a window (i.e., at most a single admixture event within an examined segment). LAMP [52], and its extension WINPOP [42], used a na¨ıve Bayes approach, assuming markers within a window are independent given an- cestry, applying the inference over a sliding window. Although LD was not modeled, the methods demonstrated an accuracy superior to methods that did account for background LD. An additional window-based framework was developed in [6], decou- pling the admixture process from the background LD model. Chromosomal ancestral profiles were efficiently enumerated using a dynamic-programming (DP) technique, enabling the instantiation of various LD models for the single-ancestry segments from which a profile was composed. Multiple LD models were studied within the frame- work, showing that higher-order LD models yield an increase in inference accuracy.

4.1.2 Overview of methods and results

In this work we describe Alloy, a novel local ancestry inference method that en- ables the incorporation of complex models for linkage disequilibrium in the ancestral CHAPTER 4. ALLOY 19

populations. Alloy applies a factorial hidden Markov model (FHMM) to capture the parallel process producing the maternal and paternal admixed haplotypes. We model background LD in ancestral populations via an inhomogeneous variable-length Markov chain (VLMC). The states in our model correspond to ancestral haplotype clusters, which are groups of haplotypes that share local structure within a chromo- somal region, as in [10]. In our method, each ancestral population is described by a separate LD model that locally fits the varying LD complexities along the genome. We provide an inference algorithm that is sub-cubic in the maximal number of haplotype clusters at any position. This allows Alloy to scale well when analyzing admixtures of more than two populations, or incorporating more elaborate LD models.

We demonstrate through simulations that Alloy accurately infers the position- specific ancestry in a wide range of complex and ancient admixtures. For instance, Alloy achieves 87% accuracy on a 3-population admixture between individuals sam- pled from Yoruba in Ibadan, Nigeria, Maasai in Kinyawa, Kenya, and northern and western Europe. Our results represent substantial improvements over previous state of the art. Further, we explore the landscape of background LD models, and find that the highest performance is achieved by LD models that lie between models that assume independence of markers and models that explicitly use the reference haplotypes. Finally, our results demonstrate that as more samples representing the ancestral populations become available, our LD models improve and enable more accurate local ancestry inference.

4.2 Methods

We consider the problem of local ancestry inference, defined as labeling each geno- typed position along the genome of an admixed individual with its ancestry. Here, admixture is assumed to follow the hybrid isolated model [34], in which a single past admixture event mixing K ancestral populations with proportions π = (π1, ..., πK ) is followed by g generations of consecutive random mating. For clarity, we assume CHAPTER 4. ALLOY 20

that a set of L bi-allelic SNPs was observed along the genome of an individual; we relax the bi-allelic marker assumption in the Discussion section. Furthermore, at each position, we define a state space of haplotype clusters Al, each of which repre- sents a collection of ancestral haplotypes that share a common local structure (i.e., allelic sequence surrounding a particular location). It immediately follows that each such haplotype cluster al ∈ Al at location l is mapped to a single allele, denoted by e(al) ∈ {0, 1}. In our model, each of the K populations is represented by a separate mutually-exclusive subset of haplotype clusters. We denote by anc(al) the ances- m p try, out of K, of a particular haplotype cluster al ∈ Al. We denote by Hl ,Hl the

(hidden) haplotype cluster membership drawn from Al on the maternal and paternal haplotype at position l, respectively, and by Gl ∈ {0, 1, 2} the genotype observed at th/e same marker position, representing the minor allele count. The vectors of hap- lotype cluster memberships and genotypes across all L marker positions are denoted {m,p} {m,p} {m,p} {m,p} by H = (H1 ,H2 , ..., HL ) and G = (G1,G2, ..., GL), respectively.

4.2.1 Factorial hidden Markov model

We use a factorial hidden Markov model (FHMM) [18] to statistically model the dual mosaic ancestral pattern along the genome of an admixed individual, as depicted in Figure 4.1. In factorial HMMs, which are equivalent in expressive power to hidden Markov models (HMM) [48], the single chain of hidden variables is replaced by a chain of a hidden vector of independent factors. In our application, the FHMM rep- resentation allows us to naturally decouple the state space into two parallel dynamic processes generating Hm and Hp, pertaining to the presumably independent mater- nal and paternal admixture processes, and producing the single composed admixed offspring G. The decomposition of the state space into independent processes allows efficient inference by leveraging the structure in the compound state transition prob- m p abilities. In our model, the values of Hl = (Hl ,Hl ) at specific position l are drawn from the Cartesian product Al × Al, corresponding to the alleles within specific an- cestral haplotypes originating from a restricted prior set of K hypothesized ancestral populations. Note that Al extends the notion of an allele, which is simply a binary CHAPTER 4. ALLOY 21

Figure 4.1: A factorial hidden Markov model capturing the parallel admixture pro- cesses generating the maternal and paternal haplotypes, and giving rise to the sampled genotypes of the admixed offspring. (a) A graphical model depicting the conditional {m,p} independencies in our model. Each variable in the hidden chains Hl corresponds to a haplotype cluster membership, and Gl corresponds to the observed genotype at location l. (b) The state space Al for a particular location l along the genome. Each ancestry is modeled by an independent set of haplotype cluster membership states, and each such state can emit a single allele. Edges in the illustration connecting states correspond to intra-population observed transitions, namely local haplotypic sequences that were frequent in the corresponding ancestral population. Edges cor- responding to admixture transitions, connecting states of different ancestries, are omitted from this illustration for clarity. variable, to an allele within an ancestral haplotype; for position l, multiple states in

Al may correspond to the same allele. CHAPTER 4. ALLOY 22

4.2.2 Inference with the forward-backward algorithm

To infer local ancestry, we first compute the posterior marginals given the sampled m m genotypes P (Hl ,Hl |G) by applying the forward-backward algorithm

m p 0 0 0 P (Hl = al,Hl = al|G) ∝ αl(al, al) · βl(al, al) (4.1)

0 m p 0 0 m where αl(al, al) = P (G1, ..., Gl,Hl = al,Hl = al) and βl(al, al) = P (Gl+1, ..., GL|Hl = p 0 al,Hl = al). A naive recursive computation of α and β yields the time complexity 2 PL 2 2 O(|A1| + l=2 |Al−1| · |Al| ) as the transition from each pair of haplotype cluster memberships to each consecutive pair of haplotype cluster memberships is explicitly assessed. However, the dependency structure of FHMMs allows for a more efficient recursive computation of α and β, as described in [18]. Specifically, α is computed in the forward direction in three steps as follows

m 0 X 0 m m αl−1(al, al−1) = αl−1(al−1, al−1) · P (Hl = al|Hl−1 = al−1) (4.2)

al−1∈Al−1 p 0 X m 0 p 0 p 0 αl−1(al, al) = αl−1(al, al−1) · P (Hl = al|Hl−1 = al−1) 0 al−1∈Al−1 0 p 0 m p 0 αl(al, al) = αl−1(al, al) · P (Gl|Hl = al,Hl = al), namely, advancing on the maternal track, followed by advancing on the paternal track, and finally incorporating the local observation by multiplying by the emission m p 0 probability P (Gl|Hl = al,Hl = al). Similarly, β is computed in a backward recursion as

e 0 0 m p 0 βl (al, al) = βl(al, al) · P (Gl|Hl = al,Hl = al) (4.3) m 0 X m m e 0 βl (al−1, al) = P (Hl = al|Hl−1 = al−1) · βl (al, al)

al∈Al 0 X p 0 p 0 m 0 βl−1(al−1, al−1) = P (Hl = al|Hl−1 = al−1) · βl (al−1, al). 0 al∈Al

0 m p 0 To complete the description, we define α1(a1, a1) = P (H1 = a1) · P (H1 = a1) · m p 0 0 P (G1|H1 = a1,H1 = a1) and βL(aL, aL) = 1. When computing β, advancing on CHAPTER 4. ALLOY 23

the maternal track takes (|Al−1| · |Al|) · |Al| time, while advancing on the pater- nal track takes (|Al−1| · |Al−1|) · |Al| time, as determined by the size of the corre- sponding composite state space. Similarly, a single forward step αl is computed in |Al| · |Al−1| · (|Al| + |Al−1|) time. Hence, the time complexity is now reduced to 2 PL O(|A1| + l=2 |Al| · |Al−1| · (|Al| + |Al−1|)).

m p To model genotyping error, the emission probability P (Gl|Hl ,Hl ) used in Equa- tions 4.2 and 4.3 is defined as follows  0 m p 0 1 − 2, e(al) + e(al) = Gl P (Gl|Hl = al,Hl = al) = (4.4) , otherwise where  corresponds to the genotyping error rate.

To increase the numerical stability in the forward-backward computation, scaling P 0 is applied. Specifically, αl and βl are scaled by sl = 0 αl(al, a ) as follows al,al l

0 0 ∗ 0 αl(al, al) ∗ 0 βl(al, al) αl (al, al) = βl (al, al) = . (4.5) sl sl

1 2 Next, the unordered ancestry pair {Zl ,Zl } at location l is called by determining the maximal a posteriori assignment

ˆ1 ˆ2 X ∗ 0 ∗ 0 {Z , Z } = arg max α (al, a ) · β (al, a ) (4.6) l l 1 2 l l l l Zl ,Zl 0 al, al s.t. 0 1 2 {anc(al), anc(al)} = {Zl ,Zl }

1 2 0 where, for each {Zl ,Zl } pair, we sum over all (al, al) haplotype cluster membership 1 2 pairs that are consistent in their ancestry with the unordered ancestry pair {Zl ,Zl }. CHAPTER 4. ALLOY 24

4.2.3 Transition probabilities

We proceed by describing the transition probabilities P (Hl|Hl−1). Let Rl be defined as the event in which at least one post-admixture recombination occurred between ¯ position l − 1 and position l since the first population admixture event, and let Rl be defined as the complementary event. The transition probability P (Hl|Hl−1), which captures the process in which an admixed haplotype is generated, mixes the event of ¯ intra-ancestral-population transition, P (Hl|Hl−1, Rl), with the event corresponding to the introduction of a new ancestral haplotype, P (Hl|Hl−1,Rl), as described by

¯ ¯ P (Hl|Hl−1) = P (Rl) · P (Hl|Hl−1,Rl) + P (Rl) · P (Hl|Hl−1, Rl) (4.7) ¯ = P (Rl) · Panc(Hl)(Hl) + P (Rl) · Panc(Hl)(Hl|Hl−1)

where Panc(Hl)(Hl) is the position specific ancestral haplotype cluster prior, and

Panc(Hl)(Hl|Hl−1) models the transition within the ancestral population anc(Hl), cap- turing the background population-specific LD. Namely, if a post-admixture recombi- nation was introduced (P (Rl)), a haplotype Hl is sampled based on the local ancestry ¯ prior Panc(Hl)(Hl); if no post-admixture recombination was introduced (P (Rl)), the next marker is sampled based on the haplotypic structure within population anc(Hl), as defined by Panc(Hl)(Hl|Hl−1). Assuming the Hybrid-Isolated model, the probability of post-admixture recombination P (Rl) is approximated via the Haldane function [22]

−φ(g·dl) P (Rl) = 1 − e (4.8)

where dl is the genetic distance, in Morgans (M), between marker l − 1 and l, and g +1 generations are assumed to have passed since the first admixture event. We note that φ(z) is defined as a function of the recombination rate g ·dl to enable smoothing; the number of false ancestry changes can be reduced by controlling the probability for z recombination (e.g., φ(z) = 10 ), overcoming local inaccuracies in ancestry inference due to an imperfect ancestral linkage model. The prior probability of the ancestral haplotype cluster is governed by the mixture proportions π and the intra-population CHAPTER 4. ALLOY 25

haplotype cluster prior Panc(Hl)(Hl), as given by

P (Hl) = πanc(Hl) · Panc(Hl)(Hl). (4.9)

4.2.4 Linkage disequilibrium model

Finally, we describe the background model we use to capture the ancestral linkage disequilibrium between markers. The range of explored background LD models is illustrated in Figure 4.2. The most basic models used for ancestry inference assume

Figure 4.2: The state space Al over three consecutive locations in different back- ground LD models, pertaining to the marker dependencies exhibited within a single ancestral population. (a) Markers are independent given ancestry. The model contain two states per location, each emitting one of the two possible alleles matching the marginal distribution observed in the ancestral population. (b) First order Marko- vian dependency between adjacent markers. The transition between the neighboring states, which correspond to alleles at specific positions, is derived from the conditional probability estimated from the ancestral population sample. (c) Generalized linkage model via haplotype clusters. The number of states at each position correspond to the number of haplotype clusters, each emitting an allele. The local transition probabil- ities correspond to the Markovian property by which haplotype cluster membership at a given location l is determined by the cluster membership at the previous location l−1. (d) Explicit use of ancestral haplotypes. For each position, the number of states equals the number of training haplotypes, each emitting a single allele observed in the corresponding haplotype. markers are independent given their corresponding ancestry assignment. An imme- diate extension which can be captured by our FHMM model incorporates first-order Markovian dependencies to model LD between neighboring markers. However, the model is not limited to first-order dependencies; to capture longer range dependencies between ancestral alleles, the state space Al from which Hl is drawn, can be enriched CHAPTER 4. ALLOY 26

so as to track ancestral haplotype clusters over a longer range. Specifically, longer range dependencies are effectively translated to additional states that map to spe- cific ancestral local haplotype clusters. Moreover, a different number of states can be introduced at each position, fitting the local ancestral haplotypic complexity. The higher the local complexity is, the more states are used to track dependencies reach- ing further away. In essence, the model is equivalent to an inhomogeneous VLMC in which regions exhibiting complex LD structures are modeled using longer dependen- cies (i.e., edges connecting distant nodes in the underlying graphical model). At one extreme, the state space Al can be constructed assuming a zero-order Markov model

(i.e., markers are independent), while at the other extreme, Al can be extended to have one state per ancestral haplotype instance used in the training phase.

Linkage model instantiation

An algorithm for fitting inhomogeneous VLMCs was described by Ron et al. [50], and extended by [10] to model haplotypes. We apply BEAGLE, an implementation of this procedure, to empirically model the local haplotypic structure. Specifically, we deter- mine both the state space of Al as well as the transition probability through the use of a localized haplotype cluster model described in [10]. Briefly, given a set of training haplotypes from a single ancestry, the algorithm processes the markers in chromoso- mal order. With each additional marker considered, nodes, representing some history of allele sequences, are split by considering the subsequent alleles for each such node. Then, nodes at location l are merged based on a Markov criterion roughly guaran- teeing that given the cluster membership at position l, prior cluster memberships are irrelevant for the prediction of subsequent cluster memberships. Namely, given some parameter t, two clusters at position l are merged if the probabilities of allele sequences at markers l + 1, l + 2, ..., l + t resemble each other. For each population anc, the procedure yields a weighted directed acyclic graph (DAG), where edges are labeled by alleles, and each training haplotype traces a path through the graph from i a root node to a terminal node, defining the weights. For each edge el at location i l, the weight wl is defined as the number of haplotypes in the ancestral population CHAPTER 4. ALLOY 27

th anc sample that pass through the i cluster. In our model, the state space Al ⊂ Al i for population anc at location l is defined so that each edge el in the weighted DAG anc,i i i corresponds to the state al . We denote the source node of each edge el by sl, and i its target by tl. The prior Panc(Hl) and transition probabilities Panc(Hl|Hl−1) from Equations 4.7 and 4.9, respectively, can be computed as follows

 wi  l j i  P wk , if tl−1 = sl anc,i anc,j  l−1 k s.t. tk =si Panc(Hl = al |Hl−1 = al−1 ) = l−1 l (4.10)  0, otherwise wi P (H = aanc,i) = l . (4.11) anc l l P j j wl−1

The process is repeated for each ancestry separately, producing the population specific

Panc(al)(Hl = al) and Panc(al)(Hl = al|Hl−1 = al−1).

4.3 Results

4.3.1 Simulation of admixed individuals and training the back- ground LD models

. We evaluated the performance of Alloy for local ancestry inference. In our exper- iments, we simulated admixed individuals and trained Alloy’s background model using data from six HapMap [4] populations: individuals from the Centre d’Etude du Polymorphisme Humain collected in Utah, USA, with ancestry from northern and western Europe (CEU); Han Chinese in Beijing, China (CHB); Japanese in Tokyo, Japan (JPT); Yoruba in Ibadan, Nigeria (YRI); Maasai in Kinyawa, Kenya (MKK); and Tuscans in Italy (Toscani in Italia, TSI). All SNPs present in the HapMap Phase III panel on the first arm of Chromosome 1 were used to expedite the results. We partitioned the HapMap data such that 100 individuals from each population were used as training data, and the remainder were used as test data to evaluate the CHAPTER 4. ALLOY 28

performance of our method. We used haplotypes from the test set to simulate ad- mixed individuals for 6 different combinations of ancestral populations: YRI-MKK (YM), CHB-JPT (CJ), YRI-MKK-CEU (YMC), CHB-JPT-CEU, (CJC) YRI-MKK- CEU-CHB (YMCC), CHB-JPT-CEU-YRI (CJCY). Each test data set contained 100 simulated admixed individuals. In this section, we use gsim and πsim to denote the parameters used for simulation, and g and π, to denote the parameters used for inference.

Simulation

Each simulated admixed individual was generated by traversing the set of markers in chromosomal order, generating a pair of admixed maternal and paternal haplotypes in parallel. The initial pair of ancestries and alleles, corresponding to the first marker, was randomly selected based on the prior ancestral admixture proportions πsim. Al- leles were then copied from the ancestral reference haplotypes. With each subsequent marker, the probability for an admixture related recombination was evaluated via Equation 4.8. In case of a recombination, a new ancestral source was selected us- ing the πsim admixture proportions and the copying process continued. We used the BEAGLE package [10] with default parameters, to phase the training and testing individuals separately. Next, the ancestral background LD model states and parame- ters were determined through Equations 4.10 and 4.11 by examining BEAGLE’s DAG output.

Marker selection

To build an efficient background LD model for Alloy, we selected a subset of ancestry informative markers (AIM), which are genetic variants that carry a population-specific characterizing allele distribution and can be used to efficiently distinguish between genetic segments of different origins. In order to select the set of ancestry informative markers, we used the Shannon Information Content (SIC) criteria [51]. Namely, for CHAPTER 4. ALLOY 29

a given set of markers and their corresponding allele distribution in the ancestral populations, we measured the mutual information (MI) I(Xl; Z) between ancestry

Z and allele Xl at position l. Using the SIC measurement, we followed the marker selection heuristic presented in [58], choosing a constant number of highly informative markers within a window of fixed size. Specifically, in our simulations, we selected the single most informative marker in windows of 0.05 centimorgans. For the YM, YMC, and YMCC data sets, we used SNPs with the highest MI differentiating the YRI and MKK populations; for the CJ, CJC, CJCY admixture scenarios, we selected markers with the highest MI when differentiating the CHB and JPT populations. While Alloy’s background LD model was based on a subset of SNPs, inference was performed on all SNPs, calling the ancestry of the excluded SNPs using a nearest marker approach.

4.3.2 Evaluating Alloy’s accuracy under complex and ancient admixtures.

When performing inference, we modeled the genotyping error rate with  = 0.01, z and used φ(z) = 10 as our smoothing function in Equation 4.8. We compared the performance of Alloy to WINPOP [42], a local ancestry inference platform that has been shown to outperform previous state of the art methods such as SABER [56], HAPAA [55] and HAPMIX [46]. We measured the accuracy of Alloy and WINPOP when inferring local ancestry of simulated admixed individuals under increasingly complex admixtures, and with a varying number of generations since the first ad- mixture event, ranging from recent admixture (g = 7) to more ancient admixture (g = 100). Accuracy was conservatively measured as the average fraction of SNPs for which the correct ancestry was inferred. As depicted in Figure 4.3, our results show that Alloy’s accuracy is greater than WINPOP’s in nearly all tested scenarios. Our experiments show that applying WINPOP over the full set of markers achieves a higher performance in comparison to analyzing only a subset of ancestry informa- tion markers. WINPOP performs SNP selection prior to inference to confirm with CHAPTER 4. ALLOY 30

Figure 4.3: (a) The performance of Alloy based on the number of generations since admixture gsim for various admixture configurations with equal ancestral proportions. Alloy was run with g = gsim and π = πsim, and the accuracy was conservatively measured as the fraction of markers for which the exact ancestry pair was inferred. (b) For the same experiments, Alloy was compared to WINPOP by measuring ALLOY’s accuracy the accuracy ratio between them ( WINPOP’s accuracy ). The results clearly demonstrate Alloy superior accuracy in the vast majority of tested admixture configurations, with an increase in performance in more than 86% of the tests. CHAPTER 4. ALLOY 31

their model assumptions, and hence benefits from the larger initial set of markers. We therefore reported WINPOP results corresponding to an analysis applied on the entire HapMap Phase III set of SNPs, rather than the SNP subsets used for training Alloy.

4.3.3 Exploring background LD models.

As previously described, the background LD models in Alloy can capture a wide range of complexities, from simpler models such as those used in AncestryMap [44] and SABER which model zero- and first-order dependencies between markers, respec- tively, to more complex explicit haplotype models as used by HAPAA and HAPMIX. More importantly, Alloy is able to capture models of intermediate complexity. We explored the performance of Alloy using a range of background LD models with varying complexities. Background models of different complexities were generated by applying BEAGLE on our training data using different values for BEAGLE’s scale parameter, which controls the complexity of the generated DAG underlying Alloy’s model. As scale approaches 0, the model approaches the explicit model used in HA- PAA, and as the value of scale grows, the generated model approaches a zero-order model similar to the one used by AncestryMap. The results, shown in Figure 4.4, illustrate that the models of intermediate complexity outperform both the more com- plex as well as the simpler models used by previous methods.

4.3.4 Measuring robustness to inaccuracies in model param- eters.

Our method assumes that the admixture parameters, such as the number of gener- ations g and the admixture proportions π are given. When applied on real data, however, the true values for these parameters are unknown. We examined the ro- bustness of Alloy to inaccuracies in model parameters. Specifically, we measured CHAPTER 4. ALLOY 32

Figure 4.4: Ancestry inference accuracy as a function of model complexity, as mea- sured by the average number of haplotype clusters under a certain background LD model. The models range from a simplistic assumption of independence on the left, to more explicit models on the right. The plot illustrates that both over- simplification, corresponding to the LD models used in AncsetryMap and SABER, and over-specification, corresponding to the models leaning towards those used in HA- PAA and HAPMIX, yield reduced performance in comparison to a more generalizing local haplotypic model. the impact of misspecified admixture proportion π on the accuracy of inference. To test for robustness, we simulated a YM mixture with πsim = (0.5, 0.5) and gsim = 30, and evaluated Alloy’s performance varying π between (0.05, 0.95) and (0.95, 0.05) during inference. Our results indicate that Alloy’s performance is robust to inac- curacies in π, yielding the highest accuracy when π = πsim, slightly reducing the accuracy by 0.0029 to its lowest value at the two extremes (i.e., π = (0.95, 0.05) and π = (0.05, 0.95)). We further evaluated Alloy’s performance when g was misspeci-

fied. A YM mixture was simulated with gsim = 20, πsim = (0.5, 0.5). When g = gsim, Alloy achieved 73.77% accuracy; for mis-specified values of g between 10 and 40, accuracy ranged from 73.41% to 73.88%, respectively. CHAPTER 4. ALLOY 33

Additionally, we explored the sensitivity of Alloy’s performance to different genotyping error rates . When simulating a YM mixture with gsim = 20, πsim = (0.5, 0.5) as above, and performing inference with values of  ≤ 0.025, Alloy’s ac- curacy was at least 73.20%. Assuming a 5% genotyping error rate ( = 0.05), the accuracy decreased by less than 1%.

4.3.5 Evaluating model accuracy under varying amounts of training data.

Currently, the amount of available genotype data is limited in the number of individu- als genotyped and the density of SNPs measured. However, the number of genotyped individuals and the SNP density of genotyping technologies are expected to greatly increase in the near future. To evaluate the effect of training set size on Alloy’s per- formance, we trained our background LD model on sets of individuals with increasing size. Specifically, we derived a model for the YRI and MKK ancestral populations using subsets of the individuals of varying size, and evaluated the inference accuracy. The results, shown in Figure 4.5a, emphasize the importance of training set size to the improved performance, suggesting that as more samples are collected and geno- typed, more accurate background models could be derived, yielding a higher level of accuracy.

We further evaluated the performance of Alloy with respect to the number of SNPs used during training. We generated subsets of informative SNPs of various sizes by using different window sizes during the AIM selection phase. To evaluate the importance of using AIMs, we also selected random SNP subsets of matching sizes.

We simulated individuals from a YM admixed population with gsim = 30 generations of admixture, evaluating Alloy’s accuracy when trained using the different SNP subset. Figure 4.5b shows that Alloy’s accuracy increases as more SNPs are used. The results further demonstrate that Alloy’s performance is significantly higher with informative SNPs compared to random ones. The rightmost point in Figure 4.5b CHAPTER 4. ALLOY 34

corresponds to Alloy’s performance when all SNPs are used. These results indicate that using excessive uninformative markers can reduce accuracy in comparison to a model based on informative markers.

4.4 Discussion

Alloy represents the LD structure of each population with a highly expressive model that lies between the simpler first-order Markov hidden Markov model in SABER, and the explicit-haplotype model in HAPAA and HAPMIX. The first advantage of this approach is its improved accuracy compared to either extreme, as shown in Fig- ure 4.4. Additionally, our inference algorithm has higher computational efficiency than explicit-haplotype models. In this work, we derive the population-specific LD structures by generating haplotype clusters through the BEAGLE package. We trans- late the produced DAG into prior and transition probabilities that define the param- eters of our factorial hidden Markov model. In future work, alternatives to BEAGLE can be used for modeling LD; for instance, one can develop ancestry-aware methods that produce LD models that emphasize the structural differences between ancestral populations.

4.4.1 Robustness to different admixtures models

In our experiments, we assumed a hybrid isolated (HI) model [34] for simulating admixed individuals. However, other models, such as the continuous gene flow (CGF) model [34], may better correspond with population migration and admixture patterns, and as such will more accurately fit the ancestral mosaic patterns observed in admixed populations. Alloy assumes an HI admixture model. To evaluate the robustness of Alloy to misspecification of admixture models, we measured Alloy’s accuracy under the scenario where the admixed individuals were simulated using a CGF model. CHAPTER 4. ALLOY 35

Alloy achieved 86.0% accuracy on a YM mixture with π = (0.5, 0.5), g = 10, and a generational donor contribution rate α = 0.01 from both ancestries. These results indicate an approximately 8% increase in accuracy compared to the results achieved when inferring the local ancestry of simulated admixed individuals generated using the HI model and the same values for g and π. The increase in accuracy can be attributed to the fact that CGF generates longer ancestral tracts in comparison to the HI model with the same admixture parameters, and the fact that longer tracts are easier to predict correctly. To explore our model under a more challenging scenario, we further simulated admixed individuals from a YM mixture using the CGF model with an adjusted g such that the average ancestral tract length was equal to the average length under an HI model with the same parameters. Alloy achieved 81.0% accuracy, which is comparable to our previous result for the HI model (79.6% accuracy). We concluded that Alloy is robust to such differences in the underlying admixture model, and can support more realistic admixture models.

4.4.2 Learning admixture parameters

Alloy assumes that the admixture parameters are given. In particular, the number of generations since admixture g, and the relative proportions of the ancestral pop- ulations π, are required. We showed through simulations that our method is robust to inaccuracies in the estimation of the admixture proportion. Nonetheless, π can be estimated by direct examination of the sampled individuals’ genotype likelihood. Alternatively, given a set of individuals representing a particular admixed popula- tion, demographic parameters such as admixture time g and ancestral proportions π, can be derived as a post-processing step. For instance, as suggested in [45] and [23], the length of ancestral tracts can be used to infer changes in migration pat- terns. In particular, as our method has been shown to be robust to inaccuracies in π and g, as well as to misspecified admixture models, we can first apply Alloy to accurately infer the individuals’ ancestral mosaic. Then, statistics over the inferred ancestral tracts, such as their length and number, can be sequentially used in com- bination with a variety of admixture models to compute the maximum likelihood CHAPTER 4. ALLOY 36

estimate for parameters such as the time of migration and nature of admixture. To infer these parameters, the method presented in [45] examined the distribution of tracts larger than a given threshold, as shorter tracts cannot be reliably inferred. By leveraging the structure stemming from the ancestral linkage disequilibrium, Alloy can accurately infer shorter ancestral tracts, enabling the observation of more dis- tant admixture events and historical changes in migration rate. We also note that the flexibility of our FHMM enables different admixture times and proportions to be incorporated separately for the maternal and paternal haplotypes. Hence, pedigrees exhibiting very recent complex admixture at the grandparental level can be explicitly modeled. For example, the parameters of our method can be tuned to accurately infer the ancestry of an admixed individual that has one African-American and one Chinese parent. Finally, our model assumes a single genetic map is given, capturing the genetic distance between neighboring SNPs that is shared between all ancestral populations. Previous work showed that more accurate recombination rates can be inferred using admixed populations by observing the ancestral switch points among admixed individuals [59], [25]. As with the methods used to infer admixture param- eters, the ancestral mosaic of admixed individuals is first inferred, then, the rate of ancestral switches per position is estimated. While such methods can be used to infer more accurate maps, our experiments have shown that inaccuracies in the estimation of these recombination rates do not have a significant effect on Alloy’s ability to infer local ancestry under the examined scenarios.

4.4.3 Time complexity reduction

In the Methods section we described an inference algorithm with a time complexity that depends on the local ancestral LD structure rather than the number of ancestral haplotypes used when training the background model. Specifically, the algorithm’s time complexity is O(L · C3), where C is an upper bound on the number of states in a single position (i.e. C = maxl |Al|). In our implementation of Alloy, we reduced the time complexity by rearranging the calculations corresponding to the transition probabilities in the forward and backward computations, described by Equations 4.2 CHAPTER 4. ALLOY 37

and 4.3, respectively. In particular, transitioning between states corresponding to an admixture recombination event can be collapsed into a single term. For instance, when transitioning between states corresponding to different ancestries in the forward m 0 iteration, Equation 4.7 is reduced to the term P (Rl) · P (Hl). Hence, αl−1(al, al−1) can be rewritten as

m 0 ¯ X αl−1(al, al−1) = P (Rl) · P (al) + P (Rl) · Panc(al)(al|al−1). anc(al) al−1∈Al−1

2 When such an optimization is applied, the time complexity is reduced to O(L·C ·CK ) where CK is an upper bound on the number of states corresponding to a single k population (i.e. CK = maxl,k |Al |). We note that this implementation of Alloy has a practical running time, completing a single experiment as described in the Results section in approximately one minute.

4.4.4 Incorporating additional variation

Our simulations experimented with SNP markers that were found to be polymorphic in 1,184 individuals sampled from 11 populations in the third phase of the HapMap project [4]. However, additional variation exists in these populations beyond the SNPs assayed in this data set. In particular, rare SNPs which have been found to exhibit little sharing among diverged populations [19] and can therefore act as highly informative markers for ancestry inference, are likely to be missing from the panel. Therefore, as additional rare SNPs are discovered and sampled, we expect the accuracy of Alloy to improve. We further note that the spectrum of human genetic variation ranges beyond SNPs. For instance, copy-number variations (CNV) and other structural variations constitute a large fraction of the total human genomic variation [2]. As with SNPs, rare CNVs are useful for separating ancestries and have been shown to be more abundant than rare SNPs [29]. Our model is not limited to bi-allelic SNPs, and supports the incorporation of markers of higher variability, such as CNVs, by adjusting Equation 4.4. The construction of the variable-length Markov CHAPTER 4. ALLOY 38

chain linkage-models, either through BEAGLE or other methods, can be extended to take such additional genetic variation into account.

4.4.5 Conclusion

Alloy is a novel method for inferring the local ancestry of admixed individuals, which is an essential task for various applications in human genetics. We have shown that our approach has higher accuracy than the previous state of the art and that its VLMC-based LD model plays a crucial role in its superior performance. Our method is applicable to ancient and complex admixtures, and is capable of separately mod- eling the maternal and paternal histories. We expect that as the genetic variation of worldwide populations is extensively sampled, Alloy will be able to better charac- terize the particular histories of examined individuals. Alloy is publicly and freely available at http://alloy.stanford.edu/. CHAPTER 4. ALLOY 39

Figure 4.5: (a) Local ancestry inference accuracy as a function of training set size. A various number of individuals were used as representatives of the ancestral popu- lations in the computation of the background LD model, demonstrating an increased performance as more samples are used in the training phase. (b) The accuracy of inferring ancestry as a function of the number of markers used. The plot illustrates the significance of using ancestry informative markers in comparison to a randomly chosen set, as for all tested resolutions, the use of the informative set yielded an im- proved performance. The results also indicate that the addition of non-informative markers reduces performance (demonstrated by the right-most data point) as these are assumed to interfere with the construction of an effective background LD model. Chapter 5

Parente

5.1 Introduction

In this chapter I will describe Parente, a method for detecting relatedness based on genotype data.

Genomic sequence variants such as single-nucleotide variants, insertions, and dele- tions, are being constantly introduced to populations with each generation. As mu- tation rates are considered to be relatively low, [15] and as genetic drift drives allele frequencies to become fixed, it is reasonable to assume that two individuals carry- ing the same allele have actually inherited it from a common ancestor; in such a case, the alleles can be said to be identical-by-descent (IBD). This strict definition of IBD holds for the majority of evident human germline mutations, and with high probability. Many biological applications, however, are driven by the study of longer shared stretches that cover multiple mutations. Using knowledge of such longer shared segments, inferences can be made regarding ancestry [49], population demographics [24, 33, 38], and perhaps more important, the location of disease susceptibility genes [3, 37, 7]. For such applications, the alleles of two individuals that were inherited

40 CHAPTER 5. PARENTE 41

from a recent common ancestor are called IBD, whereas the alleles that simply have the same allelic state but did not originate from a recent common ancestor are called identical-in-state (IIS). Note that alleles that are IBD are also IIS, but multiple inde- pendent mutation events can cause two alleles to be IIS but not IBD. It follows that in the case of a recent common ancestor, IBD alleles are harbored within longer seg- ments containing additional IBD alleles; the more recent the common ancestor, the fewer meiosis occurred, and the longer the shared segment. In this work, we describe two individuals as being related to one another if they share an IBD segment from a recent common ancestor.

Identity-by-descent (IBD) inference is defined as the process of detecting genomic segments that were inherited from recent common ancestors in a given set of geno- typed individuals. In the problem’s simplest form, a pedigree describing the connec- tion between sampled individuals is provided with the genotypes in order to identify the segments. Given the pedigree, a model can be derived to explicitly capture these relationships when the genotypes are examined. The most common model used is based on a factorial hidden Markov model (factorial-HMM) [48, 18] with a hidden state space defined by selector variables that determine the inheritance pattern in the pedigree [32, 1, 20, 28, 35]. More recently, such methods were extended to model linkage disequilibrium (LD) between neighboring markers, enabling the detection of shorter IBD segments [7]. The main use of these models is in the application of ge- netic linkage analysis. When a hereditary disease is studied in a family of healthy and affected individuals, linkage analysis is applied to identify loci that are associ- ated with the hereditary disease; these loci may contain genes or regulatory elements that increase the probability of having the disease. The premise of linkage analysis is that affected individuals will share an IBD segment around the disease locus, and that this segment is not shared (or less likely to be shared) by healthy individuals [16, 32, 14, 41].

In the large majority of hereditary disease studies, however, the relationship be- tween sampled individuals is unknown. In genome-wide association studies (GWAS), CHAPTER 5. PARENTE 42

sampled individuals are assumed to be unrelated. However, it is common to have hid- den relationships (also known as cryptic relationships) within large sampled cohorts [8, 24, 31].

The accurate detection of IBD segments within these samples enables the cor- rection for the cryptic relationships, for example, by removing related individuals from analysis. Conversely, instead of discarding related individuals, IBD mapping [13, 37, 7] can be applied, directly associating the levels of IBD with phenotype in the process of mapping disease susceptibility genes.

Extensive previous work has focused on developing methods for the accurate de- tection of IBD segments without using pedigree information. Most commonly, an HMM or a factorial-HMM is applied to infer the IBD segments. Purcell et al. pre- sented PLINK [47], which uses a simple three-state model, counting the occurrences of IBD per position given the observed genotypes of two individuals. In BEAGLE, by Browning and Browning [11], a factorial HMM was developed to phase and simulta- neously detect the specific haplotypes that are shared between examined individuals. To improve accuracy, the BEAGLE model captured complex linkage-disequilibrium patterns by extending the state space to accommodate the haplotypic structure found in the data and measuring the patterns’ frequencies. In the work by Bercovici et al. the inheritance vector capturing the relationship between two individuals was explic- itly modeled, and LD was incorporated via a first-order Markov model at the level of the founders [7]. The explicit modeling of both relationship and LD was shown to significantly improve performance. Similar to others, the work further demonstrated that these accurate inference methods could be used to detect the IBD enrichment ev- ident around disease-gene loci, highlighting the value of IBD detection in the mapping of disease susceptibility genes. Moltke et al. presented a Markov Chain Monte Carlo (MCMC) approach for the detection of IBD regions where segments of chromosomes are it iteratively partitioned into sets of identical descent [37]. In the above methods, there exists a tradeoff between accuracy and running time. Nonetheless, in most of the above methods, the complexity of the analysis in all these methods is quadratic in the number of individuals. Simply, every pair of individuals must be examined for CHAPTER 5. PARENTE 43

relatedness. GERMLINE, by Gusev et al. aimed to reduce the time complexity of IBD inference at the cost of lower accuracy [21]. The GERMLINE method performs the IBD analysis on phased data. By populating hash tables with segments taken from the phased data, the method efficiently determines potential seeds of segments that are shared between individuals. These segments are then extended to determine if sufficient evidence exists to support IBD between specific pairs of individuals. As GERMLINE requires phased data in order to operate, the individuals are first phased using BEAGLE [12]. In a later work by Browning and Browning, fastIBD [8] was developed to efficiently determine IBD segments between pairs of individuals in large cohorts of thousands of samples in a feasible timeframe. Similar to GERMLINE, fastIBD employed a sliding window approach to allow efficient computation. Pairs of individuals sharing the same state in fastIBD’s factorial HMM are considered in the evaluation of subsequent windows; shared segments are extended for pairs of individuals with a high probability of IBD. While GERMLINE provides a more time- efficient solution, previous work has shown the method to have a reduced ability to detect more ancient IBD segments in comparison to more accurate methods such as fastIBD. As phasing can be prohibitive when analyzing extremely large datasets, Henn et al. developed a method aimed at detecting larger IBD segments based on reverse-homozygous positions that does not require phasing [24]. While providing an efficient approach for IBD detection, the method is tuned to detect larger IBD segments, in order to achieve required specificity.

While advances in IBD detection have been made in recent years, accurately de- tecting IBD in large cohorts remains a challenge. As the cost of genotyping decreases, the number of genotyped individuals is increasing rapidly, and the genotyping den- sity is growing to include millions of markers per sample. Since many of the accurate methods investigate all pairs of individuals for relatedness, the analysis complexity grows quadratically with the number of individuals in a studied sample. Such chal- lenges require that IBD detection methods have high computational efficiency. More importantly, since the vast majority of examined pairs of individuals are unlikely to be related, an IBD detection method must exhibit extremely high specificity in order to avoid reporting an overwhelming number of false positives. CHAPTER 5. PARENTE 44

In this paper we present Parente, a novel method for the detection of IBD that exhibits high accuracy, and can be efficiently used for the analysis of large geno- typed cohorts. Parente employs a variant of a likelihood-ratio test along with local thresholding to achieve significantly higher accuracy than the current state of the art. Our method can be applied directly on genotype data, without needing to first phase the genotypes, a step that can be computationally-intensive. The primary goal of our method is to efficiently detect which pairs of individuals in large corhorts are re- lated to one another, in feasible time. This is done by finding pairs of individuals that share at least one IBD segment greater than x cM in size. Once these related pairs are identified, one can determine specific IBD segment boundaries as a post-processing step using a more complex IBD detection method of higher computational cost. We further show that Parente can also be directly used for the localization of the IBD segments within the related pairs, providing highly accurate results. Parente was able to successfully detect pairs of related individuals sharing a 6 cM IBD segment (the expected average IBD segment size for 7th cousins) with 90% sensitivity at a 5 × 10−5 false positive rate. In the more challenging case of a 4 cM shared segment, it detects related pairs with 86% sensitivity at a 8 × 10−3 false positive rate, which represents a 28% relative increase in sensitivity compared to fastIBD, a state-of-the- art method. Finally, we observed that Parente is an order of magnitude faster than fastIBD, as well. These results highlight the relevance of our method for the accurate and efficient analysis of large cohorts.

5.2 Methods

The Parente model employs a window-based approach, whereby multiple consec- utive markers are grouped together and their joint probability is estimated given a hypothesized IBD state. Subsequently, the probabilities of multiple non-overlapping windows are merged via a naive Bayes model, producing the probability for the as- sumed IBD state in a given block of pre-defined length. The block lengths are derived CHAPTER 5. PARENTE 45

from a target timespan covering common ancestors of interest, and the required ac- curacy as driven by the application.

Given N individuals sampled over M biallelic markers, let G be defined as the genotype matrix. We use gi,j ∈ {0, 1, 2} to denote the major allele count observed th th in the j marker of the i individual, and gi as the vector corresponding to all M genotyped markers sampled for individual i. The measured genotypes G are assumed to have originated from a set of 2N underlying hidden haplotypes, denoted by the matrix H. The maternal and paternal alleles of the jth marker in the ith individual m p are marked as hi,j ∈ {0, 1} and hi,j ∈ {0, 1}, respectively, corresponding to the major ∗ allele count in each. More broadly, however, we use hj as a symbol to signify one of the alleles at the jth marker, corresponding to one of the population haplotypes comprising an individual’s genotype. We use fj to denote the major allele frequency of the jth marker in the sampled population. The M markers covering the genome are partitioned into a set of consecutive windows W = {w1, ..., w M }, each of size k. We k use m(w) to denote the indices of the k consecutive markers within the wth window, and gi,m(w) as the partial genotyping vector for individual i corresponding to these k markers. Finally, we define a block B = {wt, ..., wt+k−1} as a set of consecutive windows.

For a target IBD block length l (in cM), the Parente method is defined as N follows. All 2 pairs of individuals are enumerated. For each pair of individuals, the genome is scanned by sliding a block B across each chromosome, where each block B M starts from one of the k possible window positions. The examined block B includes all successive windows that contain markers that are at most l cM away from the first marker of the first window in that block. For each such block B and pair of 0 individuals i, i , an aggregated block score ΛB(gi, gi0 ) is defined as follows:

X ΛB(gi, gi0 ) = log sw(gi,m(w), gi0,m(w)) (5.1) w∈B where sw(gi,m(w), gi0,m(w)) is a window-specific score, computed using the genotypes of the two examined individuals i, i0 within an examined window w. We call a pair CHAPTER 5. PARENTE 46

0 of individuals i and i to be IBD in block B whenever ΛB(gi, gi0 ) > TB, where TB is a pre-defined threshold associated with block B. We compute this score for each block in the genome and call a pair of individuals to be related if any block in the genome is called to be IBD. The threshold TB is defined such that the false-positive rate is controlled to a desired level. The block score ΛB(gi, gi0 ) can be efficiently computed along the genome of two individuals. As blocks are scanned, window- scores corresponding to windows that are no longer part of the newly examined block 0 B are subtracted from the current block score ΛB0 (gi, gi0 ), and the window-scores corresponding to newly joining windows are simply added.

In the remainder of this section we derive two instantiations for the score function sw(gi,m(w), gi0,m(w)). We first derive a score function sw using a likelihood-ratio ap- proach. We continue by deriving an embedded likelihood-ratio score which corrects for the reduced performance stemming from windows exhibiting high variance in the likelihood-ratio score. Finally, we will describe how the block-specific score threshold

TB is defined. In the Results section, we show that higher variance is associated with windows that have reduced ability to distinguish between genotypes originating from related individuals from those originating from unrelated individuals.

5.2.1 Likelihood ratio test

To efficiently detect IBD, we first develop a likelihood ratio-test (LRT) variant of our method. Within a sliding block comparing two individuals’ genotypes, we contrast the probability that the they are IBD in the block against the probability that they are are not IBD. The LRT score is computed by estimating the likelihood of the individuals’ genotypes within each block under two models, namely a model MIBD corresponding to the hypothesis the two examined individuals are related, and a model MIBD corresponding to the hypothesis the two individuals are unrelated.

As suggested by Equation 6.1, for both MIBD and MIBD, we model the genotypes within a block B using a naive Bayes approach whereby all windows are independent CHAPTER 5. PARENTE 47

given the IBD status of the two examined individuals within B. The probabilities of the genotypes within each window w ∈ B comprising an examined block B are considered separately, and the product of these probabilities defines the probability of the observed genotypes within the examined block (or as a sum, under our log formulation). Namely, given a block of interest B, and the genotype of two examined individuals gi and gi0 , the window-specific score in Equation 6.1 is defined as:

0 LR pMIBD (gi,m(w), gi ,m(w)) sw (gi,m(w), gi0,m(w)) = (5.2) p (g , g 0 ) MIBD i,m(w) i ,m(w)

Under the assumption that the sampled markers are in linkage equilibrium, meaning that the alleles within a window are not associated, the genotype probabilities under the two models are given by:

Y 0 0 pMIBD (gi,m(w), gi ,m(w)) = pMIBD (gi,j, gi ,j) (5.3) j∈m(w) Y p (g , g 0 ) = p (g , g 0 ). MIBD i,m(w) i ,m(w) MIBD i,j i ,j j∈m(w)

The probability of the genotype pair gi,j, gi0,j under our two models is then defined as:

X 1 2 1 3 1 2 3 0 0 pMIBD (gi,j, gi ,j) = p(gi,j|hj , hj ) · p(gi ,j|hj , hj ) · p(hj ) · p(hj ) · p(hj ) (5.4) 1 2 3 hj ,hj ,hj X 1 2 3 4 1 2 3 4 p (g , g 0 ) = p(g |h , h ) · p(g 0 |h , h ) · p(h ) · p(h ) · p(h ) · p(h ) MIBD i,j i ,j i,j j j i ,j j j j j j j 1 2 3 4 hj ,hj ,hj ,hj

h∗ ∗ ∗ j (1−hj ) where p(hj ) = fj · (1 − fj) as determined by the allele frequency at marker 1 2 fj. The probability p(gi,j|hj , hj ) that the genotype gi,j was sampled given the un- 1 2 derlying haplotypes hj and hj , must accommodate for genotyping errors. We define 1 2 p(gi,j|hj , hj ) as follows:  1 2 1 2 1 −  gi,j = hj + hj p(gi,j|hj , hj ) = (5.5)   2 otherwise CHAPTER 5. PARENTE 48

where the parameter  is tuned to capture the amount of expected genotyping error. Finally, to accommodate for missing data, we set the likelihood ratio at a marker to 0.5 if either genotype is missing.

We note that in the above model, the individuals can share at most a single haplotype. We further note that under the assumption of linkage equilibrium, the equivalent of a block LRT score ΛB(gi, gi0 ) can be directly computed without windows by using the sums of log of the genotype probabilities, as defined by Equation 6.3.

We utilize the window-based sw formulation described in Equation 6.2 to facilitate our description of an extension that accounts for local score variability, which we now derive.

5.2.2 Embedded likelihood ratio test

The model described thus far provides an efficient approach to identifying pairs of individuals that share a common ancestor, and in particular to detecting specific regions that are IBD. While alleviating some of the performance-related challenges that are evident when examining large cohorts by providing a computationally feasible approach, the model is sensitive to windows exhibiting highly variable scores. Namely, for each block, the window-score of a small sub-set of windows plays a critical role in the determination of the final block score. It is the high variability of such windows that limits the performance of the likelihood-ratio based test.

One approach that corrects for the detrimental impact of high-variance win- dows is based on the direct examination of window-level performance. The distri- bution of window-score can be examined given the genotypes from unrelated indi- viduals, and contrasted against the distribution of the window-score given genotypes from related individuals. By contrasting these distributions, it is possible to de- tect and control for the impact of highly-variable windows. Specifically, to apply such a correction, we treat the LR described by Equation 6.2 as a random variable LR LR Sw = sw (gi,m(w), gi0,m(w)). We then define two Gaussian models for the distribution CHAPTER 5. PARENTE 49

LR of Sw , one corresponding to the distribution of the score under related individuals, and a second corresponding to the distribution of the score given unrelated individ- uals:

LR LR Sw |IBD ∼ N(µw,IBD, σw,IBD),Sw |IBD ∼ N(µw,IBD, σw,IBD). (5.6)

Our modified score, which we term embedded likelihood-ratio (ELR), is finally defined as:

LR LR ELR P (Sw = sw (gi,m(w), gi0,m(w))|IBD) s (g , g 0 ) = . (5.7) w i,m(w) i ,m(w) LR LR P (Sw = sw (gi,m(w), gi0,m(w))|IBD)

In total, 4 additional parameters define our new model. Namely, the mean µ and standard deviation σ of the normal distributions used to approximate the behavior LR of our initial score sB under observations originating from related and unrelated individuals. In order to estimate these parameters, phased data is used to simulate related and unrelated individuals, yielding the means to compute empirical estimates for the score distributions. The phased haplotypes can be either generated from datasets containing trios, or via computationally-phased individuals. It is important to note that current phasing methods offer a sufficiently low switch-error rate such that their performance should have a negligible effect when considering haplotypes within a window of moderate size.

5.2.3 Genotyping-error function

In Equation 6.4 we describe the probability of genotypes given the hidden underlying 1 2 haplotype. The conditional probability p(gi,j|hj , hj ) derived accounts for genotyping error. While providing a more realistic model, it can in fact reduce the statistical power when failing to reject unrelated individuals. The lower power stems from the fact the impact of reverse-homozygous genotypes is reduced; such observations can be CHAPTER 5. PARENTE 50

attributed to sampling errors rather than indication of unrelatedness under the realis- tic model. One can increase the penalty under such scenarios by controlling the geno- typing error parameter . Our method strives to reduce the amount of false-positive pairs detected. Thus, we extend our method by introducing a genotyping-error func- tion that increases the contrast between IBD and non-IBD segments. Specifically, when estimating the model parameters, we use  as the genotyping error rate, whereas during inference, we replace  in Equation 6.4 with a function φ() = v · , where v is 1 a scaling factor. In the Result section, we used v = 100 .

5.2.4 Likelihood-ratio test threshold

When applying likelihood-ratio tests, thresholds are selected so as to control the false- positive rate. Specifically, the distribution of the test is examined under examples originating from the null distribution, and a threshold is selected to guarantee an expected performance in terms of false-positives. It is common to select a single, global threshold to control for the global proportion of type I errors. However, as each block in our method contains windows of different score distribution, a local, block- specific threshold TB can be applied to improve the performance. In our method, we explore the distribution of ΛB(gi, gi0 ) given the genotypes of unrelated individuals for each block, thus accommodating to the local behavior of our score. Given a training set of unrelated pairs and their corresponding block scores Db,IBD, we define the block threshold as:

T = max(D ) + cσ (5.8) B b,IBD Db,IBD where σ is the standard deviation observed in the block-scores, and c scales the Db,IBD margin defined by the standard deviation. In our experiments, we use values between -1.5 and 2.5 for the scaling-factor c. CHAPTER 5. PARENTE 51

In the Results section, we demonstrate that the combination of ELR and a block- specific threshold TB provides superior performance in comparison to current state- of-the-art methods.

5.3 Results

The performance of Parente was evaluated using simulated data. We show that Parente has a superior accuracy performance when compared against fastIBD, which is considered state-of-the-art method for the accurate and efficient detection of IBD. We further explore the relative contribution to performance stemming from the use of the likelihood-ratio approach (LRT), the embedded LRT (ELRT) approach, and finally the use of a local threshold versus a global threshold. As a note on notation, for the remainder of this paper, we present the window score as log sw(gi,m(w), gi0,m(w)) instead of sw(gi,m(w), gi0,m(w)).

5.3.1 Constructing training and testing datasets.

To train and evaluate the performance of Parente, we used the phased data from three Asian populations of the the HapMap Phase III panel [4]: Han Chinese in Bei- jing, China (CHB); Japanese in Tokyo, Japan (JPT); and Chinese in Metropolitan Denver, Colorado (CHB). Our experiments used polymorphic SNPs from the long arm of human chromosome 1. We randomly partitioned the unrelated individuals from these populations into a set of 154 training haplotypes and a set of 366 testing haplotypes. To create a larger dataset of unrelated individuals, we used the original haplotypes to generate composite haplotypes by simulating mosaics of the original haplotypes using an approach similar to [11]. Briefly, to generate a composite haplo- type, we considered every 0.2 cM segment across the chromosome; for each segment, we copied the corresponding segment from one of the original haplotypes chosen uni- formly at random. Due to the random process, some longer segments of two composite CHAPTER 5. PARENTE 52

a 1.0 0.9

0.8 0.8

0.7 0.6 0.6

0.4 0.5 Sensitivity 0.4 0.2

0.3

0 5 10 15×10−4 0.2 Embedded LRT, local threshold Embedded LRT 0.1 LRT, local threshold LRT fastIBD 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 False positive rate

Figure 5.1: Performance of Parente for detecting related pairs of individuals shar- ing 4 cM IBD segments in comparison to fastIBD. Parente was applied using three different strategies: LRT, LRT with local thresholding, and ELRT. The magnified in- set highlights Parente’s superior performance when considering the high-specificity range. haplotypes were copied from the same original haplotype. Therefore, we removed 36 composite haplotypes that had more than 0.8 cM of contiguous sequence that was generated from the same original haplotype as another composite haplotype. A total of 500 composite training haplotypes and 1, 000 composite testing haplotypes were generated. In all of our experiments we use these composite haplotypes for training and testing. Thus, henceforth, we will refer to these composite haplotypes as simply training and testing haplotypes. CHAPTER 5. PARENTE 53

b 1.0 0.9

0.8

1.0 0.7 0.8 0.6 0.6 0.5 0.4

Sensitivity 0.4 0.2

0.3 0 0 1 2 3×10−4 0.2 Embedded LRT, local threshold Embedded LRT 0.1 LRT, local threshold LRT fastIBD 0 0 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 False positive rate

Figure 5.2: Performance of Parente for detecting IBD segments compared to fastIBD. The same experiments from (a) were used, but the sensitivity and false positive rate were calculated based on the number of SNPs in IBD and non-IBD seg- ments. Similarly, the magnified inset highlights Parente’s superior performance in the high-specificity range.

5.3.2 Simulations to evaluate performance.

To evaluate and characterize the performance of Parente, we created simulated pairs of related individuals that shared a single IBD segment of a specific size, ranging between 3 and 8 cM. We used a bootstrap approach to measure accuracy, using 100 trials per experiment, averaging the results of all trials within an experiment. For CHAPTER 5. PARENTE 54

each trial, we simulated 80 pairs of related individuals by generating 80 pairs of composite individuals and inserting one shared IBD segment of a given size at a random position along the chromosome. After genotypes were copied and IBD was injected, a genotypic error rate of  = 0.005 was applied, changing the genotype call to one of the other two genotypes with equal probability. We designated the first simulated individual of each pair to be a query individual and the second individual as the database individual. Then we used Parente to predict whether IBD existed between each query individual and all database individuals by labeling a pair as IBD if at least one block had a score passing the block-specific threshold. We calculated sensitivity as the number of IBD pairs correctly predicted out of 8,000 true IBD pairs per experiment, and false positive rates as the number of non-IBD pairs incorrectly predicted as IBD out of the 632,000 non-IBD pairs per experiment.

When aiming to detect IBD segments of a particular length L (in cM), we defined the blocks to have the largest size possible l such that L − 0.5 ≤ l ≤ L − 0.1. We used block sizes slightly smaller than the target IBD segment size to account issues related to block-boundary, stemming from the varying density of the SNP array and the fact that blocks start at window boundaries (and not at arbitrary SNPs). This was done to increase the likelihood that at least one block fit completely within the any arbitrary IBD segment of length L.

In all our experiments, we used a window size of k = 20 SNPs per window, and simulated a single 4 cM segment for each related pair of individuals, except where stated otherwise.

5.3.3 Parente’s accuracy and comparison to fastIBD.

Our goal was to produce a fast, accurate method to predict IBD. We thus compared the performance of Parente to fastIBD [8], an efficient IBD detection method. fastIBD was previously shown to have higher accuracy than GERMLINE [21], a scalable IBD detection platform, and comparable accuracy to BEAGLE’s slower, CHAPTER 5. PARENTE 55

high-accuracy IBD inference method [11]. We evaluated the performance of fastIBD on our simulated dataset using the default parameters and IBD detection thresholds ranging from 1×10−6 to 1×10−30. Following fastIBD’s authors recommendations, we ran fastIBD ten times with ten different seeds and aggregated the results by taking the minimum score observed at each position in any of the runs. We applied a size filter to the fastIBD predictions, only considering called segments longer than 1 cM, a value selected for yielding the best performance for fastIBD. fastIBD further recommends providing additional genotypes to aid in training fastIBD’s internal haplotype model. Our experiments indicate that the use of additional haplotypes did not increase the performance (results not shown). As fastIBD infers IBD segments from all pairs in a given cohort, all the query and database individuals was provided simultaneously, while only considering calls that were made between query and database individuals, following Parente’s mode of operation.

To compare the accuracy of Parente and fastIBD, we performed the simulations described above, measuring accuracy on detecting which pairs of individuals shared a simulated 4 cM IBD segment. The results shown in Figure 5.1 demonstrate that Parente has a significantly higher accuracy in comparison to fastIBD when detect- ing pairs of related individuals. This difference in sensitivity further grows at high- specificity levels, which is a crucial parameter when analyzing large cohorts. Note that the use of a local threshold for the ELRT provides superior high-specificity per- formance over a global threshold strategy. In the case of the LRT, the local threshold provides a large increase in sensitivity at all specificity levels. We further compared the performance of Parente and fastIBD in the task of accurately determining the location and boundaries of IBD segments (see Figure 5.2). Our experiments demon- strate that Parente achieves higher per-SNP, per-pair accuracy when compared to fastIBD. We note that when running fastIBD for this analysis we did not enforce the called segment size filter, as fastIBD performed better when the filter was not applied. The sensitivity for each related pair of individuals was measured as the fraction SNPs in the simulated IBD segment successfully detected to be IBD. For all pairs in the experiment, we measured the false positive rate as the fraction of SNPs not in IBD segments that were incorrectly called as IBD. Since blocks can overlap in Parente, CHAPTER 5. PARENTE 56

we labeled a SNP as IBD if it belonged to any block that had a score above the threshold.

We characterized Parente performance on a range of simulated IBD segment sizes from 3 cM to 8 cM, as depicted in Figure 5.3. These results show that Parente excels at high-specificity detection of IBD segments. For instance, Parente was able to successfully detect 8 cM IBD segments with 94% sensitivity and nearly zero false positive rate, and 6 cM IBD segments with 90% sensitivity and a 5 × 10−5 false positive rate.

As efficiency is key in the analysis of large cohorts, we measure execution time. In our experiments, the running time for Parente was approximately 10 times less than that of fastIBD. Specifically, Parente was able to process ∼15 individual pairs per second on our trials of 6,400 pairs. Note that we measured running time in pairs per second as fastIBD analyzes all pairs within a cohort, whereas Parente was run on all pairings between query and database individuals.

5.3.4 Training Parente’s model and thresholds.

In order to compute our embedded LRT score, PIBD and PIBD first need to be evaluated for every window w. Simulated pairs of related and unrelated individuals was used for this process (see Equations 5.6,6.6). Simulated pairs of related individuals’ genotypes were simulated so that each pair shared one entire haplotype along the chromosome. Specificially, each pair of related genotypes was generated by randomly selecting one haplotype from the training data to be shared by both genotypes as well as a unique haplotype for each genotype so that three distinct haplotypes were sampled. Pair of unrelated genotypes were simulated by randomly choosing four distinct training haplotypes, using two of the haplotypes for one genotype and the remaining two haplotypes for the second genotype. A total of 2, 000 pairs of related genotypes and 2, 000 pairs of unrelated genotypes were generated. For each window w and each pair of related and unrelated genotypes, we computed the LRT score assuming a CHAPTER 5. PARENTE 57

1.0

0.8

0.96 0.6

0.94

Sensitivity 0.4

0.92 8 cM 0.2 6 cM 4 cM 3 cM 0.90 0 2 4 6×10 −4 0 0 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 False positive rate

Figure 5.3: Performance of Parente for detecting related pairs of individuals sharing IBD segments of various sizes. The magnified inset shows Parente’s high sensitivity achieved at near-zero false positive rates for larger IBD segments. genotyping error rate of  = 0.005; we then fit window-specific normal distributions to the scores of related and unrelated pairs resulting in (µw,IBD, σw,IBD) and (µw,IBD,

σw,IBD), respectively.

5.3.5 Embedded LRT and local thresholds.

We computed the LRT and ELRT scores for windows and blocks for the unrelated and related training data and examined their properties in order to explore the differences CHAPTER 5. PARENTE 58

between the ELRT and LRT strategies. Figure 5.4 shows the distribution of the average window-score for IBD and non-IBD segments. The figure demonstrates three notable properties of the ELRT, when compared to the LRT. First and foremost, there is greater separation between the scores of IBD and non-IBD segments; second, the boundary between the scores of IBD and non-IBD segments is very close to zero, suggesting well calibrated scores; third, the variance of the scores of IBD segments is controlled. To understand the role of the ELRT’s reduction variance of the IBD window scores, we plotted the mean and standard deviation of the window scores for ELRT versus LRT (see Figure 5.5). Note that the ideal score distribution for IBD segments would have a high mean and low variance in order to serve as a reliable predictor for the IBD state. Therefore, these plots clearly demonstrate that ELRT controls for windows that are unreliable predictors of IBD. Specifically, the windows with high variance and low negative mean LRT scores (the blue and violet points in the figure) are mapped to lower variance ELRT scores. We note that even though there is a negative trend between the average LRT scores and average ELRT scores, the ELRT scores stay above zero, the apparent boundary between IBD and non-IBD scores. The ELRT advantages at the window level translate to the block level, as seen in Figure 5.6. This greater block score separation consequently allows Parente to achieve higher accuracy when using the embedded LRT score. In Figure 5.7, the mean of these distributions can be seen for many blocks along chromosome, demonstrating the stability of the increased separation of the ELRT across the chromosome. This figure also shows the high variation in the block thresholds in for the LRT, which explains why the LRT’s performance increases significantly when using block-specific thresholds compared to a global threshold. CHAPTER 5. PARENTE 59

Figure 5.4: (a) For each window, the mean window score of the IBD and non-IBD training data was computed; the histogram of these means is shown for the LRT and ELRT scores. When compared to the LRT score, the ELRT score has more separation between the IBD and non-IBD distributions, the boundary between them becomes centered at zero, and the IBD score variance is reduced. CHAPTER 5. PARENTE 60

Figure 5.5: (b) Mean and standard deviation of window LRT scores and ELRT scores for IBD training data was computed. Each point represents a specific window, with the same color used to denote the same window in both plots. This illustrates the extent to which the ELRT reduces the variance of windows with high-variance, low- negative-mean LRT scores. CHAPTER 5. PARENTE 61

Figure 5.6: (c) For a particular block, a histogram of the scores observed in the training data are shown. As with windows, the ELRT block scores feature better separation between IBD and non-IBD individuals, with a boundary close to zero. CHAPTER 5. PARENTE 62

Figure 5.7: (d) Scores and thresholds across a chromosomal segment based on training data. The red line represents the mean score for non-IBD training data and the dark blue line represents the mean score for IBD training data. The yellow dashed line is the score for a single unrelated pair at each block. The dotted black line shows a local, block-specific threshold. This figure illustrates the consistent and improved separation between IBD and non-IBD score distributions at blocks across the chromosome for the ELRT over the LRT. CHAPTER 5. PARENTE 63

5.3.6 Accuracy performance characteristics.

Finally, we conducted additional experiments aimed at characterizing the performance of Parente. Specifically, we examined the effect of genotyping errors, the use of the genotyping-error function φ(), and the effect of varying the window size k. First, we explored Parente’s performance with and without φ(), assessing differences in 1 accuracy. When using φ() with the scaling factor v = 100 , Parente’s sensitivity increased from 75% to 86% at the 1% FPR level. The improvement in sensitivity further increased at the 0.1% FPR level, from 45% when using  to 73%, when φ() was applied. Next, we demonstrated that Parente is robust to changes in the window size parameter. IBD pairs were inferred on simulations with 4 cM injected IBD segments for a window size of 10, 20, and 30 SNPs per window. When using the LRT score, Parente’s sensitivity changed less than 0.5% at the 0.1% FPR level. The differences were due to the fact that block boundaries were generated to begin and end at window boundaries, resulting in block definitions that were slightly different given the window size. As noted earlier, the varying windows size does not effect the LRT score, as the window-based model is equivalent to the direct computation of the score at the block level. Simply, the LRT score of a block can be equivalently computed by summing the individual SNP LRT scores or the window LRT scores. When using the embedded LRT score, Parente’s sensitivity varied by less than 2% at the 0.01% FPR level across the different window sizes. These differences can be attributed to differences in the window models as well as block boundary differences. Finally, we explored the extent to which genotyping errors affected Parente’s performance. To this end, we repeated the simulations but introduced genotyping errors at different rates: 1%, 0.5%, and 0%. The model parameters  and φ() were unchanged from  previously described experiments, being set to  = 0.005 and 100 , respectively. We found that at the 0.1% FPR level, the sensitivity increased from 66% to 74% to 76% for the 1%, 0.5%, and 0% error rates, respectively. These results illustrate that Parente is robust to a realistic range of error rates of less than 0.5%. CHAPTER 5. PARENTE 64

5.4 Discussion

To improve computational efficiency when applying the described scoring functions, the log window score log sw(gi,m(w), gi0,m(w)) can be pre-computed for all possible pairs of genotypes for every window. For instance, with a window size of 5 SNPs, each (35)(35+1) 0 window requires only 2 = 29, 646 values per window. The block score Λb(gi, gi ) can then be computed efficiently by retrieving and summing these values.

The model presented here assumes markers within each window are in linkage equilibrium. One approach to satisfy this assumption is via marker pruning using tools such as PLINK [47]. Alternatively, our model can be extended so as to incorporate the LD evident between neighboring markers. Previous work has shown that modeling LD can improve the performance of IBD methods [7].

In our work, the applied block-specific threshold strategy was based on the ob- served scores of unrelated pairs in the training data. The rationale behind this ap- proach was to extremely control for false positions, since we aim to identify IBD in extremely large cohorts. Therefore, we calculated the threshold based on the max- imum and variance of the observed training scores and a provided constant, c (see Equation 5.8). The default value c = 0 yielded a threshold with good performance (82% sensitivity at a 3 × 10−3 FPR for the embedded LRT); c can be adjusted to achieve the preferred tradeoff between specificity and sensitivity. We have observed that the margin between the related and unrelated distributions varies between blocks (see Figures 5.4,5.5,5.6,5.7). One may be able to increase sensitivity without loss of specificity by increasing the thresholds at blocks where the margin is large. In future work, we aim to explore additional stronger thresholding schemes in order to increase Parente’s accuracy.

Parente makes the assumption that IBD segments along the genome are inde- pendent of one another, which holds true for distant relatives with relatively small IBD segments (eg 5 cM) that are expected to have at most one shared IBD seg- ment. The assumption may not hold true for closely-related individuals, which are CHAPTER 5. PARENTE 65

SNPs per window 3 5 10 15 20 Mean non-IBD KS p-value 7e-12 4e-11 8e-7 5e-6 2e-5 Mean IBD KS p-value 1e-9 4e-5 0.003 0.008 0.017

Table 5.1: As window size increases, a Gaussian distribution fits window LRT scores better. Given a window size (SNPs per window), the Kolmogorov−Smirnov test was performed on the scores of the training data for each window along the chromosomal segment. The mean p-value of all the windows is reported here. expected to share several IBD segments. However, due to the close relationships in these scenarios, these IBD segments also tend to be very large. Because Parente can accurately detect individual small IBD segments, it can also detect each individ- ual larger IBD segment, without needing to take into account that several large IBD segments may appear across the genome.

Our model uses a normal approximation of the LRT score distribution in order to compute the ELRT scores. With a window size of 20 SNPs per window, as used in our experiments, the LRT score distributions of most windows reasonably follow a Gaussian distribution. Naturally, however, for smaller window sizes (such as 3 SNPs window), most windows had score distributions that does not fit a Gaussian distribution. The poor approximation of the LRT score via a Gaussian distribution resulted in reduced performance (results not shown). We quantified window LRT score normality across various window sizes by using a Kolmogorov−Smirnov (KS) test on the related and unrelated training LRT scores for each window. The mean p- value of all the windows along the chromosome was computed. Table 5.1 shows these results, illustrating that the approximation using a Gaussian distribution provides a better fit as the window size increases. These observations indicate that it may be worthwhile to explore alternative parametric and empirical distributions for LRT, evaluating their impact on Parente’s accuracy, especially when using small window sizes.

In this paper we presented Parente, a novel method for the accurate and efficient CHAPTER 5. PARENTE 66

detection of IBD. Our results demonstrate that Parente has a superior accuracy in comparison to previous state-of-the-art methods, especially when set to control for extremely low false-positive rates. Furthermore, the methods efficiency enables the analysis of large-cohorts sampled over dense marker sets. As larger dataset are col- lected and sampled at an increasingly higher resolution via next-generation sequencing [36, 62], efficient methods such as Parente that can operate on non-phased geno- type data become vital for their analysis. Parente is publicly and freely available at http://parente.stanford.edu/. Chapter 6

Parente2

6.1 Introduction

Parente represented a leap forward in the arena of IBD detection, but it relied on a very simple Bayesian model. It made the strong assumption that markers are independent of one another, which is certainly not the case given the density of markers used. Specifically, the formulation of the likelihood ratio is done at the SNP level, and the log likelihood of an entire region is computed by summing the log likelihoods of all SNPs in the region. While it is possible that Parente’s embedded likelihood ratio score captures some of the linkage information implicitly by combining the scores of 20 adjacent SNPs, there is no explicit use of known linkage disequilibrium information in the model. In our work on Alloy, we showed that using a good model of linkage disequilibrium can increase prediction accuracy for ancestry inference, so we sought to introduce a stronger model incorporating linkage disequilibrium Parente will improve its accuracy.

In this chapter, I will describe Parente2, a new method for IBD detection that uses explicit linkage disequilibrium information that we developed as an improvement

67 CHAPTER 6. PARENTE2 68

on Parente. The high-level approach taken by Parente2 is very similar to Par- ente: we compute a likelihood ratio using a graphical model on windows of markers and then recalibrate the ratio using the embedded likelihood ratio statistic. However, there are several key innovations in Parente2 that contribute to its success:

• The new model underlying Parente2 incorporates linkage disequilibrium by directly computing the likelihood of window genotypes. This is in contrast to assuming marker independence and computing the likelihood of window geno- types based on the likelihood of the individual markers in the window. To do this, Parente2 considers the short haplotypes inside windows as the hidden layer of the graphical model, instead of individual alleles in the previous model. Therefore, Parente2 uses haplotype frequencies as the prior for the hidden states instead of allele frequencies.

• Parente2 introduces a new scheme for defining windows of markers. Instead of only using windows of markers where all markers in the window are adja- cent to one another, Parente2 adds high-coverage, randomly-defined “sparse” windows of markers. These windows are defined by randomly selecting markers from a small region of the genome. The windows are designed to be highly overlapping, where each marker is included in ten windows on average.

• With high-coverage windows, Parente2 also introduces a window filter that removes windows from consideration that are less informative than other win- dows nearby. The final set of windows used by Parente2 are the ones that are the most informative for the task of identifying IBD segments.

In this chapter, I will first describe these innovations in the Methods section. Then in the Results section I will discuss the improvement in accuracy of Parente2 compared to previous methods and demonstrate the contribution of each innovation on the performance of Parente2. Finally, in the discussion section, I cover several characteristics of Parente2 that are relevant for its practical application. CHAPTER 6. PARENTE2 69

6.2 Methods

The Parente2 model employs a window-based technique that is used to determine whether a pair of individuals are identical-by-descent within a given genomic block. The underlying model structure is depicted in Figure 6.1. Briefly, given two indi- viduals, the probability of the genotypes within the examined window is estimated, followed by two consecutive, embedded, log-likelihood-ratio (LLR) scores; The inner- LLR score is computed using the probability of these window-specific genotypes, followed by the calculation of the outer-LLR score, computed using the probability of generating the inner-LLR score. At the base level, observed markers are grouped into windows, chosen with repetition, and their joint probability is estimated given a hypothesized IBD state. These probabilities are then combined into a window set probability using a naive Bayes model. A log-likelihood-ratio statistics, derived from these window set probabilities, serves as the observed variable of the outer layer, and the probabilities for these statistics are again estimated given a hypothesized IBD state. Subsequently, the probabilities of several window set statistics are combined via a second naive Bayes model, concluding with a final probability given an IBD state within an examined block.

Let g be defined as an individual’s genotype, namely a vector consisting of the th genotype calls made across M bi-allelic markers. We denote the i marker as mi, and use gi ∈ {0, 1, 2} to denote the minor allele count, as observed at marker mi. An allele at mi is denoted by ai ∈ {0, 1}, corresponding to hidden underlying minor allele count at the ith marker. The second concept we introduce is the notion of marker window. A window w is defined as a set of markers, and m(w) = {i|mi ∈ w} is defined as the set indices corresponding to the markers associated with a window w. While it is common to refer to haplotypes as a genomic region consisting of measured adjacent polymorphic sites, we refine this definition by using h(w) to refer to the alleles at markers m(w), i.e. h(w) = {ai|i ∈ m(w)}. The underlying idea is that the haplotype blocks, corresponding to genomic region exhibiting high inter-marker linkage-disequilibrium (LD), may be well capture by different sets of markers within the examined region. Namely, the non-independence between genetic variants, as CHAPTER 6. PARENTE2 70 observed within an examined population, may be approximated by observing subsets of markers. Perhaps more importantly, different such subsets may capture a variety of aspects of LD existing in particular genomic region. The ensemble of such sparse models can more accurately capture nuances in the underlying non-random associa- tion between allele, capable of expressing a richer set of correlations in comparison to the more common lower-order Markov models. The frequency of a haplotype h(w) within the examined population is denoted by f(h(w)). We further define a window set S, and block Bi as a set of window sets, starting at marker i. The length of block

Bi is defined as the distance, in cM, between the last marker contained within an associated window, and the first such marker. Determining the length of such block is driven by the application; one may either target a particular accuracy, or seek to investigate events that correspond within specific timespans, covering common ancestors of interest up to some point in time.

For a target IBD block length l (in cM), the Parente2 method is defined as follows. Given the genotype calls of two individuals, g and g0, the genome is scanned by sliding a block Bi across each chromosome, where each block Bi corresponds to marker mi our of the M sampled markers. The examined block Bi is associated with a group of pre-defined window sets S ∈ Bi, where s in turn contains windows that corresponding to markers that are at l cM away from the first marker of the first window in that block. For each such block B, we compute the outer-LLR score, 0 which is an aggregated block score ΛB(g, g ) defined as follows:

0 0 X QI (Γs = γs(g, g )) ΛB(g, g ) = log 0 (6.1) Q¯(Γs = γs(g, g )) s∈B I where QI and QI¯ correspond to probabilities for score Γs under an IBD and non- 0 IBD models, respectively, and γs(g, g ) corresponds the inner-LLR window set score, computed using the genotypes of the two examined individuals g and g0 within an examined window window set s. We call a pair of individuals IBD in block B whenever 0 ΛB(g, g ) > TB, where TB is a pre-defined threshold associated with block B. We compute this score for each block in the genome and call a pair of individuals related if any block in the genome is called to be IBD. The threshold TB is defined such CHAPTER 6. PARENTE2 71

Figure 6.1: Underlying graphical model for the likelihood ratio. The variables h1, h2, h3, and h4 represent hidden haplotypes for a given window of markers. The variables g and g0 represent the observed genotype vectors from the first and second individual in a pair of individuals being evaluated for IBD in the window. (A) The model for two unrelated individuals that do not share an IBD segment in the window. (B) The model for two related individuals that do share an IBD segment in the window.

0 that the false-positive rate is controlled to a desired level. The block score ΛB(g, g ) can be efficiently computed along the genome of two individuals. As blocks are scanned, window-scores corresponding to windows that are no longer part of the 0 0 newly examined block B are subtracted from the current block score ΛB0 (g, g ), and the window-scores corresponding to newly joining windows are simply added.

0 In the remainder of this section we instantiate the score function γs(g, g ) using a likelihood-radio approach based on observations from multiple windows. Underlying 0 γs(g, g ) is a probabilistic model that is used to estimate the probability of the ob- served genotypes within the examined windows. We further detail how the empirical probabilities for both the observed genotypes and Γs are approximated for the two IBD states, aiming at improving the efficiency of the implementation. The process by which markers are selected to form windows, and windows are selected to form window sets is described, followed by criteria for selecting windows that are highly informative for IBD inference. Finally, we will describe how the block-specific score threshold TB is defined. In the Results section, we show how combining multiple overlapping windows, covering the same marker in different combinations, can signifi- cantly improve the accuracy of the algorithm, as it enables one to model the complex LD structure that exist in the underlying population haplotypes. CHAPTER 6. PARENTE2 72

6.2.1 Inner Log-Likelihood Ratio

Within a sliding block B comparing two individuals’ genotypes, and more specifically, within an associated window set s ∈ B, we contrast the probability of the observed genotypes under the assumption the individuals are IBD in the block, against a complementary assumption that they are not IBD. Specifically, the inner-LLR score is computed by estimating the likelihood of the individuals’ genotypes within each block under two models, namely a model PI corresponding to the hypothesis the two examined individuals are related, and a model PI¯ corresponding to the hypothesis the two individuals are unrelated.

As outlined by Equation 6.2 below, for both PI and PI¯, we model the genotypes within a window set s using a naive Bayes approach, whereby all windows are indepen- dent given the IBD status of the two examined individuals within s. The probabilities of the genotypes within each window w ∈ s comprising an examined window set s are considered separately, and the product of these probabilities defines the probability of the observed genotypes within the examined window set (or as a sum, under our log formulation). Namely, given a window set of interest S, and the genotype of two examined individuals gi and gi0 , the inner-LLR score Γ in Equation 6.1 is defined as:

0 0 X PI (g(w), g (w)) γs(g, g ) = log (6.2) P¯(g(w), g0(w)) w∈s I

To support the complex underlying LD structure, i.e. the non random associated between alleles associated with a window w, we capture the frequency of h with respect to the markers within window w using fw(h). For brevity, when the window w is clear from the context, we will denote the haplotype by h and its frequency by f(h). Once established or approximated, f(h(w)) is used to compute the probability CHAPTER 6. PARENTE2 73

of the observed genotypes under both models, namely PI and PI¯, as follows:

0 PI (g(w), g (w)) = (6.3) X 0 p(g(w)|h1, h2) · p(g (w)|h1, h3) · fw(h1) · fw(h2) · fw(h3)

h1,h2,h3 0 PI¯(g(w), g (w)) = X p(g(w)|h1, h2) · p(g(w)|h3, h4) · fw(h1) · fw(h2) · fw(h3) · fw(h4)

h1,h2,h3,h4

The probability p(g(w)|h1, h2) that the genotype g(w) was sampled, conditioned on haplotypes h1 and h2, needs accommodate for genotyping errors. This is especially important in the context of IBD inference, where a single errors may deem the entire regions as non-IBD under a more strict model that assumes no error. Hence, we define p(g(w)|h1, h2) as follows:  1 2 1 −  ∀i ∈ m(w), gi = ai + ai p(g(w)|h1, h2) = (6.4)   2 otherwise where the parameter  is tuned to capture the amount of expected genotyping error, 1 2 using ai and ai to correspond to the allele associated with marker i in haplotype h1 and h2, respectively. Finally, to accommodate for missing data, whenever a genotype value for a particular window w is missing, we set the probabilities under both models

PI and PI¯ to 1, which in turn, will not effect the ΓS score.

We note that in the above model, the individuals can share at most a single haplotype. We further note that as the size of window w grows, enumerating over all possible haplotypes h(w) within that window becomes impractical, as the time complexity is exponential in the number of markers within the window. To facilitate efficient computation, we approximate Equation 6.3 we summing over a subset of haplotypes for which the frequency fw(h) is highest. under the assumption of linkage 0 equilibrium, the equivalent of a block LRT score ΛB(g, g ) can be directly computed without windows by using the sums of log of the genotype probabilities, as defined by

Equation 6.3. We utilize the window-based ΓS formulation described in Equation 6.2 CHAPTER 6. PARENTE2 74 to facilitate the description of an extension that accounts for score variability within a window set, which we now derive.

6.2.2 Outer Log-Likelihood Ratio

The model described thus far can be extended to provide an efficient approach to identifying pairs of individuals that share a common ancestor. Namely, by simply summing over all window subsets scores Γs that are associated with a particular block (i.e. s ∈ B), Equation 6.3 can be extended to the detect blocks B that are IBD.

Computing a single LLR score Γs using a block-level naive Bayes model can alleviate some of the performance-related challenges, suggesting a computationally feasible and practical approach for the examination of IBD in large cohorts. Nonetheless, this Naive Bayes model may be sensitive to single windows exhibiting highly variable scores. Namely, each block B consists of window sets, and for each particular window set s ∈ B, the window-score of a small sub-set of windows may plays a dominant role in the determination of the final block score. In other words, it may be the case that the scores associated with a small subset of the windows may determine the overall block score, reducing the statistical power of our classifier to that of the window subset. It is the non-controlled high variability of such windows that limits the performance of the final likelihood-ratio based test.

To correct for the detrimental impact of high-variance windows, we directly ex- amine the performance of each window set s. The distribution of window set scores

Γs can be evaluated given the genotypes from unrelated individuals, and contrasted against the distribution of the same window set score given genotypes from related individuals. In other words, one can estimate the distribution of the scores for related and unrelated individuals with respect to windows w ∈ s. By contrasting these two distributions, it is possible to detect and control for the impact of highly-variable windows and window sets. When a particular window set s exhibits a similar dis- tribution of Γs under both models, the ratio of log probabilities for that score will converge to 0. Conversely, when the distribution of Γs exhibits a high separation of CHAPTER 6. PARENTE2 75

IBD instances from those which are not IBD, the ratio of log likelihoods for that score will be highly indicative of the underlying state.

Specifically, to apply such a correction, we treat the inner-LLR described in Equa- tion 6.2 as a random variable Γs. We then define two empirical models for the distribu- tion of Γs, one corresponding to the distribution of the score under related individuals,

QI (Γs), and a second corresponding to the distribution of the score given unrelated individuals, QI¯(Γs). We estimate these distributions by using phased training hap- lotypes that we use to simulate NI pairs of individuals sharing IBD along the entire chromosome, and NI¯ individuals without any IBD segments along the length of the chromosome. Then, we compute the LLR for each window set for the IBD and non-

IBD pairs in order to estimate the probability density functions QI (Γs) and QI¯(Γs) via binning. We define numbins equally sized, non-overlapping bins that span the domain of each distribution and we use a pseudocount ρ for each bin. Our modified score, which we term embedded likelihood-ratio (ELR), is finally defined as:

0 0 QI (Γs = γ(g, g )) λs(g, g ) = log 0 (6.5) QI¯(Γs = γ(g, g )) and the scores of several window sets associated with a block s ∈ B are combined via a naive Bayes model, to create the block score:

0 X 0 ΛB(g, g ) = λs(g, g ) (6.6) s∈B

The phased haplotypes used for training can be either generated from datasets containing trios, or via computationally-phased individuals. It is important to note that current phasing methods offer a sufficiently low switch-error rate such that their performance should have a negligible effect when considering haplotypes within a window of moderate size. For our experiments in the Results and Discussion sections, we set NI = NI¯ = 1000, binsize = 30, and ρ = 0.01. CHAPTER 6. PARENTE2 76

6.2.3 Genotyping-error function

In Equation 6.4 we describe the probability of genotypes given the hidden underlying haplotype. The conditional probability pw(g|h1, h2) derived accounts for genotyping error. While providing a more realistic model, it can in fact reduce the statistical power when failing to reject unrelated individuals. The lower power stems from the fact the impact of reverse-homozygous genotypes is reduced; such observations can be attributed to sampling errors rather than indication of unrelatedness under the realis- tic model. One can increase the penalty under such scenarios by controlling the geno- typing error parameter . Our method strives to reduce the amount of false-positive pairs detected. Thus, we extend our method by introducing a genotyping-error func- tion that increases the contrast between IBD and non-IBD segments. Specifically, when estimating the model parameters, we use  as the genotyping error rate, whereas during inference, we replace  in Equation 6.4 with a function φ() = v · , where v is a scaling factor. In the experiments in the Results and Discussion sections, we used 1  = 0.005 and v = 100 .

6.2.4 Window and window set definitions

In the above description of our model, both windows and window sets were defined in general terms. There are several ways to instantiate the definitions of windows and window sets. In the experiments in the Results and Discussion sections, we used one of two different definitions of windows, which we term the standard window definitions and the augmented window definitions.

The standard window definitions was composed of non-overlapping windows such that all markers within each window were adjacent to one another, each window was adjacent to its neighboring window(s), and each window contained winsize markers. Formally, the ith window was composed of markers in the interval [i · winsize, i · winsize + winsize). CHAPTER 6. PARENTE2 77

The augmented window definitions included all of the windows in the standard window definition, plus a collection of high-coverage overlapping windows with mark- ers randomly chosen from a small genomic region. For these purposes, a genomic region was defined to be r consecutive markers. For each marker in the data set, we defined a region starting at that marker such that neighboring regions overlapped heavily. For each region, we generated c windows by randomly choosing winsize of the markers in the region. Because of this definition, markers within a window were not constrained to be adjacent to one another.

To define window sets, we used the same approach in all of our experiments, one that was very similar to the approach used for the standard window defini- tions. Specifically, we first sorted all windows by their leftmost marker then by their rightmost marker. Each window set was defined to contain setsize windows such that the ith window set contained windows from the sorted list in the interval [i · setsize, i · setsize + setsize).

Typical parameters used in our experiments were as follows: winsize = 5, setsize = 5, r = 40, c = 10.

6.2.5 Window filter

In our experiments, we used a window filter to remove a portion of the windows that may be less informative than other windows in the same region on the chromosome for predicting IBD segments. We define an informative window to be one that reliably outputs in low scores for non-IBD pairs high scores for IBD pairs. To identify this property in windows, we utilized training simulations composed of examples of pairs of individuals with and without IBD at each window. For a given window, we found that the separation between the distributions of scores of IBD and non-IBD pairs was inversely correlated with the variance of the scores of the IBD pairs for the window (results not shown). Therefore, as a proxy for informativeness of a window, we measured the negative variance of the ELR scores of the simulated training IBD CHAPTER 6. PARENTE2 78 pairs in the window. We defined regions for filtering purposes as the non-overlapping 0.05 cM segments that tile the chromosome. When applying the window filter, we removed the 20% of the windows within each region with the lowest informativeness.

6.2.6 Decreasing running time with the SpeeDB filter

Parente2 can be applied together with SpeeDB [26], a coarse-grained filter designed to reduce the computations required for inferring IBD segments. When considering a pair of individuals, SpeeDB operates by quickly filtering out regions of the genome that are unlikely to be IBD between the two individuals. These filtered-out regions can then be ignored by a downstream IBD detection method (e.g. Parente2), thereby reducing the time spent running the detection method. SpeeDB is able to remove vast amounts of the genome while maintaining very high sensitivity; that is, true IBD segments are rarely filtered out. The running time of SpeeDB is negligible compared to the running time of Parente2, so using them together results in an overall decrease in running time.

SpeeDB is based on a simple observation: in IBD regions shared by two individu- als, there should be no “opposite homozygous loci” where each individual is homozy- gous a different alleles (unless there is a genotyping error or a biological mutation). For a given pair of individuals, SpeeDB uses a statistical model to identify regions where the pair has too many opposite homozygous loci for the region to be within an IBD segment. The following is a brief overview of how SpeeDB functions. SpeeDB takes as input a probability threshold, pth, for its model. Based on this, it computes a region-specific threshold of the maximum number of reverse homozygous loci to tolerate in each region based on the number of markers and the allele frequencies in the region. The full description of SpeeDB can be found in the work by Huang et al. [26]. CHAPTER 6. PARENTE2 79

6.2.7 Facilitating larger window sizes

The likelihood functions presented in Equation 6.3 can be computed in time that is quadratic in the number of states of the hidden variables, that is, the number of possible window haplotypes. Given k markers in a window, there are 2k possible distinct haplotypes composed of k markers. Therefore, the overall complexity of computing the likelihood for a given window is O(22k). When k = 10, it would require iterating over one million haplotype pairs to evaluate the likelihood of one window for one pair of individuals. We observed, though, that while there are over 1,024 distinct 10-marker possible haplotypes, the number of actual haplotypes that exist in the human population is much smaller. Therefore, as a time-saving heuristic, instead of iterating over all possible haplotypes for a given window, we instead only iterate over the top H most common window haplotypes observed in training data. In practice, we found that for a window size of 10 markers, 99% of windows on the long arm of chromosome 1 had no more than 50 distinct haplotypes in our data sets. Therefore, in our experiments in the Results and Discussion sections, we set H = 50, which, we note, only had an effect when we ran Parente2 with larger window sizes.

6.3 Results

In this section, we will describe a number of experiments that we performed in order to evaluate and characterize the performance of Parente2 and to compare it to other IBD inference methods.

6.3.1 Data sets

Two data sets were used for all experiments in this work: the HapMap Phase III panel (HapMap) [4], and the Case Control Consortium (WTCCC) [62]. We CHAPTER 6. PARENTE2 80 used haplotypes from three Asian populations in HapMap: Han Chinese in Beijing, China (CHB); Japanese in Tokyo, Japan (JPT); and Chinese in Metropolitan Denver, Colorado (CHB). This resulted in a total of 520 haplotypes from HapMap. Because haplotypes were necessary to perform simulations, the genotype data in WTCCC was first phased using HAPI-UR [60], which resulted in 2,960 haplotypes. For both data sets, we restricted analyses to the long arm of chromosome 1 in order to expedite the running of the experiments; this resulted in 46,580 markers for HapMap and 14,777 markers for WTCCC.

For each data set, we randomly partitioned individuals data into training and testing sets; one third was used for training, and two thirds were used for testing. In order to break up latent IBD segments in the testing data set, we created composite individuals from the original haplotype pairs for each individual in a manner similar to Browning et al. [8], however, our protocol retained 50% of the data rather than 10% in their work. Briefly, each composite individual was created by tiling 0.2 cM segments from different source individuals in the testing data. This was done in such a way so that at each marker, each source was guaranteed to appear in at most one composite individual. Without loss of generality, with Ntest individuals in a given test data set (Ntest = 170 for HapMap and N = 980 for WTCCC), we created bNtest/2c composite individuals. For the ith composite individual, we first choose a random offset, O, in cM such that 0 ≥ O < 0.2. The first segment of each composite individual was created to be O cM (the 0-th segment) and each subsequent segment was 0.2 cM in size (segments 1 to S). Formally, the haplotype pairs of the jth segment of the ith composite individual (with i being zero-based) were copied directly from the corresponding segment of source individual k (zero-based) with k = (i + 2 × i) mod bNtestc. For example, suppose N = 6 and the chromosome was 1.8 cM in size, we would produce 3 composite individuals each with 9 segments with source individuals of each segment as seen in Table 6.1. In this way, we significantly reduced latent IBD in the testing data sets prior to simulating IBD individuals. The resulting composite testing data sets sizes had the same number of haplotypes as the training data sets (170 for HapMap and 980 for WTCCC). CHAPTER 6. PARENTE2 81

j-th segment 0 1 2 3 4 5 6 7 8 0 0 1 2 3 4 5 0 1 2 i-th composite 1 2 3 4 5 0 1 2 3 4 individual 2 4 5 0 1 2 3 4 5 0

Table 6.1: Example of tiling method used to break up latent IBD. In this example, 6 source individuals used to create 3 composite individuals, each having 9 genomic segments (eg assuming a chromosome of length 1.8 cM with a segment size of 0.2 cM). Each entry in the table contains the index of the source individual used for the jth genomic segment of the ith composite individual.

6.3.2 Simulations

Training simulations

Parente2 requires a small amount of phased training data to build a model for each window and window set. To build these models during training, Parente2 gener- ates pairs of IBD and non-IBD segments from the training data. In our simulations, Parente2 generated 1,000 pairs of individuals sharing an IBD segment along the entire length of the chromosome (that is, each pair of individuals shared one haplo- type across the entire chromosome) and 1,000 pairs of individuals without any IBD segments.

Testing simulations

To evaluate the performance of Parente2 and other methods, we created simulated pairs of related individuals that shared one IBD segment of a fixed size in each experiment. We performed four simulations that differed by the data set used (either HapMap or WTCCC) and the size of the shared IBD segments (either 2 cM or 4 cM). We named these four simulations HapMap-2cM, HapMap-4cM, WTCCC-2cM, and WTCCC-4cM. Each simulation was composed of 30 bootstrapped trials. For CHAPTER 6. PARENTE2 82 each trial, we generated a number of pairs of individuals sharing an IBD segment by choosing two random haplotypes per individual without replacement; 28 pairs were generated for each HapMap simulation and 40 pairs were generated for each WTCCC simulation. Within each trial, no haplotype from the testing data set was used in more than one pair, however, each haplotype was used in multiple trial data sets. To simulate an IBD segment in each pair, we chose a random location to start the IBD segment and copied the alleles of one haplotype from one individual in the pair over one of the haplotypes of the other individual in the pair. Next, we simulated genotyping errors for each generated individual, using a genotyping error rate of 0.005 and such that when an error was introduced, the genotype was changed to one of the other two genotypes with equal probability. Thus, each HapMap trial 56 contained 56 individuals so that 2 (1,540) pairs of individuals were evaluated per 80 trial. Each WTCCC trial data set contained 80 individuals so that 2 (3,160) pairs were evaluated per trial.

Measuring accuracy

For each method, we measured the positional accuracy and pairwise accuracy. In both cases, accuracy was taken as the average over all the trials in the simulation. For positional accuracy, the basic units evaluated were tuples containing a marker, a pair of individuals, and a label indicating whether or not the marker was a part of an IBD segment shared by the two individuals in the pair. Each IBD inference method output entries containing a pair of individuals and the coordinates of a predicted IBD segment shared by the pair. A tuple was considered to be labeled ”true” by an IBD inference method if the marker in the tuple appeared in a predicted IBD segment output by the method. Likewise, the actual label for a tuple was ”true” if the pair was simulated to share an IBD segment and the marker was within the simulated IBD segment. For pairwise accuracy, the position information was disregarded so the tuples evaluated contained only a pair and a label. In this case, a tuple was considered to be labeled as ”true” by an IBD inference method if it output at least one predicted IBD segment between the pair of individuals and its actual label was ”true” if the CHAPTER 6. PARENTE2 83 pair of individuals were simulated to share an IBD segment.

To estimate the true positive rate (TPR) and false positive rate (FPR), we gen- erated many points along a receiver operator curve (ROC curve) for each method based on the score output by the method and adjusting a score threshold for the method. The exception to this was GERMLINE since it did not output a score; however, several performance points were collected based on changing non-threshold parameters. We estimated TPR and FPR values between points on the curve via linear interpolation, however the distance between points was very small.

6.3.3 Experimental parameters

Unless otherwise specified, all of following pertains to the the results reported in the Results and Discussion sections.

• The HapMap-2cM simulation was used to measure accuracy and running time of Parente2 and other methods.

• The accuracy (sensitivity and false positive rate) reported in each experiment is the pairwise accuracy.

• The augmented window definitions were used.

• When defining windows, winsize was set to 5.

• For the augmented window definitions, r = 40 and c = 10 which resulted in each marker was included in 11 windows on average.

• When defining window sets, setsize = 5 windows per window set.

• When SpeeDB was used, the threshold pth was set to 0.1. CHAPTER 6. PARENTE2 84

• fastIBD was run using a minimum IBD segment size of 1 cM since this resulted in better performance than the default or than using an IBD segment size close to the target IBD segment size.

• We ran fastID using the nsamples parameter set to 20 to achieve better per- formance given relatively small simulation sizes.

• According to the authors’ recommendations to achieve higher accuracy, we ran fastIBD was run 10 times with 10 different seeds and merged the results by setting the score of each position to be the maximum score observed in any of the 10 runs.

• GERMILNE was run with default parameters, except that the bits parameter was set to either 64 or 128 and the minimum segment size parameter was set to 0 as this yielded the best results.

6.3.4 Accuracy of Parente2 compared to other methods

Using simulations based on the HapMap data set, we evaluated the performance of Parente2 and compared it to other methods that represent the current state of the art: Parente, fastIBD, and GERMLINE. We evaluated the accuracy of these methods with respect to detecting which pairs of individuals in a cohort share at least one IBD segment as well as the positional accuracy in detecting the locality of the simulated IBD segments. We performed two simulation experiments, one where the pairs of individuals shared a single 4 cM IBD segment, and one where they shared an IBD segment 2 cM in length. The individuals sharing an IBD segment are termed related pairs of individuals. Because cohort sizes are growing larger and larger, it becomes increasingly important to have a low false positive rate to avoid reporting spurious relationships. We therefore first focused on evaluating the sensitivity (true positive rate, TPR) of the methods are low false positive rates; the results are shown in Table 6.2A. In the 4 cM case, Parente2 detected pairs of 99.9% sensitivity at a 0.1% FPR; for the 2 cM case, it achieved 78.7% sensitivity at a 1% FPR. Both CHAPTER 6. PARENTE2 85

(A) 4 cM 2 cM Method TPR (%) FPR (%) TPR (%) FPR (%) Parente2 99.9 0.1 78.7 1 Parente2 with SpeeDB 99.5 0.1 79.2 1 Parente 61.6 0.1 11.4 1 fastIBD 12.4 0.1 32.4 1 GERMLINE-128 47.6 0.23 49.1 17.5

(B) 4 cM 2 cM Method TPR (%) FPR (%) TPR (%) FPR (%) Parente2 100 1 92.6 5 Parente2 with SpeeDB 99.6 1 92.9 5 Parente 69.7 1 40.2 5 fastIBD 65.5 1 72.6 5 GERMLINE-64 87.9 2.5 98.0 79.9

Table 6.2: Pairwise accuracy of Parente2 and other methods. Table 6.2A shows sensitivity at lower false positive rates and Table 6.2B shows sensitivity at higher false positive rates. fastIBD was run ten times with ten different seeds accord- ing to author recommendations. For the GERMLINE-64 and GERMLINE-128 rows, GERMLINE was run on phased data with GERMLINE’s seed size set to 64 and 128, respectively. of these accuracy measurements far surpass the performance of Parente, fastIBD, and GERMLINE. Table 6.2B shows the performance of the methods at higher FPR values and demonstrates that Parente2 also outperforms other methods in higher FPR scenarios. Finally, Table 6.3 shows the positional accuracy of Parente2 and other methods and demonstrates that Parente2 also infers the location of IBD segments more accurately than other state-of-the-art methods. It is notable that in the 2 cM case, Parente2 is far more accurate than Parente both in terms of pairwise accuracy and positional accuracy.

fastIBD was run ten times with ten different seeds and the results were merged, according to the authors’ recommendation, and a minimum segment size of 1 cM was CHAPTER 6. PARENTE2 86 used for fastIBD as because it yielded the best performance. Each run of fastIBD used with the nsamples set to 20 according to the recommended guidelines and all other parameters were set to default values. To evaluate GERMLINE, the data was first phased using BEAGLE [12] according to the pipeline provided with the GERMLINE software. GERMLINE was applied on this phased data with default parameters, except that the -bits parameter was set to 128 to achieve higher specificity and set to 64 for the results to achieve lower specificity (see GERMLINE-64 and GERMLINE- 128 entries in Tables 6.2 and 6.3).

4 cM 2 cM Method TPR (%) FPR (%) TPR (%) FPR (%) Parente2 78.0 0.01 86.0 0.05 Parente2 with SpeeDB 80.0 0.01 86.5 0.05 Parente 71.2 0.01 17.8 0.05 fastIBD 38.6 0.01 50.4 0.05 GERMLINE-64 4.6 0.015 5.6 2.9 GERMLINE-128 2.7 0.005 1.7 0.44

Table 6.3: Positional accuracy of Parente2 and other methods. Accuracy was measured based on the portion of the genome that was in or not in a simulated IBD segment for each pair of individuals.

6.3.5 Augmented window definitions and window filter yields better performance

We sought to determine how Parente2’s accuracy changed by using the augmented window definitions (compared to the standard definitions) and the effect of using the window filter. Therefore, we applied Parente2 in four different configurations. First, we ran Parente2 with the standard and augmented window definitions without using the window filter to observe the effect of using the augmented window definitions. Then, we ran Parente2 again on the two window definitions, this time using the window filter, in order to observe the effect of the window filter on both sets. The results of these experiments are shown in Figure 6.2. CHAPTER 6. PARENTE2 87

Figure 6.2: The effect of the window filter and the augmented window def- initions on Parente2’s accuracy. The augmented window definitions had higher accuracy than the standard window definitions. The window filter improved the ac- curacy of Parente2 when using the augmented window definitions but reduced the accuracy when using the standard window definitions. The sensitivity shown is at a 1% false positive rate.

Parente2 had significantly higher accuracy when using the augmented window definitions compared to the standard window definitions; at a FPR of 1%, Parente2 had 78.7% sensitivity using the augmented window definitions versus 69.0% sensitiv- ity with the standard window definitions. Using the window filter, Parente2 had higher sensitivity than without using the filter when used on the augmented window definitions (79.7% versus 78.4%). For the standard window definitions, Parente2 had lower sensitivity when using the window filter than when not using the window filter (65.9% versus 69.0%). One explanation for the latter result is the fact that the standard window definitions has no redundancy between windows: removing a sig- nificant fraction of the windows would leave many gaps along the chromosome where no window covering them, thus not allowing Parente2 to infer IBD status in those regions. CHAPTER 6. PARENTE2 88

6.3.6 ELR yields better performance

Figure 6.3: Comparing the accuracy of the embedded likelihood ratio (ELR) scoring function to the likelihood ratio scoring (LR) function. The embedded likelihood ratio outperforms the likelihood ratio in all of the configurations evaluated.

We also evaluated the extent that the ELR improves performance for Parente2. Using four scenarios described above (using the standard or augmented window defini- tions, with or without the window filter) we measured the performance of Parente2 using the LR and compared it to the performance of using the ELR. When using the LR, we used setsize = 1 since the outer LLR was not performed and therefore did not require windows to be grouped together. The sensitivity of Parente2 (at a 1% false positive rate) in each configuration of this experiment can be seen in Figure 6.3. The results show that the ELR significantly outperforms the LR in all of the scenar- ios evaluated: for both the standard and augmented window definitions, with and without the window filter. CHAPTER 6. PARENTE2 89

Figure 6.4: The effect of window size on Parente2’s performance. Increasing the window size of Parente2 results in better performance.

6.3.7 Increasing window size increases performance

We measured the effect of window size on the performance of Parente2. We ran Parente2 using augmented window definitions by setting the markers per window between 3 and 10. The sensitivity of Parente2, at a 1% FPR, for each experiment is shown in Figure 6.4. These results indicate that larger window sizes results in higher accuracy for Parente2. We observe that the additional benefit of using larger window sizes above 7 markers per window diminishes. This could be in part due to the fact that the haplotype diversity of larger window sizes is not sufficiently captured since the maximum common haplotypes parameter H was set to 50.

6.3.8 Parente2 is robust to window set size

We also evaluated the effect of using different window set sizes with Parente2. Using the augmented window definitions with 5 markers per window, we adjusted the number of windows per window set from 3 to 8. At a 1% false positive rate, we CHAPTER 6. PARENTE2 90 found that the sensitivity varied by at most 0.7% (between 78.9% and 79.6%) and, therefore, conclude that Parente2 is robust to the choice of window set size

6.4 Discussion

In this section, several topics relevant to the practical application of Parente2 are covered.

6.4.1 Speed and accuracy tradeoff

Method Sensitivity (%) FPR (%) Running time Pairs/sec Parente2 78.7 1 3.9 hr 1.1 Parente2-SpeeDB 79.2 1 24 min 10.7 Parente2-Std. 69.0 1 7 min 36.7 Parente 11.4 1 78 sec 197.4 fastIBD 32.4 1 20.9 hr 0.2 GERMLINE-64 98.0 79.9 1.5 hr 2.9 GERMLINE-128 49.1 17.5 1.5 hr 2.9

Table 6.4: The accuracy and running time of evaluated IBD inference meth- ods. Each method was used to detect 2 cM IBD segments in 10 trials of the the HapMap data set. The Parente2 entry represents when Parente2 was run using the augmented window set with the window filter. Parente2-SpeeDB is the same but with the application of the SpeeDB filter. The Parente2-Std. entry represents when Parente2 was run using the standard window set without the window filter and without SpeeDB. fastIBD was run ten times with ten different random seeds according to the authors’ recommendations and the sum of the running time all ten runs is reported. GERMLINE-64 and GERMLINE-128 refer to running GERMLINE while using seed sizes of 64 and 128, respectively. The phasing pipeline provided with GERMLINE was used to phase the data prior to running GERMLINE and its running time is included in the reported running time. The number of pairs of individuals processed by each method per second is reported in the Pairs/sec column. CHAPTER 6. PARENTE2 91

When detecting IBD segments between all pairs of individuals in a group, the total number of pairings grows quadratically with the number of individuals in the group. For IBD detection methods that evaluate all possible pairings, this large number of comparisons exacerbates the problem of potential false positives and makes IBD inference computationally demanding. In the Results section, we demonstrated the high sensitivity of Parente2 at a low false positive rate. In light of the fact that cohorts are growing ever larger, the running time of an IBD inference method is also increasingly important.

Parente2 can be run with several different settings that affect both running time and accuracy. We compared Parente2 to Parente and other methods to evaluate the running time as well as the accuracy of each method (see Table 6.4). Running Parente2 using the augmented window set was five times faster than run- ning fastIBD and also resulted in significantly higher accuracy. When using the SpeeDB filter, however, Parente2 ran 50 times faster than fastIBD and also hae slightly higher accuracy than without the filter. If even lower running time is neces- sary, Parente2 can be run using the standard window set instead of the augmented window set, which resulted in a further 3-fold reduction in running time in our ex- periments; however this came at the cost of losing some accuracy. Parente was by far the fastest method evaluated, with the latest version running nearly 1,000 times faster than fastIBD. However, its low accuracy on 2 cM IBD segments makes it less applicable to detecting very small IBD segments.

6.4.2 Amount of training data required

In order to perform inference, Parente2 requires estimated haplotype frequencies. These frequencies are empirically estimated by examining phased training data. Given that the amount of training data available is limited, and that a small sample size can result in very coarse estimates for frequencies, we were interested to evaluate how much training data is necessary to achieve good performance from Parente2. To do this, we ran an experiment where we measured the sensitivity of Parente2 when CHAPTER 6. PARENTE2 92

Figure 6.5: Parente2’s Sensitivity as a function of number of training indi- viduals. Parente2 was run on the WTCCC 2 cM data set. The vertical axis shows sensitivity at a 1% false positive rate. changing the number of individuals used for training between 50 and 500 training individuals (see results in Figure 6.5). We use the WTCCC-2cM data set for this experiment due to the larger number of available training individuals, allowing for better trend resolution. We observed diminishing returns as the number of training individuals increased: 250 training individuals were sufficient to achieve near-peak performance with the parameters used.

In all our experiments on the HapMap data set, we used only 85 training indi- viduals, so the WTCCC results suggest that Parente2’s performance on HapMap may increase with additional training data. We are encouraged that the number of individuals necessary to achieve maximum performance is likely to be relatively small, as was the case for the WTCCC data set. We note that these number of training individuals needed to reach saturation are likely to be dependent on the the window size used as well as the haplotype diversity in the cohort evaluated. CHAPTER 6. PARENTE2 93

6.4.3 When no training data is available

This work prompted a follow-up question: how would one use Parente2 in the ab- sence of explicit training data to estimate haplotype frequencies. Ideally, Parente2 would be trained one time for a given population (eg ethnicity) for a particular plat- form, then all subsequent applications of Parente2 would use re-use the frequencies. However, there may be scenarios when one only has access to a single genotyped cohort without any external training data. In this case, one may attempt to learn haplotype frequencies from the cohort directly after first phasing the data. We sought to in- vestigate whether using these haplotype frequencies would result in a model yielding reasonable performance or whether overfitting would occur and and the model would generalize poorly. To answer this question, we measured Parente2’s performance on the WTCCC-2cM data set when we used haplotypes frequencies estimated from the testing data itself and compared it to its performance when learning haplotype frequencies from explicit training data. In both cases, we used 370 individuals to control for training set size. We found that the sensitivity of Parente2 was 68.7% when trained on training data, and 72.0% when trained on the testing data (both at a 1% FPR). With a difference in sensitivity of only 3.3%, this result indicates that the model does not significantly overfit when using the testing data to estimate haplotype frequencies. Therefore, we expect that Parente2 would perform well in a real-world scenario when one does not have access to explicit training data.

6.4.4 Recommended settings for Parente2

In this work so far, we have discussed several different several different modes and parameters for Parente2. Here, we aim to discuss practical recommendations for Parente2’s settings.

We found that Parente2 performs best overall when using the embedded like- lihood ratio scoring function on the augmented window set with the window filter CHAPTER 6. PARENTE2 94 and using SpeeDB. In all experiments, we found that SpeeDB had a small effect on performance (a slight positive effect for 2 cM segments and a slight negative effect in 4 cM) but resulted in a large decrease in running time.

In the case of our WTCCC experiments, when evaluated at the higher-specificity portion of the ROC curve, using the standard window set with the non-embedded likelihood ratio provided performance equivalent to the recommended settings (59.4% vs. 59.5% sensitivity at a 0.1% FPR, respectively). For the lower-specificity portion of the curve, those settings yielded slightly better performance than our recommended settings (81.5% vs 79.4% sensitivity at a 1% FPR, respectively). However using the standard window set and the non-embedded likelihood ratio in the HapMap data set results in significantly lower performance than the recommended settings (61.9% vs 78.7% sensitivity at a 1% FPR, respectively), so we do not advise using these settings over the ones we recommend.

We do not have a specific recommended threshold setting as the requirements for sensitivity and false positive rate will vary by application. The target IBD segment size has a large effect on the accuracy of Parente2, so the thresholds used for one target IBD segment size may not be applicable for another. In Table 6.5, we show a sampling of Parente2’s accuracy at a range of thresholds for 2 and 4 cM IBD segments. These results were produced on the HapMap data set using the augmented window set and SpeeDB was not used. We note, however, that the same thresholds when using SpeeDB result in very similar accuracy values (results not shown).

6.4.5 Applicability DNA sequencing studies

As DNA sequencing technologies are increasingly cost-effective and can dramatically increase the density of genotyped positions in the genome. With these technologies in- creasingly used in genotyping studies, we were interested to characterize Parente2’s performance as a function of marker density. Figure 6.6 shows the sensitivity of Parente2 at a 1% FPR when run on the HapMap dataset after downsampling the CHAPTER 6. PARENTE2 95

2 cM IBD segments Theshold Sensitivity (%) FPR (%) -10 88.1 2.813 -5 83.0 1.493 0 77.6 0.796 5 68.5 0.432 10 57.8 0.220 15 49.9 0.137 20 42.0 0.078 25 35.5 0.042 4 cM IBD segments Theshold Sensitivity (%) FPR (%) 30 99.9 0.0639 40 99.9 0.0485 50 99.9 0.0353 60 99.6 0.0287 70 99.3 0.0198 80 99.2 0.0088 90 98.8 0.0066 100 98.3 0.0000

Table 6.5: Parente2’s accuracy at various thresholds when detecting 2 and 4 cM IBD segments. markers to various densities. We observe that increasing marker density results in an increase in sensitivity and that the curve has yet to reach a saturation point. There- fore, we expect that Parente2 would perform better on high-density sequence-based genotyping studies.

One caveat is that Parente2’s runtime is linear in the number of markers used, so the running time would increase significantly with a large increase in marker den- sity. However, it turns out that SpeeDB is more effective at filtering out non-IBD segments with higher marker density. This means that when using Parente2 in con- junction with SpeeDB, the longer running time on high-marker-density data sets can be mitigated to an extent by more effectively filtering out regions that are unlikely to be IBD between pairs of individuals. CHAPTER 6. PARENTE2 96

Figure 6.6: The performance of Parente2 as a function of marker density. Parente2 was used to detect 2 cM IBD segments in the HapMap data set. Sensitivity is reported is at a 1% false positive rate. Chapter 7

Conclusions and Future Directions

7.1 Potential impact

Our work on Alloy represented an improvement on existing methods for determining locus-specific ancestral origins. Perhaps more importantly, however, Alloy demon- strated the utility of using a locally-fitted, expressive model of linkage disequilibrium for the purpose of ancestry inference. New ancestry-inference methods can be devel- oped to build upon the framework that Alloy put forth, or, alternatively, they may take a different approach that leverages a similar model of linkage disequilibrium. Both avenues have the potential to push the achievable accuracy even higher. With ever-growing amount of genotype data becoming available from an increasingly wide array of populations, these improvements in ancestry inference will potentially al- low researchers to uncover novel biology and medically-relevant genotype-phenotype associations.

With Parente, we demonstrated the utility of using a simple window-based model for detecting IBD segments which is a departure from many previous methods that relied on hidden Markov models. This algorithmic approach was able to deliver

97 CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS 98

much higher accuracy along with significantly lower running time. Furthermore, Parente introduced the concept of an embedded likelihood ratio to the problem of IBD detection and we surmise that this empirical recalibration approach may applicable to other problems and fields as well. The strength of the model and embedded likelihood ratio was further substantiated in our work on Parente2 which may serve as motivation for others to extend and apply these approaches elsewhere. Both Parente and Parente2 represented a significant leap forward in the accuracy of IBD detection methods that will allow IBD-based analyses of larger cohorts and allow researchers to identify more distant familial relationships at reasonable false positive rates.

7.2 Future directions

The performance of Alloy has the potential to be improved by incorporating some of the ideas we used in Parente. In particular, we hypothesize that window-based methods’ strength lies in their ability to capture the local connection between neigh- boring positions in the genome (i.e. markers belonging to the same window) while simultaneously being memoryless at longer distances (i.e. between windows). The memoryless feature may increase their robustness to noise or errors in a local portion of the genome whereas methods such as hidden Markov models may propagate erro- neous information to neighboring regions. Consequently, a windowing system may be added to Alloy where each individual window utilizes a small hidden Markov model which will hopefully allow it continue to take advantage of local linkage disequilibrium information while also strengthening its practical performance.

Additionally, with the increasing availability of population sequencing data, there is a great opportunity to apply Alloy on these data sets. Because DNA sequencing is not limited to a pre-defined set of markers, it can be used to observe rare alleles [9]. It has been shown that many rare alleles can be harbored within very specific populations with a common ancestral origin [19]. Alloy can be adapted to be CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS 99

applied to sequencing data and hopefully take advantage of these highly informative genomic positions for local ancestry inference. To this end, Alloy can be changed to have an region-specific error model to accommodate for the fact that next-generation sequencing technologies often have positional biases due to genomic sequence content and mapping artifacts.

In the space of IBD inference, one of the key ways that Parente2 can be extended is by having a method to define windows more intelligently. Our simple approach to incorporating windows of non-adjacent markers used randomization to define window sets. One could use a more sophisticated method for choosing the marker content of windows either by using a model-based scoring function or by taking an empirical approach and iteratively learning better window definitions until convergence. In addition (or, alternatively), a weighting scheme could be applied where the scores observed from particular window are scaled by the weight of the window, which could be proportional to the informativeness of the window.

In an orthogonal direction, it may also be worth investigating a way of perform- ing some amount of smoothing of scores between nearby windows. Even though we have observed that the naive assumption of independent windows tends to yield good results, we know that this is an invalid assumption, which motivates the addition of some connection between nearby regions in order to relax it. The rationale behind smoothing would be that given that windows are relatively small with respect to the size of the IBD regions of interest, the vast majority of windows in an IBD region will have the same IBD state as their neighbors. While we note that for IBD inference, Parente2 outperformed HMM methods, which tend to have a smoothing effect as neighboring positions are connected, we note that Alloy (an HMM method) outper- formed window-based methods for the ancestry inference problem. So, it’s possible an approach that is somewhere in the middle may result in even better performance than either extreme. Therefore, by using a small amount of smoothing, the accuracy of Parente2 could be increased without adding significant computational cost.

Currently, one requirement for running Parente2 is to have access to phased CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS 100

training haplotypes in order to estimate haplotype frequencies for each window. This is not a problem for studies that are performed on well-studied populations (such as Europeans) where there is a significant amount of existing haplotype data available either from trio-based phasing or computational phasing; each new IBD study in these populations can reuse existing phased data. For populations where there is little phased data available, when one performs an IBD study with a new cohort in this population, it is possible to computationally phase the cohort data first and use it directly to estimate haplotype frequencies. However, this partially negates one of the benefits of Parente2 in that it works without the need to computationally phase data before inferring IBD segments. Therefore, a potential future direction is to work toward estimating haplotype frequencies from the studied genotype data without needing to fully phase the data set. Because of the independent-window model of Parente2, one only needs to estimate window-sized haplotype frequencies. To do this, one can perform phasing only within one window at a time, for example, with an expectation maximization (EM) approach. Since the number of markers per window is very small (e.g. 5 markers), this approach would be computationally feasible and should result in a reasonable estimate of window haplotype frequencies.

While Parente2 is very fast and accurate, our experiments show that there is significant room for improvement for extremely large data sets on the order of millions of individuals. In this situation, the number of pairwise comparisons is so large for an all-by-all analysis that even with Parente2’s high accuracy would result in a large number of false positives when detecting small IBD segments. A perhaps larger challenge is the computational cost of such an analysis. Even with the quick running times of Parente and SpeeDB with Parente2, a study of this magnitude would require extensive computational resources. Therefore, it’s important to continue to develop complementary methods that reduce the overall computational complexity of the IBD inference problem. For example, a simple approach would be to perform region-specific clustering: for each part of the genome, cluster individuals based on the genotypic similarity in the region. Assuming the cluster sizes are small, one would only need to perform an all-by-all comparison within each cluster. As such complementary methods are developed, Parente2 can be easily adapted to fit into CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS 101

a larger system for IBD inference because of its simple scoring function and the fact that scores in each region are computed independently from each other. Therefore, Parente2 is in a good position to facilitate IBD-based studies in the future.

Finally, since IBD inference and ancestry inference are related problems that cap- ture the same phenomenon on different time scales, there is a great deal of exploration that can be done in the intersection of these two areas. For example, if one can iden- tify that two individuals share an IBD segment at a particular region of the genome, it can be concluded that these two people must also share the same ancestry in that region. Conversely, if one can identify that two individuals share the same ancestry at a particular region of the genome, this increases the prior probability that these individuals share an IBD segment in that same region. Going forward, a model could be constructed that allows both IBD and ancestry states to inform one another, which may result in improved accuracy for solving both inference problems. To this end, Alloy and Parente2 can serve as building blocks for this endeavor. Bibliography

[1] G. R. Abecasis, S. S. Cherny, W. O. Cookson, and L. R. Cardon. Merlin–rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet, 30(1):97– 101, 2002.

[2] Can Alkan, Bradley P. Coe, and Evan E. Eichler. Genome structural variation discovery and genotyping. Nature reviews. Genetics, 12(5):363–376, May 2011.

[3] Fowzan S Alkuraya. Homozygosity mapping: one more tool in the clinical ge- neticist’s toolbox. Genet Med, 12(4):236–9, 2010.

[4] David M Altshuler, Richard A Gibbs, Leena Peltonen, Emmanouil Dermitzakis, Stephen F Schaffner, Fuli Yu, Penelope E Bonnen, Paul I W De Bakker, Panos Deloukas, Stacey B Gabriel, and et al. Integrating common and rare genetic variation in diverse human populations. Nature, 467(7311):52–58, 2010.

[5] T M Baye and R A Wilke. Mapping genes that predict treatment outcome in admixed populations. The pharmacogenomics journal, 10(6):465–477, 2010.

[6] Sivan Bercovici and Dan Geiger. Inferring ancestries efficiently in admixed popu- lations with linkage disequilibrium. Journal of computational biology : a journal of computational molecular cell biology, 16(8):1141–50, August 2009.

[7] Sivan Bercovici, Christopher Meek, Ydo Wexler, and Dan Geiger. Estimating genome-wide ibd sharing from snp data via an efficient hidden markov model of ld with application to gene mapping. Bioinformatics, 26(12):i175–i182, June 2010.

102 BIBLIOGRAPHY 103

[8] Brian L. Browning and Sharon R. Browning. A fast, powerful method for de- tecting identity by descent. American journal of human genetics, 88(2):173–82, February 2011.

[9] Brian L Browning and Sharon R Browning. Detecting identity by descent and estimating genotype error rates in sequence data. The American Journal of Human Genetics, 2013.

[10] Sharon R Browning and Brian L Browning. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. The American Journal of Human Genetics, 81(5):1084–1097, 2007.

[11] Sharon R. Browning and Brian L. Browning. High-Resolution Detection of Iden- tity by Descent in Unrelated Individuals. American Journal of Human Genetics, 86(4):526–539, April 2010.

[12] S.R. Browning and B.L. Browning. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet, 81(5):1084–97, 2007.

[13] S.R. Browning and E.A. Thompson. Detecting Rare Variant Associations by Identity by Descent Mapping in Case-control Studies. Genetics, 190:1521–1531, April 2012.

[14] Vincent J. Carey. Mathematical and statistical methods for genetic analysis (2nd ed.). kenneth lange. Journal of the American Statistical Association, 100:712– 712, 2005.

[15] Donald F. Conrad, Jonathan E. M. Keebler, Mark A. DePristo, Sarah J. Lind- say, Yujun Zhang, Ferran Casals, Youssef Idaghdour, Chris L. Hartl, Carlos Tor- roja, Kiran V. Garimella, Martine Zilversmit, Reed Cartwright, Guy A. Rouleau, Mark Daly, Eric A. Stone, Matthew E. Hurles, Philip Awadalla, and for the 1000 Genomes Project. Variation in genome-wide mutation rates within and between human families. Nature Genetics, 2011. BIBLIOGRAPHY 104

[16] R. Elston and J. Stewart. A general model for the analysis of pedigree data. Hum. Hered., 21:523–542., 1971.

[17] W.C. Farabee. Inheritance of Digital Malformations in Man. Papers of the Peabody Museum of American Archaeology and Ethnology, Harvard University. Museum, 1905.

[18] Zoubin Ghahramani, Michael I. Jordan, and Padhraic Smyth. Factorial hidden markov models. In Machine Learning. MIT Press, 1997.

[19] Simon Gravel, Brenna M. Henn, Ryan N. Gutenkunst, Amit R. Indap, Ga- bor T. Marth, Andrew G. Clark, Fuli Yu, Richard A. Gibbs, The 1000 Genomes Project, and Carlos D. Bustamante. Demographic history and rare allele sharing among human populations. Proceedings of the National Academy of Sciences, 108(29):11983–11988, July 2011.

[20] Daniel F. Gudbjartsson, Thorvaldur Thorvaldsson, Augustine Kong, Gunnar Gunnarsson, and Anna Ingolfsdottir. Allegro version 2. Nature Genetics, 37(10):1015–1016, October 2005.

[21] Alexander Gusev, Jennifer K Lowe, Markus Stoffel, Mark J Daly, David Alt- shuler, Jan L Breslow, Jeffrey M Friedman, and Itsik Pe’er. Whole population, genome-wide mapping of hidden relatedness. Genome Research, 19:318–326, Feb 2009. 10.1101/gr.081398.108.

[22] J. B. S. Haldane. The combination of linkage values, and the calculation of distance between the loci of linked factors. J Genet, 8:299–309, 1919.

[23] Brenna M. Henn, Laura R. Botigu´e,Simon Gravel, Wei Wang, Abra Brisbin, Jake K. Byrnes, Karima Fadhlaoui-Zid, Pierre A. Zalloua, Andres Moreno- Estrada, Jaume Bertranpetit, Carlos D. Bustamante, and David Comas. Ge- nomic Ancestry of North Africans Supports Back-to-Africa Migrations. PLoS Genet, 8(1), January 2012. BIBLIOGRAPHY 105

[24] Brenna M. Henn, Lawrence Hon, J. Michael Macpherson, Nick Eriksson, Serge Saxonov, Itsik Pe’er, and Joanna L. Mountain. Cryptic distant relatives are com- mon in both isolated and cosmopolitan genetic samples. PLoS ONE, 7(4):e34267, 04 2012.

[25] Anjali G. Hinch, Arti Tandon, Nick Patterson, Yunli Song, Nadin Rohland, Cameron D. Palmer, Gary K. Chen, Kai Wang, Sarah G. Buxbaum, Ermeg L. Akylbekova, Melinda C. Aldrich, Christine B. Ambrosone, Christopher Amos, Elisa V. Bandera, Sonja I. Berndt, Leslie Bernstein, William J. Blot, Cathryn H. Bock, Eric Boerwinkle, Qiuyin Cai, Neil Caporaso, Graham Casey, L. Adri- enne Cupples, Sandra L. Deming, W. Ryan Diver, Jasmin Divers, Myriam For- nage, Elizabeth M. Gillanders, Joseph Glessner, Curtis C. Harris, Jennifer J. Hu, Sue A. Ingles, William Isaacs, Esther M. John, W. H. Linda Kao, Brendan Keating, Rick A. Kittles, Laurence N. Kolonel, Emma Larkin, Loic Le Marc- hand, Lorna H. McNeill, Robert C. Millikan, Murphy, Solomon Musani, Chris- tine Neslund-Dudas, Sarah Nyante, George J. Papanicolaou, Michael F. Press, Bruce M. Psaty, Alex P. Reiner, Stephen S. Rich, Jorge L. Rodriguez-Gil, Jerome I. Rotter, Benjamin A. Rybicki, Ann G. Schwartz, Lisa B. Signorello, Margaret Spitz, Sara S. Strom, Michael J. Thun, Margaret A. Tucker, Zhaom- ing Wang, John K. Wiencke, John S. Witte, Margaret Wrensch, Xifeng Wu, Yuko Yamamura, Krista A. Zanetti, Wei Zheng, Regina G. Ziegler, Xiaofeng Zhu, Susan Redline, Joel N. Hirschhorn, Brian E. Henderson, Herman A. Tay- lor, Alkes L. Price, Hakon Hakonarson, Stephen J. Chanock, Christopher A. Haiman, James G. Wilson, David Reich, and Simon R. Myers. The landscape of recombination in African Americans. Nature, 476(7359):170–175, August 2011.

[26] Rodriguez J. Huang L., Bercovici S. and Batzoglou S. An Effective Filter for IBD Detection in Large Datasets. Manuscript submitted for publication, 2013.

[27] Illumina. Moleculo technology. http://www.illumina.com/technology/moleculo- technology.ilmn, 2013.

[28] Anna Ingolfsdottir and Daniel Gudbjartsson. Transactions on computational systems biology iii. In Corrado Priami, Emanuela Merelli, Pablo Gonzalez, and BIBLIOGRAPHY 106

Andrea Omicini, editors, Transactions on Computational Systems Biology III, chapter Genetic linkage analysis algorithms and their implementation, pages 123–144. Springer-Verlag, Berlin, Heidelberg, 2005.

[29] Mattias Jakobsson, Sonja W Scholz, Paul Scheet, J Raphael Gibbs, Jenna M VanLiere, Hon-Chung Fung, Zachary A Szpiech, James H Degnan, Kai Wang, Rita Guerreiro, and et al. Genotype, haplotype and copy-number variation in worldwide human populations. Nature, 451(7181):998–1003, 2008.

[30] Dorna Kashef-Haghighi. In preparation, 2013.

[31] Sofia Kyriazopoulou-Panagiotopoulou, Dorna Kashef Haghighi, Sarah J. Aerni, Andreas Sundquist, Sivan Bercovici, and Serafim Batzoglou. Reconstruction of genealogical relationships with applications to phase iii of hapmap. Bioinformat- ics, 27(13):i333–i341, July 2011.

[32] E. S. Lander and P. Green. Construction of multilocus genetic maps in humans. Proceedings of the National Academy of Sciences, 84:2363–2367, 1987.

[33] Meng-Hua Li, Ismo Strandn, Timo Tiirikka, Marja-Liisa Sevn-Aimonen, and Juha Kantanen. A comparison of approaches to estimate the inbreeding co- efficient and pairwise relatedness using genomic and pedigree data in a sheep population. PLoS ONE, 6(11):e26256, 11 2011.

[34] Jeffrey C. Long. The genetic structure of admixed population. Genetics, (127):417–428, 1991.

[35] K. Markianos, M. J. Daly, and L. Kruglyak. Efficient multipoint linkage analysis through reduction of inheritance space. Am. J. Hum. Genet., 68(4):963–977, 2001.

[36] 1000 Genomes Project. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061–1073, October 2010.

[37] Ida Moltke, Anders Albrechtsen, Thomas, Finn C. Nielsen, and Rasmus Nielsen. A method for detecting IBD regions simultaneously in multiple individuals with applications to disease genetics. Genome Research, 21(7):1168–1180, July 2011. BIBLIOGRAPHY 107

[38] Michael A. Nalls, Javier Simon-Sanchez, J. Raphael Gibbs, Coro Paisan-Ruiz, Jose Tomas Bras, Toshiko Tanaka, Mar Matarin, Sonja Scholz, Charles Weitz, Tamara B. Harris, Luigi Ferrucci, John Hardy, and Andrew B. Singleton. Mea- sures of autozygosity in decline: Globalization, urbanization, and its implications for medical genetics. PLoS Genet, 5(3):e1000415, 03 2009.

[39] International Society of Genetic Genealogy. Autosomal dna statistics. http://www.isogg.org/wiki/Autosomal DNA statistics, 2013.

[40] International Society of Genetic Genealogy. Identical by descent segment. http://www.isogg.org/wiki/Identical By Descent segment, 2013.

[41] J. Ott. Analysis of Human Genetic Linkage. The Johns Hopkins series in con- temporary medicine and public health. Johns Hopkins University Press, 1999.

[42] Bogdan Pasaniuc, Sriram Sankararaman, Gad Kimmel, and Eran Halperin. In- ference of locus-specific ancestry in closely related populations. Bioinformatics (Oxford, England), 25(12):i213–21, June 2009.

[43] Bogdan Pasaniuc, Noah Zaitlen, Guillaume Lettre, Gary K Chen, Arti Tandon, W H Linda Kao, Ingo Ruczinski, Myriam Fornage, David S Siscovick, Xiaofeng Zhu, Emma Larkin, Leslie a Lange, L Adrienne Cupples, Qiong Yang, Ermeg L Akylbekova, Solomon K Musani, Jasmin Divers, Joe Mychaleckyj, Mingyao Li, George J Papanicolaou, Robert C Millikan, Christine B Ambrosone, Esther M John, Leslie Bernstein, Wei Zheng, Jennifer J Hu, Regina G Ziegler, Sarah J Nyante, Elisa V Bandera, Sue a Ingles, Michael F Press, Stephen J Chanock, Sandra L Deming, Jorge L Rodriguez-Gil, Cameron D Palmer, Sarah Buxbaum, Lynette Ekunwe, Joel N Hirschhorn, Brian E Henderson, Simon Myers, Christo- pher a Haiman, David Reich, Nick Patterson, James G Wilson, and Alkes L Price. Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a Breast Cancer Consortium. PLoS genetics, 7(4):e1001371, April 2011.

[44] Nick Patterson, Neil Hattangadi, Barton Lane, Kirk E Lohmueller, David a Hafler, Jorge R Oksenberg, Stephen L Hauser, Michael W Smith, Stephen J BIBLIOGRAPHY 108

O’Brien, David Altshuler, Mark J Daly, and David Reich. Methods for high- density admixture mapping of disease genes. American journal of human genet- ics, 74(5):979–1000, May 2004.

[45] John E. Pool and Rasmus Nielsen. Inference of Historical Changes in Migration Rate From the Lengths of Migrant Tracts. Genetics, 181(2):711–719, February 2009.

[46] Alkes L. Price, Arti Tandon, Nick Patterson, Kathleen C. Barnes, Nicholas Rafaels, Ingo Ruczinski, Terri H. Beaty, Rasika Mathias, David Reich, and Si- mon Myers. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet, 5(6):e1000519, 06 2009.

[47] Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel A. Ferreira, David Bender, Julian Maller, Pamela Sklar, Paul I. de Bakker, Mark J. Daly, and Pak C. Sham. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics, 81(3):559–575, September 2007.

[48] Lawrence R. Rabiner. A tutorial on hidden markov models and selected ap- plications in speech recognition. In Proceedings of the IEEE, pages 257–286, 1989.

[49] Peter Ralph and Graham Coop. The geography of recent genetic ancestry across Europe, July 2012.

[50] Dana Ron, Yoram Singer, and Naftali Tishby. On the learnability and usage of acyclic probabilistic finite automata. In JOURNAL OF COMPUTER AND SYSTEM SCIENCES, pages 31–40. ACM Press, 1995.

[51] Noah A. Rosenberg, Lei M. Li, Ryk Ward, and Jonathan K. Pritchard. Infor- mativeness of genetic markers for inference of ancestry. The American Journal of Human Genetics, (73):1402–1422, 2003.

[52] Sriram Sankararaman, Srinath Sridhar, Gad Kimmel, and Eran Halperin. Es- timating Local Ancestry in Admixed Populations. Journal of Human Genetics, (February):290–303, 2008. BIBLIOGRAPHY 109

[53] Michael F Seldin, Bogdan Pasaniuc, and Alkes L Price. New approaches to disease mapping in admixed populations. Nature reviews. Genetics, 12(8):523–8, January 2011.

[54] Eun Kyong Shin, Shin-Hwa Lee, Sung-Hwan Cho, Seok Jung, Sang Hyuk Yoon, Sung Woo Park, Jong Sook Park, Soo Taek Uh, Yang Ki Kim, Yong Hoon Kim, Jae-Sung Choi, Byung-Lae Park, Hyoung Doo Shin, and Choon-Sik Park.

[55] Andreas Sundquist, Eugene Fratkin, Chuong B Do, and Serafim Batzoglou. Ef- fect of genetic divergence in identifying ancestral origin using HAPAA. Genome research, 18(4):676–82, April 2008.

[56] Hua Tang, Marc Coram, Pei Wang, Xiaofeng Zhu, and Neil Risch. Reconstruct- ing genetic ancestry blocks in admixed individuals. American journal of human genetics, 79(1):1–12, July 2006.

[57] Alan R. Templeton. Human races: A genetic and evolutionary perspective. American Anthropologist, 100(3):632–650, 1998.

[58] Chao Tian, David A. Hinds, Russell Shigeta, Rick Kittles, Dennis G. Ballinger, and Michael F. Seldin. A genomewide single-nucleotide polymorphism panel with high ancestry information for african american admixture mapping. The American Journal of Human Genetics, (79):640–649, 2006.

[59] Daniel Wegmann, Darren E. Kessner, Krishna R. Veeramah, Rasika A. Mathias, Dan L. Nicolae, Lisa R. Yanek, Yan V. Sun, Dara G. Torgerson, Nicholas Rafaels, Thomas Mosley, Lewis C. Becker, Ingo Ruczinski, Terri H. Beaty, Sharon L. R. Kardia, Deborah A. Meyers, Kathleen C. Barnes, Diane M. Becker, Nelson B. Freimer, and John Novembre. Recombination rates in admixed individuals iden- tified by ancestry-based inference. Nature Genetics, 43(9):847–853, July 2011.

[60] Amy L. Williams, Nick Patterson, Joseph Glessner, Hakon Hakonarson, and David Reich. Phasing of many thousands of genotyped samples. American journal of human genetics, 91(2):238–251, August 2012. BIBLIOGRAPHY 110

[61] Cheryl A. Winkler, George W Nelson, and Michael W Smith. Admixture map- ping comes of age. Annual review of and human genetics, 11:65–89, September 2010.

[62] WTCCC. Genome-wide association study of 14,000 cases of seven common dis- eases and 3,000 shared controls. Nature, 447(7145):661–678, June 2007.