DISEASE GENE MAPPING UNDER THE COALESCENT MODEL

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Lori Hoffman, M.S., B.S. Graduate Program in Statistics ***** The Ohio State University 2010

Dissertation Committee: Prof. Laura Kubatko, Adviser Prof. Dennis K. Pearl Prof. H. N. Nagaraja Prof. Asuman Turkmen ABSTRACT

Generally speaking, association studies aim to find ties between a given trait, most commonly a disease, and the location of a causative gene. A subset of these studies uses case-control data in which there are both affected and unaffected individuals.

The basic idea behind these studies is to compare genotype frequencies of those who have a disease, or other trait, with those who do not. By analyzing frequency patterns, researchers can pick up signals along each chromosome that indicate an association with disease status. Depending on the type of disease under study, different mapping approaches are necessary. One way to classify diseases is by the underlying structure of mutations that cause them. While association studies may be used to effectively map Mendelian diseases in which a single mutation with high penetrance gives rise to a disease, the more challenging problem is that of mapping complex diseases.

For our purposes, the genotypes are made up of single nucleotide polymorphisms

(SNPs) typed along the chromosome for a sample of individuals. The phenotype data reflect the affected status of an individual with respect to a certain disease. Along with the observed genotype and phenotype data, we use information about the sample’s unknown common genealogy. The genealogies are estimated under the coalescent model with recombination and are represented by ancestral recombination graphs

(ARGs). Such a genealogy, when accurately estimated, can provide information about possible disease-causing mutations that have occurred in the common history.

ii We propose a new method of disease mapping via the coalescent, which we refer to as ARGlik. Our method implements a fast ARG estimation program and performs likelihood-based association testing. We use an existing algorithm, implemented in the software program MARGARITA [33], to estimate the genealogy. After estimating a genealogy for a given sample, we compute the likelihood of the phenotype data given the genealogy. If there is a disease association at a particular SNP in the data, we expect to see a non-random clustering of cases and controls within the genealogy.

To check the performance of ARGlik, we compared our method against other coalescent-based methods as well as the standard χ2 approach. Our simulation study

includes data ranging from simple one-locus disease models to disease models with

an external covariate. Results show that ARGlik performs as well as the coalescent

methods for the one-locus disease models while maintaining a lower false positive

rate for the no disease model. Moreover, ARGlik performs well in its ability to detect

association in the presence of a covariate. As a final check on the program, we test

three chromosomes for association with type 1 diabetes.

iii This is dedicated to my parents, my sister, and Jeremy

iv ACKNOWLEDGMENTS

First, I would like to thank Jeremy for his constant support while I completed this research. I would also like to thank my friends and family, especially my par- ents. I am very grateful to Veronica Vieland and Yungui Huang from Nationwide

Children’s Hospital for sharing their biological expertise and for providing the Type

1 Diabetes data used in my analyses. I would also like to thank my committee for their suggestions. Finally, I would like to thank Laura Kubatko for her guidance and understanding during this work.

v VITA

July 15, 1982 ...... Born - Superior, Wisconsin USA

2004 ...... B.S. Mathematics

2007 ...... M.S. Statistics

2007-present ...... Graduate Research Associate, The Ohio State University.

FIELDS OF STUDY

Major Field: Statistics

vi TABLE OF CONTENTS

Page

Abstract...... ii

Dedication...... iv

Acknowledgments...... v

Vita ...... vi

ListofTables...... ix

ListofFigures ...... x

Chapters:

1. IntroductionandLiteratureReview ...... 1

1.1 GeneticsOverview ...... 2 1.2 Single Marker and -Based Methods ...... 6 1.3 Mapping Methods Based on Ancestral Recombination Graphs (ARGs) 9 1.3.1 The Standard Coalescent and Coalescent-with-Recombination 10 1.3.2 Ancestral Recombination Graphs ...... 13 1.3.3 AssociationMapping...... 19 1.4 ARGvs.HaplotypeMethods ...... 24 1.5 OutlineofDissertation...... 25

2. An ARG-Based Disease Mapping Algorithm ...... 27

2.1 ARGEstimation ...... 27 2.2 BranchLengthEstimation...... 32 2.3 MeasuringAssociation ...... 38 2.3.1 Likelihood Calculation ...... 38

vii 2.3.2 PenetranceModel ...... 45 2.3.3 SignificanceTesting ...... 48 2.4 ComputerImplementation...... 49 2.5 IncorporatingCovariates...... 53 2.6 ModelSummary ...... 55

3. SimulationStudies ...... 56

3.1 One-LocusDiseaseModels...... 56 3.1.1 DataSimulation ...... 56 3.1.2 MethodSettings ...... 61 3.1.3 Results ...... 63 3.1.4 UnphasedData...... 72 3.1.5 ComputingTime...... 77 3.2 CovariateModels...... 80 3.2.1 DataSimulation ...... 80 3.2.2 MethodSettings ...... 83 3.2.3 Results ...... 84 3.3 DiscussionandSummary ...... 86

4. RealDataApplication ...... 88

4.1 Type1DiabetesData ...... 88 4.2 ARGlikResults...... 91 4.3 Discussion and Conclusion ...... 92

5. ConclusionsandFutureWork ...... 94

Appendices:

A. DataSimulationPseudocode...... 97

A.1 StandardOne-LocusModel ...... 97 A.2 CovariateData ...... 103

Bibliography ...... 111

viii LIST OF TABLES

Table Page

1.1 Number of possible ARGs for varying numbers of recombination events andsequences...... 15

3.1 Population settings for simulation study ...... 57

3.2 Sample settings for simulation study of one-locus disease model . . . 58

3.3 Results of simulations with disease ...... 66

3.4 Results of simulations on phase-unknown data ...... 77

3.5 Simulationruntimes ...... 80

3.6 Covariate disease models for simulation studies ...... 82

4.1 Type1diabetesdata ...... 89

4.2 Type1diabetesresults...... 91

4.3 Locations of highest signal for type 1 diabetes data ...... 92

ix LIST OF FIGURES

Figure Page

1.1 Geneticrecombination ...... 4

1.2 Sample nucleotide data converted to SNP format ...... 5

1.3 Genetreerelatingfiveindividuals ...... 11

1.4 ARGwithcorrespondingmarginaltrees ...... 14

2.1 Margarita ARG inference algorithm ...... 31

2.2 Margarita ARG inference algorithm applied to unphased data .... 33

2.3 Most parsimonious reconstruction of a 4-tip tree ...... 35

2.4 Example of trees and data used for association testing ...... 40

2.5 Example of likelihood calculation on a marginal tree ...... 42

2.6 ARGlikSNPinputformat1 ...... 51

2.7 ARGlikSNPinputformat2 ...... 52

3.1 Type I error rate for Margarita and ARGlik ...... 65

3.2 Simulation results of setting 1, without adjustment to control error rate 67

3.3 Simulation results of setting 1, with adjustment to control error rate . 68

3.4 Results of simulations with disease ...... 70

3.5 Results of simulations with disease, with adjustment to control error rate...... 71

x 3.6 Simulation results of ARGlik across settings ...... 73

3.7 Simulationresultsofsetting6 ...... 74

3.8 Simulation results of setting 6, with adjustment to control error rate . 75

3.9 Simulation results of setting 1 for both phased and unphased data . . 78

3.10 Simulation results of setting 2 for both phased and unphased data . . 79

3.11 Simulation results of covariate samples ...... 84

3.12 Simulation results of covariate samples, with adjustment to control errorrate ...... 85

xi CHAPTER 1

INTRODUCTION AND LITERATURE REVIEW

Association studies are conducted in order to find relationships between a given trait, most commonly a disease, and a location of genetic material along the chro- mosome. A subset of these studies uses case-control data in which the data are extracted from a combination of affected and unaffected individuals. The basic idea behind these studies is to compare the genotype frequencies of cases and controls in order to pick up signals along the chromosome that indicate an association with the disease being studied. The term linkage disequilibirum (LD) mapping is often used to describe these studies, as it refers to the study of non-random association of alleles at multiple loci in unrelated individuals [2].

Depending on the type of disease under study, different mapping approaches are necessary. One way to classify diseases is by the underlying structure of mutations that cause them. While association studies may be used to effectively map Mendelian diseases, in which a single mutation with high penetrance gives rise to a disease, the more challenging problem is that of mapping complex diseases. Complex diseases may require many mutation combinations to occur before one develops the disease.

Alternatively, multiple independent mutations may have ocurred at multiple loci in affected individuals. A complex disease may also be characterized by low penetrance,

1 meaning that those with the disease mutation will only be affected with the disease at

a low rate. Alternatively, many phenocopies may be present in the population, which

are individuals who are affected by the disease but do not carry the disease variant.

With genetic data becoming increasingly more available, there has been a lot of work recently on the topic of association studies with the purpose of mapping complex diseases. The simplest method of analysis of data obtained in an association study is to conduct a χ2 test of association between a single marker and a single phenotype

[46]. However, as this type of test does not take into account information from neighboring markers, most association studies use a multi-point approach. Moreover, the standard χ2 approach does not adjust for incomplete penetrance. Thus, we will focus our discussion on methods more suited for complex diseases. There are two general directions one can take in pursuing such an analysis. The first is comprised of methods that only use some form of the observed genotype and phenotype data, and the second type of approach uses observed data along with the unobserved shared genealogy of the sample. We focus on methods that can be applied to case-control data. In this chapter, we review prior work on both types of methods after giving a brief overview of some relevant genetic background.

1.1 Genetics Overview

In this section we discuss basic principles of human genetics as they apply to as- sociation studies. At the core of all living organisms is deoxyribonucleic acid (DNA).

DNA is made up of sequences of the four possible nucleotide bases - adenine (A), guanine (G), cytosine (C) and thymine (T). We can represent a DNA sequence by the nucleotide bases that it contains. For example, a segment of DNA may look like

2 AAGTTCC. Such DNA sequences are what make up the twenty-three chromosomes in humans. Since humans are diploid organisms, we contain two copies of each chro- mosome, one inherited from the mother and one inherited from the father. Because of this paired structure of DNA, there are two nucleotides present at each position

(“locus”) along the chromosome, which are referred to as a base pair. Typically the base pair variations present in the population at a given locus are referred to as alleles.

The genotype of an individual is the specific combination of alleles at a given locus. A person’s genotype is homozygous at a locus if both chromosomes have the same allele and is heterozyous if the chromosomes have different alleles. Hardy-Weinberg equi- librium describes the frequencies of alleles in a population, stating that they remain constant from generation to generation.

During reproduction, DNA replicates in order to produce copies of itself. Some- times this process is done with mistakes. As Li (1997) explains, these errors may result in single-point mutations, recombinations, insertions, or deletions. We will concentrate on single-point mutations and recombinations. A single-point mutation occurs when one nucleotide is substituted for another. Recombination occurs when paired chromosomes exchange genetic material along some region (see Figure 1.1).

The concept of recombination will be key in our ability to detect associations along the chromosome.

With this general overview of genetics, we now have the concepts to present a brief description of typical data for case-control studies of complex diseases. The observed data are the genotypes and the phenotypes of individuals. Phenotypes are typically the individual’s binary disease state, but in general may consist of any set of qual- itative or quantitative traits. Genotytpes are most commonly represented by single

3 Figure 1.1: (a) One pair of chromosomes. Each vertical line represents a chromatid, which is a daughter strand of a replicated chomosome. The paired chromatids of each chromosome are joined by a centromere, illustrated by the solid oval. The horizontal lines connecting the two chomosomes indicate where genetic recombination will occur. (b) Recombination between two chromatids on either chromosome with a recombination breakpoint between the first and second loci.

4 Figure 1.2: On the left, six sequences of DNA segments coded in terms of the four nucleotides are shown. On the right, the same data are coded as SNP data. We code the less frequent allele as a “1” and the more frequent allele as a “0”. The fourth locus has no variability and is not recorded as a SNP.

5 nucleotide polymorphisms (SNPs), which are chromosomal locations (“markers”) at

which at least two different alleles are present in the population. Figure 1.2 shows

an example of complete nucleotide data along with the data coded in terms of SNPs.

Many methods do not work with the genotypes themselves, but rather use

– groupings of genotypes along the chromosome. Haplotypes must be inferred from

the data because the phase of genotypes at a single locus is typically unknown. When we observe two alleles at a marker, the phase refers to the chromsome from which each allele came. As Clayton et al. (2004) explain, say the observed alleles at two loci are (A,a) and (B,b). The possible haplotypes are given by the pairs A-B and a-b

or the pairs A-b and a-B. Notice that if we instead observe alleles (A,a) and (B,B) then there is no uncertainty in the haplotypes. Thus, the genotype phase across a set of loci is only completely unknown if more than one locus is heterozygous. For a more complete treatment of the genetic background, see Waterman (1995) or Lange

(1997).

1.2 Single Marker and Haplotype-Based Methods

The first class of methods commonly used for case-control association studies does

not explicitly take into account the underlying genealogy of a sample. These methods

are based on the marker data or haplotype data only. The advantage of marker data

over haplotype data is that using multiple marker data does not require the phase

of the genotype to be known. A disadvantage of the marker-based methods is that

accounting for high correlation between nearby markers is difficult, leading many to

prefer haplotype-based approaches [46]. Clayton et al. (2004) use a marker-based

approach in a linear model setting, where they assume additive effects of the two

6 alleles at the unknown disease locus. In this model, they assume a single causal

mutation in the region of interest, but provide alterations to the model when there

is more than one mutation. When the phenotype under study is binary, their model

does not require the assumption of Hardy-Weinberg equilibirum in the population.

Haplotype-based tests are more common and are able to overcome the correlation inherent in marker-based tests because of the structure of haplotype data. The main assumption behind these haplotype-based mapping methods is that near the trait locus there is a non-random association of alleles in case haplotypes [27]. Moreover, the phase is assumed to be known or must be inferred. The software PHASE [42] is a commonly used method of inferring the genotype phase. For larger data sets, the software fastPHASE [40] may be more suitable but the authors note that the inferred haplotypes may be less acurate than those produced by PHASE.

Many haplotype-based approaches test for association by clustering haplotypes using a similarity metric, which varies by method. Molitor et al. (2003) cluster haplotypes by measuring similarity to a set of haplotype centers that may consist of observed or unobserved values. Their model is designed to include clusters with vary- ing risks as well as phenotypes that are binary or continuous. The resulting method can be described as a disease mapping approach that makes use of spatial statistics.

A key limitation of their method is that haplotypes are assumed known, and the model does not adapt to the case when genotype data is known but haplotype is not.

Waldron et al. (2006) further expand the method of Molitor et al. (2003). A major difference between the two methods is that in the initial method, all chromosomes are assigned a cluster, whereas in the subsequent method the interest lies in the clusters that correspond to the highest risk. Once these high-risk clusters are created, random

7 samples of the location parameter from the posterior distribution are obtained using

an MCMC algorithm. This algorithm is designed for a general single-locus disease.

Both clustering methods mentioned above are designed for fine-scale mapping over

small candidate regions.

Li and Jiang (2005) introduce a model-free data mining method that can be

applied to whole-genome studies in addition to studies across smaller regions. They

adapt an existing density-based algorithm (DBSCAN) to cluster haplotypes based on

similarity with respect to a specified locus. A z-score measuring disease association

is computed for each cluster and the cluster with the highest score is recorded. This

process is repeated by changing the loci used to create the clusters, and the marker

locus with the highest overall z-score is the final estimate of the unknown disease

location. The significance of the estimate is assessed by permuting the phenotypes

among the data and repeating the mapping process to obtain the null distribution.

This procedure is implemented in the program HapMiner. Simulation results indicate

that this is an effective tool for a wide range of studies, in terms of both size of study

and type of disease under review.

Toivonen et al. (2000) also use a data mining approach in which they do not make any assumptions regarding the model of disease inheritance. They refer to this method as haplotype pattern mining (HPM). As the name implies, this method assumes halpotype data are known. HPM efficiently searches through large samples of case-control data to find haplotype patterns that have an association with the disease phenotype. The set of patterns that have association measures that exceed some cut- off are recorded and used for fine-mapping. The final prediction of disease location is the marker that is contained in the highest number of the recorded haplotype

8 patterns. They also obtain markerwise p-values to predict the location of the disease

locus, but do not correct for multiple comparisons since the p-values are not used

in hypothesis testing. As their simulation results indicate, HPM can handle missing

data and diseases that arose from multiple founding mutations.

The method of Durrant et al. (2004) is a haplotype-based approach that uses some form of the underlying genealogy. This method is similar in spirit to those discussed in Section 1.3, however, the standard models of constructing genealogies are not used here. Instead, cladograms are created to approximate the common ancestry of a sample of unrelated cases and controls. Haplotype clusters are created with respect to the phased genotype data, and the phenotypes are modeling using logistic regression. Assumptions of the model include Hardy-Weinberg equilibrium, a mulitplicative disease model, and a single disease mutation. Moreover, the data are expected to be phase-known or inferred in some external program. Like the data mining algorithm of HPM, it is applicable to both fine-mapping and whole-genome studies.

1.3 Mapping Methods Based on Ancestral Recombination Graphs (ARGs)

The next class of methods for detecting association in case-control studies consists of those based on Ancestral Recombination Graphs (ARGs). Before we review the literature on this class of methods, we first give an overview of the coalescent process with recombination.

9 1.3.1 The Standard Coalescent and Coalescent-with-Recombination

Coalescent theory provides a model for describing, looking backwards in time, the process in which a sample of genes merge, or coalesce, back into a single ancestral gene [22, 21, 16]. Often, these genes may be samples from different species. In our treatment of the coalescent model, we will concentrate on the specific case where the sampled genes are all from a single species and we will interchangeably refer to the sampled units as sequences, individuals, or chromosomes. When we say that two sequences coalesce with one another, we mean that we can trace their ancestry back in time to the point prior to their split into two distinct gene copies. Kingman

(1982a, 1982b) did much groundbreaking work with the standard coalescent model, showing that it is the limiting process of the Wright-Fisher population model as well as proving that in this limit, coalescent times (Tk) are independent and exponentially

k distributed with rate parameter 2 , where k is the number of sequences present in the  population. As defined in the Wright-Fisher model, the standard coalescent model assumes that the population size is constant, with random mating and no selection

[41], although the model can be generalized to accomodate these possibilities [13].

Further work on the standard coalescent model and its properties has been done by

Griffiths and Tavar´e(1994) and Hudson (1991), among others (see Hein et al. (2005) and Wakeley (2008)).

One illustration of genetic relationships under the standard coalescent model are known as gene trees [29], which are bifurcating trees that trace a set of sequences back- wards in time to their most recent common ancestor (MRCA). When two sequences coalesce, their branches in the tree join together such that the branch lengths represent time as described by the coalescent model. Figure 1.3 illustrates a gene tree relating

10 Figure 1.3: Graph depicting four generations (rows) with twelve individuals (black dots) represented in each generation. Under the coalescent model, individuals in each subsequent generation choose their parents from the previous generation at random with replacement. We can trace the last five individuals at the present time to their MRCA indicated by the block arrow. This is an example of a genealogy.

five individuals embedded in a population followed over four generations. Note that

the population size remains constant (twelve samples) over all four generations.

The coalescent model with recombination builds on the standard coalescent by allowing for recombination among the genes on the same chromosome [15, 18]. Recall from Section 1.1 that recombination is a biological process in which two chromosomes exchange genetic material. As a result of this exchange, the genealogical relationships of DNA sequences sampled from a population will vary as we move along the chromo- some. The addition of recombination to the coalescent model adds a complication in

11 that we can no longer describe the genealogical history for a collection of sequences

using only one gene tree [15, 13]. Rather, the genealogy may vary across loci as we

move along the chromosome.

To accomodate recombination, the standard coalescent model is modified so that, looking backward in time, a sequence may either coalesce with another sequence

or it may experience a recombination event. If a recombination event occurs, the

ancestral material contained in the sequence will split into two separate pieces at the

recombination breakpoint, thereby increasing the number of sequences on the tree by

one. The coalescent-with-recombination model assumes that coalescent events happen

k(k−1) at a rate 2 , where k is the number of lineages (as in the standard model) and

kρ that recombination events happen at rate 2 , where ρ is the population recombination rate. Then, we have the following:

kρ k(k − 1) Waiting time until next event (recombination or coalescence) ∼ Exp{ + } 2 2 (1.1) k − 1 P r(Next event is a coalescent event) = (1.2) k − 1+ ρ ρ P r(Next event is a recombination event) = . (1.3) k − 1+ ρ

When the next event is determined to be a coalescent event, the two sequences chosen to coalesce are selected at random (all pairs equally likely). Similarly, when the next event is a recombination, the sequence that undergoes recombination is randomly selected, as is the recombination breakpoint. For further information on the coalescent model and its properties, see Hein et al. (2005) or Wakeley (2008).

12 1.3.2 Ancestral Recombination Graphs

As stated in Section 1.3.1, the possibility of recombination events in the history of a sample complicates the illustration of genealogies. If a recombination event occurs before the sequences have reached their MRCA, pieces of ancestral material on either side of the recombination breakpoint may have distinct gene trees. Thus, as recombi- nation events occur, multiple gene trees will be needed to describe the relationships of the sequences across the enitre sampled genetic region. The ancestral recombination graph (ARG) allows genealogical histories to be represented within a single graph in the presence of recombination [11]. Further, the ARG contains all of the information needed to construct the marginal genealogy for a given piece of ancestral material

(Figure 1.4). When we are dealing with an ARG, the single resulting ancestor is now referred to as the grand-most recent common ancestor (GMRCA) to distinguish it from the possibly different MRCAs of various ancestral segments of the sequence

[13].

While the coalescent framework for describing ARGs is a straightforward process, the practice of estimating an ARG from genetic data is an extremely difficult problem.

McVean and Cardin (2005) give particular attention to the difficulty in working with the coalescent-with-recombination, highlighting three important issues: (1) The state space of ARGs is extremely large, growing in both the number of sequences and recombination events (see Table 1.1); (2) The data are usually not very informative about the ARG; and (3) Likelihood estimation is a difficult problem involving missing data (the ARGs).

13 Figure 1.4: (a) An ARG relating three sequences. The black material is ancestral to the sample and the white is non-ancestral. (b) Marginal trees embedded in the ARG for the two ancestral pieces broken by recombination.

14 Number of Sequences Recombination Events 3 4 5 6 1 72 1152 25,020 715,500 2 3888 123,408 4,811,580 229,005,900 3 393,552 21,607,632 1.36E+09 9.92E+10 4 67,184,208 5.85E+09 5.55E+11 5.84E+13 5 1.79E+10 2.33E+12 3.15E+14 4.60E+16

Table 1.1: Number of possible ARGs for varying numbers of recombination events and sequences

Before discussing methods to estimate ARGs, we describe the combinatorics be- hind calculating the number of possible ARGs in Table 1.1. We need to consider the possible series of events that may occur before the sample reaches its GMRCA.

We will illustrate the simple case of three sampled sequences with one recombination event in their shared history. Letting “R” denote a recombination event and “C” de- note a coalescent event, the series of events “RCCC” and “CRCC” are the only two possibilities that lead to the GMRCA. For example, “RCCC” means that a recombi- nation event happens most recently, followed by three coalescent events. After finding all such series of events for the given number of sequences and recombinations, we can use simple combinatorics to compute the number of ARGs under the conditions that every pair of sequences has an equal chance of coalescing and each sequence has an equal chance of undergoing recombination (see Equation 1.4).

Total ARGs for 3 sequences = Number of RCCC + Number of CRCC 3 4 3 2 3 2 3 2 = + 1222 2122 (1.4) = 54 + 18

= 72

15 In general, to compute the number of ARGs given the number of sequences and a

fixed number of recombination events, we follow the same logic of Equation 1.4. By

listing out all possible series of events that lead to the GMRCA, we can use basic

counting techniques to obtain the total number of ARGs.

We now return to the problem of estimating ARGs from data. There are many proposed methods of estimation that aim to address the difficulty of working with the coalescent-with-recombination, including Monte Carlo based methods, approx- imations to the likelihood, and simplifications to the coalescent and tree-building process. We review each briefly here, since this will be an important part of our proposed algorithm.

The Monte Carlo methods we discuss are full likelihood inference methods. Be- cause they use all possible information from the data, these estimation techniques are computationally difficult, making them unfeasible for larger data sets [7, 32]. Kuhner et al. (2000) estimate the coalescent likelihood via a Markov Chain Monte Carlo

(MCMC) algorithm. Specifically, they use a Metropolis-Hastings method to sample from the coalescent genealogies, with the goal of estimating mutation and recombi- nation parameters. While embedded in their method is the estimation of ARGs, the proposed MCMC algorithm is evaluated on the acuracy of the mutation and recom- bination parameter estimates for different types of sampled SNP data. Obtaining the actual genealogies is not the ultimate goal of this algorithm. Fearnhead and Donnelly

(2001) use importance sampling (IS) as a method of calculating the tree likelihoods.

This IS algorithm improves upon the efficiency of the MCMC method of Kuhner et

16 al. (2000) and also produces better approximations to the likelihood. Larribe et

al. (2002) also created an IS algorithm to estimate coalescent likelihoods, but their

method is still computationally intensive.

Rather than approximating full likelihoods, some methods choose to work under a simpler likelihood framework, obtained by either summarizing the data or by breaking it down into subsets. Wall (2000) uses the former approach, estimating the likelihood using the number of distinct haplotypes and the minimum number of recombination events needed to explain the data. Fearnhead and Donnelly (2002) discuss two ap- proaches, a marginal likelihood and a composite likelihood. The marginal likelihood is so named because some of the data are ignored and the likelihood is computed for a smaller set of data. These smaller data are the haplotypes defined only by segregating sites where the minor allele frequency is greater than some value. This approach gains efficiency over Monte Carlo methods by ignoring some information in the data, while still retaining most of the information concerning the estimation of recombination rates. The composite likelihood approach breaks the data into regions and computes the product of approximate marginal likelihoods over each interval. This has the effect of simplifying the coalescent model by removing some of the dependencies within the data. In comparing both methods, they recommend the marginal method for shorter sequences and the composite likelihood for longer sequencecs. Li and Stephens (2003) introduce a product of approximate likelihoods (PAC) in which they allow for a vari- able recombination rate within the region of interest. This method has the advantage of being applicable to larger data sets.

17 Additionally, there are methods that tackle ARG estimation by simplifying the manner in which the trees are constructed, thus also simplifying the likelihood calcu- lations as well. The first of these is the Sequentially Markov Coalescent (SMC) [32], which addresses the non-Markovian property of the coalescent-with-recombination.

Specifically, it is non-Markovian in that as we move along the chromosome, the ge- nealogy of each site is dependent on the genealogies of all previous sites. To deal with this property, the key aspect of the SMC algorithm is the restriction of events that may happen in the history of the sequences. Most notably, it restricts coalescent events to occur between lineages that have an interval where they share ancestral material. By applying the SMC model, MCMC techniques are used to sample from the marginal genealogies. This algorithm is considered an alternative model to the coalescent since the trees are no longer generated according to the originally defined coalescent-with-recombination model. However, by comparing distributions of vari- ous statistics that measure , the SMC model is shown not to differ significantly from the standard coalescent.

Other methods that employ similiar restrictions in the construction of genealogies exist. A group of these creates what are known as minARGs, which are ARGs with the minimum number of recombination events needed to describe the data [30, 50].

Minichiello and Durbin (2006) use a similar approach implemented in the software

MARGARITA (see Section 2.1). The TREELD method [51] approximates the ARG around a focal point on the chromosome and disregards some genetic information as recombination events occur. The idea behind this construction is that markers farther away on the chromosome have less information about the genealogy at a focal point than do markers nearby.

18 1.3.3 Association Mapping

Having gained some background on the coalescent-with-recombination and ARGs, we now return to the discussion of association methods based on the ARG. ARG-based mapping methods make use of information beyond the genotypes and phenotypes.

Through clever use of the observed data, these methods attempt to reconstruct the shared ancestry (e.g., the ARG) for a set of sequences. When the ARG is accurately estimated, it can provide information about possible disease-causing mutations that have occurred in their common history, since it contains within it the genealogies at each locus. If the true genealogy of the sample at the disease marker(s) were known, it would provide the necessary information about association in the region of interest

[51, 33]. As the true ARG is unknown, one major disadvantage of these methods is that they tend to suffer from computational restrictions. However, increased com- puting power and better tree estimation algorithms are making them more feasible.

In order to construct a valid ARG, certain assumptions are made on the population according to the coalescent model described in Section 1.3.1. Briefly, these assump- tions include a constant population size, where generations are non-overlapping and discrete with random mating [13]. However, some of these assumptions are relaxed by current disease mapping methods to allow for more realistic applications. For example, Larribe et al. (2002) allow for non-constant population sizes. We should clarify that these ARG-based methods may also depend on knowing or inferring the haplotype data, but we will reserve the term haplotype-based methods to refer to those methods that do not construct a coalescent genealogy.

Many of the early methods of association mapping via the coalescent are compu- tationally intensive and are turning into less desirable approaches as newer methods

19 are more applicable to larger data sets. Rannala and Reeve (2001) use an MCMC approach that integrates over coalescent genealogies, making their method computa- tionally intensive like the Monte Carlo inference algorithms of Kuhner et al. (2000) and Fearnhead and Donnelly (2001) (see Section 1.3.2). This algorithm performs in- ference through Bayesian LD mapping, which gives a posterior distribution of disease location over a region. Morris et al. (2002) have a similar approach that allows for multiple founding mutations of a disease at the same locus through a model referred to as the shattered coalescent. The resulting method of inference is still given by a posterior density of each marker being associated with a disease. Additionally, many of the ARG estimation procedures of Section 1.3.2, such as [32], can be applied to disease mapping even if such a purpose is not specifically addressed by the authors.

More recent methods are much faster at ARG estimation, and as a result, they are more competitive with the faster haplotype-based methods of Section 1.2. We will highlight five of the most promising of these methods: TREELD, MARGARITA,

TMARG, BETA, and CAMP. In TREELD, Z¨ollner and Pritchard (2005) use a novel approach to approximating genealogies of the data. Rather than reconstruct the full

ARG over the data, they estimate the coalescent tree at a given focal point along the chromosome using the entire genotype data. As recombination events happen in the history of the data, parts of the sequences are discarded as they are detached from the focal point by recombination. Therefore, in the history of large samples where many recombination events have occurred, the sequence region surrounding the focal point at the GMRCA of the tree will be small compared to the original data. To run this tree estimation model, MCMC techniques are applied to sample trees from the posterior distribution of trees given the haplotype data and a designated focal point.

20 Multiple focal points are analyzed across the genotyped region of interest, resulting in a local approximation to the ARG at each point. These focal points are equally spaced within the region and not positions of actual data markers. Once the trees are created, at each focal point, they calculate the likelihood of the tree producing the observed phenotype data. Mutations along the branches of the tree are modeled according to a Poisson process with no back mutations. The penetrances, which are the probabilities that a chromosome comes from an individual with a certain phenotype given a particular mutation status, are selected from a grid of values on the bounded set [0,1] x [0,1]. It is important to note that these penetrances are haploid penetrances, while typical problems involve diploid data. Wu (2008) discusses this issue in depth and provides a theorem stating that diploid phenotype likelihood problems are N-P hard. The resulting test statistic is based on a likelihood at each focal point. With this, one can perform a standard likelihood ratio (LR) test or carry out permutation-based testing. Moreover, this method also provides a

Bayesian analysis of association at each focal point, resulting in the approximated posterior distribution of the disease marker location.

Minichiello and Durbin (2006) choose not to sample directly from the actual coalescent-with-recombination, but rather sample plausible ARGs using a heuristic method. This method is similar to the idea of minARGs in that the number of recom- bination events is limited. By controlling the number of recombination events, the number of sequences at any given time on the tree is also restricted thereby speed- ing up the tree construction process. See Section 2.1 for a more complete treatment of this tree estimation procedure. Once the ARG is constructed, a marginal tree is extracted for each marker. To test whether a marker is associated with a disease,

21 hypothetical mutations are placed on each branch of the tree. For each mutation, a

χ2 test statistic for testing independence between inferred genotype and phenotype is calculated, and the maximum of these statistics is recorded. Multiple trees are es- timated over a region, and the overall association score for a position is calculated by taking the mean of the maximum statistics recorded for each marginal tree. The final output is a p-value based on a permutation test of the phenotypes. This algorithm is implemented in the computer package MARGARITA.

Wu (2008) uses uniform sampling of minARGs to estimate genealogies, which is a process similar to that used in MARGARITA. Two reasons are cited for the approach, namely, that many regions in the human genome are believed to have low recombination rates and the use of minARGs gives formal guidelines for constructing

ARGs in order to restrict the number of recombination events that occur. This algorithm also takes a likelihood-based approach to assocation testing by computing what is referred to as the maximum phenotype likelihood (MPL). The MPL maximizes the phenotype likelihood by choosing the best subset of edges on the tree on which disease mutations occur, rather than summing over all possible mutations. However, the non-maximized phenotype likelihood is used in simulation results as preliminary results using the MPL are not conclusive. In the likelihood calculation, the penetrance model is based on haploid penetrances similar to those in TREELD, and the final likelihood is the maximum likelihood achieved over the grid of penetrance values.

This algorithm is implemented in the computer package TMARG. In simulations, results show that TMARG is as computationally as efficient as MARGARITA and both are much more so than TREELD. In terms of accuracy in point estimates of disease mutation, TMARG is comparable with both methods.

22 Some methods make use of perfect phylogienies, which are ancestries that do not allow the same mutation to arise independently on different branches of the tree.

When constructing perfect phyogenies for SNP data, this constraint means that each

SNP is traced back to a single mutation. For a more in depth treatment of perfect phylogenies, see Fern´andez-Baca (2001) . We highlight two methods in particular that utilize this type of tree. Tachmazidou et al. (2007) subset the chromosome first into regions where the perfect phylogeny assumption holds. Within these regions, they proceed using a Bayesian approach to group the haplotypes into clusters based on time to their most recent common ancestor in an evolutionary sense. Their MCMC algorithm iterates over different groupings of the data, with the goal of finding the location that best separates the cases from the controls. This optimal location is assumed to be near the unknown disease locus. While not giving specific limitations, the authors note that this method, referred to as BETA, can handle large numbers of subjects typed for many SNPs. Kimmel et al. (2008) focus their attention on the multiple testing issue inherent in coalescent-based approaches due to the multitude of genealogies sampled from the data. Their algorithm (CAMP) constructs perfect phy- logeny graphs, which are genealogies constructed under the perfect-phylogeny model with the exception that recombination events can occur. Using the constructed graph, they test for interactions between SNPs and disease where plausible mutations may have occured. Because they make this restriction on what interactions may be tested, many SNP interactions are not tested thus reducing the number of hypotheses.

23 1.4 ARG vs. Haplotype Methods

We conclude the discussion of coalescent-based methods by giving an overview of their advantages and disadvantages as compared to haplotype clustering methods to date. Several authors have performed comparison studies, however it is often difficult to compare between ARG-based and haplotype-based as many methods are restricted on the types and size of data they can handle. Methods are often compared based on power, type I error rates, and localization, which measures distance from the marker with the highest signal to the true disease locus.

Won et el. (2007) provide a comparison across methods, focusing on three in particular. They chose TREELD as the representative of ARG-based approaches since it seemed to be the best of such methods at the time. Similarly, they selected the methods of Molitor et al. (2003) and Waldron et al. (2006) to represent the haplotype-based methods, specifically haplotype-clustering methods. These methods were compared using a data set for which the disease SNP location is known. In addition to noting the computing time for each method, comparisons were made based on disease location estimates, lengths of confidence intervals, and empirical coverage of the disease location. It was concluded that haplotype-clustering methods, similar to [46], are favored in terms of all aforementioned criteria.

Many authors have presented results of their methods along with top compet- ing methods. Minichiello and Durbin (2006) compared MARGARITA with the haplotype-clustering method CLADHC [5] and the single-marker χ2 test. Under a wide range of simulated diseases and models, MARGARITA gives closer estimates of the disease location and has fewer false-positives. Moreover, MARGARITA internally handles unphased data without relying on an outside program. Tachmazidou et al.

24 (2007) make comparisons between their perfect phylogeny-based method (BETA) and

MARGARITA, HAPCLUSTER, and Fisher’s exact test. Results show that BETA

performs similarly in terms of localization but has lower false-positive rates and is

computationally more efficient. Finally, Kimmel et al. (2008) compare their method

CAMP to CLADHC and the standard χ2. The CAMP method has much higher

power than the other two methods and has better localization as compared to the

standard χ2 test.

In summary, coalescent-based methods have the tools to compete with the haplotype- clustering methods. The general downfall of the current coalescent-based methods is that they suffer from computational inefficiency, most often in the tree estimation algorithms. On the other extreme, some methods account for this inefficiency at the expense of power and more flexible likelihood models. As we discuss our proposed method in the next chapter, it is our goal to address both the issues of a flexible likelihood model and compuational efficiency.

1.5 Outline of Dissertation

We now present an outline of the remainder of the thesis. In Chapter 2, we

propose a new method of association testing based on the ARG. We give the details

of the likelihood calculation, significance testing, and its computer implementation.

We further discuss expansions to the model that allow for covariate information to be

included. Chapter 3 gives results of simulation studies using our method and compare

them to results obtained with other competing methods in the literature. In Chapter

4, we apply our method to real genotype data, where the disease under study is type

25 1 diabetes. Finally, we summarize our work in Chapter 5 and give future directions that we plan to pursue with this method.

26 CHAPTER 2

AN ARG-BASED DISEASE MAPPING ALGORITHM

Our proposed method of disease gene mapping is a two-step procedure. We first estimate ARGs based on the genotype data. Secondly, we use the marginal genealo- gies, along with the phenotype data, to test for associations along the chromosome.

In this chapter we describe in detail the procedure for sampling the coalescent-based genealogies. We then discuss the likelihood calculations on these genealogies with a description of disease model assumptions and extensions to include covariates. Lastly, we describe the computer-implemented version of this algorithm.

2.1 ARG Estimation

In the first step in our method, we consider the manner in which the ARG will be estimated based on the available genotype data. An ideal estimator will provide a good balance of accuracy and efficiency. Since the primary goal in this work is to develop an improved method of testing association, we chose not to create our own

ARG estimation method but instead selected an existing estimator. Specifically, we utilize the method of MARGARITA as first described in Section 1.3 [33]. This method is a computationally fast algorithm for sampling trees that can handle thousands of individuals, each containing hundreds of SNPs. Computational efficiency is a highly

27 attractive quality in a sampler since many current methods are restricted by the

sample sizes they can handle. MARGARITA achieves its speed because it uses a

heuristic algorithm that constructs plausible ARGs, rather than sampling from the

coalescent-with-recombination model directly. This sampling procedure is similar

to constructing minimal ARGs. It is important to make note that the ARGs are

estimated from information in the genotype data only. Any information contained in

the phenotype data will bias the tree construction, giving rise to possibly false genetic

associations.

Before outlining the steps in MARGARITA’s ARG building algorithm, we briefly return to the concept of a genetic mutation. Like recombination and coalescence events, mutations are accounted for in the coalescent model. In this case, a mutation refers to a locus of a sequence that undergoes a change in its genetic material. Under the coalescent model, mutations are added to the ARG according to a Poisson process with rate parameter θ = 4Neµ, where Ne is the effective population size and µ is the mutation rate [32].

We now describe the method used in MARGARITA. To ensure the proper con- struction of ARGs, the following set of rules govern what types of sequences may coalesce, undergo mutations, and incur recombinations in MARGARITA’s tree con- struction algorithm:

(A) When there exist two sequences that have a shared region of genetic material, then they may coalesce.

(B) When there exists a sequence with one locus that has a different allele from all other sequences, then that allele may be removed from the sample via a mutation.

28 The locus will mutate to the ancestral allele present at all other sequences in the

sample.

(C) When there exist no sequences that satisfy rules (A) or (B), then a

recombination event may occur.

The most important consequence of these rules results from the fact that a recom-

bination event may only occur if there are no possible mutation or coalescence events.

The idea is that we are not creating new sequences when it is not necessary to do so.

Recall that when a recombination event happens, the current sample size increases

by one, making the tree estimation problem even more difficult. The rule governing

coalescences is similar to that employed by the SMC algorithm [32] for estimating

ARGs described in Section 1.3.2

Under the rules outlined above, there may be multiple possible events at any given time. To choose between the events, the following set of heuristics determine the order in which they occur:

(1) A recombination occurs only when no mutations or coalescences are allowed.

(2) When multiple mutations or coalescences are possible, the order in which they are performed is random.

(3) Recombination breakpoints occur at the ends of the longest shared ancestral regions.

(4) After recombination occurs, the next event is a coalescence between the two sequences that determined the recombination breakpoint (see Figure 2.1).

The random ordering described in heuristic (2) adds a stochastic nature to the program in which multiple runs on the same data set will not necessarily produce the

29 same tree. Moreover, because the order of multiple possible events is random, once a

set of trees is created under this algorithm, they are all weighted equally.

Here we re-iterate the fact that this algorithm does not claim to sample trees from the coalescent-with-reecombination model. These heuristics aim to create a simplication to the model by restricting what types of events may occur given the current configuration of sequences. By limiting the number of recombination events, this algorithm borrows from the concept of minARGs (introduced in Section 1.3.2), which construct ARGs using the (estimated) minimum number of recombination event needed to describe the history of a sample. The topic of minARGs has been well studied, and Wu (2008) notes that many areas of the human genome have been shown to have low recombination rates. Therefore, under the simplification employed by MARGARITA, plausible ARGs are constructed.

This ARG inference algorithm has the ability to handle missing and unphased

data, both of which are done during the ARG building process. When there is a

missing SNP in the sample, there is no restriction on the sequences with which it can

coalesce. For example, if we have a 3-character sequence with a missing character

at marker two, given by 0.1 and a fully-known sequence given by 001, then these

sequences may coalesce and the missing character takes on the character of the fully-

known sequence. In this example the resulting ancestor of the coalescent event is now

001. The algorithm handles unphased data in a slightly more complicated manner.

When the data are unphased, the positions of the ambiguous SNPs are represented

by a ., similar to missing and non-ancestral data. However, the sister chromosomes

of an unphased individual are restricted from coalescing with one another until one

sequence becomes phase-known. Moreover, if an unphased sequence coalesces with

30 Figure 2.1: Illustration of ARG construction using the algorithm in MARGARITA. (a) The two sequences have a shared tract spanning the first two SNPs. (b) To coalesce over the region, we must add a recombination breakpoint between SNPs 2 and 3. (c) The undefined material, denoted by a . , may coalesce with anything. We can now coalesce the left recombination parent and the other sequence. We can also add a mutation to the right recombination parent. The “3” indicates that the mutation is at SNP 3. (d) We can coalesce the remaining two sequences.

31 another unphased sequence, then those two become dependent on each other and

the corresponding sister chromosomes are restricted from coalescing with one another

until all four chromosomes become phase-known. Once a sequence becomes phased

in the construction of the ARG, the phase is carried down back to the tips of the

tree. Figure 2.2 shows how the algorithm handles phase-unknown data for two sets

of sister chromosomes.

The advantage of MARGARITA’s algorithm is that the ARG is constructed rapidly. In addition, it provides heuristics to handle both missing and unphased data as part of the reconstruction process. However, it does not provide estimation of branch lengths within the ARG, as these are not required for inference with MAR-

GARITA. In the next section, we describe how we estimate branch lengths on the trees obtained with this algorithm.

2.2 Branch Length Estimation

To compute the likelihood of the phenotype data on the ARGs estimated by

MARGARITA, we need the branch lengths for each inferred tree. The branch lengths will be used in the next section to determine how probable it is for a mutation to occur as we move down the tree. The trees we generate in MARGARITA describe the topologies only, and so we need to estimate the branch lengths on the trees given the genotype data. We describe two approaches, the maximum likelihood method, which is suitable for smaller data, and the parsimony approach that is fast even for trees containing thousands of tips.

We use the function optimization algorithm known as “Brent’s Method in One

Dimension” to find the maximum likelihood branch lengths on the marginal trees

32 Figure 2.2: (a) Unphased data for two individuals at the second SNP. Phase is un- known due to the presence of heterozygotes. (b) Two unphased sequences coalesce. Now all four sequences are dependent on each other and cannot coalesce until the phase is resolved. (c) Coalescence with a phase-known sequence. (d) Phase is carried from the dervied ancestral sequence of (c) to all four observed sequences.

33 obtained from the ARG. For a given tree, we set initial branch lengths according to the molecular clock assumption by setting the total time of the tree to be 1.0 and assigning fractions of this total to nodes based on their depths. Briefly, this assumption guarantees that the mutations rates are constant along each branch of the tree. Once the initial branch lengths are set, we apply Brent’s Method to adjust these lengths one at a time in order to achieve the highest likelihood of the tree given the branch lengths. In other words, we are trying to maximize the likelihood function of a given tree topology by finding the best combination of the function’s variables – the branch lengths.

Brent’s method uses parabolic interpolation, fitting a parabola through three dis- tinct points within a predetermined interval along the function. This fitted parabola determines where the search algorithm will move to next. As each parabolic step is carried out, the points on which the parabola is fitted are: (1) The point with the greatest function value most recently found; (2) The point with the second greatest function value; and (3) The previous point with the second greatest function value.

There are two criteria within the algorithm that determine whether each parabolic step is accepted. They essentially guarantee that the parabolas are computed within the correct interval and that the algorithm is, in fact, converging to a maximum. The algorithm ends either after the last two function evaluations are within a given tol- erance of one another or when the maximum number of preset iterations is reached.

The algorithm outputs the maximum value of the function (in our case the likeli- hood) and the arguments that achieved this maximum which are the branch lengths.

See Brent (1973) and Press et al. (1992) for further reading on this method and its implementation in C.

34 0

0

0

0 1 0 0

Figure 2.3: Most parsimonious reconstruction (MPR) of a 4-tip tree. The labels of the inner nodes are assigned such that they minimize the number of changes needed to arrive at the observed data at the tips. This MPR has length one, since only one change, along the branch marked with a line, is needed.

35 The algorithm works by iteratively and sequentially optimizing each node time on the tree. Thus, in each iteration of this maximum likelihood approach, the like- lihood of the tree must be re-calculated. For trees with large numbers of tips, these calculations can be time consuming for the algorithm. Thus for large data, it is nec- essary to have an alternative method of estimating the branch lengths. For trees with a large number of tips, we generate parsimony-approximated branch lengths using the Rogers-Swofford Algorithm [39]. This method is much faster than maxi- mum likelihood for large trees, making it possible to run analyses on samples with many cases and controls. While this method is not guaranteed to find the optimal branch lengths, the lengths are generally good approximations to those obtained by maximum likelihood estimation [39].

The underlying concept of this method is that of most parsimonious reconstruc- tions (MPRs). For a fixed tree topology, an MPR is an assignment of (unobserved) internal states that minimizes the parsimony length of a tree, i.e., the number of changes that occur to produce the (observed) tip data. The following steps outline the algorithm used to find an MPR for a given tree with binary state data [14]:

(1) Assign the observed states (genotype data) on the external nodes (tips) of the tree. For each external node, the assigned state is the “state set” for that node.

(2) Go to an internal node for which its state set is undefined, but for which its two descendents’ state sets are defined. If the intersection of the two descendents’ state sets is non-empty, then assign the current internal node’s state set to be that intersection. Else, if the intersection is empty, then assign the state set to be {0,1}.

(3) If the internal node of step (2) is the root, then proceed to step (4). Else, return to step (2).

36 (4) Assign the root an arbitrary state from its state set, starting with 0. This is its

“optimal assignment state”.

(5) Starting at the root, go to an internal node k for which the state set is not defined by a single state, but for which its immediate ancestor’s optimal state m is

defined. Define k’s optimal assignment to be m.

(6) If all internal nodes have been visited in step (5), then record the number of changes needed to describe the states at the external nodes, where a “change” occurs when a node and its immediate ancestor have different assigned states.

(7) If all possible assignments of the root state have been tried in steps (4)-(6), then go to step (8). Else, return to step (4).

(8) The MPR is the assignment of states with the smallest number of changes recorded in step (6).

Note that the general algorithm for finding an MPR is described for ordered multi-

state character data for an unrooted tree. The algorithm presented here is a modified

version for our special case of a rooted tree with binary states. Figure 2.3 illustrates an

MPR for a 4-tip tree. For a given data set, there may be multiple MPRs compatible

with the data. This algorithm finds one such MPR based on the MARGARITA-

inferred tree topology. For each branch on the tree, the number of changes along that

branch is counted across SNPs. Using this count, the branch length is estimated with

the Jukes-Cantor distance formula

3 4 d = − log(1 − p), (2.1) 4 3

37 where p is the proportion of sites with a mutation along that branch.

In both the maximum likelihood method and the parsimony approach, we esti-

mate the branch lengths of each marginal tree embedded in the ARG. Although each

marginal tree represents the ancestry at one marker in the data, we use the entire

SNP data to make inference on the branch lengths. We do this because there is not

enough information at a single marker to estimate the branch lengths. By comparing

the branch length estimates based on different intervals of SNPs, we found that the

estimates obtained by using the entire data do not differ greatly from the estimates

obtained by using smaller intervals of SNPs around the marker of interest.

2.3 Measuring Association

In this section we address the second step of our disease mapping algorithm: asso- ciation testing. We explain the likelihood calculations in detail, along with the under- lying mutation and penetrance models. We also discuss the resulting test statistics from this method.

2.3.1 Likelihood Calculation

Before describing the actual likelihood calculation, it is important to note the

hypothesis that we are interested in testing. For tests of this nature, we want to

determine if there is enough evidence to conclude an association between a phenotype

and a genetic marker. Thus, for each marker, our null hypothesis is that there is no

association between genotype and phenotype at this marker.

Once we have estimated the ARGs in step 1, our data are the phenotypes observed at the tips of the tree. To measure association along the chromosome, we calculate the likelihood of the phenotype data along the tips of the genealogy. When constructing

38 the trees (embedded within the ARGs) from genetic data, the genotypes are more likely to be clustered on the tips of the tree. This is because under the coalescent model, mutations more often occur on the longer branches towards the root of the tree and less often on the shorter branches towards the tips. If there is an association between the phenotype and the genotype at a locus, then this clustering should be reflected, to some extent, when we place the phenotype data on the tips as well, resulting in a higher likelihood. The extent of the phenotype clustering, as measured by the likelihood, is an indication of the degree of association.

From the ARGs estimated within MARGARITA, we are able to extract marginal trees at each marker in the data (see Figure 2.4). At a given marker, the marginal tree describes the shared ancestry of the sample at that location. For inference purposes, the data placed on the tips of the tree are the phenotypes of the sample. In practice, we generally do not have the true disease marker in our data. If we have markers that are close enough to the true disease location, then the marginal trees should be highly correlated as there is less space in which recombination can occur. For markers that are far from the true disease marker, it is more likely that recombination has occured and the genealogies will essentially be independent from one another.

To calculate the likelihood on a marginal tree, we need the phenotype tip data along with the phenotype states at all internal nodes of the tree. However, the states of internal nodes are unobserved data and we must sum over all possible assignments of phenotypes at the internal nodes. When there are q internal nodes on the tree, there are 2q such possible assignments. Fortunately, we can use Felsenstein’s peeling algorithm [8] to simplify the computation. Before getting into the specifics of the

39 Figure 2.4: Four sequences typed for four SNPs along with two estimated ARGs for association testing. Above each SNP are the marginal trees extracted from the estimated ARGs. At the tips of the trees are the sequence labels. In practice, the corresponding phenotypes are placed on the tips.

40 algorithm, let us first define some notation:

φ MRCA = Vector of phenotypes on all tips of the tree φ i i = Vector of phenotypes of the external nodes descending from node mi = Mutation status of node i bi = Length of branch connecting node i to its parent

Mj,k(bi) = Probability of a mutation from state j to k on branch connecting node i to its parent x = Marker along the chromosome

T x = Vector of inferred genealogies at marker x

p = Vector of penetrances

P r φ X For each marker, the phenotype likelihood of interest is given by ( MRCA| = x, T x). Refer to Figure 2.5 to follow the likelihood calculations given in Equations 2.3 and 2.7. To calculate this likelihood under the peeling algorithm, in practice, we start at the tips using the given phenotype data and work up to the root of the tree. At each node within the tree, the only information needed to calculate that node’s likelihood are the likelihoods of its two descendent nodes. However, to best understand the steps of the algorithm, we will simplify the likelihood expression starting at the root and work our way to the tips. The basic idea behind the likelihood calculation is as follows. Assume the ancestral node has two descendants labeled a and b. Moreover, the mutation state at the ancestor may be 0 or 1, so that

1 1 P r(φ |X = x, T )= P r(φ |m = 0)+ P r(φ |m = 1). MRCA x 2 MRCA MRCA 2 MRCA MRCA (2.2)

41 Figure 2.5: Marginal tree with four external nodes on which we illustrate our like- lihood calculation. Nodes a and b are both internal nodes, and u and v are both external nodes where the phenotype data are known.

42 Both of the probabilities in Equation 2.2 that condition on the ancestral mutation state are calculated in a similar manner. We illustrate the details of the calculation when mMRCA = 0 in Equation 2.3. When mMRCA = 1, we substitute M1,k() for

M0,k() into Equation 2.3 to reflect the different initial mutation state.

P r φ m P r φ m M b P r φ m M b ( MRCA| MRCA = 0) =( ( a| a = 0) ∗ 0,0( a)+ ( a| a = 1) ∗ 0,1( a))

P r φ m M b P r φ m M b ∗ ( ( b| b = 0) ∗ 0,0( b)+ ( b| b = 1) ∗ 0,1( b)) (2.3)

In Equation 2.3, we are multiplying the events leading to each child node, con- sidering both types of mutations that can occur given the ancestral mutation state.

The reason we can perform this multiplication is that once a sequence splits into two descendents, the lineages are now independent. To further simplify the expression in Equation 2.3, we use the same breakdown of events to compute the probabilities

a b P φ m for nodes and (e.g., ( b| b = 1)) using their descendent nodes until we finally reach the tips of the tree. At the tips, we know the phenotype status of each indi-

vidual which allows us to substitute the penetrances into the likelihood. The vector

of penetrances is given by p = (P1|1, P0|1, P1|0, P0|0), where Pφi|j = P r(node i has phenotype φi| j mutations) for φi, j = 0 or 1. For example, if s is an external tip with

observed phenotype 1, then we make the substitutions of P r(φs|ms = 1) = P1|1 and

P r(φs|ms =0)= P1|0 into the likelihood calculation. See Section 2.3.2 for a detailed

discussion of the penetrances.

43 The mutation model used in the likelihood calculation is the M2 model of Lewis

(2001). This model assumes two possible mutation states, where lineages may change

states at any time within the genealogy. The instantaneous rate matrix is given by

Q, as shown in 2.4, where α is the instantaneous rate of change between the states.

−α α Q = (2.4)  α −α

From this rate matrix we use the equality M = BeDtB−1 to find the transition probability matrix M (see Equation 2.5), where B is the matrix of eigenvectors of Q and D is the diagonal matrix of corresponding eigenvalues.

1 1 1 1 2 + 2 exp{−2αt} 2 − 2 exp{−2αt} M =   (2.5) 1 − 1 exp{−2αt} 1 + 1 exp{−2αt}  2 2 2 2 

1 Under this model, the equilibrium frequencies of the two states are both 2 , which is reflected in Equation 2.2. Depending on the data, the assumption of equal equi- librium frequencies may not be valid. To allow for unequal frequencies, we need to re-parameterize the rate matrix and solve for the new transition probability matrix.

Under such a model, transitions from 0 to 1 would occur at a different rate than transitions from 1 to 0. However, for the method proposed here, we work under the equal equilibrium frequency model.

44 The expected number of changes per site over time is αt. The branch lengths are defined to be the expected number of changes per site over time, and so we make the substitution bi = αt when computing the entries of this matrix. One key aspect of this mutation model is that back mutations are allowed, meaning that once a site mutates to state 1 it has positive probability to mutate back to state 0.

In practice, we do not know the true ARG of the data and so we estimate multiple

ARGs to account for this uncertainty. Within each estimated ARG, we can extract a marginal tree for each marker. This process results in multiple marginal trees at each marker. To get an overall likelihood at each marker, we need to combine the likelihoods obtained for each marginal tree. As there are no distributional assump- tions placed on the trees under the MARGARITA heuristic, we take the median of the likelihoods to get the overall likelihood at marker x as described in Equation 2.6.

med P r(φ|X = x, T x,p)= i (P r(φ|X = x, T x[i],p)) (2.6)

2.3.2 Penetrance Model

To complete our description of the likelihood calculation, we now give the details of the penetrance model. As mentioned in Section 1.3, we use a haploid penetrance model. We do this because each tip on the ARG represents only one sequence, which corresponds to one chromosome from a diploid individual. The penetrance model is defined by the four penetrances: P1|1, P0|1, P1|0, and P0|0, where Pφa|j is the probability that tip a has phenotype φa given the mutation state is j. As there is uncertainty

45 regarding the penetrance values, our algorithm samples the penetrances from user-

specified priors. When no prior knowledge is assumed about the penetrances, the

algorithm samples the penetrances independently from a U(0.05, 0.95) prior. We do not sample from extreme values close to 0.0 and 1.0, as they may cause the likelihood to become 0. In practice, we need only sample values for P1|1 and P1|0 as the remaining two penetrances are their complements.

Given this definition of penetrance, we can now define the likelihood of Equation

2.3 for the tips of the tree. Returning to Figure 2.5, we illustrate how to calculate

the likelihood of node a given that ma = 0, noting that the same logic is used to calculate the likelihood of node a given that ma = 1. Substituting the penetrances

into the likelihood now gives us the following:

P r φ m P r φ m M b P r φ m M b ( a| a =0)=( ( u| u = 0) ∗ 0,0( u)+ ( u| u = 1) ∗ 0,1( u))

P r φ m M b P r φ m M b ∗ ( ( v| v = 0) ∗ 0,0( v)+ ( v| v = 1) ∗ 0,1( v))

=(P1|0 ∗ M0,0(bu)+ P1|1 ∗ M0,1(bu)) ∗ (P0|0 ∗ M0,0(bv)+ P0|1 ∗ M0,1(bv)). (2.7)

Under this representation of the likelihood, we can now enter numeric values for both the penetrances and the mutation probabilities. Remember, when calculating the likelihood in practice, Equation 2.7 is the starting point and we work our way to

Equation 2.3.

Defining the penetrance model in this way allows for flexibility in changing as- sumptions about the disease model. Constraints may be placed on the priors to ac- count for different disease behavior. For example, one may want to use the constraint

46 that P1|1 > P1|0 if it is more likely for someone to have the disease if the mutation

is present. Moreover, the penetrance model can be easily extended to incorporate

multi-level as well as continuous phenotypes. Finally, as we address in Section 2.5,

covariates are easily introduced through the penetrance model.

Because we are now introducing multiple penetrances to the model, the overall

likelihood for a marker needs to be combined across penetrances as well as marginal

trees. Let pk denote penetrance set k. For a given penetrance set, we calculate a

likelihood for each tree using the peeling algorithm. To get an overall likelihood at

this penetrance set, we take the median of the likelihoods as we did in Equation 2.6.

This gives a likelihood for each penetrance. To get the final likelihood, we then take

the mean of these likelihoods with respect to the priors placed on the penetrances.

In practice we cannot compute this mean exactly and so we sample from the priors

and take the sample mean of the likelihoods. Say we sample K penetrances, then the overall marker likelihood is given by

K 1 P r(φ |x, T )= P r(φ |x, T ,p) dp ≈ P r(φ |x, T ,p ). (2.8) MRCA x Z MRCA x K MRCA x k p Xk=1

The likelihood represented in Equation 2.8 is the final likelihood calculated for each marker in the data. We denote the final likelihood for marker x by Lx. For each marker, the value of Lx will be used to test whether there is an association with the phenotype at that location.

47 2.3.3 Significance Testing

We present three possible test statistics resulting from this algorithm, all of which

are based on the marker likelihoods, Lx. One may conduct a traditional likelihood ra-

log Lx L tio test (LRT) by computing L0 , where 0 is the null likelihood. We are interested in testing if there is a disease association at each marker, and so the likelihood under

the null hypothesis would represent the likelihood of no association. If there are n

1 n tips on the tree, then L0 is given by ( 2 ) . The intuition behind this null likelihood is that if there is no association with the disease at this marker, then each individual should have the same chance of having the disease as not having the disease. Under the coalescent model, once a node descends into two branches, the two descendents are treated as independent. Therefore, all tips of the tree are independent of one another. This assumption of independence allows L0 to be computed by taking the

1 n log Lx χ2 product of 2 across the tips. To carry out the test, we compare 2 L0 to a distribution with 1 d.f. However, the LRT tends to perform conservatively and has low power in detecting association.

Rather than using the χ2 distribution as the null distribution, we consider approx-

imating the null distribution using permutations. We construct a set of permutations

of the phenotype data and recalculate the marker likelihoods. The likelihoods based

on the permuted data now serve as a null distribution against which to compare

the observed likelihood. Within this framework, we can compute both markerwise

and experimentwise p-values. The difference between these two p-values is the null

distribution.

For a markerwise test, the null distribution is simply the set of likelihoods calcu- lated from the permuted phenotypes at that marker. The p-value is then determined

48 by counting the number of likelihoods that exceed the observed likelihood Lx. In this case, the null distribution will vary by marker. This method controls the type I error rate within the individual marker tests. For an experimentwise test, the null distri- bution is the set of maximum likelihoods obtained for each permutation across all markers in the data. The p-value for each marker is now calculated by comparing the observed likelihood at each marker to this common null distribution. This method controls the type I error rate for the test across all markers [51, 33]. Essentially, the experimentwise null distribution makes it harder to reject the null hypothesis by taking the maximum of all permuted likelihoods. Regardless of what method we use to obtain a p-value, the final output of the algorithm is a p-value for each marker that indicates whether there is evidence of an association with the phenotype.

The main advantage of using the experimentwise p-value is that it controls the error rate by correcting for multiple testing. Within each chromosomal region, there are multiple markers being tested. Because of the correlation among these markers, standard methods of correcting for multiple testing (e.g., the Bonferroni method) may not be appropriate. McIntyre et al. (2000) have studied experimentwise tests based on permutations and have noted that these tests will never be less powerful than tests based on a standard Bonferroni correction. Moreover, when markers are highly correlated, the Bonferroni correction is greatly over-conservative.

2.4 Computer Implementation

As this algorithm estimates ARGs and then computes phenotype likelihoods on them, we refer to this method as ARGlik. We implement the ARGlik algorithm in a program written in C. We first present an outline of the steps of the algorithm:

49 Step 1: Read in entire SNP data.

Step 2: Read in a tree from the tree file.

Step 3: Estimate branch lengths.

Step 4: Compute likelihood on tree for a given penetrance set.

Step 5: Compute likelihood on tree for a given penetrance set and permutation.

Step 6: Repeat Step 5 for all permutations.

Step 7: Repeat Steps 4 and 5 for all penetrance sets.

Step 8: Repeat Steps 2-7 if there are more trees in the file; otherwise, go to Step 9.

Step 9: Compute p-values.

This program inputs the MARGARITA tree output file and observed SNP data, and puts out marker likelihoods, markerwise p-values, and experimentwise p-values.

For phase-known (or externally phased) data, the SNP data input files may be format- ted in one of two ways. The first is according to the specifications of MARGARITA input files and is shown in Figure 2.6. The second is according to the specifications of

TREELD input files and is shown in Figure 2.7. ARGlik can also input data phased within MARGARITA, which phases data as it builds each ARG. Thus, the data in- put files contain MARGARITA-formatted data for each ARG estimated. Similarly, if there is missing data, MARGARITA will infer the missing characters and ARGlik will input the resulting data files.

ARGlik internally sets the phenotype values. The default setting is that the first half of the sequences are cases and the second half are controls. If the sample contains unequal numbers of cases and controls, then the number of cases must also be set.

Permutations of the phenotype data are created within ARGlik.

50 Figure 2.6: SNP input file option according to MARGARITA formatting. The header line represents the number of case sequences, the number of control sequences, and the number of SNPs. The following lines seven lines are the positions (in base pairs) of the SNPs. Finally, the sequence data is given starting with the cases and then the controls. For diploid data, the chromosomes for an individual must follow one after the other.

51 Figure 2.7: SNP input file option according to TREELD formatting. The first line represents the marker positions (in base pairs). Individuals are separated by either a 1.0 2 or a 0.0 2, where the 1.0 denotes a case, the 0.0 denotes a control, and the 2 means these are diploid organisms. An individual’s chromosomes are given in two consecutive lines with the SNPs coded as a 1 or a 2. This specific data represent two cases and two controls typed for seven markers.

52 Within the program there are many user-defined options that can be set. First, the branch lengths can be estimated either by parsimony or maximum likelihood (see

Section 2.2). The maximum likelihood method follows the molecular clock assump- tion and is more suitable for genetic data within one species, which is the type of data we have. Estimating branch lengths under maximum likelihood takes longer than parsimony. For larger samples (more than 1000 individuals or 200 SNPs) it is recommended to use the parsimony option. Second, the priors from which the pene- trances are sampled may be changed, along with the number of penetrances sampled.

Our results indicate that the number of penetrances sampled does not greatly effect results. However, it is recommended that at least 100 penetrance sets be generated for each analysis. In addition to the number of penetrances, the number of permu- tations for generating a null distribution may be set. As one increases the number permutations generated, the resolution of the p-values is greater and we have a better estimate of the null distribution. When possible, at least 1,000 permutations should be performed.

2.5 Incorporating Covariates

For some diseases, it is not simply the genotypes that affect the phenotype status.

Sometimes a specific genotype coupled with another trait may increase disease sus- ceptibility. Such traits, whether they be age or a genotype on another chromosome, can be introduced to association testing as model covariates. We propose a simple ex- tension to the ARGlik method that allows for covariates to be included in the model.

Similar ideas have been used for pedigree-based linkage analysis (see [36]).

53 By making some adjustments to the penetrance model, the likelihood in Equation

2.8 is capable of including covariates. Rather than the setting the penetrances to

reflect information about the phenotype given the mutation status only, the pene-

trances are set so that they use information about the mutation status coupled with

a covariate level. In general, under a model with m covariates, the covariate pene- trances take on the form P r(φa|j,c), where c is a vector of length m representing the covariates, φa is the phenotype of an arbitrary external node a, and j is the genotype.

The likelihood of Equation 2.8 remains the same, with the exception that p is the

vector of covariate penetrances.

As in the non-covariate case, penetrances are sampled according to priors. These

priors may be any valid probability distribution such that 0 < P r(φa|j,c) < 1 for all

φa, j, and c. Additionally, for a given j and c, the penetrances must sum to one. For a binary phenotype, this means that P r(0|j,c)+P r(1|j,c) = 1. As in the non-covariate

case, when no prior knowledge is assumed about the penetrances, they are sampled

independently from a U(0.05, 0.95) prior. As the number of covariates increases, the number of different penetrance parameters also increases. For example, when there is one covariate with two levels, there are now eight penetrances for a binary phenotype model. As the number of covariates increases, one may want to increase the number of penetrance sets sampled so as to get a good representation of the parameter space.

Under this covariate model, we make certain assumptions on the covariate data.

The covariates for each individual are assumed to be known and can be coded as

discrete characters. Moreover, the covariates are independent of genotype but may

be associated with the phenotype. To include the covariate information in the ARGlik

program, we need to create a third input file. In addition to the SNP and tree input

54 files, the program reads in a covariate input file that contains a matrix of covariates

for all sequences. If there are m sequences and n covariates, the covariate input file

will contain an m x n matrix representing the covariates for each sequence.

The covariate model described here is a way of introducing extra information from

the sample that may help locate a disease variant. When including a covariate, the

goal of the model is to test whether there is association with a phenotype and a

covariate/marker pair. The idea is that a specific combination of a covariate and a

marker’s genotype increases one’s risk of disease, and not solely the genotype. The

evidence of association is measured using the same p-values as in the standard models.

2.6 Model Summary

In this chapter, we have described a new method for ARG-based association test- ing. We have shown how to obtain a set of estimated trees from the case-control

SNP data. Using these trees, we calculate the likelihood of the phenotype data. This model incorporates uncertainty in the penetrances by sampling in the penetrance model. Significance testing is performed by permutation-based approximations to the null distribution. We discuss this model’s implemention in the C program ARGlik.

Finally, we provide an extension to the model that allows for covariate information to be included in the likelihood.

55 CHAPTER 3

SIMULATION STUDIES

In this chapter we evaluate ARGlik’s performance on simulated data. We simulate data under a variety of underlying population parameters and disease models. We

first discuss one-locus disease models and then look into more complicated models that include covariates. We compare the results of ARGlik with those of TREELD,

MARGARITA, and the standard χ2 approach adjusted for multiple testing.

3.1 One-Locus Disease Models

In the one-locus disease models, we assume there is one disease variant located in the simulated region of data. Cases and controls are assigned solely according to their genotypes at the given disease locus.

3.1.1 Data Simulation

Data simulation starts at the population level. We use the MS software of Hudson

(2002) to generate populations according to the coalescent-with-recombination model.

Quantities that need to be set to simulate the population are the recombination rate, mutation rate, number of sequences, and sequence length in base pairs (bp). Table

3.1 describes the settings used to generate the population. We choose three settings for the mutation and recombination rates. The first population has moderate levels

56 of recombination and mutation, the second population has high levels of mutation and low levels of recombination, and the third population has low levels of mutation and high levels of recombination. Finally, all populations contain 20,000 sequences simulated over a region spanning 1,000,000 base pairs, which corresponds to 10,000 diploid individuals.

Population Size Length Recombination Mutation Average (sequences) (bp) Rate Rate SNP Count 1 20,000 1,000,000 10−8 2 ∗ 10−10 85 2 20,000 1,000,000 0.5 ∗ 10−8 0.5*10−9 170 3 20,000 1,000,000 1.5 ∗ 10−8 1.5 ∗ 10−10 70

Table 3.1: Three populations from which the samples are selected. The re- combination rate represents the probability of recombination per generation between the ends of the simulated region. The mutation rate represents the neutral mutation rate per site.

To generate populations, the algorithm starts with 20,000 tips and creates the tree according to the coalescent model described in Equations 1.1 - 1.3. Once the tree topology is constructed, the sequences at the tips are determined by adding mutations to the tree. The ancestral state is assumed to be all zeros, and mutations are added to the branches according to a Poisson process driven by the user-defined mutation rate. The actual number of SNPs in each sequence is a direct effect of the mutation rate (see Table 3.1), so that the higher the mutation rate, the longer the sequences will be. The recombination rate effects how probable it is for any one sequence to incur a recombination event but does not have an effect on the sequence length.

Once a population is generated, we sample cases and controls according to a disease model (see Table 3.2) at a disease locus randomly chosen among the simulated

57 SNPs. We require that the mutation allele frequency at the disease locus be between

0.1 and 0.2. If the randomly chosen locus does not satisfy this criterion, another locus

is chosen at random until the mutation allele frequency is satisfactory. The disease

locus is used for selecting cases and controls, but then is omitted from the data before

running any analyses. It is unlikely that the true disease variant will be in any real

data set, and so to keep it in a simulation would positively skew the results.

Setting Population Replicates Cases/ Disease Relative Controls Model Risk 1 1 100 100/100 no disease NA 2 1 100 100/100 0.544, 0.05, 0.05 1.801 3 2 100 100/100 0.544, 0.05, 0.05 1.801 4 3 100 100/100 0.544, 0.05, 0.05 1.801 5 1 100 100/100 0.18, 0.18, 0.1 1.8 6 1 100 100/100 0.235, 0.05, 0.05 1.3

Table 3.2: Sampling procedure for simulation study under a one-locus disease model. The disease model gives the probabilities of disease given that an individual is homozygous for mutation, heterozygous, and homozygous for wild-type allele, respectively.

The disease model is specified by the probabilities of inheriting the disease given

one’s genotype at the disease locus, namely, 00, 01, or 11. Let P ij represent the

probability of having the disease given the genotype ij. If there are n chromosomes

with allele 0 at the disease locus and m with allele 1 at the disease locus, then we sample cases according to the following model:

P 11 ∗ m ∗ (m − 1) P r(11|case)= (3.1) P 11 ∗ m ∗ (m − 1) + 2 ∗ P 01 ∗ m ∗ n + P 00 ∗ n ∗ (n − 1)

58 2 ∗ P 01 ∗ m ∗ n P r(01|case)= (3.2) P 11 ∗ m ∗ (m − 1) + 2 ∗ P 01 ∗ m ∗ n + P 00 ∗ n ∗ (n − 1)

P r(00|case) = 1 − [P r(11|case)+ P r(01|case)]. (3.3)

If, for example, it is determined that the currently sampled case should have genotype 11 then two randomly chosen sequences with a 1 at the disease locus are taken to be that individual’s chromosomes. The two sampled sequences are then removed from the population and the process continues until all cases are selected.

The controls are sampled similarly, substituting (1 − P ij) for P ij into Equations 3.1

- 3.3. In the samples where there is no disease simulated (see Table 3.2), cases and controls are randomly assigned and there is no disease locus.

In this manner, we sampled data sets according to the models in Table 3.2. To

compare performance across a range of disease models, we simulated settings 1, 2, 5,

and 6 from the same population settings. We also are interested in the effects of the

underlying population, and so settings 1, 3, and 4 vary the population parameters

while keeping the disease model fixed. For each of these settings, we simulated 100

replicates of each, where each replicate was sampled from a newly generated popu-

lation. The inclusion of a no disease model (setting 1) allows us to compare type I

error rate among the methods.

Note that setting 3 samples from the population with the highest mutation rate.

This results in more than twice as many SNPs as the other two population settings.

Thus, for computational efficiency in simulations, once each replicate in setting 3 was

59 selected, we cut the number of SNPs in half by taking the chromosome half that surrounded the true disease locus. The main reason for this adjustment is that the high number of SNPs in the full data caused the programs to run too long, which would not permit all 100 replicates to finish. It was important to keep this data set in the analysis because it gives us an idea of how ARGlik performs on a very dense set of SNPs. To compute the relative risks in Table 3.2, we used the following:

P r(disease | exposure) RR = . (3.4) P r(disease | no exposure)

We define ‘exposure’ to be an individual with at least one mutation at the disease locus. In our model’s notation, this gives us P r(disease | exposure) = P r(disease

| 11 or 01) and P r(disease | no exposure) = P r(disease | 00). To carry out these calculations, we need P r(1) and P r(0). All disease loci have a minor allele frequency between 0.1 and 0.2 and so we set P r(1) = 0.15 and P r(0) = 0.85 to get an av- erage calculation of relative risk. The relative risk is a commonly used measure in genetic epidemiology that indicates the difficulty of the problem of detecting associ- ation. Typically, relative risk is calculated based on information from a particular sample, because the penetrance of the disease in the population is unknown. In our case, we sample according to predetermined population penetrances of the disease.

To accurately classify the relative risk we are using, we may also refer to it as a

“population-based” relative risk as opposed to the more common “sample-based” rel- ative risk. Regardless of the manner in which it is calculated, the closer the relative

60 risk is to 1, the harder it is to detect association. The relative risks we modeled here

range from moderately hard (settings 2-5) to hard (setting 6) problems.

3.1.2 Method Settings

For each of the methods in our simulation, certain options must be set before

running each analysis. In this section we will discuss the settings used for each of the

four methods we consider (standard χ2, MARGARITA, TREELD, and ARGlik).

The standard χ2 approach is straightforward in its calculation. The purpose of

the method is to test for independence between a phenotype and a SNP, one marker

at a time. For case-control studies with a binary genotype, the test statistic for each

marker is calculated from a two-by-two contingency table. The contingency table

is constructed for each marker and contains the observed counts of the phenotype-

genotype combinations in the data set at that position.

To compute the test statistic, we also need the expected values. To obtain these, we look at the proportion of genotypes 0 and 1 in the sample at the marker of interest.

Under no association, we expect to see these same proportions within the cases and controls. For example, if 40% of the sample has a 0 at the marker, then we would expect to see 40% of the cases to have a 0 and 40% of the controls to have a 0.

The χ2 test is then used to see if the observed values deviate unreasonably far from the expected values. Given the observed and expected data, the test statistic X2 is

calculated according to Equation 3.5, where Ei,j is the expected number of sequences

with genotype i and phenotype j and Oi,j is the observed number of sequences with genotype i and phenotype j.

61 (O − E )2 (O − E )2 (O − E )2 (O − E )2 X2 = 0,0 0,0 + 0,1 0,1 + 1,0 1,0 + 1,1 1,1 (3.5) E0,0 E0,1 E1,0 E1,1

Under the standard approach, the p-value is calculated by comparing the test statistic to a χ2 percentile with 1 d.f. However, this method does not account for multiple testing and resulted in false positve rates near 80% for our data. To adjust for multiple testing, we permute the phenotypes on the tree tips and re-calculate the test statistic. Doing this multiple times gives us an estimate of the null distribution and is similar to the experimentwise p-value approach of ARGlik described in Section

2.3.3.

For the MARGARITA method, the authors recommend generating between 30 and 100 ARGs for each data set. We chose to sample 50 ARGs, which is a good compromise of keeping the runtime low and extracting adequate information from the genotype data. Note that the only way the genotype data show up in the association testing is through the trees. To calculate its markerwise p-values, MARGARITA uses permutation-based testing; we set the number of permuations to 1,000.

The TREELD method also produces markerwise p-values, but these are obtained at equally-spaced focal points within the data region and not at the actual marker locations. Based on simulations the authors ran in their paper [51], we chose 50 focal points for each data set. Rather than sampling the penetrances from priors, TREELD uses penetrances equally-spaced within the closed interval [0.05, 0.95], where the user determines how many penetrances to select. To obtain a good representation of the space, we used the maximum number of penetrances allowed, which is nineteen for P1|1

2 and nineteen for P1|0. This results in a total of 19 = 361 penetrance sets. TREELD

62 produces trees using an MCMC algorithm. After a suitable burn-in, the algorithm

sampled 50 trees for each focal point. Finally, to keep the methods consistent, we set

the number of permutations used to determine the markerwise p-values to 1,000.

The ARGlik method has more options in terms of the likelihood model than the

previous methods. We chose to sample 100 penetrance sets from U(0.05, 0.95) priors.

In practice we found that penetrances close to 0 and 1 resulted in likelihoods that were indistinguishable from 0. Therefore, based on the interval guidelines for TREELD, we set the endpoints of the priors to be 0.05 and 0.95. Moreover, we tried increasing the number of penetrances to 250 but it did not change the results noticeably and increased the program’s runtime. For branch length estimation, we used the maximum likelihood approach described in Section 2.2. Finally, we ran 1,000 permutations for the p-value calculations.

3.1.3 Results

In this section we present the results of the simulation runs from the χ2 method,

MARGARITA, TREELD, and ARGlik. The χ2 and ARGlik experimentwise results are based on the correction for multiple testing. The MARGARITA, TREELD, and

ARGlik markerwise results are based on markerwise permutation testing. To distin- guish between the two p-values from ARGlik, we will denote the markerwise p-value results by “ARGlik-m” and the experimentwise p-value results by “ARGlik-e”.

We first present the results of setting 1, for which cases and controls were randomly assigned (e.g., the data do not contain a disease locus). The experimentwise tests naturally control the type I error rate due to the manner in which the null distributions are formed. For the markerwise permutation-based testing, we also want all tests to

63 yield an experimentwise type I error rate of 0.05. Before checking the overall error rate, we first confirmed that the error rate, per marker, is no larger than 0.05. To do this, we counted the number of signals across the 100 replicates, one marker at a time. Here, a “signal” refers to a marker with a p-value less than 0.05. We did this

for the first 40 data markers for the ARGlik-m and MARGARITA results, with the

results given in Figure 3.1. We see that both methods tend to have a type I error

rate around 0.05 when we classify a marker as significant when its p-value is less than

0.05.

Returning to the experimentwise error rate, we want the number of falsely signif-

icant replicates to be close to five out of 100. Running all markerwise methods with

a p-value cut-off of 0.05, we found that ARGlik-m had a type I error rate of 0.19

and MARGARITA had a type I error rate of 0.25 (see Figure 3.2). Thus, we need to

adjust our significance cut-off value in order to obtain the desired type I error rate

of 0.05 for these markerwise methods. For the TREELD method, we do not attempt

to make any adjustments as there is not enough information in the 25 replicates to

make a sound adjustment. By re-running MARGARITA and ARGlik-m with differ-

ent cut-off values, we find that a MARGARITA cut-off of 0.0025 and an ARGlik-m

cut-off of 0.03 both yield experimentwise type I error rates of 0.05 (see Figure 3.3).

Under these adjusted significance cut-off values, all tests can be fairly compared at

level 0.05. To see the effects of making this adjustment to the significance testing,

we present results from both analyses for all methods. We do not adjust ARGlik-e or

the χ2 method because we are already doing an experimentwise adjustment.

Now that we have demonstrated that the error rate is controlled, we refer back

to setting 1. Figures 3.2 and 3.3 show the results of this setting for all methods.

64 M

MMMMA AA

M MM M A A MMM M MM

MAA M AA AAM MM MMMMMMA MM M M MMM Type I error rate Type

AAAA MAAAA AAAMM AAAAMM AA AA

AAA AA M AAAM A 0.00 0.02 0.04 0.06 0.08

0 10 20 30 40

marker

Figure 3.1: Type I error rate for Margarita (denoted by an “M”) and ARGlik (denoted by an “A”) for the first 40 markers. The horizontal line is at the desired type I error rate of 0.05.

65 TREELD is omitted from this display, but we report that the number of false signals

out of the 25 replicates is 15. In this setting, it is better to have the number of signals

close to five, which corresponds to a type I error rate of 0.05. The type I error rates of both the ARGlik-e and χ2 method are not exactly at 0.05, but we did not choose

to adjust them because this could be attributed to random error as we only ran 100

replicates. Note that if we did make adjusments to those methods, then ARGlik-

e would possibly gain significant signals and the χ2 method might lose significant

signals in the data where there is a disease simulated.

Next, we present the results of the samples where we assume the data span a region

where there is a known disease variant. Table 3.3 gives the results of all methods.

Note that the TREELD results are out of 25 replicates only. The computing time for

TREELD is so great that it did not allow us to perform all 100 replicates. In looking

at these results, note that the higher the proportion of significant replicates, the

better the method is performing. As we can see from the results, across all samples,

ARGlik-m is slightly outperformed by MARGARITA. Moreover, ARGlik-e does not

improve much upon the adjusted χ2 method. For this limited number of replicates,

TREELD has high levels of significance as well. However, referring back to setting 1,

TREELD also detected many falsely significant replicates.

Setting TREELD* MARGARITA ARGlik-m χ2 ARGlik-e 2 .80 .52 .45 .23 .30 3 .88 .65 .61 .51 .46 4 .68 .48 .43 .20 .21 5 .60 .26 .18 .6 .6

Table 3.3: Results of simulations with a disease: settings 2, 3, 4, and 5. The entries represent the proportion of significant replicates out of 100 (*except TREELD, which is out of 25 replicates).

66 Significant Reps 0 5 10 15 20 25 30

2 A−m MARG A−e χ

Method

Figure 3.2: Number of significant replicates out of 100 for setting 1, without adjust- ment to control error rate.

67 Significant Reps 0 2 4 6 8 10 12 14

A−m MARG A−e χ

Method

Figure 3.3: Number of significant replicates out of 100 for setting 1, after adjustment to control error rate for A-m and MARG.

68 We also present the results of each sample in Figures 3.4 and 3.5, omitting

TREELD as it is not out of 100 replicates like the other methods. All methods perform best for setting 3. Recall that this is the dense mapping of SNPs contained on half the simulated region. As compared to the other sample settings, we have data from very “close” to the disease, resulting in more information regarding association.

Alternatively, setting 4 is the method with the lowest mutation rate, which results in a more sparse mapping of SNPs in the region. This means there is less information for the tree building process. Because of this SNP structure, setting 4 has lower levels of significance as compared to settings 1 and 3.

All methods perform worst on setting 5, which is the dominant disease model.

Although it has the same relative risk as the recessive model used, there is some logic behind why this model is harder for the coalescent-based methods. Under the dominant model, those with genotype 11 have the same chance of having the disease as those with genotype 10. This introduces more individual sequences with genotype

0 and phenotype 1, which makes it harder for the cases to cluster on the tips of the tree. Setting 4 has lower levels of significance as compared to settings 2 and 3.

Setting 4 is the method with the lowest mutation rate, which results in a more sparse mapping of SNPs in the region. This means there is less information for the tree building prccess.

When we control for the error rate of the markerwise methods, we see that ARGlik- m performs much better than MARGARITA. Before the adjustment, the result we skewed due to the fact that MARGARITA had a much higher type I error rate than

ARGlik-m, making it appear that it found more signals than in reality. We note that this pattern continues for the remaining settings we simulated, as well.

69 (a) (b) Significant Reps Significant Reps 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70

2 2 A−m MARG A−e χ A−m MARG A−e χ

(c) (d) Significant Reps Significant Reps 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70

2 2 A−m MARG A−e χ A−m MARG A−e χ

Figure 3.4: Number of significant replicates out of 100: (a) Setting 2; (b) Setting 3; (c) Setting 4; (d) Setting 5.

70 (a) (b) Significant Reps Significant Reps 0 10 30 50 0 10 30 50

2 2 A−m MARG A−e χ A−m MARG A−e χ

(c) (d) Significant Reps Significant Reps 0 10 30 50 0 10 30 50

2 2 A−m MARG A−e χ A−m MARG A−e χ

Figure 3.5: Number of significant replicates out of 100, with adjustment to control error rate for A-m and MARG: (a) Setting 2; (b) Setting 3; (c) Setting 4; (d) Setting 5.

71 We are not only interested in checking ARGlik’s performance against that of other methods, but also in making inference on what types of data sets for which ARGlik performs better. Figure 3.6 shows the number of significant replicates of the ARGlik method across samples (we note that these are the before adjustment results, but the conclusions drawn here are valid on results after the error rate adjustment, as well). ARGlik performs best when the SNPs are more dense in setting 3. Moreover, the recessive models make it much easier for ARGlik to detect signals of association.

To increase the performance on the dominant diseases, adjusting the priors on the penetrances may help. For example, P1|0 should be higher under a dominant model than under a recessive one.

As TREELD is computationally intensive, we no longer include this method in the remainder of the simulation results. To check the performance of ARGlik on harder disease models, we look at the results of setting 6 (see Table 3.2), which was simulated according to a recessive disease model with a relative risk of 1.3. Figures

3.7 and 3.8 show results of ARGlik-m, ARGlik-e, MARGARITA, and the χ2 test for the 100 replicates. As compared to the recessive disease models with relative risk

1.801, all methods perform worse. This is to be expected. ARGlik-m finds many more signals in the data than MARGARITA. Similarly, ARGlik-e outperforms the

χ2 test, finding more than twice the number of signals.

3.1.4 Unphased Data

Up to this point, we have assumed that the data are already phased. Additionally, we looked at the performance of ARGlik when the data are unphased. We used the same 100 replicates of settings 1 and 2, but we did not assume that the phase of

72 ARGlik−m ARGlik−e Significant Reps 0 10 20 30 40 50 60 70

2 3 4 5

Setting

Figure 3.6: Number of significant replicates out of 100 for all samples under the ARGlik method, without adjustment.

73 Significant Reps 0 5 10 15 20 25 30 35

A−m MARG A−e chi^2

Method

Figure 3.7: Number of significant replicates out of 100 for setting 6 (relative risk of 1.3), without adjustment.

74 Significant Reps 0 5 10 15 20

2 A−m MARG A−e χ

Method

Figure 3.8: Number of significant replicates out of 100 for setting 6, after adjustment to control error rate.

75 the data was known. For the markers where an individual was heterozygous, we

replaced both SNPs with a U in the MARGARITA input data files indicating that the genotype is unknown. Since MARGARITA phases the data as it builds each

ARG, the resulting phased data used for ARGlik consist of 50 different data sets, each phased according to one of the 50 estimated trees. Because of this phased data structure, ARGlik must perform the branch length estimation separately for each tree. Also note that we did not adjust the markerwise methods to control for error rate in this section.

To compare the results of ARGlik on the phased data, we also ran the phase- unknown data through MARGARITA to obtain p-values. TREELD does not handle unphased data and so we omitted that method from the analysis. The χ2 test does not use the phase information because it is a marker-by-marker test, and therefore the results from the previous runs on settings 1 and 2 remain the same for phase unknown data. Table 3.4 shows the results of settings 1 and 2 when the data are unphased.

For both settings, the unphased data yields much higher levels of significance. (See

Figures 3.9 and 3.10 for before and after comparisons of the two settings.) This difference in power is most likely due to the manner in which the data are phased.

The data are phased as the tree is constructed, which results in more clustering of the similar genotypes, regardless of the setting’s underlying disease model. For data in which there is a genetic association with the disease, it makes sense that this genotype clustering is giving higher levels of significance. However, for setting 1 there is no association with a disease and it is unclear why MARGARITA’s phasing of the data results in much more significance. Even when the genotypes are highly clustered

76 on the tree, the phenotypes should still remain randomly scattered. Based on these

results, it is recommended to use an alternate method of phasing.

Sample ARGlik-m ARGlik-e MARGARITA 1 54 31 36 2 88 66 66

Table 3.4: Results of simulations on phase-unknown data. Entries represent the number of significant replicates out of 100.

3.1.5 Computing Time

Finally, we note the computer runtime needed for each program on the one-locus disease models with 100 cases and 100 controls. All of the times are based on runs made on the Ohio Supercomputer Center’s e1350 cluster:

http://www.osc.edu/supercomputing/hardware/.

The reported values are meant to serve as an approximation for each data size, as the runtime will be affected differently as the data sets differ in structure. For these smaller data sets, MARGARITA is a very fast algorithm. Since both TREELD and

ARGlik compute full likelihoods on the trees, they take longer to run than MAR-

GARITA. The number of SNPs has a big effect on the runtime for these programs and we present some comparisons in Table 3.5. We note that the runtime for ARGlik is less than TREELD for smaller numbers of SNPs but is longer than TREELD for larger numbers of SNPs. The reason for this is that all TREELD runs were con- ducted with 50 focal points, regardless of the number of SNPs in the data. For every

77 Phased Unphased Significant Reps 0 10 20 30 40 50 60 70

A−m A−e MARG

Figure 3.9: Simulation results of setting 1 for both phased and unphased data, without adjustment to markerwise methods.

78 Phased Unphased Significant Reps 0 20 40 60 80 100

A−m A−e MARG

Figure 3.10: Simulation results of setting 2 for both phased and unphased data, without adjustment to markerwise methods.

79 additional SNP, ARGlik must perform additional branch length estimations as well

as likelihood calculations on all marginal trees.

Method 50 SNPs 60 SNPs 70 SNPs 80 SNPs 90 SNPs ARGlik 3,000 4,000 5,000 6,000 7,000 TREELD 4,000 4,250 4,500 4,750 5,000 MARGARITA 2.5 3.0 3.5 4 4.75

Table 3.5: Runtimes for the one-locus models in minutes.

3.2 Covariate Models

In the covariate models, we still assume there is one disease variant in the region

of interest. However, cases and controls are no longer selected based solely on their

genotype at the disease locus. Now, they are selected based on a combination of

their genotype at the disease location and the value of an additional variable that

influences the risk of disease.

3.2.1 Data Simulation

In this section we are interested in the performance of ARGlik for covariate models.

The covariates come into play when we sample the data rather than when we create

the initial populations. Because of this we sampled all of the following covariate data

sets from Population 1 described in Section 3.1.

For the covariate models simulated, we assume no relationship between the covari- ate level and the genetics of the individual. While our model is capable of handling multi-level covariates, we chose to simulate a binary covariate for all models desc- cribed. For each of the 20,000 sequences simulated for the population, we generated

80 a random number (r) between 0 and 1. The covariate assigment rule is as follows: if r< 0.5 then covariate is 0; otherwise the covariate is 1.

After a disease locus is found that satisfies the criterion that the mutation allele frequency is between 0.1 and 0.2, we now subset the data in terms of the covariate and the genotype at the disease locus. This gives us four subsets of the data: (1) co- variate=0, genotpye=0; (2) covariate=1, genotype=0; (3) covariate=0, genotype=1;

(4) covariate=1, genotype=1. Based on the penetrance model, we select the cases and controls by sampling from the appropriate subset. We will describe the manner in which cases are sampled, noting that controls are sampled similarly. To assign a case a specific (genotype, covariate) pair, we compute P (ijc|case), where ij denotes the genotype and c is the specific covariate level. When there are q sequences in subset

(1), r sequences in subset (2), s sequences in subset (3), and t sequences in subset

(4), we sample according to the following equations:

P 111 ∗ t ∗ (t − 1) P r(111|case)= (3.6) D

2 ∗ P 101 ∗ t ∗ r P r(101|case)= (3.7) D

P 001 ∗ r ∗ (r − 1) P r(001|case)= (3.8) D

P 110 ∗ s ∗ (s − 1) P r(110|case)= (3.9) D

81 2 ∗ P 100 ∗ s ∗ q P r(100|case)= (3.10) D

P r(000|case) = 1 − [P r(111|case)+ P r(101|case)+ P r(001|case) (3.11) + P r(110|case)+ P r(100|case)], where

D =P 111 ∗ t ∗ (t − 1) + 2 ∗ P 101 ∗ t ∗ r + P 001 ∗ r ∗ (r − 1) (3.12) + P 110 ∗ s ∗ (s − 1) + 2 ∗ P 100 ∗ s ∗ q + P 000 ∗ q ∗ (q − 1).

Our penetrance models need to reflect the assumption that there is an association between the disease and both the genotype and the covariate. We chose three different models that range from hard to easy in terms of the covariate relative risk (see Table

3.6).

Setting Replicates Cases/Controls Disease Model Disease Model Relative Covariate = 1 Covariate = 0 Risk Cov 1 100 100/100 0.75,0.05,0.05 0.20,0.05,0.05 1.2994 Cov 2 100 100/100 0.575,0.15,0.10 0.544,0.05,0.05 2.3237 Cov 3 100 100/100 0.644,0.25,0.25 0.544,0.05,0.05 4.4439

Table 3.6: Covariate settings used in the simulation studies. The disease model gives the probabilities of disease given that an individual is homozygous for the mutation, heterozygous, and homozygous for the wild-type allele, respectively.

The first model reflects a covariate that only increases one’s risk if the corre- sponding genotype is 11. The idea behind the third covariate sample is that having covariate level 1 only slightly increases one’s risk of the disease if he/she has genotype

82 11, but greatly effects the risk when the genotype is either 10 or 00. The second model

reflects a covariate that has more moderate effects on all three genotypes.

To compute the relative risks, we follow the standard procedure of Equation 3.4.

Our definition of exposure, however, needs to be adjusted to incorporate the covariate levels as well as the genotypes. In the binary covariate setting with levels 0 and 1, we define exposure to mean having at least one chromosome with a 1 or covariate level

1. Under this definition, the genotype/covariate combinations resulting in exposure are 111, 110, 101, 100, 001.

3.2.2 Method Settings

None of the other methods of Section 3.1.2 are capable of incorporating the co- variate information. For comparison purposes, we chose to run these data through

MARGARITA as well, noting that it is at a natural disadvantage to ARGlik. At the time of our analyses, no other coalescent-based program was able to include covariate information. For the MARGARITA runs, we kept the same settings as used in the one-locus models (see Section 3.1.2).

For the ARGlik program, we sampled 100 penetrance sets accordingtoa U(0.05, 0.95) prior. Although we know that the covariate level 1 increases the risk, we did not want to positively skew the results by including this information in the penetrance priors.

For a sequence with covariate level c, genotype j, and phenotype φ, the penetrance is given by P r(φ|j,c) = Pφ|j,c. Further, for each marker we sampled 50 trees. In our calculations of the p-values, we generated an approximate null distribution from

1,000 permutations of the phenotype data.

83 Covariate Results

ARGlik−m ARGlik−e MARGARITA Significant Reps 0 20 40 60 80

Cov 1 Cov 2 Cov 3

Figure 3.11: Number of significant replicates out of 100 for all three covariate samples across methods

3.2.3 Results

In this section we review the results of ARGlik on data with a simulated covari- ate. We also compare results to MARGARITA, which does not take into account information from the covariate. Figures 3.11 and 3.12 show the results of the three covariate models for both methods.

84 Covariate Results

ARGlik−m ARGlik−e MARGARITA Significant Reps 0 20 40 60 80 100

Cov 1 Cov 2 Cov 3

Figure 3.12: Number of significant replicates out of 100 for all three covariate samples across methods. ARGlik-m and MARGARITA have been adjusted to control error rate on the experimentwise level.

85 Recall that the sample “Cov 1” has the smallest relative risk and “Cov 3” has the largest. As the covariate relative risk gets easier, ARGlik is able to use more information than MARGARITA, allowing it to greatly outperform MARGARITA in terms of power. For the results of Figure 3.11, where there is no adjustment to control error rate, MARGARITA and ARGlik-m have very similar power for “Cov 1”, with ARGlik-m finding a few more significant data sets than MARGARITA. We can already see the covariate information playing a role in the results, since ARGlik-m did not perform as well as MARGARITA for all standard one-locus disease models in terms of power (see Table 3.3). For “Cov 3”, both ARGlik-m and ARGlik-e achieve a power of 97% as compared to just 23% for MARGARITA. “Cov 2” serves as the turning point where both ARGlik-m and ARGlik-e outperform MARGARITA.

For the results of Figure 3.12, where we do control error rate for the markerwise

methods, we see that ARGlik-m and ARGlik-e achieve a much higher power than

MARGARITA. ARGlik-m and ARGlik-e perform similarly, achieving the same power

for the easiest relative risk of “Cov 3”.

3.3 Discussion and Summary

From these simulations we are able to test the performance of our proposed ARG-

based association method, ARGlik. We first note that the heuristic method used to

generate the ARGs performs as well as the full MCMC-based method of TREELD. It

is computationally faster and produces trees that help to correctly indentify signals

in the data. When looking at the one-locus disease models in Section 3.1, ARGlik-

m does not achieve the highest rates of significance as compared to other methods.

ARGlik-e, on the other hand, finds significance in at least as many data sets as the

86 χ2 test in all except for one setting. For the no disease data, both ARGlik-m and

ARGlik-e have the lowest type I error rates of the markerwise and experimentwise tests, respectively. We also see that ARGlik performs best on SNPS that are more densely mapped around the disease locus. Further, under the uniformative priors used in these simulations, the recessive disease models have the highest rates of significance.

In Section 3.2, we have shown that ARGlik performs well when covariate data are incorporated. When comparing ARGlik to MARGARITA, which does not take into account covariate information, we find that as the covariate information becomes more important in affecting disease status, ARGlik greatly outperforms MARGARITA.

87 CHAPTER 4

REAL DATA APPLICATION

In addition to simulation studies, we ran ARGlik on a data set obtained from

The Wellcome Trust Case Control Consortium (2007) concerning type 1 diabetes. In this chapter we will discuss the data in detail, along with the performance of ARGlik as it applies to this data. We chose chromosomes where there is precendence of an association as well as a chromosome where there is no known association to check false positive rates.

4.1 Type 1 Diabetes Data

The type 1 diabetes (T1D) data used in this analysis are from The Wellcome

Trust Case Control Consortium (WTCCC) (2007) . We selected three chromosomes to analyze, each chosen for a different purpose. The main goals in analyzing these data are to replicate the results of past studies and possibly find new associations.

To do this, we selected regions of chromosomes 12, 2, and 19. Chromosome 12 has high levels of significance both previously and by WTCCC (2007), specifically around the 12q24 location. This association has since been replicated by Todd et al. (2007).

Chromosome 2 had previous strong evidence of an association around 2q33 [4] but

WTCCC (2007) only reported less than moderate evidence of a signal in this region.

Chromosome 19 does not contain any region with a known association with T1D.

88 The entirety of the T1D data consists of 2,938 controls and 1,963 cases. The controls are a combination of two groups: individuals from the 1958 British Birth

Cohort and individuals selected from recruited blood donors for this project. The cases were selected from a total of 8,000 cases who were recruited from diabetes clinics across mainland United Kingdom. To ensure that the cases were actually T1D and not the later onset Type 2 Diabetes, all cases were required to be diagnosed at younger than 17 years old.

Chr Region(Mb) Number of SNPs Cases/Controls 12 110.39-111.58 101 80/120 12 110.39-111.58 101 160/240 12 110.39-111.58 141 200/300 12 110.39-111.58 101 400/600 2 204.27-204.83 101 80/120 2 204.27-204.83 101 160/240 2 204.27-204.83 101 200/300 19 22.91-23.75 101 80/120 19 22.91-23.75 101 160/240 19 22.91-23.75 101 200/300

Table 4.1: Samples used for each of the three chromosomes.

Both cases and controls are typed for thousands of SNPs along each chromosome.

Specifically, Chromosome 12 contains 23,418 SNPs, Chromosome 2 contains 38,867

SNPs, and Chromosome 19 contains 5,942 SNPs. To run a coalescent-based analysis

on a data set this large is virtually impossible in terms of runtime. Moreover, the

likelihoods computed on such large trees would shrink to zero rapidly. To work in

a more reasonable setting, we selected regions centered around the highest signals

and sampled from the pool of cases and controls. In chromosome 19, there is no

89 association anywhere and so we selected a region at random for which to analyze.

In all samples, we preserved the 2:3 ratio of cases:controls contained in the entire

sample. Table 4.1 details the samples used for all three chromosomes.

For all three chromosomes we have samples containing 200, 400, and 500 individ-

uals. This corresponds to trees with 400, 800, and 1,000 tips, respectively. We also

sampled 1,000 individuals for chromosome 12 to see how ARGlik performs for large

numbers of sequences. For the smaller samples (200 and 400 individuals), the ARGlik

analyses were performed using the maximum likelihood estimated branch lengths. For

the larger samples (500 and 1,000 individuals), the ARGlik analyses were performed

using the parsimony-approximated branch lengths.

An additional adjustment must be made for the larger samples. Due to the nature in which the phenotype likelihood is calculated, we cannot work in the log-scale until we reach the tree’s root. For trees with many tips, the likelihoods become indistinguishable from zero well before reaching the root and so adjustments must be made to keep them from approaching zero. We multiply the penetrances by a constant, c, which is set high enough so that the final likelihoods are greater than zero. The value of c will depend on the number of tips on the tree, but for both trees of 1,000 and 2,000 tips we used c = 1035. This same adjustment is used to calculate the observed likelihoods and the likelihoods based on the permuted data.

90 4.2 ARGlik Results

For the T1D data, we are interested in whether or not ARGlik can pick up signals in the sampled regions. We are hoping to detect associations within the chromo- some 12 and 2 regions. However, no association has been previoulsy found in the chromosome 19 region. Table 4.2 shows the results for all samples.

Chr Cases/Controls ARGlik-m ARGlik-e minimum p-value 12 80/120 Y Y 0.000 12 160/240 Y Y 0.006 12 200/300 Y Y 0.000 12 400/600 Y Y 0.000 2 80/120 Y N 0.202 2 160/240 Y Y 0.004 2 200/300 Y Y 0.032 19 80/120 Y N 0.06 19 160/240 N N 0.688 19 200/300 N N 0.288

Table 4.2: Results for each of the three chromosomes. For both ARGlik-m and ARGlik-e, a “Y” indicates an association was found and an “N” indicates no association. The minimum p-value is based on ARGlik-e.

For chromosome 12, all four samples confirm the strong signals that have been

replicated elsewhere. For chromosome 2, we were able to find signals within the region

for all samples, with the exception of ARGlik-e missing a signal for the smallest

sample. In this case, we did not expect all samples to find significance along this

chromosome as strong signals in this region have been more difficult to replicate. In

chromosome 19, where no association has been previously found, we also did not find

evidence of an association with the exception of ARGlik-m in the smallest sample.

In looking at the minimum p-values, we see some variation, most likely due to the

91 different samples. This is not an indicator of unstability within the ARGlik method,

as multiple runs on the same data sets produced the same results.

We also note the distances between the locations of the largest signals found in

ARGlik and the location of the largest signals reported by the WTCCC (2007). Table

4.3 gives the distance comparisons for chromosomes 2 and 12. As there was no signal

in chromosome 19, there is no distance to report. The distances are measured both

in terms of the physical map distance and the number of markers.

Chr Cases/Controls ARGlik-e(Mb) Distance(Mb) Distance(markers) 12 80/120 110.872797 0.076741 10 12 160/240 110.872797 0.076741 9 12 200/300 110.949538 0.000000 0 12 400/600 110.627435 0.322103 30 2 80/120 204.453834 0.110591 19 2 160/240 204.408129 0.156296 30 2 200/300 204.566672 0.002247 1

Table 4.3: Locations for highest signal in T1D data for both ARGlik-e and the distance from those reported by WTCCC (2007). The distance is reported in physical map distance and number of markers.

4.3 Discussion and Conclusion

From these results we see that ARGlik performs well on real data. Even though we sub-sampled from the larger pool of data, we were able to replicate the significant results of WTCCC (2007). For chromosome 12, three of the four samples had their smallest ARGlik-e p-values within ten markers of the highest previously found signal

[4]. This indicates that not only is ARGlik a good method of association testing, it also is capable of fine mapping to within a small region.

92 In the future, it would be better to run this analysis on the entire chromosome and sample data, rather than on selected regions and subsetting the data. Perhaps the best method of analyzing an entire chromosome would be to run ARGlik on overlapping windows of SNPs. Due to recombination, SNPs that are far apart on the chromosome have little correlation between their genealogies and so breaking the data up in this manner would not take away much information. Including more sequences in the sample is a harder issue, since each additional tip added to a marginal tree increases the number of multiplications needed to calculate the likelihood. This will require development of methods for handling likelihoods that are extremely small for trees that exceed approximately 2,000 tips.

93 CHAPTER 5

CONCLUSIONS AND FUTURE WORK

In this chapter we summarize the conclusions of this work and explore possible extensions. The proposed ARGlik method is meant to be a more flexible option to model disease association as compared to other coalescent-based approaches, while also improving upon the computational efficiency. We have shown that ARGlik is much faster than TREELD when the number of SNPs is comparable to the num- ber of TREELD focal points. For smaller data, ARGlik cannot achieve the speed of MARGARITA. However, for the large T1D data sets, MARGARITA was unable to produce p-values. In terms of detecting association, we have shown that ARGlik is comparable to the leading methods under the one-locus disease models. When correcting for multiple comparisons, it achieves a very low false-positive rate of de- tection. Moreover, it can handle missing and unphased data as most other methods cannot. The penetrance model can easily incorporate covariates as well as extra known information regarding the disease characteristics. Under the one-covariate models simulated, ARGlik performs very well, achieving power close to 100 percent for the easier relative risk models. For the type 1 diabetes data, ARGlik was able to pick up previously known signals. In some samples, the highest signals were found to be as close as a few markers from the previously replicated signal locations.

94 There are many directions we may take in pursuing this research further. The

first we aim to address is that of computational efficiency. While ARGlik runs faster than many other likelihood-based approaches, it still loses speed when more SNPs and/or individuals are added to the data. To help with computing efficiency, the

ARGlik algorithm can be adjusted so that the program may run in parallel. Once the trees are estimated based on the entire genotype data, the program may be broken up on a SNP by SNP basis, depending on the number of processors available. If, say, there are four processors on which to run the program, then the first quarter of the

SNPs may be run on processor one, the second quarter on processor two, and so on.

Another alternative for data with large numbers of SNPs is to run the analysis on overlapping windows of SNPs, as suggested by Tachmazidou et al. (2007). We will try this method on simulated data in the future, where we run the analysis on the entire set and on the windows of data, noting if and where the signals are found in each setting.

In addition to the overall computing time, when the data contain large numbers of

individuals, the tree likelihoods become harder to calculate as they rapidly approach

zero. Proper adjustments must be made in order to accomodate such large data. We

discuss in Section 4.1 one possible adjustment in that we increase the penetrance by

a common multiple. This adjustment is not an optimal solution as it is necessary to

change the multiplier based on the number of tips. Based on the specific topology of

the tree, it is possible to systematically adjust the likelihood as we move along the

branches. We plan to further explore this aspect of the likelihood.

In addition to computing, we will explore improvements to the likelihood model.

This work focused on binary phenotypes only, but we plan to include multi-level

95 phenotypes. This extension can be done through different parameterization of the penetrances. More penetrances must be introduced to account for each phenotype level.

Our covariate simulations only include one binary covariate. We plan to see how well the ARGlik method handles a multi-level covariate as well as multiple covariates that may have any number of levels themselves. When the number of possible co- variate and phenotype combinations increases, more individuals should be included in the sample so that we can still detect clustering at the tips of the tree. For vary- ing numbers of phenotypes and covariates, we will determine how large the sample should be to give satisfactory results. Additionally, we will consider implementing a continuous covariates-threshold model [9].

The key result of this work is that we now have a coalescent-based method that can handle these possible alterations to the likelihood model. This allows us to explore the use of advanced covariate and phenotype models, and whether there is more information to gain from the data when genealogical information is taken into consideration.

96 APPENDIX A

DATA SIMULATION PSEUDOCODE

A.1 Standard One-Locus Model

#Read in SNP data generated from MS program (Hudson, 2002) full <- read.table(file=’’data2’’,header=F,skip=6) full <- as.matrix(full)

#Set number of cases and number of controls num.of.cases = 100 num.of.controls = 100 case.seqs=2*num.of.cases control.seqs=2*num.of.controls total.seqs = 2*num.of.cases+2*num.of.controls

#Set parameters for penetrances #let A be mutant allele (1) and a be wild-type (0) #PAA is P(disease|AA) PAA = 0.544 PAa = 0.05 Paa = 0.05

#Loop until the disease location satisfies the mutation allele frequency condition freq.ones <- 0.01 #so that the while loop is entered count <- 0

while(freq.ones < 0.1 || freq.ones >0.2) ( #Pick disease location location <- runif(1,1,ncol(full)+1) location <- trunc(location)

97 #Get frequency count of minor allele ones = subset(full,full[,location]==1) freq.ones <- nrow(ones)/nrow(full) count <- count + 1 )

Finish splitting data zeros = subset(full,full[,location]==0) n <- nrow(zeros) m <- nrow(ones)

#Get the num.of.cases cases cases <- matrix(NA, nrow=case.seqs, ncol=ncol(full)) current.n <- n current.m <- m row <- 1 for(i in 1:num.of.cases) ( #Prob of homozygous for the mutation P11 <- PAA*current.m*(current.m-1)/(PAA*current.m*(current.m-1) + 2*PAa*current.m*current.n + Paa*current.n*(current.n-1)) #Prob of heterozygous P01 <- 2*PAa*current.m*current.n/(PAA*current.m*(current.m-1) + 2*PAa*current.m*current.n + Paa*current.n*(current.n-1)) #Prob of homozygous wild-type allele P00 <- 1-P11-P01

#Generate random number to determine diplotype rand <- runif(1,0,1)

if(rand < P11) ( #Pick first row from ‘‘ones’’ at random) n1 <- trunc(runif(1,1,current.m+1))

# enter first sample into cases cases[row, ] <- ones[n1, ]

# update ones matrix by removing sampled row

98 ones <- ones[-n1,]

# update current.m and row current.m <- current.m - 1 row <- row + 1

#pick second row from ‘‘ones’’ at random n2 <- trunc(runif(1,1,current.m+1))

# enter second sample into cases cases[row, ] <- ones[n2,]

#update ones matrix by removing sampled row ones <- ones[-n2,]

# update current.m and row current.m <- current.m - 1 row <- row + 1 ) if(rand > P11 && rand < P01 + P11) ( # pick one row from ‘‘ones’’ and one row from ‘‘zeros’’ at random n1 <- trunc(runif(1,1,current.m+1)) n0 <- trunc(runif(1,1,current.n+1))

# enter sampled individual into cases cases[row,] <- ones[n1,] cases[row+1, ] <- zeros[n0,]

# update ones and zeros matrix ones <- ones[-n1,] zeros <- zeros[-n0, ]

# update current.m, current.n, and row current.m <- current.m - 1 current.n <- current.n - 1 row <- row + 2 ) if(rand > P01 + P11) (

99 # pick first row from ‘‘zeros’’ at random n1 <- trunc(runif(1,1,current.n+1))

# enter first sample into cases cases[row, ] <- zeos[n1, ]

# update zeros matrix by removing sampled row zeros <- zeros[-n1,]

# update current.n and row current.n <- current.n - 1 row <- row + 1

#pick second row from ‘‘zeros’’ at random n2 <- trunc(runif(1,1,current.n+1))

# enter second sample into cases cases[row, ] <- zeros[n2, ]

#update zeros matrix by removing sampled row zeros <- zeros[-n2,]

# update current.n and row current.n <- current.n - 1 row <- row + 1 )

# get the num.of.controls controls controls <- matrix(NA, nrow=control.seqs, ncol=ncol(full)) # hold current row of controls row <- 1 for(i in 1:num.of.controls)

# prob of homozygous for the mutation P11 <-(1-PAA)*current.m*(current.m-1)/((1-PAA)*current.m*(current.m-1) + 2*(1-PAa)*current.m*current.n + (1-Paa)*current.n*(current.n-1)) # prob of heterozygous P01 <-2*(1-PAa)*current.m*current.n/((1-PAA)*current.m*(current.m-1) + 2*(1-PAa)*current.m*current.n + (1-Paa)*current.n*(current.n-1)) # prob of homozygous wild-type allele P00 <- 1-P11-P01

100 #Generate random number to determine diplotype rand <- runif(1,0,1) if(rand < P11) ( # pick first row from ‘‘ones’’ at random n1 <- trunc(runif(1,1,current.m+1))

# enter first sample into controls controls[row, ] <- ones[n1, ]

# update ones matrix by removing sampled row ones <- ones[-n1,]

# update current.m and row current.m <- current.m - 1 row <- row + 1

#pick second row from ‘‘ones’’ at random n2 <- trunc(runif(1,1,current.m+1))

# enter second sample into controls controls[row, ] <- ones[n2,]

#update ones matrix by removing sampled row ones <- ones[-n2,]

# update current.m and row current.m <- current.m - 1 row <- row + 1 ) if(rand > P11 && rand < P01 + P11) ( # pick one row from ‘‘ones’’ and one row from ‘‘zeros’’ at random n1 <- trunc(runif(1,1,current.m+1)) n0 <- trunc(runif(1,1,current.n+1))

# enter sampled individual into controls controls[row,] <- ones[n1,] controls[row+1, ] <- zeros[n0,]

101 # update ones and zeros matrix ones <- ones[-n1,] zeros <- zeros[-n0, ]

# update current.m, current.n, and row current.m <- current.m - 1 current.n <- current.n - 1 row <- row + 2 )

if(rand > P01 + P11) ( # pick first row from ‘‘zeros’’ at random n1 <- trunc(runif(1,1,current.n+1))

# enter first sample into controls controls[row, ] <- zeros[n1, ]

# update zeros matrix by removing sampled row zeros <- zeros[-n1,]

# update current.n and row current.n <- current.n - 1 row <- row + 1

#pick second row from ‘‘zeros’’ at random n2 <- trunc(runif(1,1,current.n+1))

# enter second sample into controls controls[row, ] <- zeros[n2,]

#update zeros matrix by removing sampled row zeros <- zeros[-n2,]

#update current.n and row current.n <- current.n - 1 row <- row + 1 )

#After controls and cases are selected, format data into proper SNP output

102 A.2 Covariate Data

#read in data from ms with header full <- read.table(file=’’data2’’,header=F,skip=6) full <- as.matrix(full)

#Set number of cases and number of controls num.of.cases = 100 num.of.controls = 100 case.seqs=2*num.of.cases control.seqs=2*num.of.controls total.seqs = 2*num.of.cases+2*num.of.controls

#Set parameters for disease penetrances #let A be mutant allele (1) and a be wild-type (0) #Let covariate levels be 0, 1 #below are prob of disease given genotype and covariate

#model for covariate=1 PAA1 = 0.575 PAa1 = 0.15 Paa1 = 0.1

#model for covariate=0 #no change from rec. model => cov has no effect PAA0 = 0.544 PAa0 = 0.05 Paa0 = 0.05

#Generate column vector to represent covariate: prob(cov=0)=0.5 cov.col <- matrix(0, nrow=nrow(full), ncol=1) for(c in 1:nrow(cov.col)) ( ran <- runif(1,0,1) if(ran < 0.5) ( cov.col[c,1] <- 1 ) )

#add cov data to genotype data full.cov <- cbind(full,cov.col) location.cov <- ncol(full)+1

103 #next split data according to disease genotype #loop until the location satisfies the mutant allele prob condition freq.ones <- 0.01 #so that the while loop is entered count <- 0 while(freq.ones < 0.1 || freq.ones >0.2) ( #pick disease location location <- runif(1,1,ncol(full)+1) location <- trunc(location)

# get frequency count of minor allele ones = subset(full.cov,full.cov[,location]==1) freq.ones <- nrow(ones)/nrow(full) count <- count + 1 )

# Finish splitting data zeros = subset(full.cov,full.cov[,location]==0)

#Now, split zeros and ones based on covariate #Naming order = genotype,covariate ones.ones <- subset(ones,ones[,location.cov]==1) ones.zeros <- subset(ones, ones[,location.cov]==0) zeros.ones <- subset(zeros,zeros[,location.cov]==1) zeros.zeros <- subset(zeros, zeros[,location.cov]==0)

#get counts of each subset of data oo <- nrow(ones.ones) oz <- nrow(ones.zeros) zo <- nrow(zeros.ones) zz <- nrow(zeros.zeros)

# get the num.of.cases cases cases <- matrix(NA, nrow=case.seqs, ncol=ncol(full)+1) #set current counts which will be updated as we sample from the data current.oo <- oo current.oz <- oz current.zo <- zo current.zz <- zz

# hold current row of cases

104 row <- 1

#Sample the cases two rows at a time, for each individual for(i in 1:num.of.cases) ( #probabilies of the form: Num: P(D|geno,cov)*P(geno and cov); Denom: sum(P(D|geno,cov)*P(geno and cov)) for all (geno,cov) combos #Set denominator in two pieces: P(D|geno,cov=1)*P(geno and cov=1) + P(D|geno,cov=0)*P(geno and cov=0) = den1+den0 = den den1 <- PAA1*current.oo*(current.oo-1)+2*PAa1*current.oo*current.zo + Paa1*current.zo*(current.zo-1) den0 <- PAA0*current.oz*(current.oz-1) +2*PAa0*current.oz*current.zz + Paa0*current.zz*(current.zz-1) den <- den1 + den0 #reminder: ‘‘o’’ stands for ones and ‘‘z’’ stands for zeros; #A is a mutation #prob of homozygous for mutation (1) and covariate = 1 P111 <- PAA1*current.oo*(current.oo-1)/den #prob of geno 10 and cov = 1 P101 <- 2*PAa1*current.oo*current.zo/den #prob of geno 00 and cov = 1 P001 <- Paa1*current.zo*(current.zo-1)/den #prob of geno 11 and cov = 0 P110 <- PAA0*current.oz*(current.oz-1)/den #prob of geno 10 and cov = 0 P100 <- 2*PAa0*current.oz*current.zz/den #prob of geno 00 and cov = 0 P000 <- Paa0*current.zz*(current.zz-1)/den

#Generate random number to determine genotype / covariate combo rand <- runif(1,0,1)

#Check all six probabilities to determine what the genotype/covariate combo is for current case individual if(rand < P111) ( # pick first row from ‘‘ones.ones’’ at random n1 <- trunc(runif(1,1,current.oo+1))

# enter first sample into cases cases[row, ] <- ones.ones[n1, ]

# update ones.ones matrix by removing sampled row

105 ones.ones <- ones.ones[-n1,]

# update current.oo and row current.oo <- current.oo - 1 row <- row + 1

#pick second row from ‘‘ones.ones’’ at random n2 <- trunc(runif(1,1,current.oo+1))

# enter second sample into cases cases[row, ] <- ones.ones[n2,]

#update ones matrix by removing sampled row ones.ones <- ones.ones[-n2,]

# update current.m and row current.oo <- current.oo - 1 row <- row + 1 ) if(rand > P111 && rand < P101 + P111) ( #geno 10 and cov 1 # pick first row from ‘‘ones.ones’’ at random n1 <- trunc(runif(1,1,current.oo+1))

# enter first sample into cases cases[row, ] <- ones.ones[n1, ]

# update ones.ones matrix by removing sampled row ones.ones <- ones.ones[-n1,]

# update current.oo and row current.oo <- current.oo - 1 row <- row + 1

#pick second row from ‘‘zeros.ones’’ at random n2 <- trunc(runif(1,1,current.zo+1))

# enter second sample into cases cases[row, ] <- zeros.ones[n2,]

#update ones matrix by removing sampled row

106 zeros.ones <- zeros.ones[-n2,]

# update current.m and row current.zo <- current.zo - 1 row <- row + 1 )

if(rand > P101 + P111 && rand < P101+P111+P001) ( #geno 00 and cov 1 # pick first row from ‘‘zeros.ones’’ at random n1 <- trunc(runif(1,1,current.zo+1))

# enter first sample into cases cases[row, ] <- zeros.ones[n1, ]

# update zeros.ones matrix by removing sampled row zeros.ones <- zeros.ones[-n1,]

# update current.zo and row current.zo <- current.zo - 1 row <- row + 1

#pick second row from ‘‘zeros.ones’’ at random n2 <- trunc(runif(1,1,current.zo+1))

# enter second sample into cases cases[row, ] <- zeros.ones[n2,]

#update ones matrix by removing sampled row zeros.ones <- zeros.ones[-n2,]

# update current.m and row current.zo <- current.zo - 1 row <- row + 1 ) if(rand > P101 + P111 +P001 && rand < P101+P111+P001+P110) ( #geno 11 and cov 0 # pick first row from ‘‘ones.zeros’’ at random n1 <- trunc(runif(1,1,current.oz+1))

107 # enter first sample into cases cases[row, ] <- ones.zeros[n1, ]

# update ones.zeros matrix by removing sampled row ones.zeros <- ones.zeros[-n1,]

# update current.zo and row current.oz <- current.oz - 1 row <- row + 1

#pick second row from ‘‘ones.zeros’’ at random n2 <- trunc(runif(1,1,current.oz+1))

# enter second sample into cases cases[row, ] <- ones.zeros[n2,]

#update ones matrix by removing sampled row ones.zeros <- ones.zeros[-n2,]

# update current.m and row current.oz <- current.oz - 1 row <- row + 1 ) if(rand > P101 + P111 +P001 +P110 && rand < P101+P111+P001+P110+P100) ( #geno 10 and cov 0 # pick first row from ‘‘ones.zeros’’ at random n1 <- trunc(runif(1,1,current.oz+1))

# enter first sample into cases cases[row, ] <- ones.zeros[n1, ]

# update ones.zeros matrix by removing sampled row ones.zeros <- ones.zeros[-n1,]

# update current.oz and row current.oz <- current.oz - 1 row <- row + 1

#pick second row from ‘‘zeros.zeros’’ at random n2 <- trunc(runif(1,1,current.oo+1))

108 # enter second sample into cases cases[row, ] <- zeros.zeros[n2,]

#update ones matrix by removing sampled row zeros.zeros <- zeros.zeros[-n2,]

# update current.m and row current.oo <- current.oo - 1 row <- row + 1 ) if(rand>P101+P111+P001+P110+P100 && rand

# enter first sample into cases cases[row, ] <- zeros.zeros[n1, ]

# update ones.zeros matrix by removing sampled row zeros.zeros <- zeros.zeros[-n1,]

# update current.oo and row current.oo <- current.oo - 1 row <- row + 1

#pick second row from ‘‘zeros.zeros’’ at random n2 <- trunc(runif(1,1,current.oo+1))

# enter second sample into cases cases[row, ] <- zeros.zeros[n2,]

#update matrix by removing sampled row zeros.zeros <- zeros.zeros[-n2,]

# update current.m and row current.oo <- current.oo - 1 row <- row + 1 ) )

109 #Sample controls in the same manner, where the probabilities of each diplotype are given by: #probabilies of the form: Num: P(D|geno,cov)*P(geno and cov); # Denom: sum(P(D|geno,cov)*P(geno and cov)) for all (geno,cov) combos

#Finally, format cases and controls to proper SNP output

110 BIBLIOGRAPHY

[1] R. P. Brent. Algorithms for Minimization without Derivatives. Prentice-Hall, first edition, 1973.

[2] A. Clark. Finding genes underlying risk of complex disease by linkage disequi- librium mapping. Current Opinion in Genetics and Development, 13:296–302, 2003.

[3] D. Clayton, J. Chapman, and J. Cooper. Use of unphased multilocus genotype data in indirect association studies. Genetic Epidemiology, 27:415–428, 2004.

[4] T. W. T. C. C. Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447:661–678, 2007.

[5] C. Durrant, K. T. Zondervan, L. R. Cardon, S. Hunt, P. Deloukas, and A. P. Morris. Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. The American Journal of Human Genetics, 75:35–43, 2004.

[6] P. Fearnhead and P. Donnelly. Estimating recombination rates from population genetic data. Genetics, 159(3):1299–1318, 2001.

[7] P. Fearnhead and P. Donnelly. Approximate likelihood methods for estimating local recombination rates. Journal of the Royal Statistical Society B, 64(4):657– 680, 2002.

[8] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17(6):368–376, 1981.

[9] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, Inc., 2004.

[10] D. Fern´andez-Baca. The perfect phylogeny problem. In D. Z. Du and X. Cheng, editors, Steiner Trees in Industry. Kluwer Academic Press, 2001.

[11] R. C. Griffiths and P. Marjoram. Ancestral inference from samples of DNA sequences with recombination. Journal of Computational Biology, 3(4):479–502, 1996.

111 [12] R. C. Griffiths and S. Tavar´e. Simulating probability distributions in the coales- cent. Theoretical Population Biology, 46:131–159, 1994.

[13] J. Hein, M. H. Schierup, and C. Wiuf. Gene Genealogies, Variation, and Evolu- tion: A Primer in Coalescent Theory. Oxford University Press, 2005.

[14] D. M. Hillis, C. Moritz, and B. K. Mable. Molecular Systematics. Sinauer Associates, Inc., second edition, 1996.

[15] R. R. Hudson. Properties of a neutral allele model with intragenic recombination. Theoretical Population Biology, 23:183–201, 1983.

[16] R. R. Hudson. Gene genealogies and the coalescent process. Oxford Surveys in Evolutionary Biology, 7:1–49, 1991.

[17] R. R. Hudson. Generating samples under a Wright-Fisher neutral model. Bioin- formatics, 18:337–338, 2002.

[18] R. R. Hudson and N. L. Kaplan. Statistical properties of the number of re- combination events in the history of a sample of DNA sequences. Genetics, 111:147–164, 1985.

[19] J. A. Todd et al. Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes. Nature Genetics, 39(7):857–864, 2007.

[20] G. Kimmel, R. M. Karp, M. I. Jordan, and E. Halperin. Association mapping and significance estimation via the coalescent. The American Journal of Human Genetics, 83:675–683, 2008.

[21] J. F. C. Kingman. On the genealogy of large populations. Journal of Applied Probability, 19:27–43, 1982.

[22] J. F. C. Kingman. The coalescent. Stochastic processes and their applications, 13:235–248, 1982.

[23] M. K. Kuhner, P. Beerli, J. Yamato, and J. Felsenstein. Usefulness of single nucleotide polymorphism data for estimating population parameters. Genetics, 156:439–447, 2000.

[24] K. Lange. Mathematical and Statistical Methods for Genetic Analysis. Springer, 1997.

[25] F. Larribe, S. Lessard, and N. J. Schork. Gene mapping via the ancestral recom- bination graph. Theoretical Population Biology, 62:215–229, 2002.

[26] P. O. Lewis. A likelihood approach to estimating phylogeny from discrete mor- phological character data. Systematic Biology, 50(6):913–925, 2001.

112 [27] J. Li and T. Jiang. Haplotype-based linkage disequilibrium mapping via direct data mining. Bioinformatics, 21(24):4384–4393, 2005.

[28] N. Li and M. Stephens. Modeling linkage disequilibrium and identifying recombi- nation hotspots using single-nucleotide polymorphism data. Genetics, 165:2213– 2233, 2003.

[29] W.-H. Li. Molecular Evolution. Sinauer Associates, Inc., 1997.

[30] R. Lyngsø, Y. S. Song, and J. Hein. Minimum recombination histories by branch and bound. Proceedings of Workshop on Algorithms in Bioinformatics 2005, Lecture Notes in Computer Science, 3692:239–250, 2005.

[31] L. M. McIntyre, E. R. Martin, K. L. Simonsen, and N. L. Kaplan. Circumventing multiple testing: a multilocus Monte Carlo approach to testing for association. Genetic Epidemiology, 19:18–29, 2000.

[32] G. A. T. McVean and N. J. Cardin. Approximating the coalescent with re- combination. Philosophical Transactions of the Royal Society B, 360:1387–1393, 2005.

[33] M. J. Minichiello and R. Durbin. Mapping trait loci by use of inferred ancestral recombination. The American Journal of Human Genetics, 79:910–922, 2006.

[34] J. Molitor, P. Marjoram, and D. Thomas. Fine-scale mapping of disease genes with multiple mutations via spatial clustering techniques. The American Journal of Human Genetics, 73:1368–1384, 2003.

[35] A. P. Morris, J. C. Whittaker, and D. J. Balding. Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies. The American Journal of Huma Genetics, 70:686–707, 2002.

[36] J. Ott. Analysis of Human . The Johns Hopkins University Press, third edition edition, 1999.

[37] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, second edition edition, 1992.

[38] B. Rannala and J. P. Reeve. High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome-sequence. The American Journal of Human Genetics, 69:159–178, 2001.

[39] J. S. Rogers and D. L. Swofford. A fast method for approximating maximum likelihoods of phylogenetic trees from nucleotide sequences. Systematic Biology, 47(1):77–89, 1998.

113 [40] P. Scheet and M. Stephens. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and hap- lotypic phase. American Journal of Human Genetics, 78:629–644, 2006.

[41] M. H. Schierup and J. Hein. Consequences of recombination on traditional phy- logenetic analysis. Genetics, 156:879–891, 2000.

[42] M. Stephens, N. J. Smith, and P. Donnelly. A new statistical method for hap- lotype reconstruction from population data. The American Journal of Human Genetics, 68:978–989, 2001.

[43] I. Tachmazidou, C. J. Verzilli, and M. D. Lorio. Genetic association mapping via evolution-based clustering of haplotypes. PLoS Genetics, 3(7):1163–1177, 2007.

[44] H. T. T. Toivonen, P. Onkamo, K. Vasko, V. Ollikainen, P. Sevon, H. Mannila, M. Herr, and J. Kere. Data mining applied to linkage disequilibrium mapping. The American Journal of Huma Genetics, 67:133–145, 2000.

[45] J. Wakeley. Coalescent Theory. Roberts Company, 2008.

[46] E. R. B. Waldron, J. C. Whittaker, and D. J. Balding. Fine mapping of disease genes via haplotype clustering. Genetic Epidemiology, 30:170–179, 2006.

[47] J. D. Wall. A comparison of estimators of the population recombination rate. Molecular Biology and Evolution, 17(1):156–163, 2000.

[48] M. S. Waterman. Introduction to Computational Biology: Maps, Sequences, and Genomes. Chapman Hall, first edition edition, 1995.

[49] S. Won, R. Sinha, and Y. Luo. Fine-scale linkage disequilibrium mapping: A comparison of coalescent-based and haplotype-clustering-based methods. BMC Proceedings, 1(33):1–2, 2007.

[50] Y. Wu. Association mapping of complex diseases with ancestral recombina- tion graphs: models and efficient algorithms. Journal of Computational Biology, 15(7):667–684, 2008.

[51] S. Zollner and J. K. Pritchard. Coalescent-based association mapping and fine mapping of complex trait loci. Genetics, 169:1071–1092, 2005.

114