Identity-By-Descent Estimation and Mapping of Qualitative Traits in Large, Complex Pedigrees
Total Page:16
File Type:pdf, Size:1020Kb
Copyright Ó 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.089912 Identity-by-Descent Estimation and Mapping of Qualitative Traits in Large, Complex Pedigrees Mark Abney1 Department of Human Genetics, University of Chicago, Chicago, Illinois 60637 Manuscript received April 4, 2008 Accepted for publication May 2, 2008 ABSTRACT Computing identity-by-descent sharing between individuals connected through a large, complex pedigree is a computationally demanding task that often cannot be done using exact methods. What I present here is a rapid computational method for estimating, in large complex pedigrees, the probability that pairs of alleles are IBD given the single-point genotype data at that marker for all individuals. The method can be used on pedigrees of essentially arbitrary size and complexity without the need to divide the individuals into separate subpedigrees. I apply the method to do qualitative trait linkage mapping using the nonparametric sharing statistic Spairs. The validity of the method is demonstrated via simulation studies on a 13-generation 3028-person pedigree with 700 genotyped individuals. An analysis of an asthma data set of individuals in this pedigree finds four loci with P-values ,10À3 that were not detected in prior analyses. The mapping method is fast and can complete analyses of 150 affected individuals within this pedigree for thousands of markers in a matter of hours. OMPUTATION of identical-by-descent (IBD) allele tages and provide a valid test for the presence of a trait- C sharing between related individuals is a necessary influencing gene (Bourgain and Genin 2005). Large ingredient in many methods for linkage mapping of pedigrees also arise in other animal systems where breed- complex traits. Typically, IBD allele sharing is used ing is carefully controlled. For example, there is interest either directly to assess whether affected individuals are in methods that are applicable to complex pedigrees for sharing more at a locus than expected under the null both livestock (Thallmanet al. 2001) and dogs (Sutter hypothesis or as a component in the covariance matrix and Ostrander 2004). in a variance component model. A number of algo- What I present here is a rapid computational method rithms for computing IBD exactly exist (e.g.,Elston for estimating, in large complex pedigrees, the proba- and Stewart 1971; Lander and Green 1987; bility that pairs of alleles are IBD given the single-point Kruglyak et al. 1996; Fishelson and Geiger 2002); genotype data at that marker for all individuals. Because however, these methods become computationally in- the method is very fast, it can easily be used on geno- feasible when pedigrees are very large and complex. mewide data with many thousands of markers on hun- Under such circumstances approximate methods be- dreds of related individuals. It can be used directly to come necessary, whether Markov chain Monte Carlo do linkage mapping with affected individuals using the (Thompson et al. 1993; Sobel and Lange 1996; Heath Spairs statistic or to compute approximate multipoint 1997) or regression based (Fulker et al. 1995; Almasy probabilities both for alleles being IBD, using regres- and Blangero 1998). Even these methods, however, sion-based approaches (e.g.,Almasy and Blangero have difficulty when the pedigree is very deep with many 1998), and for alleles being homozygous by descent generations of individuals with no data. (HBD) using a hidden Markov model (HMM) (Abney In humans, very deep, and possibly complex, pedi- et al. 2002). Here, I describe this computational method grees often arise in conjunction with genetic studies and its application to qualitative trait linkage analysis. of isolated populations. Isolated populations are com- Although computing Spairs is straightforward, in prin- monly thought to have characteristics that may prove ciple, a number of challenges must be overcome in advantageous for mapping (Wright et al. 1999; Peltonen creating a practical and valid mapping method for very et al. 2000; Escamilla 2001; Shifman and Darvasi 2001; large, and possibly complex, pedigrees. In particular, it Service et al. 2006), yet may require specialized statis- is common in studies involving large pedigrees to have tical methods to both properly leverage these advan- one, or a few, pedigrees, making the asymptotic distri- bution of the test statistic, which is appropriate when there are many independent pedigrees, not necessarily 1Address for correspondence: Department of Human Genetics, University of Chicago, 920 E. 58th St., Chicago, IL 60637. applicable. Also, the allele-frequency distribution may E-mail: [email protected] have a major influence on the test statistic when Genetics 179: 1577–1590 ( July 2008) 1578 M. Abney inheritance information is obscured by missing data. when there are missing genotypes in the data. Although Unfortunately, the relevant allele-frequency distribu- neither study formulated a version of Equation 1 that tion is that in the founders of the pedigree, which in holds under missing data conditions, they each sug- large pedigrees may be many generations earlier than gested approaches for this case. The most recent version the sampled individuals. As a result, estimation of the of SimIBD (Davis et al. 1996) uses a Monte Carlo founder allele-frequency distribution from the sampled procedure where, for each realization, a random geno- data can result in a large bias of the conditional type is assigned to each missing genotype, and the expected sharing statistic. The difficulties posed by recursive algorithm is applied. The final probability is not knowing the true allele-frequency distribution can the average of the probabilities computed at each Monte be largely overcome through the use of simulations, but Carlo realization. In contrast, Wang et al. (1995) suggest the capacity to do many simulations requires a compu- two different possibilities. In the first, when the recursive tationally efficient method, particularly when a large algorithm encounters an individual who has a missing number of markers are involved. Below, I describe the genotype, the relevant inheritance probability [e.g., ) theoretical basis of the IBD estimation method, the PrðA1 M1jGÞ] is computed by summing over all approximations used, and how it differs from earlier possible genotypes for the missing data weighted by methods that take a similar approach (Wang et al. 1995; the probability of the genotype given the observed Davis et al. 1996). I then show its application to single- genotypes. To simplify the computation, one can use point linkage mapping using Spairs and how the diffi- the probability of the genotype given only the genotypes culties mentioned above are solved. The appendix of close relatives rather than all observed genotypes. The describes how to use the IBD estimation method to second possibility is to find the genotype configuration obtain multipoint estimates of HBD by modifying the for all individuals with missing genotypes that has the HMM of Abney et al. (2002). highest probability and apply the recursive algorithm to that configuration. Finding the highest-probability ge- notype configuration, however, can be computationally METHODS demanding if there are many missing genotypes. IBD estimation A common situation when analyzing large pedigrees is to have several generations of the pedigree completely The objective is to compute the probability of two untyped. None of the above strategies are entirely alleles being IBD given all available genotype data at sufficient in such a situation. The problem is that there that locus and the entire, unbroken pedigree. The is little information in the untyped portion of the method is based on the recursive strategy suggested by genealogy from which to infer the genotype probability ang avis W et al. (1995) and D et al. (1996). In both of distribution in those individuals. Simulating over valid these studies the probability is computed in a manner genotype configurations can then be time consuming analogous to the recurrence relation for kinship coef- and, possibly, inaccurate. Summing over all possible 1 1 ficients, fAB ¼ 2 fMB 1 2 fFB, where M and F are the genotypes, on the other hand, may be computationally mother and the father of individual A, and individual B impracticable. is not a descendant of A. The equivalent recurrence The approach I propose relies on classifying individ- equation when there are genotype data at the locus, as uals into two groups, A and S.AnS individual is someone ang avis given by W et al. (1995) and D et al. (1996), is who either is genotyped or has at least one ancestor who valid only when there are no missing genotypes and is is genotyped, while an A individual is someone who is not among the S group. Note that by this definition the S PrðA [B j GÞ¼PrðA )M j GÞPrðM [B j GÞ 1 1 1 1 1 1 group may contain individuals for whom no data were ) 1 PrðA1 M2 j GÞPrðM2[B1 j GÞ actually collected. I also define a set of individuals called 1 PrðA1)F1 j GÞPrðF1[B1 j GÞ ‘‘quasi-founders,’’ where each quasi-founder is either an S individual with both parents in the A group or an A 1 PrðA1)F2 j GÞPrðF2[B1 j GÞ; ð1Þ individual with a spouse in S. A version of recurrence ) where PrðAi Mj jGÞ is the probability that the ith allele Equation 1, reformulated to hold true even under from A was inherited from the jth allele of A’s mother, missing data, is applied to the S individuals until the given the observed genotype data; and Ai [ Bj means quasi-founders are reached, at which point boundary allele Ai is IBD with allele Bj. This equation is applied conditions are employed to determine the final proba- repeatedly until the founders of the pedigree are bility. This allows one to avoid using the recurrence reached and boundary conditions are used to obtain equation over those generations with no genotype data, the probability.