A Fast and Accurate Method for Detection of IBD Shared Haplotypes in Genome-Wide SNP Data
Total Page:16
File Type:pdf, Size:1020Kb
European Journal of Human Genetics (2017) 25, 617–624 & 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved 1018-4813/17 www.nature.com/ejhg ARTICLE A fast and accurate method for detection of IBD shared haplotypes in genome-wide SNP data Douglas W Bjelland1, Uday Lingala1, Piyush S Patel1, Matt Jones2 and Matthew C Keller*,1,2 Identical by descent (IBD) segments are used to understand a number of fundamental issues in genetics. IBD segments are typically detected using long stretches of identical alleles between haplotypes in phased, whole-genome SNP data. Phase or SNP call errors in genomic data can degrade accuracy of IBD detection and lead to false-positive/negative calls and to under/ overextension of true IBD segments. Furthermore, the number of comparisons increases quadratically with sample size, requiring high computational efficiency. We developed a new IBD segment detection program, FISHR (Find IBD Shared Haplotypes Rapidly), in an attempt to accurately detect IBD segments and to better estimate their endpoints using an algorithm that is fast enough to be deployed on very large whole-genome SNP data sets. We compared the performance of FISHR to three leading IBD segment detection programs: GERMLINE, refined IBD, and HaploScore. Using simulated and real genomic sequence data, we show that FISHR is slightly more accurate than all programs at detecting long (43cM) IBD segments but slightly less accurate than refined IBD at detecting short (~1 cM) IBD segments. More centrally, FISHR outperforms all programs in determining the true endpoints of IBD segments, which is crucial for several applications of IBD information. FISHR takes two to three times longer than GERMLINE to run, whereas both GERMLINE and FISHR were orders of magnitude faster than refined IBD and HaploScore. Overall, FISHR provides accurate IBD detection in unrelated individuals and is computationally efficient enough to be utilized on large SNP data sets 460 000 individuals. European Journal of Human Genetics (2017) 25, 617–624; doi:10.1038/ejhg.2017.6; published online 8 February 2017 INTRODUCTION simultaneously meet a number of goals – computational efficiency, Identical by descent (IBD) shared haplotypes – homologous chromo- low false-positive rates, low false-negative rates, and accurate detection somal segments descended from the same common ancestor – have of IBD segment endpoints – that typically tradeoff with one another. been used for genotype imputation,1,2 IBD mapping,3 heritability Many programs have been developed to discover IBD segments in estimation,4 phase inference,1 and inference of population structure.5,6 SNP data sets when expected pedigree relatedness is low. Under Haldane’s7 model of recombination, the length of IBD GERMLINE,8 often considered the benchmark IBD discovery pro- haplotypes shared between two individuals is exponentially distributed gram, is computationally efficient and therefore usable on very large with mean 100/2 g centiMorgans (cM)whereg is the number of samples, but the literature has indicated that its accuracy is lower than generations since the common ancestor. Although a pair of individuals more recently developed programs. Because GERMLINE is fast and sharing a common ancestor 15 generations ago is highly unlikely to can be run in a way that leads to few false-negative calls at the expense share any IBD haplotypes from that ancestor, when they do, the of many false-positive calls, two newer IBD detection programs that expected length of the segment is ~ 3.3 cM. Such IBD shared reportedly outperform GERMLINE in accuracy, refined IBD (rIBD)9 haplotypes are typically inferred from long stretches of identical alleles and HaploScore,10 use GERMLINE to detect candidate IBD segments in phased, whole-genome single nucleotide polymorphism (SNP) that are then post-processed by extending, removing, or slicing the arrays, but accurate and efficient IBD detection from such data is candidate segments in the hope of providing more accurate detection difficult for several reasons. First, phase and SNP call errors can split of IBD segments. rIBD uses a probabilistic hidden Markov model to long IBD segments into two or more shorter segments or lead to give locations along candidate IBD segment obtained from GERM- artificial truncation of IBD segments, inflating the false-negative (miss) LINE a posterior LOD score as to whether they are truly IBD or not. rates of IBD detection. Second, the sheer number of comparisons that rIBD has a lower false-positive rate than GERMLINE with only a must be made at each site (~ twice the squared sample size), combined modest increase in the false-negative rate, but it is computationally with the low base rate of true IBD segments between pairs of unrelated intensive and therefore has a very long run time for large data sets. individuals, means that a substantial fraction of called IBD segments HaploScore uses information on the switch error rate and the SNP can be false positives. Finally, because of the computational complexity error rate to give a posterior probability of whether each candidate of IBD detection, algorithms that sacrifice speed for accuracy can be segment from GERMLINE is truly IBD or not. unusable on the large sample sizes (eg, sample size 450 000 requires The current paper describes a new program, FISHR (Find IBD nearly 5 billion comparisons per genomic site) currently being Shared Haplotypes Rapidly), which improves upon existing IBD accumulated. Thus, successful IBD detection programs must detection programs. Most centrally, as we show below, existing 1Institute for Behavioral Genetics, University of Colorado at Boulder, Boulder, CO, USA; 2Department of Psychology & Neuroscience, University of Colorado at Boulder, Boulder, CO, USA *Correspondence: MC Keller, Institute for Behavioral Genetics, University of Colorado at Boulder, 1480 30th St., Boulder, CO 80303, USA. Tel: +1 303 492 7362; Fax: +1 303 492 8063; E-mail:[email protected] Received 5 May 2016; revised 22 November 2016; accepted 24 December 2016; published online 8 February 2017 Find IBD shared haplotypes rapidly DW Bjelland et al 618 programs tend to either overextend true IBD segments or split true positive and false-negative calls. Furthermore, threshold values for PIE and MA, IBD segments into multiple smaller ones. Because such systematic as well as values for –bits, –err-hom,and–err-het, depend on the density of biases in IBD segment length lead to biased heritability estimates, SNPs and the quality of SNP calls and phasing in the data at hand. Poorly called – incorrect inferences of demographic history, and inaccurate phase/ or poorly phased data would necessitate lower values for the bits (eg 30) and – imputation information near the ends of called segments, our main higher values for the err-hom and -err-het (eg 2) arguments to allow the GERMLINE subroutine to detect a sufficient number of potential IBD goal was to develop an algorithm that accurately estimates endpoints segments. A utility program, parameter_finder, (available for download at and true lengths of IBD segments while achieving a computational https://github.com/matthew-c-keller/GL.parameter.finder.git) is supplied along- fi ef ciency similar to GERMILNE and an accuracy similar to rIBD. side FISHR to help users choose values for –emp-pie-threshold and –emp-ma- threshold, which influence the tradeoff between false positives and false MATERIALS AND METHODS negatives the most strongly. To do this, parameter_finder provides distributions Description of the FISHR algorithm of PIE and MA from segments in the user’s data that are truly IBD (defined as FISHR is written in C++ and is freely available for download at https://github. the middlemost 50% of all stretches of the genome48cM that contained no com/matthew-c-keller/FISHR.git, along with scripts to evaluate IBD detection. opposite homozygotes between pairs of individuals) and similar distributions of FISHR utilizes GERMLINE (described in detail by Gusev et al.8), as an initial PIE and MA from non-IBD segments (defined as segments between random screen to quickly detect candidate segments. By default, FISHR uses the – pairs of individuals with the same start and endpoints as the truly IBD h_extend method in GERMLINE, which incorporates information on phased segments). Users can then compare the IBD and non-IBD distributions of PIE mismatches, to detect candidate segments between two individuals or within and MA (displayed in Supplementary Figure S3) to choose PIE and MA individuals (runs of homozygosity). FISHR then further refines the candidate thresholds that produce a desired sensitivity or specificity, depending on the segments as follows. First, because two long IBD calls separated by a short users’ priorities. distance may actually be a single contiguous IBD segment that was artificially broken apart in GERMLINE due to phase or SNP call errors, FISHR stitches Simulated sequence and SNP data together segments separated by a user-defined number of SNPs (-gap). Next, We simulated genotypic data using the sequence simulator HAPGEN2,12 FISHR finds the locations of ‘implied errors’ (IEs) – likely SNP call or phase which simulated haplotypes by conditioning on a reference set of popu- errors – for all called segments. To do this, FISHR finds the longest exact match lation haplotypes (here, the 1000 Genomes Project13 European ancestry between either of the two phased haplotypes of the first person and either of the (CEU) haplotypes of chromosome 15) and created a new population by two phased haplotypes of the second person (a total of four possible combining haplotypes according to a fine-scaled recombination rate map (from combinations), starting at the first SNP of the called segment. An IE occurs deCODE14). Here, we defined the effective population sizes as 11 418, the at the mismatching SNP after the exact haplotypic match ends. FISHR then typical value used when simulating individuals of European descent,15 and the finds the next longest exact match between any of the four possible sample size (defined as ‘controls’ in HAPGEN2) as 13 000.