Accurate Methods for Ancestry and Relatedness Inference
Total Page:16
File Type:pdf, Size:1020Kb
ACCURATE METHODS FOR ANCESTRY AND RELATEDNESS INFERENCE A DISSERTATION SUBMITTED TO THE PROGRAM IN BIOMEDICAL INFORMATICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Jesse M. Rodriguez December 2013 © 2013 by Jesse M. Rodriguez. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/cn371vd3410 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Serafim Batzoglou, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Russ Altman I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Carlos Bustamante Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract The predisposition to many diseases is strongly influenced by the genome of an in- dividual. However, the association between the genome and most diseases is not fully understood, so there is an ongoing effort to characterize these associations. One way to characterize disease-genome associations is by studying the familial and an- cestral origin of individuals in the context of disease. This kind of study relies on the fact that individuals with shared origins tend to have genomes and phenotypes that are similar to one another. Detailed information regarding familial and ancestral origin is often unknown, however, it can be inferred computationally by examining the genome. Therefore, it is important that we have accurate methods to infer this information in order to facilitate disease-genome associations. In this dissertation, I describe the contributions I have made to accurately inferring the ancestry and relat- edness of individuals based on their genomes. First, I describe my work on Alloy, a method to infer the ancestral origin of segments of the genome based on a factorial HMM. Next, I present Parente, a method to infer which individuals in a group are related to one another by detecting genomic segments that are identical-by-descent (IBD) using an embedded likelihood ratio test. Finally, I present Parente2, an ex- tension of Parente that incorporates linkage disequilibrium information and results in significantly higher accuracy. iv Acknowledgements I owe a great deal of thanks to many people who have supported me through the years of my PhD. To Pavel, Nitin, and Josef, for giving me a place to start on this journey. To Tiffany, Dan, Sarah, George, Sam, Tom, Marc, Andreas, Noah, Alex, David, Eugene, Marina, and Karen for your instruction, advice, discussions, friendship, and memories. To my BMI classmates and members of the Batzoglou lab for you support, all of the fun, and for being great colleagues. To Mary Jeanne, Nancy, and Alex Sandra, for your help and guidance. To David Paik, Mark Musen, and Larry Fagan, for your mentorship and advice. To Russ, David, and Carlos, for being on my committee. To Arend, Cheryl, Noah, Lin, and Roy, for teaching me so much and being so great to work with. To Serafim, for being relentlessly supportive and giving me the chance to have fun with my PhD. To Sivan, for your friendship, bountiful ideas, and hard work. And to Kelly, Audrey, and my family, for your love, patience, and encouragement. v Contents Abstract iv Acknowledgements v 1 Introduction 1 2 Biology, terminology, and technology 4 2.1 Biology and terminology . .4 2.1.1 Technology . .7 3 Relatedness and ancestry 10 3.1 Family genetics . 10 3.1.1 IBD . 11 vi 3.2 Ancestry . 14 4 Alloy 16 4.1 Introduction . 16 4.1.1 Previous work . 17 4.1.2 Overview of methods and results . 18 4.2 Methods . 19 4.2.1 Factorial hidden Markov model . 20 4.2.2 Inference with the forward-backward algorithm . 22 4.2.3 Transition probabilities . 24 4.2.4 Linkage disequilibrium model . 25 4.3 Results . 27 4.3.1 Simulation of admixed individuals and training the background LD models . 27 4.3.2 Evaluating Alloy's accuracy under complex and ancient ad- mixtures. 29 4.3.3 Exploring background LD models. 31 vii 4.3.4 Measuring robustness to inaccuracies in model parameters. 31 4.3.5 Evaluating model accuracy under varying amounts of training data. 33 4.4 Discussion . 34 4.4.1 Robustness to different admixtures models . 34 4.4.2 Learning admixture parameters . 35 4.4.3 Time complexity reduction . 36 4.4.4 Incorporating additional variation . 37 4.4.5 Conclusion . 38 5 Parente 40 5.1 Introduction . 40 5.2 Methods . 44 5.2.1 Likelihood ratio test . 46 5.2.2 Embedded likelihood ratio test . 48 5.2.3 Genotyping-error function . 49 5.2.4 Likelihood-ratio test threshold . 50 viii 5.3 Results . 51 5.3.1 Constructing training and testing datasets. 51 5.3.2 Simulations to evaluate performance. 53 5.3.3 Parente's accuracy and comparison to fastIBD. 54 5.3.4 Training Parente's model and thresholds. 56 5.3.5 Embedded LRT and local thresholds. 57 5.3.6 Accuracy performance characteristics. 63 5.4 Discussion . 64 6 Parente2 67 6.1 Introduction . 67 6.2 Methods . 69 6.2.1 Inner Log-Likelihood Ratio . 72 6.2.2 Outer Log-Likelihood Ratio . 74 6.2.3 Genotyping-error function . 76 6.2.4 Window and window set definitions . 76 ix 6.2.5 Window filter . 77 6.2.6 Decreasing running time with the SpeeDB filter . 78 6.2.7 Facilitating larger window sizes . 79 6.3 Results . 79 6.3.1 Data sets . 79 6.3.2 Simulations . 81 6.3.3 Experimental parameters . 83 6.3.4 Accuracy of Parente2 compared to other methods . 84 6.3.5 Augmented window definitions and window filter yields better performance . 86 6.3.6 ELR yields better performance . 88 6.3.7 Increasing window size increases performance . 89 6.3.8 Parente2 is robust to window set size . 89 6.4 Discussion . 90 6.4.1 Speed and accuracy tradeoff . 90 6.4.2 Amount of training data required . 91 x 6.4.3 When no training data is available . 93 6.4.4 Recommended settings for Parente2 ............. 93 6.4.5 Applicability DNA sequencing studies . 94 7 Conclusions and Future Directions 97 7.1 Potential impact . 97 7.2 Future directions . 98 Bibliography 102 xi List of Tables 3.1 Expected number and size of IBD segments based on relationship . 13 5.1 Normality of window scores as a function of window size. 65 6.1 Example of tiling method used to break up latent IBD. In this example, 6 source individuals used to create 3 composite individuals, each having 9 genomic segments (eg assuming a chromosome of length 1.8 cM with a segment size of 0.2 cM). Each entry in the table contains the index of the source individual used for the jth genomic segment of the ith composite individual. 81 6.2 Pairwise accuracy of Parente2 and other methods. Table 6.2A shows sensitivity at lower false positive rates and Table 6.2B shows sensitivity at higher false positive rates. fastIBD was run ten times with ten different seeds according to author recommendations. For the GERMLINE-64 and GERMLINE-128 rows, GERMLINE was run on phased data with GERMLINE's seed size set to 64 and 128, respectively. 85 xii 6.3 Positional accuracy of Parente2 and other methods. Accuracy was measured based on the portion of the genome that was in or not in a simulated IBD segment for each pair of individuals. 86 6.4 The accuracy and running time of evaluated IBD inference methods. Each method was used to detect 2 cM IBD segments in 10 trials of the the HapMap data set. The Parente2 entry represents when Parente2 was run using the augmented window set with the window filter. Parente2-SpeeDB is the same but with the applica- tion of the SpeeDB filter. The Parente2-Std. entry represents when Parente2 was run using the standard window set without the win- dow filter and without SpeeDB. fastIBD was run ten times with ten different random seeds according to the authors' recommendations and the sum of the running time all ten runs is reported. GERMLINE-64 and GERMLINE-128 refer to running GERMLINE while using seed sizes of 64 and 128, respectively. The phasing pipeline provided with GERMLINE was used to phase the data prior to running GERMLINE and its running time is included in the reported running time. The number of pairs of individuals processed by each method per second is reported in the Pairs/sec column. 90 6.5 Parente2's accuracy at various thresholds when detecting 2 and 4 cM IBD segments....................... 95 xiii List of Figures 2.1 Linkage disequilibrium . .8 3.1 Shared DNA by familial relationship . 11 4.1 Alloy's factorial HMM. 21 4.2 The state space of the HMM. 25 4.3 Alloy's performance over increasing generations since admixture and comparison to WINPOP.