Algorithms for Human Genetics
Total Page:16
File Type:pdf, Size:1020Kb
Algorithms for Human Genetics Bonnie Kirkpatrick Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2011-25 http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-25.html April 5, 2011 Copyright © 2011, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement Many wonderful people have contributed to my academic pursuits. Chiefly among them are Richard Karp and Eran Halperin who together coached me through graduate school. Professor Karp, thank you for being very approachable and for constantly asking me to find simple and clear ways to communicate our work. Professor Halperin, thanks for always sharing career advice and for always encouraging me to work on very practical problems. Thanks also go to Yun Song and Michael Jordan whose appreciation of statistics has been a pervasive influence on my work. Perhaps the most formative influence of all, my sister, Kay Kirkpatrick, thank you for your early insistence that I take math courses and your continued mentoring. Algorithms for Human Genetics by Bonnie Beth Kirkpatrick A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science and the Designated Emphasis in Computational and Genomic Biology in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Richard M. Karp, Chair Professor Yun S. Song Professor Michael I. Jordan Professor Rachel Brem Spring 2011 Algorithms for Human Genetics Copyright 2011 by Bonnie Beth Kirkpatrick 1 Abstract Algorithms for Human Genetics by Bonnie Beth Kirkpatrick Doctor of Philosophy in Computer Science University of California, Berkeley Professor Richard M. Karp, Chair Whereas Mendel used breeding experiments and painstakingly counted peas, modern biology increasingly requires computational tools. In the late 1800’s probability and ex- perimental genetics were the critical tools for discovering the gene. Today, the combined use of statistical and computational methods to make genetic and genomic discoveries has increased after the discovery of the DNA double-helix and the development of sequencing methods. By examining relationships among individuals using computational tools, geneti- cists have been able to understand the biological mechanisms that produce genetic diversity, map ancestral movements of populations, reconstruct ancestral genomes, and identify rela- tives. Furthermore, models in genetics have inspired advances in computer science, notably the model for inheritance in families is an early example of a graphical model and helped inspire the sum-product algorithm. The genetic data of interest is single-nucleotide polymorphism (SNP) data, which are positions in the genome known to have nucleotide variation across the population. Humans are diploid individuals having two copies of each chromosome. Data for an individual can come in two forms, either haplotypes or genotypes. The haplotypes are two strings, each giving the sequence of nucleotides that appear together on the same chromosome. The genotypes, for each position in the genome, give an unordered set of nucleotides that appear. In particular the genotype is said to be ‘unphased’ due to the lack of information about which nucleotide appears on which chromosome. In human genetics there are two main ways to model relatedness: evolutionary relation- ships between people and closer, family relationships. Evolutionary relationships, from the domain of population genetics, occur through a distant relative and leave small traces of the relationship in the genome. Family relationships are typically much closer and leave much larger traces in the genome. This thesis examines algorithms for both types of relationships. For evolutionarily related individuals, this thesis presents the perfect phylogeny and co- alescent and then examines two related questions. The first is related to privacy of genetic data used for research purposes. In order to share data from studies while hopefully main- taining the privacy of study participants, geneticists have released the summary statistics of the data. A natural question, whether individuals can be detected in the summary data, is answered in the affirmative by using a perfect phylogeny model. The second question is 2 how to construct perfect phylogenies from haplotypes where there is missing data. We in- troduce a polynomial-time algorithm for enumerating such phylogenies. This algorithm can be used to compute the probability of the data as an expectation over possible coalescent genealogies. Recent relationships are modeled using a family tree, or pedigree graph. Traditionally, geneticists construct these graphs from genealogical records in a very tedious process of examining birth, death, and marriage records. Invariably mistakes are made due to poor record keeping or incorrect paternity information. As an alternative to manual methods, this thesis addresses the problem of automatically constructing pedigree graphs from genetic data. The most obvious way to reconstruct pedigrees from genetic data is to use a structured machine learning approach, similar to phylogenetic reconstruction. That method would involve a search over the space of pedigree graphs where the objective is to find the pedigree graph with the highest likelihood of generating the observed data. Unfortunately, this is not a good way to proceed for two reasons: the space of pedigree graphs is exponential, and the likelihood calculation has exponential running time. The likelihood calculation given genotype data is known to be NP-hard. In an attempt to make use of the likelihood in complex pedigrees, the method PhyloPed uses a Gibbs sampler to infer haplotypes from genotype data. In a second attempt to use likelihood methods, this time for haplotype data, an NP-hardness result is presented. A third attempt to find an efficient algorithm for the likelihood problem results in a state-space reduction method for the pedigree hidden Markov model. Since likelihood-based approaches seem completely infeasible, a completely different ap- proach is introduced. We focus on the problem of inferring relationships between a set of living individuals with available identity-by-descent data. For convenience, we assume that the inferred pedigree is monogamous without inter-generational mating. Two heuristic and practical pedigree reconstruction methods are introduced, one for inbred pedigrees and the other for outbred pedigrees. This work immediately reveals another important problem, that of evaluating the resulting inferred pedigree against a ground-truth pedigree. This can be done either by determining whether the two pedigrees are isomorphic or by finding the edit distance between the two pedigrees. i To my parents, George and Denise Kirkpatrick. ii “A man should learn to detect and watch that gleam of light which flashes across his mind from within, more than the lustre of the firmament of bards and sages. Yet he dis- misses without notice his thought, because it is his.” “There is a time in every man’s education when he arrives at the conviction that envy is ignorance; that imitation is suicide; that he must take himself for better for worse as his portion.” –Ralph Waldo Emerson, Self-Reliance iii Contents List of Figures v 1 Introduction to Human Genetics 1 1.1 Genetics and Inheritance . 1 1.1.1 Genetic Variation . 1 1.1.2 Meiotic Inheritance . 3 1.1.3 Data . 4 1.2 Motivating Questions . 7 1.3 Computational and Statistical Challenges . 8 2 Unrelated Individuals 10 2.1 Populations . 11 2.1.1 Coalescent . 11 2.1.2 Perfect Phylogeny Tree . 13 2.2 Detecting Individuals in Pools . 16 2.2.1 Likelihood-Ratio Test . 18 2.2.2 Perfect Phylogeny Simulation . 18 2.2.3 Estimated Frequencies . 19 2.2.4 Discussion . 22 2.3 Efficiently Constructing Perfect Phylogenies from Binary Characters with Missing Data . 23 2.3.1 Background and Examples . 23 2.3.2 Enumerating Resolutions for Binary Partial Characters under the RDH 25 3 Related Individuals 36 3.1 Introduction . 37 3.1.1 Pedigrees . 37 3.1.2 Inheritance States and Identity by Descent (IBD) . 38 3.1.3 Inheritance Probabilities . 39 3.2 Problems of Interest . 41 3.2.1 The Peeling Algorithm and Elston-Stewart . 43 3.2.2 Hidden Markov Models, Lander-Green, and the Forward-Backward Algorithm . 45 3.3 Hardness . 46 iv 4 Algorithms for Inference 53 4.1 Gibbs Sampler . 53 4.1.1 Methods . 54 4.1.2 Results . 61 4.1.3 Summary . 65 4.2 Haplotype Hidden Markov Model . 67 4.3 State-Space Reduction for HMMs . 69 4.3.1 Introduction . 70 4.3.2 Problem Description . 71 4.3.3 Simulation Results . 78 4.3.4 Summary . 78 5 Algorithms for Pedigree Reconstruction 80 5.1 Introduction . 81 5.2 Pedigree Structure and a Simple Reconstruction Algorithm . 83 5.3 Accuracy Measurements . 86 5.3.1 Isometry between Pedigrees . 86 5.3.2 Edit Distance between Pedigrees . 87 5.4 Two Practical Algorithms for Reconstruction . 97 5.4.1 IBD Model for Constructing Outbred Pedigrees (COP) . 98 5.4.2 IBD Model for Constructing Inbred Pedigrees (CIP) . 98 5.4.3 Heuristic Graph Partitioning Method . 101 5.4.4 Simulation Results . 101 5.5 Discussion . 107 6 Conclusions 109 6.1 Progress on Motivating Questions . 109 6.2 Future Problems . 110 6.2.1 Pedigree Likelihood Calculations . 110 6.2.2 IBD Estimates . 113 Bibliography 115 Index 123 v List of Figures 1.1 Recombination. This figure illustrates the two parental chromosomes and four gametes that result from the recombination event at the indicated recom- bination junction on the parental chromosomes. The parental chromosomes are homologous and distinguished by their colors.