Imputation of Genotype Data

Imputation of genotype data Introduction to theory and implementation of Genomic Selection Vienna, 11 March 2015 – Marco Bink & Rianne van Binsbergen Background Scientist Statistical Genetics 3rd year PhD student “Development and application of “Investigate the benefits of using whole statistical methods to model and genome sequence data in selection and predict complex traits from genotypic breeding of animals and plants” and other types of high-throughput Study accuracy of genotype data, in plants & animals” imputation in case of whole-genome sequence data Study accuracy of genomic prediction based on imputed sequence data 2 Overview . Context in plants . What is genotype imputation and why needed? . Which cases of genotype imputation? . What methods are available? . How to measure correctness of imputation? . What factors do affect imputation? . What are the current challenges? . Impact on Genomic Prediction? 3 Introduction - Plants: Numerous and huge variation . Model plant: Arabidopsis (www.1001genomes.org) . Field crops: Maize, soya, rice, barley, wheat (grasses) . Tuber crops: Potato . Vegetable crops: Tomato, Lettuce, cucumber . Genome complexity . Fruit trees: Apple, Peach, Cherry . Length . Forestry trees: Loblolly pine, Maritime pine . Ploidy-level . Etc. Duplication . Repeats . Heterogeneity Example: Genome of Maritime pine (Pinus pinaster) EU FP7 project ProCoGen Diploide 2n=24 Despite this mega-genome, the genetic size is similar to what found in other plants (~150cM / chromosome) Impact on genotyping resources Re-sequencing on multiple accessions Reference genome available (commercial) SNP arrays available Scarcely markers available Number of plant species What is genotype imputation? “The statistical inference of unobserved genotypes” or “The process of predicting genotypes that are not directly assayed in a sample of individuals” 7 Cases of genotype imputation 1. Impute randomly missing genotypic data 2. Impute genotypic data for alignment of different SNP arrays 3. Impute genotypic data from low-density SNP array to high- density SNP array 4. Impute genotypic data from low coverage sequencing data 8 COMPLETE genotypic data 1. Impute randomly missing genotypic data 3. Impute from low - to high-density 2. Impute for alignment of different SNP arrays 4. Impute from low coverage sequencing data 10 1. Randomly missing genotypic data Low percentage of genotypes have not been called, e.g., <5% ”Missing At Random” Individual 1 Individual 2 ? Imputation Individual 3 ? Individual 4 ? 11 2. Alignment of different SNP arrays Different genotype platforms are used with only a certain percentage SNP common across platforms ”Missing Not At Random” Individual 1 (platform A) ? ? Individual 2 (platform A) ? ? Imputation Individual 3 (platform B) ? Individual 4 (platform B) ? 13 3. Impute from low-density to high-density Save money: genotyping at high-density often more expensive than genotyping at low-density Major incentive: Genomic Prediction • Low density array to high density array • High density array to whole genome sequence data 15 3. Impute from low-density to high-density Reference individuals Individual genotyped at genotyped at high-density low-density ? ? ? ? Individual with imputed genotype 16 4. Low coverage sequencing data Coverage = average number of reads per nucleotide Figure adapted from: Commins, et al. Biological Procedures Online 2009, 11:52-78 4. Low coverage sequencing data . Large part of the genome might be not covered ● Varies among individuals and among loci ● Data are Missing At Random . Uncertainty in genotype calls!! . No methods developed (YET) to deal with these data specifically 19 Overview . What is genotype imputation? . Which cases of genotype imputation? . What methods are available? . How to measure correctness of imputation? . What factors do affect imputation? . What are the current challenges? 20 Imputation methods Naïve approaches General statistical approaches Population-based Family-based approaches approaches 21 Naïve approaches Useful if software in proceeding step cannot handle data with missing genotypes Examples: - Marker mean value • Based on allele frequencies - Heterozygous value (inbred lines) 22 Imputation methods Naïve approaches General statistical approaches Population-based Family-based approaches approaches 23 Random forest General statistical - Machine-learning approaches algorithm that uses an ensemble of decision trees - Average of multiple trees is predicted value k-nearest neighbour a. For individual and single SNP similarity scores are calculated: - 1 if identical genotypes - 0.5 if missing genotypes - if different genotypes b. For the population with individuals, a matrix of is obtained c. The nearest neighbours of individual are individuals with the highest scores Missing genotypes from individual , are estimated as the major allele of the nearest neighbours (if frequency > ) (Huang, et al. 2010. Nat Genet 42: 961-967) Imputation methods Naïve approaches General statistical approaches Population-based Family-based approaches approaches 25 . Commonly used in human studies . Based on short-range LD information in a population . Often based on Hidden Markov models (HMM) - Relate an observed process (i.e. observed unphased genotypes) to an underlying unobserved or hidden state of interest (i.e. haplotype phase and true genotypes) . Several methods available - Differ in accuracy, computational requirements, speed, ... o Beagle (Browning and Browning. 2009. AJHG 84: 210-223) o Impute v2 (Howie, et al. 2009. PLoS Genet 5: e1000529) o fastPHASE (Scheet and Stephens. 2006. AJHG 78: 629-644) o ... Population-based approaches 26 E.g. Beagle Browning and Browning. 2009. AJHG 84: 210-223 Corresponding pair of haplotype clusters will be merged if smallest merging score < threshold Population-based Merging score ~ probability that allele approaches sequences at markers l+1, l+2, … differ Imputation methods Naïve approaches General statistical approaches Population-based Family-based approaches approaches 28 . Utilize pedigree information (long-range LD) . Possible to impute ungenotyped individuals . Several methods available - Differ in accuracy, computational requirements, speed, ... o findhap (VanRaden, et al. 2011. Genet Sel Evol 43: 10) o AlphaImpute (Hickey, et al. 2012. Genet Sel Evol 44: 9) o FImpute (Sargolzaei, et al. 2014. BMC Genomics 15: 478) o ... Family-based approaches E.g. AlphaImpute Hickey, et al. 2011. Genet Sel Evol 43: 1-13 Expansion of Long-range phasing (LRP) algorithm: Heuristic method for phasing of marker genotypes, which uses information from both related and seemingly unrelated individuals Invoking the concepts of surrogate parents and Erdös numbers Combined with haplotype library imputation Homozygous loci across both core and tails Family-based ⇒ define surrogates approaches Imputation methods Naïve approaches General statistical approaches Population-based Family-based approaches approaches 31 Overview . What is genotype imputation? . Which cases of genotype imputation? . What methods are available? . How to measure correctness of imputation? . What factors do affect imputation? . What are the current challenges? 32 Measures of imputation correctness . Imputation error rate % of alleles (or genotypes) imputed incorrectly . Imputation accuracy Correlation between true and imputed genotypes 33 Measures of imputation correctness Validation by imputing “masked” genotypes Validation Imputation 1- 1 1- 2 1- 2 1 1 1 2 1 2 1- 2 2- 1 1- 2 1 2 1 1 1 2 Put SNPs to Imputed 1 2 1 1 1 2 Reference missing 1 1 2 2 1 2 genotype 1 1 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2 1 1 2 1 1 1 2 1 2 1 2 2 1 1 2 1 1 1 2 1 2 Calculate 1 2 2 1 1 2 imputation error rate or imputation accuracy 34 I. Impute randomly missing genotypic data . Typically: some percentage of genotypes that have not been called by the genotyping algorithm. ● Proportion of uncalled genotypes is usually low (e.g., <5%) ● Missing genotypes can be regarded as occurring at random . Maize (Hickey et al., 2012) ● 35,081 SNPs uniquely mapped at physical map ● 1163 inbred lines, mostly from (sub) tropical germplasm ● IMPUTE2 software III. Impute genotypic data from low-density SNP array to high-density SNP array Cost-saving genotyping strategy: • High-Density SNP Platform (HDPs) • Key individuals, e.g., ancestors of selection candidates or mapping parents • Low-Density SNP Platform (LDP) (= subset of HDP!) • Others, e.g., selection candidates or mapping progeny . Maize (Hickey et al., 2012) ● 35,081 SNPs uniquely mapped at physical map ● 1163 inbred lines, mostly from (sub) tropical germplasm ● Masking 50%, 75%, ..., 99.2% 275 SNPs ! III. Impute genotypic data from LDP to HDP Maize (Hickey et al., 2012) Proportion of correctly imputed genotypes severely affected by (low) MAF III. Impute genotypic data from LDP to HDP EU FP7 project FruitBreedomics Apple: Well documented pedigree records + old cultivars still available (simple via clones) 1285 individuals descending from 42 founders RallsJan Delicious Winesap RomBeauty Jonathan M_PRI668-100 GoldenDel F2_26829-2-2 Cox RedWinter Wagenerap F_Prima AntonovkaOB F_B8_34.16 F_X-4355 Jefferies PRI830-101 F_X-4598 Clochard ReiDuMans GranSmith F_Ill_#2 O53T136 M_Enterprise McIntosh WorcPearm BeautBath LadyWill DrOldenbu F_Rewena F_Reanda F_Melba Wealthy Starr Braeburn F_JamesGr JerseyBla Malinda F_72-10-33 F_IngMarie LivelRasp Priscilla-NL 22 Full Sib Fuji Crandall PRI14-126 PRI14-510 PRI14-152 KidsOrRed Idared BVIII_34.16 X-4828 X-4598 Chantecler Ill_#2

Load more