Fast and Accurate 1000 Genomes Imputation Using Summary Statistics Or Low‐Coverage Sequencing Data

Fast and accurate 1000 Genomes imputation using summary statistics or low‐coverage sequencing data Bogdan Pasaniuc Harvard School of Geffen School of Public Health Medicine at UCLA GWAS study designs Sequencing: >30x Illumina 1M Sequencing: 4x, <1x? 1 1 2 1 1 1 0 0 1 1 1 1 ? 1 ??1 ? ??1 1 ? 1 ? 1 1 1 ???1 1 0 2 2 0 1 2 2 0 1 0 2 0 ? 1 ??1 ???0 1 0 ???1 ? 2 0 1 0 ? 0 0 0 1 0 0 2 1 0 1 1 0 ? 0 ??0 ???1 1 0 0 0 ??0 ? 1 ? 1 1 1 2 0 0 1 1 2 2 1 2 0 1 ? 0 ? ? 1 ? ? ? 1 0 1 ? 0 0 1 1 2 ? 1 ? 0 2 1 2 2 1 0 2 0 0 2 2 2 ? 0 ??0 ???2 0 2 ??2 ? 0 ? 0 ? 2 ? 1 0 2 2 1 2 1 0 1 1 1 1 ? 2 ??0 ???0 1 ? 0 2 ? 1 ? 1 0 1 ? 1 2 2 0 2 0 1 0 0 0 1 1 2 ? 0 ??2 ???1 2 2 ? 0 2 0 1 0 0 ? 1 ? Reference Haplotypes [1000 Genomes] Reference Haplotypes [1000 Genomes] 1 1 0 1 1 1 0 0 0 0 1 1 1 0 1 1 1 0 0 0 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 1 0 0 1 1 1 r2 (True, Inferred genotypes) ~ 1r2 (True, Inferred genotypes) ~ 0.9 r2 ~0.7 (0.24x) High coverage & lots of samples too expensive Imputation from reference panels improves power in GWAS Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet Genotype imputation: “two‐thousand and late”? 1. Key ingredient for increasing power in GWAS [Marchini&Howie, Nat Rev Genet 2010,…] 2. Enables powerful meta‐analyses [Yang et al, Nat Genet 2012, Okada et al NatGen 2012,… ] 3. Accurate genotype calls from sequencing data [1000 Genomes Project, Pasaniuc et al NatGen 2012,… ] Array-based imputation: existing methods are accurate but slow Number of CPU days needed to impute 11.6 million SNPs using a 1000G reference panel of 292 European samples: N=10,000 N=50,000 samples samples Impute11 9,000 days 45,000 days BEAGLE2 2, 500 days 12,500 days Impute23 1,000 days 5,000 days Impute2 with 200 days 1, 100 days pre‐phasing4 1Marchini et al. 2007 Nat Genet 2Browning et al. 2009 Am J Hum Genet 3Howie et al. 2009 PLoS Genet 4Howie et al. 2012 Nat Genet Imputation: limitations • Imputation requires a lot of runtime • Existing methods cannot be applied to summary statistics directly – Individual level genotype data is required – Challenge to obtain individual level data in meta‐analysis • Can we test for association untyped markers without access to individual level data? Array-based imputation: why not use a Gaussian approach? • Data at nearby SNPs is correlated ((glinkage diseqq)uilibrium) • Best performing methods use HMMs to model haplotype structure in the population • Model correlations among SNPs with a Gaussian multi‐variate • We assume X ~ N(μ,Σ) • μ,Σ known from 1000G reference panel • LD blocks windows of fixed length (e. g. 0. 5Mb). • X – individual genotypes (standard imputation) • X – association statistics (summary level imputation) Coneely&Bohenke AJHG 2007, Wen &Stephens 2010, Kostem et al GenEpi 2011, Han et al Plos Gen 2009, Zaitlen&Pasaniuc et al AJHG 2010,… Gaussian imputation • Step 1. Infer mean μ and covariance Σ for summary data from reference panel – Allele frequencies: – Σ(pi, pj) = 1/(2N‐1) (pij‐pipj) – μ = (population allele frequencies) – Association z‐scores: Σ(p1, p1) … Σ(p1, pM) – Σ(zi, zj) = rij (correlation coefficient) – μ = 0 (NULL) Σ (MxM): … Σ(pM, p1)… Σ(pM, pM) • Multivariate Central Limit – X summary statistics over sample of haplotypes X~ N(μ , Σ ) – X~ N(μ , Σ ) Gaussian imputation • Allele frequencies follow N(μ , Σ ) GWAS data: – (μ , Σ) inferred from reference panel • Step 2. Infer conditional distribution of unobserved given typed … • Ptyped ‐ Observed frequencies at subset of SNPs • X = Frequenc ies at rest of SSsNPs • Conditional Distribution Xi|t is also Gaussian Σ (MxM) , μ – Ximputed|typed ~ N (μi|t , Σi|t ) Conditional distribution is analytically derived • Conditional distribution is also Normal – Ximputed|typed ~ N (μi|t , Σi|t ) Σ (MxM) , μ ‐1 μi|t = μi + Σi,t * Σ t,t * (pt‐μt) [Lynch&Walsh, Genetics and Analysis of Quantitative Traits,1998] Conditional distribution is analytically derived • Conditional distribution is also Normal – Ximputed|typed ~ N (μi|t , Σi|t ) Σ (MxM) , μ ‐1 μi|t = μi + Σi,t * Σ t,t * (pt‐μt) Population Frequency Deviation from the population frequency Correlation among typed & imputed SNPs at typed SNPs CltiCorrelation among tdtyped SNPs [Lynch&Walsh, Genetics and Analysis of Quantitative Traits,1998] Conditional distribution is analytically derived • Conditional distribution is also Normal – Ximputed|typed ~ N (μi|t , Σi|t ) Σ (MxM) , μ ‐1 μi|t = μi + Σi,t * Σ t,t * (pt‐μt) • Linear transformation w/ weights pre‐computed based on reference panel • (Prediction world) Best Linear Unbiased Predictor (BLUP) [Henderson 1975 Biometrics] • For NN2=2 imputation of individual level data • Similar approaches but with different var/cov: Best Linear IMPutation (BLIMP) [Wen&Stephens 2010] Conditional distribution is analytically derived • Step 3. – Derive μi|t and use as imputation Σ (MxM) , μ • Step 4. – Compute association statistics over imputed frequencies (Imp‐G‐summary) ‐1 μi|t = μi + Σi,t * Σ t,t * (pt‐μt) – Impute individual level data (n=2) (Imp‐G) – Use conditional variance as measure for accuracy Simulations • 1000 Genomes data • Beagle Imputation • 292 Europeans used as – Imputes genotypes reference – Requires individual level data • The rest used to simulate case‐ cont ro l data sets • ImpG • HAPGEN [Spencer et al – Imputes genotypes PlosGen 2009] – Requires individual level data • Randomly selected 0.5Mb loci from Chr 1 • ImpG‐summary • Illumina 1M SNPs for array imputation – Imputes frequencies – z‐scores over imputed freqs • Armitage Trend Test for case‐ control association (Armitage – Does not need individual level data 1955 Biometrics) No inflation under null (odds ratio = 1) •Typed (no imputation) •Beagle •ImpG •ImpG‐summary Accurate 1000G imputation using Gaussian approach (ImpG) Average ratio of χ2 statistics for imputed vs. true genotypes in simulations of 1K cases + 1K controls (odds ratio = 1.5): All SNPs Common SNPs Low‐freq SNPs (MAF>1%) (MAF>5%) (1%<MAF<5%) BEAGLE 0.87 0.89 0.65 IGImpG 0850.85 0870.87 0590.59 Accurate 1000G imputation using Gaussian approach (ImpG) Average ratio of χ2 statistics for imputed vs. true genotypes in simulations of 1K cases + 1K controls (odds ratio = 1.5): All SNPs Common SNPs Low‐freq SNPs (MAF>1%) (MAF>5%) (1%<MAF<5%) BEAGLE 0.87 0.89 0.65 IGImpG 0850.85 0870.87 0590.59 • Or, run ImpG genome-wide, then run BEAGLE only on regions of significant or suggestive association. Accurate 1000G imputation using summary statistics (ImpG-summary) Average ratio of χ2 statistics for imputed vs. true genotypes in simulations of 1K cases + 1K controls (odds ratio = 1.5): All SNPs Common SNPs Low‐freq SNPs (MAF>1%) (MAF>5%) (1%<MAF<5%) BEAGLE 0.87 0.89 0.65 IGImpG 0850.85 0870.87 0590.59 ImpG‐summary 0.82 0.84 0.52 Accurate 1000G imputation using summary statistics (ImpG-summary) Average ratio of χ2 statistics for imputed vs. true genotypes in simulations of 1K cases + 1K controls (odds ratio = 1.5): All SNPs Common SNPs Low‐freq SNPs (MAF>1%) (MAF>5%) (1%<MAF<5%) BEAGLE 0.87 0.89 0.65 IGImpG 0850.85 0870.87 0590.59 ImpG‐summary 0.82 0.84 0.52 • Consortia can impute meta-analysis summary statistics into new reference panels without having to repeat imputation separately in each individual cohort. Accurate 1000G imputation when imputing GBR from rest of EUR •Used all Great Britain data from 1000G for simulations and the rest as reference panel •Average ratio of χ2 statistics for imputed vs. true genotypes in simu lati ons of 1K cases + 1K control s ( odd s rati o = 1 .5) : All SNPs Common SNPs Low‐freq SNPs (MAF>1%) (MAF>5%) (1%<MAF<5%) BEAGLE 0.867 0.888 0.630 ImpG 0.842 0.867 0.570 ImpG‐summary 0.816 0.843 0.516 Gaussian imputation is extremely fast Number of CPU days needed to impute 11.6 million SNPs using a 1000G reference panel of 292 European samples: N=10,000 N=50,000 samples samples Impute1 9,000 days 45,000 days BEAGLE 2,500 days 12,500 days Impute2 1,000 days 5,000 days Impute2 with 200 days 1,100 days pre‐phasing ImpG 4days 20 days IGImpG‐summary 040.4 days 040.4 days Note: ImpG/ImpG-summary running time ~ (#reference samples) BEAGLE and Impute2 running time ~ (#reference samples)2 Sequencing-based imputation is different f rom array-bdibased imputat ion … 1 1 0 1 1 1 0 0 0 0 1 … … 1 1 0 1 1 1 0 0 0 0 1 … … 0 1 0 0 1 1 0 0 1 0 1 … … 0 1 0 0 1 1 0 0 1 0 1 … … 0 0 0 1 0 0 1 1 0 1 1 … … 0 0 0 1 0 0 1 1 0 1 1 … … 0 0 0 0 1 1 0 0 1 1 1 … … 0 0 0 0 1 1 0 0 1 1 1 … … 1 ? 1 ??1 ? ??1 1 … … ? 1 ? 1 1 1 ???1 1 … … 0 ? 1 ??1 ???0 1 … … 0 ???1 ? 2 0 1 0 ? … … 0 ? 0 ??0 ???1 1 … … 0 0 0 ??0 ? 1 ? 1 1 … … 1 ? 0 ??1 ???1 0 … … 1 ? 0 0 1 1 2 ? 1 ? 0 … … 2 ? 0 ??0 ???2 0 … … 2 ??2 ? 0 ? 0 ? 2 ? … … 1 ? 2 ??0 ???0 1 … … ? 0 2 ? 1 ? 1 0 1 ? 1 … … 2 ? 0 ??2 ???1 2 … … 2 ? 0 2 0 1 0 0 ? 1 ? … Li et al.

Fast and Accurate 1000 Genomes Imputation Using Summary Statistics Or Low‐Coverage Sequencing Data

Impact of Imputation Methods on the Amount of Genetic Variation Captured by a Single-Nucleotide Polymorphism Panel in Soybeans A

Imputation-Based Assessment of Next Generation Rare Exome Variant Arrays

Allele Imputation and Haplotype Determination from Databases Composed of Nuclear Families by Nathan Medina-Rodríguez and Ángelo Santana

Identity-By-Descent Estimation with Population- and Pedigree-Based Imputation in Admixed Family Data Mohamad Saad

Genotype Imputation Via Matrix Completion

Impact of Imputation Methods on the Amount of Genetic Variation Captured by a Single-Nucleotide Polymorphism Panel in Soybeans A

References for Haplotype Imputation in the Big Data

Accurate Genotype Imputation in Multiparental Populations

1000 Genomes-Based Imputation Identifies Novel and Refined

The Use of Genomic Data and Imputation Methods in Dairy Cattle Breeding

Imputation Accuracy from Low to Moderate Density Single Nucleotide Polymorphism Chips in a Thai Multibreed Dairy Cattle Population

Odyssey: a Semi-Automated Pipeline for Phasing, Imputation, and Analysis of Genome-Wide Genetic Data Ryan J