1000 Genomes-Based Imputation Identifies Novel and Refined

European Journal of Human Genetics (2012) 20, 801–805 & 2012 Macmillan Publishers Limited All rights reserved 1018-4813/12 www.nature.com/ejhg SHORT REPORT 1000 Genomes-based imputation identifies novel and refined associations for the Wellcome Trust Case Control Consortium phase 1 Data Jie Huang1, David Ellinghaus2, Andre Franke2, Bryan Howie3 and Yun Li*,4 We hypothesize that imputation based on data from the 1000 Genomes Project can identify novel association signals on a genome-wide scale due to the dense marker map and the large number of haplotypes. To test the hypothesis, the Wellcome Trust Case Control Consortium (WTCCC) Phase I genotype data were imputed using 1000 genomes as reference (20100804 EUR), and seven case/control association studies were performed using imputed dosages. We observed two ‘missed’ disease- associated variants that were undetectable by the original WTCCC analysis, but were reported by later studies after the 2007 WTCCC publication. One is within the IL2RA gene for association with type 1 diabetes and the other in proximity with the CDKN2B gene for association with type 2 diabetes. We also identified two refined associations. One is SNP rs11209026 in exon 9 of IL23R for association with Crohn’s disease, which is predicted to be probably damaging by PolyPhen2. The other refined variant is in the CUX2 gene region for association with type 1 diabetes, where the newly identified top SNP rs1265564 has an association P-value of 1.68Â10À16. The new lead SNP for the two refined loci provides a more plausible explanation for the disease association. We demonstrated that 1000 Genomes-based imputation could indeed identify both novel (in our case, ‘missed’ because they were detected and replicated by studies after 2007) and refined signals. We anticipate the findings derived from this study to provide timely information when individual groups and consortia are beginning to engage in 1000 genomes-based imputation. European Journal of Human Genetics (2012) 20, 801–805; doi:10.1038/ejhg.2012.3; published online 1 February 2012 Keywords: genome-wide association study; the 1000 Genomes project; imputation INTRODUCTION plausible gene with a more significant P-value than that without 1000 It has been four years since the publication of one of the first and genomes imputation. The Sardiana study9 fully utilized the reference largest genome-wise association studies (GWAS), the Wellcome Trust panels from HapMap2, HapMap3, and the 1000 genomes, but did not Case Control Consortium (WTCCC) study.1 Although imputation- specifically evaluate the power gains from each panel. It also has a based analysis was already adopted at that time for refining association smaller discovery sample size (N o1700) compared with the WTCCC signals, its use was limited and few results were reported based on study. In contrast, our study is hypothesis driven, based on a flagship imputations. In one study, an imputed missense mutation in gene study with rich phenotypes and a large sample size. We reason it is GCKR is found to be more strongly associated with triglyceride.2 very important to confirm that genotype imputation based on the Imputation significantly increases the statistical power of GWAS latest 1000 genomes release could identify novel variants on a genome- and allows meta-analysis of studies genotyped on different plat- wide scale, beside refining associations at a regional level, given the forms.3–5 Until recently, most imputation work has been using large amount of efforts needed for scientists around the globe poised HapMap haplotypes as a reference panel.6 However, the recent to apply this emerging tool to their scientific investigation. publication of the 1000 Genomes Pilot Project and the availability We hypothesize that 1000 genomes-based imputation can identify of phased haplotypes from both the pilot and main project bring novel variants beyond what could be seen from purely genotyped data opportunities for much denser imputations and more extensive or HapMap imputed data, due to the much denser SNP coverage and analysis.7 The 1000 genomes-based imputation has already been a much broader representation of reference populations. We test this shown to refine association signals and identify underlying genetic hypothesis by re-examining the WTCCC Phase I data after imputing variants that are in high linkage disequilibrium (LD) with variants that the genotype data to the full set of SNPs present in 1000 genomes were included in the genotyping platform.7 However, there have been latest release (version 20100804). We re-run association analysis for few reports of novel findings using this latest imputation approach on the seven traits based on 1000 genomes imputed dosages and highlight a genome-wide scale.8,9 The Oxford-GSK study8 used 1000 genomes- novel and refined genetic associations that would have been discovered based imputation to refine a single genetic region and successfully by the original study should the 1000 genomes reference panel identified a SNP that is in the promoter region of a biologically be available back then. 1Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK; 2Institute of Clinical Molecular Biology, Christian-Albrechts-University, Kiel, Germany; 3Department of Human Genetics, University of Chicago, Chicago, IL, USA; 4Department of Genetics, Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA *Correspondence: Dr Y Li, Department of Genetics, Department of Biostatistics, University of North Carolina Chapel Hill, Chapel Hill, NC, 27599-7264, USA. Tel: +919 843 2832; Fax: +919 843 4682; E-mail: [email protected] Received 7 September 2011; revised 30 December 2011; accepted 4 January 2012; published online 1 February 2012 802 European Journal of Human Genetics 6 233 112 SNPs included in our analysis, atSNP the is not exonic and not predicted to have functional consequence. red color in Figure 1). For the refined originally reported. For the refined HapMap2 imputed SNP is less significant than the genotyped SNP program (BEAGLE) lead SNP was re-imputed and then analyzed with an independentthe imputation SNPs included in our analysis. dent analyst (DE; Supplementary Section S2). significant threshold of 5 have been identified with HapMap2-based analysis using BEAGLE and PLINKmethod (JH, (Table DE). 1). Alltwo four refined variants loci that were wethose confirmed identified in through by the this an original latest WTCCC imputation independent are study, shown and in highlighted Supplementary two Figure S1. missedgenomic We and compared inflation. the Association signals Manhattan with plots forto all 1.09, seven which analyses is comparable to the original study and indicates low MiniMac for genotype imputation. reference, we mapped all SNPs to NCBI’s build37 (hg19). inflation factor 4 tagging SNPs using a greedy algorithm similarwould to be that sufficiently conservative in in 1000 genomes imputed analysis, we picked are analysis. reaching the genome-wide significance level (5 but the new lead SNPwhere has there a is a reporteddesignated as association in ‘missed’ instead the of same ‘novel’. Welater define region ‘refined’ in for reported any the by association original study in other the studies original after study that the analyzed original genotyped data. 2007 Novel WTCCC variants paper that were are amino-acid substitutions more relevant gene. Wethe used lead SNP PolyPhen-2 has to eithernumber predict a of the functional new support possible SNPs or tested. impact is For of pinpointing ‘refined’ to association, we biologically further require that all SNPs withanalysis estimated approach presentedwith by the the shared original(thus control taking WTCCC imputation without uncertainty study. into covariatea account) We for adjustment, logistic each included a regression of analysis the similarence seven haplotypes based statistical of traits on the European A imputation ancestry served totaledu/csg/abecasis/MACH/download/1000G-2010-08.html). dosages as of the via 566 reference refer- panel. MACH2DAT We ran University of Michigan Abecasis lab, version 20100804and (http://www.sph.umich. 200 states.approach The and recommended parameters 1000 of 20 genomes iterations of the reference Markov sampler panel was obtained from the 6 233 112 SNPs withinput estimated genotype databuild37, for a total imputation. of AfterAfter 389 827 imputation, SNPs SNP for a quality 16RESULTS 179 total control samples of were and retained as mapping of genomic positions to difference. that the results are similar as thosedisease reported status, in we the ran original study the with case/control association negligible testsin with Online PLINK Supplementary Section and S1). verified Using thiscontrols genotype similar data with as embedded the original studydata to set, and the we downloaded files created one (detailsWe obtained provided harmonized approval genotype for data using set the by raw applying genotypeMETHODS data quality for the original WTCCC 0.01 were used for association analysis. The estimated genomic A total of 1 915 543 tagging SNPs were picked for the total of As shown in the regional plots, the two missed variants would not For all genetic loci identified as novel or refined, a 5-Mb region around the We used the MaCH program to first phase the haplotypes and then ran SNPs were considered in the same region if they are within the same gene or o 1 Mb apart. We define ‘novel’ for any SNP with association 10 To match the genomic position used by the 1000 genomes l 15 14 for the seven case/control GWAS ranges from 1.04 and association analysis tool (PLINK) Â in silico R 10 2 P -value more significant even after correcting the À 4 8 0.3 and minor allele frequency widely used in HapMap2 imputed analysis Powerful association using 1000 genomes imputation . 12 R 11 2 To evaluate whether the genome-wide We used the recommended two-step 4 IL23R 0.3 and minor allele frequency Â 10 region, the best HapMap2 16 À CUX2 8 imputation (shown as ) in a region not reported R ldselect 2 region, the best threshold of 0.9.

1000 Genomes-Based Imputation Identifies Novel and Refined

Impact of Imputation Methods on the Amount of Genetic Variation Captured by a Single-Nucleotide Polymorphism Panel in Soybeans A

Imputation-Based Assessment of Next Generation Rare Exome Variant Arrays

Allele Imputation and Haplotype Determination from Databases Composed of Nuclear Families by Nathan Medina-Rodríguez and Ángelo Santana

Identity-By-Descent Estimation with Population- and Pedigree-Based Imputation in Admixed Family Data Mohamad Saad

Genotype Imputation Via Matrix Completion

Impact of Imputation Methods on the Amount of Genetic Variation Captured by a Single-Nucleotide Polymorphism Panel in Soybeans A

References for Haplotype Imputation in the Big Data

Accurate Genotype Imputation in Multiparental Populations

The Use of Genomic Data and Imputation Methods in Dairy Cattle Breeding

Imputation Accuracy from Low to Moderate Density Single Nucleotide Polymorphism Chips in a Thai Multibreed Dairy Cattle Population

Odyssey: a Semi-Automated Pipeline for Phasing, Imputation, and Analysis of Genome-Wide Genetic Data Ryan J

Genotype Imputation for Genome-Wide Association Studies