Published online 2 May 2019 Nucleic Acids Research, 2019, Vol. 47, No. 14 e79 doi: 10.1093/nar/gkz320 TAGOOS: genome-wide supervised learning of non-coding loci associated to complex phenotypes Aitor Gonzalez´ *, Marie Artufel and Pascal Rihet
Aix Marseille Univ, INSERM, TAGC, Turing Center for Living Systems, 13288 Marseille, France
Received January 29, 2019; Revised April 07, 2019; Editorial Decision April 11, 2019; Accepted April 18, 2019 Downloaded from https://academic.oup.com/nar/article/47/14/e79/5482505 by guest on 27 September 2021 ABSTRACT The large majority of associated SNPs fall in intronic and intergenic regions (6). This association is likely mediated Genome-wide association studies (GWAS) associate through gene regulatory regions and enhancers, as for in- single nucleotide polymorphisms (SNPs) to complex stance, non-coding associated loci are enriched in eQTLs phenotypes. Most human SNPs fall in non-coding and DNase I hypersensitive sites (DHS) (7,8). Enhancers regions and are likely regulatory SNPs, but linkage of gene expression are non-coding regions at any distance disequilibrium (LD) blocks make it difficult to distin- of regulated genes with particular chromatin configurations guish functional SNPs. Therefore, putative functional that facilitate transcription factor (TF) binding and gene SNPs are usually annotated with molecular mark- expression (9). The chromatin of enhancers is DNase I ers of gene regulatory regions and prioritized with hypersensitive, marked by particular histone modifications dedicated prediction tools. We integrated associated such as histone H3 lysine 4 monomethylation (H3K4me1) SNPs, LD blocks and regulatory features into a su- and H3K4me2, histone H3 lysine 27 acetylation (H3K27ac) marks and produces enhancer RNAs. The three dimen- pervised model called TAGOOS (TAG SNP bOOSting) sional structure of the chromosome then brings closer en- and computed scores genome-wide. The TAGOOS hancers and transcription start sites, for the TFs to recruit scores enriched and prioritized unseen associated RNA polymerase and promote gene transcription (10). SNPs with an odds ratio of 4.3 and 3.5 and an area The resolution of associated loci is limited by linkage dis- under the curve (AUC) of 0.65 and 0.6 for intronic equilibrium (LD) blocks of correlated SNPs. In addition, and intergenic regions, respectively. The TAGOOS mechanistic understanding of the association requires the score was correlated with the maximal significance analysis of the loci at the molecular level. To prioritize and of associated SNPs and expression quantitative trait understand functional non-coding associated loci, the sim- loci (eQTLs) and with the number of biological sam- plest option is to annotate the loci with key regulatory fea- ples annotated for key regulatory features. Analysis tures such as H3K27ac or ChromHMM chromatin states of loci and regions associated to cleft lip and human (11). Regarding dedicated computational tools, there are two broad families to prioritize functional loci in LD blocks adult height phenotypes recovered known functional associated to complex diseases (Supplementary Tables S1 loci and predicted new functional loci enriched in and S2). The first family of computational tools, which transcriptions factors related to the phenotypes. we call GWAS loci priorization approaches, link GWAS In conclusion, we trained a supervised model loci, LD blocks and regulatory annotations. Tools like Fun- based on associated SNPs to prioritize putative func- ciSNP or HaploReg retrieve SNPs above an LD cutoff and tional regions. The TAGOOS scores, annotations annotate them with gene regulatory annotations (12,13) and UCSC genome tracks are available here: https: (Supplementary Table S1). More elaborated methods such //tagoos.readthedocs.io. as GWAS3D, GREGOR and GenoWAP calculate proba- bilities for the set of GWAS loci, which take into account the enrichment of the GWAS loci annotated with regula- INTRODUCTION tory features within the LD blocks (Supplementary Table Complex human phenotypes arise from a contribution of S1) (4,14,15). These methods usually do not provide pre- genetic and environmental factors. Genome-wide associa- calculated scores genome-wide and therefore they cannot be tion studies (GWAS) is a key technique to link genetic loci easily compared to each other. The second family of com- to human phenotypes (1). For instance, a single GWAS de- putational tools, which we call regulatory variant prioriza- tected more than 400 loci associated to adult human height tion approaches, use statistical or classification methods to (2). Given the large amount and importance of GWAS, sev- learn models from regulatory variants with known medical eral databases have compiled GWAS results (3–5) impact based on signatures of regulatory annotations. An
*To whom correspondence should be addressed. Tel: +33 491 82 87 48; Fax: +33 491 82 87 01; Email: [email protected]