Identifying Risk Genes and Interpreting Pathogenesis for Parkinson's Disease by a Multiomics Analysis
Total Page:16
File Type:pdf, Size:1020Kb
G C A T T A C G G C A T genes Article Identifying Risk Genes and Interpreting Pathogenesis for Parkinson’s Disease by a Multiomics Analysis Wen-Wen Cheng, Qiang Zhu * and Hong-Yu Zhang Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; [email protected] (W.-W.C.); [email protected] (H.-Y.Z.) * Correspondence: [email protected]; Tel.: +86-27-8728-5085 Received: 30 August 2020; Accepted: 18 September 2020; Published: 21 September 2020 Abstract: Genome-wide association studies (GWAS) have identified tens of genetic variants associated with Parkinson’s disease (PD). Nevertheless, the genes or DNA elements that affect traits through these genetic variations are usually undiscovered. This study was the first to combine meta-analysis GWAS data and expression data to identify PD risk genes. Four known genes, CRHR1, KANSL1, NSF and LRRC37A, and two new risk genes, STX4 and BST1, were identified. Among them, CRHR1 is a known drug target, indicating that hydrocortisone may become a potential drug for the treatment of PD. Furthermore, the potential pathogenesis of CRHR1 and LRRC37A was explored by applying DNA methylation (DNAm) data, indicating a pathogenesis whereby the effect of a genetic variant on PD is mediated by genetic regulation of transcription through DNAm. Overall, this research identified the risk genes and pathogenesis that affect PD through genetic variants, which has significance for the diagnosis and treatment of PD. Keywords: Parkinson’s disease; multiomics; risk genes; pathogenesis 1. Introduction Parkinson’s disease (PD) is called the “undead cancer” and is the second most common progressive neurodegenerative disease affecting the elderly population [1]. The global prevalence of PD is expected to double from 6.2 million cases in 2015 to 12.9 million in 2040 [2]. The complexity of PD leads to severe challenges, such as the ambiguity of early diagnosis and later symptoms that are difficult to control [3]. Therefore, finding effective genetic markers and analyzing their pathogenesis have become the focus of future clinical research on PD. Genome-wide association studies (GWAS) identified tens of genetic variants associated with PD, most of which are from patients of European ancestry, and there is relatively little knowledge about PD in other populations [3]. We have a limited understanding of the biological functions of the risk alleles that have been identified, although PD risk variants seem to be in close proximity to known PD genes and lysosomal-related genes. Recently, Nalls et al. examined the risk relationships between PD and other phenotypes based on these genetic variants [4]. However, the pathogenic mechanisms of these genetic variations remain obscure, especially how the variants influence the gene expression through methylation [5]. In fact, the determination of genotype–phenotype causality based on GWAS still faces many difficulties: (i) most of the mutation sites are located in noncoding regions of the genome and are not protein-coding sequences [5]; (ii) the linkage disequilibrium (LD) effect of adjacent single nucleotide polymorphisms (SNPs) will produce a noncausality between genotype and phenotype [6]; and (iii) the limitation of sample size reduces the statistical effect of GWAS. Rare and low frequency variations may not reach the statistical threshold [7,8]. In the past decade, researchers have made many efforts to study the genetic structure of human traits or diseases. The development of emerging technologies such as single-cell sequencing, high-throughput Genes 2020, 11, 1100; doi:10.3390/genes11091100 www.mdpi.com/journal/genes Genes 2020, 11, 1100 2 of 14 screening, and machine learning algorithms will enhance our understanding of genetic risks. Recently, Zhu et al. developed the summary data-based Mendelian randomization (SMR) and heterogeneity in dependent instruments (HEIDI) methods [9], which provide a way forward for identifying functional genes and potential functional elements of genetic variation. The fundamental idea is that GWAS identifies SNPs related to phenotypes, and expression quantitative trait locus (eQTL) analysis obtains SNPs related to gene expression and then obtains risk genes that affect phenotypes by SNPs [10]. SMR allows for the use of GWAS, eQTL and methylation quantitative trait locus (mQTL) from independent studies to aggregate data, so it can expand a sufficient sample size to enhance statistical capabilities [11]. In this study, by integrating GWAS, eQTL and mQTL, the risk genes of PD were identified, and their potential pathogenesis was analyzed. First, we integrated two sets of GWAS data based on meta-analysis and found that the data can reflect the pathological features of PD by using Gene Ontology (GO) enrichment. Second, SMR and HEIDI were used to combine the meta-analysis GWAS result with eQTL data to identify risk genes that affect PD through genetic variation. Last, this study combined GWAS, eQTL and mQTL data to analyze the pathogenesis of risk genes. To determine the functional elements that regulate gene expression, this paper used epigenetic annotation data from the Roadmap Epigenomics Mapping Consortium (REMC) to locate the function of DNAm sites [12,13]. In summary, this research identified the risk genes and potential pathogenesis that affect PD, which has significance for the diagnosis and treatment of PD. 2. Materials and Methods 2.1. GWAS Analysis of Whole-Genome Data from PPMI The whole-genome sequencing (WGS) data used in this study were derived from 606 individuals (412 PD patients and 194 controls) in the Parkinson’s Progression Marker Initiative (PPMI) database [14]. We used PLINK v1.07 toolset [15] for a standard case–control association analysis with age and gender as covariates. In the quality control phase, we selected SNPs with detection rates greater than 95%, sample detection rates greater than 95%, minimum allele frequencies greater than 5 10 2, × − and Hardy–Weinberg equilibrium less than 1 10 4. × − 2.2. Meta-Analysis Based on Effect Size Meta-analysis was used to integrate two sets of GWAS data, including the above-mentioned summary-GWAS results from PPMI and GWAS summary data from another group of 4238 PD patients and 4239 controls, respectively, and meta-analysis was used in METAL tool [16] using an effect-based approach. A total of more than 6 million SNPs were obtained by meta-analysis GWAS, of which 1678 significant SNPs (Bonferroni-corrected p-value < 7.83 10 9) were associated with PD. × − 2.3. Summary Data-Based Mendelian Randomization Analysis For the identification of PD risk genes based on GWAS and eQTL using SMR analysis, y was the phenotype, x was the gene expression, z was the genetic variation, and bxy (bxy = bzy/bzx) represent a non-confounding effect of gene expression on PD. The effect of SNPs on gene expression (bzx, with a positive or negative sign) was the effect value of eQTL, and the effect of SNPs on PD (bzy) was the effect value of meta-analysis GWAS. FDR adjustment at p-value < 5 10 2 was the threshold of SMR. × − Simultaneously, this study also infers the positive or negative effect of genes on PD (bxy < 0 means negative effect, otherwise the opposite). The HEIDI test can exclude causality caused by LD in SMR analysis and obtain pleiotropic mutation sites (i.e., causality between two genomes is caused by the same genetic mutation) [17]. This study used the Bonferroni correction p-value as the threshold for HEIDI. Therefore, for each significant eQTL probe detected by SMR, we detected the bxy of multiple SNPs in the eQTL region to exclude false-positive genes. Genes 2020, 11, 1100 3 of 14 In addition, the QTL data used are the aggregate data compiled by Zhu et al. [9,10]. The original eQTL summary data from the Westra et al. study included an eQTL meta-analysis of 5311 samples from peripheral blood [18] and a summary of the meta-analysis of Europeans from the Brisbane System Genetics Study (n = 614) and the Losian Birth Cohorts of 1921 and 193,631 (n = 1366) [19,20]. After exclusion of probes in the MHC areas and threshold screening (p-value < 5 10 8), a total of × − 5967 eQTL probes and 73,973 DNAm probes were obtained. 2.4. Enrichment Test of Functional Categories In the public data of the REMC database at http://compbio.mit.edu/roadmap/, we downloaded the epigenetic annotation files of 25 chromatin states of 23 blood cells in 127 epigenomics studies [12,13]. The 14 main functional categories were obtained by merging 25 chromatin states with functionally related annotation categories into a single functional category. These annotation files were used to locate the 14 main functional regions of each methylation probe. After functional location of the DNAm probe, we quantified the ratio of transcript-associated DNAm significant probes to all probes in 14 major functional categories. Finally, the chi-square test was used to test the significance and positive or negative effects of enrichment with these significant methylation probes in each domain. 3. Results 3.1. Screening SNPs of PD Based on Meta-Analysis GWAS Data We performed a GWAS using WGS data from the PPMI database including 412 PD patients and 194 controls [14]. Age and gender were used as covariates to obtain GWAS summary data through PLINK [15]. There were 48 SNPs significantly related to PD (p-value range was from 1.25 10 7 to × − 9.21 10 7). × − The effect-based meta-analysis was then used to integrate GWAS data from PPMI with other aggregated GWAS data, which included 4238 PD patients and 4239 controls [21]. The results of the meta-analysis for GWAS obtained more than 6 million SNPs, where 1678 SNPs (Bonferroni-corrected p-value < 0.05/6,385,696 = 7.83 10 9) were significantly associated with PD (Figure1).