Cross-Population Analysis for Functional Characterization of Type II Diabetes Variants Dalia Elmansy1* and Mehmet Koyutürk2

Elmansy and Koyutürk BMC Bioinformatics 2019, 20(Suppl 12):320 https://doi.org/10.1186/s12859-019-2835-0 RESEARCH Open Access Cross-population analysis for functional characterization of type II diabetes variants Dalia Elmansy1* and Mehmet Koyutürk2 From International Workshop on Computational Network Biology: Modeling, Analysis and Control Washington, D.C., USA. 29 August 2018 Abstract Background: As Genome-Wide Association Studies (GWAS) have been increasingly used with data from various populations, it has been observed that data from different populations reveal different sets of Single Nucleotide Polymorphisms (SNPs) that are associated with the same disease. Using Type II Diabetes (T2D) as a test case, we develop measures and methods to characterize the functional overlap of SNPs associated with the same disease across populations. Results: We introduce the notion of an Overlap Matrix as a general means of characterizing the functional overlap between different SNP sets at different genomic and functional granularities. Using SNP-to-gene mapping, functional annotation databases, and functional association networks, we assess the degree of functional overlap across nine populations from Asian and European ethnic origins. We further assess the generalizability of the method by applying it to a dataset for another complex disease – Prostate Cancer. Our results show that more overlap is captured as more functional data is incorporated as we go through the pipeline, starting from SNPs and ending at network overlap analyses. We hypothesize that these observed differences in the functional mechanisms of T2D across populations can also explain the common use of different prescription drugs in different populations. We show that this hypothesis is concordant with the literature on the functional mechanisms of prescription drugs. Conclusion: Our results show that although the etiology of a complex disease can be associated with distinct processes that are affected in different populations, network-based annotations can capture more functional overlap across populations. These results support the notion that it can be useful to take ethnicity into account in making personalized treatment decisions for complex diseases. Keywords: Cross-population analysis, Overlap analysis, Type II diabetes, T2D single nucleotide polymorphism, Functional annotation, Network analysis Background different populations. Most of these identified variants Genetic variations constitute an important part of the exhibit subtle disease associations, and these variants factors that contribute to many complex diseases. To usually have limited predictive power in risk assessment identify genetic variations that are associated with spe- [1]. cific complex diseases, Genome-wide and whole-genome While the effect of a single genetic variation, like a association studies have been widely performed in recent SNP, could be small or large, the collective effect of years. These studies have identified many germline vari- many variations, and their relationships, provide valuable ants (in particular, single nucleotide polymorphisms or information into the mechanisms of complex diseases. SNPs) associated with complex diseases in many To this end, there is an eminent need for studying the collective effect of many SNPs and their interrelation- * Correspondence: [email protected] ships as we try to characterize the functional under- 1Department of Electrical Engineering and Computer Science, Case Western pinnings of the relationship between genotype and Reserve University, 10900 Euclid Ave, Cleveland, OH 44106, USA phenotype. One useful source of information in this Full list of author information is available at the end of the article © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Elmansy and Koyutürk BMC Bioinformatics 2019, 20(Suppl 12):320 Page 2 of 17 regard is the disease associations identified by GWAS on protein-coding genes, biological pathways in which different populations. these proteins are active, and networks of physical Since many genomic variants can be population-spe- and functional interactions between these proteins are cific and their distribution can follow geographical pat- systematically evaluated for potential overlap. Figure 1 terns, GWAS on different populations offer different [2] illustrates the framework. Our results show that the and potentially complementary disease associations, overlap between different populations grow as the which may enrich our understanding of disease mecha- level of abstraction coarsens from genomic location to nisms. Furthermore, rare variants, which are often biological function. More interestingly, our results thought to be the causal variants for many complex also show that differences in the biological processes diseases, are usually population-specific [1]. Therefore, that are implicated in different populations align with the elucidation of the functional relationship between rare targets of T2D first-line therapy in each population. variants identified in different populations can be useful To further assess the generalizability of our methods, for the design of personalized treatment strategies in we repeat the same pipeline of analyses on datasets for precision medicine. In the future, the comprehensive another complex disease, prostate cancer as a second knowledge we collect from the use of multiple popu- test case. Our results on prostate cancer also show that lations in whole-genome association studies can be more functional overlap can be detected among popu- utilized by healthcare systems in smoothing out and lations as the level of abstraction coarsens. gradually eliminating health disparities [3, 4]. Systems biology uses computational and mathematical Results approaches to model the complex interactions among Populations and datasets for type II diabetes genetic variations as related to a phenotype. Such a We use nine T2D Genome Wide Association datasets holistic approach of characterizing and integrating the representing nine populations of Asian and European widespread genetic variations and the interplay between ethnic origins. The basic statistics of these studies and them yields a more intuitive understanding of complex their results are shown in Table 1. The first two datasets diseases. Type II Diabetes (T2D) is one of the complex are case-control studies for which the genotypes for all diseases, where most disease-associated variants identi- samples and genotyped loci are available. The other fied by GWAS are different across different populations. seven datasets are published results of case-control Since variants identified on different populations can be studies. These datasets consist of the list of significant linked to T2D through similar functional mechanisms, loci that are identified by the study and the associated trying to annotate the significant SNPs from each popu- statistical significance figures. lation separately will not fully utilize the information The nine datasets are the following: obtained from different populations. Furthermore, since there are distinct underlying disease processes and treat- 1. Wellcome Trust Case Control Consortium – ment regimens for T2D, analyzing the functional overlap WTCCC (W) T2D SNP data which has genotyped between variants identified on different populations can 495,477 SNPs genotyped for 1999 case and 1504 shed light into the differences between different popu- control samples obtained from the UK population lations in terms of disease mechanisms. [5]. The control samples represent individuals who In this paper, we propose a computational pipeline to were born in 1958. systematically assess the functional overlap between 2. The database of Genotypes and Phenotypes – genomic markers of complex diseases that are identified dbGAP (D) which has genotyped 561,656 SNPs for on different populations. For this purpose, we use T2D 1007 case and 983 control samples obtained from as a suitable model disease, and we compile results from other parts of Europe and US [6]. GWAS that have been performed on 9 different popu- 3. A Finnish (F) population case-control study which lations across the globe. These results mainly represent genotyped 329,091 SNPs for 1161 case and 1174 genomic loci that are found to be associated with T2D control samples collected in Finland [7, 8]. on samples collected from these nine populations. Inter- 4. A French (Fr) population case-control study which estingly, there is little overlap between disease associated has genotyped 392,365 SNPs for 679 case and 697 loci identified on different populations. control samples collected in France [9]. To systematically assess the functional overlap between 5. Saudi (S) population studies that pooled 48 Saudi these different sets of SNPs, we develop computational T2D SNPs implicated in previous experiments for a algorithms and statistical frameworks, expecting that the total of 2207 case samples and 2276 control variants identified in certain

Cross-Population Analysis for Functional Characterization of Type II Diabetes Variants Dalia Elmansy1* and Mehmet Koyutürk2

Analysis of Gene Expression Data for Gene Ontology

Trait Locus Chr Position P-Value Effect Effect SE Ref MA MAF Gene Gene Function FP Rs137787931 14 1880378 4.27E-07

Seq2pathway Vignette

Chromosomal Rearrangements Are Commonly Post-Transcriptionally Attenuated in Cancer

Aneuploidy: Using Genetic Instability to Preserve a Haploid Genome?

A Compendium of Co-Regulated Protein Complexes in Breast Cancer Reveals Collateral Loss Events

Supplementary Materials

ARP55242 P050) Data Sheet

Exploring the Mechanism of Shengmai Yin for Coronary Heart Disease Based on Systematic Pharmacology and Chemoinformatics

Exploring the Pharmacological Mechanism of Quercetin-Resveratrol Combination for Polycystic Ovary Syndrome

Cog5–Cog7 Crystal Structure Reveals Interactions Essential for the Function of a Multisubunit Tethering Complex

Feature Selection for Longitudinal Data by Using Sign Averages to Summarize Gene Expression Values Over Time