Research Article

Received 22 March 2011, Accepted 2 November 2011 Published online in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/sim.4477 A pathway analysis method for genome-wide association studies Babak Shahbaba,a Catherine M. Shachafb and Zhaoxia Yua*†

For genome-wide association studies, we propose a new method for identifying significant biological pathways. In this approach, we aggregate data across single-nucleotide polymorphisms to obtain summary measures at the gene level. We then use a hierarchical Bayesian model, which takes the gene-level summary measures as data, in order to evaluate the relevance of each pathway to an outcome of interest (e.g., disease status). Although shifting the focus of analysis from individual genes to pathways has proven to improve the statistical power and provide more robust results, such methods tend to eliminate a large number of genes whose pathways are unknown. For these genes, we propose to use a Bayesian multinomial logit model to predict the associated pathways by using the genes with known pathways as the training data. Our hierarchical Bayesian model takes the uncertainty regard- ing the pathway predictions into account while assessing the significance of pathways. We apply our method to two independent studies on type 2 diabetes and show that the overlap between the results from the two studies is statistically significant. We also evaluate our approach on the basis of simulated data. Copyright © 2012 John Wiley & Sons, Ltd.

Keywords: hierarchical; pathway; genome-wide; uncertainty

1. Introduction

With the development of high-throughput genotyping technologies, genome-wide association studies (GWAS) have been widely used for identifying common genetic variants that contribute to human com- plex traits. By testing individual single-nucleotide polymorphisms (SNPs) and focusing on the most significant SNPs, genome-wide association studies have successfully detected hundreds of SNPs asso- ciated with complex human diseases [1]. However, in most situations the identified SNPs only jointly explain a small part of heritability. For example, it has been estimated that human height has heritability of about 80%, and more than 40 loci have been identified by GWAS; however, these loci contribute only about 5% to the phenotypic variation [2]. Genetic heterogeneity, the phenomenon that different alleles at different loci might contribute to a disease in different populations, is probably the norm rather than the exception. This makes detecting genetic variants, especially those with small or moderate individual effects, for human complex diseases a challenge. Thus, despite its success, it has been recognized that the single-marker-based association analysis conducted in most GWAS might not have adequate power to detect important genetic variants for human complex traits. Moreover, these methods tend to have low reproducibility so there is little overlap between findings of different research groups studying the same biological system. Recent work has demonstrated that evaluating variations related to predefined sets of interconnected genes (e.g., pathways) often increases statistical power and produces more robust results [3–12]. More recently, several pathway-based analysis methods have been also proposed specifically for GWAS analysis [13–33]. These methods differ from one another in many aspects, such as whether individual-level SNP genotypes are required, whether gene-level summary statistics are computed, and how statistical significance is assessed. Wang et al. [34] reviewed many pathway-based methods for ana- lyzing GWAS. They asserted that shifting the focus of analysis from individual SNPs to pathways has led to the identification of many meaningful biological pathways.

aDepartment of Statistics, University of California, Irvine, CA, USA

988 bDepartment of Radiology, Stanford University, Stanford, CA, USA *Correspondence to: Zhaoxia Yu, Department of Statistics, University of California, Irvine, CA, USA. †E-mail: [email protected]

Copyright © 2012 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 988–1000 B. SHAHBABA, C. M. SHACHAF AND Z. YU

We have recently proposed a hierarchical Bayesian model for analyzing gene expression data to iden- tify pathways differentiating between two biological states [35]. Our method, Bayesian gene-set analysis (BGSA), evaluates the statistical significance of a specific pathway by using the posterior distribution of its corresponding hyperparameter. Here, we propose a hierarchical Bayesian model for GWAS while addressing some issues affecting the performance of pathway analysis methods in general. The pro- posed method in this paper aggregates data at two levels: (i) at the gene level by combining significance measures of SNPs associated with each gene and (ii) at the pathway level by combining significance measures of genes associated with each pathway. Here, we also propose a simple measure of signif- icance that can be used to identify pathways related to the outcome of interest (e.g., disease status). Further, we extend our model to account for possible uncertainty regarding the assignment of genes to pathways. This is especially useful when we deal with genes whose pathway assignments are unknown. For these genes, we use a Bayesian classification model to predict their pathways and use their posterior predictive probabilities as the measure of uncertainty regarding our predictions. We apply our method to two independent studies on the same disease, namely type 2 diabetes, in order to examine the reproducibility of the results obtained from our method. We also evaluate our method by using simulated data and compare its performance with those of two alternative models: Gene Set Enrichment Analysis for SNP data (GSEAsnp) [13] and Association LIst Go AnnoTatOR (ALIGATOR) [16]. To perform pathway-based analysis of GWA data, GSEAsnp uses a modified version of the gene-set enrichment analysis (GSEA). For each SNP, the test statistic is obtained using the 2 test. The statistic for each gene is the highest statistic value among all SNPs mapped to that gene. For any given pathway, the overall significance is measured on the basis of a weighted Kolmogorov–Smirnov statistic by using the distribution of gene-level statistics associated with that pathway. ALIGATOR defines a list of significant SNPs by applying an arbitrary threshold of significance to the GWA study and tests for overrepresen- tation of categories of genes. The categories are based on pathways obtained from public data bases such as Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG). This method corrects for the presence of linkage disequilibrium between SNPs, variable gene size, overlapping genes, and multiple nonindependent categories.

2. Methods

In this section, we describe our method, henceforth called BGSAsnp. Our method obtains gene-level summary measures by aggregating the genotype data over their corresponding SNPs. This is similar to the hierarchical Bayes prioritization method proposed by Lewinger et al. [36]. Next, we aggregate the summary measures over genes within each pathway in order to assess the significance of pathways with respect to their associations with an outcome of interest (e.g., disease status). Further, we discuss how we can take the uncertainty of pathway assignments into account and include genes with unknown pathways in our analysis.

2.1. Gene-level summary statistics We use a simple hierarchical Bayesian model to summarize the evidence for each gene. Suppose that the gene j is associated with mj SNPs. We start by measuring the association between each SNP and disease by computing the unsquared Cochran–Armitage trend test statistic [37], T , which follows the standard normal distribution asymptotically under the null hypothesis of no genetic association. For

SNPs associated with the gene j , we denote these statistics Tj1;:::;Tjmj . Next, we combine these statistics to obtain a gene-level measure of significance. To do this, we propose a simple approach based on a hierarchical Bayesian model. We assume the following model for Tj1;:::;Tjmj (i.e., unsquared Cochran–Armitage trend statistics associated with the gene j ):

2 Tji N.0;j /iD 1;2;:::;mj 2 2 2 j Inv- .; /

2 Note that SNPs associated with the gene j share the same hyperparameter j . The role of this hyper- parameter is to adjust the distribution of T within the gene. If the SNPs are not related to the outcome, 989 2 the absolute values of their corresponding statistics would be small. In this case, the hyperparameter j , which is shared by all the SNPs for the gene, would shrink towards zero to accommodate small values of

Copyright © 2012 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 988–1000 B. SHAHBABA, C. M. SHACHAF AND Z. YU

T . If the SNPs are in fact related to the outcome, we expect their corresponding T to be large in absolute 2 value. As a result, the hyperparameter j becomes large. Because the scaled-inverse-2 distribution is a conjugate prior for the variance of a normal 2 distribution, the posterior distribution of j has a closed form as follows:

2 2 2 C mj Vj j jT;; Inv- . C mj ; / C mj P mj 2 where mj is the number of SNPs associated with the gene, and Vj D 1=mj iD1 Tji. The posterior 2 mean of j is

2 2 C mj Vj E.j jT;; / D ; for C mj >2 C mj 2 2 We use the signed posterior expectation of j as the gene-level summary measure and denote it Zj for gene j .

2 Zj D sign.Tj.1//E.j jT;; / The sign is determined by the sign of the Cochran–Armitage trend statistic [37] of the most significant SNP, which is denoted as Tj.1/. For simplicity, we could use an improper uniform prior distribution on log.j /, which is equivalent 1 2 to P.j / / j and has the form of an inverse- density with zero degrees of freedom [38]. In this case, the mean exists only when mj >2(i.e., at least three SNPs must be associated with the gene). By setting D 0 and 2 D 1, the gene-level summary measure becomes P mj 2 iD1 Tji Zj D sign.Tj.1// mj 2 However, instead of the above improper prior, we usually prefer weakly informative priors in practice. One possibility is to use Inv-2.1; 0:05/, whose 0.025-quantile and 0.975-quantile are approximately 0.01 and 50, respectively. In our model for Tji, we have fixed the mean of the normal distribution at zero and use the vari- 2 2 ance parameter j to capture the overall gene effect. Alternatively, we could use N.j ;j /, which is a more flexible model because the mean of the distribution is not fixed at zero. This, however, complicates our inference regarding the significance of genes because with this alternative model, j captures the 2 location shift of the distribution of Tji, whereas j captures its scale change. Therefore, to have power against both location shift and scale change alternatives, we need to take both parameters into account. Our model (with the mean fixed at zero), on the other hand, captures both location shift and scale change 2 in the distribution of Tji by using one parameter, j . For example, suppose there are two SNPs associ- ated with the gene j ,andTj1 D Tj2 D 1. That is, the maximum likelihood estimate (MLE) of the mean 2 is 1, whereas the MLE of the variance is 0. Assuming D 1 and D 0:05, we obtain Zj D 2:05.On the other hand, if Tj1 D1 and Tj2 D 1 (i.e., the MLE for the mean is 0, whereas the MLE for the variance is 1), the overall summary measure for the gene is still Zj D 2:05.

2.2. Using hierarchical Bayesian models to identify significant pathways After we obtain gene-level summary measures, Z, we use prior biological information regarding the interconnectivity among genes in order to divide genes into subsets. In this way, we shift the focus of hypothesis testing towards pathways, as opposed to individual genes. Suppose we create a collection of pathways S1, S2, ..., SK . Our objective is to identify pathways that are involved in the biological process of interest. For the gene j in the set s, a statistic Zsj is obtained by using the method discussed above. Then, we could assume the following hierarchical model for Zsj :

2 Zsj N.0;s /jD 1;2;:::;ns 2 2 2 s Inv- .; / 990 2 Note that genes within a set, s, share the same parameter, s . The role of this parameter, which is anal- 2 ogous to the role of j used for the gene-level summary measure, is to adjust the distribution of Z’s

Copyright © 2012 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 988–1000 B. SHAHBABA, C. M. SHACHAF AND Z. YU within the pathway s. In this case, if the genes in the pathway are not related to the outcome, the abso- 2 lute values of their corresponding Z’s would be small resulting in the shrinkage of s towards zero to accommodate small values of Z. On the other hand, if the genes are in fact related to the outcome, we 2 expect their corresponding Z’s to be large in absolute value resulting in relatively large s .Again,note 2 that by fixing the mean of the normal distribution at zero, the parameter s captures both location shift and scale change in the distribution of Z. 2 2 In the above model, the assumed scaled-inv- is a conjugate prior for s . The posterior distribution, therefore, has the following closed form:

2 2 2 C nsVs s jZ; ; Inv- . C ns; / C ns P ns 2 2 where ns is the number of genes in the pathway, and Vs D 1=ns iD1 Zsj . The posterior mean of s is

2 2 C nsVs E.s jZ; ; / D ; for C ns >2 C ns 2 To illustrate this approach, we use simulated data where 1000 genes are assigned to 50 pathways such that each consecutive non-overlapping block of 20 genes forms a pathway. For the first pathway, all Z values are set to 2. For the second pathway, half of Z values are set to 2, whereas the other half are set to 2. In the third pathway, Z values are randomly sampled from N.2;1/. In the fourth pathway, we sample Z values from N.0; 4/. The remaining 46 pathways are assumed to be irrelevant to the outcome of interest. For these pathways, we sample Z from the standard normal distribution. 2 2 As before, we could use an improper prior (e.g., D 0 and D 1)fors . We, however, set D 1 and 2 D 0:05 to obtain a weakly informative prior. The 95% interval for the corresponding scaled-inv-2 distribution is approximately Œ0:01; 50, which is reasonably broad. The left panel in Figure 1 shows the 2 posterior mean of s for the 50 simulated pathways. As we can see, the first four pathways are clearly separated from the remaining 46 pathways. This shows that by fixing the mean at zero and using parame- ters for the variance of Z, our model has power against both location shift and scale change alternatives.

The above approach could be useful for ranking pathways in terms of their significance based on, 2 2 for example, posterior expectations of s , such that the higher the posterior expectation of s ,themore significant the pathway. However, in many practical problems, an automated mechanism for selecting significant (i.e., relevant to the biological process of interest) pathways is desirable. To this end, we can use a mixture prior similar to that of [39] and [40]; instead of a simple scaled-inv-2 distribution as the 2 prior of s , we use the following prior: 2 s .1 /F0 C F1; 2 where F0 and F1 are the distributions of s for irrelevant and relevant pathways, respectively, and is the 2 probability that a pathway belongs to the relevant group. For irrelevant groups, s ’s are quite small and 2 p Posterior mean of 0123456 0.0 0.2 0.4 0.6 0.8 1.0 0 1020304050 0 1020304050 Gene set index Gene set index

2 Figure 1. Left panel: The posterior mean of s for 50 simulated pathways. The first four pathways are signif- 991 icant, whereas the remaining 46 pathways are not. Right panel: The corresponding measure of irrelevance, p, obtained using our mixture model.

Copyright © 2012 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 988–1000 B. SHAHBABA, C. M. SHACHAF AND Z. YU

2 close to zero, whereas for the relevant groups, s ’s tend to be large. Therefore, we assume the following 2 mixture prior for s : 2 2 2 2 2 2 s j;0;1 .1 /Inv- .; 0 / C Inv- .; 0 C 1 / 2 2 0 ;1 Gamma.a;b/ Gamma.a;b/

Beta.a;b/

2 This way, the distribution of s is assumed to be a mixture of two simple distributions: (i) a scaled- 2 2 2 inv- distribution with a relatively small scale parameter 0 , and (ii) a scaled-inv- with a relatively 2 2 larger scale parameter 0 C 1 . Note that for given degrees of freedom, as the scale parameter increases, scaled-inv-2 distribution gives higher probability to large values of random variable (i.e., the distribu- 2 tion shifts to the right). Therefore, we separate s into two groups. The first group includes small values 2 2 2 of s with Inv- .; 0 / prior distribution. The corresponding pathways in this group are assumed 2 to be irrelevant to the outcome of interest. The second group includes large values of s with Inv- 2 2 2 .; 0 C 1 / prior distribution. The corresponding pathways in this group are assumed to be relevant to the outcome of interest. For , we could use a conjugate Beta.1; 1/ prior (i.e., a D b D 1), which is equivalent to the Uniform(0, 1) distribution. This, however, means that in prior, we expect that half of pathways are sig- nificant. In practice, we expect that only a small number of pathways are significant. Therefore, it is more reasonable to use informative priors such as Beta.1; 9/. To facilitate posterior sampling, we use the data augmentation approach [41] by introducing binary latent variables vs Bernoulli./ such that 2 2 2 2 2 2 s jvs;0;1 .1 vs/Inv- .; 0 / C vsInv- .; 0 C 1 /: Alternatively, we can write the above prior as ( 2 2 2 s jvs D 0; ; 0 Inv- .; 0 / 2 2 2 2 s jvs D 1; ; 0;1 Inv- .; 0 C 1 / 2 The priors for and s are conditionally conjugate, so we can use the Gibbs sampler to obtain their 2 2 posterior distributions. Because priors for , 0 ,and1 are not conditionally conjugate, we cannot use the Gibbs sampler for these parameters; instead, we use single-variable slice sampling [42]. At each iter- ation, we use the ‘stepping out’ procedure to find the interval around the current point and the ‘shrinkage’ procedure for sampling from the interval. To assess the significance of each pathway, we can use the posterior distribution of vs given the observed data, ´. However, to generate measurements analogous to p-value, we use the following measure instead: > 2 ps D EŒP.T 0=s jZ/ 2 > 2 where T . Note that at each iteration of Markov Chain Monte Carlo (MCMC), P.T 0=s jZ/ 2 is the upper tail probability of 0=s based on the distribution of the irrelevant group. For an irrelevant > 2 2 2 pathway, P.T 0=s jZ/ would have a uniform distribution because 0=s also has (i.e., the 2 same as T ). Thus, ps would be close to 0.5. For a relevant pathway, ps tends to be small because s 2 2 2 would be sampled from an Inv- distribution with a relatively larger scale parameter, 0 C 1 ,than that of the nonsignificant group. The right panel of Figure 1 shows ps for the 50 pathways in the above illustrative example. As we can see, ps for the first four pathways are close to zero, whereas for the remaining 46 pathways, ps are large and scattered around 0.5.

2.3. Accounting for uncertainty regarding pathway assignments Typically, pathway analysis methods assume that pathway assignments are fixed and known. This is not always the case in practice; identifying genes involved in a specific biological pathway is a difficult task, and different studies could lead to different conclusions regarding the structures of pathways. Here, we use y to denote the label (i.e., pathway) assigned to the gene j . To take into account the uncertainty 992 j regarding the membership of the gene j to a set of K predefined pathways, we assume a discrete prob- ability distribution over yj such that P.yj D k/ D Pjk for k D 1;:::;K. If we assume with certainty

Copyright © 2012 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 988–1000 B. SHAHBABA, C. M. SHACHAF AND Z. YU

that the gene j belongs to the pathway s,wesetPjk D 1 for k D s and Pjk D 0 for k ¤ s. We might of course believe that a gene can belong to multiple pathways with different probabilities. For example, suppose that the gene j belongs to three pathways S1, S2,andS3 with probabilities Pj1;Pj2,andPj3, respectively. In this case, Pjk D 0 for k …f1; 2; 3g. In the absence of any information regarding these probabilities (i.e., when all we know is that the gene could belong to one of three possible pathways), we could assume Pj1 D Pj2 D Pj3 D 1=3. To account for our uncertainty regarding the pathway assignments, we modify our proposed hierarchi- cal Bayesian model as follows. At each iteration of MCMC, we select one of the possible classes (i.e., pathway assignments) on the basis of their probability distribution and assign the gene to the selected pathway. This way, the pathway assignments could change from one iteration to another. We then con- tinue our algorithm to update the model parameters as before. Note that the Monte Carlo estimations based on this modified version of our model are obtained by integrating over the probability distributions over pathway assignments.

2.4. Predicting pathways In many situations, we might take the pathway assignments provided by public data bases as certain for simplicity. However, such data bases rarely provide pathway assignments for all genes. This is a major hurdle in pathway analysis. For example, in the next section we discuss the application of our model to two GWAS data sets. Our data includes 20,991 genes. Using Molecular Function Ontology (MFO) avail- able from Molecular Signatures Database, we want to divide these genes into 396 pathways. However, the pathways are known for 3959 genes only. The remaining 17,032 are not assigned to any pathway and would be eliminated before using pathway analysis methods. One possible solution for this problem is to combine multiple data bases to improve the overall cover- age of genes. This approach, however, might undermine one of the attractive aspects of pathway analysis methods, namely, the reduction of the number of hypotheses by using meaningful biological information to create sets of genes with some underlying biological themes. For the above example, by using the gene sets derived from the MFO, we can reduce the number of hypotheses from 20,991 to 396. This is espe- cially crucial for improving the statistical power in many biological studies with small sample size. Also, in many situations, our choice of a data base to create pathways might depend on the specific biological theme (e.g., molecular function, biological process, cellular component) in which we are interested. In such cases, combining multiple data bases with different themes might not be a feasible option. Besides, even if we decide to combine data bases to increase the coverage, it is still possible that some key genes in our study remain unassigned. To address this issue, we use a statistical model to infer pathways for genes that are not assigned to any pathway. To this end, we use a Bayesian multinomial logistic model (MNL) as our classifier,

exp.˛ C x ˇ / P s j s P.yj D sjxj ; ˛; ˇ/ D K (1) kD1 exp.˛k C xˇk/ where xj is a set of predictors (i.e., covariates), .˛; ˇ/ are model parameters, and K is the number of classes (pathways). The number of classes in our case is K D 396.WeuseN.0;2/ prior for model parameters. We set 2 to some large values so the priors are reasonably wide. We use the genes with known pathways as the training data. After we obtain posterior distributions of model parameters, we could use our model to assign each gene with unknown pathway to the class with the highest posterior predictive probability. However, as discussed later, we will consider the top three pathways to account for the uncertainty of pathway predictions based on the above model. To build the above model, we need a set of predictors, x, that can be measured for all genes (whether their membership is known or unknown) and are hopefully informative about the pathway (i.e., class) membership. Following [43] and [44], we use InterPro database [45] and use the predicted -domain annotations as x. To evaluate the performance of our model, we divide the set of genes with known pathways into two groups: training and test data sets. We randomly assign two-thirds of these genes to the training set and the remaining one-third to the test set. We build our classification model by using the training set and evaluate its performance on the basis of the test set. More specifically, we compare the predicted path- 993 ways with the actual pathways for test cases and measure the accuracy rate as the proportion of genes that are correctly classified (note that for genes in the test set, we know the corresponding pathways).

Copyright © 2012 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 988–1000 B. SHAHBABA, C. M. SHACHAF AND Z. YU

For a baseline model that randomly assigns genes to pathways, the estimate of prediction accuracy is 1=396 D 0:2%. If we assign genes to the most frequent pathway, the resulting prediction accuracy is 12.0%. Using our model, we increase the prediction accuracy from 12.0% to 38.2%. Although our model provides a substantial improvement in accuracy rate, it misclassifies 61.8% of the genes. To mitigate the potential negative effect of these misclassifications on our pathway analysis model, we can focus on genes for which we have high confidence in our assignment. For our models, we base confidence on posterior predictive probability, which is defined as the expected probability of each class with regard to the posterior distribution of model parameters. The higher this probability, the more confident we are in classifying the case. We rank the test cases on the basis of how high the highest probability is, and for a coverage of g, we classify only the top g percent of genes. Figure 2 shows how prediction accuracy changes as we increase the coverage. The dashed line shows that by predicting pathways only for the top 20% genes, we obtain 75.7% accuracy rate. Next, we train our model by using all the genes with known pathways and use the resulting model to obtain posterior predictive probabilities for genes with unknown pathways. We rank these genes accord- ing to their highest posterior predictive probabilities and predict the pathways for the top 20% genes only. This increases the number of genes for the pathway analysis part of our model from 3959 to 6155. To account for our uncertainty regarding pathway predictions, instead of considering the top ranking pathway, we find the top three pathways (i.e., the three pathways with the highest probabilities) for each gene. We assume that the gene could belong to one of these three pathways only and normalize their corresponding probabilities so the three probabilities sum to 1. At each iteration of MCMC, for each of the 2196 genes with unknown pathways, we randomly select one of its three top predicted pathways by using the normalized probabilities and assign the gene to the selected pathway. This way, the path- way assignments could change from one iteration to another. The Monte Carlo estimations are therefore obtained by integrating over the probability distributions for predicted pathways.

3. Results for genome-wide association studies of type 2 diabetes

In this section, we use our method for analyzing SNP data to identify pathways involved in type 2 diabetes. Diabetes is a disease that results from uncontrolled and constant high levels of sugar in the blood. Medical complications resulting from this physiological state could be fatal. These include impaired cell and tissue metabolic function, liver enzymatic function, impaired wound healing associ- ated with increased inflammation, and cytokine production [46]. The fundamental pathology of diabetes is the deregulation of insulin, blood glucose levels, and a variety of metabolic functions of the body. The complications of diabetes include dramatically altered energy metabolism and glucose handling, Accuracy rate 0.4 0.5 0.6 0.7 0.8

0.0 0.2 0.4 0.6 0.8 1.0 Fraction of genes predicted

994 Figure 2. Prediction accuracy for different coverage (i.e., proportion of genes predicted). Pathways are defined on the basis of Molecular Function Ontology. The dashed line shows that 20% coverage (i.e., predicting pathways for the top 20% genes only) results in 75.7% accuracy rate.

Copyright © 2012 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 988–1000 B. SHAHBABA, C. M. SHACHAF AND Z. YU protein oxidation, vascular disease, heart disease, retinopathy, neuropathies, nephropathy, and suscepti- bility to infections. The pathology of diabetes exceeds the simple maintenance of plasma glucose and insulin levels. We apply our method to SNP data from two independent GWAS on type 2 diabetes: one conducted by the Wellcome Trust Case–Control Consortium (WTCCC) [47] and the other by the Diabetes Genetics Initiative (DGI) [48]. In the WTCCC study, 2000 cases and 3000 controls are collected and genotyped using the Affymetrix Human Mapping 500K GeneChip. All subjects are Caucasians living in the UK. In the DGI study, about 1400 T2D Scandinavian patients and 1400 matched controls are genotyped using the Affymetrix Human Mapping 500K GeneChip. For the WTCCC data, we used the normalized signal intensity data available from WTCCC database and called the genotypes using a recently published algorithm called Beaglecall [49]. The new genotype calling algorithm simultaneously makes genotype calls and estimates haplotype phase. It improves geno- type calling by incorporating linkage disequilibrium information of SNPs. An iterative filtering method is then exploited to remove SNPs with potentially high genotyping errors. It has been shown that this simultaneous approach improves the genotyping accuracy and call rate significantly. Also, not surpris- ingly, false positives due to genotyping errors are reduced [49]. The default options of Beaglecall were used. For the DGI study, the data available from the DGI web site were genotype counts of cases and controls for each SNP. To reduce the potential effect of linkage disequilibrium on pathway analysis, we first prune SNPs by using PLINK (http://pngu.mgh.harvard.edu/purcell/plink) [50]. The ‘indep-pairwise’ option is used, with the parameters for the window size, step, and R2 set to 50, 5, and 0.5, respectively. About 65,000 SNPs remain in our analysis after SNP pruning. Each SNP in the list is mapped to its nearest gene that is within 20,000 base pairs. We obtain summary statistics, Z, for each gene as described in Section 2. For the prior distribution 2 2 of j , we use the weakly informative distribution Inv- .1; 0:05/. Next, we group genes into pathways based on MFO and use our hierarchical Bayesian model to identify the significant pathways. We use 2 2 Gamma(1, 1) priors for , 0 ,and1 because they are typically small positive numbers. We set the prior distribution of to Beta(1, 9). Therefore, the expected proportion of significant pathways is 0.1 in prior. We obtain the posterior distribution of model parameters and ps on the basis of 5000 MCMC samples after discarding the initial (pre-convergence) 500 samples. By setting the cutoff for ps at 0.05, 20 pathways from the WTCCC study and 31 pathways from the DGI studies are selected as significant. The selected pathways, their individual genes, and the corre- sponding ´-values for genes within each pathway are provided as a web-based supporting document along with this paper. Four of these pathways are common between the two studies. The probability of observing four or more common pathways by chance (i.e., randomly selecting 20 and 31 pathways from the two studies) is 0.06. The four common pathways are integrin binding, intramolecular oxidoreductase activity, carboxypeptidase activity,andmetalloexopeptidase activity. The pathways identified by this study are strongly linked to biological processes impaired during the onset of diabetes. Metalloexopeptidase activity is directly related to the regulation of insulin. For example, mutation in the insulin-degrading enzyme (ICE), a metalloexopeptidase, causes elevated blood glucose and insulin levels leading to the onset of diabetes [51]. The intramolecular oxidoreductase activity pathway is associated with oxidative stress prominent pri- marily in the liver. This in turn induces an inflammatory and cytokine response [52–54]. Inflammation markers are therefore predictive for the risk of developing diabetes. Patients with diabetes are in a con- stant low-grade inflammatory state, specifically in type 2 diabetes. This state is linked to the activation of innate immunity where metabolic, environmental, and genetic factors are involved and are associated with an increased cardiovascular risk. Nephropathy and renal dysfunction is a prevalent complication resulting from constant hyperglycemia. Renal dysfunction could be the result of perturbation of the integrin pathway. Integrins are key elements of the extracellular matrix (ECM) of renal glomeruli and mediate cell matrix interactions [55]. Pertur- bation of interactions between cells in the ECM of the kidney induces death of cell in the glomerular capillaries, impairing blood flow, and then development of renal ischemia characteristic of patients with diabetic nephropathy [56]. Carboxypeptidases are enzymes required to help mature such as insulin or regulate its bio- logical processes. The angiotensin-converting enzyme (ACE)-related carboxypeptidase is involved in 995 diabetic nephropathy [57]. The ACE proteins regulate the renin–angiotensin system which is the target of therapies to slow disease progression in diabetic and nondiabetic kidney disease [58].

Copyright © 2012 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 988–1000 B. SHAHBABA, C. M. SHACHAF AND Z. YU

To asses the effect of predicting pathways on the overall results of our model, we repeat the above analysis, but this time, we include genes with known pathways only. For a set of arbitrary cutoffs, Fig- ure 3 shows the proportions of significant pathways that are common between both WTCCC and DGI studies. Here, the horizontal axis shows the fraction of pathways that are selected as significant in both studies for a given cutoff. The vertical axis shows the proportion of these significant pathways that are common between the two studies. As we can see, the model with pathway prediction (solid line) tends to generate more reproducible results compared to the model that includes genes with known pathways only (dashed line). That is, for a given cutoff, the proportion of significant pathways that are common between the two studies tends to increase when we predict unknown pathways for some genes and include them in our analysis. For the specific cutoff we used for the results presented earlier, removing genes with unknown pathways from our analysis reduces the number of common pathways from four to three. Without pathway prediction, the set of common pathways does not include carboxypeptidase activity anymore.

4. Simulation

To evaluate the performance of our model, BGSAsnp, we conduct two simulation studies. We com- pare BGSAsnp with two alternative methods: GSEAsnp [13] and ALIGATOR [16]. To evaluate the performance of these three methods, we use the area under the receiver operating characteristic (ROC) curve (AUC) for identifying significant pathways. All programs to simulate data and run the models are available online at http://www.ics.uci.edu/~babaks/Homepage/Codes.html. For our model, we need to specify the priors first. For j , which is used to obtain the gene-level sum- 1 2 mary measures Z, we use the improper prior P.j / / j .WeuseGamma(1,1)priorsfor, 0 ,and 2 1 . Finally, we set Beta.1; 1/ (we encourage using weakly informative priors, such as the one used in the previous section, for real data analysis problems). We obtain the posterior distributions by using 5000 iterations of MCMC after discarding the initial 500 iterations. To run GSEAsnp and ALIGATOR, we use the R functions included in the SNPath package, which is available online at linchen.fhcrc.org/grass.html. We use the default settings for these two models. For the first simulation, we assume that all genes have known pathways (we will relax this assump- tion in the second simulation). The genotype data used in our simulations are based on 1500 subjects from the 1958 British Birth Cohort [47]. We randomly choose 500 genes that are mapped to no less than five SNPs and then randomly choose five SNPs for each selected gene. These 500 genes are randomly grouped to create 50 pathways of varying sizes. We make the first five pathways significant while keep- ing the remaining 45 pathways as nonsignificant. To this end, we randomly select three genes from each of the first five pathways (15 genes in total) and identify their corresponding SNPs. We randomly choose 10 of these SNPs to create a binary phenotype variable, Y , according to the following model:

Pathway analysis with prediction Pathway analysis without prediction common between two studies Fraction of significant pathways 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.02 0.04 0.06 0.08 0.10 0.12 Fraction of pathways selected as significant

996 Figure 3. Comparing BGSAsnp models with and without pathway predictions. The horizontal axis presents the fraction of pathways selected as significant in both WTCCC and DGI studies. The vertical axis shows the fractions of significant pathways that are common between the two studies.

Copyright © 2012 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 988–1000 B. SHAHBABA, C. M. SHACHAF AND Z. YU

Yi Bernoulli.i /iD 1;:::;1500 Xp qj logit.i / D .1/ ˇj Xij j D1

ˇj Beta.2; 8/ j D 1;:::;p

qj Bernoulli.0:5/ where Xij is the observed value of the j th selected SNP for the ith individual. We choose the number of significant SNPs and the distribution of their effects such that the range of AUC is reasonable and realistic. Using the data generated according to the above model, the pathway analysis methods consid- ered here should select the first five pathways as relatively more significant compared with the remaining 45 pathways. We repeat the above procedure to obtain 100 simulated data sets for which the proportion of each group for the phenotype variable is at least 0.10. The sizes of the pathways, the set of the significant SNPs, and their corresponding ˇ’s could vary among these data sets. Table I shows the averages (over 100 simulated data sets) of AUC on the basis of the three methods. As we can see, our BGSAsnp model outperforms both GSEAsnp and ALIGATOR. The differences between BGSAsnp and the two alternative methods are statistically significant. Using paired t-test, p-values are less than 0.05. For the second simulation, we follow the same procedure as the first simulation, but this time we assume that the assignments of genes to the pathways are unknown for 150 of the genes (out of 500). Further, we assume that the missing information regarding the pathway assignments occurs randomly, regardless of the strength of the association between genes and the phenotype. To this end, we remove 150 genes from their pathways and assume that the correct pathways can be predicted with 0.7 prob- ability; with the probabilities of 0.2 and 0.1, the gene could be assigned to two other pathways by mistake. For GSEAsnp and ALIGATOR, we only use genes with known pathways. For our method, we use two different models; one model uses genes with known pathways only (similar to GSEAsnp and ALIGATOR), and the other model includes all genes while taking the uncertainty regarding the pathway assignments for the 150 genes with unknown pathways into account. As before, we repeat the above simulation procedure to obtain 100 data sets with at least 10% of each phenotype group. The set of genes with unknown pathways, the sizes of the pathways, the set of significant SNPs, and their corresponding ˇ’s could change from one data set to another. Table I shows the averages (over 100 simulated data sets) of AUC for the three methods along with their standard errors. Removing genes with unknown pathways negatively affects the performances of all three methods. Including genes with unknown pathways in BGSAsnp, while taking the uncertainty of the predictions into account, leads to substantial improvement such that the overall performance of the model is only marginally worse than the BGSAsnp model in the first simulation (where the pathways were assumed to be known for all genes).

5. Discussion

We have proposed a new approach for identifying significant pathways on the basis of GWAS. In this approach, we aggregate SNP genotype data to obtain gene-level summary measures. We then use a hier- archical Bayesian model with a mixture prior to divide pathways into relevant and irrelevant groups, while taking the uncertainty regarding the pathway assignments into account. Further, we have proposed

Table I. Comparing our model, BGSAsnp, with GSEAsnp and ALIGATOR on the basis of the area under the ROC curve (AUC). BGSAsnp AUC% GSEAsnp ALIGATOR Without prediction With prediction Simulation 1 89.51 (0.89) 88.82 (0.71) 90.96 (0.80) –

Simulation 2 80.48 (1.20) 79.46 (1.11) 80.97 (1.25) 90.56 (0.83) 997 The corresponding standard errors are shown in parentheses.

Copyright © 2012 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 988–1000 B. SHAHBABA, C. M. SHACHAF AND Z. YU

to use a classification model to assign genes with unknown pathways to a small subset (three) of prede- fined pathways with different probabilities. Using simulated and real data from WTCCC and DGI, we have showed how our method can be useful for identifying significant pathways on the basis of GWAS. It is worth noting that Perry et al. [59] have also used the data from WTCCC and DGI to evaluate 439 pathways (on the basis of GO, BioCarta, and KEGG) with respect to their relationship to type 2 diabetes. They concluded that no pathways were associated with type 2 diabetes at 0.05 significance level after Bonferroni adjustment. When discussing the limitations of their study, they mention that their method is ‘limited by the accuracy and completeness of publicly available pathway databases.’ Indeed, this is the motivation behind our proposed method for incorporating the uncertainty regarding the pathways assign- ments into our model and predicting pathways to obtain a more complete database. We also used ALIGA- TOR for the data from WTCCC. By setting the false discovery rate to 0.05, we identified seven pathways as significant using this method. These pathways are acetylcholine binding, amino acid ligase activity, amine binding, enzyme regulator activity, GTPase regulator activity, receptor signaling protein serine threonine kinase activity,andsmall conjugating protein ligase activity. The identified pathways based on ALIGATOR are different from those we identified using our method. Because extensive research is required to verify these results, comparing the two methods on the basis of real data is not a trivial task. In this paper, we have tried to keep a balance between the performance of our model and its com- plexity so the potential improvement in the performance (in terms of identifying significant pathways) of our model does not lead to high computational cost. Indeed, the computational cost of our model is comparable with that of ALIGATOR, and it is much lower than the implementation of GSEAsnp avail- able from the SNPath package. For each simulated data (discussed in Section 4), the CPU time for 5500 MCMC iterations of our model is 158 s. The CPU times for the R implementations of ALIGATOR and GSEAsnp in the SNPath package (with their default settings) are 142 and 943 s, respectively. For simplicity, our model aggregates SNP genotype data in two steps; first, it aggregates them to obtain gene-level summary measures; then, it treats these summary measures as observed data and aggregates them to obtain pathway-level measures of significance. Future research could involve extending our model so that it performs these two steps within one hierarchical model. That is, at each iteration of 2 MCMC, we can sample j for each gene at the lower level of the hierarchical model and use them 2 (instead of the summary measures Z) to update the higher level hyperparameter s . By doing so, our model takes the uncertainty of estimates at the gene level into account. When predicting unknown pathways, the performance of our model can be improved by using other sources of information besides the InterPro domains. For example, we can use the Gene Functional Classification Tool in DAVID Bioinformatics Resources (National Institute of Allergy and Infectious Diseases/National Institutes of Health) to divide genes into subgroups and use this information along with the InterPro domains to improve classification. The Functional Classification Tool groups genes based on their functional similarity and generates a gene-to-gene similarity matrix on the basis of shared functional annotation by using over 75,000 terms from 14 functional annotation sources. Although one would expect to obtain better model performance (e.g., better prediction of pathways) by combining several sources of information, many researchers have found that simply concatenating data may negatively affect the performance of the model [60]. To address this issue, we previously pro- posed an alternative hierarchical Bayesian model in which different sources of data are combined such that their relative contributions to the classifier are automatically adjusted [61]. We showed that this strategy successfully improved the performance of the predictive model. To further improve the prediction accuracy of our classification model, we can use nonlinear classifiers instead of MNL. For example, we can employ a nonlinear model that uses a Dirichlet process mixture of MNLs [62].

Acknowledgements

We are grateful to Dr. Brian Browning for providing genotype calls of the WTCCC data by using Beaglecall. Z. Y. was supported in part by grant NIH/R01 HG004960. B. S. was supported by Grant Number UL1 RR031985 from the National Center for Research Resources (NCRR), a component of the National Institutes of Health (NIH) and the NIH Roadmap for Medical Research. This study makes use of data generated by the Diabetes Genetics Initia-

998 tive (DGI) and by the Wellcome Trust Case–Control Consortium (WTCCC). A full list of the investigators who contributed to the generation of the WTCCC data is available from www.wtccc.org.uk. Funding for the WTCCC project was provided by the Wellcome Trust under award 076113.

Copyright © 2012 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 988–1000 B. SHAHBABA, C. M. SHACHAF AND Z. YU

References 1. Manolio TA, Collins FS, Cox NJ, et al. Finding the missing heritability of complex diseases. Nature 2009; 461:747–753. DOI: 10.1038/nature08494. 2. Visscher PM. Sizing up human height variation. Nature Genetics 2008; 40:489–490. DOI: 10.1038/ng0508-489. 3. Virtaneva K, Wright FA, Tanner SM, et al. Expression profiling reveals fundamental biological differences in acute myeloid leukemia with isolated trisomy 8 and normal cytogenetics. Proceedings of the National Academy of Sciences 2001; 98:1124–1129. DOI: 10.1073/pnas.98.3.1124. 4. Pavlidis P, Lewis D, Noble W. Exploring gene expression data with class scores. Proceedings of the Seventh Annual Pacific Symposium on Biocomputing, Kauai, Hawaii, 2002; 474–485. 5. Mootha VK, Lindgren CM, Eriksson KF, et al.PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics 2003; 4:267–273. DOI: 10.1038/ng1180. 6. Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 2004; 3:Article 3. DOI: 10.2202/1544-6115.1027. 7. Rahnenfuhrer J, Domingues F, Maydt J, Lengauer T. Calculating the statistical significance of changes in pathway activ- ity from gene expression data. Statistical Applications in Genetics and Molecular Biology 2004; 3:Article 16. DOI: 10.2202/1544-6115.1055. 8. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpret- ing genome-wide expression profiles. Proceedings of the National Academy of Sciences 2005; 102:15545–15550. DOI: 10.1073/pnas.0506580102. 9. Barry WT, Nobel AB, Wright FA. Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics 2005; 21:1943–1949. DOI: 10.1093/bioinformatics/bti260. 10. Zahn JM, Sonu R, Vogel H, et al. Transcriptional profiling of aging in human muscle reveals a common aging signature. PLoS Genetics 2006; 2:e115. DOI: 10.1371/journal.pgen.0020115. 11. Efron B, Tibshirani R. On testing the significance of sets of genes. Annals of Applied Statistics 2007; 1:107–129. DOI: 10.1214/07-AOAS101. 12. Muller FJ, Laurent LC, Kostka D, et al. Regulatory networks define phenotypic classes of human stem cell lines. Nature 2008; 455:401–405. DOI: 10.1038/nature07213. 13. Wang K, Li M, Bucan M. Pathway-based approaches for analysis of genomewide association studies. American journal of human genetics 2007; 81:1278–1283. DOI: 10.1086/522374. 14. Tintle N, Lantieri F, Lebrec J, Sohns M, Ballard D, Bickeboller H. Inclusion of a priori information in genome-wide association analysis. Genetic Epidemiology 2009; 33(Suppl 1):74–80. DOI: 10.1002/gepi.20476. 15. Peng G, Luo L, Siu H, et al. Gene and pathway-based second-wave analysis of genome-wide association studies. European Journal of Human Genetics 2010; 18:111–117. DOI: 10.1038/ejhg.2009.115. 16. Holmans P, Green EK, Pahwa JS, et al. Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. The American Journal of Human Genetics 2009; 85:13–24. DOI: 10.1016/j.ajhg.2009.05.011. 17. Yu K, Li Q, Bergen AW, et al. Pathway analysis by adaptive combination of P -values. Genetic Epidemiology 2009; 33:700–709. DOI: 10.1002/gepi.20422. 18. Chen L, Zhang L, Zhao Y, et al. Prioritizing risk pathways: a novel association approach to searching for disease pathways fusing SNPs and pathways. Bioinformatics 2009; 25:237–242. DOI: 10.1093/bioinformatics/btn613. 19. O’Dushlaine C, Kenny E, Heron EA, et al. The SNP ratio test: pathway analysis of genome-wide association datasets. Bioinformatics 2009; 25:2762–2763. DOI: 10.1093/bioinformatics/btp448. 20. Chai HS, Sicotte H, Bailey KR, Turner ST, Asmann YW, Kocher JP. GLOSSI: a method to assess the association of genetic loci-sets with complex diseases. BMC Bioinformatics 2009; 10(102). DOI: 10.1186/1471-2105-10-102. 21. Chasman DI. On the utility of gene set methods in genomewide association studies of quantitative traits. Genetic Epidemiology 2008; 32:658–668. DOI: 10.1038/nrg2884. 22. De la Cruz O, Wen X, Ke B, Song M, Nicolae DL. Gene, region and pathway level analyses in whole-genome studies. Genetic Epidemiology 2010; 34:222–231. DOI: 10.1002/gepi.20452. 23. Zhang K, Cui S, Chang S, Zhang L, Wang J. i-GSEA4GWAS: a web server for identification of pathways/gene sets asso- ciated with traits by applying an improved gene set enrichment analysis to genome-wide association study. Nucleic Acids Research 2010; 38:W90–95. DOI: 10.1093/nar/gkq324. 24. Schwender H, Ruczinski I, Ickstadt K. Testing SNPs and sets of SNPs for importance in association studies. Biostatistics 2011; 12:18–32. DOI: 10.1093/biostatistics/kxq042. 25. Nam D, Kim J, Kim SY, Kim S. GSA-SNP: a general approach for gene set analysis of polymorphisms. Nucleic Acids Research 2010; 38:W749–754. DOI: 10.1093/nar/gkq428. 26. Luo L, Peng G, Zhu Y, Dong H, Amos CI, Xiong M. Genome-wide gene and pathway analysis. European Journal of Human Genetics 2010; 18:1045–1053. DOI: 10.1038/ejhg.2010.62. 27. Guo YF, Li J, Chen Y, Zhang LS, Deng HW. A new permutation strategy of pathway-based approach for genome-wide association study. BMC Bioinformatics 2009; 10:429. DOI: 10.1186/1471-2105-10-429. 28. Cantor RM, Lange K, Sinsheimer JS. Prioritizing GWAS results: a review of statistical methods and recommendations for their application. American Journal of Human Genetics 2010; 86:6–22. DOI: 10.1016/j.ajhg.2009.11.017. 29. Kraft P, Raychaudhuri S. Complex diseases, complex genes: keeping pathways on the right track. Epidemiology 2009; 20:508–511. DOI: 10.1097/EDE.0b013e3181a93b98. 30. Thomas DC, Conti DV, Baurley J, Nijhout F, Reed M, Ulrich CM. Use of pathway information in molecular epidemiology. Human Genomics 2009; 4:21–42. 31. Chen X, Wang L, Hu B, Guo M, Barnard J, Zhu X. Pathway-based analysis for genome-wide association studies using 999 supervised principal components. Genetic Epidemiology 2010; 34:716–724. DOI: 10.1002/gepi.20532.

Copyright © 2012 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 988–1000 B. SHAHBABA, C. M. SHACHAF AND Z. YU

32. Zhong H, Yang X, Kaplan LM, Molony C, Schadt EE. Integrating pathway analysis and genetics of gene expression for genome-wide association studies. American journal of human genetics 2010; 86:581–591. DOI: 10.1016/j.ajhg.2010.02.020. 33. Elbers CC, Eijk KR, Franke L, et al. Using genome-wide pathway analysis to unravel the etiology of complex diseases. Genetic Epidemiology 2009; 33:419–431. DOI: 10.1002/gepi.20395. 34. Wang K, Li M, Hakonarson H. Analysing biological pathways in genome-wide association studies. Nature Review Genetics 2010; 11:843–854. DOI: 10.1038/nrg2884. 35. Shahbaba B, Tibshirani R, Shachaf CM, Plevritis SK. Bayesian gene set analysis for identifying significant biological pathways. Journal of the Royal Statistical Society, Series C 2011; 60:541–557. DOI: 10.1111/j.1467-9876.2011.00765.x. 36. Lewinger JP, Conti DV, Baurley JW, Triche TJ, Thomas DC. Hierarchical Bayes prioritization of marker associa- tions from a genome-wide association scan for further investigation. Genetic Epidemiology 2007; 3:871–882. DOI: 10.1002/gepi.20248. 37. Sasieni PD. From genotypes to genes: doubling the sample size. Biometrics 1997; 53:1253–1261. 38. Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis 2006; 1:515–533. DOI: 10.1214/06-BA117A. 39. George EI, McCulloch RE. Variable selection via Gibbs sampling. Journal of the American Statistical Association 1993; 88:881–889. 40. Ishwaran H, Rao JS. Spike and slab gene selection for multigroup microarray data. Journal of the American Statistical Association 2005; 100:764–780. DOI: 10.1198/016214505000000051. 41. Tanner MA, Wong WH. The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association 1987; 82:528–540. 42. Neal RM. Slice sampling. Annals of Statistics 2003; 31:705–767. DOI: 10.1214/aos/1056562461. 43. Hahne F, Mehrle A, Arlt D, Poustka A, Wiemann S, Beissbarth T. Extending pathways based on gene lists using InterPro domain signatures. BMC Bioinformatics 2008; 9:3. DOI: 10.1186/1471-2105-9-3. 44. Froehlich H, Fellmann M, Sueltmann H, Poustka A, Beissbarth T. Predicting pathway membership via domain signatures. Bioinformatics 2008:2137–2142. DOI: 10.1093/bioinformatics/btn403. 45. Mulder NJ, Apweiler R, Attwood TK, et al. New developments in the interPro database. Nucleic Acids Research 2007; 35:224–228. DOI: 10.1093/nar/gkl841. 46. Biro FM, Wien M. Childhood obesity and adult morbidities. American Journal of Clinical Nutrition 2010; 91: 1499S–1505S. DOI: 10.3945/?ajcn.2010.28701B. 47. Burton PR, Clayton DG, Cardon LR, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007; 447:661–678. DOI: 10.1038/nature05911. 48. Saxena R, Voight BF, Lyssenko V, et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 2007; 316:1331–1336. 49. Browning BL, Yu Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. American journal of human genetics 2009; 85:847–861. DOI: 10.1016/j.ajhg.2009.11.004. 50. Purcell S, Neale B, Todd-Brown K, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics 2007; 81:559–575. DOI: 10.1086/519795. 51. Cordes CM, Bennett RG, Siford GL, Hamel FG. Redox regulation of insulin degradation by insulin-degrading enzyme. PLoS ONE 2011; 6:e18138. DOI: 10.1371/journal.pone.0018138. 52. Nardai G, Stadler K, Papp E, Korcsmros T, Jakus J, Csermely P. Diabetic changes in the redox status of the microsomal protein folding machinery. Biochemical and Biophysical Research Communications 2005; 334:787–795. 53. Garcia C, Feve B, Ferreacute P, et al. Diabetes and inflammation: fundamental aspects and clinical implications. Diabetes & Metabolism 2010; 36:327–338. DOI: 10.1016/j.bbrc.2005.06.172. 54. Arora P, Garcia-Bailo B, Dastani Z, et al. Genetic polymorphisms of innate immunity-related inflammatory pathways and their association with factors related to type 2 diabetes. BMC Medical Genetics 2011; 12:95. DOI: 10.1186/1471- 2350-12-95. 55. Pedchenko VK, Chetyrkin SV, Chuang P, et al. Mechanism of perturbation of integrin-mediated cell-matrix interactions by reactive carbonyl compounds and its implication for pathogenesis of diabetic nephropathy. Diabetes 2005; 54:2952–2960. DOI: 10.2337/diabetes.54.10.2952. 56. Krieglstein CF, Granger DN. Adhesion molecules and their role in vascular disease. American Journal of Hypertension 2001; 14:44S-54S. DOI: 10.1016/S0895-7061(01)02069-6. 57. Wysocki J, Ye M, Soler MJ, et al. ACE and ACE2 Activity in Diabetic Mice. Diabetes 2006; 55:2132–2139. DOI: 10.2337/db06-0033. 58. Ziyadeh FN, Sharma K. Combating diabetic nephropathy. Journal of the American Society of Nephrology 2003; 14:1355–1357. DOI: 10.1097/01.ASN.0000065608.37756. 59. Perry JRB, McCarthy MI, Hattersley AT, et al. Interrogating type 2 diabetes genome-wide association data using a biological pathway-based approach. Diabetes 2009; 58:1463–1467. DOI: 10.2337/db08-1378. 60. King RD, Karwath A, Clare A, Dehaspe L. The utility of different representations of protein sequence for predicting functional class. Bioinformatics 2001; 17:445–454. DOI: 10.1093/bioinformatics/17.5.445. 61. Shahbaba B, Neal RM. Gene function classification using Bayesian models with hierarchy-based priors. BMC Bioinfor- matics 2006; 7:448. DOI: 10.1186/1471-2105-7-448. 62. Shahbaba B, Neal RM. Nonlinear models using Dirichlet process mixtures. Journal of Machine Learning Research 2009; 10:1829–1850. 1000

Copyright © 2012 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 988–1000