<<

Intrinsic-Overlapping Co-expression Module Detection with Application to Alzheimer’s Disease

Hazel Nicolette Mannersa, Swarup Royb,a,∗, Jugal K Kalitac

aDept of Information Technology, North Eastern Hill University, Shillong, Meghalaya, India bDepartment of Computer Applications, Sikkim University, Gangtok, Sikkim, India cDept of Computer Science, University of Colorado, Colorado Springs, USA

Abstract interact with each other and may cause perturbation in the molecular pathways leading to complex diseases. Often, instead of any single , a subset of genes interact, forming a network, to share common biological func- tions. Such a subnetwork is called a functional module or motif. Identifying such modules and central key genes in them, that may be responsible for a disease, may help design patient-specific drugs. In this study, we consider the neurodegenerative Alzheimer’s Disease (AD) and identify potentially respon- sible genes from functional motif analysis. We start from the hypothesis that central genes in genetic modules are more relevant to a disease that is under investigation and identify hub genes from the modules as potential marker genes. Motifs or modules are often non-exclusive or overlapping in nature. Moreover, they sometimes show intrinsic or hierarchical distributions with overlapping functional roles. To the best of our knowledge, no prior work handles both the situations in an integrated way. We propose a non-exclusive clustering approach, CluViaN (Clustering Via Network) that can detect intrinsic as well as overlapping modules from gene co-expression networks constructed using microarray expression pro- files. We compare our method with existing methods to evaluate the quality of modules extracted. CluViaN reports the presence of intrinsic and over- lapping motifs in different species not reported by any other researches. We

∗Corresponding author Email addresses: [email protected] (Hazel Nicolette Manners), [email protected] (Swarup Roy), [email protected] (Jugal K Kalita)

Preprint submitted to Elsevier November 1, 2018 further apply our method to extract significant AD specific modules using CluViaN and rank them based the number of genes from a module involved in the disease pathways. Finally, top central genes are identified by topological analysis of the modules. We use two different AD phenotypes data for exper- imentation. We observe that central genes, namely PSEN1, APP, NDUFB2, NDUFA1, UQCR10, PPP3R1 and few more, plays significant role in the AD. Interestingly, our experiments also find a hub gene, PML, which has recently been reported to play a role in plasticity, circadian rhythms and the response to which can cause neurodegenerative disorders. MUC4, another hub gene that we found experimentally is yet to be investigated for its potential role in AD. A software implementation of CluViaN in Java is available for download at https://sites.google.com/site/swarupnehu/ publications/resources/CluViaNSoftware.rar. Keywords: Functional Module, Clustering, Overlapping, Intrinsic, Co-expression, Alzheimer’s Disease (AD).

1. Introduction Alzheimer’s disease is a neurodegenerative disease. It is a common form of dementia that leads to memory related problems, changes in thinking and behavior of a human. Studying biological pathways and identifying which genes take part in pathways leading to Alzheimer’s Disease (AD) may help us understand what goes wrong when the disease strikes. In turn, it may help design effective therapeutic drug molecules for AD. To date, studies have revealed three key genes that may be linked to autosomal dominant or familial early onset AD (FAD). These four genes include amyloid precursor (APP), presenilin 1 (PS1) and presenilin 2 (PS2). Apolipoprotein E (ApoE) has been found to be linked to late-onset Alzheimer’s disease [1]. Mutations that are linked to APP and PS proteins can lead to the production of Abeta peptides, (Abeta42, specifically). PS1 mutation, which is FAD- linked, has been found to lead to Endoplasmic Reticulum (ER) stress. It important to identify additional key genes or possible regulators responsible for AD, if any. Biological activities inside a cell are governed by a set of influential genes or proteins that regulate one another forming network modules, called regula- tory or functional modules. Biologically, a regulatory or functional module [2] is a set of genes (group of tightly interconnected nodes) that act collectively

2 to perform a distinct biological function [3]. A module usually performs a useful biological task, but may even be responsible for complex diseases like cancer, Alzheimer’s or Parkinson’s. Regulatory modules are co-expressed, co-evolved and regulated by the same set of transcription factors to respond to different conditions. Some genes may even play multiple roles and become members of more than one module [4]. Much of a cell’s activity is organized as a network of interacting modules. Identifying regulatory modules is crucial in understanding cellular responses to various external or internal stimuli or signals. In turn, the process may help uncover the disease mechanisms in a living organism [5,6]. Identifying such modules is helpful for a system level understanding of biological and cellular processes. Clustering is a popular data analysis tool in genomic studies using gene- expression microarrays [7] because of its ability to group co-expressed genes with similar expression patterns, offering insights into various transcriptional and biological processes [8,9, 10]. A large amount of work has been done to cluster functionally similar genes [11,7, 12, 13, 14] by applying different forms of classical clustering algorithms along with their variations. Recent advances in biological research has revealed that some genes or proteins play multiple functional roles in a cell. For example, it has been observed that the yeast gene CMR1/YDL156W participates in many DNA- metabolism processes such as replication, repair and transcription [15]. Out of 1, 628 proteins in the hand-curated yeast complex data set [16], 207 pro- teins participate in more than one complex. It may not be possible to describe all these complexes using disjoint or fuzzy relationships [17, 18]. The genes responsible for such proteins are, therefore, expected to participate in differ- ent functional modules or complexes. They exhibit distinct overlapping and embedded structures. Effective finding of functional modules that exhibit these features is an important step towards unveiling disease sub-modules. Further, it is observed that target factor (TF) genes, which are the possible key genes for AD or other diseases, are central genes in subgraphs associated with modules. Detecting such central genes may help prioritize potential drugs as well as further analysis of disease pathways. To achieve our goal, we contribute the following.

• As opposed to classical clustering techniques which normally perform exclusive clustering, we propose a new non-exclusive [19] clustering al- gorithm, CluViaN (Clustering Via Network), to detect functionally enriched regulatory modules from a co-expression network in the pres-

3 ence of overlapping and intrinsic relationships. • We also propose a new proximity measure to construct a weighted co- expression network. • We used CluViaN as an intermediate step to discover AD sub-modules and rank them based on AD pathway enrichment scores. • We further analyze top ranked modules topologically to identify central or hub genes, which are the potential key genes for AD. We organize the rest of the paper into different sections. Prior research on module finding techniques is reported in Section2. We introduce a new distance measure and the CluViaN algorithm in Section3. Performance of our method and ranking of genes responsible for AD are presented in Section4. In Section5, we summarize our work with concluding remarks.

2. Module Detection Techniques When a group of genes interact with each other, the deviations in in- teractions among them in the molecular pathways may lead to complex dis- eases. Highly interactive genes in these groups, often called network modules, may contribute significantly to the biological function or disease. A network module is a subgraph derived from a gene interaction graph. A subset of functionally cohesive genes is usually topologically highly interconnected in a large gene network. As a result, the task of reconstructing a gene-gene interaction graph or network is a first step towards module detection. The network inferred can then be used for the next step of module extraction or clustering. Next, we discuss the problem of network inference and module detection in a more formal way. A a genetic network is a graph, and thus can use mathematical and computational tools associated with graphs. Definition 1 (Gene co-regulation Network). A gene co-regulation network can be defined as a graph T = (G, E), where G denotes the set of N genes (nodes) {G1,G2, ··· ,GN } participating in a common gene product formation process or biological process, and E is the set of edges {e1, e2, ··· , em} that correspond to the known interrelationships among the genes. An arc between two genes has a weight signifying their relative proximity. A higher weight means that two genes are less similar in their expression profiles. A positive weight indicates that the two genes are related.

4 The graph may be undirected or directed in nature, and called a Gene Regulatory Network (GRN) or Co-expression Network (GCN), respectively. Inferred networks may even be weighted with the weights representing the strength of the interactions. Sometimes it is difficult to infer the causality or regulatory relationships among genes or proteins due to lack of sufficient information in the available data sources. Such data are limited in allowing inference of co-expression net- works using simple associations, because they hide regulator and regulated relationships. The association in the form of co-expression may be mea- sured in terms of statistical correlations, mutual information or some other similarity measures between the genes’ expressions. A plethora of inference methods have been proposed to infer regulatory or co-expression networks. Bayesian Networks [20] use the joint probability distribution to derive a Di- rected Acyclic Graph (DAG). Learning from a Bayesian network is a NP hard problem, and hence heuristic search models have to be used, without the guarantee of finding a global optima. Butte et al. [21] compute comprehen- sive pair-wise mutual information (MI) for all genes in an expression dataset and construct a Relevance Network (RN). The CLR (Context-Likelihood of Relatedness) [22] algorithm modifies the MI score based on the empirical distribution of all MI scores. MRNET (Minimum Redundancy Networks) [23] uses an iterative feature selection method based on a maximum rele- vance/minimum redundancy criterion (MRMR)[24, 25]. MIDER (Network Inference with Mutual Information Distance and Entropy Reduction) [26] is another information theoretic approach prposed recentlty, which combines MI along with an entropy reduction scheme to produce directed graph from time-series expression data. ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) [27] is a popular inference method which can be scaled to complex networks. ARACNE filters out indirect interac- tions from triplets of genes with Data Processing Inequality. The inequality states that if two nodes are connected via a third node, then these nodes will have a less interacting weight compared to either of the directly interacting nodes. Based on certain conditions, it removes the weakest edge in each triplet. MI is unable to generate directed network because of its symmetric nature. ARACNe-AP [28] proposed recently, is an effective implementation of ARACNE that improves the overall computational performance of the original ARACNE. It uses an adaptive partitioning approach to calculate MI between a pair of genes for constructing the network. Roy et al. proposed GeCoN, a pattern based co-expression network construction method with

5 regulation information [29]. GENIE3 [30] is a Random Forest [31] based method that model the inference as a regression problem. FyNE (Fuzzy NEtworks) [32] is a correlation and fuzzy set theoretic coexpression inference method that incorporates ontological knowledge by using a fuzzy aggrega- tion function. It focuses more on inferring quality networks in the presence of noise. Any network module detection technique usually takes a network (weighted or unweighted) as input to extract functional modules. Formally, we can de- fine a network module as follows.

Definition 2 (Network Module). Given a network G, a network module 0 0 Mi = (V ,E ) is a densely connected subgraph of G (Mi ⊆ G), where inter- connectivity of V0 with respect to E 0 ⊆ E is higher in comparison to the rest of V, i.e,. V − V0.

Many effective algorithms have been proposed to detect network modules. They are not designed to detect both exclusive and intrinsic (or embedded) modules. An exclusive module is a subgraph such that none of its vertices belongs to any other module. In other words, a node belonging to a module cannot be a member of any another module simultaneously.

Definition 3 (Exclusive Module). Given a set of k modules M = {Mi, M2, ···Mk} T derived from G, a module Mi is an exclusive module, if Mi(V) Mn(V) = φ, for ∀Mn=1,···k and i 6= n. A majority of the network module extraction methods use two steps to extract the modules. First, the method generates a network (weighted or un- weighted) from the gene expression data. However, in case of protein modules (popularly called complexes), the step of network formation can be skipped as the input is a network itself. Next, the method extracts modules from the network. It applies traditional clustering on the network adjacency matrix by treating the matrix as a distance matrix. Some well-known popularly used techniques are discussed below. Weighted correlation (or co-expression) network analysis (WGCNA) [33] is a popular fuzzy module detection method which can be used to study rela- tionships among co-expressed modules. It uses soft thresholding for creating the weighted network. Using a hierarchical clustering method, it detects modules from the weighted network. This method can detect overlapping but not intrinsic modules. Another fuzzy method called FUMET (Fuzzy

6 Network Module Extraction Technique) [34] uses the NMRS (Normalized mean Residue Similarity) similarity measure for network creation and applies the topological similarity measure (TOM) with an incremental approach to cluster genes based on a membership function. One of the drawbacks of this method is that the number of modules needs to be provided as input. Module Miner [35] detects non-overlapping modules using NMRS and TOM for net- work creation and module detection. It is based on the concept of connected graphs and minimum spanning tree. It uses an iterative network splitting approach and continues splitting until a pre-specified convergence criterion is satisfied. Spectral-based community identification [36] deals with a local scaling property exhibited by many module detection algorithms where some modules are tightly correlated and others are loosely correlated. It uses a rank-based approach where the highest ranked genes are connected to the seed gene. Maximizing the modularity is one the objectives of this method. Similar to FUMET, it also needs the number of modules as input. Wu et al. [6] propose a Markov Clustering (MCL) module detection method that uses the Pearson correlation measure to create a weighted network and uses the concept of random walks and bootstrapping to extract modules from the network. Molecular Complex Detection (MCODE)[37] is a well-known protein complex detection algorithm. Although it is used for protein-protein interaction networks, it is also effective for gene interaction networks. It uses the concept of density and seed genes where traversal is performed outward to extract locally dense regions. It can be run with the fluff option to increase the size of the module or the haircut option, which removes vertices that are singly connected to the module. Fast Agglomerate Algorithm (FAG-EC) [38] is based on edge clustering coefficients, which in turn is based on the num- ber of common neighbors. It uses an agglomerative approach, which merges modules until a condition is met. It is very fast and can be used in large biological networks. In reality, biological modules overlap or are non-exclusive in nature as well as embedded based on density of the interactions among the modules. A single gene that belongs to a particular group may be a member of another group and play multiple biological functions. Further, a compact module may exhibit intrinsic structure, which is a module within a module. These embedded modules can be extracted or distinguished based on density. In addition to performing a specific function, an intrinsic module also plays the function of the mother module within which it is embedded. Thus, intrinsic modules can be treated as a special case of overlapping modules.

7 Table 1: Comparison of different functional module detection techniques Sl. No. Technique Approach Detects Detects Intrinsic Overlapping 1 WGCNA Hierarchical, Fuzzy X × 2 FUMET Fuzzy X × 3 Module Miner Spanning Tree-based × × 4 Spectral-Based Partitioning, Modularity × × 5 MCL-Based Bootstrapping, Random walk × × 6 MCODE Density X × 7 FAG-EC Agglomerative X ×

Some existing methods can detect significant modules, be it exclusive or non- exclusive, but none pay even scant attention to detect intrinsic structures. A summary of the various module finding methods is given in Table1.

3. Detecting Non-Exclusive and Intrinsic Modules We introduce two important types of module structures, namely non- exclusive and intrinsic modules, that can be found not only in biological networks but most real-life large graphs including social networks, citation networks and the World Wide Web. As opposed to a vertex exclusive to a module, a vertex in a non-exclusive module can be a member of more than one module at the same time.

Definition 4 (Non-Exclusive Modules). Two modules Mi = (Vi, Ei) and T Mj = (Vj, Ej), where Mi and Mj ∈ M, are non-exclusive if Vi Vj 6= φ. A module, whether exclusive or non-exclusive, needs a measure to quan- tify the inter-connectedness of nodes through edges. We use the well-known metric called graph density [39] to quantify connectedness within a module. It is defined as follows.

Definition 5 (Module Density). Given a module Mi = {Vi, Ei}, the density of the module ρ(Mi) is defined as

ω(Ei) ρ(Mi) = , (1) |Vi| where, ω(Ei) is the sum of weights of the edges Ei, and |Vi| is the number of nodes in Mi.

8 Modules within a module or intrinsic modules are another form of over- lapping modules where modules share a subset-superset relationship.

Definition 6 (Intrinsic Modules). A module Mi is embedded or intrinsic within Mj, if Mi ⊂ Mj and the connectedness density of nodes within Mi is significantly different from the density of Mj.   1, if Mi ⊂ Mj Intrinsic(Mi, Mj) = and |ρ(Mi) − ρ(Mj)| > τ (2)  0, otherwise . where τ is a user defined significant density difference threshold. To detect overlapping and intrinsic modules, clustering is a better ap- proach as graph theoretic approaches for detecting dense subgraphs are ex- pensive in nature. It is inappropriate to apply traditional clustering as tradi- tional approaches are exclusive by definition. Unlike traditional clustering, in non-exclusive clustering, an object may be simultaneously member of more than one cluster.

Definition 7 (Non-Exclusive Clustering). Given a database D = {x1, x2, ··· , xn} of objects and an integer value k, the non-exclusive clustering problem is to define a mapping f : D → {1, ··· , k} where each xi may be assigned simul- taneously to one or more clusters from a set of clusters C = {C1, C2, ··· , Ck}. Cluster Ci contains those objects that are mapped to it alone and also those objects that belong to other clusters Cj of C, (for ∀j = 1, ··· , k ∧ j 6= i). We use a co-expression network as input to the module detection step. To reconstruct co-expression relationships, we calculate the degree of association between pairs of genes based on their relative expression levels. Below we discuss co-expression network reconstruction step.

3.1. Weighted co-expression network reconstruction In order to perform non-exclusive clustering on gene expression data, we first construct a weighted gene co-expression network and use the adjacency matrix of the network as distance table to apply density based clustering. Considering all connected pairs, we compute the weighted adjacency matrix as:

 δ(G ,G ), if δ(G ,G ) ≤  A(i, j) = i j i j (3) ∞, otherwise,

9 where, δ(Gi,Gj) is the degree of association between a pair of genes. Two genes are strongly connected if the distance between them is below a threshold . Using the above weighted adjacency matrix, we perform a non-exclusive clustering to detect overlapping intrinsic co-expression gene modules. Below, we discuss a new association measure to calculate the proximity between pairs of genes, based on their related expression levels.

3.2. A New Association Measure A network module extraction method takes a gene network as input. As a result, inference of a gene network from the expression data must be performed first. If the network inferred is more closely related to the real net- work, then obviously the modules extracted will have biological significance extracting modules that mimic real life data. Even if the module extraction method is close to perfection, but the network inference gives false interac- tions or shows strong interaction between dissimilar pairs of genes or vice versa, then it would produce irrelevant results. Hence, a good association measure is required to capture the interactions and the strength of these interactions forming a weighted network. Networks can be inferred directly by using a correlation measure which will assign a weight equivalent to the correlation measure between two genes. Most techniques use either Euclidean distance, Pearson correlation [40, 41], Spearman correlation, Kendall correlation [42] or Mutual Information(MI) as a measure of proximity for a pair of genes. However, Euclidean distance does not perform well in handling gene expression data that includes shift- ing or scaled patterns (or profiles) [13]. Pearson’s correlation coefficient also measures the similarity between the shapes of two expression patterns (pro- files). However, it is not robust with respect to outliers [43] and is unable to handle effectively scaling and shifting patterns, thus, potentially yielding false positives which assign a high similarity score to a pair of dissimilar pat- terns. Spearman Correlation can neither detect shifting or scaling patterns. An alternate measure of proximity is Mutual Information (MI). A study has shown that MI-based scores are better compared to the other distance mea- sures [44]. However, the discrete form of the MI measure, used by most techniques, requires discretization of the continuous expression values, lead- ing to information loss. Also, different discretization methods may result in different MI values. A normalized measure, Normalized Mutual Information (NMI) can scale MI in the range 0 (no correlation) to 1 (perfect correlation).

10 We use a simple parametric distance measure to compute the proximity between two gene expressions.

Definition 8 (Distance). Given two expression vectors x =< x1, x2, ··· xn > and y =< y1, y2, ··· yn > for the genes x and y, the distance, δ(x, y), between two genes can be calculated by taking the normalized difference of standard deviations between the two expression profiles. Pn i=1 | (xi − x) − (yi − y) | δ(x, y) = Pn (4) i=1(| (xi − x) | + | (yi − y) |) The value of δ(x, y) ranges within [0, 1]. δ(x, y) = 0, if x and y are exactly similar in terms of co-expressions and 1 if they are completely different from each other or negatively correlated. Our proximity measure satisfies all three properties of distance measure, viz., symmetricity, non-negativity and trian- gle inequality. The proof of these properties are provided as supplementary material. For effective network creation, association score plays an important role. This is because relative association scores determine the functional proximity between a pair of genes. We perform a very short experiment to access the robustness of different association measures in detecting association among genes with scaling and shifting expression patterns. We use random gene expression patterns (Figures 1(a)) with scaling and shifting patterns. We use gene A as a random gene expression pattern. Gene F is negatively cor- related and gradually shifted from gene A. The gene expression patterns B to E are intermediately correlated in between genes A and F. We report the calculated similarity scores given by different measures in Figure 1(b). The figure shows that Kendall, Pearson and Spearman produce similar scores and fail to detect scaling and shifting patterns. Relatively, fewer similar patterns like A-D or A-E are scored as strongly correlated, (+)ve and (-)ve. Mutual Information performs similarly and gives different values for different dis- cretization methods, viz., Equal Frequency, Equal Width and Global Equal Width, whereas Normalized Mutual Information produces perfect scores (of 1) for all expression pairs even with dissimilar pairs as shown in Fig.1(b). Interestingly, our measure produces scores between 1 to 0 with decrease in the values starting with most similar patterns to dissimilar patterns. How- ever, our measure fail to detect varying negatively co-expressed patterns, but reports them with a similarity score of 0 (or δ = 1). Experimentally, 0.4 (1-δ) or δ = 0.6, is found to perform well. Any similarity score above 0 (or δ < 1)

11 would create a network of all positive similarity, including weak interacting pairs. Below we present a new non-exclusive clustering technique that uses this distance measure. 3.3. Density based non-exclusive and intrinsic clustering To detect an effective module with overlapping and intrinsic structure, it is important to detect the density of a subgraph using Equation1. Instead of calculating density of a subgraph directly to detect modules, we use the concept of density-based clustering to capture modules. Density-based clus- tering algorithms [45, 46] use the concept of density in the distribution of data. The idea behind a density-based approach is that within each clus- ter, the typical density of data objects is considerably higher than outside of the cluster. Furthermore, the density within areas of noise is lower than the density in any of the clusters. Typically, the approach uses two dis- tance parameters ξ and , to decide on core objects and noise. Because of its ability to extract clusters from a highly noisy environment, without prior knowledge of the number of clusters, we use density based clustering to extract overlapping and intrinsic modules. Often a cluster is embedded inside another cluster. Existing density based clustering techniques are not suitable for non-exclusive and intrinsic or embedded clustering. They either classify an object as noise or assign it to exactly one cluster and are unable to detect embedded clusters that may arise due to variations in density. We use the following terms to define non-exclusive and embedded modules in a co-expression network. 3.3.1. Terminologies used Given T = (G, E), a weighted co-expression network, and E, a set of weighted edges representing the biological inter relationships or co-expression among the genes, various important terms can be defined as follows.

Definition 9 (Strong Neighborhood). For a given gene Gi ∈ G, the strong neighborhood of Gi, StrongNeighb, is a collection of genes that are strongly connected to Gi with respect to a user defined threshold distance .

StrongNeighb(Gi) = {Gj|∃Gj ∈ G ∧ δ(Gi,Gj) ≤ }. (5)

Definition 10 (Core Gene). A gene Gi is said to be a core gene, if the number of strongly connected genes within Gi’s strong neighborhood is more than the user specified threshold ξ:

12 (a) Different Gene Expression Patterns.

(b) Comparison of Similarity or Correlation values of different associa- tion measures.

Figure 1: Comparison of correlation values of Pearson, Spearman, Mutual In- formation, Normalized Mutual Information and Proposed Measure using gene patterns from Fig1(a).

13  1 if |StrongNeighb(G )| ≥ ξ CoreGene(G ) = i (6) i 0 otherwise. An embedded or intrinsic cluster occurs when density variation is ob- served within a single cluster, giving a new cluster inside another. To detect density variation inside a cluster, we introduce the concept of Core Neigh- bourhood following the well-known core distance concept taken from OP- TICS [47] and defined below.

Definition 11 (Core Distance). Given a core gene Gi, the core distance w.r.t. Gi is the minimum distance that satisfies a core gene constraint. In other words, it is the maximum distance from Gi to any other gene Gj such that Gi has exactly ξ strongly connected neighbors within that distance:

CoreDist(G ) = max{δ(G ,G )|∀G ∈ StrongNeighb(G )∧ i i j j i (7) |StrongNeighb(Gi)| = ξ}

Definition 12 (Core Neighborhood). For a given core gene Gi, the core neighborhood of Gi is a set of genes that are within the core distance asso- ciated with Gi:

CoreNeighb(G ) = {G |δ(G ,G ) ≤ CoreDist(G ), i j i j i (8) ∀Gj ∈ StrongNeighb(Gi)}.

The number of core neighbors is equal to ξ, i.e., |StrongNeighb(Gi)| ≥ |CoreNeighb(Gi)| = ξ. We redefine below the related concepts of density connectedness and clus- ters in a manner similar to DBSCAN [45], based on core distance.

Definition 13 (Core Directly Density Reachable). Gene Gj is core directly density-reachable (CDDR) from gene Gi if Gj is a core neighbor of Gi and the difference between the core distances of Gj and Gi is within a tolerance threshold of α:   1 if Gj ∈ CoreNeighb(Gi)∧ CDDR(Gj,Gi) = |CoreDist(Gj) − CoreDist(Gi)| ≤ α (9)  0 otherwise.

14 Definition 14 (Core Density Reachable). In a network T , a gene Gi is core density reachable from another core gene Gj if there is a chain of genes G1, ··· ,Gn such that G1 = Gj and Gn = Gi and Gk+1 is core directly density- reachable from Gk.

Definition 15 (Core Density Connected). Gene Gi is core density connected (CDC) to a gene Gj if there is another gene Gk such that both, Gi and Gj are core density-reachable from Gk:

 1 if G core density reachable from G CDC(G ,G ) = i j (10) i j 0 otherwise.

Definition 16 (Non Exclusive Gene). For a module Mi, if ∃Gi ∈ Mi is core density connected to Gj ∈ Mj (Mi 6= Mj), then Gi is included si- multaneously in both modules Mi and Mj. We call Gi an overlapping or non-exclusive gene.

Definition 17 (Non-exclusive Module). Given a network T , a non-exclusive module Mi derived from G w.r.t. threshold ξ and , is a non-empty subset of genes with the following properties.

1. ∀Gi,Gj, if Gi ∈ Mi and Gj is core density reachable from Gi, then Gj also in Mi. Gi in such case is called non-overlapping or exclusive gene. 2. ∀Gi,Gj, if Gi ∈ Mi and Gj is core density connected to Gi, then Gj ∈ Mi. 3. ∃Gi ∈ Mi and Gi is core density connected to Gj ∈ Mj then Gi is also in Mj (overlapping or non-exclusive gene).

Definition 18 (Noise Gene). A gene Gi is said to be noise, if Gi is neither a core gene nor core density connected to any other gene.

Definition 19 (Boundary Gene). Any gene in the periphery of a module is said to be a boundary gene. We hypothesize that functionally, boundary genes are not very similar to the group they are attached to. A gene Gi is said to be a boundary gene of a module Mi, if Gi is a non-core gene but core density connected to any core gene in Mi.

15 3.4. CluViaN: The Algorithm We present CluViaN in Algorithm1. CluViaN accepts the weighted ad- jacency matrix D created using Equation3. In addition to D, it takes ξ, α and  as input. Initially, all the nodes or genes are marked as unpro- cessed. For each gene Gi, it computes the core distance using the method GetCoreDist(). It returns NULL, if Gi is not a core gene or noise w.r.t. ξ and . Otherwise, Gi assigned a new module id and Gi is expanded further using a recursive function ExpandModule(), given in Algorithm2. All the core neighbors of a gene Gj are passed through the same iterative process of expansion until no further expansion is possible, i.e., when no neighbors are available such that they are core density connected to any core gene of the current module under expansion. The process continues with all new modules (unprocessed) with the new module id. During the expansion of the neighbor of a core gene Gi, say Gj, if core distance of Gj varies with Gi w.r.t. α, Gj will not be expanded. It is the case when Gi and Gj are part of intrinsic modules or may be part of an external module. A gene initially detected as noise (non-core) becomes non-noise if it is core density connected to any core gene. Finally, a set of modules are returned as output. To illustrate the above algorithm, we present a co-expression network constructed from a Rat CNS1 dataset in Figure2. Nodes shown in yellow are non-exclusive genes belonging to two modules C1 and C2. Non-exclusive genes are the connecting points between two or more clusters and belong to two or more clusters at the same time. In this example, we set ξ = 4, allowing us to detect small groups as noise and rest as three clusters. The same value of ξ is able to differentiate non-exclusive or overlapping genes.

3.5. Computational complexity Given N genes, the CluViaN involves two separate costs that can be calculated as follows. 1. Network Construction: Creation of network and calculation of dis- tance matrix takes N(N − 1)/2 ≈ O(N 2) time. 2. Module Extraction: Some nodes may be visited multiple times. However practically, the time complexity is based on the number of queries. If an index structure is used, the overall average runtime com- plexity of CluViaN become O(N log N).

1http://faculty.washington.edu/kayee/cluster

16 Algorithm 1: CluViaN: Functional Module Finding Technique Input : D (Adjacency Matrix), G (Set of Genes), ξ (Core Gene constraint),  (Noise threshold), α (Density variation threshold) Output : M = {M1, M2, ··· , Mk} // Set of k modules Method: 1 moduleID ← 1; 2 for (∀ gene, Gi) do 3 Mark Gi as unprocessed; 4 end 5 for (∀ gene, Gi 6= Processed) do 6 CoreDist ← GetCoreDist (Gi, ξ, ); 7 if CoreDist =NULL then 8 Mark Gi as Noise; 9 end 10 else 11 if Gi.moduleID 6= moduleID then 12 Gi.moduleID ← moduleID; S 13 MmoduleID ← MmoduleID Gi; 14 end 15 ExpandModule (Gi, CoreDist, Gi.moduleID); 16 end 17 moduleID ++; S 18 M ← M MmoduleID; 19 end

17 Algorithm 2: ExpandModule(Gj, PrevCoreDist, moduleID)

// Gj is a core gene to be expanded

1 CoreDist ← GetCoreDist (Gj, ξ, ); 2 if CoreDist =NULL then 3 Mark Gj as Noise; 4 Return; 5 end 6 else 7 if |CoreDist − P revCoreDist| < α then 8 MmoduleID ← MmoduleID ∪ Gj; 9 ExpandModule (Gj, CoreDist, Gj.moduleID); 10 Return; 11 end 12 end 13 Mark Gj as Processed; 14 Neigh ← GetNeighbors (Gj, CoreDist); 15 // Mark all the neighbors which are not included in the current module 16 for (∀ gene, Gk ∈ Neigh) do 17 Gk.moduleID ← moduleID; 18 MmoduleID ← MmoduleID ∪ Gk; 19 ExpandModule (Gk, PrevCoreDist, Gk.moduleID); 20 end

18 Figure 2: Rat CNS gene co-expression network modules showing overlapping genes in yellow between C1 and C2. Module C3 has only exclusive genes.

Total cost of CluViaN is O(N 2) + O(N log N) = O(N 2).

4. Effectiveness Assessment of CluViaN We validate the biological significance of the modules extracted by Clu- ViaN and compare the results with seven other network-module detection techniques. We also access the goodness of the topological characteristics of the subnetworks associated with the modules. The extracted modules from these eight techniques are compared and validated statistically, biologically and topologically.

4.1. Dataset used We use eight publicly available gene expression datasets for benchmark- ing. We include expression data from microorganisms to higher organisms such as Rat and Human. Two expression datasets, GSE1297 and GSE4097, related to AD from the NCBI data repository is used to identify influential genes that may be responsible for AD. For GSE2197, gene expression of hip- pocampal of 9 control and 22 AD subjects of different severity was analyzed. Each gene’s correlated expression was tested on 31 subjects. Many genes were found to be correlated with AD markers. In GSE4097, the analysis of Amyloid Precursor Protein Intracellular Domain (AICD) dependent genes is studied where when APP gene is expressed since it plays a central role in the

19 Table 2: Expression datasets used for module extraction. Sl. Dataset Species No. of No. of Source No. Genes Time Points/Conditions 1 Subset of Yeast cell cycle Saccharomyces 387 17 http://faculty.washington.edu/kayee/cluster cerevisiae 2 Yeast Sporulation Saccharomyces 474 17 http://cmgm.stanford.edu/pbrown/sporulation/ index.htmll cerevisiae 3 Rat CNS Rattus 112 9 http://faculty.washington.edu/kayee/cluster norvegicus Human Fibroblasts 4 Homo sapiens 517 13 http://www.sciencemag.org/feature/data/984559.shl Serum 5 GSE53552 Homo Sapiens 54675 99 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE53552 6 GDS958 Mus musculus 693 12 ncbi.nlm.nih.gov/gds-GDS958 7 GSE1297 (Alzheimer’s) Homo Sapiens 22215 31 http://www.ncbi.nlm.nih.gov/geo/ query/acc.cgi?acc=GSE1297 8 GSE4097 (Alzheimer’s) Homo Sapiens 22283 6 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4097 pathogenesis of AD. Processing of the APP produces plaques which generate AICD. AICD in turn regulates genes that function in actin cytoskeleton and apoptosis where both contribute to AD pathogenesis.

We next assess the quality of our proposed distance measure in inferring our input co-expression network for the CluViaN.

4.2. Assessment of association measure Due to lack of appropriate benchmark gold data, we use synthetic data for validation. In order to assess inference quality, we generate synthetic gene expression data using GeneNetWeaver (GNW) [48]. GNW is developed with an intention to assess the performance of network inference methods in the DREAM challenge2. For generating the datasets we use both ODEs (deterministic) and SDEs (noise in dynamics) for the simulation of the exper- iments. They run one after the other using exactly the same perturbations. The number of time series is set to 1 (tmax) and the duration of each time series as 1,000. The number of measured points in the series is 21. 0.05 is set for the coefficient of the noise term, and normalization after addition of noise is checked. A brief description of the synthetic datasets used for network inference is given in the Table3.

Assessment of the resulting networks are done using F-score and accuracy measures. Since there is a trade off between precision and recall, the use of

2http://dreamchallenges.org

20 Table 3: DREAM challenge datasets used for network inference. Sl. Dataset No. of No. of Time Source No. Genes Points/Conditions InSilicoSize100- 1 Yeast1 dream4 100 21 Gene Net Weaver timeseries InSilicoSize100- 2 Yeast2 dream4 timeseries

F-score which is the harmonic mean between the two, is helpful. Precision is the positive predicted value, i.e., the number of positive predictions over all positive predicted interactions of nodes in the network. Recall on the other hand is also known as sensitivity. It is the number of correctly predicted interactions over all correct predictions from the gold standard network. So, F-score takes both precision and recall into account. Another measure used here is accuracy, which is the number of correct predictions over total obser- vations. It measures the intuitive performance of a method. We compare our proposed distance measure with other association mea- sures such as Pearson correlation, Spearman correlation, Kendall correlation and Mutual Information (MI). We also use two network inference methods, viz., ARACNE and CLR for comparison. We execute the network infer- ence methods on the generated datasets with a threshold of approximately (min + max)/2, which is an average of the minimum and maximum of the generated association scores. In some cases, we use nearby values if they give better results. As shown in Table4, for Dataset1, our proposed measure outperforms the other methods with an F-Score of 0.0445 and accuracy of 0.4965, while CLR follows with an F-score of 0.0428. In accuracy, MI comes in second with 0.4949. For Dataset 2, our proposed measure again performs well with an F-score of 0.0862 and accuracy of 0.4907. CLR follows with an F-score of 0.0852. ARACNE outperforms in accuracy with 0.4946. Hence, our proposed association measure generates networks that are at par or bet- ter than the other network inference methods in our study despite the fact that they use additional steps for inference.

4.2.1. Evaluation of Extracted Modules For all the candidate module finding methods, we report the best mod- ules by parameter tuning to obtain the lowest p-values or q-values for fair

21 Table 4: Comparison of F-Score and Accuracy of different network inference methods and association measures. Dataset 1 Dataset 2 Methods Accuracy F-Score Accuracy F-Score CLR 0.474 0.0428 0.4778 0.0852 ARACNE 0.4936 0.0302 0.4905 0.0612 Spearman Rank 0.4934 0.0336 0.4855 0.0787 Correlation Pearson Correlation 0.4932 0.0368 0.4885 0.0762 Kendall Rank 0.4933 0.041 0.4899 0.0655 Correlation Mutual Information 0.4949 0.0404 0.4946 0.0644 Proposed measure 0.4965 0.0445 0.4907 0.0862 comparison. We use for validating functional enrichment of the discovered modules in terms of p and q statistical significance scores. We consider only p < 0.01 as significant. We also compare the performance of the seven techniques based on topological validation scores.

• Enrichment Analysis P-value tests the statistical hypothesis that an observed gene is in a module by measuring the false positive rate. Before the test is per- formed, a threshold value called the significance level of the test, tra- ditionally 5% or 1%, is chosen. A low p-value indicates that the genes are biologically significant and belong to enriched functional categories. Considering the Gene Ontology (GO) and the discovered modules, the p-value is computed as:

n AN−A X i n−i p-value = (11) (N ) i=x n where, n is the total number of genes in the module, A is the number of genes with a particular annotation and and N is the total number of genes within the genome. The p-value is defined as the probability of observing atleast x genes in the annotation in a module with n genes. We use FuncAssociate for evaluating the functional enrichment of a module in terms of p-values. For Yeast Cell Cycle, one of the modules detected by CluViaN is in- volved in GO term cell cycle process with enrichment score 1.41E- 40. Similarly, in case of Yeast sporulation, Rat CNS and M.musculus

22 and GSE53552, CluViaN detects biologically significant modules that are associated with meiotic cell cycle process, functional behavior and membrane-bounded organelle, intracelllular part related activities, re- spectively. The over-represented attributes for the different datasets are reported in Table5 in terms of lowest p-values, indicating highest enriched GO terms. ‘-’ indicates lack of biologically significant terms produced by different methods.

Table 5: Comparison of functionally enriched modules derived by different methods, based on p-values. Best scores are highlighted in bold. Dataset GO WGCNA FUMET Module MCL- Spectral- MCODE FAG- CluViaN Annota- Miner Based Based EC tion Yeast Cell 2.31E-20 2.75E- 6.2E- 4.72E- 3.41E- 4.15E- 8.79E- 1.41E- Cell Cycle 06 31 23 24 06 23 40 Cycle Process Yeast Sporulation 2.15E-33 1.74E- 1.4E- 1.36E- 2.67E- 1.05E- 2.79E- 2.27E- sporu- 40 26 40 34 11 45 47 lation Rat Behavior 3.23E-11 6.63E- 8.68E- 1.3E- 1.7E-13 1.24E-6 1.31E- 5.05E- CNS 06 21 17 23 27 Human Cell 4.57E-16 - 1.13E- 6.88E- 9.4E-7 4.46E- - 8.32E-07 Fi- Cycle 09 17 14 brob- Process lasts Serum GSE53552Intracellular 5.31E-15 - - 8.49E- - 2.25E- 8.49E- 2.40E- Part 11 05 11 24 Mus Membrane 4.67E-16 1.71E- - 2.75E- 9.37E- 3.04E- 3.04E- 2.34E- mus- Bounded 05 11 17 15 15 42 culus Or- ganelle

We also report functional enrichment score in terms of q-value for each module. The q-value is the minimal False Discovery Rate (FDR) at which a gene appears significant. The GO categories and q-values from an FDR corrected hypergeometric test for enrichment are obtained us- ing GeneMANIA. For Yeast Cell Cycle, Rat-CNS and M.musculus, enriched modules derived by CluViaN involve in cellular response to DNA damage stimulus, regulation of synaptic transmission and vesicle related functions, respectively. However, a module in yeast sporulation detected by FAG-EC is involved in cytoplasmic translation with q-value of 4.05E-85. For GSE53552, a significant enriched module is found to

23 be related to nucleoplasm with a q-value of 4.6E-31. A few significant q-values and corresponding GO terms for the extracted modules are given in the supplementary material. Reported enrichment scores establish that CluViaN is superior in de- tecting biologically significant modules. They confirm that biological modules contain overlapping and intrinsic structures, which are unde- tected by available methods and hence produce inferior results.

• Topological Validation Topological validation evaluates the topology or the graph structure of the extracted modules. The TopoGSA [49] web application maps a list of genes to an interaction network and computes topological prop- erties for the entire network. Given a set of gene symbols or protein names, it computes the topological properties for the entire network for the organism. Next, it computes the topological properties of the uploaded gene set and random sets of matched sizes. Thus, TopoGSA computes topological properties for the entire network, the uploaded gene/protein set and random sets of matched sizes against a collec- tion of reference datasets of known molecular functions from public annotation databases including KEGG [50], Gene Ontology (Biological Process, Molecular Function and Cellular Component) [51] and Inter- Pro protein domains [52]. The outputs provide comparative statistics of the average values for the mentioned topological properties com- puted for the uploaded gene set. In our case, we consider 100 random samples. The network topologies produced by TopoGSA are degree, shortest path length, node betweeness, clustering coefficient and eigen- vector centrality whose description are provided in [49]. Modularity score exhibited by various module finding methods are shown in Figure3 where a score >0.3 is and <1 is considered. It is clearly evident from the figure that CluViaN achieves relatively better modularity in comparison to other methods. Topological validation of the extracted modules from the Human Fi- broblast Serum dataset using different candidate methods are given in Table6. Mean values of topological features are shown in the table for uploaded and reference networks of TopoGSA. Scores that fall in the valid range are highlighted in bold. Results show that CluViaN extracts relatively better subgraphs or modules with minor variation

24 (a) Yeast Cell Cycle (b) Yeast Sporulation

(c) Human Fibroblasts Serum (d) Mouse

Figure 3: Comparison of modularity scores achieved by different methods for different datasets

from reference networks. For more topological assessment results on other datasets, please refer to supplementary pages.

4.2.2. Evidence of Non-Exclusive Modules In real-life, modules are dependent on each other and are complex in nature. Our method is designed to handle those genes that may fall in more than one module, exhibiting overlapping biological functions. Grouping of genes is performed in such way that better reflects the natural organization of genes that may be associated with more than one functional group. Evidence of overlapping modules is given below. 1. DL154w (Symbol:MSH5) from Yeast Sporulation participates in DNA binding in module 3 while it also participates in reciprocal meiotic combination in module 2. Hence, it is part of both module 2 and module 3.

25 Table 6: Topological Validation on Human Fibroblast Serum Sub-networks Shortest Path Node Between- Clustering Coeffi- Eigenvector Central- Algorithm Degree Length ness cient ity Uploaded gene set 3.994 16446.2 10.3 0.096 0.02 WGCNA 100 random 4.064(±0.12) 16581 (±118562) 9.21(±3.8) 0.116(±0.034) 0.02(±0.012) simulations(mean) Uploaded gene set 4.003 15714 10.63 0.09 0.023 FUMET 100 random 4.12(±0.11) 17757 (±11685) 8.98(±2.98) 0.11(±0.03) 0.02(±0.01) simulations(mean) Uploaded gene set 3.935 15407 4.824 0.1 0.02 MCL-Based 100 random 4.095(±0.1) 15446.5 (±7262) 8.885(±2.265) 0.1(±0.025) 0.025(±0.01) simulations(mean) Uploaded gene set 4 15256 10.31 0.1 0.02 Module Miner 100 random 4.115(±0.055) 15720.5 (±6518) 8.525(±1.61) 0.09(±0.02) 0.02(±0.015) simulations(mean) Uploaded gene set 4.13 14267 9.59 0.11 0.023 Spectral-Based 100 random 3.18(±0.07) 19466 (±12232) 9.76(±3.23) 0.11(±0.026) 0.023(±0) simulations(mean) Uploaded gene set 3.81 12329 8 0.21 0.01 FAG-EC 100 random 0(±0) 42137 (±120750) 11.4(±23.91) 0.12(±0.31) 0.02(±0.04) simulations(mean) Uploaded gene set 4.04 14814.5 9.85 0.08 0.05 CluViaN 100 random 4.05(±0.13) 19165 (±16698) 9.6(±4.07) 0.09(±0.05) 0.04(±0.01) simulations(mean)

2. Prps1 in the Mus musculus network is a member of both module 1 and module 5. In module 1, it plays a role in nucleotide binding whereas in module 5 it participates in biosynthetic process.

4.2.3. Intrinsic Network Modules An example of intrinsic modules detected by our method is shown in Figure4. Six intrinsic modules of the Mouse dataset are shown in red, blue, yellow, green, brown and pink. Some genes overlap two or more modules. The orange nodes represent outliers and do not belong to any module. The yellow module is biologically significant for the Gene Ontology term ribosome with a p-value of 5.79E-07 whereas for the red module, the Gene Ontology term ribosome has a p-value of 1.11E-05. The yellow module performs a more specific function as it is more densely connected than the red module involved in ribosome process. Similarly, modules shown in yellow color are associated with ribosomal subunit structure with a p-value of 2.06E-05 whereas the red module is not involved in similar activity.

4.3. Ranking Alzheimer’s Modules using Pathway Enrichment Analysis A biological pathway is a series of signals among molecules in cells, lead- ing to some change in the cell. Disruption of a healthy organism’s normal pathways may give rise to diseases. Comparing the same pathways of normal and diseased organisms may help us identify malfunction and possible causes

26 Figure 4: Intrinsic modules in the Mouse networks. Six modules are marked in different colors - red, blue, yellow, green, brown and pink. Nodes identified as noise are shown in orange color. Nodes that have more than one color are overlapping nodes(bi-color and tri-color). All the 5 modules are embedded in the red module. of diseases. It may even help in diagnosing and treating the disease. We rank disease modules extracted by CluViaN based on enrichment scores with respect to p-values. For our experiments, we use two different phenotypes of AD related expression data; GSE1297 and GSE4097. It is well accepted that the inference outcomes largely depend on the dataset in hand. To over- come such bias, we decided to use two different phenotypes of AD expression profiles to verify the confidence of our inference of disease responsible genes. Moreover, multiple datasets of the same subject with different environments may be complimentary in deriving more insightful biological knowledge un- detected from any one of the datasets. CluViaN detects 28 modules from the disease networks based on the input AD dataset GSE1297 and 23 mod- ules in GSE4097. Out of these, 6 and 17 modules from GSE1297 and GSE4097 respectively, are found to be significant with respect to their par- ticipation in AD pathways. We use KEGG [50] pathway enrichment analysis for ranking significant modules with the help of the enrichment analysis tool, DAVID [53, 54, 55]. We report top 5 modules for both the AD datasets in Table7. We find module 2 is the most relevant in GSE1297 with p-value of 3.44E-66 containing 65 genes that participate in AD pathways. In case of GSE4097, module 2 with p-value 5.4E-44 ranks on top. Module 1 contains 95 genes that participate in the AD pathway. Reference KEGG pathways of AD with participating genes in the top ranked modules are shown in Figure 5 and6.

27 Figure 5: Reference KEGG pathway and participating genes in the top module in GSE1297 dataset. The red stars are the genes in the module that participate in the pathway of Alzheimer’s disease. 65 genes from module 2 participate in this pathway.

Next, we analyze further the top ranked modules to identify significant genes that are the possible causes of the disease.

4.4. Identifying Central Genes It has been observed that essential genes, which are possible regulators, are usually central genes [56], also called hub genes. We adopt a simple approach to rank genes responsible for a disease by identifying hub genes from the sub-network involved in top ranked modules. We use the online tool cytoHubba [57] to find central or hub genes in a subgraph or module. cytoHubba uses the Maximum Clique Centrality (MCC) score to identify hub nodes considering degree or connectivity of a node. MCC score of a node v is defined as follows. X MCC(v) = (|C| − 1)! (12) C∈S(v)

28 Figure 6: KEGG pathway with participating genes from dataset GSE4097. The red stars are the genes in the module that participate in the pathway of Alzheimer’s disease. 95 genes from module 1 participate in this pathway. where, S(v) is the collection of maximal cliques which contain v, and (|C|−1)! is the product of all positive integers less than |C|. If no edge is present between v’s neighbors, then MCC(v) is equal to it degree. Figures7 and8 show the top ranked AD subgraph in both AD datasets. Highly central nodes are shown in red, and the color fades to orange, and then to yellow with the decrease in MCC scores. We report ranking of vari- ous central genes in different modules with respect to MCC score in Table8 and Table9. Next, we investigate the role of central genes in AD based on reported evidence in published literature. In GSE1297, PSEN1, which is the highest scoring hub gene, is a protein encoding gene. Three main gene mutations that are the causes for early-onset of familial Alzheimer’s disease are APP, PSEN1 and PSEN2. 30% - 70% of early onset of Alzheimer’s dis- ease is due to mutations in the PSEN1 gene [58], causing type 3 AD (AD3) [59]. There are more than 150 mutations in the PSEN1 gene that have been

29 Table 7: Ranking of modules for Alzheimer’s disease derived by CluViaN (Top 5). Dataset Module Rank No. of No. P-Value Member Genes No. Genes in Genes in Module Pathway Module 1st 127 65 3.44E-66 NDUFV2, PSEN1, UFA3, COX5B, COX5A, UQCRC1, ATP5O , 2 NDUFS4, GRIN2A, CDK5R1, SDHC, CALM1, CHP1, ..... Module 2nd 10 9 8.24E-12 UQCRC2, UQCRC1, UQCRQ, NDUFB2, NDUFS3, NDUFC2, 4 ATP5G3, ATP5J, COX7B Module 3rd 10 8 1.04E-09 NDUFA1, ATP5H, NDUFA5, ATP6V1G2, ATP5O, ATP5C1, 3 COX8A, COX7B, UQCRQ, ATP5J2 Module 4th 8 7 6.76E-09 UQCR10, ATP5A1, ATP6V1G2, NDUFA4, ATP5C1, COX7C, GSE1297 5 COX6C, ATP5B Module 5th 322 17 0.050594043 ATP2A3, CALML3, PLCB2, ITPR2, CASP9, APAF1, CAPN2, 1 COX6A2, NOS1, APP, BACE2, GRIN2B, PPP3CC, SDHD, PLCB1,.... Module 1st 650 95 5.40E-44 SNCA, IDE, NDUFAB1, COX5A, COX5B, APP, UQCR11, 1 GRIN2B, APOE, GRIN2C, GRIN2D, IL1B, PSENEN, GRIN2A, FADD, MAPK1, RYR3, CDK5R1, COX7C, LPL, COX8A,... Module 2nd 223 57 2.88E-37 UQCRC2, UQCRC1, CYC1, IDE, SNCA, NDUFAB1, UQCRFS1, 3 COX5B, NDUFS5, CASP3, UQCR10, UQCR11, NDUFS8, CASP8, PSENEN, NDUFS3, NDUFS2, ADAM10, ... Module 3rd 317 58 1.57E-29 UQCRC2, IDE, SNCA, UQCRFS1, COX5A, NDUFS7, NDUFS6, 4 NDUFS4, PLCB4, CASP9, GRIN2B, GRIN2C, CASP7, MAPT, GSE4097 NDUFS8, IL1B, PSENEN, FAS, NDUFS1, ADAM10, NDUFC2,... Module 4th 206 37 6.05E-18 BID, HSD17B10, CDK5R1, UQCRC1, COX7B, SNCA, IDE, MME, 10 COX5B, NDUFS5, UQCR10, APOE, CALML3,... Module 5th 93 26 2.30E-17 NDUFB8, IDE, COX7B, COX7C, MME, COX7A2L, UQCRQ, 20 NDUFB1, NDUFB2, APP, PLCB3, GRIN2C, NDUFS8,... identified in patients with early-onset Alzheimer’s disease3. These mutations in the PSEN1 gene result in abnormal presenilin 1 protein, which in turn in- terferes with the normal functioning of the γ-secretase complex that changes the APP processing, hence over-producing toxic and long amyloid-β peptide. These toxic amyloid-β peptide molecules then stick together to form clumps of amyloid plaques in the brain, a characteristic feature of Alzheimer’s dis- ease. As a consequence, death of neurons is likely to follow along with the progressive signs and symptoms of this disorder. PPP3R1 (protein phos- phatase 2B regulatory subunit 1) is another prominent hub gene in module 2 responsible for rapid progress of AD, when the cerebrospinal fluid tau in- creases and the rate of functional decline increases [60]. Rapid progression of AD is strongly correlated with a single nucleotide polymorphism rs1868402 in the PPP3R1 gene. NDUFB2 is another hub gene that encodes a protein which is a subunit of the multi-subunit NADH:ubiquinone oxidoreductase (complex I). Links have been established between defects of this complex I with Parkinson’s disease and other genetic metabolic diseases [61]. It also plays an impor- tant role in metabolism pathways of AD as well as Parkinson’s diseases (KEGG:K03958).

3https : //ghr.nlm.nih.gov/gene/P SEN1#conditions

30 NDUFA1, from module 3, is a subunit of mitochondrial respiratory chain NADH dehydrogenase (Complex I) and encodes an essential compo- nent of complex I of this respiratory chain. Some defects in the functioning of mitochondria have been associated with neurodegenerative diseases like Parkinson’s disease, Alzheimer’s disease, Huntington’s disease, and particu- larly with mitochondrial respiratory chain disorder (MRCD)[62]. NDUFA1 is a part of the Alzheimer’s disease pathway (KEGG Pathway: hsa05010). UQCR10 from module 5 is a part of mitochondrial complex III (ubiquinol- cytochrome c reductase). It creates a part of the respiratory chain of the inner mitochondrial membrane. Mitochondrial genes are altered in blood in early AD [63]. APP is one of the highest scoring central genes in module 1 for GSE1297 as well as for module 5 of dataset GSE4097. APP is highly active in the Alzheimer’s disease pathway. There are 32 APP gene mutations that result in early-onset Alzheimer’s disease [64]. It plays a large part in AD pathogenesis as it produces β-amyloid peptides (Aβ) which are found in the brain of AD patients [65]. Autosomal dominant AD and cerebroarterial amyloidosis (cerebral amyloid angiopathy) have been implicated by mutations in the APP gene. As per our results, MUC4 is the highest degree node in module 1 of GSE1297, but we do not find any reported evidence of its role in AD. We feel that further lab investigation is required to elucidate any hidden role of MUC4 in AD, not revealed so far. In case of GSE4097, PML (Promyelocytic Leukemia) is found to be the top hub gene. Recently, it has been observed that PML has some function in the nervous system and in the brain [66]. It also plays a role in plasticity, circadian rhythms and the response to proteins which can cause neurodegen- rative disorders like Alzheimer’s disease. The study of PML and its role in the brain is at its beginning phase and more is yet to be discovered. CFLAR, another hub gene, is a CASP8 and FADD-like apoptosis regulator. CASP8 and FADD also take active part in AD pathway. In [67], it is shown that as the regulatory activity of CFLAR increases, the amount of transcription factors increases for AD-affected samples. The MKI67 gene encodes for protein required for cellular proliferation. Many studies have shown that its function is associated with Alzheimer’s disease [68]. LMNA encodes for protein lamin. Over 300 mutations of this gene that can causes diseases have been identified and one of these diseases is Alzheimer’s disease [69]. A dysfunction in a lamin-rich meshwork called nucleoskeleton is a causel for AD. HNRNPA3 (Heterogeneous Nuclear Ri-

31 bonucleoprotein A3), has been found to be linked to neurodegenerative dis- eases like AD. Several studies on it [70, 71] show that it reduces expression levels for AD dataset, and is an important gene whose behavior is correlated with AD. Gene IPO5, which is the highest scoring hub gene in module 3 is not found to be associated with AD as per our findings. MAPK1 (mitogen-activated protein kinase 1) is the top ranking hub gene found in module 4. It participates in the AD pathway and has been shown to play a significant role in AD. Its expression levels are elevated in AD controls that are in intermediate stages [72]. Many diseases like AD, Parkinson’s disease and various cancers can develop if there is deviation from strict control of the MAPK signalling pathways [73]. Another hub gene MTUS1 (myotubularin 1) is found to be expressed in the brain of AD patients [74]. The SCNA gene encodes for the protein alpha-synuclein. Many studies have reported its role in AD and that it participates in the AD pathway. Its polymorphism may lead to an increased risk for AD [75]. It also causes lewy body, which are protein deposits in brain which affects memory and thinkingcapabilities. These lewy bodies are pathology in AD [76]. LEPR encodes for protein leptin receptor, which regulates body weight. It is present in may tissues including the hypothalamus of the brain. A study [77] shows that leptin levels in AD are same as in control samples, but its signalling is impaired in AD. Similar to our earlier AD phenotype dataset, GSE1297. This GSE4097 dataset also shows APP as the highest scoring gene and its role in AD is al- ready described. MAP1B is another hub gene of module 5, where alteration in its expression contributes to neuritic amyloid plaques and neurofibrillary tangles in AD [78]. The genes NDUFA5, NDUFA4 and NDUFA3 are involved in subunits of the mitochondrial respiratory chain which are nuclear- encoded oxidative phosphorylation. These genes’ expressions are altered in the blood of early AD control [63]. All these genes, along with NDUFS2, COX4I1 and EIF2AK3, take part in AD pathway and play significant role in AD. Studies show that EIF2AK3 is not an independent of APOE (apolipoprotein E), and it metabolises fats in the body and plays a very significant role in AD, and hence is associated with AD risk [79]. A few subnetworks with hub genes are shown in Figures7 and8. More significant sub-networks and corresponding hub genes are provided in supplementary material.

32 Figure 7: The subnetwork in module 2 for AD dataset GSE1297. The two influential hub genes, PSEN1 and PPP3R1 are shown in red colour.

Figure 8: The subnetwork in module 2 for AD dataset GSE4097. The two influential hub genes, PML and CFLAR are shown.

33 Table 8: Top-10 hub genes from each ranked modules for AD dataset GSE1297 with their MCC-Score. Sl. Module 1 MCC Module 2 MCC Module 3 MCC Module 4 MCC Module 5 MCC Module 6 MCC No. Score Score) Score Score Score Score 1 MUC4 10 PSEN1 9 NDUFA1 9 NDUFB2 9 UQCR10 7 GRIN2C 9 2 APP 9 PPP3R1 9 ATP5C1 1 ATP5J2 1 ATP5C1 1 ACAA1 1 3 PAK4 9 ITPR1 8 ATP5H 1 NDUFC2 1 ATP6V1G2 1 KIAA0754 1 4 FETUB 9 ADGRB3 7 ATP5O 1 UQCRC1 1 NDUFA4 1 TNFRSF25 1 5 RNGTT 9 ENO1 7 ATP5J2 1 UQCRC2 1 COX6C 1 NTRK3 1 6 HOXB5 9 AUH 7 NDUFA5 1 UQCRCQ 1 COX7C 1 BCORL1 1 7 DPP4 8 UQCRC2 7 ATP6V1G2 1 COX7B 1 ATP5A1 1 CASP9 1 8 ESR1 8 GRIN1 7 UQCRQ 1 ATP5G3 1 ATP5B 1 PLEKHF1 1 9 KIF2C 8 DHPS 6 COX7B 1 ATP5J 1 - - C1QL1 1 10 TRAF1 8 GSPT2 6 COX8A 1 NDUFS3 1 - - RAB36 1

Table 9: Top-10 hub genes from each ranked modules for AD dataset GSE4097 with their MCC-Score. Sl. Module 1 MCC Module 3 MCC Module 4 MCC Module MCC Module MCC No. Score Score) Score 10 Score 20 Score 1 PML 19 IPO5 11 MAPK1 14 LEPR 12 APP 11 2 CFLAR 18 LMNA 10 MTUS1 11 POMZP3 10 MAP1B 9 3 MKI67 17 TRAPPC4 10 PTGDS 11 GSE1 9 NDUFA5 7 4 EDA 15 HNRNPA3 10 SNCA 11 PLAUR 9 EIF2AK3 6 5 KMT2A 14 OAZ1 9 KCTD12 10 BDH1 9 NDUFA4 6 6 ESR2 14 CAPNS1 9 TNFSF13 10 PRRC2B 9 NDUFA3 5 7 DSTYK 13 SNORD139 9 ICA1 10 AGGF1 9 COX4I1 5 8 SARDH 12 ATP6V1G1 9 VCX2 10 CNTN1 9 NDUFS2 5 9 CDC14B 12 RPL10A 8 UBXN4 10 VPS13D 9 GPR27 4 10 SULF1 12 CTDNEP1 8 IDE 9 HYAL3 8 ATP6V1A 4

A brief summary of the hub genes and their roles in AD is given in Table 10.

5. Conclusion In this paper, we have proposed a method to identify genes that are possibly responsible for Alzheimer’s disease by analyzing modules derived from Alzheimer’s disease co-expression networks. We identify and report a list of genes responsible for AD. The roles of APP and PSEN1 in AD have already been confirmed by prior research. It is known that hub genes are potential genes for any disease and CluVian detects them effectively. In addition, we have ranked a few more influential genes that play important roles in the progression of AD and confirm through literature based evidence. We extract central genes inside functional modules and ranked them based on MCC scores. We consider only those modules which achieve good enrichment scores with respect to how the member genes participate in AD pathways.

34 Table 10: A brief summary of the various top ranked hub genes detected by CluViaN. Dataset Gene Roles in AD PSEN1 Mutations in this gene causes causes type 3 AD and also causes 30% - 70% of early onset of Alzheimers Disease. PPP3R1 It is responsible for the rapid progress of AD. NDUFB2 It has a vital role in metabolism pathways of AD. NDUFA1 It participates in AD pathway. It has been shown that defects in the mitochondria is associated with AD, where NDUFA1 is a subunit of mitochondrial respiratory chain NADH dehydrogenase (Complex I). GSE1297 Hence, plays a role in AD. UQCR10 It is a part of mitochondrial complex III where mitochondrial genes are altered in blood in early AD. APP Mutations in APP is one of main cause of AD which result in early-onset Alzheimers disease. It plays a large part in AD pathogenesis and produces sticky protein plaques in the brain. PML It has roles in the nervous system and the brain in functions like plasticity, circadian rhythms and the response to proteins which can cause AD. CFLAR Like CASP8 and FADD genes which participate in the AD pathway, the CFLAR gene also is similar to these genes and regulates apoptosis. In AD-affected samples, the regulatory activity of CFLAR increases. MKI67 It’s function has been shown to be associated with AD. It encodes for a protein required for cellular proliferation. LMNA Mutations in this gene have been shown to cause many diseases, of which is AD. It produces lamin, where malfunction in the nucleoskeleton that is lamin-rich, has been associated with causal of AD. HNRNPA3 Studies have reported that it is associated with neurodegenerative

GSE4097 diseases like AD. In AD controls, it expression levels reduces. MAPK1 Its expression levels has been shown to be elevated for AD patients that are in their intermediate stages. It is one of the genes that participate in the AD pathway. MTUS1 It is found to be expressed in the brain of AD patients. SCNA It causes lewy body pathology in AD and it participates in the AD pathway. . LEPR Its expression levels have been shown to be same in AD controls and control samples, but its signalling is impaired in the former. It is also present in the hypothalamus of the brain. APP As described above. EIF2AK3 Like APP, mutations in APOE gene also is one of the main cause of AD. The gene EIF2AK3 is found to be dependent on APOE. It also participates in AD pathway. COX4I1 Its expression levels are raised in patients with AD. This gene is also participating in the AD pathway. NDUFS2 It participates in AD pathway. NDUFA3 They participate in AD pathway. These genes expression levels are NDUFA4 altered in the blood of early AD control. NDUFA5

35 We rank extracted modules significant to AD. The method can be applied to any other diseases. We propose a quality proximity measure to create a weighted co-expression network from gene expression data and a new non-exclusive functional mod- ule extraction technique, CluViaN, that can detect overlapping as well as intrinsic modules prevalent in biological networks. We establish that our method is effective in detecting functionally enriched modules. CluViaN is not limited to biological co-expression module extraction, and is equally ap- plicable to other graph related domains including finding protein complexes in PPI networks. Our current work is limited to co-expression networks. We feel that causal gene regulatory network may be a better representation for identification of influential genes that disrupt normal pathways and cause diseases. Our future endeavor will focus to address the issue.

6. Acknowledgement

References [1] E. Bagyinszky, Y. C. Youn, S. S. A. An, S. Kim, The genetics of Alzheimer’s disease., Clinical Interventions in Aging 9 (2014).

[2] E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, N. Friedman, Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data, Nature Genetics 34 (2003) 166–176.

[3] A.-L. Barabasi, Z. N. Oltvai, Network biology: understanding the cell’s functional organization, Nature Reviews Genetics 5 (2004) 101–113.

[4] G. Zhao, L. A. Schriefer, G. D. Stormo, Identification of muscle-specific regulatory modules in Caenorhabditis elegans, Genome Research 17 (2007) 348–357.

[5] B. Liu, L. Fang, F. Liu, X. Wang, J. Chen, K.-C. Chou, Identification of real microRNA precursors with a pseudo structure status composition approach, PloS One 10 (2015) e0121501.

36 [6] G. Wu, L. Stein, A network module-based method for identifying cancer prognostic signatures, Genome Biology 13 (2012) 1.

[7] M. B. Eisen, P. T. Spellman, P. O. Brown, D. Botstein, Cluster analysis and display of genome-wide expression patterns, Proceedings of the National Academy of Sciences 95 (1998) 14863–14868.

[8] P. D’haeseleer, S. Liang, R. Somogyi, Genetic network inference: from co-expression clustering to reverse engineering, Bioinformatics 16 (2000) 707–726.

[9] H. Ge, Z. Liu, G. M. Church, M. Vidal, Correlation between transcrip- tome and interactome mapping data from saccharomyces cerevisiae, Na- ture Genetics 29 (2001) 482–486.

[10] A. Brazma, J. Vilo, Gene expression data analysis, FEBS Letters 480 (2000) 17–24.

[11] S. Tomida, T. Hanai, H. Honda, T. Kobayashi, Analysis of expression profile using fuzzy adaptive resonance theory, Bioinformatics 18 (2002) 1073–1083.

[12] R. Sharan, R. Shamir, CLICK: a clustering algorithm with applications to gene expression analysis, in: Proc Int Conf Intell Syst Mol Biol, volume 8, p. 16.

[13] J. Wang, J. Delabie, H. C. Aasheim, E. Smeland, O. Myklebost, Cluster- ing of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study, BMC Bioinformatics 3 (2002) 36.

[14] A. M. Newman, J. B. Cooper, AutoSOME: a clustering method for identifying gene expression modules without prior knowledge of cluster number, BMC Bioinformatics 11 (2010) 117.

[15] B. Abu-Jamous, R. Fa, D. J. Roberts, A. K. Nandi, Yeast gene CMR1/YDL156W is consistently co-expressed with genes participating in dna-metabolic processes in a variety of stringent clustering experi- ments, Journal of The Royal Society Interface 10 (2013) 20120990.

[16] S. Pu, J. Wong, B. Turner, E. Cho, S. J. Wodak, Up-to-date catalogues of yeast protein complexes, Nucleic Acids Research 37 (2009) 825–831.

37 [17] T. Nepusz, H. Yu, A. Paccanaro, Detecting overlapping protein com- plexes in protein-protein interaction networks, Nature Methods 9 (2012) 471–472. [18] G. Cleuziou, A generalization of k-means for overlapping clustering, Rapport technique (2007) 54. [19] T. Pang-Ning, M. Steinbach, V. Kumar, Introduction To Data Mining, Pearson Education, 2007. [20] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plau- sible inference, Elsevier, 2014. [21] A. J. Butte, I. S. Kohane, Mutual information relevance networks: func- tional genomic clustering using pairwise entropy measurements, in: Bio- computing 2000, World Scientific, 1999, pp. 418–429. [22] J. J. Faith, B. Hayete, J. T. Thaden, I. Mogno, J. Wierzbowski, G. Cottarel, S. Kasif, J. J. Collins, T. S. Gardner, Large-scale map- ping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles, PLoS biology 5 (2007) e8. [23] P. E. Meyer, K. Kontos, F. Lafitte, G. Bontempi, Information-theoretic inference of large transcriptional regulatory networks, EURASIP Jour- nal on Bioinformatics and Systems Biology 2007 (2007) 8–8. [24] G. D. Tourassi, E. D. Frederick, M. K. Markey, C. E. Floyd, Application of the mutual information criterion for feature selection in computer- aided diagnosis, Medical Physics 28 (2001) 2394–2402. [25] C. Ding, H. Peng, Minimum redundancy feature selection from microar- ray gene expression data, Journal of Bioinformatics and Computational Biology 3 (2005) 185–205. [26] A. F. Villaverde, J. Ross, F. Mor´an,J. R. Banga, MIDER: network inference with mutual information distance and entropy reduction, PloS One 9 (2014) e96732. [27] A. A. Margolin, I. Nemenman, K. Basso, C. Wiggins, G. Stolovitzky, R. Dalla Favera, A. Califano, ARACNE: an algorithm for the recon- struction of gene regulatory networks in a mammalian cellular context, in: BMC Bioinformatics, volume 7, BioMed Central (2006), p. S7.

38 [28] A. Lachmann, F. M. Giorgi, G. Lopez, A. Califano, ARACNe-AP: gene network reverse engineering through adaptive partitioning inference of mutual information, Bioinformatics 32 (2016) 2233–2235.

[29] S. Roy, D. K. Bhattacharyya, J. K. Kalita, Reconstruction of gene co- expression network from microarray data using local expression patterns, BMC Bioinformatics 15 (2014) S10.

[30] A. Irrthum, L. Wehenkel, P. Geurts, et al., Inferring regulatory networks from expression data using tree-based methods, PloS One 5 (2010) e12776.

[31] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, B. P. Feuston, Random forest: a classification and regression tool for com- pound classification and qsar modeling, Journal of Chemical Information and Computer Sciences 43 (2003) 1947–1958.

[32] F. G´omez-Vela, C. D. Barranco, N. D´ıaz-D´ıaz, Incorporating biologi- cal knowledge for construction of fuzzy networks of gene associations, Applied Soft Computing 42 (2016) 144–155.

[33] P. Langfelder, S. Horvath, WGCNA: an R package for weighted corre- lation network analysis, BMC Bioinformatics 9 (2008) 1.

[34] P. Mahanta, H. A. Ahmed, D. K. Bhattacharyya, A. Ghosh, FUMET: A fuzzy network module extraction technique for gene expression data, Journal of Biosciences 39 (2014) 351–364.

[35] P. Mahanta, H. A. Ahmed, D. K. Bhattacharyya, J. K. Kalita, An effective method for network module extraction from microarray data, BMC Bioinformatics 13 (2012) 1.

[36] J. Ruan, W. Zhang, Identification and evaluation of functional modules in gene co-expression networks, in: Systems Biology and Computational Proteomics, Springer, 2007, pp. 57–76.

[37] G. D. Bader, C. W. Hogue, An automated method for finding molecular complexes in large protein interaction networks, BMC Bioinformatics 4 (2003) 1.

39 [38] M. Li, J. Wang, J. Chen, A fast agglomerate algorithm for mining functional modules in protein interaction networks, in: 2008 Interna- tional Conference on BioMedical Engineering and Informatics, volume 1, IEEE, pp. 3–7.

[39] Y. Asahiro, K. Iwama, H. Tamaki, T. Tokuyama, Greedily finding a dense subgraph, Journal of Algorithms 34 (2000) 203–221.

[40] N. Bansal, A. Blum, S. Chawla, Correlation clustering, Machine Learn- ing 56 (2004) 89–113.

[41] A. Bhattacharya, R. K. De, Divisive Correlation Clustering Algorithm (DCCA) for grouping of genes: detecting varying patterns in expression profiles, Bioinformatics 24 (2008) 1359–1366.

[42] N. J. Salkind, Encyclopedia of measurement and statistics, volume 1, Sage, 2007.

[43] L. J. Heyer, S. Kruglyak, S. Yooseph, Exploring expression data: iden- tification and analysis of coexpressed genes, Genome Research 9 (1999) 1106–1115.

[44] I. Priness, O. Maimon, I. Ben-Gal, Evaluation of gene-expression clus- tering via mutual information distance measure, BMC Bioinformatics 8 (2007) 111.

[45] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise., in: Knowledge Discovery Database, volume 96, pp. 226–231.

[46] S. Roy, D. K. Bhattacharyya, An approach to find embedded clusters using density based techniques, in: Distributed Computing and Internet Technology, Springer, 2005, pp. 523–535.

[47] M. Ankerst, M. M. Breunig, H.-P. Kriegel, J. Sander, Optics: ordering points to identify the clustering structure, in: ACM Sigmod Record, volume 28, ACM, pp. 49–60.

[48] T. Schaffter, D. Marbach, D. Floreano, GeneNetWeaver: in silico bench- mark generation and performance profiling of network inference meth- ods, Bioinformatics 27 (2011) 2263–2270.

40 [49] E. Glaab, A. Baudot, N. Krasnogor, A. Valencia, TopoGSA: network topological gene set analysis, Bioinformatics 26 (2010) 1271–1272. [50] M. Kanehisa, S. Goto, KEGG: kyoto encyclopedia of genes and genomes, Nucleic Acids Research 28 (2000) 27–30. [51] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, et al., Gene Ontology: tool for the unification of biology, Nature Genetics 25 (2000) 25–29. [52] R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, E. Birney, M. Biswas, P. Bucher, L. Cerutti, F. Corpet, M. D. Croning, et al., The InterPro database, an integrated documentation resource for pro- tein families, domains and functional sites, Nucleic Acids Research 29 (2001) 37–40. [53] D. A. Hosack, G. Dennis, B. T. Sherman, H. C. Lane, R. A. Lempicki, Identifying biological themes within lists of genes with ease, Genome biology 4 (2003) R70. [54] D. W. Huang, B. T. Sherman, R. A. Lempicki, Bioinformatics enrich- ment tools: paths toward the comprehensive functional analysis of large gene lists, Nucleic Acids Research 37 (2008) 1–13. [55] D. W. Huang, B. T. Sherman, R. A. Lempicki, Systematic and inte- grative analysis of large gene lists using david bioinformatics resources, Nature Protocols 4 (2008) 44. [56] H. Bolouri, Modeling genomic regulatory networks with big data, Trends in Genetics 30 (2014) 182–191. [57] C.-H. Chin, S.-H. Chen, H.-H. Wu, C.-W. Ho, M.-T. Ko, C.-Y. Lin, cytoHubba: identifying hub objects and sub-networks from complex in- teractome, BMC Systems Biology 8 (2014) S11. [58] M. Cruts, C. Van Broeckhoven, Presenilin mutations in Alzheimer’s disease, Human Mutation 11 (1998) 183–190. [59] A. Larner, M. Doran, Clinical phenotypic heterogeneity of Alzheimer’s disease associated with mutations of the presenilin–1 gene, Journal of Neurology 253 (2006) 139–158.

41 [60] D. Peterson, C. Munger, J. Crowley, C. Corcoran, C. Cruchaga, A. M. Goate, M. C. Norton, R. C. Green, R. G. Munger, J. C. Breitner, et al., Variants in PPP3R1 and MAPT are associated with more rapid func- tional decline in Alzheimer’s disease: The Cache County Dementia Pro- gression Study, Alzheimer’s & Dementia 10 (2014) 366–371.

[61] M. A. Fisher, M. F. Oleksiak, Convergence and divergence in gene expression among natural populations exposed to pollution, BMC Ge- nomics 8 (2007) 108.

[62] N. Uehara, M. Mori, Y. Tokuzawa, Y. Mizuno, S. Tamaru, M. Kohda, Y. Moriyama, Y. Nakachi, N. Matoba, T. Sakai, et al., New MT-ND6 and NDUFA1 mutations in mitochondrial respiratory chain disorders, Annals of Clinical and Translational Neurology 1 (2014) 361–369.

[63] K. Lunnon, A. Keohane, R. Pidsley, S. Newhouse, J. Riddoch-Contreras, E. B. Thubron, M. Devall, H. Soininen, I. K loszewska, P. Mecocci, et al., Mitochondrial genes are altered in blood early in Alzheimer’s disease, Neurobiology of Aging (2017).

[64] R. J. O’Brien, P. C. Wong, Amyloid precursor protein processing and Alzheimer’s disease, Annual Review of Neuroscience 34 (2011) 185–204.

[65] H. Zheng, E. H. Koo, Biology and pathophysiology of the amyloid precursor protein, Molecular Neurodegeneration 6 (2011) 27.

[66] E. Korb, S. Finkbeiner, PML in the brain: from development to degen- eration, Frontiers in Oncology 3 (2013) 242.

[67] W. Kong, X. Mou, J. Deng, B. Di, R. Zhong, S. Wang, Y. Yang, W. Zeng, Differences of immune disorders between Alzheimers disease and breast cancer based on transcriptional regulation, PloS One 12 (2017) e0180337.

[68] A. Grupe, Y. Li, C. Rowland, P. Nowotny, A. L. Hinrichs, S. Smemo, J. S. Kauwe, T. J. Maxwell, S. Cherny, L. Doil, et al., A scan of chro- mosome 10 identifies a novel locus showing strong association with late- onset Alzheimer disease, The American Journal of Human Genetics 78 (2006) 78–88.

42 [69] B. Frost, Alzheimer’s disease: An acquired neurodegenerative laminopa- thy, Nucleus 7 (2016) 275–283.

[70] Y. E. Cruz-Rivera, J. Perez-Morales, Y. M. Santiago, V. M. Gonzalez, L. Morales, M. Cabrera-Rios, C. E. Isaza, A selection of important genes and their correlated behavior in Alzheimers disease, Journal of Alzheimer’s Disease (2018) 1–14.

[71] J. Wong, Altered expression of RNA splicing proteins in Alzheimer’s disease patients: evidence from two microarray studies, Dementia and Geriatric Cognitive Disorders extra 3 (2013) 74–85.

[72] A. Gerschuetz, H. Heinsen, E. Gr¨unblatt, A. K. Wagner, J. Bartl, C. Meissner, A. J. Fallgatter, S. Al-Sarraj, C. Troakes, I. Ferrer, et al., Neuron-specific alterations in signal transduction pathways associated with Alzheimer’s disease, Journal of Alzheimer’s Disease 40 (2014) 135– 142.

[73] E. K. Kim, E.-J. Choi, Pathological roles of mapk signaling pathways in human diseases, Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease 1802 (2010) 396–405.

[74] J. Chung, X. Wang, T. Maruyama, Y. Ma, X. Zhang, J. Mez, R. Sherva, H. Takeyama, K. L. Lunetta, L. A. Farrer, et al., Genome-wide associa- tion study of Alzheimer’s disease endophenotypes at prediagnosis stages, Alzheimer’s & Dementia 14 (2018) 623–633.

[75] Q. Wang, Q. Tian, X. Song, Y. Liu, W. Li, SNCA gene polymorphism may contribute to an increased risk of Alzheimer’s disease, Journal of clinical laboratory analysis 30 (2016) 1092–1099.

[76] C. Linnertz, M. W. Lutz, J. F. Ervin, J. Allen, N. R. Miller, K. A. Welsh-Bohmer, A. D. Roses, O. Chiba-Falek, The genetic contributions of SNCA and LRRK2 genes to lewy body pathology in Alzheimer’s disease, Human Molecular Genetics 23 (2014) 4814–4821.

[77] S. Maioli, M. Lodeiro, P. Merino-Serrais, F. Falahati, W. Khan, E. Puerta, A. Codita, R. Rimondini, M. J. Ramirez, A. Simmons, et al., Alterations in brain leptin signalling in spite of unchanged CSF leptin levels in Alzheimer’s disease, Aging Cell 14 (2015) 122–129.

43 [78] T. Yokota, M. Mishra, H. Akatsu, Y. Tani, T. Miyauchi, T. Yamamoto, K. Kosaka, Y. Nagai, T. Sawada, K. Heese, Brain site-specific gene expression analysis in Alzheimer’s disease patients, European Journal of Clinical Investigation 36 (2006) 820–830.

[79] Q.-Y. Liu, J.-T. Yu, D. Miao, X.-Y. Ma, H.-F. Wang, W. Wang, L. Tan, An exploratory study on STX6, MOBP, MAPT, and EIF2AK3 and late- onset Alzheimer’s disease, Neurobiology of Aging 34 (2013) 1519–e13.

44