NSEA: N-NODE SUBNETWORK ENUMERATION ALGORITHM IDENTIFIES LOWER GRADE GLIOMA SUBTYPES WITH ALTERED SUBNETWORKS AND DISTINCT PROGNOSTICS

by

ZHIHAN ZHANG

Submitted in partial fulfillment of the requirements for the degree of Master of Science

Systems Biology and Bioinformatics

CASE WESTERN RESERVE UNIVERSITY

May 2017

CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

ZHIHAN ZHANG

candidate for the degree of Master of Science.

Committee Chair

GURKAN BEBEK

Committee Member

MARK CAMERON

Committee Member

JEAN-EUDES DAZARD

Date of Defense

Jul 22, 2016

ii

Contents

Contents ...... iii

List of Tables ...... v

List of Figures ...... vi

Abstract ...... vii

Introduction ...... 1

Gene Expression Analysis ...... 1

Advantages of Unsupervised Learning...... 1

Feature Extraction and Unsupervised Pathway Analysis ...... 3

Application of Pathway-Based Methodology in Translational Medicine ...... 6

Diffuse Low-Grade Glioma (LGG) ...... 7

Overview ...... 7

Prognosis Markers of LGG ...... 9

Methodology ...... 11

Overview ...... 11

Concept Description ...... 13

Network Enumeration ...... 13

Subnetwork Selection and Expansion ...... 13

Vector Representation of Subnetwork ...... 15

Pipeline ...... 17

Data Preparation ...... 17

Enumeration ...... 18

Subnetwork Selection, Expansion and Vectorization ...... 21

Parameter Tuning ...... 26

iii

Future Validation ...... 29

Results ...... 31

Feature Subnetworks and Clustering ...... 31

Subnetwork Groups ...... 35

Patient Groups ...... 43

Discussion ...... 54

References ...... 57

iv

List of Tables

Table 1. Subnetwork Clusters and Corresponding Pathways ...... 42

Table 2. Comparison of Current Patient Groups and Groups from TCGA ... 44

Table 3. Mutations and MGMT Methylation Statistics ...... 50

v

List of Figures

Figure 1. Diagram of nSEA Algorithm ...... 12

Figure 2. Vector Representation of Subnetwork ...... 16

Figure 3. Edge Vector and Edge Score ...... 19

Figure 4. Definition of Subnetwork Score (Inner-pattern Consistency) ...... 22

Figure 5. Pipeline Summarization and Parameter Tuning ...... 28

Figure 6. Heatmap and Clustering of LGG Samples and Subnetworks ...... 34

Figure 7. Clinical Characteristics of LGG Patient Groups ...... 45

Figure 8. Patient Group Characterization by Subnetworks ...... 49

Figure 9. Patient Groups with Mutations and MGMT Methylation ...... 52

vi

NSEA: n-Node Subnetwork Enumeration Algorithm Identifies Lower Grade Glioma Subtypes with Altered Subnetworks and Distinct Prognostics

Abstract

by ZHIHAN ZHANG

Motivation: The prognosis of low-grade-glioma (LGG) patients is very poor. Identifying subnetworks related to LGG can better describe the genetic make-up of the tumor.

Methods: n-Node Subnetwork Enumeration Algorithm (nSEA) was developed to identify significantly dysregulated subnetworks. We utilized a filtered network to enumerate n-node subnetworks exhaustively and score each subnetwork to carry out feature selection. These subnetwork seeds were expanded to identify tumor-specific subnetworks. Clustering these subnetworks provided patient groups with different subnetwork states.

Results: We identified 92 subnetwork features, 8 subnetwork groups and 5 patient groups. A new patient group was identified with favorable outcomes. By decision tree modeling, the new group were characterized as down-regulated MAPK/B-Raf pathway and up-regulated Notch pathway. It had fewer mutations of candidate , hypomethylation of NIPBL and hypermethylation of KALRN.

Conclusions: These results could provide opportunities for improved treatment options and personalized interventions of LGG.

vii

Introduction

Gene Expression Analysis

Advantages of Unsupervised Learning

With the massive application of microarray and high-throughput sequencing, more and more data has been generated to characterize genome-wide gene expression of healthy and diseased states. Researchers work towards understanding the underlying mechanism of dysregulation. The popularity of genome-wide gene expression analysis is extremely prominent among cancer studies due to unique etiology of the disease in each tissue type. As of July 2016, there are 1079 cancer-related datasets on Gene Expression Ominbus, accounting for 28% of the whole database [1].

Since the size of gene expression database is expanding, the biggest challenge of gene expression studies of cancer is analyzing existing data rather than generating it. Among all the research interests of cancer gene expression analysis, one major goal is to identify the important gene expression patterns within a specific type of cancer. Many algorithms have been developed to solve this problem in the last decade [2]. These algorithms can be generally divided into two classes: supervised and unsupervised methods. Supervised algorithms have been widely used to discover gene expression patterns associated with known phenotypes [3-6]. Most of them are differential gene expression analysis, which aims to identify the most sensitive predictors associated with the target

1

phenotypes [7, 8]. The identified gene set, ranging from a couple of genes to dozens of genes, may perform well based on the standards of machine learning.

However, a major premise of a supervised approach is that the testing data is generated with the same conditions of the training data, which ensures them to have the same statistical properties such as distribution [9]. This implies that the gene signatures picked by a supervised approach may not have the same statistical power when the protocol or platform is changed. Even when tested within the same dataset, because of random noise and outliers, models built with completely supervised approaches are often susceptible to overfitting [2].

To the contrary, unsupervised algorithms are more flexible on cross-platform validation and more resistant to noise. Although unsupervised methods do not take advantage of the labels in the data, in turn, the generated model is not confined by these known categorical variables. Instead of discovering top- differentiated genes related to a phenotype, unsupervised approach is better at exploring the global landscape of genome-wide gene expression patterns. This characteristic not only makes the result more robust, since it is based on thousands of genes, but also easy to interpret, since different clusters of genes may represent different biological modules [9, 10]. Moreover, because of the freedom of unsupervised algorithms, they can be used to reveal the hidden structure of the data and discover potential novel subtypes [11, 12]. Although this freedom may also contribute irrelevant information to the model, if an unsupervised algorithm has the ability to minimize the confounding factors [12] and maximize the target factors in the result, it implies that this algorithm is in

2

agreement to the underlying biological mechanism. In other words, a successful model built from an unsupervised approach is more likely to be a generative model rather than a discriminative model, the former of which is more valuable from a biological perspective [9].

Feature Extraction and Unsupervised Pathway Analysis

One major challenge for unsupervised learning on gene expression data is how to reduce the dimension of the data. In other words, because of the “curse of dimensionality” phenomenon [13], a much smaller set of features characterizing the data is necessary for further studies in genome-wide gene expression analysis. Depending on whether features are original variables within the data or new variables generated from the data, dimension reduction methods can be classified into feature selection and feature extraction respectively [14]. Since feature selection is choosing a subset of features, the size of which is much smaller than the dimension of the original data, it also suffers from information loss like selecting the top differentiated genes in a supervised approach. By comparison, the basic concept of feature extraction is transformation of the original data into a less-dimensional space while still keeping the global trends

[14, 15]. So feature extraction is more suitable for classification purpose, with its advantages in both dimension reduction and data representation. Currently there are many feature extraction algorithms, modified by researchers based on traditional linear (LDA [16, 17], PCA [18-20], MDS [21, 22]) or non-linear (KPCA

3

[23], Isomap [24, 25], SOM [26, 27]) dimension reduction algorithms, to fit the dimension scale of whole-genome gene expression data.

Since the past decade, biological knowledge generated by progressive biomedical research has been increasing rapidly. The background information behind the data has been given more and more importance not only because of the need of data interpretation and visualization, but also because of the great value of interactions between genes and [28]. Various databases [29-

33] have been established and their corresponding tools [34-40] have been developed to satisfy the need of incorporation of biological background into gene expression analysis.

Pathway analysis has become an essential part of genome-wide pattern recognition and already helped researchers bring forward many important conclusions, especially in the cancer research area [41-45]. Although prior knowledge has been widely used today, it should be noted that most researchers use them merely as supporting information to interpret the result. However, the separation of prior knowledge and the learning process itself may lead to biased results due to defects underlying the data, for example, batch effects [46]. In order to address this issue, various algorithms have been designed. For (GO), researchers calculate GO-based correlations or distances between genes which then can be incorporated into various existing algorithms

[47, 48]. One major drawback of gene ontology is that genes within each term are not all functionally related across all diseases. Atushi et al. developed an algorithm which measures the information richness within each GO term with a 4

model-free approach [49]. This allows them to select gene sets that are only partially activated. Another kind of commonly used prior knowledge is pre-defined interaction between proteins and/or genes. The network built upon these interactions can direct the learning process and help generate more robust network-based biomarkers. Owing to the development of machine learning and understanding of pathway alterations, many supervised network algorithms has been designed [50-55]. However, to our knowledge, seldom unsupervised network selection algorithm is as advanced as the supervised ones, since it is challenging for unsupervised network methods to define a meaningful selection criterion. Common criterions are usually based on network structure and its corresponding parameters. First, the expression data and the pre-defined interaction network are integrated by converting the expression data into node and edge weights [37]. Second, a module identification algorithm is applied to the weighted network [56, 57]. However, both steps have their limitations respectively. In the first step, the relations between non-neighboring nodes are lost. In the second step, these algorithms have a strong selection bias towards densely connected modules which are not necessarily biologically important, let alone that these densities are largely influenced by previous research interests.

In conclusion, although unsupervised pathway analysis provides a global picture of pathway expression patterns and incorporates background knowledge into the algorithm, current methodology is sensitive to criterion selection and vulnerable to overfitting. More effort needs to be put in this area to overcome these problems.

5

Application of Pathway-Based Methodology in Translational Medicine

Basic and applied researches are two categories of clinical research. The former tries to elucidate the features of the disease while the latter focus on the improvement of patient diagnosis and therapy including operations and medicines [58, 59]. Translational medicine aims at bridging the gap between basic research and applied research by integrating the data at molecule level into clinical trials. This process can be divided into 3 steps [60, 61]. In the first step, integration, disparate data from genomic, imaging and clinical sources are ordered, compared and combined according to their infrastructures. In the second step, interpretation, the integrated data is analyzed to select potential targets, which can be worth in consideration of clinical interventions. In the third step, evaluation, the efficacies of these targets are measured and successful interventions are scaled before applying to the entire population. It should be noted that information gathered at the third step could also flow back to the first step, improving the accuracy of data integration and interpretation. This cycle dramatically increases the efficacy of discovery of disease biomarkers and therapeutic agents, together with the optimization of disease treatment.

Particularly, for cancer research, translational medicine helped researchers not only find numerous agents with high efficacy and low toxicity, but also the patterns of cancer cell sensitization, which largely facilitated cancer-related therapeutics [59].

6

Data integration is the first crucial step of translational medicine. Besides combining heterogeneous data from various sources, data integration also helps improve the confidence of data analysis and make complex or more complete disease modeling possible [62]. The challenge of data integration relies not only in the different infrastructures of the datasets but also the method used to facilitate further interpretation and validation [63]. Comparing with other methods, analysis results following pathway-based data integration are much easier to interpret and visualize [61, 64]. This integration methodology is also more intuitive and effective to incorporate biological background into the data. For cancer specifically, it has been discovered that mutations and driver genes are often enriched in certain pathways [65, 66], making this methodology even more suitable. Although pathway-based data integration has many benefits, the most difficult part of it is how to overcome the over-representation caused certain curated pathways. A robust pipeline is needed to unleash the power of pathway- based methodology in the data integration step of translational medicine.

Diffuse Low-Grade Glioma (LGG)

Overview

Gliomas are tumors originating from glial cells in primary brain and central nervous system. Among intracranial tumors, it comprises 31% of all tumors and

81% of malignant tumors. Gliomas can be classified into circumscribed and diffuse tumors based on their infiltrative abilities. Circumscribed gliomas are classified as grade I tumors by World Health Organization (WHO), whereas

7

diffuse gliomas are further subdivided into grade II-IV tumors based on their aggressiveness. Grade IV gliomas are also named as glioblastoma. There is some ambiguity for the definition of low-grade glioma (LGG). Sometimes it is defined as grade I-II gliomas comparing to grade III anaplastic gliomas.

Sometimes it refers to grade II-III gliomas, diffuse gliomas less malignant than glioblastoma. In this thesis, the latter definition is used. [67-69]

LGGs comprise nearly 40% of all gliomas. Based on histology, LGGs can be divided into astrocytomas, oligodendrogliomas, and a mixture of these two types, oligoastrocytomas. Although in grade II-III, LGGs are highly invasive since they can travel far away from the primary tumor site through neuropil. This feature makes LGGs nearly impossible to be completely neurosurgically resected. The remaining tumor cells can result in recurrence and even progression into glioblastoma. Since recurrence and malignancy rates are different among LGG patients, median survivals vary largely, ranging from 1~15 years, with 25% 5- year survival rate. [67-69]

Current therapies of LGG include surgery, radiation and chemotherapy.

Several trials have been conducted to determine the efficacies and parameters of these treatments. For radiotherapy, EORTC 22844 sought to determine the dose-response relationship and EORTC 22845 compared the efficacies of postoperative early therapy versus delayed therapy [70, 71]. For chemotherapy,

RTOG 9802 examined the efficacy of vincristine (PCV) on patients following radiation therapy and RTOG 0424 compared the efficacy of temozolomide (TMZ) with historical controls [72, 73]. Besides the improvement of progression-free 8

survival by early radiotherapy, the major finding of these trials was that radiotherapy coupled with chemotherapy improved patient survival significantly.

This was also implied by the result of EORTC 26951 which solely designed for anaplastic oligodendroglioma [74]. Also, 1p/19q co-deletion has been found to play a predictive role in the survival of patients receiving radiochemotherapy in several trials.

Recent result of RTOG 9402 confirmed that radiochemotherapy with PCV is more effective on patients with 1p/19q co-deletion than in non-co-deletion patients [75]. Several other molecular markers including IDH mutation and

MGMT promoter were also implied significant based on these results. However, although they have prognosis values, whether they are predictive for certain treatments needs more evidence [69].

Prognosis Markers of LGG

Although LGGs can be classified into three subtypes by histology, this traditional classification has been challenged in recent years, especially following the discovery of IDH mutation and 1p/19q co-deletion among LGG patients. LGG with both an IDH mutation (IDH1 or IDH2) and deletion of arms 1p and 19q (1p/19q codeletion) have better response to radiochemotherapy and are associated with longer survival than LGG without these alterations (Figure 8F)

[68].

Methylation status of MGMT (O6-alkylguanine DNA alkyltransferase) is also related to patient survival. It has been demonstrated that patients with methylated 9

MGMT has better survival in all grades of glioma, including LGG [69]. Mutation of

TERT promoter is associated with tumorigenesis of many cancer types [76]. In

LGG, this mutation is very prevalent in 1p/19q co-deletion patients, thus also an important prognosis marker [77]. Besides prognosis value, subtypes based on these markers have lower intra- and inter-observer variability than histology subtypes. In practice, molecular subtypes are gradually replacing the traditional histology subtypes to help improve the quality of patient-specific therapies.

10

Methodology

Overview

In this thesis, we present a novel pathway analysis algorithm, n-Node

Subnetwork Enumeration Algorithm (nSEA) (Figure 1). The goal of nSEA is to identify differentiating patterns among disease samples in an unsupervised process. The algorithm is based on a bottom-up methodology with which a large sparse biological network is enumerated and decomposed into n-node subnetworks exhaustively. These subnetworks are then evaluated, ranked and filtered according to their inner-pattern consistency and network topology. The selected n-node subnetworks are expanded to their neighboring nodes to form more stable network structures. Using principle component analysis of network states, we have identified subnetworks that can discriminate disease states [55,

78]. The final set of subnetworks represents the major dynamics in the protein- protein interaction network and portrays a global picture of pathway dysfunction across cancer subtypes.

Using nSEA, we applied the pipeline to LGG samples. We compared our subtypes with current classification and identified significant subnetworks related to our clustering. We also explored the mutation, copy-number variation and methylation driving force behind this classification.

11

Figure 1. Diagram of nSEA Algorithm

(A) A raw protein-protein interaction network is converted to a sparse network. Edges are filtered based on the expression difference of their corresponding node pairs. (B) The concept of network enumeration. All possible 4-node subnetworks are extracted from the original network, forming a list. Letters represent proteins. Three 4-node subnetworks and their positions in the list are annotated in colors as examples. (C) Feature selection based on the subnetwork list. Subnetworks are ranked according to their inner-pattern consistency in a decreasing manner. They are then scanned and tested in topology (not shown in the diagram) from the top to the bottom. If a subnetwork is selected into the feature set, it will exclude other subnetworks who share any node with it. (D) Selected subnetworks are expanded to neighboring nodes who share similar patterns, forming larger subnetworks. Solid lines represent edges on the current step. Dash lines represent potential edges, which can be added during the expansion. Non-significant edges are omitted in this figure.

12

Concept Description

Network Enumeration

The core function of nSEA is to enumerate the whole PPI to generate all possible n-node subnetworks. n is the number of nodes, or the size, of the subnetwork. The input network should be a simple graph whereas the output subnetworks are required to be not only simple graphs but also connected graphs. Enumeration, in comparison to decomposition, means that one node can appear in more than one subnetwork, but the composition of each subnetwork is unique. To more strictly define this concept, the collection of n-node subnetworks from the original network should satisfy the following conditions:

1) Each subnetwork is a subgraph from the original network.

2) Each subnetwork is a connected simple graph.

3) Each subnetwork in the collection has a unique node set.

4) No other subnetworks besides subnetworks in the collection exist in the

original network.

Subnetwork Selection and Expansion

The biggest challenge of analyzing n-node subnetworks is prioritizing millions of subnetworks. There are two challenges that need to be addressed.

One is how to distinguish target features from noise. This can be accomplished by distinguishing significant networks by scoring these networks. Besides, the more evidences, or interactions, support a subnetwork feature, the more robust

13

this feature is. We introduced the concept of network expansibility, which indicates how large an n-node subnetwork can expand by including neighboring nodes and interactions based on expansion requirements. Subnetwork score and expansibility together determine whether an n-node subnetwork can be selected or not.

The other problem is how to reduce the redundancy of the feature set since enumeration on dense network can generate heavily overlapped subnetworks.

This can be solved by selecting subnetworks one-by-one in an exclusive manner.

Once a subnetwork is selected, nodes in this subnetwork are subtracted from the original network. Any other subnetworks overlapping with this subnetwork are removed from the subnetwork collection. This method guaranteed that, for subnetworks in the feature set, each node only appears in one subnetwork. We summarized the feature selection process as following:

1) Define a subnetwork scoring function. Score all n-node subnetworks in

the list.

2) Rank all subnetworks by their subnetwork scores in a decreasing

order.

3) Select the first subnetwork. Expand this subnetwork by including

neighboring nodes one-by-one if defined conditions are met.

4) When expansion is finished, evaluate the expansibility (increment of

the subnetwork size). If the expansibility is acceptable, include this

expanded subnetwork into the feature set.

14

5) If the expanded subnetwork is selected as feature, other subnetworks

sharing node with this n-node subnetwork are removed from the list.

6) Remove nodes of this n-node subnetwork from the original network.

7) Repeat this cycle until the subnetwork list becomes empty.

Vector Representation of Subnetwork

Subsetting the gene expression matrix by genes in a subnetwork can form a subnetwork gene expression matrix, or subnetwork matrix in short. When using subnetworks as features, it is more convenient to compress a two-dimensional subnetwork matrix into a one-dimensional vector (Figure 2). A common unsupervised method to do this is the principal component analysis (PCA). PCA uses orthogonal transformation to convert a set of observations, often correlated variables, into uncorrelated variables. The first principal component represents the largest variance in the data and hence contains the major information of gene expression difference among the patients [79]. Therefore, vector representation of a subnetwork is defined as the first principal component of the subnetwork matrix.

15

Figure 2. Vector Representation of Subnetwork

(A) A subnetwork consisting of 6 nodes and 8 edges. The subnetwork state, which represents the expression pattern of this subnetwork in sample 1, is colored according to gene expression levels.

(B) Expression matrix of the subnetwork in (A) with 10 samples. Expression values are centered and scaled.

(C) Subnetwork vector is defined as the first principal component of the expression matrix. It is used as the summary of the patterns of this subnetwork across all samples. It is also used to cluster samples in the following steps.

16

Pipeline

Data Preparation

Since subnetwork-based subtypes of LGG and their related clinical features were explored in this study, target dataset should contain gene expression data, common LGG biomarkers and clinical information, especially survival times. We selected LGG data from TCGA since this dataset met all the requirements and had the largest amount of samples among all LGG studies [68, 80, 81]. The dataset was generated based on a combination of LGG samples from different laboratories (Biospecimen Core Resources) and batches. Sequencing data was generated with Illumina HiSeq 2000 platform. The level-3 expression data was obtained from UCSC cancer browser, a project organizing and visualizing genomic, phenotypic, and clinical data of TCGA [82]. The dataset contains 516 patient samples with 169 (33%) as astrocytoma, 174 (34%) as oligodendroglioma and 114 (22%) as oligoastrocytoma. Expression data was then imported to R and non-tumor samples were removed. Gene expression values, which were represented by RSEM levels, were first added 1 and then log2 transformed. For subtype identification, gene-wise normalization is necessary since relative expression levels are more useful for comparison between genes than absolute expression levels. Before this step, outliers for each gene were identified and truncated by adjboxStats function (coef=1.5, a=-4, b=3) from robustbase R package [83]. Then normalization was done across all patients, transforming expression levels to 0±1.

17

Protein-protein interaction (PPI) data was downloaded from STRING

Database [84]. STRING collects protein interactions based on various evidences including genomic context, high-throughput experiments, co-expression data, text-mining and other interaction databases. Each evidence is scaled independently and then a combined score is calculated considering how many sources supporting the interaction and how confidence these evidences are [85].

We selected 0.7 as the threshold to filter the PPI network since this threshold is used by STRING database to define high-confidence interactions [86]. The filtered undirected network contains 13,562 nodes and 277,172 edges.

Other clinical and molecular data was extracted from previous TCGA publication [81].

Enumeration

Due to the limitation of the computation power, the original network has to be a sparse network as the input of the enumeration step. Therefore, the PPI network from STRING needs further filtration. Since the subnetwork vector represents the first principal component or the largest variance of the expression values within the subnetwork, edge filtration should also facilitate this purpose.

We defined an edge score as

Ssd( g g ) e( v , v ) , i j ek v i v j k i j where sd is the standard deviation and g is the expression vector of the gene.

For each edge, an edge score is calculated (Figure 3). Edge filtration was done

18

by selecting the top proportion of edges ranked by the edge score. The proportion, PE, was set to 5% in this study.

Figure 3. Edge Vector and Edge Score

(A) The same subnetwork in Figure 2 (A).

(B) The same subnetwork matrix in Figure 2 (B).

(C) Edge vector is defined as the difference between expression vectors of the corresponding node pair. Edge A-D is used here as an example. The standard deviation of an edge vector is defined as its edge score which is then used for edge filtration.

19

The number of subnetworks the enumeration needs to generate is increasing exponentially with subnetwork size, n, increasing. Due to the limitation of the computation power, one has to balance the size of the subnetwork and the number of edges in the filtered network. We set n to 4 since setting n to 5 requires further filtration of the network, which will lead to loss of functional genes.

Enumeration was done by recursions of n from 2 to 4. First, the algorithm finds all 2-node subnetworks whose number equals to the number of edges.

Then all possible 3-node subnetworks are generated by adding one additional neighboring node to the 2-node subnetworks. Since the uniqueness of a subnetwork is only defined by its nodes, each subnetwork can be represented by a character vector of subnetwork node names. Then a collection of subnetworks can be represented by a list of character vectors. The non-redundant collection of the 3-node subnetworks is produced by removing the duplicates within this list.

An R package, data.table, which is designed for fast addition, modification, deletion and aggregation of large dataset is used. Similarly, 4-node subnetworks are generated based on the collection of 3-node subnetworks. Removing duplicates can dramatically decrease the amount of generations of 4-node subnetworks and therefore shorten the algorithm running time to an acceptable range. Based on circumstance, this approach is also eligible when n>4.

20

Subnetwork Selection, Expansion and Vectorization

Since the subnetwork vector represents the major variance within the subnetwork, a subnetwork score, which measures the inner-pattern consistency of a subnetwork, is used for feature selection:

g  g  g , e  ( v , v ) , i  j ek v i v j k i j |cor(gg , )|  eemn ISbn  , m n Ne where gvi denotes expression vector of node vi and Δgek denotes edge vector of edge ek. cor denotes pearson correlation. Ne denotes the number of total edge number. ISbn denotes score for subnetwork (Figure 4). To avoid extreme cases when only one node has degree larger than 1, 4-node subnetworks with average degree less or equal to 0.75 were scored 0. Inner-pattern consistency assumes that a consistent pair of opposite expression directions between two groups of genes implies functional interactions between them. The corresponding target subnetwork should be the most representative subnetwork of this difference and, during the expansion step, reinforce this expression pattern by including more neighboring nodes.

21

Figure 4. Definition of Subnetwork Score (Inner-pattern Consistency)

(A) The same edge vectors in Figure 3 (C).

(B) Edge matrix combines all edge vectors from the subnetwork.

(C) Edge correlation matrix is calculated from the edge matrix. Lower triangle (diagonal excluded) of the matrix is used to calculate the inner-pattern consistency which is defined as the mean of the absolute values of the correlations.

The aforementioned feature selection and expansion of 4-node subnetworks are described here in detail:

22

1) Calculate score (inner pattern consistency) I of all 4-node subnetworks.

Rank the subnetworks by their scores in a decreasing order, forming a list

L: {S1, S2, S3,…,SK} where Si denotes the ith subnetwork in L and

I1≥I2≥I3…≥Ik.

2) Selection and expansion is done recursively:

a. If L is empty, stop the feature selection. Otherwise, set S1 as the

initial subnetwork B0. Here Bj denotes the subnetwork with j nodes

added from the previous expansion loops.

b. The expansion loops can be described as following. Initially, j=0.

i. For neighboring nodes {A1, A2, A3,…, AN} of Bj from the

1 2 3 original network O, calculate subnetwork scores {Ij , Ij , Ij ,…,

N 1 2 3 N m Ij } of subnetworks {Bj , Bj , Bj ,…, Bj } where Bj =Bj+Am, i.e.

adding node Am to Bj.

ii. A predefined set of conditions {C1, C2,…, CP} is used to

1 2 control the expansion process (defined below). {Rj , Rj ,

3 N m m Rj ,…, Rj } is a Boolean vector. Rj =1 if Bj meets all {C1,

m C2,…, CP}. Otherwise, Rj =0.

iii. Define combined scores

1 2 3 N 1 2 3 N 1 2 3 N {Yj , Yj , Yj ,…, Yj }={Ij , Ij , Ij ,…, Ij }•{Rj , Rj , Rj ,…, Rj }.

M 1 2 3 N iv. Assume Yj is the maximum value in {Yj , Yj , Yj ,…, Yj }. If

M Yj =0, break the loop and set S’1= Bj as the fully expanded

subnetwork of S1.

v. Set j=j+1 and Bj+1=Bj+AM.

23

vi. Go to step (i).

c. Subnetwork size and expansibility are denoted as Z and E

respectively. Calculate E=Z(S’1) - Z(S1). Predefined threshold of

expansibility is denoted as E0. If E

subnetwork list L, renumber the subnetworks from 1 to K-1 and go

to step (a).

d. Add S’1 to the feature set F. If {V1, V2, V3, V4} is the nodes in S1, set

O=O - {V1, V2, V3, V4}, i.e. removing nodes in S1 from the original

network O.

e. Remove all Si which share nodes with S1 in L. Renumber the

subnetworks in L from 1 to K where K is the actually length of the

new subnetwork list.

f. Go to step (a).

During the expansion, subnetwork score is almost always decreasing.

Therefore, three conditions are used to control the expansion process:

m CII1 : jT m CIII2: 1j 0 m C3: Ne , j N e , j aN e , j

C1 ensures that all expanded subnetwork should have subnetwork scores larger than IT. C2 controls the extension of expansion by setting a threshold, I0, for the tolerance of subnetwork score decrease. C2 is a supplement to C1 since it avoids the situation that subnetworks with initial high scores can always have larger subnetwork sizes. C3 controls the subnetwork topology. Ne denotes total number

24

of edges in a subnetwork. a is a constant coefficient. For each expansion step,

C3 forces the algorithm to find a neighboring node which has sufficient connections to the current subnetwork. It also prevents the subnetwork from infinite expansion. All three rules have to be satisfied to allow an expansion step to occur. For 4-node subnetworks in this project, we set IT=0.87, I0=0.05 and a=0.25.

Subnetwork vectors are generated based on subnetwork feature set F. First principle components are calculated with prcomp function from stats R package with scale and center set to false.

Previous studies have shown that LGG patients can be clustered into various groups based on gene expression, mutation and methylation data. By vectorization of feature subnetworks, patient samples can also be clustered by subnetwork states. Since samples were generated from different hospitals and processed in different tissue centers, we use consensus clustering to eliminate the influence of inner-method inconsistency and improve the confidence of the clustering result. Consensus clustering was done with ConsensusClusterPlus package [87]. Hierarchical clustering (Ward’s method) coupled with Euclidean distance was used as the clustering algorithm. Parameter pItem which controls the proportion of the item sampling is set to 0.75. Number of repetitions is set to

10,000 to get a stable result.

25

Parameter Tuning

The whole pipeline and input data sources are summarized in Figure 5.

Actually, some aforementioned values of parameters were determined by parameter tuning. These include the edge selection proportion (PE), the low threshold of subnetwork score (IT), and the number of clusters for patient clustering (NC).

First, IT and NC were tuned while PE was fixed to 5%. Two indicators were used to optimize IT and NC. One was the clustering stability (CS), the other one was the distance from background (DB). CS is the mean of cluster-consensus values calculated by the ConsensusClusterPlus package [87]. DB is defined as the distance from background clustering, the clustering result generated by setting IT to 0. Specifically, the distance is defined as:

DCCB1 FM_index( IT ,0 ) where CIT is the clustering labels from threshold IT and C0 is the clustering labels when IT=0. Fowlkes-Mallows index (FM_index) is a measurement of similarity between two clustering results [88]. By gradually increasing IT, for each number of clusters (k), the relationship between IT and two indicators, CS and DB, was explored (Figure 5B and 5C). Noticeably, DB increases with IT, which indicates that feature selection step is necessary in order to generate different clustering results from the background. For CS, it is interesting that CS reaches its maximum values when NC is 5. We then further explored the relationship between CS and

DB (Figure 5D). By considering both indicators, three IT values from NC=5 were

26

very prominent. Among 0.83, 0.85 and 0.87, we chose 0.87 as the final IT value since when both DB and CS are similar, CS is a more important parameter than

DB.

Second, the proportion of edge selection (PE) was evaluated. Due to the limitation of computation power, 5% is almost the maximum percentage of edges we can keep. We then gradually decreased PE to inspect its influence on patient clustering. By fixing DB and CS as mentioned above, FM indices between each clustering results caused by different PE values were calculated. In addition, we fixed PE to 5% but sampled its subnetwork features (using 80% of all the features each time) to evaluate the error of clustering caused by random sampling (Figure

5E). It was interesting that the clustering difference caused by PE was even less than the clustering difference caused by 80% random sampling. Based on these results, PE did not have a significant impact on patient clustering. Therefore, in this project, PE was set to 5% since including more edges would produce more subnetwork features and therefore provide a better view of the underlying biological background.

27

Figure 5. Pipeline Summarization and Parameter Tuning

(A) Summarization of nSEA pipeline in this study. Data is represented by square and process is represented by squircle. Basic properties of the data between each step were also annotated.

28

(B) Distance from background (DB) versus score threshold (IT). Except k=4 (number

of clusters), DB is positively related to IT, which implies allowing more subnetworks to the feature set will decrease the specificity of high-score subnetworks.

(C) Clustering stability (CS) versus score threshold (IT). For most IT, CS (k= 5) was

larger than CS of other number of clusters. This shows that k=5 is preferred for stable clustering result.

(D) Distance from background (DB) versus clustering stability (CS). Three thresholds

with both large DB and CS were annotated. 0.87 was selected as the final threshold of subnetwork score.

(E) Boxplot of FM indices. Between-group box shows the FM indices between

clustering generated with different edge selection proportion (PE). Within-group

box shows the FM indices between clustering of different samplings when PE =5%.

Future Validation

For future validation, we will use a 2-fold hold-out methodology. Specifically, we will divide the TCGA dataset into a validation set and a training set. The validation set contains 20% of all samples while the training set contains 80% of all samples. Samples in the training and validation set will be drawn randomly from TCGA data without replication. By this approach, we can treat the validation set as an independent dataset. Then we will do all the training and tuning on the training set. These include:

1) The entire nSEA pipeline including edge filtration, enumeration, feature

selection and subnetwork expansion.

2) Result analysis including clustering of subnetworks and patients,

examining the subnetwork functions and exploring the clinical

characteristics of the patient groups.

29

3) Decision tree training for patient group prediction by subnetwork states,

methylation and mutation data.

If the above steps are done successfully, we will use the decision tree model from the training set to predict the subtypes in the validation set. For the predicted groups in the validation set, we will examine:

1) The subnetwork states and pathway regulation within each patient

group and whether these states match the states of its corresponding

group in TCGA data.

2) The clinical characteristics of each patient group and whether these

group specific characteristics match the characteristics from TCGA

data.

3) Estimated error of the algorithm. This can be measured by comparing

the result from the validation set and the result from the whole dataset.

The 2-fold hold-out method can eliminate the systematic error caused by different protocols and different platforms. Therefore, it can give us a good estimate of the robustness of nSEA on LGG data.

30

Results

Feature Subnetworks and Clustering

With application of nSEA pipeline on LGG profiles, 92 feature subnetworks were selected for further analysis (Supplementary Figure). A total of 515 genes were covered by these subnetworks. Frequencies of genes ranged from 1 to 3, with average frequency equaled to 1.33. This shows that nSEA captured a set of subnetworks which have distinct gene components. General statistics of the subnetworks were also explored. Subnetwork size was 6.34±1.26, ranging from 5 to 13. Average degree of the subnetwork was 2.06±0.49, varying from 1.6 to 4.

The moderate subnetwork sizes and average degrees indicate that nSEA, especially during the expansion step, is robust against extreme values and topologies.

Based on the clustering result, a heatmap was generated to visualize patient and subnetwork groupings (Figure 6A). The 5 patient groups were annotated as

LG1 to LG5 respectively. LG2 had the largest proportion of samples. LG3 was slightly smaller than LG2 while LG1 also had more than 100 samples. The sum of LG1, LG2 and LG3 comprises 81% of all LGG samples. In comparison, LG5 was the smallest patient group with only 33 samples. LG4 was twice the size of

LG5, making it the second smallest group in the clustering. A validation of patient groups was done by self-organizing map (SOM) which showed that patients could be divided into 5 separate spaces (Figure 6B). Consensus clustering plots also confirmed the 5-group structure within LGG patients (Figure 6C and 6D). In 31

addition, the tracking plot indicated that LG2, LG3 & LG4 and LG1 & LG5 were two groups of groups within which patient groups were more closely related to each other (Figure 6E). Bias of batch effect caused by different tissue source sites (TSS) was not observed. We were also interested in the relationship among feature subnetworks. By consensus clustering, subnetworks were clustered into

8 groups. Since the signs of subnetwork vectors, or the first principle components of subnetwork matrices, were not meaningful from a biological perspective, we used absolute Pearson distance instead of Euclidean distance for clustering of subnetwork vectors. Then, for each group, signs of subnetwork vectors were unified so that activation, or positive value of that subnetwork vector group, represented the up-regulation of a certain gene group within these subnetworks.

We got total of 8 subnetwork groups labeled as SNG1-8. Size of the subnetwork group was 11.5±4.8. SNG4 was the largest subnetwork grouping containing 18 subnetworks while SNG7 was the smallest group containing only 4 subnetworks.

The subnetwork states within each subnetwork group were very similar, implying that they were co-regulated during LGG development.

32

s

Subnetwork

33

Figure 6. Heatmap and Clustering of LGG Samples and Subnetworks

(A) Heatmap of subnetwork versus LGG sample. LGG samples were clustered into 5 groups (LG1~5) by consensus clustering using Euclidean distance. Subnetworks were clustered into 8 clusters by consensus clustering using absolute Pearson correlation distance. Sign of each subnetwork vector was adjusted to positively correlate with selected oncogenes or driver genes shown in the legend.

(B) Self-organizing map with 100 units. LGG samples were mapped to the units, with different shapes representing different patient groups. Units were also annotated with LG1~5 by majority voting.

(C) Consensus matrix showed that k=5 was appropriate. The column order is the same as the order in (A). Rows and columns all represent samples. The 34

consensus value scaled to white to dark blue indicates the consensus level between two samples, the probability of belonging to the same cluster. Less background noise means fewer ambiguous clustering and better clustering performance. This plot shows that setting k=5 is appropriate.

(D) Relative change of area under CDF curve (UAC) versus k (number of clusters). The area under CDF curve (AUC) is a measurement for clustering performance. Although AUC increases with k, one should choose the minimum k where AUC is close to the maximum AUC in order to avoid overfitting. In this case, there was no decrease of ΔAUC from k=4 to k=5, indicating that k=5 led to a stable clustering structure.

(E) Tracking plot with k (number of clusters) versus LGG samples. The plot indicated the division order of patient groups during consensus clustering.

Subnetwork Groups

By looking into details of the subnetworks within each subnetwork cluster, we found that SNG1 and SNG4 were highly enriched of genes in MAPK pathway.

Particularly, SNG1 captured RAF1 that encoded MAP3 (c-Raf). It plays an important role in MAPK/ERK pathway by initiating the entire kinase cascade [89].

Proto-oncogene ABL1 encoding a protein tyrosine kinase containing SH2 domain was also co-expressed with RAF1. For chronic myelogenous leukemia (CML), it has been proved that chimeric BCR-ABL oncoprotein is essential for the maintenance of leukemia phenotype [90]. Moreover, BCR-ABL oncoprotein frequently leads to the disruption of MAPK pathway, resulting in increased proliferation of cell. Other genes co-expressed with RAF1 and interacted with

MAPK included transcription factors (NFATC3, SP1) [91, 92], G proteins (GNG5,

GNAI3) [93, 94] and phospholipases (PLCB2, PLCG2) [95, 96]. It was interesting that several genes within MAPK pathway were negatively correlated with RAF1.

35

These included protein C (PRKCE, PRKCZ), PAK1 and MAPK9. Based on previous studies, PAK1 and MAPK9 participate mainly in MAPK/JNK pathway

[97, 98]. This implied that although MAPK was activated in many LGG samples, they may have different activation patterns. For SNG4, instead of RAF1, we captured a whole path of MAPK/ERK genes (SOS1, KRAS, BRAF, MAP2K2 and

MAPK1) around BRAF, another Raf kinase which affects cell division, differentiation, and secretion. It was interesting that the expression of MAP2K2 was negatively correlated with BRAF, which implied whether another MAP2K protein existed or the presence of MAPK1 down-regulated MAP2K2 by feedback

[99]. We also observed that PIK3CA, the catalytic subunit of phosphatidylinositol

3-kinase (PIK3), co-expressed with BRAF. This was in accord with other studies where the mutation or activation of PIK3CA frequently coexisted with KRAS and

BRAF in cancer [100, 101]. Activation of PIK3CA led to the interaction between

PIK3 and AKT, the latter of which acted on other proteins to promote cell growth.

MAPK/ERK of BRAF also down-regulated BAD which encoded a BCL-2 family protein that positively regulated cell apoptosis [102]. Together down regulated was JUND, a transcription factor promoted p53 dependent cell death [103]. The relationship between BRAF and proapoptosis genes indicated that the existence of BRAF-related MAPK pathway may contribute to cancer cell proliferation in

LGG. Considering SNG1 and SNG4 together, we found that expressions of

RAF1 and BRAF were generally negatively correlated across LGG samples

(Figure 6A). This again suggested that more than one pattern of MAPK activation

36

existed. Moreover, according to previous studies, RAF1 and BRAF may serve distinct roles in cancer progression [104].

SNG2 and SNG8 reflected the expression patterns of immune microenvironment within LGG samples. The major pathway hit by SNG2 was the

CD28-dependent Vav1 pathway in T-cell. Although CD28 itself was not included in SNG2, we found VAV1, the Vav Guanine Nucleotide Exchange Factor 1 and its downstream effectors including RAC2, CDC42 and PAK3. Phosphorylated

Vav1 can activate CDC42 and RAC2 [105]. These two proteins then bind to the

N-terminal regulatory domain of p21-activated kinase 3 (PAK3) which in turn activates JNK, p38 and cytoskeleton remodeling [106]. It has been reported that

CD28 can also activates PI3K which recruits SNX9 and then interacts with

WASP which regulates actin cytoskeleton [105]. We confirmed this by observing the co-regulation of PIK3CG, PIK3R5 and WAS in SNG2 subnetworks. We also found that two oncogenes, CSF1R and SPI1, co-expressed with VAV1. CSF1R encodes colony stimulating factor 1 receptor which is essential for survival of tumor-associated macrophages and microglia (TAMs). Inhibition of this protein has been shown to regress glioblastoma in mice [92]. The opposite expressing genes centered on PAX5, an essential transcription factor for the commitment of lymphoid progenitors to the B lymphocyte lineage [107, 108]. The co-expressed genes of PAX5 include two histone acetyltransferases (HATs) CREBBP and

EP300. These two genes encode transcription coactivators with multiple functions in development and haematopoiesis [109]. They have been found to mutate in various cancers especially in leukemia and they both serve as Pax5-

37

mediated transcription enhancer [110, 111]. In SNG8, we found genes related to two pathways of immunology, interferon (IFN) pathway and TNF pathway. For

IFN-dependent pathway, SNG8 captured essential downstream effectors including OAS proteins (OAS1, OAS2, OAS3 and OASL) and Mx GTPase A

(MX1) [112]. It has been proved that MX1 expression is strictly dependent on IFN expression. Domains on Mx1 can recognize target viral structures and prevent viral replications. OAS proteins play antiviral roles as 2'-5'-oligoadenylate synthetases for RNaseL activations. Their combination lead to the decay of viral

RNA. For TNF pathway, SNG8 directly captured a TNF receptor coding gene

(TNFRSF1A) and other downstream proteins (CASP8, RIPK1, MADD). TNF is a known pathway for apoptosis initiation. It was interesting that the expressions of two genes, SMPD3 and MADD, were negatively related to the activations of IFN and TNF respectively. SMPD3 has debatable roles in the regulation of IFN pathway based on previous literatures [113]. Although MADD was traditionally considered as a transducer of apoptotic signals by interacting with the death domain on TNF receptor, it was also reported that this protein was not co- regulated, or even negatively regulated by TNF pathway in Alzheimer's disease and glioblastoma [114]. The expression patterns of SNG2 and SNG8 showed a complex picture of LGG immune microenvironment regulation.

Chromatin remodeling is the dynamic rearrangement of chromatin architecture so that condensed genomic DNA can be accessed by DNA binding proteins and transcription factors. In tumor, chromatin and histone regulatory genes are frequently altered to affect transcriptional regulation and let cancer

38

cells acquire their hallmark characteristics [115]. In SNG3, many genes has been proved to control chromatin remodeling. We found genes related to nuclear receptors including NCOR1, NCOA2, NCOA6 and 8 differentially expressed mediator complex subunits. They recruit chromatin remodeling complexes which clear nucleosomes by binding to the promoter of the gene [116, 117]. Moreover, chromatin remodeling has been reported to affect telomere length. According to previous research, Rsf-1 shortens telomere length by interacting with hRap1 and

SMC1beta protects telomeres in meiocytes [118, 119]. However, expressions of

RSF1 and SMC3 in SNG3 were positively correlated, which needs further elucidation. We also captured nuclear pore protein TPR and its binding partner

NUP153 [120]. TPR is an oncogene and it has been suggested that TPR and

NUPs can detach from the nuclear pore complex (NPC) and bind to epigenetic markers or chromatin remodelers to regulate intranuclear transcription activities

[121].

In SNG5, we noted that SOX4, NOTCH1 and TCF3 were co-expressed with ribosomal proteins. Previous research indicated that expression of SOX4 can lead to activation of Notch pathway [122]. It was interesting that TCF3 was reported to repress SOX4 whereas in LGG we found them to be co-regulated

[123]. In brain development, Notch1 helps maintain the status of neural stem cells and controls gliogenesis in a complex manner [124]. Sox4 was also considered as a crucial transcription activator for neuronal differentiation [125]. In

SNG3, we found that expression of SOX4 and NOTCH1 repressed 8 genes related to mental illness including spinocerebellar ataxia, epilepsy,

39

and retardation. For example, selected as targets of various drugs, GABA receptors (GABRA2 and GABRA6) has been proved to be associated with anxiety and schizophrenia [126]. MEF2A and MEF2C in SNG3 play important roles in neuron survival and differentiation. Deletions of MEF2C has been reported to lead to severe mental retardation, seizures, and hypotonia [127].

Wnt signaling pathways are highly conserved pathways which pass signals into a cell through cell surface receptors. They can be generally divided into canonical pathways and noncanonical pathways, with the former dependent on

β-catenin and the latter independent of it [128]. In SNG6, we observed the protein components of Wnt pathways including Dishevelled proteins (DVL2, DVL3), transcription factor (NFATC1) and various regulators (GSK3B, TNKS, TNKS2,

BCAS2, PRMT1, HDAC1) [129-131]. Overexpression of Dishevelled proteins has been reported to activate Wnt/β-catenin pathway in non-small cell lung cancer

[132]. Tankyrases in SNG6 co-expressed with Dishevelled proteins and were reported to up-regulate Wnt/β-catenin pathway by degradation of Axin in the destruction complex [133]. In addition, we found the expression of TP53BP1, a p53-binding protein, was up-regulated by Wnt/β-catenin pathway in SNG6. This contradicted to a previous study which claimed that TP53BP1 were elevated after the inhibition of this pathway [134]. Negatively correlated with Wnt/β-catenin pathway activation were the expressions of NFATC1 and AP2A1 which were linked to noncanonical pathways [135, 136]. Moreover, we observed the expression of HDAC1 which was reported to inhibit Wnt/β-catenin pathway in oligodendrocyte [128]. Expression pattern of SNG6 suggested that Wnt pathways

40

in LGG were regulated in a complex manner and different subtypes may contain different activation paths.

In SNG7, we found many genes related to sister chromatid cohesion.

Defects of chromatid cohesion may lead to chromosome instability during mitosis and contribute to tumorigenesis [137]. Particularly, we found BUBR1, one of the

SAC proteins which accumulates at kinetochores to prevent the activation of anaphase, in SNG7. Moreover, it has been reported that expression of BUBR1 was not necessary for normal mitotic cells but for tumor cells in glioblastoma

[138]. Therefore it was suggested that BUBR1 helped suppress lethal kinetochore–microtubule (KT–MT) attachment defects in GBM. Other proteins interacting with BUBR1 include CDC20 and AURKA. Both of them were considered to inhibit APC/C after binding with BubR1, which prevents premature anaphase onset [139, 140]. Besides BUBR1 we found other proteins regulating

KT–MT stability in SNG7, including CDK1, and CLASP2. Cdk1 is an essential regulator for Sororin localization [141]. In addition, both Cdk1 and Plk1 can phosphorylate CLASP2 which then stabilizes the KT-MT structure [142].

Though different subnetwork clusters were assigned to different pathways

(Table 1), it should be mentioned that these subnetwork clusters were not isolated modules. Instead, they cross-talked with each other dynamically. For example, MAPK pathways represented by SNG1 and SNG4 can interact with all other pathways in other subnetwork clusters based on certain circumstances. In addition to subnetwork functions, we were also interested in the relationship of subnetwork activations and phenotypes including Karnofsky performance score 41

and telomere length (Supplementary Table 1). We found that SNG5 and SNG8 were significantly correlated with Karnofsky performance score (p-value< 8.5e-06 and p-value<5.0e-03 respectively). Telomere length was significantly related to

SNG3, SNG6 and SNG8 (p-value<0.021). Previous research also confirmed the biological interactions between telomerase and Wnt signaling pathway [143].

Gene Cluster 1 Gene Cluster 2 Subnetwork Pathway Representative Representative Group ID Oncogenes Oncogenes Genes Genes RAF1, ABL1, HCK, MAPK PRKCE, PRKCZ, NFATC3, SP1, SNG1 ABL1, pathway PAK1, MAPK9 GNG5, GNAI3, RAF1 PLCB2, PLCG2 VAV1, RAC2, CD28- CDC42, VAV1, dependent PAX5, CREBBP, SNG2 PAX5 PIK3CG, SPI1, Vav1 EP300 PIK3R5, WAS, CSF1R pathway CSF1R, SPI1 NCOR1, NCOA2, MED*, TAF10, Chromatin NCOA6, MED*, CEBPB, TPR, SNG3 remodeling RSF1, SMC3, CEBPD, PCM1 pathway NUP153, TPR, DMAP1, TAF1, ABL2 RUVBL2 BRAF, BRAF, SOS1, MAPK PIK3CA, MAP2K2, BAD, SNG4 KRAS, BRAF, HRAS pathway KRAS, JUND MAPK1, PIK3CA NCOA1 GABRA2, GABRA6, Notch SOX4, NOTCH1, MEF2A, SNG5 TCF3 pathway TCF3 MEF2C, TTBK2, ITPR1, NSF, RSRC DVL2, DVL3, NFATC1, Wnt GSK3B, TNKS, SNG6 BCAS2, PRMT1, pathway TNKS2, HDAC1, AP2A1 TP53BP1 Sister chromatid CDC20, AURKA, SNG7 AURKA CLASP2, CDS1 cohesion CDK1, PLK1 pathway IFN OAS*, MX1, pathway SNG8 MADD, SMPD3 TNFRSF1A, TPM3 and TNF CASP8, RIPK1 pathway Table 1. Subnetwork Clusters and Corresponding Pathways 42

For each subnetwork group, genes in cluster 1 is negatively correlated with genes in cluster 2. For abbreviation purpose, OAS proteins were designated as OAS* and mediator complex subunits were designated as MED*.

Patient Groups

Previous studies done by TCGA have identified different subtypes based on methylation, transcription and IDH mutation status [81]. We compared our patient groups with all subtypes and clusters identified by TCGA. Distances between patient groups and other subtypes or clusters were calculated (Table 2;

Supplementary Table 2). The definition of between-group distance is the average

Euclidean distance between all sample-pairs of two groups. Three patient groups were linked to general subtypes of LGG. IDH mutation and codeletion subtype together with RNA clusters LGr1, LGr2 and methylation cluster LGm3 were closely related to LG1. In contrast, LG2 was closely related to IDH mutation and non-codeletion subtype together with RNA cluster LGr3 and methylation clusters

LGm1 and LGm2. LG4 was linked to IDH-mutation group together with RNA cluster LGr4 and methylation clusters LGm4 and LGm5. Although the division between LGr1 and LGr2 was previously regarded as unimportant, we found that

LG5 was solely related to LGr2, suggesting that LGr1 and LGr2 may have different pathway regulation patterns. LG3 could not be mapped to any current subtypes or clusters, suggesting a new patient group that has not been identified by previous TCGA studies.

43

Cluster Name LG1 LG2 LG3 LG4 LG5 IDHmut-codel 0.43 0.55 0.59 0.74 0.63 General Subtype IDHmut-non-codel 0.58 0.44 0.52 0.56 0.69 IDHwt 0.73 0.54 0.63 0.49 0.73 LGr1 0.46 0.54 0.58 0.72 0.71 Pan-Glioma RNA LGr2 0.48 0.60 0.66 0.77 0.42 Expression 0.58 0.42 0.50 0.54 0.72 Cluster LGr3 LGr4 0.73 0.52 0.62 0.43 0.78 LGm1 0.63 0.48 0.58 0.55 0.70 LGm2 0.54 0.44 0.51 0.58 0.66 Pan-Glioma DNA LGm3 0.44 0.57 0.61 0.76 0.66 Methylation 0.77 0.57 0.63 0.45 0.80 Cluster LGm4 LGm5 0.74 0.53 0.62 0.44 0.76 LGm6 0.68 0.58 0.65 0.61 0.61 Table 2. Comparison of Current Patient Groups and Groups from TCGA Rows correspond to patient groups identified by TCGA. Columns correspond to patient groups clustered by subnetwork vectors. Distances below 0.5 are highlighted.

We further examined the clinical characteristics of each patient group. In contrast with other groups, majority of patients in LG4 had grade-3 tumor at diagnosis (Figure 7C). Moreover, LG4 had the largest mean of the ages across all groups (Figure 7D). In addition, we found that LG2 contained relatively younger patients compared to LG1, LG3 and LG5. For Karnofsky performance score (Figure 7B), LG4 had the worst performance as expected. It was interesting to find that the new identified group LG3 had the largest proportion of patients with scores more than 90. Telomere length was significantly shortened in LG4 compared to LG2, which is consistent with the result of previous research

(Figure 7E) [81]. Survival analysis showed that LG4 had substantially shorter survival than other groups (Figure 7A). Besides, LG2 and LG5 have intermediate survival compared to LG4 and other groups. In addition to LG1, LG2 and LG4, which behaved similar to IDHmut-codel, IDHmut-non-codel and IDHwt subtypes

44

respectively (Figure 7F), the newly identified LG3 had the best prognosis. This suggested that patterns of subnetwork states in LG3 may contribute to better survival of LGG patients.

Figure 7. Clinical Characteristics of LGG Patient Groups

(A) Kaplan–Meier curves of different patient groups (p-value of survival difference less than 1e-15). 45

(B) Percentages of Karnofsky performance scores in each patient group.

(C) Percentages of grade in 5 patient groups.

(D) Boxplot of patient ages divided by patient groups.

(E) Boxplot of log2 transformed telomere lengths separated by patient groups.

(F) Kaplan–Meier curves of IDHmut-codel, IDHmut-non-codel and IDHwt group based on previous TCGA studies. Reproduced with permission from [68], Copyright Massachusetts Medical Society.

Exploring the relationship between subnetworks and patient groups was essential to reveal the underlying mechanisms. To characterize patient groups by subnetworks, we applied decision tree model (C5.0) on the subnetwork state matrix [144]. With 10-fold cross-validation, group labels were predicted from the subnetwork vectors and AUCs (areas under the ROC curves) were calculated.

Tuning on the maximum depth of tree and minimum number of samples included in each leaf node, we found that all patient groups except LG5 needed at least 2 subnetworks as predictors to avoid significant loss of AUC (Figure 8A). Ranked by improvement of impurity, top 10 subnetworks of each primary and secondary split were extracted (Supplementary Table 3). We found that every primary or secondary split was enriched by specific groups of subnetworks. Especially, the newly-identified group, LG3, is strongly determined by the states of SNG4 in the primary split and SNG5, SNG8 in the secondary split. Although subnetworks in

SNG5 were not enriched for primary splits, they dominated secondary splits of

LG3 and LG4, two groups with the highest and the lowest Karnofsky performance scores respectively. To further elucidate the subnetworks separating different patient groups, we trained a decision tree model including all group labels as the outcome (Figure 8B). SN49 from SNG5 separated LG3 from

46

LG4, representing that subnetworks from SNG5 played an important role in physical function and survival of patient. Moreover, based on pathway identification, SNG4 represented B-Raf regulated MAPK pathway and SNG5 represented Notch pathway. In LG3, B-Raf regulated MAPK pathway was down- regulated whereas Notch pathway was upregulated. This indicated that these two pathways may contribute to high survival rates of LGG patients in LG3.

To elucidate the driving genes behind each patient group, mutation and methylation data of LGG patients were analyzed. 19 mutated genes were identified as significantly correlated with specific groups (p-value of chi-square test less than 0.05) including IDH1, TP53 and ATRX (Table 3; Figure 9). We also examined the TERT promoter mutation and MGMT promoter methylation, which have been previously identified as biomarkers for LGG patient prognosis. Mutual exclusiveness between TERT promoter mutation and ATRX was confirmed by our study. Methylation level of MGMT promoter is significantly lower in LG4, indicating that MGMT promoter methylation is a better prognosis biomarker. We did not find a predominate mutation contributing to LG3. However, EGFR, NF1 and PTEN were rarely mutated in LG3, which may contribute to the better survival of this patient group.

47

48

Figure 8. Patient Group Characterization by Subnetworks

(A) Maximum depth and minimum size of terminal node were tuned by training the decision tree models for each patient group. AUC is on y-axis and minimum size of terminal node is on x-axis. The corresponding model was attached below each tuning plot. Subnetworks were designated with subnetwork ID and the cluster it belonged to. Optimum maximum depth of the tree was determined if increasing the depth could not result in significant improvement of AUC. Tree models of LG1-4 have 2 layers (2 predictors and 3 terminal nodes) while LG5 have only one layer and one predictor.

(B) Overall tree model for all LGG patient groups.

(C) Supervised classification of LG3 by methylation of NIPBL and KALRN.

49

Gene LG1 LG2 LG3 LG4 LG5 Total Mutation p-value IDH1 1.21 1.13 1.14 0.15 0.78 398 5.00E-04 TP53 0.18 1.77 1.22 0.40 0.50 249 5.00E-04 ATRX 0.12 1.81 1.18 0.31 0.88 195 5.00E-04 CIC 2.85 0.12 1.06 0.00 0.42 110 5.00E-04 FUBP1 3.03 0.14 0.93 0.00 0.32 48 5.00E-04 PIK3CA 1.72 0.38 0.77 1.87 0.69 45 3.00E-03 NOTCH1 2.75 0.47 0.80 0.00 0.36 43 5.00E-04 MUC16 0.32 1.05 1.56 1.09 0.37 42 3.50E-02 EGFR 0.26 0.39 0.20 5.25 1.33 35 5.00E-04 NF1 0.83 0.72 0.31 3.71 0.47 33 5.00E-04 PTEN 0.00 0.95 0.28 4.90 0.00 25 5.00E-04 RYR2 0.00 1.42 0.86 1.91 1.30 24 4.75E-02 IDH2 1.59 0.17 1.38 0.00 3.11 20 3.50E-03 SSPO 2.27 1.19 0.52 0.00 0.00 20 1.05E-02 NIPBL 2.39 0.54 0.91 0.40 0.00 19 2.25E-02 PCLO 0.27 0.80 1.22 2.70 0.00 17 3.25E-02 CSMD3 0.30 2.49 0.69 0.00 0.00 15 8.50E-03 PKHD1 0.35 2.35 0.26 1.18 0.00 13 2.30E-02 DOCK5 0.45 1.70 0.00 3.06 0.00 10 2.30E-02

Others LG1 LG2 LG3 LG4 LG5 Total p-value TERT Promoter 2.28 0.34 0.42 1.66 0.96 130 5.00E-04 Mutation MGMT Promoter 1.20 0.96 1.08 0.58 0.99 425 5.00E-04 Methylation Table 3. Mutations and MGMT Methylation Statistics The ratios of mutation count and expected mutation count under null hypothesis (uniformly distributed) were calculated. P-value was calculated by Fisher’s exact test.

50

51

Figure 9. Patient Groups with Mutations and MGMT Methylation

(A) Comparison of patient groups with current subtypes and clusters.

(B) Relationship between patient groups and significant gene mutations.

(C) Methylation of MGMT promoter and mutation of TERT promoter ordered by patient groups.

We were particularly interested in gene methylation contributing to the newly identified patient group LG3. Supervised classification learning by C5.0 algorithm was conducted on the methylation matrix. Besides genes, the matrix also included methylations levels of promoters and CpG islands as features. The primary split and secondary split were dominated by methylations correlated to

SNG4 and SNG5 respectively (Figure 8C; Supplementary Table 4). For the primary split, methylation of gene NIPBL was identified as the most significant predictor. It is interesting to find that NIPBL is also a highly mutated gene among

LGG patients (3%), especially in LG3 and LG1. Previous studies have shown that NIPBL mutation is strongly related to Cornelia de Lange syndrome [145]. It has been reported that the cohesin-loading factor Nipbl can bind with transcriptional coactivators, forming a complex which helps cohesin load to the promoters to regulate cell-specific gene expression [146]. Within the secondary split predictors, KALRN stands out as a gene encoding protein Kalirin, which interacts with Huntington-associated protein 1 (HAP1). In CNS development, one isoform of Kalirin, Kalirin-7 has been proved to control cortical spine morphogenesis [147]. Knockdown of this gene can reduce spine density and impair activity-dependent spine plasticity [148]. This gene is also related to various mental illnesses including ADHD (attention deficit hyperactivity disorder)

52

and schizophrenia [149, 150]. In conclusion, patient group LG3 can be characterized by methylation of NIPBL and KALRN, two genes related to promoter regulation and neuropathological disorders.

53

Discussion

LGG classification has been proposed by many researchers over a decade.

Subtype identification based on IDH mutation and chromosome 1p/19q co- deletion is the widely accepted one. However, this has been challenged by recent studies showing that TERT may play an important role in the development of glioma. Although classification of LGG is becoming more and more specific, the underlying mechanisms of these biomarkers are still unclear. For example, patients with solely IDH mutation have the worst survival outcomes. However, if they have both TERT and IDH mutations, their survival lengths will be significantly elongated, forming the best survival group. This indicates that there are synergistic relationships among driver genes of LGG.

In this respect, nSEA, the algorithm we developed to capture dysregulation within pathways, provided insight to characterize these LGG samples. Contrary to common bioinformatics approaches which focus on mutation, methylation and copy-number variation, our approach applies a different subnetwork based methodology. By scanning over nearly thirty million 4-node subnetworks, we provided a global picture of subnetwork states within LGG. With feature selection based on clustering statistics, we selected 92 subnetworks that divided

LGG patients into 5 groups. 3 groups could be mapped to the general subtypes, showing that our algorithm was able to capture biologically significant signals.

Moreover, we found one patient group, LG3, which not only had distinct subnetwork states, but also had clinical significance. Further analysis showed 54

that compared to other groups, LG3 had the best survival and Karnofsky performance score. Decision tree model trained against LG3 indicated that SNG4 and SNG5 could be used to separate LG3 from other patients with high accuracy. Considering pathway functions of SNG4 and SNG5, LG3 could be characterized as patients who had down-regulated B-Raf mediated MAPK pathway and up-regulated Notch pathway. Mutation analysis shows that the better clinical performance of LG3 may be attributed to the lack of mutations in

EGFR, NF1 and PTEN. Tree model based on methylation data highlights NIPBL and KALRN, two genes responsible for the primary and secondary splits of the tree respectively. Besides its biological functions of transcription regulation through promoters, NIPBL has been linked to various types of cancers [151], suggesting that cohesin inactivation may play an important role in malignancy

LGG. The protein encoded by KALRN, kalirin, belongs to RhoGEF .

Many proteins in this family have been identified as cancer driver genes [152].

The Dbl-homologous domain of this protein may become a potential drug target in LGG therapy [153].

nSEA is an unsupervised algorithm aiming at discovery dysregulation signals within pathways instead of isolated gene expression signals. In this study, we showed that nSEA captured different regulation patterns of subnetworks related to various candidate pathways. nSEA not only captured oncogenes but also their upstream regulators and downstream effectors. Therefore, these subnetwork features were more robust, easy to interpret and biologically significant than single gene features. In addition, nSEA is in accordance with the

55

synergistic nature of cancer driver genes. Synergistic effect of driver genes has been studied extensively in the past [154-156]. Subnetworks generated by nSEA will provide how driver genes synergistically come together to drive tumor progression. In our approach, by exclusively selecting subnetwork features, we were able to cluster LGG patients into groups with distinct clinical characteristics.

This implied that while numerous driver genes which would distinguish patient groups were identified, driver genes that were not included could be investigated with nSEA subnetworks to further evaluate their roles in gliomagenesis.

56

References

1. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. 2013. doi: 10.1093/nar/gks1193. 2. Maji P, Paul S, Ohio Library and Information N. Scalable pattern recognition algorithms: applications in computational biology and bioinformatics. New York;Cham;: Springer; 2014. 3. Chen HY, Yu SL, Chen CH, Chang GC, Chen CY, Yuan A, et al. A five-gene signature and clinical outcome in non-small-cell lung cancer. New England Journal of Medicine. 2007;356(1):11-20. doi: 10.1056/NEJMoa060096. PubMed PMID: WOS:000243209600004. 4. Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, et al. Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes. J Clin Oncol. 272009. p. 1160-7. 5. Sanchez-Carbayo M, Socci ND, Lozano J, Saint F, Cordon-Cardo C. Defining molecular profiles of poor outcome in patients with invasive bladder cancer using oligonucleotide microarrays. Journal of Clinical Oncology. 2006;24(5):778-89. doi: 10.1200/jco.2005.03.2375. PubMed PMID: WOS:000235375400008. 6. Network TCGAR. Integrated genomic analyses of ovarian carcinoma. Nature. 2011;474:609-15. doi: doi:10.1038/nature10166. 7. Huang HC, Niu Y, Qin LX. Differential Expression Analysis for RNA-Seq: An Overview of Statistical Methods and Computational Software. Cancer Inform. 2015;14(Suppl 1):57-67. Epub 2015/12/22. doi: 10.4137/cin.s21631. PubMed PMID: 26688660; PubMed Central PMCID: PMCPMC4678998. 8. Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, et al. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biology. 2013;14(9). doi: info:pmid/24020486. 9. Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nature Reviews Genetics. 2015;16:321-32. doi: doi:10.1038/nrg3920. 10. Yang ZR. Machine learning approaches to bioinformatics. Hackensack, NJ;Singapore;: World Scientific; 2010. 11. Abu Jamous B, Fa R, Nandi AK, Ohio Library and Information N. Integrative cluster analysis in bioinformatics. Chichester, West Sussex, United Kingdom: John Wiley & Sons Inc; 2015. 12. Greene CS, Tan J, Ung M, Moore JH, Cheng C. Big data bioinformatics. J Cell Physiol. 2014;229(12):1896-900. Epub 2014/05/07. doi: 10.1002/jcp.24662. PubMed PMID: 24799088. 13. Bellman RE. Dynamic Programming: Dover Publications; 2003. 14. Hira ZM, Gillies DF. A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data. Adv Bioinformatics. 2015;2015:198363. Epub 2015/07/15. doi: 10.1155/2015/198363. PubMed PMID: 26170834; PubMed Central PMCID: PMCPMC4480804. 15. Tan CS, Ting WS, Mohamad MS, Chan WH, Deris S, Shah ZA. A review of feature extraction software for microarray gene expression data. Biomed Res Int. 2014;2014:213656. Epub 2014/09/25. doi: 10.1155/2014/213656. PubMed PMID: 25250315; PubMed Central PMCID: PMCPMC4164313.

57

16. Zhi X-b, Fan J-l, Zhao F. Fuzzy Linear Discriminant Analysis-guided maximum entropy fuzzy clustering algorithm. Pattern Recognition. 2013;46(6):1604-15. doi: 10.1016/j.patcog.2012.12.007. PubMed PMID: WOS:000315369900007. 17. Sharma A, Paliwal KK. Cancer classification by gradient LDA technique using microarray gene expression data. Data & Knowledge Engineering. 2008;66(2):338-47. doi: 10.1016/j.datak.2008.04.004. PubMed PMID: WOS:000258448400008. 18. Ma S, Dai Y. Principal component analysis based methods in bioinformatics studies. Briefings in Bioinformatics. 2011;12(6):714-22. doi: 10.1093/bib/bbq090. PubMed PMID: WOS:000297351800015. 19. Yeung KY, Ruzzo WL. Principal component analysis for clustering gene expression data. 2001. doi: 10.1093/bioinformatics/17.9.763. 20. Foley JW, Katagiri F. Unsupervised reduction of random noise in complex data by a row- specific, sorted principal component-guided method. Bmc Bioinformatics. 2008;9. doi: 10.1186/1471-2105-9-508. PubMed PMID: WOS:000262159700001. 21. Fuller GN, Hess KR, Rhee CH, Yung WKA, Sawaya RA, Bruner JM, et al. Molecular classification of human diffuse gliomas by multidimensional scaling analysis of gene expression profiles parallels morphology-based classification, correlates with survival, and reveals clinically- relevant novel glioma subsets. Brain Pathology. 2002;12(1):108-16. PubMed PMID: WOS:000172634200012. 22. Taguchi Y, Oono Y. Relational patterns of gene expression via non-metric multidimensional scaling analysis. Bioinformatics. 2005;21(6):730-40. doi: 10.1093/bioinformatics/bti067. PubMed PMID: WOS:000227562200005. 23. Liu Z, Chen D, Bensmail H. Gene Expression Data Classification With Kernel Principal Component Analysis. J Biomed Biotechnol. 2005;2005(2):155-9. doi: 10.1155/jbb.2005.155. PubMed PMID: 16046821; PubMed Central PMCID: PMC1184105. 24. Dawson K, Rodriguez RL, Malyj W. Sample phenotype clusters in high-density oligonucleotide microarray data sets are revealed using Isomap, a nonlinear algorithm. Bmc Bioinformatics. 2005;6. doi: 10.1186/1471-2105-6-195. PubMed PMID: WOS:000231175100001. 25. Orsenigo C, Vercellis C. An effective double-bounded tree-connected Isomap algorithm for microarray data classification. Pattern Recognition Letters. 2012;33(1):9-16. doi: 10.1016/j.patrec.2011.09.016. PubMed PMID: WOS:000298530400002. 26. Brameier M, Wiuf C. Co-clustering and visualization of gene expression data and gene ontology terms for Saccharomyces cerevisiae using self-organizing maps. Journal of Biomedical Informatics. 2007;40(2):160-73. doi: 10.1016/j.jbi.2006.05.001. PubMed PMID: WOS:000245699600009. 27. Garrigues GE, Cho DR, Rubash HE, Goldring SR, Herndon JH, Shanbhag AS. Gene expression clustering using self-organizing maps: analysis of the macrophage response to particulate biomaterials. Biomaterials. 2005;26(16):2933-45. doi: 10.1016/j.biomaterials.2004.06.034. PubMed PMID: WOS:000226675500015. 28. Alterovitz G, Ramoni MF. Knowledge based bioinformatics: from analysis to interpretation. Chichester, West Sussex, U.K: John Wiley & Sons; 2010. 29. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nature Genetics. 2000;25(1):25-9. PubMed PMID: WOS:000086884000011. 30. Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research. 2000;28(1):27-30. doi: 10.1093/nar/28.1.27. PubMed PMID: WOS:000084896300007.

58

31. Croft D, Mundo AF, Haw R, Milacic M, Weiser J, Wu G, et al. The Reactome pathway knowledgebase. Nucleic Acids Res. 2014;42(Database issue):D472-7. Epub 2013/11/19. doi: 10.1093/nar/gkt1102. PubMed PMID: 24243840; PubMed Central PMCID: PMCPMC3965010. 32. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34(Database issue):D535-9. doi: 10.1093/nar/gkj109. PubMed PMID: 16381927; PubMed Central PMCID: PMC1347471. 33. Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, et al. STRING 8—a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 372009. p. D412-6. 34. Paull EO, Carlin DE, Niepel M, Sorger PK, Haussler D, Stuart JM. Discovering causal pathways linking genomic events to transcriptional states using Tied Diffusion Through Interacting Events (TieDIE). Bioinformatics. 2013;29(21):2757-64. Epub 2013/08/30. doi: 10.1093/bioinformatics/btt471. PubMed PMID: 23986566; PubMed Central PMCID: PMCPMC3799471. 35. Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. 2010. doi: 10.1093/bioinformatics/btq182. 36. Barbie DA, Tamayo P, Boehm JS, Kim SY, Moody SE, Dunn IF, et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature. 2009;462(7269):108-12. doi: doi:10.1038/nature08460. 37. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. Bmc Bioinformatics. 2008;9. doi: 10.1186/1471-2105-9-559. PubMed PMID: WOS:000262999900002. 38. Saito R, Smoot ME, Ono K, Ruscheinski J, Wang P-L, Lotia S, et al. A travel guide to Cytoscape plugins. Nature Methods. 2012;9(11):1069-76. doi: 10.1038/nmeth.2212. PubMed PMID: WOS:000310848700018. 39. Ciriello G, Cerami E, Aksoy BA, Sander C, Schultz N. Using MEMo to discover mutual exclusivity modules in cancer. Curr Protoc Bioinformatics. 2013;Chapter 8:Unit 8.17. Epub 2013/03/19. doi: 10.1002/0471250953.bi0817s41. PubMed PMID: 23504936. 40. Kramer A, Green J, Pollard J, Tugendreich S. Causal analysis approaches in Ingenuity Pathway Analysis. Bioinformatics. 2014;30(4):523-30. doi: 10.1093/bioinformatics/btt703. PubMed PMID: WOS:000332032100010. 41. Agrawal N, Akbani R, Ally A, Arachchi H, Balasundaram M, Balu S, et al. Integrated Genomic Characterization of Papillary Thyroid Carcinoma. Cell. 2014;159(3):676-90. doi: 10.1016/j.cell.2014.09.050. PubMed PMID: 25417114. 42. Network TCGA. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature. 2015;517:576-82. doi: doi:10.1038/nature14129. 43. Verhaak RGW, The Eli and Edythe L. Broad Institute of Massachusetts Institute of Technology and Harvard University C, MA 02142, USA, Department of Medical Oncology D-FCI, Boston, MA 02115, USA, Hoadley KA, Department of Genetics UoNCaCH, Chapel Hill, NC 27599, USA, Lineberger Comprehensive Cancer Center UoNCaCH, Chapel Hill, NC 27599, USA, et al. Integrated Genomic Analysis Identifies Clinically Relevant Subtypes of Glioblastoma Characterized by Abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010;17(1):98- 110. doi: 10.1016/j.ccr.2009.12.020. PubMed PMID: 20129251. 44. PLOS ONE: A Novel Method Incorporating Gene Ontology Information for Unsupervised Clustering and Feature Selection 2015. Available from: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0003860.

59

45. Network TCGA. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012;487:330-7. doi: doi:10.1038/nature11252. 46. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11(10). doi: 10.1038/nrg2825. PubMed PMID: 20838408; PubMed Central PMCID: PMC3880143. 47. Srivastava S, Zhang L, Jin R, Chan C. A Novel Method Incorporating Gene Ontology Information for Unsupervised Clustering and Feature Selection. Plos One. 2008;3(12). doi: 10.1371/journal.pone.0003860. PubMed PMID: WOS:000265452200010. 48. Verbanck M, Lê S, Pagès J. A new unsupervised gene clustering algorithm based on the integration of biological knowledge into expression data. BMC Bioinformatics. 2013;14(1):1. doi: 10.1186/1471-2105-14-42. 49. Niida A, Imoto S, Yamaguchi R, Nagasaki M, Fujita A, Shimamura T, et al. Model-free unsupervised gene set screening based on information enrichment in expression profiles. 2010. doi: 10.1093/bioinformatics/btq592. 50. Chang JT, Carvalho C, Mori S, Bild AH, Gatza ML, Wang Q, et al. A Genomic Strategy to Elucidate Modules of Oncogenic Pathway Signaling Networks. Molecular Cell. 2009;34(1):104- 14. doi: 10.1016/j.molcel.2009.02.030. PubMed PMID: WOS:000265248600011. 51. Wu G, Stein L. A network module-based method for identifying cancer prognostic signatures. Genome Biology. 2012;13(12). doi: 10.1186/gb-2012-13-12-r112. PubMed PMID: WOS:000315869800001. 52. Ciriello G, Cerami E, Sander C, Schultz N. Mutual exclusivity analysis identifies oncogenic network modules. Genome Research. 2012;22(2):398-406. doi: 10.1101/gr.125567.111. PubMed PMID: WOS:000299606400022. 53. Chuang HY, Lee E, Liu YT, Lee D, Ideker T. Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007;3:140. Epub 2007/10/18. doi: 10.1038/msb4100180. PubMed PMID: 17940530; PubMed Central PMCID: PMCPMC2063581. 54. Kim Y, Kim T-K, Kim Y, Yoo J, You S, Lee I, et al. Principal network analysis: identification of subnetworks representing major dynamics using gene expression data. Bioinformatics. 2011;27(3):391-8. doi: 10.1093/bioinformatics/btq670. PubMed PMID: WOS:000286991300013. 55. Chowdhury SA, Nibbe RK, Chance MR, Koyutuerk M. Subnetwork State Functions Define Dysregulated Subnetworks in Cancer. Journal of Computational Biology. 2011;18(3):263-81. doi: 10.1089/cmb.2010.0269. PubMed PMID: WOS:000288156000006. 56. Castro MA, Wang X, Fletcher MN, Meyer KB, Markowetz F. RedeR: R/Bioconductor package for representing modular structures, nested networks and multiple levels of hierarchical associations. Genome Biology. 2012;13(4):1. doi: 10.1186/gb-2012-13-4-r29. 57. Rhrissorrakrai K, Gunsalus KC. MINE: Module Identification in Networks. Bmc Bioinformatics. 2011;12. doi: 10.1186/1471-2105-12-192. PubMed PMID: WOS:000292027800001. 58. Machado CM, Rebholz-Schuhmann D, Freitas AT, Couto FM. The semantic web in translational medicine: current applications and future directions. Brief Bioinform. 2015;16(1):89-103. Epub 2013/11/08. doi: 10.1093/bib/bbt079. PubMed PMID: 24197933; PubMed Central PMCID: PMCPMC4293377. 59. Goldblatt EM, Lee WH. From bench to bedside: the growing use of translational research in cancer medicine. Am J Transl Res. 2010;2(1):1-18. Epub 2010/02/26. PubMed PMID: 20182579; PubMed Central PMCID: PMCPMC2826819.

60

60. Sarkar IN. Biomedical informatics and translational medicine. Journal of Translational Medicine. 2010;8. doi: 10.1186/1479-5876-8-22. PubMed PMID: WOS:000275732200002. 61. Satagopam V, Gu W, Eifes S, Gawron P, Ostaszewski M, Gebel S, et al. Integration and Visualization of Translational Medicine Data for Better Understanding of Human Diseases. Big Data. 2016;4(2):97-108. Epub 2016/07/22. doi: 10.1089/big.2015.0057. PubMed PMID: 27441714; PubMed Central PMCID: PMCPMC4932659. 62. Mathew JP, Taylor BS, Bader GD, Pyarajan S, Antoniotti M, Chinnaiyan AM, et al. From Bytes to Bedside: Data Integration and Computational Biology for Translational Cancer Research. PLoS Comput Biol. 32007. 63. Wanichthanarak K, Fahrmann JF, Grapov D. Genomic, Proteomic, and Metabolomic Data Integration Strategies. Biomark Insights. 2015;10(Suppl 4):1-6. Epub 2015/09/24. doi: 10.4137/bmi.s29511. PubMed PMID: 26396492; PubMed Central PMCID: PMCPMC4562606. 64. Cavill R, Jennen D, Kleinjans J, Briede JJ. Transcriptomic and metabolomic data integration. Brief Bioinform. 2015. Epub 2015/10/16. doi: 10.1093/bib/bbv090. PubMed PMID: 26467821. 65. Lin Y, Wu Z, Guo W, Li J. Gene mutations in gastric cancer: a review of recent next- generation sequencing studies. Tumour Biol. 2015;36(10):7385-94. Epub 2015/09/14. doi: 10.1007/s13277-015-4002-1. PubMed PMID: 26364057. 66. Cervino AR, Burei M, Mansi L, Evangelista L. Molecular pathways and molecular imaging in breast cancer: an update. Nucl Med Biol. 2013;40(5):581-91. Epub 2013/04/23. doi: 10.1016/j.nucmedbio.2013.03.002. PubMed PMID: 23602603. 67. Olar A, Sulman EP. Molecular Markers in Low-Grade Glioma-Toward Tumor Reclassification. Semin Radiat Oncol. 2015;25(3):155-63. Epub 2015/06/09. doi: 10.1016/j.semradonc.2015.02.006. PubMed PMID: 26050585; PubMed Central PMCID: PMCPMC4500036. 68. Brat DJ, Verhaak RG, Aldape KD, Yung WK, Salama SR, Cooper LA, et al. Comprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade Gliomas. N Engl J Med. 2015;372(26):2481- 98. Epub 2015/06/11. doi: 10.1056/NEJMoa1402121. PubMed PMID: 26061751; PubMed Central PMCID: PMCPMC4530011. 69. Parsa A, Raizer J, Ohio Library and Information N. Current understanding and treatment of gliomas. Cham [Switzerland]: Springer; 2015. 70. Karim AB, Maat B, Hatlevoll R, Menten J, Rutten EH, Thomas DG, et al. A randomized trial on dose-response in radiation therapy of low-grade cerebral glioma: European Organization for Research and Treatment of Cancer (EORTC) Study 22844. Int J Radiat Oncol Biol Phys. 1996;36(3):549-56. Epub 1996/10/01. PubMed PMID: 8948338. 71. van den Bent MJ, Afra D, de Witte O, Ben Hassel M, Schraub S, Hoang-Xuan K, et al. Long-term efficacy of early versus delayed radiotherapy for low-grade astrocytoma and oligodendroglioma in adults: the EORTC 22845 randomised trial. Lancet. 2005;366(9490):985- 90. Epub 2005/09/20. doi: 10.1016/s0140-6736(05)67070-5. PubMed PMID: 16168780. 72. van den Bent MJ. Practice changing mature results of RTOG study 9802: another positive PCV trial makes adjuvant chemotherapy part of standard of care in low-grade glioma. Neuro Oncol. 2014;16(12):1570-4. Epub 2014/10/31. doi: 10.1093/neuonc/nou297. PubMed PMID: 25355680; PubMed Central PMCID: PMCPMC4232091. 73. Fisher BJ, Hu C, Macdonald DR, Lesser GJ, Coons SW, Brachman DG, et al. Phase 2 study of temozolomide-based chemoradiation therapy for high-risk low-grade gliomas: preliminary results of Radiation Therapy Oncology Group 0424. Int J Radiat Oncol Biol Phys. 2015;91(3):497-

61

504. Epub 2015/02/15. doi: 10.1016/j.ijrobp.2014.11.012. PubMed PMID: 25680596; PubMed Central PMCID: PMCPMC4329190. 74. van den Bent MJ, Brandes AA, Taphoorn MJ, Kros JM, Kouwenhoven MC, Delattre JY, et al. Adjuvant procarbazine, lomustine, and vincristine chemotherapy in newly diagnosed anaplastic oligodendroglioma: long-term follow-up of EORTC brain tumor group study 26951. J Clin Oncol. 2013;31(3):344-50. Epub 2012/10/17. doi: 10.1200/jco.2012.43.2229. PubMed PMID: 23071237. 75. Cairncross G, Wang M, Shaw E, Jenkins R, Brachman D, Buckner J, et al. Phase III trial of chemoradiotherapy for anaplastic oligodendroglioma: long-term results of RTOG 9402. J Clin Oncol. 2013;31(3):337-43. Epub 2012/10/17. doi: 10.1200/jco.2012.43.2674. PubMed PMID: 23071247; PubMed Central PMCID: PMCPMC3732012. 76. Vinagre J, Almeida A, Populo H, Batista R, Lyra J, Pinto V, et al. Frequency of TERT promoter mutations in human cancers. Nat Commun. 2013;4:2185. Epub 2013/07/28. doi: 10.1038/ncomms3185. PubMed PMID: 23887589. 77. Venneti S, Huse JT. The evolving molecular genetics of low-grade glioma. Adv Anat Pathol. 2015;22(2):94-101. Epub 2015/02/11. doi: 10.1097/pap.0000000000000049. PubMed PMID: 25664944; PubMed Central PMCID: PMCPMC4667550. 78. Patel VN, Gokulrangan G, Chowdhury SA, Chen Y, Sloan AE, Koyuturk M, et al. Network signatures of survival in glioblastoma multiforme. PLoS Comput Biol. 2013;9(9):e1003237. Epub 2013/09/27. doi: 10.1371/journal.pcbi.1003237. PubMed PMID: 24068912; PubMed Central PMCID: PMCPMC3777929. 79. Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016;374(2065):20150202. Epub 2016/03/10. doi: 10.1098/rsta.2015.0202. PubMed PMID: 26953178; PubMed Central PMCID: PMCPMC4792409. 80. Network TCGAR, Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics. 2013;45:1113-20. doi: doi:10.1038/ng.2764. 81. Ceccarelli M, Barthel FP, Malta TM, Sabedot TS, Salama SR, Murray BA, et al. Molecular Profiling Reveals Biologically Discrete Subsets and Pathways of Progression in Diffuse Glioma. Cell. 2016;164(3):550-63. Epub 2016/01/30. doi: 10.1016/j.cell.2015.12.028. PubMed PMID: 26824661; PubMed Central PMCID: PMCPMC4754110. 82. Goldman M, Craft B, Swatloski T, Ellrott K, Cline M, Diekhans M, et al. The UCSC Cancer Genomics Browser: update 2013. Nucleic Acids Res. 2013;41(Database issue):D949-54. Epub 2012/10/31. doi: 10.1093/nar/gks1008. PubMed PMID: 23109555; PubMed Central PMCID: PMCPMC3531186. 83. Finger R, ETH Zürich S, Institute for Environmental Decisions (IED) AfaAeEG, SOL C7, Sonneggstrasse 33, CH‐8092 Zurich, Switzerland. Review of ‘Robustbase’ software for R. Journal of Applied Econometrics. 2016;25(7):1205-10. doi: 10.1002/jae.1194. 84. Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, et al. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic acids research. 2015;43(Database issue):D447-52. Epub 2014/10/30. doi: 10.1093/nar/gku1003. PubMed PMID: 25352553; PubMed Central PMCID: PMCPMC4383874. 85. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011;39(Database issue):D561-8. Epub 2010/11/04. doi: 10.1093/nar/gkq973. PubMed PMID: 21045058; PubMed Central PMCID: PMCPMC3013807.

62

86. Help/STRING: functional protein association networks 2016. Available from: http://string-db.org/cgi/help.pl?UserId=CFWRJF1EmuVF&sessionId=yjep4ygJeonh. 87. Wilkerson MD, Hayes DN. ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics. 2010;26(12):1572-3. Epub 2010/04/30. doi: 10.1093/bioinformatics/btq170. PubMed PMID: 20427518; PubMed Central PMCID: PMCPMC2881355. 88. Meil M, #462. Comparing clusterings: an axiomatic view. Proceedings of the 22nd international conference on Machine learning; Bonn, Germany. 1102424: ACM; 2005. p. 577-84. 89. Kim EK, Choi E-J. Compromised MAPK signaling in human diseases: an update. Archives of Toxicology. 2015;89(6):867-82. doi: 10.1007/s00204-015-1472-2. 90. Cilloni D, Saglio G. Molecular Pathways: BCR-ABL. Clinical Cancer Research. 2012;18(4):930. 91. Wu CC, Hsu SC, Shih H, Lai MZ. Nuclear Factor of Activated T Cells c Is a Target of p38 Mitogen-Activated Protein Kinase in T Cells. Mol Cell Biol. 232003. p. 6442-54. 92. Curry JM, Eubank TD, Roberts RD, Wang Y, Pore N, Maity A, et al. M-CSF signals through the MAPK/ERK pathway via Sp1 to induce VEGF production and induces angiogenesis in vivo. PLoS One. 2008;3(10):e3405. Epub 2008/10/15. doi: 10.1371/journal.pone.0003405. PubMed PMID: 18852899; PubMed Central PMCID: PMCPMC2566603. 93. Goldsmith ZG, Dhanasekaran DN. G protein regulation of MAPK networks. Oncogene. 2007;26(22):3122-42. Epub 2007/05/15. doi: 10.1038/sj.onc.1210407. PubMed PMID: 17496911. 94. Rieder M, Green G, Park S, Stamper B, Gordon C, Johnson J, et al. A Human Homeotic Transformation Resulting from Mutations in PLCB4 and GNAI3 Causes Auriculocondylar Syndrome. Am J Hum Genet. 902012. p. 907-14. 95. Zhu X, Sano H, Kim KP, Sano A, Boetticher E, Munoz NM, et al. Role of mitogen-activated protein kinase-mediated cytosolic phospholipase A2 activation in arachidonic acid metabolism in human eosinophils. J Immunol. 2001;167(1):461-8. Epub 2001/06/22. PubMed PMID: 11418683. 96. Gasulla F, Barreno E, Parages ML, Camara J, Jimenez C, Dormann P, et al. The Role of Phospholipase D and MAPK Signaling Cascades in the Adaption of Lichen Microalgae to Desiccation: Changes in Membrane Lipids and Phosphoproteome. Plant Cell Physiol. 2016;57(9):1908-20. Epub 2016/06/24. doi: 10.1093/pcp/pcw111. PubMed PMID: 27335354. 97. Qing H, Gong W, Che Y, Wang X, Peng L, Liang Y, et al. PAK1-dependent MAPK pathway activation is required for colorectal cancer cell proliferation. Tumour Biol. 2012;33(4):985-94. Epub 2012/01/19. doi: 10.1007/s13277-012-0327-1. PubMed PMID: 22252525. 98. Weston CR, Davis RJ. The JNK signal transduction pathway. Curr Opin Genet Dev. 2002;12(1):14-21. Epub 2002/01/16. PubMed PMID: 11790549. 99. Fritsche-Guenther R, Witzel F, Sieber A, Herr R, Schmidt N, Braun S, et al. Strong negative feedback from Erk to Raf confers robustness to MAPK signalling. Mol Syst Biol. 72011. p. 489. 100. Janku F, Lee JJ, Tsimberidou AM, Hong DS, Naing A, Falchook GS, et al. PIK3CA mutations frequently coexist with RAS and BRAF mutations in patients with advanced cancers. PLoS One. 2011;6(7):e22769. Epub 2011/08/11. doi: 10.1371/journal.pone.0022769. PubMed PMID: 21829508; PubMed Central PMCID: PMCPMC3146490. 101. Trejo CL, Green S, Marsh V, Collisson EA, Iezza G, Phillips WA, et al. Mutationally activated PIK3CA(H1047R) cooperates with BRAF(V600E) to promote lung cancer progression. Cancer Res. 2013;73(21):6448-61. Epub 2013/09/11. doi: 10.1158/0008-5472.can-13-0681. PubMed PMID: 24019382; PubMed Central PMCID: PMCPMC3825323.

63

102. Lu Z, Xu S. ERK1/2 MAP kinases in cell survival and apoptosis. IUBMB Life. 2006;58(11):621-31. Epub 2006/11/07. doi: 10.1080/15216540600957438. PubMed PMID: 17085381. 103. Weitzman JB, Fiette L, Matsuo K, Yaniv M. JunD protects cells from p53-dependent senescence and apoptosis. Mol Cell. 2000;6(5):1109-19. Epub 2000/12/07. PubMed PMID: 11106750. 104. Desideri E, Cavallo AL, Baccarini M. Alike but Different: RAF Paralogs and Their Signaling Outputs. Cell. 2015;161(5):967-70. Epub 2015/05/23. doi: 10.1016/j.cell.2015.04.045. PubMed PMID: 26000477. 105. Riha P, Rudd CE. CD28 co-signaling in the adaptive immune response. Self Nonself. 12010. p. 231-40. 106. Vidal C, Geny B, Melle J, Jandrot-Perrus M, Fontenay-Roupie M. Cdc42/Rac1-dependent activation of the p21-activated kinase (PAK) regulates human platelet lamellipodia spreading: implication of the cortical-actin binding protein cortactin. Blood. 2002;100(13):4462-9. Epub 2002/11/28. doi: 10.1182/blood.V100.13.4462. PubMed PMID: 12453877. 107. Cobaleda C, Schebesta A, Delogu A, Busslinger M. Pax5: the guardian of B cell identity and function. Nat Immunol. 2007;8(5):463-70. Epub 2007/04/19. doi: 10.1038/ni1454. PubMed PMID: 17440452. 108. Schebesta A, McManus S, Salvagiotto G, Delogu A, Busslinger GA, Busslinger M. Transcription factor Pax5 activates the chromatin of key genes involved in B cell signaling, adhesion, migration, and immune function. Immunity. 2007;27(1):49-63. Epub 2007/07/31. doi: 10.1016/j.immuni.2007.05.019. PubMed PMID: 17658281. 109. Mullighan CG, Zhang J, Kasper LH, Lerach S, Payne-Turner D, Phillips LA, et al. CREBBP mutations in relapsed acute lymphoblastic leukaemia. Nature. 2011;471(7337):235-9. doi: 10.1038/nature09727. PubMed PMID: 21390130; PubMed Central PMCID: PMC3076610. 110. Jiang Y, Hatzi K, Shaknovich R. Mechanisms of epigenetic deregulation in lymphoid neoplasms. Blood. 2013;121(21):4271-9. Epub 2013/05/25. doi: 10.1182/blood-2012-12- 451799. PubMed PMID: 23704048; PubMed Central PMCID: PMCPMC3663422. 111. He T, Hong SY, Huang L, Xue W, Yu Z, Kwon H, et al. Histone Acetyltransferase p300 Acetylates Pax5 and Strongly Enhances Pax5-mediated Transcriptional Activity*. J Biol Chem. 2862011. p. 14137-45. 112. Sadler AJ, Williams BRG. Interferon-inducible antiviral effectors. Nat Rev Immunol. 2008;8(7):559-68. doi: 10.1038/nri2314. PubMed PMID: 18575461; PubMed Central PMCID: PMC2522268. 113. Garoby-Salom S, Rouahi M, Mucher E, Auge N, Salvayre R, Negre-Salvayre A. Hyaluronan synthase-2 upregulation protects smpd3-deficient fibroblasts against cell death induced by nutrient deprivation, but not against apoptosis evoked by oxidized LDL. Redox Biol. 42015. p. 118-26. 114. Del Villar K, Miller CA. Down-regulation of DENN/MADD, a TNF receptor binding protein, correlates with neuronal cell death in Alzheimer's disease brain and hippocampal neurons. Proc Natl Acad Sci U S A. 1012004. p. 4210-5. 115. Morgan MA, Shilatifard A. Chromatin signatures of cancer. Genes Dev. 2015;29(3):238- 49. Epub 2015/02/04. doi: 10.1101/gad.255182.114. PubMed PMID: 25644600; PubMed Central PMCID: PMCPMC4318141. 116. Dilworth FJ, Chambon P. Nuclear receptors coordinate the activities of chromatin remodeling complexes and coactivators to facilitate initiation of transcription. Oncogene.

64

2001;20(24):3047-54. Epub 2001/06/23. doi: 10.1038/sj.onc.1204329. PubMed PMID: 11420720. 117. Krasnov AN, Mazina MY, Nikolenko JV, Vorobyeva NE. On the way of revealing coactivator complexes cross-talk during transcriptional activation. Cell Biosci. 2016;6:15. Epub 2016/02/26. doi: 10.1186/s13578-016-0081-y. PubMed PMID: 26913181; PubMed Central PMCID: PMCPMC4765067. 118. Wu R-C, Guan B, Kobayashi Y, Wang T-L, Shih I-M. Abstract A63: Rsf-1, a chromatin remodeling protein, interacts with shelterin protein hRap1 and induces telomere shortening. Cancer Research. 2014;73(13 Supplement):A63. 119. Adelfalk C, Janschek J, Revenkova E, Blei C, Liebe B, Gob E, et al. Cohesin SMC1beta protects telomeres in meiocytes. J Cell Biol. 2009;187(2):185-99. Epub 2009/10/21. doi: 10.1083/jcb.200808016. PubMed PMID: 19841137; PubMed Central PMCID: PMCPMC2768837. 120. Hase ME, Cordes VC. Direct interaction with nup153 mediates binding of Tpr to the periphery of the nuclear pore complex. Mol Biol Cell. 2003;14(5):1923-40. Epub 2003/06/13. doi: 10.1091/mbc.E02-09-0620. PubMed PMID: 12802065; PubMed Central PMCID: PMCPMC165087. 121. Ibarra A, Hetzer MW. Nuclear pore proteins and the control of genome functions. Genes Dev. 2015;29(4):337-49. doi: 10.1101/gad.256495.114. PubMed PMID: 25691464; PubMed Central PMCID: PMC4335290. 122. Moreno CS. The Sex-Determining Region Y-Box 4 and Homeobox C6 Transcriptional Networks in Prostate Cancer Progression : Crosstalk with the Wnt, Notch, and PI3K Pathways. Am J Pathol. 2010;176(2):518-27. doi: 10.2353/ajpath.2010.090657. PubMed PMID: 20019190; PubMed Central PMCID: PMC2808058. 123. Kuwahara A, Sakai H, Xu Y, Itoh Y, Hirabayashi Y, Gotoh Y. Tcf3 Represses Wnt–β-Catenin Signaling and Maintains Neural Stem Cell Population during Neocortical Development. PLoS One. 92014. 124. Lasky JL, Wu H. Notch signaling, brain development, and human disease. Pediatr Res. 2005;57(5 Pt 2):104r-9r. Epub 2005/04/09. doi: 10.1203/01.pdr.0000159632.70510.3d. PubMed PMID: 15817497. 125. Chen C, Lee GA, Pourmorady A, Sock E, Donoghue MJ. Orchestration of Neuronal Differentiation and Progenitor Pool Expansion in the Developing Cortex by SoxC Genes. J Neurosci. 2015;35(29):10629-42. Epub 2015/07/24. doi: 10.1523/jneurosci.1663-15.2015. PubMed PMID: 26203155; PubMed Central PMCID: PMCPMC4510297. 126. Engin E, Liu J, Rudolph U. α2-containing GABAA receptors: A target for the development of novel treatment strategies for CNS disorders. Pharmacol Ther. 2012;136(2):142- 52. doi: 10.1016/j.pharmthera.2012.08.006. PubMed PMID: 22921455; PubMed Central PMCID: PMC3478960. 127. Nowakowska BA, Obersztyn E, Szymanska K, Bekiesinska-Figatowska M, Xia Z, Ricks CB, et al. Severe mental retardation, seizures, and hypotonia due to deletions of MEF2C. Am J Med Genet B Neuropsychiatr Genet. 2010;153b(5):1042-51. Epub 2010/03/25. doi: 10.1002/ajmg.b.31071. PubMed PMID: 20333642. 128. Kim W, Kim M, Jho EH. Wnt/beta-catenin signalling: from plasma membrane to nucleus. Biochem J. 2013;450(1):9-21. Epub 2013/01/25. doi: 10.1042/bj20121284. PubMed PMID: 23343194. 129. Maurer U, Preiss F, Brauns-Schubert P, Schlicher L, Charvet C. GSK-3 - at the crossroads of cell death and survival. J Cell Sci. 2014;127(Pt 7):1369-78. Epub 2014/04/02. doi: 10.1242/jcs.138057. PubMed PMID: 24687186.

65

130. Huang CW, Chen YW, Lin YR, Chen PH, Chou MH, Lee LJ, et al. Conditional Knockout of Breast Carcinoma Amplified Sequence 2 (BCAS2) in Mouse Forebrain Causes Dendritic Malformation via beta-catenin. Sci Rep. 2016;6:34927. Epub 2016/10/08. doi: 10.1038/srep34927. PubMed PMID: 27713508; PubMed Central PMCID: PMCPMC5054673. 131. Cha B, Kim W, Kim YK, Hwang BN, Park SY, Yoon JW, et al. Methylation by protein arginine methyltransferase 1 increases stability of Axin, a negative regulator of Wnt signaling. Oncogene. 2011;30(20):2379-89. Epub 2011/01/19. doi: 10.1038/onc.2010.610. PubMed PMID: 21242974. 132. Uematsu K, He B, You L, Xu Z, McCormick F, Jablons DM. Activation of the Wnt pathway in non small cell lung cancer: evidence of dishevelled overexpression. Oncogene. 2003;22(46):7218-21. Epub 2003/10/17. doi: 10.1038/sj.onc.1206817. PubMed PMID: 14562050. 133. Huang SM, Mishina YM, Liu S, Cheung A, Stegmeier F, Michaud GA, et al. Tankyrase inhibition stabilizes axin and antagonizes Wnt signalling. Nature. 2009;461(7264):614-20. Epub 2009/09/18. doi: 10.1038/nature08356. PubMed PMID: 19759537. 134. Huang M, Wang Y, Sun D, Zhu H, Yin Y, Zhang W, et al. Identification of genes regulated by Wnt/β-catenin pathway and involved in apoptosis via microarray analysis. BMC Cancer. 62006. p. 221. 135. Komiya Y, Habas R. Wnt signal transduction pathways. Organogenesis. 42008. p. 68-75. 136. Yu A, Xing Y, Harrison SC, Kirchhausen T. Structural analysis of the interaction between Dishevelled2 and clathrin AP-2 adaptor, a critical step in noncanonical Wnt signaling. Structure. 2010;18(10):1311-20. Epub 2010/10/16. doi: 10.1016/j.str.2010.07.010. PubMed PMID: 20947020; PubMed Central PMCID: PMCPMC2992793. 137. Barber TD, McManus K, Yuen KW, Reis M, Parmigiani G, Shen D, et al. Chromatid cohesion defects may underlie chromosome instability in human colorectal cancers. Proc Natl Acad Sci U S A. 2008;105(9):3443-8. Epub 2008/02/27. doi: 10.1073/pnas.0712384105. PubMed PMID: 18299561; PubMed Central PMCID: PMCPMC2265152. 138. Herman JA, Toledo CM, Olson JM, DeLuca JG, Paddison PJ. Molecular pathways: regulation and targeting of kinetochore-microtubule attachment in cancer. Clin Cancer Res. 2015;21(2):233-9. Epub 2014/08/12. doi: 10.1158/1078-0432.ccr-13-0645. PubMed PMID: 25104085; PubMed Central PMCID: PMCPMC4297562. 139. Ricke RM, Jeganathan KB, van Deursen JM. Bub1 overexpression induces aneuploidy and tumor formation through hyperactivation. J Cell Biol. 2011;193(6):1049-64. Epub 2011/06/08. doi: 10.1083/jcb.201012035. PubMed PMID: 21646403; PubMed Central PMCID: PMCPMC3115799. 140. Izawa D, Pines J. How APC/C-Cdc20 changes its substrate specificity in mitosis. Nat Cell Biol. 2011;13(3):223-33. Epub 2011/02/22. doi: 10.1038/ncb2165. PubMed PMID: 21336306; PubMed Central PMCID: PMCPMC3059483. 141. Zhang N, Pati D. Sororin is a master regulator of sister chromatid cohesion and separation. Cell Cycle. 2012;11(11):2073-83. Epub 2012/05/15. doi: 10.4161/cc.20241. PubMed PMID: 22580470; PubMed Central PMCID: PMCPMC3368859. 142. Maia AR, Garcia Z, Kabeche L, Barisic M, Maffini S, Macedo-Ribeiro S, et al. Cdk1 and Plk1 mediate a CLASP2 phospho-switch that stabilizes kinetochore-microtubule attachments. J Cell Biol. 2012;199(2):285-301. Epub 2012/10/10. doi: 10.1083/jcb.201203091. PubMed PMID: 23045552; PubMed Central PMCID: PMCPMC3471233.

66

143. Park J-I, Venteicher AS, Hong JY, Choi J, Jun S, Shkreli M, et al. Telomerase modulates Wnt signalling by association with target gene chromatin. Nature. 2009;460(7251):66-72. doi: doi:10.1038/nature08137. 144. Quinlan JR. C4.5: programs for machine learning: Morgan Kaufmann Publishers Inc.; 1993. 302 p. 145. Krantz ID, McCallum J, DeScipio C, Kaur M, Gillis LA, Yaeger D, et al. Cornelia de Lange syndrome is caused by mutations in NIPBL, the human homolog of Drosophila melanogaster Nipped-B. Nat Genet. 2004;36(6):631-5. Epub 2004/05/18. doi: 10.1038/ng1364. PubMed PMID: 15146186; PubMed Central PMCID: PMCPMC4902017. 146. Kagey MH, Newman JJ, Bilodeau S, Zhan Y, Orlando DA, Berkum NLv, et al. Mediator and cohesin connect gene expression and chromatin architecture. Nature. 2010;467:430-5. doi: doi:10.1038/nature09380. 147. Cahill ME, Xie Z, Day M, Photowala H, Barbolina MV, Miller CA, et al. Kalirin regulates cortical spine morphogenesis and disease-related behavioral phenotypes. Proc Natl Acad Sci U S A. 2009;106(31):13058-63. Epub 2009/07/25. doi: 10.1073/pnas.0904636106. PubMed PMID: 19625617; PubMed Central PMCID: PMCPMC2722269. 148. Xie Z, Cahill ME, Penzes P. Kalirin loss results in cortical morphological alterations. Mol Cell Neurosci. 2010;43(1):81-9. Epub 2009/10/06. doi: 10.1016/j.mcn.2009.09.006. PubMed PMID: 19800004; PubMed Central PMCID: PMCPMC2818244. 149. Lesch KP, Timmesfeld N, Renner TJ, Halperin R, Roser C, Nguyen TT, et al. Molecular genetics of adult ADHD: converging evidence from genome-wide association and extended pedigree linkage studies. J Neural Transm (Vienna). 2008;115(11):1573-85. Epub 2008/10/08. doi: 10.1007/s00702-008-0119-3. PubMed PMID: 18839057. 150. Kushima I, Nakamura Y, Aleksic B, Ikeda M, Ito Y, Shiino T, et al. Resequencing and association analysis of the KALRN and EPHB1 genes and their contribution to schizophrenia susceptibility. Schizophr Bull. 2012;38(3):552-60. Epub 2010/11/03. doi: 10.1093/schbul/sbq118. PubMed PMID: 21041834; PubMed Central PMCID: PMCPMC3329972. 151. Solomon DA, Kim JS, Waldman T. Cohesin gene mutations in tumorigenesis: from discovery to clinical significance. BMB Rep. 2014;47(6):299-310. Epub 2014/05/27. PubMed PMID: 24856830; PubMed Central PMCID: PMCPMC4163871. 152. Cook DR, Rossman KL, Der CJ. Rho guanine nucleotide exchange factors: regulators of Rho GTPase activity in development and disease. Oncogene. 2013;33(31):4021-35. doi: doi:10.1038/onc.2013.362. 153. Shang X, Marchioni F, Evelyn CR, Sipes N, Zhou X, Seibel W, et al. Small-molecule inhibitors targeting G-protein-coupled Rho guanine nucleotide exchange factors. Proc Natl Acad Sci U S A. 2013;110(8):3155-60. Epub 2013/02/06. doi: 10.1073/pnas.1212324110. PubMed PMID: 23382194; PubMed Central PMCID: PMCPMC3581902. 154. McMurray HR, Sampson ER, Compitello G, Kinsey C, Newman L, Smith B, et al. Synergistic response to oncogenic mutations defines gene class critical to cancer phenotype. Nature. 2008;453(7198):1112-6. doi: 10.1038/nature06973. PubMed PMID: 18500333; PubMed Central PMCID: PMC2613942. 155. Ming-Shiean H, Yu JC, Wang HW, Chen ST, Hsiung CN, Ding SL, et al. Synergistic effects of polymorphisms in DNA repair genes and endogenous estrogen exposure on female breast cancer risk. Ann Surg Oncol. 2010;17(3):760-71. Epub 2010/02/26. PubMed PMID: 20183911. 156. Wainberg ZA, Anghel A, Desai AJ, Ayala R, Luo T, Safran B, et al. Lapatinib, a dual EGFR and HER2 kinase inhibitor, selectively inhibits HER2-amplified human gastric cancer cells and is

67

synergistic with trastuzumab in vitro and in vivo. Clin Cancer Res. 2010;16(5):1509-19. Epub 2010/02/25. doi: 10.1158/1078-0432.ccr-09-1112. PubMed PMID: 20179222.

68