Patient Groups
Total Page:16
File Type:pdf, Size:1020Kb
NSEA: N-NODE SUBNETWORK ENUMERATION ALGORITHM IDENTIFIES LOWER GRADE GLIOMA SUBTYPES WITH ALTERED SUBNETWORKS AND DISTINCT PROGNOSTICS by ZHIHAN ZHANG Submitted in partial fulfillment of the requirements for the degree of Master of Science Systems Biology and Bioinformatics CASE WESTERN RESERVE UNIVERSITY May 2017 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES We hereby approve the thesis/dissertation of ZHIHAN ZHANG candidate for the degree of Master of Science. Committee Chair GURKAN BEBEK Committee Member MARK CAMERON Committee Member JEAN-EUDES DAZARD Date of Defense Jul 22, 2016 ii Contents Contents ......................................................................................................................... iii List of Tables ................................................................................................................... v List of Figures ................................................................................................................. vi Abstract ......................................................................................................................... vii Introduction .................................................................................................................... 1 Gene Expression Analysis ......................................................................................................... 1 Advantages of Unsupervised Learning............................................................................. 1 Feature Extraction and Unsupervised Pathway Analysis ................................................. 3 Application of Pathway-Based Methodology in Translational Medicine ......................... 6 Diffuse Low-Grade Glioma (LGG) ............................................................................................. 7 Overview .......................................................................................................................... 7 Prognosis Markers of LGG ................................................................................................ 9 Methodology ..................................................................................................................11 Overview ................................................................................................................................ 11 Concept Description ............................................................................................................... 13 Network Enumeration ................................................................................................... 13 Subnetwork Selection and Expansion ............................................................................ 13 Vector Representation of Subnetwork .......................................................................... 15 Pipeline .................................................................................................................................. 17 Data Preparation ............................................................................................................ 17 Enumeration .................................................................................................................. 18 Subnetwork Selection, Expansion and Vectorization .................................................... 21 Parameter Tuning .......................................................................................................... 26 iii Future Validation ........................................................................................................... 29 Results ..........................................................................................................................31 Feature Subnetworks and Clustering ..................................................................................... 31 Subnetwork Groups ............................................................................................................... 35 Patient Groups ....................................................................................................................... 43 Discussion .....................................................................................................................54 References ....................................................................................................................57 iv List of Tables Table 1. Subnetwork Clusters and Corresponding Pathways ...................... 42 Table 2. Comparison of Current Patient Groups and Groups from TCGA ... 44 Table 3. Mutations and MGMT Methylation Statistics .................................. 50 v List of Figures Figure 1. Diagram of nSEA Algorithm .......................................................... 12 Figure 2. Vector Representation of Subnetwork .......................................... 16 Figure 3. Edge Vector and Edge Score ....................................................... 19 Figure 4. Definition of Subnetwork Score (Inner-pattern Consistency) ........ 22 Figure 5. Pipeline Summarization and Parameter Tuning ............................ 28 Figure 6. Heatmap and Clustering of LGG Samples and Subnetworks ....... 34 Figure 7. Clinical Characteristics of LGG Patient Groups ............................ 45 Figure 8. Patient Group Characterization by Subnetworks .......................... 49 Figure 9. Patient Groups with Mutations and MGMT Methylation ................ 52 vi NSEA: n-Node Subnetwork Enumeration Algorithm Identifies Lower Grade Glioma Subtypes with Altered Subnetworks and Distinct Prognostics Abstract by ZHIHAN ZHANG Motivation: The prognosis of low-grade-glioma (LGG) patients is very poor. Identifying subnetworks related to LGG can better describe the genetic make-up of the tumor. Methods: n-Node Subnetwork Enumeration Algorithm (nSEA) was developed to identify significantly dysregulated subnetworks. We utilized a filtered protein network to enumerate n-node subnetworks exhaustively and score each subnetwork to carry out feature selection. These subnetwork seeds were expanded to identify tumor-specific subnetworks. Clustering these subnetworks provided patient groups with different subnetwork states. Results: We identified 92 subnetwork features, 8 subnetwork groups and 5 patient groups. A new patient group was identified with favorable outcomes. By decision tree modeling, the new group were characterized as down-regulated MAPK/B-Raf pathway and up-regulated Notch pathway. It had fewer mutations of candidate genes, hypomethylation of NIPBL and hypermethylation of KALRN. Conclusions: These results could provide opportunities for improved treatment options and personalized interventions of LGG. vii Introduction Gene Expression Analysis Advantages of Unsupervised Learning With the massive application of microarray and high-throughput sequencing, more and more data has been generated to characterize genome-wide gene expression of healthy and diseased states. Researchers work towards understanding the underlying mechanism of dysregulation. The popularity of genome-wide gene expression analysis is extremely prominent among cancer studies due to unique etiology of the disease in each tissue type. As of July 2016, there are 1079 cancer-related datasets on Gene Expression Ominbus, accounting for 28% of the whole database [1]. Since the size of gene expression database is expanding, the biggest challenge of gene expression studies of cancer is analyzing existing data rather than generating it. Among all the research interests of cancer gene expression analysis, one major goal is to identify the important gene expression patterns within a specific type of cancer. Many algorithms have been developed to solve this problem in the last decade [2]. These algorithms can be generally divided into two classes: supervised and unsupervised methods. Supervised algorithms have been widely used to discover gene expression patterns associated with known phenotypes [3-6]. Most of them are differential gene expression analysis, which aims to identify the most sensitive predictors associated with the target 1 phenotypes [7, 8]. The identified gene set, ranging from a couple of genes to dozens of genes, may perform well based on the standards of machine learning. However, a major premise of a supervised approach is that the testing data is generated with the same conditions of the training data, which ensures them to have the same statistical properties such as distribution [9]. This implies that the gene signatures picked by a supervised approach may not have the same statistical power when the protocol or platform is changed. Even when tested within the same dataset, because of random noise and outliers, models built with completely supervised approaches are often susceptible to overfitting [2]. To the contrary, unsupervised algorithms are more flexible on cross-platform validation and more resistant to noise. Although unsupervised methods do not take advantage of the labels in the data, in turn, the generated model is not confined by these known categorical variables. Instead of discovering top- differentiated genes related to a phenotype, unsupervised approach is better at exploring the global landscape of genome-wide gene expression patterns. This characteristic not only makes the result more robust, since it is based on thousands of genes, but also easy to interpret, since different clusters of genes may represent different biological modules [9, 10]. Moreover, because of the freedom of unsupervised algorithms,