Attractor Molecular Signatures and Their Applications for Prognostic Biomarkers Wei-Yi Cheng

Attractor Molecular Signatures and Their Applications for Prognostic Biomarkers Wei-Yi Cheng Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of the Graduate School of Arts and Science COLUMBIA UNIVERSITY 2014 c 2013 Wei-Yi Cheng All rights reserved ABSTRACT Attractor Molecular Signatures and Their Applications for Prognostic Biomarkers Wei-Yi Cheng This dissertation presents a novel data mining algorithm identifying molecular signatures, called attractor metagenes, from large biological data sets. It also presents a computational model for combining such signatures to create prognostic biomarkers. Us- ing the algorithm on multiple cancer data sets, we identified three such gene co-expression signatures that are present in nearly identical form in different tumor types represent- ing biomolecular events in cancer, namely mitotic chromosomal instability, mesenchymal transition, and lymphocyte infiltration. A comprehensive experimental investigation using mouse xenograft models on the mesenchymal transition attractor metagene showed that the signature was expressed in the human cancer cells, but not in the mouse stroma. The attractor metagenes were used to build the winning model of a breast cancer prognosis challenge. When applied on larger data sets from 12 different cancer types from The Cancer Genome Atlas \Pan-Cancer" project, the algorithm identified additional pan-cancer molecular signatures, some of which involve methylation sites, microRNA expression, and protein activity. Contents List of Figures iv List of Tables vi 1 Introduction 1 1.1 Background . 1 1.2 Previous Work on Identifying Cancer-Related Gene Signatures . 3 1.3 Contributions of the Thesis . 9 2 Attractor Metagenes 11 2.1 Derivation of an Attractor Metagene . 12 2.2 Derivation of a Genomically Co-localized Attractor Metagene . 16 2.3 Derivation of Consensus Attractor Signatures . 17 2.4 Evaluating the Significance of an Attractor Signature . 19 2.5 Choice of Parameters in the Iterative Algorithm . 20 2.6 Comparison with Other Unsupervised Algorithms . 23 2.7 Implementation Complexity . 26 3 Pan-Cancer Attractor Signatures 30 3.1 TCGA Pan-Cancer Project . 31 3.2 18 Pan-Cancer Molecular Signatures . 32 3.3 MES mRNA Signature . 36 3.4 Mitotic CIN mRNA Signature . 37 i 3.5 M+, M− Methylation and LYM mRNA Signature . 40 3.6 END mRNA Signature . 45 3.7 Chr8q24.3 Amplicon mRNA Signature . 46 3.8 Other Pan-Cancer Signatures . 49 4 Further Investigation on the Mesenchymal Transition Signature 53 4.1 MES Signature and Tumor Invasiveness . 53 4.2 MES Signature and Recurrence in Glioblastoma . 54 4.3 Association Between MES Signature and CD44 . 60 4.4 Comparing the MES Signature with Other Mesenchymal Subtypes in Glioma 62 4.5 Source of MES Signature . 63 5 Predicting Breast Cancer Prognosis Using Attractor Signatures 70 5.1 The Sage Bionetworks-DREAM BCC . 71 5.2 Identifying Prognostic Features . 73 5.2.1 Mitotic CIN Signature . 75 5.2.2 MES Signature . 76 5.2.3 LYM Signature . 77 5.2.4 FGD3 -SUSD3 Metagene . 78 5.2.5 Other Breast Cancer-Specific Features . 80 5.3 Building an Ensemble Prognostic Model . 82 5.3.1 Derivation of Features . 82 5.3.2 Building Prognostic Models . 84 5.3.3 Combination of Predictions . 87 5.3.4 Combination of OS- and DS-based Predictions . 88 5.4 Performance of the Prognostic Model in the BCC . 88 5.5 Comparison with Random Gene Expression Signatures . 90 6 Conclusions and Future Work 92 6.1 Building Genomic Assays for Cancer Prognosis Using Attractor Signatures 94 ii 6.2 Integrating Attractor Signatures with Additional Information for Biological Discovery . 96 6.3 Comparing Attractor Signatures Identified in Normal Tissues with Pan-Cancer Attractor Signatures . 96 Bibliography 99 A Mutual Information Estimation 111 B Data Sets and Pre-processing 112 B.1 Data Sets Used in This Dissertation . 112 B.2 Pre-processing of Data Sets . 113 B.2.1 Pre-processing of Gene Expression Microarray Data Sets . 113 B.2.2 Pre-processing of RNASeq and miRNASeq Data Sets . 113 B.2.3 Pre-processing of DNA Methylation and RPPA Data Sets . 113 B.2.4 Clustering Attractors and Creating Consensus Molecular Signatures in Pancan 12 Data Sets . 114 C Supplementary Tables 115 iii List of Figures 2.1 Convergence process of an attractor. 15 2.2 Top 5 attractor clusters found in 6 cancer data sets. 28 2.3 Exponent α and the strength of an attractor. 29 3.1 Scatter plots between ESTIMATE stromal scores and MES signature. 38 3.2 Box plots connecting CIN signature with histological tumor grade. 40 3.3 Scatter plots connecting the LYM, M+ and M- signature in 12 cancer types. 41 3.4 Pan-cancer correlation between LYM signature and leukocyte percentage. 42 3.5 Scatter plots demonstrating the pan-cancer similarity of the LYM feature to ESTIMATE immune score. 43 3.6 Scatter plots demonstrating the pan-cancer association among the LYM feature, M-feature, and the miR-142 microRNA expression. 44 3.7 Scatter plots demonstrating the pan-cancer association between MES and END feature. 47 3.8 Co-amplification of the chr8q24.3 amplicon signature in three TCGA data sets. 48 3.9 DLK1 -DIO3 cluster of ncRNA. 50 4.1 Scatter plot showing the association between MES signature and time to tumor recurrence. 55 4.2 Heat map of the top 10 genes of the MES signature in glioblastoma. 56 iv 4.3 Kaplan-Meier curves comparing samples with high vs. low levels of the MES signature . 58 4.4 Scatter plot showing association between CD44 and MES signature . 59 4.5 Scatter plots showing the association among MES signature, COL11A1, and SNAI2................................... 65 4.6 Scatter plots showing the association among MES signature, COL11A1, and SNAI2 in human and mouse cells. 66 4.7 Heat map showing differentially expressed human and mouse genes. 68 5.1 Kaplan-Meier cumulative survival curves of breast cancer patients over a 15-year period on the basis of the mitotic CIN signature expression . 76 5.2 Kaplan-Meier cumulative survival curves of breast cancer patients over a 15-year period on the basis of the LYM signature expression . 78 5.3 FGD3 -SUSD3 metagene. 79 5.4 Schematic of model development. 83 5.5 Kaplan-Meier cumulative survival curve of the final ensemble model. 89 v List of Tables 3.1 Listing of the 18 attractor signatures . 34 4.1 Top genes in terms of the rank sum for the prolonged time to recurrence. 57 4.2 Top differentially expressed genes in GBM versus lower grade glioma. 61 4.3 Cox regression using MES feature or subtypes in GBM as covariates. 63 5.1 Individual gene expression and survival. 74 5.2 Molecular features used in the model . 81 5.3 Cox proportional hazards model trained on molecular and clinical features on the basis of AIC. 86 5.4 Cox proportional hazards models trained on empirically selected features 87 5.5 CIs of the CIN feature and meta-PCNA index in four breast cancer data sets. 90 C.1 Top 100 genes of the mitotic CIN signature . 115 C.2 Top 100 genes of the MES signature . 115 C.3 Top 168 genes with consensus MI > 0:5 in the LYM signature . 116 C.4 Top 27 methylation sites with consensus MI > 0:5 in the M− signature . 117 C.5 Top 27 genes with consensus MI > 0:5 in the END signature . 117 C.6 Association of MES signature with tumor stage . 118 vi Acknowledgements I would like to thank Tai-Hsien Ou Yang for help with some of the work in this thesis, Linda Raynes for editing, Hsu-Cheng Huang, Wen-Chiang Hong, Chien-Yu Lin, Ching- Ting Chen, Ping-Chien Wu, and Yen-Yi Chen for their support and company. Most importantly, I wish to thank my advisor, Professor Dimitris Anastassiou, for being an inspiring mentor, a supportive supervisor, and an enthusiastic project leader, without whom I would have never completed this work. vii To my parents Yi-Chang Cheng and Wei-Ping Tsai, my brother Wei-Chun Cheng and to my wife Tsai-Wei Chiu viii Chapter 1 Introduction The goal of this thesis is to introduce several molecular signatures that are identified by a novel data mining algorithm across multiple genome-scale biomedical data sets, which we called attractor metagenes. These attractor molecular signatures were found in nearly identical form in multiple data sets of different cancer types, and therefore we hypothezie that they represent the unifying characteristics in cancer. The \pan-cancer" molecular signatures provided insights in generating novel hypotheses for cancer biology and helped build models used to predict clinical phenotypes. 1.1 Background Cancer is prevailingly viewed as a process of clonal evolution. In 1976, Peter Nowell proposed that tumor initiation occurs by an induced change in a single, previously normal cell, making the cell neoplastic and providing it with a selective growth advantage over adjacent normal cells [1]. This neoplastic proliferation then proceeds, allowing sequential selection of more aggressive sublines. As evidenced by its pathological manifestations, cancer is also recognized as many diseases, differing in each patient, which progressively evolve through interplay with a changing environment. The unique nature of cancer 1 demands a more individually-tailored treatment − one that identifies the characteristics of a tumor and adopts therapy accordingly. Even before the term \personalized medicine" emerged, oncologists focused on identifying histological or molecular markers in tumor specimens from individual patients which were associated with prognosis or responses to a specific treatment. Ever since the Human Genome Project [2] was completed, the information of individual genetic, genomic, epigenomic, and proteomic characteristics gradually increased their influence on medical decisions, practices, and products.

Attractor Molecular Signatures and Their Applications for Prognostic Biomarkers Wei-Yi Cheng

Case Report a Fetus with Kabuki Syndrome 2 Detected by Chromosomal Microarray Analysis

Genetic Basis of Simple and Complex Traits with Relevance to Avian Evolution

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Supplementary Table

Supplementary Table S4. FGA Co-Expressed Gene List in LUAD

Genetic Diagnosis and Respiratory Management Of

(P -Value<0.05, Fold Change≥1.4), 4 Vs. 0 Gy Irradiation

TCTE1 Is a Conserved Component of the Dynein Regulatory Complex and Is Required for Motility and Metabolism in Mouse Spermatozoa

Mapping the Landscape of Tandem Repeat Variability by Targeted Long Read Single Molecule Sequencing in Familial X-Linked Intelle

MALE Protein Name Accession Number Molecular Weight CP1 CP2 H1 H2 PDAC1 PDAC2 CP Mean H Mean PDAC Mean T-Test PDAC Vs. H T-Test

Whole Genome Sequencing of Familial Non-Medullary Thyroid Cancer Identiﬁes Germline Alterations in MAPK/ERK and PI3K/AKT Signaling Pathways

Human Induced Pluripotent Stem Cell–Derived Podocytes Mature Into Vascularized Glomeruli Upon Experimental Transplantation