BIOINFORMATICS APPROACHES TO CANCER BIOMARKER DISCOVERY
AND CHARACTERIZATION
By
PETER LEE MING LIAO
Submitted in partial fulfillment for the requirements for the degree of
Doctor of Philosophy
Systems Biology and Bioinformatics Program
CASE WESTERN RESERVE UNIVERSITY
May, 2018
CASE WESTERN RESERVE UNIVERSITY
SCHOOL OF GRADUATE STUDIES
We hereby approve the dissertation of
PETER LEE MING LIAO
candidate for the degree of Doctor of Philosophy.
Committee Chair
MEHMET KOYUTURK
Committee Members
JILL BARNHOLTZ-SLOAN
GOUTHAM NARLA
MASARU MIYAGI
Date of Defense
March 27, 2018
We also certify that written approval has been obtained for any proprietary material contained therein.
2
Table of Contents
List of Tables...... 5 List of Figures ...... 6 Chapter 1 ...... 12 1.1 Background...... 13 1.2 Literature review ...... 15 1.2.1 Biomarker identification in gliomas ...... 15 1.2.2 Epigenetic clocks as biomarkers ...... 19 1.2.3 Phosphorylation signaling pathways in cancer ...... 21 1.2.4 Phosphorylation as a biomarker ...... 26 1.2.5 Current phosphoproteomics methods ...... 28 1.3 Specific Aims ...... 32 Chapter 2 ...... 36 2.1 Abstract ...... 37 2.2 Methods ...... 38 2.2.1 DNA Methylation Data ...... 38 2.2.2 Calculations of epigenetic age ...... 42 2.2.3 Statistics and Survival ...... 44 2.3 Results ...... 45 2.3.1 Glioma epigenetic age ...... 45 2.3.2 Horvath’s clock and epiTOC association ...... 49 2.3.3 Epigenetic clocks and glioma subtype ...... 51 2.3.4 Validation of Panglioma Epigenetic Clock Associations ...... 57 2.3.5 Survival Modeling ...... 61 2.3.6 Epigenetic aging of tumor recurrences ...... 66 2.4 Discussion ...... 70 Chapter 3 ...... 73 3.1 Abstract ...... 74 3.2 Background...... 75 3.3 Materials and Methods ...... 77 3.3.1 Cell culture...... 77 3.3.2 Phosphoproteomics ...... 77 3.3.3 Data preparation ...... 78
3
3.3.4 Kinase prediction scoring ...... 79 3.3.5 Kinase activity scoring ...... 80 3.3.6 Assessing significance of kinase activity change-scores ...... 81 3.3.7 pKSEA Plotting ...... 82 3.3.8 KSEA...... 82 3.3.9 Availability and Implementation ...... 82 3.4 Results ...... 84 3.5 Discussion ...... 96 Chapter 4 ...... 98 4.1 Abstract ...... 99 4.2 Background...... 99 4.3 Materials and Methods ...... 102 4.3.1 Benchmarking data...... 102 4.3.2 pKSEA ...... 102 4.3.3 Evaluating pKSEA performance ...... 102 4.3.4 GBM data ...... 104 4.3.5 Xenograft data ...... 105 4.3.6 Data Preparation (GBM and xenograft data) ...... 106 4.3.7 pKSEA (GBM and xenograft data) ...... 106 4.3.8 pKSEA Plotting ...... 107 4.4 Results ...... 110 4.4.1 pKSEA Benchmarking ...... 110 4.4.2 GBM Prognosis ...... 114 4.4.3 Phosphatase activation and MEK inhibitor combination ...... 117 4.5 Discussion ...... 122 Chapter 5 ...... 126 5.1 Conclusions ...... 127 5.2 Future directions ...... 130 5.2.1 Epigenetic aging studies in other cancers ...... 130 5.2.2 Expansion of kinase inference benchmarking ...... 131 5.2.3 Use of phosphoproteomic data to assess kinase-substrate predictions ...... 132 5.2.4 Improvement of pKSEA package and tool ...... 133
4
List of Tables
Table 2-1 Clinical characteristics of the glioma sample set, arranged by study and by IDH, 1p/19q co-deletion status ...... 41 Table 2-2 Associations between epiTOC and Horvath epigenetic age markers and patient age at diagnosis ...... 47 Table 2-3 Associations between epiTOC and Horvath epigenetic age markers ...... 51 Table 2-4 Clinical characteristics of the clock associations validation set, arranged by study and by IDH mutation status ...... 58 Table 2-5 Patient cross-classification table across multiple classifications used...... 60 Table 2-6 Cox regression results on TCGA glioma survival (complete cases only) ...... 63 Table 2-7 Cox regression results on TCGA glioma survival (pooled results, multiply imputed 100x) ...... 64 Table 2-8 Clinical features comparison across complete and imputed cases ...... 65 Table 3-1 KSEA Analysis Results ...... 92
5
List of Figures
Figure 2-1 Model performance of calculation of reference epiTOC clock age for normal brain...... 43 Figure 2-2 Comparison of epigenetic clocks to chronologic age...... 46 Figure 2-3 Variations of epiTOC and Horvath’s clock age across normal samples by normal sample type...... 48 Figure 2-4 Association of Horvath’s epigenetic clock with epiTOC...... 50 Figure 2-5 Horvath clock acceleration of glioma tumors plotted by IDH-1p19q codeletion molecular subtype ...... 52 Figure 2-6 epiTOC acceleration of glioma tumors plotted by IDH-1p19q codeletion molecular subtype...... 54 Figure 2-7 Epigenetic age acceleration (glioma tissue epigenetic age – patient age at diagnosis) for TCGA glioma samples plotted by supervised methylation subtype...... 56 Figure 2-8 Aggregated epigenetic clock validation data...... 59 Figure 2-9 Epigenetic age acceleration (glioma tissue epigenetic age – patient age at diagnosis) for TCGA glioma samples plotted by supervised methylation subtype...... 60 Figure 2-10 Plot of all primary and recurrent tumor sample epigenetic ages showing intratumoral heterogeneity and aging of tumor from primary to recurrence...... 67 Figure 2-11 Epigenetic clock age versus time to recurrence...... 68 Figure 2-12 Validation dataset for primary-recurrence...... 69 Figure 3-1 Cartoon representation of two kinase inference methods ...... 76 Figure 3-2 Diagrammatic representation of pKSEA...... 83 Figure 3-3 Principal component analysis (PCA) of raw phosphoproteomics data from MDA-MB-231 cells treated with dasatinib, rapamycin, and combination treatment...... 85 Figure 3-4 Results of pKSEA analysis on MDA-MB-231 cells treated with dasatinib, rapamycin, and dasatinib/rapamycin combination...... 87 Figure 3-5 Filtered results of pKSEA analysis on MDA-MB-231 cells treated with dasatinib, rapamycin, and dasatinib/rapamycin combination...... 88 Figure 3-6 Venn diagram showing overall complementarity of kinase sets identified by pKSEA as downregulated by rapamycin, dasatinib, and combination treatment.. 89 Figure 3-7 Heatmap of kinase activity score correlations in permuted data, reflecting shared predicted substrates in phosphoproteomics data...... 89 Figure 3-8 Heatmap of KSEA inferred kinase activity changes in MDA-MB231 across experimental conditions ...... 90 Figure 3-9 Filtered heatmap of KSEA inferred kinase activity changes in MDA-MB231 across experimental conditions ...... 91 Figure 3-10 Heatmap of pKSEA scores on MDA-MB-231 cells treated with dasatinib, rapamycin, and dasatinib/rapamycin combination, including cross-comparisons 93 Figure 3-11 Prediction-only results of pKSEA analysis on MDA-MB-231 cells. pKSEA results on cells treated with dasatinib, rapamycin, and dasatinib/rapamycin combination, with all known kinase-substrate pairs removed from data...... 94 Figure 4-1 Median ROC curve assessing pKSEA performance on benchmarking data. 108
6
Figure 4-2 Median precision-recall curve assessing pKSEA performance on benchmarking data...... 109 Figure 4-3 Median ROC curve assessing pKSEA performance on benchmarking data with all substrates with known kinases removed...... 112 Figure 4-4 Median precision-recall curve assessing pKSEA performance on benchmarking data with all substrates with known kinases removed...... 112 Figure 4-5 Median ROC curve assessing pKSEA performance on benchmarking data restricted to only substrates with known kinases...... 113 Figure 4-6 Median precision-recall curve assessing pKSEA performance on benchmarking data restricted to only substrates with known kinases...... 113 Figure 4-7 Heatmap of pKSEA significance score comparing inferred kinase activity differences between GBM STS and LTS survival groups ...... 115 Figure 4-8 Filtered heatmap of pKSEA significance score comparing inferred kinase activity differences between GBM STS and LTS survival groups ...... 116 Figure 4-9 Heatmap of pKSEA significance scores for H358 xenograft experiment. .... 119 Figure 4-10 Heatmap of pKSEA significance scores for H358 xenograft experiment, filtered for rows (kinases) that were significant in at least two columns...... 121
7
List of Abbreviations
ABL Abelson murine leukemia viral oncogene homolog
ACVR1 Activin receptor type 1
ATM ataxia-telangiectasia mutated
BRAF B-RAF
CDK Cyclin-dependent kinase
CK Casein kinase
CLK Cdc- like kinase
CNS Central nervous system
EGFR Epidermal growth factor receptor
EPHA Ephrin receptor epiTOC Epigenetic Timer of Cancer
Receptor tyrosine-protein kinase erbB2/human epidermal growth ERBB2/HER2 factor receptor 2
FGFR Fibroblast growth factor receptor
GBM Glioblastoma
G-CIMP Glioma CpG island methylator phenotype
IGFR Insulin-like growth factor receptor
JAK Janus kinase
KSEA Kinase substrate enrichment analysis
LC-MS/MS Liquid chromatography- tandem mass spectrometry
MAP2K, MEK Mitogen-activated protein kinase -kinase, MAPK/ERK Kinase
MAPK, ERK Mitogen-activated protein, extracellular signal-regulated kinase
MGMT O6-methylguanine-DNA methyltransferase
8
MS Mass spectrometry
MTOR mechanistic target of rapamycin
NSCLC non-small cell lung cancer
PAK p21 activated kinase
PDGFRA Platelet-derived growth factor receptor alpha
PI3K phosphatidylinositol-4,5-bisphosphate 3-kinase
PKB Protein kinase B pKSEA Prediction-based kinase substrate enrichment analysis
PP2A Protein phosphatase 2A
PTEN Phosphatase and tensin homolog
PTM post translational modification
RAF Rapidly accelerated fibrosarcoma
RAS Rat sarcoma
RTK Receptor tyrosine kinase
TERT Telomerase reverse transcriptase
TGFBR Transforming growth factor beta receptor
TSC Tuberous sclerosis
9
Bioinformatics Approaches to Cancer Biomarker Discovery and Characterization
Abstract
By
PETER LEE MING LIAO
Cancers are a heterogeneous set of diseases that are defined by uncontrolled cellular growth with the potential to invade or spread to adjacent and distant tissues. While sharing certain biological capabilities that define the development and behavior of all human malignancies, cancers are governed by complex molecular changes that are often tumor-specific. As a result, even tumors arising from the same cell-type can exhibit highly divergent prognoses and treatment responses depending upon the underlying molecular mechanisms that are dysregulated and that drive its abnormal growth and cellular processes. New data collection methods grant researchers unprecedented capability to investigate and characterize cancers on a systems level. Rather than being restricted in measurement to a specific target molecule or set of molecules, “-omics” approaches allow experiments to identify and measure thousands of molecules at a time.
These “-omics” approaches can therefore characterize significant proportions of the genetic, transcript, protein, and post-translational modification landscapes that underlie and drive human malignancies. Because cancers represent such a diverse set of diseases, clinicians and researchers rely on biomarkers for a variety of uses in cancer, ranging from diagnosis to prognosis and prediction of treatment response. A good cancer biomarker is a molecular signal that is capable of distinguishing, for example, disease from normal,
10
high-risk from low risk disease, or disease cases that may be particularly susceptible to
targeted treatments.
In this dissertation, I demonstrate the use of multiple bioinformatics tools for cancer
biomarker discovery and characterization. Models of epigenetic age, termed epigenetic
clocks, are investigated in gliomas and are shown to be associated with previously
defined prognostic molecular subtypes and are independently predictive of survival. I
introduce a novel method for phosphoproteomics analysis, termed pKSEA, which uses in
silico kinase-substrate predictions to infer changes in kinase activity. pKSEA is
described, benchmarked against previously published data, and compared to existing
methods. Three examples are provided of pKSEA analysis in cancer-related data,
identifying kinase activity signals that may be useful as biomarkers in identifying and
targeting high risk glioblastomas, as well as identifying treatment-related phosphorylation signaling changes in response to kinase inhibition and phosphatase activation in cancer cells.
11
Chapter 1
Introduction
12
1.1 Background
As the improvement of molecular technologies continues to expand both the volume and
diversity of biological data available to researchers, the analytical techniques for
identifying useful and reproducible biomarkers from these data with impact on disease
have also evolved. The National Cancer Institute defines a biomarker as “a biological
molecule found in blood, other body fluids, or tissues that is a sign of a normal or
abnormal process, or of a condition or disease.” Within this broad category, molecules
that have been related to disease processes and outcomes include nucleic acids, peptides,
proteins, antibodies, and metabolites. In addition, the definition of “biomarker” is
commonly extended further to include any clinically useful molecular indicators,
including multi-molecule signatures and features that are predictive of future disease or
prognosis. As such, germline mutations, somatic mutations, changes in gene expression
and post-translational modifications of proteins have all been suggested as potential
biomarkers. In the context of cancer, biomarkers have demonstrated potential to inform
physicians and patients in making critical care choices, and have also simultaneously
guided researchers in the development of novel therapeutic strategies1,2. Cancer
researchers have utilized biomarkers to not only distinguish between normal and pathological states and processes but have also used biomarkers to aid in disease prognosis3,4, to evaluate pharmacodynamics of drugs5, to predict treatment response6, and to monitor disease progression7.
With the maturation of existing technologies such as DNA and RNA sequencing and the
development of new methods for acquiring biological data such as mass spectrometry,
13
the ability of researchers to uncover biomarkers is unprecedented. Despite the growing amount of data being generated and studied for the purpose of identifying cancer biomarkers, researchers face numerous obstacles to biomarker discovery, validation, approval, and clinical implementation.8 The reasons for these challenges are manifold,
including the fundamental biological complexities associated with disease, the difficulty
in capturing system-wide responses within the scope of biomarkers, and challenges in
sample standardization that complicate reliable biomarker identification, such as intra-
and intertumoral heterogeneity and differences between primary and disseminated disease9. Furthermore, analysis and quality control (QC) of data itself is not always
standardized and may be different depending on institution and even among a research
group.
Bioinformatics, or the application of computational approaches and applied statistics to
biology, can assist in untangling the biological challenges that are rooted in the
complexity and heterogeneity that characterize biological systems. Robust bioinformatics
analysis methods are capable of identifying generalizable biological patterns and are
necessary to make use of the growing body of high quality, high-dimensional data. This dissertation focuses on the use of bioinformatics approaches and robust analysis methods
to assist in isolating useful signals from complex, heterogeneous biological data and
identifying biomarkers that may be of use to researchers and clinicians for the benefit of
cancer patients.
14
With the incorporation of bioinformatics approaches, the search for useful biomarkers can be expanded and refined by integrating more complex data types allowing for a more complete view of biological processes and capture of biological data on a systems level.
By incorporating novel data with existing knowledge, bioinformatics approaches can improve power to detect biomarker signatures hidden in biological data that would otherwise be missed and simultaneously improve interpretability of results by providing analysis in the context of previously established knowledge. These new generation of cancer biomarkers, like biomarkers already in use, have the potential to predict patient outcome, predict treatment response, monitor progression, or even guide development of novel therapeutics. While still difficult to implement on a wide-spread clinical level, novel data acquisition methods such as mass spectrometry-based proteomics and metabolomics, as well as wider availability of large, high-quality genomic and transcriptomic data sets require ever-improving bioinformatics approaches to define better performing biomarkers that are of immediate interest to cancer researchers and that will impact clinical decision making.
1.2 Literature review
1.2.1 Biomarker identification in gliomas
Gliomas represent the most common type of malignant brain tumor, comprising 81% of malignant brain and central nervous system (CNS) tumors and 27% of all brain and CNS diagnosed in the US.10 Gliomas are named for their cells of origin, glial cells, which are the most abundant cell type in the brain and which provide numerous support functions for neurons including structural support, nutrient and oxygen transfer, myelination, and
15
immune system functions including scavenging of dead cells and infectious agents.11
The World Health Organization (WHO) classifies gliomas into histologically and
molecularly defined categories that are named for the glial cells they most histologically
resemble, molecular features including IDH-mutation and codeletion of chromosome
arms 1p/19q, and are graded into four relative grades of increasing tumor aggressiveness.
These categories include astrocytomas, and oligodendrogliomas (Grade II) which share
histological features with astrocytes and oligodendrocytes, as well as less differentiated,
higher grade, anaplastic forms anaplastic astrocytomas and anaplastic
oligodendrogliomas (Grade III). Glioblastomas (GBMs), are high grade diffuse gliomas
(Grade IV). While gliomas are relatively rare in the general population with an average
annual age-adjusted incidence of 6.2 per 100,000 from 2010-201410, gliomas contribute significant morbidity and mortality, with GBMs, the most common type of glioma, carrying a 5-year survival rate of less than 6%.10
Histopathologically defined categories of glioma suffer from significant interobserver
variability, and aside from grade are poorly correlated with clinical outcome.12 As a
result, significant efforts have been directed toward classification of gliomas on a
molecular level to improve prognosis and guide therapeutic decisions. Genetic
aberrations of receptor tyrosine kinases (RTK) including EGFR, PDGFRA, IGFR-1 and
FGFR-1 are detected in over 80% of primary GBMs13, but pursuit of the most frequent alterations such as EGFR have not yielded fruitful results as a prognostic biomarker, with studies showing equivocal results as an independent predictor of survival.14-18 Despite
these obstacles, RTKs remain of interest to researchers from a therapeutic standpoint, as
16
specific aberrations such as EGFRvIII mutation have shown potential as biomarkers for
targeted therapy and prediction of treatment response.17,19 O6-methylguanine-DNA methyltransferase (MGMT) has also been implicated as a useful biomarker for prognosis and prediction of treatment response. MGMT is a DNA repair enzyme that plays a
significant role in mismatch repair, conferring tumor resistance to alkylating
chemotherapies. Methylation of the MGMT promoter and transcriptional silencing of
MGMT is associated with loss of MGMT expression and increased response to alkylating
chemotherapies such as temozolomide (TMZ).20-24
Large-scale efforts such as The Cancer Genome Atlas (TCGA) have further expanded the
search for biomarkers in gliomas to signatures defined across multiple molecular features.
Expression-based molecular classifications divided GBM into Proneural, Neural,
Classical, and Mesenchymal subtypes and were enriched for patterns of somatic
mutations, and copy number suggestive of distinct cell lineages and development.25,26
Analysis of DNA methylation from gliomas identified a DNA methylation-based
phenotype, G-CIMP27, that is characterized by global hypermethylation of CpG islands
and that is highly predictive of increased survival and majorly determined by isocitrate
dehydrogenase (IDH) mutation status.28,29
As a whole, these efforts have resulted in identification of genomic tumor markers that
correlate with patient prognosis, including IDH mutation and codeletion of chromosomal arms 1p and 19q (1p/19q codeletion). The lack of IDH mutation in histologically defined low-grade gliomas is associated with poor clinical prognosis that resembles GBM, which
17
generally lack IDH mutation. Conversely, IDH mutations are observed in the majority of
lower-grade gliomas, and are associated with improved clinical outcomes. Of low-grade
gliomas with IDH mutations, 1p/19q codeletion is further associated with oligodendrogliomas and increased chemotherapeutic response. Additional evidence has
also suggested that TERT (telomerase reverse transcriptase) promoter mutations may also
be a useful molecular marker for glioma classification.30 The validation of these
molecular classifications has prompted WHO to include molecular subclasses into
classification of diffuse gliomas to prevent histological misclassifications and reduce
intraobserver variability. According to 2016 WHO guidelines, non-GBM diffuse gliomas
are classified based on IDH mutation status and 1p19q codeletion.31 Additionally, GBMs
are also stratified based on IDH-mutation status according to 2016 WHO classifications.
Recently, a study of GBM tumors with protein-level data identified a set of proteins with prognostic significance distinct from other known prognostic factors, emphasizing the potential of novel sources of data that can allow for improved association of biomarker and cancer phenotype32.
Efforts to identify additional biomarkers from the growing body of available data build
upon past efforts. As new methods allow for collection of more complex types of
molecular data, analytical methods for identifying biomarkers from those new types of
data must also evolve in turn. Novel biomarkers discovered using novel methods in new
and existing data have the potential to continue refining current disease classifications,
and identify key features involved in pathogenesis and progression of glioma and cancer
18
as a whole for the benefit of researchers, clinicians, and ultimately patients and their
families.
1.2.2 Epigenetic clocks as biomarkers
As evidenced by the important methylation-based prognostic signatures observed in
glioma, the availability of high-quality DNA methylation data in cancer has allowed for
novel investigation of epigenetic modifications that contribute to cancer initiation and
progression. While cancer research has traditionally focused on somatic mutation as a
driving factor, increasing evidence suggests that changes in the cancer epigenome
contribute a similarly important role in development of cancer and disease progression.
DNA methylation, which involves the addition or removal of methyl groups on cytosines
in CpG dinucleotides and is involved in regulation of gene expression, is a major focus of
epigenetic research due to its important regulatory role in normal cell physiology. In
particular, growing evidence suggests that defects in DNA methylation contribute
significantly to cancer biology.33-36 Interestingly, studies have also observed that specific
CpG sites in the genome are methylated in an age-dependent manner37-40. In conjunction,
these two facts are of particular interest not only to the study of aging but to cancer
researchers, as age remains the single most important predictor of cancer incidence and
survival41-43.
Multiple models of “biological” age based on DNA methylation have been developed37-
39,44 that show potential for predicting disease risk and survival in precancerous
tissue36,45,46, cancer47-49, and a variety of other disease contexts50. The rationale behind
19
such measures is that advancement of biological processes can be measured and reflect
aging of cells and tissues more accurately thereby being reflective of biological aging
processes than chronological age. One such epigenetic age predictor is the epigenetic
clock developed by Steve Horvath39. Horvath trained his epigenetic clock on a wide range of tissue types in order to produce a highly accurate predictor of age independent of tissue-type or mitotic potential. Horvath’s epigenetic clock utilized 82 different DNA methylation array data sets including 51 healthy tissues and cell types. Using an elastic
net regression model, 353 CpG probes were selected that were capable of accurately
predicting age across all tissue and cell types. This work suggests that there are universal
mechanisms at play in the aging of tissues that are independent of cell division number and the specific needs and functions of differentiated cells. Although the biological
processes that dictate Horvath’s clock and associate it with chronologic age were not
elucidated at the time and remain unknown, Horvath’s clock’s universal applicability
across cell types regardless of mitotic potential makes it particularly interesting as a
biomarker for studying epigenetic alteration and dysregulation in cancer. Notably,
Horvath attempted to apply his epigenetic clock to various cancer types in his seminal
study, but was unable to discern any notable utility of his epigenetic clock as a pan-
cancer biomarker. Horvath observed highly heterogeneous changes in epigenetic age
across cancer types, including cancer types that demonstrated highly accelerated aging,
cancer types that regressed in epigenetic age, and cancer types in which any significant
association of epigenetic age of tumor tissue with chronologic age of the patient were
abolished. As a result, the utility of Horvath’s clock and the functional relevance of
20 changes in epigenetic age when applied to the dysregulated and aberrant machinery of cancer cells remains unknown.
Because of the tissue-agnostic design of Horvath’s clock, the functional relevance of his epigenetic age predictor is hidden in the “black box” nature of the prediction model. In an attempt to more transparently reflect processes of interest to cancer biologists, other groups have since attempted to design epigenetic clocks with specific measures in mind.
One such effort resulted in the epigenetic Timer Of Cancer (epiTOC)45 developed by
Yang and colleagues, which was designed to reflect the number of mitotic divisions a cell has undergone. By specifically predicating epiTOC on the methylation status of 385 CpG probes located at sites that have been previously demonstrated to be unmethylated in fetal tissue and progressively methylated over cell division, epiTOC was able to reflect mitotic age in a variety of cell types and Yang and colleagues were able to demonstrate advancement of epiTOC in pre-cancerous and cancerous tissue compared to healthy tissue. Universal predictors of epigenetic age such as Horvath’s clock in conjunction with more mechanistically derived markers such as epiTOC provide tools for exploring epigenetic dysregulations in cancer as potential biomarkers requiring further study in specific disease contexts.
1.2.3 Phosphorylation signaling pathways in cancer
Phosphorylation of proteins comprises one of the most diverse and widespread types of post-translational modification (PTM) in both healthy and cancerous cells, regulating
“nearly every aspect of cell life”.51 Phosphorylation, which involves the transfer of a
21
phosphate group from ATP (GTP in a small minority of cases) onto a receptive protein
sidechain (most commonly serine, followed by threonine and tyrosine), is capable of
dramatically altering a protein’s function and so acts as an important regulator in many
cellular processes52,53. Due to the addition of a hydrophilic, highly charged molecular
group, phosphorylation can have significant effects on the structural conformation of a
protein and therefore affect a protein’s activity in the case of enzymes and receptors as
well as its ability to interact with other proteins. The enzymes that facilitate the transfer
of phosphate group onto protein substrates are known as kinases, and the enzymes that
facilitate the removal of phosphate groups in this reversible reaction are known as
phosphatases. In concert, phosphorylation and dephosphorylation events allow for
flexible cellular signal transduction and complex regulation of cellular processes via
signaling cascades, crosstalk, and feedback networks.
As phosphorylation plays an integral role in regulating key cellular processes including
cell survival, apoptosis, growth and cell division, phosphorylation signaling, its
dysregulation in cancer is critically important and has been studied extensively54-57. The
importance of phosphorylation signaling as a critical player in disease has been
recognized since the dawn of molecular studies of cancer, with the discoveries that the
first oncogene, Src, as a kinase58 and that a number of dysregulated growth-related receptors implicated in cancer, including EGFR, were also kinases and transduced their cellular signals through autophosphorylation and phosphorylation-mediated activation of other kinases.59 While the nearly ubiquitous regulatory role of phosphorylation is often
22
taken for granted today, it is interesting to track the rapid expansion in phosphorylation
research.
While an exhaustive list of phosphorylation signaling pathways implicated in cancer is
prohibitive to enumerate, several major pathways bear expanding upon as examples and
that serve as a useful primer to the following discussions.
Common genetic aberrations observed in cancerous cells confer increased kinase activity and implicate kinases as common contributors to oncogenesis and a variety of cancer hallmarks. These genetic mechanisms of kinase activation include point mutations, gene amplifications, and gene fusions.60,61 Among the most frequently amplified and overexpressed kinases are tyrosine kinases EGFR, ERBB2/HER2, PAK1, PIK3CA, and key mitotic regulator serine/threonine kinases CDK4 and CDK662. Activating point
mutations are commonly observed in a variety of kinase genes including ACVR1B, AKT1,
ATM, BRAF, EPHA2, JAK2, MAP2K1, TGFBR2, and MTOR63, with common activating
gene fusions observed in ABL1, BRAF, EGFR, PIK3CA, and RAF1.
Of these commonly activated or overexpressed genes, CDK4 and CDK6 are themselves
mediated by the RAS-RAF-MAP2K (MEK)-MAPK1/3 (ERK) pathway, which plays an
important role in proliferation and cell cycle via a variety of transcription mediated
regulatory functions. RAS, a family of small GTPases, is activated by upstream receptor
tyrosine kinases (RTKs) such as EGFR. RAS then recruits RAF kinase and promotes its
activation, which in turn phosphorylates and activates MEK, which then phosphorylates
23
and activates ERK. ERK activates CDK4 and 6 indirectly by inducing expression of
Cyclin D1, which is necessary to create an active cyclinD-CDK4/6 complex and prompt
quiescent cells to enter the cell cycle. Furthermore, ERK has been shown to be required
in the proper translocation and activation of CDK2.64 In a parallel fashion, RTK
activation also leads to activation of pro-survival and proliferative cellular functions
through activation of the kinase PI3K, which indirectly activates the kinase Akt. Akt in
turn activates mTORC1 (mTOR), which is a driver of cell cycle progression and
proliferation, through phosphorylation inhibition of TSC. mTOR then promotes
proliferative and survival programming through downstream effectors including p70S6K
and 4E-BP165, as well as other effectors that have been less extensively described.
Although here the Ras-ERK and PI3K-mTOR pathways have been described in a linear
and parallel fashion, the cross-talk between these two pathways is extensive, with many
intermediate effector substrates allowing for integration of pathway signaling and
feedback regulation66,67. RAS can directly bind and activate PI3K. ERK can also promote
mTOR activation through TSC, a substrate that it shares with AKT. Notably, the RAS-
ERK and PI3K-mTOR pathways converge on many downstream substrates that contribute to proliferation and cell survival. ERK, AKT, and p70S6K, for example, phosphorylate FOXO, BAD, c-Myc, contributing converging signals that serve to promote survival, suppress apoptosis, and induce transcription of growth genes.
The RAS-ERK and PI3K-mTOR pathways are often dysregulated in cancer.
Approximately a third of all tumors have an activating mutation in a RAS gene68 and
24
around 8% have activating mutations in BRAF69. More than a third of cancer patients show alteration of PI3K pathway regulation, with PTEN, a key negative regulator of the pathway and tumor suppressor, being lost in 30% of tumors and with mutations in PI3K catalytic subunit in 37% of endometrial and 31% of breast cancers70. In addition to these
two pathways, which have many kinase members well-represented in the list of activating
genetic aberrations provided earlier in this section, the presence of these identifiable
aberrant signaling nodes provides an appealing target for targeted therapy, prompting
further efforts to identify actionable targets and develop compounds that can selectively
target these dysregulated signaling processes.
The search for pharmacologic tools to specifically target these aberrant signaling
cascades has met with limited but notable success60,71. Prime examples include the
development of imatinib to inhibit BCR-ABL1, a driving, constitutively active kinase that results from the fusion of BCR and ABL1 commonly observed in chronic
myelogenous leukemia72, the development of erlotinib for EGFR mutated cancers73, and vemurafenib for melanomas that express the BRAF V600E mutation74. Although these
examples demonstrate the clinical potential of pharmacologic targeting aberrant
phosphorylation signaling, major obstacles to the development of effective kinase
inhibitor therapy exist, including: tumor development of resistance via acquisition of
resistance-conferring mutations, activation of downstream effectors, or exploitation of
signaling networks to rewire pathways and bypass the targeted kinase altogether.75-77 In
order to circumvent resistance, researchers have suggested targeting of drugs against
multiple kinase sites and combination inhibitor therapies. Combination therapies can take
25
various forms. “Vertical” pathway inhibition involves multiple drugs targeting the same
pathway to prevent acquisition of resistance by reactivation of downstream effectors78.
An example of this strategy is combined BRAF and MEK inhibition in melanoma79.
“Horizontal” pathway inhibition involves targeting multiple parallel pathways to prevent
resistance acquired by pathway rewiring to circumvent a particular drugged kinase.
Parallel pathways can converge and exhibit significant cross-talk and shared regulation; as a result, cells can be capable of compensating for loss of one signaling pathway by augmenting another, a method of resistance that has been observed, for example, in induction of the PI3K-AKT-mTOR pathway in response to MAPK inhibition80.
Whether designing novel kinase inhibitors with refined specificity or mechanism of
action or identifying novel combinations that are capable of targeting multiple pathways
and preventing tumor resistance, a major challenge for kinase inhibitors is an incomplete
understanding of the signaling networks themselves. In order to better understand the
effects of a kinase inhibitor or combination of kinase inhibitors, employing new
experimental and analysis techniques can perhaps give improved insight by expanding our field of view from a targeted set of proteins within a pathway to capturing systems- level changes in phosphorylation signaling.
1.2.4 Phosphorylation as a biomarker
Because phosphorylation signaling plays a significant role in regulation of many pathologic processes in cancer and kinase inhibition has demonstrated success as a mode of targeted therapy, protein phosphorylation has emerged as its own category of cancer
26
biomarker. Post-translational modifications (PTMs) serve as an example of the next generation of data that is still relatively in its infancy, but that has significant potential as a biomarker. Inclusion of protein phosphorylation and other PTMs in our understanding of cancer continue the progress tracing cancer from its origins as a genetic disease to the repercussions of foundational defects on higher biological system levels, namely, the transcriptome, proteome, phosphoproteome, and ultimately the integration and summation of all those interwoven levels in an observed phenotype.
Measurement of protein phosphorylation has the potential to identify tumor-specific vulnerabilities that may be amenable to treatment with existing drugs, track tumor response to a particular treatment, and also aid in identification of novel drug targets that may be of significance to specific cancer populations.
Protein phosphorylation has already demonstrated potential as a diagnostic biomarker in a number of cancers. As study of protein phosphorylation has historically been based on use of antibodies capable of detecting specific protein phosphorylation states, such antibodies have been included in array-based screens of cancers and phosphorylation markers have been implicated in cancer detection81, diagnosis82, and prognosis32,83-85. In
many cases, phosphorylation abundance related to regulation of kinases themselves is
considered a marker of their activity, as kinases are often regulated by upstream kinases
and phosphatases that are then transduced to downstream substrates.
Phosphorylation has also been used to assess tumor response to kinase inhibition in
cancer. An example of phosphorylation being used as a biomarker is measuring
27
phosphorylation of ribosomal protein S6 (S6), a substrate of p70S6K further downstream
of mTOR. S6 phosphorylation has been used to assess mTOR inhibition in PTEN-
deficient gliomas treated with rapamycin86. In breast cancer cells, phosphorylation
inhibition of a variety of kinase targets including MAPKs and AKT was observed with
treatment with trastuzumab and lapatinib, which are RTK inhibitors87. Additionally, researchers observed decreases in ERK activation of p70S6K1 in response to lapatinib, and cells resistant to lapatinib treatment exhibited hyperphosphorylation of p70S6K1 that were mitigated by rapamycin inhibition of mTOR and led to resensitization. Phospho- specific antibodies have also been used as biomarkers of dasatinib inhibition of Src in colon cancer cells88. In these examples, we see use of phosphorylation abundance as
biomarkers for assess treatment response, which can be of potential use for researchers
and clinicians in evaluating engagement of targeted pharmacologic agents and tailoring
of treatment doses.
1.2.5 Current phosphoproteomics methods
While antibody based methods have long been the standard for interrogating protein
phosphorylation, advances in mass spectrometry (MS), including phosphorylation
enrichment techniques, have allowed for quantitative measurements to be performed on
thousands of phosphopeptides in a single experiment89-91.
Unlike antibodies, which require extensive resources to design, generate, and validate and
are tailored for specific sites of interest only, mass spectrometry allows for relatively
unbiased measurement of phosphopeptide abundances, and therefore is more suitable for
28 discovery directed studies of the phosphoproteome. As such, mass spectrometry-based phosphorylation studies have assisted in identifying novel cancer biomarkers that are useful for identifying new kinase targets and clarifying inhibitor treatment effects on the overall phosphoproteome. Mass spectrometry has assisted in identifying novel implications of oncogenic kinases in non-small cell lung cancer (NSCLC)92.
Characterization of cellular phosphorylation responses to kinase inhibition has also led to identification of previously unknown phosphopeptides related to signaling pathways leading to improved understanding of phosphorylation targets of kinase inhibition of different kinases within the same pathway93. In one example, a study of three PI3K pathway inhibitors led to identification of phosphorylation sites involved in the PI3K pathway that were affected in common by all three inhibitors as well as sites that were treatment specific. Additionally, this study showed that phosphorylation abundance of select sites implicated by their screen could be used as biomarkers for prediction of treatment response94.
Generally, MS-based phosphoproteomic experiments can be broken down into stages, including sample preparation, enrichment of phosphopeptides, liquid chromatography- tandem MS analysis (LC-MS/MS), and bioinformatics-based identification and quantification of phosphorylation sites and phosphopeptides. Each stage is foundational to the reliability and understanding of the stages that follow it, and the development of current phosphoproteomic experimental protocols as well as the variety of methods with their advantages and disadvantages is reviewed elsewhere89,90.
29
Following high-confidence identification of phosphorylation sites and phosphopeptides from LC-MS/MS data, analysis of phosphorylation abundances has been reliant upon bioinformatics and systems biology analytical methods. A major challenge faced by the relatively new field of phosphoproteomics is the reality that the vast majority of phosphorylation sites detected using mass-spectrometry methods have unknown functional relevance95 and may in fact be of little or no functional importance96.
Furthermore, validation of novel phosphopeptide sites is resource intensive and follow-up on many potential targets identified with phosphoproteomic screens is often infeasible using traditional genetic and antibody-based methods. As a result, numerous bioinformatics strategies have been developed to better select phosphoproteomic features for in depth study or to integrate phosphoproteomic data into a pathway, network, or system-level analysis. One such strategy for overcoming issues with analyzing and interpreting individual phosphopeptide sites is to use changes in phosphorylation abundance to infer changes in kinase activity97. A variety of methods have been developed to infer kinase activity, ranging from enrichment analysis based on canonical kinase-substrate relationships, such as Kinase Substrate Enrichment Analysis (KSEA)98, to newer and more complex inference methods, such as IKAP99 (inference of kinase activities from phosphoproteomics), CLUE100 (CLUster evaluation), KinasePA101 (kinase perturbation analysis), and KARP102 (kinase activity ranking using phosphoproteomics data), which variably incorporate alternative scoring methods, time course data, interaction networks, statistical modeling and machine learning approaches. Due to its simplicity in implementation and intuitive interpretation, KSEA remains a commonly
30
utilized method103,104, with recent benchmarking studies demonstrating reasonable ability
to identify expected perturbations in the phosphoproteome105.
Although simple to implement and interpret, KSEA and similar methods are limited by
canonical definitions and hence ignore the majority of phosphorylation features identified
in any given phosphoproteomic screen. Currently, of roughly 280,000 identified
phosphorylation sites in the PhosphoSitePlus database106, less than 4% have even a single
identified kinase. As a result, kinase-substrate prediction methods have emerged as a separate field of bioinformatics, where a variety of analytical methods have emerged utilizing substrate sequence motifs, protein interaction networks, integration of different data types and a variety of machine learning methods in order to accurately predict
kinase-substrate relationships, including Scansite, iGPS (in vivo group-based prediction
system), PhosphoNET, and NetworKIN107-110.
NetworKIN, as a part of an integrated platform known as KinomeXplorer, was developed by Horn, Linding, and colleagues to predict kinase-substrate relationships by integrating kinase preferences for phosphorylating specific peptide sequences with protein-protein interaction (PPI) data to reflect the cellular context for phosphorylation107,111. Previous
studies have demonstrated that part of kinase specificity involves recognition of specific
sequence motifs to varying degree, allowing for limited prediction of novel kinase-
substrate relationships using machine learning based motifs derived from experimentally
validated substrates108,112,113. These kinase sequence preferences were previously
employed by Linding’s group to develop NetPhorest114, which generated a set of kinase
31
prediction classifiers based upon a comprehensive analysis of 179 kinases, 104 phospho-
binding domains, and their consensus sequence motifs from experimental data.
NetworKIN improved upon the linear sequence motif-based classification method of
NetPhorest by including cellular contextual information in the form of protein interaction
networks. Given a specific phosphoprotein and the amino acid sequence surrounding a
phosphorylation site in question, NetworKIN provides a combined likelihood prediction
that is a reflection of the phosphoprotein’s network proximity to a kinase in the STRING
database115, as well as the probability of a kinase to phosphorylate at the given site based
upon the sequence-based predictions provided by NetPhorest. Like most kinase-substrate prediction tools, NetworKIN has primarily been assessed and optimized to classify canonically accepted kinase-substrate relationships in cross-validation type computational experiments, and its true performance (ie. Precision, recall) on novel phosphoproteomics data remains unknown.
1.3 Specific Aims
Problem statement- As methods of biological data acquisition and bioinformatics
analysis methods advance, new opportunities to discover and characterize novel cancer
biomarkers have arisen. Development of novel cancer biomarkers holds potential for
assisting in clinical decision-making as well as improving understanding of cancer
biology and bioinformatics analytical approaches have become integral to extracting
useful information from increasingly large and complex amounts of data. The overall
goal of this project is both to apply methods of bioinformatics analysis for biomarker
discovery in novel areas, as well as develop and test novel methods for bioinformatics
32
analyses. In this project, I utilized epigenetic clocks to explore epigenetic aging as a
potential biomarker in glioma, and developed a novel method for phosphoproteomics
analysis that utilizes in silico prediction tools for inference of changes in kinase activity
in cancer.
The primary aims of this project are as follows:
Aim 1 (Chapter 2): In this Aim, I will test the hypothesis that the epigenetic age of
gliomas can provide insight into tumor behavior and is an independent predictor of
survival. Gliomas are a highly heterogeneous classification of tumors that make up 81%
of malignant brain tumors and 27% of all brain and central nervous system tumors. Due
to their highly varied presentations and prognosis, considerable effort has been invested
in improving glioma prognosis and classification using molecular biomarkers, with
notable successes ranging from classification based upon somatic mutations such as
IDH1/2 or TERT promoter mutation, to epigenetic biomarkers such as MGMT promoter methylation or the G-CIMP hypermethylator phenotype. Numerous recent studies have suggested epigenetic aging can be a useful biomarker in a variety of disease contexts, including as a biomarker of precancerous and cancerous tissue. Additionally, epigenetic clocks have been designed to be reflective of total mitotic aging and proposed as tools for studying cancer. Epigenetic aging holds potential for providing a DNA methylation- based biomarker for use in glioma characterization and prognosis, but to date no focused study of epigenetic aging of gliomas has been performed.
33
Aim 2a (Chapter 3): In this aim, I will develop a novel inference method for kinase
activity changes based on in silico kinase-substrate predictions, and apply it to a model biological system exploring kinase inhibitor treatment in breast cancer cells. Protein phosphorylation is a critical regulator of cellular functions and is involved in practically
every aspect of cancer biology. While study of protein phosphorylation has traditionally
relied upon targeted antibodies to quantify phosphorylation and interrogate
phosphorylation signaling pathways, modern methods such as mass spectrometry enable
systems-level capture of phosphorylation abundances across thousands of
phosphorylation sites in a single experiment. Due to a dearth of functional annotation of
phosphorylation sites, however, researchers have been faced with challenges prioritizing
biologically relevant phosphorylation features for further study and interpreting
phosphoproteomic level data. One approach bioinformaticians have developed to utilize
existing knowledge to inform analysis and interpretation of phosphoproteomics data is
inferring kinase activity based on enrichment of experimentally validate substrates, but
current methods remain limited in accessibility and performance.
Aim 2b (Chapter 4): In this aim, I will benchmark my novel method for kinase activity
inference, compare it to an existing method, and demonstrate two more applications of
my method for cancer biomarker discovery and characterization. Kinase activity
inference tools are difficult to benchmark in a systematic manner due to specific data
requirements of some methods and limited phosphoproteomic coverage. Additionally,
kinase-substrate prediction methods are benchmarked against canonical substrates in
silico and have not been evaluated for their predictive performance in experimental data.
34
A method for kinase activity inference that can incorporate in silico kinase-substrate predictions has a dual advantage in expanding the amount of data used in inference of kinase activity, and can also be used to assess the performance of in silico kinase- substrate predictions in detecting expected perturbations in benchmarking experimental data.
35
Chapter 2
Models of epigenetic age capture patterns of DNA methylation in glioma associated with molecular subtype, survival, and recurrence.
In press as: Peter Liao, Quinn T Ostrom, Lindsay Stetson, Jill S Barnholtz-Sloan; Models of epigenetic age capture patterns of DNA methylation in glioma associated with molecular subtype, survival, and recurrence, Neuro-Oncology, noy003, https://doi.org/10.1093/neuonc/noy003
36
2.1 Abstract
Background: Models of epigenetic aging (epigenetic clocks) have been implicated as
potentially useful markers for cancer risk and prognosis. Using two previously published
methods for modeling epigenetic age, Horvath’s clock and epiTOC, we investigated
epigenetic aging patterns related to WHO grade and molecular subtype as well as
associations of epigenetic aging with glioma survival and recurrence.
Methods: Epigenetic ages were calculated using Horvath’s clock and epiTOC on 516
lower grade glioma and 141 glioblastoma cases along with 136 non-tumor (normal) brain samples. Associations of tumor epigenetic age with patient chronologic age at diagnosis were assessed with correlation and linear regression, and associations were validated in an independent cohort of 203 gliomas. Contribution of epigenetic age to survival prediction was assessed using Cox proportional hazards modeling. 63 samples from 18 patients with primary-recurrent glioma pairs were also analyzed and epigenetic age difference and rate of epigenetic aging of primary-recurrent tumors were correlated to time to recurrence.
Results: Epigenetic ages of gliomas were near-universally accelerated using both
Horvath’s clock and epiTOC compared to normal tissue. The two independent models of epigenetic aging were highly associated with each other and exhibited distinct aging patterns reflective of molecular subtype. epiTOC was found to be a significant independent predictor of survival. Epigenetic aging of gliomas between primary and
37 recurrent tumors was found to be highly variable and not significantly associated with time to recurrence.
Conclusions: We demonstrate that epigenetic aging reflects coherent modifications of the epigenome and can potentially provide additional prognostic power for gliomas.
Importance of the study: Epigenetic age, a measure that is modeled using age-associated hyper- or hypo-methylation of specific regions of the genome, has been suggested as a potentially useful marker in cancer prediction and prognosis. Not only is age the greatest single predictor of cancer risk, but studies have demonstrated epigenetic age acceleration in precancerous tissue and suggested that epigenetic aging in cancerous tissue reflects coherent epigenetic modifications. Here we investigate a focused application of two independent epigenetic aging models to high quality DNA methylation data obtained in gliomas, which comprise the majority of malignant brain tumor diagnoses but represent a highly heterogeneous class of tumors in terms of histology, molecular characterization, and prognosis. We demonstrate that this focused approach can yield insight into coherent modifications of the epigenome related to prognostic subtypes of glioma, and show that epigenetic aging of glioma tumor tissue can provide insight into survival and recurrence.
2.2 Methods
2.2.1 DNA Methylation Data
Illumina HumanMethylation450 BeadChip DNA methylation data116 in normalized beta values format (Level 3) was downloaded from The Cancer Genome Atlas (TCGA)
38
Legacy Archive (http://cancergenome.nih.gov/) for all available lower grade glioma
(LGG) and glioblastoma (GBM) cases. 516 LGG cases in TCGA had methylation data
generated using the Illumina 450k platform, and 142 GBM cases had methylation data
generated with the 450k platform and were included in this analysis. Relevant clinical
data and case annotations (including molecular subtyping) on TCGA glioma cases were
obtained from the most recent TCGA glioma study published by Ceccarelli et al28 (Table
2-1). One GBM case with 450k methylation data was not annotated, and therefore was
excluded from analysis for a total of 657 DNA methylation profiles (516 LGG and 141
GBM). Normalized beta values generated on the Illumina HumanMethylation450
BeadChip on normal brain tissue were obtained on 136 samples (including glial, neural,
and bulk samples) collected post mortem from 58 individuals as part of a previous study
published by Guintivano et al117. All normal brain methylation array data is publicly available under GEO accession GSE41826.
For the primary-recurrent analysis, raw IDAT methylation data generated on the Illumina
HumanMethylation450 BeadChip on glioma recurrences was obtained from 63 samples from 18 patients from a study previously published by Mazor et al118. For this dataset,
normalized beta values were calculated from raw intensity files using the minfi package
(v1.20.2) in R using quantile normalization preprocessing119. These samples represent mostly primary-first recurrence pairs, with additional samples composed of multiple samples from the same tumor. If multiple samples were taken from the same tumor, the epigenetic age of that tumor was estimated as the average epiTOC and Horvath epigenetic age of all samples that were taken from the single tumor. In one case, patient
39
4, samples were taken from second and third recurrent tumors as well. One patient was dropped from analysis because records indicated that the second sample was residual disease rather than recurrence. All raw primary and recurrence methylation array intensity data is available under European Genome-Phenome Archive accession
EGAS00001001255.
40
Table 2-1 Clinical characteristics of the glioma sample set, arranged by study and by IDH, 1p/19q co-deletion status
By Study By IDH and 1p19q Co-deletion Status LGG GBM IDH-wt IDH-mut- IDH-mut- Unknown (n=516) (n=141) (n=219) codel noncodel (n= 12) (n=169) (n=257) Features
Histology (n)
Astrocytoma 169 0 52 4 112 1
Glioblastoma 0 134 119 0 6 9
Oligoastrocytoma 114 0 15 30 69 0
Oligodendroglioma 174 0 19 117 37 1
Unknown 59 7 14 18 33 1
Grade (n)
G2 216 0 19 81 114 2
G3 241 0 67 70 104 0
G4 0 134 117 0 6 9
Unknown 59 7 14 18 33 1
Age
Median (LQ-UQ) 41 (33-53) 60 (52-69) 59 (51-66) 45 (35-54) 36 (30-44) 50 (43, 58)
Unknown (n) 59 7 26 30 45 1
Survival
87.4 13.1 14.6 134.2 79.9 40 Median (CI) (67.4-130.7) (11.3-16.7) (11.6-18.4) (78.2-Inf) (63.5-Inf) (17.6-Inf) Unknown (n) 59 7 26 30 45 1
KPS (n)
<70 16 21 25 5 7 0
70-80 60 53 63 17 27 6
90 111 6 25 32 60 0
100 75 15 17 29 41 3
Unknown 254 46 89 86 122 3
MGMT Promoter
Methylated 425 60 84 168 227 6
Unmethylated 91 75 130 1 29 6
Unknown 0 6 5 0 1 0
Epigenetic Clock Horvath Clock Age 72.8 76.3 74.2 96.7 63.2 70.6
Median (LQ-UQ) (55.4-95.3) (61.5-91.7) (55.6-86.4) (76.9-120.3) (50.8-76.0) (55.3-76.9)
0.131 0.147 0.100 0.118 epiTOC Age 0.114 0.142 (0.091- (0.117- (0.087- (0.095- Median(LQ-UQ) (0.092-0.147) (0.100-0.196) 0.184) 0.182) 0.116) 0.174)
CI- 95% confidence interval. LQ- Lower quartile, UQ- Upper quartile. KPS-Karnoffsky Performance Score Survival- Unknown is number of cases where no survival time or status was available; not censored survival
41
Validation data for glioma subtyping associations was aggregated from three glioma
DNA methylation studies. 29,120,121 Data is available under GEO accessions GSE30339,
GSE36278, and GSE61160. Pediatric gliomas were excluded, resulting in 203 glioma cases available for validation. Validation data for primary-recurrent analysis was obtained from Bai et al122, and is available under European Genome-Phenome Archive accession EGAS00001001588, representing 24 individual primary-recurrent glioma pairs.
2.2.2 Calculations of epigenetic age
Epigenetic ages were calculated on all samples using R v.3.3.2 according to previously published methods39,45. Horvath clock age was calculated for LGG, GBM, normal, and primary-recurrent studies separately with normalization feature applied. Age acceleration for Horvath’s clock was defined as Horvath’s predicted age – chronological age at diagnosis/tumor resection. Age acceleration for epiTOC was defined as epiTOC value – predicted epiTOC value for a given age based on linear regression of epiTOC aging on normal brain samples. Performance of the linear regression model of epiTOC aging in normal brain was assessed and considered adequate for purposes of calculating epiTOC age acceleration (Figure 2-1).
42
Figure 2-1 Model performance of calculation of reference epiTOC clock age for normal brain.Linear regression line with 95% confidence interval is shown, plotted with normal samples only color-coded by sample type (above), and in context of glioma (below). While variability exists in the linear model of normal epiTOC aging, model performance was deemed acceptable given the intended use of calculating epiTOC acceleration as the difference between tumor epiTOC age and “normal” predicted epiTOC age using linear model.
43
2.2.3 Statistics and Survival
All statistical analysis was performed in R v3.3.2. Multiple imputation of Karnofsky
Performance Score (300 cases), MGMT promoter methylation (6 cases), and Panglioma
DNA methylation cluster (6 cases) was done via mice package in R123 using predictive
mean matching. Multiple imputation was performed 100 times with each imputation
undergoing 10 iterations to generate pooled estimates for survival modeling. Assessment
of imputation performance was done by plotting imputed distributions against complete
data distributions without any signs of bias. Linear model diagnostics were performed and linear assumptions were not shown to be violated. Comparison of clock associations was performed using Pearson correlations and linear modeling, with hypothesis testing performed on Pearson correlations. Comparison of strength of correlations was performed by Williams test between two correlations sharing one variable. Hypothesis testing of the interaction between IDH-mutation/1p19q codeletion status and epiTOC/Horvath association was performed on the coefficients of the interaction terms using linear modeling. Reported R-square values were calculated using linear modeling on reported variables. Multivariable Cox regression analysis was performed on pooled multiple imputations utilizing survival. All variables were tested for violations of the proportional hazards assumption and none were found by visual inspection of Schoenfeld residuals nor by Schoenfeld test of each variable to have significant time dependent coefficient estimates.
44
2.3 Results
2.3.1 Glioma epigenetic age
In order to investigate changes in epigenetic aging and its possible utility as a prognostic marker in glioma, we applied Horvath’s clock and epiTOC to 657 total glioma DNA methylation profiles classified as 141 glioblastoma (GBM) and 516 lower grade glioma
(LGG) cases in The Cancer Genome Atlas (TCGA). We elected to use Horvath’s clock due to its universal applicability across tissue types39 that makes it uniquely suited for estimating epigenetic age of tumors without bias as to tumor heterogeneity and composition. Horvath’s clock is not significantly attributable to any specific cellular functions, however, so we complemented our study with epiTOC, a DNA methylation- driven mitotic clock previously shown to be advanced in pre-cancerous and cancerous tissues in association with expected mitotic age45. It should be noted that epiTOC values are a unitless score reflecting cumulative DNA methylation that has been previously shown to be correlated to estimated number of cell divisions, but does not have a precise quantitative interpretation.
We first examined the clocks’ correlation with patient chronological age to investigate the extent of epigenetic age dysregulation. When applied to glioma tissue, both Horvath’s clock and epiTOC exhibited near universal acceleration compared to normal brain
(Figure 2-2).
45
Figure 2-2 Comparison of epigenetic clocks to chronologic age.A) epiTOC age versus chronologic age at diagnosis, color coded by World Health Organization (WHO) grade. B) epiTOC age versus chronologic age at diagnosis , color coded by IDH mutation/1p-19q codeletion status. C) Horvath clock age versus chronologic age at diagnosis, color coded by WHO grade. D) Horvath clock age versus chronologic age at diagnosis, color coded by IDH mutation/1p-19q codeletion status.
Some small variations were observed across cellular component in normal brain (Figure
2-3), but these differences were dwarfed by the changes in epigenetic age observed in
comparison to glioma tissue. Acceleration of Horvath’s clock in brain tumors is
consistent with Horvath’s previous results in smaller, non-TCGA GBM datasets124. LGGs
also followed a general pattern of age acceleration. Both clocks appeared to show age-
dependent variance, with tumors diagnosed in older patients showing greater levels of
acceleration from normal clock values. Despite the observed age acceleration, modest
correlation between age at diagnosis and epiTOC (Pearson correlation coeff= 0.59, R2=
0.35, p < 2.2E-16) as well as age at diagnosis and Horvath’s clock (Pearson coeff= 0.50,
46
R2 =0.25, p < 2.2E-16) was maintained in LGG (Figure 2-2, Table 2-2). This correlation was notably weaker in GBM than LGG (Pearson correlation coeff= 0.17, 0.32, R2= 0.02,
0.09, p <2.2E-16 and 2.40E-4 for epiTOC and Horvath, respectively), possibly due to the fact that GBMs are generally diagnosed in older patients and older patients appear to have greater clock variance at diagnosis compared to younger patients. Additionally,
Horvath’s clock appeared to be less accelerated in GBMs compared to LGGs, a distinction that was not observed using epiTOC. Both Horvath’s clock and epiTOC recapitulated their reliability as aging markers in normal tissue, with Horvath’s clock being highly predictive of age and epiTOC showing modest but steady advancement in normal brain tissue with increasing age (Figure 2-3, Table 2-2).
Table 2-2 Associations between epiTOC and Horvath epigenetic age markers and patient age at diagnosis
epiTOC Horvath Pearson's Adjusted Pearson's Adjusted P-val P-val Coeff R2 Coeff R2 Normal 0.48 0.24 1.21E-12 0.95 0.89 < 2.20E-16
LGG 0.50 0.25 < 2.20E-16 0.59 0.35 < 2.20E-16 GBM 0.17 0.02 < 2.20E-16 0.32 0.09 2.40E-04
IDHwt 0.28 0.07 4.80E-05 0.51 0.26 4.33E-15 IDH-mut- 0.63 0.39 < 2.20E-16 0.66 0.44 < 2.20E-16 codel IDH-mut 0.46 0.21 2.49E-13 0.64 0.41 < 2.20E-16 noncodel
47
Figure 2-3 Variations of epiTOC and Horvath’s clock age across normal samples by normal sample type.
48
2.3.2 Horvath’s clock and epiTOC association
Horvath’s clock is independent of mitosis by design, whereas epiTOC is designed to be a mitotic clock, and the two clocks share only a single CpG probe in common. Hence, we next investigated whether the age measurements of these two clocks were associated with each other in glioma. We found a stronger association between the two epigenetic age measurements in glioma tumor tissue than either clock with chronologic age (Figure
2-4,Table 2-3),which was found to be statistically significant (epiTOC: Horvath’s clock versus epiTOC: chronologic age and Horvath’s clock:chronologic age, p=6.2E-22, 5.4E-
10, respectively in GBM and p =5.7E-22, 7.3E-11, respectively in LGG). This suggests that glioma DNA methylomes are still coherently modified even though the expected associations with chronologic age in normal tissue have deteriorated. Of further note, the slope of the association between the two clocks is shallowest in normal tissue, indicating relatively slow advancement of the mitotic clock over epigenetic age as measured by
Horvath’s clock (fitted linear model slope = 1.6E-4, Figure 2-4,Table 2-3). In LGG, however, the slope of this relationship is markedly steeper compared to normal and in
GBM it is even steeper than in LGGs (fitted LM slopes = 1.25E-3 and 2.40E-3 for LGG and GBM, respectively), suggesting that although the clocks display a degree of coherent modification, the association of these changes in methylation differ based upon tumor type. This effect, modeled as the interaction effect of LGG versus GBM on epiTOC age against Horvath’s clock age was statistically significant (coeff= -0.0010, p < 2E-16,
LGG:Horvath Age on epiTOC).
49
50
Figure 2-4 Association of Horvath’s epigenetic clock with epiTOC. A) All glioma samples with epiTOC age plotted against Horvath clock age, color coded by World Health Organization (WHO) grade. B) All glioma samples with epiTOC age plotted against Horvath clock age, color coded by IDH mutation/1p-19q codeletion status. C) Overlap between methylation probe sets used to calculate Horvath’s clock, epiTOC, and used in supervised subtyping of IDH-wt gliomas. D) IDHwt samples with epiTOC age plotted against Horvath clock age, color coded by methylation cluster subtype. 2.3.3 Epigenetic clocks and glioma subtype
Studies have previously demonstrated that a small number of key molecular features are
capable of categorizing gliomas into distinct groups with demonstrable differences in
DNA methylation, RNA expression, and clinical outcome28,125. We examined whether
there were differences in epigenetic age across these known glioma molecular subtypes,
in particular those defined by IDH1/2 mutation status and 1p19q codeletion. Wildtype
IDH (IDH-wt) gliomas make up the vast majority of GBMs, and histologically
categorized LGGs that are IDH-wt have similar clinical outcomes to GBMs. When LGGs
were categorized into known prognostic molecular subtypes, IDH-wt LGGs showed a
similar epigenetic aging profile to GBM, underscoring a growing understanding of IDH-
wt gliomas as members of a common glioma subgroup regardless of WHO grade28
(Figure 2-2, Figure 2-4). Because of this similarity, we reexamined the associations that
were previously observed across WHO grade across IDH mutation/1p-19q codeletion status instead of WHO grade. When broken down by IDH mutation/1p-19q codeletion status, each group exhibited distinct epigenetic aging patterns (Figure 2-2B, Figure 2-5).
Table 2-3 Associations between epiTOC and Horvath epigenetic age markers
epiTOC v Horvath
Pearson's Coeff Adjusted R2 P-val Linear regression slope Normal 0.55 0.31 < 2.2E-16 1.60E-04
LGG 0.78 0.61 < 2.2E-16 1.25E-03 GBM 0.83 0.68 < 2.2E-16 2.30E-03
IDHwt 0.83 0.68 < 2.2E-16 2.14E-03 IDH -mut-codel 0.62 0.64 < 2.2E-16 1.31E-03 IDH-mut noncodel 0.80 0.38 < 2.2E-16 6.43E-04
51
Figure 2-5 Horvath clock acceleration of glioma tumors plotted by IDH-1p19q codeletion molecular subtype. Welch t-test p-values are displayed above brackets indicating groups being tested for statistical significance.
52
Tumors with IDH mutation and 1p19q codeletion (IDH-mut-codel) showed the highest
levels of age acceleration as measured using Horvath’s clock and IDH-wt gliomas showed the lowest levels of age acceleration. This finding was interesting not only
because it reflects broad methylome changes that distinguish IDH-mut gliomas, which
are known to be globally hypermethylated at CpG islands (i.e., G-CIMP)27, but also
because these levels of age acceleration reflect an overall negative association of age
acceleration with survival in gliomas, as IDH-wt gliomas and IDH-mut-codel gliomas
have the worst and best prognoses of the three subtypes, respectively. These differences
in epigenetic age, while still statistically significant, were less clear using epiTOC and
did not appear to reflect any apparent trend (Figure 2-6). Notably, the association
observed previously between epiTOC and Horvath’s clock appeared to also be dependent on IDH mutation/1p19q codeletion status, with IDH-wt showing the steepest epiTOC/Horvath clock slope followed by IDH-mut-codel and then IDH-mut-noncodel gliomas (Figure 2-4B, Table 2-3). This interaction between IDH-mutation/1p19q codeletion status was shown to be statistically significant (coeff= 6.7E-4, 8.4E-4, p=
2.6E-9, 1.48E-14, for IDH-mut-noncodel and IDH-wt respectively, interaction with
Horvath’s age on epiTOC).
53
The IDH-wt subgroup was then further categorized into DNA methylation-based
signature subgroupings previously identified by Cecarrelli et al. in a recent TCGA
panglioma study.28 Despite only sharing a relatively small proportion of probes between
subtyping signature probes and clock probes (Figure 2-4C), these subgroups based on methylation clustering exhibited distinct patterns of epigenetic aging (Figure 2-7). The
Classic-like DNA methylation subtype demonstrated the highest level of epiTOC mitotic acceleration compared to other methylation subtypes, however the IDH-mutant-codel
Figure 2-6 epiTOC acceleration of glioma tumors plotted by IDH-1p19q codeletion molecular subtype. Welch t-test p-values are displayed above brackets indicating groups being tested for statistical significance. 54 subtype had the highest level of age acceleration according to Horvath’s clock. No significant differences in epigenetic age were observed between the G-CIMP-high and G-
CIMP-low IDH-mutant subtypes identified in by Cecarrelli et al., which is interesting due to the expectation of global methylation differences between the two according to their previously published description. Strong reversals of epigenetic age were observed in the
LGm6-GBM subtype and PA-like subtype. Interestingly, these two subgroups were indistinguishable using the CpG probe sets utilized by Cecarrelli et al and were distinguished by histology. However, our epigenetic clock aging study suggests the PA- like group is additionally distinguishable from LGm6-GBM by a severely regressed epiTOC age compared to other IDH-wt gliomas and even compared to normal brain
(Figure 2-4D, Figure 2-7). These findings as a whole contribute evidence that epigenetic age in gliomas may reflect coherent changes in the tumor methylome that are biologically and clinically relevant.
55
.
Figure 2-7 Epigenetic age acceleration (glioma tissue epigenetic age – patient age at diagnosis) for TCGA glioma samples plotted by supervised methylation subtype.
56
2.3.4 Validation of Panglioma Epigenetic Clock Associations
To validate the associations we observed in gliomas across tumor epiTOC and Horvath’s clock age, patient age at diagnosis, and methylation subtype, we aggregated Illumina
450k DNA methylation data across several glioma studies, which to our knowledge represents all large scale study 450k DNA methylation data currently available in gliomas29,120,121 (Table 2-4). We ran similar tests compared to our analysis on the TCGA data. Although 1p/19q co-deletion information was not available for the entire validation set, we observed similar patterns of epigenetic clock associations with patient age at diagnosis, as well as epiTOC association with Horvath’s clock depending on IDH- mutation status (Figure 2-8). Furthermore, we observed similar patterns of epigenetic age acceleration across previously identified methylation subtypes in our validation set
(Figure 2-9) compared to our initial TCGA dataset. As a whole, these findings suggest that the patterns of epigenetic aging in glioma are reproducible and represent coherent changes in epigenetic modification. Epigenetic aging patterns appear to be associated with IDH-mutation status as well as the methylation subtype classifications previously identified by Cecarrelli et al. Distinguishable patterns in epigenetic age of these different subtypes suggests that outside of signature molecular features such as IDH-mutation status and methylation subtype signature, these glioma subtypes are subject to coherent modifications of their epigenome that can be estimated by their epigenetic age.
57
Table 2-4 Clinical characteristics of the clock associations validation set, arranged by study and by IDH mutation status
By LGG versus GBM By IDH and 1p19q Co-deletion Status
IDH-wt IDH-Mutant (n= LGG (n=117) GBM (n=86) (n=107) 96) Features
Study (n)
Mur 46 0 14 32 Sturm 0 77 61 16 Turcan 71 9 32 48
Histology (n)
Astrocytoma 31 0 14 17
Glioblastoma 0 86 69 17
Oligoastrocytoma 19 0 9 10
Oligodendroglioma 63 0 11 52
Other/non-glioma 4 0 4 0
Grade (n)
G2 52 0 10 42
G3 65 0 28 37
G4 0 86 69 17
Age
Median (LQ-UQ) 48 (37-59) 47 (38-52) 50 (42-58) 45 (35-53)
Unknown (n) 14 0 3 11
Survival
68.1 14 15.2 110.8 Median (CI) (45.1-Inf) (12-17) (14-19.9) (69.4-Inf) Unknown (n) 21 37 33 25
Epigenetic Clock
Horvath Clock Age 90.3 83.6 79.5 97 Median (LQ-UQ) (69.6-114.6) (70.0-98.4) (64.1-93.8) (79.5-124.8)
epiTOC Age 0.145 0.157 0.146 0.147 Median(LQ-UQ (0.117-0.178) (0.129-0.200) (0.103-0.205) (0.129-0.174)
CI- 95% confidence interval. LQ- Lower quartile, UQ- Upper quartile. Survival- Unknown is number of cases where no survival time or status was available; not censored survival
58
59
Figure 2-8 Aggregated epigenetic clock validation data with A) epiTOC and B) Horvath’s clock age plotted against patient age at diagnosis, C) epiTOC plotted against Horvath’s clock, color coded by IDH-mutation status. D) IDH-wt validation data plotting epiTOC against Horvath’s clock, color coded by supervised methylation subtype.
Table 2-5 Patient cross-classification table across multiple classifications used.
Cross-classification (n) LGG GBM Complete Cases Incomplete Cases
IDH-wt 94 125 127 92
IDH-mut-codel 169 0 83 86
IDH-mut-noncodel 250 7 134 123
Unknown 3 9 0 12
Figure 2-9 Epigenetic age acceleration (glioma tissue epigenetic age – patient age at diagnosis) for TCGA glioma samples plotted by supervised methylation subtype. 60
2.3.5 Survival Modeling
To assess whether either epigenetic clock was associated with glioma survival, we
performed Cox regression modeling on the pan-glioma TCGA dataset with known
glioma survival predictors, including age at diagnosis, WHO grade, IDH-mutation/1p-
19q codeletion status, MGMT promoter methylation20, and Karnofsky Performance Score
(KPS)126. A large proportion of glioma cases in TCGA were missing KPS information,
therefore Cox regression was performed on both complete cases as well as an imputed
data set for comparison (using multiple imputation123). Cox modeling of survival on
complete cases only affirmed the predictive power of age, IDH-mutation/1p-19q codeletion status, WHO grade, and KPS across all models tested (Table 2-6). Inclusion of epiTOC with the base model in complete cases demonstrated a significant, negatively associated effect on survival (p=0.016), but Horvath’s clock did not (p=0.10). When both clocks were included in the model, however, epiTOC marginally lost its statistically significant association (p=0.06), possibly due to collinearity with Horvath’s clock.
Imputation of missing KPS allowed for inclusion of 236 additional cases with 55 additional recorded death events in the analysis (Table 2-8). Pooled analysis again affirmed the importance of age, IDH-mutation/1p-19q codeletion status, and WHO grade across all models, but significant association of KPS with survival was lost. This larger analysis reaffirmed epiTOC as significantly associated with survival in additional to the base model (p=0.025, Table 2-7). An additional, significant negative association of
Horvath’s clock age with survival was detected (p = 0.0003, Table 2-2). Addition of both clocks to the base model showed no significant contribution of epiTOC to survival, but the negative association of Horvath’s clock to survival persisted (p=0.003). The loss of
61
significance of epiTOC in the combined epigenetic clocks model is likely due to
collinearity. As previously mentioned, epiTOC and Horvath’s clock are highly correlated
(Pearson’s coeff = 0.76, p <2.2E-16 across all samples). Multivariable model results as
well as detailed demographic information for both complete and imputed cases are
included for completeness (Table 2-7, Table 2-6, Table 2-8). Taken as a whole, these
results suggest a possible negative association between epigenetic age of glioma tissue
and patient survival in addition to known associated clinical variables and molecular
features.
This survival association was unable to be independently validated in our validation
dataset, however. With inclusion of age at diagnosis, IDH-mutation status, and WHO
grade as predictive variables, epiTOC and Horvath clock age were not associated
significantly with survival (p = 0.44, 0.97 for epiTOC and Horvath’s clock age,
respectively). However, it should be noted that our validation data was significantly
smaller (n = 203) than even the TCGA complete, non-imputed dataset. Furthermore,
because our validation dataset was aggregated across multiple studies, molecular and clinical annotations were incomplete, forcing us to run a simplified model lacking 1p/19q codeletion status, KPS, and MGMT promoter methylation. While our findings in TCGA data suggest there may be an independent association with epigenetic age and survival in glioma, it remains to be seen whether this finding can be validated in similarly large
DNA methylation studies.
62
Table 2-6 Cox regression results on TCGA glioma survival (complete cases only)
Base Variable Beta Coeff SE 95% CI p Age at Diagnosis 0.05 1.05 [0.03, 0.08] 3.10E-06 * Gender (male) 0.03 1.03 [-0.41, 0.47] 9.04E-01 IDH-mut 1p/19q status (IDHmut-non-codel) 1.00 2.73 [0.17, 1.83] 1.77E-02 * IDH-mut 1p/19q status (IDHwt) 1.31 3.70 [0.33, 2.28] 8.51E-03 * KPS -0.03 0.97 [-0.05, -0.02] 2.06E-04 * MGMT Promoter (unmethylated) 0.68 1.97 [0.13, 1.23] 1.48E-02 * Grade (II) -1.15 0.32 [-2.01, -0.29] 9.03E-03 * Grade (III) -0.23 0.79 [-0.84, 0.37] 4.49E-01
Base + epiTOC Variable Beta Coeff SE 95% CI p Age at Diagnosis 0.06 1.06 [0.03, 0.08] 8.10E-07 * epiTOC -5.61 0.00 [-10.19, -1.04] 1.61E-02 * Gender (male) -0.03 0.97 [-0.47, 0.4] 8.78E-01 IDH-mut 1p/19q status (IDHmut-non-codel) 0.69 1.99 [-0.18, 1.55] 1.21E-01 IDH-mut 1p/19q status (IDHwt) 1.15 3.14 [0.16, 2.13] 2.25E-02 * KPS -0.03 0.97 [-0.05, -0.02] 7.66E-05 * MGMT Promoter (unmethylated) 0.52 1.67 [-0.04, 1.07] 6.80E-02 Grade (II) -1.30 0.27 [-2.17, -0.44] 3.06E-03 * Grade (III) -0.30 0.74 [-0.91, 0.31] 3.30E-01
Base + Horvath's Clock Variable Beta Coeff SE 95% CI p Age at Diagnosis 0.06 1.06 [0.04, 0.09] 5.89E-07 * Gender (male) 0.03 1.03 [-0.41, 0.47] 8.91E-01 Horvath Age -0.01 0.99 [-0.02, 0] 1.01E-01 IDH-mut 1p/19q status (IDHmut-non-codel) 0.78 2.19 [-0.09, 1.65] 7.75E-02 * IDH-mut 1p/19q status (IDHwt) 1.01 2.74 [-0.03, 2.04] 5.66E-02 * KPS -0.03 0.97 [-0.05, -0.02] 1.78E-04 * MGMT Promoter (unmethylated) 0.57 1.78 [0.01, 1.14] 4.50E-02 * Grade (II) -1.18 0.31 [-2.05, -0.32] 7.47E-03 * Grade (III) -0.21 0.81 [-0.82, 0.41] 5.10E-01
Base + epiTOC + Horvath's Clock Variable Beta Coeff SE 95% CI p Age at Diagnosis 0.06 1.06 [0.03, 0.08] 1.81E-05 * epiTOC -5.92 0.00 [-12.14, 0.29] 6.18E-02 Gender (male) -0.04 0.96 [-0.48, 0.4] 8.67E-01 Horvath Age 0.00 1.00 [-0.01, 0.02] 8.86E-01 IDH-mut 1p/19q status (IDHmut-non-codel) 0.69 2.00 [-0.18, 1.57] 1.20E-01 IDH-mut 1p/19q status (IDHwt) 1.17 3.23 [0.12, 2.23] 2.96E-02 * KPS -0.03 0.97 [-0.05, -0.02] 7.51E-05 * MGMT Promoter (unmethylated) 0.52 1.68 [-0.04, 1.08] 6.70E-02 Grade (II) -1.31 0.27 [-2.17, -0.44] 2.99E-03 * Grade (III) -0.31 0.73 [-0.93, 0.31] 3.24E-01 N=344 , deaths= 94 313 observations deleted due to missing-ness (300 missing KPS, 12 missing IDH information, 66 missing survival information) Reference groups: Gender (female), IDH-mut 1p/19q status (IDHmut-codel), MGMT Promoter (methylated), Grade (IV) Abbreviations: KPS- Karnofsky Performance Score
63
Table 2-7 Cox regression results on TCGA glioma survival (pooled results, multiply imputed 100x)
Base Variable Beta Coeff SE 95% CI p Age at Diagnosis 0.05 0.01 [0.03, 0.06] 4.08E-09 * Gender (male) 0.35 0.18 [-0.01, 0.7] 5.50E-02 IDH-mut 1p/19q status (IDHmut-non-codel) 0.73 0.32 [0.1, 1.35] 2.31E-02 * IDH-mut 1p/19q status (IDHwt) 1.64 0.37 [0.91, 2.37] 9.57E-06 * KPS -0.01 0.01 [-0.03, 0.01] 2.21E-01 MGMT Promoter (unmethylated) -0.01 0.21 [-0.43, 0.41] 9.75E-01 Grade (II) -1.04 0.33 [-1.69, -0.39] 1.65E-03 * Grade (III) -0.31 0.24 [-0.77, 0.15] 1.88E-01
Base + epiTOC Variable Beta Coeff SE 95% CI p Age at Diagnosis 0.05 0.01 [0.04, 0.07] 5.86E-10 * epiTOC -3.85 1.72 [-7.22, -0.49] 2.48E-02 * Gender (male) 0.32 0.18 [-0.03, 0.67] 7.67E-02 IDH-mut 1p/19q status (IDHmut-non-codel) 0.50 0.33 [-0.15, 1.15] 1.34E-01 IDH-mut 1p/19q status (IDHwt) 1.47 0.38 [0.72, 2.21] 1.24E-04 * KPS -0.01 0.01 [-0.03, 0.01] 1.79E-01 MGMT Promoter (unmethylated) -0.08 0.22 [-0.5, 0.35] 7.23E-01 Grade (II) -1.21 0.34 [-1.88, -0.55] 3.41E-04 * Grade (III) -0.42 0.24 [-0.89, 0.05] 7.83E-02
Base + Horvath's Clock Variable Beta Coeff SE 95% CI p Age at Diagnosis 0.06 0.01 [0.05, 0.08] 3.91E-12 * Gender (male) 0.28 0.18 [-0.08, 0.63] 1.24E-01 Horvath Age -0.01 0.00 [-0.02, -0.01] 2.68E-04 * IDH-mut 1p/19q status (IDHmut-non-codel) 0.35 0.33 [-0.31, 1] 3.00E-01 IDH-mut 1p/19q status (IDHwt) 1.07 0.40 [0.29, 1.85] 7.20E-03 * KPS -0.01 0.01 [-0.03, 0.01] 1.58E-01 MGMT Promoter (unmethylated) -0.04 0.21 [-0.46, 0.38] 8.42E-01 Grade (II) -1.24 0.33 [-1.89, -0.58] 2.14E-04 * Grade (III) -0.42 0.24 [-0.89, 0.04] 7.55E-02
Base + epiTOC + Horvath's Clock Variable Beta Coeff SE 95% CI p Age at Diagnosis 0.07 0.01 [0.05, 0.08] 7.88E-12 * epiTOC 0.87 2.33 [-3.69, 5.43] 7.08E-01 Gender (male) 0.28 0.18 [-0.08, 0.63] 1.23E-01 Horvath Age -0.02 0.01 [-0.03, -0.01] 3.46E-03 * IDH-mut 1p/19q status (IDHmut-non-codel) 0.36 0.34 [-0.3, 1.02] 2.82E-01 IDH-mut 1p/19q status (IDHwt) 1.06 0.40 [0.28, 1.84] 8.04E-03 * KPS -0.01 0.01 [-0.03, 0.01] 1.62E-01 MGMT Promoter (unmethylated) -0.03 0.22 [-0.45, 0.39] 8.87E-01 Grade (II) -1.22 0.34 [-1.88, -0.56] 3.02E-04 * Grade (III) -0.41 0.24 [-0.88, 0.06] 8.74E-02
N=580 , deaths= 149 77 observations deleted due to missing-ness (12 missing IDH information, 66 missing survival information). Reference groups: Gender (female), IDH-mut 1p/19q status (IDHmut-codel), MGMT Promoter (methylated), Grade (IV) Abbreviations: KPS- Karnofsky Performance Score
64
Table 2-8 Clinical features comparison across complete and imputed cases
Complete Cases Imputed Cases
n = 344 n= 313
Features
Grade (n) G2 114 102 G3 147 94 G4 83 51 Unknown 0 66
Age at Diagnosis (years)
Median (LQ-UQ) 48 (37-59) 44 (33-58) Unknown (n) 0 66
Gender (n)
Female 164 94 Male 180 153 Unknown 0 66
Survival (mo)
Median(LQ-UQ) 12.0 (5.7-22.7) 11.5 (3.9-26.9) Unknown (n) 0 66
KPS (n)
<70 37 0 70-80 106 7 90 117 0 100 84 6 Unknown 0 300
MGMT Promoter (n)
Methylated 248 237 Unmethylated 96 70 Unknown 0 6
CI- 95% confidence interval. LQ- Lower quartile, UQ- Upper quartile. KPS-Karnoffsky Performance Score Survival- Unknown is number of cases where no survival time or status was available; not censored survival
65
2.3.6 Epigenetic aging of tumor recurrences
We further investigated changes in epigenetic age across low-grade primary gliomas and
paired recurrences from 16 patients in a previous study published by Mazer et al.118. We
found diverse changes in epigenetic age across glioma recurrences. Recurrent tumors
showed a variety of aging changes, with some tumors having marginally regressed age,
some having near equivalent aging compared to normal tissue, and some demonstrating
highly accelerated aging between their primary and recurrent tumors (Figure 2-10). This
diversity was observed across both epigenetic clocks. The rate of this aging was not
associated with histopathological diagnosis of the primary or recurrent tumors, nor
treatment type. The primary-recurrent analysis also yielded interesting insight into
heterogeneity of epigenetic age within a tumor. We observed that even within samples
obtained from the same tumor, epigenetic ages could vary significantly, as seen in the
primary tumors of patients 1, 4, 18, and 90, and the first recurrent tumor of patient 1
(Figure 2-10). Interestingly, comparing epigenetic age of different tumor portions to
published predicted lineage118 using somatic mutation and methylation dynamics shows
that in some cases the epigenetic age reflects similarity of the tumor to germline cells,
which is the case for the “youngest” sample from Patient 90 (identified in original publication as Patient90 Initial C), but can also be counterintuitive, as the outlying primary tumor sample from Patient 18 (Patient18 Initial A) is predicted to be closely related in lineage to at least two of the other primary samples.
66
67
Figure 2-10 Plot of all primary and recurrent tumor sample epigenetic ages showing intratumoral heterogeneity and aging of tumor from primary to recurrence. Numbers in the center of points designates patient number from original publication.118
Figure 2-11 Epigenetic clock age versus time to recurrence. A) epiTOC age differences between primary and recurrent tumors plotted against time to recurrence. Pearson’s correlations= -0.19 and - 0.50, p =0.47 and 0.047 between time to recurrence and epiTOC difference and epiTOC aging rate, respectively. B) Horvath’s clock age differences between primary and recurrent tumors plotted against time to recurrence. Pearson’s correlations -0.56 and -0.64, p= 0.025 and 0.0084 between time to recurrence and Horvath age difference and Horvath aging rate, respectively.
A negative correlation was observed between time to recurrence and epigenetic age difference between primary and recurrent tumors using Horvath’s clock (Figure 2-11,
Pearson’s correlation -0.56, p= 0.025); however this finding could not be recapitulated in a validation dataset of LGG methylation122 (Figure 2-12, Pearson’s correlation 0.18, p
=0.39). Although no additional significant associations were found between primary- recurrent epigenetic age differences and time to recurrence, it should be noted that this may still reflect an underlying relationship between epigenetic aging and recurrence. If primary-recurrent gliomas all aged at similar rates, we would have expected an overall
68
increase in epigenetic age differences as time to recurrence increased, however this was
not observed in either test or validation dataset. To illustrate this point visually, we
adjusted differences in epigenetic age by time to recurrence to obtain an estimate of
epigenetic aging rate over the period of time between primary-recurrent resections
(Figure 2-11, Figure 2-12). The observation that epigenetic age differences are relatively
unassociated with time to recurrence suggests that tumors that recur quickly might be
aging more quickly, and that tumors that recur slowly are aging at a relatively slower
rate. Neither clock age in primary or recurrent tumors was significantly associated with
survival or time to recurrence when considered alone.
Figure 2-12 Validation dataset for primary-recurrence. A) epiTOC age differences between primary and recurrent tumors plotted against time to recurrence. Pearson’s correlations= 0.06 and -0.45, p =0.78 and 0.027 between time to recurrence and epiTOC difference and epiTOC aging rate, respectively. B) Horvath’s clock age differences between primary and recurrent tumors plotted against time to recurrence. Pearson’s correlations 0.18 and -0.010, p= 0.39 and 0.64 between time to recurrence and Horvath age difference and Horvath aging rate, respectively.
69
2.4 Discussion
By focusing our analysis scope on gliomas, we were able to identify patterns of epigenetic aging in tumor tissue that reflect known prognostic molecular subtypes. In addition, we were able to identify aging patterns that may contribute to glioma survival
independent of molecular subtype and other known prognostic factors, and gained insight into epigenetic aging of tumors between primary and recurrence.
In our investigation of glioma epigenetic aging, we discovered that different subtypes of glioma demonstrated different epigenetic aging patterns. This observation not only contributes an independent line of verification of these subtypes as distinct biological classifications, but also showed that epigenetic age may be a useful measurement for further elucidating differences within molecular subtypes, such as between LGm6-GBM and PA-like gliomas, which were previously distinguished histologically rather than molecularly.
Within gliomas, we observed that overall lower epigenetic age is associated with poor survival. IDH-wt gliomas, which have generally poorer prognosis than IDH-mutant gliomas, had lower epigenetic age acceleration when compared to IDH-mutant gliomas.
These trends are contrary to most epigenetic aging studies, which have generally found that advanced epigenetic age is often associated with higher risks of disease and mortality in normal tissues. This opposite finding within glioma samples may be indicative of biological mechanisms in gliomas that oppose epigenetic aging and that lead to more aggressive or treatment-resistant disease. Another possible explanation for observed
70
differences in epigenetic age is that age measurements may be reflective of variable
tumor compositions, a factor that must be considered when studying invasive and highly
heterogeneous tumors such as gliomas, particularly given the evidence that stem-like cells have younger epigenetic clocks.39 Currently, the biological drivers that determine epigenetic aging in gliomas remain unknown, and therefore the relationship of epigenetic age to biological mechanisms in glioma survival remains open for further investigation.
There were several limitations to our analysis that bear mentioning. While missing KPS
scores were imputed to provide as complete a dataset as possible and all efforts were
made to prevent bias, there were differences in male/female ratio and between MGMT promoter methylation status between imputed cases and cases with complete data (Table
2-4, Table 2-8). Furthermore, the survival associations with epigenetic clock age could
not be validated in limited available data, emphasizing the need for additional large, high
quality methylation data in glioma to investigate these types of associations. We also note
that the strong correlation between the two epigenetic clocks is an interesting finding in
itself, but likely contributes to collinearity in our survival models.
Inclusion of recurrent gliomas in our analysis allowed for study of epigenetic aging in
gliomas over time. Although no reproducible association between primary-recurrent
epigenetic age difference and time to recurrence was observed, the lack of this expected
association intuitively reflects variable aging rates across gliomas. Given that the
majority of primary-recurrent pairs move in the same positive direction measured by both
epigenetic clocks, however, it seems unlikely that the epigenetic aging process is
altogether dysregulated. This lack of association between epigenetic aging and time to
71 recurrence can be explained by higher aging rate in faster recurrences and slower aging rate in delayed recurrences, but the evidence for this in our data is circumstantial. Our primary-recurrent analysis showed evidence of significant intra-tumoral heterogeneity with regards to epigenetic age, and the variety of treatments and responses of patients after resection of their primary glioma complicates any firm conclusions that can explain these variable aging rates. This observation is therefore not necessarily useful for predicting recurrence using epigenetic age, but is reported here solely to serve as a reference for further investigation into the drivers of epigenetic aging in gliomas.
Although epigenetic clocks appear broadly dysregulated in cancer without any clear pan- cancer utility, application of epigenetic clocks specifically to glioma demonstrated that epigenetic age can be a potentially useful biomarker in isolated cancer contexts.
Furthermore, we identified the need for additional investigation of the mechanisms of epigenetic aging in glioma, where we observed associations between epigenetic age, glioma subtypes and glioma survival and recurrence.
72
Chapter 3
Inference of kinase activity using in silico substrate prediction scores
Under review as: Peter Liao, Jennifer L Yori, Ruth A Keri, Jill S Barnholtz-Sloan; Inference of kinase activity using in silico substrate prediction scores, PLOS Computational Biology (under review).
73
3.1 Abstract
Motivation: Protein phosphorylation represents one of the most diverse and widespread types of post-translational modification and is critically involved in the regulation of
many cellular processes. Mass spectrometry-based phosphoproteomics has enabled systems-wide quantitation of phosphorylation, but deriving biological insights from
phosphoproteomics has been hampered by difficulty in interpreting changes in individual
phosphorylation sites. One commonly used method to interpret phosphoproteomics data
is to use individual phosphorylation site quantification to infer changes in kinase activity,
but current methods are restricted to a limited number of known kinase-substrate
relationships or rely on complex machine learning methods requiring specialized
experimental data. We present prediction-based KSEA (pKSEA), an accessible, novel method of kinase activity inference that incorporates existing in silico kinase-substrate predictions and show its potential to overcome the aforementioned limitations in a dataset investigating kinase inhibition in breast cancer cells.
Results: pKSEA identified inhibition of expected kinase targets in breast cancer cells, including Abl, Ephrins, and Src in response to dasatinib treatment and p70S6k in response to rapamycin. Compared to conventional KSEA, pKSEA identified additional additive and complementary changes in kinase activities based solely on predictions across dasatinib/rapamycin combination treatment.
Availability and implementation: https://github.com/pll21/pKSEA or https://cran.r-
project.org/web/packages/pKSEA/index.html
74
3.2 Background
Mass spectrometry-based phosphoproteomics screening enables relative quantification of
thousands of phosphorylation sites per single experiment, but analysis and interpretation
of this type of data poses unique hurdles. Among the most significant issues include
variable detection coverage of LC-MS/MS methods, which complicate comparisons
across phosphoproteomics experiments, and limited interpretability of raw
phosphorylation site quantifications due to the fact that the function of the majority of
phosphorylation sites in the human phosphoproteome remain unknown. In order to
overcome these issues, researchers have attempted to infer changes in kinase activities
from phosphorylation site quantifications using methods such as kinase substrate
enrichment analysis (KSEA) to improve interpretability of phosphoproteomic data and
allow for comparisons across experiments despite variability in the substrates being
detected. While recent benchmarking studies have shown that KSEA is indeed capable of
capturing expected changes in kinase activity, its approach toward analysis suggests
opportunities for improvement. Namely, KSEA relies on defined canonical substrate
annotations and restricts its analysis to kinase-specific subsets of canonical substrates.
Due to previously mentioned deficiencies in known kinase-substrate relationships, KSEA cannot include the majority of data obtained in phosphoproteomics screens. Furthermore, because KSEA substrate sets are defined in a binary matter, all substrates are considered equally representative of kinase activity. With these characteristics in mind, an improved method for kinase inference can utilize bioinformatics to expand the usable range of data
75
and also weight its kinase inference based upon likelihood and strength of direct kinase
regulation (Figure 3-1).
Figure 3-1 Cartoon representation of two kinase inference methods, with cartoon figures representing detected phosphorylation features in a phosphoproteomic experiment. To infer a kinase activity change, KSEA restricts its analysis to a subset of known canonical substrates that are all considered equal, leading to an equivocal result. pKSEA on the other hand, utilizes predictions to incorporate a larger proportion of the data and weighs contributions to kinase activity inference by prediction likelihood of phosphorylation site being directly regulated by a kinase.
Herein, we describe the development of prediction-based KSEA (pKSEA) to overcome
the limitations of canonically defined substrate sets. This method provides kinase activity
inferences by incorporating the bulk of detected phosphoproteomic features in any given phosphoproteomic screen that do not have known kinases. We show how this approach can both identify expected perturbations in canonical kinase substrates as well as identify novel differences in kinase activity based on predicted substrates. We emphasize that unlike more complex inference methods that require specialized experimental design and
76 data types, our method is accessible for researchers to implement without the need for extensive computational expertise nor multi-condition, multiple time-point phosphosphoproteomics data.
3.3 Materials and Methods
3.3.1 Cell culture
MDA-MB-231 cells were obtained from the American Type Culture Collection (ATCC) and were authenticated using STR profiling (BDC Molecular Biology Core Facility,
University of Colorado, Boulder, CO). MDA-MB-231 cells were maintained at 37C at
5% CO2 in RPMI-1640 media supplemented with 10% FBS. Cells were treated with vehicle (DMSO), dasatinib (100 nmol/L), rapamycin (2.5umol/L), or combination dasatinib/rapamycin for 24 hours according to previously published IC50 values and time-point optimization127. The experiment was performed with three biological replicates for each treatment group.
3.3.2 Phosphoproteomics
After treatment incubation period, cells were subjected to quantitative global phosphorylation analysis using label-free ultra-high-performance liquid chromatography- tandem mass spectrometry (LC-MS/MS) without fractionation. Cells were lysed in 2%
SDS, 20mM Tris buffer at pH 8 with pulse sonication with protease and phosphatase inhibitors (Thermo Fisher). Following reduction (25 mM DTT, 1h at 37C), alkylation (25 mM iodoacetamide, 30 minutes 37C), and detergent removal (FASP), 800ug of protein from each sample were digested by two-step Lys-C/trypsin digestion (1:20 enzyme to
77
protein ratio). Digested peptides were desalted using C18 cartridges (Waters) and then
subjected to phospho-enrichment using commercially available TiO2 enrichment spin tips
(Thermo Fisher). LC-MS/MS was performed on all samples in a single run in randomized
injection order using an ultra-high pressure liquid chromatography system (NanoAcquity,
Waters) coupled to a LTQ-Orbitrap Velos mass spectrometer (Thermo Scientific).
Separation and peptide detection details were as previously described128. Peptide
identification on raw files was performed using Rosetta Elucidator (3.3.0.1.SP.25) and
Mascot (2.3.01) searching against the UniProt database with filters for trypsin cleavage,
phosphorylation, carbamidomethylation and oxidation modifications. Relative
quantification was performed by normalizing peptide ion intensities to within sample
median intensity values.
3.3.3 Data preparation
Because kinase inference can be informed or skewed by even a few phosphopeptides,
imputation was performed on 225 missing values to reduce these potential effects of a
total of 50628 total abundance observations, representing 0.4% of the dataset.
imputeLCMD package from Bioconductor was used to perform a two-part imputation.
Phosphopeptides that were missing in a single sample were considered missing at random
(MAR) and were imputed using a k-nearest neighbor approach. If a phosphopeptide was missing across multiple samples, it was considered to be missing not at random (MNAR), likely to below the detection threshold, and was assigned the minimum detected value in that sample in a deterministic fashion. No significant biases or changes in abundance distributions were introduced with imputation. To generate summary statistics for
78
comparing phosphopeptide abundance across treatment groups, Welch’s t-test was performed comparing dasatinib, rapamycin, and combination-treated normalized sample intensities to corresponding DMSO-treated controls on an individual phosphopeptide abundance basis. This resulted in summary statistic calculations for each treatment group compared to control for each detected phosphopeptide.
3.3.4 Kinase prediction scoring
To obtain kinase prediction scores, we searched detected phosphopeptides against precomputed prediction data from NetworKIN107. This precomputed data included
predictions on all known phosphorylation sites in KinomeXplorer-DB using ENSEMBL
human protein sequences v59 and is available publicly online for download
(http://networkin.info/download.shtml). Phosphorylation sites that have not been
previously observed and for which no precomputed prediction data was available from
NetworKIN were not included in this analysis. Each phosphorylation site detected in the
phosphorylation data was considered as a separate feature; if multiple phosphorylation
sites were detected on the same phosphopeptide, a separate phosphopeptide feature entry
was created with duplicated summary statistics reflecting the abundance of the detected
phosphopeptide that contained the phosphorylation sites. Thus, a single phosphopeptide
with multiple phosphorylation sites would yield multiple phosphorylation site entries
with duplicated abundance-related summary statistics but with different prediction scores reflecting the individual phosphorylation sites. Multiple phosphopeptides containing the same phosphorylation site were treated as separate, independent entries for scoring purposes.
79
3.3.5 Kinase activity scoring
We first calculated summary statistics for each phosphopeptide, comparing single-agent and combination treatment conditions to DMSO controls. Summary statistics, including t- statistics, were calculated for each detected phosphopeptide in our phosphoproteomic screen. T-statistic was calculated using a pairwise Welch’s t-test comparing phosphopeptide abundances of each treatment group to the DMSO control group. The t- statistic reflects the magnitude of mean abundance change over the standard error observed in measurements, which we implemented as a reflection of the direction and statistical significance of changes in relative abundance of a detected phosphorylation site. To calculate a raw kinase activity change score (KAC score), we used the calculated summary statistics and incorporated kinase-substrate prediction scores to calculate contribution scores for each kinase-phosphorylation site prediction:
, = ,
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘 𝑡𝑡𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 ∗ 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘 In this formula, t is the t-statistic reflecting the direction, and estimated statistical significance of changes in relative abundance of the detected phosphorylated site, and
Pred is the NetworKIN prediction score of a single kinase predictor (kinase) against the specified phosphorylation site (psite). Using these summary statistics and kinase- substrate prediction scores, for each kinase, a raw kinase activity change (KAC) score was then calculated:
= ,
𝐾𝐾𝐾𝐾𝐾𝐾𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘 � 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘 𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
80
It should be noted that the KAC score is not an absolute measure and is not comparable
across kinases. Because prediction scores are not normalized across kinases and kinases
can have varying levels of predicted substrate preference and promiscuity, the
magnitudes of KAC scores are not directly indicative of significance.
3.3.6 Assessing significance of kinase activity change-scores
To evaluate the significance of KAC scores, we performed a permutation test to compare
observed KAC scores to KAC scores that could be expected from our given data due to random chance. Calculated t-statistics were permuted with respect to phosphopeptide to
account for any within-phosphopeptide prediction biases that may be lost with
permutation on the individual site level. After permuting t-statistics on the
phosphoproteomic dataset, individual site contribution scores and overall KAC score
were recalculated. This permutation and recalculation process was repeated 1000 times.
To provide a significance estimate, observed KAC scores were compared to the results of
permutation and assigned a percentile score in relation to permuted results, reflecting an
empirical estimate of expected results due to random chance. This percentile score is
referred to as pKSEA significance score. Kinase activity change was considered
significantly down-regulated in the results presented if the observed score was below the
5th percentile (representing a comparison to α= 0.05) of permuted scores, or in other
words, a pKSEA significance score of less than 5. For analysis on data containing only
predictions without known kinase-substrate pairs, phosphorylation sites were cross-
referenced with the database of known kinase-substrate relationships in PhosphoSitePlus
and removed and analyzed as a separate data set according to the same workflow.
81
3.3.7 pKSEA Plotting
Heatmap figures of pKSEA significance scores were generated showing increased kinase activity scores (red) and decreased pKSEA significance scores (blue). pKSEA significance scores were included in figures as space allowed. For visual clarity,
“filtered” figures were also generated by removing kinase rows that did not show significant changes in kinase activity in a specified number of experimental conditions.
3.3.8 KSEA
KSEA on phosphoproteomics summary statistics data was performed using the KSEAapp
R package104.
3.3.9 Availability and Implementation
All data preparation and statistical analysis was done in R v3.4.2 and RStudio v1.1.383. pKSEA is available for public use as a CRAN package pKSEA and all code is publicly available at https://github.com/pll21/pKSEA. Cell-line phosphoproteomics data is provided in supplemental data. Raw label-free LC-MS/MS data is available upon request.
82
Figure 3-2 Fig 1 Diagrammatic representation of pKSEA. (a) Phosphopeptide-level phosphoproteomics data is used to (b) calculate summary statistics reflecting magnitude and confidence of phosphopeptide abundance changes across experimental groups. (c) Kinase prediction scores for all phosphorylation site identifications are obtained, and abundance change summary statistics are weighted by prediction scores to generate kinase-specific contribution scores. (d) Contribution scores are summed across all phosphorylation sites to generate an observed kinase activity change score (KAC). (e) Data is permuted at the summary statistic level, randomizing phosphopeptide assignments and randomizing any observed relationship between abundance changes and kinase-predictions. (f) KAC is recalculated for each permuted data set. (g) Observed KAC score is compared to permutated KAC scores to empirically assess significance of observed KAC score. Significance is reported as a percentile score of observed KAC score among permuted KAC scores (pKSEA score), for which <5 was chosen as a threshold for significance. Note that distributions of KAC scores are kinase-specific. 83
3.4 Results
To generate a kinase-substrate prediction-based score, we weighted summary statistics generated by LC-MS/MS phosphoproteomics with in silico kinase prediction scores from
NetworKIN107 (Figure 3-2). By weighting relative changes in phosphopeptide abundance with kinase prediction scores, we then calculated a phosphoproteome-wide summed kinase activity change (KAC) score that reflected shifts in kinase substrate phosphorylation across experimental conditions. Kinases display a broad range of substrate “promiscuity”, with some kinase having hundreds of substrates and others having narrow specificity. In order to adjust for these unique kinase properties, KAC scores were empirically assessed for significance for each kinase using permutation testing. Permutation test results were reported as a percentile score, referred to here as pKSEA significance score, which indicates an empiric estimate of the likelihood that an observed KAC score could be expected by random chance. A pKSEA significance score of less than 5 means that an equal or lower KAC score could be expected by random chance less than 5% of the time according to permutation testing, hence we considered a pKSEA score of below 5 as the threshold for significance for this report.
84
Figure 3-3 Principal component analysis (PCA) of raw phosphoproteomics data from MDA-MB- 231 cells treated with dasatinib, rapamycin, and combination treatment.No clear separation of groups is observed, indicating that the largest phosphoproteomic differences observed between samples is not informative for understanding treatment effects.
To test the functionality of pKSEA, we analyzed a novel phosphoproteomic data set
generated to investigate dual kinase inhibitor combination therapy directed against MDA-
MB-231 triple negative breast cancer cells. Dasatinib129, a Src/c-Abl, multikinase
inhibitor, and rapamycin, an mTOR inhibitor, combination therapy has previously been
85
demonstrated to show increased anti-cancer efficacy in vitro and in vivo compared to
either compound alone127, and investigation into the mechanisms underlying the synergy of this combination is still ongoing and of significant translational interest for the treatment of multiple cancers. Indeed, a clinical trial investigating the combination of an
mTOR inhibitor with dasatinib is currently underway for childhood cancers
(NCT02389309).
In response to dasatinib, rapamycin, and combination treatment, 32, 23, and 36 kinases,
respectively, showed significant downward shifts in abundance of predicted substrates
when compared to vehicle controls according to pKSEA (Figure 3-4, Figure 3-5).
Despite PCA showing poor separation of experimental groups based on raw phosphoproteomic data (Figure 3-3), pKSEA was able identify changes in inferred kinase activity that included expected targets as well as novel kinases. While a direct statistical comparison between KSEA (Figure 3-9) and pKSEA results was not possible
due to differences in kinase definitions and classifiers between the two methods, inferred
dasatinib-affected kinases consistent with KSEA analysis included Abl and the Ephrin
receptor family, as well as possible downstream effectors MAPK8 and MAPK13.
Inferred rapamycin-affected kinases common across both methods included p70S6K, a
canonically regulated substrate of mTOR, and CDK1.
86
Figure 3-4 Results of pKSEA analysis on MDA-MB-231 cells treated with dasatinib, rapamycin, and dasatinib/rapamycin combination. Heatmap of pKSEA scores, showing complementary and commonly (between groups) up- and down-regulated kinase activities between dasatinib and rapamycin.
87
Figure 3-5 Filtered results of pKSEA analysis on MDA-MB-231 cells treated with dasatinib, rapamycin, and dasatinib/rapamycin combination. For readability, kinases that were not significantly up- or down-regulated in at least two treatment arms are hidden. Numeric labels in cells are pKSEA significance scores reflecting percentile score of kinase activity change against permutation testing (1000 permutations).
88
Figure 3-6 Venn diagram showing overall complementarity of kinase sets identified by pKSEA as downregulated by rapamycin, dasatinib, and combination treatment.
Figure 3-7 Heatmap of kinase activity score correlations in permuted data, reflecting shared predicted substrates in phosphoproteomics data.
89
Figure 3-8 Heatmap of KSEA inferred kinase activity changes in MDA-MB231 across experimental conditions
90
Figure 3-9 Filtered heatmap of KSEA inferred kinase activity changes in MDA-MB231 across experimental conditions. Filtered for kinases that have statistically significantly altered changes in kinase activity in at least one treatment group.
91
Table 3-1 KSEA Analysis Results
Kinase Dasatinib Combination Rapamycin m ABL1 0.2 3 55.6 1 AKT3 97.9 96.1 89.4 1 CAMK2B 0.5 22 70.8 1 CAMK4 0.5 22 70.8 1 CDK1 18 0.1 0.9 73 CDK2 44.1 2.9 81.2 97 CDK4 2.1 4.7 8 9 CDK6 8.6 1.8 1.3 6 CHEK2 1.3 4.3 13.9 2 CSNK1E 18.4 1.6 0.4 2 DYRK3 0.8 4.4 30.9 1 EPHA2 0 0.1 56.2 1 FYN 0.2 3 55.6 1 GSK3B 1.9 0.5 10.2 8 LATS1 99.8 90.2 27 1 MAP3K8 0 0 0 1 MAPK1 22.6 1.4 6.5 30 MAPK13 0.1 6.8 29.9 4 MAPK15 0.4 1.4 10 2 MAPK8 2 1.2 10.4 4 MAPK9 0.4 1.4 10 2 MTOR 5.8 0 0 19 NEK2 95.4 94.4 64.8 1 PBK 0.4 1.4 10 2 PDK1 52.1 94.1 96.7 2 PDK2 52.1 94.1 96.7 2 PDK3 50.4 97 98.6 1 PDK4 50.4 97 98.6 1 PRKACA 0.5 53.5 90.9 14 PRKCD 50.3 0 0 7 PRKCZ 99.8 90.2 27 1 PRKG1 2.7 14.6 56.1 2 PTK6 0.2 3 55.6 1 ROCK1 89.9 95.1 96.4 3 RPS6KA1 45.2 0.1 0.3 11 RPS6KA3 6.8 0 0.1 7 RPS6KB1 11.3 0 0 9 SGK1 0 12.3 84 5 STK38 99.8 90.2 27 1 VRK1 0.4 1.4 10 2
Kinases with more than 3 detected substrates included in KSEA analysis highlighted.
92
Figure 3-10 Heatmap of pKSEA scores on MDA-MB-231 cells treated with dasatinib, rapamycin, and dasatinib/rapamycin combination, including cross-comparisons of combination compared directly to single-agent treatments. Cross-comparisons illustrate the relative contributions of each agent to the combination. For readability, kinases that were not significantly up or downregulated in at least two treatment arms are hidden. Numeric labels in cells are pKSEA significance scores reflecting percentile score of kinase activity change against permutation testing (1000 permutations).
93
Figure 3-11 Prediction-only results of pKSEA analysis on MDA-MB-231 cells. pKSEA results on cells treated with dasatinib, rapamycin, and dasatinib/rapamycin combination, with all known kinase- substrate pairs removed from data. For readability, kinases that were not significantly up or downregulated in at least one treatment arm are hidden. Numeric labels in cells are pKSEA significance scores reflecting percentile score of kinase activity change against permutation testing (1000 permutations).
94 pKSEA additionally identified kinase activity shifts that were ostensibly missed by
KSEA. Down-regulated kinases identified by pKSEA but not by KSEA included expected targets of inhibition such as Src, as well as a host of potential downstream effectors, including DNAPK, FRK, and PAK2 in dasatinib, and CDK3, ICK, LOK, and
MAPK3 in rapamycin, treated cells. pKSEA was additionally able to identify inhibition patterns common to both single-agents as well as in combination, as both dasatinib and rapamycin induced reductions in phosphorylated substrates of CDK1, MAPKs 1, 8, 10-
14, and MOK. To visualize the contributions of single-agents to the combination, we also ran pKSEA analysis directly comparing combination treatment to each of the single- agent treatments, showing that strong up- and down-regulated kinase activity of single- agent treatments compared to DMSO controls were recapitulated when the agents were combined, compared to single-agent treatments (Figure 3-10). Interestingly, pKSEA identified several kinases with activity changes unique to combination treatment, including up-regulation of ACTR2B and TGFbR2 activity and down-regulation of NEK2 and AuroraB (Figure 3-10).
To further elucidate the contributions of prediction-based data to our kinase activity inference, we repeated our analysis on only phosphorylation sites without established kinases, confirming that many of the kinases identified by pKSEA but not in KSEA including AuroraB, FRK, MAPK11, MAPK14, NEK2, and PAK2 were identified based on predictions alone (Figure 3-11). KSEA also identified some kinases as “hits” that pKSEA did not. While most of these KSEA identifications were limited to kinases with less than 3 detected substrates, notable exceptions include CDK4 and GSK3B for
95
dasatinib, and CDK6 and PRKCD for rapamycin. Lastly, we assessed potential biases of
NetworKIN predictions in our dataset by calculating score correlations between kinases in our permuted data (Figure 3-7). Identifying correlations in KAC score that persisted even after randomization revealed kinases that shared a significant proportion of predicted substrates in our dataset, such as the Ephrin receptor family. Because selection of phosphopeptides in mass-spec-based phosphoproteomics is stochastic, visualizing and considering these dataset-specific correlations can be helpful for understanding whether score similarities across kinases are due to changes in independent substrates or shared substrates, and to what degree.
3.5 Discussion
pKSEA offers a means of computationally supplementing the lack of knowledge on
kinase-substrate relationships, but it must be noted that our approach is dependent upon
the accuracy of the predictions used. By incorporating current models of kinase-substrate
prediction, pKSEA can be used to explore regions of the phosphoproteome that might
otherwise be overlooked, but the performance of kinase-prediction methods such as
NetworKIN has not been systematically assessed on the greater than 95% of
phosphorylation sites without a known kinase. Methods such as KSEA are limited to
canonical kinase-substrate relationships and therefore can be sensitive to coverage issues
and can bias conclusions and future studies against less-studied phosphoproteomic perturbations. Prediction and modeling approaches can be similarly biased as they are often trained and evaluated upon those same canonical relationships while penalizing identifications as spurious if not previously characterized.
96
We emphasize, however, that our analytic approach is agnostic to prediction method.
Any prediction method that supplies prediction scores for kinase-substrate relationships can be used with pKSEA. Therefore, we expect that this approach can not only be useful for bench scientists to analyze and interpret phosphoproteomic data, but can also be used as a framework for bioinformaticians to evaluate and refine predictive methods using existing experimental datasets. pKSEA offers a potential bridge between computational prediction and experimental data that overcomes current limitations and can advance both fields of inquiry by complementing existing methods and providing a means for experimentation and computation to better inform one another.
97
Chapter 4
pKSEA benchmarking and applications to cancer phosphoproteomic
Under review as: Peter Liao, Jennifer L Yori, Ruth A Keri, Jill S Barnholtz-Sloan; Inference of kinase activity using in silico substrate prediction scores, PLOS Computational Biology (under review).
Contains data generated in collaboration with Goutham Narla, Danica Wiredja.
98
4.1 Abstract
Kinases are key regulators of cellular processes in tumor survival and progression.
Kinase activity inference therefore possesses potential for use as a cancer biomarker due
to its ability to identify changes in kinase activity and regulation. In this chapter, I first
benchmark the performance of an in silico prediction-based kinase inference method
(pKSEA) using a previously published “gold-standard” phosphoproteomic dataset, and
then apply pKSEA to two novel phosphoproteomics datasets, investigating kinase
activity as a potential biomarker in identifying and targeting high-risk glioblastoma
(GBM) and characterizing phosphorylation responses of cancer xenografts to small
molecule activators of PP2A (SMAPs).
4.2 Background
Despite a variety of kinase inference methods available, benchmarking novel methods
presents researchers with a significant dilemma. Increasingly advanced inference
methods are being developed that can take advantage of time course data and machine
learning, but benchmarking in a systematic manner is problematic due to issues in phosphoproteomic data coverage and method-specific data requirements. Kinase- substrate prediction tools on the other hand, as distinct from kinase activity inference tools, are themselves most commonly benchmarked in a cross-validation manner. This cross-validation assesses a tool’s ability to predict known kinase-substrate relationships while the tool’s real-world performance on novel data remains unknown. Interestingly, the respective challenges in kinase inference and kinase-substrate prediction complement each other, as applying kinase-substrate predictions to kinase activity inference can 99 provide a measure of their performance in experimental data, and kinase-substrate predictions can overcome the limited coverage of canonical substrates in inference of kinase activity.
Ideally, a kinase activity inference method should be robust to the known coverage issues present in MS phosphoproteomic screens and be able to identify changes in kinase activity despite practical limitations. As previously demonstrated (Chapter 3), methods based upon canonical kinase-substrate relationships can be limited by the rigidity of definitions, and a large proportion of its inferred kinase activity changes may be made based on less than 3 detected canonical substrates. To obtain higher confidence kinase activity inferences, researchers may discard inferences made based on a few canonical substrates as a conservative measure that further limits the data being used. Out of thousands of detected substrates in a given phosphoproteomics screen, KSEA routinely disregards more than half of collected data98, even before any calculations or hypothesis testing begin.
A recent publication by Ochoa et al. explored benchmarking of kinase inference methods, including KSEA and several related scoring methods105. In their publication, they provided a curated benchmarking dataset of 62 phosphoproteomics experiments from publications related to phosphorylation perturbations, including stimulation of kinases with known kinase activators and inhibition with known kinase inhibitors. Along with the benchmarking dataset, they included a set of 184 “gold standard” kinase activity changes that were expected from the experimental conditions. For example, one such “gold
100
standard” kinase-condition pair would be the expectation that EGFR should demonstrate
increased activity when stimulated with epidermal growth factor (EGF). In their work,
they show that KSEA performs significantly better than random, demonstrating a median
AUC of 0.721 with ROC analysis and a median precision value of 0.708 at recall 0.5 in
inferring “gold standard” kinase activity change-experimental condition pairs, proving that even despite potential coverage issues, KSEA can be a useful tool for kinase activity inference.
Of note, the authors did not include more recent kinase inference methods in their benchmark comparisons, likely because many recent kinase inference methods are incompatible with data that consists of single condition versus control experiments. Due to the specialized data requirements of more recent kinase inference methods, it is currently untenable to benchmark their performance across multiple datasets in a systematic manner.
In this chapter, I present benchmarking results demonstrating a significant performance improvement over KSEA using my in silico prediction-based kinase activity inference method, pKSEA (see Chapter 3 for more details about this method), and perform exploratory analysis of two phosphoproteomic datasets investigating the potential for a kinase-activity inference-based biomarker of survival in GBM as well as investigating kinase activity changes in response to kinase inhibition/phosphatase activation combination treatment of xenograft tumors.
101
4.3 Materials and Methods
4.3.1 Benchmarking data
Preprocessed, normalized human quantitative phosphoproteomic data prepared using previously published methods103,105 was downloaded from http://phosfate.com,
containing fold-change data on 103 experimental conditions and encompassing 61180
phosphorylation sites observed across 39816 phosphopeptides mapping to 9139
phosphoproteins. A “gold standard” positive set of 184 kinase-condition pairs involving
30 kinases regulated across 62 experimental perturbations was obtained from
supplementary information provided with previous publication105.
4.3.2 pKSEA
pKSEA was performed on the benchmarking data using direct fold-change as a surrogate
for t-score. Kinase activity change scores were calculated for each experimental
condition using NetworKIN107 prediction scores to weight fold-change contributions as
previously described (see Chapter 3 for more details). pKSEA significance scores were
calculated using 1000 permutations per test, and were reported as previously described.
4.3.3 Evaluating pKSEA performance
NetworKIN is currently unable to predict substrates for ALK, FLT4, LRRK2, and
PIK3CA, therefore those 4 kinases were dropped from the “gold standard” positive set for evaluating pKSEA. NetworKIN also lacks an mTOR classifier, so p70S6K was used as a well-established surrogate marker that was not already included in the initial positive set. In order to generate a “gold standard” negative set for benchmarking purposes, I
102 employed a similar strategy to that reported by Ochoa et al. For each experimental condition, an equal number of kinase-activities was randomly selected without replacement from the remaining “gold standard” positive set to serve as a “null” set of kinase-condition pairs. This sampling of an equal number of null kinase-condition pairs was performed 60 times to create a diverse negative set to assess pKSEA performance, with 60 negative sets generated in line with previously published benchmarking of the
KSEA method105.
To calculate pKSEA performance, thresholding was used at the bottom and upper ends of pKSEA significance scores to define positive identifications. pKSEA significant scores reflect the empirically calculated significance of an activity change score compared to permuted testing, with a score of 5 representing an observed activity change score lower than 95% of permuted scores, and a score of 95 representing an observed score higher than 95% of permuted scores. With a threshold of 1, kinases with a pKSEA score of greater than 99 would be flagged as significantly upregulated kinases and kinases with a pKSEA significance score of less than 1 would be flagged as significantly downregulated. Gold standard kinase-condition pairs were checked against calculated pKSEA scores, and were assigned as true positives or false negatives depending upon whether pKSEA correctly identified expected changes under given conditions. Null, random kinase-condition pairs were checked against calculated pKSEA scores and were assigned as false positives if pKSEA identified a randomly selected “null” kinase as regulated under the given condition, and as a true negative if pKSEA did not detect the randomly generated kinase-condition pair. Results of scoring assessments were averaged
103 across the 60 negative set comparisons to reduce bias. Thresholding was assessed from 0 to 100 to ensure complete testing coverage and ROC curves were generated and test performance statistics were reported as AUC and precision at recall 0.5. 95% CI and median statistics were calculated from the results of the 60 randomly generated negative sets.
4.3.4 GBM data
GBM MS data was obtained on a set of 8 flash-frozen GBM tumor tissue samples, 4 samples each from the upper and lower quartiles of GBM survival for the Ohio Brain
Tumor Study population (long-term survivors, LTS, and short-term survivors, STS).
Samples were obtained from the Ohio Brain Tumor Study and subjected to quality- control criteria consistent or exceeding TCGA specifications, including tumor nuclei and necrotic content. Sample preparation and LC-MS/MS was performed by the Case Center for Proteomics and Bioinformatics. 50mg tumor samples were pulverized and lysed in
300uL of 2% SDS, 50mM Tris, with 3 cycles of probe sonication on ice in the presence of protease and phosphatase inhibitors. Following reduction with DTT and alkylation with iodoacetamide, FASP detergent removal protocol was performed using Amicon
Ultra 0.5mL 10 MWCO filters. Total protein was quantified and 800 mcg of total protein per sample were digested with LysC and Trypsin dual digestion at 1:30 enzyme to protein ratio overnight. Digestion was stopped with titration of sample to pH 3 using HCl, and
C18 cleanup was performed using MacroSpin Columns (The Nest Group). C18 samples were reconstituted using provided buffers from a TiO2 phosphopeptide enrichment kit
(Pierce), and phosphopeptide enrichment was performed according to kit protocols.
104
Phosphopeptides were eluted, dried, reconstituted in 0.1% TFA, and desalted again using
C18 cleanup procedure. Samples were reconstituted in 0.1% formic acid and analyzed using LC-MS/MS. 15uL of phosphopeptide enriched samples were placed on Waters
NanoAcuity ultra-high pressure LC system with a 4 hour gradient, followed by MS/MS performed on a Thermo-Finnigan LTQ-Orbitrap Velos. Peptide mapping was performed using Rosetta Elucidator searching using Mascot against the UniProt database. Relative quantification of phosphopeptide abundance was calculated from area under the curve of peptide precursors, normalized by median intensity. Summary statistics, including t- statistic and mean fold changes were calculated across survival groups (STS/LTS).
4.3.5 Xenograft data
To generate in vivo xenograft tumors, NCR nu/nu mice were injected subcutaneously in their flanks with H358 NSCLC cells. Upon tumor size reaching 300mm3, mice were randomized into treatment groups and treated with a single dose of vehicle (control),
24mg/kg DT-061 (small molecule activator of PP2a, SMAP), 5mg/kg AZD6244 (MEK inhibitor), or combination DT-061/AZD6244 delivered orally via gavage. Based on previous pharmacodynamics data demonstrating full engagement of AZD6244 with its inhibition target MEK after 3 hours -- 3 hours was chosen as an optimal time for mouse sacrifice and tumor collection. 6 tumor samples were collected for each group, totaling 24 samples across control, DT-061, AZD6244, and combination treatment groups. Samples were flash frozen, and delivered to the Case Center for Proteomics and Bioinformatics for processing and LC-MS/MS phosphoproteomics where they were prepared and analyzed according to protocols described above. Summary statistics, including t-statistic and
105
mean fold changes were calculated for each treatment group compared to control, as well as combination treatment group compared to each of the single agents.
4.3.6 Data Preparation (GBM and xenograft data)
Imputation of missing phosphoproteomics data was performed according to previously
described pKSEA methods. imputeLCMD package from Bioconductor was used to perform a two-part imputation. Phosphopeptides that were missing in a single sample were considered missing at random (MAR) and were imputed using a k-nearest neighbor
approach. If a phosphopeptide was missing across multiple samples, it was considered to
be missing not at random (MNAR), likely to below the detection threshold, and was
assigned the minimum detected value in that sample in a deterministic fashion. No
significant biases or changes in abundance distributions were introduced with imputation.
To generate summary statistics for comparing phosphopeptide abundance across
treatment groups, Welch’s t-test was performed comparing specified groups on an individual phosphopeptide abundance basis. This resulted in summary statistic calculations for each analysis for each detected phosphopeptide.
4.3.7 pKSEA (GBM and xenograft data) pKSEA analysis was performed on GBM data comparing short term survivors (STS) to long term survivor (LTS) groups. For purposes of examining contributions of in silico
predictions versus canonical substrates, filtering of analyzed phosphorylation sites was
performed by comparing observed phosphorylation sites to annotated phosphorylation
sites with known kinases in the PhosphositePlus database.
106
pKSEA analysis was performed on xenograft data comparing each single-agent treatment
group to vehicle control, as well as combination treatment to single-agents.
4.3.8 pKSEA Plotting
Heatmap figures of pKSEA significance scores were generated showing increased kinase activity scores (red) and decreased pKSEA significance scores (blue). pKSEA significance scores were included in figures as space allowed. For visual clarity,
“filtered” figures were generated by removing kinase rows that did not show significant changes in kinase activity in a specified number of experimental conditions. For these filtered figures, significance was defined as either a pKSEA significance score of less than 5 (an observed decreased activity change expected less than 5% of the time compared to permuted data) or greater than 95 (an observed increased activity change expected less than 5% of the time compared to permuted data). Filtered figures, provided for easier viewing of highlighted results, are otherwise identical to full heatmaps, which are provided for completeness and transparency.
107
Figure 4-1 Median ROC curve assessing pKSEA performance on benchmarking data. Curve shows median TPR versus FPR curve using defined set of gold standard kinase- condition pairs evaluated against 60 sets of randomly drawn negative kinase-conditions of equal size to the gold standard set. AUC= 0.80 (0.78-0.82 95%CI).
108
Figure 4-2 Median precision-recall curve assessing pKSEA performance on benchmarking data. Curve shows median precision (positive predictive value) versus recall (TPR) curve using defined set of gold standard kinase-condition pairs evaluated against 60 sets of randomly drawn negative kinase-conditions of equal size to the gold standard set. Precision at recall 0.5 = 0.86 (0.80-0.92 95%CI).
109
4.4 Results
4.4.1 pKSEA Benchmarking
In order to assess pKSEA performance, pKSEA was performed on a previously published
phosphoproteomic dataset and performed benchmarking tests according to similar
parameters105. In using pKSEA significance scoring to identify significantly up and
downregulated kinase activities against a curated set of expected activity perturbations in response to kinase-stimulating and kinase-inhibiting conditions, pKSEA significance
scoring yielded a median AUC of 0.80 (0.78-0.82 95% CI) and precision value of 0.86
(0.80-0.92 95% CI) at recall 0.5 (Figure 4-1, Figure 4-2). These results suggest that in these benchmarking data, pKSEA performed better than any previously tested KSEA- based method, including GSEA and Z-test based scoring (median AUC = 0.734 and
0.721, and median precision values at recall 0.725 and 0.708, respectively). In order to quantify the contribution of in silico predictions to performance of pKSEA, pKSEA was performed on the same data while excluding from analysis any phosphorylation sites that have established kinases according to PhosphositePlus106.
Assessment of pKSEA based on predictions alone (excluding all known canonical
substrates) yielded an AUC of 0.62 (0.58-0.66 95% CI) and a precision of 0.66 (0.60-0.72
95% CI) at recall 0.5 (Figure 4-3, Figure 4-4). While relying solely on in silico
predictions is not a high-performance method of kinase activity inference in itself, these
findings suggest that pKSEA is able to include these in silico predictions into its overall
analysis in a manner that improves its performance when assessed as a whole.
Interestingly, when pKSEA analysis is restricted to only those substrates that have
110
canonical kinases, it still performs better than KSEA and near identically to pKSEA
performed on the full data, with a median AUC of 0.80 (0.78-0.82 95% CI) and precision value of 0.87 (0.82-0.92 95% CI) at recall 0.5 (Figure 4-5, Figure 4-6). In this benchmarking dataset, accurate identification of changes in kinase activity seem to be largely driven by canonical kinases. The improvement in performance can perhaps be attributed to other aspects of pKSEA scoring, such as weighting by prediction, which may devalue potentially erroneous kinase-substrate relationships present in the
PhosphositePlus database in favor of kinase-substrate relationships with a higher probability of being a direct substrate as predicted by site sequence and PPI network proximity.
111
Figure 4-3 Median ROC curve assessing pKSEA performance on benchmarking data with all substrates with known kinases removed. Curve shows median TPR versuse FPR curve using defined set of gold standard kinase-condition pairs evaluated against 60 sets of randomly drawn negative kinase-conditions of equal size to the gold standard set. AUC= 0.62. ADD 95% CI
Figure 4-4 Median precision-recall curve assessing pKSEA performance on benchmarking data with all substrates with known kinases removed. Curve shows median precision (positive predictive value) versus recall (TPR) curve using defined set of gold standard kinase-condition pairs evaluated against 60 sets of randomly drawn negative kinase-conditions of equal size to the gold standard set. Precision at recall 0.5 = 0.66. ADD 95% CI
112
Figure 4-5 Median ROC curve assessing pKSEA performance on benchmarking data restricted to only substrates with known kinases. Curve shows median TPR versuse FPR curve using defined set of gold standard kinase-condition pairs evaluated against 60 sets of randomly drawn negative kinase- conditions of equal size to the gold standard set. AUC= 0.80. ADD 95% CI. Would it be relevant to show this ROC curve with the one from figures 4-1, 4-3 together?
Figure 4-6 Median precision-recall curve assessing pKSEA performance on benchmarking data restricted to only substrates with known kinases. Curve shows median precision (positive predictive value) versus recall (TPR) curve using defined set of gold standard kinase-condition pairs evaluated against 60 sets of randomly drawn negative kinase-conditions of equal size to the gold standard set.
113
Precision at recall 0.5 = 0.87 ADD 95% CI. Would it be relevant to show this curve with the curves from figures 4-2, 4-4 all on the same graph?
4.4.2 GBM Prognosis
pKSEA was then applied to data comparing the phosphoproteomes of GBM samples
taken from long term survivors (LTS) versus short term survivors (STS) for the purposes
of characterizing any differential phosphorylation changes that might be related to GBM
prognosis. Analysis of data was performed on the full dataset, as well as subsets of the
data restricted to canonical substrates found in PhosphositePlus only, as well as the subset
of data that contains no canonical substrates (predictions only) in order to characterize
pKSEA performance and the relative contributions of canonical and prediction-based
inference. The resulting heatmaps illustrate that the inferences made by pKSEA follow
the direction of canonical substrates with a few exceptions (Figure 4-7, Figure 4-8).
Compared to LTS, STS GBMs had increased kinase activity in a number of kinases, notably including MAPK1, GSK3a, PKCdelta and CDK1/5 (Figure 4-8). Of these, inference of increased MAPK1 activity appears to be driven by canonical substrates, whereas inference of increased activity of GSK3alpha and CDK5 are made based on predictions alone. CDK1 and PKCdelta were inferred as increased in activity in both canonical substrates, as well as predictions only.
Decreased kinase activity inferences in STS compared to LTS included CK1/2alpha,
NEK2, PAK1, CLK1, and MAP2K3, with canonical substrates driving predictions for
CK1/2, NEK2, and MAP2K3, while PAK1, CLK1, and CK2a2 had similar decreases in
114 kinase activity inference in both the canonical substrate set as well as the predictions only
(Figure 4-8).
Figure 4-7 Heatmap of pKSEA significance score comparing inferred kinase activity differences between GBM STS and LTS survival groups.For comparison, results of pKSEA performed on the full dataset and subsets of the data including all substrates with at least one known kinase (KSEA substrates only) and all substrates without any known kinase (predictions only) are also plotted.
115
Figure 4-8 Heatmap of pKSEA significance score comparing inferred kinase activity differences between GBM STS and LTS survival groups, filtered for rows that were significant (above 95 or below 5 pKSEA significance score) for at least one analysis. For comparison, results of pKSEA performed on the full dataset and subsets of the data including all substrates with at least one known kinase (KSEA substrates only) and all substrates without any known kinase (predictions only) are also plotted.
116
4.4.3 Phosphatase activation and MEK inhibitor combination
pKSEA was also applied to a phosphoproteomic dataset characterizing phosphorylation
responses to phosphatase activation and kinase inhibition in H358 NSCLC xenograft
tumors. While most efforts in phosphoproteomic drug-discovery have so far focused on kinase inhibition56,97,130,131, recent efforts have begun exploring the roles of phosphatases
as anti-cancer targets. Protein phosphatase 2A (PP2A) is a highly expressed phosphatase
that is inactivated in cancer132,133 and that possesses a broad range of substrates, including
components of the AKT and MAPK signaling pathway134-137. Treatment with small
molecule activators (SMAPs) of PP2A has demonstrated anti-cancer activity in both in vitro and in vivo cancer models138,139, both as a single agent and in combination with
kinase inhibitors, suggesting that phosphatase activation could be a viable but relatively
unexplored avenue for anti-cancer therapy. In order to clarify the phosphorylation
signaling pathways that are responsible for mediating the anti-cancer effects of PP2A
activation, we performed pKSEA on H358 NSCLC xenograft tumors that were treated
with AZD6244, an inhibitor of MAP2K1/2 (MEK), and DT-061, a putative small molecule activator of PP2A (SMAP).
The results of pKSEA analysis on SMAP and MEK inhibitor combination treatment reveal both expected responses to MEK inhibition as well as unintuitive responses in combination treatment that require additional validation. H358 xenograft response to
MAP2K1/2 (MEK) inhibition with AZD6244 is consistent with expected responses, with
inferred kinase activity decreases in MAP2K1 and MAP2K2, AZD6244’s direct targets
of inhibition (Figure 4-10). Additionally, the single agent treatment group showed
117
downregulation of downstream kinases such as RSK2 and p70S6K. Interestingly, MEK
inhibition of H358 xenografts also showed downregulation of tyrosine kinases that are
understood to be upstream or parallel of MEK, including Src and Met, as well as
unexpected inhibition of AKT (PKBalpha). While this could be suspected to be due to
relatively low detection and prediction of tyrosine phosphorylation sites, similar
downregulation is also observed in the combination treatment. Single treatment DT-061 activation of PP2A showed unexpected results according to pKSEA analysis, with reduced phosphorylation of MAP2K6, CK1gamma1, Pim2, and CLK1 substrates, none of which were recapitulated in combination treatment (Figure 4-10). Notably, SMAP treatment narrowly missed significance of downregulation of AKT (pKSEA significance score 5.8), which is expected to be downregulated upon PP2A activation. RSK2,
PAK1/3, and PKBgamma displayed downregulation across both single-agent treatments as well as in combination.
118
Figure 4-9 Heatmap of pKSEA significance scores for H358 xenograft experiment. Mice with H358 xenografted tumors were treated with AZD6244, DT061, or combination treatment, and phosphoproteomic responses were quantified and analyzed with pKSEA. AZD, Combination, and DT columns represent pKSEA analysis of treatment group compared to vehicle control group. Combo/DT and Combo/AZD columns represent pKSEA of combination treatment group compared to single agent treatment groups to assess respective contributions of each small molecule to combination kinase activity changes.
119
To explore the contributions of each small molecule to the combination treatment
phosphoproteomic response, pKSEA was performed comparing combination-treatment
phosphorylation against the single agent treatments as “controls” (Figure 4-9, Figure
4-10). Compared to AZD6244 alone, the addition of SMAP significantly reduced phosphorylation of MEK2, ERK1/2, and CDK2/3 substrates among others.
Paradoxically, these reductions were not observed when SMAP was administered as a single-agent. In contrast, compared to DT-061 alone, the addition of AZD6244 largely mirrored the phosphorylation responses observed in AZD6244 alone, including downregulation of MEK1/2, and p70S6k. Interestingly, although ERK1/2 (MAPK3/1) are downstream of MEK and would be expected to be downregulated upon MEK inhibition, pKSEA reported that their substrates were not significantly decreased compared to
control in either single-agent treatment despite MEK being inhibited by AZD6244. In
combination, however, pKSEA reported that ERK1 (MAPK3) substrates were
significantly decreased compared to control and that both ERK1/2 (MAPK3/1) substrates
are significantly decreased in combination treatment compared to either treatment alone.
These findings demonstrate the ability of pKSEA to bring an additional layer of
interpretability to phosphoproteomic discovery that is capable of identifying unintuitive
phosphorylation responses that are difficult to capture and appreciate when using targeted
approaches.
120
Figure 4-10 Heatmap of pKSEA significance scores for H358 xenograft experiment, filtered for rows (kinases) that were significant in at least two columns. Mice with H358 xenografted tumors were treated with AZD6244, DT061, or combination treatment, and phosphoproteomic responses were quantified and analyzed with pKSEA. AZD, Combination, and DT columns represent pKSEA analysis of treatment group compared to vehicle control group. Combo/DT and Combo/AZD columns represent pKSEA of combination treatment group compared to single agent
121
4.5 Discussion
This study demonstrated the accuracy and predictive power of pKSEA using a previously
published benchmarking dataset, and then applied pKSEA to two example
phosphoproteomics experiments in cancer to characterize phosphorylation changes that may be useful as cancer biomarkers. pKSEA exhibited an improved ability to infer expected changes in kinase activity upon experimental perturbation compared to traditional KSEA methods.
There are a number of reasons to explain pKSEA’s improved performance. First, pKSEA
includes prioritization of substrates that better fit a “prediction profile” that may lower
the scoring contribution of substrates that are either not in close PPI network proximity or
do not fit with a kinase’s preference for phosphorylation site sequence. Because pKSEA
includes a weighting of scores in concordance with in silico predictions, its overall
inferences may be improved by lowering the relative contribution of erroneously
attributed direct substrates that KSEA is unable to distinguish. This finding would be in
concordance with previous findings that weighting KSEA scoring by kinase sequence
preferences can modestly improve performance. Second, pKSEA may also demonstrate
improved performance by incorporation of in silico predictions. When pKSEA was
performed on a restricted subset of benchmarking data that excluded all known canonical
substrates, it was still capable of inferring expected perturbations in kinase activity, albeit
at lower performance compared to the full dataset. While previous benchmarking efforts
have similarly shown that in silico predicted substrates perform poorly compared to in
vivo and in vitro validated substrates for kinase activity predictions, pKSEA has
122
demonstrated that inclusion of weighted predictions does not decrease overall
performance while offering the ability to detect kinase activity changes outside of
canonically defined substrate sets.
Despite the high performance of pKSEA in this benchmarking dataset, it bears
mentioning that the benchmarking dataset and set of “gold standard” kinase-condition
pairs are extremely limited. While both KSEA and pKSEA demonstrate ability to
accurately predict expected perturbations in kinase activity, the current benchmarking
dataset includes only 30 kinases with varying representation in benchmarking conditions,
possibly contributing sources of bias due to the fact that some kinases were tested under
fewer conditions. With only 30 kinases represented, the performance of KSEA and the
incorporation of in silico predictions with pKSEA cannot necessarily be extended to all
kinase activity inferences each method is capable of providing. Additional
phosphoproteomic datasets and an expansion of “gold standard” kinase-condition pairs can assist in further evaluation of these methods.
Despite these caveats, the high performance of pKSEA on current benchmarking data provides a basis of confidence for applying pKSEA to discover useful, novel changes in phosphorylation that can be of use as biomarkers in cancer research. Applied to clinical tumor samples obtained from GBM patients on the extremes of overall survival, pKSEA identified differential phosphorylation of substrates in a number of cancer-related kinases, including increased activity of CDK1/5, GSK3alpha, and MAPK1 in STS compared to LTS. Interestingly, the phosphoproteomic finding that MAPK1 has
123
increased activity is consistent with prior studies that implicated MAPK1 as
downregulated on the protein level in long-term GBM survivors140. The finding that
CDKs 1 and 5 are upregulated in STS is also of potential interest, as CDK1 mediates
pivotal tumorigenic events141 and CDK5 is dysregulated in a variety of cancers on the
protein and activity level and has been implicated in tumor growth, resistance to
treatment, invasion, immune evasion142, and angiogenesis143. The kinases that appear to be downregulated in STS compared to LTS are more difficult to interpret, as their functional roles in regulating tumor biology have not been as clearly defined, especially regarding how decreased kinase activity may contribute to decreased patient survival.
While further validation in a larger, independent data is necessary, these findings provide hypotheses that both provide insight into biological factors that determine GBM survival and that may be of use as prognostic biomarkers.
pKSEA also identified potentially useful biomarkers in cancer by inferring changes in kinase activity in H358 NSCLC xenografts in response to novel treatment approaches including DT-061 activation of PP2A and its combination with AZD6244 MEK inhibition. While pKSEA detected expected perturbations such as MEK (MAP2K) inhibition in response to AZD6244, numerous unexpected and even paradoxical changes in kinase activity were also observed, including decreased phosphorylation of parallel pathways and upstream kinases. Additionally, MEK inhibition alone did not appear to significantly inhibit phosphorylation of ERK substrates, although addition of DT-061 in the combination treatment group did. Notably, DT-061 did not appear to have a significant additive effect, as the kinase activity changes observed in DT-061 were
124 generally not recapitulated in the combination treatment group. Rather, the results suggest that increased efficacy of combination treatment of H358 xenograft tumors may be through DT-061 potentiation of the effects of AZD6244, with decreases of MEK2,
ERK, and CDK1/3 activity in combination treatment compared to AZD6244 alone.
These examples demonstrate the power of phosphoproteomic approaches, allowing for collection of systems-level phosphorylation data in experiments that are outside the scope of traditional methods. Phosphoproteomics findings necessarily require additional validation and careful consideration when drawing conclusions, but are capable of capturing unexpected, paradoxical biological responses that would not otherwise be considered or tested by targeted approaches. As the amount of quality biological data continues to improve, data-driven approaches may uncover emergent properties of biological systems which will require additional work to identify and interpret those system properties. As a method for identifying phosphorylation signals and consolidating them into inferences of kinase activity, pKSEA shows promise in its ability to capture expected “gold standard” perturbations and novel phosphorylation signals in clinical samples as well as in a pharmaceutical development experimental setting.
125
Chapter 5
Conclusions and future directions
126
5.1 Conclusions
In this dissertation, I explored current and novel methods for biomarker discovery and demonstrated a variety of uses for cancer biomarkers, including disease prognosis and characterization of treatment responses. As new types of data become available, the new information provides a great opportunity to advance cancer research, pending that the analytical approaches are robust, accurate and up to date. In this work, I showed that focusing existing bioinformatics methods in a cancer type-specific manner can provide new insight into cancer survival and epigenetic dysregulation. Furthermore, I showed that using novel bioinformatics approaches to the phosphoproteome can provide new insight into treatment responses and identify phosphorylation signals that are associated with survival and that may be targeted in the future.
In this work, I applied existing bioinformatics methods for estimation of epigenetic aging to explore epigenetic aging in an in-depth analysis of gliomas. Despite overall dysregulation that near-universally accelerates epigenetic aging in gliomas, a high level of association remains between a “universal” epigenetic clock age designed by Horvath39 and epiTOC45, an epigenetic clock designed to reflect mitotic divisions in precancerous and cancerous cells. These associations were associated with prognostic molecular subtypes defined by mutation of IDH and codeletion of chromosome arms 1p and 19q, and could provide a method for identification of a subset of Pilocytic Astrocytoma (PA)- like gliomas according to epigenetic age. Additionally, epigenetic age was associated with survival, with IDH-wt tumors exhibiting overall younger epigenetic ages compared to IDH-mut tumors and with younger epigenetic age exhibiting an independent
127
association with poor survival. Lastly, study of matched primary-recurrent gliomas
demonstrated significant heterogeneity of epigenetic age even within the same tumor,
suggesting variable aging of gliomas between primary tumor diagnosis and later
recurrence. This study showed that epigenetic aging reflects key glioma molecular subtypes, and demonstrated that epigenetic age can be a useful biomarker in disease- specific contexts, despite the current lack of knowledge regarding the exact mechanisms of epigenetic aging.
In order to identify potential phosphorylation-based biomarkers on the phosphoproteomic level, I developed a novel method for inference of kinase activity (pKSEA).
Bioinformatics methods to be accessible and applicable to as broad a user-base as possible. Keeping this in mind, I designed a new method, prediction-based KSEA
(pKSEA), to reflect the simplicity of an existing method, KSEA, while incorporating additional knowledge using bioinformatics tools in order to improve overall performance. pKSEA demonstrated improved performance over KSEA, while preserving an analysis platform that can be benchmarked using publicly available phosphoproteomic data. pKSEA was developed into an R package and is publicly available for researchers on
CRAN (https://cran.r-project.org/web/packages/pKSEA).
pKSEA was applied in this work to explore three phosphoproteomic datasets, illustrating
a broad utility for this type of phosphoproteomic tool. pKSEA was used to characterize
phosphoproteomic responses to proposed treatment combinations in cancer cells and
xenograft tumors, as well as to identify potential biomarkers of survival in GBM. When
128 applied to MDA-MB-231 breast cancer cells, pKSEA identified expected perturbations in kinase activity in response to kinase inhibitors dasatinib and rapamycin. By comparing the combination treatment response to the responses of each kinase inhibitor used as a single agent, a profile of each kinase inhibitor’s effects on the MDA-MB-231 phosphoproteomic landscape could be estimated and reported in terms of changes in phosphorylation of a kinase’s predicted substrates. In addition to expected perturbations from the direct targets of kinase inhibitors, changes in the activities in other cancer signaling kinases were observed, including CDK1/3 and MAPK3/1 (ERK1/2). A comparison of phosphoproteomes between combination treatment and each single agent treatment shows increased downregulation of kinase activities that both agents commonly downregulated as well as complementary downregulation of inhibitor-specific targets that were recapitulated in the combination. These results suggest that dual-inhibitor combination treatment of MDA-MB-231 breast cancer cells can demonstrate increased efficacy over either single agent by both additive inhibition of common or convergent phosphorylation signaling as well as inhibition of multiple parallel signaling pathways.
Application of pKSEA to GBM samples drawn from long-term and short-term survivors unveiled potential increases of MAPK1, GSK3a, PKCdelta and CDK1/5 kinases in STS compared to LTS, of which increases in MAPK1 expression have previously been implicated on a proteomic expression level. When applied to H358 NSCLC xenografts, pKSEA identified paradoxical results in response to combinatory treatment with
AZD6244 and a novel small molecule activator of PP2A, DT-061. While reflecting expected inhibition of MAP2K (MEK) in response to AZD6244, pKSEA identified
129 downregulation in kinase activity of additional cancer-related kinases normally not considered downstream of MAP2K, including Src and Akt, which were recapitulated in combination treatment. Furthermore, while most direct phosphorylation changes in response to DT-061 treatment were not recapitulated in combination, DT-061 administered in combination with AZD6244 resulted in increased downregulation of
MAP2K2 (MEK2), MAPK1/3 (ERK), and CDK2/3 compared to AZD6244 alone, providing an indication of the signaling effects responsible for the synergistic efficacy of this combination treatment.
In this dissertation, I applied existing bioinformatics tools to explore epigenetic aging as a potential biomarker in glioma, and used bioinformatics to develop my own analysis method, which is now publically available, for biomarker discovery in phosphoproteomics data. To demonstrate the utility of my phosphoproteomics analysis method, pKSEA, I further analyzed three novel datasets exploring different biomarker applications in cancer research, including prognosis, treatment response, and drug targeting and development.
5.2 Future directions
5.2.1 Epigenetic aging studies in other cancers
Applying epigenetic clocks to glioma demonstrated that cancer type-specific analysis of epigenetic aging can lead to potentially useful insight into cancer biology. Epigenetic age studies may be particularly insightful in studying cancers that have already shown important epigenetic factors in pathogenesis or that demonstrate interesting age-related
130 incidence trends. Acute myeloid leukemia is the most common type of acute leukemia in adults, but also has a bimodal incidence, appearing in early childhood and then most often in late adulthood. Additionally, epigenetic modifier mutations have been strongly implicated in AML144,145, including commonly observed mutations in NMT3A, TET2,
ASXL1 and IDH146 leading to epigenetic alterations of pathogenic significance and prognostic utility147. Studying epigenetic aging in AML can yield insight into differences in the aging of pediatric versus adult-onset AML, and also provide insight into the dysregulation of epigenetic regulation across highly heterogeneous malignancies such as
AML, with potential additional prognostic significance. AML data available from TCGA and other publically available sources will be utilized for preliminary analysis, with other published datasets available for validation148.
5.2.2 Expansion of kinase inference benchmarking
While the benchmarking study performed in this dissertation suggests relatively high performance of pKSEA, the benchmarking data employed was considerably limited.
With a relatively low number of kinases represented out of the kinases available for inference in both KSEA and pKSEA methods, an expanded benchmark dataset is required to assess the performance of these methods across a wider range of kinases and kinase activating/deactivating conditions. The inclusion of our three phosphoproteomic datasets may serve as a small addition to the data available for future benchmarking, and additional phosphoproteomic data may be available in the literature upon deeper searching. Unpublished data may be also available as a resource from the Case Center for
Proteomics and Bioinformatics with proper permissions. If there are phosphoproteomics
131
experiments that carry expected kinase-condition perturbations (eg. from kinase-specific
inhibitors or activators), they may serve as valuable contributions to benchmarking
currently unevaluated kinase-conditions and pKSEA may also prove useful for analyzing
these phosphoproteomic studies where perhaps existing analysis methods have not
proven useful. Ideally, a library of phosphoproteomics data and associated kinase-
condition perturbations could be compiled that can assess kinase-activity inferences over
the range of currently known kinases, however novel generation of that data would
require substantial resources.
5.2.3 Use of phosphoproteomic data to assess kinase-substrate predictions
An interesting consequence of using in silico kinase-substrate predictions to infer kinase- activity changes is that in benchmarking the kinase activity inference method, the performance of the prediction method is also being assessed against experimental data.
An ideal kinase-substrate prediction tool would be able to identify with high sensitivity and specificity substrates of a particular kinase that would be reflective of kinase activity when used in inference. If predicted substrates do not reflect expected perturbations in kinase activity, the predictions themselves incur a degree of doubt. The findings from this benchmarking performance, particularly when applied to analysis of predictions alone while excluding known substrates, can provide evaluation of a given kinase-substrate
prediction classifier based upon multiple experimental conditions. Feedback on
performance on a kinase-by-kinase basis can identify specific kinase-substrate classifiers that are performing better than others, and classifiers that perform poorly across the benchmarking dataset and need further refinement, potentially even incorporating
132
benchmarking performance into an experimental data-driven contribution to refining
kinase-substrate predictions.
5.2.4 Improvement of pKSEA package and tool
pKSEA is currently published on CRAN as version 0.0.1 (https://cran.r-
project.org/web/packages/pKSEA) and is available on GitHub
(https://github.com/pll21/pKSEA). While the current functionality is adequate to generate kinase-activity inferences from phosphoproteomics data, its features can be significantly expanded in ways that can be useful for researchers who are interested in applying it to their own data. Currently, the output of pKSEA is primarily in data table format, requiring users to then use another program for visualization (such as separate heatmap generation). Integrating a method for visualization of results, particularly across multiple experiments could be a significant improvement to the package. More important, however, are integration of features that will assist researchers in extracting intermediate data such as total number of predicted substrates included in pKSEA analysis, and substrate-level score contributions to kinase-activity interest. These improvements will improve the utility and accessibility of pKSEA for researchers in analyzing and interpreting phosphoproteomics data.
133
Bibliography
1 Goossens, N., Nakagawa, S., Sun, X. & Hoshida, Y. Cancer biomarker discovery and validation. Transl Cancer Res 4, 256-269, doi:10.3978/j.issn.2218- 676X.2015.06.04 (2015). 2 Henry, N. L. & Hayes, D. F. Cancer biomarkers. Mol Oncol 6, 140-146, doi:10.1016/j.molonc.2012.01.010 (2012). 3 Paik, S. et al. A multigene assay to predict recurrence of tamoxifen-treated, node- negative breast cancer. N Engl J Med 351, 2817-2826, doi:10.1056/NEJMoa041588 (2004). 4 Imperiale, T. F. et al. Multitarget stool DNA testing for colorectal-cancer screening. N Engl J Med 370, 1287-1297, doi:10.1056/NEJMoa1311194 (2014). 5 Flaherty, K. T. et al. Inhibition of mutated, activated BRAF in metastatic melanoma. N Engl J Med 363, 809-819, doi:10.1056/NEJMoa1002011 (2010). 6 Slamon, D. J. et al. Use of chemotherapy plus a monoclonal antibody against HER2 for metastatic breast cancer that overexpresses HER2. N Engl J Med 344, 783-792, doi:10.1056/NEJM200103153441101 (2001). 7 Stamey, T. A. et al. Prostate-specific antigen as a serum marker for adenocarcinoma of the prostate. N Engl J Med 317, 909-916, doi:10.1056/NEJM198710083171501 (1987). 8 Taube, S. E. et al. A perspective on challenges and issues in biomarker development and drug and biomarker codevelopment. J Natl Cancer Inst 101, 1453-1463, doi:10.1093/jnci/djp334 (2009). 9 Kulasingam, V. & Diamandis, E. P. Strategies for discovering novel cancer biomarkers through utilization of emerging technologies. Nat Clin Pract Oncol 5, 588-599, doi:10.1038/ncponc1187 (2008). 10 Ostrom, Q. T. et al. CBTRUS Statistical Report: Primary brain and other central nervous system tumors diagnosed in the United States in 2010-2014. Neuro Oncol 19, v1-v88, doi:10.1093/neuonc/nox158 (2017). 11 Gehrmann, J., Matsumoto, Y. & Kreutzberg, G. W. Microglia: intrinsic immuneffector cell of the brain. Brain Res Brain Res Rev 20, 269-287 (1995). 12 Wen, P. Y. & Reardon, D. A. Neuro-oncology in 2015: Progress in glioma diagnosis, classification and treatment. Nat Rev Neurol 12, 69-70, doi:10.1038/nrneurol.2015.242 (2016). 13 Ohgaki, H. & Kleihues, P. Genetic alterations and signaling pathways in the evolution of gliomas. Cancer Sci 100, 2235-2241, doi:10.1111/j.1349- 7006.2009.01308.x (2009). 14 Faulkner, C. et al. EGFR and EGFRvIII analysis in glioblastoma as therapeutic biomarkers. Br J Neurosurg, 1-7, doi:10.3109/02688697.2014.950631 (2014). 15 Hobbs, J. et al. Paradoxical relationship between the degree of EGFR amplification and outcome in glioblastomas. Am J Surg Pathol 36, 1186-1193, doi:10.1097/PAS.0b013e3182518e12 (2012). 16 Heimberger, A. B. et al. Prognostic effect of epidermal growth factor receptor and EGFRvIII in glioblastoma multiforme patients. Clin Cancer Res 11, 1462-1466, doi:10.1158/1078-0432.CCR-04-1737 (2005).
134
17 Montano, N. et al. Expression of EGFRvIII in glioblastoma: prognostic significance revisited. Neoplasia 13, 1113-1121 (2011). 18 Shinojima, N. et al. Prognostic value of epidermal growth factor receptor in patients with glioblastoma multiforme. Cancer Res 63, 6962-6970 (2003). 19 Greenall, S. A. et al. EGFRvIII-mediated transactivation of receptor tyrosine kinases in glioma: mechanism and therapeutic implications. Oncogene 34, 5277- 5287, doi:10.1038/onc.2014.448 (2015). 20 Hegi, M. E. et al. MGMT gene silencing and benefit from temozolomide in glioblastoma. N Engl J Med 352, 997-1003, doi:10.1056/NEJMoa043331 (2005). 21 Esteller, M. et al. Inactivation of the DNA-repair gene MGMT and the clinical response of gliomas to alkylating agents. N Engl J Med 343, 1350-1354, doi:10.1056/NEJM200011093431901 (2000). 22 Minniti, G. et al. Correlation between O6-methylguanine-DNA methyltransferase and survival in elderly patients with glioblastoma treated with radiotherapy plus concomitant and adjuvant temozolomide. J Neurooncol 102, 311-316, doi:10.1007/s11060-010-0324-4 (2011). 23 Malmström, A. et al. Temozolomide versus standard 6-week radiotherapy versus hypofractionated radiotherapy in patients older than 60 years with glioblastoma: the Nordic randomised, phase 3 trial. Lancet Oncol 13, 916-926, doi:10.1016/S1470-2045(12)70265-6 (2012). 24 Wick, W. et al. Temozolomide chemotherapy alone versus radiotherapy alone for malignant astrocytoma in the elderly: the NOA-08 randomised, phase 3 trial. Lancet Oncol 13, 707-715, doi:10.1016/S1470-2045(12)70164-X (2012). 25 Verhaak, R. G. et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17, 98-110, doi:10.1016/j.ccr.2009.12.020 (2010). 26 Phillips, H. S. et al. Molecular subclasses of high-grade glioma predict prognosis, delineate a pattern of disease progression, and resemble stages in neurogenesis. Cancer Cell 9, 157-173, doi:10.1016/j.ccr.2006.02.019 (2006). 27 Noushmehr, H. et al. Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma. Cancer Cell 17, 510-522, doi:10.1016/j.ccr.2010.03.017 (2010). 28 Ceccarelli, M. et al. Molecular Profiling Reveals Biologically Discrete Subsets and Pathways of Progression in Diffuse Glioma. Cell 164, 550-563, doi:10.1016/j.cell.2015.12.028 (2016). 29 Turcan, S. et al. IDH1 mutation is sufficient to establish the glioma hypermethylator phenotype. Nature 483, 479-483, doi:10.1038/nature10866 (2012). 30 Eckel-Passow, J. E. et al. Glioma Groups Based on 1p/19q, IDH, and TERT Promoter Mutations in Tumors. N Engl J Med 372, 2499-2508, doi:10.1056/NEJMoa1407279 (2015). 31 Louis, D. N. et al. The 2016 World Health Organization Classification of Tumors of the Central Nervous System: a summary. Acta Neuropathol 131, 803-820, doi:10.1007/s00401-016-1545-1 (2016).
135
32 Stetson, L. C., Dazard, J. E. & Barnholtz-Sloan, J. S. Protein Markers Predict Survival in Glioma Patients. Mol Cell Proteomics 15, 2356-2365, doi:10.1074/mcp.M116.060657 (2016). 33 Baylin, S. B. & Jones, P. A. A decade of exploring the cancer epigenome - biological and translational implications. Nat Rev Cancer 11, 726-734, doi:10.1038/nrc3130 (2011). 34 Cruickshanks, H. A. et al. Senescent cells harbour features of the cancer epigenome. Nat Cell Biol 15, 1495-1506, doi:10.1038/ncb2879 (2013). 35 Teschendorff, A. E. et al. Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res 20, 440-446, doi:10.1101/gr.103606.109 (2010). 36 Klutstein, M., Moss, J., Kaplan, T. & Cedar, H. Contribution of epigenetic mechanisms to variation in cancer risk among tissues. Proc Natl Acad Sci U S A 114, 2230-2234, doi:10.1073/pnas.1616556114 (2017). 37 Weidner, C. I. et al. Aging of blood can be tracked by DNA methylation changes at just three CpG sites. Genome Biol 15, R24, doi:10.1186/gb-2014-15-2-r24 (2014). 38 Bocklandt, S. et al. Epigenetic predictor of age. PLoS One 6, e14821, doi:10.1371/journal.pone.0014821 (2011). 39 Horvath, S. DNA methylation age of human tissues and cell types. Genome Biol 14, R115, doi:10.1186/gb-2013-14-10-r115 (2013). 40 Christensen, B. C. et al. Aging and environmental exposures alter tissue-specific DNA methylation dependent upon CpG island context. PLoS Genet 5, e1000602, doi:10.1371/journal.pgen.1000602 (2009). 41 Balducci, L. & Ershler, W. B. Cancer and ageing: a nexus at several levels. Nat Rev Cancer 5, 655-662, doi:10.1038/nrc1675 (2005). 42 Fraga, M. F., Agrelo, R. & Esteller, M. Cross-talk between aging and cancer: the epigenetic language. Ann N Y Acad Sci 1100, 60-74, doi:10.1196/annals.1395.005 (2007). 43 Campisi, J. Aging, cellular senescence, and cancer. Annu Rev Physiol 75, 685- 705, doi:10.1146/annurev-physiol-030212-183653 (2013). 44 Hannum, G. et al. Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol Cell 49, 359-367, doi:10.1016/j.molcel.2012.10.016 (2013). 45 Yang, Z. et al. Correlation of an epigenetic mitotic clock with cancer risk. Genome Biol 17, 205, doi:10.1186/s13059-016-1064-3 (2016). 46 Curtius, K. et al. A Molecular Clock Infers Heterogeneous Tissue Age Among Patients with Barrett's Esophagus. PLoS Comput Biol 12, e1004919, doi:10.1371/journal.pcbi.1004919 (2016). 47 Agrawal, S. et al. DNA methylation of tumor suppressor genes in clinical remission predicts the relapse risk in acute myeloid leukemia. Cancer Res 67, 1370-1377, doi:10.1158/0008-5472.CAN-06-1681 (2007). 48 Bullinger, L. et al. Quantitative DNA methylation predicts survival in adult acute myeloid leukemia. Blood 115, 636-642, doi:10.1182/blood-2009-03-211003 (2010).
136
49 Xu, Z. & Taylor, J. A. Genome-wide age-related DNA methylation changes in blood and other tissues relate to histone modification, expression and cancer. Carcinogenesis 35, 356-364, doi:10.1093/carcin/bgt391 (2014). 50 Marioni, R. E. et al. DNA methylation age of blood predicts all-cause mortality in later life. Genome Biol 16, 25, doi:10.1186/s13059-015-0584-6 (2015). 51 Cohen, P. The origins of protein phosphorylation. Nat Cell Biol 4, E127-130, doi:10.1038/ncb0502-e127 (2002). 52 Johnson, L. N. & Barford, D. The effects of phosphorylation on the structure and function of proteins. Annu Rev Biophys Biomol Struct 22, 199-232, doi:10.1146/annurev.bb.22.060193.001215 (1993). 53 Tarrant, M. K. & Cole, P. A. The chemical biology of protein phosphorylation. Annu Rev Biochem 78, 797-825, doi:10.1146/annurev.biochem.78.070907.103047 (2009). 54 Brognard, J. & Hunter, T. Protein kinase signaling networks in cancer. Curr Opin Genet Dev 21, 4-11, doi:10.1016/j.gde.2010.10.012 (2011). 55 Fleuren, E. D., Zhang, L., Wu, J. & Daly, R. J. The kinome 'at large' in cancer. Nat Rev Cancer 16, 83-98, doi:10.1038/nrc.2015.18 (2016). 56 Cohen, P. Protein kinases--the major drug targets of the twenty-first century? Nat Rev Drug Discov 1, 309-315, doi:10.1038/nrd773 (2002). 57 Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646-674, doi:10.1016/j.cell.2011.02.013 (2011). 58 Collett, M. S. & Erikson, R. L. Protein kinase activity associated with the avian sarcoma virus src gene product. Proc Natl Acad Sci U S A 75, 2021-2024 (1978). 59 Ushiro, H. & Cohen, S. Identification of phosphotyrosine as a product of epidermal growth factor-activated protein kinase in A-431 cell membranes. J Biol Chem 255, 8363-8365 (1980). 60 Gross, S., Rahal, R., Stransky, N., Lengauer, C. & Hoeflich, K. P. Targeting cancer with kinase inhibitors. J Clin Invest 125, 1780-1789, doi:10.1172/JCI76094 (2015). 61 Zack, T. I. et al. Pan-cancer patterns of somatic copy number alteration. Nat Genet 45, 1134-1140, doi:10.1038/ng.2760 (2013). 62 Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature 463, 899-905, doi:10.1038/nature08822 (2010). 63 Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214-218, doi:10.1038/nature12213 (2013). 64 Lents, N. H., Keenan, S. M., Bellone, C. & Baldassare, J. J. Stimulation of the Raf/MEK/ERK cascade is necessary and sufficient for activation and Thr-160 phosphorylation of a nuclear-targeted CDK2. J Biol Chem 277, 47469-47475, doi:10.1074/jbc.M207425200 (2002). 65 Guertin, D. A. & Sabatini, D. M. Defining the role of mTOR in cancer. Cancer Cell 12, 9-22, doi:10.1016/j.ccr.2007.05.008 (2007). 66 Zoncu, R., Efeyan, A. & Sabatini, D. M. mTOR: from growth signal integration to cancer, diabetes and ageing. Nat Rev Mol Cell Biol 12, 21-35, doi:10.1038/nrm3025 (2011). 67 Rodriguez-Viciana, P. et al. Phosphatidylinositol-3-OH kinase as a direct target of Ras. Nature 370, 527-532, doi:10.1038/370527a0 (1994).
137
68 Roberts, P. J. & Der, C. J. Targeting the Raf-MEK-ERK mitogen-activated protein kinase cascade for the treatment of cancer. Oncogene 26, 3291-3310, doi:10.1038/sj.onc.1210422 (2007). 69 Davies, H. et al. Mutations of the BRAF gene in human cancer. Nature 417, 949- 954, doi:10.1038/nature00766 (2002). 70 Millis, S. Z., Ikeda, S., Reddy, S., Gatalica, Z. & Kurzrock, R. Landscape of Phosphatidylinositol-3-Kinase Pathway Alterations Across 19 784 Diverse Solid Tumors. JAMA Oncol 2, 1565-1573, doi:10.1001/jamaoncol.2016.0891 (2016). 71 Arora, A. & Scholar, E. M. Role of tyrosine kinase inhibitors in cancer therapy. J Pharmacol Exp Ther 315, 971-979, doi:10.1124/jpet.105.084145 (2005). 72 Kantarjian, H. et al. Hematologic and cytogenetic responses to imatinib mesylate in chronic myelogenous leukemia. N Engl J Med 346, 645-652, doi:10.1056/NEJMoa011573 (2002). 73 Ranson, M. et al. ZD1839, a selective oral epidermal growth factor receptor- tyrosine kinase inhibitor, is well tolerated and active in patients with solid, malignant tumors: results of a phase I trial. J Clin Oncol 20, 2240-2250, doi:10.1200/JCO.2002.10.112 (2002). 74 Chapman, P. B. et al. Improved survival with vemurafenib in melanoma with BRAF V600E mutation. N Engl J Med 364, 2507-2516, doi:10.1056/NEJMoa1103782 (2011). 75 Holohan, C., Van Schaeybroeck, S., Longley, D. B. & Johnston, P. G. Cancer drug resistance: an evolving paradigm. Nat Rev Cancer 13, 714-726, doi:10.1038/nrc3599 (2013). 76 Logue, J. S. & Morrison, D. K. Complexity in the signaling network: insights from the use of targeted inhibitors in cancer therapy. Genes Dev 26, 641-650, doi:10.1101/gad.186965.112 (2012). 77 Engelman, J. A. et al. MET amplification leads to gefitinib resistance in lung cancer by activating ERBB3 signaling. Science 316, 1039-1043, doi:10.1126/science.1141478 (2007). 78 Hatzivassiliou, G. et al. ERK inhibition overcomes acquired resistance to MEK inhibitors. Mol Cancer Ther 11, 1143-1154, doi:10.1158/1535-7163.MCT-11- 1010 (2012). 79 Flaherty, K. T. et al. Combined BRAF and MEK inhibition in melanoma with BRAF V600 mutations. N Engl J Med 367, 1694-1703, doi:10.1056/NEJMoa1210093 (2012). 80 Hoeflich, K. P. et al. In vivo antitumor activity of MEK and phosphatidylinositol 3-kinase inhibitors in basal-like breast cancer models. Clin Cancer Res 15, 4649- 4664, doi:10.1158/1078-0432.CCR-09-0317 (2009). 81 Bonner, W. M. et al. GammaH2AX and cancer. Nat Rev Cancer 8, 957-967, doi:10.1038/nrc2523 (2008). 82 Nishizuka, S. et al. Diagnostic markers that distinguish colon and ovarian adenocarcinomas: identification by genomic, proteomic, and tissue array profiling. Cancer Res 63, 5243-5250 (2003). 83 Paweletz, C. P. et al. Reverse phase protein microarrays which capture disease progression show activation of pro-survival pathways at the cancer invasion front. Oncogene 20, 1981-1989, doi:10.1038/sj.onc.1204265 (2001).
138
84 Yu, G. et al. Overexpression of phosphorylated mammalian target of rapamycin predicts lymph node metastasis and prognosis of chinese patients with gastric cancer. Clin Cancer Res 15, 1821-1829, doi:10.1158/1078-0432.CCR-08-2138 (2009). 85 Kreisberg, J. I. et al. Phosphorylation of Akt (Ser473) is an excellent predictor of poor clinical outcome in prostate cancer. Cancer Res 64, 5232-5236, doi:10.1158/0008-5472.CAN-04-0272 (2004). 86 Cloughesy, T. F. et al. Antitumor activity of rapamycin in a Phase I trial for patients with recurrent PTEN-deficient glioblastoma. PLoS Med 5, e8, doi:10.1371/journal.pmed.0050008 (2008). 87 Vazquez-Martin, A., Oliveras-Ferraros, C., Colomer, R., Brunet, J. & Menendez, J. A. Low-scale phosphoproteome analyses identify the mTOR effector p70 S6 kinase 1 as a specific biomarker of the dual-HER1/HER2 tyrosine kinase inhibitor lapatinib (Tykerb) in human breast carcinoma cells. Ann Oncol 19, 1097-1109, doi:10.1093/annonc/mdm589 (2008). 88 Serrels, A. et al. Identification of potential biomarkers for measuring inhibition of Src kinase activity in colon cancer cells following treatment with dasatinib. Mol Cancer Ther 5, 3014-3022, doi:10.1158/1535-7163.MCT-06-0382 (2006). 89 Kanshin, E., Michnick, S. & Thibault, P. Sample preparation and analytical strategies for large-scale phosphoproteomics experiments. Semin Cell Dev Biol 23, 843-853, doi:10.1016/j.semcdb.2012.05.005 (2012). 90 Jünger, M. A. & Aebersold, R. Mass spectrometry-driven phosphoproteomics: patterning the systems biology mosaic. Wiley Interdiscip Rev Dev Biol 3, 83-112, doi:10.1002/wdev.121 (2014). 91 Roux, P. P. & Thibault, P. The coming of age of phosphoproteomics--from large data sets to inference of protein functions. Mol Cell Proteomics 12, 3453-3464, doi:10.1074/mcp.R113.032862 (2013). 92 Rikova, K. et al. Global survey of phosphotyrosine signaling identifies oncogenic kinases in lung cancer. Cell 131, 1190-1203, doi:10.1016/j.cell.2007.11.025 (2007). 93 Kolch, W. & Pitt, A. Functional proteomics to dissect tyrosine kinase signalling pathways in cancer. Nat Rev Cancer 10, 618-629, doi:10.1038/nrc2900 (2010). 94 Andersen, J. N. et al. Pathway-based identification of biomarkers for targeted therapeutics: personalized oncology with PI3K pathway inhibitors. Sci Transl Med 2, 43ra55, doi:10.1126/scitranslmed.3001065 (2010). 95 Liu, Y. & Chance, M. R. Integrating phosphoproteomics in systems biology. Comput Struct Biotechnol J 10, 90-97, doi:10.1016/j.csbj.2014.07.003 (2014). 96 Lienhard, G. E. Non-functional phosphorylations? Trends Biochem Sci 33, 351- 352, doi:10.1016/j.tibs.2008.05.004 (2008). 97 Munk, S., Refsgaard, J. C., Olsen, J. V. & Jensen, L. J. From Phosphosites to Kinases. Methods Mol Biol 1355, 307-321, doi:10.1007/978-1-4939-3049-4_21 (2016). 98 Casado, P. et al. Kinase-substrate enrichment analysis provides insights into the heterogeneity of signaling pathway activation in leukemia cells. Sci Signal 6, rs6, doi:10.1126/scisignal.2003573 (2013).
139
99 Mischnik, M. et al. IKAP: A heuristic framework for inference of kinase activities from Phosphoproteomics data. Bioinformatics 32, 424-431, doi:10.1093/bioinformatics/btv699 (2016). 100 Yang, P. et al. Knowledge-Based Analysis for Detecting Key Signaling Events from Time-Series Phosphoproteomics Data. PLoS Comput Biol 11, e1004403, doi:10.1371/journal.pcbi.1004403 (2015). 101 Yang, P. et al. KinasePA: Phosphoproteomics data annotation using hypothesis driven kinase perturbation analysis. Proteomics 16, 1868-1871, doi:10.1002/pmic.201600068 (2016). 102 Wilkes, E. H., Casado, P., Rajeeve, V. & Cutillas, P. R. Kinase activity ranking using phosphoproteomics data (KARP) quantifies the contribution of protein kinases to the regulation of cell viability. Mol Cell Proteomics 16, 1694-1704, doi:10.1074/mcp.O116.064360 (2017). 103 Ochoa, D. et al. An atlas of human kinase regulation. Mol Syst Biol 12, 888 (2016). 104 Wiredja, D. D., Koyutürk, M. & Chance, M. R. The KSEA App: a web-based tool for kinase activity inference from quantitative phosphoproteomics. Bioinformatics, doi:10.1093/bioinformatics/btx415 (2017). 105 Hernandez-Armenta, C., Ochoa, D., Gonçalves, E., Saez-Rodriguez, J. & Beltrao, P. Benchmarking substrate-based kinase activity inference using phosphoproteomic data. Bioinformatics, doi:10.1093/bioinformatics/btx082 (2017). 106 Hornbeck, P. V. et al. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res 43, D512-520, doi:10.1093/nar/gku1267 (2015). 107 Horn, H. et al. KinomeXplorer: an integrated platform for kinome biology studies. Nat Methods 11, 603-604, doi:10.1038/nmeth.2968 (2014). 108 Obenauer, J. C., Cantley, L. C. & Yaffe, M. B. Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res 31, 3635-3641 (2003). 109 Song, C. et al. Systematic analysis of protein phosphorylation networks from phosphoproteomic data. Mol Cell Proteomics 11, 1070-1083, doi:10.1074/mcp.M111.012625 (2012). 110 Safaei, J., Maňuch, J., Gupta, A., Stacho, L. & Pelech, S. Prediction of 492 human protein kinase substrate specificities. Proteome Sci 9 Suppl 1, S6, doi:10.1186/1477-5956-9-S1-S6 (2011). 111 Linding, R. et al. Systematic discovery of in vivo phosphorylation networks. Cell 129, 1415-1426, doi:10.1016/j.cell.2007.05.052 (2007). 112 Obata, T. et al. Peptide and protein library screening defines optimal substrate motifs for AKT/PKB. J Biol Chem 275, 36108-36115, doi:10.1074/jbc.M005497200 (2000). 113 Hutti, J. E. et al. A rapid method for determining protein kinase phosphorylation specificity. Nat Methods 1, 27-29, doi:10.1038/nmeth708 (2004). 114 Miller, M. L. et al. Linear motif atlas for phosphorylation-dependent signaling. Sci Signal 1, ra2, doi:10.1126/scisignal.1159433 (2008).
140
115 Franceschini, A. et al. STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res 41, D808-815, doi:10.1093/nar/gks1094 (2013). 116 Bibikova, M. et al. High density DNA methylation array with single CpG site resolution. Genomics 98, 288-295, doi:10.1016/j.ygeno.2011.07.007 (2011). 117 Guintivano, J., Aryee, M. J. & Kaminsky, Z. A. A cell epigenotype specific model for the correction of brain cellular heterogeneity bias and its application to age, brain region and major depression. Epigenetics 8, 290-302, doi:10.4161/epi.23924 (2013). 118 Mazor, T. et al. DNA Methylation and Somatic Mutations Converge on the Cell Cycle and Define Similar Evolutionary Histories in Brain Tumors. Cancer Cell 28, 307-317, doi:10.1016/j.ccell.2015.07.012 (2015). 119 Aryee, M. J. et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics 30, 1363- 1369, doi:10.1093/bioinformatics/btu049 (2014). 120 Sturm, D. et al. Hotspot mutations in H3F3A and IDH1 define distinct epigenetic and biological subgroups of glioblastoma. Cancer Cell 22, 425-437, doi:10.1016/j.ccr.2012.08.024 (2012). 121 Mur, P. et al. Codeletion of 1p and 19q determines distinct gene methylation and expression profiles in IDH-mutated oligodendroglial tumors. Acta Neuropathol 126, 277-289, doi:10.1007/s00401-013-1130-9 (2013). 122 Bai, H. et al. Integrated genomic characterization of IDH1-mutant glioma malignant progression. Nat Genet 48, 59-66, doi:10.1038/ng.3457 (2016). 123 van Buuren, S. & Groothuis-Oudshoorn, C. G. M. mice: Multivariate Imputation by Chained Equations in R. 45 (2011). 124 Horvath, S. Erratum to: DNA methylation age of human tissues and cell types. Genome Biol 16, 96, doi:10.1186/s13059-015-0649-6 (2015). 125 Cohen, A. L., Holmen, S. L. & Colman, H. IDH1 and IDH2 mutations in gliomas. Curr Neurol Neurosci Rep 13, 345, doi:10.1007/s11910-013-0345-4 (2013). 126 Walid, M. S. Prognostic factors for long-term survival after glioblastoma. Perm J 12, 45-48 (2008). 127 Yori, J. L. et al. Combined SFK/mTOR inhibition prevents rapamycin-induced feedback activation of AKT and elicits efficient tumor regression. Cancer Res 74, 4762-4771, doi:10.1158/0008-5472.CAN-13-3627 (2014). 128 Tomechko, S. E. et al. Tissue specific dysregulated protein subnetworks in type 2 diabetic bladder urothelium and detrusor muscle. Mol Cell Proteomics 14, 635- 645, doi:10.1074/mcp.M114.041863 (2015). 129 Montero, J. C., Seoane, S., Ocaña, A. & Pandiella, A. Inhibition of SRC family kinases and receptor tyrosine kinases by dasatinib: possible combinations in solid tumors. Clin Cancer Res 17, 5546-5552, doi:10.1158/1078-0432.CCR-10-2616 (2011). 130 Pan, C., Olsen, J. V., Daub, H. & Mann, M. Global effects of kinase inhibitors on signaling networks revealed by quantitative phosphoproteomics. Mol Cell Proteomics 8, 2796-2808, doi:10.1074/mcp.M900285-MCP200 (2009).
141
131 Gnad, F. et al. Systems-wide analysis of K-Ras, Cdc42, and PAK4 signaling by quantitative phosphoproteomics. Mol Cell Proteomics 12, 2070-2080, doi:10.1074/mcp.M112.027052 (2013). 132 Cristóbal, I. et al. PP2A impaired activity is a common event in acute myeloid leukemia and its activation by forskolin has a potent anti-leukemic effect. Leukemia 25, 606-614, doi:10.1038/leu.2010.294 (2011). 133 Neviani, P. et al. The tumor suppressor PP2A is functionally inactivated in blast crisis CML through the inhibitory activity of the BCR/ABL-regulated SET protein. Cancer Cell 8, 355-368, doi:10.1016/j.ccr.2005.10.015 (2005). 134 Kuo, Y. C. et al. Regulation of phosphorylation of Thr-308 of Akt, cell proliferation, and survival by the B55alpha regulatory subunit targeting of the protein phosphatase 2A holoenzyme to Akt. J Biol Chem 283, 1882-1892, doi:10.1074/jbc.M709585200 (2008). 135 Oaks, J. J. et al. Antagonistic activities of the immunomodulator and PP2A- activating drug FTY720 (Fingolimod, Gilenya) in Jak2-driven hematologic malignancies. Blood 122, 1923-1934, doi:10.1182/blood-2013-03-492181 (2013). 136 Ruvolo, P. P. et al. Low expression of PP2A regulatory subunit B55α is associated with T308 phosphorylation of AKT and shorter complete remission duration in acute myeloid leukemia patients. Leukemia 25, 1711-1717, doi:10.1038/leu.2011.146 (2011). 137 Zhou, B., Wang, Z. X., Zhao, Y., Brautigan, D. L. & Zhang, Z. Y. The specificity of extracellular signal-regulated kinase 2 dephosphorylation by protein phosphatases. J Biol Chem 277, 31818-31825, doi:10.1074/jbc.M203969200 (2002). 138 Sangodkar, J. et al. Activation of tumor suppressor protein PP2A inhibits KRAS- driven tumor growth. J Clin Invest 127, 2081-2090, doi:10.1172/JCI89548 (2017). 139 McClinch, K. et al. Small molecule activators of protein phosphatase 2A for the treatment of castration-resistant prostate cancer. Cancer Res, doi:10.1158/0008- 5472.CAN-17-0123 (2018). 140 Patel, V. N. et al. Network signatures of survival in glioblastoma multiforme. PLoS Comput Biol 9, e1003237, doi:10.1371/journal.pcbi.1003237 (2013). 141 Diril, M. K. et al. Cyclin-dependent kinase 1 (Cdk1) is essential for cell division and suppression of DNA re-replication but not for liver regeneration. Proc Natl Acad Sci U S A 109, 3826-3831, doi:10.1073/pnas.1115201109 (2012). 142 Dorand, R. D. et al. Cdk5 disruption attenuates tumor PD-L1 expression and promotes antitumor immunity. Science 353, 399-403, doi:10.1126/science.aae0477 (2016). 143 Pozo, K. & Bibb, J. A. The Emerging Role of Cdk5 in Cancer. Trends Cancer 2, 606-618, doi:10.1016/j.trecan.2016.09.001 (2016). 144 Ley, T. J. et al. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med 368, 2059-2074, doi:10.1056/NEJMoa1301689 (2013). 145 Abdel-Wahab, O. & Levine, R. L. Mutations in epigenetic modifiers in the pathogenesis and therapy of acute myeloid leukemia. Blood 121, 3563-3572, doi:10.1182/blood-2013-01-451781 (2013).
142
146 Marcucci, G. et al. IDH1 and IDH2 gene mutations identify novel molecular subsets within de novo cytogenetically normal acute myeloid leukemia: a Cancer and Leukemia Group B study. J Clin Oncol 28, 2348-2355, doi:10.1200/JCO.2009.27.3730 (2010). 147 Figueroa, M. E. et al. DNA methylation signatures identify biologically distinct subtypes in acute myeloid leukemia. Cancer Cell 17, 13-27, doi:10.1016/j.ccr.2009.11.020 (2010). 148 Kelly, A. D. et al. A CpG island methylator phenotype in acute myeloid leukemia independent of IDH mutations and associated with a favorable outcome. Leukemia 31, 2011-2019, doi:10.1038/leu.2017.12 (2017).
143