ARTICLE

https://doi.org/10.1038/s41467-019-13528-0 OPEN Pan-cancer molecular subtypes revealed by mass- spectrometry-based proteomic characterization of more than 500 cancers

Fengju Chen1, Darshan S. Chandrashekar2,3, Sooryanarayana Varambally2,3,4 & Chad J. Creighton 1,5,6,7*

Mass-spectrometry-based proteomic profiling of human cancers has the potential for pan- cancer analyses to identify molecular subtypes and associated pathway features that might 1234567890():,; be otherwise missed using transcriptomics. Here, we classify 532 cancers, representing six tissue-based types (breast, colon, ovarian, renal, uterine), into ten proteome-based, pan- cancer subtypes that cut across tumor lineages. The proteome-based subtypes are obser- vable in external cancer proteomic datasets surveyed. signatures of oncogenic or metabolic pathways can further distinguish between the subtypes. Two distinct subtypes both involve the immune system, one associated with the adaptive immune response and T-cell activation, and the other associated with the humoral immune response. Two addi- tional subtypes each involve the tumor stroma, one of these including the collagen VI interacting network. Three additional proteome-based subtypes—respectively involving related to , hemoglobin complex, and —were not reflected in previous transcriptomics analyses. A data portal is available at UALCAN website.

1 Dan L. Duncan Comprehensive Cancer Center Division of Biostatistics, Baylor College of Medicine, Houston, TX, USA. 2 Comprehensive Cancer Center, University of Alabama at Birmingham, Birmingham, AL 35233, USA. 3 Molecular and Cellular Pathology, Department of Pathology, University of Alabama at Birmingham, Birmingham, AL 35233, USA. 4 The Informatics Institute, University of Alabama at Birmingham, Birmingham, AL 35233, USA. 5 Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA. 6 Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA. 7 Department of Medicine, Baylor College of Medicine, Houston, TX, USA. *email: [email protected]

NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13528-0

n this age of advanced molecular-profiling technologies, can- made publicly available to the research community for secondary cer molecular subtype discovery has been one of the more analyses. I fi common exercises utilizing transcriptomic or proteomic data Mass-spectrometry-based proteomic-pro ling data of human on human tumors. Molecular subtypes can deepen our under- cancers, as recently provided by CPTAC for hundreds of cases, standing of cancer as representing a collection of diseases rather have the potential for pan-cancer analyses to identify molecular than a single disease. Molecular subtypes can provide insights subtypes and associated pathway features that previous tran- into the pathways appearing deregulated within tumor subsets, scriptomics analyses might have missed. Whereas the initial which may suggest therapeutic opportunities, as well as being CPTAC-led marker analysis studies were each focused on a indicative of what pathways, as characterized in the experimental specific cancer type7–11, our present study aims to define pan- setting, would seem particularly relevant in the human disease cancer proteome-based subtypes that would transcend tissue- setting. Historically, most subtype discovery studies in cancer based type across the CPTAC cohorts. We also examine cancer have involved transcriptomic rather than proteomic data, as the proteomic and transcriptomic datasets outside of CPTAC, for advent of DNA microarrays over 20 years ago began the wide- patterns of manifestation of our proteome-based subtypes. For spread use of transcriptomics among laboratories1. In contrast, each of the subtypes identified, we explore the top differential proteomics is typically more challenging at a technical level and features for associated functional classes and pathways, to requires dedicated laboratories with the right expertise. In a help us understand what these subtypes represent. recent study, using transcriptome data by RNA sequencing (RNA-seq) from The Cancer Genome Atlas (TCGA) consortium, we classified more than 10,000 cancers, representing 32 major Results types, into 10 molecular-based classes that cut across tumor CPTAC mass-spectrometry-based proteomic tumor datasets. lineages and cancer types2. At the same time, while protein The primary focus of this study was on a set of mass- abundance levels typically correlate with those of the corre- spectrometry-based proteomic profiles from the CPTAC Con- sponding mRNA, widespread discordant expression patterns firmatory/Discovery cohort, of 532 cancer cases (cases not between protein and mRNA are also observable3. represented in TCGA), comprised of 125 breast cancer cases, 97 The Clinical Proteomic Tumor Analysis Consortium (CPTAC) colon10, 100 ovarian, 110 renal (primarily clear cell renal cell aims to accelerate the understanding of the molecular basis of carcinoma)11, and 100 uterine (Supplementary Data S1). This cancer through the application of proteomic technologies and CPTAC Confirmatory/Discovery dataset was used to define workflows to clinical tumor samples4. While past TCGA con- proteome-based pan-cancer subtypes cutting across tissue- sortium efforts involved targeted proteomics involving a set of specific differences between the cancer types. Additional pro- ~180–250 protein features5,6, CPTAC proteomic profiling has teome and transcriptome datasets of TCGA cancer cases, which been mass-spectrometry-based, profiling on the order of more shared no cases with the CPTAC Confirmatory/Discovery cohort, than 10,000 protein features. Initial studies led by CPTAC per- were used to independently confirm the observations initially formed mass-spectrometry-based profiling on a subset of cases made using CTPAC Confirmatory/Discovery dataset (Fig. 1a). In from TCGA—involving breast, colorectal, and ovarian cancers— the CPTAC-TCGA proteome dataset, a subset of 364 cases from which allowed for integrative analysis studies between protein TCGA—involving breast, colorectal, and ovarian cancers (105, – expression and other data types including mRNA and mutation7 9. 90, and 169 cases, respectively, Supplementary Data S1)—was Subsequent Confirmatory and Discovery datasets generated by also profiled by mass-spectrometry-based proteomics7–9. TCGA CPTAC profile cancer cases not represented in TCGA10,11,with RPPA dataset represented 7663 TCGA cases and 31 cancer types these data—representing several tissue-based cancer types—being profiled for a focused panel of proteins (involving 211 protein

ab7663 cases (31 cancer types) 169 unique proteins 364 cases 12,247 total proteins (105 breast, 90 colon, 165 ovarian) Total proteins by # cancer types: CPTAC 12,247 1+ CPTAC-TCGA TCGA RPPA Confirmatory/ TCGA RPPA 72 10,731 2+ 0 cohort cohort Discovery cohort 9764 3+ cohort CPTAC 385 8695 4+ 0 Confirmatory/Discovery 5 6856 all 5 117 cohort 273 161 532 cases 14 6933 11,969 8 (125 breast, 97 colon, 100 ovarian,110 renal, 3004 8393 100 uterine) TCGA pan32 TCGA pan32 mRNA mRNA 10,224 cases cohort cohort (32 cancer types) 20,531 mRNAs

Fig. 1 Proteomic and transcriptomic human tumor datasets and associated gene features used in this study. a We used the CPTAC Confirmatory/ Discovery dataset, of 532 cancer cases (cases not represented in TCGA)10 for proteome-based, pan-cancer subtype discovery (see Fig. 2). For the following datasets of TCGA cases, we classified cases according to the CPTAC Confirmatory/Discovery-based subtypes (see Fig. 3): CPTAC-TCGA dataset (TCGA cases for which mass-spectrometry-based proteomic profiling by CPTAC was carried out), TCGA RPPA dataset (TCGA cases profiled for a focused panel of proteins by RPPA platform), and TCGA pan32 mRNA dataset (TCGA cases with RNA-sequencing data). For TCGA datasets, Venn diagram represents shared cases. b Among CPTAC Confirmatory/Discovery, TCGA RPPA, and TCGA pan32 mRNA datasets, numbers of shared gene features (protein or mRNA levels). CPTAC proteomic and TCGA transcriptomic data, as provided by their respective public data portals, were processedat the gene level, rather than at the or mRNA transcript levels. See also Supplementary Data S1 and Supplementary Fig. 1.

2 NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13528-0 ARTICLE features6 or 169 unique ) by antibody-based reverse-phase and 32 major cancer types) a CPTAC-based pan-cancer subtype. protein array (RPPA) platform12. TCGA pan32 mRNA dataset This involved mapping expression values from the top 100 over- featured 10,224 cases and 32 cancer types with RNA-sequencing expressed proteins for each subtype (Fig. 2b, 1000 protein in data. The CPTAC Confirmatory/Discovery total protein dataset total) to the corresponding normalized mRNA values in TCGA represented a total of 12,247 genes, of which 9764 had detection dataset. Similar to the above observations in the CPTAC in at least three of the five cancer types (Fig. 1b). For all proteome Confirmatory/Discovery cohort (Fig. 2a), several proteome- and transcriptome datasets, we normalized expression values based subtypes overlapped significantly (one-sided Fisher’s exact within each cancer type, whereby neither tissue-dominant dif- test) with specific mRNA-based pan-cancer classes (Fig. 2c). In ferences nor inter-laboratory batch effects would drive the particular, proteome-based versus mRNA-based subtyping2 downstream analyses2,5,13. associations, respectively, included (proteome-based) k1 to Previous work had identified 10 molecular-based pan-cancer (mRNA-based) c1, k2 to both c3 and c10, k3 to c3, k4 to c5, classes based on transcriptome data from TCGA2. Using a k5 to c6, k6 to c7, and k7 to c8. The previously identified previously defined classifier of 854 mRNAs2, we found these neuroendocrine-associated c4 class2 was not well represented transcriptome-based pan-cancer classes to be reflected in the among the proteome-based subtypes (Supplementary Fig. 1c). We proteome, based on analysis of both CPTAC-TCGA and CPTAC might attribute this to the most common c4-associated cancer Confirmatory/Discovery datasets (Supplementary Fig. 1a–c). types in TCGA (e.g., cervical, head-and-neck, lung squamous, Within the CPTAC-TCGA cohort, correlations between mRNA bladder) not being represented in the CPTAC cohorts. We found and protein expression across tumors were generally positive for each of the proteome-based pan-cancer subtypes to span cases the proteins surveyed, though with Pearsons’s correlation r-values from multiple cancer types (Fig. 2d), with the notable exceptions much less than 1 (Supplementary Fig. 1a). of k4, which represented the basal-like breast cancer molecular subtype (Supplementary Fig. 4a), and k9, which consisted entirely of clear cell renal cell carcinoma cases. Within the top De novo proteome-based pan-cancer molecular subtypes.We differentially expressed proteins underscoring each pan-cancer sought to determine what molecular subtypes would be dis- subtype, specific gene categories (by , or GO, coverable from the proteome, as opposed to pan-cancer subtypes annotation) were over-represented (Fig. 2e and Supplementary previously identified using the transcriptome2. We used mass- Data S4), which associations we further explored below. spectrometry-based proteomic data from the CPTAC Con- Differentially expressed proteins within k2, k3, k6, and firmatory/Discovery cohort to define 10 different subtypes of k7 subtypes in particular represented components of the tumor cancer (Table 1 and Supplementary Fig. 2a–d) across the 532 microenvironment (Supplementary Fig. 4b), which we also cases and five tissue-based cancer types represented. We found further explored below. several of these proteome-based subtypes to highly overlap with In summary, regarding de novo subtype discovery, here we specific mRNA-based pan-cancer classes2 as applied to the same identified 10 pan-cancer subtypes in CPTAC Discovery/Con- cohort (Fig. 2a and Supplementary Fig. 1c). The 10 proteome- firmatory cohort, using mass-spectrometry-based proteomics data. based pan-cancer molecular subtypes, referred to here as k1 To some extent, seven of these ten subtypes are reflected in through k10, were each characterized by widespread molecular previously described mRNA-based, pan-cancer subtypes (Table 1). patterns (Fig. 2b and Supplementary Fig. 2e, f and Supplementary Data S2 and S3). For each proteome-based subtype, we identified the top proteins most differentially expressed in the given subtype Proteome-based subtypes are observable in external datasets. versus the rest of the tumors (Fig. 2b), including both total and We examined proteomic and transcriptomic datasets external to phospho-protein features (Supplementary Fig. 3a, b). the CPTAC Discovery/Confirmatory dataset, for evidence of the We sought to further explore the relationship between our manifestation of our proteome-based pan-cancer subtypes. For previously defined mRNA-based pan-cancer classes2 and the this we used the top set of 1000 total proteins distinguishing proteome-based subtypes of the present study. We assigned each between our 10 subtypes (Fig. 2b) as a classifier (consisting of the transcriptome profile from TCGA pan32 cohort (n = 10,224 cases top 100 over-expressed proteins per subtype). The subtypes were

Table 1 Proteome-based pan-cancer molecular subtypes in CPTAC Confirmatory/Discovery cohort.

Subtype n (%) Associated TCGA mRNA- Description and notable features based class k1 22 (4.1) c1 Over-expression of proteasome complex proteins, glycolysis proteins, and pentose phosphate pathway proteins. k2 38 (7.1) c3,c10 Adaptive immune system-related; associated with T-cell activation; expression of major histocompatibility complex proteins k3 54 (10.2) c3,c10 Innate immune system-related; over-expression of complement system proteins; involvement of eosinophils, neutrophils, mast cells, and macrophages; hypoxia signature. k4 29 (5.5) c5 Represents basal-like breast cancer; over-expression of YAP1 and MYC targets. k5 113 (21.2) c6 Epithelial signature; normoxia signature; over-expression of YAP1 and MYC targets; over- expression of oxidative phosphorylation and TCA cycle proteins. k6 61 (11.5) c7 Stromal-related; over-expression of matrix metallopeptidases; Wnt and Notch pathway signatures; hypoxia signature. k7 95 (17.9) c8 Stromal-related; over-expression of collagen VI proteins; Wnt and Notch pathway signatures. k8 43 (8.1) – Over-expression of Golgi apparatus-related proteins; Ras pathway signature. k9 29 (5.5) – Found in clear cell renal cell carcinoma cases only; over-expression of hemoglobin complex proteins. k10 48 (9.0) – Over-expression of endoplasmic reticulum-related proteins and steroid biosynthesis pathway proteins.

NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications 3 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13528-0

a CPTAC protein-based class assignments e (CPTAC confirmatory/discovery cohort) CALM1 PSMA2 k10 k1 k2 k3 k4 k5 k6 k7 k8 k9 PSMA5 k10 n PSMA6 k1 k2 k3 k4 k5 k6 k7 k8 k9 c1 5 2 2 0 3 1 1 1 2 2 19 BTK c2 CD3E REG of UBQ-protein transferase activity 0 1 0 0 8 0 8 4 1 2 24 CD3G Extracellular exosome c3 2 15 17 1 3 3 1 2 2 2 48 CD4 c4 Regulation of protein catabolic process 2 1 0 0 10 1 2 0 1 4 21 CD38 mRNA splicing, via spliceosome c5 2 1 3 27 2 0 0 0 2 2 39 CSK c6 p > 0.05 Proteasome complex 1 1 2 0 69 0 2 13 3 3 94 FCER1G Ribonucleoprotein complex c7 2 3 9 0 1 46 12 4 4 14 95 HLA-B c8 ISG20 Regulation of lymphocyte activation 2 1 5 0 0 7 63 0 7 6 91 Relative enrichment T cell activation c9 4 0 2 1 9 1 6 11 7 10 51 –log10 (p-value) ITGAL

TCGA mRNA-based TCGA Adaptive immune response c10 2 13 14 0 8 2 0 8 0 3 50 PIK3CD

pan32 class assignments PRKCB B cell receptor signaling pathway n 0 22 38 54 29 61 95 43 29 48 PTPN1 MHC protein binding 15 30 113 PTPN6 Regulation of immune response WAS Immune system process b CPTAC Confirmatory/Discovery proteomic profiles (n = 532) C5 Leukocyte activation CDA Extracellular region k1 k2 k3k4k5 k6 k7 k8 k9 k10 CDC42 Secretory granule lumen CYBB Humoral immune response F2 k1 IGHG1 Vesicle-mediated transport k2 ITGB2 Innate immune response Regulation of coagulation k3 MMP9 NCF2 Complement activation k4 NCF4 Cell cycle process k5 NCF1 organization k6 MMP2 Cell division k7 CXCL12 Mitochondrial part TGFB3 k8 Mitochondrial inner membrane FSTL1 Ribosomal subunit k9 DCN Ribonucleoprotein complex k10 ERBB4 Translation 1000 class-specific proteins INSR binding k1 KRT8 PRKAA2 Insulin-like growth factor binding k2 PSMB8 Collagen metabolic process k3 MUC16 Extracellular matrix k4 PSMB4 Cell adhesion k5 SLC9A6 Collagen binding IGF2R k6 Basement membrane NSDHL Integrin binding k7 KRT8:S59 Focal adhesion k8 BAD:S99 Adherens junction k9 CD44:S697 Cell junction 500 phospho-proteins k10 ITGA4:S1021 Muscle contraction LAT:S195 Cancer PRKCB:T641 COPI vesicle coat type Breast Colon Ovarian Renal Uterine CYBA:T147 Golgi apparatus part No Lower HCK:Y390 Glycosylation FCER1G:Y76 data –1SD Epidermis development c CPTAC protein-based class assignments MYC:S77 FLNA:S2128 Hemoglobin complex (TCGA pan32 cohort) FLNA:S2414 Oxygen transport FLNA:T2599 Heme binding k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 MYL12A:T18 Iron ion binding Differential n expression MYL12A:S19 Intrinsic component of membrane c1 136 61 143 68 306 24 59 24 21 5 847 NME2:T94 Endoplasmic reticulum c2 51 103 133 186 882 67 463 159 35 123 2202 +1SD ROCK1:S1105 Steroid biosynthetic process c3 33 415 552 58 31 128 90 29 5 7 1348 Higher Alcohol biosynthetic process c4 7 8 21 61 149 57 75 7 8 18 411 Lipid biosynthetic process c5 2 4 4 161 7 8 1 0 0 0 187 Sterol biosynthetic process c6 11 4 3 45 1043 6 21 10 15 21 1179 p > 0.05 c7 42 22 54 121 22 702 115 15 7 53 1153 GO term enrichment c8 3 26 144 4 2 289 477 0 2 1 948 Relative enrichment within top differential proteins c9 9 9 108 54 103 97 251 46 31 24 732 –log10 (p-value) p > 0.05 TCGA mRNA-based TCGA c10 45 381 96 229 318 53 12 50 4 29 1217 pan32 class assignments 0 5 n 10 0 p -value (–log10) 50 339 987 340 128 281 100 1033 1258 2863 1431 1564

d 2 1

12 3 5 14 10 11 9 12 15 5 4 20 6 20 29 2 31 7 18 29 20 7 13 35 30 29 15 12 14 16 18 18 13 21 5

Breast — 125 cases Colon — 97 cases Ovarian — 100 cases Renal — 110 cases Uterine — 100 cases observable in the independent CPTAC-TCGA cohort (Fig. 3a and between data platforms (e.g. k1 and k9 subtypes), we might Supplementary Figs. 4c, d and 5a–c), which we had not used in the attribute this to a number of factors. These factors would include subtype discovery. In addition to our classifying the TCGA pan32 differences of the cohort considered from that of CPTAC Con- mRNA profiles according to proteome-based subtype (Fig. 2c and firmatory/Discovery (e.g., CPTAC-TCGA had no renal cases Supplementary Fig. 6a), we also classified 7694 TCGA cases with which were exclusively associated with k9 subtype) and platform- RPPA data according to proteome-based pan-cancer subtype, specific differences12 (e.g., RPPA platform had few available fea- based on protein expression (Fig. 3b). Evidence of the presence of tures specifictok8–k10 subtypes, Fig. 3b). We could identify the specific proteome-based subtypes in the external cohorts came k4 basal-like breast cancer subtype in external cohorts; still, as the from observations of subtype calls as made by different data associated proteomic classifier was enriched for cell cycle proteins platforms showing significant overlap, e.g., CPTAC-TCGA pro- (Fig. 2e), other tumors that were not breast could also manifest a teomic subtyping based on mass-spectrometry versus RPPA k4-associated signature pattern. To some degree, several (Fig. 3c), and TCGA pan32 subtyping based on mRNA versus proteome-based subtypes appeared manifested in vitro in cancer RPPA (Fig. 3d). Where subtypes did not significantly overlap cell lines (Supplementary Fig. 6b–d). Permutation testing

4 NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13528-0 ARTICLE

Fig. 2 De novo pan-cancer molecular subtypes as defined by mass-spectrometry-based proteomics. a By ConsensusClusterPlus32 of 532 proteomic profiles in the CPTAC Confirmation/Discovery cohort, 10 proteomic-based subtypes—k1 through k10—were identified (columns). For these same cases, pan-cancer class assignments—c1 through c10—based on the previous pan32 mRNA-based discovery2 were also made (rows, mapping the previous pan32 mRNA classifier to CPTAC protein expression patterns). Significances of overlap between the two sets of classifications are represented. P-values by one- sided Fisher’s exact test. b For CPTAC Confirmation/Discovery cohort, differential expression patterns (values normalized within each tissue-based cancer type; SD, standard deviation from the median) for a set of 1000 proteins (top heat map) and for a set of 500 phospho-proteins (bottom heat map) found to best distinguish between the 10 proteome-based subtypes (see the “Methods” section, top 100 over-expressed proteins for each subtype). Proteins highlighted by name have GO annotation “cell surface receptor signaling pathway” and DrugBank association (lists provide examples of differentially expressed proteins but these would not necessarily have more importance over the other proteins in the heat map, full lists provided in Supplementary Datas 2 and 3). c For the TCGA pan32 cohort (n = 10,224 cases), we made CPTAC-based pan-cancer subtype assignments (columns, mapping the CPTAC protein expression patterns to TCGA mRNA patterns). Significances of overlap between the CPTAC-based subtypes (columns, k1 through k10) to the previous pan32 mRNA-based pan-cancer class assignments2 (rows, c1 through c10) are represented. P-values by one-sided Fisher’s exact test. d For each cancer type represented in CPTAC Confirmation/Discovery cohort, distributions by proteome-based subtype. e For the top over-expressed proteins associated with each subtype (from part b, top panel), represented categories by GO were assessed, with selected enriched categories represented here. P-values by one-sided Fisher’s exact test. See also Supplementary Figs. 2–4 and Supplementary Data S2 and S3 and S4. demonstrated that the overall strengths of the proteome-based immune system, including protein expression patterns attributable subtype associations in the external datasets were non-random to immune cell infiltrates, with k2 associating with the adaptive (Supplementary Fig. 7a, b). immune response and T-cell activation, and with k3 associating In summary, regarding analysis of external profiling datasets of with the humoral immune response. The k2 and k3 subtypes were TCGA cases in the context of our proteome-based pan-cancer each associated with both the c3 and c10 immune-related mRNA- subtypes, we find most of these subtypes to be observable in based classes2 (Fig. 2a, c). However, while c3 and c10 appeared TCGA cohort, when considering cases profiled using mass- similar to each other overall, k2 and k3 each involved differential spectrometry-based proteomics (CPTAC-TCGA), RPPA, or protein expression patterns and associated gene categories that RNA-seq data platforms. revealed the two subtypes as being quite distinct from each other (Fig. 2b, e). Differentially over-expressed proteins in both k2 and k3 involved GO annotation categories of “regulation of immune Pathway-level differences across proteome-based subtypes.To response,”“immune system process,” and “leukocyte activation.” gain insight into pathways that would distinguish between the Also, k2 but not k3 was enriched for other categories including “T various proteome-based pan-cancer subtypes, we applied several cell activation,”“adaptive immune response,” and “MHC protein pathway-associated gene signatures to CPTAC proteomic binding”; and k3 was enriched for different categories that expression profiles, as well as to TCGA mRNA expression profiles included “humoral immune response,”“vesicle-mediated trans- (Fig. 4a). Many pathways appeared more or less active for dif- port,” and “complement activation.” ferent pan-cancer subtypes, as further explored below. Overall, For CPTAC Confirmatory/Discovery dataset, visualization of there was broad correspondence in patterns observed between the expression patterns for a select set of 162 immune-related CPTAC proteomic and TCGA mRNA datasets, indicative of proteins, including proteins involving the above GO categories, associations that would span multiple cohorts and molecular further demonstrated both common and distinctive expression levels, including the associations of hypoxia with k3 and k6 patterns involving k2 and k3, with a number of the proteins being tumors, YAP1 and MYC targets with k4 and k5 tumors, and Wnt expressed specifically in immune-related tissues according to and Notch pathways with k6 and k7 tumors. However, there were Human Protein Atlas14 (Fig. 5a). Analysis of both gene signatures some notable differences between protein-based and mRNA- of infiltrating immune cell types15 (Fig. 5b) and canonical immune based results as well. For example, proteins involved in fatty acid cell markers (Fig. 5c) also revealed differences between k2 and k3. metabolism, glycolysis and gluconeogenesis, pentose phosphate For example, k2 was enriched for signatures and markers of T- pathway, and tricarboxylic acid (TCA) cycle all were elevated in cells and antigen-presenting cells, and k3 was enriched for k1 tumors in CPTAC Confirmatory/Discovery dataset but not in signatures and markers of macrophages, mast cells, eosinophils, the TCGA pan32 mRNA dataset (Fig. 4a and Supplementary and neutrophils. Activation of the complement system in k3 Fig. 8). As another example, oxidative phosphorylation genes tumors involved both classical and alternative pathways (Fig. 5d). appeared elevated at the mRNA level but not at the protein level In general, the above associations were also observable in the in k1 tumors. The previously identified c1 pan-cancer class, independent CPTAC-TCGA proteomic and TCGA pan32 mRNA associated in this study with k1 (Fig. 2c), did, however, show datasets, although k2 and k3 differences were not as distinguish- elevation of all of the above metabolic pathways at the mRNA able at the mRNA level (Supplementary Fig. 9a–c). level2. As our previous study highlighted the c1, c6, and c8 In summary, regarding immune system-related differences, mRNA-based pan-cancer classes as involving metabolism path- k2 subtype involved the adaptive immune response and T-cell ways2, the k1, k5, and k7 proteome-based subtypes were exam- activation, and k3 subtype involved the humoral immune ined in this present study in a similar manner, by pathway response. Proteins that distinguish k2 from k3 subtypes include diagram (Fig. 4b). This diagram highlighted both common and markers of T-cells (high in k2) and markers of mast cells, distinctive patterns involving individual proteins and mRNAs, neutrophils, or macrophages (all high in k3), as well as including the overall patterns described above. complement system pathway proteins (high in k3). These In summary, regarding associated pathways, we find that each distinctions between k2 and k3 were not evident in previous proteome-based subtype is characterized by distinctive pathway- mRNA-based subtyping2, where k2 and k3 tumors associated level alterations and enriched functional gene categories. together as a single group. Pathway-level differences include those involving metabolism.

Immune-related differences across proteome-based subtypes. Stroma-related differences across proteome-based subtypes. The k2 and k3 proteome-based subtypes both involved the The k6 and k7 proteome-based subtypes both involved the tumor

NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications 5 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13528-0

a Differential expression –1SD +1SD YWHAB Lower Higher YWHAE YWHAZ CPTAC proteomic TCGA-CPTAC proteomic profiles (n = 364) ANXA7 profiles (n = 532) BAX k1 k2 k3 k4k5 k6 k7 k8 k9 k10 EIF4E GAPDH MAP2K1 k1 PEA15 k2 PRDX1 k3 EIF4G1 LCK k4 SQSTM1 k5 PIK3R1 k6 PREX1 SYK k7 ANXA1 k8 BID k9 CCNE1 EEF2 757 class-specific proteins k10 FASN No Cancer G6PD data type SERPINE1 Breast Colorectal Ovarian TGM2 ADAR ASNS b CPTAC proteomic TCGA RPPA proteomic profiles (n = 7663) CDC2 profiles (n = 532) k1 k2 k3 k4k5 k6 k7 k8 k9 k10 CHEK1 CHEK2 CCNB1 k1 EGFR k2 ERRFI1 MSH2 k3 MSH6 NOTCH1 k4 CDH3 TP53 k5 PCNA RAD51 k6 RBM15 k7 SMAD4 STMN1

99 class-specific proteins k8 WWTR1 k9 TFRC k10 No Cancer C12ORF5 YBX1 data type ACC BLCA BRCA Lower BRD4 CESC CHOL CRC CTNNB1 –1SD c TCGA-CPTAC protein-based class assignments DLBC ESCA GBM CDH1 HNSC KICH KIRC ESR1 GATA3

k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 KIRP LGG LIHC XRCC5 n LUAD LUSC MESO RPS6KA1 k1 0 2 1 1 3 4 6 0 2 1 20 OV PAAD PCPG RPS6KA1 PRAD SARC SKCM DIABLO Differential k2 2 16 3 3 7 3 1 0 1 4 40 expression k3 1 3 1 1 1 11 1 2 1 3 25 STAD, TGCT THCA XRCC1

Cancer type (TCGA) COG3 k4 2 3 5 20 11 3 4 0 5 3 56 THYM UCEC UCS UVM +1SD DVL3 k5 2 4 6 4 37 2 6 4 4 4 73 FN1 k6 0 0 4 0 0 12 4 0 0 2 22 Higher GSK3A k7 0 1 3 2 7 10 22 0 4 1 50 AKT1 k8 0 0 5 0 5 1 5 1 1 0 18 Relative enrichment AR CAV1 class assignments k9 4 0 3 0 3 5 1 0 2 1 19 TCGA RPPA-based TCGA –log10 (p -value) COL6A1 k10 4 3 2 2 5 3 0 0 0 3 22 p > 0.05 PARK7 7

n MAPK1 0 4 8 15 32 33 33 79 54 50 20 22 HSPA1A INPP4B d TCGA RPPA-based class assignments IRS1 MAPK9 MRE11A k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 MYH11 n CDKN1B k1 13 25 15 33 36 17 34 21 15 17 226 PXN PDCD4 k2 51 263 47 52 95 23 59 71 37 41 739 PRKCA k3 55 147 119 62 76 50 150 92 62 47 860 PGR k4 38 75 57 317 67 44 36 23 40 42 739 PTEN k5 107 161 97 256 664 63 191 141 126 146 1952 RAD50 k6 63 53 143 99 54 199 224 46 90 74 1045 STAT5A k7 82 36 57 44 113 55 485 78 86 50 1086 Relative enrichment TSC1 k8 19 21 24 16 46 9 46 43 16 15 255 YAP1 –log10 (p -value) CASP7

class assignments k9 4 9 6 9 19 3 8 10 9 10 87

TCGA mRNA-based TCGA p > 0.05 CCND1 k10 11 12 15 22 38 21 40 11 20 27 217 EPPK1

0 NRAS 20 40 n

443 802 580 910 484 536 501 469 ITGA2

1208 1273 SCD1 stroma, including protein expression patterns likely attributable other categories including “growth factor binding,” and k7 was to non-cancer cells and the tumor microenvironment2. The k6 enriched for different categories that included “muscle contrac- and k7 subtypes were, respectively, associated with the c7 and c8 tion” and “cytoskeleton.” For CPTAC Confirmatory/Discovery stroma-related mRNA-based classes2 (Fig. 2a, c), further rein- dataset, visualization of the expression patterns for a select set of forcing the notion of the distinctions between these subtypes 606 extracellular matrix-related proteins, including proteins representing various biological roles for the stromal component involving the above GO categories, further demonstrated both in human cancer2. Differentially over-expressed proteins in both common and distinctive expression patterns involving k6 and k7 k6 and k7 involved GO annotation categories (Fig. 2e) of (Fig. 6a). In general, associations involving the 606 proteins were “extracellular matrix,”“cell adhesion,”“collagen binding,” and also observable in the independent CPTAC-TCGA proteomic “basement membrane.” Also, k6 but not k7 was enriched for and TCGA pan32 mRNA datasets, although the differences were

6 NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13528-0 ARTICLE

Fig. 3 Observation of CPTAC pan-cancer proteome-based subtypes in additional multi-cancer protein expression profiling datasets. a The 364 TCGA cases with mass-spectrometry-based proteomic data from CPTAC were classified according to proteome-based pan-cancer subtype as originally defined using CPTAC Confirmatory/Discovery cohort. Expression patterns for the top set of 757 proteins distinguishing between the 10 subtypes (from Fig. 2a, based on available data) are shown for both CPTAC Confirmatory/Discovery and CPTAC-TCGA proteomic datasets (values normalized within each tissue- based cancer type; SD, standard deviation from the median). Gene patterns in the CPTAC-TCGA sample profiles sharing similarity with a subtype-specific signature pattern are highlighted. b The 7694 TCGA cases with reverse-phase protein array (RPPA) data were classified according to proteome-based pan- cancer subtype. Expression patterns for a top set of 99 proteins distinguishing between the 10 subtypes (see the section “Methods”, based on available data) are shown for both CPTAC Confirmatory/Discovery and TCGA RPPA proteomic datasets. Gene patterns in the RPPA sample profiles sharing similarity with a subtype-specific signature pattern are highlighted. Proteins highlighted by name were individually significantly associated with the given subtype (P < 0.001, t-test) in TCGA RPPA dataset. c Significances of overlap between the proteome-based subtype assignments made for the CPTAC- TCGA dataset (columns), with proteome-based subtype assignments for the TCGA RPPA dataset (rows), based on the 345 cases represented in both datasets. P-values by one-sided Fisher’s exact test. d Significances of overlap between the proteome-based subtype assignments made for the TCGA RPPA dataset (columns), with subtype assignments for the transcriptome profiles in TCGA pan32 cohort (rows, mapping the CPTAC protein expression patterns to TCGA mRNA patterns), based on the 7206 cases represented in both datasets. P-values by one-sided Fisher’s exact test. Patient-level subtyping and cancer type information for all datasets represented are provided in Supplementary Data 1. See also Supplementary Figs. 5–7.

not as distinguishable at the mRNA level (Supplementary signature also appeared manifested in k8 tumors (Fig. 4a). Inte- Fig. 10). Also, k6 tumors showed under-expression for proteins gration of the set of proteins high in k8, k9, or k10 tumors with related to the TCA cycle and oxidative phosphorylation (Fig. 4a public databases of protein–protein interactions generated pro- and Supplementary Fig. 8) and over‐expression of matrix tein interaction networks (Fig. 7e), where a number of the metallopeptidases including MMP11, MMP13, and MMP14 interacting proteins showed concordant patterns in the CPTAC- (Fig. 6a). Integration of the set of proteins high in either k6 or k7 TCGA proteomic and TCGA pan32 mRNA datasets. tumors with public databases of protein–protein interactions In summary, regarding the proteome-specific subtypes, k8 generated protein interaction networks (Fig. 6b and Supplemen- involved Golgi apparatus‐related proteins, k9 (specific to renal tary Data S5), which allowed visualization of the potential rela- cases) involved hemoglobin complex proteins, and k10 involved tionships involving these proteins. While k6 and k7 over- ER‐related proteins and steroid biosynthesis pathway proteins. expressed many of the same proteins, one feature distinguishing k7 from k6 was over-expression of collagen VI members, with A portal for visualizing proteomic associations by subtype.To many collagen VI interacting proteins16 also being preferentially facilitate access to CPTAC proteomic results by the general bio- higher in k7 tumors (Fig. 6c). medical research community, we have integrated CPTAC data In summary, regarding tumor stroma-related differences, k6 with the UALCAN data portal18. Across tumor and normal and k7 subtypes showed both common and distinctive expression samples in the CPTAC Confirmatory/Discovery datasets, the patterns with respect to each other, involving the influence of the UALCAN interface (http://ualcan.path.uab.edu) allows the user tumor microenvironment. Protein markers that distinguish k6 to analyze relative expression levels of a query protein or set of from k7 subtypes include: FN1, IGFBP3, ITGAV, LOX, LOXL2, proteins across specified tumor sub-groups. These pre-defined MMP11, MMP13, MMP14, and THBS1 (all high in k6, Fig. 6a); tumor sub-groups may include cancer stage, tumor grade, race, or and collagen VI and associated proteins (high in k7, Fig. 6c). The other clinicopathologic features. Proteome-based subtype com- distinctions between k6 and k7 are reflected in profiling datasets parisons (k1–k10), either across all five cancer types surveyed external to CPTAC, including RNA-seq-profiling datasets. (“pan-cancer” view) or within a given cancer type, can also be carried out for a given protein. The analysis results (e.g., box Proteome-based subtypes not reflected in the transcriptome. plots) can be printed directly or downloaded in several file for- Interestingly, the k8, k9, and k10 proteome-based subtypes as mats compatible with presentations or figures for publication. discovered in the CPTAC Confirmation/Discovery cohort were Users can also examine TCGA datasets in UALCAN, for differ- not well-represented in the other proteomic and transcriptomic ential patterns involving mRNA or DNA methylation features datasets examined (Fig. 2c, d and Supplementary Fig. 6d). For k8 corresponding to the proteins of particular interest arising from and k10, however, there was some significant overlap (p <1E−6 the analysis of CPTAC data. and p < 0.001, respectively, one-sided Fisher’s exact test) between RPPA-based and mRNA-based subtype calling in TCGA pan32 Discussion cohort (Fig. 3d). The k9 subtype consisted entirely of clear cell Our proteome-based, pan-cancer subtypes represent another view renal cell carcinoma cases, which cancer type was not represented into the molecular landscape of cancer, distinct in many ways in the CPTAC-TCGA cohort. Besides, the RPPA platform had from previous transcriptome-centered views. These proteome- few available features specific to k8, k9, and k10 subtypes based subtypes provide a framework for examining pathways or (Fig. 3a), which could complicate assigning proteome-based processes that, at the protein level, would cut across individual subtype to RPPA profiles. Nevertheless, significant numbers of cancer types. The molecular landscape of human cancer can help top differentially expressed proteins—with associated GO cate- guide us as to what pathways that have been extensively studied gories being over-represented—underscored each of the above in the experimental setting would be particularly relevant in the subtypes (Fig. 1e, f), indicative of the subtypes representing actual setting of human disease. In this present study, many such biology. Proteins high in k8 included Golgi-apparatus-related pathways are found to be differentially expressed or manifested proteins (Figs. 1e and 7a), proteins high in k9 included hemo- within specific subtypes of human cancers. Other pathways not globin complex proteins (Figs. 1e and 7b), and proteins high in examined here may also be explored in the context of our k10 included endoplasmic reticulum (ER)-related proteins and proteome-based subtypes. To this end, we have added the proteins involved in steroid biosynthesis (Figs. 1e, 7c, d). Typi- CPTAC datasets to the UALCAN data portal18, as well as pro- cally, sterols are synthesized in the ER and transported by non- vided protein-level statistics for all proteins in the supplementary vesicular mechanisms to the plasma membrane17. Ras pathway data of this study, allowing researchers to look up proteins of

NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications 7 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13528-0

a CPTAC proteomic profiles TCGA pan32 mRNA profiles –1SD Pathway signature Differential statistics Pathway signature Differential statistics scores (n = 532) given subtype vs others scores (n =10224) given subtype vs others

FA metabolism Pathway GLYCOLYSIS/GNG signature Pentose phosphate TCA CYCLE +1SD OX-PHOS Ras –6 –10 YAP1 –4 –6.7 MYC –2 –3.3 EMT 0 0 HYPOXIA 2 3.3 t-statistic NRF2/KEAP1 Differential 4 6.7 WNT 6 10 NOTCH AC k1 k4 k5 k6 k7 k9 k1 k4 k5 k6 k8 k2 k3 k8 k2 k3 k7 k9 k10 k10 TCGA CPT

b TCGA CPTAC PGD PGLS mRNA protein Ribulose 5- 6- 6-phospho 6-phospho Glucose p -value FDR phosphate phosphate gluconate gluconolactone < 0.01 < 0.01% RPIA RPE Pentose Fructose 6- Glycolysis < 0.001 <1% phosphate Ribulose 5- Xylulose 5- phosphate pathway < 1E–10 < 10% Higher in Lower in k1/k5/k7 phosphate phosphate ALDOA ALDOC TALDO1 TKT Dihydroxy Protein Glycer- aldehyde 3- acetonephosphate k1/k5/k7 RNA phosphate TPI1 GAPDH 1,3-bisphosphoglycerate

PGK1 PGK2 3-phosphoglycerate Fatty acid PGM1 PGM2 CPT1A CPT1B metabolism 2-phosphoglycerate Acycarnitine Fatty Acyl CoA k1 vs others k5 vs others k7 vs others ENO1 ENO3 Phosphoenolpyruvate CPT2 ACSL1 ASCL4 PDC, pyruvate PKLR dehydrogenase complex ACADL Fatty acids PKM2 DLAT DLD Pyruvate Acetyl-CoA Fatty Acyl CoA FASN β-oxidation PDHA1 PDHA2 PDHB Fatty acid Malonyl-CoA synthesis ACACA OAACS Citrate ACO2 Acetyl CoA MDH2 Malate Isocitrate Citrate ACLY IDH3B IDH3A IDH2 FH TCA Cycle Oxaloacetate IDH3G FADH MDH1 2 Fumarate α-ketoglutarate Malate OGDH NADPH SDHA Succinate SUCLA2 Succinyl-CoA ME1 Complex II NNT SDHB SUCLG1 SUCLG2 Pyruvate

SDHC SDHD NADH Complex III Complex IV Complex I Complex V Nuclear- Nuclear- Nuclear- encoded encoded Nuclear- encoded COX encoded NADH-DH UQCR ATPATP subunits subunits ATPA subunits subunits The Electron transport chain (ETC)

Fig. 4 Proteome-based subtype-specific differences involving metabolic pathways. a For CPTAC Confirmatory/Discovery proteomic dataset and for TCGA pan32 mRNA dataset, pathway-associated gene signatures (using values normalized within each cancer type; SD, standard deviation from the median). For each dataset, purple-cyan heat maps denote t-statistics for comparing the given subtype versus the other tumors (bright purple/cyan, highly significant; black, not significant; shades close to black, borderline significant). Selected pathways surveyed by signatures2 included several related to metabolism (FA fatty acid; GNG gluconeogenesis; TCA tricarboxylic acid; OX-PHOS oxidative phosphorylation). b Pathway diagram representing core metabolic pathways, with differential expression patterns represented (using values normalized within cancer type), comparing tumors in pan-cancer subtypes k1, k5, or k7 with the rest of the tumors. For each protein represented, the top portion represents results from differential protein analysis (CPTAC Confirmatory/Discovery proteomic dataset) and the bottom portion represents results from differential mRNA analysis (TCGA pan32 mRNA dataset). Red denotes significantly higher expression in k1/k6/k8 and blue denotes significantly lower expression. See also Supplementary Fig. 8.

8 NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13528-0 ARTICLE

a Lower n = 92 proteomic profiles n = 440 proteomic profiles –1SD Lower k2 k3 k1k4k5 k6 k7 k8k9 k10 expression Differential –6 Differential k2 vsk3 others vs others –4 t-statistic –2 No 0 2 data 4 6 +1SD Higher Higher

162 immune- GO:positive reg. of T-cell activation related proteins GO:MHC protein complex GO:humoral immune response

Gene GO:complement activation Cancer Gene Specifically expressed in annotations annotations immune-related tissues (HPA) type Breast Colon Ovarian Renal Uterine b d Protein 0.01%

n = 92 proteomic profiles n = 440 proteomic profiles 1% < < k2 k3 k1k4k5 k6 k7 k8k9 k10 k2 vs k3 vs Lower

k2 vsk3 others vs others FDR FDR

others others FDR < 10% –1SD aDC

Proteomic Higher in tumor subgroup T cells signature T helper cells Lower in tumor subgroup Tcm APM2 APM1 C1 complex Cytotoxic T cells C1QA C1QC Eosinophils +1SD Mast cells C1QB Neutrophils Higher Antigen B cells C1R iDC Lower Macrophages –6 Antibody Differential C1S Th1 cells –4 t-statistic (IgG) Th2 cells –2 DC 0 CD8 T cells C2a and C4b C2 NK cells 2 4 fragments cancer 6 C4B type Breast Colon Ovarian Renal Uterine Higher Classical pathway C3 convertase c Alternative C3 hydrolysis CD11b HLA-DRB5 HLA-DRA pathway C3 CD45

Eosinophil CCR3 CD68 C3b and C3a fragments Lymphocyte Macrophage HLA-DRB1 Innate FCGR3B immune Dendridic cell C3b cleaves C5 into C5a and C5b system Both immune systems C5 Neutrophil Protein CD3E CD3D CD19 FCGR3A LCK CD3G SYK k2 vs k3 vs Membrane Attack NK cell others others Complex T-cell B-cell C6 C7 C8A C8B 0.01% CD8A 1% < CD4 CD4 < Adaptive No C8G C9 FDR FDR immune system FDR< 10% data CD8B FOXP3 Higher in tumor subgroup Cytotoxic T-cell Helper T-cell Treg cell Lower in tumor subgroup

Fig. 5 Immune system-related differences underscore k2 and k3 proteome-based subtypes. a For a set of 162 immune-related proteins (FDR < 5% for either k2 or k3 subtypes and association with one of the indicated GO annotation categories), heat maps of differential protein expression patterns (expression values normalized within cancer type; SD, standard deviation from the median), across CPTAC Confirmatory/Discovery proteomic profiles, ordered by subtype. Purple-cyan heat map denotes t-statistics for comparing the given subtype versus the other tumors (bright purple/cyan, highly significant; black, not significant; shades close to black, borderline significant). Proteins found specifically expressed in immune-related tissues (according to Human Protein Atlas14, or HPA, www.proteinatlas.org) are indicated. b Heat maps of -based signatures15 of immune cell infiltrates, across CPTAC Confirmatory/Discovery proteomic profiles, ordered by subtype (expression values normalized within cancer type; SD, standard deviation from the median). Purple-cyan heat map denotes t-statistics for comparing the given subtype versus the other tumors. APM1/APM2, antigen presentation on MHC class I/class II, respectively; DC dendritic cells; iDC immature DCs; aDC activated DCs; NK cells natural killer cells; Tcm cells T central memory cells; Tem cells T effector memory cells. c Diagram of immune cell types and associated protein markers. Red denotes significantly higher expression in k2 or k3 subtypes as indicated, and blue denotes significantly lower expression. FDR false discovery rate. d Similar to part c, but for complement activation pathway. See also Supplementary Fig. 9.

NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications 9 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13528-0

a n = 156 proteomic profiles n = 376 proteomic profiles THBS1 Lower ITGAV –1SD Lower k6 k7 k1 k2 k3 k4 k5 k8k9 k10 IGFBP3 –6 Differential expression Differential

FN1 t-statistic k6 vsk7 others vs others LOXL2 –4 MMP13 –2 MMP14 0 LOX 2 MMP11 4 COL1A2 6 IGFBP4 +1SD Higher HSPG2 MMP2 Higher No ITGA1 data COL6A1 Gene annotations 606 extracellular TGFBR3 GO:growth factor binding

matrix-related proteins COL14A1 GO:extracellular matrix COL6A2 Cancer Gene COL6A3 GO:cell adhesion GO:cytoskeleton type annotations NCAM1 Breast Colon Ovarian Renal Uterine LYVE1

b C1R PCOLCE CNN3 C1S IKBIP EFEMP1 HTRA1 CLEC11A TAGLN CNN2 ALDH1B1 SERPINH1 LUM TIMP3 MMP14 COL3A1 COMP PLS3 EFEMP2 F13A1 TPM4 TIMP2 FN1 MFAP2 TPM2 COL8A1 MMP13 MMP2 TPM1 ITGAV PDLIM5 MPRIP THBS1 PDLIM7 BGN LRP1 HSPG2 COL5A1 PALLD CTGF FBLN2 RAI14 LUZP1 TLN2 FKBP10 SPARC TGFB3 THBS2 VCAN SFRP4 VCAM1 THBS3 RGCC P3H1 ITGB5 CYR61 CXCL12 FST FBLN1 SEPT11 FKBP9 INHBA Protein-protein interaction network LTBP3 CRTAP FHL3 FHL2 IGFBP3 IGF2 IGFBP7 MFAP5 ADAMTS12 proteins high (FDR<1E–6) in k6 tumors

HSPG2 PRELP SORBS1 TLN1 FHL5 AKAP12 PALM NID1 TGFB1I1 YAP1 MCAM MYH11 AGO3 PDLIM7 MAPT NID2 VCL SORBS2 NEXN SNTB2 LAMA4 SVIL LAMC1 COL14A1 LIMS1 CNN1 VWF PGM5 UTRN RASIP1 PTN LAMA5 COL1A1 ARHGEF17 RSU1 PALLD TPM1 RRAS CRMP1 PARVA SYNPO DMD SEPT2 FLNA TPM4 ILK PPP1R12B SOD3 IL33 LIMS2 SYNPO2 SEPT10 LAMB1 FLNC CAVIN3 TPM2 CAVIN1 DCN MYO1C FBLN5

Protein-protein interaction network DAG1 LMNA COL6A1

proteins high (FDR<1E–14) in k7 tumors OGN CAV1 ODF2 PPP1R12A DES FERMT2 FHL1 LMNB2 SGCD SEPT7 TNS2 TRIOBP DPT SYNM LAMA2 BGN MRVI1 SEPT11 0.01% 1% PDE5A PRKG1 < < <10% Protein Higher in tumor subgroup

FDR Lower in tumor subgroup k6 vs others k7 vs others Shared GO term between interactors: Growth factor binding Extracellular matrix Cell adhesion Cytoskeleton c COL1A1 COL1A2 COL14A1 Collagen XIV Collagen I Collagen II Collagen VI COL6A1 Perlecan HSPG2 Extracellular matrix COL2A1 COL6A2 COL6A3 Decorin DCN COL6A5 COL6A6 Lumican LUM Basal lamina Collagen IV COL4A1 WARP VWA1 COL4A2 COL4A3 COL4A4 MAGP1 MFAP2 COL4A5 COL4A6 Fibulin-2 FBLN2 Laminin LAMA1 LAMA2 LAMA3 LAMA4 LAMA5 Protein

LAMB1 LAMB2 LAMB3 k6 vs k7 vs LAMC1 LAMC2 LAMC3 others others Sarcoglycans DAG1 ITGA1 ITGB1 Nidogen NID1 NID2 SGCB

Integrins complex 1% 0.01% <10% < <

SGCD FDR Higher in tumor subgroup Cell membrane SGCE Syntrophins SNTB1 SNTB2 Lower in tumor subgroup SGCG Dystrophin DMD individual interest. In this way, our study and the associated data that tumor sample purity factors into the global expression pro- provide an excellent resource for future studies. file20, the above four subtypes all appear very distinct from each Four of the 10 proteome-based subtypes represent the invol- other. This indicates that the associated non-cancer proteomic vement of non-cancer cells, two subtypes (k2 and k3) involving patterns represent true biology rather than a purely technical immune cells and the other two subtypes (k6 and k7) involving artifact involving sample procurement. Also, multiple processes the tumor stroma or reactive stroma19. While it is understood that we would associate with the tumor microenvironment are

10 NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13528-0 ARTICLE

Fig. 6 Tumor stroma-related differences underscore k6 and k7 proteome-based subtypes. a For a set of 606 extracellular matrix-related proteins (FDR < 5% for either k6 or k7 subtypes and association with one of the indicated GO annotation categories), heat maps of differential protein expression patterns (expression values normalized within cancer type; SD, standard deviation from the median), across CPTAC Confirmatory/Discovery proteomic profiles, ordered by subtype. Purple-cyan heat map denotes t-statistics for comparing the given subtype versus the other tumors (bright purple/cyan, highly significant; black, not significant; shades close to black, borderline significant). Selected proteins of interest are listed by name. b Protein–protein interaction networks involving the top proteins over-expressed in k6 tumors (top network, using cutoff of FDR < 1E−6) and the top proteins over-expressed in k7 tumors (bottom network, using cutoff of FDR < 1E−14). Nodes represent proteins that were found over-expressed in either k6 or k7 subtypes as indicated. Red node coloring denotes significantly higher expression in k6 or k7 subtypes as indicated, and blue coloring denotes significantly lower expression. A line between two nodes signifies that the corresponding protein products of the genes can physically interact (according to the literature, from gene interactions database). Colored edges (other than gray) denote a common GO term annotation shared by both of the connected proteins. c Diagram of collagen VI interactions and associated proteins16. Red denotes significantly higher expression in k6 or k7 subtypes as indicated, and blue denotes significantly lower expression. FDR false discovery rate. See also Supplementary Fig. 10 and Supplementary Data S5. manifested within distinct subtypes of tumors, indicating that other proteomic platforms such as RPPA5. Where specific these processes are acting independently from each other. While proteome-based subtypes are found reflected in subtypes defined the recent surge of interest in cancer immunotherapy is mainly on the basis of mRNA patterns2, the existence of such subtypes in focused on T cell function or numbers, which cells appear most human cancer has even stronger support with these independent present within our k2 subtype, relatively less attention has been observations involving the proteome in addition to the tran- placed on the complement activation pathway, represented in our scriptome. On the other hand, our study also uncovers cancer k3 subtype. Activation of the complement system is an important subtypes not found in previous transcriptome-based studies. component of tumor-promoting inflammation, which in turn has Observations specific to the proteome include two different an important role in carcinogenesis and cancer progression21, and immune system-related subtypes (reflected as a common subtype which involves macrophages and neutrophils. Notably, hypoxia in the transcriptome), one involving the adaptive immune was found here to be elevated in the k3 subtype, while elsewhere response and the other involving the humoral immune response. hypoxia is known to alter the regulation of complement proteins The pathway associations according to proteome-based subtype in different cellular components of the tumor microenviron- may help in drawing attention to pathways and processes, pre- ment22. Previous surveys of TCGA data for immune cell invol- viously examined in the laboratory setting, which are shown here vement at the transcriptomic level2,23 appear to have missed the to be involved in appreciable subsets of human cancer. In addi- complement pathway as representing an important player. On the tion to the more heavily studied pathways, such as cancer other hand, collagens, which appear to have a role in our k6 and metabolism and T , other pathways and processes to k7 subtypes, constitute the scaffold of the tumor microenviron- be considered further in the context of cancer would include ment. Collagens affect this microenvironment such that it reg- those involving complement activation, collagen VI, Golgi ulates extracellular matrix remodeling by collagen degradation apparatus, hemoglobin complex, and ER. and re-deposition, and promotes tumor infiltration, angiogenesis, invasion, and migration24. Three additional proteome-based subtypes—k8, k9, and k10— Methods were not reflected in previous pan-cancer transcriptomics ana- CPTAC datasets. The National Cancer Institute CPTAC30 generated the mass 2 spectrometry-based proteomic data used in this publication. The CPTAC Con- lyses and relatively few of the >10,000 cases in TCGA could be firmation/Discovery cohort, used to define proteome-based molecular subtypes, assigned to them based on gene transcription, suggesting that consisted of five separate datasets: CPTAC Uterine Corpus Endometrial Carcinoma these subtypes might be unique to the proteome. The classes of (UCEC) Discovery Study (comprising n = 100 cases), CPTAC Clear Cell Renal Cell proteins associated with these subtypes may suggest pathways or Carcinoma (CCRCC) Discovery Study (n = 110)11, CPTAC Breast Cancer Con- fi = fi = processes that may not be as commonly associated with human rmatory Study (n 125), CPTAC Ovarian Cancer Con rmatory Study (n 100), and CPTAC Colon Cancer Confirmatory Study (n = 97)10. The CPTAC-TCGA cancer, but for which links to cancer can be found in the litera- datasets, used here for independent observations or validations, involved 364 cancer ture. Regarding the k8 subtype, many studies have demonstrated cases in TCGA—including 90 colorectal9 cases, 105 breast8 cases, and 169 ovarian7 the essential roles of the Golgi in cellular activities as a stress cases. The mass-spectrometry-based proteomics methods for profiling CPTAC- sensor, apoptosis trigger, lipid/protein modifier, mitotic check- TCGA tumors have been previously reported in the associated CPTAC-led studies, 25 as well as broadly summarized below. Tumor samples from CCRCC and UCEC were point, and a of malignant transformation . Regarding analyzed by global proteomic and phosphoproteomic mass spectrometry using the the k9 subtype, specific to renal cell carcinoma in the CPTAC 10-plexed isobaric tandem mass tags (TMT-10), following the CPTAC reproducible cohort, increased hemoglobin has been associated elsewhere with workflow protocol published by Mertins et al. 31. Breast, colon, and ovarian samples VEGF inhibitor treatment in advanced renal cell carcinoma26, were analyzed with liquid chromatography–tandem mass spectrometry (LC–MS/ MS) global proteomic and phosphoproteomic profiling. Proteomic-profiling data and elevation in hemoglobin on VEGF-directed therapy has been were generated through informed consent as part of CPTAC efforts and analyzed per 27 associated with worse clinical outcomes . Regarding the CPTAC data use guidelines and restrictions. For CPTAC Confirmatory/Discovery, k10 subtype, tumor cells are often exposed to intrinsic and we obtained processed protein expression data from the CPTAC Data Portal4. For external factors that alter protein homeostasis, thus producing ER CPTAC-TCGA, we obtained processed protein expression data from the supple- mentary tables of the associated publications. CPTAC proteomic data, as provided stress. ER stress, in turn, may be activated by a variety of factors by the CPTAC Data Portal and related publications, were processed at the gene level, and triggers the unfolded protein response (UPR), which restores rather than at the protein isoform level; as a simplification, we did not consider homeostasis or activates cell death28,29. In the future, as greater different isoforms for the same protein in the present study. numbers of human cancer cases are profiled using mass- Taking the expression values provided in the Protein Report provided by fi spectrometry-based proteomics, additional subtypes and asso- CPTAC Data Portal, we normalized CPTAC Con rmatory/Discovery proteomic fi data for downstream analyses in the following manner. First, within each ciated pathways may be uncovered and explored. Such ndings proteomic profile, we normalized logged expression values to standard deviations could represent key biological insights and potential therapeutic from the median. We carried out the above as, for some cancer types, there was an opportunities involving appreciable subsets of human cancer. observed high variability in within-profile expression median or standard deviation fl In summary, our study has uncovered proteome-based pan- across samples, which could in uence unsupervised analysis results. Next, we normalized expression values across samples to standard deviations from the cancer subtypes on the basis of mass-spectrometry-based pro- median. In the same manner, we separately normalized both total protein and teomics, which platform offers far more protein features over phospho-protein datasets for a given cancer type. For the Colon Confirmatory

NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications 11 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13528-0

a n = 43 profiles n = 489 proteomic profiles e Lower in k8/k9/k10 Comparisons k8 k1k2k3 k4k5 k6 k7 k9 k10 Higher in ABCG1 using outside k8/k9/k10 AP3D1 datasets TCGA-CPTAC TCGA pan32 ARCN1 protein (p < 0.1) mRNA (p < 0.01) No data –1SD B4GALT4 COPA COPB2 Protein–protein interaction network — proteins high (FDR < 1%) in k8 tumors COPG1 FUT3 ARHGAP26 SEC16A CLINT1 S100P GALK2 GALNT14 GRB10 EZR CD2AP GALE GOLGA5 MUC16 INSR AP1G1 TJP2 MUC4 NKFB2 related proteins AP3D1 +1SD MUC5AC JUP 59 Golgi apparatus- PRKAA2 PRRC1 PLS1 CTNNA1 MUC5B PFKL RFFL SCAMP2 COPG1 Cancer PKM CTNNA3 SEC16A TAPBP type Breast Colon Ovarian Renal Uterine ST6GALNAC1 TMED7 NANS COPA ZDHHC21 b n = 29 profiles n = 503 proteomic profiles KRT14 EVPL TNNI2 k9 k1k2k3 k4k5 k6 k7 k8 k10 COPB2 KRT15 TNNT1 –1SD TNNT3 AHSP SERPINB1 SSFA2 HBA1 FBXO6 ARCN1 HBB HBD VAPA DNAJA1 PFKP TNNI1 PCM1 HBG1 ARFIP1 HBG2 ALPP HBM ERO1A MYH6 Hemoglobin HBQ1 RFFL

complex proteins OFD1 +1SD HBZ PPP2R2C OTUB1 PPP2R2B Cancer FLT4 LGALS3 type Breast Colon Ovarian Renal Uterine Shared GO term between interactors: Vesicle Cytoplasmic part Cadherin binding AUP1 ALG5 c n = 48 profiles n = 484 proteomic profiles DHCR7 Protein-protein interaction network - proteins high (FDR < =0.001%) in k9 tumors EMC4 k10 k1k2k3 k4k5 k6 k7 k8k9 EPB41 ERGIC3 SEPHS1 HYPK DDI2 SNCA DMTN SPTB ERMP1 FAAH GCLM PRSS1 HBG2 FDFT1 HMOX1 MED24 SPTA1 –1SD GPAA1 LMAN2 SLC4A1 GYPC BLVRB HBZ MAGT1 GCLC HBB MED21 NSDHL PIGT MPP1 PIGU YOD1 AMFR FBXO7 POR PRDX2 PREB GTF2H4 USP15 HBA1

154 endoplasmic RETSAT PSMF1 CAT +1SD RNF121 RNF123 reticulum-related proteins SPTLC1 EPB42 SYVN1 GTF2H5 ADSL ANK1 Cancer TMBIM6 DAZAP2 HERC2 VAPB PSMB4 FTL type Breast Colon Ovarian Renal Uterine WLS UBAC1 AP5M1 RAD23A HERC3 FTH1 AP5B1 Shared GO term between interactors: d Response to stress Cytoskeleton Protein ubiquitination HMGCS1 HMGCS2 MVK 1% 5%

Terpenoid < < < 10% Protein-protein interaction network - proteins high (FDR<=1%) in k10 tumors Backbone Synthesis Higher in k10 tumors

Farnesyl-PP FDR Lower in k10 tumors TTYH3 FUNDC2 STT3B AUP1 ERGIC2 TMEM87A SPTLC1 ARV1 SQLE LSS CYP51A1 TM7SF2 LMAN2 DHCR24 MAGT1 C4orf3 14-Demethyl- ERGIC3 (S)-Squalene- Lanosterol 4,4-Dimethyl- cholesta- lanosterol STARD3 WLS 2,3-expoxide ERMP1 8,14,24-trienol MSMO1 PSEN2 NCSTN ITGA2 HSD17B7 PIGU 3-Keto-4- LPGAT1 STX6 methyl- 4-Methylzymosterol- SCD STX10 carboxylate GPAA1 POR zymosterol NSDHL VAPB VAMP7 7-Dehydro- PIGT ATL3 SYVN1 PON2 cholesterol Steroid Hormone RAB3GAP2 DHCR7 Biosynthesis SRD5A3 RNF5 BCAP31 RNF167 Desmosterol ABHD16A PGRMC1 FDFT1 UBAC2 DHCR24 RNF185 Primary Bile Acid SCP2 ABCD1 Biosynthesis DNAJC10 AMACR CUL4B TOR1AIP2 ITPR3 Cholesterol Shared GO term between interactors: Lipid metabolism process Endoplasmic reticulum part Plasma membrane part

Fig. 7 Overview of k8, k9, and k10 proteome-based subtypes. a For a set of 59 Golgi apparatus-related proteins elevated in the k8 subtype (FDR < 5%), heat maps of differential protein expression patterns (expression values normalized within cancer type; SD, standard deviation from the median), across CPTAC Confirmatory/Discovery proteomic profiles ordered by subtype. Listed proteins are the subset highest in k8 over all other subtypes. b For a set of nine hemoglobin complex proteins elevated in the k9 subtype (FDR < 1%), heat maps of differential protein expression patterns, across CPTAC Confirmatory/Discovery proteomic profiles ordered by subtype. c For a set of 154 endoplasmic reticulum-related proteins elevated in the k10 subtype (FDR < 1%), heat maps of differential protein expression patterns, across CPTAC Confirmatory/Discovery proteomic profiles ordered by subtype. Listed proteins are those among the top 50 most over-expressed for the k10 subtype. d Diagram of steroid biosynthesis pathway and associated proteins (from KEGG database48). Red denotes significantly higher expression in the k10 subtype. e Protein–protein interaction networks involving the top proteins over- expressed in k8 tumors (top network), the top proteins over-expressed in k9 tumors (middle network), and the top proteins over-expressed in k10 tumors (bottom network). Nodes represent proteins that were found over-expressed in the given subtype. Nodes are colored according to patterns of differential expression in additional cohorts (left, protein data from CPTAC-TCGA cohort; right, mRNA data from TCGA pan32 cohort). A line between two nodes signifies that the corresponding protein products of the genes can physically interact (according to the literature, from Entrez gene interactions database). Colored edges (other than gray) denote a common GO term annotation shared by both of the connected proteins. FDR false discovery rate.

12 NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13528-0 ARTICLE dataset, we used the data from Pacific Northwest National Laboratory (PNNL) in mRNA-based pan-cancer class (Fig. 1). The previously defined mRNA-based the analysis. For the Ovarian Confirmatory dataset, we averaged normalized values classifier consisted of 854 genes, of which 198 were represented in the CPTAC- for PNNL and Johns Hopkins University (JHU) in instances where there were TCGA dataset (taking from the set of 5863 unique genes/proteins represented in all duplicate profiles for the same sample. As intended, by normalizing expression three cancer types), and 532 were represented in CPTAC Confirmatory/Discovery within each cancer type and within each CPTAC dataset, neither tissue-dominant dataset (taking from the set of 12,247 unique genes represented in any one of the differences nor inter-laboratory batch effects would drive the downstream five cancer types). We classified TCGA pan32 mRNA profiles (n = 10,224 cases) unsupervised subtype discovery. according to proteome-based subtype as derived using the CPTAC Confirmatory/ For the CPTAC Confirmatory/Discovery total protein dataset, a total of 12,247 Discovery dataset (Fig. 2c and Supplementary Fig. 6a, using 990 out of 1000 features unique genes by Entrez Identifier comprised the compiled dataset of all five cancer in Fig. 2b classifier with available data), mapping protein features to mRNA by types. For unsupervised subtype discovery and selecting top protein features for the Entrez gene identifier. We classified the CPTAC-TCGA proteomic profiles subtype classifier, we considered for this analysis the subset of 9764 unique proteins according to the subtype originally derived from the CPTAC Confirmatory/ represented in at least three of the five cancer types. For the CPTAC Confirmatory/ Discovery dataset (Fig. 3a, using 757 out of 1000 features with available data). We Discovery phospho-protein dataset, we considered for this analysis the set of classified Cancer Cell Line Encyclopedia (CCLE) mRNA profiles according to protein features detected in more than half of cases for at least three cancer types. subtype derived from CPTAC Confirmatory/Discovery, similarly to that of TCGA For the CPTAC-TCGA total protein dataset, we considered for this analysis the set pan32 mRNA profiles (Supplementary Fig. 6c). We also classified tumors and cell of 5863 unique genes/proteins by Entrez identifier, which proteins were lines profiled by RPPA according to subtype derived from CPTAC Confirmatory/ represented in all three cancer types. Discovery, the tumor dataset from TCGA5,6 (Fig. 3b) and the cell line dataset from CCLE33 (Supplementary Fig. 6b). For the RPPA datasets, we used as a classifier the set of represented total protein features from which a significant association with a TCGA datasets. Results are based in part upon data generated by TCGA Research particular subtype was observable in the CPTAC Confirmatory/Discovery dataset Network (http://cancergenome.nih.gov/). We aggregated TCGA transcriptomic (p < 0.001 by t-test, based on logged and centered protein expression values). and RPPA data from public repositories, listed in the “Data availability” section. RNA-seq expression data were processed by TCGA at the gene level, rather than at the transcript level. Tumors spanned 32 different TCGA projects, each project Differential expression and pathway analyses. Differential expression between representing a specific cancer type, listed as follows: LAML, acute myeloid leuke- comparison groups was assessed using t-tests on log-transformed values (base 2). mia; ACC, adrenocortical carcinoma; BLCA, bladder urothelial carcinoma; LGG, False discovery rates (FDRs) were estimated using the method of Storey and 34 fi lower grade glioma; BRCA, breast invasive carcinoma; CESC, cervical squamous Tibshirani . Signi cantly enriched GO annotation categories were computed using ’ 35 cell carcinoma and endocervical adenocarcinoma; CHOL, cholangiocarcinoma; one-sided Fisher s exact tests and SigTerms software , based on the entire set of fi CRC, colorectal adenocarcinoma (combining COAD and READ projects); ESCA, 9764 unique proteins represented in at least three of the ve cancer types. Protein – esophageal carcinoma; GBM, glioblastoma multiforme; HNSC, head and neck interaction network analysis used the entire set of human protein protein inter- squamous cell carcinoma; KICH, kidney chromophobe; KIRC, kidney renal clear actions cataloged in Entrez Gene (downloaded June 2017). Entrez gene interactions cell carcinoma; KIRP, kidney renal papillary cell carcinoma; LIHC, liver hepato- with yeast two-hybrid experiments providing the only support for the interaction cellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell car- were not included in the analysis. Graphical visualization of networks was gener- 36 cinoma; DLBC, lymphoid neoplasm diffuse large B-cell lymphoma; MESO, ated using Cytoscape . mesothelioma; OV, ovarian serous cystadenocarcinoma; PAAD, pancreatic ade- nocarcinoma; PCPG, pheochromocytoma and paraganglioma; PRAD, prostate Gene signature analyses. We computed gene expression signature scores asso- adenocarcinoma; SARC, sarcoma; SKCM, skin cutaneous melanoma; STAD, sto- ciated with pathway (e.g., scores for EMT, NRF2/KEAP1, hypoxia, KEGG: Gly- mach adenocarcinoma; TGCT, testicular germ cell tumors; THYM, thymoma; colysis/Gluconeogenesis, KEGG: Pentose Phosphate pathway, KEGG: Fatty Acid THCA, thyroid carcinoma; UCS, uterine carcinosarcoma; UCEC, uterine corpus metabolism, KEGG: TCA Cycle, and KEGG: Oxidative Phosphorylation or OX- endometrial carcinoma; UVM, uveal melanoma. Cancer molecular profiling data PHOS, k-ras, MYC, YAP1, WNT, and NOTCH) as follows. We normalized log were generated through informed consent as part of previously published studies base 2-transformed values for proteins in the CPTAC dataset within each cancer and analyzed per each original study’s data use guidelines and restrictions. type (standard deviations from the median of the given cancer type). For NRF2/ KEAP1, hypoxia, WNT, NOTCH, and KEGG signatures, we computed the average 32 expression of the set of genes within a given signature. For k-ras, MYC, and Pan-cancer molecular subtype discovery. ConsensusClusterPlus R-package YAP1 signatures, normalized expression profiles were scored for the above sig- was used to identify the structure and relationship of the samples. For unsupervised natures using our t-score metric37. We generated gene signature scores of NRF2/ clustering analysis, we selected the top 2000 most variable proteins from the KEAP1 pathway as described38, based on four different signatures39. The hypoxia CPTAC Confirmatory/Discovery total protein dataset (taken from the set of 9764 40 fi signature was the set of canonical HIF1A targets from Harris . We generated gene unique proteins represented in at least three of the ve cancer types), according to transcription signature scores of YAP1 pathway, based on four different sig- average standard deviation (using log-transformed expression values centered to 39 41 42 fi natures . MYC signature (from data by Coller et al. ) was from ref. , and the standard deviations from the median within each cancer type) across the ve Settleman k-ras sensitivity signature was from ref. 43 WNT signature was taken CPTAC projects. Consensus ward linkage hierarchical clustering identified k = 2to 44 = directly from ref. (summing up values for WNT antagonist, agonist, and target k 15 subtypes, with the stability of the clustering increasing with increasing k. genes). NOTCH signature was taken from ref. 45 For TCGA pan32 cohort (n = 2 = Consistent with what was carried out for our previous studies , we selected the k 10,224 RNA-seq profiles), gene expression signature scores associated with the 10 clustering solution for further investigation. By additional clustering solutions above pathways were also computed as described above, based on transcription more subtypes may potentially be discoverable, although these additional subtypes data, in a previous study2. For visual display, we normalized pathway signature would involve fewer numbers of cases and would represent less of the global scores across samples to standard deviations from the median. variation (Supplementary Fig. 2).

Statistical analysis. All p values were two-sided unless otherwise specified. All Analysis of external multi-cancer datasets. We examined external, multi-cancer tests were performed using log2-transformed expression values. Visualization using gene expression profiling datasets, classifying each external tumor profile by pan- heat maps was performed using both JavaTreeview (version 1.1.6r4)46 and cancer class/subtype as defined by TCGA or CPTAC data. Within each cancer type matrix2png (version 1.2.1)47. of the external dataset being classified, we normalized log-transformed genes or proteins to standard deviations from the median. As a classifier, we used either the Reporting summary. Further information on research design is available in top set of mRNAs (from TCGA-based pan32 study, Fig. 1a and ref. 2) or the top set the Nature Research Reporting Summary linked to this article. of proteins (from the present study, Fig. 2) distinguishing between the pan-cancer subtypes, as noted. For each pan-cancer subtype, we computed the average nor- malized value for each gene or protein, based on the centered expression data Data availability matrix. We then computed the Pearson’s correlation between each external profile All data used in this study are publicly available. The CPTAC datasets (both and each pan-cancer subtype averaged profile. We assigned each external cancer Confirmatory/Discovery and CPTAC-TCGA) referenced during the study are available fi case to a pan-cancer subtype, based on which subtype pro le showed the highest from the CPTAC data portal website (https://cptac-data-portal.georgetown.edu/ fi correlation with the given external dataset pro le. Supplementary Data 3 provides cptacPublic/). TCGA data RNA-seq data are available through the Genome Data fi example calculations in Excel, by which the CPTAC-TCGA proteomic pro les are Commons (https://gdc.cancer.gov/) and the Broad Institute’s Firehose data portal fi fi classi ed according to proteome-based pan-cancer subtype. We re-classi ed the (https://gdac.broadinstitute.org). The TCGA RPPA dataset is available from the TCPA fi CPTAC-TCGA mRNA pro les according to TCGA mRNA-based molecular class, portal (http://tcpaportal.org/tcpa/). Cancer Cell Line Encyclopedia (CCLE) datasets are with the same set of 198 class-specific features shared between protein and mRNA fi fi ’ available from the CCLE website (http://www.broadinstitute.org/ccle). The source data datasets being used; we assigned pro les for which the best class t had a Pearson s underlying Figs. 1–7 are provided as a Source Data file. All the other data supporting the correlation of >0.05 to the c2 class (the class lacking strongly associated expression findings of this study are available within the article and its supplementary information patterns according to the original Chen et al. 2 study). files and from the corresponding author upon reasonable request. A reporting summary By the above approach, the CPTAC total protein datasets (both CPTAC-TCGA for this article is available as a Supplementary Information file. and CPTAC Confirmatory/Discovery) were separately classified according to TCGA

NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications 13 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13528-0

Code availability 30. Ellis, M. et al. Connecting genomic alterations to cancer biology with R source code written for this study is provided as part of Supplementary Data 6. proteomics: the NCI Clinical Proteomic Tumor Analysis Consortium. Cancer Example Excel calculations by which the CPTAC-TCGA proteomic profiles were Discov. 3, 1108–1112 (2013). classified according to proteome-based pan-cancer subtype (Fig. 3a) are provided in 31. Mertins, P. et al. Reproducible workflow for multiplexed deep-scale proteome Supplementary Data 3. and phosphoproteome analysis of tumor tissues by liquid chromatography–mass spectrometry. Nat. Protoc. 13, 1632–1661 (2018). 32. Wilkerson, M. & Hayes, D. ConsensusClusterPlus: a class discovery tool with Received: 19 July 2019; Accepted: 11 November 2019; confidence assessments and item tracking. Bioinformatics 26, 1572–1573 (2010). 33. Li, J. et al. Characterization of human cancer cell lines by reverse-phase protein arrays. Cancer Cell 31, 225–239 (2017). 34. Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA 100, 9440–9445 (2003). References 35. Creighton, C., Nagaraja, A., Hanash, S., Matzuk, M. & Gunaratne, P. A 1. Perou, C. et al. Molecular portraits of human breast tumours. Nature 406, bioinformatics tool for linking gene expression profiling results with public 747–752 (2000). databases of microRNA target predictions. RNA 14, 2290–2296 (2008). 2. Chen, F. et al. Pan-cancer molecular classes transcending tumor lineage across 36. Shannon, P. et al. Cytoscape: a software environment for integrated models of 32 cancer types, multiple data platforms, and over 10,000 cases. Clin. Cancer biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003). Res. 24, 2182–2193 (2018). 37. The_Cancer_Genome_Atlas_Research_Network. Comprehensive molecular 3. Chen, G. et al. Discordant protein and mRNA expression in lung characterization of clear cell renal cell carcinoma. Nature 499,43–49 (2013). adenocarcinomas. Mol. Cell. Proteom. 1, 304–313 (2002). 38. Abazeed, M. et al. Integrative radiogenomic profiling of squamous cell lung 4. Edwards, N. et al. The CPTAC data portal: a resource for cancer proteomics cancer. Cancer Res. 73, 6289–6298 (2013). research. J. Proteome Res. 14, 2707–2713 (2015). 39. Chen, F. et al. Multilevel genomics-based taxonomy of renal cell carcinoma. 5. Akbani, R. et al. A pan-cancer proteomic perspective on The Cancer Genome Cell Rep. 14, 2476–2489 (2016). Atlas. Nat. Commun. 5, 3887 (2014). 40. Harris, A. Hypoxia—a key regulatory factor in tumour growth. Nat. Rev. 6. Zhang, Y. et al. A pan-cancer proteogenomic atlas of PI3K/AKT/mTOR Cancer 2,38–47 (2002). pathway alterations. Cancer Cell 31, 820–832 (2017). 41. Coller, H. A. et al. Expression analysis with oligonucleotide microarrays 7. Zhang, H. et al. Integrated proteogenomic characterization of human high- reveals that MYC regulates genes involved in growth, cell cycle, signaling, and grade serous ovarian cancer. Cell 166, 755–765 (2016). adhesion. Proc. Natl Acad. Sci. USA 97, 3260–3265 (2000). 8. Mertins, P. et al. Proteogenomics connects somatic to signalling in 42. Creighton, C. Multiple oncogenic pathway signatures show coordinate breast cancer. Nature 534,55–62 (2016). expression patterns in human prostate tumors. PLoS One 3, e1816 (2008). 9. Zhang, B. et al. Proteogenomic characterization of human colon and rectal 43. Singh, A. et al. A gene expression signature associated with “K-Ras addiction” cancer. Nature 513, 382–387 (2014). reveals regulators of EMT and tumor cell survival. Cancer Cell 15, 489–500 10. Vasaikar, S. et al. Proteogenomic analysis of human colon cancer reveals new (2009). therapeutic opportunities. Cell 177, 1035–1049.e1019 (2019). 44. Gingras, M. et al. Ampullary cancers harbor ELF3 tumor suppressor gene 11. Clark, D. et al. Integrated proteogenomic characterization of clear cell renal mutations and exhibit frequent WNT dysregulation. Cell Rep. 14, 907–919 cell carcinoma. Cell 179, 964–983 (2019). (2016). 12. Creighton, C. & Huang, S. Reverse phase protein arrays in signaling pathways: 45. Kwon, O. et al. Notch promotes tumor metastasis in a prostate-specific Pten- a data integration perspective. Drug Des. Dev. Ther. 9, 3519–3527 (2015). null mouse model. J. Clin. Invest. 126, 2626–2641 (2016). 13. Chen, F. et al. Pan-urologic cancer genomic subtypes that transcend tissue of 46. Saldanha, A. J. Java Treeview—extensible visualization of microarray data. origin. Nat. Commun. 8, 199 (2017). Bioinformatics 20, 3246–3248 (2004). 14. Uhlén, M. et al. Proteomics. Tissue-based map of the human proteome. 47. Pavlidis, P. & Noble, W. Matrix2png: a utility for visualizing matrix data. Science 347, 1260419 (2015). Bioinformatics 19, 295–296 (2003). 15. Bindea, G. et al. Spatiotemporal dynamics of intratumoral immune cells reveal 48. Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: the immune landscape in human cancer. Immunity 39, 782–795 (2013). new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 16. Cescon, M., Gattazzo, F., Chen, P. & Bonaldo, P. Collagen VI at a glance. J. 45, D353–D361 (2017). Cell Sci. 128, 3525–3531 (2015). 17. Quon, E. et al. Endoplasmic reticulum–plasma membrane contact sites integrate sterol and phospholipid regulation. PLoS Biol. 16, e2003864 (2018). Acknowledgements 18. Chandrashekar, D. et al. UALCAN: a portal for facilitating tumor subgroup This work was supported by National Institutes of Health (NIH) grant P30CA125123 gene expression and survival analyses. Neoplasia 19, 649–658 (2017). (C.J.C.). 19. Franco, O., Shaw, A., Strand, D. & Hayward, S. Cancer associated fibroblasts – in cancer pathogenesis. Semin. Cell Dev. Biol. 21,33 39 (2010). Author contributions 20. Aran, D., Sirota, M. & Butte, A. Systematic pan-cancer analysis of tumour Conceptualization: C.J.C.; Methodology: C.J.C., F.C.; Investigation: C.J.C., F.C.; Formal purity. Nat. Commun. 6, 8971 (2015). Analysis: C.J.C., F.C., D.S.C.; Data Curation: C.J.C., S.V., D.S.C.; Visualization; C.J.C.; 21. Afshar-Kharghan, V. The role of the complement system in cancer. J. Clin. Writing: C.J.C., S.V.; Manuscript Review: F.C., D.S.C.; Supervision: C.J.C., S.V. Invest. 127, 780–789 (2017). 22. Olcina, M., Kim, R., Melemenidis, S., Graves, E. & Giaccia, A. The tumour microenvironment links complement system dysregulation and hypoxic Competing interests signalling. Br. J. Radiol. 15, 20180069 (2018). The authors declare no competing interests. 23. Thorsson, V. et al. The immune landscape of cancer. Immunity 48, 812–830 (2018). 24. Fang, M., Yuan, J., Peng, C. & Li, Y. Collagen as a double-edged sword in Additional information tumor progression. Tumour Biol. 35, 2871–2882 (2014). Supplementary information is available for this paper at https://doi.org/10.1038/s41467- 25. Migita, T. & Inoue, S. Implications of the Golgi apparatus in prostate cancer. 019-13528-0. Int. J. Biochem. Cell Biol. 44, 1872–1876 (2012). 26. Harshman, L., Kuo, C., Wong, B., Vogelzang, N. & Srinivas, S. Increased Correspondence and requests for materials should be addressed to C.J.C. hemoglobin associated with VEGF inhibitors in advanced renal cell carcinoma. Cancer Invest. 27, 851–856 (2009). Peer review information Nature Communications thanks John D. Minna and the other, 27. Tripathi, A., Jacobus, S., Feldman, H., Choueiri, T. & Harshman, L. Prognostic anonymous, reviewer(s) for their contribution to the peer review of this work. Peer significance of increases in hemoglobin in renal cell carcinoma patients during reviewer reports are available. treatment with VEGF-directed therapy. Clin. Genitourin. Cancer 15, 396–402 (2017). Reprints and permission information is available at http://www.nature.com/reprints 28. Urra, H., Dufey, E., Avril, T., Chevet, E. & Hetz, C. Endoplasmic reticulum ’ stress and the hallmarks of cancer. Trends Cancer 2, 252–262 (2016). Publisher s note Springer Nature remains neutral with regard to jurisdictional claims in fi 29. Yadav, R., Chae, S., Kim, H. & Chae, H. Endoplasmic reticulum stress and published maps and institutional af liations. cancer. J. Cancer Prev. 19,75–88 (2014).

14 NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13528-0 ARTICLE

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/ licenses/by/4.0/.

© The Author(s) 2019

NATURE COMMUNICATIONS | (2019) 10:5679 | https://doi.org/10.1038/s41467-019-13528-0 | www.nature.com/naturecommunications 15