bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

An unsupervised feature extraction and selection strategy for identifying epithelial-mesenchymal transition state metrics in breast cancer and melanoma

David J. Klinke II1-3 and Arezo Torang1,4,5

1Department of Chemical and Biomedical Engineering, West Virginia University, Morgantown, WV 2Department of Microbiology, Immunology and Cell Biology, West Virginia University, Morgantown, WV 3WVU Cancer Institute, West Virginia University, Morgantown, WV 4Amsterdam UMC, University of Amsterdam, Laboratory for Experimental Oncology and Radiobiology, Center for Experimental and Molecular Medicine, Cancer Center Amsterdam, Amsterdam, The Netherlands 5Oncode Institute, UMC, University of Amsterdam, Amsterdam, The Netherlands

Digital cytometry is opening up new avenues to better under- of malignant cells that originate within a particular anatomi- stand the heterogeneous cell types present within the tumor mi- cal organ rather than uncertainty in etiology, we will focus on croenvironment. While the focus is towards elucidating immune breast cancer and melanoma as Li et al. show that melanoma and stromal cells as clinical correlates, there is still a need to and breast cancer cell lines seem to cluster most uniformly better understand how a change in tumor cell phenotype, such while other cell lines defined by anatomical origin seem to as the epithelial-mesenchymal transition, influences the immune have a more heterogeneous composition (6). contexture. To complement existing digital cytometry methods, our objective was to develop an unsupervised signature While the tumor cells that arise in the skin and breast capturing a change in differentiation state that is tailored to the seem to be most similar, patient treatment strategies and specific cellular context of breast cancer and melanoma, as a il- outcomes can be diverse. Initial treatment strategies are lustrative example. Towards this aim, we used principal compo- guided by specific molecular alterations that can be targeted nent analysis coupled with resampling to develop unsupervised by drugs: aromatase inhibitors for ER-positive breast can- -based state metrics specific for the cellular con- cer, anti-HER2 antibodies for HER2-positive breast cancer, text that characterize the state of cellular differentiation within or small molecule inhibitors for BRAF V600E-positive or C- an epithelial to mesenchymal-like state space and independently KIT-positive melanoma (7, 8). However, dissemination of correlate with metastatic potential. First developed using cell line data, the orthogonal state metrics were refined to exclude primary tumors to vital organs like liver, brain, and lungs the contributions of normal fibroblasts and to provide tissue- is a key limiter for patient survival in breast cancer and level state estimates based on bulk tissue RNA-seq measures. melanoma. Specifically, the 5-year survival rate for patients The resulting gene expression-based metrics for differentiation with localized disease versus distant metastases drops from state aim to inform a more holistic view of how the malignant 98% to 23% and from 99% to 27% for melanoma and breast cell phenotype influences the immune contexture within the tu- cancer, respectively (9). In contrast, patient survival for tu- mor microenvironment. mors that originate in vital organs is limited by the degree to which malignant cells locally disrupt organ function. Thus Epithelial-mesenchymal transition | digital cytometry | deconvolution | transcriptomics the importance of distal dissemination in determining patient Correspondence: [email protected] outcomes can vary based on the tissue of origin. The distal dissemination and growth of malignant cells Introduction - metastasis - is a complex process thought to involve dy- namic re-engagement of biological processes used during Tissues are comprised of a diverse set of different cell types development that enable migrating cells to form tissues. that help maintain homeostasis. Oncogenesis is associated For carcinomas, initiating metastasis is thought to occur with a shift in the cellular composition of a tissue that can be through a process called the epithelial-mesenchymal transi- revealed with increasing confidence through direct measure- tion (EMT). EMT is the functional consequence of engaging ment, such as scRNA-seq, or using digital methods to decon- a genetic regulatory network that downregulates the expres- volute bulk tissue samples (1). Given the correlation with sion of associated with an epithelial phenotype and up- response to immunotherapies, the current focus has been on regulates genes associated with a mesenchymal phenotype. quantifying immune cell types present within the tumor mi- Breast carcinoma primarily originates from either luminal ep- croenvironment (2, 3). There is also an increasing appreci- ithelial cells or basal myoepithelial cells within the mammary ation for characterizing the heterogeneity among malignant gland (10). In contrast to breast cancer, melanoma arises cells that may arise in the same anatomical location (4, 5). from the oncogenic transformation of melanocytes, which Given our interest in understanding functional heterogeneity follow a different developmental trajectory along the neural

Klinke et al. | bioRχiv | December 13, 2019 | 1–8 bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

crest than epithelial cells but also involves a process similar that the information provided by these two platforms is not to EMT (11). While much of cell specification is imprinted entirely the same. The poor correspondence between Agi- epigenetically via DNA methylation and histone modifica- lent two-channel microarray and RNA-seq data has been at- tions, significant functional changes, such modifications in tributed to differences in ratio (Agilent two-channel microar- cell state due to EMT, may occur within these epigenetic ray) versus non-ratio (RNA-seq) representations of transcript constraints. To characterize cell state based on gene expres- abundance by the platforms (13). sion, supervised methods have been predominantly used for We next asked whether the assayed transcript abundance developing gene signatures that characterize the epithelial- corresponds to abundance. First, we compared RNA- mesenchymal transition. While effective, supervised meth- seq counts reported for cell lines associated with the Cancer ods can perform poorly if the strategy is based on misinfor- Cell Line Encyclopedia (CCLE) with protein abundance for mation, such as sample misclassification or prior biases as to the same cell lines measured using Reverse Phase Protein Ar- the number of cell states or defining genes. While used less ray (RPPA). We filtered the respective data sets to those cell frequently, unsupervised methods for feature extraction and lines that were reported in both data sets and for genes where selection are advantageous as they can be data-driven (12). there was a positive correlation coefficient greater than 0.36 Here, our objective was to develop an unsupervised gene sig- between read counts in RPKM and normalized RPPA val- nature capturing this change in phenotype that is tailored to ues. From the initial data sets, 283 cell lines and 146 genes the specific cellular context of breast cancer and melanoma, were retained for analysis after filtering. Next, we deter- as an illustrative example. mined whether the pairs of mRNA and protein measurements share a common value for steady-state transcript abundance Results that corresponds to steady-state protein abundance measured RNA-sequencing provides an estimate of protein above background. To do this, we applied a protein expres- abundance. We first asked whether assaying the same genes sion model to each gene measured across the cell lines where using different transcriptomics profiling platforms provides protein abundance was assumed to be a saturable function of the same information. To do this, we compared gene ex- transcript abundance (Fig. 2A). Using the fitted curve, the pression levels assayed by either Agilent microarray or by threshold of transcript abundance for detecting a change in Illumina RNA-sequencing for the same samples (Figure 1). protein abundance 2.5% above background was back calcu- Expression values obtained by RNA-seq are in units of tran- lated. Example data sets and the corresponding curve fits for scripts per million (TPM) while the Agilent Microarray re- the genes CLDN7, AXL, JAG1, and CDH1 are shown in Fig- sults are in terms of intensity units. Using samples obtained ure 2B. Interestingly, the peak value in the distribution of cal- as part of the breast cancer arm of the TCGA, we focused culated threshold values corresponded to 1 RPKM (Fig 2C). on genes that have been associated with host immunity, as We repeated this analysis for transcript abundance assayed by these genes are likely to span a broad dynamic range within Affymetrix U133+2 microarray, a single-channel approach, these samples. As the TCGA samples are obtained from ho- using RMA-normalized expression values, where 149 genes mogenized bulk samples of tumor and matched normal breast were retained for analysis after filtering. Qualitatively, data tissue, expression of these genes could be from the malignant obtained using the single-channel platform (Affymetrix) ex- cells, like GATA3 expression by breast cancer cells, or from hibited better correspondence with RPPA values than data ob- immune cell infiltrates, like the potential expression of IL4 tained using Agilent’s two-channel platform. However, the and IL5 by infiltrating T helper type 2 cells. distribution in calculated threshold values were more broadly Generally, comparing the same row across the two panels distributed compared to the RNA-seq results (see red line illustrates the poor correspondence between transcript abun- in Fig 2C, F-test p-value = 0.0045). These results imply dance assayed using Agilent microarrays and read counts that the transcript abundance assayed by RNA-seq provides (TPM) obtained by RNA-seq. A subset of genes, like HLA- a higher quality estimate of protein abundance, that is the DRA and HLA-DPA1, exhibit both high microarray inten- signal-to-noise is improved, compared with data obtained us- sity units and read counts while other genes, like TBX21 and ing a single-channel microarray platform. FASLG, exhibit high microarray intensity units but have low Collectively, the common threshold value observed using read counts. In addition, the dynamic range observed among RNA-seq data has two implications. First, there are some these samples is different depending on the platform used, as genes that have a high sensitivity of detection using microar- illustrated in the heatmap by TBX21 and IL17F. Using Illu- rays such that the observed changes may not be function- mina RNA-seq, TBX21 is constrained to the low end of the ally important. From Figure 1, it seems that IL17F, TBX21, color spectrum (dark to royal blue) while the dynamic range FASLG, KLRD1, IFNG, CCL17, and IL10 are but a few ex- spans the middle to upper end of the color spectrum (green amples (i.e., high Agilent microarray intensity but very low to red) when assayed using Agilent microarray. Similarly, read counts) in that dataset. Without knowing the detection IL17F transcripts were not detected by RNA-seq in 87% of sensitivity by microarray, traditional approaches using a z- the samples but the Agilent microarray shows a rather high score metric may give equal weight to changes in gene ex- average intensity with variation among the samples. The pression driven by a biological signal as to changes domi- difference in average intensities among genes and in vari- nated by random noise. Second, the threshold value provides ance among samples assayed by these two platforms suggest a rationale for filtering genes that are likely to have a low in-

2 | bioRχiv Klinke et al. | Future Home Journal bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

formation content when developing gene signatures for phe- line combinations. The resulting set of eigenvalues represent notypes that are not well defined. the values that could be obtained by random chance if the un- derlying dataset has no information, which are shown as the Gene expression patterns in breast cancer cells are dotted red line in Figure 4A. In comparison to the null distri- captured by a single component. Given the variety bution, only the first two PC were above the threshold. The of breast cancer subtypes reported in the literature, we variance captured by the remaining PCs were below the null next asked how many different genetic regulatory networks PCA distribution suggesting that any potential biological in- (GRN) are at work in breast cancer. GRNs associated with terpretations of these additional PCs could also be explained development commonly contain transcription factors that in- by random chance. Therefore we focused on the first two teract via positive feedback such that the target genes are ei- PCs. ther co-expressed or expressed in a mutually exclusive fash- As variance in read counts is proportional to abundance, ion (14). Given the interest in functional responses, we are gene projections along the PC1 axis were proportional to focusing on patterns of gene expression in response to sig- the average read counts of the corresponding gene among nal processing by the genetic regulatory network rather than the samples. Measured transcript abundance is proportional trying to identify the topology of the GRNs. In motivating to the basal gene expression associated with cell specifica- this study, we made four assumptions. First, we assumed tion and technical artifacts associated with RNA sequencing. that oncogenic mutations alter the peripheral control of GRN Genes were retained for further analysis that were expressed but do not alter the core network topology, where signals above the 1 RPKM threshold in more that 5% of the cell processed by a GRN change cell phenotype by engaging a lines. Next, we focused on the projection of retained genes unique gene expression pattern. Second, malignant cells de- along principal components 2 and 3. The projections were rived from a particular anatomically-defined cancer repre- annotated with horizontal and vertical dotted lines that en- sent the diverse ways that hijacking these GRNs can pro- close 95% of the projections from the null distribution. While vide a fitness advantage to malignant cells within the tumor the majority of the genes were distributed around the origin, microenvironment. Third, culturable tumor cell lines repre- a subset of genes were projected along the extreme of the sent a sampling of these ways in which GRNs are hijacked PC2 axis (outside of the dotted horizontal lines) and had no in a particular anatomical location. Fourth, the process of significant projection along the PC3 axis (inside of the dot- isolating these malignant cells from tumor tissue to gener- ted vertical lines). The list of genes associated with either ate culturable cell lines does not bias this view. It follows the high PC2/null PC3 or the low PC2/null PC3 groups are then that the number of different GRNs can be identified by listed in Supplemental Table S1 and contained 128 and 101 analyzing the transcriptional patterns of genes likely to par- genes, respectively. As the projection of Vimentin (VIM - ticipate in GRNs among an ensemble of tumor cells lines red dot in Figure 4C) and E-cadherin (CDH1 - blue dot in that share a common tissue of origin. We focused our at- Figure 4C) were prototypical for these two groups of genes, tention on 780 genes that have been previously associated the high PC2/null PC3 genes were annotated as a mesenchy- with the epithelial-mesenchymal transition and related gene mal signature (i.e., a de-differentiated state) genes and the sets in MSigDB v4.0. and analyzed the expression of these low PC2/null PC3 group were annotated as an epithelial sig- genes among 57 breast cancer cell lines included in the CCLE nature (i.e., a terminally differentiated state). In contrast to database as assayed by RNA-sequencing using a feature ex- supervised approaches that use Vimentin and E-cadherin as traction/feature selection workflow summarized in Figure 3. the basis to identify associated genes (e.g., (17, 18)), the ap- To identify coordinately expressed genes, we used Principal proach used here is unsupervised whereby the association of Component Analysis (PCA), a linear statistical approach for Vimentin and E-cadherin with these two opposite groups of unsupervised feature extraction and selection that enables the genes emerges naturally from the data. unbiased discovery of clusters of genes that exhibit coher- ent patterns of expression (i.e., features) that are independent The Epithelial and Mesenchymal state measures strat- of other gene clusters (15). The relative magnitude of the ify intrinsic subtypes of breast cancer and metastatic resulting gene expression patterns can be inferred from the potential. Using these two sets of genes, we developed a eigenvalues, which is shown in Figure 4. Specifically, PC1 state metric to quantify the extent of a gene expression signa- and PC2 captured 65% and 14% of the variance, respectively. ture associated with epithelial differentiation and mesenchy- Additional principal components each captured less than 4% mal de-differentiation using a normalized sum over all of the of the variance. genes associated with a signature. While the PCA results One of the challenges with PCA is that no clear rules ex- suggest that these two sets of genes are inversely related, the ist to determine how many principal components to consider, metrics were designed to represent each state independently such as a gap statistic in clustering (16). To select an appro- such that cells that exhibit a pure phenotype would have val- priate number of PCs (i.e., features), we established a thresh- ues of 1 and 0 associated with their respective state metrics old for determining significance relative to a null distribu- and cells with mixed phenotypes could potentially have val- tion. Specifically, we applied the same PCA to a synthetic ues of 1 for both state metrics. Next we calculated the state noise dataset generated from the original data by randomly metric values for all of the breast cancer cell lines, where their resampling with replacement the collection of gene expres- projections in state space are shown in Figure 5. Interestingly, sion values and assigning the values to particular gene-cell the breast cancer cell lines largely followed a linear recipro-

Klinke et al. | Future Home Journal bioRχiv | 3 bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

cal relationship between epithelial (E) and mesenchymal (M) scRNA-seq data obtained from a digested normal skin sample states (dotted line in Figure 5) and were segregated by intrin- obtained from a human female. This cluster contained about sic PAM50 subtype (19). While HER2, Luminal A, Lumi- 1/3 of the cell samples measured within the CD45-negative nal B, and Basal subtypes all have a high E signature, they population of the digested skin sample (see Supplemental progressively increased in their M signature. The Claudin Figure S1). Using this fibroblast gene list, overlapping genes Low subset spanned the greatest range with some express- were removed from the state metrics and highlighted in yel- ing a high E and moderate M signatures (e.g., HCC1569, low in Supplemental Table S1. All but one of the genes re- MDAMB361, HMEL) and others with a low E and high M moved were contained within the mesenchymal gene list. signatures (e.g., BT549 and HS578T). Of note, a subset of Gene expression assayed from a bulk tissue sample re- the Claudin Low cell lines (e.g., HS742T, HS343T, HS281T, flects the combined contributions of non-malignant cells plus HS606T, and HS274T) with high M and very low E sig- the changes induced by oncogenic transformation, and re- natures have been suggested by the CCLE to be fibroblast- ciprocal changes due to de-differentiation among malignant like (see Cell_lines_annotations_20181226.txt). Function- cells. Observable changes depend on the relative contribu- ally, cells with low E and high M signatures had a high tions of each cell source. As the unsupervised PCA analy- propensity for metastasis while the propensity for metasta- sis of the cell line data suggested that genes associated with sis was low in cell lines with high E and low M signatures EMT can be revealed by identifying a reciprocal pattern of (20). gene expression, we performed Ridge logistic regression us- We next assessed the epithelial and mesenchymal state ing the sample annotation to obtain regression coefficients metrics in breast cancer cells assayed using scRNA-seq for the list of EMT genes that passed the fibroblast filter (n (21) (see Figure 5B). Similar to the cell lines, the samples = 158). The regression coefficients were used to filter the were spread across the epithelial to mesenchymal spectrum list of EMT genes for consistency with the reciprocal gene roughly ordered by their corresponding intrinsic subtype, signature identified in the CCLE analysis. Genes that passed where HER2 subtype had a high E/low M signature and the the consistency filter were used to define the epithelial and basal subtype had the highest M signature without much of a mesenchymal state metrics for bulk tissue samples. Of note, reduction in their E signature. Overall the state values were E-cadherin (CDH1) and CEACAM1 were associated with farther below the reciprocal trendline than any of the cell lines the epithelial state metric while N-cadherin (CDH2), Wnt- sampled. As gene-level reads by scRNA-seq are frequently inducible signaling pathway protein 1 (WISP1/CCN4), and missing (i.e., a dropout read)(22), we imputed missing val- matrix metallopeptidase 3 (MMP3) were associated with the ues to assess whether the distribution in the E/M state values mesenchymal state metric. The list of genes associated with were a result of read dropouts (see Figure 5C). While read the corresponding state metrics are summarized in Supple- imputation shifted the cell state metrics toward the recipro- mental Table S2. cal trendline, the heterogeneity among the cell measurements Next, we projected the tissue samples obtained as part of was lost. Overall, it is unclear whether scRNA-seq measure- the breast cancer arm of the TCGA in EMT space using the ments can be used to identify biological heterogeneity sep- two tissue-based state metrics. Similar to the CCLE analysis, arately from heterogeneity introduced by technical limits of all samples clustered along the reciprocal SME versus SMM the assay. line but exhibited greater dispersion. Samples obtained from While single-cell methods are rapidly emerging as tool to normal breast tissue clustered separately from breast cancer assay human tissue samples, bulk transcriptomic assays of tu- samples (Figure 6), with normal breast tissue samples having mor tissue samples, like those acquired as part of the Cancer the highest values for the epithelial state and lower values, Genome Atlas (TCGA) are more abundant. More samples on average, for the mesenchymal state. Among the different increases the statistical power for identifying clinical, cel- clinical breast cancer subtypes, the median value for SME lular, and genetic correlates of the epithelial-mesenchymal progressively decreased from ER/PR+ (luminal), HER2+, transition. However, applying the epithelial/mesenchymal and triple negative (TN: ER-/PR-/HER2-) subtypes. The state metrics to interpret RNA-seq assays of bulk tumor tissue mesenchymal projections were about equal for both ER/PR+ samples requires some additional filtering steps as bulk RNA- and HER2+ subsets and higher than the TN samples. We seq measurements averages over the heterogeneous normal note that, while the HER2+ samples were roughly equal to and malignant cell types present within the tissue. In terms the ER/PR+ along the SMM axis, the HER2+ samples were of a gene signature for the epithelial-mesenchymal transition, lower than the ER/PR+ samples along the SME axis, which many of the genes commonly associated with acquiring mes- aligns with clinical observations. For instance, patients with enchymal function are also associated with fibroblasts, a rela- HER2+ subtype of breast cancer are at increased risk for de- tively common cell type in epithelial tissues. Thus, an enrich- veloping metastatic lesions compared to TN and luminal sub- ment of genes associated with the epithelial-mesenchymal types (23). The two different state metrics seem to capture transition may be explained solely by a shift in the preva- gene signatures that helps anchor a cell to its’ designated lo- lence of fibroblasts within the tissue sample. To deconvolute cation within the tissue and that promotes active migration, fibroblast genes from the state metrics, we obtained a list of respectively. In other words, reducing SME corresponds to 2500 genes that were uniquely associated (Area under ROC raising the anchor and increasing SMM corresponds to hoist- curve > 0.5) with a cluster annotated as fibroblasts using ing the sail. In summary, both cell-level and tissue-level EMT

4 | bioRχiv Klinke et al. | Future Home Journal bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

state metrics provide an estimate of metastatic potential and nign melanocytic nevi and untreated primary melanoma tis- a digital measure of malignant cell differentiation state in the sue were projected onto the state space. Of note, CEA- context of breast cancer. CAM1 and MITF were associated with the differentiated state and three genes - CEACAM1, CGN, and HPGD - were Gene expression patterns in melanoma cells are also shared with the breast cancer epithelial state metric. The de- captured by a single component. Using the same fea- differentiated state metric had five genes - WISP1/CCN4, ture extraction/feature selection workflow as the breast can- EDNRA, FOXC2, SERPINE1, and SPOCK1 - that were cer analysis (Figure 3), we applied principal component anal- shared with the breast cancer mesenchymal state metric. ysis to the expression of EMT-related genes assayed in an While samples were more narrowly distributed in state space ensemble of 56 melanoma cell lines associated with the Can- compared to the cell lines (Figure 9), all of the benign nevi cer Cell Line Encyclopedia (Figure 7). We focused on the exhibited higher terminally differentiated (SMT ) and tended first two principal components, PC1 and PC2, that captured to have lower de-differentiated values (SMD). The samples 80% and 6% of the variance, respectively. Additional prin- from primary melanoma were color-coded based on the an- cipal components each captured less than 4% of the vari- notated Breslow’s depth, where higher values were associ- ance. PC1 captured the variance associated with read abun- ated with lower terminal differentiation scores. Using Bres- dance, as gene projections along the PC1 axis were propor- low’s depth as a surrogate measure of metastatic potential tional to the average read counts among the samples. Vi- (24), tissue-level EMT state metrics provide an estimate of mentin (VIM) and fibronectin (FN1) were two of the most metastatic potential and a digital measure of malignant cell highly expressed genes while members of the Wnt family differentiation state in the context of melanoma. were some of the genes with low expression (e.g., WNT1, WNT6, WNT8B, WNT3A, WNT8A, WNT9B). Genes re- Terminal differentiation is associated with distinct tained for further analysis were expressed above the 1 RPKM gene signatures while de-differentiation seems to en- threshold in more than 5% of the cell lines. gage common gene regulatory networks. The sepa- Next, we focused on the projection of retained genes along rate gene signatures generated for breast cancer cells and PC2 and PC3 axes. The projections were annotated with hor- melanoma cells using an unsupervised approach provide an izontal and vertical dotted lines that enclose 95% of the pro- opportunity to identify unique and shared aspects of the ge- jections from the null distribution. While the majority of the netic regulatory mechanisms underpinning cell specification, genes were distributed around the origin, a subset of genes as summarized in Figure 9. Specifically, we used DAVID were projected along the extreme of the PC2 axis (outside of to identify genes with transcription factor activity using the dotted horizontal lines) and had no significant projection the GOTERM_MF_ALL: Sequence_specific_DNA_binding along the PC3 axis (inside of the dotted vertical lines). The + UP_Keywords:DNA_binding. In the breast cancer cell list of genes associated with either the high PC2/null PC3 or lines, nine transcription factors were upregulated in cells with the low PC2/null PC3 groups are listed in Supplemental Table a terminally differentiated phenotype, including GRHL2 and S1 and contained 26 and 90 genes, respectively. In contrast to OVOL2 that have been associated with enforcing epithelial the breast cancer results, the projection of Vimentin (VIM - differentiation (25). Correspondingly, five transcription fac- red dot in Figure 7C) and E-cadherin (CDH1 - blue dot in Fig- tors were upregulated in melanoma cells, including MITF ure 7C) were not associated with either of these two groups that is essential for melanocyte differentiation (26). Inter- of genes. As the high PC2/null PC3 group included MITF, estingly, there was no overlap in the genes with transcription a master regulator of melanocyte differentiation, and the low factor activity in the two differentiated cell signatures. In PC2/null PC3 group included a number of EMT-related genes contrast, melanoma and breast cancer cell lines that exhib- (e.g., FN1, TCF4, ZEB1, TWIST2, and WISP1), these two ited a de-differentiated phenotype shared five transcription gene sets were annotated as a terminally differentiated sig- factors, including TWIST2 and ZEB1. De-differentiation nature (i.e., an epithelial-like state) and a de-differentiated in breast cancer cell lines were also associated with an ad- signature (i.e., a mesenchymal-like state), respectively. ditional six transcription factors, including TWIST1 (27). Projections of the melanoma cell lines in differentiation Overall, the analysis of these transcription factors is consis- state space were calculated using the two state metrics (Fig- tent with specificity in phenotype as a consequence of engag- ure 8). Similar to the breast cancer cell lines, the melanoma ing gene regulatory networks unique to a specialized cell sub- cell lines largely followed a linear reciprocal relationship be- set while de-differentiation seemed to engage common gene tween terminally differentiated (SMT ) and de-differentiated regulatory networks that facilitate the loss of cell specificity. (SMD) states (dotted line in Figure 8). The majority of cell lines exhibited primarily a terminally differentiated signa- ture with some expression of de-differentiated genes while Discussion only a small subset of the cell lines exhibited primarily a Here we used an unsupervised feature extraction and se- de-differentiated signature. The gene signatures for sin- lection approach based on principal component analysis gle melanoma cells were also highly heterogeneous due to and resampling to identify state metrics for the epithelial- dropout of gene reads. mesenchymal transition in breast cancer and melanoma. Using state metrics refined for use with tissue samples Given the importance for identifying patients with tumors (see Supplemental Table S2), samples acquired from be- likely to metastasize, a number of gene signatures have been

Klinke et al. | Future Home Journal bioRχiv | 5 bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

developed to predict the prevalence of tumor cells with a treated primary melanoma tissue and associated annotation epithelial-mesenchymal transition signature (17, 18, 28–30). were downloaded from GEO entry GSE98394. Supervised approaches are most common (17, 18, 28, 29), where samples are classified a priori. For instance, Koplev et Non-linear regression of protein abundance to mRNA al. (29) develop gene signatures that average over all anatom- expression. All data was analyzed in R (V3.5.1) using the ical locations while Levine and coworkers (28, 31) classify ’stats’ package (V3.5.1). For each gene where complemen- training samples a priori into one of three cell states: ep- tary CCLE transcriptomic and RPPA data exist and for which ithelial, mesenchymal, or hybrid E/M. Rokavec et al. gen- their correlation coefficient was above 0.36, the non-linear erate features based on co-expression with E-cadherin and function, b · XmRNA Vimentin (18). While effective, supervised methods can per- Yprotein = a + , (1) form poorly if the strategy is based on misinformation, such XmRNA + c as sample misclassification or prior biases as to the number was regressed using the nls function to the resulting protein of cell states or defining genes. We also note that state met- (Yprotein) and transcript (XmRNA) abundance data. As the developed using microarray technology (e.g., (17, 29)) RPPA values are normalized, the parameters a and b repre- are not likely relevant for interpreting data based on RNA se- sent the background value and maximum detectable increase quencing, given the unclear relation between transcriptome above background, respectively, while the parameter c rep- and protein abundance as assayed using microarray technol- resents the midpoint in transcript abundance within the dy- ogy. While rarely used, the data-driven nature of unsuper- namic range of the assay. A minimum in the summed squared vised methods for feature extraction and selection are attrac- errors between model-predicted and observed RPPA values tive (12). For instance, Umeyama et al. used an unsuper- were used to determine the optimal values of the model pa- vised approach for feature extraction to identify genes as- rameters. Using the optimal values, a threshold was esti- sociated with metastasis (32). To illustrate this data-driven mated independently for each gene based on the transcript approach, we have focused on breast cancer and melanoma, abundance that yields a 2.5% increase in protein abundance where metastatic dissemination to vital organs are key lim- above background. The regression was repeated using both iters of patient survival. In summary, we hope that our devel- RNA-seq and Affymetrix transcriptomics data. oped state metrics find use alongside other digital cytometry tools to better understand how oncogenic transformation al- Statistical analysis for cell-level signatures. Principal ters the immune contexture within the tumor microenviron- component analysis (PCA) was performed on log base 2 ment. transformed RPKM values using the prcomp function on the CCLE RNA-seq data, which was filtered to 780 genes previ- ously associated with epithelial-mesenchymal transition. The Methods collective list of genes were assembled from prior studies ’Omics Data. Transcriptomics profiling of the same (17, 33–37) and additional gene sets from MSigDB V4.0 samples using both Agilent microarray and Illumina including: “EPITHELIAL TO MESENCHYMAL TRAN- RNA sequencing for the breast cancer arm (BRCA) of SITION" and “REACTOME TGF BETA RECEPTOR SIG- the Cancer Genome Atlas was downloaded from TCGA NALING IN EMT EPITHELIAL TO MESENCHYMAL data commons. Values for gene expression, expressed in TRANSITION". PCA was applied to the genes to extract RPKM for RNA-seq and gene-centric RMA-normalized the features, where the resulting eigenvectors capture the rel- data for Affymetrix U133+2 microarray, for the cell ative influence of a gene’s expression on a specific principal lines contained within the Cancer Cell Line Encyclo- component and the eigenvalues represent how much infor- pedia were downloaded from the Broad data commons mation contained within the dataset is captured by a specific (Website: https://portals.broadinstitute.org/ccle Files: principal component. Drawing upon conventional hypothesis CCLE_RNAseq_081117.rpkm.gct accessed 12/22/2017 testing where significance is established by rejecting the null and CCLE_Expression_Entrez_2012-10-18.res accessed hypothesis that experimental observations could be explained 6/15/2018). Reverse phase protein array (RPPA) results for by random chance, we used a resampling approach to estab- the cancer cell lines were obtained from the M.D. Anderson lish a null hypothesis related to the eigenvalues, that is to proteomics website (Website: https://tcpaportal.org/mclp/ determine the true rank of the noisy expression matrix. The File: MCLP-v1.1-Level4.txt accessed 6/15/2018) (6). resampling approach involved repetitively applying PCA (n = Single-cell gene expression (scRNA-seq) for breast cancer 1000) to a synthetic noise dataset with the same dimensions and melanoma cells expressed in TPM were down- that was generated from the original data by randomly resam- loaded from the Gene Expression Omnibus (GEO) entries pling with replacement from the collection of gene expres- GSE75688 and GSE72056, respectively. 10X Genomics sion values and assigning the values to particular gene-cell scRNA-seq data for CD45-negative cells digested from a line combinations. The resulting distribution of eigenvalues normal human female skin sample and expressed in counts and eigenvectors represent the values that could be obtained of gene-level features was downloaded from European by random chance if the underlying dataset has no informa- Bioinformatics Institute (EMBL-EBI) ArrayExpress entry tion (i.e., the null PCA distribution). Principal components E-MTAB-6831. RNA-seq data expressed in counts assayed with eigenvalues greater than the null PCA distribution were in samples acquired from benign melanocytic nevi and un- used to define the principal subspace for subsequent analysis,

6 | bioRχiv Klinke et al. | Future Home Journal bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

that is the selection of features. Similarly, the distribution in samples were randomly assigned in an 80:20 ratio between the projection of genes within the null PCA space were used training and testing samples. Regression coefficients were to determine whether the projection of a gene along a particu- captured for each iteration using a lambda value that min- lar PC axis was explained by random chance or not by setting imized the misclassification error of a binomial prediction thresholds along the PC2 and PC3 axes that enclosed 95% of model estimated by cross-validation. Accuracy was assessed the null PCA space. The PC projection of genes relative to using the testing samples. Genes were determined to have a the null PCA space was used to refine the extracted features. consistent expression pattern if greater than 95% of the dis- A metric was developed to estimate the extent that a tribution in regression coefficients had the correct sign. Sim- cell exhibits a gene signature corresponding to a “Epithe- ilarly to the cell-level analysis, state metrics were developed lial/Terminally Differentiated" versus “Mesenchymal/De- for bulk tissue-level RNA-seq measurements to estimate the differentiated" state. The state metrics (SM) quantify the extent that a tissue sample exhibits a gene signature corre- cellular state by averaging over a normalized expression level sponding to a “Epithelial/Terminally Differentiated" versus of each gene in the signature (readsi, expressed in TPM) ac- “Mesenchymal/De-differentiated" state. The genes included cording to the formula: in a signature and their corresponding Ki values are listed in Supplemental Table S2. ngs 1 X readsi SM = K . (2) ACKNOWLEDGEMENTS ngs readsi + 2 i i=1 This work was supported by National Science Foundation (NSF CBET-1644932 to DJK) and National Cancer Institute (NCI 1R01CA193473 to DJK). The content The genes included in a signature with their corresponding is solely the responsibility of the authors and does not necessarily represent the official views of the NSF or NCI. Ki values are listed in Supplemental Table S1 and ngs cor- AUTHOR CONTRIBUTIONS responds to the number of genes within a signature. The Ki These contributions follow the International Committee of Medical Journal Ed- values were estimated by clustering the log2 expression of itors guidelines: http://www.icmje.org/recommendations/. Conceptualization: DJK; Study Design: DJK; AT;; Data Analysis: DJK; AT; Data Interpretation: DJK; Fund- each gene into two groups using the k-means method and the ing acquisition: DJK; Methodology: DJK; AT; Project administration: DJK; Software: value was set as the mid-point in expression between the two DJK; AT; Supervision: DJK; Writing – original draft: DJK; Writing – review & editing: groups. all authors. COMPETING FINANCIAL INTERESTS Statistical analysis for tissue-level signatures. Genes The authors declare no competing financial interests. differentially expressed in normal epithelial fibroblasts were obtained by analyzing single-cell RNA-seq data of normal Bibliography skin obtained using a Genomics 10x platform and a bioin- formatics workflow based on the scater (V1.12.2) and SC3 1. A. M. Newman, C. B. Steen, C. L. Liu, A. J. Gentles, A. A. Chaudhuri, F. Scherer, M. S. Khodadoust, M. S. Esfahani, B. A. Luca, D. Steiner, M. Diehn, and A. A. Alizadeh. Deter- (V1.12.0) packages in R. Briefly, scRNA-seq data were fil- mining cell type abundance and expression from bulk tissues with digital cytometry. Nat. tered to retain samples that had less than 50% of the reads Biotechnol., May 2019. 2. V. Thorsson, D. L. Gibbs, S. D. Brown, D. Wolf, D. S. Bortone, T. H. Ou Yang, E. Porta- in the top 50 genes and to remove outlier samples based on Pardo, G. F. Gao, C. L. Plaisier, J. A. Eddy, E. Ziv, A. C. Culhane, E. O. Paull, I. K. A. PCA analysis. Gene-level features were limited to those that Sivakumar, A. J. Gentles, R. Malhotra, F. Farshidfar, A. Colaprico, J. S. Parker, L. E. Mose, N. S. Vo, J. Liu, Y. Liu, J. Rader, V. Dhankani, S. M. Reynolds, R. Bowlby, A. Califano, A. D. were expressed at greater than 1 count in more than 10 cell Cherniack, D. Anastassiou, D. Bedognetti, A. Rao, K. Chen, A. Krasnitz, H. Hu, T. M. Malta, samples. Read depth was normalized using a variant of CPM H. Noushmehr, C. S. Pedamallu, S. Bullman, A. I. Ojesina, A. Lamb, W. Zhou, H. Shen, T. K. Choueiri, J. N. Weinstein, J. Guinney, J. Saltz, and et al. The Immune Landscape of contained within the scran (V1.12.1) package, which devel- Cancer. Immunity, 48(4):812–830, 04 2018. ops a sample-specific normalization factor repetitive sample 3. I. Tirosh, B. Izar, S. M. Prakadan, M. H. Wadsworth, D. Treacy, J. J. Trombetta, A. Rotem, C. Rodman, C. Lian, G. Murphy, M. Fallahi-Sichani, K. Dutton-Regester, J. R. Lin, O. Co- pooling followed by deconvoluting a sample-specific factor hen, P. Shah, D. Lu, A. S. Genshaft, T. K. Hughes, C. G. Ziegler, S. W. Kazer, A. Gaillard, by linear algebra. Following from Davidson et al. (38), fi- K. E. Kolb, A. C. Villani, C. M. Johannessen, A. Y. Andreev, E. M. Van Allen, M. Bertagnolli, P. K. Sorger, R. J. Sullivan, K. T. Flaherty, D. T. Frederick, J. Jane-Valbuena, C. H. Yoon, broblasts were annotated based on co-expression of COL1A1 O. Rozenblatt-Rosen, A. K. Shalek, A. Regev, and L. A. Garraway. Dissecting the multi- and COL1A2. Samples were clustered and genes differen- cellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science, 352(6282): 189–196, Apr 2016. tially associated with each cluster were identified using the 4. B. Shannan, M. Perego, R. Somasundaram, and M. Herlyn. Heterogeneity in Melanoma. SC3 workflow (V1.14.0) using default parameters (see Fig- Cancer Treat. Res., 167:1–15, 2016. 5. S. Koren and M. Bentires-Alj. Breast Tumor Heterogeneity: Source of Fitness, Hurdle for ure S1). Therapy. Mol. Cell, 60(4):537–546, Nov 2015. Prior to logistic regression analysis, TCGA BRCA data 6. J. Li, W. Zhao, R. Akbani, W. Liu, Z. Ju, S. Ling, C. P. Vellano, P. Roebuck, Q. Yu, A. K. Eterovic, L. A. Byers, M. A. Davies, W. Deng, Y. N. Gopal, G. Chen, E. M. von Euw, D. Sla- and the benign nevi and melanoma data were filtered to re- mon, D. Conklin, J. V. Heymach, A. F. Gazdar, J. D. Minna, J. N. Myers, Y. Lu, G. B. Mills, and move sample outliers and normalized based on housekeeping H. Liang. Characterization of Human Cancer Cell Lines by Reverse-phase Protein Arrays. Cancer Cell, 31(2):225–239, 02 2017. gene expression (39). Using normal versus tumor annotation 7. A. Taghian, M. D. El-Ghamry, and S. D. Merajver. Overview of the treatment of newly diag- associated with the data, ridge logistic regression was per- nosed, non-metastatic breast cancer. UpToDate, 2019. 8. J.A. Sosman. Overview of the management of advanced cutaneous melanoma. UpToDate, formed on log base 2 transformed TPM and median-centered 2019. values using the glmnet package (V2.0-18), which was lim- 9. American Cancer Society. Cancer Facts & Figures 2019. American Cancer Society, Atlanta, Ga, 2019. ited to EMT-related genes identified in the CCLE analysis 10. M. Zhang, A. V. Lee, and J. M. Rosen. The Cellular Origin and Evolution of Breast Cancer. and not associated with normal fibroblasts. To minimize Cold Spring Harb Perspect Med, 7(3), Mar 2017. 11. T. Regad. Molecular and cellular pathogenesis of melanoma initiation and progression. Cell. overfitting, ridge logistic regression was repeated 500 times Mol. Life Sci., 70(21):4055–4065, Nov 2013. using a subsample of the original data set using the genes as- 12. Y. H. Taguchi. Principal Components Analysis Based Unsupervised Feature Extraction Ap- plied to Gene Expression Analysis of Blood from Dengue Haemorrhagic Fever Patients. Sci sociated with each signature separately. In each iteration, the Rep, 7:44016, 03 2017.

Klinke et al. | Future Home Journal bioRχiv | 7 bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

13. Y. Guo, Q. Sheng, J. Li, F. Ye, D. C. Samuels, and Y. Shyr. Large scale comparison of gene (10):569–574, Oct 2013. expression levels by microarrays and RNAseq using TCGA data. PLoS ONE, 8(8):e71462, 2013. 14. Uri Alon. An introduction to systems biology: design principles of biological circuits, vol- ume 10 of Chapman & Hall/CRC mathematical and computational biology series. Chapman & Hall/CRC, Boca Raton, FL, 2007. 15. I. T. Jolliffe and J. Cadima. Principal component analysis: a review and recent develop- ments. Philos Trans A Math Phys Eng Sci, 374(2065):20150202, Apr 2016. 16. R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic. J. R. Statist. Soc. B., 63(2):411–423, 2001. 17. T. Z. Tan, Q. H. Miow, Y. Miki, T. Noda, S. Mori, R. Y. Huang, and J. P. Thiery. Epithelial- mesenchymal transition spectrum quantification and its efficacy in deciphering survival and drug responses of cancer patients. EMBO Mol Med, 6(10):1279–1293, Oct 2014. 18. M. Rokavec, M. Kaller, D. Horst, and H. Hermeking. Pan-cancer EMT-signature identifies RBM47 down-regulation during colorectal cancer progression. Sci Rep, 7(1):4687, 07 2017. 19. J. S. Parker, M. Mullins, M. C. Cheang, S. Leung, D. Voduc, T. Vickery, S. Davies, C. Fauron, X. He, Z. Hu, J. F. Quackenbush, I. J. Stijleman, J. Palazzo, J. S. Marron, A. B. Nobel, E. Mardis, T. O. Nielsen, M. J. Ellis, C. M. Perou, and P.S. Bernard. Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol., 27(8):1160–1167, Mar 2009. 20. C. L. Yankaskas, K. N. Thompson, C. D. Paul, M. I. Vitolo, P. Mistriotis, A. Mahendra, V. K. Bajpai, D. J. Shea, K. M. Manto, A. C. Chai, N. Varadarajan, A. Kontrogianni- Konstantopoulos, S. S. Martin, and K. Konstantopoulos. A microfluidic assay for the quan- tification of the metastatic propensity of breast cancer specimens. Nat Biomed Eng, May 2019. 21. W. Chung, H. H. Eum, H. O. Lee, K. M. Lee, H. B. Lee, K. T. Kim, H. S. Ryu, S. Kim, J. E. Lee, Y. H. Park, Z. Kan, W. Han, and W. Y. Park. Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer. Nat Commun, 8:15081, 05 2017. 22. T. S. Andrews and M. Hemberg. False signals induced by single-cell imputation [version 2; peer review: 4 approved]. F1000Res, 7:1740, 2019. 23. H. Kennecke, R. Yerushalmi, R. Woods, M. C. Cheang, D. Voduc, C. H. Speers, T. O. Nielsen, and K. Gelmon. Metastatic behavior of breast cancer subtypes. J. Clin. Oncol., 28 (20):3271–3277, Jul 2010. 24. C. M. Balch, J. E. Gershenwald, S. J. Soong, J. F. Thompson, M. B. Atkins, D. R. Byrd, A. C. Buzaid, A. J. Cochran, D. G. Coit, S. Ding, A. M. Eggermont, K. T. Flaherty, P. A. Gimotty, J. M. Kirkwood, K. M. McMasters, M. C. Mihm, D. L. Morton, M. I. Ross, A. J. Sober, and V. K. Sondak. Final version of 2009 AJCC melanoma staging and classification. J. Clin. Oncol., 27(36):6199–6206, Dec 2009. 25. B. Cieply, P. Riley, P. M. Pifer, J. Widmeyer, J. B. Addison, A. V. Ivanov, J. Denvir, and S. M. Frisch. Suppression of the epithelial-mesenchymal transition by Grainyhead-like-2. Cancer Res., 72(9):2440–2453, May 2012. 26. C. R. Goding and H. Arnheiter. MITF-the first 25 years. Genes Dev., 33(15-16):983–1007, 08 2019. 27. J. Yang, S. A. Mani, J. L. Donaher, S. Ramaswamy, R. A. Itzykson, C. Come, P. Savagner, I. Gitelman, A. Richardson, and R. A. Weinberg. Twist, a master regulator of morphogene- sis, plays an essential role in tumor metastasis. Cell, 117(7):927–939, Jun 2004. 28. J. T. George, M. K. Jolly, S. Xu, J. A. Somarelli, and H. Levine. Survival Outcomes in Cancer Patients Predicted by a Partial EMT Gene Expression Scoring Metric. Cancer Res., 77(22): 6415–6428, 11 2017. 29. S. Koplev, K. Lin, A. B. Dohlman, and A. Ma’ayan. Integration of pan-cancer transcriptomics with RPPA proteomics reveals mechanisms of epithelial-mesenchymal transition. PLoS Comput. Biol., 14(1):e1005911, 01 2018. 30. T. M. Malta, A. Sokolov, A. J. Gentles, T. Burzykowski, L. Poisson, J. N. Weinstein, and et al. Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentia- tion. Cell, 173(2):338–354, 04 2018. 31. D. Jia, J. T. George, S. C. Tripathi, D. L. Kundnani, M. Lu, S. M. Hanash, J. N. Onuchic, M. K. Jolly, and H. Levine. Testing the gene expression classification of the EMT spectrum. Phys Biol, 16(2):025002, 01 2019. 32. H. Umeyama, M. Iwadate, and Y. H. Taguchi. TINAGL1 and B3GALNT1 are potential ther- apy target genes to suppress metastasis in non-small cell lung cancer. BMC Genomics, 15 Suppl 9:S2, 2014. 33. D. Sarrio, S. M. Rodriguez-Pinilla, D. Hardisson, A. Cano, G. Moreno-Bueno, and J. Pala- cios. Epithelial-mesenchymal transition in breast cancer relates to the basal-like phenotype. Cancer Res., 68(4):989–997, Feb 2008. 34. J. Carretero, T. Shimamura, K. Rikova, A. L. Jackson, M. D. Wilkerson, C. L. Borgman, M. S. Buttarazzi, B. A. Sanofsky, K. L. McNamara, K. A. Brandstetter, Z. E. Walton, T. L. Gu, J. C. Silva, K. Crosby, G. I. Shapiro, S. M. Maira, H. Ji, D. H. Castrillon, C. F. Kim, C. Garcia- Echeverria, N. Bardeesy, N. E. Sharpless, N. D. Hayes, W. Y. Kim, J. A. Engelman, and K. K. Wong. Integrative genomic and proteomic analyses identify targets for Lkb1-deficient metastatic lung tumors. Cancer Cell, 17(6):547–559, Jun 2010. 35. S. R. Alonso, L. Tracey, P. Ortiz, B. Perez-Gomez, J. Palacios, M. Pollan, J. Linares, S. Ser- rano, A. I. Saez-Castillo, L. Sanchez, R. Pajares, A. Sanchez-Aguilera, M. J. Artiga, M. A. Piris, and J. L. Rodriguez-Peralto. A high-throughput study in melanoma identifies epithelial- mesenchymal transition as a major determinant of metastasis. Cancer Res., 67(7):3450– 3460, Apr 2007. 36. W.-Y. Cheng, J. J. Kandel, D. J. Yamashiro, P. Canoll, and D. Anastassiou. A multi-cancer mesenchymal transition gene expression signature is associated with prolonged time to recurrence in glioblastoma. PLoS ONE, 7(4):e34705, April 2012. doi: 10.1371/journal.pone. 0034705. 37. J. L. Kaiser, C. L. Bland, and D. J. Klinke. Identifying causal networks linking cancer processes and anti-tumor immunity using Bayesian network inference and metagene con- structs. Biotechnol. Prog., 32(2):470–479, 03 2016. 38. S. Davidson, M. Efremova, A. Riedel, B. Mahata, J. Pramanik, J. Huuhtanen, G. Kar, R. Vento-Tormo, T. Hagai, X. Chen, M. A. Haniffa, J. D. Shields, and S. A. Teichmann. Single-cell RNA sequencing reveals a dynamic stromal niche within the evolving tumour microenvironment. bioRxiv, 2018. doi: 10.1101/467225. 39. E. Eisenberg and E. Y. Levanon. Human housekeeping genes, revisited. Trends Genet., 29

8 | bioRχiv Klinke et al. | Future Home Journal bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure Legends Figure 1 - Heatmaps for the expression of a subset of genes in the breast cancer arm of the TCGA study assayed using Illumina RNA-seq (A) and using Agilent microarray (B). Color bar shown at the bottom of the heatmaps indicates samples obtained from tumor tissue (black) versus matched normal tissue (yellow). The genes and samples are similarly ordered in both panels. Values were log2 normalized.

Figure 2 - RPPA measurements were used to determine a threshold for biologically significant changes in gene expression. The model for protein dependence on gene expression (A) where representative data (black circles) and model fits (dotted black line) are shown for CLDN7, AXL, JAG1, and CDH1 (B). (C) The distribution in threshold values calculated for all genes assayed by RNA-seq (black curve, n = 146) and by Affymetrix microarray (red curve, n = 149). The reported for transcript abundance units for RNA-seq corresponds to RPKM and intensity units (I.U.) for Affymetrix microarray. In (B), the vertical red dotted line indicates the threshold value and the melanoma and breast cancer cell lines are highlighted by red and blue circles.

Figure 3 - Data workflow for identifying epithelial/differentiated versus mesenchymal/de-differentiated state met- rics. Workflow contains three decision points: unsupervised feature extraction (FE)/feature selection (FS) based on PCA, a binary fibroblast filter, and a consistency filter based on Ridge logistic regression of annotated samples.

Figure 4 - Two opposing gene signatures were identified among the cohort of breast cancer cell lines. (A ) Scree plot of the percentage of variance explained by each principal component, where the dotted line corresponds to variance ex- plained by the null principal components. (B) Projection of the genes along PC1 and PC2 axes, where the font color corresponds to the mean read counts among cell lines (blue-yellow-red corresponds to high-medium-low read counts). (C) Projection of the genes along PC2 and PC3 axes, where the dotted lines enclose 95% of the null PCA distribution along the corresponding axis.

Figure 5 - The different subsets of breast cancer were clustered along a reciprocal epithelial to mesenchymal state axes. Log2 projections along the epithelial (SME) and mesenchymal (SMM ) state axes for each breast cancer cell line included in the CCLE (A) and primary breast cancer cells (B and C). Values for SME and SMM were estimated by bulk RNA-seq data for cell lines associated with the CCLE and by scRNA-seq data for primary tumor cells (21). (C) Log2 state projections are compared for primary breast cancer cells as originally reported and with dropout values imputed using the values averaged over the rest of the sample population, where grey lines connect the original state values to state values determine after imputation. Symbols were colored based on previously annotated breast cancer PAM50 subtypes: basal - red, claudin low - yellow, HER2 - pink, luminal A - blue, luminal B - black. In panel A, the metastatic potential of a subset of cell lines were annotated based on a recent study (20): low metastatic potential - grey circle, high metastatic potential - red circle. The dotted line corresponds to a reciprocal relationship between the SME and SMM state metrics (i.e., SME = 1 - SMM ).

Figure 6 - The samples from normal breast tissue and breast cancer were clustered separately along a reciprocal epithelial to mesenchymal state axes. Using EMT genes that passed the gene filter workflow, each sample contained within the breast cancer (BrCa) arm of the TCGA was projected along the epithelial (SME) versus mesenchymal (SMM ) state axes using the corresponding bulk RNA-seq data. Symbols were colored based on normal breast tissue (green) or clinical breast cancer subtype: ER/PR+ - blue, HER2 - pink, triple negative (TN) - red. The dotted line corresponds to a reciprocal relationship between the SME and SMM state metrics (i.e., SME = 1 - SMM ).

Figure 7 - Two opposing gene signatures were identified among the cohort of melanoma cell lines. (A) Scree plot of the percentage of variance explained by each principal component, where the dotted line corresponds to variance explained by the null principal components. (B) Projection of the genes along PC1 and PC2 axes, where the font color corresponds to the mean read counts among cell lines (blue-yellow-red corresponds to high-medium-low read counts). (C) Projection of the genes along PC2 and PC3 axes, where the dotted lines enclose 95% of the null PCA distribution along the corresponding axis.

Figure 8 - Melanoma cell lines and primary single melanoma cells are distributed along path between extremes in differentiation states. Projections along the terminally differentiated (SMT ) versus de-differentiated (SMD) state axes for each melanoma cell line included in the CCLE (A) and primary melanoma cells (B). Values for the terminally differentiated and de-differentiated state metrics were estimated by RNA-seq data for cell lines associated with the CCLE and by scRNA-seq data for primary melanoma cells. Symbols for primary melanoma cells were colored differently for each patient sample. The dotted line corresponds to a reciprocal relationship between the SMT and SMD state metrics (i.e., SMT = 1 - SMD).

Figure 9 - Gene expression patterns associated with benign melanocytic nevi and primary melanoma tissue samples are distributed along path between extremes in differentiation states. Projections along the terminally differ- entiated (SMT ) versus de-differentiated (SMD) state axes for 78 tissue samples obtained from common acquired melanocytic

Klinke et al. | Future Home Journal bioRχiv | 9 bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

nevi (n = 27, green circles) and primary melanoma (n = 51). The primary melanoma samples are colored based on the Breslow’s depth (blue: 0.1 mm to red: 10+ mm). The dotted line corresponds to a reciprocal relationship between the SMT and SMD state metrics (i.e., SMT = 1 - SMD).

Figure 10 - Venn diagram illustrating overlap in genes contained in the opposing state metrics for terminally differentiated/epithelial versus de-differentiated/mesenchymal extracted from breast cancer (blue circle) and melanoma (red circle) cell lines. The subset of the genes listed below the Venn diagram were annotated with transcrip- tion factor GO terms.

10 | bioRχiv Klinke et al. | Future Home Journal Klinke units and abundance melanoma transcript the for and value reported threshold The the 149). indicates line = dotted n threshold red curve, in vertical circles. (red distribution the blue microarray The (B), and Affymetrix (C) In red by microarray. (B). by CDH1 Affymetrix highlighted and for and are 146) (I.U.) JAG1, lines = units AXL, cell n intensity CLDN7, cancer curve, for and breast shown (black RPKM are to RNA-seq line) corresponds by black RNA-seq assayed (dotted expression. for genes gene fits all in model for changes and significant calculated circles) biologically (black values for data threshold representative a where determine (A) to expression used were measurements RPPA 2. Fig. Agilent using and (A) RNA-seq Illumina using assayed study TCGA the of arm normalized. log2 cancer were breast Values the panels. both in in genes ordered of similarly subset are samples a of expression the (B). microarray for Heatmaps 1. Fig. bioRxiv preprint tal. et

Low Genes A: Illumina uueHm Journal Home Future | oo a hw ttebto ftehamp niae ape bandfo uo ise(lc)vru ace omltsu ylo) h ee and genes The (yellow). tissue normal matched versus (black) tissue tumor from obtained samples indicates heatmaps the of bottom the at shown bar Color doi: High not certifiedbypeerreview)istheauthor/funder.Allrightsreserved.Noreuseallowedwithoutpermission. https://doi.org/10.1101/865139 Color Key (log2 TPM) RNA C A - Density seq Using fitted curve, threshold 0.0 0.2 0.4 0.6 corresponds to X to corresponds Protein Expression Model � Tissue Samples 10 Threshold Distribution � Transcript abundance Transcript - 2 � = 1 10 1 + RNA � � + Microarray (I.U.) = - seq seq (RPKM) RNA � 2 � 0 . 025 when: � 10 + 4 ; this versionpostedDecember13,2019. � 10 6 B RPPA RPPA 10 10 - - 2 2 0.1 1 10 10 10 1 0.1 0.1 1 10 10 10 1 0.1 RNA RNA - - B: Agilent Microarray CLDN7 seq JAG1 seq (RPKM) (RPKM) 2 2 10 10 3 3 10 10 4 4 RPPA RPPA 10 10 The copyrightholderforthispreprint(whichwas - Tissue Samples - 2 2 0.1 1 10 10 10 1 0.1 0.1 1 10 10 10 1 0.1 RNA RNA - - seq seq CDH1 AXL (RPKM) (RPKM) h oe o rti eedneo gene on dependence protein for model The 2 2 10 10 3 3 10 10 Low 4 4 High Color Key (Log2 (Log2 A.U.) v|11 | bioRχiv bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Synthetic null In vitro CCLE EMT Genes Data set Gene Expression (n = 780)

PCA-based Rejected Unsupervised Genes FE/FS Normal skin scRNAseq

Differentiated De-differentiated Gene Set Gene Set scran/SC3 nBrCa = 101 nBrCa = 128 workflow nMela = 26 nMela = 90

Normal Binary Rejected Fibroblast Fibroblast Genes Gene Set Filter (n = 2500)

Differentiated De-differentiated Gene Set Gene Set nBrCa = 100 nBrCa = 59 nMela = 25 nMela = 38

TCGA BRCA and Normal melanocytic nevi Ridge Rejected Primary melanoma Logistic Regression Genes RNAseq data set with Filter normal/tumor sample annotation

Differentiated De-differentiated Gene Set Gene Set nBrCa = 20 nBrCa = 19 nMela = 9 nMela = 18

Fig. 3. Data workflow for identifying epithelial/differentiated versus mesenchymal/de-differentiated state metrics. Workflow contains three decision points: unsuper- vised feature extraction (FE)/feature selection (FS) based on PCA, a binary fibroblast filter, and a consistency filter based on Ridge logistic regression of annotated samples.

12 | bioRχiv Klinke et al. | Future Home Journal bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A B C Epithelial CDH1 70 2 60 50 40 30 Variance (%) Variance (%) Variance 20 Principal Component 2 Component Principal Principal Component Component Principal 10

0 VIM Mesenchymal 1 2 3 4 5 6 7 8 9 10 Principal Component Principal Component 1 Principal Component 3 Principal Component

Fig. 4. Two opposing gene signatures were identified among the cohort of breast cancer cell lines. (A) Scree plot of the percentage of variance explained by each principal component, where the dotted line corresponds to variance explained by the null principal components. (B) Projection of the genes along PC1 and PC2 axes, where the font color corresponds to the mean read counts among cell lines (blue-yellow-red corresponds to high-medium-low read counts). (C) Projection of the genes along PC2 and PC3 axes, where the dotted lines enclose 95% of the null PCA distribution along the corresponding axis. .

Klinke et al. | Future Home Journal bioRχiv | 13 bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

EFM192A MDAMD175VII A UACC893BT483 ZR751 SKBR3 HCC1954 MDAMB361 AU565 HCC1428 MDAMD415 HCC2218 BT474 MDAMD468 ZR7530 UACC812 HCC1937 HCC38 HCC1419 T47D BT20 HCC70 HCC202 KPL1 HCC1500 CAL851HCC1143 CAL148 CAMA1 HCC1806 MCF7 EFM19 MDAMB453 HCC1599 MDAMB134VI HDQP1 HMEL HCC2157 JIMT1 HCC1187

DU4475 HCC1569

CAL51 0 )) E SM ( − 1 log2(BrCa_EMT$Epithelial + 1e − 04)

2 log2(BrCa_EMT$MesenchymalMDAMB231 + 1e−04) CAL120 MDAMB436 HCC1395

MDAMB157

− 2 Metastatic Potential BT549 Low HMC18 HS274T High HS578T HS739T − 3 Epithelial Signal HS606T Basal HS281T Claudin Low HS343T HER2 HS742T

− 4 Luminal A

Epithelial Signal (log Signal Epithelial Luminal B

−5 −4 −3 −2 −1 0 Mesenchymal Signal (log (SM )) B Mesenchymal Signal 2 M 0 )) E ● 1 ●●● ● ● ● ● ●●●●●●●●●●●●●● − ●● ●●●●●●●●●●●●● ● SM ● ● ●● ●● ● ●● ●● ●●●● ● ( ● ● ● ● ●●● ● ●● ●●●●●● ●● 2 ● ● ● ● ● ●● ● ● ●● ● ● 2 ● ● ● − ● ●●

● BC01 3

− BC02 ● ● BC03 BC03LN

4 ● BC04 − Epithelial Signal BC05 BC06 ● BC07 5

− BC07LN BC08 BC10 Epithelial Signal (log Signal Epithelial 6 BC11 −

−8 −6 −4 −2 0 Mesenchymal Signal (log (SM )) C Mesenchymal Signal 2 M BC01BC01 BC05BC05 BC11BC11 0 0 0

●●●●●● ●●●

● ● ● − 1 − 1

● − 1 ●● ●● ● ● ● ●●● ● ● ● ● ● ● − 2 − 2 − 2 − 3 − 3 − 3 − 4 − 4 − 4 Epithelial Signal Epithelial Signal Epithelial Signal − 5 − 5 − 5 Epithelial Signal Epithelial Signal Epithelial Signal Epithelial − 6 − 6 − 6

−8 −6 −4 −2 0 −8 −6 −4 −2 0 −8 −6 −4 −2 0 MesenchymalMesenchymal Signal Signal MesenchymalMesenchymal Signal Signal MesenchymalMesenchymal Signal Signal

Fig. 5. The different subsets of breast cancer were clustered along a reciprocal epithelial to mesenchymal axes. Log2 projections along the epithelial (SME ) and mesenchymal (SMM ) state axes for each breast cancer cell line included in the CCLE (A) and primary breast cancer cells (B and C). Values for SME and SMM were estimated by bulk RNA-seq data for cell lines associated with the CCLE and by scRNA-seq data for primary tumor cells (21). (C) Log2 state projections are compared for primary breast cancer cells as originally reported and with dropout values imputed using the values averaged over the rest of the sample population, where grey lines connect the original state values to state values determine after imputation. Symbols were colored based on previously annotated breast cancer PAM50 subtypes: basal - red, claudin low - yellow, HER2 - pink, luminal A - blue, luminal B - black. In panel A, the metastatic potential of a subset of cell lines were annotated based on a recent study (20): low metastatic potential - grey circle, high metastatic potential - red circle. The dotted line corresponds to a reciprocal relationship between the SME and SMM state metrics (i.e., SME = 1 - SMM ).

14 | bioRχiv Klinke et al. | Future Home Journal bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1.0 0.8 ) E ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●

(SM ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● 0.6 ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ●● ●●●●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ●●● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ●● ● ●● ●● ● ● ●● ● ● ●●●●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● 0.4 ● ● ●● ● ● ● ● ● ●● ● ● ● ● Epithelial Signal ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Epithelial Signal Signal Epithelial

0.2 Normal ER/PR HER2

0.0 TN

0.0 0.2 0.4 0.6 0.8 1.0

MesenchymalMesenchymal Signal Signal (SMM)

Fig. 6. The samples from normal breast tissue and breast cancer were clustered separately along a reciprocal epithelial to mesenchymal state axes. Using EMT genes that passed the gene filter workflow, each sample contained within the breast cancer (BrCa) arm of the TCGA was projected along the epithelial (SME ) versus mesenchymal (SMM ) state axes using the corresponding bulk RNA-seq data. Symbols were colored based on normal breast tissue (green) or clinical breast cancer subtype: ER/PR+ - blue, HER2 - pink, triple negative (TN) - red. The dotted line corresponds to a reciprocal relationship between the SME and SMM state metrics (i.e., SME = 1 - SMM ).

A B C Differentiated EN2 EN2

80 CEACAM1 EDNRBTMC6FOXD3 CEACAM1 20 ESRP1 FOXD3 EDNRB TMC6 FXYD3 CKMT1AALDH3B2 CDH1 20 ESRP1 LEF1 ARAP2TUBBP5 FXYD3 CKMT1AALDH3B2 ERBB3 MTUS1MYH14 CCL3 ARHGAP8 LEF1ARAP2 TUBBP5 MITF DLL3 HPGDB3GAT1SP5OVOL2 ERBB3CCL3MYH14MTUS1 ARHGAP8 EOMES SP5 DLL3MITFHPGDB3GAT1 LLGL2CGNCTLA4 C4orf19 SPRR3 GPR56CTSH GALNT3PLCG2VAV3BIKBMP7DHH LLGL2GPR56CGNBIK CTLA4 HOXC13EFNA1 TMPRSS2PLA2G7 VAV3PLCG2BMP7CTSH GALNT3 10 CDCA7NRCAMKYNUPLS1SORL1ICA1TTC39A WNT6KLRC3FA2HSPRR1BCYP4B1 EFNA1PLA2G7HOXC13 CDK2 FBXO41 CDH3CA2 10 KYNU CDCA7NRCAMPLS1 SORL1 TTC39A 60 UPP1 PRKCZPIK3C2BCDKN1CKLK6LAD1 TUBA8 F11R KLRC3 FBXO41ICA1 CA2CDH3CDK2 HES1ZNF608BCL2L11OAS1SPINT1MAP7CDH1HIST1H4C● DSC2SAA2S100A8SHH SOX1 LAD1BCL2L11UPP1PRKCZCDKN1CKLK6PIK3C2B SCDSAT1AP1S2TNFRSF19MXI1FBXO32RAB11FIP1BLMTSPAN15DSG2TGFAEPS8L1MAOAMYBPTK6CXCR4EVPLSCELBSPRYDSC3CYP4F3CBLCHNF1APAK6 SAA2 TSPAN15DSG2RAB11FIP1TNFRSF19ZNF608DSC2OAS1HES1EPS8L1PTK6SPINT1HIST1H4CMAP7 CDH1● CDC45SLC9A3R1SYNGR2SERINC5CXCL1FABP5EXO1 DAPK1CLDN1SLC37A1JAG2LCN2CDS1MYO5CFOXA1HIST1H4B WNT1 TGFA MAOACXCR4FBXO32SCDAP1S2BLMMXI1SAT1MYB SDCBPMCM7GGCTCDC6POLR3KRFC2RFC3ACSL1RFC4MYCPKMYT1RECQL4RFC5PRIM1DSCC1CITESPL1BRCA1DTLMAP3K1ETV1RAB20RAC3STAP2LSRWNT10BDTX4EGFRBM47ZNF165S100PEHFC1orf116NOX4WNT4MPZL2WNT10ATMPRSS4TOX3BLNK FMO1 LCN2 CXCL1CLDN1MYO5CSYNGR2MAP3K1CDC45LSRCDS1GGCTPRIM1JAG2EXO1RFC3WNT10BSLC9A3R1SERINC5RBM47ACSL1FABP5MYCSLC37A1DAPK1 CTSDCTNNB1HSPE1SLC3A2TKTSDF2L1UBE2TRRM2IFI16MCM3MCM4BRIX1MLF1IPMCM5CDCA5GMNNYRDCASF1BMCM2E2F1ORC6TSPAN13POPDC3CENPANUP155KNTC1ERMP1CDC7RBL1RELTHELLSHIST1H4JALG6VAMP8LPAR6ELMO3PARD6APDGFDITGB4ARHGEF18SLPITNK1GPX2AGR2DENND2DSCNN1ARAB25B3GNT3DLL4GRHL2MMP13OR7E14PCCR2UGT1A1MYCBPFGF4 EHF ITGB4STAP2HIST1H4JS100PAGR2RECQL4NUP155ETV1HSPE1CDC7RFC5MCM7POLR3KRAC3ALG6E2F1RFC4EGFDSCC1PKMYT1RFC2ESPL1TOX3BRCA1MCM3CDC6GMNNRELTERMP1DTLCITCTNNB1DTX4LPAR6RAB20SDCBPPDGFDARHGEF18NOX4DENND2D PSAPTYMSPCNAARPC1BUBE2SPHLDA1SRPXCCNB2ZWINTENOPH1CDK1SH2B3CKS1BMCM6CKS1BTMPOBUB1MAD2L1DHFRRBM14YEATS4UCK2LETM1MSH2POLR3ESKA3DDR1RAD51ZNF367POLD3STILMFHAS1VLDLRAGAP1PCPPFIBP2PRR4RAD51BANXA9MAFBTSPAN1S100A14EPCAMVDAC1P2ATP1A3C1orf106CNKSR1ST14MMP7ESRP2FGFBP1PRSS8ACTA1ELF3EXPH5ITGB6ALOX5 F11RCYP2A13 C1orf106B3GNT3TSPAN13TNK1PHLDA1POPDC3VAMP8SLPIELMO3EPCAMBRIX1CKS1BGPX2MAD2L1CENPAIFI16CKS1BCDCA5MCM2MLF1IPPARD6ASDF2L1SLC3A2YRDCUBE2TSRPXRBL1ASF1BSKA3ANXA9KNTC1BUB1ZWINTRRM2TKTMCM4HELLSMCM6PCNADHFRRBM14PCMSH2TYMSMCM5RAD51BORC6CTSD ENO1 HNRNPABPGK1RPS19BP1TRIM28CCNB1CCT4EIF4BTUBG1SND1TIMM10H2AFYRNF145PRC1SCPEP1STOMPLK1TACC3NEAT1CENPWKIF23MKI67PLEKHO2ETS1MAPK1CREG1ECT2HMMRKIF11SMURF1SOAT1UHRF1PHF19NDC80NFKB1SEMA3BEIF1ADMETGALK1SRXN1FBXO5STEAP1NOTCH1PLK4CRY1EIF5A2CRLF3MAP2K5C1orf74ADAPYCARDBARD1PTENLRRC8CMMEADSSL1JUPRTEL1MSX2SFNSH2D3ACORTTUBA3DEPN3NOTCH4IHHPOF1BWNT7A ZBTB16 PRR4MMP7DDR1SFNRNF145RPS19BP1TSPAN1SH2B3PHF19TRIM28HMMRPOLR3EUBE2SAGAP1HNRNPABSRXN1TIMM10EIF1ADCCNB2MAP2K5ENOPH1YEATS4C1orf74CDK1FBXO5SND1PLK4CCT4ARPC1BH2AFYTACC3PLK1LETM1PGK1CRLF3POLD3TMPOSTILADARTEL1ZNF367CREG1MFHAS1UCK2ADSSL1RAD51PSAPVLDLRPPFIBP2MET MSX2ATP1A3MAFBPYCARD 0 CD63FKBP1AUBA52RANSLC20A1TOP2ASTMN1SRMCCNG1SH3KBP1NFE2L2INSIG1ARPC5LIDH2BTG1ASPMTCF3MYO1DSMAD2PARD3MGST2RGS2TUBB3RNASET2KLF5 MMP25 PDZK1IP1 CCL21 SEMA3B SH2D3ANFKB1MKI67NDC80CCNB1KIF11JUPCENPWENO1UHRF1SMURF1ECT2EIF4BTUBG1BARD1MAPK1GALK1SCPEP1PRC1KIF23ETS1EIF5A2PLEKHO2CRY1SOAT1LRRC8CSTEAP1PTEN MME RPLP1RPS27ATUBB PMP22RPL37MTHFD2XBP1TFDP1AKT1CCNA2H2AFXUBE2HMFAP1GLYR1CEP55CD46MKNK2TGFBR1RELAMAPK14AKT2TOB1DCKUSP15SMAD4JAG1SOD2NCK1ORAOV1SSH3MANSC1TGFBR3GLI3JAG1KRT8IRF7SH3YL1TACSTD2ANK3OCLNAP1M2IRF6ATP2C2EPHA1LY75VGLL1WNT8B WNT9B 0 SLC20A1GLI3AP1M2SH3KBP1NOTCH1PARD3TACSTD2RGS2MMP25KLF5MGST2RPL37BTG1ARPC5LNFE2L2ASPMFKBP1ACEP55TOP2AMKNK2RPLP1STMN1CCNG1INSIG1UBA52MFAP1SMAD2TCF3NCK1TUBB3TFDP1NEAT1AKT2SRMRNASET2RANDCKTOB1IDH2STOMCD63 MYO1D EEF2TUBB4BRAB1AMMP14CDK2AP1NFE2L1CD164PSME1ANLNODC1GLUD1CD68SNAI2PERPCOMTCEP170STAT3IFNGR1AKT3SDC1TNFAIP3CENPESACSPIK3CAPNRC1MAPK7TJP2MAPK8IFI44RABGAP1LSMAD1MSX1TNFSF13TGFB3PCDH9AXIN2SOBPCLDN4CXCL3AQP3PITX2PDGFBGLI1CEACAM6SNAI3MST1RKRT13GRB7TP63TRIM29GDAMMP9FGG WNT3A TP63ANK3GDAIFI44KRT8CENPEIRF7PSME1RPS27AUBE2HOCLNANLNPCDH9H2AFXPMP22RABGAP1LSMAD1MAPK8STAT3MTHFD2CDK2AP1CCNA2SOD2SDC1SACSIRF6SSH3ORAOV1MAPK14RELATUBB4BUSP15SMAD4CD46GLYR1MANSC1COMTGLUD1ODC1EEF2TUBBTGFB3XBP1SH3YL1AKT1CD164TGFBR1SOBPJAG1TNFAIP3JAG1SNAI2TGFBR3AXIN2 PFN1CTSBCTSZHLAMSNYWHAHCAPZBTMEM59TNCPLXNB2−CAPN1CD9EAKAP12BNIP3LASNSNOTCH2DENND5ASTK17AMBOAT7KLF6ANXA4NFKBIAENC1ATP1B1PPARDTMEM158CPT1ADIAPH3TWIST1C1orf63UGGT2MYBL1HIST2H2BEZCCHC6STAMBPL1PLXNB1BCL6HMGA2GAS1MYO6TP53INP1ATF3ANK2WNT3AREGPTPN22C1QTNF3DLL1TMEFF1FGFR2FBP1TRIM31TUBA3CIL1RN PDGFB MMP9TNCAREGAQP3TUBA3CCLDN4ATP1B1HMGA2MST1RCXCL3STK17AGLI1BCL6PLXNB1MBOAT7HLAZCCHC6PNRC1ASNSTNFSF13AKAP12ATF3C1QTNF3TMEM59PERPDENND5ACEP170TJP2PIK3CACD9DLL1UGGT2NOTCH2NFE2L1DIAPH3MAPK7CAPN1MMP14IFNGR1CAPZBAKT3−ANXA4MYO6RAB1APPARDCTSZEYWHAHCD68TWIST1MSX1ANK2PITX2GAS1 SERPINE2RHOACALM3FOSL1RRAGALEPRE1FOSILKKRT18DAB2SMAD3HIST1H2ACERBB2FZD1CYB561DLG5CAMK2N1IFIT3NFIL3SPA17BAG2FJX1SNAI1CHN1MAP3K5C7orf10ARHGEF5BIRC3RHOUPKP3KRT15PPLANXA8L2ADAP1 EPYC KRT15KRT18SERPINE2TMEM158PTPN22PKP3CAMK2N1FOSL1ENC1SNAI1ARHGEF5HIST2H2BESTAMBPL1NFKBIARHOULEPRE1MAP3K5CPT1AERBB2FZD1DLG5C1orf63PFN1SPA17RRAGAPLXNB2MYBL1CALM3BNIP3LMSNWNT3RHOAFGFR2C7orf10TP53INP1NFIL3KLF6CTSBBAG2FOS B2MTUBA1AEMP3HIF1ARHOBITGAVVEGFASMARCA1ERRFI1SMTNMXRA7FOSL2FZD7DSTRUNX1TSC22D3TNFRSF10APLEK2AIM1SPINT2TUBA4ASTRA6WNT2BCLDN7WNT16FGF18CXCL5 WNT8A STRA6 TNFRSF10AERRFI1BIRC3ANXA8L2RUNX1PLEK2CHN1SMAD3CYB561ADAP1SMTNHIST1H2ACPPLIFIT3CLDN7SMARCA1FJX1B2MHIF1AITGAVTUBA1ASPINT2CXCL5RHOBFZD7EMP3FOSL2ILKMXRA7DAB2 40 ● MT2ALAMB2PPIC ARHGDIBGEMBCL3GFPT2GBP2HNMTRARRES1AKAP2 EDN2 AIM1VEGFAGBP2TUBA4AWNT16WNT2BDSTPPICTSC22D3HNMT TUBB6STAT1TGFBR2EGR1PROCRCTSKKDELC1THBS3TRPC1BMP4 WNT7B PC2 ARHGDIBBCL3MT2ARARRES1GFPT2LAMB2● AKAP2 LGALS1VIMTIMP3TNFRSF12AAPLP2EMP1MVPCDH2FHL2LHFPFERMT2ID2SERPINF1TNFAIP6SOX9 WNT11 WNT7B TNFRSF12ACDH2KDELC1PROCRTGFBR2LGALS1GEMTRPC1MVPTIMP3BMP4EMP1SOX9APLP2THBS3FHL2TUBB6LHFPVIMEGR1STAT1CTSKID2SERPINF1 SPARC CLIC4HTRA1PLAUREPAS1FOSBRAB31 BMP2SEPP1COL10A1MMP12IL20RACX3CR1 BMP2PLAURTNFAIP6SEPP1RAB31SPARCFERMT2FOSBCOL10A1HTRA1WNT11 ITGB1IFITM3CALD1JUNTGFB1I1SDC2DDR2LUMPRKCAPDGFANUAK1DSPPTGS2MMP11 VNN1 PDGFAVNN1 IFITM3PRKCAPTGS2MMP11TGFB1I1JUNITGB1CLIC4SDC2MMP12DDR2CALD1EPAS1 LUM ENGSEMA3CCYBRD1TGFB1AEBP1FGFR1RECKF3 COL11A1TGFB2MAPK13WNT9A CDH10 SEMA3CTGFB2TGFB1NUAK1MAPK13F3 COL11A1CYBRD1ENGAEBP1RECK DSP PCA2.data[, 2] RRASCOPZ2KRT19 WNT9ACOPZ2 FGFR1KRT19 Variance (%) Variance GLI2IL11 KCNK1SHANK2 RRASKCNK1 SHANK2 − 10 MAP1B IL11 GLI2 CITED2 OLFML2B − 10 MAP1B CITED2 FN1 FSTL1ITGA5THBS2 EPS8L2CRISPLD2ABCC3MMP3KRT16 HGFSNCAIP ABCC3MMP3KRT16CRISPLD2ITGA5THBS2EPS8L2FN1 OLFML2BFSTL1HGF PTRFPCOLCEFAPC1SCLUKRT7CFH GADD45GROS1 PCOLCEPTRFCLUKRT7C1SFAPGADD45GCFH IGFBP3NT5EIFITM2RCN3FHL1WNT5AFGF2ZEB1PDGFCNOTCH3TFPIEGFRFGF1GLT8D2RHOD DESADH1C NT5EWNT5APDGFCTFPIIGFBP3FGF2EGFRZEB1IFITM2FGF1RCN3GLT8D2RHODNOTCH3FHL1 DES Variance (%) COL6A1COL5A2FSTADAM12PLAUKRT14IL1R1 LGR5 FSTKRT14COL5A2IL1R1ADAM12COL6A1PLAU LGR5 TPM2SPOCK1ACTA2PRRX1 TCF4MALLEDNRACFB GSC TCF4PRRX1TPM2SPOCK1ACTA2CFB MALLEDNRA GSC LOXL2MMP2 INHBALRRC15PTGS1 LOXL2LRRC15INHBAMMP2PTGS1 20 S100A4 VCAN S100A4 VCANVEGFCWNT5B VIM TGFBI AXL SERPINB2VEGFCANKRD1WNT5BCYP1B1NID2PAPPA AXLPAPPATGFBISERPINB2ANKRD1CYP1B1NID2 COL6A2 PDGFRAWISP1 CCL2 COL6A2WISP1 PDGFRASFRP4 CCL2 SFRP4 − 20 − 20 FBN1 GJA1 FBN1GJA1 LOX TNXB TNXBLOX Principal Component 2 Component Principal MYL9 MXRA5 PDGFRBMXRA5MYL9 ASPN ITGBL1 PDGFRBITGBL1FOXC2 ASPN 2 Component Principal FOXC2 COL6A3 De-differentiated COL5A1COL6A3SULF1DCN SERPINE1COL5A1SULF1 DCN SERPINE1 BGN TWIST2 BGNTHY1WNT2 TWIST2 THY1 CDH11 WNT2 CDH11 COL3A1POSTN MFAP5 POSTNCXCL12COL3A1MFAP5 CXCL12 COL1A1 − 30 0 COL1A1 − 30 COMP COMPGREM1 1 2 3 4 5 6 7 8 9 GREM1 −50 0 50 100 −20 −10 0 10 20 Principal Component Principal Component Principal−1 * PCA2.data[,Component 1] 1 PrincipalPC3 Component 3

Fig. 7. Two opposing gene signatures were identified among the cohort of melanoma cell lines. (A) Scree plot of the percentage of variance explained by each principal component, where the dotted line corresponds to variance explained by the null principal components. (B) Projection of the genes along PC1 and PC2 axes, where the font color corresponds to the mean read counts among cell lines (blue-yellow-red corresponds to high-medium-low read counts). (C) Projection of the genes along PC2 and PC3 axes, where the dotted lines enclose 95% of the null PCA distribution along the corresponding axis.

Klinke et al. | Future Home Journal bioRχiv | 15 bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

UACC257 SKMEL5 K029AX COLO829 WM2664 COLO783 MALME3M A COLO792 COLO679 SKMEL3 HS936T IGR1 UACC62 A101D SH4 SKMEL28 HS944T HS939T 0 MELHO WM983B WM1799 IGR37 A375 HT144 G361 SKMEL31 MEWO RVH421 C32

)) COLO741 SKMEL1 A2058 HS294T T IPC298 MELJUSO COLO800 HS852T SKMEL24 SKMEL30 WM88 HS695T WM115 HMCB WM793 MDAMB435S

SM CJM IGR39

( -1 2

LOXIMVI

-2 RPMI7951

HS940T -3 HS600T

HS834T HS895T

HS839T -4 HS934T

HS688AT

Differentiated Signal (log Signal Differentiated -5 -4 -3 -2 -1 0 De-differentiated Signal (log2(SMD)) B 00

● ● ● ● ● ● ●● ●●● ● ● ●● ● )) ●●●● ●● ● ● ● ● ● ●●●●●●●●●●● ●●●●●●● ● T ● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ●●●●●●● ●●●●●●●●●●●●● ● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●● ●●●●●●●●●●● ● ●●●● − 2 ● ●●●●●● ●●● -2 ● ● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● SM ●●● ● ● ●● ●●●● ● ● ● ● ● ● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●● ●● ● ( ● ● ●● ●●● ● ●●●●●●● ●●● ●●● ● ●●●●●● ●●●●●● ●● 2 ● ●● ● ●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●● ● ● ●● ● ●●● ●●● ●●● ● ● ● ● ● ●●●●● ● ● ●●●●●● ● ● ●●●●● ● ● ● ●●●●●●●●●● ●● ●● ●●●●●● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ●● ●● ● ● ● ● Mel102 ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ●●● ● ● ● ● ● ●● ● ● ● ● ● Mel103 ● ● ● ● ●

− 4 ● -4 Mel105 ●● ●● ● ● ● ● ● Mel106 ● ●● ●●● ●●● ● Mel110 ● ● ● ● ● Mel112 ● ● ● ● Mel121.1 ● Mel128 Mel129pa ● ● -6− 6 ● Mel194 ● Differentiated Signal Differentiated Mel53 ● Mel60 Mel71 ● Mel78 ● Mel79 Mel80 -8− 8 Mel81 Mel82 ● Mel84 Mel88 Differentiated Signal (log Signal Differentiated ● Mel89 ● Mel94 ● ● ●● ●●● ●● ●● ●●● ● Mel98

-10− 10

-10−10 -8−8 -−66 -−44 -−22 00 De-differentiatedDe−differentiated Signal Signal (log2(SMD))

Fig. 8. Melanoma cell lines and primary single melanoma cells are distributed along path between extremes in differentiation states. Projections along the terminally differentiated (SMT ) versus de-differentiated (SMD ) state axes for each melanoma cell line included in the CCLE (A) and primary melanoma cells (B). Values for the terminally differentiated and de-differentiated state metrics were estimated by RNA-seq data for cell lines associated with the CCLE and by scRNA-seq data for primary melanoma cells. Symbols for primary melanoma cells were colored differently for each patient sample. The dotted line corresponds to a reciprocal relationship between the SMT and SMD state metrics (i.e., SMT = 1 - SMD ).

16 | bioRχiv Klinke et al. | Future Home Journal bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Common acquired nevi:

) Breslow’s Depth (mm) T 0.1 1.0 10+ Signal(SM Differentiated Differentiated

De-differentiated Signal (SMD)

Fig. 9. Gene expression patterns associated with benign melanocytic nevi and primary melanoma tissue samples are distributed along path between extremes in differentiation states. Projections along the terminally differentiated (SMT ) versus de-differentiated (SMD ) state axes for 78 tissue samples obtained from common acquired melanocytic nevi (n = 27, green circles) and primary melanoma (n = 51). The primary melanoma samples are colored based on the Breslow’s depth (blue: 0.1 mm to red: 10+ mm). The dotted line corresponds to a reciprocal relationship between the SMT and SMD state metrics (i.e., SMT = 1 - SMD ).

De-differentiated/Mesenchymal Terminally Differentiated/Epithelial

11 27 15 63 87 65

Shared Breast Cancer Melanoma Breast Cancer FOXC2 AEBP1 FOSL1 EN2 ELF3 EHF PRRX1 GLI2 HMGA2 FOXD3 FOXA1 GRHL2 TCF4 PITX2 TWIST1 LEF1 HOXC13 IRF6 TWIST2 MITF MSX2 MYB ZEB1 SP5 OVOL2

Fig. 10. Venn diagram illustrating overlap in genes contained in the opposing state metrics for terminally differentiated/epithelial versus de- differentiated/mesenchymal extracted from breast cancer (blue circle) and melanoma (red circle) cell lines. The subset of the genes listed below the Venn diagram were annotated with transcription factor GO terms.

Klinke et al. | Future Home Journal bioRχiv | 17 bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplemental Table S1: List of genes and associated Ki values for state metrics developed separately for breast cancer and melanoma cell lines based on CCLE gene expression. Genes that overlap with the fibroblast gene list are highlighted in yellow.

Breast Cancer Cell Lines Melanoma Cell Lines Epithelial Signature Mesenchymal Signature Differentiated Signature Dedifferentiated Signature Ki (log2 Ki (log2 Ki (log2 Ki (log2 Ki (log2 Ki (log2 Ki (log2 GENE_SYMBOL GENE_SYMBOL GENE_SYMBOL GENE_SYMBOL GENE_SYMBOL GENE_SYMBOL GENE_SYMBOL RPKM) RPKM) RPKM) RPKM) RPKM) RPKM) RPKM) AGR2 4.303 MSX2 0.179 ACTA2 4.645 LRRC15 -1.540 ALDH3B2 -5.189 ACTA2 3.745 PTGS1 -0.242 ALDH3B2 0.520 MYB 1.005 ADAM12 1.213 LUM 2.397 ARAP2 -4.050 ADAM12 2.139 PTRF 6.629 ANK3 1.913 MYH14 0.824 AEBP1 1.713 MAP1B 2.978 B3GAT1 -4.604 ANKRD1 1.628 RCN3 4.772 ANXA9 0.578 MYO5C 0.331 AKAP12 2.053 MFAP5 2.074 BIK -2.759 ASPN -3.222 RHOD -0.186 AP1M2 2.493 OCLN 1.859 AKAP2 3.118 MME 1.766 CCL3 -5.000 BGN 2.914 S100A4 5.840 ARAP2 -0.552 OR7E14P -2.318 AKT3 0.416 MMP14 3.304 CEACAM1 -1.634 C1S 3.186 SERPINB2 2.400 ARHGAP8 1.490 OVOL2 -1.155 ANK2 0.601 MMP2 3.829 CGN -2.178 CDH11 0.196 SERPINE1 5.626 ATP2C2 0.658 PDGFB -0.006 ANKRD1 1.447 MMP3 -0.991 CKMT1A -3.799 CFB -2.397 SFRP4 -2.295 B3GAT1 -4.481 PKP3 3.550 ASPN -2.623 MXRA7 5.608 CTLA4 -3.172 CFH 0.952 SPOCK1 3.602 BIK -0.863 POF1B -1.516 AXL 2.334 MYL9 5.899 DLL3 -0.614 CLU 2.266 SULF1 1.976 BLNK -1.428 PPL 1.218 BAG2 3.569 NID2 2.901 EDNRB 0.359 COL1A1 5.727 TCF4 0.102 BMP7 0.620 PRSS8 1.555 BGN 3.131 OLFML2B 1.328 EN2 -2.698 COL3A1 3.561 TFPI 1.789 BSPRY -0.611 PTK6 0.001 C1S 3.335 PAPPA -0.115 ERBB3 1.506 COL5A1 3.086 TGFBI 7.149 C1orf106 0.560 RAB25 2.438 C7orf10 1.725 PCOLCE 5.904 ESRP1 -1.928 COL5A2 4.597 THBS2 4.502 C4orf19 -0.366 RBM47 2.246 CALD1 6.281 PDGFC 3.280 FOXD3 -2.372 COL6A1 6.709 THY1 2.529 CDH1 2.795 S100A14 4.046 CCL2 1.510 PDGFRA -0.202 FXYD3 -0.741 COL6A2 5.868 TNXB -0.760 CDS1 1.006 SCNN1A 1.824 CD68 3.038 PDGFRB 0.881 GPR56 2.428 COL6A3 2.436 TPM2 5.480 CEACAM1 -0.966 SEPP1 0.991 CDH11 1.460 PHLDA1 4.056 HPGD -3.570 COMP -1.689 TWIST2 -0.122 CEACAM6 1.792 SH2D3A 1.585 CDH2 2.593 PITX2 -0.019 LEF1 0.291 CXCL12 -0.942 VCAN 3.364 CGN 1.288 SHANK2 2.091 CFH 0.413 PLAUR 4.775 LLGL2 -2.142 CYP1B1 0.958 VEGFC 2.386 CKMT1A 0.521 SLC37A1 1.335 CLIC4 6.784 PMP22 4.088 MITF 2.097 DCN 1.695 WISP1 -1.139 CLDN4 3.878 SORL1 0.433 COL1A1 7.342 POSTN 2.347 MTUS1 -1.067 DES -3.426 WNT2 -4.060 CLDN7 3.118 SPINT1 2.939 COL3A1 4.258 PROCR 2.922 MYH14 -1.413 EDNRA -2.070 WNT5A 2.285 CNKSR1 1.476 SPINT2 5.621 COL5A1 4.259 PRRX1 1.135 SP5 -6.085 EGFR 1.402 WNT5B 1.010 CX3CR1 -4.771 ST14 1.517 COL5A2 3.450 RCN3 4.002 TMC6 -0.954 EPS8L2 0.890 ZEB1 1.984 CXCR4 0.197 TMC6 2.345 COL6A1 4.820 RECK 1.117 TUBBP5 -4.963 FAP 3.289 CYP4B1 -2.086 TMPRSS2 -0.925 COL6A2 4.411 S100A4 6.135 FBN1 3.869 DENND2D 1.285 TSPAN1 2.975 COL6A3 3.274 SACS 2.466 FGF1 0.675 DSC2 -0.117 TSPAN15 2.498 COMP -2.248 SDC2 5.076 FGF2 2.188 EDN2 -0.828 TTC39A 1.812 COPZ2 3.117 SERPINB2 1.275 FHL1 3.489 EFNA1 2.368 TUBBP5 -2.023 CTSB 8.252 SERPINE1 5.902 FN1 9.974 EHF 1.234 VAMP8 3.649 CXCL3 -0.315 SERPINE2 5.411 FOXC2 -0.834 ELF3 3.414 VAV3 0.658 CYBRD1 3.701 SFRP4 -1.523 FST 3.630 EPCAM 3.550 WNT3A -4.542 DAB2 3.127 SH3KBP1 3.938 GADD45G -3.399 EPHA1 1.955 WNT4 -1.102 DCN 2.442 SMARCA1 3.121 GJA1 1.946 EPN3 1.083 WNT7B 1.009 DDR2 2.484 SPARC 6.576 GLT8D2 0.526 EPS8L1 3.622 EDNRA -1.472 SPOCK1 3.964 GREM1 3.220 ERBB3 2.885 EMP3 5.080 SRPX 2.604 IFITM2 4.612 ESRP1 1.288 FAP 1.059 SULF1 2.944 IGFBP3 6.190 ESRP2 1.782 FBN1 4.146 TCF4 0.705 IL1R1 1.252 EVPL 0.234 FGF1 -0.793 TFPI 2.062 INHBA 2.436 EXPH5 -0.989 FHL1 2.511 TGFB1 4.092 ITGBL1 2.227 F11R 3.294 FN1 8.443 TGFB1I1 3.009 KRT14 1.237 FA2H -1.889 FOSL1 3.395 THBS2 1.126 KRT16 -1.959 FBP1 1.225 FOXC2 -2.174 THY1 3.625 KRT7 1.856 FOXA1 1.528 FST 2.555 TIMP3 3.708 LGR5 -3.329 FXYD3 4.244 FSTL1 6.206 TMEFF1 0.365 LOX 3.878 GADD45G 0.490 GAS1 -0.239 TMEM158 1.520 LOXL2 5.806 GALNT3 1.907 GEM 2.394 TNC 3.234 LRRC15 -0.003 GPX2 -0.492 GFPT2 1.976 TNFAIP6 -0.824 MALL 0.019 GRB7 1.676 GJA1 2.267 TPM2 6.350 MFAP5 0.136 GRHL2 0.121 GLI2 -0.830 TRPC1 1.437 MMP2 5.236 HOXC13 0.223 GLT8D2 1.557 TUBB6 7.243 MXRA5 -1.889 HPGD -2.173 GREM1 2.165 TWIST1 -0.148 MYL9 4.994 ICA1 1.509 HMGA2 1.081 TWIST2 -2.302 NID2 1.253 IL1RN -1.966 HTRA1 4.465 VCAN 2.709 NOTCH3 1.644 IL20RA -1.730 IFITM3 6.742 VEGFC 3.036 NT5E 5.491 IRF6 0.688 IGFBP3 6.102 VIM 7.107 PAPPA -0.187 JUP 4.203 ITGA5 5.433 WISP1 -1.528 PCOLCE 4.551 KRT8 7.266 ITGB1 9.185 WNT2 -3.475 PDGFC 1.425 LAD1 1.200 LEPRE1 5.248 WNT5A 2.457 PDGFRA 0.445 LLGL2 2.833 LGALS1 10.947 WNT5B 1.454 PDGFRB 1.315 LSR 3.936 LHFP 2.623 ZEB1 1.196 PLAU 2.161 MAP7 1.737 LOX 3.766 POSTN 2.399 MST1R 1.087 LOXL2 5.430 PRRX1 2.462

Table S1. List of genes and corresponding Ki values for state metrics developed separately for breast cancer and melanoma cell lines based on CCLE gene expression. Genes that overlap with the fibroblast gene list are highlighted in yellow.

Klinke et al. | Future Home Journal Supplementary Information | 1 bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplemental Table S2: List of genes and associated Ki values for refined state metrics based on TCGA breast cancer tissue samples and tissue samples of common acquired melanocytic nevi and primary melanoma. Genes that overlap in the state metrics between breast cancer and melanoma are highlighted in green. TCGA Breast Cancer Tissue Samples Melanocytic Nevi and Melanoma Tissue Samples Epithelial Signature Mesenchymal Signature Differentiated Signature De-differentiated Signature GENE_SYMBOL Ki (log2 GENE_SYMBOL Ki (log2 GENE_SYMBOL Ki (log2 GENE_SYMBOL Ki (log2 TPM) TPM) TPM) TPM) ALDH3B2 6.539 ADAM12 5.020 ARAP2 7.068 ACTA2 5.441 ANK3 3.420 ASPN 6.415 CEACAM1 3.160 DES 0.960 B3GAT1 -1.558 CDH2 1.919 CGN 5.204 EDNRA 4.051 BMP7 2.158 CLIC4 8.028 CKMT1A 0.384 FGF1 1.751 C1orf106 1.896 CTSB 9.211 FXYD3 7.425 FOXC2 -3.978 C4orf19 2.445 EDNRA 4.930 HPGD 6.467 GADD45G 1.786 CDH1 8.627 FOXC2 1.352 MITF 7.339 INHBA 1.962 CEACAM1 4.592 IFITM3 10.371 MTUS1 3.385 KRT16 2.625 CGN 5.918 ITGA5 5.987 MYH14 5.677 KRT7 7.880 CLDN4 8.052 MMP3 3.602 NID2 3.067 CLDN7 7.632 POSTN 9.209 NOTCH3 4.381 CX3CR1 3.235 SERPINE1 6.068 PDGFRB 5.254 DSC2 4.658 SPOCK1 4.503 SERPINE1 2.100 EHF 6.185 SULF1 6.699 SPOCK1 1.611 EPHA1 4.359 TGFB1 6.076 TPM2 4.213 EXPH5 2.720 WISP1 3.690 VEGFC 1.694 FA2H 2.587 WISP1 3.202 GPX2 1.897 WNT5A 2.836 GRB7 5.585 HPGD 1.641 ICA1 5.752 IL20RA 3.761 IRF6 7.491 JUP 8.885 MSX2 4.256 POF1B 2.093 PPL 5.551 SH2D3A 3.672 TMPRSS2 3.378 TUBBP5 1.727 WNT3A -2.375 WNT4 2.885

Table S2. List of genes and associated Ki values for refined state metrics based on TCGA breast cancer tissue samples and tissue samples of common acquired melanocytic nevi and primary melanoma. Genes that overlap in the state metrics between breast cancer and melanoma are highlighted in green.

2 | Supplementary Information Klinke et al. | Future Home Journal bioRxiv preprint doi: https://doi.org/10.1101/865139; this version posted December 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

cell_type 1 cell_type Fibroblast 0.8 Other

1 0.6

0.4

0.2 2 0

3

Fig. S1. Consensus matrixFigure for similarity S1. Consensus and matrix clustering for sample of similarity cell samples. and clustering.The The symmetric symmetric 1034x1034 1034x1034 matrix is colored in element(i,j) matrix is colored in element(i,j) by similarity in assigning cells i and j to the same cluster when by similarity in assigning cellsthe iclustering and j to parameters the same are cluster changed. when A similarity the clustering score of 0 (blue) parameters indicates that are the changed. two cells A similarity score of 0 (blue) indicates that the two cells are always always assigned assigned to different to different clusters while clusters a score while of 1 (red) a score indicates of 1that (red) the two indicates cells are that the two cells are always assigned to the same cluster.always The assigned similarity to the of same the cluster. samples The aresimilarity also of illustrated the samples by are the also dendrograms illustrated by the shown on the top and side. The dendrograms shown on the top and side. The top bar indicates whether the cell was annotated top bar indicates whether theas cella fibroblast was annotatedbased on COL1A1 as a and fibroblast COL1A2 basedexpression on (aqua COL1A1 – fibroblast, and COL1A2 pink – other). expression (aqua – fibroblast, pink – other).

Klinke et al. | Future Home Journal Supplementary Information | 3