bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A single cell atlas of the healthy breast tissues reveal clinically relevant clusters of breast epithelial cells.

Poornima Bhat-Nakshatri1, Hongyu Gao2,3, Patrick C. McGuire3, Xiaoling Xuei3, Liu Sheng2,3, Jun Wan2,3, Yunlong Liu2,3, Sandra K. Althouse4, Austyn Colter5, George Sandusky5, Anna Maria Storniolo6, and Harikrishna Nakshatri1,2,7,8*

1Department of Surgery, Indiana University of School of Medicine, Indianapolis, IN 46202, USA 2Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA 3Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA 4Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN 46202, USA 5Department of Pathology and Laboratory Medicine, Indiana University School of Medicine, Indianapolis, IN 46202, USA 6Department of Medicine, Indiana University School of Medicine, Indianapolis, IN 46202, USA 7Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, IN 46202, USA 8Roudebush VA Medical Center, Indianapolis, IN 46202, USA

Running title: Epithelial cell subtypes of the normal breast.

*Corresponding author: Harikrishna Nakshatri, B.V.Sc., PhD. C218C, 980 West Walnut Street Indianapolis, IN 46202, USA 317 278 2238 (phone) 317 274 0396 (fax) [email protected]

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Summary

Single cell RNA sequencing is an evolving field to elucidate cellular architecture of adult

organs. Using normal breast tissues from healthy volunteers and a rapid

procurement/processing/sequencing protocol, 13 breast epithelial cell clusters were identified.

Approximately 90% of breast cancers were enriched for cell-of-origin signatures derived from

differentiated luminal clusters and two minor luminal progenitor clusters. Expression of cell

cycle and segregation-related were higher in one of the minor clusters and

breast tumors with this cluster signature displayed the highest mutation rate and poor outcome.

We identified TBX3 and PDK4 as genes co-expressed with estrogen (ER) in the normal

breasts and their expression analyses in >550 breast cancers enabled prognostically relevant cell-

of-origin based subclassification of ER+ breast cancers.

Keywords: Normal breasts, single cell analyses, epithelial cell clusters, cell-of-origin, breast cancer

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Significance: This study elucidates different epithelial cell types of the normal breasts and

identifies a minor subpopulation of cells from which the majority of breast cancers may

originate. This observation should help to develop methods to characterize breast tumors based

on cell-of-origin. Although it was suggested that intrinsic subtypes of breast cancers have distinct

cells of origins, this study suggests multiple cell-of-origin for an intrinsic subtype of breast

cancer, including for hormone responsive breast cancers. Cell-of-origin signatures allowed

survival-associated subclassification of intrinsic subtypes. Critically, this normal breast cell atlas

would allow for the classification of genes differentially expressed in a breast tumor compared to

normal breast due to the cell-of-origin of tumor and those that are acquired due to genomic

aberrations.

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Introduction

Breast cancers are subclassified into multiple subtypes based on expression analyses

and genomic aberrations (Curtis et al., 2012; Sotiriou et al., 2003). These classifications have

played a significant role in clinical decision making. Among these classifications, intrinsic

subtype classification based on , which classifies breast cancer into luminal-A,

luminal-B, HER2+, basal, normal-like and claudin-low, is suggested to reflect cell-of-origin of

breast cancer (Prat and Perou, 2009). Flow cytometry-based marker profiling and gene

expression portraits have identified three major epithelial cell types in the breast including

basal/stem (CD49f+/EpCAM-), luminal progenitors (CD49f+/EpCAM+), and mature luminal

(CD49f-/EpCAM+) cells (Lim et al., 2009; Visvader and Stingl, 2014). Cell-type enriched

networks such as TP63/NFIB, ELF5/EHF and FOXA1/ESR1 control gene

expression patterns in basal/stem, luminal progenitor and mature luminal cells, respectively

(Pellacani, 2016). It is suggested that while claudin-low subtype of breast cancers originates

from basal/stem cells, luminal progenitors are the source of basal-like breast cancers (Lim et al.,

2009; Prat et al., 2010; Proia et al., 2011). While HER2+ breast cancers may originate from

luminal progenitors and mature luminal cells, luminal-A/B breast cancers likely originate from

mature luminal cells (Prat and Perou, 2009). However, it is acknowledged that heterogeneity

exists within basal/stem, luminal progenitors and mature luminal cells as defined by

CD49f/EpCAM cell surface marker profiling (Anjanappa et al., 2017; Colacino et al., 2018).

A recent integrative analysis of 10,000 tumors from 33 types of cancer emphasized the

dominant role of cell-of-origin patterns in cancers (Hoadley et al., 2018). Since normal tissue

itself is composed of multiple cell types, fine mapping of these cell types and identifying

potential cancer vulnerable cell population in normal tissues would aid in identifying and

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

characterizing organ-specific cell-of-origin of cancers. Recent advances in single cell techniques

including scRNA-seq, sc-Epigenetics, scDNA-seq and scProteomics-atlas are enabling further

refinement of cell types within normal and diseased tissues (Lim et al., 2020). For example,

using reduction mammoplasty samples and cells flow sorted based on CD49f/EpCAM, Nguyen

et al identified three epithelial cell types in the normal breasts (Nguyen et al., 2018). Using the

same technique and mouse mammary tissues at different development stages, Pal et al described

seven epithelial cell types in the mouse mammary gland (Pal et al., 2017). However, Bach et al

observed 15 epithelial cell types in the mouse mammary gland (Bach et al., 2017). A concern has

been raised about the reproducibility of data, which is likely influenced by the types of tissues

used, duration between tissue collection and sequencing, and dissociation protocol (Lim et al.,

2020). While these issues can be standardized for studies involving mouse tissues, standardizing

is difficult for studies that utilize human tissues collected after a surgical procedure. In this

regard, in an elegant review, Lim et al recently proposed the need to establish a rapid tissue

dissociation program to advance single cell technology for clinical applications (Lim et al.,

2020).

A decade ago, our institution established a normal breast tissue bank where clinically

healthy women donate breast biopsies for research purposes. This resource has enabled others

and us to demonstrate clear differences between our “normal” and both reduction mammoplasty

and tumor-adjacent normal tissues, which have been the most common sources of “normal

controls” for breast cancer studies in the literature, including the single cell transcriptome study

detailed above. We and others have been able to show clear histologic and molecular

abnormalities in these surrogate sources of normal breast tissue (Degnim et al., 2012; Nakshatri

et al., 2015; Nakshatri et al., 2019). In this study, we first performed single cell RNA sequencing

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

of five freshly collected samples that included 18704 cells and 20647 genes. Results were

analyzed at both single sample levels as well as in an integrative manner. To confirm the results

of first sequencing, we repeated integrated single cell analyses of five new cryopreserved

samples covering 7582 cells and 25,842 genes. Using the expression patterns of CD49f and

EpCAM as well as basal/stem, luminal progenitor and mature luminal cell transcription factor

networks, we performed refined analyses of epithelial cells. Epithelial cluster specific gene

signatures were then applied on TCGA and METABRIC datasets to determine the impact of

putative “cell-of-origin” on breast cancer outcomes (Cancer Genome Atlas, 2012; Curtis et al.,

2012). Since there is limited subclassification of positive (ER+) breast cancers

and it is exceeding difficult to characterize ER+ breast epithelial cells from the normal breasts to

identify genes co-expressed with ER in normal and tumor cells (Rosenbluth et al., 2020), we

performed additional studies on TBX3 and PDK4, two genes that are co-expressed at different

levels in ER+ clusters of the normal breasts.

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Results:

Establishment of rapid tissue procurement and single cell analyses protocol: Although the

primary intention of establishing the Susan G. Komen Tissue Bank at IU Simon Cancer Center

(Komen Tissue Bank) was to provide a source of healthy breast tissue to be used as normal

controls for research, we took advantage of tissue collection procedure easily accessible in clinic

instead of surgical room to limit time lapse between tissue collection and utilization of tissues for

research purposes typically associated with tissue collection after surgical procedures. Because

these “collection events” have 1:2 donor: volunteer ratio, we were abundantly staffed and able to

reduce of time of tissue collection to placement in media or cold ischemia time for

cryopreservation to ~6 minutes. Since specimen cellularity varied between individuals, 50% of

fresh or cryopreserved tissues provided high quality data with respect to number of viable cells

and minimal ambient RNA contamination of single cell data. Also note that all tissues used have

undergone histologic characterization and are free of abnormalities. Table S1 provides

information about donors. Nine samples were from Caucasian women, one was from Asian, and

one was from African American woman. Genetic ancestry mapping has been performed using

41-SNP genetic ancestry informative markers. Two out of 11 were nulliparous and two out of 11

were post-menopausal women. Five donors had a family history of breast cancer.

Normal breast contains 13 epithelial cell types: Unlike the previous studies which purified

breast epithelial cells using CD49f/EpCAM markers prior to single cell analyses (Nguyen et al.,

2018), we subjected single cells after dissociation directly to RNA-sequencing and then used

CD49f/EpCAM as well as transcriptional regulators known to specify basal/stem, luminal

progenitors and mature luminal cells to subcluster epithelial cells (Pellacani, 2016). Uniform

manifold approximation and projection (UMAP) plot of combined samples is shown in Figure

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1A. As expected, the normal breasts contained a variety of cell types in addition to epithelial

cells including monocytes, T cells, NK cells, endothelial cells, and fibroblast-like cells (Figure

1A). Fibroblast-like and endothelial cells displayed three closely related clusters suggesting

heterogeneity within these cells. Heterogeneity in endothelial cells, driven largely by metabolic

plasticity, has been previously described in other organ and disease conditions (Rohlenova et al.,

2020). Similarly, heterogeneity in fibroblasts of mouse mammary gland with mammary tumors

have also been described. (Bartoschek et al., 2018). Our studies show the heterogeneity in these

cells within the normal breast itself. Epithelial cell types were dominant. Using CD49f/EpCAM

expression pattern as well as TP63/NFIB, ELF5/EHF and FOXA1/ESR1 as functional markers of

basal/stem, luminal progenitors and mature luminal cells, we performed subcluster analyses of

epithelial cells, whivch revealed 13 different epithelial cells (Figure 1B and C). Number of cells

in each cluster and average expression value of genes that differentiated these clusters are shown

in Table S2. A heatmap of average expression levels of top marker genes of these clusters is

shown in Figure 1D. CD49f+/EpCAM- basal/stem cells contained three closely related

subclusters (clusters 5, 7 and 9). Each of these clusters within basal/stem cells can be

distinguished through expression of specific genes. For example, cluster 5 expressed higher

levels of NDUFA4L2, a mitochondrial NADPH dehydrogenase (Figure 2A). This cluster also

expressed higher levels of CD36, a lipid transporter associated with breast cancer metastasis, as

well as Vimentin, a marker of basal cells (Pal et al., 2017; Pascual et al., 2017). Cluster 7

expressed ACKR1, a decoy receptor for CCL2 and IL-8 (Davis et al., 2015). Cluster 9 is

enriched for MECOM (EVI1), a stem cell associated transcription factor (Sato et al., 2014).

CD49f+/EpCAM+ cells contained two clusters that appeared as continuum of cells

(clusters 0 and 2) and five other well-separated clusters (Clusters 6, 8, 10-12). Although cluster 0

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

and 2 appeared as a continuum of cells, significant differences in gene expression are evident

(Figure 1D and 2B). For example, while all luminal progenitor cells expressed Secreted Fizzled

Related 1 (SFRP1), a modulator of Wnt signaling (Baharudin et al., 2020), genes such as

SLP1, ANXA1, RARRES1, KLK5, and KRT15 were enriched in cluster 2. KRT14 and KRT17

expression were enriched in cluster 2 but not in cluster 0. Cluster 6 was enriched for CXCL14

and ACTA2. Cluster 8 was enriched for SCGB2A1 and CALML5. Cluster 10 was KRT14-positive

and enriched for the expression of GLYATL2. Cluster 11 was enriched for the expression of

multiple genes including TOP2A, NUSAP1, UBE2C, TPX2, SPC25, MKI67, CDK1, CENPF and

CCNA2 (Figure 1D and 2B). In fact, this cluster displayed a higher number of genes that are

differentially expressed than other clusters and constituted a major signaling network associated

with regulation of cell cycle, chromosome segregation, and spindle checkpoint to name few

(Table S2). Cluster 12 was characterized by elevated expression of MEG3, IGF1 and PTGDS.

CD49f-/EpCAM+ mature luminal cells were comprised of three clusters, which appeared

as a continuum of cells (clusters 1, 3 and 4), although there were distinct differences in gene

expression. All three of these clusters expressed ESR1 and pioneering factors FOXA1 and

GATA3 (Zaret and Carroll, 2011). TBX3 and PDK4 are two other genes that showed variable

expression in these clusters. XBP1 and STC2, ESR1 target genes (McBryan et al., 2007), were

uniformly expressed at higher in all three of these clusters (Figure 3A). In fact, while expression

level of SFRP1 was able to distinguish luminal progenitors from mature luminal cells, expression

levels of XBP1 and STC2 were able to distinguish mature luminal cells from luminal progenitors.

Cluster 1 showed enrichment of RUNX1 and BATF. Cluster 3 was enriched for ANKRD30A,

whereas Cluster 4 was enriched for PIP, MUCL1, TAT and TSPAN8.

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

To highlight the presence of three clusters of cells expressing ESR1 and to identify genes

co-expressed with ESR1, we performed hierarchical clustering of breast epithelial cells from a

44-year old African American donor (donor 3). Various cell types present in the breast of this

donor are shown in Figure 3B, and gene expression patterns in 10 epithelial cell clusters from

this donor are shown in Figure 3C. Clusters 1, 3 and 4 were ESR1+. JUND, TBX3, PAWR,

PDK4, and HAMP were the genes that distinguished cluster 4 from cluster 3, whereas cluster 1

contained lower levels of GATA3 compared to clusters 3 and 4. Thus, there are at least three

clusters of estradiol-responsive breast epithelial cells with varying levels of ESR1 as well as

pioneer factor expression.

The distribution pattern of epithelial clusters in the individual samples is shown in Figure

S1A and number of cells per cluster is indicated in Table S2. Almost every sample contained

similar levels of most of the clusters; two minor clusters, clusters 11 and 12, showed inter-

sample variability.

Reproducibility of cluster analyses: To determine whether epithelial clusters identified in the

above analyses can be reproduced using cryopreserved tissues from healthy donors, we isolated

cells from five cryopreserved tissues, pooled cells, and analyzed them all together. Since five

samples were combined, there were enough cells to divide samples into two and perform cDNA

synthesis and library preparation by two independent labs. In addition, we used the latest version

of the library preparation with improved chemistry, paired-end sequencing and better efficiency.

Because pooled samples contained more lymphocytes, lymphocyte related cells were removed

from the analyses. Without lymphocyte removal, there were 28 clusters (Table S2). Side-by-side

comparisons of the second set of pooled samples and re-analyses of first five samples are shown

in Figure 4A. Re-analyses of individual samples are shown in Figure S1B and number of cells in

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

each cluster is shown in Table S2. With increased number of cells and more genes sequenced,

further subclassification of epithelial cells became possible. Since the number of clusters

identified in this new clustering are different (23) from the first analysis (13), new clusters are

named with prefix N (N0-N22). Consistent with our earlier report and a recent report on

organoid-derived single cell data, there was inter-individual differences in proportion of cells in

each cluster (Nakshatri et al., 2015; Rosenbluth et al., 2020). The UMAP cell embeddings and

cell cluster information generated from Seurat analysis were imported into 10X genomics Loupe

Browser. By checking various gene expression with the Loupe Browser, we first assigned the

subdivided clusters into basal/stem, luminal progenitor, and luminal mature cells. Based on

CD49f and EpCAM expression patterns (Figure 4B), clusters N5-7, N11, N13, N18 and N22

were basal; N3, N4, N9, N14, N16, N19 and N20 were luminal progenitors; and N2, N8, N12

and N17 were mature luminal cells. ALDH1A3 expression, which has been suggested to identify

breast cancer stem cells (Marcato et al., 2011), was expressed mainly in N4, N14 and N16

clusters of luminal progenitor cells (Figure 4C). Among transcription factors that define

basal/stem, luminal progenitor and luminal mature cells, as expected, ELF5 and EHF expression

was restricted to luminal progenitor cells, whereas ESR1 and FOXA1 expression was restricted to

luminal mature cell clusters (Figure S2). While ESR1 expression was wide spread across mature

luminal subclusters with cluster 12 displaying stronger signals, the expression of its target gene

PGR was much more restricted within mature luminal cells suggesting the natural existence of

ER+/PR+ and ER+/PR- cells, similar to the features of luminal A and luminal B breast cancers

(Sorlie et al., 2001). Expression of best studied ER target gene GREB1 overlapped with PGR

expression suggesting ER has cell type-specific targets within the normal breast. Consistent with

earlier report (Nolan et al., 2017), RANK (TNFRSF11A) expression was restricted to few luminal

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

progenitor cells, whereas its ligand RANKL (TNFSF11) expression was observed in progesterone

receptor positive mature luminal cells (Figure S2). EHF and ELF5 expression showed strong

signals in subclusters N14 and N16, which could be alveolar progenitor cells as these two

transcription factors play a major role in alveolar differentiation during pregnancy (Luk et al.,

2018). However, expression of NFIB and TP63 did not correlate with prior assessment

(Pellacani, 2016) as NFIB expression was not restricted to basal/stem cells, whereas TP63

expression was observed only in cluster 15. Cluster 15 is likely myoepithelial cells, as this

cluster expressed higher levels of ACTA2 and KRT17, previously described markers of

myoepithelial cells (Nguyen et al., 2018) (Figure S2). Although basal cells are expected to

express KRT14, we found its expression predominantly in a subpopulation of luminal progenitor

cells and in N15 with myoepithelial characteristics (Figure 4D). KRT18 and KRT19 expression

were found equally in luminal progenitor and mature luminal subclusters (Figure S2).

To further document reproducibility, we analyzed a surgical sample from a 33-year old

Hispanic BRCA1 carrier and a core biopsy of a healthy Asian (Chinese) woman. BRCA1 sample

was analyzed from cryopreserved tissues and we included duplicate samples because of

availability of large starting material. One sample was prepared as above involving both

enzymatic and mechanical disruption, whereas another sample utilized only digestion with gentle

hyaluronidase/collagenase cocktail from Stem Cell technologies. Sample preparation using

gentle hyaluronidase/collagenase yielded lower number of basal cells compared to the method

that involved both enzymatic and mechanical disruption. Nonetheless, we did not observe any

clusters unique to the BRCA1 mutated sample (Figure S1B). The breast tissue from

Asian/Chinese donor showed a disproportionately higher number of cells within basal/stem

characteristics compared to other samples.

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Gene expression patterns in Clusters 11 and 12 are similar to N19 and N0-N1, respectively,

of the new analysis: Since cluster 11 of the first analysis, despite being a minor cluster, was

enriched for genes associated with cell cycle and chromosome segregation, we next investigated

which among the clusters in the second analysis are enriched for cluster 11 genes. In addition,

since N0-N1 clusters grouped farther away from the remaining clusters, we investigated their

relationship to the first analyses. A significant number of genes in cluster 11 were also enriched

in cluster N19 of the second analysis, whereas cluster 12 genes were enriched in clusters N0 and

N1 (Figure 5). Similar to C11, N19 expressed MKI67, BIRC5 and PCLAF. Cluster N19 cells also

expressed higher levels of APOBEC3B, which is considered a driver of mutations in cancer

(Olson et al., 2018). Similar to C12, N0 and N1 expressed PTGDS and IGF1. These two clusters

are likely enriched for unique stem cells as these cells expressed higher levels of ZEB1, EGFR,

CD44 and low levels of various keratins (Morel et al., 2017). A fraction of these cells as well as

cluster N6 cells among basal/stem cell group were PROCR+, another mammary stem cell marker

(Wang et al., 2014). Note that none of 23 clusters expressed mesenchymal stem cell markers

such as CD90, CD73 and CD105 (Kfoury and Scadden, 2015).

Cluster 11 and cluster 12-specific genes are enriched in breast cancer: Although it is not

possible to definitively link cancer to a specific cell type from which it may originate (Gupta et

al., 2019), it is likely that genes that define a epithelial cell cluster are enriched in tumors that

originate from that cluster of epithelial cells. To determine such relationships, we compared gene

scores of each epithelial cluster with METABRIC and TCGA breast cancer gene expression

datasets (Cancer Genome Atlas, 2012; Curtis et al., 2012). To increase number of samples per

cluster, we used clusters 0-12 of the first analysis instead of second analysis clusters N0-N22.

Furthermore, since three clusters among CD49f+/EpCAM- showed modest gene expression

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

differences, we analyzed them together as cluster 5a. In the CD49f-/EpCAM+ population, we

combined clusters 1, 3, and 4 as cluster 1a. With both datasets, the highest number of breast

cancers mapped to the minor cluster 12. This was followed by cluster 1a and cluster 11 in both

datasets (Figure 6A). Consistent with the literature that the majority of breast cancers originate

from luminal progenitor cells (Lim et al., 2009), gene expression in the highest number of basal

breast cancers overlapped with cluster 11 followed by clusters 12, 2, and 6 (Tables S3 and S4).

In METABRIC dataset, every intrinsic subtype of breast cancer was found to be enriched for

genes of cluster 12. Survival analyses indicated significant differences in overall survival

between breast cancers with overlapping signature of specific normal epithelial clusters in

METABRIC but not in TCGA dataset. Note that METABRIC dataset is much larger than TCGA

dataset. In the TCGA dataset, we observed tumors enriched for cluster 1a (mature luminal)

signature associating with better disease-free survival (DFS) than tumors enriched for cluster 2

(luminal progenitor) signature (Figure 6B). Within luminal progenitor clusters, tumors enriched

for cluster 12 signature were associated with better DFS than tumors with signature of cluster 2.

Although tumors enriched for signatures of cluster 1a and cluster 12 did not show significant

differences in overall survival in the TCGA dataset, tumors with cluster 12 signature were

associated with better overall survival than cluster 1a in METABRIC dataset (Figure 6C).

We used PAM50 classifier to determine a relationship between intrinsic subtypes and

normal breast epithelial clusters. In both datasets, ~50% of luminal A breast cancers contained

cluster 1a signature and remaining were enriched for cluster 12 signature (Figure S3). In the

METABRIC dataset, 673 out of 679 Luminal A breast cancers were assigned to cluster 1a and

cluster 12 clusters suggesting two distinct cell-of-origin of Luminal A breast cancer. Luminal A

breast cancers with cluster 12 signature were associated with better overall survival compared to

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

luminal A with cluster 1a signatures in METABRIC dataset but not with TCGA dataset (Figure

6C and Figure S3). Disproportionately higher percentage of luminal B breast cancers carried

cluster 1a signature compared to cluster 12 signature in both datasets (Figure S3). Cluster 11

signature was found in ~50% luminal B breast cancers of METABRIC dataset (Table S3 and

S4). Failure to find similar association between luminal B and cluster 11 in TCGA dataset could

be due to low number of tumors with cluster 11 signature in this dataset. HER2+ breast cancers

were enriched for cluster 1a, cluster 12 and cluster 11 (only in METABRIC) signatures, whereas

basal-like breast cancers were enriched predominantly for cluster 11 signature. With respect to

outcome, interactions between cell clusters and oncogenic signals may determine outcome. For

example, while luminal B breast cancers with C1a signatures displayed better outcome than

luminal B breast cancers with cluster 12 signatures, opposite outcomes were observed if the

tumors were HER2+ (Figure S3). With respect to mutation frequency, breast tumors enriched for

cluster 11 signature had highest number of mutations per tumor compared to other cluster

enriched tumors (Table S4). This table also lists names of genes mutated in more than 10% of

tumors.

Basal-like breast cancers with cluster 11 signatures were associated with better overall

survival compared to basal-like breast cancers with cluster 2 signature in the TCGA dataset,

although sample size is bit small (Figure 6B). Interestingly, Kaplan-Meier curve of basal breast

cancers enriched for cluster 2 and cluster 12 signatures showed slopes similar to what is

observed with triple negative breast cancer patients with most events occurring within first 100

months (Figures 6 and S3). Cluster 1a or cluster 5a signatures were rarely represented in basal

breast cancers.

TBX3 and PDK4 expression patterns determine subtypes of ER+ breast cancers.

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

We observed expression of ESR1 in three clusters (1, 3, and 4) of the normal breasts,

which are characterized by differential expression of TBX3, PDK4, and GATA3 (Figure 3C).

While cluster 4 expressed similar levels of ESR1, TBX3 and PDK4, PDK4 expression was lower

in cluster 3. Compared to cluster 4, cluster 1 expressed lower levels of both TBX3 and PDK4.

GATA3 expression was highest in cluster 3 followed by clusters 1 and 4. Since ER-related

pioneer factor activity of GATA3 has already been established (Zaret and Carroll, 2011), we

focused our studies on TBX3 and PDK4 . Both PDK4 and TBX3 expression are linked

to anti-estrogen response (Razavi et al., 2018; Walter et al., 2015). To determine whether there is

a relationship between clinical progression and ER/TBX3/PDK4 status, we immunostained 586

breast tumor containing TMA with 15-years of follow up for TBX3 and PDK4. As expected,

while TBX3 staining was predominantly nuclear, PDK4 expression was cytoplasmic (Figure

7A). Detailed univariate and multivariate statistical analysis report of TBX3 and PDK4

expression generated in a blinded manner to the statistician, is presented in Star methods file and

only statistically significant data are described below.

PDK4 expression levels correlated with ER+/PR+/HER2-, whereas TBX3 expression

levels correlated with tumor grades and stage. Higher grade/stage tumors had higher TBX3

compared to lower grade/stage tumors. In multivariable models treating the PDK4 H-score as

dichotomous, H-score category was significant for tumors that are ER+, and patients with ER+

tumors and on endocrine therapy (Table 1). For the results that were significant, the higher

PDK4 H-score was correlated with lower survival. In multivariable models treating the H-score

as continuous variable, the H-score was significant result in patients who were ER+, on

endocrine therapy, were ER+ and on endocrine therapy, and for patients who were

16 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

ER+/PR+/HER2-. A few Kaplan-Meier curves of disease-free survival analyses are shown in

Figure 7B.

In the case of TBX3, in multivariable models treating the H-score as dichotomous, H-

score category was significant for patients with tumors that were ER+ and not on endocrine

therapy and patients who were not ER+/PR+/HER2- (Table 1). For the results that were

significant, the lower TBX3 H-score was correlated with lower survival. In multivariable

models treating the H-score as continuous, the H-score was significant in patients who were not

ER+/PR+/HER2-.

Since our TMA had more than 300 ER+ cases, we were able to perform subgroup

analyses that included all three markers: ER, TBX3 and PDK4. The analyses included

ER+/TBX3+/PDK4+, ER+/TBX3+/PDK4low, ER+/TBX3low/PDK4high, and

ER+/TBX3low/PDK4low. Although we did not observe any difference in overall survival between

these groups, disease-free survival was shorter for patients with the tumors displaying

ER+/TBX3low/PDK4low expression patterns compared to ER+/TBX3+/PDK4+ expression

patterns (Figure 7B). Among 399 ER+ tumors, 138 displayed ER+/TBX3+/PDK4+

characteristics, whereas 57 showed ER+/TBX3low/PDK4low characteristics. These results indicate

that ER+ breast cancers can be subclassified into at least four distinct subtypes based on TBX3

and PDK4 expression patterns, potentially representing four different cell-of-origin of ER+

breast cancers.

17 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Discussion:

In this study, we present evidence for the presence of at least 13 different clusters of

epithelial cells in the normal breast and we suggest that the majority of breast cancers contain

gene expression signatures that overlap with gene expression signatures of two minor clusters of

the luminal progenitors and the rest with mature luminal cells. It is possible that cells in these

clusters (cluster 11, cluster 12 and cluster 1a that combines clusters 1, 3, and 4) are the cancer-

prone population of normal breast epithelial cells. Finding of this study may permit breast cancer

classification based on cell-of-origin of tumors. In this respect, a recent study involving 10,000

tumors from 33 cancer types showed that cell-of-origin patterns dominate in the molecular

classification of tumors (Hoadley et al., 2018). Although each intrinsic subtype of breast cancer

ostensibly has a distinct cell-of-origin in the breast stem-progenitor-mature cell hierarchy (Prat

and Perou, 2009), we observed a cluster-enriched signature being represented in more than one

intrinsic subtype and an intrinsic subtype being represented in more than one cluster of epithelial

cells. Existing technologies do not permit experimental validation of cells-of-origin of tumors

but the use of the techniques such as single cell RNA-seq may allow further refinement of cancer

classification based on their presumptive cell-of-origin.

Complexities in breast epithelial cell types; past and the present: Since scRNA-seq

technology is still an evolving field requiring constant improvement in technologies starting from

source of tissues to dissociation protocols, sequencing techniques, and bioinformatics tools (Lim

et al., 2020), it is likely that clusters that we identified here will undergo further refinement in the

future. Thus, it is appropriate to compare what has been done in the past to the current data.

There has been a limited number of studies that utilized scRNA-seq technology to subclassify

breast epithelial cells. Two publications to our knowledge utilized tissues from reduction

18 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

mammoplasty samples and cells were either purified by flow cytometry or grown under organoid

cultures prior to single cell sequencing (Nguyen et al., 2018; Rosenbluth et al., 2020). Although

our source of tissue and methodology differed significantly than these studies as we used breast

tissues from healthy women and were able to prepare single cell cDNA within two hours of

tissue collection, there were several overlapping observations. For example, the L2 luminal

differentiated cell cluster described by Nguyen et al., and luminal differentiated cluster 3 in our

study are enriched for ANKRD30A (Nguyen et al., 2018). Similarly, luminal progenitor cluster

L1 in that study and our luminal progenitor subcluster 10 are enriched for the expression of SLPI

and ANXA1. Similar to that study, our luminal mature cell subcluster 4 was enriched for PIP. A

basal subcluster identified by Nguyen et al., (Nguyen et al., 2018) and our cluster 5, which is

basal, both expressed TCF4. There were few differences. Nguyen et al (Nguyen et al., 2018)

suggested that ACTA2, which codes for a-SMA, distinguishes basal/myoepithelial cell from

other cell types. Although we observed expected enrichment of ACTA2 in basal subclusters, it is

also expressed in a distinct subcluster of luminal progenitor cells (Cluster 6). This cluster is

enriched for MYLK, an actin binding protein, regulated by ZEB1/miR-200 feedback loop

associated with epithelial to mesenchymal transition (Sundararajan et al., 2015). It is possible

that cluster 6 cells correspond to naturally occurring luminal/basal hybrid cells that can trans-

differentiate based on environmental cues or to mixed luminal/basal lineage cells described in

the mouse mammary gland (Pal et al., 2017).

Two studies have described distinct cell types in the mouse mammary gland during

different stages of development (Bach et al., 2017; Pal et al., 2017). Similar to differences in the

number of epithelial subgroups identified in two human studies (three by Nguyen et al., and 13

by us), Bach et al identified 15 clusters of mouse mammary epithelial cells through single cell

19 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

sequencing of sorted EpCAM+ cells (Bach et al., 2017). Pal et al identified seven clusters of

epithelial cells (Pal et al., 2017). There is some overlap in genes expressed in specific clusters of

the mouse mammary gland and human identified in our study. Similar to our results, Pal et al

showed ACTA2 expression in both basal cells and two small subclusters of luminal cells (Pal et

al., 2017). CXCR14 expression was found in a subset of luminal progenitor/intermediate cells

and basal cells. Gene expression in hormone sensing cells that included ESR1 and FOXA1

showed similarity in expression between Bach et al., and our studies (Bach et al., 2017). Pal et al

identified SFRP1 as a marker of pre-pubertal mammary epithelial cells, which decreased after

puberty (Pal et al., 2017). In our analysis, SFRP1 expressing cells were luminal progenitor cells.

We noted one major difference between mouse and human mammary epithelial cells. In mouse

mammary gland, TSPAN8 expressing cells are considered quiescent stem cells (Fu et al., 2018),

but we found TSPAN8 expression to be restricted to cells in cluster 4, which is a mature luminal

subcluster. Thus, few of the stem cell markers may show species-specific variability.

Basal, luminal progenitor and mature luminal cells are defined using cell surface markers

CD49f and EpCAM (Visvader and Stingl, 2014). However, these markers are not ideal for in situ

estimation of three cell types. A closer look at genes enriched in each cluster and their signature

genes revealed that XBP1 is expressed predominantly in luminal mature cells, whereas SFRP1 is

expressed predominantly in luminal progenitor cells. Multiple genes are enriched in basal cell

clusters including CD96, MECOM, CLDN5, and CD93. These genes can be used in the future for

in situ estimation of composition of the breast.

Gene signatures of epithelial clusters and their relevance to breast cancer: Our single cell

studies confirmed prior reports of basal breast cancers originating from luminal progenitor cells.

The majority of basal breast cancers in both TCGA and METABRIC datasets were enriched for

20 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

genes found in luminal progenitor clusters 2, 11 and 12. Two minor clusters that we identified,

clusters 11 and 12, displayed several unique characteristics. Cluster 11 is enriched for the

expression of proliferation-associated genes including MKI67, CDK1, and TK1, apoptosis

regulators (BIRC5, DEPDC1). It co-expressed TOP2A, BIRC5 and NEK2, which form a

functional protein-protein network in luminal A breast cancer (Nuncia-Cantarero et al., 2018). In

our second set of analyses, we found this cluster to be enriched for APOBEC3B and its elevated

expression is associated with higher mutation rate (Olson et al., 2018). Indeed, breast tumors

with cluster 11 signatures displayed the highest number of mutations per tumor compared to

tumors enriched for other genes of other clusters. In fact, 15 out of 34 genes of this cluster have

been shown to be present in a rarest cluster of breast tumor cells with highly malignant

phenotype (Gao et al., 2017). Although this cluster represented 1.5% of epithelial cells, ~20% of

breast cancers in TCGA and METABRIC datasets were enriched for genes of this cluster. There

is a trend of higher number of cluster 11 cells in young donors, but this observation needs to be

confirmed with large number of samples (Table S2). Collectively, these observations reinforces

the possibility of these rare subpopulation of cells being the cell-of-origin of significant numbers

of breast cancers. Based on the expression of MKI67 as well as pathway analyses, cluster 11

contain rapidly proliferating cells, which may predispose cells for aberrant chromosome

segregation and mutations.

Cluster 12 was enriched for IGF1, MEG3, PTGDS and SRPX. Among these genes,

MEG3 codes for a linker RNA and is dysregulated in multiple cancers (He et al., 2017).

Although this cluster appeared minor in our first analyses, when we sequenced additional

samples and did combined analyses, the number of cells representing this cluster increased and

we were able to identify additional marker genes including ZEB1 and TCF4. Co-expression of

21 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

TCF4 and ZEB1 cells indicate that these cells are similar to a subset of basal cells described by

Nguyen et al., but these cells lack the expression of CD49f to be considered as basal cells

(Nguyen et al., 2018). To a certain extent, these cells are similar to mammary stem-like cells

described by Morel et al., and ZEB1+ cells we described previously (Morel et al., 2017;

Nakshatri et al., 2019). Because the gene signatures from this cluster are heavily represented in

almost all intrinsic subtypes of breast cancer, additional studies are needed to determine the role

of these cells in tumorigenesis.

With respect to ER+ breast cancers, these cancers can originate from both luminal mature

and luminal progenitor cells as both luminal A and luminal B breast cancers displayed gene

signatures enriched in cluster 1a (clusters 1, 3 and 4), and clusters 11 and 12. There appears to be

quantitative differences in ESR1 between these clusters as cluster 1a expressed highest level of

ESR1 followed by clusters 11 and 12 (Table S2). Why ER+ breast cancers with a luminal

progenitor gene expression pattern compared to mature luminal gene expression is associated

with better outcome is an intriguing question that needs to be explored. Using an independent

TMA, we were able to further subclassify mature luminal cell-derived ER+ tumors based on

TBX3 and PDK4 expression. While the role of TBX3 in ER activity and its mutations in ER+

lobular carcinomas have been described in the literature (Ciriello et al., 2015; Razavi et al.,

2018), there are no studies that functionally linked PDK4 to ER. PDK4 is a cytoplasmic kinase

involved in the TCA cycle and induces metabolic changes in transformed cells (Coloff and

Brugge, 2017). Whether it also modulates transcription by targeting transcription machinery

needs further investigation.

ScRNA-seq as well as single cell protein sequencing have been used to subclassify breast

cancers and to identify treatment resistant population. For example, Karaayvaz et al described an

22 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

aggressive disease-associated gene signature related to glycosphingolipid metabolism by

scRNA-seq of six TNBCs (Karaayvaz et al., 2018). However, pathway analyses of our cluster

enriched genes did not identify a normal breast epithelial cluster enriched for this pathway.

Similarly, signatures derived through scRNA-seq of a chemoresistant subpopulation of TNBCs

did not show overlap with any of our normal cell clusters (Kim et al., 2018). Genes in the breast

cancer-specific RNA signature that detect circulating tumor cells did not show overlap with any

specific cluster, but genes like PIP, CXCL14, and SFRP2, which are markers of circulating

tumor cells are enriched in clusters 4, 6 and 12, respectively (Kwan et al., 2018). Thus, genomic

aberrations and transcription programming rather than cell-of-origin may have given rise to drug

resistant and metastatic subpopulations of tumor cells.

A recent single cell proteomics observed expression of EMT-associated proteins in triple

negative breast cancers as well as luminal A and luminal B breast cancers (Wagner et al., 2019).

Cluster 12 (and its equivalent in the second analyses N0-N1) naturally expressed higher levels of

EMT-associated genes and signature of this cluster were enriched in all intrinsic subtypes. Thus,

expression of EMT-associated genes may not be an acquired phenomenon in all breast cancers.

From a basic research point of view, results presented here provides an opportunity to

determine whether a gene is truly differentially expressed in tumors compared to normal, as the

expression pattern of a specific gene in the tumor could be a reflection of its cell-of-origin. Using

single cell RT-PCR of tumor adjacent normal and tumor cells from the same individual, we had

previously demonstrated that elevated expression of few genes in tumor can be attributed to cell-

of-origin of tumor instead due to tumor-specific genomic aberration (Anjanappa et al., 2017).

Resources created here, which will be made available to researchers and can be mined for

expression patterns of specific genes in various epithelial clusters of the normal breasts using the

23 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

tool such as Loupe Browser of 10X Genomics. For example, genes such as MUCL1, PIP, TP63

are expressed in a specific subpopulation of normal breast epithelial cells and their

overexpression in tumors may reflect cell-of-origin of tumors instead of tumor-specific aberrant

expression. This approach would also allow streamlining of drug discovery efforts by focusing

on targets that are truly differentially expressed in tumors due to genomic aberrations.

24 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

ACKNOWLEDGMENTS

We thank the countless number of women who donated normal and malignant breast tissues for

research. We also thank the volunteers who facilitated this tissue collection. Special thanks to

members of the Komen Tissue Bank including Ms. Jill Henry, Alison Hughes, Pam Rockey,

Julia Rose von Arx, Rana German and Dr. Natascia Marino as well as the IU Simon Cancer

Center tissue procurement facility for providing tissues and related data. Funding: The

Catherine Peachy Fund of the Heroes Foundation family (HN), Breast Cancer Research

Foundation (HN), Chan-Zuckerberg Initiative Human Atlas Project (HN, AS, YL), Susan G.

Komen for the Cure (AS), and Vera Bradley Foundation for Breast Cancer Research (IUSM).

Walther Cancer Institute provided support to Cancer Bioinformatics Core.

AUTHOR CONTRIBUTIONS

Conception and design: HN

Development of methodology: PBN, PCM, XX, LS, JW, GS, HG, YL, HN

Acquisition of data: PBN, PCM, AC, GS.

Analysis and interpretation of data: PBN, AMS, HG, YL, LS, JW, HN

Writing, review and/or revision of the manuscript: HG, LS, AMS, HN

Administrative, technical, or material support: AMS, HN

Study supervision: HN

DECLARATION OF INTERESTS

Authors have no conflict of interest to declare.

25 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

26 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure Legend:

Figure 1: The normal breast contains 13 epithelial clusters. A) Integrated analysis of single

cells of the normal breast biopsies of five healthy donors. Epithelial cells dominate among cell

types. B) Subclustering of epithelial cell types using CD49f/EpCAM as well as NFIB, TP63,

EHF, ELF5, ESR1 and FOXA1 expression patterns. C) Representation of various cell types in

each sample. Subclusters in individual sample are shown in Figure S1A. D). Hierarchical

clustering of top cluster-enriched genes.

Figure 2: Expression patterns of representative cluster-enriched genes. A) Genes enriched

in basal/stem cell clusters. B) Genes enriched in various clusters within luminal progenitor cells.

Figure 3: Mature luminal cells are enriched for ESR1 and XBP1, whereas SFRP1 is

enriched in luminal progenitor cells. A) Genes enriched in mature luminal cells. Note that

cluster 4 within mature luminal cells is distinctly enriched for MUCL1 and PIP. B) Various cell

types in the normal breast of a donor. C) Identification of ESR1 expressing subclusters and genes

co-expressed with ESR1 in the normal breasts.

Figure 4: Recharacterization of epithelial cells of the normal breasts with additional

samples. A) Combined integrated analyses that included samples in Figure 1, a new sample from

an Asian (Chinese), and pooled five new samples. There were 23 clusters of cells, which can be

subdivided into three major groups of basal/stem, luminal progenitor, and mature luminal cells.

Potential myoepithelial cells (Myo) and mammary stem cells (MaSc) distinct from basal/stem

cells are also indicated in the bottom. The bottom panel shows distribution patterns of cell

27 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

clusters in five samples of the first set and the five pooled samples of the second set. Clusters in

individual samples are shown in Figure S1B. Expression patterns of various markers that are

used to subclassify clusters are shown Figure S2. B) CD49f, EpCAM, ALDH1A3, and KRT14

expression in various clusters.

Figure 5: Gene expression in clusters N19 and N0-N1 of Figure 4 overlap with unique genes

in C11 and C12, respectively. A) MKI67, BIRC5, and PCLAF, which are all overexpressed in

cluster 11 (Figure 1D), are enriched in N19. B) PTGDS and IGF1, which are overexpressed in

cluster 12 (Figure 1D), are enriched in N0-N1 clusters. This cluster also expresses ZEB1 and

EGFR.

Figure 6: Breast cancer subtype specific expression of cluster-signature genes. Breast

cancer gene expression data in TCGA (left) and METABRIC datasets were analyzed for

enrichment of cluster-specific genes described in Table S2. Clusters 1, 3 and 4 were combined to

create Cluster 1a because of limited differences. Similarly, clusters 5, 7 and 9 were clubbed to

create cluster 5a. PAM50 intrinsic subtype classifiers were used to subdivide breast cancers into

luminal A, luminal B, HER2, and basal subtypes. Enrichment of cluster-specific genes in these

subtypes of breast cancer were further analyzed. Additional data can be found in Figure S3.

Figure 7: PDK4 and TBX3 enable further classification of ER+ breast cancers. A)

Immunohistochemistry of breast TMA for PDK4 and TBX3. B) ER+ breast cancers expressing

lower levels of PDK4 compared to tumors with higher PDK4 and not received endocrine therapy

were associated with poor disease free survival (DFS). Similarly, ER+ tumors expressing lower

28 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

levels of both TBX3 and PDK4 compared to tumors expressing higher levels of PDK4 and

TBX3 were associated with poor DFS.

29 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

STAR*METHODS

Detailed methods are provided in this paper and include the following:

• KEY RESOURCES TABLE

• CONTACT FOR REAGENT AND RESOURCE SHARING

• EXPERIMENTAL MODEL AND SUBJECT DETAILS

• METHOD DETAILS

o Tissue collection and processing

o Single cell sequencing and data analyses

o Breast cancer TMA

o Public Database analyses

• STATISTICAL ANALYSIS

SUPPLEMENTAL INFORMATION

Supplemental information includes 4 figures and four tables and can be found online.

30 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

References:

Anjanappa, M., Cardoso, A., Cheng, L., Mohamad, S., Gunawan, A., Rice, S., Dong, Y., Li, L., Sandusky, G. E., Srour, E. F., and Nakshatri, H. (2017). Individualized Breast Cancer Characterization through Single-Cell Analysis of Tumor and Adjacent Normal Cells. Cancer Res 77, 2759-2769. Bach, K., Pensa, S., Grzelak, M., Hadfield, J., Adams, D. J., Marioni, J. C., and Khaled, W. T. (2017). Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing. Nature communications 8, 2128. Baharudin, R., Tieng, F. Y. F., Lee, L. H., and Ab Mutalib, N. S. (2020). Epigenetics of SFRP1: The Dual Roles in Human Cancers. Cancers (Basel) 12. Bartoschek, M., Oskolkov, N., Bocci, M., Lovrot, J., Larsson, C., Sommarin, M., Madsen, C. D., Lindgren, D., Pekar, G., Karlsson, G., et al. (2018). Spatially and functionally distinct subclasses of breast cancer-associated fibroblasts revealed by single cell RNA sequencing. Nature communications 9, 5150. Bhandari, V., Hoey, C., Liu, L. Y., Lalonde, E., Ray, J., Livingstone, J., Lesurf, R., Shiah, Y. J., Vujcic, T., Huang, X., et al. (2019). Molecular landmarks of tumor hypoxia across cancer types. Nat Genet 51, 308-318. Butler, A., Hoffman, P., Smibert, P., Papalexi, E., and Satija, R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature biotechnology 36, 411-420. Cancer Genome Atlas, N. (2012). Comprehensive molecular portraits of human breast tumours. Nature 490, 61-70. Cerami, E., Gao, J., Dogrusoz, U., Gross, B. E., Sumer, S. O., Aksoy, B. A., Jacobsen, A., Byrne, C. J., Heuer, M. L., Larsson, E., et al. (2012). The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer discovery 2, 401-404. Ciriello, G., Gatza, M. L., Beck, A. H., Wilkerson, M. D., Rhie, S. K., Pastore, A., Zhang, H., McLellan, M., Yau, C., Kandoth, C., et al. (2015). Comprehensive Molecular Portraits of Invasive Lobular Breast Cancer. Cell 163, 506-519. Colacino, J. A., Azizi, E., Brooks, M. D., Harouaka, R., Fouladdel, S., McDermott, S. P., Lee, M., Hill, D., Madden, J., Boerner, J., et al. (2018). Heterogeneity of Human Breast Stem and Progenitor Cells as Revealed by Transcriptional Profiling. Stem cell reports 10, 1596-1609. Coloff, J. L., and Brugge, J. S. (2017). Metabolic changes promote rejection of oncogenic cells. Nat Cell Biol 19, 414-415. Curtis, C., Shah, S. P., Chin, S. F., Turashvili, G., Rueda, O. M., Dunning, M. J., Speed, D., Lynch, A. G., Samarajiwa, S., Yuan, Y., et al. (2012). The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346-352. Davis, M. B., Walens, A., Hire, R., Mumin, K., Brown, A. M., Ford, D., Howerth, E. W., and Monteil, M. (2015). Distinct Transcript Isoforms of the Atypical Chemokine Receptor 1 (ACKR1)/Duffy Antigen Receptor for Chemokines (DARC) Gene Are Expressed in Lymphoblasts and Altered Isoform Levels Are Associated with Genetic Ancestry and the Duffy- Null Allele. PloS one 10, e0140098. Degnim, A. C., Visscher, D. W., Hoskin, T. L., Frost, M. H., Vierkant, R. A., Vachon, C. M., Shane Pankratz, V., Radisky, D. C., and Hartmann, L. C. (2012). Histologic findings in normal breast tissues: comparison to reduction mammaplasty and benign breast disease tissues. Breast Cancer Res Treat 133, 169-177.

31 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Ellrott, K., Bailey, M. H., Saksena, G., Covington, K. R., Kandoth, C., Stewart, C., Hess, J., Ma, S., Chiotti, K. E., McLellan, M., et al. (2018). Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. Cell Syst 6, 271-281 e277. Fu, N. Y., Pal, B., Chen, Y., Jackling, F. C., Milevskiy, M., Vaillant, F., Capaldo, B. D., Guo, F., Liu, K. H., Rios, A. C., et al. (2018). Foxp1 Is Indispensable for Ductal Morphogenesis and Controls the Exit of Mammary Stem Cells from Quiescence. Dev Cell 47, 629-644 e628. Gao, J., Aksoy, B. A., Dogrusoz, U., Dresdner, G., Gross, B., Sumer, S. O., Sun, Y., Jacobsen, A., Sinha, R., Larsson, E., et al. (2013). Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Science signaling 6, pl1. Gao, Q., Liang, W. W., Foltz, S. M., Mutharasu, G., Jayasinghe, R. G., Cao, S., Liao, W. W., Reynolds, S. M., Wyczalkowski, M. A., Yao, L., et al. (2018). Driver Fusions and Their Implications in the Development and Treatment of Human Cancers. Cell reports 23, 227-238 e223. Gao, R., Kim, C., Sei, E., Foukakis, T., Crosetto, N., Chan, L. K., Srinivasan, M., Zhang, H., Meric-Bernstam, F., and Navin, N. (2017). Nanogrid single-nucleus RNA sequencing reveals phenotypic diversity in breast cancer. Nature communications 8, 228. Gu, Z., Eils, R., and Schlesner, M. (2016). Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 32, 2847-2849. Gupta, P. B., Pastushenko, I., Skibinski, A., Blanpain, C., and Kuperwasser, C. (2019). Phenotypic Plasticity: Driver of Cancer Initiation, Progression, and Therapy Resistance. Cell Stem Cell 24, 65-78. He, Y., Luo, Y., Liang, B., Ye, L., Lu, G., and He, W. (2017). Potential applications of MEG3 in cancer diagnosis and prognosis. Oncotarget 8, 73282-73295. Hoadley, K. A., Yau, C., Hinoue, T., Wolf, D. M., Lazar, A. J., Drill, E., Shen, R., Taylor, A. M., Cherniack, A. D., Thorsson, V., et al. (2018). Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer. Cell 173, 291-304 e296. Karaayvaz, M., Cristea, S., Gillespie, S. M., Patel, A. P., Mylvaganam, R., Luo, C. C., Specht, M. C., Bernstein, B. E., Michor, F., and Ellisen, L. W. (2018). Unravelling subclonal heterogeneity and aggressive disease states in TNBC through single-cell RNA-seq. Nature communications 9, 3588. Kassambara, A., Kosinski, M., and Biecek, P (2019). Drawing Survival Curves using 'ggplot2'. R package version 0.4.6. . https://CRANR-projectorg/package=survminer. Kfoury, Y., and Scadden, D. T. (2015). Mesenchymal cell contributions to the stem cell niche. Cell Stem Cell 16, 239-253. Kim, C., Gao, R., Sei, E., Brandt, R., Hartman, J., Hatschek, T., Crosetto, N., Foukakis, T., and Navin, N. E. (2018). Chemoresistance Evolution in Triple-Negative Breast Cancer Delineated by Single-Cell Sequencing. Cell 173, 879-893 e813. Kwan, T. T., Bardia, A., Spring, L. M., Giobbie-Hurder, A., Kalinich, M., Dubash, T., Sundaresan, T., Hong, X., LiCausi, J. A., Ho, U., et al. (2018). A Digital RNA Signature of Circulating Tumor Cells Predicting Early Therapeutic Response in Localized and Metastatic Breast Cancer. Cancer discovery 8, 1286-1299. Lim, B., Lin, Y., and Navin, N. (2020). Advancing Cancer Research and Medicine with Single- Cell Genomics. Cancer Cell 37, 456-470. Lim, E., Vaillant, F., Wu, D., Forrest, N. C., Pal, B., Hart, A. H., Asselin-Labat, M. L., Gyorki, D. E., Ward, T., Partanen, A., et al. (2009). Aberrant luminal progenitors as the candidate target population for basal tumor development in BRCA1 mutation carriers. Nat Med 15, 907-913.

32 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Liu, J., Lichtenberg, T., Hoadley, K. A., Poisson, L. M., Lazar, A. J., Cherniack, A. D., Kovatich, A. J., Benz, C. C., Levine, D. A., Lee, A. V., et al. (2018). An Integrated TCGA Pan- Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell 173, 400-416 e411. Luk, I. Y., Reehorst, C. M., and Mariadason, J. M. (2018). ELF3, ELF5, EHF and SPDEF Transcription Factors in Tissue Homeostasis and Cancer. Molecules 23. Marcato, P., Dean, C. A., Pan, D., Araslanova, R., Gillis, M., Joshi, M., Helyer, L., Pan, L., Leidal, A., Gujar, S., et al. (2011). Aldehyde dehydrogenase activity of breast cancer stem cells is primarily due to isoform ALDH1A3 and its expression is predictive of metastasis. Stem Cells 29, 32-45. McBryan, J., Howlin, J., Kenny, P. A., Shioda, T., and Martin, F. (2007). ERalpha-CITED1 co- regulated genes expressed during pubertal mammary gland development: implications for breast cancer prognosis. Oncogene 26, 6406-6419. McCarthy, D. J., Campbell, K. R., Lun, A. T., and Wills, Q. F. (2017). Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179-1186. Morel, A. P., Ginestier, C., Pommier, R. M., Cabaud, O., Ruiz, E., Wicinski, J., Devouassoux- Shisheboran, M., Combaret, V., Finetti, P., Chassot, C., et al. (2017). A stemness-related ZEB1- MSRB3 axis governs cellular pliancy and breast cancer genome stability. Nat Med 23, 568-578. Nakshatri, H., Anjanappa, M., and Bhat-Nakshatri, P. (2015). Ethnicity-Dependent and - Independent Heterogeneity in Healthy Normal Breast Hierarchy Impacts Tumor Characterization. Scientific reports 5, 13526. Nakshatri, H., Kumar, B., Burney, H. N., Cox, M. L., Jacobsen, M., Sandusky, G. E., D'Souza- Schorey, C., and Storniolo, A. M. V. (2019). Genetic ancestry-dependent differences in breast cancer-induced field defects in the tumor-adjacent normal breast. Clin Cancer Res 25, 2848- 2859. Nguyen, Q. H., Pervolarakis, N., Blake, K., Ma, D., Davis, R. T., James, N., Phung, A. T., Willey, E., Kumar, R., Jabart, E., et al. (2018). Profiling human breast epithelial cells using single cell RNA sequencing identifies cell diversity. Nature communications 9, 2028. Nolan, E., Lindeman, G. J., and Visvader, J. E. (2017). Out-RANKing BRCA1 in Mutation Carriers. Cancer Res 77, 595-600. Nuncia-Cantarero, M., Martinez-Canales, S., Andres-Pretel, F., Santpere, G., Ocana, A., and Galan-Moya, E. M. (2018). Functional transcriptomic annotation and protein-protein interaction network analysis identify NEK2, BIRC5, and TOP2A as potential targets in obese patients with luminal A breast cancer. Breast Cancer Res Treat 168, 613-623. Olson, M. E., Harris, R. S., and Harki, D. A. (2018). APOBEC Enzymes as Targets for Virus and Cancer Therapy. Cell Chem Biol 25, 36-49. Pal, B., Chen, Y., Vaillant, F., Jamieson, P., Gordon, L., Rios, A. C., Wilcox, S., Fu, N., Liu, K. H., Jackling, F. C., et al. (2017). Construction of developmental lineage relationships in the mouse mammary gland by single-cell RNA profiling. Nature communications 8, 1627. Parker, J. S., Mullins, M., Cheang, M. C., Leung, S., Voduc, D., Vickery, T., Davies, S., Fauron, C., He, X., Hu, Z., et al. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol 27, 1160-1167. Pascual, G., Avgustinova, A., Mejetta, S., Martin, M., Castellanos, A., Attolini, C. S., Berenguer, A., Prats, N., Toll, A., Hueto, J. A., et al. (2017). Targeting metastasis-initiating cells through the fatty acid receptor CD36. Nature 541, 41-45.

33 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Pellacani, D., Bilenky, M., Kannan, N., Heravi-Moussavi, A., Knapp, D.J.H.F., Gakkhar, S., Moksa, M., Carles, A., Moore, R., Mungall, A.J., Marra, M.A., Jones, S.J.M., Aparicio, S., Hirst, M., and Eaves, C.J. (2016). Analysis of normal human mammary epigenomes reveal cell-specfiic active enahcner states and associated transcription factor networks. Cell reports 17, 2060-2074. Pereira, B., Chin, S. F., Rueda, O. M., Vollan, H. K., Provenzano, E., Bardwell, H. A., Pugh, M., Jones, L., Russell, R., Sammut, S. J., et al. (2016). The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes. Nature communications 7, 11479. Perkins, S. M., Bales, C., Vladislav, T., Althouse, S., Miller, K. D., Sandusky, G., Badve, S., and Nakshatri, H. (2015). TFAP2C expression in breast cancer: correlation with overall survival beyond 10 years of initial diagnosis. Breast Cancer Res Treat 152, 519-531. Prat, A., Parker, J. S., Karginova, O., Fan, C., Livasy, C., Herschkowitz, J. I., He, X., and Perou, C. M. (2010). Phenotypic and molecular characterization of the claudin-low intrinsic subtype of breast cancer. Breast Cancer Res 12, R68. Prat, A., and Perou, C. M. (2009). Mammary development meets cancer genomics. Nat Med 15, 842-844. Proia, T. A., Keller, P. J., Gupta, P. B., Klebba, I., Jones, A. D., Sedic, M., Gilmore, H., Tung, N., Naber, S. P., Schnitt, S., et al. (2011). Genetic predisposition directs breast cancer phenotype by dictating progenitor cell fate. Cell Stem Cell 8, 149-163. Razavi, P., Chang, M. T., Xu, G., Bandlamudi, C., Ross, D. S., Vasan, N., Cai, Y., Bielski, C. M., Donoghue, M. T. A., Jonsson, P., et al. (2018). The Genomic Landscape of Endocrine- Resistant Advanced Breast Cancers. Cancer Cell 34, 427-438 e426. Rohlenova, K., Goveia, J., Garcia-Caballero, M., Subramanian, A., Kalucka, J., Treps, L., Falkenberg, K. D., de Rooij, L., Zheng, Y., Lin, L., et al. (2020). Single-Cell RNA Sequencing Maps Endothelial Metabolic Plasticity in Pathological Angiogenesis. Cell metabolism 31, 862- 877 e814. Rosenbluth, J. M., Schackmann, R. C. J., Gray, G. K., Selfors, L. M., Li, C. M., Boedicker, M., Kuiken, H. J., Richardson, A., Brock, J., Garber, J., et al. (2020). Organoid cultures from normal and cancer-prone human breast tissues preserve complex epithelial lineages. Nature communications 11, 1711. Sanchez-Vega, F., Mina, M., Armenia, J., Chatila, W. K., Luna, A., La, K. C., Dimitriadoy, S., Liu, D. L., Kantheti, H. S., Saghafinia, S., et al. (2018). Oncogenic Signaling Pathways in The Cancer Genome Atlas. Cell 173, 321-337 e310. Sato, T., Goyama, S., Kataoka, K., Nasu, R., Tsuruta-Kishino, T., Kagoya, Y., Nukina, A., Kumagai, K., Kubota, N., Nakagawa, M., et al. (2014). Evi1 defines leukemia-initiating capacity and tyrosine kinase inhibitor resistance in chronic myeloid leukemia. Oncogene 33, 5028-5038. Sorlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A 98, 10869- 10874. Sotiriou, C., Neo, S. Y., McShane, L. M., Korn, E. L., Long, P. M., Jazaeri, A., Martiat, P., Fox, S. B., Harris, A. L., and Liu, E. T. (2003). Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proc Natl Acad Sci U S A 100, 10393- 10398. Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck, W. M., 3rd, Hao, Y., Stoeckius, M., Smibert, P., and Satija, R. (2019). Comprehensive Integration of Single-Cell Data. Cell 177, 1888-1902 e1821.

34 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Sundararajan, V., Gengenbacher, N., Stemmler, M. P., Kleemann, J. A., Brabletz, T., and Brabletz, S. (2015). The ZEB1/miR-200c feedback loop regulates invasion via actin interacting proteins MYLK and TKS5. Oncotarget 6, 27083-27096. Taylor, A. M., Shih, J., Ha, G., Gao, G. F., Zhang, X., Berger, A. C., Schumacher, S. E., Wang, C., Hu, H., Liu, J., et al. (2018). Genomic and Functional Approaches to Understanding Cancer Aneuploidy. Cancer Cell 33, 676-689 e673. Team, R. C. (2020). R: A language and environment for statistical computing. . R Foundation for Statistical Computing, Vienna, Austria URL https://wwwR-projectorg/. Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A 99, 6567-6572. Visvader, J. E., and Stingl, J. (2014). Mammary stem cells and the differentiation hierarchy: current status and perspectives. Genes Dev 28, 1143-1158. Wagner, J., Rapsomaniki, M. A., Chevrier, S., Anzeneder, T., Langwieder, C., Dykgers, A., Rees, M., Ramaswamy, A., Muenst, S., Soysal, S. D., et al. (2019). A Single-Cell Atlas of the Tumor and Immune Ecosystem of Human Breast Cancer. Cell 177, 1330-1345 e1318. Walter, W., Thomalla, J., Bruhn, J., Fagan, D. H., Zehowski, C., Yee, D., and Skildum, A. (2015). Altered regulation of PDK4 expression promotes antiestrogen resistance in human breast cancer cells. Springerplus 4, 689. Wang, D., Cai, C., Dong, X., Yu, Q. C., Zhang, X. O., Yang, L., and Zeng, Y. A. (2014). Identification of multipotent mammary stem cells by protein C receptor expression. Nature 517, 81-84. Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York ISBN 978-3-319-24277-4, http://ggplot2org. Zaret, K. S., and Carroll, J. S. (2011). Pioneer transcription factors: establishing competence for gene expression. Genes Dev 25, 2227-2241.

35 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 1: Relationship between PDK4 and TBX3 H-scores and outcome.

PDK4 TBX3 Lower Upper 95% 95% Lower Wald Wald 95% Upper 95% H-Score Point Confide Confide Wald Wald Category p- Estim nce nce Point Confidenc Confidence Obs group Parameter* value ate Limit Limit p-value Estimate e Limit Limit 1 OS High vs Low 0.0431 1.382 1.010 1.890 0.0333 0.721 0.533 0.974 2 OS for ER+ High vs Low 0.0131 1.600 1.104 2.319 0.1884 0.789 0.554 1.123 3 OS for ER- High vs Low 0.5535 1.239 0.610 2.514 0.0709 0.503 0.238 1.060 4 OS for Endocrine High vs Low 0.0847 1.431 0.952 2.152 0.3786 0.837 0.564 1.243 Therapy=Yes 5 OS for Endocrine High vs Low 0.0976 1.608 0.917 2.822 0.0505 0.589 0.346 1.001 Therapy=No 6 OS for ER+ and High vs Low 0.0300 1.614 1.048 2.488 0.9396 0.983 0.637 1.519 ET=Yes 7 OS for ER+ and High vs Low 0.1473 1.956 0.789 4.847 0.0323 0.422 0.191 0.930 ET=No 8 OS for ER- and High vs Low 0.9707 0.961 0.117 7.887 0.3166 0.324 0.036 2.939 ET=Yes 9 OS for ER- and High vs Low 0.2016 1.710 0.751 3.893 0.1874 0.559 0.235 1.327 ET=No 10 OS for High vs Low 0.1131 1.510 0.907 2.513 0.4029 0.792 0.458 1.368 ER+/PR+/HER2- =Yes 11 OS for High vs Low 0.7579 1.146 0.483 2.717 0.0036 0.250 0.099 0.635 ER+/PR+/HER2- =No *referent group listed second ET= Endocrine therapy OS=Overall survival Obs= Observations

36 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

STAR * METHODS:

REAGENT or RESOURCE SOURCE IDENTIFIER

ANTIBODIES

PDK4 Abcam Ab71240

TBX3 Abcam Ab99302

REAGENTS

COLLAGEASE/HYALURANID Stem cell

ASE technologies

Cryopreservation media LONZA 12-132A

ROCK Inhibitor Y-27632 TOCRIS 1254

Gentle Collagenase/Hyaluronidase StemCell 07919

technologies

Tumor dissocation kit (human) Miltenyi Biotech 130-095-929

Red cell lysis buffer Miltenyi Biotech 130-094-183

Debris removal kit Miltenyi Biotech 130-109-398

Chromium Single cell 3’reagents 10X Genomics CG00052 Rev B or CG000183

Rev C

Bioanalyzer HSDNA CHIP G2943CA Agilent

SOFTWARES AND ALGARITHMS

CellRanger 2.1.0 or 3.0.2 10X Genomics http://support.10xgenomics.com

Loupe Browser 10X Genomics https://support.10xgenomics.com/si

ngle-cell-gene-

37 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

expression/software/visualization/la

test/installation

SAS Version 9.4 www.sas.com

CONTACT FOR REAGENT AND RESOURCE SHARING

Further information and requests for reagents should be directed to and will be fulfilled by the

corresponding author: Harikrishna Nakshatri ([email protected])

EXPERIMENTAL MODEL AND SUBJECT DETAILS

Normal breast tissues: All breast tissues from healthy women were collected by the Komen

Normal Tissue Bank with informed consent and with the approval from the institutional review

board. Standard operating procedure for tissue collection is described on the Komen Tissue Bank

website. Per standard operating procedures, the normal breast biopsies were always collected

from the upper outer quadrant of the breasts. Within an average of six minutes from the time of

biopsy, tissues were either placed in a growth media and transported to the lab for immediate

single cell sequencing or cryopreserved for single cell sequencing at a later date. Our

cryopreservation protocol has been described previously (Nakshatri et al., 2015). Briefly, tissue

was minced and placed in one ml of 50% growth media and 50% Lonzo freezing media with 2

micromolar ROCK inhibitor. Vials with tissues were placed in CoolCell Containers (Nalgene)

and placed in a -80°C freezer overnight and then in liquid nitrogen. Tissue specimens were

thawed rapidly at 37°C and then washed extensively in growth media prior to dissociation.

Tissue specimens represented women of different race, age, parity, menstrual phase, and BMI

(Table S1).

Tissue dissociation procedure, cDNA library preparation and sequencing: We used the

human tumor dissociation kit from Miltenyi Biotech (130-095-929) to generate single cells from

38 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

tissue specimens. Red blood lysis buffer (130-094-183) and debris removal solution (130-109-

398) were used as needed to improve purity of single cells. Viability and single cell status were

determined via trypan blue staining and phase contrast microscopy. Samples with 80% or more

viability were utilized for the subsequent steps. Cells were suspended at ~100-800 cells/µl

depending on sample and subjected to cDNA library generation using 10 X Genomics V2 (initial

study) or V3 (second set of samples) Chromium Single Cell 3’ Reagents (CG00052 Rev B and

CG000183 Rev C, respectively). We used HSDNA Chips on the Bioanalyzer from Agilent

technologies (G2943CA) to quantify cDNA. cDNA was amplified using Chromium TM Single

cell Library kit v2 or v3. The resulting libraries were sequenced on Illumina NovaSeq 6000 to a

read depth of ~50,000 reads per cell. 26 bp of cell barcode and UMI sequences, and 91 bp RNA

reads were generated for the libraries made with the V2 kit; and 28 bp plus 91 bp paired-end for

the libraries with the V3 kit.

Analysis of scRNA-seq sequence data

CellRanger 2.1.0 or 3.0.2 (http://support.10xgenomics.com/) was utilized to process the

raw sequence data generated. Briefly, CellRanger used bcl2fastq (https://support.illumina.com/)

to demultiplex raw base sequence calls generated from the sequencer into sample-specific FASTQ

files. The FASTQ files were then aligned to the human reference genome GRCh38 with RNA-seq

aligner STAR. The aligned reads were traced back to individual cells and the gene expression level

of individual genes were quantified based on the number of UMIs (unique molecular indices)

detected in each cell.

The filtered gene-cell barcode matrices generated with CellRanger were used for further

analysis with the R package Seurat version 2.3.1 and development version 3.0.0.9000 with R

studio version 1.1.453 and R version 3.5.1 (Butler et al., 2018; Stuart et al., 2019). Quality

39 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

control (QC) of the data was implemented as the first step in our analysis. We first filtered out

genes that were detected in less than five cells and cells with less than 200 genes. To further

exclude low-quality cells in downstream analysis we used the function isOutlier from R package

scater together with visual inspection of the distributions of number of genes, UMIs, and

mitochondrial gene content (McCarthy et al., 2017). Cells with extremely high or low number of

detected genes/UMIs were excluded. In addition, cells with high percentage of mitochondrial

reads were also filtered out. After removing likely doublets/multiplets and low-quality cells, the

gene expression levels for each cell were normalized with the NormalizeData function in Seurat.

To reduce variations sourced from different number of UMIs and mitochondrial gene expression,

we used the ScaleData function to linearly regress out these variations. Highly variable genes

were subsequently identified.

To integrate the single cell data from individual donor samples, functions

FindIntegrationAnchors and IntegrateData from Seurat v3 were implemented. The integrated

data was then scaled and PCA was performed. Clusters were identified with the Seurat functions

FindNeighbors and FindClusters. The FindConservedMarkers function was subsequently used to

identify canonical cell type marker genes. Cell cluster identities were manually defined with the

cluster-specific marker genes or known marker genes. The cell clusters were visualized using the

Uniform Manifold Approximation and Projection (UMAP) plots. To help interactively explore

various gene expression pattern across cell clusters, the UMAP cell embeddings and cell cluster

information generated from Seurat analysis were imported into 10X genomics Loupe Browser

(https://support.10xgenomics.com/single-cell-gene-expression/software/visualization/latest/installation).

R packages ggplot2 and Seurat FeaturePlot were used to generate feature plots to visualize

40 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

specific gene expression across clusters. R package ComplexHeatmap was used to generate the

heatmaps (Gu et al., 2016; Wickham, 2016).

TCGA and METABRIC dataset analyses:

1741 genes in all 13 clusters of the first analyses are considered as signatures. Due to

higher similarity of gene expression among clusters 1, 3, and 4, they are merged as one cluster

1a. Similarly, clusters 5, 7, and 9 are merged as one cluster 5a. Centroids of the nine resulting

merged clusters were generated for Prediction Analysis of Microarray (PAM) algorithm using

the 1741 genes (Parker et al., 2009; Tibshirani et al., 2002). Expression, clinical, and mutation

data of METABRIC and TCGA BRCA were retrieved from cBioportal (Bhandari et al., 2019;

Cerami et al., 2012; Ellrott et al., 2018; Gao et al., 2013; Gao et al., 2018; Hoadley et al., 2018;

Liu et al., 2018; Pereira et al., 2016; Sanchez-Vega et al., 2018; Taylor et al., 2018). Expression

data are median centered and applied to the PAM classifier based on the 1741 genes using

Spearman's rank correlation as distance. Each sample in METABRIC and TCGA BRCA data

was assigned to one of the nine clusters. Relationship of each sample's cluster membership with

intrinsic subtypes and mutation were analyzed. Survival analysis of METABRIC and TCGA

BRCA samples between clusters were analyzed using R package survminer v0.4.6 in R

(Kassambara, 2019; Team, 2020; Wickham, 2016).

Breast cancer TMA and immunostaining for TBX3 and PDK4: The breast cancer TMA with

~15-years of follow up has been described previously (Perkins et al., 2015). All tissue samples

were collected following a detailed IRB approved protocol, informed patient consent, and

HIPAA compliance protocol. Tissues were fixed overnight at room temperature in 10% NBF. A

pathologist (GES) utilized light microscopy (Leica) to evaluate the staining in each tissue core

(range from 0 to +3) to make sure there was no over staining and/or extensive background

41 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

staining. The slides were imaged using the Aperio Scanscope CS. Computer-assisted

morphometric analysis of digital images was performed using the Aperio Image Analysis

software that came with the Aperio Whole Slide Digital Imaging System. The Positive Pixel

Count algorithm was used to quantify the amount of a specific stain present in a scanned slide

image. A range of color (range of hues and saturation) and three intensity ranges (weak, positive,

and strong) were masked and evaluated. The algorithm counted the number and intensity‐sum in

each intensity range, along with three additional quantities: average intensity, ratio of strong/total

number, and average intensity of weak positive pixels.

The algorithm was applied to an image by using the TMA Lab algorithm. This program

allowed us to select each core, specify the input parameters, run the algorithm, and view/save the

algorithm results. When using the Image Scope program, a pseudo‐color markup image is also

shown as an algorithm result. The H score was calculated using the Aperio TMA software

algorithm. Formula is:

(100*(weak positive + (2*normal positive) + (3*strong positive)))/Total

Approximately 80 to 90 breast biopsies in each of the 14 breast TMA immunostain were

evaluated with TBX3 and PDK4 antibodies. Anti-PDK4 (ab71240) and anti-TBX3 (ab99302)

antibodies were obtained from Abcam. The normal tissue controls (TMA orientation cores) were

normal liver, cecum, kidney, spleen, tonsil, and heart.

With TBX3, immunostaining was seen in both the cytoplasm and nucleus in most tumor

cells, and within few stromal cells in a few cases. In cases with inflammation, several of the

lymphocytes (subset) were strongly stained. Staining patterns in tumor cells ranged from weak to

moderate to strong. The two cores from the same patient in the arrays were often similar

depending on the amount of fat and /or stroma in the core. Little to no background staining was

42 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

seen in the other tissues in the core (vascular endothelial cells, smooth muscle cells, adipocytes,

and fibroblasts).

With PDK4, immunostaining was seen in the cytoplasm of most tumor cells and in some

cases, that of a few stroma cells. Lymphocyte staining was seen only in cases with

inflammation. The tumor cells were weak to moderate to strong in staining with two cores from

the same patient. This was consistent in all arrays with minimal background staining in the other

tissues in the core (vascular endothelial cells, smooth muscle cells, adipocytes, and fibroblasts).

Statistical analyses of TMA data: For subjects with multiple tumor samples available, we

included only the sample with the highest PDK4 or TBX3 H-score. Wilcoxon Rank Sum and

Kruskal-Wallis tests were used to determine if PDK4 or TBX3 H-scores correlated with other

tumor markers. Cox proportional hazards regression models were used to determine whether H-

scores and other variables were related to overall and disease-free survival either univariately or

in multivariable models. In these analyses, TBX3 H-scores were divided into low and high

categories at the score of 27.91721 for overall survival (time from surgery to death or censoring)

and disease-free survival (time from surgery to first recurrence or censoring, excluding patients

with M1 stage at surgery). PDK4 H-scores were divided into low and high categories at the score

of 19.41508 for overall survival (time from surgery to death or censoring) and 34.05692 for

disease-free survival (time from surgery to first recurrence or censoring, excluding patients with

distant metastatic diseases at surgery). These cutoff values were determined by using the

maximum chi-square value for all score values between the 25th and 75th percentile as described

previously (Perkins et al., 2015). PDK4 high/low and TBX3 high/low was included in all

multivariable models. As a double check on the direction of the hazard ratio and as a more

43 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

powerful test if the H-score effect was truly linear, we also fit multivariable models with the H-

score as continuous.

We conducted subgroup analyses on overall survival using the ER-positive subgroup,

endocrine therapy group, ER-positive on endocrine therapy, ER-negative, and ER+/PR+/HER2-.

First, log-rank tests were done with the dichotomous H-score variable. Second, multivariable

models with the H-score as dichotomous and then continuous were fit similar as was done in the

main analyses. Analyses were conducted using SAS Version 9.4. An α level of 5% was used to

determine statistical significance.

Individual marker statistical reports are below.

Statistical Methods: PDK4 For subjects with multiple tumor samples available, we included only the sample with the highest PDK4 H-score. Wilcoxon Rank Sum and Kruskal-Wallis tests were used to determine if PDK4 H-scores were correlated with other tumor markers. Cox proportional hazards regression models were used to determine whether PDK4 H-scores and other variables were related to overall and disease-free survival either univariately or in multivariable models. In these analyses, PDK4 H-scores were divided into low and high categories at the score of 19.41508 for overall survival (time from surgery to death or censoring) and 34.05692 for disease-free survival (time from surgery to first recurrence or censoring, excluding patients with M1 stage at surgery). These cutoff values were determined by using the maximum chi-square value for all score values between the 25th and 75th percentile (http://www.pharmasug.org/proceedings/2012/SP/PharmaSUG-2012-SP12.pdf). PDK4 high/low was included in all multivariable models. As a double check on the direction of the hazard ratio and as a more powerful test if the H-score effect was truly linear, we also fit multivariable models with the H-score as continuous.

We conducted subgroup analyses on overall survival using the ER-positive subgroup, endocrine therapy group, ER-positive on endocrine therapy, ER-negative, and ER+/PR+/HER2-. First, log-rank tests were done with the dichotomous H-score variable. Second, multivariable models with the H-score as dichotomous and then continuous were fit similar as was done in the main analyses.

Analyses were conducted using SAS Version 9.4. An α level of 5% was used to determine statistical significance.

Results:

44 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Of the 586 patients who were part of the TMAs available to be read (TMA1-14) for PDK4, 497 patients (85%) had PDK4 values available/readable. Clinical parameters of the subjects included in the TMAs studied are summarized in Table 1 with grouping for patients with or without PDK4 values. Age, Tumor Grade, Tumor Stage, Recurrences, and Disease Free Survival were significantly different between the two groups (Age higher in missing group, Grade 1 had more missing, T1 had more missing, and Fewer Recurrences in missing group, percentagewise with the DFS having a higher median in the missing group). Since the numbers for the missing PDK4 values were so low, these significant differences were not investigated further.

Correlation of PDK4 H-score with other disease markers We compared PDK4 H-score expression with ER, PR, HER-2/neu, Nodal stage, Tumor Stage or Grade. PDK4 levels were correlated with ER+/PR+/HER- (Table 2). Being ER+/PR+/HER- was significantly less than not being ER+/PR+/HER-.

Overall Survival Analysis Univariate In univariate analyses, variables significantly related to overall survival in the Cox proportional hazards regression models were PR Status, HER-2 Status, ER+/PR+/HER2- Status, Tumor Grade, Tumor Stage, and Nodal Stage (Table 3a). PR-negative, having HER2-, not having ER+/PR+/HER2-, Higher Tumor Grade, Higher Tumor Stage, and Nodal Stage-positive were correlated with lower survival. PDK4 H-score was not related to overall survival (log rank test p- value 0.1165).

Multivariable In the multivariate analysis (Tables 4aa and 4ab), models were run both with and without HER2- since there are a number of patients with missing results. ER+/PR+/HER2- was not included in the model since it would be correlated with the other variables. In the multivariate model with HER2- (Table 4aa), PR Status, and Nodal Stage were found to be significant. PR- negative and Nodal Stage-positive were correlated with lower survival. In the multivariate model without ER+/PR+/HER2- (Table 4aa), PR Status, Tumor Grade, Tumor Stage, Nodal Stage, and the categorical PDK4 H-Score were found to be significant. PR-negative, Higher Tumor Grade, Higher Tumor Stage, and Nodal Stage-positive, and higher PDK4 H-score were correlated with lower survival.

Subgroup analysis For unadjusted tests (i.e. univariately), the results were not significant for any of the subgroups (log rank test p-value 0.084 for ER-positive, 0.860 for ER-negative, 0.073 for patients on endocrine therapy, 0.464 for patients not on endocrine therapy, 0.053 for patients who were ER+ and on Endocrine Therapy, and 0.653 for ER-positive and not on endocrine therapy).

In multivariable models treating the H-score as dichotomous, H-score category was significant for ER-positive patients, and patients who were ER-positive and on endocrine therapy (Table 5a). For the results that were significant, the higher PDK4 H-score was correlated with lower survival. In multivariable models treating the H-score as continuous, the H-score was significant result in patients who were ER+, on Endocrine Therapy, were ER+ AND on

45 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Endocrine Therapy, and for patients who were ER+/PR+/HER2- (Table 6a). For the results that were significant, the higher PDK4 H-scores were associated with lower overall survival.

In addition, Kaplan-Meier plots are provided for the univariate analyses using the categorical PDK4 H-score for overall and the subgroup analyses.

Disease Free Survival Analysis Univariate In univariate analyses, variables significantly related to disease free survival in the Cox proportional hazards regression models were PR Status, Tumor Grade, Tumor Stage, and Nodal Stage (Table 3a). PR-negative, Higher Tumor Grade, Higher Tumor Stage, and Nodal Stage- positive were correlated with lower disease free survival. PDK4 H-score was not related to disease free survival (log rank test p-value 0.0544). Multivariable In the multivariate analysis (Table 4b), PR Status, Tumor Grade, Tumor Stage, and Nodal Stage were found to be significant. PR-negative, Higher Tumor Grade, Higher Tumor Stage, and Nodal Stage-positive were correlated with lower disease free survival. Subgroup analysis For unadjusted tests (i.e. univariately), the results were significant for patients not on endocrine therapy (log rank test p-value=0.039), and patients who were ER-positive and also not on endocrine therapy (log rank test p-value=0.013). Other analyses were not statistically significant (log rank test p-value 0.078 for ER-positive, 0.675 for ER-negative, 0.849 for patients on endocrine therapy, and 0.766 for ER-positive and on endocrine therapy). For the significant results, the lower PDK4 H-score was correlated to lower disease free survival.

In multivariable models treating the H-score as dichotomous, H-score category was significant for patients who were ER+ AND not on Endocrine Therapy (Table 5b) where the patients with lower PDK4 H-score were correlated to lower disease free survival. In multivariable models treating the H-score as continuous, H-score was significant for patients who were on Endocrine Therapy, and patients who were ER-positive and also on Endocrine Therapy (Table 6b). For the significant results, the higher PDK4 H-score was correlated to lower disease free survival.

In addition, Kaplan-Meier plots are provided for the univariate analyses using the categorical PDK4 H-score for overall and the subgroup analyses.

46 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 1. Description of the patients and characteristics of their tumors (n=586) Have PDK4 Missing PDK4 Characteristic (N=497) (N=89) p-value* Age at Diagnosis, y 0.0035 Mean (SD) 59.2 (15.08) 64.2 (13.04) Median 58.15 66.14 Range 5.92-94.75 35.49-88.33 Race, n(%) 0.0927 White 385 (77.46%) 77 (86.52%) African American 108 (21.73%) 11 (12.36%) Asian 2 ( 0.40%) 1 ( 1.12%) Other 1 ( 0.20%) 0 Unknown 1 ( 0.20%) 0 PR Status, n(%) 0.1818 Negative 158 (31.79%) 23 (25.84%) Positive 295 (59.36%) 61 (68.54%) Unknown/Not Done 44 ( 8.85%) 5 ( 5.62%) ER Status, n(%) 0.2985 Negative 107 (21.53%) 15 (16.85%) Positive 364 (73.24%) 70 (78.65%) Unknown/Not Done 26 ( 5.23%) 4 ( 4.50%) HER-2/neu, n(%) 0.3361 Negative 228 (45.88%) 46 (51.69%) Equivocal 4 ( 0.80%) 0 Positive 47 ( 9.46%) 5 ( 5.62%) Unknown/Not Done 218 (43.86%) 38 (42.70%) Tumor Grade, n(%) 0.0147 I 131 (26.36%) 36 (40.45%) II 211 (42.45%) 32 (35.96%) III 126 (25.35%) 13 (14.61%) IV 1 ( 0.20%) 0 Unknown 27 ( 5.43%) 7 ( 7.87%)

47 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Have PDK4 Missing PDK4 Characteristic (N=497) (N=89) p-value* Not Applicable 1 ( 0.20%) 1 ( 1.12%) T Stage, n(%) 0.0389 T0 1 ( 0.20%) 1 ( 1.12%) T1 269 (54.12%) 61 (68.54%) T2 181 (36.42%) 20 (22.47%) T3 25 ( 5.03%) 4 ( 4.49%) T4 15 ( 3.02%) 3 ( 3.37%) TX/Unknown 6 ( 1.21%) 0 N Stage, n(%) 0.1192 N0 295 (59.36%) 62 (69.66%) N1 126 (25.35%) 18 (20.22%) N2 29 ( 5.84%) 1 ( 1.12%) N3 10 ( 2.01%) 2 ( 2.25%) NX/Unknown 36 ( 7.25%) 6 ( 6.74%) Not Applicable 1 ( 0.20%) 0 M Stage, n(%) 0.3296 M0 255 (51.31%) 55 (61.80%) M1 17 ( 3.42%) 1 ( 1.12%) MX/Unknown 223 (44.86%) 33 (37.08%) Not Applicable 2 ( 0.40%) 0 Endocrine Therapy, n(%) 0.5009 Yes 297 (59.76%) 58 (65.17%) No 181 (36.42%) 30 (33.71%) Unknown/Missing 19 ( 3.82%) 1 ( 1.12%) Death, n(%) 0.0960 Yes 243 (48.89%) 35 (39.33%) No 254 (51.11%) 54 (60.67%) Recurrence, n(%) 0.0253 Yes 153 (30.78%) 17 (19.10%) No 344 (69.22%) 72 (80.90%)

48 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Have PDK4 Missing PDK4 Characteristic (N=497) (N=89) p-value* TMA 0.2725 TMA-1 40 ( 8.05%) 3 ( 3.37%) TMA-2 30 ( 6.04%) 8 ( 8.99%) TMA-3 34 ( 6.84%) 7 ( 7.87%) TMA-4 32 ( 6.44%) 7 ( 7.87%) TMA-5 36 ( 7.24%) 6 ( 6.74%) TMA-6 31 ( 6.24%) 7 ( 7.87%) TMA-7 31 ( 6.24%) 12 (13.48%) TMA-8 38 ( 7.65%) 4 ( 4.49%) TMA-9 38 ( 7.65%) 7 ( 7.87%) TMA-10 38 ( 7.65%) 3 ( 3.37%) TMA-11 40 ( 8.05%) 3 ( 3.37%) TMA-12 36 ( 7.24%) 9 (10.11%) TMA-13 35 ( 7.04%) 7 ( 7.87%) TMA-14 38 ( 7.65%) 6 ( 6.74%) Follow-up, y, median 12.9 (0.04-27.9) 12.5 (0.29-19.9) 0.1558 (range) Overall survival, y, 12.8 (11.6, 16.5) 15.4 (11.8, NE) 0.1061 median (95% CI) Disease-free survival, y, 11.8 (9.9, 13.7) 15.4 (11.2, NE) 0.0352 median (95% CI)** *from t-test, Chi-square, Fisher’s Exact Test, or Log-Rank Test **M1 patients excluded from analysis NE=Not Estimable

49 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 2. Bivariate analysis of PDK4 H-score with other tumor markers Variable PDK4 H-score Median (25th percentile, 75th percentile) p-value* n values n Values n values Negative Positive ER 107 41.9 (22.7, 66.4) 364 36.5 (17.7, 61.6) 0.1001 PR 158 41.5 (19.4, 65.1) 295 37.2 (18.5, 62.3) 0.2937 HER-2/neu 228 32.6 (14.6, 54.6) 47 38.7 (18.4, 61.9) 0.2218 ER+/PR+/HER 73 41.4 (18.9, 63.9) 196 30.6 (14.4, 51.5) 0.0114 - Nodal Status 295 39.1 (20.0, 61.9) 165 37.0 (17.0, 61.8) 0.3870 Grade 1 Grade 2 Grade 3 Tumor Grade 131 35.7 (17.2, 58.5) 211 39.1 (20.0, 60.9) 127 46.3 (19.9, 70.2) 0.1399 T0/1 T2 T3/4 Tumor Stage 270 36.6 (19.5, 63.1) 181 40.6 (19.4, 61.9) 40 40.7 (19.5, 60.9) 0.9795 *from Wilcoxon Rank Sum test for Status and Kruskal-Wallis test for Tumor Grade and Tumor Stage

50 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 3a. Univariate analysis of other tumor markers for Overall Survival Variable n Comparison* HR (95% CI) p-value** ER 556 ER- vs ER+ 1.05 (0.78, 1.41) 0.7643 PR 537 PR- vs PR+ 1.40 (1.08, 1.82) 0.0112 HER-2/neu 326 HER-2/neu- vs HER- 0.61 (0.40, 0.94) 0.0269 2/neu+ ER+/PR+/HER- 320 No vs Yes 1.54 (1.08, 1.82) 0.0420 Tumor Grade 550 Grade 1 vs Grade 2 0.90 (0.66, 1.23) <0.0001 Grade 1 vs Grade 3 0.44 (0.32, 0.61) Grade 2 vs Grade 3 0.49 (0.37, 0.66) T Stage 580 T0/1 vs T2 0.69 (0.54, 0.89) <0.0001 T0/1 vs T3/4 0.31 (0.22, 0.45) T2 vs T3/4 0.45 (0.31, 0.66) N Stage 543 N+ vs N0 1.86 (1.44, 2.39) <0.0001 *referent group listed second **from Wald Chi-square test

51 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 3b. Univariate analysis of other tumor markers for Disease Free Survival Variable n Comparison* HR (95% CI) p-value** ER 538 ER- vs ER+ 1.07 (0.80, 1.44) 0.6395 PR 520 PR- vs PR+ 1.31 (1.01, 1.69) 0.0399 HER-2/neu 315 HER-2/neu- vs HER- 0.71 (0.46, 1.10) 0.1266 2/neu+ ER+/PR+/HER- 309 No vs Yes 1.24 (0.85, 1.82) 0.2735 Tumor Grade 534 Grade 1 vs Grade 2 0.85 (0.64, 1.14) <0.0001 Grade 1 vs Grade 3 0.45 (0.33, 0.62) Grade 2 vs Grade 3 0.53 (0.39, 0.70) T Stage 562 T0/1 vs T2 0.64 (0.50, 0.81) <0.0001 T0/1 vs T3/4 0.36 (0.25, 0.54) T2 vs T3/4 0.57 (0.38, 0.86) N Stage 527 N+ vs N0 1.79 (1.40, 2.29) <0.0001 *referent group listed second **from Wald Chi-square test

52 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 4aa. Multivariable analysis for Overall Survival including HER-2 (N=274) Variable Comparison* HR Lower Upper p-value** 95% CI 95% CI PR PR- vs PR+ 1.690 1.094 2.611 0.0006 PR- vs Unknown 0.394 0.182 0.852 PR+ vs Unknown 0.233 0.106 0.511 HER-2/neu HER-2/neu- vs HER- 1.056 0.623 1.792 0.8393 2/neu+ Tumor Grade Grade 1 vs Grade 2 0.996 0.588 1.689 0.0594 Grade 1 vs Grade 3 0.727 0.409 1.291 Grade 1 vs Unknown 0.382 0.168 0.868 Grade 2 vs Grade 3 0.730 0.447 1.191 Grade 2 vs Unknown 0.384 0.179 0.823 Grade 3 vs Unknown 0.526 0.233 1.186 T Stage T0/1 vs T2 0.807 0.530 1.228 0.7074 T0/1 vs T3/4 0.867 0.438 1.718 T0/1 vs TX 0.566 0.128 2.497 T2 vs T3/4 1.075 0.547 2.112 T2 vs TX 0.702 0.159 3.101 T3/4 vs TX 0.653 0.133 3.214 N Stage N+ vs N0 1.916 1.262 2.909 <0.0001 N+ vs NX 0.476 0.231 0.981 N0 vs NX 0.249 0.121 0.510 PDK4 H-score High vs Low 1.473 0.960 2.258 0.0761 Category *referent group listed second **from Wald Chi-square test

53 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 4ab. Multivariable analysis for Overall Survival not including HER-2 (N=494) Variable Comparison* HR Lower Upper p-value** 95% CI 95% CI PR PR- vs PR+ 1.370 1.017 1.846 0.0088 PR- vs Unknown 0.764 0.499 1.170 PR+ vs Unknown 0.557 0.370 0.839 Tumor Grade Grade 1 vs Grade 2 0.948 0.665 1.352 0.0273 Grade 1 vs Grade 3 0.621 0.418 0.923 Grade 1 vs Unknown 0.650 0.372 1.136 Grade 2 vs Grade 3 0.655 0.475 0.902 Grade 2 vs Unknown 0.686 0.410 1.147 Grade 3 vs Unknown 1.047 0.611 1.793 T Stage T0/1 vs T2 0.862 0.641 1.160 0.0148 T0/1 vs T3/4 0.553 0.352 0.869 T0/1 vs TX 3.628 0.823 16.000 T2 vs T3/4 0.641 0.419 0.982 T2 vs TX 4.209 0.957 18.508 T3/4 vs TX 6.566 1.454 29.640 N Stage N+ vs N0 1.709 1.272 2.295 <0.0001 N+ vs NX 0.538 0.338 0.856 N0 vs NX 0.315 0.200 0.495 PDK4 H-score High vs Low 1.382 1.010 1.890 0.0431 Category *referent group listed second **from Wald Chi-square test

54 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 4b. Multivariable analysis for Disease Free Survival (n=477)

Variable Comparison* HR Lower Upper p-value** 95% CI 95% CI PR PR- vs PR+ 1.260 0.940 1.689 0.0401 PR- vs Unknown 0.771 0.503 1.183 PR+ vs Unknown 0.612 0.408 0.919 Tumor Grade Grade 1 vs Grade 2 0.871 0.623 1.217 0.0356 Grade 1 vs Grade 3 0.586 0.398 0.864 Grade 1 vs Unknown 0.754 0.433 1.313 Grade 2 vs Grade 3 0.674 0.489 0.927 Grade 2 vs Unknown 0.865 0.515 1.455 Grade 3 vs Unknown 1.285 0.739 2.234 T Stage T0/1 vs T2 0.735 0.555 0.974 0.0351 T0/1 vs T3/4 0.596 0.372 0.955 T0/1 vs TX 1.877 0.534 6.594 T2 vs T3/4 0.812 0.516 1.277 T2 vs TX 2.554 0.730 8.933 T3/4 vs TX 3.146 0.859 11.520 N Stage N+ vs N0 1.550 1.164 2.063 0.0001 N+ vs NX 0.640 0.398 1.030 N0 vs NX 0.413 0.259 0.659 PDK4 H-score High vs Low 0.879 0.680 1.136 0.3241 Category *referent group listed second **from Wald Chi-square test

55 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 5a: Overall Survival - PROC PHREG with H score as dichotomous variable in multivariable models Lower Upper 95% 95% H-Score Wald Wald Category Point Confidenc Confidenc Obs Group Parameter* p-value Estimate e Limit e Limit 1 OS High vs Low 0.0431 1.382 1.010 1.890 2 OS for ER+ High vs Low 0.0131 1.600 1.104 2.319 3 OS for ER- High vs Low 0.5535 1.239 0.610 2.514 4 OS for Endocrine High vs Low 0.0847 1.431 0.952 2.152 Therapy=Yes 5 OS for Endocrine Therapy=No High vs Low 0.0976 1.608 0.917 2.822 6 OS for ER+ and ET=Yes High vs Low 0.0300 1.614 1.048 2.488 7 OS for ER+ and ET=No High vs Low 0.1473 1.956 0.789 4.847 8 OS for ER- and ET=Yes High vs Low 0.9707 0.961 0.117 7.887 9 OS for ER- and ET=No High vs Low 0.2016 1.710 0.751 3.893 10 OS for ER+/PR+/HER2-=Yes High vs Low 0.1131 1.510 0.907 2.513 11 OS for ER+/PR+/HER2-=No High vs Low 0.7579 1.146 0.483 2.717 *referent group listed second

56

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 5b: Disease Free Survival - PROC PHREG with H score as dichotomous variable in multivariable models Lower Upper 95% 95% H-Score Point Wald Wald Category Estimat Confidenc Confidenc Obs Group Parameter* p-value e e Limit e Limit 1 DFS High vs Low 0.3241 0.879 0.680 1.136 2 DFS for ER+ High vs Low 0.5038 0.902 0.666 1.221 3 DFS for ER- High vs Low 0.9336 0.975 0.537 1.772 4 DFS for Endocrine High vs Low 0.4082 1.151 0.824 1.608 Therapy=Yes 5 DFS for Endocrine High vs Low 0.2109 0.754 0.484 1.174 Therapy=No 6 DFS for ER+ and ET=Yes High vs Low 0.3474 1.190 0.828 1.711 7 DFS for ER+ and ET=No High vs Low 0.0448 0.474 0.228 0.983 8 DFS for ER- and ET=Yes High vs Low 0.6785 1.355 0.322 5.710 9 DFS for ER- and ET=No High vs Low 0.6392 1.179 0.592 2.350 10 DFS for ER+/PR+/HER2-=Yes High vs Low 0.2106 0.750 0.478 1.177 11 DFS for ER+/PR+/HER2-=No High vs Low 0.4555 1.393 0.584 3.323 *referent group listed second

57 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 6a: Overall Survival - PROC PHREG with H score as continuous variable in multivariable models Lower Upper 95% 95% Point Wald Wald Estimat Confidenc Confidenc Obs Group Parameter p-value e e Limit e Limit 1 OS score 0.1082 1.003 0.999 1.007 2 OS for ER+ score 0.0043 1.006 1.002 1.011 3 OS for ER- score 0.4747 0.997 0.988 1.005 4 OS for Endocrine score 0.0008 1.009 1.004 1.014 Therapy=Yes 5 OS for Endocrine score 0.9615 1.000 0.995 1.006 Therapy=No 6 OS for ER+ and ET=Yes score 0.0007 1.009 1.004 1.015 7 OS for ER+ and ET=No score 0.2803 1.005 0.996 1.013 8 OS for ER- and ET=Yes score 0.3364 1.019 0.981 1.058 9 OS for ER- and ET=No score 0.3776 0.996 0.986 1.005 10 OS for ER+/PR+/HER2- score 0.0333 1.008 1.001 1.016 =Yes 11 OS for ER+/PR+/HER2- score 0.5765 0.996 0.983 1.010 =No

58

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 6b: Disease Free Survival - PROC PHREG with H score as continuous variable in multivariable models Lower Upper 95% 95% Point Wald Wald Ob Estimat Confidenc Confidenc s Group Parameter p-value e e Limit e Limit 1 DFS score 0.4423 1.001 0.998 1.005 2 DFS for ER+ score 0.2105 1.003 0.998 1.007 3 DFS for ER- score 0.6455 0.998 0.991 1.006 4 DFS for Endocrine score 0.0103 1.007 1.002 1.012 Therapy=Yes 5 DFS for Endocrine score 0.7727 0.999 0.994 1.005 Therapy=No 6 DFS for ER+ and ET=Yes score 0.0237 1.007 1.001 1.012 7 DFS for ER+ and ET=No score 0.8985 0.999 0.991 1.008 8 DFS for ER- and ET=Yes score 0.2757 1.015 0.988 1.043 9 DFS for ER- and ET=No score 0.7149 0.998 0.990 1.007 10 DFS for ER+/PR+/HER2- score 0.6246 1.002 0.994 1.010 =Yes 11 DFS for ER+/PR+/HER2- score 0.7799 1.002 0.989 1.015 =No

59 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Summary of results for Overall Survival and DFS analysis – Univariate analyses on the PDK4 H-score category

Overall Survival Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS _ 243 (48.89) 254 (51.11) 0.1155 2 OS Low 53 (42.06) 73 (57.94) 18.54 (11.62, . ) 3 OS High 190 (51.21) 181 (48.79) 12.32 (10.59, 14.76)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 371 320 273 227 182 105 65 29 12 3 1 Low 126 113 95 85 73 34 17 8 3 1 1

0 5 10 15 20 25 os_yr

rankscore High Low

60 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER+ Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER+ _ 172 (47.25) 192 (52.75) 0.0835 2 OS for ER+ Low 38 (38.78) 60 (61.22) 18.54 (12.82, . ) 3 OS for ER+ High 134 (50.38) 132 (49.62) 12.32 (10.76, 14.76)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 266 239 207 175 138 73 39 15 5 2 1 Low 98 91 76 68 58 28 13 7 2 1 1

0 5 10 15 20 25 os_yr

rankscore High Low

61 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER- Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER- _ 51 (47.66) 56 (52.34) 0.8597 2 OS for ER- Low 12 (48.00) 13 (52.00) 11.84 ( 6.54, . ) 3 OS for ER- High 39 (47.56) 43 (52.44) 17.42 ( 5.09, . )

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 82 61 50 41 35 24 20 10 4 0 Low 25 19 18 16 14 6 4 1 1 0

0 5 10 15 20 os_yr

rankscore High Low

62

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for Endocrine Therapy=Yes Logrank Obs analtype rankscore Num of Events Num Censored Median (95% CI) p-value 1 OS for Endocrine Therapy=Yes _ 143 (48.15) 154 (51.85) 0.0727 2 OS for Endocrine Therapy=Yes Low 31 (38.75) 49 (61.25) 18.54 (11.84, . ) 3 OS for Endocrine Therapy=Yes High 112 (51.61) 105 (48.39) 12.32 (10.59, 14.02)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 217 196 171 140 113 57 29 11 3 0 Low 80 75 62 57 49 21 10 6 2 1 1

0 5 10 15 20 25 os_yr

rankscore High Low

63

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for Endocrine Therapy=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for Endocrine Therapy=No _ 94 (51.93) 87 (48.07) 0.4635 2 OS for Endocrine Therapy=No Low 19 (46.34) 22 (53.66) 11.84 ( 7.41, . ) 3 OS for Endocrine Therapy=No High 75 (53.57) 65 (46.43) 10.69 ( 7.19, 18.19)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 140 111 92 77 59 40 31 14 7 3 1 Low 41 33 30 26 22 11 7 2 1 0

0 5 10 15 20 25 os_yr

rankscore High Low

64

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER+ and ET=Yes Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER+ and ET=Yes _ 125 (46.47) 144 (53.53) 0.0534 2 OS for ER+ and ET=Yes Low 27 (36.49) 47 (63.51) 18.54 (12.82, . ) 3 OS for ER+ and ET=Yes High 98 (50.26) 97 (49.74) 12.40 (10.76, 14.76)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 195 176 155 130 105 51 24 7 2 0 Low 74 70 59 54 46 20 10 6 2 1 1

0 5 10 15 20 25 os_yr

rankscore High Low

65

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER+ and ET=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER+ and ET=No _ 42 (52.50) 38 (47.50) 0.6529 2 OS for ER+ and ET=No Low 9 (45.00) 11 (55.00) . ( 4.18, . ) 3 OS for ER+ and ET=No High 33 (55.00) 27 (45.00) 10.82 ( 8.48, 18.19)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 60 53 45 38 26 16 11 5 2 2 1 Low 20 17 14 12 10 6 3 1 0

0 5 10 15 20 25 os_yr

rankscore High Low

66

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER- and ET=Yes Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER- and ET=Yes _ 9 (56.25) 7 (43.75) 0.9382 2 OS for ER- and ET=Yes Low 2 (50.00) 2 (50.00) . ( 1.52, . ) 3 OS for ER- and ET=Yes High 7 (58.33) 5 (41.67) 11.40 ( 2.46, . )

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 12 10 7 5 4 3 3 2 0 Low 4 3 2 2 2 1 0

0 5 10 15 20 os_yr

rankscore High Low

67

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER- and ET=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER- and ET=No _ 42 (47.73) 46 (52.27) 0.7225 2 OS for ER- and ET=No Low 10 (47.62) 11 (52.38) 11.84 ( 6.54, . ) 3 OS for ER- and ET=No High 32 (47.76) 35 (52.24) . ( 4.60, . )

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 67 48 40 33 28 19 16 7 3 0 Low 21 16 16 14 12 5 4 1 1 0

0 5 10 15 20 os_yr

rankscore High Low

68

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER+/PR+/HER2-=Yes Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER+/PR+/HER2-=Yes _ 75 (38.27) 121 (61.73) 0.3781 2 OS for ER+/PR+/HER2-=Yes Low 23 (34.33) 44 (65.67) . (11.84, . ) 3 OS for ER+/PR+/HER2-=Yes High 52 (40.31) 77 (59.69) 14.76 (11.74, . )

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 129 121 106 93 73 27 10 1 0 Low 67 63 56 50 42 17 5 1 1 0

0 5 10 15 20 os_yr

rankscore High Low

69

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER+/PR+/HER2-=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER+/PR+/HER2-=No _ 33 (45.21) 40 (54.79) 0.5379 2 OS for ER+/PR+/HER2-=No Low 8 (40.00) 12 (60.00) . ( 4.21, . ) 3 OS for ER+/PR+/HER2-=No High 25 (47.17) 28 (52.83) . ( 4.41, . )

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 53 43 34 28 22 13 7 1 1 0 Low 20 17 14 13 11 4 3 0

0 5 10 15 20 os_yr

rankscore High Low

70 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS _ 256 (53.33) 224 (46.67) 0.0538 2 DFS Low 123 (57.75) 90 (42.25) 10.76 ( 7.18, 12.79) 3 DFS High 133 (49.81) 134 (50.19) 12.84 (10.69, 15.95)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 267 223 187 163 125 79 49 22 9 2 1 Low 213 174 139 114 97 46 27 11 3 1 1

0 5 10 15 20 25 pfs_yr

rankscore High Low

71 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER+ Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER+ _ 184 (51.98) 170 (48.02) 0.0779 2 DFS for ER+ Low 91 (56.17) 71 (43.83) 11.12 ( 7.71, 15.61) 3 DFS for ER+ High 93 (48.44) 99 (51.56) 12.92 (11.12, 18.19)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 192 169 146 128 97 57 31 12 4 1 1 Low 162 140 112 92 76 33 18 8 2 1 1

0 5 10 15 20 25 pfs_yr

rankscore High Low

72 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER- Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER- _ 51 (51.00) 49 (49.00) 0.6753 2 DFS for ER- Low 22 (55.00) 18 (45.00) 6.54 ( 2.91, . ) 3 DFS for ER- High 29 (48.33) 31 (51.67) 8.56 ( 3.87, . )

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 60 41 33 29 23 17 13 7 2 0 Low 40 27 21 18 17 10 8 2 1 0

0 5 10 15 20 pfs_yr

rankscore High Low

73

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for Endocrine Therapy=Yes Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for Endocrine Therapy=Yes _ 150 (52.08) 138 (47.92) 0.8491 2 DFS for Endocrine Therapy=Yes Low 73 (53.28) 64 (46.72) 11.84 ( 9.67, 18.06) 3 DFS for Endocrine Therapy=Yes High 77 (50.99) 74 (49.01) 12.32 ( 9.76, 14.76)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 151 132 111 95 71 39 19 7 2 0 Low 137 120 101 84 71 31 18 10 2 1 1

0 5 10 15 20 25 pfs_yr

rankscore High Low

74

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for Endocrine Therapy=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for Endocrine Therapy=No _ 99 (56.90) 75 (43.10) 0.0387 2 DFS for Endocrine Therapy=No Low 44 (64.71) 24 (35.29) 5.19 ( 3.88, 9.13) 3 DFS for Endocrine Therapy=No High 55 (51.89) 51 (48.11) 12.38 ( 6.81, 21.67)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 106 81 67 59 45 32 25 11 5 2 1 Low 68 47 35 28 24 13 9 1 1 0

0 5 10 15 20 25 pfs_yr

rankscore High Low

75

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER+ and ET=Yes Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER+ and ET=Yes _ 130 (50.00) 130 (50.00) 0.7662 2 DFS for ER+ and ET=Yes Low 64 (51.20) 61 (48.80) 14.17 ( 9.92, 18.06) 3 DFS for ER+ and ET=Yes High 66 (48.89) 69 (51.11) 12.45 (10.75, . )

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 135 120 104 90 68 37 17 5 1 0 Low 125 110 92 79 66 27 16 8 2 1 1

0 5 10 15 20 25 pfs_yr

rankscore High Low

76

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER+ and ET=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER+ and ET=No _ 49 (61.25) 31 (38.75) 0.0131 2 DFS for ER+ and ET=No Low 23 (74.19) 8 (25.81) 5.19 ( 3.62, 8.98) 3 DFS for ER+ and ET=No High 26 (53.06) 23 (46.94) 12.38 ( 8.48, 21.67)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 49 41 35 31 22 14 10 4 2 1 1 Low 31 24 17 11 8 4 2 0

0 5 10 15 20 25 pfs_yr

rankscore High Low

77

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER- and ET=Yes Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER- and ET=Yes _ 10 (62.50) 6 (37.50) 0.8076 2 DFS for ER- and ET=Yes Low 4 (66.67) 2 (33.33) 6.05 ( 1.52, . ) 3 DFS for ER- and ET=Yes High 6 (60.00) 4 (40.00) 3.31 ( 1.01, . )

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 10 6 4 3 2 1 1 1 0 Low 6 5 4 2 2 2 1 1 0

0 5 10 15 20 pfs_yr

rankscore High Low

78

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER- and ET=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER- and ET=No _ 40 (49.38) 41 (50.62) 0.8314 2 DFS for ER- and ET=No Low 17 (51.52) 16 (48.48) 15.81 ( 3.71, . ) 3 DFS for ER- and ET=No High 23 (47.92) 25 (52.08) . ( 3.87, . )

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 48 33 27 24 19 14 11 5 1 0 Low 33 22 17 16 15 8 7 1 1 0

0 5 10 15 20 pfs_yr

rankscore High Low

79

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER+/PR+/HER2-=Yes Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER+/PR+/HER2-=Yes _ 91 (47.15) 102 (52.85) 0.1049 2 DFS for ER+/PR+/HER2-=Yes Low 58 (53.21) 51 (46.79) 11.62 ( 8.98, 15.61) 3 DFS for ER+/PR+/HER2-=Yes High 33 (39.29) 51 (60.71) 14.76 (11.76, . )

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 84 77 69 62 45 18 5 0 Low 109 96 80 66 54 19 8 2 0

0 5 10 15 20 pfs_yr

rankscore High Low

80

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER+/PR+/HER2-=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER+/PR+/HER2-=No _ 29 (44.62) 36 (55.38) 0.7114 2 DFS for ER+/PR+/HER2-=No Low 12 (42.86) 16 (57.14) . ( 3.19, . ) 3 DFS for ER+/PR+/HER2-=No High 17 (45.95) 20 (54.05) . ( 3.35, . )

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 37 27 21 20 15 11 6 1 1 0 Low 28 22 17 16 15 6 4 0

0 5 10 15 20 pfs_yr

rankscore High Low

81 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Statistical Methods: TBX3 For subjects with multiple tumor samples available, we included only the sample with the highest TBX3 H-score. Wilcoxon Rank Sum and Kruskal-Wallis tests were used to determine if TBX3 H-scores were correlated with other tumor markers. Cox proportional hazards regression models were used to determine whether TBX3 H-scores and other variables were related to overall and disease-free survival either univariately or in multivariable models. In these analyses, TBX3 H-scores were divided into low and high categories at the score of 27.91721 for overall survival (time from surgery to death or censoring) and disease-free survival (time from surgery to first recurrence or censoring, excluding patients with M1 stage at surgery). These cutoff values were determined by using the maximum chi-square value for all score values between the 25th and 75th percentile (http://www.pharmasug.org/proceedings/2012/SP/PharmaSUG-2012- SP12.pdf). TBX3 high/low was included in all multivariable models. As a double check on the direction of the hazard ratio and as a more powerful test if the H-score effect was truly linear, we also fit multivariable models with the H-score as continuous.

We conducted subgroup analyses on overall survival using the ER-positive subgroup, endocrine therapy group, ER-positive on endocrine therapy, ER-negative, and ER+/PR+/HER2-. First, log-rank tests were done with the dichotomous H-score variable. Second, multivariable models with the H-score as dichotomous and then continuous were fit similar as was done in the main analyses.

Analyses were conducted using SAS Version 9.4. An α level of 5% was used to determine statistical significance.

Results:

Of the 557 patients who were part of the TMAs available to be read (TMA1-14) for TBX3, 457 patients (82%) had TBX3 values available/readable. Clinical parameters of the subjects included in the TMAs studied are summarized in Table 1 with grouping for patients with or without TBX3 values. Age, Tumor Stage, Recurrences, and TMA number were significantly different between the two groups (Age higher in missing group, T1 had more missing, Fewer Recurrences in missing group, and TMA4 and TMA6 had more missing, percentagewise). Since the numbers for the missing TBX3 values were so low, these significant differences were not investigated further.

Correlation of TBX3 H-score with other disease markers We compared TBX3 H-score expression with ER, PR, HER-2/neu, Nodal stage, Tumor Stage or Grade. TBX3 levels were correlated with Tumor Stage and Grade (Table 2). Tumor Stage 0/1 was significantly less than Stage 2 and 3, and Tumor Grade was significantly different among all grades with Grade 3 having the higher results.

Overall Survival Analysis Univariate

82 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

In univariate analyses, variables significantly related to overall survival in the Cox proportional hazards regression models were Tumor Grade, Tumor Stage, and Nodal Stage (Table 3a). Higher Tumor Grade, Higher Tumor Stage, and Nodal Stage-positive were correlated with lower survival. TBX3 H-score was not related to overall survival (log rank test p-value 0.2443).

Multivariable In the multivariate analysis (Table 4a), Tumor Grade, Tumor Stage, Nodal Stage and TBX3 H-score were found to be significant. Higher Tumor Grade, Higher Tumor Stage, Nodal Stage-positive and lower TBX3 H-score were correlated with lower survival. Subgroup analysis For unadjusted tests (i.e. univariately), the results were significant for patients not having ER+/PR+/HER2- (log rank test p-value=0.035). Other analyses were not statistically significant (log rank test p-value 0.416 for ER-positive, 0.747 for ER-negative, 0.984 for patients on endocrine therapy, 0.289 for patients not on endocrine therapy, 0.557 for ER-positive and on endocrine therapy, and 0.057 for ER-positive and not on endocrine therapy). For the significant result, the lower TBX3 H-score was correlated to lower overall survival.

In multivariable models treating the H-score as dichotomous, H-score category was significant for patients who were ER-positive and not on endocrine therapy and patients who were not ER+/PR+/HER2- (Table 5a). For the results that were significant, the lower TBX3 H- score was correlated with lower survival. In multivariable models treating the H-score as continuous, the H-score was significant result in patients who were not ER+/PR+/HER2- (Table 6a).

In addition, Kaplan-Meier plots are provided for the univariate analyses using the categorical TBX3 H-score for overall and the subgroup analyses.

Disease Free Survival Analysis Univariate In univariate analyses, variables significantly related to disease free survival in the Cox proportional hazards regression models were Tumor Grade, Tumor Stage, and Nodal Stage (Table 3a). Higher Tumor Grade, Higher Tumor Stage, and Nodal Stage-positive were correlated with lower disease free survival. TBX3 H-score was not related to disease free survival (log rank test p-value 0.1557). Multivariable In the multivariate analysis (Table 4b), Tumor Grade, Tumor Stage, Nodal Stage and TBX3 H-score were found to be significant. Higher Tumor Grade, Higher Tumor Stage, Nodal Stage-positive and lower TBX3 H-score were correlated with lower disease free survival. Subgroup analysis For unadjusted tests (i.e. univariately), the results were not statistically significant for any subgroup (log rank test p-value 0.137 for ER-positive, 0.610 for ER-negative, 0.385 for patients on endocrine therapy, 0.675 for patients not on endocrine therapy, 0.522 for ER-positive and on endocrine therapy and 0.384 for ER-positive and not on endocrine therapy).

83 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

In multivariable models treating the H-score as dichotomous, H-score category was significant for patients who were ER+ and patients who were not ER+/PR+/HER2- (Table 5b) where the patients with lower TBX3 H-score were correlated to lower disease free survival. In multivariable models treating the H-score as continuous, the H-score was significant result in patients who were not ER+/PR+/HER2- (Table 6b).

In addition, Kaplan-Meier plots are provided for the univariate analyses using the categorical TBX3 H-score for overall and the subgroup analyses.

84 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 1. Description of the patients and characteristics of their tumors (n=557) Have TBX3 Missing TBX3 Characteristic (N=457) (N=100) p-value* Age at Diagnosis, y 0.0066 Mean (SD) 59.0 (15.01) 63.5 (13.51) Median 58.12 64.30 Range 5.92-94.75 34.49-88.55 Race, n(%) 0.1647 White 351 (76.81%) 85 (85.00%) African American 102 (22.32%) 14 (14.00%) Asian 2 ( 0.44%) 1 ( 1.00%) Other 1 ( 0.22%) 0 Unknown 1 ( 0.22%) 0 PR Status, n(%) 0.1986 Negative 147 (32.17%) 27 (27.00%) Positive 273 (59.74%) 69 (69.00%) Unknown/Not Done 37 ( 8.10%) 4 ( 4.00%) ER Status, n(%) 0.2268 Negative 100 (21.88%) 17 (17.00%) Positive 336 (73.52%) 81 (81.00%) Unknown/Not Done 21 ( 4.60%) 2 ( 2.00%) HER-2/neu, n(%) 1.0000 Negative 214 (46.83%) 54 (54.00%) Equivocal 3 ( 0.66%) 0 Positive 41 ( 8.97%) 10 (10.00%) Unknown/Not Done 199 (43.54%) 36 (36.00%) Tumor Grade, n(%) 0.3696 I 126 (27.57%) 34 (34.00%) II 194 (42.45%) 40 (40.00%) III 109 (23.85%) 17 (17.00%) IV 1 ( 0.22%) 0 Unknown 26 ( 5.69%) 8 ( 8.00%)

85 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Have TBX3 Missing TBX3 Characteristic (N=457) (N=100) p-value* Not Applicable 1 ( 0.22%) 1 ( 1.00%) T Stage, n(%) 0.0018 T0 1 ( 0.22%) 1 ( 1.00%) T1 239 (52.30%) 69 (69.00%) T2 177 (38.73%) 20 (20.00%) T3 21 ( 4.60%) 7 ( 7.00%) T4 14 ( 3.06%) 3 ( 3.00%) TX/Unknown 5 ( 1.09%) 0 N Stage, n(%) 0.6023 N0 274 (59.96%) 64 (64.00%) N1 117 (25.60%) 23 (23.00%) N2 26 ( 5.69%) 3 ( 3.00%) N3 8 ( 1.75%) 1 ( 1.00%) NX/Unknown 31 ( 6.78%) 9 ( 9.00%) Not Applicable 1 ( 0.22%) 0 M Stage, n(%) 0.7417 M0 228 (49.89%) 62 (62.00%) M1 12 ( 2.63%) 2 ( 2.00%) MX/Unknown 211 (46.17%) 36 (36.00%) Not Applicable 6 ( 1.31%) 0 Endocrine Therapy, n(%) 0.4102 Yes 272 (59.52%) 66 (66.00%) No 165 (36.11%) 33 (33.00%) Unknown/Missing 20 ( 4.38%) 1 ( 1.00%) Death, n(%) 0.4868 Yes 214 (46.83%) 43 (43.00%) No 243 (53.17%) 57 (57.00%) Recurrence, n(%) 0.0445 Yes 137 (29.98%) 20 (20.00%) No 320 (70.02%) 80 (80.00%)

86 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Have TBX3 Missing TBX3 Characteristic (N=457) (N=100) p-value* TMA 0.0414 TMA-1 40 ( 8.75%) 3 ( 3.00%) TMA-2 33 ( 7.22%) 10 (10.00%) TMA-3 0 0 TMA-4 29 ( 6.35%) 13 (13.00%) TMA-5 32 ( 7.00%) 8 ( 8.00%) TMA-6 27 ( 5.91%) 10 (10.00%) TMA-7 33 ( 7.22%) 10 (10.00%) TMA-8 40 ( 8.75%) 4 ( 4.00%) TMA-9 38 ( 8.32%) 8 ( 8.00%) TMA-10 38 ( 8.32%) 3 ( 3.00%) TMA-11 41 ( 8.97%) 4 ( 4.00%) TMA-12 35 ( 7.66%) 11 (11.00%) TMA-13 34 ( 7.44%) 9 ( 9.00%) TMA-14 37 ( 8.10%) 7 ( 7.00%) Follow-up, y, median 12.9 (0.04-27.9) 12.5 (0.3-19.2) 0.4455 (range) Overall survival, y, 13.7 (12.0, 18.5) 18.2 (11.2, 19.1) 0.4395 median (95% CI) Disease-free survival, y, 12.1 (10.7, 14.8) 15.6 (11.2, 19.1) 0.1321 median (95% CI)** *from t-test, Chi-square, Fisher’s Exact Test, or Log-Rank Test; Missing values not included. **M1 patients excluded from analysis

87 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 2. Bivariate analysis of TBX3 H-score with other tumor markers Variable TBX3 H-score Median (25th percentile, 75th percentile) p-value* n Values n values n values Negative Positive ER 100 50.49 (24.17, 336 46.30 (24.69, 0.3715 95.21) 80.31) PR 147 46.37 (23.03, 273 46.19 (24.50, 0.6406 85.93) 80.09) HER-2/neu 214 47.66 (25.35, 14 54.31 (32.72, 0.8488 87.92) 78.01) ER+/PR+/HER- 66 53.91 (26.64, 184 47.37 (25.70, 0.7202 91.04) 86.28) Nodal Status 274 49.66 (23.83, 151 45.46 (28.27, 0.9714 82.14) 82.79) Grade 1 Grade 2 Grade 3 Tumor Grade 126 39.10 (16.86, 194 46.11 (25.38, 110 72.81 (39.98, <.0001** 61.25) 78.28) 113.79) T0/1 T2 T3/4 Tumor Stage 240 45.58 (21.70, 177 49.30 (28.27, 35 60.94 (36.19, 0.0063*** 76.81) 87.66) 101.64) *from Wilcoxon Rank Sum test for Hormone Receptor Status and Kruskal-Wallis test for Tumor Grade and Tumor Stage **the difference between Grade 1 and 2 was significant (p=0.0068), Grade 1 and 3 (p<0.0001), and between Grade 2 and 3 (p<0.0001). **the difference between Stage 0/1 and 2 was significant (p=0.0238), and Stage 0/1 and 3/4 (p=0.0069).

88 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 3a. Univariate analysis of other tumor markers for Overall Survival Variable n Comparison* HR (95% CI) p-value** ER 534 ER- vs ER+ 0.96 ( 0.70, 1.31) 0.7757 PR 516 PR- vs PR+ 1.23 ( 0.94, 1.62) 0.1263 HER-2/neu 319 HER-2/neu- vs HER- 0.66 ( 0.42, 1.02) 0.0594 2/neu+ ER+/PR+/HER- 314 No vs Yes 1.36 (0.93, 2.00) 0.1161 Tumor Grade 521 Grade 1 vs Grade 2 0.91 ( 0.66, 1.25) <.0001 Grade 1 vs Grade 3 0.49 ( 0.35, 0.69) Grade 2 vs Grade 3 0.54 ( 0.40, 0.73) T Stage 552 T0/1 vs T2 0.70 ( 0.53, 0.91) <.0001 T0/1 vs T3/4 0.34 ( 0.23, 0.50) T2 vs T3/4 0.49 ( 0.33, 0.72) N Stage 516 N+ vs N0 1.78 ( 1.37, 2.31) <.0001 *referent group listed second **from Wald Chi-square test

89 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 3b. Univariate analysis of other tumor markers for Disease Free Survival Variable n Comparison* HR (95% CI) p-value** ER 520 ER- vs ER+ 1.01 ( 0.75, 1.37) 0.9330 PR 503 PR- vs PR+ 1.20 ( 0.92, 1.56) 0.1758 HER-2/neu 311 HER-2/neu- vs HER- 0.74 ( 0.48, 1.15) 0.1856 2/neu+ ER+/PR+/HER- 306 No vs Yes 1.15 (0.79, 1.68) 0.4743 Tumor Grade 509 Grade 1 vs Grade 2 0.86 ( 0.63, 1.16) <.0001 Grade 1 vs Grade 3 0.49 ( 0.35, 0.68) Grade 2 vs Grade 3 0.57 ( 0.42, 0.77) T Stage 538 T0/1 vs T2 0.62 ( 0.48, 0.80) <.0001 T0/1 vs T3/4 0.40 ( 0.26, 0.60) T2 vs T3/4 0.64 ( 0.42, 0.97) N Stage 504 N+ vs N0 1.68 ( 1.30, 2.17) <.0001 *referent group listed second **from Wald Chi-square test

90 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 4a. Multivariable analysis for Overall Survival (N=456) Variable Comparison* HR Lower Upper p-value** 95% CI 95% CI Tumor Grade Grade 1 vs Grade 2 0.881 0.610 1.273 0.0263 Grade 1 vs Grade 3 0.574 0.380 0.866 Grade 1 vs Unknown 0.665 0.377 1.174 Grade 2 vs Grade 3 0.651 0.463 0.915 Grade 2 vs Unknown 0.755 0.441 1.291 Grade 3 vs Unknown 1.160 0.657 2.047 T Stage T0/1 vs T2 0.765 0.558 1.050 0.0093 T0/1 vs T3/4 0.449 0.272 0.741 T0/1 vs TX 2.128 0.492 9.215 T2 vs T3/4 0.586 0.371 0.928 T2 vs TX 2.782 0.645 12.006 T3/4 vs TX 4.743 1.042 21.593 N Stage N+ vs N0 1.658 1.211 2.269 <.0001 N+ vs NX 0.495 0.297 0.825 N0 vs NX 0.299 0.184 0.485 TBX3 H-score High vs Low 0.721 0.533 0.974 0.0333 Category *referent group listed second **from Wald Chi-square test

91 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 4b. Multivariable analysis for Disease Free Survival (n=444)

Variable Comparison* HR Lower Upper p-value** 95% CI 95% CI Tumor Grade Grade 1 vs Grade 2 0.840 0.594 1.188 0.0129 Grade 1 vs Grade 3 0.539 0.362 0.801 Grade 1 vs Unknown 0.705 0.398 1.246 Grade 2 vs Grade 3 0.641 0.460 0.895 Grade 2 vs Unknown 0.839 0.487 1.445 Grade 3 vs Unknown 1.308 0.730 2.343 T Stage T0/1 vs T2 0.660 0.489 0.890 0.0203 T0/1 vs T3/4 0.543 0.323 0.916 T0/1 vs TX 1.312 0.380 4.534 T2 vs T3/4 0.824 0.507 1.338 T2 vs TX 1.988 0.579 6.831 T3/4 vs TX 2.414 0.648 8.986 N Stage N+ vs N0 1.577 1.169 2.126 0.0002 N+ vs NX 0.657 0.387 1.115 N0 vs NX 0.417 0.252 0.690 TBX3 H-score High vs Low 0.689 0.516 0.922 0.0121 Category *referent group listed second **from Wald Chi-square test

92 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 5a: Overall Survival - PROC PHREG with H score as dichotomous variable in multivariable models Lower Upper 95% 95% H-Score Wald Wald Category Point Confidenc Confidenc Obs Group Parameter* p-value Estimate e Limit e Limit 1 OS High vs Low 0.0333 0.721 0.533 0.974 2 OS for ER+ High vs Low 0.1884 0.789 0.554 1.123 3 OS for ER- High vs Low 0.0709 0.503 0.238 1.060 4 OS for Endocrine High vs Low 0.3786 0.837 0.564 1.243 Therapy=Yes 5 OS for Endocrine Therapy=No High vs Low 0.0505 0.589 0.346 1.001 6 OS for ER+ and ET=Yes High vs Low 0.9396 0.983 0.637 1.519 7 OS for ER+ and ET=No High vs Low 0.0323 0.422 0.191 0.930 8 OS for ER- and ET=Yes High vs Low 0.3166 0.324 0.036 2.939 9 OS for ER- and ET=No High vs Low 0.1874 0.559 0.235 1.327 10 OS for ER+/PR+/HER2-=Yes High vs Low 0.4029 0.792 0.458 1.368 11 OS for ER+/PR+/HER2-=No High vs Low 0.0036 0.250 0.099 0.635 *referent group listed second

93

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 5b: Disease Free Survival - PROC PHREG with H score as dichotomous variable in multivariable models Lower Upper 95% 95% H-Score Point Wald Wald Category Estimat Confidenc Confidenc Obs Group Parameter* p-value e e Limit e Limit 1 DFS High vs Low 0.0121 0.689 0.516 0.922 2 DFS for ER+ High vs Low 0.0393 0.700 0.498 0.983 3 DFS for ER- High vs Low 0.4854 0.772 0.373 1.598 4 DFS for Endocrine High vs Low 0.0907 0.722 0.496 1.053 Therapy=Yes 5 DFS for Endocrine High vs Low 0.1217 0.669 0.402 1.113 Therapy=No 6 DFS for ER+ and ET=Yes High vs Low 0.2091 0.770 0.512 1.158 7 DFS for ER+ and ET=No High vs Low 0.0954 0.538 0.260 1.115 8 DFS for ER- and ET=Yes High vs Low 0.1736 7.872 0.403 153.711 9 DFS for ER- and ET=No High vs Low 0.5484 0.760 0.310 1.862 10 DFS for ER+/PR+/HER2-=Yes High vs Low 0.1215 0.681 0.419 1.108 11 DFS for ER+/PR+/HER2-=No High vs Low 0.0475 0.391 0.154 0.990 *referent group listed second

94 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 6a: Overall Survival - PROC PHREG with H score as continuous variable in multivariable models Lower Upper 95% 95% Point Wald Wald Estimat Confidenc Confidenc Obs Group Parameter p-value e e Limit e Limit 1 OS score 0.1414 0.997 0.994 1.001 2 OS for ER+ score 0.6979 0.999 0.995 1.004 3 OS for ER- score 0.0534 0.992 0.984 1.000 4 OS for Endocrine score 0.8304 0.999 0.994 1.005 Therapy=Yes 5 OS for Endocrine score 0.4860 0.998 0.992 1.004 Therapy=No 6 OS for ER+ and ET=Yes score 0.9438 1.000 0.995 1.006 7 OS for ER+ and ET=No score 0.6027 0.997 0.987 1.007 8 OS for ER- and ET=Yes score 0.2253 0.952 0.880 1.031 9 OS for ER- and ET=No score 0.2698 0.995 0.987 1.004 10 OS for ER+/PR+/HER2- score 0.9418 1.000 0.993 1.006 =Yes 11 OS for ER+/PR+/HER2- score 0.0034 0.978 0.964 0.993 =No

95

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 6b: Disease Free Survival - PROC PHREG with H score as continuous variable in multivariable models Lower Upper 95% 95% Point Wald Wald Ob Estimat Confidenc Confidenc s Group Parameter p-value e e Limit e Limit 1 DFS score 0.2099 0.998 0.994 1.001 2 DFS for ER+ score 0.2528 0.997 0.993 1.002 3 DFS for ER- score 0.4409 0.997 0.990 1.004 4 DFS for Endocrine score 0.3920 0.998 0.992 1.003 Therapy=Yes 5 DFS for Endocrine score 0.8138 0.999 0.994 1.005 Therapy=No 6 DFS for ER+ and ET=Yes score 0.4605 0.998 0.992 1.003 7 DFS for ER+ and ET=No score 0.4776 0.997 0.988 1.005 8 DFS for ER- and ET=Yes score 0.2881 1.056 0.955 1.169 9 DFS for ER- and ET=No score 0.7594 0.999 0.991 1.007 10 DFS for ER+/PR+/HER2- score 0.4490 0.998 0.992 1.004 =Yes 11 DFS for ER+/PR+/HER2- score 0.0378 0.986 0.973 0.999 =No

96 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Summary of results for Overall Survival and DFS analysis – Univariate analyses on the TBX3 H-score category

Overall Survival Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS _ 214 (46.83) 243 (53.17) 0.2437 2 OS Low 69 (52.67) 62 (47.33) 12.06 ( 8.98, 16.48) 3 OS High 145 (44.48) 181 (55.52) 15.37 (12.79, NR) NR=Not Reached Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 326 286 248 213 173 89 52 30 13 3 1 Low 131 117 98 83 69 43 23 6 1 0

0 5 10 15 20 25 os_yr

rankscore High Low

97 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER+ Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER+ _ 155 (46.13) 181 (53.87) 0.4163 2 OS for ER+ Low 49 (51.58) 46 (48.42) 12.32 (10.12, . ) 3 OS for ER+ High 106 (43.98) 135 (56.02) 14.76 (11.74, 20.17)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 241 221 188 164 131 63 31 16 5 2 1 Low 95 85 74 63 52 31 14 4 1 0

0 5 10 15 20 25 os_yr

rankscore High Low

98 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER- Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER- _ 44 (44.00) 56 (56.00) 0.7467 2 OS for ER- Low 15 (50.00) 15 (50.00) 11.34 ( 5.09, NR) 3 OS for ER- High 29 (41.43) 41 (58.57) NR ( 7.41, NR) NR=Not Reached Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 70 51 47 39 34 19 16 10 5 0 Low 30 27 21 18 15 11 8 2 0

0 5 10 15 20 os_yr

rankscore High Low

99

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for Endocrine Therapy=Yes Logrank Obs analtype rankscore Num of Events Num Censored Median (95% CI) p-value 1 OS for Endocrine Therapy=Yes _ 128 (47.06) 144 (52.94) 0.9840 2 OS for Endocrine Therapy=Yes Low 37 (48.68) 39 (51.32) 12.45 (10.59, NR) 3 OS for Endocrine Therapy=Yes High 91 (46.43) 105 (53.57) 13.56 (11.66, 17.42) NR=Not Reached

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 196 182 156 134 107 47 21 11 3 0 Low 76 67 58 50 45 27 13 5 1 0

0 5 10 15 20 os_yr

rankscore High Low

100

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for Endocrine Therapy=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for Endocrine Therapy=No _ 80 (48.48) 85 (51.52) 0.2887 2 OS for Endocrine Therapy=No Low 29 (58.00) 21 (42.00) 8.98 ( 6.73, NR) 3 OS for Endocrine Therapy=No High 51 (44.35) 64 (55.65) 22.83 ( 9.74, NR) NR=Not Reached

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 115 90 80 69 56 34 26 15 8 3 1 Low 50 45 38 31 22 14 10 1 0

0 5 10 15 20 25 os_yr

rankscore High Low

101

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER+ and ET=Yes Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER+ and ET=Yes _ 112 (45.34) 135 (54.66) 0.5573 2 OS for ER+ and ET=Yes Low 29 (44.62) 36 (55.38) 16.48 (11.76, NR) 3 OS for ER+ and ET=Yes High 83 (45.60) 99 (54.40) 13.61 (11.66, 20.17) NR=Not Reached

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 182 169 145 126 101 43 18 9 2 0 Low 65 58 52 46 41 24 11 3 1 0

0 5 10 15 20 os_yr

rankscore High Low

102

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER+ and ET=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER+ and ET=No _ 38 (52.05) 35 (47.95) 0.0574 2 OS for ER+ and ET=No Low 18 (69.23) 8 (30.77) 8.48 ( 5.19, 12.06) 3 OS for ER+ and ET=No High 20 (42.55) 27 (57.45) 15.37 ( 9.13, NR) NR=Not Reached

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 47 41 34 31 23 14 9 4 2 2 1 Low 26 23 20 15 9 5 3 1 0

0 5 10 15 20 25 os_yr

rankscore High Low

103

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER- and ET=Yes Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER- and ET=Yes _ 9 (60.00) 6 (40.00) 0.5911 2 OS for ER- and ET=Yes Low 6 (66.67) 3 (33.33) 4.41 ( 1.52, NR) 3 OS for ER- and ET=Yes High 3 (50.00) 3 (50.00) 17.42 ( 1.48, 17.42) NR=Not Reached Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 6 5 4 3 2 1 1 0 Low 9 7 4 3 3 3 2 2 0

0 5 10 15 20 os_yr

rankscore High Low

104

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER- and ET=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER- and ET=No _ 35 (42.68) 47 (57.32) 0.6392 2 OS for ER- and ET=No Low 9 (42.86) 12 (57.14) NR ( 6.54, NR) 3 OS for ER- and ET=No High 26 (42.62) 35 (57.38) NR ( 6.69, NR) NR=Not Reached

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 61 43 40 33 29 16 14 9 4 0 Low 21 20 17 15 12 8 6 0

0 5 10 15 20 os_yr

rankscore High Low

105

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER+/PR+/HER2-=Yes Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER+/PR+/HER2-=Yes _ 70 (38.04) 114 (61.96) 0.7218 2 OS for ER+/PR+/HER2-=Yes Low 21 (42.00) 29 (58.00) NR ( 8.98, NR) 3 OS for ER+/PR+/HER2-=Yes High 49 (36.57) 85 (63.43) 15.37 (13.56, NR) NR=Not Reached Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 134 126 107 97 79 26 7 1 0 Low 50 45 41 34 28 16 6 1 1 0

0 5 10 15 20 os_yr

rankscore High Low

106

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Overall Survival for ER+/PR+/HER2-=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 OS for ER+/PR+/HER2-=No _ 28 (42.42) 38 (57.58) 0.0349 2 OS for ER+/PR+/HER2-=No Low 11 (64.71) 6 (35.29) 5.09 ( 3.10, NR) 3 OS for ER+/PR+/HER2-=No High 17 (34.69) 32 (65.31) NR ( 9.92, NR) NR=Not Reached Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 49 41 37 32 26 11 5 1 1 0 Low 17 14 9 7 5 4 3 0

0 5 10 15 20 os_yr

rankscore High Low

107 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS _ 231 (51.91) 214 (48.09) 0.1550 2 DFS Low 76 (59.84) 51 (40.16) 10.75 ( 7.55, 12.40) 3 DFS High 155 (48.74) 163 (51.26) 13.56 (10.92, 15.95)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 318 263 217 190 151 81 48 26 11 2 1 Low 127 109 91 73 59 36 21 6 0

0 5 10 15 20 25 pfs_yr

rankscore High Low

108 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER+ Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER+ _ 169 (51.37) 160 (48.63) 0.1367 2 DFS for ER+ Low 56 (60.22) 37 (39.78) 10.76 ( 7.68, 12.40) 3 DFS for ER+ High 113 (47.88) 123 (52.12) 13.56 (10.92, 15.95)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 236 207 171 150 117 57 29 14 5 1 1 Low 93 81 70 56 44 25 13 4 0

0 5 10 15 20 25 pfs_yr

rankscore High Low

109 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER- Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER- _ 46 (48.42) 49 (51.58) 0.6103 2 DFS for ER- Low 14 (50.00) 14 (50.00) NR ( 4.31, NR) 3 DFS for ER- High 32 (47.76) 35 (52.24) 15.81 ( 3.10, NR) NR=Not Reached Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 67 43 36 32 27 17 14 8 3 0 Low 28 24 18 15 13 10 7 2 0

0 5 10 15 20 pfs_yr

rankscore High Low

110

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for Endocrine Therapy=Yes Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for Endocrine Therapy=Yes _ 137 (51.50) 129 (48.50) 0.3849 2 DFS for Endocrine Therapy=Yes Low 43 (58.11) 31 (41.89) 11.76 ( 8.27, 16.48) 3 DFS for Endocrine Therapy=Yes High 94 (48.96) 98 (51.04) 12.84 (10.76, 15.45)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 192 171 142 124 95 43 20 11 3 0 Low 74 62 54 44 38 22 12 5 0

0 5 10 15 20 pfs_yr

rankscore High Low

111

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for Endocrine Therapy=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for Endocrine Therapy=No _ 87 (54.38) 73 (45.63) 0.6753 2 DFS for Endocrine Therapy=No Low 30 (62.50) 18 (37.50) 8.48 ( 5.38, 15.54) 3 DFS for Endocrine Therapy=No High 57 (50.89) 55 (49.11) 13.67 ( 5.04, 22.83)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 112 79 64 57 47 30 23 11 6 2 1 Low 48 42 35 27 19 12 9 1 0

0 5 10 15 20 25 pfs_yr

rankscore High Low

112

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER+ and ET=Yes Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER+ and ET=Yes _ 119 (49.38) 122 (50.62) 0.5218 2 DFS for ER+ and ET=Yes Low 35 (55.56) 28 (44.44) 12.32 ( 8.85, 16.48) 3 DFS for ER+ and ET=Yes High 84 (47.19) 94 (52.81) 13.56 (11.12, 18.06)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 178 160 134 119 92 40 18 9 2 0 Low 63 54 48 40 34 19 10 3 0

0 5 10 15 20 pfs_yr

rankscore High Low

113

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER+ and ET=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER+ and ET=No _ 45 (61.64) 28 (38.36) 0.3838 2 DFS for ER+ and ET=No Low 19 (73.08) 7 (26.92) 7.62 ( 5.16, 10.76) 3 DFS for ER+ and ET=No High 26 (55.32) 21 (44.68) 10.40 ( 4.07, 21.67)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 47 36 28 24 18 11 7 2 2 1 1 Low 26 23 20 14 8 4 3 1 0

0 5 10 15 20 25 pfs_yr

rankscore High Low

114

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER- and ET=Yes Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER- and ET=Yes _ 10 (66.67) 5 (33.33) 0.8431 2 DFS for ER- and ET=Yes Low 6 (66.67) 3 (33.33) 3.68 ( 1.40, NR) 3 DFS for ER- and ET=Yes High 4 (66.67) 2 (33.33) 4.81 ( 1.01, NR) NR=Not Reached Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 6 4 3 1 0 Low 9 6 4 3 3 3 2 2 0

0 5 10 15 20 pfs_yr

rankscore High Low

115

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER- and ET=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER- and ET=No _ 35 (45.45) 42 (54.55) 0.3539 2 DFS for ER- and ET=No Low 8 (42.11) 11 (57.89) NR ( 4.67, NR) 3 DFS for ER- and ET=No High 27 (46.55) 31 (53.45) 15.81 ( 3.68, NR) NR=Not Reached Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 58 37 31 29 25 15 13 7 2 0 Low 19 18 14 12 10 7 5 0

0 5 10 15 20 pfs_yr

rankscore High Low

116

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER+/PR+/HER2-=Yes Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER+/PR+/HER2-=Yes _ 86 (47.25) 96 (52.75) 0.3162 2 DFS for ER+/PR+/HER2-=Yes Low 29 (58.00) 21 (42.00) 11.12 ( 7.57, 14.17) 3 DFS for ER+/PR+/HER2-=Yes High 57 (43.18) 75 (56.82) 14.76 (10.76, 18.06)

Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 132 118 97 86 68 23 7 1 0 Low 50 43 40 31 23 11 4 1 0

0 5 10 15 20 pfs_yr

rankscore High Low

117

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DFS for ER+/PR+/HER2-=No Obs analtype rankscore Num of Events Num Censored Median (95% CI) Logrank p-value 1 DFS for ER+/PR+/HER2-=No _ 26 (43.33) 34 (56.67) 0.0923 2 DFS for ER+/PR+/HER2-=No Low 9 (64.29) 5 (35.71) 4.27 ( 1.52, NR) 3 DFS for ER+/PR+/HER2-=No High 17 (36.96) 29 (63.04) NR ( 6.24, NR) NR=Not Reached Product-Limit Survival Estimates With Number of Subjects at Risk

1.0 + Censored

0.8 y t i l i b

a 0.6 b o r P

l a v i 0.4 v r u S

0.2

0.0 High 46 37 30 29 23 11 5 1 1 0 Low 14 10 6 5 5 4 3 0

0 5 10 15 20 pfs_yr

rankscore High Low

118 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

119 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure S1: Epithelial clusters in individual samples. A) Epithelial cell clusters in breast tissues

of D1 to D5. B) Epithelial cell clusters in first five samples sequenced individually, additional

samples sequenced as a pool in two labs (D6-D10), individual sample from an Asian (Chinese),

and a BRCA1 mutation carrier. Clustering was done with Seurat and Loupe browser was used to

explore various gene expression.

Figure S2: Expression patterns of known mature luminal (ML), luminal progenitor (LP)

and basal/stem/myoepithelial cell-enriched marker genes and different keratins in various

clusters.

Figure S3: Prognostic value of cluster-enriched genes in various intrinsic subtypes of breast

cancer. Data were generated using TCGA and METABRIC datasets.

Table S1: Breast tissue donor information.

Table S2: Sheet 1: Number of cells per cluster of the first five samples. Sheet 2: Number of

cells per cluster in the second analysis; Sheet 3: Average expression levels of various genes

in clusters of the first analysis. Sheet 4: Signaling network generated using cluster 11

enriched genes.

Table S3: Distribution of various intrinsic subtypes of breast cancers in each cluster in

TCGA and METABRIC datasets.

120 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table S4: Mutation frequency in tumors enriched for specific cluster genes.

121 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 1: The normal breast contains 13 epithelial clusters. A) Integrated analysis of single cells of the normal breast biopsies of five healthy donors. Epithelial cells dominate among cell types. B) Subclustering of epithelial cell types using CD49f/EpCAM as well as NFIB, TP63, EHF, ELF5, ESR1 and FOXA1 expression patterns. D1-D5 corresponds to numbering of samples. C) Representation of various cell types in each sample. Subclusters in individual sample are shown in Figure S1A. D). Hierarchical clustering of top cluster-enriched genes.

122 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 1: The normal breast contains 13 epithelial clusters. A) Integrated analysis of single cells of the normal breast biopsies of five healthy donors. Epithelial cells dominate among cell types. B) Subclustering of epithelial cell types using CD49f/EpCAM as well as NFIB, TP63, EHF, ELF5, ESR1 and FOXA1 expression patterns. D1-D5 corresponds to numbering of

123 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

samples. C) Representation of various cell types in each sample. Subclusters in individual sample are shown in Figure S1A. D). Hierarchical clustering of top cluster-enriched genes.

124 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 2: Expression patterns of representative cluster-enriched genes. A) Genes enriched in basal/stem cell clusters. B) Genes enriched in various clusters within luminal progenitor cells.

125 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 3: Mature luminal cells are enriched for ESR1 and XBP1, whereas SFRP1 is enriched in luminal progenitor cells. A) Genes enriched in mature luminal cells. Note that cluster 4 within mature luminal cells is distinctly enriched for MUCL1 and PIP. B) Various cell

126 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

types in the normal breast of a donor. C) Identification of ESR1 expressing subclusters and genes co-expressed with ESR1 in the normal breasts.

127 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 4: Recharacterization of epithelial cells of the normal breasts with additional samples. A) Combined integrated analyses that included samples in Figure 1, a new sample from an Asian (Chinese), and pooled five new samples. There were 23 clusters of cells, which can be

128 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

subdivided into three major groups of basal/stem, luminal progenitor, and mature luminal cells. Potential myoepithelial cells (Myo) and mammary stem cells (MaSc) distinct from basal/stem cells are also indicated in the bottom. The bottom panel shows distribution patterns of cell clusters in five samples of the first set and the five pooled samples of the second set. Clusters in individual samples are shown in Figure S1B. Expression patterns of various markers that are used to subclassify clusters are shown Figure S2. B) CD49f, EpCAM, ALDH1A3, and KRT14 expression in various clusters.

129 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 5: Gene expression in clusters N19 and N0-N1 of Figure 4 overlap with unique genes in C11 and C12, respectively. A) MKI67, BIRC5, and PCLAF, which are all overexpressed in cluster 11 (Figure 1D), are enriched in N19. B) PTGDS and IGF1, which are overexpressed in cluster 12 (Figure 1D), are enriched in N0-N1 clusters. This cluster also expresses ZEB1 and EGFR.

130 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

131 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 6: Breast cancer subtype specific expression of cluster-signature genes. Breast cancer gene expression data in TCGA (left) and METABRIC datasets were analyzed for enrichment of cluster-specific genes described in Table S2. Clusters 1, 3 and 4 were combined to create Cluster 1a because of limited differences. Similarly, clusters 5, 7 and 9 were clubbed to create cluster 5a. PAM50 intrinsic subtype classifiers were used to subdivide breast cancers into luminal A, luminal B, HER2, and basal subtypes. Enrichment of cluster-specific genes in these subtypes of breast cancer were further analyzed. Additional data can be found in Figure S3.

132 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 7: PDK4 and TBX3 enable further classification of ER+ breast cancers. A) Immunohistochemistry of breast TMA for PDK4 and TBX3. B) ER+ breast cancers expressing lower levels of PDK4 compared to tumors with higher PDK4 and not received endocrine therapy

133 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

were associated with poor disease free survival (DFS). Similarly, ER+ tumors expressing lower levels of both TBX3 and PDK4 compared to tumors expressing higher levels of PDK4 and TBX3 were associated with poor DFS.

134 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

135 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

136 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure S1: Epithelial clusters in individual samples. A) Epithelial cell clusters in breast tissues of D1 to D5. B) Epithelial cell clusters in first five samples sequenced individually, additional samples sequenced as a pool in two labs (D6-D10), individual sample from an Asian (Chinese), and a BRCA1 mutation carrier. Clustering was done with Seurat and Loupe browser was used to explore various gene expression.

137 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

138 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure S2: Expression patterns of known mature luminal (ML), luminal progenitor (LP) and basal/stem/myoepithelial cell-enriched marker genes and different keratins in various clusters.

139 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

140 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171793; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure S3: Prognostic value of cluster-enriched genes in various intrinsic subtypes of breast cancer. Data were generated using TCGA and METABRIC datasets.

141