Method development METABRIC DATA and (1989 tumor and 144 normal samples) cancer progression model construction

Identify cancer related Identify molecular using regression analysis and subtypes stochastic learning

Identify genetically Construct a principal tree using homogeneous groups using graph modeling clustering analysis

Construct a progression model of breast cancer

Map clinical and genetic variables back onto models Construct a cancer progression Perform validation • Survival data model using TCGA RNA-Seq study on 25 datasets • Histological/molecular data (1176 tumor and 111 (5831 tumor samples) grade normal samples) • Somatic mutation rate Model validation • Genome instability index

Map mutation data onto disease Estimate mutation rates progression paths using kernel regression

Identify genes with significant Driver mutation Perform hypothesis changes in mutation incidence testing and FDR control identification along progression paths

Figure 1: Overview of the presented stepwise study consisting of three major components. First, a comprehensive bioinformatics pipeline was developed and a progression model of breast cancer was constructed. Then, a large- scale validation study was performed to evaluate the validity of the constructed model. Finally, a cancer genome analysis, focusing primarily on the detection of cancer driver gene mutations, was conducted that demonstrated the utility of the progression model. The study incorporated extensive algorithm development (Online Methods) and the analysis of 27 breast cancer datasets for model construction and validation (Online Methods – Datasets) Cancer Related Principal Curve Progression Model Clustering Analysis Gene Identification! Formation Construction

samples metastasis a b cluster/ c progression d subtype path

normal genes dormant

Figure 2: Overview of the bioinformatics pipeline for cancer progression modeling. a b d

c 4 luminal tumors 1 5

3 Cluster ID Figure 3: Progression modeling analysis of breast cancer 1 performed on the METABRIC data (n = 2,133). (a) Principal 2 9 3 component (PC) analysis provided a general view of sample 4 distribution supported by the selected genes. To aid in 5 7 6 visualization, each sample was annotated by its PAM50 subtype 7 8 label, and mapped onto a principal tree (black line) in a three- 2 9 dimensional space. Supplementary Movie 1 provides a clearer 10 picture of data distribution. (b-d) Clustering analysis performed to 8 detect genetically homogenous tumor groups. (b) The optimal number of clusters was estimated to be ten by gap statistic. (c) Resampling-based consensus clustering analysis identified ten 10 6 -0.1−0.1 0.00 0.10.1 0.20.2 0.30.3 robust and stable clusters. The samples in the red box are Silhouette width Silhouette width si predominantly luminal A/B tumors. The consensus matrix clearly Average silhouette width : 0.07 showed that luminal tumors can be further refined, however, they do not form clear-cut clusters and have significant overlaps, particularly between adjacent nodes, suggesting that they may normal 0 normal 0 N-LA share a progression relationship. (d) The robustness of clustering e f assignment was assessed by silhouette width analysis that 10 classified 1,652 out of 1,989 tumor samples with a positive N-B 2 silhouette width. (e) A progression model of breast cancer built 4 from the METABRIC data. The analysis revealed four major progression paths, referred to as N-B (normal to basal), N-H 7 8 1 8 (normal through luminal A/B to HER2+), N-LB (normal through 3 N-H N-LB luminal A to the luminal B terminus), and N-LA (normal to the 3 luminal A terminus). Each model node represents an identified 6 cluster and the node size is proportional to the number of samples 2 in that cluster. Two connected 150 nodes indicate a potential 1 progressive relationship, and the length of an edge connecting two Basal nodes is proportional to the distance between the two nodes 9 5 6 100 5 H2 measured along a progression path. The pie chart in each node LumA depicts the percentage of the samples in the node belonging to one LumB 7 of the five PAM50 subtypes. (f) A progression50 model built from the 4 N

TCGA RNA-Seq data (n = 1,287).Sample Count The overall structure of the progression models constructed using0 the two independent datasets is almost identical. 12345678 Cluster ID Mapping mutations Estimating Determining statistical Mutation matrix onto progression path mutation rates significance a TP53 0.9 140 gene Mutation rate 0.8 Null model 1 metastasis 120 0.7 1 1 progression 100 0.6 path 80 0.5 average 1

0.4 Counts 60 test 1 sample Mutation Rate Mutationrate 0.3 statistic 40 0.2 1 normal 20 0.1 1 1 1 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 2 4 6 8 10 ProgressionProgression Distance distance Null statistics

0.35 b c 1 0.08 0.45 d e 0.07 0.3 0.4 0.8 0.06 0.25 0.35 0.6 0.05 0.2 0.3 0.04 0.25 0.15 0.4 Mutation Rate Mutation Rate

Mutation Rate 0.03 Mutation Rate 0.2 0.1 0.02 0.2 0.15 0.05 0.01 0.1 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4 Progression Distance Progression Distance Progression Distance Progression Distance

TP53 0.2 ATP1A4 0.2 ZNF587 HAUS5 HIST1H2BC TP53 GATA3 0.15 TRIM42 0.15 ERBB2 DOCK11 MAP2K4 FOXA1 normal g CABYR ARID1A QSER1 MYO6 MLL3 ITSN2 NPAS4 ATP10B NCOR1 CDH1 0.1 N4BP2 0.1 ABCC1 PTPRD PTEN GPS2 LETM1 BBS9 CDH12 MED23 Mutation Rate Mutation Rate RP1 RUNX1 RB1 COL19A1 0.05 CTTNBP2 0.05 KIAA2022 ZMYM4 KIF21B CNTLN ERBB3 WSCD2 0 PEG3 KIAA2022 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 SF3B1 Progression Distance Progression Distance

0.3 f 0.8 HLA-DRB1 luminal A 0.75 basal 0.25 PIK3CA 13 genes 0.7 TP53 TP53 GATA3 0.2 MAP2K4 0.65 CBFB CTCF 0.6 0.15 ADAM29 TP53 GRHL2 0.55 luminal B MYB TGS1 0.5 14 genes Mutation Rate Mutation Rate 0.1 GNAS CDKN1B 0.45 HER2+ TBL1XR1 0.05 0.4 21 genes

0.35 0 0.2 0.4 0.6 0.4 0.6 0.8 Progression Distance Progression Distance Figure 5: Pseudo-time series analysis performed on the TCGA mutation data (n = 958) to identify gene mutations associated with cancer progression. Fifty one genes were found to have significant changes in their mutation incidences along progression paths (FDR<0.05). (a) Overview of the proposed MutationPattern method used to delineate the dynamic patterns of individual gene mutations along a progression path. (b-e) Four distinct mutation patterns were observed. Examples of each are depicted: (b) TTN, (c) TP53, (d) MLL3, and (e) CDH1. The red line depicts the estimated mutation rate, and blue lines were generated from null models built by assuming that the corresponding gene plays no role in cancer development. Each red or blue line in the bar above the figure represents the presence or absence of a mutation in a sample, respectively. The first and second broken lines in (e) indicate the locations where the N-H path intersects with the LA terminal and LB terminal, respectively. (f) Genes showing an upward mutation trend along the N-LA, N-LB, N-H and N-B progression paths. (g) Mapping of significantly progression-associated genes onto the TCGA model. Genes reported at the end of a path are those with an upward trend along the entire path. Genes with a bell-shaped pattern are marked at the bell-peak location. Genes associated with normal samples are those mutated more frequently than random chance, but do not have significant changes along any progression path. a Figure 4: Model validation analysis provided substantial support for the validity of the constructed progression models. (a) Disease-specific survival of ten breast cancer subgroups detected in the METABRIC data. A clear trend of worsening survival function was identified that was associated with progression along the four major malignant trajectories - normal to either basal (N-B path: node 10 to node 6), the luminal A side- ClusterID branch (N-LA path: node 2 to node 8), luminal B side-branch (N- LB path: node 2 through nodes 7, 3 to node 9), or to HER2+ (N- H path: node 2 through nodes 7, 3, 1, 5 to node 4). (b-d) Spearman’s rank correlation analysis of molecular grade, genome instability index, and overall mutation rate along the progression paths. The results aligned well with current theories of cancer evolution. Since the METABRIC data does not contain mutation information, mutation data analysis was performed on the TCGA data (see Figure 3f for the TCGA model). ● ● ● ● LumB ● Basal H2 LumA NL

● ●● 2 ● b ● ● R=0.82 Pvalue = 8.7e−282 ● R=0.91 Pvalue = 5.5e−108 ● R=0.65 Pvalue = 3.9e−110 ● ● ● 3 R = 0.89 Pvalue = 0 ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ●●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●●● ●● ●●●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●● ●●● ● ●● ● ● ● ● ●●● ● ● ●●● ● ● ● ●● ● ● ● ● ● 2 ● ●● ● ● ●● ● ● 1 ●● ● ●●● ● ● ●●●● ● ●●●●●● ● ● ● ● ● ● ● ● ●● ● ●●● ●●● ● ● ●● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●●●●● ●● ●● ● ● ● ●● ●● 2 ● ● ●●●● ● ● ●●●● ● ● ●●●● ● ●●● ● ● ● ● ●● ● ●● ●● ● ●●● ● ● ● ●●● ● 2 ● ●● ●● ●●● ●● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●●● ● ●●● ●● ● ● ● ● ● ●● ●● ●● ●●● ● ●● ● ● ●●● ●●● ● ●● ●● ● ●● ●● ● ● ●● ● ● ● ●●● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ● ●●● ●● ●●●● ● ●● ●●● ●● ● ● ●●● ●●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●●● ● ● ●● ●●●●●●● ● ● ●●● ● ●●●●● ● ● ●● ● ● ●● ●● ● ● ● ●●●●●● ●●●● ●● ● ● ● ● ● ●●● ●●●●● ● ●● ● ●● ●●●●●●● ● ● ●● ● ● ● ● ●●● ●● ● ●● ● 1 ● ●● ●● ●●●●●●● ● ●●● ● ● ● ● ● ●●●●● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●●●●● ● ●●● ●●● ● ● ●● ● ●●● ●●●●●●●●● ● ● ●● ● ●●●●● ●●● ● ● ●●● ● ● ● ●● ●●●● ●● ● ● ●● ●● ● ● ●●●● ●● ● ● ● ●● ●●●● ● ●●●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●●●●●● ●●●●●●●●●● ●● ● ● ●● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●●● ●● ● ●●●●● ●●● ● ● ● ● ● ●● ●●● ● ●●● ● ● ● ● ● ●●● ●●● ● ●●●●● ● ● ●●●● ● 0 ● ● ●●● ● ●● ● ● ● ● ● ● ●●●●●●●●●●●● ●●● ● ●● ●● ● ● ● ● ●●● ● ● ●●●● ● ● ● ● ●●●●● ●●● ●●● ● ● ●●● ● ●●●●●●●●●●● ●●●● ●●● ● ● ●● ● ● ● ● ●●●● ●● ● ●● ● ● ● ● ● ●● ●●● ● ● ●● ●● ●●● ●● ● ● ● ● ●● ● ●●●●●●●● ●●●● ● ● ● ●● ●● ● ● ● ● ●●●●● ●● ●●● ● ● ●● ●●● ●●●●●● ●●●●●● ● ● ● ●● ●●●●●●●●●●●● ● ●●● ●● ●● ●● ● ● ●● ● ●● ●●●● ●●● ●● ● ●●● ● ● ● ● ● ●● ●● ● ●● ●●● ●● ● ●●● ● ● ● ●●●●●●●●●●●●●●●●●● ● ●● ● ● ● ● ● ● ● ● ● ●●●●●●● ●● ● ●●● ● ● ● ●● ●●●● ●●●● ●●● ●● ● ● ● ●●● ●● ●●●● ● ●●● ●● ● ● ● ● ●●●●●●● ●●● ● ●● ● ●●●●● ●●●●●●●●●● ●●●● ● ● ● ●● ● ● ● ● ●●●● ●●●●● ●● ●● ●● ●●●● ● ● ●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● 0 ● ●● ●●●●● ●●●● ● ● ● ● ● ●● ● ● ● ●●● ●● ●●●●●● ●●● ●●●●● ●●●●●● ●●● ● ●● ●●● ●● ● ●●●● ●●●● ●● ● 0 ● ● ● ● ●●●●●●●●●●●●●●●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ●● ●●●●● ● ● ● ● ●● ●● ● ● ●●●● ●●● ● ●● ● ● ● ● 0 ● ●● ● ● ●●●●●●● ●● ●●● ●● ●●●● ● ●● ●●●● ● ●●●●● ●●● ● ● ●●●●●●● ● ● ●● ● ● ● ● ●●● ●●●●● ●●●●●●●●●●●●● ●● ●● ● ● ● ●●●●●●●●● ● ●●●● ●●● ●●● ● ● ● ● ● ● ● ●●●●●●● ●● ●●●●●●●● ●● ● ● ● ●● ●●●● ● ●● ● ●● ● ● ● ● ● ● ●●● ●● ●●●●● ● ● ●● ● ●●●●● ●● ●●●●●●● ● ● ● ● ● ● ●●● ●●●●●●●●● ●●●●●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ●●●●●●●● ●●●●●● ●● ●● ●● ● ● ●●●●●●●●●●●● ● ●●● ● ● ● ● ●● ●●●●●●●● ●●● ●●●●●●●● ● ● −1 ● ●● ●●●● ●●● ●●● ●●●●●●● ●● ● ●● ●● ●●●●●● ●●●●● ● ● ● ● ● ● ● ●● ●●●●●●●●● ● ●●●● ● ●● ●●●●●● ● ● ● ●●● ● ●● ●●●●● ● ●●●●●●●●●● ●●● ● ● ● ● ●●● ● ●●●●●●●● ●● ●●● ● ● ● ● ● ●●●●●●● ●●●●● ● ●●●●● ●● ●● ● ●● ●●●●●●●●●●● ● ●● ● ● ●●●● ●●● ●●●●● ● ●●● ● ● ● ●●●●●●●●●●● ● ● ● ● ●●● ● ●●● ●●●●●●●● ●●●●● ●●●●●● ● ● ● ● ●● ●●● ●●●● ● ●●● ● ● ● ● ● ● ● ●●●● ●●●● ●●● ● ●● ●● ● ●● ● ● ●●●●●●●●●●● ● ●● ● ●●● ●●●●●●● ●●●● ●●● ● ●●●● ●●● ● ● ● ●●●● ●● ●●● ●● ● ● ● ● Molecular Grade −1 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●● ●●●●●● ● ● ● ●● ●●●●●● ●●●●● ●● ●● ● ● ● ● ● ●●●● ● ● ●● ● ● ● Molecular Grade ● ●● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ●● ●●●●●●●●● ● ●●●● ● ● Molecular Grade ● ●● ●● ● ● ● ● ● ● Molecular Grade ● ● ● ● ● ● ● ●●●●●●●●●●●●●● ● ● ●●● ●●●●●● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ●● ●●●● ●●●●● ● ● ● ● ● ●●●●●●●●●●●● ● ● ● ● ● ●●● ●● ●● ●● ● ● ● ●● ●●●●●●●●●● ● ● ● ● ●●●● ●●●●●●●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ●● ●●●●●●● ● ● ●● ●●●●●●● ●●●●● ● ●● ●●●● ●● ● ● ● ●● ●●●●●●●●●● ● ● ● ●● ●●● ●●● ● ●● ●●● ●● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ●●● ●●● ● ● ● ● ●● ●●●● ●●●●●● ●●● ● ● ● ● ● ● −2 ● ●●● ●●● ● ●● ● ● ●● ●● ●●● ●● ●● ● ● ● ● ●●● ● ● ● ● ●● ●●● ●● ● ●●●●● ●●● ●●● ●● ●● ●● ●● −2 ● ● ● ● ● ● ●●● ● ● ● ●●● ●●● ●● ●●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● −2 ● ● ●● ● ●● ●●●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● −2 ● ● ● ●● ●● ● ●●●● ● ● ●● ● ● ● ●● ●● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●

0.0 0.2 0.4 0.6 0.8 0.00 0.25 0.50 0.75 1.00 1.25 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 c m Normal→HER2 Normal→LumB Normal→LumA ● R=0.64 Pvalue = 5.9e−36 ● ● R=0.53 Pvalue = 3.7e−66 6000 ● R=0.48 Pvalue = 6.7e−70 R=0.78 Pvalue = 6.7e−196 ● ● ● 8000 ● ● ● ●● ● ● ● ● ● 7500 ● ● ● ●● ● ● 6000 ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● 6000 ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● 4000 ●● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● 5000 ● ●●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ● ● ● ●● ● ● ● ● 4000 ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ●●●●● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●●●●●●● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ●● ● ●●● ●●● ● ●● ●● ●● ● ● ● ●●● ● 4000 ● ●●● ● ●● ●●● ●●● ● ● ● ●●● ●● ● ● ● ● ●● ●●●●●● ● ● ● ● ● ● ●●● ●●● ●●●● ●● ●● ●● ● ● ● ● ● ● ● ●●●●●● ●●●● ●●● ●●●●● ● ● ● ● ● ● ● ●● ●● ●● ● ●●●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●●● ● ●●●● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ●●● ●●● ●●● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●● ●●●●●●●●● ● ● ● ●● ● ●● ●●●●● ● ● ● ●●●● ● ●● ● ●● ●● ●● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ●●●●●●●●●● ● ● ● ● ●● ● ● ●●● ●●● ● ● ●●● ● ●● ●● ● ●●● ● ●●● ● ● ● ●●● ● ●● ● ●● ● ● ●● ● ● ●●●●● ● ●●●●● ●●●●● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ●●● ●●●●●●● ●●● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●●●●● ●● ● ●●●●● ●●● ● ●●● ● ● ●● ●●●●●● ●● ●●●● ●●●●●● ● ●●●●●● ●● ●● ●●●●●● ● ●● ●● ●●●● ●● ● ●●● ● ● ●● ● ●●●●●●●●●● ●●●● ●●●● ●● ●●● ●● ●● ● ●● ● ● ● ●● ● ●● ● ●●●●● ●●●●● ●● ●●●● ● ●● ● ● ● ● ●●●● ● ●●●●●● ●● 2000 ● ●● ●● ●● ● ● ●● ●● ● ●●●● ●● ● ●● ●● ●●● ● ● ●● ●●● ● ●● ● ● ●● ● ●●●●●● ●●●●●● ● ●●● ●●●●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●●●● ●●●●●●●●●●● ● ●● ● ●● ●● ● ● ● ● ●●●●●● ●●● ●● ●●● ●●●●●●● ●● ●●●● ● ●●● ● ● ● ● ● ●●●●●● ●● ● ● ●● ●● ●●●●● ●●●●●●● ●● ● ● ●● ● ● ●●●●●●●●●●● ●●●● ●●●● ●●●●●● ●●●●● ●● ● ●●●●● ●● ●●● ● ● ●● ● ●● ●● ●●●● ●●● ●● ●●●● ●●●● ● ● 2500 ● ● ●● ● ●● ● ●● ● ●●●●● ●● ● ● ● ● ● ● ●●● ● ●●●●●●●●● ●●●●●●●●●●●● ● ●●●●●●●● ● ● ● ● ● ●● ● ● ●● ●●●●● ●●●●●●●●● ●●● ● ●● ● ● ● ● ●● ●●●●● ● ●● ●●●●●●● ● ●● ● ● ● 2000 ● ● ● ●●●●●●● ●● ● ●●●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●●●●●●●●●●●● ●● ●● ●●●● ● ● ● ● ● ● ●● ●●●●●●● ●●● ●● ●●●●●●●●●●●●● ●● ●● ● ● ● ●● ● ●● ● ●●●●●●●●●●● ● ●●●● ●● ● ● ●● ● ●●●●● ● ●●● ●● ●●●●● ●●●●● ●●●●●● ●●●●●●●●●● ● ●● ● ● ● ● ● ●●● ●● ●●●●●●●● ●●● ● ● ●●●●●●●●●●●●● ●●● ●●● ● ● ● ● ● ● ● ●●●●●●●●●● ● ●● ●●●● ● ●● ●● ●●● ● ●● ● ● ● 2000 ● ●●● ● ●●●●●●●●● ●● ●● ●● ● ● ● ●● ●●● ● ●●●●●●●●● ●●●●●● ●● ● ● ●●● ● ● ● ●●● ●● ● ●●● ●●●●●●● ●●●●● ●●●● ● ●● ● ●● ● ● ● ●●●●●● ●● ● ● ●● ●●● ● ● ● ● ● ● ●●●●●●●●● ●●● ●●● ●●●●●● ● ●●●● ● ● ● ●● ● ●●● ● ● ● ● ●●● ● ●● ● ● ● ●● ●● ●● ●●●● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●● ●●● ●●●●● ● ●●●●● ● ● ●●●●● ● ●●● ●●●● ●● ●●●●● ● ●●●●●● ● ● ● ● ● ● ● ●●●● ● ● ●●●● ●● ●●●●●●●●●● ●●●● ● ● ● ● ● ●● ●● ●●●●●●● ●●●●● ● ●● ●● ●●● ●●● ● ● ● ● ● ● ● ●●●●●●●●●●●●● ● ●● ● ●●●●●● ● ● ● ● ●● ●●●●●●●●●●●●● ● ●●●● ● ●●●● ●● ● ● ● ● ●●● ● ● ●●●●●●●●● ●●●●● ●●●●●●● ● ●●● ● ●● ● ●● ● ●● ● ●●●●●●●●●●●●●●● ● ●●●●●●● ● ● ●● ● ●● ●●●●●● ●●●●●●●●● ● ● ● ●●●●●●●●●●● ● ● ● ● ● ● ● ●● ● ●●●●●●●●●●●●●●●● ●●● ● ●● ● ●●● ●●● ● ● ● ●● ●●●●●●●●●●● ●●● ●●●●●● ●●●● ●● ● ● ● ●● ●●●●●●●●●●●●●●●●● ●●● ● ● ● ●●●● ●● ● ●● ●●● ● ● ● ● ●● ● ●● ●●●●●●●●● ● ●● ● ●●● ● ● ● ●● ●● ●●●● ● ● ● ●● ●●●●● ●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●●●●●●●● ●●●●●●●●●●●● ●●●●● ●●● ● ● ● ● ● ● ● ●● ●●●●●●●●●●●●●●●● ●● ●● ● ●● ●● ●● ● ●●●● ● ●● ●● ●●●● ●●●●●●●●●●● ● ● ● ●●●● ● ● ● ● ●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●● ●●●● ● ● ● ●● ●●●●●● ●●●●●● ●●● ● ● ● ● ●●● ● ● ●● ●● ●●●● ●● ●●●●●●●●● ●●●●● ● ● ● ● ● ● ●● ● ● ● ●●● ●● ●● ●●●●●●●● ● ●●●● ●●●●● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ●●●●●●●●●●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ●●●●●●●● ●●●● ● ● ●● ● ● ●●●● ●●●● ●●●●●●●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●●● ●● ●● ● ● ●● ●● ● ● ● ●● ● ● ●●●●●● ●● ● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● 0 ● ● 0 ● ●● ● ● ● 0 ● 0 0.0 0.2 0.4 0.6 0.8 0.00 0.25 0.50 0.75 1.00 1.25 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 d Normal→Basal Normal→HER2 Normal→LumB Normal→LumA ● R=0.59 Pvalue = 9.8e−47 R = 0.42 Pvalue = 1.2e−08 R=0.54 Pvalue = 4.6e−36 R = 0.4 Pvalue = 3.5e−13● ● ● ● ● ● ● ● 4e−06 ● ● ● ● ● ● ● ● 6e−06 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4e−06 ● ● ● ● ● ● ● ● ● ● ● 4e−06 ● ● ● ● ● ● ● ● ● ● ● ● 3e−06 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4e−06 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● 2e−06 ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● 2e−06 ● ● ● ● ● ● ● ● ● ● ● ● 2e−06 ● ●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ● ● 2e−06 ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ●●● ●●● ● ● ● ●●●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ●● ●● ●● ●● ● ● 1e−06 ●●●● ● ● ● ●● ● ● ●● ●● ●● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●●● ●●●● ●●●●●●●●●●●● ● ●● ●●●●●● ●●● ● Overall Mutation Rate ● ● ● ● Overall Mutation Rate ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●●● ●●● ●●● ●●● ● ●●● ●● ● ●●● ● ●●● ●● ● ●●●●●● ●● ●●● ●● ● ● Overall Mutation Rate ● ●● ● ● ● ● ● ● ● ●● ●●●● ●●● ● ● ● ● ●● ● ●●●● ●● ●●●●●● ●●● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● Overall Mutation Rate ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ●●● ●●●●●●●●●● ●● ● ●● ●● ●● ●●●●● ●●●●● ●● ●● ●●●● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●●● ●●●●●●●●●●●●●●●● ● ●●● ● ● ● ● ●●● ● ● ●●●●●● ●●●●● ●● ●● ● ●●●● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●●●●● ● ●●●●● ●●●● ●● ● ●● ●●● ● ● ●● ● ● ● ● ● ●●● ● ●●●●●●● ●●●●●● ● ● ●● ● ● ● ●●●●●●●●●●●● ●●●● ● ●●● ● ● ●● ●● ●● ●●● ●●● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ●●●●● ● ● ● ● ● ● ● ● ●●●●●●●● ●● ●● ● ● ● ● ● ● ● ●●● ●●●●●●●●● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 0e+00 0e+00 ●● ● ●● ● 0e+00 ● ● 0e+00 ● ● ● 0.2 0.4 0.6 0.8 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 0.2 0.4 0.6 0.8 Normal→Basal Normal→HER2 Normal→LumB Normal→LumA Table 1: Fifty one genes were identified to have a significant change in mutation incidence in at least one progression path (FDR < 0.05). If a gene was found in multiple paths, only the smallest P-value and FDR were reported. A path highlighted in red means that a gene has a monotonically increasing mutation pattern in that path, and a path highlighted in green means that a gene has a bell-shaped mutation pattern. # Samples: the number of samples with mutations.

Rank Gene Full Name N-H N-LB N-LA N-B # Samples P-value FDR 1 PIK3CA phosphoinositide-3-kinase, catalytic, alpha polypeptide • • • 314 0 0 2 TP53 tumor protein p53 • • • • 291 0 0 3 CDH1 cadherin 1, type 1, E-cadherin (epithelial) • • • 103 0 0 4 GATA3 GATA binding protein 3 • • • 95 0 0 5 MAP3K1 mitogen-activated protein kinase kinase kinase 1 • • 70 0 0 6 MAP2K4 mitogen-activated protein kinase kinase 4 • • • 32 0 0 7 RUNX1 runt-related transcription factor 1 (acute myeloid leukemia 1; aml1 oncogene) • • • 28 0 0 8 TBX3 T-box 3 (ulnar mammary syndrome) • • • 27 0 0 9 CBFB core-binding factor, beta subunit • • • 23 0 0 10 CTCF CCCTC-binding factor (zinc finger protein) • • • 17 0 0 11 ATP1A4 ATPase, Na+/K+ transporting, alpha 4 polypeptide • 13 0 0 12 ZNF587 zinc finger protein 587 • 10 0 0 13 ADAM29 ADAM metallopeptidase domain 29 • 9 0 0 14 HLA-DRB1 major histocompatibility complex, class II, DR beta 1 • 8 0 0 15 HAUS5 HAUS augmin-like complex, subunit 5 • 5 0 0 16 ARID1A AT rich interactive domain 1A (SWI-like) • 26 1.00E-04 2.80E-03 17 MYO6 VI • 10 1.00E-04 2.80E-03 18 NPAS4 neuronal PAS domain protein 4 • 9 1.00E-04 2.80E-03 19 MED23 mediator complex subunit 23 • • 14 1.00E-04 3.03E-03 20 HIST1H2BC histone cluster 1, H2bc • 6 1.00E-04 4.20E-03 21 TRIM42 tripartite motif-containing 42 • 6 1.00E-04 4.20E-03 22 ATP10B ATPase, class V, type 10B • 17 2.00E-04 5.17E-03 23 DOCK11 dedicator of cytokinesis 11 • 20 2.00E-04 6.72E-03 24 CABYR calcium binding tyrosine-(Y)-phosphorylation regulated • 6 2.00E-04 6.72E-03 25 GRHL2 grainyhead-like 2 (Drosophila) • 8 3.00E-04 7.26E-03 26 QSER1 glutamine and serine rich 1 • 20 3.00E-04 7.96E-03 27 ITSN2 intersectin 2 • 12 3.00E-04 7.96E-03 28 N4BP2 NEDD4 binding protein 2 • 10 3.00E-04 7.96E-03 29 ABCC1 ATP-binding cassette, sub-family C (CFTR/MRP), member 1 • 11 4.00E-04 9.60E-03 30 SF3B1 splicing factor 3b, subunit 1, 155kDa • • 16 5.00E-04 1.01E-02 31 MYB v-myb myeloblastosis viral oncogene homolog (avian) • 12 5.00E-04 1.01E-02 32 GPS2 pathway suppressor 2 • 11 5.00E-04 1.10E-02 33 BBS9 Bardet-Biedl syndrome 9 • 9 5.00E-04 1.10E-02 34 RB1 retinoblastoma 1 (including osteosarcoma) • 19 7.00E-04 1.41E-02 35 CTTNBP2 cortactin binding protein 2 • 13 7.00E-04 1.41E-02 36 PTPRD protein tyrosine phosphatase, receptor type, D • 17 7.00E-04 1.47E-02 37 LETM1 leucine zipper-EF-hand containing transmembrane protein 1 • 9 7.00E-04 1.47E-02 38 ZMYM4 zinc finger, MYM-type 4 • 12 8.00E-04 1.55E-02 39 CDH12 cadherin 12, type 2 (N-cadherin 2) • 10 8.00E-04 1.58E-02 40 RP1 retinitis pigmentosa 1 (autosomal dominant) • 18 1.00E-03 1.87E-02 41 ERBB3 v-erb-b2 erythroblastic leukemia viral oncogene homolog 3 (avian) • 17 1.50E-03 2.80E-02 42 KIF21B family member 21B • 14 1.80E-03 2.90E-02 43 TGS1 trimethylguanosine synthase homolog (S. cerevisiae) • 12 2.30E-03 3.48E-02 44 COL19A1 collagen, type XIX, alpha 1 • 12 2.60E-03 4.37E-02 45 WSCD2 WSC domain containing 2 • 12 2.50E-03 4.50E-02 46 PEG3 paternally expressed 3 • 19 2.70E-03 4.54E-02 47 KIAA2022 KIAA2022 • • 14 2.70E-03 4.54E-02 48 GNAS GNAS complex locus • 11 3.60E-03 4.59E-02 49 CDKN1B cyclin-dependent kinase inhibitor 1B (p27, Kip1) • 10 3.50E-03 4.59E-02 50 TBL1XR1 (beta)-like 1 X-linked receptor 1 • 10 3.40E-03 4.59E-02 51 CNTLN centlein, centrosomal protein • 13 3.10E-03 4.96E-02