Figure 1: Overview of the Presented Stepwise Study Consisting of Three Major Components

Figure 1: Overview of the Presented Stepwise Study Consisting of Three Major Components

Method development METABRIC DATA and (1989 tumor and 144 normal samples) cancer progression model construction Identify cancer related genes Identify molecular using regression analysis and subtypes stochastic learning Identify genetically Construct a principal tree using homogeneous groups using graph modeling clustering analysis Construct a progression model of breast cancer Map clinical and genetic variables back onto models Construct a cancer progression Perform validation • Survival data model using TCGA RNA-Seq study on 25 datasets • Histological/molecular data (1176 tumor and 111 (5831 tumor samples) grade normal samples) • Somatic mutation rate Model validation • Genome instability index Map mutation data onto disease Estimate mutation rates progression paths using kernel regression Identify genes with significant Driver gene mutation Perform hypothesis changes in mutation incidence testing and FDR control identification along progression paths Figure 1: Overview of the presented stepwise study consisting of three major components. First, a comprehensive bioinformatics pipeline was developed and a progression model of breast cancer was constructed. Then, a large- scale validation study was performed to evaluate the validity of the constructed model. Finally, a cancer genome analysis, focusing primarily on the detection of cancer driver gene mutations, was conducted that demonstrated the utility of the progression model. The study incorporated extensive algorithm development (Online Methods) and the analysis of 27 breast cancer datasets for model construction and validation (Online Methods – Datasets) Cancer Related Principal Curve Progression Model Clustering Analysis Gene Identification! Formation Construction samples metastasis a b cluster/ c progression d subtype path normal genes dormant Figure 2: Overview of the bioinformatics pipeline for cancer progression modeling. a b d c 4 luminal tumors 1 5 3 Cluster ID Figure 3: Progression modeling analysis of breast cancer 1 performed on the METABRIC data (n = 2,133). (a) Principal 2 9 3 component (PC) analysis provided a general view of sample 4 distribution supported by the selected genes. To aid in 5 7 6 visualization, each sample was annotated by its PAM50 subtype 7 8 label, and mapped onto a principal tree (black line) in a three- 2 9 dimensional space. Supplementary Movie 1 provides a clearer 10 picture of data distribution. (b-d) Clustering analysis performed to 8 detect genetically homogenous tumor groups. (b) The optimal number of clusters was estimated to be ten by gap statistic. (c) Resampling-based consensus clustering analysis identified ten 10 6 -0.1−0.1 0.00 0.10.1 0.20.2 0.30.3 robust and stable clusters. The samples in the red box are Silhouette width Silhouette width si predominantly luminal A/B tumors. The consensus matrix clearly Average silhouette width : 0.07 showed that luminal tumors can be further refined, however, they do not form clear-cut clusters and have significant overlaps, particularly between adjacent nodes, suggesting that they may normal 0 normal 0 N-LA share a progression relationship. (d) The robustness of clustering e f assignment was assessed by silhouette width analysis that 10 classified 1,652 out of 1,989 tumor samples with a positive N-B 2 silhouette width. (e) A progression model of breast cancer built 4 from the METABRIC data. The analysis revealed four major progression paths, referred to as N-B (normal to basal), N-H 7 8 1 8 (normal through luminal A/B to HER2+), N-LB (normal through 3 N-H N-LB luminal A to the luminal B terminus), and N-LA (normal to the 3 luminal A terminus). Each model node represents an identified 6 cluster and the node size is proportional to the number of samples 2 in that cluster. Two connected 150nodes indicate a potential 1 progressive relationship, and the length of an edge connecting two Basal nodes is proportional to the distance between the two nodes 9 5 6 100 5 H2 measured along a progression path. The pie chart in each node LumA depicts the percentage of the samples in the node belonging to one LumB 7 of the five PAM50 subtypes. (f) A progression50 model built from the 4 N TCGA RNA-Seq data (n = 1,287).Sample Count The overall structure of the progression models constructed using0 the two independent datasets is almost identical. 12345678 Cluster ID Mapping mutations Estimating Determining statistical Mutation matrix onto progression path mutation rates significance a TP53 0.9 140 gene Mutation rate 0.8 Null model 1 metastasis 120 0.7 1 1 progression 100 0.6 path 80 0.5 average 1 0.4 Counts 60 test 1 sample Mutation Rate Mutation rate 0.3 statistic 40 0.2 1 normal 20 0.1 1 1 1 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 2 4 6 8 10 ProgressionProgression Distance distance Null statistics 0.35 b c 1 0.08 0.45 d e 0.07 0.3 0.4 0.8 0.06 0.25 0.35 0.6 0.05 0.2 0.3 0.04 0.25 0.15 0.4 Mutation Rate Mutation Rate Mutation Rate 0.03 Mutation Rate 0.2 0.1 0.02 0.2 0.15 0.05 0.01 0.1 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4 Progression Distance Progression Distance Progression Distance Progression Distance TP53 0.2 ATP1A4 0.2 ZNF587 HAUS5 HIST1H2BC TP53 GATA3 0.15 TRIM42 0.15 ERBB2 DOCK11 MAP2K4 FOXA1 normal g CABYR ARID1A QSER1 MYO6 MLL3 ITSN2 NPAS4 ATP10B NCOR1 CDH1 0.1 N4BP2 0.1 ABCC1 PTPRD PTEN GPS2 LETM1 BBS9 CDH12 MED23 Mutation Rate Mutation Rate RP1 RUNX1 RB1 COL19A1 0.05 CTTNBP2 0.05 KIAA2022 ZMYM4 KIF21B CNTLN ERBB3 WSCD2 0 PEG3 KIAA2022 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 SF3B1 Progression Distance Progression Distance 0.3 f 0.8 HLA-DRB1 luminal A 0.75 basal 0.25 PIK3CA 13 genes 0.7 TP53 TP53 GATA3 0.2 MAP2K4 0.65 CBFB CTCF 0.6 0.15 ADAM29 TP53 GRHL2 0.55 luminal B MYB TGS1 0.5 14 genes Mutation Rate Mutation Rate 0.1 GNAS CDKN1B 0.45 HER2+ TBL1XR1 0.05 0.4 21 genes 0.35 0 0.2 0.4 0.6 0.4 0.6 0.8 Progression Distance Progression Distance Figure 5: Pseudo-time series analysis performed on the TCGA mutation data (n = 958) to identify gene mutations associated with cancer progression. Fifty one genes were found to have significant changes in their mutation incidences along progression paths (FDR<0.05). (a) Overview of the proposed MutationPattern method used to delineate the dynamic patterns of individual gene mutations along a progression path. (b-e) Four distinct mutation patterns were observed. Examples of each are depicted: (b) TTN, (c) TP53, (d) MLL3, and (e) CDH1. The red line depicts the estimated mutation rate, and blue lines were generated from null models built by assuming that the corresponding gene plays no role in cancer development. Each red or blue line in the bar above the figure represents the presence or absence of a mutation in a sample, respectively. The first and second broken lines in (e) indicate the locations where the N-H path intersects with the LA terminal and LB terminal, respectively. (f) Genes showing an upward mutation trend along the N-LA, N-LB, N-H and N-B progression paths. (g) Mapping of significantly progression-associated genes onto the TCGA model. Genes reported at the end of a path are those with an upward trend along the entire path. Genes with a bell-shaped pattern are marked at the bell-peak location. Genes associated with normal samples are those mutated more frequently than random chance, but do not have significant changes along any progression path. a Figure 4: Model validation analysis provided substantial support for the validity of the constructed progression models. (a) Disease-specific survival of ten breast cancer subgroups detected in the METABRIC data. A clear trend of worsening survival function was identified that was associated with progression along the four major malignant trajectories - normal to either basal (N-B path: node 10 to node 6), the luminal A side- ClusterID branch (N-LA path: node 2 to node 8), luminal B side-branch (N- LB path: node 2 through nodes 7, 3 to node 9), or to HER2+ (N- H path: node 2 through nodes 7, 3, 1, 5 to node 4). (b-d) Spearman’s rank correlation analysis of molecular grade, genome instability index, and overall mutation rate along the progression paths. The results aligned well with current theories of cancer evolution. Since the METABRIC data does not contain mutation information, mutation data analysis was performed on the TCGA data (see Figure 3f for the TCGA model). ● ● ● ● LumB ● Basal H2 LumA NL ● ●● 2 ● b ● ● R=0.82 Pvalue = 8.7e−282 ● R=0.91 Pvalue = 5.5e−108 ● R=0.65 Pvalue = 3.9e−110 ● ● ● 3 R = 0.89 Pvalue = 0 ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ●●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●●● ●● ●●●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●● ●●● ● ●● ● ● ● ● ●●● ● ● ●●● ● ● ● ●● ● ● ● ● ● 2 ● ●● ● ● ●● ● ● 1 ●● ● ●●● ● ● ●●●● ● ●●●●●● ● ● ● ● ● ● ● ● ●● ● ●●● ●●● ● ● ●● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●●●●● ●● ●● ● ● ● ●● ●● 2 ● ● ●●●● ● ● ●●●● ● ● ●●●● ● ●●● ● ● ● ● ●● ● ●● ●● ● ●●● ● ● ● ●●● ● 2 ● ●● ●● ●●● ●● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●●● ● ●●● ●● ● ● ● ● ● ●● ●● ●● ●●● ● ●● ● ● ●●● ●●● ● ●● ●● ● ●● ●● ● ● ●● ● ● ● ●●● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ● ●●● ●● ●●●● ● ●● ●●● ●● ● ● ●●● ●●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●●● ● ● ●● ●●●●●●● ● ● ●●● ● ●●●●● ● ● ●● ● ● ●● ●● ● ● ● ●●●●●● ●●●● ●● ● ● ● ● ● ●●● ●●●●● ● ●● ● ●● ●●●●●●● ● ● ●● ● ● ● ● ●●● ●● ● ●● ● 1 ● ●● ●● ●●●●●●● ● ●●● ● ● ● ● ● ●●●●● ● ●●● ● ● ● ● ●● ● ● ● ●● ●

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    6 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us