<<

Vignette for Fletcher2013b: master regulators of FGFR2 signalling and breast cancer risk.

Mauro AA Castro,* Michael NC Fletcher,* Xin Wang, Ines de Santiago, Martin O’Reilly, Suet-Feung Chin, Oscar M Rueda, Carlos Caldas, Bruce AJ Ponder, Florian Markowetz and Kerstin B Meyer „ [email protected] [email protected]

May 22, 2021

Contents

1 Description 2

2 Data sources for regulatory network inference2

3 Reconstruction of the breast cancer networks2 3.1 Transcription network inference pipeline...... 2 3.2 Pre-processing of expression data...... 3 3.3 Mutual information (MI) computation...... 3 3.4 Application of data processing inequality (DPI)...... 3

4 Master Regulator Analysis (MRA)4

5 Transcriptional network of consensus master regulators4

6 Enrichment maps 4

7 Session information7

*joint first authors „Cancer Research UK - Cambridge Research Institute, Robinson Way Cambridge, CB2 0RE, UK.

1 1 Description

The package Fletcher2013b contains a set of transcriptions networks and related datasets that can be used to reproduce the results in Fletcher et al. [1]. The first part of this study is available in the package Fletcher2013a, which contains the time-course data and has been separated for better organization on the data distribution. Here we provide the R scripts to reproduce the bioinformatics analysis. Please refer to Fletcher et al. [1] for more details about the biological background and experimental design of the study.

2 Data sources for regulatory network inference

The METABRIC breast cancer gene expression dataset [2] was used in two cohorts, a discovery set (n = 997) and a validation set (n = 995). The METABRIC normal breast expression dataset (n = 144) was used as a non-cancer, tissue control and a T-cell acute lymphoblastic leukaemia gene expression dataset (n = 57) was included as a non-related tissue, cancer control [3]. These data sets are publicly available at:

ˆ METABRIC discovery set EGAD00010000210

ˆ METABRIC validation set EGAD00010000211

ˆ METABRIC normals EGAD00010000212

ˆ T-cell ALL GSE33469

3 Reconstruction of the breast cancer transcription networks

Due to the large-scale datasets and the parallel processing required to compute the transcription networks, this package provides 4 pre-processed networks named: rtni1st (METABRIC discovery set), rtni2nd (METABRIC validation set), rtniNormals (METABRIC normals) and rtniTALL (T-cell ALL). These R objects will be required to reproduce the analyses along the vignette:

> library(Fletcher2013b) > data(rtni1st) > data(rtni2nd) > data(rtniNormals) > data(rtniTALL)

Next we describe the main methods used to compute the transcription networks, and in the R package RTN we provide a short tutorial demostrating the inference pipeline.

3.1 Transcription network inference pipeline In order to make all methods used in this study available for different users, we implemented the R package called RTN: reconstruction of transcriptional networks and analysis of master regulators, which is designed for the reconstruction of transcriptional networks using mutual information [4]. It is implemented by S4 classes in R [5] and extends several methods previously validated for assessing transcriptional regulatory units, or regulons (e.g. MRA [6] and GSEA [7]. The main advantage of

2 using RTN lies in the provision of a statistical pipeline that runs the network inference in a stepwise process together with a parallel computing algorithm that demands high performance. The RTN package should be installed prior to running this vignette. Additionally, in RTN we provide a tutorial showing how to compute a transcriptional network using a toy example, which is generated with default options and pValueCutoff=0.05. Here, the pre-processed breast cancer transcription networks were generated by a more stringent threshold, with pValueCutoff=1e-6. To reproduce these large networks we suggest as minimum computational resources a cluster >= 8 nodes and RAM >= 8 GB per node (specific routines should be tuned for the available resources). The in- ference pipeline is executed in four steps: (i) check the consistency of the input data and remove non-informative probes, (ii) compute the mutual information and remove the non-significant asso- ciations by permutation analysis, (iii) remove unstable interactions by bootstrap and (iv) apply the data processing inequality filter. These steps are described next.

3.2 Pre-processing of gene expression data Non-informative microarray probes with low dynamic range of expression were removed from the gene expression matrices. This procedure aims to filter out probes that exhibit low coefficient of variation (CV), below the CV median value. For breast cancer samples, this CV threshold yields a good overlap (>90%) with the corresponding differential expression analysis of cancer vs. normal cohort samples. The differential expression analysis therefore was used for quality control purposes. The advantage of using the CV here is that the same procedure could be applied across all samples, guaranteeing statistical independence between cancer and normal cohorts. In an alternative approach, for a given gene with multiple probes the RTN package selects the probe exibiting the maximum CV, which yields higher gene representativity. We have carried out both approaches and the overall results converged to the same scenario as described in [1].

3.3 Mutual information (MI) computation The MI algorithm used in the RTN package extends the methods available in minet [8]. The structure of the regulatory network was derived by mapping all significant interactions between TF and target probes. The TF list was derived from that used in a previous ARACNe/MRA publication [6] by converting Affymetrix probe IDs into the equivalent probes on the Illumina -HT12 Expression BeadChip. Non-significant interactions were removed by permutation analysis. Unstable interactions were additionally removed by bootstrap analysis in order to create a consensus bootstrap network (referred to as the transcriptional network (TN)).

3.4 Application of data processing inequality (DPI) DPI was applied to the RN with tolerance = 0.0 to remove interactions likely to be mediated by another TF [9]. As DPI removes the weakest edge of each network triplet, the vast majority of indirect interactions are likely to be removed. We also tested DPI tolerance ranging from 0.1 to 0.5 in order to assess the stability of the regulatory units identified in the transcriptional networks. Both the TN and the post-DPI network (filtered transcriptional network) were used in the MRA analysis.

3 4 Master Regulator Analysis (MRA)

The application of MRA has been described in detail in a previous publication [6]. MRA computes the overlap between two lists: the TFs and their candidate regulated (referred to as regulons) and the gene expression signatures from other sources. In this case, the MRA analytical pipeline estimates the statistical significance of the overlap between all the regulons in each TN using a hypergeometric test. The stability of MRA results was tested by comparing the MRA results between the filtered and unfiltered TN networks, removing master regulators inconsistent with the previous analysis (i.e. selected regulons must be significant in both TN networks). Next we retrieve one of the FGFR2 signatures (i.e. differentially expressed genes from Exp1 ) and run the MRA analysis on METABRIC discovery set:

> sigt <- Fletcher2013pipeline.deg(what="Exp1",idtype="") > MRA1 <- Fletcher2013pipeline.mra1st(hits=sigt$E2FGF10, verbose=FALSE)

We provide the following functions to run the MRA analysis on the other 3 TN networks:

> MRA2 <- Fletcher2013pipeline.mra2nd(hits=sigt$E2FGF10) > MRA3 <- Fletcher2013pipeline.mraNormals(hits=sigt$E2FGF10) > MRA4 <- Fletcher2013pipeline.mraTALL(hits=sigt$E2FGF10)

Each of these MRA pipelines constitutes a wrapper function that uses the pre-processed tran- scriptional networks together with the MRA algorithm implemented in the RTN package. There- fore, different signatures can also be interrogated on METABRIC datasets using these functions (for detailed description and default settings, please see the package’s documentation).

5 Transcriptional network of consensus master regulators

Next, the pipeline function plots a graph representing all regulons identified in the consensus MRA analysis. The network is generated by the R package RedeR [10] and should require some user input in order to tune the layout in the software’s interface (Figure1).

> Fletcher2013pipeline.consensusnet()

As a suggestion, set ’anchor’ to the master regulators at the end of the ’relax’ algorithm for a better layout control! right-click the square nodes and then assign ’transform’ and ’anchor’!!!

6 Enrichment maps

In addition to the clustering analysis, the regulons were also represented in an association map show- ing the degree of similarity among them, the number of common targets. Likewise, the similarity is assessed by the Jaccard coefficient, which is plotted in the association map by the R package RedeR [10]. In the next pipeline, a graph representation is generated for regulons exhibiting JC ≥ 0.4 (Figure2).

> Fletcher2013pipeline.enrichmap()

Suggestion: zoom in/out with a scroll wheel, and adjust the graph settings interactively!

4 0 1 2 3 Hits

TFs Targets PTTG1

GATA3

SPDEF

ESR1

FOXA1

Figure 1: Breast cancer transcriptional network (TN) enriched for the FGFR2 responsive genes. The network shows the 5 MRs, each one comprising one TF (square nodes) and all inferred targets (round nodes) applying a DPI threshold of 0.01.

5 h vra frgln,adsae foag niaedge fercmn farglni tleast at in regulon a of enrichment signatures. of gene degree FGFR2 indicate three orange the of of shades one and regulons, of overlap the 2: Figure MSC MAZ NFYB ZNF22 PML ZNF345 MTA1 AFF1 ZNF197 HIC1 ZNF232 BUD31 MAFF RXRB NFE2L1 SIX1 NR3C2 ZNF292 HOXD11 MAFK 0.51 0.46 0.43 0.40 0.62 CREB3 211.1 20.0 ZNF155 1181.0 581.3 2189.1 NR1I2 nihetmpdrvdfo h eeac ewr nbes cancer. breast in network relevance the from derived map Enrichment FOXP3 SMAD2 MYCL1 CDX4 ZNF665 MYBL2 TAL1 OR56B1 ZNF131 ZBTB6 PAX3 TCF12 MYOG TARDBP ZBTB17 ZNF277 GMEB1 HMGA2 TFCP2L1 ELF1 HOXD3 ZNF133 HOXB2 HMBOX1 HES1 MLLT10 ZNF556 PHTF2 RBL2 NFYC THRB NKX2-8 NOTCH2 CREBL2 ZBTB7A FEZF2 ZNF343 HIRA TBX21 NFKB2 TEAD1 HIF3A

FOXE3 Jaccard coefficient Regulon size EN2 CBFA2T3 TRPS1 ZNF646 ZNF672 BAZ1B HEYL NR4A2 ZNF484 RNF4 HOXC6 POU4F1 MEOX2 ZNF271 NR5A2 ZNF189 KLF5 HEY2 ZNF407 ZNF586 TGIF2 NR6A1 SMAD1 ZNF696 ZNF331 FOXN1 ZNF3 ZNF529 CDX2 SREBF2 MEOX1 CEBPZ BTAF1 ZNF609 NKRF FOXP1 ZNF10 FOSL2 ETV5 NCOR1 ZNF239 HR TTC36 ZNF224 HAND2 RELA FBXL8 MYCN STAT6 ARNTL2 ZNF471 PPARG HIF1A RFXANK CBL ZNF248 ZNF250 DMTF1 ZNF7 HMGB2 TSC22D3 MGA CLOCK PRDM2 ZNF146 ZNF408 ZNF12 CTBP1 ZNF266 TSC22D2 YWHAE ZNF143 IRX5 REXO4 OJ2 FOXJ EDF1 FOXH1 CREBBP ZNF217 YY1 ZNF451 LHX2 TULP4 ELK1 TFAP2C ZNF230 ZNF358 HOXA10 RFX5 FOXK2 ZNF235 TBX1 LHX6 ZNF273 ATF4 ZNF85 ZNF549 RXRA SOX18 NR2C1 BLZF1 ZNF574 NR2C2 CREG1 YWHAZ SNAPC5 DENND4A ZNF318 ADNP AHCTF1 ZNF589 ZNF430 ZNF223 ZNF669 CNOT8 ZNF205 ZNF682 NPAT ZNF557 ZNF629 ATF6 ZNF692 PFDN1 KLF13 ZNF140 TAF1B ATF5 ZFX ZNF124 USF2 PLAGL2 ZNF212 SUPT4H1 ZNF142 ARNTL HOXA3 HOXC4 RFX2 HSF2 ZMYM3 PBX3 MLXIPL SMAD7 CNBP NFX1 DR1 ZNF652 KLF10 ZNF493 GTF2IRD1 ZNF394 SP3 ESR2 NR3C1 ATF2 SP4 ZNF14 TCF7 BRF1 SNAPC4 ZNF562 TSC22D1 NR1H2 NFE2L2 DAXX RB1 CTNNB1 ATF1 ZNF330 KLF3 MEF2C ZNF174 REL PHF2 ZNF282 TFEC ZNF215 ZMYM4 PREB ZKSCAN1 UBN1 AATF ZNF329 HOXA2 HLF HOXC10 ZNF193 IRF6 POU2F1 FOSL1 VDR HOXB9 SOLH PROP1 ZBTB38 YEATS4 IRF3 KIAA0415 ZNF226 RFX1 RERE IKZF4 ZFP36L1 ARID4A MTF1 ZNF264 MAX ZNF324 CBFA2T2 ZNF573 ETV3 FUBP3 RORC NFATC3 ZNF384 HIVEP2 SLC30A9 SOX4 RREB1 ZNF654 VPS72 SOX9 ZBTB16 ZNF35 TEAD3 CEBPD ZNF287 VAX2 ASCL2 BCL6 HOXB5 NFAT5 BRPF1 ZNF136 CREB1 ZNF148 ZFP36L2 MNT ZNF668 ATF3 CEBPA KLF6 SOX12 PPARD ERG NFIC AFF4 PURA MTA2 ZBTB43 ZNF83 TCF7L2 ZNF227 SP2 E2F6 ZNF415 TCF25 TFEB INSM1 NR1D2 PPARA TCFL5 ZNF528 BATF ZNF274 RXRG ZNF580 NR1H3 STAT5B TFAP2A STAT2 POU2F2 CIITA STAT1 VAV1 IKZF1 RORA SRF KLF12 GATAD1 ZNF184 LZTS1 DLX5 PBX1 ARID3A IRF7 SOX13 ZNF643 ILF2 6 ZNF167 FUBP1 TRERF1 SHOX2 MEF2A ZNF135 TFE3 KLF15 TFAP2B BACH2 ZNF219 IRF1 ETV7 SPI1 STAT4 SP140 TRIM22 IRF8 RELB ZNF587 ZNF395 CREB3L2 HCLS1 RUNX1 PGR CSDA ETV4 ZNF516 CTCF NR2F6 ZNF234 ZNF16 STAT3 MBD1 FLI1 NFKB1 RUNX3 ETS1 PITX1 DDIT3 ZNF410 ZNF614 ZNF510 LOC401317 ZNF192 TAF5L NFIL3 TCEAL1 PRDM1 ZNF552 NFIB LMO4 NFE2L3 TCF7L1 SLC2A4RG ZNF259 MZF1 SREBF1 ELF3 MEF2D ZNF688 TRIM29 ELF4 ZNF91 TEAD4 PKNOX1 SCAND1 ZNF446 ZNF544 PCGF2 NFYA MAFB ZNF175 ZNF671 AHR SALL2 HESX1 PBX2 ZNF335 ZNF350 SIX3 TP53 ZNF79 PAX9 SOX10 ZNF500 EN1 SNAPC2 NR2E3 KLF11 PLAGL1 GATA3 HOXD12 ESR1 DEK TBX19 ELF5 FMNL2 YBX1 SPDEF ZNF532 HMG20B VEZF1 ZNF236 NR2F2 NR2F1 ELK3 TEF FOXA1 TBX2 ZNF551 RLF KLF9 SP1 NFIX ESRRG GATA2 MYNN XBP1 ZNF468 HIVEP3 EGR1 GLI3 ZNF263 RARA CBFB ZNF434 MYB CEBPB SMAD5 AFF3 UNB J FOSB TRIM28 ARNT2 ZNF24 RNPEP LZTFL1 HBP1 GATAD2A SMAD4 COMMD5 EP300 SIX5 SUPT6H MLX ZNF507 ZNF200 ZHX3 MSX2 HOXD1 BRD8 ZNF611 ZNF207 FOS UN J ZNF302 ENO1 ETS2 AR KNTC1 FOXM1 ZNF473 SOX11 HSF1 YPEL3 SNAI2 TFDP1 ZNF74 VENTX ZNF33B OVOL1 SMARCA4 EPAS1 RFXAP CNOT7 STAT5A ZNF675 EGR3 REST PRRX2 NR4A3 HOXD13 ZNF701 IRF2 KLF2 HMGA1 PTTG1 MEIS2 SOX17 PRRX1 EMX2 MYT1 IRF4 ZNF337 ZNF444 HIVEP1 DBP UBP1 ASCL1 TSC22D4 HNF4A ZNF93 1 FOXJ MAFG ZNF225 IRF5 HOXA5 ZNF202 CITED1 RUNX1T1 ZFHX4 MITF GAS7 FOXF2 HOXA4 GLI2 THRA TWIST1 ZNF423 CEBPG TBX3 LZTR1 ZNF281 BACH1 CITED2 ZNF606 PAX6 ZNF593 TFAM ZNF34 NFRKB RUNX2 ZNF576 SMAD6 ZNF195 ZNF432 ZNF267 TCF4 dewdhdepics width Edge MSX1 GR enrichment FGFR2 ZNF365 AEBP1 ZNF467 CREB5 TFAP4 MLL MEIS1 ZFP37 ZNF211 ZNF45 HCFC1 UND J SLC26A3 ZNF426 NKX2-5 EMX1 HEY1 CCRN4L MSRB2 0

1 ELK4 HSF4 SMAD3 TRIM25 HHEX E2F8 ZNF460 ERF NR4A1 DRAP1 ZMYM2 MECP2 NPAS2 SCAND2 3 FOXJ ARNT BARX2 2

3 7 Session information

R version 4.1.0 beta (2021-05-03 r80259) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS

Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so attached base packages: [1] stats graphics grDevices utils datasets methods [7] base loaded via a namespace (and not attached): [1] compiler_4.1.0 tools_4.1.0

7 References

[1] Michael NC Fletcher, Mauro AA Castro, Suet-Feung Chin, Oscar Rueda, Xin Wang, Carlos Caldas, Bruce AJ Ponder, Florian Markowetz, and Kerstin B Meyer. Master regulators of FGFR2 signalling and breast cancer risk. Nature Communications, 4:2464, 2013.

[2] Christina Curtis, Sohrab P. Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M. Rueda, Mark J. Dunning, Doug Speed, Andy G. Lynch, Shamith Samarajiwa, Yinyin Yuan, and et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486:346–352, 2012.

[3] Pieter Van Vlierberghe, Alberto Ambesi-Impiombato, Arianne Perez-Garcia, J. Erika Haydu, Isaura Rigo, Michael Hadler, Valeria Tosello, Giusy Della Gatta, Elisabeth Paietta, Janis Racevskis, Peter H. Wiernik, Selina M. Luger, Jacob M. Rowe, Montserrat Rue, and Adolfo A. Ferrando. Etv6 mutations in early immature human t cell leukemias. The Journal of Experi- mental Medicine, 208(13):2571–2579, 2011.

[4] Adam Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo Favera, and Andrea Califano. Aracne: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7(Suppl 1):S7, 2006.

[5] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2012. ISBN 3-900051-07-0.

[6] Maria Stella Carro, Wei Keat Lim, Mariano Javier Alvarez, Robert J. Bollo, Xudong Zhao, Evan Y. Snyder, Erik P. Sulman, Sandrine L. Anne, Fiona Doetsch, Howard Colman, Anna Lasorella, Ken Aldape, Andrea Califano, and Antonio Iavarone. The transcriptional network for mesenchymal transformation of brain tumours. Nature, 463(7279):318–325, 01 2010.

[7] Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, Scott L. Pomeroy, Todd R. Golub, Eric S. Lander, and Jill P. Mesirov. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43):15545–15550, 2005.

[8] Patrick Meyer, Frederic Lafitte, and Gianluca Bontempi. minet: A r/bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics, 9(1):461, 2008.

[9] Adam A Margolin, Kai Wang, Wei Keat Lim, Manjunath Kustagi, Ilya Nemenman, and Andrea Califano. Reverse engineering cellular networks. Nat. Protocols, 1(2):662–671, 07 2006.

[10] Mauro AA Castro, Xin Wang, Michael NC Fletcher, Kerstin B Meyer, and Florian Markowetz. Reder: R/bioconductor package for representing modular structures, nested networks and multiple levels of hierarchical associations. Genome Biology, 13(4):R29, 2012.

8