Systematic identification of transcriptional regulatory modules from -protein interaction networks

Supplementary Material

Diego Diez1, Andrew Paul Hutchins1 and Diego Miranda-Saavedra1*

1Bioinformatics and Genomics Laboratory, World Premier International (WPI) Immunology Frontier Research Center (IFReC), Osaka University, 3-1 Yamadaoka, Suita 565-0871, Osaka, Japan.

*Author for correspondence. E-mail. [email protected]; Tel. +81 6 6879 4269.

1 SUPPLEMENTARY METHODS

Overview rTRM is a method for predicting transcriptional regulatory modules (TRMs) that combines experimentally determined genomic binding sites for a specific TF (e.g. from ChIP-seq), the computational prediction of enriched TF binding motifs, expression and protein-protein interaction (PPI) data. rTRM was designed with ChIP-seq experiments in mind but any other readout providing genomic locations of regulatory regions can be readily adapted (e.g. DNase I footprinting).

Availability rTRM has been implemented as an R package, is licensed under GPL-3 terms and is freely available for download. The rTRM version used for this manuscript (0.9.4) can be found at https://sourceforge.net/projects/rtrm, including the documentation (rTRM_introduction.pdf; also included in the package). rTRM has also been accepted as a Bioconductor package, and is currently available in the development version of Bioconductor (http://bioconductor.org/packages/devel/bioc/html/rTRM.html) and will be generally available from www.bioconductor.org following the next Bioconductor release (October 2013). For instructions on how to install Bioconductor and companion packages, please check the information at http://bioconductor.org/install/. For instructions on how to install the version available at sourceforge please consult the following section.

Installation

This section details the installation process of rTRM using the packages available at sourceforge. This should only we done by those wishing to reproduce the results presented in the manuscript. For everyone else, please use the package available at Bioconductor, which contains further developments and bug fixes, as well as integration with the Bioconductor infrastructure. For instructions on how to install rTRM from Bioconductor please follow the information at http://bioconductor.org/install/. rTRM runs on the R programming language, which can be installed on Windows and Unix-based machines (including Mac and Linux). Binaries and source code for R can be obtained from www.r- project.org. rTRM has been successfully tested in recent versions of R, including R-2.15.x and R- 3.0.1. rTRM dependencies include RSQLite (for managing the SQLite database, which itself depends on the DBI package) and igraph (for graph manipulation).

2 # To install dependencies, from the R prompt (any platform):

> install.packages(c(“RSQLite”, “igraph”))

# To install rTRM from the command line:

$ R CMD INSTALL rTRM_0.9.4.tar.gz

Documentation rTRM comes with documentation including a reference guide for each command and a vignette with a general introduction to the problem of predicting TRMs, a description of the most common operations as well as some examples. The PDF version of the vignette used in this paper is also available at https://sourceforge.net/projects/rtrm.

# To access the vignette from the R prompt:

> browseVignettes(package="rTRM")

Protein-protein interaction datasets rTRM relies on PPI datasets to predict TRMs, and so the package includes interaction data (as an igraph object) from the BioGRID database (http://www.thebiogrid.org; version 3.2.98) for human and mouse. The rTRM framework provides a facility to retrieve and process updated datasets directly from the BioGRID website. Please consult the package vignette for specific instructions on how to update the PPI dataset.

Computation of network similarities: the Jaccard index

Similarities between TRMs were computed using the Jaccard index, which measures the proportion of shared elements between two sets (intersection) divided by the total number of elements (union; see eq.). In this manuscript we used the Jaccard index of shared edges (Supplementary Figure 5).

3 SUPPLEMENTARY FIGURES

Supplementary Figure 1. Schematic representation of the rTRM framework and workflow. The left panel describes the rTRM database module, which comprises the generation of a library of PWMs with annotations of the PWM gene identifiers and the species of origin, mappings of the original gene to orthologous species (currently human and mouse), and the mapping of TFs to the TFClass scheme. The right panel represents the user pipeline, which starts with the identification of enriched motifs from experimentally determined genomic binding sites (from ChIP-seq data). Enriched motifs are automatically mapped onto specific of the organism under study. Expression data is used to filter out non-expressed genes, and the genes still remaining are mapped onto the organism’s protein-protein interaction network. Finally a TRM is reconstructed from the target gene’s neighbourhood using our own module-finding algorithm (rTRM finder).

rTRM database User pipeline

PWM TF TF TF datasets binding expression interaction

Jaspar Uniprobe SELEX ChIP-seq GEO BioGRID

(a) compile PWMs binding PPI genes PWM sequences network

(b) map to genes gene expressed ≠ species (e) compile enriched filtered database motifs genes network (c) map to rTRM database orthologs API enriched rTRM gene gene Hs Mm genes finder

(d) map to TFclass filtered TFclass TRM genes

4 Supplementary Figure 2. TFClass ontology tree. Concentric layout tree showing the mapping of the TFs included in rTRM onto the ontology classification in TFClass.

Root Superclass Class Family Subfamily Genus Map

5 Supplementary Figure 3. Plots showing the distribution of expression values in ESCs and HSCs to determine the expression cutoffs for ESCs (A) and HSCs (B). The distribution of (log2) intensities was plotted for each individual array. For selected genes, the median intensity across all arrays is indicated. Red lines indicate the filtering cutoff (5.5 for ESCs and 7.0 for HSCs). A B ESC HSC Sfpi1 Gata2 Pou5f1 Sox2 Pou5f1 Sfpi1 Gata2

0.15 0.15

0.10 0.10 Density Density 0.05 0.05

0.00 0.00

2 4 6 8 10 12 14 16 5 10 15

N = 17370 Bandwidth = 0.2918 N = 17370 Bandwidth = 0.3233

Supplementary Figure 4. Identification of TRMs from PPIs. A neighborhood search algorithm was implemented to identify directly interacting with a target TF or through a bridge protein. First, target TFs are mapped onto the PPI network (1) and then the first neighbours of each TF are identified (2). Next, the common subgraph containing all the interactions for the selected nodes is extracted (3), and finally distance restrictions (maximum separation of one node) are applied and the resulting TRM is returned.

11

8

(1) (2) (3) (4) 27

10

28

7

locate target TFs find first neighbors find common subgraph remove distal relations26

9

25

12

11 11

6

11 8 8

8

27 27

27

10 10 23

28 28

24

10

28

7 7

7

1

26 26

26

5

4

22

2

9 9

25 25 9

12 12

25

12

19

6 6

6

3

20

13

23 23

24 24

23

21

24

1 1

1

14

16 5 5

5

17

4 4

22 22 4

18

2 2

22

2

19 19

19

15

3 3

3

20 20 TF from ChIP-seq

20

13 13

13

21 21

21

14 14

16 16

14

16 TF from motif enrichment

17 17

17

18 18

18

15 15 15 bridge TF

6 Supplementary Figure 5. Computing network similarity. In order to compare any two networks, two distinct properties may be used: the number of common nodes and the number of common edges. In this figure Networks A and B have five nodes in common (A-E) but which are connected in distinct ways (shown in blue). Network A has one unique node (F; green) and Network B has another unique node (G; red). The similarity based on the number of common nodes is represented in the Venn diagram in the bottom left corner. In contrast, the analysis of the number of common edges between Networks A and B results in two common edges (indicated in blue). Network A has three unique edges (green) whereas Network B has two unique edges (red), as shown in the Venn diagram for edges in the bottom right corner. Based on this, we used the comparison of edges as a more reliable estimation of the degree of similarity between networks.

Network A Network B

B B A A

F C C G D D

E E

Nodes Edges

1 5 1 3 2 2

A B A B

7 Supplementary Figure 6. Individual TRMs identified in ESCs.

Myc Esrrb

Hif1a Etv6 Max Rxra Id4 Etv6 Mnt Rara Kat5 Clock Nanog Max Lass2 Jun Pou5f1 Kat5 Nr0b1 Id3 Clock Kat2b Med16 Atf4 Rxrb Nrip1 Huwe1 Sp1 Hdac2 Mnt Pou5f1 Park7 Cdk1 Mycn MycEp300Arntl Rela E2f1Arntl Klf4 Esrrb Esrra Srebf2Polr1a Rad21 Zfp292 Sp3 Mycn Tgif1 Pou5f1 Yy1 Ube2n Sp1 Rif1 Tcf3 Rnf14 Sin3a Smarca4Yy1 Tbp Smarca4 Sall4Smarca4 Sin3a Trp53 Srebf1 Zfp281 Tcf4 Tfeb Tfe3 Tfeb Tfe3 Sp3 Klf4 Mycn Nanog Pou2f1 Klf13 Hif1a Sox15 Max Etv6 Mnt Nrip1Nr0b1 Lass2 Clock Rara Etv6 Med16 Kat5 Nr0b1 Polr1a Nanog Sp1 Jun Rad21 Park7 Nanog Rad21 Esrrb Kat2b Klf4 Atf4 Klf4 Klf4 Sp3 Huwe1 Rest Hdac2 Polr1a Hdac2 RxraSall4 Ctcf Esrra Myc CtcfEp300Arntl Sp1 Pou5f1 Esrrb Esrra Pou5f1 Pou5f1 Srebf1 Sall4 Sin3a Sox2 Yy1 Mycn Zfp292 Sox2 Rela Yy1 Tbp Ube2n Smarca4 Wdr5 Srebf2 Rnf14 Sin3a Rxrb TbpTgif1 Sp3 Smarca4Trp53 Smarca4 Tcf3 Sin3a Tfeb Sox15 Sp1 Tcf4 Tfe3 Sp3 Pou5f1 Smad1 Sox2 Pou2f1 Rara Msx1 Rara Nanog Nrip1Nr0b1 Etv6 Sox15 Polr1a Nanog Nr0b1 Kat2b Rad21 Nanog Sp1Rad21 Esrrb Kat2b Nrip1 Klf4 Hdac2 Klf4 Klf4 Arid1a Sall4 Hdac2 Rela CtcfHdac2Esrra Rad21Pou5f1 Esrrb EsrraPou2f1 Pou5f1 Esrrb Esrra Pou5f1 Sp3 Sall4 Sox2 Yy1 Sall4 Sox2 Zmym2 Sox2 Sin3a Wdr5 Sp1 Smad1 Tbp Smarca4 Tbp Smarca4Tbp Tfeb Smarca4 Srebf1 Tfe3 Sp3 Sox15 Stat3 Tfcp2l1 factors (bHLH) Mafk Mafg Sox15 Rara Max Nanog Basic factors (bZIP) Mll2 Klf13 Nr0b1 C2H2 factors Rara NanogNacc1Kat2b Fork head / winged helix factors Jun Nrip1 Nfe2l2 Hdac2Etv6 Kat2b Grainyhead domain factors Nr0b1 Klf4 Klf4 Rxra Pou5f1 Cdk1 Pou5f1 Nrip1 Homeo domain factors EsrrbAsh2lEsrra Sp1Rad21 Esrrb Esrra Nuclear receptors with C4 zinc fingers Ppp1ca Hdac2 Rxrb Sox2 Tgif1 Sox2 domain factors Ppp1cb Stat3 Tfcp2l1 Rel homology region (RHR) factors TbpZfp281 Tbp Rad21 Smarca4 Sall4 STAT domain factors Sp1 RelaRif1 Sin3a TAT Sall4 Tfeb Sin3a Sp3 Smarca4 Tryptophan cluster factors Srebf1Tfe3 Sp3 Tfcp2 unclassified

8 Supplementary Figure 7. Reproducibility of rTRM predicted TRMs in different studies. (A) TRM predicted for Sox2 obtained using information from the Chen and Lodato datasets. (B) Comparison of the number of shared nodes and edges between the Chen and Lodato Sox2 TRMs, referred to the Chen TRM, as well as differences in the number of peaks and enriched motifs. (C) Comparison of the number of shared nodes and edges between the Chen and Lodato Pou5f1 (Oct4) TRMs, referred to the Chen TRM, as well as differences in the number of peaks and enriched motifs.

A Sox2 reproducibility Chen Lodato C2H2 zinc finger factors f factors C2H2 zinc finger factors Homeo domain factors factors Homeo domain factors TAT TAT Nr2f6 Nacc1 Rad21 Nanog Rara Nrip1 Hdac2 Fos Rxra Rad21 Hdac2 Esrra Ctcf Cdk1 Esrra Rif1 Sox2 Sox2 Wdr5 Smarca4 Sirt1 Tgif1 Sox15 Smarca4 Sp1 Sox15 Sp3 Vax2 B % nodes % edges 103 # peaks # motifs 100 100 400

75 75 10 300

50 50 200 5 25 25 100

0 0 0 0 Chen Lodato Chen Lodato Chen Lodato Chen Lodato

C Pou5f1 (Oct4) reproducibility % nodes % edges 103 # peaks # motifs 100 100 300 75 75 10 200 50 50 5 25 25 100

0 0 0 0 Chen Lodato Chen Lodato Chen Lodato Chen Lodato

9 Supplementary Figure 8. The Esrrb and Sox2 regulatory networks are functionally interdependent, and they auto-regulate Esrrb and Sox2, as well as other genes of the TRMs. (A) The Esrrb and Sox2 TRMs, and their common nodes. (B) Change in of the genes within both regulatory networks in ESCs, where Sox2, Esrrb or Sox2/Esrrb (SoES) are knocked down. ‘GFP’ is a GFP-specific shRNA used as a control. Microarray data is from GSE34170. Significance (t- test) is indicated for significantly changing gene sets. (C) Heatmaps showing the fold-change of the genes from the two TRMs in the Esrrb knockdown (left) and the Sox2 knockdown (right).

A Esrrb Sox2

Rxra Msx1 Rara Nr0b1Nanog Kat2b Rad21 Nanog Rxrb Nrip1 Hdac2 Klf4 Pou5f1 4 Klf4 Esrrb Cdk1 Esrra 12 10 Pou2f1Sall4Pou5f1 EsrrbHdac2Esrra Rad21 Tgif1 Sox2 Sp1 Rif1 Tbp Smarca4 Tbp Sall4Smarca4 Zfp281 Sp3 Sox15

3.0 3.0 B p < 0.0009 p < 0.0003 2.5 2.5 p < 0.0009 p < 0.001 2.0 p < 0.016 2.0 1.5 1.5 1.0 1.0 0.5

0.0 0.5 GFP Esrrb Sox2 SoES GFP Esrrb Sox2 SoES C log2 Fold Change 1.0 Sp1 0.8 Esrra Zfp281 0.6 Pou2f1 Tbp 0.4 0.2 Pou2f1 Tgif1 0 Rxrb -0.2 Tbp Rxrb -0.4 Sall4 -0.6 Esrra -0.8 Pou2f1 Rxra -1.0 Smarca4 Sall4 Smarca4 Msx1 Hdac2 Pou5f1 Sall4 Pou5f1 Tgif1 Sox15 Tgif1 Kat2b Nanog Kat2b Esrrb Rif1 Klf4 Pou5f1 Klf4 Pou5f1 Rara Klf4 Nanog Hdac2 Esrrb Sall4 Klf4 Sox2 Klf4 Klf4 Sox2 Esrrb Sox2 SoEs Esrrb Sox2 SoEs

10 Supplementary Figure 9. (Biological Process) terms related to “” in all the ESC modules. The p-value is indicated using a colour scale (darker colours indicating greater significance). Grey indicates missing values (i.e. terms not enriched for that module at p<0.05). Terms and modules are clustered using Euclidean distance and complete linkage.

regulation of differentiation p−value negative regulation of amniotic stem cell differentiation 0.00 negative regulation of stem cell differentiation stem cell development 0.01

stem 0.02

term somatic stem cell maintenance 0.03 neuronal stem cell maintenance stem cell maintenance 0.04 stem cell differentiation 0.05 germ−line stem cell maintenance Klf4 Myc ESC E2f1 Sox2 Stat3 Mycn Esrrb Nanog Pou5f1 Smad1 Tfcp2l1 trm

11 Supplementary Figure 10. Individual TRMs reconstructed for 9 TFs in HSCs.

Erg Fli1 Gata2 Gfi1b

FosEtv6 Fos Etv6 Fos Gata1 Ets1 Gata1 Etv6 Gata1 Etv6 Hoxa5 Elk1 Ets1 Hoxa5 Ets1 Hoxb4 Elf1 Hoxa5 Hdac3Hdac2CreChd4bbp CreCreb1bbpChd4Cbfb Elk1 Crebbp Gfi1bCreCreb1bbpChd4 Hoxb6 Hdac4 Cbfb Ctcf Hdac2 Cbfa2t3 Hoxb4 Hdac2 Cbfb Elk1 Hoxa9 Cbfa2t3 Hoxb4 Hdac3 Cbfa2t3 Jdp2 Creb1 Hdac3 Btrc Gata2 Cbfb Hoxa9 Jak2 Runx1 Btrc Runx1 Runx1 Runx1 Btrc Mapk1 Atf3 Hoxa9 Atf3 Elf1 Hoxb6Kdm1a Atf3 Ctcf Jun Mapk8 Barx2Hoxb6 Jak2 Atf2 Mapk1 Atf2 Mapk8 Atf2 Klf13 Men1 Sfpi1 Fli1 Ash2lArid3a Mafk Sfpi1 Fli1 Ash2lBarx2Gata1Hdac2Sfpi1 Fli1Cbfa2t3Ets1 Jdp2 Sfpi1 Fli1 Ash2lBarx2 Men1 Mms19 Yy1 Yy1 Klf4 Zfp281Jdp2Mapk1 Wdr5 Mms19 Nacc1 Wdr5 Mapk8 Trim33 Jun Wdr5 Stat1 Tal1 Tal1 Stat3 Tal1 Pbx1 Tal1 Trim33 Maff Pbx1 Trim33 Stat3 Men1 Smarce1 Pbx4 Pbx4 Smarce1 Jun Pml Trim33 Pml Stat3 Mafg Pml Smarcb1 Stat1 Pbx1 Smarcb1 Maff Polr1a Smarce1Runx2 Polr1aSin3aSiSmarca4rt1 Pbx4PmlSin3aSirt1 Stat1 Sin3a Rcor2Sin3aSiSmarcb1rt1 Mafk Sp3 Nfe2 Runx2 Mafg Pou2f1 Nfe2 Sp1 Nfe2l2 Nfe2l2Nrf1Pou2f1Runx2 Nrf1Pou2f1 Runx2 Mafk Nfe2Nfe2l2

Lmo2 Meis1 Sfpi1 Runx1

Gata1 Etv6 Hoxa5Gata1 Jdp2 Fos Hoxa5Gata1Fos Hoxa5 Etv6 Etv6 Hoxb5 Etv6 Hoxb4 Ets1 Hoxb5 Ets1 Jak2Id2Hdac2 Hdac3Hdac2Crebbp Ets1 Jun HoHdac3xa9Hdac2Crebbp Hoxb5 Jun Crebbp Jak2 Creb1 Ldb1 Creb1 Hoxb6Hoxa9 Creb1 Hdac2 Ets1 Hoxb6 Chd4 Elf1 Elf1 Hdac3 Crebbp Mafk Cbfb Lmo2 Runx1 Chd4 Meis1 Barx2 Mapk8 Runx1 Jun Cbfb Maff Mapk1 Btrc Jdp2 Cbfa2t3Barx2 Hoxb6Pbx1 Cbfb Hoxb8 Runx1 Men1 Mybbp1a Atf2 Pbx3 Sfpi1 Fli1Cbfa2t3Barx2 Nacc1 Fli1Cbfa2t3Arid3a Mapk8 Sfpi1 Atf3 Elk1 Jun Sfpi1 Fli1 Ash2lArid3a Nacc1 Yy1 Meis3 Pbx4 Trim33 Meis3 Sfpi1 Pbx1 Pbx3 Trim33 Mafg Mms19 Stat3 Meis3 Wdr5Zfp281 Pml Tal1 Trim27 Tal1 Zfp281 Pbx3 Tal1 Tcf4 Pbx4 Sirt1 Trim33 Polr1a Smarce1 Pbx4 Tlx1 Sin3a Stat1 Nfe2 Pbx4 Smarce1 Tcf4 Pknox1 Rbbp4 Smarca4 Pbx1 Pml Smarcb1 Rbbp7Sin3aSmad2 PmlSin3aSmad2 Sp3 Mafk Sin3aSiSmad2rt1Smad3 Tcf3 Nfe2l2 Stat3 Pknox1 Sp1 Pou2f1 Sp1 Pknox1 Sp3 Runx2 Stat3 Pou2f1Runx2 Nfe2 Nfe2l2 Pou2f1Runx2Sp1

ARID domain factors Tal1 factors (bHLH)

Fos Basic leucine zipper factors (bZIP) Gata1 Etv6 Hoxa5 Ets1 C2H2 zinc finger factors Network Layout Hoxb4 HoHdac3xa9Hdac2Ercc8Ep300Crebbp IkbkgId2 Creb1 Elk1 Homeo domain factors Jak2 Chd4 Target TF Hoxb6 Lass2 Cbfb Mafk Runx1 Cbfa2t3 factors Mapk1 Btrc Elf1 Enriched TF Jdp2Mapk8 Atf3 Runt domain factors Men1 Atf2 Pbx1 Sfpi1 Fli1 Ash2lBarx2 Bridge protein Jun Pbx4 Zfp292 factors Pml Zfp277 Polr1a Zdhhc6 Yy1 STAT domain factors Rbbp4 Tal1 Wdr5 Nfe2 Rbbp7 Trip4 Rnf14 Trim33 Sin3a Tle6Trim27 Tcf3 Tryptophan cluster factors Nfe2l2 Smarca4SiSmarcb1rt1Smarce1Tcf4 Stat3 Uncharacterized Nrf1 Stat1 Pou2f1Runx2 unclassified

12 Supplementary Figure 11. Gene Ontology (Biological Process) terms related to “hemopoietic” and “stem cell” in all HSC modules. The p-value is indicated using a color scale (darker colors indicating greater significance). Grey indicates missing values (i.e. terms not enriched for that module at p<0.05). Terms and modules are clustered using Euclidean distance and complete linkage.

hemopoietic or lymphoid organ development

germ−line stem cell maintenance p−value

stem cell differentiation 0.00

stem cell maintenance 0.01

0.02 regulation of stem cell maintenance term 0.03 hemopoietic stem cell differentiation 0.04 somatic stem cell maintenance 0.05 positive regulation of stem cell proliferation

stem cell development Erg Fli1 Tal1 HSC Sfpi1 Gfi1b Lmo2 Meis1 Gata2 Runx1 trm

13 Supplementary Figure 12. TRMs identified in (A) TSCs for Cdx2, (B) NPCs for Sox2, and (C) during CD4+ T cell development for Gata3. Although the NPC dataset contained also data for Pou3f2, and the TSC dataset contained data for Elf5 and Eomes, these TFs could not be found in the PPI network and therefore the corresponding TRMs could not predicted. A Cdx2 Mafk TSCs Mga Rnf2 f f f Ash2l f x f

B Sox2 NPCs Jun f f F Meis2 Nfe2l2 f Olig2 Etv6 Ff Meis3 Ctcf f Sx2 f ax6 x f f Sin3a Stat3 Vax2 STAf Tf Tcf3 Unchaacteiz Sx8 C Th1 Th2 Th17 Etv6 Etv6 Etv6 Etv3 Etv3 Etv3 F F F F F Atf3 Atf3 F Jun Gata3 Ash2l Gata3 Ash2l Ctcf Gata3 Atf3 Jun Stat3 Stat3 Stat3 Sin3a Mafg Sin3a Sin3a Mafk Mafk Jun Runx2 Nfe2l2 Nfe2l2 Runx2 Nfe2l2 CD4+ T cells iTreg f F f Etv6 Etv3 f F f Ff F f f Jun Gata3 Ctcf f f Stat3 Sin3a f Si Nf STAf Nfe2l2 TAT Runx2 Tf Rela

14

Supplementary Figure 13. Combined network for CD4+ T cells. The different TRMs identified in CD4+ T cells were combined and the nodes coloured according to their presence in one or more CD4+ T cell type-specific TRMs.

Common Etv6 Elk1Crebbp Wdr5 iTreg Fli1 Creb1 Sp3 Ctcf Fos Th1 Cbfb Yy1 Irf2 Th1+Th2 Gata1 Lef1 Irf1 Runx1 Smad2 Hdac6 Th2 Gata3 Th2+iTreg Runx2 Nfatc1 Fem1a Hdac2 Btrc Nfkb1 Ctnnb1 Hdac3 Atf3 Mafb Jun Nfkb2 Mta1 Jdp2 Nfe2l2 Nfkbia Sirt1 Mafg Elf4 Foxp3 Rela Cers2 Smarca4 Hdac1 Sin3a Etv3 Ube2i Tbk1 Klf4 Mafk Smarcb1 Mapk1 Pias3 Irf3 Ash2l Sp1 Klf13 Stat1Stat3Ets1

15