bioRxiv preprint doi: https://doi.org/10.1101/2020.03.04.976910; this version posted March 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Pheno-RNA, a method to identify genes linked to a specific phenotype: application to cellular transformation

Rabih Darwiche and Kevin Struhl* Dept. Biological Chemistry and Molecular Pharmacology Harvard Medical School Boston, MA 02115 (617) 432-2104 (617) 432-2529 (FAX) [email protected]

*To whom correspondence should be addressed.

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.04.976910; this version posted March 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

ABSTRACT

Cellular transformation is associated with dramatic changes in gene expression, but it is difficult to determine which regulated genes are oncogenically relevant. Here, we describe Pheno-RNA, a general approach to identify candidate genes associated with a specific phenotype. Specifically, we generate a “phenotypic series” by treating a non-transformed breast cell line with a wide variety of molecules that induce cellular transformation to various extents. By performing transcriptional profiling across this phenotypic series, the expression profile of every gene can be correlated with the transformed phenotype. We identify ~200 genes whose expression profiles are very highly correlated with the transformation phenotype, strongly suggesting their importance in transformation. Within biological categories linked to cancer, some genes show high correlations with the transformed phenotype, but others do not. Many genes whose expression profiles are highly correlated with transformation have never been linked to cancer, suggesting the involvement of heretofore unknown genes in cancer.

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.04.976910; this version posted March 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

A variety of whole-genome approaches have been employed to determine which genes are important for a specific phenotype. Transcriptional profiling of cells under the relevant

developmental or environmental conditions identifies differentially expressed genes1-4. While some of these are likely important for the process of interest, it is unlikely that all of them are. In this regard, there is surprisingly limited conservation of genes in different

species that respond to a given stress5. Alternatively, classical or systematic genome-scale

mutational screens identify genes that affect a given phenotype6-8. However, mutations might confer such phenotypes by indirect and/or artifactual effects, and many important genes might be missed due to the nature of the assay or to functional redundancy. Integration of other forms of information (e.g. genome binding, chromatin status, protein-protein interactions, genetic epistasis, evolutionary conservation) is very helpful in addressing the

connection between genes and phenotypes9. Here, we describe a new genome-scale approach, Pheno-RNA, to identify genes important for a specific phenotype. We apply Pheno-RNA to the process of cellular transformation, using our inducible model in which transient activation of v-Src oncoprotein converts a non-transformed breast

epithelial cell line into a stably transformed state within 24 hours10, 11. This epigenetic switch between stable non-transformed and transformed states is mediated by an inflammatory

positive feedback loop involving NF-kB, STAT3, and AP-1 factors11, 12, and many genes are

directly and jointly co-regulated by these factors13. In addition, we identified > 40 factors important for transformation as well as putative target sites directly

bound by these factors14. These transcriptional regulatory circuits are utilized in other cancer cell types and human cancers, and they are the basis of an inflammation index to type human

cancers by functional criteria14. Nevertheless, and despite the molecular description of these regulatory circuits, it is unclear which genes are important for transformation. Pheno-RNA is designed to distinguish genes that are “drivers” as opposed to “passengers” for a specific phenotype. Cells are subject to a wide variety of experimental perturbations, and the expression profiles of individual genes are correlated with a

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.04.976910; this version posted March 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

quantitative measurement of a given phenotype. The specific perturbations, individually or in combination, do not matter for Pheno-RNA analysis, although they can provide additional information. In principle, the expression profiles of genes driving the phenotype should be highly correlated with the degree/strength of the phenotype. Conversely, genes whose expression is regulated by one or a few experimental conditions are likely passengers under those conditions and not generally relevant for the phenotype. To identify genes important for breast cellular transformation, we generated a series of transformation phenotypes by treating the parental MCF-10A cell line (i.e. lacking the ER- Src derivative) with a wide variety of signaling molecules - interleukins, growth factors (RGF, FGFs, HGF, TGFs), TNF-⍺, IFN-γ, leptin, insulin, TPA, cAMP - either individually or in combination. The degree of transformation was determined quantitively by growth in

low attachment, an assay highly correlated with the more qualitative soft-agar assay15. With the exception of Oncostatin M, none of signaling molecules tested individually caused transformation. Out of hundreds of combinations tested (Supplementary Table 1), we selected 17 to generate the phenotypic series. These 17 treatments show a wide range of transformation phenotypes (Fig. 1a), but do not significantly affect cell growth in standard (high attachment) conditions (Fig. 1b). We performed RNA-seq experiments to obtain transcriptional profiles under these 17 conditions (Supplementary Table 2). As expected, a few thousand genes were up- or down-regulated (> 2-fold) in each condition as compared to the untreated control (Fig. 1c), and some genes were up- or down-regulated under multiple conditions (range from 1-17; Fig. 1d). For 6651 genes, we determined the Pearson’s correlation coefficient between the mRNA levels and degree of transformation under the 17 experimental conditions (Supplementary Table 3). The correlation coefficients range from 0.99 to -0.60, as compared to the ± 0.4 range observed when mRNA values are randomized among the different conditions (Fig. 2a). 209 and 82 genes have remarkably high positive correlation coefficients (> 0.8 and > 0.9, respectively), strongly suggesting their relevance for cellular

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.04.976910; this version posted March 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

transformation. The results for negatively regulated genes are less dramatic, but 52 genes have correlation coefficients < -0.5. Importantly, for any individual condition, only a subset of genes (range 0.5 to 5.5%) showing regulated expression have high correlations (> 0.6) with the transformation phenotype. As expected, genes whose expression profiles are highly correlated (either positively or negatively) with transformation are enriched in pathways strongly associated with cancer progress such as the inflammatory response, apoptosis, metastasis, and transcriptional regulators (Fig. 2b, c; Supplementary Table 4). In addition, genes showing strong correlation (> 0.8) with the transformation phenotype are enriched for genes that are up- or down-regulated in the Src-inducible model and/or are part of the transcriptional regulatory

network mediated by the combined action of NF-kB, STAT3, and AP-1 (Fig. 2b, c)13, 14. However, 76% of these high-correlated genes are not regulated in the Src-inducible model of transformation 85% are not part of the transcriptional regulatory network. Thus, Pheno- RNA both confirms genes and pathways previously linked to the transcriptional phenotype and also identifies genes that were not identified by other approaches. Although the inflammatory pathway is critical for transformation, our results distinguish among inflammatory genes whose expression patterns are strongly correlated with the transformation phenotype from those that are not (Fig. 2d). Some inflammatory genes that are induced by Src have expression levels that are highly correlated with the transformation phenotype (examples in blue), whereas other Src-inducible genes show essentially no correlation with the transformation phenotype (examples in green). Conversely, other inflammatory genes have expression profiles highly correlated with transformation, but are not induced in our ER-Src model (examples in red). These results suggest a subset of inflammatory genes that are important for transformation. Similar analyses on non-inflammatory pathways previously linked to transformation (e.g. apoptosis, cell proliferation, and metastasis) also reveal subsets of genes whose expression patterns strongly correlate with transformation.

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.04.976910; this version posted March 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

As Pheno-RNA is based solely on the connection between specific phenotypes and gene expression patterns in a defined experimental system, it represents a new and unbiased approach to identify genes associated with transformation. We therefore used the following approach to identify genes whose expression patterns are strongly correlated to the transformation phenotype, yet have never previously linked to cancer. First, we performed an automated PubMed search (NCBI’s E-Utilities) on each gene ID and asked whether it was it was linked to studies on PubMed. Second, we reviewed the literature to confirm that the candidate cancer genes were not linked to cancer. Third, we cross-checked the candidate genes against the Network of Cancer Genes (NCG) that contains detailed information on 2,372 cancer genes, the Cancer Gene Census (CGC) database that catalogues genes with mutations causally implicated in cancer, Tumor-Associated Genes (TAG), and COSMIC and cBIOPortal that focus on cancer alterations rather than cancer genes. From this approach, we identified 90 genes that are strongly linked to the transformation phenotype (R > 0.8) but not previously linked to cancer (Supplementary Table 3). Interestingly, these genes are enriched in functional categories including the ribosome, translational elongation, and RNA transport. Some of these 90 genes are up- regulated in the Src-inducible model, whereas others are not. Conversely, we identified 6 genes negatively linked to the transformation phenotype (R < -0.5) and previously unlinked to cancer. Lastly, expression of many small nucleolar RNAs (A45B, D36B, D18A, D7A, D42B, D90, and D49B) show strong correlations (> 0.6) with the transformed phenotype, yet have never been linked to cancer. These genes are new candidates for playing a role in cancer, and hence worthy of functional analysis, either individually or in combination. Our Pheno-RNA analysis distinguish between genes that are strongly associated with cellular transformation from those that are regulated by a specific condition that causes a transformation phenotype, and it also identifies new genes that are candidates for affecting the phenotype. Presumably, genes with high correlation coefficients are important for the transformation phenotype assayed here (growth under conditions of low attachment),

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.04.976910; this version posted March 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

whereas regulated genes with poor correlation coefficients are more relevant for a non- oncogenic function(s) of the specific experimental condition(s). However, it remains possible that some of these low-correlating genes are important for the transformation phenotype, particularly if pairwise (of higher order) interactions between genes affect their transcriptional profiles and oncogenic function. More generally, Pheno-RNA can be applied to multiple phenotypes generated by the experimental conditions. For example, the experimental conditions and transcriptional profiling data presented here can be linked to other transformation-related phenotypes such as morphology, focus formation, cell motility/invasive growth, mammosphere formation, and metformin sensitivity. The phenotypic series will likely differ among these various phenotypes, making it possible to identify different sets of genes associated with these phenotypes. Lastly, Pheno-RNA should be generally applicable for any phenotype that can be quantitated and for which multiple experimental conditions can be used to generate a phenotypic series.

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.04.976910; this version posted March 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

MATERIALS AND METHODS

The non-transformed breast cell line MCF-10A16 was cultured as described

previously10, 11. The level of transformation mediated by compounds listed in Table were

determined by growth in low attachment (GILA assay) essentially as described previously15 except that 384-well plates were used and the cell concentration was optimized to 25 cells per well in 30 μl. Cells were assayed for ATP content after 5 days of incubation at 37°C, with each plate having five repetitions, and the entire screen repeated three times. Transcriptome level mRNA profiling was performed by 3’READS as described

previously17. For each condition, reads counts for each gene were normalized to the total number of True Poly(A) Read pairs. Normalized mRNA seq values for the 17 different experimental conditions (Supplementary Table 2) were correlated to the level of transformation using the Excel CORREL function. As a control, mRNA seq values for every gene were randomly shuffled across the 17 treatments and then correlated to the transformation level. Over-representational significance (p-value) for each set of genes was calculated using the Fisher Exact test. Gene sets based on correlation cutoffs were used as input for MetaCoreTM Functional Enrichment by Ontology using pathway maps in order to identify biological pathways associated with transformation (https://portal.genego.com/cgi/data_manager.cgi). An FDR cutoff of 0.05 was used to identify pathway maps significantly associated transformation. Network analysis were also performed using Gene Set Enrichment Analysis (GSEA) and DAVID ontology.

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.04.976910; this version posted March 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

ACKNOWLEDGEMENTS

We thank Zarmik Moqataderi and Joseph Geisberg for help with the bioinformatic analysis and useful discussions. This work was funded by the Swiss National Science Foundation SNSF (Project P2FRP3_178090 to RD) and by a research grant to K.S. from the National Institutes of Health (CA 107486).

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.04.976910; this version posted March 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

REFERENCES

1. St. John, T.P. & Davis, R.W. Isolation of galactose-inducible DNA sequences from by differential plaque filter hybridization. Cell 16, 443-452 (1979). 2. Schena, M., Shalon, D., Davis, R.W. & Brown, P.O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467-470 (1995). 3. Lipshutz, R.J., Fodor, S.P., Gingeras, T.R. & Lockhart, D.J. High density synthetic oligonucleotide arrays. Nat Genet 21, 20-24 (1999). 4. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10, 57-63 (2009). 5. Tirosh, I., Wong, K.-H., Barkai, N. & Struhl, K. Extensive divergence of the yeast stress response through transitions between induced and constitutive activation. Proc. Natl. Acad. Sci. U.S.A. 108, 16693-16698 (2011). 6. Hartwell, L.H. Macromolecule synthesis in temperature-sensitive mutants in yeast. J. Bacteriol. 93, 1662-1670 (1967). 7. Nusslein-Volhard, C. & Wieschaus, E. Mutations affecting segment number and polarity in Drosophila. Nature 287, 795-801 (1980). 8. Mohr, S.E., Smith, J.A., Shamu, C.E., Neumuller, R.A. & Perrimon, N. RNAi screening comes of age: improved techniques and complementary approaches. Nat. Rev. Mol. Cell. Biol. 15, 591-600 (2014). 9. Karczewski, K.J. & Snyder, M. Integrative omics for health and disease. Nat. Rev. Genet. 19, 299-310 (2018). 10. Hirsch, H.A. et al. A transcriptional signature and common gene networks link cancer with lipid metabolism and diverse human diseases. Cancer Cell 17, 348-361 (2010).

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.04.976910; this version posted March 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

11. Iliopoulos, D., Hirsch, H.A. & Struhl, K. An epigenetic switch involving NF-kB, lin 28, let-7 microRNA, and IL6 links inflammation to cell transformation. Cell 139, 693-706 (2009). 12. Iliopoulos, D., Jaeger, S.A., Hirsch, H.A., Bulyk, M.L. & Struhl, K. STAT3 activation of miR-21 and miR-181b, via PTEN and CYLD, are part of the epigenetic switch linking inflammation to cancer. Mol. Cell 39, 493-506 (2010). 13. Ji, Z., He, L., Regev, A. & struhl, K. Inflammatory regulatory network mediated by the joint action of NF-kB, STAT3, and AP-1 factors is involved in many human cancers. Proc. Natl. Acad. Sci. U.S.A. 116, 9453-9462 (2019). 14. Ji, Z. et al. Genome-scale identification of transcription factors that mediate an inflammatory network during breast cellular transformation. Nat. Commun. 9, 2068 (2018). 15. Rotem, A. et al. Alternative to the soft-agar assay that permits high-throughput drug and genetic screens for cellular transformation. Proc. Natl. Acad. Sci. U.S.A. 112, 5708-5713 (2015). 16. Soule, H.D. et al. Isolation and characterization of a spontaneously immortallized human breast epithelial cell line, MCF10. Cancer Res. 50, 6075-6086 (1990). 17. Jin, Y. et al. Mapping 3' mRNA isoforms on a genomic scale. Curr. Protoc. Mol. Biol. 110, 4.23.21-24.23.17 (2015).

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.04.976910; this version posted March 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

FIGURE LEGENDS

Fig. 1. Transformation levels and differentially expressed genes. (a, b) Relative growth levels of MCF-10A cells grown on low-attachment (a) or high-attachment (b) plates for 5 days in 17 different conditions (A-Q; see Supplementary Table 1). (c) Number of genes whose expression is induced (black) or reduced (gray) 2-fold under the 17 conditions. The control represents the absence of any compound added to the cells. (d) Number of genes whose expression is induced or reduced under the indicated number (1-17) of experimental conditions.

Fig. 2. Correlation of gene expression profiles to transformation phenotype. (a) Number of genes in the indicated bins of correlation coefficients in the experimental (purple) and shuffled (orange) RNA samples. (b) Percent of genes in each functional category (A-L shown in various colors) in the indicated binds of correlation coefficients. The red box indicates high positive correlation (> 0.6) and the blue box indicates high negative correlation (< 0.4). (c) Percent of genes in the indicated category (A-L as above) that have high positive (> 0.6; red) or negative (< 0.5; blue) correlation to the level of transformation. (d) Relationship between the correlation coefficient and fold-induction upon Src induction (log 2) for all genes (black dots) that are part of the NF-kB/STAT3/AP-1 transcriptional network. Selected inflammatory genes with high correlation/high induction (blue), low correlation/high induction (green), and high correlation/low induction (red) are indicated.

Fig. 3. Over-represented gene categories of 90 genes with high correlation (> 0.8) to the transformed phenotype but no previous link to cancer. The significance of each category

is indicated by -Log10(P-value).

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.04.976910; this version posted March 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

a 12 MCF-10A b 50 MCF-10A 11 45 )

5

10 0 40 x1 )

9 5

P 35 0

8 T x1

30 7 (A P T 6 25 5 20

GILA (A GILA 4 15 3 10

2 High-attachment 1 5 0 0 - A B C D E F G H I J K L M N O P Q - A B C D E F G H I J K L M N O P Q Experimental conditions Experimental conditions c d 3000 Induced 1600 Induced Repressed Repressed 1400 2500 1200 2000 1000 1500 800

600 1000 Number of genes Number of Number of genes Number of 400 500 200

0 0 Ctr A B C D E F G H I J K L M N O P Q 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Experimental conditions Experimental conditions bioRxiv preprint doi: https://doi.org/10.1101/2020.03.04.976910; this version posted March 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

a b 3000 Experiment 14 Control

y 12 2500 r go e

t 10 ca s e

2000 h

n 8 e eac g f in o 1500 6 er es b n e

m 4 g u f

N 1000 o

% 2

500 0 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Pearson Correlation Coeffcient 0 A- Inflammation B- Cell adhesion C- Apotosis D- Signal transduction -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 E- Cell proliferation F- Transcriptional regulators G- Stem cell differentiation Pearson Correlation Coeffcient H- NF-kB, STAT3, and AP-1 I- 5 Non-Correlated Categories J- Ubiquinone metabolism K- Acetyl-CoA biosynthetic process L- Oxidative phosphorylation c d 8 A SERPINB4 7 B SERPINB3 6 C S100A8 D 5 GSDMC S100A9 E 4 TNFRSF6B IL1R1 F MT1X SERPINA1 3 G IL13RA2 H 2 NR1D1 Categories I Expressionchange 1

J 0 K -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 Vnn1 CD14 ANO6 L TGFB1 8 6 4 2 0 2 4 6 8 10 12 14 16 -2 % of genes Downregulated Upregulated -3 Correlation Coeffcient Over-Represented Pathways Cytoskeleton remodeling_ACM3 andremodeling_ACM3 Cytoskeleton inmigration ACM4 keratinocyte (which wasnotcertifiedbypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmade bioRxiv preprint Development_Activation of by Development_Activation ERK Alpha-1 receptors adrenergic Signal transduction_Adenosine signaling receptor A1 pathway Development_Gastrin inDevelopment_Gastrin differentiation of the mucosa gastric G-protein signaling_Regulation of Cyclic AMP by levels ACM G-protein signaling_G-Protein alpha-q signaling cascades Role of Ghrelin in activation eating behavior in obesity doi: https://doi.org/10.1101/2020.03.04.976910 Transport_ACM3 signaling inglands salivary Development_Oxytocin receptor signaling receptor Development_Oxytocin available undera Immune response_LTBR1 signaling Chemotaxis_CCR1 signaling Chemotaxis_CCR1 CC-BY-NC-ND 4.0Internationallicense Riboflavin metabolism GTP-XTP metabolism ATP/ITP metabolism ; dGTP metabolism this versionpostedMarch5,2020. 0 1 . The copyrightholderforthispreprint 2 -Log 3 10 ( P -value) 4 5 6 up 7