Title: oncoEnrichR: -dedicated set interpretation

Authors: Sigve Nakken1,2,*, Sveinung Gundersen3, Fabian L. M. Bernal3, Eivind

Hovig1,4, Jørgen Wesche1,2

Affiliations: 1Department of Tumor Biology, Institute for Cancer Research, Oslo

University Hospital, Norway

2Centre for Cancer Cell Reprogramming, Institute of Clinical Medicine,

Faculty of Medicine, University of Oslo, Norway

3ELIXIR Norway, Department of Informatics, University of Oslo,

Norway

4Centre for Bioinformatics, Department of Informatics, University of Oslo,

Norway

*Corresponding author: [email protected]

Abstract

Summary: Interpretation and prioritization of candidate hits from genome-scale screening experiments represent a significant analytical challenge, particularly when it comes to an understanding of cancer relevance. We have developed a flexible tool that substantially refines gene set interpretability in cancer by leveraging a broad scope of prior knowledge unavailable in existing frameworks, including data on target tractabilities, tumor-type association strengths, complexes and protein-protein interactions, tissue and cell-type expression specificities, subcellular localizations, prognostic associations, cancer dependency maps, and information on of poorly defined or unknown function.

Availability: oncoEnrichR is developed in R, and is freely available as a stand-alone R package. A web interface to oncoEnrichR is provided through the Galaxy framework (https://oncotools.elixir.no).

All code is open-source under the MIT license, with documentation, example datasets and and instructions for usage available at https://github.com/sigven/oncoEnrichR/

Contact: [email protected]

1 Introduction

The search for novel cancer-implicated genes is currently fueled by different types of genome-scale screening experiments, exemplified by gene silencing through siRNA and CRISPR/cas-9, protein interaction screens, or differential gene expression from RNA-seq. Importantly, most high-throughput screens typically produce the same conceptual output, that is dozens or hundreds of candidate genes ranked by effect size. A significant challenge in this respect is to aggregate sufficient prior knowledge regarding the candidate hits, both individually and at a systems level, in order to prioritize likely cancer-implicated genes for in-depth experimental validation. Presently, this effort is mostly supported by various gene set enrichment tools, but also platforms that can prioritize hits through their aberration patterns in different cancer omics datasets. Valuable resources in this regard include

Metascape, DAVID, g:Profiler, and EnrichR, which leverage up-to-date databases for different types of enrichment analyses of gene lists (Huang et al., 2009; Chen et al., 2013; Zhou et al., 2019; Reimand et al., 2007). Furthermore, the web-based cBioPortal platform enables elegant interactive visualizations of custom gene lists in relation to tumor gene expression and aberration datasets from

The Cancer Genome Atlas (TCGA) and other large-sequencing sequencing projects (Cerami et al.,

2012; The Cancer Genome Atlas Research Network et al., 2013).

To our knowledge, there is however a lack of gene set interpretation tools that successfully combine a broad set of cancer-dedicated annotations at the individual gene level with multiple perspectives of system-level properties. Specifically, information that is significant for comprehensive geneset interpretability in cancer, yet not sufficiently integrated in available frameworks, include data on gene-tumor type associations, prognostic gene expression associations, target druggability, oncogene/tumor suppressor/cancer driver gene information, tissue- and cell-type expression specificities, as well data on target subcellular localizations, protein complex memberships, and protein-protein interactions. In order to fill this gap, we have developed oncoEnrichR, a cancer-dedicated gene set interpretation workflow where a diverse suite of gene- centric annotation and prioritization modules are supplemented with system-level modules that offer

insight at the level of signaling pathways, molecular gene signatures and the protein interactome. A structured and interactive gene set analysis report, highly configurable by the user, is readily available as the main output per run. We believe that oncoEnrichR can function as an easily accessible and all- inclusive resource for candidate hit interpretation in cancer.

2 Construction and Content

The oncoEnrichR gene set interpretation tool has been developed as an R package, encompassing four main components (schematically illustrated in Figure 1). In order to make the tool accessible to biologists, oncoEnrichR is also available through the web-based Galaxy platform, which facilitates a reproducible and collaborative analysis framework.

Input

The main input to the tool is a simple list of gene identifiers. To cater for ease of use, multiple common identifier types are permitted, e.g. official gene symbols, UniProt accessions, RefSeq/Ensembl transcript and protein identifiers etc. An optional specification of a background gene set for use in enrichment analysis is further available, which ensures that enrichment results are performed according to the nature of the underlying screen. The user can also configure different parameters of the individual workflow modules (e.g. enrichment and protein-protein interaction settings), and which parts to include in the output.

Data resources & analysis content

The backend of oncoEnrichR is based on a comprehensive collection of prior knowledge on cancer- relevant properties of genes and their interrelationships (a complete overview is provided in

Supplementary Table 1).

Query genes are annotated with status as tumor suppressor genes, oncogenes or potential cancer drivers according to significant support from the CancerMine text-mining resource and

Network of Cancer Genes (Repana et al., 2019; Lever et al., 2019). For gene-disease associations, we exploit the powerful Open Targets Platform, which have established an aggregated association score between phenotypes (i.e. in the form of Experimental Factor Ontology (EFO) terms) and human genes from a multitude of sources, e.g. genetic associations, text mining, or data from models

(Malone et al., 2010; Ochoa et al., 2021). These association scores enable an effective ranking of the candidate hits in terms of their global association to cancer, as well as tumor-type specific rankings.

Importantly, in order to enable the latter functionality, we have established a semi-manually curated mapping between ontology terms and main tumor sites (e.g. breast, lung, colon/rectum etc.), using

OncoTree as the starting point (Kundra et al., 2021). The Open Targets Platform is further utilized to gather information on cancer hallmark evidence, known cancer drugs targeted toward candidate genes, as well as overall target tractability (druggability) data, which provides a prioritization of the hits in terms of their potential as targets for inhibition by small molecules or antibodies. Genes of poorly defined or unknown function, as evident from limited Gene Ontology (GO) annotations or gene summary descriptions, can be conveniently highlighted.

A systems-level view of the candidate set is provided through a standard enrichment/over- representation analysis towards several databases, including GO, the Molecular Signatures Database,

KEGG and NetPath (Kanehisa and Goto, 2000; Liberzon et al., 2011; Kandasamy et al., 2010;

Ashburner et al., 2000). This is supplemented by a protein-protein interaction view of the candidate set and their neighbouring partners (STRING), a view where we also integrate druggable targets and known cancer genes (Von Mering et al., 2005). Information on candidate hit memberships in known protein complexes (CORUM) and their subcellular localization patterns (ComPPI) adds additional dimensions to the global view of the candidate set (Giurgiu et al., 2019; Veres et al., 2015).

Through the use of omics data from TCGA, candidate hits can be interrogated and prioritized for aberration frequencies in different tumor types, and also the potential co-expression of candidate hits with other cancer-relevant genes. Expression profiling data from healthy tissues (Human Protein

Atlas (HPA) and Genotype-Tissue Expression Project) are used to portray tissue- and cell-type

specific expression patterns for the candidate genes, which subsequently shows enrichment of particular tissues/cell-types (Jain and Tuteja, 2019; Uhlen et al., 2010). The tool can further list gene- survival associations for multiple genetic features (expression, mutation, copy-number) based on

HPA and recently published data (Smith and Sheltzer, 2021) . Finally, the oncoEnrichR workflow integrates gene fitness data and target priority scores from CRISPR-Cas9 mediated cell inactivation of human cancer cell lines (Behan et al., 2019), which have lately shown to be an effective approach to prioritize drug target candidates in cancer. Importantly, intersection of the candidate hits with genetic screen data also indicate relevant cell lines to be used in potential experimental follow-up studies.

Output

The output of the oncoEnrichR workflow is two-fold. A structured HTML report is provided with 13 separate analysis modules of the above-mentioned features and datasets, encompassing a variety of visualizations and interactive tables for prioritization of candidate hits. Moreover, a multi-sheet Excel workbook with all annotations and enrichment results is readily available. To cater for transparency, all versions of underlying software and databases are provided in the output files.

In order to demonstrate the power of oncoEnrichR, we provide example outputs

(https://zenodo.org/record/4945414) for a target list of n = 134 high-confidence interacting with the c-MYC oncoprotein (Dingar et al., 2015).

Author Contributions

Design and concept: SN, JW. Data preparation and software implementation: SN. Galaxy platform infrastructure: FB, SG, EH. Wrote the manuscript: SN. All authors read and approved the final manuscript.

Acknowledgements

Funding: This work was supported by the Research Council of Norway through its Centers of

Excellence funding scheme [grant number: 262652]. The authors wish to acknowledge the University

Center for Information Technology (USIT) at the University of Oslo for infrastructure support.

References

Ashburner,M. et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25, 25–29. Behan,F.M. et al. (2019) Prioritization of cancer therapeutic targets using CRISPR-Cas9 screens. Nature, 568, 511–516. Cerami,E. et al. (2012) The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov., 2, 401–404. Chen,E.Y. et al. (2013) Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics, 14, 128. Dingar,D. et al. (2015) BioID identifies novel c-MYC interacting partners in cultured cells and xenograft tumors. J. Proteomics, 118, 95–111. Giurgiu,M. et al. (2019) CORUM: the comprehensive resource of mammalian protein complexes—2019. Nucleic Acids Res., 47, D559–D563. Huang,D.W. et al. (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc., 4, 44–57. Jain,A. and Tuteja,G. (2019) TissueEnrich: Tissue-specific gene enrichment analysis. Bioinformatics, 35, 1966–1967. Kandasamy,K. et al. (2010) NetPath: a public resource of curated signal transduction pathways. Genome Biol., 11, R3. Kanehisa,M. and Goto,S. (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 28, 27–30. Kundra,R. et al. (2021) OncoTree: A Cancer Classification System for Precision Oncology. JCO Clin Cancer Inform, 5, 221–230. Lever,J. et al. (2019) CancerMine: a literature-mined resource for drivers, oncogenes and tumor suppressors in cancer. Nat. Methods, 16, 505–507. Liberzon,A. et al. (2011) Molecular signatures database (MSigDB) 3.0. Bioinformatics, 27, 1739–1740. Malone,J. et al. (2010) Modeling sample variables with an Experimental Factor Ontology. Bioinformatics, 26, 1112–1118. Ochoa,D. et al. (2021) Open Targets Platform: supporting systematic drug-target identification and prioritisation. Nucleic Acids Res., 49, D1302–D1310. Reimand,J. et al. (2007) g:Profiler--a web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic Acids Res., 35, W193–200. Repana,D. et al. (2019) The Network of Cancer Genes (NCG): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens. Genome Biol., 20, 1. Smith,J.C. and Sheltzer,J.M. (2021) Genome-wide identification and analysis of prognostic features in human . bioRxiv, 2021.06.01.446243. The Cancer Genome Atlas Research Network et al. (2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet., 45, 1113–1120. Uhlen,M. et al. (2010) Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol., 28, 1248–1250. Veres,D.V. et al. (2015) ComPPI: a cellular compartment-specific database for protein-protein interaction network analysis. Nucleic Acids Res., 43, D485–93. Von Mering,C. et al. (2005) STRING: known and predicted protein--protein associations, integrated and transferred across . Nucleic Acids Res., 33, D433–D437. Zhou,Y. et al. (2019) Metascape provides a biologist-oriented resource for the analysis of systems-level

datasets. Nat. Commun., 10, 1523.

Figures

Figure 1. Schematic overview of the oncoEnrichR workflow. The tool consists of four key processing steps: 1) conversion and harmonization of gene/protein query identifiers, 2) annotation, membership and enrichment analyses of query set against a comprehensive collection of cancer-relevant databases,

3) visualization of analysis results, and 4) report generation. GO = Gene Ontology.