Cancer-Dedicated Gene Set Interpretation Authors
Total Page:16
File Type:pdf, Size:1020Kb
Title: oncoEnrichR: cancer-dedicated gene set interpretation Authors: Sigve Nakken1,2,*, Sveinung Gundersen3, Fabian L. M. Bernal3, Eivind Hovig1,4, Jørgen Wesche1,2 Affiliations: 1Department of Tumor Biology, Institute for Cancer Research, Oslo University Hospital, Norway 2Centre for Cancer Cell Reprogramming, Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Norway 3ELIXIR Norway, Department of Informatics, University of Oslo, Norway 4Centre for Bioinformatics, Department of Informatics, University of Oslo, Norway *Corresponding author: [email protected] Abstract Summary: Interpretation and prioritization of candidate hits from genome-scale screening experiments represent a significant analytical challenge, particularly when it comes to an understanding of cancer relevance. We have developed a flexible tool that substantially refines gene set interpretability in cancer by leveraging a broad scope of prior knowledge unavailable in existing frameworks, including data on target tractabilities, tumor-type association strengths, protein complexes and protein-protein interactions, tissue and cell-type expression specificities, subcellular localizations, prognostic associations, cancer dependency maps, and information on genes of poorly defined or unknown function. Availability: oncoEnrichR is developed in R, and is freely available as a stand-alone R package. A web interface to oncoEnrichR is provided through the Galaxy framework (https://oncotools.elixir.no). All code is open-source under the MIT license, with documentation, example datasets and and instructions for usage available at https://github.com/sigven/oncoEnrichR/ Contact: [email protected] 1 Introduction The search for novel cancer-implicated genes is currently fueled by different types of genome-scale screening experiments, exemplified by gene silencing through siRNA and CRISPR/cas-9, protein interaction screens, or differential gene expression from RNA-seq. Importantly, most high-throughput screens typically produce the same conceptual output, that is dozens or hundreds of candidate genes ranked by effect size. A significant challenge in this respect is to aggregate sufficient prior knowledge regarding the candidate hits, both individually and at a systems level, in order to prioritize likely cancer-implicated genes for in-depth experimental validation. Presently, this effort is mostly supported by various gene set enrichment tools, but also platforms that can prioritize hits through their aberration patterns in different cancer omics datasets. Valuable resources in this regard include Metascape, DAVID, g:Profiler, and EnrichR, which leverage up-to-date databases for different types of enrichment analyses of gene lists (Huang et al., 2009; Chen et al., 2013; Zhou et al., 2019; Reimand et al., 2007). Furthermore, the web-based cBioPortal platform enables elegant interactive visualizations of custom gene lists in relation to tumor gene expression and aberration datasets from The Cancer Genome Atlas (TCGA) and other large-sequencing sequencing projects (Cerami et al., 2012; The Cancer Genome Atlas Research Network et al., 2013). To our knowledge, there is however a lack of gene set interpretation tools that successfully combine a broad set of cancer-dedicated annotations at the individual gene level with multiple perspectives of system-level properties. Specifically, information that is significant for comprehensive geneset interpretability in cancer, yet not sufficiently integrated in available frameworks, include data on gene-tumor type associations, prognostic gene expression associations, target druggability, oncogene/tumor suppressor/cancer driver gene information, tissue- and cell-type expression specificities, as well data on target subcellular localizations, protein complex memberships, and protein-protein interactions. In order to fill this gap, we have developed oncoEnrichR, a cancer-dedicated gene set interpretation workflow where a diverse suite of gene- centric annotation and prioritization modules are supplemented with system-level modules that offer insight at the level of signaling pathways, molecular gene signatures and the protein interactome. A structured and interactive gene set analysis report, highly configurable by the user, is readily available as the main output per run. We believe that oncoEnrichR can function as an easily accessible and all- inclusive resource for candidate hit interpretation in cancer. 2 Construction and Content The oncoEnrichR gene set interpretation tool has been developed as an R package, encompassing four main components (schematically illustrated in Figure 1). In order to make the tool accessible to biologists, oncoEnrichR is also available through the web-based Galaxy platform, which facilitates a reproducible and collaborative analysis framework. Input The main input to the tool is a simple list of gene identifiers. To cater for ease of use, multiple common identifier types are permitted, e.g. official gene symbols, UniProt accessions, RefSeq/Ensembl transcript and protein identifiers etc. An optional specification of a background gene set for use in enrichment analysis is further available, which ensures that enrichment results are performed according to the nature of the underlying screen. The user can also configure different parameters of the individual workflow modules (e.g. enrichment and protein-protein interaction settings), and which parts to include in the output. Data resources & analysis content The backend of oncoEnrichR is based on a comprehensive collection of prior knowledge on cancer- relevant properties of human genes and their interrelationships (a complete overview is provided in Supplementary Table 1). Query genes are annotated with status as tumor suppressor genes, oncogenes or potential cancer drivers according to significant support from the CancerMine text-mining resource and Network of Cancer Genes (Repana et al., 2019; Lever et al., 2019). For gene-disease associations, we exploit the powerful Open Targets Platform, which have established an aggregated association score between phenotypes (i.e. in the form of Experimental Factor Ontology (EFO) terms) and human genes from a multitude of sources, e.g. genetic associations, text mining, or data from animal models (Malone et al., 2010; Ochoa et al., 2021). These association scores enable an effective ranking of the candidate hits in terms of their global association to cancer, as well as tumor-type specific rankings. Importantly, in order to enable the latter functionality, we have established a semi-manually curated mapping between ontology terms and main tumor sites (e.g. breast, lung, colon/rectum etc.), using OncoTree as the starting point (Kundra et al., 2021). The Open Targets Platform is further utilized to gather information on cancer hallmark evidence, known cancer drugs targeted toward candidate genes, as well as overall target tractability (druggability) data, which provides a prioritization of the hits in terms of their potential as targets for inhibition by small molecules or antibodies. Genes of poorly defined or unknown function, as evident from limited Gene Ontology (GO) annotations or gene summary descriptions, can be conveniently highlighted. A systems-level view of the candidate set is provided through a standard enrichment/over- representation analysis towards several databases, including GO, the Molecular Signatures Database, KEGG and NetPath (Kanehisa and Goto, 2000; Liberzon et al., 2011; Kandasamy et al., 2010; Ashburner et al., 2000). This is supplemented by a protein-protein interaction view of the candidate set and their neighbouring partners (STRING), a view where we also integrate druggable targets and known cancer genes (Von Mering et al., 2005). Information on candidate hit memberships in known protein complexes (CORUM) and their subcellular localization patterns (ComPPI) adds additional dimensions to the global view of the candidate set (Giurgiu et al., 2019; Veres et al., 2015). Through the use of omics data from TCGA, candidate hits can be interrogated and prioritized for aberration frequencies in different tumor types, and also the potential co-expression of candidate hits with other cancer-relevant genes. Expression profiling data from healthy tissues (Human Protein Atlas (HPA) and Genotype-Tissue Expression Project) are used to portray tissue- and cell-type specific expression patterns for the candidate genes, which subsequently shows enrichment of particular tissues/cell-types (Jain and Tuteja, 2019; Uhlen et al., 2010). The tool can further list gene- survival associations for multiple genetic features (expression, mutation, copy-number) based on HPA and recently published data (Smith and Sheltzer, 2021) . Finally, the oncoEnrichR workflow integrates gene fitness data and target priority scores from CRISPR-Cas9 mediated cell inactivation of human cancer cell lines (Behan et al., 2019), which have lately shown to be an effective approach to prioritize drug target candidates in cancer. Importantly, intersection of the candidate hits with genetic screen data also indicate relevant cell lines to be used in potential experimental follow-up studies. Output The output of the oncoEnrichR workflow is two-fold. A structured HTML report is provided with 13 separate analysis modules of the above-mentioned features and datasets, encompassing a variety of visualizations and interactive tables for prioritization of candidate hits. Moreover, a multi-sheet Excel workbook with all annotations and