bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

A common genetic architecture enables the lossy compression of large CRISPR libraries

Boyang Zhao1,*, Yiyun Rao2, Luke Gilbert3-5, Justin Pritchard1,2,*

1. Department of Biomedical Engineering, Pennsylvania State University 2. Huck Institute for the Life Sciences, Pennsylvania State University 3. Department of Urology, University of California at San Francisco 4. Department of Cellular & Molecular Pharmacology, University of California, San Francisco, CA, USA 5. Helen Diller Family Comprehensive Cancer Center, San Francisco, San Francisco, CA, USA

* Correspondence and requests for materials should be addressed to JP ([email protected]) and BZ ([email protected])

bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Abstract There are thousands of ubiquitously expressed mammalian , yet a genetic knockout can be lethal to one cell, and harmless to another. This context specificity confounds our understanding of genetics and cell biology. 2 large collections of pooled CRISPR screens offer an exciting opportunity to explore cell specificity. One explanation, synthetic lethality, occurs when a single “private” mutation creates a unique genetic dependency. However, by fitting thousands of machine learning models across millions of omic and CRISPR features, we discovered a “public” genetic architecture that is common across cell lines and explains more context specificity than synthetic lethality. This common architecture is built on CRISPR loss-of- function phenotypes that are surprisingly predictive of other loss-of-function phenotypes. Using these insights and inspired by the in silico lossy compression of images, we use machine learning to identify small “lossy compression” sets of in vitro CRISPR constructs where reduced measurements produce genome-scale loss-of-function predictions.

bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Introduction Large-scale transcriptomics atlases have identified thousands of ubiquitously expressed mammalian genes1. Single-cell transcriptomics can rapidly enumerate the differential expression of the genome across a tissue2,3. Yet this unprecedented resolution in transcriptomics fails to tell us how genes function in different cell types. Importantly, while transcriptomics easily scales to measure thousands of transcripts in a single cell, functional genomics can only measure the phenotypic effect of a single genotype in a single cell. The knockout of a single can cause lethality in one cell and no discernible phenotype in another4. This diversity is even known to exist across KRAS mutant cell lines. Some G12X mutated cells are sensitive to loss of KRAS while others aren’t5. These phenotypic differences constitute a poorly understood, but well recognized phenomenon called context specificity that is important in cell biology, genetics, and medicine6. Context specific observations are a classic problem in mammalian cell biology and genetics that dates to the origins of the field 7. Understanding them aids our understanding of basic biological and genomic processes.

However, understanding cell specificity means contending with the fact that mammalian cell lines span an incredible diversity of potentially relevant contexts that include; nucleotide mutations, chromosomal abnormalities, copy number alterations, transcriptional profiles and tissues of origin8. Together, these contexts drive the richness and the confusion of mammalian cell biology/genetics. CRISPR screening efforts by the Sanger and Broad institutes have performed genome wide loss-of-function studies in nearly 1000 unique mammalian cell lines4,9. These studies have identified extensive phenotypic diversity (as measured by differences in CERES scores) in many genes9,10. Synthetic lethality has been the dominant paradigm that has been used to explore cell type specific sensitivity to loss-of-function in mammalian cancer cell lines. Classic synthetic lethality in cancer occurs when a specific somatic mutation confers sensitivity to genetic or chemical genetic loss-of-function11. This is best exemplified by the loss- of-function mutations in BRCA1/2 that predict exquisite sensitivity to the loss of PARP1/2. This relationship has led to the clinical application of PARP inhibitors12. Beyond this paradigmatic example, a number of interesting versions of synthetic lethality have been suggested, including; dosage lethality, paralog lethality, and lineage specific sensitivities12–16,16,17. All of these have been useful ideas to investigate the origins of cell type specificity and to create potential biomarkers for cancer therapy. However, the number of new and exciting synthetic lethality relationships discovered appears to be smaller than the phenotypic diversity observed in large scale cell line studies6, indicating that there is room for improvement in current association datasets.

One likely additional explanation for context specificity is that some molecular markers remain unmeasured. For instance, an unmeasured covariate is a potential explanation for any selective essential phenotype without a clear explanation. A second explanation is a lack of statistical power to identify rare variants or variants with modest effect sizes across large and heterogeneous groups of cells. Both these explanations can be tested and addressed by examining more data across more cell lines, and we applaud the ongoing efforts to do this. But, beyond these typical explanations, we posit that static genomic features of unperturbed cell lines may not be sufficient to predict context specificity. CRISPR knockouts create dynamic measurements of cellular responses to gene loss. This dynamic information may be useful to understand context specificity. bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

In this study, we explore the origins of cell type specificity as a question of basic mammalian cell biology and genetics. To do this, we take a data-driven approach that focuses on predicting cell type specificity with machine learning models. We carefully build thousands of machine-learning models that incorporated the effects of millions of CRISPR knockout phenotypes in addition to mutations, copy number, lineage, and RNA-seq to predict cell type specific phenotypes. Our analysis revealed that the best models of cell type specific CRISPR loss-of-function phenotypes were composed of other CRISPR loss-of-function phenotypes in a pooled library. Thus, “CRISPR predicts CRISPR” explains more context specificity than previous models. Furthermore, predictive CRISPR features fall into highly clustered and cross predictive subnetworks. When combined, these subnetworks constitute a “common genetic architecture”. While this architecture could be used for gene functional discovery, they also enable a powerful new approach for functional genomics. Inspired by the ideas behind data compression, we propose an approach for dramatically compressing genome-scale CRISPR functional genomics experiments. In lossy compression, orders of magnitude reductions in file sizes are exchanged for some acceptable tradeoff in data quality. Instead of in silico data, “CRISPR predicts CRISPR” models can compress in vitro CRISPR library composition. They identify reduced sets of CRISPR constructs that can predict the loss-of-function effects of unmeasured genes at tunable scales.

Chemogenomic screens bottleneck libraries by 10-fold at a 90% inhibition of cell viability and raise screening coverage requirements for genome scale screens to as many as 1 billion cells, a challenging bar. Pairwise genetic epistasis maps require n2/2 unique constructs18. A genome-wide pairwise interaction map that uses the common coverage requirements of even 3 sgRNAs per gene and 500 cells/sgRNA and 500 reads/sgRNA creates impossibly large screening population in excess of 1010 cells. Thus, our lossy sets create an opportunity to extract genome-level information from smaller and more tractable libraries that will have applications in experiments that are impossible to scale to genome wide libraries. In doing so, we show it is possible to break the fundamental barrier of functional genomics; that only 1 genotype to phenotype relationship can be measured per cell in a pooled screen.

Methods Overview of code repository A fully documented git repository with all source codes and notebooks can be accessed at the Pritchard Lab at PSU GitHub page https://github.com/pritchardlabatpsu/cnp_dev. The source codes for running the entire pipeline have provided as python package, with all library dependencies defined for complete reproducibility – including the virtual environment used. All original data are already publicly available on the DepMap portal and upon download can be placed in the data folders as described in the documentation. The repo also contains codes for the generation of all figures/tables presented in this manuscript.

Datasets Cancer Dependency Map datasets 19Q3 and 19Q4 were retrieved from DepMap portal19. This included the Broad and Sanger gene effects (CERES scores), CCLE expression, CCLE gene copy number, CCLE mutations, and lineage (from sample info file). The classifications of nonessentials and common essentials were also retrieved for the same release periods.

Data preprocessing The data was pre-processed and later models were built using python. Several steps were performed to preprocess the raw data. Copy number missing value was replaced with zero. Cell bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

lines with any missing CERES values were dropped. Mutations were grouped into damaging (with variant annotation labeled as damaging), hotspot non-damaging (with variant annotation labeled as not damaging and is either a COSMIC or TCGA hotspot), or other (with variant annotation labeled as other conserving/non-conserving).

Feature pruning was performed to remove non-informative variables. This included removal of features with only constant values and categorical features where if a category value was supported only by less than or equal to 10 samples. Non-expressed genes (TPM<1) were also removed. Data were normalized (z-scored) for RNA-seq and copy number values.

Feature selection and model building The model building was based on an iterative feature selection process. First, the dataset was constructed using CERES (excluding the target gene), RNA-seq, copy number, mutations, and lineage as features, with the goal of predicting the CERES of a target gene. The target genes list was a set of 583 genes with highly variable phenotype based on CERES standard deviation >0.25 and range >0.6. For lossy gene set compression inference (described below), the target genes list was the entire genomic set of 18,333 genes. The data was split 85% and 15% into train and test sets, respectively. The full training dataset was fit using random forest regressor with 1000 trees, maximum depth of 15 per tree, minimum of 5 samples required per leaf node, and maximum number of features as log2 of the total number of features. The top quartile of most important features were kept and used to refit a new random forest model. This process was repeated three times. The remaining feature set was further refined for significant features using the Boruta feature selection method20. The resulting features fit using random forest constituted the reduced model. For the purposes of analyses, the top ten most important features were selected and the resulting model constituted the top 10 only features model.

In addition to this iterative feature selection and boruta selection with random forest, other modeling approaches were performed for comparison. This included linear regression, elastic net (with alpha of 0.1), and random forest on the full train dataset followed by taking the top quartile of important features for building the reduced model. The top ten most important features were then taken for follow-up analyses.

Model evaluation The models were evaluated based on R2 values and were calculated for the full, reduced, and top 10 only features models. While R2 values provide a sense of model fit, it does not describe model significance. To this end, a recall metric was also calculated to take into account the correlation between predicted versus actual CERES relative to a reference null distribution. More specifically, recall was defined, similar to that in Subramanian et al, as the fraction of null Spearman correlation values that was lower than the Spearman correlation of the given model for the target gene of interest. The null correlation value was calculated as the Spearman correlation between the actual CERES of a randomly drawn gene and the predicted CERES of the target gene. This procedure was repeated 1000 times to generate the null distribution, per target gene. For the analysis set, we focused on predictable models with a recall >0.95 and R2 >0.1 of the top 10 only features models. In addition, a concordance score was calculated per model as the fraction of all predicted CERES values where the actual and predicted were both below or above -0.6 (a cut-off used for essentiality).

Gene set analysis For the examination of the relations between source and target genes in the models, several gene sets were used. Panther gene sets were downloaded from Enrichr libraries21. Paralog bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

genelist were retrieved from Ensembl gene trees using biomaRt in R22,23. The HGNC symbols of the most diverse genes were used as filters and the queries were submitted to GRCh37 Homo Sapiens Ensembl Genes biomaRt database. Gene set enrichments were performed using gprofiler2 in R with default parameters and a p-value significance threshold of 0.0524.

Network analyses Networks were derived from the model results by linking all related source and target genes using networkx in python. Communities were extracted from this network using the Louvain method. Network communities were visualized and network statistics were extracted using Cytoscape v3.7.2 25. Power analysis was performed on nodes with undirected degree of at least 2. The number of nodes vs degree was fit to a power-law function y=axb.

Lossy gene sets Lossy sets were derived as centroids in an iterative k-means-based tight-clustering procedure using the tightClust package in R26. The procedure was run to derive L25, L75, L100, L200, and L300 lossy sets, which were then used for the saturation analyses. For compression-based inference, the input dataset consisted of the CERES of these lossy gene sets, as opposed to all genomic features.

Statistical analyses The enrichment of contribution of CERES as features were statistically tested using Chi-squared test. The statistical significance for node clustering coefficient and average number of neighbors between functional only and functional+genomic networks was assessed using two-sided t-test. All statistical tests were performed in python.

Results

Cell type specific loss-of-function phenotypes are predictable with machine learning.

Recent efforts at the Broad and Sanger institutes have created an unprecedented resource to investigate the origins of context dependence across mammalian cell lines. Context specific behavior in CRISPR loss-of-function phenotypes are defined across cell lines using CERES scores9. Briefly, a CERES score measures the size of the fitness difference that is observed in a pooled screen from a single mammalian cell line. CERES scores are calculated for every gene in the library and they account for multiple confounding effects that bias the direct measurements of individual sgRNA enrichment and depletion. Negative CERES scores denote net gene depletion and positive scores denote enrichment across sgRNAs. All CERES scores for a gene across all cell lines provide a high resolution measurement of cell type specificity.

In considering how to model cell type specific phenotypes from CERES, genes could follow 2 distinct models of cell specificity. For example, in one widely used model, CERES scores are thought to belong to 2 distinct probability distributions that are simply “essential” or “not essential” for growth in an individual cell line. This first model suggests that individual cell lines can be assigned 1 of these 2 classes for any gene. However, CERES scores in common essential genes can harbor large variation across cell lines19,27–29. Moreover, some genes that are labeled “common essential”, “selective essential”, and “non-essential” have large ranges of continuous variation in CERES scores that appear to be biologically meaningful (Figure 1A,B). Thus, an alternative explanation is that CERES scores are continuous outputs of a single cell bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

type specific function across cell lines that represent the degree to which gene loss-of-function changes cell growth rates30.

To be agnostic to these two models, we used standard deviation and range cutoffs across the CERES scores for all genes in all cell lines. We found that genes with more context specificity across cell lines (i.e. increasing standard deviation and range), tend to have highly correlated non-paralogous genes in the dataset (Figure 1B). This relationship was not simply a function of increased variation because permutation tests failed to observe similar trends in the data. We visualized distributions and identified 583 genes out of 18,333 total genes using a cutoff (see methods) that identified the largest standard deviations and the most expansive ranges CERES scores across all cell lines. (Figure S1-a-A). These highly context specific genes were enriched for functions in the , metabolic processes and mitochondrial processing. (Figure S1-b).

There are multiple potentially predictive data types (and millions of data points) assayed in the DepMap project that could be used to infer the origins of context specific loss-of-function phenotypes (Figure 1C,D). All of the datasets (mutation, RNA-seq, CNV, and lineage) that we examined from the CCLE contained a large dynamic range of measurements (Figure S1-a-B). This suggested that variation in any/all of the data types could be useful for making context specific predictions of CRISPR loss-of-function phenotypes and inferring predictive features of interest.

Traditional definitions of synthetic lethality in cancer cell lines examine univariate genetic predictors that include copy number and mutation context (Figure 1D,E). Beyond classic definitions of synthetic lethality, lineage and transcriptional data have also been used to mine large cancer datasets. Because we aim to understand the nature of context specificity and how predictable it is, we took a fundamentally different approach from recent impressive work that ranked potential univariate biomarkers and drug target pairs4. Compared to univariate predictors, our first question was whether multivariate machine learning models can explain larger amounts of context specificity. We also relaxed the requirement for translational utility by allowing a CERES score in one gene to be predicted by other CERES scores across the genome. We did this because we suspected that the collective probing of the loss-of-function dynamics of the entire genome through CRISPR could provide information on the state of the genetic network that is fundamentally different than baseline OMICS measurements in unperturbed cells.

With millions of measurements, we needed to build an extensive feature selection and model validation pipeline for our machine learning models to minimize overfitting while ensuring robustness and predictive power (Figure S1-c). Comparing across multiple machine learning methods, we found that an approach that employed iterative-feature selection combined with random forest regression was robust and superior to other approaches (Figure S1-d and Fig S1-e). We validated our models with additional test sets (from later DepMap releases) and we observed comparable results (Fig S1-e A-C).

With this high-quality pipeline, we built thousands of machine-learning models across millions of measurements. We found that multivariate models virtually always outperform univariate models when predicting diverse cell line specific phenotypes across our set of highly context specific genes (Figure 1E,F). This suggests that multivariate context critically improves predictions of cell type specificity, and that classic synthetic lethality as a univariate paradigm to interpret context specificity can be greatly improved upon. We observed this for models that bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

predicted cell line specificity as essential or not, and models that predicted cell line specificity as a continuous CERES score. However, are our multivariate models better because they add further genetic context to key synthetic lethal mutations? Or is the multivariate prediction accuracy due to something else?

“CRISPR predicts CRISPR” rationalizes cell line specific responses to gene loss.

To investigate why our multivariate machine learning models make better predictions than univariate models we first examined the types of data that contribute to highly predictive models of context specificity. A plurality of input features (40.0%) were RNA-seq transcripts in the input data set, but the top 10 features in our predictive multivariate models were overwhelmingly composed of CERES scores (73.4%) (p-value < 0.001) (Figure 2A-B, and Figure S2-a A and Figure S2-b).This suggested to us that perturbations of the genetic network (as provoked by Cas9, and measured by CERES scores) were better at predicting cell type specific phenotypes than conventional “omic” measurements from unperturbed cells.

To quantify the degree of these differences and to understand them, we examined our multivariate models of context specific loss-of-function using existing literature models of context specificity. These alternate models include classic synthetic lethality (the genotype of cell x predicts sensitivity to LOF), dosage lethality (the expression level of a gene predicts its own sensitivity to knockout), paralog lethality (the loss of a compensating paralog confers sensitivity to a related gene), lineage lethality (where cell type or developmental context predict genetic sensitivity) and network context (where neighboring genetic nodes in a network diagram are connected using another database that adds signal to noise in context specific predictions)4,14,31– 33. Amongst the top 10 features in our multivariate models, all of these existing models of context specificity were observed (Figure 2C-F). This is consistent with exciting recent papers where similar approaches have featured prominently4,31,33. Moreover, it re-emphasizes the importance of these relationships in a select group of context specific genes. However, despite our replication of previous paradigms, we found >4 times as many context specific CRISPR loss-of-function phenotypes were accurately predicted by CERES scores in another gene in the genome (Figure 2F). We term this finding “CRISPR predicts CRISPR”.

Therefore, prior paradigms do not capture as many context specific relationships as our newly discovered “CRISPR predicts CRISPR” approach. Importantly, this also represents a distinct model of how context specificity can be generated. In classic synthetic lethality, cell specificity is predicted by “private” mutations that are unique to a subset of cells. Synthetic lethality suggests that these private mutations cause a genetic network to function in a different way in 2 different cell types. In contrast, our CRISPR predicts CRISPR models utilize measurements of the function of predominantly wild type genes that are consistently expressed across cell lines, only 6.9% of genes with significant CERES scores have any alterations in the omic data in any cell line. The collective probing of these “public” nodes that are present across most, if not all, cells identifies a different way to explain context specificity . It implies that many genetic rules are shared across all cell lines and that cell line specificity can be driven by the degree of common pathway utilization. Importantly, this is true in genes that are labeled as “selective essential” and “common essential”.

To understand this common genetic architecture in more detail we also sought to understand the topology of the combined set of all multivariate models. In other words, we asked how many predictive features are shared between models. If multiple models share the same CRISPR features it would suggest that some genes might be especially good at predicting bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

context specificity. It would also suggest that some screening information might be redundant. We divided the total number of predictive relationships by the number of unique predictors to get the average number of predictive relationships per feature. CRISPR features had an average of 2 relationship/feature, meaning that unlike mutation predictors, many CRISPR gene features were shared across multiple models (Figure 2G).

Densely cross predictive CRISPR networks define a common genetic architecture in mammalian cells.

Because of the dichotomy between the “private” features that predict synthetic lethality, and the “public” features in our CRISPR predicts CRISPR models, we provide a schematic that describes our hypothesis to account for these differences. In cancer models of synthetic lethality, a variety of possible models exist. These distinct relationships form distinct private

functions fi(x) that predict cell type specific differences in CERES scores Yi (Figure 3A, Left). The predictive features in these functions are rarely shared between models (Figure 2G). However, in a common genetic architecture these private functions exist in the background of a

densely predictive common genetic architecture that ties many individual Yi predictions together, and many CERES features are shared between models (Figure 3A, Right). This hypothesizes a different network structure. We asked whether this schematic was consistent with our models of the DepMap data by using the union of all models to build networks where genes are nodes, and predictors of cell type specific phenotypes are edges (Figure 3B). When we compared the networks with and without CRISPR CERES scores we observed a striking dichotomy (Figure 3C) that matched our hypothesis from Figure 3A. CRISPR CERES predictions connect many private models into larger “public” subnetworks (Figure 3C,D), this dramatically increases the number of nodes (genes) that exist in the network (i.e. It explains more context specificity), and these nodes have a higher mean number of neighbors 5.1 vs 1.7 (p-value = 2.0e-24), and a higher clustering coefficient 0.5 vs 0.04 (p-value = 1.7e-40) (Figure 3E,F). These highly connected and densely clustered subnetworks identify the fabric of genetic interactions that are interwoven to collectively determine cell specificity across cell lines.

These networks obey the power law (Figure 3E), and highlight coherent biology (Figure 3C, S3-a, Supplementary Data). For example, Figure S3-a-A shows a Cyclin-CDK regulatory network that is entirely composed of genes that drive the cell cycle. Figure S3-a-B highlights mitochondrial respiration and the . Thus, our common genetic architecture identifies coherent biological subnetworks with highly cross predictive CERES scores that could be used to annotate the function of unannotated genes. But our most important hypothesis is that not all genes in a subnetwork would need to be measured in order to predict the CERES scores across an entire subnetwork.

Identification of a common genetic architecture leads to the lossy compression of CRISPR libraries.

Measuring sgRNA enrichment or depletion in pooled screens informs gene function by identifying when genes are required for viability27,28,34. However, these libraries can be too large and cumbersome for many types of biological studies. This is because CRISPR libraries require large amounts of coverage (4-10 sgRNAs/gene, 500 cells/sgRNA, and 500 reads/sgRNA) to achieve high quality measurements18,35,36. A “simple” genome-wide dropout screen in a mammalian cell line can require 40 to >100 million cells and a similar number of sequencing reads to get good measurements on nearly 20,000 genes. Thus, an approach that decreases the measurement requirements at the genome-scale would be useful for mammalian cell bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

biology and genetics. This reduction creates a natural analogy to image compression. “CRISPR predicts CRISPR” models might be able to compress CRISPR experiments instead of their data. We aim to identify reduced sets of CRISPR constructs that could be screened at smaller scales to predict the loss-of-function effects of unmeasured genes at the genome-scale.

Image compression can be lossless or lossy. Lossless compression achieves modest reductions in file sizes but loses no information. Lossy compression can achieve orders of magnitude reductions in file size (Figure 4A), but there are tunable reductions in image quality37. For our purposes, a dramatic reduction in library size with some information loss is preferable to lossless compression and a modest reduction in library sizes. This is because a fundamental barrier to the scalability of modern human functional genomics is library size 18. Achieving less than an order of magnitude reduction in library size will be useful, but it will fail to dramatically enhance experimental tractability in many experimental contexts. For instance, even a genome-wide library of perfect sgRNAs cannot decrease CRISPR library sizes to below ~20,000 unique constructs. Eliminating the need to measure redundant genes (and not redundant sgRNAs) is the only approach that has the potential to create libraries that are orders of magnitude smaller.

Inspired by lossy compression and previous landmark analyses in transcriptomics38, we hypothesized that the existence of a common genetic architecture could enable the tunable lossy compression of the genes that are targeted in CRISPR libraries. Here compression does not refer to the in silico data storage requirements, but to the in vitro number of genes that must be measured in a pooled screen to achieve a genome-scale “portrait” of gene function (Figure 4A).

To examine the possibility for lossy compression we decided to start with the entire genome, and not simply focus on the genes that were highly cell type specific. Our aim was to identify a highly informative set of genes, for whom experimental measurement can accurately predict unmeasured CERES at a genomic scale. Using a multi-step machine learning approach, we identified 25, 100, 200, and 300 gene sets whose CRISPR CERES scores were highly correlated with other genes in the 19Q3 Broad DepMap data. These potential compression sets were enriched for genes that are involved in metabolic processes, nuclear organization, organelle biosynthesis, and transcription factor activity (Figure S4a-A-C). We then examined whether these small gene sets harbored predictive power across the genome. With models trained on 488 cell lines and then tested on 87 independent cell lines screened months later, we found that the predictive performance of lossy compression genes in a validation set began to saturate at 200 genes (Figure S4b-A). Across the genome, predictions from a lossy 200 set closely resembled the true test set data (Pearson r = 0.92) (Figure 4B and S4-b-B-D). Importantly, the similarity between predicted and measured CERES scores existed across common essential, common nonessential, and selective essential genes. The predictions were dramatically better than models built from a random set of 200 genes (Pearson r = 0.0) (Figure 4C). To build intuition, 2 representative examples of 2 gene specific models are available in Figure S4b-D and all 18,333 models can be reproducibly generated from source on GitHub. Importantly, some common essential and nonessential genes were predicted to have roughly the same phenotype in all cell lines. This is consistent with the interpretation that some essential genes are interpreted in a binary fashion by the cell. This can be seen in our use of qualitative comparisons of accuracy in addition to observed vs predicted plots.

Finally, recent work comparing the Broad dataset with the Sanger dataset has identified high cross-dataset correlations and key differences for many shared genes10. These projects bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

had differences in screen duration, media composition and sgRNA identity. As an extraordinarily stringent test of our lossy 200 set, we examined the predictive performance of our lossy 200 gene set that was built on Broad data in a second test set from the Sanger institute (Figure 4D). Consistent with a lossy approach, our lossy 200 set reproduced a genome scale portrait of CRISPR loss of function in the Sanger institute data that resembled performance in the DepMap test data (Pearson r = 0.78). This is despite the Sanger institute’s use of different sgRNAs, and different screening conditions that our lossy 200 gene set was not trained upon. This clearly suggests that small sets of 200 genes can provide reproducible genome-scale information.

Discussion What makes different cell types behave differently? This has been a fundamental question of genetics and cell biology since the first cells were cultured. It has guided classic work by Hayflick and Eagle7,39, as well as modern work in systems biology40. Recent systematic efforts to perform genome scale CRISPR screens across mammalian cell lines are a significant new tool to understand this challenging question. This is due to the exceptionally detailed and comprehensive genomic information that now exists in CRISPR screened mammalian cell lines across 2 independent institutes4,8,9. Our work represents the largest systematic effort to use these datasets to understand the basic biological origins of cell type specificity. Understanding this question advances cell biology, genetics and cancer biology. Previous work has focused on ranking translational hypotheses and comparing dataset measurement quality4,10. Surprisingly, well described and clinically important phenomena like synthetic lethality can make strong models, but they are relatively rare explanations for context specificity across all cell lines/genes. We term them “private models”, where a specific event in a subset of cell lines predicts a unique response to CRISPR mediated loss-of-function. Thus, private genetic mutations constitute a comparatively minor explanation for cell type specific phenotypes.

Instead of private models, the collective wisdom of all CRISPR loss of function perturbations across multiple genes constitutes a “public” model of cell type specificity where the elements of genetic logic are shared across cell lines. Cell type specificity is produced by the differential utilization of underlying genetic pathways that are common across cells. These public models are based upon the information encoded in the response to stimuli in widely expressed genes and not mutations in those genes. They can also be collectively assembled into a common genetic map that details the network architecture of cell type specific phenotypes across cells. Interestingly this common map resembles a decade-old concept in the signal transduction literature called “common effector processing”40. Common effector processing describes cell specific differences in caspase activation as a function of signal integration across a small set of widely expressed and not the differential utilization of cell type restricted pathways. Thus, our genetic findings converge with classic high-profile work on post translational modification networks. This convergence points to a broader theme in explanations of context specificity that transcend a single data type or phenotypic focus. This theme is that while discrete and qualitative differences between cells can drive cell type specific behaviors, it is often the quantitative degree to which ubiquitous pathways are utilized that determines context specificity. This is a continuous model of cell specificity in public genes, as opposed to the discrete model of synthetic lethality in private contexts. Our systematic analysis across the Broad and Sanger datasets lends support to this model of context specificity at a scale that dramatically exceeds prior work.

In network biology, previous work has focused on the most connected nodes. So called “Network Hubs” are often enriched in functional phenotypes and regulate many genes41. In our study, the most connected hubs contain the most redundant features in our predictive models. bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

This dichotomous interpretation of hubs (critical vs dispensable) is counterintuitive in network science because most work has identified network hubs as important. The highly connected genes in our common genetic architecture drive the insight that the lossy compression of CRISPR libraries is possible and it helps us understand which measurements we might be able to eliminate instead of the measurements that we want to keep.

While screens of single genes in well-behaved cancer cell lines are easy to perform at the genome-wide scale, many contexts exist where the experimental coverage requirements of genome-wide libraries are logistically challenging, if not impossible. Lossy compression changes the experimental calculus by allowing a subset of a genetic library to predict unmeasured gene phenotypes. The largest caveat in our approach is that it is lossy, and therefore has less information than a genome-wide library. However, lossy compression also enables orders of magnitude reductions in library size, i.e. 200 genes and not 20,000. This is most important to consider in the case of the currently accurate null predictions that we are already making in common nonessential genes. Evidence in yeast and worms suggests that broader environmental and developmental contexts elicit more phenotypes across single gene knockouts42,43. Thus, the number of “common nonessentials” decreases as more screening conditions are tested. A completely novel phenotype from a novel screen in a common nonessential gene might be challenging to predict with our current lossy 200 compression set. However, new screens are being completed all the time. At the time of our writing, nearly 1000 different mammalian cell lines have already been screened by the Broad and Sanger institute. Only 1144 contexts were needed to identify a measurable phenotypes in every yeast gene43. While the will likely require more contexts to saturate phenotypic predictions, lossy sets can be rederived in real time to capture new biology and extend our lossy approach as new data arises. Despite this, many common nonessential genes belong to paralogous gene families and contain genes that are likely to buffer each other’s effects upon single sgRNA knockout. Thus, many current common non-essential genes may never have a measurable phenotype with a single sgRNA33 and our current lossy 200 approach would be sufficient to predict phenotypes in these genes in any context. This buffering provides a biologically plausible bound to potential false negatives in lossy compression predictions of common nonessentials.

Despite these minor but important caveats, lossy compression has enormous potential. Chemogenomic screens that kill cells can bottleneck libraries by 10-fold at a 90% inhibition of cell viability. These bottlenecks in screens can raise coverage requirements for genome scale screens to as many as 1 billion cells, a challenging bar. Beyond gene-drug interaction screens, pairwise genetic epistasis maps require n2/2 unique constructs18. To make a genome-wide pairwise interaction map (even ignoring the challenge of cloning a pairwise library of 1e8 gene pairs), coverage requirements of 3 sgRNAs per gene, 500 cells/sgRNA and 500 reads/sgRNA create an impossibly large screening challenge that requires a population greater than 1010 cells. Because thousands of unmeasured phenotypes can be predicted with hundreds of CRISPR measurements, lossy compression sets are positioned to expand the experimental landscape of possibilities in mammalian functional genomics by enabling genome-scale experiments that were previously impossible.

References

1. Ramsköld, D., Wang, E. T., Burge, C. B. & Sandberg, R. An Abundance of Ubiquitously Expressed Genes Revealed by Tissue Transcriptome Sequence Data. PLOS Comput. Biol. 5, e1000598 (2009). bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

2. Wang, W. et al. Single-cell transcriptomic atlas of the human endometrium during the menstrual cycle. Nat. Med. 26, 1644–1653 (2020).

3. Ma, S. et al. Single-cell transcriptomic atlas of primate cardiopulmonary aging. Cell Res. (2020) doi:10.1038/s41422-020-00412-6.

4. Behan, F. M. et al. Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens. Nature 568, 511–516 (2019).

5. Muzumdar, M. D. et al. Survival of pancreatic cancer cells lacking KRAS function. Nat. Commun. 8, 1090 (2017).

6. Huang, A., Garraway, L. A., Ashworth, A. & Weber, B. Synthetic lethality as an engine for cancer drug target discovery. Nat. Rev. Drug Discov. 19, 23–38 (2020).

7. Eagle, H. Metabolism in Mammalian Cell Cultures. Science (80-. ). 130, 432 LP – 437 (1959).

8. Ghandi, M. et al. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569, 503–508 (2019).

9. Meyers, R. M. et al. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat. Genet. 49, 1779–1784 (2017).

10. Dempster, J. M. et al. Agreement between two large pan-cancer CRISPR-Cas9 gene dependency data sets. Nat. Commun. 10, 5817 (2019).

11. O’Neil, N. J., Bailey, M. L. & Hieter, P. Synthetic lethality and cancer. Nat. Rev. Genet. 18, 613–623 (2017).

12. Bryant, H. E. et al. Specific killing of BRCA2-deficient tumours with inhibitors of poly(ADP-ribose) polymerase. Nature 434, 913–917 (2005).

13. Davis, R. E. et al. Chronic active B-cell-receptor signalling in diffuse large B-cell lymphoma. Nature 463, 88–92 (2010).

14. Megchelenbrink, W., Katzir, R., Lu, X., Ruppin, E. & Notebaart, R. A. Synthetic dosage lethality in the human metabolic network is highly predictive of tumor growth and cancer patient survival. Proc. Natl. Acad. Sci. 112, 12217 LP – 12222 (2015).

15. Viswanathan, S. R. et al. Genome-scale analysis identifies paralog lethality as a vulnerability of 1p loss in cancer. Nat. Genet. 50, 937–943 (2018).

16. Hoffman, G. R. et al. Functional epigenetics approach identifies BRM/SMARCA2 as a critical synthetic lethal target in BRG1-deficient cancers. Proc. Natl. Acad. Sci. 111, 3128 LP – 3133 (2014).

17. Mavrakis, K. J. et al. Disordered methionine metabolism in MTAP/CDKN2A-deleted bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

cancers leads to dependence on PRMT5. Science (80-. ). 351, 1208 LP – 1213 (2016).

18. Horlbeck, M. A. et al. Mapping the Genetic Landscape of Human Cells. Cell 174, 953- 967.e22 (2018).

19. Shimada, K., Muhlich, J. L. & Mitchison, T. J. A tool for browsing the Cancer Dependency Map reveals functional connections between genes and helps predict the efficacy and selectivity of candidate cancer drugs. bioRxiv 2019.12.13.874776 (2019) doi:10.1101/2019.12.13.874776.

20. Kursa, M. B. & Rudnicki, W. R. Feature selection with the boruta package. J. Stat. Softw. 36, 1–13 (2010).

21. Kuleshov, M. V. et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44, W90–W97 (2016).

22. Durinck, S. et al. BioMart and Bioconductor: A powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439–3440 (2005).

23. Durinck, S., Spellman, P. T., Birney, E. & Huber, W. Mapping identifiers for the integration of genomic datasets with the R/ Bioconductor package biomaRt. Nat. Protoc. 4, 1184–1191 (2009).

24. Raudvere, U. et al. G:Profiler: A web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res. 47, W191–W198 (2019).

25. Shannon, P. et al. Cytoscape: A software Environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

26. Tseng, G. C. & Wong, W. H. Tight clustering: A resampling-based approach for identifying stable and tight patterns in data. Biometrics 61, 10–16 (2005).

27. Hart, T. et al. High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype- Specific Cancer Liabilities. Cell 163, 1515–1526 (2015).

28. Hart, T. et al. Evaluation and design of genome-wide CRISPR/SpCas9 knockout screens. G3 Genes, Genomes, Genet. 7, 2719–2727 (2017).

29. Dempster, J. et al. Extracting Biological Insights from the Project Achilles Genome-Scale CRISPR Screens in Cancer Cell Lines. bioRxiv 720243 (2019) doi:10.1101/720243.

30. Wainberg, M. et al. A genome-wide almanac of co-essential modules assigns function to uncharacterized genes. bioRxiv 827071 (2019) doi:10.1101/827071.

31. Lord, C. J., Quinn, N. & Ryan, C. J. Integrative analysis of large-scale loss-of-function screens identifies robust cancer-associated genetic interactions. Elife 9, 1–37 (2020).

32. De Kegel, B. & Ryan, C. J. Paralog buffering contributes to the variable essentiality of bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

genes in cancer cell lines. PLOS Genet. 15, e1008466 (2019).

33. Dede, M., McLaughlin, M., Kim, E. & Hart, T. Multiplex enCas12a screens show functional buffering by paralogs is systematically absent from genome-wide CRISPR/Cas9 knockout screens. bioRxiv 2020.05.18.102764 (2020) doi:10.1101/2020.05.18.102764.

34. Wang, T. et al. Identification and characterization of essential genes in the human genome. Science (80-. ). 350, 1096–1101 (2015).

35. Jost, M. et al. Combined CRISPRi/a-Based Chemical Genetic Screens Reveal that Rigosertib Is a -Destabilizing Agent. Mol. Cell 68, 210-223.e6 (2017).

36. Canver, M. C. et al. Integrated design, execution, and analysis of arrayed and pooled CRISPR genome-editing experiments. (2018) doi:10.1038/nprot.2018.005.

37. Usevitch, B. E. A tutorial on modern lossy wavelet image compression: foundations of JPEG 2000. IEEE Signal Process. Mag. 18, 22–35 (2001).

38. Subramanian, A. et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell 171, 1437-1452.e17 (2017).

39. Smith, J. R. & Hayflick, L. VARIATION IN THE LIFE-SPAN OF CLONES DERIVED FROM HUMAN DIPLOID CELL STRAINS . J. Cell Biol. 62, 48–53 (1974).

40. Miller-Jensen, K., Janes, K. A., Brugge, J. S. & Lauffenburger, D. A. Common effector processing mediates cell-specific responses to stimuli. Nature 448, 604–608 (2007).

41. Zotenko, E., Mestre, J., O’Leary, D. P. & Przytycka, T. M. Why Do Hubs in the Yeast Interaction Network Tend To Be Essential: Reexamining the Connection between the Network Topology and Essentiality. PLoS Comput. Biol. 4, e1000140 (2008).

42. Ramani, A. K. et al. The majority of animal genes are required for wild-type fitness. Cell 148, 792–802 (2012).

43. Hillenmeyer, M. E. et al. The chemical genomic portrait of yeast: Uncovering a phenotype for all genes. Science (80-. ). 320, 362–365 (2008).

bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure Legends Figure 1 Multivariate machine learning models predict context specificity. (A) Examining the largest and most cell type specific genes in the DepMap data set identified genes for whom CERES scores were highly correlated with a biologically related gene in the same pathway or complex. This was true in common, selective, and non-essential DepMap genes. This suggested that continuous variation in CERES scores can have biological meaning (B) Scatter plots (left, middle) have 18333 unique genes as data points. The standard deviation of CERES score for each gene is plotted on the x-axis, and the point size is proportional to the range of CERES scores across all cell lines. The most cell type specific genes tend to have the strongest correlations with CERES scores in other genes. This is measured via a linear model (red line), and the p-value of this model is <2e-16 in the true data (left) for common, selective, and non- essential genes. To assess significance of the relationship between standard deviation within a gene and biological correlations with the CERES scores of another gene, 1 example of a permuted CERES dataset and its linear model is shown in the middle. 163 permutations of the CERES scores and the p-values from their respective linear models are shown in the histogram to the far right. Permuted p-values rarely drop below 0.001 and are never below 10-6 while the real data had a p-value of <2e-16. The more context specific a gene is, the more likely it is to be highly correlated to a second gene. Correlation with a second gene is an indicator that CERES score variation has biological meaning. (C) Heatmaps of datasets available for the prediction of context specificity from the 19Q3 DepMap CERES scores, copy number, RNA-seq, mutation, and lineage. The colors were scaled to be between -1 and 1. We have focused on the use of CERES scores in addition to traditional genomic features as potentially useful for predicting context specificity in loss of function phenotypes. (D) The distribution of number of features per data source. (E) After cross validation and validation of test set predictive power, models for each context specific gene are plotted as individual points. 1 data point is 1 gene. Multivariate models(x-axis) are more predictive than univariate models(y-axis), across all gene essentiality classes. (F) Comparisons in model scores between the multivariate versus univariate models across different essentiality classes. Boxplots are broken down by gene classification, score is based on performance during cross validation.

Figure 2 CRISPR CERES scores are the most prominent predictive features in multivariate models. (A) The distribution of the top 10 features per data source from the multivariate models in figure 1 are aggregated across all models. CERES scores are highly enriched as features in these multivariate models. (B) A heatmap view of all target genes for nth feature, colored by data source. (C-E) The relationship between the top 10 features for every cell type specific gene and whether they are from the same gene (C), paralogs (D), or in network gene set Panther (D), as visualized in the gray pie charts and heatmaps. Of the feature- target gene pairs within each group (dark grey), they are further broken by data source (colored pie charts). (F) Examining the top feature in every model, we classified which model of cell type specificity that feature belonged too. (G) Redundancy scores are measured as the ratio of the total number of features to total number of unique features, measured per data source (CERES, RNA-seq, mutation, CNA, lineage).

Figure 3 A common genetic architecture in mammalian cell lines. (A) Schematic illustrating difference between classic synthetic lethality and our common genetic architecture. Synthetic

lethality consists of many individual Yi functions. These functions are cell type specific “private” models. Our proposed common genetic architecture is hypothesized to connect these private functions with shared CERES features. A common genetic architecture has redundant edges, and more nodes. More nodes suggest that more cell type specific phenotypes are predictable. (B) A network built from the aggregation of all multivariate models. Genes are represented as nodes and feature-target gene relations as edges. Colors represent distinct subnetwork bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

communities that were identified by the Louvian method. (C) Network communities with (right) and without (left) nodes/edges involving functional CRISPR features. Edges are colored based upon the data source; and nodes are colored based on the model score (of top 10 feature model) of the corresponding gene as target. (D) Illustration, based on top left network community from (C), in terms of the individual Y and aggregated Z functions. (E) Number of nodes versus node degree based on the network genomic+functional features. The data is fit to a power law function. (F) Characteristics of networks with and without CERES features.

Figure 4 Lossy genomic compression. (A) Schematic of lossy compression in terms of image compression using JPEG (top) and similarity to genomic loss-of-function compression. (B) Genome-wide comparison between actual and inferred CERES scores on a held-out test set, shown as heatmaps and scatter plots per gene essentiality classes. Models built based on only 200 lossy genes with their CERES scores as features. (C) The same test set as in (B) but was inferred based on a random predictor. (D) Genome-wide comparison between actual and inferred CERES scores on a held-out test set from Sanger. Multivariate models were trained based on Broad data.

bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Supplemental Information S1-a Figure S1 Baseline statistics of DepMap datasets. (A) Scatter plot of standard deviation (SD) versus mean of CERES scores. (B) Relative coefficient of variation (CV) of each data source. Relative CV is calculated as the ratio of standard deviation to the absolute value of the mean.

S1-b Figure S2 Gene set enrichment of input datasets. Gene set enrichment of terms related to biological process (A) and mitochondrial terms (B) of target genes.

S1-c Figure S3 Schematic of model and network building. Data sources (CERES scores, RNA-seq, copy number, mutations, and lineage) are used as features. Prior to model building, invariant/low variant features and non-expressed genes are pruned. The features are selected via an iterative processed followed by Boruta feature selection. Models are fit to the final reduced set of features and further filtered based on model scores. The feature-target genes were aggregated into an undirected gene network, of which communities are identified using the Louvian method. The communities are defined as directed networks based on the feature-target gene relationships. See Methods for more details.

S1-d Figure S4 Comparison in performance across different machine learning algorithms. (A) Comparison of model score on test set using all features based on different ML algorithms: elastic net, linear regression, random forest, or random forest with iterative/Boruta selection. (B) Comparison of model scores across different ML algorithms built based on the top 10 features. (C) Model score for train and test set for elastic net (left) and linear regression (right) models based on all features.

S1-e Figure S5 Validation of model performance. (A) Scatter plot of model scores on the 19Q4 vs 19Q3 test sets. Models are trained on the 19Q3 training set. (B-C) Recall versus correlation for 19Q3 (B) and 19Q4 (C) test sets. (D-E) Top 10 features are sufficient and shows comparable model performance to models with all significant features, as shown by box-whisker plot (D) and scatter plot (E).

S2-a Figure S6 Model feature summaries. (A) Classification of all features of all models by data source. (B) Significant lineage features and their target gene. (C) Univariate model scores as violin plots grouped by the data source of features.

S2-b Figure S7 Contribution of data sources for top 10 features. Pie chart showing breakdown of the data source type of features contributing to model prediction, for each nth most important feature.

S3-a Figure S8 Selected network communities. CDK4 (A) and mitochondrial (B) networks extracted from model/network building. Nodes denote genes and edges denote feature-target gene relationships. Node colors are based on the score of the top 10 feature model of the corresponding gene as target.

S4-a Figure S9 Gene set enrichment of L200 lossy gene sets. Gene set enrichment of gene ontology terms related to biological process (A), molecular function (B), and cellular compartment (C) of L200 lossy gene sets, using gProfiler.

S4-b Figure S10 Validation of lossy compression. (A) Saturation analyses based on lossy L25, 100, 200, and 300 lossy gene sets. Y-axis shows the normalized proportion of predictable bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

gene targets, where predictable is defined as target genes with recall greater than 0.95. (B) Genome-wide comparison between actual and inferred CERES scores on the training set. (C) Concordance between actual and predicted CERES scores for train and test sets. (D) Scatter plot of predicted versus actual for selected genes TP53BP1 and XRCC6.

bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Figure 1 available under aCC-BY-NC-ND 4.0 International license.

A GTPase-GEF pair, Golgi B True data Permuted data All permuted p-values

p-value < 2e-16

Mediator complex

Selective Common p-value < 2e-16

Hippo pathway Essentiality calls

p-value < 2e-16 Non-selective C

Synthetic lethality Y = F(Copy number, Mutation, Lineage) Potential biomarker Y = F(Copy number, Mutation, Lineage, RNA-seq) Understand fundamentals Y = F(Copy number, Mutation, Lineage, RNA-seq, CRISPR LOF)

D E F Data source 1.0 selective_essential 0.8 common_essential 0.8 RNA-seq (36759) CERES (18333) common_nonessential 0.6 40.04% 19.97% 0.6 0.4 0.02% Lineage (18) 0.4 Score 9.87% 0.2 Mut (9065) 0.2 30.10% 0.0

−0.2 0.0 CN (27639) multivariate univariate Model score (univariate, top feature) −0.2 0.0 0.2 0.4 0.6 0.8 1.0 Model Model score (multivariate) bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Figure 2 available under aCC-BY-NC-ND 4.0 International license.

A Feature source (top 10 features) C On same gene (27) Mut (6) Not on same gene (499) RNA-seq (23) CERES (2426) 20.7%

73.4% 5.1% 94.9% 79.3% 0.2% Mut (6) 4.5% CN (150)

21.9% CERES (22) In paralog (60) D CN (1) RNA-seq (723) Not in paralog (466) RNA-seq (47) 1.4% 31.4% B Mut 11.4%

88.6% CN 67.1%

RNA-seq In Panther (82) CERES (193) E Not in Panther (444) CN (1) RNA-seq (24) Target genes CERES 15.6% 88.5%

11.0% 84.4% 0.5% NaN 1 2 3 4 5 6 7 8 9 10 nth Feature

F Top most important feature Top 10 features G 500 500 Mut

400 400 CN 300 300 RNA-seq 200 200 Feature

100 100 CERES Number of genes Number of genes 0 0 Classic Same gene Paralog Gene set Functional Classic Same gene Paralog Gene set Functional 0.0 0.5 1.0 1.5 2.0 synthetic synthetic Redundancy score Lethality types Lethality types (No. total features/No. unique features) bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Figure 3 available under aCC-BY-NC-ND 4.0 International license.

A Synthetic lethality Common genetic architecture B Input/output Examples Input/output Example

Y = RNA-seq Z1 RNA-seq 1 F1(X1) 1 Mutations 2 X ∈ Mutations X ∈ Copy Number Copy Number 1 }{ CRISPR }{ Y2 = F2(X2) 2

3 n x 1 Y3 = F3(X3) 3 Z = n x m Yi = [ ] i [ ] n: no. of cell lines n: no. of cell lines 4 m = 1: LOF phenotype Y4 = F4(X4) 4 m: no. of LOF phenotypes

C network communities network communities genomic features only genomic + functional features

GJA1 TIPARP CMPK1 AAGAB DHFR2 COASY GFPT1 GNPNAT1 AMBRA1 GMPPB ANKRD49 TIPARP UGP2 GJA1 RPE CDAN1 TKT PMVK FCHO2 YRDC HMGCS1 CCDC32 GFPT1 AP2S1 SLC3A2 TYMS DTYMK PES1 SQLE PGM3 MVD OTUD5 TOE1 UBIAD1 DHFR PPCDC CAD FDPS IMPDH2 MVK AIP CTPS1 YRDC PGD HMGCS1 KDM8 GCLC TRIP12 HIRA FDFT1 RPA2 SLC7A5 ADSL RPIA AP2M1 ASF1B AHR MTHFD1 SQLE DOLK DCPS ACTR3 SLC7A1 RFK IPPK HMGCR RPS20 ARPC3 PAICS TFRC UBIAD1 IDI1 NAMPT UBR5 DNAJC9 TLK2 PFASGMPS FDPS HUWE1 TRPM7 MAT2A SF3A1 ARNT UMPS GARTPPAT FDFT1 HSPA5 RPA2 ABI1 DHODH ADSS RTCB METTL1 ATIC SDHA CDC42ARPC4 BRK1 NMNAT1 UXS1 IDH3A ITPK1 RNASEH2C SDHB VCPIP1 IDI1 ADAT2 SNRNP200 ATP1A1 OGDHTMX2 WDR4 PSMD7 HAMP ARPC2 SLC4A7 C16orf72 RRM1 RAPGEF1 TLN1 DPH5 RRM2 NCKAP1 ACO2 FXN ALG13 CDIPTTRMT61A NCAPG FPGS FDXR ABCG1 LIMS1 TIMM10 HSD17B12 CNIH4 ANAPC4 TP53BP1 DLST SDHD SDHC RAN MED30 DLD DNLZ RPL9 NCBP2 USP7 PCNA PTK2 CRKL PGK1 TERF1 RPS27L RSU1 BCAR1 POLR2J KRAS KIRREL1 NCAPH2 PRPF38A CHEK1 RPL37 CDCA8 NCAPG2 RPP21 KNTC1 PAK2 MARK2 AHCYL1 PTPMT1 RPL10A CDKN1A RAC1 USP28 TXNL4A AL133415.1 ILK TP53BP1 ALDOA MED30 TP53 LIAS USP7 TRIAP1 RNPC3 MDM2 PTK2 RHOA GPI FAM25G WWTR1 MAP4K4 ITGB5 RPS27L OIP5 RPL17 ELMO2 KRAS TEAD1 KIRREL1 TERF1 C1orf109TAMM41 HSPE1 NCAPD3 PKM RPL12 RHOC PPM1D KNTC1 NCAPG2 RPS13 POLR2L PTPN12 NF2 VCL AL133415.1 LIPT1 CDKN1A TPI1 ARHGEF7 NRP1 ARHGEF12 FDX1 TP53 USP38 SPATA5L1PGS1 RPL11 IKBIP EDA2R MDM4 MDM2 DUX4SNRPB FBXO42 WWTR1 PXN CDK7NPIPA1 DOCK5 RNF146 TAOK1 ELMO2 MYBL2 CRK PKN2 ITGAV GNA13 ENO1 PHF5A BUB3 DCLRE1B RIC8A RHOC PPM1D WDR18 PGAM1 HIST2H2AC ATM ANKHD1 AMOTL2 TIMM13 LATS2 IKBIP EDA2R RNGTT PTPN14 TRIO FBXO42 SYDE1 KCTD5 MYBL2 ANKRD17 RPLP2 CDH2 PDCD10 FERMT2 DCLRE1B CCDC6 ARHGAP35 CHEK2 ANKHD1 DNAJC11 HIST2H3A CUL3 SYDE1 PTEN KEAP1 CDH2 YME1L1 CCDC6 SYS1 CLPBROMO1 MTX2 CAND1 FBN1 Legend PSD4 FRYL HTRA2 NSMCE1 CAND1 FBN1 TAZ APLP1 PSD4 PPP2R1A UBE2M CAB39 APLP1 PPP2R1A Node Edge Node IRF2BP2

RBM19 IRF2BP2 AK6

NOP9 RBM19

NOP9 feature EXOSC1 Feature Target gene source type gene

LAMTOR2 DNM2 COP1 RRAGA CSDE1 ZNF217 CAND2 TRIB1 WDR59 COP1 RRAGC SEH1L DNM2 ARHGAP12 MIOS CSDE1 MRPL39 AIFM1 ZNF217 STK40 MTPAPQRSL1 DET1 CAND2 NQO1 HSD17B10 MRPS9 TFB1M ARHGAP12 LAMTOR3 WDR24 AIFM1 MRPL46 SNAP25 MRPL15 MRPS24 EP300 NQO1 DET1 FLCN MRPL17 SNAP25-AS1 MRPL2 USP22 MRPS14 EARS2 LRPPRC LAMTOR1LAMTOR4 MRPL23 RARS2 MRPL16 DUSP4 SNAP25 ATXN7 TWNK MARS2 MRPL57 Edges MRPL10 HARS2 SNAP25-AS1 ATXN7L3 POLG2 PPA2 TADA3 DUSP4 ARL8B KIAA0391 MRPL32 SUPT20H MRPS11 BRD9 KAT2A LARS2 SUPT7L MRPL21 MRPL33 MRPS16 VPS33A VPS41 MRPL48 HDAC8 TADA1 POLG PAXIP1 TAF5L MRPS18B RAB10 YARS2 NRAS STAG2 SGF29 VPS18 MECR SNAP23 TADA2B VPS37A TARS2 SMC3 PAGR1 WASHC3 MRPL43 ROPN1B MAU2 TAF6L MORF4L1 RAB10 BICRA MED13 EMSY NRAS LONP1 SMARCD1 VPS39 SNAP23 CCNC KDM5A MTG2 BRAF NIPBL MED12 ROPN1B VPS16 VPS29 GAB1 RAB7A MED19 MED15 GATAD1 GTPBP10 RABIF MLANA STXBP3 PRR12 SOS1 UBAP1 OXSM ZMYND8 BRAF MRPS6 MED13L PHF12 USP8 MED23 GRB2 CERES B4GALT7 SIN3B MLANA SOD2 STXBP3 RNF223 MED16 SHOC2

MAPK1 MED24 PTPN11 XYLT2 ROPN1 EXTL3 EXT1 RNF223 PPP2R2A MED25 PTPN23 MED1 GLCE CCDC130 STX4 ROPN1 MAPK1 EXT2 MED10 PPP2R2A SLC35B2 B3GAT3 SPRED1 FRS2 CCDC130 STX4 LHFPL3-AS1 RAF1 AL035661.1 FGFR1 HGS MED9 LHFPL3-AS1 NF1 B3GALT6 HS2ST1 RAB25 TFG AL035661.1 SOX10 AC108860.2 ITGB1 RNA-seq IQANK1 RAB25 SOX10 AC108860.2 ITGB1 IQANK1 AC105219.3 HAUS2 HAUS7 STRAP GRHL2 CLTC AC105219.3 KLF5 HAUS7 GRHL2 KLF5 TACC3 GAK HAUS1 HAUS3 JUP Copy number JUP

ERBB2 ERBB2 NMT1 Mutation

AUP1 CNOT10 CCND3 UFSP2 GGNBP2 AMFR ODR4 FAF2 ARID1A ACSL3 NR2C2AP UBE2G2 CCND3 Nodes SMARCE1 SREBF1 EMC3 EMC6 SMARCB1 UBA5 DPM1 DERL2 UBE2J1 CNOT1 SMARCB1 TMEM41A SCAP UFM1 DPF2 MBTPS1 EMC1 ELP5 DNAJC3 ARID1B C12orf49CYB5B EMC2 EMC4 METAP1 RAD1 UFC1 SMARCC1 MBTPS2 MMGT1 EMC7 UFL1 SCD RAD17 DDRGK1 TIPRL NPRL2 ACACA CCDC115 SYVN1 RAD9A VMA21 CCND1 CDK4 RAD9A CYB5R4 DEPDC5 CDK4 TSC2 CCND1 HUS1 PTPA RBL1 ACLY TRAIP PPP6R3 LCMT1 SMARCA4 ACLY ATP6AP2 FASN H2AFX SMARCA4 WDR7 TSC1 DDIT4 CDK6 RFC3 DSCC1 RB1 CHTF18 CDK6 MAD2L2 CIP2ARNASEH2A RB1 E2F3 PPP6C RHNO1 MTOR E2F3 RNF168 POLE4 CHTF8 MAPKAP1 TAP1 SLC33A1 SKP2 UBE2N POLE3 TAP1 MLST8 TFDP1 SKP2 CDK2 CCNE1 RICTOR CCNE1 0.2 model score 0.88 CHMP2A CHMP3 CHMP2A BICC1 BICC1 CHMP3 CKS1B PSMB6 PSMB6 PDPK1 AKT1

UBE2Q1 RPAIN PSMB8 PSMB8 THAP1 PPP1R15B THAP1 PSMB8-AS1 PSMB8-AS1 HLA-B HLA-B

HNF1B HNF1B

PSMB5 EIF2AK4 PSMB5

ASNS

GCN1 ATF4

EGFR MLLT10 EGFR WDR1 MEN1 WDR73 RAB6A PET117 WDR73 COX4I1 MIB1 GSR GSR NDUFA11 PSIP1 COA7 DOT1L KMT2A NDUFS7 CXCL16 DBF4 UQCRC2 NDUFB2 CXCL16 SNRPB2 RGP1 NDUFS2 ZNF395 ZNF395 CFL1 NDUFV2 NDUFA10 CFL1 OFD1 BRAT1 CYC1 NDUFS5 BRAT1 ATP6AP1 RIC1 EED NDUFB11NDUFS1 NDUFB3 EED HECTD1 COA6 BCS1L PALB2 NDUFS8 VPS51 NDUFB10NDUFA6NDUFC2 UBR4 NDUFB9 NDUFAF4 UBR4 JTB EZH2 PARP1 VPS54 TIMMDC1 KCMF1 COG3 VPS53 NDUFB8 TNFRSF1A COG2 NDUFA8 TNFRSF1A ADAR ARMC5 VPS52 EIPR1 NDUFA1 NDUFC1NDUFB6 NDUFA2 SUZ12 ZNHIT3 COG7 GLRX5 UBA6 COG1 ARHGAP30 ARHGAP30 COG4 UNC50 BIRC6 BIRC6 PTAR1 SMG7 XRN1 SMG7 GET4 COG6 CAMLG COG5 ICE2 WRB USPL1 ASNA1 IFFO1 CBFB IFFO1 CBFB KDF1 IL2RG KDF1 IL2RG CCND2 CCND2 EAF1

RUNX1 CHMP4B RUNX1 CHMP4B

MYB MYB

ZEB2 NHLRC2 OST4 OSTC ATP6V1G1 TUBA1C DYNLL1 SEPHS2 SLC9A2 IRS2 CUL2 ELOB ZEB2 NHLRC2 SLC9A2 G6PD IRS2 HSP90B1 SECISBP2 IGF1R PSTKSEPSECS TXNDC17 KBTBD2 GPX4 STT3A VHL SEPSECS DHRSX PFDN5 PSTK FURIN VHL EGLN1 PFDN1 GYPC PLOD3 DYRK1A EEFSEC TXNRD1 EGLN1 MYC GYPC PLOD3 TXNRD1 RFT1 MYC HIF1A AC124947.2 MYCN CFLAR PFDN4 VBP1 HIF1A MYCN CEP350 AC124947.2 CFLAR VBP1 PIK3CA ELL NAA10 POP7 PIK3CA ELL NAA10 POP7 FGFR1OP MAP3K7 RPP25

RPP25 SPTBN2 TRAF2

SPTBN2 PDS5A RNF31 RPP25L CSF2RA PDS5A RPP25L SLC2A1

CSF2RA SLC2A1

RPS4Y1TXLNGY UBE2J2 USP9Y ANAPC13 RMND5A RPS4Y1 TXLNGY USP9Y DDX3Y EIF1AX ARHGAP29 CTNNA1 RAB35 DDR1 STK11 C19orf25 SGO2 ANAPC15 TP63 MCL1 JUN PSMG2 GID8 UBE2H KRT5 AXL UBE2S PSMG4 RANBP9 DDX3Y EIF1AX MARCH5 DDX3X RAB35 DDR1 STK11 TP63 JUN MEMO1 KRT5 AXL PSMG4 BNIP1 PSMB1 UBE2S NBAS AC010889.1 YPEL5 DDX3X RINT1 SGO1 SNAI2 MAEA UBE2C PMAIP1 WDR26 PARD6B FOSL1 PSMG1 KDM5D AC010889.1 SNAI2 UBE2C PARD6B FOSL1 KDM5D PSMG1 RAB18 NAPG

DCDC2

DCDC2

PAX8

PAX8

3 D E 10 F y = 674.05x−1.57 Network characteristics Values No CRISPR data Common genetic architecture Genomic + functional feat

2 Clustering coefficient 0.496 10

TIPARP TIPARP Average number of neighbors 5.088 Y1 =

AIP AHR Network heterogeneity 0.837 ACTR3 ARPC3

ARNT ABI1 CDC42ARPC4 BRK1 Genomic feat only ARPC2 RAPGEF1 TLN1 1 NCKAP1 KRAS Y2 = LIMS1 10 RSU1 CRKL BCAR1 Clustering coefficient 0.043 PAK2 MARK2 RAC1 ILK Z = PTK2 RHOA ITGB5 1 KRAS TEAD1 MAP4K4 KIRREL1 PTPN12 NF2 VCL AL133415.1 ARHGEF7 NRP1 PTK2 ARHGEF12 WWTR1 PXN Average number of neighbors 1.732 ELMO2 KIRREL1 DOCK5 RNF146 TAOK1PKN2 ITGAV GNA13 CRK RHOC Number of nodes AL133415.1 RIC8A AMOTL2 LATS2 WWTR1 IKBIP ELMO2 PTPN14 TRIO KCTD5 MYBL2 RHOC FERMT2 PDCD10 ARHGAP35 IKBIP Network heterogeneity 0.609 CUL3 SYDE1 PTEN MYBL2 KEAP1 CDH2 0

SYDE1 FRYL Y3 = CDH2 10 CAND1 FBN1

UBE2M CAB39 CAND1 FBN1 0 1 10 10 Degree bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Figure 4 available under aCC-BY-NC-ND 4.0 International license.

A Original Lossy B Actual (Broad) Inferred (Broad) no compression compression

488kB 13kB Cell lines

~18K Genes ~18K Genes inferred from 200 genes CERES score

-3 0 3

C Random predictor D Actual (Sanger) Inferred (Sanger) bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Figure S1-a available under aCC-BY-NC-ND 4.0 International license. A B 4 10

2 10

Relative CV 0 10

CERES RNA-seq Copy number Mutation Lineage bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Figure S1-b available under aCC-BY-NC-ND 4.0 International license.

A

B bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Figure S1-c available under aCC-BY-NC-ND 4.0 International license.

filtering

CERES feature selection RNA-seq feature interative computational inference R2 (reduced model) Copy number pruning Boruta feature selection Mutations features

Lineages R2 (univariate model)

directed undirected undirected gene network gene network gene networks communities communities

Louvian method bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Figure S1-d available under aCC-BY-NC-ND 4.0 International license.

A C Elastic net Linear regression Model using all features

0 0 10 10

0 −1 −1 5 × 10 5 × 10

−1 Score Score 0 0

Score (on test) −2

−1 −1 −5 × 10 −5 × 10 elastic net linear regressionrandom forest iter select+borutarandom forest

0 0 −10 −10 Train Test Train Test

B Reduced model with top 10 features 0.4

0.3

0.2

0.1 Score (median) 0.0

elastic net linear regressionrandom forest iter select+borutarandom forest bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Figure S1-e available under aCC-BY-NC-ND 4.0 International license. A B C Recall Recall Score (Q4) 0.0 0.2 0.4 0.6 0.8 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Score (Q3) Correlation Correlation D E

0.8 0.8

0.6 0.6

0.4

Score 0.4

0.2 0.2 top 10 features)

Score (model based on 0.0 All selected features Top 10 features 0.0 0.2 0.4 0.6 0.8 Model Score (model with all selected features) bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Figure S2-a available under aCC-BY-NC-ND 4.0 International license. A B Data source summary (all features) Feature Feature source Target gene CERES (32688) HAEMATOPOIETIC_AND_LYMPHOID_TISSUE Lineage PPCDC RNA-seq (27700) PANCREAS Lineage KRAS 51.66% CN (2858) HAEMATOPOIETIC_AND_LYMPHOID_TISSUE Lineage NAMPT Mut (17) SKIN Lineage SOX10 Lineage (11) SKIN Lineage DUSP4 HAEMATOPOIETIC_AND_LYMPHOID_TISSUE Lineage ATP1B3 0.03%0.02% 4.52% HAEMATOPOIETIC_AND_LYMPHOID_TISSUE Lineage CBFB LARGE_INTESTINE Lineage CTNNB1 HAEMATOPOIETIC_AND_LYMPHOID_TISSUE Lineage NCKAP1 43.78% SKIN Lineage BRAF SKIN Lineage ZEB2

B 1.0

0.8

0.6

0.4

0.2

Score (univariate) 0.0

−0.2 CERES RNA-seq CN Mut Feature source bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Figure S2-b available under aCC-BY-NC-ND 4.0 International license.

Data source summary (for 1st important feature) Data source summary (for 2nd important feature) Data source summary (for 3rd important feature)

CERES (390) CERES (352) CERES (429) 76.5% 73.5% 81.6%

15.4% 19.2% 21.9% 2.5%0.6% RNA-seq (81) 3.9%0.4% 4.4%0.2% RNA-seq (98) RNA-seq (105)

CNMut (13) (3) CNMut (20) (2) CNMut (21) (1)

Data source summary (for 4th important feature) Data source summary (for 5th important feature) Data source summary (for 6th important feature) CERES (202) CERES (304) CERES (260)

69.7% 69.1% 65.8%

25.5% 24.7% 28.3% 4.8% 6.1% 5.9% RNA-seq (111) RNA-seq (93) RNA-seq (87) CN (21) CN (23) CN (18)

Data source summary (for 7th important feature) Data source summary (for 8th important feature) Data source summary (for 9th important feature)

CERES (171) CERES (141) CERES (104) 69.8% 72.7% 76.5%

24.5% 22.2% 19.1% 5.7% 5.2% 4.4% RNA-seq (26) RNA-seq (43) RNA-seq (60)

CN (14) CN (10) CN (6)

Data source summary (for 10th important feature)

CERES (73)

76.0%

19.8%

4.2% RNA-seq (19)

CN (4) bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Figure S3-a available under aCC-BY-NC-ND 4.0 International license.

A B

Legend PET117 Node Edge Node CCND3 COX4I1 Feature feature Target gene source type gene NDUFA11 Edges COA7 CERES NDUFS7 RNA-seq UQCRC2 NDUFB2 NDUFS2 Copy number Mutation NDUFV2 NDUFA10 CYC1 NDUFS5 Nodes

NDUFB11 NDUFS1 NDUFB3 COA6 BCS1L 0.2 model score 0.88 CCND1 CDK4 NDUFS8 RBL1 NDUFB10NDUFA6NDUFC2 NDUFB9 NDUFAF4 TIMMDC1 CDK6 RB1 NDUFB8 NDUFA8 E2F3 NDUFA1 NDUFC1 NDUFB6 NDUFA2 GLRX5 TFDP1 SKP2 CDK2

CCNE1

CKS1B

UBE2Q1 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Figure S4-a available under aCC-BY-NC-ND 4.0 International license.

A

B

C bioRxiv preprint doi: https://doi.org/10.1101/2020.12.18.423506; this version posted December 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Figure S4-b available under aCC-BY-NC-ND 4.0 International license.

A B actual inferred

2.5

2.0

1.5 predictable gene targets

Normalized proportions of 1.0

50 100 150 200 250 300 CERES score Lx C D -3 0 3 TP53BP1 XRCC6 train 1.0 1.0 test 0.5 0.5 0.0 0.0 −0.5 −0.5 −1.0 −1.0 Predicted Predicted −1.5 −1.5 −2.0 −2.0 −2.5 −2.5 0.4 0.5 0.6 0.7 0.8 0.9 1.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 Concordance Actual Actual