ARTICLE

https://doi.org/10.1038/s41467-020-14605-5 OPEN Transcriptional effects of copy number alterations in a large set of human cancers

Arkajyoti Bhattacharya 1,2, Rico D. Bense1,2, Carlos G. Urzúa-Traslaviña 1, Elisabeth G.E. de Vries 1, Marcel A.T.M. van Vugt1 & Rudolf S.N. Fehrmann 1*

Copy number alterations (CNAs) can promote tumor progression by altering expression levels. Due to transcriptional adaptive mechanisms, however, CNAs do not always translate

1234567890():,; proportionally into altered expression levels. By reanalyzing >34,000 pro- files, we reveal the degree of transcriptional adaptation to CNAs in a genome-wide fashion, which strongly associate with distinct biological processes. We then develop a platform- independent method—transcriptional adaptation to CNA profiling (TACNA profiling)—that extracts the transcriptional effects of CNAs from gene expression profiles without requiring paired CNA profiles. By applying TACNA profiling to >28,000 patient-derived tumor samples we define the landscape of transcriptional effects of CNAs. The utility of this landscape is demonstrated by the identification of four that are predicted to be involved in tumor immune evasion when transcriptionally affected by CNAs. In conclusion, we provide a novel tool to gain insight into how CNAs drive tumor behavior via altered expression levels.

1 Department of Medical Oncology, University Medical Center Groningen, University of Groningen, Groningen, the Netherlands. 2These authors contributed equally: Arkajyoti Bhattacharya, Rico D. Bense *email: [email protected]

NATURE COMMUNICATIONS | (2020) 11:715 | https://doi.org/10.1038/s41467-020-14605-5 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14605-5

he genomic composition of a tumor results from defects in CNAs can promote tumor progression via alteration of genome maintenance, leading to the genomic instability expression levels of genes located at the affected genomic T 1 3–5 that characterizes many cancers . Genomic instability can regions . Due to transcriptional adaptive mechanisms, however, result in the accumulation of structural aberrancies, changes in gene copy number at the genomic level do not always including copy-number alterations (CNAs). CNAs are important translate proportionally into altered gene expression levels genomic events in the progressive molecular rewiring that takes (Fig. 1a)6,7. The degree of transcriptional adaptation to CNAs is place during tumor cell evolution2. currently determined with the genetical genomics approach8,

a

Copy number Cancer hallmarks alterations Altered gene expression levels Sustaining Avoiding Low degree of High degree of proliferative immune transcriptional adaptation transcriptional adaptation signaling destruction

Genome Deregulating instability cellular and mutation energetics

Resisting Inducing cell death angiogenesis

Evading Tumor- Gene expression level Gene expression level growth promoting suppressors inflammation

Loss Gain Loss Gain Enabling Activating replicative invasion and Gain Loss No alteration No alteration immortality metastasis

bc

Observed signals X 1..n Identification of consensus sources 15.0 R > 0.9

GEO (21,592) 9.0 Consensus source Š1 Credibility index 100% 3.0 Total (34,494) –3.0 R > 0.9 TCGA (10,817) Standardized mRNA level –9.0 Consensus source Š ≥ m –15.0 Credibility index 50% 1 2 3 4 5 6 7 8 9 10 11 12 13141516171819202122 CCLE (1067) Genomic mapping GDSC (1018) R > 0.9

Unknown Repeat 25 times with random initialization No valid consensus source Credibility index < 50% Mixing matrix Source signals Observed signals Estimated mixing matrix Estimated sources Estimated sources S from ICA run 1 x â â â S 1..m a11 a21 an1 S1 1 11 21 n1 1 Independent component Estimated sources S1..m from ICA run 2 analysis (ICA) Estimated sources S from ICA run 3 x â â â S 1..m a12 a22 an2 S2 2 12 22 n2 2 Estimated sources S1..m from ICA run 4 f(u) = log cosh(u) Estimated sources S1..m from ICA run 5-24 x â â â S a1m a2m anm Sm n 1m 2m nm m

Estimated sources S1..m from ICA run 25

d Concensus source Extreme-valued region indicator 25 25 GEO - patient samples - microarray data (HG-U133 Plus 2.0) CCLE - cell lines - microarray data (HG-U133 Plus 2.0)

1 1 15 15

5 5 0 0

–5 –5 Weight in consensus source Weight in consensus source –15 –15

–1 Extreme-valued region indicator Extreme-valued region indicator –1

–25 –25 12345678910111213141516171819202122 12345678910111213141516171819202122 Genomic mapping Genomic mapping

25 GEO - patient samples - microarray data (HG-U133 Plus 2.0) 25 CCLE - cell lines - microarray data (HG-U133 Plus 2.0)

1 1 15 15

5 5 0 0

–5 –5 Weight in consensus source Weight in consensus source –15 –15 Extreme-valued region indicator Extreme-valued region indicator –1 –1

–25 –25 12345678910111213141516171819202122 1 2 3 4 5 6 7 8 9 10 11 12 13141516171819202122 Genomic mapping Genomic mapping

25 25 TCGA - patient samples - RNA-seq data GDSC - cell lines - microarray data (HG-U219)

1 1 15 15

5 5 0 0

–5 –5 Weight in consensus source –15 Weight in consensus source –15

Extreme-valued region indicator –1 –1 Extreme-valued region indicator

–25 –25 12345678910111213141516171819202122 1 2 3 4 5 6 7 8 9 10 11 12 13141516171819202122 Genomic mapping Genomic mapping

25 25 TCGA - patient samples - RNA-seq data GDSC - cell lines - microarray data (HG-U219)

1 1 15 15

5 5 0 0 –5 –5 Weight in consensus source –15 Weight in consensus source –15 Extreme-valued region indicator –1 Extreme-valued region indicator –1

–25 –25 12345678910111213141516171819202122 1 2 3 4 5 6 7 8 9 10 11 12 13141516171819202122 Genomic mapping Genomic mapping

2 NATURE COMMUNICATIONS | (2020) 11:715 | https://doi.org/10.1038/s41467-020-14605-5 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14605-5 ARTICLE

Fig. 1 Data acquisition and decomposition of gene expression profiles. a CNAs can promote tumor progression via altering expression levels of genes located at the affected genomic regions. However, due to transcriptional adaptation mechanisms, changes in gene copy number at the genomic level do not always translate proportionally into altered mRNA expression levels. Unraveling the degree of transcriptional adaptation to CNAs for all genes will greatly contribute to our knowledge how CNAs drive tumor progression. b Number of gene expression profiles collected from GEO, TCGA, CCLE, and GDSC. c Identification of underlying regulatory factors of the mRNA transcriptome. We hypothesized that the observed gene expression in a gene expression profile is the result of (i) the effect of underlying regulatory factors (i.e., source signals) on expression levels of individual genes and (ii) the activity of these underlying regulatory factors in a complex biopsy (i.e., mixing matrix). ICA was used to capture the number and nature of these underling regulatory factors for all four datasets separately. This resulted in estimated sources, representing the effects of independent underlying regulatory factors on the expression levels of individual genes, and a mixing matrix reflecting the activity of each estimated source in each gene expression profile. ICA was run 25 times with random initialization, followed by consensus sources estimation using a credibility index ≥50%. d Examples of CNA-CESs harboring a pattern in which only genes mapping to a specific contiguous genomic region had a high absolute weight. The red line shows whether genomic regions were marked as having a significant number of genes with a high absolute weight by the detection algorithm (i.e., extreme-valued region indicator).

which combines paired genomic and gene expression profiles of description of preprocessing, normalization and quality control is tumor samples to explore associations between CNAs and gene provided in the Supplementary Note 1. expression levels. However, we previously demonstrated that this approach is difficult to use for detecting the effects of CNAs on fi 9 Independent sources capture transcriptional effects of CNAs. gene expression levels in a gene expression pro le . This is fi because gene expression profiling is often performed on biopsies In tumor gene expression pro ling, the net measured expression comprising tumor cells and non-tumor cells from the tumor level of an individual gene is determined by the combined effects of microenvironment. A gene expression profile therefore measures various transcriptional regulatory factors, including experimental, the average expression of all cell types present in tumor biopsies. genetic (e.g., CNAs), and non-genetic factors. To identify the effects of these factors on gene expression levels, we first performed This means that the effects of CNAs on gene expression levels, 12 besides being influenced by experimental and other non-genetic consensus-independent component analysis (consensus-ICA) on factors, are often overshadowed by the effects of non-tumor each of the four datasets separately (Fig. 1c, see Supplementary 10,11 Note 1). Consensus-ICA is a computational method to separate cells . Therefore, to accurately determine the degree of tran- fi scriptional adaptation to CNAs using genetical genomics, large gene expression pro les into additive consensus estimated sources numbers of paired genomic and gene expression profiles of tumor (CESs), so that each CES is statistically as independent from the samples are required. Unfortunately, few large-scale genetical other CESs as possible. We hypothesized that each CES describes genomic datasets are available, and many of the publicly acces- the effect of a latent transcriptional regulatory factor on gene sible gene expression profiles from tumor samples do not have a expression levels. In every CES, each gene has a weight that fi describes how strongly and in which direction its expression level is paired genomic pro le. These samples are therefore currently fl excluded from genetical genomics analyses. Despite the current in uenced by a latent transcriptional regulatory factor. Ultimately, efforts in genetical genomics, the degree of transcriptional consensus-ICA resulted in 855 CESs in the GEO dataset, 1383 adaptation to CNAs thus remains unclear for most genes. CESs in the TCGA dataset, 467 CESs in the CCLE dataset, and 466 Unraveling the degree of transcriptional adaptation to CNAs in CESs in the GDSC dataset. In all four datasets, many CESs showed a consistent pattern in a genome-wide fashion would greatly enhance our knowledge of fi how CNAs drive tumor progression. We therefore develop a new, which only genes mapping to a speci c contiguous genomic platform-independent method: transcriptional adaptation to CNA region had a high absolute weight (Fig. 1d). We hypothesized profiling (TACNA profiling). TACNA profiling extracts the effects that these CESs (referred to as CNA-CESs) capture the effects of CNAs on gene expression levels. Using a detection algorithm of CNAs on gene expression levels from a single gene expression fi profile—generated from a tumor biopsy—without the need for a (see Supplementary Note 1), we identi ed 242/855 (28%) of paired genomic CNA profile. Using this method, we define the CNA-CESs in the GEO dataset containing such a pattern, 447/ landscape of transcriptional effects of CNAs in >34,000 expression 1383 (32%) in the TCGA dataset, 186/467 (40%) in the CCLE profiles. To demonstrate this landscape’s utility, we determine dataset and 175/466 (38%) in the GDSC dataset. In these CNA- which genes and biological pathways affected by CNAs might be CESs, 97% of genes present in the GEO dataset, 97% in the TCGA associated with tumor-infiltrating immune cell composition. These dataset, 95% in the CCLE dataset, and 94% in the GDSC dataset were located in at least one genomic region identified by the insights could ultimately contribute to the development of new fi therapeutic strategies. detection algorithm as having a signi cant number of genes with a high absolute weight (Supplementary Data 5). These high percentages indicate that almost all genes are affected by CNAs in Results our datasets. Data acquisition. We collected gene expression data of 34,494 samples from four public repositories: Gene Expression Omnibus (GEO, n = 21,592, generated with Affymetrix HG-U133 TACNA profiles highly correlate with SNP-derived profiles.To Plus 2.0), The Cancer Genome Atlas (TCGA, n = 10,817, gener- continue testing the above hypothesis, we developed a method ated with RNA-seq), Cancer Cell Line Encyclopedia (CCLE, n = called transcriptional adaptation to CNA profiling (TACNA 1067, generated with Affymetrix HG-U133 Plus 2.0), and Geno- profiling—see Supplementary Note 1). In TACNA profiling, the mics of Drug Sensitivity in Cancer (GDSC, n = 1018, generated gene expression profile of an individual sample is reconstructed with Affymetrix HG-U219) (Fig. 1b). In total, 2085 expression using only CNA-CESs. By including only these CNA-CESs, we profiles were generated from cell line samples, 28,200 from patient- automatically excluded other CESs that capture non-genetic fac- derived tumor biopsies, and 4209 from patient-derived tissue tors that can affect gene expression levels from our pipeline for biopsies of normal tissue. The patient-derived tumor gene TACNA profiling (see Supplementary Fig. 1 and Supplementary expression profiles collected from GEO and TCGA represented Note 1). Weights of genes mapping to a contiguous genomic 30 tumor types in total (Supplementary Data 1–4). A detailed region in CNA-CESs marked by the detection algorithm were

NATURE COMMUNICATIONS | (2020) 11:715 | https://doi.org/10.1038/s41467-020-14605-5 | www.nature.com/naturecommunications 3 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14605-5

a Transcriptional adaption to CNA-profiling Consensus sources TACNA-profiles š š š y y y y y 11 12 1m Consensus mixing matrix 11 12 11 12 1n m = Number of consensus sources 021 š22 š2m y21 y22 y21 y22 y2n n = Number of samples a a a a 0 š 0 1112 12 1n y y y y y p = Number of probes/genes 31 32 3m 31 32 31 32 3n a a a a š = Scalar for probe/gene p in consensus source m š š 0 21 2222 2n y y y y y pm 41 42 4m 41 42 41 42 4n 0 a š a š a = y o = Zeroed scalar for probe/gene p in consensus source m that × = 21 12 22 22 2m m2 22 pm š 0 0 y y y y y outside a contiguous genomic region with extreme-valued 51 52 5m 51 52 51 52 5n

scalars am1 a m2 am2 amn

amn = Weight for sample n for consensus source m š š š y y y y y Ypn = TACNA signal for probe/gene p in sample n p1 p2 pm p1 p2 p1 p2 pn

TACNA profile CNA profile b 3 5 3 5 TCGA - individual patient sample TCGA - individual patient sample

3 3

1 1 1 1

–1 –1 Copy number TACNA signal TACNA TACNA signal TACNA Copy number –1 –1

–3 –3

–3 –5 –3 –5 12345678910111213141516171819202122 12345678910111213141516171819202122 Genomic mapping Genomic mapping 3 5 CCLE - individual cell line sample 3 5 CCLE - individual cell line sample

3 3

1 1 1 1

–1 –1 Copy number Copy number TACNA signal TACNA TACNA signal TACNA –1 –1

–3 –3

–3 –5 –3 –5 12345678910111213141516171819202122 12345678910111213141516171819202122 Genomic mapping Genomic mapping 10 5 GDSC - individual cell line sample 10 5 GDSC - individual cell line sample

6 3 Tumor samples Normal samples 6 3

2 1 2 1

–2 –1 –2 –1 Copy number Copy number TACNA signal TACNA TACNA signal TACNA

–6 –3 –6 –3

–10 –5 –10 –5 12345678910111213141516171819202122 12345678910111213141516171819202122 Genomic mapping Genomic mapping

cd1st quartile of r [0–0.452) 2nd quartile of r [0.452–0.684) 3rd quartile of r [0.684–0.765) 4th quartile of r [0.765–1] 1000 0.5 TCGA Tumor samples Normal samples TCGA Tumor samples Normal samples Ovarian serous cystadenocarcinoma (n = 304) Lung squamous cell carcinoma (n = 498) 800 Lung adenocarcinoma (n = 514) Triple-negative breast carcinoma (n = 142) 600 −0.25 0.00 0.25 −0.25 0.00 0.25 Uterine carcinosarcoma (n = 56) 0.25 Cervical squamous cell carcinoma (n = 294) Count 400 Glioblastoma multiforme (n = 160)

Variance of CNA-profile Testicular germ cell carcinoma (n = 156) 200 Rectal adenocarcinoma (n = 94) Esophageal carcinoma (n = 184) 0 0 ER−/HER2+ breast carcinoma (n = 43) −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 −0.25 0.00 0.25 0.50 0.75 1.00 Adrenocortical carcinoma (n = 77) Correlation between TACNA- and CNA-profile Correlation between TACNA- and CNA-profile 100 0.5 Clear cell renal cell carcinoma (n = 526) CCLE CCLE Cholangiocarcinoma (n = 36) Chromophobe renal cell carcinoma (n = 66) 80 Head and neck squamous cell carcinoma (n = 516) ER+/HER2− breast carcinoma (n = 544) 60 Hepatocellular carcinoma (n = 366) 0.25 ER+/HER2+ breast carcinoma (n = 138) Count 40 Papillary renal cell carcinoma (n = 288)

Variance of CNA-profile Pheochromocytoma and paraganglioma (n = 166) 20 Breast carcinoma, unknown subtype (n = 204) Mesothelioma (n = 87) 0 0 Skin cutaneous melanoma (n = 470) −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 −0.25 0.00 0.25 0.50 0.75 1.00 ER−/HER2−/PR+ breast carcinoma (n = 11) Correlation between TACNA- and CNA-profile Correlation between TACNA- and CNA-profile Urothelial carcinoma (n = 406) 100 0.5 GDSC GDSC Sarcoma (n = 259) Colon adenocarcinoma (n = 451) 80 Gastric adenocarcinoma (n = 413) Lower grade glioma (n = 527) 60 Pancreatic adenocarcinoma (n = 178) 0.25 Uveal melanoma (n = 80) Count 40 Diffuse large B-cell lymphoma (n = 48) Uterine corpus endometrial carcinoma (n = 362) Variance of CNA-profile 20 Thymoma (n = 119) Prostate adenocarcinoma (n = 492) 0 0 Thyroid carcinoma (n = 504) −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 −0.25 0.00 0.25 0.50 0.75 1.00 Acute myeloid leukemia (n = 166) Correlation between TACNA- and CNA-profile Correlation between TACNA- and CNA-profile Normal (n = 657)

0.0 0.2 0.4 0.6 0.8 1.0 Proportion

Fig. 2 Transcriptional adaptation to CNA profiling (TACNA profiling). a Transcriptional adaptation to CNA (TACNA) profiling. For each dataset, weights of genes mapping to a contiguous genomic region in CNA-CESs marked by the detection algorithm were retained in the CES matrix. Weights of genes that mapped outside marked genomic regions were instead set to zero. Next, TACNA profiles were calculated as the product between this transformed CES matrix and the consensus mixing matrix. b Examples of TACNA profiles illustrating that the resulting patterns clearly matched those in their paired, independently generated CNA profiles. c Left panel: distribution of Pearson correlations between TACNA profiles and paired CNA profiles for the TCGA, CCLE, and GDSC datasets. Right panel: variance observed in CNA profiles versus the Pearson correlation between CNA profiles and paired TACNA profiles for the TCGA, CCLE, and GDSC datasets. d Quartiles distribution plot of Pearson correlations between TACNA profiles and paired CNA profiles, per tumor type in the TCGA dataset. ER: estrogen ; HER2: human epidermal growth factor receptor 2; PR: . retained in the CNA-CES matrix. Weights of genes that mapped derived from SNP arrays, for a subset of samples in the TCGA outside marked genomic regions in every CNA-CES were set to dataset (n = 10,620), CCLE dataset (n = 1011) and the GDSC zero. Next, TACNA profiles were calculated as the product dataset (n = 962). Examples of TACNA profiles showed that the between this transformed CNA-CES matrix and the consensus observed patterns clearly matched those in their paired, mixing matrix (Fig. 2a). independently generated CNA profiles (Fig. 2b). Strong correla- We assumed that the resulting TACNA profile would show the tions were found between TACNA profiles and paired CNA effect of CNAs present in the genome on gene expression levels of profiles in all three datasets (Fig. 2c). Low correlations were found that sample, and thus expected to see a clear correlation between for a subset of samples in the TCGA dataset mainly representing paired TACNA and CNA profiles. We retrieved CNA profiles, normal tissue, which generally does not contain CNAs (see inset

4 NATURE COMMUNICATIONS | (2020) 11:715 | https://doi.org/10.1038/s41467-020-14605-5 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14605-5 ARTICLE

Fig. 2c). As expected, we found lower correlations between paired between two CNA-CESs in the GEO dataset and TCGA dataset in TACNA and CNA profiles in the TCGA dataset for patient- which the oncogene KRAS had a high weight (Fig. 3b, r = 0.66). derived samples belonging to the more point-mutation-driven Moreover, we performed TACNA profiling using 570 samples tumor types (e.g., acute myeloid leukemia) than for samples from from the TCGA that were subjected to both RNA sequencing and tumor types that are more copy-number-driven (e.g., ovarian microarray profiling (Affymetrix HG-U133A). Paired TACNA cancer) (Fig. 2d)13. In a subset of patient-derived acute myeloid profiles generated with RNA-seq and microarray data were highly leukemia samples in the GEO dataset for which karyotype correlated (Supplementary Fig. 5, mode of distribution r = 0.66). information was provided online (n = 189) we also observed In addition, TACNA profiling was robust to both cross-validation clearly matching patterns between TACNA profiles and the within one platform (Supplementary Figs. 6 and 7, mode of dis- expected karyotype in these samples (Supplementary Fig. 2). tribution r range = 0.45–0.55) and cross-platform cross-validation We obtained gene expression profiles of an experimental model (Supplementary Fig. 8, mode of distribution r range = 0.50–0.61, of aneuploidy introduced in HCT116 cells14. Differences on the see Supplementary Note 1 for details). mRNA level and TACNA level were calculated between the These results show that TACNA profiling can extract similar parental HCT116 cells (n = 3) and an isogenic model with patterns from gene expression profiles generated with different chromosome 5 tetrasomy (n = 3) (Supplementary Fig. 3a and 3b). platforms. For genes located on chromosome 5, we observed larger signal to noise ratio in differences of TACNA expression levels compared with regular mRNA expression levels (Supplementary Fig. 3c). The degree of TACNA is associated with biological processes. We did not observe a correlation between the log2 fold change in Next, we determined whether the patterns observed in CNA- mRNA expression of individual genes and changes in TACNA CESs were associated with biological processes rather than being expression levels (r = 0.04). We have previously shown that the mere mathematical artifacts. For each CNA-CES we transformed effects of CNAs on gene expression levels are often overshadowed the weights of genes mapping to the marked genomic region to a by other (non-genetic) effects that are not associated with specific metric ranging from zero to one, where zero corresponds to the single CNAs9. For example, Stingele et al. showed that aneuploidy highest absolute weight (i.e., low degree of transcriptional adap- invokes a general cellular mRNA response which is not the result tation to CNAs) and one corresponds to the lowest absolute of any specific CNAs but only the presence of aneuploidy14. weight (i.e., high degree of transcriptional adaptation to CNAs). In addition, tumor cells in vivo may develop transcriptional Subsequently, all genes mapping to marked genomic regions in adaptive mechanisms to survive under selective pressure, which the CNAs-CESs were pooled and ranked according to their may be different from the adaptive mechanisms observed in vitro. metric. We then performed gene set enrichment analysis (GSEA) We conducted expression quantitative trait loci (eQTL) on this ranked gene list using 12 gene set databases from the analyses using the mRNA expression profiles, the TACNA Molecular Signatures Database (MSigDB)16. The analysis showed profiles and the CNA profiles (derived from SNP arrays) from strong biological enrichment for all databases, with high con- the TCGA dataset (n = 10,620). In general, we observed a cordance between GSEA results obtained with GEO and TCGA stronger correlation between TACNA profiles and CNA profiles metrics (r ranging between 0.86 and 0.97, Supplementary Data 8). compared with the correlation between mRNA expression Genes with a lower degree of transcriptional adaptation were profiles and CNA profiles on the gene level (Supplementary highly enriched for gene sets involved in proliferation (Fig. 3c). Fig. 4). This indicates that TACNA profiling increases statistical Conversely, genes with a higher degree of transcriptional adap- power to detect eQTLs, i.e., associations between CNAs and tation were highly enriched for immune-related gene sets. This mRNA expression levels. indicates that the expression levels of genes involved in immune We observed that the borders of marked genomic regions in signaling are less affected on average when CNAs occur in their 82/242 (33%) CNA-CESs in the GEO dataset and 141/447 (32%) genomic region. The opposite effect was seen for genes involved CNA-CESs in the TCGA dataset colocalized with a stringent set in signaling pathways important for cell proliferation. of common fragile sites on the (Supplementary We observed that multiple genomic regions were enriched for Data 6)15. Although the location of most of these common fragile genes having a higher or lower degree of transcriptional adaptation sites have been mapped using low-resolution methods, these (Supplementary Fig. 9a and Supplementary Data 9). Epigenetic results are compatible with the notion of genomic instability mechanisms such as DNA methylation are known to affect underlying structural abnormalities such as CNAs. transcription levels of individual genes17,18. When we explored Taken together, these results strongly suggest that TACNA available DNA methylation data for a subset of samples from the profiles capture the effect of CNAs at the gene expression level. TCGA dataset (n = 9317), we observed that for a subset of genes their TACNA expression levels correlated with their mean methylation levels (r range = −0.65–0.45, Supplementary Fig. 9b). The inferred degree of TACNA is concordant across platforms. These results indicate that DNA methylation could be one of the Although strong correlations were found between paired TACNA underlying mechanisms driving the degree of transcriptional and CNA profiles, not all genes mapping to a marked genomic adaptation to CNAs. region in a CNA-CES had equally high weights. We hypothesized The function of complexes might depend on the that the absolute weight of an individual gene in a marked correct proportional levels of protein subunits19. When CNAs genomic region of a CNA-CES represents how much its expres- occur, cancer cells might transcriptionally adapt the mRNA sion level is affected by the underlying CNA: a lower absolute expression of genes that encode protein subunits to maintain weight corresponds with a higher degree of transcriptional adap- their ratios at the protein level. However, in the TCGA dataset we tation. A pair-wise comparison of the weights of genes mapping to observed very little enrichment of genes having a high degree of marked genomic regions in CNA-CESs between the four datasets transcriptional adaptation in 304 protein complexes with 5 or showed that many CNA-CESs strongly correlated with another more members as defined by the CORUM protein complex CNA-CES in each of the three remaining datasets (Fig. 3a and definitions (see Supplementary Note 1 and Supplementary Supplementary Data 7). This indicates that the specific patterns Data 10). We did observe for 18 out of 304 protein complexes observed in many CNA-CESs are robust and independent of (permutation P value < 0.05) that patient-derived cancer cells platform or dataset. For example, we found a high correlation might coordinately adjust the mRNA expression levels of some

NATURE COMMUNICATIONS | (2020) 11:715 | https://doi.org/10.1038/s41467-020-14605-5 | www.nature.com/naturecommunications 5 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14605-5

CCLE 39% CCLE 31% a (73/186) (58/186)

GEO to TCGA 47.52% TCGA to GEO 25.95% GEO 57% GEO 49% (138/242) (119/242) 75 75 ) )

GDSC 44% P GDSC 31% P ( 50 ( 50 (77/175) 10 (55/175) 10 −log −log

25 25

0 0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 TCGA 26% TCGA 29% Spearman r Spearman r (118/447) (131/447) CCLE 55% CCLE 45% (102/186) (84/186)

CCLE to GDSC 42.47% GDSC to CCLE 44.57% GEO 30% GEO 31% (73/242) (76/242) 100 100 ) ) P P

GDSC 47% ( GDSC 58% ( (77/175) 10 (102/175) 10 −log 50 −log 50

0 0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 TCGA 13% TCGA 12% Spearman r Spearman r (58/447) (54/447) Spearman r

0 0.5 0.625 0.75 0.875 1

b Spearman r = 0.66

Weight in consensus source Extreme-valued region indicator 40.0 40.0 GEO - consensus estimated source 9 TCGA - consensus estimated source 751 1.00 KRAS KRAS 1.00

20.0 20.0

0.00 0.0 0.00 0.0 Weight in consensus source Weight Weight in consensus source Weight Extreme-valued region indicator Extreme-valued region indicator

–20.0 –20.0 12345678910111213141516171819202122 12345678910111213141516171819202122

Genomic mapping Genomic mapping

c Enriched for genes having a lower degree of transcriptional adaptation Enriched for genes having a higher degree of transcriptional adaptation

TCGA

GEO s s s s y y n n a g p g p g e n n B i i i m ll o ng ng n κ si sm sm i - li li li ace y li lis li ow f om ce - o o o pox d latio is nalin nalin repair β b b y col

unct g gna g gna gna ab j i i A y onse u Hy l t

poptos l si ogenes rox p N s i e transition A response response G

β eas h onse D T R signaling ca r γ i Pe α Myogenesis ng og s dipogenesis m meta T5 signaling p targets OS pathwa c pical sur signaling u F- O M Coagulation p c meta h A A A Complement pathway i otc n id R A S T E A V res G TA a rotein secretion N ot Mitotic spindle ORC1 si ge res U acid IFN- targets V1 MYC targets V2 eme metabolism P en response late

P IFN- d T T RA bi -catenin si Spermatogenesis g G2/M checkpoint H e β K m UV Allograft rejection / Androgen response H L-2/S signaling via NF atty ac Bile I eno nt α KRAS signaling dow stro F X strogen response earl holesterol homeostasis E W xidative phosphor E C Inflammatory response nfolded protein response I3K/AKT/m NF- O U P T IL-6/JAK/STAT3 signaling IL-6/JAK/STAT3

Proliferation-related gene sets Immune-related gene sets P = 1 × 10–50 P = 1 × 10–10 P = 0.0001 P = 0.05

Fig. 3 Degree of transcriptional adaptation to CNAs. a Citrus plots showing the Spearman correlations between CNA-CESs in the reference dataset, depicted in bold, with CNA-CESs in the other three datasets. Blue lines indicate r > 0.5. Correlations are calculated based on the weights of genes in marked genomic regions of the CNA-CESs under investigation. Each scatterplot with marginal histograms shows the correlations versus their −log10 transformed P values. The inset shows correlations > 0.5 having a P value < 0.05. b Example of a CNA-CESs in the GEO dataset that is highly correlated with a CNA-CES in the TCGA dataset with KRAS having a low degree of transcriptional adaptation to CNAs. Spearman correlation coefficient was derived using the genes mapping to the extreme-valued region from either of the CNA-CESs (n = 38). c Enrichment results using two-sided Welch’s t-test for the MSigDB Hallmark collection. A yellow bubble indicates enrichment for genes with a high degree of transcriptional adaptation to CNAs, and a blue bubble indicates a low degree. The size of the bubble corresponds to the significance level. Only CNA-CESs having at least 50 genes in their marked region were included. EMT: epithelial-mesenchymal transition; ROS: reactive oxygen species. protein complex subunits in response to the occurrence of CNAs marked genomic region in only one CNA-CES (Supplementary (Supplementary Fig. 10 and Supplementary Data 11). Data 12). The degree of transcriptional adaptation of these genes (n = 7641) highly correlated between the GEO and TCGA dataset (Fig. 4a, r = 0.84). Of these 7641 genes, 786 (10.2%) had low Most genes show moderate to high adaptation to CNAs. adaptation (metric < 0.25), 3898 (51.0%) moderate adaptation In both the GEO and TCGA dataset, most genes mapped to a (metric 0.25–0.75), and 2957 genes (38.7%) high adaptation

6 NATURE COMMUNICATIONS | (2020) 11:715 | https://doi.org/10.1038/s41467-020-14605-5 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14605-5 ARTICLE

a c Average degree of transcriptional adaptation 0.0 0.2 0.4 0.6 0.8 1.0

RAF1 NUP98 KAT7 TRRAP RAC1 Spearman r = 0.84 BRD3 1.00 IDH2 MAP2K1 KNSTRN TRIM27 MDM2

ABL1 JAK2 0.75 HRAS ELK4 BIRC6

ATF1 TAF15 STAT6 KMT2A DDX6 NCOA2 0.50 STAT3 MLLT10 AFDN NRAS BCL9 MDM4 HNRNPA2B1 SKI NPM1 0.25 CHD4 GNAQ

DDIT3 PLAG1 DDX5 ERBB3 marked genomic region in only one CNA-CES (TCGA) CCND3

Degree of transcriptional adaptation genes mapping to a WWTR1 HMGA1 0.00 TFEB GNAS FOXP1 0.00 0.25 0.50 0.75 1.00 CTNNB1 SYK CRTC1 Degree of transcriptional adaptation of genes mapping to a ETV4 SGK1 marked genomic region in only one CNA-CES (GEO) FGFR4 PDCD1LG2 RET CDK4 CACNA1D b ACKR3 200 Low Moderate High HOXC13 FSTL3 n = 786 n = 3898 n = 2957 HOXC11 IL6ST HOXD11

AKT3 BCL2 MUC4 TLX1 150 FGFR2 LPP HOXD13 MYB ZNF521 CARD11 ALK CD74 CCND1 PRDM16 100 MYC MITF GLI1 NR4A3 HOXA13

Number of genes PDGFB LMO2 LCK PDGFRB CSF1R 50 KCNJ5 FLI1 LMO1 CD79A MSI2 FLT3 IL7R TLX3 TCL1A FLT4 0 NUTM1 FOXR1 CYSLTR2 0.0 0.2 0.4 0.6 0.8 1.0 CD79B Average degree of transcriptional adaptation

Fig. 4 Degree of transcriptional adaptation to CNAs per individual gene. a Degree of transcriptional adaptation to CNAs for individual genes mapping to a marked genomic region in a single CNA-CES in both the GEO and TCGA dataset (n = 7641). b Distribution of the average degree of transcriptional adaptation to CNAs for genes mapping to a marked genomic region in a single CNA-CES in both the GEO and TCGA dataset. c Average degree of transcriptional adaptation to CNA for a set of oncogenes obtained from the Catalogue of Somatic Mutations in Cancer Gene Census.

(metric > 0.75, Fig. 4b). These results suggest that for most genes transcriptional adaptation in a set of oncogenes obtained from the the transcriptional effect of an underlying CNA at their genomic Catalogue of Somatic Mutations in Cancer Gene Census (Fig. 4c position is largely buffered by non-genetic adaptive mechanisms. and Supplementary Data 13)21. Whereas some oncogenes in this Amplification of oncogenes and the resulting enhanced gene set had a low degree of transcriptional adaptation (e.g., RAF1, expression levels have been implicated in the progression of metric = 0.06), most showed a relatively high degree of many cancers20. We observed a large variability in the degree of transcriptional adaptation to CNAs (e.g., MYC, metric = 0.86).

NATURE COMMUNICATIONS | (2020) 11:715 | https://doi.org/10.1038/s41467-020-14605-5 | www.nature.com/naturecommunications 7 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14605-5

Pancancer Chromosome a 0.75 c GEO Positive Negative 0.50

0.25

0.00

–0.25

Spearman correlation r = 0.72 TCGA 0.50

0.25 Average TACNA50 expression TACNA50 Average

0.00

–0.25

–0.50 1 2 3 4 5 6 7 8 9 10 11 12 13141516171819202122

Genomic mapping

b GEO Triple-negative breast carcinoma (n = 146) Adrenocortical carcinoma (n = 79) Colorectal adenocarcinoma (n = 573) Ovarian carcinoma (n = 307) ER+/HER2+ breast carcinoma (n = 140) ER+/HER2– breast carcinoma (n = 554) ER–/HER2+ breast carcinoma (n = 43) Glioblastoma multiforme (n = 166) Renal papillary cell carcinoma (n = 291)

Cholangiocarcinoma (n = 36) 8150 unrelated primary tumor samples (TCGA) Hepatocellular carcinoma (n = 373) Cutaneous melanoma (n = 472) Gastric adenocarcinoma (n = 415) Esophageal carcinoma (n = 185)

TCGA Lung adenocarcinoma (n = 517) Uveal melanoma (n = 80) Diffuse large B−cell lymphoma (n = 48) Cervical caricnoma (n = 306) Prostate carcinoma (n = 498) Renal clear cell carcinoma (n = 534) Renal chromophobe carcinoma (n = 66) Urothelial carcinoma (n = 408) Thyroid carcinoma (n = 509) Pancreatic carcinoma (n = 179) HNSCC (n = 522) Lower grade glioma (n = 530) Acute myeloid leukemia (n = 173) = 4) = 37) = 37) = 40) = 97) = 81) = 39) = 62) n = 97) = 737) = 752) = 506) = 456) = 225) = 364) = 332) = 390) = 398) = 308) = 158) = 187) = 106) n n n n n n n n = 386) = 1678) = 2710) = 2604) n n n n n n n n = 1019) n n n n n n 1 3 5 7 9 11 13 15 17 19 21 n n n n 2 4 6 8 10 12 14 16 18 20 22

Amplification HNSCC Spearman correlation R

HNSCC ( Deletion

–1 0 1 Uveal melanoma breast carcinoma Thyroid carcinoma Ovarian carcinoma Cervical carcinoma – Prostate carcinoma Lower grade glioma Urothelial carcinoma Cholangiocarcinoma Thyroid carcinoma ( Pancreatic carcinoma Uveal melanoma ( Cutaneous melanoma Lung adenocarcinoma Cervical caricnoma ( Esophageal carcinoma Cholangiocarcinoma ( Acute myeloid leukemia Ovarian carcinoma ( Glioblastoma multiforme Gastric adenocarcinoma Urothelial carcinoma ( Lower grade glioma ( Prostate carcinoma ( Adrenocortical carcinoma Hepatocellular carcinoma Pancreatic carcinoma ( /HER2+ breast carcinoma Renal clear cell carcinoma Esophageal carcinoma ( – Colorectal adenocarcinoma Cutaneous melanoma ( Lung adenocarcinoma ( Renal papillary cell carcinoma Diffuse large B-cell lymphoma Adrenocortical carcinoma ( ER+/HER2+ breast carcinoma Glioblastoma multiforme ( ER ER+/HER2 Acute myeloid leukemia ( Gastric adenocarcinoma ( Renal chromophobe carcinoma Hepatocellular carcinoma ( Triple-negative breast carcinoma Renal clear cell carcinoma ( Renal papillary cell carcinoma ( Colorectal adenocarcinoma ( Renal chromophobe carcinoma ( ER–/HER2+ breast carcinoma ( ER+/HER2+ breast carcinoma ( Diffuse large B−cell lymphoma ( ER+/HER2– breast carcinoma ( Triple-negative breast carcinoma ( Fig. 5 The landscape of transcriptional effects of CNAs in a large set of cancer samples. a Average pan-cancer TACNA profiles in GEO and TCGA dataset. Spearman correlation coefficient was derived using the 15,389 genes which were present in both datasets. b Distance matrix of Spearman correlations between average TACNA profiles in the GEO dataset and TCGA dataset for overlapping tumor types. The size and transparency of a square corresponds to the absolute correlation coefficient. HNSCC: head and neck squamous cell carcinoma. c Hierarchical clustering of the landscape of transcriptional effects of CNAs in the TCGA dataset for 8150 cancer samples of tumor types also present in the GEO dataset.

This suggests that also transcription of many oncogenes is under glioblastoma multiforme, and a 1p/19q deletion in low-grade the strong control of non-genetic regulatory factors that buffer gliomas (Supplementary Figs. 11b and c)25,26. The landscape the effects of CNAs. of transcriptional effects of CNAs can be explored at www. genomicinstability.org (Supplementary Fig. 11d). Hierarchical clustering of the landscape of transcriptional Transcriptional effects of CNAs in >21,000 cancer samples. effects of CNAs in the GEO and TCGA dataset showed clear TACNA profiling was applied to samples from overlapping differences and commonalities between tumor types (Fig. 5c and tumor types in the GEO dataset (n = 13,180) and TCGA dataset Supplementary Figs. 12–21), indicating that tumor types share (n = 8150) to define the landscape of transcriptional effects of strong transcriptional effects of CNAs driving biological pathways CNAs. This showed a high correlation between the average pan- relevant for their tumor behavior. The landscape of transcrip- cancer TACNA profiles in the GEO and TCGA dataset (r = 0.73, tional effects of CNAs can enable transcriptomic-wide association Fig. 5a). These average pan-cancer patterns of transcriptional studies (TWASs) to generate strong hypotheses on genes and effects of CNAs closely resemble average pan-cancer copy-num- biological pathways affected by CNAs that might be causally ber patterns reported in previous studies22,23. As expected, high linked to tumor phenotypes. correlations between the average GEO and TCGA TACNA pro- files of copy-number-driven tumors such as triple-negative breast carcinoma (r = 0.80), colorectal adenocarcinoma (r = 0.74), TACNA profiles are associated with immune composition. and ovarian carcinoma (r = 0.73) were observed (Fig. 5b and Recently, a higher CNA burden was found to be negatively Supplementary Data 14). associated with an expression-based metric describing CD8+ T While exploring the landscape of transcriptional effects of cell activity in the tumor microenvironment23,27. The ability to CNAs, we observed common patterns consistent with well- elicit an anti-cancer immune response with immune checkpoint known genomic alterations. For instance, many renal clear cell inhibitors depends on the composition and activity of immune carcinomas in the TCGA dataset contained a pattern consistent cells present in the tumor microenvironment, especially on the with a deletion in chromosome 3p and an irregular amplification presence of CD8+ T cells28. However, it is still unclear how in chromosome 5q (Supplementary Fig. 11a)24. Likewise, we CNAs affect the activity and composition of immune cells present observed patterns consistent with a deletion in in the tumor microenvironment.

8 NATURE COMMUNICATIONS | (2020) 11:715 | https://doi.org/10.1038/s41467-020-14605-5 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14605-5 ARTICLE

a 0.6 GEO TCGA n = 100 n = 400 n = 900 n = 1600 n = 2500 b Positively correlated Gene enrichment per cytoband on correlations between TACNA expression levels and CD8+ T cell abundance Negatively correlated 0.4 >100

0.2 75 r 0.0 50 −0.2 Spearman −0.4 25 ) P

−0.6 (

10 0 −0.8 –log 25

HNSCC 50

75 Uveal melanoma Thyroid carcinoma Ovarian carcinoma Cervical carcinoma Prostate carcinoma Lower grade glioma Urothelial carcinoma Cholangiocarcinoma Pancreatic carcinoma Cutaneous melanoma Lung adenocarcinoma Esophageal carcinoma

Acute myeloid leukemia >100 Glioblastoma multiforme Gastric adenocarcinoma Hepatocellular carcinoma Adrenocortical carcinoma 12345678910111213141516171819202122 Colorectal adenocarcinoma Papillary renal cell carcinoma Clear cell renal carcinoma ER−/HER2+ breast carcinoma ER+/HER2+ breast carcinoma ER+/HER2− breast carcinoma Diffuse large B−cell lymphoma Genomic mapping Triple−negative breast carcinoma Chromophobe renal cell carcinoma c Gene Predicted mammalian phenotype Distance r Permuted Z-score

IRF2 IRF2 Decreased number 0.63 12.61 Enlarged 0.66 12.42 Enlarged lymph nodes 0.66 11.86 Decreased IFN-γ secretion 0.62 11.74 LOC90768 Increased leukocyte cell number 0.63 11.63 Decreased T cell number 0.63 10.99 Decreased lymphocyte cell number 0.61 10.52 Decreased CD8+/αβ T cell number 0.63 10.21 ZCCHC6 DAPK1 OSTF1 OSTF1 Enlarged spleen 0.54 10.76 Increased leukocyte cell number 0.53 8.89 Impaired natural killer cell mediated cytotoxicity 0.58 8.67 Abnormal cytokine secretion 0.53 8.54 Decreased CD8+/αβ T cell number 0.52 8.53 Enlarged lymph nodes 0.56 8.46 NINJ1 Abnormal macrophage physiology 0.51 8.46 Decreased T cell proliferation 0.52 8.39 ACSL1 LOC90768 Decreased B cell number 0.48 9.80 NFIL3 Enlarged spleen 0.46 9.09 Decreased T cell number 0.47 8.03 Z-transformed P value Decreased CD8+/αβ T cell number 0.46 7.90 4.98 - Myogenesis 8.68 - G2/M checkpoint 3.35 - Allograft rejection 5.42 - Complement Increased leukocyte cell number 0.45 7.76 4.63 - EMT 8.54 - E2F targets 2.95 - KRAS signaling down 5.13 - IFN-γ response Enlarged lymph nodes 0.44 7.49 Decreased lymphocyte cell number 0.46 7.27 3.71 - UV response down 7.45 - MYC targets V1 2.85 - IFN-γ response 5.01 - Inflammatory response THBD Increased spleen weight 0.42 6.92 3.45 - Angiogenesis 6.63 - DNA repair 2.77 - Inflammatory response 4.58 - Allograft rejection KLF3 3.04 - Apical junction 5.48 - MYC targets V2 2.69 - E2F targets 4.47 - IL2/STAT5 signaling ZCCHC6 Enlarged spleen 0.57 10.83 Increased leukocyte cell number 0.54 9.46 2.88 - Hedgehog signaling 4.86 - Mitotic spindle 2.53 - Spermatogenesis 4.23 - TNF-α signaling via NFκB Increased IgG1 level 0.59 9.17 2.69 - Coagulation 4.25 - MTORC1 signaling 2.44 - Myogenesis 3.96 - IL6/JAK/STAT3 signaling Decreased T cell number 0.54 8.91 αβ 2.10 - Notch signaling 4.20 - Spermatogenesis 2.32 - TNF-α signaling via NFκB 3.66 - Heme metabolism Decreased CD4+/ T cell number 0.51 8.86 Enlarged lymph nodes 0.56 8.80 1.95 - Hypoxia 3.25 - Unfolded protein response 2.27 - G2/M checkpoint 3.16 - IFN-α response ANXA1 Decreased CD8+/αβ T cell number 0.53 8.40 1.92 - Estrogen response early 2.50 - KRAS signaling down 2.25 - MYC targets V1 2.81 - KRAS signaling up Abnormal cytokine secretion 0.58 8.30

Fig. 6 Transcriptional effects of CNAs in relation to inferred immunological phenotypes. a Per tumor type, Spearman correlations between inferred CNA burden and an expression-based metric describing CD8+ T cell activity for each sample. b Manhattan plot showing, the enrichment (−log10 P value on the y-axis) for genes with a strong correlation between TACNA expression level and inferred CD8+ T cell abundance per cytogenetic band defined according to the MSigDB Positional gene sets collection. c Left panel: constructed co-functionality network for the top 350 genes with the highest inverse correlation between their TACNA expression levels and CD8+ T cell abundance. Middle panel: cluster in which genes share strong predicted co-functionality (r > 0.8) within the co-functionality network that was enriched with genes predicted to be involved in immunological processes (e.g., complement, inflammatory response, and IFN-γ response). Right panel: a GBA strategy with >106,000 expression profiles in combination with the gene set collection obtained from the Mammalian Phenotype (MP) Ontology predicted for the genes IRF2, OSTF1, LOC90768, and ZCCHC6 that altered transcription results in a decreased CD8+/αβ T cell number.

We therefore developed a univariate measurement describing guilt-by-association (GBA) strategy utilizing >106,000 expression the CNA burden for each sample (see Methods), which confirmed profiles (see Methods). We identified four big clusters in which the negative association between CNA burden and the genes have strong predicted co-functionality (r > 0.8, Fig. 6c). expression-based metric describing CD8+ T cell activity in both One cluster was enriched with genes predicted to be involved the GEO and TCGA datasets (Fig. 6a, Supplementary Fig. 22 and in biological processes related to tumor progression (e.g., E2F Supplementary Data 15). We observed strong negative correla- targets, G2/M checkpoint, and MYC targets), whereas two tions especially in the prototypical copy-number-driven tumors. clusters were enriched for immunological processes (e.g., allograft To determine how CNAs affect the composition of immune cells rejection, complement, and interferon-γ response). This suggests present in the tumor microenvironment, we inferred the that CNAs can transcriptionally activate multiple processes abundance of 22 immune cell types with CIBERSORT for the that benefit tumor growth, in this case, simultaneously lowering 21,592 cancer samples in the GEO dataset and correlated these to CD8+ T cell abundance (i.e., avoiding immune destruction) and TACNA expression levels (Supplementary Data 16). Genes were promoting tumor cell proliferation. ranked based on this correlation and GSEA was performed with We then predicted the phenotypic effects of altered expression the MSigDB Positional gene sets collection. We identified levels of the genes in the cluster with the strongest enrichment for multiple genomic regions that were enriched for genes having a immunological processes using the GBA strategy in combination highly significant correlation with CD8+ T cell abundance, with the gene set collection obtained from the Mammalian indicating that CNAs occurring at these genomic regions are Phenotype Ontology. We predicted that altered expression levels driving the associations (Fig. 6b and Supplementary Data 17). of four genes (IRF2, OSTF1, LOC90768, and ZCCHC6) would To identify the genes in these enriched genomic regions that indeed result in decreased CD8+/αβ T cell numbers (Fig. 6c). may be linked to CD8+ T cell abundance, we constructed a co- The above data strongly suggest that the identified genes which functionality network for the top 350 genes with highest inverse are transcriptionally affected by CNAs underlie the composition correlation between their TACNA expression levels and CD8+ T and activity of immune cells present in the tumor microenviron- cell abundance. This co-functionality network was generated ment. These genes could potentially serve as novel targets to with an integrative tool that predicts gene functions based on a induce or enhance anti-cancer immune responses.

NATURE COMMUNICATIONS | (2020) 11:715 | https://doi.org/10.1038/s41467-020-14605-5 | www.nature.com/naturecommunications 9 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14605-5

Discussion p genes from n samples, ICA resulted in (i) the extraction of i independent In the present study we revealed the degree of transcriptional components (hereafter called estimated sources, ESs), (ii) an ES matrix of adaptation to CNAs in a genome-wide fashion using 34,494 gene dimension i × p, where the weight of each ES represents the direction and mag- fi fi fi nitude of its effect on the expression level of each gene, and (iii) a mixing matrix expression pro les. Using TACNA pro ling, we de ned the (MM) of dimension i × n which contains the coefficients (i.e., indirect activity landscape of transcriptional effects of CNAs and determined how measurements) of ESs in each sample. Within each dataset, principal component these effects might influence the composition of immune cells in analysis was performed on the covariance matrix between samples, after which i the tumor microenvironment. was chosen as the minimum number of top principal components which captured at least 85% of the total variance in the dataset. The inner product between the We observed that most oncogenes in our set had a high degree vector of coefficients in the MM and the vector of ES weights per individual gene of transcriptional adaptation to CNAs. This suggests that ele- results in the original mRNA expression level. vation of expression levels of many oncogenes is only beneficial In ICA, an initial random weight vector with a variance of 1 has to be chosen in for tumor progression up to a certain level; excessive expression order to obtain statistically independent ESs. Hence, varying initial random weight vectors could result in different sets of ESs. To retrieve a set of consensus ESs (or of several oncogenes was previously linked to cell senescence and CESs), we performed 25 ICA runs, each with a different random initialization 29 apoptosis . Perturbations of genes whose expression levels are weight vector. After all ICA runs were completed, ESs with an absolute Pearson r > tightly controlled by non-genetic factors (i.e., have a high degree 0.9 were clustered, where the number of ESs in each cluster could be at most the of transcriptional adaptation to CNAs) could thus result in total number of ICA runs. CESs were computed by obtaining the average of ESs inside each cluster. Next, we calculated a credibility index for each CES by more extreme tumor phenotypes. These genes could therefore be determining the ratio between the number of ESs in its cluster and the total potential therapeutic targets. In addition, our results facilitate number of ICA runs (i.e., 25). CESs with a credibility index of ≥50% were used for future research into the underlying mechanisms of transcrip- obtaining the CES matrix and the consensus MM. We hypothesized that each CES tional adaptation of genes to CNAs, which are currently poorly describes the effect of a latent transcriptional regulatory factor on gene expression understood7,30. levels. fi fi A subset of CESs harbored a pattern in which genes having high absolute TACNA pro ling was applied to gene expression pro les in weights mapped to a specific contiguous genomic region (i.e., CNA-CESs). To two patient-derived datasets and two cell line compendia. These identify these CNA-CESs, we developed and used a detection algorithm, which is expression profiles were generated with three platforms using described in detail in the Supplementary Note 1. either RNA-seq or microarray technology. TACNA profiling proved robust, as shown by the highly reproducible average pan- Transcriptional adaptation to CNA profiles. For each dataset, weights of genes cancer and tumor-type-specific TACNA profiles in the GEO and mapping to a contiguous genomic region in CNA-CESs marked by the detection algorithm were retained in the CES matrix. Weights of genes that mapped outside TCGA datasets. This enables reanalysis of the tens of thousands marked genomic regions were instead set to zero. Next, TACNA profiles were publicly available gene expression profiles generated from various calculated as the product between this transformed CES matrix and the consensus platforms that are lacking a paired CNA profile. This approach MM. TACNA profiles in the GEO dataset and TCGA dataset were adjusted by provides a novel tool to determine how CNAs drive the pro- subtracting the Hodges Lehmann estimate (i.e., robust mean) of the TACNA gressive acquisition of tumor phenotypes, such as the hallmarks expression of genes observed in normal tissue samples in each dataset. This 29 ensured that a TACNA expression value of zero corresponded to a situation where of cancer . a gene has exactly two copies on the genome. After generating the landscape of transcriptional effects of fi CNAs, we identi ed four genes which are associated with reduced Human fragile sites. Previously described genomic locations of aphidicolin- CD8+ T cell abundance in the tumor microenvironment. This is induced fragile sites identified through cytogenic analyses were used to assess the especially of interest as a high number of CD8+ T cells is asso- colocalization of borders of marked genomic regions in CNA-CESs with common ciated with higher response rates to immune checkpoint inhibi- fragile sites15. Genomic coordinates were converted to human genome reference 31–33 GRCh38 using the Batch Coordinate Conversion tool from USCS Genome tors . Targeting these potentially causal genes might increase Browser38. Borders of marked genomic regions in CNA-CESs were assumed to CD8+ T cell infiltration. In combination with immune check- colocalize with common fragile sites when one or both borders were located inside point inhibitors, this might induce or enhance an anti-cancer a common fragile site. immune response, and thereby improve disease outcome. In conclusion, our study provides a new tool to explore the Gene set enrichment analyses. Weights of genes in a marked genomic region of a transcriptional effects of CNAs which can lead to novel biological CNA-CES were transformed to metrics ranging from zero to one, where zero insights into how CNAs drive tumor behavior, and guide the would correspond to the highest absolute weight (i.e., low degree of transcriptional adaptation to CNAs) and one would correspond to the lowest absolute weight (i.e., development of new therapeutic strategies. high degree of transcriptional adaptation to CNAs). When genes occurred in marked regions of multiple CNA-CESs, the lowest metric was used. Enrichment of ’ URLs. The degree of transcriptional adaptation, source code of gene sets was tested according to the two-sample Welch s t-test for unequal var- fi iance. The proportion of false discoveries was controlled using a multivariate TACNA pro ling, landscapes of transcriptional effects of CNAs permutation testing framework with 100 permutations, a 1% false discovery rate per dataset, and additional datasets and results are available at and confidence level of 80%. http://www.genomicinstability.org/. Inferred CNA burden. For each sample in the GEO dataset and TCGA dataset, Methods CNA load was estimated as the sum of the coefficients (i.e., indirect activity measurements) of all CNA-CESs in the MM with at least 50 genes in their marked Data acquisition. Microarray expression data was collected from three public data genomic regions. CNA loads were normalized to a 0–1 range across samples of the repositories: GEO platform GPL570 (generated with Affymetrix HG-U133 Plus same tumor type in each dataset separately. 2.0)34, CCLE (generated with Affymetrix HG-U133 Plus 2.0)35, and GDSC (gen- erated with Affymetrix HG-U219)36. Preprocessing and aggregation of raw expression data was performed using the robust multi-array average algorithm with Expression-based immune metric. A previously defined set of genes describing RMAExpress (version 1.1.0)37. Quality control of the processed expression data is CD8+ T cell and natural killer cell activity (CD2, CD3E, CD247, GZMK, NKG7, described in the Supplementary Note 1. Pre-processed and normalized RNA-seq and PRF1) was used to calculate immune metric scores27. Per sample, the rank data was collected from TCGA using the Broad GDAC Firehose portal. In each position of the mRNA expression levels of each of these genes was calculated. dataset, expression levels for every gene were standardized to a mean of zero and Scores for each sample were determined by calculating the mean rank position of variance of one. In addition, CNA profiles generated with Affymetrix Genome- the seven genes. Immune metric scores were normalized to a 0–1 range across Wide Human SNP Array 6.0 were collected for a subset of samples in the TCGA samples of the same tumor type in each dataset separately. dataset, CCLE dataset and GDSC dataset and processed as described in Supple- mentary Note 1. Inference of immune cell type abundance. Inference of the abundance of 22 immune cell types was performed using CIBERSORT39 and confined to the GEO Identification of CNA-CESs. ICA was used to identify a regulatory model for the dataset. As CNAs generally do not occur in non-tumor tissue, we reconstructed mRNA transcriptome12. For each dataset containing mRNA expression profiles of gene expression profiles using all CESs except CNA-CESs with at least 50 genes in

10 NATURE COMMUNICATIONS | (2020) 11:715 | https://doi.org/10.1038/s41467-020-14605-5 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14605-5 ARTICLE their marked regions to more accurately capture the effects from the tumor 8. Jansen, R. C. & Nap, J. Genetical genomics: the added value from segregation. microenvironment on gene expression levels. Next, the abundance of 22 immune Trends Genet. 17, 388–391 (2001). cell types was inferred by applying the leukocyte gene signature matrix (LM22) on 9. Fehrmann, R. S. N. et al. Gene expression analysis identifies global gene dosage the reconstructed profiles. sensitivity in cancer. Nat. Genet. 47, 115–125 (2015). 10. Zhang, R., Lahens, N. F., Ballance, H. I., Hughes, M. E. & Hogenesch, J. B. A Prediction of gene functionalities. We used a GBA approach to predict likely circadian gene expression atlas in mammals: implications for biology and – functions for genes based on gene co-regulation. For this, we conducted a medicine. Proc. Natl Acad. Sci. USA 111, 16219 16224 (2014). consensus-ICA on an unprecedented scale (manuscript in preparation). In short, a 11. Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in covariance matrix was calculated between 19,635 genes using the expression pat- high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010). terns of 106,462 gene expression profiles generated with Affymetrix HG-U133 Plus 12. Hyvärinen, A. & Oja, E. Independent component analysis: algorithms and 2.0 representing the many disease states, cellular states, and genetic and chemical applications. Neural Netw. 13, 411–430 (2000). perturbations that were obtained. Consensus-ICA was performed on the covar- 13. Ciriello, G. et al. Emerging landscape of oncogenic signatures across human iance matrix, which resulted in the identification of a large set of CESs and a MM cancers. Nat. Genet. 45, 1127–1133 (2013). reflecting the activity of each source in the expression pattern of each gene across 14. Stingele, S. et al. Global analysis of genome, transcriptome and proteome the samples. Next, a GBA approach was used to predict the functionality of reveals the response to aneuploidy in human cells. Mol. Syst. Biol. 8, 608 individual genes. First, we retrieved 16 public gene set collections describing a large (2012). range of biological processes and phenotypes. For each gene set, we calculated its 15. Fungtammasan, A., Walsh, E., Chiaromonte, F., Eckert, Ka & Makova, K. D. A bar code by averaging the MM weights of its member genes per CES. Next, for each genome-wide analysis of common fragile sites: what features determine gene in the MM, the distance correlation was determined between its MM weights chromosomal instability in the human genome? Genome Res. 22, 993–1005 and the gene set bar code. A high correlation between a gene’s MM weight and a (2012). gene set bar code indicated that the gene under investigation shared a functionality 16. Liberzon, A. et al. The Molecular Signatures Database hallmark gene set with the genes of the specific gene set under investigation. Significance levels were collection. Cell Syst. 1, 417–425 (2015). obtained with permutated data (1000 permutations). This strategy was used on 17. Baylin, S. B. DNA methylation and gene silencing in cancer. Nat. Clin. Pract. 23,372 well-described functional gene sets, which enabled us to create a compre- Oncol. 2,S4–S11 (2005). hensive network of predicted functionalities of individual genes. This framework is 18. Du, J., Johnson, L. M., Jacobsen, S. E. & Patel, D. J. DNA methylation available at http://www.genetica-network.com. pathways and their crosstalk with methylation. Nat. Rev. Mol. Cell A more detailed description of the methods used in this study is provided in the Biol. 16, 519–532 (2015). Supplementary Note 1. 19. Birchler, J. A. & Veitia, R. A. Gene balance hypothesis: connecting issues of dosage sensitivity across biological disciplines. Proc. Natl Acad. Sci. USA 109, Reporting summary. Further information on research design is available in 14746–14753 (2012). the Nature Research Reporting Summary linked to this article. 20. Santarius, T., Shipley, J., Brewer, D., Stratton, M. R. & Cooper, C. S. A census of amplified and overexpressed human cancer genes. Nat. Rev. Cancer 10, – Data availability 59 64 (2010). 21. Sondka, Z. et al. The COSMIC Cancer Gene Census: describing genetic Microarray expression data was collected from three public data repositories: Gene dysfunction across all human cancers. Nat. Rev. Cancer 18, 696–705 (2018). Expression Omnibus with accession number GPL570 (generated with Affymetrix HG- fi 22. Zack, T. I. et al. Pan-cancer patterns of somatic copy number alteration. Nat. U133 Plus 2.0), CCLE (generated with Affymetrix HG-U133 Plus 2.0, le Genet. 45, 1134–1140 (2013). CCLE_Expression.Arrays_2013-03-18.tar.gz) available at https://portals.broadinstitute. 23. Davoli, T., Uno, H., Wooten, E. C. & Elledge, S. J. Tumor aneuploidy org/ccle/data and GDSC (generated with Affymetrix HG-U219) available at https://www. correlates with markers of immune evasion and with reduced response to ebi.ac.uk/arrayexpress/experiments/E-MTAB-3610/. Pre-processed and normalized immunotherapy. Science 355, eaaf8399 (2017). RNA-seq data was collected from TCGA using the Broad GDAC Firehose portal (https:// fi 24. Mitchell, T. J. et al. Timing the landmark events in the evolution of clear cell gdac.broadinstitute.org/). Individual identi ers or accession numbers for all samples used renal cell cancer: TRACERx Renal. Cell 173, 611–623.e17 (2018). in this paper can be found in the Download support data section of http://www. 25. Fults, D. & Pedone, C. Deletion mapping of the long arm of chromosome genomicinstability.org/. The datasets generated during and/or analyzed during the 10 in glioblastoma multiforme. Genes, Chromosom. Cancer 7,173–177 current study are available in the website http://www.genomicinstability.org/. TACNA (1993). gene distribution and degree of TACNA can be explored at the gene level in top four 26. Eckel-Passow, J. E. et al. Glioma groups based on 1p/19q, IDH, and TERT panels of the above website. In addition, the tab Download support data in the website – fi fi promoter mutations in tumors. N. Engl. J. Med. 372, 2499 2508 (2015). links to zip les that contain TACNA pro les and degree of TACNA estimates for all 27. Rooney, M. S., Shukla, S. A., Wu, C. J., Getz, G. & Hacohen, N. Molecular and four datasets. genetic properties of tumors associated with local immune cytolytic activity. Cell 160,48–61 (2015). Code availability 28. Binnewies, M. et al. Understanding the tumor immune microenvironment The website http://www.genomicinstability.org/ contains a link to the R code that (TIME) for effective therapy. Nat. Med. 24, 541–550 (2018). contains the pipeline used to calculate the TACNA profiles. 29. Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646–674 (2011). 30. Weischenfeldt, J., Symmons, O., Spitz, F. & Korbel, J. O. Phenotypic impact of Received: 11 June 2019; Accepted: 20 January 2020; genomic structural variation: insights from and for human disease. Nat. Rev. Genet. 14, 125–138 (2013). 31. Sade-Feldman, M. et al. Defining T cell states associated with response to checkpoint immunotherapy in melanoma. Cell 175, 998–1013.e20 (2018). 32. Chen, P. L. et al. Analysis of immune signatures in longitudinal tumor samples yields insight into biomarkers of response and mechanisms of resistance to References immune checkpoint blockade. Cancer Discov. 6, 827–837 (2016). 1. Negrini,S.,Gorgoulis,V.G.&Halazonetis,T.D.Genomicinstability 33. Tumeh, P. C. et al. PD-1 blockade induces responses by inhibiting adaptive an evolving hallmark of cancer. Nat. Rev. Mol. Cell Biol. 11, 220–228 (2010). immune resistance. Nature 515, 568–571 (2014). 2. Davoli, T. et al. Cumulative haploinsufficiency and triplosensitivity drive 34. Barrett, T. et al. NCBI GEO: archive for functional genomics data sets— aneuploidy patterns and shape the cancer genome. Cell 155, 948–962 update. Nucleic Acids Res 41, 991–995 (2013). (2013). 35. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive 3. Stranger, B. E. et al. Relative impact of nucleotide and copy number variation modelling of anticancer drug sensitivity. Nature 483, 603–307 (2012). on gene expression phenotypes. Science 315, 848–853 (2007). 36. Yang, W. et al. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for 4. Henrichsen, C. N. et al. Segmental copy number variation shapes tissue therapeutic biomarker discovery in cancer cells. Nucleic Acids Res. 41, transcriptomes. Nat. Genet. 41, 424–429 (2009). 955–961 (2013). 5. Tang, Y. C. & Amon, A. Gene copy-number alterations: a cost-benefit analysis. 37. Bolstad, B. M., Irizarry, R. A., Astrand, M. & Speed, T. P. A comparison of Cell 152, 394–405 (2013). normalization methods for high density oligonucleotide array data based on 6. Veitia, R. A., Bottani, S. & Birchler, J. A. Gene dosage effects: nonlinearities, variance and bias. Bioinformatics 19, 185–193 (2003). genetic interactions, and dosage compensation. Trends Genet. 29, 385–393 38. Casper, J. et al. The UCSC Genome Browser database: 2018 update. Nucleic (2013). Acids Res. 46, D762–D769 (2018). 7. El-Brolosy, M. A. & Stainier, D. Y. R. Genetic compensation: a phenomenon 39. Newman, A. M. et al. Robust enumeration of cell subsets from tissue in search of mechanisms. PLoS Genet. 13,1–17 (2017). expression profiles. Nat. Methods 12, 453–457 (2015).

NATURE COMMUNICATIONS | (2020) 11:715 | https://doi.org/10.1038/s41467-020-14605-5 | www.nature.com/naturecommunications 11 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14605-5

Acknowledgements Correspondence and requests for materials should be addressed to R.S.N.F. This research was supported by the Netherlands Organization for Scientific Research (NWO-VENI grant 916-16025 to R.S.N.F.), the Dutch Cancer Society (RUG 2013-5960 Peer review information Nature Communications thanks the anonymous reviewer(s) for to R.S.N.F. and RUG 2016-10034 to E.G.E.d.V.), the European Research Council (ERC their contribution to the peer review of this work. Peer reviewer reports are available. Consolidator grant 682421 to M.A.T.M.v.V.), and the Graduate School of Medical Sci- ences of the University Medical Center Groningen (no grant number, to R.D.B.). Reprints and permission information is available at http://www.nature.com/reprints

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in Author contributions published maps and institutional affiliations. R.S.N.F. conceived this study. A.B., R.D.B., and R.S.N.F. collected and assembled data. A.B., R.D.B., C.G.U., and R.S.N.F. performed data analyses. A.B., R.D.B., C.G.U., E.G.E.d.V., M.A.T.M.v.V., and R.S.N.F. contributed to the data interpretation, writing of Open Access This article is licensed under a Creative Commons the paper, and the final decision to submit the paper. Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give Competing interests appropriate credit to the original author(s) and the source, provide a link to the Creative E.G.E.d.V. reports institutional financial support for advisory board/consultancy from Commons license, and indicate if changes were made. The images or other third party Sanofi, Daiichi, Sankyo, NSABP, Pfizer and Merck, and institutional financial support for material in this article are included in the article’s Creative Commons license, unless clinical trials or contracted research from Amgen, Genentech, Roche, AstraZeneca, indicated otherwise in a credit line to the material. If material is not included in the Synthon, Nordic Nanovector, G1 Therapeutics, Bayer, Chugai Pharma, CytomX Ther- article’s Creative Commons license and your intended use is not permitted by statutory apeutics and Radius Health, all unrelated to the submitted work. All other authors regulation or exceeds the permitted use, you will need to obtain permission directly from declare no competing interests. the copyright holder. To view a copy of this license, visit http://creativecommons.org/ licenses/by/4.0/. Additional information Supplementary information is available for this paper at https://doi.org/10.1038/s41467- © The Author(s) 2020 020-14605-5.

12 NATURE COMMUNICATIONS | (2020) 11:715 | https://doi.org/10.1038/s41467-020-14605-5 | www.nature.com/naturecommunications