<<

medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

Title: Overexpression of transposable elements underlies immune overdrive and poor

clinical outcome in patients

Authors: Xiaoqiang Zhu1, Hu Fang1, Kornelia Gladysz1, Jayne A. Barbour1, Jason W. H.

Wong*1,2

Affiliations:

1School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR

2 Centre for PanorOmic Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR

*Corresponding authors: Jason W. H. Wong, E-mail: [email protected].

One Sentence Summary: expression is an independent predictor of

immune infiltration and poor clinical outcome in cancer patients.

Abstract:

Increased immune infiltration in tumor tissue is usually associated with improved clinical

outcome, but excessive infiltration can lead to worst prognosis. The factors underlying such

immune overdrive phenotype remains unknown. Here, we investigate the contribution of

transposable element (TE) expression to immune response and clinical outcome in cancer.

Using colorectal cancer as a model, we develop a TE expression score, showing that highest

scores are predicative of immune overdrive and poor outcome independent of microsatellite

instability and tumor mutation burden. TE expression scores from lines treated with DNA

methyltransferase inhibitors show that TEs are directly responsible for driving excess immune

infiltration. A pan-cancer survey of TE expression further identify a subset of kidney cancer

patients with similar immune overdrive phenotype and adverse prognosis. Together, our

NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.

1 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

findings reveal that in cancer, TE expression underlies the immune overdrive phenotype and

is an independent predictor of immune infiltration and prognosis.

Introduction:

Transposable elements (TEs) are a major component of the human . They are

divided into class 1 and class 2 DNA transposons which are further

subclassified into subclasses, superfamilies and over 1,000 subfamilies (1). As much as 45% of

the human genome is comprised of TEs, with many copies of near identical TE sequences from

each subfamily located throughout the genome (2-4). In normal somatic tissue, TEs are mainly

epigenetically silenced (5, 6), however, in cancer, TEs can become reactivated due to DNA

hypomethylation (7), resulting in the of retrotransposons into RNA or direct

transposition of DNA transposons. One potential consequence of the reactivation of TEs is to

stimulate the via viral mimicry (8, 9). For instance, upon the treatment of

DNA methyltransferase (DNMT) inhibitors, one type of TE, the human endogenous

(hERV) was shown to be reactivated and was accompanied by the up-regulation of viral

defense pathways in ovarian (10) and colorectal (8) cancer cells. Several hERVs such as LTR21B,

MER57F, HERVL74-int, were shown to be positively associated with immune infiltration (e.g.

CD8 + T cells expression) in multiple cancer types (11). Recently, it has been shown that hERVs

can serve as tumor antigen signals and some specific hERVs are of prognostic value and

predictive of response of immunotherapy (9). For instance, ERV3-2 was found to be

correlated with immune checkpoint activation across 11 cancer types (12). Additionally, hERV

4700 derived epitope may function as a target by which anti-PD1 could trigger anti-tumor

immunity in kidney renal clear cell (KIRC) (13). Apart from hERVs, other classes of

TEs could also be the potential source of immunogenic peptides from tumor cells such as long

2 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

interspersed nuclear elements (LINE), short interspersed nuclear elements (SINE) and SINE-

VNTE-Alu (SVA) (11). Together, these results have demonstrated the critical roles of TEs in

anti-tumor immunity. Nonetheless, how TE reactivation impacts cancer progression and

clinical outcome still remains unclear.

Patients whose cancer have higher immune cell infiltration tend have better prognosis. For

instance, Immunoscore has been developed based on the density of CD3+ and cytotoxic CD8+

T cells in the tumor and the invasive margin in colorectal cancer (CRC) (14), and has been

shown to have prognostic value superior to American Joint Committee on Cancer (AJCC) stage

classification (15). Intriguingly, a recent study identified a high risk subgroup of CRC patients

with high tumor immune infiltration as indicated by high CD8A and CD274 expression

(16). Termed “immune overdrive” signature, these patients’ cancer included both

microsatellite instability (MSI) and stable (MSS) status, increased TGF-β activation and

overexpression of immune response and checkpoint . While whether such patients are

likely to benefit from immune checkpoint inhibitors therapy remain to be evaluated, the

underlying factor behind this phenotype and whether a similar immune overdrive phenotype

exists in other cancer types remains unknown.

Given the recent evidence for the role of TEs in triggering cancer immune response, in this

study, we set about investigating whether TE expression may be the underlying mechanism

of the immune overdrive phenotype observed in CRC. To do this, we quantified the expression

of over 1,000 subfamilies of TEs at the RNA level in The Cancer Genome Atlas (TCGA) CRC

cohort. We identified a nine TE expression signature that classified CRC patients into four

groups with distinctive prognosis. The group with the highest TE score was characterised by

markers of immune overdrive and had the poorest prognosis, while the upper intermediate

TE score group is characterised by moderately elevated immune marker expression had the

3 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

best prognosis – mirroring the risk profile reported by Fakih et al (16). Significantly, our TE

score was predicative of prognosis independent of MSI status and tumour mutation burden

(TMB). To demonstrate that TE expression is directly linked to the immune overdrive

phenotype, we compared immune response and gene expression pathways in high TE score

samples using data from cell lines treated with DNA demethylating agents. Finally, a pan-

cancer analysis of our TE expression signature uncovered a similar immune overdrive

subgroup in high risk KIRC patients, validating the prognostic significance of TE expression

signature across cancer types.

Results:

Identification of TEs associated with survival and immune activation in CRC

To establish whether TEs are associated with CRC prognosis and immune activation, we first

quantified the TE expression landscape in CRC by applying the recently developed

“REdiscoverTE” pipeline (11) on RNA sequencing data from TCGA (Fig. S1A). In brief, the

pipeline quantifies the number of reads mapping to each TE subfamily without uniquely

identifying unique instances in the genome. Our down-stream analysis was focused on 1,072

TE subfamilies which were classified into six classes including LTR (long terminal repeats), DNA,

LINE, SINE, and (Fig. S1B). The expression pattern of these six classes is

shown in Fig. 1A, indicating that Retroposon and SINE had higher expression following by LINE

and LTR while Satellite and DNA had lowest expression (11).

To comprehensively identify prognostic TEs, we performed survival analysis on each TE in

terms of four survival endpoints, respectively, including overall survival (OS), DSS (disease

specific survival), DFI (disease-free interval) and PFI (progression-free interval) (17). There

were 372 candidate TEs that had survival difference for at least one endpoints (Fig. 1B, Table.

S1) with seven significant in all the four endpoints (Fig. S1C-G). Using a permutation test (see

4 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

Methods), we estimated the false discover rate (FDR) to be 1.35% (Fig. S1H). Interestingly,

almost all of the hazard ratio of the candidate TEs was greater than one indicating that higher

TE expression generally contributed to worse survival (Fig. 1C). Similar analysis was performed

at family and class level, respectively. Retroposon showed significant differences in terms of

three endpoints (Fig. S1I and J). Retroposon includes six subfamilies named SVA-A to SVA-F.

Further multivariable Cox regression analysis for each of these suggested five of

them could be independent predictors of survival for at least one endpoint except for SVA-B

(Fig. S1K-M).

Next, to identify immune associated TEs, we evaluated 29 immune indices including 17

immunologically relevant gene sets from ImmPort (18), one overall immune infiltration,

termed ImmuneScore calculated by ESTIMATE (19) and 11 immune associated genes (Fig.

S1N). Gene set variation analysis (GSVA) was performed to estimate expression scores of the

17 immune gene sets. Further correlation analysis between individual TEs and the immune

indices indicated that 14 out of the 1,072 TEs had significant positive correlation with at least

one immune indice (Spearman’s correlation ≥ 0.4, p < 0.0001, Fig. 1D and E). The FDR the

significant immune-TE associations was estimated to be 0.7% by permutation test (Fig. S1O).

To integrate the results of the prognostic and immune TE associations, we overlapped the

372 survival relevant TEs and the 14 immune positively correlated TEs. By doing so, nine TEs

were identified and used for further exploration (Fig. 1F).

Generation of CRC subtypes based on TE score and clinical outcome

Based on the nine TEs identified above, we first explored the relationship between their

expression, finding that most of them were positively correlated with each other (Fig. S2A).

As such, to generate a combined TE score, we averaged the normalized TE expression (log

counts per million mapped reads, logCPM) across the CRC samples. We applied the kaps

5 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

algorithm (20) to the normalized TE score and identified 4 clusters based on OS (see Methods,

Fig. S2B-G). Based on these clusters, samples were classified into four risk groups termed TE

clusters from cluster 1 to cluster 4 with increasing TE score (Fig. 2A). Cluster 4 accounted for

9% (n = 51) of the total cohort while cluster 3 for 8% (n = 47), cluster 2 for 19% (n = 113) and

cluster 1 for 64% (n = 379). We found that there were significant differences among these

four clusters in terms of some molecular features (Fig. 2B-E, Fig. S2H, Table. S2). Specifically,

cluster 4 showed higher fraction of MSI samples (33%) while cluster 1 had lowest fraction

(11%) (chi-squared P = 0.0001, Fig. 2B). Cluster 4 also had more samples with CpG island

methylator phenotype (CIMP) (30%) (chi-squared P = 0.0228, Fig. 2C). We also observed some

overlaps between TE clusters and other two molecular subtypes including consensus

molecular subtype (CMS) (21) and immune subtypes (22). CMS consists of four subtypes

characterized by MSI, CIMP high and immune infiltration for CMS1, epithelial, WNT and MYC

signals activation for CMS2, metabolic dysregulation for CMS3, mesenchymal and TGF-β

signal activation for CMS4. Predominantly, high fraction of samples in cluster 4 belong to

CMS1 subtype (35%) while half of the cluster 1 were CMS2 and 40% of the cluster 2 belong to

CMS4 (chi-squared P < 0.0001, Fig. 2D). Moreover, as for immune subtypes derived from pan-

cancer analysis, cluster 4 showed higher proportion of IFN-gamma dominant (39%) and

inflammatory subtype (10%) while cluster 1 displayed highest proportion of wound healing

subtype (85%) followed by cluster 2 (69%), cluster 3 (58%) and cluster 4 (51%) (chi-squared P

< 0.0001, Fig. 2E). Lastly, as expected, there was significant survival difference among these

four clusters (log-rank P = 0.0035 for OS, P = 0.011 for DSS, P = 0.12 for DFI and P = 0.022 for

PFI, Fig. 2F-I). Generally, cluster 4 showed worse survival while cluster 3 showed the most

favorable outcomes (cluster 4 versus 3 with HR = 2.36, 95%CI = 0.96-5.80, P = 0.05 for OS, HR

= 3.99, 95%CI =1.12-14.16, P = 0.02 for DSS and HR = 3.05, 95%CI = 1.34-6.94, P = 0.005 for

6 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

PFI). Notably, cluster 4 had a very poor survival rate after relapse while cluster 3 had superior

survival rate after relapse.

TE score is a prognostic and immune infiltration predictor independent of MSI and tumor

mutation burden

In CRC, it is well established that patients with tumors that are MSI or high TMB generally

have better prognosis. As TE cluster 4 are slightly enriched with MSI tumors compared with

other groups (33% versus 19%, 16% and 11% in clusters 3, 2 and 1 respectively), we sought to

determine whether TE score is an independent predictor of prognosis and immune infiltration.

To do so, we performed multivariable Cox regression on the TE clusters adjusted by clinical

features including MSI status. Cluster 4 remained an independent prognostic variable for all

the three endpoints with HR against cluster 3 of 3.98 (95%CI: 1.09-14.57, P = 0.037) for OS,

9.52 (95%CI: 1.18-76.54, P = 0.034) for DSS and 2.79 (95%CI:1.07-7.31, P = 0.036) for PFI (Fig.

3A-C). Notably, MSI was not significant for any of the endpoints. Further analysis was

performed on MSI and MSS samples separately. It was found that the four clusters were well

separated especially in MSI samples (Fig. 3D and E). We also explored the correlation between

TE cluster and TMB, defined as non-silent mutation count per megabase (Mb) (23). TMB was

poorly correlated with TE score (r = 0.13, Fig. 3F) and there was no difference in terms of TMB

among TE clusters (Fig. 3G).

To test if TE cluster could independently predict immune infiltration, we used CD8A, a

maker of CD8+ T cells and a -inflamed gene expression profile (GEP) (24). Firstly, we

observed strong correlations between TE score with CD8A (r = 0.43, Fig. 3H) and GEP (r = 0.51,

Fig. 3I), respectively. A multinomial logistic regression analysis incorporating clinical

parameters and MSI or TMB was then performed to predict CD8A and GEP. Cluster 4 was a

significant predictor for CD8A and GEP and notably displayed the highest odds ratio (OR) of

7 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

6.3 for CD8A (Fig. 3J) and 5.4 for GEP (Fig. 3K). Furthermore, TE score showed much higher

OR than TMB with 2.3 versus 1.3 for CD8A (Fig. 3L) and 3.0 versus 1.4 for GEP (Fig. 3M). Our

results demonstrate that TE clusters and TE score were independent predictors of immune

infiltration independent of MSI and TMB.

Immune overdrive is associated with TE score and expression

Given that the TE signature expression was associated with immune activation, we further

explored the differences of tumor immune microenvironment (TME) among TE clusters. We

first compared 28 cell fractions including 26 immune cells and two stroma cells (see Table. S3)

based on GSVA and found that cluster 4 displayed higher fractions of most of these cells

especially for T cells, and dendritic cells, while cluster 1 showed a lack of

immune infiltration (Fig. 4A). Similar results were obtained for 10 cell fractions based on the

MCPcounter method (Fig. S3A). Further, cluster 4 displayed the highest gene signatures of

immune infiltration followed by cluster 3, 2 and lowest for cluster 1 (Fig. 4B). These signatures

included T cell, lymphocyte, leukocyte infiltration signatures, hot tumor signature and tumor

associated (TAM) ratio. Cluster 4 also displayed highest expression of T helper 1

(Th-1) immune response and regulatory genes as well as MCH II and MHC I molecules (Fig. 4B,

Fig. S3B and C). Cluster 4 also had higher IFN-γ response rate but exhibited T cell exhaustion

(Fig. 4B). Most of the TCR/BCR receptor indices were also highest in cluster 4 and lowest in

cluster 1 (Fig. 4C).

Next, we investigated association of the TE clusters with genetic changes and did not found

distinctive differences among the clusters except for neo-antigens (Fig. 4D). Specifically, we

investigated associations with hotspot mutations by focusing on CRC specific drivers (25, 26).

To test if any drivers contributed to TE reactivation, we compared the differences in the

frequency of hotspot mutations between cluster 4 and other clusters. Eleven cancer genes

8 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

had sufficient mutated samples in each group for further statistical evaluation (see Methods,

Fig. S3D). Only TP53 (29% in cluster 4 versus 17% in other three clusters) and BRAF (14% in

cluster 4 versus 2.4% in other three clusters) displayed significance (chi-square P = 0.0001 and

P = 0.0378 for TP53 and BRAF respectively). As for the copy number changes showing in Fig.

S3E, there was no distinctive differences among TE clusters.

To elucidate the molecular phenotype associated with the TE clusters, we further analyzed

dysregulated pathways among TE clusters. We found that cluster 4 and 2 displayed stronger

immune evasion-associated signatures including TGF-β response, extracellular matrix (ECM)

gene expression, VEGF target, Epithelial–mesenchymal transition (EMT) and innate anti-PD1

resistance (IPRES) signatures (Fig. 4E, Fig. S3F). By analysis of 50 MSigDB hallmark gene sets

(27) and 39 gene program and canonical drug targetable pathways (28), we found 32 out of

50 hallmark gene sets were dysregulated among TE clusters and most of them were

upregulated in cluster 4 (Fig. 4F). Similarly, 24 out of 39 drug targetable pathways were

significantly different among TE clusters indicating the potential therapy targets for individual

clusters such as ALK and PI3K pathways in cluster 4, anti- and epidermal growth

factor (EGF) signals in cluster 3, plasma membrane signal in cluster 2 and MYC pathway in

cluster 1 (Fig. S3G).

Finally, we compared the expression of two markers including CD8A and CD274 which were

used to identify immune overdrive by Fakih et al (16). Our results demonstrated that our

cluster 4 also displayed higher expression of these two markers (Fig. S3H and I), implying the

comparability of immune overdrive identified by TE cluster and these two markers. More

importantly, based on the classification of risk groups identified by Fakih et al (16), we found

that risk group (IV*) characterized by immune overdrive signature also displayed highest TE

score, followed by risk group III* and I/II (Fig. S3J). Together, these results above reflect that

9 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

the immune overdrive phenotype of cluster 4 is characterized by TE reactivation, highest

immune infiltration, poorest survival and immune evasion activity (e.g. TGF-β signal), and

higher expression of immune response and checkpoint genes.

Activation of innate immune response in CRC with high TE score recapitulates TE

reactivation by DNMT inhibitors

DNMT inhibitors such as 5-aza-2’-deoxycytidine (5-aza) have been shown to result in TE

reactivation through induction of genome-wide DNA demethylation (2, 5). This in turn results

in the activation of a range of innate immune response pathways in cell lines (8, 10). To show

that TE expression is also directly responsible for triggering the activation of immune

response in the CRC patient samples, we compared immune pathway activation in 5-aza

treated cells and the CRC TE clusters. Generally, activity of hallmark gene sets was consistent

in these three 5-aza treated cell line datasets (Fig. 4F). Interestingly, we observed large

overlaps of the significant up-regulated pathways in treated groups of these three data sets

and CRC TE cluster 4 (n=21, Fig. 4G, Table. S4). Specifically, these pathways were mainly

associated with immune response such as complement, inflammatory response and IFN γ and

α response signals. Besides, several oncogenic pathways were also up-regulated including

P53 pathway, TGF-β, EMT, apoptosis pathways. Furthermore, by quantifying the TE

expression in GSE80137 derived from RNA sequencing, we found that TE score were highest

in the 5-aza treated groups compared to control group (Fig. 4H).

As the 5-aza treated samples are all derived from cell lines, to further confirm that TE

reactivation signal is mainly derived from tumor cells in bulk CRC tissue rather than the TME,

we investigated TE expression in a high-depth single cell RNA sequencing (scRNA-seq) breast

cancer dataset (29). Clustering analysis based on the global TE expression profiles after

dimensionality reduction with t-Distributed Stochastic Neighbor Embedding (t-SNE) displayed

10 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

that tumor and non-tumor cells formed separated clusters (Fig. S3K). Importantly, the

proportion of reads mapping to TEs was highest in tumor cells (Kruskal-Wallis test P = 0.0019,

Fig. S3L). This is consistent with previous studies that compared TE expression profiles across

healthy human tissues, finding that samples from whole blood displayed relatively lower

expression of TE compared with other solid tissues, suggesting that hematopoietic cells (e.g.

T and B cells) is contributed less to TE expression (30, 31). Together, these results suggest that

TE reactivation is likely to be the underlying mechanism of the immune overdrive phenotype

of cluster 4 and TE expression signal is mainly derived from tumor cells.

TEs trigger intracellular immune response by viral mimicry

Increasing evidence indicated that TE reactivation (e.g. hERVs) could stimulate immune

system via intracellular antiviral responses (8, 10). To elucidate a comprehensive pathway

regulation behind TE reactivation in CRC, we applied weighted correlation network analysis

(WGCNA) (32) to find module genes that were associated with TE score. WGCNA is a popular

systems biology approach aimed to not only build gene networks but also detect gene

modules associated with phenotypic trait (see Methods). Our results suggested that two of

the module genes were strongly positively correlated with TE score (r= 0.5 for greenyellow

module, r = 0.46 for brown module, Fig. 5A, Fig. S4A-G, Table. S5). There were 39 and 389

genes in greenyellow and brown module, respectively. To determine the function of these

modules, we performed GO and KEGG enrichment analysis using these genes (Fig. 5B and C).

Genes in the greenyellow module were mainly enriched in pathways associated with innate

immune pathways such as defense to , type I IFN response, dsRNA sensing, NOD-like,

RIG-I-like, MDA-5 and TLRs signals. Genes in brown module were more involved in adaptive

immune response such as differentiation, migration and activation of immune cells, IFN γ

response, JAK-STAT and NF-kB signals.

11 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

We then further compared the expression of some critical genes that might be involved in

the response to TE reactivation. As expected, most of these genes were highly expressed in

cluster 4 followed by cluster 3 and 2 and lowest in cluster 1 (Fig. 5D). Specifically, these

included RIG-I-like and interferon-stimulated genes (ISGs), OASL/2/3, IFNs secretion and

production process which have proofed to be associated with TE reactivation. For instance,

upon TE reactivation, the secretion and production of IFNs were increased in which IFN α and

β indicated type I IFN response. It have been proven that type I IFN response is indicative of

the upregulation of intracellular antiviral pathways which are generally induced by dsRNA

sensing signals simulated by TE reactivation (8, 10). Similarly, some literature evidenced TE

suppressors such as APOBECs, ADAR, NOD2, MOV10, MOV10L1, CTCFL, etc were also

upregulated in cluster 4. Specifically, SVA-C and F were two Retroposons that are part of the

nine TE signature. It has been shown that CTCFL, a germline-specific transcription factor,

functions as suppressor of SVA expression by directly binding to and regulate SVA repeats

(33). Together, these results show that our TE score recapitulate the immune response known

to be induced by both endogenous transposable element and exogenous .

Pan-cancer analysis identified immune overdrive phenotype in KIRC

Finally, to confirm if the TE induced immune overdrive phenotype is also present in other

cancer types, we examined TE expression in an additional 23 cancer types. We first compared

the nine TE score among cancer types. Several cancer types showed higher TE score such as

KIRC, Diffuse Large B-cell (DLBC) and Head and Neck squamous cell carcinoma

(HNSC) while CRC and Adrenocortical carcinoma (ACC) were generally lower (Fig. S5A, Table.

S6). By performing univariable Cox regression analysis on TE score in each cancer type, we

found that apart from CRC, another cancer type, KIRC also had significantly increased HR > 1

(HR =1.78, 95%CI = 1.43 - 2.22, P value < 0.0001, Fig. 6A, Table. S7). Several cancers also had

12 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

significant HR but less than one including Breast invasive carcinoma (BRCA), Cervical

squamous cell carcinoma and endocervical adenocarcinoma (CESC), HNSC, Liver

hepatocellular carcinoma (LIHC) and Ovarian serous cystadenocarcinoma (OV) (Table. S7).

We then further correlated TE score with immune infiltration based on GEP. Overall, TE score

correlated well with GEP across the cancer types (r = 0.45, Fig. S5B). Individually, the best

correlations were observed in SKCM, HNSCC and CESC followed by CRC and PRAD (Fig. 6B).

As some KIRC appear to also exhibit an immune overdrive phenotype, we carried out

further analyses as we had done for CRC. Eight of the nine TEs had similar trends of expression

across the samples except for Trigger12A (Fig. 6C). As with CRC, using the kaps algorithm, we

identified four KIRC clusters with differing prognostic outcomes (Fig. 6D-F) and enrichment of

molecular subtypes (Fig. S5C). Notably, cluster 4 had worse OS (cluster 4 versus 3 with HR =

2.21, 95%CI = 1.29-3.77, P = 0.003, Fig. 6D), DSS (cluster 4 versus 3 with HR = 2.77, 95%CI =

1.38-5.56, P = 0.0027, Fig. 6E), and poor survival rate after relapse (cluster 4 versus 3 with HR

= 2.56, 95%CI = 1.38-4.76, P = 0.0021, Fig. 6F). Cluster 2 also had relative short OS while cluster

3 had favorable survival rate after relapse. After adjusted by clinical features, cluster 4

remained significant for all three endpoints and cluster 2 were significant for OS and PFI (Fig.

S5D-F).

Consistently, cluster 4 in KIRC also displayed highest immune infiltration and immune

evasion phenotypes (Fig. S5G-J). Similar with CRC, there were no distinctive differences for

genetic changes among these TE clusters in KIRC (Fig. S5K). Finally, most of the genes involved

in the response of TE reactivation had higher expression in cluster 4 showing similar immune

regulation patterns with CRC (Fig. S5L-R).

Discussion:

13 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

Molecular subtyping based on genomic and transcriptomic data has facilitated improved

understanding of molecular features in and has guided targeted strategies in cancer

treatment (34-36). For instance, MSI is a critical subtype in CRC which has been associated

with high immune infiltration (e.g. CD8+ T cells) (37-40) and lower risk of relapse (41, 42).

Generally, cancer samples with higher immune infiltration had better survival which has been

observed in cancers including CRC (14, 22). However, an immune overdrive phenotype is also

observed in CRC characterized by high immune infiltration but poorest survival (16). Here, we

link TE expression to the immune overdrive phenotype in CRC and proposed that reactivation

and overexpression of TE might be the potential reason of this phenotype. The immune

overdrive phenotype can be reproduced directly using our TE signature. TE cluster 4 is

characterized as the immune overdrive phenotype with the highest TE score, poorest survival

but also highest immune infiltration. Importantly, a similar immune overdrive subgroup was

also present in KIRC which to our knowledge has not previously been reported. Our findings

suggest that immune overdrive is mainly contributed by TE reactivation, highlighting the

importance of TE reactivation on immune infiltration in cancers.

Nearly 50% of the human genome consists of TEs which are critical for sustaining genomic

stability, chromosomal structure and transcriptional regulation (43, 44). Generally, TEs are

strictly regulated from early embryonic development and in human differentiated somatic

cells mainly through epigenetic mechanisms such as DNA methylation and histone

modification (45, 46). TEs are highly mutagenic and normally regarded as harmful because

their activation is conflicting to the fitness of the human host (47). Compelling evidence

indicates the critical roles of reactivated TEs in cancer development and progression resulting

from the loss of TEs suppression (48-50). Generally, epigenetic regulation, especially DNA

methylation and histone modification, are the best known mechanisms of TE silencing. Indeed,

14 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

studies have demonstrated that epigenetic alterations could lead to in which

TE reactivation might be a potential secondary cause (51). The global loss of methylation can

lead to TE reactivation and is often accompanied by the hypermethylation of tumor

suppressor genes in cancers (52). For instance, the reactivation of LINE1 caused by DNA

hypomethylation has been observed in several cancer types including CRC (53), LIHC (54), and

BRCA (55). It has been shown that 5-aza treatment could stimulate innate immune response

accompanied by TE reactivation including hERVs and other class of TEs (8, 10, 11). Our analysis

based on cell line data also confirmed immune response was stimulated after 5aza treatment.

More importantly, three cell lines of GBM derived RNA sequencing obtained significant higher

TE score after treatment. Up-regulated hallmark pathways largely overlapped between 5-aza

treated cells and TE cluster 4, most of these were immune associated signals as well as

oncogenic pathways. These results suggested that the phenotype of TE cluster could be

reflected by 5-aza treatment, implying TE reactivation to be the likely cause of immune

overdrive. Our TE cluster 4 with highest TE score is comprised of a higher fraction of samples

with CIMP, a phenotype characterized by hypermethylation of promoter CpG island sites, but

importantly is also associated with genome-wide global hypomethylation (56) which may

contribute to the reactivation of TEs. Currently, few tools are available to specifically

investigate epigenetic regulation on TEs as the repetitive nature of TEs makes it difficult to

assign reads derived from next generation sequencing technology to individual TE copies,

termed multi-mapping problems (57). Bisulfite sequencing (BS-Seq) and methyl-DNA

immunoprecipitation sequencing (MeDIP-Seq) are two well-established methods to measure

DNA methylation at high throughput level (58). However, mapping these short reads from

bisulfite converted genomic DNA to TEs is still challenging. Although most older TEs with

accumulated sufficient nucleotide diversity can be uniquely identified, the insertions of

15 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

younger TEs are often indistinguishable from corresponding source elements by using short-

read sequencing (59). Given that most of the quantification of DNA methylation in TCGA

cohorts were based on the Infinium human methylation 450K or 27K array platform, this limits

the proper analysis on the correlation between DNA methylation alterations and TE

reactivation in this study. Nevertheless, by using TCGA Illumina 450K array data, Kong et al

found that 431 out of 1,007 TEs displayed inverse correlation between TE expression and

methylation in at least one out of 10 cancer types while only 13 TEs were significant across

multiple cancer types (11). One recent pre-published paper has demonstrated it is possible

to accurately assess methylation of TEs by using long-read nanopore sequencing (59). This

new technique would facilitate our investigation of TE biology in the future. Moreover,

heterochromatin formation has also been proven to regulate TE silencing by cooperating with

DNA methylation and small RNAs (7). TE-associated nucleosomes are generally methylated at

the histone 3 lysine (H3K9) representing signals. To date, immunoprecipitation

sequencing (ChIP-seq) is the best known to determine the chromatin landscape of TEs by

mapping reads from ChIP-seq data to consensus sequences of TEs (60). Since no ChIP-seq data

is available for TCGA data, we could not compare TE associated chromatin pattern among our

TE clusters at tissue level.

Previous studies have drawn the landscape of TE expression across human tissues and

indicated that TE expression is much higher in solid tissues compared with in whole blood (30,

31). This is in line with our findings based on high depth scRNA-seq dataset of

that cancer cells are the main contributor of TE expression. Thus, TE expression might reflect

cancer-cell intrinsic characteristics (3). Besides, our further analysis revealed that TE score

was generally correlated with not only the proportion of reads mapping to TEs but also TE

expression at class level in both bulk CRC and KIRC, and scRNA-seq datasets (Fig. S6A-D). This

16 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

indicates that TE score might reflect global TE expression. Our TE cluster identified immune

overdrive phenotype in CRC and KIRC characterised by highest immune infiltration but worst

survival. As for survival difference, TE cluster 4 and 2 had shorter OS than cluster 3 and 1.

Cluster 4 had an extreme poor survival rate after relapse, while cluster 3 had superior survival

rate after relapse. The independent predictive value of cluster 4 was confirmed after adjusted

by clinical factors including MSI status, suggesting the clinical significance of TE cluster as

biomarker for prognosis. TE clusters were also separated well in MSI samples, implying that

MSI tumours might be further classified based on TE score (Fig. 3D). Although it has been

known that higher TMB normally leads immune activation, no significant difference of TMB

was observed amongst TE cluster. Importantly, we found two driver genes, TP53 and BRAF,

with enriched hotspot mutations in cluster 4 compared with other three clusters. Indeed,

studies have reported that p53 can function to restrain TEs and TP53 mutations may

potentially cause reactivation of TEs (61, 62). Hence, the enrichment of TP53 mutations in

cluster 4 is consistent with these studies. As for the enrichment of BRAF in cluster 4, the

possible reason might be the higher fraction of MSI samples in cluster 4. For CRC, although

the fraction of MSI status across TE clusters was different, no difference was observed

between cluster 4 and 3. These results implied that TE expression is likely mechanism behind

immune overdrive rather than TMB or MSI. Importantly, immunosuppressive phenotype was

observed in cluster 4 and 2 marked by high expression score of TGF-β response, ECM genes,

VEGFA and EMT. Indeed, recent studies have demonstrated that increased TGF-β response

signal is the primary mechanism of immune evasion and could lead to worse survival (63).

Moreover, a pan-cancer investigation demonstrated that up-regulation of 30 ECM genes was

involved with poor prognosis (64). These results to some extent explain why cluster 4 and 2

had worst overall survival.

17 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

Innate immune system is essential for pathogen recognition and initiation of protective

immune response through the recognition of pathogen associated molecular patterns

(PAMPs) by its pattern recognition receptors (PRRs) (65). Nucleic acids including RNA and DNA

are critical PAMPs especially for viruses. We found that, upon TE reactivation, some important

PRRs of innate immune signals were up-regulated. These included TLRs, RIG-I like receptors,

NOD-like receptors, MDA5 (IFIH1), APOBECs, etc. TLRs including TLR3, 7 and 8 are endosome

RNA sensors while other PPRs including RIG-I, MDA5, NOD-like receptors belong to

RNA sensors (66-68). Indeed, TE-derived RNAs are very prevalent and can form dsRNA in the

nucleus. Annealing of these hybirds is relaxed by adenosine (A)-to-inosine (I) editing through

ADAR or cytidine (C)-to-uridine (U) deamination editing through APOBEC3s (3, 69). However,

the unedited hybirds are prone to bind with RNA sensors and further stimulate downstream

immune response by viral mimicry represented by increased IFN response with higher

expression of IFN-stimulated genes (ISGs) and IFN regulatory factors such as IRF3 and IRF7.

To date, it remains unknown whether individual classes of TEs are prone to activate different

PRRs and whether this would cause distinct downstream biological response (70). In addition

to triggering of innate immune activation, some other evidence also support that TE

reactivation can stimulate immune response through other mechanisms. For instance, some

TEs, such as LTR can function as prompters or enhancers of ISGs (71). Moreover, TEs, such as

hERVs, have also been shown to provide a source of antigens in KIRC (13).

It has been suggested that CRC with MSI could benefit from immune checkpoint blockade

(ICB) therapy. Moreover, epigenetic therapy has been proven to increase tumor

immunogenicity and modulate the response to immunotherapy (8, 72). Thus, there are

several potential strategies for the treatment of patients amongst TE clusters. MSI tumors in

cluster 4 are prone to relapse, therefore, patients in this cluster might benefit from combined

18 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

ICB and . Currently, some clinical trials such as ATOMIC, are ongoing with the

aim to investigate the efficiency of ICB for MSI-H CRC. Our findings indicate that patients with

highest TE score might have higher risk of recurrence and benefit from ICB. Furthermore, our

results found that activation of TGF-β, ALK, PI3K pathways were enriched in cluster 4.

Therefore, these specific pathways might be used as targets for therapy by combining with

ICB. Finally, epigenetic therapy combining with ICB might be suitable for patients in cluster 1

and 2 with relative lower expression TE. More studies and clinical trials will be needed to

confirm these strategies.

There are some limitations of our study. This is a retrospective study linking TEs with

immune overdrive using TCGA CRC cohort without independent validation. To our knowledge

there are currently no other suitable publicly available CRC RNA-seq data with large sample

size that can be used to interrogate TE expression. However, since immune overdrive in CRC

has been previously reported and we have identified a similar phenotype in KIRC, this means

that immune overdrive and its association with TE reactivation is not a cohort specific

phenomenon. Future large cohort of CRC or KIRC with comprehensive RNA sequencing and

survival information is needed to validate our findings. Although we did not observe immune

overdrive phenotype in other cancer types, it does not mean similar phenotype does not exist

in those cancers types. Some other TEs may specifically contribute to immune overdrive in

other cancer types rather than our nine TEs. Thus, more studies are needed to identify other

potential associations between immune infiltration and TE expression in other cancers.

In conclusion, our results suggest that immune overdrive phenotype is not only

characterized by high immune infiltration and poor survival, but also overexpression of TEs.

Importantly our findings suggest that TE reactivation is the potential cause of immune

overdrive in CRC and KRIC. Moreover, patients with relative lower TE expression might be

19 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

suitable for strategies combining ICB with epigenetic therapy while patients in cluster 4 with

highest TE expression may need more comprehensive treatment and clinical monitoring.

Materials and Methods

TE expression quantification using REdiscoverTE pipeline

We used the REdiscoverTE pipeline to quantify TE expression based on RNA sequencing

data as described by Kong et al (11). Briefly, REdiscoverTE uses Salmon (Version, 0.8.2) to

perform quantification adjusted by GC content bias and sequence specific bias. The reference

transcriptome include RNA transcript sequences from GENCODE release 26 basic (73),

RepeatMasker element (74) and GENCODE RE-containing introns. TE and gene transcripts

were separately quantified and the output of read counts were used for further normalization.

TEs were aggregated into TE subfamily, family and class defined by the human Repeatmasker

for hg38. TEs were also classified based on their individual genomic locations with respect to

genes including exon, intron and intergenic.

TCGA RNA sequencing fasta files were obtained from NCI Genomic Commons

(https://www.cancer.gov/tcga). Reads were trimmed firstly using bbduk.sh pipeline (75) and

then quantified by REdiscoverTE. For normalization, the read counts were aggregated to gene

level for gene transcripts and subfamilies for TE, respectively. The two matrixes including gene

and TE expression were combined into one as input of edgeR (76). The normalization was

conducted using “RLE” algorithm and log2CPM was obtained with prior counts of 5. Then a

total of 1,204 TEs were used for further analysis. Additionally, the proportion of reads mapping

to TEs was calculated as the total counts mapped to TE divided by the total counts mapped

to TE and genes.

20 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

Indeed, Kong et al has provided the normalized TE expression in pan cancer level. However,

not all the samples were included for some cancer types. Since our analysis started from CRC,

we applied REdiscoverTE pipeline only to the CRC cohort. For other cancer types, processed

data from Kong et al were used .

Search for TEs associated with survival

To select out the potential TEs associated with survival, we included four survival endpoints

for survival analysis in CRC including OS, DSS, DFI and PFI. We performed Univariable Cox

regression analysis for individual TEs in each of the four endpoints. The median expression of

each TE was used as the cutoff point to separate samples into high or low expression group.

A TE was considered as significant if the log-rank p-value was less than 0.05. To estimate the

FDR of the candidate TEs, we shuffled samples and performed survival analysis for 100 times.

TEs that were significant in at least one endpoint were used for further analysis.

Search for TEs associated with immune activation

To estimate the correlation of TEs with immune activity, we included a total of 29 immune

activation indices for analysis consisted of (i) ImmuneScore, which was measured by

ESTIMATE algorithm (19); (ii) 17 immunologically relevant signatures reflecting immune

pathways derived from ImmPort (18); (iii) mRNA expression of 11 markers representing

immune activation or checkpoint pathway up-regulation including CD8A, CD86, CD80, CTLA4,

PDCD1LG2, CD274, PDCD1, LAG3, TNFRSF14, BTLA and HAVCR2. We performed GSVA to

obtain the 17 gene signatures score(see method below). We calculate Spearman’s correlation

coefficients between individual TEs and immune sets. To estimate the FDR of the candidate

TEs, we shuffled samples and performed Spearman’s correlation analysis 100 times. TEs with

correlation coefficients ≥ 0.4 and p-value < 0.0001 were regarded as immunogenic TEs. TEs

significantly correlated with at least one immune sets were used for further analysis.

21 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

Generation of TE clusters

Using the nine survival and immune associated TEs, we defined subgroups with distinctive

OS differences based on kaps algorithm (20). Briefly, this algorithm estimates multi-way split

points simultaneously and finds the optimal set of cut off points for a prognostic variable (e.g.

OS). A multi-way partition is identified based on subgroup pairwise test and the best number

of subgroups is defined by a permutation test. Based on the output, Xk and X1 indicate the

overall and worst-pair test statistics. The “adj.Pr (|X1|)” represents the Bonferroni corrected

permuted p-value which can be used to select the optimal K. In this study, we input OS as

prognostic variable and found that the worst pairs of comparisons were significant with

significance level α = 0.05 when k = 2 (adjusted P = 0.001) and 4 (adjusted P = 0.04). To

comprehensively compare the survival differences among samples, we chose k = 4 and

identified four subgroups for downstream analysis.

Estimation of gene signature expression score

We included 70 gene signatures from previous publications associated with cell types in

tumour microenvironment (n = 28), immune infiltration (n = 14), immune evasion (n = 4) and

IPRES (n = 24). See Table. S3 for details of each signature. The R package GSVA was applied

to calculate most of these gene signature expression scores. GSVA is a non-parametric and

unsupervised method that can be used to evaluate gene set enrichment based on gene

expression profiles derived from microarrays or RNA-seq data (77). It can evaluate the given

pathway activity variation by transforming the gene by sample matrix into a gene set by

sample matrix. Therefore, it can assess pathway enrichment for individual cases. Importantly,

GSVA also provide a method called “ssgsea”, which can compute a gene set enrichment score

per sample as the normalized difference in empirical cumulative distribution functions of

gene expression ranks inside and outside a given gene set. Single sample GSEA (ssGSEA) was

22 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

firstly described by Berbie et al (78). To validate the estimation of cell fractions from GSVA,

we additionally used Microenvironment Cell Population (MCP)-counter algorithm to estimate

the fraction of 10 cell type including endothelial cells, fibroblasts and another eight immune

cells populations. This method can robustly quantify the abundance of these cell types based

on transcriptomic data for each sample (79). As for the IPRES signature which consisted of 24

gene sets, we firstly calculated the expression score of each gene sets and averaged them as

the IPRES score. In terms of GEP, we referred to the previous approach (24) based on 18

inflammatory genes including CCL5, CD27, CD274, CD276, CD8A, CMKLR1, CXCL9, CXCR6, HLA-

DQA1, HLA-DRB1, HLA-E, IDO1, LAG3, NKG7, PDCD1LG2, PSMB10, STAT1 and TIGIT. The GEP

score was estimated by a weighted sum of normalized expression values of these 18 genes

adjusted by the expression of 11 housekeeping genes.

Estimation of pathway regulation

To compare the pathway regulation, we included pathways from two resources including

hall markers (n= 50) (27) and gene program and canonical targetable pathways (n = 39) (28).

We calculated the 50 hallmark pathways score using GSVA. As for the 39 targetable pathways,

we directly downloaded processed gene expression signature score data from previous TCGA

pan-cancer analysis (28). The Kruskal-Wallis test was used to test the difference of these

pathways amongst TE clusters.

Somatic mutation and copy number variation analysis

To explore the genetic changes, we performed somatic mutation and copy number

variation (CNV) analysis. Specifically, we focused on hotspot mutations of a total of 95 CRC

specific driver genes (25, 26). The mutation profile for these 95 genes were downloaded from

cBioPortal (80). Only 11 driver genes that displayed at least 5 hotspot mutations in either

cluster 4 or the other clusters combined were retained for further analysis. Fisher’s exact test

23 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

was performed to compare the differences among groups. In terms of CNV, Copy Number

GISTIC2 level 4 data was download from Broad GDAC Firehose

(https://gdac.broadinstitute.org/). Segment file was used as input for GISTIC 2.0 from

GenePattern (81). The derived GISTIC score was used for visualization.

Cell line datasets treated with DNA methylation inhibitor

We explored the expression profiles three cell line datasets which were treated with 5-aza.

(i) GSE5816, consisted of 11 lung cancer cells, 1 breast cancer and 1 CRC cell line (82). Cells

were treated with DMSO as the control group, while low and high groups were treated with

0.1 and 1 uM 5-aza, respectively. Cells were collected after 6 days. (ii) GSE80137, consisted

of three GMB cell lines. Cells were treated with 1 uM 5-aza and collected after 3 days (83). (iii)

GSE22250, consisted of 7 breast lines, treated with 5-aza of 1 uM and collected

after 4 days (84). The processed transcript data for these datasets were directly accessed

under individual GEO accession identifiers and used for analysis. The batch effect was

removed based on the type of cell lines in GSE80137, GSE22250 and GSE5816, respectively

using the Combat method in the sva R package.

Weighted correlation network analysis (WGCNA)

A WGCNA network was generated for CRC data using the top 5,000 genes with the highest

median absolute deviation (32). This approach firstly calculates Pearson’s correlation

coefficients for all the genes to get the correlation matrix of these genes. A soft threshold of

β value can be determined based on a scale-free topology criterion (a β of 10 was chosen in

this study). The topological overlap metric (TOM) and dissTOM = 1-TOM are obtained from

the resulting adjacent matrix. Then hierarchical clustering is performed based on the

blockwiseModules and dynamic tree cutting functions to obtain a cluster dendrogram

representing the gene co-expression modules in which the genes are densely interconnected.

24 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

Individual gene modules are marked using different colors while grey module indicates these

genes are not assigned into any modules. To link the association of modules to TE expression,

a module-trait association was performed between gene module and the phenotype (TE

score). The gene significance (GS) indicates the absolute value of the association between the

expression profile and phenotype while the module membership (MM) represents the

correlation between the expression profile and each module. Finally, to explore the biological

function behind TE expression, module genes from brown and greenyellow modules were

used for GO and KEGG pathway enrichment analyses as these two modules correlated well

with TE score.

Single cell RNA sequencing data processing

Raw data was downloaded under the SRA accession number “SRP066982” which consists

of 563 fastq files of single cells derived from breast cancer. Only 515 of these cells with high

quality and annotation of cell types, as indicated by the original paper was used for further

analysis (29). The REdiscoverTE pipeline was applied to these data to quantify TE expression

as with the CRC cohort.

Statistics analysis

All statistical analyses were carried out using the program R. Enumeration data were

examined by Chi-square test or Fisher’s exact test. The comparisons among multiple groups

were performed by nonparametric Kruskal-Wallis test. Survival analysis was evaluated by the

Kaplan–Meier survival curve and the Log-rank test. Differences were considered significant

with a value of P < 0.05 unless otherwise stated.

Supplementary Materials

Fig. S1. Screening of candidate TEs.

25 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

Fig. S2. Clinical and molecular comparison among TE clusters.

Fig. S3. Comparison among TE cluster in terms of immune overdrive.

Fig. S4. Construction of co-expression module using WGCNA.

Fig. S5. Pan cancer analysis of TE score.

Fig. S6. Correlation between TE score and global TE expression.

Table. S1. Results of screening TEs associated with survival.

Table. S2. summarized clinical data of CRC.

Table. S3. Gene signatures used in this study.

Table. S4. Overlapped significant pathways among 5-aza treated datasets and TE cluster.

Table. S5. Two module gene list derived from WGCNA.

Table. S6. TE score of pan cancer.

Table. S7. Univariable Cox regression analysis of TE score across 24 cancer types.

References and Notes:

1. Bourque G, Burns KH, Gehring M, Gorbunova V, Seluanov A, Hammell M, et al. Ten things you should know about transposable elements. Genome biology. 2018;19(1):199. 2. Gorbunova V, Boeke JD, Helfand SL, and Sedivy JM. Human Genomics. Sleeping dogs of the genome. Science (New York, NY). 2014;346(6214):1187-8. 3. Burns KH. Our Conflict with Transposable Elements and Its Implications for Human Disease. Annual review of pathology. 2020;15:51-70. 4. Deniz Ö, Frost JM, and Branco MR. Regulation of transposable elements by DNA modifications. Nature reviews Genetics. 2019;20(7):417-31. 5. Anwar SL, Wulaningsih W, and Lehmann U. Transposable Elements in Human Cancer: Causes and Consequences of Deregulation. International journal of molecular sciences. 2017;18(5). 6. He J, Fu X, Zhang M, He F, Li W, Abdul MM, et al. Transposable elements are regulated by context-specific patterns of chromatin marks in mouse embryonic stem cells. Nature communications. 2019;10(1):34. 7. Slotkin RK, and Martienssen R. Transposable elements and the epigenetic regulation of the genome. Nature reviews Genetics. 2007;8(4):272-85. 8. Roulois D, Loo Yau H, Singhania R, Wang Y, Danesh A, Shen SY, et al. DNA- Demethylating Agents Target Colorectal Cancer Cells by Inducing Viral Mimicry by Endogenous Transcripts. Cell. 2015;162(5):961-73.

26 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

9. Hegde PS, and Chen DS. Top 10 Challenges in . Immunity. 2020;52(1):17-35. 10. Chiappinelli KB, Strissel PL, Desrichard A, Li H, Henke C, Akman B, et al. Inhibiting DNA Methylation Causes an Interferon Response in Cancer via dsRNA Including Endogenous . Cell. 2015;162(5):974-86. 11. Kong Y, Rose CM, Cass AA, Williams AG, Darwish M, Lianoglou S, et al. Transposable element expression in tumors is associated with immune infiltration and increased antigenicity. Nature communications. 2019;10(1):5228. 12. Panda A, de Cubas AA, Stein M, Riedlinger G, Kra J, Mayer T, et al. expression is associated with response to immune checkpoint blockade in clear cell renal cell carcinoma. JCI insight. 2018;3(16). 13. Smith CC, Beckermann KE, Bortone DS, De Cubas AA, Bixby LM, Lee SJ, et al. Endogenous retroviral signatures predict immunotherapy response in clear cell renal cell carcinoma. The Journal of clinical investigation. 2018;128(11):4804-20. 14. Pagès F, Mlecnik B, Marliot F, Bindea G, Ou FS, Bifulco C, et al. International validation of the consensus Immunoscore for the classification of colon cancer: a prognostic and accuracy study. Lancet (London, England). 2018;391(10135):2128-39. 15. Galon J, Mlecnik B, Bindea G, Angell HK, Berger A, Lagorce C, et al. Towards the introduction of the 'Immunoscore' in the classification of malignant tumours. The Journal of pathology. 2014;232(2):199-209. 16. Fakih M, Ouyang C, Wang C, Tu TY, Gozo MC, Cho M, et al. Immune overdrive signature in colorectal tumor subset predicts poor clinical outcome. The Journal of clinical investigation. 2019;129(10):4464-76. 17. Liu J, Lichtenberg T, Hoadley KA, Poisson LM, Lazar AJ, Cherniack AD, et al. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell. 2018;173(2):400-16.e11. 18. Bhattacharya S, Dunn P, Thomas CG, Smith B, Schaefer H, Chen J, et al. ImmPort, toward repurposing of open access immunological assay data for translational and clinical research. Scientific data. 2018;5:180015. 19. Yoshihara K, Shahmoradgoli M, Martínez E, Vegesna R, Kim H, Torres-Garcia W, et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nature communications. 2013;4:2612. 20. Eo S-H, Kang HJ, Hong S-M, and Cho H. K-adaptive partitioning for survival data, with an application to . arXiv preprint arXiv:13064615. 2013. 21. Guinney J, Dienstmann R, Wang X, de Reyniès A, Schlicker A, Soneson C, et al. The consensus molecular subtypes of colorectal cancer. Nature medicine. 2015;21(11):1350-6. 22. Thorsson V, Gibbs DL, Brown SD, Wolf D, Bortone DS, Yang T-HO, et al. The immune landscape of cancer. Immunity. 2018;48(4):812-30. e14 %@ 1074-7613. 23. Garcia-Garijo A, Fajardo CA, and Gros A. Determinants for Neoantigen Identification. Frontiers in immunology. 2019;10:1392. 24. Cristescu R, Mogg R, Ayers M, Albright A, Murphy E, Yearley J, et al. Pan-tumor genomic biomarkers for PD-1 checkpoint blockade-based immunotherapy. Science (New York, NY). 2018;362(6411). 25. Gonzalez-Perez A, Perez-Llamas C, Deu-Pons J, Tamborero D, Schroeder MP, Jene- Sanz A, et al. IntOGen-mutations identifies cancer drivers across tumor types. Nature methods. 2013;10(11):1081-2 %@ 548-7105. 26. Chang MT, Asthana S, Gao SP, Lee BH, Chapman JS, Kandoth C, et al. Identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity. Nature biotechnology. 2016;34(2):155-63.

27 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

27. Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, and Tamayo P. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell systems. 2015;1(6):417-25. 28. Hoadley KA, Yau C, Wolf DM, Cherniack AD, Tamborero D, Ng S, et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 2014;158(4):929-44 %@ 0092-8674. 29. Chung W, Eum HH, Lee HO, Lee KM, Lee HB, Kim KT, et al. Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer. Nature communications. 2017;8:15081. 30. Bogu GK, Reverter F, Marti-Renom MA, Snyder MP, and Guigó R. Atlas of transcriptionally active transposable elements in human adult tissues. bioRxiv. 2019:714212. 31. Larouche J-D, Trofimov A, Hesnard L, Ehx G, Zhao Q, Vincent K, et al. Widespread and tissue-specific expression of endogenous retroelements in human somatic tissues. Genome Medicine. 2020;12:1-16. 32. Langfelder P, and Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC bioinformatics. 2008;9(1):559 %@ 1471-2105. 33. Pugacheva EM, Teplyakov E, Wu Q, Li J, Chen C, Meng C, et al. The cancer- associated CTCFL/BORIS protein targets multiple classes of genomic repeats, with a distinct binding and functional preference for humanoid-specific SVA transposable elements. Epigenetics & chromatin. 2016;9(1):35. 34. Roszik J, and Subbiah V. Mining Public Databases for Precision . Trends in cancer. 2018;4(7):463-5. 35. Yu Y. Molecular classification and precision therapy of cancer: immune checkpoint inhibitors. Frontiers of medicine. 2018;12(2):229-35. 36. Malone ER, Oliva M, Sabatini PJB, Stockley TL, and Siu LL. Molecular profiling for precision cancer therapies. Genome Med. 2020;12(1):8. 37. Boland CR, and Goel A. Microsatellite instability in colorectal cancer. Gastroenterology. 2010;138(6):2073-87.e3. 38. Gryfe R, Kim H, Hsieh ET, Aronson MD, Holowaty EJ, Bull SB, et al. Tumor microsatellite instability and clinical outcome in young patients with colorectal cancer. The New England journal of medicine. 2000;342(2):69-77. 39. Popat S, Hubner R, and Houlston RS. Systematic review of microsatellite instability and colorectal cancer prognosis. Journal of clinical oncology : official journal of the American Society of Clinical Oncology. 2005;23(3):609-18. 40. Maby P, Galon J, and Latouche JB. Frameshift mutations, neoantigens and tumor- specific CD8(+) T cells in microsatellite unstable colorectal cancers. Oncoimmunology. 2016;5(5):e1115943. 41. Ribic CM, Sargent DJ, Moore MJ, Thibodeau SN, French AJ, Goldberg RM, et al. Tumor microsatellite-instability status as a predictor of benefit from fluorouracil- based adjuvant chemotherapy for colon cancer. The New England journal of medicine. 2003;349(3):247-57. 42. Sargent DJ, Marsoni S, Monges G, Thibodeau SN, Labianca R, Hamilton SR, et al. Defective mismatch repair as a predictive marker for lack of efficacy of fluorouracil- based adjuvant therapy in colon cancer. Journal of clinical oncology : official journal of the American Society of Clinical Oncology. 2010;28(20):3219-26. 43. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860-921.

28 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

44. Lee E, Iskow R, Yang L, Gokcumen O, Haseley P, Luquette LJ, 3rd, et al. Landscape of somatic retrotransposition in human cancers. Science (New York, NY). 2012;337(6097):967-71. 45. Rebollo R, Romanish MT, and Mager DL. Transposable elements: an abundant and natural source of regulatory sequences for host genes. Annual review of genetics. 2012;46:21-42. 46. Gerdes P, Richardson SR, Mager DL, and Faulkner GJ. Transposable elements in the mammalian embryo: pioneers surviving through stealth and service. Genome biology. 2016;17:100. 47. Labrador M, and Corces VG. Transposable element-host interactions: regulation of insertion and excision. Annual review of genetics. 1997;31:381-404. 48. Criscione SW, Zhang Y, Thompson W, Sedivy JM, and Neretti N. Transcriptional landscape of repetitive elements in normal and cancer human cells. BMC genomics. 2014;15:583. 49. Wolff EM, Byun HM, Han HF, Sharma S, Nichols PW, Siegmund KD, et al. Hypomethylation of a LINE-1 promoter activates an alternate transcript of the MET in bladders with cancer. PLoS genetics. 2010;6(4):e1000917. 50. Daskalos A, Nikolaidis G, Xinarianos G, Savvari P, Cassidy A, Zakopoulou R, et al. Hypomethylation of retrotransposable elements correlates with genomic instability in non-small cell lung cancer. International journal of cancer. 2009;124(1):81-7. 51. Xie M, Hong C, Zhang B, Lowdon RF, Xing X, Li D, et al. DNA hypomethylation within specific transposable element families associates with tissue-specific enhancer landscape. Nature genetics. 2013;45(7):836-41. 52. Esteller M. Epigenetics in cancer. The New England journal of medicine. 2008;358(11):1148-59. 53. Swets M, Zaalberg A, Boot A, van Wezel T, Frouws MA, Bastiaannet E, et al. Tumor LINE-1 Methylation Level in Association with Survival of Patients with Stage II Colon Cancer. International journal of molecular sciences. 2016;18(1). 54. Harada K, Baba Y, Ishimoto T, Chikamoto A, Kosumi K, Hayashi H, et al. LINE-1 methylation level and patient prognosis in a database of 208 hepatocellular . Annals of surgical oncology. 2015;22(4):1280-7. 55. Terry MB, Delgado-Cruzata L, Vin-Raviv N, Wu HC, and Santella RM. DNA methylation in white blood cells: association with risk factors in epidemiologic studies. Epigenetics. 2011;6(7):828-37. 56. Issa JP. CpG island methylator phenotype in cancer. Nature reviews Cancer. 2004;4(12):988-93. 57. Lerat E, Casacuberta J, Chaparro C, and Vieira C. On the Importance to Acknowledge Transposable Elements in Epigenomic Analyses. Genes. 2019;10(4). 58. Lea AJ, Vockley CM, Johnston RA, Del Carpio CA, Barreiro LB, Reddy TE, et al. Genome-wide quantification of the effects of DNA methylation on human gene regulation. eLife. 2018;7. 59. Ewing AD, Smits N, Sanchez-Luque FJ, Faivre J, Brennan PM, Cheetham SW, et al. Nanopore sequencing enables comprehensive transposable element epigenomic profiling. bioRxiv. 2020. 60. Day DS, Luquette LJ, Park PJ, and Kharchenko PV. Estimating enrichment of repetitive elements from high-throughput sequence data. Genome biology. 2010;11(6):R69. 61. Tiwari B, Jones AE, and Abrams JM. Transposons, p53 and Genome Security. Trends in genetics : TIG. 2018;34(11):846-55.

29 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

62. Wylie A, Jones AE, and Abrams JM. p53 in the game of transposons. BioEssays : news and reviews in molecular, cellular and developmental biology. 2016;38(11):1111-6. 63. Tauriello DVF, Palomo-Ponce S, Stork D, Berenguer-Llergo A, Badia-Ramentol J, Iglesias M, et al. TGFβ drives immune evasion in genetically reconstituted colon cancer . Nature. 2018;554(7693):538-43. 64. Chakravarthy A, Khan L, Bensler NP, Bose P, and De Carvalho DD. TGF-β- associated extracellular matrix genes link cancer-associated fibroblasts to immune evasion and immunotherapy failure. Nature communications. 2018;9(1):4692. 65. Chen N, Xia P, Li S, Zhang T, Wang TT, and Zhu J. RNA sensors of the innate immune system and their detection of pathogens. IUBMB . 2017;69(5):297-304. 66. Hayashi F, Means TK, and Luster AD. Toll-like receptors stimulate human neutrophil function. Blood. 2003;102(7):2660-9. 67. Schlee M. Master sensors of pathogenic RNA - RIG-I like receptors. Immunobiology. 2013;218(11):1322-35. 68. Kell AM, and Gale M, Jr. RIG-I in RNA virus recognition. Virology. 2015;479- 480:110-21. 69. Schumann GG. APOBEC3 proteins: major players in intracellular defence against LINE-1-mediated retrotransposition. Biochemical Society transactions. 2007;35(Pt 3):637-42. 70. Jones PA, Ohtani H, Chakravarthy A, and De Carvalho DD. Epigenetic therapy in immune-oncology. Nature reviews Cancer. 2019;19(3):151-61. 71. Chuong EB, Elde NC, and Feschotte C. Regulatory evolution of innate immunity through co-option of endogenous retroviruses. Science (New York, NY). 2016;351(6277):1083-7. 72. Brocks D, Schmidt CR, Daskalakis M, Jang HS, Shah NM, Li D, et al. DNMT and HDAC inhibitors induce cryptic transcription start sites encoded in long terminal repeats. Nature genetics. 2017;49(7):1052-60. 73. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome research. 2012;22(9):1760-74. 74. Chen N. Using RepeatMasker to identify repetitive elements in genomic sequences. Current protocols in bioinformatics. 2004;Chapter 4:Unit 4.10. 75. Bushnell B. BBDuk: Adapter. Quality Trimming and Filtering httpssourceforgenetprojectsbbmap. 76. Robinson MD, McCarthy DJ, and Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England). 2010;26(1):139-40. 77. Hänzelmann S, Castelo R, and Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC bioinformatics. 2013;14(1):7 %@ 1471-2105. 78. Barbie DA, Tamayo P, Boehm JS, Kim SY, Moody SE, Dunn IF, et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature. 2009;462(7269):108-12 %@ 1476-4687. 79. Becht E, Giraldo NA, Lacroix L, Buttard B, Elarouci N, Petitprez F, et al. Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression. Genome biology. 2016;17(1):218. 80. Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Science signaling. 2013;6(269):pl1. 81. Reich M. Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP. GenePattern.2:500-1.

30 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

82. Shames DS, Girard L, Gao B, Sato M, Lewis CM, Shivapurkar N, et al. A genome- wide screen for promoter methylation in lung cancer identifies novel methylation markers for multiple . PLoS medicine. 2006;3(12). 83. Shraibman B, Kadosh DM, Barnea E, and Admon A. Human leukocyte antigen (HLA) peptides derived from tumor antigens induced by inhibition of DNA methylation for development of drug-facilitated immunotherapy. Molecular & Cellular Proteomics. 2016;15(9):3058-70 %@ 1535-9476. 84. Dedeurwaerder S, Desmedt C, Calonne E, Singhal SK, Haibe‐Kains B, Defrance M, et al. DNA methylation profiling reveals a predominant immune component in breast cancers. EMBO molecular medicine. 2011;3(12):726-41 %@ 1757-4684.

Acknowledgments: Funding: This work is support by seed funding to JWHW from The

University of Hong Kong. Author contributions: XZ and JWHW conceived this study. HF, KG,

and JAB helped to collect the public data. XZ and JWHW conducted statistical analysis. HF, KG,

and JAB helped to interpret the results. XZ and JWHW wrote the manuscript. Data used in this

study is derived from TCGA database (https://cancergenome.nih.gov). Part of the data was

analyzed on Cancer Genomic Cloud (CGC, http://www.cancergenomicscloud.org/). The CGC

team helped to build custom application program for the analysis on CGC with pilot funds

provided by Seven Bridges Genomics. Competing interests: The authors have declared that

no conflict of interest exists. Data availability: TCGA CRC RNA sequencing data were directly

analysed on Cancer Genomics Cloud using the custom pipeline

[http://www.cancergenomicscloud.org/]. The processed TE expression for pan-cancer was

downloaded from [http://research-pub.gene.com/REdiscoverTEpaper/]. CRC Copy Number

GISTIC2 level 4 data was downloaded from Broad GDAC Firehose

[http://gdac.broadinstitute.org/runs/analyses_2016_01_28/data/COADREAD/20160128/].

The clinical data, gene program pathways were obtained from [https://xenabrowser.net/].

The TCR/BCR index scores and genetic changes were downloaded from the Supplementary

table of the pan-cancer immune landscape paper (22). Three cell line datasets were

31 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

downloaded from [https://www.ncbi.nlm.nih.gov/geo/] under the accession number of (i)

GSE5816, (ii) GSE80137, and (iii) GSE22250.

32 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

A C

● DNA

● LINE DFI DSS OS PFI LTR 4 ● ● 4 ● ● ● ● ● Retroposon ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Satellite ● ● ●

● SINE ● ● ● repClass ● ● 3 ● SINE 2 ● DNA ● ● ● ● ● ● ● ● ● ● ● LINE ● ● ● ● ● ● 2 ● ● ● ● ● ● ● Satellite ● ● ● ● ● ● ● ● ● ● ● ● Hazard Ratio ● ● ● ● ● LTR ● ● ● ● ● ● ● ● Retroposon ● normalized e xpression 0 1 ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ●

● SINEDNALINE LTR SINEDNALINE LTR SINEDNALINE LTR SINEDNALINE LTR −2 Satellite Satellite Satellite Satellite Retroposon Retroposon Retroposon Retroposon LTR SINE LINE DNA Satellite Retroposon Potentially immunogenic B D with number of immue sets 150 0 1 3 6 9 12 AluSq4 SVA_F SVA_C repClass LTR21B 100 DNA HERVL74−int MER57F LINE HERV1_LTRd LTR MER92−int Retroposon MER65C Satellite L1MEd 50 SINE L1M6B MER45A Tigger12A Charlie20a No.TEs (log rank p value < 0.05) p value (log rank No.TEs

0

DFI DSS OS PFI

E

immuscore Antigen_Processing_and_Presentation Antimicrobials F Chemokines Chemokine_Receptors Cytokines Cytokine_Receptors Interferons Interferon_Receptor Interleukins Interleukins_Receptor NaturalKiller_Cell_Cytotoxicity BCRSignalingPathway survival immune TCRsignalingPathway associated 363 associated TGFb_Family_Member 9 TGFb_Family_Member_Receptor TNF_Family_Members 5 TNF_Family_Members_Receptors LAG3 TNFRSF14 BTLA Cor CD86 1 CD80 CTLA4 0.5 PDCD1LG2 CD274 0 PDCD1 −0.5 CD8A HAVCR2 −1 SVA_F L1M6B L1MEd SVA_C AluSq4 LTR21B MER57F MER65C VL74−int MER45A Tigger12A Charlie20a MER92−int HER HERV1_LTRd

Figure 1. Identification of TEs associated with survival and immune sets in CRC. (A) Normalized TE expression pattern at TE class level including Retroposon, SINE, LINE, LTR, Satellite and DNA. (B) Stacked plot showing the number of subfamily TEs with significant log rank p value (p < 0.05) for each of the four endpoints including DFI, DSS, OS and PFI. TEs were annotated at class level. (C) Distribution of hazard ratios of significant TEs from (B) for each of the four endpoints. TEs were annotated at class level. (D) Number of immune sets significantly correlated with each TE (Cor ≥ 0.4, p < 0.0001). (E) Spearman’s correlation between candidate TE expression (n=14) and 29 immune sets. Heatmap colors indicate the correlation coefficient. (F) Venn diagram showing 9 TEs overlapped between candidate prognostic TEs and immunogenic TEs. medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

●● A ● ● ● ●●●●● ●● ● ●●● ●● ●● ●●● ●●●●● repClass ● 2 ●●●●●● ● ●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● DNA ●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● 0 ●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● LTR ●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●● ●●●●● ●●●●●●●● ●●●●●●●●●● ●●●●●● ●●●●● ●●●●● ●● ● Retroposon ●●● −2 ● ● Z of TE score SINE 0 200 400 600 repFamily HERV1_LTRd Alu MER57F ERV1 MER65C SVA LTR21B TcMar−Tigger Tigger12A MER92−int expression SVA_C 10 SVA_F 5 AluSq4 0 −5 TE cluster cluster4 cluster3 cluster2 cluster1 −10

B C D E MSI status CIMP CMS subtype immune subtype 100% p−value: MSI p−value: CIMP.Neg p−value: CMS1 p−value: Wound Healing MSS 0.0001 0.0228 CIMP.Low <0.0001 CMS2 <0.0001 TGF−beta Dominant CIMP.High CMS3 Lymphocyte Depleted CMS4 Inflammatory 75% Immunologically Quiet IFN−gamma Dominant

50% ercentage(%) P

25%

0%

TE.cluster4TE.cluster3TE.cluster2TE.cluster1 TE.cluster4TE.cluster3TE.cluster2TE.cluster1 TE.cluster4TE.cluster3TE.cluster2TE.cluster1 TE.cluster4TE.cluster3TE.cluster2TE.cluster1

F G 1.00 +++++++ +++++++++ 1.00 +++++++++ ++ +++++++++++++ +++++++++++++++++++++++++++++ +++++++++++++++++++ + ++++++++++++++++++++++++ ++++ ++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + + + ++++++++ +++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++ +++++++++++++++++ +++++++++++ +++++++++++++++++++ ++++++ +++++++++++++++ ++++++++++ ++++++ ++++++ ++++++ +++ ++++++ +++ ++++ +++++++++++++++++++++++++ ++ + ++ + 0.75 ++ +++ 0.75 + ++++++++++ ++++ + +++++++++++++ ++++++++ + + + + ++++++++++ ++ ++ +++++++++++ ++++++ ++++++ ++++++++ +++ ++ +++++ + ++++ +++++ + + +

+++++ + + al probability ++++ 0.50 + ++ ++ +++++++++++

rviv 0.50 al probability Log−rank Log−rank + p = 0.0035 ++ + p = 0.011 + 0.25 0.25 HR (95% CI) p value + HR (95% CI) p value Overall surviv Overall cluster 4 vs 3 2.36(0.96 - 5.80) 0.0534 cluster 4 vs 3 3.99(1.12 - 14.16) 0.0207 cluster 4 vs 2 1.13(0.61 - 2.10) 0.7008 cluster 4 vs 2 1.57(0.74 - 3.33) 0.2335 0.00 cluster 4 vs 1 2.09(1.19 - 3.67) 0.0087 Disease specific su 0.00 cluster 4 vs 1 2.50(1.31 - 4.78) 0.0041

0 1000 2000 3000 4000 0 1000 2000 3000 4000 Time(days) Time(days) 378 125 29 13 4 362 118 27 13 4 − 113 26 6 2 1 − 110 24 6 2 1 − 47 17 2 1 0 − 44 16 2 1 0 − 51 17 2 0 0 − 51 17 2 0 0 H I 1.00 +++++++++++ 1.00 ++++++++++++++ +++++++++++++++++++++++++++++++++++++++++ ++ +++++++++++++++++++++ ++++ ++ + ++++++++++++++++++ +++++++++++++++++++++++++ ++++++++++++ ++ +++++ ++++ +++++++ ++ +++++++++++++++++++++++ ++++++++++++++++ ++++++ ++++++ + ++++++++++++++++++++++++++++++++ + ++ ++++ ++++++++++++++++++++++++ +++++++++++ +++++++ +++++++++++++++++++++++++++++ 0.75 0.75 +++++++++++++++++++++++++++ ++++++++++++++++++ ++++ + + + ++++++ ++ + + +++++++++++++++++ +++ +++++++++++ al probability +++++++++++++ +++++++++++++++++++++++ + + rv

al probability +++ +++++++ ++ + rv +++++ 0.50 0.50 + +++++++++ ++++ Log−rank Log−rank + + p = 0.12 + p = 0.022 + + 0.25 0.25

Disease−free inte HR (95% CI) p value HR (95% CI) p value cluster 4 vs 3 4.88(0.56 - 42.23) 0.1116 cluster 4 vs 3 3.05(1.34 - 6.94) 0.0052 cluster 4 vs 2 3.94(0.93 - 16.67) 0.0444 cluster 4 vs 2 1.78(0.99 - 3.2) 0.0497 Progression−free inte 0.00 cluster 4 vs 1 2.70 (1.00 - 7.66) 0.0401 0.00 cluster 4 vs 1 1.88(1.16 - 3.06) 0.0092 0 1000 2000 3000 4000 0 1000 2000 3000 4000 Time(days) Time(days) 143 60 11 3 1 378 104 23 12 4 44 13 1 0 0 113 23 4 1 1 − 16 7 1 1 0 − 47 16 2 1 0 − 20 7 0 0 0 − 51 10 1 0 0

Figure 2. Generation of TE score-based CRC clusters and comparison of molecular association. (A) Top scatter plot showed the TE score in decreasing order from left to right. Bottom heatmap displayed the expression profiles of the 9 TEs across four TE clusters as ordered by the TE score. Each TE was annotated at family and class level, respectively. (B-E) Stacked plots showing the fractions of molecular features across four TE clusters including MSI status (B), CIMP (C), CMS subtypes (D) and immune subtypes (E). (F-I) Prognostic value of four TE clusters with Kaplan-Meier survival analysis for OS (n = 589) (F), DSS (n = 567) (G), DFI (n = 223) (H) and PFI (n = 589) (I). The hazard ratios (HR) and 95% confidence intervals (CIs) for pairwise comparisons in univariable analyses (log-rank test) are displayed in each Kaplan-Meier plot. Numbers below the x-axes represent the number of patients at risk at the selected time points. The tick marks on the Kaplan- Meier curves indicated the censored patients. medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license . A B C Multicox for OS Multicox for DSS Multicox for PFI

TE.cluster1 (ref=cluster3) ● ns ● ns ● ns TE.cluster2 (ref=cluster3) ● * ● * ● ns TE.cluster4 (ref=cluster3) ● * ● * ● * gender(male vs female) ● ns ● ns ● ns age(high vs low) ● *** ● * ● ns MSI.status (MSS vs MSI) ● ns ● ns ● ns lymphhatic. (yes vs no) ● * ● * ● * location (proximal vs distal) ● ns ● ns ● ns AJCC stage (III&IV vs I&II) ● **** ● **** ● **** 1 2 3 5 7 10 14 1 2 4 5 7 9 16 64 1 2 3 5 7 Hazard ratio Hazard ratio Hazard ratio D MSI F OS DSS TE.cluster PFI 1.00 +++++ ++ ++++ + + 1.00 ++++++++++++++++++ ++ ++++ + + ++++++++++ ++++++++++++ +++++++++++++ +++ + +++ + 1.00 +++ + +++++++++++++ +++ + + cluster1 ++++ + ++++ +++++ ++ cluster2 ++ + ++++++++++ +++ + + ● + ++ + 8 ● ● cluster3 ++++++ 2 ● + +++ +++ + ++++ + + + + +++++++ + + + ● ● cluster4 ● 0.75 0.75 ++ ● +++ 0.75 ● 6 ● ● ++ ● ● ● 2 ● ● ●● ● ● ● +++ ● ● ● ● ● ● ● rval probability ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● + + ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● 4 ● ● + + + 2 ● 0.50 0.50 ● + + 0.50 ● ● ● + ● ●● ●● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●●● ● ● ● ● 2 ● ● ● ● ● ●●●● ●● ●● ●●●●●● ●● ●● ● ● ● ● ● ● ●● ● ●●● ● ●●●●● ●●●● ● ● ● ●●●●●●●●●●● ● ● ● ● ● 2 ● ●●●● ● ●●● ● ● ●● ●● ●●●●● ●●●●●●●●●●●●●● ●●●●● ●●● ●● ● + ● ● ● ●●●● ●●●●●● ●●● ●● ●●● ●●●●● ●● ● ●● ● ●●●●● ●● ●●●● ●●●●●● ●●●●●●●●● ● ● ● ● ● ●● ● ●● ●●●●● ●● ●●● ●●●●● ●●●●●●●●●●● ●● ●●● ● ● ● ● ● ● ++ + ● ●● ●● ● ●●●●●●● ●● ●●●● ● ●●●●●●●●●● ●●●●●● ● ● ●●● ● ● ●● ●● ● ● ● ●● ●●● ●● ●●●●●●●● ● ●●●● ●● ● ● ● ● ●●● ●● ● ●●●●●●●●●●●●●●●● ●●●●● ●●●● ● ● ● ● ●● ● ● ● ●●●● ●●●● ● ●●●●●●●●●●●●●●● ● ● ● ●● ●●●● ● ● ●● ●● ● ● ●●●● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ●● ● ● + 0 ● ● ● 0.25 0.25 0.25 Non.silent.per.Mb 2 Log−rank Log−rank Log−rank −2 R = 0.13 Overall survival probability Overall p = 0.0022 p = 0.017 p = 0.038 2 ●

● ● ● ●● ● ● ●●● ● 0.00 Disease specific survival probability 0.00 0.00 Progression−free inte −2 0 2 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 TE.score Time(days) Time(days) Time(days) G E ns MSS ● ● ● OS DSS ● +++++++ PFI ● 1.00 ++++++++ +++ +++++++++++ ++++++++ 1.00 +++++++ +++++++ 1.00 +++++ ++++ ++ ● ++++++++++++++++++++++++++ ++++++++++++++++++ ++++++++ +++++ ++++++++++++++++++++++++++++++++++++++ + +++++++++++++++++++++++++++ +++++++++ ● ● ++++++ ++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++ ● ● ++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + + +++ ● ● ++++++ +++++++++ +++ ++ +++++++++++++++++ ++++++++++ ++++++ ● ++++++ ++++ ++++++++++++++++++ ● ● ++ +++++++++++ ++ ++++++++++++++ +++++++++++++ ++++++++++ 5 ● ● ● ++++ ++ +++ ++++++ ++++++ 2 ● ● ● 0.75 + ++++ + +++++++++++++ ● ● ++++ ++++++ ++ ++++++++++++++++++++++ + ++ + +++++++++++ ● +++ ++ 0.75 ++ ++++++ + 0.75 ++++++++++++++++++++++ ● +++++++++++++++++++ ++++ +++++++++++++++++ ● +++ + ++++++ ++ ++ ++++++++ ++ +++++++++++++ +++ + + ++ rval probability ++ ++++++ ++ + + + ++++ + + ++++++++++++ +++++ ++++++++++++++++++++++++++ + + +++ + ++ ++++++ + + 0.50 +++ +++ +++++ ++ + ++ ++ ++++++++ ++ 0.50 0.50 ++++ + ++++++ ++ ++ +

+ 0.25 0.25 0.25 Non.silent.per.Mb 0 Log−rank Log−rank Log−rank 2

Overall survival probability Overall p = 0.14 p = 0.15 p = 0.15 0.00 0.00 Disease specific survival probability 0.00 Progression−free inte 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 ● Time(days) Time(days) Time(days)

TE.cluster4TE.cluster3TE.cluster2TE.cluster1 H J L Odds ratio on CD8A Odds ratio on CD8A ● R = 0.43, P < 0.0001 ● TE.cluster2 ● ● ● ● ● ● ●● TE.score ● ●● ●● ● ●● ● ● ● ● ● ● 9 ●●●●●●● ●●●●●●● ● TE.cluster3 ●●●●●●●●●●●● ●●● ● TMB ● ● ● ●● ●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●● ●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●● TE.cluster4 ●●●● ● ● ●●●●●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● stageII ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● MSI.status (MSI) 6 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●● ●●●●●●●●● ●●●●● ● CD8A ● ● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● stageII stageIII ●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ●●●● ●●●●●●●●●●●●● ●● ●● ●● ● ●● ●●●●●● ●● ● ● ●● ● ●●●● stageIII ● ●●●●● ●●●●●●● ●●●● stageIV ● ●●●●●●●●●●●● ● ● ● ●● ● ● 3 ● ●●● ●● ● stageIV ● ● ● ● ● ● age(high) ● ● ● ● age(high) ● ● ● gender(male) ● gender(male) −2 0 2 0 1 2 4 6 8 10 12 1.0 2.0 2.5 TE score Odds ratio Odds ratio I K M Odds ratio on GEP Odds ratio on GEP R = 0.51, P < 0.0001 ● ● ● TE.cluster2 ● ● ●●● TE.score ● ● ●● ●●● 0 ●● ● ● ●●● ● TE.cluster3 ● ●● ● ●● ●●● ● ● ● ●● ●●●●●●●●●●●● TMB ● ●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ● TE.cluster4 ●● ● ●●●●●●●●●● ●● ●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●● ●●● ● stageII ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● MSI.status (MSI) −2 ●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●● GEP ● ●●●● ●●● ●●● ●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● stageII stageIII ● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ●●●●●● ● ●●●●●●●●●●●● ● ● ●● ● ● ● ● ●● ●●●● ●●● ●●● ● stageIII ● ●●●●●●●●●●●●● ●●●●●●● stageIV −4 ●● ●●● ●●●●●● ●●●●● ● ● ● ●●●●●● ● ● ● ● ●● ●●●● ● ● stageIV ● ●●●●●● ● ●● ● ● ●●● ●● ●● age(high) ● ●●● ● ● age(high) ● ●● −6 ● ● gender(male) ● gender(male) −2 0 2 0 1 2 4 6 8 10 1 2 3 TE score Odds ratio Odds ratio

Figure 3. Prognostic value of TE cluster and immune infiltration prediction. (A-C) Forest plots showing multivariable Cox regression analysis of TE cluster adjusted by clinical features for OS (A), DSS (B) and PFI (C). All variables were set as categorial variable. Samples with age < 65 was set as age low group and ≥ 65 for high group. Solid dots represent the HR of death and open-ended horizontal lines represent the 95 % confidence intervals (CIs). All p-values were calculated using Cox proportional hazards analysis (ns: p > 0.05, *: p <= 0.05, **: p <= 0.01, ***: p <= 0.001, ****: p <= 0.0001). (D-E) Prognostic value of four TE clusters with Kaplan- Meier survival analysis in two subgroups separated by MSI status (MSI in D, MSS in E) for three endpoints, respectively. DFI was excluded because of non- comparable sample size among TE clusters. P-value was calculated using log-rank test. Numbers below the x-axes represent the number of patients at risk at the selected time points. The tick marks on the Kaplan-Meier curves indicate the censored patients. (F) Spearman’s correlation between normalized TE score and non-silent mutation per Mb. (G) Violin plot comparing non-silent mutation per Mb among TE clusters (n.s.: p > 0.05). (H-I) Spearman’s correlation between normalized TE score and CD8A expression (H) and GEP (I), respectively. (J-K) Forest plots showing the odds ratio indicating immune infiltration determined by CD8A expression (J) and GEP (K) using multinomial logistic regression analysis adjusted by MSI status. (L-M) Forest plots showing the odds ratio indicating immune infiltration determined by CD8A expression (L) and GEP (M) using multinomial logistic regression analysis adjusted by TMB. Solid dots represent the adjusted OR and open- ended horizontal lines represent the 95 % confidence intervals (CIs). OR to the right of dashed line (where OR = 1) indicates higher odds of immune infiltration while OR to the left of the dashed line indicates lower odds of immune infiltration. medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

A Cell fraction B Immune.infiltration.signature D Genetic changes

Th1.angelova GEP Aneuploidy.Score Th17.angelova Lymphocyte.Infiltration Homologous.Recombination.Defects Th2.angelova leukocyte.infiltration MDSC.angelova Number.of.Segments IFN.gamma.signature B.cells.naive.newman Fraction.Altered B.cells.memory.newman hot.tumor.signautre SNV.Neoantigens Plasma.cells.newman Th1.signature Indel.Neoantigens T.cells.CD8.newman Tcell_infiltration_1 T.cells.CD4.naive.newman CD4.T.cells.exhuasted Intratumor.Heterogeneity T.cells.CD4.memory.resting.newman CD8.T.cells.exhuasted T.cells.CD4.memory.activated.newman TcClassII_score 00.20.4 −0.4−0.2 T.cells.follicular.helper.newman T.cells.regulatory..Tregs..newman TAMsurr_score E Immune evasion genesets T.cells.gamma.delta.newman TAMsurr_TcClassII_ratio NK.cells.resting.newman MHC.II TGFB.response.wolf NK.cells.activated.newman MHC.I UP.ECM.signature Monocytes.newman WESTON_VEGFA_TARGETS Macrophages.M0.newman 012 −2−1 EMT_UP Macrophages.M1.newman T/B cell receptor score IPRES Macrophages.M2.newman C Dendritic.cells.resting.newman 00.20.4 −0.4−0.2 Dendritic.cells.activated.newman TCR.Shannon Mast.cells.resting.newman TE cluster P value BCR.Shannon Mast.cells.activated.newman cluster1 0.06 TCR.Richness Eosnophils.newman cluster2 0.04 Neutrophils.newman BCR.Richness cluster3 0.02 Endothelials.Becht TCR.Evenness cluster4 Fibroblasts.Becht BCR.Evenness 0

012 −2−1 012 −2−1 F GEO ID GEO ID cancer type cell line GSE5816 Breat.cancer GSE80137 Bronchial.epithelial cancer type GSE22250 Colon.cancer time point Lung.cancer cell line GBM group A549 BRCA HALLMARK_CHOLESTEROL_HOMEOSTASIS H1299 time point HALLMARK_ANDROGEN_RESPONSE H157 HALLMARK_GLYCOLYSIS H1819 D3 HALLMARK_PROTEIN_SECRETION H1993 D4 HALLMARK_PI3K_AKT_MTOR_SIGNALING D6 HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY H2347 HALLMARK_MTORC1_SIGNALING H460 treatment H526 HALLMARK_UNFOLDED_PROTEIN_RESPONSE control HALLMARK_UV_RESPONSE_DN HBEC2 low HALLMARK_NOTCH_SIGNALING HBEC2.Rep2 high HALLMARK_WNT_BETA_CATENIN_SIGNALING HBEC3 HALLMARK_MITOTIC_SPINDLE HALLMARK_E2F_TARGETS HBEC3.Rep2 HALLMARK_G2M_CHECKPOINT HBEC4 HALLMARK_DNA_REPAIR HBEC4.Rep2 HALLMARK_MYC_TARGETS_V1 HCT116 H GSE80137 HALLMARK_MYC_TARGETS_V2 MCF7 ●● HALLMARK_ADIPOGENESIS P = 0.0051 HALLMARK_PEROXISOME LNT.229 HALLMARK_OXIDATIVE_PHOSPHORYLATION T98G 1 ● HALLMARK_FATTY_ACID_METABOLISM ● ● HALLMARK_BILE_ACID_METABOLISM U.87 HALLMARK_APICAL_SURFACE ● HALLMARK_HEDGEHOG_SIGNALING BT20 HALLMARK_ESTROGEN_RESPONSE_LATE MCF.7 0 HALLMARK_ESTROGEN_RESPONSE_EARLY MDA.MB.231 ● HALLMARK_UV_RESPONSE_UP MDA.MB.361 ● HALLMARK_APOPTOSIS ● SKBR3 normalized TE score HALLMARK_TNFA_SIGNALING_VIA_NFKB −1 ● T47D ● HALLMARK_HYPOXIA ZR.75.1 ● HALLMARK_ANGIOGENESIS HALLMARK_IL2_STAT5_SIGNALING control low dose HALLMARK_TGF_BETA_SIGNALING treatment HALLMARK_APICAL_JUNCTION HALLMARK_IL6_JAK_STAT3_SIGNALING G HALLMARK_INFLAMMATORY_RESPONSE HALLMARK_ALLOGRAFT_REJECTION GSE5816 GSE80137 HALLMARK_KRAS_SIGNALING_UP HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 3 4 GSE22250 HALLMARK_COMPLEMENT HALLMARK_COAGULATION 1 0 0 HALLMARK_P53_PATHWAY CRC bulk HALLMARK_MYOGENESIS 4 7 HALLMARK_XENOBIOTIC_METABOLISM 1 0 HALLMARK_INTERFERON_GAMMA_RESPONSE HALLMARK_INTERFERON_ALPHA_RESPONSE 1 HALLMARK_SPERMATOGENESIS 2 21 HALLMARK_PANCREAS_BETA_CELLS HALLMARK_HEME_METABOLISM 3 1 HALLMARK_KRAS_SIGNALING_DN 0 −2 −1 0 1 2

Figure 4. Exploration of immune overdrive phenotype. (A) Gene set variation analysis showing fraction of 28 cell types. (B) Gene set variation analysis showing immune infiltration signatures. (C) TCR/BCR indexes comparison among TE clusters. (D) Genetic changes comparison among TE clusters. (E) Gene set variation analysis showing immune evasion signatures. P-value for each variable was calculated using Kruskal-Wallis test. For each variable, the median of normalized value in each cluster was shown. (F) Heatmap showing 50 hallmark gene sets score based on gene set variation analysis in three 5-aza treated cell line datasets across multiple cell lines and CRC TE clusters. (G) Venn diagram showing the overlapped significant pathways among cell line datasets and bulk CRC. (H) Comparison of TE score between treated and control groups in GSE80137. medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

A B GO enrichment Module−trait relationships response to virus ● GeneRatio defense response to virus ● type I interferon signaling pathway ● 0.2 MEmagenta −0.037 ● (0.4) 1 response to type I interferon ● ● 0.4 cellular response to type I interferon ● ● 0.6 0.26 response to interferon−gamma ● ● MEblue negative regulation of viral life cycle ● (7e−11) negative regulation of viral genome replication ● T cell activation ● −0.095 p.adjust MEyellow cellular response to interferon−gamma ● ● 0.100 (0.02) interferon−gamma−mediated signaling pathway ● ● 0.5 positive regulation of innate immune response ● 0.075 MEgreenyellow 0.5 immune response−regulating cell surface receptor signaling pathway ● (5e−38) regulation of lymphocyte activation ● 0.050 leukocyte cell−cell adhesion ● 0.025 immune response−activating cell surface receptor signaling pathway ● MEbrown 0.46 (9e−33) regulation of T cell activation ● 0.000 lymphocyte differentiation ● positive regulation of type I interferon production ● MEturquoise 0.26 double−stranded RNA binding ● (2e−10) 0 regulation of leukocyte cell−cell adhesion ● T cell differentiation ● MEpink 0.029 positive regulation of interferon−alpha production ● (0.5) pattern recognition receptor signaling pathway ● antigen receptor−mediated signaling pathway ● MEred −0.13 cytoplasmic pattern recognition receptor signaling pathway ● (0.002) antigen processing and presentation ● MDA−5 signaling pathway ● cellular response to exogenous dsRNA ● 0.13 −0.5 MEgreen (0.002) cellular response to dsRNA ● antigen processing and presentation of endogenous peptide antigen ● 0.035 antigen processing and presentation of endogenous antigen ● MEblack regulation of tumor necrosis factor superfamily cytokine production ● (0.4) collagen−containing extracellular matrix ● extracellular matrix organization ● MEpurple −0.07 JAK−STAT cascade ● (0.09) vascular endothelial growth factor receptor signaling pathway ● −1 toll−like receptor 4 signaling pathway ● ● MEgrey −0.047 regulation of single stranded viral RNA replication via double stranded DNA intermediate (0.3) DNA demethylation ● cytidine to uridine editing ● TE.score cytidine metabolic process ● brown greenyellow D C KEGG pathway enrichment RIG−like APOBECs ISGs TLRs Influenza A GeneRatio ● CXCL10 APOBEC3D C6orf150 TLR8 Herpes simplex virus 1 infection ● ● 0.1 CYLD APOBEC3F TLR7 APOBEC3H EIF2AK2 Epstein−Barr virus infection ● 0.2 DDX58 TLR1 ● APOBEC3G IFI16 NOD−like receptor signaling pathway ● ● 0.3 IL12A APOBEC3A TLR6 TBKBP1 APOBEC3C IFI30 TLR2 Hepatitis C ● ● 0.4 TLR5 RIPK1 APOBEC3B IFI35 Measles ● APOBEC1 TLR3 IFIH1 IFI44L TLR10 RIG−I−like receptor signaling pathway ● IL12B p.adjust IFI44 TLR9 Human papillomavirus infection ● 0.100 TNF TLR4 Oligoadenylate IFI6 Toll−like receptor signaling pathway ● 0.075 IRF7 Hepatitis B ● CASP10 synthetases IFIT1 0.050 median Cytosolic DNA−sensing pathway ● TRADD IFIT2 0.025 MAPK10 OAS2 z.score pvalue Cell adhesion molecules (CAMs) ● MAP3K1 IFIT3 0.000 Th17 cell differentiation ● DHX58 OAS3 IFIT5 1 0.06 Chemokine signaling pathway ● MAPK11 OAS1 IFITM2 ● 0.5 0.04 Th1 and Th2 cell differentiation NFKBIA OASL IFITM3 Rap1 signaling pathway ● ISG15 MAPK12 IFITM5 0 Natural killer cell mediated cytotoxicity ● 0.02 RNF125 other RNA sensors MX1 Leukocyte transendothelial migration ● TRAF3 OASL −0.5 0 Intestinal immune network for IgA production ● IRF3 ADARB1 STAT1 Graft−versus−host disease ● IKBKB MOV10L1 −1 STAT2 B cell receptor signaling pathway ● IFNE NOD2 MOV10 Viral protein interaction with cytokine and cytokine receptor ● DDX3X ATG12 ADARB2 IFNs.secretion T cell receptor signaling pathway ● TRAF2 TREX1 ● SAMHD1 NF−kappa B signaling pathway IKBKG IFN_ALPHA_SECRETION TNF signaling pathway ● IKBKE ZC3HAV1 NLRP3 IFN_BETA_SECRETION PD−L1 expression and PD−1 checkpoint pathway in cancer ● MAPK9 ADAR IFN_GAMMA_SECRETION Complement and coagulation cascades ● MAPK14 RNASEL IFN_ALPHA_PRODUCTION Fc epsilon RI signaling pathway ● MAPK13 AICDA IFN_BETA_PRODUCTION DAK CTCFL IFN_GAMMA_PRODUCTION brown greenyellow

Figure 5. Weighted correlation network analysis (WGCNA) based on TE score. (A) WGCNA consensus network modules correlated with TE score. Each row corresponds to a module, column to the TE score, respectively. Each cell contains the corresponding correlation coefficient and p-value. Individual gene modules were marked using different colors. (B) GO enrichment analysis of the genes in the brown and greenyellow module, respectively. (C) KEGG pathway enrichment analysis of the genes in the brown and greenyellow module, respectively. The size of the circle indicates the ratio of the genes mapped to each pathway. (D) Representative expression of genes or signatures involved in immune response and RNA sensor signals including RIG-I-like pathways, APOBECs, Oligoadenylate synthetases, RNA sensors, interferon-stimulated genes (ISGs), interferon secretion process and Toll-like receptors (TLRs). P-value for each variable was calculated using Kruskal-Wallis test. For each variable, the median of normalized value in each cluster was shown. medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

A

ns ns *** * ** ns ns *** ns **** ns ns * ns ns ** ns ns ns ns ns ns ns ns 25 n=79 n=211 n=982 n=144 n=589 n=27 n=154 n=403 n=66 n=494 n=161 n=300 n=147 n=486 n=481 n=404 n=56 n=246 n=103 n=73 n=282 n=491 n=118 n=56

cancer type

● ACC ● LIHC ● BLCA ● LUAD 3 ● ● ● ● BRCA LUSC 2 ● ● atio ● ● ● CESC ● OV ● ● ● ● 1 ● ● ● ● CRC PAAD ● ● ● ● ● ● ● ● ● ● ● DLBC ● PRAD

Hazard r ● ● GBM ● SARC ● HNSC ● SKCM ● KICH ● STAD ● KIRC ● THCA ● KIRP ● UCEC ● LGG ● UCS

OV ACC BLCA BRCA CESC CRC DLBC GBM HNSC KICH KIRC KIRP LGG LIHC LUAD LUSC PAAD PRAD SARC SKCM STAD THCA UCEC UCS

B C

SKCM 0.55 HNSC ● CESC 4 ● ●CRC ● ●● ●●● repClass ●●● ●●●● 0.50 ●●●● ●●●●●●●● PRAD ●●●● ●●●● ● ●●●●● 2 ●●●●●● ●●●● ●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●● ●●●●●●● ● ●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● DNA ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● BLCA ●●●●●●●●●●●●●●●●●●●●●● LUAD ●● ●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●●●●●●●●●●●●● ●●●●● LTR ●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●● TE score ●●●●●● ●●●● ●● ●●●●●●●● BRCA ●●● ●●● Retroposon ● 0.40 ● −2 ● ● LIHC SINE ●

● KIRC z o f −4 UCEC 0 100 200 300 400 500 repFamily SARC KICH Alu LTR21B ERV1 ● ACC STAD MER57F SVA ●●LUSC PAAD HERV1_LTRd TcMar−Tigger 0.20 ● MER65C KIRP expression ●● ● SVA_C OV ●UCS 10 ●THCA AluSq4 DLBC 5 ● SVA_F 0

Spearman’s correlation (r) Tigger12A −5 GBM● 0.00 MER92−int −10 LGG TE.cluster ● cluster4 cluster3 cluster2 cluster1 0.00 0.05 0.20 0.40 0.60 P value

D E F 1.00 +++++++++ + 1.00 ++++++++++++++ + ++ ++++++++++++++++ ++ 1.00 +++++++++++ +++++++++ +++++++++++++++ +++ ++++++++++ +++++++++++++ ++++++ +++ ++ + ++++++++++++++ ++++++ +++++++++++++++ + + ++++++ +++++++++ +++++ ++++ ++++++++++ +++++ ++++++++++++++++++++++++++++ ++++++++ +++++++++++ ++ ++++++++ +++++++++++++++++++++++++++++++++++++++++ +++++++++ + ++++++++++++++++ ++ +++++ + +++++++++++++++++++ +++++++++++++++++++ ++ ++++++++++++ + + +++++++++++++++++++++++++++++++++++++++++++++++++++++++ + ++ + +++ ++ ++++++++++++++++++ +++++++++++++ 0.75 ++ ++++++++++++++++++++++++++++++ 0.75 ++++++++++++++++ + + + +++++++++++++++++++++++++++++++++++++++++++++++ + 0.75 +++++++++ +++++++++++++ + ++++++++++++++++++ ++ + ++++++++++++++++ ++ ++++ + ++ + +++ ++ + +++++++++++++++ +++++ + ++++++++++++++++++++++++++++++ ++++ +++ +++++++++++++++++++++ ++++++++++++++++++++ ++++++++++++ +++++++ ++ ++++++ ++++ ++++++++++ + + + al probability al probability +++++++++ ++ ++++ ++++++++++++++++++++++++++++++++ +++++++++++ ++++++++++++ +++++ + + + + rv ++ ++++++++ +++ ++++++ ++++++++++++ ++++++++++ ++++++++++++ al probability + + +++++++++ + 0.50 0.50 0.50 + + +++ ++ + +++++ Log−rank +++++++++ + + + + Log−rank Log−rank ++++++ ++ ++++ p < 0.0001 ++ p < 0.0001 p = 0.00081 ++ ++ ++ + + 0.25 ++ + Overall surviv Overall 0.25 0.25 + + HR (95% CI) p value + HR (95% CI) p value HR (95% CI) p value Disease specific surviv cluste 4 vs 3 2.21(1.29 - 3.77) 0.003 cluste 4 vs 3 2.77(1.38 - 5.56) 0.0027 Progression−free inte cluste 4 vs 3 2.56(1.38 - 4.76) 0.0021 cluste 4 vs 2 1.39(0.76 - 2.55) 0.2851 cluste 4 vs 2 1.91(0.85 - 4.27) 0.1097 cluste 4 vs 2 1.40(0.70 - 2.78) 0.3361 0.00 cluste 4 vs 1 2.44(1.73 - 3.44) < 0.0001 0.00 cluste 4 vs 1 2.87(1.89 - 4.38) < 0.0001 0.00 cluste 4 vs 1 1.98(1.37 - 2.85) < 0.0001 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 Time(days) Time(days) Time(days) 304 188 86 31 3 296 183 84 31 3 304 167 67 22 1 − 33 18 5 2 0 33 18 5 2 0 33 14 3 0 0 58 40 12 4 0 − − − 56 40 12 4 0 58 36 11 4 0 99 53 13 1 0 − − − − 98 53 13 1 0 − 97 40 9 0 0

Figure 6. Pan-cancer analysis of TE score and identification overdrive phenotype in KIRC. (A) Forest plot showing the univariable Cox regression analysis of OS on TE score across 24 cancer types. Solid dots represent the HR of death and open-ended horizontal lines represent the 95 % CIs. All p-values were calculated using Cox proportional hazards analysis (ns: p > 0.05, *: p <= 0.05, **: p <= 0.01, ***: p <= 0.001, ****: p <= 0.0001). (B) Spearman’s correlation between TE score and GEP across 24 cancer types. X-axis indicates p-value and y-axis indicates correlation coefficient. (C) Top scatter plot showed the distribution TE score in decreasing order from left to right in KIRC. Bottom heatmap displayed the expression profiles of 9 TEs across four TE clusters as ordered by the TE score. Each TE was annotated at family and class level, respectively. (D-F) Prognostic value of four TE clusters with Kaplan-Meier survival analysis in KIRC for OS (n = 494) (D), DSS (n = 483) (E) and PFI (n = 492) (F). The hazard ratios (HR) and 95% CIs for pairwise comparisons in univariate analyses (log-rank test) are displayed in each Kaplan-Meier plot. Numbers below the x-axes represent the number of patients at risk at the selected time points. The tick marks on the Kaplan-Meier curves indicate the censored patients. medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. A It is made available under a CC-BY-NC 4.0 International licenseB . “REdiscoverTE” pipeline

TE screening in TCGA CRC cohort

TE associated with survival TE associated with immune TE proportion at class level (n=1204) (OS, DSS, DFI, PFI) (29 immune sets) LINE 14% significant at least Cor >=0.4 at least with one endpoint with one set DNA 20% 363 5 immune overdrive phenotype 9 TE score

survival SINE 5% kaps Satellite 2% Retroposon 0% clinical comparison checkpoints LTR 48% four TE clusters gene signatures other.repeats 11% immune infiltration cell type proportion MHC immune evasion TGF-beta, ECM, VEGF,EMT,IPRES TCR/BCR TMB, MSI, CIMP genetic changes CNV SNV CRC drivers hotspot C

hall marker (n=50) significant TEs overlapped among endpoints alterative pathways gene program and targetable pathways (n=39) GO WGCNA module gene KEGG TE caused changes APOBECs RNA sensors TE silencing validation in KIRC RIG-like 5-aza treated cell lines (datasets n=3) TLRs ISGs prognostic value phenotype Oligoadenylate synthetases pan-cancer check immune correlation

D E MSTA−int.group + high expression 1.00 ++++++ +++++++++++++ 1.00 ++++++++ ++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + low expression +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ +++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++ + 0.75 ++++++++++ +++++++++++ ++++++++++ ++ ++++++++++++ ++++++++++ +++ +++ 0.75 +++++++++++ +++++++++++ ++ +++++++ ++++++++++++++++++ ++++++++ + +++++++ + + ++ ++ ++ +++++++++++ 0.50 + ++++ 0.50 ++++ + + ++ ++ 0.25 HR=1.83 (1.268 − 2.639) 0.25 HR=2.19 (1.343 − 3.547) logrank P = 0.001 Overall survival probability Overall 0.00 logrank P = 0.002 0.00 0 1000 2000 3000 4000 0 1000 2000 3000 4000

Time(days) Disease specific survival probability Time(days)

F G 1.00 ++++ +++++++++++++++++++++++++++++++++++++ 1.00 ++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++ +++++++++++++++++++ ++++++++ +++++++++++++++++++ ++++++++++++++ ++++++++++++++++++++++++++++++++++++ 0.75 ++++++ ++ + +++++++++++++++++++++++++++++++++++++++++++++++++ + 0.75 +++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++ ++++++++++++++++++++ +++++++++++++++++++++++++++++++++++ rval probability ++++ ++++ + + +++++++++++++++++ ++++++++++++++++ ++++ rval probability 0.50 + ++++ ++ ++++++++ 0.50 +++++++++++++++++++++++++ + ++ + 0.25 0.25 HR=2.43 (1.111 − 5.254) HR=1.79 (1.29 − 2.483) logrank P = 0.025 logrank P < 0.0001 0.00 0.00 Disease−free inte 0 1000 2000 3000 4000 0 1000 2000 3000 4000

Time(days) progression−free inte Time(days) medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. HR distribution of significant TEs at family level H It is made available under a CC-BY-NCI 4.0 International license . survival screening DFI DSS OS PFI tRNA ● ● 1.2 telo ● ● TcMar−Tc2 ● PFI 1.5 TcMar ● ● TcMar ● OS 1.3 SVA ● ● ● MIR ● ● DSS L2 ● ● ● 1.4 hAT−Blackjack ● ● ● amily DFI hAT ● repF Gypsy ● ● 0 1 2 4 8 12 ERVL−MaLR ● ● FDR percentage (%) ERVL ● ● ERVK ● ERV1 ● centr ● ● Alu ● J HR distribution TEs at class level 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 DFI DSS OS PFI Hazard ratio

SINE ● ● ● ● K Multivariable Cox of OS adjusted by age, gender, msi and stage

Satellite ● ● ● ● repClass Retroposon HR (95%CI) P value ● DNA SVA−A 1.62 (1.10 − 2.38) 0.0144 Retroposon ● ● ● ● ● LINE ● LTR SVA−B 1.32 (0.91 − 1.93) 0.1448

● ● ● ● ● repClass Retroposon LTR SVA−C 1.57 (1.07 − 2.30) 0.0207 ● Satellite

LINE ● ● ● ● ● SINE SVA−D 1.67 (1.13 − 2.48) 0.0098 SVA−E 1.61 (1.10 − 2.37) 0.0152 DNA ● ● ● ● SVA−F 2.00 (1.35 − 2.97) 0.0006

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 0.71 1.0 1.41 2.0 2.83 Hazard ratio

L M Multivariable Cox of DSS adjusted by age, gender, msi and stage Multivariable Cox of PFI adjusted by age, gender, msi and stage

Retroposon HR (95%CI) P value Retroposon HR (95%CI) P value

SVA−A 2.40 (1.43 − 4.02) 0.0009 SVA−A 1.49 (1.06 − 2.09) 0.0213

SVA−B 1.23 (0.76 − 1.99) 0.3893 SVA−B 0.98 (0.70 − 1.36) 0.8921

SVA−C 2.10 (1.28 − 3.46) 0.0035 SVA−C 1.31 (0.94 − 1.84) 0.1092

SVA−D 2.07 (1.25 − 3.45) 0.0050 SVA−D 1.41 (1.00 − 1.99) 0.0488

SVA−E 1.55 (0.95 − 2.53) 0.0804 SVA−E 1.43 (1.01 − 2.01) 0.0417

SVA−F 2.47 (1.48 − 4.13) 0.0005 SVA−F 1.40 (1.00 − 1.96) 0.0503

1.0 2.0 4.0 0.71 1.0 1.41 2.0

log2(Number of immune−related genes+1) N 0 1 3 5 7 9 Immunescore.estimate O Antimicrobials immune screening Cytokines Cytokine_Receptors TCRsignalingPathway 0.7 BCRSignalingPathway Antigen_Processing_and_Presentation immune set NaturalKiller_Cell_Cytotoxicity Chemokines 0 1 2 3 4 5 Chemokine_Receptors FDR percentage (%) Interleukins Interleukins_Receptor TGFb_Family_Member TNF_Family_Members_Receptors Interferons TNF_Family_Members TGFb_Family_Member_Receptor Interferon_Receptor HAVCR2 CD8A PDCD1 CD274 PDCD1LG2 CTLA4 CD80 CD86 BTLA TNFRSF14 LAG3 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

Supplementary Figure S1. Screening of candidate TEs. (A) Schematic workflow of this study. DSS, disease free survival; OS, overall survival; DFI, disease free interval; PFI, progression free interval; ECM, extracellular matrix; VEGF, vascular endothelial growth factor; EMT, epithelial-mesenchymal transition; IPRES, innate anti-PD1 resistance signature; TMB, tumor mutation burden; CIMP, CpG island methylator phenotype; CNV, copy number variation; SNV, single-nucleotide variant; TLR, Toll-like receptors; ISGs, interferon-stimulated genes. (B) Pie chart showing the fraction of 1,204 TEs at class level. (C) Venn diagram showing the overlaps of significant candidate TEs associated with four endpoints including OS, DFI, PFI, DSS. (D-G) Prognostic value of one representative TE (MSTA-int) with Kaplan-Meier survival analysis for OS (D), DSS (E), DFI (F) and PFI (G). The hazard ratios (HR) and 95% CIs for pairwise comparisons in univariable analyses (log-rank test) are displayed in each Kaplan-Meier plot. (H) Density ridgeline plot showing FDR of univariable Cox regression analysis for each endpoint. Vertical line indicates the median value. (I-J) Forest plots showing univariable Cox regression analysis of TEs for four endpoints at family (I) and class (J) level, respectively. Solid dots represent the HR of death and open-ended horizontal lines represent the 95 % CIs. For TEs at family level, only those families significant with at least one endpoint were shown in (I). Six main TEs at class level were shown. (K-M) Forest plots showing multivariable Cox regression analysis of six Retroposon at three endpoints including OS (K), DSS (L) and PFI (M). For each Retroposon at each endpoint, four clinical features were included for multivariable Cox regression analysis including age, gender, MSI status and AJCC stage. Solid dots represent the HR of death and open-ended horizontal lines represent the 95% CIs. (N) Bar plot showing the number of genes in 29 immune sets. (O) Density ridgeline plot showing FDR of Spearman’s correlation between TEs and immune sets. Vertical line indicates the median value.

medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

A Tigger12A B OS status SVA_F ● Spearman’s cor 4000 Censored [−1,−0.5] ● Event SVA_C ● ● ● (−0.5,0] (0,0.5] 3000 MER92.int ● ● ● ● ● (0.5,1] MER65C ● ● ● ● 2000 MER57F ● ● ● ● ● OS.time (days) LTR21B ● ● ● ● ● ● ● 1000 HERV1_LTRd ● ● ● ● ● ● ●

AluSq4 ● ● ● ● ● ● ● ● 0 −2 0 2 TE score

C D 1.00 +++++++ kaps K = 2 ++++++++++ 1.00 +++++++++++++++++++++++++++++++ +++++++++++++ kaps K = 3 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ + set1+ set2 +++++++++++++++ ++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++ ++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + set1+ set2+ set3 +++++++++++++++++++++++++ + ++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++ 0.75 ++++++++++++++++ +++++++++++ +++++++++++++++++++ ++++++++ ++++ + 0.75 ++++++++++++++++++ +++++++ +++++++ +++ +++ ++++++ + + +++ ++++++++ ++++++ +++++++ ++++ ++++++ ++++ + ++ ++ ++++++ +++ ++ +++++ ++++ 0.50 ++++ +++ + + 0.50 + ++++ ++ +++ ++ +++++++++++ +++++ + + +++ + + + + +++ 0.25 0.25 Overall survival probability Overall Overall survival probability Overall 0.00 0.00 0 1000 2000 3000 4000 0 1000 2000 3000 4000 Time(days) Time(days)

E kaps K = 4 F 0.05 1.00 +++++++++++ + set1 + set3 +++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + set2 + set4 ++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 0.04 +++++++++++ ++++++++ 0.75 ++++++ +++++ ++++++ ++++++++++++++ +++++++++ + + + ++++++++++++++ ++++++ ++++ ++ +++++ ++++ 0.03 ++++++ +++++ 0.50 + ++ ++ +++++++++++ 0.02 +++ 0.25 + pvaluesofXk + 0.01 Overall survival probability Overall

0.00 0.00 0 1000 2000 3000 4000 Time(days) 2 3 4 K G H TE cluster age p=0.0542 gender p=0.8623 tumor location p=0.0268 AJCC stage p=0.3777 0.10 MSI status p=0.0001 MLH1 silencing p=0.0178 hypermutation p=0.0129 CIMP p=0.0228 TCGA subtype p=0.0122 0.05 molecular subtype p=0.0138 CMS subtype p<0.0001

pvaluesofX1 Immune subtype p<0.0001

TE cluster age gender tumor location AJCC stage MSI status MLH1 silencing hypermutation cluster1 low female distal stageI MSI No FALSE 0.00 cluster2 high male proximal stageII MSS Yes TRUE cluster3 2 3 4 stageIII cluster4 stageIV K CIMP TCGA subtype mRNA subtype CMS subtype Immune subtype not_available CIMP.High CIN Enterocyte/Goblet−like CMS1 IFN−gamma Dominant CIMP.Low Invasive Inflammatory CMS2 Inflammatory CIMP.Neg MSI/CIMP Stem−like CMS3 Lymphocyte Depleted TA CMS4 TGF−beta Dominant Wound Healing medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

Supplementary Figure S2. Clinical and molecular comparison among TE clusters. (A) Correlation matrix showing Spearman’s correlation coefficient among 9 TEs with each other. (B) Scatter plot of survival times (OS) against the prognostic factor (TE score). (C-E) Kaplan-Meier survival curves of the selected groups for K = 2 (C), K = 3 (D) and K = 4 (E). (F) Plot of the overall p-values against K with significance level α = 0.05. (G) Plot of the worst-pair p-values against K with significance level α = 0.05. (H) Heatmap showing the distribution of clinical and molecular features among four TE clusters. Each row represents one feature, column to each sample. P-value was calculated using chi-square test.

medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

A B Th1 signature genes 2 0 TE z score −2 Cell fractions from MCPcounter TE cluster GZMB CD8B P value FOXP3 0.06 CTLA4 T.cells 0.04 IL12B CCL5 CD8.T.cells 0.02 CD8A 0 GZMH Cytotoxic.lymphocytes PDCD1 B.lineage median of z score TBX21 1 PRF1 IRF1 NK.cells 0.5 IFNG 0 GNLY Monocytic.lineage −0.5 GZMA Myeloid.dendritic.cells −1 CXCL9 CXCL10 Neutrophils TE cluster IDO1 cluster1 CD274 Endothelial.cells cluster2 STAT1 Fibroblasts cluster3 cluster4 4 2 0 −2 −4

C D MHC genes

2 0 TE z score 11 drivers with hotspot mutations among TE clusters −2 TE cluster TE cluster HLA−G MSI status HLA−E HLA−C NRAS HLA−F HLA−B PIK3CA HLA−A BRAF P =0.0001 TAPBP HLA−DMA TP53 P =0.0378 HLA−DOA HLA−DMB KRAS HLA−DRA PTEN HLA−DPA1 HLA−DPB1 FBXW7 HLA−DQA1 HLA−DRB1 SMAD4 HLA−DQB1 ARID1A HLA−DQA2 HLA−DOB SMAD2 B2M APC TAP1 TAP2 TE cluster MSI status mutation set1 MSI hot 4 2 0 −2 −4 set2 MSS not.hot MHC class set3 MHC.class.I MHC.non.class.I set4 MHC.class.II MHC.non.class.II

E medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

F IPRES gene signatures expression profiles

IPRES z score TE cluster CHARAFE_BREAST_CANCER_BASAL_VS_MESENCHYMAL_UP JAEGER_METASTASIS_DN KARAKAS_TGFB1_SIGNALING MAINA_VHL_TARGETS_DN HARRIS_HYPOXIA POST_OP_WOUNDHEALING YE_METASTATIC_LIVER_CANCER POOLA_INVASIVE_BREAST_CANCER_UP EP_BLOOD_VESS_DEVEL_DN_IN_R MAPKi_INDUCED_ANGIOGENESIS WESTON_VEGFA_TARGETS_6HR WESTON_VEGFA_TARGETS_12HR LU_TUMOR_ANGIOGENESIS_UP ROY_WOUND_BLOOD_VESSEL_UP MAPKi_INDUCED_EMT LU_TUMOR_VASCULATURE_UP LU_TUMOR_ENDOTHELIAL_MARKERS_UP LIEN_BREAST_CARCINOMA_METAPLASTIC VECCHI_GASTRIC_CANCER_ADVANCED_VS_EARLY_UP JEON_SMAD6_TARGETS_UP MAHADEVAN_GIST_MORPHOLOGICAL_SWITCH MISHRA_CARCINOMA_ASSOCIATED_FIBROBLAST_UP MS_RESP_TO_WOUNDING_UP_IN_MAPKi_aPDL1_NR LEF1_UP.V1_UP

4 2 0 −2 −4

G gene program and canonical targetable pathways H I J (significant 24 out of 39) median of P value z score RETINOL_METABOLISM 7.5 0.06 2 **** ALK_PATHWAY **** **** 1 4 0.04 ● HER2_AMPLIFIED ● CTLA4_PATHWAY 0.02 0 5.0 PI3K_CASCADE −1 0 5 ● GP2_Immune.Tcell.Bcell −2 2 ● HDAC_TARGETS_DN 2.5 ● RB_PATHWAY ● BRCA_ATR_PATHWAY ● CD8A RAS_PATHWAY CD274 0 GP21_Anti.apoptosis.DNA_stability 0.0 TE score GP3_Tumor_suppressing_miRNA_targets 0 ● ● GP11_Immune.IFN_GP12_Hypoxia.glycolosis ●

GP8_FOXO.stemness ● −2.5 ● ● ● ● −2 GP22_16Q22.24_amplicon ● ● ● ● ● GP17_Basal_signaling ● GP4_MES.ECM −5.0 GP14_Plasma_membrane_cell.cell_signaling GP18_Vesicle.EPR_membrane_coat GP7_Estrogen_signaling GP15_EGF_signailng GP5_MYC_targets.TERT Group GroupIV* III*Group I/II TE.cluster4TE.cluster3TE.cluster2TE.cluster1 TE.cluster4TE.cluster3TE.cluster2TE.cluster1 GP9_Cell.cell_adhesion GP19_1Q_amplicon

L P = 0.0019 ● ● K ●● cell type ● ● ● Tumor 75 ●●● ●●●●●● ● ● ●● ● Myeloid ●●●●●●● ● ● ●●●●●●●●● cell type Bcell ● ●●● ● ● ●●●●● ● ●● ● ●●●●● 0.15 ● ● Tcell ● ●●● ● Bcell ● ● ● Stromal ● ●● 50 Myeloid ● ● ●● ● ● ● ● ● ● ●● Stromal ● ●●● ●● ● ● ● ● ● Tcell ● ● ● ● ● ● ● ●●●● ●● ● ● ●● ● ●● ● ● ●●● Tumor ● ● ●●●● ●●● ● ● ● ●●●●●●●●●●●● 0.10 ● 25 ● ●● ●●●● ●● ●●●●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●

tSNE2 ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ● ●● ● ●●● ● ● ●●● ●●● ●●● ●● ●●● ●● ●●● ● ● ●● ● ●●● ●● ●●●●●●● ● ● ●● ●●● ● ●● ● ● ● ● ●● ●● ●● ● ●●●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●●● ● ● ●● ●●●● ● ● ●● ● ● ● ●● ● ● ●● ● ● 0 ● ●●● ● ● ● ●● ● ● ●●●●● ● ● ●●● ● ● ●● ● ● ● ● ● ●●● ●● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ●●● ●●● ●● ● ●● ●●●●●●● ● ● ● ● ●●●● ● ● ●● ● ● ● ●●●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●●●● ● ● ● ●● ● ●●●● ● ● ●●●● 0.05 ●●●●●●● ● ● ● ●● ● ● ● ● ●● ● ● ●●●●●● ● ● ●●●●● ● ● ● ● ●●●●●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●●●●●●●●● ● ● ● ● ● ● ● ● ● ●●●●●● ●●●●●●●●●● ● ● ● ● ● ● ●● ● ●●● ●● ●●●● ●●● ●●● ●● ● ● ● ● ●●●●● ● ● ●● ● ●● ●●●●●●●● ● ● ●●● ●● ● ● ●● ● ● ●●● ●●●● ●● ● ● ● ● ● ●●● ● ●● ●● ● ●● ● ● ● ● ●●● ● ● ●●● ●● ●●● ●● ● ● ● ● ● ● ● ●●● ●●●● ● ● ● ●● ●●● ●● ● ● ●●●●●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●●●●●● ● ● ●●●●●● ● ● ● ● ● ●●●●●● ● ●● ● ●●● ● ● ● ●●●● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ● ●●● ●● ● ● ●●●●● ●●●●●● ●●● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● the proportion of reads mapping to TEs ● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ●●●●●●● ● ● ●●●● −25 ● ●●● ● ● ● ●●●●●● ● ●● ● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ●● ●● ● ●●● ● ● ● ●●● ● ● ●● ● ●●● ● ● ●● ● ● ●●● ●●● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ● ●●● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● 0.00 −50 0 50 Bcell Tcell tSNE1 Tumor Myeloid Stromal medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

Supplementary Figure S3. Comparison among TE cluster in terms of immune overdrive. (A) Cell fraction of 10 cell types estimated using MCPCounter algorithm. P-value for each variable was calculated using Kruskal-Wallis test. For each variable, the median of normalized value in each cluster was shown. (B) Heatmap showing the expression profiles of Th-1 signatures compromised of 20 genes. Samples in each column was ordered by TE score with decreasing order from left to right. (C) Heatmap showing the expression profiles of MHC genes. Samples in each column was ordered by TE score with decreasing order from left to right. (D) Heatmap showing hotspot mutation profiles of 11 CRC drivers. Each row indicates one gene, each column indicates one sample. P-value was calculated using chi- square test by comparing between cluster 4 and three clusters combined. (E) CNV plot showing the GISTIC score among four TE cluster. (F) Heatmap showing the expression profiles of IPRES signatures compromised of 24 pathways. Each row indicates one pathway, each column indicates one sample. Sample in each column was ordered by TE score with decreasing order from left to right. (G) Gene set variation analysis of 39 gene program and canonical targetable pathways. 24 significant pathways were shown. P-value for each variable was calculated using Kruskal-Wallis test. For each variable, the median of normalized value in each cluster was shown. (H) Violin plot showing the comparison of CD8A expression amongst TE clusters. (I) Violin plot showing the comparison of CD274 expression amongst TE clusters. (J) Violin plot showing the difference of TE score among risk groups identified by Fakih et al (13) (****: p <= 0.0001). (K) t-SNE plot of TE expression profiles at subfamily level for 515 single cells. (L) Boxplot showing the proportion of reads mapping to TEs among cell types. P-value for each variable was calculated using Kruskal-Wallis test.

Module colors D A Height 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Height 40 60 80 100120140160180 G medRxiv preprint

TCGA−DM−A1HA−01TCGA−AA−3947−01 TCGA−DY−A1H8−01TCGA−ATCGA−DC−6155−01G−A002−01 TCGA−G4−6321−01TCGA−D5−5540−01 (which wasnotcertifiedbypeerreview) TCGA−CA−5796−01TCGA−DM−A1D6−01TCGA−AG−A02X−01 TCGA−AZ−4681−01TCGA−AA−A01T−01TCGA−AA−3542−01 TCGA−AA−A01S−01TCGA−AA−3979−01TCGA−CM−4744−01 TCGA−A6−2676−01TCGA−AA−A029−01TCGA−CK−6746−01 TCGA−AA−A02R−01TCGA−AA−A022−01TCGA−AA−A00R−01 TCGA−AA−A01R−01TCGA−AA−A004−01TCGA−A6−2679−01 TCGA−AA−A03F−01TCGA−AA−3672−01TCGA−AA−3715−01 TCGA−AA−A02O−01TCGA−AA−A01V−01TCGA−A6−2683−01 TCGA−TCGA−AG−3726−01G−3727−01AG−3885−01 TCGA−AA−A03J−01TCGA−A6−2680−01TCGA−AA−3848−01 TCGA−AA−3841−01TCGA−AA−3667−01TCGA−AA−3527−01 TCGA−TCGA−AA−3514−01AG−3882−01TCGA−AG−3881−01 TCGA−AA−3684−01TCGA−TCGA−AG−3728−01AG−3878−01 TCGA−AA−3812−01TCGA−TCGA−AZ−4308−01AG−3883−01 TCGA−DM−A28C−01TCGA−CK−5915−01TCGA−AA−3950−01 TCGA−AA−3845−01TCGA−CM−5341−01TCGA−AG−3575−01 TCGA−DY−A1DF−01TCGA−AM−5821−01TCGA−AA−A00O−01 TCGA−DY−A1DD−01TCGA−F4−6704−01TCGA−AZ−6607−01 TCGA−AZ−4323−01TCGA−TCGA−F4−6703−01AY−6196−01 TCGA−AA−3489−01TCGA−AF−2690−01TCGA−AF−2687−01 TCGA−CM−6167−01TCGA−A6−6651−01TCGA−CM−5348−01 TCGA−D5−6534−01TCGA−A6−A566−01TCGA−F5−6464−01 TCGA−F4−6569−01TCGA−A6−6781−01TCGA−G4−6302−01 TCGA−CM−6162−01TCGA−EI−7004−01TCGA−G4−6628−01 TCGA−AZ−4615−01TCGA−F4−6570−01TCGA−CK−5916−01 TCGA−AA−A01P−01TCGA−CA−6718−01TCGA−D5−6928−01 TCGA−AD−6964−01TCGA−G4−6299−01TCGA−EI−6507−01 TCGA−DT−5265−01TCGA−AA−3560−01TCGA−AA−A01I−01 TCGA−A6−A5ZU−01TCGA−AM−5820−01TCGA−G4−6295−01 TCGA−AH−6643−01TCGA−EI−6509−01TCGA−A6−A56B−01 TCGA−AA−A01X−01TCGA−A6−6652−01TCGA−G5−6572−01 TCGA−G4−6310−01TCGA−F5−6863−01TCGA−G4−6298−01 TCGA−DM−A282−01TCGA−CM−6166−01TCGA−CA−6716−01 TCGA−CA−5256−01TCGA−AZ−4315−01 Sample c TCGA−CM−5860−01TCGA−DC−5869−01TCGA−DC−5337−01 TCGA−AA−3526−01TCGA−AZ−5403−01TCGA−DC−4745−01 TCGA−AA−3685−01TCGA−D5−6536−01TCGA−F4−6461−01 Cluster Dendrogram TCGA−D5−6531−01TCGA−CM−6161−01TCGA−EI−6885−01 TCGA−F5−6814−01TCGA−EI−6917−01TCGA−CK−4951−01 TCGA−F5−6813−01TCGA−D5−6541−01TCGA−AG−4021−01 TCGA−DY−A1DE−01TCGA−AH−6549−01TCGA−CI−6620−01 TCGA−D5−5538−01TCGA−A6−6138−01TCGA−EI−6511−01 TCGA−CK−6748−01TCGA−A6−6142−01TCGA−AA−3496−01 doi: TCGA−DC−6681−01TCGA−F4−6459−01TCGA−AH−6644−01 TCGA−CM−5344−01TCGA−CM−5349−01TCGA−F4−6460−01 TCGA−F4−6809−01TCGA−F5−6812−01TCGA−AD−6901−01 TCGA−A6−2685−01TCGA−F5−6465−01TCGA−DC−6156−01 TCGA−CM−6169−01TCGA−F5−6571−01TCGA−A6−5657−01 TCGA−TCGA−D5−6529−01TCGA−A6−2684−01AG−3731−01 TCGA−CA−6717−01TCGA−A6−5664−01TCGA−G4−6627−01 TCGA−CM−6168−01TCGA−F4−6805−01TCGA−F4−6807−01 TCGA−AZ−6605−01TCGA−A6−6654−01TCGA−DC−6158−01 TCGA−F5−6702−01TCGA−F4−6463−01TCGA−F4−6855−01 TCGA−G4−6314−01TCGA−CA−6719−01TCGA−F5−6864−01 https://doi.org/10.1101/2020.07.14.20129031 TCGA−D5−6926−01TCGA−D5−6922−01TCGA−CM−6164−01 TCGA−A6−6782−01TCGA−AF−A56K−01TCGA−D5−6924−01 TCGA−CM−6679−01TCGA−AZ−6603−01TCGA−EI−7002−01 TCGA−AF−6655−01TCGA−D5−6923−01TCGA−CM−6170−01

TCGA−AA−3712−01 lustering todetectoutliers TCGA−G4−6311−01TCGA−G4−6297−01TCGA−G4−6625−01 TCGA−TCGA−A6−4105−01TCGA−CI−6621−01AG−4022−01 TCGA−AF−4110−01TCGA−AA−3511−01TCGA−F4−6854−01 TCGA−TCGA−F5−6811−01TCGA−EI−6508−01AU−3779−01 TCGA−CK−6747−01TCGA−TCGA−AA−3655−01AY−6386−01 TCGA−CK−6751−01TCGA−NH−A50V−01TCGA−D5−6535−01 TCGA−AH−6547−01TCGA−A6−A565−01TCGA−A6−6649−01 TCGA−AD−6899−01TCGA−EI−6884−01TCGA−F4−6806−01 TCGA−CM−6680−01TCGA−A6−6141−01TCGA−CM−5863−01 TCGA−CM−6163−01TCGA−D5−5539−01TCGA−CM−6172−01 TCGA−CM−6165−01TCGA−D5−6539−01TCGA−G4−6293−01 TCGA−D5−6920−01TCGA−EI−6506−01TCGA−A6−6137−01 TCGA−CK−4950−01TCGA−AF−2693−01TCGA−G5−6233−01 TCGA−TCGA−AA−3662−01TCGA−BM−6198−01AG−3902−01 TCGA−AD−6548−01TCGA−CM−4751−01TCGA−AZ−6601−01 TCGA−AZ−6600−01TCGA−D5−7000−01TCGA−D5−6931−01 TCGA−TCGA−D5−6927−01TCGA−QG−A5YW−01AU−6004−01 TCGA−CL−5917−01TCGA−AZ−4684−01TCGA−A6−5662−01 TCGA−TCGA−A6−A567−01TCGA−CA−5797−01AG−3742−01 TCGA−DC−6683−01TCGA−EI−6514−01TCGA−DC−6154−01 TCGA−QG−A5Z1−01TCGA−G4−6303−01TCGA−A6−5659−01 TCGA−D5−5541−01TCGA−D5−6898−01TCGA−CM−6676−01 TCGA−A6−5667−01TCGA−AH−6903−01TCGA−AA−3675−01 TCGA−DM−A28G−01TCGA−QG−A5YV−01TCGA−DM−A0X9−01 TCGA−F5−6861−01TCGA−CM−6678−01TCGA−D5−5537−01 TCGA−AA−3506−01TCGA−A6−2682−01TCGA−CM−4747−01 TCGA−CM−6677−01TCGA−AD−6890−01TCGA−EI−6513−01 TCGA−DC−6157−01TCGA−CM−5868−01TCGA−AD−6965−01 TCGA−AF−3911−01TCGA−A6−5660−01TCGA−AA−3509−01 TCGA−AA−3495−01TCGA−AA−3660−01TCGA−AG−3592−01 TCGA−AA−3697−01TCGA−AF−A56N−01TCGA−D5−6929−01 TCGA−DM−A28A−01TCGA−AF−A56L−01TCGA−D5−6932−01 TCGA−DY−A1DC−01TCGA−EF−5831−01TCGA−DM−A285−01 TCGA−AA−3973−01TCGA−AA−3941−01TCGA−AG−A01Y−01 TCGA−AA−3693−01TCGA−A6−2677−01 It ismadeavailableundera TCGA−AA−A010−01TCGA−AA−3930−01TCGA−AA−3811−01 TCGA−AA−3949−01TCGA−TCGA−AA−3994−01AG−3600−01 TCGA−AA−3525−01TCGA−AA−3516−01TCGA−A6−3809−01 TCGA−AA−3815−01TCGA−A6−2672−01TCGA−AA−A00E−01 TCGA−AA−3554−01TCGA−AA−3543−01TCGA−AA−3518−01 TCGA−AA−3833−01TCGA−AA−3877−01TCGA−AA−3710−01 TCGA−AA−A00J−01TCGA−AA−3821−01TCGA−AG−3594−01 TCGA−AA−3681−01TCGA−AA−3664−01TCGA−AG−3578−01 TCGA−A6−2678−01TCGA−AA−3556−01TCGA−AA−3555−01 TCGA−AA−3864−01TCGA−AA−3977−01TCGA−AG−3605−01 TCGA−AA−3488−01TCGA−AA−A02F−01TCGA−AG−A011−01 TCGA−TCGA−AA−3549−01TCGA−AA−A02E−01AG−A00Y−01 TCGA−AA−A00W−01TCGA−TCGA−AG−A00C−01AG−A01L−01 TCGA−AA−A01G−01TCGA−AA−A00U−01TCGA−AG−A015−01 TCGA−TCGA−AA−A02J−01TCGA−AA−A00L−01AG−A016−01 TCGA−AA−A00D−01TCGA−AA−3544−01TCGA−AA−3952−01 TCGA−AA−3842−01TCGA−AA−3524−01TCGA−AG−3602−01 TCGA−AA−3862−01TCGA−AA−3666−01TCGA−AA−3561−01 TCGA−AA−3939−01TCGA−AA−3844−01TCGA−AA−3955−01

TCGA−AA−3976−01 istheauthor/funder,whohasgrantedmedRxivalicensetodisplaypreprintinperpetuity. TCGA−AA−A00A−01TCGA−AA−3984−01TCGA−AG−A026−01 TCGA−TCGA−TCGA−AAG−A01N−01G−A01J−01AG−3584−01 TCGA−AA−3530−01TCGA−AA−A02W−01TCGA−AA−3494−01 TCGA−AA−3846−01TCGA−TCGA−AA−3980−01AG−A036−01 TCGA−TCGA−AA−A01K−01TCGA−AA−3517−01AG−A01W−01 TCGA−AA−3521−01TCGA−AA−3972−01TCGA−AF−2692−01 TCGA−AA−A00Q−01TCGA−TCGA−AA−A017−01AG−A032−01 TCGA−AA−A00K−01TCGA−TCGA−AA−A02H−01AG−3586−01 TCGA−TCGA−AG−A025−01G−A023−01AG−3583−01 TCGA−TCGA−AA−3692−01TCGA−AA−3510−01AG−3887−01 TCGA−TCGA−AG−3599−01G−3892−01AG−3598−01 TCGA−AA−3831−01TCGA−AA−3986−01TCGA−AA−3989−01 TCGA−AA−3850−01TCGA−TCGA−AA−3519−01AG−3580−01 TCGA−AA−3678−01TCGA−TCGA−AA−A00Z−01AG−3593−01 TCGA−AA−3819−01TCGA−AA−3837−01TCGA−AA−3818−01 TCGA−AA−3552−01TCGA−TCGA−AA−3679−01AG−4015−01 TCGA−AA−3867−01TCGA−AA−3553−01TCGA−AA−3548−01 TCGA−AA−3855−01TCGA−AF−2691−01TCGA−AA−3538−01 TCGA−TCGA−TCGA−AA−3673−01AG−3581−01AG−3608−01 TCGA−AA−3970−01TCGA−AA−3875−01TCGA−AA−3956−01 TCGA−AA−3971−01TCGA−AA−3982−01TCGA−AA−3856−01 TCGA−TCGA−AA−3968−01TCGA−CM−4752−01AG−3582−01 TCGA−TCGA−AA−3562−01TCGA−AA−3688−01AG−3896−01 TCGA−AA−A00F−01TCGA−TCGA−AA−3858−01AG−3612−01 TCGA−TCGA−AG−3893−01G−4008−01AY−4071−01 TCGA−TCGA−AA−3975−01TCGA−AG−3890−01AG−3909−01 TCGA−TCGA−AA−3852−01TCGA−AA−3869−01AG−4007−01 TCGA−TCGA−A6−3808−01TCGA−AA−A00N−01AG−3901−01 TCGA−AA−3866−01TCGA−AA−3532−01TCGA−AA−3520−01 TCGA−AA−3814−01TCGA−A6−4107−01TCGA−AA−3851−01 TCGA−A6−3810−01TCGA−A6−3807−01TCGA−AG−3609−01 TCGA−A6−2671−01TCGA−AA−3860−01TCGA−AF−3913−01 TCGA−TCGA−A6−2681−01TCGA−AG−4005−01AG−4001−01 TCGA−TCGA−AA−A01C−01AG−A00H−01TCGA−CM−4750−01 TCGA−TCGA−AG−3601−01AG−3587−01G−3898−01 TCGA−TCGA−AA−3531−01TCGA−AA−3696−01AG−3574−01 TCGA−AA−3529−01TCGA−AA−3534−01TCGA−AG−3999−01 TCGA−TCGA−AA−3966−01AG−3894−01TCGA−A6−2686−01 TCGA−AF−3400−01TCGA−A6−2674−01TCGA−AY−4070−01 TCGA−AA−3870−01TCGA−AA−3872−01TCGA−AA−A01D−01 TCGA−G4−6322−01TCGA−TCGA−CM−4746−01AG−3732−01 TCGA−TCGA−G5−6641−01TCGA−AZ−6599−01AG−A02N−01 TCGA−ATCGA−G−A020−01G−A008−01AG−3611−01 TCGA−AA−3680−01TCGA−AA−3854−01TCGA−AA−A024−01 TCGA−AA−3522−01TCGA−AA−3861−01TCGA−G4−6320−01 TCGA−TCGA−AF−6672−01TCGA−AA−3502−01AG−3591−01 E TCGA−AZ−5407−01 TCGA−QG−A5Z2−01TCGA−G4−6323−01TCGA−EI−6510−01 TCGA−DM−A1DA−01TCGA−DM−A288−01TCGA−AA−A02K−01 TCGA−DM−A1D4−01TCGA−DM−A1DB−01TCGA−DM−A28H−01 TCGA−DM−A28M−01TCGA−G4−6306−01TCGA−AA−A02Y−01 TCGA−DM−A1D0−01TCGA−DM−A28E−01TCGA−AD−6888−01 TCGA−AZ−6608−01TCGA−G4−6307−01TCGA−DY−A1DG−01 TCGA−CM−5862−01TCGA−G4−6317−01TCGA−D5−6538−01 Gene significance for TE score TCGA−EF−5830−01TCGA−CK−5912−01TCGA−A6−5666−01 TCGA−AA−A01Z−01TCGA−DC−6682−01TCGA−AH−6897−01 CC-BY-NC 4.0Internationallicense Module membership significance vs. gene TCGA−DM−A1D8−01 TCGA−DM−A1D9−01TCGA−DY−A0XA−01TCGA−EI−6883−01 TCGA−G5−6235−01TCGA−QG−A5YX−01TCGA−AY−A54L−01 Module Membershipingreenyellow module TCGA−AH−6544−01TCGA−NH−A50T−01TCGA−G4−6626−01 TCGA−DM−A28F−01TCGA−A6−5656−01TCGA−D5−6537−01 TCGA−DC−4749−01 ; TCGA−CL−5918−01TCGA−F4−6808−01TCGA−AD−6963−01 0.2 0.3 0.4 0.5 TCGA−DM−A0XF−01TCGA−CA−6715−01TCGA−D5−6532−01 TCGA−CM−5864−01TCGA−G4−6294−01TCGA−AD−A5EK−01 . . . . 0.9 0.8 0.7 0.6 0.5 TCGA−CK−5914−01 this versionpostedJuly17,2020. TCGA−EI−6881−01TCGA−D5−6533−01TCGA−AY−5543−01 TCGA−AZ−6606−01TCGA−EI−6512−01TCGA−AG−3725−01 TCGA−DC−6160−01TCGA−A6−6648−01TCGA−AF−6136−01 TCGA−A6−6140−01TCGA−A6−6650−01TCGA−DM−A1D7−01 cor=0.53,p=0.00052 TCGA−G4−6315−01TCGA−CL−4957−01TCGA−CI−6622−01 TCGA−AZ−4616−01TCGA−CM−4743−01TCGA−AZ−6598−01 TCGA−DM−A0XD−01TCGA−DM−A280−01TCGA−CA−5255−01 TCGA−DM−A28K−01TCGA−CM−6675−01TCGA−CA−5254−01 TCGA−CK−4952−01TCGA−CM−5861−01TCGA−AZ−4313−01 TCGA−AA−3713−01TCGA−ATCGA−AA−3663−01Y−6197−01 TCGA−CK−5913−01TCGA−A6−6653−01TCGA−EI−6882−01 TCGA−AD−A5EJ−01TCGA−G4−6588−01TCGA−DM−A1HB−01 TCGA−A6−5665−01TCGA−A6−5661−01TCGA−G4−6309−01 TCGA−AD−5900−01TCGA−AD−6895−01TCGA−D5−6540−01 TCGA−CM−6674−01TCGA−D5−6930−01TCGA−D5−6530−01 TCGA−F4−6856−01TCGA−CM−6171−01TCGA−AD−6889−01 TCGA−AA−3492−01TCGA−A6−6780−01TCGA−AA−A01Q−01 TCGA−G4−6586−01TCGA−G4−6304−01TCGA−AZ−4614−01 TCGA−TCGA−AA−A01F−01TCGA−AG−A014−01AG−A02G−01 B

Scale Free Topology Model Fit,signed R^2 0.0 0.2 0.4 0.6 0.8 Scale independence 1 2 3 Soft Threshold(po 4 01 20 15 10 5 5 6 7 8 9 10 . 12 The copyrightholderforthispreprint 14

Gene significance for TE score F 16

0.1 0.2 0.3 0.4 0.5 0.6 18 w 20 . . . . . 0.9 0.8 0.7 0.6 0.5 0.4 Module membership significance vs. gene er) Module Membershipinbrown module cor=0.56,p=1.7e−33

Mean Connectivity C 0 200 400 600 800 Mean connectivity 1 2 3 Soft Threshold(po 4 01 20 15 10 5 5 6 7 8 9 10 12 14 16 18 w 20 er) medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

Supplementary Figure S4. Construction of co-expression module using WGCNA. (A) Clustering dendrogram of CRC samples. One sample (TCGA−AA−3947−01) was considered as outlier and was removed in downstream analysis. (B-C) Soft-thresholding power selection in WGCNA. Analysis of the scale- free fit index for individual soft-thresholding powers. Analysis of the mean connectivity for individual soft-thresholding powers. The power = 10 was chosen which is the lowest power for the curve that the scale-free topology fit index flat upon reaching a high value above 0.9 with a moderate mean connectivity. (D) Clustering dendrograms of genes included with dissimilarity based on topological overlap, together with assigned module colors. A total of 12 modules were identified and assigned into different colors. (E-F) Scatter plots of gene significance for TE score versus module membership in greenyellow (E) and brown (F) module, respectively. (G) Heatmap showing the topological overlap in WGCNA. Each row and column represents a gene, light color indicates low topological overlap, and progressively darker red indicates higher topological overlap. Darker squares along the diagonal represent modules. The gene dendrogram and module assignment are shown along the left and top. Heatmap on the right panel zooms into the brown and greenyellow modules.

medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. A It is made available under a CC-BY-NC 4.0 International license . 6 ●

● ●

● ● cancer type ● ● ● ● ● ● ● ACC LUAD 4 ● ● BLCA LUSC ● ● BRCA OV ● ● ● ● ● ● ● ● CESC PAAD ● ● ● ● ● DLBC PRAD 2 ● ● ● ● ● ● ● ● ● GBM SARC ● ● ● ● ● ● ● ● ● HNSC SKCM ● ● ● ● ● ● z of TE score ● ● ● KICH STAD 0 KIRC THCA KIRP UCEC LGG UCS ● ● ● ● ● LIHC CRC ● ● ● ● −2 ● ● ● ● ● ●

OV LIHC LGG UCS GBM CRC ACC KIRC DLBC HNSC STAD CESC KIRP LUSC LUAD PAAD BLCA SKCM UCEC BRCA KICH SARC PRAD THCA

B C TE cluster age p=0.2947 gender p=0.6353 tumor location p=0.381 AJCC stage p=0.0003 log2(TMB) p=0.8692 hypermutation p=0.7274 mRNA subtype p<0.0001 microRNA subtype p=0.001 Immune subtype p<0.0001

TE cluster tumor location AJCC stage hypermutation microRNA subtype cluster1 left StageI no cluster_1 cluster2 right StageII yes cluster_2 cluster3 StageIII cluster_3 cluster4 StageIV cluster_4 age gender TMB mRNA subtype Immune subtype low female 4 cluster_1 IFN−gamma Dominant high male 2 cluster_2 Immunologically Quiet 0 cluster_3 Inflammatory −2 cluster_4 Lymphocyte Depleted not_available TGF−beta Dominant −4 Wound Healing

D E Hazard ratio of multiCox for OS Hazard ratio of multiCox for DSS

TE.cluster1 (ref=cluster3) ● ns ● ns TE.cluster2 (ref=cluster3) ● * ● ns

TE.cluster4 (ref=cluster3) ● * ● *

gender(male vs female) ● ns ● ns

age(high vs low) ● *** ● ns

location (left vs right) ● * ● *

AJCC stage (III&IV vs I&II) ● **** ● ****

1 2 3 5 1 2 3 5 8 10 14 Hazard ratio Hazard ratio

F Hazard ratio of multiCox for PFI

TE.cluster1 (ref=cluster3) ● ns TE.cluster2 (ref=cluster3) ● **

TE.cluster4 (ref=cluster3) ● *

gender(male vs female) ● *

age(high vs low) ● ns

location (left vs right) ● **

AJCC stage (III&IV vs I&II) ● ****

1 2 3 5 7 Hazard ratio medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license . G H J Immune evasion Cell fraction Immune infiltration signature

TGFB.response.wolf Th1.angelova GEP up.ECM.signature Th17.angelova Lymphocyte.Infiltration WESTON_VEGFA_TARGETS Th2.angelova leukocyte.infiltration EMT_UP MDSC.angelova IFN.gamma.signature IPRES B.cells.naive.newman hot.tumor.signautre

B.cells.memory.newman Th1.signature Tcell_infiltration_1 −0.4−0.200.20.4 Plasma.cells.newman K CD4.T.cells.exhuasted T.cells.CD8.newman Genetic changes T.cells.CD4.naive.newman CD8.T.cells.exhuasted

T.cells.CD4.memory.resting.newman TcClassII_score

T.cells.CD4.memory.activated.newman TAMsurr_score Aneuploidy Score

T.cells.follicular.helper.newman TAMsurr_TcClassII_ratio Homologous Recombination Defects MHC.II T.cells.regulatory..Tregs..newman Number ofSegments MHC.I T.cells.gamma.delta.newman Fraction.Altered NK.cells.resting.newman −2−1012 SNV Neoantigens NK.cells.activated.newman I T/B cell receptor score Indel Neoantigens Monocytes.newman Intratumor Heterogeneity Macrophages.M0.newman

Macrophages.M1.newman TCR.Shannon 0.20.4 −0.20 −0.4 Macrophages.M2.newman BCR.Shannon Dendritic.cells.resting.newman TCR.Richness TE cluster p value BCR.Richness Dendritic.cells.activated.newman cluster1 TCR.Evenness Mast.cells.resting.newman cluster2 0.06 BCR.Evenness Mast.cells.activated.newman cluster3 0.04 Eosnophils.newman cluster4 −0.4−0.200.20.4 0.02 Neutrophils.newman 0 Endothelials.Becht

Fibroblasts.Becht

−1−0.500.51

L RIG-I like M APOBECs P ISGs p value 0.06 IKBKB APOBEC3H C6orf150 IRF3 APOBEC3D EIF2AK2 0.04 TRAF2 APOBEC3G IFI16 IRF7 APOBEC3F IFI27L2 TRAF3 0.02 TNF APOBEC3C IFI27 IL12A APOBEC3A IFI30 0 CXCL10 APOBEC3B IFI35 IKBKE IFI44L median of z score TRADD IFI44 MAPK11 IFIT1 1 CASP8 IFIT2 0.5 CYLD N IFIT3 TANK Oligoadenylate synthetases IFITM1 0 IKBKG IFITM3 NFKBIB IFITM4P DHX58 −0.5 IFITM5 IL12B −1 ISG15 OASL OASL IFNE OAS1 STAT1 MAPK10 STAT2 OAS2 RNF125 RIPK1 OAS3 Q MAVS IFNs secretion CASP10 DDX58 IFIH1 ATG12 IFN_ALPHA_SECRETION DAK IFN_BETA_SECRETION MAPK12 O other RNA sensors IFN_GAMMA_SECRETION RELA IFN_ALPHA_PRODUCTION OTUD5 IFN_BETA_PRODUCTION NFKBIA TBKBP1 IFN_GAMMA_PRODUCTION NOD2 IFNA13 IFNB1 MOV10 R TLRs IFNA1 AICDA IFNA21 TREX1 IFNA5 SAMHD1 IFNW1 RNASEL TLR6 IFNK ZC3HAV1 TLR10 MAPK14 ADAR TLR2 CHUK NLRP3 TLR9 MAPK8 ADARB1 NFKB1 TLR8 ADARB2 TRAF6 TLR5 DDX3X CTCFL TLR3 NLRX1 ATG5 TLR7 AZI2 TLR1 TLR4 medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

Supplementary Figure S5. Pan cancer analysis of TE score. (A) Comparison of TE score across 24 cancer types. (B) Spearman’s correlation between TE score and GEP in pooled cancer samples (n = 6,554). (C) Heatmap showing the comparison of clinical and molecular features among four TE clusters in KIRC. Each row represents one feature, while each column represents one sample. P-value was calculated using the chi-square test. (D-F) Forest plots showing multivariable Cox regression analysis of TE cluster adjusted by clinical features for OS (D), DSS (E) and PFI (F) in KIRC. All variable was set as categorial variable. Samples with age < 65 was set as age low group and ³ 65 for high group. Solid dots represent the HR of death and open-ended horizontal lines represent the 95 % CIs. All P-values were calculated using Cox proportional hazards analysis (ns: p > 0.05, *: p <= 0.05, **: p <= 0.01, ***: p <= 0.001, ****: p <= 0.0001). (G) Gene set variation analysis showing fraction of 28 cell types in KIRC. (H) Gene set variation analysis showing immune infiltration signatures in KIRC. (I) TCR/BCR indexes comparison among TE clusters in KIRC. (J) Genetic changes comparison among TE clusters in KIRC. (K) Gene set variation analysis showing immune evasion signatures in KIRC. P-value for each variable was calculated using Kruskal-Wallis test. For each variable, the median of normalized value in each cluster was shown. (L-R) Representative expression of genes or signatures involved in immune response and RNA sensor signals in KIRC including RIG-I-like pathways (L), APOBECs (M), Oligoadenylate synthetases (N), RNA sensors (O), interferon-stimulated genes (ISGs) (P), interferon secretion process (Q) and Toll-like receptors (TLRs) (R). P-value for each variable was calculated using Kruskal-Wallis test. For each variable, the median of normalized value in each cluster was shown.

medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

A B

● LTR Retroposon R = 0.50, P < 0.0001 0.2 0.4 0 0 0.2 ● 0.4 0 Satellite ● LINE 0.2 0.06 0.2

0 0

0.4

SINE

0.2

● DNA 0.2 ● 0.04 0.4 ●

0 ● ●

2.6 ● 0

● ●● 2.4 0.02 ● ● ● 0.2

● 2.2 ● ● ● 0.4 ● ●● ● the proportion of reads mapping to TEs ●● ● ● ●●●●● ●●●●●● ●●● ●● ● ●● ●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●● 2 0.6 ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●● ● ●● ● ● 1.8 0.8 0.00 1.6 1 −2 0 2 1.4 1.2 TE score TE.score

C D

LTR Retroposon 0.20 R = 0.35, P < 0.0001 0.6 0 cell type 0.3 0.3 ● 0 0.6 ● ● Tumor 0 Satellite ● 0.6 ● Myeloid LINE 0.3 ● ● Bcell 0.3

● ● ● ● Tcell 0.6 0.15 ● 0 ● 0 ● Stromal ● ● 0.6 SINE ● ● ● 0.3 ● ● ●● ● ● ● DNA ● 0.3 ● ● ● ● ● 0.6 ● ● 0.10 ● ● ● ● ● 0 ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● 0 ● ● ● ●●●●●● ●● ● ● ● ●● ●● ● ● ● ● ●●● ●● ●● ● ●● ● ● ●●● ● ●● ● ● 4.2 ●● ●●●● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ●●● ● ● ● ●● ●● ●●●●●● ● ●●●●● ● ●●●●●● ●●●● ● 0.3 0.05 ● ●●●● ● ●●●●● ●● ●●●● ●●● ● ●●●● ●●●●●● ●● ●●● ● ●●● ●●● ● ● ● ●●● ●●●●●●●●●●● ●●● ●●● ●●●●● ● ●●●● ●●●●●● ● ●●● ● ● ● ● ● 3.9 ●● ●●●● ●●● ●● ●●●● ● ● ●● ● ●●● ●●●●●●●●●●● ●●●● the proportion of reads mapping to TEs ●●●●●●●●●●● ● ●●● ● ●● ●●● ●●● ● ● ● ●●●●● ●● ●●●●● ●●●● ●●● ● ● 0.6 ● ●● ●●●●●●●●●●● ● ● ● ● ● ●●●● ●● ●●●●●●●●●●●● ●● ●●● ●● ●●●●● ● ● ●● 3.6 ●● ●●●●● ● ●● ●●●● ●● ● ● ●●● ●● ●● ● ● ● ●●●● ●● 0.9 ●● ● ● ● ● ● 0.00 3.3 1.2 3 −1 0 1 2 3 1.5 2.7 1.8 TE score 2.4 2.1 TE score medRxiv preprint doi: https://doi.org/10.1101/2020.07.14.20129031; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

Supplementary Figure S6. Correlation between TE score and global TE expression. (A) Spearman’s correlation between TE score and the proportion of reads mapping to TEs in CRC. (B) Circle plot showing Spearman’s correlation between TE score and TE expression at six main class level (DNA, LINE, LTR, SINE, Retroposon and Satellite) in CRC. (C) Spearman’s correlation between TE score and the proportion of reads mapping to TEs in scRNA-seq data of breast cancer. (D) Circle plot showing Spearman’s correlation between TE score and TE expression at six main class level (DNA, LINE, LTR, SINE, Retroposon and Satellite) in KIRC.