bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

An atlas of bacterial and viral associations in cancer

Sven Borchmann1,2,3

1University of Cologne, Department I of Internal Medicine, Center for Integrated Oncology Aachen Bonn Cologne Duesseldorf, German Hodgkin Study Group, Cologne, Germany

2University of Cologne, Faculty of Medicine and University Hospital of Cologne, Center for Molecular Medicine, Cologne, Germany

3University of Cologne, Faculty of Medicine and University Hospital of Cologne, Else Kröner Forschungskolleg Clonal Evolution in Cancer, Cologne, Germany

1 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

ABSTRACT

Bacterial1,2 and viral3–5 infections have widely been recognized as a cause of cancer. Examples

for viral infections are Human Papillomaviridae causing cervical cancer3,6,7 or Hepatitis B virus

causing liver cancer8,9. An example for a bacterial infection is causing

adenocarcinoma of the stomach10,11. Known carcinogenic mechanisms are disruption of the

host genome via genomic integration, expression of oncogenic viral proteins12 or chronic

inflammation1.

Next generation sequencing has made it possible to gain deep insight into the cancer

genome13, while at the same time providing datasets with traces of bacterial and viral nucleic

acids that colonize the examined tissue or are integrated into the host genome. This hidden

resource has so far not been comprehensively studied.

Here, 3000 whole genome sequencing datasets are leveraged to gain insight into possible

novel associations between viruses, and cancer. Apart from confirming known

associations, multiple, previously unknown, possible associations between bacteria, viruses

and tumors are identified. The presence of some bacteria or viruses was associated with

profound differences in patient and tumor characteristics, such as patient age, tumor stage,

survival, defined somatic alterations or gene expression profiles when compared to other

patients with the same cancer.

Overall, these results provide a detailed map of potential associations between viruses,

bacteria and cancer and can be the basis for future investigations aiming to confirm these

associations experimentally and to provide further mechanistic insight.

2 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

INTRODUCTION

Bacterial1,2 and viral3–5 infections have widely been recognized as a cause of cancer with a

wide range of carcinogenic mechanisms described. Examples for carcinogenic viruses are

Human Papillomaviridae causing head and neck14,15 as well as cervical cancer3,6,7 or Hepatitis

B virus causing liver cancer8,9. The main carcinogenic mechanisms at play are thought to be (1)

viral integration into and disruption of the host genome and (2) expression of oncogenic viral

proteins12.

An important example for a carcinogenic species from the bacterial domain is Helicobacter

pylori causing adenocarcinoma of the stomach10,11. The carcinogenic mechanism at play here

is thought to be a totally different one compared to carcinogenic viruses, namely sustained

inflammation caused by a chronic, mostly sub-clinical infection1. For some associations

between infections and cancer, preliminary evidence has been presented but simultaneous

presence of contradictory findings has led to widespread debate. An example for this is the

finding that Fusobacterium nucleatum is enriched in cancerous tissue of colorectal cancer

patients compared to tissue of benign adenomas or healthy colon mucosa16–18. Given the

diversity of possible carcinogenic mechanisms19 it is likely that other carcinogenic viruses and

bacteria exist which are currently unknown.

Recent advances in next generation sequencing have made it possible to generate large

amounts of data representing the whole genetic makeup of a tissue13. Resulting datasets

contain traces of non-human, or, more general, non-host origin that is present either because

of genomic integration or presence of the virus or bacteria in the tissue itself. While some

studies have already been performed with the goal to repurpose this data to unravel novel

3 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

associations between infections and cancer4,5,20–22, these resources have so far been

underutilized to discover novel, possible associations between cancers and infections.

With this aim in mind, this study leverages a large, high-quality collection of over 3000 whole

genome sequencing datasets to gain insight into possible novel associations between viruses,

bacteria and cancer.

4 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

RESULTS

Samples

A total of 3,025 whole genome sequencing datasets comprising 3.79 trillion reads were

included in this study (Figure 1A, Extended Data Table 1). These include 1,330 whole genome

sequencing datasets of tumor tissue samples across 14 different cancers from 19 International

Cancer Genome Consortium (ICGC)23 studies and their corresponding matched normal (e.g.

non-cancerous) tissue controls (n=1,330) (Extended Data Table 1). Included patients were

predominantly male (n=1,028, 60.6%) and elderly, with 47.3% of patients (n=801) at least 60

years old (Figure 1B-C). Additionally, whole genome sequencing datasets from 365 subjects

from the 1,000 genome project24 were selected as a healthy control group substituting for the

lack of negative sequencing controls to examine non-human DNA in the blood of healthy

donors. Only subjects, in which blood-derived DNA was directly subjected to whole genome

sequencing as opposed to DNA derived from immortalized lymphoblastoid cell lines (LCL)

(subset analyzed, n=102) were included in the healthy control cohort as the whole genome

sequencing datasets derived from LCL DNA showed a markedly different species-level taxon

distribution likely representing taxa present due to LCL culture and not present in the donor

itself (Extended Data Figure 2A, Supplementary Table 1, Supplementary Table 2). For

validation purposes and to assess differential gene expression in cancers associated with

certain species-level taxa all available matching RNA-Seq datasets for all included ICGC

studies23 were also analyzed (n=324).

Validation of Pipeline

Based on its high accuracy, especially with regard to precision and high speed necessary for

handling large amounts of input data, a pipeline was built around KRAKEN25, which at its core

5 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

is based on the exact alignments of k-mers to their least common ancestor (LCA). Before

applying the pipeline to the datasets of interest, it was validated fourfold.

First, it was confirmed that the pipeline was able to identify already known bacterial and viral

taxa in tissue-derived bacterial isolates and cell lines independent from this study. The pipeline

was applied to a gastric mucosa isolate of Helicobacter pylori from a stomach ulcer patient

(sample ID: SRS2800545, run ID: SRR6432709). As expected, 3,606,537/3,607,687 (99.97%) of

read pairs matching any taxon in the database matched Helicobacter pylori (Supplementary

Table 3, Extended Data Figure 2B). Next, the pipeline was applied to 3 cell line samples (HeLa

(run ID: SRR1611000), CaSki (run ID: SRR1611128) and SiHa (run ID: SRR1611127)) with known

integrations of Human papillomavirus26. Applying the pipeline resulted in between 78.14%

and 99,90% of non-human read pairs matching any species-level taxon in the database to

match Human papillomavirus 7 or 9 (Supplementary Table 3, Extended Data Figure 2B).

Interestingly, the majority of non-human read pairs not matching Human papillomavirus 7 or

9 matched Mycoplasma hyorhinis (0.08%-21.74%), a common cell culture contaminant27.

Thus, it was confirmed that the pipeline can identify bacterial and viral taxa irrespective of

both a high level of human background DNA and whether the identifying sequence is present

in the form of host-integrated DNA or by presence of the viral or bacterial taxa itself in the

sample.

Second, it was confirmed that the taxonomic classification of non-human reads by the pipeline

is highly correlated to the classification by an independent approach, namely mapping non-

human reads to putative target genomes with BWA28 (Supplementary Table 4, Extended Data

Figure 2C-E).

6 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Third, in yet another independent approach, it was confirmed that assembled contigs from

non-human reads show high sequence similarity with the taxa identified by the pipeline

(Supplementary Table 4, Extended Data Figure 2F).

Fourth, the pipeline was independently run with the same settings on available matched RNA-

Seq and WGS data from the same tumor tissue samples. It was shown, that identified taxa in

pairs of RNA-Seq and WGS data from the same tumor tissue sample are correlated both in a

combined dataset of all pairs and for each sample separate for which RNA-Seq and WGS data

was available (n=324) (Extended Data Figure 2 G-H). This validation is especially helpful to

minimize the risk of contamination post sample processing though this was further mitigated

with the data analysis strategy described below.

A map of cancer-associated taxa

Due to the nature of this dataset, which was not created with the purpose of analyzing the

non-human taxonomic composition of cancer tissues, only a small fraction of sequence

information can be used to infer taxa present in the respective sample. Therefore, less

prevalent taxa are likely not detectable. Additionally, considering many samples will have

been collected under sterile, surgical conditions, detected taxa are more likely to indicate true

presence of that taxa in the tissue, limiting the detection of taxa that are part of the naturally

occurring microbiome of non-sterile tissues, such as skin or bowel. Therefore, on average, only

2.2 species-level taxa per sample were detected, though variation was high (Supplementary

Table 5, Extended Data Figure 3 A-D). For these reasons, many conventional methods of

analysis of taxonomic composition are not applicable to this dataset and the focus of the

present study is on identifying highly abundant species-level taxa that are potentially

associated with cancer, either by being present in the patient’s matched normal sample or

7 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

potentially more directly involved in carcinogenesis by being present in the patient’s tumor

tissue, especially when it is likely to have been collected sterile.

In this study, a total of 3,534,707 read pairs matching bacterial, viral or phage species-level

taxa were detected across 19 studies in 3025 samples (Figure 1D-E, Extended Data Table 1,

Supplementary Table 6). Subsampling of 10% of all read pairs did not alter the detected

species-level taxa and their relative composition compared to analyzing all non-human read

pairs which was validated in a subset of patients (Extended Data Figure 2I-J, Supplementary

Table 7). A total of 218 species-level taxa could be identified in all examined samples (Figure

1F, Supplementary Table 8). and Propionibacterium acnes were the most

commonly detected species in all samples.

In order to control for differences in sequencing depth between samples, all raw read pair

counts were normalized by dividing them by 1,000,000,000 total read pairs. This normalized

count was defined as read pairs per billion (RPPB). Overall, 18,112 (+/- 4,238 standard error

of mean (s.e.m.)), 20.003(+/- 4,295 s.e.m.) and 28,282 (+/- 9,081 s.e.m.) RPPB could be

detected on average in healthy control samples from the 1000 genomes cohort, matched

normal samples and tumor tissue samples, respectively, with a large variation across samples

and sequencing projects (Figure 1G). Of note, particularly high RPPB were detected in matched

normal, saliva-derived, samples from acute myeloid leukemia (AML) patients, as would be

expected from a naturally non-sterile source such as saliva. The taxa with the highest RPPB

are depicted in Figure 1H. Ralstonia pickettii was the species with the highest RPPB and almost

absent in healthy donors, while being detected at higher levels in tumor tissue than in

matched normal samples (Figure 1H). Except for Bacillus subtilis, Acidovorax sp. KKS102 and

Bacillus cereus, all species with very high RPPB were detected predominantly in tumor tissue

8 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

or matched normal samples. A clustered heatmap of all species-level taxa detected in all

samples is provided in Figure 2.

Next, filtering was performed to exclude taxa that were (i) frequently detected in the healthy

control group, (ii) detected only in very few (<5) tumor tissue or matched normal samples, (iii)

detected phages, (iv) taxa that are commonly detected as part of the normal oral microbiome

and were mainly detected in saliva, oral or esophageal cancer tissue samples (Supplementary

Table 9), (v) taxa that have been previously described as sequencing contaminants

(Supplementary Table 10) and (vi) taxa, for which the detected reads were unevenly

distributed across their genome (Extended Data Figure 4). After all these filtering steps

(Extended Data Figure 5) 27 species-level taxa remained for further analysis (Figure 1 I-J,

Extended Data Figure 6, Supplementary Table 8).

Among these, known tumor-associated taxa, such as Hepatitis B virus8,9 or Salmonella

enterica29,30 were detected. Additionally, taxa were detected that have been implicated in

carcinogenesis before without enough evidence to support a carcinogenic role, such as

Pseudomonas species31,32 and taxa that have never been implicated in carcinogenesis before,

such as Gordonia polyisoprenivorans. Of note, most taxa in the final filtered species list were

detected at much higher RPPB-levels in tumor tissue compared to matched normal samples,

though most matched normal tissues in this study were blood-derived. However lower RPPB

could still be detected in matched normal samples and across all taxa tumor tissue samples

and matched normal samples are highly correlated (Extended Data figure 2K).

In order to better understand the association of detected taxa with different cancers and

among each other, a heuristic approach was used combining the non-linear dimensionality

reduction and visualization method t-distributed stochastic neighborhood embedding (t-SNE)

9 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

with K-means clustering. To focus on taxa that potentially play a more direct role in

carcinogenesis and considering that RPPB of detected taxa were highly correlated between

tumor tissue and matched normal tissue with lower levels detected in matched normal tissue

(Extended Data Figure 2K), only tumor tissues were included in the following analyses.

Using a pan-cancer approach, a combined dataset of all tumor tissue samples was visualized

using t-SNE using the log2-transformed RPPBs of all detected non-phage taxa (n=204) as input

variables. First, two distinct groups of patients with chronic lymphocytic leukemia (CLL)

emerged. One group (n=24) was associated with presence of Gordonia polyisoprenivorans in

the patient’s tumor tissue and another group (n=15) was associated with tumor tissue

presence of Pseudomonas mendocina. Second, one distinct group (n=23) of pancreatic cancer

patients associated with presence of Cupriavidus metallidurans was observed. Third, one

group (n=22) of bone cancer patients associated with presence of Pseudomonas poae,

Pseudomonas fluorescens and Pseudomonas sp. TKP could be distinguished. Fourth, a group

of patients (n=14) with bone cancer (n=4) and chronic myeloid disorders (n=10) associated

only with presence of Pseudomonas sp. TKP was observable (Figure 3A). Overall, this approach

was helpful to get a first insight into groups of tumors from a pan-cancer perspective that

might be defined by presence or absence of specific taxa.

In order to gain further insight into the association of detected taxa and specific cancers, each

cancer (Supplementary Table 11) was examined separately, visualizing patient groupings using

t-SNE with log2-transformed RPPBs of all the final filtered taxon list (n=27) (Supplementary

Table 1) in tumor tissue as input variables combined with K-means clustering of patients using

the same input variables. Additionally, each cluster was associated with age, survival, gender,

number of alterations in known cancer genes or specific alterations in one of those cancer

genes. 10 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

For bone cancer patients this dual methodology revealed a cluster of patients with presence

of Pseudomonas fluorescens, Pseudomonas sp. TKP and Pseudomonas poae (cluster 2) in the

tumor tissue, confirming the results of the pan-cancer approach used above (Figure 3B,

Extended Data Figure 7A).

For chronic lymphocytic leukemia patients this approach revealed 3 clusters, one of patients

with presence of Gordonia polyisoprenivorans in the tumor tissue (cluster 3), one of patients

associated with Human Mastadenovirus C (cluster 2) and one with the remainder of patients

(cluster 1), some of which showed presence of Pseudomonas species in the tumor tissue.

Clusters could be identified by both methods, t-SNE and K-means (Figure 3C, Extended Data

Figure 3B). The clusters representing patients associated with Gordonia polyisoprenivorans

(cluster 3) and Human Mastadenovirus C (cluster 2) were mutually exclusive (pFisher = 0.0124)

(Figure 3C, Extended Data Figure 7B). There was a tendency towards different ages at

diagnosis between the clusters (panova=0.0743) (Figure 4A). Additionally, there was a tendency

towards a difference in survival between the different clusters (panova=0.0745). Patients in

cluster 1 (no associations) had worse survival than patients in cluster 2 (Human

Mastadenovirus C) or 3 (Gordonia polyisoprenivorans) (p=0.024626) (Figure 4B). Patients with

an association with Gordonia polyisoprenivorans (cluster 3) were more likely to be of Binet C

stage (5/36 vs. 1/61, C vs. not-C, pFischer = 0.0252) (Figure 4C). These patients were also more

likely to be associated with TP53 mutations than patients in the other clusters (pFisher(cluster

3 vs. other)=0.0335) (Figure 4D).

For esophageal cancer patients 3 clusters were identified. K-means clustering revealed one

cluster of patients with presence of Human Herpesvirus 5 (cluster 3) in the tumor tissue and

one cluster with presence of Bifidobacterium dentium (cluster 2). The remaining cluster

contained patients with no discernable associations. However, these 3 clusters could not be 11 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

differentiated by t-SNE. (Figure 3D, Extended Data Figure 7C). There was a tendency towards

different numbers of alterations in cancer genes between clusters (panova=0.0858), with

patients in cluster 3 (Human Herpesvirus 5) having less alterations than patients in cluster 1

(no associations) and 2 (Bifidobacterium dentium) (p=0.0392) (Figure 4E). Additionally, there

was a tendency towards a difference in survival between the different clusters (panova=0.1).

Patients in cluster 2 (Bifidobacterium dentium) or 3 (Human Herpesvirus 5) had worse survival

than patients in cluster 1 (no associations) (p=0.040928). (Figure 4F).

Patients with liver cancer could be grouped into 4 clusters with K-means clustering and t-SNE

agreeing. One cluster was defined by presence of Hepatitis B (cluster 1) in the tumor tissue. A

second cluster was defined by sole presence of both Pseudomonas and Serratia species

(cluster 3). A third cluster was defined by additional presence of Parvibaculum

lavamentivorans and Human Herpesvirus 5 as well as two additional Pseudomonas species,

Pseudomonas poae and Pseudomonas protegens (cluster 4), in addition to those detected in

the previous cluster. The fourth cluster contained patients with no discernable associations

(cluster 2) (Figure 3E, Extended Data Figure 7D). There was a tendency towards different ages

at diagnosis between the clusters (panova=0.0015) with patients in cluster 1 (Hepatitis B),

cluster 2 (Pseudomonas and Serratia sp.) and cluster 4 (Pseudomonas sp., Serratia sp.

Parvibaculum lavamentivorans and Human Herpesvirus 5) younger than patients in cluster 3

(no associations) (pt-test=0.000142) (Figure 4G). In a similar pattern, these patients had had a

higher frequency of alterations in RNF21 (pFischer = 0.0121) (Figure 4H).

For patients with pancreatic adenocarcinoma 5 clusters could be identified using K-means

clustering. The first cluster was defined by presence of Pseudomonas protegens in the tumor

tissue (cluster 5), the second by Methylobacterium populi (cluster 4) and a third, prominent

cluster was defined by presence of Cupriavidus metallidurans (cluster 2). The two other 12 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

clusters contained patients with no discernable associations (cluster 1 and 3). Only the cluster

of patients associated with Cupriavidus metallidurans (cluster 2) could also be identified using

t-SNE (Figure 3F, Extended Data Figure 7E). Differences in the frequency of alterations

between patients in different clusters were observed in KMT2C, CDKN2A and RNF21. Patients

in cluster 2 (Cupriavidus metallidurans) ,4 (Methylobacterium populi) or 5 (Pseudomonas

protegens) had a higher frequency of alterations in these genes compared to the majority of

patients which were in cluster 1 (no associations) (n=187) (pFisher(KMT2C)=0.0308,

pFisher(CDKN2A)=0.0124, pFisher(RNF21)=0.0107) (Figure 4I-K).

For patients with prostate cancer, 5 clusters emerged using K-means clustering, one defined

by simultaneous presence of Thauera sp. MZ1T, Cupriavidus metallidurans and Pseudomonas

mendocina in the tumor tissue (cluster 5), one by presence of Pseudomonas sp. TKP (cluster

3) and another one by presence of Gluconobacter oxydans and Rhodopseudomonas palustris

(cluster 2). The two other clusters contained patients with no discernable associations. Out of

these, the cluster defined by Gluconobacter oxydans and Rhodopseudomonas palustris

(cluster 2) and the one defined by Thauera sp. MZ1T, Cupriavidus metallidurans and

Pseudomonas mendocina (cluster 5) could be confirmed using t-SNE (Figure 3G, Extended

Data Figure 7F). There were age differences at diagnosis observable between the clusters

(panova=0.00994) with patients in cluster 2 (Gluconobacter oxydans) and 5

(Rhodopseudomonas palustris) older than the other patients (pt-test)=0.000544) (Figure 4M).

For patients with renal cancer a cluster of patients associated with

(cluster 2) was identified using K-means clustering, although t-SNE did not separate this group

of patients (Figure 3H, Extended Data Figure 4N). The other cluster contained patients with no

discernable associations (cluster 1). There was a tendency towards a higher frequency of

PBRM1 alterations in patients in cluster 1 (pFisher=0.0723) (Figure 4P). 13 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

For patients with pancreatic endocrine neoplasms, ovarian cancer, chronic myeloid disorders

and breast cancer, no discernable clusters emerged (Extended Data figure 8A-D).

Unbiased association analysis between taxa and patient features

In addition to associating the above identified clusters with patient features, an unbiased

analysis of possible associations between presence of a species-level taxa and patient

features, such as age, survival, gender, number of alterations in known cancer genes and

specific alterations in one of those cancer genes was performed in both a pan-cancer approach

using the whole dataset and analyzing each cancer separately. In this exploratory analysis, all

non-phage taxa detected (n=204) were included and multiple testing correction was

performed.

Association between taxa and patient age

In the pan-cancer analysis, no association remained significant at the α=0.05 level after

correction for multiple testing. However, looking at each cancer separately, several

associations were identified. A group of bacterial taxa was associated with older patients in

prostate cancer (Figure 4L). In chronic myeloid dysplasia, presence of Pseudomonas sp. TKP

was associated with older age (padj=0.039) (Figure 4Q).

Association between taxa and patient survival

In the pan-cancer analysis, no association remained significant at the α=0.05 level after

correction for multiple testing. However, looking at each cancer separately, presence of

several taxa had a significant impact on survival at the α=0.05 level. In renal cancer, presence

of Ralstonia pickettii was associated with improved survival, in fact no patient died (padj=0.035)

(Figure 4O).

14 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Association between taxa and patient gender

For both the pan-cancer analysis and looking at each cancer separately, no significant

associations were found at the α=0.05 level after correcting for multiple testing.

Association between taxa and number of genetic alterations in cancer genes

In the pan-cancer analysis, no association remained significant at the α=0.05 level after

correction for multiple testing. However, looking at each cancer separately, presence and

higher RPPB matching P. acne was associated with a decreasing number of alterations in

prostate cancer (padj=0.0041) (Figure 4N).

Association between taxa and specific genetic alterations

For both the pan-cancer analysis and looking at each cancer separately, no significant

associations were found at the α=0.05 level after correcting for multiple testing.

Somatic lateral gene transfer

The integration of viral nucleic acids into the human host genome is well recognized as a

carcinogenic process. High-level evidence exists for the integration of Hepatitis B8,9, Human

Papillomavirus6,7 and Epstein–Barr virus (Human Herpesvirus 4)33,34, which are causally linked

to hepatocellular carcinoma, cervical cancer and lymphoma, respectively. Additionally, some

evidence for somatic lateral gene transfer from bacteria to cancerous tissue has been

presented before20 and subsequently controversially discussed.

Aiming to find evidence for bacterial or viral DNA integrated into the human host genome in

this dataset, a pipeline was developed to do so. In brief, read pairs in which one read mapped

to the human genome and one read to one of the taxa in the final filtered taxon list (n=27)

15 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

were counted and then compared to the number of read pairs in which both reads mapped

to the respective taxa. The number of divergently mapping read pairs divided by the number

of complete pairs mapping that respective taxa was used as an indicator for potential genomic

integration and lateral gene transfer. This was done separately for the tumor tissue datasets

and the matched normal datasets for all patients for which the pipeline revealed at least 1000

RPPB matching one of the taxa in the final filtered taxon list (n=27) in that respective patient’s

tumor tissue sample (n=79, Supplementary Table 2). The resulting data is shown in

Supplementary Table 1 and Extended Data Figure 4. The highest rate of integration was

observed for Hepatitis B in tumor tissue (9.62%) (Extended Data Figure 9A). Across all analyzed

taxa, putative integrations were more common in matched normal samples than in tumor

tissue samples (p=0.0009, Wilcoxon paired signed rank test) (Extended Data Figure 9B) and

for viral taxa compared to bacterial taxa (p=0.0001, Mann-Whitney U-test) (Extended Data

Figure 9C). In conclusion, there was no evidence for a general phenomenon of lateral gene

transfer for the final filtered taxon list, with the exception of Hepatitis B, for which integration

into the cancer genome has been widely described8,9.

Results of differential gene expression analysis

Differential gene expression analysis was performed for all taxa and cancer combinations with

available RNA-seq data and in which at least 5 patients had at least 100 RPPB matching the

respective taxon (Supplementary table 10). Differentially expressed genes (q<0.05) were

identified for chronic lymphocytic leukemia patients with or without detection of Gordonia

polyisoprenivorans (258 genes) (Figure 5A, Supplementary table 12), Human Mastadenovirus

C (1725 genes) (Figure 5B, Supplementary table 13) and (50 genes,

Supplementary table 14). Furthermore, differentially expressed genes were identified for

ovarian cancer patients with or without detection of Escherichia coli (22 genes) 16 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

(Supplementary table 15) and pancreatic adenocarcinoma patients with or without detection

of Propionibacterium acnes (3 genes) (Supplementary table 16). Next, differential gene

expression data was used to perform bidirectional functional enrichment in order to identify

pathways that are altered in patients with or without presence of the respective taxon. To no

surprise, only the two associations with a significant number of differentially expressed genes,

chronic lymphocytic leukemia with or without presence of Gordonia polyisoprenivorans

(Supplementary table 17) or Human Mastadenovirus C (Supplementary table 18), respectively,

showed enrichment in some pathways.

Overall, presence of Gordonia polyisoprenivorans lead to a gene expression pattern indicative

of a decreased level of mitotic cell cycling. Also, a gene expression pattern indicative of a

decreased DNA damage response was observed. Additionally, a gene expression pattern

indicative of reduced blood coagulation and reduced macrophage activation was observed in

patient samples with presence of Gordonia polyisoprenivorans (Figure 5C).

Contrary to that, presence of Human Mastadenovirus C lead to a gene expression pattern that

is indicative of an increase in B-cell proliferation and a reduction of various immune defense

functions, such as immune response in general, Tumor necrosis factor and interleukin beta 1

production, activated T cell proliferation and a decrease in cytokine secretion in general. In

addition, the altered gene expression pattern of patients with presence of Human

Mastadenovirus C is indicative of a markedly reduced DNA damage response (Figure 5D).

17 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DISCUSSION

The aim of this this study was to leverage a large, high-quality dataset of over 3000 samples

to unravel novel associations between viral and bacterial taxa and cancer. A total of 218

species-level taxa could be identified in tumor tissue, matched normal and healthy donor

samples. Out of these, 27 taxa remained being likely cancer-associated after extensive

filtering. While studies on the viral metagenome of caner tissues and patients have been

performed using datasets from cancer genomics studies4,5,21,22, similar large studies looking at

the bacterial metagenome are lacking.

Studies examining the viral metagenome of cancer tissues mainly identified known

associations of Human Papillomavirus with cervical and head and neck cancer, Hepatitis B in

liver cancer and Human Herpesvirus 5 in a variety of cancers while detection of Human

Mastadenovirus C was controversial4,5,21,22.

Smaller studies examining bacteria-tumor associations in pan-cancer datasets have identified

Escherichia coli, Propionibacterium acne and Ralstonia pickettii in multiple cancers, while

more specifically finding Acinetobacter sp. in AML and Pseudomonas sp. in both AML and

adenocarcinoma of the stomach20,35. Interrogating cancer-specific datasets, Salmonella

enterica, Ralstonia pickettii, Escherichia coli and Pseudomonas sp. were detected in breast

cancer and adjacent tissue36, while Escherichia sp., Propionibacterium sp., Acinetobacter sp.

and Pseudomonas sp. were frequently detected in prostate cancer32. Confirming the findings

of the present study, these smaller studies found similar bacterial taxa, especially Ralstonia

pickettii, Escherichia coli, Propionibacterium acne and Pseudomonas sp.

Interestingly, the present study identified a number of taxa that have not been previously

identified in cancer tissue or matched normal samples of these patients, among them

18 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Cupriavidus metallidurans, Gordonia polyisoprenivorans, Serratia sp. and Bifidobacterium sp..

Reasons for this are likely the much higher (1) number and diversity of patients and samples

included, (2) the larger amounts of data examined per sample due to having higher-coverage

WGS datasets available for all patients compared to RNA-seq or whole-exome sequencing

(WXS) datasets in previous studies and (3) the optimized and extensively validated

bioinformatics approach used herein. Of note, the filtering strategy used in this study to

exclude taxa likely resulting from contamination has also eliminated bacterial taxa that have

been linked to cancer previously such as Escherichia coli and Propionibacterium acne, mainly

due to the frequent detection of these taxa in healthy donor samples from the 1000 genome

cohort. Stringent filtering for species-level taxa that have been detected at 2 log10-levels

higher in tumor tissue or matched normal tissue compared to healthy donor samples excluded

these species, despite having about 1.4x and 3.1x higher RPPB for matched normal and tumor

tissue, respectively, compared to healthy donors for Propionibacterium acne. These values

were even higher for in the case of Escherichia coli, namely 7.1x and 13.4x for matched normal

and tumor tissue, respectively. Thus, it cannot be excluded that tumor-associated taxa were

eliminated due to the stringent filtering utilized.

Looking at specific cancer-pathogen associations, a subgroup of bone cancer patients with

presence of Pseudomonas sp. in the tumor tissue was identified. Consistent associations

between viral or bacterial taxa and bone cancer have not been described before, apart from

some evidence linking simian virus 40 (SV40) infection to bone cancer37. Interestingly,

Pseudomonas sp. are frequently implied in difficult to treat cases of osteomyelitis. Thus, one

might speculate that chronic, subclinical infection exists and could be carcinogenic.

For CLL patients, two associations were identified, one with Gordonia polyisoprenivorans and

one with Human Mastadenovirus C. Gordonia polyisoprenivorans has not been implied in 19 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

cancer before. Interestingly, the bacterium has been identified as a rare cause of bacteremia,

so far exclusively in patients with hematological cancers38–40. While the immunosuppression

associated with hematological cancers and subsequent treatment obviously predisposes to

infections with unusual pathogens from the environment, it is at least possible that these

patients had a latent infection with Gordonia polyisoprenivorans which exacerbated into

bacteremia and sepsis upon treatment-induced immunosuppression. Human Mastadenovirus

C has to date not been clearly linked to cancer, though two recent reports have found it

frequently in various cancer tissue, with one treating it as contamination4 and one not21 (see

methods for a detailed discussion). Further indirect evidence for a possible role of both

Gordonia polyisoprenivorans and Human Mastadenovirus C in CLL carcinogenesis is provided

by (1) the observed age difference - patients with detection of Human Mastadenovirus C were

markedly younger, (2) the mutual exclusivity of detection of Gordonia polyisoprenivorans and

Human Mastadenovirus C and (3) the fact that survival was different - patients with detection

of either taxa had improved outcome. This was despite patients associated with Gordonia

polyisoprenivorans having a higher likelihood of having Binet C stage. Providing a possible

explanation for the observed survival benefit, patients with no taxa association (cluster 3) had

a higher likelihood of having TP53 mutations. Strikingly, marked differences in host gene

expression between patients in which one of the taxa was detected and other patients were

observed. The observed gene expression pattern for patients associated with Gordonia

polyisoprenivorans, especially an increased DNA damage response and reduced mitotic cell

cycling, could explain the improved survival of these patients. The gene expression pattern

observed for patients associated with Human Mastadenovirus C pointed towards reduced

immune activity, especially reduced T cell function and a decrease in cytokine production and

secretion, which could explain the observed high detection frequency of Human

20 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Mastadenovirus C DNA in line with an uncontrolled infection due to a diminished immune

response. Detection of Mastadenovirus C was associated with a decreased DNA damage

response, which has been described as being an important pathomechanism of CLL41. In

addition to CLL patients, both Gordonia polyisoprenivorans and Human Mastadenovirus C

were also detected frequently in other cancers, sometimes at high levels, while never being

detected in healthy donors.

Two taxon-tumor associations in esophageal cancer were identified, one with Bifidobacterium

dentium and one with Human Herpesvirus 5. Interestingly, patients in cluster 3 (Human

Herpesvirus 5) had less alterations in cancer gene consensus genes than the other patients,

pointing to a possible different carcinogenic mechanism, where the accumulation of multiple

somatic alterations is not stringently needed for malignant transformation. Furthermore,

patients in cluster 1 (without any association with Bifidobacterium dentium or Human

herpesvirus 5) had better survival. While the association of Human Herpesvirus 5 with

esophageal cancer is a novel finding, Human Herpesvirus 5 has frequently be detected in

adjacent adenocarcinoma of the stomach5. Of note, overt Human Herpesvirus 5 esophagitis

can occur in immunocompromised hosts42, which could be indicative of a latent infection of

the esophagus by Human Herpesvirus 5 in some hosts.

In patients with liver cancer, one subgroup of patients was defined by detection of Hepatitis

B, a known cause of liver cancer8,9. Interestingly, two other subgroups could be identified.

Pseudomonas sp. and Serratia sp. were detected in both groups, with additional detection of

Parvibaculum lavamentivorans and Human Herpesvirus 5 in one group. There are some hints

that Human Herpesvirus 5 might play a role in the carcinogenesis of liver cancer, among them

presence of CMV DNA in tumor tissue and an increased seroprevalence in liver cancer

patients43 as well as frequent hepatitis in Human Herpesvirus 5 infection underscoring 21 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

hepatotropism of Human Herpesvirus 544. The other identified taxa have not yet been implied

in liver cancer carcinogenesis. Compared to patients not associated with one of the taxa

identified in liver cancer tumor tissue samples, all patients with presence of any identified taxa

were younger. Hepatocellular carcinoma at younger age is associated with chronic Hepatitis B

infection45. Since younger age was also observed in patients associated with the other taxa,

one might speculate that liver cancers associated with other pathogens are associated with

younger age of onset as well.

Three bacteria-tumor associations were identified in pancreatic adenocarcinoma, one with

Pseudomonas protegens, one with Methylobacterium populi and one with Cupriavidus

metallidurans. While Pseudomonas sp. have been shown to be a contributor to the pancreatic

adenocarcinoma tissue microbiome31, Methylobacterium populi and Cupriavidus

metallidurans have not been detected in cancer tissue. In fact, both taxa have not been

discovered in human hosts but in environmental samples and are thus not considered to be

part of the human microbiome, making it possible that they are contaminants not truly

present in the tumor tissue samples analyzed. On the other hand, few infections of humans

by these taxa have been described46,47 and, of note, the first published report of a Cupriavidus

metallidurans infection was a case of septicemia in a patient with a pancreatic tumor47.

In patients with prostate cancer, the most interesting findings were age differences between

patient clusters defined by the presence of different taxa and a negative correlation of

Propionibacterium acne presence and number of alterations in cancer gene consensus genes.

Patients in cluster 2 (defined by Gluconobacter oxydans and Rhodopseudomonas palustris)

and cluster 5 (defined by Thauera sp. MZ1T, Cupriavidus metallidurans and Pseudomonas

mendocina) were markedly older than the other patients. Gluconobacter sp.,

Rhodopseudomonas sp., Cupriavidus sp. and Pseudomonas sp. have previously been 22 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

identified in prostate cancer and normal prostate tissue32, but their precise role in prostate

disease is entirely unclear. Propionibacterium acne has been implied as a potential

carcinogenic bacterium in prostate cancer48,49, possibly by creating a chronic inflammatory

microenvironment50. Like Human Herpesvirus 5 in esophageal cancer, the reduced number of

alterations in cancer gene consensus genes upon increased Propionibacterium acne frequency

could point to alternative carcinogenesis by chronic inflammation in the absence of the

accumulation of many mutations in putative cancer driver genes.

For patients with renal cancer, an association with Serratia marcescens was described for a

subgroup of patients. While Serratia marcescens has been described as a frequent cause of

urinary tract infections, especially in immunocompromised hosts in a nosocomial setting51, it

has so far not been implicated in cancer. Interestingly, survival of patients with detection of

Ralstonia pickettii in their tumor tissue was markedly improved. Ralstonia pickettii is a

bacterium that has been filtered out in this study because of frequent detection in healthy

donor samples. It could be speculated, that, while not causing any overt infection, low level

presence of Ralstonia pickettii in the human host is common, and improves immunogenicity

of renal cancer and, thus, improving outcome. It has been shown, that alterations of local and

systemic immunity by a host’s microbiome possibly influence the anticancer immune

response52, which might be highly relevant for a naturally immunogenic tumor, such as renal

cancer53.

The intriguing observation of increased somatic bacteria-human lateral gene transfer by Riley

et al. 20 could not be made in this study. The only taxa for which more integration into the host

genome was observed in tumor tissue compared to matched normal was Hepatitis B, for

which integration into the genome of cancerous cells has been well recognized54.

23 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

In conclusion, the present study provides an important, unprecedented atlas of potential

associations between both bacterial and viral taxa with cancer. In addition to confirming

known or postulated associations between cancers and viral or bacterial taxa, several novel

associations were identified for the first time, laying the groundwork for experimental

validation and further mechanistic studies.

24 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

METHODS

Data sources

Mapped sequencing data for the included ICGC studies was obtained via the ICGC DCC23,55 and

downloaded using customized scripts. The use of controlled access ICGC data for this project was

approved by the ICGC Data access compliance office. Mapped sequencing data for the 1000 genome

healthy control samples was obtained from the 1000 genome FTP server56 using customized scripts.

Sequencing data used for validation was obtained from the European Nucleotide Archive (ENA)57.

Computing environment

Data analysis was performed using a HP Z4 workstation in a Unix environment either using

standalone standard tools as mentioned throughout the methods section or customized

scripts. Some analyses were performed employing the Galaxy platform58.

Pipeline for taxonomic classification

First, unmapped (non-human) read pairs were extracted from downloaded sequencing data

using Samtools (version 1.7)59. Bam files were sorted by query name with Picard Tools (version

2.7.1.1)60 and converted to FastQ files using Bedtools (version 2.26.0.0)61. Subsequently,

Trimmomatic (version 0.36.3)62 was used to trim reads with sliding window trimming with an

average base quality of 20 across 4 bases as cutoff and dropping resulting reads with a residual

length < 50. Read pairs, in which one read was dropped according to these rules were dropped

altogether. Remaining read pairs were joined with FASTQ joiner (version 2.0.1)63 and

converted to FASTA files using the build-in function FASTQ to FASTA (version 1.0.0) from

Galaxy58. Next, VSearch (version 1.9.7.0)64 was used to mask repetitive sequences by replacing

them with Ns using standard settings. These masked and joined read pairs were fed into

25 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Kraken (version 1.2.3)25 using a database of all bacterial and viral genomes in Refseq. The

output of each run was filtered with Kraken (version 1.2.3)25 setting a confidence threshold of

0.5. A report combining the output of all samples and runs was generated using Kraken

(version 1.2.3)25. The output was then arranged using customized scripts in R (beginning with

version R 3.3.2. and subsequently updated)65 to generate the raw metagenome output of

each sample. Next, the raw outputs of each sample running the pipeline on a subset of 10%

of the cancer tissue and matched normal sample read pairs and on all non-human read pairs

of the 1000 genomes samples were filtered by only including species-level taxa and excluding

all species-level taxa that were supported by less than 10 read pairs across all samples using

R (beginning with version R 3.3.2. and subsequently updated)65. Read pairs assigned to

Enterobacteria phage phiX174 which is ubiquitously used as a spike-in control in next

generation sequencing were omitted from all counts as an intended contaminant.

To correct for the variation of sequencing depth across samples, matched read pairs per billion

read pairs raw sequence (RPPB) were calculated for each sample and each taxon. A taxon was

heuristically considered detected, when a respective sample had at least 100 RPPB assigned

to that taxon.

Filtering strategy

First, all taxa that were also highly prevalent in the healthy control group, were excluded from

further analysis. In detail, a taxon had to have an at least 2 log10-levels higher mean RPPB

across either the tumor tissue samples, or the matched normal samples compared to the

healthy control samples. This cutoff excluded all taxa that were also highly prevalent in the

healthy control samples while at the same time allowing to further analyze taxa that were

enriched in matched normal samples such as blood as well as taxa that were dominant in

26 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

tumor tissue. After this step, 147/218 (67.4%) potential tumor-associated species-level taxa

remained. Next, taxa that were detected in fewer than 5 tumor tissue or matched normal

samples were excluded (n=78), as well as all remaining phages (n=2). Hepatitis B as a known

cancer-associated virus was re-included despite being present in fewer than 5 tumor tissue or

matched normal samples so that 68/218 (31.2%) taxa remained.

Taxa that survived this filtering strategy were likely to be tumor-associated but could also

represent artifacts from contamination. To account for that, all taxa were filtered out that are

known to be regularly present in the oral microbiome66–68 and were at the same time mainly

(>50% of all RPPB matching a respective taxon) detected in samples that are likely

contaminated with the oral microbiome (saliva matched normal, oral cancer tissue and

esophageal cancer tissue samples) (Supplementary Table 9). Exemplary, this filtered out taxa

that are a commonly present in the oral human flora such as Streptococcus mitis and were

indeed mainly present in tumor tissue samples of oral cancer or esophageal cancer and likely

a contaminant based on the biopsy location or detected in the matched normal of leukemia

cases whose matched normal was a saliva sample likely containing taxa of the normal human

microbiome. After this filtering step 49/218 (22.5%) species-level taxa remained.

It has recently emerged that both reagents and kits used in DNA extraction and library

preparation as well as ultrapure water used in laboratories can contain contaminants that can

hamper the detection of truly present taxa in low biomass or high background (e.g. human)

samples. As this study was performed using samples that were both, low in non-human

biomass and in the context of high human background the aim was to further reduce false

positives by compiling a list of common contaminants in microbiome studies from the

literature. Recommended approaches, such as the sequencing of blank controls69 were not

feasible as the present study was conducted utilizing already sequenced primary material. 27 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Therefore, a database of common contaminants from various studies70–74 examining this issue

was compiled (Supplementary Table 10). All previously recognized contaminant taxa apart

from those that were described as a contaminant on genus-level but where different species

within the genus were associated differently with 1000 genome control samples and matched

normal or tumor tissue samples were excluded. This was the case for the genera Pseudomonas

and Methylobacterium. While the species Pseudomonas aeruginosa, Pseudomonas putida

and Pseudomonas stutzeri were associated with both 1000 genome control and matched

normal or tumor tissue samples and thus likely contaminants, Pseudomonas fluorescens,

Pseudomonas mendocina, Pseudomonas poae, Pseudomonas protegens and Pseudomonas

sp. TKP were not present in 1000 genome healthy control samples. Thus, it is likely, that the

species-level resolution of this analysis was able to differentiate between common

contaminants and possibly tumor-associated taxa. Similarly, differentiation was possible

between Methylobacterium radiotolerans and Methylobacterium extorquens (likely

contaminants) and Methylobacterium populi, which was only found in tumor-associated

samples. The same held true for Cupriavidus metallidurans and Propionibacterium

propionicum. After this filtering step, 27/218 (12.4%) species-level taxa remained (Figure 1 I-

J, Extended Data Figure 6). While it cannot be ruled out that all these filtering steps removed

cancer-associated taxa, the aim was to be cautious and rather accept a false-negative than a

false-positive finding. Of course, experimental validation of the relevance of one of the taxa

eliminated by one of the filters could prove that this taxon is both, a sequencing contaminant

and a relevant taxon in cancer.

If a taxon is indeed present in a tissue, matching read pairs are expected to be uniformly

distributed across its genome. Consequently, a further filtering step examining this was

introduced. The sequencing data from all samples was combined and matched against a

28 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

reference database constructed out of the genome of these 27 taxa. Next, the coverage

distribution of reads across each genome was assessed (Extended Data Figure 4). If the

respective taxa are indeed present in the sample and not the result from misalignment or

sequence similarity of certain parts of the genome an uneven coverage would result. It was

found that all read pairs matching Human Mastadenovirus C aligned to short parts of its

genome with a maximum length of a few hundred base pairs and abrupt drops in coverage

(Extended Data Figure 4). This was also found in another study using different cancer tissue

sequencing data and resulted in excluding Mastadenovirus C from further analysis4. Using

Blast (beginning with version 2.7.1 and subsequently updated)75 it was found that read pairs

matching Human Mastadenovirus C aligned all equally well to commonly used cloning vectors,

such as pAxCALGL which might have been used in the production of reagents used for

sequencing. This form of contamination was recently analyzed and found to be frequent76.

However, read pairs aligning to Human Mastadenovirus C originated from very few, seemingly

unlinked samples from diverse cancer sequencing projects, making contamination by

recombinant DNA unlikely. Another possible explanation for the observed coverage pattern is

somatic genomic integration of parts of the Human Mastadenovirus C genome into a specific

cancer genome. A recent analysis of PCAWG data focusing on viruses in cancer with overlap

to patients and samples analyzed within this study also identified presence of Human

Mastadenovirus C in cancer tissue and included it in their analysis among other viruses21. On

balance, Human Mastadenovirus C was therefore not excluded from further analysis as an at

least possible cancer associated viral taxon.

Clustering of pipeline hits

T-SNE was performed using a web-based TensorFlow Embedding Projector implementation77.

The learning rate and the perplexity were heuristically set to 10 and 30 for all analyses, 29 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

respectively, except for the liver cancer subset, for which the perplexity was set to 50 due to

improved cluster discrimination. The number of iterations was heuristically chosen, so that no

major changes of cluster composition occurred upon increasing the number of iterations.

Depending on the subset analysis, between 500 and 1500 iterations were needed to reach

that point. T-SNE was performed in 3 dimensions for the pan-cancer analysis and in 2

dimensions for the cancer-specific analyses.

K-means clustering was performed using Morpheus78. Unsupervised K-means clustering using

Euclidean distance as a similarity measure was employed with the number of clusters being

heuristically informed by combining visual inspection, comparing t-SNE and K-means clusters

and by examining marginal reduction of within-group variance with increasing numbers of

clusters (i.e. the elbow method) (Extended Data Figure 10 A-J).

To combine the information obtained by both complimentary methods, t-SNE clustering was

repeated with the same settings for each analysis, while color-coding clusters inferred from K-

means clustering.

Assessment of taxonomic differences between cell culture-derived DNA samples and blood-

derived DNA samples from the 1000 genome project

All taxa in the final taxon list (n=218) (Supplementary Table 19) which were identified in blood-

derived 1000 genomes healthy control samples were selected (n=76) (Supplementary Table

11). The pipeline was subsequently applied to randomly selected (identifier ending with 8 or

9) LCL-derived 1000 genomes samples (n=102) (Supplementary Table 12). Finally, RPPB in

these LCL-derived samples were calculated for all 76 taxa that were identified in the blood-

derived 1000 genomes samples and compared between blood-derived and LCL-derived

samples.

30 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Proof that subsampling of read pairs does not alter relative taxon distribution

To show that subsampling does both not alter the detected species-level taxa and their

relative composition compared to analyzing all non-human read pairs, a subset of 184 tumor

tissue samples (Supplementary Table 15) was analyzed completely without any subsampling

and subjected to the pipeline in the same way as in the main analysis. Exemplary, read pairs

matching Human Mastadenovirus C, P. poae, R. pickettii and P. acnes were used to compare

absolute read pair counts between full data and the subsample for all taxon-sample pairs in

which the 10% subsample had at least 10 matches to the respective taxon.

Pipeline validation strategy

In order to perform in-depth validation of the pipeline, two alternative strategies were

compared to the results obtained from it.

In the first alternative strategy, BWA mem (version 0.7.17) 79 was used to align read pairs to

downloaded representative bacterial and viral genomes for the final filtered taxon list

(Supplementary Table 1, Accession numbers in Supplementary Table 16). In detail, Samtools

(version 1.7)59 was used to merge all non-human mapped BAM files. Next, read pairs were

extracted using Samtools (version 1.7)59 and mapped to phiX (Coliphage phi-X174, complete

genome, NC_001422.1) with BWA mem (version 0.7.17)79. Using Samtools (version 1.7)59 all

read pairs not matching phiX were extracted. Picard (version 2.7.1.1)60 and Samtools (version

1.7)59 was used to extract reads from the resulting BAM file and sort them by read name,

respectively. Next, reads were trimmed with Trimmomatic (version 0.36.3)62 with sliding

window trimming with an average base quality of 20 across 4 bases as cutoff and dropping

resulting reads with a residual length < 50. Read pairs, in which one read was dropped

according to these rules were dropped altogether. Resulting read pairs were mapped to a

31 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

merged FASTA file combining reference genomes of the final filtered taxon list

(Supplementary Table 1, Accession numbers in Supplementary Table 16). Subsequently,

Samtools (version 1.7)59 was used to filter only mapped read pairs with a quality of at least 60.

Customized scripts were used to generate statistics on the number of mapped reads per

species-level taxon in the whole dataset.

In the second alternative strategy, Unicycler (version 0.4.6.0)80 was used to assemble reads

from a merged read database containing all non-human and non-phiX mapped, trimmed read

pairs generated as described above for the first alternative, BWA-based strategy. Unicycler

(version 0.4.6.0)80 was run with standard settings. The assembled contigs were then matched

to sequences in a combined database consisting of the htgs (downloaded on 04/13/2014), nt

(downloaded on 04/17/2014 and wgs (downloaded on 04/20/2014) database using BLAST

megablast (beginning with version 2.7.1 and subsequently updated)75 with an expectation

value cutoff of 0.001. Next, customized scripts were used to filter out all contigs with a

sequence identity to its matched sequence ≤ 80% and to sum the total alignment length of all

remaining contigs by matched sequence template. Finally, for each of the final filtered taxon

list (Supplementary Table 1), the matched sequence template representing the species-level

taxon with the biggest summed up alignment length was selected.

Next, in order to validate the KRAKEN-based pipeline, the log10 of the number of read pairs

25 assigned to a species-level taxon with KRAKEN (version 1.2.3) , the log10 of read pairs mapped

to a taxon with the BWA-based approach and the log10 of the total alignment length of the

assembly-based approach were computed for all taxa in the final filtered taxon list

(Supplementary Table 1) and compared (Supplementary Table 16). To assess the correlation

between the main method and the two tested alternative methods, the Pearson correlation

coefficient was calculated. 32 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Finally, in order to assess the concordance between RNA-Seq and WGS experiments

performed on the same sample the main pipeline was applied with the same settings to all

samples with RNA-Seq and WGS paired data available (n=324). Pearson correlation

coefficients of log10 transformed data were calculated for both a combined dataset of all RNA-

Seq / WGS pairs and for each sample separately for which RNA-Seq and WGS data was

available (n=324).

Assessment of somatic lateral gene transfer

In order to assess the integration of bacterial DNA into human DNA, read pairs with one read

matching the human genome and one read matching one of the taxa in the final filtered taxon

list (n=27) (Supplementary Table 19) were identified. First, representative bacterial and viral

genomes for the final filtered taxon list were downloaded (Accession numbers in

Supplementary Table 19). These genomes were merged with the human reference genome

(hg1k_v37, downloaded from the 1000 genomes FTP server56) into one FASTA file using

customized scripts. Second, all read pairs in which only one read of a read pair was mapped

to the human genome were filtered using Samtools (version 1.7)59. All tumor tissue samples

and matched normal samples of all patients in which at least 1000 RPPB matching one of the

taxa in the final filtered taxon list were identified in that respective patient’s tumor tissue

sample (n=79, Supplementary table 20) were included in this analysis. BWA mem (version

0.7.17)79 with standard settings in paired end mode was used to align all such read pairs to

the merged FASTA file of all taxa and the human reference genome. Next, all read pairs that

were now divergently mapped were extracted using Samtools (version 1.7)59 and customized

scripts by only including read pairs where one read mapped to one of the included non-human

taxa and the other read mapped to a human sequence. Only read pairs with a mapping quality

of at least 40 were retained. Customized scripts were used to count and tabulate all divergent 33 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

read pairs obtained by taxa and sample, respectively (Supplementary Table 21,

Supplementary Table 22). In order to normalize read pairs mapping to putative integration

sites (i.e. divergently mapped read pairs as defined above) by correcting for the total number

of read pairs matching a taxon with a similar approach (i.e. non-divergently mapped,

putatively non-integrated reads), a comparable pipeline was applied to the data used for the

main analysis (i.e. both reads in a read pair not mapped to the human genome) of all patients

included in the integration analysis (n=79) (Supplementary Table 20). First, the genomes of all

taxa in the final filtered taxon list were downloaded (Accession numbers in Supplementary

Table 16) and merged without adding any further human sequences into one FASTA file to

create a reference genome containing all taxa in the final filtered taxon list. BWA mem

(version 0.7.17)79 with standard settings in paired end mode was used to align these reads to

the merged FASTA file of all taxa in the final filtered taxon list. Subsequently, Samtools (version

1.7)59 was used to filter the aligned data to only include read pairs that mapped as a proper

pair to only one taxon with a minimum mapping quality of 60. Samtools (version 1.7)59 and

customized scripts were used to count and tabulate all mapped reads. The putative

integration rate of a taxon in either tumor tissue samples or matched normal samples was

calculated by dividing the number of divergently mapping read pairs by the number of read

pairs mapping as a proper pair to the respective taxon.

Analysis of associations of patient features with identified clusters

Differences in age at diagnosis between patient clusters were first analyzed by ANOVA for

each cancer. All results with panova≤0.1 are shown separately for each cancer in Figure 4. Such

clusters or combinations of clusters were then compared to the other clusters by student t-

test.

34 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Differences in the gender distribution between patient clusters were analyzed by chi-square

test for each cancer.

In order to analyze possible relationships between the number or type of cancer-associated

genetic alterations and presence of specific taxa, all somatic alterations for all patients

included in this study were obtained from the ICGC data portal23,55. All synonymous mutations

were filtered out. Subsequently, only Tier 1 cancer gene census81 genes that were altered in

more than 20 cases were filtered using customized scripts and included in the analysis

(Supplementary Table 23).

Differences in the number of genetic alterations in cancer genes between patient clusters

were first analyzed by ANOVA for each cancer All results with panova≤0.1 are shown separately

for each cancer in Figure 4. Such clusters or combinations of clusters were then compared to

the other clusters by student t-test.

The Kaplan-Meier method was used to estimate survival curves for each patient cluster in each

cancer with available survival data. Differences in survival between patient clusters were

analyzed by log rank test.

Associations between patient clusters and alterations in single cancer genes were analyzed

for each cancer gene in a cancer that was altered in at least 10 patients in that respective

cancer by chi-square test. Clusters or combinations of clusters were then compared to the

other clusters by Fisher’s exact test.

All calculations were performed with R (beginning with version R 3.3.2. and subsequently

updated)65.

Unbiased analysis of association of patient features with the presence of certain taxa

35 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

For this analysis a taxon was considered detected in a patient if at least 100 RPPB matched

the respective taxa in the patient’s tumor tissue sample. All phages were excluded from the

analysis. All included projects were grouped by cancer (Supplementary Table 18). Cancer

genes were defined as above.

All calculations were performed with R (beginning with version R 3.3.2. and subsequently

updated)65. All p-values were corrected for multiple testing using the FDR method to obtain a

q-value, which was considered significant if < 0.05.

Survival analysis

First, the Kaplan-Meier method was used to estimate survival curves for each cancer-taxon

pair separately that was present in at least 10 patients. Differences in survival between

patients with presence or absence of a respective taxon were analyzed using log rank tests,

stratified by ICGC project. Additionally, a pan-cancer analysis was performed in the same way,

also stratifying by ICGC project.

Association between taxa and patient gender

Associations between presence or absence of a taxa and patient gender were analyzed by

Fisher’s exact test for each cancer-taxon pair separately that was detected in at least 10

patients. Additionally, a pan-cancer analysis was performed in the same way, using a logistic

regression model that included the ICGC project as an independent variable.

Association between taxa and patient age

Associations between presence or absence of a taxon and patient age at diagnosis were

analyzed by student t-test for each cancer-taxon pair separately that was present in at least

36 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

10 patients. Additionally, a pan-cancer analysis was performed in the same way, using a linear

regression model that was stratified by ICGC project.

Association between taxa and number of genetic alterations in cancer genes

Associations between presence or absence of a taxon and the number of non-synonymous

genetic alterations in cancer gene consensus81 genes of a patient were analyzed by student t-

test for each cancer-taxon pair separately that was present in at least 10 patients. Additionally,

a pan-cancer analysis was performed in the same way, using a linear regression model that

was stratified by ICGC project.

Association between taxa and specific genetic alterations

Associations between presence or absence of a taxon and non-synonymous alterations in one

of the cancer gene consensus81 genes of a patient were analyzed by Fisher exact test for each

cancer-taxon pair separately that was detected in at least 10 patients. Additionally, a pan-

cancer analysis was performed in the same way, using a logistic regression model that

included the ICGC project as an independent variable.

Analysis of differential gene expression

Reads per kilobase of transcript, per million mapped reads (RPKM) data was downloaded from

http://dcc.icgc.org/pcawg for all available tumor tissue samples. Ensemble gene ID was

substituted by the standard Human Genome Organization (HUGO) Gene Nomenclature

Committee (HGNC) symbol downloaded from http://genenames.org/download/custom and

the RPKM data was then linked to the identified species-level taxa in each sample using R

(beginning with version R 3.3.2. and subsequently updated)65. sRAP82 was used to normalize

RPKM values, perform quality control, differential gene expression analysis and to identify

37 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

pathways that are differentially expressed. For these analyses, each cancer was analyzed

separately and patients that had more than 100 RPPB matching one species-level taxon in the

final filtered taxon list were compared with those who had not. Differential gene expression

analysis was performed for all taxa that were detected with more than 100 RPPB in at least

five patients and RNA-seq data available (Supplementary Table 24). Next, sRAP82 was used to

perform bidirectional functional enrichment of gene expression data to identify pathways up-

or downregulated between patients with or without presence of a taxon. Briefly, the

distribution of activation versus inhibition t-test statistics for all samples associated or not

associated with a specific taxon was compared using ANOVA and corrected for multiple testing

using the FDR method. For details please refer to the original description of BD-Func83. A gene

set was considered functionally enriched, if q<0.05. Gene sets likely not relevant for the

respective cancer were excluded and gene sets provided with sRAP82 were reduced to gene

ontology (GO) gene sets84.

38 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

FIGURE LEGENDS

Figure 1: Overview over sample characteristics and identified taxa

A, Sample distribution by project. B, sample distribution by age group. C, sample distribution

by gender. D, distribution of read pairs matching any species-level taxa by project. E,

distribution of read pairs matching any species-level taxa by sample type. F, species-level

taxa detected in at least 10 samples (detection limit 100 read pairs per billion (RPPB)) color-

coded by project. G, average RPPB detected across all species-level taxa by project and

sample type. Bars show mean of samples and error bars show standard error of mean. H,

average RPPB detected by species-level taxa and sample type. I, average RPPB detected for

all species-level taxa identified as likely tumor associated after filtering by sample type. J,

species-level taxa identified as likely tumor associated after filtering (detection limit 100

RPPB) color-coded by project.

Figure 2: Heatmap of all taxa detected in all samples

Log2-transformed read pairs per billion (RPPB) of all taxa in all samples. Taxa were

hierarchically clustered using Pearson correlation as a distance measure with average-

linkage. Samples were hierarchically clustered within each project and type subgroup using

Pearson correlation as a distance measure with average-linkage.

Figure 3: Patient clusters defined by detected taxa can be identified across all patients and

cancer-specific subgroups

A, t-SNE visualization of all tumor-tissue samples color coded by project using the log2-

transformed read pairs per billion (RPPBs) of all detected non-phage taxa as input variables.

B-H, t-SNE visualizations of single cancer subgroup tumor-tissues color coded by k-means

39 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

cluster (Extended data figure 4) using the log2-transformed read pairs per billion (RPPBs) of

all detected non-phage taxa as input variables.

Figure 4: Patient clusters defined by detected taxa associate with patient characteristics

A, association of CLL patient clusters (1: no specific association, 2: Human Mastadenovirus C,

3: Gordonia polyisoprenivorans) with age (panova=0.0743). B, survival by cluster in CLL (p(1 vs.

other)=0.0246). C, Binet stage depending on detection of Gordonia polyisoprenivorans

(pFischer = 0.0252). D, TP53 alteration frequency by cluster in CLL (pFisher(cluster 3 vs.

other)=0.0335). E, number of cancer consensus gene alterations by cluster (1: no specific

association, 2: Bifidobacterium dentium, 3: Human Herpesvirus 5) in esophageal cancer

(panova=0.0858). F, Kaplan-Meier survival curves for each cluster in esophageal cancer (p(1 vs.

other)= 0.0409). G, association of liver cancer patient clusters (1: Hepatitis B virus, 2:

Pseudomonas sp. and Serratia sp., 3: no specific associations, 4: Parvibaculum

lavamentivorans and Human Herpesvirus 5 in addition to taxa from cluster 2) with age

(panova=0.0015). H, RNF21 alteration frequency by cluster in liver cancer (pFischer = 0.0121). I,

KMT2C alteration frequency by cluster (1: no specific associations, 2: Cupriavidus

metallidurans, 3: no specific associations, 4: Methylobacterium populi, 5: Pseudomonas

protegens) in pancreatic cancer (pFisher=0.0308). J, CDKN2A alteration frequency by cluster in

pancreatic cancer (pFisher=0.0124). K, RNF21 alteration frequency by cluster in pancreatic

cancer (pFisherRNF21=0.0107). L, association of indicated taxa with age in prostate cancer (padj

between 0.0015 and 0.0420). M, association of prostate cancer patient clusters (1: no

specific association, 2: Gluconobacter oxydans and Rhodopseudomonas palustris, 3:

Pseudomonas sp. TKP, 4: no specific associations, 5: Thauera sp. MZ1T, Cupriavidus

metallidurans and Pseudomonas mendocina) with age (panova=0.00994). N, association

between detected amount of Propionibacterium acne DNA and number of cancer consensus 40 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

gene alterations (padj=0.0041). O, Kaplan-Meier survival analysis of Ralstonia pickettii

detection status in renal cancer (padj=0.035). P, PBRM1 alteration frequency by cluster (1: no

specific association, 2: Serratia marcescens) in renal cancer (pFisher=0.0723). Q, association of

Pseudomonas sp. TKP detection with age in chronic myeloid dysplasia (padj=0.039). For all:

The midline of the boxplots shows the median, the box borders show upper and lower

quartile, the whiskers show 5th and 95th percentiles and the dots outliers.

Figure 5: Differential gene expression and pathway analysis of chronic lymphocytic

leukemia patients

A, heatmap of all differentially expressed genes (q<0.05) in tumor tissue samples of chronic

lymphocytic leukemia (CLL) patients with (green) and without (orange) detection of

Gordonia polyisoprenivorans. Dendrograms show clustering with complete linkage and

Euclidian distance measure. B, boxplots of the distribution of activating versus inhibitory

gene t-test statistics for the indicated pathway is shown for tumor tissue samples of CLL

patients with (+, n=23)) and without (-, n=49) detection of Gordonia polyisoprenivorans. The

midline of the boxplot shows the median, the box borders show upper and lower quartile

and the whiskers the maximum and minimum test statistic. A, heatmap of all differentially

expressed genes (q<0.05) in tumor tissue samples of CLL patients with (green) and without

(orange) detection of Human Mastadenovirus C.. D, boxplots of the distribution of activating

versus inhibitory gene t-test statistics for the indicated pathway is shown for tumor tissue

samples of CLL patients with (+, n=7)) and without (-, n=65) detection of Human

Mastadenovirus C. The midline of the boxplot shows the median, the box borders show

upper and lower quartile and the whiskers the maximum and minimum test statistic.

Dendrograms show clustering with complete linkage and Euclidian distance measure.

41 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Extended Data Table 1: Included patients and samples

Overview of all included patients and samples.

Extended Data Figure 2: Validation of pipeline and analytical approach

A, mean read pairs per billion (RPPB) detected for indicated species in blood derived and

lymphoblastoid cell line 1000 Genome samples sorted by mean of blood derived samples. B,

proportion of non-human read pairs matching the indicated taxon of read pairs matching

any species-level taxon for each external validation sample. C-E, comparison of BWA

mapped reads in Kraken matched read pairs. Each dot represents one species-level taxon

(n=27). The line represents the best-fitted line (log-log linear regression). The dotted lines

depict the 95% confidence interval. Pearson correlation coefficients (log-log) are shown with

two-sided p-values. (C) shows all samples (n=27), (D) only bacterial taxa (n=22) and (E) only

viral taxa (n=5). F, comparison of assembled read length and Kraken matched read pairs.

Each dot represents one species-level taxon (n=27). The line represents the best-fitted line

(log-log linear regression). The dotted lines depict the 95% confidence interval. Pearson

correlation coefficients (log-log) are shown with two-sided p-values. G, comparison of

Kraken matched read pairs in RNA-seq and WGS data of the same sample. Each dot

represents one species-level taxon in one sample with both RNA-seq and WGS data

available. The line represents the best-fitted line (log-log linear regression). Pearson

correlation coefficients (log-log) are shown with two-sided p-values. H, plot of Pearson

correlation coefficient (log-log) distribution of all samples with both RNA-seq and WGS data

available. Each dot represents the Pearson correlation coefficient within a single sample. I,

comparison of 10% subsample and full dataset. Each dot represents one species-level taxon

in one sample for which both the full and the subsampled dataset has been analyzed and

42 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

indicates the absolute read count identified in both samples. The line represents the best-

fitted line (log-log linear regression). Pearson correlation coefficients (log-log) are shown

with two-sided p-values. J, Ratio of absolute read counts in the full sample to the 10%

subsample for 4 selected taxa. The mean ratio of all samples in which the respective taxon

was detected is indicated by the symbol and the error bars indicate the standard error of the

mean. The dotted line shows the expected ratio of 10. K, comparison of tumor tissue and

matched normal by patient and taxon. Each dot represents one species-level taxon in one

patient with both tumor tissue and matched normal analyzed and indicates the RPPB in both

samples. The line represents the best-fitted line (log-log linear regression). Pearson

correlation coefficients (log-log) are shown with two-sided p-values.

Extended Data Figure 3: Alpha diversity

A, counts of 1000 Genome samples by species-level richness. B, counts of tumor tissue

samples by species-level richness color-coded by project. C, counts of matched normal

samples by species-level richness color-coded by project. D, comparison of richness between

projects and sample type. Bars show mean and error bars standard deviation. N in brackets

indicates total sample number for each project.

Extended Data Figure 4: Coverage distribution for all likely tumor-associated species-level

taxa

Coverage distribution across each species-level taxon identified as likely tumor associated.

Extended Data Figure 5: Flow chart of taxa filtering strategy

Flow chart of filtering strategy to derive likely tumor-associated species-level taxa.

Extended Data Figure 6: Heatmap of final filtered taxon list

43 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Log2-transformed read pairs per billion (RPPB) of all species-level taxa identified as likely

tumor associated after filtering in all samples. Taxa were hierarchically clustered using

Pearson correlation as a distance measure with average-linkage. Samples were hierarchically

clustered within each project and type subgroup using Pearson correlation as a distance

measure with average-linkage.

Extended Data Figure 7: Heatmaps of likely tumor-associated taxa for all cancers with

discernible clusters

A-F, log2-transformed read pairs per billion (RPPB) of all species-level taxa identified as likely

tumor associated and detected after filtering in all tumor-tissues of the indicated cancer.

Results of K-means clustering of samples are shown.

Extended Data Figure 8: Heatmaps of likely tumor-associated taxa for all cancers without

discernible clusters

A-D, log2-transformed read pairs per billion (RPPB) of all species-level taxa identified as likely

tumor associated and detected after filtering in all tumor-tissues of the indicated cancer.

Results of K-means clustering of samples are shown.

Extended Data Figure 9: Analysis of host integration

A, integration rate by species for tumor tissue and matched normal sample. B, difference in

integration rates between bacterial and viral taxa (p<0.0001, Wilcoxon rank-sum test, two-

tailed). The midline of the boxplot shows the median, the box borders show upper and lower

quartile, the whiskers show 5th and 95th percentiles and the dots outliers of species-specific

integration rates in tumor tissue or matched normal samples. C, difference in integration

rate between tumor tissue and matched normal samples (p=0.0009, Wilcoxon signed-rank

44 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

test, two-tailed). ). The midline of the boxplot shows the median, the box borders show

upper and lower quartile, the whiskers show 5th and 95th percentiles and the dots outliers of

species-specific integration rates.

Extended Data Figure 10: „Elbow method“ to determine k for k-mean clustering

A-J, plot of reducing within group sum of squares for increasing k (number of clusters) in k-

means clustering (Extended data figure 4, Extended data figure 5) of log2-transformed read

pairs per billion (RPPB) of all species-level taxa identified as likely tumor associated and

detected after filtering in all tumor-tissues for each indicated cancer.

45 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

FIGURES

Figure 1

A B C D < 4 0 4 0 -4 9 5 0 -5 9 6 0 -6 9 7 0 -7 9  8 0 N /A

T o ta l 3 0 2 5 s a m p le s T o ta l 1 6 9 5 p a tie n ts T o ta l 1 6 9 5 p a tie n ts 2 0 0 0 T o ta l 3 ,5 3 4 ,7 0 7 r e a d s M ale F e m a le 1 5 0 0

s R E C A -E U

t i

h P R A D -U K

E F f

4 0 0 o 1 0 0 0 P R A D -C A

.

o P A E N -IT N 5 0 0 P A E N -A U

3 0 0 P A C A -C A

s t i P A C A -A U

h 0

f li O V -A U o s 2 0 0 c e

o n a c

i h a O R C A -IN . ic r m e iu o h r L IR I-J P c te s c

N E a ib 1 0 0 n L IN C -J P io p o r L IC A -F R P L A M L -K R T o ta l 3 ,5 3 4 ,7 0 7 r e a d s 0 E S A D -U K i a s s i 2 s s 1 2 s i e s a ii a s e s i a i s i a .3 s .1 s s ii s a is s i s s s i s a s P n is a s s i m e i s i is s n r a a a C i i a s ia la a r s f s i i a is s B u i 5 S i l s e m i s tt s u l i u n n h 4 n a d li r 0 n p a n u ic n c o s a n l u M r c n m n n t i c 0 u l l n k m s J u i a n u u n id K a il e u a i i A t 1 a i e o ie t c o t o s e d u a e r i e w u a a i s a n 6 u ic h n s d s u e in i l e o a t x c o e a ic r S o t r o n c e n n t o e e s M e a r o u a v l z t r A r c c n a t t ti a z t i r i k n r o T i p h r c r t a J r p m b T s S r t a c a a g ic r u p i c u e r t s e R i n v z r s ic y e u p i l n n u a i u if b s t o r e le h o s u e y a c in o i t e r p p e r i m a u t p u i i o e a u y k ir s l e n c h C M D I-U K c g o p d p r o p u i e p u B K t e l g m x f l t c i e t a r q m a i t lu a P h g g i k o r n h s g e e e a g iv p i v a s t r o e e d c iv s e p o m e g n v s a c s n i r r p r n s p d v i r s v s n n u f n l p u r i h t r i w s a d s p K t n c s o u o e n a c u e l c o f o e e s a e s p a in n m a i io p o a s i n l m e o tu m o r tis s u e r o s t l x o ll a n i s m a u r n a u l r p s o s l o o o o t t p la o v Ta os ltt av l 3o 0f 2l 5 us ae mo fpr le sa a a e d a a s n a a s n x e n s a p io e a u i la s i d u n a e n t t p t la d s a l u s r g e s p r a e ll n b n u s s m i i p a e y r u a a r o p u c b l iq c e s l t s a y x in l l a l c r t a n r u e o a p i e s i e a r a n r m l t a la r e l m c s d m n n l u t a il e n e e e n j e e o s l e o l s m t e r p l r c l u f iq r s e n c o n d e o f e n p il a p m c a r ie d u c s ll d c t e h n a e n c e t l e t m p a l is m t s l C L L E -E S o s a x o v o o o m s u s x r u o l e lo i a a c o d la n h o o t e a s i r p e e a O a s a u h t a m a s s s s m o s v o c i a a r a ia e t s m e a a r a s d m p m l k n n a c u r n a v c a e s r x h a u u a p c t s a r m u a o a s u a b r t l t c c b y v b n t t m o i e lo r o o c a a i e y B i in il l i m a tc h e d n o rm a ls l n i o l u s m a o i d c n m c l o m a a o o B s o r a a iu ia iu t l m w o a e o n a r h m s r il t c s t o o i c n id n i y u B o o a e a r a le m i i o B n i r r i u o m m e l b r n i b n p n p h a s o a o lf d v h o o u o id l c z o v u i r b r r v a g e h b o e l e m o r o e n iu u o a e p c r u u R u d c c g n o e o i i b s r ic la K a p th e m e d o t o e e v B d o h y o t d d r c n s s a r c c m e io u p o m id o c s h m o r e l o t o r m in l d t V u a S h r c r o m e n l a c a g m s o c o c B T C A -S G o e r li A m in t p c r o d e o b e t s u o n o m o g l c e d S h a r o d o o e n i u b e c t D s e c o o v A P o i t r e S M t e u n P n h a r e u p a y l r m t o T i u d m o S o d a s c o a h ls a l y h c c h l l C R E e p h a R in o P s a c b G d o h c c e i t o p c h e a lth y d o n o rs P V y h d i i y d t o n il l a k h b e t y h C k u a h H N v e c e u P d p u r p a n n A a c K v i e S r m h P s o p u h r H o p a c r o e ic o p S h a b c c m p k lo S s r e k b l ia a to t t s l o e R o p r a O e a S u u o r p P th s r u i y S r H p B R C A -U K h r s u d o r A C S u y o o n L p S P A t r a B X l P B H B h n P P u B h p e e R o P C t y B t d a io p u r r o S h e o B t t n t X p ta C S e G M h o S t e R r S B O C A -U K S M P 1 0 0 0 G e n o m e s

G 1 0 0 0 g e n o m e s H B O C A -U K 1 5 0 0 0 B R C A -U K tu m o r tis s u e B T C A -S G m a tc h e d n o rm a ls

C L L E -E S .

b 1 0 0 0 0 h e a lth y d o n o rs

C M D I-U K . p

E S A D -U K . p

L A M L -K R r L IC A -F R 5 0 0 0 L IN C -J P L IR I-J P tu m o r tis s u e O R C A -IN m a tc h e d n o rm a ls 0 O V -A U h e a lth y d o n o rs i ii s 2 l a s s 1 s a s 4 s t s li 0 s C s m ia e r t n i n o s e u n s u s n l 4 a P A C A -A U e t 1 c o s c e e o u a i . i e a S e u n i c u r i t h B h k r b c a n r c t ir r n ic n t c u u K s i i i y s e a f p R o P A C A -C A i s h g v a l e v c e i o O p id K e a o r s c r t p m l c ic l o s o m it l a l s p r r i m o e t r a s u P A E N -A U i a u c n u m u p lu e n e t ll s a e e i e l r l n f e m a n e i h u d r a f e i e e n o c x m c e c d s d s a p P A E N -IT t m a m ta t h s h a u r s a r ia s c a a l s a a l s o t a s s n B i il u n o l a B E i a a u n a l l u v a h b o th c i o d e P R A D -C A R d o r t m i c a h n i i r o n c m m o m a s d e u b ip o v i R n io o o R o l P b P R A D -U K a c S a c d H t c h e i p o c p l r A m o l u y K p r y e a c o R E C A -E U u s L i r u H P h l t C p P A o a n 1 2 3 4 5 6 7 t e S t 0 0 0 0 0 0 0 S 1 1 1 1 1 1 1 r p p b I 1 5 0 0 0 J 1 5 0 1 0 0 0 0

5 0 0 0

. tu m o r tis s u e

s

t

b i

. 1 0 0

m a tc h e d n o rm a ls h

p .

6 0 f

o

p

. r h e a lth y d o n o rs

4 0 o 5 0 N

2 0

0 0 t s s s s li t s s i s i s P s n e C i a a ii m 5 s n s s A a T m is 1 s s C s 1 e P s a s i a i l 5 i s s a s n n a s r c n n u n 6 c 1 n l u m m l s T n A s K a a n s t i i k u s n a p o n i iu a a s r n n n a n c n k n r u 1 n n c n T e r r o s r c a c u e i e a h Z t r u i s s K u i i u t s a n o 6 i u c o p e u e z i r i d o b g l s t m r v a e e o a e a i p i a a a r u c ir lu t o n i c y u u M n o i i r u u T r c r c s u e Z l h r i p s d iv s s d a v p m e r u e v v c c p i g z t o m i b s d t s e i v a n k o fa x t c i p i n B r r e o u r u o v r ll n a e o p e n i s o y o a v m d t a s u s i s i p o n t e a n l p i i c M u y u s e n c e a p e e m s r s n e is s d m r c v o a r r n a s u r u s ly t d e v e v s v o n t k e a v n a i x i a u t o e s l o p e i li p m e a m e m p i i r a i i n f p y a m B n l e p a a l m r r r q t r a p r u r t l c o s e o a d p s a v o t f m d e e p e li e a s p e i m u e a l n s n p r e m e s s y o o m a n s t c t t a te r a r i r n o e s e s m l n s s m s o t o n a c h a s e i u a r h p a o a e o a p s u u i r i m a i d a o m ia c n o t a te v e e t a e u p r l r m m i p m a l e e a p e t s y i s m n a n t b a a o r h a a t n l r n r l a r r q r i o n u l u t a o b u o i r h c l c H e d f p s e u u i a p t e t d o e a o lm i a a b t p n r a a m e m p e t s i n e e i l e t a m o d r m o r r n o m a T a a o o a t r r c t i a u i p s r d a m e m r o l u o ia e b m b m m t s h o n c a r h u s a e m v u o n t e y a t S o a s o c e a e a o t p o a a P e n S o c u c l d m u o u i s a d m i m o n a n e i a r v s i i S a e d r H S u h P u a u d l d H s t t a t t a h b a e P d r n s u a l t r i u i a n n u o y u o b m n c h i p r a m C b e e r H if c f u a a l i m o c b a a t o r l H u p o p e i G s e i r o e d r l m o a o a r T n e u d u o s n M S B a B d r m o m o r u n a e s r H d P b i m s u e a o n b d l m b a i m C P io i v e m e p t o u e la o t S P o o v n o u P S d o o y u o m c a u G h p r a S a s a c r d e h S P r l R o a i d H i u d H d u u r P r P a C u i s t i l r u P m u n b e e if p e if H e c p u e i s G u o n s B o M B S a H s d P P d b C P r o i o i o v p h r G o R a r P P

46

0.00 3.75 15.00

Project Type 1000 Genomes control BOCA-UKbioRxiv preprintnormal doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which Figure 2 BRCA-UK wastumour not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. BTCA-SG CLLE-ES CMDI-UK ESAD-UK LAML-KR LICA-FR LINC-JP LIRI-JP ORCA-IN OV-AU PACA-AU PACA-CA PAEN-AU PAEN-IT PRAD-CA PRAD-UK RECA-EU

id Methylophaga frappieri Burkholderia phytofirmans Hepatitis B virus Nocardia farcinica Pseudomonas mendocina Sphingopyxis alaskensis Rhodococcus erythropolis Ochrobactrum anthropi Gordonia polyisoprenivorans Sphingomonas wittichii Zymomonas mobilis Chelativorans sp BNC1 Pseudomonas resinovorans Pseudomonas fluorescens Pseudomonas poae TKP Pseudomonas sp paradoxus Variovorax Delftia acidovorans Pseudomonas putida Pseudomonas savastanoi Pseudomonas protegens Stenotrophomonas maltophilia Achromobacter xylosoxidans Human herpesvirus 5 Serratia marcescens Serratia liquefaciens Salmonella enterica Parvibaculum lavamentivorans Plautia stali symbiont Cronobacter sakazakii Enterobacter sp 638 Serratia proteamaculans Serratia plymuthica Rahnella aquatilis Enterobacteria phage HK022 Escherichia phage HK639 Klebsiella variicola Brevundimonas subvibrioides Pantoea ananatis Enterobacteria phage P1 Methylibium petroleiphilum Pseudoxanthomonas spadix Raoultella ornithinolytica Pseudomonas brassicacearum Alteromonas macleodii Rhodopseudomonas palustris Xanthobacter autotrophicus Gluconobacter oxydans Sphingomonas sp MM Sphingobium japonicum Bradyrhizobium sp BTAi1 Bradyrhizobium sp S23321 Methylobacterium populi Methylobacterium extorquens Methylobacterium radiotolerans Xanthomonas euvesicatoria Thauera sp MZ1T Xanthomonas campestris Acidovorax sp JS42 Comamonas testosteroni Pseudomonas stutzeri Ralstonia pickettii Alicycliphilus denitrificans Acidovorax sp KKS102 Cupriavidus pinatubonensis Ralstonia solanacearum Cupriavidus metallidurans Acidovorax ebreus Pseudomonas aeruginosa Micrococcus luteus Propionibacterium acnes Escherichia coli Enterobacteria phage phiX174 sensu lato Alcanivorax dieselolei A0.3 Shewanella sp Psychrobacter sp PRwf.1 petrii Burkholderia vietnamiensis Burkholderia cenocepacia Burkholderia lata Burkholderia ambifaria Burkholderia multivorans Burkholderia phage KS5 Human mastadenovirus C Herbaspirillum seropedicae Corynebacterium urealyticum Corynebacterium kroppenstedtii Shewanella baltica Streptococcus mutans Leuconostoc carnosum Maraba virus Enterobacteria phage lambda Aggregatibacter aphrophilus Bifidobacterium dentium Streptococcus oligofermentans Campylobacter curvus Propionibacterium propionicum Capnocytophaga ochracea Streptococcus constellatus Streptococcus intermedius Fusobacterium nucleatum Atopobium parvulum Prevotella sp oral taxon 299 Streptococcus anginosus Filifactor alocis denticola Treponema forsythia Tannerella Porphyromonas gingivalis Prevotella denticola Selenomonas sputigena Prevotella intermedia Olsenella uli Prevotella dentalis Alistipes shahii Bacteroides salanitronis Bacteroides fragilis Porphyromonas asaccharolytica Streptococcus dysgalactiae Rothia dentocariosa Neisseria lactamica influenzae Haemophilus parainfluenzae Prevotella melaninogenica parvula Veillonella Rothia mucilaginosa Streptococcus salivarius Streptococcus parasanguinis Streptococcus pneumoniae Streptococcus mitis Streptococcus pseudopneumoniae Corynebacterium variabile Streptococcus oralis Streptococcus thermophilus Lactobacillus fermentum Lactobacillus rhamnosus Streptococcus phage 20617 Lactobacillus plantarum Streptococcus sp I.P16 Lactobacillus salivarius Clostridium sp SY8519 Lactobacillus phage Sha1 Streptococcus sp I.G2 Lactobacillus brevis Lactobacillus buchneri Lactobacillus paracasei Lactobacillus casei Lactobacillus phage Lc.Nu Lactobacillus sakei YMC.2011 Streptococcus phage Arcanobacterium haemolyticum Streptococcus gordonii Streptococcus sanguinis Human herpesvirus 1 Staphylococcus haemolyticus Enterococcus faecalis Geobacillus thermoleovorans Geobacillus sp WCH70 Leuconostoc kimchii Anoxybacillus flavithermus Dechloromonas aromatica Meiothermus silvanus Delftia sp Cs1.4 Lactobacillus delbrueckii Burkholderia gladioli Arthrobacter arilaitensis Alicyclobacillus acidocaldarius Deinococcus proteolyticus Exiguobacterium sibiricum Enterococcus faecium Bifidobacterium longum Cryptobacterium curtum Bifidobacterium breve Akkermansia muciniphila Eggerthella lenta Bacteroides thetaiotaomicron Parabacteroides distasonis Bacteroides vulgatus Bacillus amyloliquefaciens Bacillus subtilis Bacillus sp JS Bacillus phage SPbeta Gardnerella vaginalis Staphylococcus aureus Staphylococcus epidermidis Human endogenous retrovirus K Enterobacteria phage M13 Lactococcus lactis Acinetobacter calcoaceticus Staphylococcus saprophyticus Staphylococcus warneri Human herpesvirus 4 Cupriavidus necator Thiobacillus denitrificans Bacillus weihenstephanensis Bacillus thuringiensis Bacillus toyonensis B

47 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 3

A Chronic Pancreatic B lymphocytic cancer leukemia

Bone cancer P. fluorescens

TSNE TSNE 2 P. poae P. sp TKP

Chronic myeloid disorders & bone TSNE 1 cancer C Bone cancer

Liver

cancer TSNE TSNE 2

Mastadenovirus C G. polyisoprenivorans

TSNE 1 Chronic lymphocytic leukemia D E F C. metallidurans

Cluster 4* M. populi

Cluster 2**

TSNE TSNE 2 TSNE 2 TSNE TSNE 2 P. protegens

Hepatitis B

B dentium Human Herpesvirus 5

TSNE 1 TSNE 1 TSNE 1 Esophageal cancer Liver cancer Pancreatic adenocarcinoma G H G. oxydans R. palustris Cluster 5*

P. sp TKP

TSNE 2 TSNE TSNE TSNE 2

S. marcescens

TSNE 1 TSNE 1 Prostate adenocarcinoma Renal cancer

48 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 4

A B C D 1 0 0 % 8 0

6 0

s s

r N o G . p o ly is o p re n iv o ra n s

t

a

n

e e

i G . p o ly is o p re n iv o ra n s

y

t

a 4 0

n 6 0

p

i

5 0 %

f

e

o

g

. o

A 2 0 N 4 0

0 0 % B C 1 2 3 / 1 2 3 r r r A t r r r e e e e te te te t n t t t e i s s s s s s n B u u u i lu lu lu l l l B C C C E C C C F G H 1 0 0 % 1 5

e 8 0

n

e

s

r

g

s

a r

n 1 0

e

e

o

y

i

c

t 5 0 % n

n 6 0

a

i

a

r

c

e

e

t

f

l

g a

o 5

A

. o

N 4 0

0 0 % 1 2 3 4 1 2 3 1 2 3 4 r r r r r r r e e e r r r r te te te te t t t te te te te s s s s s s s s s s s u u u u lu lu lu u u u u l l l l l l l l C C C C C C C C C C C I J K 1 0 0 % 1 0 0 % 1 0 0 %

5 0 % 5 0 % 5 0 %

0 % 0 % 0 %

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 r r r r r r r r r r r r r r r e e e e e e e e e e te te te te te t t t t t t t t t t s s s s s s s s s s s s s s s u u u u u lu lu lu lu lu lu lu lu lu lu l l l l l L C C C C C C C MC C C C N C C C C

8 0 8

s e

r 8 0

n

a

e e

s 6

y

r

g

s

n

a

r

i n

6 0

e

e

e

o

i

y

c

g

t

n

A a

n 6 0 4

i

r

a

e

c

e

t

l

f

g

a

o

A 4 0 . 2 i i i o s s 2 2 s s i s s a a i i s s 4 4 n n n n t t m m i i

u t t N u u a a o o u s s r r 4 0 e e S S r r e e o o e e u u t t r r J J ic ic t t k k c c s s b f f e e u u n n i i e e b i i t t l l i i ic ic n n e e p p r r s s g g p p s s it it o o s s u u p p o o x x t t u u r r p p m m a a x x n n a a a a a a r r a a e e s s c c e e i i j j 0 r r e e c c a a n n c c o o d d t t v v o o o o s s to to m m s s o o v v s s s s c c a a u u a a . . . . u u a a o o ls ls i i 1 2 3 4 5 b d d o o l l r r n n n n . .b .b b i i d d i i n n o o a a b b o o . c c i i h h o o ic ic R R o o r r r r r p p p p A A c c p p m m g g m m e e e e e . . . . i i m m M M o o o t t t t t A A l l a a o o in in p p p p o c c o d d n h h s s s s s r r r r n o y y h h t t n m m n u u p n n u u u u u 0 ic ic o o e e p l l l l l 0 0 0 l l s s S S a a 1 0 0 0 C C X X C C C C C - A A P P o 1 0 0 o o - 1 o n o n 0 - 0 n n n 0 1 1 0 - 0 0 1 0 0 P Q 1 O 1 0 0 %

8 0 s

1 0 0 % r

a

e y

N o t a lte re d

n 6 0

5 0 % i

e g

A A lte re d 4 0

0 %

1 2 d d e r r t te te te c c e s s t te lu lu e e C C d D t 5 0 % o N

0 % 1 2 3 4 r r r r te te te te s s s s lu lu lu lu C C C C

49 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 5

A C

DNA damage response, signal Mitotic cell cycle transduction by p53 class mediator

+ - + -

Ubiquitin-protein ligase activity Blood coagulation involved in mitotic cell cycle

+ - + -

Cysteine-type endopeptidase Macrophage activity involved in activation apoptotic process

+ - + -

B D Ubiquitin- DNA damage protein response, ligase Interleukin-1 signal activity beta transduction involved in production by p53 class mitotic cell mediator + - + - + - cycle Epidermal Protein growth factor Immune catabolic receptor response signalling process pathway + - + - + - Cysteine-type Sequence- endopeptidas specific DNA Activated T cell e activity binding proliferation involved in transcription apoptotic factor activity + - process + - + -

Signal Cell Cell cycle transduction differention + - + - + -

Tumor Interferon- necrosis B cell Cytokine Apoptotic gamma factor proliferation secretion process production production + - + - + - + - + -

release of epithelial cell cytochrome c cell-substrate MAPK inflammatory proliferation from adhesion cascade response mitochondria + - + - + - + - + -

50 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Extended Data Table 1

Phage Virus Bacteria All

Project matched tumor- Project matched tumor- matched tumor- matched tumor- matched tumor- healthy total healthy healthy healthy healthy Project control tissue total control tissue control tissue control tissue control tissue normals reads in normals normals normals normals samples samples samples samples samples samples samples samples samples samples samples bn

1000 365 365 85.85 444 13,725 764,787 778,956 Genomes

BOCA-UK 76 76 152 193.74 1 0 25 0 22,502 166,820 22,528 166,820

BRCA-UK 46 46 92 121.42 3 3 4,602 10,404 2,730 1,800 7,335 12,207

BTCA-SG 12 12 24 44.30 0 0 0 0 259 554 259 554

CLLE-ES 97 97 194 179.15 3 4 515 676 11,548 13,256 12,066 13,936

CMDI-UK 32 32 64 84.55 23 20 2 2 28,464 4,759 28,489 4,781

ESAD-UK 97 97 194 343.24 61 5 42 140 9,258 33,838 9,361 33,983

LAML-KR 10 10 20 20.42 173 2 87,268 29 358,442 515 445,883 546

LICA-FR 6 6 12 17.85 0 0 2 0 1,065 1,584 1,067 1,584

LINC-JP 31 31 62 102.88 104 239 37 140 99,733 324,055 99,874 324,434

LIRI-JP 256 256 512 641.41 0 0 3 21 182 376 185 397

ORCA-IN 13 13 26 41.72 0 0 0 2 80 12,589 80 12,591

OV-AU 73 73 146 220.69 11 16 30 24 3,316 3,394 3,357 3,434

PACA-AU 97 97 194 348.55 13 18 145 97 7,142 6,348 7,300 6,463

PACA-CA 147 147 294 337.77 32 19 31 83 377,496 566,397 377,559 566,499

PAEN-AU 49 49 98 165.19 4 4 102 151 1,128 1,371 1,234 1,526

PAEN-IT 37 37 74 82.95 0 0 7 4 733 909 740 913

PRAD-CA 124 124 248 269.85 21 35 8 61 115,213 412,634 115,242 412,730

PRAD-UK 33 33 66 111.30 50 6 3 2 24,961 5,601 25,014 5,609

RECA-EU 94 94 188 377.53 2 4 49 50 15,442 13,624 15,493 13,678

TOTAL 365 1,330 1,330 3,025 3,790.34 444 501 375 13,725 92,871 11,886 764,787 1,079,694 1,570,424 778,956 1,173,066 1,582,685

51 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Extended Data Figure 2

O th e r s p e c ie s m a tc h e d A 1 0 0 ,0 0 0 B A lp h a p a p illo m a v iru s B lo o d d e riv e d s a m p le s (n = 3 6 5 )

B 1 0 ,0 0 0 H e lic o b a c te r p y lo ri

P ly m p h o b la s to id c e ll lin e s a m p le s (n = 1 0 2 )

P R

1 ,0 0 0 S R R 1 6 1 1 1 2 8

d e

h 1 0 0

c S R R 1 6 1 1 1 2 7

t

a m 1 0 S R R 1 6 1 1 0 0 0

1 i s i 2 s a i s s 2 s s r ii s a 1 i i s 3 s i e a ri s s s s s ri i r s s s s li B ti n s i1 i s 4 . s n i u . a i is n s s i n s s a s i e m i n e is s m m s 0 1 s e o a i i is 0 s e i s n 6 t n u 4 n d to n m h u S id f li r c l s a i n a o n ti m ic li e s u a lu n e il u M i 1 u l l s s n S R R 6 4 3 2 7 0 9 il 1 u n s o o a e a e A i u c s e J t n t i 0 u n s c u z r s u z n c i i tu i e n c in t u u d P l o o i t e a c s e r r S r i a ra r ti u r w a a e t A v n id e o n e a c o i n a t ti n h u u r i a M d c i C D i l io n S r c n in c k u J e c T m c a t r b p u n p y r e e ra t d t c a c e r u e o a q ff u i i it h e ic r e m b K a e fi a u c u l B r e o i i p R i m h p n x u r e s fi l in e o t n e p e g r a i g p p if n M A p s l d o u e i g i i ir i d a p o fa e v e v e s P g a n o h l o u i g s c o l n w d s s g 2 e a s K c g u r h v p li s t p n c w s s a u la p s p o o s t fl o r s g a o s a m c to n i b o n p o ie c s h m r it c l s o e id o a x s a p a l o b n s g t it u a f n u s h a u m u a s a r s p i 1 r o o e r s p s u in e i s a a u x i u s s d s e a u v e r la a u y lo a u in s c il i p o e r s x m s c m a n 4 s h d c t u i r n r e i t c d p u i n a p r ll n s b t p l ll t o s e n c s n c a c n a e u a j r n th e lu s l r u a e e n e a iq e c a n r i o r la r e e t y s c u a t e c u c n e l n e r c iu s s n iu e Y p x s t il il e d rp c r ra l m id a l e o c l a e a x c c r d o u c la o t a p h c u r te a a e e e c x c t h s h to m o o o u s o o v m te e te d s n n n s r lu o a s c o s t m a m id u ll o m m p r d b a c a c t a s c e c v l i u v a h o a r r a i u l c a s m c e c n c a u p u o r c i te u s c r o c a r a s h ls s m y b a i s m B o c c o s lo p ll e i c o p o o o m r l s i i c t c c m i ia a e y a r b o a s n lu lo o o c ri ft o n id d a e a w il i t c o c n lu t a c m t o e l u r s r s o c b r s b t iv i e o B B b u o i E n a u d iu m c l ia a n b B u s c c a r s o l c i p h te e c e e lo o a a c m h t l v i ll h d y i r iz o p e g c u b c e e u a c o u i a h o a d i i t u t y lo c b g o e lu o a n o t c y o n i m a R i h c a u n n m A e o rd o c h a B i t il m c t r ll e u e c s c c c c c n b g s l t a t p d o c ip v te h c D o i s r t V id B b p a L o e tr e a b o a c h to to i o s i e ib c o a 0 5 0 i i a o l m p A s r o C t h u h a e o S o M e h a t te w b c a p o h d in i c n t l p R b c p d c u ia a c u y l s P n c v r p b R n S s le b o b t p c p i e a i e li m . . . B u r t a ll d y l p H c G i o a m t o m o o P s o o o c ta p e a if h c a A r a A o y H p S b i h a S y c l i o i E v u t K t o lo e r L S p N ib g t u 0 0 1 r e ic c a s y r o S h l e p s t y S t B A e S n C P s l u lo r p R A p r m C re il n e u p P S S n r a a B a P h u h e T c i r h e r P A C y B t p c a P c t F e t g h a C a tr e a g G t S t A H B A S P e S M A S M

C D E

s

s s

d

d d

a

a a

e e

e 6 6 6

r

r r

d

d d

e

e e

p

p p

p p

p 4 4 4

a

a a

m

m m

A

A A

2 2 2

W

W W

B B

B r = 0 .6 8 (p = 0 .0 0 0 0 8 ) r = 0 .7 1 (p = 0 .0 0 0 1 7 ) r = 0 .9 7 (p = 0 .0 0 3 8 )

g

g g

o

o o

l l l 0 0 0 2 4 6 2 4 6 2 4 6 lo g K ra k e n m a tc h e d re a d p a irs lo g K ra k e n m a tc h e d re a d p a irs lo g K ra k e n m a tc h e d re a d p a irs

F G H t

4 -1 5 1 .0 n

h r = 0 .5 9 2 3 (p = 1 0 )

e

t

i g

6 c

i

n

f f

e 3

S

l

e

G

y

o

l

c

W b

4 0 .5

2 n

0

m

o

1

e

i

s

g

t

s

o

a

l l

a 2

1 e

r g

r = 0 .5 6 (p = 0 .0 0 7 9 ) r

o

l

o c 0 0 0 .0 2 4 6 0 1 2 3 4 5 lo g K ra k e n m a tc h e d re a d p a irs lo g 1 0 R N A -S e q

I J K 1 5 0 ,0 0 0 8 1 5 -15

r = 0 .5 7 8 2 (p < 1 0 )

s

r

i

s

l

t

a

a

e p

m 6

r

s

o

l

d

l

a n

1 0 0 ,0 0 0 t

a

u

d

f

a

e

e

r

d

h

o

l 4

i l

1 0 c

t

t

a

%

a

a

f

0

m

R

o

5 0 ,0 0 0 1

0

1 2

o

g %

-1 5 t

o l

0 r > 0 ,9 9 9 (p < 1 0 ) 1 0 5 0 0 2 4 6 8 0 5 0 0 0 0 0 1 0 0 0 0 0 0 1 5 0 0 0 0 0 C ii s e tt s a e lo g 1 0 tu m o r tis s u e u o e n A ll re a d p a irs r p k c i ic a v s p o a n a m e n i iu d o n r a m to te t o s c s d l a a u a b m e R i s n n P io a p m ro u P H

52 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Extended Data Figure 3

A

1 5 0

s

e l

p 1 0 0

m

a

s

f 1 0 0 0 G e n o m e s

o

. 5 0

o N B O C A -U K P A C A -A U 0 P A C A -C A P A E N -A U 0 1 2 3 4 5 6 7 8 9 0 P A E N -IT 1 P R A D -C A  P R A D -U K B L IR I-J P L IC A -F R E S A D -U K C L L E -E S 4 0 0 B R C A -U K

s R E C A -E U e l B OBCTAC-UAK-S G

p C M D I-U K P A CL A-MA UL -K R m L IN C -J P

a P AOC RA -CC A -I s

OP AVE-NA -UA U f 2 0 0 N

o P A E N -IT

. P R A D -C A o

N P RBAODC-UAK-U K L IR I-J P 4 0 0 P A C A -A U s L IC A -F R

e 0 P A C A -C A l E S A D -U K p 0 1 2 3 4 5 6 7 8 9 0 P A E N -A U m 1 C L L E -E S a  P A E N -IT

s B R C A -U K C

f P R A D -C A

o 2 0 0 R E C A -E U

. P R A D -U K

B T C A -S G o

s L IR I-J P

N 4 0 0 C M D I-U K e

l L IC A -F R L A M L -K R p E S A D -U K L IN C -J P m 0 C L L E -E S a O V -A U s 0 0 1 2 3 4 5 6 7 8 9 B R C A -U K 1 f O R C A -IN 2 0 0 

o R E C A -E U

. B T C A -S G

o C M D I-U K N L A M L -K R L IN C -J P 0 O V -A U 0 1 2 3 4 5 6 7 8 9 0 O R C A -I 1 D 3 0  N

s tu m o r tis s u e

s 2 0 e

n m a tc h e d n o rm a ls h

c h e a lth y c o n tro ls i

R 1 0

0

) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 5 2 2 4 4 4 4 0 2 2 2 6 6 4 4 8 4 8 6 8 6 5 9 2 9 6 9 2 1 6 1 2 4 9 9 9 7 4 6 8 1 = = 1 = 1 = = = 5 = 1 1 2 = = 2 = 1 3 n = = = n = = = = (n n = n = n (n n = ( ( n (n n ( n ( n ( ( n n n n ( n n (n ( ( ( ( ( ( ( ( ( K G K R R P IN U IT K K K F J P - U U A A - A U s -U S S U K - - - -U e U - E I- U - -J A A A C N C E - A A - - L A C I - - - N - D - m D C V A A E D A A C C E D M IC IN R R E A A o C L M A I O C C A A C n R T L A L L L O P R e O B B C S L A A P R P E C E P P P R G B 0 0 0 1

53 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Extended Data Figure 4

Bifidobacterium animalis

Bifidobacterium dentium

Cronobacter sakazakii

Cupriavidus metallidurans

Gluconobacter oxydans

Gordonia polyisoprenivorans

Hepatitis B virus

Human herpesvirus 1

Human herpesvirus 5

Human herpesvirus 6A

Human mastadenovirus C

Methylobacterium populi

54 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Parvibaculum lavamentivorans

Plautia stali symbiont

Pseudomonas fluorescens

Pseudomonas mendocina

Pseudomonas poae

Pseudomonas protegens

Pseudomonas sp. TKP

Pseudopropionibacterium propionicum

Rhodopseudomonas palustris

Salmonella enterica subsp. enterica serovar Typhi

Serratia liquefaciens

55 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Serratia marcescens subsp. marcescens

Serratia plymuthica

Serratia proteamaculans

Thauera sp. MZ1T

56 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Extended Data Figure 5

n = 218 All identified species-level taxa

n = 147 Excluding taxa also detected in healthy controls

n = 70 Excluding taxa detected in < 5 tumor tissues or matched controls (except Hepatitis B)

n = 68 Excluding all remaining phage hits

n = 49 Excluding likely oral microbiome contamination

n = 27 Excluding likely sequencing / reagent contaminants

57 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Extended Data Figure 6

58 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Extended Data Figure 7

A B C D

E F G

59 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Extended Data Figure 8

A B C D

60 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Extended Data Figure 9

A 1 5

%

,

s

y

d

l a

e 1 0

e

v i

r tu m o r tis s u e

t

e d

t m a tc h e d n o rm a ls

e

a

t t

a 5

u

r

p

g

e

t

n i 0 i t is ii s s s s 1 5 A C l s s a e s s a s s a s T l m n n u u n n m P i c n n c n 1 a u k n r s s 6 o u n in a n K r i i i a a a a i s p a i e o e t r ie e h a Z t z r r u u s u r b ic c T s e c t l im n u d o v ir ir r o o c o p g t c u M a y u i p n s e p lu a s u n e k d v B v v ir v v m e d s t n f e c p a d li x i s s ti y io r a s a e m a a l o n s v o m n o p e c s s a i e e s n n s p o e n r s a u r ly m m m t r e t p p iu e i o p a l a a r e r ti r r e e r l r lu m o s l iq p a r u u e e t p p d a f s n a e l m e e i i t a e e r e m t p s m o a t r r m c o p h h a t a s o a n n a a i u e e c a s e t c s a n o o i i t o a t t a s i e h s v m a n d m t t a r c c b y n n a a ia n o o a a r p h b u o l H a a n a b l t iu o u m lm r r r a a o d r o e m d o r r e a T b b i n o m m a m lo u m s u a i n v o p m a te m o d S e e S t o o o c u u m n y u l o P d e u S S a d d r ia a a h l P c o d s e r i i r u i H H u t u a d u r f f C l n H e u e P s e i i p m c ib u e p B B u G o u a e s S d M n s s P o C r H ib o P d v i P o r p o G a o h P r R P

B C 1 0

1 0

%

%

,

,

s

s

y

d

l

y

d

l

a e

a 1

e e

1 v

e

i

v

r

i

t

r

t

e

d

e

t

d

t

e

e

a

t

a

t

t

t

a

a u

r 0 .1 u

r 0 .1

p

g

p

g

e

e

t

t

n

n

i i 0 .0 1 0 .0 1

) ) ) ) 4 5 7 7 4 = 2 2 = = = (n n (n n ( a ( a x e s x a u l a t s a t l s m l a i r r t ia i r o r v n e o t d c m e a tu b h tc a m

61 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Extended Data Figure 10

62 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

REFERENCES

1. Schwabe, R. F. & Jobin, C. The microbiome and cancer. Nat. Rev. Cancer 13, 800–12

(2013).

2. Goodman, B. & Gardner, H. The microbiome and cancer. J. Pathol. 244, 667–676

(2018).

3. zur Hausen, H. Papillomaviruses and cancer: from basic studies to clinical application.

Nat. Rev. Cancer 2, 342–350 (2002).

4. Tang, K.-W., Alaei-Mahabadi, B., Samuelsson, T., Lindh, M. & Larsson, E. The landscape

of viral expression and host gene fusion and adaptation in human cancer. Nat.

Commun. 4, 2513 (2013).

5. Cantalupo, P. G., Katz, J. P. & Pipas, J. M. Viral sequences in human cancer. Virology

513, 208–216 (2018).

6. Walboomers, J. M. M. et al. Human papillomavirus is a necessary cause of invasive

cervical cancer worldwide. J. Pathol. 189, 12–19 (1999).

7. Schwarz, E. et al. Structure and transcription of human papillomavirus sequences in

cervical carcinoma cells. Nature 314, 111–114 (1985).

8. Perz, J. F., Armstrong, G. L., Farrington, L. A., Hutin, Y. J. F. & Bell, B. P. The

contributions of hepatitis B virus and hepatitis C virus infections to cirrhosis and

primary liver cancer worldwide. J. Hepatol. 45, 529–538 (2006).

9. Shafritz, D. A., Shouval, D., Sherman, H. I., Hadziyannis, S. J. & Kew, M. C. Integration

of Hepatitis B Virus DNA into the Genome of Liver Cells in Chronic Liver Disease and 63 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Hepatocellular Carcinoma. N. Engl. J. Med. 305, 1067–1073 (1981).

10. Uemura, N. et al. Helicobacter pylori Infection and the Development of Gastric Cancer.

N. Engl. J. Med. 345, 784–789 (2001).

11. Watanabe, T., Tada, M., Nagai, H., Sasaki, S. & Nakao, M. Helicobacter pylori infection

induces gastric cancer in mongolian gerbils. Gastroenterology 115, 642–8 (1998).

12. zur Hausen, H. The search for infectious causes of human cancers: Where and why.

Virology 392, 1–10 (2009).

13. Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-

generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

14. Niedobitek, G. et al. Detection of human papillomavirus type 16 DNA in carcinomas of

the palatine tonsil. J. Clin. Pathol. 43, 918–21 (1990).

15. Gillison, M. L. et al. Evidence for a Causal Association Between Human Papillomavirus

and a Subset of Head and Neck Cancers. J. Natl. Cancer Inst. 92, 709–720 (2000).

16. Castellarin, M. et al. Fusobacterium nucleatum infection is prevalent in human

colorectal carcinoma. Genome Res. 22, 299–306 (2012).

17. Kostic, A. D. et al. Genomic analysis identifies association of Fusobacterium with

colorectal carcinoma. Genome Res. 22, 292–8 (2012).

18. Repass, J. et al. Replication Study: Fusobacterium nucleatum infection is prevalent in

human colorectal carcinoma. Elife 7, (2018).

19. Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144,

646–74 (2011). 64 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

20. Riley, D. R. et al. Bacteria-Human Somatic Cell Lateral Gene Transfer Is Enriched in

Cancer Samples. PLoS Comput. Biol. 9, e1003107 (2013).

21. Zapatka, M. et al. The landscape of viral associations in human cancers. bioRxiv

465757 (2018). doi:10.1101/465757

22. Khoury, J. D. et al. Landscape of DNA virus associations across human malignant

cancers: analysis of 3,775 cases using RNA-Seq. J. Virol. 87, 8916–26 (2013).

23. International Cancer Genome Consortium, T. I. C. G. et al. International network of

cancer genome projects. Nature 464, 993–8 (2010).

24. Gibbs, R. A. et al. A global reference for human genetic variation. Nature 526, 68–74

(2015).

25. Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification

using exact alignments. Genome Biol. 15, R46 (2014).

26. Hu, Z. et al. Genome-wide profiling of HPV integration in cervical cancer identifies

clustered genomic hot spots and a potential microhomology-mediated integration

mechanism. Nat. Genet. 47, 158–163 (2015).

27. Young, L., Sung, J., Stacey, G. & Masters, J. R. Detection of Mycoplasma in cell

cultures. Nat. Protoc. 5, 929–934 (2010).

28. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler

transform. Bioinformatics 26, 589–95 (2010).

29. Caygill, C. P., Hill, M. J., Braddick, M. & Sharp, J. C. Cancer mortality in chronic typhoid

and paratyphoid carriers. Lancet (London, England) 343, 83–4 (1994).

65 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

30. Scanu, T. et al. Salmonella Manipulation of Host Signaling Pathways Provokes Cellular

Transformation Associated with Gallbladder Carcinoma. Cell Host Microbe 17, 763–

774 (2015).

31. Geller, L. T. et al. Potential role of intratumor bacteria in mediating tumor resistance

to the chemotherapeutic drug gemcitabine. Science (80-. ). 357, 1156–1160 (2017).

32. Feng, Y. et al. Metagenomic and metatranscriptomic analysis of human prostate

microbiota from patients with prostate cancer. BMC Genomics 20, 146 (2019).

33. Kripalani-Joshi, S. & Law, H. Y. Identification of integrated epstein-barr virus in

nasopharyngeal carcinoma using pulse field gel electrophoresis. Int. J. Cancer 56, 187–

192 (1994).

34. Morton, C. et al. Mapping of the human Blym-1 transforming gene activated in Burkitt

lymphomas to chromosome 1. Science (80-. ). 223, 173–175 (1984).

35. Robinson, K. M., Crabtree, J., Mattick, J. S. A., Anderson, K. E. & Dunning Hotopp, J. C.

Distinguishing potential bacteria-tumor associations from contamination in a

secondary data analysis of public cancer genome sequence data. Microbiome 5, 9

(2017).

36. Thompson, K. J. et al. A comprehensive analysis of breast cancer microbiota and host

gene expression. PLoS One 12, e0188873 (2017).

37. Mazzoni, E. et al. Significant Association Between Human Osteosarcoma and Simian

Virus 40. doi:10.1002/cncr.29137

38. Ramanan, P., Deziel, P. J. & Wengenack, N. L. Gordonia bacteremia. J. Clin. Microbiol.

66 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

51, 3443–7 (2013).

39. Ding, X. et al. Bacteremia due to Gordonia polyisoprenivorans: case report and review

of literature. BMC Infect. Dis. 17, 419 (2017).

40. Gupta, M., Prasad, D., Khara, H. S. & Alcid, D. A rubber-degrading organism growing

from a human body. Int. J. Infect. Dis. 14, e75–e76 (2010).

41. Austen, B. et al. Mutations in the ATM gene lead to impaired overall and treatment-

free survival that is independent of IGVH mutation status in patients with B-CLL.

(2005). doi:10.1182/blood-2004-11-4516

42. Balthazar, E. J., Megibow, A. J. & Hulnick, D. H. Cytomegalovirus esophagitis and

gastritis in AIDS. AJR. Am. J. Roentgenol. 144, 1201–4 (1985).

43. Lepiller, Q., Tripathy, M. K., Di Martino, V., Kantelip, B. & Herbein, G. Increased HCMV

seroprevalence in patients with hepatocellular carcinoma. Virol. J. 8, 485 (2011).

44. Leonardsson, H., Hreinsson, J. P., Löve, A. & Björnsson, E. S. Hepatitis due to Epstein–

Barr virus and cytomegalovirus: clinical features and outcomes. Scand. J.

Gastroenterol. 52, 893–897 (2017).

45. Bruix, J. & Llovet, J. M. Hepatitis B virus and hepatocellular carcinoma. J. Hepatol. 39,

59–63 (2003).

46. Lai, C.-C. et al. Infections caused by unusual Methylobacterium species. J. Clin.

Microbiol. 49, 3329–31 (2011).

47. Langevin, S., Vincelette, J., Bekal, S. & Gaudreau, C. First case of invasive human

infection caused by Cupriavidus metallidurans. J. Clin. Microbiol. 49, 744–5 (2011).

67 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

48. Ugge, H. et al. Acne in late adolescence and risk of prostate cancer. Int. J. cancer 142,

1580–1585 (2018).

49. Davidsson, S. et al. Frequency and typing of Propionibacterium acnes in prostate

tissue obtained from men with and without prostate cancer. Infect. Agent. Cancer 11,

26 (2016).

50. COHEN, R. J., SHANNON, B. A., McNEAL, J. E., SHANNON, T. & GARRETT, K. L.

PROPIONIBACTERIUM ACNES ASSOCIATED WITH INFLAMMATION IN RADICAL

PROSTATECTOMY SPECIMENS: A POSSIBLE LINK TO CANCER EVOLUTION? J. Urol. 173,

1969–1974 (2005).

51. Mahlen, S. D. Serratia infections: from military experiments to current practice. Clin.

Microbiol. Rev. 24, 755–91 (2011).

52. Fessler, J., Matson, V. & Gajewski, T. F. Exploring the emerging role of the microbiome

in cancer immunotherapy. J. Immunother. Cancer 7, 108 (2019).

53. Motzer, R. J. et al. Nivolumab versus Everolimus in Advanced Renal-Cell Carcinoma. N.

Engl. J. Med. 373, 1803–13 (2015).

54. Sung, W.-K. et al. Genome-wide survey of recurrent HBV integration in hepatocellular

carcinoma. Nat. Genet. Vol. 44, (2012).

55. Welcome | ICGC Data Portal. Available at: https://dcc.icgc.org/. (Accessed: 6th June

2019)

56. Index von /vol1/ftp/. Available at: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/.

(Accessed: 6th June 2019)

68 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

57. European Nucleotide Archive < EMBL-EBI. Available at: https://www.ebi.ac.uk/ena.

(Accessed: 6th June 2019)

58. Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative

biomedical analyses: 2018 update. Nucleic Acids Res. 46, W537–W544 (2018).

59. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25,

2078–2079 (2009).

60. Picard Tools - By Broad Institute. Available at: http://broadinstitute.github.io/picard/.

(Accessed: 6th June 2019)

61. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic

features. Bioinformatics 26, 841–842 (2010).

62. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina

sequence data. Bioinformatics 30, 2114–20 (2014).

63. Blankenberg, D. et al. Manipulation of FASTQ data with Galaxy. Bioinformatics 26,

1783–5 (2010).

64. Rognes, T., Flouri, T., Nichols, B., Quince, C. & Mahé, F. VSEARCH: a versatile open

source tool for metagenomics. PeerJ 4, e2584 (2016).

65. The R Foundation for Statistical Computing R version 3.3.2. (https://www.r-

project.org/).

66. Escapa, I. F. et al. New Insights into Human Nostril Microbiome from the Expanded

Human Oral Microbiome Database (eHOMD): a Resource for the Microbiome of the

Human Aerodigestive Tract. mSystems 3, e00187-18 (2018).

69 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

67. Dewhirst, F. E. et al. The Human Oral Microbiome. J. Bacteriol. 192, 5002–5017

(2010).

68. HOMD :: Human Oral Microbiome Database. Available at:

http://www.homd.org/index.php?name=HOMD. (Accessed: 6th June 2019)

69. de Goffau, M. C. et al. Recognizing the reagent microbiome. Nat. Microbiol. 3, 851–

853 (2018).

70. Salter, S. J. et al. Reagent and laboratory contamination can critically impact

sequence-based microbiome analyses. BMC Biol. 12, 87 (2014).

71. Jervis-Bardy, J. et al. Deriving accurate microbiota profiles from human samples with

low bacterial content through post-sequencing processing of Illumina MiSeq data.

Microbiome 3, 19 (2015).

72. Leon, L. J. et al. Enrichment of Clinically Relevant Organisms in Spontaneous Preterm-

Delivered Placentas and Reagent Contamination across All Clinical Groups in a Large

Pregnancy Cohort in the United Kingdom. Appl. Environ. Microbiol. 84, e00483-18

(2018).

73. Laurence, M., Hatzis, C. & Brash, D. E. Common Contaminants in Next-Generation

Sequencing That Hinder Discovery of Low-Abundance Microbes. PLoS One 9, e97876

(2014).

74. Glassing, A., Dowd, S. E., Galandiuk, S., Davis, B. & Chiodini, R. J. Inherent bacterial

DNA contamination of extraction and sequencing reagents may affect interpretation

of microbiota in low bacterial biomass samples. Gut Pathog. 8, 24 (2016).

70 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

75. Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421

(2009).

76. Wally, N. et al. Plasmid DNA contaminant in molecular reagents. Sci. Rep. 9, 1652

(2019).

77. Embedding projector - visualization of high-dimensional data. Available at:

https://projector.tensorflow.org/. (Accessed: 6th June 2019)

78. Morpheus. Available at: https://software.broadinstitute.org/morpheus/. (Accessed:

6th June 2019)

79. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler

transform. Bioinformatics 25, 1754–1760 (2009).

80. Wick, R. R., Judd, L. M., Gorrie, C. L. & Holt, K. E. Unicycler: Resolving bacterial genome

assemblies from short and long sequencing reads. PLOS Comput. Biol. 13, e1005595

(2017).

81. Sondka, Z. et al. The COSMIC Cancer Gene Census: describing genetic dysfunction

across all human cancers. Nat. Rev. Cancer 18, 696–705 (2018).

82. Warden, C. D., Yuan, Y.-C. & Wu, X. Optimal Calculation of RNA-Seq Fold-Change

Values. International Journal of Computational Bioinformatics and In Silico Modeling

2, (2013).

83. Warden, C. D., Kanaya, N., Chen, S. & Yuan, Y.-C. BD-Func: a streamlined algorithm for

predicting activation and inhibition of pathways. PeerJ 1, e159 (2013).

84. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene

71 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Ontology Consortium. Nat. Genet. 25, 25–9 (2000).

72