bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
An atlas of bacterial and viral associations in cancer
Sven Borchmann1,2,3
1University of Cologne, Department I of Internal Medicine, Center for Integrated Oncology Aachen Bonn Cologne Duesseldorf, German Hodgkin Study Group, Cologne, Germany
2University of Cologne, Faculty of Medicine and University Hospital of Cologne, Center for Molecular Medicine, Cologne, Germany
3University of Cologne, Faculty of Medicine and University Hospital of Cologne, Else Kröner Forschungskolleg Clonal Evolution in Cancer, Cologne, Germany
1 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
ABSTRACT
Bacterial1,2 and viral3–5 infections have widely been recognized as a cause of cancer. Examples
for viral infections are Human Papillomaviridae causing cervical cancer3,6,7 or Hepatitis B virus
causing liver cancer8,9. An example for a bacterial infection is Helicobacter pylori causing
adenocarcinoma of the stomach10,11. Known carcinogenic mechanisms are disruption of the
host genome via genomic integration, expression of oncogenic viral proteins12 or chronic
inflammation1.
Next generation sequencing has made it possible to gain deep insight into the cancer
genome13, while at the same time providing datasets with traces of bacterial and viral nucleic
acids that colonize the examined tissue or are integrated into the host genome. This hidden
resource has so far not been comprehensively studied.
Here, 3000 whole genome sequencing datasets are leveraged to gain insight into possible
novel associations between viruses, bacteria and cancer. Apart from confirming known
associations, multiple, previously unknown, possible associations between bacteria, viruses
and tumors are identified. The presence of some bacteria or viruses was associated with
profound differences in patient and tumor characteristics, such as patient age, tumor stage,
survival, defined somatic alterations or gene expression profiles when compared to other
patients with the same cancer.
Overall, these results provide a detailed map of potential associations between viruses,
bacteria and cancer and can be the basis for future investigations aiming to confirm these
associations experimentally and to provide further mechanistic insight.
2 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
INTRODUCTION
Bacterial1,2 and viral3–5 infections have widely been recognized as a cause of cancer with a
wide range of carcinogenic mechanisms described. Examples for carcinogenic viruses are
Human Papillomaviridae causing head and neck14,15 as well as cervical cancer3,6,7 or Hepatitis
B virus causing liver cancer8,9. The main carcinogenic mechanisms at play are thought to be (1)
viral integration into and disruption of the host genome and (2) expression of oncogenic viral
proteins12.
An important example for a carcinogenic species from the bacterial domain is Helicobacter
pylori causing adenocarcinoma of the stomach10,11. The carcinogenic mechanism at play here
is thought to be a totally different one compared to carcinogenic viruses, namely sustained
inflammation caused by a chronic, mostly sub-clinical infection1. For some associations
between infections and cancer, preliminary evidence has been presented but simultaneous
presence of contradictory findings has led to widespread debate. An example for this is the
finding that Fusobacterium nucleatum is enriched in cancerous tissue of colorectal cancer
patients compared to tissue of benign adenomas or healthy colon mucosa16–18. Given the
diversity of possible carcinogenic mechanisms19 it is likely that other carcinogenic viruses and
bacteria exist which are currently unknown.
Recent advances in next generation sequencing have made it possible to generate large
amounts of data representing the whole genetic makeup of a tissue13. Resulting datasets
contain traces of non-human, or, more general, non-host origin that is present either because
of genomic integration or presence of the virus or bacteria in the tissue itself. While some
studies have already been performed with the goal to repurpose this data to unravel novel
3 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
associations between infections and cancer4,5,20–22, these resources have so far been
underutilized to discover novel, possible associations between cancers and infections.
With this aim in mind, this study leverages a large, high-quality collection of over 3000 whole
genome sequencing datasets to gain insight into possible novel associations between viruses,
bacteria and cancer.
4 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
RESULTS
Samples
A total of 3,025 whole genome sequencing datasets comprising 3.79 trillion reads were
included in this study (Figure 1A, Extended Data Table 1). These include 1,330 whole genome
sequencing datasets of tumor tissue samples across 14 different cancers from 19 International
Cancer Genome Consortium (ICGC)23 studies and their corresponding matched normal (e.g.
non-cancerous) tissue controls (n=1,330) (Extended Data Table 1). Included patients were
predominantly male (n=1,028, 60.6%) and elderly, with 47.3% of patients (n=801) at least 60
years old (Figure 1B-C). Additionally, whole genome sequencing datasets from 365 subjects
from the 1,000 genome project24 were selected as a healthy control group substituting for the
lack of negative sequencing controls to examine non-human DNA in the blood of healthy
donors. Only subjects, in which blood-derived DNA was directly subjected to whole genome
sequencing as opposed to DNA derived from immortalized lymphoblastoid cell lines (LCL)
(subset analyzed, n=102) were included in the healthy control cohort as the whole genome
sequencing datasets derived from LCL DNA showed a markedly different species-level taxon
distribution likely representing taxa present due to LCL culture and not present in the donor
itself (Extended Data Figure 2A, Supplementary Table 1, Supplementary Table 2). For
validation purposes and to assess differential gene expression in cancers associated with
certain species-level taxa all available matching RNA-Seq datasets for all included ICGC
studies23 were also analyzed (n=324).
Validation of Pipeline
Based on its high accuracy, especially with regard to precision and high speed necessary for
handling large amounts of input data, a pipeline was built around KRAKEN25, which at its core
5 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
is based on the exact alignments of k-mers to their least common ancestor (LCA). Before
applying the pipeline to the datasets of interest, it was validated fourfold.
First, it was confirmed that the pipeline was able to identify already known bacterial and viral
taxa in tissue-derived bacterial isolates and cell lines independent from this study. The pipeline
was applied to a gastric mucosa isolate of Helicobacter pylori from a stomach ulcer patient
(sample ID: SRS2800545, run ID: SRR6432709). As expected, 3,606,537/3,607,687 (99.97%) of
read pairs matching any taxon in the database matched Helicobacter pylori (Supplementary
Table 3, Extended Data Figure 2B). Next, the pipeline was applied to 3 cell line samples (HeLa
(run ID: SRR1611000), CaSki (run ID: SRR1611128) and SiHa (run ID: SRR1611127)) with known
integrations of Human papillomavirus26. Applying the pipeline resulted in between 78.14%
and 99,90% of non-human read pairs matching any species-level taxon in the database to
match Human papillomavirus 7 or 9 (Supplementary Table 3, Extended Data Figure 2B).
Interestingly, the majority of non-human read pairs not matching Human papillomavirus 7 or
9 matched Mycoplasma hyorhinis (0.08%-21.74%), a common cell culture contaminant27.
Thus, it was confirmed that the pipeline can identify bacterial and viral taxa irrespective of
both a high level of human background DNA and whether the identifying sequence is present
in the form of host-integrated DNA or by presence of the viral or bacterial taxa itself in the
sample.
Second, it was confirmed that the taxonomic classification of non-human reads by the pipeline
is highly correlated to the classification by an independent approach, namely mapping non-
human reads to putative target genomes with BWA28 (Supplementary Table 4, Extended Data
Figure 2C-E).
6 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Third, in yet another independent approach, it was confirmed that assembled contigs from
non-human reads show high sequence similarity with the taxa identified by the pipeline
(Supplementary Table 4, Extended Data Figure 2F).
Fourth, the pipeline was independently run with the same settings on available matched RNA-
Seq and WGS data from the same tumor tissue samples. It was shown, that identified taxa in
pairs of RNA-Seq and WGS data from the same tumor tissue sample are correlated both in a
combined dataset of all pairs and for each sample separate for which RNA-Seq and WGS data
was available (n=324) (Extended Data Figure 2 G-H). This validation is especially helpful to
minimize the risk of contamination post sample processing though this was further mitigated
with the data analysis strategy described below.
A map of cancer-associated taxa
Due to the nature of this dataset, which was not created with the purpose of analyzing the
non-human taxonomic composition of cancer tissues, only a small fraction of sequence
information can be used to infer taxa present in the respective sample. Therefore, less
prevalent taxa are likely not detectable. Additionally, considering many samples will have
been collected under sterile, surgical conditions, detected taxa are more likely to indicate true
presence of that taxa in the tissue, limiting the detection of taxa that are part of the naturally
occurring microbiome of non-sterile tissues, such as skin or bowel. Therefore, on average, only
2.2 species-level taxa per sample were detected, though variation was high (Supplementary
Table 5, Extended Data Figure 3 A-D). For these reasons, many conventional methods of
analysis of taxonomic composition are not applicable to this dataset and the focus of the
present study is on identifying highly abundant species-level taxa that are potentially
associated with cancer, either by being present in the patient’s matched normal sample or
7 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
potentially more directly involved in carcinogenesis by being present in the patient’s tumor
tissue, especially when it is likely to have been collected sterile.
In this study, a total of 3,534,707 read pairs matching bacterial, viral or phage species-level
taxa were detected across 19 studies in 3025 samples (Figure 1D-E, Extended Data Table 1,
Supplementary Table 6). Subsampling of 10% of all read pairs did not alter the detected
species-level taxa and their relative composition compared to analyzing all non-human read
pairs which was validated in a subset of patients (Extended Data Figure 2I-J, Supplementary
Table 7). A total of 218 species-level taxa could be identified in all examined samples (Figure
1F, Supplementary Table 8). Escherichia coli and Propionibacterium acnes were the most
commonly detected species in all samples.
In order to control for differences in sequencing depth between samples, all raw read pair
counts were normalized by dividing them by 1,000,000,000 total read pairs. This normalized
count was defined as read pairs per billion (RPPB). Overall, 18,112 (+/- 4,238 standard error
of mean (s.e.m.)), 20.003(+/- 4,295 s.e.m.) and 28,282 (+/- 9,081 s.e.m.) RPPB could be
detected on average in healthy control samples from the 1000 genomes cohort, matched
normal samples and tumor tissue samples, respectively, with a large variation across samples
and sequencing projects (Figure 1G). Of note, particularly high RPPB were detected in matched
normal, saliva-derived, samples from acute myeloid leukemia (AML) patients, as would be
expected from a naturally non-sterile source such as saliva. The taxa with the highest RPPB
are depicted in Figure 1H. Ralstonia pickettii was the species with the highest RPPB and almost
absent in healthy donors, while being detected at higher levels in tumor tissue than in
matched normal samples (Figure 1H). Except for Bacillus subtilis, Acidovorax sp. KKS102 and
Bacillus cereus, all species with very high RPPB were detected predominantly in tumor tissue
8 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
or matched normal samples. A clustered heatmap of all species-level taxa detected in all
samples is provided in Figure 2.
Next, filtering was performed to exclude taxa that were (i) frequently detected in the healthy
control group, (ii) detected only in very few (<5) tumor tissue or matched normal samples, (iii)
detected phages, (iv) taxa that are commonly detected as part of the normal oral microbiome
and were mainly detected in saliva, oral or esophageal cancer tissue samples (Supplementary
Table 9), (v) taxa that have been previously described as sequencing contaminants
(Supplementary Table 10) and (vi) taxa, for which the detected reads were unevenly
distributed across their genome (Extended Data Figure 4). After all these filtering steps
(Extended Data Figure 5) 27 species-level taxa remained for further analysis (Figure 1 I-J,
Extended Data Figure 6, Supplementary Table 8).
Among these, known tumor-associated taxa, such as Hepatitis B virus8,9 or Salmonella
enterica29,30 were detected. Additionally, taxa were detected that have been implicated in
carcinogenesis before without enough evidence to support a carcinogenic role, such as
Pseudomonas species31,32 and taxa that have never been implicated in carcinogenesis before,
such as Gordonia polyisoprenivorans. Of note, most taxa in the final filtered species list were
detected at much higher RPPB-levels in tumor tissue compared to matched normal samples,
though most matched normal tissues in this study were blood-derived. However lower RPPB
could still be detected in matched normal samples and across all taxa tumor tissue samples
and matched normal samples are highly correlated (Extended Data figure 2K).
In order to better understand the association of detected taxa with different cancers and
among each other, a heuristic approach was used combining the non-linear dimensionality
reduction and visualization method t-distributed stochastic neighborhood embedding (t-SNE)
9 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
with K-means clustering. To focus on taxa that potentially play a more direct role in
carcinogenesis and considering that RPPB of detected taxa were highly correlated between
tumor tissue and matched normal tissue with lower levels detected in matched normal tissue
(Extended Data Figure 2K), only tumor tissues were included in the following analyses.
Using a pan-cancer approach, a combined dataset of all tumor tissue samples was visualized
using t-SNE using the log2-transformed RPPBs of all detected non-phage taxa (n=204) as input
variables. First, two distinct groups of patients with chronic lymphocytic leukemia (CLL)
emerged. One group (n=24) was associated with presence of Gordonia polyisoprenivorans in
the patient’s tumor tissue and another group (n=15) was associated with tumor tissue
presence of Pseudomonas mendocina. Second, one distinct group (n=23) of pancreatic cancer
patients associated with presence of Cupriavidus metallidurans was observed. Third, one
group (n=22) of bone cancer patients associated with presence of Pseudomonas poae,
Pseudomonas fluorescens and Pseudomonas sp. TKP could be distinguished. Fourth, a group
of patients (n=14) with bone cancer (n=4) and chronic myeloid disorders (n=10) associated
only with presence of Pseudomonas sp. TKP was observable (Figure 3A). Overall, this approach
was helpful to get a first insight into groups of tumors from a pan-cancer perspective that
might be defined by presence or absence of specific taxa.
In order to gain further insight into the association of detected taxa and specific cancers, each
cancer (Supplementary Table 11) was examined separately, visualizing patient groupings using
t-SNE with log2-transformed RPPBs of all the final filtered taxon list (n=27) (Supplementary
Table 1) in tumor tissue as input variables combined with K-means clustering of patients using
the same input variables. Additionally, each cluster was associated with age, survival, gender,
number of alterations in known cancer genes or specific alterations in one of those cancer
genes. 10 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
For bone cancer patients this dual methodology revealed a cluster of patients with presence
of Pseudomonas fluorescens, Pseudomonas sp. TKP and Pseudomonas poae (cluster 2) in the
tumor tissue, confirming the results of the pan-cancer approach used above (Figure 3B,
Extended Data Figure 7A).
For chronic lymphocytic leukemia patients this approach revealed 3 clusters, one of patients
with presence of Gordonia polyisoprenivorans in the tumor tissue (cluster 3), one of patients
associated with Human Mastadenovirus C (cluster 2) and one with the remainder of patients
(cluster 1), some of which showed presence of Pseudomonas species in the tumor tissue.
Clusters could be identified by both methods, t-SNE and K-means (Figure 3C, Extended Data
Figure 3B). The clusters representing patients associated with Gordonia polyisoprenivorans
(cluster 3) and Human Mastadenovirus C (cluster 2) were mutually exclusive (pFisher = 0.0124)
(Figure 3C, Extended Data Figure 7B). There was a tendency towards different ages at
diagnosis between the clusters (panova=0.0743) (Figure 4A). Additionally, there was a tendency
towards a difference in survival between the different clusters (panova=0.0745). Patients in
cluster 1 (no associations) had worse survival than patients in cluster 2 (Human
Mastadenovirus C) or 3 (Gordonia polyisoprenivorans) (p=0.024626) (Figure 4B). Patients with
an association with Gordonia polyisoprenivorans (cluster 3) were more likely to be of Binet C
stage (5/36 vs. 1/61, C vs. not-C, pFischer = 0.0252) (Figure 4C). These patients were also more
likely to be associated with TP53 mutations than patients in the other clusters (pFisher(cluster
3 vs. other)=0.0335) (Figure 4D).
For esophageal cancer patients 3 clusters were identified. K-means clustering revealed one
cluster of patients with presence of Human Herpesvirus 5 (cluster 3) in the tumor tissue and
one cluster with presence of Bifidobacterium dentium (cluster 2). The remaining cluster
contained patients with no discernable associations. However, these 3 clusters could not be 11 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
differentiated by t-SNE. (Figure 3D, Extended Data Figure 7C). There was a tendency towards
different numbers of alterations in cancer genes between clusters (panova=0.0858), with
patients in cluster 3 (Human Herpesvirus 5) having less alterations than patients in cluster 1
(no associations) and 2 (Bifidobacterium dentium) (p=0.0392) (Figure 4E). Additionally, there
was a tendency towards a difference in survival between the different clusters (panova=0.1).
Patients in cluster 2 (Bifidobacterium dentium) or 3 (Human Herpesvirus 5) had worse survival
than patients in cluster 1 (no associations) (p=0.040928). (Figure 4F).
Patients with liver cancer could be grouped into 4 clusters with K-means clustering and t-SNE
agreeing. One cluster was defined by presence of Hepatitis B (cluster 1) in the tumor tissue. A
second cluster was defined by sole presence of both Pseudomonas and Serratia species
(cluster 3). A third cluster was defined by additional presence of Parvibaculum
lavamentivorans and Human Herpesvirus 5 as well as two additional Pseudomonas species,
Pseudomonas poae and Pseudomonas protegens (cluster 4), in addition to those detected in
the previous cluster. The fourth cluster contained patients with no discernable associations
(cluster 2) (Figure 3E, Extended Data Figure 7D). There was a tendency towards different ages
at diagnosis between the clusters (panova=0.0015) with patients in cluster 1 (Hepatitis B),
cluster 2 (Pseudomonas and Serratia sp.) and cluster 4 (Pseudomonas sp., Serratia sp.
Parvibaculum lavamentivorans and Human Herpesvirus 5) younger than patients in cluster 3
(no associations) (pt-test=0.000142) (Figure 4G). In a similar pattern, these patients had had a
higher frequency of alterations in RNF21 (pFischer = 0.0121) (Figure 4H).
For patients with pancreatic adenocarcinoma 5 clusters could be identified using K-means
clustering. The first cluster was defined by presence of Pseudomonas protegens in the tumor
tissue (cluster 5), the second by Methylobacterium populi (cluster 4) and a third, prominent
cluster was defined by presence of Cupriavidus metallidurans (cluster 2). The two other 12 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
clusters contained patients with no discernable associations (cluster 1 and 3). Only the cluster
of patients associated with Cupriavidus metallidurans (cluster 2) could also be identified using
t-SNE (Figure 3F, Extended Data Figure 7E). Differences in the frequency of alterations
between patients in different clusters were observed in KMT2C, CDKN2A and RNF21. Patients
in cluster 2 (Cupriavidus metallidurans) ,4 (Methylobacterium populi) or 5 (Pseudomonas
protegens) had a higher frequency of alterations in these genes compared to the majority of
patients which were in cluster 1 (no associations) (n=187) (pFisher(KMT2C)=0.0308,
pFisher(CDKN2A)=0.0124, pFisher(RNF21)=0.0107) (Figure 4I-K).
For patients with prostate cancer, 5 clusters emerged using K-means clustering, one defined
by simultaneous presence of Thauera sp. MZ1T, Cupriavidus metallidurans and Pseudomonas
mendocina in the tumor tissue (cluster 5), one by presence of Pseudomonas sp. TKP (cluster
3) and another one by presence of Gluconobacter oxydans and Rhodopseudomonas palustris
(cluster 2). The two other clusters contained patients with no discernable associations. Out of
these, the cluster defined by Gluconobacter oxydans and Rhodopseudomonas palustris
(cluster 2) and the one defined by Thauera sp. MZ1T, Cupriavidus metallidurans and
Pseudomonas mendocina (cluster 5) could be confirmed using t-SNE (Figure 3G, Extended
Data Figure 7F). There were age differences at diagnosis observable between the clusters
(panova=0.00994) with patients in cluster 2 (Gluconobacter oxydans) and 5
(Rhodopseudomonas palustris) older than the other patients (pt-test)=0.000544) (Figure 4M).
For patients with renal cancer a cluster of patients associated with Serratia marcescens
(cluster 2) was identified using K-means clustering, although t-SNE did not separate this group
of patients (Figure 3H, Extended Data Figure 4N). The other cluster contained patients with no
discernable associations (cluster 1). There was a tendency towards a higher frequency of
PBRM1 alterations in patients in cluster 1 (pFisher=0.0723) (Figure 4P). 13 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
For patients with pancreatic endocrine neoplasms, ovarian cancer, chronic myeloid disorders
and breast cancer, no discernable clusters emerged (Extended Data figure 8A-D).
Unbiased association analysis between taxa and patient features
In addition to associating the above identified clusters with patient features, an unbiased
analysis of possible associations between presence of a species-level taxa and patient
features, such as age, survival, gender, number of alterations in known cancer genes and
specific alterations in one of those cancer genes was performed in both a pan-cancer approach
using the whole dataset and analyzing each cancer separately. In this exploratory analysis, all
non-phage taxa detected (n=204) were included and multiple testing correction was
performed.
Association between taxa and patient age
In the pan-cancer analysis, no association remained significant at the α=0.05 level after
correction for multiple testing. However, looking at each cancer separately, several
associations were identified. A group of bacterial taxa was associated with older patients in
prostate cancer (Figure 4L). In chronic myeloid dysplasia, presence of Pseudomonas sp. TKP
was associated with older age (padj=0.039) (Figure 4Q).
Association between taxa and patient survival
In the pan-cancer analysis, no association remained significant at the α=0.05 level after
correction for multiple testing. However, looking at each cancer separately, presence of
several taxa had a significant impact on survival at the α=0.05 level. In renal cancer, presence
of Ralstonia pickettii was associated with improved survival, in fact no patient died (padj=0.035)
(Figure 4O).
14 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Association between taxa and patient gender
For both the pan-cancer analysis and looking at each cancer separately, no significant
associations were found at the α=0.05 level after correcting for multiple testing.
Association between taxa and number of genetic alterations in cancer genes
In the pan-cancer analysis, no association remained significant at the α=0.05 level after
correction for multiple testing. However, looking at each cancer separately, presence and
higher RPPB matching P. acne was associated with a decreasing number of alterations in
prostate cancer (padj=0.0041) (Figure 4N).
Association between taxa and specific genetic alterations
For both the pan-cancer analysis and looking at each cancer separately, no significant
associations were found at the α=0.05 level after correcting for multiple testing.
Somatic lateral gene transfer
The integration of viral nucleic acids into the human host genome is well recognized as a
carcinogenic process. High-level evidence exists for the integration of Hepatitis B8,9, Human
Papillomavirus6,7 and Epstein–Barr virus (Human Herpesvirus 4)33,34, which are causally linked
to hepatocellular carcinoma, cervical cancer and lymphoma, respectively. Additionally, some
evidence for somatic lateral gene transfer from bacteria to cancerous tissue has been
presented before20 and subsequently controversially discussed.
Aiming to find evidence for bacterial or viral DNA integrated into the human host genome in
this dataset, a pipeline was developed to do so. In brief, read pairs in which one read mapped
to the human genome and one read to one of the taxa in the final filtered taxon list (n=27)
15 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
were counted and then compared to the number of read pairs in which both reads mapped
to the respective taxa. The number of divergently mapping read pairs divided by the number
of complete pairs mapping that respective taxa was used as an indicator for potential genomic
integration and lateral gene transfer. This was done separately for the tumor tissue datasets
and the matched normal datasets for all patients for which the pipeline revealed at least 1000
RPPB matching one of the taxa in the final filtered taxon list (n=27) in that respective patient’s
tumor tissue sample (n=79, Supplementary Table 2). The resulting data is shown in
Supplementary Table 1 and Extended Data Figure 4. The highest rate of integration was
observed for Hepatitis B in tumor tissue (9.62%) (Extended Data Figure 9A). Across all analyzed
taxa, putative integrations were more common in matched normal samples than in tumor
tissue samples (p=0.0009, Wilcoxon paired signed rank test) (Extended Data Figure 9B) and
for viral taxa compared to bacterial taxa (p=0.0001, Mann-Whitney U-test) (Extended Data
Figure 9C). In conclusion, there was no evidence for a general phenomenon of lateral gene
transfer for the final filtered taxon list, with the exception of Hepatitis B, for which integration
into the cancer genome has been widely described8,9.
Results of differential gene expression analysis
Differential gene expression analysis was performed for all taxa and cancer combinations with
available RNA-seq data and in which at least 5 patients had at least 100 RPPB matching the
respective taxon (Supplementary table 10). Differentially expressed genes (q<0.05) were
identified for chronic lymphocytic leukemia patients with or without detection of Gordonia
polyisoprenivorans (258 genes) (Figure 5A, Supplementary table 12), Human Mastadenovirus
C (1725 genes) (Figure 5B, Supplementary table 13) and Pseudomonas aeruginosa (50 genes,
Supplementary table 14). Furthermore, differentially expressed genes were identified for
ovarian cancer patients with or without detection of Escherichia coli (22 genes) 16 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
(Supplementary table 15) and pancreatic adenocarcinoma patients with or without detection
of Propionibacterium acnes (3 genes) (Supplementary table 16). Next, differential gene
expression data was used to perform bidirectional functional enrichment in order to identify
pathways that are altered in patients with or without presence of the respective taxon. To no
surprise, only the two associations with a significant number of differentially expressed genes,
chronic lymphocytic leukemia with or without presence of Gordonia polyisoprenivorans
(Supplementary table 17) or Human Mastadenovirus C (Supplementary table 18), respectively,
showed enrichment in some pathways.
Overall, presence of Gordonia polyisoprenivorans lead to a gene expression pattern indicative
of a decreased level of mitotic cell cycling. Also, a gene expression pattern indicative of a
decreased DNA damage response was observed. Additionally, a gene expression pattern
indicative of reduced blood coagulation and reduced macrophage activation was observed in
patient samples with presence of Gordonia polyisoprenivorans (Figure 5C).
Contrary to that, presence of Human Mastadenovirus C lead to a gene expression pattern that
is indicative of an increase in B-cell proliferation and a reduction of various immune defense
functions, such as immune response in general, Tumor necrosis factor and interleukin beta 1
production, activated T cell proliferation and a decrease in cytokine secretion in general. In
addition, the altered gene expression pattern of patients with presence of Human
Mastadenovirus C is indicative of a markedly reduced DNA damage response (Figure 5D).
17 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
DISCUSSION
The aim of this this study was to leverage a large, high-quality dataset of over 3000 samples
to unravel novel associations between viral and bacterial taxa and cancer. A total of 218
species-level taxa could be identified in tumor tissue, matched normal and healthy donor
samples. Out of these, 27 taxa remained being likely cancer-associated after extensive
filtering. While studies on the viral metagenome of caner tissues and patients have been
performed using datasets from cancer genomics studies4,5,21,22, similar large studies looking at
the bacterial metagenome are lacking.
Studies examining the viral metagenome of cancer tissues mainly identified known
associations of Human Papillomavirus with cervical and head and neck cancer, Hepatitis B in
liver cancer and Human Herpesvirus 5 in a variety of cancers while detection of Human
Mastadenovirus C was controversial4,5,21,22.
Smaller studies examining bacteria-tumor associations in pan-cancer datasets have identified
Escherichia coli, Propionibacterium acne and Ralstonia pickettii in multiple cancers, while
more specifically finding Acinetobacter sp. in AML and Pseudomonas sp. in both AML and
adenocarcinoma of the stomach20,35. Interrogating cancer-specific datasets, Salmonella
enterica, Ralstonia pickettii, Escherichia coli and Pseudomonas sp. were detected in breast
cancer and adjacent tissue36, while Escherichia sp., Propionibacterium sp., Acinetobacter sp.
and Pseudomonas sp. were frequently detected in prostate cancer32. Confirming the findings
of the present study, these smaller studies found similar bacterial taxa, especially Ralstonia
pickettii, Escherichia coli, Propionibacterium acne Salmonella enterica and Pseudomonas sp.
Interestingly, the present study identified a number of taxa that have not been previously
identified in cancer tissue or matched normal samples of these patients, among them
18 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Cupriavidus metallidurans, Gordonia polyisoprenivorans, Serratia sp. and Bifidobacterium sp..
Reasons for this are likely the much higher (1) number and diversity of patients and samples
included, (2) the larger amounts of data examined per sample due to having higher-coverage
WGS datasets available for all patients compared to RNA-seq or whole-exome sequencing
(WXS) datasets in previous studies and (3) the optimized and extensively validated
bioinformatics approach used herein. Of note, the filtering strategy used in this study to
exclude taxa likely resulting from contamination has also eliminated bacterial taxa that have
been linked to cancer previously such as Escherichia coli and Propionibacterium acne, mainly
due to the frequent detection of these taxa in healthy donor samples from the 1000 genome
cohort. Stringent filtering for species-level taxa that have been detected at 2 log10-levels
higher in tumor tissue or matched normal tissue compared to healthy donor samples excluded
these species, despite having about 1.4x and 3.1x higher RPPB for matched normal and tumor
tissue, respectively, compared to healthy donors for Propionibacterium acne. These values
were even higher for in the case of Escherichia coli, namely 7.1x and 13.4x for matched normal
and tumor tissue, respectively. Thus, it cannot be excluded that tumor-associated taxa were
eliminated due to the stringent filtering utilized.
Looking at specific cancer-pathogen associations, a subgroup of bone cancer patients with
presence of Pseudomonas sp. in the tumor tissue was identified. Consistent associations
between viral or bacterial taxa and bone cancer have not been described before, apart from
some evidence linking simian virus 40 (SV40) infection to bone cancer37. Interestingly,
Pseudomonas sp. are frequently implied in difficult to treat cases of osteomyelitis. Thus, one
might speculate that chronic, subclinical infection exists and could be carcinogenic.
For CLL patients, two associations were identified, one with Gordonia polyisoprenivorans and
one with Human Mastadenovirus C. Gordonia polyisoprenivorans has not been implied in 19 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
cancer before. Interestingly, the bacterium has been identified as a rare cause of bacteremia,
so far exclusively in patients with hematological cancers38–40. While the immunosuppression
associated with hematological cancers and subsequent treatment obviously predisposes to
infections with unusual pathogens from the environment, it is at least possible that these
patients had a latent infection with Gordonia polyisoprenivorans which exacerbated into
bacteremia and sepsis upon treatment-induced immunosuppression. Human Mastadenovirus
C has to date not been clearly linked to cancer, though two recent reports have found it
frequently in various cancer tissue, with one treating it as contamination4 and one not21 (see
methods for a detailed discussion). Further indirect evidence for a possible role of both
Gordonia polyisoprenivorans and Human Mastadenovirus C in CLL carcinogenesis is provided
by (1) the observed age difference - patients with detection of Human Mastadenovirus C were
markedly younger, (2) the mutual exclusivity of detection of Gordonia polyisoprenivorans and
Human Mastadenovirus C and (3) the fact that survival was different - patients with detection
of either taxa had improved outcome. This was despite patients associated with Gordonia
polyisoprenivorans having a higher likelihood of having Binet C stage. Providing a possible
explanation for the observed survival benefit, patients with no taxa association (cluster 3) had
a higher likelihood of having TP53 mutations. Strikingly, marked differences in host gene
expression between patients in which one of the taxa was detected and other patients were
observed. The observed gene expression pattern for patients associated with Gordonia
polyisoprenivorans, especially an increased DNA damage response and reduced mitotic cell
cycling, could explain the improved survival of these patients. The gene expression pattern
observed for patients associated with Human Mastadenovirus C pointed towards reduced
immune activity, especially reduced T cell function and a decrease in cytokine production and
secretion, which could explain the observed high detection frequency of Human
20 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Mastadenovirus C DNA in line with an uncontrolled infection due to a diminished immune
response. Detection of Mastadenovirus C was associated with a decreased DNA damage
response, which has been described as being an important pathomechanism of CLL41. In
addition to CLL patients, both Gordonia polyisoprenivorans and Human Mastadenovirus C
were also detected frequently in other cancers, sometimes at high levels, while never being
detected in healthy donors.
Two taxon-tumor associations in esophageal cancer were identified, one with Bifidobacterium
dentium and one with Human Herpesvirus 5. Interestingly, patients in cluster 3 (Human
Herpesvirus 5) had less alterations in cancer gene consensus genes than the other patients,
pointing to a possible different carcinogenic mechanism, where the accumulation of multiple
somatic alterations is not stringently needed for malignant transformation. Furthermore,
patients in cluster 1 (without any association with Bifidobacterium dentium or Human
herpesvirus 5) had better survival. While the association of Human Herpesvirus 5 with
esophageal cancer is a novel finding, Human Herpesvirus 5 has frequently be detected in
adjacent adenocarcinoma of the stomach5. Of note, overt Human Herpesvirus 5 esophagitis
can occur in immunocompromised hosts42, which could be indicative of a latent infection of
the esophagus by Human Herpesvirus 5 in some hosts.
In patients with liver cancer, one subgroup of patients was defined by detection of Hepatitis
B, a known cause of liver cancer8,9. Interestingly, two other subgroups could be identified.
Pseudomonas sp. and Serratia sp. were detected in both groups, with additional detection of
Parvibaculum lavamentivorans and Human Herpesvirus 5 in one group. There are some hints
that Human Herpesvirus 5 might play a role in the carcinogenesis of liver cancer, among them
presence of CMV DNA in tumor tissue and an increased seroprevalence in liver cancer
patients43 as well as frequent hepatitis in Human Herpesvirus 5 infection underscoring 21 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
hepatotropism of Human Herpesvirus 544. The other identified taxa have not yet been implied
in liver cancer carcinogenesis. Compared to patients not associated with one of the taxa
identified in liver cancer tumor tissue samples, all patients with presence of any identified taxa
were younger. Hepatocellular carcinoma at younger age is associated with chronic Hepatitis B
infection45. Since younger age was also observed in patients associated with the other taxa,
one might speculate that liver cancers associated with other pathogens are associated with
younger age of onset as well.
Three bacteria-tumor associations were identified in pancreatic adenocarcinoma, one with
Pseudomonas protegens, one with Methylobacterium populi and one with Cupriavidus
metallidurans. While Pseudomonas sp. have been shown to be a contributor to the pancreatic
adenocarcinoma tissue microbiome31, Methylobacterium populi and Cupriavidus
metallidurans have not been detected in cancer tissue. In fact, both taxa have not been
discovered in human hosts but in environmental samples and are thus not considered to be
part of the human microbiome, making it possible that they are contaminants not truly
present in the tumor tissue samples analyzed. On the other hand, few infections of humans
by these taxa have been described46,47 and, of note, the first published report of a Cupriavidus
metallidurans infection was a case of septicemia in a patient with a pancreatic tumor47.
In patients with prostate cancer, the most interesting findings were age differences between
patient clusters defined by the presence of different taxa and a negative correlation of
Propionibacterium acne presence and number of alterations in cancer gene consensus genes.
Patients in cluster 2 (defined by Gluconobacter oxydans and Rhodopseudomonas palustris)
and cluster 5 (defined by Thauera sp. MZ1T, Cupriavidus metallidurans and Pseudomonas
mendocina) were markedly older than the other patients. Gluconobacter sp.,
Rhodopseudomonas sp., Cupriavidus sp. and Pseudomonas sp. have previously been 22 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
identified in prostate cancer and normal prostate tissue32, but their precise role in prostate
disease is entirely unclear. Propionibacterium acne has been implied as a potential
carcinogenic bacterium in prostate cancer48,49, possibly by creating a chronic inflammatory
microenvironment50. Like Human Herpesvirus 5 in esophageal cancer, the reduced number of
alterations in cancer gene consensus genes upon increased Propionibacterium acne frequency
could point to alternative carcinogenesis by chronic inflammation in the absence of the
accumulation of many mutations in putative cancer driver genes.
For patients with renal cancer, an association with Serratia marcescens was described for a
subgroup of patients. While Serratia marcescens has been described as a frequent cause of
urinary tract infections, especially in immunocompromised hosts in a nosocomial setting51, it
has so far not been implicated in cancer. Interestingly, survival of patients with detection of
Ralstonia pickettii in their tumor tissue was markedly improved. Ralstonia pickettii is a
bacterium that has been filtered out in this study because of frequent detection in healthy
donor samples. It could be speculated, that, while not causing any overt infection, low level
presence of Ralstonia pickettii in the human host is common, and improves immunogenicity
of renal cancer and, thus, improving outcome. It has been shown, that alterations of local and
systemic immunity by a host’s microbiome possibly influence the anticancer immune
response52, which might be highly relevant for a naturally immunogenic tumor, such as renal
cancer53.
The intriguing observation of increased somatic bacteria-human lateral gene transfer by Riley
et al. 20 could not be made in this study. The only taxa for which more integration into the host
genome was observed in tumor tissue compared to matched normal was Hepatitis B, for
which integration into the genome of cancerous cells has been well recognized54.
23 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
In conclusion, the present study provides an important, unprecedented atlas of potential
associations between both bacterial and viral taxa with cancer. In addition to confirming
known or postulated associations between cancers and viral or bacterial taxa, several novel
associations were identified for the first time, laying the groundwork for experimental
validation and further mechanistic studies.
24 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
METHODS
Data sources
Mapped sequencing data for the included ICGC studies was obtained via the ICGC DCC23,55 and
downloaded using customized scripts. The use of controlled access ICGC data for this project was
approved by the ICGC Data access compliance office. Mapped sequencing data for the 1000 genome
healthy control samples was obtained from the 1000 genome FTP server56 using customized scripts.
Sequencing data used for validation was obtained from the European Nucleotide Archive (ENA)57.
Computing environment
Data analysis was performed using a HP Z4 workstation in a Unix environment either using
standalone standard tools as mentioned throughout the methods section or customized
scripts. Some analyses were performed employing the Galaxy platform58.
Pipeline for taxonomic classification
First, unmapped (non-human) read pairs were extracted from downloaded sequencing data
using Samtools (version 1.7)59. Bam files were sorted by query name with Picard Tools (version
2.7.1.1)60 and converted to FastQ files using Bedtools (version 2.26.0.0)61. Subsequently,
Trimmomatic (version 0.36.3)62 was used to trim reads with sliding window trimming with an
average base quality of 20 across 4 bases as cutoff and dropping resulting reads with a residual
length < 50. Read pairs, in which one read was dropped according to these rules were dropped
altogether. Remaining read pairs were joined with FASTQ joiner (version 2.0.1)63 and
converted to FASTA files using the build-in function FASTQ to FASTA (version 1.0.0) from
Galaxy58. Next, VSearch (version 1.9.7.0)64 was used to mask repetitive sequences by replacing
them with Ns using standard settings. These masked and joined read pairs were fed into
25 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Kraken (version 1.2.3)25 using a database of all bacterial and viral genomes in Refseq. The
output of each run was filtered with Kraken (version 1.2.3)25 setting a confidence threshold of
0.5. A report combining the output of all samples and runs was generated using Kraken
(version 1.2.3)25. The output was then arranged using customized scripts in R (beginning with
version R 3.3.2. and subsequently updated)65 to generate the raw metagenome output of
each sample. Next, the raw outputs of each sample running the pipeline on a subset of 10%
of the cancer tissue and matched normal sample read pairs and on all non-human read pairs
of the 1000 genomes samples were filtered by only including species-level taxa and excluding
all species-level taxa that were supported by less than 10 read pairs across all samples using
R (beginning with version R 3.3.2. and subsequently updated)65. Read pairs assigned to
Enterobacteria phage phiX174 which is ubiquitously used as a spike-in control in next
generation sequencing were omitted from all counts as an intended contaminant.
To correct for the variation of sequencing depth across samples, matched read pairs per billion
read pairs raw sequence (RPPB) were calculated for each sample and each taxon. A taxon was
heuristically considered detected, when a respective sample had at least 100 RPPB assigned
to that taxon.
Filtering strategy
First, all taxa that were also highly prevalent in the healthy control group, were excluded from
further analysis. In detail, a taxon had to have an at least 2 log10-levels higher mean RPPB
across either the tumor tissue samples, or the matched normal samples compared to the
healthy control samples. This cutoff excluded all taxa that were also highly prevalent in the
healthy control samples while at the same time allowing to further analyze taxa that were
enriched in matched normal samples such as blood as well as taxa that were dominant in
26 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
tumor tissue. After this step, 147/218 (67.4%) potential tumor-associated species-level taxa
remained. Next, taxa that were detected in fewer than 5 tumor tissue or matched normal
samples were excluded (n=78), as well as all remaining phages (n=2). Hepatitis B as a known
cancer-associated virus was re-included despite being present in fewer than 5 tumor tissue or
matched normal samples so that 68/218 (31.2%) taxa remained.
Taxa that survived this filtering strategy were likely to be tumor-associated but could also
represent artifacts from contamination. To account for that, all taxa were filtered out that are
known to be regularly present in the oral microbiome66–68 and were at the same time mainly
(>50% of all RPPB matching a respective taxon) detected in samples that are likely
contaminated with the oral microbiome (saliva matched normal, oral cancer tissue and
esophageal cancer tissue samples) (Supplementary Table 9). Exemplary, this filtered out taxa
that are a commonly present in the oral human flora such as Streptococcus mitis and were
indeed mainly present in tumor tissue samples of oral cancer or esophageal cancer and likely
a contaminant based on the biopsy location or detected in the matched normal of leukemia
cases whose matched normal was a saliva sample likely containing taxa of the normal human
microbiome. After this filtering step 49/218 (22.5%) species-level taxa remained.
It has recently emerged that both reagents and kits used in DNA extraction and library
preparation as well as ultrapure water used in laboratories can contain contaminants that can
hamper the detection of truly present taxa in low biomass or high background (e.g. human)
samples. As this study was performed using samples that were both, low in non-human
biomass and in the context of high human background the aim was to further reduce false
positives by compiling a list of common contaminants in microbiome studies from the
literature. Recommended approaches, such as the sequencing of blank controls69 were not
feasible as the present study was conducted utilizing already sequenced primary material. 27 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Therefore, a database of common contaminants from various studies70–74 examining this issue
was compiled (Supplementary Table 10). All previously recognized contaminant taxa apart
from those that were described as a contaminant on genus-level but where different species
within the genus were associated differently with 1000 genome control samples and matched
normal or tumor tissue samples were excluded. This was the case for the genera Pseudomonas
and Methylobacterium. While the species Pseudomonas aeruginosa, Pseudomonas putida
and Pseudomonas stutzeri were associated with both 1000 genome control and matched
normal or tumor tissue samples and thus likely contaminants, Pseudomonas fluorescens,
Pseudomonas mendocina, Pseudomonas poae, Pseudomonas protegens and Pseudomonas
sp. TKP were not present in 1000 genome healthy control samples. Thus, it is likely, that the
species-level resolution of this analysis was able to differentiate between common
contaminants and possibly tumor-associated taxa. Similarly, differentiation was possible
between Methylobacterium radiotolerans and Methylobacterium extorquens (likely
contaminants) and Methylobacterium populi, which was only found in tumor-associated
samples. The same held true for Cupriavidus metallidurans and Propionibacterium
propionicum. After this filtering step, 27/218 (12.4%) species-level taxa remained (Figure 1 I-
J, Extended Data Figure 6). While it cannot be ruled out that all these filtering steps removed
cancer-associated taxa, the aim was to be cautious and rather accept a false-negative than a
false-positive finding. Of course, experimental validation of the relevance of one of the taxa
eliminated by one of the filters could prove that this taxon is both, a sequencing contaminant
and a relevant taxon in cancer.
If a taxon is indeed present in a tissue, matching read pairs are expected to be uniformly
distributed across its genome. Consequently, a further filtering step examining this was
introduced. The sequencing data from all samples was combined and matched against a
28 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
reference database constructed out of the genome of these 27 taxa. Next, the coverage
distribution of reads across each genome was assessed (Extended Data Figure 4). If the
respective taxa are indeed present in the sample and not the result from misalignment or
sequence similarity of certain parts of the genome an uneven coverage would result. It was
found that all read pairs matching Human Mastadenovirus C aligned to short parts of its
genome with a maximum length of a few hundred base pairs and abrupt drops in coverage
(Extended Data Figure 4). This was also found in another study using different cancer tissue
sequencing data and resulted in excluding Mastadenovirus C from further analysis4. Using
Blast (beginning with version 2.7.1 and subsequently updated)75 it was found that read pairs
matching Human Mastadenovirus C aligned all equally well to commonly used cloning vectors,
such as pAxCALGL which might have been used in the production of reagents used for
sequencing. This form of contamination was recently analyzed and found to be frequent76.
However, read pairs aligning to Human Mastadenovirus C originated from very few, seemingly
unlinked samples from diverse cancer sequencing projects, making contamination by
recombinant DNA unlikely. Another possible explanation for the observed coverage pattern is
somatic genomic integration of parts of the Human Mastadenovirus C genome into a specific
cancer genome. A recent analysis of PCAWG data focusing on viruses in cancer with overlap
to patients and samples analyzed within this study also identified presence of Human
Mastadenovirus C in cancer tissue and included it in their analysis among other viruses21. On
balance, Human Mastadenovirus C was therefore not excluded from further analysis as an at
least possible cancer associated viral taxon.
Clustering of pipeline hits
T-SNE was performed using a web-based TensorFlow Embedding Projector implementation77.
The learning rate and the perplexity were heuristically set to 10 and 30 for all analyses, 29 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
respectively, except for the liver cancer subset, for which the perplexity was set to 50 due to
improved cluster discrimination. The number of iterations was heuristically chosen, so that no
major changes of cluster composition occurred upon increasing the number of iterations.
Depending on the subset analysis, between 500 and 1500 iterations were needed to reach
that point. T-SNE was performed in 3 dimensions for the pan-cancer analysis and in 2
dimensions for the cancer-specific analyses.
K-means clustering was performed using Morpheus78. Unsupervised K-means clustering using
Euclidean distance as a similarity measure was employed with the number of clusters being
heuristically informed by combining visual inspection, comparing t-SNE and K-means clusters
and by examining marginal reduction of within-group variance with increasing numbers of
clusters (i.e. the elbow method) (Extended Data Figure 10 A-J).
To combine the information obtained by both complimentary methods, t-SNE clustering was
repeated with the same settings for each analysis, while color-coding clusters inferred from K-
means clustering.
Assessment of taxonomic differences between cell culture-derived DNA samples and blood-
derived DNA samples from the 1000 genome project
All taxa in the final taxon list (n=218) (Supplementary Table 19) which were identified in blood-
derived 1000 genomes healthy control samples were selected (n=76) (Supplementary Table
11). The pipeline was subsequently applied to randomly selected (identifier ending with 8 or
9) LCL-derived 1000 genomes samples (n=102) (Supplementary Table 12). Finally, RPPB in
these LCL-derived samples were calculated for all 76 taxa that were identified in the blood-
derived 1000 genomes samples and compared between blood-derived and LCL-derived
samples.
30 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Proof that subsampling of read pairs does not alter relative taxon distribution
To show that subsampling does both not alter the detected species-level taxa and their
relative composition compared to analyzing all non-human read pairs, a subset of 184 tumor
tissue samples (Supplementary Table 15) was analyzed completely without any subsampling
and subjected to the pipeline in the same way as in the main analysis. Exemplary, read pairs
matching Human Mastadenovirus C, P. poae, R. pickettii and P. acnes were used to compare
absolute read pair counts between full data and the subsample for all taxon-sample pairs in
which the 10% subsample had at least 10 matches to the respective taxon.
Pipeline validation strategy
In order to perform in-depth validation of the pipeline, two alternative strategies were
compared to the results obtained from it.
In the first alternative strategy, BWA mem (version 0.7.17) 79 was used to align read pairs to
downloaded representative bacterial and viral genomes for the final filtered taxon list
(Supplementary Table 1, Accession numbers in Supplementary Table 16). In detail, Samtools
(version 1.7)59 was used to merge all non-human mapped BAM files. Next, read pairs were
extracted using Samtools (version 1.7)59 and mapped to phiX (Coliphage phi-X174, complete
genome, NC_001422.1) with BWA mem (version 0.7.17)79. Using Samtools (version 1.7)59 all
read pairs not matching phiX were extracted. Picard (version 2.7.1.1)60 and Samtools (version
1.7)59 was used to extract reads from the resulting BAM file and sort them by read name,
respectively. Next, reads were trimmed with Trimmomatic (version 0.36.3)62 with sliding
window trimming with an average base quality of 20 across 4 bases as cutoff and dropping
resulting reads with a residual length < 50. Read pairs, in which one read was dropped
according to these rules were dropped altogether. Resulting read pairs were mapped to a
31 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
merged FASTA file combining reference genomes of the final filtered taxon list
(Supplementary Table 1, Accession numbers in Supplementary Table 16). Subsequently,
Samtools (version 1.7)59 was used to filter only mapped read pairs with a quality of at least 60.
Customized scripts were used to generate statistics on the number of mapped reads per
species-level taxon in the whole dataset.
In the second alternative strategy, Unicycler (version 0.4.6.0)80 was used to assemble reads
from a merged read database containing all non-human and non-phiX mapped, trimmed read
pairs generated as described above for the first alternative, BWA-based strategy. Unicycler
(version 0.4.6.0)80 was run with standard settings. The assembled contigs were then matched
to sequences in a combined database consisting of the htgs (downloaded on 04/13/2014), nt
(downloaded on 04/17/2014 and wgs (downloaded on 04/20/2014) database using BLAST
megablast (beginning with version 2.7.1 and subsequently updated)75 with an expectation
value cutoff of 0.001. Next, customized scripts were used to filter out all contigs with a
sequence identity to its matched sequence ≤ 80% and to sum the total alignment length of all
remaining contigs by matched sequence template. Finally, for each of the final filtered taxon
list (Supplementary Table 1), the matched sequence template representing the species-level
taxon with the biggest summed up alignment length was selected.
Next, in order to validate the KRAKEN-based pipeline, the log10 of the number of read pairs
25 assigned to a species-level taxon with KRAKEN (version 1.2.3) , the log10 of read pairs mapped
to a taxon with the BWA-based approach and the log10 of the total alignment length of the
assembly-based approach were computed for all taxa in the final filtered taxon list
(Supplementary Table 1) and compared (Supplementary Table 16). To assess the correlation
between the main method and the two tested alternative methods, the Pearson correlation
coefficient was calculated. 32 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Finally, in order to assess the concordance between RNA-Seq and WGS experiments
performed on the same sample the main pipeline was applied with the same settings to all
samples with RNA-Seq and WGS paired data available (n=324). Pearson correlation
coefficients of log10 transformed data were calculated for both a combined dataset of all RNA-
Seq / WGS pairs and for each sample separately for which RNA-Seq and WGS data was
available (n=324).
Assessment of somatic lateral gene transfer
In order to assess the integration of bacterial DNA into human DNA, read pairs with one read
matching the human genome and one read matching one of the taxa in the final filtered taxon
list (n=27) (Supplementary Table 19) were identified. First, representative bacterial and viral
genomes for the final filtered taxon list were downloaded (Accession numbers in
Supplementary Table 19). These genomes were merged with the human reference genome
(hg1k_v37, downloaded from the 1000 genomes FTP server56) into one FASTA file using
customized scripts. Second, all read pairs in which only one read of a read pair was mapped
to the human genome were filtered using Samtools (version 1.7)59. All tumor tissue samples
and matched normal samples of all patients in which at least 1000 RPPB matching one of the
taxa in the final filtered taxon list were identified in that respective patient’s tumor tissue
sample (n=79, Supplementary table 20) were included in this analysis. BWA mem (version
0.7.17)79 with standard settings in paired end mode was used to align all such read pairs to
the merged FASTA file of all taxa and the human reference genome. Next, all read pairs that
were now divergently mapped were extracted using Samtools (version 1.7)59 and customized
scripts by only including read pairs where one read mapped to one of the included non-human
taxa and the other read mapped to a human sequence. Only read pairs with a mapping quality
of at least 40 were retained. Customized scripts were used to count and tabulate all divergent 33 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
read pairs obtained by taxa and sample, respectively (Supplementary Table 21,
Supplementary Table 22). In order to normalize read pairs mapping to putative integration
sites (i.e. divergently mapped read pairs as defined above) by correcting for the total number
of read pairs matching a taxon with a similar approach (i.e. non-divergently mapped,
putatively non-integrated reads), a comparable pipeline was applied to the data used for the
main analysis (i.e. both reads in a read pair not mapped to the human genome) of all patients
included in the integration analysis (n=79) (Supplementary Table 20). First, the genomes of all
taxa in the final filtered taxon list were downloaded (Accession numbers in Supplementary
Table 16) and merged without adding any further human sequences into one FASTA file to
create a reference genome containing all taxa in the final filtered taxon list. BWA mem
(version 0.7.17)79 with standard settings in paired end mode was used to align these reads to
the merged FASTA file of all taxa in the final filtered taxon list. Subsequently, Samtools (version
1.7)59 was used to filter the aligned data to only include read pairs that mapped as a proper
pair to only one taxon with a minimum mapping quality of 60. Samtools (version 1.7)59 and
customized scripts were used to count and tabulate all mapped reads. The putative
integration rate of a taxon in either tumor tissue samples or matched normal samples was
calculated by dividing the number of divergently mapping read pairs by the number of read
pairs mapping as a proper pair to the respective taxon.
Analysis of associations of patient features with identified clusters
Differences in age at diagnosis between patient clusters were first analyzed by ANOVA for
each cancer. All results with panova≤0.1 are shown separately for each cancer in Figure 4. Such
clusters or combinations of clusters were then compared to the other clusters by student t-
test.
34 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Differences in the gender distribution between patient clusters were analyzed by chi-square
test for each cancer.
In order to analyze possible relationships between the number or type of cancer-associated
genetic alterations and presence of specific taxa, all somatic alterations for all patients
included in this study were obtained from the ICGC data portal23,55. All synonymous mutations
were filtered out. Subsequently, only Tier 1 cancer gene census81 genes that were altered in
more than 20 cases were filtered using customized scripts and included in the analysis
(Supplementary Table 23).
Differences in the number of genetic alterations in cancer genes between patient clusters
were first analyzed by ANOVA for each cancer All results with panova≤0.1 are shown separately
for each cancer in Figure 4. Such clusters or combinations of clusters were then compared to
the other clusters by student t-test.
The Kaplan-Meier method was used to estimate survival curves for each patient cluster in each
cancer with available survival data. Differences in survival between patient clusters were
analyzed by log rank test.
Associations between patient clusters and alterations in single cancer genes were analyzed
for each cancer gene in a cancer that was altered in at least 10 patients in that respective
cancer by chi-square test. Clusters or combinations of clusters were then compared to the
other clusters by Fisher’s exact test.
All calculations were performed with R (beginning with version R 3.3.2. and subsequently
updated)65.
Unbiased analysis of association of patient features with the presence of certain taxa
35 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
For this analysis a taxon was considered detected in a patient if at least 100 RPPB matched
the respective taxa in the patient’s tumor tissue sample. All phages were excluded from the
analysis. All included projects were grouped by cancer (Supplementary Table 18). Cancer
genes were defined as above.
All calculations were performed with R (beginning with version R 3.3.2. and subsequently
updated)65. All p-values were corrected for multiple testing using the FDR method to obtain a
q-value, which was considered significant if < 0.05.
Survival analysis
First, the Kaplan-Meier method was used to estimate survival curves for each cancer-taxon
pair separately that was present in at least 10 patients. Differences in survival between
patients with presence or absence of a respective taxon were analyzed using log rank tests,
stratified by ICGC project. Additionally, a pan-cancer analysis was performed in the same way,
also stratifying by ICGC project.
Association between taxa and patient gender
Associations between presence or absence of a taxa and patient gender were analyzed by
Fisher’s exact test for each cancer-taxon pair separately that was detected in at least 10
patients. Additionally, a pan-cancer analysis was performed in the same way, using a logistic
regression model that included the ICGC project as an independent variable.
Association between taxa and patient age
Associations between presence or absence of a taxon and patient age at diagnosis were
analyzed by student t-test for each cancer-taxon pair separately that was present in at least
36 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
10 patients. Additionally, a pan-cancer analysis was performed in the same way, using a linear
regression model that was stratified by ICGC project.
Association between taxa and number of genetic alterations in cancer genes
Associations between presence or absence of a taxon and the number of non-synonymous
genetic alterations in cancer gene consensus81 genes of a patient were analyzed by student t-
test for each cancer-taxon pair separately that was present in at least 10 patients. Additionally,
a pan-cancer analysis was performed in the same way, using a linear regression model that
was stratified by ICGC project.
Association between taxa and specific genetic alterations
Associations between presence or absence of a taxon and non-synonymous alterations in one
of the cancer gene consensus81 genes of a patient were analyzed by Fisher exact test for each
cancer-taxon pair separately that was detected in at least 10 patients. Additionally, a pan-
cancer analysis was performed in the same way, using a logistic regression model that
included the ICGC project as an independent variable.
Analysis of differential gene expression
Reads per kilobase of transcript, per million mapped reads (RPKM) data was downloaded from
http://dcc.icgc.org/pcawg for all available tumor tissue samples. Ensemble gene ID was
substituted by the standard Human Genome Organization (HUGO) Gene Nomenclature
Committee (HGNC) symbol downloaded from http://genenames.org/download/custom and
the RPKM data was then linked to the identified species-level taxa in each sample using R
(beginning with version R 3.3.2. and subsequently updated)65. sRAP82 was used to normalize
RPKM values, perform quality control, differential gene expression analysis and to identify
37 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
pathways that are differentially expressed. For these analyses, each cancer was analyzed
separately and patients that had more than 100 RPPB matching one species-level taxon in the
final filtered taxon list were compared with those who had not. Differential gene expression
analysis was performed for all taxa that were detected with more than 100 RPPB in at least
five patients and RNA-seq data available (Supplementary Table 24). Next, sRAP82 was used to
perform bidirectional functional enrichment of gene expression data to identify pathways up-
or downregulated between patients with or without presence of a taxon. Briefly, the
distribution of activation versus inhibition t-test statistics for all samples associated or not
associated with a specific taxon was compared using ANOVA and corrected for multiple testing
using the FDR method. For details please refer to the original description of BD-Func83. A gene
set was considered functionally enriched, if q<0.05. Gene sets likely not relevant for the
respective cancer were excluded and gene sets provided with sRAP82 were reduced to gene
ontology (GO) gene sets84.
38 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
FIGURE LEGENDS
Figure 1: Overview over sample characteristics and identified taxa
A, Sample distribution by project. B, sample distribution by age group. C, sample distribution
by gender. D, distribution of read pairs matching any species-level taxa by project. E,
distribution of read pairs matching any species-level taxa by sample type. F, species-level
taxa detected in at least 10 samples (detection limit 100 read pairs per billion (RPPB)) color-
coded by project. G, average RPPB detected across all species-level taxa by project and
sample type. Bars show mean of samples and error bars show standard error of mean. H,
average RPPB detected by species-level taxa and sample type. I, average RPPB detected for
all species-level taxa identified as likely tumor associated after filtering by sample type. J,
species-level taxa identified as likely tumor associated after filtering (detection limit 100
RPPB) color-coded by project.
Figure 2: Heatmap of all taxa detected in all samples
Log2-transformed read pairs per billion (RPPB) of all taxa in all samples. Taxa were
hierarchically clustered using Pearson correlation as a distance measure with average-
linkage. Samples were hierarchically clustered within each project and type subgroup using
Pearson correlation as a distance measure with average-linkage.
Figure 3: Patient clusters defined by detected taxa can be identified across all patients and
cancer-specific subgroups
A, t-SNE visualization of all tumor-tissue samples color coded by project using the log2-
transformed read pairs per billion (RPPBs) of all detected non-phage taxa as input variables.
B-H, t-SNE visualizations of single cancer subgroup tumor-tissues color coded by k-means
39 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
cluster (Extended data figure 4) using the log2-transformed read pairs per billion (RPPBs) of
all detected non-phage taxa as input variables.
Figure 4: Patient clusters defined by detected taxa associate with patient characteristics
A, association of CLL patient clusters (1: no specific association, 2: Human Mastadenovirus C,
3: Gordonia polyisoprenivorans) with age (panova=0.0743). B, survival by cluster in CLL (p(1 vs.
other)=0.0246). C, Binet stage depending on detection of Gordonia polyisoprenivorans
(pFischer = 0.0252). D, TP53 alteration frequency by cluster in CLL (pFisher(cluster 3 vs.
other)=0.0335). E, number of cancer consensus gene alterations by cluster (1: no specific
association, 2: Bifidobacterium dentium, 3: Human Herpesvirus 5) in esophageal cancer
(panova=0.0858). F, Kaplan-Meier survival curves for each cluster in esophageal cancer (p(1 vs.
other)= 0.0409). G, association of liver cancer patient clusters (1: Hepatitis B virus, 2:
Pseudomonas sp. and Serratia sp., 3: no specific associations, 4: Parvibaculum
lavamentivorans and Human Herpesvirus 5 in addition to taxa from cluster 2) with age
(panova=0.0015). H, RNF21 alteration frequency by cluster in liver cancer (pFischer = 0.0121). I,
KMT2C alteration frequency by cluster (1: no specific associations, 2: Cupriavidus
metallidurans, 3: no specific associations, 4: Methylobacterium populi, 5: Pseudomonas
protegens) in pancreatic cancer (pFisher=0.0308). J, CDKN2A alteration frequency by cluster in
pancreatic cancer (pFisher=0.0124). K, RNF21 alteration frequency by cluster in pancreatic
cancer (pFisherRNF21=0.0107). L, association of indicated taxa with age in prostate cancer (padj
between 0.0015 and 0.0420). M, association of prostate cancer patient clusters (1: no
specific association, 2: Gluconobacter oxydans and Rhodopseudomonas palustris, 3:
Pseudomonas sp. TKP, 4: no specific associations, 5: Thauera sp. MZ1T, Cupriavidus
metallidurans and Pseudomonas mendocina) with age (panova=0.00994). N, association
between detected amount of Propionibacterium acne DNA and number of cancer consensus 40 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
gene alterations (padj=0.0041). O, Kaplan-Meier survival analysis of Ralstonia pickettii
detection status in renal cancer (padj=0.035). P, PBRM1 alteration frequency by cluster (1: no
specific association, 2: Serratia marcescens) in renal cancer (pFisher=0.0723). Q, association of
Pseudomonas sp. TKP detection with age in chronic myeloid dysplasia (padj=0.039). For all:
The midline of the boxplots shows the median, the box borders show upper and lower
quartile, the whiskers show 5th and 95th percentiles and the dots outliers.
Figure 5: Differential gene expression and pathway analysis of chronic lymphocytic
leukemia patients
A, heatmap of all differentially expressed genes (q<0.05) in tumor tissue samples of chronic
lymphocytic leukemia (CLL) patients with (green) and without (orange) detection of
Gordonia polyisoprenivorans. Dendrograms show clustering with complete linkage and
Euclidian distance measure. B, boxplots of the distribution of activating versus inhibitory
gene t-test statistics for the indicated pathway is shown for tumor tissue samples of CLL
patients with (+, n=23)) and without (-, n=49) detection of Gordonia polyisoprenivorans. The
midline of the boxplot shows the median, the box borders show upper and lower quartile
and the whiskers the maximum and minimum test statistic. A, heatmap of all differentially
expressed genes (q<0.05) in tumor tissue samples of CLL patients with (green) and without
(orange) detection of Human Mastadenovirus C.. D, boxplots of the distribution of activating
versus inhibitory gene t-test statistics for the indicated pathway is shown for tumor tissue
samples of CLL patients with (+, n=7)) and without (-, n=65) detection of Human
Mastadenovirus C. The midline of the boxplot shows the median, the box borders show
upper and lower quartile and the whiskers the maximum and minimum test statistic.
Dendrograms show clustering with complete linkage and Euclidian distance measure.
41 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Extended Data Table 1: Included patients and samples
Overview of all included patients and samples.
Extended Data Figure 2: Validation of pipeline and analytical approach
A, mean read pairs per billion (RPPB) detected for indicated species in blood derived and
lymphoblastoid cell line 1000 Genome samples sorted by mean of blood derived samples. B,
proportion of non-human read pairs matching the indicated taxon of read pairs matching
any species-level taxon for each external validation sample. C-E, comparison of BWA
mapped reads in Kraken matched read pairs. Each dot represents one species-level taxon
(n=27). The line represents the best-fitted line (log-log linear regression). The dotted lines
depict the 95% confidence interval. Pearson correlation coefficients (log-log) are shown with
two-sided p-values. (C) shows all samples (n=27), (D) only bacterial taxa (n=22) and (E) only
viral taxa (n=5). F, comparison of assembled read length and Kraken matched read pairs.
Each dot represents one species-level taxon (n=27). The line represents the best-fitted line
(log-log linear regression). The dotted lines depict the 95% confidence interval. Pearson
correlation coefficients (log-log) are shown with two-sided p-values. G, comparison of
Kraken matched read pairs in RNA-seq and WGS data of the same sample. Each dot
represents one species-level taxon in one sample with both RNA-seq and WGS data
available. The line represents the best-fitted line (log-log linear regression). Pearson
correlation coefficients (log-log) are shown with two-sided p-values. H, plot of Pearson
correlation coefficient (log-log) distribution of all samples with both RNA-seq and WGS data
available. Each dot represents the Pearson correlation coefficient within a single sample. I,
comparison of 10% subsample and full dataset. Each dot represents one species-level taxon
in one sample for which both the full and the subsampled dataset has been analyzed and
42 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
indicates the absolute read count identified in both samples. The line represents the best-
fitted line (log-log linear regression). Pearson correlation coefficients (log-log) are shown
with two-sided p-values. J, Ratio of absolute read counts in the full sample to the 10%
subsample for 4 selected taxa. The mean ratio of all samples in which the respective taxon
was detected is indicated by the symbol and the error bars indicate the standard error of the
mean. The dotted line shows the expected ratio of 10. K, comparison of tumor tissue and
matched normal by patient and taxon. Each dot represents one species-level taxon in one
patient with both tumor tissue and matched normal analyzed and indicates the RPPB in both
samples. The line represents the best-fitted line (log-log linear regression). Pearson
correlation coefficients (log-log) are shown with two-sided p-values.
Extended Data Figure 3: Alpha diversity
A, counts of 1000 Genome samples by species-level richness. B, counts of tumor tissue
samples by species-level richness color-coded by project. C, counts of matched normal
samples by species-level richness color-coded by project. D, comparison of richness between
projects and sample type. Bars show mean and error bars standard deviation. N in brackets
indicates total sample number for each project.
Extended Data Figure 4: Coverage distribution for all likely tumor-associated species-level
taxa
Coverage distribution across each species-level taxon identified as likely tumor associated.
Extended Data Figure 5: Flow chart of taxa filtering strategy
Flow chart of filtering strategy to derive likely tumor-associated species-level taxa.
Extended Data Figure 6: Heatmap of final filtered taxon list
43 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Log2-transformed read pairs per billion (RPPB) of all species-level taxa identified as likely
tumor associated after filtering in all samples. Taxa were hierarchically clustered using
Pearson correlation as a distance measure with average-linkage. Samples were hierarchically
clustered within each project and type subgroup using Pearson correlation as a distance
measure with average-linkage.
Extended Data Figure 7: Heatmaps of likely tumor-associated taxa for all cancers with
discernible clusters
A-F, log2-transformed read pairs per billion (RPPB) of all species-level taxa identified as likely
tumor associated and detected after filtering in all tumor-tissues of the indicated cancer.
Results of K-means clustering of samples are shown.
Extended Data Figure 8: Heatmaps of likely tumor-associated taxa for all cancers without
discernible clusters
A-D, log2-transformed read pairs per billion (RPPB) of all species-level taxa identified as likely
tumor associated and detected after filtering in all tumor-tissues of the indicated cancer.
Results of K-means clustering of samples are shown.
Extended Data Figure 9: Analysis of host integration
A, integration rate by species for tumor tissue and matched normal sample. B, difference in
integration rates between bacterial and viral taxa (p<0.0001, Wilcoxon rank-sum test, two-
tailed). The midline of the boxplot shows the median, the box borders show upper and lower
quartile, the whiskers show 5th and 95th percentiles and the dots outliers of species-specific
integration rates in tumor tissue or matched normal samples. C, difference in integration
rate between tumor tissue and matched normal samples (p=0.0009, Wilcoxon signed-rank
44 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
test, two-tailed). ). The midline of the boxplot shows the median, the box borders show
upper and lower quartile, the whiskers show 5th and 95th percentiles and the dots outliers of
species-specific integration rates.
Extended Data Figure 10: „Elbow method“ to determine k for k-mean clustering
A-J, plot of reducing within group sum of squares for increasing k (number of clusters) in k-
means clustering (Extended data figure 4, Extended data figure 5) of log2-transformed read
pairs per billion (RPPB) of all species-level taxa identified as likely tumor associated and
detected after filtering in all tumor-tissues for each indicated cancer.
45 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
FIGURES
Figure 1
A B C D < 4 0 4 0 -4 9 5 0 -5 9 6 0 -6 9 7 0 -7 9 8 0 N /A
T o ta l 3 0 2 5 s a m p le s T o ta l 1 6 9 5 p a tie n ts T o ta l 1 6 9 5 p a tie n ts 2 0 0 0 T o ta l 3 ,5 3 4 ,7 0 7 r e a d s M ale F e m a le 1 5 0 0
s R E C A -E U
t i
h P R A D -U K
E F f
4 0 0 o 1 0 0 0 P R A D -C A
.
o P A E N -IT N 5 0 0 P A E N -A U
3 0 0 P A C A -C A
s t i P A C A -A U
h 0
f li O V -A U o s 2 0 0 c e
o n a c
i h a O R C A -IN . ic r m e iu o h r L IR I-J P c te s c
N E a ib 1 0 0 n L IN C -J P io p o r L IC A -F R P L A M L -K R T o ta l 3 ,5 3 4 ,7 0 7 r e a d s 0 E S A D -U K i a s s i 2 s s 1 2 s i e s a ii a s e s i a i s i a .3 s .1 s s ii s a is s i s s s i s a s P n is a s s i m e i s i is s n r a a a C i i a s ia la a r s f s i i a is s B u i 5 S i l s e m i s tt s u l i u n n h 4 n a d li r 0 n p a n u ic n c o s a n l u M r c n m n n t i c 0 u l l n k m s J u i a n u u n id K a il e u a i i A t 1 a i e o ie t c o t o s e d u a e r i e w u a a i s a n 6 u ic h n s d s u e in i l e o a t x c o e a ic r S o t r o n c e n n t o e e s M e a r o u a v l z t r A r c c n a t t ti a z t i r i k n r o T i p h r c r t a J r p m b T s S r t a c a a g ic r u p i c u e r t s e R i n v z r s ic y e u p i l n n u a i u if b s t o r e le h o s u e y a c in o i t e r p p e r i m a u t p u i i o e a u y k ir s l e n c h C M D I-U K c g o p d p r o p u i e p u B K t e l g m x f l t c i e t a r q m a i t lu a P h g g i k o r n h s g e e e a g iv p i v a s t r o e e d c iv s e p o m e g n v s a c s n i r r p r n s p d v i r s v s n n u f n l p u r i h t r i w s a d s p K t n c s o u o e n a c u e l c o f o e e s a e s p a in n m a i io p o a s i n l m e o tu m o r tis s u e r o s t l x o ll a n i s m a u r n a u l r p s o s l o o o o t t p la o v Ta os ltt av l 3o 0f 2l 5 us ae mo fpr le sa a a e d a a s n a a s n x e n s a p io e a u i la s i d u n a e n t t p t la d s a l u s r g e s p r a e ll n b n u s s m i i p a e y r u a a r o p u c b l iq c e s l t s a y x in l l a l c r t a n r u e o a p i e s i e a r a n r m l t a la r e l m c s d m n n l u t a il e n e e e n j e e o s l e o l s m t e r p l r c l u f iq r s e n c o n d e o f e n p il a p m c a r ie d u c s ll d c t e h n a e n c e t l e t m p a l is m t s l C L L E -E S o s a x o v o o o m s u s x r u o l e lo i a a c o d la n h o o t e a s i r p e e a O a s a u h t a m a s s s s m o s v o c i a a r a ia e t s m e a a r a s d m p m l k n n a c u r n a v c a e s r x h a u u a p c t s a r m u a o a s u a b r t l t c c b y v b n t t m o i e lo r o o c a a i e y B i in il l i m a tc h e d n o rm a ls l n i o l u s m a o i d c n m c l o m a a o o B s o r a a iu ia iu t l m w o a e o n a r h m s r il t c s t o o i c n id n i y u B o o a e a r a le m i i o B n i r r i u o m m e l b r n i b n p n p h a s o a o lf d v h o o u o id l c z o v u i r b r r v a g e h b o e l e m o r o e n iu u o a e p c r u u R u d c c g n o e o i i b s r ic la K a p th e m e d o t o e e v B d o h y o t d d r c n s s a r c c m e io u p o m id o c s h m o r e l o t o r m in l d t V u a S h r c r o m e n l a c a g m s o c o c B T C A -S G o e r li A m in t p c r o d e o b e t s u o n o m o g l c e d S h a r o d o o e n i u b e c t D s e c o o v A P o i t r e S M t e u n P n h a r e u p a y l r m t o T i u d m o S o d a s c o a h ls a l y h c c h l l C R E e p h a R in o P s a c b G d o h c c e i t o p c h e a lth y d o n o rs P V y h d i i y d t o n il l a k h b e t y h C k u a h H N v e c e u P d p u r p a n n A a c K v i e S r m h P s o p u h r H o p a c r o e ic o p S h a b c c m p k lo S s r e k b l ia a to t t s l o e R o p r a O e a S u u o r p P th s r u i y S r H p B R C A -U K h r s u d o r A C S u y o o n L p S P A t r a B X l P B H B h n P P u B h p e e R o P C t y B t d a io p u r r o S h e o B t t n t X p ta C S e G M h o S t e R r S B O C A -U K S M P 1 0 0 0 G e n o m e s
G 1 0 0 0 g e n o m e s H B O C A -U K 1 5 0 0 0 B R C A -U K tu m o r tis s u e B T C A -S G m a tc h e d n o rm a ls
C L L E -E S .
b 1 0 0 0 0 h e a lth y d o n o rs
C M D I-U K . p
E S A D -U K . p
L A M L -K R r L IC A -F R 5 0 0 0 L IN C -J P L IR I-J P tu m o r tis s u e O R C A -IN m a tc h e d n o rm a ls 0 O V -A U h e a lth y d o n o rs i ii s 2 l a s s 1 s a s 4 s t s li 0 s C s m ia e r t n i n o s e u n s u s n l 4 a P A C A -A U e t 1 c o s c e e o u a i . i e a S e u n i c u r i t h B h k r b c a n r c t ir r n ic n t c u u K s i i i y s e a f p R o P A C A -C A i s h g v a l e v c e i o O p id K e a o r s c r t p m l c ic l o s o m it l a l s p r r i m o e t r a s u P A E N -A U i a u c n u m u p lu e n e t ll s a e e i e l r l n f e m a n e i h u d r a f e i e e n o c x m c e c d s d s a p P A E N -IT t m a m ta t h s h a u r s a r ia s c a a l s a a l s o t a s s n B i il u n o l a B E i a a u n a l l u v a h b o th c i o d e P R A D -C A R d o r t m i c a h n i i r o n c m m o m a s d e u b ip o v i R n io o o R o l P b P R A D -U K a c S a c d H t c h e i p o c p l r A m o l u y K p r y e a c o R E C A -E U u s L i r u H P h l t C p P A o a n 1 2 3 4 5 6 7 t e S t 0 0 0 0 0 0 0 S 1 1 1 1 1 1 1 r p p b I 1 5 0 0 0 J 1 5 0 1 0 0 0 0
5 0 0 0
. tu m o r tis s u e
s
t
b i
. 1 0 0
m a tc h e d n o rm a ls h
p .
6 0 f
o
p
. r h e a lth y d o n o rs
4 0 o 5 0 N
2 0
0 0 t s s s s li t s s i s i s P s n e C i a a ii m 5 s n s s A a T m is 1 s s C s 1 e P s a s i a i l 5 i s s a s n n a s r c n n u n 6 c 1 n l u m m l s T n A s K a a n s t i i k u s n a p o n i iu a a s r n n n a n c n k n r u 1 n n c n T e r r o s r c a c u e i e a h Z t r u i s s K u i i u t s a n o 6 i u c o p e u e z i r i d o b g l s t m r v a e e o a e a i p i a a a r u c ir lu t o n i c y u u M n o i i r u u T r c r c s u e Z l h r i p s d iv s s d a v p m e r u e v v c c p i g z t o m i b s d t s e i v a n k o fa x t c i p i n B r r e o u r u o v r ll n a e o p e n i s o y o a v m d t a s u s i s i p o n t e a n l p i i c M u y u s e n c e a p e e m s r s n e is s d m r c v o a r r n a s u r u s ly t d e v e v s v o n t k e a v n a i x i a u t o e s l o p e i li p m e a m e m p i i r a i i n f p y a m B n l e p a a l m r r r q t r a p r u r t l c o s e o a d p s a v o t f m d e e p e li e a s p e i m u e a l n s n p r e m e s s y o o m a n s t c t t a te r a r i r n o e s e s m l n s s m s o t o n a c h a s e i u a r h p a o a e o a p s u u i r i m a i d a o m ia c n o t a te v e e t a e u p r l r m m i p m a l e e a p e t s y i s m n a n t b a a o r h a a t n l r n r l a r r q r i o n u l u t a o b u o i r h c l c H e d f p s e u u i a p t e t d o e a o lm i a a b t p n r a a m e m p e t s i n e e i l e t a m o d r m o r r n o m a T a a o o a t r r c t i a u i p s r d a m e m r o l u o ia e b m b m m t s h o n c a r h u s a e m v u o n t e y a t S o a s o c e a e a o t p o a a P e n S o c u c l d m u o u i s a d m i m o n a n e i a r v s i i S a e d r H S u h P u a u d l d H s t t a t t a h b a e P d r n s u a l t r i u i a n n u o y u o b m n c h i p r a m C b e e r H if c f u a a l i m o c b a a t o r l H u p o p e i G s e i r o e d r l m o a o a r T n e u d u o s n M S B a B d r m o m o r u n a e s r H d P b i m s u e a o n b d l m b a i m C P io i v e m e p t o u e la o t S P o o v n o u P S d o o y u o m c a u G h p r a S a s a c r d e h S P r l R o a i d H i u d H d u u r P r P a C u i s t i l r u P m u n b e e if p e if H e c p u e i s G u o n s B o M B S a H s d P P d b C P r o i o i o v p h r G o R a r P P
46
0.00 3.75 15.00
Project Type 1000 Genomes control BOCA-UKbioRxiv preprintnormal doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which Figure 2 BRCA-UK wastumour not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. BTCA-SG CLLE-ES CMDI-UK ESAD-UK LAML-KR LICA-FR LINC-JP LIRI-JP ORCA-IN OV-AU PACA-AU PACA-CA PAEN-AU PAEN-IT PRAD-CA PRAD-UK RECA-EU
id Methylophaga frappieri Burkholderia phytofirmans Hepatitis B virus Nocardia farcinica Pseudomonas mendocina Sphingopyxis alaskensis Rhodococcus erythropolis Ochrobactrum anthropi Gordonia polyisoprenivorans Sphingomonas wittichii Zymomonas mobilis Chelativorans sp BNC1 Pseudomonas resinovorans Pseudomonas fluorescens Pseudomonas poae TKP Pseudomonas sp paradoxus Variovorax Delftia acidovorans Pseudomonas putida Pseudomonas savastanoi Pseudomonas protegens Stenotrophomonas maltophilia Achromobacter xylosoxidans Human herpesvirus 5 Klebsiella pneumoniae Enterobacter cloacae Klebsiella oxytoca Serratia marcescens Serratia liquefaciens Salmonella enterica Parvibaculum lavamentivorans Plautia stali symbiont Cronobacter sakazakii Enterobacter sp 638 Serratia proteamaculans Serratia plymuthica Rahnella aquatilis Enterobacteria phage HK022 Escherichia phage HK639 Klebsiella variicola Brevundimonas subvibrioides Pantoea ananatis Enterobacteria phage P1 Methylibium petroleiphilum Pseudoxanthomonas spadix Raoultella ornithinolytica Pseudomonas brassicacearum Alteromonas macleodii Rhodopseudomonas palustris Xanthobacter autotrophicus Gluconobacter oxydans Sphingomonas sp MM Sphingobium japonicum Bradyrhizobium sp BTAi1 Bradyrhizobium sp S23321 Methylobacterium populi Methylobacterium extorquens Methylobacterium radiotolerans Xanthomonas euvesicatoria Thauera sp MZ1T Xanthomonas campestris Acidovorax sp JS42 Comamonas testosteroni Pseudomonas stutzeri Ralstonia pickettii Alicycliphilus denitrificans Acidovorax sp KKS102 Cupriavidus pinatubonensis Ralstonia solanacearum Cupriavidus metallidurans Acidovorax ebreus Pseudomonas aeruginosa Micrococcus luteus Propionibacterium acnes Escherichia coli Enterobacteria phage phiX174 sensu lato Alcanivorax dieselolei A0.3 Shewanella sp Psychrobacter sp PRwf.1 Bordetella petrii Acinetobacter baumannii Burkholderia vietnamiensis Burkholderia cenocepacia Burkholderia lata Burkholderia ambifaria Burkholderia multivorans Burkholderia phage KS5 Human mastadenovirus C Herbaspirillum seropedicae Corynebacterium urealyticum Corynebacterium kroppenstedtii Shewanella baltica Streptococcus mutans Leuconostoc carnosum Maraba virus Enterobacteria phage lambda Aggregatibacter aphrophilus Bifidobacterium dentium Streptococcus oligofermentans Campylobacter curvus Propionibacterium propionicum Capnocytophaga ochracea Streptococcus constellatus Streptococcus intermedius Fusobacterium nucleatum Atopobium parvulum Prevotella sp oral taxon 299 Streptococcus anginosus Filifactor alocis denticola Treponema forsythia Tannerella Porphyromonas gingivalis Prevotella denticola Selenomonas sputigena Prevotella intermedia Olsenella uli Prevotella dentalis Alistipes shahii Bacteroides salanitronis Bacteroides fragilis Porphyromonas asaccharolytica Streptococcus dysgalactiae Rothia dentocariosa Neisseria gonorrhoeae Neisseria lactamica Neisseria meningitidis Haemophilus influenzae Haemophilus parainfluenzae Prevotella melaninogenica parvula Veillonella Rothia mucilaginosa Streptococcus salivarius Streptococcus parasanguinis Streptococcus pneumoniae Streptococcus mitis Streptococcus pseudopneumoniae Corynebacterium variabile Streptococcus oralis Streptococcus thermophilus Lactobacillus fermentum Lactobacillus rhamnosus Streptococcus phage 20617 Lactobacillus plantarum Streptococcus sp I.P16 Lactobacillus salivarius Clostridium sp SY8519 Lactobacillus phage Sha1 Streptococcus sp I.G2 Lactobacillus brevis Lactobacillus buchneri Lactobacillus paracasei Lactobacillus casei Lactobacillus phage Lc.Nu Lactobacillus sakei YMC.2011 Streptococcus phage Arcanobacterium haemolyticum Streptococcus gordonii Streptococcus sanguinis Human herpesvirus 1 Staphylococcus haemolyticus Enterococcus faecalis Geobacillus thermoleovorans Geobacillus sp WCH70 Leuconostoc kimchii Anoxybacillus flavithermus Dechloromonas aromatica Meiothermus silvanus Delftia sp Cs1.4 Lactobacillus delbrueckii Burkholderia gladioli Arthrobacter arilaitensis Alicyclobacillus acidocaldarius Deinococcus proteolyticus Exiguobacterium sibiricum Enterococcus faecium Bifidobacterium longum Cryptobacterium curtum Bifidobacterium breve Akkermansia muciniphila Eggerthella lenta Bacteroides thetaiotaomicron Parabacteroides distasonis Bacteroides vulgatus Bacillus amyloliquefaciens Bacillus subtilis Bacillus sp JS Bacillus phage SPbeta Gardnerella vaginalis Staphylococcus aureus Staphylococcus epidermidis Human endogenous retrovirus K Enterobacteria phage M13 Lactococcus lactis Acinetobacter calcoaceticus Staphylococcus saprophyticus Staphylococcus warneri Human herpesvirus 4 Cupriavidus necator Thiobacillus denitrificans Bacillus weihenstephanensis Bacillus thuringiensis Bacillus toyonensis B
47 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Figure 3
A Chronic Pancreatic B lymphocytic cancer leukemia
Bone cancer P. fluorescens
TSNE TSNE 2 P. poae P. sp TKP
Chronic myeloid disorders & bone TSNE 1 cancer C Bone cancer
Liver
cancer TSNE TSNE 2
Mastadenovirus C G. polyisoprenivorans
TSNE 1 Chronic lymphocytic leukemia D E F C. metallidurans
Cluster 4* M. populi
Cluster 2**
TSNE TSNE 2 TSNE 2 TSNE TSNE 2 P. protegens
Hepatitis B
B dentium Human Herpesvirus 5
TSNE 1 TSNE 1 TSNE 1 Esophageal cancer Liver cancer Pancreatic adenocarcinoma G H G. oxydans R. palustris Cluster 5*
P. sp TKP
TSNE 2 TSNE TSNE TSNE 2
S. marcescens
TSNE 1 TSNE 1 Prostate adenocarcinoma Renal cancer
48 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Figure 4
A B C D 1 0 0 % 8 0
6 0
s s
r N o G . p o ly is o p re n iv o ra n s
t
a
n
e e
i G . p o ly is o p re n iv o ra n s
y
t
a 4 0
n 6 0
p
i
5 0 %
f
e
o
g
. o
A 2 0 N 4 0
0 0 % B C 1 2 3 / 1 2 3 r r r A t r r r e e e e te te te t n t t t e i s s s s s s n B u u u i lu lu lu l l l B C C C E C C C F G H 1 0 0 % 1 5
e 8 0
n
e
s
r
g
s
a r
n 1 0
e
e
o
y
i
c
t 5 0 % n
n 6 0
a
i
a
r
c
e
e
t
f
l
g a
o 5
A
. o
N 4 0
0 0 % 1 2 3 4 1 2 3 1 2 3 4 r r r r r r r e e e r r r r te te te te t t t te te te te s s s s s s s s s s s u u u u lu lu lu u u u u l l l l l l l l C C C C C C C C C C C I J K 1 0 0 % 1 0 0 % 1 0 0 %
5 0 % 5 0 % 5 0 %
0 % 0 % 0 %
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 r r r r r r r r r r r r r r r e e e e e e e e e e te te te te te t t t t t t t t t t s s s s s s s s s s s s s s s u u u u u lu lu lu lu lu lu lu lu lu lu l l l l l L C C C C C C C MC C C C N C C C C
8 0 8
s e
r 8 0
n
a
e e
s 6
y
r
g
s
n
a
r
i n
6 0
e
e
e
o
i
y
c
g
t
n
A a
n 6 0 4
i
r
a
e
c
e
t
l
f
g
a
o
A 4 0 . 2 i i i o s s 2 2 s s i s s a a i i s s 4 4 n n n n t t m m i i
u t t N u u a a o o u s s r r 4 0 e e S S r r e e o o e e u u t t r r J J ic ic t t k k c c s s b f f e e u u n n i i e e b i i t t l l i i ic ic n n e e p p r r s s g g p p s s it it o o s s u u p p o o x x t t u u r r p p m m a a x x n n a a a a a a r r a a e e s s c c e e i i j j 0 r r e e c c a a n n c c o o d d t t v v o o o o s s to to m m s s o o v v s s s s c c a a u u a a . . . . u u a a o o ls ls i i 1 2 3 4 5 b d d o o l l r r n n n n . .b .b b i i d d i i n n o o a a b b o o . c c i i h h o o ic ic R R o o r r r r r p p p p A A c c p p m m g g m m e e e e e . . . . i i m m M M o o o t t t t t A A l l a a o o in in p p p p o c c o d d n h h s s s s s r r r r n o y y h h t t n m m n u u p n n u u u u u 0 ic ic o o e e p l l l l l 0 0 0 l l s s S S a a 1 0 0 0 C C X X C C C C C - A A P P o 1 0 0 o o - 1 o n o n 0 - 0 n n n 0 1 1 0 - 0 0 1 0 0 P Q 1 O 1 0 0 %
8 0 s
1 0 0 % r
a
e y
N o t a lte re d
n 6 0
5 0 % i
e g
A A lte re d 4 0
0 %
1 2 d d e r r t te te te c c e s s t te lu lu e e C C d D t 5 0 % o N
0 % 1 2 3 4 r r r r te te te te s s s s lu lu lu lu C C C C
49 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Figure 5
A C
DNA damage response, signal Mitotic cell cycle transduction by p53 class mediator
+ - + -
Ubiquitin-protein ligase activity Blood coagulation involved in mitotic cell cycle
+ - + -
Cysteine-type endopeptidase Macrophage activity involved in activation apoptotic process
+ - + -
B D Ubiquitin- DNA damage protein response, ligase Interleukin-1 signal activity beta transduction involved in production by p53 class mitotic cell mediator + - + - + - cycle Epidermal Protein growth factor Immune catabolic receptor response signalling process pathway + - + - + - Cysteine-type Sequence- endopeptidas specific DNA Activated T cell e activity binding proliferation involved in transcription apoptotic factor activity + - process + - + -
Signal Cell Cell cycle transduction differention + - + - + -
Tumor Interferon- necrosis B cell Cytokine Apoptotic gamma factor proliferation secretion process production production + - + - + - + - + -
release of epithelial cell cytochrome c cell-substrate MAPK inflammatory proliferation from adhesion cascade response mitochondria + - + - + - + - + -
50 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Extended Data Table 1
Phage Virus Bacteria All
Project matched tumor- Project matched tumor- matched tumor- matched tumor- matched tumor- healthy total healthy healthy healthy healthy Project control tissue total control tissue control tissue control tissue control tissue normals reads in normals normals normals normals samples samples samples samples samples samples samples samples samples samples samples bn
1000 365 365 85.85 444 13,725 764,787 778,956 Genomes
BOCA-UK 76 76 152 193.74 1 0 25 0 22,502 166,820 22,528 166,820
BRCA-UK 46 46 92 121.42 3 3 4,602 10,404 2,730 1,800 7,335 12,207
BTCA-SG 12 12 24 44.30 0 0 0 0 259 554 259 554
CLLE-ES 97 97 194 179.15 3 4 515 676 11,548 13,256 12,066 13,936
CMDI-UK 32 32 64 84.55 23 20 2 2 28,464 4,759 28,489 4,781
ESAD-UK 97 97 194 343.24 61 5 42 140 9,258 33,838 9,361 33,983
LAML-KR 10 10 20 20.42 173 2 87,268 29 358,442 515 445,883 546
LICA-FR 6 6 12 17.85 0 0 2 0 1,065 1,584 1,067 1,584
LINC-JP 31 31 62 102.88 104 239 37 140 99,733 324,055 99,874 324,434
LIRI-JP 256 256 512 641.41 0 0 3 21 182 376 185 397
ORCA-IN 13 13 26 41.72 0 0 0 2 80 12,589 80 12,591
OV-AU 73 73 146 220.69 11 16 30 24 3,316 3,394 3,357 3,434
PACA-AU 97 97 194 348.55 13 18 145 97 7,142 6,348 7,300 6,463
PACA-CA 147 147 294 337.77 32 19 31 83 377,496 566,397 377,559 566,499
PAEN-AU 49 49 98 165.19 4 4 102 151 1,128 1,371 1,234 1,526
PAEN-IT 37 37 74 82.95 0 0 7 4 733 909 740 913
PRAD-CA 124 124 248 269.85 21 35 8 61 115,213 412,634 115,242 412,730
PRAD-UK 33 33 66 111.30 50 6 3 2 24,961 5,601 25,014 5,609
RECA-EU 94 94 188 377.53 2 4 49 50 15,442 13,624 15,493 13,678
TOTAL 365 1,330 1,330 3,025 3,790.34 444 501 375 13,725 92,871 11,886 764,787 1,079,694 1,570,424 778,956 1,173,066 1,582,685
51 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Extended Data Figure 2
O th e r s p e c ie s m a tc h e d A 1 0 0 ,0 0 0 B A lp h a p a p illo m a v iru s B lo o d d e riv e d s a m p le s (n = 3 6 5 )
B 1 0 ,0 0 0 H e lic o b a c te r p y lo ri
P ly m p h o b la s to id c e ll lin e s a m p le s (n = 1 0 2 )
P R
1 ,0 0 0 S R R 1 6 1 1 1 2 8
d e
h 1 0 0
c S R R 1 6 1 1 1 2 7
t
a m 1 0 S R R 1 6 1 1 0 0 0
1 i s i 2 s a i s s 2 s s r ii s a 1 i i s 3 s i e a ri s s s s s ri i r s s s s li B ti n s i1 i s 4 . s n i u . a i is n s s i n s s a s i e m i n e is s m m s 0 1 s e o a i i is 0 s e i s n 6 t n u 4 n d to n m h u S id f li r c l s a i n a o n ti m ic li e s u a lu n e il u M i 1 u l l s s n S R R 6 4 3 2 7 0 9 il 1 u n s o o a e a e A i u c s e J t n t i 0 u n s c u z r s u z n c i i tu i e n c in t u u d P l o o i t e a c s e r r S r i a ra r ti u r w a a e t A v n id e o n e a c o i n a t ti n h u u r i a M d c i C D i l io n S r c n in c k u J e c T m c a t r b p u n p y r e e ra t d t c a c e r u e o a q ff u i i it h e ic r e m b K a e fi a u c u l B r e o i i p R i m h p n x u r e s fi l in e o t n e p e g r a i g p p if n M A p s l d o u e i g i i ir i d a p o fa e v e v e s P g a n o h l o u i g s c o l n w d s s g 2 e a s K c g u r h v p li s t p n c w s s a u la p s p o o s t fl o r s g a o s a m c to n i b o n p o ie c s h m r it c l s o e id o a x s a p a l o b n s g t it u a f n u s h a u m u a s a r s p i 1 r o o e r s p s u in e i s a a u x i u s s d s e a u v e r la a u y lo a u in s c il i p o e r s x m s c m a n 4 s h d c t u i r n r e i t c d p u i n a p r ll n s b t p l ll t o s e n c s n c a c n a e u a j r n th e lu s l r u a e e n e a iq e c a n r i o r la r e e t y s c u a t e c u c n e l n e r c iu s s n iu e Y p x s t il il e d rp c r ra l m id a l e o c l a e a x c c r d o u c la o t a p h c u r te a a e e e c x c t h s h to m o o o u s o o v m te e te d s n n n s r lu o a s c o s t m a m id u ll o m m p r d b a c a c t a s c e c v l i u v a h o a r r a i u l c a s m c e c n c a u p u o r c i te u s c r o c a r a s h ls s m y b a i s m B o c c o s lo p ll e i c o p o o o m r l s i i c t c c m i ia a e y a r b o a s n lu lo o o c ri ft o n id d a e a w il i t c o c n lu t a c m t o e l u r s r s o c b r s b t iv i e o B B b u o i E n a u d iu m c l ia a n b B u s c c a r s o l c i p h te e c e e lo o a a c m h t l v i ll h d y i r iz o p e g c u b c e e u a c o u i a h o a d i i t u t y lo c b g o e lu o a n o t c y o n i m a R i h c a u n n m A e o rd o c h a B i t il m c t r ll e u e c s c c c c c n b g s l t a t p d o c ip v te h c D o i s r t V id B b p a L o e tr e a b o a c h to to i o s i e ib c o a 0 5 0 i i a o l m p A s r o C t h u h a e o S o M e h a t te w b c a p o h d in i c n t l p R b c p d c u ia a c u y l s P n c v r p b R n S s le b o b t p c p i e a i e li m . . . B u r t a ll d y l p H c G i o a m t o m o o P s o o o c ta p e a if h c a A r a A o y H p S b i h a S y c l i o i E v u t K t o lo e r L S p N ib g t u 0 0 1 r e ic c a s y r o S h l e p s t y S t B A e S n C P s l u lo r p R A p r m C re il n e u p P S S n r a a B a P h u h e T c i r h e r P A C y B t p c a P c t F e t g h a C a tr e a g G t S t A H B A S P e S M A S M
C D E
s
s s
d
d d
a
a a
e e
e 6 6 6
r
r r
d
d d
e
e e
p
p p
p p
p 4 4 4
a
a a
m
m m
A
A A
2 2 2
W
W W
B B
B r = 0 .6 8 (p = 0 .0 0 0 0 8 ) r = 0 .7 1 (p = 0 .0 0 0 1 7 ) r = 0 .9 7 (p = 0 .0 0 3 8 )
g
g g
o
o o
l l l 0 0 0 2 4 6 2 4 6 2 4 6 lo g K ra k e n m a tc h e d re a d p a irs lo g K ra k e n m a tc h e d re a d p a irs lo g K ra k e n m a tc h e d re a d p a irs
F G H t
4 -1 5 1 .0 n
h r = 0 .5 9 2 3 (p = 1 0 )
e
t
i g
6 c
i
n
f f
e 3
S
l
e
G
y
o
l
c
W b
4 0 .5
2 n
0
m
o
1
e
i
s
g
t
s
o
a
l l
a 2
1 e
r g
r = 0 .5 6 (p = 0 .0 0 7 9 ) r
o
l
o c 0 0 0 .0 2 4 6 0 1 2 3 4 5 lo g K ra k e n m a tc h e d re a d p a irs lo g 1 0 R N A -S e q
I J K 1 5 0 ,0 0 0 8 1 5 -15
r = 0 .5 7 8 2 (p < 1 0 )
s
r
i
s
l
t
a
a
e p
m 6
r
s
o
l
d
l
a n
1 0 0 ,0 0 0 t
a
u
d
f
a
e
e
r
d
h
o
l 4
i l
1 0 c
t
t
a
%
a
a
f
0
m
R
o
5 0 ,0 0 0 1
0
1 2
o
g %
-1 5 t
o l
0 r > 0 ,9 9 9 (p < 1 0 ) 1 0 5 0 0 2 4 6 8 0 5 0 0 0 0 0 1 0 0 0 0 0 0 1 5 0 0 0 0 0 C ii s e tt s a e lo g 1 0 tu m o r tis s u e u o e n A ll re a d p a irs r p k c i ic a v s p o a n a m e n i iu d o n r a m to te t o s c s d l a a u a b m e R i s n n P io a p m ro u P H
52 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Extended Data Figure 3
A
1 5 0
s
e l
p 1 0 0
m
a
s
f 1 0 0 0 G e n o m e s
o
. 5 0
o N B O C A -U K P A C A -A U 0 P A C A -C A P A E N -A U 0 1 2 3 4 5 6 7 8 9 0 P A E N -IT 1 P R A D -C A P R A D -U K B L IR I-J P L IC A -F R E S A D -U K C L L E -E S 4 0 0 B R C A -U K
s R E C A -E U e l B OBCTAC-UAK-S G
p C M D I-U K P A CL A-MA UL -K R m L IN C -J P
a P AOC RA -CC A -I s
OP AVE-NA -UA U f 2 0 0 N
o P A E N -IT
. P R A D -C A o
N P RBAODC-UAK-U K L IR I-J P 4 0 0 P A C A -A U s L IC A -F R
e 0 P A C A -C A l E S A D -U K p 0 1 2 3 4 5 6 7 8 9 0 P A E N -A U m 1 C L L E -E S a P A E N -IT
s B R C A -U K C
f P R A D -C A
o 2 0 0 R E C A -E U
. P R A D -U K
B T C A -S G o
s L IR I-J P
N 4 0 0 C M D I-U K e
l L IC A -F R L A M L -K R p E S A D -U K L IN C -J P m 0 C L L E -E S a O V -A U s 0 0 1 2 3 4 5 6 7 8 9 B R C A -U K 1 f O R C A -IN 2 0 0
o R E C A -E U
. B T C A -S G
o C M D I-U K N L A M L -K R L IN C -J P 0 O V -A U 0 1 2 3 4 5 6 7 8 9 0 O R C A -I 1 D 3 0 N
s tu m o r tis s u e
s 2 0 e
n m a tc h e d n o rm a ls h
c h e a lth y c o n tro ls i
R 1 0
0
) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 5 2 2 4 4 4 4 0 2 2 2 6 6 4 4 8 4 8 6 8 6 5 9 2 9 6 9 2 1 6 1 2 4 9 9 9 7 4 6 8 1 = = 1 = 1 = = = 5 = 1 1 2 = = 2 = 1 3 n = = = n = = = = (n n = n = n (n n = ( ( n (n n ( n ( n ( ( n n n n ( n n (n ( ( ( ( ( ( ( ( ( K G K R R P IN U IT K K K F J P - U U A A - A U s -U S S U K - - - -U e U - E I- U - -J A A A C N C E - A A - - L A C I - - - N - D - m D C V A A E D A A C C E D M IC IN R R E A A o C L M A I O C C A A C n R T L A L L L O P R e O B B C S L A A P R P E C E P P P R G B 0 0 0 1
53 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Extended Data Figure 4
Bifidobacterium animalis
Bifidobacterium dentium
Cronobacter sakazakii
Cupriavidus metallidurans
Gluconobacter oxydans
Gordonia polyisoprenivorans
Hepatitis B virus
Human herpesvirus 1
Human herpesvirus 5
Human herpesvirus 6A
Human mastadenovirus C
Methylobacterium populi
54 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Parvibaculum lavamentivorans
Plautia stali symbiont
Pseudomonas fluorescens
Pseudomonas mendocina
Pseudomonas poae
Pseudomonas protegens
Pseudomonas sp. TKP
Pseudopropionibacterium propionicum
Rhodopseudomonas palustris
Salmonella enterica subsp. enterica serovar Typhi
Serratia liquefaciens
55 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Serratia marcescens subsp. marcescens
Serratia plymuthica
Serratia proteamaculans
Thauera sp. MZ1T
56 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Extended Data Figure 5
n = 218 All identified species-level taxa
n = 147 Excluding taxa also detected in healthy controls
n = 70 Excluding taxa detected in < 5 tumor tissues or matched controls (except Hepatitis B)
n = 68 Excluding all remaining phage hits
n = 49 Excluding likely oral microbiome contamination
n = 27 Excluding likely sequencing / reagent contaminants
57 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Extended Data Figure 6
58 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Extended Data Figure 7
A B C D
E F G
59 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Extended Data Figure 8
A B C D
60 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Extended Data Figure 9
A 1 5
%
,
s
y
d
l a
e 1 0
e
v i
r tu m o r tis s u e
t
e d
t m a tc h e d n o rm a ls
e
a
t t
a 5
u
r
p
g
e
t
n i 0 i t is ii s s s s 1 5 A C l s s a e s s a s s a s T l m n n u u n n m P i c n n c n 1 a u k n r s s 6 o u n in a n K r i i i a a a a i s p a i e o e t r ie e h a Z t z r r u u s u r b ic c T s e c t l im n u d o v ir ir r o o c o p g t c u M a y u i p n s e p lu a s u n e k d v B v v ir v v m e d s t n f e c p a d li x i s s ti y io r a s a e m a a l o n s v o m n o p e c s s a i e e s n n s p o e n r s a u r ly m m m t r e t p p iu e i o p a l a a r e r ti r r e e r l r lu m o s l iq p a r u u e e t p p d a f s n a e l m e e i i t a e e r e m t p s m o a t r r m c o p h h a t a s o a n n a a i u e e c a s e t c s a n o o i i t o a t t a s i e h s v m a n d m t t a r c c b y n n a a ia n o o a a r p h b u o l H a a n a b l t iu o u m lm r r r a a o d r o e m d o r r e a T b b i n o m m a m lo u m s u a i n v o p m a te m o d S e e S t o o o c u u m n y u l o P d e u S S a d d r ia a a h l P c o d s e r i i r u i H H u t u a d u r f f C l n H e u e P s e i i p m c ib u e p B B u G o u a e s S d M n s s P o C r H ib o P d v i P o r p o G a o h P r R P
B C 1 0
1 0
%
%
,
,
s
s
y
d
l
y
d
l
a e
a 1
e e
1 v
e
i
v
r
i
t
r
t
e
d
e
t
d
t
e
e
a
t
a
t
t
t
a
a u
r 0 .1 u
r 0 .1
p
g
p
g
e
e
t
t
n
n
i i 0 .0 1 0 .0 1
) ) ) ) 4 5 7 7 4 = 2 2 = = = (n n (n n ( a ( a x e s x a u l a t s a t l s m l a i r r t ia i r o r v n e o t d c m e a tu b h tc a m
61 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Extended Data Figure 10
62 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
REFERENCES
1. Schwabe, R. F. & Jobin, C. The microbiome and cancer. Nat. Rev. Cancer 13, 800–12
(2013).
2. Goodman, B. & Gardner, H. The microbiome and cancer. J. Pathol. 244, 667–676
(2018).
3. zur Hausen, H. Papillomaviruses and cancer: from basic studies to clinical application.
Nat. Rev. Cancer 2, 342–350 (2002).
4. Tang, K.-W., Alaei-Mahabadi, B., Samuelsson, T., Lindh, M. & Larsson, E. The landscape
of viral expression and host gene fusion and adaptation in human cancer. Nat.
Commun. 4, 2513 (2013).
5. Cantalupo, P. G., Katz, J. P. & Pipas, J. M. Viral sequences in human cancer. Virology
513, 208–216 (2018).
6. Walboomers, J. M. M. et al. Human papillomavirus is a necessary cause of invasive
cervical cancer worldwide. J. Pathol. 189, 12–19 (1999).
7. Schwarz, E. et al. Structure and transcription of human papillomavirus sequences in
cervical carcinoma cells. Nature 314, 111–114 (1985).
8. Perz, J. F., Armstrong, G. L., Farrington, L. A., Hutin, Y. J. F. & Bell, B. P. The
contributions of hepatitis B virus and hepatitis C virus infections to cirrhosis and
primary liver cancer worldwide. J. Hepatol. 45, 529–538 (2006).
9. Shafritz, D. A., Shouval, D., Sherman, H. I., Hadziyannis, S. J. & Kew, M. C. Integration
of Hepatitis B Virus DNA into the Genome of Liver Cells in Chronic Liver Disease and 63 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Hepatocellular Carcinoma. N. Engl. J. Med. 305, 1067–1073 (1981).
10. Uemura, N. et al. Helicobacter pylori Infection and the Development of Gastric Cancer.
N. Engl. J. Med. 345, 784–789 (2001).
11. Watanabe, T., Tada, M., Nagai, H., Sasaki, S. & Nakao, M. Helicobacter pylori infection
induces gastric cancer in mongolian gerbils. Gastroenterology 115, 642–8 (1998).
12. zur Hausen, H. The search for infectious causes of human cancers: Where and why.
Virology 392, 1–10 (2009).
13. Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-
generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
14. Niedobitek, G. et al. Detection of human papillomavirus type 16 DNA in carcinomas of
the palatine tonsil. J. Clin. Pathol. 43, 918–21 (1990).
15. Gillison, M. L. et al. Evidence for a Causal Association Between Human Papillomavirus
and a Subset of Head and Neck Cancers. J. Natl. Cancer Inst. 92, 709–720 (2000).
16. Castellarin, M. et al. Fusobacterium nucleatum infection is prevalent in human
colorectal carcinoma. Genome Res. 22, 299–306 (2012).
17. Kostic, A. D. et al. Genomic analysis identifies association of Fusobacterium with
colorectal carcinoma. Genome Res. 22, 292–8 (2012).
18. Repass, J. et al. Replication Study: Fusobacterium nucleatum infection is prevalent in
human colorectal carcinoma. Elife 7, (2018).
19. Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144,
646–74 (2011). 64 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
20. Riley, D. R. et al. Bacteria-Human Somatic Cell Lateral Gene Transfer Is Enriched in
Cancer Samples. PLoS Comput. Biol. 9, e1003107 (2013).
21. Zapatka, M. et al. The landscape of viral associations in human cancers. bioRxiv
465757 (2018). doi:10.1101/465757
22. Khoury, J. D. et al. Landscape of DNA virus associations across human malignant
cancers: analysis of 3,775 cases using RNA-Seq. J. Virol. 87, 8916–26 (2013).
23. International Cancer Genome Consortium, T. I. C. G. et al. International network of
cancer genome projects. Nature 464, 993–8 (2010).
24. Gibbs, R. A. et al. A global reference for human genetic variation. Nature 526, 68–74
(2015).
25. Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification
using exact alignments. Genome Biol. 15, R46 (2014).
26. Hu, Z. et al. Genome-wide profiling of HPV integration in cervical cancer identifies
clustered genomic hot spots and a potential microhomology-mediated integration
mechanism. Nat. Genet. 47, 158–163 (2015).
27. Young, L., Sung, J., Stacey, G. & Masters, J. R. Detection of Mycoplasma in cell
cultures. Nat. Protoc. 5, 929–934 (2010).
28. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler
transform. Bioinformatics 26, 589–95 (2010).
29. Caygill, C. P., Hill, M. J., Braddick, M. & Sharp, J. C. Cancer mortality in chronic typhoid
and paratyphoid carriers. Lancet (London, England) 343, 83–4 (1994).
65 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
30. Scanu, T. et al. Salmonella Manipulation of Host Signaling Pathways Provokes Cellular
Transformation Associated with Gallbladder Carcinoma. Cell Host Microbe 17, 763–
774 (2015).
31. Geller, L. T. et al. Potential role of intratumor bacteria in mediating tumor resistance
to the chemotherapeutic drug gemcitabine. Science (80-. ). 357, 1156–1160 (2017).
32. Feng, Y. et al. Metagenomic and metatranscriptomic analysis of human prostate
microbiota from patients with prostate cancer. BMC Genomics 20, 146 (2019).
33. Kripalani-Joshi, S. & Law, H. Y. Identification of integrated epstein-barr virus in
nasopharyngeal carcinoma using pulse field gel electrophoresis. Int. J. Cancer 56, 187–
192 (1994).
34. Morton, C. et al. Mapping of the human Blym-1 transforming gene activated in Burkitt
lymphomas to chromosome 1. Science (80-. ). 223, 173–175 (1984).
35. Robinson, K. M., Crabtree, J., Mattick, J. S. A., Anderson, K. E. & Dunning Hotopp, J. C.
Distinguishing potential bacteria-tumor associations from contamination in a
secondary data analysis of public cancer genome sequence data. Microbiome 5, 9
(2017).
36. Thompson, K. J. et al. A comprehensive analysis of breast cancer microbiota and host
gene expression. PLoS One 12, e0188873 (2017).
37. Mazzoni, E. et al. Significant Association Between Human Osteosarcoma and Simian
Virus 40. doi:10.1002/cncr.29137
38. Ramanan, P., Deziel, P. J. & Wengenack, N. L. Gordonia bacteremia. J. Clin. Microbiol.
66 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
51, 3443–7 (2013).
39. Ding, X. et al. Bacteremia due to Gordonia polyisoprenivorans: case report and review
of literature. BMC Infect. Dis. 17, 419 (2017).
40. Gupta, M., Prasad, D., Khara, H. S. & Alcid, D. A rubber-degrading organism growing
from a human body. Int. J. Infect. Dis. 14, e75–e76 (2010).
41. Austen, B. et al. Mutations in the ATM gene lead to impaired overall and treatment-
free survival that is independent of IGVH mutation status in patients with B-CLL.
(2005). doi:10.1182/blood-2004-11-4516
42. Balthazar, E. J., Megibow, A. J. & Hulnick, D. H. Cytomegalovirus esophagitis and
gastritis in AIDS. AJR. Am. J. Roentgenol. 144, 1201–4 (1985).
43. Lepiller, Q., Tripathy, M. K., Di Martino, V., Kantelip, B. & Herbein, G. Increased HCMV
seroprevalence in patients with hepatocellular carcinoma. Virol. J. 8, 485 (2011).
44. Leonardsson, H., Hreinsson, J. P., Löve, A. & Björnsson, E. S. Hepatitis due to Epstein–
Barr virus and cytomegalovirus: clinical features and outcomes. Scand. J.
Gastroenterol. 52, 893–897 (2017).
45. Bruix, J. & Llovet, J. M. Hepatitis B virus and hepatocellular carcinoma. J. Hepatol. 39,
59–63 (2003).
46. Lai, C.-C. et al. Infections caused by unusual Methylobacterium species. J. Clin.
Microbiol. 49, 3329–31 (2011).
47. Langevin, S., Vincelette, J., Bekal, S. & Gaudreau, C. First case of invasive human
infection caused by Cupriavidus metallidurans. J. Clin. Microbiol. 49, 744–5 (2011).
67 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
48. Ugge, H. et al. Acne in late adolescence and risk of prostate cancer. Int. J. cancer 142,
1580–1585 (2018).
49. Davidsson, S. et al. Frequency and typing of Propionibacterium acnes in prostate
tissue obtained from men with and without prostate cancer. Infect. Agent. Cancer 11,
26 (2016).
50. COHEN, R. J., SHANNON, B. A., McNEAL, J. E., SHANNON, T. & GARRETT, K. L.
PROPIONIBACTERIUM ACNES ASSOCIATED WITH INFLAMMATION IN RADICAL
PROSTATECTOMY SPECIMENS: A POSSIBLE LINK TO CANCER EVOLUTION? J. Urol. 173,
1969–1974 (2005).
51. Mahlen, S. D. Serratia infections: from military experiments to current practice. Clin.
Microbiol. Rev. 24, 755–91 (2011).
52. Fessler, J., Matson, V. & Gajewski, T. F. Exploring the emerging role of the microbiome
in cancer immunotherapy. J. Immunother. Cancer 7, 108 (2019).
53. Motzer, R. J. et al. Nivolumab versus Everolimus in Advanced Renal-Cell Carcinoma. N.
Engl. J. Med. 373, 1803–13 (2015).
54. Sung, W.-K. et al. Genome-wide survey of recurrent HBV integration in hepatocellular
carcinoma. Nat. Genet. Vol. 44, (2012).
55. Welcome | ICGC Data Portal. Available at: https://dcc.icgc.org/. (Accessed: 6th June
2019)
56. Index von /vol1/ftp/. Available at: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/.
(Accessed: 6th June 2019)
68 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
57. European Nucleotide Archive < EMBL-EBI. Available at: https://www.ebi.ac.uk/ena.
(Accessed: 6th June 2019)
58. Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative
biomedical analyses: 2018 update. Nucleic Acids Res. 46, W537–W544 (2018).
59. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25,
2078–2079 (2009).
60. Picard Tools - By Broad Institute. Available at: http://broadinstitute.github.io/picard/.
(Accessed: 6th June 2019)
61. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic
features. Bioinformatics 26, 841–842 (2010).
62. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina
sequence data. Bioinformatics 30, 2114–20 (2014).
63. Blankenberg, D. et al. Manipulation of FASTQ data with Galaxy. Bioinformatics 26,
1783–5 (2010).
64. Rognes, T., Flouri, T., Nichols, B., Quince, C. & Mahé, F. VSEARCH: a versatile open
source tool for metagenomics. PeerJ 4, e2584 (2016).
65. The R Foundation for Statistical Computing R version 3.3.2. (https://www.r-
project.org/).
66. Escapa, I. F. et al. New Insights into Human Nostril Microbiome from the Expanded
Human Oral Microbiome Database (eHOMD): a Resource for the Microbiome of the
Human Aerodigestive Tract. mSystems 3, e00187-18 (2018).
69 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
67. Dewhirst, F. E. et al. The Human Oral Microbiome. J. Bacteriol. 192, 5002–5017
(2010).
68. HOMD :: Human Oral Microbiome Database. Available at:
http://www.homd.org/index.php?name=HOMD. (Accessed: 6th June 2019)
69. de Goffau, M. C. et al. Recognizing the reagent microbiome. Nat. Microbiol. 3, 851–
853 (2018).
70. Salter, S. J. et al. Reagent and laboratory contamination can critically impact
sequence-based microbiome analyses. BMC Biol. 12, 87 (2014).
71. Jervis-Bardy, J. et al. Deriving accurate microbiota profiles from human samples with
low bacterial content through post-sequencing processing of Illumina MiSeq data.
Microbiome 3, 19 (2015).
72. Leon, L. J. et al. Enrichment of Clinically Relevant Organisms in Spontaneous Preterm-
Delivered Placentas and Reagent Contamination across All Clinical Groups in a Large
Pregnancy Cohort in the United Kingdom. Appl. Environ. Microbiol. 84, e00483-18
(2018).
73. Laurence, M., Hatzis, C. & Brash, D. E. Common Contaminants in Next-Generation
Sequencing That Hinder Discovery of Low-Abundance Microbes. PLoS One 9, e97876
(2014).
74. Glassing, A., Dowd, S. E., Galandiuk, S., Davis, B. & Chiodini, R. J. Inherent bacterial
DNA contamination of extraction and sequencing reagents may affect interpretation
of microbiota in low bacterial biomass samples. Gut Pathog. 8, 24 (2016).
70 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
75. Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421
(2009).
76. Wally, N. et al. Plasmid DNA contaminant in molecular reagents. Sci. Rep. 9, 1652
(2019).
77. Embedding projector - visualization of high-dimensional data. Available at:
https://projector.tensorflow.org/. (Accessed: 6th June 2019)
78. Morpheus. Available at: https://software.broadinstitute.org/morpheus/. (Accessed:
6th June 2019)
79. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics 25, 1754–1760 (2009).
80. Wick, R. R., Judd, L. M., Gorrie, C. L. & Holt, K. E. Unicycler: Resolving bacterial genome
assemblies from short and long sequencing reads. PLOS Comput. Biol. 13, e1005595
(2017).
81. Sondka, Z. et al. The COSMIC Cancer Gene Census: describing genetic dysfunction
across all human cancers. Nat. Rev. Cancer 18, 696–705 (2018).
82. Warden, C. D., Yuan, Y.-C. & Wu, X. Optimal Calculation of RNA-Seq Fold-Change
Values. International Journal of Computational Bioinformatics and In Silico Modeling
2, (2013).
83. Warden, C. D., Kanaya, N., Chen, S. & Yuan, Y.-C. BD-Func: a streamlined algorithm for
predicting activation and inhibition of pathways. PeerJ 1, e159 (2013).
84. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene
71 bioRxiv preprint doi: https://doi.org/10.1101/773200; this version posted September 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Ontology Consortium. Nat. Genet. 25, 25–9 (2000).
72