-specific 16S Bacterial rRNA Microbiome Analyses of the Lung Identifies a Unique Bacterial Species Associated with a Specific T-53 Mutation

Garth D. Ehrlich, Ph.D., F.A.A.A.S Executive Director, Center for Genomic Sciences Executive Director, Center for Advanced Microbial Processing Professor of Microbiology and Immunology Drexel University College of Medicine Director, Meta-Omics Shared Resource Sidney Kimmel Cancer Center

PacBio Users Group Meeting, St. Louis, MO, September 19, 2018. Thanks and acknowledgements The Lab Team The Bioinformatic Team

Jarek Krol, PhD Azad Ahmed, MD Josh Earl, ABD Josh Mell, PhD

Steven Lang, BS Bhaswati Sen, PhD Carol Hope, BS Archana Bhat, MS Rachael Ehrlich, MS Objectives of Pan-Domain Technology • Broad-based identification of and Fungi – no a priori guessing of what to test for within these domains – No culture necessary • Mixed populations of microbes (polymicrobial assemblages) • Quantitative • High resolution genotyping • Identify emerging agents not previously seen • Costs effective, high throughput 16S Microbiome Analyses

• Revolutionized community-based microbial studies – Revealed unprecedented levels of complexity • Natural systems - soils water • Organismal systems – Holobiont – Metazoans and their microbiota – Plants and their microbiota The Problem with NextGen Microbiome analyses Lack of specificity – good for global analyses of diversity • Short read-lengths – – Genus-level taxonomic characterization at best – Sometime only family level • High error rates – Single-pass sequencing make error correction very difficult – Vastly increases numbers of OTU’s The problem of false OTUs

They create differences S1 0 error S1 w errors where there are none 400 OTUs 500 OTUs

They amplify any real S2 0 error 0/400 100/500 differences 400 OTUs S2 w errors 100/500 200/600 500 OTUs S3 0 errors 200/500 300/600 400 OTUs S3 w errors 300/600 400/700 500 OTUs S1 and S2 are biological replicates S3 is different from S1 and S2 Current technical methods fail to provide the necessary taxonomic details to follow up associative studies

How Circular Consensus Sequencing Works Multiple pass CCS Analysis Algorithm

1) Bin amplimers via barcodes 2) Filter out all amplimers that are not sequenced for ≥ 5 passes (circular sequencing), and do not have ≥ 90% PacBio quality scores 3) Filter out any reads aligning partially to the human genome 4) Remove all reads of > 1500bp (<1% of remaining reads) as likely chimerics 5) Remove all reads not having both forward and reverse primers, and trim primers 6) Correct for strandedness so that all reads are on the same strain 7) Remove any remaining reads which have > 10 expected errors calculated from PHRED scores (< 5% of remaining reads) 8) Use only reads with < 1expected error (so as to not inflate the number of OTU’s clusters formed at the level of 97% sequence identity 9) Use the OTU’s from (8) to map back all of the reads that passed filters 2-7 to provide the final counts for each taxon. Note with this approach essentially all of the surviving reads map back to one of the high-fidelity OTU’s Goals of Microbiome Data Processing • Minimize numbers of operational taxonomic units (OTUs) = species so as not to artificially inflate the diversity • This is critical for comparisons among specimens – for if each specimen has its own set of false OTUs it produces a greatly exaggerated picture of differences Expected Error vs. Average Error

The Uclust pipeline recommends using the “Expected Error” Score rather than the average error:

Q scores in read Avg. Q Expected number of errors

140 x Q35 + 10 x Q2 33 6.4 !

150 x Q25 25 0.5

We can see that average quality is actually a poor representation of the probable number of errors in a sequence. Filtering

• Phred Quality – Highest quality score in PacBio sequences is

42 where Quality is equal to -10*log10P, where P is equal to the probability the base is wrong. Our longest sequence was trimmed to 1500bp, therefore: – The *best* expected error we can get is 10^(- 4.2)*1500 ~= 0.00946 (or about .01) Filtering applied to controls

• Perfect (i.e. single OTU) clustering of positive controls happened only at an Expected Error (EE) level of one or less. • This suggests that we should filter to EE ≤ 1 for OTU clustering and classification, or expect that we will have spurious OTUs based on error profile, rather than true biological sequences BEI Mock Community (22 species) BEI Mock Community

• Network of all BEI strain’s16s alignment Some results and graphs: BEI dataset

Multiple expected error filtering and how it affects the number (and ‘correctness’) of the OTU’s

Number of OTUs in BEI

500 472 450 400 350 300 250 210 200 150

NUMBER OF OTU'S NUMBER 80 100 50 65 19 19 19 19 20 22 25 36 50 0 0 0.01 0.25 0.5 0.75 1 2 3 4 5 6 7 8 9 10 EXPECTED ERROR BEI Mock Community • Initial Results – Able to resolve 19 out of 21 – No False Positives 100% species-level identity from mock community analysis

Species %identity Streptococcus mutans UA159 100.00% Streptococcus agalactiae 2603V/R 100.00% Lactobacillus gasseri ATCC 33323 100.00% Enterococcus faecalis OG1RF 100.00% Actinomyces odontolyticus 100.00% Streptococcus pneumoniae TIGR4 99.90% Acinetobacter baumannii ATCC 17978 99.90% Neisseria meningitidis MC58 99.80% Listeria monocytogenes strain EGD 99.80% Bacillus cereus ATCC 10987 99.80% Staphylococcus epidermidis ATCC 12228 99.70% Rhodobacter sphaeroides 2.4.1 chromosome 1 99.70% Pseudomonas aeruginosa PAO1 99.70% Escherichia coli str. K-12 substr. MG1655 99.70% beijerinckii NCIMB 8052 99.70% Propionibacterium acnes KPA171202 99.60% Helicobacter pylori 26695 99.40% Deinococcus radiodurans 99.20% Bacteroides vulgatus ATCC 8482 98.90% No species level confidence

• Database had taxonomic classification up to species level • Command to assign only gave confidences up to the genus level • We wanted the confidence values for the species levels as well • Spoof the database Shift taxonomy so Domain is now Phylum etc.

• Command to assign taxonomy so program thinks of the species level as genus • Confidence of ‘g’ is actually species confidence Species-level confidence

OTUId Genus genus_conf species species_con

OTU_10;size=2442; Listeria 0.987 Listeria_monocytogenes 0.767

OTU_11;size=4118; Bacteroidaceae 0.9931 Bacteroides_vulgatus 0.9301 OTUId genus genus_conf species OTU_12;size=3798; Staphylococcus 0.9843 Staphylococcus_epidermidis 0.767 OTU_10;size=2442; Listeria 0.978 Listeria_monocytogenes OTU_13;size=597; Rhodobacter 0.9684 Rhodobacter_sphaeroides 0.6177 OTU_11;size=4118; Bacteroidaceae 0.9883 Bacteroides_vulgatus OTU_14;size=2281; Shigella 0.7125 Shigella_flexneri 0.4684 OTU_12;size=3798; Staphylococcus 0.9729 Staphylococcus_epidermidis OTU_15;size=417; Pseudomonas 0.9684 Pseudomonas_aeruginosa 0.9128 OTU_13;size=597; Rhodobacter 0.9678 Rhodobacter_sphaeroides OTU_16;size=294; Deinococcus 0.9936 Deinococcus_radiodurans 0.9301 OTU_14;size=2281; Shigella 0.6849 Shigella_flexneri OTU_15;size=417; Pseudomonas 0.9678 Pseudomonas_aeruginosa OTU_17;size=238; Streptococcus 0.9887 Streptococcus_pneumoniae 0.4684 OTU_16;size=294; Deinococcus 0.9891 Deinococcus_radiodurans OTU_17;size=238; Streptococcus 0.9814 Streptococcus_pneumoniae OTU_18;size=2227; Clostridium 0.9824 Clostridium_beijerinckii 0.319

OTU_18;size=2227; Clostridium 0.9746 Clostridium_beijerinckii OTU_19;size=101; Actinomyces 0.9878 Actinomyces_odontolyticus 0.9214 OTU_19;size=101; Actinomyces 0.9797 Actinomyces_odontolyticus OTU_1;size=5193; Helicobacteraceae 0.9887 Helicobacter_pylori OTU_1;size=5193; Helicobacteraceae 0.9934 Helicobacter_pylori 0.9128 OTU_2;size=2845; Streptococcus 0.9797 Streptococcus_agalactiae OTU_3;size=2582; Lactobacillus 0.9879 Lactobacillus_gasseri OTU_2;size=2845; Streptococcus 0.9878 Streptococcus_agalactiae 0.9387 OTU_4;size=2658; Neisseria 0.9678 Neisseria_meningitidis OTU_3;size=2582; Lactobacillus 0.9922 Lactobacillus_gasseri 0.767 OTU_5;size=1755; Propionibacterium 0.9865 Propionibacterium_acnes OTU_6;size=1983; Streptococcus 0.9865 Streptococcus_mutans OTU_4;size=2658; Neisseria 0.9684 Neisseria_meningitidis 0.9128 OTU_7;size=1812; Acinetobacter 0.9695 Acinetobacter_baumannii OTU_8;size=1901; Bacillus 0.9746 Bacillus_anthracis OTU_5;size=1755; Propionibacterium 0.9913 Propionibacterium_acnes 0.956 OTU_9;size=886; Enterococcus 0.9712 Enterococcus_faecalis OTU_6;size=1983; Streptococcus 0.9913 Streptococcus_mutans 0.9474

OTU_7;size=1812; Acinetobacter 0.9824 Acinetobacter_baumannii 0.9214

OTU_8;size=1901; Bacillus 0.9852 Bacillus_anthracis 0.4684 OTU_9;size=886; Enterococcus 0.9835 Enterococcus_faecalis 0.9214 CAMI Mock Community Sequencing OTU

• 250 OTU found (232 mapped to unique species) out of 243 unique NCBI OTU (Or, How Did We Do?) – 6.9% (16/232) labelled with incorrect species name (after correcting with reference OTU*) – 93.1% (216/232) Correct identification of OTU to species level – 88.9% (216/243) Correct identification and classification to species level out of entire expected OTU number in community – 2.2% (5/232) instances of the incorrect genus (only 2 of which had more than a single read mapping to it) – 93.4% (227/243) Correct identification and classification to genus level of entire community – 66 correct unique Species identified from OTU pipeline not present in predicted reference sequences

– 27 missing/incorrect*16s genes predicted from species, genome assemblies 16 weremissing/incorrect assigned taxonomy from the generaNCBI. OTU Instances where a predicted 16s gene’s taxonomy matched the taxonomic assignment of an OTU from our sequencing was considered a ‘true assignment’ (N=10) JGI Mock Microbiome

Multiple expected error filtering and how it affects the number (and ‘correctness’) of the OTU’s

JGI DATA WITH DIFFERENT EE

350 290 301 300 271 284 255 255 262 240 247 253 250 200 150 100 NUMBER OF OTU'S NUMBER 50 0 1 2 3 4 5 6 7 8 9 10 EXPECTED ERROR Quantitative Species-level Accuracy of the MCSMRT Pipeline

CAMI community composition: observed versus expected relative abundances, based on matching centroid OTU assignments with expected species composition. CAMI Multi-species genera Clostridium

Predicted from PacBio Sequencing True Species Clostridium_acidisoli Clostridium acidisoli DSM 12555 Clostridium bartlettii DSM 16795 Clostridium_caminithermale Clostridium caminithermale DSM 15212 Not too bad. Clostridium_cavendishii Clostridium cavendishii DSM 21758 We Clostridium_collagenovorans Clostridium collagenovorans DSM 3089 classified all Clostridium_grantii Clostridium grantii DSM 8605 Clostridium_intestinale Clostridium intestinale DSM 6191 but one to Clostridium_jejuense Clostridium jejuense DSM 15929 the species Clostridium_lactatifermentans Clostridium lactatifermentans DSM 14214 level. Clostridium_litorale Clostridium litorale DSM 5388 Clostridium_magnum Clostridium magnum DSM 2767 Clostridium_propionicum Clostridium propionicum DSM 1682 Clostridium_proteolyticum Clostridium proteolyticum DSM 3090 Clostridium_tetani Clostridium tetani ATCC 19406 Clostridium_sticklandii

In fact… CAMI Multi-species genera

You may (or Predicted from Sequencing True Species Clostridium_acidisoli Clostridium acidisoli DSM 12555 may not) Clostridium bartlettii DSM 16795 recognize Clostridium_caminithermale Clostridium caminithermale DSM 15212 this species DSM 21758 Clostridium_cavendishii Clostridium cavendishii from a Clostridium_collagenovorans Clostridium collagenovorans DSM 3089 Clostridium_grantii Clostridium grantii DSM 8605 previous Clostridium_intestinale Clostridium intestinale DSM 6191 slide Clostridium_jejuense Clostridium jejuense DSM 15929 Clostridium_lactatifermentans Clostridium lactatifermentans DSM 14214 Clostridium_litorale Clostridium litorale DSM 5388 Clostridium_magnum Clostridium magnum DSM 2767 Clostridium_propionicum Clostridium propionicum DSM 1682 Clostridium_proteolyticum Clostridium proteolyticum DSM 3090 Clostridium_tetani Clostridium tetani ATCC 19406 Clostridium_sticklandii 1 species not in NCBI: Clostridium bartlettii (aka Intestinibacter bartlettii1) CAMI Multi-species genera

Predicted from Sequencing True Species Clostridium_acidisoli Clostridium acidisoli DSM 12555

Clostridium bartlettii DSM 16795 So, really we The NCBI are looking for Clostridium_caminithermale Clostridium caminithermale DSM 15212 Clostridium_cavendishii Clostridium cavendishii DSM 21758 “knows” this a hit to Clostridium_collagenovorans Clostridium collagenovorans DSM 3089 Intestinibacter species as Clostridium_grantii Clostridium grantii DSM 8605 Intestinibacter bartlettii. Did Clostridium_intestinale Clostridium intestinale DSM 6191 we find that? Clostridium_jejuense Clostridium jejuense DSM 15929 bartlettii Clostridium_lactatifermentans Clostridium lactatifermentans DSM 14214 Clostridium_litorale Clostridium litorale DSM 5388 Yes we did. Clostridium_magnum Clostridium magnum DSM 2767 Clostridium_propionicum Clostridium propionicum DSM 1682 Clostridium_proteolyticum Clostridium proteolyticum DSM 3090 Clostridium_tetani Clostridium tetani ATCC 19406

Clostridium_sticklandii Therefore, we correctly identified every single ‘Clostridium’ spp to the species level, but had one ‘false positive’ (the sticklandii) Conclusions on Technology Development • We have developed a filtering algorithm that produces high fidelity OTUs from full length 16S rRNA gene sequences • Essentially all reads can be mapped back to these high fidelity OTUs to get taxon- specific abundances • This technology provides semi-quantitative species-level microbiome data Greathouse et al. Genome Biology (2018) 19:123 https://doi.org/10.1186/s13059-018-1501-6

Interaction between the microbiome and TP53 in human lung cancer K. Leigh Greathouse, James R. White, Ashely J. Vargas, Valery V. Bliskovsky, Jessica A. Beck, Natalia von Muhlinen, Eric C. Polley, Elise D. Bowman, Mohammed A. Khan, Ana I. Robles, Tomer Cooks, Bríd M. Ryan, Noah Padgett, Amiran H. Dzutsev, Giorgio Trinchieri, Marbin A. Pineda, Sven Bilke, Paul S. Meltzer, Alexis N. Hokenstad, Tricia M. Stickrod, Marina R. Walther-Antonio, Joshua P. Earl, Joshua C. Mell, Jaroslaw E. Krol, Sergey V. Balashov, Archana S. Bhat, Garth D. Ehrlich, Alex Valm, Clayton Deming, Sean Conlan, Julia Oh, Julie A. Segre and Curtis C. Harris* Methods of Analysis

• 16S rRNA microbiome analyses – Illumina and PacBio • RNAseq for bacterial taxonomy – Remove human reads – Assign microbial taxonomy using: MEtaPhIAn, Kraken, and PathoScope • Fluorescent in situ Hybridization (FISH) – Confirm 16S and RNAseq findings Specimens

• NCI-MD case-control study – 143 lung cancer cases – 144 non tumor adjacent tissues • Control Lung – Immediate autopsy – Hospital biopsy • The Cancer Genome Atlas (TCGA) – n = 1112 tumor (T) and NT adjacent RNAseq data

Identification of Microbial Changes Associated with Lung Cancer

• Ecological diversity examination (after removal of genera identified through hierarchal clustering as being likely contaminants: Halomonas, Herbaspririllium, Shewanella, • Propionibacterium, and Variovorax) – within samples (alpha diversity) – between samples (beta diversity)

• Findings (at the phylum level) - > Proteobacteria (Kruskal–Wallis p = 0.0002) and < Firmicutes (Kruskal–Wallis p = 0.04) in lung tissue hospital biopsies, as well as in tumor and associated non-tumor tissues from the NCI-MD study compared with non-cancer population control lung tissues

• We also observed a similar increase in Proteobacteria (Mann–Whitney p = 0.02) between non-tumor lung tissue and lung cancer in the TCGA study, indicating that this is recurrent phenomenon in lung cancer

Differences among microbial communities using beta diversity

• Since we were comparing between studies and between types of sequencing (16S rRNA and RNA-seq), we used a method that could be commonly applied between studies, which excludes phylogeny (i.e. Bray Curtis).

• Within the NCI-MD study, we observed significant differences in beta diversity among all tissue types (PERMANOVA F = 2.90, p = 0.001), tumor and non-tumor (PERMANOVA F = 2.94, p = 0.001), and adenocarcinoma (AD) versus squamous cell carcinoma (SCC) (PERMANOVA F = 2.27, p = 0.005), with tumor vs. non-tumor having the largest among-group distance denoted by the higher F value (Fig. 1c–e).

• Similarly, in the TCGA we observed significant differences in beta diversity between tumor and non-tumor (PERMANOVA F = 3.63, p = 0.001) and AD v SCC (PERMANOVA F = 27.19, p = 0.001)

• Together, these data illustrate a trend of increasing diversity and richness associated with lung cancer. Acidovorax are in enriched in SCC and are more abundant in smokers

• SCC arises from the bronchial lining and AD arises from the peripheral airways • In the NCI-MD study, we identified 32 genera that were differentially abundant in SCC (n = 47) versus AD (n = 67) tumors (Student’s t- test; MW P < 0.05): – nine of which were significant after multiple testing correction (FDR) (Acidovorax, Brevundimonas, Comamonas,Tepidimonas, Rhodoferax, Klebsiella, Leptothrix, Polaromonas, Anaerococcus

– Confirmed with the TCGA data p < 0.05 after FDR Logistic Regression Analysis

• Logistic regression analyses of 16S data confirmed 6/9 genera as SCC > AD – Acidovorax, Klebsiella, Tepidimonas, Rhodoferax, and Anaerococcus • Logistic regression analyses of TCGA data confirmed 4/9 genera as SCC > AD – Acidovorax, Klebsiella, Rhodoferax, Anaerococcus

Smoking-related bacterial genera • We identified six genera that were able to distinguish ever smokers (former and current) versus non-smokers in our NCI-MD study (Acidovorax, Ruminococcus, Oscillospira, Duganella, Ensifer, Rhizobium)

• Specifically, Acidovorax was more abundant in former and current smokers as compared with never smokers (Kruskal–Wallis p value < 0.05)

• A similar trend observed in the TCGA dataset (n never = 120, n former = 551, n current= 217) (Kruskal–Wallis p = 0.27; ANOVA p = 0.02). Relative Abundance of Acidovorax stratified by smoking status and tumor histological subtype Acidovorax temperans

• We demonstrated the presence of this bacterium in lung tumors using: – FISH – Species-specific PacBio 16S sequencing TP53 mutations and lung microbiome composition • We analyzed all tumors in the NCI-MD study, regardless of histology, and identified a group of taxa that were more abundant in tumors with TP53 mutations • To have greater power, we performed the same analysis in the TCGA dataset and observed a significant increase in these same taxa (MW FDR corrected p < 0.05) • When analyzing only SCC tumors (n = 46), this signature became stronger in tumors with TP53 mutations in both datasets, specifically among the SCC-associated taxa previously identified • In the NCI-MD study, we found that 5/9 of the genera (Acidovorax, Klebsiella, Rhodoferax, Comamonas, and Polarmonas) that differentiated SCC from AD were also more abundant in the tumors harboring TP53 mutations. • In the TCGA dataset, the fold change in all five SCC-associated genera were significantly higher in SCC tumors (n = 177) with TP53 mutations (MW corrected FDR p <0.01)