DIAGNOSTIC METAGENOMICS The Quest for the One True Test

Tanya Golubchik

Big Data Institute, Oxford University Wellcome Centre for Human Genetics, Oxford University What is diagnostic metagenomics?

Pathogen identification Pathogen identification: needle in a haystack

1. Culture – Routine practice for many bacteria, but what about ? – 40-50% of childhood meningitis in UK culture-negative (aseptic)

2. Organism-specific PCR – Extremely high specificity, not always desirable for diverse pathogens – Several PCRs required, one positive may mean no further testing is done – Order of testing can introduce bias – Multiplex PCR can save costs but hard to develop

3. Serological assays – Assays developed for flu, HIV, other viruses

4. Other molecular diagnostics – Immunoassays (ELISA), mass spec, etc good for some specific applications 5. “Let’s just sequence everything!” Pathogen identification: needle in a haystack

1. Culture – Routine practice for many bacteria, but what about viruses? – 40-50% of childhood meningitis in UK culture-negative (aseptic)

2. Organism-specific PCR – Extremely high specificity, not always desirable for diverse pathogens – Several PCRs required, one positive may mean no further testing is done – Order of testing can introduce bias – Multiplex PCR can save costs but hard to develop

3. Serological assays – Assays developed for flu, HIV, other viruses

4. Other molecular diagnostics – Immunoassays (ELISA), mass spec, etc good for some specific applications 5. Metagenomics There are two types of metagenomics:

(1) Hay classification (2) Needle detection

https://hackingmaterials.com/2013/11/11/why-hack-materials U.S. Marine Corps photo by Lance Cpl. James Purschwitz There are two types of metagenomics:

(1) Standard (2) Targeted From clinical sample (plasma, CSF, …): From clinical sample [ or culture ]:  Extract total DNA & RNA  Extract total DNA & RNA  [ Deplete host material ]  Prepare sequencing library  Prepare sequencing library  [ Enrich for targets of interest ]  Sequence reads  Sequence reads

 Classify by organism/strain  Classify by organism/strain v There are two types of metagenomics:

• Great idea… – … when you are interested in the “hay”, ie. predominant components of the (1) Hay classification sample • The more you look, the more you find… – … but most of what you find is “hay” • More likely to detect larger genomes – HIV genome: 0.01 Mb – S. aureus genome: 3 Mb – Human genome: 6500 Mb

https://hackingmaterials.com/2013/11/11/why-hack-materials • YES: Microbial community composition & abundance, biomarkers

• NO: Pathogen detection There are two types of metagenomics:

• Great idea… – … when you know (roughly) what you’re looking for, and relative abundance is not important (2) Needle detection • Adds extra steps to an already From clinical sample [ or culture ]: complex protocol… – … but greater sensitivity and specificity,  Extract total DNA & RNA no longer looking for 1-2 reads in  [ Deplete host material ] millions  Prepare sequencing library  [ Enrich for targets of interest ] • YES: Targeted diagnostics, amplification  Sequence reads of low-abundance pathogens (includes most viruses)  Classify by organism/strain

U.S. Marine Corps photo by Lance Cpl. James Purschwitz • NO: Abundance quantification, finding novel pathogens Make Sequencinga library with enrichment

2. DNA/RNA 3. RNA: First and fragmentation second strand cDNA Clinical synthesis by random priming sample (eg. Plasma) Culture 1. Extraction

6. Multiplexing 5. PCR amplification (primers specific to adapter sequences) 4. Adapter ligation (indexing)

7. Illumina sequencing

Target-specific probe based enrichment At >80% sequence similarity capture is unbiased

• Probes targeting a defined panel of pathogens of interest – Probe panel can include 100s of sequences – Same or different organisms

• Viruses – Complete genome if small – Multiple subtypes if diverse

• Bacteria – Select targets from rMLST – Prevents bias towards larger genome size

Bonsall D, Ansari MA, Ip C et al. ve-SEQ: Robust, unbiased enrichment for streamlined detection and whole-genome sequencing of HCV and other highly diverse pathogens. F1000Research 2015, 4:1062 How well does enrichment work?

• Currently used for all viral sequencing at WHG Oxford Genomics Centre – Thousands of viral samples where organism is known – Hepatitis C – StopHCV consortium (Ellie Barnes) – HIV sequencing – PopART HPTN-071 clinical trial (Christophe Fraser)

• Over 90% success rate – Success can depend on viral load for low-VL pathogens

• Viral load ~ unique reads – Number of uniquely mapping reads is proportional to viral load when sequencing specific viruses

Sequencing with enrichment can be used to estimate viral load for both HIV and HCV. Identifying positives • Targeted capture gives >50-fold enrichment for pathogens of interest – Reduces background, boosts signal – Still detect other organisms, but can’t quantify their relative abundance – Can enrich multiple samples in a single pool, saves cost

Negative Positive (Enterovirus) Another method:True negatives nuclease vs true positives pre-treatment

• OSCAR project: Oxford Screening for CSF and Respiratory Viruses – Sequencing done at the Roslin Institute – Protocol includes DNAse & RNAse digestion before extraction – Immense depth (40-60 mln reads per sample) – high cost & processing time – Most samples have <1 pathogen reads per million – ?significance

• Example CSF sample: – 95% human – 5% unknown – 0.002% bacterial – 0.002% viral

Taxonomic classification of reads from CSF sample VS067. ChiMES project:Reads diagnostic per sample metagenomics

• CSF samples from 1000 UK childhood meningitis cases • Validation on 4 most common pathogens – EV, HPeV, Sp, Nm account for 75% of laboratory diagnosed cases – Negative controls from children without meningitis HHV6 VZV Other EBV H. influenzae HSV E. coli Aim: Identify causes of GBS aseptic meningitis in UK HPeV children

S. pneumoniae Unknown ? N. meningitidis

Enterovirus Pathogen identification in UK childhood meningitis cases – combined laboratory testing results. ChiMES project: probe panel development • Identifying likely suspects sooner rather than later – Clinically relevant pathogens – Probes designed to cover pathogen diversity (10% cutoff, similar seqs clustered)

Bacteria Viruses 1. Streptococcus pneumoniae 1. Mastadenovirus A/B/C/D/E/F/G complete 33. sosuga virus complete 2. Streptococcus pyogenes 2. Lassa complete 34. complete 3. Streptococcus agalactiae 3. Lymphocytic chriomeningitis mammarenavirus complete 35. – B/M complete 4. Staphylococcus aureus 4. California Encephalitis Virus complete 36. Respiratory syncytial virus – A/B complete 5. Mycoplasma pneumoniae 5. Virus complete 37. Human metapneumovirus complete 6. Legionella pneumophila 6. Sandfly Fever Naples Virus complete 38. erythroparvovirus 1 complete 7. Coxiella burnetii 7. Sandfly Fever Sicillian Virus complete 39. Primate tetraparvovirus 1 complete 8. Escherichia coli 8. Human Coronavirus HCoV-229E complete 40. Human bocavirus 1 complete 9. Klebsiella pneumoniae 9. Human Coronavirus HCoV-NL63 complete 41. Human Parechovirus 1-8 complete 10. Klebsiella oxytoca 10. Human Coronavirus HCoV-HKU1 Genotypes A, B, C complete 42. Parechovirus B complete 11. Enterobacter cloacae 11. MERS-Coronavirus complete 43. Enterovirus A/B/D complete 12. Enterobacter aerogenes 12. Human Coronavirus HCoV-OC43 Genotypes A-E complete 44. Rhinovirus A/B/C complete 13. Serratia marcescens 13. SARS-Coronavirus complete 45. Rhinovirus B complete 14. Haemophilus influenzae 14. Virus Genotype 1/2/3/4 complete 46. Rhinovirus C complete 15. Haemophilus parainfluenzae 15. Virus - All Genotypes complete 47. Cardiovirus A complete 16. Chlamydophila pneumoniae 16. Murray Valley Encephalitis Virus - All Genotypes complete 48. Cardiovirus B complete 17. Chlamydia psittaci 17. St. Louis Encephalitis Virus - All Genotypes complete 49. Rubella virus complete 18. Pseudomonas aeruginosa 18. - All Genotypes complete 50. Hepatitis A complete 19. Moraxella catarrhalis 19. Tick-borne Encephalitis Virus - All Genotypes complete 51. Rosavirus 2 complete 20. Acinetobacter baumannii 20. Virus - All Genotypes complete 52. Salivirus A complete 21. Acinetobacter calcoaceticus 21. HHV1 / Herpes Simplex Virus Type 1 (HSV-1) partial 53. Salivirus FHB complete 22. Mycobacterium tuberculosis 22. HHV2 / Herpes Simplex Virus Type 2 (HSV-2) partial 54. JC polyomavirus complete 23. Stenotrophomonas maltophilia 23. HHV3 / Varicella-Zoster Virus (VZV) partial 55. BK polyomavirus complete 24. Bordetella pertussis 24. HHV4 / Epstein-Barr Virus (EBV) partial 56. Rotavirus A/B/C complete 25. Neisseria meningitidis 25. HHV5 / Human Cytomegalovirus (HCMV) partial 57. Rhabdovirus 5 - European Bat 1 (EBLV2) complete 26. Listeria monocytogenes 26. HHV6A/B / Human Herpesvirus 6A/B partial 58. Rhabdovirus 6 - European Bat Lyssavirus 2 (EBLV1) complete 27. Borrelia burgdorferi 27. HHV7 / Human Herpesvirus 7 partial 59. Rhabdovirus 7 - Austrailian Bat Lyssavirus(es) complete 28. Treponema pallidum 28. HHV8 / Kaposi's Sarcoma Herpesvirus (KSHV) partial 60. Rhabdovirus 1 - complete 29. Leptospira (multiple spp) 29. Influenza A/B/C virus (multiple genotypes) 61. Rhabdovirus 4 - Duvenhage complete 30. Bartonella henselae 30. Human Parainfluenza virus 1/2/3/4a/4b/5 complete 62. Rhabdovirus 2 - Lagos Bat virus complete 31. Brucella (multiple spp) 31. Mumps virus (multiple genotypes) complete 63. Rhabdovirus 3 - Mokola virus complete 32. Measles virus (multiple genotypes) complete 64. Equine Encephalitis Virus (multiple) complete Raw reads

Target quantification Taxonomic classification Probe sequences (viral genomes + bacterial rMLST) + Microbial RefSeq + Human transcriptome Human genome KALLISTO1 KRAKEN2

Human transcripts Human reads dicarded discaded

Quantification of Quantification of all enriched taxa microbial reads Model fitting (tpm) (%) & prediction

Pathogen identification

1. Bray et al (2016) Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, 525–527. 2. Wood & Salzberg (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology. 15(3):R46 Total readReads numbers per sample after capture • Metagenomic sequencing of EV, HPeV, Sp, Nm + negative controls – Number of reads similar regardless of presence of pathogen (mean 3.7mln) – Number of microbial reads similar regardless of pathogen (mean 2mln) – Proportion of microbial reads also similar (mean 0.2)

• Microbial reads in negative samples are mostly contaminants – Alteromonas, Achromobacter, PhiX… – Some bystander organisms from skin flora Detecting specific pathogens works well

Enterovirus Parechovirus

N. meningitidis S. pneumoniae Concordance with laboratory testing • Logistic regression, univariate models for EV, HPeV, Nm, Sp

Enterovirus HPeV

N. meningitidis S. pneumoniae Distribution of microbial reads differs between negative and positive samples

Negative Positive

Can we confidently identify single-pathogen infection? Dominant organism in each sample

Samples coloured by lab result

Limit of detection: >60% of microbial reads must support a single taxon

Dominant organism detected DominantDominant organism organism in each in sampleeach sample

“Other”: anything not likely to be clinically significant Dominant organism in each sample

Sp

Nm

HPeV

EV

H. paraflu

Can achieve perfect concordance with lab result, at some sensitivity cost What’s needed for a “one true test”?

1. Unbiased enrichment – Method of choice for Oxford Viromics initiative in WHG • StopHCV consortium, PopART HIV clinical trial – Unbiased capture of sequences with > 80% similarity to probes

2. Sensitivity – how much sequencing do we need to do? – Probe design is important – More pathogens: greater sensitivity – Can include entire microbial RefSeq if required – Viruses: can sequence samples with viral load <104 & quantify viral load

3. Cost? – Upfront cost of IDT or SureSelect probe panel can be reduced by designing probe panels across related projects Thank you

Oxford Viromics WHG UK Childhood Meningitis and Encephalitis Study (UK-ChiMES) Rory Bowden Rory Bowden David Bonsall Andrew Pollard Mariateresa de Cesare Manish Sadarangani Azim Ansari Camilla Ip Samples Hubert Slawinski OVG: Annabel Coxon, Gretchen Meddaugh Amy Trebes Probe panel & method development Paolo Piazza Azim Ansari David Buck Ivo Elliott Cyndi Goh Cyndi Goh Ivo Elliott Peter Medawar laboratory Anthony Brown OSCAR study WHG laboratory Philippa Matthews Mariateresa de Cesare Peter Simmonds Hubert Slawinski Anna McNaughton Sequencing Colin Sharp (Roslin) Amy Trebes Paolo Piazza Viral Sequencing David Buck Ellie Barnes & STOPHCV collaborators High-Throughput Genomics WHG Christophe Fraser & PopART HPTN-071 collaborators

Oxford Dept. Zoology & Peter Medawar Institute Paul Klenerman Oliver Pybus James Iles