<<

Microbial , Pan-genomics, and Metagenomics in disease and in William Hsiao BCCDC Public Health Labs [email protected]

Talk dedicated to Francis Ouellette and all the VanBUG organizers/ volunteers current and past! Microbes Germs Microbes – learn to love them

§ Microbes harbour much higher genetic diversity than eukaryotic organisms § Less than 0.5% of the microbial have been identified – huge potential for discovery of new genes and new functions § Most Microbes (>99.9%) do not cause diseases in human § Microbes can be engineered to act as little factories for energy production, drug production, immune system booster (probiotics), pollution clean up, and environmental sensors The Bacterial/Archaeal

§ Typically contained within a single large, circular chromosome (some are linear)

§ Haploid

§ May contain plasmids (extrachromosomal DNA)

§ No introns in the genes

§ Genome size range from 0.5Mb to ~10Mb (average is about 3 - 5Mb and contain about 3000-5000 genes)

§ Much easier than eukaryotic genomes to assemble and to annotate

§ First free-living organism sequenced is a bacterium – Haemophilus influenzae in 1995 Shotgun – 90s style

contigs

Fraser et al 2000, Nature Current Stats on Published Bacterial Genomes

§ Around 3000 published genomes in 17 years (thousands more sequenced)

number'of'published'microbial'genomes' 1200"

1000"

800"

600" number"of"published"genomes"

400"

200"

0" 1995" 1996" 1997" 1998" 1999" 2000" 2001" 2002" 2003" 2004" 2005" 2006" 2007" 2008" 2009" 2010" 2011" DNA Sequencing Technologies

ER Mardis. Nature 470, 198-203 (2011) doi:10.1038/nature09796 $100M for the first human genome

$10K per human genome or $10 per Computing Improvements are “slower”

Cluster Computing

Cloud Computing Next Generation

In Most microbial genomes are not finished anymore

http://www.genomesonline.org/ Improving Assembly – paired-end and optical map

§ With high depth coverage from next generation sequencers, the gaps in unfinished genomes are usually due to unresolved repeats. So by incorporating long range information, we can order the contigs better and close the gaps

§ One of my post-doc projects involves in sequencing several bacterial genomes and since we got back incomplete genomes with a few hundred contigs, we explored other way to improve the genome assembly

§ We decided to use optical maps (high density, whole genome restriction maps aka fingerprints) to help us assemble the genome Theodore Assembler

Hsiao et al, in preparation Improved Assembly Results

assembly Strain method # of contigs total base placed n50

PBT16 Theodore 30 6661776 6566608

PBT21 Theodore 51 6927391 6739126

PBT91 Theodore 58 6927423 6738230

PBT16 Newbler 133 6564894 147681

PBT21 Newbler 204 6749359 124004

PBT91 Newbler 160 6900498 172612 Automated Genome Annotation

§ Several systems available to public, each with sophisticated approaches to assign functions to predicted genes / proteins

§ BASys (http://basys.ca) § Prok-annotation pipeline (http://ae.igs.umaryland.edu/cgi/intro_info.cgi)

§ IMG-ER (https://img.jgi.doe.gov/cgi-bin/er/main.cgi)

§ RAST (http://rast.nmpdr.org/rast.cgi)

§ Most of the systems run on large clusters of computers and take less than a day to annotated a genome BASys Annotation Overview

Contigs

Protein Non-protein Regional Encoding encoding Annotation Genes genes

Functional Annotation

rRNA tRNA others Automated Annotation Manual Annotated Annotation Genome Intergenic Scan Extremely time consuming!

Van Domselaar et al NAR 2005 Genome Projects – then and now

Conditions for Then Now one genome Sequence Time Months to a year Days to sequence to sequence one several genomes genome Cost of $10,000 – $10-100 sequencing 100,000 Annotation Time A year of manual Automated curation by annotation + spot multiple people inspection Finish status Mostly complete fragmented Publication Nature + Science SIGS

§ First Comparative Genomic paper published in 1999 § 2 Helicobacter pylori genomes isolated 7 years apart were compared

Found more than half of the strain specific genes are clustered in hyper variable regions This observation soon was consistently observed in many other species

Alm et al, Nature 1999 Tools to detect Genomic Islands

§ In Fiona’s Lab, we developed several tools to aid the identification of genomic islands (genomic regions that are likely to be horizontally acquired from another species) § IslandPath – based on DNA signatures of the genomes and other features associated with islands (Hsiao et al , 2003) § IslandPick – based on comparative genomics (Langille et al BMC Bioinformatics, 2008) § IslandViewer – integrated approach to identify and view genomic islands (Langille et al Bioinformatics, 2009) Proportions of Genes with no COG Assignment in Islands vs. Outside 20 70.00% OUTSIDE ISLAND 60.00% Paired-t-test 50.00% P value: % no 1.27E-18 assig 40.00%

30.00%

20.00%

10.00%

0.00% APS 1262 T2 AO1 TIGR4 TCC824 Buchnera sp. Bacillus subtilis 168 A K12 CDC1551 Borrelia burgdorferi B31 More novel genesEscherichia coli O157 inside of islands Chlamydia trachomatis D Mycobacterium leprae Clostridium acetobutylicum Helicobacter pylori 26695 ersinia pestis CO92 Listeria innocua Clip1 Sulfolobus solfataricus Y Organisms Mycobacterium tuberculosis Neisseria meningitidis MC58 Salmonella typhimurium L Staphylococcus aureus N315 Haemophilus influenzae Rd-KW20 Mycoplasma pneumoniae M129 ibrio cholerae chromosome I Hsiao et al. PLOS ibrio cholerae chromosome II V Pseudomonas aeruginosa P V e62, Nov. Streptococcus pneumoniae 2005 Pan-genomes

§ Comparative Genomics and gene-gain and gene- loss in microbes lead to the idea of pan-genomes

§ The term first coined in 2005 in a paper by Tettelin et al., in which they compared sequenced genomes from six S. agalactiae.

§ Pan-genome consists of the core (shared) genes of a species + its strain-specific (dispensable) genes

§ Pan-genome calculation extrapolates observations based on a limited number of strains to come up with the theoretical number of genomes required to fully capture the pan-genome of a species Open vs. Closed pan-genome SNP-phylogeny for very closely related genomes

§ For very closely related isolates or very slowly evolving species, sometimes there is very little gene-gain and gene-loss.

§ In these cases, SNPs detected by aligning these genomes can be used as basis for comparison and phylogenetic tree reconstruction of the evolutionary history of the species

§ Whole Genome SNPs and Social Network Questionnaire used to reconstruct a TB outbreak in BC Pangenome + !

§ A TB outbreak occurred in a BC community over a 3 year period

§ Molecular marker suggests that the outbreak is clonal but traditional contact tracing couldn’t identify a source

§ and social network questionnaires (include location information) provide higher resolution data to allow a reconstruction of a likely scenario for the outbreak events.

§ Further epidemiological investigation point to increased crack cocaine usage (common locations) in the community

Gardy, Johnston, Ho Sui et al NEJM 2011 Putative Transmission Networks Pangenome + Metadata!

§ This paper really demonstrated the power of whole genome sequencing

§ But, the availability of the metadata (disease conditions, locations, contacts, dates, etc) that facilitated the interpretation of the whole genome data

Biodiversity § In a recent global ocean survey study, ~4000 novel protein families were detected, a significant addition to ~13,000 known protein families (Yooseph et al, PLoS Biology, 03/2007)

§ Sampling human gut, >3 million non-redundant bacterial genes and >1000 prevalent species identified (Qin et al, Nature, 03/2010)

§ In environmental surveys to date, 30% - 70% of the genes identified in the samples are novel

§ >90% of all genetic diversity comes from non- eukaryotic organisms

§ How can we begin to study this diversity and identify important microorganisms? What is Metagenomics?

§ Meta = beyond

§ Coined by Jo Handelsman (environmental microbiologist) in 1998

§ Has taken a more precise definition as studies to analyze genetic material from a mixed population living in the same environment

§ Who’s there? What do they do?

§ How do they interact with each other and with the environment? Typical Experimental Protocols

Samples from Extract DNA or RNA Environment or hosts Enriched for from mixed population microbes (no culturing & cloning!)

Targeted Sequencing Shotgun Sequencing • Use PCR primers to target specific • Sequence randomly all the DNA that are regions of genome in the sample (RNA is reverse transcribed • E.g. 16S rRNA, capsid,18S first) • Able to sequence deeper and broader • Obtain functional information • No metabolic functional information • Don’t know the exact host of each gene • Good for finding out “Who’s there” • Good for finding out “What is the community doing”

Taxonomic

§ After obtaining the 16S or other amplicon sequences, taxonomic binning based on sequence similarity or based on k-mer frequency similarity is carried out to assign a read to a taxon

§ Alternatively, reads are clusters to form OTUs (operational taxonomic unit) since many reads can not be assigned to a taxon

§ In the end, we obtain a matrix of count data associated with each taxa/OTU

Taxon E. coli OTU 1 B. theta P. aeruginosa

Sample 1 5 8 77 23 Sample 2 11 34 3 12 International Human Consortium

§ International efforts to characterize the associated with human body sites

§ Systemic survey of the bacteria found in each site in healthy individuals – metagenomics

§ Sequencing of reference genomes of bacteria isolated from human – genomics and pangenome

§ Targeted study of associated with various diseases

§ More information: http://www.hmpdacc.org and http://commonfund.nih.gov/hmp/ Endodontics 16S Microbiome

§ Root canal are a leading cause of oro-facial pain and tooth loss in western countries § No clear etiology; polymicrobial factors § Patients with root canal infections and periapical abscess were studied for the transition of from healthy oral sites to root canal and abscess § 3 samples (normal oral, infected root canal, and abscess) were obtained from each of 8 individuals undergone treatment § First study we know to sample healthy and diseased oral microbiota from the same individuals

Hsiao et al, submitted Abundant taxa show different distributions Diseased sites have lower diversity More abundant taxa were found in all 3 sites

All OTUs

Abundant OTUs Differentially Distributed Bacteria may be associated with disease

§ We were interested to know which organisms are found differentially distributed in healthy vs. diseased sites

§ So after adjusting the count data for variance and sampling depth, we used paired-t tests and ANOVA tests to identify OTUs that are differentially distributed

§ We are especially interested in organisms that are found more abundant in diseased samples

§ In short, we were able to identify specific bacteria (some known opportunistic pathogens) to have higher relative abundance in diseased samples

§ These include: Granulicatella adiacens, Eubacterium yurii, Prevotella melaninogenica, Prevotella salivae, Streptococcus mitis, and Atopobium rimae Watershed Microbiome

§ Genome BC project

§ Project leaders Dr. Patrick Tang and Dr. Judith Isaac-Renton

§ Two major goals are § 1) To use metagenomics to identify novel microbial biomarkers of watershed health § 2) Develop tools to match the microbial fingerprint of a contaminated watershed to the specific source of pollution Current Water Quality Monitoring Problems “The most significant problems associated with pathogen measurement are the lag time involved in testing and… the large number of false results… The absence of E. coli does not assure the absence of more resistant fecal pathogens… source protection planning must be carried out on an ecologically meaningful scale – that is, at the watershed level.” The Honourable Dennis R. O’Connor Walkerton Inquiry Commissioner 38 Metagenomics will Provide the Solutions

“DNA analysis offers promise for the future” Walkerton Inquiry Report 1. We need better tests § Water quality test: Is fecal pollution present? § Pollution attribution test: Which species is the cause? 2. The tests need better indicators § New bacterial, viral, and potentially protozoan markers 3. An environmental survey is needed to find these novel indicators § Metagenomics is the only tool that can do this survey

39 Pilot study looking at 16S microbiome at different sites under wet and dry conditions

§ Two Watershed sites § Two different conditions (wet day vs. dry day) § Multiple different time points throughout a day § Two replicated samples per sampling event § 16S sequences from the samples amplified and sequenced § Microbiome profile generated based on 16S sequences § Clustering of the samples based on relative abundance of the species (OTUs) Hierarchical Clustering of samples based on 16S relative abundance Systems Approach – Mouse Gut Model

§ Host is a dynamic system just like the microbiota and it’s the interaction between host and microbes that really produce the observed outcome.

§ So, we want to be able to study the host changes and the microbiota changes simultaneously.

§ Immunity vs. metabolism in the gut: a trialogue between B lymphocytes, microbiota and the intestinal epithelium

Shulzenko, Morgun, Hsiao et al, Nat. Med, 2011 Overview of the Systems

Microbiota

Epithelial cells

? modified from Lora V. Hooper T cell Nature Reviews 7, 367-374; 2009 B cell Immune cells Prepared by N. Shulzenko Mice – B lymphocyte knockout and control

B10.AµMT-/- B10.A WT BALB/c JH-/- BALB/c WT

17 pairs 10 pairs (littermates and non-littermates) (non-littermates)

For all mice: Take jejunum → Isolate RNA → gene expression by microarrays Jejunum contents -> Isolate DNA -> 16S microbiota analysis

Analysis of microarrays

1. Comparing gene expression in the jejunum of µMT vs. heterozygous littermates 2. Excluding B-cell origin genes (microarrays on separated B lymphocytes)

3. Validating on non-littermates (µMT and Jh-/- vs. WT)

Final list of genes: B-cell KO profile Prepared by N. Shulzenko What happens when the B cells are knocked-out?

Normal host B lymphocyte/antibody-deficient host

Microbiota Dietary Microbiota lipids

absorption

GATA4 absorption GATA4

Epithelial cells Epithelial cells

IgA immune T funcon metabolic B funcon

deposition deposition Adipose Adipose

Prepared by N. Shulzenko Changes in commensal microbes in the small intestine of B-cell KO mice

Few significant differences detected by paired comparison of absolute amounts

Clostridiaceae (family) Paracoccus (genus) Lactococcus subgroup Ø sequencing of (genus) 100 10 100

DNA coding for 10 1

16S rRNA 1 10 0.1 0.1

Control 0.01 0.01 1 1 0.010.01 0.001 0.001 0.00010.0010.001 0.001 0.01 0.1 1 10 100 0.001 0.01 0.1 1 10 1 10 100 B-cell KO All three are minor members of the microbiota (<0.4%) Do microbiota really play a role in the changes? Germ-free vs. conventional B-cell KO

No difference in gene expression between BcKO and control mice under germ-free condions

Microbiota has a major role in “B-cell KO” intesnal profile Prepared by N. Shulzenko In this trialogue, the adapve immune system, the intesne, and the microbiota combine to influence a homeostac metabolic funcon, in mice and in humans.

Microbiota

Epithelial cells immune funcon metabolic funcon

T cell B cell

Prepared by N. Shulzenko Trans-kingdom Cross Talk (phylochip + anbioc treatment)

Red = host genes that are differenally expressed

Blue = microbes that have different relave abundance

Lines connects nodes that are correlated across samples (yellow = posive; black=negave)

Prepared by A. Morgun Future and Wishes

§ Microbial genomics with its § “World Peace” – combating rapid advances in the past two microbes with broad-spectrum decades has a bright future in antibiotics = last resort and is helping us to understand the often counter-productive (we world’s most dominating life need our microbiota for health) forms better! § With increasing number of § Many diseases and health genomes available, tools for issues have polymicrobial comparative microbial genomics origins and pan-genome and and good comparative genome metagenomics can help us browser capable of handling solve these mysteries hundreds of incomplete § Combination of different data genomes will be very useful types is key to interpret genomic data § Better statistical tools to integrate the data and to help interpret the results are also needed Acknowledgements

§ Claire Fraser-Liggett § Patrick Tang

§ Art Delcher § Judy Isaac-Renton § Elliott Drábek § Fiona Brinkman § Zhenqiu Liu § Natalie Prystajecky § Cheron Jones § Miguel Uyaguari § Brandi Cantarel § Jennifer Gardy § Institute for Genome Sciences (sequencing, annotation) § Michael Chan § Ashraf Fouad § Stephen Pleasance § Andrey Morgun

§ Natalia Shulzhenko

§ Jeffrey Gordon (and his lab) Outline

§ Progression from Microbial Genomics, Pangenomics, and Metagenomics

§ Bioinformatics tools used for these analyses

§ My own projects and HMPs as examples § Health § Diseases

§ Tool developments § Database management § Assemblers § Classifier § Future of the field and Wish list for tools