Microbial Genomics, Pan-genomics, and Metagenomics in disease and in health William Hsiao BCCDC Public Health Labs [email protected]
Talk dedicated to Francis Ouellette and all the VanBUG organizers/ volunteers current and past! Microbes Germs Microbes – learn to love them
§ Microbes harbour much higher genetic diversity than eukaryotic organisms § Less than 0.5% of the microbial species have been identified – huge potential for discovery of new genes and new functions § Most Microbes (>99.9%) do not cause diseases in human § Microbes can be engineered to act as little factories for energy production, drug production, immune system booster (probiotics), pollution clean up, and environmental sensors The Bacterial/Archaeal Genome
§ Typically contained within a single large, circular chromosome (some are linear)
§ Haploid genomes
§ May contain plasmids (extrachromosomal DNA)
§ No introns in the genes
§ Genome size range from 0.5Mb to ~10Mb (average is about 3 - 5Mb and contain about 3000-5000 genes)
§ Much easier than eukaryotic genomes to assemble and to annotate
§ First free-living organism sequenced is a bacterium – Haemophilus influenzae in 1995 Shotgun Sequencing – 90s style
contigs
Fraser et al 2000, Nature Current Stats on Published Bacterial Genomes
§ Around 3000 published genomes in 17 years (thousands more sequenced)
number'of'published'microbial'genomes' 1200"
1000"
800"
600" number"of"published"genomes"
400"
200"
0" 1995" 1996" 1997" 1998" 1999" 2000" 2001" 2002" 2003" 2004" 2005" 2006" 2007" 2008" 2009" 2010" 2011" DNA Sequencing Technologies
ER Mardis. Nature 470, 198-203 (2011) doi:10.1038/nature09796 $100M for the first human genome
$10K per human genome or $10 per bacterial genome Computing Improvements are “slower”
Cluster Computing
Cloud Computing Next Generation Shotgun Sequencing
In Most microbial genomes are not finished anymore
http://www.genomesonline.org/ Improving Assembly – paired-end and optical map
§ With high depth coverage from next generation sequencers, the gaps in unfinished genomes are usually due to unresolved repeats. So by incorporating long range information, we can order the contigs better and close the gaps
§ One of my post-doc projects involves in sequencing several bacterial genomes and since we got back incomplete genomes with a few hundred contigs, we explored other way to improve the genome assembly
§ We decided to use optical maps (high density, whole genome restriction maps aka fingerprints) to help us assemble the genome Theodore Assembler
Hsiao et al, in preparation Improved Assembly Results
assembly Strain method # of contigs total base placed n50
PBT16 Theodore 30 6661776 6566608
PBT21 Theodore 51 6927391 6739126
PBT91 Theodore 58 6927423 6738230
PBT16 Newbler 133 6564894 147681
PBT21 Newbler 204 6749359 124004
PBT91 Newbler 160 6900498 172612 Automated Genome Annotation
§ Several systems available to public, each with sophisticated approaches to assign functions to predicted genes / proteins
§ BASys (http://basys.ca) § Prok-annotation pipeline (http://ae.igs.umaryland.edu/cgi/intro_info.cgi)
§ IMG-ER (https://img.jgi.doe.gov/cgi-bin/er/main.cgi)
§ RAST (http://rast.nmpdr.org/rast.cgi)
§ Most of the systems run on large clusters of computers and take less than a day to annotated a genome BASys Annotation Overview
Contigs
Protein Non-protein Regional Encoding encoding Annotation Genes genes
Functional Annotation
rRNA tRNA others Automated Annotation Manual Annotated Annotation Genome Intergenic Scan Extremely time consuming!
Van Domselaar et al NAR 2005 Genome Projects – then and now
Conditions for Then Now one genome Sequence Time Months to a year Days to sequence to sequence one several genomes genome Cost of $10,000 – $10-100 sequencing 100,000 Annotation Time A year of manual Automated curation by annotation + spot multiple people inspection Finish status Mostly complete fragmented Publication Nature + Science SIGS Comparative Genomics
§ First Comparative Genomic paper published in 1999 § 2 Helicobacter pylori genomes isolated 7 years apart were compared
Found more than half of the strain specific genes are clustered in hyper variable regions This observation soon was consistently observed in many other species
Alm et al, Nature 1999 Tools to detect Genomic Islands
§ In Fiona’s Lab, we developed several tools to aid the identification of genomic islands (genomic regions that are likely to be horizontally acquired from another species) § IslandPath – based on DNA signatures of the genomes and other features associated with islands (Hsiao et al Bioinformatics, 2003) § IslandPick – based on comparative genomics (Langille et al BMC Bioinformatics, 2008) § IslandViewer – integrated approach to identify and view genomic islands (Langille et al Bioinformatics, 2009) Proportions of Genes with no COG Assignment in Islands vs. Outside 20 70.00% OUTSIDE ISLAND 60.00% Paired-t-test 50.00% P value: % no 1.27E-18 assig 40.00%
30.00%
20.00%
10.00%
0.00% APS 1262 T2 AO1 TIGR4 TCC824 Buchnera sp. Bacillus subtilis 168 A Escherichia coli K12 CDC1551 Borrelia burgdorferi B31 More novel genesEscherichia coli O157 inside of islands Chlamydia trachomatis D Mycobacterium leprae Clostridium acetobutylicum Helicobacter pylori 26695 ersinia pestis CO92 Listeria innocua Clip1 Sulfolobus solfataricus Y Organisms Mycobacterium tuberculosis Neisseria meningitidis MC58 Salmonella typhimurium L Staphylococcus aureus N315 Haemophilus influenzae Rd-KW20 Mycoplasma pneumoniae M129 ibrio cholerae chromosome I Hsiao et al. PLOS ibrio cholerae chromosome II V Pseudomonas aeruginosa P V Genetics e62, Nov. Streptococcus pneumoniae 2005 Pan-genomes
§ Comparative Genomics and gene-gain and gene- loss in microbes lead to the idea of pan-genomes
§ The term first coined in 2005 in a paper by Tettelin et al., in which they compared sequenced genomes from six S. agalactiae.
§ Pan-genome consists of the core (shared) genes of a species + its strain-specific (dispensable) genes
§ Pan-genome calculation extrapolates observations based on a limited number of strains to come up with the theoretical number of genomes required to fully capture the pan-genome of a species Open vs. Closed pan-genome SNP-phylogeny for very closely related genomes
§ For very closely related isolates or very slowly evolving species, sometimes there is very little gene-gain and gene-loss.
§ In these cases, SNPs detected by aligning these genomes can be used as basis for comparison and phylogenetic tree reconstruction of the evolutionary history of the species
§ Whole Genome SNPs and Social Network Questionnaire used to reconstruct a TB outbreak in BC Pangenome + Metadata!
§ A TB outbreak occurred in a BC community over a 3 year period
§ Molecular marker suggests that the outbreak is clonal but traditional contact tracing couldn’t identify a source
§ Whole genome sequencing and social network questionnaires (include location information) provide higher resolution data to allow a reconstruction of a likely scenario for the outbreak events.
§ Further epidemiological investigation point to increased crack cocaine usage (common locations) in the community
Gardy, Johnston, Ho Sui et al NEJM 2011 Putative Transmission Networks Pangenome + Metadata!
§ This paper really demonstrated the power of whole genome sequencing
§ But, the availability of the metadata (disease conditions, locations, contacts, dates, etc) that facilitated the interpretation of the whole genome data
Biodiversity § In a recent global ocean survey study, ~4000 novel protein families were detected, a significant addition to ~13,000 known protein families (Yooseph et al, PLoS Biology, 03/2007)
§ Sampling human gut, >3 million non-redundant bacterial genes and >1000 prevalent species identified (Qin et al, Nature, 03/2010)
§ In environmental surveys to date, 30% - 70% of the genes identified in the samples are novel
§ >90% of all genetic diversity comes from non- eukaryotic organisms
§ How can we begin to study this diversity and identify important microorganisms? What is Metagenomics?
§ Meta = beyond
§ Coined by Jo Handelsman (environmental microbiologist) in 1998
§ Has taken a more precise definition as studies to analyze genetic material from a mixed population living in the same environment
§ Who’s there? What do they do?
§ How do they interact with each other and with the environment? Typical Experimental Protocols
Samples from Extract DNA or RNA Environment or hosts Enriched for from mixed population microbes (no culturing & cloning!)
Targeted Sequencing Shotgun Sequencing • Use PCR primers to target specific • Sequence randomly all the DNA that are regions of genome in the sample (RNA is reverse transcribed • E.g. 16S rRNA, capsid,18S first) • Able to sequence deeper and broader • Obtain functional information • No metabolic functional information • Don’t know the exact host of each gene • Good for finding out “Who’s there” • Good for finding out “What is the community doing”
Taxonomic Binning
§ After obtaining the 16S or other amplicon sequences, taxonomic binning based on sequence similarity or based on k-mer frequency similarity is carried out to assign a read to a taxon
§ Alternatively, reads are clusters to form OTUs (operational taxonomic unit) since many reads can not be assigned to a taxon
§ In the end, we obtain a matrix of count data associated with each taxa/OTU
Taxon E. coli OTU 1 B. theta P. aeruginosa
Sample 1 5 8 77 23 Sample 2 11 34 3 12 International Human Microbiome Consortium
§ International efforts to characterize the bacteria associated with human body sites
§ Systemic survey of the bacteria found in each site in healthy individuals – metagenomics
§ Sequencing of reference genomes of bacteria isolated from human – genomics and pangenome
§ Targeted study of microbiomes associated with various diseases
§ More information: http://www.hmpdacc.org and http://commonfund.nih.gov/hmp/ Endodontics 16S Microbiome
§ Root canal infections are a leading cause of oro-facial pain and tooth loss in western countries § No clear etiology; polymicrobial factors § Patients with root canal infections and periapical abscess were studied for the transition of microbiota from healthy oral sites to root canal and abscess § 3 samples (normal oral, infected root canal, and abscess) were obtained from each of 8 individuals undergone treatment § First study we know to sample healthy and diseased oral microbiota from the same individuals
Hsiao et al, submitted Abundant taxa show different distributions Diseased sites have lower diversity More abundant taxa were found in all 3 sites
All OTUs
Abundant OTUs Differentially Distributed Bacteria may be associated with disease
§ We were interested to know which organisms are found differentially distributed in healthy vs. diseased sites
§ So after adjusting the count data for variance and sampling depth, we used paired-t tests and ANOVA tests to identify OTUs that are differentially distributed
§ We are especially interested in organisms that are found more abundant in diseased samples
§ In short, we were able to identify specific bacteria (some known opportunistic pathogens) to have higher relative abundance in diseased samples
§ These include: Granulicatella adiacens, Eubacterium yurii, Prevotella melaninogenica, Prevotella salivae, Streptococcus mitis, and Atopobium rimae Watershed Microbiome
§ Genome BC project
§ Project leaders Dr. Patrick Tang and Dr. Judith Isaac-Renton
§ Two major goals are § 1) To use metagenomics to identify novel microbial biomarkers of watershed health § 2) Develop tools to match the microbial fingerprint of a contaminated watershed to the specific source of pollution Current Water Quality Monitoring Problems “The most significant problems associated with pathogen measurement are the lag time involved in testing and… the large number of false results… The absence of E. coli does not assure the absence of more resistant fecal pathogens… source protection planning must be carried out on an ecologically meaningful scale – that is, at the watershed level.” The Honourable Dennis R. O’Connor Walkerton Inquiry Commissioner 38 Metagenomics will Provide the Solutions
“DNA analysis offers promise for the future” Walkerton Inquiry Report 1. We need better tests § Water quality test: Is fecal pollution present? § Pollution attribution test: Which species is the cause? 2. The tests need better indicators § New bacterial, viral, and potentially protozoan markers 3. An environmental survey is needed to find these novel indicators § Metagenomics is the only tool that can do this survey
39 Pilot study looking at 16S microbiome at different sites under wet and dry conditions
§ Two Watershed sites § Two different conditions (wet day vs. dry day) § Multiple different time points throughout a day § Two replicated samples per sampling event § 16S sequences from the samples amplified and sequenced § Microbiome profile generated based on 16S sequences § Clustering of the samples based on relative abundance of the species (OTUs) Hierarchical Clustering of samples based on 16S relative abundance Systems Approach – Mouse Gut Model
§ Host is a dynamic system just like the microbiota and it’s the interaction between host and microbes that really produce the observed outcome.
§ So, we want to be able to study the host gene expression changes and the microbiota changes simultaneously.
§ Immunity vs. metabolism in the gut: a trialogue between B lymphocytes, microbiota and the intestinal epithelium
Shulzenko, Morgun, Hsiao et al, Nat. Med, 2011 Overview of the Systems
Microbiota
Epithelial cells
? modified from Lora V. Hooper T cell Nature Reviews Microbiology 7, 367-374; 2009 B cell Immune cells Prepared by N. Shulzenko Mice – B lymphocyte knockout and control
B10.AµMT-/- B10.A WT BALB/c JH-/- BALB/c WT
17 pairs 10 pairs (littermates and non-littermates) (non-littermates)
For all mice: Take jejunum → Isolate RNA → gene expression by microarrays Jejunum contents -> Isolate DNA -> 16S microbiota analysis
Analysis of microarrays
1. Comparing gene expression in the jejunum of µMT vs. heterozygous littermates 2. Excluding B-cell origin genes (microarrays on separated B lymphocytes)
3. Validating on non-littermates (µMT and Jh-/- vs. WT)
Final list of genes: B-cell KO profile Prepared by N. Shulzenko What happens when the B cells are knocked-out?
Normal host B lymphocyte/antibody-deficient host
Microbiota Dietary Microbiota lipids
absorption
GATA4 absorption GATA4
Epithelial cells Epithelial cells
IgA immune T func on metabolic B func on
deposition deposition Adipose Adipose
Prepared by N. Shulzenko Changes in commensal microbes in the small intestine of B-cell KO mice
Few significant differences detected by paired comparison of absolute amounts
Clostridiaceae (family) Paracoccus (genus) Lactococcus subgroup Ø sequencing of (genus) 100 10 100
DNA coding for 10 1
16S rRNA 1 10 0.1 0.1
Control 0.01 0.01 1 1 0.010.01 0.001 0.001 0.00010.0010.001 0.001 0.01 0.1 1 10 100 0.001 0.01 0.1 1 10 1 10 100 B-cell KO All three are minor members of the microbiota (<0.4%) Do microbiota really play a role in the changes? Germ-free vs. conventional B-cell KO
No difference in gene expression between BcKO and control mice under germ-free condi ons
Microbiota has a major role in “B-cell KO” intes nal profile Prepared by N. Shulzenko In this trialogue, the adap ve immune system, the intes ne, and the microbiota combine to influence a homeosta c metabolic func on, in mice and in humans.
Microbiota
Epithelial cells immune func on metabolic func on
T cell B cell
Prepared by N. Shulzenko Trans-kingdom Cross Talk (phylochip + an bio c treatment)
Red = host genes that are differen ally expressed
Blue = microbes that have different rela ve abundance
Lines connects nodes that are correlated across samples (yellow = posi ve; black=nega ve)
Prepared by A. Morgun Future and Wishes
§ Microbial genomics with its § “World Peace” – combating rapid advances in the past two microbes with broad-spectrum decades has a bright future in antibiotics = last resort and is helping us to understand the often counter-productive (we world’s most dominating life need our microbiota for health) forms better! § With increasing number of § Many diseases and health genomes available, tools for issues have polymicrobial comparative microbial genomics origins and pan-genome and and good comparative genome metagenomics can help us browser capable of handling solve these mysteries hundreds of incomplete § Combination of different data genomes will be very useful types is key to interpret genomic data § Better statistical tools to integrate the data and to help interpret the results are also needed Acknowledgements
§ Claire Fraser-Liggett § Patrick Tang
§ Art Delcher § Judy Isaac-Renton § Elliott Drábek § Fiona Brinkman § Zhenqiu Liu § Natalie Prystajecky § Cheron Jones § Miguel Uyaguari § Brandi Cantarel § Jennifer Gardy § Institute for Genome Sciences (sequencing, annotation) § Michael Chan § Ashraf Fouad § Stephen Pleasance § Andrey Morgun
§ Natalia Shulzhenko
§ Jeffrey Gordon (and his lab) Outline
§ Progression from Microbial Genomics, Pangenomics, and Metagenomics
§ Bioinformatics tools used for these analyses
§ My own projects and HMPs as examples § Health § Diseases
§ Tool developments § Database management § Assemblers § Classifier § Future of the field and Wish list for tools