Microbial Genomics, Pan-Genomics, and Metagenomics in Disease and in Health William Hsiao BCCDC Public Health Labs [email protected]
Total Page:16
File Type:pdf, Size:1020Kb
Microbial Genomics, Pan-genomics, and Metagenomics in disease and in health William Hsiao BCCDC Public Health Labs [email protected] Talk dedicated to Francis Ouellette and all the VanBUG organizers/ volunteers current and past! Microbes Germs Microbes – learn to love them § Microbes harbour much higher genetic diversity than eukaryotic organisms § Less than 0.5% of the microbial species have been identified – huge potential for discovery of new genes and new functions § Most Microbes (>99.9%) do not cause diseases in human § Microbes can be engineered to act as little factories for energy production, drug production, immune system booster (probiotics), pollution clean up, and environmental sensors The Bacterial/Archaeal Genome § Typically contained within a single large, circular chromosome (some are linear) § Haploid genomes § May contain plasmids (extrachromosomal DNA) § No introns in the genes § Genome size range from 0.5Mb to ~10Mb (average is about 3 - 5Mb and contain about 3000-5000 genes) § Much easier than eukaryotic genomes to assemble and to annotate § First free-living organism sequenced is a bacterium – Haemophilus influenzae in 1995 Shotgun Sequencing – 90s style contigs Fraser et al 2000, Nature Current Stats on Published Bacterial Genomes § Around 3000 published genomes in 17 years (thousands more sequenced) number'of'published'microbial'genomes' 1200" 1000" 800" 600" number"of"published"genomes" 400" 200" 0" 1995" 1996" 1997" 1998" 1999" 2000" 2001" 2002" 2003" 2004" 2005" 2006" 2007" 2008" 2009" 2010" 2011" DNA Sequencing Technologies ER Mardis. Nature 470, 198-203 (2011) doi:10.1038/nature09796 $100M for the first human genome $10K per human genome or $10 per bacterial genome Computing Improvements are “slower” Cluster Computing Cloud Computing Next Generation Shotgun Sequencing In Most microbial genomes are not finished anymore http://www.genomesonline.org/ Improving Assembly – paired-end and optical map § With high depth coverage from next generation sequencers, the gaps in unfinished genomes are usually due to unresolved repeats. So by incorporating long range information, we can order the contigs better and close the gaps § One of my post-doc projects involves in sequencing several bacterial genomes and since we got back incomplete genomes with a few hundred contigs, we explored other way to improve the genome assembly § We decided to use optical maps (high density, whole genome restriction maps aka fingerprints) to help us assemble the genome Theodore Assembler Hsiao et al, in preparation Improved Assembly Results assembly Strain method # of contigs total base placed n50 PBT16 Theodore 30 6661776 6566608 PBT21 Theodore 51 6927391 6739126 PBT91 Theodore 58 6927423 6738230 PBT16 Newbler 133 6564894 147681 PBT21 Newbler 204 6749359 124004 PBT91 Newbler 160 6900498 172612 Automated Genome Annotation § Several systems available to public, each with sophisticated approaches to assign functions to predicted genes / proteins § BASys (http://basys.ca) § Prok-annotation pipeline (http://ae.igs.umaryland.edu/cgi/intro_info.cgi) § IMG-ER (https://img.jgi.doe.gov/cgi-bin/er/main.cgi) § RAST (http://rast.nmpdr.org/rast.cgi) § Most of the systems run on large clusters of computers and take less than a day to annotated a genome BASys Annotation Overview Contigs Protein Non-protein Regional Encoding encoding Annotation Genes genes Functional Annotation rRNA tRNA others Automated Annotation Manual Annotated Annotation Genome Intergenic Scan Extremely time consuming! Van Domselaar et al NAR 2005 Genome Projects – then and now Conditions for Then Now one genome Sequence Time Months to a year Days to sequence to sequence one several genomes genome Cost of $10,000 – $10-100 sequencing 100,000 Annotation Time A year of manual Automated curation by annotation + spot multiple people inspection Finish status Mostly complete fragmented Publication Nature + Science SIGS Comparative Genomics § First Comparative Genomic paper published in 1999 § 2 Helicobacter pylori genomes isolated 7 years apart were compared Found more than half of the strain specific genes are clustered in hyper variable regions This observation soon was consistently observed in many other species Alm et al, Nature 1999 Tools to detect Genomic Islands § In Fiona’s Lab, we developed several tools to aid the identification of genomic islands (genomic regions that are likely to be horizontally acquired from another species) § IslandPath – based on DNA signatures of the genomes and other features associated with islands (Hsiao et al Bioinformatics, 2003) § IslandPick – based on comparative genomics (Langille et al BMC Bioinformatics, 2008) § IslandViewer – integrated approach to identify and view genomic islands (Langille et al Bioinformatics, 2009) assig no % 70.00% 60.00% 50.00% 40.00% 30.00% Proportions ofGeneswithnoCOGAssignmentinIslandsvs. Outside 20.00% 10.00% 0.00% More novel genes inside of inside genes novel More islands Organisms Bacillus subtilis 168 Borrelia burgdorferi B31 Buchnera sp. APS Chlamydia trachomatis D Clostridium acetobutylicum ATCC824 Escherichia coli K12 Escherichia coli O157 Haemophilus influenzae Rd-KW20 Helicobacter pylori 26695 Listeria innocua Clip11262 Mycobacterium leprae Mycobacterium tuberculosis CDC1551 Mycoplasma pneumoniae M129 Neisseria meningitidis MC58 Pseudomonas aeruginosa PAO1 Salmonella typhimurium LT2 Staphylococcus aureus N315 Streptococcus pneumoniae TIGR4 Sulfolobus solfataricus Vibrio cholerae chromosome I ISLAND OUT Vibrio cholerae chromosome II SIDE 2005 Genetics Hsiao Yersinia pestis CO92 1.27E-18 value: P Paired-t-test et al e62, Nov. Nov. e62, . PLOS 20 Pan-genomes § Comparative Genomics and gene-gain and gene- loss in microbes lead to the idea of pan-genomes § The term first coined in 2005 in a paper by Tettelin et al., in which they compared sequenced genomes from six S. agalactiae. § Pan-genome consists of the core (shared) genes of a species + its strain-specific (dispensable) genes § Pan-genome calculation extrapolates observations based on a limited number of strains to come up with the theoretical number of genomes required to fully capture the pan-genome of a species Open vs. Closed pan-genome SNP-phylogeny for very closely related genomes § For very closely related isolates or very slowly evolving species, sometimes there is very little gene-gain and gene-loss. § In these cases, SNPs detected by aligning these genomes can be used as basis for comparison and phylogenetic tree reconstruction of the evolutionary history of the species § Whole Genome SNPs and Social Network Questionnaire used to reconstruct a TB outbreak in BC Pangenome + Metadata! § A TB outbreak occurred in a BC community over a 3 year period § Molecular marker suggests that the outbreak is clonal but traditional contact tracing couldn’t identify a source § Whole genome sequencing and social network questionnaires (include location information) provide higher resolution data to allow a reconstruction of a likely scenario for the outbreak events. § Further epidemiological investigation point to increased crack cocaine usage (common locations) in the community Gardy, Johnston, Ho Sui et al NEJM 2011 Putative Transmission Networks Pangenome + Metadata! § This paper really demonstrated the power of whole genome sequencing § But, the availability of the metadata (disease conditions, locations, contacts, dates, etc) that facilitated the interpretation of the whole genome data Biodiversity § In a recent global ocean survey study, ~4000 novel protein families were detected, a significant addition to ~13,000 known protein families (Yooseph et al, PLoS Biology, 03/2007) § Sampling human gut, >3 million non-redundant bacterial genes and >1000 prevalent species identified (Qin et al, Nature, 03/2010) § In environmental surveys to date, 30% - 70% of the genes identified in the samples are novel § >90% of all genetic diversity comes from non- eukaryotic organisms § How can we begin to study this diversity and identify important microorganisms? What is Metagenomics? § Meta = beyond § Coined by Jo Handelsman (environmental microbiologist) in 1998 § Has taken a more precise definition as studies to analyze genetic material from a mixed population living in the same environment § Who’s there? What do they do? § How do they interact with each other and with the environment? Typical Experimental Protocols Samples from Extract DNA or RNA Environment or hosts Enriched for from mixed population microbes (no culturing & cloning!) Targeted Sequencing Shotgun Sequencing • Use PCR primers to target specific • Sequence randomly all the DNA that are regions of genome in the sample (RNA is reverse transcribed • E.g. 16S rRNA, capsid,18S first) • Able to sequence deeper and broader • Obtain functional information • No metabolic functional information • Don’t know the exact host of each gene • Good for finding out “Who’s there” • Good for finding out “What is the community doing” Taxonomic Binning § After obtaining the 16S or other amplicon sequences, taxonomic binning based on sequence similarity or based on k-mer frequency similarity is carried out to assign a read to a taxon § Alternatively, reads are clusters to form OTUs (operational taxonomic unit) since many reads can not be assigned to a taxon § In the end, we obtain a matrix of count data associated with each taxa/OTU Taxon E. coli OTU 1 B. theta P. aeruginosa Sample 1 5 8 77 23 Sample 2 11 34 3 12 International Human Microbiome Consortium § International efforts to characterize the bacteria associated