Pathogenomics
From patient to bioinformatician
Dieter Bulach Torsten Seemann
M.Sc(Bioinf) - University of Melbourne - Tue 1 May 2012 The "rules"
● Conversation, not lecture
● Ask questions at any time
● Activities and quizzes are interspersed. These have yellow background like this slide.
● Please turn your phones to silent.
● Let's start! Overview
● Medical issue ○ sample collection from patient ○ sample preparation ● Genome sequencing ○ experimental design ● Bioinformatics ○ identify SNPs - read mapping, to reference ■ phylogenomic tree ○ identify novel DNA - de novo assembly, no reference ■ annotation ● Biological interpretation What is a pathogen?
● infectious agent or "germ"
● microbe that causes disease in its host ○ organism ■ virus, bacterium, fungus, protozoa, parasite ■ most are harmless or even beneficial ○ host ■ human, animal, plant ○ transmission ■ any "hole" in you - inhalation, ingestion, wound ○ virulence ■ how bad, how quick, mortality What type of pathogen are these?
HIV Malaria Golden Staph
Powdery mildew Black death (Plague) Glandular Fever How do they work?
● Adhesion ○ bind to host cell surface - interferes normal process ● Colonization ○ take over parts of the body - upsets processes ● Invasion ○ produce proteins to disrupt host cells, allow entry ● Immunosuppression ○ for example, produce proteins to bind to antibodies ● Toxins ○ proteins/metabolites that are poison to the host Patient scenario
● Hospital patient with indwelling catheter ○ risk of pathogens entering the bloodstream ○ this is not normal, and is called septicemia ○ sepsis is the whole body inflammatory response to it
● Need to defeat the pathogen ○ most likely bacterial in this case ○ need to identify the bacterium and characteristics of the bacterium ■ antibiotic resistance profile eg. MRSA, VRE ■ might even want to know where it came from Sample collection
● Take patient blood ○ send to pathology
● Centrifuge ○ slow spin to remove human cells ○ fast spin to pellet bacterial cells
● Streak onto agar media ○ first emulsify the pellet to make it spreadable ○ grow for 24 hours, likely to be monoculture Traditional Microbiology ● Phenotype based: ○ look at cells under microscope ○ Gram staining - cell walls ○ biochemical tests ==> identification of the bacterium ■ genus and species ● PCR based testing: ○ 16s ribosomal RNA ■ common to all bacteria, differs slightly per strain ■ identify genus >90%, species >65%, unknown 10% ○ Multi-locus sequence typing (MLST) ■ sequence ~8 conserved genes ■ each strain/genus has its own MLST pattern ==> faster but limited - need prior knowledge WGS for diagnostics
● Whole Genome Sequencing ○ fast and no prerequisite knowledge about the pathogen ○ Microbiologist won't be superseded!! ■ just different tools ■ sequence data set: will still do all the 'tests' to identify and profile Purify DNA ● DNA extraction kit ○ lyse cells and digest (proteinaseK) ○ centrifuge to remove cell debris ○ pass lysate through column ■ DNA sticks to a DNA binding matrix ○ wash bound DNA ○ lower salt concentration - release bound DNA ○ precipitate: dubiously familiar stringy white pellet ■ salt and ethanol
● Extract DNA from strawberries at home! ○ detergent - breaks cells (octoploid genome) ○ strainer/pantyhose - remove particulate matter ○ salt - aids DNA precipitation ○ alcohol - precipitates DNA, keeps rest in solution Library preparation
● Enough DNA? ○ each technology requires different amounts ● Library type ○ shotgun, short paired, or long paired reads? ○ different construction methods eg. circularization ● Size selection ○ nebulize, sonication, enzymatic methods ○ run on gel + scalpel, or fancier methods ● Amplification ○ lots DNA : use PCR methods eg. emulsion PCR ○ little DNA : multiple displacement amplification ■ random hexamers, high fidelity polymerase ■ whole genome amplification for single-cell seq High throughput DNA-Seq
● Lots of technologies at market ○ 454, Illumina, SOLiD, Ion Torrent, PacBio
● Each has its ups and downs ○ speed, yield, length, price, quality, labour, reliability
● Technology trend
○ Illumina is currently the best choice
○ Most mature technology
○ Produces direct "base space" ie. A,G,T,C
○ Easiest data to work with Current technology
Length Length Paired Mate Method Quality Yield "Space" (now) (soon) end? pairs?
Yes Yes Illumina 150 250 base (→800bp) (→3kb) +++++ ++++
Yes 454 500 900 No flow (→8kb) +++ ++
Yes SOLiD 75 75 No colour (~4kb) +++ +++++
Ion Testing 100 200 Testing flow Torrent (~4kb) ++ +++
PacBio 2000 6000+ No No + + base? Read types
● Single end, "shotgun" ○ ===>------○ sequence from one end of a fragment
● Paired end ○ ==>------<== ○ sequence from both ends of the same fragment ○ space between mates is the "insert size" (< 800bp) ○ ● Mate pair ○ ==>--~~~~--<== ○ sequence both ends of a pseudo-fragment ○ this allows us to use longer insert sizes (> 800bp) Read "spaces"
● Example read ○ ACTGGGTCC
● Base space ○ get native bases: A,C,T,G,G,G,T,C,C
● Flow space ○ get base flows: A*1, C*1, T*1, G*3, T*1, C*2 ○ mis-count when n > 3 (homopolymers)
● Colour space ○ get di-base encoding: T:X,X,X,X,X,X,X ○ theoretically useful, but messy overall Read filtering
● Sequencing is a multi-step process ○ ost steps are biological - so there will be errors!
● Bacterial genome sequencing ○ usually excess sequence, can afford to discard
● Why filter? ○ reduce size of data set to deal with ■ need less RAM and CPU ○ improve average reads quality ■ better results What to filter on
● Phred base qualities ○ Q<20 still means >1% error!
● ambiguous bases ie. "N" ○ these should have low Q scores anyway
● reads that are too short ○ too ambiguous to map, too short to assemble
● widowed reads ○ reads, that after filtering, no longer have a mate Sequenced it, now what?
● How is it different? ○ compare to known closely related "reference" strain
● Types of differences ○ deleted DNA - in reference, not in ours ○ duplicated DNA - extra copies in ours ○ novel DNA - in ours, not in reference ○ SNPs - single nucleotide polymorphisms (1bp subst) ○ indels - short insertions or deletions (usually 1bp) ■ sometimes indels fall under "SNPs" banner ○ structural variation - rearrangements, inversions ■ small scale, large scale Read mapping - large scale
x1 coverage
Conserved Deleted x3 x2 Conserved Read mapping - medium scale Read mapping - small scale
Reference sequence
Depth
Errors Are we seeing everything?
● Hmm, some of our reads didn't map ○ sequencing artifacts (some) ○ contamination (maybe - RA sneezed into sample?) ○ DNA in our sample but not in reference (yes) ■ need to de novo assemble
● Other comparisons more difficult ○ structural change, rearrangements ■ read length & insert size are limiting factors ○ read mapping is not the answer to every question ■ particularly with non-model organisms De novo genome assembly
● De novo ○ Latin - "from the beginning", "afresh", "anew" ○ Without reference to any other genomes
● "Genome assembly is impossible." ○ A/Prof. Mihai Pop - leading assembly researcher! Assembling bacteria
● Genomes ○ DNA, single organism, ~1 sequence
● Transcriptomes ○ RNA (cDNA), single organism, ~4000 sequences
● Meta-genomes ○ DNA, mix of organisms, >10 sequences ○ eg. human gut microbiome, oral cavity
● Meta-transcriptomes ○ RNA (cDNA), mix of organisms, >40000 sequences! Types of assemblers
● Greedy ○ find two best matching reads, join them, iterate
● Overlap-Layout-Consensus ○ collate all overlaps into a graph and finds a path
● de Bruijn graph (pronounced "brown") ○ break reads into k-mers, overlap is 100%id k-1
● String graph ○ represents all that is inferable from the reads ○ encompasses OLC and DBGs Assembly algorithm
● Find all overlaps between all reads ○ naively this is O(N2) for N reads, but good heuristics ○ parameters are: min. overlap, min %identity
● Build a graph from these overlaps ○ nodes/arcs <=> reads/overlaps <=> vertices/edges
● Simplify the graph ○ because real-world reads have errors
● Trace a single path through the graph ○ Read off the consensus of bases as you go The tyranny of repeats
● Assembler would output 7 contig sequences ○ path is broken at ambiguous decision points ○ read/pair length limits ability to resolve repeats How many contigs will be produced? More complex graph
Contigs Connections Reality bites
Shared vertices are repeats. Scaffolding
● Use paired reads to join contigs ○ reads with their mates in different contigs in a consistent manner suggests adjacency
● A difficult constraint problem ○ distance between mates ("insert size") variable ○ repeats cause ambiguous mate placement ○ some assemblers do it, separate scaffolders exist Contig ordering
● Optical maps ○ wet lab method, real experimental evidence ○ chromosome sized restriction site map
● Align to reference genome ○ fit contigs best as possible against known reference ○ some contigs will fit if split (DNA rearrangement) ○ expect orphan contigs (novel DNA) Genome closure
● Finished genome ○ one contig per replicon in original sample ○ bacterial chromosomes/plasmids usually circular
● Labour intensive ○ design primers around gaps, PCR, Sanger ○ Fosmid/BAC libraries for larger inconsistencies
● Why bother? ○ no close reference exists ○ ensures you didn't miss anything ○ understand whole genome architecture ○ simplifies all downstream analysis Annotation
● Annotation is the process of identifying important features in a genome ○ gene - protein product, promoter, signal sequences ■ ~1000 per Mbp in bacteria, coding dense ○ tRNA - transfer RNA ■ ~30 per bacteria cover all codons (wobble base) ○ rRNA - ribosomal RNA locus ■ 1 to 9 per bacteria, fast vs slow growers ○ And many more... ■ small RNAs, ncRNA, binding sites, tx factors Annotating proteins
● Homology vs. Similarity ○ homology means same biological function ○ we use sequence similarity as a proxy for homology ○ works well for most situations ● Sequence alignment methods ○ "Exact" - Needleman-Wunsch, Smith-Waterman ○ "Approx" - BLAST, FASTA, and many others! ○ Database is sequences: nr, RefSeq, UniProt ● Sequence profile methods ○ Build a HMM (model) of aligned sequence families ○ HMMer - scan profiles against query protein seq. ○ Database is profiles: Pfam, TIGRfams, FigFam Curation
● Automatic annotation ○ more quality databases and models now ○ but still flawed
● Manual curation ○ Essential for a quality annotation ○ Find pseudo, missing, bogus, and broken genes ○ Discover mis-assemblies ○ Correct mis-annotated protein families ○ Fix incorrect start codons ■ Bacteria have 3-5 start codons, not just AUG (M) Practical Exercise
Go to URL: dna.med.monash.edu.au/~torsten/tmp/msc
Install: Artemis and MEGA
Keep this URL open in a tab.
Wait for further instructions. Annotation
● Start Artemis ○ open Hendra1994.fa
● What is it? ○ Hendra virus - 18 kb viral genome ○ single-stranded negative-sense RNA (not DNA!!) ○ has 6 protein coding regions ("genes")
● Task ○ find these genes using Artemis ○ use similarity searching to assign a name to the gene The official annotation
● In Artemis ○ download and open Hendra1994.gbk
● Task ○ compare to your annotations
● What did you find? ○ methionine (M) start codon (ATG) DNA vs Protein Similarity
● Examine relationships between Paramyxoviridae ○ includes Hendravirus already in Artemis
● Open BLAST: http://blast.ncbi.nlm.nih.gov/
● For the Hendravirus: ○ use blastn to search nr database for sequences related to the L gene (DNA) ○ use blastp to search nr database for sequences related to the L protein (amino acids) ○ Any observations?? Phylogeny
● Start MEGA ○ download L_para.fas ○ multifasta with "L" proteins from 37 similar viruses
● Task: ○ Load L_para.fas ○ Align sequences (using MUSCLE) ○ Infer tree (minumum evolution method) ○ Examine relationships Viral strain comparison
● Hendravirus ○ 11 complete genome sequences ○ different hosts (bats horses) and times (1994 onwards)
● Task ○ Load hendra11.meg into MEGA ○ multiple alignment already done - examine SNPs ○ What is the impact of the nucleotide differences? ■ look at one SNP in detail ■ use Artemis to see if the SNP is in a gene ■ does the SNP change the encoded protein? 9918