Pathogenomics

Pathogenomics From patient to bioinformatician Dieter Bulach Torsten Seemann M.Sc(Bioinf) - University of Melbourne - Tue 1 May 2012 The "rules" ● Conversation, not lecture ● Ask questions at any time ● Activities and quizzes are interspersed. These have yellow background like this slide. ● Please turn your phones to silent. ● Let's start! Overview ● Medical issue ○ sample collection from patient ○ sample preparation ● Genome sequencing ○ experimental design ● Bioinformatics ○ identify SNPs - read mapping, to reference ■ phylogenomic tree ○ identify novel DNA - de novo assembly, no reference ■ annotation ● Biological interpretation What is a pathogen? ● infectious agent or "germ" ● microbe that causes disease in its host ○ organism ■ virus, bacterium, fungus, protozoa, parasite ■ most are harmless or even beneficial ○ host ■ human, animal, plant ○ transmission ■ any "hole" in you - inhalation, ingestion, wound ○ virulence ■ how bad, how quick, mortality What type of pathogen are these? HIV Malaria Golden Staph Powdery mildew Black death (Plague) Glandular Fever How do they work? ● Adhesion ○ bind to host cell surface - interferes normal process ● Colonization ○ take over parts of the body - upsets processes ● Invasion ○ produce proteins to disrupt host cells, allow entry ● Immunosuppression ○ for example, produce proteins to bind to antibodies ● Toxins ○ proteins/metabolites that are poison to the host Patient scenario ● Hospital patient with indwelling catheter ○ risk of pathogens entering the bloodstream ○ this is not normal, and is called septicemia ○ sepsis is the whole body inflammatory response to it ● Need to defeat the pathogen ○ most likely bacterial in this case ○ need to identify the bacterium and characteristics of the bacterium ■ antibiotic resistance profile eg. MRSA, VRE ■ might even want to know where it came from Sample collection ● Take patient blood ○ send to pathology ● Centrifuge ○ slow spin to remove human cells ○ fast spin to pellet bacterial cells ● Streak onto agar media ○ first emulsify the pellet to make it spreadable ○ grow for 24 hours, likely to be monoculture Traditional Microbiology ● Phenotype based: ○ look at cells under microscope ○ Gram staining - cell walls ○ biochemical tests ==> identification of the bacterium ■ genus and species ● PCR based testing: ○ 16s ribosomal RNA ■ common to all bacteria, differs slightly per strain ■ identify genus >90%, species >65%, unknown 10% ○ Multi-locus sequence typing (MLST) ■ sequence ~8 conserved genes ■ each strain/genus has its own MLST pattern ==> faster but limited - need prior knowledge WGS for diagnostics ● Whole Genome Sequencing ○ fast and no prerequisite knowledge about the pathogen ○ Microbiologist won't be superseded!! ■ just different tools ■ sequence data set: will still do all the 'tests' to identify and profile Purify DNA ● DNA extraction kit ○ lyse cells and digest (proteinaseK) ○ centrifuge to remove cell debris ○ pass lysate through column ■ DNA sticks to a DNA binding matrix ○ wash bound DNA ○ lower salt concentration - release bound DNA ○ precipitate: dubiously familiar stringy white pellet ■ salt and ethanol ● Extract DNA from strawberries at home! ○ detergent - breaks cells (octoploid genome) ○ strainer/pantyhose - remove particulate matter ○ salt - aids DNA precipitation ○ alcohol - precipitates DNA, keeps rest in solution Library preparation ● Enough DNA? ○ each technology requires different amounts ● Library type ○ shotgun, short paired, or long paired reads? ○ different construction methods eg. circularization ● Size selection ○ nebulize, sonication, enzymatic methods ○ run on gel + scalpel, or fancier methods ● Amplification ○ lots DNA : use PCR methods eg. emulsion PCR ○ little DNA : multiple displacement amplification ■ random hexamers, high fidelity polymerase ■ whole genome amplification for single-cell seq High throughput DNA-Seq ● Lots of technologies at market ○ 454, Illumina, SOLiD, Ion Torrent, PacBio ● Each has its ups and downs ○ speed, yield, length, price, quality, labour, reliability ● Technology trend ○ Illumina is currently the best choice ○ Most mature technology ○ Produces direct "base space" ie. A,G,T,C ○ Easiest data to work with Current technology Length Length Paired Mate Method Quality Yield "Space" (now) (soon) end? pairs? Yes Yes Illumina 150 250 base (→800bp) (→3kb) +++++ ++++ Yes 454 500 900 No flow (→8kb) +++ ++ Yes SOLiD 75 75 No colour (~4kb) +++ +++++ Ion Testing 100 200 Testing flow Torrent (~4kb) ++ +++ PacBio 2000 6000+ No No + + base? Read types ● Single end, "shotgun" ○ ===>--------- ○ sequence from one end of a fragment ● Paired end ○ ==>--------<== ○ sequence from both ends of the same fragment ○ space between mates is the "insert size" (< 800bp) ○ ● Mate pair ○ ==>--~~~~--<== ○ sequence both ends of a pseudo-fragment ○ this allows us to use longer insert sizes (> 800bp) Read "spaces" ● Example read ○ ACTGGGTCC ● Base space ○ get native bases: A,C,T,G,G,G,T,C,C ● Flow space ○ get base flows: A*1, C*1, T*1, G*3, T*1, C*2 ○ mis-count when n > 3 (homopolymers) ● Colour space ○ get di-base encoding: T:X,X,X,X,X,X,X ○ theoretically useful, but messy overall Read filtering ● Sequencing is a multi-step process ○ ost steps are biological - so there will be errors! ● Bacterial genome sequencing ○ usually excess sequence, can afford to discard ● Why filter? ○ reduce size of data set to deal with ■ need less RAM and CPU ○ improve average reads quality ■ better results What to filter on ● Phred base qualities ○ Q<20 still means >1% error! ● ambiguous bases ie. "N" ○ these should have low Q scores anyway ● reads that are too short ○ too ambiguous to map, too short to assemble ● widowed reads ○ reads, that after filtering, no longer have a mate Sequenced it, now what? ● How is it different? ○ compare to known closely related "reference" strain ● Types of differences ○ deleted DNA - in reference, not in ours ○ duplicated DNA - extra copies in ours ○ novel DNA - in ours, not in reference ○ SNPs - single nucleotide polymorphisms (1bp subst) ○ indels - short insertions or deletions (usually 1bp) ■ sometimes indels fall under "SNPs" banner ○ structural variation - rearrangements, inversions ■ small scale, large scale Read mapping - large scale x1 coverage Conserved Deleted x3 x2 Conserved Read mapping - medium scale Read mapping - small scale Reference sequence Depth Errors Are we seeing everything? ● Hmm, some of our reads didn't map ○ sequencing artifacts (some) ○ contamination (maybe - RA sneezed into sample?) ○ DNA in our sample but not in reference (yes) ■ need to de novo assemble ● Other comparisons more difficult ○ structural change, rearrangements ■ read length & insert size are limiting factors ○ read mapping is not the answer to every question ■ particularly with non-model organisms De novo genome assembly ● De novo ○ Latin - "from the beginning", "afresh", "anew" ○ Without reference to any other genomes ● "Genome assembly is impossible." ○ A/Prof. Mihai Pop - leading assembly researcher! Assembling bacteria ● Genomes ○ DNA, single organism, ~1 sequence ● Transcriptomes ○ RNA (cDNA), single organism, ~4000 sequences ● Meta-genomes ○ DNA, mix of organisms, >10 sequences ○ eg. human gut microbiome, oral cavity ● Meta-transcriptomes ○ RNA (cDNA), mix of organisms, >40000 sequences! Types of assemblers ● Greedy ○ find two best matching reads, join them, iterate ● Overlap-Layout-Consensus ○ collate all overlaps into a graph and finds a path ● de Bruijn graph (pronounced "brown") ○ break reads into k-mers, overlap is 100%id k-1 ● String graph ○ represents all that is inferable from the reads ○ encompasses OLC and DBGs Assembly algorithm ● Find all overlaps between all reads ○ naively this is O(N2) for N reads, but good heuristics ○ parameters are: min. overlap, min %identity ● Build a graph from these overlaps ○ nodes/arcs <=> reads/overlaps <=> vertices/edges ● Simplify the graph ○ because real-world reads have errors ● Trace a single path through the graph ○ Read off the consensus of bases as you go The tyranny of repeats ● Assembler would output 7 contig sequences ○ path is broken at ambiguous decision points ○ read/pair length limits ability to resolve repeats How many contigs will be produced? More complex graph Contigs Connections Reality bites Shared vertices are repeats. Scaffolding ● Use paired reads to join contigs ○ reads with their mates in different contigs in a consistent manner suggests adjacency ● A difficult constraint problem ○ distance between mates ("insert size") variable ○ repeats cause ambiguous mate placement ○ some assemblers do it, separate scaffolders exist Contig ordering ● Optical maps ○ wet lab method, real experimental evidence ○ chromosome sized restriction site map ● Align to reference genome ○ fit contigs best as possible against known reference ○ some contigs will fit if split (DNA rearrangement) ○ expect orphan contigs (novel DNA) Genome closure ● Finished genome ○ one contig per replicon in original sample ○ bacterial chromosomes/plasmids usually circular ● Labour intensive ○ design primers around gaps, PCR, Sanger ○ Fosmid/BAC libraries for larger inconsistencies ● Why bother? ○ no close reference exists ○ ensures you didn't miss anything ○ understand whole genome architecture ○ simplifies all downstream analysis Annotation ● Annotation is the process of identifying important features in a genome ○ gene - protein product, promoter, signal sequences ■ ~1000 per Mbp in bacteria, coding dense ○ tRNA - transfer RNA ■ ~30 per bacteria cover all codons (wobble base)

Pathogenomics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support