Pathogenomics From patient to bioinformatician Dieter Bulach Torsten Seemann M.Sc(Bioinf) - University of Melbourne - Tue 1 May 2012 The "rules" ● Conversation, not lecture ● Ask questions at any time ● Activities and quizzes are interspersed. These have yellow background like this slide. ● Please turn your phones to silent. ● Let's start! Overview ● Medical issue ○ sample collection from patient ○ sample preparation ● Genome sequencing ○ experimental design ● Bioinformatics ○ identify SNPs - read mapping, to reference ■ phylogenomic tree ○ identify novel DNA - de novo assembly, no reference ■ annotation ● Biological interpretation What is a pathogen? ● infectious agent or "germ" ● microbe that causes disease in its host ○ organism ■ virus, bacterium, fungus, protozoa, parasite ■ most are harmless or even beneficial ○ host ■ human, animal, plant ○ transmission ■ any "hole" in you - inhalation, ingestion, wound ○ virulence ■ how bad, how quick, mortality What type of pathogen are these? HIV Malaria Golden Staph Powdery mildew Black death (Plague) Glandular Fever How do they work? ● Adhesion ○ bind to host cell surface - interferes normal process ● Colonization ○ take over parts of the body - upsets processes ● Invasion ○ produce proteins to disrupt host cells, allow entry ● Immunosuppression ○ for example, produce proteins to bind to antibodies ● Toxins ○ proteins/metabolites that are poison to the host Patient scenario ● Hospital patient with indwelling catheter ○ risk of pathogens entering the bloodstream ○ this is not normal, and is called septicemia ○ sepsis is the whole body inflammatory response to it ● Need to defeat the pathogen ○ most likely bacterial in this case ○ need to identify the bacterium and characteristics of the bacterium ■ antibiotic resistance profile eg. MRSA, VRE ■ might even want to know where it came from Sample collection ● Take patient blood ○ send to pathology ● Centrifuge ○ slow spin to remove human cells ○ fast spin to pellet bacterial cells ● Streak onto agar media ○ first emulsify the pellet to make it spreadable ○ grow for 24 hours, likely to be monoculture Traditional Microbiology ● Phenotype based: ○ look at cells under microscope ○ Gram staining - cell walls ○ biochemical tests ==> identification of the bacterium ■ genus and species ● PCR based testing: ○ 16s ribosomal RNA ■ common to all bacteria, differs slightly per strain ■ identify genus >90%, species >65%, unknown 10% ○ Multi-locus sequence typing (MLST) ■ sequence ~8 conserved genes ■ each strain/genus has its own MLST pattern ==> faster but limited - need prior knowledge WGS for diagnostics ● Whole Genome Sequencing ○ fast and no prerequisite knowledge about the pathogen ○ Microbiologist won't be superseded!! ■ just different tools ■ sequence data set: will still do all the 'tests' to identify and profile Purify DNA ● DNA extraction kit ○ lyse cells and digest (proteinaseK) ○ centrifuge to remove cell debris ○ pass lysate through column ■ DNA sticks to a DNA binding matrix ○ wash bound DNA ○ lower salt concentration - release bound DNA ○ precipitate: dubiously familiar stringy white pellet ■ salt and ethanol ● Extract DNA from strawberries at home! ○ detergent - breaks cells (octoploid genome) ○ strainer/pantyhose - remove particulate matter ○ salt - aids DNA precipitation ○ alcohol - precipitates DNA, keeps rest in solution Library preparation ● Enough DNA? ○ each technology requires different amounts ● Library type ○ shotgun, short paired, or long paired reads? ○ different construction methods eg. circularization ● Size selection ○ nebulize, sonication, enzymatic methods ○ run on gel + scalpel, or fancier methods ● Amplification ○ lots DNA : use PCR methods eg. emulsion PCR ○ little DNA : multiple displacement amplification ■ random hexamers, high fidelity polymerase ■ whole genome amplification for single-cell seq High throughput DNA-Seq ● Lots of technologies at market ○ 454, Illumina, SOLiD, Ion Torrent, PacBio ● Each has its ups and downs ○ speed, yield, length, price, quality, labour, reliability ● Technology trend ○ Illumina is currently the best choice ○ Most mature technology ○ Produces direct "base space" ie. A,G,T,C ○ Easiest data to work with Current technology Length Length Paired Mate Method Quality Yield "Space" (now) (soon) end? pairs? Yes Yes Illumina 150 250 base (→800bp) (→3kb) +++++ ++++ Yes 454 500 900 No flow (→8kb) +++ ++ Yes SOLiD 75 75 No colour (~4kb) +++ +++++ Ion Testing 100 200 Testing flow Torrent (~4kb) ++ +++ PacBio 2000 6000+ No No + + base? Read types ● Single end, "shotgun" ○ ===>--------- ○ sequence from one end of a fragment ● Paired end ○ ==>--------<== ○ sequence from both ends of the same fragment ○ space between mates is the "insert size" (< 800bp) ○ ● Mate pair ○ ==>--~~~~--<== ○ sequence both ends of a pseudo-fragment ○ this allows us to use longer insert sizes (> 800bp) Read "spaces" ● Example read ○ ACTGGGTCC ● Base space ○ get native bases: A,C,T,G,G,G,T,C,C ● Flow space ○ get base flows: A*1, C*1, T*1, G*3, T*1, C*2 ○ mis-count when n > 3 (homopolymers) ● Colour space ○ get di-base encoding: T:X,X,X,X,X,X,X ○ theoretically useful, but messy overall Read filtering ● Sequencing is a multi-step process ○ ost steps are biological - so there will be errors! ● Bacterial genome sequencing ○ usually excess sequence, can afford to discard ● Why filter? ○ reduce size of data set to deal with ■ need less RAM and CPU ○ improve average reads quality ■ better results What to filter on ● Phred base qualities ○ Q<20 still means >1% error! ● ambiguous bases ie. "N" ○ these should have low Q scores anyway ● reads that are too short ○ too ambiguous to map, too short to assemble ● widowed reads ○ reads, that after filtering, no longer have a mate Sequenced it, now what? ● How is it different? ○ compare to known closely related "reference" strain ● Types of differences ○ deleted DNA - in reference, not in ours ○ duplicated DNA - extra copies in ours ○ novel DNA - in ours, not in reference ○ SNPs - single nucleotide polymorphisms (1bp subst) ○ indels - short insertions or deletions (usually 1bp) ■ sometimes indels fall under "SNPs" banner ○ structural variation - rearrangements, inversions ■ small scale, large scale Read mapping - large scale x1 coverage Conserved Deleted x3 x2 Conserved Read mapping - medium scale Read mapping - small scale Reference sequence Depth Errors Are we seeing everything? ● Hmm, some of our reads didn't map ○ sequencing artifacts (some) ○ contamination (maybe - RA sneezed into sample?) ○ DNA in our sample but not in reference (yes) ■ need to de novo assemble ● Other comparisons more difficult ○ structural change, rearrangements ■ read length & insert size are limiting factors ○ read mapping is not the answer to every question ■ particularly with non-model organisms De novo genome assembly ● De novo ○ Latin - "from the beginning", "afresh", "anew" ○ Without reference to any other genomes ● "Genome assembly is impossible." ○ A/Prof. Mihai Pop - leading assembly researcher! Assembling bacteria ● Genomes ○ DNA, single organism, ~1 sequence ● Transcriptomes ○ RNA (cDNA), single organism, ~4000 sequences ● Meta-genomes ○ DNA, mix of organisms, >10 sequences ○ eg. human gut microbiome, oral cavity ● Meta-transcriptomes ○ RNA (cDNA), mix of organisms, >40000 sequences! Types of assemblers ● Greedy ○ find two best matching reads, join them, iterate ● Overlap-Layout-Consensus ○ collate all overlaps into a graph and finds a path ● de Bruijn graph (pronounced "brown") ○ break reads into k-mers, overlap is 100%id k-1 ● String graph ○ represents all that is inferable from the reads ○ encompasses OLC and DBGs Assembly algorithm ● Find all overlaps between all reads ○ naively this is O(N2) for N reads, but good heuristics ○ parameters are: min. overlap, min %identity ● Build a graph from these overlaps ○ nodes/arcs <=> reads/overlaps <=> vertices/edges ● Simplify the graph ○ because real-world reads have errors ● Trace a single path through the graph ○ Read off the consensus of bases as you go The tyranny of repeats ● Assembler would output 7 contig sequences ○ path is broken at ambiguous decision points ○ read/pair length limits ability to resolve repeats How many contigs will be produced? More complex graph Contigs Connections Reality bites Shared vertices are repeats. Scaffolding ● Use paired reads to join contigs ○ reads with their mates in different contigs in a consistent manner suggests adjacency ● A difficult constraint problem ○ distance between mates ("insert size") variable ○ repeats cause ambiguous mate placement ○ some assemblers do it, separate scaffolders exist Contig ordering ● Optical maps ○ wet lab method, real experimental evidence ○ chromosome sized restriction site map ● Align to reference genome ○ fit contigs best as possible against known reference ○ some contigs will fit if split (DNA rearrangement) ○ expect orphan contigs (novel DNA) Genome closure ● Finished genome ○ one contig per replicon in original sample ○ bacterial chromosomes/plasmids usually circular ● Labour intensive ○ design primers around gaps, PCR, Sanger ○ Fosmid/BAC libraries for larger inconsistencies ● Why bother? ○ no close reference exists ○ ensures you didn't miss anything ○ understand whole genome architecture ○ simplifies all downstream analysis Annotation ● Annotation is the process of identifying important features in a genome ○ gene - protein product, promoter, signal sequences ■ ~1000 per Mbp in bacteria, coding dense ○ tRNA - transfer RNA ■ ~30 per bacteria cover all codons (wobble base)
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages44 Page
-
File Size-