Pathogenomics

From patient to bioinformatician

Dieter Bulach Torsten Seemann

M.Sc(Bioinf) - University of Melbourne - Tue 1 May 2012 The "rules"

● Conversation, not lecture

● Ask questions at any time

● Activities and quizzes are interspersed. These have yellow background like this slide.

● Please turn your phones to silent.

● Let's start! Overview

● Medical issue ○ sample collection from patient ○ sample preparation ● Genome sequencing ○ experimental design ● Bioinformatics ○ identify SNPs - read mapping, to reference ■ phylogenomic tree ○ identify novel DNA - de novo assembly, no reference ■ annotation ● Biological interpretation What is a pathogen?

● infectious agent or "germ"

● microbe that causes disease in its host ○ organism ■ virus, bacterium, fungus, protozoa, parasite ■ most are harmless or even beneficial ○ host ■ human, animal, plant ○ transmission ■ any "hole" in you - inhalation, ingestion, wound ○ virulence ■ how bad, how quick, mortality What type of pathogen are these?

HIV Malaria Golden Staph

Powdery mildew Black death (Plague) Glandular Fever How do they work?

● Adhesion ○ bind to host cell surface - interferes normal process ● Colonization ○ take over parts of the body - upsets processes ● Invasion ○ produce proteins to disrupt host cells, allow entry ● Immunosuppression ○ for example, produce proteins to bind to antibodies ● Toxins ○ proteins/metabolites that are poison to the host Patient scenario

● Hospital patient with indwelling catheter ○ risk of pathogens entering the bloodstream ○ this is not normal, and is called septicemia ○ sepsis is the whole body inflammatory response to it

● Need to defeat the pathogen ○ most likely bacterial in this case ○ need to identify the bacterium and characteristics of the bacterium ■ antibiotic resistance profile eg. MRSA, VRE ■ might even want to know where it came from Sample collection

● Take patient blood ○ send to pathology

● Centrifuge ○ slow spin to remove human cells ○ fast spin to pellet bacterial cells

● Streak onto agar media ○ first emulsify the pellet to make it spreadable ○ grow for 24 hours, likely to be monoculture Traditional Microbiology ● Phenotype based: ○ look at cells under microscope ○ Gram staining - cell walls ○ biochemical tests ==> identification of the bacterium ■ genus and species ● PCR based testing: ○ 16s ribosomal RNA ■ common to all bacteria, differs slightly per strain ■ identify genus >90%, species >65%, unknown 10% ○ Multi-locus sequence typing (MLST) ■ sequence ~8 conserved genes ■ each strain/genus has its own MLST pattern ==> faster but limited - need prior knowledge WGS for diagnostics

● Whole Genome Sequencing ○ fast and no prerequisite knowledge about the pathogen ○ Microbiologist won't be superseded!! ■ just different tools ■ sequence data set: will still do all the 'tests' to identify and profile Purify DNA ● DNA extraction kit ○ lyse cells and digest (proteinaseK) ○ centrifuge to remove cell debris ○ pass lysate through column ■ DNA sticks to a DNA binding matrix ○ wash bound DNA ○ lower salt concentration - release bound DNA ○ precipitate: dubiously familiar stringy white pellet ■ salt and ethanol

● Extract DNA from strawberries at home! ○ detergent - breaks cells (octoploid genome) ○ strainer/pantyhose - remove particulate matter ○ salt - aids DNA precipitation ○ alcohol - precipitates DNA, keeps rest in solution Library preparation

● Enough DNA? ○ each technology requires different amounts ● Library type ○ shotgun, short paired, or long paired reads? ○ different construction methods eg. circularization ● Size selection ○ nebulize, sonication, enzymatic methods ○ run on gel + scalpel, or fancier methods ● Amplification ○ lots DNA : use PCR methods eg. emulsion PCR ○ little DNA : multiple displacement amplification ■ random hexamers, high fidelity polymerase ■ whole genome amplification for single-cell seq High throughput DNA-Seq

● Lots of technologies at market ○ 454, Illumina, SOLiD, Ion Torrent, PacBio

● Each has its ups and downs ○ speed, yield, length, price, quality, labour, reliability

● Technology trend

○ Illumina is currently the best choice

○ Most mature technology

○ Produces direct "base space" ie. A,G,T,C

○ Easiest data to work with Current technology

Length Length Paired Mate Method Quality Yield "Space" (now) (soon) end? pairs?

Yes Yes Illumina 150 250 base (→800bp) (→3kb) +++++ ++++

Yes 454 500 900 No flow (→8kb) +++ ++

Yes SOLiD 75 75 No colour (~4kb) +++ +++++

Ion Testing 100 200 Testing flow Torrent (~4kb) ++ +++

PacBio 2000 6000+ No No + + base? Read types

● Single end, "shotgun" ○ ===>------○ sequence from one end of a fragment

● Paired end ○ ==>------<== ○ sequence from both ends of the same fragment ○ space between mates is the "insert size" (< 800bp) ○ ● Mate pair ○ ==>--~~~~--<== ○ sequence both ends of a pseudo-fragment ○ this allows us to use longer insert sizes (> 800bp) Read "spaces"

● Example read ○ ACTGGGTCC

● Base space ○ get native bases: A,C,T,G,G,G,T,C,C

● Flow space ○ get base flows: A*1, C*1, T*1, G*3, T*1, C*2 ○ mis-count when n > 3 (homopolymers)

● Colour space ○ get di-base encoding: T:X,X,X,X,X,X,X ○ theoretically useful, but messy overall Read filtering

● Sequencing is a multi-step process ○ ost steps are biological - so there will be errors!

● Bacterial genome sequencing ○ usually excess sequence, can afford to discard

● Why filter? ○ reduce size of data set to deal with ■ need less RAM and CPU ○ improve average reads quality ■ better results What to filter on

● Phred base qualities ○ Q<20 still means >1% error!

● ambiguous bases ie. "N" ○ these should have low Q scores anyway

● reads that are too short ○ too ambiguous to map, too short to assemble

● widowed reads ○ reads, that after filtering, no longer have a mate Sequenced it, now what?

● How is it different? ○ compare to known closely related "reference" strain

● Types of differences ○ deleted DNA - in reference, not in ours ○ duplicated DNA - extra copies in ours ○ novel DNA - in ours, not in reference ○ SNPs - single nucleotide polymorphisms (1bp subst) ○ indels - short insertions or deletions (usually 1bp) ■ sometimes indels fall under "SNPs" banner ○ structural variation - rearrangements, inversions ■ small scale, large scale Read mapping - large scale

x1 coverage

Conserved Deleted x3 x2 Conserved Read mapping - medium scale Read mapping - small scale

Reference sequence

Depth

Errors Are we seeing everything?

● Hmm, some of our reads didn't map ○ sequencing artifacts (some) ○ contamination (maybe - RA sneezed into sample?) ○ DNA in our sample but not in reference (yes) ■ need to de novo assemble

● Other comparisons more difficult ○ structural change, rearrangements ■ read length & insert size are limiting factors ○ read mapping is not the answer to every question ■ particularly with non-model organisms De novo genome assembly

● De novo ○ Latin - "from the beginning", "afresh", "anew" ○ Without reference to any other genomes

● "Genome assembly is impossible." ○ A/Prof. Mihai Pop - leading assembly researcher! Assembling bacteria

● Genomes ○ DNA, single organism, ~1 sequence

● Transcriptomes ○ RNA (cDNA), single organism, ~4000 sequences

● Meta-genomes ○ DNA, mix of organisms, >10 sequences ○ eg. human gut microbiome, oral cavity

● Meta-transcriptomes ○ RNA (cDNA), mix of organisms, >40000 sequences! Types of assemblers

● Greedy ○ find two best matching reads, join them, iterate

● Overlap-Layout-Consensus ○ collate all overlaps into a graph and finds a path

● de Bruijn graph (pronounced "brown") ○ break reads into k-mers, overlap is 100%id k-1

● String graph ○ represents all that is inferable from the reads ○ encompasses OLC and DBGs Assembly algorithm

● Find all overlaps between all reads ○ naively this is O(N2) for N reads, but good heuristics ○ parameters are: min. overlap, min %identity

● Build a graph from these overlaps ○ nodes/arcs <=> reads/overlaps <=> vertices/edges

● Simplify the graph ○ because real-world reads have errors

● Trace a single path through the graph ○ Read off the consensus of bases as you go The tyranny of repeats

● Assembler would output 7 contig sequences ○ path is broken at ambiguous decision points ○ read/pair length limits ability to resolve repeats How many contigs will be produced? More complex graph

Contigs Connections Reality bites

Shared vertices are repeats. Scaffolding

● Use paired reads to join contigs ○ reads with their mates in different contigs in a consistent manner suggests adjacency

● A difficult constraint problem ○ distance between mates ("insert size") variable ○ repeats cause ambiguous mate placement ○ some assemblers do it, separate scaffolders exist Contig ordering

● Optical maps ○ wet lab method, real experimental evidence ○ chromosome sized restriction site map

● Align to reference genome ○ fit contigs best as possible against known reference ○ some contigs will fit if split (DNA rearrangement) ○ expect orphan contigs (novel DNA) Genome closure

● Finished genome ○ one contig per replicon in original sample ○ bacterial chromosomes/plasmids usually circular

● Labour intensive ○ design primers around gaps, PCR, Sanger ○ Fosmid/BAC libraries for larger inconsistencies

● Why bother? ○ no close reference exists ○ ensures you didn't miss anything ○ understand whole genome architecture ○ simplifies all downstream analysis Annotation

● Annotation is the process of identifying important features in a genome ○ gene - protein product, promoter, signal sequences ■ ~1000 per Mbp in bacteria, coding dense ○ tRNA - transfer RNA ■ ~30 per bacteria cover all codons (wobble base) ○ rRNA - ribosomal RNA locus ■ 1 to 9 per bacteria, fast vs slow growers ○ And many more... ■ small RNAs, ncRNA, binding sites, tx factors Annotating proteins

● Homology vs. Similarity ○ homology means same biological function ○ we use sequence similarity as a proxy for homology ○ works well for most situations ● Sequence alignment methods ○ "Exact" - Needleman-Wunsch, Smith-Waterman ○ "Approx" - BLAST, FASTA, and many others! ○ Database is sequences: nr, RefSeq, UniProt ● Sequence profile methods ○ Build a HMM (model) of aligned sequence families ○ HMMer - scan profiles against query protein seq. ○ Database is profiles: Pfam, TIGRfams, FigFam Curation

● Automatic annotation ○ more quality databases and models now ○ but still flawed

● Manual curation ○ Essential for a quality annotation ○ Find pseudo, missing, bogus, and broken genes ○ Discover mis-assemblies ○ Correct mis-annotated protein families ○ Fix incorrect start codons ■ Bacteria have 3-5 start codons, not just AUG (M) Practical Exercise

Go to URL: dna.med.monash.edu.au/~torsten/tmp/msc

Install: Artemis and MEGA

Keep this URL open in a tab.

Wait for further instructions. Annotation

● Start Artemis ○ open Hendra1994.fa

● What is it? ○ Hendra virus - 18 kb viral genome ○ single-stranded negative-sense RNA (not DNA!!) ○ has 6 protein coding regions ("genes")

● Task ○ find these genes using Artemis ○ use similarity searching to assign a name to the gene The official annotation

● In Artemis ○ download and open Hendra1994.gbk

● Task ○ compare to your annotations

● What did you find? ○ methionine (M) start codon (ATG) DNA vs Protein Similarity

● Examine relationships between Paramyxoviridae ○ includes Hendravirus already in Artemis

● Open BLAST: http://blast.ncbi.nlm.nih.gov/

● For the Hendravirus: ○ use blastn to search nr database for sequences related to the L gene (DNA) ○ use blastp to search nr database for sequences related to the L protein (amino acids) ○ Any observations?? Phylogeny

● Start MEGA ○ download L_para.fas ○ multifasta with "L" proteins from 37 similar viruses

● Task: ○ Load L_para.fas ○ Align sequences (using MUSCLE) ○ Infer tree (minumum evolution method) ○ Examine relationships Viral strain comparison

● Hendravirus ○ 11 complete genome sequences ○ different hosts (bats horses) and times (1994 onwards)

● Task ○ Load hendra11.meg into MEGA ○ multiple alignment already done - examine SNPs ○ What is the impact of the nucleotide differences? ■ look at one SNP in detail ■ use Artemis to see if the SNP is in a gene ■ does the SNP change the encoded protein? 9918