BAM Alignment File — Output: Alignment Counts and RPKM Expression Measurements for Each Exon • Calculate Coverage Profiles Across the Genome with “Sam2wig”

01 Agenda Item 02 Agenda Item 03 Agenda Item SOLiD TM Bioinformatics Overview (I) July 2010 1 Secondary and Tertiary Primary Analysis Analysis BioScope in cluster SETS ICS Export BioScope in cloud •Satay Plots •Auto Correlation •Heat maps On Instrument Other mapping tools 2 Auto-Export (cycle-by-cycle) Collect Primary Analysis Generate Secondary Images (colorcalls) Csfasta & qual Analysis Instrument Cluster Auto Export delta spch run definition Merge Generate Tertiary Spch files Csfasta & qual Analysis Bioscope Cluster 3 Manual Export Collect Primary Analysis Generate Secondary Images (colorcalls) Csfasta & qual Analysis Instrument Cluster Option in SETS to export All run data or combination of Spch, csfasta & qual files Tertiary Analysis Bioscope Cluster 4 Auto-Export database requirement removed Instrument Cluster Remote Cluster JMS Broker JMS Broker Export ICS Hades Bioscope (ActiveMQ) (ActiveMQ) Daemon Postgres Postgres SETS Bioscope UI Tomcat Tomcat Disco installer Auto-export installer Bioscope installer System installer 5 SOLiDSOLiD DataData AnalysisAnalysis Workflow:Workflow: Secondary/Tertiary Analysis (off -instrument) Primary Analysis Secondary Analysis Tertiary Analysis BioScope BioScope Reseq WT SAET •SNP •Coverage .csfasta / .qual Accuracy Enhanced .csfasta •InDel •Exon Counting •CNV •Junction Finder •Inversion •Fusion Transcripts Mapping Mapped reads (.ma) Third Party Tools Mapped reads (.bam) maToBam 6 01 Agenda Item 02 Agenda Item 03 Agenda Item SOLiD TM Bioinformatics Overview (II) July 2010 7 OutlineOutline • Color space and 2-base-encoding • Quality values and filtering • Mapping algorithm and considerations • SOLiD Webinar and Online Training • SOLiD Software Community 8 WhatWhat IsIs ColorColor Space?Space? • Capillary electrophoresis uses single base, color encoding of data Collect color Identify peak Convert to Identify peaks image colors base calls Base space Color space 9 SOLiDSOLiD ColorColor SpaceSpace • SOLiD uses 2 base color encoding of data (2BE) Collect color Identify bead Identify beads image color Record colors for each bead over consecutive cycles Color space Base space A C G G T C G T C G T G T G C G T 10 PropertiesProperties OfOf 22 --BaseBase EncodingEncoding (2BE)(2BE) Second Base 5’ 3’ 1 3 1 3 1 3 2 3 5’-A C G T A C G A T -3’ 3’-T G C A T G C T A -5’ 1 3 1 3 1 3 2 3 Base First 3’ 5’ • Two dibases that agree in just one base have different colors — color(AC) ≠ color(AG) ≠ color(AT) ≠ color(AA) • Two dibases that do not agree in either base have same color — color(AC) = color(GT) and color(CG) = color(AT) • A dibase and its reverse have the same color — color(AC) = color(CA), color(GT) = color(TG) • Repeated-base dibases have the same color — color(AA) = color(CC)= color(GG)= color(TT) 11 ““ValidValid ”” andand ““InvalidInvalid ”” AdjacentAdjacent ColorColor SubstitutionsSubstitutions • “Invalid” changes are inconsistent with SNP and likely sequencing errors 12 OutlineOutline • Color space and 2-base-encoding • Quality values and filtering • Mapping algorithm and considerations • SOLiD Webinar and Online Training • SOLiD Software Community 13 QualityQuality ValueValue (QV)(QV) ForFor ColorColor CallCall • A score calculated based on the probability of an error call at that base • Similar to those generated by phred and the KB Basecaller for capillary electrophoresis sequencing = − q 10 log 10 p p = probability of color call error • A QV score of 10 represent 10% error rate, whereas a QV score of 20 represents a 1% error rate 14 SETS Software Updates What has changed in SETS? Primary Analysis Filtering enabled - Removes poor quality beads, primary analysis results file size will be reduced. Mapping will be performed faster and matching % will improve. Filter Poor quality Beads Mappable Beads 15 Why Filter? • By removing poor quality beads, primary analysis results would be reduced by about 15% or more • Easier to discover novel information from remaining unmatched beads • Due to smaller list of reads of a run, mapping would be faster for generating similar throughput • Improved matching percentage 16 Filtering Design • Used Human data as training set • Set parameters based on the number of poor quality beads filtered — 20 value corresponds to 20% of poor quality beads filtered out — 80 value corresponds to 80% of poor quality beads filtered out • Tested mapping using BioScope Classic mapping 17 Configuring Filtering from SETS •Valid ranges for Stringency are from 0 to 80 •Default value is 20 20 18 OutlineOutline • Color space and 2-base-encoding • Quality values and filtering • Mapping algorithm and considerations • SOLiD Webinar and Online Training • SOLiD Software Community 19 MappingMapping AlgorithmAlgorithm • Challenge: — A small word size is needed for continuous word searches in short reads. This is computationally and time intensive. • Our Approach: — Use discontinuous word patterns > Allows faster searching and guaranteed to find all hits up to a certain number of mismatches 20 DiscontinuousDiscontinuous WordsWords • Continuous words: searching for a perfect alignment, 8/8 bases (word size 8, e.g. used by BLAST) ATTTTTT GGGTAGCC CCTTGGATGAGT |||||||| AG GGGTAGCC TGATGATGGT • Discontinuous words: searching 8/18 matches (effective word size is also 8) ATTTT TT GGGTA GC CCCTT GGAT GAGT || || |||| TT GACCG GC ATGGG GGAT 110000011000001111 21 MappingMapping ToolTool -- mapreadsmapreads • General features of mapping tool — Aligns in color space — Translates reference sequence to color space — Allows mismatches (no indels), valid adjacent mismatches can be counted as one — Allows masking of certain positions (bad calls) — For fixed reference sequence, running time is linear with number of reads • New with SOLiD 3+ — Seed and extend mapping approach — Multi-threaded 22 LocalLocal MappingMapping • Motivation — Long reads, non-uniform quality — At the end of reads errors tend to accumulate — Some applications show sequencing into adaptors 23 LocalLocal AlignmentAlignment StrategyStrategy • Map the first 25 colors of the read to allowing 2 mismatches (MM). • For every hit found (up to the Z-limit), do a local extension — Accumulate alignment score (Match = 1, MM = -2 [user defined] ) — Report the best partial alignment (anchored local) based on score > Discard if score does not meet minimum cutoff Read: 0122130123012303201203021 123012310231203120103120 ||||||| ||||||||||||||| | ||||| |||||||||| ||| Ref: 0122130 0230123032012030 1112301 13102312031 0010 1203 • For reads not mapped, shift anchor location and attempt additional mapping 24 LocalLocal Mapping:Mapping: AnchorAnchor OffsetOffset start end reference read (offset) • start and end mark the start and end of the alignment in the reference. • The alignment may not encompass the entire read. • The start of the alignment in the read is called the offset 25 MappingMapping QV:QV: MathematicalMathematical DefinitionDefinition • Mapping quality is an estimate how likely an alignment is correct • First, calculate the posterior probability L−t − 1 P(r | Alignment ) = 1( − e)t m e m 4 • If an alignment has a probability of ,P(r), it’s mapping QV is defined as P(r) — -10*log 10 (1-P(r)/P), where P = Σ for all reads 26 What are mapping/pairing quality values? • Given the fact that a read R of length L can map to n different locations Xi (i = 1…n) in the genome, mapping quality value represents the probability of the hypothesis, that the read maps to location Xi is true. Mapping Quality value ~ Prob hypothesis (R Ξ X1 | R) is true R X1 X2 Xn 27 Difference between Mapping QV & Pairing QV • Mapping QV represents the quality of alignment for Fragment reads or the quality of alignment for individual tags (F3/R3/F5-P2) in pared reads • Pairing QV represents the quality of alignment for a pair of reads. Example if F3 tag has 10 alignments and F5-P2 tag has 10 alignments, then we could form 100 alignment pairs for tags F3, F5-P2 together 28 Parameters that factor into Pairing Quality Values • Alignment Length • Number of mismatches • Offset • Insert size •Total number of possible alignments Offset Alignment Length - R3/F5 Insert Size F3 + 29 Phred quality score and Pairing Quality Values Phred Quality score used most commonly used in literature is -10 x log 10 [prob (error)]. So to be consistent with Phred scaled quality score, we calculate the pairing quality value (PQV) as: =− × [ − ( )] PQV 10 log 10 1 Q r1,r2,x1,x2 Finally, we normalize the PQV with the maximum possible PQV for a given pair of reads of read length L1 & L2, to keep the PQVs in the range of 1 – 100 PQV PQV = ×100 PQV max 30 MultithreadedMultithreaded MapreadsMapreads • Single mapping job • Fraction of reads (1/n) are mapped against the whole reference • ~20GB of RAM for the human genome • Limit read mapping to whole genome (-z) • Combine results (simple merge) 1/n reads Mapped CPU 1 (.csfasta) Results 1 1/n reads Full Mapped Combined CPU 2 (.csfasta) Reference Results 2 Results . 1/n reads Mapped CPU n (.csfasta) Results n 31 LocalLocal Mapping:Mapping: AdvantagesAdvantages • Increased throughput — Some data sets have observed 2-fold increase in mapping using local mapping vs. classical mapping • Increased speed — Up to 15X Faster than iterative mapping with trimming • As read length increases, only a small set of schemas is needed to be optimized 32 OutlineOutline • Color space and 2-base-encoding • Quality values and filtering • Mapping algorithm and considerations • SOLID Webinar and Online Training • SOLiD Software Community 33 Introducing 34 SOLiD™ University Offerings

BAM Alignment File — Output: Alignment Counts and RPKM Expression Measurements for Each Exon • Calculate Coverage Profiles Across the Genome with “Sam2wig”

Mouse Kcnip2 Conditional Knockout Project (CRISPR/Cas9)

Gene Discovery and Annotation Using LCM-454 Transcriptome Sequencing Scott J

BIO4342 Exercise 2: Browser-Based Annotation and RNA-Seq Data

BLAT—The BLAST-Like Alignment Tool

EMBL-EBI Powerpoint Presentation

An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads

A Multithread Blat Algorithm Speeding up Aligning Sequences to Genomes Meng Wang and Lei Kong*

Sequence Alignment/Map Format Specification

Homology & Alignment

Multi-Scale Analysis and Clustering of Co-Expression Networks

A Dissertation

NGS Raw Data Analysis