BAM Alignment File — Output: Alignment Counts and RPKM Expression Measurements for Each Exon • Calculate Coverage Profiles Across the Genome with “Sam2wig”

BAM Alignment File — Output: Alignment Counts and RPKM Expression Measurements for Each Exon • Calculate Coverage Profiles Across the Genome with “Sam2wig”

01 Agenda Item 02 Agenda Item 03 Agenda Item SOLiD TM Bioinformatics Overview (I) July 2010 1 Secondary and Tertiary Primary Analysis Analysis BioScope in cluster SETS ICS Export BioScope in cloud •Satay Plots •Auto Correlation •Heat maps On Instrument Other mapping tools 2 Auto-Export (cycle-by-cycle) Collect Primary Analysis Generate Secondary Images (colorcalls) Csfasta & qual Analysis Instrument Cluster Auto Export delta spch run definition Merge Generate Tertiary Spch files Csfasta & qual Analysis Bioscope Cluster 3 Manual Export Collect Primary Analysis Generate Secondary Images (colorcalls) Csfasta & qual Analysis Instrument Cluster Option in SETS to export All run data or combination of Spch, csfasta & qual files Tertiary Analysis Bioscope Cluster 4 Auto-Export database requirement removed Instrument Cluster Remote Cluster JMS Broker JMS Broker Export ICS Hades Bioscope (ActiveMQ) (ActiveMQ) Daemon Postgres Postgres SETS Bioscope UI Tomcat Tomcat Disco installer Auto-export installer Bioscope installer System installer 5 SOLiDSOLiD DataData AnalysisAnalysis Workflow:Workflow: Secondary/Tertiary Analysis (off -instrument) Primary Analysis Secondary Analysis Tertiary Analysis BioScope BioScope Reseq WT SAET •SNP •Coverage .csfasta / .qual Accuracy Enhanced .csfasta •InDel •Exon Counting •CNV •Junction Finder •Inversion •Fusion Transcripts Mapping Mapped reads (.ma) Third Party Tools Mapped reads (.bam) maToBam 6 01 Agenda Item 02 Agenda Item 03 Agenda Item SOLiD TM Bioinformatics Overview (II) July 2010 7 OutlineOutline • Color space and 2-base-encoding • Quality values and filtering • Mapping algorithm and considerations • SOLiD Webinar and Online Training • SOLiD Software Community 8 WhatWhat IsIs ColorColor Space?Space? • Capillary electrophoresis uses single base, color encoding of data Collect color Identify peak Convert to Identify peaks image colors base calls Base space Color space 9 SOLiDSOLiD ColorColor SpaceSpace • SOLiD uses 2 base color encoding of data (2BE) Collect color Identify bead Identify beads image color Record colors for each bead over consecutive cycles Color space Base space A C G G T C G T C G T G T G C G T 10 PropertiesProperties OfOf 22 --BaseBase EncodingEncoding (2BE)(2BE) Second Base 5’ 3’ 1 3 1 3 1 3 2 3 5’-A C G T A C G A T -3’ 3’-T G C A T G C T A -5’ 1 3 1 3 1 3 2 3 Base First 3’ 5’ • Two dibases that agree in just one base have different colors — color(AC) ≠ color(AG) ≠ color(AT) ≠ color(AA) • Two dibases that do not agree in either base have same color — color(AC) = color(GT) and color(CG) = color(AT) • A dibase and its reverse have the same color — color(AC) = color(CA), color(GT) = color(TG) • Repeated-base dibases have the same color — color(AA) = color(CC)= color(GG)= color(TT) 11 ““ValidValid ”” andand ““InvalidInvalid ”” AdjacentAdjacent ColorColor SubstitutionsSubstitutions • “Invalid” changes are inconsistent with SNP and likely sequencing errors 12 OutlineOutline • Color space and 2-base-encoding • Quality values and filtering • Mapping algorithm and considerations • SOLiD Webinar and Online Training • SOLiD Software Community 13 QualityQuality ValueValue (QV)(QV) ForFor ColorColor CallCall • A score calculated based on the probability of an error call at that base • Similar to those generated by phred and the KB Basecaller for capillary electrophoresis sequencing = − q 10 log 10 p p = probability of color call error • A QV score of 10 represent 10% error rate, whereas a QV score of 20 represents a 1% error rate 14 SETS Software Updates What has changed in SETS? Primary Analysis Filtering enabled - Removes poor quality beads, primary analysis results file size will be reduced. Mapping will be performed faster and matching % will improve. Filter Poor quality Beads Mappable Beads 15 Why Filter? • By removing poor quality beads, primary analysis results would be reduced by about 15% or more • Easier to discover novel information from remaining unmatched beads • Due to smaller list of reads of a run, mapping would be faster for generating similar throughput • Improved matching percentage 16 Filtering Design • Used Human data as training set • Set parameters based on the number of poor quality beads filtered — 20 value corresponds to 20% of poor quality beads filtered out — 80 value corresponds to 80% of poor quality beads filtered out • Tested mapping using BioScope Classic mapping 17 Configuring Filtering from SETS •Valid ranges for Stringency are from 0 to 80 •Default value is 20 20 18 OutlineOutline • Color space and 2-base-encoding • Quality values and filtering • Mapping algorithm and considerations • SOLiD Webinar and Online Training • SOLiD Software Community 19 MappingMapping AlgorithmAlgorithm • Challenge: — A small word size is needed for continuous word searches in short reads. This is computationally and time intensive. • Our Approach: — Use discontinuous word patterns > Allows faster searching and guaranteed to find all hits up to a certain number of mismatches 20 DiscontinuousDiscontinuous WordsWords • Continuous words: searching for a perfect alignment, 8/8 bases (word size 8, e.g. used by BLAST) ATTTTTT GGGTAGCC CCTTGGATGAGT |||||||| AG GGGTAGCC TGATGATGGT • Discontinuous words: searching 8/18 matches (effective word size is also 8) ATTTT TT GGGTA GC CCCTT GGAT GAGT || || |||| TT GACCG GC ATGGG GGAT 110000011000001111 21 MappingMapping ToolTool -- mapreadsmapreads • General features of mapping tool — Aligns in color space — Translates reference sequence to color space — Allows mismatches (no indels), valid adjacent mismatches can be counted as one — Allows masking of certain positions (bad calls) — For fixed reference sequence, running time is linear with number of reads • New with SOLiD 3+ — Seed and extend mapping approach — Multi-threaded 22 LocalLocal MappingMapping • Motivation — Long reads, non-uniform quality — At the end of reads errors tend to accumulate — Some applications show sequencing into adaptors 23 LocalLocal AlignmentAlignment StrategyStrategy • Map the first 25 colors of the read to allowing 2 mismatches (MM). • For every hit found (up to the Z-limit), do a local extension — Accumulate alignment score (Match = 1, MM = -2 [user defined] ) — Report the best partial alignment (anchored local) based on score > Discard if score does not meet minimum cutoff Read: 0122130123012303201203021 123012310231203120103120 ||||||| ||||||||||||||| | ||||| |||||||||| ||| Ref: 0122130 0230123032012030 1112301 13102312031 0010 1203 • For reads not mapped, shift anchor location and attempt additional mapping 24 LocalLocal Mapping:Mapping: AnchorAnchor OffsetOffset start end reference read (offset) • start and end mark the start and end of the alignment in the reference. • The alignment may not encompass the entire read. • The start of the alignment in the read is called the offset 25 MappingMapping QV:QV: MathematicalMathematical DefinitionDefinition • Mapping quality is an estimate how likely an alignment is correct • First, calculate the posterior probability L−t − 1 P(r | Alignment ) = 1( − e)t m e m 4 • If an alignment has a probability of ,P(r), it’s mapping QV is defined as P(r) — -10*log 10 (1-P(r)/P), where P = Σ for all reads 26 What are mapping/pairing quality values? • Given the fact that a read R of length L can map to n different locations Xi (i = 1…n) in the genome, mapping quality value represents the probability of the hypothesis, that the read maps to location Xi is true. Mapping Quality value ~ Prob hypothesis (R Ξ X1 | R) is true R X1 X2 Xn 27 Difference between Mapping QV & Pairing QV • Mapping QV represents the quality of alignment for Fragment reads or the quality of alignment for individual tags (F3/R3/F5-P2) in pared reads • Pairing QV represents the quality of alignment for a pair of reads. Example if F3 tag has 10 alignments and F5-P2 tag has 10 alignments, then we could form 100 alignment pairs for tags F3, F5-P2 together 28 Parameters that factor into Pairing Quality Values • Alignment Length • Number of mismatches • Offset • Insert size •Total number of possible alignments Offset Alignment Length - R3/F5 Insert Size F3 + 29 Phred quality score and Pairing Quality Values Phred Quality score used most commonly used in literature is -10 x log 10 [prob (error)]. So to be consistent with Phred scaled quality score, we calculate the pairing quality value (PQV) as: =− × [ − ( )] PQV 10 log 10 1 Q r1,r2,x1,x2 Finally, we normalize the PQV with the maximum possible PQV for a given pair of reads of read length L1 & L2, to keep the PQVs in the range of 1 – 100 PQV PQV = ×100 PQV max 30 MultithreadedMultithreaded MapreadsMapreads • Single mapping job • Fraction of reads (1/n) are mapped against the whole reference • ~20GB of RAM for the human genome • Limit read mapping to whole genome (-z) • Combine results (simple merge) 1/n reads Mapped CPU 1 (.csfasta) Results 1 1/n reads Full Mapped Combined CPU 2 (.csfasta) Reference Results 2 Results . 1/n reads Mapped CPU n (.csfasta) Results n 31 LocalLocal Mapping:Mapping: AdvantagesAdvantages • Increased throughput — Some data sets have observed 2-fold increase in mapping using local mapping vs. classical mapping • Increased speed — Up to 15X Faster than iterative mapping with trimming • As read length increases, only a small set of schemas is needed to be optimized 32 OutlineOutline • Color space and 2-base-encoding • Quality values and filtering • Mapping algorithm and considerations • SOLID Webinar and Online Training • SOLiD Software Community 33 Introducing 34 SOLiD™ University Offerings

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    171 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us