01 Agenda Item 02 Agenda Item 03 Agenda Item
SOLiD TM Bioinformatics Overview (I) July 2010
1 Secondary and Tertiary Primary Analysis Analysis
BioScope in cluster
SETS ICS
Export BioScope in cloud •Satay Plots •Auto Correlation •Heat maps
On Instrument Other mapping tools
2 Auto Export (cycle by cycle)
Collect Primary Analysis Generate Secondary Images (colorcalls) Csfasta & qual Analysis
Instrument Cluster Auto Export
delta spch run definition
Merge Generate Tertiary Spch files Csfasta & qual Analysis
Bioscope Cluster
3 Manual Export
Collect Primary Analysis Generate Secondary Images (colorcalls) Csfasta & qual Analysis
Instrument Cluster
Option in SETS to export All run data or combination of Spch, csfasta & qual files
Tertiary Analysis
Bioscope Cluster
4 Auto Export database requirement removed Instrument Cluster Remote Cluster
JMS Broker JMS Broker Export ICS Hades Bioscope (ActiveMQ) (ActiveMQ) Daemon
Postgres Postgres
SETS Bioscope UI
Tomcat Tomcat
Disco installer Auto-export installer Bioscope installer
System installer 5 SOLiDSOLiD DataData AnalysisAnalysis Workflow:Workflow: Secondary/Tertiary Analysis (off instrument)
Primary Analysis Secondary Analysis Tertiary Analysis
BioScope BioScope
Reseq WT SAET •SNP •Coverage .csfasta / .qual Accuracy Enhanced .csfasta •InDel •Exon Counting •CNV •Junction Finder •Inversion •Fusion Transcripts Mapping
Mapped reads (.ma) Third Party Tools
Mapped reads (.bam) maToBam
6 01 Agenda Item 02 Agenda Item 03 Agenda Item
SOLiD TM Bioinformatics Overview (II) July 2010
7 OutlineOutline
• Color space and 2 base encoding • Quality values and filtering • Mapping algorithm and considerations • SOLiD Webinar and Online Training • SOLiD Software Community
8 WhatWhat IsIs ColorColor Space?Space?
• Capillary electrophoresis uses single base, color encoding of data
Collect color Identify peak Convert to Identify peaks image colors base calls
Base space
Color space
9 SOLiDSOLiD ColorColor SpaceSpace
• SOLiD uses 2 base color encoding of data (2BE)
Collect color Identify bead Identify beads image color
Record colors for each bead over consecutive cycles
Color space Base space A C G G T C G T C G T G T G C G T
10 PropertiesProperties OfOf 22 BaseBase EncodingEncoding (2BE)(2BE) Second Base 5’ 3’ 1 3 1 3 1 3 2 3 5’-A C G T A C G A T -3’ 3’-T G C A T G C T A -5’
1 3 1 3 1 3 2 3 Base First
3’ 5’ • Two dibases that agree in just one base have different colors — color(AC) ≠ color(AG) ≠ color(AT) ≠ color(AA) • Two dibases that do not agree in either base have same color — color(AC) = color(GT) and color(CG) = color(AT) • A dibase and its reverse have the same color — color(AC) = color(CA), color(GT) = color(TG) • Repeated base dibases have the same color — color(AA) = color(CC)= color(GG)= color(TT) 11 ““ValidValid ”” andand ““InvalidInvalid ”” AdjacentAdjacent ColorColor SubstitutionsSubstitutions • “Invalid” changes are inconsistent with SNP and likely sequencing errors
12 OutlineOutline
• Color space and 2 base encoding • Quality values and filtering • Mapping algorithm and considerations • SOLiD Webinar and Online Training • SOLiD Software Community
13 QualityQuality ValueValue (QV)(QV) ForFor ColorColor CallCall
• A score calculated based on the probability of an error call at that base • Similar to those generated by phred and the KB Basecaller for capillary electrophoresis sequencing = − q 10 log 10 p p = probability of color call error
• A QV score of 10 represent 10% error rate, whereas a QV score of 20 represents a 1% error rate
14 SETS Software Updates
What has changed in SETS?
Primary Analysis Filtering enabled Removes poor quality beads, primary analysis results file size will be reduced. Mapping will be performed faster and matching % will improve.
Filter
Poor quality Beads Mappable Beads
15 Why Filter? • By removing poor quality beads, primary analysis results would be reduced by about 15% or more • Easier to discover novel information from remaining unmatched beads • Due to smaller list of reads of a run, mapping would be faster for generating similar throughput • Improved matching percentage
16 Filtering Design • Used Human data as training set • Set parameters based on the number of poor quality beads filtered — 20 value corresponds to 20% of poor quality beads filtered out — 80 value corresponds to 80% of poor quality beads filtered out • Tested mapping using BioScope Classic mapping
17 Configuring Filtering from SETS
•Valid ranges for Stringency are from 0 to 80 •Default value is 20
20
18 OutlineOutline
• Color space and 2 base encoding • Quality values and filtering • Mapping algorithm and considerations • SOLiD Webinar and Online Training • SOLiD Software Community
19 MappingMapping AlgorithmAlgorithm
• Challenge: — A small word size is needed for continuous word searches in short reads. This is computationally and time intensive.
• Our Approach: — Use discontinuous word patterns > Allows faster searching and guaranteed to find all hits up to a certain number of mismatches
20 DiscontinuousDiscontinuous WordsWords
• Continuous words: searching for a perfect alignment, 8/8 bases (word size 8, e.g. used by BLAST) ATTTTTT GGGTAGCC CCTTGGATGAGT |||||||| AG GGGTAGCC TGATGATGGT • Discontinuous words: searching 8/18 matches (effective word size is also 8) ATTTT TT GGGTA GC CCCTT GGAT GAGT || || |||| TT GACCG GC ATGGG GGAT 110000011000001111
21 MappingMapping ToolTool mapreadsmapreads
• General features of mapping tool — Aligns in color space — Translates reference sequence to color space — Allows mismatches (no indels), valid adjacent mismatches can be counted as one — Allows masking of certain positions (bad calls) — For fixed reference sequence, running time is linear with number of reads
• New with SOLiD 3+ — Seed and extend mapping approach — Multi threaded
22 LocalLocal MappingMapping • Motivation — Long reads, non uniform quality — At the end of reads errors tend to accumulate — Some applications show sequencing into adaptors
23 LocalLocal AlignmentAlignment StrategyStrategy
• Map the first 25 colors of the read to allowing 2 mismatches (MM). • For every hit found (up to the Z limit), do a local extension — Accumulate alignment score (Match = 1, MM = 2 [user defined] ) — Report the best partial alignment (anchored local) based on score > Discard if score does not meet minimum cutoff Read: 0122130123012303201203021 123012310231203120103120 ||||||| ||||||||||||||| | ||||| |||||||||| ||| Ref: 0122130 0230123032012030 1112301 13102312031 0010 1203
• For reads not mapped, shift anchor location and attempt additional mapping
24 LocalLocal Mapping:Mapping: AnchorAnchor OffsetOffset
start end
reference
read (offset)
• start and end mark the start and end of the alignment in the reference. • The alignment may not encompass the entire read. • The start of the alignment in the read is called the offset
25 MappingMapping QV:QV: MathematicalMathematical DefinitionDefinition
• Mapping quality is an estimate how likely an alignment is correct
• First, calculate the posterior probability L−t − 1 P(r | Alignment ) = 1( − e)t m e m 4
• If an alignment has a probability of ,P(r), it’s mapping QV is defined as
P(r) — 10*log 10 (1 P(r)/P), where P = Σ for all reads
26 What are mapping/pairing quality values?
• Given the fact that a read R of length L can map to n different locations Xi (i = 1…n) in the genome, mapping quality value represents the probability of the hypothesis, that the read maps to location Xi is true.
Mapping Quality value ~ Prob hypothesis (R Ξ X1 | R) is true
R
X1 X2 Xn
27 Difference between Mapping QV & Pairing QV
• Mapping QV represents the quality of alignment for Fragment reads or the quality of alignment for individual tags (F3/R3/F5 P2) in pared reads • Pairing QV represents the quality of alignment for a pair of reads. Example if F3 tag has 10 alignments and F5 P2 tag has 10 alignments, then we could form 100 alignment pairs for tags F3, F5 P2 together
28 Parameters that factor into Pairing Quality Values • Alignment Length • Number of mismatches • Offset • Insert size •Total number of possible alignments
Offset Alignment Length -
R3/F5 Insert Size F3
+
29 Phred quality score and Pairing Quality Values
Phred Quality score used most commonly used in literature is
-10 x log 10 [prob (error)]. So to be consistent with Phred scaled quality score, we calculate the pairing quality value (PQV) as:
=− × [ − ( )] PQV 10 log 10 1 Q r1,r2,x1,x2
Finally, we normalize the PQV with the maximum possible PQV for a given pair of reads of read length L1 & L2, to keep the PQVs in the range of 1 – 100
PQV PQV = ×100 PQV max
30 MultithreadedMultithreaded MapreadsMapreads
• Single mapping job • Fraction of reads (1/n) are mapped against the whole reference • ~20GB of RAM for the human genome • Limit read mapping to whole genome (-z) • Combine results (simple merge)
1/n reads Mapped CPU 1 (.csfasta) Results 1
1/n reads Full Mapped Combined CPU 2 (.csfasta) Reference Results 2 Results . .
1/n reads Mapped CPU n (.csfasta) Results n
31 LocalLocal Mapping:Mapping: AdvantagesAdvantages
• Increased throughput — Some data sets have observed 2 fold increase in mapping using local mapping vs. classical mapping • Increased speed — Up to 15X Faster than iterative mapping with trimming • As read length increases, only a small set of schemas is needed to be optimized
32 OutlineOutline
• Color space and 2 base encoding • Quality values and filtering • Mapping algorithm and considerations • SOLID Webinar and Online Training • SOLiD Software Community
33 Introducing
34 SOLiD™ University Offerings • At Life Technologies Application Support Centers — SOLiD™ 4 System Course — SOLiD™ 4 Bioinformatics Course — SOLiD Libraries Courses: DNA, RNA Seq, Targeted Reseq • SOLiD™ Edge Live Webinar Series — SOLiD™ 4 System Essentials, SOLiD™ 4 Data Analysis Essentials — Advanced Troubleshooting — Success Stories in Cancer — Applications: RNA Seq, Targeted Resequencing, Epigenetics and Whole Genome Sequencing • SOLiD™ Elearning Series — Recorded webinars from SOLiD Edge Series — SOLiD™ EZ Bead™ Videos — SOLiD™ 4 System Bead Deposition Video
35 New at SOLiD™ University
Targeted Reseq Library Construction Course – 3 day intensive course with theory for all methods and hands-on for Agilent’s SureSelect Target Enrichment System
RNA-Seq Library Construction – 5 day intensive course covers construction of small RNA, whole transcriptome and SOLiD™ SAGE™ libraries
Location, Location, Location – Courses available at Foster City, CA and Frederick, MD in the US and Darmstadt, Germany
36 SOLiD™ Edge Live Webinar Series
SOLiD™4 System – includes these topics, SOLiD System & Data Analysis Essentials, Advanced Troubleshooting and Successes in Cancer Research
August: RNA-Seq with the SOLiD™ System – using SOLiD™ Total RNA-Seq and SOLiD™ SAGE™ Kits and SOLiD™ BioScope™ & third party tools for analysis
September: Targeted Reseq with the SOLiD™ System – strategies for enrichment and SOLiD™ BioScope™ & software community tools for analysis
October: Epigenetics with the SOLiD™ System – using SOLiD™ ChIP-Seq Kit and Methylation and SOLiD™ BioScope™ & software community tools for analysis
37 Thank You
Visit learn.appliedbiosystems.com for schedule & registration
For research use only. Not intended for human or animal therapeutic or diagnostic use.
© 2010 Life Technologies Corporation. All rights reserved. The trademarks mentioned herein are the property of Life Technologies Corporation or their respective owners 38 OutlineOutline
• Color space and 2 base encoding • Quality values and filtering • Mapping algorithm and considerations • SOLiD Webinar and Online Training • SOLiD Software Community
39 SOLiD Software Community
— SOLID System Software (Separate Session) — Open Source Software — Commercial Partners’ Software — SOLID Software Community Website
40 SolidSolid SoftwareSoftware CommunityCommunity solid.appliedbiosystems.com
41 SOLiDSOLiD SoftwareSoftware CommunityCommunity info.appliedbiosystems.com/solidsoftwarecommunity
• Easy Navigation — Application centric — Links to downloads
42 SOLiDSOLiD SoftwareSoftware DownloadDownload (http://solidsoftwaretools.com)
43 SOLiDSOLiD DatasetDataset DownloadDownload (http://solidsoftwaretools.com)
44 01 Agenda Item 02 Agenda Item 03 Agenda Item
Data Management July 2010
45 OutlineOutline
• Key files and formats • File structures on the SOLiD systems • File transfer • Storage requirements
46 DataData AnalysisAnalysis OverviewOverview :: FilesFiles
Focal map .csfasta .sam / .bam .qual Input Images
.spch .csfasta.ma (mapping Text files and tab .csfasta output) delimited data files .sam / .bam Output .qual .intensity
47 FileFile Format:Format: ..spchspch
• spch : SOLiD Panel Cache HDF5 • One file per panel • Contains information about the color calls at each ligation for every bead in that panel • The .spch file is in the HDF5 format >HDFView can be used to view structure and content of a HDF5 file — http://hdf.ncsa.uiuc.edu/products/hdf5/index.html. — http://hdf.ncsa.uiuc.edu/hdf java html/hdfview/index.html
48 FileFile Format:Format: .. csfastacsfasta
• FASTA file with tag headers and sequences • Tag header: >1_88_1830_F3 — 1 = panel number — 88 = X coordinate of bead within panel — 1830 = Y coordinate of bead within panel — F3 = type of tag
>1_88_1830_F3 T2103112003130213233110321
>1_88_1830_R3 G3211312320130023232012112
49 FileFile Format:Format: .. qualqual
• FASTA file with tag headers and quality values • Quality value (QV) calculated on the probability of an error call at that position — A QV of 10 represents 10% error rate — A QV of 20 represents a 1% error rate
>97_2040_1850_F3 38 36 26 33 41 26 24 33 28 31 27 23 5 35 32 31 11 10 24 38 22 24 7 27 11 15 26 13 14 17 17 13 12 8 5 17 5 12 = − q 10 log 10 p
p = probability of color call error
50 RepresentationRepresentation ofof aa missingmissing colorcolor callcall
• .csfasta with a ‘.’ >1_88_1830_F3 T2103112003130213233110 .21
• .qual with a ‘ 1’
> 1_88_1830_F3 38 36 26 33 41 26 24 33 28 31 27 23 5 35 32 31 11 10 24 38 22 24 -1 27 11
• A missing color call will be treated as a mismatch by the SOLiD mapping program “MaxMapper”
• Some 3 rd party software does not handle ‘.’ or ‘ 1’ well, need to pre filter out reads with those values
51 FileFile Format:Format: ScaledScaled IntensityIntensity FilesFiles
• FASTA file with tag headers intensity values • Not generated by default (very large ~100GB per color)
— Used with SRF file format for sequence submission to NCBI, but are not required • One file generated for each dye color for each tag F3_intensity.ScaledCY3.fasta F3_intensity.ScaledCY5.fasta F3_intensity.ScaledFAM.fasta F3_intensity.ScaledTXR.fasta >1_43_24_R3 0.0471744 0.0140623 0.000482545 0.160932 0.0100427 0.0219512 0.0158016 .... 0.00679131 0.129587 0.000653466 0.00944984
52 FileFile Format:Format: .. csfasta.macsfasta.ma
• Tag_ID, 1_-6172.2:(40.4.0):q45 , ... • 1 = reference entry (fasta index) • ‘-’ = strand (nothing if positive strand) • 6172 = position of hit on reference • 2 = number of mismatches in the anchor region • 40 = alignment length of local alignment (seed+extension) • 4 = number of mismatches of the total alignment • 0 = alignment start in the read • q45 = mapping quality value
>1_88_1830_F3, 1_ 6172.2:(40.4.0):q45 T2103112003130213233110321 >2_89_1831_F3 T31220320101322020102301212
53 File Format: SRF
• SRF (sequence read format) NCBI trace archive — A block based format (ZTR compressed), http://srf.sourceforge.net/ — See SOLiD™ System SRF Conversion Tool ( solid2srf) > Input: .csfasta, .qual, and .scaled_intensity files (optional) — http://solidsoftwaretools.com/gf/project/srf/
54 File Format: SAM/BAM
• SAM is an alignment format developed by 1000g project that includes pairing information • Extended CIGAR format for alignment information, compact • SAM refers to the generic format specification and the text file. • BAM is the compressible binary version of SAM • Resources — Main site http://samtools.sourceforge.net/ — Format specification http://samtools.sourceforge.net/SAM1.pdf — Mailing lists
55 http://sourceforge.net/mail/?group_id=246254 File Format: SAM/BAM Header • Reference sequence information (SQ) — SQ lines indicate name and length of each contig in the reference file — Location and MD5 checksum are optional fields to be included in SOLiD BAM • Read group information (RG) — One read group per secondary analysis result — RG lines describe sample information tied to alignments via read group id — Library (LB), sample name (SM), pairing insert size (PI) will be included — LB will include library type that will distinguish mate pair, reverse reads, and fragment libraries LB:libname 50x50MP LB:libname2 50x25RR LB:lib3 50F — PI will include pairing range
56 PI:100 1400 File Format: SAM/BAM Header
@HD VN:1.0 GO:none SO:coordinate @SQ SN:chr20 LN:62434914 UR:file:///share/reference/genomes/hg18/human.fa @RG ID:RG1 LB:libname-50x50MP PI:100-1410 SM:S1
1417_237_929 115 chr1 446 150 1H49M = 1453 1057 ACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCG )6B><5/>G?45=CF&&II@27GII##GGIDEIIIIDEIIID?DGGIIF RG:Z:sys.S1 CS:Z:T23003300303032122001310233220010310010320010320011 CQ:Z:166626/14987/6:69514717#6:71',5;/&25//'.26)'/.12%% MD:Z:49M NM:i:0
Specification Details: http://samtools.sourceforge.net/SAM1.pdf
57 File Format: SAM/BAM Header
• 1417_237_929 115 chr1 446 150 1H49M = 1453 1057 ACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCG )6B><5/>G?45=CF&&II@27GII##GGIDEIIIIDEIIID?DGGIIF RG:Z:sys.S1 CS:Z:T23003300303032122001310233220010310010320010320011 CQ:Z:166626/14987/6:69514717#6:71',5;/&25//'.26)'/.12%% MD:Z:49M NM:i:0 • 1417_237_929 Bead ID • 115 SAM mates flag ( Strand / Mates Strand / Proper Orientation, etc.) • Chr1 Aligned contig • 446 Start position (always forward strand) • 150 Mapping quality value • 1H49M CIGAR string • = Mate contig • 1453 Mate start position • 1057 Insertion length • ACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCG Base sequence • )6B><5/>G?45=CF&&II@27GII##GGIDEIIIIDEIIID?DGGIIF Base quality value • RG:Z:sys.S1 Read group • CS:Z:T23003300303032122001310233220010310010320010320011 Color sequence • CQ:Z:166626/14987/6:69514717#6:71',5;/&25//'.26)'/.12%% Color quality • MD:Z:49M MD string • NM:i:0 Edit distance Specification Details: http://samtools.sourceforge.net/SAM1.pdf
58 Alignment records • CIGAR string — Condensed alignment descriptor nX[nX,nX,…] where X is the operation and n is the number of operations
OpOpOp Description M Alignment match (can be a sequence match or mismatch) I Insertion to the reference D Deletion from the reference N Skipped region from the reference S Soft clip on the read (clipped sequence present in
59 Alignment records
• CIGAR examples
Perfect 50mer match 50M
50mer with 1 mismatch 50M
Anchor extend to 47bp 47M3H
Start at 10bp, extend to 50bp 10H40M
Start at 10bp, extend to 50bp, bottom strand 40M10H
2bp deletion 25M2D25M
2bp insertion 24M2I24M
60 Samtools (SAM manipulation tool) in C • http://http://samtools.sourceforge.netsamtools.sourceforge.netsamtools.sourceforge.net//// • importimport: SAM-to-BAM conversion • viewview: BAM-to-SAM conversion and subalignment retrieval • sortsort: sorting alignment • mergemerge: merging multiple sorted alignments • indexindex: indexing sorted alignment • faidxfaidx: FASTA indexing and subsequence retrieval • tview : text alignment viewer • pileup : generating position-based output and consensus/indel calling
61 Samtools ‘tview’ : alignment viewer
62 SAM Format Bioinformatics software • Aligners natively generating SAM • BFAST , `Blat-like Fast Accurate Search Tool' for Illumina and SOLiD reads. • BWA , Burrows-Wheeler Aligner for short and long reads. • GEM library . Short read aligner. Convertor provided by the developers. • Karma , the K-tuple Alignment with Rapid Matching Algorithm. • Novoalign . An accurate aligner capable of gapped alignment for Illumina short reads. Academic free binary. • SNP-o-matic , short read aligner and SNP caller. • SSAHA2 (since v2.4). Classical aligner for both short and long reads. • Stampy, by Gerton Lunter . An accurate read aligner capable of gapped alignment for Illumina short reads. Used for indel discovery on the 1000 genomes data. Not released. • TopHat for mapping short RNA-seq reads bridging exon junctions. • Programs processing SAM/BAM • GAP5 , sequence assembly viewer, editor and analyzer. Capable of importing BAM files and outputing SAM. • GATK , the Genome Analysis Toolkit. Rich funtionality including an accurate SNP caller. Built upon Picard. • GBrowse , generic genome browser. Experimental SAM/BAM alignment viewing. Built upon Perl APIs. • IGV , the Integrative Genomics Viewer. Elegant alignment viewer supporting multiple tracks and genome annotations. Built upon Picard. • LookSeq , web-based alignment/annotation viewer. • samToBed by Aaron Quinlan . Converting alignments in the SAM format to the BED format. • Vancouver Short Read Analysis Package (in particular FindPeaks), post alignment processing of new sequencing data.
63 Samtools pileup for coverage plots • Using BAM file with Samtools pileup • GNUplot to generate plots
64 OutlineOutline
• Key files and formats • File structures on the SOLiD systems • File transfer • Storage requirements
65 SOLiDSOLiD FileFile SystemSystem
66 SOLiDSOLiD FileFile SystemSystem –– MultiplexMultiplex SampleSample
67 OutlineOutline
• Key files and formats • File structures on the SOLiD systems • File transfer • Storage requirements
68 FileFile TransferringTransferring
Internal network, Isolated from external network and headnode is a firewall
Internet / External network Gigabit switch eth0
Instrument controller XP 10.1.1.3 eth0 eth1 headnode 10.1.1.1 LAN/WAN IP address Perc 6 Results 4TB Requires 1Gigabit LAN
Optional MD1000 Images – MD1000 9TB for results to 13 TB
eth0 compute node 0 10.1.1.100 eth0 Communication Protocols compute node 1 10.1.1.101
eth0 UPS Headnode and XP -- SAMBA compute node 2 10.1.1.102 Headnode and Compute nodes -- NFS
69 CopyCopy DataData ToTo RemoteRemote StorageStorage
• rsync • scp/sftp • NFS mount remote storage on SOLiD • Use cp/tar/gzip to copy files • External drive • USB drive mounted on the head node
70 OutlineOutline
• Key files and formats • File structures on the SOLiD systems • File transfer • Storage requirements
71 SOLiD™ 4 On Instrument Data Size And Storage
50 ntntnttagtagtag Image data size Primary analysis Primary analysis data size 300K/panel data size (.(.(.csfasta(. csfastacsfasta,, QV.qualQV.qual,, .stats) 2357 panels (.(.(.spch(. spch format)
1 slide –––1 tag 1.84 TB 646 GB 170 GB (((Frag(FragFrag)))) 1 slide –––2 tags 3.6 TB 1.29 TB 340 GB (Mate pair) 2 slides –––1 tag 3.6 TB 1.29 TB 340 GB (((Frag(FragFrag))))
2 slides –––2 tags 7.2 TB 2.58 TB 680 GB (Mate pair)
1 slides –––2 tags 2.8 TB 969 GB 255 GB (Paired(Paired End)End)
2 slides –––2 tags 5.6 TB 1.90 GB 510 GB (Paired(Paired End)End)
(Assumes deposition densities of 300K beads/panel, 2357 panels) 72 SOLiDSOLiD 44 InstrumentInstrument DataData StorageStorage CapacityCapacity
• DELL MD1000 — 15x 750 GB SATA hard drives — RAID 5 w/ hot swap capabilities — /data/images/ 8.9 TB
• Head node HD — 6x 1 TB SATA hard drives — RAID 5 — /data/results/ 4 TB
73 01 Agenda Item 02 Agenda Item 03 Agenda Item
SOLiD BioScope Software Overview June 2010
74 Topics • Overview — General overview
• ReSeq pipelines — Overview of functions — Demo and hands on
• Whole Transcriptome (WT) pipelines — Overview on the functions — Demo and hands on
75 BioScopeBioScope SoftwareSoftware OverviewOverview • Introduction • Flexible Access (Available options) • Graphical User Interface (GUI) • Command Line • Monitoring Job Status • Data Standardization to BAM format • Analysis and Libraries • SAET Tool for error correction
76 SOLiDSOLiD AnalysisAnalysis WorkflowWorkflow
g in n c io g en at rin qu g tr er ai se in is ll ld r /P e g d eg a ui ie g R in Fin R r C B sif in g d ge lo ds as pp a a a o a Cl a Im Be Im C Re M
Output: .spch .csfasta .ma TA .qual .mates W .bam ICS .sam SETS
BioScope (offline)
77 BioScopeBioScope OverviewOverview
Primary Secondary Tertiary Image Mapping Visualization Analysis Analysis Analysis
Off-Instrument Compute Cluster
mapping ReSequencing
Output in Common file format
Whole Transcriptome
BioScope
78 BioScope Software • Out of the box tools/pipelines for
Secondary Analysis Tertiary Analysis
Mapping SAET - NEW Whole Transcriptome Mapping statistics Resequencing •WT mapping Position Errors •SNP/diBayes •Count known exons • Pairing •Inversion Create UCSC WIG file •CNV Fusion and splicing - NEW •Small Indel ChiPSeq mapping - NEW •Large Indel
• Simple GUI to allow ease of use to run pipelines • Flexible command line to enable custom pipeline development
79 Offline Cluster Minimal Offline Cluster Spec for 1 billion reads • Dedicated Head node • 3+ Compute nodes • Head Node
— > 2 GHz processors — 16 GB RAM — 100 GB storage local disk space for OS+ software installation • Compute Node
— > 2 GHz processors — > 16+ GB RAM (24GB recommended) and 8+ cores per node — > 500GB scratch space per node • 1 GB Switch • OS: Cent OS 4.x, 5.x / RedHat 4.x, 5.x • Job Manager: PBS Torque / SGE • Storage > 10TB 80 Development principles for BioScope
• Easy to use and Flexibility — GUI — Command line • Similar operation for all functions — GUI > Global settings > Application settings > Advanced settings — Command line > bioscope.sh –l log/ instruction.ini • Consistent file structure for easy tracking — workingDir/output — workingDir/temp — workingDir/intermediate — workingDir/log
81 BioScopeBioScope GUIGUI –– 1.21.2 DashboardDashboard
82 BioScopeBioScope GUIGUI GlobalGlobal SettingsSettings
• Define working directory, output locations for results, logs, etc.
83 BioScopeBioScope GUIGUI ApplicationApplication SettingsSettings
• Specify application specific settings, input files, reference files
84 BioScopeBioScope AdvancedAdvanced SettingsSettings
• Specify analysis settings, and optional plugins and analysis
85 BioScopeBioScope ParameterParameter ValidationValidation
• Validates input settings: integer, range, file extension, not null etc
86 BioScopeBioScope PluginPlugin SelectionSelection
87 BioScopeBioScope JobJob SubmissionSubmission
88 BioScope Job Monitoring – View Run History
Step 1: Select Step 2: Select the Analysis the folder you History to view want to view
Note: The BioScope Analysis History is based on the time created. User no longer has to go to the cluster and use command line to view log and results files
89 BioScope Job Monitoring – Log File Access
Step 3: Select Step 4. Open or log file to view Save log file
90 BioScopeBioScope filefile structurestructure
• Working directory • config/ for all the instruction files such as .ini and .plan files • log/ for all the running log files • output/ for the output results • temp/ temp file for pipeline • intermediate/ intermediate files for pipeline
• Default working directory • /data/result/secondary/
91 BioScopeBioScope CommandCommand LineLine • Create the following files • .ini • .plan • Execute • bioscope.sh -l
Recommendation: • Copy and edit the example .ini and .plan files provided with BioScope • Use BioScope Interface to generate file templates
92 BioScopeBioScope FilesFiles (.(. iniini ))
• Defines input / output directories and folders • Parameter settings • Can contain multiple pipelines ## global parameters ## global settings for the pipeline run base.dir=./ import ../../globals/global.ini output.dir = ${base.dir}/outputs reference = ${reference.dir}/chr20.fasta temp.dir = ${base.dir}/temp run.name = myRun intermediate.dir = ${base.dir}/intermediate sample.name = chr20 log.dir = ${base.dir}/log primer.set = F3 reads.result.dir.1 = ${base.dir} read.length = 50 reads.result.dir.2 = ${base.dir} output.dir = ${base.dir}/../outputs reference.dir = /data/results/bioscope_examples/examples/references ## qv filtering pipeline scratch.dir=/scratch/solid classify.run = 1 read.dir = ${base.dir}/../../human_var/secondary/JOAN/mappingF3 read.file.prefix = ${run.name}_${sample.name}_${primer.set} mapping.tagfiles.dir = ${output.dir}/qvfiltered filtering.qv.filtered.dir = ${output.dir}/qvfiltered filtering.qv.failed.dir = ${output.dir}/qvfail
## mapping pipeline mapping.run = 1 mapping.tagfiles.dir =${base.dir}/../../human_var/secondary/JOAN/mappingF3 mapping.output.dir = ${output.dir}/s_mappingF3
93 BioScopeBioScope FilesFiles (.plan)(.plan)
• List of .ini files to run • “=“ denotes jobs to be run in parallel • “+” breakup sets of parallel jobs
.plan .ini (serial) .ini (parallel)
=example.mappingF3.ini =example.mappingR3.ini example.mapStats.ini + output =example.posErrors.ini =example.MaToBam.ini
94 BioScopeBioScope JobJob MonitoringMonitoring
• GUI: History Tab & Access to Log Files • Command line: Check job submission status in three ways • Verify job submitted to cluster • qstat • Monitor BioScope logs for errors • Examine output directory
95 BioScopeBioScope JobJob MonitoringMonitoring (Cluster(Cluster Submission)Submission) • Use the command “qstat” to monitor job progression
• When job is complete, proceed to check log files
96 BioScopeBioScope JobJob MonitoringMonitoring (Log(Log Files)Files)
• Navigate to output directory and look at the bioscope.[ timestamp ].log in the log folder • Look for “Finished successfully”
16 Nov 2009 13:03:00,044 INFO [main] JMSEventReceiver:68 - JMSEventReceiver waitForEvent after taking Whole Transcriptome Counttag completed successfully
16 Nov 2009 13:03:00,046 INFO [main] JMSEventReceiver:70 - Event Whole Transcriptome Counttag completed successfully received on selector 'a1bd33de-2d30-48f5-92b4- e084add31d75' 16 Nov 2009 13:03:00,047 INFO [main] AnalysisJobManager:79 - wt.counttag.run completed. 16 Nov 2009 13:03:00,047 INFO [main] PluginJobManager:118 - Finished successfully 16 Nov 2009 13:03:00,047 INFO [main] PluginJobManager:102 - >>>> END of PluginJobManager >>>> date=2009-11-16 13:03:00.047 PST 16 Nov 2009 13:03:00,048 INFO [main] PluginJobManager:104 - >>>> END of PluginJobManager >>>> date DURATION=4 secs
97 BioScopeBioScope JobJob MonitoringMonitoring (Log(Log Files)Files)
• Hierarchical structure to log files • Check higher level logs for general errors and lower level logs for more specific errors
bioscope.[ timestamp ].log
console.log
mapping.{}.main.[ timestamp ].log
mapping.scatter.[ timestamp ].log
98 BioScopeBioScope JobJob MonitoringMonitoring (Log(Log Files)Files)
• Use log file to diagnose errors
22 Dec 2009 11:52:56,967 FATAL [main] PluginJobManager:99 - Bioscope failed java.io.IOException: ReferenceFile does not exist: /data/results/users/user\#\#/bioscope_runs/intermediate/spljunctionextraction/jun ction.fasta at com.apldbio.aga.analysis.secondary.mapping.reseq.MappingPipeline.setUpRefParams(MappingPipelin e.java:374) at com.apldbio.aga.analysis.secondary.mapping.reseq.MappingPipeline.validateParams(MappingPipelin e.java:286) at com.apldbio.aga.analysis.tertiary.wt.plugin.JunctionMappingPipeline.validateParams(JunctionMap pingPipeline.java:43) at com.apldbio.aga.analysis.exec.PluginRunner.preparePipeline(PluginRunner.java:104) at com.apldbio.aga.analysis.exec.PluginRunner.doMain(PluginRunner.java:137) at com.apldbio.aga.analysis.exec.PluginRunner.main(PluginRunner.java:339) 22 Dec 2009 11:52:56,968 INFO [main] PluginJobManager:102 - >>>> END of PluginJobManager >>>> date=2009-12-22 11:52:56.968 PST 22 Dec 2009 11:52:56,968 INFO [main] PluginJobManager:104 - >>>> END of PluginJobManager >>>> date DURATION=1 minutes 43 secs
99 BioScopeBioScope SoftwareSoftware OverviewOverview • Introduction • Flexible Access • Graphical User Interface (GUI) • Command Line • Monitoring Job Status • Data Standardization to BAM format • Analysis and Libraries • SAET
100 SOLiD BioScope Data Standard : SAM/BAM • Contains both color space sequence as well as base based sequences
• SAM (sequence alignment map) – BAM is binary version
• http://samtools.sourceforge.net/
• Tab delimited text format
• Contains base space quality values
• Extended CIGAR format for alignment information, compact
101 Visualization : Integrated Genomic Viewer (BAM)
— http://www.broadinstitute.org/igv beta/index.html
102 SOLiDSOLiD AccuracyAccuracy EnhancementEnhancement ToolTool (SAET)(SAET)
•Modified version of the spectral alignment error correction algorithm proposed by Pevzner et. al. 2001
•SAET takes quality values and properties of color-space into account, significantly improving the performance of the original method for SOLiD data
•Reduces the color calling error rate by factor of three to five without having the reference genome.
103 SAET Usage Considerations • Decrease in error rate improvements demonstrated in — Targeted Resequencing of large genomes — Whole Genome Sequencing on smaller genomes (under 200 Mb) — De Novo Assembly of smaller genomes (under 200 Mb) • SAET genome considerations — Size: 1 Kbp to 200 Mbp — Coverage: 30x to 4000x — Read length: 25 75bp • BioScope Software command line only • For detailed use and considerations see Appendix B of BioScope 1.2 User Manual
104 01 Agenda Item 02 Agenda Item 03 Agenda Item
SOLiD BioScope Software: Resequencing June 2010
105 BioScope Applications
Secondary Primary Tertiary Mapping Visualization Image Analysis Analysis Analysis
Off-Instrument Compute Cluster SNP
ReSeq Small Indel Tools Large indel
CNV
Inversions
mapping BAM
106 ResequencingResequencing ToolsTools
• Secondary Analysis — Mapping — Pairing — Position error
• Tertiary Analysis — DiBayes — Large Indel — Small Indel — Inversion — CNV
107 Mapping
• Reads are mapped to reference genome • Output workingDir/output/ — F3 workingDir/output/F3/s_mapping/ — R3 workingDir/output/R3/s_mapping/ • Mapping state • Position error • And optional small indel detection
108 LocalLocal AlignmentAlignment StrategyStrategy
• Map the first 25 colors of the read to allowing 2 mismatches (MM). • For every hit found (up to the Z limit), do a local extension — Accumulate alignment score (Match = 1, MM = 2 [user defined] ) — Report the best partial alignment (anchored local) based on score > Discard if score does not meet minimum cutoff Read: 0122130123012303201203021 123012310231203120103120 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ref: 0122130 000230123032012030 111111 12301 1113102312031 000010 1203
• For reads not mapped, shift anchor location and attempt additional mapping
109 Mapping output file .ma
110 Pairing (if mate pair library)
• Match the mate pairs (F3 and R3) • Output pairing result in workingDir/output/pairing • Pairing states • State for pairing distance • The data will be used in downstream pipelines: Large Indel, Small Indel, Inversion, CNV
111 Hands on • Log on • Working from command line • Work with a mate pair library • Run ReSeq Pipeline — Mapping F3, and R3 — Pairing — Position error — SNP call
112 Command line • Take a look at working directory • File structures in working directory — Config/ — Output/ — Temp/ — Intermediate/ • Config/ contains instruction files. — .plan — .ini file • And start the run — Monitor the run — Important output file .ma file, bam file and gff file
113 BioScopeBioScope HandsHands onon
• Log on to the cluster (IP: 167.116.104.43) • User name: user## • Password: solid
• Go to user working directory • cd /data/results/users/user##/resequencing/
• Examine plan file (example.plan) • cd config/
• Note Separate Directories for F3 and R3 reads
114 Hands on run at command line
• Check bioscope — >which bioscope.sh • Run $> nohup bioscope.sh example.plan & • Monitor the run $>qstat $>showq • Find the output files, Use BioScope Manual as reference
115 Continue on the Resequencing tools
116 SmallSmall IndelIndel ToolTool
• Input: BAM file from secondary analysis • Output: candidate indel list file (.pas, .gff)
A C G G T C - - C G T G T G C G T 2 base insertion
A C G G T C G T C G T G T G C G T
117 Small Indel Tool: Two ways to execute
• Can be run with fragment data as a plugin with the mapping step (smallIndelFrag) • Can be run with mate pair data as a standalone module ( GUI – Command line)
Largest deletion size Largest insertion size
LMP 500 20 PE 11 3 FRAG 11 3
118 LargeLarge IndelIndel ToolTool
Compatible with MatePair data, will call indels up to 100Kb • Input: BAM file from secondary analysis • Output: Large indel GFF file (.gff)
Concordant clone
sample
reference
Discordant clone -- insertion appears smaller Discordant clone -- deletion appears larger
119 InversionInversion ToolTool
Compatible with MatePair data • Input: BAM file from secondary analysis (pairing) • Output: Inversion GFF file and auxiliary .txt files
AAA: Normal R3 F3 F3 R3
BAX: Inversion R3 F3 F3 R3
BBX: R3 F3 F3 R3 Inversion
ABX: Tandem Repeat/ F3 R3 R3 F3 double inversion
120 Mate pair Descriptions R3 F3 Normal
• Mate-pairs are annotated with a three letter code
121 Inversion Pipeline
Use the Pairing Result from Mate pair library
R3 F3 Normal
F3 R3
R3 F3 Inversion
F3 R3
122 CopyCopy numbernumber variationvariation (CNV)(CNV) ToolTool
Compatible with only human data • Input: BAM file from secondary analysis, Mappability files • Output: GFF file, text based files with CNV calls
123 01 Agenda Item 02 Agenda Item 03 Agenda Item
BioScope™ Software Demo – SNP Finding
124 Dashboard
125 Advanced Settings: Find SNPs If needed, modify analysis settings
126 SNP Demo – start
127 Successfully Job Submission Make note (or take screenshot) of the log and output file locations
Log
Output
128 SNP Demo – check log file
129 SNP Demo – check results – gff3 file
130 Human Chr20 Demo data – IGV chr20:106420
131 01 Agenda Item 02 Agenda Item 03 Agenda Item
SNP Detection
132 SOLiD™ System Enables SNP Calls at Low Coverage
ILMN sequence reads AB SOLiD™ 3 System sequence reads Each color represents 1 base position. A single Each color represents 2 base positions and information on sequencing error can easily be miscalled a SNP each position is represented by 2 colors. A SNP typically and require higher coverage to discriminate. will result in 2 color changes therefore single measurement errors can easily be identified and excluded form analysis
C C G A T A T G A C T C A G C T C A G A C T
C C C G A T C A G A C T C C G A T A G A C T A
? ? ? SNP
Error? or SNP? Correction by 2-base encoding 2 adjacent color changes = SNP
*schematic not to scale for typically SNP detection coverage 133 Example diBayes SNP report
134 Post processing SNPs examples • P value – filter out SNPs close to 1 • Color QV – is the mean color QV much lower than average ? • Low coverage – remove SNPs with very low coverage • Filter out SNPs with flags • Repeat the analysis with a more stringent setting
135 Post processing / filtering SNPs
Low coverage
Filter for low NovelQV
Flag not called as Het
P-value close to 1
136 Filtered SNP list – what next ?
137 SIFT – filter snp list for exonic snps
138 SIFT – predictions for snps / protein
139 SIFT results
140 Nat Genetics paper on second run !!!
141 The Importance of Accuracy
Accuracy will enable
Detecting variants at low coverage
Detecting low frequency
Bodmer & Bonilla, Nature Genetics 40, 695-701(2008) mutations among pooled samples or heterogeneous samples
Accurately detecting rare mutations
Reduce false positives and downstream validations TA Manolio et al., Nature 461, 747-753 (2009)
Rare variants can only be detected with a high accuracy system
142 BioScopeBioScope DiBayesDiBayes • Workflow • Input files
— Reference sequence — Mapping result or Pairing result or Both (Bam file) — Reads quality — Position error files (created by mapping pipeline) > Error rate at each position on read of two color code
• Output files
— workingDir/output/diBayes/ — SNP output files > report in GFFv3 format > Consensus_call.txt
143 output • outputs/ — diBayes > chr_1/ > exampleExperiment_Consensus_Basespace2.fasta > exampleExperiment_SNP.gff3 — pairing > F3 R3 Paired.bam > F3 R3 Paired.bam.bai > pairDistFreq/ > pairingStats.stats
> unmappedBamFile.bam — position errors > F3 R3 Paired_F3_positionErrors.txt > F3 R3 Paired_F3_positionErrors.txt — s_mappingF3 > mapping stats.txt > myRun_chr20_F3.csfasta.ma — s_mappingR3 > mapping stats.txt > myRun_chr20_R3.csfasta.ma
144 01 Agenda Item 02 Agenda Item 03 Agenda Item
SOLiD BioScope Software: WT July 2010
145 BioScope Applications
Secondary Primary Tertiary Mapping Visualization Image Analysis Analysis Analysis
Off-Instrument Compute Cluster
WT mapping BAM
Coverage WT WT Tools Counting
Fusion
146 Outline
• Introduction
— RNA Seq (Whole Transcriptome Analysis) • BioScopeBioScope™™™™Software Version 1.2
— Single Read Whole Transcriptome Pipeline — Paired End Whole Transcriptome Pipeline • Splicing detection
— Detect exon exon junctions — Alternative splicing — Gene fusions • Gene expression
— Count expression of exons, transcripts, genes (count tags) — Count coverage across genome area (wig)
147 RNASeq: Whole Transcriptome Library Prep
TOTAL RNA Few important points: SMALL RNA DEPLETION (5S, 5.8S rRNA; tRNAs) −Libraries are strand-specific Poly(A) SELECTION −Standard protocol calls for RNase RiboMinus KIT (18S, 25S rRNA) III digestion… −Can sequence from P1 (“single
RNase III end”) or from P1 and P2 (“paired Fragmentation end”) Adapter −Barcode sequencing ordinary nnnnnn Ligation n nnnnn Reverse Transcription
Gel Purification
PCR, ~15 Cycles
P1 IA P2 Column Purification ~120 bp inserts
148 SOLiD™ 4.0 WT Pipeline Overall Flow
WT RNA Library (Ambion)
SOLiD™ 4.0 Instrument Primary analysis
csfasta/qual
Map Paired End WT Secondary Map Single Read WT analysis
Tertiary Count annotations Coverage Junction and Fusion Finder analysis
149 Bioscope™ 1.2 Software Single Read WT Pipeline
Splice Junction Extractor Mapping Stages
Genomic Junction Map Filter Map Map*
Merge Stage Legend Merge ma* * = required plugin
Tertiary Stages
Sam2Wig Counttag
150 Gene Expression
• Determine expression of exons with “count tags”. — Input: A gene annotation file (GTF) and the BAM alignment file — Output: Alignment counts and RPKM expression measurements for each exon • Calculate coverage profiles across the genome with “sam2wig”. — Input: BAM alignment file — Output: Genomic coverage profiles, one WIG file produced per chromosome strand • Both tools consider reads and transcript annotations in a strand specific fashion
— Assumption is that the Whole Transcriptome Kit will be used. • Inference of whole transcript abundance is left to third party tools
— e.g., Cufflinks
151 Gene Expression
• A snapshot of coverage/expression profiles (from the WIG file):
152 Display WT data using Integrative Genomics Viewer (IGV)
• UHR gene region displayed with IGV for positions 3,530,193 to 3,548,355 of Human Chr-1. • Wig (x2) tracks : Top 2 tracks show the genomic coverage using the negative strand and positive strand generated by the Bam2Wig tool (Max: 100 coverage). • BAM track : Middle track shows the alignments from the BAM file. For display purposes reads are filtered with MAPQ threshold of 45 (a stringent filter) and bases with quality value 5 to 20 are shaded. • BED track : Fifth track shows the junctions detected by Junction Finder (BED file). In this case all junctions detected are "known" and so are shaded in green.
153 RNA Seq Summary • RNA splicing and novel gene fusion detection enabled • Intersection of paired and single read methods eliminate false positives • User defined parameters allow tuning sensitivity versus specificity • RNA Seq shows large dynamic range and high correlation with TaqMan • Standard output files: — Gtf annotation counts — Wig file (coverage) — Bed file (junctions) — Junctions tab seperated — Alternative Splicing — Fusions • New metrics reported: — Total/unique junction evidence counts — RPKM — JCV
154 Whole Transcriptome (RNA Seq) hands on • Brief review of GUI • Brief review file structure • Command line — ini — Run bioscope with whole transcriptome plan file — Review the output
155 What’s new in Bioscope 1.2 wt pipeline?
156 Review BioScope GUI for Whole Transcriptome • Global — Chose library type — Working directory • Application — Input files, max hits, read length • Advance — Plugins — settings • Submit the job and monitoring • Command line follow up — qstat — looking at the file structure — ini files
157 Hands on • Find the working directory /data/results/users/user#/whole_transcriptome • Looks at the config/ for ini files • Run them step by step nohup bioscope.sh config/wt_map/example.ini & nohup bioscope.sh config/knownExon/example.ini & nohup bioscope.sh config/wig/example.ini &
158 Review the output • Take a look at the bam — samtools view .bam | more • Take a look at other output files
159 Data Analysis Support Resources
• User Documentation — SOLiD 3 Plus to SOLiD 4 — ICS/SETS User Guide — BioScope User Guide — Quick Reference Guide
• SOLiD 4 Data Analysis Webinars
160 Global Service and Support
More Than Just a Technology… A Dedicated Team To Enable Your Success
SOLID Technical Sales Consultants Helping you to… • Knowledge of system and components • Select the right Guidance to the right platforms and technology reagents
SOLID Field Application Scientists • Get started quickly • Application knowledge & experience Experimental design & support • Continuous operation SOLID Bioinformatics Scientists • Computer science, IT, and • Design & run bioinformatics experiments Data preparation & analysis
SOLID Field Service Engineers • Successful data analysis • Technical engineering system knowledge Installation and maintenance • Interpret data sets for publications
161 THANK YOU
For Research Use Only. Not intended for any animal or human therapeutic or diagnostic use. © 2010 Life Technologies Corporation. All rights reserved. The trademarks mentioned herein are the property of Life Technologies 162Corporation or their respective owners. TaqMan is a registered trademark of Roche Molecular Systems, Inc. Extra slides
163 lll
Junction Confidence Value (JCV)
Equation 1. Junction Confidence Value ( JCV ) n = − ( ) JCV j ∑PQV i 10 log 10 EEM j y-x y-x i=1 JCV = Junction Confidence Value EEM = Error Expectation Metric Equation 2. Error expectation metric ( EEM ) PQV = Pairing Quality Value J = junction between exons x and y = RC x × RC y x,y EEM RC x = Read Coverage of exon x jx - y l l x y μ + × σ μ + × σ ℓℓℓx = Length of exon x 3 3 T T T T µT = Average insert size of library σσσT = Standard deviation of insert size
• Highly expressed genes (with random sequencing errors) and homologous exons are suspected to be a major contributor for false positive junctions. JCV assigns a low score for possible random junctions in consideration of exon coverage, exon length and pairing quality of junction evidence. 164 WT Annotation Aided Alignment Rescue • Why? — Reads that are not mapped, due to sequencing errors — Search space too large to allow many mismatches due to high false positive rate. • How? — Reads come in pairs that are close together — If one of the reads is mapped that we can search in the vicinity of that read — Small insert size for PE means in a reduced search space for rescue — We use annotation to search in the most likely places on the genome • When? — applied to read pairs that have at least one alignment — no pair of alignments occurring within a maximum expected range
> The expected range was set to 100,000 bases — If the anchor read overlaps a gene
165 AccuracyAccuracy
• Accuracy is affected by the mapping parameters used — Increasing number of mismatches allowed will increase number of reads that map and drive up the error rate
• The accuracy after applying 2 Base Encoding (2BE) rules improves significantly over raw color accuracy
166 AccuracyAccuracy AssessmentAssessment –– WithWith aa KnownKnown GenomeGenome
50 base read mapped with up to 6 MM (DH10B results)
Total number of correct CS calls 0.14% 97.60% 2.17% Single mismatched calls 2.40% 0.09% Invalid Adjacent
Valid Adjacent
Accuracy = 99.91% Raw accuracy (before corrected by 2BE) 97.6%
167 AccuracyAccuracy
Effect of 2BE correction
Base accuracy by position in read
100 QV
10 10
raw 1 20 Percent Error Percent
0.1 corrected 30
0.01 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Base Position in Read
168 HowHow MuchMuch CoverageCoverage DoDo YouYou Need?Need? –– ForFor SNPSNP
• 2 base encoding helps to reduce the coverage needed to detect SNP with high confidence • Heterozygous SNP will need higher coverage, compared to homozygous, to detect both alleles — If the coverage at a heterozygous position is less than 10X, the probability that one of the alleles will not be detected is 1% or more — If the sample preparation method is likely to introduce some bias in allele ratio, coverage should be increased
169 SNPSNP DetectionDetection atat DifferentDifferent CoveragesCoverages
Wheeler et al PLoS Computational Biol. 2008 Vol 452| 17 April 2008| doi:10.1038/nature06884
170 CoverageCoverage VarianceVariance inin DifferentDifferent GenomicGenomic RegionsRegions
• Ideally, the coverage would follow a binomial distribution
• Possible reasons for deviations frequency — Characteristics of the reference genome (complexity, coverage repeats, etc) — The samples being sequenced (structural variation) — Sample preparation and sequencing chemistry
• Mate pair data (rather than fragment data) can largely, but not completely, overcome this problem
• Consistent contiguous regions of over/under coverage may represent copy number variation — Detection of SNPs or InDels in these regions should be treated with caution
171