<<

01 Agenda Item 02 Agenda Item 03 Agenda Item

SOLiD TM Overview (I) July 2010

1 Secondary and Tertiary Primary Analysis Analysis

BioScope in cluster

SETS ICS

Export BioScope in cloud •Satay Plots •Auto Correlation •Heat maps

On Instrument Other mapping tools

2 AutoExport (cyclebycycle)

Collect Primary Analysis Generate Secondary Images (colorcalls) Csfasta & qual Analysis

Instrument Cluster Auto Export

delta spch run definition

Merge Generate Tertiary Spch files Csfasta & qual Analysis

Bioscope Cluster

3 Manual Export

Collect Primary Analysis Generate Secondary Images (colorcalls) Csfasta & qual Analysis

Instrument Cluster

Option in SETS to export All run data or combination of Spch, csfasta & qual files

Tertiary Analysis

Bioscope Cluster

4 AutoExport database requirement removed Instrument Cluster Remote Cluster

JMS Broker JMS Broker Export ICS Hades Bioscope (ActiveMQ) (ActiveMQ) Daemon

Postgres Postgres

SETS Bioscope UI

Tomcat Tomcat

Disco installer Auto-export installer Bioscope installer

System installer 5 SOLiDSOLiD DataData AnalysisAnalysis Workflow:Workflow: Secondary/Tertiary Analysis (offinstrument)

Primary Analysis Secondary Analysis Tertiary Analysis

BioScope BioScope

Reseq WT SAET •SNP •Coverage .csfasta / .qual Accuracy Enhanced .csfasta •InDel • Counting •CNV •Junction Finder •Inversion •Fusion Transcripts Mapping

Mapped reads (.ma) Third Party Tools

Mapped reads (.bam) maToBam

6 01 Agenda Item 02 Agenda Item 03 Agenda Item

SOLiD TM Bioinformatics Overview (II) July 2010

7 OutlineOutline

• Color space and 2baseencoding • Quality values and filtering • Mapping algorithm and considerations • SOLiD Webinar and Online Training • SOLiD Software Community

8 WhatWhat IsIs ColorColor Space?Space?

• Capillary electrophoresis uses single base, color encoding of data

Collect color Identify peak Convert to Identify peaks image colors base calls

Base space

Color space

9 SOLiDSOLiD ColorColor SpaceSpace

• SOLiD uses 2 base color encoding of data (2BE)

Collect color Identify bead Identify beads image color

Record colors for each bead over consecutive cycles

Color space Base space A C G G T C G T C G T G T G C G T

10 PropertiesProperties OfOf 22 BaseBase EncodingEncoding (2BE)(2BE) Second Base 5’ 3’ 1 3 1 3 1 3 2 3 5’-A C G T A C G A T -3’ 3’-T G C A T G C T A -5’

1 3 1 3 1 3 2 3 Base First

3’ 5’ • Two dibases that agree in just one base have different colors — color(AC) ≠ color(AG) ≠ color(AT) ≠ color(AA) • Two dibases that do not agree in either base have same color — color(AC) = color(GT) and color(CG) = color(AT) • A dibase and its reverse have the same color — color(AC) = color(CA), color(GT) = color(TG) • Repeatedbase dibases have the same color — color(AA) = color(CC)= color(GG)= color(TT) 11 ““ValidValid ”” andand ““InvalidInvalid ”” AdjacentAdjacent ColorColor SubstitutionsSubstitutions • “Invalid” changes are inconsistent with SNP and likely errors

12 OutlineOutline

• Color space and 2baseencoding • Quality values and filtering • Mapping algorithm and considerations • SOLiD Webinar and Online Training • SOLiD Software Community

13 QualityQuality ValueValue (QV)(QV) ForFor ColorColor CallCall

• A score calculated based on the probability of an error call at that base • Similar to those generated by phred and the KB Basecaller for capillary electrophoresis sequencing = − q 10 log 10 p p = probability of color call error

• A QV score of 10 represent 10% error rate, whereas a QV score of 20 represents a 1% error rate

14 SETS Software Updates

What has changed in SETS?

Primary Analysis Filtering enabled Removes poor quality beads, primary analysis results file size will be reduced. Mapping will be performed faster and matching % will improve.

Filter

Poor quality Beads Mappable Beads

15 Why Filter? • By removing poor quality beads, primary analysis results would be reduced by about 15% or more • Easier to discover novel information from remaining unmatched beads • Due to smaller list of reads of a run, mapping would be faster for generating similar throughput • Improved matching percentage

16 Filtering Design • Used Human data as training set • Set parameters based on the number of poor quality beads filtered — 20 value corresponds to 20% of poor quality beads filtered out — 80 value corresponds to 80% of poor quality beads filtered out • Tested mapping using BioScope Classic mapping

17 Configuring Filtering from SETS

•Valid ranges for Stringency are from 0 to 80 •Default value is 20

20

18 OutlineOutline

• Color space and 2baseencoding • Quality values and filtering • Mapping algorithm and considerations • SOLiD Webinar and Online Training • SOLiD Software Community

19 MappingMapping AlgorithmAlgorithm

• Challenge: — A small word size is needed for continuous word searches in short reads. This is computationally and time intensive.

• Our Approach: — Use discontinuous word patterns > Allows faster searching and guaranteed to find all hits up to a certain number of mismatches

20 DiscontinuousDiscontinuous WordsWords

• Continuous words: searching for a perfect alignment, 8/8 bases (word size 8, e.g. used by BLAST) ATTTTTT GGGTAGCC CCTTGGATGAGT |||||||| AG GGGTAGCC TGATGATGGT • Discontinuous words: searching 8/18 matches (effective word size is also 8) ATTTT TT GGGTA GC CCCTT GGAT GAGT || || |||| TT GACCG GC ATGGG GGAT 110000011000001111

21 MappingMapping ToolTool mapreadsmapreads

• General features of mapping tool — Aligns in color space — Translates reference sequence to color space — Allows mismatches (no indels), valid adjacent mismatches can be counted as one — Allows masking of certain positions (bad calls) — For fixed reference sequence, running time is linear with number of reads

• New with SOLiD 3+ — Seed and extend mapping approach — Multithreaded

22 LocalLocal MappingMapping • Motivation — Long reads, nonuniform quality — At the end of reads errors tend to accumulate — Some applications show sequencing into adaptors

23 LocalLocal AlignmentAlignment StrategyStrategy

• Map the first 25 colors of the read to allowing 2 mismatches (MM). • For every hit found (up to the Zlimit), do a local extension — Accumulate alignment score (Match = 1, MM = 2 [user defined] ) — Report the best partial alignment (anchored local) based on score > Discard if score does not meet minimum cutoff Read: 0122130123012303201203021 123012310231203120103120 ||||||| ||||||||||||||| | ||||| |||||||||| ||| Ref: 0122130 0230123032012030 1112301 13102312031 0010 1203

• For reads not mapped, shift anchor location and attempt additional mapping

24 LocalLocal Mapping:Mapping: AnchorAnchor OffsetOffset

start end

reference

read (offset)

• start and end mark the start and end of the alignment in the reference. • The alignment may not encompass the entire read. • The start of the alignment in the read is called the offset

25 MappingMapping QV:QV: MathematicalMathematical DefinitionDefinition

• Mapping quality is an estimate how likely an alignment is correct

• First, calculate the posterior probability L−t −  1  P(r | Alignment ) = 1( − e)t m e m    4 

• If an alignment has a probability of ,P(r), it’s mapping QV is defined as

P(r) — 10*log 10 (1P(r)/P), where P = Σ for all reads

26 What are mapping/pairing quality values?

• Given the fact that a read R of length L can map to n different locations Xi (i = 1…n) in the genome, mapping quality value represents the probability of the hypothesis, that the read maps to location Xi is true.

Mapping Quality value ~ Prob hypothesis (R Ξ X1 | R) is true

R

X1 X2 Xn

27 Difference between Mapping QV & Pairing QV

• Mapping QV represents the quality of alignment for Fragment reads or the quality of alignment for individual tags (F3/R3/F5P2) in pared reads • Pairing QV represents the quality of alignment for a pair of reads. Example if F3 tag has 10 alignments and F5P2 tag has 10 alignments, then we could form 100 alignment pairs for tags F3, F5P2 together

28 Parameters that factor into Pairing Quality Values • Alignment Length • Number of mismatches • Offset • Insert size •Total number of possible alignments

Offset Alignment Length -

R3/F5 Insert Size F3

+

29 Phred quality score and Pairing Quality Values

Phred Quality score used most commonly used in literature is

-10 x log 10 [prob (error)]. So to be consistent with Phred scaled quality score, we calculate the pairing quality value (PQV) as:

=− × [ − ( )] PQV 10 log 10 1 Q r1,r2,x1,x2

Finally, we normalize the PQV with the maximum possible PQV for a given pair of reads of read length L1 & L2, to keep the PQVs in the range of 1 – 100

PQV PQV = ×100 PQV max

30 MultithreadedMultithreaded MapreadsMapreads

• Single mapping job • Fraction of reads (1/n) are mapped against the whole reference • ~20GB of RAM for the human genome • Limit read mapping to whole genome (-z) • Combine results (simple merge)

1/n reads Mapped CPU 1 (.csfasta) Results 1

1/n reads Full Mapped Combined CPU 2 (.csfasta) Reference Results 2 Results . .

1/n reads Mapped CPU n (.csfasta) Results n

31 LocalLocal Mapping:Mapping: AdvantagesAdvantages

• Increased throughput — Some data sets have observed 2fold increase in mapping using local mapping vs. classical mapping • Increased speed — Up to 15X Faster than iterative mapping with trimming • As read length increases, only a small set of schemas is needed to be optimized

32 OutlineOutline

• Color space and 2baseencoding • Quality values and filtering • Mapping algorithm and considerations • SOLID Webinar and Online Training • SOLiD Software Community

33 Introducing

34 SOLiD™ University Offerings • At Life Technologies Application Support Centers — SOLiD™ 4 System Course — SOLiD™ 4 Bioinformatics Course — SOLiD Libraries Courses: DNA, RNASeq, Targeted Reseq • SOLiD™ Edge Live Webinar Series — SOLiD™ 4 System Essentials, SOLiD™ 4 Data Analysis Essentials — Advanced Troubleshooting — Success Stories in Cancer — Applications: RNASeq, Targeted Resequencing, Epigenetics and Whole Genome Sequencing • SOLiD™ Elearning Series — Recorded webinars from SOLiD Edge Series — SOLiD™ EZ Bead™ Videos — SOLiD™ 4 System Bead Deposition Video

35 New at SOLiD™ University

Targeted Reseq Library Construction Course – 3 day intensive course with theory for all methods and hands-on for Agilent’s SureSelect  Target Enrichment System

RNA-Seq Library Construction – 5 day intensive course covers construction of small RNA, whole transcriptome and SOLiD™ SAGE™ libraries

Location, Location, Location – Courses available at Foster City, CA and Frederick, MD in the US and Darmstadt, Germany

36 SOLiD™ Edge Live Webinar Series

SOLiD™4 System – includes these topics, SOLiD System & Data Analysis Essentials, Advanced Troubleshooting and Successes in Cancer Research

August: RNA-Seq with the SOLiD™ System – using SOLiD™ Total RNA-Seq and SOLiD™ SAGE™ Kits and SOLiD™ BioScope™ & third party tools for analysis

September: Targeted Reseq with the SOLiD™ System – strategies for enrichment and SOLiD™ BioScope™ & software community tools for analysis

October: Epigenetics with the SOLiD™ System – using SOLiD™ ChIP-Seq Kit and Methylation and SOLiD™ BioScope™ & software community tools for analysis

37 Thank You

Visit learn.appliedbiosystems.com for schedule & registration

For research use only. Not intended for human or animal therapeutic or diagnostic use.

© 2010 Life Technologies Corporation. All rights reserved. The trademarks mentioned herein are the property of Life Technologies Corporation or their respective owners 38 OutlineOutline

• Color space and 2baseencoding • Quality values and filtering • Mapping algorithm and considerations • SOLiD Webinar and Online Training • SOLiD Software Community

39 SOLiD Software Community

— SOLID System Software (Separate Session) — Open Source Software — Commercial Partners’ Software — SOLID Software Community Website

40 SolidSolid SoftwareSoftware CommunityCommunity solid.appliedbiosystems.com

41 SOLiDSOLiD SoftwareSoftware CommunityCommunity info.appliedbiosystems.com/solidsoftwarecommunity

• Easy Navigation — Application centric — Links to downloads

42 SOLiDSOLiD SoftwareSoftware DownloadDownload (http://solidsoftwaretools.com)

43 SOLiDSOLiD DatasetDataset DownloadDownload (http://solidsoftwaretools.com)

44 01 Agenda Item 02 Agenda Item 03 Agenda Item

Data Management July 2010

45 OutlineOutline

• Key files and formats • File structures on the SOLiD systems • File transfer • Storage requirements

46 DataData AnalysisAnalysis OverviewOverview :: FilesFiles

Focal map .csfasta .sam / .bam .qual Input Images

.spch .csfasta.ma (mapping Text files and tab .csfasta output) delimited data files .sam / .bam Output .qual .intensity

47 FileFile Format:Format: ..spchspch

• spch : SOLiD Panel Cache HDF5 • One file per panel • Contains information about the color calls at each ligation for every bead in that panel • The .spch file is in the HDF5 format >HDFView can be used to view structure and content of a HDF5 file — http://hdf.ncsa.uiuc.edu/products/hdf5/index.html. — http://hdf.ncsa.uiuc.edu/hdfjavahtml/hdfview/index.html

48 FileFile Format:Format: .. csfastacsfasta

• FASTA file with tag headers and sequences • Tag header: >1_88_1830_F3 — 1 = panel number — 88 = X coordinate of bead within panel — 1830 = Y coordinate of bead within panel — F3 = type of tag

>1_88_1830_F3 T2103112003130213233110321

>1_88_1830_R3 G3211312320130023232012112

49 FileFile Format:Format: .. qualqual

• FASTA file with tag headers and quality values • Quality value (QV) calculated on the probability of an error call at that position — A QV of 10 represents 10% error rate — A QV of 20 represents a 1% error rate

>97_2040_1850_F3 38 36 26 33 41 26 24 33 28 31 27 23 5 35 32 31 11 10 24 38 22 24 7 27 11 15 26 13 14 17 17 13 12 8 5 17 5 12 = − q 10 log 10 p

p = probability of color call error

50 RepresentationRepresentation ofof aa missingmissing colorcolor callcall

• .csfasta with a ‘.’ >1_88_1830_F3 T2103112003130213233110 .21

• .qual with a ‘1’

> 1_88_1830_F3 38 36 26 33 41 26 24 33 28 31 27 23 5 35 32 31 11 10 24 38 22 24 -1 27 11

• A missing color call will be treated as a mismatch by the SOLiD mapping program “MaxMapper”

• Some 3 rd party software does not handle ‘.’ or ‘1’ well, need to prefilter out reads with those values

51 FileFile Format:Format: ScaledScaled IntensityIntensity FilesFiles

• FASTA file with tag headers intensity values • Not generated by default (very large ~100GB per color)

— Used with SRF file format for sequence submission to NCBI, but are not required • One file generated for each dyecolor for each tag F3_intensity.ScaledCY3.fasta F3_intensity.ScaledCY5.fasta F3_intensity.ScaledFAM.fasta F3_intensity.ScaledTXR.fasta >1_43_24_R3 0.0471744 0.0140623 0.000482545 0.160932 0.0100427 0.0219512 0.0158016 .... 0.00679131 0.129587 0.000653466 0.00944984

52 FileFile Format:Format: .. csfasta.macsfasta.ma

• Tag_ID, 1_-6172.2:(40.4.0):q45 , ... • 1 = reference entry (fasta index) • ‘-’ = strand (nothing if positive strand) • 6172 = position of hit on reference • 2 = number of mismatches in the anchor region • 40 = alignment length of local alignment (seed+extension) • 4 = number of mismatches of the total alignment • 0 = alignment start in the read • q45 = mapping quality value

>1_88_1830_F3, 1_6172.2:(40.4.0):q45 T2103112003130213233110321 >2_89_1831_F3 T31220320101322020102301212

53 File Format: SRF

• SRF (sequence read format) NCBI trace archive — A block based format (ZTRcompressed), http://srf.sourceforge.net/ — See SOLiD™ System SRF Conversion Tool ( solid2srf) > Input: .csfasta, .qual, and .scaled_intensity files (optional) — http://solidsoftwaretools.com/gf/project/srf/

54 File Format: SAM/BAM

• SAM is an alignment format developed by 1000g project that includes pairing information • Extended CIGAR format for alignment information, compact • SAM refers to the generic format specification and the text file. • BAM is the compressible binary version of SAM • Resources — Main site http://samtools.sourceforge.net/ — Format specification http://samtools.sourceforge.net/SAM1.pdf — Mailing lists

55 http://sourceforge.net/mail/?group_id=246254 File Format: SAM/BAM Header • Reference sequence information (SQ) — SQ lines indicate name and length of each contig in the reference file — Location and MD5 checksum are optional fields to be included in SOLiD BAM • Read group information (RG) — One read group per secondary analysis result — RG lines describe sample information tied to alignments via read group id — Library (LB), sample name (SM), pairing insert size (PI) will be included — LB will include library type that will distinguish mate pair, reverse reads, and fragment libraries LB:libname50x50MP LB:libname250x25RR LB:lib350F — PI will include pairing range

56 PI:1001400 File Format: SAM/BAM Header

@HD VN:1.0 GO:none SO:coordinate @SQ SN:chr20 LN:62434914 UR:file:///share/reference/genomes/hg18/human.fa @RG ID:RG1 LB:libname-50x50MP PI:100-1410 SM:S1

1417_237_929 115 chr1 446 150 1H49M = 1453 1057 ACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCG )6B><5/>G?45=CF&&II@27GII##GGIDEIIIIDEIIID?DGGIIF RG:Z:sys.S1 CS:Z:T23003300303032122001310233220010310010320010320011 CQ:Z:166626/14987/6:69514717#6:71',5;/&25//'.26)'/.12%% MD:Z:49M NM:i:0

Specification Details: http://samtools.sourceforge.net/SAM1.pdf

57 File Format: SAM/BAM Header

• 1417_237_929 115 chr1 446 150 1H49M = 1453 1057 ACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCG )6B><5/>G?45=CF&&II@27GII##GGIDEIIIIDEIIID?DGGIIF RG:Z:sys.S1 CS:Z:T23003300303032122001310233220010310010320010320011 CQ:Z:166626/14987/6:69514717#6:71',5;/&25//'.26)'/.12%% MD:Z:49M NM:i:0 • 1417_237_929  Bead ID • 115  SAM mates flag ( Strand / Mates Strand / Proper Orientation, etc.) • Chr1  Aligned contig • 446  Start position (always forward strand) • 150  Mapping quality value • 1H49M  CIGAR string • =  Mate contig • 1453  Mate start position • 1057  Insertion length • ACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCG  Base sequence • )6B><5/>G?45=CF&&II@27GII##GGIDEIIIIDEIIID?DGGIIF  Base quality value • RG:Z:sys.S1  Read group • CS:Z:T23003300303032122001310233220010310010320010320011  Color sequence • CQ:Z:166626/14987/6:69514717#6:71',5;/&25//'.26)'/.12%%  Color quality • MD:Z:49M  MD string • NM:i:0  Edit distance Specification Details: http://samtools.sourceforge.net/SAM1.pdf

58 Alignment records • CIGAR string — Condensed alignment descriptor nX[nX,nX,…] where X is the operation and n is the number of operations

OpOpOp Description M Alignment match (can be a sequence match or mismatch) I Insertion to the reference D Deletion from the reference N Skipped region from the reference S Soft clip on the read (clipped sequence present in ) H Hard clip on the read (clipped sequence NOT present in ) P Padding (silent deletion from the padded reference sequence)

59 Alignment records

• CIGAR examples

Perfect 50mer match 50M

50mer with 1 mismatch 50M

Anchorextend to 47bp 47M3H

Start at 10bp, extend to 50bp 10H40M

Start at 10bp, extend to 50bp, bottom strand 40M10H

2bp deletion 25M2D25M

2bp insertion 24M2I24M

60 Samtools (SAM manipulation tool) in C • http://http://samtools.sourceforge.netsamtools.sourceforge.netsamtools.sourceforge.net//// • importimport: SAM-to-BAM conversion • viewview: BAM-to-SAM conversion and subalignment retrieval • sortsort: sorting alignment • mergemerge: merging multiple sorted alignments • indexindex: indexing sorted alignment • faidxfaidx: FASTA indexing and subsequence retrieval • tview : text alignment viewer • pileup : generating position-based output and consensus/indel calling

61 Samtools ‘tview’ : alignment viewer

62 SAM Format Bioinformatics software • Aligners natively generating SAM • BFAST , `Blat-like Fast Accurate Search Tool' for Illumina and SOLiD reads. • BWA , Burrows-Wheeler Aligner for short and long reads. • GEM library . Short read aligner. Convertor provided by the developers. • Karma , the K-tuple Alignment with Rapid Matching Algorithm. • Novoalign . An accurate aligner capable of gapped alignment for Illumina short reads. Academic free binary. • SNP-o-matic , short read aligner and SNP caller. • SSAHA2 (since v2.4). Classical aligner for both short and long reads. • Stampy, by Gerton Lunter . An accurate read aligner capable of gapped alignment for Illumina short reads. Used for indel discovery on the 1000 genomes data. Not released. • TopHat for mapping short RNA-seq reads bridging exon junctions. • Programs processing SAM/BAM • GAP5 , sequence assembly viewer, editor and analyzer. Capable of importing BAM files and outputing SAM. • GATK , the Genome Analysis Toolkit. Rich funtionality including an accurate SNP caller. Built upon Picard. • GBrowse , generic genome browser. Experimental SAM/BAM alignment viewing. Built upon Perl APIs. • IGV , the Integrative Genomics Viewer. Elegant alignment viewer supporting multiple tracks and genome annotations. Built upon Picard. • LookSeq , web-based alignment/annotation viewer. • samToBed by Aaron Quinlan . Converting alignments in the SAM format to the BED format. • Vancouver Short Read Analysis Package (in particular FindPeaks), post alignment processing of new sequencing data.

63 Samtools pileup for coverage plots • Using BAM file with Samtools pileup • GNUplot to generate plots

64 OutlineOutline

• Key files and formats • File structures on the SOLiD systems • File transfer • Storage requirements

65 SOLiDSOLiD FileFile SystemSystem

66 SOLiDSOLiD FileFile SystemSystem –– MultiplexMultiplex SampleSample

67 OutlineOutline

• Key files and formats • File structures on the SOLiD systems • File transfer • Storage requirements

68 FileFile TransferringTransferring

Internal network, Isolated from external network and headnode is a firewall

Internet / External network Gigabit switch eth0

Instrument controller XP 10.1.1.3 eth0 eth1 headnode 10.1.1.1 LAN/WAN IP address Perc 6 Results 4TB Requires 1Gigabit LAN

Optional MD1000 Images – MD1000 9TB for results to 13 TB

eth0 compute node 0 10.1.1.100 eth0 Communication Protocols compute node 1 10.1.1.101

eth0 UPS Headnode and XP -- SAMBA compute node 2 10.1.1.102 Headnode and Compute nodes -- NFS

69 CopyCopy DataData ToTo RemoteRemote StorageStorage

• rsync • scp/sftp • NFS mount remote storage on SOLiD • Use cp/tar/gzip to copy files • External drive • USB drive mounted on the head node

70 OutlineOutline

• Key files and formats • File structures on the SOLiD systems • File transfer • Storage requirements

71 SOLiD™ 4 OnInstrument Data Size And Storage

50 ntntnttagtagtag Image data size Primary analysis Primary analysis data size 300K/panel data size (.(.(.csfasta(. csfastacsfasta,, QV.qualQV.qual,, .stats) 2357 panels (.(.(.spch(. spch format)

1 slide –––1 tag 1.84 TB 646 GB 170 GB (((Frag(FragFrag)))) 1 slide –––2 tags 3.6 TB 1.29 TB 340 GB (Mate pair) 2 slides –––1 tag 3.6 TB 1.29 TB 340 GB (((Frag(FragFrag))))

2 slides –––2 tags 7.2 TB 2.58 TB 680 GB (Mate pair)

1 slides –––2 tags 2.8 TB 969 GB 255 GB (Paired(PairedEnd)End)

2 slides –––2 tags 5.6 TB 1.90 GB 510 GB (Paired(PairedEnd)End)

(Assumes deposition densities of 300K beads/panel, 2357 panels) 72 SOLiDSOLiD 44 InstrumentInstrument DataData StorageStorage CapacityCapacity

• DELL MD1000 — 15x 750 GB SATA hard drives — RAID5 w/ hot swap capabilities — /data/images/ 8.9 TB

• Head node HD — 6x 1 TB SATA hard drives — RAID5 — /data/results/ 4 TB

73 01 Agenda Item 02 Agenda Item 03 Agenda Item

SOLiD BioScope Software Overview June 2010

74 Topics • Overview — General overview

• ReSeq pipelines — Overview of functions — Demo and hands on

• Whole Transcriptome (WT) pipelines — Overview on the functions — Demo and hands on

75 BioScopeBioScope SoftwareSoftware OverviewOverview • Introduction • Flexible Access (Available options) • Graphical User Interface (GUI) • Command Line • Monitoring Job Status • Data Standardization to BAM format • Analysis and Libraries • SAET Tool for error correction

76 SOLiDSOLiD AnalysisAnalysis WorkflowWorkflow

g in n c io g en at rin qu g tr er ai se in is ll ld r /P e g d eg a ui ie g R in Fin R r C B sif in g d ge lo ds as pp a a a o a Cl a Im Be Im C Re M

Output: .spch .csfasta .ma TA .qual .mates W .bam ICS .sam SETS

BioScope (offline)

77 BioScopeBioScope OverviewOverview

Primary Secondary Tertiary Image Mapping Visualization Analysis Analysis Analysis

Off-Instrument Compute Cluster

mapping ReSequencing

Output in Common file format

Whole Transcriptome

BioScope

78 BioScope Software • Outofthebox tools/pipelines for

Secondary Analysis Tertiary Analysis

Mapping SAET - NEW Whole Transcriptome Mapping statistics Resequencing •WT mapping Position Errors •SNP/diBayes •Count known • Pairing •Inversion Create UCSC WIG file •CNV Fusion and splicing - NEW •Small Indel ChiPSeq mapping - NEW •Large Indel

• Simple GUI to allow ease of use to run pipelines • Flexible command line to enable custom pipeline development

79 Offline Cluster Minimal Offline Cluster Spec for 1 billion reads • Dedicated Head node • 3+ Compute nodes • Head Node

— > 2 GHz processors — 16 GB RAM — 100 GB storage local disk space for OS+ software installation • Compute Node

— > 2 GHz processors — > 16+ GB RAM (24GB recommended) and 8+ cores per node — > 500GB scratch space per node • 1 GB Switch • OS: Cent OS 4.x, 5.x / RedHat 4.x, 5.x • Job Manager: PBS Torque / SGE • Storage > 10TB 80 Development principles for BioScope

• Easy to use and Flexibility — GUI — Command line • Similar operation for all functions — GUI > Global settings > Application settings > Advanced settings — Command line > bioscope.sh –l log/ instruction.ini • Consistent file structure for easy tracking — workingDir/output — workingDir/temp — workingDir/intermediate — workingDir/log

81 BioScopeBioScope GUIGUI –– 1.21.2 DashboardDashboard

82 BioScopeBioScope GUIGUI GlobalGlobal SettingsSettings

• Define working directory, output locations for results, logs, etc.

83 BioScopeBioScope GUIGUI ApplicationApplication SettingsSettings

• Specify application specific settings, input files, reference files

84 BioScopeBioScope AdvancedAdvanced SettingsSettings

• Specify analysis settings, and optional plugins and analysis

85 BioScopeBioScope ParameterParameter ValidationValidation

• Validates input settings: integer, range, file extension, not null etc

86 BioScopeBioScope PluginPlugin SelectionSelection

87 BioScopeBioScope JobJob SubmissionSubmission

88 BioScope Job Monitoring – View Run History

Step 1: Select Step 2: Select the Analysis the folder you History to view want to view

Note: The BioScope Analysis History is based on the time created. User no longer has to go to the cluster and use command line to view log and results files

89 BioScope Job Monitoring – Log File Access

Step 3: Select Step 4. Open or log file to view Save log file

90 BioScopeBioScope filefile structurestructure

• Working directory • config/ for all the instruction files such as .ini and .plan files • log/ for all the running log files • output/ for the output results • temp/ temp file for pipeline • intermediate/ intermediate files for pipeline

• Default working directory • /data/result/secondary/_timestamp • /data/result/tertiary/_timestamp

91 BioScopeBioScope CommandCommand LineLine • Create the following files • .ini • .plan • Execute • bioscope.sh -l <.ini file or .plan file>

Recommendation: • Copy and edit the example .ini and .plan files provided with BioScope • Use BioScope Interface to generate file templates

92 BioScopeBioScope FilesFiles (.(. iniini ))

• Defines input / output directories and folders • Parameter settings • Can contain multiple pipelines ## global parameters ## global settings for the pipeline run base.dir=./ import ../../globals/global.ini output.dir = ${base.dir}/outputs reference = ${reference.dir}/chr20.fasta temp.dir = ${base.dir}/temp run.name = myRun intermediate.dir = ${base.dir}/intermediate sample.name = chr20 log.dir = ${base.dir}/log primer.set = F3 reads.result.dir.1 = ${base.dir} read.length = 50 reads.result.dir.2 = ${base.dir} output.dir = ${base.dir}/../outputs reference.dir = /data/results/bioscope_examples/examples/references ## qv filtering pipeline scratch.dir=/scratch/solid classify.run = 1 read.dir = ${base.dir}/../../human_var/secondary/JOAN/mappingF3 read.file.prefix = ${run.name}_${sample.name}_${primer.set} mapping.tagfiles.dir = ${output.dir}/qvfiltered filtering.qv.filtered.dir = ${output.dir}/qvfiltered filtering.qv.failed.dir = ${output.dir}/qvfail

## mapping pipeline mapping.run = 1 mapping.tagfiles.dir =${base.dir}/../../human_var/secondary/JOAN/mappingF3 mapping.output.dir = ${output.dir}/s_mappingF3

93 BioScopeBioScope FilesFiles (.plan)(.plan)

• List of .ini files to run • “=“ denotes jobs to be run in parallel • “+” breakup sets of parallel jobs

.plan .ini (serial) .ini (parallel)

=example.mappingF3.ini =example.mappingR3.ini example.mapStats.ini + output =example.posErrors.ini =example.MaToBam.ini

94 BioScopeBioScope JobJob MonitoringMonitoring

• GUI: History Tab & Access to Log Files • Command line: Check job submission status in three ways • Verify job submitted to cluster • qstat • Monitor BioScope logs for errors • Examine output directory

95 BioScopeBioScope JobJob MonitoringMonitoring (Cluster(Cluster Submission)Submission) • Use the command “qstat” to monitor job progression

• When job is complete, proceed to check log files

96 BioScopeBioScope JobJob MonitoringMonitoring (Log(Log Files)Files)

• Navigate to output directory and look at the bioscope.[ timestamp ].log in the log folder • Look for “Finished successfully”

16 Nov 2009 13:03:00,044 INFO [main] JMSEventReceiver:68 - JMSEventReceiver waitForEvent after taking Whole Transcriptome Counttag completed successfully

16 Nov 2009 13:03:00,046 INFO [main] JMSEventReceiver:70 - Event Whole Transcriptome Counttag completed successfully received on selector 'a1bd33de-2d30-48f5-92b4- e084add31d75' 16 Nov 2009 13:03:00,047 INFO [main] AnalysisJobManager:79 - wt.counttag.run completed. 16 Nov 2009 13:03:00,047 INFO [main] PluginJobManager:118 - Finished successfully 16 Nov 2009 13:03:00,047 INFO [main] PluginJobManager:102 - >>>> END of PluginJobManager >>>> date=2009-11-16 13:03:00.047 PST 16 Nov 2009 13:03:00,048 INFO [main] PluginJobManager:104 - >>>> END of PluginJobManager >>>> date DURATION=4 secs

97 BioScopeBioScope JobJob MonitoringMonitoring (Log(Log Files)Files)

• Hierarchical structure to log files • Check higher level logs for general errors and lower level logs for more specific errors

bioscope.[ timestamp ].log

console.log

mapping.{}.main.[ timestamp ].log

mapping.scatter.[ timestamp ].log

98 BioScopeBioScope JobJob MonitoringMonitoring (Log(Log Files)Files)

• Use log file to diagnose errors

22 Dec 2009 11:52:56,967 FATAL [main] PluginJobManager:99 - Bioscope failed java.io.IOException: ReferenceFile does not exist: /data/results/users/user\#\#/bioscope_runs/intermediate/spljunctionextraction/jun ction.fasta at com.apldbio.aga.analysis.secondary.mapping.reseq.MappingPipeline.setUpRefParams(MappingPipelin e.java:374) at com.apldbio.aga.analysis.secondary.mapping.reseq.MappingPipeline.validateParams(MappingPipelin e.java:286) at com.apldbio.aga.analysis.tertiary.wt.plugin.JunctionMappingPipeline.validateParams(JunctionMap pingPipeline.java:43) at com.apldbio.aga.analysis.exec.PluginRunner.preparePipeline(PluginRunner.java:104) at com.apldbio.aga.analysis.exec.PluginRunner.doMain(PluginRunner.java:137) at com.apldbio.aga.analysis.exec.PluginRunner.main(PluginRunner.java:339) 22 Dec 2009 11:52:56,968 INFO [main] PluginJobManager:102 - >>>> END of PluginJobManager >>>> date=2009-12-22 11:52:56.968 PST 22 Dec 2009 11:52:56,968 INFO [main] PluginJobManager:104 - >>>> END of PluginJobManager >>>> date DURATION=1 minutes 43 secs

99 BioScopeBioScope SoftwareSoftware OverviewOverview • Introduction • Flexible Access • Graphical User Interface (GUI) • Command Line • Monitoring Job Status • Data Standardization to BAM format • Analysis and Libraries • SAET

100 SOLiD BioScope Data Standard : SAM/BAM • Contains both color space sequence as well as base based sequences

• SAM ( map) – BAM is binary version

• http://samtools.sourceforge.net/

• Tabdelimited text format

• Contains base space quality values

• Extended CIGAR format for alignment information, compact

101 Visualization : Integrated Genomic Viewer (BAM)

— http://www.broadinstitute.org/igvbeta/index.html

102 SOLiDSOLiD AccuracyAccuracy EnhancementEnhancement ToolTool (SAET)(SAET)

•Modified version of the spectral alignment error correction algorithm proposed by Pevzner et. al. 2001

•SAET takes quality values and properties of color-space into account, significantly improving the performance of the original method for SOLiD data

•Reduces the color calling error rate by factor of three to five without having the reference genome.

103 SAET Usage Considerations • Decrease in error rate improvements demonstrated in — Targeted Resequencing of large genomes — Whole Genome Sequencing on smaller genomes (under 200 Mb) — De Novo Assembly of smaller genomes (under 200 Mb) • SAET genome considerations — Size: 1 Kbp to 200 Mbp — Coverage: 30x to 4000x — Readlength: 2575bp • BioScope Software command line only • For detailed use and considerations see Appendix B of BioScope 1.2 User Manual

104 01 Agenda Item 02 Agenda Item 03 Agenda Item

SOLiD BioScope Software: Resequencing June 2010

105 BioScope Applications

Secondary Primary Tertiary Mapping Visualization Image Analysis Analysis Analysis

Off-Instrument Compute Cluster SNP

ReSeq Small Indel Tools Large indel

CNV

Inversions

mapping BAM

106 ResequencingResequencing ToolsTools

• Secondary Analysis — Mapping — Pairing — Position error

• Tertiary Analysis — DiBayes — Large Indel — Small Indel — Inversion — CNV

107 Mapping

• Reads are mapped to reference genome • Output workingDir/output/ — F3 workingDir/output/F3/s_mapping/ — R3 workingDir/output/R3/s_mapping/ • Mapping state • Position error • And optional small indel detection

108 LocalLocal AlignmentAlignment StrategyStrategy

• Map the first 25 colors of the read to allowing 2 mismatches (MM). • For every hit found (up to the Zlimit), do a local extension — Accumulate alignment score (Match = 1, MM = 2 [user defined] ) — Report the best partial alignment (anchored local) based on score > Discard if score does not meet minimum cutoff Read: 0122130123012303201203021 123012310231203120103120 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ref: 0122130 000230123032012030 111111 12301 1113102312031 000010 1203

• For reads not mapped, shift anchor location and attempt additional mapping

109 Mapping output file .ma

110 Pairing (if mate pair library)

• Match the mate pairs (F3 and R3) • Output pairing result in workingDir/output/pairing • Pairing states • State for pairing distance • The data will be used in downstream pipelines: Large Indel, Small Indel, Inversion, CNV

111 Hands on • Log on • Working from command line • Work with a mate pair library • Run ReSeq Pipeline — Mapping F3, and R3 — Pairing — Position error — SNP call

112 Command line • Take a look at working directory • File structures in working directory — Config/ — Output/ — Temp/ — Intermediate/ • Config/ contains instruction files. — .plan — .ini file • And start the run — Monitor the run — Important output file .ma file, bam file and gff file

113 BioScopeBioScope HandsHands onon

• Log on to the cluster (IP: 167.116.104.43) • User name: user## • Password: solid

• Go to user working directory • cd /data/results/users/user##/resequencing/

• Examine plan file (example.plan) • cd config/

• Note Separate Directories for F3 and R3 reads

114 Hands on run at command line

• Check bioscope — >which bioscope.sh • Run $> nohup bioscope.sh example.plan & • Monitor the run $>qstat $>showq • Find the output files, Use BioScope Manual as reference

115 Continue on the Resequencing tools

116 SmallSmall IndelIndel ToolTool

• Input: BAM file from secondary analysis • Output: candidate indel list file (.pas, .gff)

A C G G T C - - C G T G T G C G T 2 base insertion

A C G G T C G T C G T G T G C G T

117 Small Indel Tool: Two ways to execute

• Can be run with fragment data as a plugin with the mapping step (smallIndelFrag) • Can be run with mate pair data as a standalone module ( GUI – Commandline)

Largest deletion size Largest insertion size

LMP 500 20 PE 11 3 FRAG 11 3

118 LargeLarge IndelIndel ToolTool

Compatible with MatePair data, will call indels up to 100Kb • Input: BAM file from secondary analysis • Output: Large indel GFF file (.gff)

Concordant clone

sample

reference

Discordant clone -- insertion appears smaller Discordant clone -- deletion appears larger

119 InversionInversion ToolTool

Compatible with MatePair data • Input: BAM file from secondary analysis (pairing) • Output: Inversion GFF file and auxiliary .txt files

AAA: Normal R3 F3 F3 R3

BAX: Inversion R3 F3 F3 R3

BBX: R3 F3 F3 R3 Inversion

ABX: Tandem Repeat/ F3 R3 R3 F3 double inversion

120 Matepair Descriptions R3 F3 Normal

• Mate-pairs are annotated with a three letter code

121 Inversion Pipeline

Use the Pairing Result from Mate pair library

R3 F3 Normal

F3 R3

R3 F3 Inversion

F3 R3

122 CopyCopy numbernumber variationvariation (CNV)(CNV) ToolTool

Compatible with only human data • Input: BAM file from secondary analysis, Mappability files • Output: GFF file, text based files with CNV calls

123 01 Agenda Item 02 Agenda Item 03 Agenda Item

BioScope™ Software Demo – SNP Finding

124 Dashboard

125 Advanced Settings: Find SNPs If needed, modify analysis settings

126 SNP Demo – start

127 Successfully Job Submission Make note (or take screenshot) of the log and output file locations

Log

Output

128 SNP Demo – check log file

129 SNP Demo – check results – gff3 file

130 Human Chr20 Demo data – IGV chr20:106420

131 01 Agenda Item 02 Agenda Item 03 Agenda Item

SNP Detection

132 SOLiD™ System Enables SNP Calls at Low Coverage

ILMN sequence reads AB SOLiD™ 3 System sequence reads Each color represents 1 base position. A single Each color represents 2 base positions and information on sequencing error can easily be miscalled a SNP each position is represented by 2 colors. A SNP typically and require higher coverage to discriminate. will result in 2 color changes therefore single measurement errors can easily be identified and excluded form analysis

C C G A T A T G A C T C A G C T C A G A C T

C C C G A T C A G A C T C C G A T A G A C T A

? ? ? SNP

Error? or SNP? Correction by 2-base encoding 2 adjacent color changes = SNP

*schematic not to scale for typically SNP detection coverage 133 Example diBayes SNP report

134 Post processing SNPs examples • Pvalue – filter out SNPs close to 1 • Color QV – is the mean color QV much lower than average ? • Low coverage – remove SNPs with very low coverage • Filter out SNPs with flags • Repeat the analysis with a more stringent setting

135 Postprocessing / filtering SNPs

Low coverage

Filter for low NovelQV

Flag not called as Het

P-value close to 1

136 Filtered SNP list – what next ?

137 SIFT – filter snp list for exonic snps

138 SIFT – predictions for snps /

139 SIFT results

140 Nat Genetics paper on second run !!!

141 The Importance of Accuracy

Accuracy will enable

 Detecting variants at low coverage

 Detecting low frequency

Bodmer & Bonilla, Nature Genetics 40, 695-701(2008) mutations among pooled samples or heterogeneous samples

 Accurately detecting rare mutations

 Reduce false positives and downstream validations TA Manolio et al., Nature 461, 747-753 (2009)

Rare variants can only be detected with a high accuracy system

142 BioScopeBioScope DiBayesDiBayes • Workflow • Input files

— Reference sequence — Mapping result or Pairing result or Both (Bam file) — Reads quality — Positionerror files (created by mapping pipeline) > Error rate at each position on read of two color code

• Output files

— workingDir/output/diBayes/ — SNP output files > report in GFFv3 format > Consensus_call.txt

143 output • outputs/ — diBayes > chr_1/ > exampleExperiment_Consensus_Basespace2.fasta > exampleExperiment_SNP.gff3 — pairing > F3R3Paired.bam > F3R3Paired.bam.bai > pairDistFreq/ > pairingStats.stats

> unmappedBamFile.bam — positionerrors > F3R3Paired_F3_positionErrors.txt > F3R3Paired_F3_positionErrors.txt — s_mappingF3 > mappingstats.txt > myRun_chr20_F3.csfasta.ma — s_mappingR3 > mappingstats.txt > myRun_chr20_R3.csfasta.ma

144 01 Agenda Item 02 Agenda Item 03 Agenda Item

SOLiD BioScope Software: WT July 2010

145 BioScope Applications

Secondary Primary Tertiary Mapping Visualization Image Analysis Analysis Analysis

Off-Instrument Compute Cluster

WT mapping BAM

Coverage WT WT Tools Counting

Fusion

146 Outline

• Introduction

— RNASeq (Whole Transcriptome Analysis) • BioScopeBioScope™™™™Software Version 1.2

— Single Read Whole Transcriptome Pipeline — Paired End Whole Transcriptome Pipeline • Splicing detection

— Detect exonexon junctions — Alternative splicing — Gene fusions • Gene expression

— Count expression of exons, transcripts, genes (count tags) — Count coverage across genome area (wig)

147 RNASeq: Whole Transcriptome Library Prep

TOTAL RNA Few important points: SMALL RNA DEPLETION (5S, 5.8S rRNA; tRNAs) −Libraries are strand-specific Poly(A) SELECTION −Standard protocol calls for RNase RiboMinus KIT (18S, 25S rRNA) III digestion… −Can sequence from P1 (“single

RNase III end”) or from P1 and P2 (“paired Fragmentation end”) Adapter −Barcode sequencing ordinary nnnnnn Ligation n nnnnn Reverse Transcription

Gel Purification

PCR, ~15 Cycles

P1 IA P2 Column Purification ~120 bp inserts

148 SOLiD™ 4.0 WT Pipeline Overall Flow

WT RNA Library (Ambion)

SOLiD™ 4.0 Instrument Primary analysis

csfasta/qual

Map Paired End WT Secondary Map Single Read WT analysis

Tertiary Count annotations Coverage Junction and Fusion Finder analysis

149 Bioscope™ 1.2 Software Single Read WT Pipeline

Splice Junction Extractor Mapping Stages

Genomic Junction Map Filter Map Map*

Merge Stage Legend Merge ma* * = required plugin

Tertiary Stages

Sam2Wig Counttag

150 Gene Expression

• Determine expression of exons with “counttags”. — Input: A gene annotation file (GTF) and the BAM alignment file — Output: Alignment counts and RPKM expression measurements for each exon • Calculate coverage profiles across the genome with “sam2wig”. — Input: BAM alignment file — Output: Genomic coverage profiles, one WIG file produced per chromosome strand • Both tools consider reads and transcript annotations in a strandspecific fashion

— Assumption is that the Whole Transcriptome Kit will be used. • Inference of whole transcript abundance is left to third party tools

— e.g., Cufflinks

151 Gene Expression

• A snapshot of coverage/expression profiles (from the WIG file):

152 Display WT data using Integrative Genomics Viewer (IGV)

• UHR gene region displayed with IGV for positions 3,530,193 to 3,548,355 of Human Chr-1. • Wig (x2) tracks : Top 2 tracks show the genomic coverage using the negative strand and positive strand generated by the Bam2Wig tool (Max: 100 coverage). • BAM track : Middle track shows the alignments from the BAM file. For display purposes reads are filtered with MAPQ threshold of 45 (a stringent filter) and bases with quality value 5 to 20 are shaded. • BED track : Fifth track shows the junctions detected by Junction Finder (BED file). In this case all junctions detected are "known" and so are shaded in green.

153 RNASeq Summary • RNA splicing and novel gene fusion detection enabled • Intersection of paired and single read methods eliminate false positives • User defined parameters allow tuning sensitivity versus specificity • RNASeq shows large dynamic range and high correlation with TaqMan • Standard output files: — Gtf annotation counts — Wig file (coverage) — Bed file (junctions) — Junctions tabseperated — Alternative Splicing — Fusions • New metrics reported: — Total/unique junction evidence counts — RPKM — JCV

154 Whole Transcriptome (RNASeq) hands on • Brief review of GUI • Brief review file structure • Command line — ini — Run bioscope with whole transcriptome plan file — Review the output

155 What’s new in Bioscope 1.2 wt pipeline?

156 Review BioScope GUI for Whole Transcriptome • Global — Chose library type — Working directory • Application — Input files, max hits, read length • Advance — Plugins — settings • Submit the job and monitoring • Command line follow up — qstat — looking at the file structure — ini files

157 Hands on • Find the working directory /data/results/users/user#/whole_transcriptome • Looks at the config/ for ini files • Run them step by step nohup bioscope.sh config/wt_map/example.ini & nohup bioscope.sh config/knownExon/example.ini & nohup bioscope.sh config/wig/example.ini &

158 Review the output • Take a look at the bam — samtools view .bam | more • Take a look at other output files

159 Data Analysis Support Resources

• User Documentation — SOLiD 3 Plus to SOLiD 4 — ICS/SETS User Guide — BioScope User Guide — Quick Reference Guide

• SOLiD 4 Data Analysis Webinars

160 Global Service and Support

More Than Just a Technology… A Dedicated Team To Enable Your Success

SOLID Technical Sales Consultants Helping you to… • Knowledge of system and components • Select the right Guidance to the right platforms and technology reagents

SOLID Field Application Scientists • Get started quickly • Application knowledge & experience  Experimental design & support • Continuous operation SOLID Bioinformatics Scientists • Computer science, IT, and • Design & run bioinformatics experiments  Data preparation & analysis

SOLID Field Service Engineers • Successful data analysis • Technical engineering system knowledge  Installation and maintenance • Interpret data sets for publications

161 THANK YOU

For Research Use Only. Not intended for any animal or human therapeutic or diagnostic use. © 2010 Life Technologies Corporation. All rights reserved. The trademarks mentioned herein are the property of Life Technologies 162Corporation or their respective owners. TaqMan is a registered trademark of Roche Molecular Systems, Inc. Extra slides

163 lll

Junction Confidence Value (JCV)

Equation 1. Junction Confidence Value ( JCV ) n = − ( ) JCV j ∑PQV i 10 log 10 EEM j y-x y-x i=1 JCV = Junction Confidence Value EEM = Error Expectation Metric Equation 2. Error expectation metric ( EEM ) PQV = Pairing Quality Value J = junction between exons x and y = RC x × RC y x,y EEM RC x = Read Coverage of exon x jx - y l l x y μ + × σ μ + × σ ℓℓℓx = Length of exon x 3 3 T T T T µT = Average insert size of library σσσT = Standard deviation of insert size

• Highly expressed genes (with random sequencing errors) and homologous exons are suspected to be a major contributor for false positive junctions. JCV assigns a low score for possible random junctions in consideration of exon coverage, exon length and pairing quality of junction evidence. 164 WT Annotation Aided Alignment Rescue • Why? — Reads that are not mapped, due to sequencing errors — Search space too large to allow many mismatches due to high false positive rate. • How? — Reads come in pairs that are close together — If one of the reads is mapped that we can search in the vicinity of that read — Small insert size for PE means in a reduced search space for rescue — We use annotation to search in the most likely places on the genome • When? — applied to read pairs that have at least one alignment — no pair of alignments occurring within a maximum expected range

> The expected range was set to 100,000 bases — If the anchor read overlaps a gene

165 AccuracyAccuracy

• Accuracy is affected by the mapping parameters used — Increasing number of mismatches allowed will increase number of reads that map and drive up the error rate

• The accuracy after applying 2Base Encoding (2BE) rules improves significantly over raw color accuracy

166 AccuracyAccuracy AssessmentAssessment –– WithWith aa KnownKnown GenomeGenome

50 base read mapped with up to 6 MM (DH10B results)

Total number of correct CS calls 0.14% 97.60% 2.17% Single mismatched calls 2.40% 0.09% Invalid Adjacent

Valid Adjacent

Accuracy = 99.91% Raw accuracy (before corrected by 2BE) 97.6%

167 AccuracyAccuracy

Effect of 2BE correction

Base accuracy by position in read

100 QV

10 10

raw 1 20 Percent Error Percent

0.1 corrected 30

0.01 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Base Position in Read

168 HowHow MuchMuch CoverageCoverage DoDo YouYou Need?Need? –– ForFor SNPSNP

• 2base encoding helps to reduce the coverage needed to detect SNP with high confidence • Heterozygous SNP will need higher coverage, compared to homozygous, to detect both alleles — If the coverage at a heterozygous position is less than 10X, the probability that one of the alleles will not be detected is 1% or more — If the sample preparation method is likely to introduce some bias in allele ratio, coverage should be increased

169 SNPSNP DetectionDetection atat DifferentDifferent CoveragesCoverages

Wheeler et al PLoS Computational Biol. 2008 Vol 452| 17 April 2008| doi:10.1038/nature06884

170 CoverageCoverage VarianceVariance inin DifferentDifferent GenomicGenomic RegionsRegions

• Ideally, the coverage would follow a binomial distribution

• Possible reasons for deviations frequency — Characteristics of the reference genome (complexity, coverage repeats, etc) — The samples being sequenced (structural variation) — Sample preparation and sequencing chemistry

• Matepair data (rather than fragment data) can largely, but not completely, overcome this problem

• Consistent contiguous regions of over/under coverage may represent copy number variation — Detection of SNPs or InDels in these regions should be treated with caution

171