Solid Bioinformatics Overview Yang Wang Staff Field Bioinformatics

Yang Wang Staff Field Bioinformatics Scientist Life Technologies

April 14, UMASS Part I SOLID System Overview • SOLiD workflow and system components • Library preparation • On-bead chemistry • Data analysis workflow

Part II Color Concept and Mapping • Color space and 2-base-encoding • Quality values and filtering • Mapping algorithm and considerations

Application Application Application specificspecific SOLiD Workflow Application sample specificspecific sample Data analysis preparationpreparation Data analysis

EmulsionEmulsion PCRPCR & & SequencingSequencing ImagingImaging & & slideslide chemistrychemistry basecallingbasecalling preparationpreparation

Monitor and keyboard to onboard XP box

Reagent handling Waste Containers

1Gbit Switch

Instrument Controller/XP Box

Head Node – Dell 2950

Compute Nodes – Dell 1950

MD1000 – secondary storage

Slide

Reagent Stage with Dual handling Flow Cell

Spot (1/4) Spot (1/1) Spot (1/8)

1 image per panel/ligation

Less real estate for multiple samples

3’-end modification

Beads attached to glass surface in a random array

• SOLiD workflow and system components • Library preparation • On-bead chemistry • Data analysis workflow

Fragment Library (Targeted Resequencing, ChIP-Seq, SmallRNA , Multiplexing)

P1 adapter DNA fragment P2 adapter

60-90 Bases

Mate-pair Library (Whole Genome Sequencing, Structural Variation)

P1 adapter F3 Tag Internal Adapter R3 Tag P2 adapter

Complex sample Fragment sample Ligate P1 and P2 Adaptors e.g. Genomic Randomly or Targeted DNA, TAG library, e.g. sonication, Concatenated mechanical, enzymatic PCR products digestion

P1 adapter DNA fragment P2 adapter

60-90 Bases

50 bp F3 Tag

P1 adapter DNA Fragment P2 adapter T

Colorspace read output in FASTA format

>1_88_1830_ F3 T2103112003130213233110321

Ligate P1 and Complex sample Fragment sample, Modified P2 Randomly or Targeted e.g. Genomic e.g. sonication, Adaptors DNA, TAG library, mechanical, enzymatic Concatenated digestion PCR products

P1 adapter DNA fragment Internal Adapter Barcode P2 adapter

60-90 Bases

F3 Tag BC Tag

P1 adapter DNA Fragment Internal Adapter Barcode 1 P2 adapter T G

P1 adapter T DNA Fragment Internal Adapter G Barcode 2 P2 adapter ……..

P1 adapter DNA Fragment Internal Adapter Barcode 16 P2 adapter T G >1_88_1830_ F3 >1_88_1830_ BC T2103112003130213233110321 G00032

Ligate Complex Randomly size select Internal sample Fragment (eg:1, 2, 3, 5, adapters sample 10 KB) IA

Circularize

Nicked Cleave IA IA +

Ligate P1 and P2

50 bp 50 bp F3 Tag R3 Tag

P1 adapter DNA Sample Internal Adapter DNA Sample P2 adapter T G

>1_88_1830_ F3 >1_88_1830_ R3 T2103112003130213233110321 G3211312320130023232012112

Library Template

Primers P1<

P1-coupled beads

Polymerase Enzyme

P1-coupled beads

1) Template Anneals to P1

2) Polymerase extends from P1

3) Complementary sequence is extended off bead surface

Beads with no product

Bead contains ~30K amplified products from original single strand molecule

R3 F3 5' 3'

3' 5' F3 R3

Note strand and orientation of the tags per Mate-Pair library construction - Both F3 and R3 are on the same strand - R3 is upstream of F3

• SOLiD workflow and system components • Library preparation • On-bead chemistry • Data analysis workflow

Collect color Identify bead Identify beads image color

Record colors for each bead over consecutive cycles

Color space Base space A C G G T C G T C G T G T G C G T

3’ 3’ Ligation site Fluorescent dye

T C n n n z z z

2nd base

1,024 Octamer Probes (4 5), 4 Dyes 4 dinucleotides, 256 probes per dye base st Each dinucleotide is encoded by a color 1 N= degenerate bases Z= Universal bases | | | | CY5 TXR CY3 FAM

Initialize

1µm 1µm bead bead 5’ 3’ P1 Adapter Template Sequence

universal seq primer ligase 3’ p5’ 3’ 5’ 5’ GG n n n z z z T A n n n z z z

3’ 5’ 3’ 5’ AT n n n z z z TC n n n z z z

universal seq primer p5’ T A 1µm 1µm bead bead 5’ 3’ P1 Adapter Template Sequence

universal seq primer ligase 3’ p5’ 3’ 5’ 5’ GG n n n z z z T A n n n z z z

3’ 5’ 3’ 5’ AT n n n z z z TC n n n z z z

ligase universal seq primer p5’ T A 1µm 1µm bead bead 5’ 3’ P1 Adapter Template Sequence

universal seq primer T A 1µm 1µm bead bead 5’ 3’ P1 Adapter 1,2 Template Sequence

universal seq primer p5’ T A 1µm 1µm bead bead 5’ 1,2 3’ P1 Adapter Template Sequence

ligase

3’ 5’ 5’ GG n n n z z z TA n n n z z z

3’ 5’ 3’ 5’ AT n n n z z z TC n n n z z z

universal seq primer ligase p5’ T A G G 1µm 1µm bead bead 5’ 3’ P1 Adapter 1,2 Template Sequence

universal seq primer T A G G 1µm 1µm bead bead 5’ 1,2 6,7 3’ P1 Adapter Template Sequence

universal seq primer p5’ T A G G 1µm 1µm bead bead 5’ 1,2 6,7 3’ P1 Adapter Template Sequence

universal seq primer T A G G G A T T C C 1µm1µm bead bead 5’ 1,2 6,7 11,12 19,20 24,25 3’ P1 Adapter Template Sequence

Reset And Primer annealling

1µm 1µm bead bead 5’ 3’ P1 Adapter Template Sequence

universal seq primer n-1 ligase 3’ p5’ 3’ 5’ 5’ GG n n n z z z T A n n n z z z

3’ 5’ 3’ 5’ AT n n n z z z TC n n n z z z ligase universal seq primer n-1 p5’ A T 1µm 1µm bead bead 5’ 3’ P1 Adapter Template Sequence

universal seq primer n-1 A T 1µm 1µm bead bead 5’ 0,1 3’ P1 Adapter Template Sequence

universal seq primer n-1 AT TT GA CG AG 1µm 1µm bead bead 5’ 0,1 5,6 10,11 15,16 19,20 23,24 3’ P1 Adapter

• SOLiD Workflow and Library Creation • On-bead chemistry • Data analysis workflow

Instrument Control Software ICS set up run details on XP box Job Manager Insert executes workflows workflow Query/Update Job Status JobManager Relational DB SOLiD DB stores SOLiD run workflow information Initiate Pipeline Execution

Send jobs Primary Analysis to cluster Secondary Analysis

PBS View real time results Resource Manager Queues, job submission SETS

Barcoding/Multiplexing Align to Focal Map

Bead Finding

Primary Analysis Outputs - Colorspace reads in FASTA format - Quality value scores - Panel and Bead Statistics

On-Instrument BioScope

GFF/SAM Fasta/QualFasta/Qual GFF/SAM Mapping Files FilesFiles Files

DiBayes (SNP)

Mapping/Pairing ApplicationApplication SpecificSpecific Files Structural Files Variation

Filtering Pipeline Whole (optional) Transcriptome

43 © 2009 Applied Biosystems Auto -export and Offline Analysis • Job Manager and BioScope mediated • Configured from on-instrument software package, SETS • BioScope will launch the remote analysis job when run completes New features

On-Instrument

Image Primary Auto-Export BioScope Acquisition Analysis

Offline Analysis

Available tools include (and others): • Small RNA Pipeline • SOLiD GFF Conversion Tool • SOLiD Base QV Tool

• Color space and 2-base-encoding • Quality values and filtering • Mapping algorithm and considerations

Collect color Identify peak Convert to Identify peaks image colors base calls

Base space

Color space

Collect color Identify bead Identify beads image color

Record colors for each bead over consecutive cycles

Color space Base space A C G G T C G T C G T G T G C G T

1 3 1 3 1 3 2 3 FirstBase

3’ 5’ • Two dibases that agree in just one base have different colors • color(AC) ≠ color(AG) ≠ color(AT) ≠ color(AA) • Two dibases that do not agree in either base have same color • color(AC) = color(GT) and color(CG) = color(AT) • A dibase and its reverse have the same color • color(AC) = color(CA), color(GT) = color(TG) • Repeated-base dibases have the same color

49 • color(AA) = color(CC)= color(GG)= color(TT) © 2009 Applied Biosystems “Valid ” and “Invalid ” Adjacent Color Substitutions • “Invalid” changes are inconsistent with SNP and likely sequencing errors

• Color space and 2-base-encoding • Quality values and filtering • Mapping algorithm and considerations

• A score calculated based on the probability of an error call at that base • Similar to those generated by phred and the KB Basecaller for capillary electrophoresis sequencing = − q 10 log 10 p p = probability of color call error

• A QV score of 10 represent 10% error rate, whereas a QV score of 20 represents a 1% error rate

• Use angle of vector / intensity of color and a lookup table (pre-computed from training data sets) to predict QV

• Each color call and QV is computed independently

• Filter • Filter out reads based on QV values and patterns • Reduces total amount of reads and raw error rate • Increases % of reads mapped • Lose good alignments

• Don’t filter • Filter the data by mapping • Low quality reads won’t map

• Color space and 2-base-encoding • Quality values and filtering • Mapping algorithm and considerations

• Challenge: • A small word size is needed for continuous word searches in short reads. This is computationally and time intensive.

• Our Approach: • Use discontinuous word patterns • Allows faster searching and guaranteed to find all hits up to a certain number of mismatches

• Continuous words: searching for a perfect alignment, 8/8 bases (word size 8, e.g. used by BLAST) ATTTTTT GGGTAGCC CCTTGGATGAGT |||||||| AG GGGTAGCC TGATGATGGT • Discontinuous words: searching 8/18 matches (effective word size is also 8) ATTTT TT GGGTA GC CCCTT GGAT GAGT || || |||| TT GACCG GC ATGGG GGAT 110000011000001111

• For a read length of 15, we can find all alignments with 1 mismatch (15_1) using discontinuous words in the 3 schemas of word size 10

Schema_15_1 Ref 002321031332122013220 1111111111 Read 32103133 3122013 111111111100000 ? Ref 002321031332122013220 000001111111111 ? 1111111111 Read 32103133 3122013

111110000011111 ? Ref 002321031332122013220 Ref 002321031332122013220 11111 11111 11111111 111111 Read 32103133 3122013 Read 32103133 3122013 extend • Using continuous words , word size must be at most 7 to find all alignments with 1 mismatch • 40 times slower than the three schemas above

# 14 base index on 25, 0 mismatches 000000000011111111111111

# 14 base index on 25, 1 mismatches 111111111111110000000000 111110000000000111111111 000000000011111111111111 More mismatches

# 14 base index on 25, 2 mismatches longer run time 000001101111001110110111 110110001000101010111110 111101100111100100001001 100011110011110011011010 011110010100010111010011 101000011100111101101101 010101111011011001100100

• A SNP will generate two color mismatches • Consider the SNP frequency in the genome when setting up number of mismatches allowed in a read

• Recommended mismatch levels • 50 base-pair read – 6 mismatches • 35 base-pair read – 3 mismatches • 25 base-pair read – 2 mismatches

60 © 2009 Applied Biosystems Mapping Tool - mapreads • General features of mapping tool • Aligns in color space • Translates reference sequence to color space • Allows mismatches (no indels), valid adjacent mismatches can be counted as one • Allows masking of certain positions (bad calls) • For fixed reference sequence, running time is linear with number of reads

• New with SOLiD 3+ • Seed and extend mapping approach • Multi-threaded

• Map the first 25 colors of the read to allowing 2 mismatches (MM). • For every hit found (up to the Z-limit), do a local extension • Accumulate alignment score (Match = 1, MM = -2 [user defined] ) • Report the best partial alignment (anchored local) based on score • Discard if score does not meet minimum cutoff

Read: 0122130123012303201203021 123012310231203120103120 ||||||| ||||||||||||||| | ||||| |||||||||| ||| Ref: 0122130 0230123032012030 1112301 13102312031 0010 1203

• For reads not mapped, shift anchor location and attempt additional mapping

start end

reference

read (offset)

• start and end mark the start and end of the alignment in the reference. • The alignment may not encompass the entire read. • The start of the alignment in the read is called the offset

• Mapping quality is an estimate how likely an alignment is correct • First, calculate the posterior probability L−t −  1  P(r | Alignment ) = 1( − e)t m e m    4 

• If an alignment has a probability of ,P(r), it’s mapping QV is defined as

ΣP(r) • -10*log 10 (1-P(r)/P), where P = for all reads

65 © 2009 Applied Biosystems Mapping QV: Working Examples • Read maps two places, both with one mismatch • Each has equal chance of being correct Read: 01221301230123032012030 21 Read: 01221301230 12303201203021 mqv=0 ||||||||||||||||||||||| | mqv=0 ||||||||||| ||||||||||||| Ref: 01221301230123032012030 11 Ref: 01221301230 02303201203021 • Read maps two places, one with zero mismatches, and one with two mismatches • Higher likelihood the true alignment is the perfect alignment 1_2434280.2:(47.3.0):q22 1_24854171.2:(47.5.0):q0 Read: T…0120123012303201203011 Read: T…01201230 123032012 03021 mqv=22|||||||||||||||||||||| mqv=0 |||||||| |||||||| |||| Ref: …0120123012303201203011 Ref: …01201230 023032012 23021 • Mapping QV largely depends on the difference of the hits

66 © 2009 Applied Biosystems Traditional Mapreads • Split up the mapping jobs • All reads mapped against part of reference ( IO intensive, processing the same read multiple times ) • Limit read mapping to each reference entry (-z) • Merge results ( IO intensive ) All reads (.csfasta) Mapped CPU 1 Reference Results 1 Entry 1

• Single mapping job • Fraction of reads (1/n) are mapped against the whole reference • ~20GB of RAM for the human genome • Limit read mapping to whole genome (-z) • Combine results (simple merge)

1/n reads Mapped CPU 1 (.csfasta) Results 1

1/n reads Full Mapped Combined CPU 2 (.csfasta) Reference Results 2 Results . .

• Increased throughput • Some data sets have observed 2-fold increase in mapping using local mapping vs. classical mapping • Increased speed • Up to 15X Faster than iterative mapping with trimming • As read length increases, only a small set of schemas is needed to be optimized

• Given a read with QVs, one can estimate the expected number of errors in the read

• If the QV of the i-th call is qi, then the expected number of errors in the read is n − = = ⋅ qi 10/ m ∑ pi pi .1 0022 10 i=1

Accounts for q values being rounded to integer

• Accuracy is affected by the mapping parameters used • Increasing number of mismatches allowed will increase number of reads that map and drive up the error rate

• The accuracy after applying 2-Base Encoding (2BE) rules improves significantly over raw color accuracy

50 base read mapped with up to 6 MM (DH10B results)

Total number of correct CS calls 0.14% 97.60% 2.17% Single mismatched calls 2.40% 0.09% Invalid Adjacent

Valid Adjacent

Accuracy = 99.91% Raw accuracy (before corrected by 2BE) 97.6%

Base accuracy by position in read

100 QV

10 10

raw 1 20 Percent Error Percent

0.1 corrected 30

0.01 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Base Position in Read

73 © 2009 Applied Biosystems 1000 Genome Project Data Accuracy Comparison • Using Broad Institute IGV (Integrated Genome Viewer) • NA19240 SOLID YRI Daughter • NA NA19240 SLX(Illumina) YRI Daughter • Install IGV and See the SOLID and SLX (Illumina) data • Download and install http://www.broadinstitute.org/igv/ (need register) • Open IGV and Select Human hg18 • Go to File --> Load from server ( be connected to internet) • Under Available Datasets pop up window, Expand 1000 Genomes • Expand YRI Trio • Select NA19240 SOLID YRI Daughter and NA NA19240 SLX(Illumina) YRI Daughter • Pick any chr, move the blue zoom bar (minus plus sign on top right) to the most right (closest to + ) • Randomly move the solid-red to any region on the chr • Count SNP/Error calls by counting the colored base

In chr2:124,504,606 -124,504,653, a 47 bp region, Illumina has 15 SNP/Error calls,

Wheeler et al PLoS Computational Biol. 2008 Vol 452| 17 April 2008| doi:10.1038/nature06884

• 2-base encoding helps to reduce the coverage needed to detect SNP with high confidence • Heterozygous SNP will need higher coverage, compared to homozygous, to detect both alleles • If the coverage at a heterozygous position is less than 10X, the probability that one of the alleles will not be detected is 1% or more • If the sample preparation method is likely to introduce some bias in allele ratio, coverage should be increased

• Ideally, the coverage would follow a binomial distribution

• Possible reasons for deviations • Characteristics of the reference genome (complexity, frequency repeats, etc) coverage • The samples being sequenced (structural variation) • Sample preparation and sequencing chemistry

• Mate-pair data (rather than fragment data) can largely, but not completely, overcome this problem

• Consistent contiguous regions of over/under-coverage may represent copy number variation • Detection of SNPs or InDels in these regions should be treated with caution

ref A G G C A C C 2 0 3 1 1 0 Second Color 2 0 0 2 1 0 Sample A G G G A C C ⊕ 0 1 2 3 31 = 2 02 = 2 0 0 1 2 3 First ColorFirst 1 1 0 3 2 ref A G G C A C C 2 0 3 1 1 0 2 0 0 3 0 0 2 2 3 0 1 Sample A G G G C C C 3 3 2 1 0 311 = 21=3 030 = 30=3

81 © 2009 Applied Biosystems More on Color Consistency • Isolated color changes do not always correspond to measurement errors • e.g. the following reference/read combination results in two single-color mismatches A G G C A C C reference AGG CA CC 2 0 3 1 1 0 read AGG GT CC 2 0 0 1 2 0 A G G G T C C • Permitted by addition table because 311 = 012 = 3

• “Invalid” two position changes do not always correspond to measurement errors • e.g. the combination below results in one “forbidden” 2-color change and one isolated single-color change A A C T T A A reference AA CTT AA 0 1 2 0 3 0 read AA TGG AA 0 3 1 0 2 0 A A T G G A A

• Permitted by addition table because 1203 = 3102 = 0