Sequence Alignment and Statistical Analysis (Barcas) tool

2016.10.05 Mun, Jihyeob and Kim, Seon-Young Korea Research Institute of Bioscience and Biotechnology Barcode-Sequencing

Ø Genome-wide screening method based on sequencing the counts of tens of thousands of individual tags () for each gene for a given condition

Ø Originally developed as yeast deletion libraries such as Saccharomyces cerevisiae and Schizosaccharomyces pombe

Ø Now applied for genome-wide siRNA or shRNA screening to measure the effects of knock-down of genes

Ø Or, using CRISPR-Cas9, applied for genome-wide sgRNA screening for the effects of gene knock-out

2 Examples of genome-wide barcode-sequencing libraries

Contents Organism # of # of References genes barcodes Yeast deletion consortium S. cerevisiae 6,343 2 (UP and DN) www-sequence.Stanford.edu/group/

Bioneer pombe collection S. pombe 4,836 2 (UP and DN) http://us.bioneer.com/ MISSION shRNA (human) H. sapiens 20,018 129,696 shRNA http://sigmaaldrich.com MISSION shRNA (human) M. musculus 21,171 118,072 shRNA http://sigmaaldrich.com TRC1 (human) shRNA H. sapiens 16,019 80,717 shRNA https://portals.broadinstitute.org/gpp/trc1/ TRC1 (mouse) shRNA M. musculus 15,960 77,819 shRNA https://portals.broadinstitute.org/gpp/trc1/ Human DECIPHER (shRNA) H. sapiens 15,377 5+ shRNAs https://www.cellecta.com Mouse DECIPHER (shRNA) M. musculus 9,145 5+ shRNAs https://www.cellecta.com Cellecta Genome-wide shRNA H. sapiens 19,276 8 shRNAs https://www.cellecta.com Cellecta Genome-wide CRISPR H. sapiens 19,001 8 sgRNAs https://www.cellecta.com Human GeCKO v2 H. sapiens 19,050 123,411 sgRNA https://www.addgene.org/ Mouse GeCKO v2 M. musculus 20,611 130,209 sgRNA https://www.addgene.org/ Mouse genome-wide v1 (yusa) M. musculus 19,150 87,897 sgRNA https://www.addgene.org/ Oxford fly Drosophila 13,501 40,279 sgRNA https://www.addgene.org/ CRISPRa H. sapiens 15,977 198,810 sgRNA https://www.addgene.org/ CRISPRi H. sapiens 11,219 206,421 sgRNA https://www.addgene.org/

3 Workflow : barcoded yeast deletion strains

4 Workflow : genome-wide shRNA screening

5 Basic format of barcode-seq data

MID (Multiplexing Universal Barcode Index, 4-6 bp) Primer (20-25 bp) (20-30 bp)

6 Steps of barcode-seq data analysis

Pre-processing and QC

Multiplex Universal Trim index Barcode Trim primer Index Primer (20-30 bp) (4-6 bp) (20-bp)

Map and count each TAG

sample1 Sample2 sample3 Statistical tag1 3400 2500 2983 Analyses Normalization tag2 120 199 739 tag3 29920 3544 2232 tag4 4300 3433 3344 ...... Visualization Current tools and methods for barcode-seq data analysis

Tool (or Pre- QC Normal Statistical Visuali Software Ref. method) processing ization Analysis zation format Barcas O O O O O GUI Mun 2016 BMC Bioinfo

Barcode O X X X X Windows or www.decipherproject.net/ Deconvoluter Mac GUI software BiNGS!LS- O O O O X R package Kim 2012 Method Mol seq & edgeR Biol edgeR O X O O X R package Dai 2014 F1000 Res HiTSelect X X X Multi-objective O Matlab Diaz 2015 Nuc Acids Res ranking runtime MAGeCK O O O O X Python, Li 2014 Genome Bio source code MAGeCK- O O O Robust rank O Python script Li 2015 Genome Bio VISPR aggregation

RIGER X X X RNAi Gene O GENE-E (=> Luo 2008 PNAS Enrichment Morpheus) Ranking Java GUI RSA X X X Iterative X Windows Konig 2007 Nat Methods hypergeometric P- GUI (C#), R, value ScreenBEAM X X X Pooled scoring X R package Yu 2015 Bioinformatics shALIGN & O O O O X Perl and R Sims 2011 Genome Bio shRNAseq script 8 Barcas (Barcode sequence Alignment and Statistical Analysis)

- Barcas is an all-in-one program for the analysis of multiplexed barcode sequencing (barcode-seq) data - Available at http://medical-genome.kribb.re.kr/barseq/ Input: Barcode-seq data • Genome-wide shRNAs (Cellecta, TRC, Sigmaaldrich, etc) • Genome-wide sgRNAs (Addgene, Cellecta, etc) • barcoded yeast deletion strains: S. cerevisae or S. pombe

Ø Preprocessing & Mapping • Filtering, trimming, and mapping with mismatches and indels

Ø Quality Control (of barcodes and samples) Ø Normalization Ø Statistical Analysis • Two-condition comparison, multiple time points.

Ø Visualization

• Various graphs and heatmap 9 All in one package with user-friendly GUI

Step 1: Pre-processing & Mapping Step 2: QC of data quality

Step 3: Design experiment Step 4: Statistical analysis

10 Step 1: Data preprocessing and mapping

Ø De-multiplexing and trimming (universal primers) Ø Mapping with imperfect matches (mismatches and indels) Ø Searching for individual tag sequences

11 Step 2: Data quality evaluation

Ø Sequence level: overall sequence quality Ø Sample level: mapping counts and percentage, etc Ø Barcode (or tag) level: mapping counts and percentage, etc

12 Step 3: Experimental design

Ø Comparison of two conditions Ø Across several different time points

13 Step 4: Statistical analysis and Visualization

Ø Calculates z-score and p-value for each barcode Ø Ranks each barcode by z-score Ø Plots z-score graph Ø Plots time dependent intensity heat-map Ø Allows searching for individual target gene

14 Novel functions of Barcas for data pre-processing and QC

Ø Flexible mapping with support for both substitution s and indels

Ø Detection of erroneous barcodes in the

Ø Checking similarity among barcodes in the library collection

15 Existing tools for data preprocessing

Name Mismatches Shifts of the Indel Backend Ref. position tool BiNGS!LS- Kim (2012) O X X bowtie seq Methods Mol Bio shALIGN Perl script Sims (2011) O X X (or bowtie) Genome Bio Dai (2014) edgeR O O X edgeR F1000Res Trie data Mun (2016) BMC Barcas O O O structure Bioinfo

MID Universal Primer Barcode (shRNA)

Original barcode TCAAAGATAGTCACGCGACCTCATCGACGAGCTACC Perfect match TCAAAGATAGTCACGCGACCTCATCGACGAGCTACC Mismatches TCAAAGATAGTCACGCGACCTCATCGACGAGCTACC Position shift TCAAAGATAGTCACGCGACC-ATCGACGAGCTACC Indel TCAAAGATAGTCACGCGACCTCATCGA--AGCTACC 16 Trie data structure Ø Data structure based on prefix tree Ø Useful data structure to store a dynamic set or associate array in which the keys are usually strings Ø More efficient than hash table (or dictionary) or lists in terms of look-up speed an d memory 1:M sequence matching processing 1:1 sequence matching processing Algorithm : Tree based Algorithm : List based Maximum time : N Maximum time : N * M (N: read count) (N: read count, M: reference count) read Library reference read Library reference root CGCT AGCT GCCAA A T G C AGCT TTAG G T C C G TCAGT GCAG C A A A C C TTAT AGCT T T G G G A T

T A 17 1. Data structure of Barcas for mapping

- Based on trie data structure, Barcas supports imperfect matching allowing mismatches, base shifting and indels - Dynamic sequence lengths - Dynamic start positions

18 Comparison of speed and mapping rate of barcas with bowtie and edgeR package of R

• 215 million reads were mapped to 4,832 heterozygous diploid deletion strains in S. pombe. Data • 45-bp sequences were used as barcode library.

Option

Result

Barcas was 1.7 times faster than bowtie and 13 times faster than edgeR. Owing to indel mapping, Barcas mapped at least 8-12% more than the other two programs. 2. Detection of erroneous barcodes from the genome-wide barcode library

Ø We are likely to assume that barcode sequences in the li brary are perfectly error-free from the original design

Ø However, errors can creep in the barcodes during many steps including • barcode synthesis, • random mutations during library maintenance, • erroneous incorporation of barcodes into the genome in case of yeast strains.

20 Erroneous barcodes in the yeast library

Eason et al (2004) Characterization of Smith et al (2009) Quantitative synthetic DNA bar codes in phenotyping via deep barcode sequencing Saccharomyces cerevisiae gene-deletion Genome Res 19:1836-42 strains PNAS 101(30):11046-51

U1 UpTag U2 D2 DnTag D1 # correct 4,242 4,369 4,045 4,207 4,320 3,867 by Smith % correct 80.1% 82.5% 82.9% 80.9% 83.1% 83.7% by Smith # correct 4185 3,764 4,057 4,343 3,807 4,095 by Easton % correct 79.1% 71.1% 83.2% 83.5% 73.2% 88.7% by Easton % Agreed 86% 84.4% 89.2% 92.6% 85.1% 92% 21 A simple method to detect erroneous barcodes

Measure the amount of gains in count between perfect match only and (PM + MM)

Dominant Perfect Match with minor One dominant Mismatch with minor Mismatches Perfect Match and other Mismatches Original ACTGACTGACTGACTGACTG Counts Original ACTGACTGACTGACTGACTG Counts design design Perfect ACTGACTGACTGACTGACTG 50,000 Perfect ACTGACTGACTGACTGACTG 200

Mismatch 1 ACTGACTGACTGACTGCCTG 10 Mismatch 1 ACTGACTGACTGACTGCCTG 40,000 Mismatch 2 ACTCACTGACTGACTGACTG 9 Mismatch 2 ACTCACTGACTGACTGACTG 11 Mismatch 3 ACTGACAGACTGACTGACTG 20 Mismatch 3 ACTGACAGACTGACTGACTG 12

Mismatch 4 ACTGACTGACTTACTGACTG 3 Mismatch 4 ACTGACTGACTTACTGACTG 3 Mismatch 5 AGTGACTGACTGACTGACTG 7 Mismatch 5 AGTGACTGACTGACTGACTG 12

Mismatch 6 ACTGACTGACTGACTGTCTG 12 Mismatch 6 ACTGACTGACTGACTGTCTG 9 Mismatch 7 ACTGACTGACTAACTGACTG 5 Mismatch 7 ACTGACTGACTAACTGACTG 5 PM only 50,000 PM only 20 PM + MM 50,065 PM + MM 40,071 Gain 50,565/50,000 = 1.013% 0.13% gain Gain 40,071/200 = 200.35% 200% gain Detection of erroneous barcodes

Ø Library : 1,230 shRNA sequences of TRC library. Ø Data : Control samples in neuroepithelial (NE), early radial glial (ERG) and mid radial glial (MRG) Ø We found 25 erroneous barcodes (2.03%).

Ziller,MJ. et al., Nature 2015, 518, 355-9. 23 Detection of erroneous barcodes (TRC)

Major mapped PM MM Gene ID Original sequence (Two mismatch/indels) count count

PBX2 TRCN0000285144 ATACTCCCACTTGCAACTATT ATACTCCCACTTGTAACTATT 10,785 34,084

SKI TRCN0000010439 GAATCTGCCACTCTCAGAATA -AATCTGCCACTCTCAGAATA 14 5,935

TERF2IP TRCN0000010356 GAGAGTTCTTGCATTGGAACT -AGAGTTCTTGCATTGGAACT 4 1,244

SKI TRCN0000010437 GATCGAAGACCTGCAGGTGAA -ATCGAAGACCTGCAGGTGAA 5 625

MYC TRCN0000010390 GAATGTCAAGAGGCGAACACA -AATGTCAAGAGGCGAACACA 3 393

JDP2 TRCN0000019000 CGGGAGAAGAACAAAGTCGCA CGGGAGAAGAACAAAAACGCA 46 508

TFAP2B TRCN0000019659 CGGTTCTTTCGAGTTTAGTAA CGGTTCTTTTGAGTTTTGTAA 87 522

NFFKB TRCN0000014868 CAGGGAGGTTGCATCATTGTT CAGGGAGGGTGCATCATTGTT 98 571

KLF13 TRCN0000016925 CGGGCGAGAAGAAGTTCAGCT CGGGCGAGAAGAAGTTCATGGT 0 124

24 3. Check for sequence similarity among barcodes in a reference

Ø Erroneous barcodes can potentially be generated during the production of many barcodes.

Ø If two barcodes were designed similarly (i.e only 1 bp difference) and mutations or sequencing errors occur, then it will be hard to distinguish errors from true differences.

Ø Thus, barcodes originally designed to be similar should be identified (and flagged) in advance.

Ø For this purpose, Barcas allows checking of sequence similarity among barcode sequences.

25 Library reference QC

Tested public library sets (11) Barcode Barcode Gene Screen Library Date Species Module length count count TRC 05/Apr/11 21-bp 61,621 15,435 Module1 18-bp 27,500 5,046 shRNA Human Cellecta 15/Feb/12 Module2 18-bp 27,500 5,421 Module3 18-bp 27,500 4,923 yusa Mouse 19-bp 87,437 19,149 Library A 20-bp 63,950 21,669 Human sgRNA Library B 20-bp 56,869 19,834 CeCKOv2 09/Mar/15 Library A 20-bp 65,959 22,486 Mouse Library B 20-bp 61,139 21,263

Saccharomyces 6,318/UP Deletion 20-bp 6,131 Heterozyg cerevisiae 6,126/DN mutant ous diploid Schizosaccharomyces 4,832/UP strains 20-bp 4,832 pombe 4,832/DN 26 Library reference QC

Barcode counts having similar pairs within one base

Static sequence length Dynamic sequence length Library comparison Comparison (indels) GeCKOv2.Human.A 517 (0.81%) 538 (0.84%)

GeCKOv2.Human.B 437 (0.77%) 441 (0.78%)

GeCKOv2.Mouse.A 736 (1.12%) 755 (1.14%)

GeCKOv2.Mouse.B 850 (1.39%) 860 (1.41%)

yusa 517 (0.59%) 3,944 (4.51%)

Cellecta.Human.M1 0 (0 %) 412 (1.5%)

Cellecta.Human.M2 0 (0 %) 398 (1.45%)

Cellecta.Human.M3 0 (0 %) 410 (1.49%)

TRC 790 (1.28%) 1,909 (3.10%)

S. cerevisiae 0 (0 %) 0 (0 %)

S. pombe 0 (0 %) 0 (0 %) 27 Conclusions

Ø Barcas is an all-in-one software for barcode-seq data analysis with user-friendly interface and a few new useful functions for data pre-processing and quality control of barcode library Ø Future improvements Supports for diverse statistical analyses • Sophisticated gene-level summary statistics for shRNA and sgRNA • RSA, RIGER, MAGeCK, HiTSelect, ScreenBEAM, etc • Multiple-condition comparison (MAGeCK-VISPR) • Utilization of metadata and gene-set level analysis (HiTSelect) Ø We hope Barcas will be useful for many researchers with minimal bioinformatics skills for barcode-seq data analysis

28 Thank you for your attention

29 Limits of the mapping of edgeR package

1. Indels in the barcode reads are not supported 2. Only shifts of the barcode positions allowed 3. Mismatches in the MID, universal primers not allowed 4. Indels in the MID and universal primers not allowed

Loss of sequences with indels in any of the MID, primers and barcodes Loss of sequences with mismatches in the MID and primers

Read format MID Universal Primer Barcode (shRNA)

Example 1: TRC Library Example 2: Cellecta library Different primer lengths of universal primers: Different MID lengths: Forward: 37 bp, reverse 42 bp From 9 to 17 bp

Universal Primer (sense) Barcode (shRNA) MID Universal Primer Barcode (shRNA)

Universal Primer (anti-sense) Barcode (shRNA) MID Universal Primer Barcode (shRNA)

MID Universal Primer Barcode (shRNA)

30