Optimal Grna Design of Different Crispr-Cas Systems for Dna and Rna Editing

MIAMI UNIVERSITY The Graduate School

Certificate for Approving the Dissertation

We hereby approve the Dissertation

Houxiang Zhu

Candidate for the Degree

Doctor of Philosophy

______Chun Liang, Director

______Yoshinori Tomoyasu, Reader

______Haifei Shi, Reader

______Meixia Zhao, Reader

______Dhananjai Rao, Graduate School Representative

ABSTRACT

OPTIMAL GRNA DESIGN OF DIFFERENT CRISPR-CAS SYSTEMS FOR DNA AND RNA EDITING

Houxiang Zhu

CRISPR-Cas systems have been successfully applied in DNA and RNA editing, but research, therapeutic, and industrial applications will place high demand on target specificity and efficiency. In this dissertation, we have developed three web services, CT-Finder, CRISPR-DT, and CRISPR-RT, to help scientists design optimal guide RNAs (gRNAs) for different CRISPR- Cas systems with improved target specificity and efficiency. The dissertation is divided into five chapters. Chapter 1 is the introduction section, which mainly introduces CRISPR-Cas systems. In Chapter 2, we describe CT-Finder, a web service to help users design gRNAs for the wild-type Cas9, Cas9 D10A nickases (Cas9n), and RNA-guided FokI nucleases (RFNs) systems with improved target specificity. CT-Finder supports multiple parameter settings, such as the PAM (protospacer adjacent motif) sequence. Optimal target candidates can be chosen based on the off-target effects. On-target and off-target sites can be visualized in the genome and transcript background. CT-Finder covers major model organisms, and can be easily extended to cover other species. Recently, the CRISPR-Cpf1 system has been successfully applied in genome editing with high target specificity. However, target efficiency varies among different gRNA sequences. In Chapter 3, we reanalyzed the published gRNA activity data of the CRISPR-Cpf1 system and found many gRNA sequence and structural features associated with target efficiency. With the aid of Random Forest in feature selection, a support vector machine (SVM) model was built to predict CRISPR-Cpf1 target efficiency for any given gRNAs. In addition, we have developed CRISPR-DT, the first web service to help scientists design optimal gRNAs for the CRISPR-Cpf1 system by considering both target efficiency and specificity. More recently, the CRISPR-C2c2 system has been demonstrated as a powerful tool for RNA editing. In Chapter 4, we describe CRISPR-RT, the first web service to help biologists design C2c2 gRNAs with improved target specificity. CRISPR-RT allows users to set up multiple parameters, which makes it highly flexible for current and future research in RNA editing. Optimal gRNAs can be designed based on the target candidate with the least off- target effects. CRISPR-RT will empower researchers in CRISPR-based RNA editing. Chapter 5 is the conclusions and future directions section.

OPTIMAL GRNA DESIGN OF DIFFERENT CRISPR-CAS SYSTEMS FOR DNA AND RNA EDITING

A DISSERTATION

Presented to the Faculty of

Miami University in partial

fulfillment of the requirements

for the degree of

Doctor of Philosophy

Department of Biology

Houxiang Zhu

The Graduate School Miami University Oxford, Ohio

2019

Dissertation Director: Chun Liang

Houxiang Zhu

2019

TABLE OF CONTENTS CHAPTER 1. INTRODUCTION ...... 1 REFERENCES ...... 5 CHAPTER 2. CT-FINDER: A WEB SERVICE FOR CRISPR OPTIMAL TARGET PREDICTION AND VISUALIZATION* ...... 9 ABSTRACT ...... 9 INTRODUCTION ...... 10 DESIGN RATIONALE ...... 12 GRAPHIC USER INTERFACE ...... 14 DISCUSSION ...... 15 IMPLEMENTATION ...... 17 CONCLUSIONS ...... 18 REFERENCES ...... 18 LIST OF FIGURES ...... 21 CHAPTER 3. CRISPR-DT: DESIGNING GRNAS FOR THE CRISPR-CPF1 SYSTEM WITH IMPROVED TARGET EFFICIENCY AND SPECIFICITY* ...... 26 ABSTRACT ...... 26 INTRODUCTION ...... 27 MATERIALS AND METHODS ...... 29 Data retrieval and usage ...... 29 Computational tools and statistical analysis ...... 29 Target efficiency predictive model ...... 30 Bioinformatics pipeline for improving target specificity ...... 30 RESULTS ...... 31 Sequence features of efficient and inefficient gRNAs ...... 31 Position-specific nucleotide composition ...... 31 Position-nonspecific nucleotide composition ...... 31 GC content ...... 32 Structural features of efficient and inefficient gRNAs ...... 33 Minimum free energy ...... 33 Melting temperature ...... 33 A SVM model to predict target efficiency ...... 33 A web service application for improving target efficiency and specificity ...... 34 DISCUSSION ...... 34 REFERENCES ...... 35 LIST OF TABLES ...... 41 LIST OF FIGURES ...... 57 CHAPTER 4. CRISPR-RT: A WEB APPLICATION FOR DESIGNING CRISPR-C2C2 CRRNA WITH IMPROVED TARGET SPECIFICITY* ...... 67 ABSTRACT ...... 67 INTRODUCTION ...... 68 IMPLEMENTATION ...... 69 Graphic Input Interface ...... 69 Graphic Output Interface ...... 70 METHODS ...... 71 REFERENCES ...... 72 LIST OF FIGURES ...... 74 CHAPTER 5. CONCLUSIONS AND FUTURE DIRECTIONS ...... 79 REFERENCES ...... 81

iii ACKNOWLEDGMENTS

First, I would like to thank my advisor Dr. Chun Liang. Thanks for giving me such a great opportunity to work in the Bioinformatics Lab of Miami University. I feel very lucky and blessed to be a lab member. During the past five years, I have learnt a lot from Dr. Liang, not only on research but also on life. I really appreciate your help, support, and love. I will always remember the great experience in your lab. I also would like to thank my current and previous committee members: Dr. Yoshinori Tomoyasu, Dr. Haifei Shi, Dr. Meixia Zhao, Dr. Dhananjai Rao, Dr. Daniel Gladish, and Dr. Jack Vaughn. Thanks for your strong support and giving me so many valuable comments during my PhD research. I would like to thank my current and previous lab mates: Lin Liu, Sutharzan Sreeskandarajan, Chen Wang, Minghua Li, Cheng Guo, Jieming Shi, Kai Wang, and Abraham Moller. Thanks for your help and support. It has been great working with you in Dr. Liang’s lab. Also, I would like to thank all the friends I met in the USA. Last but most importantly, I would like to thank my parents. Thanks for raising me up and teaching me how to be a better man. No matter what happens, you always help, support, and love me. It is my honor to be your son. I love you forever!

iv CHAPTER 1. Introduction

Researchers have been pursuing an effective method to edit DNA and RNA for a few decades. In the 1970s, when restriction enzymes were discovered (Smith and Wilcox, 1970; Kelly and Smith, 1970; Danna and Nathans, 1971), scientists began to manipulate DNA in vitro. In the 1980s, studies demonstrated that an exogenous DNA can be introduced into the mammalian cell genome by homologous recombination (Smithies et al., 1985; Thomas et al., 1986; Capecchi, 1989). Since the spontaneous integration of this approach was extremely low (Capecchi, 1989) and random integration was very likely to occur at an undesired genomic locus (Lin et al., 1985), in the 1990s biologists found that introducing a double-strand break (DSB) at a desired location can greatly increase the frequency of homologous recombination (Rudin et al., 1989; Rouet et al., 1994). From then on, researchers aimed to find a reliable method to generate DSBs at desired genomic loci. The first discovered method was to use the zinc finger nucleases (ZFNs), which consists of zinc finger proteins and the DNA cleavage domain of the FokI endonuclease (Kim et al., 1996). Each zinc finger protein can recognize a 3-bp DNA sequence (Klug and Rhodes, 1987). Higher target specificity can be achieved by assembling multiple zinc finger proteins to target a longer genomic sequence. Since homodimerization is required by FokI endonucleases to cleave the double-strand DNA, two sets of zinc finger modules targeting proximal sites need to be designed for DNA cleavage. Research showed that ZFNs significantly increased homologous recombination in different organisms (Bibikova et al., 2001; Porteus and Baltimore, 2003). However, ZFNs have some limitations, such as the slow speed of recognizing target sites. A better method, TALENs (transcription activator-like effector nucleases), emerged, which contains the transcription activator-like effector (TALE) proteins from the plant pathogen Xanthomonas and the DNA cleavage domain of the FokI endonuclease (Christian et al., 2010; Zhang et al., 2011; Miller et al., 2011; Li et al., 2011). Since each TALE protein can recognize one nucleotide, multiple TALE proteins need to be assembled to target a long DNA sequence. Like ZFNs, TALENs require two sets of TALE models for double-strand DNA cleavage. Although ZFNs and TALENs had greatly increased the frequency of homologous recombination, both systems need to design new sets of proteins for targeting different genomic loci, which makes ZFNs and TALENs not broadly accepted by scientists due to the difficulty in engineering proteins.

1 Recently, the CRISPR (clustered regularly interspaced short palindromic repeats)-Cas (CRISPR-associated) system has shown up as a powerful DNA and RNA manipulation tool. In the late 1980s and early 1990s, the CRISPR structure was first noticed in Escherichia coli (Ishino et al., 1987) and Haloferax mediterranei (Mojica et al., 1993), and was predicted to have important function in prokaryotes (Mojica et al., 1995). By 2000, researchers had found the similar structures in 20 different microbes, including Yersinia pestis and Mycobacterium tuberculosis (Mojica et al., 2000). In 2002, the Cas genes were found in the nearby region of the CRISPR structure (Jansen et al., 2002). At that time, people still did not know the function of the CRISPR-Cas system in prokaryotes. Three years later, researchers reported that the CRISPR-Cas system is an adaptive immune system in prokaryotes (Mojica et al., 2005; Pourcel et al., 2005; Bolotin et al., 2005). In 2007, when scientists used the phage-sensitive Streptococcus thermophilus strain to isolate phage-resistant bacteria, they found that the resistant strains resist phages by inserting phage-derived sequences into the CRISPR locus rather than generating resistance mutations (Barrangou et al., 2007). At the same time, the role of the Cas9 gene was also studied. The Cas9 gene sequence consists of two types of nuclease motifs (RuvC and HNH); the Cas9 protein is an important component of the immune system (Barrangou et al., 2007). In 2008, researchers found that CRISPR RNAs (crRNAs), which were generated by cleaving the pre-crRNA transcribed from the CRISPR array, are required by the adaptive immune system of prokaryotes (Brouns et al., 2008). Most crRNA sequences consist of the last eight bases of the previous repeat sequence, the complete spacer sequence coming from foreign genetic elements, and the beginning of the next repeat sequence (Brouns et al., 2008). In 2010, a study of plasmid interference in S. thermophilus demonstrated that crRNA can guide the Cas9 protein to a specific genomic location; Cas9 then cuts DNA to create a blunt-end DSB at the three nucleotides upstream of the protospacer adjacent motif (PAM), which shows Cas9’s nuclease activity (Garneau et al., 2010). Meanwhile, when some researchers were using massively parallel sequencing to identify microbial RNAs, they found a novel RNA that was transcribed from a DNA region adjacent to the CRISPR structure (Sharma et al., 2010). The partial sequence of this RNA is almost perfectly complementary to the CRISPR repeat (Sharma et al., 2010). Research showed that the novel RNA, later known as trans-activating crRNA (tracrRNA), first hybridizes with the pre-crRNA, then the complex was processed by RNaseIII to generate mature crRNAs (Deltcheva et al., 2011). Later studies indicated that tracrRNA was also important for the Cas9

2 cleavage (Jinek et al., 2012; Siksnys et al., 2012). In 2011, researchers transferred the entire

CRISPR locus from S. thermophilus to a different species of bacteria – E. coli; they found that E. coli was able to resist both bacteriophage DNA and plasmid with the transferred CRISPR locus (Sapranauskas et al., 2011), which indicated the crRNA, tracrRNA, and Cas9 protein are the sufficient components of the CRISPR-Cas9 immune system in prokaryotes. In addition, a study showed that the purified Cas9-crRNA-tracrRNA complex from S. thermophilus can cut a DNA target in vitro (Gasiunas et al., 2012). The HNH- and RuvC-nuclease domains of Cas9 are responsible for the double-strand DNA cleavage (Gasiunas et al., 2012; Jinek et al., 2012). Moreover, specific crRNAs can be designed to target specific DNA sequences in vitro (Gasiunas et al., 2012; Jinek et al., 2012). The crRNA can be trimmed to 20 nucleotides but still has similar target efficiency (Gasiunas et al., 2012). Additionally, research showed that the crRNA and tracrRNA can be fused into a guide RNA (gRNA), which still functions well in vitro (Jinek et al., 2012). Although this fusion worked poorly in vivo, it can be solved by using a full-length fusion to restore the critical 3’ hairpin structure (Cong et al., 2013). In 2013, the CRISPR-Cas9 system was first applied in mammalian genome editing (Cong et al., 2013). Within one year, CRISPR- Cas systems had been applied to many species, including yeasts, fruit flies, and mice. Generally, CRISPR-Cas systems are classified into two classes. Class 1 systems need a complex of multiple Cas proteins for nucleic acids cleavage, including type I, III, and IV CRISPR systems, while class 2 systems only require one single Cas protein to degrade nucleic acids, containing type II, V, and VI CRISPR systems (Wright et al., 2016). Among all these CRISPR systems, the type II CRISPR-Cas9 system from Streptococcus pyogenes has been most widely used for genome editing. SpCas9 recognizes a simple PAM sequence (NGG) in the genome and is guided by gRNA to target a specific genomic locus to create a DSB. For therapeutic purpose, researchers identified some smaller Cas9 variants, such as Staphylococcus aureus Cas9 (SaCas9) (Ran et al., 2015; Friedland et al., 2015) and Campylobacter jejuni Cas9 (CjCas9) (Kim et al., 2017a), which are easier to be delivered into different cell types. Since SaCas9 recognizes a NNGRRT PAM sequence (Ran et al., 2015; Friedland et al., 2015; Mojica et al., 2009) and CjCas9 needs a NNNNACAC PAM sequence (Kim et al., 2017a), both of them are less flexible for genome targeting than SpCas9. Researchers also found that SpCas9 can be engineered to recognize different PAM sequences (Kleinstiver et al., 2015), which will improve the target scope. Since the CRISPR-Cas9 system can tolerate 1 to 5 mismatches or gaps between gRNA and the target

3 sequence, especially when the mismatches are in the non-seed (PAM-distal) region, CRISPR- Cas9 editing may cause off-target effects in the whole genome. Researchers used engineered variants of the wild-type Cas9 to improve target specificity, such as the Cas9 D10A nickases (Cas9n) (Ran et al., 2013) and RNA guided FokI nucleases (RFNs) (Tsai et al., 2014). In Cas9n, since the D10A mutation renders Cas9 be able to cut only one strand of DNA, a pair of gRNAs are needed to guide two Cas9n to create a DSB. Compared with the wild-type Cas9, Cas9n can improve target specificity up to 1500 folds (Ran et al., 2013). The two gRNAs’ distance of Cas9n ranges from 0 to 1000 bp with PAM-in or PAM-out orientation (Ran et al., 2013; Mali et al., 2013; Cho et al., 2014). RFNs are also a paired-gRNA CRISPR system. However, in RFNs, Cas9 is a fully inactive mutant (called dCas9). Each dCas9 is fused with a FokI nuclease. Two different gRNAs guide two FokI-dCas9 fusion proteins to adjacent target sites to cause FokI dimerization and DNA cleavage. Since RFNs are a true dimeric system rather than just co- localization like Cas9n, RFNs introduce less unwanted indel mutations than single Cas9n (Tsai et al., 2014). The two gRNAs’ distance of RFNs ranges from 14 to 17 bp with specific PAM-out orientation (Tsai et al., 2014). Recently, the CRISPR-Cpf1 (also known as CRISPR-Cas12a) system has been successfully applied for genome editing, which is highly specific in mammals and plants (Kim et al., 2016; Tang et al., 2017; Xu et al., 2017; Hu et al., 2017; Kim et al., 2017b). The CRISPR-Cpf1 system belongs to class 2 type V system. Cpf1 recognizes a T-rich PAM sequence, which is at the 5’ end of the target sequence (Zetsche et al., 2015). The gRNA of Cpf1 only includes crRNA but not tracrRNA. Additionally, Cpf1 can generate a staggered cut at the target DNA site (Zetsche et al., 2015). In 2016, the CRISPR-C2c2 (also known as CRISPR- Cas13a) system has been reported as a tool for RNA targeting (Abudayyeh et al., 2016). CRISPR-C2c2 belongs to class 2 type VI system. C2c2 recognizes a protospacer flanking site (PFS) of H (A, U or C) at the 3’ end of the target RNA. The C2c2 gRNA only contains crRNA without tracrRNA. C2c2 cleaves the target RNA at the uracil homopolymer loop (Abudayyeh et al., 2016). The CRISPR-C2c2 system has already been successfully used for specific RNA knockdown in E. coli (Abudayyeh et al., 2016) and RNA detection in human total RNAs (East- Seletsky et al., 2016). The inactive C2c2 (dC2c2) also has several potential applications as an RNA-binding protein, such as bringing effectors to specific RNAs to regulate their translation (Abudayyeh et al., 2016).

4 Although CRISPR-Cas systems have been widely applied for DNA and RNA editing, target specificity and efficiency are still big concerns for researchers. Target specificity mainly depends on the PAM sequence, the length of gRNA sequence, and mismatches or gaps tolerated between gRNA and the target sequence. Target efficiency is mainly related to the gRNA sequence and structural features. In this dissertation, we have developed three web services, CT- Finder, CRISPR-DT, and CRISPR-RT, to help users design optimal gRNAs by considering both target specificity and efficiency.

References Abudayyeh,O.O. et al. (2016) C2c2 is a single-component programmable RNA-guided RNA- targeting CRISPR effector. Science, 353, aaf5573. Barrangou,R. et al. (2007) CRISPR Provides Acquired Resistance Against Viruses in Prokaryotes. Science, 315, 1709–1712. Bibikova,M. et al. (2001) Stimulation of Homologous Recombination through Targeted Cleavage by Chimeric Nucleases. Molecular and Cellular Biology, 21, 289–297. Bolotin,A. et al. (2005) Clustered regularly interspaced short palindrome repeats (CRISPRs) have spacers of extrachromosomal origin. Microbiology, 151, 2551–2561. Brouns,S.J.J. et al. (2008) Small CRISPR RNAs Guide Antiviral Defense in Prokaryotes. Science, 321, 960–964. Capecchi,M.R. (1989) Altering the genome by homologous recombination. Science, 244, 1288– 1292. Cho,S.W. et al. (2014) Analysis of off-target effects of CRISPR/Cas-derived RNA-guided endonucleases and nickases. Genome Res., 24, 132–141. Christian,M. et al. (2010) Targeting DNA Double-Strand Breaks with TAL Effector Nucleases. Genetics, 186, 757–761. Cong,L. et al. (2013) Multiplex Genome Engineering Using CRISPR/Cas Systems. Science, 339, 819–823. Danna,K. and Nathans,D. (1971) Specific cleavage of simian virus 40 DNA by restriction endonuclease of Hemophilus influenzae. PNAS, 68, 2913–2917. Deltcheva,E. et al. (2011) CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III. Nature, 471, 602–607.

5 East-Seletsky,A. et al. (2016) Two distinct RNase activities of CRISPR-C2c2 enable guide-RNA processing and RNA detection. Nature, 538, 270–273. Friedland,A.E. et al. (2015) Characterization of Staphylococcus aureus Cas9: a smaller Cas9 for all-in-one adeno-associated virus delivery and paired nickase applications. Genome Biology, 16, 257. Garneau,J.E. et al. (2010) The CRISPR/Cas bacterial immune system cleaves bacteriophage and plasmid DNA. Nature, 468, 67–71. Gasiunas,G. et al. (2012) Cas9–crRNA ribonucleoprotein complex mediates specific DNA cleavage for adaptive immunity in bacteria. PNAS, 109, E2579–E2586. Hu,X. et al. (2017) Targeted mutagenesis in rice using CRISPR-Cpf1 system. J Genet Genomics, 44, 71–73. Ishino,Y. et al. (1987) Nucleotide sequence of the iap gene, responsible for alkaline phosphatase isozyme conversion in Escherichia coli, and identification of the gene product. Journal of Bacteriology, 169, 5429–5433. Jansen,R. et al. (2002) Identification of genes that are associated with DNA repeats in prokaryotes. Molecular Microbiology, 43, 1565–1575. Jinek,M. et al. (2012) A Programmable Dual-RNA–Guided DNA Endonuclease in Adaptive Bacterial Immunity. Science, 337, 816–821. Kelly,T.J. and Smith,H.O. (1970) A restriction enzyme from Hemophilus influenzae: II. Base sequence of the recognition site. Journal of Molecular Biology, 51, 393–409. Kim,D. et al. (2016) Genome-wide analysis reveals specificities of Cpf1 endonucleases in human cells. Nature Biotechnology, 34, 863–868. Kim,E. et al. (2017a) In vivo genome editing with a small Cas9 orthologue derived from Campylobacter jejuni. Nature Communications, 8, 14500. Kim,H. et al. (2017b) CRISPR/Cpf1-mediated DNA-free plant genome editing. Nature Communications, 8, 14406. Kim,Y.G. et al. (1996) Hybrid restriction enzymes: zinc finger fusions to Fok I cleavage domain. PNAS, 93, 1156–1160. Kleinstiver,B.P. et al. (2015) Engineered CRISPR-Cas9 nucleases with altered PAM specificities. Nature, 523, 481–485.

6 Klug,A. and Rhodes,D. (1987) Zinc Fingers: A Novel Protein Fold for Nucleic Acid Recognition. Cold Spring Harb Symp Quant Biol, 52, 473–482. Li,T. et al. (2011) Modularly assembled designer TAL effector nucleases for targeted gene knockout and gene replacement in eukaryotes. Nucleic Acids Res, 39, 6315–6325. Lin,F.L. et al. (1985) Recombination in mouse L cells between DNA introduced into cells and homologous chromosomal sequences. PNAS, 82, 1391–1395. Mali,P. et al. (2013) CAS9 transcriptional activators for target specificity screening and paired nickases for cooperative genome engineering. Nature Biotechnology, 31, 833–838. Miller,J.C. et al. (2011) A TALE nuclease architecture for efficient genome editing. Nature Biotechnology, 29, 143–148. Mojica,F.J.M. et al. (2000) Biological significance of a family of regularly spaced repeats in the genomes of Archaea, Bacteria and mitochondria. Molecular Microbiology, 36, 244–246. Mojica,F.J.M. et al. (2005) Intervening Sequences of Regularly Spaced Prokaryotic Repeats Derive from Foreign Genetic Elements. J Mol Evol, 60, 174–182. Mojica,F.J.M. et al. (1995) Long stretches of short tandem repeats are present in the largest replicons of the Archaea Haloferax mediterranei and Haloferax volcanii and could be involved in replicon partitioning. Molecular Microbiology, 17, 85–93. Mojica,F.J.M. et al. (2009) Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology, 155, 733–740. Mojica,F.J.M. et al. (1993) Transcription at different salinities of Haloferax mediterranei sequences adjacent to partially modified PstI sites. Molecular Microbiology, 9, 613–621. Porteus,M.H. and Baltimore,D. (2003) Chimeric Nucleases Stimulate Gene Targeting in Human Cells. Science, 300, 763–763. Pourcel,C. et al. (2005) CRISPR elements in Yersinia pestis acquire new repeats by preferential uptake of bacteriophage DNA, and provide additional tools for evolutionary studies. Microbiology, 151, 653–663. Ran,F.A. et al. (2013) Double Nicking by RNA-Guided CRISPR Cas9 for Enhanced Genome Editing Specificity. Cell, 154, 1380–1389. Ran,F.A. et al. (2015) In vivo genome editing using Staphylococcus aureus Cas9. Nature, 520, 186–191.

7 Rouet,P. et al. (1994) Introduction of double-strand breaks into the genome of mouse cells by expression of a rare-cutting endonuclease. Molecular and Cellular Biology, 14, 8096–8106. Rudin,N. et al. (1989) Genetic and physical analysis of double-strand break repair and recombination in Saccharomyces cerevisiae. Genetics, 122, 519–534. Sapranauskas,R. et al. (2011) The Streptococcus thermophilus CRISPR/Cas system provides immunity in Escherichia coli. Nucleic Acids Res, 39, 9275–9282. Sharma,C.M. et al. (2010) The primary transcriptome of the major human pathogen Helicobacter pylori. Nature, 464, 250–255. Siksnys, V., Gasiunas, G., and Karvelis, T. (2012). RNA-directed DNA cleavage by the Cas9- crRNA complex from CRISPR3/Cas immune system of Strepto- coccus thermophilus. U.S. Provisional Patent Application 61/613,373, filed March 20, 2012; later published as US2015/0045546 (pending). Smith,H.O. and Welcox,K.W. (1970) A Restriction enzyme from Hemophilus influenzae: I. Purification and general properties. Journal of Molecular Biology, 51, 379–391. Smithies,O. et al. (1985) Insertion of DNA sequences into the human chromosomal β -globin locus by homologous recombination. Nature, 317, 230. Tang,X. et al. (2017) A CRISPR–Cpf1 system for efficient genome editing and transcriptional repression in plants. Nature Plants, 3, 17018. Thomas,K.R. et al. (1986) High frequency targeting of genes to specific sites in the mammalian genome. Cell, 44, 419–428. Tsai,S.Q. et al. (2014) Dimeric CRISPR RNA-guided FokI nucleases for highly specific genome editing. Nature Biotechnology, 32, 569–576. Wright,A.V. et al. (2016) Biology and Applications of CRISPR Systems: Harnessing Nature’s Toolbox for Genome Engineering. Cell, 164, 29–44. Xu,R. et al. (2017) Generation of targeted mutant rice using a CRISPR-Cpf1 system. Plant Biotechnology Journal, 15, 713–717. Zetsche,B. et al. (2015) Cpf1 Is a Single RNA-Guided Endonuclease of a Class 2 CRISPR-Cas System. Cell, 163, 759–771. Zhang,F. et al. (2011) Efficient construction of sequence-specific TAL effectors for modulating mammalian transcription. Nature Biotechnology, 29, 149–153.

8 CHAPTER 2. CT-Finder: A Web Service for CRISPR Optimal Target Prediction and Visualization*

Abstract The CRISPR system holds much promise for successful genome engineering, but therapeutic, industrial, and research applications will place high demand on improving the specificity and efficiency of this tool. CT-Finder (http://bioinfolab.miamioh.edu/ct-finder) is a web service to help users design guide RNAs (gRNAs) optimized for specificity. CT-Finder accommodates the original single-gRNA Cas9 system and two specificity-enhancing paired-gRNA systems: Cas9 D10A nickases (Cas9n) and dimeric RNA-guided FokI nucleases (RFNs). Optimal target candidates can be chosen based on the minimization of predicted off-target effects. Graphical visualization of on-target and off-target sites in the genome is provided for target validation. Major model organisms are covered by this web service.

* This chapter is a pre-copyedited, author-produced version of an article accepted for publication in Scientific Reports following peer review. The version of record, Zhu,H. et al. (2016) CT-Finder: A Web Service for CRISPR Optimal Target Prediction and Visualization. Scientific Reports, 6, 25516, is available online at: https://doi.org/10.1038/srep25516. This work is licensed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

9 Introduction Cas nucleases, derived from bacterial adaptive immune systems, possess the ability to cause double-stranded breaks at targeted locations in genomes. These double stranded breaks in the genome may be repaired either through non-homologous end joining, causing insertions or deletions (indels) that are expected to result in gene knockout, or through homology-directed repair, in which a repair template is provided for insertion at the location of the break (Sander and Joung et al., 2014). Unlike transcription activator-like effector nucleases (TALENs) and zinc finger nucleases (ZFNs) – tools for similarly sophisticated genomic editing – the CRISPR/Cas system is highly adaptable and easily programmed. This is because CRISPR/Cas eliminates the need for protein engineering, and in turn only requires the design of a simple and short (~20 nt) guide RNA (gRNA) that base pairs with the intended on-target genomic site (Gupta and Musunuru, 2014). Another advantage of CRISPR/Cas is its ability to create multiple simultaneous mutations, as demonstrated in mouse embryonic stem cells (Wang et al., 2013). Unfortunately, the original Cas9 system raises issues of specificity due to the shortness of the gRNA sequence. Even with a gRNA sequence that is entirely unique within the genome of interest, Cas nucleases may still cleave at unintended off-target sites that bear very high similarity to the target DNA sequence. Off-target site sequences resemble the target site sequences except for the presence of indels, mismatches, or a combination. Off-target cleavage is especially concerning to potential future therapeutic applications of genome editing; it is crucial to reduce the possibility of off-target cleavage to the point of virtual elimination before performing genomic editing on human subjects. As an improvement over the Cas9 system, the Cas9 D10A nickases (Cas9n) system operates on the biological principle that two single stranded breaks in close proximity are equivocal to a double stranded break. Two nickases are equipped with non-identical gRNAs designed to match to two proximal DNA targets. The combination of two single-stranded, approximately simultaneous breaks acts to achieve cleavage activity at the target site by an unknown mechanism (Ran et al., 2013). This strategy has been demonstrated to reduce off-target effects by a factor of 50- to 1,500-fold as compared to Cas9 (Ran et al., 2013). Although single stranded breaks at off-target sites are normally repaired with high fidelity via the base excision repair (BER) pathway (Dianov and Hübscher, 2013), concerns have arisen that these nickases could individually introduce indel mutations at off-target sites with high efficiency (Wang et al.,

10 2014). Meanwhile, RNA-guided FokI nucleases (RFNs) extend the concept of paired gRNA systems to a level of a true dimeric system. RFNs, like the Cas9n system, use two gRNAs. Unlike Cas9n, which can cause single stranded breaks independently, RFNs strictly require that both gRNAs must be present as a pair with an appropriate distance of space between them. The spacer length between two gRNAs in the RFNs system is 14-17 nt (Tsai et al., 2014). The stringency of the requirements for RFNs results in acutely specific targeting, essentially eliminating Cas9-induced off-target mutagenesis, but comes at the disadvantage of reduced target space and thus narrower applicability. Both Cas9n and RFNs systems utilize paired gRNAs and claim high efficiency of gene editing due to the effective doubling of the number of base pairs matched to the target sites, but have a more limited set of possible target candidates as compared to Cas9. So far, there are a few web services, including CRISPR DESIGN (Hsu et al., 2013), E- CRISP (Heigwer et al., 2014), GT-Scan (O’Brien and Bailey, 2014) and CRISPRdirect (Naito et al., 2015), available for helping biologists design gRNAs and determine target sites for CRISPR systems. To the best of our knowledge, there is no web service designed to accommodate three different types (Cas9, Cas9n and RFNs) of genome editing CRISPR systems. We have designed and developed CRISPR Target Finder (CT-Finder) to cover Cas9, Cas9n and RFNs systems and aid researchers in maximizing target specificity in the application of CRISPR technologies. For each target candidate in any of the three CRISPR systems, CT-Finder is able to predict off-target sites in the genomic region. This allows for the optimal target candidates to be chosen based on the minimization of predicted off-target effects. Differently from other online tools for CRISPR systems, CT-Finder incorporates a high degree of flexibility while maintaining a simple graphical interface that is able to accommodate user inputs for many of the most important features, such as the gRNA length and protospacer adjacent motif (PAM) sequence. Compared with other tools, CT-Finder is more comprehensive and precise in its determination of possible off-target sites, as it considers mismatches, insertions, and deletions separately for both the seed and non-seed regions. Additionally, CT-Finder is able to provide users with more information (e.g., GC content of the entire target sequence excluding the PAM sequence, and GC content of the 6 nucleotides closest to the PAM sequence, which has been found to be a significant predictor of gRNA efficiency). CT-Finder also introduces, for the first time, a smooth, scrollable

11 genome browser for visualizing on-target and off-target sites in the genomic context, allowing users to validate data and consider gene annotation features when choosing proper targets.

Design Rationale The Cas nucleases of CRISPR systems consist of an RNA binding domain, a α-helical recognition lobe, a nuclease lobe, and a PAM-interacting site (van der Oost et al., 2014). The nuclease lobe in Cas9 contains two DNA-cleaving nuclease domains: RuvC and HNH, each of which cleaves one strand of the targeted DNA (van der Oost et al., 2014). Together, these two domains cause a double stranded break. Cas9n is modified with a mutation in one of these domains, which in turn causes only single-stranded breaks (Ran et al., 2013). The PAM, recognized by the PAM-interacting site of Cas, is critical to binding of Cas to the target site. The PAM sequence is not a part of the gRNA, but must be present immediately at the 3’ end of the targeted DNA site (van der Oost et al., 2014). Mismatches in the PAM sequence are so poorly tolerated that they are typically expected to result in elimination of detectable target activity by Cas; in fact, base pair mutations in the PAM sequence are a viral defense against microbial CRISPR systems (Fineran et al., 2014). In Streptococcus pyogenes Cas9 (SpCas9) system, the default on-target PAM sequence is NGG and the default off-target PAM sequence is NAG. The off-target PAM sequence is not adequately efficient to be considered for on-target, but presents appreciable efficacy to be considered for off-target sites. Therefore, flexibility in choice of on- target and off-target PAM sequences is incorporated into CT-Finder. CT-Finder is designed to be unrestricted by the gRNA length. Though many current online tools for choosing gRNAs rely on an inflexible 20-nt standard gRNA length, it is known that there is a range of possible effective gRNA lengths for CRISPR/Cas. The natural length of SpCas9 gRNA is 20 nt, but the range of permissibly effective lengths varies between 17 and 20 nt. Length of gRNAs also varies across different Cas9 systems; for example, Staphylococcus aureus Cas9 (SaCas9) in mammalian cells is reported to exhibit the greatest editing efficiency with a gRNA length of 21 to 23 nt (Ran et al., 2015). Length does appear to be a determinant of cleavage efficiency and potentially also off-target cleavage risk. Truncated gRNAs 17 or 19 nt in length, for example, have been demonstrated in SpCas9 to have significantly reduced off-target mutagenesis in mammalian cells while retaining cleavage efficiency comparable to canonical 20- nt gRNA (Fu et al., 2014). Therefore, the flexibility in gRNA length offered by CT-Finder

12 enables more applications in genome editing and addresses potential species-specific needs. Total number of mismatched base pairs is a key factor for cleavage efficiency of SpCas9 CRISPR systems (Hsu et al., 2013). As a general rule, mismatches at the 3’ end of the gRNA, proximal to the PAM sequence, are less tolerated than mismatches closer to the 5’ end of the gRNA (Lin et al., 2014). In SpCas9, 2 concatenated or interspersed mismatches considerably reduce cleavage efficiency; 3 concatenated mismatches cause reduction beyond that of 2 mismatches; and 3 or more interspersed mismatches or 5 concatenated mismatches result in elimination of detectable SpCas9 activity in the vast majority of tested genomic loci (Hsu et al., 2013). Insertions and deletions are somewhat more complex. Cas9 tolerates single gRNA bulges up to 4 nt in length (Lin et al., 2014). Cas9n has been found to tolerate both single gRNA and DNA bulges in one of the guide strands, when one bulge-forming gRNA is utilized with a perfectly matched gRNA (Lin et al., 2014). Accordingly, CT-Finder takes into account insertions, deletions, and mismatches separately in the seed and non-seed regions, and also allows users to set the maximum number of strands (i.e., 0, 1, or 2) that can tolerate bulges in Cas9n and RFNs systems. Even when comparing single gRNAs targeting the same gene, substantial variance in efficacy exists, indicating the importance of the choice of sequences (Wang et al., 2014). CT- Finder includes GC-content calculations as a measure of target candidate efficiency. In mammalian cell lines, medium GC content in 20-nt single gRNAs is more efficient than low or high GC-content single gRNAs (Ren et al., 2014). CT-Finder also includes a measure of GC content of the 6 nucleotides proximal to the PAM sequence in each target candidate, as a study in Drosophila found a strong positive correlation between efficiency and single gRNA GC content of the 6 nucleotides proximal to the PAM sequence (Ren et al., 2014). Experimental evidence indicates that it is critical for these 6 nucleotides to possess a GC content of >50%, i.e. 4 or more of them are either G or C - but no further increases in efficiency are seen past this threshold (Ren et al., 2014). This has such a significant effect on efficiency that simultaneous mutation of multiple genes is achievable with single gRNAs of high GC content in the 6 nucleotides adjacent to the PAM sequence; 4 genes have been mutated in one step experimentally in a fly model (Ren et al., 2014). In CT-Finder, GC contents of both the 6 nucleotides proximal to the PAM and the entire gRNA are provided for users to consider when choosing the most efficient target sites. Among the existing web services for CRISPR systems, none seamlessly integrate data

13 visualization in terms of genomic positions of on-target and off-target sites along with associated gene structures within a genome browser. In CT-Finder, we integrated JBrowse (Skinner et al., 2009), a JavaScript-based genome browser, for visualization of predicted on-target and off-target sites. JBrowse enables customizability to the visualization by permitting users to build upon the reference sequence annotation through the addition of multiple genome feature tracks (Skinner et al., 2009). Compared to its predecessor GBrowse, JBrowse is exceptionally fast and provides smooth, continuous movement along the genome. This seamless data integration and visualization using JBrowse is important for biologists to validate data and to identify optimal target sites.

Graphic User Interface Figure 2.1 shows the CT-Finder home page, which has a menu on the left side to support three primary working modes: Cas9, Cas9n, and RFNs systems. If users click “Cas9”, the setting page for the Cas9 system will be displayed (Fig. 2.2a). Users follow the steps to enter their sequence, select a reference genome, and input their specifications for the target search. For the Cas9 system, users can input the on-target and off-target PAM sequences as well as the length of gRNA and seed region. For indels (gaps) and mismatches, users can choose between “Basic settings” and “Specific settings”. “Basic settings” allows for the maximum number of mismatches and gaps tolerated by off targets to be set. Additionally, “Basic settings” allows the maximum number of mismatches and gaps in the seed region tolerated by off targets to be specified. “Specific settings” includes seed region settings and non-seed region settings. In seed region settings, users can set the maximum number of mismatches, insertions, and deletions respectively tolerated by off targets. In non-seed region settings, users can also set the maximum number of mismatches, insertions, and deletions respectively tolerated by off targets. If users select the “Cas9n” (Fig. 2.3a) or “RFNs” (Fig. 2.4a) mode, two additional paired-gRNA settings are presented: one for minimum and maximum spacer length between paired gRNAs and another for the maximum number of strands that can tolerate bulges. The setting pages for RFNs and Cas9n modes differ only by the default minimum and maximum spacer length setting. The setting range of spacer distance between gRNAs in the RFNs system is most effective within a range of 14 to 17 nt, while the Cas9n system permits a much wider range, from approximately 0 to 1000 nt. After setting all parameters, users may click “Find optimal targets!” to run the

14 program and view the result page. The result pages for all three systems include a table viewer (e.g., Fig. 2.2b), a sequence viewer (e.g., Fig. 2.2c) and a genome viewer (e.g., Fig. 2.2d). On the table viewer page, if the user selects “Input Sequence Viewer”, a new window will open; in this new “Sequence Viewer” window, users are able to view their target DNA sequence with or without a ruler, with or without a spacer, and in the forward or reverse complementary strand. The “Sequence Viewer” window also includes a function to search for a user-typed pattern or subsequence (e.g., PAM sequence), which will be highlighted wherever it is found in the original sequence. The sequence viewer also helps position subsequences. The table viewer displays a list of target candidates (single gRNAs for the Cas9 system or pairs of gRNAs for the Cas9n and RFNs systems) and relevant information, including which strand it is found on (i.e., forward or reverse), its start position, its end position, GC-content measures, and predicted number of corresponding matched target sites in the genome. A low number of targets is preferred for high specificity. If the value is 1 it will be highlighted. This indicates that the target candidate is highly specific, having only one target site in the whole genome. Each row is clickable to bring up a list of target sites for the chosen target candidate, including information about the sequence, chromosome, strand, start/end position, and number of mismatches and gaps. A link to JBrowse is also included for each target site. Clicking one of these links leads to a genome browser showing the target site within the genomic context. The reference genome sequence is visualized with feature tracks that contain gene annotation information and the target site. Additional tracks may be added on the server or client side.

Discussion In applying the CRISPR systems to edit genomes, one of the most important considerations is to improve target specificity. According to recent research, there are two methods to improve target specificity. One method is to search all possible off-target sites in the genome (depending on the experimental off-target features) and then to choose the optimal target candidate that introduces the least abrasive off-target effects (Cho et al., 2013). The second method is to use new variants of CRISPR technology to improve target specificity (Ran et al., 2013; Tsai et al., 2014; Guilinger et al., 2014). CT-Finder combines these two methods to enhance target specificity.

15 Like its predecessors E-CRISP (Heigwer et al., 2014), GT-Scan (O’Brien and Bailey, 2014), and CRISPRdirect (Naito et al., 2015), CT-Finder supports a number of experimentally applicable input parameters, such as the PAM sequence, length of seed region, and maximum number of mismatches. However, CRISPR DESIGN (Hsu et al., 2013) does not allow users to customize these parameters. In addition to the fundamental settings, CT-Finder also supports settings for maximum number of gaps. Lin et al. reported that off-target sites can also tolerate insertions and deletions in addition to mismatches, indicating that gaps need to be considered for off-target analysis (2014). CRISPRdirect also supports gaps, but its users cannot set the length of the seed region and cannot alter the number of mismatches and gaps in the seed region – these features are unique to CT-Finder. CT-Finder allows specific settings applicable to the seed region and non-seed region, including the number of mismatches, insertions and deletions, making CT-Finder highly flexible to suit future experimental off-target features. CT-Finder also, for the first time, introduces the use of JBrowse to integrate and visualize on-target and off-target sites within the genomic context. This visualization interface presents a smoother experience compared to that of other tools, especially in the sense of data visualization and validation. Additional feature tracks, such as gene references, can be conveniently added on the server or client side to satisfy users’ requirements. Applying new technologies to improve target specificity, CT-Finder introduces Cas9n (Ran et al., 2013) and RFNs (Tsai et al., 2014; Guilinger et al., 2014). Both of these CRISPR variants are based on paired gRNAs and greatly improve target specificity relative to the original single gRNA CRISPR system. CRISPR DESIGN and E-CRISP also support Cas9n or paired gRNAs, but do not allow gaps and the number of strands that can tolerate bulges to be set; CT- Finder supports all these settings, leading to more comprehensive and precise off-target searching than other tools. In addition, CT-Finder is the first available tool to help users design gRNAs according to any of 3 modes: Cas9, Cas9n and RFNs. On the other hand, no multivariable scoring algorithm is incorporated currently into CT- Finder because there is not enough experimental evidence available to give rational weighting to various factors such as gaps, affecting the likelihood of off-target effects. Prediction of off-target effects is still a new concept and additional research will be necessary to produce more accurate and quantitative algorithms. Furthermore, it is worthy to note that computational algorithms can only predict off-target effects; experimental determination of off-target effects will always be

16 necessary to confirm these predictions. CT-Finder is designed to favor predicted specificity above all other aspects when choosing optimal target candidates. Experimental goals should always be considered when utilizing computational tools, especially ones biased toward specific purposes. In some cases, higher efficiency of cleavage activity may be preferred at the cost of higher risk of off-target effects. In other cases, specificity is crucial, no matter the cost to efficiency. Beyond prediction of off-target effects, experimental conditions can also play a role in influencing off-target cleavage activity. For example, it has been reported that titrating the amount of SpCas9 delivered can aid in minimizing off-target cleavage activity, but with the drawback of reducing on-target cleavage as well (Ran et al., 2015). In addition, cellular environments or genomic factors may, in some cases, despite use of optimal design, preclude efficient targeting of certain genes of interest by CRISPR systems (Doench et al., 2014). Other technologies, such as RNAi, may provide a preferable alternative method in these situations.

Implementation CT-Finder is essentially composed of web interfaces coded in PHP and JavaScript and a backend pipeline implemented in Perl. The web interfaces accept users’ inputs, including many parameter settings, which are then passed to the backend pipeline for data processing and data analysis. The pipeline will generate multiple result files residing in the web server. Afterwards, the results will be displayed through highly interactive web interfaces. As shown in Figure 2.5, the backend pipeline contains multiple steps. CT-Finder first processes the user-input DNA sequence into a list of subsequences ending with the on-target PAM sequence specified by a user, leading to the creation of a file of all possible target candidates for the given sequence. For the Cas9 option (Fig. 2.5), these target candidates are single sequences; for either the Cas9n or RFNs option (Fig. 2.5), the target candidates are paired sequences. Bowtie2 (Langmead and Salzberg, 2012), a fast and sensitive read aligner, is then called to align each sequence, consisting of the gRNA target candidate sequence and off-target PAM sequence specified by users, to the reference genome, generating a SAM file. Samtools (Li, 2011) is applied to convert the Bowtie2 off-target SAM output into a BAM file, which will be used to separate alignments into either forward or reverse strand alignment files. If the Cas9n or RFNs option is selected, the number of insertions, deletions and mismatches are counted for all

17 individual alignments. Then, the alignments are further separated by chromosomes. The alignments in each chromosome are filtered according to the user-defined minimum and maximum length of the spacer between paired gRNA targets. For all three options (Cas9, Cas9n, and RFNs), potential off-target sites in alignment files are filtered to exclude alignments that have indels in the PAM sequence. The user may choose either “Basic settings” or “Specific settings” for further filtering of the potential off-target sites. “Basic settings” takes into account the total number of mismatches and gaps of the gRNA sequence and the total number of mismatches and gaps in the seed region. “Specific settings” details restrictions on the number of mismatches, the number of insertions, and the number of deletions for both the seed and non-seed regions. These two settings are applied to narrow the set of potential off-target sites, eliminating any alignments with a greater number of mismatches, insertions, or deletions than the given thresholds. This produces a final set of predicted off-target sites.

Conclusions CT-Finder is a web service to help users design the optimal gRNAs for Cas9, Cas9n and RFNs systems with a minimal number of off-target effects. The service covers major model organisms, including human, mouse, and Arabidopsis, and can be easily extended to other species. With its capability to accommodate three CRISPR systems, emphasis on target specificity, flexibility in user inputs of multiple parameter settings, and seamless integration into a genome browser, CT-Finder will empower many researchers in the gRNA design and optimal target determination of CRISPR-based genome editing.

References Cho,S.W. et al. (2014) Analysis of off-target effects of CRISPR/Cas-derived RNA-guided endonucleases and nickases. Genome Res., 24, 132–141. Dianov,G.L. and Hübscher,U. (2013) Mammalian Base Excision Repair: the Forgotten Archangel. Nucleic Acids Res, 41, 3483–3490. Doench,J.G. et al. (2014) Rational design of highly active sgRNAs for CRISPR-Cas9–mediated gene inactivation. Nature Biotechnology, 32, 1262–1267.

18 Fineran,P.C. et al. (2014) Degenerate target sites mediate rapid primed CRISPR adaptation. PNAS, 111, E1629–E1638. Fu,Y. et al. (2014) Improving CRISPR-Cas nuclease specificity using truncated guide RNAs. Nature Biotechnology, 32, 279–284. Guilinger,J.P. et al. (2014) Fusion of catalytically inactive Cas9 to FokI nuclease improves the specificity of genome modification. Nature Biotechnology, 32, 577–582. Gupta,R.M. and Musunuru,K. (2014) Expanding the genetic editing tool kit: ZFNs, TALENs, and CRISPR-Cas9. J Clin Invest, 124, 4154–4161. Heigwer,F. et al. (2014) E-CRISP: fast CRISPR target site identification. Nature Methods, 11, 122–123. Hsu,P.D. et al. (2013a) DNA targeting specificity of RNA-guided Cas9 nucleases. Nature Biotechnology, 31, 827–832. Hsu,P.D. et al. (2013b) DNA targeting specificity of RNA-guided Cas9 nucleases. Nature Biotechnology, 31, 827–832. Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nature Methods, 9, 357–359. Li,H. (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27, 2987–2993. Lin,Y. et al. (2014) CRISPR/Cas9 systems have off-target activity with insertions or deletions between target DNA and guide RNA sequences. Nucleic Acids Res, 42, 7473–7485. Naito,Y. et al. (2015) CRISPRdirect: software for designing CRISPR/Cas guide RNA with reduced off-target sites. Bioinformatics, 31, 1120–1123. O’Brien,A. and Bailey,T.L. (2014) GT-Scan: identifying unique genomic targets. Bioinformatics, 30, 2673–2675. van der Oost,J. et al. (2014) Unravelling the structural and mechanistic basis of CRISPR–Cas systems. Nature Reviews Microbiology, 12, 479–492. Ran,F.A. et al. (2013) Double Nicking by RNA-Guided CRISPR Cas9 for Enhanced Genome Editing Specificity. Cell, 154, 1380–1389. Ran,F.A. et al. (2015) In vivo genome editing using Staphylococcus aureus Cas9. Nature, 520, 186–191.

19 Ren,X. et al. (2014) Enhanced Specificity and Efficiency of the CRISPR/Cas9 System with Optimized sgRNA Parameters in Drosophila. Cell Reports, 9, 1151–1162. Sander,J.D. and Joung,J.K. (2014) CRISPR-Cas systems for editing, regulating and targeting genomes. Nature Biotechnology, 32, 347–355. Skinner,M.E. et al. (2009) JBrowse: A next-generation genome browser. Genome Res., 19, 1630–1638. Tsai,S.Q. et al. (2014) Dimeric CRISPR RNA-guided FokI nucleases for highly specific genome editing. Nature Biotechnology, 32, 569–576. Wang,H. et al. (2013) One-Step Generation of Mice Carrying Mutations in Multiple Genes by CRISPR/Cas-Mediated Genome Engineering. Cell, 153, 910–918. Wang,T. et al. (2014) Genetic Screens in Human Cells Using the CRISPR-Cas9 System. Science, 343, 80–84.

20 List of Figures

Figure 2.1. The home page of CT-Finder web service, including three primary working modes: Cas9, Cas9n, and RFNs systems.

21 a

d Figure 2.2. Example settings and result pages for the Cas9 system. Panel a: the settings page for the Cas9 system. Panel b: Table Viewer, which displays a list of target candidates (top) and a list of corresponding target sites in the genome for each target candidate (bottom). Panel c: Sequence Viewer, which shows the user input sequence and highlights the subsequence or pattern specified by a user. Panel d: Genome Browser, which visually displays the target sites and gene annotation.

22 a

d Figure 2.3. Example settings and result pages for the Cas9n system. Panel a: the settings page for the Cas9n system. Panel b: Table Viewer, which displays a list of target candidates (top) and a list of corresponding target sites in the genome for each target candidate (bottom). Panel c: Sequence Viewer, which shows the user input sequence and highlights the subsequence or pattern specified by a user. Panel d: Genome Browser, which visually displays the target sites and gene annotation.

23 a

Figure 2.4. Example settings and result pages for the RFNs system. Panel a: the settings page for the RFNs system. Panel b: Table Viewer, which displays a list of target candidates (top) and a list of corresponding target sites in the genome for each target candidate (bottom). Panel c: Sequence Viewer, which shows the user input sequence and highlights the subsequence or pattern specified by a user. Panel d: Genome Browser, which visually displays the target sites and gene annotation.

24 Cas9_Candidates Bowtie2 Off-target sites for _Search.pl Single gRNA Candidates Users’ Nucleotide Sequence Candidates (Candidates_Search.fa) (Bowtie2_Offtargets.sam)

Samtools

Forward strand (forward.sam) a + Graphical Visualization Reverse strand (reverse.sam)

filter_pam.pl JBrowse

DataTables filter_seed.pl Target Candidates filter_seed.sam and off-target sites filter_pam.sam

Pairs_Candidates Bowtie2 Off-target sites for User’s Nucleotide _Search.pl Pairs of gRNA Candidates Candidates Sequence (Candidates_Search.fa) (Bowtie2_Offtargets.sam)

Samtools

Forward strand (forward.sam) + Reverse strand (reverse.sam)

b countInDel_filterpam.pl

countIDel_forward.sam Graphical + Visualization countIDel_reverse.sam chrs_seperate.pl JBrowse + filter_position.pl DataTables filter_seed.pl Target Candidates filter_seed.sam filter_position.sam and off-target sites

Figure 2.5. Workflow of CT-Finder web service that incudes web interfaces and a backend bioinformatics pipeline. Panel a: the workflow for finding the optimal target candidates for Cas9. Panel b: the workflow for finding the optimal target candidates for Cas9n and RFN.

25 CHAPTER 3. CRISPR-DT: designing gRNAs for the CRISPR-Cpf1 system with improved target efficiency and specificity*

Abstract The CRISPR-Cpf1 system has been successfully applied in genome editing. However, target efficiency of the CRISPR-Cpf1 system varies among different gRNA sequences. In this study, we reanalyzed the published CRISPR-Cpf1 gRNAs data and found many sequence and structural features related to their target efficiency. With the aid of Random Forest in feature selection, a support vector machine (SVM) model was created to predict target efficiency for any given gRNAs. We have developed the first CRISPR-Cpf1 web service application, CRISPR-DT (CRISPR DNA Targeting), to help users design optimal gRNAs for the CRISPR-Cpf1 system by considering both target efficiency and specificity. CRISPR-DT will empower researchers in genome editing. CRISPR-DT is freely available at http://bioinfolab.miamioh.edu/CRISPR-DT.

* This chapter is a pre-copyedited, author-produced version of an article accepted for publication in Bioinformatics following peer review. The version of record, Zhu,H. and Liang,C. CRISPR-DT: designing gRNAs for the CRISPR- Cpf1 system with improved target efficiency and specificity. Bioinformatics, bty1061, is available online at: https://doi.org/10.1093/bioinformatics/bty1061

26 Introduction The CRISPR-Cas9 (Clustered Regularly Interspaced Short Palindromic Repeats - CRISPR associated protein 9) system is a prokaryotic immune system, which has been widely applied in eukaryotic genome editing. The basic mechanism of CRISPR-Cas9 editing is that a single guide RNA (gRNA) guides the Cas9 protein to a DNA target sequence and Cas9 then cleaves double strands at the same position. The DNA double-strand breaks can be repaired either by non-homologous end joining (NHEJ), leading to random insertion/deletion (indel) mutations, which can be applied to mediate gene knockouts (Perez et al., 2008; Ran et al., 2013) and knockins (Auer et al., 2014; Maresca et al., 2013; Suzuki et al., 2016), or by homology directed repair (HDR), resulting in precise gene editing via introducing a repair template, which can be harnessed for gene knockins (Genovese et al., 2014; Lombardo et al., 2011; Luchetti et al., 2014). In 2013, the CRISPR-Cas9 system was first demonstrated as a tool for eukaryotic genome editing (Cong et al., 2013; Mali et al., 2013). Afterwards, many studies have successfully applied CRISPR-Cas9 to edit genomes of different organisms (Friedland et al., 2013; Bassett et al., 2013; Ma et al., 2014; Platt et al., 2014; Shalem et al., 2014; Wang et al., 2014; Li et al., 2013; Shan et al., 2013; Feng et al., 2013). In addition, researchers found that CRISPR-Cas9 can even be used to correct genetic diseases (Wu et al., 2013; Long et al., 2014) and model cancers by introducing mutations to normal tissues (Platt et al., 2014; Matano et al., 2015; Xue et al., 2014; Heckl et al., 2014). Unfortunately, the CRISPR-Cas9 system still has some shortcomings, such as high off- target effect (Fu et al., 2013; Cradick et al., 2013). The CRISPR-Cpf1 system was first characterized in 2015, and it is a class 2 CRISPR-Cas system with a single Cas protein to mediate cleavage like CRISPR-Cas9 (Zetsche et al., 2015). The Cpf1 locus is simply organized, including Cpf1, Cas genes and a CRISPR array (Fig. 3.1). Researchers found that CRISPR-Cpf1 has some characteristics distinct from those of CRISPR- Cas9 (Zetsche et al., 2015). First, Cpf1 has RNase III activity for processing the precursor CRISPR RNA (pre-crRNA) into mature crRNAs (Fonfara et al., 2016; Zetsche et al., 2017). Second, Cpf1 needs only one crRNA for cleavage, whereas Cas9 requires both the crRNA and trans-activating crRNA (tracrRNA) (Zetsche et al., 2015). Third, Cpf1 recognizes a T-rich protospacer adjacent motif (PAM), which is at the 5’ end of the protospacer sequence (Zetsche et al., 2015). Fourth, Cpf1 cleaves double-strand DNA via a staggered cut that generates two sticky ends, and the cleavage sites are further away from the PAM (Zetsche et al., 2015). It was

27 suggested that the gRNA sequence of CRISPR-Cpf1 should be divided into seed (6 nucleotides in the 5’ PAM-proximal end) and non-seed (14 nucleotides in the 3’ PAM-distal end) regions (Kim et al., 2017b; Kim et al., 2018). The tolerant ability of mismatches in the seed region is lower than that in the non-seed region. Recently, the CRISPR-Cpf1 system has been successfully applied for genome editing in various organisms. Studies showed that CRISPR-Cpf1 is highly specific in mammals (Kleinstiver et al., 2016; Kim et al., 2016a; Kim et al., 2016b; Hur et al., 2016). Compared with Cas9, Cpf1 induced less off targets in human cells (Kim et al., 2016a). Cpf1 also demonstrates high target specificity in plant cells (Tang et al., 2017; Xu et al., 2017; Hu et al., 2017; Kim et al., 2017a). In addition, CRISPR-Cpf1 has been used to genetically engineer some important bacteria, such as Corynebacterium glutamicum (Jiang et al., 2017) and Cyanobacteria (Ungerer and Pakrasi, 2016), whose genomes cannot be edited by the CRISPR- Cas9 system probably due to Cas9 toxicity in these bacteria (Jiang et al., 2017; Wendt et al., 2016). Moreover, researchers demonstrated that CRISPR-Cpf1 is highly suitable for multiplex gene editing (Zetsche et al., 2017; Wang et al., 2017). Although CRISPR-Cas9 can also be used in multiplex gene editing, it requires larger constructs or delivery of several plasmids (Kabadi et al., 2014; Nissim et al., 2014; Sakuma et al., 2014; Xie et al., 2015), which might cause problems in multiplex gene editing (Zetsche et al., 2017). Additionally, the CRISPR-Cpf1 system has been applied in correction of genetic mutations (Zhang et al., 2017b; Yang et al., 2017) and shown great therapeutic potential. Furthermore, the inactive Cpf1 can be used for gene repression (Tang et al., 2017; Zhang et al., 2017a). All these studies demonstrate that the CRISPR-Cpf1 system has many advantages over CRISPR-Cas9, and it is a very powerful tool for genome editing. Target efficiency and specificity are the two most important aspects for genome editing. However, target efficiency of the CRISPR-Cpf1 system varies among different gRNA sequences (Kim et al., 2017b; Kim et al., 2018), just like the CRISPR-Cas9 system (Shalem et al., 2014; Wang et al., 2014; Doench et al., 2014; Gilbert et al., 2014; Koike-Yusa et al., 2014; Konermann et al., 2015; Zhou et al., 2014). Kim and his colleagues attempted to find out gRNA features that are related to the target efficiency of CRISPR-Cpf1 (Kim et al., 2017b; Kim et al., 2018). Recently, they measured activities of 15000 gRNA sequences for the CRISPR-Cpf1 system and developed an algorithm to predict target efficiency (Kim et al., 2018). Since the favorable PAM sequence for CRISPR-Cpf1 is TTTV (Kim et al., 2017b), where V stands for A, C or G but not T in IUPAC codes, in our study we reanalyzed the 11365 gRNA sequences with the TTTV PAM

28 by comparing the most efficient gRNAs (top 10% in activity ranking) with the least efficient gRNAs (bottom 10%). The most distinct gRNA features related to target efficiency can be identified by excluding gRNAs with modest efficiency, which has been demonstrated as a great strategy in previous studies (Wang et al., 2009; Wong et al., 2015). We found many novel gRNA sequence and structural features that are associated closely with target efficiency of CRISPR- Cpf1. Using machine learning technologies, we created a support vector machine (SVM) model to predict target efficiency by combining all the important features selected by Random Forest and SVM. Previous research showed that the SVM had robust performance in predicting target efficiency for the CRISPR-Cas9 system (Wong et al., 2015). Random Forest can be used to calculate the importance of each feature (Saeys et al., 2007), which has been applied in various biological studies (Ebina et al., 2011; Lou et al., 2014; Everson et al., 2015). We have developed the first CRISPR-Cpf1 web service application, CRISPR-DT, to help users design appropriate gRNAs for the CRISPR-Cpf1 system by considering both target efficiency and specificity.

Materials and Methods Data retrieval and usage The 15000 gRNAs data were downloaded from the website (https://www-nature- com.proxy.lib.miamioh.edu/articles/nbt.4061) (Kim et al., 2018). The 11365 gRNA sequences with the TTTV PAM were obtained using Perl scripts. The top 10% (1137 efficient) and bottom 10% (1137 inefficient) gRNAs in activity ranking of the 11365 gRNAs were analyzed to explore features related to target efficiency. These 2274 gRNAs were also used to train the SVM model. From the aforementioned website, we also obtained 990 independent gRNAs with the TTTV PAM from the HT 1-2 data set (Kim et al., 2018) for validating the SVM model. The reference genomes and corresponding gff3 annotation files for predicting target specificity were downloaded from Ensembl (https://useast.ensembl.org, Ensembl Release 81) and Ensembl Plants (https://plants.ensembl.org, Ensembl Release 27).

Computational tools and statistical analysis The minimum free energy of each 20-nt gRNA sequence was calculated using RNAfold (Lorenz et al., 2011) with the default parameters. The Melting temperatures of the entire 20-nt gRNA sequence, seed region, and non-seed region were calculated using the Biopython Tm_NN function with thermodynamic values from the DNA_NN2 table (Cock et al., 2009; Le, 2001;

29 Sugimoto et al., 1996). Welch’s t-test was used to perform statistical significance analysis (P- value < 0.05) by the R package.

Target efficiency predictive model The top 10% (1137) gRNAs in activity ranking of the 11365 gRNAs were labeled as “efficient gRNAs”, while the bottom 10% (1137) gRNAs were labeled as “inefficient gRNAs”. We calculated 738 features (Fig. 3.2) for these 2274 gRNAs, including the position-specific nucleotide composition (20 × 4 + 19 × 42 = 384), position-nonspecific nucleotide composition (4 + 42 + 43 + 44 = 340), repetitive bases (1 + 1 + 1 + 1 +1 = 5), UUU in the gRNA seed region (1), GC content (1 + 1 + 1 + 1 = 4), minimum free energy (1), and melting temperature (1 + 1 + 1 = 3). Feature selection was preformed carefully using Random Forest (default parameters) and SVM before training SVM models. We used ten-fold cross validation to evaluate the SVM model performance. Specifically, the 2274 gRNAs were randomly divided into ten folds. Each fold was used once as the test data, and correspondingly the remaining nine folds were used as the training data. For each of ten-fold cross validations, (1) we used Random Forest and SVM to select important features by comparing the five-fold cross validation results of top 50, 150, 250, 350, and 450 features in the training data set (features are ranked by the mean decrease impurity calculated by Random Forest); (2) we built the SVM model by LIBSVM (Chang and Lin, 2011), using a radial basis function (RBF) as kernel transformation, based on the training data with the important features; and (3) we evaluated the SVM model by the test data. After repeating the above procedure ten times, we got the average performance of SVM models. The ROCR package of R was used to draw the ROC curve and calculate the AUC value. Last, we used RBF as kernel transformation to build the final SVM model based on all the 2274 gRNAs data with the most important features, which were selected using Random Forest and SVM by comparing the five-fold cross validation results of top 50, 150, 250, 350, and 450 features in the 2274 gRNAs data set. To further validate the final SVM model, we compared our model with previously published model by using the 990 independent gRNAs data, in which the top 10% gRNAs in activity ranking were defined as efficient ones and the remaining 90% gRNAs were regarded as inefficient ones (Kim et al., 2018).

Bioinformatics pipeline for improving target specificity As shown in Figure 3.3, the pipeline consists of the following five major steps. (1) Users input a DNA sequence. (2) A Perl script was used to search all target candidates from the input

30 DNA sequence based on the on-target PAM sequence users selected. (3) Each target candidate is mapped to the reference genome by Bowtie2 (Langmead and Salzberg, 2012) to find all the possible target sites in the genome according to the setting of maximum mismatches and gaps that off targets can tolerate. (4) samtools (Li et al., 2009) was used to separate alignment results by the forward or reverse strand. (5) Several Perl scripts were used to filter results based on users’ off-target settings. After converting file format, we use PHP and JavaScript to display the results in DataTables and JBrowse (Skinner et al., 2009). The same strategy was used in our previous works (Zhu et al., 2016; Zhu et al., 2018).

Results Sequence features of efficient and inefficient gRNAs Position-specific nucleotide composition Figure 3.4 shows that compared with inefficient gRNAs, efficient gRNAs extremely disfavor uracil (U) at the first position of the 20-nt gRNA sequence, which is immediately adjacent to the PAM sequence (P = 1.61E-44). Meanwhile, guanine (G) and cytosine (C) are strongly favored at the first position (P = 2.76E-22 and P = 2.21E-04). In the last position, efficient gRNAs prefer cytosine (P = 1.05E-04) and adenine (A) (P = 6.25E-03), but not guanine (P = 3.95E-10). Overall, within the 20-nt RNA sequence, efficient gRNAs favor adenine but disfavor uracil. Position-specific dinucleotides were also analyzed by comparing efficient and inefficient gRNAs (Table 3.1). For the first two positions of efficient gRNAs, UU is extremely disfavored with an enrichment ratio of 0.10 (P = 9.67E-43), whereas CC, GG, and GA are highly favored with enrichment ratios of 2.50 (P = 1.19E-11), 2.61 (P = 1.53E-07), and 3.09 (P = 4.10E-07), respectively. More interestingly, UU is disfavored in any two positions of efficient gRNAs compared with that in inefficient gRNAs.

Position-nonspecific nucleotide composition Table 3.2 shows that the average counts of uracil for efficient and inefficient gRNAs are 4.83 and 7.22, respectively, within the 20-nt gRNA sequence (P = 5.83E-98), which indicates efficient gRNAs disfavor uracil. Compared with inefficient gRNAs, however, efficient gRNAs have a preference for adenine (4.26 versus 3.00, P = 1.35E-65) and cytosine (6.08 versus 5.13, P = 3.76E-27). The most significant position-nonspecific dinucleotide count is the UU count (P = 8.00E-127), which greatly decreased in efficient gRNAs with an enrichment ratio of 0.36. This

31 finding is consistent with the previous study of CRISPR-Cas9 (Wong et al., 2015). The position- nonspecific trinucleotide and tetranucleotide counts were also calculated (Supplementary Table 3.3). Compared with efficient gRNAs, UUU is a significantly enriched trinucleotide in inefficient gRNAs, with average counts of 0.11 and 0.80 in efficient and inefficient gRNAs respectively (P = 2.13E-123). The average count of GGG is significantly decreased in efficient gRNAs with an enrichment ratio of 0.71 (P = 1.56E-04), suggesting that GGG is disfavored in efficient gRNAs, which is consistent with previous CRISPR-Cas9 study (Wong et al., 2015). Compared with efficient gRNAs, UUUU is the most significant tetranucleotide enriched in inefficient gRNAs, with the average counts of 0.01 and 0.46 in efficient and inefficient gRNAs respectively (P = 4.29E-124). Previous work has demonstrated that GGGG within the gRNA sequence can cause poor CRISPR-Cas9 activity because GGGG badly affects DNA oligo synthesis, and can form a secondary structure called guanine tetrad in the gRNA sequence, which makes gRNAs difficult to bind to target sequences (Wong et al., 2015). Consistently, for the CRISPR-Cpf1 system, we found much fewer efficient gRNAs contain the GGGG motif than inefficient ones, with an enrichment ratio of 0.52 (P = 2.22E-04). Repetitive bases (at least four A, four C, four G, or four U) also prove to be related with poor CRISPR-Cas9 activity (Wong et al., 2015). Similarly, in our study we found more inefficient gRNAs contain repetitive bases than efficient ones (52.68% versus 12.31%, P = 5.29E-102), especially for UUUU (42.74% versus 0.79%, P = 2.22E-134). In addition, we examined UUU in the gRNA seed region, which are the 6 nucleotides in the 5’ PAM-proximal region (Kim et al., 2017b), and found that more inefficient gRNAs contain UUU in the seed region than efficient ones with an enrichment ratio of 0.08 (P = 2.86E-55), which is consistent with previous CRISPR-Cas9 research (Wong et al., 2015).

GC content We separately compared GC content of the entire 20-nt gRNA sequence, seed region, and non-seed region between efficient and inefficient gRNAs. For the entire 20-nt gRNA sequence, efficient gRNAs have higher GC content than inefficient ones (0.55 versus 0.49, P = 9.27E-24). In addition, the GC content of efficient gRNAs is higher than that of inefficient ones in both seed and non-seed regions (0.57 versus 0.46, P = 6.67E-36 and 0.54 versus 0.50, P = 2.19E-08). Since gRNAs with balanced GC content have higher target efficiency for the CRISPR-Cas9 system (Wang et al., 2014; Doench et al., 2014; Gagnon et al., 2014), in our study we consider GC content of 0.30 - 0.70 as normal, but greater than 0.70 or less than 0.30 as abnormal. Results

32 showed that more efficient gRNAs contain normal GC content than inefficient ones (95.60% versus 84.17%, P = 9.31E-20).

Structural features of efficient and inefficient gRNAs Minimum free energy The secondary structure stability of the 20-nt gRNA sequence was determined by its minimum free energy (MFE). Compared with efficient gRNAs, more inefficient gRNAs have lower MFE but fewer ones have higher MFE (Fig. 3.5). In addition, the average MFE of inefficient gRNAs is significantly lower than that of efficient ones (-2.11 versus -1.49, P = 1.88E-11), which means the secondary structure of inefficient gRNAs is more stable than that of efficient ones. Our finding suggests that nucleotide accessibility of the 20-nt gRNA sequence due to secondary structure is strongly related to target efficiency, which is consistent with previous CRISPR-Cas9 study (Wong et al., 2015).

Melting temperature The melting temperature was calculated based on the DNA version of the gRNA sequence (Doench et al., 2016). We separately compared melting temperatures of the entire 20-nt gRNA sequence, seed region, and non-seed region between efficient and inefficient gRNAs, which is a similar strategy utilized by the previous research (Doench et al., 2016). For the entire 20-nt gRNA sequence, efficient RNAs have a significantly higher average melting temperature than inefficient ones (57.40 versus 56.04, P = 4.68E-09). Additionally, the average melting temperatures of efficient gRNAs are higher than those of inefficient ones in both seed (-11.92 versus -15.08, P = 2.39E-13) and non-seed (43.11 versus 42.48, P = 2.05E-02) regions.

A SVM model to predict target efficiency The top 10% and bottom 10% (2274 total) gRNAs in activity ranking of the 11365 gRNAs were used to train and test our support vector machine (SVM) models. Feature selection was performed before training SVM models. We used ten-fold cross validation to evaluate the SVM models and the receiver operating characteristic (ROC) curves are shown in Figure 3.6. The average area under the curve (AUC) is 0.92, which means the SVM models have high performance in distinguishing efficient and inefficient gRNAs. We built a final SVM model by using all 2274 gRNAs with the 150 most important features (Table 3.4) selected by Random Forest and SVM. The evaluation results of different amount of features are in Figure 3.7. To

33 further validate the SVM model, we compared our model with the previously published model (Kim et al., 2018). Our model showed a better performance than the deep learning model created by Kim et al. (2018) (AUC = 0.78 versus 0.73) in predicting target efficiency for the 990 independent gRNAs (Fig. 3.8). In summary, our SVM model has robust performance at predicting target efficiency.

A web service application for improving target efficiency and specificity Although the CRISPR-Cpf1 system appears to be highly specific in human and plant cells, we still can use bioinformatics methods to improve target specificity to its greatest extent. By considering both target efficiency and specificity, we have developed a web service application, CRISPR-DT, to help users design optimal gRNAs for the CRISPR-Cpf1 system. In the setting page (Fig. 3.9), first, users input a DNA sequence that they want to target in FASTA format. Second, users select a reference genome. Third, users set on- and off-target PAMs, respectively. Fourth, users choose an off-target setting. Then they can click “Find targets!” to run the program. In the result pages, Figure 3.10a shows all the target candidates and relevant information, including the number of target sites within the entire genome, the number of exon targets within the entire genome, and the efficiency score of each target candidate. Users can rank target candidates by clicking column headers and choose optimal ones by considering both off-target effect and target efficiency. The number of target sites and the number of exon targets can be clicked to show the detailed information for each target site (Fig. 3.10b), including alignment information and the JBrowse (Skinner et al., 2009) link, which can be clicked to display the target site within the genomic and transcript features for useful visualization (Fig. 3.10c).

Discussion The CRISPR-Cas9 system has been widely applied in genome editing. More recently, the CRISPR-Cpf1 system was identified as a new powerful tool for genome editing (Zetsche et al., 2015). The CRISPR-Cpf1 system is a class 2 CRISPR-Cas system like CRISPR-Cas9, but it has some distinct characteristics different from CRISPR-Cas9. For example, Cpf1 recognizes a T- rich PAM and cleaves double-strand DNA via a staggered cut, resulting in two sticky ends (Zetsche et al., 2015). Compared with CRISPR-Cas9, the CRISPR-Cpf1 system has some advantages, such as higher target specificity in human and plant cells (Kim et al., 2016a; Tang et

34 al., 2017; Xu et al., 2017; Hu et al., 2017; Kim et al., 2017a) and better performance in multiplex gene editing (Zetsche et al., 2017). Target efficiency and specificity are the two most important aspects for genome editing. However, target efficiency for the CRISPR-Cpf1 system varies among different gRNAs (Kim et al., 2017b; Kim et al., 2018). So far, gRNA characteristics related to target efficiency have not been well studied. Here, we reanalyzed the published CRISPR-Cpf1 gRNAs data (Kim et al., 2018) and found that many sequence and structural features of gRNAs (e.g., the position-specific nucleotide composition, position-nonspecific nucleotide composition, GC content, minimum free energy, and melting temperature) are correlated with target efficiency. With the aid of Random Forest in feature selection, a SVM model was created based on existing gRNAs and their efficiency data to predict target efficiency for any given gRNAs. We have developed the first CRISPR-Cpf1 web service application, CRISPR-DT, to help users design optimal gRNAs for the CRISPR-Cpf1 system by considering both target efficiency and specificity. The target efficiency score is available for mammals because the SVM model was built based on mammalian data (Kim et al., 2018). We have updated our previously published three CRISPR-Cas systems (Cas9, Cas9n, and RFN) (Zhu et al., 2016) by incorporating the model developed by Doench et al. (2016) to predict target efficiency for mammals, which are also available in CRISPR-DT. Clearly, CRISPR-DT will empower researchers in genome editing using CRISPR-Cpf1.

References Auer,T.O. et al. (2014) Highly efficient CRISPR/Cas9-mediated knock-in in zebrafish by homology-independent DNA repair. Genome Res., 24, 142–153. Bassett,A.R. et al. (2013) Highly Efficient Targeted Mutagenesis of Drosophila with the CRISPR/Cas9 System. Cell Reports, 4, 220–228. Chang,C.-C. and Lin,C.-J. (2011) LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2, 27. Cock,P.J.A. et al. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422–1423. Cong,L. et al. (2013) Multiplex Genome Engineering Using CRISPR/Cas Systems. Science, 339, 819–823. Cradick,T.J. et al. (2013) CRISPR/Cas9 systems targeting β-globin and CCR5 genes have

35 substantial off-target activity. Nucleic Acids Res, 41, 9584–9592. Doench,J.G. et al. (2016) Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol., 34, 184–191. Doench,J.G. et al. (2014) Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation. Nature biotechnology, 32, 1262. Ebina,T. et al. (2011) DROP: an SVM domain linker predictor trained with optimal features selected by random forest. Bioinformatics, 27, 487–494. Everson,T.M. et al. (2015) DNA methylation loci associated with atopy and high serum IgE: a genome-wide application of recursive Random Forest feature selection. Genome Medicine, 7, 89. Feng,Z. et al. (2013) Efficient genome editing in plants using a CRISPR/Cas system. Cell Research, 23, 1229. Fonfara,I. et al. (2016) The CRISPR-associated DNA-cleaving enzyme Cpf1 also processes precursor CRISPR RNA. Nature, 532, 517–521. Friedland,A.E. et al. (2013) Heritable genome editing in C. elegans via a CRISPR-Cas9 system. Nature Methods, 10, 741. Fu,Y. et al. (2013) High frequency off-target mutagenesis induced by CRISPR-Cas nucleases in human cells. Nature biotechnology, 31, 822. Gagnon,J.A. et al. (2014) Efficient Mutagenesis by Cas9 Protein-Mediated Oligonucleotide Insertion and Large-Scale Assessment of Single-Guide RNAs. PLOS ONE, 9, e98186. Genovese,P. et al. (2014) Targeted genome editing in human repopulating haematopoietic stem cells. Nature, 510, 235–240. Gilbert,L.A. et al. (2014) Genome-Scale CRISPR-Mediated Control of Gene Repression and Activation. Cell, 159, 647–661. Heckl,D. et al. (2014) Generation of mouse models of myeloid malignancy with combinatorial genetic lesions using CRISPR-Cas9 genome editing. Nature biotechnology, 32, 941. Hu,X. et al. (2017) Targeted mutagenesis in rice using CRISPR-Cpf1 system. J Genet Genomics, 44, 71–73. Hur,J.K. et al. (2016) Targeted mutagenesis in mice by electroporation of Cpf1 ribonucleoproteins. Nat. Biotechnol., 34, 807–808. Jiang,Y. et al. (2017) CRISPR-Cpf1 assisted genome editing of Corynebacterium glutamicum.

36 Nat Commun, 8, 15179. Kabadi,A.M. et al. (2014) Multiplex CRISPR/Cas9-based genome engineering from a single lentiviral vector. Nucleic Acids Res, 42, e147–e147. Kim,D. et al. (2016a) Genome-wide analysis reveals specificities of Cpf1 endonucleases in human cells. Nature Biotechnology, 34, 863. Kim,H. et al. (2017a) CRISPR/Cpf1-mediated DNA-free plant genome editing. Nature Communications, 8, 14406. Kim,H.K. et al. (2017b) In vivo high-throughput profiling of CRISPR-Cpf1 activity. Nat. Methods, 14, 153–159. Kim,H.K. et al. (2018) Deep learning improves prediction of CRISPR–Cpf1 guide RNA activity. Nature Biotechnology, 36, 239–241. Kim,Y. et al. (2016b) Generation of knockout mice by Cpf1-mediated gene targeting. Nat. Biotechnol., 34, 808–810. Kleinstiver,B.P. et al. (2016) Genome-wide specificities of CRISPR-Cas Cpf1 nucleases in human cells. Nature Biotechnology, 34, 869. Koike-Yusa,H. et al. (2014) Genome-wide recessive genetic screening in mammalian cells with a lentiviral CRISPR-guide RNA library. Nat. Biotechnol., 32, 267–273. Konermann,S. et al. (2015) Genome-scale transcriptional activation by an engineered CRISPR- Cas9 complex. Nature, 517, 583. Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nature methods, 9, 357–359. Le Novère,N. (2001) MELTING, computing the melting temperature of nucleic acid duplex. Bioinformatics, 17, 1226–1227. Li,H. et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078–2079. Li,J.-F. et al. (2013) Multiplex and homologous recombination-mediated plant genome editing via guide RNA/Cas9. Nature biotechnology, 31, 688. Lombardo,A. et al. (2011) Site-specific integration and tailoring of cassette design for sustainable gene transfer. Nature Methods, 8, 861–869. Long,C. et al. (2014) Prevention of muscular dystrophy in mice by CRISPR/Cas9–mediated editing of germline DNA. Science, 345, 1184–1188.

37 Lorenz,R. et al. (2011) ViennaRNA Package 2.0. Algorithms for Molecular Biology : AMB, 6, 26. Lou,W. et al. (2014) Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes. PLOS ONE, 9, e86703. Luchetti,A. et al. (2014) Small Fragment Homologous Replacement (SFHR): sequence-specific modification of genomic DNA in eukaryotic cells by small DNA fragments. Methods Mol. Biol., 1114, 85–101. Ma,Y. et al. (2014) Generating rats with conditional alleles using CRISPR/Cas9. Cell Res, 24, 122–125. Mali,P. et al. (2013) RNA-Guided Human Genome Engineering via Cas9. Science, 339, 823–826. Maresca,M. et al. (2013) Obligate Ligation-Gated Recombination (ObLiGaRe): Custom- designed nuclease-mediated targeted integration through nonhomologous end joining. Genome Res., 23, 539–546. Matano,M. et al. (2015) Modeling colorectal cancer using CRISPR-Cas9-mediated engineering of human intestinal organoids. Nat. Med., 21, 256–262. Nissim,L. et al. (2014) Multiplexed and Programmable Regulation of Gene Networks with an Integrated RNA and CRISPR/Cas Toolkit in Human Cells. Molecular Cell, 54, 698–710. Perez,E.E. et al. (2008) Establishment of HIV-1 resistance in CD4+ T cells by genome editing using zinc-finger nucleases. Nature Biotechnology, 26, 808–816. Platt,R.J. et al. (2014) CRISPR-Cas9 Knockin Mice for Genome Editing and Cancer Modeling. Cell, 159, 440–455. Ran,F.A. et al. (2013) Genome engineering using the CRISPR-Cas9 system. Nature Protocols, 8, 2281–2308. Saeys,Y. et al. (2007) A review of feature selection techniques in bioinformatics. Bioinformatics, 23, 2507–2517. Sakuma,T. et al. (2014) Multiplex genome engineering in human cells using all-in-one CRISPR/Cas9 vector system. Scientific Reports, 4, 5400. Shalem,O. et al. (2014) Genome-Scale CRISPR-Cas9 Knockout Screening in Human Cells. Science, 343, 84–87. Shan,Q. et al. (2013) Targeted genome modification of crop plants using a CRISPR-Cas system. Nat. Biotechnol., 31, 686–688. Skinner,M.E. et al. (2009) JBrowse: A next-generation genome browser. Genome Res., 19,

38 1630–1638. Sugimoto,N. et al. (1996) Improved Thermodynamic Parameters and Helix Initiation Factor to Predict Stability of DNA Duplexes. Nucleic Acids Res, 24, 4501–4505. Suzuki,K. et al. (2016) In vivo genome editing via CRISPR/Cas9 mediated homology- independent targeted integration. Nature, 540, 144–149. Tang,X. et al. (2017) A CRISPR-Cpf1 system for efficient genome editing and transcriptional repression in plants. Nat Plants, 3, 17018. Ungerer,J. and Pakrasi,H.B. (2016) Cpf1 Is A Versatile Tool for CRISPR Genome Editing Across Diverse Species of Cyanobacteria. Scientific Reports, 6, 39681. Wang,M. et al. (2017) Multiplex Gene Editing in Rice Using the CRISPR-Cpf1 System. Molecular Plant, 10, 1011–1013. Wang,T. et al. (2014) Genetic Screens in Human Cells Using the CRISPR-Cas9 System. Science, 343, 80–84. Wang,X. et al. (2009) Selection of hyperfunctional siRNAs with improved potency and specificity. Nucleic Acids Res, 37, e152–e152. Wendt,K.E. et al. (2016) CRISPR/Cas9 mediated targeted mutagenesis of the fast growing cyanobacterium Synechococcus elongatus UTEX 2973. Microbial Cell Factories, 15, 115. Wong,N. et al. (2015) WU-CRISPR: characteristics of functional guide RNAs for the CRISPR/Cas9 system. Genome Biology, 16, 218. Wu,Y. et al. (2013) Correction of a Genetic Disease in Mouse via Use of CRISPR-Cas9. Cell Stem Cell, 13, 659–662. Xie,K. et al. (2015) Boosting CRISPR/Cas9 multiplex editing capability with the endogenous tRNA-processing system. PNAS, 112, 3570–3575. Xu,R. et al. (2017) Generation of targeted mutant rice using a CRISPR-Cpf1 system. Plant Biotechnol J, 15, 713–717. Xue,W. et al. (2014) CRISPR-mediated direct mutation of cancer genes in the mouse liver. Nature, 514, 380. Yang,M. et al. (2017) Targeted Disruption of V600E-Mutant BRAF Gene by CRISPR-Cpf1. Molecular Therapy - Nucleic Acids, 8, 450–458. Zetsche,B. et al. (2015) Cpf1 Is a Single RNA-Guided Endonuclease of a Class 2 CRISPR-Cas System. Cell, 163, 759–771.

39 Zetsche,B. et al. (2017) Multiplex gene editing by CRISPR–Cpf1 using a single crRNA array. Nature Biotechnology, 35, 31. Zhang,X. et al. (2017a) Multiplex gene regulation by CRISPR-ddCpf1. Cell Discovery, 3, 17018. Zhang,Y. et al. (2017b) CRISPR-Cpf1 correction of muscular dystrophy mutations in human cardiomyocytes and mice. Science Advances, 3, e1602814. Zhou,Y. et al. (2014) High-throughput screening of a CRISPR/Cas9 library for functional genomics in human cells. Nature, 509, 487–491. Zhu,H. et al. (2018) CRISPR-RT: a web application for designing CRISPR-C2c2 crRNA with improved target specificity. Bioinformatics, 34, 117-119. Zhu,H. et al. (2016) CT-Finder: A Web Service for CRISPR Optimal Target Prediction and Visualization. Scientific Reports, 6, 25516.

40 List of Tables Table 3.1. Significant position-specific dinucleotide probability within the 20-nt gRNA sequences Position-specific Dinucleotides Efficient gRNAs Inefficient gRNAs Enrichment Ratioa P-valueb Position_6_UU 0.02 0.21 0.11 9.10E-44 Position_1_UU 0.02 0.20 0.10 9.67E-43 Position_5_UU 0.03 0.19 0.16 1.49E-33 Position_7_UU 0.04 0.20 0.20 1.69E-32 Position_4_UU 0.05 0.20 0.25 1.74E-28 Position_8_UU 0.03 0.16 0.21 4.45E-24 Position_13_UU 0.05 0.18 0.27 2.07E-22 Position_3_UU 0.05 0.18 0.30 4.94E-21 Position_12_UU 0.05 0.18 0.29 5.33E-21 Position_2_UU 0.04 0.16 0.27 1.11E-20 Position_14_UU 0.04 0.15 0.28 3.41E-19 Position_10_UU 0.06 0.18 0.33 3.45E-19 Position_5_CA 0.13 0.03 4.11 4.90E-18 Position_9_UU 0.06 0.17 0.37 7.90E-16 Position_6_AG 0.09 0.02 4.73 4.68E-14 Position_11_UU 0.05 0.14 0.35 1.88E-13 Position_3_UG 0.12 0.04 3.12 8.70E-13 Position_1_CC 0.14 0.06 2.50 1.19E-11 Position_5_GA 0.07 0.02 4.00 2.79E-10 Position_15_UU 0.05 0.13 0.42 6.06E-10 Position_6_AC 0.07 0.02 3.90 7.04E-10 Position_9_AG 0.08 0.03 3.03 1.12E-08 Position_7_CC 0.11 0.05 2.31 2.44E-08 Position_16_UU 0.06 0.13 0.48 2.66E-08 Position_19_AC 0.08 0.03 2.88 2.81E-08 Position_2_CU 0.12 0.06 2.12 4.84E-08 Position_6_CU 0.08 0.15 0.53 1.42E-07 Position_1_GG 0.08 0.03 2.61 1.53E-07 Position_9_CC 0.10 0.05 2.23 2.23E-07 Position_17_UU 0.05 0.11 0.48 3.40E-07 Position_1_GA 0.06 0.02 3.09 4.10E-07 Position_18_UU 0.04 0.10 0.46 6.77E-07

41 Position_6_AA 0.05 0.01 3.56 1.04E-06 Position_16_CA 0.10 0.05 2.07 2.83E-06 Position_19_UG 0.07 0.12 0.54 6.87E-06 Position_13_CA 0.10 0.05 1.98 9.53E-06 Position_19_GG 0.04 0.08 0.47 1.10E-05 Position_12_AC 0.06 0.03 2.48 1.17E-05 Position_4_GG 0.09 0.04 2.00 2.34E-05 Position_16_GG 0.05 0.09 0.53 3.67E-05 Position_17_AG 0.07 0.03 2.20 4.62E-05 Position_8_AC 0.07 0.03 2.21 5.61E-05 Position_11_AA 0.04 0.01 2.88 6.32E-05 Position_5_CU 0.07 0.11 0.58 7.83E-05 Position_1_GC 0.08 0.04 1.96 1.20E-04 Position_14_AG 0.07 0.04 1.98 1.60E-04 Position_17_GG 0.05 0.09 0.56 1.69E-04 Position_19_UU 0.05 0.08 0.54 1.81E-04 Position_4_CC 0.11 0.06 1.69 2.09E-04 Position_7_AG 0.06 0.03 2.06 2.45E-04 Position_8_CA 0.10 0.06 1.69 2.54E-04 Position_7_UC 0.08 0.12 0.63 2.64E-04 Position_1_AC 0.07 0.04 1.95 2.66E-04 Position_1_GU 0.06 0.03 2.13 3.38E-04 Position_12_CC 0.10 0.06 1.68 4.24E-04 Position_7_GA 0.06 0.03 1.97 4.30E-04 Position_11_AC 0.06 0.03 2.10 4.72E-04 Position_9_AC 0.06 0.03 2.00 6.91E-04 Position_1_UG 0.07 0.11 0.63 7.88E-04 Position_4_GA 0.05 0.03 2.07 7.96E-04 Position_6_CC 0.10 0.06 1.63 8.07E-04 Position_2_GG 0.05 0.08 0.59 1.01E-03 Position_7_CA 0.08 0.05 1.70 1.14E-03 Position_18_AC 0.06 0.03 1.91 1.18E-03 Position_5_UC 0.07 0.11 0.65 1.19E-03 Position_11_CA 0.09 0.06 1.60 1.59E-03 Position_5_UA 0.04 0.02 2.33 1.68E-03 Position_8_CC 0.11 0.07 1.52 1.71E-03 Position_3_UA 0.04 0.01 2.35 2.03E-03

42 Position_10_CA 0.08 0.05 1.65 2.06E-03 Position_13_CC 0.09 0.06 1.58 2.31E-03 Position_2_GA 0.06 0.03 1.83 2.36E-03 Position_2_UC 0.06 0.09 0.64 2.71E-03 Position_4_CU 0.07 0.11 0.67 2.79E-03 Position_10_AA 0.03 0.01 2.29 2.91E-03 Position_4_AA 0.02 0.04 0.48 3.12E-03 Position_2_AG 0.08 0.05 1.62 3.41E-03 Position_2_CC 0.10 0.07 1.50 3.48E-03 Position_10_UC 0.08 0.12 0.69 3.69E-03 Position_18_UA 0.04 0.02 2.10 3.79E-03 Position_14_GG 0.05 0.09 0.64 3.99E-03 Position_7_AC 0.06 0.03 1.74 4.07E-03 Position_17_AC 0.06 0.03 1.78 4.18E-03 Position_6_AU 0.06 0.04 1.70 4.23E-03 Position_5_AG 0.07 0.04 1.68 4.55E-03 Position_10_AC 0.05 0.03 1.77 5.07E-03 Position_17_CA 0.10 0.06 1.49 5.39E-03 Position_1_UC 0.06 0.09 0.66 5.62E-03 Position_5_CC 0.10 0.07 1.44 7.40E-03 Position_17_AU 0.05 0.03 1.76 7.47E-03 Position_4_GU 0.06 0.04 1.62 8.10E-03 Position_19_CG 0.02 0.04 0.53 8.41E-03 Position_18_GA 0.06 0.04 1.64 8.59E-03 Position_3_GG 0.08 0.05 1.53 8.60E-03 Position_9_AU 0.04 0.07 0.63 9.15E-03 Position_9_GG 0.05 0.08 0.65 9.31E-03 Position_14_CC 0.08 0.05 1.53 9.78E-03 Position_10_CU 0.11 0.08 1.40 1.00E-02 Position_13_AC 0.05 0.03 1.68 1.02E-02 Position_10_GU 0.06 0.04 1.60 1.04E-02 Position_18_GG 0.06 0.08 0.67 1.08E-02 Position_18_AA 0.05 0.03 1.74 1.10E-02 Position_3_AG 0.06 0.04 1.62 1.10E-02 Position_16_GA 0.05 0.03 1.72 1.19E-02 Position_1_CG 0.03 0.04 0.57 1.23E-02 Position_2_GU 0.07 0.04 1.55 1.30E-02

43 Position_13_AG 0.07 0.04 1.55 1.30E-02 Position_2_GC 0.06 0.09 0.69 1.32E-02 Position_14_AC 0.05 0.03 1.65 1.32E-02 Position_13_AA 0.05 0.03 1.73 1.33E-02 Position_18_GU 0.04 0.06 0.64 1.40E-02 Position_13_UG 0.08 0.11 0.72 1.42E-02 Position_17_GC 0.07 0.10 0.71 1.48E-02 Position_13_GA 0.05 0.03 1.66 1.49E-02 Position_8_AG 0.07 0.04 1.54 1.55E-02 Position_11_CG 0.02 0.04 0.55 1.68E-02 Position_8_GG 0.05 0.07 0.67 1.73E-02 Position_1_UA 0.02 0.04 0.56 1.85E-02 Position_10_CC 0.08 0.05 1.45 1.87E-02 Position_15_AC 0.05 0.03 1.62 2.32E-02 Position_2_CA 0.09 0.07 1.39 2.32E-02 Position_15_GG 0.06 0.09 0.71 2.40E-02 Position_4_GC 0.07 0.05 1.45 2.42E-02 Position_8_CG 0.02 0.04 0.60 2.57E-02 Position_3_AC 0.06 0.04 1.51 2.65E-02 Position_7_AA 0.04 0.03 1.66 2.76E-02 Position_4_UA 0.04 0.02 1.71 3.24E-02 Position_12_CU 0.08 0.11 0.76 3.26E-02 Position_6_GA 0.06 0.04 1.49 3.30E-02 Position_14_AU 0.07 0.04 1.45 3.43E-02 Position_6_CA 0.08 0.06 1.39 3.64E-02 Position_8_AA 0.04 0.02 1.65 3.77E-02 Position_7_UG 0.08 0.10 0.76 3.90E-02 Position_17_UA 0.04 0.02 1.63 4.04E-02 Position_5_GG 0.08 0.06 1.37 4.10E-02 Position_16_UA 0.03 0.01 1.82 4.11E-02 Position_14_CA 0.09 0.07 1.34 4.24E-02 Position_19_GA 0.06 0.04 1.45 4.87E-02 Position_8_GU 0.06 0.04 1.42 4.98E-02 Position_12_GU 0.05 0.07 0.73 4.99E-02 aThe enrichment ratio was calculated by dividing the average dinucleotide probability of efficient gRNAs by that of inefficient gRNAs. bThe P-value was determined by Welch’s t-test (P < 0.05).

44 Table 3.2. Significant position-nonspecific mononucleotide and dinucleotide counts within the 20-nt gRNA sequences Mono- and Dinucleotides Efficient gRNAs Inefficient gRNAs Enrichment Ratioa P-valueb Count_U 4.83 7.22 0.67 5.83E-98 Count_A 4.26 3.00 1.42 1.35E-65 Count_C 6.08 5.13 1.18 3.76E-27 Count_UU 0.76 2.12 0.36 8.00E-127 Count_AC 1.10 0.61 1.80 2.48E-37 Count_CA 1.66 1.13 1.47 2.15E-33 Count_AG 1.26 0.83 1.51 3.65E-27 Count_GA 1.09 0.70 1.56 7.13E-25 Count_CC 1.44 1.01 1.43 1.80E-23 Count_AA 0.61 0.41 1.47 1.80E-11 Count_UC 1.50 1.70 0.88 2.59E-05 Count_CG 0.53 0.65 0.81 3.41E-04 Count_UA 0.60 0.50 1.20 1.32E-03 Count_CU 1.77 1.88 0.94 2.29E-02 Count_AU 0.98 0.90 1.10 2.50E-02 Count_UG 1.60 1.70 0.94 3.53E-02 Count_GG 0.98 1.07 0.92 3.59E-02 aThe enrichment ratio was calculated by dividing the average nucleotide counts of efficient gRNAs by that of inefficient gRNAs. bThe P-value was determined by Welch’s t-test (P < 0.05).

45 Table 3.3. Significant position-nonspecific trinucleotides and tetranucleotides and their average counts within the 20-nt gRNA sequences between efficient and inefficient gRNAs Tri- and Tetranucleotides Efficient gRNAs Inefficient gRNAs Enrichment Ratioa P-valueb Count_UUU 0.11 0.80 0.14 2.13E-123 Count_UUG 0.25 0.58 0.44 2.72E-36 Count_UUC 0.36 0.76 0.48 2.86E-36 Count_CUU 0.36 0.75 0.48 3.53E-36 Count_ACC 0.34 0.13 2.57 9.74E-24 Count_CCA 0.57 0.31 1.85 4.43E-23 Count_AGA 0.26 0.11 2.51 4.56E-19 Count_UCU 0.40 0.66 0.61 3.05E-17 Count_AUU 0.17 0.33 0.52 6.78E-15 Count_GAC 0.27 0.13 2.11 2.27E-14 Count_CAG 0.56 0.36 1.55 4.35E-13 Count_CAC 0.37 0.22 1.71 4.03E-12 Count_UUA 0.09 0.19 0.45 5.05E-11 Count_CCC 0.33 0.19 1.71 1.10E-10 Count_ACA 0.27 0.15 1.82 1.49E-10 Count_GUA 0.18 0.08 2.16 2.48E-10 Count_GUU 0.20 0.34 0.60 6.07E-10 Count_GAA 0.23 0.12 1.86 6.56E-10 Count_AUG 0.32 0.20 1.64 1.03E-09 Count_AAG 0.22 0.12 1.87 2.44E-09 Count_UAC 0.19 0.11 1.73 3.56E-07 Count_AGC 0.38 0.26 1.44 4.90E-07 Count_AUA 0.15 0.08 1.85 1.59E-06 Count_GAG 0.28 0.19 1.51 1.65E-06 Count_GCA 0.37 0.26 1.41 3.36E-06 Count_AGU 0.27 0.18 1.49 4.10E-06 Count_GCG 0.11 0.19 0.58 4.89E-06 Count_GGA 0.32 0.22 1.42 6.10E-06 Count_CCU 0.54 0.42 1.30 8.26E-06 Count_CAA 0.24 0.16 1.48 2.65E-05 Count_AAC 0.17 0.11 1.58 5.15E-05 Count_CUA 0.16 0.10 1.58 6.93E-05 Count_CUC 0.46 0.36 1.28 9.74E-05 Count_GGG 0.19 0.26 0.71 1.56E-04

46 Count_ACU 0.30 0.22 1.34 3.07E-04 Count_GGC 0.34 0.42 0.79 5.95E-04 Count_UGA 0.33 0.26 1.29 6.65E-04 Count_CGG 0.16 0.22 0.73 1.26E-03 Count_GUC 0.30 0.24 1.28 1.33E-03 Count_UCG 0.11 0.16 0.70 1.43E-03 Count_CUG 0.64 0.55 1.17 2.09E-03 Count_UCC 0.50 0.42 1.20 2.10E-03 Count_UGG 0.42 0.50 0.83 2.41E-03 Count_CGC 0.12 0.17 0.70 3.00E-03 Count_CAU 0.37 0.30 1.23 3.56E-03 Count_AGG 0.28 0.23 1.20 2.82E-02 Count_UAG 0.12 0.09 1.31 3.64E-02 Count_ACG 0.10 0.08 1.35 3.76E-02 Count_AUC 0.28 0.24 1.17 4.95E-02 Count_UUUU 0.01 0.46 0.02 4.29E-124 Count_UUUG 0.04 0.29 0.14 1.34E-52 Count_UUUC 0.05 0.33 0.14 7.11E-52 Count_CUUU 0.06 0.32 0.18 1.64E-46 Count_AUUU 0.02 0.19 0.12 1.51E-36 Count_UUCU 0.10 0.34 0.30 9.92E-34 Count_UCUU 0.10 0.30 0.32 2.34E-27 Count_GUUU 0.03 0.15 0.21 2.78E-20 Count_ACCA 0.11 0.02 5.04 3.71E-17 Count_UUUA 0.02 0.09 0.18 4.76E-15 Count_UUGU 0.05 0.14 0.36 2.02E-12 Count_CCAG 0.21 0.10 2.05 5.41E-11 Count_CACC 0.13 0.05 2.53 1.97E-10 Count_UUGG 0.07 0.16 0.45 2.22E-10 Count_AGAC 0.07 0.02 3.95 6.61E-10 Count_UUAU 0.03 0.09 0.30 1.08E-09 Count_ACCC 0.08 0.03 3.20 9.79E-09 Count_UAUU 0.03 0.08 0.34 1.37E-08 Count_UGUU 0.07 0.14 0.48 4.47E-08 Count_UUGC 0.07 0.14 0.49 4.53E-08 Count_AAGA 0.06 0.02 3.45 2.14E-07 Count_CAGA 0.10 0.04 2.33 2.28E-07

47 Count_AAUU 0.02 0.06 0.32 2.69E-07 Count_CCCA 0.13 0.06 2.04 3.88E-07 Count_GACC 0.08 0.03 2.57 3.99E-07 Count_GGGC 0.07 0.14 0.51 4.33E-07 Count_CUGA 0.13 0.07 1.92 7.98E-07 Count_UUCG 0.02 0.06 0.33 1.10E-06 Count_ACAC 0.07 0.03 2.73 1.13E-06 Count_GAGA 0.07 0.02 2.88 1.26E-06 Count_AGAA 0.06 0.02 2.92 2.06E-06 Count_GGCG 0.03 0.07 0.38 2.28E-06 Count_GUAU 0.04 0.01 3.92 2.94E-06 Count_CAGU 0.12 0.06 1.92 4.19E-06 Count_GAAG 0.08 0.04 2.27 4.21E-06 Count_CCAC 0.14 0.08 1.83 4.24E-06 Count_UCCA 0.17 0.10 1.64 7.46E-06 Count_AUGU 0.08 0.04 2.28 8.44E-06 Count_AGUA 0.04 0.01 3.40 9.31E-06 Count_GCUU 0.07 0.13 0.56 1.02E-05 Count_AUAC 0.04 0.01 3.54 1.32E-05 Count_CCUC 0.16 0.10 1.66 1.48E-05 Count_CAGC 0.18 0.12 1.57 2.01E-05 Count_CCUG 0.19 0.12 1.57 2.41E-05 Count_ACUG 0.11 0.06 1.85 2.71E-05 Count_UACC 0.05 0.02 2.84 3.08E-05 Count_CAAG 0.07 0.03 2.22 4.35E-05 Count_ACUC 0.08 0.04 2.07 4.50E-05 Count_CCCU 0.13 0.07 1.73 4.85E-05 Count_CCAA 0.08 0.04 2.10 4.88E-05 Count_GUAC 0.05 0.02 2.59 6.03E-05 Count_AGUC 0.08 0.04 1.96 7.46E-05 Count_AUAG 0.04 0.01 3.14 8.70E-05 Count_ACUU 0.05 0.10 0.55 8.83E-05 Count_AGAG 0.06 0.02 2.36 1.05E-04 Count_AAUG 0.06 0.03 2.18 1.12E-04 Count_AGCA 0.10 0.06 1.80 1.13E-04 Count_GGGG 0.04 0.08 0.51 1.65E-04 Count_AGGA 0.08 0.04 1.90 1.72E-04

48 Count_UACA 0.06 0.03 2.15 1.82E-04 Count_CAUA 0.06 0.03 2.19 1.85E-04 Count_CCAU 0.12 0.07 1.65 2.41E-04 Count_AAGG 0.04 0.02 2.58 2.71E-04 Count_GGUA 0.04 0.02 2.55 2.77E-04 Count_CUUC 0.16 0.22 0.72 3.52E-04 Count_CACA 0.09 0.05 1.78 3.64E-04 Count_UUGA 0.05 0.09 0.58 4.10E-04 Count_AACC 0.05 0.02 2.29 4.54E-04 Count_CUCC 0.16 0.11 1.48 4.74E-04 Count_UUAG 0.01 0.03 0.36 5.08E-04 Count_GUCA 0.09 0.05 1.74 5.34E-04 Count_GCAC 0.09 0.05 1.76 5.96E-04 Count_GGAC 0.09 0.05 1.74 6.34E-04 Count_GAUG 0.11 0.07 1.63 6.44E-04 Count_CAAC 0.07 0.04 1.90 7.45E-04 Count_CGCG 0.01 0.04 0.38 8.02E-04 Count_UCGA 0.01 0.03 0.38 9.44E-04 Count_CUAC 0.07 0.04 1.85 1.08E-03 Count_ACAA 0.05 0.02 2.17 1.08E-03 Count_UGGG 0.09 0.13 0.66 1.09E-03 Count_AUGA 0.07 0.04 1.83 1.31E-03 Count_GUCC 0.10 0.06 1.61 1.34E-03 Count_GACU 0.07 0.04 1.80 1.48E-03 Count_GCGG 0.04 0.07 0.55 1.58E-03 Count_UUCA 0.09 0.13 0.68 1.65E-03 Count_CUCA 0.11 0.07 1.58 1.72E-03 Count_GAAC 0.05 0.02 2.00 1.84E-03 Count_AUCC 0.08 0.05 1.69 1.88E-03 Count_CUUG 0.09 0.13 0.68 1.89E-03 Count_ACAG 0.08 0.05 1.69 1.97E-03 Count_CCUA 0.04 0.01 2.35 2.03E-03 Count_UCCC 0.12 0.08 1.49 2.21E-03 Count_CCUU 0.11 0.15 0.71 2.37E-03 Count_GAAU 0.05 0.03 1.91 2.38E-03 Count_CCGA 0.04 0.02 2.14 2.59E-03 Count_CAUG 0.09 0.06 1.57 2.91E-03

49 Count_AGCC 0.10 0.06 1.56 2.95E-03 Count_UGAC 0.07 0.04 1.67 2.97E-03 Count_GCAG 0.14 0.10 1.42 3.15E-03 Count_AAUA 0.03 0.01 2.43 3.52E-03 Count_ACUA 0.03 0.01 2.40 3.53E-03 Count_GCGC 0.03 0.05 0.51 3.64E-03 Count_GGCU 0.09 0.13 0.70 4.05E-03 Count_GUAG 0.04 0.02 2.09 4.11E-03 Count_GGAA 0.06 0.04 1.71 4.76E-03 Count_AACA 0.05 0.02 1.93 4.80E-03 Count_CGGG 0.04 0.06 0.58 5.04E-03 Count_ACCU 0.09 0.06 1.53 5.06E-03 Count_GACG 0.04 0.02 2.10 5.31E-03 Count_CAUU 0.07 0.10 0.67 5.34E-03 Count_AGAU 0.06 0.04 1.67 5.94E-03 Count_GGGA 0.06 0.04 1.69 6.45E-03 Count_GGUU 0.04 0.07 0.61 7.22E-03 Count_AUGC 0.08 0.05 1.56 7.23E-03 Count_GACA 0.06 0.03 1.71 7.60E-03 Count_UUCC 0.13 0.17 0.76 8.49E-03 Count_ACGA 0.02 0.01 2.67 8.53E-03 Count_UGAA 0.07 0.05 1.57 9.51E-03 Count_CGCU 0.03 0.06 0.59 9.82E-03 Count_GCUA 0.04 0.02 1.88 1.02E-02 Count_AAGC 0.05 0.03 1.74 1.04E-02 Count_CUAA 0.03 0.01 2.21 1.05E-02 Count_UAGC 0.04 0.02 1.88 1.15E-02 Count_CCCC 0.06 0.04 1.59 1.17E-02 Count_CACU 0.10 0.07 1.43 1.23E-02 Count_GUAA 0.03 0.02 2.00 1.38E-02 Count_GUGA 0.06 0.04 1.58 1.40E-02 Count_CGAC 0.03 0.01 2.14 1.73E-02 Count_GGAG 0.09 0.07 1.42 1.77E-02 Count_GCCG 0.04 0.06 0.63 1.80E-02 Count_CGGC 0.05 0.07 0.66 2.02E-02 Count_CUGC 0.18 0.14 1.27 2.11E-02 Count_AUGG 0.08 0.06 1.44 2.21E-02

50 Count_UAGA 0.03 0.01 1.94 2.21E-02 Count_UGUA 0.06 0.04 1.56 2.29E-02 Count_UGAG 0.09 0.06 1.40 2.43E-02 Count_UGCG 0.03 0.05 0.62 2.46E-02 Count_CGCC 0.03 0.05 0.63 2.79E-02 Count_CAUC 0.12 0.09 1.31 2.90E-02 Count_UCAC 0.08 0.06 1.42 2.97E-02 Count_CGUU 0.02 0.04 0.58 2.99E-02 Count_GAGU 0.06 0.04 1.47 3.60E-02 Count_GAUA 0.03 0.02 1.75 4.06E-02 Count_GUGC 0.09 0.06 1.37 4.08E-02 Count_CCCG 0.05 0.04 1.53 4.19E-02 Count_UCUG 0.13 0.17 0.81 4.26E-02 Count_UGGC 0.11 0.14 0.80 4.87E-02 aThe enrichment ratio was calculated by dividing the average nucleotide counts of efficient gRNAs by that of inefficient gRNAs. bThe P-value was determined by Welch’s t-test (P < 0.05).

51 Table 3.4. The 150 most important features used to build the final SVM model Feature Name Feature Type Mean Gini Exist_UUUU Binary (0 or 1) 39.55 Count_UUUU Numerical 36.21 Minimum_Free_Energy Numerical 35.97 Count_UU Numerical 34.37 Count_UUU Numerical 30.14 Count_U Numerical 27.32 Exist_Repetitive_Bases Binary (0 or 1) 22.46 Melting_Temp_Entire Numerical 19.49 GC_Content_Entire Numerical 17.43 Count_A Numerical 16.22 Melting_Temp_Nonseed Numerical 15.27 Position_1_U Binary (0 or 1) 13.00 Melting_Temp_Seed Numerical 11.96 GC_Content_Seed Numerical 11.32 Position_1_UU Binary (0 or 1) 10.59 Count_UUUG Numerical 9.92 Seed_UUU Binary (0 or 1) 9.91 Count_G Numerical 9.37 GC_Content_Nonseed Numerical 8.99 Count_CUUU Numerical 8.09 Position_6_UU Binary (0 or 1) 7.91 Count_AC Numerical 7.76 Position_6_A Binary (0 or 1) 7.50 Count_UUUC Numerical 7.32 Count_C Numerical 7.23 Count_CG Numerical 5.88 Count_GG Numerical 5.79 Position_1_G Binary (0 or 1) 5.71 Count_AUUU Numerical 5.71 Count_GC Numerical 5.49 Count_UUG Numerical 5.21 Position_6_U Binary (0 or 1) 5.17 Count_CA Numerical 5.11 Position_5_UU Binary (0 or 1) 5.10 Position_7_U Binary (0 or 1) 4.77

52 Position_7_UU Binary (0 or 1) 4.65 Position_20_G Binary (0 or 1) 4.57 Count_CC Numerical 4.44 Count_GA Numerical 4.28 GC_Content_Normal Binary (0 or 1) 4.20 Count_UUC Numerical 4.09 Count_AG Numerical 3.99 Count_ACC Numerical 3.75 Count_CUU Numerical 3.67 Count_GGC Numerical 3.53 Count_UG Numerical 3.41 Count_UUCU Numerical 3.37 Count_CU Numerical 3.35 Count_GU Numerical 3.34 Count_GCG Numerical 3.16 Count_UC Numerical 3.04 Count_AU Numerical 3.02 Count_CCA Numerical 3.00 Count_AGA Numerical 2.98 Count_UGG Numerical 2.93 Count_GGGC Numerical 2.71 Count_GGG Numerical 2.67 Position_17_G Binary (0 or 1) 2.63 Position_8_UU Binary (0 or 1) 2.62 Position_5_CA Binary (0 or 1) 2.53 Count_UA Numerical 2.51 Position_4_UU Binary (0 or 1) 2.48 Count_CGG Numerical 2.47 Count_UCU Numerical 2.36 Position_12_UU Binary (0 or 1) 2.33 Position_10_U Binary (0 or 1) 2.32 Count_UCUU Numerical 2.31 Count_CCC Numerical 2.30 Count_CAG Numerical 2.22 Count_AA Numerical 2.19 Position_9_G Binary (0 or 1) 2.18 Position_4_U Binary (0 or 1) 2.17

53 Count_GCU Numerical 2.13 Count_GUG Numerical 2.12 Count_CUG Numerical 2.11 Count_CCU Numerical 2.11 Position_4_G Binary (0 or 1) 2.07 Position_13_UU Binary (0 or 1) 2.06 Position_3_UG Binary (0 or 1) 2.06 Position_2_UU Binary (0 or 1) 2.05 Count_UGC Numerical 2.04 Count_GCC Numerical 2.00 Position_2_U Binary (0 or 1) 2.00 Count_ACCA Numerical 1.98 Count_UCC Numerical 1.95 Position_2_GG Binary (0 or 1) 1.95 Position_8_U Binary (0 or 1) 1.93 Position_5_U Binary (0 or 1) 1.88 Count_AGC Numerical 1.87 Position_3_U Binary (0 or 1) 1.87 Count_GGCG Numerical 1.87 Count_UGU Numerical 1.78 Position_17_A Binary (0 or 1) 1.73 Position_14_UU Binary (0 or 1) 1.72 Count_CUC Numerical 1.71 Count_GCA Numerical 1.70 Count_CGC Numerical 1.69 Position_6_CU Binary (0 or 1) 1.67 Position_19_UG Binary (0 or 1) 1.67 Count_GUC Numerical 1.67 Position_17_GG Binary (0 or 1) 1.66 Position_2_CU Binary (0 or 1) 1.65 Count_AUG Numerical 1.65 Position_11_G Binary (0 or 1) 1.64 Position_10_UU Binary (0 or 1) 1.63 Count_UCG Numerical 1.62 Count_UCA Numerical 1.62 Position_2_C Binary (0 or 1) 1.61 Count_GUUU Numerical 1.60

54 Position_9_C Binary (0 or 1) 1.60 Position_13_U Binary (0 or 1) 1.60 Position_14_U Binary (0 or 1) 1.59 Count_AUC Numerical 1.59 Count_GGA Numerical 1.58 Position_2_G Binary (0 or 1) 1.57 Count_ACA Numerical 1.57 Position_3_UU Binary (0 or 1) 1.56 Position_5_CU Binary (0 or 1) 1.55 Position_16_GG Binary (0 or 1) 1.54 Position_17_GC Binary (0 or 1) 1.54 Count_CAC Numerical 1.54 Count_UUGG Numerical 1.53 Count_AGU Numerical 1.51 Count_GGU Numerical 1.51 Count_AGG Numerical 1.50 Position_19_A Binary (0 or 1) 1.49 Count_CCAG Numerical 1.49 Count_CAU Numerical 1.49 Position_3_C Binary (0 or 1) 1.48 Position_19_GG Binary (0 or 1) 1.48 Count_GGGG Numerical 1.47 Position_1_CC Binary (0 or 1) 1.46 Count_GAU Numerical 1.46 Count_GAUC Numerical 1.46 Position_1_C Binary (0 or 1) 1.46 Count_GUA Numerical 1.45 Count_GAC Numerical 1.45 Count_GAG Numerical 1.44 Position_20_C Binary (0 or 1) 1.43 Count_CCCU Numerical 1.42 Position_10_UG Binary (0 or 1) 1.42 Count_UUUA Numerical 1.42 Count_ACU Numerical 1.41 Position_5_C Binary (0 or 1) 1.41 Position_12_G Binary (0 or 1) 1.40 Position_9_UU Binary (0 or 1) 1.39

55 Count_GCUG Numerical 1.38 Position_4_A Binary (0 or 1) 1.38 Count_CUGG Numerical 1.37 Count_GUU Numerical 1.36

56 List of Figures CRISPR array CRISPR-Cpf1 locus Cpf1 Cas4 Cas1 Cas2

Pre-crRNA

5’ 3’

Cpf1

Mature crRNA Cpf1

5’ 3’

PAM 5’ 3’ Target DNA 3’ 5’ 5’ 3’ Seed

Figure 3.1. The schematic architecture of the CRISPR-Cpf1 system. The CRISPR-Cpf1 locus includes Cpf1, Cas4, Cas1, Cas2 and a CRISPR array. The CRISPR array is first transcribed into pre-crRNA. Then, Cpf1 can process the pre-crRNA into mature crRNAs. The mature crRNA will bind to Cpf1 and guide Cpf1 to target a specific double-strand DNA at a proper target site.

57 Position-specific mononucleotide (Binary: 0 Position-specific or 1, 20 × 4 = 80) nucleotide composition (80 + 304 = 384) Position-specific dinucleotides (Binary: 0 or 1, 19 × 42 = 304)

Position-nonspecific mononucleotide counts (Numerical, 4)

Position-nonspecific dinucleotide counts Position-nonspecific (Numerical, 42 = 16) nucleotide composition (4 + 16 + 64 + 256 = 340) Position-nonspecific trinucleotide counts (Numerical, 43 = 64)

Position-nonspecific tetranucleotide counts (Numerical, 44 = 256)

AAAA in the gRNA sequence (Binary: 0 or 1, 1)

CCCC in the gRNA sequence (Binary: 0 or 1, 1)

Repetitive bases GGGG in the gRNA (1 + 1 + 1 + 1 + 1 = 5) sequence (Binary: 0 or 1, 1)

UUUU in the gRNA sequence (Binary: 0 or 1, 1) Features Features Any repetitive bases in the

738 gRNA sequence (Binary: 0 or 1, 1)

UUU in the gRNA seed

(384 + 340 5 1 4 3 = 738) region (Binary: 0 or 1, 1)

GC content of the entire gRNA sequence (Numerical, 1)

GC content of the seed region (Numerical, 1) GC content (1 + 1 + 1 + 1 = 4) GC content of the non-seed region (Numerical, 1)

Normal GC content of the entire gRNA sequence (Binary: 0 or 1, 1)

Minimum free energy (Numerical, 1)

Melting temperature of the entire gRNA sequence (Numerical, 1)

Melting temperature (1 + 1 Melting temperature of the + 1 = 3) seed region (Numerical, 1)

Melting temperature of the non-seed region (Numerical, 1) Figure 3.2. 738 features were calculated for all 2274 gRNAs. The features consist of position- specific nucleotide compositions (20 × 4 + 19 × 42 = 384), position-nonspecific nucleotide compositions (4 + 42 + 43 + 44 = 340), repetitive bases (1 + 1 + 1 + 1 +1 = 5), UUU in the gRNA seed region (1), GC contents (1 + 1 + 1 + 1 = 4), minimum free energy (1), and melting temperature (1 + 1 + 1 = 3).

58 Candidate search Bowtie2 Perl script Target sites within the DNA sequence Target candidates reference genome

Samtools

Forward strand + Reverse strand

Filter Perl scripts

PHP & JavaScript Format conversion Perl script Display results in Ajax format files Final target sites for each DataTables & JBrowse target candidate

Figure 3.3. The flowchart of CRISPR-DT backend pipeline. CRISPR-DT first processes a DNA sequence entered by a user and then displays results on the web interfaces.

59 Favored Disfavored

Figure 3.4. P-values of single nucleotides at each position of the 20-nt gRNA sequence. The Y axis indicates whether a single nucleotide is favored or disfavored by efficient gRNAs. The red lines are used as the cutoff for significance (P = 0.05).

Figure 3.5. The distribution of efficient and inefficient gRNAs for minimum free energy.

Figure 3.6. Performance evaluation of SVM models. Ten-fold cross validation was used to evaluate the SVM models. Each gray line indicates the ROC curve for each fold. The red line is the average ROC curve. The bar graph indicates the AUC for each fold.

Figure 3.7. The evaluation results of different amount of features. Random Forest and SVM were used to select important features by comparing the five-fold cross validation results of top 50, 150, 250, 350, and 450 features based on all 2274 gRNAs. Features are ranked by the mean decrease impurity calculated by Random Forest.

Figure 3.8. Validation of our final SVM model using an independent experimental data set. ROC curves compare the performance of our SVM model (CRISPR-DT) and the deep learning model (DeepCpf1, Kim et al., 2018) in predicting target efficiency using 990 independent gRNAs.

Figure 3.9. The parameter setting page of the CRISPR-Cpf1 system. First, users input a DNA sequence in FASTA format into the text area. Second, users select a reference genome. Third, users set up the PAM requirement. Fourth, users choose an off-target setting (either “Basic settings” or “Specific settings”).

65 (a)

(b)

(c)

Fig. 3.10. The result web interfaces of CRISPR-DT. (a) The detailed information of all the target candidates within the DNA sequence entered by a user. (b) The detailed information of target sites in the reference genome for each target candidate. (c) Visualization of on- and off- target sites with genomic and transcript annotations in JBrowse.

66 CHAPTER 4. CRISPR-RT: A web application for designing CRISPR-C2c2 crRNA with improved target specificity*

Abstract CRISPR-Cas systems have been successfully applied in genome editing. Recently, the CRISPR- C2c2 system has been reported as a tool for RNA editing. Here we describe CRISPR-RT (CRISPR RNA-Targeting), the first web application to help biologists design the crRNA with improved target specificity for the CRISPR-C2c2 system. CRISPR-RT allows users to set up a wide range of parameters, making it highly flexible for current and future research in CRISPR- based RNA editing. CRISPR-RT covers major model organisms and can be easily extended to cover other species. CRISPR-RT will empower researchers in RNA editing. CRISPR-RT is freely available at http://bioinfolab.miamioh.edu/CRISPR-RT.

* This chapter is a pre-copyedited, author-produced version of an article accepted for publication in Bioinformatics following peer review. The version of record, Zhu,H. et al. (2018) CRISPR-RT: a web application for designing CRISPR-C2c2 crRNA with improved target specificity. Bioinformatics, 34, 117–119, is available online at: https://doi.org/10.1093/bioinformatics/btx580

67 Introduction During the past several years, CRISPR-Cas systems have been successfully applied in genome editing, but no system has been reported for RNA editing. Therefore, new CRISPR-Cas systems that regulate RNA activities are necessary for studying the roles of RNA molecules. Recently, the CRISPR-C2c2 system has been demonstrated as a tool for RNA targeting (Abudayyeh et al., 2016). CRISPR-C2c2 was discovered in 21 bacterial genomes and belongs to the type VI of Class 2 CRISPR systems (Shmakov et al., 2015). Researchers have characterized the CRISPR- C2c2 system from the bacteria Leptotrichia shahii (Abudayyeh et al., 2016). The L. shahii C2c2 locus is simply organized, including C2c2, Cas1, Cas2 and a CRISPR array (Fig. 4.1). C2c2, which contains two higher eukaryotes and prokaryotes nucleotide-binding (HEPN) domains, mainly functions as a sole effector protein mediating single-strand RNA cleavage (Abudayyeh et al., 2016). Similar to other Class 2 systems, the CRISPR array of CRISPR-C2c2 is first transcribed into pre-crRNA (Shmakov et al., 2015). Differently, the pre-crRNA is processed by C2c2 into mature crRNAs without attaching to trans-activating crRNAs (East-Seletsky et al., 2016). The mature crRNA binds to C2c2 and guides it to target a specific single-strand RNA. C2c2 combined with a 22-28 length of the target complementarity region of crRNA would effectively mediate cleavage and the secondary structure of the crRNA is also required for RNA cleavage (Abudayyeh et al., 2016). The seed region is located in the center of the crRNA-target duplex, where it is more sensitive to mismatches than the non-seed region (Abudayyeh et al., 2016). Single mismatch can be fully tolerated by C2c2, but if double mismatches are located in the seed region, C2c2 is unable to cleave the single-strand RNA; C2c2 can even tolerate 3 consecutive mismatches in the non-seed region (Abudayyeh et al., 2016). Whether C2c2 can tolerate gaps remains unknown, which might be explored in near future. The CRISPR-C2c2 system prefers H (A, U, or C) for the 3’ protospacer flanking site (PFS) sequence of one single base length to mediate single-strand RNA cleavage (Abudayyeh et al., 2016). CRISPR-C2c2 has already been successfully used for specific RNA knockdown in E. coli (Abudayyeh et al., 2016). Researchers found that C2c2 cleaves the targeted single-strand RNA in addition to collateral RNA (Abudayyeh et al., 2016; East-Seletsky et al., 2016), which has been applied for RNA detection in human total RNAs (East-Seletsky et al., 2016) and viral strains detection (Gootenberg et al., 2017). The inactive dC2c2, just like dCas9 (Gilbert et al., 2013; Zetsche et al.,

68 2015; Gao et al., 2016), also has many potential applications as an RNA-binding protein, such as bringing effectors to specific RNAs to regulate their translation and tracking specific RNAs by fluorescent tag (Abudayyeh et al., 2016). Therefore, the CRISPR-C2c2 system has been viewed as a powerfully programmable tool for RNA editing (Abudayyeh et al., 2016; East-Seletsky et al., 2016; Nainar et al., 2016; Wang and Qi, 2016; Puchta, 2017). However, until now there is no available public software for designing crRNAs of the CRISPR-C2c2 system. We have developed CRISPR-RT (CRISPR RNA-Targeting), a web application to help biologists design the crRNA for the CRISPR-C2c2 system. To maximize the flexibility for current and future research in CRISPR-based RNA editing, CRISPR-RT allows users to set up a wide range of parameters, such as length of the target complementarity region of crRNA, length of the seed region, the PFS, and the number of mismatches or gaps tolerated by off targets. After setting up the required parameters, CRISPR-RT will find target candidates from the input RNA sequences and employ rigorous alignment algorithms to search on- and off-target sites for each target candidate within the reference transcriptome. The results are displayed in highly interactive graphical interfaces. Users can rank target candidates by the total number of target sites in the reference transcriptome, which help them choose the target candidate based on the minimum effect of off targets. In addition, users are able to validate the on- and off-target sites in the background of annotated genome and transcript features by data visualization through JBrowse (Skinner et al., 2009).

Implementation Graphic Input Interface Figure 4.2 shows the CRISPR-C2c2 setting page. First, users can input an RNA/cDNA sequence that they want to target in FASTA format. They can also use an example sequence by clicking the “Example Sequence” button. Second, users select a reference transcriptome. They can also click the “custom” link to upload a custom reference transcriptome in FASTA format. Third, in terms of our current understanding of the CRISPR-C2c2 system architecture (Fig. 4.1), users can set up the PFS sequence and crRNA requirements for the CRISPR-C2c2 system properly. The on- and off- target PFS sequence can be set respectively. Users then set up the length of the target complementarity region of crRNA and the length of the seed region. The seed region is located in the center of the crRNA-target duplex, and its length should not be

69 greater than the length of the target complementarity region of crRNA. Fourth, users can choose an off-target setting (“Basic settings” or “Specific settings”). For “Basic settings”, the number of mismatches or gaps tolerated by off targets and by the seed region can be set respectively. The number of consecutive mismatches or gaps in the seed or non-seed region tolerated by off targets can also be configured. For “Specific settings”, users can set up more detailed parameters in the seed and non-seed region separately, such as the number of mismatches in the seed or non-seed region tolerated by off targets. Users can also set up the search sensitivity of Bowtie2 (Langmead and Salzberg, 2012), which is related to the alignment options setting of Bowtie2. Higher sensitivity setting causes alignments to be more sensitive, but it usually results in a longer search time. After setting up all the parameters, users click the “Find targets!” button, which will run programs in the background to get the results. If users want to retrieve a recent job, they can click “Retrieve Jobs” on the left side menu to enter the “Job ID”, which is generated in the result page. The results will be kept in our server for only one week and will be deleted automatically afterwards.

Graphic Output Interface Figure 4.3A shows all the target candidates and relevant information for one RNA query. Users can view the input sequence by clicking the “Input Sequence Viewer” button. In the sequence viewer, users can search and highlight any subsequence such as the target candidate sequence (Fig. 4.3B). Users can also download the target candidates file by clicking the “Download” link. In the table, the protospacer and PFS of each target candidate are labeled by different colors. The corresponding crRNA of each target candidate can be accessed by clicking “crRNA”; a graph will appear to help users design their own crRNAs (Fig. 4.4). The table also displays the start position, end position, and GC content of each target candidate sequence. The last two columns of the table show the numbers of targeted transcripts and genes respectively for each target candidate. By clicking table headers users can rank all target candidates based on the number of target sites including on- and off-target sites. Target candidates with fewer number of target sites have higher target specificity. If the number of target sites is 1, the corresponding table cell will be highlighted with green background color to indicate that the target candidate is highly specific. By clicking the number of targeted transcripts users can view the detailed information of targeted transcripts for each target candidate (Fig. 4.3C). Because CRISPR-RT has converted the transcriptome mapping result to a genome mapping result, the information of

70 targeted transcripts is displayed in genomic context with genomic coordinates and gene annotations, including the transcript isoform, mapped gene, chromosome, start position, and strand. Users can click the “transcript ID” link or “gene ID” link of each target site to view the detailed description of transcript or gene where the target site located. The number of mismatches or gaps in each target site is also shown in the table. To visualize and manually validate the targeted transcripts, users can click the JBrowse link to visualize each target site in the background of genome and transcript features annotated by Ensembl or Phytozome (Fig. 4.3D).

Methods CRISPR-RT is essentially composed of many web interfaces and a backend pipeline. Web interfaces are implemented by PHP and JavaScript code, which are used to accept user inputs and display the results interactively. The backend pipeline is implemented by Perl code, which is used to process user input data and generate multiple result files. The same strategy was applied in our previously published web application CT-Finder (Zhu et al., 2016), which helps to design Cas9 gRNAs. After setting up proper parameters and clicking the “Find targets!” button in the parameter setting page of CRISPR-C2c2 (Figure 4.2), all of the parameters stored in PHP code are passed to the main Perl script, which invokes other specific Perl scripts or commands to execute specific functions. The default parameters are set based on Abudayyeh et al.’s research. As shown in Figure 4.5, first, the Perl script for target candidate search is called to find target candidates of specified length with the PFS in the input RNA sequence. Second, Bowtie2 (Langmead and Salzberg, 2012) is used to map each target candidate sequence to the reference transcriptome, which is extracted from the genome by RSEM (Li and Dewey, 2011) using Ensembl or Phytozome gene annotation, to search for on- and off-target sites within the transcriptome. Bowtie2 is particularly good at aligning short reads to long genomes or transcriptomes, which has been used for Cas9 gRNA design in previous study (Heigwer et al., 2014; Zhu et al., 2016). RSEM is a quite popular package for analyzing RNA-Seq data (Haas et al., 2013; Konermann et al., 2015; Shalek et al., 2013). Third, the Perl scripts for result filtration based on the input parameters are invoked to filter out the off targets that do not meet the requirements set by users. Since the seed region is more sensitive to mismatches than the non-

71 seed region, they are handled separately. Next, RSEM is used to convert the transcriptome mapping result to a genome mapping result, which can be displayed properly in JBrowse (Skinner et al., 2009). Then, the main Perl script separates the file storing target sites of all target candidates into many single files that store target sites for each target candidate respectively. The main Perl script also processes the file of target candidates and the files of target sites for each target candidate to generate Ajax format files, which are required by DataTables (a table plug-in for jQuery). After getting all of those files, the PHP and JavaScript code is used to display detailed information of target candidates and corresponding target sites in DataTables. The on- and off-target sites can be visualized in JBrowse.

References Abudayyeh,O.O. et al. (2016) C2c2 is a single-component programmable RNA-guided RNA- targeting CRISPR effector. Science, 353, aaf5573. East-Seletsky,A. et al. (2016) Two distinct RNase activities of CRISPR-C2c2 enable guide-RNA processing and RNA detection. Nature, 538, 270–273. Gao,Y. et al. (2016) Complex transcriptional modulation with orthogonal and inducible dCas9 regulators. Nat Meth, 13, 1043–1049. Gilbert,L.A. et al. (2013) CRISPR-Mediated Modular RNA-Guided Regulation of Transcription in Eukaryotes. Cell, 154, 442–451. Gootenberg,J.S. et al. (2017) Nucleic acid detection with CRISPR-Cas13a/C2c2. Science, eaam9321. Haas,B.J. et al. (2013) De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity. Nat Protoc, 8. Heigwer,F. et al. (2014) E-CRISP: fast CRISPR target site identification. Nat Meth, 11, 122–123. Konermann,S. et al. (2015) Genome-scale transcriptional activation by an engineered CRISPR- Cas9 complex. Nature, 517, 583–588. Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat Meth, 9, 357–359. Li,B. and Dewey,C.N. (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12, 323.

72 Nainar,S. et al. (2016) Evolving insights into RNA modifications and their functional diversity in the brain. Nat Neurosci, 19, 1292–1298. Puchta,H. (2017) Applying CRISPR/Cas for genome engineering in plants: the best is yet to come. Current Opinion in Plant Biology, 36, 1–8. Shalek,A.K. et al. (2013) Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature, 498, 236–240. Shmakov,S. et al. (2015) Discovery and Functional Characterization of Diverse Class 2 CRISPR-Cas Systems. Molecular Cell, 60, 385–397. Skinner,M.E. et al. (2009) JBrowse: A next-generation genome browser. Genome Res., 19, 1630–1638. Wang,F. and Qi,L.S. (2016) Applications of CRISPR Genome Engineering in Cell Biology. Trends in Cell Biology, 26, 875–888. Zetsche,B. et al. (2015) A split-Cas9 architecture for inducible genome editing and transcription modulation. Nat Biotech, 33, 139–142. Zhu,H. et al. (2016) CT-Finder: A Web Service for CRISPR Optimal Target Prediction and Visualization. Scientific Reports, 6, srep25516.

73 List of Figures C2c2 Cas1 Cas2 CRISPR array Leptotrichia shahii CRISPR-C2c2 locus

Pre-crRNA

C2c2 5’ 3’ C2c2

Mature crRNA

5’ 3’

PFS Target RNA 5’ 3’

Seed

Figure 4.1. The configuration and architecture of the CRISPR-C2c2 system. The CRISPR array of CRISPR-C2c2 is first transcribed into pre-crRNA. Then, the pre-crRNA is processed by C2c2 into mature crRNAs without attaching to trans-activating crRNAs. The mature crRNA binds to C2c2 and guides C2c2 to target a specific single-strand RNA at a proper target site.

Figure 4.2. The parameter setting page of CRISPR-C2c2. First, users input an RNA sequence in FASTA format into the text area. Second, users select a reference transcriptome. Third, users set up the PFS and crRNA requirements for the CRISPR-C2c2 system. Fourth, users choose an off-target setting (either “Basic settings” or “Specific settings”).

75 A

Figure 4.3. The result web interfaces of CRISPR-RT. (A) The detailed information of all the target candidates in the user input RNA sequence. (B) User input RNA sequence viewer. (C) The detailed information of target sites in the transcriptome for each target candidate. (D) Visualization of on- and off-targets with gene and transcript annotations in JBrowse.

Figure 4.4. The basic process of crRNA design based on the target candidate sequence (protospacer). The target complementarity sequence of crRNA is labelled in purple color. The stem-loop sequence of crRNA comes from previous research (Abudayyeh et al., 2016). The designed crRNA binds to C2c2 to form a crRNA-C2c2 complex for RNA targeting.

Figure 4.5. The flowchart of CRISPR-RT backend pipeline that processes user input RNA sequence and displays results in the interfaces. All of input parameters, including the RNA sequence, are passed to the main Perl script. The main Perl script invokes other specific Perl scripts or commands to execute specific functions. Finally, the results will be displayed in DataTables and JBrowse in the web interfaces.

78 CHAPTER 5. Conclusions and future directions

The CRISPR-Cas system is an adaptive immune system of bacteria and archaea, which has been successfully applied in DNA and RNA editing. The main CRISPR-Cas systems include Cas9, Cpf1, and C2c2, all of which belong to the Class 2 CRISPR system. Cas9 and Cpf1 are applied for genome editing, whereas C2c2 is used to edit RNA. The applications of CRISPR- based editing include genome-wide screens, modified crops in agriculture, drug targets identification, and gene therapy (Fellmann et al., 2017; Yin et al., 2017; Waltz, 2018; Zhang et al., 2017; Long et al., 2018). In addition, Cas9 has been applied to make defective genes inactive in the Huntington’s disease (Staahl et al., 2017) and in the amyotrophic lateral sclerosis (Gaj et al., 2017). Furthermore, dCas9 can be used as a DNA-binding protein in gene transcriptional regulation by fusing a transcriptional factor (Kampmann, 2018; Pulecio et al., 2017), or in repetitive genomic loci imaging of live cells by fusing a fluorescent reporter (Knight et al., 2018). Cas9n and RFNs are two important engineered variants of the wild-type Cas9 for enhancing target specificity (Ran et al., 2013; Tsai et al., 2014). Additionally, Cpf1 has been successfully applied for genome editing in mammals and plants with high target specificity (Kim et al., 2016; Tang et al., 2017; Xu et al., 2017; Hu et al., 2017; Kim et al., 2017). In a recent study, Cpf1 was even used for DNA detection (Chen et al., 2018). As an RNA-targeting tool, C2c2 has been employed for specific RNA knockdown in plant (Aman et al., 2018) and mammalian cells (Abudayyeh et al., 2017). C2c2 can also be applied in RNA detection and diagnostics (East- Seletsky et al., 2016; Gootenberg et al., 2017). In addition, dC2c2 can be used as an RNA- binding protein in RNA imaging by fusing a fluorescent reporter (Abudayyeh et al., 2017). Although CRISPR-Cas systems have been successfully and widely applied in DNA and RNA editing, target specificity and efficiency are still big concerns for researchers. Therefore, we have developed three web services, CT-Finder, CRISPR-DT, and CRISPR-RT, to help users design optimal gRNAs for different CRISPR-Cas systems with improved target specificity and efficiency. Specifically, CT-Finder can help users design gRNAs for CRISPR-Cas9, Cas9n and RFNs systems with improved target specificity. CRISPR-DT is the first web service to help scientists design optimal gRNAs for the CRISPR-Cpf1 system by considering both target efficiency and specificity. CRISPR-RT is the first web service to help biologists design gRNAs for the CRISPR-C2c2 system with improved target specificity. All of the three web services

79 support multiple parameter settings, which make them very flexible for current and future research in CRISPR-Cas systems. In addition, major model organisms are covered by the web services, and other species can be easily added. We hope CT-Finder, CRISPR-DT, and CRISPR- RT will empower researchers in CRISPR-based DNA and RNA editing.

Although researchers have been working on improving target specificity and efficiency of CRISPR-Cas systems for several years (e.g., engineering the CRISPR-Cas9 system, discovering new CRISPR-Cas systems, and developing predictive software tools), more efforts are still needed for improving target specificity and efficiency, especially in the therapeutic applications. Bioinformatic methods have played a crucial role in improving target specificity and efficiency for CRISPR-Cas systems. The typical strategy for predicting target specificity of CRISPR-Cas systems in bioinformatics is to align each target candidate against the reference genome or transcriptome to search all possible off-target sites. The target candidate with less off- target sites is usually considered as a better candidate with higher target specificity. Although Bowtie2 (Langmead and Salzberg, 2012) has been widely applied in the target candidate alignment for CRISPR-Cas systems, other alignment tools, such as BWA (Burrows-Wheeler Alignment) (Li and Durbin, 2009), could also be used to compare and improve the results produced by Bowtie2. Additionally, more data are required for us to understand clearly the effects of mismatches and indels in both seed and non-seed regions of target sites, as well as in PAM sequences so that we can predict on-target and off-target sites more accurately. The typical strategy for predicting target efficiency of CRISPR-Cas systems in bioinformatics is using experimental gRNA activity data to create a machine learning model to predict target efficiency for any given gRNAs. The CRISPR-Cas9 system has been widely studied among different species and there are plenty of experimental gRNA activity data for Cas9. Since target efficiency of the same gRNA might vary in different organisms, in the future, more specific target efficiency predictive models could be built based on specific data to predict CRISPR-Cas9 target efficiency for different species and even for different cell types, respectively. In addition, deep learning or other novel machine learning technologies might be applied to further improve target efficiency prediction for the CRISPR-Cas9 system. Since there are not much experimental gRNA activity data available for the CRISPR-Cpf1 system so far, more relevant experiments are needed in the future. By using more gRNA activity data and incorporating more distinctive sequence or structural features, such as position-specific nucleotide accessibility of gRNAs,

80 chromatin accessibility information, and DNA methylation data, more robust predictive models could be created to further improve target efficiency prediction for the CRISPR-Cpf1 system. To the best of our knowledge, currently there are almost no experimental gRNA activity data for the CRISPR-C2c2 system and no target efficiency predictive model available for C2c2. Therefore, it is necessary to do some experiments about C2c2 gRNA activity and then build a model based on the experimental data to predict target efficiency for the CRISPR-C2c2 system. Overall, CRISPR-Cas systems are very promising in DNA and RNA editing. However, in the future, we still need to further improve target specificity and efficiency for CRISPR-Cas systems, especially for therapeutic aims.

References Abudayyeh,O.O. et al. (2017) RNA targeting with CRISPR–Cas13. Nature, 550, 280–284. Aman,R. et al. (2018) RNA virus interference via CRISPR/Cas13a system in plants. Genome Biology, 19, 1. Chen,J.S. et al. (2018) CRISPR-Cas12a target binding unleashes indiscriminate single-stranded DNase activity. Science, 360, 436–439. East-Seletsky,A. et al. (2016) Two distinct RNase activities of CRISPR-C2c2 enable guide-RNA processing and RNA detection. Nature, 538, 270–273. Fellmann,C. et al. (2017) Cornerstones of CRISPR–Cas in drug discovery and therapy. Nature Reviews Drug Discovery, 16, 89–100. Gaj,T. et al. (2017) In vivo genome editing improves motor function and extends survival in a mouse model of ALS. Science Advances, 3, eaar3952. Gootenberg,J.S. et al. (2017) Nucleic acid detection with CRISPR-Cas13a/C2c2. Science, 356, 438–442. Hu,X. et al. (2017) Targeted mutagenesis in rice using CRISPR-Cpf1 system. J Genet Genomics, 44, 71–73. Kampmann,M. (2018) CRISPRi and CRISPRa Screens in Mammalian Cells for Precision Biology and Medicine. ACS Chem. Biol., 13, 406–416. Kim,D. et al. (2016) Genome-wide analysis reveals specificities of Cpf1 endonucleases in human cells. Nature Biotechnology, 34, 863–868.

81 Kim,H. et al. (2017) CRISPR/Cpf1-mediated DNA-free plant genome editing. Nature Communications, 8, 14406. Knight,S.C. et al. (2018) Genomes in Focus: Development and Applications of CRISPR-Cas9 Imaging Technologies. Angewandte Chemie International Edition, 57, 4329–4337. Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nature Methods, 9, 357–359. Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25, 1754–1760. Long,C. et al. (2018) Correction of diverse muscular dystrophy mutations in human engineered heart muscle by single-site genome editing. Science Advances, 4, eaap9004. Pulecio,J. et al. (2017) CRISPR/Cas9-Based Engineering of the Epigenome. Cell Stem Cell, 21, 431–447. Ran,F.A. et al. (2013) Double Nicking by RNA-Guided CRISPR Cas9 for Enhanced Genome Editing Specificity. Cell, 154, 1380–1389. Staahl,B.T. et al. (2017) Efficient genome editing in the mouse brain by local delivery of engineered Cas9 ribonucleoprotein complexes. Nature Biotechnology, 35, 431–434. Tang,X. et al. (2017) A CRISPR–Cpf1 system for efficient genome editing and transcriptional repression in plants. Nature Plants, 3, 17018. Tsai,S.Q. et al. (2014) Dimeric CRISPR RNA-guided FokI nucleases for highly specific genome editing. Nature Biotechnology, 32, 569–576. Waltz,E. (2018) With a free pass, CRISPR-edited plants reach market in record time. Nature Biotechnology, 36, 6–7. Xu,R. et al. (2017) Generation of targeted mutant rice using a CRISPR-Cpf1 system. Plant Biotechnology Journal, 15, 713–717. Yin,K. et al. (2017) Progress and prospects in plant genome editing. Nature Plants, 3, 17107. Zhang,Y. et al. (2017) CRISPR-Cpf1 correction of muscular dystrophy mutations in human cardiomyocytes and mice. Science Advances, 3, e1602814.