Published online 11 April 2014 Nucleic Acids Research, 2014, Vol. 42, No. 10 6091–6105 doi: 10.1093/nar/gku241

SURVEY AND SUMMARY Classification and evolution of type II CRISPR-Cas systems Krzysztof Chylinski1,2, Kira S. Makarova3, Emmanuelle Charpentier1,4,5 and Eugene V. Koonin3,*

1The Laboratory for Molecular Infection Medicine Sweden (MIMS), Umea˚ Centre for Microbial Research (UCMR), Department of Molecular Biology, Umea˚ University, Umea˚ 90187, Sweden, 2Max F. Perutz Laboratories, University of Vienna, Vienna 1030, Austria, 3National Center for Biotechnology Information, NLM, National Institutes of Health, Bethesda, MD 20894, USA, 4Helmholtz Centre for Infection Research, Department of Regulation in Infection Biology, Braunschweig 38124, Germany and 5Hannover Medical School, Hannover 30625, Germany

Received January 13, 2014; Revised March 11, 2014; Accepted March 12, 2014

ABSTRACT INTRODUCTION The CRISPR-Cas systems of archaeal and bacterial The CRISPR (clustered regularly interspaced short palin- adaptive immunity are classified into three types dromic repeats)-Cas (CRISPR-associated genes) system is that differ by the repertoires of CRISPR-associated the first and so far the only adaptive immunity system (cas) genes, the organization of cas operons and discovered in archaea and . Similar to restriction– the structure of repeats in the CRISPR arrays. The modification and several other defense systems, it is based on the self–non-self discrimination principle but unlike simplest among the CRISPR-Cas systems is type other defense mechanisms, CRISPR-Cas imprints pieces of II in which the endonuclease activities required for genetic material as a memory of previously encountered the interference with foreign deoxyribonucleic acid viruses and plasmids [(1) and references therein]. These (DNA) are concentrated in a single multidomain short fragments matching the infectious agent deoxyribonu- protein, Cas9, and are guided by a co-processed cleic acid (DNA) are inserted into an array of CRISPR re- dual-tracrRNA:crRNA molecule. This compact en- peats and are used in the form of guide CRISPR RNA (cr- zymatic machinery and readily programmable site- RNA) to target the cognate virus or plasmid upon new in- specific DNA targeting make type II systems top can- fections (2–5). didates for a new generation of powerful tools for The diversity of the protein components associated with genomic engineering. Here we report an updated the CRISPR-Cas systems is remarkable (6–9), but identifi- census of CRISPR-Cas systems in bacterial and ar- cation of several signature elements in their genomic orga- nization has provided for a relatively simple classification of chaeal genomes. Type II systems are the rarest, miss- ∼ these systems (5). This classification differentiates three ma- ing in archaea, and represented in 5% of bacte- jor types of the CRISPR-Cas systems, with cas3, cas9 (for- rial genomes, with an over-representation among merly csn1)andcas10 being the signature genes for the type pathogens and commensals. Phylogenomic analysis I, type II and type III systems, respectively.Moreover, recent suggests that at least three cas genes, cas1, cas2 bioinformatic analysis and structural studies revealed strik- and cas4, and the CRISPR repeats of the type II-B ing similarity in the organization of effector complexes be- system were acquired via recombination with a type tween CRISPR-Cas systems of type I and type III suggest- I CRISPR-Cas locus. Distant homologs of Cas9 were ing their origin from a common ancestor (10–12). However, identified among proteins encoded by diverse trans- the origin of type II CRISPR-Cas systems remains obscure. posons, suggesting that type II CRISPR-Cas evolved According to the current classification, two subtypes of via recombination of mobile nuclease genes with type II CRISPR-Cas systems are defined on the basis of their characteristic operon organizations (5). Type II-A sys- type I loci. tems encompass an additional gene, known as csn2.The

*To whom correspondence should be addressed. Tel: 301-435-5913; Fax: 301-435-7793; Email: [email protected]

Published by Oxford University Press on behalf of Nucleic Acids Research 2014. This work is written by (a) US Government employee(s) and is in the public in the US. 6092 Nucleic Acids Research, 2014, Vol. 42, No. 10

Csn2 protein is involved in spacer integration but is not re- III and a specialized ribonucleic acid (RNA) molecule, de- quired for interference (13,14). Two distantly related Csn2 noted as tracrRNA. The tracrRNAs have been identified in subfamilies that include, respectively, short (15,16) and long most genomes encoding type II systems and are now con- forms (17) of the protein have been identified. For both of sidered to be an integral component of this CRISPR-Cas these forms, the structures have been solved and the pro- type (22,23). These RNAs are encoded in the vicinity of teins have been biochemically characterized (15–18). The or within the cas operon and CRISPR array, and all tracr- Csn2 structures have been shown to adopt a highly de- RNA orthologs are characterized by the presence of an anti- rived P-loop ATPase fold in which the adenosine triphos- repeat sequence homologous to cognate CRISPR repeats. phate (ATP) binding center appears to be inactivated (15– A first processing event involves base-pairing of tracrRNA 17). The structurally characterized Csn2 proteins form ho- with the pre-crRNA repeats in the presence of Cas9 to form motetrameric rings that bind linear double-stranded (ds) RNA duplexes that are cleaved by the executioner endoge- DNA through the rings central hole. The major difference nous RNase III (23). The intermediate crRNA species un- between the short and long forms of Csn2 is that the short dergo further maturation resulting in mature individual cr- but not the long forms display Ca2+-dependent DNA bind- RNAs that remain duplexed with tracrRNA in a complex ing (15,16,19). The specific function of Csn2 in spacer inte- with the Cas9 protein (23,33). gration remains unclear, but it has been hypothesized that Several type II systems show additional, distinct features Csn2 is an accessory component that binds the dsDNA ends (Figure 1). For example, it has been reported that in type II- and protects them from exonucleolytic degradation, while CofNeisseria meningitidis WUE2594, RNase III is dispens- recruiting DNA-repair proteins to seal the breaks (18). Type able for the DNA targeting function of CRISPR-Cas (38). II-B systems lack the csn2 gene but possess another addi- Certain mature crRNAs have been shown to be transcribed tional gene of the cas4 family (5). The Cas4 proteins belong from promoters embedded within each CRISPR repeat and to the PD-(D/E)xK family of nucleases (8) and indeed have thus would be produced without the requirement of any been shown to possess 5-single-stranded DNA exonuclease maturation event (38). Furthermore, an additional RNA activity (20). Similar to Csn2, the actual role of Cas4 pro- component of type II-B in Francisella novicida U112, de- teins in the CRISPR-Cas systems remains unknown. Unlike noted as scaRNA (small, CRISPR-Cas-associated RNA), csn2, which is found only in association with the type II-A has recently been discovered. Together with scaRNA and system, the cas4 gene is not tightly linked to a particular Cas9, tracrRNA is involved in the repression of a mes- CRISPR-Cas defense mechanism and thus has been pro- senger RNA (mRNA) encoding a lipoprotein implicated posed to play a role in associated immunity (21). Since the in the virulence of F. novicida, a notable case of a non- original classification was developed, numerous additional immune function of the CRISPR-Cas system (39). A link of type II CRISPR-Cas systems with only three genes (cas1, type II CRISPR-Cas to virulence has also been described cas2 and cas9) in the operon have been identified and a third in Campylobacter jejuni (40), N. meningitidis (39)andLe- subtype, type II-C, has been proposed (11,22). gionella pneumophila (41). Type II CRISPR-Cas is considered to be the minimal Given their simplicity and potential for easy RNA pro- CRISPR-Cas system that includes the CRISPR repeat– gramming, type II is the best candidate of all CRISPR-Cas spacer array and only four (but often three) cas genes (Fig- systems for exploitation in genetic engineering. It has been ure 1), although additional bacterial factors, in particu- shown that the dual-tracrRNA:crRNA can be replaced lar trans-activating crRNA (tracrRNA) and RNase III, with a single guide RNA that combines the two RNA contribute to the function of this system (23). All type II molecules, eliminating the maturation steps required for CRISPR-Cas modules contain the pair of cas1 and cas2 RNA-programmable Cas9 activation (33). Thus, a variety genes that are required for spacer acquisition (24–26). These of guide RNAs can be designed to direct the Cas9 endonu- two genes are also present in the vast majority of the clease for site-specific DNA cleavage and further genetic genomes that contain at least one CRISPR-Cas system of manipulations such as gene editing, insertion or deletions type I or type III. Similar to type I, type II CRISPR-Cas (32,33). The easy conversion of Cas9 into a nickase was uti- systems require a well-defined short protospacer adjacent lized to facilitate homology directed repair in mammalian motif (PAM) that is located immediately downstream of genomes with reduced mutagenic activity and reported in- the protospacer on the non-target DNA strand (27–30). creased specificity (42–45). Furthermore, the DNA-binding The PAM sequence is important both for spacer acquisi- capacity of a catalytically inactive Cas9 mutant can be tion and for target recognition and cleavage (14,27,30–33). exploited to engineer diverse RNA-programmable devices The cas9 gene, the signature of type II systems (5), is a large that can be used to mediate transcriptional silencing or ac- multidomain protein that alone is sufficient for targeting tivation, or as DNA modification tools (32,33,46–51). The and cleaving the invader DNA (23,33). Two readily iden- unprecedented versatility in alternatives of genome engi- tifiable, split nuclease domains of Cas9, namely HNH and neering and modulation of gene expression makes RNA- RuvC-like nucleases, are required for target DNA cleavage programmable Cas9 a unique technology in molecular biol- (8,12,27,30,32,33). ogy. At the time of this writing, systems for eukaryotic gene Surprisingly, the minimal type II CRISPR-Cas system targeting using type II CRISPR-Cas systems have been de- employs an elaborate, unique processing mechanism of pre- veloped for human cells (22,42,45,52,53), monkey (54), pig cursor crRNA (pre-crRNA) (Figure 1). Unlike most other (55), mice (56,57), zebrafish (58), Drosophila (59), yeast (60), systems of types I and III that use a dedicated Cas endori- plants (61,62)andCaenorhabditis elegans (63), as well as bonuclease to cleave pre-crRNA by recognizing the repeat bacteria (64). The successful, rapid application of sequence- units (12,34–37), type II systems use endogenous RNase specific RNA-programmable Cas9 for genome editing in Nucleic Acids Research, 2014, Vol. 42, No. 10 6093

Figure 1. General scheme of the mechanism of type II CRISPR-Cas systems. (A) Proteins responsible for new spacer acquisition are shown for different type II subtypes. (B) Typical type II CRISPR-Cas locus architecture for three major subtypes shown together with a representative strain locus scheme. Red and orange arrows: tracrRNA and scaRNA with transcription direction indicated, respectively; black rectangles: repeats; diamonds: spacers; red rectangles: degenerated repeats; black arrows: pre-crRNA promoters. In type II-B, the localization of the pre-crRNA promoter in relation to the scaRNA is not known (see the paragraph ‘Role of type II CRISPR-Cas in virulence and origin of scaRNA’); the arrow represents only the direction of pre-crRNA transcription. Note the differences in the loci architecture with respect to cas gene composition, tracrRNA and repeat–spacer array transcription orientation and tracrRNA position. (C) Mechanisms of type II CRISPR-Cas systems. The classical DNA targeting pathway, common to all type II CRISPR-Cas systems (middle), involves co-processing of Cas9-stabilized tracrRNA:pre-crRNA duplexes by RNase III upon binding of tracrRNA anti-repeat to the pre- crRNA repeat, followed by trimming of crRNA by a yet unknown mechanism. The mature tracrRNA:crRNA guides the Cas9 endonuclease to introduce site-specifically dsDNA breaks in the invading DNA. The mechanism shown here for the type II-Aof S. pyogenes was also shown for the type II-A of S. thermophilus (22,51). The alternative DNA targeting mechanism (right), described in type II-C of N. meningitidis (38), does not involve RNase III co- processing due to transcription of a short crRNA directly from an upstream repeat-encoded promoter. In type II-B of F. novicida (39), the system evolved to possibly target endogenous mRNA expression (left). We hypothesize that similar to tracrRNA:crRNA-Cas9, the tracrRNA:scaRNA-Cas9 complex is first formed. The scaRNA in the complex would undergo trimming by unknown nucleases [the size of most abundant scaRNA forms isshorterthan predicted (39) according to RNAseq data (not shown)]. The tracrRNA:scaRNA-Cas9 further recognizes mRNA upon binding of the tracrRNA 3 region to the target mRNA leading to its degradation by an unknown mechanism. 6094 Nucleic Acids Research, 2014, Vol. 42, No. 10 a broad variety of cells and organisms demonstrates the is compatible with the involvement of RNase III, a primar- power of the system that upon further refinement could ily bacterial enzyme [except for the presence of this enzyme supersede such popular genome engineering tools as zinc in several mesophilic archaea (67)], in the function of these finger nucleases and Transcription Activator-Like Effector variants of CRISPR-Cas. In contrast, random distribution Nucleases (TALENs) (65,66). of type II systems across major bacteria phyla cannot be re- Given the high potential of RNA-guided Cas9 as a jected. However, in both the representative set and a larger tool for genome manipulation, the diversity of the type II set of complete and draft genomes (5865) (that was not used CRISPR-Cas systems across bacterial genomes is of special for statistical analysis because, unlike the representative set interest. We recently demonstrated high variability among of 661 representative genomes, these genomes could Cas9, dual-RNA structure and PAM sequences (22,31). not be considered independent), the type II CRISPR-Cas In addition, we characterized functional exchangeability system was not identified in many phyla, namely Cyanobac- among dual-RNA and Cas9 orthologs according to their teria, Chlorobi, (three major groups of photo- phylogenetically defined co-evolution (31). Thus, the col- synthetic bacteria), , , Deinococcus- lection of bacterial dual-RNA–Cas9 complexes associated Thermus and Chlamydia (Supplementary Table S3). with diverse specific PAM sequences broadens the func- We also detected statistically significant over- tional capabilities of the toolbox for multiplex engineering. representation of type II systems in host-associated Here we present an update on comparative genomics and (parasitic or commensal) bacteria (P = 4.3E−08) and phylogenetic analysis of type II CRISPR-Cas systems and under-representation in thermophilic bacteria (P = develop a hypothesis on the origin and evolution of all ma- 0.0019). These observations suggest that environmental jor components of these systems. traits substantially contribute to the distribution of type II CRISPR-Cas systems among bacteria. In particular, hori- zontal gene transfer (HGT) of this system might have been Type II CRISPR-Cas loci in bacterial genomes favored in environments with diverse bacterial communities As indicated above, cas9 is the signature gene of type II such as animal-associated microbiomes. Conversely, given CRISPR-Cas systems. The typical domain organization of that archaea lack type II CRISPR-Cas systems, HGT from Cas9 proteins is shown in Figure 2. Due to the abundance archaeal to bacterial thermophiles appears to be out of the of the two individual Cas9 nuclease domains in prokary- question, in accordance with the under-representation of otic genomes outside of the CRISPR-Cas loci, Cas9 alone type II among the latter. In agreement with these findings, is not a suitable probe for computational detection of type indications of the presence of type II CRISPR-Cas systems II CRISPR-Cas systems, especially given that stand-alone in diverse human- and animal-associated microbiomes distant Cas9 homologs have been identified12 ( ). To confi- have been reported (68–70). dently predict the presence of a complete type II CRISPR- Another notable observation is the narrow spread of Cas system in a genome, other components, such as cas1 type II-B systems (because of its low abundance, it is and cas2 genes, cognate CRISPR repeats and tracrRNA, not currently amenable for statistical analysis). This sys- have to be identified in the same locus, in addition to the tem is present only in some representatives of several presence of the two nuclease domains within Cas9, although bacterial genera, namely Francisella, Parasutterella, Sut- functionality of some apparently incomplete loci cannot be terella, Legionella, Wolinella () and Lep- ruled out. Here we present comparative genomic data for a tospira (Spirochaeta), most of which are pathogens or com- curated set of Cas9 sequences (including stand-alone cas9 mensals (Supplementary Table S1). genes) identified in currently available complete bacterial It is not uncommon for type II systems to be present in a genomes (Supplementary Table S1). genome along with other CRISPR-Cas systems. In our rep- To explore statistical trends in the distribution of type II resentative set of genomes, type II co-occurred with type I systems in bacteria, we selected a representative set of 661 CRISPR-Cas systems in eight genomes, with type III sys- genomes (the representative with the largest genome was tems in two genomes, and with both in the genome of the chosen for each genus, to minimize the potential bias toward alphaproteobacterium Tistrella mobilis that encompasses extensively sequenced groups) and assigned the three major subtypes III-A, I-C and II-C. The only other example of all CRISPR-Cas types to each of these genomes (Supplemen- three systems present in one genome is Azospirillum B510 tary Table S2). Of the three types, type II is the least abun- (III-A, I-C and II-C) from the full genome set. In addition, dant and is present only in ∼5% of the selected genomes (32 some Streptococcus thermophilus strains that are widely em- in the analyzed representative set), compared with ∼40% for ployed as models in CRISPR research encompass two type type I and ∼12% for type III. Of the 32 bacterial genomes II-A systems (that belong to two distinct groups; see details that encompass type II CRISPR-Cas, 10 possess a type II-A below) along with a type III-A system in LMD9 (23,27), system, 21 have a type II-C system and only one has a type and with a type I-E and a type III-A in DGCC7710 (71,72). II-B system. Type II is the only major type of CRISPR-Cas In Streptococcus pyogenes SF370, similar to some other S. systems that has not been detected in archaea. Although pyogenes strains, type II-A also is present along with I-C sys- type II CRISPR-Cas is relatively rare even in bacteria, sta- tems (23). In addition, a fusion between type I-C and type tistical analysis for the subset of these genomes that encode II-C systems has been detected in the genome of the uncul- at least one CRISPR-Cas system of any type indicates that tured Termite group 1 bacterium phylotype Rs D17 (72). the presence of type II in bacteria but not in archaea is not a random fluctuation (Chi-square P value = 0.005). The ex- clusive presence of type II CRISPR-Cas systems in bacteria Nucleic Acids Research, 2014, Vol. 42, No. 10 6095

A

RuvCR-rich I PAM-binding loop RuvC IIHNH RuvC III 1 60 94 448 501 718 775 909 1099 1368

Recognition lobe PAM interacting

D.G....G [R/K] RG E Y [D/E]H..N..N..K [GA] H..D

B

type II-B

Q S R type II-A

E GP S Q..Y RR G

EQGP S Q..Y RR G

EQ S Q..Y RR type II-C

EQ G

C Cyan7822_0783-like Cyan7822_6324-like YP_004464404

Zn finger R R R I rich II+III I rich II HNH III I rich II HNH III G G

ORF B (IS605) Fansor1

[RD] [RD] HTH Zn finger Zn finger R R I rich II+III* I rich II+III*

Figure 2. Schematic representation of Cas9 domain organization, motifs and relationships with distant homologs. (A) A general view of the domain architecture of Cas9. (B) Comparison of the domain organizations and conserved sequences motifs between the major groups of Cas9 proteins. (C) Domain architectures of distant homologs of Cas9. Homologous regions are shown by the same color. Compare with Supplementary Figure S8. The S. pyogenes Cas9 schematic representation with domains and domain boundaries according to the Cas9 structures (76,77) is shown in (A). See Supplementary Figure S4. Distinct sequence motifs are denoted by the corresponding conserved amino acid residues. The residues indicated in (A) are conserved in all five s9Ca groups and in (B), within the given subtype. Compare with Supplementary Figure S4. The size of a domain or a distinct region is roughly proportional to the length and the motifs are shown in accordance with their approximate position within a respective protein. The scheme was derived from the multiple alignments of each group. The color code to the left of the protein schematics in (B) corresponds to the major branches of the Cas9 phylogenetic tree in Figure 4. HTH: helix turn helix DNA-binding domain; R-rich: arginine-rich region; HNH: nuclease of the corresponding family.

Evolution of type II CRISPR-Cas subtypes 3A and Supplementary Figures S1 and S2; see Supplemen- tary Materials and Methods for details). The resulting tree The cas1 gene phylogeny and cas operon organization are strongly supports monophyly of the Cas1 sequences of type pivotal for the classification of subtypes of CRISPR-Cas II, with the exception of type II-B. The type II-B Cas1 se- systems. In a previous analysis, the majority of type II sys- quences form a clade within a subtree that consists mostly tems formed a clade in the Cas1 tree (5). With many more of type I-A but also includes representatives of other groups sequenced genomes now available, we updated this analysis from type I and type III (Figure 3A). Furthermore, several with a focus on the evolution of distinct subtypes of type II conserved protein sequence motifs that are characteristic of CRISPR-Cas. the Cas1 sequences of the major type II clade are absent in The first question we addressed was whether the mono- the type II-B group (Supplementary Figure S1). Notably, II- phyly of all three subtypes of type II was supported by the B is the only type II subtype that shares cas4 genes with sev- Cas1 phylogeny. We generated a representative set of Cas1 eral type I subtypes. We further addressed the origin of the sequences from all completely sequenced genomes to recon- cas4 and cas2 genes associated with the type II-B systems. struct a multiple alignment and a phylogenetic tree (Figure 6096 Nucleic Acids Research, 2014, Vol. 42, No. 10

A Cas1 0 Cas2 Cas4 76 86 98 94 92 84 65 100 75 89 Cas4 Paralog 98 18 78 63 96 82 95 80 43 12 93 90 87 76 47 96 2 84 31 0 97 85 77 97 79 51 72 54 70 15 24 81 93 8 74 14 97 73 96 43 92 96 66 88 82 92 61 72 59 89 59 99 95 83 52 99 97 33 87 85 93 69 49 78 0 87 23 68 60 88 97 46 52 92 88 91 16 79 88 100 95 86 72 82 76 97 82 72 74 89 73 85 97 84 87 86 51 41 99 96 98 78 17 80 83 98 99 15 60 59 18 80 99 7881

32 83 36 90 91 0.5 69 92 81 73 98 68 74 82 85 78

25 Color code: 94 77 84 86 75 72 99 98 50 84 98 type I type II type III Cas1 solo 4 38 83 80 72 71 A 54 13 A A 73 61 B B B 92 69 84 C X 99 C 95 64 67 100 68 D 31 89 33 88 E type X 55 F

90 77 U/C 88

100 21 35 83 76 35 83 89 32 87

85 93 67 86 79 83 48 76 92 72 93 99 78 80 79 99 81 96 82 86 85 61 82 99 69 B 0 85 94 32 I-A cluster 1 61 2 24-26 nt 38 98 66 86 64 88 42 1 97 0 91 bits 60 70 27 66 A A T TCA A G T A A AA C A C A C C T A C C GT G GG T G T T AA A A 85 CGT TG GATCTTTCT TTG GTCGC 78 80 A A A A 0 C T 98 G G G 74 68 91 63 0 64 87 27 1 93

71 85 I and III cluster 2 96 63 2 37 nt 60 62 68 46 1 100 bits 96 50 97 64 G A 84 57 T C C A C C T CAC G GGAA TGAT GA G A C CC G A A AGG A CT G GTGA T C C G C TCA U TA T ACTC C T T T T TT GT G GA 20 0 C G T T G 49 GT A T 60 90 7 2 90 43 100 25 68 84

62 91 II-B 83 2 37 nt

90 Cas1 solo 61 Cas1 solo2 96 1 99 100 CAS I-F bits 99 CAS I-F T A T A T GC CAS I-E TG G T T GG T A T G T A A A A G A T A A G C A G AG G C A G C C TT G T T T TT 85 76 CAS I-E C CCA G C A T TA C A AG 81 0 GTTTCA

0.5 0.5

Figure 3. Origin of type II-B CRISPR-Cas system. (A) The PSI-BLAST program was used to retrieve Cas1 protein sequences from 2262 complete genomes in the Refseq database. The BLASTCLUST program (length coverage cutoff 0.8; score density threshold 1.0) was used to select 205 representative sequences. The multiple alignment was built using the MUSCLE program (see Supplementary Materials and Methods for details). The FastTree program ([Jones- Taylor-Thornton (JTT) evolutionary model, discrete gamma model for site rates with 20 rate categories; see Supplementary Materials and Methods for details] was used for the tree reconstruction. The Cas2 and Cas4 phylogenetic trees were reconstructed using the FastTree program as indicated for the Cas1 tree above. The sequences of these families were chosen from the same genomic neighborhoods as the selected Cas1 representatives (a few incomplete sequences from both protein families were either omitted or replaced by closely related sequences from other species). Type II-B branches are indicated by the green arrow. The branches are colored according to the assignment of cas1 genes to CRISPR-Cas subtypes based on the analysis of 10 upstream and 10 downstream genes. X denotes systems of unknown type or those that are predicted to be derivatives of the respective system (when colored). The trees are shown only schematically, the complete trees are available in Supplementary Figure S2. (B) Logoplots of CRISPR repeats for the genomes that belong to several branches that are neighbors of the type II-B branch on the Cas1 phylogenetic tree. Clusters 1 and 2 are indicated by dashed lines. The type II-B (cluster 2) logoplot is shown separately. See details in Supplementary Figure S3.

To this end, we collected all cas2 and cas4 genes co-localized CRISPR-Cas systems but cluster with type II-B in the Cas1 with the representative cas1 genes and constructed phyloge- tree (Supplementary Figure S3). The repeats of type II- netic trees for the corresponding proteins. Although the tree B more closely resemble the repeats from these organisms of Cas2 is not as reliable as the Cas1 tree due to the low se- than those of the other type II systems including the char- quence conservation and small size of the Cas2 proteins, the acteristic length of the repeats, 37 nt, in contrast to the re- overall topology of this tree is largely consistent with that of peats of other type II systems most of which consist of 36 Cas1 and also shows clustering of type II-B with type I-A nucleotides and encompass distinct signature motifs (Fig- (along with several type I-B and type III sequences; Figure ure 3B, Figure 4 and Supplementary Figure S3; see also 3A). A similar observation was made as a result of the phy- a more detailed discussion below). These observations are logenetic analysis of Cas4 (Figure 3A). Thus, all three genes compatible with the results of independent clustering of the (cas1, cas2 and cas4) that type II-B CRISPR-Cas systems repeats that was based on repeat length and sequence simi- share with other CRISPR-Cas types clearly originate from larity analysis (73). Thus, cas1, cas2 and cas4 as well as the type I, most likely subtype I-A. CRISPR repeats of type II-B CRISPR-Cas systems appear We further examined the characteristics of CRISPR re- to originate from type I systems, probably via recombina- peats in the organisms that possess type I and/or type III tion. Given the extensive diversity of the type I CRISPR- Nucleic Acids Research, 2014, Vol. 42, No. 10 6097

Cas9Cas9 consensusconsensus llocusocus aarchitecturerchitecture repeatrepeat ssequenceequence aandnd llengthength

11849735218497352 FrancisellaFrancisella novicidanovicida U11212 9 1 2 4 2 II-BII-B 3434 254447899254447899 gammagamma proteobacteriumproteobacterium HTCC5015HTCC5015 as as as as 3737 nntt 9191 cas9c cas1c cas2c cas4c 331001027331001027 ParasutterellaParasutterella eexcrementihominisxcrementihominis YIYIT 118591859 ...... 1 100100 9999 319941583319941583 SutterellaSutterella wwadsworthensisadsworthensis 3 1 45B45B 1515 T A T A T GC TG G T T GG T A T G T tracrRNAtracrRNA A A A A G A Csn2Csn2 T A A G C A G AG G C A G C C TT G T T T TT C CCA G C A T AT C G 5429613854296138 LegionellaLegionella pneumophilapneumophila ststr. PParisaris 0 GTTTTTTCA A A 3455793234557932 WolinellaWolinella succinogenessuccinogenes DSMDSM 17401740 II typetype IIII / 2323 323463801323463801 StaphylococcusStaphylococcus pseudintermediuspseudintermedius ED99ED99 I s9 s1 ss22 n2 389815359389815359 PlanococcusPlanococcus antarcticusantarcticus DSMDSM 1450514505 cas9ca cas1ca caca csn2cs IIIIII II-AII-A ...... 9999 422884106422884106 StreptococcusStreptococcus ssanguinisanguinis SK49SK49 100100 1362219313622193 StreptococcusStreptococcus ppyogenesyogenes M1M1 GGASAS tracrRNAtracrRNA 4848 11662821316628213 StreptococcusStreptococcus tthermophilushermophilus LMD-9LMD-9 7878 2437980924379809 StreptococcusStreptococcus mmutansutans UA159UA159 9797 328956315328956315 CoriobacteriumCoriobacterium glomeransglomerans PW2PW2 4141 336394882336394882 LactobacillusLactobacillus farciminisfarciminis KCTCKCTC 33681681 I 7171 s9 s1 ss22 n2 224543312224543312 CatenibacteriumCatenibacterium mitsuokaimitsuokai DSMDSM 1589715897 cas9ca cas1ca caca csn2cs 258509199258509199 LactobacillusLactobacillus rhamnosusrhamnosus GGGG ...... 6363 310286728310286728 BifidobacteriumBifidobacterium bifidumbifidum S17S17 2 3636 nntt 9494 tracrRNAtracrRNA I 366983953366983953 OenococcusOenococcus kitaharaekitaharae DSMDSM 1733017330 9696 1 9090 7575 339625081339625081 FructobacillusFructobacillus ffructosusructosus KCTCKCTC 33544544 T C G G A AAT AA TT T T A A C C AT G T G C C T A TT A A A G C G C TAG C GC G C CT C TG T G A G G G T G T A T T T A T CT C C C G A T T A C G T G G 169823755169823755 FinegoldiaFinegoldia mmagnaagna ATCCTCC 2932829328 C A A A A AAA 0 GTTT G TGT C 4040 303229466303229466 Veillonellaeillonella atypicaatypica ACS-134-ACS-134-V-Col7a-Col7a 320528778320528778 SolobacteriumSolobacterium mooreimoorei F0204F0204 I 9999 2121 9 1 2 2 227824983227824983 AcidaminococcusAcidaminococcus sp.sp. DD2121 s s s n 2424 cas9ca cas1a cas2ca csn2cs 9090 8787 306821691306821691 EubacteriumEubacterium yuriiyurii ATCCTCC 4371543715 ...... 291520705291520705 CoprococcusCoprococcus catuscatus GD-7GD-7 tracrRNAtracrRNA 4040 3434 3476259234762592 FusobacteriumFusobacterium nucleatumnucleatum ATCCTCC 4925649256 374307738374307738 FilifactorFilifactor alocisalocis ATCCTCC 3589635896 9898 4343 304438954304438954 PeptoniphilusPeptoniphilus duerdeniiduerdenii ATCCTCC BAA-1640BAA-1640 7070 4252584342525843 Treponemareponema denticoladenticola ATCCTCC 3354055405 V II/I/ 9696 315659848315659848 StaphylococcusStaphylococcus lugdunensislugdunensis M23590M23590 V 2 s9 s1 s2 n2n2 3636 nntt 160915782160915782 EubacteriumEubacterium dolichumdolichum DSMDSM 39913991 cas9ca cas1ca cas2a cscs ...... 1 3131 11662754216627542 StreptococcusStreptococcus tthermophilushermophilus LMD-9LMD-9 IIII 315149830315149830 EnterococcusEnterococcus faecalisfaecalis TX0012TX0012 T A A A A C A A C A 8989 G G C AT G AT GTG A G AAG tracrRNAtracrRNA A T ACTG TT TA T TCTTGTT GCT 0 TTTTTTTT TA TTTT T AC 8484 238924075238924075 EubacteriumEubacterium rectalerectale ATCCTCC 3365633656 IVIV 2 4745886847458868 MycoplasmaMycoplasma mobilemobile 163K163K 9 1 2 2 363542550363542550 MycoplasmaMycoplasma ovipneumoniaeovipneumoniae SC01SC01 as9as as as sn 3636 nntt 9696 c cas1c cas2c csc 284931710284931710 MycoplasmaMycoplasma gallisepticumgallisepticum ststr. F ...... 1 9191 IVIV T T CT A 7189459271894592 MycoplasmaMycoplasma synoviaesynoviae 5353 T T AA CG T CA T AG AGG A 9898 9595 GACA TC A C GC TTTGATT tracrRNAtracrRNA 0 GTTTTTT G GTA AT T T AAC 8787 384393286384393286 MycoplasmaMycoplasma caniscanis PGPG 1414 9898 3455779034557790 WolinellaWolinella succinogenessuccinogenes DSMDSM 17401740 II-CII-C 218563121218563121 CampylobacterCampylobacter jejunijejuni NNCTCCTC 11168168 9999 291276265291276265 HelicobacterHelicobacter mustelaemustelae 1219812198 9090 296446027296446027 MethylosinusMethylosinus trichosporiumtrichosporium OB3bOB3b 7777 310780384310780384 IlyobacterIlyobacter ppolytropusolytropus DSMDSM 29262926

8 365156657365156657 BacillusBacillus smithiismithii 7 3 4747FAAAA 2 5555 220930482220930482 ClostridiumClostridium cellulolyticumcellulolyticum H10H10 3636 nntt 222109285222109285 AcidovoraxAcidovorax ebreusebreus TPSYTPSY 1 4949 218767588218767588 NeisseriaNeisseria meningitidismeningitidis Z2491Z2491 T 9494 T CT C TTT G T C AA A C T A C A T A T C A CTT A G A TA AT C G C A G C G T A A C GTC CC GCG CG AA A AAA C C CT T GC 8080 0 T A C 9999 1560299215602992 PasteurellaPasteurella multocidamultocida ststr. PPm70m70 G T C G TA T 9292 312879015312879015 AminomonasAminomonas paucivoranspaucivorans DSMDSM 1226012260 336393381336393381 LactobacillusLactobacillus coryniformiscoryniformis KCTCKCTC 33535535 3535 3939 8888 225377804225377804 RoseburiaRoseburia inulinivoransinulinivorans DSMDSM 1684116841 1919 403744858403744858 AlicyclobacillusAlicyclobacillus hesperidumhesperidum URH17-3-68URH17-3-68 297182908297182908 uncultureduncultured deltadelta proteobacteriumproteobacterium HF0070HF0070 07E1907E19 8484 325677756325677756 RuminococcusRuminococcus aalbuslbus 8 0 384109266384109266 Treponemareponema sp.sp. JJC4C4 407803669407803669 AlcanivoraxAlcanivorax sp.sp. W11-51-5 5656 427429481427429481 CaenispirillumCaenispirillum salinarumsalinarum AK4AK4 1414 8359179383591793 RhodospirillumRhodospirillum rubrumrubrum ATCCTCC 11170170 7373 344171927344171927 RalstoniaRalstonia syzygiisyzygii R24R24 2 402849997402849997 RhodovulumRhodovulum sp.sp. PPH10H10 3636 nntt 8888 1 9595 330822845330822845 AlicycliphilusAlicycliphilus denitrificansdenitrificans K601K601 9 1 2 as as as cas9c cas1c cas2c G C G C T CT G G CG A C C AC CG A A C A C 294086294086111 Cand.Cand. PuniceispirillumPuniceispirillum marinummarinum IMCC1322IMCC1322 T A G C C CTGA A ...... CTCCTTTG CC TGTCTG GTCGT GAAGGTCG C CT 0 GT T 8787 A 288957741288957741 AzospirillumAzospirillum sp.sp. BB510510 9999 tracrRNAtracrRNA 9595 100100 9210926292109262 NitrobacterNitrobacter hhamburgensisamburgensis X14X14 4040 148255343148255343 BradyrhizobiumBradyrhizobium sp.sp. BTAi1Ai1 0 159042956159042956 DinoroseobacterDinoroseobacter shibaeshibae DFDFL 1212 154250555154250555 ParvibaculumParvibaculum lavamentivoranslavamentivorans DS-1DS-1 9999 423317190423317190 BergeyellaBergeyella zoohelcumzoohelcum ATCCTCC 4437673767 30133013118691869 BacteroidesBacteroides sp.sp. 2200 3 9999 9292 38583858116091609 IgnavibacteriumIgnavibacterium albumalbum JCMJCM 16516511 9292 2 6068338960683389 BacteroidesBacteroides fragilisfragilis NCTCNCTC 99343343 3636 nntt 402847315402847315 PorphyromonasPorphyromonas sp.sp. ooralral ttaxonaxon 227979 ssttr. FF04500450 1717 9090 1 44-4844-48 nntt 6969 9999 404487228404487228 BarnesiellaBarnesiella intestinihominisintestinihominis YIYIT 118601860 A A A C AT AG T A T A T T G A A A G T CGA G A T A C A A A G C A A T C C G C GACG GCTAGA GTC TT CC CGCA T TGCT C T C G T T T T T AA A TA A G T A A 374384763374384763 OdoribacterOdoribacter llaneusaneus YIYIT 1206112061 0 G TGT T A C 347536497347536497 FlavobacteriumFlavobacterium branchiophilumbranchiophilum FL-15FL-15 8585 100100 345885718345885718 PrevotellaPrevotella sp.sp. CC561561 4242 282880052282880052 PrevotellaPrevotella timonensistimonensis CRISCRIS 55C-B1C-B1 2323 187250660187250660 ElusimicrobiumElusimicrobium minutumminutum Pei191Pei191 100100 325972003325972003 SphaerochaetaSphaerochaeta gglobuslobus ststr. BBuddyuddy 11792915817929158 AcidothermusAcidothermus cellulolyticuscellulolyticus 11B1B 315605738315605738 ActinomycesActinomyces sp.sp. ooralral ttaxonaxon 118080 ssttr. FF03100310 9999 9696 227494853227494853 ActinomycesActinomyces ccoleocanisoleocanis DSMDSM 1543615436 9191 189440764189440764 BifidobacteriumBifidobacterium longumlongum DJO10ADJO10A 0.20.2 187736489187736489 AkkermansiaAkkermansia muciniphilamuciniphila ATCCTCC BAA-835BAA-835 319957206319957206 NitratifractorNitratifractor salsuginissalsuginis DSMDSM 16516511

Figure 4. Cas9 phylogeny as a basis for type II system classification. The multiple alignment for the representative set of Cas9 sequences was constructed using the MUSCLE program followed by manual adjustment based on the results of pairwise alignments by PSI-BLAST, HHPRED and secondary structure predictions (see Supplementary Materials and Methods for details).

Cas systems, which is in a sharp contrast to the uniformity genomic neighborhoods (Figure 4). Both Cas9 and Cas1 and narrow spread of the type II-B systems, the opposite trees reveal monophyly of the II-A and II-B subtypes (Fig- direction of evolution can be effectively ruled out. ure 4). All the genomes that encode subtype II-A possess To investigate the evolution of type II CRISPR-Cas sys- either a csn2 gene or previously uncharacterized genes that, tems in greater detail, we performed phylogenetic analysis as shown below, encode derived members of the Csn2 fam- of Cas9 as described earlier (31). A representative set of ily (22). In both trees, the II-A branch is embedded within Cas9 sequences from the genomes that encompass type II type II-C in which the cas operon contains only the cas1, systems was constructed (see Supplementary Table S1 and cas2 and cas9 genes. Thus, type II-A appears to be a deriva- Supplementary Materials and Methods for details). This se- tive of type II-C, with the csn2 gene acquired by the type quence set was used to construct a phylogenetic tree from II-A ancestor rather than lost during the evolution of type the conserved sequence blocks that are present in all Cas9 II-C. The conservation of csn2 in all members of the mono- sequences (Figure 4 and Supplementary Figure S4). For phyletic type II-A group is most likely linked to its essential comparison, we also reconstructed a phylogenetic tree for function in adaptation. The absence of the csn2 gene in type the Cas1 proteins from the same operons as the selected II-C implies that in these systems adaptation occurs via a Cas9 (Supplementary Figures S5 and S6) and analyzed their 6098 Nucleic Acids Research, 2014, Vol. 42, No. 10 distinct molecular mechanism that might involve additional bind ATP or Mg2+, consistent with the experimental results bacterial factors. (15,16,19). We also found that previously identified lysine Comparative genomics as well as experimental studies residues involved in DNA binding are not highly conserved indicate that CRISPR-Cas loci are prone to HGT (5,7). even among the short forms of Csn2 (Supplementary Figure Apart from the hybrid origin of the type II-B systems (see S7). The C-terminal regions of the three Csn2 families are above), the present analysis reveals many likely cases of extremely diverse. The short variants lack the C-terminal HGT. Generally, neither the Cas9 nor the Cas1 phyloge- extension whereas in the other subfamilies these regions are netic trees agree with the phylogeny of the respective organ- not alignable. However, based on the secondary structure isms (Figure 4). Even at relatively short evolutionary dis- prediction, it can be suggested that at least three families tances, ample evidence of HGT is apparent (31). Further- are structurally similar whereas the fourth is clearly distinct more, when the Cas9 protein sequences were clustered by and might have an additional subdomain at the C-terminus sequence similarity using BLASTCLUST, many of the re- (Figure 5B). sulting clusters included bacteria of different classes (e.g. Clostridia and Bacilli in cluster 9; Supplementary Table S1). Origin of the enzymatic domains of Cas9 from common However, no clustering of sequences from different phyla transposon-related proteins was observed (e.g. no clusters that include both and Proteobacteria). Thus, barriers appear to exist that have The detailed analysis of Cas9 domains has been reported prevented frequent HGT between distantly related bacteria. previously (8,12). Three regions of homology with the RuvC-like nucleases (RNase H fold) were identified; the predicted RuvC-like domain of Cas9 is interrupted by in- Csn2, a common but fast evolving component of type II-A serts of variable lengths (Figure 1 and Supplementary Fig- CRISPR-Cas systems ure S4). An uninterrupted HNH (McrA-like) nuclease do- The next step in our exploration of the type II-A sub- main is located in the middle of the protein (12). The group (Figure 4) involved an in-depth analysis of several arginine-rich region recently proposed to be involved in genes in the respective operons that could not be confidently RNA binding (39) is present in all Cas9 proteins immedi- identified as Csn2 homologs using either PSI-BLAST or ately downstream of the first region of RuvC homology. search of the Conserved Domain Database (74) which in- The large N-terminal insert is predominantly alpha-helical cludes sequence profiles for all major domains of Cas pro- and shows the greatest variability among subtypes and even teins (5). These uncharacterized genes co-localize with cas9 within subtypes of type II CRISPR-Cas systems (Supple- genes that belong to three deep branches within the type mentary Figure S4). II-A clade, namely , Planococcus antarcticus, After the original version of this Survey and Summary Staphylococcus pseudintermedius and Staphylococcus lug- was submitted for publication, two independent studies dunensis (Figure 4)(22). Using PSI-BLAST, homologs of have been published reporting crystal structures of Cas9 these uncharacterized proteins in the Non-redundant pro- proteins (76,77). The first of these studies describes the tein sequence database were identified and a representa- structures of the type II-A Cas9 from S. pyogenes (PDB: tive set for each subfamily was constructed (see Supplemen- 4CMP) and the type II-C Cas9 from Actinomyces naeslundii tary Table S1 and Supplementary Figure S7 for details). (PDB: 4OGE) (77). The second study describes the struc- To search for remote sequence similarity, we used the HH- ture of the S. pyogenes Cas9 co-crystalized with the guide PRED method with query sequences representing each of RNA and the target DNA, and provides detailed infor- the three groups. In all three cases, HHPRED identified mation on the amino acid residues involved in the inter- significant similarity of these proteins to known Csn2 se- action with both RNA and DNA and the catalysis of the quences, suggesting that all three groups of uncharacterized DNA cleavage (76). Both reports (76,77) describe two-lobed proteins are diverged members of the Csn2 family (Supple- structures, with the target DNA and guide RNA positioned mentary Figure S7). This finding is in agreement with the in the interface between the two lobes. In agreement with positions of the respective Cas9 and Cas1 proteins in the the secondary structure prediction, the N-terminal lobe is phylogenetic trees and shows that Csn2 remains a perfect mostly alpha helical and could be divided into two sub- signature of type II-A. Superposition of the Csn2 protein domains whereas the second lobe encompasses two beta- architectures onto the phylogenetic trees of Cas9 and Cas1 stranded subdomains. Two specific loops in both lobes con- implies that the long form of Csn2 is ancestral whereas the tribute to the recognition of the PAM. Several portions of short form is derived (Figure 4). the Cas9 molecule are intrinsically disordered, at least un- We used the sequence similarity identified with HH- der certain conditions; in particular, this is the case of the PRED to adjust the multiple alignment of all five Csn2 sub- HNH domain prior to its interaction with the dual-RNA families (Figure 5 and Supplementary Figure S7). In agree- and DNA. Outside the RuvC and HNH domains, the Cas9 ment with the structure comparison results, HHPRED structure shows no structural similarity to other proteins. identified the similarity between the Csn2 sequences and The sequence downstream of the last RuvC homology re- various ATP-Binding Cassette (ABC) ATPases (Figure 5A gion belongs to the nuclease lobe of Cas9 and consists of and Supplementary Figure S7). The similarity with AT- two subdomains that adopt distinct alpha/beta secondary Pases is limited to the fold core including the region span- structures, and a loop that is involved in PAM recogni- ning the Walker A and B motifs (75). However, the char- tion. This region of Cas9 showed no significant sequence or acteristic amino acids involved in ATP and Mg2+ binding structural similarity to any known proteins (76,77) but con- are replaced and thus Csn2 proteins are not predicted to tains a small structural motif, the Greek key, that is found Nucleic Acids Research, 2014, Vol. 42, No. 10 6099

A * * * 3TOC_Secondary structure ---E-E------E--EEEEE--HHHHHHHHHHHH------H-H-HE-EEE------H---HHHHHHHHH--HHHHH---HHHHHHHH-HHHHHHHHHHHHH-HH 388603967 3TOC 13 DEPI-PL--R-GG--TILVLEDVCVFSKIVQYCYQ- 9 FDHKXKTIKE-S-EI-XLVTDILG----FD-V-NSS---TILKLIHAD--LESQFNEKPEVKSXID-KLVATITELIVFE-CL 11 374414489 3QHQ 10 DEPL-VL--S-NA--TILTIEDVSVYSSLVKHFYQ- 9 FDDKQKSLKA-T-EL-MLVTDILG----YD-V-NSA---PILKLIHGD--LENQFNEKPEVKSMVE-KLAATITELIAFE-CL 11 339625078 Fructobacillus fructosus 11 VGVL-KM--N-SG-LNLITVANSELYLRLIAKLS-- 9 SYDNKIKPLC-K-SG-LLIGDPVS----GR-D-IRG---FYSKFVPKI--LAQSIS--EEGIHELF-QLNQKIQSVIASE-II 12 372325148 Oenococcus kitaharae 11 D-PV-AL--D-TG-LNILSVEDTQLFFDFCQDF--- 9 TSGGKQLVMQ-K-NM-TFFGDVTR----VF-D-FNQ---LYLKAAIKT--LSKNFD--EMDLQQLD-DLNGQINRLFTKV-IF 12 310286726 Bifidobacterium bifidum 11 D-PV-QL--E-MG-LNVIGCASPKIFYELSAGL--- 9 GVGEKTVLAS-K-SI-IYIGDCAD----EV-D-YNA---LFQKLAVKK--IIEGFT--PEQISRLL-SLQAELKTIIQDE-IW 12 336394885 Lactobacillus farciminis 11 K-PF-EI--N-DG-ITLLNFQNIDEYSNLIFRINKL 18 NDDEPIEEHL-K-DI-SF IGNLSS----FN-L-NAS---SQMRIILKK--IIEDSQ---IMLNEIE-KVNADLNSVVFDS-IT 12 258509196 Lactobacillus rhamnosus 9 Q-LI-TT--K-PSVITVIDTGLPKVLLDLVQAFR-- 9 NEQMQIQELR-K-AS-LWIGDPVL----EL-D-LDK---LFQRLIIKK--VSQLIE--DRRLIELI-EQSQRLAMSLLRE-PL 12 328956318 Coriobacterium glomerans 9 ERPI-DI--E-PGLPTVLQVENSSLFTRLCLSLRSG 11 WRGEKEIATR-D- AL-MFIGDPML----LP-W-DDR---ALMGSISKR--FEKELIEDEELRRTLE-DAAVALSSQVASL-GL 12 374307735 Filifactor alocis 9 NMDI-EL--E-EERLNFLVIENPVKLEDFVLSIHSA 10 FEKFEKINFA-K-QV-DILFSPMS----IS-Y-HTK---EIQKKLLEY--VLEKVQ-DSYLAEDFI-DISGSLVSLCSKI-YL 12 42525846 Treponema denticola 9 KNQI-IF--S-ENKVNVLTIENKTFFTEFVSELFSQ 10 SEDDKELSIP-K-TC-EL ILNPIN----LD-I-NNK---QNLNKIYSI--LKQSIS-DEYNYLKTC-TIKSEILKHIDDV-LL 12 224543315 Catenibacterium mitsuokai 9 E-EI-DF--N-DQDNTILIIKNHKFYAHILSLLSSI 11 VIDGEK-NIT-T-NV-LLITDLYK----ID-F-NDK---NLLKEIYSE--ISKSITSDEEKCMRYQ-ELVDDLVAILENE-LE 12 I 227824980 Acidaminococcus sp. 9 QINL-TW--K-DYSVPTLIIENPHQFFEIINDLVLQ 11 SDKDKPVPLE-K-AF-EIL IDYLS----ID-S-VPK---MLLTKIQKR--MEFQAT-NGPLFENTQ-ILLAQMEHLAEEI-AM 11 169823758 Finegoldia magna 9 NEPF-FI--D-YGDICNITIENKLEYRKIIENIY-- 7 YDDDELITKH-K-----VIENSAI----LD-F-DDK---KITTKIIKD--LMEIIS-DEKHYEQKI-NLNSSIVNFINDI-TL 12 303229343 Veillonella atypica 9 ILEF-EV--L-RDKVTVVSFENRKVFRRMVTELDIQ 10 NDNDKAFNLD-K-YS-HIILNPLY----VD-V-NSK---SLLTKLQNQ--LTKEAL---LMTEEVA-DIVNRLHAFYYSL-EF 1 2 306821688 Eubacterium yurii 9 NFAA-NI--S-QDEMTVIIAENPKTFCDFVSNIREM 10 TDNSGAMIST-D- KC-DIILSPMD----FQ-L-NSK---KILNSIYKE--ISQIVK--EELWMQEQ-ELNNAIMDFLDRA-VL 1 2 320528781 Solobacterium moorei 9 FETL-DF--E-KG-SQTLVVENVELLRTIILQLKFQ 10 SDKDAILDIS-K- NI-LLITDVFE----IS-S-LSK---QLKNKLQQY--VESNYD-NDDLYQDLY-Q---KLIDFGNDL-IN 12 304438951 Peptoniphilus duerdenii 9 ENSL-EI--K-SKTINTLVVEDIRYFSIFLKGLIES 10 IEDYKTVDMT-K-YV-EI IFDIFN----LE-A-NSA---NILKKMYSE--LEEDLN-TQEVYTKKV-ELESIIANITDEL-IY 1 1 34762595 Fusobacterium nucleatum 9 NFKI-DF--E-EKNIFSLIVENKRAYRKIIEDLVNN 10 SKNNKLVIPE-K-EI-FIF SDYFN----FD-I-N-K---FVLNKYYKE--LKNLSE--NEFLNETL-EIKEILKDYINKL-I- 12 291520707 Coprococcus catus 9 SGEI-IS--E-SAAFTEWIIESPKDFSEYLHELYCQ 10 SQRDKILELS-K-YL-EII VNPFT----VE-I-NSR---KILNKLYIK--LEKVSK-EEQMYEETL-KLTAYIHEYLMQL-EQ 12 116101487 Streptococcus thermophilus 9 DEPL-EI--N-QG--TVLVIEDVSVFAQLVKEFYQ- 9 FDSKIRSIRS-S-EL-LLITDILG----YD-I-NTS---QVLKLLHTD--IVSQLNDKPEVRSEID-SLVSLITDIIMAE-CI 11 h h h h h h h h h h h h 3ZTH_Secondary structure -EEE-E------E--EEEE---HHHHHHHHHHHHHH ------EE-EEE--HHH----HHHH------HHHHHHHH--HH------HHHHHHH-HHHHHHHHHHHHH-HH 394986200 3ZTH 12 RIEL-NI--G-AI--TQIVGQNNELKYYTWQILSWY 26 EI---VKRSS-Y-HY-IDISSFKD----LLEQ-XEYK KGTLAQGYLRK--IVNQ-----VDIVGHL-EKINEQVELIEEA-XN 19 365905381 Lactobacillus versmoldensis 12 FFEL-DF--G-DI--VYISGYNHQNMWKIYRSLYYY 24 SP----STKN-T-QT-YFISNRDS----IYQQ-MIYK KGNLLFENLNT--LNDS-----FEITRIL-EELNDTNTKLSLL-VQ 17 422729713 Enterococcus faecalis 13 FIEF-SL--E-DY--VFFYGGENQWRRKILRTLKRF 26 KL----KAKS-I-DF-YFLEDNTS----IFQQ-MNFI KGTLLHQELLY--LQDD-----FDITNQM-SNLNDQLLKLENL-MN 17 406671115 Facklamia hominis 14 FLEF-EI--D-DY--VIITGNTYQLKREVNYALERL 25 NI----SDKE-A-CY-LSITQFSS----LEKV-YSIK KGSLLYDRLIR--LSQE-----KIITQQL-ERVNDELLRLEDQ-IN 17 406838713 Lactobacillus vini 12 YLNF-DA--G-LI--NFFLGPNSEIKWKLFRGLQRF 26 KL----NSKN-N-LI-MILDSRDS----FLKH-YTLS KGSLLLHYFEN--LKND-----FDLNKSI-YELNDKILNLETI-FQ 17 238924078 Eubacterium rectale 11 RFCF-EL--C-PV--TQLCGQNVRKKNYIFESLRRY 22 QV----GRKF-F-SV-LSISNKTD----IIQM-IKMTKQSLLMEYVKN--VIQD-----FDWQLHL-RNLSEEIEIMFQI-IN 16 II 381184142 Listeriaceae bacterium 13 HIEM-GF--S-QL--TQVVGDNTELKQQLWQNISWY 26 DL----SRSK-F-RT-IIIESISD----IHEL-LGAN KGTPTFDVLNQ--VMKN-----LEISKNI-EKINQEIIQIENQ-LN 21 116100822 Streptococcus thermophilus 12 RIEL-NI--G-AI--TQIVGQNNELKYYILQILSWY 26 EI---VKRSS-Y-HY-IDISTFKD----LLEQ-MEYKKGTLAHGYLRK--IVNQ-----VDIVGHL-EKINEQVELIEEA-MN 19 h h h h h G h h h h h h h h K h h h h h h h h --EE-EEE---EE--EEEEE--HHHHHHHHHHHHHH EE---EEE------EEEE------HH-HHHHHH--HHHHH-HHH--HHHHHH---HHHHHHH-HHHHHHHHHHHH--EE 389815356 Planococcus antarcticus 8 FEVI-PLVEG-EV--NVIAFADLEKQFAYRQLVKSW 13 TIEGRE-VAA-K- ELNILSLQGGG----LE-FFKDKH--HAKQL-IEL--MKFELENSPELLEVFT-VLQGELEKAFEQC-VV 14 386318633 Staphylococcus pseudintermedius 8 TEVL-EIKEQ-SL--NIITFEDTDLCGMFIQNFMDY 14 ISDDYETINS-K-NFHPI ILNGGD----RD-VLGQKL--IEEKV-YQI--FEEQIIKCETTNYEFL-KFEQSLTHLKSSL-GI 14 414160480 Staphylococcus simulans 8 NKPI-EIVEQ-KL--NVISFQNTSKLNALTDGLIKY 14 VADNYQTILP-N-NVNLI IVPCGD----IN-LFDSKI--FKEAL-YHR--FEQQIFNEETTNLLYN-KVESTIESFLNSL-EI 14 III c h h E h NhI F hh h a c h h h h G hh Kh h c h h hc h a h h h h ------E--EEEEE----HHHHHHHHHHHH ------H----H--E-EEE---HHHHHHH----HHH--HHHHH-HHH----HH-HHHHHHHHHHH-HHHHHHHHHH--H-HE 193216855 arthritidis 9 ECENPNF--K-GI--VTIATS--DSDAFLRELYLWE 11 EHKDIYL----K-DC-FIFSQLTKSYELFN-L-SAKN--FLTKR-IYDDEL-WD-DSKVINFSVLE-KICENI NENIEED-FL 5 350546883 Mycoplasma iowae 12 NYEI-SF--N-GI--KIFQVR--NPNKFSIDFINQA 5 NDVNIFK----N-NI-ILISDFTKINDLIS-L-NKSS--QIFKE-IIN--Y-IE-TNKILGNDKIH-NIKEYINSKIGFD-VI 5 385326557 Mycoplasma gallisepticum 7 NQEIDNF--N-GL--LLIDVK--NTSEYLKRLFMYE 10 NNVNVDI----S-DC-LIITPFSKYSDLIS-Y-TAKN--VF-TK-LLGNIN-FE-HDKILNEEYLDKEVVAKLNETLGRD-II 5 490549417 Mycoplasma arginini 9 NFNLEAK--Y-NF--KIVEED--NTDYLIRNLIDYE 10 NKNTFSI----R-DC-LIISNLSYVSDILN-S-ISKS--WIQEK-INNNEN-WN-TNIILDQEKVE-TIIEQINLKIGFE-YL 5 363542553 Mycoplasma ovipneumoniae 9 NNTVQLK--T-NV--KIIETE--NTDLFLSHLFNYE 10 MSRKISL----K-DS-VLITNLTKFSEFLS-L-SAKN--WLGDS-IINDEY-WS-ESIFFRNEKID-GIVDKI NQKLGFE-FL 5 IV 71894595 Mycoplasma synoviae 7 KGLVKNY--N-GI--VCIETN--NSDLFIRKLFEFE 10 NNNKYSI----K-DF-IIIDNLTKYHDLYN-F-NSKG--LL-NQ-WINDLD-FE-NQKIANEKLVL-EIKNLLNNKIGFE-FV 5 419703977 Mycoplasma canis 8 NKLIENF--S-GV--ACINVD--NTTAYLRKLYEYE 9 GGNEISI ----E-DF-MIISPLTTLDSLYS-F-TNKN--IL-NK-IIGEHN-LD-TDKLINKNYLQ-NIANKV NKKIGYD-LV 5 h h h hh hhh ho hh h h h hh h h hN h c hh --EE-EE--E----EEEEE----HHHHHHHHHHHHH HHHH--H-----EEE-EEE----H----HH-H-HHHH--HHHHH-HHH--HHHHHHH---HHHHHH-HHHHHHHHHHHHH--- 315659845 Staphylococcus lugdunensis 14 -PNI-EL--N-IGKYTAIYNEYETIEEDLLTIVDGY 16 LNQE--LVSHHLFES-VIIDNNLI----EN-E-HLLG--ASSIL-NKK--IQRDFVNNLESNGYLQ-SINSLLEDLLELLNYT 10 403411239 Staphylococcus aureus 14 -PKV-IL--N-TSVYTSIYNEDSTHEEYILDMLKNY 16 SDHE--RVSHNRFRG-FLLDHSTI----EN-E-HQLS--SSSIL-TKK--IVREMQSDLEIDGYIN-SINVLLDDLRDNLKGT 9 425737239 Staphylococcus massiliensis 26 -NHV-TL--N-LGEYTSIYNENETNEELILSLIYQY 15 IEDV--EVSSKSHVA-FMISHNDI----EQ-E-HQLA--ASSIL-NKK--LQYDMNQNIEVSGYIN-SINVLMEDLLEEVKRK 9 V h L N u YTuIYNE T EE hL hh Y VS a u hhh I E E H Lu uSSIL KK h ch hE GYh SIN LhcDL c h 4I99_Secondary structure EEEE------E--EEEE-----HHHHHHHHHHHH EE HHHH HH HHHHHHHHHHHH 444302296 4I99 20 VVIP-FS--K-GF--TAIVGANGSGKSNIGDAILFV 85 ------ISP-D-GY-NIVLQGDITKF-----IKXSPLERRLLIDdis------53 134105243 2ONK 17 NVDF-EM--GRDY--CVLLGPTGAGKSVFLELIAGI 19 ------PER-R-GI-GFVPQDYALFP-----HLSVYRNIAYGLRNVE------13 126031361 2O5V 19 PGTL-NF--PEGV--TGIYGENGAGKTNLLEAAYLA 58 ------LPR-G-GA-VWIRPEDSELvfgppsgRRAYLDSLLsrlsar------98

* 3TOC_Secondary structure -HHH--HHHHH---- -HHHHHHHHH----H------EEEEE--HHH--HH-HHHHHHHHHHH------EEEEE------EEEE------E 388603967 3TOC TILE--LIKSLGVKV 8 -EKCLEILQI----FKY-----LT----KKKLLIFVNSGAFLTKD-EVASLQEYISLTN-----LTVLFLE------PRELYD--FP------QYILDEDYFLI 219 374414489 3QHQ TILE--LIKALGVKI 8 -EKCFEIIQV----YHY-----LT----KKNLLVFVNSGAYLTKD-EVIKLCEYINLMQ-----KSVLFLE------PRRLYD--LP------QYVIDKDYFLI 216 339625078 Fructobacillus fructosus ELER--LLHSQNIVV 9 -GKIEEVIHV----MGM-----LD----ESRYLFLTNASLYCKLS-DLN QLHECLLAEG-----VNLISID---LSLEKPNFSDEHYC------HFHIDQDFVLY 223 372325148 Oenococcus kitaharae DLME--TIKSRKIVI 9 -DKIRSTIDL----AAE-----LS----DPHLLVFTNAFQYLTMG-QTNQLAEYSRSME-----RNVLFLE----RFDSKIDLTEAAN------SYFFDKDFVQF 220 310286726 Bifidobacterium bifidum DLKA--AIAMAKPRI 9 -DKISTAVDT----AGA-----LA----ESRMLVTLHTTQYCDND-QLLYLHRTLLRYQ-----LQLLDLE---CCSKKVMLTE--GM------SHYVDEDYVQF 219 336394885 Lactobacillus farciminis NYEK--LLKSKDIKI 9 -DKIMDVISF----YSD-----FT----NKKMLIFNNIGRLLNVN-QLNEIHAYLKSVD-----LKLVSLESYPMIFKEEKLN---AK------VYSIDNDHVRF 232 258509196 Lactobacillus rhamnosus KLEQ--IMKYCNVHF 9 -SKVEAAIKT----LAK-----LE----EKRLVIFTNISHYLDVD-SLNKLLDQVKETS-----LELLLIE-FSDVERRKFFER--CQ------YIYIDKDFMDS 221 328956318 Coriobacterium glomerans DLKR--FIKAFGFGA 9 -DNLIDFLSF----VLD-----IQ----CSKVMVFVNLKTFLTEI-ELEKLYEHVFFSK-----TRVLLLE---NKADPNRHEN--EI------KTTIDLDFIEK 226 374307735 Filifactor alocis EVED--ILKLMDIRL 8 -ERLIDYSKT----MFE-----LL----HKNIFILVGCSGFLTKI-DFEYLRQHQFSQK-----ILFLFVE----SHQIDLESP--KK------QTIIDIDLCEI 222 42525846 Treponema denticola DLIS--IFKAVNISF 8 -EKIIDYISA----SVE-----LN----QIKCFIFINLKQF LHFK-DLSELYKFACYIK-----TNLFLIE---GSYTSTKEDC--EK------SYIIDKDLCEI 223 224543315 Catenibacterium mitsuokai TIEN--YLKAVSLKI 9 -DRLMNYVEL----VTE-----LV----KKPVLVLYNCFDYLDDE-ELMELIKYKNYKH-----LHLLFIE-----NKNREVDSIAFR------KYIIDEDLCED 224 I 227824980 Acidaminococcus sp. TVGN--LLKALGITI 8 -EKIFVLMDL----ISR-----FE----GDKIFILVNLRSYLSSE-EIQLFVDTALAHD-----LKFLLID----NRRYPLFHQ--EK------RVLIDNDLCVL 222 169823758 Finegoldia magna NFES--LLKVVPIKP 8 -DKFVSFLDI----YQN-----VL----KIDVFYTINLLQYFTEE-ELKELSDYIKIKN-----ICIINYD------NILVDSNYISK------KLLFDNDLCRV 214 303229343 Veillonella atypica GAAE--IIKLGSFNF 8 -MDLMSYIEV----VDT-----LL----SPMIYIMINLDLVLNDD-EINAFYQNMLSRQ-----LRLVCLT--TGSFDRESLDKNLIN------GYILDNDFCII 224 306821688 Eubacterium yurii DTQD--FLKLYKLNL 8 -EKLINYVKI----ISQ-----VS----FTKCLFLVNIHDFLSFK-DLDMLYEIALYND-----VSLIVIE----SHQREKSSY--EH------TYILDDNNCLI 221 320528781 Solobacterium moorei TKID--MIKLLDIQL 8 -EEIVDYIDI----FSQ-----II----KTKLFVFISLRSYLTGE-EFNNFIKIMDYKG-----IRILLIE-----RYLSLDNSIVDN------VQIIDKDLCVI 219 304438951 Peptoniphilus duerdenii NYQN--LFKAIEIEF 8 -ERLIEYIKV----TSE-----LL----KTKVYIIVNLDSFLSEE-ELVEL EKFLLYND-----IKVLALQ---NAIRREVIPS--EN------LRIVDKDLCEI 222 34762595 Fusobacterium nucleatum DVSQ--ILKAFSIKF 7 -LNLFEWLKI----LNE-----IL----GYEIFFFINIENFLSED-ELLEFSKFIVYNK-----YKVVFLE---NFYRNKLSDN--DN------LIVIDNDLCEI 219 291520707 Coprococcus catus EIPA--LLKAVGIKH 8 -ERIIRYVKI----AVE-----VL----STKVFVFVNIRSYLTDE-QIQEL IKEIRYQE-----AKILFME----SQERACLEG--GM------RYIIDRDGCEI 222 116101487 Streptococcus thermophilus TLLE--LIKALGVRI 8 -EKIFDILQI----FKY-----LV----KKRILVFVNSLSYFSKD-EIYQILEYTKLSQ-----ADVLFLE------PRQIEG--IQ------QFILDKDYILM 215 hh+ h h h h hhhh hh h h hh h hD Dh h 3ZTH_Secondary structure -HHH--HHHH------HHHHHHHHH----HHH-----HH------EEEEEE--HHH--HH-HHHHHHHHHHHHHHH---EEEEEE------HHHEEEE-----EE 394986200 3ZTH TLDQ--LLTKNFSPF 16 -DKLSLFLEX----LDH-----LLSQTTEKYLIVLKNIDGFISEE-SYTIFYRQICHLVKKYPNLTFILFP------SDQGYL--KIDEENSRFVNILSDQ--VE 262 365905381 Lactobacillus versmoldensis TYFD--VLKYFLKVS 16 -SLVDELLNL----IES-----SLKKANNDTWIVLYNLDSFVSRS-AKKTILN KLKEFTDRF-ELKVIYLG------NSLTDC--SITSAETEKIVVAADE--FH 256 422729713 Enterococcus faecalis NFQD--ILKYSLSLN 16 -ELLAEFIYL----LQR-----SISRREQPVWLVITNPESFLETE-SIH YLFQQLKKLAQETKQLKFFIFS------NRSLPL--PYTEEDVEKTILLYDK--YQ 260 406671115 Facklamia hominis EFEE--LIKSYSSYI 16 -LAYQDMLEL----IKF-----RVEEEGRLVILLL-DLEPFRTINFEVAEFITQLIEI ANNTHLIKLIIFD------DGDLNQ--KI-PLSISKTIIVTDD--IQ 259 406838713 Lactobacillus vini EFDE--ILKKLLIIN 16 -KTIEDFCRL----LRN-----FLISESKPIWLWIKNPNSFLAKG-SLKKFIAVLRKITNETKLLRVFIVS------DDFLDI--AYEPEDMENTVLIYNE--IQ 259 II 238924078 Eubacterium rectale DVWD--MVQKSEVS- 13 -ELILILLNI----VDN-----VLKNNPKKTLILFENLDHLVSLK-EYVEIIHAADTISKKY-DLYFIFSS------SLNKYV--CCDIDFLEGISVFGDV--YF 248 381184142 Listeriaceae bacterium EFNSKILLQKNVSLM 16 -TKYMSLLDL----LEI-----NLESSSEMILLMIKNIDDTLSYQ-EYEEVMNSMKRLTTKHPNFYCMVFP------SQIGYV--YISEEHIESILVVGEE--AH 266 116100822 Streptococcus thermophilus TLDQ--LLTKNFSPF 16 -DKLSLFLEM----LDH-----LLLQTTEKYLIVLKNIDGFISEE-SYTIFYRQICHLVKKYPNLTFILFP------SDQGYL--KIDEENSRFVNILSDQ--VE 262 h hh h h h h h hhhh hh h h h hh hh h HHHH--HHHH----H HHHHHHHHHHHHHHH------EEEEE--HHH--HH-HHHHHHHHHHHH------EEEEEEE--HHHHHH------EEEEEE---- 389815356 Planococcus antarcticus SFEQ--LIKIAPIEV 9 AYQYQSFLLKSWLHLVH-----NK----ATNVCFYDFPEN ELNKS-ELKELFQFATSQA-----CTMICLTTSPQVINQVGLS------NVHLIKRTGE 234 386318633 Staphylococcus pseudintermedius KLRK--LLKLFGLDF 9 LNKRKAFIDL----LIH-----PT----KENILIILFPESHLGIH-DIKKFIHLIKRYK-----FTTIIFTNHPSIIINEE------NIFLCKSNKK 230 414160480 Staphylococcus simulans VLKK--LLGMFKPDF 9 IEQRKLFIDT----LTS-----TE----KENVLVLLFPEAHLGLS-DMKAFNKYLQTLN-----MFSIIVTSHPYFIIESD------TLALIKTNGE 230 III hc Lh hh ch h Fh L Nh hh FPE cL chK h hh I hT P hI hhL K c -HHH--HHHHH-H-- -HHHHHHHHH----HH------EEEEEEE------HHHHHHHH------EEEEEEE--HHHHHHHHHHHHHHHHH------EEEEE----- 193216855 Mycoplasma arthritidis -FNK--MLKTF-FKV 8 -DLFFKWLE-----N-Y-----DEV---GQKVVILKNLPWVK-----IEDLVQYL --DK-----FVFIILTDN-FFDNCIRFEALETCAFWEKGR--LVLVHS--- 213 350546883 Mycoplasma iowae -MQK--IISNL-FEI 8 -NIFISLIKN----N------LFD---ERMTFVFCDIDWLN-----INDISEYI--NS-----HNFIFVTND-LRKNIVNITQIESCVICNDF---LVEIYD--- 206 385326557 Mycoplasma gallisepticum -YSK--ILKSI-IKI 7 -EFIYSYLEL----L-H-----WSK---EVKTVILKDFDNID-----LKRLSNLT--QT-----TNLIIMKNDIIYDLEKNFEFFETYCLIDHDYENMIKIDN--- 213 490549417 Mycoplasma arginini -NNK--LIKSI-FTI 8 -NNLLNILDI----L-S-----SSN---FKPLIIVKDLTYIN-----IEEL IKY---SN-----LNFLILTSD-FTKYIQSYNQLELVSFYNGEI--LFDIKS--- 212 363542553 Mycoplasma ovipneumoniae -NVK--LIKAI-FNI 8 -ENFENLAEI----I-F-----STM---KQTIVILKDLDYIK-----FEKLFK Y---NN-----VTFLILTND-FTKYISKFSELELVAFYSQKT--VIDVID--- 212 IV 71894595 Mycoplasma synoviae -NSK--YLKYL-FNL 8 -KSLIKWMEN----Q-K-----YNN---QKINLIIKNFDFVL-----INELIKFS--NN-----FNIIVLTND-FFKHINNFDYIESVAFTNEDLNRIVSIEE--- 212 419703977 Mycoplasma canis -TSK--LFKSM-FDI 8 -SLLYSFLRN----I------SQA---EKQNIIIKDMDDIS-----LSLLTEFS--ND-----FNIIVMTNN-SFNIIKSSSEVLLTNFVDENLYTLKNIDN--- 211 K h h h h hhc hhh h h h hhhh h h h h -HHH--HHHHH-EEE -HHHHHHHHH----HHHHHHHHH------EEEEEEE------HH-HHHHHHHHHH------EEEE------HHHHHHHH------EEE------315659845 Staphylococcus lugdunensis -IKQ--FVKMLKFEY 10 -VRIEQVLPL----VIEELKKQHN----NQLLLIYLYPEANLSPK-EQIRL KQLLISLD-----VNIIVLT------GSMHFLAQQWKNN---NYIRYD----- 235 403411239 Staphylococcus aureus -LKQ--FIKLLSFKY 10 -TRLEQVVPL----VVDEMNYQVK----EKAFVIYLYPEAGLSIK -EQIRFRKILNNLP-----VPVIVLT------ESPRFLSESIKGL---NYFINN----- 234 425737239 Staphylococcus massiliensis -IKQ--FIKQLNFEY 10 -VRLKQLLPL----LVEQMNNLSN----NKGLLIYLYPEANLSIK-ERL IFREVLNDLP-----IKIIVLT------QSRYFLADSIDGL---NYLRNS----- 245 V hKQ FhK L FcY RhcQhhPL hhc h hhhIYLYPEA LS K E h h+ hL L h hIVLT S FLu hc NYh 4I99_Secondary structure --HH--HHH-EEEEE -HHHHHHHHH----HHH-----HHH-HH---EEEEE------HHH-HHHHHHHHHHHHHH-----EEEEE HHHH EEEEEE EEE 444302296 4I99 PEDP--FSGGLEIEA 16 -EKALTALAF----VFA-----IQK-FKPAP FYLFDEIDAHLDDA-NVKRVADLIKESSK---ESQFIVIT------LRDVXX--AN----ADKIIGVSXrdKVV 330 134105243 2ONK ------KL 16 -ERQRVALAR----ALV-----I-----QPRLLLLDEPLSAVDLK-TKGVLMEELRFVQREFD-VPILHVT------HDLIEA--AMl---ADEVAVMLNgriVE 207 126031361 2O5V TVTG--PHRDDLLLT 16 -TVALALRRA----ELEL----LREKFGEDPVLLLDDFTAELDPH-RRQYLLDLAASV------PQAIVTG------TEla------pGAALTLRAQAgrfTP 347 B I Csn2 short ATPase core fold ATPase core fold

II Csn2 long ATPase core fold ATPase core fold

Csn2 long ATPase core fold ATPase core fold III Planococcus subfamily

Csn2 long ATPase core fold ATPase core fold IV Mycoplasma subfamily

Csn2 long ATPase core fold ATPase core fold V Staphylococcus subfamily

Figure 5. Multiple alignment of Csn2 subfamilies and comparison of their specific structural elements.A ( ) The multiple sequence alignment was con- structed using the MUSCLE program for each Csn2 subfamily, separately. The alignments were then superimposed on the basis of conserved regions identified by HHPRED with some manual adjustment based on secondary structure predictions (see Supplementary Materials and Methods fordetails). The alignment with several ATPase sequences is based on Vector Alignment Search Tool (VAST) structural alignments with the structure of Csn2 of S. thermophilus (3ZTH) (17) used as a query (see Supplementary Materials and Methods for details). . The sequences are denoted by their GI numbers and species names. Secondary structure predictions and the secondary structure elements mapped to the respective crystal structures of the Csn2 long and short subfamilies are shown above the alignment for each Csn2 family. The positions of the first and last residues of the aligned region in the corresponding protein are indicated for each sequence. The numbers within the alignment represent poorly conserved inserts that are not shown. Secondary structure prediction is shown as follows: H indicates ␣-helix and E indicates extended conformation (␤-strand). The positions strongly conserved in three families with a larger number of representatives are shown by reverse shading. The coloring is based on the 70% consensus built for a larger alignment (Supple- mentary Figure S7). Specific 90% consensus is also shown underneath the alignment for each family: ‘h’ indicates hydrophobic residues (WFYMLIVA), ‘c’ indicates charged residues (EDKRH) and ‘s’ indicates small residues (AGS). (B) Schematic representation of structures (actual and predicted) of five distinct Csn2 subfamilies. Cylindrical shape represents ␣-helix and arrow ␤-strand. 6100 Nucleic Acids Research, 2014, Vol. 42, No. 10 in numerous OB-fold domains in diverse RNA and DNA- of their small size and low conservation, comparison of binding proteins (77). Here we show several additional mo- these sequences might provide insights into the evolution of tifs that are conserved in different Cas9 subfamilies and these components within the subtypes and reveal potential might be of interest for future mutagenesis studies, espe- exchange of repeats between CRISPR-Cas loci. Recently cially in the context of recent structural data (Figure 2 and we reported a detailed analysis of tracrRNA anti-repeats Supplementary Figure S4). and their base-pairing with cognate repeat sequences within Previously, several families of distant Cas9 homologs the three type II subtypes and delineated several subgroups that are encoded outside of CRISPR-Cas loci have been based on the similarity of repeat lengths and localization identified (12). These proteins contain the RuvC-like and of the tracrRNA genes (22). Here, we identified full-length arginine-rich domains, and in some cases, also contain the tracrRNA sequences for 63 of the 88 representative type II HNH domain (Figure 2). We updated the set of these loci and validated the CRISPR repeat sequences accord- uncharacterized Cas9 homologs that are encoded in the ing to their transcription direction (Supplementary Table current collection of sequenced genomes. Ktedonobacter S4). The updated information about the consensus locus racemifer remains the species with the largest number of architecture is summarized in Figure 4. We also provide stand-alone Cas9 homologs (49). Furthermore, 15 more a thermal map representation of the matrix of similarity genomes encode 10 or more paralogs suggesting that these within the CRISPR repeat and tracrRNA sequences (Sup- genes belong to uncharacterized mobile elements. To ad- plementary Figure S9). The pairwise comparison of repeat dress this possibility, we performed additional searches sequences results in a grouping of repeats that is consistent aimed at the identification of distant homologs of Cas9. with the Cas9 clustering. The repeat sequences within the Indeed, HHPRED searches detected significant similarity groups share similar characteristics, especially at the 5 and between the Cyan7822 6324-like subfamily of Cas9 ho- 3 ends (Figure 4). Generally, CRISPR repeats of type II mologs (GI: 297585104 from Bacillus selenitireducens was system are weakly palindromic (i.e. presumably unfolded), used as the query) and OrfB protein (also known as TnpB) and typically 36 nt in length. However, as already pointed from a variety of transposons of the IS605 family (Supple- out above, type II-B repeats are 37 nt long with a distinct mentary Figure S8). The similarity covers the N-terminal motif and with only the 5-terminal part of the repeats con- RuvC-like motif and the arginine-rich motif. The reverse served. Some longer repeats of up to 48 nt in length were search started from a TnpB protein (GI: 386713960 IS1341- found exclusively in a subset of type II-C loci, mainly in type transposase from Halobacillus halophilus) identified a . Except one analyzed locus, all the repeat se- highly significant similarity with the RuvC nuclease and quences start with G. many RuvC homologs. The TnpB superfamily has recently Despite considerable divergence of the respective Cas9 been analyzed in detail, and many homologs of these pro- proteins and the existence of several clearly defined sub- teins, dubbed Fanzors, have also been detected in diverse groups, which correlate with distinct subfamilies of Csn2 eukaryotic mobile elements (78). However, the similarity of proteins and with distinct position and orientation of tracr- the conserved motifs identified in the TnpB–Fanzor family RNA, all CRISPR repeats of type II-A share a clear pattern to those of RuvC-like nucleases was missed in these analy- of similarity.This is less obvious in the type II-C where a cer- ses. The TnpB–Fanzor proteins do not contain transposase tain degree of similarity is only observed within subclusters domains, and it has been shown that OrfB is not required (Figure 4 and Supplementary Figure S9). for the transposition of IS605 transposons (79). The TnpB As discussed above, type II-C is extremely diverse in genes are often associated with various transposases but terms of sequence similarity of protein components, RNA there are also many transposons that encompass the TnpB components and repeats. Most of the repeats, except the gene only (78). Accordingly, it has been hypothesized that Bacteroidetes cluster with longer repeat sequences, show such TnpB-only transposons employ a transposase in trans conservation of the last nucleotide (T, rather than C like (78). It seems likely that both families of distant homologs in type II-A; Figure 4). The repeat–spacer array is in most of Cas9 retain this ability and are transposable with the cases transcribed from the strand opposite to the cas operon help of transposases of other mobile elements. Such mobil- and the first, not the last (as in many type I and most type ity would explain the extensive proliferation of these genes II-A systems), repeat is degenerated (Supplementary Fig- in many genomes. In accordance with this hypothesis, we ure S10). N. meningitidis and C. jejuni contain functional identified several cases when these proteins are associated promoters within the repeats (38), specifically, inside the with group II introns [e.g. GI: 428315656 Oscillatoria PCC AT-rich, conserved 3 part of the repeats of at least two 7112, which transposes via a reverse transcription mecha- groups of type II-C. However, the lack of perfect conserva- nism (80)]. The numerous Cas9 and TnpB homologs that tion suggests that these promoters either are absent in other contain a RuvC-like nuclease domain and an arginine-rich genomes encoding the type II-C system or are very weak. region can be predicted to possess DNase activity, the role The latter possibility is consistent with recent experimental of which in the life cycle of the respective transposons re- observations (38). mains to be determined. The tracrRNA is required for both RNase III-dependent pre-crRNA processing and interference with DNA and has to be complementary to the cognate repeat (23). Tracr- Evolution of type II associated CRISPR repeats and tracr- RNAs were identified in the majority of representative RNA type II CRISPR-Cas systems in various locations within Although CRISPR repeats and tracrRNA are much less CRISPR loci and in different orientations with respect to amenable to phylogenetic analysis than cas genes because the cas operon and the repeat–spacer array, although usu- Nucleic Acids Research, 2014, Vol. 42, No. 10 6101 ally closely related genomes share similar characteristics tions within RNAs resulting in the maintenance of both (22). The length of the predicted tracrRNAs is highly vari- base-pairing and the secondary structure of the ds region able (#72–171 nt) and cannot be confidently aligned even [Supplementary Figure S11; (23)]. Our recent results further for some closely related loci (Supplementary Table S4). The support the co-evolution of the type II crRNAs with Cas9 pairwise comparison further corroborates the diversity in (31). Orthologous Cas9 proteins can utilize non-cognate tracrRNA sequences showing clusters of similar tracrRNA tracrRNA and crRNA as guide sequences only when these sequences only for very closely related loci (Supplementary RNAs originate from loci with highly similar Cas9 se- Figure S9). Base-pairing of tracrRNA and CRISPR repeats quences, as exemplified for the groups of S. pyogenes, S. typically involves long sequence stretches, although only mutans and S. thermophilus (CRISPR3) and the pair of N. within tracrRNA anti-repeats. The diversity of tracrRNA meningitidis and P. multocida (31). The RNAs and Cas9 length and nucleotide sequences upstream and downstream proteins from more distantly related loci can still be ex- of the anti-repeats (22,23), together with the functionality changeable although with lowered cleavage efficiency (N. of substantially truncated tracrRNA-derived parts of sin- meningitidis or P.multocida with C. jejuni;seeFigure4). The gle guide RNAs that have been used for genome editing, same study suggests the involvement of the secondary struc- suggest that the tracrRNA sequences outside the functional ture of tracrRNA anti-repeat:crRNA repeat duplex in the anti-repeat can be essentially random (33). However, the re- specific recognition by Cas9 based on similar structure char- cent structural study of S. pyogenes Cas9 in complex with acteristics shared among exchangeable RNAs (31). This hy- single guide RNA and target DNA (76) demonstrates the pothesis is corroborated by the recent structural study that importance of certain structural features within tracrRNA. indicates the importance of the asymmetric bulge of tracr- A comparative analysis of DNA cleavage efficiency in hu- RNA within the S. pyogenes anti-repeat:repeat duplex (76). man cells using a series of mutations of the single guide A mutation in the tracrRNA that removes the bulge and RNA based on the original S. pyogenes dual-tracrRNA- a mutation in the bulge sequence both abrogate the DNA CRISPR repeat suggests that Cas9 utilizes most efficiently cleavage activity of the Cas9 complex. Moreover, the bulge long RNAs with structural features of naturally occurring has been shown to directly interact with Cas9 (76). Accord- tracrRNA, namely a three short stem-loop structure fol- ing to in silico predictions, this bulge is a conserved struc- lowing the anti-repeat (76). Notably, the first stem-loop lo- tural feature of RNAs from type II-A loci closely related to cated immediately downstream of the repeat:anti-repeat du- that of S. pyogenes (22) including the exchangeable RNAs plex seems to be essential for the recognition of the guide of S. pyogenes, S. mutans and S. thermophilus (CRISPR3) RNA by Cas9 whereas deletion of the second and third (31). stem-loop structures substantially decreases but does not abolish the DNA cleavage activity. The solved structure of Role of type II CRISPR-Cas in bacterial virulence and origin the complex further shows that all these structural elements of scaRNA interact with Cas9 (76). Mutations that affect the stem se- quence but not the structure have only a slight negative ef- The ability of CRISPR-Cas to limit horizontal transfer of fect on the DNA cleavage efficiency. Even the heavily mu- mobile genetic elements can impact bacterial fitness and tated guide RNA with almost one-third of the sequence ex- pathogenicity. An inverse relationship between the presence changed within the duplex and all three stem loops, but with of a CRISPR-Cas locus and acquired resistance preserved secondary structure, remained functional in the has been described in Enterococcus faecalis (81). Similarly, assay, albeit with a 2-fold lower efficiency (76). Here, we an association between CRISPR spacer content and an- modeled the putative secondary structures of tracrRNAs tibiotic susceptibility in S. pyogenes has been reported (82). base-paired with cognate CRISPR repeats for all the pre- By targeting multiple temperate phages, CRISPR-Cas can dicted sequences (Supplementary Tables S4 and S5). Al- limit the acquisition of virulence genes in S. pyogenes (23). A though Cas9 binding might affect the guide RNA structure comparative analysis of S. pyogenes genomes demonstrates and thus the predictions are unlikely to be fully accurate, we an inverse correlation between the presence of CRISPR-Cas identified the three similar stem-loop structures with short systems and the amount of CRISPR spacers with the num- G/C-rich stems downstream of the repeat:anti-repeat du- ber of integrated prophages (23,83). plex in RNAs from loci closely related to S. pyogenes, e.g. Remarkably, type II CRISPR-Cas systems have been in S. mutans, S. thermophilus, Listeria innocua, Coriobac- shown to affect virulence in C. jejuni, F. novicida and L. terium glomerans or Lactobacillus farciminis. Similar short pneumophila (41). In the type II-B system of F. novicida stem-loops were also found in other type II-A and many U112, scaRNA has been recently identified as part of a type II-C RNAs. Along the same lines, closely related loci complex that includes also Cas9 and tracrRNA. The three of N. meningitidis and P. multocida seem to encode RNAs components are involved in the repression of the mRNA with similar secondary structure characteristics, namely two of a lipoprotein that contributes to the virulence of this stem-loops downstream of the repeat:anti-repeat duplex. bacterium (39). The regulation is likely to involve inter- Despite the diversity and low conservation of tracrRNA action between sequences located within tracrRNA and and crRNA repeat sequences, there are some indications of the mRNA of the lipoprotein FTN 1103, a process that their co-evolution (22,23). The functionality of tracrRNA would trigger mRNA degradation. We analyzed the mul- requires complementarity of the tracrRNA anti-repeat with tiple alignment of the CRISPR loci from the genomes of the CRISPR repeat. A comparison of predicted secondary two F. novicida strains, U112 and 3523. The sequences from structures of tracrRNA antirepeat:crRNA repeat duplexes the two strains contain a variable region upstream of the among closely related species shows compensatory muta- CRISPR array, around the scaRNA sequence. The vari- 6102 Nucleic Acids Research, 2014, Vol. 42, No. 10

scaRNA SAM cas9 cas1cas2cas4 +9 F. novicida U112 ... . tracrRNA F. novicida Fx1 .

+12 F. novicida 3523 ... . .

pseudo transposase F. tularensis SCHU S4

transposase pseudopseudopseudo F. holarctica LVS pseudo

+18 W. succinogenes DSM1740 ...

+29 L. pneumophila str. Paris ...

+8 S. wadsworthensis 3_1_45B ...

? P. excrementihominis YIT11859

+22 gamma proteobact. HTCC5015 ...

Figure 6. A schematic representation of the scaRNA-tracrRNA locus in Francisella strains. The type II-B CRISPR-Cas locus architecture of representative species (see Figure 4) and diverse Francisella species is shown. Red and yellow arrows: tracrRNA and scaRNA with indicated confirmed22 ( ) or predicted transcription direction, accordingly; black rectangles and green diamonds: repeat–spacer arrays; red rectangles: degenerated repeats; white diamonds: putative spacers of degenerated arrays. Degenerated array spacers with the scaRNA promoter and transcriptional terminator are shown in yellow. Putative promoters of repeat–spacer arrays are shown with dotted arrows. The scaRNA-encoding spacer–repeat–spacer unit was found only in two of the analyzed strains and is incomplete in F. novicida 3523, lacking transcriptional terminator-encoding spacer. Note also the degenerate repeats that are commonly found at the 5-end of the repeat–spacer array. See Supplementary Figure S12. able region contains highly degenerated but regularly inter- transcript would be prone to mutations and recombination spaced repeat sequences, most likely a remnant of an old, and accordingly would have the potential for the acquisition not currently active CRISPR array, which is longer in F. of a new function. In the case of F. novicida U112, this re- novicida 3523 than in the U112 strain (Figure 6 and Supple- gion might have acquired its own promoter and terminator mentary Figure S12). The scaRNA sequence covers one de- and started to be transcribed as a single RNA, still retain- generated repeat completely and another one partially. The ing the ability to interact with both tracrRNA and Cas9. scaRNA appears to be transcribed from a promoter located The promoter and terminator sequences could have been in the putative spacer sequence to a transcriptional termi- acquired as CRISPR spacer units while the array was still nator located within the next putative spacer sequence. In F. functional. novicida 3523, only the 5 part of the scaRNA sequence con- taining the degenerated repeat that can base-pair with tracr- RNA is conserved, whereas the 3 part of the sequence is CONCLUSIONS absent or not conserved and the transcriptional terminator Since the demonstration of the activity of type II CRISPR- cannot be confidently predicted. Thus, scaRNA might not Cas system in bacterial adaptive defense against phages and be expressed as a single RNA species in F. novicida 3532. We plasmids (13,14), and the discovery of the essential role of hypothesize that scaRNA is not a novel small RNA tran- tracrRNA in crRNA maturation and DNA targeting by script but rather a fragment of a highly degenerated, old Cas9 (23,33), type II CRISPR-Cas systems have been ac- CRISPR array that might be present only in genomes of tively studied. Thanks to the minimalist architecture of the closely related isolates of Francisella (e.g. F. novicida Fx1). interfering complex that consists of the single, multidomain This fragment would still be transcribed (albeit with sub- endonuclease Cas9 guided by dual-tracrRNA:crRNA, type stantially reduced efficiency) but not processed into mature II is the obvious top choice of a CRISPR-Cas system for crRNAs. The degenerated sequence cannot serve as a tem- application in genome engineering. Despite the major dif- plate for new spacer integration either. Such an inactivated ferences in the mechanisms of crRNA maturation and ap- Nucleic Acids Research, 2014, Vol. 42, No. 10 6103 paratus required for target DNA cleavage, type I, type II FUNDING and type III systems share important mechanistic details. US Department of Health and Human Services (tothe Such shared features between different types of CRISPR- National Library of Medicine) [K.S.M., E.V.K.]; Swedish Cas systems include the role of PAM as both a prerequi- Research Council [K2010-57X-21436-01-3, K2013-57X- site for target DNA binding and a motif that is required for 21436-04-3, 621-2011-5752-LiMS to E.C.]; Kempe Foun- spacer acquisition and self–non-self discrimination (types dation [#SMK-1136.1]; Umea˚ University [Dnr: 223-2728- I and II) as well as R-loop formation during DNA tar- 10, Dnr: 223-2836-10, Dnr: 223-2989-10 to E.C.]; Labora- geting (types I and II). The duplex structure of the natu- tory for Molecular Infection Medicine Sweden, the Umea˚ ral dual-tracrRNA:crRNA and the minimalist single engi- Centre for Microbial Research, the Helmholtz Associa- neered guide RNA (sufficient for Cas9 activation) can both  tion and the Alexander von Humboldt Foundation, Federal be considered mimics of the 3 hairpin of mature crRNAs Ministry of Education and Research [E.C.]. Funding for that are transcribed from palindromic repeats in CRISPR- open access charge: US Department of Health and Human Cas types I and III (Figure 1). Services (to the National Library of Medicine); Swedish We present evidence that reinforces the previously pub- Research Council [K2010-57X-21436-01-3, K2013-57X- lished hypothesis on the origin of type II system (12). 21436-04-3, 621-2011-5752-LiMS]; Kempe Foundation; First, it is shown that type II-B systems evolved by re- Umea˚ University [Dnr: 223-2728-10, Dnr: 223-2836-10, combination between Cas9 and a type I CRISPR-Cas Dnr: 223-2989-10]. locus. Second, we describe the homology between Cas9 Conflict of interest statement. None declared. and transposon-encoded-predicted nucleases. Conceivably, transposons were the original source of the nuclease moi- eties of Cas9, and the mobility of these elements would REFERENCES have greatly facilitated recombination. Although there are 1. Makarova,K.S., Wolf,Y.I. and Koonin,E.V. (2013) Comparative some mechanistic similarities of the interference processes genomics of defense systems in archaea and bacteria. Nucleic Acids between type I and type II CRISPR-Cas systems, the N- Res., 41, 4360–4377. and C-terminal inserts in the recently determined structures 2. Barrangou,R. and Horvath,P. (2012) CRISPR: new horizons in of Cas9 do not show any structural similarity with sub- phage resistance and strain identification. Annu. Rev. Food Sci. units of Cascade complexes or any other proteins. There- Technol., 3, 143–162. 3. Wiedenheft,B., Sternberg,S.H. and Doudna,J.A. (2012) RNA-guided fore, the currently available Cas9 structures do not directly genetic silencing systems in bacteria and archaea. Nature, 482, contribute to deciphering the origin of Cas9 beyond the 331–338. two nuclease domains. In that regard, the structure of type 4. van der Oost,J., Jore,M.M., Westra,E.R., Lundgren,M. and II-B Cas9, which is the largest protein in the family and, Brouns,S.J. (2009) CRISPR-based adaptive and heritable immunity in . Trends Biochem. Sci., 34, 401–407. as shown here, belongs to the system in which other com- 5. Makarova,K.S., Haft,D.H., Barrangou,R., Brouns,S.J., ponents most likely originated from type I CRISPR-Cas, Charpentier,E., Horvath,P., Moineau,S., Mojica,F.J., Wolf,Y.I., could be informative. Of further interest will be structures Yakunin,A.F. et al. (2011) Evolution and classification of the of distant homologs of Cas9 which are encoded outside CRISPR-Cas systems. Nat. Rev. Microbiol., 9, 467–477. CRISPR-Cas loci. 6. Jansen,R., Embden,J.D., Gaastra,W. and Schouls,L.M. (2002) Identification of genes that are associated with DNA repeats in The evolutionary and functional plasticity of type II prokaryotes. Mol. Microbiol., 43, 1565–1575. CRISPR-Cas systems is further demonstrated by the ap- 7. Haft,D.H., Selengut,J., Mongodin,E.F. and Nelson,K.E. (2005) A parent acquisition of the dual-RNA-guided RNA interfer- guild of 45 CRISPR-associated (Cas) protein families and multiple ing enzyme function by the Cas9 protein of Francisella (39). CRISPR/Cas subtypes exist in prokaryotic genomes. PLoS Comput. Biol., 1, e60. This transition would be accompanied by the recruitment of 8. Makarova,K.S., Grishin,N.V., Shabalina,S.A., Wolf,Y.I. and the transcript of a degraded array of CRISPR repeat–spacer Koonin,E.V. (2006) A putative RNA-interference-based immune to tracrRNA as a novel guide dual-RNA. The contribution system in prokaryotes: computational analysis of the predicted of this derived type II system to the pathogenicity of Fran- enzymatic machinery, functional analogies with eukaryotic RNAi, cisella and the involvement of type II CRISPR-Cas in the and hypothetical mechanisms of action. Biol. Direct., 1,7. 9. Makarova,K.S., Aravind,L., Grishin,N.V., Rogozin,I.B. and virulence of C. jejuni (40), N. meningitidis (39)andL. pneu- Koonin,E.V. (2002) A DNA repair system specific for thermophilic mophila (41), taken together with the significant enrichment archaea and bacteria predicted by genomic context analysis. Nucleic of type II systems among pathogens and commensals, sug- Acids Res., 30, 482–496. gest that rewiring of this system for functions distinct from 10. Reeks,J., Naismith,J.H. and White,M.F. (2013) CRISPR interference: a structural perspective. Biochem. J., 453, 155–166. antivirus defence is a general evolutionary trend. The poten- 11. Koonin,E.V. and Makarova,K.S. (2013) CRISPR-Cas: evolution of tial involvement in the regulation of bacterial pathogenicity an RNA-based adaptive immunity system in prokaryotes. RNA Biol., is another important incentive for detailed study of type II 10, 679–686. CRISPR-Cas systems. 12. Makarova,K.S., Aravind,L., Wolf,Y.I. and Koonin,E.V. (2011) Unification of Cas protein families and a simple scenario forthe origin and evolution of CRISPR-Cas systems. Biol. Direct., 6, 38. SUPPLEMENTARY DATA 13. Barrangou,R., Fremaux,C., Deveau,H., Richards,M., Boyaval,P., Supplementary Data are available at NAR Online. Moineau,S., Romero,D.A. and Horvath,P. (2007) CRISPR provides acquired resistance against viruses in prokaryotes. Science, 315, 1709–1712. ACKNOWLEDGMENTS 14. Garneau,J.E., Dupuis,M.E., Villion,M., Romero,D.A., Barrangou,R., Boyaval,P., Fremaux,C., Horvath,P., Magadan,A.H. K.C. was a fellow of the Austrian Doctoral Program in and Moineau,S. (2010) The CRISPR/Cas bacterial immune system RNA Biology. cleaves bacteriophage and plasmid DNA. Nature, 468, 67–71. 6104 Nucleic Acids Research, 2014, Vol. 42, No. 10

15. Nam,K.H., Kurinov,I. and Ke,A. (2011) Crystal structure of 35. Deng,L., Kenchappa,C.S., Peng,X., She,Q. and Garrett,R.A. (2012) clustered regularly interspaced short palindromic repeats Modulation of CRISPR locus transcription by the repeat-binding (CRISPR)-associated Csn2 protein revealed Ca2+-dependent protein Cbp1 in Sulfolobus. Nucleic Acids Res., 40, 2470–2480. double-stranded DNA-binding activity. J. Biol. Chem., 286, 36. Carte,J., Wang,R., Li,H., Terns,R.M. and Terns,M.P. (2008) Cas6 is 30759–30768. an endoribonuclease that generates guide RNAs for invader defense 16. Koo,Y., Jung,D.K. and Bae,E. (2012) Crystal structure of in prokaryotes. Genes Dev., 22, 3489–3496. Streptococcus pyogenes Csn2 reveals calcium-dependent 37. Nam,K.H., Haitjema,C., Liu,X., Ding,F., Wang,H., DeLisa,M.P. and conformational changes in its tertiary and quaternary structure. Ke,A. (2012) Cas5d protein processes pre-crRNA and assembles into PLoS ONE, 7, e33401. a cascade-like interference complex in subtype I-C/Dvulg 17. Lee,K.H., Lee,S.G., Eun Lee,K., Jeon,H., Robinson,H. and Oh,B.H. CRISPR-Cas system. Structure, 20, 1574–1584. (2012) Identification, structural, and biochemical characterization of 38. Zhang,Y., Heidrich,N., Ampattu,B.J., Gunderson,C.W., Seifert,H.S., a group of large Csn2 proteins involved in CRISPR-mediated Schoen,C., Vogel,J. and Sontheimer,E.J. (2013) bacterial immunity. Proteins, 80, 2573–2582. Processing-independent CRISPR RNAs limit natural transformation 18. Arslan,Z., Wurm,R., Brener,O., Ellinger,P., Nagel-Steger,L., in Neisseria meningitidis. Mol. Cell, 50, 488–503. Oesterhelt,F., Schmitt,L., Willbold,D., Wagner,R., Gohlke,H. et al. 39. Sampson,T.R., Saroj,S.D., Llewellyn,A.C., Tzeng,Y.L.and Weiss,D.S. (2013) Double-strand DNA end-binding and sliding of the toroidal (2013) A CRISPR/Cas system mediates bacterial innate immune CRISPR-associated protein Csn2. Nucleic Acids Res., 41, 6347–6359. evasion and virulence. Nature, 497, 254–257. 19. Ellinger,P., Arslan,Z., Wurm,R., Tschapek,B., MacKenzie,C., 40. Louwen,R., Horst-Kreft,D., de Boer,A.G., van der Graaf,L., de Pfeffer,K., Panjikar,S., Wagner,R., Schmitt,L., Gohlke,H. et al. Knegt,G., Hamersma,M., Heikema,A.P., Timms,A.R., Jacobs,B.C., (2012) The crystal structure of the CRISPR-associated protein Csn2 Wagenaar,J.A. et al. (2013) A novel link between Campylobacter from Streptococcus agalactiae. J. Struct. Biol., 178, 350–362. jejuni bacteriophage defence, virulence and Guillain-Barre syndrome. 20. Zhang,J., Kasciukovic,T. and White,M.F. (2012) The CRISPR Eur. J. Clin. Microbiol. Infect. Dis., 32, 207–226. associated protein Cas4 Is a 5 to 3 DNA exonuclease with an 41. Gunderson,F.F. and Cianciotto,N.P. (2013) The CRISPR-associated iron-sulfur cluster. PLoS ONE, 7, e47232. gene cas2 of Legionella pneumophila is required for intracellular 21. Makarova,K.S., Anantharaman,V., Aravind,L. and Koonin,E.V. infection of amoebae. mBio, 4, e00074-13. (2012) Live virus-free or die: coupling of antivirus immunity and 42. Mali,P., Yang,L., Esvelt,K.M., Aach,J., Guell,M., DiCarlo,J.E., programmed suicide or dormancy in prokaryotes. Biol. Direct., 7, 40. Norville,J.E. and Church,G.M. (2013) RNA-guided human genome 22. Chylinski,K., Le Rhun,A. and Charpentier,E. (2013) The tracrRNA engineering via Cas9. Science, 339, 823–826. and Cas9 families of type II CRISPR-Cas immunity systems. RNA 43. Mali,P., Aach,J., Stranges,P.B., Esvelt,K.M., Moosburner,M., Biol., 10, 726–737. Kosuri,S., Yang,L. and Church,G.M. (2013) CAS9 transcriptional 23. Deltcheva,E., Chylinski,K., Sharma,C.M., Gonzales,K., Chao,Y., activators for target specificity screening and paired nickases for Pirzada,Z.A., Eckert,M.R., Vogel,J. and Charpentier,E. (2011) cooperative genome engineering. Nat. Biotechnol., 31, 833–838. CRISPR RNA maturation by trans-encoded small RNA and host 44. Ran,F.A., Hsu,P.D., Lin,C.Y., Gootenberg,J.S., Konermann,S., factor RNase III. Nature, 471, 602–607. Trevino,A.E., Scott,D.A., Inoue,A., Matoba,S., Zhang,Y. et al. 24. Yosef,I., Goren,M.G. and Qimron,U. (2012) Proteins and DNA (2013) Double nicking by RNA-guided CRISPR Cas9 for enhanced elements essential for the CRISPR adaptation process in Escherichia genome editing specificity. Cell, 154, 1380–1389. coli. Nucleic Acids Res., 40, 5569–5576. 45. Cong,L., Ran,F.A., Cox,D., Lin,S., Barretto,R., Habib,N., Hsu,P.D., 25. Datsenko,K.A., Pougach,K., Tikhonov,A., Wanner,B.L., Wu,X., Jiang,W., Marraffini,L.A. et al. (2013) Multiplex genome Severinov,K. and Semenova,E. (2012) Molecular memory of prior engineering using CRISPR/Cas systems. Science, 339, 819–823. infections activates the CRISPR/Cas adaptive bacterial immunity 46. Perez-Pinera,P., Kocak,D.D., Vockley,C.M., Adler,A.F., system. Nat. Commun., 3, 945. Kabadi,A.M., Polstein,L.R., Thakore,P.I., Glass,K.A., 26. Swarts,D.C., Mosterd,C., van Passel,M.W. and Brouns,S.J. (2012) Ousterout,D.G., Leong,K.W. et al. (2013) RNA-guided gene CRISPR interference directs strand specific spacer acquisition. PLoS activation by CRISPR-Cas9-based transcription factors. Nat. ONE, 7, e35888. Methods, 10, 973–976. 27. Horvath,P., Romero,D.A., Coute-Monvoisin,A.C., Richards,M., 47. Gilbert,L.A., Larson,M.H., Morsut,L., Liu,Z., Brar,G.A., Deveau,H., Moineau,S., Boyaval,P., Fremaux,C. and Barrangou,R. Torres,S.E., Stern-Ginossar,N., Brandman,O., Whitehead,E.H., (2008) Diversity, activity, and evolution of CRISPR loci in Doudna,J.A. et al. (2013) CRISPR-mediated modular RNA-guided Streptococcus thermophilus. J. Bacteriol., 190, 1401–1412. regulation of transcription in eukaryotes. Cell, 154, 442–451. 28. Bhaya,D., Davison,M. and Barrangou,R. (2011) CRISPR-Cas 48. Bikard,D., Jiang,W., Samai,P., Hochschild,A., Zhang,F. and systems in bacteria and archaea: versatile small RNAs for adaptive Marraffini,L.A. (2013) Programmable repression and activation of defense and regulation. Annu. Rev. Genet., 45, 273–297. bacterial gene expression using an engineered CRISPR-Cas system. 29. Mojica,F.J., Diez-Villasenor,C., Garcia-Martinez,J. and Nucleic Acids Res., 41, 7429–7437. Almendros,C. (2009) Short motif sequences determine the targets of 49. Qi,L.S., Larson,M.H., Gilbert,L.A., Doudna,J.A., Weissman,J.S., the prokaryotic CRISPR defence system. Microbiology, 155, 733–740. Arkin,A.P. and Lim,W.A. (2013) Repurposing CRISPR as an 30. Deveau,H., Barrangou,R., Garneau,J.E., Labonte,J., Fremaux,C., RNA-guided platform for sequence-specific control of gene Boyaval,P., Romero,D.A., Horvath,P. and Moineau,S. (2008) Phage expression. Cell, 152, 1173–1183. response to CRISPR-encoded resistance in Streptococcus 50. Mali,P., Esvelt,K.M. and Church,G.M. (2013) Cas9 as a versatile tool thermophilus. J. Bacteriol., 190, 1390–1400. for engineering biology. Nat. Methods, 10, 957–963. 31. Fonfara,I., Le Rhun,A., Chylinski,K., Makarova,K.S., 51. Gasiunas,G., Barrangou,R., Horvath,P. and Siksnys,V. (2012) Lecrivain,A.L., Bzdrenga,J., Koonin,E.V. and Charpentier,E. (2013) Cas9-crRNA ribonucleoprotein complex mediates specific DNA Phylogeny of Cas9 determines functional exchangeability of cleavage for adaptive immunity in bacteria. Proc. Natl. Acad. Sci. dual-RNA and Cas9 among orthologous type II CRISPR-Cas U.S.A., 109, E2579–E2586. systems. Nucleic Acids Res., 42, 2577–2590. 52. Cho,S.W., Kim,S., Kim,J.M. and Kim,J.S. (2013) Targeted genome 32. Sapranauskas,R., Gasiunas,G., Fremaux,C., Barrangou,R., engineering in human cells with the Cas9 RNA-guided endonuclease. Horvath,P. and Siksnys,V. (2011) The Streptococcus thermophilus Nat. Biotechnol., 31, 230–232. CRISPR/Cas system provides immunity in Escherichia coli. Nucleic 53. Jinek,M., East,A., Cheng,A., Lin,S., Ma,E. and Doudna,J. (2013) Acids Res., 39, 9275–9282. RNA-programmed genome editing in human cells. Elife, 2, e00471. 33. Jinek,M., Chylinski,K., Fonfara,I., Hauer,M., Doudna,J.A. and 54. Niu,Y., Shen,B., Cui,Y., Chen,Y., Wang,J., Wang,L., Kang,Y., Charpentier,E. (2012) A programmable dual-RNA-guided DNA Zhao,X., Si,W., Li,W. et al. (2014) Generation of gene-modified endonuclease in adaptive bacterial immunity. Science, 337, 816–821. cynomolgus monkey via Cas9/RNA-mediated gene targeting in 34. Haurwitz,R.E., Jinek,M., Wiedenheft,B., Zhou,K. and Doudna,J.A. one-cell embryos. Cell, 156, 836–843. (2010) Sequence- and structure-specific RNA processing by a 55. Hai,T., Teng,F., Guo,R., Li,W. and Zhou,Q. (2014) One-step CRISPR endonuclease. Science, 329, 1355–1358. generation of knockout pigs by zygote injection of CRISPR/Cas system. Cell Res., 24, 372–375. Nucleic Acids Research, 2014, Vol. 42, No. 10 6105

56. Shen,B., Zhang,J., Wu,H., Wang,J., Ma,K., Li,Z., Zhang,X., by a metagenomic survey of the rumen microbiome. Environ. Zhang,P. and Huang,X. (2013) Generation of gene-modified mice via Microbiol., 14, 207–227. Cas9/RNA-mediated gene targeting. Cell Res., 23, 720–723. 71. Horvath,P. and Barrangou,R. (2010) CRISPR/Cas, the immune 57. Wang,H., Yang,H., Shivalila,C.S., Dawlaty,M.M., Cheng,A.W., system of bacteria and archaea. Science, 327, 167–170. Zhang,F. and Jaenisch,R. (2013) One-step generation of mice 72. Makarova,K.S., Wolf,Y.I. and Koonin,E.V. (2013) The basic building carrying mutations in multiple genes by CRISPR/Cas-mediated blocks and evolution of CRISPR-Cas systems. Biochem. Soc. Trans., genome engineering. Cell, 153, 910–918. 41, 1392–1400. 58. Hwang,W.Y., Fu,Y., Reyon,D., Maeder,M.L., Kaini,P., Sander,J.D., 73. Kunin,V., Sorek,R. and Hugenholtz,P. (2007) Evolutionary Joung,J.K., Peterson,R.T. and Yeh,J.R. (2013) Heritable and precise conservation of sequence and secondary structures in CRISPR zebrafish genome editing using a CRISPR-Cas system. PLoS ONE, repeats. Genome Biol., 8, R61. 8, e68708. 74. Marchler-Bauer,A., Anderson,J.B., Chitsaz,F., Derbyshire,M.K., 59. Gratz,S.J., Cummings,A.M., Nguyen,J.N., Hamm,D.C., DeWeese-Scott,C., Fong,J.H., Geer,L.Y., Geer,R.C., Gonzales,N.R., Donohue,L.K., Harrison,M.M., Wildonger,J. and Gwadz,M. et al. (2009) CDD: specific functional annotation with the O’Connor-Giles,K.M. (2013) Genome engineering of Drosophila Conserved Domain Database. Nucleic Acids Res., 37, D205–D210. with the CRISPR RNA-guided Cas9 nuclease. Genetics, 194, 75. Walker,J.E., Saraste,M., Runswick,M.J. and Gay,N.J. (1982) 1029–1035. Distantly related sequences in the alpha- and beta-subunits of ATP 60. DiCarlo,J.E., Norville,J.E., Mali,P., Rios,X., Aach,J. and synthase, myosin, kinases and other ATP-requiring enzymes and a Church,G.M. (2013) Genome engineering in Saccharomyces common nucleotide binding fold. EMBO J., 1, 945–951. cerevisiae using CRISPR-Cas systems. Nucleic Acids Res., 41, 76. Nishimasu,H., Ran,F.A., Hsu,P.D., Konermann,S., Shehata,S.I., 4336–4343. Dohmae,N., Ishitani,R., Zhang,F. and Nureki,O. (2014) Crystal 61. Shan,Q., Wang,Y., Li,J., Zhang,Y., Chen,K., Liang,Z., Zhang,K., structure of Cas9 in complex with guide RNA and target DNA. Cell, Liu,J., Xi,J.J., Qiu,J.L. et al. (2013) Targeted genome modification of 156, 935–949. crop plants using a CRISPR-Cas system. Nat. Biotechnol., 31, 77. Jinek,M., Jiang,F., Taylor,D.W., Sternberg,S.H., Kaya,E., Ma,E., 686–688. Anders,C., Hauer,M., Zhou,K., Lin,S. et al. (2014) Structures of 62. Xie,K. and Yang,Y. (2013) RNA-guided genome editing in plants Cas9 endonucleases reveal RNA-mediated conformational using a CRISPR-Cas system. Mol. Plant, 6, 1975–1983. activation. Science, 343, doi:10.1126/science.1247997. 63. Waaijers,S., Portegijs,V., Kerver,J., Lemmens,B.B., Tijsterman,M., 78. Bao,W. and Jurka,J. (2013) Homologues of bacterial TnpB IS605 are / van den Heuvel,S. and Boxem,M. (2013) CRISPR Cas9-targeted widespread in diverse eukaryotic transposable elements. Mob DNA, 4, mutagenesis in Caenorhabditis elegans. Genetics, 195, 1187–1191. 12. 64. Jiang,W., Bikard,D., Cox,D., Zhang,F. and Marraffini,L.A. (2013) 79. Kersulyte,D., Velapatino,B., Dailide,G., Mukhopadhyay,A.K., Ito,Y., RNA-guided editing of bacterial genomes using CRISPR-Cas Cahuayme,L., Parkinson,A.J., Gilman,R.H. and Berg,D.E. (2002) systems. Nat. Biotechnol., 31, 233–239. Transposable element ISHp608 of Helicobacter pylori: nonrandom 65. Wei,C., Liu,J., Yu,Z., Zhang,B., Gao,G. and Jiao,R. (2013) TALEN geographic distribution, functional organization, and insertion or Cas9 - rapid, efficient and specific choices for genome specificity. J. Bacteriol., 184, 992–1002. modifications. J. Genet. Genomics, 40, 281–289. 80. Lambowitz,A.M. and Zimmerly,S. (2011) Group II introns: mobile 66. Pennisi,E. (2013) The CRISPR craze. Science, 341, 833–836. ribozymes that invade DNA. Cold Spring Harb. Perspect. Biol., 3, 67. Wolf,Y.I., Makarova,K.S., Yutin,N. and Koonin,E.V. (2012) Updated a003616. clusters of orthologous genes for archaea: a complex ancestor of the 81. Palmer,K.L. and Gilmore,M.S. (2010) Multidrug-resistant archaea and the byways of horizontal gene transfer. Biol. Direct., 7, enterococci lack CRISPR-cas. mBio, 1, e00227–10. 46. 82. Zheng,P.X., Chiang-Ni,C., Wang,S.Y., Tsai,P.J., Kuo,C.F., 68. Rho,M., Wu,Y.W., Tang,H., Doak,T.G. and Ye,Y. (2012) Diverse Chuang,W.J., Lin,Y.S., Liu,C.C. and Wu,J.J. (2013) Arrangement and CRISPRs evolving in human microbiomes. PLoS Genet., 8, number of clustered regularly interspaced short palindromic repeat e1002441. spacers are associated with erythromycin susceptibility in emm12, 69. Pride,D.T., Salzman,J. and Relman,D.A. (2012) Comparisons of emm75 and emm92 of group A streptococcus. Clin. Microbiol. clustered regularly interspaced short palindromic repeats and viromes Infect., doi:10.1111/1469-0691.12379. in human saliva reveal bacterial adaptations to salivary viruses. 83. Nozawa,T., Furukawa,N., Aikawa,C., Watanabe,T., Haobam,B., Environ. Microbiol., 14, 2564–2576. Kurokawa,K., Maruyama,F. and Nakagawa,I. (2011) CRISPR 70. Berg Miller,M.E., Yeoman,C.J., Chia,N., Tringe,S.G., Angly,F.E., inhibition of prophage acquisition in Streptococcus pyogenes. PLoS Edwards,R.A., Flint,H.J., Lamed,R., Bayer,E.A. and White,B.A. ONE, 6, e19543. (2012) Phage-bacteria relationships and CRISPR elements revealed