Report for Taikichiro Mori Memorial Research Grants 2019 (2019 年度森基金研究成果報告書)

生命の複製に関わる酵素の新規発見と機能解明 Comprehensive evolutionary analysis of re- verse transcriptases in viruses and prokary- otes Shohei Nagata Institute for Advanced Biosciences, Keio University, Tsuruoka 997-0035, Japan and Sys- tems Biology Program, Graduate School of Media and Governance, Keio University, Fu- jisawa 252-0882, Japan.

Abstract Reverse transcriptases (RTs) are enzymes that polymerize DNA from RNA tem- plates. RTs are usually thought to be viral and eukaryotic elements, but they are also present in . Bacterial RTs are seemed to be ancestors of eukaryotic RTs and several types are identified i.e. group II introns, retrons, CRISPR/Cas- associated RTs, diversity-generating retroelements (DGRs), and Abi -like genes. Recently, several studies reported that the existence of RTs in a recently reported bacterial group, candidate phyla radiation (CPR). These CPR RTs are thought to have an important role and functions in CPR bacterial ecologies since they retain RT genes while lacking numerous biosynthetic pathways. In this study, I compre- hensively collected RT-like sequences from CPR genomes and systematically char- acterized RT functions and evolution. Using known functional profiles in RTs as queries, sequence similarity search was performed against 804 near-complete genomes of CPR bacteria in the database. I obtained 514 RT sequences and these RTs are widely distributed in CPR phyla. It is known that CPR bacteria utilize RTs involved in DGRs to adapt rapidly changing environments, I found RTs related to group II introns, retrons, and abortive infection (Abi). I will discuss possible roles and evolution of RTs in CPR bacteria. Contact: [email protected]

otes thereafter. In addition to viruses infecting eu- 1 Introduction karyotic organisms (retrovirus, pararetrovirus, Central dogma in molecular biology is a flow of in- hepadnavirus), the existence of a RT homologous formation that genetic information retained on DNA region in long terminal repeat (LTR) retroelement, is transcribed into mRNA and translated into protein, non-LTR retroelement, telomerase has been re- which was proposed in 1958. However, in 1970, an vealed. RNA-dependent DNA polymerase (reverse tran- In 1989, retron, one of the reverse transcriptase scriptase; RT), which synthesizes DNA based on (RT) was found in bacteria [3,4]. Even after that, RNA, reversed this flow [1,2]. This was discovered various types of RTs were discovered in bacteria by studies of tumor-associated retroviruses that in- and archaea by the discovery of group II intron [5– fect eukaryotes, and various types of RT enzymes 7] and diversity-generating retroelements (DGRs) have been discovered primarily related to eukary- [8–10] etc. Retrons consist of an RT and an adjacent repeat sequence but its function remains unknown.

1 S.Nagata

Group II introns are retroelements consists of cata- of diversification are not well understood. In this lytic RNA and an RT protein which mediate splic- study, a comprehensive analysis was performed on ing and mobility reactions [11–13]. DGRs are retro- the RT sequence from CPR bacterial genomes, to elements that lost mobility functions and use reverse revealLETTERS roles and evolutionNATURE of MICROBIOLOGYRTs in CPRDOI: 10.1038/NMICROBIOL.2016.48 bacteria. transcription to generate sequence variations in spe- () Bacteria Nomurabacteria Kaiserbacteria cific target genes [10]. Then, it was revealed that RT Adlerbacteria Cloacimonetes Aquificae Chloroflexi Campbellbacteria Calescamantes Caldiserica WOR-3 Dictyoglomi is a gene that is widely present in the three domains TA06 -Therm. Latescibacteria Giovannonibacteria BRC1 Wolfebacteria Jorgensenbacteria of life (bacteria, archaea, eukaryotes) and viruses RBX1 Ignavibacteria WOR1 Chlorobi Azambacteria [14–17]. In bacteria, it is also known that RT ho- PVC Parcubacteria superphylum Yanofskybacteria Moranbacteria , mologous region exists also in abi gene related to Lentisphaerae, Magasanikbacteria Uhrbacteria Falkowbacteria Candidate Omnitrophica Phyla Radiation abortive infection (Abi) to phage [18,19] and cas1 SM2F11 Rokubacteria NC10 Aminicentantes Peregrinibacteria Tectomicrobia, BD1-5, GN02 Absconditabacteria SR1 gene of CRISPR/Cas immune system [20,21]. Dadabacteria Deltaprotebacteria () Chrysiogenetes Deferribacteres Three bacterial RT-related proteins are involved in Hydrogenedentes NKB19 Woesebacteria Shapirobacteria Amesbacteria TM6 Collierbacteria Pacebacteria phage resistance; AbiA, AbiK, and Abi-P2 [15]. Beckwithbacteria Roizmanbacteria Dojkabacteria WS6 Gottesmanbacteria CPR1 Levybacteria CPR3 Daviesbacteria Microgenomates AbiA and AbiK are thought to provide phage im- Curtissbacteria WWE3 Zetaproteo. munity through abortive infection. Also, recently Acidithiobacillia Major lineages with isolated representative: italics Major lineage lacking isolated representative: there have been reports that many uncharacterized 0.4 RT-like sequences mainly exist in bacteria [15,20,21]. However, what kind of functions/activi- ties they possess, and how they divergences were Micrarchaeota Diapherotrites Eukaryotes Nanohaloarchaeota Aenigmarchaeota Loki. unclear. Parvarchaeota Thor.

Korarch. DPANN Crenarch. More recently, it has become clear that a vast un- Pacearchaeota Bathyarc. Nanoarchaeota YNPFFA Woesearchaeota Aigarch. Opisthokonta Altiarchaeales Halobacteria Z7ME43 known microbial strain group exists in bacteria by Methanopyri TACK Methanococci Excavata Archaea Hadesarchaea Thermococci Thaumarchaeota Archaeplastida Methanobacteria technological advances in metagenomic analysis Thermoplasmata Chromalveolata Archaeoglobi Methanomicrobia Amoebozoa and single-cell genomics. Metagenomic approach Figure 1 | A current view of the , encompassing the total diversity represented by sequenced genomes. The tree includes 92 named bacterial Figurephyla, 26 archaeal phyla1. and1 allAfive ofcurrent the Eukaryotic supergroups. view Major lineages of are assignedthe arbitrary tree colours andof named, life. with well-characterized The phy- lineage revealed huge diversity of previously unknown names, in italics. Lineages lacking an isolated representative are highlighted with non-italicized names and red dots. For details on taxon sampling and tree inference, see Methods. The names Tenericutes and Thermodesulfobacteria are bracketed to indicate that these lineages branch within the Firmicutesand logeneticthe , respectively. tree Eukaryotic of supergroups bacteria, are noted, but not archaea, otherwise delineated due and to the low resolution eukaryotes, of these lineages. The CPR in- phyla of bacteria and archaea since they have differ- phyla are assigned a single colour as they are composed entirely of organisms without isolated representatives, and are still in the process of definition at cludlower taxonomicing levels. 92 The completenamed ribosomal protein bacterial tree is available in rectangular phyla, format with 26 full bootstrap archaeal values as Supplementary phyla Fig. 1 andin and ent forms of 16S rRNA sequences. In bacteria, these Newick format in Supplementary Dataset 2. all2 five of the Eukaryotic supergroupsNATURE. MICROBIOLOGY The tree| www.nature.com/naturemicrobiology was esti- metagenomically recovered bacterial strain was de- mated by maximum© 2016-likelihood Macmillan Publishers Limited. All method rights reserved using concatena- scribed as candidate phyla radiation (CPR) and tion of ribosomal protein sequences. The figure adapted comprises at least 15% of all bacteria [22]. The CPR from reference [23]. seems to be monophyletic and clearly separated from other bacteria (Figure 1.1; Castelle and Banfield, 2018; Hug et al., 2016). CPR bacteria are 2 Methods widely distributed across the various environments such as human microbiome [25] , deep subsurface 2.1 Data sources sediments [26], the dolphin mouth [27], drinking Complete genome sequences of bacteria and ar- water [28], soil [29], marine sediment [30] and other chaea were downloaded from the Reference Se- environments [24,31]. quence Database (RefSeq) [32] at the National Cen- CPR bacteria have various unusual features com- ter for Biotechnology Information (NCBI) as of pared to non-CPR bacteria. CPR genomes are less May 2018. The acquired genomes (denoted as Ref- than 1.5Mb while the genome size of non-CPR bac- Seq in this manuscript) were 9,078 ge- teria, Escherichia coli, is 4.6Mb. Most of them lost nomes (total of bacteria 8,825, archaea 253, respec- TCA-cycle genes and they have intron regions in tively). rRNA genes [22,31]. It is sometimes questioned Nearly full-length (restored by ≥ 70% based on whether CPR bacteria is a cellular organism, at least, the estimated full length) of 804 genomes (790 spe- CPR genomes encode genetic systems for cell divi- cies) of CPR bacteria were obtained from NCBI sion (e.g. Fts-Z-based mechanisms, not found in GenBank based on Hug et al. [23]. some symbionts with very reduced genomes), and Known RT sequences were obtained from a pre- measurements of replication rates and images show- vious study by Simon et al. [20]. Sequences anno- ing cell division indicate that the cells are metabol- tated as “Unknown”, “Unclassified”, and “nonRTs” ically active. It is also thought that they may adhere were eliminated and totally 930 RT sequences were to the surface of other microorganisms to survive. collected. It is reported that CPR bacteria have RT-like se- quences in their genomes, however, the types of RTs, their functions, and its evolutionary scenario

2 Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes

2.2 Identification of RT sequences quences using MAFFT v.7.407 (L-INS-i algo- rithms) [41] and estimated maximum likelihood tree From the prokaryotic genomes collected, RT-like using RAxML v.8.2.11 [42] with sequences which have RT functional domains were PROTGAMMAJTT evolutionary model for amino identified using HMMER v.3.2 (hmmscan program; acid sequences. Both analyses were performed and E-value ≤ 1e-5) [33] search against sequence pro- visualized through the environment for tree explo- files corresponding to “RVT_1” (PF00078) or ration (ETE) v.3.1.1 [43]. Also, the identified CPR “RVT_2” (PF07727) in Pfam-A 32.0 [34]. In our RTs were mapped onto the phylogenetic tree esti- first pipeline, Pfam ID: “RVT_3” (PF13456) was mated by Hug et al. [23] using iTOL [44]. included in the query profile since “RVT_3” do- main was registered as “Reverse transcriptase-like” 2.5 Estimation of frameshift mutations in in the database. However, proteins collected with “RVT_3” profile query were RNase H protein ra- polymerases ther than RT. Therefore, I exclude proteins exist as RNase H alone, not a part of RT protein, to observe To verify whether the frameshift mutation occurred the diversity and evolution of RT domains and pro- only in CPR bacterial RTs, DNA polymerase family teins in the analysis including CPR bacteria. A proteins were identified and compared to the RTs. DNA polymerase family A proteins were identified 2.3 Network analysis based on sequence sim- using HMMER v.3.2 (hmmscan program; E-value ≤ 1e-5) [33] search against sequence profiles corre- ilarities sponding to “DNA_pol_A” (PF00476) in Pfam 32.0 [34]. To increase phylogenetic coverage of the pol- The sequence similarity scores were calculated to ymerases in CPR phylogeny, the retrieved DNA construct a weighted undirected graph (SSN). The polymerase protein sequences (438 sequences for similarity scores (Basic Local Alignment Search CPR bacteria) were additionally run against all cod- Tool [BLAST] bit scores) [35] for all the collected ing sequences of datasets using BLAST v.2.8.1+ protein sequences were calculated with an all- (blastp program; E-value ≤ 1e-5; query coverage against-all BLASTP (BLAST 2.7.1+) analysis per subject ≥ 50%) [35–37] and 670 sequences were [36,37], with a cut-off E-value of ≤ 1e−5. Using the identified for CPR bacteria. With the same pipeline, BLAST bit scores, the sequence similarities were I also re-identified RT sequences using 514 CPR RT normalized to 0.0–1.0, with the following equation sequences as query and retrieved 539 RTs from [38,39]: CPR genomes. ��� (��� �����(�, �), ��� �����(�, �)) ���(�, �) = ��� (��� �����(�, �), ��� �����(�, �)) 2.6 Domain architecture of related proteins where sim(x,y) represents the normalized sequence Domain organization of CPR RTs were visualized similarity between two sequences x and y. If the with DoMosaics v.0.95 [45]. The visualized do- score was 1.0, the pair was deemed to be identical. mains were extracted using HMMER v.3.2 (hmm- A weighted undirected graph was constructed based scan program) [33] search against Pfam-A 32.0 [34] on the scores of all the pairs of sequences, and the database. HMMER was performed and the results edges were weighted with the scores. I set a thresh- were combined by DoMosaics. Other sequences old sequence identity value and connected the nodes which have specific domain architecture was when the sequence identity exceeded the threshold. searched by InterProScan [46] against InterPro da- The threshold to be used was determined by com- tabase [47]. paring the networks constructed with an incremen- tal series of threshold values. The constructed net- 2.7 Identification of RT-related group II in- works were visualized with Cytoscape 3.7.1 [40], using “Prefuse Force-Directed OpenCL Layout” trons with default parameters except for enabling “Force Since most of bacterial group II introns have RT as deterministic layouts” option. intron-encoded protein (IEP) in its open reading frame (ORF), I identified the introns to annotate RT 2.4 Sequence comparison and phylogenetic functions. To detect its characteristic RNA second- analysis ary structures surrounding IEP (RT), homologous structures to the specific domains of the introns (do- To compare differences between RefSeq prokary- mains I-VI) in CPR bacterial genomes were otic RT and CPR bacteria RT, I aligned RT se- searched. Domains V, VI were searched using In-

3 S.Nagata fernal v.1.1.2 (cmsearch program with --nohmm op- (Figure 3.2). Several sequences were selected from tion; score > 24) against RNA secondary structural each type of RT and used. The color of tips in the profiles corresponding to “Intron_gpII” (RF00029) tree corresponds to the color of the node in Figure in Rfam database [48]. For domains I-IV, Infernal 3.1, and the type of RT and the taxonomic domain v.1.1.2 (cmsearch program with --rfam option; E- (bacteria, archaea, virus) derived from are described value ≤ 1e-10) were used against profiles corre- together. Retroviral, LTR, non-LTR, and retron II sponding to “group-II-D1D4-1” (RF01998), types of RTs were located nearby on the phyloge- “group-II-D1D4-2” (RF01999), “group-II-D1D4-3” netic tree, while group II introns and retron I RTs (RF02001), “group-II-D1D4-4” (RF02003), were splitted and located on multiple strains. Many “group-II-D1D4-5” (RF02004), “group-II-D1D4-6” RTs of the virus possessed various protein domains (RF02005), and “group-II-D1D4-7” (RF02012) in in addition to the central domain of the RT, as de- the database. Based on the search results, consider- scribed “RVT_1” in the figure, and the sequence ing the distances between the intron components, length was considerably longer than that of prokar- types of group II introns were defined as follows; yotes. This is probably because viruses often encode full-length, which has all domains I-IV, ORF-RT, one protein with multiple functions. domains V-VI; ORF-less, which lacks ORF-RT but A has domains; others which lacks one of the three components.

3 Results and discussion

2.1 Overall relationships among prokaryot-

Bacteria ic RTs Archaea Virus To see overall sequence relationship of RT and RT- related proteins in prokaryotes and viruses, I con- B RNase H Unclassified structed and visualized sequences sequence similar- ity network (SSN) (Figure 3.1). The SSN is a RdRP 3 RNA dependent RNA polymerase graphical representation of the similarities between RT Rtv sequences. Each sequence is indicated by a point RT ZFREV-like RdRP 4

(node) and the similarity between the sequences is RT LTR represented by the length of the line (edge) connect- ing the points. The smaller the distance between the Viral DNA polymerase RT retronⅡ nodes, the greater the degree of similarity between RT group II intron the sequences. I used RT and the related protein se- RT retronⅠ quences identified from prokaryotic and viral ge- RT nLTR-like nomes in RefSeq dataset. Nodes are colored accord- Figure 3.1 Sequence similarity network of RTs from ing to the origin of sequences: bacteria (non-CPR); RefSeq prokaryotes. Nodes (colored dots) represent the archaea; virus (Figure 3.1A) or to the types of RT RT protein sequences and the edge lengths represent the and RT-related proteins (Figure 3.1B). An over- sequence similarities. (A) Nodes are colored according to view of the entire network structure shows that the the origin of sequences: bacteria (non-CPR); archaea; vi- rus. (B) Nodes are colored according to the types of RT RT and RT-related proteins can be divided into four and RT-related proteins. groups, i.e., RTs of bacteria and archaea, RTs of vi- ruses, RNA-dependent RNA polymerases (RdRp), RdRp and RNase H, which is not RT itself, were Ribonuclease (RNase) H. The group of viral RTs obtained as RT-related proteins. RdRp has been and viral RdRp consisted only of sequences derived considered to be evolutionary related to RT [50] and from viruses, whereas the group of RNase H and it is not surprising that the RdRp domain sequences bacterial RT both contained sequences derived from were highly similar to the RT domain. On the other thee domains. Some bacterial type of RTs, such as hand, for RNase H, I selected Pfam ID: “RVT_3” in DGR, have been found in virus (bacteriophage) ge- the process of selecting protein sequences having nomes [49], and they are mainly associated with the the RT domain. Although the “RVT_3” domain are bacterial RT group on the network. registered as “Reverse transcriptase-like” in NCBI The phylogenetic relationships of the obtained CDD, the superfamily does not belong to “RVT_1 RTs and RT-related proteins were analyzed together Superfamily” and “RT_like Superfamily” with the with structure of protein functional domains other RTs but the superfamily belong to

4 Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes

“RNase_H_like Superfamily”. In many cases, RT diversity. Sequence length of each RT dataset has an RNase H domain region as part of it [51,52]. shows that the minimum length was 78 residues for However, after this, I exclude proteins exist as CPR bacteria and 72 residues for RefSeq prokary- RNase H alone, not a part of RT protein, to observe otes, the mean length was 311 residues and 475 res- the diversity and evolution of RT domains and pro- idues, and the maximum length was 763 residues teins in the subsequent analysis including CPR bac- and 1879 residues respectively. The shape of the teria. distribution showed that the RT of CPR bacteria To analyze the characteristics of RT in CPR bac- was unimodal and had a small variation in sequence teria, I firstly plotted histograms of amino acid se- length, while the RefSeq prokaryotes had roughly quence length of RTs extracted from CPR bacteria three peaks with a multimodal distribution. As a re- and non-CPR prokaryotes registered in RefSeq sult, the RT of the registered in RefSeq (Figure 3.3). Only when plotting histograms, Ref- contains a wide variety of RT types, whereas most Seq prokaryotic RTs were used for cluster repre- of the RT of CPR bacteria are specific types of RT. sentative sequences to which at least 5 sequences For comparing sequence between RTs in CPR belong to each cluster in order to ensure sequence bacteria and non-CPR prokaryotes, I constructed and visualized SSN of RTs from both datasets (Figure 3.4). Note that in Figure 3.1, Pfam ID: “RVT_3” was also included in the extraction of RTs. However, a considerable number of RNase H se- quences were included in the network. These RNase H protein profile (Pfam ID: “RVT_3”) was ex- cluded since I would like to target only sequences close to the RT enzyme. CPR bacteria RTs, which nodes are colored blue, showed a cluster-like se- quences on the left side of the network and se- quences scattered slightly to the lower left A

B

Figure 3.2 Phylogenetic tree and domain architecture of RTs. Based on the RT and RT-related proteins identi- fied in the RefSeq prokaryotic genomes, several se- Figure 3.3 Distribution of sequence length of the iden- quences were obtained from each type of RT. The color- tified RT proteins. Distribution of amino acid sequence ing of the tip of the phylogenetic tree corresponds to the length of the identified RTs from (A) CPR RT (B) non- coloring of the node in Figure 3.1. Also, the type of RT CPR prokaryotic RTs registered in RefSeq database. and the taxonomic domain (bacteria, archaea, virus) de- Note that panel B is a representative sequence of clusters rived from were described. Functional protein domains containing 5 or more sequences due to reduce the bias in are colored by domain type and names of domain in Pfam the sequence data of RefSeq. databases are indicated.

5 S.Nagata

the best hits in the NCBI CDD profiles, detailed A types would be identified by phylogenetic analysis with known types of RTs in the next section.

2.2 Functional analysis and classification of CPR RTs Sequence similarity-based search of RT domains identified 514 RT protein sequences. To observe the phylogenetic distribution of the RTs, they were Bacteria CPR Bacteria mapped onto CPR bacterial phylogenies [23] Archaea Virus (Figure 3.5). RTs were widely distributed in CPR bacteria. They appeared in both major superphyla of B RT group II intron CPR, Parcubacteria (OD1) and Microgenomates. RT nLTR-like RT retronⅠ Cas1 RTs were found in 313 species out of 804 of CPR

RT LTR RNA-dependent RNA polymerase bacteria. RT retronⅡ RdRP 4 I combined CPR RT sequences and the known RT sequences and constructed phylogenetic tree Viral DNA polymerase (Figure 3.6). The CPR RTs were not monophyletic, RT Rtv and RTs related to retrons, abortive infection (AbiK, RT ZFREV-like Abi-P2, but not AbiA), DGRs, group II introns and Others & Unclassified group II intron-like were observed in CPR. Most of RNase H-like RdRP 3 CPR RTs (441 sequences) were involved in DGRs and it consists 86% of CPR RTs

Figure 3.4 Sequence similarity network of RTs from RefSeq (non-CPR) prokaryotes and CPR bacteria. Nodes (colored dots) represent the RT protein sequences and the edge lengths represent the sequence similarities. (A) Nodes are colored according to the origin of se- quences: bacteria (non-CPR); CPR bacteria; archaea; vi- rus. (B) Nodes are colored according to the types of RT and RT-related proteins. (Figure 3.4A). These RTs were classified as group II intron type and retron type, respectively (Figure 3.4B). In addition to these, some CPR bacterial RTs have been annotated as RNase H-like proteins (5 se- quences) or seemed to be similar to viral RdRp (3 sequences). Nodes annotated as group II introns type of RT from CPR bacteria were clustered on the network (Figure 3.4) and seemed to be consists majority of the CPR RTs. Previous study reported that 75% of the RT in the bacterial genome belongs to the group Figure 3.5 Phylogenetic distribution of RTs in CPR II intron, with 12% for the retron and 3% for the bacteria. RTs were found in 313 genomes and they were mapped onto the CPR phylogeny (804 genomes). Ge- DGR [15,53]. However, it should be noted that, as nomes with RT proteins are colored in blue. The CPR mentioned above, a detailed discussion must be phylogeny was taken from Hug et al. and modified. made in conjunction with a more accurate RT type annotation. This cluster of sequences is might be characteristic RTs of CPR bacteria because of its distance compared to bacteria and archaeal RT other than CPR bacteria on the network. If these were RT associated with Group II introns as noted, new types of Group II introns might be present in CPR bacteria. Since this RT annotation was determined only by

6 Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes

Table 3.1 Identified group II introns in CPR bacteria. @94490804 01298039.1 @90407942 01216116.1 Table 1. Identified group II introns in CPR bacteria @147678802 001213017.1 @14887141770280.1 B01000013.1@C 2011 B1 37 5@35495.1@@A- A @1618742@8966 10004 Corresponding RTs 01000025.1@C B1 49 6@96685.1@@ @1798644@5912 6851 Genomes (GenBank accession) Exsistence of RNA domains and RT C01000025.1@C C2 02 49 11@55025.1@@ @1798489@58351 59635 (GenBank accession) 01000012.1@C C2 01 41 20@04041.1@@ @1798657@62416 63481 01000015.1@C C2 01 49 22@A65208.1@@ @1802448@9819 10770 01000003.1@C B C2 01 50 28@64267.1@@ @1797471@21074 22046 Microgenomates_group_bacterium_RBG_16_45_19 (MHDC01000039.1) Full (Domain I-IV, ORF-RT, Domain V,VI) OGV95898.1 01000007.1@C 1 39 35@B16464.1@@ @1802780@4831 5782 CB01000007.1@ 2011 B1 41 6@16149.1@@ @1618869@24444 25329 01000025.1@C C2 01 38 26@13508.1@@ @1798591@1834 2767 Candidatus_Kerfeldbacteria_bacterium_RIFCSPLOWO2_02_FULL_42_19 (MHKF01000008.1) Full (Domain I-IV, ORF-RT, Domain V,VI) OGY84268.1 01000029.1@C B A2 43 10@63869.1@@ @1797472@7478 8378 01000015.1@C B A2 43 10@65175.1@@ @1797472@19464 20781 C005957.1@C @A61879.1@@ A- A @1332188@174268 175129 Candidatus_Uhrbacteria_bacterium_GW2011_GWF2_46_218 (LCMG01000021.1) Full (Domain I-IV, ORF-RT, Domain V,VI) KKU32234.1 @94310415 583625.1 @119857475 01638904.1 Candidatus_Uhrbacteria_bacterium_GW2011_GWF2_39_13 (LBWG01000031.1) ORF-RT, Domain V,VI KKR03365.1 @92113328 573256.1 Candidatus_Vogelbacteria_bacterium_RIFOXYB1_FULL_42_16 (MHTH01000005.1) Domain I-IV, ORF-RT OHA59186.1 @121604255 981584.1 @28871080 793699.1 C01000036.1@C 2011 C2 48 9@01006.1@@ @1619071@3198 4356 Candidate_division_Kazan_bacterium_RIFCSPLOWO2_01_FULL_48_13 (METE01000004.1) Domain I-IV, ORF-RT OGB85388.1 @8100799AA72414.1 01000020.1@C 2 34 120@40791.1@@ @1798007@12897 13848 01000023.1@C C2 12 50 10@88238.1@@ @1798526@2291 3323 Candidatus_Levybacteria_bacterium_RIFCSPHIGHO2_01_FULL_37_33 (MFNM01000037.1) ORF-less (Domain I-IV, Domain V,VI) A01000041.1@C C 1 40 9@81593.1@@ @1817731@55742 56720 C01000010.1@ 2011 A2 43 17@06583.1@@ @1618827@984 1947 CA01000003.1@C 2011 B1 41 12@88815.1@@ @1619006@62053 62887 Candidatus_Levybacteria_bacterium_RIFCSPLOWO2_02_FULL_37_10 (MFPB01000045.1) ORF-less (Domain I-IV, Domain V,VI) 01000024.1@C C2 12 42 11@43129.1@@ @1798473@64 991 Candidatus_Colwellbacteria_bacterium_RIFCSPHIGHO2_12_FULL_44_17 (MHIX01000008.1) ORF-less (Domain I-IV, Domain V,VI)

@60681593 211737.1 C01000032.1@C C2 12 45 10@29444.1@@ @1802603@3678 4569 Candidatus_Nealsonbacteria_bacterium_RBG_13_42_11 (MHLY01000006.1) ORF-less (Domain I-IV, Domain V,VI) A01000043.1@C C 1 40 9@81470.1@@ @1817731@4594 5467 Retrons B01000006.1@C 2011 2 40 19@53802.1@@ @1618595@45560 46538 01000001.1@C B 16 42 24@15919.1@@ @1802485@286583 287609 Candidatus_Terrybacteria_bacterium_RIFCSPLOWO2_01_FULL_40_23 (MHSW01000005.1) ORF-less (Domain I-IV, Domain V,VI) 01000006.1@C C2 01 44 14@70272.1@@ @1802525@188878 189886 Candidatus_Gottesmanbacteria_bacterium_GW2011_GWB1_49_7 (LCQD01000034.1) ORF-less (Domain I-IV, Domain V,VI)

BA01000005.1@C 2011 1 35 17@68543.1@@ @1618707@80730 81774 @15925199 372733.1 @20090946 617021.1 @121727482 01680600.1 (PF03167). “zf-CHC2” is a domain of CHC2-type @134093299 001098374.1 @13407823070

01000011.1@C C2 01 37 25@88012.1@@ @1801768@22124 22991 zinc finger domain which bind metals such as zinc, 01000005.1@C C2 01 48 27@45710.1@@ @1802115@104266 105190 01000009.1@C C2 01 43 60@A89186.1@@ @1802736@3217 4132 B01000028.1@C 2011 A2 32 13@35451.1@@ @1618475@2909 3575 01000022.1@ 2 52 8@B25204.1@@ @1817746@2872 3853 @56698727 166298.1 iron, or no metal at all. “UDG” is a domain of uracil @83943143 00955603.1 @126664055 01735049.1 @56475694 157283.1 @54293958 126373.1 C01000017.1@ 2011 A1 59 11@48036.1@@A- A @1618804@28000 29530 @75909461 323757.1 DNA glycosylase (UDG). UDG is an enzyme that @21233055 638972.1 @113937250 01423127.1 @113940973 01426789.1 @90409437 01217503.1 01000030.1@C C C1 02 44 10@91945.1@@ @1805087@7228 9307 01000035.1@C C2 02 43 22@13830.1@@ @1802681@11378 13670 01000032.1@ 3 1 43 17@C79859.1@@ @1802652@17430 19548 reverts mutations in DNA and crucial in DNA repair. 01000006.1@C C2 01 47 25@82851.1@@ @1802402@41594 43373 C01000013.1@ 2011 A2 47 12@59044.1@@ @1618841@6751 8077 CC01000009.1@C 2011 C2 41 9@27177.1@@ @1619029@12144 12894 01000016.1@C A2 50 9@41060.1@@ @1798474@29752 31144 01000010.1@C C2 01 48 27@45070.1@@ @1802115@5846 7247 01000046.1@C B 13 34 9@08540.1@@ @1802477@4752 6297 I further searched sequences by domain architecture 01000042.1@C C1 02 54 53@52926.1@@ @1805323@31900 32515 A@24636606BAC22947.1 A@17227201 478367.1 A@32455447 862563.1 01000048.1@ 3 B 19 CB 34 6@C45144.1@@ @1802612@1374 2922 Abi-like A-2@133728112BA29810.1 C01000010.1@C 2011 C2 48 9@02910.1@@ CA@1619071@14844 16377 which have both RT domain and UDG domain A-2@110642862 670592.1 A-2@84619229CA43154.1 A-2@84619238CA43157.1 A-2@118587264 01544691.1 A-2@150378854 01918051.1 01000025.1@C A B 16 42 10@C81516.1@@ @1817814@4508 6005 against the protein database of almost all public 01000003.1@C 2 36 9@A47706.1@@ @1802338@0 1074 01000005.1@C C2 36 12@32498.1@@ @1798002@13174 14677 A-2@145632406 01788141.1 A-2@145634195 01789906.1 B01000019.1@ 2011 A2 40 23@54227.1@@A- A @1618816@12412 13744 B01000010.1@ 2011 2 38 18@61081.1@@ @1618949@7972 9313 01000008.1@C C2 01 45 15@00985.1@@ @1798649@5816 7163 available sequences. However, the only sequences I AA@149358AAA25159.1 ………………………………………☆ C01000014.1@C 2011 A2 45 13@91305.1@@A- A @1618662@1107 1704 found was sequences of Candidatus Giovannoni- B01000037.1@ 2011 A1 36 12@91937.1@@A- A @1618782@783 2118 01000024.1@C C2 01 47 17@92105.1@@ @1802558@70735 71734 01000001.1@ C1 02 41 26@50581.1@@ @1805308@1639 2689 01000022.1@C C2 30 33 16@85866.1@@ @1805340@2510 3563 B01000002.1@C C2 30 36 39@74398.1@@ @1805300@110184 111267 01000032.1@C C2 01 46 14@13797.1@@ @1798380@130 1216 @23335577 00120811.1 bacteria, which host species is same as the sequence @150007547 001302290.1 @14584767024588.1 @139439157 01772609.1 B01000037.1@ 2011 A1 36 12@91938.1@@ @1618782@2080 2707 I mentioned (OGF82770.1). Therefore, I concluded @15158092544663.1 @68553139 00592520.1 @42527768 972866.1 @68551181 00590605.1 @78189651 379989.1 C005957.1@C @A62529.1@@ A- A @1332188@819704 820739 @71065017 263744.1 that the RT which have UDG is specific in Candi- @118747050 01594931.1 @126090247 001041702.1 @90580666 01236470.1 DGRs @148359926 001251133.1 01000040.1@C C1 02 54 53@53019.1@@ @1805323@48601 49693 @29347723 811226.1 datus Giovannonibacteria. @121528340 01660954.1 @83310593 420857.1 @83309559 419823.1 C01000005.1@ 2011 1 44 10@31229.1@@A- A @1618537@156 708 01000004.1@ 3 B 16 37 10@C52057.1@@ @1802610@1356 2160 C01000013.1@C 2011 B1 43 14@97455.1@@- @1618578@563704 565018 C01000011.1@C 2011 A1 43 17@99596.1@@- @1618731@372 1446 To observe RTs in bacterial group II introns, the 01000002.1@C B1 37 44@79527.1@@ @1802223@14902 16117 BA01000012.1@C 2011 1 35 17@67656.1@@- @1618707@37124 38210 B01000005.1@C 2011 2 36 40@06962.1@@- @1618713@31180 32314 01000054.1@C B C1 02 42 45@88982.1@@ @1805036@3231 4407 01000003.1@C C2 12 40 42@04006.1@@ @1801794@4656 5937 C01000030.1@C 2011 A2 52 12@23017.1@@- @1618671@5891 7085 introns were detected by searching their character- B01000003.1@C 2011 2 38 254@66088.1@@- @1618639@22810 24028 01000007.1@C C2 01 46 22@A18252.1@@ @1802301@125724 126855 C01000022.1@ 2011 B1 43 8@20764.1@@- @1618874@4583 5684 BA01000012.1@C 2011 1 35 17@67667.1@@- @1618707@51277 52450 01000005.1@C C2 01 46 25@A90397.1@@ @1802738@174327 175584 istic RNA secondary structures (Table 3.1). Totally B01000001.1@ 2011 C2 40 10@35804.1@@A- A @1618923@111719 113063 01000016.1@C B C2 01 44 11@45643.1@@ @1797535@35043 36300 24@82702063 411629.1 24@71735515 277063.1 24@126356711 01713715.1 21@34541577 906056.1 2 @149195205 01872295.1 12 group II introns were detected and six of them 25@149176144 01854760.1 25@87308561 01090701.1 25@149173121 01851752.1 had RTs as IEP. The remaining six group II introns had both domains I-IV and domains V-VI, but the @94266883 01290540.1 @88812538 01127786.1 @150004670 001299414.1 C01000021.1@C 2011 2 46 218@32234.1@@ A- A @1619001@2782 4459 @150004673 001299417.1 @150005876 001300620.1 distance between the domains were short (~450 nt) @29347707 811210.1 that they didn’t possess corresponding RTs. The phylogenetic mapping and the aforemen- @77406984 00784000.1 Group II introns @22536745 687596.1

@87133431AB24341.1 @88811340 01126595.1 @108757513 633367.1 tioned analysis revealed several types of RT pro- B01000031.1@C 2011 2 39 13@03365.1@@A- A @1618995@8899 9832 @118034236 01505672.1 @117925277 865894.1

01000008.1@C C2 02 42 19@84268.1@@ @1798544@7062 7635 teins were widely distributed and conserved in CPR 01000008.1@C C2 02 42 19@84261.1@@ @1798544@1142 1604 . The roles and functions of RTs in @33325845AA08377.1

@69933606 00628808.1 CPR bacterial ecologies were less understood ex- @29348025 811528.1 @117923963 864580.1 @117924433 865050.1 @150391097 001321146.1

@52549848AA83697.1 cept for DGRs [16,54]. DGRs, retroelements that @75758415 00738538.1 C01000039.1@ B 16 45 19@95898.1@@ @1817747@2779 4027 @20091619 617694.1 @20093406 619481.1 generates sequence variations in specific target

genes using RT, were identified in CPR bacteria Figure 3.6 Phylogenetic relationships of CPR RTs and and suggested to be utilized for adaptation to a dy- known RTs. 514 sequences of CPR RTs and 930 se- namic host-dependent environment [54]. The ob- quences of known RTs were combined, and the phyloge- served phylogenetic distribution of CPR RTs was netic tree were estimated using maximum-likelihood method. The types of RTs are indicated at right side of congruent with the previous study which identified the tree nodes (tips). Collapsed nodes are of known RT DGRs [54]. Despite the study was focused on RTs sequences except for 416 CPR RT sequences indicated as as component of DGRs, I newly identified RTs in star. CPR phyla, i.e. Candidatus Collierbacteria, Candi- datus Pacebacteria, subdivision RIF-10, 15, 16, 17, Most CPR RTs have only RT domain, but five RT 20, 21 of Candidatus Parcubacteria. Especially, RTs sequences have other domains (Figure 3.8). Three in RIF-10, 15, 16, 17, 20, 21 of Parcubacteria were of them (GenBank accession, KKR03365.1; identified in multiple genomes that indicates non- KKU32234.1; OGV95898.1) have “GIIM” (Pfam DGR RTs exist in the phyla and may play important ID, PF08388) domain which is maturase- role for their ecologies. specificdomain of group II intron. Also, protein The identified group II introns were seemed to be OGV95898.1 has “RVT_N” (PF13655) domain recently acquired by horizontal gene transfer since which means N-terminal domain of reverse tran- they have well conserved group II intron maturase scriptase. Interestingly, the other two proteins have domain and their RTs are closely related to RTs specific domains, “zf-CHC2” (PF01807) and “UDG”

7 Fig. 1.

S.Nagata

CPR RT Figure 3.7 Phylogenetic relationships and schematic sequence alignment of RT protein sequences. Six CPR RTs and three the other RTs (RT of group II introns, DGRs, retrons, respectively) were aligned and schematically represented in the right side. Grey colored box represents the existence of residues and the other regions are gaps of alignments. CPR RTs are surrounded with a blue square.

from non-CPR prokaryotes. RTs involved in abor- 2.3 Sequence analysis of small CPR RTs im- tive infection and retron were also identified. Abor- tive infection is a process which provide phage im- plies putative ribosomal frameshifts munityFig. 1. through blocking phage multiplication by To observe features of CPR RT, I compare se- programmed death of the cell. Three bacterial RT- quences of CPR RT and usual bacterial RT (Figure related proteins are involved in phage resistance; 3.7). Compared to group II intron RT (in the intron- AbiA, AbiK, and Abi-P2. AbiK and Abi-P2 were encoded protein), retron, and DGR RT, multiple Se- identified in CPR bacterial genome in this study, quence alignments showed that some CPR RTs and several RTs that closed to AbiK but formed dis- have very short sequences and it seemed to be trun- tinct clade were also identified. Also, retrons consist cated since their first-half regions were well aligned of an RT gene and an adjacent inverted repeat se- but they didn’t have latter half regions. quence, but its function remains unknown. Detailed I hypothesized that these truncations in coding se- sequence properties of these RTs will be investi- quences were occurred by ribosomal frameshift so gated for further research. that the latter half regions may exist in the down- Despite the CPR bacteria phyla were radiation and stream region of the coding region. To confirm the clearly separated from the other bacteria, the RT hypothesis, the downstream regions were concate- tree were polyphyletic. This result implies that CPR nated with their coding sequences and aligned bacteria and the other non-CPR bacteria RTs ex- (Figure 3.9). The downstream regions of those trun- change RT genes occasionally, contributing RT cated RTs were well aligned to full-length (not trun- evolution and diversifications. I identified several cated) RT, so it seemed that frameshift mutation had types of RTs and biological properties of these pro- occurredFig. 2. at the end of the coding region. teins were not elucidated, but the RTs might con-

Full-length RT tribute to CPR ecology since the RTs still exist and CDS RT① CDS + downstream CDS RT② CDS + downstream not discarded in extremely small CPR genomes CDS RT③ CDS + downstream CDS RT④ CDS + downstream CDS RT⑤ CDS + downstream

Protein Full-length RT CDS (GenBank accession) Domain architecture RT① CDS + downstream CDS RT② CDS + downstream CDS RT③ CDS + downstream CDS LBWG01000031.1@[email protected]@@RNA-directed_DNA_polymerase__Reverse_transcriptase_@taxon|1618995@8899_9832KKR03365.1 RVT_1 GIIM RT④ CDS + downstream CDS RT⑤ CDS + downstream

Full-length RT LCMG01000021.1@[email protected]@@Reverse_transcriptase__RNA-dependent_DNA_polymerase_@taxon|1619001@2782_4459KKU32234.1 RVT_N RVT_1 GIIM CDS RT① CDS + downstream CDS RT② CDS + downstream CDS RT③ CDS + downstream CDS RT④ CDS + downstream MFIA01000018.1@Candidatus_Giovannonibacteria_bacterium_RIFCSPLOWO2_01_FULL_44_16@OGF82770.1@@hypothetical_protein@taxon|1798348@11033_12213 OGF82770.1 RVT_1 UDG CDS RT⑤ CDS + downstream

Full-length RT CDS MFLC01000025.1@Candidatus_Kaiserbacteria_bacterium_RIFCSPHIGHO2_02_FULL_49_11@OGG55025.1@@hypothetical_protein@taxon|1798489@58351_59635 OGG55025.1 RVT_1 zf-CHC2 RT① CDS + downstream CDS RT② CDS + downstream CDS RT③ CDS + downstream CDS RT④ CDS + downstream CDS MHDC01000039.1@[email protected]@@group_II_intron_reverse_transcriptase/maturase@taxon|1817747@2779_4027 OGV95898.1 RVT_1 GIIM RT⑤ CDS + downstream

Figure 3.8 Domain architecture of CPR RTs with Figure 3.9 Sequence alignment of RT-coding regions other domains. Functional domains in RT proteins are and its neighbors. Representative truncated RTs and visualized. Domains are based on Pfam database and de- their downstream regions were concatenated and aligned fined as follows: RVT_1 (PF00078), Reverse transcrip- with a full-length (not truncated) RT (protein ID: tase (RNA-dependent DNA polymerase); GIIM KKS64476.1 of Parcubacteria group bacterium GW2011 (PF08388), Group II intron, maturase-specific domain; GWB1 42 6). Black dashed box represents putative RVT_N (PF13655), N-terminal domain of reverse tran- translational frameshift site. scriptase; UDG (PF03167), Uracil DNA glycosylase su- perfamily; zf0CHC2 (PF01807), CHC2 zinc finger.

8 Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes

To observe the phylogenetic distribution of these RTs in group II introns has high processive and fi- putative frameshift RTs, the RTs were mapped onto delity to preserving functions in the RT [12,55,56], CPR bacterial phylogenies [23] (Figure 3.10A). Pu- RTs lack the 3’ to 5’ proofreading capability of tative frameshift RTs were not inclined to appear in other DNA polymerases [57] and prone to errors. specific phyla, but they were widely distributed in Since most of CPR bacteria genomes were con- CPR bacteria. They appeared in both major super- structed by metagenomics, it was possible that RT phyla of CPR, Parcubacteria (OD1) and Microge- protein frameshift was just an artifact of sequencing. nomates. RTs were found in 318 species out of 804 Therefore, I observed protein frameshift in another of CPR bacteria and putative frameshift RTs existed protein, DNA polymerase protein, to reduce the in 38 species. Totally, 63 out of 539 RTs (11.7%) possibility, and the result showed that protein and 367 out of 7,143 RTs (5.1%) were identified as frameshifts were only found in few cases implying frameshift RT in CPR bacteria and RefSeq prokar- that the frameshifts could not be an artifact of met- yotes, respectively. agenome methodologies. To confirm that the frameshift occurred especially in RT, DNA polymerase family A proteins were re- trieved and analyzed frameshift likewise (Figure 3.10B). In contrast to RT, only 4 out of 670 DNA polymerase family A proteins (0.6%) were identi- fied as frameshift protein and they were found in only 4 species. From this analysis, I found that RT proteins con- taining frameshift mutations were widely conserved in CPR bacteria phyla. It was difficult to think that these proteins were conserved without any reasons, and so I assumed that the proteins might be fully translated with translational frameshift, the reading ribosomes slipped and skip nucleotides and read a different frame hereafter. There were few studies that comprehensively identified RTs in CPR bacteria, except for DGRs [16,54]. DGRs, retroelements consists of RT and specific types of related sequences, were identified in CPR bacteria and suggested to be utilized for ad- aptation to a dynamic host-dependent environment [54]. The phylogenetic distribution of CPR RTs was congruent with the previous study which identified DGRs [54]. Despite the study was focused on DGR RTs, I newly identified RTs in CPR phyla, i.e. Can- didatus Amesbacteria, Candidatus Collierbacteria, Candidatus Pacebacteria. The types of RTs which I identified were not clearly identified and the biolog- ical properties of these phyla were not elucidated, but the RTs might contribute to CPR ecology since the RTs still exist and not discarded in extremely small CPR genomes. Considering that putative translational frameshifts Figure 3.10 Phylogenetic distribution of putative existed in RT, not other polymerases, some specific frameshift proteins. The existence of putative mechanisms related to RT might contribute to arise frameshift/full-length RTs (A) and DNA polymerase frameshift in RT. I supposed that RTs in bacterial family A (B) proteins were mapped onto the CPR phy- group II introns contributed to the mutation. Group logeny (804 genomes). Genomes with full-length pro- teins are colored in blue and putative frameshift proteins II introns are a kind of mobile elements and its mo- are colored in red. RTs were found in 318 species and bility reactions are mediated by RTs encoded in the putative frameshift RTs existed in 38 species, and DNA introns [6,7]. RTs synthesized DNA from tran- polymerase family A proteins were found in 619 species scribed group II introns, and frameshift mutation and the putative frameshift proteins existed in 4 species. may arise during the reverse transcription. Despite The CPR phylogeny was taken from Hug et al. and mod- ified.

9 S.Nagata

4 Conclusion 4. Lim D, Maas WK. Reverse transcriptase-dependent In this study, I revealed that several types of RT pro- synthesis of a covalently linked, branched DNA- teins were widely distributed and conserved in CPR RNA compound in E. coli B. Cell. 1989;56(5):891– bacterial phyla. At least, CPR bacteria have RTs re- 904. lated to DGRs, group II introns, retrons, and abor- 5. Michel F, Jacquier A, Dujon B. Comparison of tive infection (Abi), and the abundance was differ- fungal mitochondrial introns reveals extensive homologies in RNA secondary structure. Biochimie. ent from non-CPR bacteria. While CPR bacteria 1982;64(10):867–81. thought to attach to the other micro-organism to live, 6. Mcneil BA, Semper C, Zimmerly S. Group II introns: the result of the majority of DGRs, which are minor Versatile ribozymes and retroelements. Wiley in other prokaryotes, suggests that CPR bacteria Interdiscip Rev RNA. 2016;7(3):341–55. have successfully utilized the property of DGRs, in- 7. Lambowitz AM, Belfort M. Mobile Bacterial Group troducing mutations in the target genes, to adapt II Introns at the Crux of Eukaryotic Evolution. rapidly changing host environments. The polyphy- Microbiol Spectr. 2015;3(2):1–26. letic tree of CPR RTs implying that CPR bacteria 8. Liu M, Deora R, Doulatov SR, Gingery M, Eiserling and the other non-CPR bacteria RTs exchange FA, Preston A, et al. Reverse Transcriptase–Mediated genes occasionally, contributing RT evolution and Tropism Switching in. Science (80- ). diversifications. Also, Sequence comparisons 2002;295(March):2091–4. among CPR RTs, the other prokaryotic and viral 9. Liu M, Deora R, Simons RW, Doulatov S, Hodes A, RTs showed that there were several truncated RT Dai L, et al. Tropism switching in Bordetella protein sequences. These were RTs containing bacteriophage defines a family of diversity- frameshift mutations and widely distributed in CPR generating retroelements. Nature. phyla. Since this phenomenon is RT-specific and it 2004;431(7007):476–81. is unlikely that group II introns introducing muta- 10. Arambula D, Miller JF, Guo H, Ghosh P. Diversity- tions when replicating their own sequences, it is generating Retroelements in Phage and Bacterial speculated that these RTs with frameshift mutations Genomes. Microbiol Spectr. 2014;2(6):1–16. 11. van der Veen R, Arnberg AC, van der Horst G, may retain some kind of functions. Bonen L, Tabak HF, Grivell LA. Excised group II introns in yeast mitochondria are lariats and can be Acknowledgements formed by self-splicing in vitro. Cell. I would like to thank Professor Akio Kanai for his 1986;44(2):225–34. great support of my research. He taught me the fas- 12. Cousineau B, Smith D, Lawrence-Cavanagh S, cinating aspects of molecular biology and evolution, Mueller JE, Yang J, Mills D, et al. Retrohoming of a and he always encouraging my research. I also bacterial group II intron: Mobility via complete reverse splicing, independent of homologous DNA thank Ms. Megumi Tsurumaki, Mr. Masahiro Miura, recombination. Cell. 1998;94(4):451–62. and all the members of the RNA Group for their in- 13. Lazowska J, Meunier B, Macadre C. Homing of a sightful discussions. I would also like to express my group II intron in yeast mitochondrial DNA is gratitude to my family for their moral support and accompanied by unidirectional co-conversion of warm encouragement. Finally, I would like to thank upstream-located markers. EMBO J. Professor Masaru Tomita for providing a stimulat- 1994;13(20):4963–72. ing environment to do my research in IAB. This 14. Gladyshev E a, Arkhipova IR. A widespread class of work was supported, in part, by Taikichiro Mori reverse transcriptase-related cellular genes. Proc Natl Memorial Research Grants. Acad Sci U S A. 2011;108(51):20311–6. 15. Zimmerly S, Wu L. An Unexplored Diversity of References Reverse Transcriptases in Bacteria. Microbiol Spectr. 2015;3(2):1–16. 1. Baltimore D. RNA-dependent DNA polymerase in 16. Wu L, Gingery M, Abebe M, Arambula D, Czornyj virions of RNA tumour viruses. Nature. E, Handa S, et al. Diversity-generating retroelements: 1970;226(5252):1209–11. Natural variation, classification and evolution 2. Temin HM, Mizutani S. RNA-dependent DNA inferred from a large-scale genomic survey. Nucleic polymerase in virions of Rous sarcoma virus. Nature. Acids Res. 2018;46(1):11–24. 1970;226(5252):1211–3. 17. Menéndez-Arias L, Sebastián-Martín A, Álvarez M. 3. Lampson BC, Sun J, Hsu MY, Vallejo-Ramirez J, Viral reverse transcriptases. Virus Res. Inouye S, Inouye M. Reverse transcriptase in a 2017;234:153–76. clinical strain of Escherichia coli: production of 18. Wang C, Villion M, Semper C, Coros C, Moineau S, branched RNA-linked msDNA. Science. Zimmerly S. A reverse transcriptase-related protein 1989;243(4894 Pt 1):1033–8. mediates phage resistance and polymerizes

10 Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes

untemplated DNA in vitro. Nucleic Acids Res. prokaryotic genome annotation and curation. Nucleic 2011;39(17):7620–9. Acids Res. 2018;46(D1):D851–60. 19. Fortier LC, Bouchard JD, Moineau S. Expression and 33. Mistry J, Finn RD, Eddy SR, Bateman A, Punta M. site-directed mutagenesis of the lactococcal abortive Challenges in homology search: HMMER3 and phage infection protein AbiK. J Bacteriol. convergent evolution of coiled-coil regions. Nucleic 2005;187(11):3721–30. Acids Res. 2013;41(12). 20. Simon DM, Zimmerly S. A diversity of 34. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani uncharacterized reverse transcriptases in bacteria. A, Potter SC, et al. The Pfam protein families Nucleic Acids Res. 2008;36(22):7219–29. database in 2019. Nucleic Acids Res. 2018;(8):1–6. 21. Kojima KK, Kanehisa M. Systematic survey for 35. Altschul SF, Gish W, Miller W, Myers EW, Lipman novel types of prokaryotic retroelements based on DJ. Basic local alignment search tool. J Mol Biol. gene neighborhood and protein architecture. Mol Biol 1990;215(3):403–10. Evol. 2008;25(7):1395–404. 36. Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, 22. Brown CT, Hug LA, Thomas BC, Sharon I, Castelle Miller W, et al. Gapped BLAST and PSI- BLAST: a CJ, Singh A, et al. Unusual biology across a group new generation of protein database search programs. comprising more than 15% of domain Bacteria. Nucleic acids Res. 1997;25(17):3389–402. Nature. 2015;523(7559):208–11. 37. Camacho C, Coulouris G, Avagyan V, Ma N, 23. Hug LA, Baker BJ, Anantharaman K, Brown CT, Papadopoulos J, Bealer K, et al. BLAST+: Probst AJ, Castelle CJ, et al. A new view of the tree architecture and applications. BMC Bioinformatics. of life. Nat Microbiol. 2016;1(5):Manuscript 2009;10(421):1. submitted for publication. 38. Dufour YS, Kiley PJ, Donohue TJ. Reconstruction of 24. Castelle CJ, Banfield JF. Major New Microbial the core and extended regulons of global transcription Groups Expand Diversity and Alter our factors. PLoS Genet. 2010;6(7):1–20. Understanding of the Tree of Life. Cell. 39. Matsui M, Tomita M, Kanai A. Comprehensive 2018;172(6):1181–97. computational analysis of bacterial CRP/FNR 25. He X, McLean JS, Edlund A, Yooseph S, Hall AP, superfamily and its target motifs reveals stepwise Liu SY, et al. Cultivation of a human-associated TM7 evolution of transcriptional networks. Genome Biol phylotype reveals a reduced genome and epibiotic Evol. 2013;5(2):267–82. parasitic lifestyle. Proc Natl Acad Sci U S A. 40. Shannon P, Markiel A, Ozier O, Baliga NS, Wang 2015;112(1):244–9. JT, Ramage D, et al. Cytoscape: A software 26. Anantharaman K, Brown CT, Hug LA, Sharon I, Environment for integrated models of biomolecular Castelle CJ, Probst AJ, et al. Thousands of microbial interaction networks. Genome Res. genomes shed light on interconnected 2003;13(11):2498–504. biogeochemical processes in an aquifer system. Nat 41. Katoh K, Standley DM. MAFFT multiple sequence Commun. 2016;7:1–11. alignment software version 7: Improvements in 27. Dudek NK, Sun CL, Burstein D, Kantor RS, Aliaga performance and usability. Mol Biol Evol. Goltsman DS, Bik EM, et al. Novel Microbial 2013;30(4):772–80. Diversity and Functional Potential in the Marine 42. Stamatakis A. RAxML version 8: a tool for Mammal Oral Microbiome. Curr Biol. phylogenetic analysis and post-analysis of large 2017;27(24):3752-3762.e6. phylogenies. Bioinformatics. 2014;30(9):1312–3. 28. Danczak RE, Johnston MD, Kenah C, Slattery M, 43. Huerta-Cepas J, Serra F, Bork P. ETE 3: Wrighton KC, Wilkins MJ. Members of the Reconstruction, Analysis, and Visualization of Candidate Phyla Radiation are functionally Phylogenomic Data. Mol Biol Evol. differentiated by carbon- and nitrogen-cycling 2016;33(6):1635–8. capabilities. Microbiome. 2017;5(1):112. 44. Letunic I, Bork P. Interactive tree of life (iTOL) v3: 29. Starr EP, Shi S, Blazewicz SJ, Probst AJ, Herman DJ, an online tool for the display and annotation of Firestone MK, et al. Stable isotope informed genome- phylogenetic and other trees. Nucleic Acids Res. resolved metagenomics reveals that Saccharibacteria 2016;44(W1):W242–5. utilize microbially-processed plant-derived carbon. 45. Moore AD, Heldy A, Terrapon N, Weiner J, Microbiome. 2018;6(1):1–12. Bornberg-Bauer E. DoMosaics: Software for domain 30. Orsi WD, Richards TA, Francis WR. Predicted arrangement visualization and domain-centric microbial secretomes and their target substrates in analysis of proteins. Bioinformatics. 2014;30(2):282– marine sediment. Nat Microbiol. 2018;3(1):32–7. 3. 31. Castelle CJ, Brown CT, Anantharaman K, Probst AJ, 46. Jones P, Binns D, Chang HY, Fraser M, Li W, Huang RH, Banfield JF. Biosynthetic capacity, McAnulla C, et al. InterProScan 5: Genome-scale metabolic variety and unusual biology in the CPR protein function classification. Bioinformatics. and DPANN radiations. Nat Rev Microbiol. 2018; 2014;30(9):1236–40. 32. Haft DH, DiCuccio M, Badretdin A, Brover V, 47. Mitchell AL, Attwood TK, Babbitt PC, Blum M, Chetvernin V, O’Neill K, et al. RefSeq: An update on Bork P, Bridge A, et al. InterPro in 2019: Improving

11 S.Nagata

coverage, classification and access to protein sequence annotations. Nucleic Acids Res. 2019;47(D1):D351–60. 48. Kalvari I, Argasinska J, Quinones-Olvera N, Nawrocki EP, Rivas E, Eddy SR, et al. Rfam 13.0: Shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res. 2018;46(D1):D335–42. 49. Miller JL, Le Coq J, Hodes A, Barbalat R, Miller JF, Ghosh P. Selective ligand recognition by a diversity- generating retroelement variable protein. PLoS Biol. 2008;6(6):1195–207. 50. Iyer LM, Koonin E V, Aravind L. Evolutionary connection between the catalytic subunits of DNA- dependent RNA polymerases and eukaryotic RNA- dependent RNA polymerases and the origin of RNA polymerases. 2003;23:1–23. 51. Moelling K, Broecker F, Russo G, Sunagawa S. RNase H As gene modifier, driver of evolution and antiviral defense. Front Microbiol. 2017;8(SEP):1– 20. 52. Moelling K, Broecker F. The reverse transcriptase- RNase H: From viruses to antiviral defense. Ann N Y Acad Sci. 2015;1341(1):126–35. 53. Toro N, Nisa-Martínez R. Comprehensive phylogenetic analysis of bacterial reverse transcriptases. PLoS One. 2014;9(11):1–16. 54. Paul BG, Burstein D, Castelle CJ, Handa S, Arambula D, Czornyj E, et al. Retroelement-guided protein diversification abounds in vast lineages of Bacteria and Archaea. Nat Microbiol. 2017;2(April):1–7. 55. Mohr S, Ghanem E, Smith W, Sheeter D, Qin Y, King O, et al. Thermostable group II intron reverse transcriptase fusion proteins and their use in cDNA synthesis and next-generation RNA sequencing. Rna. 2013;19(7):958–70. 56. Conlan LH, Stanger MJ, Ichiyanagi K, Belfort M. Localization, mobility and fidelity of retrotransposed Group II introns in rRNA genes. Nucleic Acids Res. 2005;33(16):5262–70. 57. Kunkel TA, Bebenek K. DNA Replication Fidelity. Annu Rev Biochem. 2000;69(1):497–529.

12