Report for Taikichiro Mori Memorial Research Grants 2019 (2019 年度森基金研究成果報告書)
生命の複製に関わる酵素の新規発見と機能解明 Comprehensive evolutionary analysis of re- verse transcriptases in viruses and prokary- otes Shohei Nagata Institute for Advanced Biosciences, Keio University, Tsuruoka 997-0035, Japan and Sys- tems Biology Program, Graduate School of Media and Governance, Keio University, Fu- jisawa 252-0882, Japan.
Abstract Reverse transcriptases (RTs) are enzymes that polymerize DNA from RNA tem- plates. RTs are usually thought to be viral and eukaryotic elements, but they are also present in bacteria. Bacterial RTs are seemed to be ancestors of eukaryotic RTs and several types are identified i.e. group II introns, retrons, CRISPR/Cas- associated RTs, diversity-generating retroelements (DGRs), and Abi -like genes. Recently, several studies reported that the existence of RTs in a recently reported bacterial group, candidate phyla radiation (CPR). These CPR RTs are thought to have an important role and functions in CPR bacterial ecologies since they retain RT genes while lacking numerous biosynthetic pathways. In this study, I compre- hensively collected RT-like sequences from CPR genomes and systematically char- acterized RT functions and evolution. Using known functional domain profiles in RTs as queries, sequence similarity search was performed against 804 near-complete genomes of CPR bacteria in the database. I obtained 514 RT sequences and these RTs are widely distributed in CPR phyla. It is known that CPR bacteria utilize RTs involved in DGRs to adapt rapidly changing environments, I found RTs related to group II introns, retrons, and abortive infection (Abi). I will discuss possible roles and evolution of RTs in CPR bacteria. Contact: [email protected]
otes thereafter. In addition to viruses infecting eu- 1 Introduction karyotic organisms (retrovirus, pararetrovirus, Central dogma in molecular biology is a flow of in- hepadnavirus), the existence of a RT homologous formation that genetic information retained on DNA region in long terminal repeat (LTR) retroelement, is transcribed into mRNA and translated into protein, non-LTR retroelement, telomerase has been re- which was proposed in 1958. However, in 1970, an vealed. RNA-dependent DNA polymerase (reverse tran- In 1989, retron, one of the reverse transcriptase scriptase; RT), which synthesizes DNA based on (RT) was found in bacteria [3,4]. Even after that, RNA, reversed this flow [1,2]. This was discovered various types of RTs were discovered in bacteria by studies of tumor-associated retroviruses that in- and archaea by the discovery of group II intron [5– fect eukaryotes, and various types of RT enzymes 7] and diversity-generating retroelements (DGRs) have been discovered primarily related to eukary- [8–10] etc. Retrons consist of an RT and an adjacent repeat sequence but its function remains unknown.
1 S.Nagata
Group II introns are retroelements consists of cata- of diversification are not well understood. In this lytic RNA and an RT protein which mediate splic- study, a comprehensive analysis was performed on ing and mobility reactions [11–13]. DGRs are retro- the RT sequence from CPR bacterial genomes, to elements that lost mobility functions and use reverse revealLETTERS roles and evolutionNATURE of MICROBIOLOGYRTs in CPRDOI: 10.1038/NMICROBIOL.2016.48 bacteria. transcription to generate sequence variations in spe- (Tenericutes) Bacteria Actinobacteria Armatimonadetes Nomurabacteria Kaiserbacteria cific target genes [10]. Then, it was revealed that RT Zixibacteria Atribacteria Adlerbacteria Cloacimonetes Aquificae Chloroflexi Campbellbacteria Fibrobacteres Calescamantes Gemmatimonadetes Caldiserica Firmicutes WOR-3 Dictyoglomi Cyanobacteria is a gene that is widely present in the three domains TA06 Thermotogae Poribacteria Deinococcus-Therm. Latescibacteria Synergistetes Giovannonibacteria BRC1 Fusobacteria Melainabacteria Wolfebacteria Marinimicrobia Jorgensenbacteria of life (bacteria, archaea, eukaryotes) and viruses RBX1 Ignavibacteria Bacteroidetes WOR1 Chlorobi Caldithrix Azambacteria [14–17]. In bacteria, it is also known that RT ho- PVC Parcubacteria superphylum Yanofskybacteria Planctomycetes Moranbacteria Elusimicrobia Chlamydiae, mologous region exists also in abi gene related to Lentisphaerae, Magasanikbacteria Verrucomicrobia Uhrbacteria Falkowbacteria Candidate Omnitrophica Phyla Radiation abortive infection (Abi) to phage [18,19] and cas1 SM2F11 Rokubacteria NC10 Aminicentantes Peregrinibacteria Acidobacteria Tectomicrobia, Modulibacteria Gracilibacteria BD1-5, GN02 Nitrospinae Absconditabacteria SR1 Nitrospirae Saccharibacteria gene of CRISPR/Cas immune system [20,21]. Dadabacteria Berkelbacteria Deltaprotebacteria (Thermodesulfobacteria) Chrysiogenetes Deferribacteres Three bacterial RT-related proteins are involved in Hydrogenedentes NKB19 Woesebacteria Spirochaetes Shapirobacteria Wirthbacteria Amesbacteria TM6 Collierbacteria Epsilonproteobacteria Pacebacteria phage resistance; AbiA, AbiK, and Abi-P2 [15]. Beckwithbacteria Roizmanbacteria Dojkabacteria WS6 Gottesmanbacteria CPR1 Levybacteria CPR3 Daviesbacteria Microgenomates AbiA and AbiK are thought to provide phage im- Katanobacteria Curtissbacteria Alphaproteobacteria WWE3 Zetaproteo. munity through abortive infection. Also, recently Acidithiobacillia Betaproteobacteria Major lineages with isolated representative: italics Major lineage lacking isolated representative: there have been reports that many uncharacterized 0.4 Gammaproteobacteria RT-like sequences mainly exist in bacteria [15,20,21]. However, what kind of functions/activi- ties they possess, and how they divergences were Micrarchaeota Diapherotrites Eukaryotes Nanohaloarchaeota Aenigmarchaeota Loki. unclear. Parvarchaeota Thor.
Korarch. DPANN Crenarch. More recently, it has become clear that a vast un- Pacearchaeota Bathyarc. Nanoarchaeota YNPFFA Woesearchaeota Aigarch. Opisthokonta Altiarchaeales Halobacteria Z7ME43 known microbial strain group exists in bacteria by Methanopyri TACK Methanococci Excavata Archaea Hadesarchaea Thermococci Thaumarchaeota Archaeplastida Methanobacteria technological advances in metagenomic analysis Thermoplasmata Chromalveolata Archaeoglobi Methanomicrobia Amoebozoa and single-cell genomics. Metagenomic approach Figure 1 | A current view of the tree of life, encompassing the total diversity represented by sequenced genomes. The tree includes 92 named bacterial Figurephyla, 26 archaeal phyla1. and1 allAfive ofcurrent the Eukaryotic supergroups. view Major lineages of are assignedthe arbitrary tree colours andof named, life. with well-characterized The phy- lineage revealed huge diversity of previously unknown names, in italics. Lineages lacking an isolated representative are highlighted with non-italicized names and red dots. For details on taxon sampling and tree inference, see Methods. The names Tenericutes and Thermodesulfobacteria are bracketed to indicate that these lineages branch within the Firmicutesand logeneticthe Deltaproteobacteria, respectively. tree Eukaryotic of supergroups bacteria, are noted, but not archaea, otherwise delineated due and to the low resolution eukaryotes, of these lineages. The CPR in- phyla of bacteria and archaea since they have differ- phyla are assigned a single colour as they are composed entirely of organisms without isolated representatives, and are still in the process of definition at cludlower taxonomicing levels. 92 The completenamed ribosomal protein bacterial tree is available in rectangular phyla, format with 26 full bootstrap archaeal values as Supplementary phyla Fig. 1 andin and ent forms of 16S rRNA sequences. In bacteria, these Newick format in Supplementary Dataset 2. all2 five of the Eukaryotic supergroupsNATURE. MICROBIOLOGY The tree| www.nature.com/naturemicrobiology was esti- metagenomically recovered bacterial strain was de- mated by maximum© 2016-likelihood Macmillan Publishers Limited. All method rights reserved using concatena- scribed as candidate phyla radiation (CPR) and tion of ribosomal protein sequences. The figure adapted comprises at least 15% of all bacteria [22]. The CPR from reference [23]. seems to be monophyletic and clearly separated from other bacteria (Figure 1.1; Castelle and Banfield, 2018; Hug et al., 2016). CPR bacteria are 2 Methods widely distributed across the various environments such as human microbiome [25] , deep subsurface 2.1 Data sources sediments [26], the dolphin mouth [27], drinking Complete genome sequences of bacteria and ar- water [28], soil [29], marine sediment [30] and other chaea were downloaded from the Reference Se- environments [24,31]. quence Database (RefSeq) [32] at the National Cen- CPR bacteria have various unusual features com- ter for Biotechnology Information (NCBI) as of pared to non-CPR bacteria. CPR genomes are less May 2018. The acquired genomes (denoted as Ref- than 1.5Mb while the genome size of non-CPR bac- Seq prokaryotes in this manuscript) were 9,078 ge- teria, Escherichia coli, is 4.6Mb. Most of them lost nomes (total of bacteria 8,825, archaea 253, respec- TCA-cycle genes and they have intron regions in tively). rRNA genes [22,31]. It is sometimes questioned Nearly full-length (restored by ≥ 70% based on whether CPR bacteria is a cellular organism, at least, the estimated full length) of 804 genomes (790 spe- CPR genomes encode genetic systems for cell divi- cies) of CPR bacteria were obtained from NCBI sion (e.g. Fts-Z-based mechanisms, not found in GenBank based on Hug et al. [23]. some symbionts with very reduced genomes), and Known RT sequences were obtained from a pre- measurements of replication rates and images show- vious study by Simon et al. [20]. Sequences anno- ing cell division indicate that the cells are metabol- tated as “Unknown”, “Unclassified”, and “nonRTs” ically active. It is also thought that they may adhere were eliminated and totally 930 RT sequences were to the surface of other microorganisms to survive. collected. It is reported that CPR bacteria have RT-like se- quences in their genomes, however, the types of RTs, their functions, and its evolutionary scenario
2 Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes
2.2 Identification of RT sequences quences using MAFFT v.7.407 (L-INS-i algo- rithms) [41] and estimated maximum likelihood tree From the prokaryotic genomes collected, RT-like using RAxML v.8.2.11 [42] with sequences which have RT functional domains were PROTGAMMAJTT evolutionary model for amino identified using HMMER v.3.2 (hmmscan program; acid sequences. Both analyses were performed and E-value ≤ 1e-5) [33] search against sequence pro- visualized through the environment for tree explo- files corresponding to “RVT_1” (PF00078) or ration (ETE) v.3.1.1 [43]. Also, the identified CPR “RVT_2” (PF07727) in Pfam-A 32.0 [34]. In our RTs were mapped onto the phylogenetic tree esti- first pipeline, Pfam ID: “RVT_3” (PF13456) was mated by Hug et al. [23] using iTOL [44]. included in the query profile since “RVT_3” do- main was registered as “Reverse transcriptase-like” 2.5 Estimation of frameshift mutations in in the database. However, proteins collected with “RVT_3” profile query were RNase H protein ra- polymerases ther than RT. Therefore, I exclude proteins exist as RNase H alone, not a part of RT protein, to observe To verify whether the frameshift mutation occurred the diversity and evolution of RT domains and pro- only in CPR bacterial RTs, DNA polymerase family teins in the analysis including CPR bacteria. A proteins were identified and compared to the RTs. DNA polymerase family A proteins were identified 2.3 Network analysis based on sequence sim- using HMMER v.3.2 (hmmscan program; E-value ≤ 1e-5) [33] search against sequence profiles corre- ilarities sponding to “DNA_pol_A” (PF00476) in Pfam 32.0 [34]. To increase phylogenetic coverage of the pol- The sequence similarity scores were calculated to ymerases in CPR phylogeny, the retrieved DNA construct a weighted undirected graph (SSN). The polymerase protein sequences (438 sequences for similarity scores (Basic Local Alignment Search CPR bacteria) were additionally run against all cod- Tool [BLAST] bit scores) [35] for all the collected ing sequences of datasets using BLAST v.2.8.1+ protein sequences were calculated with an all- (blastp program; E-value ≤ 1e-5; query coverage against-all BLASTP (BLAST 2.7.1+) analysis per subject ≥ 50%) [35–37] and 670 sequences were [36,37], with a cut-off E-value of ≤ 1e−5. Using the identified for CPR bacteria. With the same pipeline, BLAST bit scores, the sequence similarities were I also re-identified RT sequences using 514 CPR RT normalized to 0.0–1.0, with the following equation sequences as query and retrieved 539 RTs from [38,39]: CPR genomes. ��� (��� �����(�, �), ��� �����(�, �)) ���(�, �) = ��� (��� �����(�, �), ��� �����(�, �)) 2.6 Domain architecture of related proteins where sim(x,y) represents the normalized sequence Domain organization of CPR RTs were visualized similarity between two sequences x and y. If the with DoMosaics v.0.95 [45]. The visualized do- score was 1.0, the pair was deemed to be identical. mains were extracted using HMMER v.3.2 (hmm- A weighted undirected graph was constructed based scan program) [33] search against Pfam-A 32.0 [34] on the scores of all the pairs of sequences, and the database. HMMER was performed and the results edges were weighted with the scores. I set a thresh- were combined by DoMosaics. Other sequences old sequence identity value and connected the nodes which have specific domain architecture was when the sequence identity exceeded the threshold. searched by InterProScan [46] against InterPro da- The threshold to be used was determined by com- tabase [47]. paring the networks constructed with an incremen- tal series of threshold values. The constructed net- 2.7 Identification of RT-related group II in- works were visualized with Cytoscape 3.7.1 [40], using “Prefuse Force-Directed OpenCL Layout” trons with default parameters except for enabling “Force Since most of bacterial group II introns have RT as deterministic layouts” option. intron-encoded protein (IEP) in its open reading frame (ORF), I identified the introns to annotate RT 2.4 Sequence comparison and phylogenetic functions. To detect its characteristic RNA second- analysis ary structures surrounding IEP (RT), homologous structures to the specific domains of the introns (do- To compare differences between RefSeq prokary- mains I-VI) in CPR bacterial genomes were otic RT and CPR bacteria RT, I aligned RT se- searched. Domains V, VI were searched using In-
3 S.Nagata fernal v.1.1.2 (cmsearch program with --nohmm op- (Figure 3.2). Several sequences were selected from tion; score > 24) against RNA secondary structural each type of RT and used. The color of tips in the profiles corresponding to “Intron_gpII” (RF00029) tree corresponds to the color of the node in Figure in Rfam database [48]. For domains I-IV, Infernal 3.1, and the type of RT and the taxonomic domain v.1.1.2 (cmsearch program with --rfam option; E- (bacteria, archaea, virus) derived from are described value ≤ 1e-10) were used against profiles corre- together. Retroviral, LTR, non-LTR, and retron II sponding to “group-II-D1D4-1” (RF01998), types of RTs were located nearby on the phyloge- “group-II-D1D4-2” (RF01999), “group-II-D1D4-3” netic tree, while group II introns and retron I RTs (RF02001), “group-II-D1D4-4” (RF02003), were splitted and located on multiple strains. Many “group-II-D1D4-5” (RF02004), “group-II-D1D4-6” RTs of the virus possessed various protein domains (RF02005), and “group-II-D1D4-7” (RF02012) in in addition to the central domain of the RT, as de- the database. Based on the search results, consider- scribed “RVT_1” in the figure, and the sequence ing the distances between the intron components, length was considerably longer than that of prokar- types of group II introns were defined as follows; yotes. This is probably because viruses often encode full-length, which has all domains I-IV, ORF-RT, one protein with multiple functions. domains V-VI; ORF-less, which lacks ORF-RT but A has domains; others which lacks one of the three components.
3 Results and discussion
2.1 Overall relationships among prokaryot-
Bacteria ic RTs Archaea Virus To see overall sequence relationship of RT and RT- related proteins in prokaryotes and viruses, I con- B RNase H Unclassified structed and visualized sequences sequence similar- ity network (SSN) (Figure 3.1). The SSN is a RdRP 3 RNA dependent RNA polymerase graphical representation of the similarities between RT Rtv sequences. Each sequence is indicated by a point RT ZFREV-like RdRP 4
(node) and the similarity between the sequences is RT LTR represented by the length of the line (edge) connect- ing the points. The smaller the distance between the Viral DNA polymerase RT retronⅡ nodes, the greater the degree of similarity between RT group II intron the sequences. I used RT and the related protein se- RT retronⅠ quences identified from prokaryotic and viral ge- RT nLTR-like nomes in RefSeq dataset. Nodes are colored accord- Figure 3.1 Sequence similarity network of RTs from ing to the origin of sequences: bacteria (non-CPR); RefSeq prokaryotes. Nodes (colored dots) represent the archaea; virus (Figure 3.1A) or to the types of RT RT protein sequences and the edge lengths represent the and RT-related proteins (Figure 3.1B). An over- sequence similarities. (A) Nodes are colored according to view of the entire network structure shows that the the origin of sequences: bacteria (non-CPR); archaea; vi- rus. (B) Nodes are colored according to the types of RT RT and RT-related proteins can be divided into four and RT-related proteins. groups, i.e., RTs of bacteria and archaea, RTs of vi- ruses, RNA-dependent RNA polymerases (RdRp), RdRp and RNase H, which is not RT itself, were Ribonuclease (RNase) H. The group of viral RTs obtained as RT-related proteins. RdRp has been and viral RdRp consisted only of sequences derived considered to be evolutionary related to RT [50] and from viruses, whereas the group of RNase H and it is not surprising that the RdRp domain sequences bacterial RT both contained sequences derived from were highly similar to the RT domain. On the other thee domains. Some bacterial type of RTs, such as hand, for RNase H, I selected Pfam ID: “RVT_3” in DGR, have been found in virus (bacteriophage) ge- the process of selecting protein sequences having nomes [49], and they are mainly associated with the the RT domain. Although the “RVT_3” domain are bacterial RT group on the network. registered as “Reverse transcriptase-like” in NCBI The phylogenetic relationships of the obtained CDD, the superfamily does not belong to “RVT_1 RTs and RT-related proteins were analyzed together Superfamily” and “RT_like Superfamily” with the with structure of protein functional domains other RTs but the superfamily belong to
4 Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes
“RNase_H_like Superfamily”. In many cases, RT diversity. Sequence length of each RT dataset has an RNase H domain region as part of it [51,52]. shows that the minimum length was 78 residues for However, after this, I exclude proteins exist as CPR bacteria and 72 residues for RefSeq prokary- RNase H alone, not a part of RT protein, to observe otes, the mean length was 311 residues and 475 res- the diversity and evolution of RT domains and pro- idues, and the maximum length was 763 residues teins in the subsequent analysis including CPR bac- and 1879 residues respectively. The shape of the teria. distribution showed that the RT of CPR bacteria To analyze the characteristics of RT in CPR bac- was unimodal and had a small variation in sequence teria, I firstly plotted histograms of amino acid se- length, while the RefSeq prokaryotes had roughly quence length of RTs extracted from CPR bacteria three peaks with a multimodal distribution. As a re- and non-CPR prokaryotes registered in RefSeq sult, the RT of the prokaryote registered in RefSeq (Figure 3.3). Only when plotting histograms, Ref- contains a wide variety of RT types, whereas most Seq prokaryotic RTs were used for cluster repre- of the RT of CPR bacteria are specific types of RT. sentative sequences to which at least 5 sequences For comparing sequence between RTs in CPR belong to each cluster in order to ensure sequence bacteria and non-CPR prokaryotes, I constructed and visualized SSN of RTs from both datasets (Figure 3.4). Note that in Figure 3.1, Pfam ID: “RVT_3” was also included in the extraction of RTs. However, a considerable number of RNase H se- quences were included in the network. These RNase H protein profile (Pfam ID: “RVT_3”) was ex- cluded since I would like to target only sequences close to the RT enzyme. CPR bacteria RTs, which nodes are colored blue, showed a cluster-like se- quences on the left side of the network and se- quences scattered slightly to the lower left A
B
Figure 3.2 Phylogenetic tree and domain architecture of RTs. Based on the RT and RT-related proteins identi- fied in the RefSeq prokaryotic genomes, several se- Figure 3.3 Distribution of sequence length of the iden- quences were obtained from each type of RT. The color- tified RT proteins. Distribution of amino acid sequence ing of the tip of the phylogenetic tree corresponds to the length of the identified RTs from (A) CPR RT (B) non- coloring of the node in Figure 3.1. Also, the type of RT CPR prokaryotic RTs registered in RefSeq database. and the taxonomic domain (bacteria, archaea, virus) de- Note that panel B is a representative sequence of clusters rived from were described. Functional protein domains containing 5 or more sequences due to reduce the bias in are colored by domain type and names of domain in Pfam the sequence data of RefSeq. databases are indicated.
5 S.Nagata
the best hits in the NCBI CDD profiles, detailed A types would be identified by phylogenetic analysis with known types of RTs in the next section.
2.2 Functional analysis and classification of CPR RTs Sequence similarity-based search of RT domains identified 514 RT protein sequences. To observe the phylogenetic distribution of the RTs, they were Bacteria CPR Bacteria mapped onto CPR bacterial phylogenies [23] Archaea Virus (Figure 3.5). RTs were widely distributed in CPR bacteria. They appeared in both major superphyla of B RT group II intron CPR, Parcubacteria (OD1) and Microgenomates. RT nLTR-like RT retronⅠ Cas1 RTs were found in 313 species out of 804 of CPR
RT LTR RNA-dependent RNA polymerase bacteria. RT retronⅡ RdRP 4 I combined CPR RT sequences and the known RT sequences and constructed phylogenetic tree Viral DNA polymerase (Figure 3.6). The CPR RTs were not monophyletic, RT Rtv and RTs related to retrons, abortive infection (AbiK, RT ZFREV-like Abi-P2, but not AbiA), DGRs, group II introns and Others & Unclassified group II intron-like were observed in CPR. Most of RNase H-like RdRP 3 CPR RTs (441 sequences) were involved in DGRs and it consists 86% of CPR RTs
Figure 3.4 Sequence similarity network of RTs from RefSeq (non-CPR) prokaryotes and CPR bacteria. Nodes (colored dots) represent the RT protein sequences and the edge lengths represent the sequence similarities. (A) Nodes are colored according to the origin of se- quences: bacteria (non-CPR); CPR bacteria; archaea; vi- rus. (B) Nodes are colored according to the types of RT and RT-related proteins. (Figure 3.4A). These RTs were classified as group II intron type and retron type, respectively (Figure 3.4B). In addition to these, some CPR bacterial RTs have been annotated as RNase H-like proteins (5 se- quences) or seemed to be similar to viral RdRp (3 sequences). Nodes annotated as group II introns type of RT from CPR bacteria were clustered on the network (Figure 3.4) and seemed to be consists majority of the CPR RTs. Previous study reported that 75% of the RT in the bacterial genome belongs to the group Figure 3.5 Phylogenetic distribution of RTs in CPR II intron, with 12% for the retron and 3% for the bacteria. RTs were found in 313 genomes and they were mapped onto the CPR phylogeny (804 genomes). Ge- DGR [15,53]. However, it should be noted that, as nomes with RT proteins are colored in blue. The CPR mentioned above, a detailed discussion must be phylogeny was taken from Hug et al. and modified. made in conjunction with a more accurate RT type annotation. This cluster of sequences is might be characteristic RTs of CPR bacteria because of its distance compared to bacteria and archaeal RT other than CPR bacteria on the network. If these were RT associated with Group II introns as noted, new types of Group II introns might be present in CPR bacteria. Since this RT annotation was determined only by
6 Comprehensive evolutionary analysis of reverse transcriptases in viruses and prokaryotes