The Rnase H-Like Superfamily: New Members, Comparative Structural Analysis and Evolutionary Classification Karolina A
Total Page:16
File Type:pdf, Size:1020Kb
4160–4179 Nucleic Acids Research, 2014, Vol. 42, No. 7 Published online 23 January 2014 doi:10.1093/nar/gkt1414 The RNase H-like superfamily: new members, comparative structural analysis and evolutionary classification Karolina A. Majorek1,2,3,y, Stanislaw Dunin-Horkawicz1,y, Kamil Steczkiewicz4, Anna Muszewska4,5, Marcin Nowotny6, Krzysztof Ginalski4 and Janusz M. Bujnicki1,3,* 1Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology, ul. Ks. Trojdena 4, PL-02-109 Warsaw, Poland, 2Department of Molecular Physiology and Biological Physics, University of Virginia, 1340 Jefferson Park Avenue, Charlottesville, VA USA-22908, USA, 3Bioinformatics Laboratory, Institute of Molecular Biology and Biotechnology, Adam Mickiewicz University, Umultowska 89, PL-61-614 Poznan, Poland, 4Laboratory of Bioinformatics and Systems Biology, Centre of New Technologies, University of Warsaw, Zwirki i Wigury 93, PL-02-089 Warsaw, Poland, 5Institute of Biochemistry and Biophysics PAS, Pawinskiego 5A, PL-02-106 Warsaw, Poland and 6Laboratory of Protein Structure, International Institute of Molecular and Cell Biology, ul. Ks. Trojdena 4, PL-02-109 Warsaw, Poland Received September 23, 2013; Revised December 12, 2013; Accepted December 26, 2013 ABSTRACT revealed a correlation between the orientation of Ribonuclease H-like (RNHL) superfamily, also called the C-terminal helix with the exonuclease/endo- the retroviral integrase superfamily, groups together nuclease function and the architecture of the numerous enzymes involved in nucleic acid metab- active site. Our analysis provides a comprehensive olism and implicated in many biological processes, picture of sequence-structure-function relation- including replication, homologous recombination, ships in the RNHL superfamily that may guide func- DNA repair, transposition and RNA interference. tional studies of the previously uncharacterized The RNHL superfamily proteins show extensive di- protein families. vergence of sequences and structures. We con- ducted database searches to identify members of INTRODUCTION the RNHL superfamily (including those previously The ribonuclease H-like (RNHL) superfamily is a large unknown), yielding >60 000 unique domain se- group of evolutionarily related, but strongly diverged, quences. Our analysis led to the identification of proteins with different functions. Ribonuclease (RNase) new RNHL superfamily members, such as RRXRR H from E. coli was the first protein of the superfamily for (PF14239), DUF460 (PF04312, COG2433), DUF3010 which the 3D structure was determined, revealing a new (PF11215), DUF429 (PF04250 and COG2410, architecture of the polypeptide chain, subsequently called COG4328, COG4923), DUF1092 (PF06485), the RNase H fold (1,2). Similar spatial architecture of COG5558, OrfB_IS605 (PF01385, COG0675) and the catalytic domain was later identified in other enzymes Peptidase_A17 (PF05380). Based on the clustering involved in nucleic acid metabolism, including retroviral integrases and DNA transposases (3), Holliday junction analysis we grouped all identified RNHL domain se- resolvases (HJRs) (4), Piwi/Argonaute nucleases (5,6), quences into 152 families. Phylogenetic studies numerous exonucleases (7) and Prp8: the largest and revealed relationships between these families, and most highly conserved spliceosomal protein considered suggested a possible history of the evolution of to be a master regulator of the spliceosome (8). RNase RNHL fold and its active site. Our results revealed H-like enzymes are involved in numerous fundamental clear division of the RNHL superfamily into exo- processes, including DNA replication and repair, hom- nucleases and endonucleases. Structural analyses ologous recombination, transposition and RNA of features characteristic for particular groups interference. *To whom correspondence should be addressed. Tel: +48 22 597 0750; Fax: +48 22 597 0715; Email: [email protected] Correspondence may also be addressed to Krzysztof Ginalski. Tel: +48 22 5540800; Fax: +48 22 5540801; Email: [email protected] yThese authors contributed equally to the paper as first authors. ß The Author(s) 2014. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2014, Vol. 42, No. 7 4161 RNHL superfamily proteins, despite extensive sequence These representatives were used as queries to search the and function diversity, show significant similarity of the Protein Data Bank (PDB) with DALI (16) to identify global 3D fold, the architecture of the catalytic core and proteins with a similar topology, including those not the catalytic mechanism. The RNase H-like fold has been yet assigned to the RNHL superfamily in the SCOP shown to be one of the evolutionarily oldest protein folds database. We conducted analysis of pairwise structural (9). The main element of the RNase H-like catalytic core is similarities between the previously known members of a b-sheet comprising five b-strands, ordered 32 145, where the RNase H superfamily using the DaliLite server. the b-strand 2 is antiparallel to the other b-strands. On Based on the Z-scores assigned to each compared pair both sides the central b-sheet is flanked by a-helices, the of known homologs, we attempted to determine the number of which differs between related enzymes. RNHL Z-score value threshold that would allow us to automat- superfamily members also share the position and type of ically include newly identified similar structures in further the active site residues, which typically include aspartic steps of the analysis. Initially, new structures identified acid, glutamic acid and in some cases histidine. Positions by DALI with Z score 4 were automatically included in of the two key aspartate residues are most strongly further steps of the analysis, while structures with Z conserved, while the position and identity of the remaining scores <4 were checked for similarity of their function catalytic amino acid residues differ to some extent between and analyzed visually to confirm their similarity to members of particular families. Negatively charged side RNase H fold. In some cases, for the structures of chains in the active sites of the RNase H-like enzymes proteins with unknown function, false positives or false are involved, directly or through the water molecule, in negatives could be generated, due to the relatively high coordination of divalent metal ions. It has been shown similarity of the RNase H-like superfamily members to that RNase H-like enzymes use a two ion-dependent the Actin-like ATPase domain that shares the same fold. mechanism of catalysis (2,10–13). Under physiological Amino acid sequences corresponding to the collected conditions, the preferred ion is Mg2+, but Mn2+ also structures of RNHL superfamily members were clustered supports catalysis, while Ca2+ inhibits the cleavage (14). using CLANS (17), so that proteins showing sequence Many enzymes have been classified as RNHL superfam- similarity could be grouped into families. CLANS uses ily members, including RNases and deoxyribonucleases, the P-values of highly scoring segment pairs obtained exo- and endonucleases, proteins that fulfill numerous from an N Â N BLAST search, to compute attractive functions in Eukaryota, Prokaryota, Archaea and and repulsive forces between each sequence pair in a viruses. For many of these proteins, 3D structures have user-defined data set. A 2D or 3D representation of been solved, e.g. the SCOP database, as of August 2013 sequence families is achieved by randomly seeding the se- lists 14 protein families with experimentally determined quences in the arbitrary distance space, then moving them structures, classified as the members of the RNHL super- within this environment according to the force vectors re- family. However, for most members of the RNHL super- sulting from all pairwise interactions, and repeating the family identified to date, the structural information is process until convergence, based on the Fruchterman– missing and their evolutionary origin is typically Reingold graph layout algorithm (17). The P-value thresh- unknown. In addition, discoveries of unexpected RNase old of 1e-12 was used for the clustering; the clusters H-like structures in various proteins, such as Prp8p, obtained were robust, as clustering with a more permissive suggest that further RNHL superfamily members remain threshold of 1e-2 has yielded qualitatively similar results to be discovered. (data not shown). Sequence similarity-based clustering In this work, we carried out a search for all proteins, was used three times in this analysis, first to group the which, based on their structural and catalytic properties, sequences representing proteins with known structures, can be incorporated into the RNHL superfamily. We have second to group representatives of Pfam, COG and subsequently analyzed sequence-structure-function rela- KOG databases and third to group all identified RNHL- tionships and developed a classification scheme for previ- like domains. ously known and new members of the RNHL superfamily, Based on the results of clustering performed with which will greatly facilitate computational annotations of CLANS, we selected representative sequences for each proteins and domains and the planning of experiments