Bioinformatics Comparisons of RNA-Binding Proteins of Pathogenic and Non-Pathogenic Escherichia Coli Strains Reveal Novel Virule
Total Page:16
File Type:pdf, Size:1020Kb
Ghosh and Sowdhamini BMC Genomics (2017) 18:658 DOI 10.1186/s12864-017-4045-3 RESEARCH ARTICLE Open Access Bioinformatics comparisons of RNA-binding proteins of pathogenic and non-pathogenic Escherichia coli strains reveal novel virulence factors Pritha Ghosh and Ramanathan Sowdhamini* Abstract Background: Pathogenic bacteria have evolved various strategies to counteract host defences. They are also exposed to environments that are undergoing constant changes. Hence, in order to survive, bacteria must adapt themselves to the changing environmental conditions by performing regulations at the transcriptional and/or post-transcriptional levels. Roles of RNA-binding proteins (RBPs) as virulence factors have been very well studied. Here, we have used a sequence search-based method to compare and contrast the proteomes of 16 pathogenic and three non-pathogenic E. coli strains as well as to obtain a global picture of the RBP landscape (RBPome) in E. coli. Results: Our results show that there are no significant differences in the percentage of RBPs encoded by the pathogenic and the non-pathogenic E. coli strains. The differences in the types of Pfam domains as well as Pfam RNA-binding domains, encoded by these two classes of E. coli strains, are also insignificant. The complete and distinct RBPome of E. coli has been established by studying all known E. coli strains till date. We have also identified RBPs that are exclusive to pathogenic strains, and most of them can be exploited as drug targets since they appear to be non-homologous to their human host proteins. Many of these pathogen-specific proteins were uncharacterised and their identities could be resolved on the basis of sequence homology searches with known proteins. Detailed structural modelling, molecular dynamics simulations and sequence comparisons have been pursued for selected examples to understand differences in stability and RNA-binding. Conclusions: The approach used in this paper to cross-compare proteomes of pathogenic and non-pathogenic strains may also be extended to other bacterial or even eukaryotic proteomes to understand interesting differences in their RBPomes. The pathogen-specific RBPs reported in this study, may also be taken up further for clinical trials and/or experimental validations. Keywords: Escherichia coli, RNA-binding proteins, Genome-wide survey, Pathogen, Virulence, Ribonuclease PH, PELOTA, Uncharacterised Background humans [1]. In the pathogenic strains, novel genetic Escherichia coli is one of the most abundant, facultative islands and small clusters of genes are present in anaerobic gram-negative bacterium of the intestinal addition to the core genomic framework and provide the microflora and colonises the mucus layer of the colon. bacteria with increased virulence [2–4]. The extracellular The core genomic structure is common among the com- intestinal pathogen, enterohemorrhagic E. coli (EHEC), mensal strains and the various pathogenic E. coli strains which cause diarrhea, hemorrhagic colitis and the that cause intestinal and extra-intestinal diseases in haemolytic uremic syndrome, is the most devastating of the pathogenic E. coli strains [5, 6]. Pathogenic bacteria have evolved various strategies to * Correspondence: [email protected] National Centre for Biological Sciences, Tata Institute of Fundamental counteract host defences. They are also exposed to Research, Bellary Road, Bangalore, Karnataka 560 065, India © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Ghosh and Sowdhamini BMC Genomics (2017) 18:658 Page 2 of 23 environments that are undergoing constant changes. Methods Hence, in order to survive, bacteria must adapt them- Dataset selves to the changing environmental conditions by al- Protein families were grouped on the basis of either tering gene expression levels and in turn adjusting structural homology (structure-centric families) or se- protein levels according to the need of the cell. Such quence homology (sequence-centric families). A dataset regulations may occur at the transcriptional and/or post- of 1285 RNA-protein and 14 DNA/RNA hybrid-protein transcriptional levels [7]. complexes were collected from the Protein Data Bank RNA-binding proteins (RBPs) are a versatile group of (PDB) (May 2015) and were split into protein and RNA proteins that perform a diverse range of functions in the chains. The RNA-interacting protein chains in this data- cell and are ‘master regulators’ of co-transcriptional and set were classified into 182 Structural Classification of post-transcriptional gene expression like RNA modifi- Proteins (SCOP) families, 135 clustered families and 127 cation, export, localization, mRNA translation, turnover orphan families (a total of 437 structure-centric fam- [8–12] and also aid in the folding of RNA into con- ilies), on the basis of structural homology with each formations that are functionally active [13]. In bacteria, other. Sequence-centric RNA-binding families were many different classes of RBPs interact with small RNAs retrieved from Pfam, using an initial keyword search of (sRNA) to form ribonucleoprotein (RNP) complexes that ‘RNA’, followed by manual curation to generate a dataset participate in post-transcriptional gene regulation pro- of 746 families. The structure-centric classification cesses [14–23]. In eukaryotes, noncoding RNAs (ncRNAs) scheme, the generation of structure-centric family are known to be important regulators of gene expression Hidden Markov Models (HMMs) and retrieval of [24–26]. Hence, bacterial RBPs that are capable of inhibit- sequence-centric family HMMs from the Pfam database ing this class of RNAs, are also capable of disrupting the (v 28) were as adapted from our previous study [43]. normal functioning of their host cells, thus acting as vi- Proteomes of 19 E. coli strains were retrieved from rulence factors. Roles of RBPs like the Hfq [27–36], Re- UniProt Proteomes (May 2016) [44] for the comparative pressor of secondary metabolites A (RsmA) [36–41] and study of pathogenic and non-pathogenic strains. The endoribonuclease YbeY [42] as virulence factors, have also names and organism IDs of the E. coli strains, their cor- been very well studied. responding UniProt proteome IDs and the total number Here, we describe the employment of mathematical of proteins in each proteome have been listed in Table 1. profiles of RBP families to study the RBP repertoire, All complete E. coli proteomes were retrieved from henceforth referred to as the ‘RBPome’,inE. coli RefSeq (May 2016) [45] to study the overall RBP strains. The proteomes of 19 E. coli strains (16 patho- landscape in E. coli.ThenamesoftheE. coli strains, genic and three non-pathogenic strains) have been stu- their corresponding assembly IDs and the total num- died to compare and contrast the RBPomes of pathogenic ber of proteins in each proteome and have been listed and non-pathogenic E. coli. More than 40 different kinds in Table 2. of proteins have been found to be present in two or more pathogenic strains, but absent from all the three non- pathogenic ones. Many of these proteins are previously Search method uncharacterised and may be novel virulence factors and The search method was described in our previous study probable candidates for further experimental validations. [43] and is represented schematically in Fig. 1. A library We have also extended our search method to probe to of 1183 RBP family HMMs (437 structure-centric fam- all available E. coli complete proteomes (till the date of ilies and 746 sequence-centric families) were used as the study) for RBPs, and thus obtain a bigger picture of start points to survey the E. coli proteomes for the pres- the RBP landscape in all known E. coli strains. The ence of putative RBPs. The genome-wide survey (GWS) search method can also be adapted in future for compar- for each E. coli proteome was performed with a se- − ing the RBPomes of other species of bacteria as well. In quence E-value cut-off of 10 3 and the hits were filtered addition, our work also discusses case studies on a few with a domain i-Evalue cut-off of 0.5. i-Evalue (inde- interesting RBPs. The first of them is an attempt to pro- pendent E-value) is the E-value that the sequence/profile vide a structural basis for the inactivity of the Ribonu- comparison would have received if this were the only clease PH (RNase PH) protein from E. coli strain K12, domain envelope found in it, excluding any others. This the second study deals with the structural modelling and is a stringent measure of how reliable this particular characterisation of RNA substrates of an ‘uncharac- domain may be. The independent E-value uses the total terised’ protein that is exclusively found in the patho- number of targets in the target database. We have now genic E. coli strains, whereas the third one involves the mentioned this definition in the revised manuscript. The analysis of pathogen-specific Cas6 proteins and compari- Pfam (v 28) domain architectures (DAs) were also resolved son with their non-pathogenic counterparts. at the same sequence E-value and domain i-Evalue cut-offs. Ghosh and Sowdhamini BMC Genomics (2017) 18:658 Page 3 of 23 Table 1 E. coli proteomes for comparative study. The 19 E. coli proteomes from UniProt (May 2016) used in the study for the comparison of RBPomes of pathogenic and non-pathogenic strains have been listed in this table. The pathogenic and the non- pathogenic E. coli strains have been represented in red and green fonts, respectively apoints to the differences among the strains at the genome level, considering the strain K12 (or lab strain) as the standard − Comparison of RNA-binding proteins across strains cut-off of 10 5.