<<

bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

PresRAT: A server for identification of bacterial small-RNA sequences and their targets with probable binding region.

Krishna Kumar#1, Abhijit Chakraborty#1+, and Saikat Chakrabarti1*

1Structural Biology and Bioinformatics Division, CSIR-Indian Institute of Chemical Biology, Kolkata, India.

+ Current address: Division of Vaccine-Discovery, La Jolla Institute for Immunology, San Diego, California, USA.

# Equal authorship.

*Corresponding author.

Running title: Identification of bacterial sRNA and their targets.

Keyword: sRNA identification; sRNA target prediction; RNA structure; RNA- interaction; RNA-RNA interaction.

Dr. Saikat Chakrabarti Structural Biology and Bio-informatics Division CSIR-Indian Institute of Chemical Biology Contact: [email protected]; [email protected]

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Abstract

Bacterial small-RNA (sRNA) sequences are functional , which play an important role in regulating the expression of a diverse class of genes. It is thus critical to identify such sRNA sequences and their probable mRNA targets. Here, we discuss new procedures to identify and characterize sRNA and their targets via the introduction of an integrated online platform named PresRAT (Prediction of sRNA and their Targets). PresRAT uses the primary and secondary structural attributes of sRNA sequences to predict sRNA from a given sequence or bacterial genome. PresRAT also finds probable target mRNAs of sRNA sequences from a given bacterial chromosome and further concentrates on identification of the probable sRNA-mRNA binding regions. Using PresRAT we have identified a total of 60,518 potential sRNA sequences from 54 bacterial genomes and 2447 potential targets from 13 bacterial genomes. We have also implemented a protocol to build and refine 3D models of sRNA and sRNA-mRNA duplex regions and generated 3D models of 50 known sRNAs and 81 sRNA-mRNA duplexes using this platform. Along with the server part, PresRAT also contains a database section, which enlists the predicted sRNA sequences, sRNA targets and their corresponding 3D models with structural dynamics information.

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Introduction

small-RNA (sRNA) based regulation of mRNA sequences play a central role in various bacterial gene expressions in response to environmental changes1, 2. These sRNA sequences are extensively studied in context with different bacterial regulatory processes such as quorum sensing3, the formation of biofilms4, expression of diverse virulence factors5, iron uptake6, etc. In prokaryotic system, these functional RNA regulators exist as small transcripts about 30-600 long and modulate the translation and stability of mRNAs by binding with them7. Binding regions of these sRNA sequences with their cognate mRNAs are not fixed. Some sRNA sequences primarily bind at the binding site (RBS) of their target mRNA sequences1 while other sRNA sequences modulate translation of target genes by attaching at distant locations1, 7. The antisense mechanisms exhibited by the sRNA sequences to modulate their targets are the key to understand the function of these RNA regulators1, 2, 7. The significance of these sRNA sequences and their targeted genes in has driven the research community to identify and more of these sRNA sequences and their targets with respect to various biological functions. Different experimental methods such as RNA-seq, tiling arrays, sRNA pulse expression, immunoprecipitation are available to detect sRNA sequences and their targets but they are tedious and expensive and the outputs of these experiments also need exhaustive filtering8-10. Hence, computational approaches can be helpful in augmenting the experimental methods11 in the detection of sRNA sequences and their targets.

Comparative genomics is a standard computational practice to find sRNA sequences within a bacterial genome12. QRNA13 and Intergenic Sequence Inspector14 programs implemented a comparative genomics approach to find sRNAs. Programs like RNAz15 and sRNApredict16 use the conserved RNA secondary structure information to calculate the thermodynamic energy for sRNA prediction. GLASSgo17 combined an iterative BLAST strategy with pairwise identity filtering and structure-based clustering to find sRNA homologs from scratch. Some of the programs also utilize the promoter and terminator sequence signal annotations for sRNA prediction. sRNAscanner18 applies a position weighted matrix of sRNA specific transcriptional signals and

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

sRNAfinder19 utilizes a generalized probabilistic model to integrate primary sequence data, transcript expression data, microarray experiments and conserved RNA structure information to to predict sRNA genes. Diverse sets of methods also utilizes different machine learning approaches20,21,, antisense RNA transcript expression profiling22 and motif-based discovery23 algorithms to predict sRNA sequences.

Identification of sRNA target genes are also important and represents a challenging task. Experimental procedures including genetic screens24, knockouts studies25, microarray analysis26 and other systematic proteomic investigations27 are implemented to identify sRNA target genes. However, these methods are expensive, time-consuming and extremely tedious. Therefore, development of sensitive and accurate computational approaches would aid and complement the experimental analysis. The first step in sRNA target prediction is to search for a complementary region between the sRNA and its target mRNA sequence. Comparing to microRNA and its target mRNA interaction, which requires a generally conserved 6-8 long seed region for initiation of binding, the sRNA-mRNA binding region is relatively more extended and more evolutionary flexible28. Non-Watson-Crick base- pairing plays a key role in sRNA-mRNA target interaction and programs such as TargetRNA26, GUUGle29 utilises both Watson-Crick and non-Watson-Crick base pairing models to find sRNA targets and their interacting regions. RNAcofold30 computes the hybridization energy and base-pairing arrangement between a pair of RNA molecules, while RNAplex31 calculates the same hybridization energy using modified thermodynamic energy models. RNAup32 utilises the thermodynamics of RNA-RNA interactions and the hybridization energy while IntaRNA33 incorporates binding site accessibility information for searching the best interaction with minimum extended hybridization energy. SPOT34 incorporated existing computational tools to search for sRNA binding sites and utilises experimental data for filtering.

Despite the availability of various sRNA and its target prediction algorithms, the sensitivity remains quite low (20% - 49%)11. Hence, it is fair to assume that many more sRNA sequences and their target genes remain undiscovered, particularly in less studied prokaryotic organisms. Here, in this

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

article, we introduce an integrated platform with alternative approach to identify bacterial small-RNA and their target genes from a given nucleotide sequence and from a whole bacterial chromosome. This online platform, PresRAT (Prediction of sRNA and their Targets) integrates two separate but related procedures of sRNA and their target identification into a simple and effective single dais.

Apart from identifying the sRNA sequences and their targets, a key feature of this platform is to provide sRNA and sRNA-mRNA duplex structures generated via three dimensional (3D) modelling and further refined through molecular dynamics simulation. Predicting accurate 3D structure of RNA from the primary sequence is still a fundamental and challenging problem in computational biology35. There are few notable programs such as iFoldRNA36, FARNA/FARFAR37, 38, MC-Sym39, MANIP40, 3DRNA41, RNA2D3D42, NAST43 and ASSEMBLE44 try to tackle the existing bottlenecks of RNA structure prediction and improve the algorithms for RNA 3D structure generation. As per the latest RCSB protein databank45 statistics, there are 1528 experimentally solved 3D structures of RNA representing only ~0.01% of the whole deposited structures. So far there is only one experimentally solved 3D structures available for sRNA46 (N-terminal SL1 part of DsrA from E.coli) other than few sRNA conjugated with proteins47-51. PresRAT uses RNA2D3D42 program to generate an initial RNA model followed by extensive refinement of the structure using GROMACS v4.6.3 molecular dynamics simulation package52.

The validation and benchmarking results of the PresRAT for sRNA, sRNA- target, and sRNA-mRNA binding regions suggest improvements compared to those achieved by other available programs. In this current study we have studied 54 bacterial genomes to identify a total of 15,568 and 44,950 sRNA sequences from genic and non-genic regions of the respective genomes. We also have identified 2447 potential targets for 50 unique sRNA sequences from 13 bacterial genomes. Further, we have shown the structural details of 50 sRNA and 81 sRNA-mRNA duplex 3D models through molecular dynamics simulation studies. All these procedures and information are automated and

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

integrated in the PresRAT platform freely available at http://www.hpppi.iicb.res.in/presrat.

Results

PresRAT: The webserver interfaces PresRAT webserver and the database sections are built using CGI-Perl scripts. Integrated CGI-Perl scripts allow the users to utilize this webserver in an efficient way. The webserver is freely available at www.hpppi.iicb.res.in/presrat. The PresRAT webserver section is divided into four parts accomplishing their individual tasks. The first two parts are called PredsRNA and PredGsRNA. PredsRNA interface can be used to identify sRNA sequences one at a time while PredGsRNA interface directs the user to identify potential sRNA sequences from different bacterial sequences. The next two parts of the PresRAT webserver compromises sRNA target identification protocols. PredTar Interface helps to identify interacting nucleotides between a pair of sRNA and mRNA sequence and PredWG performs the function of sRNA target identification from a given bacterial genome. The following sections briefly describe the methodology and the results of each of these parts of the PresRAT server.

PredsRNA : Prediction of sRNA from its sequence and secondary structural attributes PredsRNA uses sequence and secondary structural information of existing sRNA and non-sRNA sequences to calculate a combined score to predict novel sRNA sequences. In sequence analysis, a directional (5'->3') dinucleotide score is calculated for the whole input sequence from Log Odds (LOD) score matrices of all possible combinations of directional dinucleotides,

which we denote as Sequencescore (see sRNA prediction protocol and Figure

S1 in Supplementary Information File). We also calculate a U-richscore that reflects the presence of uracil (U) nucleotides at the 3' end of the sRNA sequence. PresRAT effectivelyuses the secondary structure information by first calculating a set of local minima conformations of the provided RNA

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

sequenc. followed by estimating the two different scores namely ABEscore and

ALEscore. ABEscore and ALEscore indicate the Average Base Energy and Average Loop Energy of local minima secondary structure conformations. The

final combined score i.e. sRNAscore is calculated as follows:

.*- .*- .*- .*- .*-

The combined sRNAscore (details of the scoring schemes can be found in Supplementary Information File) is used as the final scoring scheme for filtering the true positive sRNA instances from the true negative ones.

PredsRNA page accepts query nucleotide sequence and based on its

sRNAscore and predicts whether the given sequence is likely to be sRNA or not above a user defined confidence level. The output page provides the

sRNAscore and its other component’s score along with a distribution of the score with respect to the known sRNA scores. PredsRNA result page also provides a minimum free energy secondary structure of the probable sRNA sequence. PredsRNA also provides a useful option of building the three- dimensional (3D) structural model of the predicted sRNA sequence via the PredsRNA3D module, which generates sRNA model using RNA2D3D42 and RNAfold53 programs, respectively. Following the generation of a 3D sRNA structure, a molecular dynamics (MD) simulation is performed on the RNA structure using GROMACS52 MD simulation package (details of the structural analysis can be found at Supplementary Information File). The initial and final structures are provided from MD simulation trajectory file along with other necessary information. For benchmarking and validation of sRNA prediction by PresRAT, a set of non-redundant 819 sRNA (Additional file 1), 31,403 genic, and 28,733 non- genic non-sRNA sequences were randomly collected from 54 different bacterial species (Supplementary Information File and Additional file 1). Receiver operating characteristic (ROC) statistics were calculated by

comparing the sRNA to non-sRNA sRNAscore in a 1:5 ratio via 1000 fold cross- validation protocol. In each cross-validation step sRNA and non-sRNA

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

sequences were randomly collected to calculate the sensitivity and error

rates. The ROC statistics generated using the sRNAscore distribution of sRNA and non-sRNA sequences demonstrate sensitivity values of 13.2%, 28.0%, 39.0% and 50.1% at 1.0% 5.0%, 10.0%, and 20.0% error rates, against non- sRNA sequences collected from genic regions whereas sensitivities of 6.3%, 18.4%, 31.2%, and 48.0% were obtained at 1.0% 5.0%, 10.0%, and 20.0% error rates against non-sRNA sequences collected from non-genic parts of the bacterial chromosomes (Figure 1A), respectively. A comparative study against RNAz15 program with PresRAT showed slightly better or comparable performance in predicting sRNA sequences with respect to genic non-sRNA sequences (Figure 1B) while PresRAT showed much better performance in sRNA prediction with respect to non-genic non-sRNA sequences (Figure 1C). However, it is also important to mention that RNAz requires evolutionary information from homologous sequences to predict functional RNA sequences while PresRAT does not require homologous sequences for sRNA prediction. However, to make comparison meaningful between the two methods, comparative analysis was performed on a limited set of commonly predicted sRNA (199 sequences), genic (5193) and non-genic (4218) non-sRNA sequences as RNAz15 was restricted only to the mentioned number of sRNA and non-sRNA sequences having homologous sequences in same or other bacterial chromosomes.

PredGsRNA : Prediction of sRNA sequences within bacterial chromosome PredGsRNA aims to predict potential small-RNA (sRNA) sequences within a bacterial genome. The result page of PredGsRNA contains the sRNA sequences predicted from the genic sequences and non-genic sequences of the bacterial genome, respectively. The corresponding result page of this program consists of top 100 predicted sRNA sequences along with their start

and end loci, sRNAscore and P-values. Similarly, a pictorial representation of the predicted sRNA sequences on a genome browser view is also provided for better visualization (Please refer the supplementary material for further details of this approach).

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

PredGsRNA was used to predict new probable sRNA sequences from the 54 bacterial genomes (Additional file 2) to identify a total of 15,568 and 44,950 hypothetical sRNA sequences from genic and non-genic parts of the bacterial genomes (Additional file 3). sRNAscanner18, a separate program for bacterial chromosomal sRNA detection was used for comparative analysis. sRNAscanner uses a transcriptional signal based computational method for sRNA detection and detected only 55 unique sRNA sequences while our new protocol identified a total of 205 sRNA sequences out of 819 sequences (Additional file 4). Our protocol demonstrates moderate improvement and provides an alternative approach in identifying sRNA sequences from bacterial genomes.

PredTAR : Prediction of sRNA-mRNA target binding regions PredTAR aims to predict potential binding regions between bacterial small- RNA (sRNA) and the target mRNA of that sRNA. The output page provides the alignment of the given sRNA and mRNA sequences along with scores and duplex energy values. PresRAT uses combines two different approaches to find a sRNA-mRNA target binding region. The first approach (Component I) uses the base un-pairing probability of both the sRNA and its target mRAN sequence (Figure 2A) and the subsequent approach (Component II) (Figure 2B) utilizes the local minima energy structures of sRNA and the minimum free energy structure of the target mRNA to find the potential binding region (Figure S2 in Supplementary Information File for details). PredTAR also provides a useful option of building the three-dimensional structural (3D) model of the sRNA-mRNA bound duplex region via the Preddup3D module, which generates the duplex model using RNA2D3D42 program for the secondary structure information followed by molecular dynamics (MD) simulation. Details of the structural analysis can be found at Supplementary Information File. The Initial and final structures are provided from MD simulation trajectory file along with other necessary information. Experimentally known binding regions in 88 out of 91 sRNA and their respective target mRNA sequences (Additional file 5 and Additional file 6) were used to validate the Component I and II algorithms. The selection of these 88 binding regions was done based on the frequency distribution of the

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

sRNA binding site on mRNA sequences relative to its translation initiation site (Figure 3A). The frequency distribution of the 91 sRNA binding sites on mRNA sequences reveals that in 97% of the cases the sRNA binding site is located within the -250 to +100 window region of a target mRNA (“0” being the translation start site). Component, I and II algorithms, could correctly identify the binding region for 49% and 67% of the true sRNA-mRNA pair, respectively within the above mentioned window region. The Figure 3B shows a Venn diagram illustrating the distribution of correctly predicted sRNA-target binding region between the two algorithms Component I and II. 74% (65 out of 88) sensitivity was achieved when the binding region with lowest free energy of sRNA-mRNA binding54, 55 predicted by either of the algorithms was considered. 82% (72 out of 88) sensitivity (Figure 3C) was achieved when both the common and uniquely predicted binding regions were considered. A comparative analysis (Figure 3C) with other published methods shows much better performance of PresRAT algorithms in predicting sRNA-mRNA targets (a complete list of predicted sRNA-mRNA binding regions is provided in Additional file 7).

PredWG : Prediction of sRNA target genes and binding regions within a bacterial genome PredWG aims to predict potential bacterial small-RNA (sRNA) targets within a bacterial genome. The result page of PredWG contains the list of the predicted target genes along with individual alignment and the corresponding duplex energy value of the binding region. Similarly, a pictorial representation of the predicted sRNA sequences on a genome browser view is also provided for better visualization.

Our benchmarking exercise considering 91 known sRNA-mRNA target pairs, PresRAT achieved 79% sensitivity in rightly predicting the target genes (within the top 100 predictions/genome) within bacterial genomes compared to similar programs, like IntaRNA33, RNAPredator56, TargetRNA257 and RNAplex31, which achieved sensitivity values of 38%, 23%, 22% and 14%, respectively (Figure 3D) (complete list of targets found by the various programs are provided in Additional files 6 and 7). For a fair comparison, a

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

similar search window (-250 to +100 of mRNA translation start site) length was maintained for all the programs. Within the top 100 predictions for individual sRNA sequences, we have predicted 2447 novel mRNA targets with a similar binding energy distribution profile (p values < 0.5) (Figure S3 in Supplementary Information File) as that of the known 91 sRNA-mRNA interactions. The complete list of the identified target mRNAs for 50 sRNA sequences along with their corresponding functions is provided in the Additional file 8. Gene Ontology (GO)59 based molecular functional network of E. coli known sRNA sequences (18 sequences) with their known (27 unique targets) and predicted mRNA target sequences (192) (Figure S4A in Supplementary Information File) suggests involvement of similar functions such as binding, catalysis and transporter activities (Figure S4B in Supplementary Information File). Additionally, a smaller subset of the predicted target genes also offers unique molecular functions like electron transport, guanyl-nucleotide exchange and other regulator activities (Figure S4B in Supplementary Information File).

PresRAT: The database interface PresRAT database section enlists the predicted sRNA sequences, sRNA targets, and the corresponding 3D models of selected sRNAs and sRNA- mRNA duplex structures. Predicted small-RNA sequences from 54 bacterial genomes are accessible to users from a drop-down selection menu provided in the database page. Similarly, the list of predicted target genes for a given sRNA within a bacterial genome is readily accessible in a similar manner. The database also harbors mRNA targets of 50 known sRNA sequences from 13 bacterial genomes along with their details.

Exhaustive structural analysis was performed to predict the 3D structures of the known sRNA sequences and their corresponding binding regions with the target genes. Three-dimensional modelling of sRNA sequences and the target binding regions using their current secondary structure information provides the opportunity to gain structural insight, which in turn may become quite valuable in RNA based therapeutics development. There are 50 sRNA models in PresRAT database. All the sRNA and 81 unique sRNA-mRNA bound

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

duplex models are explicitly refined using GROMACS package through one- nanosecond of molecular simulation runs. The final models with other structural detailsare available to users in the database section. Further methodical details and validation results can be found in Supplementary Information File.

Discussion

Here, we introduce a web-based platform PresRAT to identify and analyze bacterial small-RNAs using both individual and genomic sequence level searches. The platform integrates a combination of scoring schemes developed based on primary and secondary structural attributes of small- RNAs. Using PresRAT we have scanned a total of 54 bacterial genomes to identify 15,568 and 44,950 potential sRNA sequences (Supplemental material 3). One of the key features of this platform is the generation of sRNA three- dimensional models and their dynamic analysis. For in-depth structural analysis, availability of energy minimized and refined three-dimensional models of sRNAs are essential. PresRAT provides an option for automated model building and molecular dynamics simulation of sRNA in presence of water. Using this automated procedure we have successfully built 50 different small-RNA structures and analyzed some of their basic properties calculated from their dynamics data. Comparison of our server derived model of DsrA and the NMR structure of the SL1 of the DsrA (pdb code: 5WQ1)46 shows high similarities (RMSD: 2.75Å) between the structures, indicating reliability of our 3D model (Figure S7 in Supplementary Information File). However, the NMR structure covers only a short fraction of the DsrA sequence.

PresRAT enables a user to search for probable target mRNAs of sRNA sequences from a given bacterial chromosome and further concentrate on identification of the probable sRNA-mRNA binding regions. For identification of target binding regions, PresRAT uses base un-pairing probability information of sRNA-mRNA sequences and local minima of sRNA secondary structures. Both the approaches combined with other additional steps provide a simple but effective way of identification of target mRNA sequences from

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

the whole bacterial chromosomes. In this way we have scanned 13 bacterial genomes and identified 2447 novel mRNA target sequences for an existing 50 sRNA sequences (Additional file 7). The identified target mRNAs for a given sRNA are provided in the database section of PresRAT along with their interacting nucleotide region and their related functional information. Existing model generation and molecular dynamics simulation procedure are extended to build the energy minimized and refined three-dimensional models of the sRNA-mRNA duplex regions, as well their structural characterization were also performed. Our study revealed that in general these duplex regions are thermodynamically unfavourable, which indicates the involvement of other critical factors for generation of stable sRNA-mRNA complexes.

Acknowledgement

SC acknowledges CSIR-IICB for infrastructural support, CSIR Network Project (BSC-0121), Systems Medicine Cluster (SyMeC) grant (GAP357) from Department of Biotechnology (DBT), High Risk High Reward Grant (HRR/2016/000093) from Department of and Technology (DST) for financial supports.

Additional Information

Author Contributions K.K worked on the web server and drafted the manuscript. A.C. collected and organized the data, developed the program, webserver and drafted the manuscript. S. C. drafted the manuscript and coordinated the project. All authors read and approved the final manuscript.

Competing financial interests

The author(s) declare no competing financial interests.

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure legends

Figure 1. Panel A provides the ROC statistics of PresRAT program on the full dataset. Panel B and C show the ROC statistics perform on the limited set of commonly predicted sRNA and non-sRNA sequences.

Figure 2. Panel A demonstrates the sRNA-mRNA target binding region identification protocol (Component I) of PresRAT. The main aim of this protocol is to identify the un-pairing probability of each base from both sRNA and mRNA sequences and utilise this probability information to identify the binding region. Panel B describes the Component II protocol used by PresRAT to identify sRNA-mRNA target binding region. In this protocol energetic value of both sRNA and mRNA secondary structures are taken into account to identify binding regions.

Figure 3. Panel A shows the frequency distribution of sRNA binding sites in the 91 mRNA sequences. In 98% of the cases it is found that the sRNA binding site is present within the window of -250 to +100 nucleotides (0 being the translation initiation site) of the target gene. Panel B shows the overlap of correctly predicted target by the PresRAT Component I and II protocols, respectively. Panel C shows the PresRAT sensitivity in predicting the correct sRNA-mRNA target binding regions in 88 samples compared to other programs. Panel D shows the percentage of true sRNA target gene identification through whole-genome search by PresRAT compared to other existing programs.

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

References

1 Storz, G., Vogel, J. & Wassarman, K. M. Regulation by small RNAs in bacteria: expanding frontiers. Mol Cell 43, 880-891, doi:10.1016/j.molcel.2011.08.022 (2011). 2 Modi, S. R., Camacho, D. M., Kohanski, M. A., Walker, G. C. & Collins, J. J. Functional characterization of bacterial sRNAs using a network biology approach. Proc Natl Acad Sci U S A 108, 15522-15527, doi:10.1073/pnas.1104318108 (2011). 3 Liu, X., Zhou, P. & Wang, R. Small RNA-mediated switch-like regulation in bacterial . IET Syst Biol 7, 182-187, doi:10.1049/iet- syb.2012.0059 (2013). 4 Mika, F. & Hengge, R. Small Regulatory RNAs in the Control of Motility and Formation in E. coli and . Int J Mol Sci 14, 4560- 4579, doi:10.3390/ijms14034560 (2013). 5 Geissmann, T. et al. Regulatory RNAs as mediators of virulence in bacteria. Handb Exp Pharmacol, 9-43 (2006). 6 Payne, S. M. et al. Iron and pathogenesis of Shigella: iron acquisition in the intracellular environment. Biometals 19, 173-180, doi:10.1007/s10534-005-4577-x (2006). 7 Waters, L. S. & Storz, G. Regulatory RNAs in bacteria. Cell 136, 615- 628, doi:10.1016/j.cell.2009.01.043 (2009). 8 Croucher, N. J. & Thomson, N. R. Studying bacterial transcriptomes using RNA-seq. Curr Opin Microbiol 13, 619-624, doi:10.1016/j.mib.2010.09.009 (2010). 9 Li, W., Ying, X., Lu, Q. & Chen, L. Predicting sRNAs and their targets in bacteria. Genomics Proteomics Bioinformatics 10, 276-284, doi:10.1016/j.gpb.2012.09.004 (2012). 10 Sharma, C. M. & Vogel, J. Experimental approaches for the discovery and characterization of regulatory small RNA. Curr Opin Microbiol 12, 536-546, doi:10.1016/j.mib.2009.07.006 (2009). 11 Lu, X., Goodrich-Blair, H. & Tjaden, B. Assessing computational tools for the discovery of small RNA genes in bacteria. RNA 17, 1635-1647, doi:10.1261/.2689811 (2011). 12 Sridhar, J. & Gunasekaran, P. Computational small RNA prediction in

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

bacteria. Bioinform Biol Insights 7, 83-95, doi:10.4137/BBI.S11213 (2013). 13 Rivas, E., Klein, R. J., Jones, T. A. & Eddy, S. R. Computational identification of noncoding RNAs in E. coli by comparative genomics. Curr Biol 11, 1369-1373 (2001). 14 Pichon, C. & Felden, B. Intergenic sequence inspector: searching and identifying bacterial RNAs. Bioinformatics 19, 1707-1709 (2003). 15 Washietl, S., Hofacker, I. L. & Stadler, P. F. Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci U S A 102, 2454-2459, doi:10.1073/pnas.0409169102 (2005). 16 Livny, J., Fogel, M. A., Davis, B. M. & Waldor, M. K. sRNAPredict: an integrative computational approach to identify sRNAs in bacterial genomes. Nucleic Acids Res 33, 4096-4105, doi:10.1093/nar/gki715 (2005). 17 Steffen C. Lott, Richard A Schäfer, Martin Mann, Rolf Backofen, Wolfgang R Hess, Bjoern Voss, Jens Georg. GLASSgo - Automated and reliable detection of sRNA homologs from a single input sequences Frontiers in Genetics, 9, 124, 2018. 18 Sridhar, J. et al. sRNAscanner: a computational tool for intergenic small RNA detection in bacterial genomes. PLoS One 5, e11970, doi:10.1371/journal.pone.0011970 (2010). 19 Tjaden, B. Prediction of small, noncoding RNAs in bacteria using heterogeneous data. J Math Biol 56, 183-200, doi:10.1007/s00285-007- 0079-5 (2008). 20 Carter, R. J., Dubchak, I. & Holbrook, S. R. A computational approach to identify genes for functional RNAs in genomic sequences. Nucleic Acids Res 29, 3928-3938 (2001). 21 Saetrom, P. et al. Predicting non-coding RNA genes in with boosted genetic programming. Nucleic Acids Res 33, 3263-3270, doi:10.1093/nar/gki644 (2005). 22 Herbig, A. & Nieselt, K. nocoRNAc: characterization of non-coding RNAs in prokaryotes. BMC Bioinformatics 12, 40, doi:10.1186/1471- 2105-12-40 (2011). 23 Salari, R. et al. smyRNA: a novel Ab initio ncRNA gene finder. PLoS One 4, e5433, doi:10.1371/journal.pone.0005433 (2009). 24 Mandin, P. Genetic screens to identify bacterial sRNA regulators. Methods Mol Biol 905, 41-60, doi:10.1007/978-1-61779-949-5_4 (2012). 25 Harris, J. F., Micheva-Viteva, S., Li, N. & Hong-Geller, E. Small RNA- mediated regulation of host-pathogen interactions. Virulence 4, 785- 795, doi:10.4161/viru.26119 (2013).

16 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

26 Tjaden, B. et al. Target prediction for small, noncoding RNAs in bacteria. Nucleic Acids Res 34, 2791-2802, doi:10.1093/nar/gkl356 (2006). 27 Ramachandran, R. & Stevens, A. M. Proteomic analysis of the quorum- sensing regulon in Pantoea stewartii and identification of direct targets of EsaR. Appl Environ Microbiol 79, 6244-6252, doi:10.1128/AEM.01744-13 (2013). 28 Backofen, R. & Hess, W. R. Computational prediction of sRNAs and their targets in bacteria. RNA Biol 7, 33-42 (2010).

29 Gerlach, W. & Giegerich, R. GUUGle: a utility for fast exact matching under RNA complementary rules including G-U base pairing. Bioinformatics 22, 762-764, doi:10.1093/bioinformatics/btk041 (2006). 30 Bernhart, S. H. et al. Partition function and base pairing probabilities of RNA heterodimers. Algorithms Mol Biol 1, 3, doi:10.1186/1748-7188-1- 3 (2006). 31 Tafer, H. & Hofacker, I. L. RNAplex: a fast tool for RNA-RNA interaction search. Bioinformatics 24, 2657-2663, doi:10.1093/bioinformatics/btn193 (2008). 32 Muckstein, U. et al. Thermodynamics of RNA-RNA binding. Bioinformatics 22, 1177-1182, doi:10.1093/bioinformatics/btl024 (2006). 33 Busch, A., Richter, A. S. & Backofen, R. IntaRNA: efficient prediction of bacterial sRNA targets incorporating target site accessibility and seed regions. Bioinformatics 24, 2849-2856, doi:10.1093/bioinformatics/btn544 (2008). 34 King AM, Vanderpool CK, Degnan PH. 2019. sRNA target prediction organizing tool (SPOT) integrates computational and experimental data to facilitate functional characterization of bacterial small RNAs. mSphere 4:e00561-18. doi:10.1128/mSphere.00561-18. 35 Sripakdeevong, P., Beauchamp, K. & Das, R. Why can’t we predict RNA structure at atomic resolution? Nucleic Acids and Molecular Biology 27, 43-65 (2011). 36 Sharma, S., Ding, F. & Dokholyan, N. V. iFoldRNA: three-dimensional RNA structure prediction and folding. Bioinformatics 24, 1951-1952, doi:10.1093/bioinformatics/btn328 (2008). 37 Das, R. & Baker, D. Automated de novo prediction of native-like RNA tertiary structures. Proc Natl Acad Sci U S A 104, 14664-14669, doi:10.1073/pnas.0703836104 (2007). 38 Das, R., Karanicolas, J. & Baker, D. Atomic accuracy in predicting and designing noncanonical RNA structure. Nat Methods 7, 291-294, doi:10.1038/nmeth.1433 (2010). 39 Parisien, M. & Major, F. The MC-Fold and MC-Sym pipeline infers RNA

17 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

structure from sequence data. Nature 452, 51-55, doi:10.1038/nature06684 (2008). 40 Massire, C. & Westhof, E. MANIP: an interactive tool for modelling RNA. J Mol Graph Model 16, 197-205, 255-197 (1998). 41 Zhao, Y. et al. Automated and fast building of three-dimensional RNA structures. Sci Rep 2, 734, doi:10.1038/srep00734 (2012).

42 Martinez, H. M., Maizel, J. V., Jr. & Shapiro, B. A. RNA2D3D: a program for generating, viewing, and comparing 3-dimensional models of RNA. J Biomol Struct Dyn 25, 669-683, doi:10.1080/07391102.2008.10531240 (2008). 43 Jonikas, M. A. et al. Coarse-grained modeling of large RNA molecules with knowledge-based potentials and structural filters. RNA 15, 189- 199, doi:10.1261/rna.1270809 (2009). 44 Jossinet, F., Ludwig, T. E. & Westhof, E. Assemble: an interactive graphical tool to analyze and build RNA architectures at the 2D and 3D levels. Bioinformatics 26, 2057-2059, doi:10.1093/bioinformatics/btq321 (2010). 45 Rose, P. W. et al. The RCSB Protein Data Bank: new resources for research and education. Nucleic Acids Res 41, D475-482, doi:10.1093/nar/gks1200 (2013). 46 Wu, P., Liu, X., Yang, L., Sun, Y., Gong, Q., Wu, J. & Shi, Y. The important conformational plasticity of DsrA sRNA for adapting multiple target regulation. Nucleic Acids Res 45, 9625-9639, doi: 10.1093/nar/gkx570 (2017). 47 Morris, E. R. et al. Structural rearrangement in an RsmA/CsrA ortholog of creates a dimeric RNA-binding protein, RsmN. Structure 21, 1659-1671, doi:10.1016/j.str.2013.07.007 (2013). 48 Sauer, E. Structure and RNA-binding properties of the bacterial LSm protein Hfq. RNA Biol 10, 610-618, doi:10.4161/rna.24201 (2013).

49 Wang, W., Wang, L., Wu, J., Gong, Q. & Shi, Y. Hfq-bridged ternary complex is important for translation activation of rpoS by DsrA. Nucleic Acids Res 41, 5938-5948, doi:10.1093/nar/gkt276 (2013). 50 Sauer, E. & Weichenrieder, O. Structural basis for RNA 3'-end recognition by Hfq. Proc Natl Acad Sci U S A 108, 13065-13070, doi:10.1073/pnas.1103420108 (2011). 51 Link, T. M., Valentin-Hansen, P. & Brennan, R. G. Structure of Escherichia coli Hfq bound to polyriboadenylate RNA. Proc Natl Acad Sci U S A 106, 19292-19297, doi:10.1073/pnas.0908744106 (2009). 52 Van Der Spoel, D. et al. GROMACS: fast, flexible, and free. J Comput Chem 26, 1701-1718, doi:10.1002/jcc.20291 (2005).

18 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

53 http://rna.tbi.univie.ac.at/cgi-bin/RNAWebSuite/RNAfold.cgi. 54 Lorenz, R. et al. ViennaRNA Package 2.0. Algorithms Mol Biol 6, 26, doi:10.1186/1748-7188-6-26 (2011). 55 Turner, D. H. & Mathews, D. H. NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res 38, D280-282, doi:10.1093/nar/gkp892 (2010). 56 Eggenhofer, F., Tafer, H., Stadler, P. F. & Hofacker, I. L. RNApredator: fast accessibility-based prediction of sRNA targets. Nucleic Acids Res 39, W149-154, doi:10.1093/nar/gkr467 (2011). 57 Kery, M. B., Feldman, M., Livny, J. & Tjaden, B. TargetRNA2: identifying targets of small regulatory RNAs in bacteria. Nucleic Acids Res 42, W124-129, doi:10.1093/nar/gku317 (2014). 58 Harris, M. A. et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32, D258-261, doi:10.1093/nar/gkh036 (2004).

19 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.03.024935; this version posted April 5, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.