Computational Biology and Chemistry 84 (2020) 107171

Contents lists available at ScienceDirect

Computational Biology and Chemistry

journal homepage: www.elsevier.com/locate/cbac

Research Article Discovering -binding RNA motifs with a generative model of RNA sequences T

Byungkyu Park, Kyungsook Han*

Department of Computer Engineering, Inha University, 22212 Incheon, South Korea

ARTICLE INFO ABSTRACT

Keywords: Recent advances in high-throughput experimental technologies have generated a huge amount of data on in- Protein-RNA interaction teractions between and nucleic acids. Motivated by the big experimental data, several computational Binding motif methods have been developed either to predict binding sites in a sequence or to determine if an interaction exists Generator between protein and nucleic acid sequences. However, most of the methods cannot be used to discover new Long short-term memory network nucleic acid sequences that bind to a target protein because they are classifiers rather than generators. In this paper we propose a generative model for constructing protein-binding RNA sequences and motifs using a long short-term memory (LSTM) neural network. Testing the model for several target proteins showed that RNA sequences generated by the model have high binding affinity and specificity for their target proteins and that the protein-binding motifs derived from the generated RNA sequences are comparable to the motifs from experi- mentally validated protein-binding RNA sequences. The results are promising and we believe this approach will help design more efficient in vitro or in vivo experiments by suggesting potential RNA aptamers for a target protein.

1. Introduction protein-binding sites in the input nucleic sequence. Nonetheless, the binding score provided by DeepBind is informative, so we used the Interactions between proteins and nucleic acids are involved in binding score in our study to estimate the affinity and specificity of many biological processes such as transcription, RNA processing, and nucleic acid sequences for a target protein. translation. A variety of in vitro and in vivo experimental methods have A later model known as DeeperBind (Hassanzadeh and Wang, 2016) been developed to study interactions between proteins and nucleic is a long short-term recurrent convolutional network for predicting the acids, and the past decade has witnessed a large amount of data gen- protein-binding specificity of DNA sequences. DeeperBind showed a erated by the experimental methods. The experimental data have trig- better performance than DeepBind for some proteins, but its use is gered the development of computational methods to predict binding limited to the datasets from protein-binding microarrays. Both Deep- sites in a sequence (Choi et al., 2017; Tuvshinjargal et al., 2016; Walia Bind and DeeperBind are classifiers rather generators, and are not in- et al., 2014) or to determine if an interaction exists between a pair of tended for finding protein-binding motifs or for constructing protein- sequences (Akbaripour-Elahabad et al., 2016; Alipanahi et al., 2015; binding nucleic acid sequences. Zhang and Liu, 2017). However, most of the computational methods A more recent model called iDeep (Pan and Shen, 2017) uses a are classifiers rather than generators, so cannot be used to discover new convolutional neural network and deep belief network to predict the protein-binding nucleic acid sequences. RBP binding sites and motifs in RNAs. Five features (region type, clip- Among the computational methods, neural networks have shown a cobinding, structure, motif, CNN sequence) were used to train iDeep. In certain degree of success in predicting the interactions between proteins testing on 31 human RBPs, iDeep showed a better performance than and nucleic acids. DeepBind (Alipanahi et al., 2015), for example, is a DeepBind. However, iDeep has several drawbacks compared to Deep- convolutional neural network trained on a large amount of data from Bind. First, iDeep contains only 31 human RBP models whereas Deep- RNAcompete experiments (Ray et al., 2009, 2013). For the problem of Bind has 144 human RBP models. Second, iDeep requires input data predicting protein-binding sites of nucleic acid sequences, DeepBind with the 5 features (region type, clip-cobinding, structure, motif, CNN contains hundreds of distinct prediction models, each for a different sequence) derived by iONMF (Stražar et al., 2016) whereas DeepBind target protein. What DeepBind predicts is a binding score rather than takes only a sequence as input. Thus, if data generated by iONMF is not

⁎ Corresponding author. E-mail address: [email protected] (K. Han). https://doi.org/10.1016/j.compbiolchem.2019.107171 Received 17 June 2019; Received in revised form 19 October 2019; Accepted 19 November 2019 Available online 07 January 2020 1476-9271/ © 2019 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/BY-NC-ND/4.0/). B. Park and K. Han Computational Biology and Chemistry 84 (2020) 107171 available for a target protein, iDeep cannot be used. xt = number for an input nucleotide (1) In this paper we propose a new approach to finding protein-binding y = number for a target nucleotide (2) RNA motifs with a generative model of protein-binding RNA sequences t using a long short-term memory (LSTM) network. As an extension of a zxtt= LSTM( ) (3) recurrent neural network (RNN), the LSTM network solves the van- n ishing gradient problem of RNN by introducing a gating mechanism yzxnt ===softmax(tt ) Pr(+1 |yt ) (4) (Hochreiter and Schmidhuber, 1997). LSTM networks are often used in deep learning and show good performance in speech recognition loss=−∑ (log Pr(xnyt+1 = |t )) (5) (Graves et al., 2013) and language translation (Sutskever et al., 2014). Fig. 2 shows the loss of the model during the first 50 epochs of To the best of our knowledge, our generator is the first attempt to training for eight target proteins HuR, SLM2, PTB, RBM4, U1A, SF2/ construct protein-binding RNA sequences using a LSTM network. The ASF, FUSIP1, and YB1. The loss tended to be decreased after a certain rest of the paper describes the architecture of the generator and dis- point as the model was trained longer, but the decreasing trend was not cusses the protein-binding RNA sequences generated by the model and monotonic. We selected a generator model with the minimum loss their motifs for several RNA-binding proteins (RBPs). value.

2. Method 2.2. Definition of binding affinity and specificity

2.1. Generator model of RNA sequences To assess the RNA sequences generated by our model, we defined the binding affinity and specificity of the sequences using the predictive For protein-binding RNA sequences, a generator model was im- binding score of DeepBind (Alipanahi et al., 2015). One problem with plemented using char-rnn (https://github.com/karpathy/char-rnn). the DeepBind score is that the scale of the scores is arbitrary, thus the The model is composed of four layers of LSTM with 512 hidden neurons scores from different DeepBind models are not directly comparable. (Fig. 1). Given an RNA sequence, it reads one nucleotide of the input To make DeepBind scores comparable, we defined the binding af- sequence at a time and predicts the next nucleotide in the sequence. The finity (AF) of an RNA sequence s to a target protein p as the probability ff parameters of the model are updated by the di erence between pre- that s has a larger DeepBind score than a random sequence. To obtain dicted and target nucleotides. an approximate value of the probability, we ran DeepBind on 200,000 To train the model, we obtained a set of 213,300 unique RNA se- random RNA sequences of 40 nucleotides and computed their binding fi quences (GEO: GSE15769), which were identi ed as binding sequences affinity defined by Eq. (6) (see Fig. 3 for an illustration of deriving the to RBPs by a custom Agilent 244K microarray experiment with known binding affinity). Since the binding affinity of a sequence is a prob- as RNAcompete (Ray et al., 2009). Among the nine RBPs of RNA- ability that the sequence has a larger DeepBind score than a random compete, we excluded one yeast RBP and used the remaining eight sequence, it is guaranteed to be in the range of [0, 1]. In the following human RBPs as target proteins in our study. Since the RBP-binding RNA equation, Score (s ) denotes the score of sequence s computed by – m sequences were 29 38 nucleotide (nt) long, the length of RNA se- DeepBind model m: quences generated by our model was set to 40 nt. n The generator model was trained in the following way (Eqs. 1 AF()p s =≤∑ δx(Score()),im s – n (1) (5)). Let xt be a 4-bit vector representing the tth nucleotide ( )in n i=1 (6) the input sequence (Supplementary data 1). Only one element of xt is 1 where δA()= 1 if an event A occurs; δA()= 0 otherwise. and the others are 0 (one hot encoding). yt is a class indicator re- The binding specificity (SP) of RNA sequence s to protein p was presenting a target nucleotide. The LSTM calculates zt for xt. Softmax defined by Eq. (7). In the equation, M is a set of all generator models changes zt to a vector of values between 0 and 1 that sum to 1 (one hot decoding). When training the model, the parameters of the neurons in trained on data from the same type of experiment as m: the LSTM layers were updated using the loss calculated from the dif- MMmc =−{} ference between the target nucleotide and predicted nucleotide. The 1 loss function, defined by Eq. (5), is the mean of the negative log-like- SP()ppss=− AF() ∑ AFk (s ) ||Mc lihood of the prediction: kM∈ c (7)

Fig. 1. Framework for generating pro- tein-binding RNA sequences and finding protein-binding motifs in RNAs. Left: The generator model of protein- binding RNA sequences. The para- meters of LSTM layers are updated using the loss function defined by Eq. (5). Right: Deriving protein-binding motifs from the generated RNA se- quences.

2 B. Park and K. Han Computational Biology and Chemistry 84 (2020) 107171

Fig. 2. The loss of the model during the first 50 epochs of training for eight proteins with names in parentheses. The symbol ‘x’ represents the minimum loss point.

(Supplementary data 2) generated by our model for each target protein fi 2.3. Protein-binding motif into two groups: 1500 sequences with relatively high-speci city and the remaining 1500 sequences with relatively low-specificity. We ran Although the generated RNA sequences are of fixed length (40 nt), DREME (Bailey, 2011) in the MEME Suite (Bailey et al., 2009) on the finding a motif from the sequences is not straightforward partly due to two groups of sequences. the large number of sequences generated by our model. Many of ex- isting methods for finding sequence motifs have difficulty in scaling to 3. Results thousands or more sequences. From the generated sequences, we attempted to find discriminative 3.1. Binding affinity and specificity of generated RNA sequences motifs that are enriched in the sequences with high specificity relative to those with low specificity. Thus, we divided the 3000 RNA sequences Table 1 shows the protein-binding affinity of 3000 RNA sequences

3 B. Park and K. Han Computational Biology and Chemistry 84 (2020) 107171

Fig. 3. The procedure for computing the binding affinity of a sequence s to a target protein p. After computing DeepBind scores of 200,000 random sequences for a target protein, an empirical cumulative distribution function (cdf) was derived from the DeepBind scores. The binding affinity of a sequence s to a target protein p is the cdf value of the score of sequence s computed by DeepBind model m.

(Supplementary data 2) generated by our model for eight proteins. The of high-specificity RNA sequences, low-specificity RNA sequences, and proteins in the table have one or more RNA-binding domains (RBDS) random RNA sequences. The median binding affinity (a bar inside a box such as the RNA recognition motif (RRM), hnRNP, K-homology (KH), plot) of high-specificity sequences is higher than that of low-specificity and the cold-shock domain (CSD). For the eight proteins, DeepBind has sequences for all eight proteins. Except for two proteins (SLM2 and 23 models (five HuR models, one KHDRBS2 model, two PTBP1 models, PTB), random RNA sequences show lower median affinity than other two RBM4 models, one SNRPA model, six SRSF1 models, four STSF10 groups of sequences. models, and two YBX1 models). A generator model that construct RNA sequences with high affinity For comparison, we examined protein-binding affinity of 200,000 but low specificity make it less than ideal for discovering potential random RNA sequences for each protein. We used the median affinity aptamers for a target protein. As discussed earlier, we define binding value instead of the mean affinity because outliers can distort the mean. affinity and specificity of a sequence using the DeepBind score of the As shown in Table 1, the RNA sequences generated by our model sequence. showed a much higher median affinity than random sequences con- As an example of the specificity of the RNA sequences generated by sistently for all proteins except one protein (FUSIP1 in DeepBind model our model, the top plot in Fig. 5 compares the DeepBind scores com- D00102.001). puted by 8 different DeepBind models for the same RNA sequences The binding affinity of the generated RNA sequences was examined generated by our model as HuR-binding sequences. Since the RNA se- further by dividing them into two groups: sequences with high speci- quences are generated as potential aptamers of HuR, they show much ficity and sequences with low specificity. The reason for splitting the higher scores in the DeepBind HuR model than in other DeepBind sequences into two groups with respect to the specificity instead of models. This implies that RNA sequences generated by our HuR model affinity is because a sequence with a high binding affinity to a target possess much higher specificity for HuR than for other proteins. In protein may show a high binding affinity to other proteins as well. For contrast, random RNA sequences show similar scores in all DeepBind example, some RNA sequences generated for HuR show higher affinity models (the bottom plot in Fig. 5, indicating that random sequences do for PTB than HuR. This is because HuR and PTB have similar binding not have binding specificity to any protein. sites that function jointly to stimulate HIF-1α translation in response to CoCl2 treatment (Galbán et al., 2008). Fig. 4 shows the binding affinity

Table 1 Comparison of generated RNA sequences and random sequences with respect to the median binding affinity (AF) for 8 proteins used in 23 DeepBind models.

Gene Protein RNAcompete ID DeepBind model Domain Median AF (generated) Median AF (random)

ELAVL1 HuR RNCMPT00032 D00114.001 RRM 0.9751 0.4999 ELAVL1 HuR RNCMPT00112 D00114.002 RRM 0.9846 0.0000 ELAVL1 HuR RNCMPT00117 D00114.003 RRM 0.9862 0.4999 ELAVL1 HuR RNCMPT00136 D00114.004 RRM 0.9828 0.4999 ELAVL1 HuR RNCMPT00274 D00114.005 RRM 0.9748 0.4999 KHDRBS3 SLM2 RNCMPT00034 D00116.001 KH 0.7911 0.4999 PTBP1 PTB RNCMPT00268 D00274.001 RRM 0.7358 0.4999 PTBP1 PTB RNCMPT00269 D00274.002 RRM 0.7053 0.4999 RBM4 RBM4 RNCMPT00052 D00133.001 RRM 0.9877 0.4999 RBM4 RBM4 RNCMPT00113 D00133.002 RRM 0.9843 0.4999 SNRPA U1A RNCMPT00071 D00152.001 RRM 0.8836 0.4999 SRSF1 SF2/ASF RNCMPT00106 D00169.001 RRM 0.9824 0.4999 SRSF1 SF2/ASF RNCMPT00107 D00169.002 RRM 0.9833 0.4999 SRSF1 SF2/ASF RNCMPT00108 D00169.003 RRM 0.9900 0.4999 SRSF1 SF2/ASF RNCMPT00109 D00169.004 RRM 0.9898 0.4999 SRSF1 SF2/ASF RNCMPT00110 D00169.005 RRM 0.9683 0.4999 SRSF1 SF2/ASF RNCMPT00163 D00169.006 RRM 0.9863 0.0000 SRSF10 FUSIP1 RNCMPT00019 D00102.001 RRM 0.0349 0.0349 SRSF10 FUSIP1 RNCMPT00088 D00102.002 RRM 0.9167 0.4999 SRSF10 FUSIP1 RNCMPT00089 D00102.003 RRM 0.7803 0.4999 SRSF10 FUSIP1 RNCMPT00090 D00102.004 RRM 0.7824 0.4999 YBX1 YB1 RNCMPT00083 D00163.001 CSD 0.6721 0.4999 YBX1 YB1 RNCMPT00116 D00163.002 CSD 0.8115 0.4999

4 B. Park and K. Han Computational Biology and Chemistry 84 (2020) 107171

Fig. 4. The binding affinity of the three groups of RNA sequences (high-specificity sequences, low-specificity sequences, and random sequences) for eight RNA- binding proteins (HuR, SLM2, PTB, RBM4, U1A, SF2/ASF, FUSIP1, and YB1). Gene names of the proteins are indicated in parentheses.

3.2. Protein-binding motifs and comparison with others combinations of A, C, and U. For further comparison with other methods, we obtained protein- Fig. 6 shows protein-binding RNA motifs derived by our method and binding sites in human RNAs from POSTAR2 (Zhu et al., 2019) and those derived by RNAcompete. The protein-binding motifs (Supple- iDeep (Pan and Shen, 2017), which were experimentally detected by mentary data 3) derived by our method show significant E-values and crosslinking and immunoprecipitation combined with high-throughput p-values determined by Fisher's exact test, and are similar to the motifs sequencing (CLIP-seq). Among the eight RNA-binding proteins (HuR, derived by RNAcompete. For four RBPs (HuR, PTB, SF2/ASF, and SLM2, PTB, RBM4, U1A, SF2/ASF, FUSIP1, and YB1) shown in Table 1 FUSIP1) in Fig. 6, our motifs are similar to the binding preferences and Figs. 4–6, POSTAR2 provides protein-binding sites in human RNAs identified by previous studies (Gao et al., 1994; Tacke and Manley, for three RBPs (HuR, SF2/ASF, and PTB) only. For the three RBPs, we 1995; Pérez et al., 1997; Shin and Manley, 2002). RBM4 is a splicing located the binding sites in the assembly GRCh38 regulator (Lai et al., 2003), and the RBM4-binding RNA motif in RNA (hg38) using Ensembl REST API (http://rest.ensembl.org) to extract we found is rich in GC. SLM2-binding motif contains UAA (Danilenko RBP-binding sequences in the human genome. Fig. 7 compares the et al., 2017). U1A and YB1 preferred sequences with more complex motifs of the RNA sequences generated by our method with the motifs

5 B. Park and K. Han Computational Biology and Chemistry 84 (2020) 107171

Fig. 5. Top: DeepBind scores of 3000 RNA sequences generated by our model as HuR-binding sequences. The sequences show much higher scores in the DeepBind HuR model than in other DeepBind models. Bottom: DeepBind scores of 200,000 random RNA se- quences, calculated by 8 DeepBind models for 8 proteins (HuR, SLM2, PTB, RBM4, U1A, SF2/ASF, FUSIP1, and YB1). The random sequences show similar scores in all models.

in POSTAR2 and iDeep, which were originally derived by HOMER Eight RNA sequences the high-specificity group indeed contain the (Heinz et al., 2010), MEME (Bailey et al., 2009), and iONMF (Stražar motif AUUGCAC (2 in the top 10 sequences, 6 in the top 100 sequences, et al., 2016) from the CLIP-seq data. The HuR-binding motif derived by 8 in the top 1000 sequences). Unlike the high-specificity group, none of our method is similar to those derived by HOMER, MEME, and iDeep the low-specificity group of RNA sequences contains the motif AUUG from the CLIP-seq data. But, our motifs for SF2/ASF and PTB are dif- CAC. ferent from those derived by HOMER and MEME. iDeep predicts a SF2/ The U1A-binding motif derived by our model agrees with the known ASF-binding motif which is a bit similar to ours, but predicts no PTB- motif AUUGCAC as well as that found by RNAcompete. The structure of binding motif. a complex of U1A and RNA (PDB ID: 1URN) reveals that the U1A- In addition to this, we performed pairwise alignment of the RNA binding motif found by our method (CAC in the following figure) is sequences generated by our method with two CLIP-seq data sets closer to U1A than the other nucleotides in the known motif AUUGCAC (Supplementary data 4) using the Needleman–Wunsch algorithm (Li (Fig. 8). et al., 2015) with default parameter values. One CLIP-seq data set contains top 3000 CLIP-seq sequences with high Piranha scores (Uren et al., 2012) and the other CLIP-seq data set contains bottom 3000 CLIP- 4. Discussion seq sequences with low Piranha scores. As shown in Fig. 7, the RNA sequences generated by our method showed higher alignment scores Many machine learning methods have been employed to predict the with the CLIP-seq sequences with high Piranha scores than the CLIP-seq interactions between proteins and nucleic acids. However, most of the – sequences with low Piranha scores. learning methods treat the problem of nucleic acid protein interactions fi For comparison of our model with AptaSim in the AptaSuite col- as a classi cation problem. In this paper we proposed a generative lection (Hoinka et al., 2018), we downloaded RNA Bind-n-Seq (RBNS) model using a LSTM neural network to construct RNA sequences data (SRA ID: SRX516685) (Lambert et al., 2014). We ran AptaSim with binding to a target protein. The model was trained on a large set of RNA MBNL1-binding RNA sequences. We tested our model on MBNL1- sequences obtained from an in vitro experiment known as RNAcompete, binding RNA sequences, and selected top 100 RNA sequences with a and tested to construct RNA sequences binding to several target pro- high binding specificity. MBNL1-binding RNAs are known to contain teins. The RNA sequences generated by the model showed much higher ffi fi YGCY motifs in their binding regions, where Y denotes pyrimidine (C or binding a nity and speci city for target proteins than random RNA U) (Goers et al., 2010). The MBNL1-binding motif found by our model sequences. is GCW where W denotes A or U. The most frequent MBNL1-binding In addition to constructing protein-binding RNA sequences with ffi fi motif found by AptaSim is UGCG (Supplementary data 5). Both motis high a nity and speci city, we found motifs conserved in the RNA derived by our model and AptaSim agree with the known MBNL1- sequences. The protein-binding motifs derived in this study are not fi binding motif YGCY. identical but similar to known motifs, and show signi cant E-values Among the eight RBPs discusses in Figs. 4–6, a complex of U1A with and p-values. Learning sequence features that can dictate global gene RNA has a known structure (PDB ID: 1URN) and AUUGCAC is known as regulation remains a major challenge in computational biology (Keene, U1A-binding motif in RNA (Tsai et al., 1991). We examined whether 2007; Barash et al., 2010; Hogan et al., 2008). Compared to binding the known motif AUUGCAC is included in the high-specificity group of motifs in protein-DNA interactions, motifs in protein-RNA interactions fl RNA sequences generated by our model (Supplementary data 2). As are much harder to identify due to high exibility of RNA-protein in- mentioned above, the binding specificity (SP) of an RNA sequence to a terfaces, and it has been questioned whether such RNA-binding re- target protein is defined as the difference between the binding affinity cognition codes exist (Allain et al., 2006). Recent studies (Ray et al., (AF) of the RNA sequence to the target protein and the mean AF of the 2009, 2013) show that RBP motifs can be readily used to infer human sequence to all other proteins except the target protein. So, the gener- post-transcriptional regulation mechanisms, and can explain evolu- ated RNA sequences are sorted with respect to their DeepBind scores. tionary constraints found within both coding and non-coding regions of transcripts.

6 B. Park and K. Han Computational Biology and Chemistry 84 (2020) 107171

Fig. 6. Left: Protein-binding motifs derived from the RNA sequences generated by our model. Right: Protein-binding motifs derived by RNAcompete. Gene names of the proteins are indicated in parentheses.

In this paper we presented a new computational approach to con- and an initial pool of RNA sequences for a target protein. structing protein-binding RNA sequences and finding protein-binding motifs. The results of testing the approach for several target proteins Conflict of interest demonstrated its potential as a generator of RNA sequences binding to a target protein. In particular, RNA sequences generated by our model The authors declare that they have no competing interests. have high binding affinity and specificity for their target proteins and the protein-binding motifs derived from the generated RNA sequences Acknowledgements are comparable to the motifs from experimentally validated protein- binding RNA sequences. Further analysis of the protein-binding motifs This work was supported by the National Research Foundation of is required, but our approach will help design in vitro or in vivo ex- Korea (NRF) grant funded by the Ministry of Science and ICT (NRF- periments more efficiently by suggesting protein-binding RNA motifs 2018K2A9A2A11080914, NRF-2017R1E1A1A03069921) and the

7 B. Park and K. Han Computational Biology and Chemistry 84 (2020) 107171

Fig. 7. Left: Protein-binding motifs derived from the RNA sequences generated by our model. Center: Protein-binding motifs derived from CLIP-seq data by POSTAR2 and iDeep. No PTB-binding motif was predicted by iDeep. Right: Pairwise alignment of the RNA sequences generated by our model with two CLIP-seq data sets.

https://doi.org/10.1093/nar/gkw1277. Galbán, S., Kuwano, Y., Pullmann, R., Martindale, J.L., Kim, H.H., Lal, A., Abdelmohsen, K., Yang, X., Dang, Y., Liu, J.O., Lewis, S.M., Holcik, M., Gorospe, M., 2008. RNA- binding proteins HuR and PTB promote the translation of hypoxia-inducible factor 1α. Mol. Cell. Biol. 28, 93–107. https://doi.org/10.1128/mcb.00973-07. Gao, F.B., Carson, C.C., Levine, T., Keene, J.D., 1994. Selection of a subset of mRNAs from combinatorial 3’ untranslated region libraries using neuronal RNA-binding protein Hel-N1. Proc. Natl. Acad. Sci. U.S.A. 91, 11207–11211. https://doi.org/10.1073/ pnas.91.23.11207. Goers, E.S., Purcell, J., Voelker, R.B., Gates, D.P., Berglund, J.A., 2010. MBNL1 binds GC motifs embedded in pyrimidines to regulate . Nucleic Acids Res. 38, 2467–2484. https://doi.org/10.1093/nar/gkp1209. Graves, A., Mohamed, A., Hinton, G.E., 2013. Speech Recognition with Deep Recurrent Neural Networks. arXiv:1303.5778. Hassanzadeh, H.R., Wang, M., 2016. Deeperbind: enhancing prediction of sequence specificities of DNA binding proteins. Comput. Vision Pattern Recogn. 178–183. Fig. 8. (A) U1A protein-binding motif derived by our method. (B) U1A protein- https://doi.org/10.1109/BIBM.2016.7822515. binding motif derived by RNAcompete. (C) The structure of a complex of U1A Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y.C., Laslo, P., Cheng, J.X., Murre, C., and RNA ***(PDB ID: 1URN). Singh, H., Glass, C.K., 2010. Simple combinations of lineage-determining transcrip- tion factors prime cis-regulatory elements required for macrophage and b cell iden- tities. Mol. Cell 38, 576–589. https://doi.org/10.1016/j.molcel.2010.05.004. Ministry of Education (NRF-2016R1A6A3A11931497). Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9, 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735. Hogan, D.J., Riordan, D.P., Gerber, A.P., Herschlag, D., Brown, P.O., 2008. Diverse RNA- Appendix A. Supplementary data binding proteins interact with functionally related sets of RNAs, suggesting an ex- tensive regulatory system. PLOS Biol. 6, e255. https://doi.org/10.1371/journal.pbio. 0060255. Supplementary data associated with this article can be found, in the Hoinka, J., Backofen, R., Przytycka, T.M., 2018. AptaSUITE: a full-featured bioinformatics online version, at https://doi.org/10.1016/j.compbiolchem.2019. framework for the comprehensive analysis of aptamers from HT-SELEX experiments, 107171. molecular therapy. Nucleic Acids 11, 515–517. https://doi.org/10.1016/j.omtn. 2018.04.006. Keene, J.D., 2007. RNA regulons: coordination of post-transcriptional events. Nat. Rev. References Genet. 8, 533. https://doi.org/10.1038/nrg2111. Lai, M.C., Kuo, H.W., Chang, W.C., Tarn, W.Y., 2003. A novel splicing regulator shares a nuclear import pathway with SR proteins. EMBO J. 22, 1359–1369. https://doi.org/ Akbaripour-Elahabad, M., Zahiri, J., Rafeh, R., Eslami, M., Azari, M., 2016. rpiCOOL: a 10.1093/emboj/cdg126. tool for in silico RNA-protein interaction detection using random forest. J. Theoret. Lambert, N., Robertson, A., Jangi, M., McGeary, S., Sharp, P.A., Burge, C.B., 2014. RNA Biol. 402, 1–8. https://doi.org/10.1016/j.jtbi.2016.04.025. Bind-n-Seq: quantitative assessment of the sequence and structural binding specificity Alipanahi, B., Delong, A., Weirauch, M.T., Frey, B.J., 2015. Predicting the sequence of RNA binding proteins. Mol. Cell 54, 887–900. https://doi.org/10.1016/j.molcel. specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 2014.04.016. 831. https://doi.org/10.1038/nbt.3300. Li, W., Cowley, A., Uludag, M., Gur, T., McWilliam, H., Squizzato, S., Park, Y.M., Buso, N., Allain, F.H.T., Oberstrass, F.C., Auweter, S.D., 2006. Sequence-specific binding of single- Lopez, R., 2015. The EMBL-EBI bioinformatics web and programmatic tools frame- stranded RNA: is there a code for recognition? Nucleic Acids Res. 34, 4943–4959. work. Nucleic Acids Res. 43, W580–W584. https://doi.org/10.1093/nar/gkv279. https://doi.org/10.1093/nar/gkl620. Pan, X., Shen, H.B., 2017. RNA-protein binding motifs mining with a new hybrid deep Bailey, T.L., 2011. DREME: motif discovery in transcription factor chip-seq data. learning based cross-domain knowledge integration approach. BMC Bioinformatics Bioinformatics 27, 1653–1659. https://doi.org/10.1093/bioinformatics/btr261. 18, 136. https://doi.org/10.1186/s12859-017-1561-8. Bailey, T.L., Boden, M., Buske, F.A., Frith, M., Grant, C.E., Clementi, L., Ren, J., Li, W.W., Pérez, I., McAfee, J.G., Patton, J.G., 1997. Multiple RRMs Contribute to RNA binding Noble, W.S., 2009. MEME Suite: tools for motif discovery and searching. Nucleic specificity and affinity for polypyrimidine tract binding protein. Biochemistry 36, Acids Res. 37, W202–W208. https://doi.org/10.1093/nar/gkp335. 11881–11890. https://doi.org/10.1021/bi9711745. Barash, Y., Calarco, J.A., Gao, W., Pan, Q., Wang, X., Shai, O., Blencowe, B.J., Frey, B.J., Ray, D., Kazan, H., Chan, E.T., Castillo, L.P., Chaudhry, S., Talukder, S., Blencowe, B.J., 2010. Deciphering the splicing code. Nature 465, 53. https://doi.org/10.1038/ Morris, Q., Hughes, T.R., 2009. Rapid and systematic analysis of the RNA recognition nature09000. specificities of RNA-binding proteins. Nat. Biotechnol. 27, 667. https://doi.org/10. Choi, D., Park, B., Chae, H., Lee, W., Han, K., 2017. Predicting protein-binding regions in 1038/nbt.1550. RNA using nucleotide profiles and compositions. BMC Syst. Biol. 11. https://doi.org/ Ray, D., Kazan, H., Cook, K.B., Weirauch, M.T., Najafabadi, H.S., Li, X., Gueroussov, S., 10.1186/s12918-017-0386-4. Albu, M., Zheng, H., Yang, A., Na, H., Irimia, M., Matzat, L.H., Dale, R.K., Smith, S.A., Danilenko, M., Dalgliesh, C., Pagliarini, V., Naro, C., Ehrmann, I., Feracci, M., Kheirollahi- Yarosh, C.A., Kelly, S.M., Nabet, B., Mecenas, D., Li, W., Laishram, R.S., Qiao, M., Chadegani, M., Tyson-Capper, A., Clowry, G.J., Fort, P., Dominguez, C., Sette, C., Lipshitz, H.D., Piano, F., Corbett, A.H., Carstens, R.P., Frey, B.J., Anderson, R.A., Elliott, D.J., 2017. Binding site density enables paralog-specific activity of SLM2 and Lynch, K.W., Penalva, L.O.F., Lei, E.P., Fraser, A.G., Blencowe, B.J., Morris, Q.D., Sam68 proteins in Neurexin2 AS4 splicing control. Nucleic Acids Res. 45, 4120–4130.

8 B. Park and K. Han Computational Biology and Chemistry 84 (2020) 107171

Hughes, T.R., 2013. A compendium of RNA-binding motifs for decoding gene reg- predictor. Biosystems 139, 17–22. https://doi.org/10.1016/j.biosystems.2015.10. ulation. Nature 499, 172. https://doi.org/10.1038/nature12311. 004. Shin, C., Manley, J.L., 2002. The SR protein SRp38 represses splicing in M phase cells. Uren, P.J., Bahrami-Samani, E., Burns, S.C., Qiao, M., Karginov, F.V., Hodges, E., Hannon, Cell 111, 407–417. https://doi.org/10.1016/S0092-8674(02)01038-3. G.J., Sanford, J.R., Penalva, L.O.F., Smith, A.D., 2012. Site identification in high- Stražar, M., žitnik, M., Zupan, B., Ule, J., Curk, T., 2016. Orthogonal matrix factorization throughput RNA-protein interaction data. Bioinformatics 28, 3013–3020. https://doi. enables integrative analysis of multiple RNA binding proteins. Bioinformatics 32, org/10.1093/bioinformatics/bts569. 1527–1535. https://doi.org/10.1093/bioinformatics/btw003. Walia, R.R., Xue, L.C., Wilkins, K., El-Manzalawy, Y., Dobbs, D., Honavar, V., 2014. RNAB Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to Sequence Learning with Neural indRPlus: a predictor that combines machine learning and -based Networks. arXiv:1409.3215. methods to improve the reliability of predicted RNA-binding residues in proteins. Tacke, R., Manley, J.L., 1995. The human splicing factors ASF/SF2 and SC35 possess PLOS ONE 9. https://doi.org/10.1371/journal.pone.0097725. distinct, functionally significant RNA binding specificities. EMBO J. 14, 3540–3551. Zhang, X.L., Liu, S.Y., 2017. RBPPred: predicting RNA-binding proteins from sequence https://doi.org/10.1002/j.1460-2075.1995.tb07360.x. using SVM. Bioinformatics 33, 854–862. https://doi.org/10.1093/bioinformatics/ Tsai, D.E., Harper, D.S., Keene, J.D., 1991. U1-sn RNP-A protein selects a ten nucleotide btw730. consensus sequence from a degenerate RNA pool presented in various structural Zhu, Y.M., Xu, G., Yang, Y.C.T., Xu, Z.Y., Chen, X.D., Shi, B.B., Xie, D.X., Lu, Z.J., Wang, contexts. Nucleic Acids Res. 19, 4931–4936. https://doi.org/10.1093/nar/19.18. P.Y., 2019. POSTAR2: deciphering the post-transcriptional regulatory logics. Nucleic 4931. Acids Res. 47, D203–D211. https://doi.org/10.1093/nar/gky830. Tuvshinjargal, N., Lee, W., Park, B., Han, K., 2016. PRIdictor: protein-RNA interaction

9