Discovering Protein-Binding RNA Motifs with a Generative Model of RNA Sequences T
Total Page:16
File Type:pdf, Size:1020Kb
Computational Biology and Chemistry 84 (2020) 107171 Contents lists available at ScienceDirect Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/cbac Research Article Discovering protein-binding RNA motifs with a generative model of RNA sequences T Byungkyu Park, Kyungsook Han* Department of Computer Engineering, Inha University, 22212 Incheon, South Korea ARTICLE INFO ABSTRACT Keywords: Recent advances in high-throughput experimental technologies have generated a huge amount of data on in- Protein-RNA interaction teractions between proteins and nucleic acids. Motivated by the big experimental data, several computational Binding motif methods have been developed either to predict binding sites in a sequence or to determine if an interaction exists Generator between protein and nucleic acid sequences. However, most of the methods cannot be used to discover new Long short-term memory network nucleic acid sequences that bind to a target protein because they are classifiers rather than generators. In this paper we propose a generative model for constructing protein-binding RNA sequences and motifs using a long short-term memory (LSTM) neural network. Testing the model for several target proteins showed that RNA sequences generated by the model have high binding affinity and specificity for their target proteins and that the protein-binding motifs derived from the generated RNA sequences are comparable to the motifs from experi- mentally validated protein-binding RNA sequences. The results are promising and we believe this approach will help design more efficient in vitro or in vivo experiments by suggesting potential RNA aptamers for a target protein. 1. Introduction protein-binding sites in the input nucleic sequence. Nonetheless, the binding score provided by DeepBind is informative, so we used the Interactions between proteins and nucleic acids are involved in binding score in our study to estimate the affinity and specificity of many biological processes such as transcription, RNA processing, and nucleic acid sequences for a target protein. translation. A variety of in vitro and in vivo experimental methods have A later model known as DeeperBind (Hassanzadeh and Wang, 2016) been developed to study interactions between proteins and nucleic is a long short-term recurrent convolutional network for predicting the acids, and the past decade has witnessed a large amount of data gen- protein-binding specificity of DNA sequences. DeeperBind showed a erated by the experimental methods. The experimental data have trig- better performance than DeepBind for some proteins, but its use is gered the development of computational methods to predict binding limited to the datasets from protein-binding microarrays. Both Deep- sites in a sequence (Choi et al., 2017; Tuvshinjargal et al., 2016; Walia Bind and DeeperBind are classifiers rather generators, and are not in- et al., 2014) or to determine if an interaction exists between a pair of tended for finding protein-binding motifs or for constructing protein- sequences (Akbaripour-Elahabad et al., 2016; Alipanahi et al., 2015; binding nucleic acid sequences. Zhang and Liu, 2017). However, most of the computational methods A more recent model called iDeep (Pan and Shen, 2017) uses a are classifiers rather than generators, so cannot be used to discover new convolutional neural network and deep belief network to predict the protein-binding nucleic acid sequences. RBP binding sites and motifs in RNAs. Five features (region type, clip- Among the computational methods, neural networks have shown a cobinding, structure, motif, CNN sequence) were used to train iDeep. In certain degree of success in predicting the interactions between proteins testing on 31 human RBPs, iDeep showed a better performance than and nucleic acids. DeepBind (Alipanahi et al., 2015), for example, is a DeepBind. However, iDeep has several drawbacks compared to Deep- convolutional neural network trained on a large amount of data from Bind. First, iDeep contains only 31 human RBP models whereas Deep- RNAcompete experiments (Ray et al., 2009, 2013). For the problem of Bind has 144 human RBP models. Second, iDeep requires input data predicting protein-binding sites of nucleic acid sequences, DeepBind with the 5 features (region type, clip-cobinding, structure, motif, CNN contains hundreds of distinct prediction models, each for a different sequence) derived by iONMF (Stražar et al., 2016) whereas DeepBind target protein. What DeepBind predicts is a binding score rather than takes only a sequence as input. Thus, if data generated by iONMF is not ⁎ Corresponding author. E-mail address: [email protected] (K. Han). https://doi.org/10.1016/j.compbiolchem.2019.107171 Received 17 June 2019; Received in revised form 19 October 2019; Accepted 19 November 2019 Available online 07 January 2020 1476-9271/ © 2019 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/BY-NC-ND/4.0/). B. Park and K. Han Computational Biology and Chemistry 84 (2020) 107171 available for a target protein, iDeep cannot be used. xt = number for an input nucleotide (1) In this paper we propose a new approach to finding protein-binding y = number for a target nucleotide (2) RNA motifs with a generative model of protein-binding RNA sequences t using a long short-term memory (LSTM) network. As an extension of a zxtt= LSTM( ) (3) recurrent neural network (RNN), the LSTM network solves the van- n ishing gradient problem of RNN by introducing a gating mechanism yzxnt ===softmax(tt ) Pr(+1 |yt ) (4) (Hochreiter and Schmidhuber, 1997). LSTM networks are often used in deep learning and show good performance in speech recognition loss=−∑ (log Pr(xnyt+1 = |t )) (5) (Graves et al., 2013) and language translation (Sutskever et al., 2014). Fig. 2 shows the loss of the model during the first 50 epochs of To the best of our knowledge, our generator is the first attempt to training for eight target proteins HuR, SLM2, PTB, RBM4, U1A, SF2/ construct protein-binding RNA sequences using a LSTM network. The ASF, FUSIP1, and YB1. The loss tended to be decreased after a certain rest of the paper describes the architecture of the generator and dis- point as the model was trained longer, but the decreasing trend was not cusses the protein-binding RNA sequences generated by the model and monotonic. We selected a generator model with the minimum loss their motifs for several RNA-binding proteins (RBPs). value. 2. Method 2.2. Definition of binding affinity and specificity 2.1. Generator model of RNA sequences To assess the RNA sequences generated by our model, we defined the binding affinity and specificity of the sequences using the predictive For protein-binding RNA sequences, a generator model was im- binding score of DeepBind (Alipanahi et al., 2015). One problem with plemented using char-rnn (https://github.com/karpathy/char-rnn). the DeepBind score is that the scale of the scores is arbitrary, thus the The model is composed of four layers of LSTM with 512 hidden neurons scores from different DeepBind models are not directly comparable. (Fig. 1). Given an RNA sequence, it reads one nucleotide of the input To make DeepBind scores comparable, we defined the binding af- sequence at a time and predicts the next nucleotide in the sequence. The finity (AF) of an RNA sequence s to a target protein p as the probability ff parameters of the model are updated by the di erence between pre- that s has a larger DeepBind score than a random sequence. To obtain dicted and target nucleotides. an approximate value of the probability, we ran DeepBind on 200,000 To train the model, we obtained a set of 213,300 unique RNA se- random RNA sequences of 40 nucleotides and computed their binding fi quences (GEO: GSE15769), which were identi ed as binding sequences affinity defined by Eq. (6) (see Fig. 3 for an illustration of deriving the to RBPs by a custom Agilent 244K microarray experiment with known binding affinity). Since the binding affinity of a sequence is a prob- as RNAcompete (Ray et al., 2009). Among the nine RBPs of RNA- ability that the sequence has a larger DeepBind score than a random compete, we excluded one yeast RBP and used the remaining eight sequence, it is guaranteed to be in the range of [0, 1]. In the following human RBPs as target proteins in our study. Since the RBP-binding RNA equation, Score (s ) denotes the score of sequence s computed by – m sequences were 29 38 nucleotide (nt) long, the length of RNA se- DeepBind model m: quences generated by our model was set to 40 nt. n The generator model was trained in the following way (Eqs. 1 AF()p s =≤∑ δx(Score()),im s – n (1) (5)). Let xt be a 4-bit vector representing the tth nucleotide ( )in n i=1 (6) the input sequence (Supplementary data 1). Only one element of xt is 1 where δA()= 1 if an event A occurs; δA()= 0 otherwise. and the others are 0 (one hot encoding). yt is a class indicator re- The binding specificity (SP) of RNA sequence s to protein p was presenting a target nucleotide. The LSTM calculates zt for xt. Softmax defined by Eq. (7). In the equation, M is a set of all generator models changes zt to a vector of values between 0 and 1 that sum to 1 (one hot decoding). When training the model, the parameters of the neurons in trained on data from the same type of experiment as m: the LSTM layers were updated using the loss calculated from the dif- MMmc =−{} ference between the target nucleotide and predicted nucleotide. The 1 loss function, defined by Eq. (5), is the mean of the negative log-like- SP()ppss=− AF() ∑ AFk (s ) ||Mc lihood of the prediction: kM∈ c (7) Fig.