-RNA interaction prediction with deep learning: Structure matters

Junkang Wei * 1 Siyuan Chen * 2 Licheng Zong * 1 Xin GaoB 2 Yu LiB 1 3

Abstract their importance, researchers have made significant efforts Protein-RNA interactions are of vital importance to understand the interactions (Corley et al., 2020) and the to a variety of cellular activities. Both experi- related molecular mechanism behind the processes (Lin and mental and computational techniques have been Miles, 2019; Yi et al., 2020). Due to the difficulty to perform developed to study the interactions. Due to the high-throughput structural biological experiments in the last limitation of the previous database, especially the century, the progress of this field was slow (Conn et al., lack of data, most of the exist- 1999). However, with the development and advancement of in vivo ing computational methods rely heavily on the high-throughput assays, such as the RIP-seq (Keene in vitro sequence data, with only a small portion of the et al., 2006) and CLIP-seq (Ule et al., 2005), and the methods utilizing the structural information. Re- RNACompete (Ray et al., 2009) and HT-SELEX (Roulet cently, AlphaFold has revolutionized the entire et al., 2002), we have witnessed the significant progress of protein and biology field. Foreseeably, the protein- this field as well as the large amount of accumulated data RNA interaction prediction will also be promoted (Alipanahi et al., 2015). Computational methods emerge to significantly in the upcoming years. In this work, analyze the data and accelerate the discovery (Hafner et al., we give a thorough review of this field, survey- 2021; Sun et al., 2021; Bernstein et al., 2005; Van Nostrand ing both the binding site and binding preference et al., 2020; Lam et al., 2019; Hentze et al., 2018). prediction problems and covering the commonly Similar to the experimental techniques, which can be di- used datasets, features, and models. We also point vided into the structure-based methods and the assay-based out the potential challenges and opportunities in methods, the computational methods can also be classi- this field. This survey summarizes the develop- fied into two categories, either predicting the RNA binding ment of the RBP-RNA interaction field in the past sites on the protein surface (Miao and Westhof, 2015; Yan and foresees its future development in the post- et al., 2016) or modeling the preferred RNA sequences of AlphaFold era. an RNA-binding protein (RBP) (Trabelsi et al., 2019). In the first category, people essentially resolve a binary clas- sification problem. Given the protein, researchers want to 1. Introduction predict whether it is an RBP, and if it is an RBP, at which amino acids it can interact with an RNA. In the latter one, Protein-RNA interactions are involved in a variety of cellu- given a protein with the high-throughput assay experimen- lar activities, such as expression regulations (Weirauch tal data, people extract the frequency of each nucleotide at et al., 2013), post-transcriptional regulations (Alipanahi each position on the preferred RNA sequences, using k-mer et al., 2015), and protein synthesis (Yan et al., 2016). Per- models (Lee et al., 2015), position weight matrix (PWM)

arXiv:2107.12243v1 [q-bio.BM] 26 Jul 2021 turbation of such interactions can lead to fatal cellular dys- models (Weirauch et al., 2013), or deep learning models function and diseases (Ramanathan et al., 2019). Owing to (Alipanahi et al., 2015). If the computational method targets *Equal contribution 1Department of Computer Science and En- on genome-wide prediction, sometimes, it is also referred as gineering (CSE), The Chinese University of Hong Kong (CUHK), the binding sites prediction on (Pan and Shen, 2018a; Hong Kong, People’s Republic of China 2Computational Bio- Li et al., 2017a), which may cause confusion to the readers. science Research Center (CBRC), Computer, Electrical and Math- In the rest of the paper, binding sites prediction refers to ematical Sciences and Engineering Division (CEMSE), King Ab- dullah University of Science and Technology (KAUST), Thuwal predicting the RNA binding sites on the protein surface, 23955-6900, Saudi Arabia 3The CUHK Shenzhen Research while the binding preference prediction refers to predicting Institute, Hi-Tech Park, Nanshan, Shenzhen 518057, China. the protein binding preference against RNA sequences. On Correspondence to: Xin Gao , Yu Li the other hand, as both of the two main research directions . are protein-centric (Ramanathan et al., 2019), which means Copyright 2021 by the author(s). that there is intrinsic relation between the two research top- Protein-RNA interaction prediction with deep learning ics, researchers are also trying to predict both information sive local structural encoding can significantly improve the simultaneously with a unified deep learning method (Lam model’s performance. For example, people have used vox- et al., 2019). els (Torng and Altman, 2019) and graphs (Xia et al., 2021a) to encode the protein 3D structures.In terms of the RNA Since the first computational method was proposed to tackle features, the logic is similar to the protein ones. Regard- the interaction between RNA and protein specifically (Jeong ing the sequence features, people have been using the se- et al., 2004), a number of algorithms have been developed quence one-hot encodings (Grønning et al., 2020; Sun et al., to handle the problems (Miao and Westhof, 2015; Yan et al., 2021; Alipanahi et al., 2015), k-mer models (Orenstein et al., 2016; Yan and Zhu, 2020; Sagar and Xue, 2019; Pan et al., 2016), and position-weighted matrix (PWM) (Kazan et al., 2019a). They can be divided into the following categories. 2010; Orenstein et al., 2016). However, unlike the protein Firstly, based on the assumption that similar structures may secondary structures, RNA secondary structural information have similar function, people have used the template-based has been significantly emphasized, including both the pre- method to predict the binding sites (Yang et al., 2013; Chen dicted RNA secondary structures and the in vivo structure et al., 2014; Wu et al., 2018; Xie et al., 2020a) and the bind- profiles (Hiller et al., 2006; Maticzka et al., 2014; Kazan ing preference (Zheng et al., 2016). Although such methods et al., 2010; Sun et al., 2021). Meanwhile, the tertiary struc- can perform well on queries with homologs, they have diffi- tures are also shown to be very important (Zhang et al., culty in handling new sequences without homologs (Senior 2016). Despite the large variety of existing features, unfor- et al., 2020). Secondly, people combine hand-crafted fea- tunately, people have not taken full advantage of them for tures, which will be discussed in the next paragraph, with the following two reasons. Firstly, in the binding site predic- shallow-learning methods, such as support vector machine tion, people usually only consider the protein information, (SVM) (Jolma et al., 2020; Maticzka et al., 2014; Zhang while in the binding preference prediction, people usually and Liu, 2017; Su et al., 2019a), logistic regression (Kazan only consider the RNA information. The interactions be- et al., 2010; Orenstein et al., 2016; Hiller et al., 2006; Yan tween RNAs and include at least two molecules, and Kurgan, 2017), and random forest (Li et al., 2017b; Sun and using information from only one side can lead to infe- et al., 2016), to investigate the topic. The commonly used rior performance. Secondly, the RNA and protein structural k-mer models (Orenstein et al., 2016) and position-weighted information has not been fully utilized as well, mainly due matrix (PWM) models (Kazan et al., 2010) are classified to the limitation of previous structure prediction methods into this category, because they are usually combined with and the unsatisfactory structure encoding methods. logistic regression. Notice that this category of methods is still under active development (Zhang and Liu, 2017; Su In recent years, we have witnessed the significant improve- et al., 2019a), even after the surge of deep learning, because ment of both the structure determination methods (Yip et al., it is difficult to represent and encode the raw structural infor- 2020) and prediction methods (Marks et al., 2011; Wang mation, which will be discussed in detail in this paper. The et al., 2017a; 2019a; Senior et al., 2020; Jumper et al., 2021; last category is the deep learning-based methods (Alipanahi Baek et al., 2021; Tunyasuvunakool et al., 2021). Consid- et al., 2015; Lam et al., 2019; Sun et al., 2021), which have ering the success of the previous computational methods been very popular in recent years. With such models, peo- targeting protein-RNA interaction prediction based on struc- ple only need to input the raw representation of the proteins tural information, it is foreseeable that researchers will make or RNAs, and let the models learn and extract useful in- significant progress in this field (Figure1). Given that, we formation by themselves. However, the transparency and review this field thoroughly in this paper, emphasizing the interpretability of the models are usually questioned (Li structural information. In this work, we also consider the et al., 2019). protein-RNA interaction binding site and binding preference prediction simultaneously for the first time, considering their Within the above algorithms, people have been using various intrinsic relationship. We notice that there are some existing features, including the ones from both proteins and RNAs. related reviews focusing on different aspects of this prob- Regarding the protein features, researchers have developed lem. More specifically, Pan et al.(2019b); Yan and Zhu representations from sequences, such as sequence one-hot (2020); Sagar and Xue(2019) list the recently developed encodings (Yan et al., 2016), position-specific scoring ma- deep learning tools for predicting binding preference. Tra- trix (PSSM) (Su et al., 2019a; Liu et al., 2020), and conser- belsi et al.(2019) evaluates the performance of different vation entropy derived from PSSM. The physicochemical deep learning models on predicting the binding preference. properties (Chen and Lim, 2008), including hydrophobicity, Yan et al.(2016); Si et al.(2015); Miao and Westhof(2015) electrostatics, and atom types, are also helpful. Although list and evaluate the tools for predicting binding sites on the individual local protein structural information, such protein, although all the involved methods were developed as residue propensity and solvent accessibility, has been before 2014, which means that the deep learning methods adopted for a while (Miao and Westhof, 2015), recently, are not included. Hafner et al.(2021); Ramanathan et al. researchers have shown that directly using the comprehen- Protein-RNA interaction prediction with deep learning

Binding Preference Prediction 2020 ProbeRating Binding Site 2018 Nucleic acid recommender Prediction 2006 2014 systemPRIME-3D2D iDeepE Structure Prediction MEMERIS 2019 GraphProt Combines global and local CNNs Docks protein 3D structure and Models a single structural Encodes each ThermoNet RNA secondary structure 2016 2017 context based on position SSMART Integrates thermodynamic structure into a RPI-Net weight matrix (PWM) Zhang et. al iDeep Models primary sequence and and sequence contexts graph for classifying Uses a graph neural network(GNN) Can predict RNA Combines CNNs and DBNs structural properties 2021 by SVMs Shen et. al tertiary structural DeepCLIP iDeepS DLPRB Uses capsule network PrismNet 2010 information for First context-aware modeling and 2015 Uses CNNs and BLSTM Jointly uses CNNs and RNNs Integrates RNAContext the first time simultaneously NucleicNet predicting method iDeepV experimental in vivo Integrates DeepBind Predicts from local RCK Wang et. al First learns distributed vectors Jolma et. al RNA structure and sequence and First CNN- physicochemical Uses a k-mer Uses extreme learning Assembled a genome-scale RBP binding data based of k-mers from genome-wide characteristics structure collection of RBPs and their predictors based model machine (ELM) to classify sequences information RBDs using HTR-SELEX

2010 2014 2017 2021 2021 RBPpred Li et.al DRNApred GraphBind AlphaFold-2 Uses amino 2007 Predicts and discriminates 2019 Uses hierarchical Quantifies the contribution 2016 2017 acid sequence, between DNA- and RNA- GNNs EVFold of both sequence- and Wang et.al Torng and Altman RoseTTAFold evolutionary Sun et.al binding residues 2020 First applies structure-based features utilizes deep A protein site represented conservation, Two new features: co-evolutionary learning to DeepSite as four channels (carbon, AlphaFold-1 DeepMind secondary electrostatic information to 2013 predict protein Uses 3D CNNs oxygen, nitrogen and feature and triplet predicts almost the structure and 2020 predict protein structure RBPPred sulfur) entire human solvent Yang et.al interface structure Based on binding-specific Integrates physicochemical aPRBind proteome based on accessibility propensity NucBind substructure comparison properties and evolutionary Combines SVMnuc and Trained with sequence AlphaFold (TM-SITE) and sequence information of protein and structural features 2007 COACH-D profile alignment (S-SITE) sequences RNABindR Uses Naive Bayes classifier

Figure 1. An overview of important works related to binding site and binding preference prediction. Considering structural information is crucial for interaction prediction, and the protein structure prediction methods have made significant progress in recent years, which can even approximate the experimental performance on some proteins, this field will also embrace great advancement in the upcoming years.

(2019); Licatalosi et al.(2020); Corley et al.(2020) summa- tion, we refer to the first one as the binding sites prediction rize the related biological experimental techniques to study and the second one as the binding preference prediction. We the interactions as well as the biological insights and mech- summarize the paradigms in Figure2. anism behind the interactions. Our work, which unifies two intrinsically related computational problems and highlights 2.1. Binding sites prediction the importance of structural information, can provide new insights into the topic. Table1 summarizes the main focuses This problem is related to the first problem that people want of different review papers. to know when investigating the protein and RNA interaction. Given a protein, we first want to know whether this protein This paper is organized as follows. In Section2, we give a is an RNA-binding protein (RBP) or not. If it is not an RBP, clear description of the computational problems related to we could stop here and save the computational resources the interaction between proteins and RNAs. From Section for other proteins. If the protein is an RBP, people further 3 to Section6, we review each component of the com- want to know which amino acids on the protein sequence putational methods targeting against the above problems, can potentially interact with RNAs, which is related to the including datasets (Section3), features (Section4), models function of the protein. In other words, researchers want to (Section5), and model evaluation (Section6). In Section predict the binding sites and binding positions on the protein 7, we provide a thorough review on the challenges and op- surface for RNAs. portunities in this field. Although we emphasize on the importance of structural information to the interaction, for Usually, for this problem, people only consider the infor- the completeness of this review, we also mention the meth- mation from the protein side. The input is a protein, with ods only utilizing sequence encoding. either the sequence information or the structure information, or both. Then, researchers extract some features or define certain scoring functions with the above information. A 2. Computational problems for protein-RNA model or an alignment-based method will interaction thus be developed accordingly with an annotated database. The outputs are binary predictions, either at the protein level In this section, we are going to introduce the two kinds of or the level. Usually, the methods based on computational problems related to the interaction between structure have better performance on this problem than the proteins and RNAs in detail. As discussed in the Introduc- sequence-based methods (Miao and Westhof, 2015), as the Protein-RNA interaction prediction with deep learning

(a) Binding site prediction (b) Binding preference prediction

Alpha helix aacguccaaaggaguaccuacggc

RQTGQVARPVAERNCLGRASVR RNA Sequence Protein-seq Secondary structure

… …

Random SVM CNN SVM Alignment CNN Forest

Model Model

Binding Site (TP) Binding Preference

Non-Binding Site (TN)

() Merged Approach

Random CNN SVM Forest … 0.01%

0.98%

0.01% Protein RNA Model Binding site/preference Prediction

Figure 2. The different paradigms of studying the interactions between proteins and RNAs. a. Binding site prediction. Given the protein information, people predict which locations on the protein surface are the binding sites for RNAs. b. Binding preference prediction. For a given protein, the researchers have already determined the RNA sequences that can bind to the protein by experiments. Here, the models learn the statistical information from the input RNA sequences as the binding preference of that specific protein against RNAs. c. For studying the interaction more comprehensively, it is more desirable to consider the protein and RNA information, including both the sequence and structural information, simultaneously and predict both binding sites and binding preference. Protein-RNA interaction prediction with deep learning

Table 1. Summary and comparison of the existing reviews on the studies of protein-RNA interaction. Paper Journal Main Focus (Konig et al., 2012) Review Genetics State-of-the-art Ultraviolet (UV) crosslinking and immunoprecipitation(CLIP) (Weirauch et al., 2013) Nature Biotechnology Systematical comparison of protein’s DNA-binding specificity (Miao and Westhof, 2015) PLOS CB Comprehensive assessment on RNA-binding sites prediction from multiple web servers, datasets, and protein-nucleic acid complexes (Si et al., 2015) International Journal of Computational approaches for RNA-binding sites Molecular Sciences and RNA-binding proteins (RBPs) prediction (Li et al., 2017b) Briefings in Functional Integrating RNA–protein interaction data with observations of post-transcriptional regulation (Yan et al., 2016) Briefing in RNA- or DNA-binding residues from protein (BIB) sequences (Jones, 2016) Biophysical Reviews 3D structural of protein–RNA complexes at atomic resolution (Lewis et al., 2017) Nature Reviews Molecular The coupling of RNA modifications and structures Biology shapes RNA–protein interactions at different steps of the process (Lee and Ule, 2018) Molecular Cell Rationale for each step in CLIP protocol and discuss the impact of variations technologies (Nithin et al., 2018) Computational methods for and for scoring 3D structural models of RNP complexes (Lin and Miles, 2019) Nucleic Acids Research Assessment of RNA SS and crosslinking and (NAR) immunoprecipitation(CLIP) in detail (Trabelsi et al., 2019) Bioinformatics Deep learning architectures for predicting DNA- and RNA-binding specificity (Moore and t Hoen, 2019) Journal of Biological Statistical inference and machine-learning Chemistry approaches for RNA-binding proteins (RBPs) prediction, analysis of large-scale RNA–protein interaction datasets (Ramanathan et al., 2019) Nature Methods Comparison between RNA-centric and protein-centric methods (Sagar and Xue, 2019) Protein and Peptide Letters Computational predictors for RNA-protein interaction in the aspects of data, prediction, and input features (Yan and Zhu, 2020) IEEE Access Machine learning and deep learning approaches focusing on RNA binding preference (Licatalosi et al., 2020) Wiley Interdisciplinary RNA interactions with Proteins and techniques Reviews: RNA measuring the kinetic dynamics of RNA–protein interactions in vitro (Corley et al., 2020) Molecular Cell Protein-RNA molecular interactions & Software availability (Pan et al., 2019b) Wiley Interdisciplinary Prediction of RNA–protein interaction pairs and Reviews: RNA RBP binding preference (Hafner et al., 2021) Nature Reviews Methods prospect of integrating data obtained by Primers crosslinking and immunoprecipitation(CLIP) Protein-RNA interaction prediction with deep learning local structure can determine whether the protein is accessi- Compete (Ray et al., 2009) and HT-SELEX (Roulet et al., ble for interaction with other molecules. 2002). Although their experimental techniques and proto- cols are very different, the basic principles are the same, 2.2. Binding preference prediction that is, to identify and isolate RNAs that a protein can in- teract with and then sequence those RNAs. Consequently, In this computational problem, we want to know more in- the outputs and the data from those experiments are RNA formation about the interaction from the RNA side. The sequences. As this review does not focus on the experimen- interaction involves two molecules, a protein, and an RNA. tal techniques, we refer the readers to the related reviews In Section 2.1, we have investigated it from the protein side, in case the readers are interested in them (Lin and Miles, determining which amino acids can potentially interact with 2019). RNAs. In this problem, we study which RNAs can interact with a certain protein. If we describe the problem from the In Table2, we list the related datasets. The doRiNA (Anders protein aspect, we want to know the binding preference of et al., 2012) contains 24 experiments of 21 RBPs, which the protein against RNAs. are determined by experimental protocols including PAR- CLIP (Ago/EIF2C1-4, IGF2BP1-3, PUM2, Ago2-MNase, Although we want to predict the binding preference of an ELAVL1, ELAVL1-MNase, ELAVL1A, ESWR1, FUS, RNA-binding protein, seldom would researchers include the TAF15, MOV10) and CLIP-seq (TIAL1, Ago2, ELAVL1, protein information in the prediction model. Usually, the eIF4AIII, SRSF1). The data originated from doRiNA have training data are a set of RNA sequences or RNA secondary been regarded as the benchmark dataset for cross-fold valida- structures, which are proved to interact with a protein. Then, tion in iONMF (Strazarˇ et al., 2016), DeepBind (Alipanahi a machine learning model or a statistical motif model will be et al., 2015), iDeep (Pan and Shen, 2017), iDeepS (Pan et al., constructed based on the data. The inputs of these models 2018), iDeepE (Pan and Shen, 2018a), GraphProt (Maticzka are RNA features, and the models will predict whether they et al., 2014), deepnet-rbp (Pan and Shen, 2018b), deepRAM can interact with the protein. Notice that, in these models, (Trabelsi et al., 2019). people do not use the protein information explicitly. Instead, people believe a large amount of training RNA sequences iCount utilizes iCLIP to create the dataset with 17 RBPs. can describe the target protein implicitly. However, recent iONMF (Strazarˇ et al., 2016) analyzes the data from iCount studies (Lam et al., 2019; Xie et al., 2020a) show that the and doRiNA, building a unified dataset, which has been protein information can be used directly to predict the inter- widely used in different models. AURA 2 (Dassi et al., action preference, even without the high-throughput assay 2014) collects the UTR sequences of 67 RBPs with 502,178 data. binding sites from CLIP-seq. Within the dataset, the num- ber of binding sites for different RBPs is variant. iDeepE 3. Datasets for building the models regards 2000 as the cutoff, filtering 20 RBPs with less than 2000 positive sequences and constructing the dataset of After defining the computational problems, we need to pre- RBP-47. However, the RBP-47 only provides the positive pare the related data, which are the foundation for build- UTRs sequence. For constructing the negative sample, the ing computational models to resolve the above problems. UTR dataset randomly selects the UTRs not interacting The data can be divided into two categories, either the pro- with their RBPs. It is different from the strategy of doRiNA, tein/RNA sequence data or the structure data. In this section, which generates the negative samples by selecting random we give an overview of the data and the related databases. sites excluding positive binding sites in the same gene. In- We also summarize the things in Table2. tuitively, the doRiNA’s tactics would be more rational and have a lower possibility of false-negative samples. The same 3.1. Sequence datasets CLIP-seq experiments detect regions as the binding sites of a gene and the other regions as unbinding sites, which The protein sequences are usually used for predicting the means that experiments verify the negative samples. binding sites, while the RNA sequences are used for pre- dicting the binding preference. The techniques to sequence CLIPdb (Yang et al., 2015) is another database of various proteins are very mature, and the resulted data are stored high-resolution binding sites for RBPs, constructed from in UniProt1, which is one of the most famous databases in published CLIP-seq data. It contains manually curated an- bioinformatics. notations from all CLIP-seq studies across different model organisms with 395 CLIP-seq samples for 11 RBPs. In The techniques to investigate the proteins’ binding prefer- addition, CLIPdb also provides genome-wide binding sites ence against RNAs include the in vivo RIP-seq (Keene et al., for each dataset, which are identified using a unified anal- 2006) and CLIP-seq (Ule et al., 2005), and the in vitro RNA- ysis procedure. The high-resolution binding site data from a large number of RBPs will benefit investigations on the 1https://www.uniprot.org Protein-RNA interaction prediction with deep learning coordination and competition of RBP binding, which has annotations of RNA-binding residues are transferred among not been extensively studied. Because the binding sites of similar chains to alleviate the effect of strand truncations, RBPs are well-annotated in CLIPdb, its negative sample finally, they establish a dataset of 175 representative and setting is similar to that of RBP47. non-redundant RNA-binding protein chains. With the appearance of AlphaFold, Jumper et al.(2021) pro- 3.2. Structure datasets vides AlphaFold Protein Structure Database, which contains Protein structure: For the protein structure, the most com- 23,391 protein structures (Homo sapiens), covering 98.5% prehensive database is (PDB)2. Although of human proteome. Although it is a method of ab initio the database does not contain the structure of all the RNA- protein structure prediction, AlphaFold can already achieve binding proteins and some parts of the RNAs may not be a similar prediction accuracy and resolution as Cryo-EM on very clear, most of the existing structure datasets are ex- some proteins. The structures of RBPs that have not been tracted from structures of protein-RNA complexes from successfully resolved by experimental approaches may have PDB (Sussman et al., 1998). Generally, the criterion of the already been predicted accurately by AlphaFold. amino acid in the protein being considered as RNA-binding RNA secondary structure: Although most of the devel- in a co-crystal complex is that at least one of its backbone oped binding preference prediction methods only utilize atoms or side chains are within a certain distance from atoms the predicted secondary structure, such as RNAstructure ˚ ˚ of the RNA. Specifically, both 3.5A and 5.0A are the usual (Reuter and Mathews, 2010) or SPOT-RNA (Singh et al., threshold (Yan et al., 2016). 2019), to improve the prediction performance, the predicted NPIDB (Kirsanov et al., 2013) is a continually updated secondary structure annotation may not provide the explicit ribonucleoprotein(RNP) database server hosting PDB struc- RNA structure and in vivo profile. Sun et al.(2021) in- tures classified by the binding nucleotides such as RNA troduce icSHAPE (Flynn et al., 2016) to characterize the (668), DNA (1671), RNA & DNA (504). However, homolo- single- and double-stranded regions of RNAs, which is cru- gous protein structures would cause bias in modeling. cial information to protein-RNA interaction. Recently, RNA Atlas of Structure Probing (RASP) (Li et al., 2021a) col- NucleicNet (Lam et al., 2019) has defined two homologous lects transcriptome-wide RNA secondary structure probing redundancy, internal redundancy, and external redundancy. data through 18 experimental methods such as DMS-seq, The internal redundancy is that multiple copies of the same SHAPE-Seq, SHAPE-MaP, and icSHAPE, etc. RNA-binding protein chain can exist within the same PDB entry due to the formation of homo- or hetero-multimeric Intuitively, the experimental and well-annotated RNA sec- complexes. The external redundancy is that homologous ondary structure c provide precise and informative input to chains are shared across different PDB entries dedicated to modeling. For instance, bpRNA (Danaee et al., 2018) col- different binding RNA sequences. These redundant RNA- lects 102,318 known secondary structures from 7 different binding samples, sharing the homologous chains common databases, including Comparative RNA Web Site (Cannone in RNA-binding configurations and physicochemical envi- et al., 2002), tmRNA Database (Zwieb et al., 2003), Sig- ronments, would introduce bias to the evaluation and cause nal Recognition Particle Database (Rosenblad et al., 2003), the overstated generalizability power of the model. To re- Sprinzl tRNA Database, RNase P Database (Brown, 1998), move the internal redundancy, the authors retain the best RNA Family Database (Griffiths-Jones et al., 2003) and locally resolved component and discard the other homolo- PDB, and introduces a novel annotation tool to parse com- gous protein and RNA. For the external redundancy, PDB plex pseudoknot-containing RNAs with 7 annotations, such entries are clustered into groups where each entry is linked as stems, internal loops, bulges, multi-branched loops, ex- with another that shares at least one RNA-binding chain ternal loops, hairpin loops, and pseudoknots. Furthermore, with cutoff=90% BLASTClust (Earl- bpRNA offers a subset of the database with high sequence Mirowski and Rosenberg, 2007). For each cluster, the PDB similarity (90% identity), which helps the model solve the entry with the best resolution is selected, turning the 483 issue of training data replicates. valid PDB entries into 158 clusters. The authors select one representative entry for each cluster. 4. Model inputs and structure encodings RNA T dataset (Yan et al., 2016) is another benchmark The feature and representation of the protein and RNA dataset collected from PDB, which consists of 981 RNA- molecules are crucial for the downstream prediction per- ˚ binding protein chains with the distance cutoff of 3.5A(985 formance. In this section, we summarize the commonly ˚ for 5A). After the authors clustering protein chains with used encodings of protein and RNA features, including both respect to their sequence and structural similarities, where sequence encoding and structure encoding. We also use 2https://www.rcsb.org Figure3 and Table3 as a summary. Protein-RNA interaction prediction with deep learning

Table 2. Accessible datasets for studying the interaction between proteins and RNAs.

Type Dataset Name Samples Availability Benchmark Methods

iONMF(Strazarˇ et al., 2016) DeepBind(Alipanahi et al., 2015) https://dorina.mdc- iDeep(Pan and Shen, 2017) doRiNA 67 RBPs (Anders et al., 2012) berlin.de/ iDeepS(Pan et al., 2018) iDeepE(Pan and Shen, 2018a) GraphProt(Maticzka et al., 2014) deepnet-rbp(Pan and Shen, 2018b) deepRAM(Trabelsi et al., 2019) Sequence Dataset iONMF(Strazarˇ et al., 2016) https://icount.readthedo- iDeepS(Pan et al., 2018) iCount 21 RBPs cs.io/en/latest/cite.html iDeepE(Pan and Shen, 2018a) deepRAM(Trabelsi et al., 2019)

RNAcommender AURA 2 158 RBPs http://aura.science.unitn.it/ (Corrado et al., 2016) (Dassi et al., 2014) iDeepE(Pan and Shen, 2018a)

395 CLIP-seq CLIPdb http://clipdb.ncrnalab.org/ deepnet-rbp(Pan and Shen, 2018b) (Yang et al., 2015) 111 RBPs

NucleicNet (Lam et al., 2019) Protein Data Bank 179,206 protein Protein https://www.rcsb.org aPRBind (Liu et al., 2020) (PDB) structures Structure GraphBind (Xia et al., 2021a) Dataset

8140 protein https://npidb.belozers- NPIDB NucleicNet (Lam et al., 2019) structures ky.msu.ru/

23,391 predicted structures(Homo AlphaFold DB sapiens), all the https://alphafold.ebi.ac.uk/ - UniRef90 proteins (over 100 million)

102,318 secondary RNA bpRNA http://bprna.cgrb.oregonstate.edu/ - Secondary structures Structure Dataset RASP - http://rasp.zhanglab.net - Protein-RNA interaction prediction with deep learning

4.1. RNA sequence encodings which is used to characterize the single- and double-stranded regions of RNAs (Spitale et al., 2015). The raw data One-hot encoding: The RNA sequence can be encoded of icSHAPE can be processed by the bioinformatic tool, into a 4×L matrix, of which columns correspond to the icSHAPE-pipe (Li et al., 2020a). In brief, raw reads are presence of A, C, G, U and N (padding, if necessary) (Xia first collapsed to delete PCR duplicates, and the adapters et al., 2019). Given an RNA sequence s = (s , s , s ...s ) 1 2 3 n are trimmed. Next, the clean reads are mapped to the human with n nucleotides, and the one-hot encoding matrix M for genome using STAR with the default parameters. Then, the sequence is: icSHAPE scores can be calculated using icSHAPE-pipe,  resulting in a 1×L matrix with the value ranging from 0 to  0.25 if si = N, Mi,j = 1 if si = Dj, (1) 1.  0 otherwise, Tertiary structure: JAR3D (Roll et al., 2016) can be used where i is the index of nucleotides; Dj is an ordered to extracted probable tertiary structural motifs from the list of [A, C, G, U]. For the padding sequences, the 4 RNA 3D Motif Atlas (R3DMA) (Petrov et al., 2013), which nucleotides are assumed to be equally distributed and contains 253 representative hairpin loop motifs and 276 [0.25, 0.25, 0.25, 0.25] is for the padding nucleotide N in representative internal loop motifs, once given the corre- the one-hot matrix. sponding RNA base sequence and secondary structural in- formation. For encoding RNA tertiary structure, the target k-mer embedding: The RNA sequence is split into overlap- RNA sequence is first predicted into the probable secondary ping k-mers (Kazan et al., 2010) of length k using a sliding structure using RNAshapes. Then, all the hairpin and inter- window with stride s. People will count the frequency of nal loops that overlap the viewpoint region would be fed to each k-mer and use the frequency as the feature. If neces- JAR3D to calculate the probabilities of folding into the pre- sary, each k-mer in the obtained sequence is mapped into defined corresponding tertiary structural motifs. Thus, RNA a d-dimensional vector space using the word2vec (Church, tertiary structure can be encoded into a binary vector of 529 2017) algorithm as the additional feature. The word2vec dimensions, corresponding to 253 hairpin loop motifs and method is an unsupervised learning algorithm that maps 276 internal loop motifs in the R3MDA. k-mers from the vocabulary to vectors of real numbers in a low-dimensional space. The embedding representation of 4.3. Protein sequence encoding k-mers is computed in such a way that their context is pre- served, i.e., word2vec produces similar embedding vectors One-hot encoding: The protein sequence can be encoded for k-mers that tend to co-occur or similar. Generally, the into a 20xL matrix, of which columns correspond to the k-mer representation is more informative than one-hot en- presence of 20 standard amino acids such as A, R, N, D.... coding because the word2vec algorithm provides contextual The encoding process is similar to that of RNA. information by learning the statistical information of k-mer PSSM: Position-specific scoring matrix (PSSM) (Ahmad co-occurrence relationships in the input sequences. and Sarai, 2005) introduces evolutionary information into the RNA binding site prediction, which quantifies conser- 4.2. RNA structure encoding vation of residues, as the binding residues are shown to be RNA secondary structure: RNA secondary structure of- conserved in the sequence. The encoding can be conducted fers the local and geometric patterns in two approaches by PSI-BLAST (Altschul et al., 1997), where the query depending on whether there is an available protein and RNA sequence is aligned through the NCBI non-redundant (nr) structure in the PDB. If the structure is available, the explicit sequence database, with sequence profile represented in the secondary structure can be calculated by using an assign- matrix of 20×L. Each value in the matrix represents the ment approach, such as RNAstructure (Reuter and Math- frequency of a specific amino acid at a particular position in ews, 2010). If the structure is unavailable, the predicted the multiple sequence alignment (Li et al., 2018; Zou et al., secondary structure can be obtained by using a secondary 2019). structure prediction algorithm, such as SPOT-RNA (Singh et al., 2019), RNAshapes (Steffen et al., 2006), RNAstruc- 4.4. Protein structure encoding ture (Reuter and Mathews, 2010), and E2Efold (Chen et al., Local structure: Individual local structural information 2020). For the RNA secondary structure stored in bpRNA included secondary structure(SS) (Hiller et al., 2006), inter- (Danaee et al., 2018), bpseq file reveals the con- face propensity (IP) (Li et al., 2012), accessible surface area nection of the RNA. (ASA) (Heffernan et al., 2017) and electrostatic patches In vivo structure profile: RNA in vivo structure profile (EP) (Stawiski et al., 2003). The secondary structure re- is produced by in vivo click selective 2’-hydroxyl acyla- veals primary structural information, which has 3/8-class tion and profiling experiment (icSHAPE) (Sun et al., 2021), labeling systems. Dictionary of Secondary structure of Pro- Protein-RNA interaction prediction with deep learning tein (DSSP) (Kabsch and Sander, 1983) assigns eight sec- representing the compound structure. Proteins with the sec- ondary structure states to amino acids, including 310-helix ondary structure information could be encoded in a similar G, alpha-helix H, pi-helix I, beta- bridge B, beta-strand E, way. beta-turn T, and coil C. SPIDER3 (Heffernan et al., 2017) converts the 8-class assignment into the 3-class assignment, 5. Computational models where Helix H is composed of G, H, and I; Beta strand B is composed of B and E, Coil C is composed of T and C. After encoding the proteins and RNAs, we need to build and Li et al.(2012) introduces interface propensity, the residue- train a model to perform the interaction prediction. We di- nucleotide propensities with secondary structure informa- vide the methods into two categories, either template-based tion of proteins and RNAs. The propensity of a specific and shallow-learning methods or deep learning methods, residue-nucleotide pair is calculated from its observed prob- which will be introduced in detail in this section. Table ability at interfaces divided by its expected probability. The 3 also summarizes the models of different representative interface propensity of a residue type with a particular class works. of secondary structures is represented as an average value of its pairwise propensities for the four kinds of nucleotides. 5.1. Template-based and shallow-learning models Accessible surface area is widely used for RNA binding site prediction, which can be calculated by NACCESS (Ding The template-based approach, which is similar to homol- and Arnold, 2006) when the protein structure is available ogy modeling, is applied for the binding site prediction in PDB. For the protein absent in the PDB, there are sev- with known homologous structures. The models, such as eral predictive methods, such as ASAquick, (Faraggi et al., DBD-Threader (Gao and Skolnick, 2009), and SPOT-Seq 2014) to predict ASA. Electrostatic patches can describe the (Singh et al., 2019), can directly adopt the known knowledge protein surface charge status, which is an important factor in without feature extraction and mainly rely on the protein RNA-binding. Generally, RNA binding interfaces are more structure alignment process. As for the protein without likely to be positively charged, and the electrostatic fea- known homologous structure, the approach is incapable of ture can be calculated by PatchFinderPlus (Shazman et al., solving this situation because the template-based approach 2007). tries to copy specific sites from few homologous cases. On the other hand, the shallow-learning methods attempt to gen- For the comprehensive local structural information, atom eralize common rules learned from the known experience features within concentric shells or grid boxes are intro- of a dataset (Yan et al., 2016). Because of the capacity to duced to describe the physicochemical environment in a process big data and good interpretability, shallow-learning specific physical space, which can be calculated by FEA- approaches such as support vector machine (SVM), random TURE (Halperin et al., 2008a) or AutoDock (Forli et al., forest, logistic regression, decision tree, and na¨ıve Bayes are 2016). In FEATURE, 80 physicochemical properties (e.g., widely used in RNA binding sites and binding preference negative/positive charges, hydrophobicity, solvent accessi- prediction. Although shallow learning methods are very bility, etc.) on atoms of the protein with 7.5A˚ of a grid powerful in terms of interpolation, the prediction of extrap- point in a radial distribution are divided into six concentric olation can not be guaranteed since the predefined feature shells of spheres, resulting in a 6×80 matrix. AutoDock uti- limits the module learning from the raw data. The prede- lizes an atom-channel (carbon-, oxygen-, nitrogen-, sulfur-) fined feature provides a fixed explicit insight of the learning framework to define a local 20A˚ cubical box to state the module. However, with the increasing amount of data, the presence of carbon, oxygen, sulfur, and nitrogen atoms in a feature extraction procedure can be flexible and learned by corresponding atom type channel, divided into 1A˚ cubical the model, so-called deep learning, yielding higher perfor- voxel, resulting in a 4×20×20×20 tensor. MaSIF (Gainza mance of the binding site and binding preference prediction, et al., 2020) emphasizes the significance of the protein sur- especially for complex protein. It will be introduced in the face, and presents a method to encode geometric features following section in detail. On the other hand, several works, (shape index and distance-dependent curvature) and chemi- including RNABindRPlus (Walia et al., 2014), and RBRDe- cal features (hydropathy, continuum electrostatics and free tector (Yang et al., 2014) are attempting to incorporate both electrons/protons) on the surface with the geodesic radius template-based and shallow-learning approaches to improve of 9A˚ or 12A,˚ resulting in a 1×80 matrix. the performance. Global structure: Global structural information is rarely used in RNA binding site prediction since the interaction is 5.2. Deep learning models regarded as a local recognition problem. However, global The existing methods emphasize the importance of sequence structural information may play an important role in iden- information. DeepBind (Alipanahi et al., 2015) is the first tifying RBP in future applications. Ishiguro et al.(2019) deep learning approach for RNA binding preference pre- introduces supernodes to connect other nodes in the graph diction, which employs a single layer of convolution and Protein-RNA interaction prediction with deep learning

Protein Sequence Feature PVRQTGQVARPASVRS Protein Comprehensive Local Structure Feature

Pos. 1 2 3 L

TYHMCQFHCRYVNNH ⋯ Oxygen Hydrophobic SQEKLYECNERSKAF A 0 0.1 0 0 0 SCYECNQCGKAFAQP R 0 0 0.83 0.01 0 … 0.86 0.9 0 0.02 0.85 Q 0.01 0.76 0.04 0.82 0.02 Electron Donor Sulfur Sequence Position-Specific Scoring Matrix (PSSM) Knob

Protein Individual Local Structure Feature Carbon MaSIF Feature

helix sheet coli

𝛼𝛼 − 𝛽𝛽 − Nitrogen 80 Biochemical Descriptors

Atom Count, Residue Type… 0, 1, 0.01, 1,… 0, 2, 0.03, -1 … 1, 1, 0.01, 0 … 6 Concentric Shells 1, 3, 0.00, -2 … Secondary Accessible 0, 1, 0.05, 1 Hydrophobicity Electrostatic Structure Surface Area 3D Structure in Patch Voxel FEATURE

RNA Sequence Feature RNA Structure Feature Other Prior Knowledge

auc Stem Stem Pos. 1 2 L vivo A 0.025 0.029 0.012 AACGUCAGGAG ⋯ UAACGGCAACG aucg… U 0.775 0.029 0.015 Hairpin Loop Hairpin Loop vitro … UCAGGAGUAAC C 0.123 0.012 0.015 … G 0.078 0.930 0.958 ucg Pseudoknot Pseudoknot Position Weight In Vivo/Vitro Secondary Contact Tertiary Molecular Molecular Sequence k-mer Matrix Structure Structure Map Structure Docking Dynamics (PWM) Profile

Figure 3. Summary of features from proteins and RNAs, as well as prior knowledge, that can be used to study the interaction between the two molecules. demonstrates the accuracy of CNNs as well as their abil- 2016) utilizes a multimodal deep learning framework, which ity to detect the recapitulate known motifs, taking only the systematically integrates RNA primary sequences, predicted RNA sequences as inputs and identifying the preference of secondary structures using RNAshapes, and tertiary struc- RNA-binding proteins. Based on DeepBind, DeeperBind tural features extracted by JAR3D based on R3DMA. As for (Hassanzadeh and Wang, 2016) introduces the long short- RNA binding preference prediction, DLPRB (Ben-Bassat term memory layers (LSTM) into the DeepBind architecture et al., 2018) also takes the advantage of the predicted sec- to learn the long-range dependencies between the sequence ondary structures to explore RNA structural contexts. The features extracted by the CNN layers. iDeepS (Pan et al., PrismNet (Sun et al., 2021) considers that there are a large 2018) also integrates CNN, and RNN layers since both of number of structurally variable sites across the cell lines. them are helpful for performance, and extra RNA structural Consequently, icSHAPE (Li et al., 2020a) is introduced in motifs are combined into the model. iDeepE (Pan and Shen, PrismNet to describe the in vivo structural profile with 1×L 2018a) considers the local and global sequence information matrix (see 4.2.2). The PrismNet encodes the sequence with for CNNs, and demonstrates that multiple overlapping fixed- the one-hot encoding and extra in vivo structure scores as length subsequences (similar to k-mer) provide informative the fifth dimension and applies a squeeze-and-excitation features for the binding preference prediction. DeepRAM module to adaptively calibrate convolutional channels for (Trabelsi et al., 2019) comprehensively evaluates the model learning channel-wise attention and residual blocks, captur- based on CNNs, RNNs, and hybrid CNN/RNN architec- ing the joint sequence-and-structural determinants of RBP tures, finding that the hybrid architectures outperform the binding. former two methods. Besides, DeepCLIP (Grønning et al., Besides, the protein local structural environment of the 2020) also employs 1D convolution layers and Bidirectional binding sites is also crucial to the RNA-protein interaction. LSTM to capture the mutation profile of protein-RNA bind- Torng and Altman(2019) applies 3DCNNs to protein struc- ing preference. However, single input of sequence limits ture information, generated by AutoDock or FEATURE, the model capacity to capture the authentic mechanism of and demonstrates the comparable performance of the RNA- RNA-protein interaction. protein interaction binding site prediction method. Further- With the developing insight of RNA-protein interaction, more, NucleicNet (Lam et al., 2019) considers the RNA- RNA structural information is discovered to exerts an impor- binding issue from the perspective of three-dimensional tant role in the mechanism. Thus, deepnet-rbp (Zhang et al., protein structure, which is extracted in units of residues. For Protein-RNA interaction prediction with deep learning predicting RNA-binding properties at various locations on the RNA-binding motif dataset, CISBP-RNA (8056 records a protein’s surface, the FEATURE (Halperin et al., 2008b) of RBP binding motifs) (Ray et al., 2013). Besides, the framework is used to encode physicochemical properties RNA binding motifs with particular secondary structures, on the grid point of protein surfaces. For each grid point, including stems, multiloops, hairpins, internal loops, and a high-dimensional feature vector for six concentric shells dangling, are prone to access the surface of RBPs. Thus, the of spheres with 80 physicochemical properties for each structural information extracted from the model can explain shell will be generated. Furthermore, the NucleicNet pre- their binding tendency. dictor, based on ResNet, uses the hierarchical classification of residue sites, first for binding or not, if affirmative, the 6.3. In vitro and in vivo experimental validation possible type of RNA constituent binding to the location. RNAcompete assay (RNAC) (Ray et al., 2017) is a large- To efficiently capture such structural information of RNA scale in vitro experiment that uses the epitope-tagged RBP and protein local environment, people have applied Graph to competitively select RNA sequences from a designed Neural Networks (GNN) to extract the comprehensive fea- pool. In NucleicNet (Lam et al., 2019), the authors obtain tures. RPI-Net (Yan et al., 2020) employs an end-to-end 7-mer RNA-binding profiles summarized as a Z-score for learning approach with GNN from the sequences and struc- the individual RNA sequence. The RBPs with both available tures of RNAs, which provide dense information for binding RNAC data and PDB structure, such as PABPC1, PCBP2, site prediction. For the protein structural context, Graph- PTBP1, RBFOX1, SNRPA, SRSF2, TARDBP, and U2AF2, Bind (Xia et al., 2021b) applies a hierarchical graph neural are tested. The results suggest that NucleicNet is capable of network (HGNN) to learn the latent patterns of structural differentiating between the top and bottom ten sequences and physicochemical characteristics for binding residue indicated by RNAC Z-scores. Thus, RNAC is suitable to recognition. evaluate the model performance. In vivo experimental val- idation in PrismNet (Sun et al., 2021) is to distinguish the 6. Model evaluation relevant affinity of the given RBPs, such as SND1 with specific conformation (hairpin) or single-stranded confor- After building the model, the last step is to evaluate the per- mation. With different melting-and-folding treatments to formance of the model so that to make the users understand perturb RNA structure without altering the sequence, the the usefulness and weak points of the propose methods. In authors can obtain two conformations of the given RNA, the this section, we summarize the commonly used evaluation one refolding into the hairpin structure and the other retain- criteria in this field. ing single-stranded conformation. PrismNet predicts that a double-stranded binding site for SND1, which is consistent 6.1. Cross-fold and cross-dataset validation with the in vivo affinity experiment. Cross-fold (3-, 5-, 10-fold) validation is usually used to evaluate the performance of models with metrics of the 7. Challenges and opportunities area under the receiver operating characteristic (AUROC) We may encounter several challenges when modeling the and F1 score. For the 10-fold cross-validation, the dataset interaction between proteins and RNAs. In terms of the would be divided into ten folds, and for each time, nine inputs to the models, we need to think of how to encode folds of them are used for training while the left one is structural information more efficiently and even considering for testing. One problem is that many works are evaluated the dynamic structural information. Regarding the model, using data within a specific protein category, indicating that we should design novel deep learning models, which can the models only learn protein-specific features instead of process multi-modality data effectively, including the in- general binding features, which limits the application of the formation from proteins and RNAs, as well as our prior models. To assess the generalizability of the model, people knowledge. Furthermore, people also care about the model should also use cross-dataset validation, which means that interpretability, that is, what leads the model to make a general models should be established and evaluated with specific prediction. Revisiting the protein-RNA interaction protein data from different categories and different sources problem and advancement in the related fields, we may want (Trabelsi et al., 2019). to resolve some more sophisticated but appealing tasks. For instance, because of the recent breakthrough in the protein 6.2. Structure visualization structure prediction field, it becomes increasingly possible The specific patterns inferred from these models can be to perform high-resolution Ab initio protein-RNA interac- visualized as the sequence logo diagrams (Weblogo (Crooks tion prediction with only the protein sequence information. et al., 2004)) for the RBP. Generally, these patterns can Finally, based on the predicted interaction results, people be regarded as the RNA motifs, which can be mapped to are also eager to design specific molecules with high bind- Protein-RNA interaction prediction with deep learning

Table 3. Summary and comparison of the representative works for studying the protein-RNA interaction. A more comprehensive list is in the Appendix. Feature Paper Year Prediction Model Sequence Encoding Format Structural Information Fully- (Ahmad et al., Sequence composition, sequence 2004 Binding Site connected Vector 2004) neighbourhood, SA1 NN (Hiller et al., Binding 2006 PWM Single Strand Motif Finding RNA SS2 2006) Preference (Kazan et al., Binding 2010 PWM Motif model Structure annotation profiles 2010) Preference (Yang et al., 2013 Binding Site Alignment Structure alignment TM-SITE, S-SITE, PSSM 2013) (Maticzka Binding 2014 SVM Graph-kernel Sequence and SS et al., 2014) Preference Sequence, evolutionary (Li et al., 2014 Binding Site ANN Vector conservation, surface deformations, 2014) relative SA, side chain contributions (Alipanahi Binding 2015 CNN One-hot encoding Sequence et al., 2015) Preference (Orenstein Binding K-mer, 2016 K-mer RNA SS et al., 2016) Preference PWM (Zhang et al., Binding Multimodal Replicated softmax, 2016 Sequence, SS, tertiary Structure 2016) Preference DBNs R3DMA (Yan and AA3 type, putative intrinsic disorder, 2017 Binding Site HMM Feature Kurgan, 2017) SS, SA, PSSM (Jimenez´ Pharmacophoric properties, voxel 2017 Binding Site 3D CNN 3D Voxel et al., 2017) occupancies (Pan et al., Binding 2018 CNN+LSTM One hot encoding Sequence and SS 2018) Preference (Wu et al., AA sequence, I-TASSER Suite, 2018 Binding Site Docking Structure modeling 2018) consensus predictions, ligand Global (Pan and Binding 2018 and local One hot encoding Sequence Shen, 2018a) Preference CNN (Deng et al., Binding 2019 CNN+RNN One hot encoding Sequence and SS 2019) Preference (Torng and The presence of carbon, oxygen, 2019 Binding Site 3D CNN Voxel, feature Altman, 2019) sulfur and nitrogen atoms Binding Site (Lam et al., 2019 Binding CNN Feature Structure 2019) Preference (Jolma et al., Binding 2020 SVM K-mer Sequence, Structure 2020) Preference Pseudo-positions, atomic features (Xia et al., 2021 Binding Site GNN Graph, feature Vector of residues, SS, evolutionary 2021a) conversation profiles 1 Solvent accessibility 2 Secondary structure 3 Amino acid Protein-RNA interaction prediction with deep learning ing affinity against the target molecule. In this section, we Despite the specific encoding we may use from the machine discuss the challenges and the potential opportunities in this learning field, we still need to consider the chemical back- field in detail. ground of the problem. The structures in the atom-scale are different from the 3D objects in real life. Although 7.1. Structure encodings we may use rigid bodies to approximate and model them, they are not rigid bodies. The physicochemical properties As discussed above, structural information is critical to pre- (Lam et al., 2019) should be considered when we design the dicting the protein-RNA interaction accurately. However, methods. how to encode the structural information efficiently remains to be an open question. Because deep learning models are 7.2. Dynamic structure information also useful to perform feature selection, when encoding the structural information, we should try to preserve as much Another fundamental property of biomolecules that most raw information as possible, especially the spatial informa- machine learning methods fail to consider is their dynamics. tion. As we know, biomolecules are not static, rigid bodies. Every part of the molecule is continuously moving and oscillating Regarding the protein structure, some traditional ways of in high frequency. The apo protein structures would not stay encoding, such as 3/8-class protein secondary structure, lose in the state with the lowest energy all the time. Instead, they too much raw information. FEATURE (Lam et al., 2019), may change from one sub-optimal state to another from defining shells around a location in the 3D space and sum- time to time. When it comes to the interaction between marizing the physicochemical properties within each shell, two molecules, such as the interaction between proteins and is another popular method. However, using such an encod- RNAs, the situation will be even more complex. For exam- ing, we cannot differentiate the properties within each shell. ple, some molecules, such as Argonaute, need to undergo In the machine learning field, people usually use 3D voxels, substantial conformation change to bind to RNA sequences. point clouds, and polygon mesh to represent 3D objects. 3D The other proteins may also have conformation changes voxel encoding is similar to the 2D pixel. And it was shown once incorporating RNAs. This phenomenon leads to two to be better than FEATURE in predicting the functional difficulties when we model the protein-RNA interaction. domain of proteins (Torng and Altman, 2019). However, Firstly, the structure database that we rely on is not per- because we extend the representation to another axes, we fect for providing the structural information that we need. need to design a more efficient algorithm for handling the Simply removing the RNA structure from the protein-RNA increasing dimension. Polygon mesh representation collects complex may not reveal the actual protein apo structure. vertices, edges, and faces to define the surface of the protein Secondly, failing to model molecule dynamics may lead structure. The combination of such a representation and to the performance degradation of the machine learning geodesical CNN is shown to extract the fingerprint of the method when we apply the method to real-life problems. To protein surface, which can be used to predict the interaction resolve the above challenges, we should use both the PDB between different molecules (Gainza et al., 2020). Point structures and the information from molecular dynamic sim- cloud methods sample points from the 3D object, using the ulation (MD simulation). In practice, we may consider the coordinates of those points to represent the structure of the state of a molecule at each time point as a screenshot. The object. Although it has not been widely applied in this field, entire protein dynamics trajectory can be considered as a it has shown great power in the computer vision field for 3D video. Deep learning techniques to process videos, such object classification and segmentation. as multi-instance learning, would be helpful to resolve this In terms of the RNA structure, people usually use the sec- challenge. ondary structure profile to encode them, indicating whether each base is single-strand or double-strand. However, this 7.3. Incorporating prior knowledge encoding loses too much information. For example, we would not know which base forms the with In addition to the data, researchers have accumulated exper- the other specific base. Recently, researchers have shown tise and prior knowledge about this problem. For example, that predicting the RNA secondary structure by predicting we know that Aquifex aeolicus Ribonuclease III (Aa-RNase the contact map matrix can boost the performance signif- III) is most likely bind with double-stranded RNAs. Incor- icantly (Chen et al., 2020). A similar idea can be applied porating such knowledge into the machine learning model to the protein-RNA interaction prediction. Meanwhile, us- can further boost the model’s prediction performance and ing the graph to represent the RNA secondary structure is usefulness. There are multiple ways to achieve that. We can another natural approach (Yan et al., 2020). However, we manipulate the data prepared for training the model by up- need to specify which information we want to extract from sampling the class favored by the prior knowledge. When the graph. we train the model, such knowledge could be incorporated into the model implicitly. But we should handle the data Protein-RNA interaction prediction with deep learning carefully to avoid overfitting. On the other hand, we may people were still struggling with the performance before the design a specific machine learning model that explicitly in- appearance of AlphaFold2 (Jumper et al., 2021). However, corporates prior knowledge. For example, by embedding with the fast performance improvement, it is foreseeable constraint optimization as a module into the deep learn- that the demand for an explanation of the model will soon in- ing model (Chen et al., 2020), we can reduce the data size crease. The model explanation techniques from the machine requirement for training a deep learning model. learning field can be used to identify which input features influence the final prediction. However, such an explana- 7.4. Using information from both RNA and protein tion is too trivial for this field. Building the connection between the feature and the biological insight would be a In the previous studies, when predicting the binding sites more interesting problem, requiring more effort from the on the protein surface, people usually only use the informa- researchers. tion from the protein. On the other hand, researchers often only use the RNA information when modeling the protein’s 7.6. High-resolution prediction binding preference to the RNA sequences. Because the in- teraction is related to both molecules, it is more desirable When predicting the binding sites on the protein surface, to consider both when modeling the process. However, as researchers usually annotate at the amino acid level. Regard- protein and RNA are different molecules, it is not reason- ing the binding preference against the RNA sequences, the able to use just one deep learning model to process them. resolution is usually until the nucleotide. From the struc- Instead, we should use multi-modality models. Essentially, tural aspect, the above prediction resolution is still too low. for each molecule, we have a deep learning module to ex- In reality, when studying the interaction between proteins tract features from it. Then, the features can be combined and RNAs, we want to know the exact binding pocket and to perform the final prediction. In practice, we may pre- even the binding location on the protein and RNA surface. train each module separately first and then fine-tune all the With such information, we can understand the functional modules together in an end-to-end fashion. By considering mechanisms of those important proteins, such as Ago and the two molecules simultaneously, we do not have to train CRISPR-associated proteins. Some recent works are trying a model for each protein, and we are more likely to obtain to increase the resolution of the prediction (Lam et al., 2019; one general model, which deciphers the principle behind Gainza et al., 2020). More works can be done to improve protein-RNA interaction. the existing methods further. For example, although Lam et al.(2019) generates grid points on the protein surface and 7.5. Model interpretability for structural modeling predicts at the grid point level, which increases the predic- tion resolution significantly on the protein side, the authors It is always difficult to explain deep learning models. For have not considered the information from the RNA side at the bio-molecular sequence analysis, after the investigation all. Consequently, the method is unable to determine the in the past few years, people have proposed a number of sequence and orientation of the binding RNA precisely. In- methods to explain the prediction of deep learning models troducing features from the RNA structure should increase (Umarov et al., 2019; Li et al., 2019; 2021b). Such explana- the prediction resolution for the RNA, although the entire tions converge with the motif discovery techniques before framework needs to be redesigned. As discussed in the the surge of deep learning. However, for the prediction previous sections, with more advanced structural encod- at the structure level, the explanation is much more diffi- ing techniques and frameworks considering both protein cult. In the structure field, we encounter a serious dilemma and RNA information, the prediction resolution would be between explanation and performance, no matter utilizing increased significantly in the near future. deep learning or not. For example, those methods with a strong physicochemical foundation and carefully designed 7.7. Ab initio prediction force fields usually have inferior performance compared with the machine learning-based methods. Before the wide Currently, when predicting the interaction between proteins usage of deep learning in this field, threading and similarity- and RNAs with structural information, people usually as- based methods are also often used. Although such methods sume that we have already known the protein structure. cannot handle queries without homologs, researchers know However, in reality, determining the protein and RNA struc- when they will work and when they will not. However, after ture is not a trivial task. Even if we can determine the deep learning methods are applied to this field, people will structure of molecules in nature by biological experiments, use them by default because of their superior performance, it is almost impossible to resolve the structure of molecules although researchers cannot explain what physicochemical with mutations, which is important for drug discovery and and structural biology knowledge are used by the model to development. Under that circumstance, it is desirable that perform the prediction. Currently, the request for model in- we can predict everything from the protein and RNA se- terpretability in the structure field is not very urgent because quences, which is referred as Ab initio prediction here. With Protein-RNA interaction prediction with deep learning the sequences, we may first predict the 3D structures of pro- sequence information and auxiliary structural information teins and RNAs. Then, based on the predicted structures, we to perform the prediction. Considering the recent progress will further predict their interactions. Although this research of AlphaFold and the tremendous amount of structure data paradigm seems to be computational daunting and may accu- produced by it (Tunyasuvunakool et al., 2021), the study of mulate errors in the multiple steps, it becomes increasingly the RBP-RNA interactions will be promoted significantly appealing with the rapid development of the protein struc- by deep learning methods (Lam et al., 2019; Li et al., 2020b) ture prediction algorithms in recent years. For example, operating directly on the structural data. AlphaFold2 (Jumper et al., 2021) can already achieve a similar prediction accuracy and resolution as Cryo-EM on References some proteins. Eventually, we can use one end-to-end deep learning model to address the two steps all at once. If we M. T. Weirauch, A. Cote, R. Norel, M. Annala, Y. Zhao, could predict the structural interaction details only using the T. R. Riley, J. Saez-Rodriguez, T. Cokelaer, A. Vedenko, sequence information, gene regulation and drug discovery S. Talukder, Dream Consortium, H. J. Bussemaker, Q. D. investigation will be accelerated significantly. Morris, M. L. Bulyk, G. Stolovitzky, and T. R. Hughes. Evaluation of methods for modeling transcription factor 7.8. From prediction to design sequence specificity. Nat Biotechnol, 31(2):126–34, 2013. ISSN 1546-1696 (Electronic) 1087-0156 (Linking). doi: After determining the molecular structure, we want to know 10.1038/nbt.2486. URL https://www.ncbi.nlm. the molecular function, that is, how a specific molecule nih.gov/pubmed/23354101. can interact with another. However, only investigating their function is not our ultimate goal. Eventually, we want to B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey. design particular molecules with desirable functions so that Predicting the sequence specificities of - and - to resolve the problems that we encounter in real life, such binding proteins by deep learning. Nat Biotechnol, 33(8): as curing diseases. As the performance of prediction models 831–8, 2015. ISSN 1546-1696 (Electronic) 1087-0156 has been improved significantly in recent years, researchers (Linking). doi: 10.1038/nbt.3300. URL https://www. are increasingly interested in designing. For instance, peo- ncbi.nlm.nih.gov/pubmed/26213851. ple have been using deep learning to optimize the CRISPR guide RNA design (Chuai et al., 2018; Wang et al., 2019b). J. Yan, S. Friedrich, and L. Kurgan. A comprehensive Deep learning has also shown its power in designing new comparative review of sequence-based predictors of dna- antimicrobial peptide (Das et al., 2021). Regarding this spe- and rna-binding residues. Brief Bioinform, 17(1):88–105, cific topic of protein-RNA interaction, people are especially 2016. ISSN 1477-4054 (Electronic) 1467-5463 (Linking). interested in designing RNA sequences with high binding doi: 10.1093/bib/bbv023. URL https://www.ncbi. affinity to protein, similar to the CRISPR guide RNA design- nlm.nih.gov/pubmed/25935161. ing mentioned above. Moreover, a suitable guide RNA for Muthukumar Ramanathan, Douglas F. Porter, and Paul A. Ago can also increase the gene knock-down efficiency (Lam Khavari. Methods to study rna–protein interactions. Na- et al., 2019). In addition to the commonly used generative ture Methods, 16(3):225–234, 2019. ISSN 1548-7091. models, such as GAN and VAE, recently, differentiable al- doi: 10.1038/s41592-019-0330-1PMID-30804549. gorithms (Chen et al., 2020) and energy models (Song et al., 2020) have drawn great attention in the machine learning M. Corley, M. C. Burns, and G. W. Yeo. How rna-binding field, which is potentially useful for designing problems in proteins interact with rna: Molecules and mechanisms. the protein-RNA interaction field. Molecular Cell, 78(1):9–29, 2020. ISSN 1097-2765. doi: 10.1016/j.molcel.2020.03.011. URL :// 8. Conclusion WOS:000523321100004. The interactions between different molecules are essential C. Lin and W. O. Miles. Beyond clip: advances and op- for biological processes in our body. Among them, the portunities to measure rbp-rna and rna-rna interactions. RBP-RNA interactions are of great interest to researchers, Nucleic Acids Res, 47(11):5490–5501, 2019. ISSN 1362- considering their central role in gene expression regulation 4962 (Electronic) 0305-1048 (Linking). doi: 10.1093/ (Weirauch et al., 2013; Dai et al., 2017). People have de- nar/gkz295. URL https://www.ncbi.nlm.nih. veloped a number of computational tools and methods to gov/pubmed/31076772. facilitate the study of the RBP-RNA interaction, usually pre- dicting the binding sites and binding preference. However, W. Yi, J. Li, X. Zhu, X. Wang, L. Fan, W. Sun, L. Liao, as we discussed in detail in the review, due to the limitation J. Zhang, X. Li, J. Ye, F. Chen, J. Taipale, K. M. of the previous data, researchers usually only consider the Chan, L. Zhang, and J. Yan. Crispr-assisted detec- tion of rna-protein interactions in living cells. Nat Protein-RNA interaction prediction with deep learning

Methods, 17(7):685–688, 2020. ISSN 1548-7105 E. L. Van Nostrand, P. Freese, G. A. Pratt, X. Wang, (Electronic) 1548-7091 (Linking). doi: 10.1038/ X. Wei, R. Xiao, S. M. Blue, J. Y. Chen, N. A. L. Cody, s41592-020-0866-0. URL https://www.ncbi. D. Dominguez, S. Olson, B. Sundararaman, L. Zhan, nlm.nih.gov/pubmed/32572232. C. Bazile, L. P. B. Bouvrette, J. Bergalet, M. O. Duff, K. E. Garcia, C. Gelboin-Burkhart, M. Hochman, N. J. Lam- G. L. Conn, D. E. Draper, E. E. Lattman, and A. G. Git- bert, H. Li, M. P. McGurk, T. B. Nguyen, T. Palden, I. Ra- tis. Crystal structure of a conserved ribosomal protein- bano, S. Sathe, R. Stanton, A. Su, R. Wang, B. A. Yee, rna complex. Science, 284(5417):1171–4, 1999. ISSN B. Zhou, A. L. Louie, S. Aigner, X. D. Fu, E. Lecuyer, 0036-8075 (Print) 0036-8075 (Linking). doi: 10.1126/ C. B. Burge, B. R. Graveley, and G. W. Yeo. A large- science.284.5417.1171. URL https://www.ncbi. scale binding and functional map of human rna-binding nlm.nih.gov/pubmed/10325228. proteins. Nature, 583(7818):711–719, 2020. ISSN Jack D Keene, Jordan M Komisarow, and Matthew B 1476-4687 (Electronic) 0028-0836 (Linking). doi: 10. Friedersdorf. Rip-chip: the isolation and identification of 1038/s41586-020-2077-3. URL https://www.ncbi. mrnas, and protein components of ribonucleo- nlm.nih.gov/pubmed/32728246. protein complexes from cell extracts. Nature protocols, 1 J. H. Lam, Y. Li, L. Zhu, R. Umarov, H. Jiang, A. Heliou, (1):302–307, 2006. F. K. Sheong, T. Liu, Y. Long, Y. Li, L. Fang, R. B. Jernej Ule, Kirk Jensen, Aldo Mele, and Robert B Darnell. Altman, W. Chen, X. Huang, and X. Gao. A deep learn- Clip: a method for identifying protein–rna interaction ing framework to predict binding preference of rna con- sites in living cells. Methods, 37(4):376–386, 2005. stituents on protein surface. Nat Commun, 10(1):4941, 2019. ISSN 2041-1723 (Electronic) 2041-1723 (Linking). D. Ray, H. Kazan, E. T. Chan, L. Pena Castillo, S. Chaudhry, doi: 10.1038/s41467-019-12920-0. URL https:// S. Talukder, B. J. Blencowe, Q. Morris, and T. R. Hughes. www.ncbi.nlm.nih.gov/pubmed/31666519. Rapid and systematic analysis of the rna recognition speci- ficities of rna-binding proteins. Nat Biotechnol, 27(7): M. W. Hentze, A. Castello, T. Schwarzl, and T. Preiss. A 667–70, 2009. ISSN 1546-1696 (Electronic) 1087-0156 brave new world of rna-binding proteins. Nat Rev Mol (Linking). doi: 10.1038/nbt.1550. URL https://www. Cell Biol, 19(5):327–341, 2018. ISSN 1471-0080 (Elec- ncbi.nlm.nih.gov/pubmed/19561594. tronic) 1471-0072 (Linking). doi: 10.1038/nrm.2017. 130. URL https://www.ncbi.nlm.nih.gov/ E. Roulet, S. Busso, A. A. Camargo, A. J. Simpson, pubmed/29339797. N. Mermod, and P. Bucher. High-throughput selex sage method for quantitative modeling of transcription- Z. Miao and E. Westhof. A large-scale assessment of factor binding sites. Nat Biotechnol, 20(8):831–5, 2002. nucleic acids binding site prediction programs. PLoS ISSN 1087-0156 (Print) 1087-0156 (Linking). doi: Comput Biol, 11(12):e1004639, 2015. ISSN 1553- 10.1038/nbt718. URL https://www.ncbi.nlm. 7358 (Electronic) 1553-734X (Linking). doi: 10.1371/ nih.gov/pubmed/12101405. journal.pcbi.1004639. URL https://www.ncbi. nlm.nih.gov/pubmed/26681179. Markus Hafner, Maria Katsantoni, Tino Koster,¨ James Marks, Joyita Mukherjee, Dorothee Staiger, Jernej Ule, A. Trabelsi, M. Chaabane, and A. Ben-Hur. Compre- and Mihaela Zavolan. Clip and complementary methods. hensive evaluation of deep learning architectures for Nature Reviews Methods Primers, 1(1):20, 2021. ISSN prediction of dna/rna sequence binding specificities. 2662-8449. doi: 10.1038/s43586-021-00018-1. Bioinformatics, 35(14):i269–i277, 2019. ISSN 1367- 4811 (Electronic) 1367-4803 (Linking). doi: 10.1093/ L. Sun, K. Xu, W. Z. Huang, Y. C. T. Yang, P. Li, L. Tang, bioinformatics/btz339. URL https://www.ncbi. T. L. Xiong, and Q. F. C. Zhang. Predicting dynamic nlm.nih.gov/pubmed/31510640. cellular protein-rna interactions by deep learning using in vivo rna structures. Cell Research, pages 1–22, 2021. Dongwon Lee, David U Gorkin, Maggie Baker, Benjamin J ISSN 1001-0602. doi: 10.1038/s41422-021-00476-y. Strober, Alessandro L Asoni, Andrew S McCallion, and URL ://WOS:000620873100002. Michael A Beer. A method to predict the impact of regulatory variants from dna sequence. Nature genetics, D. Bernstein, B. Hook, A. Hajarnavis, L. Opperman, and 47(8):955, 2015. M. Wickens. Binding specificity and mrna targets of a c. elegans puf protein, fbf-1. RNA, 11(4):447–58, 2005. Xiaoyong Pan and Hong-Bin Shen. Predicting rna–protein ISSN 1355-8382 (Print) 1355-8382 (Linking). doi: 10. binding sites and motifs through combining local and 1261/rna.7255805. URL https://www.ncbi.nlm. global deep convolutional neural networks. Bioinformat- nih.gov/pubmed/15769874. ics, 34(20):3427–3436, 2018a. Protein-RNA interaction prediction with deep learning

Shuya Li, Fanghong Dong, Yuexin Wu, Sai Zhang, Chen Jinfang Zheng, Petras J Kundrotas, Ilya A Vakser, and Shiy- Zhang, Xiao Liu, Tao Jiang, and Jianyang Zeng. A ong Liu. Template-based modeling of protein-rna inter- deep boosting based approach for capturing the sequence actions. PLoS computational biology, 12(9):e1005120, binding preferences of rna-binding proteins from high- 2016. throughput clip-seq data. Nucleic acids research, 45(14): e129–e129, 2017a. Andrew W Senior, Richard Evans, John Jumper, James Kirk- patrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin E. Jeong, I. F. Chung, and S. Miyano. A neural network Zˇ´ıdek, Alexander WR Nelson, Alex Bridgland, et al. Im- method for identification of rna-interacting residues in proved protein structure prediction using potentials from protein. Genome Inform, 15(1):105–16, 2004. ISSN 0919- deep learning. Nature, 577(7792):706–710, 2020. 9454 (Print) 0919-9454 (Linking). URL https:// www.ncbi.nlm.nih.gov/pubmed/15712114. A. Jolma, J. Zhang, E. Mondragon, E. Morgunova, T. Kiv- ioja, K. U. Laverty, Y. Yin, F. Zhu, G. Bourenkov, Q. Mor- J. R. Yan and M. Zhu. A review about rna-protein- ris, T. R. Hughes, 3rd Maher, L. J., and J. Taipale. Bind- binding sites prediction based on deep learning. Ieee ing specificities of human rna-binding proteins toward Access, 8:150929–150944, 2020. ISSN 2169-3536. doi: structured and linear rna sequences. Genome Res, 30(7): 10.1109/Access.2020.3014996. URL :// 962–973, 2020. ISSN 1549-5469 (Electronic) 1088-9051 WOS:000564191300001. (Linking). doi: 10.1101/gr.258848.119. URL https:// www.ncbi.nlm.nih.gov/pubmed/32703884. A. Sagar and B. Xue. Recent advances in machine learning based prediction of rna-protein interactions. D. Maticzka, S. J. Lange, F. Costa, and R. Backofen. Graph- Protein Pept Lett, 26(8):601–619, 2019. ISSN 1875- prot: modeling binding preferences of rna-binding pro- 5305 (Electronic) 0929-8665 (Linking). doi: 10.2174/ teins. Genome Biol, 15(1):R17, 2014. ISSN 1474- 0929866526666190619103853. URL https://www. 760X (Electronic) 1474-7596 (Linking). doi: 10.1186/ ncbi.nlm.nih.gov/pubmed/31215361. gb-2014-15-1-r17. URL https://www.ncbi.nlm. nih.gov/pubmed/24451197. X. Pan, Y. Yang, C. Q. Xia, A. H. Mirza, and H. B. Shen. Recent methodology progress of deep learning Xiaoli Zhang and Shiyong Liu. Rbppred: predicting rna- for rna-protein interaction prediction. Wiley Interdis- binding proteins from sequence using svm. Bioinformat- cip Rev RNA, 10(6):e1544, 2019a. ISSN 1757-7012 ics, 33(6):854–862, 2017. (Electronic) 1757-7004 (Linking). doi: 10.1002/wrna. 1544. URL https://www.ncbi.nlm.nih.gov/ Hong Su, Mengchen Liu, Saisai Sun, Zhenling Peng, and pubmed/31067608. Jianyi Yang. Improving the prediction of protein–nucleic acids binding residues via multiple sequence profiles and Jianyi Yang, Ambrish Roy, and Yang Zhang. Protein–ligand the consensus of complementary methods. Bioinformat- binding site recognition using complementary binding- ics, 35(6):930–936, 2019a. specific substructure comparison and sequence profile alignment. Bioinformatics, 29(20):2588–2595, 2013. Hilal Kazan, Debashish Ray, Esther T Chan, Timothy R Hughes, and Quaid Morris. Rnacontext: a new method Yao Chi Chen, Karen Sargsyan, Jon D Wright, Yi-Shuian for learning the sequence and structure binding prefer- Huang, and Carmay Lim. Identifying rna-binding ences of rna-binding proteins. PLoS Comput Biol, 6(7): residues based on evolutionary conserved structural and e1000832, 2010. energetic features. Nucleic acids research, 42(3):e15–e15, 2014. Y. Orenstein, Y. Wang, and B. Berger. Rck: accu- rate and efficient inference of sequence- and structure- Qi Wu, Zhenling Peng, Yang Zhang, and Jianyi Yang. based protein-rna binding models from rnacompete data. Coach-d: improved protein–ligand binding sites predic- Bioinformatics, 32(12):i351–i359, 2016. ISSN 1367- tion with refined ligand-binding poses through molecular 4811 (Electronic) 1367-4803 (Linking). doi: 10.1093/ docking. Nucleic acids research, 46(W1):W438–W442, bioinformatics/btw259. URL https://www.ncbi. 2018. nlm.nih.gov/pubmed/27307637.

Juan Xie, Jinfang Zheng, Xu Hong, Xiaoxue Tong, and Michael Hiller, Rainer Pudimat, Anke Busch, and Rolf Shiyong Liu. Prime-3d2d is a 3d2d model to predict Backofen. Using rna secondary structures to guide se- binding sites of protein–rna interaction. Communications quence motif finding towards single-stranded regions. Nu- biology, 3(1):1–10, 2020a. cleic acids research, 34(17):e117–e117, 2006. Protein-RNA interaction prediction with deep learning

J. Yan and L. Kurgan. Drnapred, fast sequence-based Michael Birkerod Hansen, Anne-Mette Hartung, Jan method that accurately predicts and discriminates dna- Baumbach, and Brage Storstein Andresen. Deepclip: and rna-binding residues. Nucleic Acids Res, 45(10):e84, predicting the effect of mutations on protein–rna bind- 2017. ISSN 1362-4962 (Electronic) 0305-1048 (Linking). ing with deep learning. Nucleic acids research, 48(13): doi: 10.1093/nar/gkx059. URL https://www.ncbi. 7099–7118, 2020. nlm.nih.gov/pubmed/28132027. S. Zhang, J. Zhou, H. Hu, H. Gong, L. Chen, C. Cheng, S. Li, F. Dong, Y. Wu, S. Zhang, C. Zhang, X. Liu, and J. Zeng. A deep learning framework for mod- T. Jiang, and J. Zeng. A deep boosting based approach eling structural features of rna-binding protein targets. for capturing the sequence binding preferences of rna- Nucleic Acids Res, 44(4):e32, 2016. ISSN 1362- binding proteins from high-throughput clip-seq data. 4962 (Electronic) 0305-1048 (Linking). doi: 10.1093/ Nucleic Acids Res, 45(14):e129, 2017b. ISSN 1362- nar/gkv1025. URL https://www.ncbi.nlm.nih. 4962 (Electronic) 0305-1048 (Linking). doi: 10.1093/ gov/pubmed/26467480. nar/gkx492. URL https://www.ncbi.nlm.nih. gov/pubmed/28575488. Ka Man Yip, Niels Fischer, Elham Paknia, Ashwin Chari, and Holger Stark. Atomic-resolution protein structure M. Sun, X. Wang, C. Zou, Z. He, W. Liu, and H. Li. determination by cryo-em. Nature, 587(7832):157–161, Accurate prediction of rna-binding protein residues 2020. with two discriminative structural descriptors. BMC Bioinformatics, 17(1):231, 2016. ISSN 1471-2105 Debora S Marks, Lucy J Colwell, Robert Sheridan, (Electronic) 1471-2105 (Linking). doi: 10.1186/ Thomas A Hopf, Andrea Pagnani, Riccardo Zecchina, s12859-016-1110-x. URL https://www.ncbi. and . Protein 3d structure computed from nlm.nih.gov/pubmed/27266516. evolutionary sequence variation. PloS one, 6(12):e28766, 2011. Yu Li, Chao Huang, Lizhong Ding, Zhongxiao Li, Yijie Pan, and Xin Gao. Deep learning in bioinformatics: In- Sheng Wang, Siqi Sun, Zhen Li, Renyu Zhang, and Jinbo Xu. troduction, application, and perspective in the big data Accurate de novo prediction of protein contact map by era. Methods, 166:4–21, 2019. ultra-deep learning model. PLoS computational biology, 13(1):e1005324, 2017a. Y. Liu, W. Gong, Y. Zhao, X. Deng, S. Zhang, and C. Li. aprbind: protein-rna interface prediction by Sheng Wang, Shiyang Fei, Zongan Wang, Yu Li, Jinbo Xu, combining sequence and i-tasser model-based structural Feng Zhao, and Xin Gao. Predmp: a web server for de features learned with convolutional neural networks. novo prediction and visualization of membrane proteins. Bioinformatics, 2020. ISSN 1367-4811 (Electronic) Bioinformatics, 35(4):691–693, 2019a. 1367-4803 (Linking). doi: 10.1093/bioinformatics/ John Jumper, Richard Evans, Alexander Pritzel, Tim btaa747. URL https://www.ncbi.nlm.nih. Green, Michael Figurnov, Olaf Ronneberger, Kathryn gov/pubmed/32821925. Tunyasuvunakool, Russ Bates, Augustin Zˇ´ıdek, Anna Yao Chi Chen and Carmay Lim. Predicting rna-binding Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. sites from the protein structure based on electrostatics, Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino and geometry. Nucleic acids research, 36(5): Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas e29, 2008. Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Wen Torng and Russ B Altman. High precision protein Pacholska, Tamas Berghammer, Sebastian Bodenstein, functional site detection using 3d convolutional neural David Silver, Oriol Vinyals, Andrew W. Senior, Ko- networks. Bioinformatics, 35(9):1503–1512, 2019. ray Kavukcuoglu, Pushmeet Kohli, and Demis Hass- Y. Xia, C. Q. Xia, X. Pan, and H. B. Shen. Graphbind: pro- abis. Highly accurate protein structure prediction with tein structural context embedded rules learned by hierar- . Nature, pages 1–11, 2021. chical graph neural networks for recognizing nucleic-acid- Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas binding residues. Nucleic Acids Res, pages gkab044–, Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue 2021a. ISSN 1362-4962 (Electronic) 0305-1048 (Link- Wang, Qian Cong, Lisa N. Kinch, R. Dustin Schaeffer, ing). doi: 10.1093/nar/gkab044. URL https://www. Claudia Millan,´ Hahnbeom Park, Carson Adams, ncbi.nlm.nih.gov/pubmed/33577689. Caleb R. Glassman, Andy DeGiovanni, Jose H. Pereira, Alexander Gulliver Bjørnholt Grønning, Thomas Koed Andria V. Rodrigues, Alberdina A. van Dijk, Ana C. Doktor, Simon Jonas Larsen, Ulrika Simone Spangs- Ebrecht, Diederik J. Opperman, Theo Sagmeister, berg Petersen, Lise Lolle Holm, Gitte Hoffmann Bruun, Christoph Buhlheller, Tea Pavkov-Keller, Manoj K. Protein-RNA interaction prediction with deep learning

Rathinaswamy, Udit Dalwadi, Calvin K. Yip, John E. C. J. Lewis, T. Pan, and A. Kalsotra. Rna modifications Burke, K. Christopher Garcia, Nick V. Grishin, Paul D. and structures cooperate to guide rna-protein interac- Adams, Randy J. Read, and David Baker. Accurate tions. Nat Rev Mol Cell Biol, 18(3):202–210, 2017. prediction of protein structures and interactions using ISSN 1471-0080 (Electronic) 1471-0072 (Linking). doi: a three-track neural network. Science, 2021. ISSN 10.1038/nrm.2016.163. URL https://www.ncbi. 0036-8075. doi: 10.1126/science.abj8754. URL https: nlm.nih.gov/pubmed/28144031. //science.sciencemag.org/content/ F. C. Y. Lee and J. Ule. Advances in clip technolo- early/2021/07/19/science.abj8754. gies for studies of protein-rna interactions. Mol Cell, Kathryn Tunyasuvunakool, Jonas Adler, Zachary Wu, 69(3):354–369, 2018. ISSN 1097-4164 (Electronic) Tim Green, Michal Zielinski, Augustin Zˇ´ıdek, Alex 1097-2765 (Linking). doi: 10.1016/j.molcel.2018.01. Bridgland, Andrew Cowie, Clemens Meyer, Agata Lay- 005. URL https://www.ncbi.nlm.nih.gov/ don, Sameer Velankar, Gerard J. Kleywegt, Alex Bate- pubmed/29395060. man, Richard Evans, Alexander Pritzel, Michael Fig- C. Nithin, P. Ghosh, and J. M. Bujnicki. Bioinformatics tools urnov, Olaf Ronneberger, Russ Bates, Simon A. A. and benchmarks for computational docking and 3d struc- Kohl, Anna Potapenko, Andrew J. Ballard, Bernardino ture prediction of rna-protein complexes. Genes (Basel), 9 Romera-Paredes, Stanislav Nikolov, Rishub Jain, Ellen (9):432, 2018. ISSN 2073-4425 (Print) 2073-4425 (Link- Clancy, David Reiman, Stig Petersen, Andrew W. Se- ing). doi: 10.3390/genes9090432. URL https://www. nior, Koray Kavukcuoglu, , Pushmeet ncbi.nlm.nih.gov/pubmed/30149645. Kohli, John Jumper, and Demis Hassabis. Highly ac- curate protein structure prediction for the human pro- K. S. Moore and P. A. C. t Hoen. Computational approaches teome. Nature, 2021. ISSN 1476-4687. doi: 10.1038/ for the analysis of rna-protein interactions: A primer for s41586-021-03828-1. URL https://doi.org/10. biologists. J Biol Chem, 294(1):1–9, 2019. ISSN 1083- 1038/s41586-021-03828-1. 351X (Electronic) 0021-9258 (Linking). doi: 10.1074/jbc. REV118.004842. URL https://www.ncbi.nlm. Xiaoyong Pan, Yang Yang, Chun-Qiu Xia, Aashiq H Mirza, nih.gov/pubmed/30455357. and Hong-Bin Shen. Recent methodology progress of deep learning for rna–protein interaction prediction. Wiley Gerd Anders, Sebastian D Mackowiak, Marvin Jens, Interdisciplinary Reviews: RNA, 10(6):e1544, 2019b. Jonas Maaskola, Andreas Kuntzagk, Nikolaus Rajewsky, Markus Landthaler, and Christoph Dieterich. dorina: a J. Si, J. Cui, J. Cheng, and R. Wu. Computational database of rna interactions in post-transcriptional regula- prediction of rna-binding proteins and binding sites. tion. Nucleic acids research, 40(D1):D180–D186, 2012. Int J Mol Sci, 16(11):26303–17, 2015. ISSN 1422- 0067 (Electronic) 1422-0067 (Linking). doi: 10.3390/ Martin Strazar,ˇ Marinka Zitnik,ˇ Blazˇ Zupan, Jernej Ule, ijms161125952. URL https://www.ncbi.nlm. and Tomazˇ Curk. Orthogonal matrix factorization en- nih.gov/pubmed/26540053. ables integrative analysis of multiple rna binding proteins. Bioinformatics, 32(10):1527–1535, 2016. D. D. Licatalosi, X. Ye, and E. Jankowsky. Approaches for measuring the dynamics of rna-protein interactions. Wi- X. Pan and H. B. Shen. Rna-protein binding mo- ley Interdiscip Rev RNA, 11(1):e1565, 2020. ISSN 1757- tifs mining with a new hybrid deep learning based 7012 (Electronic) 1757-7004 (Linking). doi: 10.1002/ cross-domain knowledge integration approach. BMC wrna.1565. URL https://www.ncbi.nlm.nih. Bioinformatics, 18(1):136, 2017. ISSN 1471-2105 gov/pubmed/31429211. (Electronic) 1471-2105 (Linking). doi: 10.1186/ s12859-017-1561-8. URL https://www.ncbi. J. Konig, K. Zarnack, N. M. Luscombe, and J. Ule. nlm.nih.gov/pubmed/28245811. Protein-rna interactions: new genomic technologies and perspectives. Nat Rev Genet, 13(2):77–83, 2012. X. Y. Pan, P. Rijnbeek, J. C. Yan, and H. B. Shen. Pre- ISSN 1471-0064 (Electronic) 1471-0056 (Linking). doi: diction of rna-protein sequence and structure binding 10.1038/nrg3141. URL https://www.ncbi.nlm. preferences using deep convolutional and recurrent neu- nih.gov/pubmed/22251872. ral networks. Bmc Genomics, 19(1):511, 2018. ISSN 1471-2164. doi: ARTN51110.1186/s12864-018-4889-1. S. Jones. Protein-rna interactions: structural biol- URL ://WOS:000437514200003. ogy and computational modeling techniques. Bio- phys Rev, 8(4):359–367, 2016. ISSN 1867- Xiaoyong Pan and Hong-Bin Shen. Learning distributed 2450 (Print) 1867-2450 (Linking). doi: 10.1007/ representations of rna sequences and its application for s12551-016-0223-9. URL https://www.ncbi. predicting rna-protein binding sites with a convolutional nlm.nih.gov/pubmed/28510023. neural network. Neurocomputing, 305:51–58, 2018b. Protein-RNA interaction prediction with deep learning

Erik Dassi, Angela Re, Sara Leo, Toma Tebaldi, Luigi Jamie J Cannone, Sankar Subramanian, Murray N Schnare, Pasini, Daniele Peroni, and Alessandro Quattrone. Aura 2: James R Collett, Lisa M D’Souza, Yushi Du, Brian Feng, Empowering discovery of post-transcriptional networks. Nan Lin, Lakshmi V Madabusi, Kirsten M Muller,¨ et al. Translation, 2(1):e27738, 2014. The comparative rna web (crw) site: an online database of comparative sequence and structure information for Yu-Cheng T Yang, Chao Di, Boqin Hu, Meifeng Zhou, ribosomal, , and other rnas. BMC bioinformatics, 3 Yifang Liu, Nanxi Song, Yang Li, Jumpei Umetsu, and (1):1–31, 2002. Zhi John Lu. Clipdb: a clip-seq database for protein-rna interactions. BMC genomics, 16(1):1–8, 2015. Christian Zwieb, Jan Gorodkin, Bjarne Knudsen, Jody Burks, and Jacek Wower. tmrdb (tmrna database). Nu- Joel L Sussman, Dawei Lin, Jiansheng Jiang, Nancy O cleic acids research, 31(1):446–447, 2003. Manning, Jaime Prilusky, Otto Ritter, and Enrique E Abola. Protein data bank (pdb): database of three- Magnus Alm Rosenblad, Jan Gorodkin, Bjarne Knudsen, dimensional structural information of biological macro- Christian Zwieb, and Tore Samuelsson. Srpdb: Signal molecules. Acta Crystallographica Section D: Biological recognition particle database. Nucleic acids research, 31 Crystallography, 54(6):1078–1084, 1998. (1):363–364, 2003.

D. D. Kirsanov, O. N. Zanegina, E. A. Aksianov, S. A. James W Brown. The ribonuclease p database. Nucleic Spirin, A. S. Karyagina, and A. V. Alexeevski. Npidb: acids research, 26(1):351–352, 1998. Nucleic acid-protein interaction database. Nucleic Acids Res, 41(Database issue):D517–23, 2013. ISSN 1362- Sam Griffiths-Jones, , Mhairi Marshall, Ajay 4962 (Electronic) 0305-1048 (Linking). doi: 10.1093/ Khanna, and Sean R Eddy. Rfam: an rna family database. nar/gks1199. URL https://www.ncbi.nlm.nih. Nucleic acids research, 31(1):439–441, 2003. gov/pubmed/23193292. Gianluca Corrado, Toma Tebaldi, Fabrizio Costa, Paolo Frasconi, and Andrea Passerini. Rnacommender: Virginia Earl-Mirowski and Michael Rosenberg. Compare genome-wide recommendation of rna–protein interac- and contrast the effects of using less stringent criteria in tions. Bioinformatics, 32(23):3627–3634, 2016. blastclust to a novel iterative method for identifying gene families. 2007. Zhihao Xia, Yu Li, Bin Zhang, Zhongxiao Li, Yuhui Hu, Wei Chen, and Xin Gao. Deerect-polya: a robust and generic Jessica S Reuter and David H Mathews. Rnastructure: soft- deep learning method for pas identification. Bioinformat- ware for rna secondary structure prediction and analysis. ics, 35(14):2371–2379, 2019. BMC bioinformatics, 11(1):1–9, 2010. Kenneth Ward Church. Word2vec. Natural Language Engi- Jaswinder Singh, Jack Hanson, Kuldip Paliwal, and Yaoqi neering, 23(1):155–162, 2017. Zhou. Rna secondary structure prediction using an ensem- ble of two-dimensional deep neural networks and transfer Peter Steffen, Bjorn¨ Voß, Marc Rehmsmeier, Jens Reeder, learning. Nature communications, 10(1):1–13, 2019. and Robert Giegerich. Rnashapes: an integrated rna anal- ysis package based on abstract shapes. Bioinformatics, Ryan A Flynn, Qiangfeng Cliff Zhang, Robert C Spitale, 22(4):500–503, 2006. Byron Lee, Maxwell R Mumbach, and Howard Y Chang. Transcriptome-wide interrogation of rna secondary struc- Xinshi Chen, Yu Li, Ramzan Umarov, Xin Gao, and ture in living cells with icshape. Nature protocols, 11(2): Le Song. Rna secondary structure prediction by learning 273–290, 2016. unrolled algorithms. arXiv preprint arXiv:2002.05810, 2020. Pan Li, Xiaolin Zhou, Kui Xu, and Qiangfeng Cliff Zhang. Rasp: an atlas of transcriptome-wide rna secondary struc- Robert C Spitale, Ryan A Flynn, Qiangfeng Cliff Zhang, ture probing data. Nucleic Acids Research, 49(D1):D183– Pete Crisalli, Byron Lee, Jong-Wha Jung, Hannes Y D191, 2021a. Kuchelmeister, Pedro J Batista, Eduardo A Torre, Eric T Kool, et al. Structural imprints in vivo decode rna regula- Padideh Danaee, Mason Rouches, Michelle Wiley, Dezhong tory mechanisms. Nature, 519(7544):486–490, 2015. Deng, Liang Huang, and David Hendrix. bprna: large- scale automated annotation and analysis of rna secondary Pan Li, Ruoyao Shi, and Qiangfeng Cliff Zhang. icshape- structure. Nucleic acids research, 46(11):5381–5394, pipe: A comprehensive toolkit for icshape data analysis 2018. and evaluation. Methods, 178:96–103, 2020a. Protein-RNA interaction prediction with deep learning

James Roll, Craig L Zirbel, Blake Sweeney, Anton I Petrov, Shula Shazman, Gershon Celniker, Omer Haber, Fabian and Neocles Leontis. Jar3d webserver: Scoring and align- Glaser, and Yael Mandel-Gutfreund. Patch finder plus ing rna loop sequences to known 3d motifs. Nucleic acids (pfplus): a web server for extracting and displaying pos- research, 44(W1):W320–W327, 2016. itive electrostatic patches on protein surfaces. Nucleic acids research, 35(suppl 2):W526–W530, 2007. Anton I Petrov, Craig L Zirbel, and Neocles B Leontis. Automated classification of rna 3d motifs and the rna 3d Inbal Halperin, Dariya S Glazer, Shirley Wu, and Russ B motif atlas. Rna, 19(10):1327–1340, 2013. Altman. The feature framework for protein function anno- tation: modeling new functions, improving performance, Shandar Ahmad and Akinori Sarai. Pssm-based prediction and extending to novel applications. BMC genomics, 9 of dna binding sites in proteins. BMC bioinformatics, 6 (2):1–14, 2008a. (1):1–6, 2005. Stephen F Altschul, Thomas L Madden, Alejandro A Stefano Forli, Ruth Huey, Michael E Pique, Michel F Sanner, Schaffer,¨ Jinghui Zhang, Zheng Zhang, Webb Miller, and David S Goodsell, and Arthur J Olson. Computational David J Lipman. Gapped and psi-blast: a new gener- protein–ligand docking and virtual drug screening with ation of protein database search programs. Nucleic acids the autodock suite. Nature protocols, 11(5):905–919, research, 25(17):3389–3402, 1997. 2016. Yu Li, Sheng Wang, Ramzan Umarov, Bingqing Xie, Ming Pablo Gainza, Freyr Sverrisson, Frederico Monti, Emanuele Fan, Lihua Li, and Xin Gao. Deepre: sequence-based Rodola, D Boscaini, MM Bronstein, and BE Correia. De- enzyme ec number prediction by deep learning. Bioinfor- ciphering interaction fingerprints from protein molecular matics, 34(5):760–769, 2018. surfaces using geometric deep learning. Nature Methods, 17(2):184–192, 2020. Zhenzhen Zou, Shuye Tian, Xin Gao, and Yu Li. mldeepre: Multi-functional enzyme function prediction with hierar- Katsuhiko Ishiguro, Shin-ichi Maeda, and Masanori chical multi-label deep learning. Frontiers in genetics, 9: Koyama. Graph warp module: an auxiliary module for 714, 2019. boosting the power of graph neural networks in molecular graph analysis. arXiv preprint arXiv:1902.01020, 2019. Chun Hua Li, Li Bin Cao, Ji Guo Su, Yong Xiao Yang, and Cun Xin Wang. A new residue-nucleotide propensity po- Mu Gao and Jeffrey Skolnick. A threading-based method for tential with structural information considered for discrim- the prediction of dna-binding proteins with application to inating protein-rna docking decoys. Proteins: Structure, the . PLoS computational biology, 5(11): Function, and Bioinformatics, 80(1):14–24, 2012. e1000567, 2009.

Rhys Heffernan, Yuedong Yang, Kuldip Paliwal, and Yaoqi Rasna R Walia, Li C Xue, Katherine Wilkins, Yasser El- Zhou. Capturing non-local interactions by long short- Manzalawy, Drena Dobbs, and Vasant Honavar. Rnabindr- term memory bidirectional recurrent neural networks for plus: a predictor that combines machine learning and improving prediction of protein secondary structure, back- sequence homology-based methods to improve the relia- bone angles, contact numbers and solvent accessibility. bility of predicted rna-binding residues in proteins. PloS Bioinformatics, 33(18):2842–2849, 2017. one, 9(5):e97725, 2014.

Eric W Stawiski, Lydia M Gregoret, and Yael Mandel- Xiao-Xia Yang, Zhi-Luo Deng, and Rong Liu. Rbrde- Gutfreund. Annotating nucleic acid-binding function tector: Improved prediction of binding residues on rna- based on protein structure. Journal of molecular biology, binding protein structures using complementary feature- 326(4):1065–1079, 2003. and template-based strategies. Proteins: Structure, Func- Wolfgang Kabsch and Christian Sander. Dictionary of pro- tion, and Bioinformatics, 82(10):2455–2471, 2014. tein secondary structure: pattern recognition of hydrogen- Hamid Reza Hassanzadeh and May D Wang. Deeperbind: bonded and geometrical features. : Original Enhancing prediction of sequence specificities of dna Research on Biomolecules, 22(12):2577–2637, 1983. binding proteins. In 2016 IEEE International Conference J Ding and E Arnold. Naccess. 2006. on Bioinformatics and Biomedicine (BIBM), pages 178– 183. IEEE, 2016. Eshel Faraggi, Yaoqi Zhou, and Andrzej Kloczkowski. Ac- curate single-sequence prediction of solvent accessible I. Ben-Bassat, B. Chor, and Y. Orenstein. A deep neural surface area using local and global features. Proteins: network approach for learning intrinsic protein-rna bind- Structure, Function, and Bioinformatics, 82(11):3170– ing preferences. Bioinformatics, 34(17):i638–i646, 2018. 3176, 2014. ISSN 1367-4811 (Electronic) 1367-4803 (Linking). doi: Protein-RNA interaction prediction with deep learning

10.1093/bioinformatics/bty600. URL https://www. compendium of rna-binding motifs for decoding gene ncbi.nlm.nih.gov/pubmed/30423078. regulation. Nature, 499(7457):172–177, 2013.

I. Halperin, D. S. Glazer, S. Wu, and R. B. Altman. The fea- Debashish Ray, Kevin CH Ha, Kate Nie, Hong Zheng, Timo- ture framework for protein function annotation: modeling thy R Hughes, and Quaid D Morris. Rnacompete method- new functions, improving performance, and extending to ology and application to determine sequence preferences novel applications. BMC Genomics, 9 Suppl 2:S2, 2008b. of unconventional rna-binding proteins. Methods, 118: ISSN 1471-2164 (Electronic) 1471-2164 (Linking). doi: 3–15, 2017. 10.1186/1471-2164-9-S2-S2. URL https://www. ncbi.nlm.nih.gov/pubmed/18831785. Ramzan Umarov, Hiroyuki Kuwahara, Yu Li, Xin Gao, and Victor Solovyev. Promoter analysis and prediction in Z. Yan, W. L. Hamilton, and M. Blanchette. Graph the human genome using sequence-based deep learning neural representational learning of rna secondary struc- models. Bioinformatics, 35(16):2730–2737, 2019. tures for predicting rna-protein interactions. Bioin- formatics, 36(Suppl 1):i276–i284, 2020. ISSN 1367- Yu Li, Zeling Xu, Wenkai Han, Huiluo Cao, Ramzan 4811 (Electronic) 1367-4803 (Linking). doi: 10.1093/ Umarov, Aixin Yan, Ming Fan, Huan Chen, Carlos M bioinformatics/btaa456. URL https://www.ncbi. Duarte, Lihua Li, et al. Hmd-arg: hierarchical multi-task nlm.nih.gov/pubmed/32657407. deep learning for annotating antibiotic resistance genes. Microbiome, 9(1):1–12, 2021b. Ying Xia, Chun-Qiu Xia, Xiaoyong Pan, and Hong-Bin Shen. Graphbind: protein structural context embedded Guohui Chuai, Hanhui Ma, Jifang Yan, Ming Chen, Nan- rules learned by hierarchical graph neural networks for fang Hong, Dongyu Xue, Chi Zhou, Chenyu Zhu, recognizing nucleic-acid-binding residues. Nucleic Acids Ke Chen, Bin Duan, et al. Deepcrispr: optimized crispr Research, 2021b. guide rna design by deep learning. Genome biology, 19 (1):1–18, 2018. S. Ahmad, M. M. Gromiha, and A. Sarai. Analysis and pre- diction of dna-binding proteins and their binding residues Daqi Wang, Chengdong Zhang, Bei Wang, Bin Li, Qiang based on composition, sequence and structural infor- Wang, Dong Liu, Hongyan Wang, Yan Zhou, Leming Shi, mation. Bioinformatics, 20(4):477–486, 2004. ISSN Feng Lan, et al. Optimized crispr guide rna design for 1367-4803. doi: 10.1093/bioinformatics/btg432. URL two high-fidelity cas9 variants by deep learning. Nature ://WOS:000220058800006. communications, 10(1):1–14, 2019b. Songling Li, Kazuo Yamashita, Karlou Mar Amada, and Payel Das, Tom Sercu, Kahini Wadhawan, Inkit Padhi, Se- Daron M Standley. Quantifying sequence and structural bastian Gehrmann, Flaviu Cipcigan, Vijil Chenthama- features of protein–rna interactions. Nucleic acids re- rakshan, Hendrik Strobelt, Cicero Dos Santos, Pin-Yu search, 42(15):10086–10098, 2014. Chen, et al. Accelerated antimicrobial discovery via deep J Jimenez,´ S Doerr, G Mart´ınez-Rosell, A S Rose, and generative models and molecular dynamics simulations. G De Fabritiis. Deepsite: protein-binding site predictor Nature Biomedical Engineering, pages 1–11, 2021. using 3d-convolutional neural networks. Bioinformatics, Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, 33(19):3036–3042, 2017. Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- L. Deng, Y. Z. Liu, Y. C. Shi, and H. Liu. A deep neu- based generative modeling through stochastic differential ral network approach using distributed representations equations. arXiv preprint arXiv:2011.13456, 2020. of rna sequence and structure for identifying binding site of rna-binding proteins. 2019 Ieee International Hanjun Dai, Ramzan Umarov, Hiroyuki Kuwahara, Yu Li, Conference on Bioinformatics and Biomedicine (Bibm), Le Song, and Xin Gao. Sequence2vec: a novel embed- 00:12–17, 2019. ISSN 2156-1125. doi: 10.1109/ ding approach for modeling transcription factor binding bibm47256.2019.8983345. URL ://WOS: affinity landscape. Bioinformatics, 33(22):3575–3583, 000555804900002. 2017. Gavin E Crooks, Gary Hon, John-Marc Chandonia, and Haoyang Li, Shuye Tian, Yu Li, Qiming Fang, Renbo Tan, Steven E Brenner. Weblogo: a sequence logo generator. Yijie Pan, Chao Huang, Ying Xu, and Xin Gao. Modern Genome research, 14(6):1188–1190, 2004. deep learning in bioinformatics. Journal of molecular cell biology, 2020b. Debashish Ray, Hilal Kazan, Kate B Cook, Matthew T Weirauch, Hamed S Najafabadi, Xiao Li, Serge Guer- Lei Wang, Zhu-Hong You, De-Shuang Huang, and Fengfeng oussov, Mihai Albu, Hong Zheng, Ally Yang, et al. A Zhou. Combining high speed elm learning with a deep Protein-RNA interaction prediction with deep learning

convolutional neural network feature encoding for pre- dicting protein-rna interactions. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17(3): 972–980, 2017b. ISSN 1545-5963. doi: 10.1109/tcbb. 2018.2874267PMID-30296240. A. Munteanu, N. Mukherjee, and U. Ohler. Ssmart: sequence-structure motif identification for rna-binding proteins. Bioinformatics, 34(23):3990–3998, 2018. ISSN 1367-4811 (Electronic) 1367-4803 (Linking). doi: 10.1093/bioinformatics/bty404. URL https://www. ncbi.nlm.nih.gov/pubmed/29893814. Y. Su, Y. Luo, X. Zhao, Y. Liu, and J. Peng. In- tegrating thermodynamic and sequence contexts im- proves protein-rna binding prediction. PLoS Comput Biol, 15(9):e1007283, 2019b. ISSN 1553-7358 (Elec- tronic) 1553-734X (Linking). doi: 10.1371/journal.pcbi. 1007283. URL https://www.ncbi.nlm.nih. gov/pubmed/31483777.

Zhen Shen, Su-Ping Deng, and De-Shuang Huang. Cap- sule network for predicting rna-protein binding prefer- ences using hybrid feature. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17(5):1483– 1492, 2019. ISSN 1545-5963. doi: 10.1109/tcbb.2019. 2943465PMID-31562101. S. Yang, X. Liu, and R. T. Ng. Proberating: a recom- mender system to infer binding profiles for nucleic acid- binding proteins. Bioinformatics, 36(18):4797–4804, 2020. ISSN 1367-4811 (Electronic) 1367-4803 (Linking). doi: 10.1093/bioinformatics/btaa580. URL https:// www.ncbi.nlm.nih.gov/pubmed/32573679. J. Xie, J. F. Zheng, X. Hong, X. X. Tong, and S. Y. Liu. Prime-3d2d is a 3d2d model to predict binding sites of protein-rna interaction. Communications Biology, 3(1): 384, 2020b. doi: ARTN38410.1038/s42003-020-1114-y. URL ://WOS:000552372600001. Protein-RNA interaction prediction with deep learning A. Appendix Table

Table 1: A comprehensive summary and comparison of the representative works for studying the protein-RNA interaction.

Feature Paper Year Prediction Model Sequence Encoding Format Structural Information Fully- (Ahmad et al., Sequence composition, sequence 2004 Binding Site connected Vector 2004) neighbourhood, SA1 NN (Hiller et al., Binding 2006 PWM Single Strand Motif Finding RNA SS2 2006) Preference (Kazan et al., Binding 2010 PWM Motif model Structure annotation profiles 2010) Preference (Yang et al., 2013 Binding Site Alignment Structure alignment TM-SITE, S-SITE, PSSM 2013) Sequence, evolutionary (Li et al., 2014 Binding Site ANN Vector conservation, surface deformations, 2014) relative SA, side chain contributions (Maticzka Binding 2014 SVM Graph-kernel Sequence and SS et al., 2014) Preference (Chen et al., Electrostatic and evolutionary 2014 Binding Site Threading Homologs 2014) ranking, SA residues (Alipanahi Binding 2015 CNN One-hot encoding Sequence et al., 2015) Preference (Orenstein Binding K-mer, 2016 K-mer RNA SS et al., 2016) Preference PWM Sequence and structure (Zheng et al., Binding 2016 Template alignment, SARA and Structure 2016) Preference TM-align Electrostatic feature, triplet (Sun et al., Random interface propensit, PSSM profile, 2016 Binding Site Euclidean distance 2016) Forest geometrical characteristic, physicochemical property (Zhang et al., Binding Multimodal Replicated softmax, 2016 Sequence, SS, tertiary Structure 2016) Preference DBNs R3DMA (Yan and AA3 type, putative intrinsic disorder, 2017 Binding Site HMM Feature Kurgan, 2017) SS, SA, PSSM (Li et al., Binding Deep K-mer encoding into vector 2017 Sequence 2017a) Preference boosting features (Jimenez´ Pharmacophoric properties, voxel 2017 Binding Site 3D CNN 3D Voxel et al., 2017) occupancies (Wang et al., Binding 2017 CNN PSSM and 3-mer Protein and RNA sequence 2017b) Preference Hydrophobicity, SS, normalized van der Waals volume, polarity and (Zhang and Composition, transition and 2017 Binding Site SVM polarizability, SA, charge and Liu, 2017) distribution vector polarity of side chain, evolutionary information, protein sequence. (Pan et al., Binding 2018 CNN+LSTM One hot encoding Sequence and SS 2018) Preference (Munteanu Binding Motif 2018 4-mer seeds Sequence, Structure et al., 2018) Preference finder Protein-RNA interaction prediction with deep learning

(Wu et al., AA sequence, I-TASSER Suite, 2018 Binding Site Docking Structure modeling 2018) consensus predictions, ligand Global (Pan and Binding 2018 and local One hot encoding Sequence Shen, 2018a) Preference CNN Protein sequence, hydrophobicity, (Wu et al., normalized van der Waals volume, 2018 Binding Site CNN Feature Vector 2018) polarity and polarizability, charge and polarity of side chain (Su et al., PSI-BLAST, PSIPRED, HHblits 2019 Binding Site SVM Feature Vector 2019a) profile. (Deng et al., Binding 2019 CNN+RNN One hot encoding Sequence and SS 2019) Preference (Su et al., Binding K-mer, structure matrix 2019 CNN Sequence and Structure 2019b) Preference encoding (Torng and The presence of carbon, oxygen, 2019 Binding Site 3D CNN Voxel, feature Altman, 2019) sulfur and nitrogen atoms (Shen et al., Binding Capsule 2019 One hot encoding Sequence and SS 2019) Preference net Binding Site (Lam et al., 2019 Binding CNN Feature Structure 2019) Preference (Yang et al., Binding Recommendation 2020 FastText Sequence 2020) Preference system (Xie et al., Binding 2020 Alignment Docking Structure 2020b) Preference Dynamics, sequence, number of atoms, electrostatic charges and potential hydrogen bonds, (Liu et al., 2020 Binding Site CNN Feature Vector molecular mass, hydrophobicity, 2020) hydrophilicity, polarity, polarizability, propensity, average accessible surface area (Yan et al., Binding 2020 GNN Base pairing matrix RNA SS 2020) Preference (Jolma et al., Binding 2020 SVM K-mer Sequence, Structure 2020) Preference (Grønning Binding 2020 CNN One hot encoding Sequence et al., 2020) Preference Pseudo-positions, atomic features (Xia et al., 2021 Binding Site GNN Graph, feature Vector of residues, SS, evolutionary 2021a) conversation profiles (Sun et al., Binding 2021 SENet One hot encoding RNA Sequence and SS 2021) Preference 1 Solvent accessibility 2 Secondary structure 3 Amino acid