Protein-RNA Interaction Prediction with Deep Learning: Structure Matters
Total Page:16
File Type:pdf, Size:1020Kb
Protein-RNA interaction prediction with deep learning: Structure matters Junkang Wei * 1 Siyuan Chen * 2 Licheng Zong * 1 Xin GaoB 2 Yu LiB 1 3 Abstract their importance, researchers have made significant efforts Protein-RNA interactions are of vital importance to understand the interactions (Corley et al., 2020) and the to a variety of cellular activities. Both experi- related molecular mechanism behind the processes (Lin and mental and computational techniques have been Miles, 2019; Yi et al., 2020). Due to the difficulty to perform developed to study the interactions. Due to the high-throughput structural biological experiments in the last limitation of the previous database, especially the century, the progress of this field was slow (Conn et al., lack of protein structure data, most of the exist- 1999). However, with the development and advancement of in vivo ing computational methods rely heavily on the high-throughput assays, such as the RIP-seq (Keene in vitro sequence data, with only a small portion of the et al., 2006) and CLIP-seq (Ule et al., 2005), and the methods utilizing the structural information. Re- RNACompete (Ray et al., 2009) and HT-SELEX (Roulet cently, AlphaFold has revolutionized the entire et al., 2002), we have witnessed the significant progress of protein and biology field. Foreseeably, the protein- this field as well as the large amount of accumulated data RNA interaction prediction will also be promoted (Alipanahi et al., 2015). Computational methods emerge to significantly in the upcoming years. In this work, analyze the data and accelerate the discovery (Hafner et al., we give a thorough review of this field, survey- 2021; Sun et al., 2021; Bernstein et al., 2005; Van Nostrand ing both the binding site and binding preference et al., 2020; Lam et al., 2019; Hentze et al., 2018). prediction problems and covering the commonly Similar to the experimental techniques, which can be di- used datasets, features, and models. We also point vided into the structure-based methods and the assay-based out the potential challenges and opportunities in methods, the computational methods can also be classi- this field. This survey summarizes the develop- fied into two categories, either predicting the RNA binding ment of the RBP-RNA interaction field in the past sites on the protein surface (Miao and Westhof, 2015; Yan and foresees its future development in the post- et al., 2016) or modeling the preferred RNA sequences of AlphaFold era. an RNA-binding protein (RBP) (Trabelsi et al., 2019). In the first category, people essentially resolve a binary clas- sification problem. Given the protein, researchers want to 1. Introduction predict whether it is an RBP, and if it is an RBP, at which amino acids it can interact with an RNA. In the latter one, Protein-RNA interactions are involved in a variety of cellu- given a protein with the high-throughput assay experimen- lar activities, such as gene expression regulations (Weirauch tal data, people extract the frequency of each nucleotide at et al., 2013), post-transcriptional regulations (Alipanahi each position on the preferred RNA sequences, using k-mer et al., 2015), and protein synthesis (Yan et al., 2016). Per- models (Lee et al., 2015), position weight matrix (PWM) arXiv:2107.12243v1 [q-bio.BM] 26 Jul 2021 turbation of such interactions can lead to fatal cellular dys- models (Weirauch et al., 2013), or deep learning models function and diseases (Ramanathan et al., 2019). Owing to (Alipanahi et al., 2015). If the computational method targets *Equal contribution 1Department of Computer Science and En- on genome-wide prediction, sometimes, it is also referred as gineering (CSE), The Chinese University of Hong Kong (CUHK), the binding sites prediction on RNAs (Pan and Shen, 2018a; Hong Kong, People’s Republic of China 2Computational Bio- Li et al., 2017a), which may cause confusion to the readers. science Research Center (CBRC), Computer, Electrical and Math- In the rest of the paper, binding sites prediction refers to ematical Sciences and Engineering Division (CEMSE), King Ab- dullah University of Science and Technology (KAUST), Thuwal predicting the RNA binding sites on the protein surface, 23955-6900, Saudi Arabia 3The CUHK Shenzhen Research while the binding preference prediction refers to predicting Institute, Hi-Tech Park, Nanshan, Shenzhen 518057, China. the protein binding preference against RNA sequences. On Correspondence to: Xin Gao <[email protected]>, Yu Li the other hand, as both of the two main research directions <[email protected]>. are protein-centric (Ramanathan et al., 2019), which means Copyright 2021 by the author(s). that there is intrinsic relation between the two research top- Protein-RNA interaction prediction with deep learning ics, researchers are also trying to predict both information sive local structural encoding can significantly improve the simultaneously with a unified deep learning method (Lam model’s performance. For example, people have used vox- et al., 2019). els (Torng and Altman, 2019) and graphs (Xia et al., 2021a) to encode the protein 3D structures.In terms of the RNA Since the first computational method was proposed to tackle features, the logic is similar to the protein ones. Regard- the interaction between RNA and protein specifically (Jeong ing the sequence features, people have been using the se- et al., 2004), a number of algorithms have been developed quence one-hot encodings (Grønning et al., 2020; Sun et al., to handle the problems (Miao and Westhof, 2015; Yan et al., 2021; Alipanahi et al., 2015), k-mer models (Orenstein et al., 2016; Yan and Zhu, 2020; Sagar and Xue, 2019; Pan et al., 2016), and position-weighted matrix (PWM) (Kazan et al., 2019a). They can be divided into the following categories. 2010; Orenstein et al., 2016). However, unlike the protein Firstly, based on the assumption that similar structures may secondary structures, RNA secondary structural information have similar function, people have used the template-based has been significantly emphasized, including both the pre- method to predict the binding sites (Yang et al., 2013; Chen dicted RNA secondary structures and the in vivo structure et al., 2014; Wu et al., 2018; Xie et al., 2020a) and the bind- profiles (Hiller et al., 2006; Maticzka et al., 2014; Kazan ing preference (Zheng et al., 2016). Although such methods et al., 2010; Sun et al., 2021). Meanwhile, the tertiary struc- can perform well on queries with homologs, they have diffi- tures are also shown to be very important (Zhang et al., culty in handling new sequences without homologs (Senior 2016). Despite the large variety of existing features, unfor- et al., 2020). Secondly, people combine hand-crafted fea- tunately, people have not taken full advantage of them for tures, which will be discussed in the next paragraph, with the following two reasons. Firstly, in the binding site predic- shallow-learning methods, such as support vector machine tion, people usually only consider the protein information, (SVM) (Jolma et al., 2020; Maticzka et al., 2014; Zhang while in the binding preference prediction, people usually and Liu, 2017; Su et al., 2019a), logistic regression (Kazan only consider the RNA information. The interactions be- et al., 2010; Orenstein et al., 2016; Hiller et al., 2006; Yan tween RNAs and proteins include at least two molecules, and Kurgan, 2017), and random forest (Li et al., 2017b; Sun and using information from only one side can lead to infe- et al., 2016), to investigate the topic. The commonly used rior performance. Secondly, the RNA and protein structural k-mer models (Orenstein et al., 2016) and position-weighted information has not been fully utilized as well, mainly due matrix (PWM) models (Kazan et al., 2010) are classified to the limitation of previous structure prediction methods into this category, because they are usually combined with and the unsatisfactory structure encoding methods. logistic regression. Notice that this category of methods is still under active development (Zhang and Liu, 2017; Su In recent years, we have witnessed the significant improve- et al., 2019a), even after the surge of deep learning, because ment of both the structure determination methods (Yip et al., it is difficult to represent and encode the raw structural infor- 2020) and prediction methods (Marks et al., 2011; Wang mation, which will be discussed in detail in this paper. The et al., 2017a; 2019a; Senior et al., 2020; Jumper et al., 2021; last category is the deep learning-based methods (Alipanahi Baek et al., 2021; Tunyasuvunakool et al., 2021). Consid- et al., 2015; Lam et al., 2019; Sun et al., 2021), which have ering the success of the previous computational methods been very popular in recent years. With such models, peo- targeting protein-RNA interaction prediction based on struc- ple only need to input the raw representation of the proteins tural information, it is foreseeable that researchers will make or RNAs, and let the models learn and extract useful in- significant progress in this field (Figure1). Given that, we formation by themselves. However, the transparency and review this field thoroughly in this paper, emphasizing the interpretability of the models are usually questioned (Li structural information. In this work, we also consider the et al., 2019). protein-RNA interaction binding site and binding preference prediction simultaneously for the first time, considering their Within the above algorithms, people have been using various intrinsic relationship. We notice that there are some existing features, including the ones from both proteins and RNAs. related reviews focusing on different aspects of this prob- Regarding the protein features, researchers have developed lem. More specifically, Pan et al.(2019b); Yan and Zhu representations from sequences, such as sequence one-hot (2020); Sagar and Xue(2019) list the recently developed encodings (Yan et al., 2016), position-specific scoring ma- deep learning tools for predicting binding preference.