Identification of High-Confidence RNA Regulatory Elements By
Total Page:16
File Type:pdf, Size:1020Kb
Li et al. Genome Biology (2017) 18:169 DOI 10.1186/s13059-017-1298-8 METHOD Open Access Identification of high-confidence RNA regulatory elements by combinatorial classification of RNA–protein binding sites Yang Eric Li1†, Mu Xiao2†, Binbin Shi1†, Yu-Cheng T. Yang1, Dong Wang1, Fei Wang2, Marco Marcia3 and Zhi John Lu1* Abstract Crosslinking immunoprecipitation sequencing (CLIP-seq) technologies have enabled researchers to characterize transcriptome-wide binding sites of RNA-binding protein (RBP) with high resolution. We apply a soft-clustering method, RBPgroup, to various CLIP-seq datasets to group together RBPs that specifically bind the same RNA sites. Such combinatorial clustering of RBPs helps interpret CLIP-seq data and suggests functional RNA regulatory elements. Furthermore, we validate two RBP–RBP interactions in cell lines. Our approach links proteins and RNA motifs known to possess similar biochemical and cellular properties and can, when used in conjunction with additional experimental data, identify high-confidence RBP groups and their associated RNA regulatory elements. Keywords: RNA-binding protein, CLIP-seq, Non-negative matrix factorization, RBPgroup Background protein binding sites with high resolution in different RNA-binding proteins (RBPs) are essential to sustain mammalian cells [12–14]. The CLIP-seq data of mul- fundamental cellular functions, such as splicing, polya- tiple RNA binding proteins have been curated and denylation, transport, translation, and degradation of annotated in specific databases, such as CLIPdb, RNA transcripts [1, 2]. One study estimated that more POSTAR, and STARbase [15–17]. Several significant than 1500 different RBPs exist in human [3]. These studies improved the prediction of individual RBPs’ RBPs cooperate or compete with each other in binding binding sites by training on CLIP-seq and RNAcompete their RNA targets [4–6]. Many RBPs are capable of datasets [18–20]. Systematic assessment of combinatory binding different RNA targets, partially by associating regulation of multiple RBPs would be more beneficial with different co-factors [7–9]. At the same time, some to derive precious biological information from various consensus RNA sequence motifs are recognized by high-throughput CLIP-seq data. homologous RBPs or homologous domains [10]. Thus, Therefore, we analyzed the CLIP-seq data to group to- proteins and RNAs appear to interact in a combinatorial gether RBPs that bind on the same RNA sites, in which manner [11]. proteins interact with RNAs in a combinatorial manner. Recently, the advent of crosslinking immunoprecipita- For our classification, we used a soft-clustering method, tion sequencing (CLIP-seq) technologies, which combine non-negative matrix factorization (NMF), which allows immunoprecipitation with RNA–protein crosslinking each RBP to be clustered into more than one group, as followed by high-throughput sequencing, has enabled it is the case for proteins that participate in multiple researchers to characterize transcriptome-wide RNA– metabolic pathways. Using other RBPs’ binding signals as background also enables us to identify specific bind- ing. Through our approach, we defined RNA–protein * Correspondence: [email protected] †Equal contributors binding sites for the targeted RNAs and classified the 1MOE Key Laboratory of Bioinformatics, Center for Synthetic and Systems corresponding RBPs into groups. Subsequently, we dem- Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China onstrated that the binding sites we defined from RBP Full list of author information is available at the end of the article © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Li et al. Genome Biology (2017) 18:169 Page 2 of 16 groups were supported by other types of biological data, ranging broadly from several thousands, i.e. for such as alternative splicing (AS) and RNA degradation HNRNPU and TNRC6A, to tens of thousands, i.e. for data. Furthermore, we compiled a web-based platform CPSF6 and MOV10 (Fig. 1a, Additional file 1: Figure (http://RNAtarget.ncrnalab.org/RBPgroup) to make the S1a). Such broad variance in the number of CLIP-seq binding sequences and enriched motifs easily accessible peaks associated to each RBP is probably caused by by the scientific community. many factors, such as differences in biochemical prop- erties of the corresponding RBPs and in different labs’ Results experimental protocols. Second, our analysis shows Curation of various CLIP-seq data discrepancies in the number of peaks obtained from We collected 327 CLIP-seq datasets generated from different CLIP technologies and different peak calling three technical approaches: PAR-CLIP, HITS-CLIP, and methods (Additional file 1: Figure S1b). We show the eCLIP. The binding peaks of PAR-CLIP data were de- CLIP-seq signal of an example RBP, where the binding fined by Piranha [21] and PARalyzer [22]; the binding peaks defined by two computational tools (Piranha and peaks of HITS-CLIP data were defined by Piranha and PARalyzer) are very different because they rely on two CIMS [23] (see detail in “Methods”). We also down- experimental features: Piranha defines binding peaks loaded the binding peaks of eCLIP data, defined by based on reads abundance, while PARalyzer utilizes the CLIPper [24], from the ENCODE data portal (https:// information of T to C mutation in PAR-CLIP (Fig. 1b, www.encodeproject.org). First, we can see that different Additional file 1: Figure S2). Besides, yielding different RBPs display a very different number of binding peaks, number of peaks for each RBP, peak calling software a Number of binding peaks b Example of binding peaks called by for example RBPs Piranha and PARalyzer for CPSF6 NAPB AGO2 RNA-seq ALKBH5 raw reads CLIP-seq CAPRIN1 CPSF2 Piranha CPSF6 PARalyzer ELAVL1 peaks FBL T>C sites FUS 1 kb FXR1 0 11 IGF2BP1 LIN28B c MOV10 250 bp NOP56 QKI RBPMS HNRNPR TAF15 RNA-seq TNRC6A HEK293/HEK293T FXR1 DGCR8 FXR2 HNRNPA1 CLIP-seq EWSR1 HNRNPA2 FXR1 HNRNPF FXR2 HNRNPH1 EWSR1 HNRNPM peaks HITS-CLIP PAR-CLIP PAR-CLIP HITS-CLIP HNRNPU Site 1 Site 2 AUH FKBP4 HNRNPA1 d 1 kb IGF2BP1 LARP7 HepG2 SLC23A2 LIN28B RNA-seq SF3B4 SMNDC1 CPSF6 eCLIP eCLIP TAF15 CPSF7 CPSF6 NUDT21 FXR1 CLIP-seq HNRNPM CPSF6 TNRC6A K562 CPSF7 U2AF2 peaks NUDT21 PARalyzer Piranha CIMS CLIPper Site 1 Site 2 Fig. 1 CLIP-seq data for different RBPs, CLIP technologies, and peak calling methods. a We show the numbers (log scale) of binding peaks defined by different peak calling methods for example RBPs. b CLIP-seq signals of an example RBP, CPSF6, binding on a gene, NAPB. The binding peaks are defined by Piranha (from signal height) and PARalyzer (from T to C mutation), respectively. c, d Examples of co-binding sites and binding peaks of single RBP. We show the raw reads and peaks identified by Piranha (red bar) Li et al. Genome Biology (2017) 18:169 Page 3 of 16 also differ in terms of peak width. While Piranha and datasets of 48 human RBPs [15, 16] obtained in HEK293/ CIMS typically yield peaks of fixed width due to com- HEK293T cells (Fig. 2a, Additional file 1: Table S1). putational strategies being used [21, 23] (see detail in We first filtered 450 K peaks out of ~4 M total peaks, “Methods”), PARalyzer and CLIPper yield peaks of differ- retaining those that are reproducibly identified by mul- ent length distributions (10–50 nt long for PARalyzer and tiple peak calling methods (PARalyzer and Piranha) 1–90 nt long for CLIPper) (Additional file 1: Figure S3). and that occur in at least two biological replicas In summary, deposited CLIP-seq datasets display sig- (Fig. 2b, see detail in “Methods”). We then merged nificant variety probably due to intrinsic biochemical these filtered peaks into a unified set of binding sites properties of RBPs, to differences in the experimental (235 K in total) (Fig. 2c). Among these merged binding procedures and to the software used to identify RNA– sites, most were 20–60 nt in length, which corresponds protein interaction sites. These observations reflect the to one- to twofold the length of non-merged binding need to make CLIP-seq data analysis uniform and compar- peaks (Additional file 1: Figures S3, S4). Only ~10% of able and to define RNA–protein binding sites confidently the merged binding sites were longer than 100 nt. We and consistently. In this study, we aim to define a set of split these longer sites into 100-nt bins with 50-nt binding sites. We would sacrifice sensitivity and complete- overlap for the following analyses. ness to improve the accuracy of our prediction. Therefore, Next, we generated an occupancy profile matrix from we only use peaks identified by multiple methods (e.g. the merged binding sites (Fig. 2d). We used the CLIP- PARalyzer and Piranha) (see detail in “Methods”). seq signals normalized by total RNA-seq signals ob- tained in HEK293 cell line as the values of our matrix Clustering RNA–protein binding sites potentially identifies (see detail in “Methods”). We discarded binding sites high-confidence interactions with no RNA-seq signals and those associated to only We reason that a strategy to find high-confidence RNA– one RBP. This filtering procedure resulted in 84,222 protein interactions in CLIP-seq data could be to iden- binding sites used for clustering. The final occupancy tify RNA sites that are simultaneously associated with profile matrix V is composed of N rows and M columns, multiple RBPs.