Empowering the Annotation and Discovery of Structured Rnas With
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/550335; this version posted February 20, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Empowering the annotation and discovery of structured RNAs with scalable and accessible integrative clustering Milad Miladi 1, Eteri Sokhoyan 1, Torsten Houwaart 2, Steffen Heyne 3, Fabrizio Costa 4, Bj¨orn Gr¨uning 1;5;∗ and Rolf Backofen 1;5;6; ∗ 1Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Koehler-Allee 106, D-79110 Freiburg, Germany 2Institute of Medical Microbiology and Hospital Hygiene, University of Dusseldorf, Universitaetsstr. 1, D-40225, Germany 3Max Planck Institute of Immunobiology and Epigenetics, D-79108 Freiburg, Germany 4Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK 5ZBSA Centre for Biological Systems Analysis, University of Freiburg, Habsburgerstr. 49, D-79104 Freiburg, Germany and 6Center for Biological Signaling Studies (BIOSS), University of Freiburg, Germany ABSTRACT quite successfully applied for annotating proteins, is not well- established for ribonucleic acids. RNA plays essential regulatory roles in all known forms of For many ncRNAs and regulatory elements in messenger life. Clustering RNA sequences with common sequence and RNAs (mRNAs), however, it is well known that the secondary structure is an essential step towards studying RNA function. structure is better conserved than the sequence, indicating With the advent of high-throughput sequencing techniques, experimental and genomic data are expanding to complement the paramount importance of structure for the functionality. the predictive methods. However, the existing methods do not This fact has promoted annotation approaches that try to families effectively utilize and cope with the immense amount of data detect structural homologs in the forms of RNA classes becoming available. and (8). Members of an RNA family are similar and Here we present GraphClust2, a comprehensive approach for typically stem from a common ancestor, while RNA classes scalable clustering of RNAs based on sequence and structural combine ncRNAs that overlap in function and structure. A prominent example of an RNA class whose members share similarities. GraphClust2 provides an integrative solution by a common function without a common origin is microRNA. incorporating diverse types of experimental and genomic One common approach to detect ncRNA of the same class data in an accessible fashion via the Galaxy framework. We is to align them first by sequence, then predict and detect demonstrate that the tasks of clustering and annotation of structured RNAs can be considerably improved, through a functionally conserved structures by applying approaches scalable methodology that also supports structure probing like RNAalifold (9), RNAz (10), or Evofold (11). A large data. Based on this, we further introduce an off-the-shelf portion of ncRNAs from the same RNA class, however, procedure to identify locally conserved structure candidates have a sequence identity of less than 70%. In this sequence in long RNAs. In this way, we suggest the presence and the identity range, sequence-based alignments are not sufficiently sparsity of phylogenetically conserved local structures in some accurate (12, 13). Alternatively, approaches for simultaneous long non-coding RNAs. Furthermore, we demonstrate the alignment and folding of RNAs such as Foldalign, Dynalign, LocARNA (14, 15, 16) yield better accuracy. advantage of a scalable clustering for discovering structured Clusters of ncRNA with a conserved secondary structure motifs under inherent and experimental biases and uncover are promising candidates for defining RNA families or classes. prominent targets of the double-stranded RNA binding protein Roquin-1 that are evolutionary conserved. In order to detect RNA families and classes, Will et al. (17) and Havgaard et al. (14) independently proposed to use the sequence-structure alignment scores between all INTRODUCTION input sequence pairs to perform hierarchical clustering of putatively functional RNAs. However, their applicability High throughput RNA sequencing and computational screens is restricted by the input size, due to the high quartic have discovered hundreds of thousands of non-coding RNAs computational complexity of the alignment calculations over a (ncRNAs) with putative cellular functionality (1, 2, 3, 4, 5, 6, quadratic number of pairs. Albeit the complexity of similarity 7). Functional analysis and validation of this vast amount of computation by pairwise sequence-structure alignment can data demand a reliable and scalable annotation system for the been reduced to quadratic O(n2) of the sequence length (18), ncRNAs, which is currently still lacking for several reasons. it is still infeasible for most of the practical purposes with First, it is often challenging to find homologs even for many several thousand sequence pairs. For the scenarios of this validated functional ncRNAs as sequence similarities can be scale, alignment-free approaches such as our previous work very low. Second, the concept of conserved domains, which is GraphClust (19) and Nofold (20) propose solutions. ∗To whom correspondence should be addressed. Email: fbackofen, [email protected]; Tel: +49 761 2037461; Fax: +49 761 2037462 [21:55 20/2/2019 graphclust-2.tex] Page: 1 1{15 bioRxiv preprint doi: https://doi.org/10.1101/550335; this version posted February 20, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 2 A stochastic context-free grammar (SCFG), also known immunoprecipitation (CLIP) data. By comparing public CLIP as covariance model (CM), encodes the sequence and data from two double-stranded RBPs SLBP and Roquin-1, structure features of a family in a probabilistic profile. we demonstrate the advantage of a scalable approach for CM-base approaches have been extensively used, e.g. for discovering structured elements. Under subjective binding discovering homologs of known families (21) or comparing preferences of Roquin-1 and the protocol biases, a scaled two families (22). Profile-based methods (20, 23) such as clustering uncovers structured targets of Roquin-1 that are Nofold annotate and cluster sequences against a CM database evolutionary conserved. Finally, we propose BCOR’s mRNA of known families, therefore their applicability is limited to as a prominent binding target of Roquin-1 that contains already known families and cannot be used for de novo family multiple stem-loop binding elements. or motif discovery. The GraphClust methodology uses a graph kernel approach to integrate both sequence and structure information into MATERIALS AND METHODS high-dimensional sparse feature vectors. These vectors are Methods overview then rapidly clustered, with a linear-time complexity over the number of sequences, using a locality sensitive hashing The clustering workflow. The GraphClust approach can technique. While this solved the theoretical problem, the use efficiently cluster thousands of RNA sequences. This is case guiding the development of the original GraphClust achieved through a workflow with five major steps: (i) pre- work, here as GraphClust1, was tailored for a user with in- processing the input sequences; (ii) secondary structure depth experience in RNA bioinformatics that has already prediction and graph encoding; (iii) fast linear-time clustering; the set of processed sequences at hand, and now wants to (iv) cluster alignment and refinement, with an accompanying detect RNA family and classes in this set. However, with search with alignment models for extra matches; and finally the increasing amount of sequencing and genomic data, the (v) cluster collection, visualization and annotation; An tasks of detecting RNA family or classes and motif discovery overview of the workflow is presented in Figure 1. have been broadened and are becoming a standard as well as More precisely, the pre-processed sequences are appealing tasks for the analysis of high-throughput sequencing individually folded according to the thermodynamic (HTS) data. free energy models with the structure prediction tools To answer these demands, here we propose GraphClust2 as RNAfold (25) or RNAshapes (26). A decomposition graph a full-fledged solution within the Galaxy framework (24).With kernel is then used to efficiently measure similarity according the development of GraphClust2, we have materialized the to the sequence and structure features that are graph-encoded. following goals, GraphClust2 is: (i) allowing a smooth and The MinHash technique (27) and inverse indexing are used seamless integration of high-throughput experimental data and to identify the initial clusters, which correspond to dense genomic information; (ii) deployable by the end users less neighborhoods of the feature space (formal description experienced with the field of RNA bioinformatics; (iii) easily and mathematical formulations are available in the section expandable for up- and downstream analysis, and allow S1 of the supplementary materials). This fast clustering for enhanced interoperability; (iv) allowing for accessible, approach has a linear runtime complexity