Sequence-Based Microrna Clustering

Sequence-based MicroRNA Clustering Kübra Narcı1, Hasan Oğul2 and Mahinur Akkaya3 1Medical Informatics Department, Informatics, Middle East Technical University, Ankara, Turkey 2Department of Computer Engineering, Faculty of Engineering, Başkent University, Ankara, Turkey 3Department of Chemistry, Faculty of Arts and Sciences, Middle East Technical University, Ankara, Turkey Keywords: MicroRNA, Sequence Clustering, Clustering Algorithms, Pair-wise Sequence Comparison Sequence Similarity. Abstract: MicroRNAs (miRNAs) play important roles in post-transcriptional gene regulation. Altogether, understanding integrative and co-operative activities in gene regulation is conjugated with identification of miRNA families. In current applications, the identification of such groups of miRNAs is only investigated by the projections of their expression patterns and so along with their functional relations. Considering the fact that the miRNA regulation is mediated through its mature sequence by the recognition of the target mRNA sequences in the RISC (RNA-induced silencing complex) binding regions, we argue here that relevant miRNA groups can be obtained by de novo clustering them solely based on their sequence information, by a sequence clustering approach. In this way, a new study can be guided by a set of previously annotated miRNA groups without any preliminary experimentation or literature evidence. In this report, we presents the results of a computational study that considers only mature miRNA sequences to obtain relevant miRNA clusters using various machine learning methods employed with different sequence representation schemes. Both statistical and biological evaluations encourages the use this approach in silico assessment of functional miRNA groups. 1 INTRODUCTION sequence similarity is also known (Altuvia et al. 2005). miRNAs targeted into a specific mRNA MiRNAs are small, 20-22 nucleotides in length, non- region are greeted through biogenesis which is coding RNA products of the corresponding MIR, commonly specific into the organism. At the end of miRNA transcribing, genes. They regulate encoding its biogenesis RNA-induced silencing complex machinery involving in cleavage or translational (RISC) is formed, and by RISC binding regions the events by precise sequence complementation to the miRNA sequence is used as template to complement target RNA sequence (Bartel 2004; Lagos-Quintana the target mRNA sequence. The consequence is et al. 2001). Due to its crucial function in the cell either miRNA cleavage by degradation or identification, miRNA sequences has a great translational repression by blocking the mRNA importance since the earliest breakthrough being translated. Conversely, there would be a accumulated. The let-7 is one of the early identified positive result like sponsoring transcription or miRNA. The role of the miRNA is very important in translation and stabilization of transcription (Asgari function; controlling differentiation in C. elegans. 2011). In the complementation there are some key The let-7 family members generally involve in the regions important for target determination. Second to same processes such as controlling developmental eight nucleotides of the pre-miRNA sequence called timing (Abbott et al. 2005). Recently, it is found out as seed region are known as key nucleotides (Bartel that the sequence of the let-7 family members are 2013). miRNA binding sites in target mRNA region also well conserved. Moreover, some of the MIR is generally in 3 ‘UTR region, occasionally in 5 genes found to be polycistronic transcribed into ‘UTR region of the gene in animals. The percentage miRNAs and located into the same chromosomal of the complementarities changes by, and depends positions; they are called as miRNA clusters. In on type of the organisms (Pratt and MacRae 2009). some of the miRNA clusters a recognizable Through the evolutionary time, as the cell getting change, miRNA to target relation diverse. From this 107 Narcı, K., Ogul,˘ H. and Akkaya, M. Sequence-based MicroRNA Clustering. DOI: 10.5220/0005552901070116 In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2016) - Volume 3: BIOINFORMATICS, pages 107-116 ISBN: 978-989-758-170-0 Copyright c 2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms context, studies to observe sequence fingerprints in amino acid sequences. Sequential simulation of each miRNA families has been started (Hertel et al. 2012; miRNAs in like the mentioned studies presented by Shi et al. 2012) . The investigations on let-7 family numeric kernels constructed though dynamic bring the consideration up that there are some programming pair-wise sequence alignment conserved patterns alongside with base composition algorithms; Smith-Waterman (Smith and Waterman similarities between miRNA sequences (Hertel et al. 1981) and Needleman-Wunsch (Needleman and 2012; Newman et al. 2008). Furthermore, it is found Wunsch 1970) or by calculating their k-mer that there is evolutionary importance of the distributions. Unsupervised clustering approaches mismatches between miRNA and its target as well. then applied into these sequence simulations by The mismatches are tolerating the easier release using the similarity features. The performance of the from the RISC complexes when perfectly located clusters is statistically analyzed by using Dunn Index into its target (Bartel and Bartel 2003). All these (Dunn 1973) calculation and the functionality of the findings in the end support that there is a great pipeline is tested with a well-known human miRNA concurrence between the miRNA and target dataset of Tool for Annotations of miRNA (TAM) sequence even evolutionary favoring the (Lu et al. 2010). The tool also used to test the mismatches. Hence, what is the level of these groups, annotate them into functional categories and similarity between existing miRNA groups? thus calculate the enrichment of the miRNA groups It is well known that a number of miRNA with any purposeful similarities. In conclusion, the element may role in multiple functions, like workflow here represents the method to explore proliferation, cell death and differentiation, sequentially similar miRNAs and their relevance in immunity and fat metabolism by various pattern of groups. expression (He and Hannon 2004). Therefore, the network including miRNA and its targets is highly complex housing several genes. The analysis of 2 MATERIALS AND METHOD these relation may be compromised through advanced tools like miRWalk2.0 (Dweep and Gretz 2.1 Clustering 2015). The latest studies have shown that miRNAs usually operate in a co-operative manner to perform The task is to assign each miRNA into one of their activities (Antonov et al. 2009). This suggests previously unlabeled classes so that a set of non- that some miRNAs can form context-specific overlapping miRNA groups, which are desired to modules, i.e., cluster of entities, while regulating imply a useful relevance, can be obtained. This can gene expression. Since the elucidation of gene be achieved through an unsupervised machine regulatory networks comprising all actors is one of learning technique called as clustering (Flynn 1999; the ultimate goals of systems biology, which Sisodia 2012). As having a large diversity of miRNAs are functionally similar in a certain context clustering algorithms in machine learning society, is high-potential knowledge for the researchers and we consider here four distinct methods which are clinicians working in this domain (Ölçer and Oğul selected based on their common use in 2013). bioinformatics studies; k-means (Macqueen 1967), Recently, as the importance of miRNA directed CLAG (Dib and Carbone 2012), and SOTA (Dopazo gene regulation become clear, computational et al. 1997) and MCL (Enright et al. 2002), which miRNA prediction tools was an active research area are briefly introduced as follows. (Zhao et al. 2010; Lai et al. 2003). Following the K-means (Macqueen 1967) is the classical yet advance, many predicted miRNAs sourced to be the most common in use method of partitional characterized into function. Here in this study, in the clustering. By the technique, the dataset is divided light of the current miRNA literature we used the into k non-overlapping groups by means of sequence clustering approach to group mature minimizing the sum of squares of distances between miRNAs in order identification of miRNA families data points and the corresponding cluster centroids. acting in the same metabolic events. In The logic of the method depends on the iterations of bioinformatics, the attempt of grouping the these steps; (1) determination of the centroid biological sequences is not novel. USEARCH and coordinate, (2) evaluating the distance of each object UBLAST (Edgar 2010)are two algorithms developed to the centroids and, finally (3) grouping into the recently in that concept operating on nucleic acid objects based on minimum distance (Macqueen sequences, and there are TribeMCL (Enright et al. 1967). Prior to these steps however, k must be 2002) and OrthoMCL (Li et al. 2003) operating on specified. Actually, if the dataset is unknown and 108 Sequence-based MicroRNA Clustering analyzer doesn’t know how many grouping will be This makes the centroids of the data fixed; re-runs of done, optimization of k becomes one of the the data do not change the position of the centroids

Sequence-Based Microrna Clustering

Parallel and Scalable Precise Clustering for Homologous Protein Discovery

Representative Based Protein Sequence Clustering

De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

Ultrafast and Sensitive Sequence Search and Clustering Methods in the Era of Next Generation Sequencing

VIRMOTIF: a User-Friendly Tool for Viral Sequence Analysis

Spclust: Towards a Fast and Reliable Clustering for Potentially Divergent

De Novo Clustering of Long Reads by Gene from Transcriptomics Data

Application of Subspace Clustering in DNA Sequence Analysis

Downloaded Without User Conclusions Registration At: and Additional Informations in Supplementary Material

Scalable Clustering for Immune Repertoire Sequence Analysis

The Genexpress IMAGE Knowledge Base of the Human Brain Transcriptome: a Prototype Integrated Resource for Functional and Computational Genomics

State-Of-The-Art Bioinformatics Protein Structure Prediction Tools (Review)