Sequence-Based Microrna Clustering

Total Page:16

File Type:pdf, Size:1020Kb

Sequence-Based Microrna Clustering Sequence-based MicroRNA Clustering Kübra Narcı1, Hasan Oğul2 and Mahinur Akkaya3 1Medical Informatics Department, Informatics, Middle East Technical University, Ankara, Turkey 2Department of Computer Engineering, Faculty of Engineering, Başkent University, Ankara, Turkey 3Department of Chemistry, Faculty of Arts and Sciences, Middle East Technical University, Ankara, Turkey Keywords: MicroRNA, Sequence Clustering, Clustering Algorithms, Pair-wise Sequence Comparison Sequence Similarity. Abstract: MicroRNAs (miRNAs) play important roles in post-transcriptional gene regulation. Altogether, understanding integrative and co-operative activities in gene regulation is conjugated with identification of miRNA families. In current applications, the identification of such groups of miRNAs is only investigated by the projections of their expression patterns and so along with their functional relations. Considering the fact that the miRNA regulation is mediated through its mature sequence by the recognition of the target mRNA sequences in the RISC (RNA-induced silencing complex) binding regions, we argue here that relevant miRNA groups can be obtained by de novo clustering them solely based on their sequence information, by a sequence clustering approach. In this way, a new study can be guided by a set of previously annotated miRNA groups without any preliminary experimentation or literature evidence. In this report, we presents the results of a computational study that considers only mature miRNA sequences to obtain relevant miRNA clusters using various machine learning methods employed with different sequence representation schemes. Both statistical and biological evaluations encourages the use this approach in silico assessment of functional miRNA groups. 1 INTRODUCTION sequence similarity is also known (Altuvia et al. 2005). miRNAs targeted into a specific mRNA MiRNAs are small, 20-22 nucleotides in length, non- region are greeted through biogenesis which is coding RNA products of the corresponding MIR, commonly specific into the organism. At the end of miRNA transcribing, genes. They regulate encoding its biogenesis RNA-induced silencing complex machinery involving in cleavage or translational (RISC) is formed, and by RISC binding regions the events by precise sequence complementation to the miRNA sequence is used as template to complement target RNA sequence (Bartel 2004; Lagos-Quintana the target mRNA sequence. The consequence is et al. 2001). Due to its crucial function in the cell either miRNA cleavage by degradation or identification, miRNA sequences has a great translational repression by blocking the mRNA importance since the earliest breakthrough being translated. Conversely, there would be a accumulated. The let-7 is one of the early identified positive result like sponsoring transcription or miRNA. The role of the miRNA is very important in translation and stabilization of transcription (Asgari function; controlling differentiation in C. elegans. 2011). In the complementation there are some key The let-7 family members generally involve in the regions important for target determination. Second to same processes such as controlling developmental eight nucleotides of the pre-miRNA sequence called timing (Abbott et al. 2005). Recently, it is found out as seed region are known as key nucleotides (Bartel that the sequence of the let-7 family members are 2013). miRNA binding sites in target mRNA region also well conserved. Moreover, some of the MIR is generally in 3 ‘UTR region, occasionally in 5 genes found to be polycistronic transcribed into ‘UTR region of the gene in animals. The percentage miRNAs and located into the same chromosomal of the complementarities changes by, and depends positions; they are called as miRNA clusters. In on type of the organisms (Pratt and MacRae 2009). some of the miRNA clusters a recognizable Through the evolutionary time, as the cell getting change, miRNA to target relation diverse. From this 107 Narcı, K., Ogul,˘ H. and Akkaya, M. Sequence-based MicroRNA Clustering. DOI: 10.5220/0005552901070116 In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2016) - Volume 3: BIOINFORMATICS, pages 107-116 ISBN: 978-989-758-170-0 Copyright c 2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms context, studies to observe sequence fingerprints in amino acid sequences. Sequential simulation of each miRNA families has been started (Hertel et al. 2012; miRNAs in like the mentioned studies presented by Shi et al. 2012) . The investigations on let-7 family numeric kernels constructed though dynamic bring the consideration up that there are some programming pair-wise sequence alignment conserved patterns alongside with base composition algorithms; Smith-Waterman (Smith and Waterman similarities between miRNA sequences (Hertel et al. 1981) and Needleman-Wunsch (Needleman and 2012; Newman et al. 2008). Furthermore, it is found Wunsch 1970) or by calculating their k-mer that there is evolutionary importance of the distributions. Unsupervised clustering approaches mismatches between miRNA and its target as well. then applied into these sequence simulations by The mismatches are tolerating the easier release using the similarity features. The performance of the from the RISC complexes when perfectly located clusters is statistically analyzed by using Dunn Index into its target (Bartel and Bartel 2003). All these (Dunn 1973) calculation and the functionality of the findings in the end support that there is a great pipeline is tested with a well-known human miRNA concurrence between the miRNA and target dataset of Tool for Annotations of miRNA (TAM) sequence even evolutionary favoring the (Lu et al. 2010). The tool also used to test the mismatches. Hence, what is the level of these groups, annotate them into functional categories and similarity between existing miRNA groups? thus calculate the enrichment of the miRNA groups It is well known that a number of miRNA with any purposeful similarities. In conclusion, the element may role in multiple functions, like workflow here represents the method to explore proliferation, cell death and differentiation, sequentially similar miRNAs and their relevance in immunity and fat metabolism by various pattern of groups. expression (He and Hannon 2004). Therefore, the network including miRNA and its targets is highly complex housing several genes. The analysis of 2 MATERIALS AND METHOD these relation may be compromised through advanced tools like miRWalk2.0 (Dweep and Gretz 2.1 Clustering 2015). The latest studies have shown that miRNAs usually operate in a co-operative manner to perform The task is to assign each miRNA into one of their activities (Antonov et al. 2009). This suggests previously unlabeled classes so that a set of non- that some miRNAs can form context-specific overlapping miRNA groups, which are desired to modules, i.e., cluster of entities, while regulating imply a useful relevance, can be obtained. This can gene expression. Since the elucidation of gene be achieved through an unsupervised machine regulatory networks comprising all actors is one of learning technique called as clustering (Flynn 1999; the ultimate goals of systems biology, which Sisodia 2012). As having a large diversity of miRNAs are functionally similar in a certain context clustering algorithms in machine learning society, is high-potential knowledge for the researchers and we consider here four distinct methods which are clinicians working in this domain (Ölçer and Oğul selected based on their common use in 2013). bioinformatics studies; k-means (Macqueen 1967), Recently, as the importance of miRNA directed CLAG (Dib and Carbone 2012), and SOTA (Dopazo gene regulation become clear, computational et al. 1997) and MCL (Enright et al. 2002), which miRNA prediction tools was an active research area are briefly introduced as follows. (Zhao et al. 2010; Lai et al. 2003). Following the K-means (Macqueen 1967) is the classical yet advance, many predicted miRNAs sourced to be the most common in use method of partitional characterized into function. Here in this study, in the clustering. By the technique, the dataset is divided light of the current miRNA literature we used the into k non-overlapping groups by means of sequence clustering approach to group mature minimizing the sum of squares of distances between miRNAs in order identification of miRNA families data points and the corresponding cluster centroids. acting in the same metabolic events. In The logic of the method depends on the iterations of bioinformatics, the attempt of grouping the these steps; (1) determination of the centroid biological sequences is not novel. USEARCH and coordinate, (2) evaluating the distance of each object UBLAST (Edgar 2010)are two algorithms developed to the centroids and, finally (3) grouping into the recently in that concept operating on nucleic acid objects based on minimum distance (Macqueen sequences, and there are TribeMCL (Enright et al. 1967). Prior to these steps however, k must be 2002) and OrthoMCL (Li et al. 2003) operating on specified. Actually, if the dataset is unknown and 108 Sequence-based MicroRNA Clustering analyzer doesn’t know how many grouping will be This makes the centroids of the data fixed; re-runs of done, optimization of k becomes one of the the data do not change the position of the centroids
Recommended publications
  • Parallel and Scalable Precise Clustering for Homologous Protein Discovery
    Parallel and Scalable Precise Clustering for Homologous Protein Discovery Stuart Byma Akash Dhasade Adrian Altenhoff [email protected] [email protected] [email protected] EPFL IIT ETH Zürich Christophe Dessimoz James R. Larus [email protected] [email protected] University of Lausanne EPFL Abstract are referred to as homologs, and their detection allows the This paper presents a new, parallel implementation of clus- transference of knowledge from well-studied genes to newly tering and demonstrates its utility in greatly speeding up sequenced ones. Homologs, despite having accumulated sub- the process of identifying homologous proteins. Clustering stantial differences during evolution, often continue to per- is a technique to reduce the number of comparison needed form the same biological function. In fact, most of today’s to find similar pairs in a set of n elements such as protein molecular-level biological knowledge comes from the study sequences. Precise clustering ensures that each pair of similar of a handful of model organisms, which is then extrapolated elements appears together in at least one cluster, so that to other life forms, primarily through homology detection. similarities can be identified by all-to-all comparison in each Several sequence homology techniques are among the 100 cluster rather than on the full set. This paper introduces most-cited scientific papers of all time [24]. ClusterMerge, a new algorithm for precise clustering that Current approaches to find similar (homologous) proteins uses transitive relationships among the elements to enable are computationally expensive. The baseline is to perform 2 parallel and scalable implementations of this approach.
    [Show full text]
  • Representative Based Protein Sequence Clustering
    REPRESENTATIVE BASED PROTEIN SEQUENCE CLUSTERING M. Riadul Mannan Riad B.Sc. University of Windsor, 2000 A PROJECT SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in the School of Computing Science O M.Riadul Mannan 2005 SIMON FRASER UNIVERSITY Summer 2005 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author. Approval Name: M.Riadu1 Mannan Riad Degree: Master of Science Title of Project: Representative Based Protein Sequence Clustering Examining Committee: Chair: Dr. Binay Bhattacharya Professor - - Dr. Martin Ester Senior Supervisor Associate Professor Dr. Jian Pei Supervisor Assistant Professor Dr. Anoop Sarkar Examiner Assistant Professor School of Computing Science, SFU Date Approved: June 17,2005 SIMON FRASER UNIVERSITY PARTIAL COPYRIGHT LICENCE The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection. The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies. It is understood that copying or publication of this work for financial gain shall not be allowed without the author's written permission.
    [Show full text]
  • De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm
    bioRxiv preprint doi: https://doi.org/10.1101/463463; this version posted November 6, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm Kristoffer Sahlin1y and Paul Medvedev1;2;3 1 Department of Computer Science and Engineering, The Pennsylvania State University 2 Department of Biochemistry and Molecular Biology, The Pennsylvania State University 3 Center for Computational Biology and Bioinformatics, The Pennsylvania State University y to whom correspondence should be addressed: [email protected] Abstract. Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technolo- gies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the po- tential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets.
    [Show full text]
  • Ultrafast and Sensitive Sequence Search and Clustering Methods in the Era of Next Generation Sequencing
    TECHNISCHE UNIVERSITAT¨ MUNCHEN¨ Lehrstuhl fur¨ Bioinformatik Ultrafast and sensitive sequence search and clustering methods in the era of next generation sequencing Martin Steinegger Vollstandiger¨ Abdruck der von der Fakultat¨ fur¨ Informatik der Technischen Universitat¨ Munchen¨ zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Prof. Dr. Michael Georg Bader Prufer¨ der Dissertation: 1. Prof. Dr. Burkhard Rost 2. Priv.-Doz. Dr. Hanjo Taubig¨ Die Dissertation wurde am 07.05.2018 bei der Technischen Universitat¨ Munchen¨ eingere- icht und durch die Fakultat¨ fur¨ Informatik am 13.08.2018 angenommen. Abstract Environmental metagenomic studies are a rich source for unraveling genetic diversity of yet un- known and mostly unculturable microbes, with terabytes of short read sequencing data already publicly available. The growth rate of such data has been increasing steadily, driven by the expo- nential increase in throughput, which has been outperforming Moore’s law by two-fold in the last decade. To date, the main bottleneck is not the generation but rather the analysis of sequence data. To address this analysis bottleneck we propose three novel methods: (1) MMseqs, a fast and sensitive clustering method, (2) MMseqs2, a fast and sensitive homology search method, and (3) Linclust, the first sequence clustering algorithm whose runtime scales linearly with the number of sequences and independently of the number of clusters obtained. Clustering is the process of assigning sequences to distinct groups based on similarity. Thus, clustering serves as a means to discover biological connections or define families of homologous sequences. Furthermore, clustering can speed up downstream analysis considerably by reducing highly similar sequences to a single sequence.
    [Show full text]
  • VIRMOTIF: a User-Friendly Tool for Viral Sequence Analysis
    G C A T T A C G G C A T genes Technical Note VIRMOTIF: A User-Friendly Tool for Viral Sequence Analysis Pedram Rajaei 1, Khadijeh Hoda Jahanian 2, Amin Beheshti 3,* , Shahab S. Band 4, Abdollah Dehzangi 5,6 and Hamid Alinejad-Rokny 2,7,* 1 Amirkabir University of Technology, Tehran 346512, Iran; [email protected] 2 AI-enabled Processes (AIP) Research Centre, Health Data Analytics Program, Macquarie University, Sydney, NSW 2109, Australia; [email protected] 3 Department of Computing, Macquarie University, Sydney, NSW 2109, Australia 4 Future Technology Research Center, College of Future, National Yunlin University of Science and Technology, 123 University Road, Yunlin 64002, Taiwan; [email protected] 5 Department of Computer Science, Rutgers University, Camden, NJ 08102, USA; [email protected] 6 Center for Computational and Integrative Biology, Rutgers University, Camden, NJ 08102, USA 7 Biological & Medical Machine Learning Lab (BML), The Graduate School of Biomedical Engineering, The University of New South Wales, UNSW, Sydney, NSW 2052, Australia * Correspondence: [email protected] (A.B.); [email protected] (H.A.-R.) Abstract: Bioinformatics and computational biology have significantly contributed to the generation of vast and important knowledge that can lead to great improvements and advancements in biology and its related fields. Over the past three decades, a wide range of tools and methods have been developed and proposed to enhance performance, diagnosis, and throughput while maintaining feasibility and convenience for users. Here, we propose a new user-friendly comprehensive tool called VIRMOTIF to analyze DNA sequences. VIRMOTIF brings different tools together as one package so that users can perform their analysis as a whole and in one place.
    [Show full text]
  • Spclust: Towards a Fast and Reliable Clustering for Potentially Divergent
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Archive Ouverte en Sciences de l'Information et de la Communication SpCLUST: Towards a fast and reliable clustering for potentially divergent biological sequences Johnny Matar, Hicham El Khoury, Jean Charr, Christophe Guyeux, Stéphane Chrétien To cite this version: Johnny Matar, Hicham El Khoury, Jean Charr, Christophe Guyeux, Stéphane Chrétien. SpCLUST: Towards a fast and reliable clustering for potentially divergent biological sequences. Computers in Biology and Medecine, 2019, 114, pp.103439 (14). hal-02366767 HAL Id: hal-02366767 https://hal.archives-ouvertes.fr/hal-02366767 Submitted on 16 Nov 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. SpCLUST: towards a fast and reliable clustering for potentially divergent biological sequences Johny Matar1;3, Hicham EL Khoury3 Jean-Claude Charr1, Christophe Guyeux1, St´ephaneChr´etien2, 1Universit´ede Bourgogne Franche-Comt´e,UMR 6174 CNRS, 16 route de Gray, Besan¸con,France 2National Physical Laboratory, Hampton Road, Teddington, United Kingdom 3LaRRIS, Faculty of Science, Lebanese University, Fanar, Lebanon October 2018 Abstract This paper presents SpCLUST, a new C++ package that takes a list of sequences as input, aligns them with MUSCLE, computes their similarity matrix in parallel and then performs the clustering.
    [Show full text]
  • De Novo Clustering of Long Reads by Gene from Transcriptomics Data
    De Novo Clustering of Long Reads by Gene from Transcriptomics Data Camille Marchet, Lolita Lecompte, Corinne da Silva, Corinne Cruaud, Jean-Marc Aury, Jacques Nicolas, Pierre Peterlongo To cite this version: Camille Marchet, Lolita Lecompte, Corinne da Silva, Corinne Cruaud, Jean-Marc Aury, et al.. De Novo Clustering of Long Reads by Gene from Transcriptomics Data. Nucleic Acids Research, Oxford University Press, In press, pp.1-12. 10.1093/nar/gky834. hal-01643156v2 HAL Id: hal-01643156 https://hal.archives-ouvertes.fr/hal-01643156v2 Submitted on 14 Sep 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. i “output” — 2018/9/13 — 8:22 — page 1 — #1 i i i Published online XXX Nucleic Acids Research, 2018, Vol. X, No. X 1–12 doi:10.1093/nar/gkn000 De Novo Clustering of Long Reads by Gene from Transcriptomics Data Camille Marchet 1, Lolita Lecompte 1, Corinne Da Silva 2, Corinne Cruaud 2, Jean-Marc Aury 2, Jacques Nicolas 1 and Pierre Peterlongo 1 1Univ Rennes, Inria, CNRS, IRISA, F-35000 Rennes, France 2Commissariat `al'Energie´ Atomique (CEA), Institut de Biologie Fran¸coisJacob, Genoscope, 91000, Evry, France Received XXX, 2018; Revised XXX, XXX; Accepted XXX, XXX ABSTRACT that do not have enough support.
    [Show full text]
  • Application of Subspace Clustering in DNA Sequence Analysis
    JOURNAL OF COMPUTATIONAL BIOLOGY Volume 22, Number 00, 2015 Research Article # Mary Ann Liebert, Inc. Pp. 1–13 DOI: 10.1089/cmb.2015.0084 Application of Subspace Clustering in DNA Sequence Analysis TIM WALLACE,1 ALI SEKMEN,1 and XIAOFEI WANG2 ABSTRACT Identification and clustering of orthologous genes plays an important role in developing evolutionary models such as validating convergent and divergent phylogeny and predicting functional proteins in newly sequenced species of unverified nucleotide protein mappings. Here, we introduce an application of subspace clustering as applied to orthologous gene sequences and discuss the initial results. The working hypothesis is based upon the concept that genetic changes between nucleotide sequences coding for proteins among selected species and groups may lie within a union of subspaces for clusters of the orthologous groups. Estimates for the subspace dimensions were computed for a small population sample. A series of experiments was performed to cluster randomly selected sequences. The experimental design allows for both false positives and false negatives, and estimates for the statistical significance are provided. The clustering results are consistent with the main hypothesis. A simple random mutation binary tree model is used to simulate speciation events that show the interdependence of the subspace rank versus time and mutation rates. The simple mutation model is found to be largely consistent with the observed subspace clustering singular value results. Our study indicates that the subspace clustering method may be applied in orthology analysis. Key words: algorithms, statistics. 1. INTRODUCTION dentification of orthologous relationships among protein and nucleotide sequences is of wide Iinterest for biological sequence analysis because such information provides insight into the molecular function and evolutionary history of these sequences.
    [Show full text]
  • Downloaded Without User Conclusions Registration At: and Additional Informations in Supplementary Material
    Lima Nichio et al. BMC Bioinformatics (2019) 20:392 https://doi.org/10.1186/s12859-019-2973-4 SOFTWARE Open Access RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets Bruno Thiago de Lima Nichio1,2, Aryel Marlus Repula de Oliveira1, Camilla Reginatto de Pierri1,2, Leticia Graziela Costa Santos1, Alexandre Quadros Lejambre1, Ricardo Assunção Vialle1, Nilson Antônio da Rocha Coimbra1, Dieval Guizelini1, Jeroniza Nunes Marchaukoski1, Fabio de Oliveira Pedrosa1,2 and Roberto Tadeu Raittz1* Abstract Background: Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed to explore the full potential of clustering methods - in which alignment-free methods stand out - and the good choice of dataset makes it essentials. Results: Here we present a new approach to Data Mining in large protein sequences datasets, the Rapid Alignment Free Tool for Sequences Similarity Search to Groups (RAFTS3G), a method to clustering aiming of losing less biological information in the processes of generation groups. The strategy developed in our algorithm is optimized to be more astringent which reflects increase in accuracy and sensitivity in the generation of clusters in a wide range of similarity. RAFTS3G is the better choice compared to three main methods when the user wants more reliable result even ignoring the ideal threshold to clustering.
    [Show full text]
  • Scalable Clustering for Immune Repertoire Sequence Analysis
    SCALABLE CLUSTERING FOR IMMUNE REPERTOIRE SEQUENCE ANALYSIS A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science by PREM BHUSAL B.E., Visvesvaraya Technological University, 2014 2019 Wright State University Wright State University GRADUATE SCHOOL April 23, 2019 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPER- VISION BY Prem Bhusal ENTITLED Scalable Clustering for Immune Repertoire Sequence Analysis BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science . Keke Chen, Ph.D. Thesis Director Mateen M. Rizki, Ph.D. Chair, Department of Computer Science Committee on Final Examination Keke Chen , Ph.D Krishnaprasad Thirunarayan (T.K.Prasad) , Ph.D Tanvi Banerjee, Ph.D. Barry Milligan, Ph.D. Interim Dean of the Graduate School ABSTRACT Bhusal, Prem. M.S., Department of Computer Science and Engineering, Wright State University, 2019. Scalable Clustering for Immune Repertoire Sequence Analysis. The development of the next-generation sequencing technology has enabled systems immunology researchers to conduct detailed immune repertoire analysis at the molecule level. Large sequence datasets (e.g., millions of sequences) are being collected to com- prehensively understand how the immune system of a patient evolves over different stages of disease development. A recent study has shown that the hierarchical clustering (HC) algorithm gives the best results for B-cell clones analysis - an important type of immune repertoire sequencing (IR-Seq) analysis. However, due to the inherent complexity, the classical hierarchical clustering algorithm does not scale well to large sequence datasets. Surprisingly, no algorithms have been developed to address this scalability issue for im- munology research.
    [Show full text]
  • The Genexpress IMAGE Knowledge Base of the Human Brain Transcriptome: a Prototype Integrated Resource for Functional and Computational Genomics
    Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press Resource The Genexpress IMAGE Knowledge Base of the Human Brain Transcriptome: A Prototype Integrated Resource for Functional and Computational Genomics Genevie`ve Pie´tu,1,5 Re´gine Mariage-Samson,1 Nicole-Adeline Fayein,1 Christiane Matingou,1 Eric Eveno,1 Re´mi Houlgatte,1 Charles Decraene,1 Yves Vandenbrouck,1 Fariza Tahi,1 Marie-Dominique Devignes,1 Ute Wirkner,2 Wilhelm Ansorge,2 David Cox,3 Takahiro Nagase,4 Nobuo Nomura,4 and Charles Auffray1 1Genexpress, Centre National de la Recherche Scientifique (CNRS), 94801 Villejuif, France; 2European Molecular Biology Laboratory, 6900 Heidelberg, Germany; 3Stanford Human Genome Center, Stanford University, Stanford, California 94305 USA; 4Kazusa DNA Research Institute, Kisarazu, Chiba 292 Japan Expression profiles of 5058 human gene transcripts represented by an array of 7451 clones from the first IMAGE Consortium cDNA library from infant brain have been collected by semiquantitative hybridization of the array with complex probes derived by reverse transcription of mRNA from brain and five other human tissues. Twenty-one percent of the clones corresponded to transcripts that could be classified in general categories of low, moderate, or high abundance. These expression profiles were integrated with cDNA clone and sequence clustering and gene mapping information from an upgraded version of the Genexpress Index. For seven gene transcripts found to be transcribed preferentially or specifically in brain, the expression profiles were confirmed by Northern blot analyses of mRNA from eight adult and four fetal tissues, and 15 distinct regions of brain. In four instances, further documentation of the sites of expression was obtained by in situ hybridization of rat-brain tissue sections.
    [Show full text]
  • State-Of-The-Art Bioinformatics Protein Structure Prediction Tools (Review)
    INTERNATIONAL JOURNAL OF MOLECULAR MEDICINE 28: 295-310, 2011 State-of-the-art bioinformatics protein structure prediction tools (Review) ATHANASIA PAVLOPOULOU1 and IOANNIS MICHALOPOULOS2 1Department of Pharmacy, School of Health Sciences, University of Patras, Rion-Patras; 2Cryobiology of Stem Cells, Centre of Immunology and Transplantation, Biomedical Research Foundation, Academy of Athens, Athens, Greece Received April 18, 2011; Accepted May 9, 2011 DOI: 10.3892/ijmm.2011.705 Abstract. Knowledge of the native structure of a protein data, the experimental determination of a protein structure could provide an understanding of the molecular basis of its and/or function is labour intensive, time consuming and function. However, in the postgenomics era, there is a growing expensive. As a result, the ‘sequence-structure gap’, the gap gap between proteins with experimentally determined between the number of protein sequences and the number of structures and proteins without known structures. To deal proteins with experimentally determined structure and with the overwhelming data, a collection of automated function, is growing rapidly. Therefore, the use of computa- methods as bioinformatics tools which determine the structure tional tools for assigning structure to a novel protein represents of a protein from its amino acid sequence have emerged. The the most efficient alternative to experimental methods (1). To aim of this paper is to provide the experimental biologists overcome this problem, a plethora of automated methods to with a set of cutting-edge, carefully evaluated, user-friendly predict protein structure have evolved as computational tools computational tools for protein structure prediction that would over the past decade (2). be helpful for the interpretation of their results and the rational In the present review, we present a set of state-of-the-art design of new experiments.
    [Show full text]