Uniclust Databases of Clustered and Deeply Annotated Protein Sequences and Alignments Milot Mirdita1,†, Lars Von Den Driesch1,2,†, Clovis Galiez1, Maria J

Total Page:16

File Type:pdf, Size:1020Kb

Uniclust Databases of Clustered and Deeply Annotated Protein Sequences and Alignments Milot Mirdita1,†, Lars Von Den Driesch1,2,†, Clovis Galiez1, Maria J Nucleic Acids Research Advance Access published November 28, 2016 Nucleic Acids Research, 2016 1 doi: 10.1093/nar/gkw1081 Uniclust databases of clustered and deeply annotated protein sequences and alignments Milot Mirdita1,†, Lars von den Driesch1,2,†, Clovis Galiez1, Maria J. Martin2, Johannes Soding¨ 1,* and Martin Steinegger1,3,4,* 1Quantitative and Computational Biology Group, Max Planck Institute for Biophysical Chemistry, Gottingen,¨ Germany, 2European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK, 3Department for Bioinformatics and Computational Biology, Technische Universitat¨ Munchen,¨ Munich, Germany and 4Department of Chemistry, Seoul National University, Seoul, Korea Received August 15, 2016; Revised October 14, 2016; Editorial Decision October 24, 2016; Accepted November 01, 2016 Downloaded from ABSTRACT growth makes it attractive for many applications to work with representative subsets, in which the representatives We present three clustered protein sequence are computed by clustering similar sequences together and databases, Uniclust90, Uniclust50, Uniclust30 and choosing only a single representative per cluster. Apart from http://nar.oxfordjournals.org/ three databases of multiple sequence alignments saving computational resources, the more even coverage of (MSAs), Uniboost10, Uniboost20 and Uniboost30, sequence space of such clustered databases can improve the as a resource for protein sequence analysis, func- sensitivity of sequence similarity searches (6–8). tion prediction and sequence searches. The Uniclust The popular UniProt Reference Clusters (UniRef) (9) databases cluster UniProtKB sequences at the level consist of three databases that are generated by cluster- of 90%, 50% and 30% pairwise sequence identity. ing the UniProtKB sequences in three steps using the CD- Uniclust90 and Uniclust50 clusters showed better HIT software (10): UniRef100 combines identical UniPro- tKB sequences and fragments with 100% sequence identity consistency of functional annotation than those of at MPI Study of Societies on December 5, 2016 into common entries. UniRef90 sequences are obtained by UniRef90 and UniRef50, owing to an optimised clus- clustering UniRef100 sequences together that have at least tering pipeline that runs with our MMseqs2 software 90% sequence identity and 80% sequence length overlap, for fast and sensitive protein sequence searching and UniRef50 clusters together UniRef90 sequences with at and clustering. Uniclust sequences are annotated least 50% sequence identity and 80% sequence length over- with matches to Pfam, SCOP domains, and proteins lap. in the PDB, using our HHblits homology detection Here, we introduce the Uniclust sequence databases tool. Due to its high sensitivity, Uniclust contains 17% which, like UniRef, are clustered, representative sets of more Pfam domain annotations than UniProt. Uni- UniProtKB sequences at three different clustering lev- boost MSAs of three diversities are built by enriching els. But whereas UniRef relies on the CD-HIT software the Uniclust30 MSAs with local sequence matches for the clustering, we use our software suite MMseqs2 from MMseqs2 profile searches through Uniclust30. (github.com/soedinglab/mmseqs2, Steinegger & Soding,¨ to be published). The following characteristics make Uniclust All databases can be downloaded from the Uniclust databases unique and useful: First, the sensitivity of MM- server at uniclust.mmseqs.com. Users can search seqs2 for distantly homologous sequences allows us to clus- clusters by keywords and explore their MSAs, tax- ter the UniProtKB down to 30% sequence identity. Second, onomic representation, and annotations. Uniclust is we have developed a cascaded clustering workflow within updated every two months with the new UniProt re- MMseqs2 in order to produce sequence clusters that are lease. as compact and functionally homogeneous as possible. As a result, Uniclust90 and Uniclust50 clusters show higher INTRODUCTION functional consistency scores than UniRef90 and UniRef50 at similar clustering depths, respectively. Third, we provide The number of protein sequences in public databases such deep annotation of Uniclust sequences with Pfam (11)and as UniProt (1) or GenBank (2) is growing fast, in part due SCOP (12) domains, and matches to PDB sequences (13) to various large-scale genomics projects (3–5). The rapid *To whom correspondence should be addressed. Email: [email protected] Correspondence may also be addressed to Johannes Soding.¨ Tel: +49 551 201 2890; Email: [email protected] †These authors contributed equally to the paper as first authors. C The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 2 Nucleic Acids Research, 2016 using HH-suite, our remote homology detection software full-length sequences we also add sequences to a cluster if suite. The sensitivity of HH-suite allows us to annotate 17% they have at least 90% sequence identity to the representa- more Pfam domains than UniProt, which uses InterPro tive sequence and are also covered by at least 95% of their and HMMER3 for these annotations. Fourth, we provide length, without regard to the E-value. the MSAs of all Uniclust clusters as well as the three Uni- In the third step, we generate the Uniclust50 and Uni- boost databases with MSAs of different diversity levels that clust30 clustering both directly from the sequences in Uni- are obtained by enriching Uniclust30 clusters with local se- clust90, using a 50% or 30% sequence identity threshold, re- quence matches. spectively, and a minimum sequence length overlap of 80%. A high minimum overlap ensures that all proteins within one cluster have the same or a very similar domain struc- MATERIALS AND METHODS ture and is also an effective criterion to achieve functional We developed an open-source bash pipeline (github.com/ homogeneity (15). We avoided the cascaded clustering ap- soedinglab/uniclust-pipeline) to generate all data described proach of generating Uniclust30 from Uniclust50 because here: the Uniclust clusterings, cluster summary head- we found this resulted in slightly inferior clustering quality ers, domain annotations for sequences, and the Uniboost to the direct approach. databases of multiple sequence alignments. We provide the In addition to the simple greedy clustering, we imple- Downloaded from pipeline scripts as a supplementary archive file to avoid clut- mented affinity propagation, depth-n single linkage clus- tering the descriptions here with command line options and tering, and the classic greedy set-cover algorithm in MM- other details irrelevant for the understanding. seqs2 and compared the clustering qualities. We found that the cluster compactness for all algorithms could be further improved by passing over all sequences after the clustering Uniclust clustering pipeline http://nar.oxfordjournals.org/ and reassigning each to the cluster whose representative se- The Uniclust clusters contain all sequences in the UniProt quence is most similar to it. The greedy set-cover algorithm knowledge base (UniProtKB), the union of the Swiss-Prot with sequence reassignment gave best results and is there- and TrEMBL databases. Sequences longer than 14 000 fore used in the final clustering step. The three-step cluster- amino acid residues are split into multiple individual en- ing took 5 days on 10 nodes with two Intel Xeon E5-2640 tries to limit memory usage and improve compatibility with v3 CPUs and 128GB main memory each. other tools. (This affects 352 sequences in the 2016 03 re- lease.) Once a year we will cluster these sequences from Updating Uniclust. We will update the Uniclust databases scratch as described in the following. every two months following the new UniProt release. To at MPI Study of Societies on December 5, 2016 In order to cluster together sequences of ≥30% pairwise keep the cluster identifiers stable between updates, wedo sequence identity, we need high sensitivity, yet the enor- not recluster from scratch but instead update the clustering mous number of pairwise comparisons (on the order of incrementally, add new sequences to existing clusters, create (107)2) requires very high speed at the same time. We devel- new clusters, and remove deprecated sequences (14). We em- oped a cascaded clustering workflow in MMseqs (14) that ploy the updating workflow ‘mmseqs clusterupdate’ in the uses three clustering steps with progressively increasing sen- MMseqs2 package for that purpose, which has the added sitivity and decreasing speed. advantage of running in linear time instead of quadratic in The first step consists of an extremely fast redundancy the number of sequences. To avoid excessive computational filtering that can cluster sequences of identical length and demands, we recompute the MSAs and sequence annota- 100% overlap (‘mmseqs clusthash’). It reduces each se- tions only during the reclustering step once per year and quence to a five-letter alphabet, computes a 64 bit CRC32 for major UniProt releases. hash value for the full-length sequences, and places se- quences with identical hash code that satisfy the sequence Consensus sequences and representative sequences. We pro-
Recommended publications
  • Parallel and Scalable Precise Clustering for Homologous Protein Discovery
    Parallel and Scalable Precise Clustering for Homologous Protein Discovery Stuart Byma Akash Dhasade Adrian Altenhoff [email protected] [email protected] [email protected] EPFL IIT ETH Zürich Christophe Dessimoz James R. Larus [email protected] [email protected] University of Lausanne EPFL Abstract are referred to as homologs, and their detection allows the This paper presents a new, parallel implementation of clus- transference of knowledge from well-studied genes to newly tering and demonstrates its utility in greatly speeding up sequenced ones. Homologs, despite having accumulated sub- the process of identifying homologous proteins. Clustering stantial differences during evolution, often continue to per- is a technique to reduce the number of comparison needed form the same biological function. In fact, most of today’s to find similar pairs in a set of n elements such as protein molecular-level biological knowledge comes from the study sequences. Precise clustering ensures that each pair of similar of a handful of model organisms, which is then extrapolated elements appears together in at least one cluster, so that to other life forms, primarily through homology detection. similarities can be identified by all-to-all comparison in each Several sequence homology techniques are among the 100 cluster rather than on the full set. This paper introduces most-cited scientific papers of all time [24]. ClusterMerge, a new algorithm for precise clustering that Current approaches to find similar (homologous) proteins uses transitive relationships among the elements to enable are computationally expensive. The baseline is to perform 2 parallel and scalable implementations of this approach.
    [Show full text]
  • Sequence-Based Microrna Clustering
    Sequence-based MicroRNA Clustering Kübra Narcı1, Hasan Oğul2 and Mahinur Akkaya3 1Medical Informatics Department, Informatics, Middle East Technical University, Ankara, Turkey 2Department of Computer Engineering, Faculty of Engineering, Başkent University, Ankara, Turkey 3Department of Chemistry, Faculty of Arts and Sciences, Middle East Technical University, Ankara, Turkey Keywords: MicroRNA, Sequence Clustering, Clustering Algorithms, Pair-wise Sequence Comparison Sequence Similarity. Abstract: MicroRNAs (miRNAs) play important roles in post-transcriptional gene regulation. Altogether, understanding integrative and co-operative activities in gene regulation is conjugated with identification of miRNA families. In current applications, the identification of such groups of miRNAs is only investigated by the projections of their expression patterns and so along with their functional relations. Considering the fact that the miRNA regulation is mediated through its mature sequence by the recognition of the target mRNA sequences in the RISC (RNA-induced silencing complex) binding regions, we argue here that relevant miRNA groups can be obtained by de novo clustering them solely based on their sequence information, by a sequence clustering approach. In this way, a new study can be guided by a set of previously annotated miRNA groups without any preliminary experimentation or literature evidence. In this report, we presents the results of a computational study that considers only mature miRNA sequences to obtain relevant miRNA clusters using various machine learning methods employed with different sequence representation schemes. Both statistical and biological evaluations encourages the use this approach in silico assessment of functional miRNA groups. 1 INTRODUCTION sequence similarity is also known (Altuvia et al. 2005). miRNAs targeted into a specific mRNA MiRNAs are small, 20-22 nucleotides in length, non- region are greeted through biogenesis which is coding RNA products of the corresponding MIR, commonly specific into the organism.
    [Show full text]
  • Representative Based Protein Sequence Clustering
    REPRESENTATIVE BASED PROTEIN SEQUENCE CLUSTERING M. Riadul Mannan Riad B.Sc. University of Windsor, 2000 A PROJECT SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in the School of Computing Science O M.Riadul Mannan 2005 SIMON FRASER UNIVERSITY Summer 2005 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author. Approval Name: M.Riadu1 Mannan Riad Degree: Master of Science Title of Project: Representative Based Protein Sequence Clustering Examining Committee: Chair: Dr. Binay Bhattacharya Professor - - Dr. Martin Ester Senior Supervisor Associate Professor Dr. Jian Pei Supervisor Assistant Professor Dr. Anoop Sarkar Examiner Assistant Professor School of Computing Science, SFU Date Approved: June 17,2005 SIMON FRASER UNIVERSITY PARTIAL COPYRIGHT LICENCE The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection. The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies. It is understood that copying or publication of this work for financial gain shall not be allowed without the author's written permission.
    [Show full text]
  • De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm
    bioRxiv preprint doi: https://doi.org/10.1101/463463; this version posted November 6, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm Kristoffer Sahlin1y and Paul Medvedev1;2;3 1 Department of Computer Science and Engineering, The Pennsylvania State University 2 Department of Biochemistry and Molecular Biology, The Pennsylvania State University 3 Center for Computational Biology and Bioinformatics, The Pennsylvania State University y to whom correspondence should be addressed: [email protected] Abstract. Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technolo- gies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the po- tential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets.
    [Show full text]
  • Ultrafast and Sensitive Sequence Search and Clustering Methods in the Era of Next Generation Sequencing
    TECHNISCHE UNIVERSITAT¨ MUNCHEN¨ Lehrstuhl fur¨ Bioinformatik Ultrafast and sensitive sequence search and clustering methods in the era of next generation sequencing Martin Steinegger Vollstandiger¨ Abdruck der von der Fakultat¨ fur¨ Informatik der Technischen Universitat¨ Munchen¨ zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Prof. Dr. Michael Georg Bader Prufer¨ der Dissertation: 1. Prof. Dr. Burkhard Rost 2. Priv.-Doz. Dr. Hanjo Taubig¨ Die Dissertation wurde am 07.05.2018 bei der Technischen Universitat¨ Munchen¨ eingere- icht und durch die Fakultat¨ fur¨ Informatik am 13.08.2018 angenommen. Abstract Environmental metagenomic studies are a rich source for unraveling genetic diversity of yet un- known and mostly unculturable microbes, with terabytes of short read sequencing data already publicly available. The growth rate of such data has been increasing steadily, driven by the expo- nential increase in throughput, which has been outperforming Moore’s law by two-fold in the last decade. To date, the main bottleneck is not the generation but rather the analysis of sequence data. To address this analysis bottleneck we propose three novel methods: (1) MMseqs, a fast and sensitive clustering method, (2) MMseqs2, a fast and sensitive homology search method, and (3) Linclust, the first sequence clustering algorithm whose runtime scales linearly with the number of sequences and independently of the number of clusters obtained. Clustering is the process of assigning sequences to distinct groups based on similarity. Thus, clustering serves as a means to discover biological connections or define families of homologous sequences. Furthermore, clustering can speed up downstream analysis considerably by reducing highly similar sequences to a single sequence.
    [Show full text]
  • VIRMOTIF: a User-Friendly Tool for Viral Sequence Analysis
    G C A T T A C G G C A T genes Technical Note VIRMOTIF: A User-Friendly Tool for Viral Sequence Analysis Pedram Rajaei 1, Khadijeh Hoda Jahanian 2, Amin Beheshti 3,* , Shahab S. Band 4, Abdollah Dehzangi 5,6 and Hamid Alinejad-Rokny 2,7,* 1 Amirkabir University of Technology, Tehran 346512, Iran; [email protected] 2 AI-enabled Processes (AIP) Research Centre, Health Data Analytics Program, Macquarie University, Sydney, NSW 2109, Australia; [email protected] 3 Department of Computing, Macquarie University, Sydney, NSW 2109, Australia 4 Future Technology Research Center, College of Future, National Yunlin University of Science and Technology, 123 University Road, Yunlin 64002, Taiwan; [email protected] 5 Department of Computer Science, Rutgers University, Camden, NJ 08102, USA; [email protected] 6 Center for Computational and Integrative Biology, Rutgers University, Camden, NJ 08102, USA 7 Biological & Medical Machine Learning Lab (BML), The Graduate School of Biomedical Engineering, The University of New South Wales, UNSW, Sydney, NSW 2052, Australia * Correspondence: [email protected] (A.B.); [email protected] (H.A.-R.) Abstract: Bioinformatics and computational biology have significantly contributed to the generation of vast and important knowledge that can lead to great improvements and advancements in biology and its related fields. Over the past three decades, a wide range of tools and methods have been developed and proposed to enhance performance, diagnosis, and throughput while maintaining feasibility and convenience for users. Here, we propose a new user-friendly comprehensive tool called VIRMOTIF to analyze DNA sequences. VIRMOTIF brings different tools together as one package so that users can perform their analysis as a whole and in one place.
    [Show full text]
  • Spclust: Towards a Fast and Reliable Clustering for Potentially Divergent
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Archive Ouverte en Sciences de l'Information et de la Communication SpCLUST: Towards a fast and reliable clustering for potentially divergent biological sequences Johnny Matar, Hicham El Khoury, Jean Charr, Christophe Guyeux, Stéphane Chrétien To cite this version: Johnny Matar, Hicham El Khoury, Jean Charr, Christophe Guyeux, Stéphane Chrétien. SpCLUST: Towards a fast and reliable clustering for potentially divergent biological sequences. Computers in Biology and Medecine, 2019, 114, pp.103439 (14). hal-02366767 HAL Id: hal-02366767 https://hal.archives-ouvertes.fr/hal-02366767 Submitted on 16 Nov 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. SpCLUST: towards a fast and reliable clustering for potentially divergent biological sequences Johny Matar1;3, Hicham EL Khoury3 Jean-Claude Charr1, Christophe Guyeux1, St´ephaneChr´etien2, 1Universit´ede Bourgogne Franche-Comt´e,UMR 6174 CNRS, 16 route de Gray, Besan¸con,France 2National Physical Laboratory, Hampton Road, Teddington, United Kingdom 3LaRRIS, Faculty of Science, Lebanese University, Fanar, Lebanon October 2018 Abstract This paper presents SpCLUST, a new C++ package that takes a list of sequences as input, aligns them with MUSCLE, computes their similarity matrix in parallel and then performs the clustering.
    [Show full text]
  • De Novo Clustering of Long Reads by Gene from Transcriptomics Data
    De Novo Clustering of Long Reads by Gene from Transcriptomics Data Camille Marchet, Lolita Lecompte, Corinne da Silva, Corinne Cruaud, Jean-Marc Aury, Jacques Nicolas, Pierre Peterlongo To cite this version: Camille Marchet, Lolita Lecompte, Corinne da Silva, Corinne Cruaud, Jean-Marc Aury, et al.. De Novo Clustering of Long Reads by Gene from Transcriptomics Data. Nucleic Acids Research, Oxford University Press, In press, pp.1-12. 10.1093/nar/gky834. hal-01643156v2 HAL Id: hal-01643156 https://hal.archives-ouvertes.fr/hal-01643156v2 Submitted on 14 Sep 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. i “output” — 2018/9/13 — 8:22 — page 1 — #1 i i i Published online XXX Nucleic Acids Research, 2018, Vol. X, No. X 1–12 doi:10.1093/nar/gkn000 De Novo Clustering of Long Reads by Gene from Transcriptomics Data Camille Marchet 1, Lolita Lecompte 1, Corinne Da Silva 2, Corinne Cruaud 2, Jean-Marc Aury 2, Jacques Nicolas 1 and Pierre Peterlongo 1 1Univ Rennes, Inria, CNRS, IRISA, F-35000 Rennes, France 2Commissariat `al'Energie´ Atomique (CEA), Institut de Biologie Fran¸coisJacob, Genoscope, 91000, Evry, France Received XXX, 2018; Revised XXX, XXX; Accepted XXX, XXX ABSTRACT that do not have enough support.
    [Show full text]
  • Application of Subspace Clustering in DNA Sequence Analysis
    JOURNAL OF COMPUTATIONAL BIOLOGY Volume 22, Number 00, 2015 Research Article # Mary Ann Liebert, Inc. Pp. 1–13 DOI: 10.1089/cmb.2015.0084 Application of Subspace Clustering in DNA Sequence Analysis TIM WALLACE,1 ALI SEKMEN,1 and XIAOFEI WANG2 ABSTRACT Identification and clustering of orthologous genes plays an important role in developing evolutionary models such as validating convergent and divergent phylogeny and predicting functional proteins in newly sequenced species of unverified nucleotide protein mappings. Here, we introduce an application of subspace clustering as applied to orthologous gene sequences and discuss the initial results. The working hypothesis is based upon the concept that genetic changes between nucleotide sequences coding for proteins among selected species and groups may lie within a union of subspaces for clusters of the orthologous groups. Estimates for the subspace dimensions were computed for a small population sample. A series of experiments was performed to cluster randomly selected sequences. The experimental design allows for both false positives and false negatives, and estimates for the statistical significance are provided. The clustering results are consistent with the main hypothesis. A simple random mutation binary tree model is used to simulate speciation events that show the interdependence of the subspace rank versus time and mutation rates. The simple mutation model is found to be largely consistent with the observed subspace clustering singular value results. Our study indicates that the subspace clustering method may be applied in orthology analysis. Key words: algorithms, statistics. 1. INTRODUCTION dentification of orthologous relationships among protein and nucleotide sequences is of wide Iinterest for biological sequence analysis because such information provides insight into the molecular function and evolutionary history of these sequences.
    [Show full text]
  • Downloaded Without User Conclusions Registration At: and Additional Informations in Supplementary Material
    Lima Nichio et al. BMC Bioinformatics (2019) 20:392 https://doi.org/10.1186/s12859-019-2973-4 SOFTWARE Open Access RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets Bruno Thiago de Lima Nichio1,2, Aryel Marlus Repula de Oliveira1, Camilla Reginatto de Pierri1,2, Leticia Graziela Costa Santos1, Alexandre Quadros Lejambre1, Ricardo Assunção Vialle1, Nilson Antônio da Rocha Coimbra1, Dieval Guizelini1, Jeroniza Nunes Marchaukoski1, Fabio de Oliveira Pedrosa1,2 and Roberto Tadeu Raittz1* Abstract Background: Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed to explore the full potential of clustering methods - in which alignment-free methods stand out - and the good choice of dataset makes it essentials. Results: Here we present a new approach to Data Mining in large protein sequences datasets, the Rapid Alignment Free Tool for Sequences Similarity Search to Groups (RAFTS3G), a method to clustering aiming of losing less biological information in the processes of generation groups. The strategy developed in our algorithm is optimized to be more astringent which reflects increase in accuracy and sensitivity in the generation of clusters in a wide range of similarity. RAFTS3G is the better choice compared to three main methods when the user wants more reliable result even ignoring the ideal threshold to clustering.
    [Show full text]
  • Scalable Clustering for Immune Repertoire Sequence Analysis
    SCALABLE CLUSTERING FOR IMMUNE REPERTOIRE SEQUENCE ANALYSIS A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science by PREM BHUSAL B.E., Visvesvaraya Technological University, 2014 2019 Wright State University Wright State University GRADUATE SCHOOL April 23, 2019 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPER- VISION BY Prem Bhusal ENTITLED Scalable Clustering for Immune Repertoire Sequence Analysis BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science . Keke Chen, Ph.D. Thesis Director Mateen M. Rizki, Ph.D. Chair, Department of Computer Science Committee on Final Examination Keke Chen , Ph.D Krishnaprasad Thirunarayan (T.K.Prasad) , Ph.D Tanvi Banerjee, Ph.D. Barry Milligan, Ph.D. Interim Dean of the Graduate School ABSTRACT Bhusal, Prem. M.S., Department of Computer Science and Engineering, Wright State University, 2019. Scalable Clustering for Immune Repertoire Sequence Analysis. The development of the next-generation sequencing technology has enabled systems immunology researchers to conduct detailed immune repertoire analysis at the molecule level. Large sequence datasets (e.g., millions of sequences) are being collected to com- prehensively understand how the immune system of a patient evolves over different stages of disease development. A recent study has shown that the hierarchical clustering (HC) algorithm gives the best results for B-cell clones analysis - an important type of immune repertoire sequencing (IR-Seq) analysis. However, due to the inherent complexity, the classical hierarchical clustering algorithm does not scale well to large sequence datasets. Surprisingly, no algorithms have been developed to address this scalability issue for im- munology research.
    [Show full text]
  • The Genexpress IMAGE Knowledge Base of the Human Brain Transcriptome: a Prototype Integrated Resource for Functional and Computational Genomics
    Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press Resource The Genexpress IMAGE Knowledge Base of the Human Brain Transcriptome: A Prototype Integrated Resource for Functional and Computational Genomics Genevie`ve Pie´tu,1,5 Re´gine Mariage-Samson,1 Nicole-Adeline Fayein,1 Christiane Matingou,1 Eric Eveno,1 Re´mi Houlgatte,1 Charles Decraene,1 Yves Vandenbrouck,1 Fariza Tahi,1 Marie-Dominique Devignes,1 Ute Wirkner,2 Wilhelm Ansorge,2 David Cox,3 Takahiro Nagase,4 Nobuo Nomura,4 and Charles Auffray1 1Genexpress, Centre National de la Recherche Scientifique (CNRS), 94801 Villejuif, France; 2European Molecular Biology Laboratory, 6900 Heidelberg, Germany; 3Stanford Human Genome Center, Stanford University, Stanford, California 94305 USA; 4Kazusa DNA Research Institute, Kisarazu, Chiba 292 Japan Expression profiles of 5058 human gene transcripts represented by an array of 7451 clones from the first IMAGE Consortium cDNA library from infant brain have been collected by semiquantitative hybridization of the array with complex probes derived by reverse transcription of mRNA from brain and five other human tissues. Twenty-one percent of the clones corresponded to transcripts that could be classified in general categories of low, moderate, or high abundance. These expression profiles were integrated with cDNA clone and sequence clustering and gene mapping information from an upgraded version of the Genexpress Index. For seven gene transcripts found to be transcribed preferentially or specifically in brain, the expression profiles were confirmed by Northern blot analyses of mRNA from eight adult and four fetal tissues, and 15 distinct regions of brain. In four instances, further documentation of the sites of expression was obtained by in situ hybridization of rat-brain tissue sections.
    [Show full text]