Kclust: Fast and Sensitive Clustering of Large Protein Sequence Databases Maria Hauser1†, Christian E Mayer2,3† and Johannes Söding1*

Total Page:16

File Type:pdf, Size:1020Kb

Kclust: Fast and Sensitive Clustering of Large Protein Sequence Databases Maria Hauser1†, Christian E Mayer2,3† and Johannes Söding1* Hauser et al. BMC Bioinformatics 2013, 14:248 http://www.biomedcentral.com/1471-2105/14/248 SOFTWARE Open Access kClust: fast and sensitive clustering of large protein sequence databases Maria Hauser1†, Christian E Mayer2,3† and Johannes Söding1* Abstract Background: Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable. Results: Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%–30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%–30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed. Conclusions: kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at ftp://toolkit.lmb.uni-muenchen.de/pub/kClust/. Background improve the readability of search results [4]. Cluster- In recent years, the amount of sequence data has been ing is also often used in the analysis of metagenomics growing at an accelerating pace. While one would expect experiments, where potentially orthologous sequences are the denser sampling of sequence space to lead to better clustered together in order to define the inventory of bio- performance of sequence searches, the opposite seems to chemical reactions that are likely to be present in the be true: The increase has led to stagnating or even nega- metagenomic community [5-7]. tive returns, as measured by the ability to detect homol- An important motivation to develop kClust was our ogous sequences for structure or function predictions need for a database of profile hidden Markov models [1]. Removing redundant sequences through clustering (HMMs) that contains all UniProt sequences, clustered can partly alleviate this problem: [2] and [3] showed that down to 20%–30% sequence identity (uniprot20). Such sequence searches through clustered databases, which a database is required by HHblits, the most sensitive contain one representative sequence per cluster, could protein sequence search method to date [8]. HHblits cal- improve the sensitivity of homology search methods. Fur- culates a profile HMM from the query sequence and thermore, clustering reduces search times and can greatly compares this query HMM with the uniprot20 database of HMMs. In subsequent iterations, sequences belong- ing to significantly similar database HMMs are added to *Correspondence: [email protected] the query multiple sequence alignment (MSA). Thanks to †Equal contributors 1Gene Center and Center for Integrated Protein Science (CIPSM), the additional information from homologous sequences Ludwig-Maximilians-Universität München, Feodor-Lynen-Str. 25, Munich in the MSAs, HMM-HMM comparison is more sensitive 81377, Germany and accurate than profile-sequence comparison: Com- Full list of author information is available at the end of the article pared to PSI-BLAST [9], HHblits is faster, up to twice © 2013 Hauser et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Hauser et al. BMC Bioinformatics 2013, 14:248 Page 2 of 12 http://www.biomedcentral.com/1471-2105/14/248 as sensitive, and produces alignments with several per- since decrementing k by one results in approximately 20 cent higher accuracy [8]. For HHblits it is critical that times more chance k-mers matches and a 20-fold longer only a very low number of clusters are corrupted by run time. CD-HIT can therefore cluster large databases non-homologous sequences, since these can cause high- such as UniProt only to down to ∼ 50% sequence identity. scoring false-positive matches. The clustering sensitivity UCLUST uses 5-mers at all clustering thresholds. This is also important because MSAs with higher sequence allows it to maintain a high speed even at low thresholds, diversity and higher information content are better at at the cost of a loss of sensitivity. It ranks the represen- finding remotely related sequences. Also, the lower num- tative sequences by the number of 5-mers they have in ber of clusters results in shorter search times. common with the query sequence and aligns them in this Sequence clustering methods first need to compare the order until one of them is similar to the query sequence or sequences in the input set with each other. Most meth- until the highest-ranked eight clusters have been rejected. ods use FASTA or BLAST [10,11] for this purpose. Some An apparent disadvantage of UCLUST is that, in order of these methods use the pairwise sequence similarities to increase sensitivity despite its word length of 5, it uses for hierarchical clustering [12-15]. Others use them to rather loose acceptance criteria. At clustering thresholds cluster sequences into orthologous or functionally simi- below 50%, this leads to a high fraction of corrupted lar groups [16-22]. FASTA and BLAST are two to three clusters containing non-homologous sequences. orders of magnitude faster than Smith-Waterman search Both CD-HIT and UCLUST use banded dynamic pro- [23] due to their fast, k-mer-based heuristic filters, yet gramming to speed up the Smith-Waterman search. In they would require around 80 years of CPU time to cluster addition, UCLUST extends the alignment around identi- the UniProt database version with ∼35.5 million sequence cal 5-mer matches in a way similar to BLAST [11]. from scratch. SIMAP [24] employs a 9-TeraFLOP dis- kClust achieves high sensitivity by allowing matches tributed network of computers to regularly update their between similar k-mers and ranking sequence pairs by the database of precomputed FASTA similarity scores for sum of similarity scores over all similar k-mer pairs. a large set of protein sequences. This database can be We benchmark the performance of kClust and other tapped to avoid the time-consuming sequence alignments tools to cluster large sequence databases well below 50% [22], but it cannot be used for sequences not yet con- sequence identity. Our results show that kClust achieves tained in SIMAP or when the clustering should depend on false discovery rates similar to BLAST at much higher information other than sequence similarity scores. speeds, whereas at these low clustering thresholds CD- CD-HIT [25-27] and UCLUST [28] are sequence clus- HIT and UCLUST manifest severe limitations in sensitiv- tering methods that, like kClust, do not rely on an external ity, false discovery rate, and speed. sequence search tool. All three methods aim at cluster- ing large sequence databases much faster than what is Implementation possible using BLAST or FASTA. They start with zero Overview of kClust algorithm clusters and pick one sequence after the other from the kClust, like CD-HIT and UCLUST [26,28], uses the incre- database. If the query is sufficiently similar to the rep- mental, greedy clustering strategy of [29] (Additional resentative sequence of a cluster, the query is added to file 1: Figure S1). First, all sequences are sorted by length. that cluster. Otherwise it becomes the founding member Starting with the longest one, the next sequence is picked and representative sequence for a new cluster. For the fast from the database as query and is compared with the comparison of the query to the representative sequences representative sequences representing the already created of the clusters, these methods first prefilter the repre- clusters. If the query fulfills kClust’s similarity criteria with sentative sequences using an alignment-free, k-mer-based one of the representative sequences, the query is added to sequence comparison. Sequences that pass the prefilter that cluster, otherwise a new cluster is created for which are aligned to the query using Smith-Waterman align- the query becomes the representative sequence. ment. The clustering finishes when all sequences have A fast, alignment-free prefilter reduces the number of been picked and assigned to a cluster. costly computations of sequence alignments. Whereas For prefiltering, CD-HIT and UCLUST simply count the CD-HIT and UCLUST count exact k-mer matches, kClust number of identical k-mers between sequences. Because calculates the sum of similarity scores over all similar this number drops quickly as the similarity of the com- 6-mers. To increase speed, alignments are constructed pared sequences decreases, CD-HIT uses shorter k-mers with a novel algorithm, k-mer dynamic programming for achieving higher sensitivity: (e. g. 5-mers for cluster- (kDP), which finds the local optimal alignment passing ing thresholds down to 70% sequence identity, and 2-mers through pairs of similar 4-mers (C.E.M. and J.S., to be pub- below 50%). The choice of k follows from the require- lished). Furthermore, kClust employs spaced 6-mers and ment for near-perfect sensitivity, between 95% and 99%.
Recommended publications
  • Parallel and Scalable Precise Clustering for Homologous Protein Discovery
    Parallel and Scalable Precise Clustering for Homologous Protein Discovery Stuart Byma Akash Dhasade Adrian Altenhoff [email protected] [email protected] [email protected] EPFL IIT ETH Zürich Christophe Dessimoz James R. Larus [email protected] [email protected] University of Lausanne EPFL Abstract are referred to as homologs, and their detection allows the This paper presents a new, parallel implementation of clus- transference of knowledge from well-studied genes to newly tering and demonstrates its utility in greatly speeding up sequenced ones. Homologs, despite having accumulated sub- the process of identifying homologous proteins. Clustering stantial differences during evolution, often continue to per- is a technique to reduce the number of comparison needed form the same biological function. In fact, most of today’s to find similar pairs in a set of n elements such as protein molecular-level biological knowledge comes from the study sequences. Precise clustering ensures that each pair of similar of a handful of model organisms, which is then extrapolated elements appears together in at least one cluster, so that to other life forms, primarily through homology detection. similarities can be identified by all-to-all comparison in each Several sequence homology techniques are among the 100 cluster rather than on the full set. This paper introduces most-cited scientific papers of all time [24]. ClusterMerge, a new algorithm for precise clustering that Current approaches to find similar (homologous) proteins uses transitive relationships among the elements to enable are computationally expensive. The baseline is to perform 2 parallel and scalable implementations of this approach.
    [Show full text]
  • Sequence-Based Microrna Clustering
    Sequence-based MicroRNA Clustering Kübra Narcı1, Hasan Oğul2 and Mahinur Akkaya3 1Medical Informatics Department, Informatics, Middle East Technical University, Ankara, Turkey 2Department of Computer Engineering, Faculty of Engineering, Başkent University, Ankara, Turkey 3Department of Chemistry, Faculty of Arts and Sciences, Middle East Technical University, Ankara, Turkey Keywords: MicroRNA, Sequence Clustering, Clustering Algorithms, Pair-wise Sequence Comparison Sequence Similarity. Abstract: MicroRNAs (miRNAs) play important roles in post-transcriptional gene regulation. Altogether, understanding integrative and co-operative activities in gene regulation is conjugated with identification of miRNA families. In current applications, the identification of such groups of miRNAs is only investigated by the projections of their expression patterns and so along with their functional relations. Considering the fact that the miRNA regulation is mediated through its mature sequence by the recognition of the target mRNA sequences in the RISC (RNA-induced silencing complex) binding regions, we argue here that relevant miRNA groups can be obtained by de novo clustering them solely based on their sequence information, by a sequence clustering approach. In this way, a new study can be guided by a set of previously annotated miRNA groups without any preliminary experimentation or literature evidence. In this report, we presents the results of a computational study that considers only mature miRNA sequences to obtain relevant miRNA clusters using various machine learning methods employed with different sequence representation schemes. Both statistical and biological evaluations encourages the use this approach in silico assessment of functional miRNA groups. 1 INTRODUCTION sequence similarity is also known (Altuvia et al. 2005). miRNAs targeted into a specific mRNA MiRNAs are small, 20-22 nucleotides in length, non- region are greeted through biogenesis which is coding RNA products of the corresponding MIR, commonly specific into the organism.
    [Show full text]
  • Representative Based Protein Sequence Clustering
    REPRESENTATIVE BASED PROTEIN SEQUENCE CLUSTERING M. Riadul Mannan Riad B.Sc. University of Windsor, 2000 A PROJECT SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in the School of Computing Science O M.Riadul Mannan 2005 SIMON FRASER UNIVERSITY Summer 2005 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author. Approval Name: M.Riadu1 Mannan Riad Degree: Master of Science Title of Project: Representative Based Protein Sequence Clustering Examining Committee: Chair: Dr. Binay Bhattacharya Professor - - Dr. Martin Ester Senior Supervisor Associate Professor Dr. Jian Pei Supervisor Assistant Professor Dr. Anoop Sarkar Examiner Assistant Professor School of Computing Science, SFU Date Approved: June 17,2005 SIMON FRASER UNIVERSITY PARTIAL COPYRIGHT LICENCE The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection. The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies. It is understood that copying or publication of this work for financial gain shall not be allowed without the author's written permission.
    [Show full text]
  • De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm
    bioRxiv preprint doi: https://doi.org/10.1101/463463; this version posted November 6, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm Kristoffer Sahlin1y and Paul Medvedev1;2;3 1 Department of Computer Science and Engineering, The Pennsylvania State University 2 Department of Biochemistry and Molecular Biology, The Pennsylvania State University 3 Center for Computational Biology and Bioinformatics, The Pennsylvania State University y to whom correspondence should be addressed: [email protected] Abstract. Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technolo- gies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the po- tential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets.
    [Show full text]
  • Ultrafast and Sensitive Sequence Search and Clustering Methods in the Era of Next Generation Sequencing
    TECHNISCHE UNIVERSITAT¨ MUNCHEN¨ Lehrstuhl fur¨ Bioinformatik Ultrafast and sensitive sequence search and clustering methods in the era of next generation sequencing Martin Steinegger Vollstandiger¨ Abdruck der von der Fakultat¨ fur¨ Informatik der Technischen Universitat¨ Munchen¨ zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Prof. Dr. Michael Georg Bader Prufer¨ der Dissertation: 1. Prof. Dr. Burkhard Rost 2. Priv.-Doz. Dr. Hanjo Taubig¨ Die Dissertation wurde am 07.05.2018 bei der Technischen Universitat¨ Munchen¨ eingere- icht und durch die Fakultat¨ fur¨ Informatik am 13.08.2018 angenommen. Abstract Environmental metagenomic studies are a rich source for unraveling genetic diversity of yet un- known and mostly unculturable microbes, with terabytes of short read sequencing data already publicly available. The growth rate of such data has been increasing steadily, driven by the expo- nential increase in throughput, which has been outperforming Moore’s law by two-fold in the last decade. To date, the main bottleneck is not the generation but rather the analysis of sequence data. To address this analysis bottleneck we propose three novel methods: (1) MMseqs, a fast and sensitive clustering method, (2) MMseqs2, a fast and sensitive homology search method, and (3) Linclust, the first sequence clustering algorithm whose runtime scales linearly with the number of sequences and independently of the number of clusters obtained. Clustering is the process of assigning sequences to distinct groups based on similarity. Thus, clustering serves as a means to discover biological connections or define families of homologous sequences. Furthermore, clustering can speed up downstream analysis considerably by reducing highly similar sequences to a single sequence.
    [Show full text]
  • VIRMOTIF: a User-Friendly Tool for Viral Sequence Analysis
    G C A T T A C G G C A T genes Technical Note VIRMOTIF: A User-Friendly Tool for Viral Sequence Analysis Pedram Rajaei 1, Khadijeh Hoda Jahanian 2, Amin Beheshti 3,* , Shahab S. Band 4, Abdollah Dehzangi 5,6 and Hamid Alinejad-Rokny 2,7,* 1 Amirkabir University of Technology, Tehran 346512, Iran; [email protected] 2 AI-enabled Processes (AIP) Research Centre, Health Data Analytics Program, Macquarie University, Sydney, NSW 2109, Australia; [email protected] 3 Department of Computing, Macquarie University, Sydney, NSW 2109, Australia 4 Future Technology Research Center, College of Future, National Yunlin University of Science and Technology, 123 University Road, Yunlin 64002, Taiwan; [email protected] 5 Department of Computer Science, Rutgers University, Camden, NJ 08102, USA; [email protected] 6 Center for Computational and Integrative Biology, Rutgers University, Camden, NJ 08102, USA 7 Biological & Medical Machine Learning Lab (BML), The Graduate School of Biomedical Engineering, The University of New South Wales, UNSW, Sydney, NSW 2052, Australia * Correspondence: [email protected] (A.B.); [email protected] (H.A.-R.) Abstract: Bioinformatics and computational biology have significantly contributed to the generation of vast and important knowledge that can lead to great improvements and advancements in biology and its related fields. Over the past three decades, a wide range of tools and methods have been developed and proposed to enhance performance, diagnosis, and throughput while maintaining feasibility and convenience for users. Here, we propose a new user-friendly comprehensive tool called VIRMOTIF to analyze DNA sequences. VIRMOTIF brings different tools together as one package so that users can perform their analysis as a whole and in one place.
    [Show full text]
  • Spclust: Towards a Fast and Reliable Clustering for Potentially Divergent
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Archive Ouverte en Sciences de l'Information et de la Communication SpCLUST: Towards a fast and reliable clustering for potentially divergent biological sequences Johnny Matar, Hicham El Khoury, Jean Charr, Christophe Guyeux, Stéphane Chrétien To cite this version: Johnny Matar, Hicham El Khoury, Jean Charr, Christophe Guyeux, Stéphane Chrétien. SpCLUST: Towards a fast and reliable clustering for potentially divergent biological sequences. Computers in Biology and Medecine, 2019, 114, pp.103439 (14). hal-02366767 HAL Id: hal-02366767 https://hal.archives-ouvertes.fr/hal-02366767 Submitted on 16 Nov 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. SpCLUST: towards a fast and reliable clustering for potentially divergent biological sequences Johny Matar1;3, Hicham EL Khoury3 Jean-Claude Charr1, Christophe Guyeux1, St´ephaneChr´etien2, 1Universit´ede Bourgogne Franche-Comt´e,UMR 6174 CNRS, 16 route de Gray, Besan¸con,France 2National Physical Laboratory, Hampton Road, Teddington, United Kingdom 3LaRRIS, Faculty of Science, Lebanese University, Fanar, Lebanon October 2018 Abstract This paper presents SpCLUST, a new C++ package that takes a list of sequences as input, aligns them with MUSCLE, computes their similarity matrix in parallel and then performs the clustering.
    [Show full text]
  • De Novo Clustering of Long Reads by Gene from Transcriptomics Data
    De Novo Clustering of Long Reads by Gene from Transcriptomics Data Camille Marchet, Lolita Lecompte, Corinne da Silva, Corinne Cruaud, Jean-Marc Aury, Jacques Nicolas, Pierre Peterlongo To cite this version: Camille Marchet, Lolita Lecompte, Corinne da Silva, Corinne Cruaud, Jean-Marc Aury, et al.. De Novo Clustering of Long Reads by Gene from Transcriptomics Data. Nucleic Acids Research, Oxford University Press, In press, pp.1-12. 10.1093/nar/gky834. hal-01643156v2 HAL Id: hal-01643156 https://hal.archives-ouvertes.fr/hal-01643156v2 Submitted on 14 Sep 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. i “output” — 2018/9/13 — 8:22 — page 1 — #1 i i i Published online XXX Nucleic Acids Research, 2018, Vol. X, No. X 1–12 doi:10.1093/nar/gkn000 De Novo Clustering of Long Reads by Gene from Transcriptomics Data Camille Marchet 1, Lolita Lecompte 1, Corinne Da Silva 2, Corinne Cruaud 2, Jean-Marc Aury 2, Jacques Nicolas 1 and Pierre Peterlongo 1 1Univ Rennes, Inria, CNRS, IRISA, F-35000 Rennes, France 2Commissariat `al'Energie´ Atomique (CEA), Institut de Biologie Fran¸coisJacob, Genoscope, 91000, Evry, France Received XXX, 2018; Revised XXX, XXX; Accepted XXX, XXX ABSTRACT that do not have enough support.
    [Show full text]
  • Application of Subspace Clustering in DNA Sequence Analysis
    JOURNAL OF COMPUTATIONAL BIOLOGY Volume 22, Number 00, 2015 Research Article # Mary Ann Liebert, Inc. Pp. 1–13 DOI: 10.1089/cmb.2015.0084 Application of Subspace Clustering in DNA Sequence Analysis TIM WALLACE,1 ALI SEKMEN,1 and XIAOFEI WANG2 ABSTRACT Identification and clustering of orthologous genes plays an important role in developing evolutionary models such as validating convergent and divergent phylogeny and predicting functional proteins in newly sequenced species of unverified nucleotide protein mappings. Here, we introduce an application of subspace clustering as applied to orthologous gene sequences and discuss the initial results. The working hypothesis is based upon the concept that genetic changes between nucleotide sequences coding for proteins among selected species and groups may lie within a union of subspaces for clusters of the orthologous groups. Estimates for the subspace dimensions were computed for a small population sample. A series of experiments was performed to cluster randomly selected sequences. The experimental design allows for both false positives and false negatives, and estimates for the statistical significance are provided. The clustering results are consistent with the main hypothesis. A simple random mutation binary tree model is used to simulate speciation events that show the interdependence of the subspace rank versus time and mutation rates. The simple mutation model is found to be largely consistent with the observed subspace clustering singular value results. Our study indicates that the subspace clustering method may be applied in orthology analysis. Key words: algorithms, statistics. 1. INTRODUCTION dentification of orthologous relationships among protein and nucleotide sequences is of wide Iinterest for biological sequence analysis because such information provides insight into the molecular function and evolutionary history of these sequences.
    [Show full text]
  • Downloaded Without User Conclusions Registration At: and Additional Informations in Supplementary Material
    Lima Nichio et al. BMC Bioinformatics (2019) 20:392 https://doi.org/10.1186/s12859-019-2973-4 SOFTWARE Open Access RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets Bruno Thiago de Lima Nichio1,2, Aryel Marlus Repula de Oliveira1, Camilla Reginatto de Pierri1,2, Leticia Graziela Costa Santos1, Alexandre Quadros Lejambre1, Ricardo Assunção Vialle1, Nilson Antônio da Rocha Coimbra1, Dieval Guizelini1, Jeroniza Nunes Marchaukoski1, Fabio de Oliveira Pedrosa1,2 and Roberto Tadeu Raittz1* Abstract Background: Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed to explore the full potential of clustering methods - in which alignment-free methods stand out - and the good choice of dataset makes it essentials. Results: Here we present a new approach to Data Mining in large protein sequences datasets, the Rapid Alignment Free Tool for Sequences Similarity Search to Groups (RAFTS3G), a method to clustering aiming of losing less biological information in the processes of generation groups. The strategy developed in our algorithm is optimized to be more astringent which reflects increase in accuracy and sensitivity in the generation of clusters in a wide range of similarity. RAFTS3G is the better choice compared to three main methods when the user wants more reliable result even ignoring the ideal threshold to clustering.
    [Show full text]
  • Scalable Clustering for Immune Repertoire Sequence Analysis
    SCALABLE CLUSTERING FOR IMMUNE REPERTOIRE SEQUENCE ANALYSIS A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science by PREM BHUSAL B.E., Visvesvaraya Technological University, 2014 2019 Wright State University Wright State University GRADUATE SCHOOL April 23, 2019 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPER- VISION BY Prem Bhusal ENTITLED Scalable Clustering for Immune Repertoire Sequence Analysis BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science . Keke Chen, Ph.D. Thesis Director Mateen M. Rizki, Ph.D. Chair, Department of Computer Science Committee on Final Examination Keke Chen , Ph.D Krishnaprasad Thirunarayan (T.K.Prasad) , Ph.D Tanvi Banerjee, Ph.D. Barry Milligan, Ph.D. Interim Dean of the Graduate School ABSTRACT Bhusal, Prem. M.S., Department of Computer Science and Engineering, Wright State University, 2019. Scalable Clustering for Immune Repertoire Sequence Analysis. The development of the next-generation sequencing technology has enabled systems immunology researchers to conduct detailed immune repertoire analysis at the molecule level. Large sequence datasets (e.g., millions of sequences) are being collected to com- prehensively understand how the immune system of a patient evolves over different stages of disease development. A recent study has shown that the hierarchical clustering (HC) algorithm gives the best results for B-cell clones analysis - an important type of immune repertoire sequencing (IR-Seq) analysis. However, due to the inherent complexity, the classical hierarchical clustering algorithm does not scale well to large sequence datasets. Surprisingly, no algorithms have been developed to address this scalability issue for im- munology research.
    [Show full text]
  • The Genexpress IMAGE Knowledge Base of the Human Brain Transcriptome: a Prototype Integrated Resource for Functional and Computational Genomics
    Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press Resource The Genexpress IMAGE Knowledge Base of the Human Brain Transcriptome: A Prototype Integrated Resource for Functional and Computational Genomics Genevie`ve Pie´tu,1,5 Re´gine Mariage-Samson,1 Nicole-Adeline Fayein,1 Christiane Matingou,1 Eric Eveno,1 Re´mi Houlgatte,1 Charles Decraene,1 Yves Vandenbrouck,1 Fariza Tahi,1 Marie-Dominique Devignes,1 Ute Wirkner,2 Wilhelm Ansorge,2 David Cox,3 Takahiro Nagase,4 Nobuo Nomura,4 and Charles Auffray1 1Genexpress, Centre National de la Recherche Scientifique (CNRS), 94801 Villejuif, France; 2European Molecular Biology Laboratory, 6900 Heidelberg, Germany; 3Stanford Human Genome Center, Stanford University, Stanford, California 94305 USA; 4Kazusa DNA Research Institute, Kisarazu, Chiba 292 Japan Expression profiles of 5058 human gene transcripts represented by an array of 7451 clones from the first IMAGE Consortium cDNA library from infant brain have been collected by semiquantitative hybridization of the array with complex probes derived by reverse transcription of mRNA from brain and five other human tissues. Twenty-one percent of the clones corresponded to transcripts that could be classified in general categories of low, moderate, or high abundance. These expression profiles were integrated with cDNA clone and sequence clustering and gene mapping information from an upgraded version of the Genexpress Index. For seven gene transcripts found to be transcribed preferentially or specifically in brain, the expression profiles were confirmed by Northern blot analyses of mRNA from eight adult and four fetal tissues, and 15 distinct regions of brain. In four instances, further documentation of the sites of expression was obtained by in situ hybridization of rat-brain tissue sections.
    [Show full text]