Representative Based Protein Sequence Clustering

REPRESENTATIVE BASED PROTEIN SEQUENCE CLUSTERING M. Riadul Mannan Riad B.Sc. University of Windsor, 2000 A PROJECT SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in the School of Computing Science O M.Riadul Mannan 2005 SIMON FRASER UNIVERSITY Summer 2005 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author. Approval Name: M.Riadu1 Mannan Riad Degree: Master of Science Title of Project: Representative Based Protein Sequence Clustering Examining Committee: Chair: Dr. Binay Bhattacharya Professor - - Dr. Martin Ester Senior Supervisor Associate Professor Dr. Jian Pei Supervisor Assistant Professor Dr. Anoop Sarkar Examiner Assistant Professor School of Computing Science, SFU Date Approved: June 17,2005 SIMON FRASER UNIVERSITY PARTIAL COPYRIGHT LICENCE The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection. The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies. It is understood that copying or publication of this work for financial gain shall not be allowed without the author's written permission. Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence. The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive. W. A. C. Bennett Library Simon Fraser University Burnaby, BC, Canada Abstract Over the years, many methods have been developed for clustering protein sequences based on their similarity. However, most of the methods are based on all-against-all sequence comparison that requires at least quadratic computation on the number of sequences. Furthermore, many methods do not address the issues and challenges associated with protein clustering explicitly such as finding distant relatives and detecting multi-domain proteins. Here, we develop a novel clustering technique based on representatives with successfully avoiding the pair-wise sequence comparison. We address the protein clustering issues in details and give a solution for finding distant relatives and multi-domain proteins. We also develop a new similarity measure that captures the significant similarity information embedded in a sequence such as frequent pattern and sequence length. Acknowledgements First, I want to thank Almighty for keeping us in good health and giving me this opportunity to complete my master's degree. Then, I would like to express my deepest gratitude to my senior supervisor Dr. Martin Ester for introducing me into the field of data mining and biological sequence mining, and providing me with tremendous guidance, great support and feedback through out the project work. I would also like to thank Dr. Ke Wang for his advice and guidance during the study of biological sequence mining. I would also like to extend my sincere gratitude to my supervisor Dr.Jian Pei for providing me insightful advice, help and support for completing the project. Finally, I would like thank all the members of Database and Data mining lab for their continuous help and support during my study at SFU. Table of Contents .. Approval ............................................................................................................................11 ... Abstract .............................................................................................................................111 Acknowledgements .........................................................................................................iv Table of Contents ..............................................................................................................v .. List of Tables ...................................................................................................................vn ... List of Figures .................................................................................................................vlu Chapter 1: Introduction ...................................................................................................1 1. 1 What is Clustering? ..............................................................................................................2 1.2 Challenges and issues in Protein Clustering ........................................................................3 1.3 Existing Methods for Protein Clustering .............................................................................4 1.4 Clustering Algorithms .........................................................................................................6 1.4.1 K-means ................................................................................................................................. 6 1.4.2 K-medoids .............................................................................................................................. 7 1.5 Our Clustering Approach .....................................................................................................8 Chapter 2: Related Studies - An Overview ..................................................................10 2.1 Protein Databases ..............................................................................................................10 2.1.1 SWISS-PROT ....................................................................................................................... 10 2.1.2 TrEMBL ............................................................................................................................... 11 2.2 Motif, Pattern and Domain Based Analysis .......................................................................11 2.2.1 PROSITE .............................................................................................................................. 11 2.2.2 PRINTS ................................................................................................................................ 12 2.2.3 BLOCKS .............................................................................................................................. 12 2.2.4 ProDom ................................................................................................................................ 13 2.2.5 DOMO .................................................................................................................................. 13 2.2.6 InterPro ................................................................................................................................. 14 2.2.7 The Study by V . Guralink and G . Karypis ...................................................................... 14 2.3 Sequence Comparison Approach ...................................................................................... 15 2.3.1 SYSTERS ............................................................................................................................. 15 2.3.2 ProtoMap .............................................................................................................................. 16 2.3.3 Structure Prediction by Transitive Homology ...................................................................... 17 2.3.4 CluStr ................................................................................................................................... 17 2.4 Hidden Markov Modeling (HMM) ....................................................................................18 2.4.1 Pfam ..................................................................................................................................... 18 Chapter 3: Preliminaries of Protein Sequence .............................................................20 3.1 Basics of Protein Sequence and Structure .........................................................................20 3.2 Homology and Protein Family ...........................................................................................21 3.3 The Protein Folding Problem .............................................................................................21 3.4 Protein Sequence Analysis .................................................................................................22 Chapter 4: Clustering Method .......................................................................................25 4.1 Problem Definition ............................................................................................................26 4.2 Finding Frequent Pattern ...................................................................................................27 4.3 Selection of Cluster Representative ...................................................................................29

Representative Based Protein Sequence Clustering

Parallel and Scalable Precise Clustering for Homologous Protein Discovery

Sequence-Based Microrna Clustering

De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

Ultrafast and Sensitive Sequence Search and Clustering Methods in the Era of Next Generation Sequencing

VIRMOTIF: a User-Friendly Tool for Viral Sequence Analysis

Spclust: Towards a Fast and Reliable Clustering for Potentially Divergent

De Novo Clustering of Long Reads by Gene from Transcriptomics Data

Application of Subspace Clustering in DNA Sequence Analysis

Downloaded Without User Conclusions Registration At: and Additional Informations in Supplementary Material

Scalable Clustering for Immune Repertoire Sequence Analysis

The Genexpress IMAGE Knowledge Base of the Human Brain Transcriptome: a Prototype Integrated Resource for Functional and Computational Genomics

State-Of-The-Art Bioinformatics Protein Structure Prediction Tools (Review)