Protein Function Prediction in Uniprot with the Comparison of Structural Domain Arrangements

Tunca Dogan1,*, Alex Bateman1 and Maria J. Martin1 1 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK * To whom correspondence should be addressed: [email protected] Protein Function Prediction in UniProt with the Comparison of Structural Domain Arrangements INTRODUCTION METHODS Data Preparation DA alignment Classification DA Generation Generation of SAAS DAs (Automatic decision tree- UniRule using InterProScan based rule-generating (Manually curated rules results on UniProtKB system) created by curation team) proteins Definition of domain architecture (DA): concatenation of the Requirement of InterPro IDs of the new approaches Annotations from Grouping for automatic other sources domains on the protein annotation proteins under shared sequence Treats domain hits as DAs & separation of strings (instead of a.a.) learning and test sets Pairwise alignment of DAs for similarity detection DAAC Supervised classification (Domain Architecture of query sequences into Mining functional classes Alignment and functional annotations Classification) from source databases (for learning set) Schematic representation of DAAC: DA Alignment A modified version of Needleman-Wunsch global sequence alignment algorithm to compare the DAs by: - treating the domains as the strings in a sequence - working for 7497 InterPro domains instead of 20 a.a - fast due to reduced total number of operations RESULTS • DAAC system is trained for GO term and EC number prediction • The method is cross-validated on the UniProtKB/Swiss-Prot (ground-truth data) and applied on non-reviewed proteins in UniProtKB/TrEMBL • The # of unique DAs is nearly 1/10 of the # of proteins (for UniProtKB/SwissProt # of entries: 546,238; # of DAs: 58,834) Classification Each annotation term has its own class use of class specific parameters GO prediction EC prediction Cross-validation Application Cross-validation Application - Using multi-label classification query sequences can be members of multiple classes Database UniProtKB/TrEMBL UniProtKB/TrEMBL UniProtKB/SwissProt UniProtKB/SwissProt (v2014_01) (reference proteomes) (reference proteomes) - Comparison criteria is the DA similarity measure # of input proteins 55,759 2,802,893 255,948 2,802,893 # of input prot. Annotate: P with GO:000001 w IPR domains 41,965 1,724,619 206,730 1,724,619 T # of unique func. terms 24,679 - 3,524 - • The performance is • Better performance variable Total # of predictions compared to GO Total # of predictions Reference 1,049,882 1,297,871 • The method • Probably due to classes Test performs better on simpler structure of Results # of proteins w pred.: # of proteins w pred.: protein specific terms EC system (P ) 352,798 477,777 T • High performance: (54,847 previously • High performance: (426,323 previously 572 GO terms (> 0.8 645 EC numbers (> non-annotated) non-annotated) if → SavT ≥ T1 F1-score) 0.8 F1-score) CONCLUSIONS • We proposed DAAC: a new approach in the field of automatic functional annotation of • This approach is proposed as complementary to the conventional sequence based UniProtKB/TrEMBL proteins especially to capture the complex relationship between function prediction methods. the distant multi-domain proteins • The system is planned to be implemented to work as a part of the UniProt Automatic • Novel in: (i) DA alignment as the basis of similarity, (ii) supervised multi-label Annotation Pipeline to increase the coverage and the quality of the functional predictions classification where each class represents a unique functional term, (iii) InterPro as the • We plan to establish the method to predict other types of annotations as well: domain annotation resource recommended protein names, sub-cellular locations, keywords, comments and features. EMBL-EBI Tel. +44 (0) 1223 494 444 Wellcome Trust Genome Campus [email protected] Hinxton, Cambridgeshire, CB10 1SD, UK www.ebi.ac.uk.

Protein Function Prediction in Uniprot with the Comparison of Structural Domain Arrangements

Uniprot at EMBL-EBI's Role in CTTV

Sequencing Alignment I Outline: Sequence Alignment

Comparative Analysis of Multiple Sequence Alignment Tools

Chapter 6: Multiple Sequence Alignment Learning Objectives

How to Generate a Publication-Quality Multiple Sequence Alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012)

"Phylogenetic Analysis of Protein Sequence Data Using The

Aligning Reads: Tools and Theory Genome Transcriptome Assembly Mapping Mapping

Alignment of Next-Generation Sequencing Data

Webnetcoffee

Errors in Multiple Sequence Alignment and Phylogenetic Reconstruction

Tunca Doğan , Alex Bateman , Maria J. Martin Your Choice

The Biogrid Interaction Database