New Automatically Built Profiles for a Better Understanding of the Peroxidase Superfamily Evolution
Total Page:16
File Type:pdf, Size:1020Kb
University of Geneva Practical training report submitted for the Master Degree in Proteomics and Bioinformatics New automatically built profiles for a better understanding of the peroxidase superfamily evolution presented by Dominique Koua Supervisors: Dr Christophe DUNAND Dr Nicolas HULO Laboratory of Plant Physiology, Dr Christian J.A. SIGRIST University of Geneva Swiss Institute of Bioinformatics PROSITE group. Geneva, April, 18th 2008 Abstract Motivation: Peroxidases (EC 1.11.1.x), which are encoded by small or large multigenic families, are involved in several important physiological and developmental processes. These proteins are extremely widespread and present in almost all living organisms. An important number of haem and non-haem peroxidase sequences are annotated and classified in the peroxidase database PeroxiBase (http://peroxibase.isb-sib.ch). PeroxiBase contains about 5800 peroxidase sequences classified as haem peroxidases and non-haem peroxidases and distributed between thirteen superfamilies and fifty subfamilies, (Passardi et al., 2007). However, only a few classification tools are available for the characterisation of peroxidase sequences: InterPro motifs, PRINTS and specifically designed PROSITE profiles. However, these PROSITE profiles are very global and do not allow the differenciation between very close subfamily sequences nor do they allow the prediction of specific cellular localisations. Due to the rapid growth in the number of available sequences, there is a need for continual updates and corrections of peroxidase protein sequences as well as for new tools that facilitate acquisition and classification of existing and new sequences. Currently, the PROSITE generalised profile building manner and their usage do not allow the differentiation of sequences from subfamilies showing a high degree of similarity. There are frequent cases of overlapping of profiles for close subfamilies. To improve the discriminatory power of PROSITE profiles, many possibilities exist. For instance, it is possible to use a more conservative BLOSUM matrix or also to assign more or less weight to the substitution matrix used to build the profile. But these two possibilities only partially solve the problem of overlapping profiles. On the other hand, as generalised profiles are used to trigger automatic annotation of UniProtKB sequences, this discriminative inability appears to be an important weakness in the use of generalised profiles in large-scale annotation processes. The aim of the project was to propose and to test a new approach to build generalised profiles in order to improve their discriminative capacities. The new technique is based on the silencing of residues (positions in the sequences) which are highly conserved among all the subfamilies, giving more importance to the really discriminative residues of each subfamily. The reliability of this new approach was tested using the Peroxidase database as a source of well annotated sequences. 1 Results: − Construction of nearly 70 new profiles for peroxidases, which allowed a better characterisation of sequences and to confirm or facilitate the affiliation of sequences to given subgroups. In addition, these profiles will be a new tool for peroxidase sequence identification in UniProtKB in order to give a more complete number of correctly annotated peroxidases available in PeroxiBase and also in UniProtKB. The new profiles will be added to the PROSITE database as well as to PeroxiBase. − Addition to PeroxiBase of about a hundred sequences picked up by a profile-based search against UniProtKB. − Development of an automated method for PROSITE profile building with an important discriminatory power for subclasses of a given set of sequences. − Detection of annotations and/or classification mistakes in PeroxiBase and UniProtKB sequences. − Development of a tool that automatically checks and resumes in a UniRule format file the homogeneous annotation lines contained in a set of UniProtKB accession representing the match list for a given profile. This is the first step for the construction of a ProRule associated with the PROSITE profile. Availability: − Seventy-six new profiles built for existing and newly created families and subfamilies of peroxidases. − A new profile-based classification tool for peroxidase sequences added to PeroxiBase: http:// peroxibase.isb-sib.ch/class_scan.php . − 14 ProRule files on non-animal peroxidases. 2 Content 1. Introduction......................................................................................................................................5 1.1. Overview on peroxidases and PeroxiBase.............................................................................5 1.1.1. EC-based peroxidase classification................................................................................5 1.1.2. Haem-based classification...............................................................................................7 1.2. Current situation of peroxidase characterization tools..........................................................9 1.3. PROSITE: a motif database..................................................................................................10 1.3.1. PROSITE patterns.........................................................................................................11 1.3.2. PROSITE generalised profiles......................................................................................11 1.4. Profile calibration..................................................................................................................14 1.5. Significance of generalised profile matches........................................................................15 1.6. Sequence annotation based on generalised profiles and rules............................................16 1.7. Improvement of discriminatory power of generalised profiles...........................................17 1.8. Aim of the project..................................................................................................................18 2. Methods..........................................................................................................................................19 2.1. Presentation of the proposed approach.................................................................................19 2.1.1. Superfamily profile building.........................................................................................19 2.1.2. Analysis preparation step..............................................................................................20 2.2. New methodology proposed.................................................................................................21 2.2.1. Choice of the initial set of trusted sequences ..............................................................21 2.2.2. Construction of a multiple alignment...........................................................................23 2.2.3. Multiple alignment annotation and profile construction.............................................23 2.2.4. Profile cut-off value fixation.........................................................................................23 2.3. ProRule semi-automated construction methodology...........................................................24 3. Results ...........................................................................................................................................25 3.1. Peroxidase profiles for PeroxiBase......................................................................................25 3.2. A new profile building methodology for subfamily classification.....................................28 3.3. Improvement of existing programs......................................................................................30 3.4. ProRule file editor.................................................................................................................31 4. Discussion......................................................................................................................................32 4.1. Importance of profiles for peroxidase classification and evolution understanding...........32 4.2. Advantages of the global approach......................................................................................33 4.3. Human calibration steps........................................................................................................34 4.4. ProRule skeleton files...........................................................................................................34 4.5. Weaknesses of the proposed methodology..........................................................................35 5. Conclusion and perspectives.........................................................................................................36 References..........................................................................................................................................37 Acknowledgments.............................................................................................................................39 Appendices.........................................................................................................................................40 3 4 1. Introduction 1.1. Overview on peroxidases and PeroxiBase. Peroxidases are enzymes that use various peroxides (ROOH) as electron acceptors to catalyse a number of oxidative reactions. PeroxiBase is a unique