University of Geneva

Practical training report submitted for the Master Degree in Proteomics and Bioinformatics

New automatically built profiles for a better understanding of the superfamily evolution

presented by Dominique Koua

Supervisors: Dr Christophe DUNAND Dr Nicolas HULO Laboratory of Plant Physiology, Dr Christian J.A. SIGRIST University of Geneva Swiss Institute of Bioinformatics PROSITE group.

Geneva, April, 18th 2008

Abstract

Motivation:

Peroxidases (EC 1.11.1.x), which are encoded by small or large multigenic families, are involved in several important physiological and developmental processes. These proteins are extremely widespread and present in almost all living organisms. An important number of haem and non­ sequences are annotated and classified in the peroxidase database PeroxiBase (http://peroxibase.isb­sib.ch). PeroxiBase contains about 5800 peroxidase sequences classified as haem and non­haem peroxidases and distributed between thirteen superfamilies and fifty subfamilies, (Passardi et al., 2007). However, only a few classification tools are available for the characterisation of peroxidase sequences: InterPro motifs, PRINTS and specifically designed PROSITE profiles. However, these PROSITE profiles are very global and do not allow the differenciation between very close subfamily sequences nor do they allow the prediction of specific cellular localisations. Due to the rapid growth in the number of available sequences, there is a need for continual updates and corrections of peroxidase protein sequences as well as for new tools that facilitate acquisition and classification of existing and new sequences.

Currently, the PROSITE generalised profile building manner and their usage do not allow the differentiation of sequences from subfamilies showing a high degree of similarity. There are frequent cases of overlapping of profiles for close subfamilies. To improve the discriminatory power of PROSITE profiles, many possibilities exist. For instance, it is possible to use a more conservative BLOSUM matrix or also to assign more or less weight to the substitution matrix used to build the profile. But these two possibilities only partially solve the problem of overlapping profiles.

On the other hand, as generalised profiles are used to trigger automatic annotation of UniProtKB sequences, this discriminative inability appears to be an important weakness in the use of generalised profiles in large­scale annotation processes.

The aim of the project was to propose and to test a new approach to build generalised profiles in order to improve their discriminative capacities. The new technique is based on the silencing of residues (positions in the sequences) which are highly conserved among all the subfamilies, giving more importance to the really discriminative residues of each subfamily. The reliability of this new approach was tested using the Peroxidase database as a source of well annotated sequences.

1 Results:

− Construction of nearly 70 new profiles for peroxidases, which allowed a better characterisation of sequences and to confirm or facilitate the affiliation of sequences to given subgroups. In addition, these profiles will be a new tool for peroxidase sequence identification in UniProtKB in order to give a more complete number of correctly annotated peroxidases available in PeroxiBase and also in UniProtKB. The new profiles will be added to the PROSITE database as well as to PeroxiBase.

− Addition to PeroxiBase of about a hundred sequences picked up by a profile­based search against UniProtKB.

− Development of an automated method for PROSITE profile building with an important discriminatory power for subclasses of a given set of sequences.

− Detection of annotations and/or classification mistakes in PeroxiBase and UniProtKB sequences.

− Development of a tool that automatically checks and resumes in a UniRule format file the homogeneous annotation lines contained in a set of UniProtKB accession representing the match list for a given profile. This is the first step for the construction of a ProRule associated with the PROSITE profile.

Availability:

− Seventy­six new profiles built for existing and newly created families and subfamilies of peroxidases.

− A new profile­based classification tool for peroxidase sequences added to PeroxiBase: http:// .isb­sib.ch/class_scan.php .

− 14 ProRule files on non­animal peroxidases.

2 Content 1. Introduction...... 5 1.1. Overview on peroxidases and PeroxiBase...... 5 1.1.1. EC­based peroxidase classification...... 5 1.1.2. Haem­based classification...... 7 1.2. Current situation of peroxidase characterization tools...... 9 1.3. PROSITE: a motif database...... 10 1.3.1. PROSITE patterns...... 11 1.3.2. PROSITE generalised profiles...... 11 1.4. Profile calibration...... 14 1.5. Significance of generalised profile matches...... 15 1.6. Sequence annotation based on generalised profiles and rules...... 16 1.7. Improvement of discriminatory power of generalised profiles...... 17 1.8. Aim of the project...... 18 2. Methods...... 19 2.1. Presentation of the proposed approach...... 19 2.1.1. Superfamily profile building...... 19 2.1.2. Analysis preparation step...... 20 2.2. New methodology proposed...... 21 2.2.1. Choice of the initial set of trusted sequences ...... 21 2.2.2. Construction of a multiple alignment...... 23 2.2.3. Multiple alignment annotation and profile construction...... 23 2.2.4. Profile cut­off value fixation...... 23 2.3. ProRule semi­automated construction methodology...... 24 3. Results ...... 25 3.1. Peroxidase profiles for PeroxiBase...... 25 3.2. A new profile building methodology for subfamily classification...... 28 3.3. Improvement of existing programs...... 30 3.4. ProRule file editor...... 31 4. Discussion...... 32 4.1. Importance of profiles for peroxidase classification and evolution understanding...... 32 4.2. Advantages of the global approach...... 33 4.3. Human calibration steps...... 34 4.4. ProRule skeleton files...... 34 4.5. Weaknesses of the proposed methodology...... 35 5. Conclusion and perspectives...... 36 References...... 37 Acknowledgments...... 39 Appendices...... 40

3 4 1. Introduction

1.1. Overview on peroxidases and PeroxiBase.

Peroxidases are that use various peroxides (ROOH) as electron acceptors to catalyse a number of oxidative reactions. PeroxiBase is a unique repository exclusively dedicated to peroxidases and composed of multigenic families from both Eukaryotes and Prokaryotes. Even with the large extension of the database, it is still mainly composed of sequences originated from Viridiplantae. It includes fragments of peroxidase sequences, which are regularly verified for possible annotation updates and sequence completion. Partial genomes are also searched, particularly in bacteria. Old entries of complete sequences are frequently verified and updated if any changes have occurred. The goal of the peroxidase database is to centralize most of peroxidase sequences, to follow the evolution of peroxidases among living organisms and to compile information concerning putative functions and transcription regulation.

1.1.1. EC­based peroxidase classification.

Peroxidases can be found under the same classification number EC.1.11.1.x, donor:hydrogen­peroxide (Fleischmann et al., 2004). Currently, 15 different EC numbers have been ascribed to peroxidase: from EC 1.11.1.1 to EC 1.11.1.16 (EC 1.11.1.4 was removed) (Table 1). Due to the presence of dual enzymatic domains, other peroxidase families were classified with the following numbers: EC 1.13.11.44, EC 1.14.99.1, EC 1.6.3.1 and EC 4.1.1.44 (Table 1). To date, certain peroxidases do not possess an EC number (Pxd, Pxt, AnPOX, NAnPrx, DypPrx, APx­CcP and CII) and can only be classified in EC 1.11.1.7. Two particular cases are also observed for EC 1.11.1.2 (NADPH peroxidase) and EC 1.11.1.3 (fatty acid peroxidase). Concerning EC 1.11.1.2, NADPH peroxidase activities have been observed in different studies (Conn et al., 1952 and Hochman and Goldberg, 1991); however there is no known peroxidase sequence that has been assigned to this EC number, probably due to the fact that none of the peroxidases known so far have a predominant NADPH peroxidase activity.

5 Peroxidasins, peroxinectins, other non­animal peroxidases, dyp­type peroxidases, hybrid ascorbate­ and other class II peroxidases do not possess an EC number. The two independent EC numbers (1.11.1.9 and 1.11.1.12) both correspond to gluthatione peroxidase and are based on the electron acceptor ( or lipid peroxide, respectively).

EC number Recommended name Abbreviation in PeroxiBase EC 1.11.1.1 NADH peroxidase NadPrx EC 1.11.1.2 NADPH peroxidase No sequence available EC 1.11.1.3 Fatty acid peroxidase No sequence available (aDox?) EC 1.13.11.11 (previously Not considered as a peroxidase any Tryptophan 2,3­dioxygenase EC 1.11.1.4) longer EC 1.11.1.5 Cytochrome­c peroxidase CcP, DiHCcP EC 1.11.1.6 Kat, CP EC 1.11.1.7 Peroxidase Haem peroxidases EC 1.11.1.8 Iodide peroxidase TPO EC 1.11.1.9 GPx EC 1.11.1.10 HalPrx, HalNPrx, HalVPrx EC 1.11.1.11 l­ APx Phospholipid­hydroperoxide glutathione EC 1.11.1.12 GPx peroxidase EC 1.11.1.13 MnP EC 1.11.1.14 LiP 1CysPrx, 2CysPrx, PrxII/V/PrxGrx, EC 1.11.1.15 PrxQ/BCP EC 1.11.1.16 VP EC 1.13.11.44 Linoleate diol synthase LDS EC 1.14.99.1 Prostaglandin­endoperoxide synthase PGHS EC 1.6.3.1 NAD(P)H oxidase DuOx AhpD, CMD, CMDn, HCMD, EC 4.1.1.44 4­Carboxymuconolactone decarboxylase HCMDn, DCMD, DCMDn, AlkyPrx, AlkyPrxn Table 1: The International Union of Biochemistry classification of peroxidases.

6 1.1.2. Haem­based classification.

In PeroxiBase, sequences are organised according to the presence or abscence of haem. Sequences are divided between haem peroxidases and non­haem peroxidases (Figure 1).

Figure 1: Heam presence based peroxidase classification.

Genes encoding haem peroxidases can be found in almost all kingdoms of life. They are grouped in two major superfamilies: one mainly found in bacteria, fungi and plants (Passardi et al., 2007b) and a second mainly found in animals, fungi and bacteria (Daiyasu and Toh, 2000 and Furtmuller et al., 2006).

Members of the superfamily of plant/fungal/bacterial peroxidases (non­animal peroxidases) have been identified in the majority of the living organisms except animals.

7 Three independent classes can be distinguished:

− class I, which includes ascorbate peroxidase (APx), cytochrome c peroxidase (CcP) and catalase peroxidase (CP);

− class II, which includes lignin peroxidases (LiP), manganese peroxidases (MnP), versatile peroxidase (VP);

− class III (Ruiz­Dueñas et al., 2001).

The second superfamily described as “animal peroxidases” comprises a group of homologous proteins mainly found in animals and categorised as following: (MPO); (EPO); (LPO); (TPO); prostaglandin H synthase (PGHS); peroxidasin (Pxd) and peroxinectin (Pxt). Homologous animal peroxidase sequences from the fungal kingdom can also be classified as PGHS.

Two other groups of proteins which contain the typical 600 amino acid­long “animal” peroxidase domain have also been added to this widespread superfamily: the dual oxidase or thyroid NADPH oxidase (DuOx) and the alpha dioxygenase found in plants (aDox). Finally, a group of peroxidases that do not fall in the above defined groups (POX) was also included. According to structural and functional peculiarities independently of the origin, the “animal peroxidase” superfamily has also been named “peroxidase­cyclooxygenase superfamily”.

In addition to these two large superfamilies, smaller protein families are classified as capable to reduce peroxide molecules. Catalase (Kat) that can also oxidise hydrogen peroxide (unique feature), di­haem cytochrome C peroxidases (DiHCcP), dyp­type peroxidases (DypPrx), with (HalPrx) or without (HalNPrx, HalVPrx) haem.

8 Non­containing­haem peroxidases are mainly grouped in the following superfamilies: alkylhydroperoxidase D­like superfamily (AhpD, CMD, HCMD, DCMD, and AlkyPrx), NADH peroxidase (NadPrx), manganese (MnCat) and two subfamilies of thiol peroxidases: glutathione peroxidases (GPx) and (1CysPrx, 2CysPrx, PrxII/PrxV/PrxGrx and PrxQ/BCP). Haem and non­haem peroxidases are found in all kingdoms; they may hence become key markers for the evolution of living organisms. PeroxiBase represents a powerful tool for an efficient analysis and a better understanding of the evolution of protein superfamilies, catalytic domains and peroxidase activity in all kingdoms of life.

1.2. Current situation of peroxidase characterization tools. The two large groups of haem peroxidases are characterised by a general InterPro entry IPR010255 (http://www.ebi.ac.uk/interpro/IEntry?ac=IPR010255) representing the whole haem peroxidase family.

Animal peroxidases are all defined by the InterPro accession IPR002007 and the PROSITE profile PS50292.

Currently a general PROSITE profile (PS50873) and a general fingerprint (InterPro accession IPR002016 and PR00458) allow to establish the membership to the “non­animal peroxidase” superfamily. This superfamily is composed of three independent classes well defined with four fingerprints:

− PR00459 for ascorbate peroxidase and cytochrome c peroxidase. This fingerprint does not allow distinguishing between the two subfamilies, and gives even similar results with degenerated sequences. Ascorbate peroxidases can be separated, according to their cellular targeting, into cytosolic, peroxisomal, stromal and thylakoïde­bound. The classification following the cellular localization cannot be performed with the actual profile.

− PR00460 allows clearly identification of catalase peroxidases.

− PR00462 is the only existing profile to characterise class II peroxidases.

− Finally, class III peroxidases or secreted plant peroxidases are characterised by a specific profile (InterPro IPR000823 and Prints PR00461).

The large panel of animal peroxidase subfamilies is characterised by only one profile (PS50292). Various PFAM profiles and Prints motifs are available but cannot be used to correctly characterise animal peroxidase subclasses.

9 Among non­haem peroxidases, only glutathiones have a general pattern in PROSITE (PS00460). A small number of fingerprints that can be used for the characterisation of some other non­haem subfamilies are reported (http://peroxibase.isb­sib.ch/classes.php).

All the details corresponding to the above described families and classes, as well as the other peroxidase sequences presented in PeroxiBase, are summarised in a table representing the situation of available characterisation tools before our project (Appendix 1).

Only few classification tools are available for peroxidase sequence characterisation: about thirty InterPro motifs and twelve PROSITE profiles and/or patterns for the main superfamilies. In order to adress this need, specifically designed profiles have been constructed and used internally in PeroxiBase to characterise sequences. But these profiles are very global and do not allow the differenciation between very close subfamily sequences as well as specifically localised sequences. Due to the rapid growth of available sequences, there is a need for continual updates and correction of peroxidase protein sequences as well as for new tools to improve the classification of existing and new sequences.

The present project was initiated in order to improve and complete these characterisation tools for all the subfamilies currently available in the database. We specially focused on PROSITE profiles as they give an interesting discriminative power for several other proteins families. Nevertheless, the greatest challenge is to produce very discriminative profiles that can separate functional subfamilies and ensure discrimination between sequences with weak residue variations and/or sequences that are differently localised.

1.3. PROSITE: a motif database.

Among the various databases dedicated to the identification of protein families and domains, PROSITE was the first one to be created and has continuously evolved since (Bairoch, 1992 and Hulo et al., 2008). PROSITE currently consists of a large collection of biologically meaningful motifs that are described as patterns or profiles, and linked to documentation briefly describing the or domain they are designed to detect (Sigrist et al., 2002).

10 1.3.1. PROSITE patterns.

Patterns are regular expressions matching short sequence motifs usually of biological meaning. (http://www.itc.virginia.edu/achs/molbio/localsupport/PROSITE_Syntax.htm). Patterns are written using the standard IUPAC one­letter codes for the amino acids. For instance, the pattern [AC]­x­V­x(4)­{ED} is translated as: [Ala or Cys]­any­Val­any­any­any­any­{any but not Glu or Asp}.

Patterns are qualitative motif descriptors. As they are regular expressions, they either match or not: there is no threshold above which one can consider the match as statistically significant or not. The manner to evaluate pattern accuracy is to make statistics on the number of hits obtained when scanning a confident database (UniProtKB for instance) or by scanning a randomised database.

The advantages of patterns are their easy intelligibility for the user and the fact that patterns are directed against the most conserved residues, which are often the most relevant for the biological function of the protein family or domain. Another advantage of patterns is that the scan of a protein database with patterns can be performed in a reasonable time on any computer.

Although patterns are largely useful, they also have intrinsic limitations in identifying distant homologous as they do not accept any mismatch. Therefore, patterns are kept for historical reasons and are currently nearly not used.

1.3.2. PROSITE generalised profiles.

A profile or weight matrix (the two terms are used synonymously here) is a table of position­ specific amino acid weights and gap costs (Bucher and Bairoch, 1994). These numbers (also referred to as scores) are used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a given cut­off value constitutes a motif occurrence. As with patterns, there may be several matches to a profile in one sequence, but multiple occurrences in the same sequences must be disjoint (non­overlapping) according to a specific definition included in the profile.

The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co­workers (Gribskov et al., 1990). Additional parameters allow representation of other motif descriptors, including the currently popular hidden Markov models (HMMs). A technical description of the profile structure and of the corresponding motif search

11 method is given in the file PROFILE.TXT included in each PROSITE release (ftp://ftp.expasy.org/databases/prosite/profile.txt).

Profiles can be constructed by a large variety of different techniques. The classical method developed by Gribskov and co­workers requires a multiple sequence alignment as input and uses a symbol comparison table to convert residue frequency distributions into weights. The profiles included in the current PROSITE release were generated by this procedure applying modifications described by Lüthy and co­workers (Lüthy et al., 1994).

Currently, PROSITE profiles are constructed based on weighted and annotated multiple alignments (Figure 2). The weighting is achieved with pfw which computes Voronoi weights. Then the alignment is processed by xpsa2annot.pl which adds annotation lines. The annotations are based on weighted counts.

For each sequence in the multiple alignment and for each column of the alignment, the number of each element of the alphabet (here, the amino acids plus a symbol representing a deletion) is counted and multiplied by the weight of the considered sequence to determine the weighted counts. Then, each column is labelled as: MATCH if the weighted count for the deletion's symbol is zero, this position does not correspond to a deletion, DEL if the weighted count is less than 0.5 (this position is a deletion) and INS if the weighted count is greater than 0.5 (this position is an amino acid insertion).

Finally, the weight matrix is realised by apsimake.pl. Apsimake profiles are constructed based on a mix between observed frequencies and pseudo­counts as for PSI­BLAST (Altschul et al., 1997). The pseudo­counts are obtained by processing the weighted counts with substitution frequencies from a BLOSUM matrix and the probabilities of the different amino acids in each position of the alignment. The mix leads to the calculation of a PSI value. Currently, the PSI value is set uniformly to all the positions of the alignment.

Apsimake also integrates information on gap penalties: gap opening and gap extension costs, gap costs for transitions between matches, deletions and insertions. All these profile construction parameters can be set and tuned manually or automatically in an external script to improve profiles discriminatory power.

Profiles built in these conditions are more sensitive and more robust than patterns because they provide discriminatory weights not only for the residues already found at a given position of a motif but also for those not yet found. The weights for those not yet found are extrapolated from

12 the observed amino acid compositions using empiric knowledge about amino acid substitutability (e.g. BLOSUMxx Matrices).

Unlike patterns, profiles are usually not confined to small regions with high sequence similarity. Rather, they attempt to characterize a protein family or domain over its entire length. This can lead to specific problems not arising with PROSITE patterns. With a profile covering conserved as well as divergent sequence regions, there is a chance to obtain a significant similarity score even with a partially incorrect alignment.

This possibility is taken into account by the current quality evaluation procedures. In order to be acceptable, a profile must not only assign high similarity scores to true motif occurrences and low scores to false matches. In addition, it should correctly align those residues having analogous functions or structural properties according to experimental data.

For a profile and a sequence of typical lengths, there is a very large number of possible alignments. At most a few of them will be biologically relevant.

The profile­alignment scores are called raw scores. In most cases, they will not lend themselves to meaningful biological interpretations and will therefore not be very helpful in the interpretation of a potential match. In practice, one is interested in questions like: What is the probability of finding a match of a certain score in a random sequence?

How does the similarity score relate to a measurable property of the biological object? The purpose of normalised scores is to convert the raw score into directly interpretable units. The normalised scores are used to compare protein­profile matches lists and fix cut­off values for each profile (Bucher et al., 1996).

The function of a cut­off value is to a priori exclude a large number of alignments from further consideration by a profile search algorithm. The fate of the remaining alignments with similarity scores greater than or equal to the cut­off value depends on a specific disjointness definition applied. An important aspect of a cut­off value is that it gives a qualitative meaning to a profile. This is a prerequisite for statistics on false positives and false negatives obtained in a database search, as currently provided by PROSITE.

In certain situations, it is useful to supply more than one cut­off value, partitioning the range of alignment scores into multiple areas. The areas may correspond to different degrees of certainty, ranges of evolutionary distance, or levels of physiological activity.

13 Figure 2. Schematic representation of the construction process of a PROSITE generalised profile.

1.4. Profile calibration.

The standard calibration with pfcalibrate.pl is performed just after the profile construction.

This calibration method is used to estimate the significance of profiles matches. Each profile is compared against a random database to produce a list of high­scoring profile matches sorted by score.

14 Two kinds of random databases are commonly used:

− reversed: formed by taking the reverse sequence of each individual entry. It is the mostly used but is not appropriate for profiles of regularly spaced repeated amino acids such as cysteines in zinc fingers or hydrophobic residues in helix­loop­helix domains.

− Window20: a regionally shuffled version of UniProtKB proteins preserving the original length distribution and amino acid composition in successive windows of length 20.

The score distribution obtained against the randomised database is then analysed by plotting the logarithm of the number of observed matches above a given score against the score itself. Such a plot typically shows an approximately linear relationship between these two variables, which would be expected for an extreme value distribution:

where NDB is the number of residues in the database. The parameters a and b are estimated by linear regression analysis and used to calculate a normalised score:

Note that a and b are characteristic parameters of a profile that need to be re­estimated whenever a profile is modified.

In the profile entries, a and b are called R1 and R2 respectively :

MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=2.7122869; R2=0.0061759; TEXT='­LogE';

1.5. Significance of generalised profile matches.

When scanning a database with a profile, the challenge is to establish which sequences are really related (true positives) and which are unrelated (false positives). For this purpose, thresholds have to be defined in order to capture the majority of true positives family members and to avoid the false positives. This task can be difficult since the distribution of match scores of homologue and non­homologue proteins often overlap to some extent. When setting a too­high threshold, some true positives will be missed (false negatives). On the other hand, with a too­low threshold, false­positives will be erroneously included into the match list.

In order to overcome this problem, at least two cut­offs values are supplied in PROSITE profiles. The LEVEL = 0 cut­off is placed at a sufficiently high score, above which no false

15 positive matches are expected to be detected. Usually in PROSITE files, LEVEL = 0 cut­off is placed at a normalised score of 8.5. The LEVEL = ­1 cut­off is specified to preclude false­ negatives and is arbitrarily set at a normalised score of 6.5. Clearly, matches scoring between LEVEL = ­1 and LEVEL = 0 need further investigation to assess their validity. In PROSITE a post­processing competition has been implemented to deal with this.

1.6. Sequence annotation based on generalised profiles and rules.

The ProRule section of PROSITE is constituted of manually created rules. ProRules increase the discriminatory power of PROSITE motifs (generally profiles) by providing additional information about functionally and/or structurally critical amino acids and can automatically generate annotation based on PROSITE profiles (Sigrist et al., 2005).

ProRule uses the UniRule format that is common to all types of rules created to annotate UniProtKB (http://www.expasy.ch/unirule/unirule_web_view.html#General), including the family rules derived from the High­quality Automated and Manual Annotation of microbial Proteomes (HAMAP, http://www.expasy.ch/sprot/hamap/). Each rule contains information used to provide template­based annotation associated with the domain or family detected by the PROSITE profile. ProRule is used to create UniProtKB lines with basic and complex annotation derived from the presence of the domain and of biologically critical amino acids: domain name and boundaries, EC number, function, keywords, associated PROSITE patterns, PTMs, active sites, disulfide bonds, etc… ProRule contains notably the position of structurally and/or functionally critical amino acid(s), as well as the condition(s) they must fulfil to play their biological role(s).

Part of this supplementary data is used by ScanProsite that not only provides the protein sequence matched by a profile, but also information about the relevance of biologically meaningful residues, like active sites, binding sites, post­translational modification sites or disulfide bonds, to help function determination (http://www.expasy.org/tools/scanprosite/).

It appears thus that to be really useful for the automatic annotation of sequences, the PROSITE profiles must have a high discriminatory power to ensure the distinction between sequences belonging to very close subfamilies. Information included in the corresponding rules as well as the triggered annotation will be highly improved if the profiles are discriminative.

16 1.7. Improvement of discriminatory power of generalised profiles.

Currently, four main methods were used to improve the discriminatory abilities of profiles.

The manual treatment of the initial multiple sequence alignment.

This step is crucial as the quality of the alignment has a considerable impact on the profile quality (Notredame, 2002). The goal is to conserve in sequence alignment the maximum of diversity and ensure that the conserved residues are well aligned. Several programs for alignment treatment are available. The more frequently used are T­Coffee, CLUSTAL­W and for manual editing Jalview among others.

The choice of the substitution matrix.

For the profile construction, BLOSUM matrices are used as substitution scoring matrices (Henikoff and Henikoff, 1994). The choice of the BLOSUM matrix will allow the profile to be more or less predictive. For instance, using a more conservative BLOSUM matrix will lead to a profile very close to the initial multiple alignment used as input.

The choice of the relative weight of the substitution matrix.

For the same given BLOSUM matrix, the frequency of a given residue in the initial alignment can be used to balance the weight of this residue for this concrete position. Currently, there are tools that allow weighting the observed versus the "pseudo­counts" and lead to use the relative number of independent observations in the multiple alignment as described for PSI­BLAST . These tools are used to modify the background layer annotation applied to build the profile.

Competition between profiles

A post­processing step has been added in the profile scanning procedure (Hulo et al., 2008). It allows display of weak matches or the masking of strong ones according to the occurrence of other specific features in the protein. The required features for the post­processing step are stored in a new line type in profiles (PP). Three types of post­processing are defined. But the interesting one for the distinction between subfamilies is formatted as following:

PP /COMPETES_HIT_WITH: PS­accession1;PS­accession2.

This last method is mainly used when at least two matches overlap. In this case, only the one with the highest score is reported.

17 1.8. Aim of the project.

To improve the classification of peroxidase sequences included and curated in PeroxiBase, the need of powerful tools is currently an unavoidable question. PROSITE profiles have been chosen as a major tool because of their important abilities to discriminate between families and functional domains. But some problems remain in their direct use for very related subfamilies.

Currently, the automated profile building methodology for closely related subclasses is not stably defined. In addition, few profiles are available to characterize peroxidase subfamily sequences and assist the profile­based classification.

This project aims to establish a method for automatic construction of highly discriminative subprofiles and apply it to help the understanding of the peroxidase superfamily evolution. The project focuses on the determination of points in the standard methodology that can be adjusted to improve the discriminative power of generalised profiles. For instance, the weighting steps of sequences in multiple alignments, the choice of BLOSUM substitution matrices, the constraints applied during profile calibration and the cut­off level assignment are object of particular attention. In addition, human steps are introduced in the profile construction protocol. In these manual steps, parameters of the indicated criteria are modified according to the result obtained at different levels of the process and the whole protocol is relaunched.

18 2. Methods

2.1. Presentation of the proposed approach.

The main idea of the methodology applied to build highly discriminative profiles is based on the splitting of a global annotated alignment containing all the sequences of the whole hierarchy of superfamilies and subfamilies of a given family. All the sequences are aligned based on a general profile, then the residues identified to be common to all the sequences are masked. The general alignment is then split into subfamily sequence files. Each file contains information about the previous global masking step and is used to build the new subprofile; the files serve as an initial set of true positive sequences and will be used for the subprofile testing and validation. The subprofiles are built following nearly the same method used for general profiles. The major difference is that subprofiles cut­off values are fixed after a two­step (first automatic, then manual) process.

To be applicable, this methodology has two major requirements:

− a general profile for the whole family,

− an adequate preparation of the sequences.

2.1.1. Superfamily profile building.

In order to construct each superfamily profile, a multiple cycle process has been performed with the steps shown in Table 2.

This methodology of general profile construction is currently available internally to the PROSITE group as a web tool named PSMaker (http://anabelle.vital­it.ch/psmaker/cgi­ bin/psmaker.cgi).

During the procedure, at each cycle, new sequences are added to the initial set and contribute to improve the quality of the resulting profile.

19 Steps n° Description Input: a multiple alignment into MSF format

Remove redondancy in the alignment with Tcoffee_trim

1 Weight MSF multiple alignment with pfw

Add annotation layers to the weighted alignment using xpsa2annot.pl

Output: an annotated fasta alignment.

Build profile from the annotated alignment using APSImake.pl 2 Calibrate profile against a database (reversed or Windows20) Search matches for the new build profile (pfsearch/ps_scan) in UniProtKB 3 Compare the domain covered by the new profile with the coverage of other motifs (InterPro). Use the obtained match list to improve the profile (return to step 1 for a new 4 cycle) or finish. Table 2. Generalised profile building steps.

2.1.2. Analysis preparation step.

This step consists of:

− a particular sequence preparation step. During this step, the family hierarchy is included in sequence fasta headers. This makes sequences easily recognizable for further analysis.

− construction of a special data structure for the family tree. The family tree (Figure 1) is represented as a perl hash table containing the whole hierarchy and its organisation. Each level of the tree is associated with an acronym.

These operations are useful for the initial MSA splitting, for automatic cut­off fixation, and for the automatic completion of the competition lists during subprofile construction and calibration.

20 2.2. New methodology proposed.

To increase the specificity, subprofiles have been constructed following a more complex procedure described in the Figure 3. As the general scheme is the same as for general profiles (already presented), only the following main methodologically important points are discussed:

− the choice of the initial sequence set,

− the multiple alignment: done based on a superfamily profile,

− the general alignment labelling and

− the cut­off value setting after the raw subprofile is obtained from the split general alignment.

2.2.1. Choice of the initial set of trusted sequences

The construction of each new profile entry begins with a set of sequences, either total proteins or local homology domains, which can be assumed to belong to the same family. It is important not to include sequences with doubtful relationship to the family under consideration since even a single inappropriate sequence can severely degrade profile performance by modifying the score of the obtained profile.

In our case, the set of sequences used to define the new profiles were correctly annotated and well organised. Indeed, the peroxidase­encoding sequences available in PeroxiBase are human curated and annotated. To improve the quality of the profile to be built, only complete sequences were included in the initial sequence sets.

21 Figure 3: Schematic representation of the method proposed for the construction of subfamily profiles.

22 2.2.2. Construction of a multiple alignment.

Two different alignment methods were used for the construction of multiple alignments.

For a general profile construction: we started with a multiple alignment generated by ClustalW or MUSCLE. In most cases the alignment includes divergent proteins, so a manual refinement of the initial multiple alignment was necessary. The very divergent sequences were excluded from the initial alignment but were considered (with partial sequences) for the cut­off assignment step.

For subclass profile construction, the initial aligned set of sequences was produced by a pfsearch analysis using the general profile against the complete sequence of the database in fasta format. The pfsearch result is a match list with sequences aligned according to the positions in the given profile.In some cases, the alignments were manually edited and rearranged using Jalview (for example, to remove irregular N or C termini).

2.2.3. Multiple alignment annotation and profile construction.

Annotations have been added with xpsa2annot.pl, which also allowed the masking of highly conserved residues (frequency greater than 0.8) and the modification of annotation lines according to the BLOSUM matrices and background parameters. The profile construction was carried out with apsimake.pl and the calibration with pfcalibrate running on the Expasy cluster machines.

2.2.4. Profile cut­off value fixation.

Once the profile had been calibrated, it was run against the whole database using pfsearch. The match list is then computed to determine a new cut­off value. The automatic cut­off value is calculated according to the first false positive and the last true positive scores in the match list. This automatic value is then manually checked and, if it is accepted, the default profile cut­off value at LEVEL = 0, which is 8.5, was modified.

When the cut­off value is fixed, a new normalised score is computed and the profile is updated accordingly.

23 2.3. ProRule semi­automated construction methodology.

The aim of this part is the development of a tool which automatically checks and resumes, in a UniRule format file, homogeneous annotation lines contained in a set of UniProtKB entries. The input is a file with UniProtKB sequence IDs representing the match list of the given profile. This file is named with the AC number of the profile. Two intermediate files are produced: a table presentation of the splitting operation and a log file of the data structure. The final output file contains the skeleton of the rules which will be completed and corrected by UniProtKB annotators. The main steps of the ProRule file pre­construction are:

(i). Creation of a data structure:

Entries of the input file are split and annotations are redistributed under global labels which will be included in the final rule file: DE (Description), CC (Comment lines), KW (keywords), GO (Gene Ontology) and OC (Organism Classification).

Particular attention was given to two points for the data structure building:

− Conserve the order of concepts as adopted for UniProtKB entry annotation (http://beta.uniprot.org/docs/userman.htm). This is important because rules trigger annotation for unannotated entries and must respect a defined information appearance order.

− Report all the different annotations encountered under a given tag. For instance, if two or three GO terms are reported, or if two functions are indicated for at least one of the given sequences, the data structure must contain all of these GO terms and functions in separated columns. This makes the data structure useful for detecting annotation mistakes and/or absence of annotation.

The data structure itself is an indexed perl hash containing the frequencies of the reorganised annotations.

(ii). Rule file skeleton building:

The Rule skeleton is built by transferring data from the data structure to a text file according to the UniRule requirements. A transferred annotation in the rule under construction is considered confident when it appears in the treated sequence set with a frequency at least equal to 0.95. Annotations with frequencies less than 30% are not transferred and, between 30% and 95%, the transferred annotation frequencies are indicated to facilitate the further manual curation of the file by annotators.

24 3. Results

3.1. Peroxidase profiles for PeroxiBase.

The peroxidase database provided a set of well annotated sequences. In addition, some general peroxidase profiles were already available in PROSITE: PS50292 for the animal peroxidase superfamily and PS50873 for the non­animal peroxidase superfamily. For the glutathione peroxidase superfamily, we used a general profile made by Dr. Falquet L. and available on MyHits in a private domain (http://myhits.isb­sib.ch/cgi­bin/index/).

The running of the described procedure allows the construction of 76 newly built and calibrated profiles for the different families and subfamilies of peroxidase (appendix 2). The obtained profiles are very specific and sensitive (Table 3). For some superfamilies, the sensitivity is reduced due to the important number of partial sequences in the initial sequence set. Nevertheless, in most cases, the partial sequences are picked up by the profiles. Very few cases of false positive matches are reported (Appendix 3). The concerned sequences have been checked and confirmed to belong to the families where they are classified: so they remain false positives. In some cases, the inner profiles are more sensitive than the profile of the superfamily they are related to. All the detailed results are compiled in the appendix 3.

These profiles also improved the classification of the sequences in the peroxidase database. For example, thanks to profiles specific for localization (PS52034, PS52035, PS52036), ascorbates were distinguished and well classified between chloroplastic, cytosolic and peroxisomal ascorbate peroxidases.

During this project, many subclasses were created to improve the general classification. These subdivisions were created based on the match lists of the new profiles and were confirmed by phylogenetic considerations (tree clusters, data not shown) and bibliographic data. For instance, the peroxinectin superfamily was divided according to the origin and the size of the proteins into bacterial peroxinectin, invertebrate peroxinectin, and short peroxinectin (Zamocky, 2008). The dyp­type peroxidase superfamily was divided into subtypes A to D and non­haem superfamily has been subdivided according to the metal used to transfer electron: vanadium­iodoperoxidase (VIPo), vanadium­bromoperoxidase (VBPo) and vanadium­ chloroperoxidase (VCPo).

25 Family name Concerned subfamilies Profiles Profiles TP* sensitivity** picked/total Alpha­dioxygenase DiOx 100% 15/15 Dual oxidase DuOx 93.94% 31/33 Linoleate diol synthase LDS 100% 55/55 Peroxidasin Pxd 100% 37/37 Peroxinectin Pxc, Pxt, PxDo 100% 67/67 (+ 1FP) Prostaglandin H synthase PGHS 100% 73/73 Vertebrate peroxidase EPO, LPO, MPO, TPO, POX 96.55% 56/58 Catalase Kat, KatLox 98.68% 149/151 Di­haem peroxidase superfamily DiHCcP, MauG, DiHPOX 100% 115/115 DyP­type peroxidase DyPrx A to D 61.72% 79/128 Haloperoxidase HalPrx, HalNPrx, 88.35% 91/103 Non­animal class I peroxidase Apx, CP, CcP, APx­CcP 90.57% 759/838 Non­animal class II peroxidase LiP, MnP, VP, CII 99.16% 118/119 Non­animal class III peroxidase Prx 92.54% 2442/2639 Non­haem alkylhydroperoxidase D­ AhpD, CMD, CMDn, HCMD, HCMDn, 82.54% 156/189 like superfamily DCMD, DCMDn, AlkyPrx, AlkyPrxn Glutathione peroxidase AmnGpx, InsGpx, PltGpx, FBGpx, 99.75% 403/404 OthGpx Peroxiredoxin 1CysPrx, 2CysPrx / AhpC, PrxII / 100% 610/610 PrxV, PrxQ / BCP, PrxGrx, Tpx, AhpE NADH oxidase NadOxd 97.56% 40/41 (+1FP) Ancestral NADPH oxidase NOx 100% 26/26 Respiratory burst oxidase homolog Rboh 94.19% 81/86 NADPH oxidase V NoxV 85.71% 12/14 NADH dehydrogenase NadDH 100% 22/22 (+4FP) NADH peroxidase NadPrx 92.31% 12/13 Manganese catalase MnCat 100% 30/30 Table 3: Specificity and sensitivity of the main peroxidase subfamily profiles.

* TP: True Positive, FP: False Positive, FN:False Negative. ** Sensitivity is the proportion of true positives that are correctly identified by the test : TP/(TP+FN).

26 As a major improvement, a new tool has been added to PeroxiBase. The ps_scan interface allows checking a set of new sequences against the available profiles. This page may help solve problematic classifications (Figure 4 and 5).

Figure 4: The new Class scan interface to identify the classification of a given peroxidase sequence.

A search is performed for each new given sequence against all peroxidase profiles. The output is a succession of profiles matching the given sequence (Figure 5): it directly gives the classification hierarchy of the new sequence and indicates the insertions or mismatches observed during the superposition between the sequence and each profile. Another part of the result is shown as a graph representing the quality of the alignment of the sequence with the matching profiles. This graph is under construction. The page is publicly available for use at http://peroxibase.isb­sib.ch/class_scan.php .

27 Figure 5: Result screen of the Class scan search against the newly built peroxidase profiles.

3.2. A new profile building methodology for subfamily classification.

The new methodology proposed in this project for the construction of highly discriminative profiles begins with a general alignment of a complete set of all the families and subfamilies sequences in fasta format. The sequences are prepared so that the fasta header contains the precise hierarchy or at least the subfamily each one belongs to. The new pipeline first annotates a multiple alignment containing all the subfamily sequences and not only a unique subfamily set of sequences. This allows the storage of a general information about residue conservation between subfamilies and their positions during all the subprofile construction procedure, specially after the general alignment splitting. Based on this initial annotation, the subsets of particular subfamilies can be processed focusing only on family­specific residues.

28 For the implementation of this procedure, existing programs were adapted in a special and enriched pipeline. In addition to existing profile construction programs, some new tools were developed to facilitate alignment splitting, annotation transfer, match list processing and profile editing. The most important tools developed in this project concern the LEVEL = 0 cut­off value fixation and the post­processing compete preparation. These steps remain the more relevant to really improve the discriminative power of profiles for closely related families.

The LEVEL = 0 cut­off value fixation has been carried out in two different steps: an automatic determination and a manual checking of the result of the automatic step.

For the automatic determination of the cut­off, a perl script has been written which contains the following steps:

(i) order the output file of pfsearch by score in descending order,

(ii) determine the higher false positive based on the hierarchy indicated in the sequence header: it is the first sequence not belonging to the family in course of treatment,

(iii) evaluate a theoretical cut­off value equal to the score of the first false positive encountered + 0.1,

(iv) make a list of subclasses encountered between the last true positive and the last entry of the considered subclass. We then try to remove these false positives by the competition approach,

(v) edit a cut­off result file for the subfamily.

An example of a cut­off file obtained after the automatic determination can be observed in Appendix 4.

After this automatic treatment of the profile match list, a human checking is made on proposed values. This manual step is useful to take into account and correct annotation mistakes, biological or phylogenetic aspects and/or the erroneous values proposed by the script.

The compete list is also checked. The competition between profiles, made as a post processing analysis, is resource­ and time­consuming, therefore we have tried to limit this analysis to the strict minimum: the problematic subfamily classification. To do so, we have accepted that final profiles will mostly pick up complete sequences because the scores of partial sequences are too small and require too many competitions between profiles to be caught by the correct profile.

29 After this human step, a new raw score is calculated for the level 0 and the profile is tested again.

The human­fixed cut­off value could be revisited after the test regarding the competition between profiles. If the fixed cut­off cannot resolve the distinction between sequences after the ps_scan step, the profile has to be recalibrated by pfcalibrate modifying the ­L option. This option allows a profile to produce lower or higher scores and can be useful for resolving inconsistent results after the post­processing step.

When recalibration appeared not to solve discrimination between subfamily sequences, the whole process where relaunched for the concerned family after new sequences were added to the database and wrong sequences were extracted.

3.3. Improvement of existing programs.

During this project, existing programs were also improved. For instance, a new option was added to xpsa2annot.pl: ­rm_gap. This option allows deleting empty columns in the annotated multiple sequence alignment before the profile construction. This empty column situation arises because the annotation is made considering all the sequences in the initial sequence set. Indeed, these annotation lines can be in some cases longer than the sequence subset when the initial set is split. In these case, all the positions in the sequence subset are empty (represented by dashes in sequences) and must be removed. If not, these empty positions cause an error in the profile construction specially if an annotation is linked to the considered positions. This is a problem specially relevant in subprofile construction and the present project contributed to solve it.

In addition, we tested two ways of masking high frequency residues: one which replaces the given residue by a X (maskpos.pl) and the second which puts a value of zero in the annotation lines for the considered position. This second method is a new option implemented in xpsa2annot.pl. This ­mask option can be coupled with a particular file of correspondence between amino acids to take into account their similarities, specially in term of functional activity. We noticed that the profile quality and prevision ability are better in the second case, when the residues are not silenced by a X­replacement. This difference can be due to the fact that the residues annotated with a zero value continue to influence the weighting step. Nevertheless, when they are replaced by X, they are no more used and the residues weights are slightly different.

30 3.4. ProRule file editor.

To facilitate the construction of the rule, the reorganisation of UniProtKB annotations leads to a data structure accessible and visible in a Table Editor (SCalc from OpenOffice.org, Microsoft Excel for Windows). This table­way compilation of entries' annotations constitutes a first important step in the rule construction. It allows a general overview and facilitates the rule's validation and correction. In addition, a text log file of the data structure itself is available at the end of the ProRule file building process.

The automated ProRule building step proposed in this project was tested on non­animal peroxidases. Peroxidase subprofiles obtained after the first part were run against UniProtKB to produce the match lists used as input. It led to the building of 14 rule skeletons. These files contain the most frequent annotations found in UniProtKB entries corresponding to profiles used to pick them up. Thus, the more the profile is precise, the more precise and complete is the transferred annotation. In other words, the more the profile is directed to a deeper subfamily in the classification, the more the match list is homogeneous (list of really very close entries) and the more the annotation which will be triggerred is precise and complete. For instance, ProRule files PRU52034 and PRU52036 produced from match lists of profiles PS52034 (chloroplastic ascorbate peroxidases) and PS52036 (peroxisomal ascorbate peroxidases) respectively contain information about the subcellular localisation. This information is not taken into account in PRU52033 file resulting of PS52033 corresponding to the ascorbate superfamily profile. An example of ProRule skeleton is provided in Appendix 5.

31 4. Discussion

4.1. Importance of profiles for peroxidase classification and evolution understanding.

The increasing number of genome sequencing projects of various organisms (Genomes OnLine Database) as well as the numerous EST programs led to the identification of numerous new peroxidase­encoding sequences. Initial peroxidase family and superfamily classifications mainly based on Prokaryotes and Opistokontes peroxidase sequences were at this point obsolete.

The major classification dilemma arises from the peroxidase sequences found in different protistean organisms (Excavata, Rhizaria, Alveolata, Stramenopiles, Haptophyceae, Cryptomonads and Glaucophyceae) as well as in dead­end evolutive branches (basal angiosperms, basal viridiplantae). These sequences have evolved on their own, and their current form does not allow at first sight to precisely classify them within a peroxidase class, even after enzymatic activity experiments. An obvious example came from the sequences TcrAPx and LmAPx, initially annotated as ascorbate peroxidase (Wilkinson et al., 2002) but which are in fact hybrid ascorbate and cytochrome c peroxidase sequences. However, current bioinformatic tools are not adapted to make these distinctions.

In the next years, the amount of data on protistean and marginal organisms will increase, and many more sequences of this kind will appear.

It seems thus important to create more accurate profiles to classify the new upcoming peroxidase sequences with greater precision in order to prevent further misinterpretations and erroneous peroxidase appellations and classifications.

The profiles built during this work and the web tool made available thus constitute an important improvement for the peroxidase database which, more than a simple recapitulation of existing peroxidase sequences, is becoming a more complete environment for researchers involved in understanding evolution problems concerning peroxidases.

32 4.2. Advantages of the global approach.

The proposed approach has allowed the production of highly discriminative profiles to improve the classification of the nearly 5800 peroxidases in PeroxiBase.

As has been shown by Lüthy, introduction of sequence weights improves the performance of the resulting profiles (Lüthy, 1994). This effect is particularly pronounced if the initial set of trusted sequences contains both unique sequences and multiple members from closely related sequence families.

For subprofile construction, we managed to obtain a subset of sequences as widespread as possible concerning species. In addition, the trimming step, carried out after the masking and the splitting of the global alignment, allowed us to conserve only really different (nearly unique) sequences for the subprofile building step. This allows us to validate the masking option of xpsa2annot as a good way to obtain predictive and confident subprofiles.

On the other hand, this project has made possible to set and verify the complementarity between residue masking step during alignment labelling and profile post­processing competitions. As the obtained subprofiles are highly specific, the post­processing steps were necessary only for very close subfamilies and specially for localisation or specific differentiation inside few subfamilies: for instance, the post­processing has allowed to separate chloroplastic, peroxisomal and cytosolic peroxidases. It has also been possible to separate subcategories of vanadium peroxidases (bromoperoxidase, chloroperoxidase and iodoperoxidase), as well as for the peroxinectin subcategories (bacterial peroxicin, invertebrate peroxinectin and short peroxidockerin).

In addition, the approach used here is more consistent and powerful. All the subprofiles are constructed simultaneously from unique global annotated alignment and tested simultaneously including competitions. If the profiles were constructed and tested individually and separately, some sequences could have been picked up by more than one profile or by a wrong one. With the adopted approach, the competitions ensure the uniqueness of the relation protein­profile. Only the profile that produces the best score with the given sequence is reported.

The application of the described procedure to UniProtKB can lead to the annotation of about 5500 sequences. It will directly allow the classification of these peroxidases into well defined subfamilies since at present the classification of peroxidases in UniProtKB is not as precise and detailed as in PeroxiBase.

33 4.3. Human calibration steps.

The human calibration steps during the subprofile construction have permitted to improve the classification and annotation of sequences in the PeroxiBase itself. We also observed numerous incorrect annotations or the lack of annotation in UniProtKB. During this process, some sequences were detected that had been wrongly annotated (a class III named with a class I or class II acronym). Other mistakes like the absence of sequence status (complete or partial) or absence of localization were also observed.

4.4. ProRule skeleton files.

Final skeleton files produced here are a good basis for rule edition. It is not a complete rule but a first automatic step that needs to be manually completed. The availability of the table view of the data structure of treated sequences annotation is very helpful to complete the skeleton. Very often, some annotations are not transferred because some entries of the treated set contain a mistake or a difference of format in the available annotation and sometimes because annotations are omitted.

The proposed tool can thus be used to check annotation mistakes or failure. This is specially easy with the table view (Appendix 6).

As annotations are transferred according to their appearance frequencies in the sequence set representing the match list of a profile, the confidence level in the obtained pre­ProRule files depends on the number of entries in the considered match list. Concerning the peroxidase superfamily, it will be useful to add more sequences in UniProtKB because some of the newly built profiles match only few UniProtKB entries (less than five in some cases). This is not really a problem when the extracted annotation is uniform because it can be considered like a 100% uniform annotation. But because of this small number, we cannot assume that all the future homologous will have the same annotation. And also, for a small number of sequences in the input set, when there is an annotation difference (for example 3/5 have a same annotation and 2/5 have another one), it is extremely difficult to decide what annotation must be transferred.

34 4.5. Weaknesses of the proposed methodology.

Although the proposed methodology allows the production of highly discriminative profiles for subfamilies, some difficulties exist in its application:

­ The sequence preparation step: this step during which family hierarchies are included in fasta headers constitute an obstacle to the direct use of existing sequence sets that can be obtained on different databases. Currently none of them gives the family hierarchy in the fasta header.

­ The construction of a special data structure for the family tree: this structure is useful for the initial MSA splitting, for automatic cut­off fixation, ... In this project we used perl hashes. This is a limitation for people who do not know this programming language. The script will be modified to allow the construction of the family tree automatically from the sequence set.

­ The precision of the proposed pre­ProRule files: as the precision of the proposed pre­ ProRule files depends on the “depth” in the hierarchy, only few of the obtained files can be directly used in the present situation. Moreover, the proposed files do not include the FT lines (Feature Table) because these lines must normally be triggered by superfamily profiles. That means to trigger one part of the annotation by the superfamily profile and the rest by the more precise subclasse profile. This will requiere to adapt the program that applies rules on UniProtKB entries.

35 5. Conclusion and perspectives

In this work, we were asked to build highly discriminative profiles for peroxidase characterisation. The challenge was to overcome weaknesses of PROSITE profiles built in usual manner as they do not allow establishing differences between closely related subfamily sequences. To do so, we have developed a new method that focuses on residues really specific to each subfamily; residues with a high level of conservation in upper level family are silenced. The application of this method to sequences of the peroxidase database has allowed to build 76 new profiles for all the existing superfamilies and subfamilies. The online availability of a ps_scan­based tool on the database website as well as on MyHits, is a powerful improvement for future classification of new sequences.

As a major perspective, the methodologic approach itself could be improved. In this project, we have masked the residues showing high conservation at family level (constant among all the subfamilies). This masking has led to a unique annotation used as a general and uniform background information for the construction of all subprofiles. The improvement could be to weigh differently family conserved residues and subfamily specific residues. That would allow applying a more appropriate background annotation to each subfamily sequence set.

This project has provided a number of bioinformatics tools. Among these tools, a tool for the first step of ProRule files edition from ps_scan match list. These tools can be optimised to lead to a complete PROSITE profile and ProRule editor. Such an environment can be useful to produce powerful profiles and precise associated rules from a given global alignment of a confident and well annotated sequence set corresponding to a complete protein family.

The subprofile construction method will be implemented on the PROSITE web page as a public tool for researchers. Internally, this web page can be associated with existing tools used by PROSITE annotators to facilitate rule construction. The weaknesses have to be overcome to allow a simple and intuitive usage for users who are not familiar with programming languages.

36 References

Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI­BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3 389­402.

Bairoch A (1992) PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res 20 Suppl:2013­8.

Bucher P, Bairoch A (1994) A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation. Proc Int Conf Intell Syst Mol Biol 2:53­61.

Bucher P, Karplus K, Moeri N, Hofmann K (1996) A flexible motif search technique based on generalized profiles. Comput Chem 20:3­23.

Conn EE, Kraemer LM, Liu PN, Vennesland B (1952) The aerobic oxidation of reduced triphosphopyridine nucleotide by a wheat germ enzyme system. J Biol Chem 194:143­51.

Daiyasu H, Toh H (2000) Molecular evolution of the myeloperoxidase family. J Mol Evol 51:433­45.

Fleischmann A, Darsow M, Degtyarenko K, Fleischmann W, Boyce S, Axelsen KB, Bairoch A, Schomburg D, Tipton KF, Apweiler R (2004) IntEnz, the integrated relational enzyme database. Nucleic Acids Res 32:D434­7.

Furtmuller PG, Zederbauer M, Jantschko W, Helm J, Bogner M, Jakopitsch C, Obinger C (2006) structure and catalytic mechanisms of human peroxidases. Arch Biochem Biophys 445:199­213.

Gribskov M, Lüthy R, Eisenberg D (1990) Profile analysis. Methods Enzymol 183:146­59.

Henikoff S, Henikoff JG (1994) Protein family classification based on searching a database of blocks. Genomics 19:97­107.

Hochman A, Goldberg I (1991) Purification and characterization of a catalase­peroxidase and a typical catalase from the bacterium Klebsiella pneumoniae. Biochim Biophys Acta 1077:299­307.

Hulo N, Bairoch A, Bulliard V, Cerutti L, Cuche BA, de Castro E, Lachaize C, Langendijk­ Genevaux PS, Sigrist CJA (2008) The 20 years of PROSITE. Nucleic Acids Res 36:D245­9.

37 Lüthy R, Xenarios I, Bucher P (1994) Improving the sensitivity of the sequence profile method. Protein Sci 3:139­46.

Notredame C (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 3:131­44.

Passardi F, Theiler G, Zamocky M, Cosio C, Rouhier N, Teixera F, Margis­Pinheiro M, Ioannidis V, Penel C, Falquet L, Dunand C (2007a) PeroxiBase: the peroxidase database. Phytochemistry 68:1605­11.

Passardi F, Zamocky M, Favet J, Jakopitsch C, Penel C, Obinger C, Dunand C (2007b) Phylogenetic distribution of catalase­peroxidases: are there patches of order in chaos? Gene 397:101­13.

Ruiz­Dueñas FJ, Camarero S, Pérez­Boada M, Martínez MJ, Martínez AT (2001) A new versatile peroxidase from Pleurotus. Biochem Soc Trans 29:116­22.

Sigrist CJA, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, Bairoch A, Bucher P (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 3:265­74.

Sigrist CJA, De Castro E, Langendijk­Genevaux PS, Le Saux V, Bairoch A, Hulo N (2005) ProRule: a new database containing functional and structural information on PROSITE profiles. Bioinformatics 21:4060­6.

Wilkinson SR, Taylor MC, Touitha S, Mauricio IL, Meyer DJ, Kelly JM (2002) TcGPXII, a glutathione­dependent Trypanosoma cruzi peroxidase with specificity restricted to fatty acid and phospholipid hydroperoxides, is localized to the endoplasmic reticulum. Biochem J 364:787­94.

Zamocky M, Jakopitsch C, Furtmuller PG, Dunand C, Obinger C (2008) The peroxidase­ cyclooxygenase superfamily: Reconstructed evolution of critical enzymes of the innate immune system. Proteins

38 Acknowledgments

I am sincerely grateful to Professor Amos Bairoch for having given the possibility to do this training in collaboration between the Swiss Institute of Bioinformatics and the Laboratory of plant physiology. It has been a great opportunity to work in a very stimulating and multidisciplinary scientific environment.

My deepest thanks go to my project supervisors Nicolas Hulo, Christophe Dunand and Christian Sigrist for their constant support and advice during this training. Their scientific competence and encouragement made this training a rich and fruitful time for me. Thanks also to Patricia Palagi for her infatigable support and advice. She is really a good mentor.

I am particularly grateful to Virginie Le Saux, Grégory Theiler and Lorenzo Cerutti for their helpful programming advice and orientation.

I also want to thank Béatrice Cuche for the pleasant working environment I have been welcomed into and Tania Lima for the time spent to read my pitiful English, and all the collaborators in the plant physiology laboratory and all those in the PROSITE group and at SIB.

A note of thanks to Karin Sonesson, Salvo Paesano and Alessandro Innocenti for their efficiency in solving all the small technical problems I faced during these months.

I am also thankful to my friends of the Champel Residence and to all my colleagues from the master degree of bioinformatics for the great moments we shared and for their support.

Thanks also to all those I do not mention but do not forget.

39 Appendices Appendix 1:Identification tools for peroxidases before the present project.

PROSITE PRINT PFAM INTERPRO Haem peroxidase Animal peroxidase­Cyclooxygenase superfamily PS50292 PR00457 PF03098 IPR010255, IPR002007 Alpha­dioxygenase DiOx PF00036, PF08022, IPR011992, IPR002048 (EF hand), Dual oxidase DuOx PF01794, PF08030 IPR013112 (FAD bind), IPR013130, IPR013121 (NAD bind) Linoleate diol synthase (PGHS like) LDS PR00385 (cyt P450) PF00067 (cyt P450) IPR001128, IPR002397 (cyt P450) PF07679, PF00560, IPR007110, IPR013783, IPR013098, PF01462, PF00093 IPR003598 (immunoG), Peroxidasin Pxd PR00019 IPR001611, IPR000483, IPR000372, IPR003591 (LRR), IPR001007 Peroxinectin Bacterial peroxinectin Pxt Invertebrate peroxinectin Pxt Short peroxinectin Pxt PGHS/CyO IPR013032, IPR006210, IPR000742, Prostaglandin H synthase (Cyclooxygenase) x IPR013111 (EGF) Vertebrate peroxidase Eosinophil peroxidase EPO Lactoperoxidase LPO IPR009057 Myeloperoxidase MPO PF07645 IPR000152 Asp hydroxylation site, IPR000742, IPR001881, IPR013091, Thyroid peroxidase TPO IPR013032 (EGF), IPR000436 (sushi domain) Non mammalian vertebrate peroxidase POX PS00437, , IPR011614, IPR010582 Catalase superfamily PF00199, PF06628 IPR002226 PS00438 PR00067 Catalase Kat Catalase­lipoxygenase fusion KatLox IPR004852, IPR009056 (mono heam), Di­haem peroxidase superfamily PF03150 IPR012282 (Cyt c region) Di­haem cytochrome C peroxidase DiHCcP Methylamine utilisation protein MauG Other di­haem peroxidase DiHPOX DyP­type peroxidase PF04261 IPR006314, IPR006311

40 DyP­type peroxidase A DyPrx DyP­type peroxidase B DyPrx DyP­type peroxidase C DyPrx DyP­type peroxidase D DyPrx Haloperoxidase haem HalPrx PF01328 IPR000028 PS50873, IPR010255, IPR002016 non­animal peroxidase PS00435, PR00458 PF00141 PS00436 Class I peroxidase Ascorbate peroxidase APx PR00459 IPR002207 Catalase peroxidase CP PR00460 IPR000763 Cytochrome C peroxidase CcP PR00459 IPR002207 Hybrid APx­CcP APx­CcP Class II peroxidase PR00462 IPR001621 Lignin peroxidase LiP Manganese peroxidase MnP Versatile peroxidase VP Other class II peroxidase CII Class III peroxidase Prx PR00461 IPR000823 Other non­animal peroxidase NAnPrx non­haem peroxidase Alkylhydroperoxidase D­like superfamily PF02627 IPR003779, IPR004675, IPR012788 Alkylhydroperoxidase D AhpD Carboxymuconolactone decarboxylase(act) CMD Carboxymuconolactone decarboxylase(no ) CMDn PR00111 PF00561 () IPR003089, IPR000073, IPR012790 Hydrolase­CMD fusion (act) HCMD (hydrolase) (hydrolase) PR00111 PF00561 (hydrolase) IPR003089, IPR000073, IPR012790 Hydrolase­CMD fusion (no) HCMDn (hydrolase) (hydrolase) Double CMD (act) DCMD Double CMD (no) DCMDn Other alkylhydroperoxidase (act) AlkyPrx Other alkylhydroperoxidase (no) AlkyPrxn PF07883 (cupin) IPR013096, IPR011051 (cupin) Haloperoxidase PR00111, PF00561, IPR003089, IPR000073, IPR000639 No haem­no metal haloperoxidase HalNPrx PR00412 (hydrolase) No haem­Vanadium haloperoxidase PS00012 PF02681, PF01569 IPR003832, IPR000326, IPR006162 No haem­ VBPo No haem­Vanadium chloroperoxidase VCPo No haem­Vanadium iodoperoxidase VIPo Manganese Catalase MnCat PF05067 IPR007760, IPR009078 41 PR00411, PR00368 IPR001100, IPR013027, IPR001327 (NAD/ PF00070, PF07992, NADH peroxidase NadPrx FAD_pyr_redox), IPR004099; PF02852 (Pyr_redox_dim) Thiol peroxidase IPR012335, IPR012336 PS00460, , IPR013376 Glutathione peroxidase IPR000889 PS00763 PR01011 PF00255 Animal glutathione peroxidase GPx Insect glutathione peroxidase GPx Plant glutathione peroxidase GPx Fungi­Bacteria glutathione peroxidase GPx Other glutathione peroxidase GPx ­Cysteine peroxiredoxin 1CysPrx PF00578 IPR000866; 2CysPrx / , ; Typical 2­Cysteine peroxiredoxin PF00578 PF08534 IPR000866 IPR013740 AhpC Atypical 2­Cysteine peroxiredoxin (type II, type V) PrxII / PrxV PF08534 IPR013740 Atypical 2­Cysteine peroxiredoxin (type Q, BCP) PrxQ / BCP PF08534 IPR013740 PS00195, PR00160, , IPR014025, IPR011767, PrxII­glutoredoxin fusion PrxGrx PF08534, IPR013740 PF00462 IPR011906, IPR002109 (Glutaredoxin) NAD(P)H oxidase superfamily PR00411, PR00368, IPR001100, IPR013027, IPR001327 (NAD/ PF00070, PF07992, NADH oxidase NadOxd (PR00419) FAD_pyr_redox), IPR004099; PF02852 (Pyr_redox_dim), (IPR000759) NADPH oxidase NOx IPR000778 PS00018, PF08414, PF08022, IPR013623, IPR013112 (FAD bind), PS50222 PF01794, PF08030, IPR013130 (Ferric red), IPR013121 (Ferric Respiratory burst oxidase homolog Rboh PR00466, PF00036 red, NAD bind), IPR000778 (cyt B245), IPR011992, IPR002048 (EF­hand) PR00411, PR00368, IPR001100, IPR013027, IPR001327 (NAD/ PF00070, PF07992, ND FAD_pyr_redox), IPR004099; PF02852 (Pyr_redox_dim)

42 Appendix 2: List of newly built and calibrated profiles for the families and subfamilies of peroxidase.

HAEM PEROXIDASE PS50292 = General animal peroxidase PS52002 = Alpha dioxygenase PS52003 = Dual oxidase PS52004 = Linoleate diol synthase PS52005 = Peroxidasin PS52006 = Peroxinectin Superfamily PS52007 = Bacterial peroxinectin PS52008 = Invertebrate peroxinectin PS52009 = Short peroxinectin PS52010 = Prostaglandin H synthase PS52011 = Vertebrate peroxidase Superfamily PS52012 = Eosinophil peroxidase PS52013 = Lactoperoxidase PS52014 = Myeloperoxidase PS52015 = Thyroid peroxidase PS52016 = Non mammalian vertebrate peroxidase PS52017 = Catalase superfamily PS52018 = Catalase PS52019 = Catalase lipoxygenase fusion PS52020 = Di haem peroxidase superfamily PS52021 = Di haem cytochrome C peroxidase PS52022 = Methylamine utilisation protein PS52023 = Other di haem peroxidase PS52024 = DyP type peroxidase Superfamily PS52025 = DyP type peroxidase A PS52026 = DyP type peroxidase B PS52027 = DyP type peroxidase C PS52028 = DyP type peroxidase D PS52029 = Haloperoxidase Superfamily PS52030 = Haloperoxidase haem

PS50873 = General non­animal perox PS52032 = Class I peroxidase superfamily PS52033 = Ascorbate peroxidase Superfamily PS52034 = Ascorbate Chloroplastic PS52035 = Ascorbate Cytosolic PS52036 = Ascorbate Peroxisomal PS52037 = Catalase peroxidase PS52038 = Cytochrome C peroxidase PS52039 = Hybrid Ascorbate Cytochrome C peroxidase PS52040 = Class II peroxidase Superfamily PS52041 = Lignin peroxidase PS52042 = Manganese peroxidase PS52043 = Versatile peroxidase PS52044 = Other class II peroxidase PS52045 = Class III peroxidase Superfamily PS52046 = Other non­animal peroxidase

NON­HAEM PEROXIDASE PS52051 = Alkylhydroperoxydase D like Superfamily PS52052 = Alkylhydroperoxidase D PS52053 = Carboxymuconolactone decarboxylase peroxidase activity PS52054 = Carboxymuconolactone decarboxylase no peroxidase activity PS52056 = Hydrolase CMD fusion no peroxidase activity PS52057 = Double CMD peroxidase activity PS52058 = Double CMD no peroxidase activity

43 PS52029 = Haloperoxidase Superfamily PS52061 = No haem no metal haloperoxidase PS52062 = No haem Vanadium haloperoxidase Superfamily PS52063 = No haem Vanadium bromoperoxidase PS52064 = No haem Vanadium chloroperoxidase PS52065 = No haem Vanadium iodoperoxidase PS52066 = Manganese Catalase PS52067 = NADH peroxidase Thiol peroxidase PS52069 = Glutathione Peroxidase Superfamily PS52070 = Animal glutathione peroxidase PS52071 = Insect glutathione peroxidase PS52072 = Plant glutathione peroxidase PS52073 = Fungi Bacteria glutathione peroxidase PS52074 = Other glutathione peroxidase PS52075 = General Peroxiredoxin Superfamily PS52076 = I Cysteine peroxiredoxin PS52077 = Typical II Cysteine peroxiredoxin PS52078 = Atypical II Cysteine peroxiredoxin type II type V PS52079 = Atypical II Cysteine peroxiredoxin type Q BCP PS52080 = PrxII glutoredoxin fusion PS52081 = Other Peroxiredoxin PS52082 = AhpE like peroxiredoxin NAD(P)H oxidase superfamily PS52083 = NADH oxidase PS52084 = Ancestral NADPH oxidase PS52085 = Respiratory burst oxidase homolog PS52086 = NADH dehydrogenase

44 Appendix 3: Result of the match search for the newly built peroxidase profiles aigainst the peroxibase (march 12th 2008)

Number of sequences picked by each of the built profiles. For superfamily profiles the picked sequences are separated to show the number of picked sequences for each subfamily. * False positives are in bold

Part1: Animal and Haloperoxidases. PS50292 .... General animal peroxidase PS52017 .... KatSF 'Alpha­dioxygenase' => 15/15, 'Catalase' => 146/148, 'Dual oxidase' => 32/33, 'Catalase­lipoxygenase fusion' => 3/3 'Linoleate diol synthase ' => 54/55, PS52018 . Kat 'Catalase' => 146/148 'Peroxidasin' => 37/37, PS52019 . KatLox Cat­lipoxygenase fusion' => 3 'Bacterial peroxicin' => 15/15, 'Invertebrate peroxinectin' => 40/40, PS52020 .... DiHSF 'Short peroxidockerin' => 12/12, 'Di­haem cytochrome C peroxidase' => 84/84 'Thyroid peroxidase' => 12/12, 'Methylamine utilisation protein' => 5/5, 'Eosinophil peroxidase' => 9/9, 'Other di­haem peroxidase' => 26/26, 'Lactoperoxidase' => 17/17, PS52021 .. DiHCcp 'Di­haem cyt C perox' => 84 'Non mammalian vertebrate peroxidase' => 11/11, PS52022 .. MauG 'Methylamine util prot' => 5 'Prostaglandin H synthase ' => 73/73, PS52023 .... DiHPOX 'Other di­haem perox' => 26 'Myeloperoxidase' => 9/9 PS52002 .... Diox 'Alpha­dioxygenase' => 15/15 PS52024 .... DyPrxSF PS52003 .... DuOx 'Dual oxidase' => 31/33 'DyP­type peroxidase A' => 26/26, PS52004 .... LDS 'Linoleate diol synthase ' => 55/55 'DyP­type peroxidase B' => 24/24, PS52005 .... Pxd 'Peroxidasin' => 37/37 'DyP­type peroxidase C' => 13/13 'DyP­type peroxidase D' => 16/16, PS52006 .... PxtSF PS52025 .... DyPrxA 'DyP­type perox A' => 26, 'Short peroxidockerin' => 12/12, PS52026 .... DyPrxB 'DyP­type perox B' => 24 'Bacterial peroxicin' => 15/15 PS52027 .... DyPrxC 'DyP­type perox C' => 13 'Invertebrate peroxinectin' => 40/40, PS52028 .... DyPrxD 'DyP­type perox D' => 16 'Lactoperoxidase' => 1*, PS52007 .... BactPxt 'Bacterial peroxicin' => 15 PS52029 .... HalPrxSF PS52008 .... InvPxt 'Invertebrate peroxinectin' => 40 'Haloperoxidase (Haem) ' => 44/49, PS52009 .... shrtPxt 'Short peroxidockerin' => 12/12 'No haem, no metal haloperoxidase' => 20/26, PS52010 .... PGHS 'Prostaglandin H synthase ' => 73 'No haem, Vanadium bromoperox' => 17/17, 'No haem, Vanadium chloroperox' => 9/10, PS52011 .... VertprxSF 'No haem, Vanadium iodoperox' => 1/1 'Thyroid peroxidase' => 11/12, 'Eosinophil peroxidase' => 9/9, PS52030 .... HalPrx 'Haloperoxidase ' => 47/49 'Lactoperoxidase' => 16/17, PS52061 . 'Myeloperoxidase' => 9/9 HalNPrx 'No haem, no metal haloprx' => 26/26 'Non mammalian vertebrate peroxidase' => 11/11, PS52012 .... EPO 'Eosinophil peroxidase' => 8/9 PS52062 .... HalVanSF PS52013 .... LPO 'Lactoperoxidase' => 16/17 'No haem, Vanadium bromoperox' => 17/17, PS52014 .... MPO 'Myeloperoxidase' => 8/9 'No haem, Vanadium chloroperox' => 10/10, PS52015 .... TPO 'Thyroid peroxidase' => 11/12, 'No haem, Vanadium iodoperox' => 1 'Non mammalian vertebrate peroxidase' => 1 PS52063 .... VBPo 'No haem, Va bromoperox' = 17 PS52016 .. POX 'Non mammalian vertebrate perox' => 9 PS52064 .... VCPo 'No haem, Va chloroperox' = 10 PS52065 .... VIPo 'No haem, Va iodoperox' => 1

45 Part 2: Plant and non­haem peroxidases

PS50873 .... General non­animal Peroxidase PS52056 .. HCMDn 'Hydrolase­CMD fusion ' => 21 'Ascorbate peroxidase' => 340/381, PS52057 .. DCMD 'Doubl CMD perox activity'=> 3 'Catalase peroxidase' => 307/329, PS52058 .. 'Cytochrome C peroxidase' => 93/99, DCMDn 'Double CMD no perox activ' => 21, 'Hybrid Ascorb­Cytochrome C peroxidase' => 26/27 ' CMD peroxidase activity' => 1, 'Lignin peroxidase' => 17/24, 'Other alkylhydroperoxidase ' => 11 'Manganese peroxidase' => 56/60, 'Versatile peroxidase' => 11/13, PS52066 .... MnCat 'Manganese Catalase' => 30/31 'Other class II peroxidase' => 22/22, 'Class III peroxidase' => 2626/2639, PS52067 .... NadPrx 'NADH peroxidase' => 12/13 'Other non­animal peroxidase' => 3/4, PS52069 .... GpxSF PS52032 .... ClassIPrxSF 'Animal glutathione peroxidase' => 89/90, 'Catalase peroxidase' => 315/329, 'Insect glutathione peroxidase' => 24/24, 'Cytochrome C peroxidase' => 93/99, 'Plant glutathione peroxidase' => 193/193, 'Hybrid Ascorb­Cytochr C peroxidase' => 10/27 'Fungi­Bacteria glutat peroxidase' => 73/73 'Ascorbate peroxidase' => 338/381 'Other glutathione peroxidase' => 24/24, PS52033 .... APxSF 'Ascorbate peroxidase' => 365/381 PS52070 .. AmnGpx 'Animal glutat perox' => 85/90, PS52034 .... APxChl 'Chloroplastic Ascorb perox' => 83 PS52071 .. InsGpx 'Insect glutat perox' => 21/24 PS52035 .... APxCyt 'Cytosolic Ascorb perox' => 136 PS52072 .. PltGpx 'Plant glutat perox' => 189/193 PS52036 .... APxPer 'Peroxisomal Ascorb perox' => 66 PS52073 .. FBGpx 'Fungi­Bact glut perox' => 72/73, PS52037 .... CP 'Catalase peroxidase' => 326/329 'Plant glutathione peroxidase' => 2 PS52038 .... CcP 'Cytochrome C peroxidase' => 98/99, PS52074 .. OthGpx 'Other glutat peroxidase' => 24 'Hybrid Ascorbate­Cytochrome C peroxidase' => 6 'Insect glutathione peroxidase' => 2, PS52039 .. APxCcP 'Hybrid Asc­Cyt C perox' => 20/27 'Animal glutathione peroxidase' => 4,

PS52040 .... ClassIISF PS52075 .... PrxRdxSF 'Lignin peroxidase' => 24/24, '1­Cysteine peroxiredoxin' => 103/103, 'Manganese peroxidase' => 59/60, 'Typical 2­Cysteine peroxiredoxin' => 219/219, 'Versatile peroxidase' => 13/13, 'Atypical 2­Cysteine type II ' => 82, 'Other class II peroxidase' => 22/22 'Atypical 2­Cysteine type Q ' => 110, PS52041 .... LiP 'Lignin peroxidase' => 23/24 'PrxII­glutoredoxin fusion' => 51/51, PS52042 .... MnP 'Manganese peroxidase' => 59/60 'Thioredox­dependent thiol perox' => 22/22, 'Versatile peroxidase' => 3, 'AhpE like peroxiredoxin' => 23/23 PS52043 .... VP 'Versatile peroxidase' => 10/13 PS52076 ... CysPrxI '1­Cys prxRedox' => 103/103 PS52044 .... CII 'Other class II peroxidase' => 21 PS52077 ... CysPrxII 'Typical 2­Cys prxrdx' => 219 PS52045 .... ClssIIISF 'Class III peroxidase' => 2442 PS52078 ... PrxrdxIIV 'prxredoxin typII' => 82 PS52046 .... NAnPrx 'Other non­animal perox' => 4 PS52079 ... PrxrdxBCP ' prxredox BCP' => 105/110 PS52080 .. PrxGrx 'PrxII­glutoredox fusion' => 51/51 PS52081 . OthPrxRdx 'Thioredox thiol perox' => 22 non­haem Peroxidases PS52082 .... AhpE 'AhpE like peroxiredoxin' => 23 PS52051 .... AhpDSF 'Alkylhydroperoxidase D' => 67/68, PS52083 .... NadOxd 'NADH oxidase' => 40,/41 'CMD perox activity' => 17/42, 'NADH peroxidase' => 1 'CMD no perox activity' => 19/22, PS52084 .. NOX 'Ancestral NADPH ox' => 26/26 'Double CMD perox activity' => 3/3, PS52085 .. Rboh 'Resp burst ox homolog' => 81/89 'Double CMD no perox activity' => 21/21, PS52086 .... NOXV 'NADPH oxidase V' => 12/14 'Hydrolase­CMD fusion no perox' => 21/22, PS52087 .... NadhD 'NADH dehydr' => 22/22 'Other alkylhydroperoxidase no perox' => 8/11 'NADH oxidase' => 4 PS52052 .. AhpD 'Alkylhydroperoxidase D' => 68/68 PS52053 .. CMD 'CMD peroxidase activity=> 34/42, PS52054 .. CMDn 'CMD no perox activity ' => 19, 'Hydrolase­CMD fusion ' => 1

46 Appendix 4: Example of a cut­off file obtained after the automatic determination (Ascorbate peroxidase subfamilies).

################### Tst_chApx_res0 :88#################### *Haem peroxidase/non­animal peroxidase/Class I peroxidase/Ascorbate peroxidase/Chloroplastic OK 1 *** False negative detected!

'last_true_pos' => ' 23.9 3771 pos. 21 ­ 331 2571|PtrAPx01 [2850:Phaeodactylum tricornutum] Haem peroxidase/non­animal peroxidase/Class I peroxidase/Ascorbate peroxidase/Chloroplastic Stromal; complete.', 'cutoff' => '23.90', 'nber_true_pos' => 66, 'NbSeq' => 376, 'last_false_neg' => ' 8.518 680 pos. 1 ­ 120 2983|LjAPx04 [34305:Lotus japonicus] Haem peroxidase/non­animal peroxidase/Class I peroxidase/Ascorbate peroxidase/Chloroplastic Thylakoid­bound; partial.', 'Compete' => 'cyApx|peApx|', 'first_false_pos' => ' 31.020 3665 pos. 1 ­ 279 2627|UfAPx01 [111617:Ulva fasciata] Haem peroxidase/non­animal peroxidase/Class I peroxidase/Ascorbate peroxidase; partial.', 'nber_false_neg' => 22 }; #################### Tst_cyApx_res0 :161#################### ***Haem peroxidase/non­animal peroxidase/Class I peroxidase/Ascorbate peroxidase/Cytosolic OK 1 *** False negative detected!

'last_true_pos' => ' 85.786 12289 pos. 1 ­ 189 3957|RhcAPx01 [128735:Rosa hybrid cultivar] Haem peroxidase/non­animal peroxidase/Class I peroxidase/Ascorbate peroxidase/Cytosolic; partial.', 'cutoff' => '85.77', 'nber_true_pos' => 117, 'NbSeq' => 2798, 'last_false_neg' => ' 17.061 2139 pos. 1 ­ 187 2570|GsuAPx04 [130081:Galdieria sulphuraria] Haem peroxidase/non­animal peroxidase/Class I peroxidase/Ascorbate peroxidase/Cytosolic; partial.', 'Compete' => 'peApx|chApx|', 'first_false_pos' => ' 85.765 12286 pos. 1 ­ 284 1818|HarAPx03 [73275:Helianthus argophyllus] Haem peroxidase/non­animal peroxidase/Class I peroxidase/Ascorbate peroxidase/Peroxisomal; complete.', 'nber_false_neg' => 146 }; ......

47 Appendix 5: An example of a Prorule skeleton file (Ascrorbate superfamily) .

AC PRU52033; DC Protein; TR PROSITE; PS52033; 1; level=0 XX Names: Ascorbate peroxidase Superfamily. Function: Removal of H(2)O(2), oxidation of toxic reductants XX DE cytosolic #27.7777777777778# DE chloroplast precursor #27.7777777777778# ( EC 1.11.1.11) XX CC ­!­ FUNCTION: Plays a key role in hydrogen peroxide removal #83.3333333333333# CC ­!­ CATALYTIC ACTIVITY: L­ascorbate + H(2)O(2) = dehydroascorbate + 2H(2)O CC ­!­ : Binds 1 potassium or calcium ion per subunit #72.2222222222222# CC ­!­ COFACTOR: Binds 1 B (iron­protoporphyrin IX) group #44.4444444444444# CC ­!­ COFACTOR: Binds 1 heme B (iron­protoporphyrin IX) group persubunit #44.4444444444444# CC ­!­ SIMILARITY: Belongs to the peroxidase family XX GO GO: 0016688; F:L­ascorbate peroxidase activity; IEA:EC.#77.7777777777778# XX KW Calcium #83.3333333333333# KW Heme #88.8888888888889# KW Hydrogen peroxide KW Iron #88.8888888888889# KW Metal­binding #88.8888888888889# KW Oxidoreductase KW Peroxidase #94.4444444444444# KW Potassium #72.2222222222222# KW Membrane #33.3333333333333# XX FT ACT_SITE ? ? Proton acceptor #83.3333333333333# FT METAL ? ? Potassium or calcium FT METAL ? ? Iron #88.8888888888889# FT SITE ? ? Transition state stabilizer #83.3333333333333# FT TRANSMEM ? ? Potential #33.3333333333333# XX Chop: Nter=0; Cter=0; Size: unlimited; Related: None; Repeats: 1; Topology: Undefined; Example: XXXXXXX; Scope: Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; II#44.4444444444444#; Brassicales#44.4444444444444#; Brassicaceae#44.4444444444444#; Comments: None

48 Appendix 6: Table view of the data structure built for prorule file edition (example of the Ascorbate superfamily).

.

49