Automatic Rule Generation for Protein Annotation with the C4. 5 Data

Vol. 17 no. 10 2001 BIOINFORMATICS Pages 920–926 Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT Ernst Kretschmann, Wolfgang Fleischmann and Rolf Apweiler The EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Received on April 20, 2001; revised and accepted on July 8, 2001 ABSTRACT found), the literature (in which it was mentioned), etc. Motivation: The gap between the amount of newly • Information or annotation: the statement of an aspect submitted protein data and reliable functional annotation in that is relevant or important to describe the protein public databases is growing. Traditional manual annotation as a whole or parts of it, e.g. ‘It is expressed in the by literature curation and sequence analysis tools without mitochondrion’, ‘Amino acids 1–26 encode a Signal’, the use of automated annotation systems is not able etc. to keep up with the ever increasing quantity of data • Knowledge: the process that draws conclusions about that is submitted. Automated supplements to manually an unknown protein using gathered information, curated databases such as TrEMBL or GenPept cover raw e.g. ‘The protein sequence contains pattern x. Since data but provide only limited annotation. To improve this all known sequences having this pattern belong situation automatic tools are needed that support manual to transmembrane proteins, this should also be a annotation, automatically increase the amount of reliable transmembrane protein.’ information and help to detect inconsistencies in manually • Data mining: any technique that uses information to generated annotations. gain knowledge on data. Results: A standard data mining algorithm was suc- cessfully applied to gain knowledge about the Keyword INTRODUCTION annotation in SWISS-PROT. 11 306 rules were generated, which are provided in a database and can be applied to How to obtain information about a protein? If the protein yet unannotated protein sequences and viewed using a was biochemically characterized before and this informa- web browser. They rely on the taxonomy of the organism, tion was entered into a database like SWISS-PROT, which in which the protein was found and on signature matches is a completely human expert controlled and maintained of its sequence. The statistical evaluation of the generated database (Bairoch and Apweiler, 2000), one can simply rules by cross-validation suggests that by applying them make use of the provided information, i.e. a human be- on arbitrary proteins 33% of their keyword annotation can ing has used his knowledge to compose annotations on be generated with an error rate of 1.5%. The coverage this very protein data and established a one to one rela- rate of the keyword annotation can be increased to 60% tionship between this data and its annotation which can be by tolerating a higher error rate of 5%. used by others. However, often the information is incom- Availability: The results of the automatic data mining plete, which is a fact for the majority of the known pro- process can be browsed on http://golgi.ebi.ac.uk:8080/ teins. Many of those poorly annotated proteins are stored Spearmint/ Source code is available upon request. in databases like TrEMBL (Bairoch and Apweiler, 2000), Contact: [email protected] which is only partly annotated by human experts but also by automated annotation systems like EDIT to TrEMBL TERMINOLOGY (Moller¨ et al., 1999) and RuleBase (Fleischmann et al., 1999; Apweiler, 2001). The protein can even be hypothet- This paper is about data, information, and knowledge on ical, so there is no information available at all. protein sequences. As far as we know there is no standard In these cases one mostly resorts to sequence similarity definition to distinguish between these concepts. In the or signature searches, hoping to find well annotated following, we are going to use the definitions given below: protein features sharing some similarity with the protein • Data: the measurable or observable facts, e.g. the in question. Apart from similarity searches against com- sequence, the organism (in which the protein was prehensive, non-redundant protein sequence databases 920 c Oxford University Press 2001 Automatic rule generation for protein annotation like SWISS-PROT and TrEMBL (Apweiler, 2000), the use of protein sequence signature databases such as Prosite (Hofmann et al., 1999), PRINTS (Attwood et al., 2000) or Pfam (Bateman et al., 2000) can be helpful as are protein cluster databases like SYSTERS (Krause et al., 1999) and CluSTr (Kriventseva et al., 2001). In those cases, there is a many-to-many relationship between annotation and data, i.e. one annotation is stored for many proteins and one protein sequence might match various signatures and their annotation. Obviously, the process of gathering, analyzing, evaluating, and deriving information is time-consuming and cumbersome. It can be regarded as manual data mining across various databases. Fig. 1. Example of data distribution in InterPro IPR003009 (only We have developed a method to automate this process part of data is shown). The first column contains the SWISS-PROT for a subset of the information available in SWISS-PROT, accession numbers of some proteins in this entry. the Keyword Line. Keywords are particularly useful for analysis because they are controlled, limited in number (at the time of this writing there were 850 different Mammal? Keywords allowed), they show little inherent structure or yes no dependencies and are either annotated or not. These facts do nothing Prosite make automated knowledge acquisition much easier as for (5 instances) PS00487? comment lines and description lines, which often are in unstructured free text. yes no The implementation uses the C4.5 data mining algorithm to detect decision trees which are an equivalent no- do nothing annotate ’FAD’ tation to rules. C4.5 shows particularly good results for (1 Instance) (3 instances) non noisy data, which is the case for SWISS-PROT. The derived rules are not only fitting the training set, but are Fig. 2. Decision tree describing the data in Figure 1. also human readable and kept short. This is obtained by an elaborated heuristic approach inherent to the standard algorithm. Also statistical evidence is given for every rule, Prosite pattern PS00487, three to Pfam pattern PF01493, which can be used to order rules in terms of confidence. five belong to mammalia and three have the Keyword This property can be used to select subsets of rules for dif- ‘FAD’. The distribution is as follows. ferent applications, i.e. only the highly confident ones for A decision tree is generated that has a preferably small error critical purposes where coverage is less important number of leaves to make rules better readable and at the and all of the generated rules where coverage is the main same time more reliable. Less leaves mean that on the concern. average there are more examples per leaf that give the rule better statistical confirmation. In general, there are several possible equivalent decision trees. The example decision SYSTEM AND METHODS tree in Figure 2 covers all the instances in the training set Algorithm in Figure 1. One of the basic ideas of artificial intelligence algorithms But the decision tree in Figure 3 classifies the data is to derive knowledge from training sets and apply it on more compactly. The problem of finding the optimal yet unknown data. The C4.5 algorithm expects input in a decision tree is known to be NP-complete (Hyafil and tabular format where the last column contains the target, Rivest, 1976). C4.5 uses the gain ratio criterion, which in this particular case the information if a given keyword is based on information theory and produces suboptimal is present or not. The previous columns store core data trees heuristically (Quinlan, 1993). Note that if there are about the proteins like taxonomy details or the presence two instances having the same core data but a different of sequence patterns. The algorithm tries to derive the annotation, there is no tree that classifies all examples of a contents of the last column by using the information in training set correctly. the other columns. The precision of a tree can be checked by analyzing To illustrate the procedure, a simple example for the the number of correct and incorrect classifications it SWISS-PROT proteins in InterPro (Apweiler et al., 2000) produces when applied on the training set. This analysis IPR003009 is given. One of those proteins matches to gives the number of True Positives (TPs) (annotation 921 E.Kretschmann et al. estimation has to be performed of how well they will Pfam perform on unknown data in terms of coverage and error PF01493 rate. This aspect was tested by a tenfold cross-validation yes no (see Results). The standard algorithm was designed to classify in- annotate ’FAD’ do nothing stances into groups where there is an interest in all classes. (3 instances) (6 instances) Yet in this particular application there is no need for rules suggesting the non-annotation of certain Keywords. The standard statistical evaluation implemented in C4.5 was Fig. 3. Compact decision tree describing the data in Figure 1. tried as a method to order rules in terms of quality. Parameters to trigger the calculation were adapted to the exists in instance and is predicted), True Negatives (TNs) particular problem but the results were unsatisfactory. (annotation does not exist in instance and is not predicted), Therefore a procedure was chosen that derives confidence False Positives (FPs) (annotation does not exist in instance by exclusively using the number of TP and FP examples. but is predicted) and False Negatives (FNs) (annotation The formula calculates the following value of likelihood: exists in instance but is not predicted).

Automatic Rule Generation for Protein Annotation with the C4. 5 Data

Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

The HUPO Proteomics Standards Initiative Meeting: Towards Common Standards for Exchanging Proteomics Data Hinxton, Cambridge, UK, 19–20 October 2002

The European Bioinformatics Institute in 2020: Building a Global Infrastructure of Interconnected Data Resources for the Life Sciences Charles E

2003 Mulder Nucl Acids Res {22

MPGM: Scalable and Accurate Multiple Network Alignment

EMBL-EBI-Overview.Pdf

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

Expert Review of Proteomics

Multi-Platform Discovery of Haplotype-Resolved Structural Variation in Human Genomes

UC Riverside UC Riverside Previously Published Works

Interpro: the Integrative Protein Signature Database

Speaker Biographies