BIOINFORMATICS DOI: 10.1093/Bioinformatics/Btg494

Vol. 20 no. 6 2004, pages 863–873 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg494 Optimizing substitution matrices by separating score distributions Yuichiro Hourai1,∗, Tatsuya Akutsu2 and Yutaka Akiyama3 1Department of Computer Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan, 2Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan and 3Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Aomi Frontier Bldg. 17F, 2-43 Aomi, Koto-ku, Tokyo 135-0064, Japan Received on February 22, 2003; revised on September 5, 2003; accepted on September 10, 2003 Advance Access publication January 29, 2004 Downloaded from ABSTRACT identical more than a particular percentage of residues (e.g. in Motivation: Homology search is one of the most fundamental the case of BLOSUM 62, 62%). Log odds scores (logarithms tools in Bioinformatics. Typical alignment algorithms use sub- of ratios of likelihoods) are calculated by counting the substi- stitution matrices and gap costs. Thus, the improvement tutions in blocks as follows. First, the observed probabilities http://bioinformatics.oxfordjournals.org/ of substitution matrices increases accuracy of homology of substitutions, searches. Generally, substitution matrices are derived from fij aligned sequences whose relationships are known, and gap q = , ij f costs are determined by trial and error. To discriminate rela- i,j∈AApairs ij tionships more clearly, we are encouraged to optimize the sub- are calculated where fij is the frequency of the observed stitution matrices from statistical viewpoints using both positive amino acid pairs i, j, and then, the log odds scores for matches and negative examples utilizing Bayesian decision theory. or mismatches between amino acid i and j are calculated as Results: Using Cluster of Orthologous Group (COG) qij database, we optimized substitution matrices. The classi- sij = log , fication accuracy of the obtained matrix is better than that pipj of conventional substitution matrices to COG database. It by guest on March 25, 2015 where pi, pj are the probabilities of occurrence of amino acid also achieves good performance in classifying with other i and j respectively in a database. databases. The point accepted mutation (PAM) matrix family (Dayhoff Availability: The optimized substitution matrices and the et al., 1978) is also common, which is proposed earlier and programs are available from the http://olab.is.s.u-tokyo.ac. based on the probability of single point mutations and the the- jp/~hourai/optssd/index.html ory of Markov processes. The construction of the PAM matrix Contact: [email protected] family is different in the calculation of qij from Henikoff’s method. Dayhoff’s method was applied to large databases INTRODUCTION by different research groups independently (Gonnet et al., Most of homology search methods, e.g. SSEARCH, FASTA 1992; Jones et al., 1992). Gonnet et al. proposed a prob- (Pearson, 1991) and BLAST (Altschul et al., 1990), seek abilistic model of gap insertion from massively collected homologous sequences in a database making use of substi- alignment data. tution matrices for their scoring schemes. The substitution The OPTIMA substitution matrix (Kann et al., 2000) is matrix used in homology search has a great influence on the derived from Cluster of Orthologous Group (COG) data- results. base (Tatusov et al., 1997, 2001) using the Blosum62 Table 1 shows an example of the substitution matrix. matrix by maximizing the average of confidence parameters It is one of the most widely used substitution matrices C = 1/(1 + E), where E is the E-value of alignment scores referred to as blocks substitution matrices 62 (BLOSUM 62) between homologous sequences. Although Kann’s method (Henikoff et al., 1992, 1993). The Blosum matrix family was had a new idea and increased the significance of alignment derived from many (more than 2000) protein patterns called scores between related sequences, in a view of the probability blocks. Blocks are composed of sequence segments which are of error, there is still room for improvement. It should be noted that the term ‘error’ in this paper means the classification error, ∗To whom correspondence should be addressed. not the alignment error. Bioinformatics 20(6) © Oxford University Press 2004; all rights reserved. 863 Y.Hourai et al. Table 1. BLOSUM62 A4 R −15 N −206 D −2 −216 C0−3 −3 −39 Q −1100−35 E −1002−425 G0−20−1 −3 −2 −26 H −201−1 −300−28 I −1 −3 −3 −3 −1 −3 −3 −4 −34 L −1 −2 −3 −4 −1 −2 −3 −4 −324 K −120−1 −311−2 −1 −3 −25 M −1 −1 −2 −3 −10−2 −3 −212−15 F −2 −3 −3 −3 −2 −3 −3 −3 −100−30 6 P −1 −2 −2 −1 −3 −1 −1 −2 −2 −3 −3 −1 −2 −47 S1−110−1000−1 −2 −20−1 −2 −14 Downloaded from T0−10−1 −1 −1 −1 −2 −2 −1 −1 −1 −1 −2 −115 W −3 −3 −4 −4 −2 −2 −3 −2 −2 −3 −2 −3 −11−4 −3 −211 Y −2 −2 −2 −3 −2 −1 −2 −32−1 −1 −2 −13−3 −2 −22 7 V0−3 −3 −3 −1 −2 −2 −3 −331−21−1 −2 −20−3 −14 ARNDCQEGHI LKMFPSTWYV http://bioinformatics.oxfordjournals.org/ In information science, it is believed that learning from Bayesian decision theory is useful when probability dis- both positive and negative examples are more powerful than tributions are known. We apply it to sequence classification learning from only positive examples (Gold, 1967; Laird, and experiment our method with the COG database. Since 1988). PAM and BLOSUM matrix families are composed the optimization results depend on the nature of database, it of substitution probabilities from positive examples, i.e. is important to optimize a substitution matrix with a data- alignments of related sequences, and background probabil- base which agrees with one’s purpose. On the other hand, ities from composition of a database as negative examples. the score matrix optimized for the COG database should OPTIMA uses alignments of random sequences as negative also be useful for other databases (e.g. PFAM and SCOP data- examples in calculating E-values. But, it does not seem that bases). Thus, we apply the optimized score matrix to PFAM by guest on March 25, 2015 the previous methods make full use of negative examples. and SCOP databases. Furthermore, we apply the proposed Linear separation methods such as linear programming or method to optimization of score matrices for PFAM and SCOP support vector machines may be applied to optimization of databases too. score matrices. However, The minimization of the number of errors in linear separation is known to be computationally hard (Amaldi et al., 1998). SYSTEM AND METHODS Therefore, we need to choose a more appropriate learning In our system, the input consists of a substitution matrix with model to overcome the difficulty of separating positive and gap costs and a classified protein sequence database, in which negative data by taking account of negative examples more classes are disjoint from each other. The output is a substitu- explicitly. tion matrix with gap costs. Given a substitution matrix as an It is believed that the distribution of normalized optimal initial value and classified protein sequence data as a training alignment scores of unrelated sequences can be approxim- dataset, our goal is to optimize the substitution matrix in order ated by extreme value distribution (EVD) (Kotz et al., 2001). to improve the classification accuracy. E-value is the expected number of sequences in a database whose alignment scores are greater than a given alignment score. It is calculated from a database size D and a normal- Learning sample ized score x as E = D · Pr[X ≥ x], where Pr[X ≥ x] is the The relationship of two sequences is predicted based on their probability derived from the EVD that a normalized alignment optimal alignment score. Therefore, positive examples are score between unrelated sequences exceeds x. Many experi- alignment scores of related sequences and negative examples ments support that E-values of normalized scores discriminate are those of unrelated sequences. Optimal alignments are cal- relationships more clearly (Karin et al., 1990; Brenner et al., culated by the Smith–Waterman alignment algorithm (Smith 1998). et al., 1981). The raw score S of an alignment can be 864 Optimizing substitution matrices described as S(w, n) = nij wij + nogapwogap + negapwegap i,j∈AApair = nkwk k = n · w, (1) where nij , nogap and negap denote the numbers of substitutions between amino acids i and j, open gaps and extension gaps, respectively, and wij , wogap and wegap are scores per substitution, open gap and extension gap, respectively. When we Fig. 1. The Bayes decision boundary and the Bayes error rate of change the score function w, we can recalculate the alignment two distributions. The Bayes decision boundary is the point where score S with the 212-dimensional vector n. This is based on the minimizer of two functions changes. The Bayes error rate is the area of the shadowed region. the assumption that small change of a substitution matrix does Downloaded from not so much affect optimal alignments. It is important to normalize scores, because scores are a certain class using their features, and must answer ‘yes’ or affected by the lengths of amino acid sequences. The nor- ‘no’. A function is called a discriminant function, if its output malization formula is helps to decide on the class membership. Suppose the data to be dealt with is a random sample from continuous space. http://bioinformatics.oxfordjournals.org/ S = f (n) = λS(w, n) − log KN,(2) w In this case, using a discriminant function for a class c, the where λ, K are parameters, and N is the search space size.

Load more