Vol. 20 no. 6 2004, pages 863–873 DOI: 10.1093/bioinformatics/btg494

Optimizing substitution matrices by separating score distributions Yuichiro Hourai1,∗, Tatsuya Akutsu2 and Yutaka Akiyama3

1Department of Computer Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan, 2Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan and 3Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Aomi Frontier Bldg. 17F, 2-43 Aomi, Koto-ku, Tokyo 135-0064, Japan

Received on February 22, 2003; revised on September 5, 2003; accepted on September 10, 2003 Advance Access publication January 29, 2004 Downloaded from

ABSTRACT identical more than a particular percentage of residues (e.g. in Motivation: Homology search is one of the most fundamental the case of BLOSUM 62, 62%). Log odds scores (logarithms tools in Bioinformatics. Typical alignment algorithms use sub- of ratios of likelihoods) are calculated by counting the substi-

stitution matrices and gap costs. Thus, the improvement tutions in blocks as follows. First, the observed probabilities http://bioinformatics.oxfordjournals.org/ of substitution matrices increases accuracy of homology of substitutions, searches. Generally, substitution matrices are derived from fij aligned sequences whose relationships are known, and gap q =  , ij f costs are determined by trial and error. To discriminate rela- i,j∈AApairs ij tionships more clearly, we are encouraged to optimize the sub- are calculated where fij is the frequency of the observed stitution matrices from statistical viewpoints using both positive pairs i, j, and then, the log odds scores for matches and negative examples utilizing Bayesian decision theory. or mismatches between amino acid i and j are calculated as Results: Using Cluster of Orthologous Group (COG) qij database, we optimized substitution matrices. The classi- sij = log , fication accuracy of the obtained matrix is better than that pipj of conventional substitution matrices to COG database. It by guest on March 25, 2015 where pi, pj are the probabilities of occurrence of amino acid also achieves good performance in classifying with other i and j respectively in a database. databases. The point accepted (PAM) matrix family (Dayhoff Availability: The optimized substitution matrices and the et al., 1978) is also common, which is proposed earlier and programs are available from the http://olab.is.s.u-tokyo.ac. based on the probability of single point and the the- jp/~hourai/optssd/index.html ory of Markov processes. The construction of the PAM matrix Contact: [email protected] family is different in the calculation of qij from Henikoff’s method. Dayhoff’s method was applied to large databases INTRODUCTION by different research groups independently (Gonnet et al., Most of homology search methods, e.g. SSEARCH, FASTA 1992; Jones et al., 1992). Gonnet et al. proposed a prob- (Pearson, 1991) and BLAST (Altschul et al., 1990), seek abilistic model of gap insertion from massively collected homologous sequences in a database making use of substi- alignment data. tution matrices for their scoring schemes. The substitution The OPTIMA (Kann et al., 2000) is matrix used in homology search has a great influence on the derived from Cluster of Orthologous Group (COG) data- results. base (Tatusov et al., 1997, 2001) using the Blosum62 Table 1 shows an example of the substitution matrix. matrix by maximizing the average of confidence parameters It is one of the most widely used substitution matrices C = 1/(1 + E), where E is the E-value of alignment scores referred to as blocks substitution matrices 62 (BLOSUM 62) between homologous sequences. Although Kann’s method (Henikoff et al., 1992, 1993). The Blosum matrix family was had a new idea and increased the significance of alignment derived from many (more than 2000) patterns called scores between related sequences, in a view of the probability blocks. Blocks are composed of sequence segments which are of error, there is still room for improvement. It should be noted that the term ‘error’ in this paper means the classification error, ∗To whom correspondence should be addressed. not the alignment error.

Bioinformatics 20(6) © Oxford University Press 2004; all rights reserved. 863 Y.Hourai et al.

Table 1. BLOSUM62

A4 R −15 N −206 D −2 −216 C0−3 −3 −39 Q −1100−35 E −1002−425 G0−20−1 −3 −2 −26 H −201−1 −300−28 I −1 −3 −3 −3 −1 −3 −3 −4 −34 L −1 −2 −3 −4 −1 −2 −3 −4 −324 K −120−1 −311−2 −1 −3 −25 M −1 −1 −2 −3 −10−2 −3 −212−15 F −2 −3 −3 −3 −2 −3 −3 −3 −100−30 6 P −1 −2 −2 −1 −3 −1 −1 −2 −2 −3 −3 −1 −2 −47

S1−110−1000−1 −2 −20−1 −2 −14 Downloaded from T0−10−1 −1 −1 −1 −2 −2 −1 −1 −1 −1 −2 −115 W −3 −3 −4 −4 −2 −2 −3 −2 −2 −3 −2 −3 −11−4 −3 −211 Y −2 −2 −2 −3 −2 −1 −2 −32−1 −1 −2 −13−3 −2 −22 7 V0−3 −3 −3 −1 −2 −2 −3 −331−21−1 −2 −20−3 −14 ARNDCQEGHI LKMFPSTWYV http://bioinformatics.oxfordjournals.org/

In information science, it is believed that learning from Bayesian decision theory is useful when probability dis- both positive and negative examples are more powerful than tributions are known. We apply it to sequence classification learning from only positive examples (Gold, 1967; Laird, and experiment our method with the COG database. Since 1988). PAM and BLOSUM matrix families are composed the optimization results depend on the nature of database, it of substitution probabilities from positive examples, i.e. is important to optimize a substitution matrix with a data- alignments of related sequences, and background probabil- base which agrees with one’s purpose. On the other hand, ities from composition of a database as negative examples. the score matrix optimized for the COG database should OPTIMA uses alignments of random sequences as negative also be useful for other databases (e.g. PFAM and SCOP data- examples in calculating E-values. But, it does not seem that bases). Thus, we apply the optimized score matrix to PFAM by guest on March 25, 2015 the previous methods make full use of negative examples. and SCOP databases. Furthermore, we apply the proposed Linear separation methods such as linear programming or method to optimization of score matrices for PFAM and SCOP support vector machines may be applied to optimization of databases too. score matrices. However, The minimization of the number of errors in linear separation is known to be computationally hard (Amaldi et al., 1998). SYSTEM AND METHODS Therefore, we need to choose a more appropriate learning In our system, the input consists of a substitution matrix with model to overcome the difficulty of separating positive and gap costs and a classified protein sequence database, in which negative data by taking account of negative examples more classes are disjoint from each other. The output is a substitu- explicitly. tion matrix with gap costs. Given a substitution matrix as an It is believed that the distribution of normalized optimal initial value and classified protein sequence data as a training alignment scores of unrelated sequences can be approxim- dataset, our goal is to optimize the substitution matrix in order ated by extreme value distribution (EVD) (Kotz et al., 2001). to improve the classification accuracy. E-value is the expected number of sequences in a database whose alignment scores are greater than a given alignment score. It is calculated from a database size D and a normal- Learning sample ized score x as E = D · Pr[X ≥ x], where Pr[X ≥ x] is the The relationship of two sequences is predicted based on their probability derived from the EVD that a normalized alignment optimal alignment score. Therefore, positive examples are score between unrelated sequences exceeds x. Many experi- alignment scores of related sequences and negative examples ments support that E-values of normalized scores discriminate are those of unrelated sequences. Optimal alignments are cal- relationships more clearly (Karin et al., 1990; Brenner et al., culated by the SmithÐWaterman alignment algorithm (Smith 1998). et al., 1981). The raw score S of an alignment can be

864 Optimizing substitution matrices described as  S(w, n) = nij wij + nogapwogap + negapwegap i,j∈AApair  = nkwk k = n · w, (1) where nij , nogap and negap denote the numbers of substitutions between amino acids i and j, open gaps and extension gaps, respectively, and wij , wogap and wegap are scores per sub- stitution, open gap and extension gap, respectively. When we Fig. 1. The Bayes decision boundary and the Bayes error rate of change the score function w, we can recalculate the alignment two distributions. The Bayes decision boundary is the point where score S with the 212-dimensional vector n. This is based on the minimizer of two functions changes. The Bayes error rate is the area of the shadowed region.

the assumption that small change of a substitution matrix does Downloaded from not so much affect optimal alignments. It is important to normalize scores, because scores are a certain class using their features, and must answer ‘yes’ or affected by the lengths of amino acid sequences. The nor- ‘no’. A function is called a discriminant function, if its output malization formula is helps to decide on the class membership. Suppose the data

 to be dealt with is a random sample from continuous space. http://bioinformatics.oxfordjournals.org/ S = f (n) = λS(w, n) − log KN,(2) w In this case, using a discriminant function for a class c, the where λ, K are parameters, and N is the search space size. We probability of error for an observed data x is calculate these parameters based on the ML method (Bailey  P(c¯ | x), if the function answers as x ∈ c et al., 2002). error(x) = P(c| x), if the function answers as x ∈¯c. Objective function It is the most important in optimization to design an objective Considering the overall sample space, the error rate by the function. Since we want to reduce the numbers of false pos- Bayes discriminant function is, itives and false negatives, the objective function must reflect  them. We make use of the error rate because minimizing the errorB = min{p(x |¯c)P (c)¯ , p(x | c)P (c)} dx. (3) error rate is one of the best criteria in a statistical view. by guest on March 25, 2015 Error rate The error rate is defined by, See Appendix section for derivation. This is called as the Bayes error or the Bayes risk for the 0Ð1 loss. Figure 1 will #FP + #FN  = give some insights. The boundaries, at those points the min- + , #P #N imizer of errorB (x) changes, should be the thresholds for the where #FP,#FN,#P and #N are the numbers of false membership decision. | positives, false negatives, positive examples and negative In general, the true conditional probabilities P(x c), |¯ examples, respectively. And it can be written as P(x c) are unknown. Therefore, we infer these probabil- ity distributions by sampling and parameter estimation. The  = c · (1 − Sensitivity) + (1 − c)(1 − Specificity), prior probabilities P(c), P(c)¯ are unknown, too. To mitig- ate the bias of a database and from another practical reason where Sensitivity = (#P − #FN)/#P , Specificity = (which will be discussed later), the prior probabilities P(ci) − 1 = + 1 | | | | (#N #FP)/#N and c #P/(#P #N). Therefore, are set to either 2 or ci / j cj . the error rate has correlation with sensitivity and specificity, If the estimated probability distribution coincides with the which are major evaluation criteria of discrimination methods. true distribution and the threshold coincides with the Bayes The error rate depends on the choice of a threshold. We discriminant function, the estimated Bayes error coincides approximate Bayes decision rule to determine thresholds. with the minimum error rate. Bayes error We describe Bayesian decision theory (Duda Modification for sequence classification We consider the et al., 2000) in brief. We often face the problem of class mem- multi-class membership problem. Suppose C ={c1, ..., cl} bership, where we judge the membership of some elements to is a set of classes. Given a data n (or an fixed alignment) and a score function fw, where w is a vector of parameters 1Some other papers define Specificity = (#P −#F N)/(#P −#FN+#FP). (or a substitution matrix), we calculate its observed score by

865 Y.Hourai et al.

Equations (1) and (2). Suppose the membership of a sequence to class cj is determined by sampling a sequence from class iterate cj and calculating their alignment score. Since we use only Substitution Matrix one optimal threshold tj for a class cj , the probability of Database error to overall database sequences can be calculated from Equation (3) as, Optimize Sampling j (w) = P(c¯j )Pr[S>tj |¯cj ]+P(cj )Pr[S ≤ tj |cj ],(4) where S is a random variable for alignment scores between Alignment database sequences and sequences in class cj , and its dis- Positive Negative tribution depends on the substitution matrix w. We should Example Example maximize the expected probability that membership determ- inations over all classes by Bayes decision boundaries are Score Distribution successful. So, in order to optimize the score function over- all classes, we designed the objective function (see Appendix Fig. 2. Overview of the optimization procedure. Each iteration con- Downloaded from section for details) sists of sampling, alignment, calculation of score distributions and optimization of a substitution matrix. l maximizew log[1 − i(w)], (5) i=1 where γ is Euler’s constant (≈0.57722), S is the standard deviation and X˜ is the average (Kotz et al., 2001). It may be http://bioinformatics.oxfordjournals.org/ where the parameter w consists of a substitution matrix and a rough estimator, but needs less computation time than the gap costs. It is logarithm of the expected probability that ML estimator does. alignment scores between a given sequence and representat- ive sequences of each group can determine their membership correctly. Optimization method Decision boundary We adopt the nonlinear conjugate gradient method (Press et al., 1993; Nocedal et al., 1999) for optimizing our objective We have to decide a threshold as the Bayes decision boundary function. It requires first differential of the objective function. between two distributions which we assume to be unimodal We approximated it by finite differencing (Nocedal et al., for the ease of determining thresholds. In the case where both 1999). We adopted Brent’s method (Press et al., 1993) for distributions are normal distributions, it can be solved analyt-

the line search in the conjugate direction. Since the conjugate by guest on March 25, 2015 ically. But, since we assume EVDs, we must use numerical gradient methods find only a local optimum, a good initial methods to solve the equation, value will help the discovery of a good solution. P(c)p(x| c) = P(c)p¯ (x |¯c). Optimization procedure We choose the Van WijngaardenÐDekkerÐBrent method We summarize the optimization procedure, (Fig. 2). (Press et al., 1993), because we are interested in only the solution between peaks of two distributions and the method is 1. perform sampling of positive and negative examples for a bracketing method and sufficiently fast. If any cross points each class; are not found, the score which gives the peak of p(x | ci) is chosen as a decision boundary. 2. calculate the normalized alignment scores using the current score parameters; Extreme value distribution Many extreme values (maxim- 3. estimate the statistical parameters of the score distribu- ums or minimums) form extreme value distributions (Kotz tions for each class; et al., 2001). Its cdf is 4. calculate a decision boundary for each class; [ ≤ ]= − −(x−µ)/σ Pr X x exp( e ), (6) 5. calculate the error rate for each class; where µ, σ(σ > 0) are parameters. To estimate these 6. calculate the objective function and its gradient; parameters, we use the momentum estimator 7. move the score parameters in the search direction. √ 6 σ˜ = S, (7) Iterate this procedure until the objective function converges to π a local optimum or the number of iterations reaches a certain µ˜ = X˜ − γ σ˜ , (8) constant.

866 Optimizing substitution matrices

IMPLEMENTATION Cross validation We have written our software in C language with We tested the score functions drawn from the training dataset MPI library, and it is available from http://olab.is.s. (half of the COG database) using the test dataset (the other u-tokyo.ac.jp/˜hourai/optssd/index.html half of the COG database). The initial substitution matrix is PAM250. There is no apparent sign of over-fitting in this EXPERIMENT optimization (Fig. 3). This result also shows the great reduc- 2 Correct relationships tion in both the average error rate and the ratio of the number of errors to that of alignment pairs. The figure also shows To obtain accurate substitution matrices, correct relation- that the objective function is strongly correlated to the ratio ships should be known. Phylogenetic analysis, or comparative of errors. genomics reveal functions of some other species. COG database (Tatusov et al., 1997, 2001) is carefully collected Differences by initial values in order to exclude paralogs, but to contain remote homo- Figure 4 shows the experimental results changing initial logs. It contains more than 3000 groups and more than 70 000 values. The initial open/extension gap costs are 12/2 for protein sequences. We make use of mainly this database BLOSUM50, 11/1 for BLOSUM62, 12/1 for GONNET, 12/1 in experiments for the purpose of comparison with Kann’s for JONES, 120/20 for OPTIMA and 14/2 for PAM250. In this Downloaded from method because the database has plenty of data with clear experiment, about 100 alignment pairs are chosen at random relationships. as positive and negative examples respectively for each class. As a whole, about 600 000 pairs are used as training data at Some details in optimization each optimization step.

We begin with the published substitution matrices as initial All substitution matrices are improved but seem to have con- http://bioinformatics.oxfordjournals.org/ values, and optimize them with alignment data by maximizing verged to distinct local optima. The reductions of the number 1 the proposed objective function, giving 2 to the prior probab- of errors are from 43% for JONES to 13% for OPTIMA. ility for each class. We assume EVDs for both distributions of positives and negatives. Derived substitution matrix If a substitution matrix is replaced by another matrix, Table 2 shows the optimized score matrix derived by the the optimal alignments will change. Thus, after several proposed method, which is referred to as COGOPT. In steps in optimization, we perform again. this learning, about 250 alignment pairs are chosen at ran- Moreover, we cannot perform all-to-all alignment, because it dom as positive and negative examples respectively for will need 70 000 × 70 000 sequence alignments per iteration each class and about 1 500 000 alignments are used as a for the COG database. Therefore, as for the positive examples whole. It is derived from OPTIMA (Kann et al., 2000) with we sampled alignment pairs from the same group. As for neg- COG database. by guest on March 25, 2015 ative examples for a certain class, we sampled each alignment Consistency to other databases pair, one from the class and the other from another class. We evaluated substitution matrices with COG, SCOP (Murzin Evaluation method et al., 1995) and PFAM databases (Bateman et al., 2002). For fairness, we use the ratio of errors to evaluate score func- SCOP40%ID (SCOP95%ID) is derived excluding sequences tions, which is different from our objective function. It is with more than 40% (90%) identity from SCOP data- calculated as follows: Given sampled alignment scores, we base (Brenner et al., 1998; Chandonia et al., 2002). calculate the minimum number of errors (false positives + SCOP sequences are classified based on super family. The false negatives) achieved by selecting the optimal threshold ratios of the minimum number of errors are shown on per each class. Then, the numbers are summed up and the sum Table 3. is divided by the number of sampled alignments. This metric The numbers following matrix names on the table are represents a somewhat maximum ability of a substitution open and extension gap costs. PFAMOPT, SCOP40OPT matrix to classify the sequences in a database. and SCOP95OPT are derived by the proposed method with PFAM (325 766 sequences and 3360 classes), SCOP40%ID RESULTS (4774 sequences and 1109 classes) and SCOP95%ID (8004 sequences and 1109 classes) databases, respectively. Our We optimized substitution matrices based on the assump- matrices achieved the best performance to the database with tion that the calculated probability of error is correlated which we optimized and also achieved good performance with classification accuracy. Our results support this assump- tion. In this section, we show the experimental results in 2It is the value of the translated objective function, different conditions. Note that in these experiments, we re-   g(w) sampled alignment pairs and performed alignments every 1 − exp . 10 optimization steps. |C|

867 Y.Hourai et al.

0.032 test set training set 0.03

0.028

0.026

0.024

0.022 The expected average error rate 0.02

0.018 Downloaded from 0.016 0 20 40 60 80 100 #Steps

0.026 test set training set http://bioinformatics.oxfordjournals.org/

0.024

0.022

0.02 Ratio of errors

0.018 by guest on March 25, 2015 0.016

0.014 0 20 40 60 80 100 #Steps

Fig. 3. Cross validation experiment: Groups in database are divided into 2 disjoint sets, and then they are used as both test and training sets alternatively. The top figure shows the average of normalized objective functions versus the number of optimization steps. The bottom figure shows the ratio of the minimum number of errors to the number of alignments versus the number of optimization steps. The points labeled ‘training set’ are from training data and the points labeled ‘test set’ are from test data. The test dataset is the half of database, and the training dataset is the other half. Exchanging the role (training, test) for datasets, we calculated average values obtained from two training (test) datasets. We tested score matrices every five optimization steps. to other databases. Especially, PFAMOPT showed notable of true positivesÐfalse positives per query plot by perform- performances to all databases. ing all versus all alignment with the SCOP40%ID database described above (Fig. 5). Each sequence is ranked based on the E-value obtained Sensitivity and specificity for structural by the SSEARCH program (Pearson, 1991). The matrices conservation on the figures are ones with good results on the Sensitivity and specificity can be measured by receiver oper- previous experiment (Table 3). Over a wide range, ating characteristic (ROC) analysis (Gribskov et al., 1996; SCOP95OPT is the best. GONNET is better than COG- Brenner et al., 1998). For this purpose, we draw the fraction OPT in the region of higher fractions of true positives

868 Optimizing substitution matrices

0.026 BLOSUM50 BLOSUM62 GONNET 0.024 JONES OPTIMA PAM250

0.022

0.02

0.018 Ratio of errors

0.016

0.014 Downloaded from

0.012 0 10 20 30 40 50 60 70 80 90 100 #Steps

Fig. 4. The comparison of optimization processes by different initial values. ‘Ratio of errors’ is explained in ‘Evaluation Method’. Each line http://bioinformatics.oxfordjournals.org/ shows the optimization process in which the titled substitution matrices are the initial values.

Table 2. The derived substitution matrix (rounded to integers), COGOPT

A31 R −11 50 N −18 6 56 D −20 −16 18 62 C10−28 −29 −30 107 Q01448−30 42 E −10 4 1 19 −39 19 38 − − − − − − G 1 20 9 11 27 22 26 64 by guest on March 25, 2015 H −18 6 14 −7 −28 5 2 −17 91 I −3 −30 −33 −39 −4 −29 −34 −43 −28 37 L −3 −21 −33 −45 −2 −19 −32 −43 −27 26 38 K −9281−3 −30 17 10 −16 −7 −34 −24 35 M −5 −10 −19 −31 −61−21 −30 −19 13 22 −13 52 F −15 −30 −29 −35 −16 −28 −34 −32 −61024−30 5 58 P −8 −17 −14 −7 −29 −11 −8 −16 −17 −31 −32 −8 −22 −38 74 S12−8124−82−25−7 −22 −24 −1 −10 −19 −339 T0−53−10 −5 −6 −7 −17 −19 −8 −14 −8 −6 −16 −82145 W −27 −28 −39 −39 −17 −18 −28 −19 −16 −24 −10 −30 −819−38 −28 −18 109 Y −17 −11 −17 −19 −17 −8 −23 −29 20 −53−17 −737−28 −18 −15 26 70 V1−32 −32 −36 −1 −24 −32 −35 −29 32 17 −24 8 8 −21 −20 2 −24 −538 ARNDCQEGHILKMFPSTWYV OGAP, EGAP −120 −8

but little worse in the region of lower fractions of true • The input consists of only a classified database and an positives. initial substitution matrix. • The optimization process is automatic. DISCUSSION • The specificity and sensitivity are improved. We proposed a new optimization criterion. It has following advantages over existing methods. But, there remained several problems. • A user needs not to know any elaborate model of • The prior probabilities have influence on the optimization evolution. speed.

869 Y.Hourai et al.

1600 BLOSUM50 COGOPT 1400 GONNET SCOP95OPT

1200

1000

800

600 false positives per query

400 Downloaded from 200

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 fraction of true positives http://bioinformatics.oxfordjournals.org/ 10 BLOSUM50 COGOPT GONNET SCOP95OPT

1 by guest on March 25, 2015 false positives per query

0.1

0.02 0.03 0.04 0.05 0.06 0.07 0.08 fraction of true positives

Fig. 5. The result of all versus all alignment with SCOP40%ID database. The top figure shows the result for the full region of fractions of true positives. The bottom figure shows the result for the region of lower fractions of true positives (i.e. the region of higher scores) with the y axis in log scale. The COGOPT matrix (optimized with COG database) shows good performance in the bottom figure and the SCOP95OPT (optimized with SCOP95%ID database) shows the best performance as a whole.

• The type of distribution of positive examples is not • Optimization process finds only a local minimum and analytically known. does not always converges to the same values. • If the substitution matrix changes, optimal alignments The first problem is concerned with the objective function. change, too. 1 In our experiments, we assumed 2 for the prior probabilities • Databases may contain sequences classified into multiple P(ci), P(c¯i). It was supposed that the use of accurate P(ci) classes. should improve results. But it was not true. We experimented

870 Optimizing substitution matrices

Table 3. Ratio of the minimum number of errors: The numbers following The last problem can be divided into further two points. matrix names are open and extension gap costs The first is that the change of the unit scores as multiplying or shifting by a constant do not change an optimal align- COG SCOP40%ID SCOP95%ID Pfam ment. However, restrictions for these degrees of freedom are not always necessary, because such a direction which BLOSUM50-12-2 0.0166 0.0866 0.0776 0.0350 does not change the objective function will not be searched BLOSUM62-11-1 0.0177 0.0924 0.0806 0.0363 by the steepest descent method. Practically, it seems valid for COGOPT-120-8 0.0145 0.0835 0.0739 0.0333 the conjugated gradient method, too. The other one is that GONNET-12-1 0.0190 0.0806 0.0709 0.0374 JONES-12-1 0.0265 0.0876 0.0766 0.0450 optimized matrices converged to distinct optima as shown in OPTIMA-120-20 0.0160 0.0900 0.0798 0.0340 Figure 4. This is the restriction for most of nonlinear optimiza- PAM250-14-2 0.0260 0.0976 0.0854 0.0452 tion methods that they can find only local minima. However, PFAMOPT-77-6 0.0147 0.0807 0.0732 0.0324 in our experiments, all matrices are improved considerably. SCOP40OPT-101-6 0.0220 0.0667 0.0622 0.0415 We believe it is worth optimizing. SCOP95OPT-96-6 0.0234 0.0659 0.0612 0.0432 ACKNOWLEDGEMENT  We thank Dr. Kentaro Tomii (CBRC) for much discussion to Downloaded from =| | | | with the case of P(ci) ci / j cj . But the results were improve our research. This work is the extension of Hourai’s = 1 worse than those for the case ofP(ci) 2 . The main reason master thesis on Miyano lab (Human Genome Center, Institute =| | | | is that if we use P(ci) ci / j cj , there is a large dif- of Medical Science, University of Tokyo). We thank the mem- ference in scale between positive and negative distributions bers of Miyano lab for supporting this research. We also and it seems to have bad influence on numerical optimiza- thank Dr. Steven Brenner in UC Berkeley for letting us http://bioinformatics.oxfordjournals.org/ tion. That is, the objective function places undue importance know the work by Kann et al. (2000). This work was sup- 1 on negative examples. It is why we choose 2 for prior ported in part by a Grant-in-Aid for Scientific Research on probabilities. Priority Areas (C) for ‘Genome Information Science’ and The second problem is concerned with our assumption on Grant-in-Aid #13680394 from the MEXT of Japan. the probability distributions. In a case that a type of distri- bution is unknown, we assume a unimodal distribution. In REFERENCES the sequence alignment case, we used the EVD for positive Altschul,S., Gish,W., Miller,W., Myers,E. and Lipman,D. (1990) examples. However, since they did not always fit to EVD, Basic local alignment search tool. J. Mol. Biol., 215, 403Ð410. we excluded data which have high scores. High score data Altschul,S., Bundschuh,R., Olsen,R. and Hwa,T. (2001) The estima- may inflate the variance of the distribution too much and tion of statistical parameters for local alignment score distribu- weaken the influence of the lower scores on the error rate. tions. Nucleic Acids Res., 29(2), 351Ð361. by guest on March 25, 2015 We experimented by using the normal distribution, but the Amaldi,E. and Kann,V. (1998) On the approximability of minimiz- result was little worse. The wide foot seemed not to fit the ing nonzero variables or unsatisfied relations in linear systems. real distribution. Theoret. Comput. Sci., 209, 237Ð260. The third problem remained in the optimization method. Bailey,T.L. and Gribskov,M. (2002) Estimating and evaluating the We fixed alignments in line search and several optimization statistics of gapped local-alignment scores. J. Comput. Biol., 9(3), 575Ð593. steps. But it may lead score parameters to destructive changes, Bateman,A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L., although such a phenomenon was not observed in our exper- Eddy,S.R., Grifths-Jones,S., Howe,K.L., Marshall,M. and iments. Since the optimal alignment score is the maximum Sonnhammer,E.L.L. (2002) The Pfam protein families database. of many candidate alignment scores, the preservation of align- Nucleic Acids Res., 30(1), 276Ð280. ments, which give large scores and can be represented as Brenner,S.E., Chothia,C. and Hubbard,T.J.P. (1998) Assessing com- vertices of a convex hull, may help to work around this prob- parison methods with reliable structurally identified distant evolu- lem and to reduce the computation of sequence alignment. tionary relationships. Proc. Natl Acad. Sci., USA, 95, 6073Ð6078. However, since we re-sample alignment pairs, it is hard to Brenner,S.E., Koehl,P. and Levitt,M. (2000) The ASTRAL compen- maintain them. It will be a future work. dium for sequence and structure analysis. Nucleic Acids Res., 28, The forth is that sequences may belong to multiple classes 254Ð256. and it may be difficult to extract such information from data- Chandonia,J.M., Walker,N.S., Lo Conte,L., Koehl,P., Levitt,M. and Brenner,S.E. (2002) ASTRAL compendium enhancements. bases. Fortunately, the databases we used have plenty of Nucleic Acids Res., 30, 260Ð263. sequences and smaller number of such sequences. So the prob- Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) A Model ability that such sequences are selected as negative example of Evolutionary Change in . Atlas Prot. Seq. Struct., pairs in our method is extremely small. Our experimental res- 5(Suppl. 3), 345Ð352. ults show that the influence on the distributions of negative Duda,R.O., Hart,P.E. and Stork,D.G. (2000) Pattern Classification, examples is small. 2nd edn. Wiley-Interscience.

871 Y.Hourai et al.

Gold,E.M. (1967) Language identification in the limit. Informat. The predictor which uses the Bayes decision rule achieves the Control, 10, 447Ð474. probability of error of conditional Bayes error, Gonnet,G.H., Cohen,M.A. and Benner,S.A (1992) Exhaustive matching of the entire protein sequence database. Science, 256, 1443Ð1445. errorB (x) = min error(x) Gribskov,M. and Robinson,N.L. (1996) Use of receiver operating = min{P(c¯ | x), P(c| x)}, characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem., 20, 25Ð33. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution and it is optimal in a probabilistic view. matrices from protein blocks. Proc. Natl Acad. Sci., USA, 89, Considering the overall sample space, the error rate by the 10915Ð10919. Henikoff,S. and Henikoff,J.G. (1993) Performance evaluation of Bayes discriminant function is, amino acid substitution matrices. Proteins, 17, 49Ð61.  Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) The rapid genera- tion of mutation data matrices from protein sequences. CABIOS, errorB = errorB (x)p(x) dx 8, 275Ð282. 

Kann,M., Qian,B. and Goldstein,R.A. (2000) Optimization of a new Downloaded from = min{P(c¯ | x), P(c| x)}p(x) dx score function for the detection of remote homologs. Proteins, 41, 498Ð503.  Karin,S. and Altschul,S.F. (1990) Methods for assessing the statist- = min{P(c¯ | x)p(x), P(c| x)p(x)} dx ical significance of molecular sequence features by using general  scoring schemes. Proc. Natl Acad. Sci., USA, 87, 2264Ð2268.

= { |¯ ¯ | } http://bioinformatics.oxfordjournals.org/ Kotz,S. and Nadarajah,S. (2001) Extreme Value Distributions: min p(x c)P (c), p(x c)P (c) dx. Theory and Applications. Imperial College Press. Laird,P.D. (1988) Learning from Good and Bad Data. Kluwer Academic Publishers. The last derivation is from Bayes’ theorem, Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, p(x | c)P (c) = P(c| x)p(x). 536Ð540. Nocedal,J. and Wright,S.J. (1999) Numerical Optimization. Springer. In the case of alignment scores, one can judge the significance Pearson,W.R. (1991) Searching protein sequence librar- from thresholds. We can rewrite the error rate by limiting the ies: comparison of the sensitivity and selectivity of the Bayes decision boundary to a threshold t as follows, SmithÐWaterman and FASTA algorithms. Genomics, 11, by guest on March 25, 2015 635Ð650.  Press,W.H., Teukolsky,S.A., Vetterling,W.T. and Flannery,B.P. c(w) = min{P(c)pw(s | c), P(c)p¯ w(s |¯c)} ds (1993) Numerical Recipes in C. Cambridge University Press.   Smith,T.F. and Waterman,M.S. (1981) Identification of molecular t ∞ subsequences. J. Mol. Biol., 147, 195Ð197. = P(c)pw(s | c)ds + P(c)p¯ w(s |¯c) ds Tatusov,R.L., Koonin,E.V. and Lipman,D.J. (1997) A genomic −∞ t perspective on protein families. Science, 278(5338), 631Ð637. = P(c)¯ Pr[S>t|¯c]+P(c)Pr[S ≤ t | c], Tatusov,R.L., Natale,D.A., Garkavtsev,I.V., Tatusova,T.A., Shankavaram,U.T., Rao,B.S., Kiryutin,B., Galperin,M.Y., Fedorova,N.D. and Koonin,E.V. (2001) The COG database: new where we defined the pdf of class c at score s as pw(s | c), developments in phylogenetic classification of proteins from that of negative examples at score s as pw(s |¯c) and S as complete genomes. Nucleic Acids Res., 29(1), 22Ð28. a random variable for alignment scores. The notation of w means that the pdfs depend on the parameters of the score APPENDIX function. Derivation of Bayes error3 Derivation of objective function In order to minimize the probability of error (loss function in = Bayesian statistics), a Bayes discriminant function should be Suppose s (s1, ..., sl) is the scores of optimal alignment of  a query sequence to each representative of the class ci(i = ∈ | ¯ | ‘x c’, if P(c x)>P(c x) 1, ..., l) and m = (m1, ..., ml) represents the memberships ‘x ∈¯c’, if P(c| x)

872 Optimizing substitution matrices

Predictori(si) which obeys the Bayes decision rule. Given we have the scores, the probability that all membership determinations  are successful is Pr[m ≡ Predictor(s)]p(s) ds

  l Pr[(m1, ..., ml) ≡ Predictor(s1, ..., sl)] = ··· Pr[mi ≡ Predictori(si)]p(si) ds1 ···dsl l i=1 = [m ≡ (s )] Pr i Predictori i . l i=1 = (1 − i(w)). i=1 From the following equality, In this derivation, we assumed that an optimal alignment score  is independent from the optimal alignment scores for the other Pr[mi ≡ Predictori(si)]p(si) dsi = 1 − i(w), classes. Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on March 25, 2015

873