Making Substitution Matrices Metric

MAKING SUBSTITUTION MATRICES METRIC Master’s thesis Jarle Anfinsen June 2005 NORGES TEKNISK-NATURVITENSKAPELIGE UNIVERSITET FAKULTET FOR INFORMASJONSTEKNOLOGI, MATEMATIKK OG ELEKTROTEKNIKK Abstract With the emergence and growth of large databases of information, efficient methods for storage and processing are becoming increasingly important. The exis- tence of a metric distance measure between data entities enables efficient index structures to be applied when storing the data. Unfortunately, this is often not the case. Amino acid substitution matrices, which are used to estimate similar- ities between proteins, do not yield metric distance measures. Finding efficient methods for converting a non-metric matrix into a metric one is therefore highly desirable. In this work, the problem of finding such conversions is approached by embedding the data contained in the non-metric matrix into a metric space. The embedding is optimized according to a quality measure which takes the original data into account, and a distance matrix is then derived using the metric distance function of the space. More specifically, an evolutionary scheme is proposed for constructing such an embedding. The work shows how a coevolutionary algorithm can be used to find a spatial embedding and a metric distance function which try to preserve as much of the proximity structure of the non-metrix matrix as possible. The evolutionary scheme is compared to three existing embedding algorithms. Some modifications to the existing algorithms are proposed, with the purpose of han- dling the data in the non-metric matrix more efficiently. At a higher level, the strategy of deriving a metric distance function from a spatial embedding is compared to an existing algorithm which enforces metricity by manipulating the data in the non-metric matrix directly (the triangle fixing algorithm). The methods presented and compared are general in the sense that they can be applied in any case where a non-metric matrix must be converted into a metric one, regardless of how the data in the non-metric matrix was originally derived. The proposed methods are tested empirically on amino acid substitution matrices, and the derived metric matrices are used to search for similarity in a data- base of proteins. The results show that the embedding approach outperforms the triangle fixing approach when applied to matrices from the PAM family. Moreover, the evolutionary embedding algorithms perform best among the embedding algorithms. In the case of the PAM250 scoring matrix, a metric distance matrix is found which is more sensitive than the mPAM250 matrix presented in a recent paper. Possible advantages of choosing one method over another are shown to be unclear in the case of matrices from the BLOSUM family. Preface This thesis is submitted to the Norwegian University of Science and Technology in fulfillment of the requirements for the degree Master of Science. The work has been conducted at the Department of Computer and Information Science, from January 2005 to June 2005. Acknowledgments First of all I would like to thank my supervisors, Asc. Prof. Magnus Lie Hetland and Prof. Finn Drabløs, for tutoring me through the project. Their suggestions have been invaluable, and the steady stream of background literature they pro- vided me with made it easy for me to get the project started. I would also like to thank Geir Kjetil Sandve, Petter Braute and Jorg Rødsjø for contributing with valuable inputs and ideas, and Vassilis Athitsos for answering questions regard- ing the BoostMap algorithm. A final thanks goes to friends and fellow MSc candidates for many a good time at Pia’s Café after a long day in front of the computer. Contents Abstract i Preface iii 1 Introduction 1 1.1 Motivation and problem definition . 2 1.2 Objectives . 3 1.3 Contributions . 3 1.4 Structure of the thesis . 3 1.5 Notation . 4 2 Background 5 2.1 Sequences in computational biology . 5 2.1.1 Genetic sequences . 6 2.1.2 Proteins - sequences of amino acids . 6 2.2 Computing similarity between symbol sequences . 7 2.2.1 Terminology and concepts . 8 2.2.2 Simple edit distance . 9 2.2.3 The Smith-Waterman local alignment algorithm . 10 2.2.4 Searching in large databases . 12 2.3 Substitution matrices . 13 2.3.1 Some simple substitution matrices . 13 2.3.2 Amino acid substitution matrices . 14 2.3.3 Similarity versus dissimilarity . 15 2.4 Metric spaces and distance functions . 16 2.4.1 Formal definitions . 16 v vi CONTENTS 2.4.2 Some metric and non-metric distance functions . 17 2.4.3 Measuring metricity . 18 2.5 Data indexing . 19 2.5.1 Types of queries . 19 2.5.2 Utilizing the triangle inequality . 20 2.5.3 Some illustrative index structures for range queries . 22 2.5.4 Index structures for nearest neighbour queries . 24 2.5.5 The challenge of non-metricity . 25 2.6 Related work . 26 3 Methods and solutions 29 3.1 The triangle-fixing algorithm . 30 3.2 Deriving metrics from spatial embeddings . 32 3.3 Multi-dimensional scaling . 33 3.3.1 Metric and non-metric multidimensional scaling . 33 3.3.2 NMDS with proposed modification . 34 3.4 FastMap . 35 3.5 BoostMap . 39 3.5.1 AdaBoost - Adaptive Boosting . 40 3.5.2 BoostMap - using AdaBoost to produce embeddings . 41 3.5.3 Proposed BoostMap modifications . 45 3.6 Proposed genetic embedding algorithms . 47 3.6.1 A brief introduction to genetic algorithms . 48 3.6.2 A preliminary genetic algorithm . 50 3.6.3 Algorithm using fixed metric distance function . 53 3.6.4 Algorithm using evolving metric distance function . 57 3.7 Summary of proposed methods and algorithms . 62 4 Experiments and results 65 4.1 Evaluating generated substitution matrices . 65 4.1.1 Method of evaluation . 66 4.1.2 Dataset . 67 4.1.3 Performance measure . 68 CONTENTS vii 4.1.4 Matrices . 69 4.2 Homology search results . 69 4.2.1 Matrices derived from PAM70 . 71 4.2.2 Matrices derived from PAM120 . 72 4.2.3 Matrices derived from PAM250 . 74 4.2.4 Matrices derived from BLOSUM80 . 75 4.2.5 Matrices derived from BLOSUM62 . 77 4.2.6 Matrices derived from BLOSUM40 . 78 4.3 Test for statistical significance . 79 5 Analysis and discussion 81 5.1 Analytical comparison of methods . 81 5.1.1 Analysis . 82 5.1.2 Comparison and discussion . 84 5.2 Homology search results . 86 5.2.1 PAM matrices . 87 5.2.2 BLOSUM matrices . 91 5.3 Synthesis and summary . 92 6 Conclusions and future work 95 A Derivation of PAM and BLOSUM matrices I A.1 PAM matrix derivation . I A.2 BLOSUM matrix derivation . III A.3 A brief comparison of PAM and BLOSUM . V A.4 Other amino acid substitution matrices . VI B CCA-PAM70 matrix VII C GA-PAM250 matrix IX Bibliography XI List of Figures 2.1 Secretory protein (SSp120p) in Saccharomyces cerevisiae . 7 2.2 Example of a global alignment . 8 2.3 Example of a local alignment . 9 2.4 Edit distance between two character strings . 10 2.5 Identity score matrix . 14 2.6 a) Similarity matrix b) Dissimilarity matrix . 14 2.7 PAM250 score matrix . 15 2.8 An example of a distance matrix . 18 2.9 a) Two first steps of a BKT b) First step c) Second step . 23 2.10 a) First step of the construction of a VPT b) Final tree . 24 3.1 Notation used in the triangle fixing algorithm . 31 3.2 Illustration of the cosine law . 36 3.3 Initial distance matrix (D) and final embedding (X) . 37 3.4 a) Two-dimensional points b) M1 embedding of two-dimensional points c) Calculation of M2 embedding of a point . 41 3.5 Evolutionary process of a generic genetic algorithm . 49 3.6 Mutation operator can guide the search out of local extremes . 50 4.1 Relation between sequence divergence and substitution matrices 69 4.2 Range of difference from PAM70 ROC50 score . 71 4.3 Range of difference from PAM120 ROC50 score . 72 4.4 Plot of sorted ROC50 scores for PAM70 and CCA-PAM70 . 73 4.5 Plot of sorted ROC50 scores for PAM120 and CCA-PAM120 . 73 4.6 Range of difference from PAM250 ROC50 scores . 74 4.7 Range of difference from BLOSUM80 ROC50 scores . 75 ix x LIST OF FIGURES 4.8 Plot of sorted ROC50 scores for PAM250, mPAM250 and GA-PAM250 76 4.9 Plot of sorted ROC50 scores for BLOSUM80, TF-BLOSUM80 and BM-BLOSUM80 . 76 4.10 Range of difference from BLOSUM62 ROC50 scores . 77 4.11 Range of difference from BLOSUM40 ROC50 scores . 78 4.12 Plot of sorted ROC50 scores for BLOSUM40, GA-BLOSUM40 and BM-BLOSUM40 . 79 4.13 Plot of sorted ROC50 scores for BLOSUM62, TF-BLOSUM62 and BM-BLOSUM62 . 79 5.1 ROC curves for PAM250, mPAM250 and GA-PAM250 . 90 A.1 PAM250 score matrix . III A.2 Example of a BLOCK of four proteins . IV A.3 BLOSUM62 score matrix . V B.1 Derived CCA-PAM70 matrix, scaled by 0.05 . VII C.1 Derived GA-PAM250 matrix, scaled by 0.1 . IX Chapter 1 Introduction With the emergence and growth of large databases of information, efficient methods for storage and processing are becoming increasingly important. In the field of bioinformatics, for example, large amounts of proteins are being encoded and stored as strings of symbols. When studying a specific protein, scientists need fast methods of finding proteins in existing databases which are similar to the one at hand. This enables them to gain knowledge of how proteins are related to each other in terms of evolutionary relationships.

Load more