Metrics in Similarity Search

MASARYK UNIVERSITY FACULTY OF INFORMATICS %, \J/ & Metrics in Similarity Search BACHELOR THESIS Jiří Jelínek Brno, Spring 2007 Declaration I declare that this thesis is my original author work and that I made it by myself. All sources and literature I used or was inspired by are in the thesis cited appropriately with complete reference to the respective source. Supervisor: RNDr. David Novák 11 Acknowledgement I would like to thank my thesis supervisor, RNDr. David Novák, for his active support and patience. m Abstract Metric functions are key to indexing databases based on similarity of objects. This thesis tries to provide a survey of currently used metrics and similarity measures in different areas of human activity. Hoping to provide a reference point for future daily use, many direct internet sources to materials concerning metrics and similarity are provided. A part of this thesis is also the implementation of Smith- Waterman algorithm into the MESSIF platform developed at Faculty of Informatics, Masaryk University. IV Keywords similarity, distance, measure, metric, similarity search v Contents 1 Introduction 1 2 Similarity Searching and Metrics 3 2.1 Similarity and Dissimilarity in Similarity Search .... 3 2.2 Metric space and Metric definition 4 3 Sequences Similarity 5 3.1 Sequences Comparison Introduction 5 3.2 Sequence Aligning Algorithms 6 3.2.1 Sequence Aligning Introduction 6 3.2.2 Levenshtein Distance (Edit Distance) 7 3.2.3 Weighted Edit Distance 8 3.2.4 Sequences Compared in Biology 8 3.2.5 Scoring Matrices 8 3.2.6 Needleman-Wunsch Algorithm 9 3.2.7 Smith-Waterman Algorithm 10 3.3 Optimalizations and Heuristics 10 3.4 Using Smith-Waterman Algorithm to Define a Metric . 11 4 Text Similarity 13 4.1 Text Similarity Introduction 13 4.2 Word Frequency Vector 14 4.2.1 Euclidean Distance 15 4.2.2 Cosine Measure 16 4.2.3 Augmented Euclidean Distance 16 4.3 Corpus-based Measures 17 4.3.1 Pointwise Mutual Information 17 4.3.2 Latent Semantic Analysis 17 4.4 Knowledge-based Measures 18 4.4.1 Hirst St-Onge 18 4.4.2 Leacock-Chodorow 19 4.4.3 Resnik 19 4.4.4 Jiang-Conrath 19 vi 5 3D-object Similarity 20 5.1 Introduction 20 5.2 Jaccard Coefficient and Modifications 21 6 Image Similarity 22 6.1 Image Similarity Introduction 22 6.2 Metrics in CBIR 24 6.2.1 Manhattan Distance 24 6.2.2 Euclidean Distance 25 6.2.3 Chebyshev Distance 25 6.2.4 Quadratic Form distance 26 6.2.5 Histogram Intersection Distance 26 6.2.6 Mahalanobis Distance 27 6.2.7 Canberra Distance 27 6.2.8 Bray-Curtis Distance 28 6.2.9 Squared Chored Distance 28 6.2.10 Square Chi-Squared Distance 28 6.2.11 Earth Mover's Distance 29 7 Implementation of Smith-Waterman algorithm 31 8 Conclusion 33 9 Bibliography 35 Vll Chapter 1 Introduction In recent years rapid growth of data amounts, whether text, image or any other data, stirred the need of their effective storing and in dexing — because data need not only to be stored, but also to be found when they are looked for. Therefore, some means of effective database searching need to be found. However, the greatest chal lenge is the fact we usually do not know what we are looking for. The most natural queries (requests for data) of a typical user are like "documents dealing with similar issues as this one here", "images similar to this one I have here", rather than "I want all documents from shelf A beginning with letter 'C written in year 2002". Also with the amounts of data available today, specifying an exact querry becomes nearly impossible. Therefore, techniques that support searches based on similarity of objects have been developed. Similarity measures were introduced, functions computing a "score" that tells us how similar two objects are. Inversely, dissimi larity measures (or distance functions) were developed to determine how dissimilar (or "distant") the objects are. In many areas many different measures and techniques have been developed, but if these methods are to be used for searching, they cannot be searched for when they are needed. Therefore, a reasonable summary of current techniques and measures developed — and also really used — be comes a vital part of work in similarity searching. Thus the first aim of this thesis is to summarize the techniques and most importantly the measures used in so called similarity search ing in different areas of human activity and provide a comprehensive survey for easy reference to works concerning similarity searching in most exploited areas, text and image retrieval, and similarity of sequences in biology. To provide a most practical source of informa- 1 1. INTRODUCTION tion, many references to web addresses and other easily accessible sources are included. This thesis puts most emphasis on metrics (distance functions sat isfying certain criteria), which are of most use in terms of indexing data. However, since similarity measures and metrics are closely in tertwined, where similarity measures are widely used in practice, the thesis mentions them as well. The secondary aim of this thesis is to implement a metric either used in practice or currently suggested for use into MESSIF1, devel oped and used at Faculty of Informatics, Masaryk University. The Smith-Waterman algorithm using standard scoring matrix PAM250 was chosen, as the biological sequence similarity was not yet in cluded in the system. The chapter 2 provides a brief summary of terms used in sim ilarity searching and definitions of key terms, as metric and met ric space. The chapter 3 deals with similarity of sequences. After general introduction to sequences and their similarity, most used se quence aligning algorithms are presented. The chapter ends with a brief overview of heuristics used in sequence similarity in biology and references to works proposing metrics that can be used in bi ology. The chapter 4 summarizes most common approaches to text similarity and gives an overview of metrics and similarity measures used for determining similarity of texts. The chapter 5 provides a brief overview of 3D-object similarity techniques. The chapter 6 pro vides a survey of metrics used in image similarity, along with ref erences to recent publications testing performance of currently used metrics. The chapter 7 details the implementation part of this thesis. Metric Similarity Search Implementation Framework 2 Chapter 2 Similarity Searching and Metrics 2.1 Similarity and Dissimilarity in Similarity Search A usual way of querying in similarity search is query-by-example.This means, that we are given one object, usually called query object, and the goal is to find objects similar to the query object. The searching in a database can be done by taking one object of the database af ter another and comparing these objects with the querry object. This comparison should give us the desired information about the simi larity of the currently tested object and the querry object and if the similarity is deemed high then the tested object is returned as a result of the search. However, these comparisons can equally well compute the dissimilarity of two objects and if that is low the objects are re turned as the result. In either case, these comparisons are conducted using special functions, so called measures. In the first case, when similarity is computed, the functions are usually called similarity measures and their results, i.e. how similar two objects are, are often called similarity score or also simply sim ilarity. Similarity measures are inconvenient in that they generally do not have any upper bounds. For example, consider a similar ity measure of two words defined as "the number of characters they have in common at the same position". Such measure would give for words automobile and automaton the similarity score of 5 {autom in common), while giving scores of 9 for similarity of automaton with automaton (self-similarity) and 10 for similarity of automobile with au tomobile. And so the maximum possible similarity score is only de pendent on the length of the word, and so is virtually infinite. This means that from the similarity score alone we generally cannot de- 3 2. SIMILARITY SEARCHING AND METRICS termine whether the objects are identical or very much similar only1. However, similarity measures can often be defined fairly easily and arbitrarily compared to distance functions. In the latter case of comparisons based on dissimilarity compu tations, the measures are sometimes called dissimilarity measures. But as an analogy to geometry and vector spaces they are more often called distance functions and their results called distance of the two objects, instead of dissimilarity. Distance functions are often better suited for many applications in computer science, especially for in dexing in databases. The most convenient case is when the space of all searched objects along with the distance function can be formal ized as a metric space (definition of metric space follows in section 2.2). For that, the distance function needs to be a metric (definition follows in section 2.2), naturally. This solves, for example, the prob lem of growing similarity score for self-similarity (distance to self), as the distance of any object to itself is zero (and is zero for identical objects only). 2.2 Metric space and Metric definition Metric space Al is a pair M = (V, d), where V is the domain of ob jects and d is a total distance function d : V x V —> R satisfying the following conditions for all objects x,y,z e V: d(x,y) > 0 (non-negativity), d(x,y) = 0 iff x = y (identity), d(x,y) = d(y,x) (symmetry), d(x, z) < d(x, y) + d(y, z) (triangle inequality).

Metrics in Similarity Search

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support