MASARYK UNIVERSITY FACULTY OF INFORMATICS

%, \J/ &

Metrics in Similarity Search

BACHELOR THESIS Jiří Jelínek

Brno, Spring 2007 Declaration

I declare that this thesis is my original author work and that I made it by myself. All sources and literature I used or was inspired by are in the thesis cited appropriately with complete reference to the respective source.

Supervisor: RNDr. David Novák

11 Acknowledgement

I would like to thank my thesis supervisor, RNDr. David Novák, for his active support and patience.

m Abstract

Metric functions are key to indexing databases based on similarity of objects. This thesis tries to provide a survey of currently used metrics and similarity measures in different areas of human activity. Hoping to provide a reference point for future daily use, many direct internet sources to materials concerning metrics and similarity are provided. A part of this thesis is also the implementation of Smith- Waterman algorithm into the MESSIF platform developed at Faculty of Informatics, Masaryk University.

IV Keywords similarity, , measure, metric, similarity search

v Contents

1 Introduction 1 2 Similarity Searching and Metrics 3 2.1 Similarity and Dissimilarity in Similarity Search .... 3 2.2 and Metric definition 4 3 Sequences Similarity 5 3.1 Sequences Comparison Introduction 5 3.2 Sequence Aligning Algorithms 6 3.2.1 Sequence Aligning Introduction 6 3.2.2 Levenshtein Distance (Edit Distance) 7 3.2.3 Weighted Edit Distance 8 3.2.4 Sequences Compared in Biology 8 3.2.5 Scoring Matrices 8 3.2.6 Needleman-Wunsch Algorithm 9 3.2.7 Smith-Waterman Algorithm 10 3.3 Optimalizations and Heuristics 10 3.4 Using Smith-Waterman Algorithm to Define a Metric . 11 4 Text Similarity 13 4.1 Text Similarity Introduction 13 4.2 Word Frequency Vector 14 4.2.1 15 4.2.2 Cosine Measure 16 4.2.3 Augmented Euclidean Distance 16 4.3 Corpus-based Measures 17 4.3.1 Pointwise Mutual Information 17 4.3.2 Latent Semantic Analysis 17 4.4 Knowledge-based Measures 18 4.4.1 Hirst St-Onge 18 4.4.2 Leacock-Chodorow 19 4.4.3 Resnik 19 4.4.4 Jiang-Conrath 19

vi 5 3D-object Similarity 20 5.1 Introduction 20 5.2 Jaccard Coefficient and Modifications 21 6 Image Similarity 22 6.1 Image Similarity Introduction 22 6.2 Metrics in CBIR 24 6.2.1 Manhattan Distance 24 6.2.2 Euclidean Distance 25 6.2.3 Chebyshev Distance 25 6.2.4 Quadratic Form distance 26 6.2.5 Histogram Intersection Distance 26 6.2.6 Mahalanobis Distance 27 6.2.7 Canberra Distance 27 6.2.8 Bray-Curtis Distance 28 6.2.9 Squared Chored Distance 28 6.2.10 Square Chi-Squared Distance 28 6.2.11 Earth Mover's Distance 29 7 Implementation of Smith-Waterman algorithm 31 8 Conclusion 33 9 Bibliography 35

Vll Chapter 1 Introduction

In recent years rapid growth of data amounts, whether text, image or any other data, stirred the need of their effective storing and in­ dexing — because data need not only to be stored, but also to be found when they are looked for. Therefore, some means of effective database searching need to be found. However, the greatest chal­ lenge is the fact we usually do not know what we are looking for. The most natural queries (requests for data) of a typical user are like "documents dealing with similar issues as this one here", "images similar to this one I have here", rather than "I want all documents from shelf A beginning with letter 'C written in year 2002". Also with the amounts of data available today, specifying an exact querry becomes nearly impossible. Therefore, techniques that support searches based on similarity of objects have been developed. Similarity measures were introduced, functions computing a "score" that tells us how similar two objects are. Inversely, dissimi­ larity measures (or distance functions) were developed to determine how dissimilar (or "distant") the objects are. In many areas many different measures and techniques have been developed, but if these methods are to be used for searching, they cannot be searched for when they are needed. Therefore, a reasonable summary of current techniques and measures developed — and also really used — be­ comes a vital part of work in similarity searching. Thus the first aim of this thesis is to summarize the techniques and most importantly the measures used in so called similarity search­ ing in different areas of human activity and provide a comprehensive survey for easy reference to works concerning similarity searching in most exploited areas, text and image retrieval, and similarity of sequences in biology. To provide a most practical source of informa-

1 1. INTRODUCTION tion, many references to web addresses and other easily accessible sources are included. This thesis puts most emphasis on metrics (distance functions sat­ isfying certain criteria), which are of most use in terms of indexing data. However, since similarity measures and metrics are closely in­ tertwined, where similarity measures are widely used in practice, the thesis mentions them as well. The secondary aim of this thesis is to implement a metric either used in practice or currently suggested for use into MESSIF1, devel­ oped and used at Faculty of Informatics, Masaryk University. The Smith-Waterman algorithm using standard scoring matrix PAM250 was chosen, as the biological sequence similarity was not yet in­ cluded in the system. The chapter 2 provides a brief summary of terms used in sim­ ilarity searching and definitions of key terms, as metric and met­ ric space. The chapter 3 deals with similarity of sequences. After general introduction to sequences and their similarity, most used se­ quence aligning algorithms are presented. The chapter ends with a brief overview of heuristics used in sequence similarity in biology and references to works proposing metrics that can be used in bi­ ology. The chapter 4 summarizes most common approaches to text similarity and gives an overview of metrics and similarity measures used for determining similarity of texts. The chapter 5 provides a brief overview of 3D-object similarity techniques. The chapter 6 pro­ vides a survey of metrics used in image similarity, along with ref­ erences to recent publications testing performance of currently used metrics. The chapter 7 details the implementation part of this thesis.

Metric Similarity Search Implementation Framework

2 Chapter 2 Similarity Searching and Metrics

2.1 Similarity and Dissimilarity in Similarity Search

A usual way of querying in similarity search is query-by-example.This means, that we are given one object, usually called query object, and the goal is to find objects similar to the query object. The searching in a database can be done by taking one object of the database af­ ter another and comparing these objects with the querry object. This comparison should give us the desired information about the simi­ larity of the currently tested object and the querry object and if the similarity is deemed high then the tested object is returned as a result of the search. However, these comparisons can equally well compute the dissimilarity of two objects and if that is low the objects are re­ turned as the result. In either case, these comparisons are conducted using special functions, so called measures. In the first case, when similarity is computed, the functions are usually called similarity measures and their results, i.e. how similar two objects are, are often called similarity score or also simply sim­ ilarity. Similarity measures are inconvenient in that they generally do not have any upper bounds. For example, consider a similar­ ity measure of two words defined as "the number of characters they have in common at the same position". Such measure would give for words automobile and automaton the similarity score of 5 {autom in common), while giving scores of 9 for similarity of automaton with automaton (self-similarity) and 10 for similarity of automobile with au­ tomobile. And so the maximum possible similarity score is only de­ pendent on the length of the word, and so is virtually infinite. This means that from the similarity score alone we generally cannot de-

3 2. SIMILARITY SEARCHING AND METRICS

termine whether the objects are identical or very much similar only1. However, similarity measures can often be defined fairly easily and arbitrarily compared to distance functions. In the latter case of comparisons based on dissimilarity compu­ tations, the measures are sometimes called dissimilarity measures. But as an analogy to geometry and vector spaces they are more often called distance functions and their results called distance of the two objects, instead of dissimilarity. Distance functions are often better suited for many applications in computer science, especially for in­ dexing in databases. The most convenient case is when the space of all searched objects along with the distance function can be formal­ ized as a metric space (definition of metric space follows in section 2.2). For that, the distance function needs to be a metric (definition follows in section 2.2), naturally. This solves, for example, the prob­ lem of growing similarity score for self-similarity (distance to self), as the distance of any object to itself is zero (and is zero for identical objects only).

2.2 Metric space and Metric definition

Metric space Al is a pair M = (V, d), where V is the domain of ob­ jects and d is a total distance function d : V x V —> R satisfying the following conditions for all objects x,y,z e V:

d(x,y) > 0 (non-negativity), d(x,y) = 0 iff x = y (identity), d(x,y) = d(y,x) (symmetry), d(x, z) < d(x, y) + d(y, z) (triangle inequality).

The distance function d is then called a metric.

^n case of our example measure normalizing the similarity score by length of the longer word would restrain the similarity scores to the interval [0,1], where 0 means no similarity and 1 identical words. Such normalization or other im­ provements are generally not found easily if they can be found at all.

4 Chapter 3 Sequences Similarity

3.1 Sequences Comparison Introduction

A sequence is a string s of characters Q from an alphabet E. For example, if the alphabet consists of characters 0 and 1, E = {0,1}, the possible sequences are the representations of all binary numbers. Also, an empty sequence e denotes a sequence (string) of zero char­ acters. Another example is the alphabet of E = {A, T, G, C} where the possible sequences are all possible1 DNA codes, including e — no code. In sequences, generally, the character Q is as important as it's po­ sition i in the sequence. Sequence 1000 is similar, from the character count point of view, to 0100, yet they represent dramatically differ­ ent values and therefore their similarity score should rather be low. Of course, if we were interested in similarity of graphical represen­ tations of 1000 and 0100, the similarity score would be higher, but even then 0001 would be dissimilar to 1000 and the character count is again a secondary criterion only. Therefore, more complex ap­ proaches that take into account both the character and the position it is on need to be facilitated in sequences similarity search. The search engines usually use some sort of sequence aligning, see Section 3.2. Using sequence similarity approach for similarity of text strings is very useful in case of correction of mistypings. The Levenshtein distance, see Section 3.2.2, then gives us practically the number key­ strokes that went wrong. For example,'simlarity' is a possible and common mistyping of 'similarity' and Levenshtein distance of these two terms is one (corresponding to one insert operation necessary), which tells us user most probably tried to type the word 'similarity'

Possible meaning imaginable, not necessarily really occurring somewhere.

5 3. SEQUENCES SIMILARITY and not ' šiml arity'. However, sequence similarity is not well ap­ plicable to text-to-text similarity as the order of words in text is not explicitly given and so the number of necessary edit operations for changing one text into the other may be very high even for quite similar (semantically) texts. More in Section 4. Sequence similarity is most developed in the field of biology for comparing and similarity searching protein and RNA or DNA se­ quences. Anytime a new interesting bit of a RNA/DNA sequence is determined the most usual question the researcher needs answered first is whether anyone has already seen such a sequence and what it could be related to. In other words, sequences similar to the querry sequence need to be found in the database. In biology, the computed similarity score of two sequences usually describes the probability a mutation of the querry sequence into the tested sequence would happen.

3.2 Sequence Aligning Algorithms

3.2.1 Sequence Aligning Introduction To compute similarity of two sequences we generally need to know how many and how expensive changes need to be done to transform one sequence into the other. An alignment of two sequences is a way of arranging the two sequences so as to allow identification of regions of similarity of the sequences. For example, when aligning sequences TTCC and AATT their alignments (along with the de­ scription of edit operations necessary to use in case of the particular alignment) could be:

TTCC and then using four proper replaces AATT

- - T T C C using two deletes and two insertions AATT- = From the intuitive point of view the latter alignment seems better as good half of the symbols is exactly matched. However, the cost of re­ moving A from the sequence and adding C into the sequence may be

6 3. SEQUENCES SIMILARITY very high because A may have strong ties to the rest of the sequence while C and T are repulsive to each other. On the other hand replac­ ing T with A may be quite easy, as they can be similar in nature and thus easily exchangeble2. And so the price for the intuitively better solution may be higher than for the all-replace solution. If by alignment we understand the mutual position of the se­ quences before the necessary sequence edit operations are applied, the purpose of sequence aligning algorithms could be described as finding the optimal alignment, where optimal means with greatest similarity score or lowest price of used operations. The task of these algorithms is therefore twofold — finding how to position the tested sequence to the querry sequence and then what operations to per­ form (inserts, replaces, adding gaps to either the tested sequence or querry sequence3, etc.).

3.2.2 Levenshtein Distance (Edit Distance) The edit distance defined by V. I. Levenshtein [1] can be described as the number of basic "edit" operations insert, delete and replace (all considering a single character) necessary for changing one string into the other. More formally, Levenshtein distance is defined as follows: The distance between two strings x = x\... xn and y = y\... ym is the minimum number of atomic operations needed to transform the string x into y, while the atomic operations are:

• insert character c into string x at position i,

• delete character at position i from string x,

• replace character at position i in string x with a new character c.

2Such behaviour of the nucleotides is chosen for the purpose of the example only, it does not necessarily correspond to the real behaviour of the given nu­ cleotides. 3Adding gaps to the querry sequence is equivalent to deletions in the tested se­ quence, and vice versa.

7 3. SEQUENCES SIMILARITY

3.2.3 Weighted Edit Distance The generalized edit distance function is extended by weights (posi­ tive real numbers) assigned to individual atomic operation. Then the distance between strings x and y is the minimum value of the sum of weights of atomic operations needed to transform x into y. However, the weight of insert and delete operations need to be the same if the edit distance is to be a metric (with different weights for insert and delete operations the edit distance is not symmetric).

3.2.4 Sequences Compared in Biology In biology frequently compared sequences are amino acid4 sequences. Amino acids form proteins5 and so the amino acid sequence similar­ ity is in other words protein similarity. Another type of similarity is RNA/DNA6 similarity where basic build units are nucleotides7. The references in footnotes are all referring to Wikipedia only, as all these subjects are well described there in detail and such description should be sufficient for the scope of this thesis that deals with metrics and similarity.

3.2.5 Scoring Matrices Because the price of insert, delete, and replace operations are, espe­ cially in biology, not uniform, a matrix is provided giving the prob­ ability of such operations happening. For example, scoring matrix PAM2508 looks like follows:

4http://en.wikipedia.org/wiki/Amino_acid 5http://en.wikipedia.org/wiki/Protein 6http://en.wikipedia.org/wiki/DNA 7http://en.wikipedia.org/wiki/Nucleotide 8Complete PAM250 can be found at http: / /www. icp. ucl. ac. be / ~ opperd/ private/pam250 . html and many other places.

8 3. SEQUENCES SIMILARITY

Ala Arg Asn Asp Cys A R N D C Ala A 13 6 9 9 5 Arg R 3 17 4 3 2 AsnN 4 4 6 7 2 Asp D 5 4 8 11 1 CysC 2 1 1 1 52 where the column Ala means there is a 13% probability that a posi­ tion containing Ala in the first sequence will contain Ala in the sec­ ond, there is a 3% chance that it will contain Arg, and so forth. And so similarity scores using PAM250 as scoring matrix actually describe the probability one amino acid sequence would change into the other. The 250 in PAM 250 stands for the data gathered from 71 sets of aligned sequences extrapolated up to the level of 250 amino acid replacements per 100 residues9. There are two mainly used types of matrices, PAM (Point Ac­ cepted Mutation) [2] and BLOSUM (BLOcks of Amino Acid Substi­ tution Matrix) [3]10.

3.2.6 Needleman-Wunsch Algorithm The NeedlemanWunsch algorithm [4] performs a global alignment on two sequences. The NeedlemanWunsch algorithm uses principles of dynamic programming, and is guaranteed to find the alignment with the maximum score. NeedlemanWunsch was the first applica­ tion of dynamic programming to biological sequence comparison. A detailed explanation can be found at [5, 6]. Note: The Needleman-Wunsch algorithm, as used in practice, com­ putes the similarity11 between the argument sequences, not their dis­ tance.

9http://www.med.nyu.edu/rcr/rcr/course/sim-dist.html 10There is a good report about both PAM and BLOSUM matrices on Wikipedia, http://en.wikipedia.org/wiki/Substitution_matrix. 11 The probability of changing one sequence into the other by the means of evolu­ tion, mutation, etc.

9 3. SEQUENCES SIMILARITY

3.2.7 Smith-Waterman Algorithm The Smith-Waterman algorithm [7] is a well-known algorithm for performing local sequence alignment, i.e. for determining similar regions between two sequences (nucleotide or protein). Instead of looking at the total sequence, the Smith-Waterman algorithm com­ pares segments of all possible lengths and optimizes the similarity measure. The motivation for local alignment is the difficulty to obtain cor­ rect alignments in regions of low similarity between distantly re­ lated biological sequences, because mutations have added too much 'noise' in evolutionary times to allow for a meaningful comparison of these regions. Local alignment avoids these regions altogether and focuses on those with a positive score, i.e. those with an evolutionary conserved signal of similarity. A good report on Smith-Waterman algorithm along with links to several implementations of the algorithm can be found on Wikipedia [8]. A detailed explanation of the algorithm can be found at [9,10], web addresses belonging to Dublin and New York universities cited also at MST [11]. Note: The Smith-Waterman algorithm used in practice computes the similarity12 between the two RNA/DNA given sequences, not their distance, thus does not represent a metric.

3.3 Optimalizations and Heuristics

Since Smith-Waterman algorithm is highly demanding in terms of both time and space complexity (0(n2)), faster methods have been developed, as a kind of heuristic applied to the Smith-Waterman al­ gorithm. The most common are BLAST13 (Basic Local Alignment Search Tool) [12] and FASTA14 [13]. Both BLAST and FASTA are well intro-

The probability of changing one sequence into the other by the means of evolu­ tion, mutation, etc. An operable version of BLAST can be found at http://130.14.29.110/BLAST/ A publicly available FASTA search engine can be found at http://fasta.genome.jp/

10 3. SEQUENCES SIMILARITY duced in a course of Stuart M. Brown at New York Medicine School15. BLAST16 and FASTA17 are also well described on Wikipedia. More detailed description of BLAST and FASTA is lengthy and as these search engines do not represent metrics, but similarity mea­ sures, they are not discussed here in detail. For detailed info on BLAST and FASTA see references given above.

3.4 Using Smith-Waterman Algorithm to Define a Met­ ric

As is noted above, both Needleman-Wunsch and Smith-Waterman algorithms compute similarity score of two sequences. However, for many practical applications a distance function, and at best a met­ ric, would be more convenient. Although there is an intuitive con­ nection between similarity and dissimilarity measures, the process of correctly deriving one from the other is generally highly compli­ cated. One of many proposed approaches uses analogy with vector spaces and inner product to define a distance function from the sim­ ilarity scores [14] like:

d(s,t) = ((s,s) + (t,t)-2(s,t)ý where (s, t) denotes the similarity score of sequences s and t. The function d(s, t) is a metric if the similarity measure obeys these rules: 1. Similarity of a symbol with itself is always positive. 2. Every symbol is more similar to itself than to any other symbol.

3. The similarity function is symmetrical.

4. Space is less similar to all non-spaces than any other symbol.

5. Similarity score for two spaces equals 0.

15http://www.med.nyu.edu/rcr/rcr/course/sim-fasta.html 16http://en.wikipedia.org/wiki/BLAST 17http://en.wikipedia.org/wiki/FASTA

11 3. SEQUENCES SIMILARITY

6. The triangle inequality is satisfied for all single symbols.

In case of n = 2 and scoring matrices BLOSUM or PAM, the sim­ ilarity measure defined as the similarity score returned by Smith- Waterman algorithm can be used to define metric d(s, t) correctly. Another approach lies in using different scoring matrix. Needleman-Wunsch algorithm was also originally defined using a distance scoring matrix, not similarity scoring matrix, see [4], so it actually formed a metric, like Levenshtein distance. But since the introduction of Smith-Waterman algorithm, similarity scoring matri­ ces instead of distance scoring matrices began to be used. In [15] a metric scoring matrix is proposed.

12 Chapter 4 Text Similarity

4.1 Text Similarity Introduction

Text similarity is at the foundation of many widely used text op­ erations, text mining, text search and retrieval, text classification, etc. While text similarity can only be accurately determined by com­ paring the semantic meaning of the texts, current technologies and knowledge does not allow for computers to "understand" the mean­ ing of texts, and so algorithmically feasible solutions are looked for. The texts are often approached as bags-of-words and simple counts (absolute, relative, weighted) of words in the texts are com­ pared to determine the similarity of the texts, see Section 4.2. This is a very common method and many techniques have been developed to improve the so-called word frequency vector (counts of words are put into a vector for better operability), like weighting the counts of words by the importance of the word (inverse document frequency), stemming words in the text (Porter stemming algorithm), completely omitting common and modal words like be, can, etc. More informa­ tion and references can be found in Section 4.2. Another possible approach is to count the number of changes necessary for changing one text into the other. This approach, first used by Levenshtein, see Section 3.2.2, has some merits and some limitations. Levenshtein distance, or edit distance, is a good word- to-word metric, especially for correction of mistypings. For exam­ ple, 'mystyping' is a possible and common mistyping of 'mistyping' and Levenshtein distance of these two "words" is one (correspond­ ing to one replace operation necessary), which tells us the user most probably tried to type in the word 'mistyping' and not 'mystyping'. Levenshtein distance, however, ignores any semantic meaning of the

13 4. TEXT SIMILARITY texts, therefore can be used for any language without any changes1. However, this generality is greatly paid for. For example, the edit distance, however weighted and augmented, will fail to identify any similarity between words cat and kitten, yet the similarity is obvious. This problem is, as is noted above, a general problem of text sim­ ilarity. Including more semantics in the text similarity measures is the current greatest challenge. In [16], an interesting improvement of standard word frequency vector with euclidean distance is pre­ sented, see Section 4.2.3. Other approaches use corpus-based and knowledge-based measures to include semantics in similarity searches. At this point, WordNet2 [17], the lexical database for English lan­ guage of the Princeton University, should be mentioned. WordNet is a semantic lexicon for the English language. It groups English words into sets of cognitive synonyms called synsets, provides short, general definitions, and records the various semantic relations be­ tween these synonym sets. For use in WordNet many similarity measures have been developed. The similarity measures usually compute the relatedness of two words or texts, i.e. are not metrics (distance functions). However, similarity measures and metrics are greatly intertwined and in case of text similarity similarity measures are frequently used and some are therefore included in this overview, see Sections 4.3,4.4.

4.2 Word Frequency Vector

Word frequency vector is a vector containing counts of appearances of each word in the text. The frequencies can either be absolute, as in the example below or in [18], or relative, as in [16]. The relative frequencies of words are computed from their absolute frequencies. For example, consider text: "I am hungry. - I am not." The cor­ responding word frequency vector (absolute frequencies) could be wfv = (2, 2,1,1) where words associated to item i of wfv would be:

^or any two texts of the same language. 2http://wordnet.princeton.edu/

14 4. TEXT SIMILARITY

1. I

2. am

3. hungry

4. not

Notice that word frequency vector ignores position of the word in the text. Therefore, the sentences above in reverse order "I am not. -1 am hungry." have the same word frequency vector as the original text. Also "I am not hungry, am I?" would have the same word frequency vector. The word frequency vectors are often improved by omitting com­ mon words like be, can, etc. and by stemming words that are included in the vector, for example by Porter stemming algorithm [19]. Fur­ ther improvements to word frequency vectors may be achieved by weighting the counts of words in the vector to better reflect the im­ portance of the word for text similarity (how common or how proba­ ble to appear in two dissimilar texts the word is). A common method for such weighting is to use the inverse document frequency (IDF) weights [20]. The IDF weight of words is calculated in the context of document collection as:

IDF(s) = In ' Mtotal M,s

where Mtotai is the total number of documents in the collection and Ms is the number of documents containing the word s. Words that are very common will have a low weight, while words, which are rare, will have high weight.

4.2.1 Euclidean Distance

Comparing two word frequency vectors f\, f2 can be done simply by computing their euclidean distance.

s d(fij2) = . £(/i,-/2,) %=\

15 4. TEXT SIMILARITY

This distance has the drawback of putting too much importance on words that appear in one of the texts only (the appropriate coordi­ nate is 0 in one of the vectors) which can cause problems in case of uncommon words. The discriminative power may then be increased by putting zeroes to all words that only appear in one of the vectors.

4.2.2 Cosine Measure A popular measure of similarity for text clustering is the cosine of the angle between two vectors. The cosine measure is given by:

d(/i,/2) = 1 - cosC/x./a) = 1 - J^f],, II/ill ' II/2II It has been used in [21, 22, 23], for example, and many others.

4.2.3 Augmented Euclidean Distance Even by weighting the word frequency vectors we can capture only little of semantic information of the texts and therefore other meth­ ods of including more semantic information in the text comparison are sought after. Peter Andras and Olusolu Idowu in their work Ko- honen Networks With Graph-based Augmented Metrics [16] present the use of word consecutiveness graphs to augment the results of the standard Euclidean distance. The word consecutiveness graph is constructed by having as nodes the words in the document, and for each pair of consecutive words in the document we add to the graph a directed edge linking the nodes corresponding to these words. In other words, nodes in the word consecutiveness graph represent words in the text, while edges in the graph represent the fact that the words represented by the nodes connected by the edge appear in the text consecutively. Their supposed augmented distance is given as follows:

d(fi,f2) = (n n , • ll/i - /2II 7(Gi,G2)

where 7(Gi, G2) = \E{G\ n G2)|, i.e. 7(Gi, G2) represents the number of edges the two word consecutiveness graphs have in common.

16 4. TEXT SIMILARITY 4.3 Corpus-based Measures

Corpus-based measures of word semantic similarity try to identify the degree of similarity between words using information exclusively derived from large corpora, see [23].

4.3.1 Pointwise Mutual Information The pointwise mutual information using data collected by informa­ tion retrieval (PMI-IR) was suggested in [24] as an unsupervised mea­ sure for the evaluation of the semantic similarity of words. It is based on word co-occurrence using counts collected over very large cor­ pora (e.g. the Web). Given two words vol and w2, their PMI-IR is measured as:

P{WlkW2) PMI-IR{Wl,W2) = log2- >Oi) -p(w2) which indicates the degree of statistical dependence between w\ and w2, and can be used as a measure of the semantic similarity of w\ and w2. Note: PMI-IR is a similarity measure, not a metric (computes sim­ ilarity between two texts, not their distance).

4.3.2 Latent Semantic Analysis Latent semantic analysis (LSA) was proposed by Landauer in [25]. In LS A, term coocurrences in a corpus are captured by means of a di­ mensionality reduction operated by a singular value decomposition (SVD) on the term-by-document matrix T representing the corpus. SVD is a well-known operation in linear algebra, which can be ap­ plied to any rectangular matrix in order to find correlations among its rows and columns. In our case, SVD decomposes the term-by- T document matrix T into three matrices T = UT,kV where Sfc is the diagonal k x k matrix containing the k singular values of T, ci > cr2 > ... > <7fc, and U and V are column-orthogonal matri­ ces. When the three matrices are multiplied together the original term-by-document matrix is recomposed. Typically we can choose k! «k obtaining the approximation T ř« lľĽk>VT.

17 4. TEXT SIMILARITY

LSA can be viewed as a way to overcome some of the drawbacks of the standard model (sparseness and high dimension­ ality). In fact, the LSA similarity is computed in a lower dimensional space, in which second-order relations among terms and texts are exploited. The similarity in the resulting vector space is then mea­ sured with the standard cosine similarity. Note also that LSA yields a vector space model that allows a homogeneous representation (and hence comparison) of words, word sets, and texts [23, 25].

4.4 Knowledge-based Measures

Knowledge-based measures were developed to quantify the degree to which two words are semantically related using information drawn from semantic networks - see e.g. [23,26] for a more thorough overview. These measures were developed for use in WordNet3 [17]. In WordNet, nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct con­ cept. Synsets are interlinked by means of conceptual-semantic and lexical relations.

4.4.1 Hirst St-Onge The idea behind Hirst and St-Onge's measure [27] of semantic relat- edness is that two lexicalized concepts are semantically close if their WordNet synsets are connected by a path that is not too long and that "does not change direction too often". The strength of the rela­ tionship is given by:

relns(ci, c2) = C — path length — k • d where d is the number of changes of direction in the path, and C and k are constants; if no such path exists, relHs(ci,c2) is zero and the synsets are deemed unrelated. Note: Hirst and St-Onge's measure is a similarity measure, it computes the semantic relatedness of two synsets; it is not a metric.

3See WordNet homepage — http: / /wordnet. princeton . edu/.

18 4. TEXT SIMILARITY

4.4.2 Leacock-Chodorow Leacock and Chodorow (1998) rely on the length len(cl,c2) of the shortest path between two synsets for their measure of similarity. However, they limit their attention to IS-A links and scale the path length by the overall depth D of the taxonomy:

simLC{c1,c2) = - log I — I .

4.4.3 Resnik Resnik's [28] approach brings together ontology and corpus. Guided by the intuition that the similarity between a pair of concepts may be judged by "the extent to which they share information", Resnik defined the similarity between two concepts lexicalized in WordNet to be the information content of their lowest super-ordinate (most specific common subsumer) lso{cl, c2):

simR(ci,c2) = - log ('p(lso(ci,c2)))

where p(c) is the probability of encountering an instance of a synset c in some specific corpus.

4.4.4 Jiang-Conrath Jiang and Conrath's [29] approach also uses the notion of informa­ tion content, but in the form of the conditional probability of en­ countering an instance of a child-synset given an instance of a parent synset. Thus the information content of the two nodes, as well as that of their most specific subsumer, plays a part.

distjc(ci,c2) = 2 log (p(lso(ci,c2))) - (log(p(ci)) +log(p(c2))).

This formula measures semantic distance, the inverse of similarity.

19 Chapter 5 3D-object Similarity

5.1 Introduction

It is widely recognized that 3D similarity search is a difficult prob­ lem — by far more difficult than the 2D similarity search. Database technology does no yet support geometry-based similarity search of 3D-objects. In comparison to the available systems that support 2D spatial data, the 3D data is much more complex. The currently most widely used techniques for accessing database of complex objects are feature-based approaches (e.g., [30, 31]) which are mainly used as a simple filter to restrict the search space. A second approach which comes from the area of pattern recog­ nition is the similarity search of 3D objects based on their 2D projec­ tions [32, 33]. Another idea to solve the problem, suggested in [34] is to develop an index structure which would support an efficient geometry-based similarity search on large databases of 3D volume objects. The index structure should use the actual geometry of the data objects to sup­ port an efficient similarity search of the objects. A problem of using the actual 3D geometry is the complexity of the 3D objects, which is by far too complex to be directly stored in any index structure. A so­ lution which has also proven useful in the case of indexing extended 2D objects is to use approximations of the objects in the index to sup­ port an efficient pruning of irrelevant objects.

20 5. 3D-OBJECT SIMILARITY 5.2 Jaccard Coefficient and Modifications

A reasonable metric used in [34] is Jaccard ceoefficient defined as:

\\voli U V0I2W where vol\, vol2 denotes the volume of objects 1 and 2. Notice that \\vol\ U V0I2W in denominator part is only necessary for normalizing the resulting volume difference with respect to the overall volume of vol\ and V0I2. Other normalizing factors may be used, like ma^dl^o/iH , IIW0/2II) or 0.5(||?;o/i|| + H^o^H)-

21 Chapter 6 Image Similarity

6.1 Image Similarity Introduction

The similarity of images is most a problem of large image databases where images are retrieved for he user from the database based on their similarity to a query image. For instance, a picture of a car is supplied by the user and so all pictures with a car in them should be retrieved from the database. The first solutions used labels to describe the contents of images and the images were stored in the database indexed by these labels. So the image similarity was "reduced" to keyword search. For ex­ ample, green, car, city could be a query for all images of cars in a city while the dominant colour of the image is green. Such solu­ tions can be very effective, once the database (all images have been labeled) is created, if users know exactly what they are looking for and what keywords to choose. There are two major difficulties — firstly, the users cannot be expected to know always precisely what they are looking for, the less so what keywords to choose, secondly, the amounts of images in databases today have far overgrown the human possibilities of properly labeling them all. Therefore, content based image retrieval (CBIR) techniques were introduced. In CBIR we try to extract information about the content of the image, image features like colour, texture, shapes in the image. Prob­ lem is, to fully be able to find similar images to some query image, we would have to know what is in the query image, not only what colour or shape. For instance, there are as yet no methods of algo- rithmically finding all images of cats in any conceivable stance. Cats differ in size and colour, both these issues can be achieved, by com-

22 6. IMAGE SIMILARITY

paring colour histograms1 and scaling the images; however, practi­ cally no program will find as similar images of a cat lying on the ground and a cat in the middle of a jump, as the shape of the cat in the picture changes dramatically and the color of the cat's fur can be different on its back and its belly. In a picture, colour, texture and shape (at least to some degree) are extractable automatically. There is no way of algorithmically finding all cats in any conceivable stance and position, but we can find all images with similar colour distribution and similar shapes in the pic­ ture; the chances are many of these pictures will include a lot of cats in a lot of different positions. They will also include some pictures without cats at all as false positives, naturally. To this end, there have been developed different metrics for colour, texture and shape comparison. Yong Rui et al. [35] made a brilliant survey of CBIR techniques as they evolved in past thirty years and it is a good source of reference for as colour, as texture so shape comparison techniques. In CBIR and metrics used there, two fac­ tors need to be taken into consideration —false acceptance with false rejection rates and how much the metrics correspond to human no­ tion of image similarity. In case of a database with labeled images the similarity of images is determined by humans (A dog is in the picture, brown and green are dominant colours, it is in the country, etc.). In case of automatic content based image retrieval, the simi­ larity is computed most often from colour information alone. Thus images with similar colour distributions may be flagged similar, al­ though one image could be an image of a forest and the other of a brown and green carpet. Therefore, for better results comparisons of colour, texture and shape are sometimes combined to form one, more complex metric in order to lower the false acceptance rate and make the resulting metric more corrresponding to human notion of similarity. The colour information of an image plays the most dominant role in CBIR. For effective storing and processing of the colour in­ formation of an image histograms are used. Histograms are gen­ erally vectors containing the number of pixels of the given colour. More formally, a histogram is a vector h = ((hi, w^),... ,(hn, Wh„)),

defined below.

23 6. IMAGE SIMILARITY

where the pair (hi, whi) represents the information there is whi pixels of colour hi in the image. This definition holds perfectly for gray-scale images, where hi represents the gray-level, while Wht represents the count of pixels at gray-level hi. The metrics shown below are however defined for standard vectors — their application on histograms is straightfor­ ward, though. Histograms h and g are taken as vectors (whl,..., whn) and (wgi,..., Wgn) respectively. Technically, an image histogram refers to the probability mass function of the image intensities. While for gray-scale images the definition above is sufficient, for color images we can extend the def­ inition to capture the joint probabilities of the intensities of the three color channels (RGB or HSV, for example). More formally, the colour histogram is defined by

hA,B,c(a, b, c) = N.Prob(A = a,B = b,C = c)

where A, B and C represent the three colour channels (R, G, B or H, S, V) and N is the number of pixels in the image. Computationally, the colour histogram is formed by discretizing the colours within an image and counting the number of pixels of each colour. Since the typical computer represents colour images with up to 224 colours, this process generally requires substantial quantization of the colour space. The colour histogram can be thought of as a set of vectors. For gray-scale images these are two dimensional vectors. One dimension gives the value of the gray-level and the other the count of pixels at the gray-level. For colour images the colour his­ tograms are composed of 4-D vectors.

6.2 Metrics in CBIR

6.2.1 Manhattan Distance The Li metric from Li metric family (Minkowski metrics), also called city-block or taxi distance, was proposed in [36] for computing the dissimilarity scores between colour images. The Manhattan distance is a popular metric in image similarity issues, as it is simple and ro­ bust. Furthermore several studies show that L\ metric is often closer

24 6. IMAGE SIMILARITY to human sense of similarity than for example standard L2 metric (Euclidean distance, see Section 6.2.2). n d(h,g) = 2_^hi- Qi. i=\ The Manhattan distance is used by many image retrieval engines including FIDS, FOCUS, ImageScape, PicHunter,QuickLook2 and oth­ ers, see a thorough survey at [37]. Additional information about Manhattan distance can be found for instance on Wikipedia [38].

6.2.2 Euclidean Distance The Ľ2 metric from the Minkowski metrics family, the standard met­ ric in geometry (also Pythagorean metric, in older literature). Eu­ clidean metric is widely used in many applications, though recent studies show there are metrics better suited for issues of image sim­ ilarity in terms of better match with human notion of similarity. For example, simple L\ metric (see Section 6.2.1) shows slightly better results than Euclidean distance.

d(h,g) = Y, (^ - 9i)' \ í=\ Euclidean metric is used for example by CANDID [39] for com­ parison of global signatures or in ASSERT for comparing feature vec­ tors, and in many other applications, e.g. FIR, MetaSEEK, MIR and others, see a thorough survey at [37]. Additional information can also be found on Wikipedia [40] or in a report of histogram [41]. Euclidean distance along with others is discussed in a very good survey of CBIR techniques alto­ gether [35].

6.2.3 Chebyshev Distance The Loo metric from Minkowski metrics family, also called the maxi­ mum metric or chessboard distance. The Chebyshev distance is gen­ erally not used for image retrieval, but was included in a test com­ paring quality of metrics in terms of closeness to human notion of

25 6. IMAGE SIMILARITY similarity The results of Chebyshev distance were rather poor com­ pared to, for example, L\ and L2 metrics, which are commonly used in CBIR.

d(h,g) = max'i=1\hi — Qi\. Additional information can also be found on Wikipedia [42].

6.2.4 Quadratic Form distance Quadratic Form distance was used by Blobworld [43] for matching colour histograms [44]. d(h,g) = (h-g)TA(h-g) where A = (a^) is a symmetric matrix of weights representing the similarity between colour bins i and j. The Quadratic Form metric can become computationally ineffec­ tive for histograms of higher dimensions, compared to Minkowski metrics, for example. However, the quadratic form distance can con­ sider cross-correlation between colour bins of the compared histograms, see [41].

6.2.5 Histogram Intersection Distance The colour histogram intersection was proposed for colour image retrieval in [45]. There is a good report on histogram use at [41]. This metric, unlike others presented here, is defined for colour histograms represented as sets of 4-D vectors (see Section 6.1). The intersection of histograms h and g is given by:

E E E min(h(a, b, c),g(a, b, c)) A d{h,g) = . n,i i h min{\n\, \g\) where \h\ and \g\ gives the magnitude of each histogram. Histogram intersection distance is used by MARS, SIMBA, or Sur­ fimage [37,46,47].

26 6. IMAGE SIMILARITY

6.2.6 Mahalanobis Distance Mahalanobis distance is a standard metric in statistics. Formally, the Mahalanobis distance from a group of values with mean JJL = T (/-«i, ß2, ß3, • • •, ßP) and covariance matrix Cov(x) for a multivariate T vector x = (x\, x2, x$,..., xp) is defined as:

DM(X) = y{x — ß)TCov(x)~1(x — ß).

When defined as dissimilarity measure and used as metric the defi­ nition goes as follows:

d(h,g) = y/{h-gyCoü{h)-i{h-g).

Defined by P. C. Mahalanobis in 1936, it differs from standard Eu­ clidean distance in that it is scale-invariant and can take into account the correlation of the data set. Additional information can be found onWikipedia [48]. The Mahalanobis distance is not very common at CBIR systems and in comparison with eight other metrics did not perform well, as is shown in [49].

6.2.7 Canberra Distance The Canberra distance is given as follows:

*0) = EirTTfr \hi\ + \Qi\ The Canberra distance has not yet been widely used in CBIR systems; however, an extensive comparison of performance of several metrics indicates that Canberra distance (along with Bray-Curtis distance, see Section 6.2.8) can perform very well in CBIR, significantly better than mostly used Manhattan and Euclidean distances, as is shown in [49]. This distance examines the sum of series of a fraction differences between coordinates of a pair of vectors. Each term of fraction differ­ ence has value between 0 and 1. If one vector is zero vector, the term becomes 1 regardless of the other value, thus the distance will not be

27 6. IMAGE SIMILARITY affected. Note that if both the vectors are zero vectors, the Canberra distance needs to be defined to 0. The distance is very sensitive even to small changes when both vectors are close to zero.

6.2.8 Bray-Curtis Distance The Bray-Curtis distance, sometimes also called Sorensen distance, is commonly used in ecology and environmental sciences. It views the space as grid similar to the city block distance. The Bray-Curtis dis­ tance has a nice property that if all coordinates are positive, its value is between zero and one. If both objects are at the zero coordinates, the Bray-Curtis distance is undefined.

hi d(h,g) = Y^ hi í=\ As the Canberra distance (see Section 6.2.7), Bray-Curtis distance is not common in CBIR systems; however, recent performance tests show that Bray-Curtis distance (as well as the Canberra distance) performs better than Euclidean or Manhattan metric, with respect to human notion of similarity, see [49].

6.2.9 Squared Chored Distance The Squared Chored distance is given as follows:

d(h, g) = Yl (v7^ - Vgi)2- %=\ This distance was included in a performance test of nine metrics used in CBIR and slightly outperformed common metrics, such as Man­ hattan or Euclidean metrics, see [49].

6.2.10 Square Chi-Squared Distance The Squared Chi-squared distance is given as follows:

(hi - Qif d(h,g) = J2 h i=\

28 6. IMAGE SIMILARITY

This distance was included in a performance test [49] of nine met­ rics used in CBIR and slightly outperformed commonly used met­ rics, such as Manhattan or Euclidean metrics.

6.2.11 Earth Mover's Distance Computing the EMD [50] is based on a solution to the old transporta­ tion problem [51]. This is a bipartite network flow problem which can be formalized as the following linear programming problem: Let / be a set of suppliers, J a set of consumers, and c^ the cost to ship a unit of supply from i e I to j e J. We want to find a set of flows fy that minimize the overall cost:

iei jeJ

subject to the following constraints: fa > 0 ielJeJ e J Eie/ Iv = y J J Z^jej Jij — xi t E 1 where Xi is the total supply of supplier i and y j is the total capacity of consumer j. First constraint allows shipping of supplies from a sup­ plier to a consumer and not vice versa. Second forces the consumers to fill up all of their capacities and third limits the supply that a sup­ plier can send to its total amount. A feasibility condition is that the total demand does not exceed the total supply

j&j iei The transportation problem can be naturally used for histogram matching by defining one histogram as the supplier and the other as the consumer, and solving the transportation problem where the cost Cij is the ground distance between element i in the first histogram and element j in the second. When the total weights of the his­ tograms are not equal (partial matches), the smaller histogram will be the consumer in order to satisfy the feasibility condition. Once

29 6. IMAGE SIMILARITY the transportation problem is solved, and we have found the opti­ mal flow F, the earth movers distance is defined as:

l^iei l^jejcíjjíj _ l^iei l^jejcíjJíj EMD(h,g) = where the denominator is a normalization factor that avoids favor­ ing signatures with smaller total weights. In general, the ground dis­ tance cij can be any distance and is chosen according to the problem at hand.

30 Chapter 7 Implementation of a metric based on Smith- Waterman algorithm

A part of this thesis was the implementation of a metric into the Met­ ric Similarity Search Implementation Framework (MESSIF) [52] de­ veloped at Faculty of Informatics, Masaryk University. The imple­ mentation of Smith-Waterman algorithm [7] (see Section 3.2.7) was chosen because the algorithm was unexploited by the framework as yet. While Smith-Waterman algorithm using PAM scoring matrices [2] defines a similarity measure, see Section 3.2.7, using the work of Igor Fisher Similarity-preserving Metrics for Amino-acid Sequences [14] we can define a metric based on the similarity scores computed by the Smith-Waterman algorithm, see Section 3.4. The metric, dis­ tance function d(s, t) of two sequences s and t, is defined as:

/ d{s,t) = v {s,s) + {t,t)-2{s,t) where (s, t) is the similarity score of sequences s and t. Using this formula we implemented Smith-Waterman algorithm into MESSIF as class ObjectStringSmithWaterman for comparing amino acid sequences. The object ObjectStringSmithWaterman is sim­ ply put a string with added distance computation capabilities. The string stored in the object represents the amino acid sequence, while the object offers a function getDistance(another object) that computes the distance of this object to the supplied argument object. The dis­ tance, as defined above, is computed from three similarity compu­ tations. While the (s,s) and (t,t) scores (self-similarities) do not change, they are precomputed when the object is created (i.e. when object containing sequence s is created the self-similarity (s, s) is pre­ computed for the object) and stored in the object, so the self-similarity

31 7. IMPLEMENTATION OF SMITH-WATERMAN ALGORITHM

for each sequence is only computed once for all distance computa­ tions. The implementation is based on an implementation of Smith- Waterman algorithm with Gotoh's enhancement1 in JAligner by Ahmed Moustafa [54] licensed under The GNU General Public License (GPL)2. The code was adapted and optimized for distance computations. The traceback necessary for construction of the align­ ment with the highest similarity score was removed (it has no signif­ icance for the distance computation which is based on the similarity score alone). The distance function has also been implemented to support multiple threads computing distances simultaneously. The implementation uses PAM250 scoring matrix [2] (see Sec­ tion 3.2.5) to compute distances of amino acid sequences (see Section 3.2.4).

^he enhancement for linear gap functions, see [53]. 2http://www.gnu.org/licenses/gpl.txt

32 Chapter 8 Conclusion

Similarity search is widely used in many applications: In biology Needlema-Wunsch [4] and Smith-Waterman [7] algo­ rithms are used for determining similarity of amino acid or DNA sequences. In practice, systems FASTA [13] and BLAST [12] are used to speed up slower Smith-Waterman algorithm. In text classification and retrieval Euclidean [40] and cosine [22] measure are used for comparing word frequency vectors [16]. Many corpus-based and knowledge-based measures [23] are facilitated in WordNet [17]. 3D-object similarity can be dealt with using feature vectors [30, 31] or 2D projection of the 3D objects [32, 33]. Another approach is to use the actual geometry of the objects [34]. Image similarity is determined by comparing colour histograms or texture feature vectors, shape may also be included in the com­ parison [35, 37]. Several different metrics have been found and de­ scribed, including Manhattan distance [36] or Quadratic Form Dis­ tance [41]. Many of the basic similarity measures and metrics that are used in biological sequence similarity, text and image retrieval, and 3D- object similarity were found and described, providing for each one reference at least. Also, extensive surveys of measures and tech­ niques used in text classification [23] and image retrieval [35, 37, 50] were found. Internet sources are also provided for most of the mea­ sures and metrics included in this overview. The implementation of Smith-Waterman algorithm into the MES- SIF platform was successful, the distance computations are reason­ ably time demanding. Test data were downloaded from UniProt

33 8. CONCLUSION

(Universal Protein Resource)1 database.

1http://www.expasy.uniprot.org/

34 Chapter 9 Bibliography Bibliography

[1] V. I. Levenshtein. Binary codes capable of correcting spurious inser­ tions and deletions of ones, volume 1. 1965.

[2] Margaret O. Dayhoff, R. M. Schwarz, and B. C. Orcutt. Messif: Metric similarity search implementation framework. In Atlas of Protein Sequence and Structure, volume 5, pages 345-352. Na­ tional Biochemical Research Foundation, 1978.

[3] S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sei USA, 89(22):10915- 10919, November 1992.

[4] Saul Needleman and Christian Wunsch. A general method ap­ plicable to the search for similarities in the amino acid sequence of two proteins, 1970. [5] Aoife McLysaght. Needleman-wunsch algorithm — the univer­ sity of dublin. http://www.maths.ted.ie/~lily/pres2/ sIdO 0 3 . htm. [Online; accessed 6-May-2007].

[6] Wikipedia. Needleman-wunsch algorithm — wikipedia, the free encyclopedia, http : //en .wikipedia. org/w/index . php?title=Needleman-Wunsch_algorithm&oldid= 116968647, 2007. [Online; accessed 6-May-2007].

[7] T. F. Smith and M. S. Waterman. Identification of common molecular subsequences, 1981.

[8] Wikipedia. Smith-waterman algorithm — wikipedia, the free encyclopedia, http : //en .wikipedia . org/w/index .php? title=Smith-Waterman_algorithm&oldid=12 64 822 40, 2007. [Online; accessed 6-May-2007].

35 9. BIBLIOGRAPHY

Aoife McLysaght. Needleman-wunsch algorithm — the univer­ sity of dublin. http://www.maths.ted.ie/~lily/pres2/ sldO 0 9 . htm. [Online; accessed 6-May-2007].

Stuart M. Brown. Needleman-wunsch algorithm — nyu school of medicine, http : //www.med.nyu. edu/rcr/rcr/ course/sim-sw.html. [Online; accessed 6-May-2007].

Paul E. Black. Smith-waterman algorithm, nist. http: / /www. nist.gov/dads/HTML/smithWaterman.html, 2006. [On­ line; accessed 6-May-2007]. S. F. Altschul and T. L. Madden et al. Gapped BLAST and PSI- BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25:3389-3402,1997.

D. J. Lipman and W. R. Pearson. Rapid and sensitive protein similarity searches. Science, 227(4693):1435-41,1985. Igor Fisher. Similarity-preserving metrics for amino-acid se­ quences.

Weijia Xu and Daniel P. Miranker. A metric model of amino acid substitution. Bioinformatics, 20(8):1214-1221, 2004.

Peter Andras and Olusolu Idowu. Kohonen networks with graph-based augmented metrics. In Proceedings ofWSOM2005, pages 179-186, 2005.

WordNet: An Electronic Lexical Database. MIT Press, 1998.

Richard B. Segal and Jeffrey O. Kephart. Incremental learning in swiftfile. In Proc. 17th International Conf. on Machine Learning, pages 863-870. Morgan Kaufmann, San Francisco, CA, 2000.

Martin Porter. An algorithm for suffix stripping. In Program, volume 14, pages 130-137, July 1980.

Paul Ginsparg, Paul Houle, Thorsten Joachims,, and Jae-Hoon Sul. Mapping subsets of scholarly information. In PN AS, Febru­ ary 2004.

36 9. BIBLIOGRAPHY

[21] J. Hopcroft, O. Khan, B. Kulis, and B. Selman. Tracking evolving communities in large linked networks. Proc Natl Acad Sei USA, 101 Suppl 1:5249-5253, April 2004. [22] Gerard Salton. Automatic text processing: the transformation, anal­ ysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989. [23] Rada Mihalcea, Courtney Corley, and Carlo Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI, 2006. [24] Peter D. Turney. Mining the web for synonyms: Pmi-ir versus lsa on toefl. In EMCL '01: Proceedings of the 12th European Con­ ference on Machine Learning, pages 491-502, London, UK, 2001. Springer-Verlag. [25] T. K. Landauer, P. W. Foltz, and D. Laham. Introduction to latent semantic analysis. In Discourse Processes, volume 25, pages 259- 284,1998. [26] A. Budanitsky Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures, 2001. [27] G. Hirst and D. St-Onge. Lexical chains as representation of context for the detection and correction malapropisms, 1997. [28] Philip Resnik. Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 448-453,1995. [29] Jay J. Jiang and David W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of Interna­ tional Conference on Research in Computational Linguistics, 1997. [30] C. Faloutsos and R. et al. Barber. Efficient and effective querying by image content. In Journal of Intelligent Information Systems, volume 3, pages 231-262,1994. [31] R. Mehrotra and J. E. Gary. Feature-based retrieval of similar shapes. In Proc. 9th Int. Conf. on Data Engineering, Vienna, Aus­ tria, pages 108-115,1993.

37 9. BIBLIOGRAPHY

[32] H. Noborio, P. Liang, and S. Hackwood. Construction of the octree approximating three-dimensional objects using multiple views. In IEEE Trans, on Pattern Analysis and Machine Learning, volume 10, pages 769-782,1988.

[33] P. Srinivasan, S. Fukusa, and S. Azimoto. Computational Geomet­ ric Methods in Volumetric Intersection for 3D Reconstruction, vol­ ume 23. 1990.

[34] D. A. Keim. Efficient geometry-based similarity search of 3D spatial databases. In Proc. ACM SIGMOD Int. Conf. on Manage­ ment of Data, Philadelphia, PA, pages 419-430. ACM Press, June 1999.

[35] Y. Rui, T. Huang, and S. Chang. Image retrieval: current tech­ niques, promising directions and open issues, April 1999.

[36] Michael J. Swain and Dana H. Ballard. Color indexing. Int. }. Comput. Vision, 7(l):ll-32,1991.

[37] R. Veltkamp and M. Tanase. Content-based image re­ trieval systems: A survey, http : //www. aa-lab . cs .uu.nl/ cb i r survey / cb i r-survey/cb i r-survey .html, 2001.

[38] Wikipedia. — wikipedia, the free encyclope­ dia, http://en.wikipedia.org/w/index.php?title= Taxicab_geometry&oldid=117611031, 2007. [Online; ac­ cessed 2- April-2007].

[39] Computer Research and USA Applications Group, Los Alamos National Laboratory. Candid, comparison algorithm for navi­ gating digital image databases, http: //public . lanl .gov/ kelly/CANDID/index. shtml, 2007. [Online; accessed 2- April-2007].

[40] Wikipedia. Euclidean distance — wikipedia, the free encyclope­ dia, http://en.wikipedia.org/w/index.php?title= Euclidean_distance&oldid=116093887, 2007. [Online; accessed 2-April-2007].

38 9. BIBLIOGRAPHY

[41] Sangoh Jeong. Histogram-based color image retrieval. http://seien.Stanford.edu/class/psych221/ projects/02/sojeong/, March 2001. [Online; accessed 30-April-2007]. [42] Wikipedia. Chebyshev distance — wikipedia, the free ency­ clopedia, http://en.wikipedia.org/w/index.php? title=Chebyshev_distance&oldid=118 25210 7, 2007. [Online; accessed 2-April-2007]. [43] Berkeley USA Computer Science Division, University of Cal­ ifornia. Blobworld, a cbir system. http://elib.es. berkeley.edu/blobworld/, 2004. [Online; accessed 2- April-2007]. [44] Remco C. Veltkamp and Mirela Tanase. Content-based image retrieval systems: A survey, http: / /www. aa- lab .cs.uu. nl/cbirsurvey/cbir-survey/node9 .html, March 2001. [Online; accessed 2-April-2007]. [45] M. J. Swain and D. H. Ballard. Color indexing. In International Journal of Computer Vision, 1991. [46] Chahab Nastar and Matthias Mitschke et al. Surfimage: A flexible content-based image retrieval system. In Proceedings of the ACM International Multimedia Conference, pages 339-344, September 1998.

[47] Remco C. Veltkamp and Mirela Tanase. Content-based im­ age retrieval systems: A survey, http: / /www. aa- lab. c s . uu . nl/ cbir survey /cbir- survey /node 3 8 .html, March 2001. [Online; accessed 28-April-2007]. [48] Wikipedia. Mahalanobis distance — wikipedia, the free en­ cyclopedia, http://en.wikipedia.org/w/index.php? title=Mahalanobis_distance&oldid=10 94502 47,2007. [Online; accessed 28-April-2007]. [49] Manesh Kokare, B.N. Chatterji, and P. K. Biswas. Comparison of similarity metrics for texture image retrieval. In TENCON 2003, volume 2, pages 571-575, October 2003.

39 9. BIBLIOGRAPHY

[50] Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. A metric for distributions with applications to image databases. In ICCV '98: Proceedings of the Sixth International Conference on Computer Vision, page 59, Washington, DC, USA, 1998. IEEE Computer Society.

[51] G. B. Dantzig. Application of the simplexmethod to a trans­ portation problem. In Activity Analysis of Production and Alloca­ tion, pages 359-373,1951.

[52] Michal Batko, David Novak, and Pavel Zezula. Messif: Metric similarity search implementation framework. In C. Thanos and R Borri, editors, DELOS Conference 2007: Working Notes, Pisa, 13-14 February 2007, pages 11-23. Information Society Technolo­ gies, 2007.

[53] O. Gotoh. An improved algorithm for matching biological se­ quences, 1982.

[54] Ahmed Moustafa. http://jaligner.sourceforge.net. [Online; accessed 12-May-2007].

40 Contents of CD

• 3D Object Similarity keim_pap.pdf geometry-based 3D-object similarity

• Biology 2004-mPAM.pdf scoring matrix defining metric blastpdf report on BLAST Byang.pdf a novel Smith-Waterman implementation (2002) sim-pres-metrics.pdf suggested ways of defining metrics in biology smithwaterman.pdf Smith-Waterman algorithm • Image Similarity emd.pdf Earth Mover's distance kokare.pdf comparison of metrics used in CBIR rui99.pdf an extensive survey of techniques used in CBIR spie.pdf overview of current search engines and suggestion of cognitive visual attention model • Implementation of Smith-Waterman Algorithm ObjectStringSmithWaterman.java the source code of the implementation

Text Similarity BudanitskyHirst.pdf overview of similarity measures in WordNet mihalcea.aaai06.pdf overvier and comparison of corpus-based and knowledge-based measures of text similarity PAOIwsom.pdf suggested graph-augmented Euclidean metric swiftfile.pdf e-mail assistant using text classification

41