Similarity Measures for Sequential Data 1
Total Page:16
File Type:pdf, Size:1020Kb
Similarity Measures for Sequential Data 1 SIMILARITY MEASURES FOR SEQUENTIAL DATA Konrad Rieck Technische Universität Berlin Germany This is a preprint of an article published in Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery – http://wires.wiley.com/dmkd ABSTRACT quential data. Instead of operating in the domain of vec- tors, the learning methods are applied to pairwise relation- Expressive comparison of strings is a prerequisite for analysis of ships of sequential data, such as distances, angles or kernels sequential data in many areas of computer science. However, defined over strings. As a consequence, a powerful yet in- comparing strings and assessing their similarity is not a trivial task tuitive interface for learning with sequential data can be and there exists several contrasting approaches for defining simi- established, which is rooted in assessing the similarity or larity measures over sequential data. In this article, we review dissimilarity of strings. three major classes of such similarity measures: edit distances, Comparing strings and assessing their similarity is far bag-of-word models and string kernels. Each of these classes orig- from trivial, as sequential data is usually characterized by inates from a particular application domain and models similar- complex and rich content. As an example, this article itself ity of strings differently. We present these classes and underlying is a string of data and comparison with related texts clearly comparison concepts in detail, highlight advantages and differ- resembles an involved task. As a consequence, several con- ences as well as provide basic algorithms for practical application. trasting approaches have been proposed for comparison of strings, reflecting particular application domains and em- phasizing different aspects of strings. INTRODUCTION In this article, we review three major classes of similar- ity measures for sequential data, which have been widely Strings and sequences are a natural representation of data studied in data mining and machine learning. First, we in- in many areas of computer science. For example, several troduce the edit distance and related measures, which de- applications in bioinformatics are concerned with study- rive from early telecommunication and model similarity ing sequences of DNA, many tasks of information retrieval by transforming strings into each other. We then proceed center on analysis of text documents, and even computer to bag-of-words models, which originate from information security deals with finding attacks in strings of network retrieval and implement comparison of strings by embed- packets and file contents. Application of data mining and ding sequential data in a vector space. Finally, we present machine learning in these areas critically depends on the string kernels, a recent class of similarity measures based availability of an interface to sequential data, which allows on the paradigm of kernel-based learning, which allows for for comparing, analyzing and learning in the domain of designing feature spaces for comparison with almost arbi- strings. trary complexity. For each of the three classes of similarity Algorithms for data mining and machine learning have measures, we discuss underlying concepts, highlight advan- been traditionally designed for vectorial data, whereas tages and differences to related techniques, and introduce strings resemble variable-length objects that do not fit into basic algorithms for application in practice. the rigid representation of vectors. However, a large body of learning algorithms rests on analysis of pairwise rela- NOTATION tionships between objects, which imposes a looser con- straint on the type of data that can be handled. For in- Before presenting similarity measures in detail, we intro- stance, several learning methods, such as nearest neigh- duce some basic notation. Let be a set of symbols de- bor classification, linkage clustering and multi-dimensional noted as alphabet, such as the charactersA in text or the bases scaling, solely depend on computing distances between ob- in DNA. Then, a string x is a concatenation of symbols jects. This implicit abstraction from vectorial representa- from the alphabet . Moreover, we denote the set of all A tions can be exploited for applying machine learning to se- strings by ∗ and refer to the set of strings with length n A Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2 Similarity Measures for Sequential Data as n. A function f for comparison of two strings takes transformation of strings. In the original formulation of A the form f : ∗ ∗ R where for (dis)similarity the edit distance these operations are defined as measures the returnedA × A quantity! increases with pairwise (dis)similarity of strings. For simplicity, we use the term (a) the insertion of a symbol into x, similarity measure synonymously with dissimilarity mea- (b) the deletion of a symbol from x and sure, as both types of functions mainly differ in the direc- (c) the substitution of a symbol of x with a symbol of y. tion of comparison. To navigate within a string x, we ad- dress the i-th symbol of x by x[i] and refer to the symbols A transformation of strings can now be defined as a se- from position i to j as a substring x[i:j] of x. Finally, we de- quence of edit operations where the costs correspond to note the empty string by ε and define x to be the length the length of this sequence. Assessing the similarity of two of the string x. j j strings x and y thus amounts to determining the shortest Let us, as an example, consider the string z = cute kitty. sequence of edit operations that transforms x to y. In this Depending on how we define the alphabet , the string z view, the edit distance d(x, y) is simply the length of this either corresponds to a concatenation of charactersA (c u shortest sequence and corresponds to the minimal number ◦ ◦ t...) with z = 10 or a concatenation of words (cute kitty) of edit operations required to change x into y. j j ◦ with z = 2. Thus, by selecting the alphabet of strings, Formally, the edit distance d(x, y) for two strings x and we arej j able to refine the scope of the analysis depending y can be defined recursively, starting with small substrings on the application at hand. In the following, we restrict and continuing until the shortest transformation from x ourselves to the simple alphabet = a,b and use the to y has been discovered. As base cases of this recursive A f g strings x = ababba and y = bababb as examples through- definition, we have out the article. However, all of the presented similarity measures are applicable to arbitrary alphabets, which may d(ε,ε) = 0, d(x,ε) = x , d(ε, y) = y range from binary symbols in telecommunication to huge j j j j dictionaries for natural language processing. where for the first base case the distance between empty strings is zero and in the other two cases the shortest se- quence of operations corresponds to the length of the non- EDIT DISTANCE empty string. In all other cases, the edit distance d(x, y) is defined recursively by As the first class of similarity measures for sequential data, we introduce the edit distance and variants thereof. The 8 d(x[1:i 1], y[1:j]) + 1 idea underlying the edit distance is to assess the similarity < − of two strings by transforming one string into the other d(x[1:i], y[1:j]) = min d(x[1:i], y[1:j 1]) + 1 : − and determining the minimal costs for this transformation. d(x[1:i 1], y[1:j 1]) + m(i, j) Strictly speaking, the edit distance is a metric distance and − − thus a dissimilarity measure which returns zero if no trans- in which the indicator function m(i, j) is 1 for a mismatch formation is required and positive values depending on the of symbols x[i] = y[j], and 0 for a match of symbols x[i] = amount of changes otherwise. A generic definition of dis- 6 y[j]. tances for strings is provided in Box 1. This recursive definition builds on the technique of dy- namic programming: the distance computation for x[1:i] Box 1 A distance for strings is a function comparing two and y[1:j] is solved by considering distances of smaller sub- strings and returning a numeric measure of their dissim- strings for each edit operation. In the first case, the symbol ilarity. Formally, a function d : ∗ ∗ is a x[i] is discarded, which corresponds to deleting a symbol A × A ! R distance if and only if for all x, y, z ∗ holds from x. The second case matches an insertion, as the last 2 A symbol of y[j] is omitted. If x[i] = y[j], the third case re- d(x, y) 0 (non-negativity) 6 ≥ flects a substitution, whereas otherwise x[i] = y[j] and no d(x, y) = 0 x = y (isolation) operation is necessary. The final distance value d x, y of x , ( ) d(x, y) = d(y, x) (symmetry). and y is obtained by starting the recursion with i = x and j j j = y . A function d is a metric distance if additionally Inj practice,j this dynamic programming is realized by keeping intermediate distance values in a table, such that d(x, z) d(x, y) + d(y, z) (triangle inequality). ≤ the edit distance of the substrings x[1:i] and y[1:j] can be ef- ficiently determined using results of previous iterations. To To transform a string x into another string y, we require illustrate this concept, we consider the strings x = ababba a set of possible edit operations that provide the basis for and y = bababb and construct a corresponding table.