<<

Similarity Measures for Sequential Data 1

SIMILARITY MEASURESFOR SEQUENTIAL DATA Konrad Rieck Technische Universität Berlin Germany

This is a preprint of an article published in Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery – http://wires.wiley.com/dmkd

ABSTRACT quential data. Instead of operating in the domain of vec- tors, the learning methods are applied to pairwise relation- Expressive comparison of strings is a prerequisite for analysis of ships of sequential data, such as , angles or kernels sequential data in many areas of . However, defined over strings. As a consequence, a powerful yet in- comparing strings and assessing their similarity is not a trivial task tuitive interface for learning with sequential data can be and there exists several contrasting approaches for defining simi- established, which is rooted in assessing the similarity or larity measures over sequential data. In this article, we review dissimilarity of strings. three major classes of such similarity measures: edit distances, Comparing strings and assessing their similarity is far bag-of-word models and string kernels. Each of these classes orig- from trivial, as sequential data is usually characterized by inates from a particular application domain and models similar- complex and rich content. As an example, this article itself ity of strings differently. We present these classes and underlying is a string of data and comparison with related texts clearly comparison concepts in detail, highlight advantages and differ- resembles an involved task. As a consequence, several con- ences as well as provide basic algorithms for practical application. trasting approaches have been proposed for comparison of strings, reflecting particular application domains and em- phasizing different aspects of strings. INTRODUCTION In this article, we review three major classes of similar- ity measures for sequential data, which have been widely Strings and sequences are a natural representation of data studied in data mining and machine learning. First, we in- in many areas of computer science. For example, several troduce the edit and related measures, which de- applications in are concerned with study- rive from early telecommunication and model similarity ing sequences of DNA, many tasks of information retrieval by transforming strings into each other. We then proceed center on analysis of text documents, and even computer to bag-of-words models, which originate from information security deals with finding attacks in strings of network retrieval and implement comparison of strings by embed- packets and file contents. Application of data mining and ding sequential data in a vector space. Finally, we present machine learning in these areas critically depends on the string kernels, a recent class of similarity measures based availability of an interface to sequential data, which allows on the paradigm of kernel-based learning, which allows for for comparing, analyzing and learning in the domain of designing feature spaces for comparison with almost arbi- strings. trary complexity. For each of the three classes of similarity Algorithms for data mining and machine learning have measures, we discuss underlying concepts, highlight advan- been traditionally designed for vectorial data, whereas tages and differences to related techniques, and introduce strings resemble variable-length objects that do not fit into basic algorithms for application in practice. the rigid representation of vectors. However, a large body of learning algorithms rests on analysis of pairwise rela- NOTATION tionships between objects, which imposes a looser con- straint on the type of data that can be handled. For in- Before presenting similarity measures in detail, we intro- stance, several learning methods, such as nearest neigh- duce some basic notation. Let be a set of symbols de- bor classification, linkage clustering and multi-dimensional noted as alphabet, such as the charactersA in text or the bases scaling, solely depend on computing distances between ob- in DNA. Then, a string x is a concatenation of symbols jects. This implicit abstraction from vectorial representa- from the alphabet . Moreover, we denote the set of all A tions can be exploited for applying machine learning to se- strings by ∗ and refer to the set of strings with length n A

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2 Similarity Measures for Sequential Data as n. A function f for comparison of two strings takes transformation of strings. In the original formulation of A the form f : ∗ ∗ R where for (dis)similarity the edit distance these operations are defined as measures the returnedA × A quantity→ increases with pairwise (dis)similarity of strings. For simplicity, we use the term (a) the insertion of a symbol into x, similarity measure synonymously with dissimilarity mea- (b) the deletion of a symbol from x and sure, as both types of functions mainly differ in the direc- (c) the substitution of a symbol of x with a symbol of y. tion of comparison. To navigate within a string x, we ad- dress the i-th symbol of x by x[i] and refer to the symbols A transformation of strings can now be defined as a se- from position i to j as a substring x[i:j] of x. Finally, we de- quence of edit operations where the costs correspond to note the empty string by ε and define x to be the length the length of this sequence. Assessing the similarity of two of the string x. | | strings x and y thus amounts to determining the shortest Let us, as an example, consider the string z = cute kitty. sequence of edit operations that transforms x to y. In this Depending on how we define the alphabet , the string z view, the edit distance d(x, y) is simply the length of this either corresponds to a concatenation of charactersA (c u shortest sequence and corresponds to the minimal number ◦ ◦ t...) with z = 10 or a concatenation of words (cute kitty) of edit operations required to change x into y. | | ◦ with z = 2. Thus, by selecting the alphabet of strings, Formally, the edit distance d(x, y) for two strings x and we are| | able to refine the scope of the analysis depending y can be defined recursively, starting with small substrings on the application at hand. In the following, we restrict and continuing until the shortest transformation from x ourselves to the simple alphabet = a,b and use the to y has been discovered. As base cases of this recursive A { } strings x = ababba and y = bababb as examples through- definition, we have out the article. However, all of the presented similarity measures are applicable to arbitrary alphabets, which may d(ε,ε) = 0, d(x,ε) = x , d(ε, y) = y range from binary symbols in telecommunication to huge | | | | dictionaries for natural language processing. where for the first base case the distance between empty strings is zero and in the other two cases the shortest se- quence of operations corresponds to the length of the non- EDIT DISTANCE empty string. In all other cases, the edit distance d(x, y) is defined recursively by As the first class of similarity measures for sequential data, we introduce the edit distance and variants thereof. The  d(x[1:i 1], y[1:j]) + 1 idea underlying the edit distance is to assess the similarity  − of two strings by transforming one string into the other d(x[1:i], y[1:j]) = min d(x[1:i], y[1:j 1]) + 1  − and determining the minimal costs for this transformation. d(x[1:i 1], y[1:j 1]) + m(i, j) Strictly speaking, the edit distance is a distance and − − thus a dissimilarity measure which returns zero if no trans- in which the indicator function m(i, j) is 1 for a mismatch formation is required and positive values depending on the of symbols x[i] = y[j], and 0 for a match of symbols x[i] = amount of changes otherwise. A generic definition of dis- 6 y[j]. tances for strings is provided in Box 1. This recursive definition builds on the technique of dy- namic programming: the distance computation for x[1:i] Box 1 A distance for strings is a function comparing two and y[1:j] is solved by considering distances of smaller sub- strings and returning a numeric measure of their dissim- strings for each edit operation. In the first case, the symbol ilarity. Formally, a function d : ∗ ∗ is a x[i] is discarded, which corresponds to deleting a symbol A × A → R distance if and only if for all x, y, z ∗ holds from x. The second case matches an insertion, as the last ∈ A symbol of y[j] is omitted. If x[i] = y[j], the third case re- d(x, y) 0 (non-negativity) 6 ≥ flects a substitution, whereas otherwise x[i] = y[j] and no d(x, y) = 0 x = y (isolation) operation is necessary. The final distance value d x, y of x ⇔ ( ) d(x, y) = d(y, x) (symmetry). and y is obtained by starting the recursion with i = x and | | j = y . A function d is a metric distance if additionally In| practice,| this is realized by keeping intermediate distance values in a table, such that d(x, z) d(x, y) + d(y, z) (triangle inequality). ≤ the edit distance of the substrings x[1:i] and y[1:j] can be ef- ficiently determined using results of previous iterations. To To transform a string x into another string y, we require illustrate this concept, we consider the strings x = ababba a set of possible edit operations that provide the basis for and y = bababb and construct a corresponding table.

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery Similarity Measures for Sequential Data 3

tions: bag-of-words models. In these models, sequential data x is characterized using a predefined set of strings, such as a a b a b b a dictionary of words or fixed terms. We herein refer to this 0 1 2 3 4 5 6 set as an embedding language L ∗ and to a string w L b 1 1 1 2 3 4 5 as a word of L. The set L enables⊆ A embedding strings in∈ a a 2 1 2 1 2 3 4 vector space, such that vectorial similarity measures can be b 3 2 1 2 1 2 3 y applied for analysis of sequential data. Before studying this a 4 3 2 1 2 2 2 embedding and respective similarity measures in detail, we b 5 4 3 2 1 2 3 present two definitions of embedding languages frequently b 6 5 4 3 2 1 2 used in previous work: words and n-grams. The main representation of data in information retrieval During computation of the edit distance the table is filled and natural language processing are strings of regular text. from the upper left to the lower right corner. The cell (i, j) Thus, L is usually defined as words of a natural language, contains the distance value d(x[1:i], y[1:j]) which is com- such as English or Esperanto. In this setting, L is either puted using the elements [i, j 1], [i 1, j] and [i 1, j 1] given explicitly by providing a dictionary of terms or im- determined in previous iterations.− In− our example,− we− get plicitly by partitioning strings according to a set of delim- d(x, y) = 2, as the string x can be transformed to y by in- iter symbols D , such that serting “b” to its beginning and removing its last symbol ⊂ A “a”. The corresponding shortest sequence of edit opera- L = ( D)∗ A\ tions can be obtained by traversing back from the lower right to the upper left corner of the table using an addi- where L corresponds to all possible concatenations of non- tional table of back pointers. The respective path in the delimiter symbols. Based on this definition, the contents above table is indicated by bold face font. of a string can be described in terms of contained words of The run- for computing the edit dis- L, hence the term “bag of words”. tance is quadratic in the length of the strings. In particu- Models based on words are intuitive if strings derive from natural language text or similarly structured se- lar, ( x + 1) ( y + 1) distance values need to be computed, | | · | | quences. In several applications, however, the structure resulting in a complexity of Θ( x y ). While there ex- ist techniques to accelerate computation| | · | | below quadratic underlying strings is unknown and hence no delimiters can complexity (Masek and Patterson, 1980), no linear-time al- be defined a priori, for example, as in analysis of DNA and gorithm is known for calculating the edit distance. The protein sequences. An alternative technique for defining computational effort, however, often pays off, as not only an embedding language L is to move a sliding window of a distance but a transformation is returned, which allows length n over each string and extract n-grams (substrings for insights into the nature of compared strings. of length n). Formally, this embedding language can be The edit distance has been first studied by Levenshtein defined as n (1966) as a generalization of the —a pre- L = , ceding distance restricted to strings of equal length (Ham- A where n is the set of all possible concatenations with ming, 1950). The seminal concept of describing similarity length nA. The contents of a string is characterized in terms in terms of edit operations has been extended in a variety of of contained n-grams, in reference to original bag-of-words ways, for example, by incorporation of different cost func- models, referred to as a “bag of n-grams”. tions and weightings for operations, symbols and strings. Equipped with an appropriate embedding language L, A discussion of several variants is provided by Sankoff and we are ready to embed strings in a vector space. Specif- Kruskal (1983). Furthermore, the technique of sequence ically, a string x is mapped to an L -dimensional vector alignment is closely related to the edit distance. Although space spanned by the words of L using| | a function φ defined alignments focus on similarities of strings, the underlying as framework of dynamic programming is identical in both L φ : ∗ R| |, φ : x (φw (x))w L settings (Needleman and Wunsch, 1970, Smith and Water- A → 7→ ∈ man, 1981, Doolittle, 1986). A comprehensive discussion where φw (x) returns the number of occurrences of the of edit distances and sequence alignments along with im- word w in the string x. In practice, the mapping φ is plementations is provided in the book of Gusfield (1997). often refined to particular application contexts. For ex- ample, φw (x) may be alternatively defined as frequency, BAG-OF-WORDS MODELS probability or binary flag for the occurrences of w in x. Additionaly, a weighting of individual words can be in- From the edit distance, we turn to another concept for troduced to embedding to account for irrelevant terms in comparison of strings widely used in data mining applica- strings (Salton and McGill, 1986).

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4 Similarity Measures for Sequential Data

As an example, let us consider the strings x = ababba Besides the capability to adapt the embedding languages and y = bababb. If we choose the embedding language as to a particular application domain, the proposed embed- 3 L = and define φw (x) to be the number of occurrences ding brings about another advantage over other similar- of wAin x, we obtain the following vectors for the strings x ity measures, such as the edit distance. Several learning and y: methods can be explicitly formulated in the induced vec- tor space, such that not only the input but also the learned     0 aaa 0 aaa models are expressed using the map φ. For example, the     0 aab 0 aab centers of k-means clustering and the weights of linear     1 aba 1 aba classification can be directly phrased as linear combination     1 abb 1 abb of embedded strings, thereby providing a transparent and φ(x) =   and φ(y) =   0 baa 0 baa efficient representation of learned concepts (Sonnenburg     1 bab 2 bab et al., 2007).     1 bba 0 bba The vector space induced by words and n-grams is high- 0 bbb 0 bbb. dimensional, that is, for a typical alphabet with around 4 50 symbols an embedding using 4-grams induces 50 = The strings are mapped to 8-dimensional vectors associ- 6,250,000 distinct dimensions. Computing and comparing ated with all 3-grams of the alphabet = a,b . While the vectors in such high-dimensional spaces seems intractable edit distance compares the global structureA { of} strings, the at a first glance. However, for both types of languages, embedding characterizes sequential data with a local scope. words and n-grams, the number of words contained in a The substrings aba and abb are shared by both strings and string x is linear in its length. Consequently, the vector reflected in particular dimensions—irrespective of their po- φ(x) contains at most x non-zero dimensions—regardless sition and surrounding in the two strings. Similarly, the of the actual dimensionality| | of the vector space. This spar- absence of the substring bba in y and the repeated occur- sity can be exploited to design very efficient techniques rences of the substring bab in y are modeled in a local man- for embedding and comparison of strings, for example, us- ner. This characterization of local similarity provides the ing data structures such as sorted arrays, and suffix basis for embedding of strings and enables effective com- trees (Rieck and Laskov, 2008). The run-time complexity parison by means of vectorial similarity measures. for comparison is O( x + y ), such that embeddings us- A large variety of similarity and dissimilarity measure ing words and n-grams| | are| a| technique of choice in many is defined for vectors of real numbers, such as various real-time applications. distance functions, kernel functions and similarity coeffi- The idea of mapping text to vectors using a “bag of cients. Most of these measures can be extended to handle words” has been first introduced by Salton et al. (1975) sequential data using the mapping φ. For example, we can under the term vector space model. Several extensions of express the Euclidean distance (also denoted as `2 distance) this influential work have been studied for information re- as follows trieval and natural language processing using different al- phabets, similarity measures and learning methods (Salton ÈX d x, y φ x φ y 2. and McGill, 1986, Joachims, 2002). The concept of using `2 ( ) = w ( ) w ( ) w L | − | n-grams for modeling strings evolved concurrently to this ∈ work, for example for analysis of textual documents and Similarly, we can adapt the Manhattan distance (also de- language structure (Suen, 1979, Damashek, 1995, Robert- noted as `1 distance) to operate over embedded strings, son and Willett, 1998). The ease of mapping strings to vec- tors and subsequent application of vectorial similarity mea- X sures also influenced other fields of computer science, such d x, y φ x φ y , `1 ( ) = w ( ) w ( ) that there are several related applications in the fields of w L | − | ∈ bioinformatics (Leslie et al., 2002, Sonnenburg et al., 2007) and computer security (Liao and Vemuri, 2002, Hofmeyr and define the cosine similarity measure used in informa- et al., 1998, Rieck and Laskov, 2007). tion retrieval for comparison of strings by

P STRING KERNELS w L φw (x) φw (y) scos(x, y) = ∈ · . ÆP 2 P 2 w L φw (x) w L φw (y) As the last and most recent class of similarity measures ∈ ∈ for sequential data, we consider string kernels. At the first An extensive list of vectorial similarity and dissimilarity sight, a string kernel is a regular similarity measure for se- measures suitable for application to sequential data is pro- quential data—though, as we will see shortly, string kernels vided by Rieck and Laskov (2008). reach far beyond regular measures of similarity. In contrast

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery Similarity Measures for Sequential Data 5 to similarity measures specifically designed for comparison in terms of kernel functions using the inverse cosince func- of sequential data, string kernels rest on a mathematical tion as follows, motivation—the concept of kernel functions and the re- k(x, y) spective field of kernel-based learning. Hence, we first need θx,y = arccos p . to explore some underlying theory before moving on to k(x, x) k(y, y) · practical kernels for strings. These examples demonstrate how geometric quantities corresponding to distances, angles and norms in feature Box 2 A kernel for strings is a function comparing two space can be defined solely in terms of kernel functions. strings and returning a numeric measure of their sim- Several learning methods infer dependencies of data using ilarity. Formally, a function k : ∗ ∗ is geometry, such as separating hyperplanes, enclosing hy- A × A → R a kernel if and only if for any finite subset of strings perspheres or sets of descriptive directions. By virtue of x1,..., xn ∗, the function k is symmetric and posi- kernels all these geometric models can be formulated inde- { } ⊂ A tive semi-definite, that is, pendent of particular input data , which builds the basis n for kernel-based learning and respectiveX methods, such as X ci cj k(xi , xj ) 0 for all c1,..., cn R. support vector machines and kernel principle component i,j=1 ≥ ∈ analysis (Müller et al., 2001, Schölkopf and Smola, 2002). Note that a kernel is always associated with a (possibly im- It is evident that the mapping ϕˆ is related to the func- plicit) feature space and corresponds to an inner product tion φ introduced for embedding strings in vector spaces and, clearly, by defining ϕˆ := φ, we obtain kernel functions k(x, y) = ϕˆ(x),ϕˆ(y) for bag-of-words models. However, the mapping ϕˆ under- 〈 〉 lying a kernel is more general: any similarity measure with where ϕˆ is a map from strings to the feature space. the properties of being symmetric and positive semi-definite is a kernel and corresponds to an inner product in a feature Formally, a kernel function or short kernel is a function space—even if the mapping ϕˆ and the space are totally k : that assess the similarity of objects from unknown. As a result, we gain the ability to designF similar- X × X → R a domain and is linked with a feature space —a special ity measures for strings which operate in implicit feature X F form of a vector space (Schölkopf and Smola, 2002). A brief spaces and thereby combine the advantages of access to ge- definition of string kernels is provided in Box 2. Simply ometry and expressive characterization of strings. put, any kernel function k induces a mapping ϕˆ : On the basis on this generic definition of kernel func- X → from the domain to the feature space , where the tions, a large variety of string kernels has been devised, F X F kernel k equals an inner product in , that is, ranging from kernels for words (Joachims, 1999) and n- F grams (Leslie et al., 2002) to involved string comparison k(x, y) = ϕˆ(x),ϕˆ(y) . 〈 〉 with alignments (Watkins, 2000), gaps (Lodhi et al., 2002), This association between a kernel and an inner product al- mismatches (Leslie et al., 2003) and wild cards (Leslie and lows for establishing a generic interface to the geometry Kuang, 2004). From this wide range of kernels, we select in the feature space , independent of the type and struc- the all-substring kernel (Vishwanathan and Smola, 2004) ture of the input domainF . To illustrate this interface, for presentation, as it generalizes the idea of characterizing let us consider some examples,X where we assume that the strings using n-grams to all possible substrings, while pre- serving linear-time comparison. Formally, the all-substring mapping ϕˆ simply embeds strings in without giving any details on how this mapping is carriedF out. We will come kernel can be defined as follows back to this point shortly. X k(x, y) = ϕˆ(x),ϕˆ(y) = ϕˆz (x) ϕˆz (y). 2 〈 〉 · As first example, the length (` -norm) of a vector corre- z ∗ sponding to a string x can be expressed in terms of kernels ∈A where ϕ x returns the number of occurrences of the sub- as follows ˆz ( ) Æ string z in the string x and the function ϕˆ maps strings to a ϕˆ(x) = k(x, x). || || feature space = R∞ spanned by all possible strings ∗. F A2 Similar to the length, the Euclidean distance between two In this generic setting, a string x may contain ( x + | | strings x and y in the feature space can be formulated x )/2 different substrings z and explicitly accessing all val- F | | using kernels by ues ϕˆz (x) is impossible in linear time. Thus, techniques for linear-time comparison of sparse vectors, as presented Æ ϕˆ(x) ϕˆ(y) = k(x, x) 2k(x, y) + k(y, y). for words and n-grams, are not applicable. The key to effi- || − || − ciently computing the all-substring kernel is implicitly ac- Moreover, also the angle θx,y between two strings x and y cessing the feature space using two data structures: a embedded in the feature space can be solely expressed suffix tree and a matching statisticF . F

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 6 Similarity Measures for Sequential Data

A suffix tree Sx is a tree containing all suffixes of the connects node f with node b, as the substring terminating string x. Every path from the root to a leaf corresponds at b is a suffix of the path ending at f . Similarly, node e is to one suffix where edges are labeled with substrings. Suf- linked to node c. For simplicity only these two suffix links fixes that share the same prefix initially pass along the same are shown in Figure 1 where further suffix links have been edges and nodes. If one path from the root is a suffix of omitted. The matching statistic My contains six columns another path, the respective end nodes are connected by a corresponding to the positions of matches in the string y. so-called suffix link—a shortcut for traversing the tree. A For example, at position 3 of My , we have v3 = 4 and n3 = suffix tree comprises a quadratic amount of information, h which indicates a match of length 4 between x and y namely all suffixes of x, in a linear representation and can starting at position 3 of string y and ending on the edge be constructed in x time and space (Gusfield, 1997). Θ( ) to node h in Sx . Obviously, this match is babb. A matching statistic| M| is a compact representation of all y Based on the suffix tree Sx and the matching statistic My , matches between two strings x and y. The statistic is com- the all-substring kernel can be computed by passing over puted in y time by traversing the suffix tree S with Θ( ) x My as follows the symbols| of| y, making use of suffix links on mismatches X (Chang and Lawler, 1994). The statistic My contains only k(x, y) = ϕˆz (x) ϕˆz (y) · two vectors n and v of length y , where ni reflects the z ∗ | | ∈A length of the longest substring of x matching at position i y   of y and v the node in S directly below this match. X| | i x = s(vi ) l(vi ) (ni d(vi )) . Additionally, we augment the nodes of the suffix tree Sx i=1 − · − with three fields. For each node v, we store the number of To understand how this computation works, we first symbols on the path to v in d(v) and the number of leaves note that substrings z present in either x or y do not below v in l(v). Moreover, we recursively pre-compute contribute to the kernel, as either x or y is zero. s(v) for each node: the number of occurrences of all strings ϕˆz ( ) ϕˆz ( ) starting at the root and terminating on the path to v. Hence, the computation can be restricted to the shared substrings reflected in the matching statistics My . Let us now consider a match at position i of My : If the match a terminates exactly at node the vi , we have ni d(vi ) = 0 and s v corresponds to the occurrences of all− shared sub- a ( i ) b strings in x along this path. If the match terminates on the edge, however, s v is slightly too large and we need b c ( i ) to discount all occurrences of substrings below the match ba a b by substracting l(vi ) (ni d(vi )). By finally adding up | | all occurrences of shared· substrings− in x, we arrive at the d e f g all-substring kernel k(x, y). The run-time for the kernel computation is linear in bba abba ba the lengths of the strings with x y , as first a suf- | | | | Θ( + ) fix tree is constructed and then a matching| | | | statistic is com- hij k puted. Although the induced feature space has infinite di- mension and there exist further involved weights of sub- strings (Vishwanathan and Smola, 2004), the all-substring (a) Suffix tree S for string x ababba x = kernel can be generally computed in linear-time—a capabil- y b a b a b b ity shared with many string kernels and rooted in implicit n 3 5 4 3 2 1 access to feature space based on advanced data structures v h j h k d b for string comparison. Due to their versatility and ease of incorporation with

(b) Matching statistic My for string y = bababb kernel-based learning, string kernels have gained consider- able attention in research. A large variety of kernels has Figure 1: Suffix tree and matching statistic for exemplary strings been developed, starting from first realizations of Haus- x and y. The additional symbol indicates the different suffices of sler (1999) and Watkins (2000), and extending to domain- x. For simplicity only two suffix| links (dotted lines) are shown. specific variants, such as string kernels designed for nat- ural language processing (Joachims, 1999, Lodhi et al., Figure 1 shows a suffix tree and a matching statistic for 2002) and bioinformatics (Zien et al., 2000, Leslie et al., the exemplary strings x and y. The suffix tree Sx contains 2002). The presented all-substring kernel has been pro- six leaves corresponding to the suffices of x. A suffix link posed by Vishwanathan and Smola (2004). The challenge

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

1 Similarity Measures for Sequential Data 7

Edit distance Bag of words All-substring kernel Run-time complexity O( x y ) O( x + y ) O( x + y ) Implementation dynamic| progamming| · | | sparse| | vectors| | suffix| | trees| | Granularity characters words substrings Comparison constructive descriptive descriptive Typical applications spell checking text classification DNA analysis

Table 1: Comparison of similarity measures for sequential data.

of uncovering structure in DNA has influenced further ad- sification of text documents or analysis of log files (Salton vancement of string kernels by incorporating mismatches, and McGill, 1986). Moreover, the bag-of-words model en- gaps and wild cards (Leslie et al., 2003, Leslie and Kuang, ables a linear-time comparison of strings where its imple- 2004). Additionally, string kernels based on generative mentation using sparse vectors is straightforward (Rieck models (Jaakkola et al., 2000, Tsuda et al., 2002), position- and Laskov, 2008). dependent matching (Rätsch et al., 2005, Sonnenburg et al., The ability to efficiently operate in complex feature 2006) and sequence alignments (Vert et al., 2004, Cuturi spaces finally renders string kernels preferable when learn- et al., 2007) have been devised in bioinformatics. An exten- ing with sequential data of involved structure. String sive discussion of several string kernels and applications is kernels can be specifically designed for a learning task provided by Shawe-Taylor and Cristianini (2004) and Son- and equipped with various extensions, such as position- nenburg et al. (2007). dependent comparison, inexact matching and wild-card symbols (Shawe-Taylor and Cristianini, 2004, Sonnenburg et al., 2007). Due to this flexibility the majority of string COMPARISON OF SIMILARITY MEASURES kernels has been designed for analysis of DNA and protein As we have seen in the previous sections, there exist dif- sequences, which possess highly complex and largely un- ferent approaches for assessing the similarity of strings. known structure. String kernels are the method of choice, Each of the presented similarity measures compares strings if simpler similarity measures fail to capture the structure differently and emphasizes other aspects of the sequential of strings sufficiently. data. Consequently, the choice of a similarity measure in practice depends on the contents of the strings as well as CONCLUSIONS the underlying application. As a first guide for the practi- tioner, Table 1 compares the presented similarity measures In this article we have studied the comparison of strings and summarizes relevant properties. and corresponding similarity measures for sequential data. The edit distance is the similarity measure of choice, if We have reviewed three major classes of such measures, the sequential data exhibits minor perturbations, such as with each modeling similarity differently and emphasizing spelling mistakes or typographical errors. As the compar- particular aspects of strings. Starting with the edit distance ison is constructive, the edit distance does not only assess and reaching over to bag-of-words models and string ker- similarity, but also provides a trace of operations to trans- nels, we have derived different concepts for assessing the form one string into the other. Hence, the edit distance similarity of strings along with their implementations, pe- is often applied for comparing and correcting text strings, culiarities and advantages in practice. for example, in applications of spell checking and optical The abstraction realized by analysing pairwise relation- character recognition (Sankoff and Kruskal, 1983). The ships provides a generic interface to structured data beyond constructive comparison, however, comes at a price. The strings and sequences. While we have explored different run-time complexity of the edit distance is not linear in the ways for integrating strings into data mining and machine length of the strings, such that the comparison of longer learning, there exists several related approaches modeling strings is often intractable. similarity and dissimilarity of other non-vectorial data. The bag-of-words model comes particularly handy when For example, kernels and distances for trees (Collins and analysing natural language text and structured data. The Duffy, 2002, Rieck et al., 2010) and graphs (Gärtner et al., comparison of strings can be controlled using the set of 2004, Vishwanathan et al., 2009) are one strain of develop- considered words, which allows for adapting the model to ment in this field. This divergence of research ultimately the particular context of an application. As a result, bag- provides the basis for tackling challenging problems and of-words models are prevalent in information retrieval and designing novel methods that benefit from both worlds: respective applications, such as searching of web pages, clas- machine learning and comparison of structured data.

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 Similarity Measures for Sequential Data

ACKNOWLEDGMENTS C. Leslie and R. Kuang. Fast string kernels using inex- act matching for protein sequences. Journal of Machine The author would like to thank Tammo Krueger, Sören Learning Research, 5:1435–1455, 2004. Sonnenburg and Klaus-Robert Müller for fruitful dis- cussions and support. Moreover, the author acknowl- C. Leslie, E. Eskin, and W. Noble. The spectrum kernel: edges funding from the Bundesministerium für Bildung und A string kernel for SVM protein classification. In Proc. Forschung under the project REMIND (FKZ 01-IS07007A). Pacific Symp. Biocomputing, pages 564–575, 2002.

C. Leslie, E. Eskin, A. Cohen, J. Weston, and W. Noble. REFERENCES Mismatch string kernel for discriminative protein classi- W. I. Chang and E. L. Lawler. Sublinear approximate string fication. Bioinformatics, 1(1):1–10, 2003. matching and biological applications. Algorithmica, 12 V. I. Levenshtein. Binary codes capable of correcting dele- (4/5):327–344, 1994. tions, insertions, and reversals. Doklady Akademii Nauk M. Collins and N. Duffy. Convolution kernel for natural SSSR, 163(4):845–848, 1966. language. In Advances in Neural Information Proccessing Y. Liao and V. R. Vemuri. Using text categorization tech- Systems (NIPS), volume 16, pages 625–632, 2002. niques for intrusion detection. In Proc. of USENIX Secu- M. Cuturi, J.-P. Vert, and T. Matsui. A kernel for time rity Symposium, pages 51 – 59, 2002. series based on global alignments. In Proc. of the Interna- tional Conferenc on Acoustics, Speech and Signal Process- H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, ing (ICASSP), 2007. and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. M. Damashek. Gauging similarity with n-grams: Language-independent categorization of text. Science, W. Masek and M. Patterson. A faster algorithm for com- 267(5199):843–848, 1995. puting string edit distance. Journal of Computer and Sys- tem sciences, 20(1):18–31, 1980. R. Doolittle. Of Urfs and Orfs: A Primer on How to Anal- yse Derived Amino Acid Sequences. University of Science K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and Books, 1986. B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE Neural Networks, 12(2):181–201, May T. Gärtner, J. Lloyd, and P. Flach. Kernels and distances for 2001. structured data. Machine Learning, 57(3):205–232, 2004. D. Gusfield. Algorithms on strings, trees, and sequences. S. Needleman and C. Wunsch. A general method appli- Cambridge University Press, 1997. cable to the search for similarties in the amino acid se- quence of two proteins. Journal of Molecular Biology, 48: R. W. Hamming. Error-detecting and error-correcting 443–453, 1970. codes. Bell System Technical Journal, 29(2):147–160, 1950. G. Rätsch, S. Sonnenburg, and B. Schölkopf. RASE: recog- D. Haussler. Convolution kernels on discrete structures. nition of alternatively spliced exons in c. elegans. Bioin- Technical Report UCSC-CRL-99-10, UC Santa Cruz, formatics, 21:i369–i377, June 2005. July 1999. K. Rieck and P. Laskov. Language models for detection of S. Hofmeyr, S. Forrest, and A. Somayaji. Intrusion detec- unknown attacks in network traffic. Journal in Com- tion using sequences of system calls. Journal of Computer puter Virology, 2(4):243–256, 2007. Security, 6(3):151–180, 1998. T. Jaakkola, M. Diekhans, and D. Haussler. A discrimi- K. Rieck and P. Laskov. Linear-time computation of sim- native framework for detecting remote protein homolo- ilarity measures for sequential data. Journal of Machine gies. J. Comp. Biol., 7:95–114, 2000. Learning Research, 9(Jan):23–48, 2008. T. Joachims. Making large–scale SVM learning practi- K. Rieck, T. Krueger, U. Brefeld, and K.-R. Müller. Ap- cal. In B. Schölkopf, C. Burges, and A. Smola, editors, proximate tree kernels. Journal of Machine Learning Re- Advances in Kernel Methods — Support Vector Learning, search, 11(Feb):555–580, 2010. pages 169–184, Cambridge, MA, 1999. MIT Press. A. Robertson and P. Willett. Applications of n-grams in T. Joachims. Learning to classify text using support vector textual information systems. Journal of Documentation, machines. Kluwer, 2002. 58(1):48–69, 1998.

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery Similarity Measures for Sequential Data 9

G. Salton and M. J. McGill. Introduction to Modern Infor- mation Retrieval. McGraw-Hill, 1986. G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 18 (11):613–620, 1975. D. Sankoff and J. Kruskal. Time wraps, String edits, and Macromulecules: The Theory and Practice of Sequence Comparison. Addision-Wesley Publishing Co., 1983. B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. J. Shawe-Taylor and N. Cristianini. Kernel methods for pat- tern analysis. Cambridge University Press, 2004. T. Smith and M. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195–197, 1981. S. Sonnenburg, A. Zien, and G. Rätsch. ARTS: Accurate Recognition of Transcription Starts in Human. Bioin- formatics, 22(14):e472–e480, 2006. S. Sonnenburg, G. Rätsch, and K. Rieck. Large scale learn- ing with string kernels. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, pages 73–103. MIT Press, Cambridge, MA., 2007. C. Suen. N-gram statistics for natural language understand- ing and text processing. IEEE Trans. Pattern Analysis and Machine Intelligence, 1(2):164–172, Apr. 1979. K. Tsuda, M. Kawanabe, G. Rätsch, S. Sonnenburg, and K. Müller. A new discriminative kernel from probabilis- tic models. Neural Computation, 14:2397–2414, 2002. J.-P. Vert, H. Saigo, and T. Akutsu. Kernel methods in Com- putational Biology, chapter Local alignment kernels for biological sequences, pages 131–154. MIT Press, 2004. S. Vishwanathan and A. Smola. Fast kernels for string and tree matching. In K. Tsuda, B. Schölkopf, and J. Vert, editors, Kernels and Bioinformatics, pages 113–130. MIT Press, 2004. S. Vishwanathan, N. Schraudoplh, R. Kondor, and K. Borgwardt. Graph kernels. Journal of Machine Learn- ing Research, 2009. to appear. C. Watkins. Dynamic alignment kernels. In A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 39–50, Cam- bridge, MA, 2000. MIT Press. A. Zien, G. Rätsch, S. Mika, B. Schölkopf, T. Lengauer, and K.-R. Müller. Engineering support vector ma- chine kernels that recognize translation initiation sites in DNA. BioInformatics, 16(9):799–807, Sep 2000.

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery