<<

Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20, 2009, Hong Kong

Searching All Seeds of Strings with Hamming using Finite Automata

Ondřej Guth, Bořivoj Melichar ∗

Abstract—Seed is a type of a regularity of strings. Finite automata provide common formalism for many al- A restricted approximate seed w of string T is a fac- gorithms in the area of text processing (stringology), in- tor of T such that w covers a superstring of T under volving forward exact and approximate pattern match- some distance rule. In this paper, the problem of all ing and searching for borders, periods, and repetitions restricted seeds with the smallest Hamming distance [7], backward [8], pattern matching in is studied and a polynomial time and space algorithm a compressed text [9], the longest common subsequence for solving the problem is presented. It searches for [10], exact and approximate 2D pattern matching [11], all restricted approximate seeds of a string with given limited approximation using Hamming distance and and already mentioned computing approximate covers it computes the smallest distance for each found seed. [5, 6] and exact covers [12] and seeds [2] in generalized The solution is based on a finite (suffix) automata ap- strings. Therefore, we would like to further extend the set proach that provides a straightforward way to design of problems solved using finite automata. Such a problem algorithms to many problems in stringology. There- is studied in this paper. fore, it is shown that the set of problems solvable using finite automata includes the one studied in this Finite automaton as a may be easily imple- paper. mented. Therefore, using it as a base for similar approach to many algorithms is not only theoretical problem, as it Keywords: approximate seed, suffix automaton, Ham- may make development of software related with above ming distance, stringology mentioned areas easier, faster, and cost-reduced.

1 Introduction This paper is organized as follows: in Section 2, basic definitions and previous works overview are placed. In Section 3, the algorithm for the problem being studied is Searching regularities of strings is used in a wide area presented. Its theoretical time and space complexity is of applications like molecular biology, computer–assisted derived in Section 4 and experimental results are shown music analysis, or data compression. By regularities, re- in Section 5. peated strings are meant. Examples of regularities in- clude repetitions, borders, periods, covers, and seeds. 2 Preliminaries The algorithm for computing all exact seeds of a string was introduced by Iliopoulos, Moore, and Park [1]. The Some notions and notations used in this paper are com- first algorithm for searching all seeds using finite au- monly used, thus they are not defined here. Exact defi- tomata was introduced by Voráček and Melichar [2]. nitions of such notions and notations could be found in [6]. Finding exact regularities is not always sufficient and thus some kind of approximation is used. An algorithm for A symbol of an alphabet is denoted by a. Having string R searching approximate periods, covers, and seeds under T = a1, a2, . . . , a|T |, reversed string T is denoted by T R Hamming, Levenshtein (also called edit), and weighted and it is equal to T = a|T |, a|T |−1, . . . , a1. An effective was presented by Christodoulakis, alphabet of a string T is denoted by AT . Only effective Iliopoulos, Park, and Sim [3]. The algorithm for comput- alphabet is considered in this paper.

ing approximate seeds was originally introduced by these ′ ∗ authors in [4]. An algorithm for searching all covers un- Suppose p, s, u, w, p ,T ∈ A . T is a superstring of w if T = pws. p′ is an approximate prefix of T with maximum der Hamming, Levenshtein, and Damerau distance using ′ finite automata was introduced by Guth [5], optimized al- Hamming distance k if T = pu and DH (p, p ) ≤ k. Set of all approximate prefixes of T with maximum Hamming gorithm for computing all covers with smallest Hamming k distance was presented by Guth, Melichar, and Balík [6]. distance k is denoted by Pref (T ). An approximate suffix and set of all approximate suffixes of T with respect to k k ∗Czech Technical University in Prague, Faculty of Electrical is defined by analogy and is denoted by Suff (T ). Engineering, Department of Computer Science and Engineering. Email: {gutho1,melichar}@fel.cvut.cz We say that string w occurs in string T if w ∈ Fact(T ).

ISBN: 978-988-17012-2-0 IMECS 2009 Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20, 2009, Hong Kong

Factor w occurs at position (or end-position) i in string The algorithm for searching (exact) seeds in general- T if ∀j ∈ {1,..., |w|} : w[j] = T [i − |w| + j]. An end-set ized strings presented in [2] obviously works for (non- is a set of all i such that w occurs at position i in T . generalized) strings as well (as string is special case of String w occurs approximately with maximum Hamming generalized string). It is based on the following idea. distance k at position i in string T (or w has approximate First, MSN (T ) is constructed. Equivalent deterministic occurrence at position i in T ) if there exists factor x that automaton MSD (T ) = (Q, AT , δ, q0,F ) is computed us- occurs at position i in T and DH (x, w) ≤ k. ing subset construction. One of conditions for any factor to be a seed of T is its length. Seed w must cover central String w is a restricted approximate seed of T with max- part of T (i.e. the part of T between the leftmost and imum Hamming distance k if w is a factor of T and w is the rightmost position of w within T ), and it must cover a restricted approximate cover of some superstring of T the uncovered suffix of T and the uncovered prefix of with maximum Hamming distance k. T . All sufficiently long factors are then checked whether they cover the uncovered suffix, prefix of T , respectively. The smallest Hamming distance of a restricted approxi- If MSD (T ) accepts some prefix of factor w then w covers mate seed w of T is the smallest possible integer lm such R uncovered suffix of T . If a suffix automaton M D (T ) for that w is a restricted approximate seed of T with maxi- S reversed string T accepts some prefix of reversed factor mum Hamming distance l . m w then w covers uncovered prefix of T . When w satisfies

DFA is partial if there may exist a pair of state qi and all the conditions, w is a seed of T . symbol a such that δ(q , a) is undefined. In this paper, i Computation of the smallest Hamming distance of a cover only partial DFA are considered. A deterministic is (presented in [6]) is based on the following idea: when the a DFA that may be represented as a tree, i.e. for each maximum approximation of the first and the last position state qj, there exists at most one qi such that for any of cover w in T is l , for its smallest distance l holds a ∈ A, δ(qi, a) = qj. An extended transition function, min m l ≥ l , because cover is an approximate prefix and suf- denoted by δ∗, is for a DFA defined for a ∈ A, u ∈ A∗ in m min fix of T and thus it cannot cover T without its first and this way: last position. When cover w of T has positions with ap- δ∗(q, ε) = q, δ∗(q, ua) = δ(δ∗(q, u), a) proximation at most l, for its smallest distance lm clearly holds lm ≤ l. When the positions of w with the maxi- An extended transition function of an NFA is defined as: mum approximation equal to l are no longer considered (the first and the last position must be still considered) ∗ ∗ ∗ δ (q, ε) = {q}, δ (q, au) = [ δ (qi, u) and w is still cover of T , then for lm holds lm ≤ l − 1. qi∈δ(q,a) lm is decremented till w still covers T . This may be used with modifications for computation of seeds. A left language of a state q of a DFA is set of all strings ∗ w for that holds δ (q0, w) = q for initial state q0. Left The algorithm for searching exact seeds from [2] uses language of a state q of a trie contains one string, denoted two phases: first, deterministic suffix automaton is con- by factor(q). structed and then d–subsets are analyzed and seeds are computed. This means that complete automaton or at A nondeterministic Hamming suffix automaton for string least all the d–subsets to be analyzed need to be stored T and maximum Hamming distance k is denoted by k in memory at a time. By contrast, the algorithm for M N (T ). S searching covers from [6] uses merge of the phases, each A deterministic suffix automaton for string T and maxi- d–subset is analyzed just after its construction. A depth- k first search like algorithm is used and the states that are mum Hamming distance k, denoted by MSD (T ), is a DFA that accepts all strings from Suff k(T ) (see Figure 5). no longer needed are removed. For approximate seeds A depth of a state q of the automaton is length of the searching, there is also no need to store all elements of ∗ d–subsets of the automaton at a time. longest string w such that δ (q0, w) = q for initial state q0. An element of a d–subset is denoted by ri, where the subscript i means an index (order) of the element ri 3 Problem solution within the d–subset. In figures, states of nondeterminis- tic automata and elements of d–subsets of deterministic Some properties are common for exact and approximate automata are denoted by their depths and levels, e.g. 3′′ seeds with Hamming distance. Hence the algorithm pre- means state or element with depth 3 and level 2. sented in [2] is used as a base of algorithm for the problem studied in this paper, using some (but not all) techniques Problem formulation (All restricted seeds with the for searching covers in [6] for further improvements. smallest Hamming distance). Given string T and max- imum Hamming distance k, find all restricted approx- Every approximate restricted seed of string T is neces- imate seeds of T with respect to k and compute their sarily an exact factor of T with other possible approx- smallest . imate occurrences. Suffix automaton constructed for T

ISBN: 978-988-17012-2-0 IMECS 2009 Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20, 2009, Hong Kong

bbbbbaaa and maximum Hamming distance k has extended transi- bbba tions defined for all factors of T with respect to k. When k bbba MSD (T ) is constructed using subset construction from k bbba MSN (T ), each element of any d–subset of any state of k bbba MSD (T ) contains information not only about position k bbba (depth within MSN (T )) but also about approximation k (level within MSN (T )). Therefore, it may be easily de- termined whether string from left language of any state k Figure 1: Possible covering of string bbbbbaaa with string of MSD (T ) is an exact factor of T . bbba and Hamming distance 2 from Example 2 Note 1. For string T , string p such that p is not a factor bbbbbaaa of T , and any string u being a superstring of p holds: bbba ∀T, p, u ∈ A∗, p ∈ Fact(u): p∈ / Fact(T ) ⇒ u∈ / Fact(T ) bbba bbba k D bbba Lemma 1. For DFA MSD (T ) = (QD,AT , δD, q0 ,FD) k b created using subset construction from MSN (T ) and state qi ∈ QD with d–subset d(qi) such that ∀r ∈ d(qi): level(r) > 0 holds that any successor of qi cannot con- tain element rj such that level(rj ) = 0. Figure 2: Possible covering of a superstring of bbbbbaaa with bbba and Hamming distance 1 from Example 2 Corollary. As only exact factor of T may be a restricted seed of T , there is no need to construct any state of k is a successor of a state that is a successor of q1. There- MSD (T ) having only non-zero-level elements in its d– subset, as such state contains no exact factor of T in its fore, q1 must not be removed to be able to find aaabb. left language. Therefore, when such state is created dur- ing construction, it may be removed and any of its suc- For computation of the smallest distance lm of each seed, cessors need not be constructed. Such deterministic suffix the idea used for searching covers ([6]) may be used for automaton that contains only states having at least one searching seeds. Unlike searching covers, any position ˜ k zero-level element in its d–subset is denoted by MSD (T ). may be removed, including the first and the last, thus the only lower bound of lm is 0. Determination whether Note 2. Special type of deterministic suffix automaton, continue to decrement l is for seeds more complex than suffix trie, is considered in this paper. Construction of for covers, as computation of covering of central part of the trie and left language extraction is simpler than for T is not sufficient condition for seeds. For factor w of general suffix automaton. As left language of any state T , not only positions and their approximation need to (but the initial one) of the trie contains exactly one string, be considered, but also distance of the uncovered prefix, extraction of left language of any state takes linear time suffix, of T , and some suffix, prefix, of w, respectively. with respect to length of the string (e.g. using inverted See Algorithm 3 for further information. transition function). See Fig. 5 for example of suffix trie. Example 2. Let us have string T = bbbbbaaa and max- The relation between length and positions of any seed imum Hamming distance k = 2. One seed of T with (presented in [2]) holds also for approximate positions respect to k is bbba. It may be seed of T with Ham- with Hamming distance, as the distance is defined for ming distance 2, because its positions in T are 4, 5, 6, 7, strings of equal lengths only. and 8 with maximum approximation 2 (see Figures 1, 5). When the position 8 with approximation 2 is removed, Note 3. When searching for covers with Hamming dis- bbba is still seed of T with positions 4, 5, 6, and 7, all tance [6], it is possible to remove all states q of determin- with approximation at most 1 (see Figure 2). istic suffix trie that do not represent prefix, i.e. such q that |factor(q)| < depth(r1), where d(q) = r1, . . . , r|d(q)|. Similar property between the first position and length The deterministic suffix trie is needed not only to deter- of a factor is used for searching seeds: |factor(q)| ≤ mine positions of each factor w of T , but also for checking depth(r1) whether w is able to cover uncovered prefix and suffix of T 2 . Unlike in computing covers, this condition can- not be used for removing states q and their successors. (see Algorithm 4). Thus, the trie must be able to accept strings of length at least |w| − 1. Therefore, the depth- Example 1. Let us consider suffix trie for string T = first search with removing states from [6] cannot be used. bbbbbaaabb and maximum Hamming distance k = 2. Fac- By contrast, only elements of d–subset d(q) may be re- tor aaa cannot be a seed of T as its first approximate moved after construction of all successors of q, transitions position within T is 6. Factor aaabb is a seed of T with must be preserved. Thus, breadth-first search in the au- respect to k. It is obvious that for states q1, q2 of the trie, tomaton is used (see Algorithm 1 and usage of queues R ˜ k where factor(q1) = aaa and factor(q2) = aaabb, holds: q2 L,L ). As no state of trie MSD (T ) is removed and the

ISBN: 978-988-17012-2-0 IMECS 2009 Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20, 2009, Hong Kong

last element of each d–subset is preserved, it is possible to seed d–subset lm occurrences p s recognize all approximate suffixes of T of length at least ba 2′3′4′5′67′8′ 1 2,3,4,5,6,7,8 ε ε |w| − 1 and their distance. baa 3′′4′′5′′6′78′ 2 3,4,5,6,7,8 ε ε bba 3′4′5′67′8′′ 1 3,4,5,6,7 b ε Like an exact seed, an approximate one must also cover bbb 3456′7′′ 2 3,4,5,6,7 b ε the uncovered prefix and suffix of T (i.e. some prefix baaa 6′′7′8 2 6,7,8 ε aa of seed w must be an approximate suffix of T and some bbaa 4′′5′′6′78′ 2 4,5,6,7,8 ε aa suffix of w must be an approximate prefix of T ). Similar bbba 4′5′67′8′′ 1 4,5,6,7 b ε technique as for exact seeds ([2]) is used (Algorithm 4), ′ ′′ ˜ k ˜ k R bbbb 456 7 2 4,5,6,7 b ε but with MSD (T ) and MSD (T ). When some suffix ′′ ′ of seed w of T (i.e. some prefix of reversed w) is accepted bbaaa 6 7 8 2 6,7,8 ε a ′′ ′ ′ k R bbbaa 5 6 78 1 6,7,8 ε a by M˜ D (T ), i.e. w = factor(q) for some final state q S ′ ′ ′′ k R bbbba 5 67 8 1 5,6,7 b ε of M˜ D (T ), w covers the uncovered prefix of T with S bbbbb 56′7′′ 2 5,6,7 b ε approximation equal to level of the last element r|d(q)| of ′′ ′ d(q). Similarly for a prefix of w, the uncovered suffix of bbbaaa 6 7 8 1 7,8 ε a ′ ′ k bbbbaa 6 78 1 6,7,8 ε ε T and M˜ D (T ). S bbbbba 67′8′′ 1 6,7 b ε For complete solution of the problem see Algorithm 1. bbbbaaa 7′8 1 7,8 ε ε bbbbbaa 78′ 1 7,8 ε ε Algorithm 2 Compute state of a deterministic suffix trie bbbbbaaa 8 0 8 ε ε ˜ D M = QD,AT , δD, q0 ,FD. N t Input: NSA (QN ,AT , δN , q ,FN ), state q ∈ QD, sym- 0 Table 1: All seeds of string T = bbbbbaaa with maximum bol a ∈ AT , queue L of states. Output: Modified M˜ with possibly added successor qu Hamming distance k = 2 and their smallest distances lm; of state qt for symbol a, modified queue L. p is used prefix of a seed, s is used suffix (both computed by Algorithm 4); see Example 3 1: create new state qu 2: define depth(qu) = depth(qt) + 1 3: for all ri ∈ d(qt) (in order as stored in d(qt)) do 4: j i u k append all r ∈ δN (r , a) to d(q ) in ascending or- Nondeterministic suffix automata MSN (T ) (see Figure 3) depth j k R der by (r ) and MSN (T ) (see Figure 4) are constructed. Next, sub- 5: end for k set construction of deterministic suffix trie MSD (T ) from 6: if u level then k exists r ∈ d(q ) where (r) = 0 MSN (T ) starts state-by-state (see transition diagram of 7: u k QD ← QD ∪ {q } MSD (T ) with all states, that need to be constructed, at 8: u k R enqueue(L, q ) Figure 5), the same is done with trie MSD (T ) from u u u u k R 9: if u u then r|d(q )| ∈ FN , d(q ) = r1 , . . . r|d(q )| MSN (T ). Some states may have only elements with u ′′ ′ 10: FD ← FD ∪ {q } non-zero level in its d–subset (e.g. 7 8 ). Such states 11: end if are removed and their successors are not constructed as 12: end if strings from their left languages (e.g. aaaa) are not fac- tors of T . Algorithm 3 The smallest distance of a seed of T . All other states need to be checked whether their left Input: d–subset d(q) = r1, r2, . . . , r|d(q)| representing languages contain some seeds. For example, state with seed w of T . d–subset 6′′7′8 contains string aaa in its left language. Output: The smallest distance lm of w. The string occurs approximately at positions 6, 7 in T 1: t ← d(q) and exactly at position 8 in T , thus it cannot be seed of 2: lmax ← maxr∈t{level(r)} T , as its leftmost occurrence within T ends at position 6 3: l ← lmax and its length is 3 (i.e. the occurrence starts at position 4: repeat 4), so any proper suffix of aaa cannot cover the uncovered 5: for all r ∈ t : level(r) = l do prefix (positions 1 to 3) of T . 6: remove r from t ′ ′ ′ ′′ 7: end for Other example is state with d–subset 4 5 67 8 , which 8: l ← l − 1 contains string bbba in its left language. This string covers 9: until w is a seed of T using positions determined by T with Hamming distance 2, and therefore it is seed of t with respect to l (Algorithm 4) T (see Figure 1). When all positions with the maximum 10: lm ← l + 1. distance (i.e. 8) are not considered, bbba is still seed of T , as proper prefix b of bbba covers uncovered suffix a of Example 3. Let us compute set of all seeds with max- T with Hamming distance 1 (see Figure 2). See resulting imum Hamming distance k = 2 for string T = bbbbbaaa. table of all seeds and their distances in Table 1

ISBN: 978-988-17012-2-0 IMECS 2009 Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20, 2009, Hong Kong

Algorithm 1 Compute set of seeds of T with the smallest Hamming distances. Input: String T , maximum Hamming distance k. Output: Set hseeds k(T ) of all seeds of T . k 1: hseeds (T ) ← ∅ k N 2: construct MSN (T ) = (QN ,AT , δN , q0 ,FN ) k R R R NR R 3: construct MSN (T ) = (QN ,AT , δN , q0 ,FN ) D k D 4: create new state q0 as the initial one of the deterministic suffix trie MSD (T ) = (QD,AT , δD, q0 ,FD) DR k R R R DR R 5: create new state q0 as the initial one of the deterministic suffix trie MSD (T ) = (QD,AT , δD, q0 ,FD ) D D DR 6: define factor(q0 ) = ε, depth(q0 ) = 0, depth(q0 ) = 0 7: create L,LR new empty queues of states D R DR 8: enqueue(L, q0 ), enqueue(L , q0 ) R ˜ k R 9: while L is not empty {construct complete MSD (T ) in this loop} do 10: qtR ← dequeue(LR) 11: for all a ∈ AT do 12: compute new state quR as a successor of state qtR for symbol a using Algorithm 2 13: discard all elements of d(qtR) but the last one {all successors of d(qtR) have just been computed} 14: end for 15: end while ˜ k 16: while L is not empty {construct MSD (T ) and compute seeds in this loop} do 17: qt ← dequeue(L) 18: for all a ∈ AT do 19: compute new state qu as a successor of state qt for symbol a using Algorithm 2 u u ˜ k 20: if exists r ∈ d(q ) where level(r) = 0 {only state q that is part of MSD (T ) is further processed} then 21: define w = factor(qu) = factor(qt).a 22: if w is a seed of T using positions determined by d(qu) (Algorithm 4) then 23: compute the smallest distance lm of w (Algorithm 3) 24: if |w| > k or lm < |w| {all strings of length less or equal to lm are seeds} then k k 25: hseeds (T ) ← hseeds (T ) ∪ {(w, lm)} 26: end if 27: end if 28: end if 29: end for 30: discard all elements of d(qt) but the last one 31: end while

Algorithm 4 Determine whether string w is a seed of T with maximum Hamming distance l. k k R Input: Already constructed parts of deterministic suffix automata MSD (T ) = (Q, AT , δ, q0,F ) and MSD (T ) = R R R (Q ,AT , δR, q0 ,F ), d–subset t = r1, r2, . . . , r|t| for q ∈ Q and w ∈ factor(q), maximum Hamming distance l. Output: Resolution whether w is a seed of T with respect to l and t.

1: if for all i = 2, 3,..., |t| : ri − ri−1 ≤ |w| 1 0 ∗ 1 1 q and ∃p ∈ Pref (w), |p| ≥ |T | − r|t| : δ (q0, p) = q , q ∈ F ∧ level(r|d(q1)|) ≤ l 2 0 ∗ R 2 2 R q and ∃s ∈ Suff (w), |s| ≥ r1 − |w| : δR(q0 , s) = q , q ∈ F ∧ level(r|d(q2)|) ≤ l then 2: return true 3: else 4: return false 5: end if

ISBN: 978-988-17012-2-0 IMECS 2009 Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20, 2009, Hong Kong

7′′8′

a 8′′ 7′′8′ b a b 6′′7′8 6′′7′8 8′′ a a a 4′′5′′7′′8′ 5′′7′′8′ 6′′7′8 7′′8′ b b a b b 3′′4′′5′′6′78′ 4′′5′′6′78′ 5′′6′78′ 7′′8′ 8′′ a a a a a 3′4′5′6′′7′8′′ 4′5′6′′7′8′′ 5′6′′7′8′′ a 6′78′ 7′8 8′ b b b b b b 2′3′4′5′67′8′ 3′4′5′67′8′′ 4′5′67′8′′ 5′67′8′′ 6′′7′8′′ 7′′8′ 8′′ a a a a b b b b b a a a 0 123456′7′8′ 23456′7′′8′′ 3456′7′′ 456′7′′ 56′7′′ 67′8′′ 78′ 8 b b a b a a a 1′2′3′4′5′678 2′′3′′4′′5′′6′78 6′′7′8 7′′8′ 6′7′′ 7′8′′ 8′ b b b

2′3′4′5′6′′7′8′ 3′′4′′5′′7′′8′ 8′′

k Figure 5: Transition diagram of the constructed part of the suffix trie MSD (T ) for string T = bbbbbaaa and maximum Hamming distance k = 2 from Example 3; dashed states are removed as their left language do not contain exact ˜ k factor of T and thus they are not states of MSD (T )

b b b b a a a a a b b b b b b b b b b a a a a a a b b b b b 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 a a a a a b b b b b b a a a a a b b b b a a a a a b b b b b 1′ 2′ 3′ 4′ 5′ 6′ 7′ 8′ 1′ 2′ 3′ 4′ 5′ 6′ 7′ 8′ a a a a b b b b b a a a a a

a a a a b b b b b a a a a a b b b a a a a b b b b b 2′′ 3′′ 4′′ 5′′ 6′′ 7′′ 8′′ 2′′ 3′′ 4′′ 5′′ 6′′ 7′′ 8′′

Figure 3: Transition diagram of the nondeterministic Figure 4: Transition diagram of the nondeterministic ap- k k R approximate suffix automaton MSN (T ) for string T = proximate suffix automaton MSN (T ) for reversion of bbbbbaaa and maximum Hamming distance k = 2 from string T , i.e. T R = aaabbbbb, and maximum Hamming Example 3 distance k = 2 from Example 3

4 Time and space complexities of repetitions of w in T with respect to k, denoted by k Rw(T ), is e − 1. Then number of repetitions of all factors k k R k Note 4. As parts of MSD (T ) and MSD (T ) are con- of T with respect to k, denoted by R (T ), is defined as structed the same way in Algorithm 1, the time and space k k complexities of their construction are the same. R (T ) = X Rw(T ) Fact Lemma 2. Left languages of states of deterministic suffix w∈ (T ) ˜ k D trie MSD (T ) = (QD,AT , δD, q0 ,FD) are distinct, i.e. ˜ k Lemma 3. Number of states of MSD (T ) is q1, q2 ∈ QD; q1 6= q2 ⇒ factor(q1) 6= factor(q2) 1 · (|T |2 + |T |) − Rk(T ) + 1 Definition 1. Let us consider string T and maximum 2 Hamming distance k. When a factor w approximately Note 5. As restricted approximate seeds of string T are occurs e-times in T with respect to k, we say that number exact factors of T , it is meaningful to consider effective

ISBN: 978-988-17012-2-0 IMECS 2009 Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20, 2009, Hong Kong

Athlon64 2.2 GHz, for k=78 (solid) and k=55 (dotted) alphabet AT only and |AT | ≤ |T | always holds (recall 0.40

that effective alphabet AT consists only of symbols that 0.35

occur in T ). It is also meaningless to consider high k, 0.30

because every factor of T having length less or equal to 0.25

k is always approximate seed of T . Thus k ≤ |T | always 0.20

holds. Usually k and |AT | are independent of |T |. Time [sec] 0.15

k 0.10 Lemma 4. Number of states of MSD (T ) constructed us- 1 2 k 0.05 ing Alg. 1 is at most |AT | · ( 2 · (|T | + |T |) − R (T )) + 1. 0.00 50 60 70 80 90 100 110 ˜ k Text length |T| Lemma 5. For every d–subset of MSD (T ) constructed by Algorithm 1 holds that there are no two elements having the same depth. Figure 6: Time consumption of the experimental run on the Athlon64 with respect to the text size (see Section 5) Lemma 6. Number of elements of all d–subsets of ˜ k 1 3 2 k MSD (T ) is not greater than 2 ·(|T | +|T | )−|T |·R (T )+1. Athlon64 2.2 GHz, for |T|=279 (solid) and |T|=159 (dotted) Lemma 7. Number of elements of all d–subsets of 12 k MSD (T ) constructed using Alg. 1 is not greater than 10

8 1 3 2 k |AT | · ( · (|T | + |T | ) − |T | · R (T )) + 1 2 6 Time [sec] Lemma 8. Time complexity of the check whether d– 4 ˜ k subset d(q) of MSD (T ) represents a seed w = factor(q) of T (Alg. 4) is at most 2 · |d(q)| + 2 · |w| − 2, that is O(|T |). 2 0 0 50 100 150 200 250 300 Lemma 9. Time complexity of the computation of the Maximum distance k smallest distance of seed w = factor(q) of T (Alg. 3) is at most |d(q)| + k · (3 · |d(q)| + 2 · |w| − 2) that is O(k · |T |). Figure 7: Time consumption of the experimental run on Note 6. Number of all seeds is O(|T |2) (like number of the Athlon64 with respect to the maximum distance (see factors). Thus, the sum of their lengths is O(|T |3), de- Section 5) noted by |hseedsk(T )|. Theorem 1. Time complexity of computation of all seeds k with their smallest distance for string T with maximum Proof. Space complexity of construction of MSN (T ) is 3 O(k·|A|·|T |) (proven in [6]). By Lemma 10, number of el- Hamming distance k (Algorithm 1) is O(k · |AT | · |T | ). ements stored in memory at a time is O(|T |2), as no more elements of d–subsets than those in L plus O(|T |) new are Proof. Construction of nondeterministic suffix automa- in memory at a time. By Lemma 3, number of states of k N ton M N (T ) = (QN ,AT , δN , q ,FN ) for T and k takes k 2 S 0 M˜ D (T ) is O(|T | ) (they all are stored in memory with ˜ k S O(k · |AT | · |T |). For each state of MSD (T ) (Lemma 3) one element each). As the constructed automaton is trie, and for each symbol of AT , new d–subset is constructed. number of transitions is also O(|T |2). The space com- As each element of any d–subset may be constructed in plexity also depends on size of result, |hseedsk(T )|. constant time (just using already known δN ) and the ele- ments are naturally ordered (no need to sort – proven in [6]), all d–subsets are constructed in at most 5 Experimental results

1 3 2 k |AT | · ( · (|T | + |T | ) − |T | · R (T )) + 1 The algorithm was implemented in C++ using STL and 2 compiled using GNU C++ 3.4.6 with O3 optimizations time. Each d–subset is checked whether it contains el- level. The dataset used to test the algorithm is the nu- ement with zero level in linear time. The left language cleotide sequence of Saccharomyces cerevisiae chromo- 1 extraction of state takes linear time and by Lemma 8 and some IV . The string T consists of the first |T | characters 3 the theorem holds. of the chromosome. ˜ k Lemma 10. During construction of MSD (T ) (Algorithm The first set of tests was run on an AMD Athlon 64 3200+ 1), there are at most O(|T |2) elements of d–subsets stored (2200 MHz) system, with 2.5 GB of RAM, under Gentoo in memory at a time. Linux operating system (see Figures 6 and 7).

Theorem 2. Space complexity of computation of all 1The Saccharomyces cerevisiae chromosome IV dataset could be k seeds is O(|T |2 + |hseeds (T )|). downloaded from http://www.genome.jp/.

ISBN: 978-988-17012-2-0 IMECS 2009 Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20, 2009, Hong Kong

Athlon 1.4 GHz, for |T|=113 (solid) and |T|=149 (dotted) 1.8 Czech Technical University in Prague, 2006, pp. 138– 1.6 139. 1.4

1.2 [3] Christodoulakis, M., Iliopoulos, C. S., Park, K. S.,

1.0 and Sim, J. S., “Implementing Approximate Reg-

0.8 ularities,” Mathematical and Computer Modelling, Time [sec] 0.6 Vol. 42, October 2005, pp. 855–866. 0.4

0.2 [4] Christodoulakis, M., Iliopoulos, C. S., Park, K. S.,

0.0 and Sim, J. S., “Approximate Seeds of Strings,” Jour- 0 20 40 60 80 100 120 140 160 Maximum distance k nal of Automata, Languages and Combinatorics, Vol. 10, No. 5/6, 2005, pp. 609–626. Figure 8: Time consumption of the experimental run on the Athlon with respect to the maximum distance (see [5] Guth, O., “Searching Approximate Covers of Strings Section 5) Using Finite Automata,” POSTER 2008, Czech Technical University in Prague, Faculty of Electri- cal Engineering, Praha, 2008.

The second set of tests was run on an AMD Athlon (1400 [6] Guth, O., Melichar, B., and Balík, M., “Search- MHz) system, with 1.2 GB of RAM, under Gentoo Linux ing All Approximate Covers and Their Distance us- operating system (see Figure 8). ing Finite Automata,” Information Technologies – Note 7. In comparison to experimental results presented Applications and Theory, Univerzita P. J. Šafárika, in [3], the algorithm presented in this paper runs a bit Košice, 2008, pp. 21–26. faster for the same data, even on a slightly slower com- [7] Melichar, B., Holub, J., and Polcar, T., “Text puter (1.3 seconds in [3] for text length 100 vs. maximum Searching Algorithms, Volume I,” November 2005. 0.7 second for text length 113 – see Figure 8). [8] Melichar, B., “Text Searching Algorithms, Volume 6 Conclusion II,” March 2006.

In this paper, we have shown that an algorithm design [9] Lahoda, J., Melichar, B., and Žďárek, J., “Pattern based on determinization of a suffix automaton is appro- Matching in DCA Coded Text,” Proceedings of the priate for computation of all restricted seeds with the 13th International Conference on Implementation smallest Hamming distance. The presented algorithm is and Application of Automata, Springer, Heidelberg, straightforward, easy to understand and to implement 2008, pp. 151–160. and its theoretical and experimental time requirements [10] Melichar, B. and Polcar, T., “The Longest Com- are comparable to the existing approach ([4]). mon Subsequence Problem – A Finite Automata Approach,” Implementation and Application of Au- For the future work, we would like to extend the algo- tomata, Springer, New York, 2003, pp. 294–296. rithm for searching seeds to other distances and to utilize similar approach for searching other types of regularities. [11] Žďárek, J., “Automata and 2D Pattern Matching,” Advances on Two-dimensional Language Theory, Acknowledgements University of Salerno, Salerno, 2006, p. 15.

This research was supported by the Czech Technical Uni- [12] Voráček, M., “Computing Covers in Generalized versity in Prague as grant No. CTU0803113, by the Min- Strings,” POSTER 2005 , Czech Technical Univer- istry of Education, Youth and Sports of the Czech Repub- sity in Prague, Faculty of Electrical Engineering, lic under research program MSM 6840770014, and by the 2005. Czech Science Foundation as project No. 201/06/1039 and as project No. 201/09/0807.

References

[1] Iliopoulos, C. S., Moore, D., and Park, K. S., “Cover- ing a String,” CPM ’93: Proceedings of the 4th An- nual Symposium on Combinatorial Pattern Match- ing, Springer-Verlag, London, UK, 1993, pp. 54–62. [2] Voráček, M. and Melichar, B., “Computing Seeds in Generalized Strings,” Proceedings of Workshop 2006,

ISBN: 978-988-17012-2-0 IMECS 2009