arXiv:2106.10541v2 [cs.DS] 23 Jul 2021 iainpten htcnb iwda rpscalled graphs as viewed d be commu- can that have patterns applications nication processing parallel Many Introduction 1 al- Lee-isometric. an over is four word size a of whether phabet check linear- to a algorithm characterization this time from Lee- derive two We with border errors. a having been words have as alphabet characterized four-letter a over words isometric etasomdinto transformed be odi amn-smti.I sbsdo pat- on based is with It algorithms matching Hamming-isometric. whether tern check is charac- to word this algorithm a from border linear-time derive a a We having terization also mismatches. words two process as with this characterized in been obtained have words new avoid the of all etr nwhich on letters oe r h od flength of words the are nodes Z nyi hydffri xcl n oiin n the and position, one symbols exactly two by in given is differ mismatch they if only word nt word finite A Abstract Vall´ee, France a nrae lwrta nahprue s 9 intro- [9] Hsu hypercube, vertices a of in number than the slower which increases for hypercubes of ants -ary d = ∗ † = IM nvGsaeEffl NS -75 Marne-la- F-77454 CNRS, Eiffel, Gustave Univ LIGM, IMadKn’ olg odn ntdKingdom United London, College King’s and LIGM hcigwehrawr sHmigioerci iertime linear in Hamming-isometric is word a whether Checking b u n { f 0 ± cbs A -cubes. od hc r o Hamming-isometric not are which Words . and , 1 mod 1 d , . . . , v f ftesm eghavoiding length same the of sHmigioerci o n two any for if Hamming-isometric is u − d d nodrt bansm vari- some obtain to order In . iesfrom differs -ary 1 v } w oe r ikdi and if linked are nodes Two . ycagn n yoealthe all one by one changing by n uei graph a is cube ai-ireB´ealMarie-Pierre v n k nsc a that way a such in , a vrtealphabet the over imths Lee- mismatches. and b htverify that Q f n d , uy2,2021 26, July whose u ID can aieCrochemore Maxime , ∗ 1 flength of amn-smti itrsaeivsiae n[3]. in non- investigated and are pictures setting, Hamming-isometric two-dimensional the in considered . Lee the length to same as according the 2 of [2] distance prefix at in a are which and characterized suffix a been having words have words positions. isometric two not exactly the is in with length mismatches word that same suffix A the a of has bor- prefix it 2-error if Lee-isometric. a is has is that it der, if it only and if Hamming-isometric if Hamming-isometric is only word [2]. and in binary if alphabets a general particular, to extended In and 18] in 17, characterized [12, been has words Hamming-isometric all for isometric xhnigoeb n h iso hc hydiffer avoiding they words which word on only bits generating the meanwhile one by one exchanging amn-smti ffraypi fwords of pair any for if Hamming-isometric ocmiaoiso od.Abnr word binary closer A view words. of point on ignor- a combinatorics by adopting to given and equivalently hypercubes be ing can word isometric stesm in same the is itnebtentowords two between distance a eioerclyebde into embedded isometrically be can lhbtadaodtefco 1 h oinof notion The 11. factor the binary avoid a on and are nodes alphabet which in cubes Fibonacci duced adt eLeioercwe,frany for when, Lee-isometric be to said h eeaie ioac ue[0 1 9 ti the is it ; 19] 11, [10, subgraph cube Fibonacci generalized the ary oefactor some nti ae esuyteagrtmccomplex- algorithmic the study we paper this In been also have words Hamming-isometric non-Lee- Binary 4, size of alphabet an of case the In nabnr lhbt h ento faLee- a of definition the alphabet, binary a On n cbshssbeunl eetne odefine to extended be subsequently has -cubes f sHmigioerci tis it if Hamming-isometric is n Q avoiding n 2 f ( nti rmwr,abnr word binary a framework, this In . f Q fa2-ary a of ) n 2 ( n f h tutr fbnr non- binary of structure The . f n in and ) , u ID a etasomdinto transformed be can † n u cb hs oe avoid nodes whose -cube Q and n 2 . v etcsof vertices Q n 2 n hti,the is, that , n -Hamming- ≥ 1, f u f The . Q Q and is n 2 n 2 v f ( ( by f f n d is v ) ) - - ity of checking whether a word is not Hamming- is the length of u and u[i] are its letters. The suffix of isometric. Our approach is based on the character- index i on u, denoted by ui, is the word u[i] u[n 1] ization of 2-error borders of such words. The naive of length n i. A suffix (or prefix) of it is···proper− if algorithm runs clearly in quadratic time. We show it is distinct− from u itself. that known algorithms for matching patterns with Let k be a non-negative integer. We say that a mismatches can be used to solve this problem effi- word u has a k-error border if u has a proper suf- ciently. with k mismatches can be fix s that matches its prefix of the same length with solved by algorithms running in time O(nk) (see [13] exactly k differences. In other words, the Hamming and [7]). These algorithms are mostly based on a distance between the suffix and the prefix is k. technique called the Kangaroo method. This method computes the for every alignment Example 1 The word 1010011 has a 2-error border. in time O(k) by “jumping” from one error to the next Indeed, it has the prefix 101 and the suffix 011 ans error. A faster algorithm for pattern matching with the Hamming distance between 101 and 011 is 2. k mismatches runs in O(n√k log k) [1]. A simpler version of this algorithm is given in [15]. Let f be a finite word and n be a positive integer. We show two methods to check whether a word Then a word u is called f-free if it does not contain is not Hamming-isometric. The first one uses the f as a factor, and f is called n-Hamming-isometric if Kangaroo method which allows to derive an algo- for every f-free words u and v of length n, the follow- O kn O n rithm running in time ( ) and using ( ) space ing holds: u can be transformed into v by changing n k to check whether a word of size has a -error bor- one by one all the letters on which u differs from v, in der. The method has a preprocessing of linear time such a way that all of the new words obtained during and space for computing the suffix tree of the word this process are also f-free. Such a transformation is and to enhance it in order to answer lowest common called an f-free transformation from u to v. Eventu- ancestor queries in constant time. This overall leads ally, a word f is said to be Hamming-isometric if it to a linear-time and linear-space algorithm to check is n-Hamming-isometric for all positive integer n. whether a word is not Hamming-isometric, and hence The d-ary n-cube, denoted by Qd , is the graph also to check whether a binary word is Lee-isometric. n whose vertices are the words of length n over the The second method uses the computation of a k- alphabet Z = 0, 1,...,d 1 , and for which any prefix table that gives, for some word u and every d two words u and{ v are adjacent− } if and only if u and position on u, the length of the longest proper suffix v differ by one unit at exactly one position, say i, of u at this position that matches its prefix of the that is, u[i] = v[i] 1 mod d [2]. The d-ary n-cube same length with at most k differences [5]. The com- avoiding f, where ±f is a word over the alphabet Z putation of this k-prefix table is done in time O(kn) d is the graph Qd (f) obtained from Qd by deleting the using O(n) space. n n vertices containing f as a factor. We also use the Kangaroo method to derive an al- A word f over Z is said to be Lee-isometric if for gorithm running in time O(kn) and using O(n) space d all n 1, Qd (f) is an isometric subgraph of Qd . on a constant size alphabet to check whether a word ≥ n n of size n has a k-Lee-error border and thus check in linear time whether a word over an alphabet of size Example 2 The word 0301 on the alphabet Z4 = 4 is Lee-isometric. 0, 1, 2, 3 is non-Lee-isometric. Indeed, the words {u = 030001} and v = 030201, which do not contain the factor 0301, are at distance 2 but there is no path 4 2 Definitions and background of length 2 from u to v in Q6(0301) since any path of length 1 changing the symbol of index 3 of u goes Let A be a finite alphabet. A word u in A⋆ is a finite from u to 030101 or to 030301 and these two words sequence u[0]u[1] u[n 1] of letters in A, where n both have the word 0301 as factor. ··· −

2 It is shown in [2] that non-Hamming-isometric and containing all the suffixes of u by their keys and non-Lee-isometric words coincide for words on an al- positions on u as their values [6], [4]. The tree has a phabet of size at most three. But this property is no linear number of nodes and edges, each edge contain- more true for greater alphabets. ing a pair of integers identifying a factor of u, e.g. Hamming-isometric words have the following char- (position, length), hence the linear space complexity. acterization obtained in [12, 17] for binary alphabets Suffix arrays can also be used for this problem. They and in [2] for general alphabets. contain essentially the starting positions of suffixes of u sorted in lexicographic order. Proposition 3 A word is not Hamming-isometric if To get the overall running time, we need to an- and only if it has a 2-error border. swer Lowest Common Ancestor (LCA) queries in con- stant time [8], [16]. LCA queries give us the longest Example 4 For instance the words 11, 1n for n 1 common prefix between two suffixes of u, essentially ≥ are Hamming-isometric. The word 1010011 is not telling us where the first mismatch appears between a Hamming-isometric. suffix of u and its prefix of the same length. This can be performed by first constructing a Longest Com- mon Prefix (LCP) array. The LCP array stores the 3 Algorithms for checking length of the longest common prefix between two con- whether a word is Hamming- secutive suffixes in the suffix array (lexicographic con- secutive suffixes). This array can also be constructed isometric in linear time. To compute the length of the longest prefix common to any two suffixes in the suffix tree In this section, we use the characterization of non- (instead of consecutive suffixes), we need to use some Hamming-isometric words in terms of 2-error border range minimum query . (Proposition 3) and assume that the alphabet A has Thus, we assume that our suffix tree is enhanced a constant size. Observe that a quadratic-time naive to answer LCA queries in constant time. This can algorithm can be obtained to check wether a word is be done in linear time and space. We denote by non-Hamming-isometric by computing the Hamming LCA(i, j) the query that returns in O(1) time the distance between each suffix of index i and the prefix length of the commun prefix between the suffix u of the same length. We show that checking if a word i and the suffix u of u. of length n has a k-error border can be done in time j For every index i, we try to find if the suffix of index O(kn) and space O(n). i has k-mismatches with its prefix of the same length. Proposition 5 It can be checked in time O(kn) and We first compute LCA(0,i). Let this length be ℓ0. space O(n) whether a word of length n has a k-error We skip the mismatching character in u0 and ui and border. try to find LCA(ℓ0 +1,i + ℓ0 +1). We repeat this to obtain k mismatches between ui and u[0] u[n i Proof. We give two algorithms for solving this prob- 1] or fail to obtain this condition. ··· − − lem. The first one is based on a technique called the The pseudo code of the technique is given in Al- Kangaroo method used for pattern matching with k gorithm 1. We maintain a variable ℓ which gives, mismatches in O(nk) time (see [13], [7] and [14]). after line 4 of Algorithm 1, the index of ui of the These algorithms compute the Hamming distance for current mismatch between ui and u0. A variable every alignment in O(k) time by “jumping” from one d contains the current Hamming distance between error to the next. We use the Kangaroo method to u[i] u[i + ℓ 1] and u[0] u[ℓ 1]. It is increased check for each index i on a word u of length n whether by 1··· at line 8− since a mismatch··· has− been found. it has a k-error border of length n i in time O(k). Since there are at most O(k) LCA queries for each To do so, we first compute in time− and space O(n) index i, this can be done in O(k) time. The overall the suffix tree of u. The suffix tree is a compacted time complexity is thus O(kn) and the space com-

3 plexity is O(n). at the first step of the loop of line 3 we obtain line 4 We now show a second method to check whether ℓ = LCA(0, 1) = 0; we set ℓ to 1 (”the jump”) and d a word of length n has a k-error border. We use to 1 line 8. At the second step of the loop of line 3 the computation of a k-prefix table as done in [5]. we obtain line 4 ℓ = ℓ + LCA(1, 2)= 1; we set ℓ to 2 For each position i, we compute a table πk for and d to 2 line 8. At the third step of the loop of line which πk(i) is the length ℓ of the longest word 3, we obtain line 4 ℓ = ℓ + LCA(2, 3) = 2 and break u[i] ...u[i + ℓ 1] such that the Hamming distance line 10. For i = 2 the loop of line 3 fails to return between u[0] ...u− [ℓ 1] and u[i] ...u[i + ℓ 1] is at true. For i = 3, at the first step of the loop of line 3 most k and u[0] ...u−[ℓ 1] is proper prefix− of u. we obtain line 4, ℓ = LCA(0, 3) = 0; we set ℓ to 1 and This computation can− be done in time O(kn) and d to 1 line 8. At the second step of the loop of line 3 space O(n) (see [5, Theorem 5]). It needs the compu- we obtain line 4 ℓ = ℓ + LCA(1, 4)= 1; we set ℓ to 2 tations of the prefix array of u and the longest prefix and d to 2 line 8. At the third step of the loop of line array preprocessed for range minimum queries. 3, we obtain line 4 ℓ = ℓ + LCA(2, 5) = 3 and, since The existence of a k-error border is then obtained d = 2 and ℓ = n i = 3, the algorithm returns true as follows. For k 1, a word u has a k-error border if line 6. The algorithm− has thus detected the 2-error and only if there≥ is a position i, 1 i

Example 6 Let u = 101011. Let us check with Al- A combinatorial characterization of Lee-isometric gorithm 1 whether u has a 2-error border. For i = 1, words over an alphabet of size 4 has been obtained in

4 [2]. It uses the notion of Lee distance which is defined time complexity is thus O(kn) and the space com- as follows. The Lee distance, denoted by dL, between plexity is O(n). two letters of the alphabet Zd = 0, 1,...,d 1 is { − }

dL(a,b) = min( a b , d a b ). | − | − | − | Word with a -Lee-error The Lee distance between two words u and v of length Algorithm 3: k border(u) n over Zd is Input: A non empty word u of length n, a −1 n non-negative integer k dL(u, v)= X dL(u[i], v[i]). Output: true if u has a k-Lee-error border i=0 1 for i 1 to n 1 do ← − A word has a k-Lee-error border if it has a suffix u 2 (ℓ, d) (0, 0); ← and a prefix v of same length satisfying dL(u, v)= k. 3 while d k do Z ≤ For words over 4, the Lee-isometric words are 4 ℓ ℓ+ LCA(ℓ,i + ℓ); ← characterized as follows in [2]. 5 if d = k and ℓ = n i then − 6 return true; Proposition 8 A word over a 4-letter alphabet is 7 if (d = k and ℓ k) non-Lee-isometric if and only if it has a 2-Lee-error then − border. 8 break 9 if d < k and ℓ = n i then In this section we show that checking if a word of 10 break − length n has a k-Lee-error border can be done in time 11 d d + dL(u[ℓ],u[i + ℓ]); O(kn) and space O(n). ← 12 ℓ ℓ + 1; ← 13 false Proposition 9 It can be checked in time O(kn) and return ; space O(n) whether a word of length n has a k-Lee- error border. The following corollary follows then directly from Proof. The algorithm is almost the same as Algo- Proposition 8 and the analysis of Algorithm 2 in rithm 1 and the proof is similar to the proof of Propo- Proposition 9. sition 5. Therefore we only discuss the differences. For every index i, we try to find if the suffix of Corollary 10 It can be checked in linear time and index i is at Lee distance k from its prefix of the space whether a word over an alphabet of size 4 is same length. Lee-isometric. The pseudo code of the technique is given in Al- gorithm 3. A variable d contains, after line 11, the current Lee distance between u[i] u[i + ℓ] and References u[0] u[ℓ]. ··· ··· The difference with Algorithm 1 appears when [1] Amihood Amir, Moshe Lewenstein, and Ely Po- there is mismatch between the suffix ui and the prefix rat. Faster algorithms for string matching with k u[0] u[n i 1] at positions ℓ on ui and i + ℓ on mismatches. J. Algorithms, 50(2):257–275, 2004. u[0] ··· u[n −i −1]. The current Lee distance between u[0] ··· u[ℓ]− and−u[i] u[i + ℓ] is augmented this time [2] Marcella Anselmo, Manuela Flores, and Maria ··· ··· by the value of dL(u[ℓ],u[i + ℓ]). Madonia. Quaternary n-cubes and isometric Since there are at most O(k) LCA queries for each words. In WORDS 2021, Lecture Notes in Com- index i, this can be done in O(k) time. The overall puter Science. Springer, 2021. to appear.

5 [3] Marcella Anselmo, Dora Giammarresi, Maria [12] Sandi Klavzar and Sergey V. Shpectorov. Madonia, and Carla Selmi. Bad pictures: Some Asymptotic number of isometric generalized Fi- structural properties related to overlaps. In bonacci cubes. Eur. J. Comb., 33(2):220–226, Galina Jir´askov´aand Giovanni Pighizzini, ed- 2012. itors, Descriptional Complexity of Formal Sys- tems - 22nd International Conference, DCFS [13] Gad M. Landau and Uzi Vishkin. Efficient string 2020, Vienna, Austria, August 24-26, 2020, matching in the presence of errors. In 26th An- Proceedings, volume 12442 of Lecture Notes in nual Symposium on Foundations of Computer Computer Science, pages 13–25. Springer, 2020. Science, Portland, Oregon, USA, 21-23 October 1985, pages 126–136. IEEE Computer Society, [4] Alberto Apostolico, Maxime Crochemore, Mar- 1985. tin Farach-Colton, Zvi Galil, and S. Muthukr- [14] Gonzalo Navarro and Mathieu Raffinot. Flexi- ishnan. 40 years of suffix trees. Commun. ACM, 59(4):66–73, 2016. ble Pattern Matching in Strings - practical on- line search algorithms for texts and biological se- [5] Carl Barton, Costas S. Iliopoulos, Solon P. Pis- quences. Cambridge University Press, 2002. sis, and William F. Smyth. Fast and simple com- [15] Marius Nicolae and Sanguthevar Rajasekaran. putations using prefix tables under Hamming On pattern matching with k mismatches and few and . In Jan Kratochv´ıl, Mirka don’t cares. Inf. Process. Lett., 118:78–82, 2017. Miller, and Dalibor Froncek, editors, Combi- natorial Algorithms - 25th International Work- [16] Baruch Schieber and Uzi Vishkin. On finding shop, IWOCA 2014, Duluth, MN, USA, Octo- lowest common ancestors: Simplification and ber 15-17, 2014, Revised Selected Papers, vol- parallelization. SIAM J. Comput., 17(6):1253– ume 8986 of Lecture Notes in Computer Science, 1262, 1988. pages 49–61. Springer, 2014. [17] Jianxin Wei. The structures of bad words. Eur. [6] Maxime Crochemore, Christophe Hancart, and J. Comb., 59:204–214, 2017. Thierry Lecroq. Algorithms on Sstrings. Cam- bridge University Press, 2007. [18] Jianxin Wei, Yujun Yang, and Xuena Zhu. A characterization of non-isometric binary words. [7] Z Galil and R Giancarlo. Improved string Eur. J. Comb., 78:121–133, 2019. matching with k mismatches. SIGACT News, 17(4):52–54, March 1986. [19] Jianxin Wei and Heping Zhang. Proofs of two conjectures on generalized Fibonacci cubes. Eur. [8] Dov Harel and Robert Endre Tarjan. Fast al- J. Comb., 51:419–432, 2016. gorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338–355, 1984.

[9] W.-J. Hsu. Fibonacci cubes-a new interconnec- tion topology. IEEE Transactions on Parallel and Distributed Systems, 4(1):3–12, 1993.

[10] Aleksandar Ilic, Sandi Klavzar, and Yoomi Rho. Generalized Fibonacci cubes. Discret. Math., 312(1):2–11, 2012.

[11] Sandi Klavzar. Structure of Fibonacci cubes: a survey. J. Comb. Optim., 25(4):505–522, 2013.

6