<<

2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS)

Faster Approximate : A Unified Approach

Panagiotis Charalampopoulos Tomasz Kociumaka Philip Wellnitz Department of Informatics Department of Computer Science Max Planck Institute for Informatics King’s College London, UK Bar-Ilan University Saarland Informatics Campus and Ramat Gan, Israel Saarbrücken, Germany Institute of Informatics [email protected] [email protected] University of Warsaw, Poland [email protected]

Abstract—In the approximate pattern matching problem, Hamming : Recall that the given a text T , a pattern P , and a threshold k, the task of two (equal-length) strings is the number of positions is to find (the starting positions of) all substrings of T that T n k P where the strings differ. Now, given a text of length , are at distance at most from . We consider the two most P m k> fundamental string metrics: Under the Hamming distance,we a pattern of length , and an integer threshold 0, search for substrings of T that have at most k mismatches we want to compute the k-mismatch occurrences of P in T , with P , while under the , we search for substrings that is, all length-m substrings of T that are at Hamming of T that can be transformed to P with at most k edits. distance at most k from P . This pattern matching with Exact occurrences of P in T have a very simple structure: 3 mismatches problem has been extensively studied. In the If we assume for simplicity that |P | < |T |≤ /2 |P | and that P occurs both as a prefix and as a suffix of T , then both P and late 1980s, Abrahamson [2] and√ Kosaraju [26] independently T are periodic with a common period. However, an analogous proposed an FFT-based O(n m log m)-time algorithm for characterization for occurrences with up to k mismatches was computing the Hamming distance of P and all the length- proved only recently by Bringmann et al. [SODA’19]: Either m fragments of T . While their algorithms can be used to there are O(k2) k-mismatch occurrences of P in T , or both P T O(k) solve the pattern matching with mismatches problem, the and are at Hamming distance from strings with k a common string period of length O(m/k). We tighten this first algorithm to benefit from the threshold was given characterization by showing that there are O(k) k-mismatch by Landau and Vishkin [27] and slightly improved by Galil occurrences in the non-periodic case, and we lift it to the edit and Giancarlo [15]: Based on so-called “kangaroo jumping”, distance setting, where we tightly bound the number of k-edit they obtained an O(nk)-time algorithm, which is faster than O(k2) √ occurrences by in the non-periodic case. Our proofs O(n m log m) even for moderately large k. Amir et al. [4] are constructive and let us obtain a unified framework for √ developed two algorithms with running time O(n k log k) approximate pattern matching for both considered . 3 In particular, we provide meta-algorithms that only rely on a and O˜(n + k n/m), respectively; the latter algorithm was small set of primitive operations. We showcase the generality then improved upon by Clifford et al. [11], who presented an of our meta-algorithms with results for the fully compressed O˜(n + k2n/m)-time solution. Subsequently, Gawrychowski setting, the dynamic setting, and the standard setting. and Uznanski´ [17] provided√ a smooth trade-off between the 2 Keywords-approximate pattern matching, grammar com- running times O√˜(n k) and O˜(n + k n/m) by designing pression, dynamic strings, Hamming distance, edit distance an O˜(n + kn/ m)-time algorithm. Very recently, Chan I. INTRODUCTION et al. [10] removed most of the polylog n factors in the latter solution at the cost of (Monte-Carlo) randomization. The pattern matching problem, asking to search for oc- Furthermore, Gawrychowski and Uznanski´ [17] showed that currences of a given pattern P in a given text T , is perhaps a significantly faster “combinatorial” algorithm would have the most fundamental problem on strings. However, in most (unexpected) consequences for the complexity of Boolean applications, finding all exact occurrences of a pattern is matrix multiplication. Pattern matching with mismatches not enough: Think of human spelling mistakes or DNA on strings is thus well understood in the standard setting. sequencing errors, for example. In this work, we focus on Nevertheless, in the settings where the strings are not given approximate pattern matching, where we are interested in explicitly, a similar understanding is yet to be obtained. One finding substrings of the text that are “similar” to the pattern. of the main contributions of this work is to improve the While various similarity measures are imaginable, we focus upper bounds for two such settings, obtaining algorithms on the two most commonly encountered metrics in this with running times analogous to the algorithm of [11]. context: the Hamming distance and the edit distance. Edit Distance: Recall that for two strings S and T , the A full version of this paper is available at arxiv.org/abs/2004.08350. edit distance (also known as ) is the Proofs of the claims marked with ♠ are presented only in the full version. minimum number of edits required to transform S into T .

2575-8454/20/$31.00 ©2020 IEEE 978 DOI 10.1109/FOCS46700.2020.00095 Here, an edit is an insertion, a substitution, or a deletion of a straight-line programs. We refer an interested reader to [36], single character. In the pattern matching with edits problem, [34], [30], [38] to learn more about grammar compression. we are given a text T , a pattern P , and an integer threshold Working directly with a compressed representation of a k>0, and the task is to find all starting positions of the text in general, intuitively at least, seems to be hard—in k-edit (or k-error) occurrences of P in T . Formally, we fact, Abboud et al. [1] showed that, for some problems, are to find all positions i in T such that the edit distance decompress-and-solve is the best we can hope for, under between T [i..j] and P is at most k for some position some reasonable assumptions from fine-grained complexity j. Again, a classic algorithm by Landau and Vishkin [28] theory. Nevertheless, Jez˙ [22] managed to prove that exact runs in O(nk) time. Subsequent research [37], [13] resulted pattern matching can be solved on grammar-compressed 4 in an√O(n + k n/m)-time algorithm (which is faster for strings in near-linear time: Given an SLP of size n rep- k ≤ 3 m). From a lower-bound perspective, we can benefit resenting a string T and an SLP of size m representing a from the discovery that the classic quadratic-time algorithm string P , we can find all exact occurrences of P in T in for computing the edit distance of two strings is essentially O((n + m)log|P |) time. For fully compressed approximate optimal: Backurs and Indyk [5] recently proved that a signifi- pattern matching, no such near-linear time algorithm is cantly faster algorithm would yield a major breakthrough for known, though. While the O˜((n+|P |)k4)-time algorithm by the satisfiability problem. For pattern matching with edits, Bringmann et al. [8] for pattern matching with mismatches this means that there is no hope for an algorithm that is comes close, it works in an easier setting where only the significantly faster than O(n+k2n/m); however, apart from text is compressed. We fill this void by providing the first that “trivial” lower bound and the 20-year-old conjecture algorithm for fully compressed pattern matching with mis- 3 H of Cole and Hariharan [13] that an O(n + k n/m)-time matches that runs in near-linear time. Denote by Occk (P, T) algorithm should be possible, nothing is known that would the set of (starting positions of) k-mismatch occurrences close this gap. While we do not manage to tighten this of P in T ; then, our result reads as follows. gap, we do believe that the structural insights we obtain Theorem I.1 (♠). Let GT denote an SLP of size n generat- may be useful for doing so. What we do manage, however, ing a text T , let GP denote an SLP of size m generating a is to significantly improve the running time of the known pattern P , let k denote a threshold, and set N := |T | + |P |. algorithms in two settings where T and P are not given H Then, we can compute |Occk (P, T)| in time O(m log N+ explicitly, thereby obtaining running times that can be seen 2 3 H nk log N). The elements of Occk (P, T) can be reported as analogous to the running time of Cole and Hariharan’s H within O(|Occ (P, T)|) extra time. algorithm [13]. k Grammar Compression: One of the settings that we For pattern matching with edits, near-linear time algorithms consider in this paper is the fully compressed setting, when are not known even in the case that the pattern is given both the text T and the pattern P are given as straight-line explicitly. Currently, the best pattern matching algorithms on programs. Compressing the text and the pattern is, in gen- an SLP-compressed text run in time O(n|P | log |P |) [41] 4 eral, a natural thing to do—think of huge natural-language and O(n(min{|P |k, k + |P |} +log|T |)) [6]. Moreover, texts or genomic databases, which are easily compressible. an O˜(n |P |k3)-time solution [16] is known for (weaker) While one approach to solve pattern matching in the fully LZW compression [42]. Again, we obtain a near-linear time compressed setting is to first decompress the strings and algorithm for fully compressed pattern matching with edits. E then run an algorithm for the standard setting, this voids Denote by Occk (P, T) the set of all starting positions of k- most benefits of compression in the first place. Hence, there error occurrences of P in T ; then, our result reads as follows. has been a long line of research with the goal of designing Theorem I.2 (♠). Let GT denote an SLP of size n generat- text algorithms directly operating on compressed strings. ing a string T , let GP denote an SLP of size m generating a Naturally, such algorithms highly depend on the chosen string P , let k denote a threshold, and set N := |T | + |P |. compression method. In this work, we consider grammar E Then, we can compute |Occk (P, T)| in time O(m log N + compression, where a string T is represented using a context- 4 3 E nk log N). The elements of Occk (P, T) can be reported free grammar that generates exactly {T }; such a grammar E within O(|Occ (P, T)|) extra time. is also called a straight-line program (SLP). k Straight-line programs are popular due to mathematical Note that our algorithms also improve the state of the art elegance and equivalence [35], [25], [24] (up to logarithmic when the pattern is given in an uncompressed form; this is factors and moderate constants) to widely-used dictionary because any string P admits a trivial SLP of size O(|P |). compression schemes, including the LZ77 [43] Dynamic Strings: While compression handles large and the run-length-encoded Burrows–Wheeler transform [9]. static data sets, a different approach is available if the data Many more schemes, such as byte-pair encoding [39], Re- changes frequently. Several works on pattern matching in Pair [29], Sequitur [32], and further members of the Lempel– dynamic strings considered the indexing problem, assuming Ziv family [44], [42], to name but a few, can be expressed as that the text is maintained subject to updates and the pattern

979 is given explicitly at query time; we refer the interested of Theorem I.3 is faster than recomputing the reader to [19], [14], [37], [3], [33] and references therein. occurrences from scratch after each update: Recall that, in Recently, Clifford et al. [12] considered the problem of the standard setting, the fastest known algorithm√ for pattern maintaining a data structure for a text T and a pattern P , matching√ with mismatches costs O˜(n + kn/ m)=O˜(n + both of which undergo character substitutions, in order to be k m · n/m) time; in particular, the additive O˜(n) term able to efficiently compute the Hamming distance between dominates the time complexity for the considered parameter P and any given fragment of T . Among other results, for range. Observe further that, for any k, Theorem I.3 is not the case where |T |≤2|P | and constant√ alphabet size, slower (ignoring polylog factors) than running the O˜(n + they presented a data structure with O( m log m) time k2n/m)-time algorithm by Clifford et al. after every update. 2 per operation, and they proved that, conditional on the In the edit distance case, for k<(m/ log N)1/3, the data Online Boolean Matrix-Vector Multiplication (OMv) Con- structure of Theorem I.3 is faster than running the O(nk)- jecture [20], one cannot simultaneously achieve O(m1/2−ε) time Landau–Vishkin algorithm for the standard setting. for the query and update time for any constant ε>0. Note further that, for any k, Theorem I.3 is not slower We consider the following more general setting: We (ignoring polylog N factors) than running after every update maintain an initially empty collection of strings X that can the O(n + k4n/m)-time Cole–Hariharan algorithm, whose be modified via the following “update” operations: bottleneck for k0,we Theorem I.5 ([8, Theorem 1.2], simplified). Given a pattern H 2 2 P m T n ≤ 3/ m can compute the set Occk (P, T) in time O(n/m·k log N) of length , a text of length 2 , and a threshold E 4 2 2 k ≤ m and the set Occk (P, T) in time O(n/m · k log N). , at least one of the following holds: √ • The number of k-mismatch occurrences of P in T is In the Hamming distance case, for k< m/ log N, the bounded by O(k2). 1Strictly speaking, makestring(U) costs O(|U| +logN) time. • There is a primitive string Q of length O(m/k) such Ω(1) ∞ 2All running time bounds hold with high probability (i.e., 1 − N ). that δH (P, Q [0 ..m)) ≤ 6k.

980 Motivated by the absence of examples proving the tightness Recall that Theorem I.5, originating from [8], includes a of their result, Bringmann et al. [8] conjectured that the weaker version of Theorem I.7. We also note that part (c) of bound on the number of occurrences in Theorem I.5 can the new characterization is asymptotically tight, as justified be improved to O(k). We resolve their conjecture positively by modifying the example of Figure 2: Let P be obtained by proving the following stronger variant of Theorem I.5. from am by placing c at (k +1)/2 random positions, and let T be obtained from a3m/2 by placing c at (k +1)/2 Theorem I.6 (Compare Theorem I.5). Given a pattern P 3m/2 random positions within the middle third of a . Then, of length m, a text T of length n ≤ 3/ m, and a threshold 2 each k-mismatch occurrence must align at least one c from k ≤ m, at least one of the following holds. P with one c from T and, conversely, each such alignment k P T • The number of -mismatch occurrences of in is results in a k-mismatch occurrence. Hence, the number of O k bounded by ( ). k-mismatch occurrences is Θ(k2). Furthermore, for every q, Q O m/k H • There is a primitive string of length ( ) that with high probability, Occ (P, T) can only be decomposed δ P, Q∞[ ..m) < k k satisfies H ( 0 ) 2 . into Θ(k2) progressions with difference q. Examples from [8], illustrated in Figures 1 and 2, prove Structure of Pattern Matching with Edits: Having un- the asymptotic tightness of Theorem I.6. derstood the structure of pattern matching with mismatches, a3m/4 c3m/4 we turn to the more complicated situation for pattern match- T aaaccc··· ···ccc ··· ··· ing with edits. First, observe that the examples of Figures 1 and 2 are still valid: Any k-mismatch occurrence is also P aacc···cc ··· a k-error occurrence. However, as the edit distance allows

m/2 m/2 insertions and deletions of characters, we can construct an a c example where neither P nor T is approximately periodic, yet the number of k-error occurrences is Ω(k2); see Figure 3. Figure 1. Consider a text T := a3m/4c3m/4 and a pattern P := am/2cm/2, neither of which is approximately periodic. Then, shifting the P T k exact occurrence of in by up to positions in either direction still n/2 k−1 k−1 k−1 yields a k-mismatch occurrence. Hence, we need Ω(k) distinct k-mismatch a a a a occurrences to derive approximate periodicity of P . T aaaaaaa··· ··· ··· ···c ··· ···c aa ··· ···c aa ··· 3m/2 a P aaaaa··· ···c ··· ···c aa ··· T a··· aaaac c ··· c am/2 ak−1 ak−1 c at k/2 random positions in each string P c aaa··· c Figure 3. Consider a text T := an/2(c · ak−1)n/2k and a pattern P := m/2 k−1 m/2k 2 m a (c · a ) for n := m +2k . Now, for every i ∈ [−k..k], a an |i|-mismatch occurrence of P starts at position n/2 − m/2+i · k in T . The remaining budget on the number of errors can be spent on shifting 3m/2 Figure 2. Consider a text T and a pattern P obtained from a and the starting positions, so for every j ∈ [|i|−k..k−|i|], there is a k-error m a , respectively, by substituting a to c at k/2 random positions. Then, occurrence starting at position n/2 − m/2+i · k + j in T . Overall, the all length-m fragments of T are k-mismatch occurrences of P , but, with number of k-error occurrences of P in T is Ω(k2), but neither P nor T high probability, neither T nor P is perfectly periodic. Hence, we need a is approximately periodic. relaxed periodicity notion allowing for Ω(k) mismatches. O k As in the exact pattern matching case, we can also char- In the example of Figure 3, there are still only ( ) O k k acterize the (approximately) periodic case in more detail. regions of size ( ) each where -error occurrences start. In fact, we can show that this is the worst that can happen P Theorem I.7 (Compare [8, Claim 3.1]). Let denote a δE S, T S T 3 (we write ( ) for the edit distance of and ): pattern of length m, let T denote a text of length n ≤ /2 m, and let 0 ≤ k ≤ m denote a threshold. Suppose that both Theorem I.8 (♠). Given a pattern P of length m, a text T 3 T [0 ..m) and T [n − m..n) are k-mismatch occurrences of length n ≤ /2 m, and a threshold k ≤ m, at least one of P . If there is a positive integer d ≥ 2k and a primitive of the following holds: ∞ string Q with |Q|≤m/8d and δH (P, Q [0 ..m)) ≤ d, • The starting positions of all k-error occurrences of P then each of following holds: in T lie in O(k) intervals of length O(k) each. T δ T,Q∞[ ..n) ≤ d (a) The string satisfies H ( 0 ) 3 . • There is a primitive string Q of length O(m/k) and k P T ∞ (b) Every -mismatch occurrence of in starts at a integers i, j such that δE(P, Q [i..j]) < 2k. position that is a multiple of |Q|. H 2 (c) The set Occk (P, T) can be decomposed into O(d ) Again, we treat the (approximately) periodic case sepa- arithmetic progressions with difference |Q|. rately, thereby obtaining a result similar to Theorem I.7:

981 Theorem I.9 (♠). Let P denote a pattern of length m, collected so far have a total length of at least 3/8·m, we stop let T denote a text of length n, and let 0 ≤ k ≤ m denote our process and continue to work only with the repetitive 3 an integer threshold such that n< /2 m + k. Suppose that regions computed so far. A repetitive region R does not exist the k-error occurrences of P in T include a prefix of T only if P [j..m) has too few mismatches with Q∞. In this and a suffix of T . If there are positive integers d ≥ 2k, case, we try extending P [j..m) to a suffix R of P that  ∞   i, j, and a primitive string Q with |Q|≤m/8d and satisfies δH (R , Q [0 ..|R |))=Θ(k·|R |/m); where Q is ∞ δE(P, Q [i..j]) ≤ d, then each of following holds: a suitable rotation of Q. If we fail again, we report that P is ∞   (a) The string T satisfies δE(T,Q [i ..j ]) ≤ 3d for some approximately periodic; otherwise we continue to work with  integers i and j. the single repetitive region R generated. For this, we note E |R|≥ / · m (b) For every p ∈ Occk (P, T), we have p mod |Q|≤3d that 3 8 because all breaks and repetitive regions or p mod |Q|≥|Q|−3d. found beforehand have a total length of at most 5/8 · m. E 3 P k (c) The set Occk (P, T) can be decomposed into O(d ) Overall, for every pattern , we obtain either 2 disjoint arithmetic progressions with difference |Q|. breaks, or disjoint repetitive regions of total length at least 3/8 · m, or a string with period O(m/k) at Hamming Technical Overview distance O(k) from P ; consider Lemma III.6 for a formal Gaining Structural Insights: To highlight the novelty proof. of our approach, let us first outline the proof Theorem I.5 If the analysis results in breaks, we observe that at least by Bringmann et al. [8]. Consider a pattern P of length m k breaks need to be matched exactly in each k-mismatch 3 andatextT of length n ≤ /2 m. Split the pattern into occurrence of P in T . As both the length and the shortest Θ(k) blocks of length Θ(m/k) each and process each such period of each break are Θ(n/k), there are at most O(k) block Pi as follows: Compute the shortest string period exact matches of each break in the text. Now, a simple ∞ Qi of Pi and align P with a substring of Qi , starting marking argument shows that the number of k-mismatch ∞ from Pi = Qi [0 ..|Pi|) and extending to both directions, occurrences of P in T is O(k); consider Lemma III.8 for a allowing mismatches. If there are O(k) mismatches for any formal proof of this case. block Pi, then P is approximately periodic; otherwise, there If the analysis results in repetitive regions, for each region are many mismatches for every block Pi. In particular, in Ri, we consider its ki-mismatch occurrences in T with every k-mismatch occurrence where a block Pi is matched ki := Θ(k·|Ri|/m). Intuitively, this distributes the available exactly, all but at most k of these mismatches between P budget of k mismatches among the repetitive regions accord- ∞ and Qi must be aligned to the corresponding mismatches ing to their lengths. Next, we try extending each ki-mismatch ∞ between T and Qi . Observing that, in any k-mismatch occurrence of each Ri to an approximate occurrence of P , occurrence, at least k of the blocks must be matched exactly, and we assign |Ri| marks to this extension. Using insights 2 this yields an O(k ) bound on the number of k-mismatch gained in the periodic case, we bound the total number P T O k · |R | occurrences of in . of marks by ( i i ). Independently, we show that k P |R |−m/ The main shortcoming of this approach is the initial each -mismatch occurrence of has at least i i 4 treatment of the pattern: Since the pattern P is independently marks. Using i |Ri|≥3/8 · m, we finally obtain a bound ∞ aligned with Qi for every block Pi, the same position in P of O(k) on the number of k-mismatch occurrences of P may be accounted for as a mismatch for multiple blocks Pi. in T ; consider Lemma III.11 for the formal proof. In particular, this happens if several adjacent blocks share In total, this proves Theorem I.6. For the characteriza- the same period. This leads to an overcounting of the k- tion of the periodic case (Theorem I.7), we use a similar mismatch occurrences that is hard to control. reasoning as [8]. As in the theorem, assume that P has What we do instead is a more careful analysis of the k-mismatch occurrences both as a prefix and as a suffix pattern. Instead of creating all blocks Pi at once, we process of T . Further, fix a threshold d ≥ 2k and a primitive ∞ P from left to right, as described below. Suppose that string Q such that δH (P, Q [0 ..m)) ≤ d. First, we show P [j..m) is the unprocessed suffix of P . We first consider that every k-mismatch occurrence of P in T starts at a the length-m/8k prefix P  of P [j..m) and compute its multiple of |Q|. In particular, |Q| divides n − m and, using ∞ shortest string period Q.If|Q| exceeds a certain constant this observation, we bound δH (T,Q [0 ..n)). Finally, to   H 2 fraction of |P |,wesetP aside as a break and continue decompose Occk (P, T) into O(k ) arithmetic progressions, processing P [j + |P | ..m).Now,ifP  is the 2k-th break we analyze the sequence of Hamming distances between P that we set aside, our process stops, and we continue to and the length-m fragments of T starting at the multiples work only with the breaks. If P  does not form a break, we of |Q|: we observe that the number of changes in this try extending P  to a prefix R of P [j..m) that satisfies sequence is bounded by O(d2), which then yields the claim. ∞ δH (R, Q [0 ..|R|))=Θ(k ·|R|/m). If such a prefix R For pattern matching with edits, surprisingly few modifi- exists, we set it aside as a repetitive region and continue cations in our arguments are necessary. In fact, the analysis processing P [j +|R| ..m). Now, if all the repetitive regions of the pattern stays essentially the same. The main difference

982 in the subsequent arguments is that we need to account for plementing the proof of Theorem I.7 is rather straight- shifts of up to O(k) positions; this causes the increase in the forward. However, for the edit distance case, the more bound on the number of occurrences. Unfortunately, for the complicated proof of Theorem I.9 gets complemented periodic case of pattern matching with edits, the situation with even more sophisticated algorithms. Hence, we do is messier. The key difficulty that we overcome is that an not discuss them in this outline. alignment corresponding to a specific edit distance may not • Finding the occurrences in the presence of 2k breaks be unique. In particular, due to insertions and deletions, is easy: We first use IPM operations to find exact combining (the arguments for) two disjoint substrings is not occurrences of the breaks in the text and then perform as easy as in the Hamming distance case. We solve these a straightforward marking step; for the Hamming dis- issues by enclosing individual errors between a string and its tance, we lose an O(log log k) factor for sorting marks. approximate period with so-called locked fragments, which • Finding the occurrences in the presence of repetitive admit a unique canonic alignment. (A similar idea was used regions is implemented similarly; the key difference is by Cole and Hariharan [13].) Combining this with a more that we use our algorithm for the periodic case to find involved marking scheme, we then obtain Theorem I.9. the approximate occurrences of the repetitive regions. A Unified Approach to Approximate Pattern Matching: Overall, this approach then yields the main technical results The proofs of our new structural insights are already essen- of this work (stated below for strings of arbitrary lengths): tially algorithmic. To obtain algorithms for all the considered settings at once, we proceed in two steps. In the first step, Theorem I.10 (♠). Given a pattern P of length m,atextT of length n, and a positive integer k ≤ m, we can compute we devise meta-algorithms that only rely on a core set of H (a representation of) the set Occk (P, T) using O(n/m · abstract operations; in the second step, we implement these 2 2 operations in various settings. Specifically, we introduce the k log log k) time plus O(n/m · k ) PILLAR operations. PILLAR model—A novel abstract interface to handle strings For pattern matching with edits, the number of PILLAR- S represented in a setting-specific manner. For two strings model operations matches the time cost of non-PILLAR- T and , the following operations are supported: model operations; hence the simplified theorem statement. • Extract(S, , r): Retrieve a string S[..r]. Theorem I.11 (♠). Given a pattern P of length m,atextT • LCP(S, T ): Compute the length of the longest common S T of length n, and a positive integer k ≤ m, we can compute prefix of and . E 4 R P, T O n/m · k • S, T : Compute the length of the longest com- (a representation of) the set Occk ( ) using ( ) LCP ( ) PILLAR mon suffix of S and T . time in the model. • IPM(S, T ): Assuming that |T |≤2|S|, compute the Finally, we show how to implement the PILLAR model starting positions of all exact occurrences of S in T . in the settings that we consider: • Access(S, i): Retrieve the character S[i]. • As a toy example, we start with the standard set- • Length(S): Compute the length |S| of the string S. ting. Here, implementing the PILLAR-model opera- Using the PILLAR-model operations, the meta-algorithms tions boils down to collecting known tools on strings. for both pattern matching with mismatches and with errors • For the fully compressed setting, we heavily rely on the follow the same overall structure: recompression technique by Jez˙ [22], [23] (especially • First, we implement the analysis of the pattern. Here, for internal pattern matching queries), as well as on the key difficulty is to detect repetitive regions. Our other works on straight-line programs [6], [21]. algorithm finds the shortest repetitive region: Starting • Finally, for the dynamic setting, we use the data struc- R from the prefix P  of the unprocessed suffix P [j..m), ture by Gawrychowski et al. [18] (for LCP and LCP we enumerate the mismatches (or errors) between operations). Further, we reuse some tools from the fully P [j..m) and Q∞. We stop when the number of mis- compressed setting—the data structure of [18] actually matches (or errors) within the constructed region R works with (a form of) straight-line programs. exceeds Θ(k/m·|R|). Intuitively, this is correct because As the primitive operations of the PILLAR model are the number of mismatches increases at most as fast as rather simple, we believe that they can be implemented in the length of |R|. We treat the special case when we further settings not considered here. reach the end of the pattern symmetrically. Note that computing the next mismatch between two II. PRELIMINARIES strings is a prime application of the LCP operation. For finding a next edit, we adapt the Landau–Vishkin Sets: We write [n] to denote the set {1,...,n}. Further, algorithm [28], which builds on LCP operations as well. we write [i..j] to denote {i,...,j} and [i..j) to denote • Next, we deal with the periodic case. This turns out to {i,...,j−1}. The set {a+j·d | j ∈ [0 ..)} is an arithmetic be the main difficulty. For the Hamming distance, im- progression with starting value a, difference d, and length .

983 Strings: We write T = T [0] T [1] ···T [n−1] to denote Theorem III.1. Given a pattern P of length m, a text T a string of length |T | = n over an alphabet Σ. The elements of length n, and a threshold k ∈ [1 ..m], at least one of the of Σ are called characters. following holds: For two positions i ≤ j in T , we write T [i..j +1) := • The number of k-mismatch occurrences of P in T is T [i..j] T [i] ···T [j] T H := for the fragment of that starts at bounded by |Occk (P, T)|≤576 · n/m · k. position i and ends at position j.Aprefix of a string T is a • There is a primitive string Q of length |Q|≤m/128k T ∗ fragment that starts at position 0;asuffix of a string is a that satisfies δH (P, Q ) < 2k. fragment that ends at position |T |−1. A string P of length m with 0 1. T [0 ..m) and T [n − m..n) are k-mismatch occurrences H A positive integer p is called a period of a string T if of P (that is, {0,n − m}⊆Occk (P, T)). If there is a T [i]=T [i + p] for all i ∈ [0 ..|T |−p). We refer to the positive integer d ≥ 2k and a primitive string Q with |Q|≤ ∗ smallest period as the period per(T ) of the string. The string m/8d and δH (P, Q ) ≤ d, then each of following holds: T [ .. T ) T H 0 per( ) is called the string period of . We call a (a) Every position in Occk (P, T) is a multiple of |Q|. ∗ string periodic if its period is at most half of its length. (b) The string T satisfies δH (T,Q ) ≤ 3d. T H For a string , we define the following rotation opera- (c) The set Occk (P, T) can be decomposed into 3d(d +1) tions. The operation rot(·) takes as input a string, and moves arithmetic progressions with difference |Q|. −1 · ∗ H its last character to the front. We write rot ( ) to denote (d) If δH (P, Q )=d, then |Occk (P, T)|≤6d. the corresponding inverse operation. Note that a primitive string T does not match any of its non-trivial rotations. Before proving Theorem III.2, we characterize the values δ T [j|Q| ..j|Q| m),P Hamming Distance and Pattern Matching with Mis- H ( + ) under the extra assumption that δ T,Q∗ matches: For two strings S and T of the same length n, H ( ) is small as well; this assumption is dropped we define the set of mismatches between S and T as in Theorem III.2. Mis(S, T ):={i ∈ [0 ..n) | S[i] = T [i]}; the size Lemma III.3. Let P denote a pattern of length m and S, T δH S, T 3 of Mis( ) is the Hamming distance ( ) between let T denote a text of length n ≤ /2 m. Further, let Q S T ∗ and . As we are often concerned with the Ham- denote a string of length q and set d := δH (P, Q ) and S T ∞  ∗ ming distance of a string and a prefix of for a d := δH (T,Q ). Then, the sequence of values hj := T S, T ∗ S, T ∞[ ..|S|) string , we write Mis( ):=Mis( 0 ) and δH (T [jq ..jq + m),P) for 0 ≤ j ≤ (n − m)/q contains δ S, T ∗ | S, T ∗ |  H ( ):= Mis( ) . at most d (2d +1) entries hj with hj = hj+1 and, unless  For a pattern string P ,atext string T , and a threshold d =0, at most 2d entries hj with hj ≤ d/2. k ∈ [0 ..|P |], a fragment T [i..i+|P |) of T is a k-mismatch Proof: For every τ ∈ Mis(T,Q∗) and π ∈ Mis(P, Q∗), occurrence of P in T if δH (P, T[i..i+|P |)) ≤ k. We write H − δ P [π],T[τ] τ − π Occ (P, T) to denote the set of starting positions of all k- let us put (2 H ( )) marks at position k T ≤ j ≤ n − m /q mismatch occurrences of P in T . in , if it exists. For each 0 ( ) , let μj(τ,π) denote the number of marks placed at position Lastly, we define pattern matching with mismatches. jq due to the mismatches τ in T and π in P , that is, ∗ Problem II.1 (Pattern matching with mismatches). Given a μj(τ,π)=2− δH (P [π],T[τ]) if π ∈ Mis(P, Q ) and H τ jq π ∈ T,Q∗ μ τ,π pattern P , a text T , and a threshold k, compute Occk (P, T). = + Mis(  ), and j( )=0otherwise. Further, define μj := τ,π μj(τ,π) as the total number jq III. IMPROVED STRUCTURAL INSIGHTS INTO PATTERN of marks at position . ≤ j ≤ n − m /q MATCHING WITH MISMATCHES Next, for every 0 ( ) , we relate the Hamming distance hj := δH (T [jq ..jq + m),P) to the In this section, we improve the result of [8] and show the number of marks μj at position jq and the Hamming ∗ ∗ following asymptotically tight structural characterization of distances δH (T [jq ..jq + m),Q ) and δH (P, Q ); consult the k-mismatch occurrences of a pattern P inatextT . Figure 4 for an illustration.

984 π1 π2 π3 δ T,Q∗ P aab Now, we drop the assumption that H ( ) is small, |Q| ∗∗ ∗ and prove Theorem III.2.  ∈ T bbcaca Proof of Theorem III.2: Consider any position H τ1 jq τ2 τ3 τ4 τ5 τ6 Occk (P, T). By the definition of a k-mismatch occurrence, Figure 4. In both strings, all blocks apart from the last one in T we have δH (T [..+ m),P) ≤ k ≤ d/2. Combining this ∗ are of length q. For each string X ∈{P, T} we show the charac- inequality with δH (P, Q ) ≤ d via the triangle inequality ters at positions in Mis(X, Q∗), only. At position jq in T , we place ∗ 3 yields δH (T [..+ m),Q ) ≤ /2 d. Note that, similarly, μj (τ2,π1)+μj (τ4,π2)+μj (τ5,π3) = 1+2+1 = 4 marks. We H ∗ ∗ ∈ P, T δ T [ ..m), have δH (P, Q )=|{π1,π2,π3}| =3and δH (T [jq..jq+ m),Q )= for the position 0 Occk ( ), we obtain H ( 0 |{τ ,τ ,τ ,τ }| h ∗ 3 2 3 4 5 =4. Using Claim III.4, we obtain that j =3; the three Q ) ≤ /2 d, which lets us compare the overlapping parts corresponding mismatches are indicated by asterisks. of Q∞. Replacing strings by superstrings and applying the Claim III.4. For each 0 ≤ j ≤ (n − m)/q, we have hj = triangle inequality yields ∗ ∗ δH P, Q δH T [jq ..jq m),Q − μj ( )+ ( + ) . ∞ ∞ δH (Q [..m),Q [0 ..m− )) Proof: We show the following equivalent statement: ∞ ∗ ≤ δH (T [..m),Q [..m))+ |Mis(T [jq ..jq + m),P)| = |Mis(P, Q )|+  ∞ ∗ δH (T [..m),Q [0 ..m− )) | T [jq ..jq m),Q |− μj τ,π . (1) Mis( + ) ( ) ≤ δ T [ ..m),Q∞[ ..m) τ,π H ( 0 0 )+ ∞ By construction, μj(τ,π)=0whenever τ = π +jq. Hence, δH (T [..+ m),Q [0 ..m)) we can prove (1) by showing that for every position π ∈ ∗ ∗ = δH (T [0 ..m),Q )+δH (T [..+ m),Q ) ≤ 3d. [0 ..m) in P and every position τ := jq + π in T , the following equation holds: Towards a proof by contradiction, suppose that  is not an ∞ |Q| Q δH (T [τ],P[π])=δH (P [π],Q [π])+ integer multiple of .As is primitive, we have ∞ δH (T [τ],Q [τ]) − μj(τ,π). ∞ ∞ 3d ≥ δH (Q [..m),Q [0 ..m− )) π ∈ P, Q∗ By a case distinction on whether Mis( ) and ≥ (m − )/|Q| ≥ (m/2)/(m/8d) =4d, whether τ ∈ Mis(T,Q∗), one can see that this is indeed the case. Combining the equations obtained for every pair where the second bound follows from  ≤ m/2 and |Q|≤ of positions π and τ, we derive (1). m/8d. This contradiction yields Claim (a). In particular, Claim III.4 yields In order to prove Claim (b), we observe that n − m ∈ ∗ H P, T |Q| hj+1 − hj = |Mis(T,Q ) ∩ [jq + m..(j +1)q + m)| Occk ( ) is a multiple of . Consequently, ∗ −|Mis(T,Q ) ∩ [jq ..(j +1)q)|−μj+1 + μj. ∗ ∗ δH (T,Q )=δH (T [0 ..n− m),Q )+ Hence, in order for hj+1 not to equal hj, at least one of the ∗ ∗ 3 δH (T [n−m..n),Q ) ≤ δH (T [0 ..m),Q )+ /2 d ≤ 3d, four terms on the right hand side of the equation above must be non-zero. Let us analyze when this is possible. To which concludes the proof of Claim (b). that end, we first observe that the set Mis(T,Q∗) ∩ [jq + For a proof of Claims (c) and (d), we apply Lemma III.3. ∗ H m..(j +1)q + m) contains only elements τ ∈ Mis(T,Q ) Due to Claim (a), each position in Occk (P, T) corresponds ∗ with τ ≥ m, and that the set Mis(T,Q ) ∩ [jq ..(j +1)q) to an entry hj with hj ≤ k. In particular, each block of con- ∗ only contains elements τ ∈ Mis(T,Q ) with τk≥ hj+1 Further, each non-zero value in one of the terms μj+1 and is in total at most 3d(2d +1), so the number of arithmetic μj can be attributed to a marked position (jq or (j +1)q, progressions is at most 1+1/2 · 3d(2d +1)≤ 3d(d +1), respectively). The total number of marked positions is dd, which proves Claim (c). ∗ so hj+1 can be different from hj due one of the terms μj+1 For Claim (d), we observe that if d = δH (P, Q ), then  H or μj at most 2dd times. In total, we conclude that the each position in Occk (P, T) corresponds to an entry hj  H number of entries hj with hj = hj+1 is at most d (2d +1). with hj ≤ k ≤ d/2; thus |Occk (P, T)|≤2 · 3d ≤ 6d. μ ≤ | T,Q∗ ∩[jq ..jq m)| Next, observe that j 2 Mis( ) + = P m T δ T [jq ..jq m),Q∗ Corollary III.5. Let denote a pattern of length , let 2 H ( + ), and therefore n k ∈ [ ..m] ∗ ∗ denote a text of length , and let 0 denote a hj = δH (P, Q )+δH (T [jq ..jq+m),Q )−μj ≥ d−μj/2. threshold. If there is a positive integer d ≥ 2k and a prim- ∗ hj ≤ d/ μj ≥ d itive string Q with |Q|≤m/8d and δH (P, Q ) ≤ d, then Consequently, 2 yields , that is, that there are H at least d marks at position jq. Given that the total number the set Occk (P, T) can be decomposed into 6·n/m·d(d+1)  |Q| of marks is at most 2dd , the number of entries hj with arithmetic progressions with difference . Moreover, if  δ P, Q∗ d | H P, T |≤ · n/m · d hj ≤ d/2 is at most 2d , assuming that d>0. H ( )= , then Occk ( ) 12 .

985 Proof: Partition the string T into 2n/m blocks Algorithm 1: A constructive proof of Lemma III.6. T ,...,T 3/ m 0 2n/m−1 of length less than 2 each, where B←{}; R←{} i i · m/ 1 ; the th block starts at position 2 ; formally, we 2 while true do  set Ti := T [ i · m/2 .. min{n, (i +3)· m/2 −1}).If 3 Consider the fragment P = P [j..j+ m/8k) of the H  P, Ti  ∅ T next m/8k unprocessed characters of P ; Occk ( ) = , we define i to be the shortest fragment  T k P T 4 if per(P ) >m/128k then of i containing all -mismatch occurrences of in i.Asa  T  5 B←B∪{P }; result, i satisfies the assumptions of Theorem III.2. Hence, |B| =2k B H P, T  d d 6 if then return breaks ; Occk ( i ) can be decomposed into 3 ( +1)arithmetic 7 else   progressions with difference |Q|, and |Occ(P, Ti )|≤6d if 8 Q ← P [j..j+per(P )); ∗  δH (P, Q )=d. 9 Search for a prefix R of P [j..m) with |R| > |P | H ∗ We conclude that Occ (P, T ) decomposes into 6·n/m· and δH (R, Q )=8k/m ·|R|; k i R d(d+1) arithmetic progressions with difference |Q|; further, 10 if such exists then ∗ 11 R←R∪{ (R, Q)}; |Occ(P, Ti)|≤12 · n/m · d if δH (P, Q )=d. 12 if (R,Q)∈R |R|≥3/8 · m then return regions R; 13 else B. The Non-Periodic Case   14 Search for a suffix R of P with |R |≥m − j  |R|−m+j ∗  Having dealt with the (approximately) periodic case, we and δH (R , rot (Q) )=8k/m ·|R |;  now turn to the general case. In particular, we show that 15 if such R exists then  |R|−m+j whenever the string P is sufficiently far from being periodic, 16 return repetitive region (R , rot (Q)) j the number of k-mismatch occurrences of P in any string T 17 else return approximate period rot (Q); 3 of length n ≤ /2 m is O(k). Intuitively, we proceed (and thereby prove Theorem III.1) P as follows: We first analyze the string for useful structure length of all repetitive regions found so far is at least P that can help in bounding the number of occurrences of 3/8 · m). If we fail to construct a new repetitive region, T in any string . If we fail to find any special structure in then we conclude that the suffix of P starting with P  has P P , then we conclude that the string is close to a periodic an approximate period Q. We try to construct a repetitive |P | string with a small period (compared to )—a case that region by extending this suffix to the left, dropping all other we already understand thanks to the previous subsection. repetitive regions computed so far. If we fail again, we P We start by investigating the structure of any string . declare that Q is an approximate period of the string P . Lemma III.6. Given a string P of length m and and a Consider Algorithm 1 for a detailed description. threshold k ∈ [1 ..m], at least one of the following holds: Note that, by construction, all breaks in the set B and R (a) The string P contains 2k disjoint breaks B1,...,B2k repetitive regions in the set returned by the algorithm are each having periods per(Bi) >m/128k and length disjoint and satisfy the claimed properties. To prove that the |Bi| = m/8k . algorithm is also correct when it fails to find a new repetitive region, we start by bounding from above the length of the (b) The string P contains r disjoint repetitive regions r processed prefix of P . R1,...,Rr of total length i=1 |Ri|≥3/8 · m such that each region Ri satisfies |Ri|≥m/8k and has a Claim III.7. Whenever we consider a new fragment primitive approximate period Qi with |Qi|≤m/128k P [j..j+ m/8k ) of the next m/8k unprocessed charac- δ R ,Q∗  k/m ·|R | and H ( i i )= 8 i . ters of P , such a fragment starts at a position j<5/8 · m. (c) The string P has a primitive approximate period Q with ∗ Proof: Observe that whenever we consider a new |Q|≤m/128k and δH (P, Q ) < 8k. fragment P [j..j+ m/8k ), the string P [0 ..j) has been Proof: We prove the claim constructively, that is, we partitioned into breaks and repetitive regions. The total construct either a set B of 2k breaks, or a set R of repetitive length of breaks is less than 2k m/8k ≤2/8 · m, and regions, or, if we fail to construct either, we derive an the total length of repetitive regions is less than 3/8 · m. approximate string period Q of the string P with the desired Hence, j<5/8 · m, yielding the claim. properties. Note that Claim III.7 also shows that whenever we We process the string P from left to right as follows: If the consider a new fragment P  of m/8k characters, there  fragment P of the next m/8k (unprocessed) characters is indeed such a fragment, that is, P  is well-defined. of P has a long period, we have found a new break and Now, consider the case when, for a fragment P  = continue (or return the found set of 2k breaks). Otherwise, P [j..j+ m/8k ) (that is not a break) and its string period  if P has a short string period Q, we try to extend the Q = P [j..j+per(P )), we fail to obtain a new repetitive  fragment P (to the right) into a repetitive region. If we region R. In this case, we search for a repetitive region R succeed, we have found a new repetitive region and continue of length |R|≥m − j that is a suffix of P and has an ap-  (or return the found set of repetitive regions if the total proximate period Q := rot|R |−m+j(Q). If we indeed find

986 R |R|≥m−j ≥ m− / ·m / ·m H such a region , then 5 8 =3 8 By Claims III.9 and III.10, we have |Occk (P, T)|≤  by Claim III.7, so R is long enough to be reported on its (256 · n/m · k2)/k = 256 · n/m · k.  own. However, if we fail to find such R , we need to show Secondly, we discuss how to use repetitive regions in the j Q P H that rot ( ) can be reported as an approximate period of , string P to bound |Occk (P, T)|. j ∗ that is, δH (P, rot (Q) ) < 8k. ∗ Lemma III.11. Let P denote a pattern of length m, We first derive δH (P [j..m),Q ) < 8k/m·(m−j).For let T denote a text of length n, and let k ∈ [1 ..m] this, we inductively prove that the values Δρ := 8k/m·ρ− ∗  denote a threshold. If P contains disjoint repetitive regions δH (P [j..j+ ρ),Q ) for ρ ∈ [|P | ..m− j] are all at least r  R1,...,Rr of total length at least i=1 |Ri|≥3/8 · m 1. In the base case of ρ = |P |,wehaveΔρ =1−0 because  such that each region Ri satisfies |Ri|≥m/8k and has Q is the string period of P . To carry out an inductive step,  a primitive approximate period Qi with |Qi|≤m/128k suppose that Δρ−1 ≥ 1 for some ρ ∈ [|P | ..m− j]. Notice ∗ H and δH (Ri,Qi )=8k/m ·|Ri|, then |Occk (P, T)|≤ that Δρ ≥ Δρ−1 − 1 ≥ 0: The first term in the definition 576 · n/m · k. of Δρ has not decreased compared to Δρ−1, and the term  δ P [j..j ρ),Q∗ r H ( + ) may have increased by at most one. Proof: Set mR := i=1 |Ri|. For each repetitive region Moreover, Δρ =0 because R = P [j..j+ ρ) could not be Ri = P [ri ..ri + |Ri|), set ki := 4k/m ·|Ri| , and place H ρ ∈ Z |R | j j r ∈ R ,T reported as a repetitive region. Since Δ , we conclude i marks at each position with + i Occki ( i ). that Δρ ≥ 1. This inductive reasoning ultimately shows that ∗ Claim III.12. We place at most 192 · n/m · k · mR marks Δm−j > 0, that is, δH (P [j..m),Q ) < 8k/m · (m − j).  in total. A symmetric argument holds for the values Δρ :=  k/m · ρ−δ P [m − ρ..m), ρ−m+j Q ∗ ρ ∈ | H R ,T | 8 H ( rot ( ) ) for Proof: We use Corollary III.5 to bound Occki ( i ) .  ∗ [m−j..m] because no repetitive region R was found as an For this, we set di := δH (Ri,Qi ) and notice that di = j ∗ extension of P [j..m) to the left. Thus, δH (P, rot (Q) ) < 8k/m·|Ri| ≤ 16·k/m·|Ri| since |Ri|≥m/8k. Moreover, j 8k, that is, rot (Q) is an approximate period of P . di ≥ 2ki and |Qi|≤m/128k ≤|Ri|/8di due to di ≤ In the next steps, we discuss how to exploit the structure 16·k/m·|Ri|. Hence, the assumptions of Corollary III.5 are | H R ,T |≤ · n/|R |·d ≤ obtained by Lemma III.6. First, we discuss the case that a satisfied. Consequently, Occki ( i ) 12 i i string P contains 2k disjoint breaks. 192·n/m·k; the last inequality holds as di ≤ 16·k/m·|Ri|. The total number of marks placed due to Ri is therefore P m T Lemma III.8. Let denote a pattern of length , let bounded by 192·n/m·k ·|Ri|. Across all repetitive regions, n k ∈ [ ..m] denote a text of length , and let 1 denote this sums up to 192 · n/m · k · mR, yielding the claim. P k a threshold. Suppose that contains 2 disjoint breaks Next, we show that every k-mismatch occurrence of P B ,...,B B ≥ m/ k 1 2k each satisfying per( i) 128 . Then, in T , starts at a position with many marks. | H P, T |≤ · n/m · k Occk ( ) 256 . H Claim III.13. Each  ∈ Occk (P, T) has at least mR −m/4 Proof: For every break Bi = P [bi ..bi +|Bi|) we mark marks. j T j b ∈ B ,T a position in if + i Occ( i ). H  Proof: Let us fix  ∈ Occk (P, T) and denote ki := 2 Claim III.9. We place at most 256·n/m·k marks in total. δH (Ri,T[ + ri ..+ ri + |Ri|)) to be the number of mis- matches incurred by repetitive region Ri. Further, let I := Proof: Fix a break Bi and notice that the positions in   {i ∈ [1 ..r] | ki ≤ ki} = {i ∈ [1 ..r] | ki ≤ 4k/m ·|Ri|} Occ(Bi,T) are at distance at least per(Bi) from each other. denote the set of indices of all repetitive regions that have ki- Hence, for the break Bi, we place at most 128·n/m·k marks mismatch occurrences at the corresponding positions in T . in T . In total, we therefore place at most 2k · 128n/m · k = By construction, for each i ∈ I, we have placed |Ri| marks 256 · n/m · k2 marks in T . at position. Hence, the total number of marks at position  Next, we show that every k-mismatch occurrence of P |R | m − |R | is at least i∈I i = R i/∈I i . It remains to bound in T starts at a position with at least k marks. the term i/∈I |Ri|. Using the definition of I, we obtain  ∈ H P, T k    Claim III.10. Each Occk ( ) has at least marks. 4mk m |Ri| = ·|Ri| = · (4k/m ·|Ri|) H 4mk 4k Proof: Fix  ∈ Occk (P, T). Out of the 2k breaks, i/∈I i/∈I i/∈I at least k breaks are matched exactly, as not matching a  r m  m  m break exactly incurs at least one mismatch. If a break Bi < 4k · ki ≤ 4k · ki ≤ 4 , is matched exactly, then we have  + bi ∈ Occ(Bi,T). i/∈I i=1 Hence, we have placed a mark at position . Thus, there where the last bound holds as, in total, all repetitive regions r  Bi  is a mark at position for every break matched exactly incur at most i=1 ki ≤ k mismatches (since all repetitive in the corresponding occurrence of P in T . In total, there regions are pairwise disjoint). Hence, the number of marks k  T are at least marks at position in . placed at position  is at least mR − m/4.

987 In total, by Claims III.12 and III.13, the number of k- [2] K. R. Abrahamson, “Generalized string matching,” SIAM mismatch occurrences of P in T is at most Journal on Computing, vol. 16, no. 6, pp. 1039–1051, 1987. | H P, T |≤ 192·n/m·k·mR 192·n/m·k . Occk ( ) mR−m/4 = 1−m/(4mR) [3] S. Alstrup, G. S. Brodal, and T. Rauhe, “Pattern matching in dynamic texts,” in 11th Annual ACM-SIAM Symposium on As this bound is a decreasing function in mR, the assump- Discrete Algorithms, SODA 2000, D. B. Shmoys, Ed. SIAM, tion mR ≥ 3/8 · m yields the upper bound 2000, pp. 819–828. H 192·n/m·k·3/8·m |Occk (P, T)|≤ 3/8·m−m/4 = 576 · n/m · k, [4] A. Amir, M. Lewenstein, and E. Porat, “Faster algorithms for string matching with k mismatches,” Journal of Algorithms, completing the proof. vol. 50, no. 2, pp. 257–275, 2004. Finally, we consider the case that P is approximately periodic, but not too close to the periodic string in scope. [5] A. Backurs and P. Indyk, “Edit distance cannot be computed in strongly subquadratic time (unless SETH is false),” SIAM Lemma III.14. Let P denote a string of length m, let T Journal on Computing, vol. 47, no. 3, pp. 1087–1097, 2018. denote a string of length n, and let k ∈ [1 ..m] denote threshold. If there is a primitive string Q of length at most [6] P. Bille, G. M. Landau, R. Raman, K. Sadakane, S. R. Satti, ∗ and O. Weimann, “Random Access to Grammar-Compressed |Q|≤m/128k that satisfies 2k ≤ δH (P, Q ) ≤ 8k, then | H P, T |≤ · n/m · k Strings and Trees,” SIAM Journal on Computing, vol. 44, Occk ( ) 96 . no. 3, pp. 513–539, 2015. ∗ Proof: We apply Corollary III.5 with d = δH (P, Q ). k ≤ d ≤ k |Q|≤m/ k ≤ m/ d [7] D. Breslauer and Z. Galil, “Finding all periods and initial As 2 8 yields 128 8 , palindromes of a string in parallel,” Algorithmica, vol. 14, the assumptions of Corollary III.5 are met. Consequently, no. 4, pp. 355–366, 1995. H |Occk (P, T)|≤12 · n/m · d ≤ 96 · n/m · k. Gathering Lemmas III.6, III.8, III.11, and III.14, we are [8] K. Bringmann, M. Künnemann, and P. Wellnitz, “Few Matches or Almost Periodicity: Faster Pattern Matching with now ready to prove Theorem III.1. Mismatches in Compressed Texts,” in 30th Annual ACM- Proof of Theorem III.1: We apply Lemma III.6 on the SIAM Symposium on Discrete Algorithms, SODA 2019,T.M. string P and proceed depending on the structure found in P . Chan, Ed. SIAM, 2019, pp. 1126–1145. If the string P contains 2k disjoint breaks B1,...,B2k (in the sense of Lemma III.6), we apply Lemma III.8 and [9] M. Burrows and D. J. Wheeler, “A block-sorting lossless data H obtain that |Occ (P, T)|≤256 · n/m · k. compression algorithm,” Digital Equipment Corporation, Palo k Alto, California, Tech. Rep. 124, 1994. If the string P contains r disjoint repetitive regions R1,...,Rr (again, in the sense of Lemma III.6), we apply H [10] T. M. Chan, S. Golan, T. Kociumaka, T. Kopelowitz, and Lemma III.11 and obtain that |Occk (P, T)|≤576·n/m·k. E. Porat, “Approximating text-to-pattern hamming distances,” Otherwise, Lemma III.6 guarantees that there is a prim- in 52nd Annual ACM Symposium on Theory of Computing, itive string Q of length at most |Q|≤m/128k that STOC 2020, J. Chuzhoy, Ed. ACM, 2020, pp. 643–656. ∗ ∗ satisfies δH (P, Q ) < 8k.IfδH (P, Q ) ≥ 2k, then H [11] R. Clifford, A. Fontaine, E. Porat, B. Sach, and Lemma III.14 yields |Occk (P, T)|≤96 · n/m · k. If, k ∗ T. Starikovskaya, “The -mismatch problem revisited,” however, δH (P, Q ) < 2k, then we are in the second in 27th Annual ACM-SIAM Symposium on Discrete alternative of the theorem statement. Algorithms, SODA 2016, R. Krauthgamer, Ed. SIAM, 2016, pp. 2039–2052. ACKNOWLEDGMENTS P. Charalampopoulos was partially supported by ERC [12] R. Clifford, A. Grønlund, K. G. Larsen, and T. A. Starikovskaya, “Upper and lower bounds for dynamic data grant TOTAL under the European Union’s Horizon 2020 Re- structures on strings,” in 35th Symposium on Theoretical search and Innovation Programme (agreement no. 677651). Aspects of Computer Science, STACS 2018, R. Niedermeier T. Kociumaka was supported by ISF grants no. 1278/16 and B. Vallée, Eds. Schloss Dagstuhl–Leibniz-Zentrum für and 1926/19, by a BSF grant no. 2018364, and by an ERC Informatik, 2018, pp. 22:1–22:14. grant MPM under the EU’s Horizon 2020 Research and [13] R. Cole and R. Hariharan, “Approximate String Matching: Innovation Programme (agreement no. 683064). A Simpler Faster Algorithm,” SIAM Journal on Computing, vol. 31, no. 6, pp. 1761–1782, 2002. REFERENCES [1] A. Abboud, A. Backurs, K. Bringmann, and M. Künne- [14] P. Ferragina and R. Grossi, “Optimal on-line search and mann, “Fine-grained complexity of analyzing compressed sublinear time update in string matching,” SIAM Journal on data: Quantifying improvements over decompress-and-solve,” Computing, vol. 27, no. 3, pp. 713–736, 1998. in 58th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2017, C. Umans, Ed. IEEE Computer [15] Z. Galil and R. Giancarlo, “Improved string matching with k Society, 2017, pp. 192–203. mismatches,” SIGACT News, vol. 17, no. 4, pp. 52–54, 1986.

988 [16] P. Gawrychowski and D. Straszak, “Beating O(nm) in [30] M. Lohrey, “Algorithmics on SLP-compressed strings: A approximate LZW-compressed pattern matching,” in 24th survey,” Groups Complexity Cryptology, vol. 4, no. 2, pp. International Symposium on Algorithms and Computation, 241–299, 2012. ISAAC 2013, L. Cai, S. Cheng, and T. W. Lam, Eds. Springer, 2013, pp. 78–88. [31] K. Mehlhorn, R. Sundar, and C. Uhrig, “Maintaining Dy- namic Sequences under Equality Tests in Polylogarithmic [17] P. Gawrychowski and P. Uznanski,´ “Towards unified approxi- Time,” Algorithmica, vol. 17, no. 2, pp. 183–198, 1997. mate pattern matching for Hamming and L1 distance,” in 45th International Colloquium on Automata, Languages, and Pro- [32] C. G. Nevill-Manning and I. H. Witten, “Compression and gramming, ICALP 2018, I. Chatzigiannakis, C. Kaklamanis, explanation using hierarchical grammars,” The Computer D. Marx, and D. Sannella, Eds. Schloss Dagstuhl–Leibniz- Journal, vol. 40, no. 2 and 3, pp. 103–116, 1997. Zentrum für Informatik, 2018, pp. 62:1–62:13. [33] T. Nishimoto, T. I, S. Inenaga, H. Bannai, and M. Takeda, [18] P. Gawrychowski, A. Karczmarz, T. Kociumaka, J. Ł ˛acki, “Dynamic index and LZ factorization in compressed space,” and P. Sankowski, “Optimal dynamic strings,” in 29th Annual Discrete Applied Mathematics, vol. 274, pp. 116–129, 2020. ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, A. Czumaj, Ed. SIAM, 2018, pp. 1509–1528. [34] R. Radicioni and A. Bertoni, “Grammatical compression: compressed equivalence and other problems,” Discrete Math- [19] M. Gu, M. Farach, and R. Beigel, “An efficient algorithm for ematics and Theoretical Computer Science, vol. 12, no. 4, p. dynamic text indexing,” in 5th Annual ACM-SIAM Symposium 109, 2010. on Discrete Algorithms, SODA 1994, D. D. Sleator, Ed. [35] W. Rytter, “Application of Lempel-Ziv factorization to the ACM/SIAM, 1994, pp. 697–704. approximation of grammar-based compression,” Theoretical [20] M. Henzinger, S. Krinninger, D. Nanongkai, and T. Saranu- Computer Science, vol. 302, no. 1-3, pp. 211–222, 2003. rak, “Unifying and strengthening hardness for dynamic prob- [36] ——, “Grammar compression, LZ-encodings, and string algo- lems via the online matrix-vector multiplication conjecture,” rithms with implicit input,” in 31st International Colloquium in 47th Annual ACM on Symposium on Theory of Computing, on Automata, Languages, and Programming, ICALP 2004, STOC 2015, R. Rubinfeld, Ed. ACM, 2015, pp. 21–30. J. Díaz, J. Karhumäki, A. Lepistö, and D. Sannella, Eds. Springer, 2004, pp. 15–27. [21] T. I, “Longest common extensions with recompression,” in 28th Annual Symposium on Combinatorial Pattern Matching, [37] S. C. Sahinalp and U. Vishkin, “Efficient approximate and CPM 2017, J. Kärkkäinen, J. Radoszewski, and W. Rytter, dynamic matching of patterns using a labeling paradigm Eds. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, (extended abstract),” in 37th Annual IEEE Symposium on 2017, pp. 18:1–18:15. Foundations of Computer Science, FOCS 1996, M. Tompa, Ed. IEEE Computer Society, 1996, pp. 320–328. [22] A. Jez,˙ “Faster fully compressed pattern matching by recom- pression,” ACM Transactions on Algorithms, vol. 11, no. 3, [38] H. Sakamoto, “Grammar compression: Grammatical infer- pp. 20:1–20:43, 2015. ence by compression and its application to real data.” in 12th International Conference on Grammatical Inference, ICGI [23] ——, “Recompression: A simple and powerful technique for 2014. JMLR.org, 2014, pp. 3–20. word equations,” Journal of the ACM, vol. 63, no. 1, pp. 4:1– 4:51, 2016. [39] Y. Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shino- hara, T. Shinohara, and S. Arikawa, “Byte pair encoding: A [24] D. Kempa and T. Kociumaka, “Resolution of the Burrows– text compression scheme that accelerates pattern matching,” Wheeler transform conjecture,” in 61st Annual IEEE Sym- Technical Report DOI-TR-161, Department of Informatics, posium on Foundations of Computer Science, FOCS 2020, Kyushu University, Tech. Rep., 1999. S. Irani, Ed. IEEE Computer Society, 2020. [40] R. Sundar and R. E. Tarjan, “Unique binary-search- [25] D. Kempa and N. Prezza, “At the roots of dictionary compres- representations and equality testing of sets and sequences,” sion: string attractors,” in 50th Annual ACM Symposium on SIAM Journal on Computing, vol. 23, no. 1, pp. 24–44, 1994. Theory of Computing, STOC 2018, M. Henzinger, Ed. ACM, 2018, pp. 827–840. [41] A. Tiskin, “Threshold approximate matching in grammar- compressed strings,” in Prague Stringology Conference, PSC [26] S. Kosaraju, “Efficient string matching,” 1987, manuscript. 2014, J. Holub and J. Žd’árek, Eds., 2014, pp. 124–138.

[27] G. M. Landau and U. Vishkin, “Efficient string matching with [42] T. Welch, “A technique for high-performance data compres- k mismatches,” Theoretical Computer Science, vol. 43, pp. sion,” Computer, vol. 17, pp. 8–19, 1984. 239–249, 1986. [43] J. Ziv and A. Lempel, “A universal algorithm for sequential [28] ——, “Fast parallel and serial approximate string matching,” data compression,” IEEE Transactions on Information The- Journal of Algorithms, vol. 10, no. 2, pp. 157–169, 1989. ory, vol. 23, no. 3, pp. 337–343, 1977.

[29] N. Larsson and A. Moffat, “Off-line dictionary-based com- [44] ——, “Compression of individual sequences via variable-rate pression,” Proceedings of the IEEE, vol. 88, no. 11, pp. 1722– coding,” IEEE Transactions on , vol. 24, 1732, 2000. no. 5, pp. 530–536, 1978.

989