Faster Approximate Pattern Matching: a Uniﬁed Approach

2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS) Faster Approximate Pattern Matching: A Unified Approach Panagiotis Charalampopoulos Tomasz Kociumaka Philip Wellnitz Department of Informatics Department of Computer Science Max Planck Institute for Informatics King’s College London, UK Bar-Ilan University Saarland Informatics Campus and Ramat Gan, Israel Saarbrücken, Germany Institute of Informatics [email protected] [email protected] University of Warsaw, Poland [email protected] Abstract—In the approximate pattern matching problem, Hamming Distance: Recall that the Hamming distance given a text T , a pattern P , and a threshold k, the task of two (equal-length) strings is the number of positions is to find (the starting positions of) all substrings of T that T n k P where the strings differ. Now, given a text of length , are at distance at most from . We consider the two most P m k> fundamental string metrics: Under the Hamming distance,we a pattern of length , and an integer threshold 0, search for substrings of T that have at most k mismatches we want to compute the k-mismatch occurrences of P in T , with P , while under the edit distance, we search for substrings that is, all length-m substrings of T that are at Hamming of T that can be transformed to P with at most k edits. distance at most k from P . This pattern matching with Exact occurrences of P in T have a very simple structure: 3 mismatches problem has been extensively studied. In the If we assume for simplicity that |P | < |T |≤ /2 |P | and that P occurs both as a prefix and as a suffix of T , then both P and late 1980s, Abrahamson [2] and√ Kosaraju [26] independently T are periodic with a common period. However, an analogous proposed an FFT-based O(n m log m)-time algorithm for characterization for occurrences with up to k mismatches was computing the Hamming distance of P and all the length- proved only recently by Bringmann et al. [SODA’19]: Either m fragments of T . While their algorithms can be used to there are O(k2) k-mismatch occurrences of P in T , or both P T O(k) solve the pattern matching with mismatches problem, the and are at Hamming distance from strings with k a common string period of length O(m/k). We tighten this first algorithm to benefit from the threshold was given characterization by showing that there are O(k) k-mismatch by Landau and Vishkin [27] and slightly improved by Galil occurrences in the non-periodic case, and we lift it to the edit and Giancarlo [15]: Based on so-called “kangaroo jumping”, distance setting, where we tightly bound the number of k-edit they obtained an O(nk)-time algorithm, which is faster than O(k2) √ occurrences by in the non-periodic case. Our proofs O(n m log m) even for moderately large k. Amir et al. [4] are constructive and let us obtain a unified framework for √ developed two algorithms with running time O(n k log k) approximate pattern matching for both considered distances. 3 In particular, we provide meta-algorithms that only rely on a and O˜(n + k n/m), respectively; the latter algorithm was small set of primitive operations. We showcase the generality then improved upon by Clifford et al. [11], who presented an of our meta-algorithms with results for the fully compressed O˜(n + k2n/m)-time solution. Subsequently, Gawrychowski setting, the dynamic setting, and the standard setting. and Uznanski´ [17] provided√ a smooth trade-off between the 2 Keywords-approximate pattern matching, grammar com- running times O√˜(n k) and O˜(n + k n/m) by designing pression, dynamic strings, Hamming distance, edit distance an O˜(n + kn/ m)-time algorithm. Very recently, Chan I. INTRODUCTION et al. [10] removed most of the polylog n factors in the latter solution at the cost of (Monte-Carlo) randomization. The pattern matching problem, asking to search for oc- Furthermore, Gawrychowski and Uznanski´ [17] showed that currences of a given pattern P in a given text T , is perhaps a significantly faster “combinatorial” algorithm would have the most fundamental problem on strings. However, in most (unexpected) consequences for the complexity of Boolean applications, finding all exact occurrences of a pattern is matrix multiplication. Pattern matching with mismatches not enough: Think of human spelling mistakes or DNA on strings is thus well understood in the standard setting. sequencing errors, for example. In this work, we focus on Nevertheless, in the settings where the strings are not given approximate pattern matching, where we are interested in explicitly, a similar understanding is yet to be obtained. One finding substrings of the text that are “similar” to the pattern. of the main contributions of this work is to improve the While various similarity measures are imaginable, we focus upper bounds for two such settings, obtaining algorithms on the two most commonly encountered metrics in this with running times analogous to the algorithm of [11]. context: the Hamming distance and the edit distance. Edit Distance: Recall that for two strings S and T , the A full version of this paper is available at arxiv.org/abs/2004.08350. edit distance (also known as Levenshtein distance) is the Proofs of the claims marked with ♠ are presented only in the full version. minimum number of edits required to transform S into T . 2575-8454/20/$31.00 ©2020 IEEE 978 DOI 10.1109/FOCS46700.2020.00095 Here, an edit is an insertion, a substitution, or a deletion of a straight-line programs. We refer an interested reader to [36], single character. In the pattern matching with edits problem, [34], [30], [38] to learn more about grammar compression. we are given a text T , a pattern P , and an integer threshold Working directly with a compressed representation of a k>0, and the task is to find all starting positions of the text in general, intuitively at least, seems to be hard—in k-edit (or k-error) occurrences of P in T . Formally, we fact, Abboud et al. [1] showed that, for some problems, are to find all positions i in T such that the edit distance decompress-and-solve is the best we can hope for, under between T [i..j] and P is at most k for some position some reasonable assumptions from fine-grained complexity j. Again, a classic algorithm by Landau and Vishkin [28] theory. Nevertheless, Jez˙ [22] managed to prove that exact runs in O(nk) time. Subsequent research [37], [13] resulted pattern matching can be solved on grammar-compressed 4 in an√O(n + k n/m)-time algorithm (which is faster for strings in near-linear time: Given an SLP of size n rep- k ≤ 3 m). From a lower-bound perspective, we can benefit resenting a string T and an SLP of size m representing a from the discovery that the classic quadratic-time algorithm string P , we can find all exact occurrences of P in T in for computing the edit distance of two strings is essentially O((n + m)log|P |) time. For fully compressed approximate optimal: Backurs and Indyk [5] recently proved that a signifi- pattern matching, no such near-linear time algorithm is cantly faster algorithm would yield a major breakthrough for known, though. While the O˜((n+|P |)k4)-time algorithm by the satisfiability problem. For pattern matching with edits, Bringmann et al. [8] for pattern matching with mismatches this means that there is no hope for an algorithm that is comes close, it works in an easier setting where only the significantly faster than O(n+k2n/m); however, apart from text is compressed. We fill this void by providing the first that “trivial” lower bound and the 20-year-old conjecture algorithm for fully compressed pattern matching with mis- 3 H of Cole and Hariharan [13] that an O(n + k n/m)-time matches that runs in near-linear time. Denote by Occk (P, T) algorithm should be possible, nothing is known that would the set of (starting positions of) k-mismatch occurrences close this gap. While we do not manage to tighten this of P in T ; then, our result reads as follows. gap, we do believe that the structural insights we obtain Theorem I.1 (♠). Let GT denote an SLP of size n generat- may be useful for doing so. What we do manage, however, ing a text T , let GP denote an SLP of size m generating a is to significantly improve the running time of the known pattern P , let k denote a threshold, and set N := |T | + |P |. algorithms in two settings where T and P are not given H Then, we can compute |Occk (P, T)| in time O(m log N+ explicitly, thereby obtaining running times that can be seen 2 3 H nk log N). The elements of Occk (P, T) can be reported as analogous to the running time of Cole and Hariharan’s H within O(|Occ (P, T)|) extra time. algorithm [13]. k Grammar Compression: One of the settings that we For pattern matching with edits, near-linear time algorithms consider in this paper is the fully compressed setting, when are not known even in the case that the pattern is given both the text T and the pattern P are given as straight-line explicitly. Currently, the best pattern matching algorithms on programs. Compressing the text and the pattern is, in gen- an SLP-compressed text run in time O(n|P | log |P |) [41] 4 eral, a natural thing to do—think of huge natural-language and O(n(min{|P |k, k + |P |} +log|T |)) [6].

Load more