Streaming Dictionary Matching with Mismatches

Streaming dictionary matching with mismatches∗† Pawe l Gawrychowski1 and Tatiana Starikovskaya2 1University of Wroc law, 50-137 Wroc law, Poland, [email protected] 2DI/ENS, PSL Research University, Paris, France, [email protected] Abstract In the k-mismatch problem we are given a pattern of length n and a text and must find all loca- tions where the Hamming distance between the pattern and the text is at most k. A series of recent breakthroughs have resulted in an ultra-efficient streaming algorithm for this problem that requires only (k log n ) space and (log n (√k log k + log3 n)) time per letter [Clifford, Kociumaka, Porat, SODA O k O k 2019]. In this work, we consider a strictly harder problem called dictionary matching with k mismatches. In this problem, we are given a dictionary of d patterns, where the length of each pattern is at most n, and must find all substrings of the text that are within Hamming distance k from one of k the patterns. We develop a streaming algorithm for this problem with (kd log d polylog n) space and k O (k log d polylog n + output ) time per position of the text. The algorithm is randomised and outputs O | | correct answers with high probability. On the lower bound side, we show that any streaming algorithm for dictionary matching with k mismatches requires Ω(kd) bits of space. 1 Introduction In the fundamental dictionary matching problem, we are given a dictionary of patterns and a text, and must find all the substrings of the text equal to one of the patterns (such substrings are called occurrences of the patterns). The classical algorithm for dictionary matching is the one by Aho and Corasick [1]. For a dictionary of d patterns of length at most n, their algorithm uses Ω(nd) space and (1 + output ) time per letter, where output is the set of occurrences of the patterns that end at this position.O Apart| from| the Aho–Corasick algorithm, other word-RAM algorithms for exact dictionary matching include [3, 4, 7, 13, 14, 17, 21, 25, 28, 29, 35]. However, in many applications one is interested in substrings of the text that are close to but not necessarily equal to the patterns. This task can be naturally formalised as follows: given a dictionary of d patterns of length at most n, and a text, find all substrings of the text within distance k from one of the patterns, where the distance is either the Hamming or the edit distance. In this work, we focus on the arXiv:1809.02517v3 [cs.DS] 20 Jun 2021 Hamming distance, and refer to this problem as dictionary matching with k mismatches. We give a brief survey of existing solutions in the word-RAM model, ignoring the special case of k = 1 that relies on very different techniques. The case of d = 1 was considered in [2, 10, 20, 32]. The latest algorithm [20] uses (n) space and (log2 n + kplog(n)/n) amortised time per letter for constant-size alphabet. These algorithmsO can be generalisedO to d > 1 patterns by running d instances of the algorithm in parallel. One can also reduce the problem to dictionary look-up with k mismatches or text indexing with k mismatches. In the former problem, the task is to preprocess the dictionary of patterns into a data structure to support the following queries fast: given a string Q, find all patterns in the dictionary within Hamming distance k from Q. In the latter, the task is to preprocess the text so that given a pattern to ∗This is a full and extended version of the conference paper [19]. †P. Gawrychowski was partially supported by the Bekker programme of the Polish National Agency for Academic Ex- change (PPN/BEK/2020/1/00444) and the grant ANR-20-CE48-0001 from the French National Research Agency (ANR). T. Starikovskaya was partially supported by the grant ANR-20-CE48-0001 from the French National Research Agency (ANR). 1 be able to report all substrings of the text within Hamming distance k from the pattern efficiently. These problems were considered in [12, 16, 18, 26, 31, 34]. However, all of the above algorithms must at least store the dictionary in full, which in the worst case requires Ω(nd) bits of space. In this work, we focus on the streaming model of computation that was designed to overcome this restriction and allows developing particularly efficient algorithms. In the streaming model, we assume that the text arrives as a stream, one letter at a time. The space complexity of an algorithm is defined to be all the space used, including the space we need for storing the information about the pattern(s) and the text. The time complexity of an algorithm is defined to be the time we spend to process one letter of the text. The streaming model of computation aims for algorithms that use as little space and time as possible. All streaming algorithms we discuss in this paper are randomised and output correct answers with high probability1. Throughout the paper, we assume that the length of the text is (n). If the text is longer, we can partition it into overlapping blocks of length (n) and process each blockO independently. The first sublinear-space streaming algorithmO for the dictionary matching problem with d = 1 was suggested by Porat and Porat [33]. For a pattern of length n, their algorithm uses (log n) space and (log n) time per letter. Later, Breslauer and Galil gave a (log n)-space and (1)-time algorithmO [8]. For arbitraryO d, Clifford et al. [9] showed a streaming algorithm thatO uses (d log n)O space and (log log(n + d)+ output ) time per letter. Golan and Porat [24] showed an improvedO algorithm that usesO the same amount| of space| and (1 + output ) time per letter for constant-size alphabets. DictionaryO | matching| with k mismatches has been mainly studied for d = 1. The first algorithm was shown by Porat and Porat [33] by reduction to exact dictionary matching. The algorithm uses (k3 log7 n/ log log n) space and (k2 log5 n/ log log n) time. The complexity has been subsequently improvedO in [10, 11, 23]. The O n n 3 current best algorithm uses only (k log k ) space and (log k (√k log k + log n)) time per letter [11]. Golan et al. [22] studied space-time trade-offsO for this problem.O For d> 1, one can obtain the following result by a repeated application of the algorithm for d = 1 [11]: Corollary 1. For any k 1, there is a randomised streaming algorithm for dictionary matching with k mismatches that uses ˜(dk≥) space and ˜(d√k) time per letter2. The algorithm has two-sided error and outputs correct answersO with high probability.O 1.1 Our results In this work, we consider the problem of streaming dictionary matching with k mismatches for arbitrary d> 1. As it can be seen, the time complexity of Corollary 1 depends on d linearly, which is prohibitive for applications where the stream letters arrive at a high speed and the size of the dictionary is large, up to several thousands of patterns, as we must be able to process each letter before the next one arrives to benefit from the space advantages of streaming algorithms. In this work, we show an algorithm that uses ˜(kd logk d) space and ˜(k logk d+ output ) time per letter, assuming polynomial-size alphabet (Theorem 4). OurO algorithm makes useO of a new randomised| | variant of the k-errata tree (Section 3), a famous data structure of Cole, Gottlieb, and Lewenstein for dictionary matching with k mismatches [12]. This variant of the k- errata tree allows to improve both the query time and the space requirements and can be considered as a generalisation of the z-fast tries [5, 6], that have proved to be useful in many streaming applications. We also show that any streaming algorithm for dictionary matching with k mismatches requires Ω(kd) bits of space (Lemma 12). This lower bound implies that for constant values of k our algorithm is optimal up to polylogarithmic factors. 1With high probability means with probability at least 1 − 1/nc for any predefined constant c> 1. 2Hereafter, O˜ hides a multiplicative factor polynomial in log n. 2 a b b aa a b b a bb cb ca c c b Figure 1: The trie (left) and the compact trie (right) for a dictionary aa,aabc,bac,bbb . The nodes of the trie that we delete are shown by circles, marked nodes (nodes labelled by{ the dictionary} patterns) are black. 2 Preliminaries In this section, we give the definitions of strings, tries, and two hash functions that we use throughout the paper: Karp–Rabin fingerprints [27] and sketches for the Hamming distance [11]. 2.1 Strings and tries We assume an integer alphabet 1, 2,...,σ of size σ = nO(1). A string is a finite sequence of letters of the alphabet. For a string S = S[1]S{[2] ...S[m]} we denote its length m by S and its substring S[i]S[i+1] ...S[j], 1 i < j m, by S[i, j]. If i = 1, the substring S[1, j] is referred to| as| a prefix of S. If j = m, S[i,m] is called≤ a suffix≤ of S. We say that a substring S[i, j] is an occurrence of a string X in S if S[i, j]= X and a k-mismatch occurrence of X in S if the Hamming distance between S[i, j] and X is at most k.

Load more