Wavelet Trees Meet Suffix Trees∗

Maxim Babenko† Pawe lGawrychowski‡ Tomasz Kociumaka§ Tatiana Starikovskaya¶

Abstract [0, σ − 1]; and we assume that a pair of characters We present an improved wavelet construction algorithm can be compared in O(1) time. A finite ordered and discuss its applications to a number of rank/select sequence of characters (possibly empty) is called a problems for integer keys and strings. string. characters in a string are enumerated starting Given a string of length n over an alphabet of size√σ ≤ n, our method builds the wavelet tree in O(n log σ/ log n) from 1, that is, a string w of length n consists of time, improving√ upon the state-of-the-art algorithm by a characters w[1], w[2], . . . , w[n]. factor of log n. As a consequence,√ given an array of n Wavelet trees. The wavelet tree, invented by integers we can construct in O(n log n) time a data struc- ture consisting of O(n) machine words and capable of an- Grossi, Gupta, and Vitter [19], is an important data swering rank/select queries for the subranges of the array in structure with a vast number of applications to stringol- O(log n/ log log n) time. This is a log log n-factor improve- ogy, computational geometry, and others (see [27] for an ment in query√ time compared to Chan and P˘atra¸scu(SODA excellent survey). Despite this, the problem of wavelet 2010) and a log n-factor improvement in construction time compared to Brodal et al. (Theor. Comput. Sci. 2011). tree construction has not received much attention in the Next, we switch to stringological context and propose literature. For a string w of length n, one can derive a a novel notion of wavelet suffix trees. For a string w of construction algorithm with O(n log σ) running time di- length√ n, this occupies O(n) words, takes O(n log n) time to construct, and simultaneously captures rectly from the definition. Apart from this, two recent the combinatorial structure of of w while enabling works [32, 10] present construction algorithms in the efficient top-down traversal and binary search. In particular, setting when limited extra space is allowed. Running with a wavelet suffix tree we are able to answer in O(log |x|) time the following two natural analogues of rank/select time of the algorithms is higher than that of the naive queries for suffixes of substrings: algorithm. The first result of our paper is a novel de- 1) For substrings x and y of w (given by their endpoints) terministic√ algorithm for constructing a wavelet tree in count the number of suffixes of x that are lexicographi- O(n log σ/ log n) time. cally smaller than y; Standard wavelet trees form a perfect 2) For a x of w (given by its endpoints) and an integer k, find the k-th lexicographically smallest suffix with σ nodes, but different shapes have been introduced of x. for several applications: among others this includes We further show that wavelet suffix trees allow to compute wavelet trees for Huffman encoding [17] and wavelet a run-length-encoded Burrows-Wheeler transform of a sub- [20] that have the shape of a for a given string x of w (again, given by its endpoints) in O(s log |x|) time, where s denotes the length of the resulting run- of strings. Our construction algorithm is capable length encoding. This answers a question by Cormode and of building arbitrary-shaped (binary) wavelet trees of Muthukrishnan (SODA 2005), who considered an analogous O(log σ) height with constant overhead. problem for Lempel-Ziv compression. All our algorithms, except for the construction of Our algorithm for wavelet trees helps to derive some wavelet suffix trees, which additionally requires O(n) time improved bounds for range rank/select queries. Namely, in expectation, are deterministic and operate in the word given an array A[1..n] of integers, one could ask either RAM model. to compute the number of integers in A[i..j] that are smaller than a given A[k](rank query); or to find, for 1 Introduction. given i, j, and k, the k-th smallest integer in A[i..j] Let Σ be a finite ordered non-empty set which we (select query). refer to as an alphabet. The elements of Σ are called These problems have been widely studied; see e.g. characters. characters are treated as integers in a range Chan and P˘atra¸scu [8] and Brodal et al. [6]. By slightly tweaking our construction of wavelet trees, we ∗ √ Tomasz Kociumaka is supported by Polish budget funds for can build in O(n log n) deterministic time an O(n) science in 2013-2017 as a research project under the ‘Diamond Grant’ program. Tatiana Starikovskaya is partly supported by size data structure for answering rank/select queries in log n √ Dynasty Foundation. O( log log n ) time. Our approach yields a log n-factor †National Research University Higher School of Economics improvement to the construction time upon Brodal et ‡Max-Planck-Institut f¨urInformatik al. [6] and a log log n-factor improvement to the query § Institute of Informatics, University of Warsaw time upon Chan and P˘atra¸scu [8]. ¶National Research University Higher School of Economics Wavelet suffix trees. Then we switch to stringo- bitrary k (thus enabling to answer general substring logical context and extend our approach to the so-called suffix select queries) and also consider substring suffix internal string problems (see [25]). This type of prob- rank queries.√ We devise a linear-space data structure lems involves construction of a compact data structure with O(n log n) expected construction time support- for a given fixed string w capable of answering cer- ing both types of the queries in O(log |x|) time. tain queries for substrings of w. This line of develop- Our approach to substring suffix rank/select prob- ment was originally inspired by suffix trees, which can lems is based on wavelet trees and attracts a number of answer some basic internal string queries (e.g. equal- additional combinatorial and algorithmic ideas. Com- ity testing and longest common extension computation) bining wavelet trees with suffix trees we introduce a in constant time and linear space. Lately a number novel concept of wavelet suffix trees, which forms a foun- of studies emerged addressing compressibility [11, 23], dation of our data structure. Like usual suffix trees, range longest common prefixes (range LCP) [1, 28], pe- wavelet suffixes trees provide a compact encoding for riodicity [24, 13], minimal/maximal suffixes [4, 3], sub- all substrings of a given text; like wavelet trees they string hashing [15, 18], and fragmented pattern match- maintain logarithmic height. Our hope is that these ing [2, 18]. properties will make wavelet suffix trees an attractive Our work focuses on rank/select problems for suf- tool for a large variety of stringological problems. fixes of a given substring. Given a fixed string w of We conclude with an application of wavelet suffix length n, the goal is to preprocess it into a compact trees to substring compression, a class of problems intro- data structure for answering the following two types of duced by Cormode and Muthukrishnan [11]. Queries of queries efficiently: this type ask for a compressed representation of a given substring. The original paper, as well as a more recent 1) Substring suffix rank: For substrings x and y of work by Keller et al. [23], considered Lempel-Ziv com- w (given by their endpoints) count the number of pression schemes. Another family of methods, based suffixes of x that are lexicographically smaller than y; on Burrows-Wheeler transform [7] was left open for fur- 2) Substring suffix select: For a substring x of w (given ther research. We show that wavelet suffix trees allow to by its endpoints) and an integer k, find the k-th compute a run-length-encoded Burrows-Wheeler trans- lexicographically smallest suffix of x. form of an arbitrary substring x of w (again, given by its endpoints) in O(s log |x|) time, where s denotes the Note that for k = 1 and k = |x| substring suffix length of the resulting run-length encoding. select queries reduce to computing the lexicographically minimal and the lexicographically maximal suffixes of 2 Wavelet Trees. a given substring. Study of this problem was started by Duval [14]. He showed that the maximal and the Given a string s of length n over an alphabet Σ = minimal suffixes of all prefixes of a string can be found [0, σ −1], where σ ≤ n, we define the wavelet tree of s as in linear time and constant additional space. Later follows. First, we create the root node r and construct this problem was addressed in [4, 3]. In [3] it was its bitmask Br of length n. To build the bitmask, we shown that the minimal suffix of any substring can be think of every s[i] as of a binary number consisting computed in O(τ) time by a linear space data structure of exactly log σ bits (to make the description easier to with O(n log n/τ) construction time for any parameter follow, we assume that σ is a power of 2), and put the τ, 1 ≤ τ ≤ log n. As an application of this result it most significant bit of s[i] in Br[i]. Then we partition s was shown how to construct the Lyndon decomposition into s0 and s1 by scanning through s and appending s[i] of any substring in O(τs) time, where s stands for with the most significant bit removed to either s1 or s0, the number of distinct factors in the decomposition. depending on whether the removed bit of s[i] was set For the maximal suffixes an optimal linear-space data or not, respectively. Finally, we recursively define the structure with O(1) query and O(n) construction time wavelet trees for s0 and s1, which are strings over the was presented. We also note that [26] considered a alphabet [0, σ/2 − 1], and attach these trees to the root. problem with a similar name, namely substring rank We stop when the alphabet is unary. The final result is a and select. However, the goal there is to preprocess perfect binary tree on σ leaves with a bitmask attached a string, so that given any pattern, we can count its at every non-leaf node. Assuming that the edges are occurrences in a specified prefix of the string, and select labelled by 0 or 1 depending on whether they go to the the k-th occurrence in the whole string. One can easily left or to the right respectively, we can define a label of see that this problem is substantially different than the a node to be the concatenation of the labels of the edges one we are interested in. on the path from the root to this node. Then leaf labels Here, we both generalize the problem to an ar- are the binary representations of characters in [0, σ −1]; 1011100000101011 leaves and the characters in Σ. Then, while defining the bitmasks Bv, we do not remove the most significant 11100001 10100101 bit of each character, and instead of partitioning the values based on this bit, we make a decision based on 0011 1100 1010 0101 whether the leaf corresponding to the character lies in the left or in the right subtree of v. Our construction 01 01 01 10 10 10 01 10 algorithm generalizes to such arbitrarily-shaped wavelet trees without increasing the provided 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 that the height of T is O(log σ). Wavelet trees of larger degree. Binary wavelet trees can be generalized to higher degree trees in a natural way as follows. We think of every s[i] as of a Figure 1: Wavelet tree for a string 11002 01112 10112 number in base d. The degree-d wavelet tree of s is a tree 11112 10012 01102 01002 00002 00012 00102 10102 00112 on σ leaves, where every inner node is of degree d, except 11012 01012 10002 11102, the leaves are labelled with their corresponding characters. for the last level, where the nodes may have smaller degrees. First, we create its root node r and construct its string Dr of length n setting as Dr[i] the most significant digit of s[i]. We partition s into d strings see Figure 1 for an example. s , s , . . . , s by scanning through s and appending In virtually all applications, each bitmask B is 0 1 d−1 r s[i] with the most significant digit removed to s , where augmented with a rank/select structure. For a bitmask a a is the removed digit of s[i]. We recursively repeat B[1..N] this structure implements operations rank (i), b the construction for every s and attach the resulting which counts the occurrences of a bit b ∈ {0, 1} in a tree as the a-th child of the root. All strings D take B[1..i], and select (i), which selects the i-th occurrence u b O(n log σ) bits in total, and every D is augmented with of b in the whole B[1..n], for any b ∈ {0, 1}, both in u a generalized rank/select structure. O(1) time. We consider d = logε n, for a small constant ε > 0. The bitmasks and their corresponding rank/select Remember that we assume σ to be a power of 2, and structures are stored one after another, each starting because of similar technical reasons we assume d to be at a new machine word. The total space occupied by a power of two as well. Under such assumptions, our the bitmasks alone is O(n log σ) bits, because there are construction algorithm for binary wavelet trees can be log σ levels, and the lengths of the bitmasks for all used to construct a higher degree wavelet tree. More nodes at one level sum up to n. A rank/select structure precisely, in Section 2.3 we show how to construct such built for a bit string B[1..N] takes o(N) bits [9, 21], a higher degree tree in O(n log log n), assuming that so the space taken by all of them is o(n log σ) bits. we are already given the binary wavelet tree, making Additionally, we might lose one machine word per node √ the total construction time O(n log n) if σ ≤ n, because of the word alignment, which sums up to O(σ). and allowing us to derive improved bounds on range For efficient navigation, we number the nodes in a - selection, as explained in Section 2.3. like fashion and, using O(σ) space, store for every node the offset where its bitmasks and the corresponding 2.1 Wavelet Tree Construction. Given a string s rank/select structure begin. Thus, the total size of a of length n over an alphabet Σ, we want to construct its wavelet tree is O(σ + n log σ/ log n), which for σ ≤ n is wavelet tree, which requires constructing the bitmasks O(n log σ/ log n). √ and their rank/select structures, in O(n log σ/ log n) Directly from the recursive definition, we get a time. construction algorithm taking O(n log σ) time, but we Input representation. The input string s can be can hope to achieve a better time complexity as the considered as a list of log σ-bit integers. We assume that size of the output is just O(n log σ/ log n) words. In it is represented in a packed form, as described below. Section 2.1 we show that one can construct all bitmasks A single machine word of length log n can accom- augmented with the rank/select structures in total time √ modate log n b-bit integers. Therefore a list of N such O(n log σ/ log n). b integers can be stored in Nb machine words. (For s, Arbitrarily-shaped wavelet trees. A general- log n ized view on a wavelet tree is that apart from a string b = log σ, but later we will also consider other values of s we are given an arbitrary full binary tree T (i.e., a b.) We call such a representation a packed list. rooted tree whose nodes have 0 or 2 children) on σ We assume that packed lists are implemented as re- leaves, together with a bijective mapping between the sizable bitmasks (with each block of b bits representing a single entry) equipped with the size of the packed list. on whether its (β + 1)-th bit is set or not, respectively. This way a packed list of length N can be appended to The bitmask Bv simply stores all these (β + 1)-th most Nb another packed list in O(1+ log n ) time, since our model significant bits. In order to compute these three lists supports bit-shifts in O(1) time. Similarly, splitting into efficiently we apply the following claim. Nb N lists of length at most k takes O( log n + k ) time. √ Overview. Let τ be a parameter to be chosen later Claim 2.1. Assuming O˜( n) space and preprocessing to minimize the total running time. We call a node u shared by all instances of the structure, the following Nb big if its depth is a multiple of τ, and small otherwise. operation can be performed in O( log n ) time: given a The construction conceptually proceeds in two phases. packed list L of N b-bit integers, where b = o(log n), First, for every big node u we build a list Su. and a position t ∈ [0, b − 1], compute packed lists L0 Remember that the label `u of a node u at distance and L1 consisting of the elements of L whose t-th most α from the root consists of α bits, and the characters significant bit is 0 or 1, respectively, and a bitmask B corresponding to the leaves below u are exactly the being a concatenation of the t-th most significant bits of characters whose binary representation starts with `u. the elements of L. Su is defined to be a subsequence of s consisting of these characters. Proof. As a preprocessing, we precompute all the an- Secondly, we construct the bitmasks B for every swers for lists L of length at most 1 log n . This takes v √ 2 b node v using Su of the nearest big ancestor u of v O˜( n) time. For a query we split L into lists of length (possibly v itself). 1 log n 2 b , apply the preprocessed mapping and merge the First phase. Assume that for a big node u at Nb results, separately for L0,L1 and B. This takes O( log n ) distance ατ from the root we are given the list Su. We time in total. want to construct the lists Sv for all big nodes v whose deepest (proper) big ancestor is u. There are exactly 2τ Consequently, we spend O(|Lv|τ/ log n) per node v. such nodes v, call them v0, . . . , v2τ −1. To construct their The total number of the elements of all short lists is lists S , we scan through S and append S [j] to the vi u u O(n log σ), but we also need to take into the account list of vt, where t is a bit string of length τ occurring in the fact that the lengths of some short lists might be the binary representation of Su[j] at position ατ +1. We not divisible by log n/τ, which adds O(1) time per a can extract t in constant time, so the total complexity node of the wavelet tree, making the total complexity is linear in the total number of elements on all lists, i.e., O(σ + n log στ/ log n) = O(n log στ/ log n). O(n log σ/τ) if τ ≤ log σ. Otherwise the root is the only Intermixing the phases. In order to make sure big node, and the first phase is void. that space usage of the construction algorithm does not Second phase. Assume that we are given the list exceed the size of the the final structure, the phases Su for a big node u at distance ατ from the root. We are intermixed in the following way. Assuming that we would like to construct the bitmask Bv for every node are given the lists Su of all big nodes u at distance ατ v such that u is the nearest big ancestor of v. First, from the root, we construct the lists of all big nodes we observe that to construct all these bitmasks we only at distance (α + 1)τ from the root, if any. Then we need to know τ bits of every Su[j] starting from the construct the bitmasks Bu such that the nearest big (ατ + 1)-th one. Therefore, we will operate on short ancestor of u is at distance ατ from the root and increase lists Lv consisting of τ-bit integers instead of the lists α. To construct the bitmasks, we compute the short Sv of whole log σ-bit integers. Each short list is stored lists for all nodes at distances ατ, . . . , (α + 1)τ − 1 from as a packed list. the root and keep the short lists only for the nodes We start with extracting the appropriate part of at the current distance. Then the peak space of the each Su[j] and appending it to Lu. The running time of construction process is just O(n log σ/ log n) words. this step is proportional to the length of Lu. (This step The total construction time is O(n log σ/τ) for is void if τ > log σ; then Lv = Sv.) Next, we process the first phase and O(n log στ/ log n) for the second all nodes v such that u is the deepest big ancestor of phase. The bitmasks, lists and short lists constructed v. For every such v we want to construct the short by the√ algorithm are illustrated in Figure 2. Choosing list Lv. Assuming that we already have Lv for a node τ = log n as to minimize the total time, we get the v at distance ατ + β from the root, where β ∈ [0, τ), following theorem. and we want to construct the short lists of its children and the bitmask B . This can be (inefficiently) done v Theorem 2.1. Given a string s of length n over an by scanning Lv and appending the next element to the alphabet Σ, we can construct√ all bitmasks Bu of its short list of the right or the left child of v, depending wavelet tree in O(n log σ/ log n) time. 11002,01112,10112,11112,10012,01102,01002,00002,00012,00102,10102,00112,11012,01012,10002,11102

00002,00012,00102,00112 01112,01102,01002,01012 10112,10012,10102,10002 11002,11112,11012,11102

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

112,012,102,112,102,012,012,002,002,002,102,002,112,012,102,112

112,112,102,002,002,012,012,102 102,012,112,002,012,102,002,112

002,002,012,012 112,112,102,102 012,002,012,002 102,112,102,112

002,012 102,112 002,012 112,102 012,002 112,102 002,012 112,102

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 2: Elements of the wavelet tree construction algorithm for the string from Figure 1. The first figure shows the lists Su of all big nodes u when τ = 2, and the third shows the short lists Lu of all nodes u, as defined in the proof of Theorem 2.1.

Additionally, we want to build a rank/select struc- O(σ(log log σ)2) = O(n(log log n)2) time. In either case, ture for every Bu. While it is well-known that given a we can return in O(1) time the corresponding pointer, bit string of length N one can construct an additional which might be null if the node does not exist. Then the structure of size o(N) allowing executing both rank and construction algorithm works as described previously: select in O(1) time [9, 21], we must verify that the con- first we construct the list Su of every big node u, and struction time is not too high. The following lemma is then we construct every Bv using the Su of the near- proved in Appendix A. est big ancestor of v. The only difference is that when splitting the list S into the lists S of all big nodes v Lemma 2.1. Given a bit string B[1..N] packed in N u v log n such that u is the first proper big ancestor of v, it might N machine words, we can extend it in O( log n ) time with happen that the retrieved pointer to v is null. In such a rank/select structure occupying additional o( N ) case we simply continue without appending anything to √ log n space, assuming an O˜( n) time and space preprocessing the non-existing Sv. The running time stays the same. shared by all instances of the structure. Theorem 2.2. Let s be a string of length n over Σ 2.2 Arbitrarily-Shaped Wavelet Trees. To gen- and T be a full binary tree of height O(log σ) with σ eralize the algorithm to arbitrarily-shaped wavelet trees leaves, each assigned a distinct character in Σ. Then of degree O(log σ), instead of operating on the charac- the T -shaped√ wavelet tree of s can be constructed in ters we work with the labels of the root-to-leaf paths, O(n log σ/ log n) time. appended with 0s so that they all have the same length. The heap-like numbering of the nodes is not enough 2.3 Wavelet Trees of Larger Degree. We move to in such setting, as we cannot guarantee that all leaves constructing a wavelet tree of degree d = logε n, where have the same depth, so the first step is to show how d is a power of two. Such higher degree tree can be to efficiently return a node given the length of its label also defined through the binary wavelet tree for s as and the label itself stored in a single machine word. follows. We remove all inner nodes whose depth is not If log σ = o(log n) we can afford storing nodes in a a multiple of log d. For each surviving node we set its simple array, otherwise we use the deterministic dic- nearest preserved ancestor as a parent. Then each inner tionary of Ruˇzi´c [31], which can be constructed in node has d children (the lowest-level nodes may have fewer children), and we order them consistently with the input buffer for Bv becomes empty after generating 1 the left-to-right order in the original tree. 4 log n characters. Hence the total number of times we Recall that for each node u of the binary wavelet 0 log n need to reload one of the input buffers is O(|Dv|/ log d ). tree we define the string Su as a subsequence of s We preprocess all possible scenarios between two consisting of its characters whose binary representation reloads by simulating, for every possible initial content starts with the label `u of u. Then we create the bitmask of the input buffers, processing the bits until one of storing, for each character of Su, the bit following its the buffers becomes empty. We store the generated 1 label `u. Instead of Bu a node u of the wavelet tree data (which is at most 2 log n bits) and the final of degree d now stores a string Du, which contains the content of the input buffers. The whole preprocessing 3 log n next log d bits following `u. takes O˜(2 4 ) = o(n) time and space, and then the The following lemma allows to use binary wavelet 0 number of operations required to generate packed Dv tree construction√ as a black box, and consequently gives is proportional to the number of times we need to an O(n log n)-time construction algorithm for wavelet reload the buffers, so by summing over all v the total trees of degree d. complexity is O(n log log n).

Lemma 2.2. Given all bitmasks Bu, we can construct Then we extend every Du with a generalized all strings Du in O(n log log n) time. rank/select data structure. Such a structure for a string D[1..N] implements operations rankc(i), which counts Proof. Consider a node u at depth k of the wavelet tree positions k ∈ [1, i] such that D[k] ≤ c, and selectc(i), of degree d. Its children correspond to all descendants which selects the i-th occurrence of c in the whole of u in the original wavelet tree at depth (k + 1)d. D[1..n], for any c ∈ Σ, both in O(1) time. Again, For each descendant v of u at depth (k + 1) log d − δ, it is well-known that such a structure can be imple- 0 where δ ∈ [0, log d], we construct a temporary string Dv mented using just o(n log σ) additional bits if σ = over the alphabet [0, 2δ − 1]. Every character of this O(polylog(n)) [16], but its construction time is not ex- temporary string corresponds to a leaf in the subtree of plicitly stated in the literature. Therefore, we prove the v. The characters are arranged in order of the identifiers following lemma in Appendix A. of the corresponding leaves and describe prefixes of 0 Let d ≤ logε n for ε < 1 . Given a string length δ of paths from u to the leaves. Clearly Du = Du. Lemma 2.3. 3 0 N log d Furthermore, if the children of v are v1 and v2, then Dv D[1..N] over the alphabet [0, d−1] packed in log n ma- can be easily defined by looking at D0 , D0 , and B N log d v1 v2 v chine words, we can extend it in O( ) time with as follows. We prepend 0 to all characters in D0 , and log n v1 a generalized rank/select data structure occupying addi- 0 0 1 to all characters in D . Then we construct D by N √ v2 v tional o( ) space, assuming O˜( n) time and space appropriately interleaving D0 and D0 according to the log n v1 v2 preprocessing shared by all instances of the structure. order defined by Bv. We consider the bits of Bv one-by- one. If the i-th bit is 0, we append the next character 2.4 Range Selection. A classic application of of D0 to the current D0 , and otherwise we append the v1 v wavelet trees is that, given an array A[1..n] of integers, next character of D0 . Now we only have to show to v2 we can construct a structure of size O(n), which allows implement this process efficiently. answering any range rank/select query in O(log n) time. 1 log n 0 We pack every 4 log d consecutive characters of Dv A range select query is to return the k-th smallest ele- into a single machine word. To simplify the implemen- ment in A[i..j], while a range rank query is to count how tation, we reserve log d bits for every character irrespec- many of A[i..j] are smaller than given x = A[k]. Given tively of the value of δ. This makes prepending 0s or the wavelet tree for A, any range rank/select query can 0 1s to all characters in any Dv trivial, because the re- 1 log n be answered by traversing a root-to-leaf path of the tree ˜ 4 log d sult can be preprocessed in O(d ) = o(n) time and using the rank/select data structures for bitmasks Bu space. Interleaving D0 and D0 is more complicated. at subsequent nodes. v1 v2 √ Instead of accessing D0 and D0 directly, we keep two v1 v2 With O(n log n) construction algorithm this 1 log n buffers, each containing at most next 4 log d characters matches the bounds of Chan and P˘atra¸scu[8] for range from the corresponding string. Similarly, instead of ac- select queries, but is slower by a factor of log log n than 1 cessing Bv directly, we keep a buffer of at most 4 log n their solution for range rank queries. We will show that next bits there. Using the buffers, we can keep process- one can in fact answer any range rank/select query in ing bits from B as long as there are enough characters O( log n ) time with an O(n) size structure, which can v log log n √ in the input buffers. The input buffers for D0 and D0 v1 v2 be constructed in O(n log n) time. For range rank 1 log n become empty after generating 4 log d characters, and queries this is not a new result, but we feel that our proof gives more insight into the structure of the prob- Lemma 2.4. Given a string D[1..N] over the alphabet N log d lem. For range select queries, Brodal et al. [6] showed [0, d − 1] packed in log n machine words, we can that one can answer a query in O( log n ) time with an N log d log log n extend it in O( log n ) time and space with the following O(n) size structure, but with O(n log n) construction information: time. Chan and P˘atra¸scuasked if methods of [7] can 1) The first copy of the matrix M for each superblock be combined with the efficient construction. We answer 2 this question affirmatively. (multiple of d log n); 2) The second copy of the matrix M for each superblock A range rank query is easily implemented in 2 O( log n ) time using wavelet tree of degree logε n de- (multiple of d log n); log log n 0 scribed in the previous section. To compute the rank of 3) The small matrix M for each block (multiple of x = A[k] in A[i..j], we traverse the path from the root log n/d); to the leaf corresponding to A[k]. At every node we use N log d occupying additional o( log n ) space, assuming an the generalized rank structure to update the answer and √ O˜( n) time and space preprocessing shared by all in- the current interval [i..j] before we descend. ε log n stances of the structure and d = log n. Implementing the range select queries in O( log log n ) time is more complicated. Similar to the rank queries, Proof. To compute the small matrices M 0, i.e., the we start the traverse at the root and descend along a cumulative generalized ranks for blocks, and the first root-to-leaf path. At each node we select its child we copies of the matrices M, i.e., the cumulative general- will descend to, and update the current interval [i..j] ized ranks for superblocks, we notice that the standard using the generalized rank data structure. To make solution for generalized rank queries in O(1) time is to this query algorithm fast, we preprocess each string Du split the string into superblocks and blocks. Then, for and store extracted information in matrix form. As every superblock we store the cumulative generalized shown by Brodal et al. [6], constant time access to such rank of every character, and for every block we store the information is enough to implement range select queries cumulative generalized rank within the superblock for in O(log n/ log log n) time. every character. As explained in the proof of Lemma 2.3 The matrices for a string Du are defined as follows. presented in the appendix, such data can be computed We partition D into superblocks of length d log2 n.1 N log d u in O( log n ) time if the size of the blocks and the su- For each superblock we store the cumulative generalized perblocks are chosen to be 1 log n and d log2 n, respec- rank of every character, i.e., for every character c we 3 log d 0 tively. Therefore, we obtain in the same complexity ev- store the number positions where characters c ≤ c ery first copy of the matrix M, and a small matrix every occur in the prefix of the string up to the beginning 1 log n 3 log d characters. We can simply extract and save every of the superblock. We think of this as a d × log n third such small matrix, also in the same complexity. matrix M. The matrix is stored in two different The second copies of the matrices are constructed ways. In the first copy, every row is stored as a single from the first copies in O(d2) time each; we simply par- word. In the second copy, we divide the matrix into tition each row into (overlapping) sections and append sections of log n/d columns, and store every section in each part to the appropriate machine words. This takes a single word. We make sure there is an overlap of 2 O( Nd ) = O( N ) in total. four columns between the sections, meaning that the d2 log n log n

first section contains columns 1,..., log n/d, the second As follows from the lemma, all strings Du at one section contains columns log n/d − 3,..., 2 log n/d − 4, level of the tree can be extended in O(n log log n/ log n) and so on. time, which, together√ with constructing the tree itself, Each superblock is then partitioned into blocks of sums up to O(n log n) time in total. length log n/ log d each. For every block, we store the cumulative generalized rank within the superblock of 3 Wavelet Suffix Trees. every character, i.e., for every character c we store the 0 In this section we generalize wavelet trees to obtain number of positions where characters c ≤ c occur in the wavelet suffix trees. With logarithmic height and prefix of the superblock up to the end beginning of the shape resembling the shape of the suffix tree wavelet block. We represent this information in a small matrix suffix trees, augmented with additional stringological M 0, which can be packed in a single word, as it requires 2 data structures, become a very powerful tool. In only O(d log(d log n)) bits. particular, they allow to answer the following queries efficiently: (1) find the k-th lexicographically minimal 1In the original paper, superblocks are of length d log n, but suffix of a substring of the given string (substring suffix this does not change the query algorithm. selection), (2) find the rank of one substring among the suffixes of another substring (substring suffix rank), and is an arithmetic progression and consequently it can be (3) compute the run-length encoding of the Burrows- represented by three integers: p0, p1 − p0, and k. Pe- Wheeler transform of a substring. riodic progressions appear in our work because of the following result: Organization of Section 3. In Section 3.1 we in- troduce several, mostly standard, stringological notions Theorem 3.1. ([25]) Using a data structure of size and recall some already known algorithmic results. Sec- O(n) with O(n)-time randomized (Las Vegas) construc- tion 3.2 provides a high-level description of the wavelet tion, the following queries can be answered in con- suffix trees. It forms an interface between the query al- stant time: Given two substrings x and y such that gorithms (Section 3.5) and the more technical content: |x| = O(|y|), report the positions of all occurrences of |x|+1 full description of the data structure (Section 3.3) and y in x, represented as at most |y|+1 non-overlapping its construction algorithm (Section 3.4). Consequently, periodic progressions. Sections 3.3 and 3.5 can be read separately. The latter additionally contains cell-probe lower bounds for some 3.2 Overview of wavelet suffix trees. For a string of the queries (suffix rank & selection), as well as a de- w of length n, a wavelet suffix tree of w is a full binary scription of a generic transformation of the data struc- tree of logarithmic height. Each of its n leaves corre- ture, which allows to replace a dependence on n with sponds to a non-empty suffix of w$. The lexicographic a dependence on |x| in the running times of the query order of suffixes is preserved as the left-to-right order of algorithms. leaves. Each node u of the wavelet suffix tree stores two 3.1 Preliminaries. Let w be a string of length |w| = bitmasks. Bits of the first bitmask correspond to suffixes n over the alphabet Σ = [0, σ − 1]. For 1 ≤ i ≤ j ≤ n, below u sorted by their starting positions, and bits of w[i..j] denotes the substring of w from position i to the second bitmask correspond to these suffixes sorted position j (inclusive). For i = 1 or j = |w|, we use by pairs (preceding character, starting position). The shorthands w[..j] and w[i..]. If x = w[i..j], we say that i-th bit of either bitmask is set to 0 if the i-th suffix x occurs in w at position i. Each substring w[..j] is belongs to the left subtree of u and to 1 otherwise. Like called a prefix of w, and each substring w[i..] is called a in the standard wavelet trees, on top of the bitmasks we suffix of w. A substring which occurs both as a prefix maintain a rank/select data structure. See Figure 3 for and as a suffix of w is called a border of w. The length a sample wavelet suffix tree with both bitmasks listed longest common prefix of two strings x, y is denoted by down in nodes. lcp(x, y). Each edge e of the wavelet suffix tree is associated We extend Σ with a sentinel symbol $, which we with a sorted list L(e) containing substrings of w. The assume to be smaller than any other character. The wavelet suffix tree enjoys an important lexicographic order on Σ can be generalized in a standard way to property. Imagine we traverse the tree depth-first, the lexicographic order of the strings over Σ: a string and when going down an edge e we write out the x is lexicographically smaller than y (denoted x ≺ y) if contents of L(e), whereas when visiting a leaf we output either x is a proper prefix, or there exists a position i, the corresponding suffix of w$. Then, we obtain the 0 ≤ i < min{|x|, |y|}, such that x[1..i] = y[1..i] and lexicographically sorted list of all substrings of w$ x[i + 1] ≺ y[i + 1]. The following lemma provides one of (without repetitions).2 This, in particular, implies that the standard tools in stringology. the substrings in L(e) are consecutive prefixes of the longest substring in L(e), and that for each substring y Lemma 3.1. (LCP Queries [12]) A string w of of w there is exactly one edge e such that the y ∈ L(e). length n can be preprocessed in O(n) time so that the In the query algorithms, we actually work with following queries can be answered in O(1) time: Given Lx(e), containing the suffixes of x among the elements of substrings x and y of w, compute lcp(x, y) and decide L(e). For each edge e, starting positions of these suffixes whether x ≺ y, x = y, or x y. form O(1) non-overlapping periodic progressions, and consequently the list L (e) admits a constant-space We say that a sequence p < p < . . . < p x 0 1 k representation. Nevertheless, we do not store the lists of positions in a string w is a periodic progression if explicitly, but instead generate some of them on the fly. w[p0..p1 − 1] = ... = w[pk−1..pk − 1]. Periodic progres- sions p, p0 are called non-overlapping if the maximum 0 term in p is smaller than the minimum term in p or 2A similar property holds for suffix trees if we define L(e) so 0 vice versa, the maximum term in p is smaller than the that it contains the labels of all implicit nodes on e and the label minimum term in p. Note that any periodic progression of the lower explicit endpoint of e. 0101101010110 0111110100010 b

111110 0100010 111110 0100100 a ab bb aba 01001 11110 10 13 01001 11110 10 abab ababa ba bba ababab abb bab bbab abababb bbaba... 100 10 1001 100 10 12 1010 11 4 ababba abba baba babab babb ababbab abbab bababa ababbaba... abbaba... bababab 01 01 10 01 1 10 3 10 10 babba ababb babababb bababb babbab babbaba... 6 8 5 7 9 2

Figure 3: A wavelet suffix tree of w = ababbabababb. Leaves corresponding to w[i..]$ are labelled with i. Elements of L(e) are listed next to e, with ... denoting further substrings up to the suffix of w. Suffixes of x = bababa are marked red, of x = abababb: blue. Note that the prefixes of w[i..] do not need to lie above the leaf i (see w[1, 5] = ababb), and the substrings above the leaf i do not need to be prefixes of w[i..] (see w[10..] and aba).

This is one of the auxiliary operations, each of which is 3.3.1 String intervals. To define wavelet suffix supported by the wavelet suffix tree in constant time. trees, we often need to compare substrings of w trimmed to a certain number of characters. If instead of x and y (1) For a substring x and an edge e, output the list we compare their counterparts trimmed to ` characters, Lx(e) represented as O(1) non-overlapping periodic i.e., x[1.. min{`, |x|}] and y[1.. min{`, |y|}], we use ` in progressions; the subscript of the operator, e.g., x =` y or x ` y. (2) Count the number of suffixes of x = w[i..j] in the For a pair of strings s, t and a positive integer ` we ∗ left/right subtree of a node (given along with the define a string interval [s, t]` = {z ∈ Σ¯ : s ` z ` t} ∗ segment of its first bitmask corresponding to suffixes and (s, t)` = {z ∈ Σ¯ : s ≺` z ≺` t}. Intervals [s, t)` that start inside [i, j]); and (s, t]` are defined analogously. The strings s, t are (3) Count the number of suffixes x = w[i..j] that are called the endpoints of these intervals. preceded by a character c and lie in the left/right In the remainder of this section, we show that the subtree of a node (given along with the segment data structure of Lemma 3.1 can answer queries related of its second bitmask corresponding to suffixes that to string intervals and periodic progressions, which arise start inside [i, j] and are preceded by c); in Section 3.3.3. We start with a simple auxiliary result; (4) For a substring x and an edge e, compute the here y∞ denotes a (one-sided) infinite string obtained by run-length encoding of the sequence of characters concatenating an infinite number of copies of y. preceding suffixes in Lx(e). Lemma 3.2. The data structure of Lemma 3.1 sup- 3.3 Full description of wavelet suffix trees. We ports the following queries in O(1) time: start the description with Section 3.3.1, where we intro- (1) Given substrings x, y of w and an integer `, deter- duce string intervals, a notion central to the definition mine if x ≺` y, x =` y, or x ` y. of wavelet suffix tree. We also present there corollaries (2) Given substrings x, y of w, compute lcp(x, y∞) and of Lemma 3.1 which let us efficiently deal with string determine whether x ≺ y∞ or x y∞. intervals. Then, in Section 3.3.2, we give a precise defi- nition of wavelet suffix trees and prove its several combi- Proof. (1) By Lemma 3.1, we may assume to know natorial consequences. We conclude with Section 3.3.3, lcp(x, y). If lcp(x, y) ≥ `, then x =` y. Otherwise, where we provide the implementations of auxiliary op- trimming x and y to ` characters does not influence the erations defined in Section 3.2. order between these two substrings. (2) If lcp(x, y) < |y|, i.e., y is not a prefix of x, then 1 $ a b lcp(x, y∞) = lcp(x, y) and the order between x and y∞ 2 2 $ 2 is the same as between x and y. Otherwise, define x0 b a b 0 ∞ 0 4 4 4 4 so that x = yx . Then lcp(x, y ) = |y| + lcp(x , x) and b $ a b b b a the order between x and y∞ is the same as between 6 $ a a x0 and x. Consequently, the query can be answered in b b b 8 8 8 8 8 8 constant time in both cases. $ a b b b b b a a $ 10 The data structure of Lemma 3.1 sup- b a a a b b b Lemma 3.3. 12 ports the following queries in O(1) time: Given a peri- b b b b $ a a 14 odic progression p0 < . . . < pk in w, a position j ≥ pk, $ a a b b b and a string interval I whose endpoints are substrings 16 16 16 16 16 16 $ of w, report, as a single periodic progression, all posi- b b a b tions p such that w[p ..j] ∈ I. 18 i i a b b $ 20 Proof. If k = 0, it suffices to apply Lemma 3.2(1). b $ b 22 Thus, we assume k ≥ 1 in the remainder of the proof. b $ Let s and t be the endpoints of I, ρ = w[p0..p1 − 1], $ 24 and xi = w[pi..j]. Using Lemma 3.2(2) we can compute ∞ 0 ∞ 26 r0 = lcp(x0, ρ ) and r = lcp(s, ρ ). Note that ∞ ri := lcp(xi, ρ ) = r − i|ρ|, in particular r0 ≥ k|ρ|. Figure 4: An auxiliary tree T introduced to define the 0 If r ≥ `, we distinguish two cases: wavelet suffix tree of w = ababbabababb. Levels are written inside nodes. Gray nodes are dissolved during 1) ri ≥ `. Then lcp(xi, s) ≥ `; thus xi =` s. ∞ the construction of the wavelet suffix tree. 2) ri < `. Then lcp(xi, s) = ri; thus xi ≺` s if x0 ≺ ρ , and xi ` s otherwise. O(log n) with O(n log n) nodes. Its leaves represent 0 On the other hand, if r < `, we distinguish three cases: non-empty suffixes of w$, and the left-to-right order of leaves corresponds to the lexicographic order on the 1) r > r0. Then lcp(x , s) = r0; thus x ≺ s if ρ∞ ≺ s, i i i ` suffixes. Internal nodes of T represent all substrings of and x s otherwise. i ` w whose length is a power of two, with an exception of 2) r = r0. Then we use Lemma 3.2(2) to determine i the root, which represents the empty word. Edges in T the order between x and s trimmed to ` characters. i are defined so that a node representing v is an ancestor This, however, may happen only for a single value i. of a node representing v0 if and only if v is a prefix of v0. 3) r < r0. Then lcp(x , s) = r ; thus x ≺ s if i i i i ` To each non-root node u we assign a level `(u) := 2|v|, x ≺ ρ∞, and x s otherwise. 0 i ` where v is the substring that u represents. For the root Consequently, in constant time we can partition indices r, we set `(r) := 1. See Figure 4 for a sample tree T i into at most three ranges, and for each range determine with levels assigned to nodes. whether xi ≺` s, xi =` s, or xi ` s for all indices For a node u, we define S(u) to be the set of suffixes i in the range. We ignore from further computations of w$ that are represented by descendants of u. Note those ranges for which we already know that xi ∈/ I, that S(u) is a singleton if u is a leaf. The following and for the remaining ones repeat the procedure above observation characterizes the levels and sets S(u). with t instead of s. We end up with O(1) ranges of Observation 3.1. For any node u other than the root: positions i for which xi ∈ I. However, note that as the k string sequence (xi)i=0 is always monotone (decreasing (1) `(parent(u)) ≤ `(u), ∞ if x0  ρ , increasing otherwise), these ranges (if any) (2) if y ∈ S(u) and y0 is a suffix of w$ such that can be merged into a single range, so in the output lcp(y, y0) ≥ `(parent(u)), then y0 ∈ S(u), we end up with a single (possibly empty) periodic 0 0  1  (3) if y, y ∈ S(u), then lcp(y, y ) ≥ 2 `(u) . progression. Next, we modify T to obtain a binary tree of O(n) nodes. In order to reduce the number of nodes, we 3.3.2 Definition of wavelet suffix trees. Let w dissolve all nodes with exactly one child, i.e., while be a string of length n. To define the wavelet suffix there is a non-root node u with exactly one child u0, tree of w, we start from an auxiliary tree T of height we set parent(u0) := parent(u) and remove u. To make 1 [$, a]1 (a, b]1 1 2 , $]1 ($, ]2 (ba, [$ a]1 [b$, ba bb]2

2 4 ( 2 4 ]4 abab b$]2 (b$, , abab , abba [b$, ba] $ [abab ]4 2 8 4 4 4 6 20 b$ bb$ bbabababb$ 8 26 8 22 8 8 ababbabababb$ abb$ abbabababb$ 16 12 18 14 10 24 abababb$ ababb$ babababb$ bababb$ babb$ babbabababb$ Figure 5: The wavelet suffix tree of w = ababbabababb (see also Figures 3 and 4). Levels are written inside nodes. Gray nodes have been introduced as inner nodes of replacement trees. The corresponding suffix is written down below each leaf. Selected edges e are labelled with the intervals I(e). the tree binary, for each node u with k > 2 children, The following lemma characterizes the intervals. we remove the edges between u and its children, and Lemma 3.4. For any node u we have: introduce a replacement tree, a full binary tree with k leaves whose root is u, and leaves are the k children (1) If u is not a leaf, then I(u) is a disjoint union of of u (preserving the left-to-right order). We choose I(u, lchild(u)) and I(u, rchild(u)). the replacement trees, so that the resulting tree still (2) If y is a suffix of w$, then y ∈ I(u) if and only if has depth O(log n). In Section 3.4.1 we provide a y ∈ S(u). constructive proof that such a choice is possible. This (3) If u is not the root, then I(u) ⊆ I(parent(u), u). procedure introduces new nodes (inner nodes of the Proof. (1) is a trivial consequence of the definitions. replacement trees); their levels are inherited from the (2) Clearly y ∈ S(u) iff y ∈ [min S(u), max S(u)]. parents. Therefore, it suffices to show that if lcp(y, y0) ≥ `(u) The obtained tree is the wavelet suffix tree of w; see for y0 = min S(u) or y0 = max S(u), then y ∈ S(u). Figure 5 for an example. Observe that, as claimed in This is, however, a consequence of points (1) and (2) of Section 3.2 it is a full binary tree of logarithmic height, Observation 3.1. whose leaves corresponds to non-empty suffixes of w$. Moreover, it is not hard to see that this tree still satisfies (3) Let `p = `(parent(u)). If u = lchild(parent(u)), Observation 3.1. then S(u) ⊆ S(parent(u)) and, by Observation 3.1(1), As described in Section 3.2, each node of u (except `(u) ≤ `p, which implies the statement. Therefore, assume that u = rchild(parent(u)), for the leaves) stores two bitmasks. In either bitmask 0 each bit corresponds to a suffix y ∈ S(u), and it is equal and let u be the left sibling of u. Note that I(parent(u), u) = (max S(u0), max S(u)] and I(u) ⊆ to 0 if y ∈ S(lchild(u)) and to 0 if y ∈ S(rchild(u)), `p [min S(u), max S(u)] , since `(u) ≤ ` . Consequently, where lchild(u) and rchild(u) denote the children of u. `p p it suffices to prove that max S(u0) ≺ min S(u). This In the first bitmask the suffixes y = w[j..]$ are ordered `p is, however, a consequence of Observation 3.1(2) for by the starting position j, and in the second bitmask — 0 0 by pairs (w[j − 1], j) (assuming w[−1] = $). Both y = min S(u) and y = max S(u ), and the fact that bitmasks are equipped with rank/select data structures. the left-to-right order of leaves coincides with the lexi- Additionally, each node and each edge of the cographic order of the corresponding suffixes of w$. wavelet suffix tree are associated with a string interval For each edge e = (parent(u), u) of the wavelet whose endpoints are suffixes of w$. Namely, for an arbi- suffix tree, we define L(e) to be the sorted list of those trary node u we define I(u) = [min S(u), max S(u)]`(u). substrings of w which belong to I(e) \ I(u). Additionally, if u is not a leaf, we set I(u, lchild(u)) = Recall that the wavelet suffix tree shall enjoy the [min S(u), y]`(u) and I(u, rchild(u)) = (y, max S(u)]`(u), lexicographic property: if we traverse the tree, and when where y = max S(lchild(u)) is the suffix corresponding going down an edge e we write out the contents of to the rightmost leaf in the left subtree of u; see also Fig- L(e), whereas when visiting a leaf we output the cor- ure 5. For each node u we store the starting positions responding suffix of w$, we shall obtain a lexicographi- of min S(u) and max S(u) in order to efficiently retrieve cally sorted list of all substrings of w$. This is proved a representation of I(u) and I(e) for adjacent edges e. in the following series of lemmas. Lemma 3.5. Let e = (parent(u), u) for a node u. 3.3.3 Implementation of auxiliary queries. Re- Substrings in L(e) are smaller than any string in I(u). call that Lx(e) is the sublist of L(e) containing suffixes of x. The wavelet suffix tree shall allow the following Proof. We use a shorthand `p for `(parent(u)). Let four types of queries in constant time: y = max S(u) be the rightmost suffix in the subtree of u. Consider a substring s = w[k..j] ∈ L(e), also let (1) For a substring x and an edge e, output the list t = w[k..]$. Lx(e) represented as O(1) non-overlapping periodic progressions; We first prove that s  y. Note that I(e) = [x, y]`p or I(e) = (x, y] for some string x. We have s ∈ (2) Count the number of suffixes of x = w[i..j] in the `p left/right subtree of a node (given along with the L(e) ⊆ I(e), and thus s `p y. If lcp(s, y) < `p, this already implies that s  y. Thus, let us assume that segment of its first bitmask corresponding to suffixes lcp(s, y) ≥ ` . The suffix t has s as a prefix, so this that start inside [i, j]); p (3) Count the number of suffixes x = w[i..j] that are also means that lcp(t, y) ≥ `p. By Observation 3.1(2), t ∈ S(u), so t  y. Thus s  t  y, as claimed. preceded by a character c and lie in the left/right To prove that s  y implies that s is smaller than subtree of a node (given along with the segment any string in I(u), it suffices to note that y ∈ S(u) ⊆ of its second bitmask corresponding to suffixes that I(u), s∈ / I(u), and I(u) is an interval. start inside [i, j] and are preceded by c); (4) For a substring x and and edge e, compute the run-length encoding of the sequence of characters Lemma 3.6. The wavelet suffix tree satisfies the lexi- preceding suffixes in L (e). cographic property. x We start with an auxiliary lemma applied in the Proof. Note that for the root r we have I(r) = [$, c]1 solutions to all four queries. where c is the largest character present in w. Thus, 0 I(r) contains all substrings of w$ and it suffices to show Lemma 3.7. Let e = (u, u ) be an edge of a wavelet 0 that if we traverse the subtree of r, writing out the suffix tree of w, with u being a child of u. The following contents of L(e) when going down an edge e, and the operations can be implemented in constant time. corresponding suffix when visiting a leaf, we obtain a (1) Given a substring x of w, |x| < `(u), return, as a sorted list of substrings of w$ contained in I(r). But we single periodic progression of starting positions, all will show even a stronger claim — we will show that, in suffixes s of x such that s ∈ I(e). fact, this property holds for all nodes u of the tree. (2) Given a range of positions [i, j], j −i ≤ `(u), return If u is a leaf this is clear, since I(u) consists of the all positions k ∈ [i, j] such that w[k..]$ ∈ I(e), corresponding suffix of w$ only. Next, if we have already represented as at most two non-overlapping periodic proved the hypothesis for u, then prepending the output progressions. with the contents of L(parent(u), u), by Lemmas 3.5 and 3.4(3), we obtain a sorted list of substrings of w$ Proof. Let p be the longest common prefix of all strings contained in I(parent(u), u). Applying this property in I(u); by Observation 3.1(3), we have |p| ≥ b 1 `(u)c. 0 2 for both children of a non-leaf u , we conclude that (1) Assume x = w[i..j]. We apply Theorem 3.1 to 0 if the hypothesis holds for children of u then, by find all occurrences of p within x, represented as a 0 Lemma 3.4(1), it also holds for u . single periodic progression since |x| + 1 < 2(|p| + 1). Then, using Lemma 3.3, we filter positions k for which Corollary 3.1. Each list L(e) contains consecutive w[k, j] ∈ I(e). prefixes of the largest element of L(e). (2) Let x = w[i..j + |p| − 1] (x = w[i..]$ if j + |p| − 1 > |w|). We apply Theorem 3.1 to find all occurrences Proof. Note that if x ≺ y are substrings of w such that of p within x, represented as at most two periodic x is not a prefix of y, then x can be extended to a suffix 1 progressions since |x| + 1 ≤ `(u) + |p| + 1 ≤ 2b 2 `(u)c + x0 of w$ such that x ≺ x0 ≺ y. However, L(e) does not |p| + 2 < 3(|p| + 1). Like previously, using Lemma 3.3 contain any suffix of w$. By Lemma 3.6, L(e) contains we filter positions k for which w[k..]$ ∈ I(e). a consecutive collection of substrings of w$, so x and y cannot be both present in L(e). Consequently, each Lemma 3.8. The wavelet suffix tree allows to answer element of L(e) is a prefix of max L(e). queries (1) in constant time. In more details, for an Similarly, since L(e) contains a consecutive collec- edge e = (parent(u), u), the starting positions of suffixes tion of substrings of w$, it must contain all prefixes of in Lx(e) form at most three non-overlapping periodic max L(e) no shorter than min L(e). progressions, which can be reported in O(1) time. Proof. First, we consider short suffixes. We use are prefixes of one another, so the lexicographic order Lemma 3.7(1) to find all suffixes s of x, |s| < on these suffixes coincides with the order of ascending `(parent(u)), such that s ∈ I(parent(u), u). Then, we lengths. Consequently, the run-length encoding of the apply Lemma 3.3 to filter out all suffixes belonging to piece corresponding to Lx(e) has at most six phrases I(u). By Lemma 3.5, we obtain at most one periodic and can be easily found in O(1) time. progression. Now, it suffices to generate suffixes s, |s| ≥ 3.4 Construction of wavelet suffix trees. The ac- `(parent(u)), that belong to L(e). Suppose s = w[k..j]. tual construction algorithm is presented in Section 3.4.2. If s ∈ I(e), then equivalently w[k..]$ ∈ I(e), since s Before, in Section 3.4.1, we introduce several auxiliary is a long enough prefix of w[k..]$ to determine whether tools for abstract weighted trees. the latter belongs to I(e). Consequently, by Lemma 3.4, w[k..]$ ∈ I(u). This implies |s| < ` (otherwise we would 3.4.1 Toolbox for weighted trees. Let T be a have s ∈ I(u)), i.e., k ∈ [j − ` + 2..j − `(parent(u)) + 1]. rooted ordered tree with positive integer weights on We apply Lemma 3.7(2) to compute all positions k edges, n leaves and no inner nodes of degree one. We in this range for which w[k..]$ ∈ I(e). Then, us- say that L1,...,Ln−1 is an LCA sequence of T , if Li ing Lemma 3.3, we filter out positions k such that is the (weighted) depth of the w[k..j] ∈ I(u). By Lemma 3.5, this cannot increase of the i-th and (i + 1)-th leaves. The following fact is the number of periodic progressions, so we end up with usually applied to construct the suffix tree of a string three non-overlapping periodic progressions in total. from the suffix array and the LCP table [12]. The wavelet suffix tree allows to answer n−1 Lemma 3.9. Fact 3.1. Given a sequence (Li)i=1 of non-negative queries (2) in constant time. integers, one can construct in O(n) time a tree whose LCA sequence is (L )n−1. Proof. Let u be the given node and u0 be its right/left i i=1 child (depending on the variant of the query). First, we The LCA sequence suffices to detect if a tree is binary. use Lemma 3.7(1) to find all suffixes s of x, |s| < `(u), such that s ∈ I(u, u0), i.e., s lies in the appropriate Observation 3.2. A tree is a binary tree if and only n−1 subtree of u. if its LCA sequence (Li)i=1 satisfies the following prop- Thus, it remains to count suffixes of length at least erty for every i < j: if Li = Lj then there exists k, `(u). Suppose s = w[k..j] is a suffix of x such that i < k < j, such that Lk < Li. |s| ≥ `(u) and s ∈ I(u, u0). Then w[k..]$ ∈ I(u, u0), Trees constructed by the following lemma can be and the number of suffixes w[k..]$ ∈ I(u, u0) such that seen as a variant of the weight-balanced trees, whose k ∈ [i, j] is simply the number of 1’s or 0’s in the given existence for arbitrary weights was by proved Blum and segment of the first bitmask in u, which we can compute Mehlhorn [5]. in constant time. Observe, however, that we have also counted positions k such that |w[k..j]| < `(u), and we Lemma 3.11. Given a sequence w1, . . . , wn of positive need to subtract the number of these positions. For integers, one can construct in O(n) time a binary tree this, we use Lemma 3.7(2) to compute the positions T with n leaves, such that the depth of the i-th leaf is k ∈ [j − ` + 2, j] such that w[k..]$ ∈ I(u, u0). We count W Pn O(1 + log ), where W = j=1 wj. the total size of the obtained periodic progressions, and wi subtract it from the final result, as described. Pi Proof. For i = 0, . . . , n define Wi = j=1 wj. Let p be the position of the most significant bit where Lemma 3.10. The wavelet suffix tree allows to answer i the binary representations of Wi−1 and Wi differ, and queries (3) and (4) in O(1) time. n let P = maxi=1 pi. Observe that P = Θ(log W ) and Proof. Observe that for any periodic progression pi = Ω(log wi). Using Fact 3.1, we construct a tree T p0 . . . , pk we have w[p1 − 1] = ... = w[pk − 1]. Thus, it whose LCA sequence is Li = P − pi. Note that this is straightforward to determine which positions of such sequence satisfies the condition of Observation 3.2, and a progression are preceded by c. thus the tree is binary. Answering queries (3) is analogous to answering Next, we insert an extra leaf between the two queries (2), we just use the second bitmask at the given children of any node to make the tree ternary. The node and consider only positions preceded by c while i-th of these leaves is inserted at (weighted) depth counting the sizes of periodic progressions. 1 + Li = O(1 + log W − log wi), which is also an upper To answer queries (4), it suffices to use Lemma 3.8 bound for its unweighted depth. Next, we remove the to obtain Lx(e). By Corollary 3.1, suffixes in Lx(e) original leaves. This way we get a tree satisfying the lemma, except for the fact that inner nodes may have an o(n log n)-time construction we cannot afford that. between one and three children, rather than exactly two. Thus, we construct the tree T already without inner In order to resolve this issue, we remove (dissolve) all nodes having exactly one child. Observe that this tree inner nodes with exactly one child, and for each node u is closely related to the suffix tree of w$. The only with three children v1, v2, v3, we introduce a new node difference is that if the longest common prefix of two 0 0 0 u , setting v1, v2 as the children of u and u , v3 as the consecutive suffixes is d, their root-to-leaf paths diverge children of u. This way we get a full binary tree, and at depth blog dc instead of d. To overcome this difficulty, the depth of any node may increase at most twice, i.e., we use Fact 3.1 for Li = blog LCP [i]c, rather than for the i-th leaf it stays O(1 + log W ). LCP [i] which we would use for the suffix tree. This wi way an inner node u at depth j represents a substring Let T be an ordered rooted tree and let u be a of length 2j. The level `(u) of an inner node u is set to node of T , which is neither the root nor a leaf. Also, 2j+1, and if u is a leaf representing a suffix s of w$, we let v be the parent of u. We say that T 0 is obtained have `(u) = 2|s|. from T by contracting the edge (v, u), if u is removed After this operation, the tree T may have inner and the children of u replace u at its original location nodes of large degree, so we use Corollary 3.2 to obtain in the list of children of v. If T 0 is obtained from T a binary tree such that T is its contraction. We set this by a sequence of edge contractions, we say that T 0 is a binary tree as the shape of the wavelet suffix tree. Since contraction of T . Note that contraction does not alter T has height O(log n), so does the wavelet suffix tree. the pre-order and post-order of the preserved nodes, To construct the bitmasks, we apply Theorem 2.2 which implies that the ancestor-descendant relation also for T with the leaf representing w[i..]$ assigned to i. remains unchanged for these nodes. For the first bitmask, we simply set s[i] = i. For the Corollary 3.2. Let T be an ordered rooted tree of second bitmask, we sort all positions i with respect to height h, which has n leaves and no inner node with (w[i − 1], i) and take the resulting sequence as s. exactly one child. Then, in O(n) time one can construct This way, we complete the proof of the main a full binary ordered rooted tree T 0 of height O(h+log n) theorem concerning wavelet suffix trees. such that T is a contraction of T 0 and T 0 has O(n) nodes. Theorem 3.2. A wavelet suffix tree of a string w of length√ n occupies O(n) space and can be constructed in Proof. For any node u of T with three or more children, O(n log n) expected time. we replace the star-shaped tree joining it with its children v1, . . . , vk by an appropriate replacement tree. 3.5 Applications. Let W (u) be the number of leaves in the subtree of u, and let W (vi) be the number of leaves in the subtrees 3.5.1 Substring suffix rank/select. In the sub- below vi, 1 ≤ i ≤ k. We use Lemma 3.11 for wi = W (vi) string suffix rank problem, we are asked to find the rank to construct the replacement tree. Consequently, a node of a substring y among the suffixes of another substring u with depth d in T has depth O(d + log n ) in W (u) x. The substring suffix selection problem, in contrast, 0 T (easy top-down induction). The resulting tree has is to find the k-th lexicographically smallest suffix of x height O(h + log n), as claimed. for a given an integer k and a substring x of w.

3.4.2 The algorithm. In this section we show how Theorem 3.3. The wavelet suffix tree can solve the to construct√ the wavelet suffix tree of a string w of length substring suffix rank problem in O(log n) time. n in O(n log n) time. The algorithm is deterministic, but the data structure of Theorem 3.1, required by the Proof. Using binary search on the leaves of the wavelet wavelet suffix tree, has a randomized construction only, suffix tree of w, we locate the minimal suffix t of w$ running in O(n) expected time. such that t y. Let π denote the path from the root The construction algorithm has two phases: first, it to the leaf corresponding to t. Due to the lexicographic builds the shape of the wavelet suffix tree, following a property, the rank of y among the suffixes of x is equal description in Section 3.3.2, and then it uses the results to the sum of two numbers. The first one is the number of Section 2 to obtain the bitmasks. of suffixes of x in the left subtrees hanging from the We start by constructing the suffix array and the path π, whereas the second summand is the number of LCP table for w$ (see [12]). Under the assumption that suffixes not exceeding y in the lists Lx(e) for e ∈ π. σ < n, this takes linear time. To compute those two numbers, we traverse π Recall that in the definition of the wavelet suffix maintaining a segment [`, r] of the first bitmask cor- tree we started with a tree of size O(n log n). For responding to the suffixes of w$ starting within x. When we descend to the left child, we set [`, r] := We now show that the query time for the two [rank0(`), rank0(r)], while for the right child, we set considered problems is almost optimal. We start by [`, r] := [rank1(`), rank1(r)]. In the latter case, we pass reminding lower bounds by P˘atra¸scuand by Jørgensen [`, r] to type (2) queries, which let us count the suffixes and Larsen. of x in the left subtree hanging from π in the current node. This way, we compute the first summand. Theorem 3.5. ([30, 29]) In the cell-probe model with For the second number, we use type (1) queries W -bit cells, a static data structure of size c · n must log n to generate all lists Lx(e) for e ∈ π. Note that if take Ω( log c+log W ) time for orthogonal range counting we concatenated these lists Lx(e) in the root-to-leaf queries. order of edges, we would obtain a sorted list of strings. Thus, while processing the lists in this order (ignoring Theorem 3.6. ([22]) In the cell-probe model with W - the empty ones), we add up the sizes of Lx(e) until bit cells, a static data structure of size c · n must log n max Lx(e) y. For the first encountered list Lx(e) take Ω( log c+log W ) time for orthogonal range selection satisfying this property, we binary search within Lx(e) queries. to determine the number of elements not exceeding y, and also add this value to the final result. Both of these results allow the coordinates of points The described procedure requires O(log n) time, to be in the rank space, i.e., for each i ∈ {1, . . . , n} there since type (1) and (2) queries, as well as substring is one point (i, A[i]), and values A[i] are distinct integers comparison queries (Lemma 3.1), run in O(1) time. in {1, . . . , n}. This lets us construct a string w = A[1] ...A[n] Theorem 3.4. The wavelet suffix tree can solve the for any given point set P . Since w has pairwise substring suffix selection problem in O(log n) time. distinct characters, comparing suffixes of any substring of w is equivalent to comparing their first characters. Proof. The algorithm traverses a path in the wavelet Consequently, the substring suffix selection in w is suffix tree of w. It maintains a segment [`, r] of the first equivalent to the orthogonal range selection in P , and bitmask corresponding to suffixes of w starting within the substring suffix rank in w is equivalent to the 0 x = w[i..j], and a variable k counting the suffixes of x orthogonal range counting in P (we need to subtract represented in the left subtrees hanging from the path the results of two suffix rank queries to answer an on the edges of the path. The algorithm starts at the orthogonal range counting query). Consequently, we 0 root initializing [`, r] with [i, j] and k = 0. obtain At each node u, it first decides to which child of u to proceed. For this, is performs a type (2) query Corollary 3.3. In the cell-probe model with W -bit to determine k00, the number of suffixes of x in the left cells, a static data structure of size c · n must take 0 00 log n subtree of u. If k + k ≥ k, it chooses to go to the left Ω( log c+log W ) time for the substring suffix rank and the child, otherwise to the right one; in the latter case it substring suffix select queries. also updates k0 := k0 + k00. The algorithm additionally updates the segment [`, r] using the rank queries on the 3.5.2 Run-length encoding of the BWT of bitmask. a substring. Wavelet suffix trees can be also used Let u0 be the child of u that the algorithm has to compute the run-length encoding of the Burrows- chosen to proceed to. Before reaching u0, the algorithm Wheeler transform of a substring. We start by remind- 0 performs a type (1) query to compute Lx(u, u ). If ing the definitions. The Burrows-Wheeler transform [7] 0 k summed with the size of this list is at least k, (BWT) of a string x is a string b0b1 . . . b|x|, where bk is then the algorithm terminates, returning the k − k0-th the character preceding the k-th lexicographically small- element of the list (which is easy to retrieve from the est suffix of x$. The BWT tends to contain long seg- representation as a periodic progression). Otherwise, ments of equal characters, called runs. This, combined 0 0 0 0 it sets k := k + |Lx(u, u )|, so that k satisfies the with run-length encoding, allows to compress strings ef- definition for the extended path from the root to u0. ficiently. The run-length encoding of a string is obtained The correctness of the algorithm follows from the by replacing each maximal run by a pair: the charac- lexicographic property, which implies that at the begin- ter that forms the run and the length of the run. For ning of each step, the sought suffix of x is the k − k0-th example, the BWT of a string banana is annb$aa, and smallest suffix in the subtree of u. In particular, the the run-length encoding of annb$aa is a1n2b1$1a2. procedure always terminates before reaching a leaf. The Below, we show how to compute the run-length running time of the algorithm is O(log n) due to O(1)- encoding of the BWT of a substring x = w[i..j] using time implementations of type (1) and (2) queries. the wavelet suffix tree of w. Let x = w[i..j] and for k ∈ {1,..., |x|} let sk = w[ik..j] be the suffixes of x 3.5.3 Speeding up queries. Finally, we note that sorted in the lexicographic order. Then the BWT of x building wavelet suffix trees for several substrings of w, is equal to b0b1 . . . b|x|, where b0 = w[j], and for k ≥ 1, we can make the query time adaptive to the length of the bk = w[ik − 1], unless ik = i when bk = $. query substring x, i.e., replace O(log n) by O(log |x|). Our algorithm initially generates a string b(x) which instead of $ contains w[i − 1]. However, we know that $ should occur at the position equal to the rank Theorem 3.8. Using a data structure√ of size O(n), of x among all the suffixes of x. Consequently, a single which can be constructed in O(n log n) expected time, substring suffix rank query suffices to find the position substring suffix rank and selection problems can be which needs to be corrected. solved in O(log |x|) time. The run-length encoding Remember that the wavelet suffix tree satisfies the b(x) of the BWT of a substring x can be found in lexicographic property. Consequently, if we traverse the O(|b(x)| log |x|) time. tree and write out the characters preceding the suffixes in the lists Lx(e), we obtain b(x) (without the first symbol b0). Our algorithm simulates such a traversal. Proof. We build wavelet suffix trees for some substrings 2−k Assume that the last character appended to b(x) is c, of length nk = bn c, 0 ≤ k ≤ log log n. For 0 1 and the algorithm is to move down an edge e = (u, u ). each length nk we choose every b 2 nkc-th substring, Before deciding to do so, it checks whether all the starting from the prefix and, additionally, we choose suffixes of x in the appropriate (left or right) subtree the suffix. Auxiliary data structures of Lemma 3.1 and of u are preceded with c. For this, it performs type Theorem 3.1, are built for w only. √  2 (2) and (3) queries, and if both results are equal to We have nk = nk−1 , so nk−1 ≤ (nk + 1) some value q, it simply appends cq to b(x) and decides and thus any substring x of w lies within a substring not to proceed to u0. In order to make these queries v, |v| ≤ 2(|x| + 1)2, for which we store the wavelet possible, for each node on the path from the root to suffix tree. For each m, 1 ≤ m ≤ n, we store such 2 u, the algorithm maintains segments corresponding to nk that 2m ≤ nk ≤ 2(m + 1) . This reduces finding [i, j] in the first bitmasks, and to (c, [i, j]) in the second an appropriate substring v to simple arithmetics. Using bitmasks. These segments are computed using rank the wavelet suffix tree for v instead of the tree for the queries on the bitmasks while moving down the tree. whole string w gives the announced query times. The Before the algorithm continues at u0, if it decides only thing we must be careful about is that the input to do so, suffixes in Lx(e) need to be handled. We for the substring suffix rank problem also consists of a perform a type (4) query to compute the characters string y, which does not need to be a substring of v. preceding these suffixes, and append the result to b(x). However, looking at the query algorithm, it is easy to This, however, may result in c no longer being the last see that y is only used through the data structure of symbol appended to b(x). If so, the algorithm updates Lemma 3.1. the segments of the second bitmask for all nodes on the It remains to analyze the space usage and construc- path from the root to u0. We assume that the root tion time. Observe that the wavelet suffix tree of a stores all positions i sorted by (w[i − 1], i), which lets substring v is simply a binary tree with two bitmasks us use a binary search to find either endpoint of the at each node and with some pointers to the positions segment for the root. For the subsequent nodes on the of the string w. In particular, it does not contain any path, the the rank structures on the second bitmasks characters of w and, if all pointers are stored as rel- are applied. Overall, this update takes O(log n) time ative values, it can be stored using O(|v| log |v|) bits, log |v| and it is necessary at most once per run. i.e., O(|v| log n ) words. For each nk the total length of Now, let us estimate the number of edges visited. selected substrings is O(n), and thus the space usage Observe that if we go down an edge, then the last char- log nk −k is O(n log n ) = O(n2 ), which sums up to O(n) over acter of b(x) changes before we go up this edge. Thus, p all lengths nk. The construction time is O(|v| log |v|) all the edges traversed down between such character for any substring (including alphabet renumbering), and changes form a path. The length of any path is O(log n), this sums up to O(np2−k log n) for each length, and and consequently the total number of visited edges is √ O(n log n) in total. O(s log n), where s is the number of runs. Acknowledgement. The authors gratefully thank Theorem 3.7. The wavelet suffix tree can compute the Djamal Belazzougui, Travis Gagie, and Simon J. Puglisi run-length encoding of the BWT of a substring x in who pointed out that wavelet suffix trees can be useful O(s log n) time, where s is the size of the encoding. for BWT-related applications. References [13] Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and [1] Amihood Amir, Alberto Apostolico, Gad M. Landau, Tomasz Wale´n. Extracting powers and periods in a Avivit Levy, Moshe Lewenstein, and Ely Porat. Range word from its runs structure. Theoretical Computer LCP. Journal of Computer and System Sciences, Science, 521:29–41, 2014. 80(7):1245–1253, 2014. [14] Jean-Pierre Duval. Factorizing words over an ordered [2] Amihood Amir, Gad M. Landau, Moshe Lewenstein, alphabet. Journal of Algorithms, 4(4):363–381, 1983. and Dina Sokol. Dynamic text and static pattern [15] Martin Farach and S. Muthukrishnan. Perfect hashing matching. ACM Transactions on Algorithms, 3(2), for strings: Formalization and algorithms. In Dan May 2007. Hirschberg and Gene Myers, editors, Combinatorial [3] Maxim Babenko, Pawel Gawrychowski, Tomasz Koci- , volume 1075 of LNCS, pages 130– umaka, and Tatiana Starikovskaya. Computing mini- 140. Springer Berlin Heidelberg, 1996. mal and maximal suffixes of a substring revisited. In [16] Paolo Ferragina, Giovanni Manzini, Veli M¨akinen, Alexander S. Kulikov, Sergei O. Kuznetsov, and Pavel and Gonzalo Navarro. Compressed representations of Pevzner, editors, Combinatorial Pattern Matching, vol- sequences and full-text indexes. ACM Transactions on ume 8486 of LNCS, pages 30–39. Springer International Algorithms, 3(2), May 2007. Publishing, 2014. [17] Luca Foschini, Roberto Grossi, Ankur Gupta, and Jef- [4] Maxim Babenko, Ignat Kolesnichenko, and Tatiana frey Scott Vitter. When indexing equals compression: Starikovskaya. On minimal and maximal suffixes of experiments with compressing suffix arrays and appli- a substring. In Johannes Fischer and Peter Sanders, cations. ACM Transactions on Algorithms, 2(4):611– editors, Combinatorial Pattern Matching, volume 7922 639, October 2006. of LNCS, pages 28–37. Springer Berlin Heidelberg, [18] Pawel Gawrychowski, Moshe Lewenstein, and 2013. Patrick K. Nicholson. Weighted ancestors in suf- [5] Norbert Blum and Kurt Mehlhorn. On the average fix trees. In Andreas S. Schulz and Dorothea Wagner, number of rebalancing operations in weight-balanced editors, Algorithms – ESA 2014, volume 8737 of trees. Theoretical , 11(3):303–320, LNCS, pages 455–466. Springer Berlin Heidelberg, July 1980. 2014. [6] Gerth Stølting Brodal, Beat Gfeller, Allan Grønlund [19] Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vit- Jørgensen, and Peter Sanders. Towards optimal ter. High-order entropy-compressed text indexes. In range median. Theoretical Computer Science: Selected 14th Annual ACM-SIAM Symposium on Discrete Al- Papers from 36th International Colloquium on Au- gorithms, SODA’03, pages 841–850. SIAM, 2003. tomata, Languages and Programming (ICALP 2009), [20] Roberto Grossi and Giuseppe Ottaviano. The wavelet 412(24):2588–2601, 2011. trie: maintaining an indexed sequence of strings in [7] Michael Burrows and David J. Wheeler. A block- compressed space. In 31st Symposium on Principles sorting lossless algorithm. Technical of Database Systems, PODS’12, pages 203–214. ACM, Report 124, Digital Equipment Corporation, Palo Alto, 2012. California, 1994. [21] Guy Jacobson. Space-efficient static trees and graphs. [8] Timothy M. Chan and Mihai P˘atra¸scu. Counting in- In 30th Annual Symposium on Foundations of Com- versions, offline orthogonal range counting, and related puter Science, pages 549–554. IEEE Computer Society, problems. In 21st Annual ACM-SIAM Symposium on 1989. Discrete Algorithms, SODA’10, pages 161–173. SIAM, [22] Allan Grønlund Jørgensen and Kasper Green Larsen. 2010. Range selection and median: tight cell probe lower [9] David Richard Clark. Compact Pat Trees. PhD thesis, bounds and adaptive data structures. In 22nd An- University of Waterloo, Waterloo, Ontario, Canada, nual ACM-SIAM Symposium on Discrete Algorithms, 1996. SODA’11, pages 805–813. SIAM, 2011. [10] Francisco Claude, Patrick K. Nicholson, and Diego [23] Orgad Keller, Tsvi Kopelowitz, Shir Landau Feibish, Seco. Space efficient wavelet tree construction. In and Moshe Lewenstein. Generalized substring com- Roberto Grossi, Fabrizio Sebastiani, and Fabrizio Sil- pression. Theoretical Computer Science: Advances in vestri, editors, String Processing and Information Re- Stringology, 525:42–54, 2014. trieval, volume 7024 of LNCS, pages 185–196. Springer [24] Tomasz Kociumaka, Jakub Radoszewski, Wojciech Berlin Heidelberg, 2011. Rytter, and Tomasz Wale´n. Efficient data structures [11] Graham Cormode and S. Muthukrishnan. Substring for the factor periodicity problem. In Liliana Caldern- compression problems. In 16th Annual ACM-SIAM Benavides, Cristina Gonzlez-Caro, Edgar Chvez, and Symposium on Discrete Algorithms, SODA’05, pages Nivio Ziviani, editors, String Processing and Informa- 321–330. SIAM, 2005. tion Retrieval, volume 7608 of LNCS, pages 284–294. [12] Maxime Crochemore, Christophe Hancart, and Thierry Springer Berlin Heidelberg, 2012. Lecroq. Algorithms on Strings. Cambridge University [25] Tomasz Kociumaka, Jakub Radoszewski, Wojciech Press, Cambridge, 2007. Rytter, and Tomasz Wale´n. Internal pattern match- ing queries in a text and applications. In 26th An- Select. Now, recall the more involved implementa- nual ACM-SIAM Symposium on Discrete Algorithms, tion of select1. We split B into superblocks by choosing SODA’15. SIAM, 2015. every s-th occurrence of 1, where s = log n log log n, and [26] Veli M¨akinenand Gonzalo Navarro. Rank and select store the starting position of every superblock together revisited and extended. Theoretical Computer Science, N with a pointer to its description, using O( s log n) = 387(3):332–347, 2007. o(N) bits in total. This can be done by processing the [27] Gonzalo Navarro. Wavelet trees for all. Journal input word-by-word while maintaining the total number of Discrete Algorithms: 23rd Annual Symposium on Combinatorial Pattern Matching, 25:2–20, 2014. of 1s seen so far in the current superblock. After reading [28] Manish Patil, Rahul Shah, and Sharma V. the next word, i.e., the next log n bits, we check if the Thankachan. Faster range LCP queries. In Oren next superblock should start inside, and if so, compute Kurland, Moshe Lewenstein, and Ely Porat, editors, its starting position. This can be easily done if we can String Processing and Information Retrieval, volume count 1s inside the word and select the k-th one in O(1) 8214 of LNCS, pages 263–270. Springer International time,√ which can be preprocessed in a shared table of size Publishing, 2013. O( n) if we split every word into two parts. There are [29] Mihai P˘atra¸scu.Lower bounds for 2-dimensional range N at most s superblocks, so the total construction time counting. In 39th Annual ACM Symposium on Theory so far is O( N + N ) = O( N ). Then we have two of Computin, STOC’07, pages 40–46. ACM, 2007. log n s log n cases depending on the length ` of the superblock. [30] Mihai P˘atra¸scu. Unifying the landscape of cell- 2 probe lower bounds. SIAM Journal on Computing, If ` > s , we store the position of every 1 inside 40(3):827–847, 2011. the superblock explicitly. We generate the positions by [31] Milan Ruˇzi´c. Constructing efficient dictionaries in sweeping the superblock word-by-word, and extracting ` close to sorting time. In Luca Aceto, Ivan Damg˚ard, the 1s one-by-one, which takes log n + s time. Because Leslie Ann Goldberg, Magn´usM. Halld´orsson, Anna N there are at most s2 such sparse superblocks, this is Ing´olfsd´ottir,and Igor Walukiewicz, editors, Automata, O( N + N ) = O( N ), and the total number of used Languages and Programming – ICALP 2008, volume log n s log n N 5125 of LNCS, pages 84–95, 2008. bits is O( s2 s log n) = o(N). 2 [32] German Tischler. On wavelet tree construction. If ` ≤ s , the standard solution is to recurse on the In Raffaele Giancarlo and Giovanni Manzini, edi- superblock, i.e., repeat the above with n replaced by s2. tors, Combinatorial Pattern Matching, volume 6661 We need to be more careful, because otherwise we would of LNCS, pages 208–218. Springer Berlin Heidelberg, have to pay O(1) time for roughly every (log log n)2-th 2011. occurrence of 1 in B, which would be too much. We want to split every superblock into blocks A Constructing rank/select structures. by choosing every s0-th occurrence of 1, where s0 = N 2 Lemma 2.1. Given a bit string B[1..N] packed in log n (log log n) , and storing the starting position of every machine words, we can extend it in O( N ) time with a block relative to the starting position of its superblock. log n Then, if a block is of length `0 > s02, we store the rank/select data structure occupying o( N ) additional log n position of every 1 inside the block, again relative to ˜ √ space, assuming O( n) time and space preprocessing the starting position of the superblock, which takes shared by all instances of the structure. N 0 2 O( s02 s log(s )) = o(N) bits in total. Otherwise, the block is so short that we can extract the k-th Proof. We focus on implementing rank1 and select1. occurrence of 1 inside using the shared table. The By repeating the construction, we also get rank0 and starting positions of the blocks are saved one after select . N 2 0 another using O( s0 log(s )) = o(N) bits in total, so Rank. Recall the usual implementation of rank1. that we can access the i-th block in O(1) time. With We first split B into superblocks of length log2 n, and the starting positions of every 1 in a sparse block the 1 then split every superblock into blocks of length 2 log n. situation is slightly more complicated, as not every We store the cumulative rank for each superblock, and block is sparse, and we cannot afford to store a separate the cumulative rank within the superblock for each pointer for every block. The descriptions of all sparse block. All ranks can be computed by reading whole blocks are also saved one after another, but additionally blocks. More precisely, in every step we extract the we store for every block a single bit denoting whether it 1 next 2 log n bits, i.e., take either the lower or the higher is sparse or dense. These bits are concatenated together half of the next word encoding the input, and compute and enriched with a rank1 structure, which allows us to the number of ones inside using a shared table of size compute in O(1) time where the description of a given √ N O( n). Hence we need O( log n )-time preprocessing for sparse block begins. rank queries. To partition a superblock into blocks efficiently, we parameters. The main difficulty is that we cannot afford must be able to generate multiple blocks simultaneously. to consider every c ∈ [0, d−1] separately as in the binary Assume that a superblock spans a number of whole case. words (if not, we can shift its description paying O(1) Rank. We split D into superblocks of length per word, which is fine). Then we can process these d log2 n, and then split every superblock into blocks of 1 log n words one-by-one after some preprocessing. First, length 3 log d . For every superblock, we store the cumu- consider the process of creating the blocks when we read lative generalized rank of every character, i.e., for every the bits one-by-one. We maintain the description of the character c we store the number of positions where char- current block, which consists of its length `0 ≤ `, the acters c0 ≤ c occur in the prefix of the string up to the number of 1s seen so far, and their starting positions. beginning of the superblock. This takes O(d log n) bits 0 N The total size of the description is O(log `+s log `) bits. per superblock, which is O( log n ) = o(N) in total. Additionally, we maintain our current position inside For every block, we store the cumulative generalized the superblock, which takes another O(log `) bits. After rank of every character within the superblock, i.e., for reading the next bit, we update the current position, every character c we store the number of positions and either update the description of the current block, where characters c0 ≤ c occur in the prefix of the or start a new block. In the latter case, we output the superblock up to the beginning of the block. This takes position of the first 1 inside the current block, store one O(d log(d log2 n)) = O(logε n log log n) bits per block, 0 02 log n ε bit denoting whether ` > s , and if so, additionally which is O(N/ log d log n log log n) = o(N) in total. save the positions of all 1s inside the current block. Finally, we store a shared table, which given a string 1 log n Now we want to accelerate processing the words de- of 3 log d characters packed into a single machine word scribing a superblock. Informally, we would like to read 1 of length 3 log n, a character c, and a number i, returns the whole next word, and generate all the corresponding the number of positions where characters c0 ≤ c occur in data in O(1) time. The difficulty is that the starting po- the prefix of length i. Both the size and the construction 1 log n sitions of the blocks, the bits denoting whether a block is time of the table is O˜(2 3 ). It is clear that using the sparse or dense, and the descriptions of sparse blocks are table, together with the cumulative ranks within the all stored in separate locations. To overcome this issue, blocks and superblocks, we can compute any generalized we can view the whole process as an automaton with one rank in O(1) time. input tape, three separate output tapes, and a state de- We proceed with a construction algorithm. All scribed by O(log `+s0 log `) = O(polyloglog(n)) bits. In ranks can be computed by reading whole blocks. More every step, the automaton reads the next bit from the precisely, the cumulative generalized ranks of the next input tape, updates its state, and possibly writes into block can be computed by taking the cumulative gen- the output tapes. Because the situation is fully deter- eralized ranks of the previous block (if it belongs to ministic, we can preprocess its behavior, i.e., for every the same superblock) stored in one word of length initial state and a sequence of 1 log n bits given in the 2 1 log n 2 d log(d log n), and updating it with the next 3 log d input tape, we can precompute the new state and the 1 characters stored in one word of length 3 log n. Such data written into the output tapes (observe that, by con- updates can be preprocessed for every possible input in struction, this is always at most log n bits). The cost of logε n+ 1 log n O˜(2 3 ) time and space. Then each block can 1 log n+polyloglog(n) √ such preprocessing is O(2 2 ) = O˜( n). N log d be handled in O(1) time, which gives O( log n ) in total. The output buffers are implemented as packed lists, so The cumulative generalized ranks of the next su- appending at most log n bits to any of them takes just perblock can be computed by taking the cumulative 1 O(1) time. Therefore, we can process 2 log n bits in generalized ranks of the previous superblock, and up- O(1) time, which accelerates generating the blocks by dating it with the cumulative generalized ranks of the the desired factor of log n. last block in that previous superblock. This can be done

ε 1 by simply iterating over the whole alphabet and updat- Lemma 2.3. Let d ≤ log n for ε < 3 . Given a string ing each generalized rank separately in O(1) time, which D[1..N] over the alphabet [0, d − 1] packed in N log d N N log d log n is O( d log2 n d) = o( log n ) in total. N log d machine words, we can extend it in O( log n ) time with Select. We store a separate structure for every c ∈ [0, d − 1]. Even though the structures must a generalized rank/select structure√ occupying additional N ˜ be constructed together to guarantee the promised o( log n ) space, assuming an O( n) time and space preprocessing shared by all instances of the structure. complexity, first we describe a single such structure. Nevertheless, when stating the total space usage, we Proof. The construction is similar to the one from mean the sum over all c ∈ [0, d − 1]. Lemma 2.1, but requires careful adjustment of the We split D into superblocks by choosing every s- 2 N log d th occurrence of c in D, where s = d log n, and store is hence O( log n ) so far. Observe that we now know the starting position of every superblock. This takes which superblock is sparse. N O((d + s ) log n) = o(N) bits in total. As in the To compute the occurrences in sparse superblocks, binary case, then we proceed differently depending on we perform another word-by-word scan of the input. the length ` of the superblock. During the scan, we keep track of the current superblock If ` > s2, we store the position of every occur- for every c ∈ [0, d−1], and if it is sparse, we extract and rence of c inside the superblock explicitly. This takes copy the starting positions of all occurrences of c. This N O(d s2 s log n) = o(N) bits. requires just O(1) additional time per an occurrence of 2 N If ` ≤ s , we split the superblock into smaller blocks c in a sparse superblock, which sums up to O( log2 n ). by choosing every s0-th occurrence of c inside, where The remaining part is to split all dense superblocks s0 = d(log log n)2. We write down the starting position into blocks. Again, we view the the process as an of every block relative to the starting position of the su- automaton with one input tape, three output tapes N N perblock, which takes O((d+ s0 ) log s) = O( d log log n ) = corresponding to different parts of the structure, and N a state consisting of O(log ` + s0 log `) bits. Here, we o( d ) bits in total. Then, if the block is of length `0 > d·s02, we store the position of every occurrence of c must be more careful than in the binary case: because inside, again relative to the starting position of the su- we will be reading the whole input word-by-word, not N 0 N just a single dense superblock, it might happen that ` perblock, which takes O(d d·s02 s log s) = O( d log log n ) = o( N ) bits in total. Otherwise, the block spans at most is large. Therefore, before feeding the input into the d automaton, we mask out all occurrences of c within O(log3ε n(log log n)4) = o(log n) bits in D, hence the k- sparse superblocks, which can be done in O(1) time th occurrence of c there can be extracted in O(1) time √ per an input word after an appropriate preprocessing, using a shared table of size O( n). Descriptions of the if we keep track of the current superblock for every sparse blocks are stored one after another as in the bi- c ∈ [0, d − 1] while scanning through the input. Then nary case. we can combine all d automata into a larger automaton This completes the description of the structure for equipped with 3d separate output tapes and a state a single c ∈ [0, d − 1]. To access the relevant structure consisting of o(log n) bits. We preprocess the behavior when answering a select query, we additionally store the of the large automaton after reading a sequence of pointers to the structures in an array of size O(d). Now 1 log n bits given in the input tape for every initial state. we will argue that all structures can be constructed in 2 The result is the new state, and the data written into O( N log d ) time, but first let us recall what data is stored log n each of the 3d output tapes. Now we again must be more for a single c ∈ [0, d − 1]: careful: even though the output tapes are implemented 1. an array storing the starting positions of all su- as packed lists, and we need just O(1) time to append perblocks and pointers to their descriptions, the preprocessed data to any them, we cannot afford to 2. for every sparse superblock, a list of positions of touch every output tape every time we read the next 1 every occurrence of c inside, 2 log n bits from the input. Therefore, every output 3. for every dense superblock, a bitvector with the i-th tape is equipped with an output buffer consisting of 1 log n bit denoting if the i-th block inside is sparse, 5 d bits, and all these output buffers are stored in 1 4. an array storing the starting positions of all blocks a single machine word consisting of 5 log n bits. Then relative to the starting position of their superblock, we modify the preprocessing: for every initial state, a 1 5. for every sparse block, a list of positions of every sequence of 5 log n input bits, and the initial content occurrence of c inside relative to the starting posi- of the output buffers, we compute the new state, the tion of its superblock. new content of the output buffers, and zero or more full chunks of 1 log n bits that should be written into First, we compute the starting positions of all su- 5 d perblocks by scanning the input while maintaining the the respective√ output tapes. Such preprocessing is number of occurrences of every c ∈ [0, d−1] in its current still O( n), and now the additional time is just O(1) superblock. These counters take O(d log s) = o(log n) per each such chunk. Because we have chosen the bits together, and are packed in a single machine word. parameters so that the size of the additional data in bits is o( N ), the total time complexity is O( N log d ) as We read the input word-by-word and update the coun- d log n ters, which can be done in O(1) time after an appropri- claimed. ate preprocessing. Whenever we notice that a counter exceeds s, we output a new block. This requires just O(1) additional time per a superblock, assuming an appropriate preprocessing. The total time complexity