Indexing the Bijective

Indexing the Bijective BWT Hideo Bannai Department of Informatics, Kyushu University, Fukuoka, Japan [email protected] Juha Kärkkäinen Helsinki Institute of Information Technology (HIIT), Finland [email protected].fi Dominik Köppl Department of Informatics, Kyushu University, Japan Society for Promotion of Science (JSPS) [email protected] Marcin Pia¸tkowski Nicolaus Copernicus University, Toruń, Poland [email protected] Abstract The Burrows-Wheeler transform (BWT) is a permutation whose applications are prevalent in data compression and text indexing. The bijective BWT is a bijective variant of it that has not yet been studied for text indexing applications. We fill this gap by proposing a self-index built on the bijective BWT. The self-index applies the backward search technique of the FM-index to find a pattern P with O(|P | lg |P |) backward search steps. 2012 ACM Subject Classification Theory of computation; Mathematics of computing → Combina- torics on words Keywords and phrases Burrows-Wheeler Transform, Lyndon words, Text Indexing Digital Object Identifier 10.4230/LIPIcs.CPM.2019.17 Funding This research received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 690941. Dominik Köppl: JSPS KAKENHI Grant Number JP18F18120. 1 Introduction The Burrows-Wheeler transform (BWT) [6] is a transformation permuting all symbols of a given string T . It is obtained by sorting all cyclic rotations (conjugates) of T with respect to the lexicographical order and writing the last character of the i-th sorted cyclic rotation in a linear manner from i = 1 to i = |T |. The BWT tends to group identical characters together. All cyclic rotations of a given string share the same BWT. However, there are strings that are not the BWT of any string (e.g., bccaab cannot be reversed). A variant, called the bijective BWT [15], is a bijective transformation. It is based on the Lyndon factorization [7] of the input string. In this variant, the output consists of the last symbols of the lexicographically sorted cyclic rotations of all Lyndon factors of the input. Since the Lyndon factorization of a string is uniquely defined, the bijective BWT induces a bijection between strings of a given length n and multisets of Lyndon words of total length n. In the following, we call the BWT traditional to ease the distinguishability of both transformations. It is well known that the traditional BWT has many applications in data compression [1] and text indexing [9, 10, 11]. For the latter, the algorithms for pattern searching are based on the backward search [20]: given a pattern P and the traditional BWT © Hideo Bannai, Juha Kärkkäinen, Dominik Köppl, and Marcin Pi¸atkowski; licensed under Creative Commons License CC-BY 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Editors: Nadia Pisanti and Solon P. Pissis; Article No. 17; pp. 17:1–17:14 Leibniz International Proceedings in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany 17:2 Indexing the Bijective BWT of T , the occurrences of P in a text T can be computed with O(|P |) backward search steps. In this light, one may ask whether it is possible to build similar index data structures by exchanging the traditional BWT with the bijective BWT. In this article, we answer affirmatively the above question: We show that searching a pattern P on the bijective BWT can be conducted with O(|P |pˆ) backward search steps, where pˆ is the number of distinct factors in the Lyndon factorization of the longest pre-Lyndon suffix of P , where pˆ is known to be in O(lg |P |) [13]. Thus, we can reduce the number of backward search steps to O(|P | lg |P |). Our results are based on combinatoric properties of Lyndon words and the bijective BWT. They may have applications in distributed implementations of the BWT index [14] or in practical database systems storing dynamic yet compressed data [4, 5]. 2 Preliminaries Our computational model is the word RAM model with word size Ω(lg n). Accessing a word costs O(1) time. We write [b(I)..e(I)] = I for an interval I of natural numbers. 2.1 Strings Let Σ denote a finite alphabet. We call an element T ∈ Σ∗ a string. Its length is denoted by |T |. Given an integer j with 1 ≤ j ≤ |T |, we access the j-th character of T with T [j]. Concatenating a string T ∈ Σ∗ k times is abbreviated by T k.A bit vector is a string on the binary alphabet {0, 1}. When T is represented by the concatenation of X, Y, Z ∈ Σ∗, i.e., T = XYZ, then X, Y and Z are called a prefix, substring and suffix of T , respectively; A prefix X, substring Y , or suffix Z is called proper if X 6= T , Y 6= T , or Z 6= T , respectively. For two integers i, j with 1 ≤ i ≤ j ≤ |T |, let T [i..j] denote the substring of T that begins at position i and ends at position j in T . If i > j, then T [i..j] is the empty string. In particular, the suffix starting at position j of T is called the j-th suffix of T , and denoted with T [j..]. An occurrence of a substring S in T is treated as a sub-interval of [1..|T |] such that S = T [b(S)..e(S)]. The longest common prefix (LCP) of two strings S and T is the longest string that is a prefix of both S and T . The length of the LCP of two strings S and T is given by the function lcp(S, T ) returning an integer ` such that T [1..`] = S[1..`] and either (a) T [` + 1] 6= S[` + 1] or (b) ` = min(|T |, |S|) holds. Lexicographic Order. We denote the lexicographic order with ≺. Given two string S and T , then S ≺ T if S is a prefix of T or there exists an integer ` with 1 ≤ ` ≤ min(|S|, |T |) such that S[1..` − 1] = T [1..` − 1] and S[`] < T [`]. We write S ≺ω T if the infinite concatenation Sω := SSS ··· is lexicographically smaller than T ω := TTT ··· . For instance, ab ≺ aba but aba ≺ω ab. Support Data Structures. Given a string T ∈ Σ∗, a character c ∈ Σ, and an integer j, the rank query T.rankc(j) counts the occurrences of c in T [1..j], and the select query T.selectc(j) gives the position of the j-th c in T . We stipulate that rankc(0) = selectc(0) = 0. 2.2 Lyndon Words Given a string T = T [1..n], its i-th conjugate conji(T ) is defined as T [i + 1..n]T [1..i] for an integer i with 0 ≤ i ≤ n − 1. We say that T and every of its conjugates belongs to the conjugate class conj(T ) := {conj0(T ),..., conjn−1(T )}. If a conjugate class contains exactly H. Bannai, J. Kärkkäinen, D. Köppl, and M. Pi¸atkowski 17:3 one conjugate that is lexicographically smaller than all other conjugates, then this conjugate is called a Lyndon word [16]. Equivalently, a string T is said to be a Lyndon word if and only if T ≺ S for every proper suffix S of T . A consequence is that a Lyndon word is border-free, i.e., there is no Lyndon word T = SUS with S ∈ Σ+ and U ∈ Σ∗.A pre-Lyndon word is a string that is a prefix of a Lyndon word. The Lyndon factorization [7] of T ∈ Σ+ is the factorization of T into a sequence of + lexicographically non-increasing Lyndon words T1 ··· Tt, where (a) each Tx ∈ Σ is a Lyndon word, and (b) Tx ≥ Tx+1 for each 1 ≤ x < t. I Lemma 1 ([8, Algo. 2.1]). The Lyndon-factorization of a string can be computed in linear time. Each Lyndon word Tx is called a Lyndon factor. We denote the multiset of T ’s Lyndon factors by LynF(T ) := {T1,...,Tt}. There is a bijection between LynF(T ) and T in such a sense that LynF(T ) uniquely defines T . That is because we can restore T by 1. sorting the Lyndon factors of LynF(T ) in lexicographically descending order, and 2. subsequently concatenating them. The last factor Tt is special, as it has the following property: I Lemma 2 ([8, Prop. 1.9]). The last Lyndon factor of a string T is the smallest suffix of T . We borrow from [13, Sect. 2.2] the notation lfsT (j) := Tj ··· Tt for the suffix of T starting with the j-th Lyndon factor. For what follows, we fix a string T [1..n] over an alphabet Σ with size σ. We use the string T := acababdababcababbab as our running example. Its Lyndon factorization is LynF(T ) = {ac, ababd, ababc, ababb, ab}. The suffix lfsT (4) is T4T5 = ababbab. 2.3 Bijective Burrows-Wheeler transform We denote the bijective BWT of T by BBWT, where BBWT[i] is the last character of the i-th string in the list storing the conjugates of all Lyndon factors T1,...,Tt of T sorted with respect to ≺ω. Figure 1 shows BBWT for our running example. Gill and Scott [12] and Mantaci et al. [19] postulated that BBWT can be built in linear time. However, we could only verify an algorithm running in O(n lg n/ lg lg n) time in the word RAM model with σ = nO(1), provided by Bonomo et al. [3]. This algorithm processes all Lyndon factors in their lexicographical order, starting with the largest one.

Load more