Alternative Algorithms for Lyndon Factorization⋆
Total Page:16
File Type:pdf, Size:1020Kb
Alternative Algorithms for Lyndon Factorization? Sukhpal Singh Ghuman1, Emanuele Giaquinta2, and Jorma Tarhio1 1 Department of Computer Science and Engineering, Aalto University P.O.B. 15400, FI-00076 Aalto, Finland fSukhpal.Ghuman,[email protected] 2 Department of Computer Science, P.O.B. 68, FI-00014 University of Helsinki, Finland [email protected] Abstract. We present two variations of Duval's algorithm for computing the Lyndon factorization of a word. The first algorithm is designed for the case of small alphabets and is able to skip a significant portion of the characters of the string, for strings containing runs of the smallest character in the alphabet. Experimental results show that it is faster than Duval's original algorithm, more than ten times in the case of long DNA strings. The second algorithm computes, given a run-length encoded string R of length ρ, the Lyndon factorization of R in O(ρ) time and constant space. 1 Introduction Given two strings w and w0, w0 is a rotation of w if w = uv and w0 = vu, for some strings u and v. A string is a Lyndon word if it is lexicographically smaller than all its proper rotations. Every string has a unique factorization in Lyndon words such that the corresponding sequence of factors is nonincreasing with respect to lexico- graphical order. This factorization was introduced by Chen, Fox and Lyndon [2]. Duval's classical algorithm [3] computes the factorization in linear time and constant space. The Lyndon factorization is a key ingredient in a recent method for sorting the suffixes of a text [8], which is a fundamental step in the construction of the Burrows- Wheeler transform and of the suffix array, as well as in the bijective variant of the Burrows-Wheeler transform [4] [6]. The Burrows-Wheeler transform is an invertible transformation of a string, based on the sorting of its rotations, while the suffix array is a lexicographically sorted array of the suffixes of a string. They are the basis for important data compression methods and text indexes. Although Duval's algorithm runs in linear time and is thus efficient, it can still be useful to further improve the time for the computation of the Lyndon factorization in the cases where the string is either huge or compressible and given in a compressed form. Various alternative algorithms for the Lyndon factorization have been proposed in the last twenty years. Apostolico and Crochemore presented a parallel algorithm [1], while Roh et al. described an external memory algorithm [10]. Recently, I et al. showed how to compute the Lyndon factorization of a string given in grammar-compressed form and in Lempel-Ziv 78 encoding [5]. In this paper, we present two variations of Duval's algorithm. The first variation is designed for the case of small alphabets like the DNA alphabet fa, c, g, tg. If the string contains runs of the smallest character, the algorithm is able to skip a significant portion of the characters of the string. In our experiments, the new algorithm is more than ten times faster than the original one for long DNA strings. ? Supported by the Academy of Finland (grant 134287). Paper Submitted to PSC The second variation is for strings compressed with run-length encoding. The run-length encoding of a string is a simple encoding where each maximal consecutive sequence of the same symbol is encoded as a pair consisting of the symbol plus the length of the sequence. Given a run-length encoded string R of length ρ, our algorithm computes the Lyndon factorization of R in O(ρ) time and uses constant space. It is thus preferable to Duval's algorithm in the cases in which the strings are stored or maintained in run-length encoding. 2 Basic definitions Let Σ be a finite ordered alphabet of symbols and let Σ∗ be the set of words (strings) over Σ ordered by lexicographic order. The empty word " is a word of length 0. Let also Σ+ be equal to Σ∗ nf"g. Given a word w, we denote with jwj the length of w and with w[i] the i-th symbol of w, for 0 ≤ i < jwj. The concatenation of two words u and v is denoted by uv. Given two words u and v, v is a substring of u if there are indices 0 ≤ i; j < juj such that v = u[i]:::u[j]. If i = 0 (j = juj−1) then v is a prefix (suffix) of u. We denote by u[i::j] the substring of u starting at position i and ending at position j. For i > j u[i::j] = ". We denote by uk the concatenation of k u's, for u 2 Σ+ and k ≥ 1. The longest border of a word w, denoted with β(w), is the longest proper prefix of w which is also a suffix of w. Let lcp(w; w0) denote the length of the longest common prefix of words w and w0. We write w < w0 if either lcp(w; w0) = jwj < jw0j, i.e., if w is a proper prefix of w0, or if w[lcp(w; w0)] < w0[lcp(w; w0)]. For any 0 ≤ i < jwj, rot(w; i) = w[i::jwj − 1]w[0::i − 1] is a rotation of w. A Lyndon word is a word w such that w < rot(w; i), for 1 ≤ i < jwj. Given a Lyndon word w, the following properties hold: 1. jβ(w)j = 0; 2. either jwj = 1 or w[0] < w[jwj − 1]. Both properties imply that no word ak, for a 2 Σ, k ≥ 2, is a Lyndon word. The following result is due to Chen, Fox and Lyndon [7]: Theorem 1. Any word w admits a unique factorization CFL(w) = w1; w2; : : : ; wm, such that wi is a Lyndon word, for 1 ≤ i ≤ m, and w1 ≥ w2 ≥ ::: ≥ wm. The run-length encoding (RLE) of a word w, denoted by rle(w), is a sequence of pairs (runs) h(c1; l1); (c2; l2; );:::; (cρ; lρ)i such that ci 2 Σ, li ≥ 1, ci 6= ci+1 for 1 ≤ i < r, l1 l2 lρ and w = c1 c2 : : : cρ . The interval of positions in w of the factor wi in the Lyndon Pi−1 Pi factorization of w is [ai; bi], where ai = j=1 jwjj, bi = j=1 jwjj − 1. Similarly, rle rle rle Pi−1 the interval of positions in w of the run (ci; li) is [ai ; bi ] where ai = j=1 lj, rle Pi bi = j=1 lj − 1. 3 Duval's algorithm In this section we briefly describe Duval's algorithm for the computation of the Lyn- don factorization of a word. Let L be the set of Lyndon words and let P = fw j w 2 Σ+ and wΣ∗ \ L 6= ;g ; 2 Alternative Algorithms for Lyndon Factorization LF-Duval(w) 1. k 0 2. while k < jwj do 3. i k + 1 4. j k + 2 5. while true do 6. if j = jwj + 1 or w[j − 1] < w[i − 1] then 7. while k < i do 8. output(w[k::k + j − i]) 9. k k + j − i 10. break 11. else 12. if w[j − 1] > w[i − 1] then 13. i k + 1 14. else 15. i i + 1 16. j j + 1 Figure 1. Duval's algorithm to compute the Lyndon factorization of a string. be the set of nonempty prefixes of Lyndon words. Let also P 0 = P [ fck j k ≥ 2g, where c is the maximum symbol in Σ. Duval's algorithm is based on the following Lemmas, proved in [3]: + 0 Lemma 2. Let w 2 Σ and w1 be the longest prefix of w = w1w which is in L. We 0 have CFL(w) = w1CFL(w ). Lemma 3. P 0 = f(uv)ku j u 2 Σ∗; v 2 Σ+; k ≥ 1 and uv 2 Lg. Lemma 4. Let w = (uav0)ku, with u; v0 2 Σ∗, a 2 Σ, k ≥ 1 and uav0 2 L. The following propositions hold: 1. For a0 2 Σ and a > a0, wa0 2= P 0; 2. For a0 2 Σ and a < a0, wa0 2 L; 3. For a0 = a, wa0 2 P 0 n L. Lemma 2 states that the computation of the Lyndon factorization of a word w 0 can be carried out by computing the longest prefix w1 of w = w1w which is a Lyndon word and then recursively restarting the process from w0. Lemma 3 states that the nonempty prefixes of Lyndon words are all of the form (uv)ku, where u 2 Σ∗; v 2 Σ+; k ≥ 1 and uv 2 L. By the first property of Lyndon words, the longest prefix of (uv)ku which is in L is uv. Hence, if we know that w = (uv)kuav0, (uv)ku 2 P 0 but (uv)kua2 = P 0, then by Lemma 2 and by induction we have CFL(w) = 0 w1w2 : : : wk CFL(uav ), where w1 = w2 = ::: = wk = uv. Finally, Lemma 4 explains how to compute, given a word w 2 P 0 and a symbol a 2 Σ, whether wa 2 P 0, and thus makes it possible to compute the factorization using a left to right parsing. Note that, given a word w 2 P 0 with jβ(w)j = i, we have w[0::jwj − i − 1] 2 L and q jwj w = (w[0::jwj − i − 1]) w[0::r − 1] with q = b jw|−i c and r = jwj mod (jwj − i).