A SIMPLE REPRESENTATION OF SUBWORDS OF THE FIBONACCI WORD

BARTOSZ WALCZAK

Abstract. We introduce a representation of subwords of the infinite Fibonacci word f∞ by a specific concatenation of finite Fibonacci words. It is unique and easily computable by backward processing of the given word. This provides an efficient recognition algorithm for subwords of f∞ as well as a full description of their occurrences in f∞. Our representation yields a natural notion of rank of a subword, which explains main properties of subwords of f∞ in a unified way.

1. Introduction Fibonacci words are strings over {a, b} defined inductively as follows:

f1 = a,

f2 = ab, fn = fn−1fn−2, for n > 3.

This construction converges to the infinite Fibonacci word f∞. Every fn is a prefix of f∞ and of fm for m > n. The lengths of Fibonacci words are Fibonacci numbers: Fn = |fn|. Fibonacci words are famous and important due to their many interesting combi- natorial properties. See the book [6] for a good survey. In particular, the infinite Fibonacci word is the simplest example of a [4]. We focus on subwords of f∞. We say that u is a subword of f∞ if there is a position i such that f∞[i . . . i + |u| − 1] = u (the letters of f∞ being numbered from zero). Every such i is called an occurrence of u in f∞. The structure of the occurrences of subwords in f∞ was first fully described by Chuan and Ho [3]. However, their methods and proofs are quite involved. Independently, Rytter [5] discovered pretty regular structure of the subword graph of f∞. Using this structure he derived a simpler description of the occurrences of subwords in f∞ and gave an efficient algorithm for recognizing subwords of f∞. A different algorithm for finding all occurrences of a word in a finite Fibonacci word is presented in [1]. We show that subwords of f∞ allow a very simple representation. It has the form of a specific concatenation of finite Fibonacci words together with an integer offset. It is unique and has logarithmic size on the length of the subword. An attempt to construct such a representation for a given string succeeds only if it is a subword of f∞. This leads to a recognition algorithm that is similar to Rytter’s, but our argument bypasses the subword graph. As another consequence of our representation

Date: February 4, 2010. Key words and phrases. Fibonacci word, algorithms, combinatorial problems. 1 2 BARTOSZ WALCZAK we obtain an alternative proof of the description of the occurrences of subwords in f∞ shown in [5]. In fact, the set of occurrences is completely determined by two parameters of the representation: the rank and the offset, while the density of the occurrences depends only on the rank. We provide a formula for the number of distinct subwords of a prefix of f∞. Finally, we present an efficient algorithm for deciding whether a concatenation of finite Fibonacci words given by their indices is a subword of f∞. The structure of the subword graph of f∞ as described in [5] and our notion of representation turn out to be very similar. The main difference is that in our approach the representation is computed by analyzing the string backwards. More- over, avoiding the conceptual complexity of subword graphs and going directly to the representation, the nature of subwords of f∞ is explained in a transparent way.

2. Representation of prefixes of f∞ We say that a word p is f-representable if it is a concatenation of Fibonacci words of the form

(∗) p = fks fks−1 . . . fk1 with

k1 ∈ {1, 2}, and ki ∈ {ki−1 + 1, ki−1 + 2}, for i = 2, . . . , s. We call (∗) an f-representation of p. We will prove later in this section that a word is f-representable if and only if it is a prefix of f∞. The following algorithm computes an f-representation of a given word p with a simple right-to-left procedure. The algorithm rejects p if not f-representable.

Algorithm 1: f-representation of p input : p output: ki, ki−1, . . . , k1

1 k0 := 0 2 for i := 1, 2,... do

3 choose ki ∈ {ki−1 + 1, ki−1 + 2} so that fki and p end with the same letter

4 if p = fki then accept {fki fki−1 . . . fk1 is the f-representation}

5 if fki is not a suffix of p then reject

6 remove fki from the end of p

Since fj and fj+1 always end with a different letter, there is only one possible choice of ki at each step. Therefore the f-representation is unique. Define the rank of p to be the leftmost index ks in (∗). It is the most important parameter of the f-representation. Example 1. We compute the f-representation for p = abaababaabaababa: A SIMPLE REPRESENTATION OF SUBWORDS OF THE FIBONACCI WORD 3

i p ki fki 1 abaababaabaababa 1 f1 = a 2 abaababaabaabab 2 f2 = ab 3 abaababaabaab 4 f4 = abaab 4 abaababa 5 f5 = abaababa

The resulting f-representation of p is f5f4f2f1, and the rank of p is 5. Denote by u0 the word u with the last letter removed. Denote by u00 the word u with the last two letters removed. Let u v v denote that u is a finite nonempty prefix of v. Theorem 2. 00 (1) p v fn+2 iff p is f-representable of rank at most n. (2) p v f∞ iff p is f-representable. Proof. We only need to prove part (1) as part (2) is a direct consequence of it. 00 00 The cases n = 1 (f3 = a) and n = 2 (f4 = aba) are easy to verify. Now suppose n > 3 and proceed by induction on n. 00 00 To see the ‘only if’ part first note that fn+2 = fn+1fn . There are four cases: 00 i. p v fn+1, 0 ii. p = fn+1 = fnfn−2 . . . fi+2fi (i ∈ {1, 2}, i ≡ n (mod 2)),

iii. p = fn+1 = fnfn−2 . . . fi+3fi+1fi (i ∈ {1, 2}, i ≡ n + 1 (mod 2)), 00 iv. p = fn+1q, q v fn . Induction hypothesis applies directly to case i. In cases ii–iii the f-representations are given explicitly. In case iv, by induction hypothesis, q has f-representation

q = fkr fkr−1 . . . fk1 (kr 6 n − 2). It yields the following f-representation of p:

p = fnfn−2 . . . f`+3f`+1f` fkr fkr−1 . . . fk1 , | {z } | {z } fn+1 q where ` ∈ {kr + 1, kr + 2}, ` ≡ n + 1 (mod 2). This shows the ‘only if’ part.

For the converse implication let p = fks fks−1 . . . fk1 be the f-representation of p. 00 00 If ks 6 n − 1 then, by induction hypothesis, p v fn+1 v fn+2. Otherwise, find the maximal i such that ki+1 = ki + 1 to distinguish three possible situations: 0 00 i. p = fnfn−2 . . . fk1+2fk1 = fn+1 v fn+2, 00 ii. p = fnfn−2 . . . fk1+3fk1+1fk1 = fn+1 v fn+2,

iii. p = fnfn−2 . . . fki+3fki+1fki fki−1 . . . fk1 (ki−1 6 n − 2). | {z } | {z } fn+1 q 00 00 00 In the last case q v fn by induction hypothesis, and thus p v fn+1fn = fn+2.  4 BARTOSZ WALCZAK

Remark. Theorem2 exhibits the similarity between Rytter’s and our approaches. 00 Regarding the fact (easy to prove inductively) that fn+2 is a , the f- 00 representation of fn+2 and its ‘compacted subword graph’ defined in [5] are equiv- alent. Part (2) of Theorem2 is also a special case of [2, Lemma 3.10] by taking Chuan’s ai = 1 for all i’s. Theorem2 gives that there is exactly one f-representable word of each length; 0 00 the shortest f-representable word of rank n is fn+1, while the longest one is fn+2. Hence the number of all f-representable words of rank n is

00 0 |fn+2| − |fn+1| + 1 = Fn+2 − Fn+1 = Fn. Note that any f-representation can be encoded using the differences between indices of Fibonacci words that are consecutive factors in the f-representation. They are always 1 or 2. This way the size of the f-representation of p is Θ(log |p|) as rank(p) is Θ(log |p|).

3. Representation of subwords of f∞

It follows from Theorem2 that a string u is a subword of f∞ iff u is a suffix of an f-representable word q. Any such q has f-representation

q = fks . . . fkr+1 fkr . . . fk1 , where by r we denote the smallest number such that u is a suffix of fkr . . . fk1 . The letters that determine the choice of k1, . . . , kr in Algorithm1 come from u, so the part fkr . . . fk1 depends only on u and is the same for all possible q’s. Therefore, p = fkr . . . fk1 is the shortest f-representable word containing u as a suffix; any other one must also contain p as a suffix. We call fkr . . . fk1 the suffix-representation of u. Define rank(u) = rank(p) = kr and offset(u) = |p| − |u|. Since u (considered as a suffix of p) must have at least one letter in common with the leftmost factor fkr in its suffix-representation, we get offset(u) ∈ {0,...,Fkr − 1}. We thus obtained the representation of u announced in the Introduction: u is uniquely characterized by its suffix-representation and its offset.

Algorithm 2: suffix-representation of u input : u output: ki, ki−1, . . . , k1

1 k0 := 0 2 for i := 1, 2,... do

3 choose ki ∈ {ki−1 + 1, ki−1 + 2} so that fki and u end with the same letter

4 if u is a suffix of fki then accept

{fki fki−1 . . . fk1 is the suffix-representation, Fki − |u| is the offset}

5 if fki is not a suffix of u then reject

6 remove fki from the end of u A SIMPLE REPRESENTATION OF SUBWORDS OF THE FIBONACCI WORD 5

Algorithm2 computes the suffix-representation of a given word u or decides that u is not a subword of f∞. It follows the lines of Algorithm1 with the only difference in the accepting condition: it checks whether u is a suffix of fki . Example 3. Apply Algorithm2 to compute the suffix-representation, the rank, and the offset of u = babaabaababa:

i u ki fki 1 babaabaababa 1 f1 = a 2 babaabaabab 2 f2 = ab 3 babaabaab 4 f4 = abaab 4 baba 5 f5 = abaababa

At step 4 the current string baba is a suffix of f5. Therefore, p = f5f4f2f1 is the shortest f-representable word containing u as a suffix, and rank(u) = 5. Moreover, baba is 4 letters shorter than f5, so offset(u) = 4.

Algorithm2 works in linear time. Moreover, after checking at step i that fki is a suffix of the current word u, this suffix serves as a template for fki+1 in verification at step i + 1. This way the algorithm achieves constant space complexity (by constant space we mean constant number of integers not greater than |u|).

Now we describe the structure of occurrences of subwords in f∞ in terms of rank and offset. Our description is equivalent to those by Chuan and Ho [3, Theorem 3.4] and by Rytter [5, Fact 11]. Define m n X o Zn = αiFi | m > n, αi ∈ {0, 1}, ∀i (αi = 1 ∨ αi+1 = 1) . i=n Remark. In [5] this set was defined as m n X o Zn = βiFi | m > n, βi ∈ {0, 1}, ∀i (βi = 0 ∨ βi+1 = 0) . i=n

Therefore, Zn is the set of numbers that do not use F1,...,Fn−1 in their Zeckendorff representation. The conversion between the two representations of numbers in Zn is easy to obtain.

Theorem 4. The set of positions at which a subword u of f∞ of rank n occurs in f∞ is {z + offset(u) | z ∈ Zn+1}.

Proof. Each occurrence of u in f∞ corresponds to a prefix q of f∞ that contains this occurrence of u as a suffix. In the f-representation of q,

q = fks . . . fkr+1 fkr . . . fk1 , the part fkr . . . fk1 is the suffix-representation of u and hence is determined by u.

The part fks . . . fkr+1 has no restrictions except

ki ∈ {ki−1 + 1, ki−1 + 2}, for i = r + 1, . . . , s. 6 BARTOSZ WALCZAK

The position of the occurrence of u is

|q| − |u| = |fks . . . fkr+1 | + |fkr . . . fk1 | − |u|

= |fks . . . fkr+1 | + offset(u).

Since kr = rank(u) = n, the set of possible lengths of fks . . . fkr+1 is exactly the set Zn+1. 

Remark. In particular, offset(u) is the first position at which u occurs in f∞. Since 00 rank(u) is the least n such that u is a subword of fn+2, this shows that Theorem4 is in fact a reformulation of [5, Fact 11] in terms of rank and offset. To link Theorem 4 with [3, Theorem 3.4] one has to observe (which is not difficult) that the indices k k Nkj and Jkj defined in [3] are equal to rank(zj ) and offset(zj ), respectively, where k k z0 , . . . , zk are the k + 1 subwords of f∞ of length k listed in the order of their first occurrences in f∞.

Note that |Zn+1 ∩ {0,...,Fm − 1}| = Fm−n. Hence the average density of Zn+1 (and therefore the density of the occurrences of u in f∞) is F lim m−n = φ−n, m→∞ Fm with φ denoting the . We conclude this section with the following application of our method:

Theorem 5. Let p be a prefix of f∞ of rank n. The number of distinct nonempty subwords of p is (|p| − Fn + 2)Fn − 1.

2 Proof. The number of distinct subwords of f∞ of any rank k is Fk : there are Fk possible f-representations of rank k and Fk possible offsets in each of them. 2 2 Since every word of rank at most n−1 is a subword of p, there are F1 +...+Fn−1 = Fn−1Fn−1 subwords of p of rank at most n−1. Thus it suffices to count the subwords of p of rank n. Every subword of p of rank n is a suffix of some q v p of rank n. There are 0 |p|−Fn+1 +2 candidates for q’s (fn+1 being the shortest one). Each one gives rise to Fn subwords of p of rank n with offsets 0,...,Fn − 1. Therefore, the total number of distinct nonempty subwords of p is

(|p| − Fn+1 + 2)Fn + Fn−1Fn − 1 = (|p| − Fn + 2)Fn − 1. 

In particular, since the Fibonacci word fn with n > 3 has rank n − 1, the number of its distinct nonempty subwords is (Fn−2 + 2)Fn−1 − 1.

4. Concatenations of Fibonacci words We now show how the concept of suffix-representation yields an efficient test of whether a given concatenation of finite Fibonacci words u = f`1 . . . f`r (with no restrictions on the indices `1, . . . , `r) is a subword of f∞. The algorithm we present (Algorithm3) works in Θ( r)-time. It follows the idea of Algorithm2 and constructs right-to-left the suffix-representation of u. A SIMPLE REPRESENTATION OF SUBWORDS OF THE FIBONACCI WORD 7

Algorithm3 uses three sequences: I, R and S. The elements of I are indices of Fibonacci words. Every element of R and S is a pair (k, j) of indices representing the concatenation

f(k,j) = fkfk−2 . . . fj+2fj (k > j, k ≡ j (mod 2)).

For convenience we identify (j, j) with j. A sequence X = x1, . . . , xs (each xi being an index or a pair of indices) represents the concatenation

fX = fx1 . . . fxs .

At the beginning of the algorithm I is the input sequence `1, . . . , `r, and in each iteration fI is some prefix of u. Let uI be such that u = fI uI . The algorithm gradually shortens I and maintains the suffix-representation of uI . After each it- eration the suffix-representation of uI is stored in R, and S represents the word that completes uI to fR, that is, fR = fSuI . Suppose u is a subword of f∞. Then the suffix-representation of u ends with R. Therefore, one of fS, fI is a suffix of the other. If the algorithm discovers that neither of fS, fI is a suffix of the other, u is rejected. Otherwise, after complete execution of the algorithm, R stores the suffix-representation of u.

Algorithm 3: suffix-representation of concatenation f`1 . . . f`r input : `1, . . . , `r output: R

1 I := `1, . . . , `r; R := S := ∅; n := 0 2 while I is not empty do 3 remove i from the end of I 4 if S is empty then 5 choose j ∈ {n + 1, n + 2} so that j ≡ i (mod 2) 6 if j = i then R := j, R; n := j 7 if j > i then R := j, R; n := j; S := S, (j − 1, i + 1) 8 if j < i then R := (i − 1, j + 1), j, R; n := i − 1 9 else if S ends with j then 10 remove j from the end of S 11 if j 6≡ i (mod 2) then reject

12 if j > i then S := S, (j − 1, `i + 1) 13 if j = i − 2 then I := I, i − 1 14 if j < i − 2 then reject 15 else {S ends with (k, j) such that k > j} 16 remove (k, j) from the end of S 17 if j 6≡ i (mod 2) then reject 18 if j = i then S := S, (k, j + 2) 19 if j > i then S := S, (k, j + 2), (j − 1, i + 1) 20 if j < i then reject 8 BARTOSZ WALCZAK

Most cases in Algorithm3 are easy to analyze given the identity

ft = f(t−1,s+1)fs (t > s, t ≡ s (mod 2)). Only line 14 is a little tricky. To see that it correctly rejects u, first note that the algorithm keeps the following invariants: • n is greater that all indices in S, • every two consecutive elements (·, k), (j, ·) in S satisfy k > j + 1. Instead of rejecting in line 14 we could set I := I, i − 1, i − 3, . . . , j + 3, j + 1. Suppose we do so. Let i0, j0 denote the values of variables i, j in the next iteration. We have i0 = j + 1, and it follows from the above invariants that j0 > j + 1. Either j0 6≡ i0 (mod 2) and we reject, or after that iteration I ends with j + 3 and S ends with (j0 − 1, i0 + 1) = (j0 − 1, j + 2), so we reject in the subsequent iteration. The algorithm works in Θ(r)-time as each its iteration decreases the value of 2|I| + |S|.

References [1] P. Baturo, W. Rytter, Compressed string matching in standard Sturmian words, Theoretical Computer Science 410 (2009) 2804–2810. [2] W.-F. Chuan, A representation theorem of the suffixes of characteristic sequences, Discrete Applied Mathematics 85 (1998) 47–57. [3] W.-F. Chuan, H.-L. Ho, Locating factors of the infinite Fibonacci word, Theoretical Computer Science 349 (2005) 429–442. [4] M. Lothaire, Algebraic Combinatorics on Words, Cambridge University Press, Cambridge, 2002. [5] W. Rytter, The structure of subword graphs and suffix trees of Fibonacci words, Theoretical Computer Science 363 (2006) 211–223. [6] B. Smyth, Computing Patterns in Strings, Addison Wesley, Reading, MA, 2003.

Bartosz Walczak, Department of Theoretical Computer Science, Jagiellonian University, ul.Lojasiewicza6, 30-348 Krakow,´ Poland E-mail address: [email protected]