<<

on Words – A Tutorial ⋆

J. Berstel1 and J. Karhum¨aki2⋆⋆

1Institut Gaspard-Monge, Universit´ede Marne-la-Vall´ee, 77454 Marne-la-Vall´ee Cedex 2, France, email: [email protected] 2Department of and Turku Centre for Computer Science, University of Turku, 20014 Turku, Finland, email: [email protected].fi

Table of Contents

1 Introduction ...... 2 1.1 History...... 3 1.2 Notions and notations...... 4 2 Connections...... 5 2.1 Tomatrices ...... 6 2.2 Toalgebra ...... 7 2.3 To algorithmics...... 8 3 Periodicity...... 9 3.1 Fine and Wilf’s theorem...... 9 3.2 Critical factorization theorem ...... 11 3.3 Characterizations for ultimately periodic words ...... 12 4 Dimension properties...... 15 4.1 Defect theorems ...... 16 4.2 Ehrenfeucht Compactness Property ...... 19 5 Unavoidable regularities ...... 19 5.1 Power-free words ...... 19 5.2 Testsets andtest words ...... 23 5.3 Repetition threshold ...... 23 5.4 Unavoidable patterns...... 24 5.5 Shirshov’s theorem...... 25 5.6 Unavoidable sets of words...... 27 6 Complexity ...... 28 6.1 Subword complexity of finite words ...... 28 6.2 Subword complexity of infinite words ...... 29 6.3 Sturmianwords ...... 30 6.4 Episturmian words...... 33 6.5 Hierarchies of complexities ...... 34 6.6 Subword complexity and transcendence ...... 35 6.7 Descriptive and computational complexity ...... 36 7 From words to finite sets of words ...... 38 7.1 Conway’s Problem ...... 38

⋆ This is a slightly improved version of the paper that appeared in Bulletin EATCS, 79 (february 2003), p. 178-228. The work of the second author was done while visiting Institut Gaspard-Monge Universit´ede Marne-la-Vall´ee ⋆⋆ Supported by the Academy of Finland under the grant 44087 7.2 Characterization of commuting sets ...... 41 7.3 Undecidability results for finite sets ...... 42 8 OpenProblems...... 43

1 Introduction

During the last two decades research on combinatorial problems of words, i.e., on Combinatorics on Words, has grown enormously. Although there has been important contributions on words starting from the very beginning of last cen- tury, they were scattered and typically needed as tools to achieve some other goals in mathematics. A notable exception is combinatorial theory, which studies combinatorial problems on words as representing group elements, see [LS77] and [MKS66]. Now, and particularly after the appearance of Lothaire’s book – Combinatorics on Words – in 1983 the topic has become a challenging research topic of its own. In the latest classification of Mathematical Reviews combinatorics on words constitutes its own section under the chapter discrete mathematics related to computer science. Although the applications of words are, by no means, only in computer science the classification catches the basic of the nature of combinatorics on words. Recent developments of the field culminated in Lothaire’s second book – Algebraic Combinatorics on Words – which appeared in 2002. Its more than 500 pages witness the vital stage of the topic. The new book repeats basically nothing from the first one, and actually most of the results were discovered during the last twenty years. A biannual conference – referred to as WORDS – devoted entirely to combinatorics on words has also been created. The fourth event will be in Turku in 2003. A word is a of symbols, finite of infinite, taken from a finite alphabet. A natural environment of a finite word is a . Consequently, words can be seen as a discrete combinatorial objects or discrete algebraic objects in a noncommutative structure. These two facts – discreteness and noncommutativity – are the two fundamental features of words. At the same time they explain why many problems are so difficult. Words are central objects of automata theory, and in fact in any standard model of computing. Even when computing on numbers computers operate on words, i.e., representations of numbers as words. Consequently, on one hand, it is natural to study algorithmic properties of words. On the other hand, the undecidability of problems is most easily stated in terms of words – the Post Cor- respondence Problem being a splendid example. Both these elements of words – algorithmic aspects and undecidability – are visible, often implicitly, throughout our presentation. The goal of the tutorial is to discuss – without aiming to be exhaustive – several typical problems on words, as well as to try to point out several appli- cations. With a few exceptions the proofs are not presented here, however, in some cases we use examples to illustrate the basic ideas. Open problems form an important part of our presentation.

2 The contents of this tutorial is as follows. At the end of this introductory sec- tion we discuss briefly the history of combinatorics on words, and fix some basic terminology. Then in Section 2 we consider connections to other fields of math- ematics and computer science. Section 3 is devoted to the most fundamental notion of words, namely to periodicity. Dimension properties of words consti- tute Section 4, while Section 5 concentrates one of the most studied and most characteristic feature of words, namely unavoidable regularities. Words, indeed, are very suitable objects to formulate such fundamental properties. In Section 6 complexity issues of infinite words are studied from different points of view. An interesting phenomenon is that what is considered to be complicated in a classical sense, e.g., algebraically, need not be so from the point of view of words. Finally, in Section 7 we discuss about some extensions of the theory to finite sets of words, and in Section 8 collect a list of important open problems, many of those being apparently very difficult. As we said everywhere above algorithmic and decidability issues are present. We conclude by some bibliographic remarks. Combinatorics on words has now become a rich area, with many connections to , to number the- ory, to symbolic dynamics, and to applications in biology and text processing. Several books have appeared quite recently, or will appear in the next months, that emphasize these connections. What have to be mentioned are the book of Allouche and Shallit [AS03], where the emphasis is on relation to automata theory, and the book by Pytheas-Fogg [PF02] which is a nom de plume for the Marseille group. Quite recently the book by Crochemore and Rytter [CR02] ap- peared as a follow-up book to [CR94]. A detailed introduction into algorithms on words is given in the book [CHL01]. Algorithms on words are also described, from a more biological point of view, in Gusfield’s book [Gus97]. Finally, we should point to algebraic applications of combinatorics on words, as they appear in the book of de Luca and Varricchio [dLV99].

1.1 History

The history of combinatorics on words goes back to the beginning of the last century, when A. Thue started to work on repetition-free words. He proved, among other things, the existence of an infinite square-free word over a ternary alphabet. Interestingly, it seems that Thue had no outside motivation for his research on words. He published his results in two long papers [Th06] and [Th12], but unfortunately in a less known Norwegian journal, so that his results became known only much later, cf. [Be95]. Actually many of those were reproved several times. The notion of word is, of course, so natural that it can be found even in several older mathematical works. Even Gauss considered a problem which was nothing but a problem on combinatorics on words, cf. [Ga00] and [KMPS92], and in 1850s Prouhet [Pr51] introduced the most famous infinite word redefined by Thue. However, Thue was clearly the first to study systematically problems on words, and moreover as problems of their own.

3 After Thue during the first half of the previous century there were only a few isolated works on words, such as those of [Ar37], [Mo21], [Mo38] and [MH38]. In these works, as it has been very typical for the whole field, properties of words were not that much the research topic of itself rather than tools for solving problems in other areas. It took till the second half of the last century when the theory of words arose. This happened more or less simultaneously in France and Russia. In France it grew up from research of M.P. Sch¨utzenberger on theory of codes, see [Sch56]. The Russian school, in turn, developed from the seminal work of P.S. Novikov and S. Adian on Burnside Problem for groups, see [Ad79]. Especially, in Russia results on words, not to speak about the theory, were not so explicit, although their studies culminated rather soon to remarkable results, such as Makanin’s for satisfiability of word equations, see [Ma77]. In France, the theory of words became an independent topic of its own rather soon, very much due to the stimulating paper [LS67] from 1967. Other stimulating early works were (hand written) book [Len72] and [Hm71]. Once the foundations of the theory were laid down it developed rapidly. One influencial paper should be mentioned here, from year 1979. In [BEM79] repetition-free words were studied very extensively, and an important notion of an avoidable pattern, as well as many open problems, were formulated. The D0L systems, and particularly the D0L problem [CF77], was an important source of many questions on combinatorics on words, including the Ehrenfeucht Compact- ness Property, see [Ka93]. In fifteen years or so, in 1983 the active research on words culminated to the first book of the topic, namely Lothaire’s book Combinatorics on Words. The starting point of Lothaire’s book was a mimeographed text of lectures given by M.P. Sch¨utzenberger at the University of Paris in 1966 and written down by J.F. Perrot. It had an enormous influence on the further development of the field. Results of this impact, including several jewels of theory, can be seen from Lothaire’s second book – Algebraic Combinatorics on Words – which appeared last year.

1.2 Notions and notations We conclude this Introduction by fixing the terminology and a few notions needed in this presentation. For more detailed definitions we refer to [Lot02] or [CK97]. We denote by A a finite set of symbols referred to as an alphabet. , finite or infinite, of letters from A are called words. The empty sequence is called the empty word and is denoted by 1 or ε. The set of all finite words, in symbols A∗, is the free monoid generated by A under the operation of product or concatenation of words: u v = uv. The free semigroup generated by A, in + · symbols A , is A∗ 1 . The set of one-way infinite words over A is denoted by Aw. Formally, such\{ words} are mappings from N into A. For two words u and v we say that u is a prefix (resp. a suffix or a factor) of u if there exist a word x (or words x and y) such that v = ux (resp., v = xu

4 1 or v = xuy). In the case of prefix we write u = vx− . Note that the mapping 1 (v, x) vx− can be viewed as a partial mapping from A∗ A∗ into A∗. These 7→ × notions extend straightforwardly to subsets, i.e., to languages, of A∗. A very crucial notion in combinatorics on words is that of a morphism from the free monoid A∗ into itself (or into another free monoid B∗), that is to say a mapping h : A∗ A∗ satisfying h(uv) = h(u)h(v) for all words u and v. Examples of important→ morphisms are a ab a ab µ : and ϕ : . b 7→ ba b 7→ a 7→ 7→ The former, discovered by Thue, is so-called Thue-Morse, sometimes referred to Prouhet-Thue-Morse, morphism. It plays an important role in the study of repetition-free words. The other morphism is called Fibonacci morphism. These morphisms has a property that they map a to a word starting with a as a prefix. This implies that the limits t = lim µi(a) and f = lim ϕi(a) i i →∞ →∞ exist. We say that f and t are obtained by iterating morphisms µ and ϕ at the word a. Clearly, f and t are the unique fixed-points of these morphisms. They are called Thue-Morse and Fibonacci words, respectively. We have, for example, f = abaababaabaab ... It is easy to see that alternatively

f = lim fi i →∞ where f0 = a, f1 = ab and fn+1 = fnfn 1 for n 1. − ≥ These formulas explain the name Fibonacci morphism. In order to state some properties of these words, let us say that a word w is k-free if it does not contain as a factor any word of the form uk. This notion extends, cf. [CK97], in a natural way to nonnegative rational numbers, and also to real numbers ζ, when the requirement is that w does not contain a factor of ′ k k the form u with k Q and k > ζ. If w = xu y, with k′ k and u = 1, we say that w contains∈ a repetition of order k. By a k+-free word≥ we mean6 a word which is k′-free for any k′ > k (but not necessarily k-free). What Thue proved was that the Thue-Morse word is 2+-free, i.e. does not contain repetitions of higher order than 2. Indeed, it – like any binary word of length at least four – contains a repetition of order 2. For the the repetitions which are avoided in it are exactly those being of order > ϕ2 +1 √5+1 when ϕ is the number of the , i.e. ϕ = 2 . In other words, the Fibonacci word is (ϕ2 +1)+-free, cf. [MP92]. This is just one of the many special properties of the Fibonacci word. In fact, it is almost a universal counterexam- ple for conjectures or an example showing an optimality, an exception being a problem in [Cas97a].

5 2 Connections

In this section we point out some connections of combinatorics on words and other areas of mathematics and computing. Such connections are quite broad and has been fruitful in both directions. In fact, many important properties of words has been discovered when looking for tools to solve other completely different problems. More concretely, we discuss here three different connections, one to matrices, one to algebra and one to algorithms. These reflects, we hope, different aspects of such connections. Other connections to combinatorial , to algebraic combinatorics, and to general combinatorics are sketched or described in [MKS66] and [LS77]. More specifically, let us just mention Lie Algebras, see [Re93], words as codings of combinatorial structures, and words as codings of . For the beginning, however, let us consider another typical and interesting relation between words and some classical mathematical notions. Hilbert’s space filling curve has played an important role describing an intuitive anomaly in topology. From the point of view of topology it can be seen as quite a complicated object. However, from the point of view of words it is nothing but an infinite word over a four element alphabet which, moreover, is easy to define: it is a morphic image under a length preserving morphism, i.e., a coding, of the fixed-point of an iterated morphism.

2.1 To matrices As the first connection we consider that of words and matrices, more precisely, that of multiplicative semigroups of matrices. Let us denote by Mn n(S) the family of n n matrices with entries in the semigroups S. It has been× known since 1920s that× free monoids can be embedded into the multiplicative semigroup of 2 2 matrices over N, i.e. into M2 2(N). For A = a, b such an embedding is given,× for instance, by the mapping× { }

1 1 1 0 (1) a and b . 7→  0 1  7→  1 1 

In fact, this mapping is an isomorphism between a, b ∗ and SL (N), the set of { } 2 matrices in M2 2(N) having a determinant equal to 1. Arbitrary, even countable, × free semigroups can be embedded in M2 2(N) by employing an embedding from × a i N ∗ into A∗ given by { i| ∈ } a abi for i 0. i 7→ ≥ The above ideas become even more usable when we associate above with a morphism h : A∗ A∗. In order to simplify the notation we set A = 1, 2 and define the mapping→ { } k h(1) 0 k h(2) 0 1 | | and 2 | | , 7→  ν(h(1)) 1  7→  ν(h(2)) 1 

6 where the vertical bars are used to denote the length of a word and ν : A+ N maps a word into the number it represents in base k > 3. It is straightforward→ to see that this mapping too is an embedding, that is to say, for any word u = a a , we have 1 · · · n k h(u) 0 u = a a | | . 1 · · · n 7→  ν(h(u)) 1 

Consequently, questions asking something about images of a morphism of A∗ can be transformed into questions about matrices. More concretely; this allows to transfer the undecidability of the Post Correspondence Problem into unde- cidability results on matrices. Paterson [Pat70] was among the first to use this idea when showing (i) in the following theorem, for the other parts we refer to [HK97], where also further references can be found.

Theorem 2.1. The following questions are undecidable:

(i) Does a given finitely generated multiplicative subsemigroup of M3 3(Z) contain the zero matrix? × (ii) Does a given finitely generated multiplicative subsemigroup of M3 3(Z) contain a matrix having the zero in the right upper corner? × (iii) Is a given finitely generated multiplicative subsemigroup of M3 3(N) free? ×

Part (iii) is undecidable even for upper triangular matrices.

All of the above problems are open in dimension n = 2. Another interesting open problem is the question of part (i) for the identity matrix. The embedding methods used in Problems (i)-(iii) does not seem to work here, see [CHK99]. In the dimension n = 2 this problem is decidable, see [CK02]. Embeddings like (1) are important not only to conclude the above undecid- ability results, but also to obtain results for words from those of matrices. The Ehrenfeucht Compactness Property, discussed in Section 4, is a splendid example of that.

2.2 To algebra

The second connection we consider is that to algebra. We want to give a concrete example rather than recalling that words are, after all, elements of free monoids or that representations of groups are relations on words. This example, due to [MH44], is an application of repetition-free words to solve Burnside Problems for semigroups. The problem asks whether the assumptions (i): the semigroup S is finitely generated; and (ii): each element of S is of a finite order, i.e. generates a finite cyclic subsemigroup, imply that the semigroup S itself is finite.

Theorem 2.2. The Burnside Problem for semigroups has a negative answer.

7 The answer is achieved as follows, cf. [Lot83]. Let A be a three letter al- phabet. Then by a result of Thue, there exists a square-free word over A, and consequently the set of finite square-free words is infinite. Now, adjoin to the free semigroup A+ the zero element, and define a congruence on A+ 0 : each square-free word forms an equivalence class of its own, and≈ the rest belongs∪{ } to the class containing 0. Then the quotient semigroup

S = A+ 0 / ∪{ } ≈ is well defined and has the required properties. For finitely generated groups the Burnside Problem is much more complicated, cf. [Ad79].

2.3 To algorithmics

Finally, we discuss about connections to algorithmics. We consider very natural algorithmic problem, and point out how a certain property of words can be used to obtain an efficient solution to the problem. We ask

Question. How can we decide efficiently whether a given word is primitive?

The problem has a brute force quadratic solution: divide the input into two parts and check whether the right part is a power of the left part. But how to get a faster solution? The answer comes from the property of primitive words: a word w is nonprimitive if and only if it is a factor of •ww•, i.e., w occurs properly as a factor in ww. For the definition of the operator • see the next section. So the problem is reduced to a simple instance of the string matching problem, and hence doable in linear time, see e.g., [CR94]. Despite of the simplicity of the above example it is very illustrative: it shows how the correctness of an algorithm is finally based on a combinatorial property of words. This seems to be a common rule in efficient string algorithms. Or even more strongly, whenever a fundamental property of words is revealed, it has applications in improving algorithms on words, cf. e.g., [MRS95]. String matching and pattern matching are only two – although important – aspects of algorithmics on words, see [CR94], [CHL01] for expositions, and [Gus97]. Other algorithmic problems we do not consider here are: systematic generation of words (e.g., Dyck words), ranking, unranking, and random gener- ation of words, see e.g., [Ru78], [BBG90] and [FZC94]. All of these are used in order to code combinatorial structures. Another important topic on algorithmic combinatorics on words, which is nei- ther considered in our tutorial, is the satisfiability problem for word equations, i.e. the decision question whether a given word equation with constants possesses a solution. The classical paper of Makanin [Ma77] answers this question affir- matively. However, his algorithm is one of the most complicated algorithms ever presented, see Chapter 12 in [Lot02] for a detailed exposition. Rather recently W. Plandowski showed, by his completely new algorithm, that the problem is actually in PSPACE, cf. [Pla99]

8 3 Periodicity

In this section we consider one of the most fundamental notions of words, namely periodicity. We discuss three different topics, all of those being very fundamental.

3.1 Fine and Wilf’s theorem

One of the oldest results in combinatorics on words concerns commutation of words. It is the following well-known statement. Typically this result, in one form or another, has been proved when needed, see e.g., [LS62].

Theorem 3.1. Let x and y be nonempty words. The following properties are equivalent:

(i) xy = yx, (ii) the infinite words xω and yω are equal, (iii) there exists a word z such that x, y z+, ∈ (iv) x, y is not a code, i.e. satisfies a nontrivial relation. { } There is another rather old result, due to Fine and Wilf, strongly related to this theorem. It uses the notion of a period. Let w = a1 an be a word, with a ,...,a letters. An integer p is a period of w, if 1 p · · ·n and a = a for 1 n ≤ ≤ i p+i i = 1,...,n p. Thus a1 an p = ap+1 an, and the word w′ = a1 an p − · · · − · · · · · · − is both a prefix and a suffix of w. Provided p

Theorem 3.2. (Fine and Wilf’s Theorem) Let w be a word of length n. If w has two periods p and q and n p + q gcd(p, q), then also gcd(p, q) is a period of w. ≥ −

For the proof of Fine and Wilf’s theorem, we consider a variation of Theorem 3.1. Given a word w = a1 an, where a1,...,an are letters, we set w• = a1 an 1. · · · · · · − In particular, a• = ε for a letter a, and ε• is undefined.

Lemma 3.3. Let x and y be nonempty words. If xy• = yx•, then xy = yx.

Proof. By induction on xy . If xy = 2, and more generally, if x = y , one gets x = y. Otherwise, one| may| assume| | x > y . Then y is a proper| | prefix| | of x, | | | | and x = yz for some nonempty word z. It follows that zy• = yz•. By induction, zy = yz, and consequently xy = yx. ⊓⊔

9 We now prove Fine and Wilf’s theorem. First, we show that p and q may be assumed to be relatively prime. Indeed, assume d = gcd(p, q) > 1, set w = [i] a0 an 1 and define d words w = aiai+d ai+nid, where ni = (n i 1)/d , for·i · ·= 0−,...,d 1. These words have periods· · · p/d and q/d and length⌊ − − at least⌋ n/d. By the conclusion,− each w[i] is a power of some letter. Thus w is the power of some word of length d, that is w has period d. Suppose now that p and q are relatively prime. Let x and y be the prefixes of w of length p and q, respectively. Then w is a common prefix of xω and yω. Moreover, setting w = xw′, the word w′ is a prefix of w, so w′ is a prefix of yω and thus w is a prefix of xyω. Symmetrically, w is a prefix of yxω. Since the length of w is at least p + q 1, one gets xy• = yx•. By Lemma 3.3, one gets xy = yx. Thus x and y are in− some z+. But since gcd(p, q) = 1, this implies that z is a letter. This completes the proof. ⊓⊔

Fine and Wilf’s original paper [FW65] contains three theorems. The first one is basically Theorem 3.2. It states indeed that if two infinite words u and v have periods p and q respectively, and they share a common prefix of length p + q gcd(p, q), then they are equal. The two other theorems are in the same vein, but− concern real continuous periodic functions. The proof of Theorem 3.2 in Fine and Wilf’s original paper [FW65] is quite different and deserves a short description. Any letter of the alphabet is considered as a number. Any infinite word a0a1a2 an corresponds to a formal series 2 n · · · · · · ω a0 + a1t + a2t + + ant + . Thus, the infinite periodic word x , with · · · · · · p x = a1 ap, corresponds to the formal series F (t)= P (t)/(1 t ), with P (t)= · · · p 1 ω − a1 + + apt − , and similarly y , with y = b1 bq, corresponds to G(t) = · · · q q 1 · · · Q(t)/(1 t ), with Q(t)= b1 + + bqt − . Now, a computation with rational fractions− shows that · · ·

1 tgcd(p,q) H(t)= F (t) G(t)= − R(t) , − (1 tp)(1 tq) − − where R(t) is a polynomial of degree at most p+q gcd(p, q) 1. By assumption, R(t) = 0, and consequently F = G and xω = yω. − − Several other proofs of the theorem are known. Some are by induction on the length of the period (e.g., Chapter 8 of [Lot02] and [CHL01]). There are also proofs that argue directly on congruential properties of the indices [CK97]. Extension to more than two periods are given in [CMR99], [Jus00] and [TZ03]. Further related results are shown in [CMSWY01], [BB99] and [MRS03]. The bound in Fine and Wilf’s theorem is sharp. A concrete example is the word abaababaaba. It has periods 5 and 8 and length 11 = 8+5 2, and is not a power of a single letter. In fact, all words of length p + q 2, with− periods p and q, for coprimes p, q, are known. They are all binary, and,− more precisely, prefixes of infinite standard Sturmian words; see Chapter 2 of [Lot02] for a description and references.

10 3.2 Critical factorization theorem

The Critical Factorization Theorem stated below gives a relation between the global period of a word, and a notion of local period called local repetition, associated to a factorization of the word. As it will appear, global periods are always longer than local periods, but the remarkable fact is that for any word, there is always a factorization whose local period is equal to the global period. Such a factorization is called critical. In order to formalize the above we say that two words x and y are prefix comparable (resp. suffix comparable) if one of the words is a prefix (resp. a suffix) of the other. Further, given a word w and a factorization w = uv into nonempty words, a repetition at (u, v) is a nonempty word z such that z and u are suffix comparable, and z and v are prefix comparable. The (local) period of the factorization (u, v) is the length of the shortest repetition at (u, v). It is easy to see that any local period of a word is shorter that the period. A factorization (u, v) is critical if its period is equal to the period of w. Consider for example the word w = abaab which has the period 3. The factorization (a, baab) has the repetition ba, so its period is 2. The factorization (aba, ab) has period 1. The factorizations (ab, aab) and (abaa, b) both have period 3, and these are the critical factorizations of the word w. The following theorem shows that critical factorizations are unavoidable.

Theorem 3.4. (Critical Factorization Theorem.) Every word of length at least 2 has a critical factorization.

The first statements and proofs of the theorem are given in [CV78] and [Duv79]. A proof of the critical factorization theorem in its present form, and a discussion, is given in Chapter 8 of [Lot02]. A short proof is in [CP91]. Recent results related to this topic appear in [HN02]. We sketch now an interesting application of the Critical Factorization Theo- rem. Consider a finite set X A+ of nonempty words. Given a word w A+, a sequence of nonempty words⊂ ∈

(s, x1,...,xm,p) is an X-interpretation of w if w = px1 xmp, and p is a prefix of a word in X, s is a suffix of a word in X, and x·,...,x · · X. Two X-interpretations 1 m ∈ (s, x ,...,x ,p) and (s′, x′ ,...,x′ ′ ,p′) of a word w are disjoint if sx x = 1 m 1 m 1 · · · i 6 s′x′ x′′ for i =1,...,m, i′ =1,...,m′. 1 · · · i As an example, consider the set X = a3,b,aba,a2ba2 . The word a4ba4ba4b has the X-interpretation (a,a3,b,a3,aba,a{ 3,b). The sequence} (a3, aba, a3,b,a3, ab) is another X-interpretation of the word a4ba4ba4b, and this interpretation is disjoint from the previous one.

Theorem 3.5. Let X be a finite set of nonempty words, and let p be the max- imum of the periods of the words in X. Every word w with the period strictly greater than p has at most Card(X) disjoint X-interpretations.

11 As another example consider the set X = ap , consisting of a single word which has period 1. Clearly, every word w that{ admits} an X-interpretation is a power of a and so also has period 1. The number of its disjoint X-interpretations is bounded by p, and not by Card(X). Proof. (Sketch) Assume that a word w has m>n disjoint X-interpretations, where n = Card(X). Consider any factorization w = uv. There exist m distinct pairs (y,z) of words such that yz X, u and y are suffix comparable, and v and ∈ z are prefix comparable. Since m>n, there are at least two pairs (y1,z1) and (y ,z ) such that y z = y z . Set x = y z = y z . Assume y > y . Next 2 2 1 1 2 2 1 1 2 2 | 1| | 2| y1 and y2 are prefix comparable, or z1 and z2 are suffix comparable. Except for some special cases, which we do not consider in this sketch, the number d = y1 y2 = z2 z1 is a period of x. This implies that there is a repetition of length| | −d | at| (u,| v).| − Thus | | the local period at (u, v) is at most p. Thus, all local periods are smaller than p. However, by the Critical Factor- ization Theorem, at least one local period is strictly greater than p. This yields a contradiction. ⊓⊔ A detailed proof is given in [Lot83]. An application to the order of the sub- groups of the syntactic monoid of the set X∗ was given in [Sch79]. There is a renewal of research on this topic now, see [PR02]. In [Ma02] examples were given showing that the bound for the number of disjoint X-interpretations is optimal in Theorem 3.5. For example, the set aibai i =1, ..., n 1 banb is such a set. { | − }∪{ }

3.3 Characterizations for ultimately periodic words We conclude this section with a third fundamental periodicity result of words de- rived in [MRS95], see also [MRS98]. It characterizes one-way infinite ultimately periodic words in terms of local periodicity. Intuitively, it tells how much lo- cal regularity, i.e., periodicity, is needed to guarantee the global regularity, i.e., ultimate periodicity. Such results are the most basic goals in mathematics. In order to continue let us fix a few notions. Let ρ 1 be a real and p 1 a natural number. We say that a finite word is ρ-legal ≥if it contains as a suffix≥ a repetition of order ρ (or equivalently – by definition – of order larger than or equal to ρ), and that it is (ρ, p)-legal if it contains as a suffix a repetition of order ρ of a word of length at most p. Similarly, an infinite word w is ρ-legal or (ρ, p)-legal if it so for all of its long enough prefixes. Note that (ρ, )-legality can be interpreted as simply ρ-legality. ∞ We describe the usefulness of these notions in the following two examples, which are illustrations of subsequent theorems. Example 1. We claim that the Fibonacci word

f = lim fi = abaababaabaab i →∞ · · · where f0 = a, f1 = ab and fn+1 = fnfn 1 for n 1, is (2, 5)-legal, as first observed by J. Shallit, personal communication.− ≥

12 It is straightforward to see that f can be decomposed uniquely into blocks of ab and aba, such that ab does not occur twice and aba three times in a row. Consequently, suffixes of prefixes of f ending at one of the blocks are of the forms:

Now consider the suffixes ending at the rightmost ab in the left graph, i.e. ending either at a or b there. In the former case there is a suffix aa, a square. In the latter case there is necessarily either square aabaab or square abaababaab, i.e., also a square of a word of length at most five. The similar argumentation applies for the right graph. Note that the word f is not ultimately periodic, see Section 6.3. ⊓⊔ Example 2. In this example we point out a striking difference of (2, 5)-legal and (2 + ε, 5)-legal infinite words for any ε > 0. Let us search for (2, 5)- legal infinite words containing a factor (abaab)2. Now, we try to extend a suffix ending to the above mentioned factor exhaustively symbol by symbol preserving the (2, 5)-legality, and not reporting the extensions leading only to ultimately periodic words. We obtain the graph:

(aabab)2 ab(aba)2 2 ab (ab)(aab)2 2 a (aba)3 ab a a (abaab)2 ab ab a ab ab (ababa)2 aba(ba)2 2 2 a(aba)(ab)2 2 ab (aab)3

Here the labels tell the extensions, and the nodes correspond the suffixes obtained at particular moments. In the suffixes a short square is always shown (proving the legality) together with a sufficient amount of other letters needed in further steps. In some nodes some continues are not shown – in these cases only ultimately periodic words would be (2, 5)-legal; for instance from (abaab)2 by b we would obtain a (2, 5)-legal word, which, however, could be continued only by bs preserving the legality. It follows from the construction that all words spelled from this graph are (2, 5)-legal. In particular, there exist nondenumerably many such infinite words, since the graph contains intersecting loops, labeled by noncommuting words. One can also show that actually this graph gives all (2, 5)-legal nonultimately periodic words. We did the exhaustive search for a particular square, the other squares do not give any other nonultimately periodic (2, 5)-legal infinite words.

13 Now, an interesting observation is that if instead of the (2, 5)-legality the (2+ ε, 5)-legality is considered, then the node (aba)2(ab)2 is no longer legal. Indeed, independently of x the word xabaabaabab does not contain at the end a repetition of order strictly larger than 2 of a word of length at most 5. Consequently, intersecting loops are lost, meaning that any (2 + ε, 5)-legal infinite word is necessarily ultimately periodic. Constructing graphs similar to the above one for (2, 4)-legal words one can conclude that all (2, 4)-legal words are ultimately periodic. ⊓⊔ Above examples are special cases of much deeper results. If in Example 1 cubes instead of squares were asked our approach would not work. Indeed, all 3- legal infinite words are ultimately periodic, or even much strongly ρ-legal infinite words are necessarily ultimately periodic if and only if ρ ϕ2 = ϕ +1=2.6 . . . 1+√5 ≥ where ϕ is the number of golden ratio 2 . This is a remarkable theorem, conjectured by J. Shallit in 1994, and proved in [MRS95] by F. Mignosi, A. Restivo and S. Salemi in 1995: Theorem 3.6. (i) Each ϕ2-legal word is periodic. (ii) The Fibonacci word is (ϕ2 ε)-legal for any ε> 0. − Example 2 considers a similar phenomena is a simple setting yielding the following result, cf. [KLP02]. As outlined in the example the optimality is with respect to both of the parameters.

Theorem 3.7. (i) Each (2, 4)-legal infinite word is ultimately periodic. (ii) For any ε> 0, each (2 + ε, 5)-legal infinite word is ultimately periodic. (iii) There exists nondenumerably many (2, 5)-legal infinite words, including the Fibonacci word.

In [Le02] the similar optimal value of ρ is found for any finite length n of the period. For example, and interestingly, the optimal ρ for n = 5, 6, ..., 11 is 1 the same, namely 2, while for n =12 it is 2 12 . Further, after some anomaly in small values of n, the behaviour of such optimal ρs is regular, but amazing: there exists just one jump in between two values of consecutive Fibonacci numbers, except that every sixth jump is missing. Also surprisingly, it is not exactly the Fibonacci word, but very related one, which determines these jumps. We conclude this section with a few remarks. First the above results are beautiful examples, not only in combinatorics on words, but in a much broader perspective, where a local regularity implies the global one, and, in fact, in an optimal way. In other words, they can be seen as results strictly separating a predictable, i.e., ultimately periodic, behaviour from a chaotic one, i.e., allowing nondenumerably choices. This is more discussed in [KLP02]. As the second final comment we state another similar result, the proof of which is related to the Critical Factorization Theorem. Consider an infinite word w = a a with a A. We say that w contains a square centered at position i 1 2 · · · i ∈ if there exists a t [1, i] such that ai t+1 ai is a prefix of ai+1ai+2 . Then we have: ∈ − · · · · · ·

14 Theorem 3.8. An infinite word w = a1a2 is ultimately periodic if and only if, for any large enough i, there exists a square· · · centered at position i. For a proof of this and related results we refer to Chapter 8 in [Lot02], where also the optimality of the result is shown: no smaller amount than a square, i.e., of order 2, of centered repetition guarantees the ultimate periodicity.

4 Dimension properties

In this section we consider properties of words which can be called dimension properties. We approach this by considering a finite set X as a solution of a constant-free equation. Natural questions to be asked are, what can be said about X if it satisfies a nontrivial equation, or several “different” equations? And how many different equations it can satisfy? Here an equation over the variables Ξ = z1, ..., zn is just a pair (u, v), { } n usually written as u = v, of words of Ξ, and X = (x , ..., x ) (A∗) is a 1 n ⊆ solution if xis substituted for zis makes the equation to be an equality in A∗. More formally, a solution of the equation u = v is a morphism ϕ : Ξ A∗ such that ϕ(u)= ϕ(v). Actually, for simplicity, we sometimes overlook the→ fact that X must be ordered, and consider it only as a set. We say that two systems of equations are equivalent if they have exactly the same solutions, and a system S is independent if it is not equivalent to any of its proper subsystems. We use the independency to formalize the notion that “equations are different”. Note also that we have defined only constant free equations. We start with the following simple example Example 3. As an extension of the well known fact that two words satisfy a nontrivial relation if and only if they are powers of a common word, cf. Section 3, we consider the set X = x, y, z A+ of three nonempty words satisfying the equations { }⊆

(1) xα = yβ and xγ = zδ with α, β, γ and δ in X∗. If two of the words are powers of a common word so is the third, by above. Consequently, assume that x = yt for a nonempty word t. Now, substituting x = yt to the equations in (1) we obtain two relations on words t, y, z A+. More specifically, the first equation comes into the form ∈ uα′ = tβ′ with u y, z and α′, β′ t, y, z ∗, ∈{ } ∈{ } and the second equation comes into the form

tγ′ = zδ′ with γ′, δ′ t, y, z ∗. ∈{ } Since tyz < xyz induction applies showing that t, x and y are powers of a common| | word,| and| so are x, y and z. For the case y = xt with t = 1, the reasoning is the same. Therefore, we have proved that nonempty words satisfying6 the pair (1) is necessarily cyclic, i.e., a subset of w+ for some word w. ⊓⊔

15 4.1 Defect theorems We continue by recalling so-called defect theorem: if a set of n words satisfies a nontrivial relation, then these words can be expressed as products of at most n 1 words. In other words, we can say that a nontrivial equation implies a defect effect− of its solutions. Consequently, the defect theorem states a dimension type property for words. Actually, there does not exist just one, but rather several, defect theorems witnessing the above defect effect. Namely, the set of n 1 words might be defined in different ways. In order to formalize those we recal−l that a submonoid M of A∗ is called right unitary if its minimal generating set is a prefix code, cf. [BP85]. Then we define the free hull and the prefix hull of a finite set X A+ as ⊆ F (X)= M, M is a free\ monoid X M ⊆ and P (X)= M, M is right\ unitary X M ⊆ respectively. By the basic properties of these semigroups, F (X) is free and P (X) is right unitary. Hence, by definition, they are the smallest such semigroups containing X (and hence also X∗). The cardinalities of the minimal generating sets of F (X) and P (X) are called the free rank and the prefix rank of X, and those are denoted by f(X) and p(X), respectively. Before formulating several versions of the defect theorem we still define the combinatorial rank of X, in symbols r(X), as the minimal cardinality of a set F such that X F ∗. Note that contrary to the above minimal generating sets F need not be unique.⊆ Now the defect theorem can be formulated, cf. e.g., [BPPR79], [Lot83] or [CK97]:

Theorem 4.1. For any finite X A+ we have ⊆ c(X) p(X) f(X) card(X), ≤ ≤ ≤ and moreover the last inequality is proper if X satisfies a nontrivial equation.

The methods to compute these different ranks are discussed in [CK97]. Also the following example showing that all of these inequalities can be simultaneously proper is from [CK97].

Example 4. Consider the set X = aa, aaba, bac, cbb, bbaa, baa . It satisfies a nontrivial relation: { } aa. bac. bbaa = aaba. cbb. aa so that f(X) should be < 6. By the well known method, see [BPPR79], one can compute that X(f) = aa, ba, c, bb, baa . But (X(f))+ is not right unitary, { }

16 so that, by methods in [CK97], one computes X(p) = a, ba, c, bb . Hence the cardinality of X(p) is strictly smaller than that of X({f). Finally, X}(p) is of combinatorial rank 3, yielding the following sequence of inequalities 3= c(X)

⊓⊔ The above shows that the notion of the defect effect is quite involved. In order to further emphasize this we recall that even sets generating isomorphic subsemigroups of A+ might have nonisomorphic free hulls, cf. [HK86]. Example 4, and even more Example 3, might suggest that several different equations satisfied by X imply a cumulative defect effect for X. For example, two equalities would imply that r(X) is at most card(X) 2. This indeed was the case in Example 3. However, words do not possess such− a strong dimension property:

Example 5. The pair xzy = yzx and xzzy = yzzx has a solution x = y = a and z = b of (any) rank 2, and still the equations are independent. For example, x = abba, y = a and z = b is a solution of the latter but not of the former. 2

We do not know whether there exist independent systems of equations with three unknowns, containing more than two equations and having a non-cyclic solution. The above indicates that it is very difficult to impose a cumulative defect effect, and indeed there are very few results in that direction. Chapter 6 in [Lot02] gives one example showing that if X is a code, but not ω-code neither to the right nor to the left, then the rank of X is at most card(X) 2. Here being not an ω-code means that some one-way infinite word can be decomposed− in two different ways by X. Another complicated but still very special case is obtained in [KM02]. In the following lines, however, we show a simple but in many cases useful cumulative defect effect. In order to state it we need some terminology. We + associate a finite set X A with a graph GX as follows: the nodes of GX are the elements of X, and⊆ there exists an edge between two nodes x and y if and only if xX+ yX+ = . ∩ 6 ∅ GX is called the dependency graph of X. Then the number of connected com- ponents of GX , in symbols c(GX ), gives an upper bound for the rank of X, cf. [HK86]: Theorem 4.2. Let X A+ be finite. Then we have ⊆ c(X) p(X) c(G ). ≤ ≤ X

17 In particular, and this is typically the power of Theorem 4.2, if c(GX ) is connected, then X is cyclic. Note that Example 3 is a simple special case of this result. Unlike in the usual defect theorems here it is crucial that all words of X are nonempty. Defect theorems hold when X is not a code. A natural question is, can this assumption be weakened, for example, requiring only that X is not an ω-code, i.e., finite relations are replaced one-way infinite relations or even two-way infinite relations. For one-way infinite relations the answer is easy: all results we stated, in particular Theorems 4.1 and 4.2, hold for one-way infinite relations as well. Even the proofs are basically the same. For two-way infinite words the situation is completely different. Strictly speaking no defect theorem holds, as shown by the set X = ab, ba and two factorizations depicted as: { }

However, one can prove the following defect theorem for two-way infinite words. In order to state it we need a few notions. Let X A+ be a finite set and w a two-way infinite word. Any decomposition of w into⊆ consecutive blocks of elements of X is an X-factorization of w, and two X-factorizations of w are disjoint if they do not match at any point of w. Finally, w is nonperiodic if it is not a two-way infinite repetition of a single word. We have, see [KMP02]: Theorem 4.3. Let X A+ be finite. If there exists a nonperiodic two-way infinite word with two disjoint⊆ X-factorizations, then

c(X) < card(X).

An interesting point here is that, unlike in all other known cases, the defect effect is witnessed only by the combinatorial rank, for a counterexample see [Ma02]. We conclude our discussion on defect effect by returning to Example 4. It shows that different notions of a rank of a finite set do not coincide, and this is unavoidable. Fortunately, however, the rank of an equation can be defined in the unique way. Let us define the rank of an equation as the maximal rank of its solutions. In theory, that would lead – in our considerations – to three different notions of the rank of an equation, namely the free rank, the prefix rank and the combinatorial rank. However, we have, see [CK97]: Theorem 4.4. Let u = v be a constant-free equation over variables Ξ and A an alphabet such that card(A) card(Ξ). The following numbers coincide ≥ (i) the maximal of free ranks of solutions of u = v in A∗, (ii) the maximal of prefix ranks of solutions of u = v in A∗, (iii) the maximal of combinatorial ranks of solutions of u = v in A∗. The number specified in Theorem 4.4 is called the rank of an equation u = v.

18 4.2 Ehrenfeucht Compactness Property In above we saw that words have certain dimension type properties, in fact even a rich theory in that direction. However, the dimension properties of words are rather weak. A natural question arises: How weak are they? Or more concretely, how large independent systems of equations can exist? These questions are par- tially answered in the following Ehrenfeucht Compactness Property:

Theorem 4.5. Each system S of equations with a finite number of variables Ξ over free monoid A∗ is equivalent to some of its finite subsets S0.

This compactness claim, conjectured by A. Ehrenfeucht – after studying the D0L sequence equivalence problem, cf. [Ka93] – was proved simultaneously in [AL85] and [Gu86]. The proof is a marvelous example of the usefulness of the embedding (1) in Section 2.1. That allows to conclude the result of Theorem 4.5 from Hilbert’s Basis Theorem for polynomials over commuting variables. Interestingly, no bound for the cardinality of S0, e.g. in the number of vari- ables, is known. That leads to the following fundamental problem. Does there ex- ist any such bound? Or would the bound 2n, where n is the number of unknowns, be enough? The best known lower bounds for the maximal size of independent systems of equations are Θ(card(Ξ)3) and Θ(card(Ξ)4) in free semigroups and in free monoids, respectively, see [KP96]. It is interesting to note that there exist monoids, where Ehrenfeucht Compactness Property holds, but no bound for the maximal size of independent systems exists. An example of this is the variety of finitely generated commutative monoids, see again [KP96]. On the other hand, the property does not hold for all finitely generated monoids, see e.g. [HKP97] and [HK97].

5 Unavoidable regularities

In this section we consider several properties of words related to repetitions. It appears that some repetitions, or more generally some other regularities, are unavoidable, others are avoidable. For instance, it is easily checked that every binary word of length 4 contains a square. So squares are unavoidable over two letters. One may ask whether there exist arbitrarily long cube-free words over two letters. The answer is positive. Thus cubes are avoidable over two letters.

5.1 Power-free words The Thue-Morse word defined earlier is cube-free. As we already mentioned in Section 1.2, the Thue-Morse word is even 2+-free. This means that it is overlap- free, that is it does not contain any factor of the form axaxa, where a is a letter and x is a word. As another example, we consider the infinite word z defined below, and we prove that it is cube-free. We choose to consider this infinite word rather than the Thue-Morse word because the proof is quite simple, although the argument

19 of the proof is rather the same as for the Thue-Morse word. The cube-freeness of z is optimal in the sense that there are repetitions in z that are cubes up to one letter. One may say that z is 3−-free. Consider the morphism a aba ζ : b 7→ abb 7→ Starting with a, it generates the infinite word z = ζω(a)= abaabbabaabaabbabbabaabbabaabaabbabaabaabbabbabaabbabb · · · Inspection shows that there are many factors of the form uuu•, that is cubes up to a final letter. For example, aab, bba, abaabaabb, abbabbaba are almost cubes, and in fact all words ζn(aab) and ζn(bba) for n 1 are almost cube factors of the infinite word z. ≥ Fact 5.1. The infinite word z is cube-free. Proof. We prove that ζ is a cube-free morphism in the sense that it preserves cube-free words: if w is cube-free, then ζ(w) is cube-free. This suffices to show that z is cube-free. Assume the contrary, and consider a finite cube-free word w of minimal length n such that ζ(w) contains a cube uuu. The word ζ(w) has length 3n, and it is a product of n blocks, each being either aba or abb. Observe first that u must be a multiple of 3. Indeed, the initial letter of u appears in w at three| positions| i, j = i + u , k = i +2 u for some i, and if u 0 mod 3, then i, j, k take all possible values| | (mod 3).| This| means that the |initial| 6≡ letter of u must appear in the images ζ(a) = aba and ζ(b) = abb at the first, the second and the third letter. However, b does not appear as an initial letter, and a does not appear as a middle letter. This proves the claim. Next, observe that ζ is one-to-one. Indeed, the argument can be retrieved from the image of a letter because it is just the last letter of the image. This means that if ζ(w) = xuuuy for some x, y, we may assume that x 0 mod 3. We may even assume that x is the empty word because w was| chosen| ≡ to be minimal. But then, there is a unique word v such that u = ζ(v) and w starts with v3. ⊓⊔ As already mentioned earlier, the arguments in the proof are rather typical. Even the construction is typical: iterating a morphism f is a general tool for the construction of infinite words with predictable properties. Next, the property to be proved by induction (or by contradiction) uses the fact that one can infer a property for a word w from a property of the image f(w) (in the previous example, of ζ(w)). This step may be involved, because the morphism f may not always have such a simple form as the morphism ζ. In the present case, we proved that ζ is a cube-free morphisms, that is a morphisms that preserves cube-free words. Another, rather old, example is given e.g., in [Hal64]. It is the morphism a abc b 7→ ac c 7→ b 7→

20 This morphism does not preserve square-free words, because the image of aba is abcacabc. However, iterating the morphism yields an infinite square-free word. The question whether a morphism is k-free, that is whether it preserves k-free words, is not yet completely solved. The simplest, but not the easiest case is that of square-free morphisms. The final answer to this case was given in [Cro82]. For a nonerasing morphism h on A, set

M(h) = max h(a) , m(h) = min h(a) . a A | | a A | | ∈ ∈ Then the following holds:

Theorem 5.2. Let h : A∗ B∗ be a nonerasing morphism. If h preserves square-free words of length K→(h) = max(3, 1+ (M(h) 3)/m(h) ) , then h is square-free. ⌈ − ⌉ As a special case, any uniform morphism h, i.e. such that the images of all letters have the same length, is square-free as soon as it preserves square-free words of length 3. In the ternary case, one has the following theorem [Cro82]: Theorem 5.3. A ternary endomorphism h is square-free if h preserves square- free words of length 5. No general result is known for cube-free morphisms. In the case of binary morphisms, one has the following bound [Kar83]: Theorem 5.4. A binary morphism h is cube-free if h preserves cube-free words of length 10. This bound was improved to the length 7 in [Lec85b]. In the same direction, we mention the following striking result concerning power-free morphisms, i.e. k-frees morphisms for all k [Lec85a]. Theorem 5.5. A morphism h is power-free if h preserves square-free words and if the words h(a2) for a a letter, are cube-free. Deciding whether a given morphism is k-free, for a fixed k 3, or even cube-free, is an difficult open problem. ≥ Abelian repetitions, i.e. repetitions of commutatively equal blocks, constitute another type of interesting repetitions of words. Of course, every repetition of words is also Abelian repetitions, but not vice versa. It was shown in [Ev68] that Abelian squares can be avoided in infinite words if the alphabet contains 25 letters. In [Ple70] the result was improved to five letter alphabets, and finally the four letter case – known as an Erd¨os Problem – was solved in a complicated paper in [Ke92]. The required infinite word is obtained by iterating a uniform morphism, where the images of letters are of length 85! This amazing result has been completed by Carpi [Car98] who shows that the number of Abelian square-free words over 4 letters of given length grows exponentially. He also proves that there are ”many” Abelian square-free morphisms, i.e., morphisms that map Abelian square-free words on Abelian square-free words. He shows

21 that the monoid of Abelian square-free morphisms is not finitely generated. Similarly in [Ju72] it was shown that Abelian fifth powers can be avoided in a binary alphabet, and later [Dek79] filled the gaps: Abelian fourth powers can be avoided in a binary alphabet and cubes in a ternary alphabet. Moreover, all these results are optimal, for instance all binary words avoiding Abelian cubes are finite, in fact, of length at most 7. Most, if not all, known repetition-free words are defined by iterating a mor- phism, or as a morphic image of such. However, this approach can capture only very few repetition-free words, as witnessed by the following result proved in [Br83]. In order to formulate it we denote by ρ Fk(n) the number of ρ-free words of length n over a k-letter alphabet. −

Theorem 5.6. (i) The number of cubefree words of length n over a binary al- phabet is exponential, i.e., there exist constants A,B > 0 and α,β > 1 such that Aαn 2 F (n) Bβn ≤ − 3 ≤ (ii) There exist nondenumerably many squarefree infinite words over a ternary alphabet.

Both parts of the above theorem holds also for squarefree words over a ternary alphabet, see again [Br83]. On the other hand, for 2+-free binary words the situation is different: Theorem 5.7. (i) The number of 2+-free words of length n over a binary al- phabet is polynomial, i.e., there exist constants A,B > 0 and α,β > 1 such that Anα 2+ F (n) Bnβ ≤ − 2 ≤ (ii) There exist nondenumerably many 2+-free infinite words over a binary al- phabet. Part (i) above is shown in [RS85], while part (ii) is an exercise in [Lot83]. The values of the parameters α and β in Theorem 5.7 (ii) are studied in [Ko88] and [Le96], see also [Cas93] and [Car93b]. Very recently in [KS03] Theorems 5.6 and 5.7 were extended to determine the exact borderline between polynomial and exponential growth in numbers of binary ρ-free words of length n:

+ Theorem 5.8. The cardinality of 21/3 F (n) is polynomial while that of 21/3 − 2 − F2(n) is exponential. Amazingly the Thue-Morse morphism plays a central role in the proof of Theorem 5.8. Recently, it has been shown [Ram03] that the only binary 7/3-power-free word that can be obtained by iterating a morphism is the Thue-Morse word. Moreover, the Thue-Morse morphism is the basically the only 7/3-power-free morphism: if h is a binary morphism, and if h(01101001) is 7/3-power-free, then h is a power of the Thue-Morse morphism (up to inverting of 0 and 1)

22 5.2 Test sets and test words

A convenient framework is to define test sets for morphisms. A finite set T A∗ is a test set for overlap-free (resp. square-free, cube-free) morphisms if every⊂ morphism f : A∗ B∗ is overlap-free (resp. square-free, cube-free) as soon as f(w) is overlap-free→ (resp. square-free, cube-free) for all w T . Theorem 5.4 can be rephrased as follows: the set of binary cube-free word∈s of length 10 is a test set for binary cube-free morphisms. Test sets for overlap-free morphisms are characterized in [RS99]. Test sets for k-free morphisms with k 3 are studied in [RW02b]. It has been shown that no test sets exist for the se≥t of square-free morphisms over a 4-letter alphabet. Another interesting notion is that of test word. A word w A∗ is a test word for overlap-free (resp. square-free, cube-free) morphisms if every∈ morphism f : A∗ B∗ is overlap-free (resp. square-free, cube-free) as soon as the word f(w) is→ overlap-free (resp. square-free, cube-free). Thus w is a test word if w is a test set. It has been shown [BS93] that abbabaab is a test word for overlap-free{ } morphisms. We refer to [RS99] and [RW02b] for a detailed study of test words in various cases. Overlap-free morphisms were already characterized by Thue [Th12]. The fol- lowing is Satz 16 of his 1912 paper; the morphism µ was given in the introduction.

Theorem 5.9. Let h be an overlap-free binary endomorphism. Then there is an integer n such that h = µn or h = π µn, where π is the endomorphism that exchanges the two letters of the alphabet.◦

Clearly, this means that the monoid of binary overlap-free endomorphisms is finitely generated. The same does not hold for larger alphabets, neither for overlap-free morphisms, nor for k-power free morphisms: all these monoids of endomorphisms are not finitely generated [Ric02b]. Many other monoids of en- domorphisms are not finitely generated, such as the monoid of primitive mor- phisms (that is, preserving primitive words) or of Lyndon morphisms (that is, preserving Lyndon words), see [Ric02a].

5.3 Repetition threshold

More general repetitions can be considered as well. Thue himself called a word on n letters irreducible if two distinct occurrences of a nonempty factor are always separated by at least n 2 letters. Thus, irreducible means overlap-free if n = 2, and square-free if n =− 3. A more general concept, first considered by F. Dejean [Dej72], is to require that the length of the word y separating two occurrences of x is bounded from below by the length of x times some factor. This is precisely what we called earlier a repetition: a word xyx, where x is nonempty, is a repetition of order k, where k = xyx / xy . | | | | We have seen that every binary word of length 4 contains a square, and that there exist infinite binary overlap-free words (such as the Thue-Morse word): these words are 2+-free.

23 A similar property holds for ternary words: every ternary word of length 39 contains a repetition of order 7/4, and there exists [Dej72] an infinite ternary word that has no repetition of order > 7/4. More precisely, we have:

Theorem 5.10. The word generated by the endomorphism

a abcacbcabcbacbcacba b 7→ bcabacabcacbacabacb c 7→ cabcbabcabacbabcbac 7→ 7 + has no repetitions of order > 7/4. So it is ( 4 ) -free. Call repetition threshold the smallest number s(k) such that there exists an infinite word over k letters that has only repetitions of order less than or equal to s(k). We know that s(2) = 2, s(3) = 7/4. It is conjectured in [Dej72] that s(4) = 7/5 and s(k)= k/(k 1) for k 5. We know that this conjecture is true up to 11, see [Pan84b] and [MO92],− but≥ the general case is still open.

5.4 Unavoidable patterns

Another topic on unavoidable regularities concerns unavoidable patterns. This is a generalization of the notion of square-freeness and k-power-freeness. Interest- ingly, this notion has a rather deep relation to universal algebra, see [BMT89]. We consider two alphabets, the first denoted by A as usual, and the second one, by E, called the pattern alphabet. Given a pattern p E∗, the pattern language associated to p on A is the set of all words h(p), where∈ h is a non-erasing mor- phism from E∗ to A∗. A word w is said to avoid the pattern p, if no factor of w is in the pattern language of p. For example, consider the pattern p = ααββα, where α and β are letters. The word 1(011)(011)(0)(0)(011)1 does not avoid p. On the contrary, 0000100010111 avoids p. A pattern is avoidable on A if there exists an infinite word on A that avoids p, otherwise it is unavoidable. A pattern is k-avoidable if it is avoidable on a k letter alphabet. For instance, since there exist infinite square-free words over three letters, the pattern αα is 3-avoidable. Also, the Thue-Morse infinite word avoids the patterns ααα and αβαβα, the latter one corresponding to overlaps. Clearly, the pattern αβα is unavoidable. More generally, define the Zimin words Zn as follows. Let αn, for n 0, be distinct letters in E. Set Z0 = ε and Z = Z α Z for n 1. ≥ n+1 n n n ≥

Fact 5.11. The Zimin words Zn are all unavoidable. In some sense, these words are all the unavoidable words. Indeed, say that a pattern p divides a pattern p′ if p′, viewed as an ordinary word, has a factor in the pattern language of p.

Theorem 5.12. A pattern p is unavoidable if and only if it divides some Zimin word.

24 There exists an algorithm to decide whether a pattern is avoidable. This has been given independently in almost the same terms by [BEM79] and [Zim82]. Let us just sketch the construction. Let p E∗ be a pattern, and let P the set of variables occurring in p. The adjacency∈ graph of p is the bipartite graph G(p) with two copies of P as vertices, denoted P L and P R, and with an edge between λL and µR if and only if λµ is a factor of p. For example, the adjacency graph of the pattern p = αβαγβα has six vertices and four edges. A subset F of P is free if there is no path in G(p) from a vertex λL to a vertex µR with λ, µ in F . In our example, the free sets are α and β . Given a pattern p and a free set F for p, we reduce p to q by deleting{ in} p all{ occurrences} of the letters of F . In our example, p = αβαγβα reduces to q = βγβ by the free set α , and q itself reduces to γ which itself reduces to the empty word. A pattern{ is reducible} if it can be reduced to the empty word in a finite number of steps. The remarkable theorem that yields the algorithm is: Theorem 5.13. A pattern p is avoidable if and only if it is reducible. As shown by the pattern αα, an avoidable pattern is not necessarily avoidable on two letters. In other terms, the same pattern may be 2-unavoidable, but k- avoidable for some larger k. The avoidability index µ(p) of a pattern p is the smallest integer k such that p is k-avoidable, or if p is unavoidable. Contrary to the previous theorem, there is no known algorithm∞ to compute the avoidability index of a given pattern. Even for short patterns, the exact value of µ(p) may be unknown. For instance, it is not known whether the value of µ(ααββγγ) is 2 or 3, although there is some experimental evidence that the index is 2, see Chapter 3 in [Lot02]. However, the proof of Theorem 5.13, if analyzed carefully, provides an upper bound on the avoidability index of a pattern. Theorem 5.14. Let p be a pattern on k symbols. If p is avoidable, then µ(p) 2k +4. ≤ The above bound is probably far from being optimal. In fact, it is quite difficult to find patterns with high avoidability index. A pattern which has index 4 is

p = αβζ1βγζ2γαζ3βαζ4αγ over a pattern alphabet of 7 symbols. No pattern of index 5 is known, and one may ask whether such a pattern exists. Many results on avoidable and unavoid- able patterns are reported in Chapter 3 of [Lot02], while [Cur93] is a source of several open problems. Recent results concerning complexity of the reduction algorithm appear in [Hei02a], [Hei02b] and [Hei02c].

5.5 Shirshov’s theorem An unavoidable regularity is a property of words that can be observed on any word, provided it is sufficiently long. Several of these unavoidable regularities exist. The most famous are Ramsey’s theorem, van der Waerden’s theorem and

25 Shirshov’s theorem. We just state here Ramsey’s and van der Waerden’s theo- rems in the perspective of words, and then discuss Shirshov’s theorem and some of its consequences. Let A be a finite alphabet, S a set and k 2 an integer. A map f : A∗ S is called k-ramseyan if there exists an integer≥ L(f, k) such that any word→ w of length at least L(f, k) admits a factor u of the form u = w w , with 1 · · · k w1,...,wk nonempty words, such that

f(w w )= f(w ′ w ′ ) i · · · j i · · · j for all 1 i j k, 1 i′ j′ k. Further f is ramseyan if it is k-ramseyan for some≤k. Ramsey’s≤ ≤ theorem,≤ ≤ see≤ [GRS90], can be stated in a number of com- binatorial structures. On words it can be stated as the following unavoidable regularity:

Theorem 5.15. (Ramsey’s theorem) Every map f : A∗ S into a finite set S is ramseyan. →

Given a word w = a a , where a ,...,a are letters, a cadence of w of 1 · · · n 1 n order r is a sequence 0 < t1 < < tr n such that at1 = at2 = = atr . A cadence is arithmetic if the integers· · · are≤ in arithmetic progression,· that· · is if ti+1 ti = ti ti 1 for 1

Theorem 5.16. (van der Waerden’s theorem) Let A be an alphabet with k let- ters. For any positive integer n, there exists an integer W (k,n) such that any word w over A of length at least W (k,n) contains an arithmetic cadence of order n.

Ramsey’s and van der Waerden’s theorems are rather well known. However, the evaluation of the integers W (k,n), as well as the function L, is very difficult and not at all completely known, see [GRS90]. We refer to [Lot83] and to [dLV97] for proofs of our above formulations and further discussions. We now turn to Shirshov’s theorem, which is much less known. It was proved by Shirshov in connection with so-called polynomial identities (see [Lot83] for a discussion of this issue). The unavoidable regularity concerned by this result has also other far reaching applications. Let A be a totally ordered alphabet, and let < denote the lexicographic order on A∗ induced by the order on the alphabet. Denote by Sn the symmetric group on n elements. A sequence (u1,u2,...,un) of nonempty words is called an n- division of the word u = u1u2 un if, for any nontrivial σ of Sn, one has · · · u u u >u u u 1 2 · · · n σ(1) σ(2) · · · σ(n) Example 6. Consider the binary alphabet A = a,b ordered by setting a

26 Now we can formulate our third unavoidable regularity: Theorem 5.17. (Shirshov’s theorem) Let A be an alphabet with k letters. For any positive integers n, k, there exists an integer N(k,p,n) such that any word w over A of length at least N(k,p,n) contains as a factor an n-divided word or a pth power. Again, the integers N(k,p,n) are quite difficult to compute. It is also quite interesting that this result admits an extension to infinite words, in contrast to van der Waerden’s theorem. An infinite word s is called ω-divided if it can be factorized into an infinite product s = s s s of nonempty words such 1 2 · · · n · · · that (s1,...,sn) is an n-division for all n> 0. Now, by denoting by F (s) the set of all (finite) factors of an infinite word s we can formulate: Theorem 5.18. For any infinite word t over A, there exists an infinite word s such that F (s) F (t) and s is ultimately periodic or s is ω-divided. ⊂ This is, in fact, a consequence of another structural result of infinite words. In order to state this, we recall that a Lyndon word is a word w that is primitive and that is smaller (for the lexicographic order) than all of its conjugates, i.e., such that whenever w = xy with x, y nonempty, then xy < yx. Then we have: Theorem 5.19. For any infinite word t over A, there exists an infinite word s such that F (s) F (t) and s is a product of an infinite sequence of non increasing ⊂ Lyndon words: s = ℓ1ℓ2 ℓn , with ℓn ℓn+1 and ℓn is a Lyndon word for all n. · · · · · · ≥ As we already said Shirshov’s theorem has quite interesting consequences. We consider here one of those. A semigroup S is called periodic if every subsemigroup of S generated by a single element is finite. A sequence (s1,...,sn) of n elements of S is called permutable if there exists a nontrivial permutation σ of Sn such that s1 sn = sσ(1) sσ(n). The semigroup is called n-permutable if any sequence of n· ·elements · of ·S · ·is permutable, and S is called permutable if S is n-permutable for some n. A language L is called periodic (resp. permutable) if its syntactic semigroup is periodic (resp. permutable). A striking characterization of rational languages can now be stated: Theorem 5.20. A language L is rational if and only if it is periodic and per- mutable. The proof is based on Shirshov’s theorem, see [RR85], [dLV97]. Also the proofs of Theorems 5.18 and 5.19, as well as more related results can be found in [dLV97].

5.6 Unavoidable sets of words We conclude this section with still another notion of an unavoidability, namely the notion of unavoidable set of words. A set X of words over A is unavoidable if any long enough word over A has a factor in X, that is if A∗ A∗XA∗ is finite. For example, the set a,bb is unavoidable over a,b . Indeed,− any word of length 2 either contains an{ a or} contains bb. Since a{ superset} of an unavoidable

27 set is again unavoidable, it is natural to consider minimal unavoidable sets. Every minimal unavoidable set is finite. Indeed, if X is unavoidable, then let d be the maximal length of the words in the finite set A∗ A∗XA∗. Let Z be the set of words in X of length at most d + 1. Every word− of length d + 1 has a factor in X which actually is in Z. Thus Z is unavoidable and finite. In fact, there exists an algorithm to decide whether a finite set X of words in unavoidable, and the structure of unavoidable sets of words is rather well understood, see Chapter 1 of [Lot02] for a general exposition and for references. A recent result concerns unavoidable sets of words of the same length, i.e., uniform unavoidable sets. For k, q 1, let c(k, q) be the number of conjugacy classes of words of length k on q ≥letters. An unavoidable set of words of length k on q symbols clearly has at least c(k, q) elements. It is quite remarkable that the converse holds as well, see [CHP03]: Theorem 5.21. For any k, q 1, there exists an unavoidable set of words of length k on q letters having exactly≥ c(k, q) elements. As an illustration assume that k = 3 and q = 2, so that c(k, q) = 4. Now the sets aaa,aab,aba,bbb and aaa,aab,bba,bbb , for example, are avoidable, the former{ because it does} not intersect{ all the conjugacy} classes and the latter since it avoids (ab)ω. On the other hand, the set X = aaa,aab,bab,bbb is unavoidable. Indeed, infinite words containing no letter a or{ the factor aa clearly} contains a factor from X, and all the other words contain either bab or bbb.

6 Complexity

Given a set X of words over an alphabet A the complexity function of X is n the function pX defined by pX (n) = Card(X A ). We are interested here in the (subword) complexity (one should say factor∩ complexity!) of a finite or an infinite word u. It is the function pu which is the complexity function for the n set F (u) of factors of u, thus pu(n) = Card(F (u) A ) is just the number of factors of length n in u. The study of infinite words∩ of given complexity has revealed a large set of surprising results. Moreover, there are strong relations to number theory and symbolic dynamics. Subword complexity was studied, in relation with languages generated by D0L-systems, already in [ELR75].

6.1 Subword complexity of finite words

The subword complexity of a finite word w is simply the function pw such that pw(n) is the number of factors of length n of w. Of course, pw(0) = pw( w ) = 1, and p (n) = 0 for n > w . The description of the shape of the function| | w | | pw uses the parameter Gw defined as follows: Gw is the maximal length of a repeated factor, that is a factor that appears at least twice in w. For instance, let w = abccacbccabaab. The word bcca is a repeated factor of maximal length, so Gw = 4. The shape of the function pw is given by the following result, see [dL99], [CdL01b], [CdL01a] and [LS01]:

28 Theorem 6.1. Let w be a word over at least two letters. There is an integer mw such that the function pw(n) behaves as follows:

(i) it is strictly increasing for 0 n mw, (ii) it is constant for m n ≤G +1≤ , w ≤ ≤ w (iii) it is decreasing for Gw +1 n w , and more precisely pw(n +1) = p (n) 1 for n = G +1,..., w≤ . ≤ | | w − w | | Consider again the word w = abccacbccabaab. One gets pw(1) = 3, pw(2) = 8, pw(3) = 10, pw(4) = 11, and this is the maximum because Gw = 4. Then it decreases by 1 at each step: pw(5) = 10 . . . . In our example, the integer mw is equal to Gw. A precise description of the parameter mw is missing. On the contrary, the parameter Gw is strongly related to other structural parameters of the word w. Denote by Rw the minimal number such that there is no factor x of w of length Rw that has two right extensions in w, i.e., such that xa and xb are factors of w for distinct letters a and b. Next, let Kw be the length of the shortest suffix of w which is an unrepeated factor of w. Then it can be shown, see [CdL01b], that Gw +1 = max(Rw,Kw). The parameter Gw has the following interesting property, proved in [CdL01b]: Theorem 6.2. A word w is completely determined by its set of factors of length at most Gw +2. For example, the word w = abccacbccabaab, of length 14, is entirely determined by its factors of length 6. Many combinatorial facts, and properties of distribution of these parameters, are given in [CdL02a] and [CdL02b]. Related to Theorem 6.2 one can ask several questions when a given word is uniquely determined by its factors or sparse subwords. Besides these two variants one can also ask the same when multiplicities are taken into account. This leads to four different problems, see [Ma00] or Chapter 6 in [Lot83]. Among those is a question, sometimes referred to as Milner’s Problem, asking what is the minimal length of sparse subwords with multiplicities which defines the word uniquely.

6.2 Subword complexity of infinite words Let us start with two examples. For the infinite word 01(10)ω, the complexity function is easily computed. One gets p(0) = 1, p(1) = 2, p(2) = 3, p(n)=4 for n 4. More generally, it is easily seen that an ultimately periodic infinite word has≥ a complexity function that is bounded, and therefore is ultimately constant because a complexity function is never decreasing. On the contrary, consider the infinite binary word c known as the Champernowne word: c = 0110111001011101111000 · · · This is obtained by concatenating the binary expansions of all positive integers n in order. Clearly, every binary word is a factor of c and thus pc(n)=2 for all n 0. These examples show that both extreme behaviours are possible for complexity≥ functions. However, there are gaps in the growth of complexity functions. The first such result was given in [MH40,CH73]:

29 Theorem 6.3. Let x be an infinite word. The following are equivalent:

(i) x is ultimately periodic, (ii) px(n)= px(n + 1) for some n, (iii) px(n) < n + k 1 for some n 1, where k is the number of letters appearing in x, − ≥ (iv) px(n) is bounded.

Proof. The implications (1) (4) (3) (2) are easy. For the remaining implication (2) (1) we observe⇒ that⇒ each⇒ factor of length n in x can be extended in exactly⇒ one manner to a factor of length n + 1. This means that each occurrence of a given factor of length n is always followed by the same letter in x. Consider any factor u of length n that appears twice in x, and denote by y the word that separates the two occurrences, so that uyu is a factor of x. Since the letters that follow u are determined by u, this means that in fact uyuy is a factor of x, and that (uy)ω is a suffix of x. ⊓⊔ This result shows a “gap” for complexity functions: either a function p is ultimately constant, or p(n) n + 1 for all n. So an infinite aperiodic word, that is a word that is not ultimately≥ periodic, cannot have a complexity function bounded by n. It appears that aperiodic infinite words of complexity p(n)= n+1 indeed exist (see also [Kn72]). These words are called Sturmian words. For another gap of complexity functions see [Cas97b].

6.3 Sturmian words

A x is always a binary word because px(1) = 2 and px(1) is the number of letters appearing in x. Every factor of length n can be extended to the right into a factor of length n + 1. Since px(n +1)=1+ px(n), this extension is unique for all factors up to one, and this last factor has two extensions. A factor with two extension is called right special. More precisely, call the (right) degree of a word u the number of distinct letters a such that ua is a factor of x. Then, in the binary case, a right special factor is a word u of degree 2. A Sturmian word has exactly one special factor of each length. We give an example: Fact 6.4. The Fibonacci word f = 010010100100101001010 is Sturmian. · · · To check this, we observe that f is the fixed-point of the morphism ϕ given in Section 1.3. Thus f is a product of words 0 and 01. In particular, 11 is not a factor of f and pf (2) = 3. Also, the word 000 is not a factor of f since otherwise it is a factor of the image of a factor of f that must contain 11. Next, we show that for all finite words x, neither 0x0 nor 1x1 is a factor of f. This is clear if x is the empty word or is a single letter. Arguing by induction, assume that both 0x0 or 1x1 are factors of f. Then x starts and ends with 0, so x =0y0 for some word y. Since 10y01 is a factor of f, there exists a factor z of f such that ϕ(z)=0y. Moreover, 00y00 = ϕ(1z1) and 010y01 = ϕ(0z0) are factors of f, showing that 0z0 and 1z1 are factors, a contradiction.

30 We now prove that f has at most one special factor of each length. Assume that u and v are right special factors of f of the same length, and let x be the longest common suffix of u and v. Then the four words 0x0, 0x1, 1x0, 1x1 are factors of f which contradicts our previous observation. We finally prove that f has at least one special factor of each length. For this, it suffices to prove that f is aperiodic. Recall that f is the limit of the sequence of finite Fibonacci words defined by f0 = a, f1 = ab, and fn+2 = fn+1fn. It is easily shown that the f 1 | n|a , n , f → τ → ∞ | n| with τ = (1+√5)/2, whereas in a ultimate periodic word, this limit is a rational number. ⊓⊔ We now give two other descriptions of Sturmian words, namely as balanced words and as mechanical words. Given two binary words u and v of the same length, the balance of u and v is the number b(u, v) = u 1 v 1 , that is the absolute value of the difference of the number of occurrences|| | −of | | the| letter 1 in the words u and v. Since u and v have the same length, one could also have defined this number by taking the number of 0’s instead of the number of 1’s. As an example, for 01001 and 11001, the balance is 1. An infinite word x is balanced if the balance of any two factors of x of the same length is at most 1. Intuitively, a balanced word cannot have big differences in factors. In particular, a balanced word cannot contain simultaneously the factors 0u0 and 1u1. Moreover, it can be shown that a balanced word x has a slope, that is that the limit limn bn/n →∞ exists, where bn is the number of 1’s in the prefix of length n. As an example, the slope of the Fibonacci word is 1/ϕ2, where ϕ = (1+ √5)/2. Another notion strongly related to Sturmian words is of more arithmetical nature. Given two real numbers α and ρ with 0 α 1, we define two infinite ≤ ≤ words s = s (0)s (1) and s′ = s′ (0)s′ (1) by α,ρ α,ρ α,ρ · · · α,ρ α,ρ α,ρ · · ·

sα,ρ(n)= α(n +1)+ ρ αn + ρ ⌊ ⌋−⌊ ⌋ for n> 0. s′ (n)= α(n +1)+ ρ αn + ρ α,ρ ⌈ ⌉−⌈ ⌉

The word sα,ρ is the lower mechanical word and sα,ρ′ is the upper mechanical word with slope α and intercept ρ. It is clear that we may assume 0 ρ 1. If ≤ ≤ α is irrational, sα,ρ and sα,ρ′ differ by at most one factor of length 2. The terminology stems from the following graphical interpretation. Consider the straight line defined by the equation y = αx + ρ. The points with integer coordinates just below this line are P = (n, αn + ρ ). Two consecutive points n ⌊ ⌋ Pn and Pn+1 are joined by a straight line segment that is horizontal if sα,ρ(n)=0 and diagonal if sα,ρ(n) = 1. The same observation holds for the points located just above the line. A special case deserves to be considered separately, namely when 0 <α< 1 and ρ = 0. In this case, sα,0(0) = 0, sα,′ 0(0) = 1, and if α is irrational

sα,0 =0cα, sα,′ 0 =1cα,

31 where the infinite word cα is called the characteristic word of α. Mechanical words can be interpreted in several other ways. One is as cutting sequences, and is as follows. Consider again a straight line y = βx + ρ, for some β > 0 not restricted to be less than 1, and ρ not restricted to be positive. Consider the intersections of this line with the lines of the grid with nonnegative integer coordinates. We get a sequence of intersection points. Writing a 0 for each vertical intersection point and a 1 for each horizontal intersection point, we obtain an infinite word Kβ,ρ that is called the cutting sequence. Then

Kβ,ρ = sβ/(1+β),ρ/(1+β)

Indeed, the transformation (x, y) (x + y, x) of the plane maps the line y = βx + ρ to y = β/(1 + β)x + ρ/(17→ + β). Thus, cutting sequences are just an- other formulation of mechanical words, see also [CMPS93] for a more detailed discussion. Mechanical words can also be generated by rotations. Let 0 <α< 1. The rotation of angle α is the mapping R = Rα from [0, 1[ into itself defined by

R(z)= z + α { } Iterating R, one gets Rn(ρ)= nα + ρ . Thus, defining a partition of [0, 1[ by { } I = [0, 1 α[, I = [1 α, 1[ , 0 − 1 − one gets n 0 if R (ρ) I0 sα,ρ(n)= n ∈ .  1 if R (ρ) I1 ∈ The three properties are related by the following theorem [MH40]: Theorem 6.5. Let s be an infinite word. The following are equivalent: (i) s is Sturmian; (ii) s is balanced and aperiodic; (iii) s is mechanical with an irrational slope. A proof can be found in Chapter 2 of [Lot02]. As a example, the Fibonacci word f is indeed the lower mechanical word with slope and intercept equal to 1/ϕ2. There is a special class of Sturmian words called characteristic Sturmian words. These are the words where the intercept equals the slope. Each characteristic Sturmian word s has a description as the limit of a sequence sn of finite words, quite as the Fibonacci word f is the limit of the finite words fn. The recurrence relation is slightly more complicated. It has the form

dn sn = sn 1sn 2 for n 1 − − ≥ with s 1 = 1, s0 = 0, and where d1, d2,... is a sequence of integers with d1 0 − ≥ and dn > 0 for n > 1. This sequence is related to the slope α of s by the fact that [0, 1+ d1, d2,...] is the expansion of α. In the case of the

32 Fibonacci word, the continued fraction expansion of 1/ϕ2 is indeed [0, 2, 1, 1,...] and the dn are all equal to 1. The sequence sn is called the standard sequence, and every word that appears in a standard sequence is a standard word. A beautiful theorem, proved in [CMPS93], describes those irrational num- bers α for which the standard Sturmian word can be generated by iterating a morphism. We give here the description that follows from a complement given in [All98]. Theorem 6.6. Let 0 <α< 1 be an irrational number. The characteristic Stur- mian word of slope α is a fixed-point of some nontrivial morphism if and only if α is a quadratic such that 1/α<¯ 1. In this theorem,α ¯ denotes the conjugate of the number α, that is the other root of its minimal polynomial. This result shows a relation between a combinatorial property of words and an arithmetical counterpart. Sturmian words have a tremendous amount of combinatorial or arithemetic properties. An account can be found in Chapter 2 of [Lot02] and in [PF02]. A final comment: why Sturmian words are called Sturmian? This term was introduced by Morse and Hedlund in their work on symbolic dynamics. The term is rather unfortunate in that the mathematician Sturm (1803–1855) never worked on these sequences. The argument is as follows ([MH40], page 40 and 41): consider a linear homogeneous second order differential equation

y′′ + φ(x)y =0 where φ is continuous and has period 1. For an arbitrary solution u(x) of this equation, one considers the infinite word a0 an where an is the number of zeros of u in the interval [n,n + 1). According· · · to· · · the well-known Sturmian separation theorem, this infinite word is Sturmian (over a convenient alphabet). This observation motivates the choice of the terminology. For additional bibliographic comments about the origins, see the answer to exercise 1.2.8-36 in Knuth’s volume 1, as well as [St76] which dates back the Fibonacci word to J. Bernoulli III and A.A. Markov, see [Be1772] and [Ma1882].

6.4 Episturmian words There have been several attempts to extend the notion of Sturmian words. By the definition, Sturmian words are binary, and so relaxations of the constraints on the complexity function were looked for. It appears that a very good extension is rather related to what is called Arnoux-Rauzy, or more generally episturmian words. We start with a typical word of this family called Tribonacci word. This word is defined, like the Fibonacci word, as the limit of a sequence of words defined by a recurrence relation:

tn+3 = tn+2tn+1tn, t0 =0, t1 = 01, t2 = 0102 So the Tribonacci word is t = 0102010010201010201 · · ·

33 It is also defined as the fixed-point of the mophism 0 01 1 7→ 02 2 7→ 0 7→ It is not very difficult to see that pt(n)=2n +1 for n 0. This means that for each n, there must be 2 additional extensions to the right≥ of factors of length n. To do this, there are two possibilities: either there are 2 distinct right special factors, each of which has degree 2, or there is just one right special factor which has degree 3. It can be checked that the second property holds for the Tribonacci word. The right special factors are ε, 0, 10, 010, 2010,... . As in the case of the Fibonacci words, the right special factors are the reversal of the prefixes of t. Infinite words that have exactly one right special factor of each length, and each having degree 3, where introduced by Arnoux and Rauzy in [AR91], and are therefore called Arnoux-Rauzy words, or AR-words for short. This terminology has been extended to words over k-letters, see e.g., [WZ01], by requiring that for each length n, there is a unique right special factor with degree k. This definition is relaxed in [DJP01] and [JP02] to allow right special factors of degree at most k. Infinite words with this property are called episturmian words, and AR-words are called strict episturmian [JP02]. Strict episturmian binary words are Sturmian, whereas episturmian binary words may be ultimately periodic, so they are the mechanical words.

6.5 Hierarchies of complexities One of the first papers on subword (factor) complexity is [ELR75]. In this paper, it is shown that the subword complexity of a D0L language is bounded by cn2 (resp. cn log n, cn) if the morphism that generates the languages is arbitrary (resp. growing, uniform). This result was extended in [Pan84a]: Theorem 6.7. The subword complexity of an infinite word generated by iterat- ing a morphism is of one of the following types: Θ(n), Θ(n log n), Θ(n log n log n), Θ(n2), or Θ(1). Each of the complexity classes corresponds to a class of morphisms. We just give some examples. The morphism a ab b 7→ bc c 7→ c 7→ generates the infinite word abbcbc2bc3 bcn · · · · · · having quadratic complexity. Consider next the morphism a abc b 7→ bb c 7→ ccc 7→

34 It generates the infinite word

n n abcb2c3b4c9 b2 c3 · · · · · · which has complexity Θ(n log n). Consider finally the morphism

a abab b 7→ bb 7→ Starting with a, one gets the infinite word

abab3abab7ababbbabab15 · · · which has complexity Θ(n log log n).

6.6 Subword complexity and transcendence Consider the Thue-Morse word t and the Fibonacci word f written over the alphabet 0, 1 : { } t = 01101001100101101001011001101001 f = 010010100100101001010 · · · · · · One may consider the infinite words as binary expansions of real numbers in the interval [0, 1], and ask whether these numbers are algebraic or transcendental. We address this and related questions in this section. It is well known that the expansion in an integral base b 2 of a rational number is ultimately periodic. This means that the subword≥ complexity of a rational number is very low. On the other hand, it has been conjectured by Borel, see [All00] for a more detailed discussion, that the infinite word representing the expansion in base b of an algebraic number is “normal”, which means that all words appear as factors, and that all factors of length n appear with the same frequency 1/bn. This conjecture is presently far from being proved, and even the following, much weaker conjecture, seems to be very difficult: let x be the n expansion in base b of a r. If px(n)