On the Star-Height of Factor Counting Languages and Their Relationship To
Total Page:16
File Type:pdf, Size:1020Kb
ON THE STAR-HEIGHT OF SUBWORD COUNTING LANGUAGES AND THEIR RELATIONSHIP TO REES ZERO-MATRIX SEMIGROUPS TOM BOURNE AND NIK RUSKUCˇ Abstract. Given a word w over a finite alphabet, we consider, in three special cases, the generalised star-height of the languages in which w occurs as a contiguous subword (factor) an exact number of times and of the languages in which w occurs as a contiguous subword modulo a fixed number, and prove that in each case it is at most one. We use these combinatorial results to show that any language recognised by a Rees (zero-)matrix semigroup over an abelian group is of generalised star-height at most one. 1. Introduction and preliminaries The generalised star-height problem, which asks whether or not there exists an algo- rithm to compute the generalised star-height of a regular language, is a long-standing problem in the field of formal language theory. In particular, it is not yet known whether there exist languages of generalised star-height greater than one; see [10, Section I.6.4] and [9]. The aim of the present paper is to present some new contributions concerning this problem. In Section 2, we take a combinatorial approach and find the generalised star-height of languages where a fixed word w appears as a contiguous subword of the words in our language precisely k times or k modulo n times. In Section 3, we apply these results to prove that languages recognised by Rees (zero-)matrix semigroups over abelian groups are of generalised star-height at most one. An alphabet A is a finite, non-empty set; its elements are letters. A finite sequence of letters is a word (over A). The length of a word w, denoted by w , is the total number of letters appearing in w. The empty word, denoted by ", is thej uniquej word of length zero. The set of all words over A is denoted by A∗, and the set of all non-empty words over A is denoted by A+.A semigroup (respectively, monoid) language is a subset of A+ (respectively, A∗). Given an alphabet A, we define the empty set, the empty word, and each of the letters in A to be basic regular expressions. Using these, we recursively define new regular expres- sions by using the (finite) union, concatenation product, (Kleene) star and complement ∗ arXiv:1603.06236v2 [cs.FL] 27 Sep 2016 operations; that is, if E and F are regular expressions then so too are E F , EF , E and Ec. A language is regular if it can be represented by a regular expression.[ The (generalised) star-height of a regular expression E, denoted by h(E), is defined recursively as follows: for the basic regular expressions, h( ) = h(") = h(a) = 0, where a is a letter from A; for union and product, h(E F ) = ;h(EF ) = max h(E); h(F ) ; for the star operation, h(E∗) = h(E) + 1; and for complementation,[ h(Ec)f = h(E). Theg (generalised) star-height of a language L, denoted by h(L), is h(L) = min h(E) E is a regular expression representing L : f j g Key words and phrases. Regular language, star-height, subword, Rees matrix semigroup. 1 Note that we can use De Morgan's laws to express intersection and set difference, and that h(E F ) = h(E F ) = max h(E); h(F ) : \ n f g It is well known that the class of regular languages remains unchanged if comple- mentation is removed from the list of allowed operations. One can define the notion of (restricted) star-height with respect to this signature. In this context, the star-height problem has been solved: there exist languages of arbitrary (restricted) star-height [1], and the (restricted) star-height of a language is algorithmically computable [4]. From this point on, the phrase \star-height" will always refer to generalised star-height. The following simple observation, which allows `removal' of stars, will be used through- out the paper: Observation 1.1 ([8]). For any alphabet A and any subset B of A we have A∗ = c and B∗ = A∗ (A∗(A B)A∗): ; n n Hence, h(A∗) = h(B∗) = 0: Let u; w and x be elements of A∗. If v = uwx then u is a prefix of v, w is a contiguous subword (or factor) of v, and x is a suffix of v. Throughout this paper, the phrase \subword" will always mean contiguous subword. A prefix of a word that is also a suffix of that word is a border, and the proper border of greatest length is said to be maximal. For every word w in A+ and every word v in A∗, we denote the number of times that w appears as a subword of v by v w. When w is a letter, say w = a, the notation v a coincides with its usual meaning;j thatj is, the number of times the letter a appears inj thej word v. For every word w in A+ and every non-negative integer k, we define the language Count(w; k) by ∗ Count(w; k) = v A v w = k ; f 2 j j j g that is, the set of words v over A such that w appears as a subword of v precisely k times. As such, we regard Count(w; 0) as the set of all words that do not feature w as a subword. From this characterisation, we note that v Count(w; 0) v A∗ A∗wA∗ v (A∗wA∗)c v ( cw c)c ; (1) 2 , 2 n , 2 , 2 ; ; where the the final equivalence follows by Observation 1.1. Thus, for a fixed word w, the language Count(w; 0) is representable by a star-free expression and is therefore of star-height zero. In a similar manner, for every word w in A+, every integer n greater than or equal to 2 and every non-negative integer k with 0 k < n, we define the language ModCount(w; k; n) by ≤ ∗ ModCount(w; k; n) = v A v w k (mod n) ; f 2 j j j ≡ g that is, the set of words v over A such that w appears as a subword of v precisely k modulo n times. It should be noted that the languages Count(w; k) and ModCount(w; k; n) are regular. This can be proved directly; for example, by building a finite state automaton accepting the language and appealing to Kleene's Theorem (see, for example, [10, Theorem I.2.3]). For the languages under consideration in this paper, regularity also follows from the proofs in Section 2. In Section 2 we prove the following result: 2 Proposition 1.2. Let A be an alphabet. For any word w in A+ with w 3, the language Count(w; k) is of star-height zero, and the language ModCount(w; k;j nj) ≤is of star-height at most one. In Section 3 we are interested in languages recognised by Rees (zero-)matrix semigroups over abelian groups. A language L A+ is recognised by a semigroup S if there exists a semigroup morphism ' : A+ S⊆and a subset X of S such that L = X'−1. Again by Kleene's Theorem, a language! is recognisable by a finite semigroup if and only if it is regular. We then prove: Theorem 1.3. A language recognised by a Rees (zero-)matrix semigroup over an abelian group is of star-height at most one. In order to prove this, we combine the results of Section 2 and a general result on Rees zero-matrix semigroups over semigroups with the following known results: (AG1) A language L is recognised by a finite abelian group if and only if L is a boolean combination of languages of the form ModCount(a; k; n), where a is a letter from an alphabet A; see, for example, [7, Corollary 2.3.12]. (AG2) A language recognised by a finite abelian group is of star-height at most one; [5]. 2. Counting subwords Throughout this section we will consistently make use of the notation Count(w; k) and ModCount(w; k; n) as introduced in Section 1. We split our analysis into the following three special cases: (1) Counting subwords over a unary alphabet; (2) Counting subwords with maximal border " over a non-unary alphabet; (3) Counting subwords that are a power of a letter over a non-unary alphabet. Case (1) is simple, but we consider it for the sake of establishing some equalities that will be useful subsequently. The substantive difference between cases (2) and (3) is that in (3) the letters of w may appear as components of multiple subwords. For example, if we are counting the number of occurrences of aa and we encounter the expression aaa then we have two occurrences of aa and the central a belongs to both. This is not a problem in the second case as having maximal border " ensures that occurrences of w do not overlap. 2.1. Case 1: a unary alphabet Let A = a be a unary alphabet. A language L over A is regular if and only if L is of the formfXg Y (ar)∗, where X and Y are finite sets and r is an integer greater than or equal to 0; see[ [10, Exercise II.2.4]. Thus, every language over a unary alphabet is of star-height at most one. However, we want to find expressions of minimal star-height for the languages Count(ar; k) and ModCount(ar; k; n), where r is a natural number, to be used later in the non-unary cases.