<<

An Algebraic Characterization of Strictly Piecewise Languages

Jie Fu, Jeffrey Heinz, and Herbert G. Tanner

University of Delaware {jiefu,heinz,btanner}@udel.edu

Abstract. This paper provides an algebraic characterization of the Strictly Piecewise class of languages studied by Rogers et al. 2010. These language are a natural subclass of the Piecewise Testable languages (Si- mon 1975) and are relevant to natural language. The algebraic character- ization highlights a similarity between the Strictly Piecewise and Strictly Local languages, and also leads to a procedure which can decide whether a L is Strictly Piecewise in polynomial time in the size of the syntactic for L.

1 Introduction

Rogers et al. [12] study the Strictly Piecewise (SP), which are a proper subclass of the Piecewise Testable (PT) languages of Simon [13]. The Strictly Piecewise languages are interesting for two reasons. First, there are several senses in which the SP class is natural. For example, SP is exactly the class of those languages closed under subsequence [12]. Also, they bear the same relation to Piecewise Testable languages that the Strictly Local (SL) bear to Locally Testable (LT) languages [10, 12]. Second, this class expresses some of the kinds of long-distance dependencies found in natural language [6, 12]. While Rogers et al. provide several characterizations of SP languages, they do not provide an algebraic one. Also, the procedure they give for deciding whether a regular language L belongs to SP is exponential in the size of the smallest deterministic acceptor for L. This paper aims to address these issues. An algebraic characterization for the SP class is provided. This result not only reveals an important similarity between the SP and SL languages, but also leads to a procedure which decides whether L belongs to SP in time quadratic in the size of for L. However, it remains an open question whether a polynomial time decision procedure exists in the size of the smallest deterministic acceptor. The rest of this paper is organized as follows. Section 2 reviews foundational concepts and notation. Section 3 defines the Piecewise Testable (PT), Strictly Piecewise (SP), and Stricly Local (SL) classes. Section 4 presents our algebraic characterization of the SP class and Section 5 describes the polynomial-time decision procedure. Finally, Section 6 concludes.

This research is supported by grant #1035577 from the National Science Foundation.

M. Ogihara and J. Tarui (Eds.): TAMC 2011, LNCS 6648, pp. 252–263, 2011. c Springer-Verlag Berlin Heidelberg 2011 An Algebraic Characterization of Strictly Piecewise Languages 253

2 Preliminaries

A is a set with an associative operation. A monoid is a semigroup with an (written 1). If S is a semigroup, S1 denotes the monoid equal to S if 1 ∈ S and to S ∪{1} otherwise. A zero is an element 0 such that, for every s ∈ S, s0=0s =0.Thefree semigroup (monoid) of a set S is the set of all finite sequences of one (zero) or more elements from S. If x is an element of set S and π a partition of S,theblock of π containing x is [x]π. The partition of S induced by an equivalence relation ρ is S/ρ.A right (left) congruence is a partition such that if [x]π =[y]π then [xz]π =[yz]π ([zx]π =[zy]π). A congruence is both a left and a right congruence. Following Clifford [2], a left (right) ideal of a semigroup S is a non-empty subset T of S such that ST ⊆ T (TS ⊆ T ). The left (right) ideal of S generated by T is T ∪ ST = S1T (T ∪ TS = TS1). The principal left (right) ideal of S generated by t ∈ T is PL(t)=S1t (PR(t)=tS1). Let Σ denote a finite set, called the alphabet.SetsΣ+ and Σ∗ denote the free semigroup and of Σ, respectively. We refer to the elements of Σ+ and Σ∗ as strings and words interchangeably. The unique string of length zero is denoted λ.ThesetΣ≤k denotes the set of all words of length at most k. The length of a string u is denoted |u|,and|w|σ denotes the number of occurences of σ in w.Astringv is a factor of w iff there exist strings x, y ∈ Σ∗ such that w = xvy.Astringv is a prefix (suffix)ofw iff there exist x ∈ Σ∗ such that w = vx (w = xv). A string v is a subsequence of string w iff v = σ1 ···σn ∗ ∗ ∗ ∗ ∗ and w ∈ Σ σ1Σ ···Σ σnΣ ,andwewritev  w. Languages are subsets of Σ . The complement of a language L is L = {w ∈ Σ∗ : w ∈ L}. A is a tuple A = {Q, Σ, T },whereQ is a non-empty finite set of states and Σ is the alphabet. The transition is a T : Q × Σ → Q. The domain of the transition function is expanded to Q × Σ∗ recursively as follows. For all q ∈ Q, T (q, λ)=q and for all w ∈ Σ∗ and σ ∈ Σ, T (q, wa)=T (T (q, w),a). It follows that T (q, xy)=T (T (q, x),y). By definition semi-automata are deterministic. A finite-state automaton (FSA) is a tuple A = {Q, q0,Qf ,Σ,T},where {Q, Σ, T } is a semi-automaton, q0 ∈ Q is the initial state, and Qf ⊆ Q is a ∗ set of final states. The language recognized by A is {w ∈ Σ : T (q0,w) ∈ Qf }. A language L is regular iff there exists a FSA recognizing it. For every regular language L there is a unique (up to isomorphism) automaton with the fewest number of states recognizing L called the canonical FSA for L. A state q of an automaton is a sink state iff, for all σ ∈ Σ,ifT (q, σ) is defined then T (q, σ)=q. One can always make the transition function total by adding a nonfinal sink state and directing all the missing transitions for each state to this sink. When the sink state is added to a canonical acceptor, it is the only state which is both a sink and nonfinal. The resulting automaton is complete. For any automaton A and state q ∈ Q,letρq be that relation such that, for ∗ all elements x and y of Σ , xρqy iff T (q, x)=T (q, y). More generally, let 254 J.Fu,J.Heinz,H.G.Tanner

  q1 ··· qn fx = . T (q1,x) ··· T (qn,x)

∗ ∗ For all x, y ∈ Σ ,letxρy iff fx = fy. The equivalence relation ρ over Σ induces a congruence over Σ∗ [15]. The index of ρ is finite because Q is finite. ∗ Let FA = {fx : x ∈ Σ } denote the finite monoid of mappings and I¯(A)= ∗ Σ /ρ.ThenFA is isomorphic to I¯(A) under the correspondence of fx of FA with [x]ofI¯(A), where [x]istheρ-congruence coset containing x of Σ∗.Inthis paper, when writing fx and [x], we choose x to be a shortest-length element in the congruence class without any ambiguity. For FSA A,whereA is the associated semiautomaton of A, FA is called the and I¯(A)isthecharacteristic semigroup of A. Elements fx of FA canalsobewritteninmatrixformμx, where the rows and columns indicate states in Q = {q1,...,qn} and μx[i, j]=1iffT (qi,x)=qj. The set of matrices is another semigroup, the transition semigroup. The name is derived from the fact that each element in this semigroup is a transition ∗ matrix associated to a walk x in A.WewriteUA = {μx : x ∈ Σ }. Clearly UA is isomorphic to I¯(A) under the correspondance of μx of UA with [x]ofI¯(A). Definition 1 (Pin 1997). The syntactic semigroup of a regular language L is the transformation semigroup given by its complete canonical semiautomaton.

In the syntactic semigroup of an automaton A,theset of generators of FA is Gen(FA)={fσ : σ ∈ Σ}.Thesyntactic monoid of a regular language L is the 1 syntactic semigroup with identity, Gen(FA)={fσ : σ ∈ Σ ∪ λ}. Pin [11] discusses the equivalence between automata and . Note that since the transition semigroup UA of A is represented as a semigroup of boolean matrices of order |Q|×|Q|,awordw is recognized by A iff μx(q0,qf )=1 for some final state qf ∈ Qf . It follows that a finite automaton recognizes a regular language L iff its transition semigroup recognizes L. A “monoid graph” is a useful method employed by contemporary algebraic theorists to visualize . The nodes of the graph are elements in the monoid, though an initial node labeled “λ” is included by convention. The labels on edges are the elements in the set of generators of the monoid. Given a monoid s M, x → y iff xs = y,wherex, y ∈ M,ands ∈ Gen(M). The monoid graph of FA is denoted as MG(FA). We mark elements x in the monoid graph as final iff fx ∈ FA and there exists a final state q in the canonical acceptor such that T (q0,x)=q [11]. Examples of monoid graphs are in Figures 1,2, and 3. Definition 2. A unique nonfinal sink state in an automaton A is called zero. An element fx is a zero element of the transformation semigroup iff   q ...q f 1 n . x = 0 ... 0

We use the notation fx =0for the transformation semigroup, μx =0for the transition semigroup, and x =0for the free semigroup Σ∗.Thecorresponding zero in the characteristic semigroup I¯(A) is denoted [0]. An Algebraic Characterization of Strictly Piecewise Languages 255

While every complete canonical automaton (except the one recognizing Σ∗)has a unique nonfinal sink state, not every transformation semigroup has a zero.

3 Piecewise Testable and Strictly Piecewise Languages

The concept of a subsequence is central to the notion of piecewise testability. Definition 3. The principle shuffle ideal of v is the language of all words for which v is a subsequence. We write SI(v)={w ∈ Σ∗ | v  w}. The Piecewise Testable languages is the smallest class of languages including SI(w) for all w ∈ Σ∗ and closed under Boolean operations [13]. Similarly, the class of Piecewise k-Testable (PTk) languages is the smallest class of languages including SI(w) for all w ∈ Σ≤k and closed under Boolean operations. A well-known characterization of the PT languages is stated in terms of the def sets of subsequences within words. If P≤k(w) = {v : v  w and |v|≤k} then the following characterization (sometimes taken as the definition of PT [3]) holds. Theorem 1. A language L is Piecewise Testable iff there exists k such that, for ∗ all words w1,w2 ∈ Σ ,ifP≤k(w1)=P≤k(w2) then w1 ∈ L iff w2 ∈ L.

When k is known, L is said to be Piecewise k-Testable (L ∈ PTk). Simon proved one of the first examples of what later became known as Eilen- berg’s correspondence theorem [11]. One of the relations that Green [4] defines on semigroups is the J relation, which relates two elements of a semigroup S if they generate the same two-sided principal ideal of S: aJ b iff S1aS1 = S1bS1. AsemigroupS is J -trivial iff, for all a, b ∈ S,ifaJ b then a = b.Simonproved the following algebraic characterization of piecewise testable languages. Theorem 2 (Simon 1975). A language is Piecewise Testable iff its syntactic monoid is J -trivial. As an example, consider the language of all words with exactly one a, L = {w : |w|a =1}. The canonical acceptor for this language is shown in Figure 1. F 1 {a, , } There are three elements in the monoid A1 = 1 0 , (for simplicity of nota- x f J F 1 xF 1 tion, let stand for x). The -triviality is established by calculating A1 A1 , x ∈ F 1 F 1 aF 1 { ,a} F 1 F 1 { ,a, } F 1 F 1 { } for all A1 : A1 A1 = 0 , A1 1 A1 = 0 1 ,and A1 0 A1 = 0 . The J -triviality is satisfied, which means this languange L is piecewise testable. Rogers et al. [12] study a proper subclass of the Piecewise Testable languages, the Strictly Piecewise class. This paper takes as definition of Strictly Piecewise languages those languages which are closed under subsequence. (Unknown to Rogers et al., languages closed under subsequence were studied forty years earlier by Haines [5] (see also Higman [7]).) Definition 4. A language is Piecewise Testable in the Strict Sense (L ∈ SP) iff, for all w ∈ Σ∗,ifw ∈ L and v  w then v ∈ L. Rogers et al. [12] establish the following equivalences (see also [5]). 256 J.Fu,J.Heinz,H.G.Tanner

b,c b,c b,c a,b,c a a a a 1 2 λ a 0

b,c b b,c

Fig. 1. The canonical automaton and the monoid graph for L = {w : |w|a =1},which is the language of all words with exactly one a

Theorem 3. The following are equivalent:

1. L ∈ SP. ∗ 2. L = SI(X),X ⊆ Σ . 3. L ∈ w∈S SI(w),forS finite. 4. there exists k such that if P≤k(w) ⊆ P≤k(L) then w ∈ L.

It follows from the third characterization above that any SP language can be characterized by a finite set S. Elements of this set are the forbidden subse- quences, and the language is all words which do not contain any of these forbid- den subsequences. The longest word in S is the length k in the 4th characteri- 1 zation above, in which case we say L is Strictly k-Piecewise (L ∈ SPk) . By forbidding subsequences, SP languages resemble the Stricly Local lan- guages which forbid factors [10]. Any SL language L can be defined as the intersection of the complements of sets defined to be those words which con- tain a forbidden factor. Formally, let the container of w ∈ Σ∗ be C(w)= ∗ {u ∈ Σ : w is a factor of  u} then a language L ∈ SL iff there exists a fi- ∗ 2 nite set of forbidden factors S ⊂ Σ  such that L = w∈S C(w) .Fig- ure 2 shows the canonical acceptor and the monoid graph for the SL language L = Σ∗aaΣ∗ = C(aa), i.e. all words except those containing the factor aa. To illustrate SP languages, consider the language L = SI(bb) ∩ SI(ca), which is the language of all words except those containing either the subsequences bb or ca; i.e., bb and ca are the forbidden subsequences. Thus this SP language can be characterized by the set {bb, ca} of forbidden subsequences (or equivalently by the set Σ≤2/{bb, ca} of permissible subsequences [12]). Hence this language belongs to SP2. Figure 3 shows ths canonical automata and the monoid graph for L. The 0 element is not shown there, but note that all missing edges go to 0. As with the other piecewise testable languages like the one in Figure 1, it is not difficult to verify that the syntactic monoid of this language is J -trivial.

1 While every SP language is convex [14], it is not the case that all convex languages are SP since, for example, there are nonempty subword-convex languages that do not contain λ but the only SP language not containing λ is the empty one. 2 The symbols  and  invoke left and right word boundaries and are necessary because SL languages make distinctions at word edges [10]. An Algebraic Characterization of Strictly Piecewise Languages 257

b,c b,c a ab b,c a a a 1 2 λ b,c a aa a b,c a b,c a,b,c b ba b,c Fig. 2. The canonical acceptor and the monoid for the language L = C(aa), which is all words except those containing the factor aa

However, this language, like every other SP language, has two additional prop- erties. Furthermore, no non-SP language has both of these properties.

a a a b ab c b b c c 4 3 a a b c a c c λ a bc abc c b c b 1 2 c c b c c ac c

Fig. 3. The canonical automata and the monoid graph of the syntactic monoid of L = SI(bb) ∩ SI(ca), i.e. the language where the subsequences bb and ca are forbidden

4 Algebraic Characterization of SP

There are two important concepts that need to be introduced. Definition 5. Let L be a regular language recognized by FSA, and consider its characteristic semigroup. Language L is wholly nonzero if and only if L =[0]. In other words, a language is wholly nonzero if and only if every word not in the language is in the zero block of the characteristic semigroup. In terms of the transformation semigroup, this means that every word x not in the language is zero; i.e., fx =0. Theorem 4. A language L is wholly nonzero if and only if L is closed under prefix and closed under suffix.

Proof. Clearly, [0] ⊆ L. Now suppose L is closed under prefix and suffix, and consider any x ∈ L. For contradiction, suppose fx = 0. Then in the canonical 258 J.Fu,J.Heinz,H.G.Tanner acceptor A for L there are states q, q in A such that x transforms q to q.Since A is canonical, there exist strings w, y such that w transforms q0 to q and y transforms q to a final state. Thus wxy ∈ L.SinceL is closed under prefix wx belongs to L and since L is closed under suffix, x belongs to L, which contradicts the assumption. Therefore fx = 0, which completes one direction of the proof. Now suppose L = [0] and consider any w ∈ L and any prefix (suffix) v of w, which means there exists x such that w = vx (w = xv). If v ∈ L then by assumption fv = 0. It follows that fw = fvx =0fx =0(fw = fxv = fx0=0), which contradicts that w ∈ L. Observe that L = Σ∗ and the empty language are wholly nonzero vacuously. The following two corollaries are almost immediate. Corollary 1. The Strictly Piecewise languages are wholly nonzero. Proof. The Strictly Piecewise are closed under subsequence by definition and are therefore closed under prefix and suffix. Corollary 2. The Strictly Local languages are wholly nonzero. Proof. Consider any Strictly Local language L and any w ∈ L.Sincew ∈ L, there are no forbidden factors in w and therefore there are none in any prefix or suffix of w. Hence every prefix and suffix of w belongs to L as well. That both the Strictly Local and Strictly Piecewise are wholly nonzero is a nontrivial property they have in common. To illustrate, recall the SL language L = C(aa) (Figure 2). Every string not in this SL language transforms any state in its monoid graph to 0. These are all the strings with the 2-factor aa. Similarly, consider again the language L = SI(bb) ∩ SI(ca) (Figure 3). Every string not in this SP language transforms any state in its monoid graph to 0. These are exactly those strings with either subsequence bb or ca. The second property is an algebraic characterization of what Rogers et al. describe in automata-theoretic terms as “missing edges propagate down.” This means that if some state q in the canonical accepter does not have a transition labeled with symbol σ then no state reachable from q has an outgoing transition labeled with σ. To capture this, we need the following concept relating to zeroes.

Definition 6. Let M be a monoid. The set of right annihilators of an element x ∈ M,isRA(x)={a ∈ M : xa =0}. In other words, the elements of RA(x) annihilate x from the right. The set of left annihilators can be defined similarly, but it does not play a role here. We now define the following property which captures the notion of “missing edges propagating down.”

Definition 7. A language L is right annihilating iff for any element fx in the syntactic monoid FA(L),andforallfw in the principle right ideal generated by f RA f ⊆ RA f x, it is the case that FA(L)( x) FA(L)( w). The main theorem of this paper can now be stated and proved. An Algebraic Characterization of Strictly Piecewise Languages 259

Theorem 5. A language L is SP iff L is wholly nonzero and right annihilating.

Proof. By Corollary 1, any SP language is wholly nonzero. L ∈ f f ∈ RA f Next consider any SP and any element x and any t FA(L)( x). It follows that fxft = 0; hence, fxt =0.SinceL ∈ SP, there must be some v  xt such that v is forbidden; i.e SI(v) ⊆ L. For any fw in the principal right ideal of fx, it is the case that there exists fa such that fw = fxfa.Thus fwft = fxfaft = fxat.Sincev  xt it follows that v  xat and therefore f f f f ∈ RA f f f w t = xat =0andso t FA(L)( w). The generality of w and t ensures ∀w ∈ PR x ,RA f ⊆ RA f that ( ) FA(L)( x) FA(L)( w). Now for the other direction. The empty language vacuously satisfies the above conditions and belongs to SP so consider any nonempty regular language L, which is wholly nonzero and right annihilating. We show that L belongs to SP. By contradiction, suppose L is wholly nonzero and right annihilating, but not in SP. By definition of SP, L is not closed under subsequence. So there is some w and v such that w ∈ L and v  w but v ∈ L.Sincev  w,thereexists u0,u1, ··· ,un such that for v = σ1σ2 ...σn, w = u0σ1u1σ2u2 ···σnun. Since v ∈ L and since L is wholly nonzero, v ∈ [0]. It will be useful to refer to the suffixes of v as follows: vi = σi ···σn for 1 ≤ i ≤ n. For example, v = v1 = σ1 ···σn and v2 = σ2 ···σn,andvn = σn. Now v2 is a right annihilator of u0σ1 since u0σ1v2 = u0v = u00=0.Also, since L is right annihilating, RA(u0σ1) is a subset of RA(u0σ1u1), and so v2 right annihilates u0σ1u1 as well. Next consider that v3 is a right annihilator of u0σ1u1σ2 since u0σ1u1σ2v3 = u0σ1u1v2 and above we showed that v2 right annihilates u0σ1u1. Again, since L is right annihilating, RA(u0σ1u2σ2) is a subset of RA(u0σ1u1σ2u2), and so v3 right annihilates u0σ1u1σ2u2 as well. Carrying this argument through to its conclusion, we see that vn = σn is a right annihilator of u0σ1u1σ2u2 ···un−1σn−1. Therefore σn is a right annihilator of u0σ1u1σ2u2 ···un−2σn−1un as well. Hence u0σ1u1σ2u2 ···un−2σn−1unσn =0. But this means that w = u0σ1u1σ2u2 ···un−2σn−1unσnun =0un =0.Since L is wholly nonzero, it follows that w ∈ L, which contradicts the reduction assumption. Therefore there is no v, w such that w ∈ L, v  w,andv ∈ L.It follows that regular languages that are wholly nonzero and right annihilating are closed under subsequence and are therefore SP.

We illustrate this property in the context of the decision procedure we present below for deciding whether a regular language is SP.

5 Algorithms for SP Languages

Theorem 3 provides a polynomial-time decision procedure for deciding whether any regular language L is SP, and if it is, it finds the finite set of the shortest forbidden subsequences necessary to define L. 260 J.Fu,J.Heinz,H.G.Tanner

5.1 Deciding SP The input to the algorithms below is taken to be the monoid graph of the syntactic monoid for a regular language L, with the initial state being the node labeled “λ” and the final states being marked. Since this graph is determinstic, it is possible to obtain the canonical acceptor in time O(n log n)[9].Givena minimal DFA A, the syntactic monoid FA can be obtained through the set of generators {fσ}, ∀σ ∈ Σ. The reader is referred to [1] for the construction method of syntactic monoid FA. Theorem 3 provides the basis for the decision procedure, which we call DSP. DSP simply checks whether the syntactic monoid satisfies the wholly nonzero and the right annihilation conditions. The wholly nonzero condition can be checked in two steps, essentially by checking closure under prefixes and suffixes. To check closure under prefixes, one simply need check whether every state in the canonical accepter A is final. If they are not, then the syntactic monoid of A is not wholly nonzero. To check closure under suffixes, both the complete canonical acceptor and the transformation semigroup FA are examined. Let 0 be the non-final sink state in the complete automaton. If there exists one nonzero element fx in FA and one noninitial state q in the canonical acceptor such that T (q, x) = 0 but T (q0,x) = 0, then the wholly nonzero condition is violated. If no such fx or q exist, however, we can conclude the language is wholly nonzero. Whether the right annihilating condition is satisfied can be determined from the Cayley table for FA. The columns and rows of a Cayley table are labeled with the elements in the syntactic monoid FA, and the cell is the product(x · y)ofthe row-th(x) and column-th(y) elements [2]. Then for each fx ∈ FA, the principal right ideal generated by x (PR(x)) can be found by the union of all distinct elements in the xth row of the table and the right annhilators of x (RA(x)) are given by those elements y such that the xth row and yth column is 0. Then for each z ∈ PR(x), it is sufficient to check whether RA(x) ⊆ RA(z). If for any x ∈ FA and any z ∈ PR(x), it is the case that RA(x) ⊆ RA(z) then the algorithm exits and returns “false”. Otherwise it returns “true”. We illustrate these procedures with three examples. Consider first the SP language L = SI(bb) ∩ SI(ca) in Figure 3. The elements of its transformation ∗ semigroup FA(L)={fx,x∈ Σ } are:       1234 1234 1234 fa = fb = fc = 0034 2003 1221       1234 1234 1234 fab = fbc = fac = 0003 2002 0021     1234 1234 fabc = 0= . 0002 0000

Since FA is isomorphic to the characteristic semigroup I¯(A), it follows that I¯(A)={[0], [a], [b], [c], [ab], [bc], [ac], [abc]}. The transition semigroup U(A)are the set of the adjacency matrices given by each string x in fx,fx ∈ FA. An Algebraic Characterization of Strictly Piecewise Languages 261

Table 1. Cayley table for syntactic monoid for L = SI(bb) ∩ SI(ca)

λ a b cabbcacabc λλa b cabbcacabc a a a ab ac ab abc ac abc bbab0bc00abc0 c c 0bcc 0bc0 0 ab ab ab 0 abc 0 0 abc 0 bc bc 0 0 bc 0 0 0 0 ac ac 0 abc ac 0 abc 0 0 abc abc 0 0 abc 0 0 0 0

The monoid graph for this language is in Figure 3. Recall that although the 0 element is not shown, it is understood that all missing edges go to 0. The Cayley table is given in Table 1. With a little abuse of notation, in the following context, x is used to denote the element fx in syntactic monoid FA. The wholly non-zero condition can be checked by examining the syntactic monoid. It is noticed that in this canonical accepter all states are finals and there is no such fx ∈ FA and q ∈ Q such that T (q, x) = 0 but T (q0,x)=0. The next step is to determine whether the right annihilation condition is satisfied with the help of Cayley table. For example in the Cayley table, the 1 ab-row is all the elements that are in the right ideal generated by ab, abFA = {ab, abc, 0}. The elements in those columns corresponding to 0s form the set RA(ab)={b, ab, bc, abc}. The right annihilating condition requires that ∀w ∈ xFA, RA(x) ⊆ RA(w). From the table it is easy to verify that RA(abc)= {a, b, ab, bc, ac, abc}, which is a superset of RA(ab). Since RA(0) = FA, it like- wise follows that RA(ab) ⊆ RA(0). The right annihilation condition for other elements can be verified in the same manner and it can be shown this syntactic monoid is right-annhilating. Now consider the language L = {w : |w|a =1} (Figure 1). L is not SP because it does not satisfy the wholly nonzero condition. The element b is not in the language but it is not zero in its syntactic semigroup. For the language L = C(aa) in Figure 2, though it satisfies the wholly nonzero condition, the right annihilating condition is violated. Observe that aa =0 and ab ∈ PR(a). If L were right annihilating then RA(a) ⊆ RA(ab). However, aba = a = 0 and thus the right annihilating condition is not met. Therefore, L = C(aa)isnotSP. What is the time complexity for DSP? Letting n be the size of the syntactic monoid, the wholly nonzero condition can be checked with time O(n)andright annihilating condition runs in time O(n2). Thus DSP runs in O(n2). Holzer studies the size of the syntactic monoid as a natural measure of descriptive complexity for regular languages [8]. 262 J.Fu,J.Heinz,H.G.Tanner

5.2 Finding the Shortest Forbidden Subsequences The following procedure Find-ssq takes the syntactic monoid of a SP language as input and finds the finite set of shortest forbidden subsequences which de- scribe the SP language. In order to link the syntactic monoid and the length of forbidden subsequences, the monoid graph is employed to find the set of the shortest paths from the λ node to 0 that covers the graph.

def P(FA) = {xσ : fx ∈ FA,σ ∈ Σ, xσ =0, and ∀fy = fx, |x|≤|y|}

Find-ssq begins with the syntactic monoid for some L ∈ SP and k =1.

def 1. Letting Pk(S) = {{Pk(p)} : p ∈ S},calculatePk(P(FA)), i.e. the set of sets of k-subsequences for each path in P(FA). 2. Find all singleton sets in Pk(P(FA)) and construct the set FSk,whichisthe set of hypothesized forbidden subsequences of length k. This set is formed by taking the union of the singleton sets in Pk(P(FA)). If there is no singleton set found, update k by one and return to step 1. 3. Verify whether each set P ∈ Pk P(FA) has a nonempty intersection with FSk.IfsothenFSk is a set of forbidden sequences which can define L and L ∈ SPk.Otherwise,updatek by one and return to step 1. Theorem 6. Find-ssq terminates at the shortest k for L ∈ SP. Proof. Suppose this k is not the shortest one for the SP language L,andthere  exists k >ksuch that L ∈ SPk . This means that there exists at least one path     p ∈P(FA), with |p | >k, such that Pk(p ) ⊆ Pk(L)andPk (p ) ∩ Pk (L)=∅,    for some k >k.ThefactthatPk(p ) ⊆ Pk(L) implies that ∀v ∈ Pk(p ), v ∈ L, which is guaranteed by the syntactic monoid of L being wholly nonzero. However, if the algorithm does not terminate at k ensures that there exists  at least one element h ∈ Pk(p )withh ∈ FSk.SinceFSk is the set of all paths of length k that lead to 0, h/∈ L. This contradicts the previous statement   that ∀v ∈ Pk(p ), v ∈ L. Therefore, no such p exists and thus the algorithm terminates at the shortest k for the strictly piecewise language L. We illustrate this algorithm with the automaton in Figure 3, assuming it has already been verified with DSP that it describes an SP language. We refer to the monoid in Figure 3 with FA. The set of the shortest paths from the λ node to 0 that covers the graph is P(FA)={bb, ca, bab, bcb, bca, abb, aca, cba, cbb, bacb, baca, abcb, abca, acbb, acba}.

1. For k = 1, all sets in P1(P(FA)) are not singleton. Therefore, increase k by 1. 2. For k =2,P2(P(FA)) = {bb}, {ca}, {ba, ab, bb}, {bc, cb, bb}, {bc, ba, ca}, {ab, bb}, {ac, aa, ca}, {cb, ca, ba}, {cb, bb}, {ba,bc,ac,bb,ab,cb }, {ab, bc, ac, cb, bb}, {ab, ac, aa, bc, ca, ba}, {ab,ac,bb,cb}, {ac,ab,ca,ba} . The singleton sets are {bb}, {ca} and thus FS2 = {bb, ca}. It is easy to verify that for all P ∈ P2(P(FA)), P has a nonempty intersection with FS2. The algorithm terminates and outputs {bb, ca}, which are the forbidden subsequences which describe this language. An Algebraic Characterization of Strictly Piecewise Languages 263

In sum, this procedure tells us that this language is SP for k = 2. Together DSP and Find-ssq provide a means to check whether a regular language is SP, and if it is to find the finite set of the shortest forbidden subsequences.

6Conclusion

Strictly Piecewise languages are wholly nonzero and right annihilating. The wholly nonzero property is shared by the Strictly Local languages and provides a definition for the “Strict” aspect, independent of the relation to the Testable classes. Also, the algebraic characterization for SP provides a polynomial-time decision procedure for a regular language in the size of its syntactic monoid. This paper also leaves open some interesting questions. In particular, we would like to know whether every wholly nonzero, J -trivial language is right annihilating.

References

1. Anderson, J.A.: with Modern Applications. Cambridge Univer- sity Press (2006) 2. Clifford, A.: The Algebraic Theory of Semigroups. American Mathematical Society, Providence (1967) 3. Garc´ıa, P., Ruiz, J.: Learning k-testable and k-piecewise testable languages from positive data. Grammars 7, 125–140 (2004) 4. Green, J.A.: On the structure of semigroups. The Annals of Mathematics 54(1), pp. 163–172 (1951) 5. Haines, L.H.: On free moniods partially ordered by embedding. Journal of Combi- natorial Theory 6, 94–98 (1969) 6. Heinz, J.: Learning long-distance phonotactics. Linguistic Inquiry 41(4), 623–661 (2010) 7. Higman, G.: Ordering by divisibility in abstract . Proceedings of the Lon- don Mathematical Society 3(2), 326–336 (1952) 8. Holzer, M., K¨onig, B.: Regular languages, sizes of syntactic monoids, graph colour- ing, state complexity results, and how these topics are related to each other. EATCS Bulletin 83, 139–155 (June 2004) 9. Hopcroft, J.E.: An n log n algorithm for minimizing states in a finite automaton. Tech. rep., Stanford, CA, USA (1971) 10. McNaughton, R., Papert, S.: Counter-Free Automata. MIT Press (1971) 11. Pin, J.E.,´ et A. Salomaa (´ed.), G.R.: Syntactic semigroups, vol. 1. Springer Verlag (1997) 12. Rogers, J., Heinz, J., Bailey, G., Edlefsen, M., Visscher, M., Wellcome, D., Wibel, S.: On languages piecewise testable in the strict sense. In: Ebert, C., J¨ager, G., Michaelis, J. (eds.) The Mathematics of Language. Lecture Notes in Artifical In- telligence, vol. 6149, pp. 255–265. Springer (2010) 13. Simon, I.: Piecewise testable events. In: Automata Theory and Formal Languages, pp. 214–222 (1975) 14. Thierrin, G.: Convex languages. In: ICALP’72. pp. 481–492 (1972) 15. Watanabe, T., Nakamura, A.: On the transformation semigroups of finite au- tomata. Journal of Computer and System Sciences 26(1), 107–138 (1983)