An Algebraic Characterization of Strictly Piecewise Languages

An Algebraic Characterization of Strictly Piecewise Languages Jie Fu, Jeffrey Heinz, and Herbert G. Tanner University of Delaware {jiefu,heinz,btanner}@udel.edu Abstract. This paper provides an algebraic characterization of the Strictly Piecewise class of languages studied by Rogers et al. 2010. These language are a natural subclass of the Piecewise Testable languages (Si- mon 1975) and are relevant to natural language. The algebraic characterization highlights a similarity between the Strictly Piecewise and Strictly Local languages, and also leads to a procedure which can decide whether a regular language L is Strictly Piecewise in polynomial time in the size of the syntactic monoid for L. 1 Introduction Rogers et al. [12] study the Strictly Piecewise (SP), which are a proper subclass of the Piecewise Testable (PT) languages of Simon [13]. The Strictly Piecewise languages are interesting for two reasons. First, there are several senses in which the SP class is natural. For example, SP is exactly the class of those languages closed under subsequence [12]. Also, they bear the same relation to Piecewise Testable languages that the Strictly Local (SL) bear to Locally Testable (LT) languages [10, 12]. Second, this class expresses some of the kinds of long-distance dependencies found in natural language [6, 12]. While Rogers et al. provide several characterizations of SP languages, they do not provide an algebraic one. Also, the procedure they give for deciding whether a regular language L belongs to SP is exponential in the size of the smallest deterministic acceptor for L. This paper aims to address these issues. An algebraic characterization for the SP class is provided. This result not only reveals an important similarity between the SP and SL languages, but also leads to a procedure which decides whether L belongs to SP in time quadratic in the size of syntactic monoid for L. However, it remains an open question whether a polynomial time decision procedure exists in the size of the smallest deterministic acceptor. The rest of this paper is organized as follows. Section 2 reviews foundational concepts and notation. Section 3 defines the Piecewise Testable (PT), Strictly Piecewise (SP), and Stricly Local (SL) classes. Section 4 presents our algebraic characterization of the SP class and Section 5 describes the polynomial-time decision procedure. Finally, Section 6 concludes. This research is supported by grant #1035577 from the National Science Foundation. M. Ogihara and J. Tarui (Eds.): TAMC 2011, LNCS 6648, pp. 252–263, 2011. c Springer-Verlag Berlin Heidelberg 2011 An Algebraic Characterization of Strictly Piecewise Languages 253 2 Preliminaries A semigroup is a set with an associative operation. A monoid is a semigroup with an identity element (written 1). If S is a semigroup, S1 denotes the monoid equal to S if 1 ∈ S and to S ∪{1} otherwise. A zero is an element 0 such that, for every s ∈ S, s0=0s =0.Thefree semigroup (monoid) of a set S is the set of all finite sequences of one (zero) or more elements from S. If x is an element of set S and π a partition of S,theblock of π containing x is [x]π. The partition of S induced by an equivalence relation ρ is S/ρ.A right (left) congruence is a partition such that if [x]π =[y]π then [xz]π =[yz]π ([zx]π =[zy]π). A congruence is both a left and a right congruence. Following Clifford [2], a left (right) ideal of a semigroup S is a non-empty subset T of S such that ST ⊆ T (TS ⊆ T ). The left (right) ideal of S generated by T is T ∪ ST = S1T (T ∪ TS = TS1). The principal left (right) ideal of S generated by t ∈ T is PL(t)=S1t (PR(t)=tS1). Let Σ denote a finite set, called the alphabet.SetsΣ+ and Σ∗ denote the free semigroup and free monoid of Σ, respectively. We refer to the elements of Σ+ and Σ∗ as strings and words interchangeably. The unique string of length zero is denoted λ.ThesetΣ≤k denotes the set of all words of length at most k. The length of a string u is denoted |u|,and|w|σ denotes the number of occurences of σ in w.Astringv is a factor of w iff there exist strings x, y ∈ Σ∗ such that w = xvy.Astringv is a prefix (suffix)ofw iff there exist x ∈ Σ∗ such that w = vx (w = xv). A string v is a subsequence of string w iff v = σ1 ···σn ∗ ∗ ∗ ∗ ∗ and w ∈ Σ σ1Σ ···Σ σnΣ ,andwewritev w. Languages are subsets of Σ . The complement of a language L is L = {w ∈ Σ∗ : w ∈ L}. A semiautomaton is a tuple A = {Q, Σ, T },whereQ is a non-empty finite set of states and Σ is the alphabet. The transition function is a partial function T : Q × Σ → Q. The domain of the transition function is expanded to Q × Σ∗ recursively as follows. For all q ∈ Q, T (q, λ)=q and for all w ∈ Σ∗ and σ ∈ Σ, T (q, wa)=T (T (q, w),a). It follows that T (q, xy)=T (T (q, x),y). By definition semi-automata are deterministic. A finite-state automaton (FSA) is a tuple A = {Q, q0,Qf ,Σ,T},where {Q, Σ, T } is a semi-automaton, q0 ∈ Q is the initial state, and Qf ⊆ Q is a ∗ set of final states. The language recognized by A is {w ∈ Σ : T (q0,w) ∈ Qf }. A language L is regular iff there exists a FSA recognizing it. For every regular language L there is a unique (up to isomorphism) automaton with the fewest number of states recognizing L called the canonical FSA for L. A state q of an automaton is a sink state iff, for all σ ∈ Σ,ifT (q, σ) is defined then T (q, σ)=q. One can always make the transition function total by adding a nonfinal sink state and directing all the missing transitions for each state to this sink. When the sink state is added to a canonical acceptor, it is the only state which is both a sink and nonfinal. The resulting automaton is complete. For any automaton A and state q ∈ Q,letρq be that relation such that, for ∗ all elements x and y of Σ , xρqy iff T (q, x)=T (q, y). More generally, let 254 J.Fu,J.Heinz,H.G.Tanner q1 ··· qn fx = . T (q1,x) ··· T (qn,x) ∗ ∗ For all x, y ∈ Σ ,letxρy iff fx = fy. The equivalence relation ρ over Σ induces a congruence over Σ∗ [15]. The index of ρ is finite because Q is finite. ∗ Let FA = {fx : x ∈ Σ } denote the finite monoid of mappings and I¯(A)= ∗ Σ /ρ.ThenFA is isomorphic to I¯(A) under the correspondence of fx of FA with [x]ofI¯(A), where [x]istheρ-congruence coset containing x of Σ∗.Inthis paper, when writing fx and [x], we choose x to be a shortest-length element in the congruence class without any ambiguity. For FSA A,whereA is the associated semiautomaton of A, FA is called the transformation semigroup and I¯(A)isthecharacteristic semigroup of A. Elements fx of FA canalsobewritteninmatrixformμx, where the rows and columns indicate states in Q = {q1,...,qn} and μx[i, j]=1iffT (qi,x)=qj. The set of matrices is another semigroup, the transition semigroup. The name is derived from the fact that each element in this semigroup is a transition ∗ matrix associated to a walk x in A.WewriteUA = {μx : x ∈ Σ }. Clearly UA is isomorphic to I¯(A) under the correspondance of μx of UA with [x]ofI¯(A). Definition 1 (Pin 1997). The syntactic semigroup of a regular language L is the transformation semigroup given by its complete canonical semiautomaton. In the syntactic semigroup of an automaton A,theset of generators of FA is Gen(FA)={fσ : σ ∈ Σ}.Thesyntactic monoid of a regular language L is the 1 syntactic semigroup with identity, Gen(FA)={fσ : σ ∈ Σ ∪ λ}. Pin [11] discusses the equivalence between automata and semigroups. Note that since the transition semigroup UA of A is represented as a semigroup of boolean matrices of order |Q|×|Q|,awordw is recognized by A iff μx(q0,qf )=1 for some final state qf ∈ Qf . It follows that a finite automaton recognizes a regular language L iff its transition semigroup recognizes L. A “monoid graph” is a useful method employed by contemporary algebraic theorists to visualize monoids. The nodes of the graph are elements in the monoid, though an initial node labeled “λ” is included by convention. The labels on edges are the elements in the set of generators of the monoid. Given a monoid s M, x → y iff xs = y,wherex, y ∈ M,ands ∈ Gen(M). The monoid graph of FA is denoted as MG(FA). We mark elements x in the monoid graph as final iff fx ∈ FA and there exists a final state q in the canonical acceptor such that T (q0,x)=q [11]. Examples of monoid graphs are in Figures 1,2, and 3. Definition 2. A unique nonfinal sink state in an automaton A is called zero. An element fx is a zero element of the transformation semigroup iff q ...q f 1 n . x = 0 ... 0 We use the notation fx =0for the transformation semigroup, μx =0for the transition semigroup, and x =0for the free semigroup Σ∗.Thecorresponding zero in the characteristic semigroup I¯(A) is denoted [0]. An Algebraic Characterization of Strictly Piecewise Languages 255 While every complete canonical automaton (except the one recognizing Σ∗)has a unique nonfinal sink state, not every transformation semigroup has a zero.

Load more