<<

Multi-- Derivatives

Pascal , Jean-Marc Champarnaud, and Ludovic Mignot

LITIS, Universit´ de Rouen, 76801 Saint-Etienne´ du Rouvray Cedex, {pascal.caron,jean-marc.champarnaud,ludovic.mignot}@univ-rouen.fr

Abstract. Multi-tilde-bar operators allow us to extend regular expres- sions. The associated extended expressions are compatible with the structure of Glushkov automata and they provide a more succinct repre- sentation than standard expressions. The aim of this paper is to examine the derivation of multi-tilde-bar expressions. Two types of computation are investigated: Brzozowski derivation and Antimirov derivation, as well as the construction of the associated automata.

1 Introduction

Regular expression word derivatives have been introduced in [5] by Brzozowski in order to compute language quotients via expression derivatives: for any word , the language denoted by the derivative of a E w... w is the left quotient of the language denoted by E w.r.t. w. Regular expression derivation plays a fundamental role in theory of automata. In particular, under the assumption that the set of all the derivatives of a regular expression E is finite, it is possible to construct a FA (finite automaton) with D as a set of states that recognizes the language denoted by E. Word derivatives handle unrestricted regular expressions; they are themselves expressions and they provide a DFA (deterministic finite automaton), as far as the ACI (associativity, commutativity and idempotence) properties of the sum of two expressions are used. Alternative types of derivation have been designed since Brzozowski’ seminal work. Partial derivatives, due to Antimirov [2], only address simple regular expressions; they are sets of expressions and they provide both a DFA and a NFA (non-deterministic finite automaton). Antimirov derivatives have been recently extended to unrestricted regular expressions [10]; extended partial derivatives are sets of sets of expressions and they provide a DFA, a NFA and an AFA (alternating finite automaton) [11]. Some derivations are based on the linearization of the (simple) input expression: let us cite the continuations of Berry and Sethi [4], the -continuations of Champarnaud and Ziadi [14] and the derivatives of Ilie and [18]. Let us mention that Antimirov derivation has been extended to the case of weighted rational expressions [21,13]. As reported in [2], the concept of derivation has been successfully used to investigate the properties of regular expressions [17,15,7,20,3,1]. More recently, Brzozowski introduced a new approach for studying the state complexity of regu- lar languages, based on the counting of their quotients (or of their derivatives) [6].

N. Moreira and R. Reis (Eds.): CIAA 2012, LNCS 7381, pp. 321–328, 2012. c Springer-Verlag Berlin Heidelberg 2012 322 P. Caron, .-. Champarnaud, and . Mignot

Moreover, derivatives provide a useful tool to implement regular matching algo- rithms [23,16], or scanner generators as reported in [22]. A close topic is the derivation of new operators that extend regular expres- sions. For example, the computation of the derivatives of an approximative reg- ular expression (that denotes a languages at a bounded distance from a given language) has been presented in [12]. The aim of this paper is to investigate the derivation of the multi-tilde-bar expressions introduced in [8,9]. These expres- sions are built upon simple operators and multi-tilde-bar operators and their main interest is that they are compatible with the structure of Glushkov au- tomata and more succinct than standard expressions. We provide formulae for the computation of word and partial derivatives of multi-tilde-bar expressions and investigate the properties of these derivatives. The next section gathers classical notions concerning regular languages, regu- lar expressions and finite automata; it also recalls the definition and main prop- erties of multi-tilde-bar operators. The definition of the quotient of the language of an extended to multi-tilde-bar expression is introduced in Section 3. Section 4 is devoted to the computation of the Brzozowski derivatives of an extended ex- pression and Section 5 to the computation of the Antimirov derivatives. In both cases, the construction of the associated automaton is provided.

2 Preliminaries

We recall some definitions and notation concerning regular languages, regular expressions , finite automata and multi-tilde-bar expressions. For further details about these topics, we refer to classical books such as [24]. Languages, Regular Expressions and Automata An alphabet is a finite set of symbols. Given an alphabet Σ, any of Σ∗ is a language over Σ.Theset of regular languages over Σ is denoted by Reg(Σ∗) and is defined as the smallest family of languages containing ∅ and {a} for every a in Σ and closed under union, catenation and Kleene star. A regular expression E over an alphabet Σ is inductively defined by E =0, E =1,E = a, E =( + ), E =(F · G), E =(F ∗) with a asymbolin Σ,andF and G two regular expressions over Σ.Thelanguage denoted by a regular expression is inductively defined by L(0) = ∅, L(1) = {ε}, L(a)={a}, L(F + G)=L(F ) ∪ L(G), L(F · G)=L(F ) · L(G)andL(F ∗)=L(F )∗, with a asymbolinΣ,andF and G two regular expressions over Σ.Byconstruction, the language denoted by a regular expression is regular. The alphabetic width |E| of E is the of occurrences of symbols of Σ appearing in E.Afinite automaton A is a 5-tuple (Σ,,I,F,δ)whereΣ is an alphabet, Q is a finite set of states, I ⊂ Q asetofinitial states, F ⊂ Q asetoffinal states and δ ⊂ Q×Σ ×Q asetoftransitions.Thesetδ can be seen as a function from Q×Σ to 2Q defined   by q ∈ δ(q, a) ⇔ (q, a, q ) ∈ δ. The domain of the function δ can be extended Q ∗     to 2 × Σ by setting, for all Q ⊂ Q, δ(Q ,ε)=Q , δ(Q ,a)= q∈Q δ(q, a), δ(Q,a· w)=δ(δ(Q,a),w) for all word w in Σ∗ . The language recognized by the automaton A is the set L(A)={w ∈ Σ∗ | δ(I,w) ∩ F = ∅}. A language Multi-Tilde-Bar Derivatives 323

L is recognizable if there exists an automaton that recognizes it. The set of recognizable languages over Σ is denoted by Rec(Σ∗). Kleene theorem [19] asserts that Reg(Σ∗) = Rec(Σ∗). Consequently , for every regular language L,there exist an automaton A and an expression E such that L = L(E)=L(A). The Multi-tilde-Bar Operators [8,9] The unary operators tilde, denoted by ,andbar, denoted by are defined for every expression E by L( E )=L(E) ∪{ε} and L( E )=L(E) \{ε}.Theyare extended to multi-tilde-bar operators, which are applied to a list of expressions, according to the following definitions. Let be a positive integer. For convenience, the list (E1,...,En)ofexpres- sions is denoted by E1,n. Similarly, a catenation E1 ···En is denoted by E1···n. The set of integers {1,...,n} is denoted by 1,n. The subset of pairs (i, j)such 2 that if 1 ≤ i ≤ j ≤ n is denoted by 1,n≤. The set of finite lists of pairs in 2 1,n≤ is denoted by Sn. Let S be a list in Sn.Letk be in 1,n. The list S≤ (resp. S≥k) is defined by S≤k =((i, f) ∈ S | f ≤ k)(resp.S≥k =((i − k +1,f − k +1)∈ S | i ≥ k)). Letusnoticethatarenumberingisperformed for the computation of S≥k.A list S is said to be free if for all pairs (i, f), (i,f)inS such that (i, f) =( i,f),   i, f ∩ i ,f  = ∅.LetL1,...,Ln be n nonempty regular languages over Σ and w be a word in L1 ···Ln. A sequence (w1,...,wn) satisfying w1 ···wn = w ∧∀k ∈ 1,n,wk ∈ Lk is said to be a split up of w over (L1,...,Ln). Multi-tilde-bar operators are a natural combination of multi-tilde and multi- bar operators [9]. The respective role of and bars is explicited in the two following definitions.

Definition 1. Let (w1,...,wn) be a split up of a word w over a list of languages L ∪{ε},...,L ∪{ε} T S w ,...,w ( 1 n ).Let be a free list in n. The sequence ( 1 n) T w ε k ∈ i, f w ∈ L is generated by the list if it holds: k = if (i,f)∈T and k k otherwise.

Bars are used to forbid some combinations of tildes. Consequently, the satisfac- tion of a bar by a sequence has to be defined with a list of tildes as a context.

Definition 2. Let E1,n be a list of n expressions. Let (w1,...,wn) be a split up of a word w over (L(E1) ∪{ε},...,L(En) ∪{ε}) generated by a free list T in 2 Sn.Letb =(i, f) be a pair in 1,n≤ \ T .Thebarb is said to be satisfied by (w1,...,wn) w.r.t. T if at least one of the three following conditions is satisfied: (1) there exists a pair t in T such that t overlaps , (2) there exists a pair t in T such that b is included in t, (3) wi ···wf = ε.

According to the two previous definitions, the language denoted by a multi-tilde- bar can be expressed as follows:

 Definition 3 ([8]). Let E1,n be a list of expressions over an alphabet Σ and L be the list (L(E1) ∪{ε},...,L(En) ∪{ε}) of languages. Let B and T be two lists 324 P. Caron, J.-M. Champarnaud, and L. Mignot   Sn B ∩ T ∅ E E ,n in such that = . The multi-tilde-bar = T ;B 1 denotes the language   w ∈ Σ∗ |there exists a split up of w over L generated by a free L(E)=   sublist T of T satisfying every bar in B w.r.t. T .

Example 1. Let us consider the EMRE E defined by  1 ∗ ∗ ∗ ∗ E a b , b a · a ( a∗b )( b∗a ) · a 1 = (1,1),(2,2);(1,2) ( ) ( ) (i.e. ). The language denoted by E1 is the set ∗ ∗ ∗ L(E1) = (((L(a b) ∪{ε}) · (L(b a) ∪{ε})) \{ε}) · L(a ).

Definition 4. Let Σ be an alphabet. An Extended to multi-tilde-bar Regular Expression (EMRE) over Σ is inductively defined by: E =0, E =1, E = a,   ∗ E E E E E · E E E E E ,n = 1 + 2, = 1 2, = 1 , = T ;B 1 , where E1,...,En are any n EMREs over an alphabet Σ, a is any symbol in Σ and T and B are any two disjoint lists in Sn.

Definition 5. An EMRE is said to be total for any of its multi- 2 E ,n T ∪ B  ,n tilde-bar subsexpressions T ;B 1 it holds = 1 ≤. Lemma 1 ([8]). Any EMRE admits an equivalent total one.

3QuotientFormulae

We now recall the inductive computation of the quotient w−1(L) of a language L w.r.t. a word w in Σ∗, that is the set {w ∈ Σ∗ | ww ∈ L}.

Lemma 2. Let L be language in Reg(Σ∗) and w be a word in Σ∗. The quotient w−1(L) of L w.r.t. w is inductively computed as follows: ε−1(L)=L, (aw)−1(L)=w−1(a−1(L)), a−1(∅)=a−1({ε})=a−1({b})=∅, a−1(a)={ε}, a−1 L ∪ L a−1 L ∪ a−1 L a−1 L∗ a−1 L · L∗ ( 1 2)= ( 1) ( 2), ( 1)= ( 1) 1, a−1 L · L ∪ a−1 L ε ∈ L , a−1 L · L ( 1) 2 ( 2) if 1 ( 1 2)= −1 a (L1) · L2 otherwise. ∗ where L1 and L2 are any two languages in Reg(Σ ), a and b are any two distincts symbols in Σ and w is any word in Σ∗.   Lemma 3. Let E = E ,n be a total EMRE over an alphabet Σ.Then:  T ;B 1    {ε | ,n ∈ T }∪ L E \{ε} · L E (1 ) ( ( 1) ) ( T ;B 2,n ) L(E)=  ≥2 ≥2  . ∪ (L(Ek) \{ε}) · L( Ek+1,n ) (1,k−1)∈T T≥k+1;B≥k+1   E E ,n Σ Corollary 1. Let = T ;B 1 be a total EMRE over an alphabet and let a be a symbol in Σ.Then:    −1 a (L(E1)) · L( E2,n ) −1  T≥2;B≥2   a (L(E)) = −1 ∪ (1,k−1)∈T a (L(Ek)) · L( Ek+1,n ) T≥k+1;B≥k+1 Multi-Tilde-Bar Derivatives 325

4 Word Derivatives of an EMRE

The set of all the word derivatives of a regular expression can be infinite. How- ever Brzozowski derivation yields a finite set of derivatives (called dissimilar derivatives) based on the use of the +ACI operator that is associative, commu- tative and idempotent. We extend these results to the case of EMREs and give the construction of the dissimilar derivative DFA of an EMRE.

Definition 6. Let E be regular expression over the alphabet Σ and w be a word Σ∗ d E E w in . The dissimilar derivative da ( ) of w.r.t. is inductively computed as d (E)=E, d (E)= d ( d (E)), dε daw dw da d d d b d a da (0) = da (1) = da ( )=0, da ( )=1, d F G d F d G d F ∗ d F · F ∗ da ( + )= da ( )+ da ( ), da ( )= da ( ) ,  d d F · G ACI G ε ∈ L F , d da ( ) + da ( ) if ( ) (F · G)= d da F · G da ( ) otherwise. where F and G are any two regular expressions over the alphabet Σ, a and b are any two distincts symbols of Σ and w is any word in Σ∗.   E E ,n Σ Definition 7. Let = T ;B 1 be a total EMRE over an alphabet ,let ∗ a be a symbol inΣ and w be a word in Σ .Then:  d E · E d ( 1) 2,n d a T≥2;B≥2   (E)= d , da E · E +ACI ACI (1,k−1)∈T d ( k) k+1,n a T≥k+1;B≥k+1 d E if w = ε, (E)= d d   ∗ dw ( (E)) if w = b · w ∧ b ∈ Σ ∧ w ∈ Σ . dw db 1. The derivative of an EMRE E w.r.t. a word w denotes the set w−1(L(E)).

Proposition 2. The set of dissimilar derivatives of an EMRE is finite.

Definition 8. Let E be an EMRE over an alphabet Σ and DE be the set of the dissimilar derivatives of E.LetA =(Σ,Q,I,F,δ) be the automaton defined    by Q = DE, I = {E}, F = {E ∈ Q | ε ∈ L(E )}, ∀E ∈ Q, ∀a ∈ Σ, δ E,a { d E } A E ( )= da ( ) . The automaton is the dissimilar derivative DFA of . Proposition 3. The dissimilar derivative DFA of an EMRE E recognizes L(E).

∗ ∗ ∗ Example 2. Let us consider the total EMRE E1 = ( a b )( b a ) · a defined in Example 1. Successive dissimilar derivatives of E are computed as follows: d ∗ ∗ ∗ ∗ d ∗ (E1)=a b · ( b a ) · a + a = E2 E a E da da ( 4)= = 5 d ∗ ∗ ∗ d ∗ ∗ (E )=(b∗a ) · a + b a · a = E (E )=b a · a = E db 1 3 db 4 6 d ∗ ∗ ∗ d ∗ E a b · b∗a · a a E (E )=a = E d ( 2)= ( ) + = 2 da 5 5 a d d ∗ ∗ (E )=0 (E )=(b a ) · a = E db 5 db 2 4 d ∗ d E a∗ E E a E da ( 6)= = 5 da ( 3)= = 5 d ∗ ∗ d (E )=b∗a · a∗ = E E b a · a E db 6 6 db ( 3)= = 6 326 P. Caron, J.-M. Champarnaud, and L. Mignot

a a

b a E3 E6 E5 a b a b b a b E1 E2 E4

Fig. 1. The Dissimilar Derivative DFA of E1

5 Partial Derivatives of an EMRE

Partial derivatives [2] of a regular expression are defined as follows. Definition 9. The partial derivative of a regular expression E w.r.t. a word w ∂ E is the set ∂a ( ) of expressions inductively computed as follows: ∂ (E)=E, ∂ (E)= ∂ ( ∂ (E)), ∂ε ∂aw ∂w ∂a ∂ ∂ ∂ b ∅ ∂ a { } ∂a (0) = ∂a (1) = ∂a ( )= , ∂a ( )= 1 , ∂ (F + G)= ∂ (F ) ∪ ∂ (G), ∂ (F ∗)= ∂ (F ) · F ∗, ∂a ∂a ∂a ∂a ∂a ∂ F · G ∪ ∂ G ε ∈ L F , ∂ ∂a ( ) ∂a ( ) if ( ) (F · G)= ∂ ∂a F · G ∂a ( ) otherwise. where: F and G are any two regular expressions over the alphabet Σ, a and b  ∗ are any two distincts symbols of Σ and w is any word in Σ and for any set of E ∂ E ∂ E L E L E expressions , ∂a ( )= E∈E ∂a ( ), ( )= E∈E ( ). We now define the partial derivatives of a total EMRE.   E E ,n Σ Definition 10. Let = T ;B 1 be a total EMRE over an alphabet , ∗ let a be a symbol inΣ and w be a word inΣ .Then:  ∂ E · E ∂ ( 1) 2,n ∂ a T≥2;B≥2   (E)= ∂ , ∂a ∪ E · E (1,k−1)∈T ∂ ( k) k+1,n a T≥k+1;B≥k+1 {E} w ε, ∂ if = (E)= ∂ ∂   ∗ ∂w ( (E)) if w = b · w ∧ b ∈ Σ ∧ w ∈ Σ . ∂w ∂b   E E ,n Σ Proposition 4. Let = T ;B 1 be a total EMRE over an alphabet w Σ∗ L ∂ E w−1 L E and be a word in .Then ( ∂w ( )) = ( ( )). By definition, a partial derivative of an expression E is a set of expressions and  each of these expressions is called a derivated term of E. We show that the set DE of all the derivated terms of an EMRE E is finite and we give the construction of the derivated term NFA.   E E ,n Σ Lemma 4. Let = T ;B 1 be a total EMRE over an alphabet and + let w beawordinΣ .Then:    ∂ n ∂ (E) ⊂ (Ek) · Ek+1,n . ∂w w=uv∧=ε k=1 ∂v T≥k+1;B≥k+1 Multi-Tilde-Bar Derivatives 327

 Proposition 5. Let E be a total EMRE . Then: (#DE ) ≤|E| +1. Definition 11. Let E be an EMRE over an alphabet Σ .LetA =(Σ,Q,I,F,δ)    be the automaton defined by Q = DE, I = {E}, F = {E ∈ Q | ε ∈ L(E )}, E ∈ Q a Σ δ E,a ∂ E for any expression , for any symbol in , ( )= ∂a ( ).The automaton A is the derivated term NFA of E. Proposition 6. The derivated term automaton of an EMRE E recognizes L(E).

∗ ∗ ∗ Example 3. Let us consider the total EMRE E1 = ( a b )( b a ) · a defined in Example 2. Successive derivated terms of E are computed as follows: ∂ ∗ ∗ ∗ ∂ E {a b b∗a · a ,a } E {a∗} {E } ∂a ( 1)= ( ) ) ∂a ( 3)= = 3 = {E ,E } ∂ E ∅ 2 3 ∂b ( 3)= ∂ ∗ ∗ ∗ ∂  ∗  (E )={( b∗a ) · a ,b a · a } (E )={a } = {E } ∂b 1 ∂a 4 3 = {E ,E } ∂ (E )={b∗a · a∗} = {E } 4 5 ∂b 4 5 ∂  ∗ ∗  ∂  ∗  E {a b b∗a · a } {E } (E )={a = {E } ∂a ( 2)= ( ) = 2 ∂a 5 3 ∂ ∂  ∗ ∗   ∗ ∗  E { b a · a } {E } ∂ (E )={b a · a } = {E } ∂b ( 2)= ( ) = 4 b 5 5

a

a a

a b a E1 E2 E4 E5 E3 a b b b b

Fig. 2. The Derivated Term NFA of E1

6Conclusion

We have shown how the Brzozowski derivation and the Antimirov one can be applied to the case of (simple) regular expressions extended to multi-tilde-bar operators. The computation of the c-continuations for such expressions has been already investigated even though it is not presented here. The main interest of c-continuations is that they allow us to efficiently implement Glushkov and Antimirov NFAs. We also intend to generalize these derivations to the case of unrestricted regular expressions extended to multi-tilde-bar operators.

References

1. Almeida, M., Moreira, N., Reis, R.: Antimirov and Mosses’s rewrite system revis- ited. Int. J. Found. Comput. Sci. 20(4), 669–684 (2009) 2. Antimirov, V.: Partial derivatives of regular expressions and finite automaton con- structions. Theoret. Comput. Sci. 155, 291–319 (1996) 328 P. Caron, J.-M. Champarnaud, and L. Mignot

3. Antimirov, V.M., Mosses, P.D.: Rewriting extended regular expressions. Theor. Comput. Sci. 143(1), 51–72 (1995) 4. Berry, G., Sethi, R.: From regular expressions to deterministic automata. Theoret. Comput. Sci. 48(1), 117–126 (1986) 5. Brzozowski, J.A.: Derivatives of regular expressions. J. Assoc. Comput. Mach. 11(4), 481–494 (1964) 6. Brzozowski, J.A.: Quotient complexity of regular languages. Journal of Automata, Languages and Combinatorics 15(1/2), 71–89 (2010) 7. Brzozowski, J.A., Leiss, E.L.: On equations for regular languages, finite automata, and sequential networks. Theor. Comput. Sci. 10, 19–35 (1980) 8. Caron, P., Champarnaud, J.M., Mignot, L.: Erratum to “acyclic automata and small expressions using multi-tilde-bar operators”. [Theoret. Comput. Sci. 411(38- 39), 3423–3435] (2010); Theor. Comput. Sci. 412(29), 3795–3796 (2011) 9. Caron, P., Champarnaud, J.M., Mignot, L.: Multi-bar and multi-tilde regular op- erators. Journal of Automata, Languages and Combinatorics 16(1), 11–26 (2011) 10. Caron, P., Champarnaud, J.-M., Mignot, L.: Partial Derivatives of an Extended Regular Expression. In: Dediu, A.-., Inenaga, S., Mart´ın-Vide, C. (eds.) LATA 2011. LNCS, vol. 6638, pp. 179–191. Springer, Heidelberg (2011) 11. Caron, P., Champarnaud, J.M., Mignot, L.: A general frame for the derivation of regular expressions (submitted, 2012) 12. Champarnaud, J.-M., Jeanne, H., Mignot, L.: Approximate Regular Expressions and Their Derivatives. In: Dediu, A.-H., Mart´ın-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 179–191. Springer, Heidelberg (2012) 13. Champarnaud, J.M., Ouardi, F., Ziadi, D.: An efficient computation of the equation K-automaton of a regular K-expression. Fundam. Inform. 90(1-2), 1–16 (2009) 14. Champarnaud, J.M., Ziadi, D.: Canonical derivatives, partial derivatives, and finite automaton constructions. Theoret. Comput. Sci. 239(1), 137–163 (2002) 15. Conway, J.H.: Regular algebra and finite machines. Chapman and Hall (1971) 16. Frishert, M.: FIRE Works & FIRE Station: A finite automata and regular expres- sion playground. Ph.D. thesis, Eindhoven University, Netherlands (2005) 17. Ginzburg, A.: A procedure for checking of regular expressions. J. ACM 14(2), 355–362 (1967) 18. Ilie, L., Yu, S.: Follow automata. Inf. Comput. 186(1), 140–162 (2003) 19. Kleene, S.: Representation of events in nerve nets and finite automata. Automata Studies Ann. Math. Studies 34, 3–41 (1956) 20. Krob, D.: Differentation of K-rational expressions. Internat. J. Algebra Com- put. 2(1), 57–87 (1992) 21. Lombardy, S., Sakarovitch, J.: Derivatives of rational expressions with multiplicity. Theor. Comput. Sci. 332(1-3), 141–177 (2005) 22. Owens, S., Reppy, J.H., Turon, A.: Regular-expression derivatives re-examined. J. Funct. Program. 19(2), 173–190 (2009) 23. Sulzmann, M., Lu, K.: Partial derivative regular expression pattern matching (De- cember 2007) (manuscript) 24. Yu, S.: Regular languages. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages. Word, Language, Grammar, vol. I, pp. 41–110. Springer, Berlin (1997)