Intelligent Control and Automation, 2013, 4, 309-312 http://dx.doi.org/10.4236/ica.2013.43036 Published Online August 2013 (http://www.scirp.org/journal/ica)

An Automata-Based Approach to Pattern Matching

Ali Sever Pfeiffer University, Misenheimer, USA Email: [email protected]

Received January 21, 2013; revised March 4, 2013; accepted March 11, 2013

Copyright © 2013 Ali Sever. This is an open access article distributed under the Creative Commons Attribution License, which per- mits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

ABSTRACT Due to its importance in security, syntax analysis has found usage in many high-level programming languages. The Lisp language has its share of operations for evaluating regular expressions, but native parsing of Lisp code in this way is unsupported. Matching on lists requires a significantly more complicated model, with a different programmatic ap- proach than that of string matching. This work presents a new automata-based approach centered on a set of functions and macros for identifying sequences of Lisp S-expressions using finite tree automata. The objective is to test that a given list is an element of a given tree language. We use a macro that takes a grammar and generates a function that reads off the leaves of a tree and tries to parse them as a string in a context-free language. The experimental results in- dicate that this approach is a viable tool for parsing Lisp lists and expressions in the abstract interpretation framework.

Keywords: Computation and ; Pattern Matching; Regular Languages

1. Introduction The inability to use regular expressions in a simple format was perhaps the biggest hurdle. We will use stan- There have been many studies on how to generate a tree dard definitions, but we give some of these for conven- parser over Lisp lists, so that the existence of a list as an ience of the reader. element of a can be determined through  Analytic grammar—A set of rules for parsing and equality checks [1]. Matching on lists using Automate- based approach requires a significantly more complicated returning of truth values confirming a string as either model, with a different programmatic approach than that consistent or inconsistent with the rules of a formal of string matching which is studied in [2]. We present language. that one must abandon the use of regular expressions, for  Context-free grammar—A set of rules over an alpha- list pattern matching, in favor of tree parsers. Regular bet, defined by a quad-tuple G = (Vt, Vn, P, S), where expressions are convenient in string matching because of P is a set of production rules; the relative simplicity and predictability of string struc- Vn is a set of non-terminals; tures and their straightforward representation [3,4]. In the Vt is a set of terminals; and realm of trees, the closest we come to this is in compari- S is a starting non-terminal and an element of Vn. sons at tree nodes. Rather, to conclude that a tree is a  Context-free language—All strings which can be member of a regular tree language, we must represent it generated by a context-free grammar. in a useful format for our purposes or else require each  Finite automaton—A set of states and transitions, “cons” cell in the tree to have its own production. commonly expressed in a flow diagram. Also called a Another requirement of this approach is to provide a finite state machine. certain malleability in the interface—the ability for a pro-  Finite state machine: a model of computation com- grammer to define his or her own transition rules (or re- posed of states, a transition function, and an input write rules) on-the-fly—in the make-tree-matcher Figure alphabet. 1 function call, for example. We made sure the code  Transition function: describes a condition that would would run on Common Lisp as well as on different Lisp need to be fulfilled to enable the transition. implementations. This required some bending of the con-  input alphabet: input recognized by the finite state straints of Lisp, like redefinition of Boolean symbols out machine of utility, and working around the organization of nested  —A description of a set of rules for lists. a given alphabet over which a set of finite strings can

Copyright © 2013 SciRes. ICA 310 A. SEVER

(defparameter *falsehood* '*falsehood*) *falsehood*))))

(defun nonterminal? (x) (genproduction (p) (and (symbolp x) (let ((name (car p)) (let* ((str (symbol-name x)) (rules (cdr p))) (fr (char str 0)) (format t “Genproduction:~% name is: (ls (char str (- (length str) 1)))) ~A~% rules is/are: ~A~%~%” name rules) (and `(,name (y) (char= fr #\<) (n-or ,@(mapcar #'gencall rules)))))) (char= ls #\>))))) (format t “Defining function named ~a~%” name) `(defun ,name (y) (labels ,(mapcar #'genproduction (car rules)) (defun yield (tree) (not (eq (,start (yield y)) *falsehood*)))))) (cond ((null tree) nil) (defmacro tst () ((atom tree) (list tree)) (make-tree-matcher plusone (t (append (yield (car tree)) (yield (cdr tree)))))) (( (+ )) ( (1 ) (1))))) ;; Use a special symbol rather than nil for false. ;(defmacro n-and (&rest conjuncts) (defun ld () (load “tyield.lisp”)) ; `(and ,@(mapcar #'(lambda (x) `(if (eq ,x *falsehood*) nil ,x)) conjuncts) t)) (make-tree-matcher booleval (( (and ) (defun n-and (&rest conjuncts) (not ) (if (eq (car conjuncts) *falsehood*) (or ) *falsehood* (or ) (if (eq (cdr conjuncts) nil) (or ) (car conjuncts) (1)) (apply #'n-and (cdr conjuncts))))) ( (and ) (and ) (and ) ;(defmacro n-or (&rest disjuncts) (or ) ; `(or ,@(mapcar #'(lambda (x) `(if (eq ,x *falsehood*) nil ,x)) dis- (not ) juncts) nil)) (0))))

(defmacro n-not-null (v) (defun add-match-tracers () (if (null v) (trace n-and) *falsehood* (trace n-or)) v)) Figure 1. The make-tree-matcher function.

(defun n-or (&rest disjuncts) be defined. (if (null disjuncts) *falsehood* (if (eq (car disjuncts) *falsehood*)  Kleene closure/Kleene star—The set of all possible (apply #'n-or (cdr disjuncts)) combinations of non-terminals of a regular language. (car disjuncts)))) Specifically, the superset of a set of strings containing

(defmacro make-tree-matcher (name start &body rules) the empty string ε and closed on the string concate- (labels nation function. Every string that is part of a regular ((gencall (right) language can be found in its Kleene expansion. (format t “generating expansion for ~A~%” right) `(if (or (null y) (eq *falsehood* y)) *falsehood*  Parse tree—A description of the syntax of a string (let ((l (copy-list y))) within a formal grammar. (if (not (eq *falsehood* (n-and ,@(mapcar  Parser—A method or algorithm or its implementa- #'(lambda (x) (if (nonter- tion which examines the application of a given string minal? x) within an analytic grammar. `(if  Regular tree language—The set of trees accepted by (null l) *falsehood* (setf l (,x a finite tree automaton. l)))  Terminal—A constant, indivisible value that cannot `(if be further reduced to a more simplified form within (null l) *falsehood* (if (eql its own grammar. (quote ,x) (pop l))  Tree automaton—While finite automata typically l act on strings, tree automata are used for tree expres-

*falsehood*)))) sions. right))))  Yield—The string pattern formed from a tree’s leaves l as encountered in an ordered traversal.

Copyright © 2013 SciRes. ICA A. SEVER 311

2. Analysis trees for a context-free grammar [10]. We parse the yield of the tree to show that it’s a mem- Finite tree automata are much more difficult to imple- ber of the context-free language, and use the theorem 1 to ment than finite automata, and regular tree languages do conclude that it’s a member of the regular tree language. not have a nice compact notation like regular (string) This does not provide us any information about the struc- languages do. Instead of reading only one next symbol, ture of the original tree, but we do know whether it is a finite state machines that are used to recognize regular member of the tree language we defined. If it is a mem- expressions can read any finite number of next symbols. ber, then we know that it matches the pattern we are Each of these next symbols can have any finite number looking for. of next states. Since parse trees are not generally unique, we do not know anything about the structure of the tree 3. Experiments when we decide it’s a member of the regular tree lan- guage by parsing its yield. We know it’s a parse tree for a Parsing the string allowed us to test whether it was a string in the grammar, but there can be more than one, member of a given string language. A language is a set of and we don’t know which it is. sentences (strings and trees are “sentences” in the sense Although our initial vision involved the usage of regu- that we use them in theory). A context- lar expressions, which would be appropriate for normal free language is the set of all strings that can be gener- strings [5,6], the nested form of S-expressions required a ated by its grammar. We also know because of the theo- more iterative and comprehensive method. Traversing a rem that there is a regular tree language composed of all Lisp list involves the usage of a parse tree. Through this its parse trees. application, one can back trace the sequence of an ex- We test whether that tree is in the regular tree lan- pression through the parent nodes that generated each guage by testing whether the string is a parse tree of the step [7]. context-free language. Generalized nondeterministic finite automata can be Our tree pattern matching implementation allows for defined as the 5-tuple (S, Σ, δ, s, a), where: parsing of a tree structure as a string for subsequent com- S = A complete set of states; parison. By seeing that a given list is a parse tree ac- Σ = A finite alphabet; cepted by a finite tree automaton, we know, by the gen- δ = A transition function (δ: (S − {a}) × (S − {s}) → eral theorem which states “A regular tree language is the R); set of parse trees for a context-free grammar”. That means s = A start state; and the list in question is a member of the regular language a = A accept state; tested by the automaton. The use of tree yield functions where R is the set of all regular expressions over Σ simplified this step. Conceptually, trees consist of parent (“Automata”). nodes, where child nodes extend from the right leaf of This can be similarly migrated to tree parsing. Recog- each node, with the outermost levels on the left side of nizing trees as elements of a regular language only al- the graph. Our custom yield function generates usable lows us to say that the input is part of the language, or set strings from tree inputs. With the make-tree-matcher of all possible elements of the . This is function, the yield of a given tree is produced, and recur- done through Boolean comparisons in a top-down tree sive-descent parsing is performed on the value by the automaton context. Such automata can be represented same function Figure 1. with the four-tuple (Q, F, Qf, Δ), where: We experimented and provided the results with the Q = A set of states; make-tree-matcher function in Figures 2 and 3. Notice F = A ranked alphabet; that the BOOLEVALUATION function is the language Qf = A subset of terminal states in Q; and of all true Boolean expressions without variables using Δ = A set of transition rules [8,9]. efficient implementations of automata operations. As our implementation uses a nondeterministic push- From the definition of a context-free language, we know that a regular tree language is the set of all parse down approach, we can only test for the presence of a trees of this language’s grammar. Therefore, a tree is in list in a regular tree language. As the language can have the regular tree language if its yield is in the context-free any number of similarly organized tree structures, we language. cannot tell with our algorithm which one of them has been found—only that the pattern in question is present 4. Conclusion in the language. Using a finite tree automaton to match a Lisp list We proposed a symbolic approach for pattern matching would require every cons cell in the pattern to have its on LISP programs. We use a symbolic automata repre- own production, all labeled “cons”. sentation and implement set of functions and macros for Theorem 1. A regular tree language is the set of parse identifying sequences of Lisp S-expressions using finite

Copyright © 2013 SciRes. ICA 312 A. SEVER

(make-tree-matcher boolevaluation tree automata. Our experiments demonstrate that the (( (and ) study has produced a viable tool for parsing Lisp lists (not ) and expressions, and because of the language used, has (or ) (or ) the added bonuses of portability, small size, customiza- (or ) bility, and clarity. The study of formal grammar and re- (1)) gular expressions has shown us with those topics the util- ( (and ) ity, robustness, and sometimes elegance of regular lan- (and ) guages and Lisp. The same approach can also be applied (and ) (or ) to variety of other functional programming languages. (not ) Finally, the use of automata as a symbolic representation (0)))) for verification has been investigated in other contexts (e.g., [9]). Therefore, to obtain these kind of particular Figure 2. The make-tree-matcher implementation of auto- mata. but interesting results are of substantial and growing in- terest for many applied problems in symbolic computa- Defining function named BOOLEVALUATION tions [10]. Genproduction: name is: rules is/are: ((AND ) (NOT ) REFERENCES (OR ) (OR ) (OR ) (1)) [1] M. Sipser, “Theory of Computation,” 3rd Edition, Course generating expansion for (AND ) Technology, 2012. generating expansion for (NOT ) generating expansion for (OR ) [2] M. Bojańczyk and T. Colcombet, “Tree-Walking Auto- generating expansion for (OR ) mata Cannot Be Determinized,” Theoretical Computer generating expansion for (OR ) Science, Vol. 350, No. 2-3, 2006, pp. 164-170. generating expansion for (1) doi:10.1016/j.tcs.2005.10.031 Genproduction: [3] H. Comon, M. Dauchet, R. Gilleron, C. Löding, F. Jac- name is: rules is/are: ((AND ) (AND quemard, D. Lugiez, S. Tison and M. Tommasi, “Tree ) (AND ) (OR Automata Techniques and Applications,” 2007. ) (NOT ) (0)) [4] H. Comon, M. Dauchet, R. Gilleron, C. Löding, F. Jac- generating expansion for (AND ) quemard, D. Lugiez, S. Tison and M. Tommasi, “Tree generating expansion for (AND ) Automata Techniques and Applications II,” 2007. generating expansion for (AND ) generating expansion for (OR ) [5] C.-H. Chen, “A Neural Network Arhitecture for Syntax generating expansion for (NOT ) Analysis,” IEEE Transactions of Neural Networks, Vol. generating expansion for (0) 10, No. 1, 1999, pp. 94-114. doi:10.1109/72.737497 BOOLEVALUATION [6] J. Power, “Notes on Formal Language Theory and Pars- This is the language of all true Boolean expressions (without ing,” National University of Ireland, Maynooth, Kildare, variables). 2002. > (boolevaluation '(or (not 1) (or 1 0))) T [7] I. Bagrak and O. Shivers, “trx: Regular-Tree Expressions, > (boolevaluation '(0)) Now in Scheme,” Scheme Workshop, September 2004. NIL [8] F. Yu, et al., “Symbolic String Verification,” SPIN’08 Pro- > (boolevaluation '(1)) ceedings of the 15th International Workshop on Model T Checking Software, pp. 306-324. > (boolevaluation '(or 0 1)) T [9] A. Bouajjani, B. Johnson, M. Nilsson and T. Touili, > (boolevaluation '(not 0)) “Regular Model Checking,” Proceedings of the 12th In- T ternational Conference on Computer Aided Verification, > (boolevaluation '(and 1 (not 1))) 2007, pp. 403-418. NIL [10] L. Segoufin and V. Vianu, “Validating Streaming XML Documents,” ACM, 2002, pp. 53-64. Figure 3. Sample run of “boolevaluation” function.

Copyright © 2013 SciRes. ICA