Efficient Generation of Table-Driven Scanners

Efficient Generation of Table-DrivenScanners J. Grosch DR. JOSEF GROSCH COCOLAB - DATENVERARBEITUNG GERMANY Cocktail Toolbox for Compiler Construction Efficient Generation of Table-DrivenScanners Josef Grosch May 24, 1987 Document No. 2 Copyright © 1994 Dr.Josef Grosch Dr.Josef Grosch CoCoLab - Datenverarbeitung Breslauer Str.64c 76139 Karlsruhe Germany Phone: +49-721-91537544 Fax: +49-721-91537543 Email: [email protected] 1 SUMMARY General tools for the automatic generation of lexical analysers likeLEX [Les75] convert a specification consisting of a set of regular expressions into a deterministic finite automaton. The main algorithms involved aresubset construction, state minimization, and table compression. Even if these algorithms do not showtheir worst case time behaviour they arestill quite expensive.This paper shows howtosolvethe problem in linear time for practi- cal cases, thus resulting in a significant speed-up. The idea is to combine the automaton introduced by Aho and Corasick [AhC75] with the direct computation of an efficient table representation. Besides the algorithm we present experimental results of the scanner generator Rex [Groa] which uses this technique. KEY WORDS lexical analysis scanner generator INTRODUCTION Today,there exist several tools to generate lexical analysers likee.g. LEX [Les75], flex[Pax88], GLA [Heu86, Wai86], ALEX [Mös86], or Mkscan [HoL87]. This indicates that the automatic implementation of scanners becomes accepted by today’scompiler writers. In the past, the lowperformance of generated scanners in comparison to hand-written ones may have restricted the generation approach to prototyping applications. However, since newer results [Grob, Heu86, Wai86] showthat generated scanners have reached the speed of hand-written ones there is no argument against using automatically generated lexical analysers in production compilers. In the following we present an efficient algorithm to convert a scanner specification based on regular expressions into a deterministic finite automaton. Aspecification of a scanner consists of a set of regular expressions (REs). Each RE usually describes a deterministic finite automaton (DFA) being able to recognize a token. The whole set of REs, however, usually specifies a nonde- terministic finite automaton (NFA). Toallowscanning to be done in linear time the NFAhas to be converted into a DFA. This problem can be solved with well known algorithms for subset construction and state minimization [ASU86, WaG84]. In the worst case subset construction takes time O(2n)and state minimization O(n2)orO(n log n) if nisthe number of states. If the DFAisimplemented as a table-drivenautomaton an additional algorithm for table-compression is needed in practice, usually taking time O(n2). Running example: letter ( letter | digit ) * → IdentSymbol digit + → DecimalSymbol octdigit + Q → OctalSymbol BEGIN → BeginSymbol END → EndSymbol := → AssignSymbol The above specification describes six tokens each by a RE using an abstract notation. The automaton for these tokens is a NFAwhose graphical representation is shown in Figure 1. The conversion of this NFAtoaDFA yields the automaton shown in Figure 2. Definition 1: constant RE AREconsisting only of the concatenation of characters is called a constant RE. A constant RE contains no oper- ators like+*|? The language of a constant RE contains exactly one word. In the above example the constant REs are: BEGIN END := During the construction of tables for a DFAbyhand we observed that the task was solvable easily and in linear time for constant REs. Common prefixes have simply to be factored out thus arriving at a DFAhaving a decision tree as its graphical representation. Only the fewnon-constant REs required real work. Scanner specifications for languages likeModula or C usually contain 90% constant REs: 60% keywords and 30% delimiters. Only 10% are non-constant REs needed to specify identifiers, various constants, and comments (see Appendix 2 for an example). The languages Ada and Pascal are exceptions from this observation because keywords can be written in upper or lower case letters or in anymixture. 2 1 A-Z,a-z,0-9 A-Z,a-z 2 0-9 0-9 0-7 Q 0 3 4 B 0-7 E E G I N 5 6 7 8 9 : N D 10 11 12 = 13 14 Fig. 1: NFAfor the running example 2 0-9 8-9 Q 3 4 0-7 N D 8-9 10 11 12 0-7 E α\N α\D α A-Z,a-z,0-9 A,C,D,F-Z,a-z 0 1 B α\E α\G α\I α\N α : E G I N 5 6 7 8 9 = 13 14 Fig. 2: DFAfor the running example (α \ β stands for A-Z,a-z,0-9 except β) It would be nice to retain the linear time behaviour for constant REs and perform subset construction and state minimization only for the fewnon-constant REs. The following chapter describes howthis can be achievedbyfirst converting the NFAofthe non-constant REs to a DFAand incrementally adding the constant REs afterwards. The results of this method are shown for the running example. Then we generalize the method to automata with several start states. Weconclude with a comparison of some scanner generators and present experimental results. 3 THE METHOD The basic idea is to combine the special automaton of Aho and Corasick [AhC75] with the so-called "comb-vec- tor" or rowdisplacement technique [ASU86] for the table representation. The automaton of Aho and Corasick, called a pattern matching machine, extends a usual DFAbyaso-called failure function and was originally used to search for overlapping occurrences of several strings in a giventext. This automaton can also be used to cope with a certain amount of nondeterminism. To introduce the terminology needed we present a definition of our version of the automaton. Wecall it a tunnel automaton and the tunnel function used corresponds to the failure function of Aho and Corasick. The name tunnel automaton is motivated by the analogy to the tunnel effect from nuclear physics: Electrons can switch overtoanother orbit without supply of energy and the tunnel automaton can change its state without consumption of an input symbol. Definition 2: tunnel automaton A tunnel automaton is an extension of a DFAand consists of: StateSet a finite set of states FinalStates a set of final states FinalStates ⊆ StateSet StartState a distinguished start state StartState ∈ StateSet Vocabulary a finite set of input symbols Control the transition function Control: StateSet × Vocabulary → StateSet NoState a special state to denote undefined NoState ∈ StateSet Semantics a function mapping the states Semantics: StateSet → 2Proc to subsets of a set of procedures Tunnel a function mapping states to states Tunnel: StateSet → StateSet Atunnel automaton usually works likeaDFA: in each step it consumes a symbol and performs a state transition. Additionally the tunnel automaton is allowed to change from state s to state t = Tunnel (s) without consuming a symbol, if no transition is possible in state s. In state t the same rules apply,sothe automaton may change the state several times without consuming anysymbols. After recognizing a token a semantic procedure associated with the final state is exe- cuted. Algorithm 1formalizes the behaviour of a tunnel automaton. To guarantee the termination of the WHILE loop the StateStack is initialized with a special final state called DefaultState. Algorithm 1:tunnel automaton BEGIN Push (StateStack , DefaultState) Push (SymbolStack, Dummy ) State := StartState Symbol := NextSymbol () LOOP IF Control (State, Symbol) ≠ NoState THEN State := Control (State, Symbol) Push (StateStack , State ) Push (SymbolStack, Symbol) Symbol := NextSymbol () ELSE State := Tunnel (State) IF State = NoState THEN EXIT END END END WHILE NOT Top (StateStack) ∈ FinalStates DO State := Pop (StateStack ) UnRead (Pop (SymbolStack)) END Semantics (Top (StateStack)) () END Before we present the algorithm to compute the control function we need some more definitions: 4 Definition 3: trace The trace of a string is the sequence of states that a tunnel automaton passes through giventhe string as input. States at which the automaton does not consume a symbol are omitted. This includes the start state. If necessary we pad the trace with the value NoState (denoted by the character -) to adjust its length to the length of the string. Examples (computed by using the DFAofFig. 2 as tunnel automaton): The trace of IF is 11 The trace of ENDIF is 10 11 12 1 1 The trace of 1789 is 3322 The trace of 777Q is 3334 The trace of ::= is 13 - - Definition 4: ambiguous state We call a state s ambiguous (or ambiguously reachable) if there exist more than one string such that for each string the repeated execution of the control function (first loop in Algorithm 1) ends up in state s. Example: The states 1, 2, 3, and 4 of the DFAofFig. 2 are ambiguous states. Method: Construction of a tunnel automaton from a givenset of regular expressions Step 1: Convert the NFAfor non-constant REs to a DFAusing the usual algorithms for subset construction and state minimization [ASU86, WaG84]. Step 2: Extend the DFAtoatunnel automaton by setting Tunnel (s) to NoState for every state s. Step 3: Compute the set of ambiguous states of the tunnel automaton. Forconvenience Appendix 1 contains an algorithm to compute the ambiguous states. Step 4: Forevery constant RE execute Step 5 which incrementally extends the tunnel automaton. Continue with Step 6. Step 5: Compute the trace of the string specified by the constant RE using the current tunnel automaton. Wehav e to distinguish 4 cases: Case 1: The trace contains neither NoState nor ambiguous states: Let the RE be a1 a2 ... an Let the trace be s1 s2 ... sn As the path for the RE already exists include (add) the semantics of RE to the semantics of sn and include sn into the set of final states. Case 2: The trace does not contain ambiguous states but it contains NoState: Let the RE be a1 a2 ...ai ai+1 ..

Load more