Eindhoven University of Technology

MASTER

ForestFIRE and FIREWood a toolkit and GUI for tree algorithms

Strolenberg, Roger

Award date: 2007

Link to publication

Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain TECHNISCHE UNIVERSITEIT EINDHOVEN

Department of Mathematics and Computer Science

MASTER’S THESIS

ForestFIRE and FIREWood A Toolkit & GUI for Tree Algorithms

by Roger Strolenberg

Date of defense 1st of June 2007

Tutor ir. L.G.W.A. Cleophas, TU/e Supervisor dr.ir. . Hemerik, TU/e ii Abstract

Many fields in computer science use trees to represent hierarchical data. Parsing these trees and performing searches for patterns in them are well known problems. Tree parsing, for ex- ample, is a known issue in compilers, because compilers can parse intermediate representation trees to help in translating such trees into sequences of machine dependent instructions. Loek Cleophas’ PhD research focuses on tree domain problems. He gathered, structured and classified a collection of tree parsing, matching and acceptance algorithms. This thesis discusses a toolkit and that were developed to create an environment to experiment with these algorithms and to collect information on their properties. These properties can be used to select the most promising algorithms for the instruction selection process in compilers. The process of building the toolkit started with a study of existing tree toolkits. The knowledge gained from this study was used to construct the toolkit and user interface. The resulting toolkit consists of a subset of the algorithms described by Cleophas. Finally the toolkit and GUI were used to experiment with two types of algorithms. The description and results of these experiments are included at the end of this report.

iii iv Acknowledgments

I devote this section to all people who provided me with the knowledge, inspiration, and support throughout my Master project. Starting with my tutor Loek Cleophas, PhD student at the Technische Universiteit Eindhoven, who guided me through the project, provided the necessary theory and helped improving the toolkit and thesis. I also want to thank Kees Hemerik, my supervisor at TU/e, who provided valuable input and guarded the progress. Next, I want to thank my close friends, without whom things would have been very different: My old schoolmates, Nicky, Frank, Tim, and Manon, for their encouragements and kind words; Isabelle, for the challenging games and the pleasant conversations during the breaks. There are also some colleagues that may not be forgotten. Rudy, a colleague and friend, which worked on a similar Master project. We had very useful reviews of each other’s work and discussed daily programming curiosities. I also want to thank my colleagues Jan and Harald for removing the small language imperfections from my thesis. The project would also not be the same without the hilarious Wednesday afternoon lunches at the TU/e with Harald, Erik and Rudy. And of course I want to thank my parents and my sister who always support me. Without them, I would not be where I am today.

v vi Acronyms

This section contains the acronyms mentioned in the document. This list can be used as a quick reference.

DAC Data Access Component

DFRTA Deterministic Frontier-to-Root Tree Automaton

DRFTA Deterministic Root-to-Frontier Tree Automaton

FR Frontier-to-Root

JRE Runtime Environment

LCL Component Library

LHS Left Hand Side

MDV Multiple Data View

NFRTA Nondeterministic Frontier-to-Root Tree Automaton

NRFTA Nondeterministic Root-to-Frontier Tree Automaton

RF Root-to-Frontier

RHS Right Hand Side

RTG Regular Tree Grammar

SDV Single Data View

STF Smallest Tree First

SWT Standard

TA Tree Automaton

TTF Tallest Tree First

vii viii Contents

Abstract iii

Acknowledgments v

Acronyms vii

0 Introduction 1

1 Domain 3 1.1 Basicconcepts ...... 3 1.1.1 Trees...... 3 1.1.2 Treelanguages ...... 5 1.1.3 Regulartreegrammars...... 5 1.1.4 Treepatterns ...... 9 1.1.5 Finitetreeautomata...... 9 1.2 Problemsofinterest ...... 12 1.3 ApplicationAreas ...... 13

2 Research into existing toolkits 15 2.1 ATerms ...... 15 2.2 BEG...... 16 2.3 BURG...... 17 2.4 iBURG ...... 18 2.5 Timbuk ...... 19 2.6 Treebag ...... 20 2.7 Twig...... 21 2.8 Summary ...... 21

3 The Toolkit and GUI 25 3.1 ForestFIRE ...... 25 3.1.1 Trees...... 26 3.1.2 Regulartreegrammars...... 28 3.1.3 Treepatterns ...... 32 3.1.4 Treeautomata ...... 33 3.2 FIREWood ...... 36 3.2.1 Architecture ...... 36 3.2.2 Resultinguserinterface ...... 39 3.3 Implementationdetails...... 42

4 Experiments 45

ix 4.1 Usedtreegrammars ...... 45 4.1.1 Thesisgrammar ...... 46 4.1.2 iBurgstandardgrammars ...... 46 4.1.3 ReportoftenEikelder ...... 47 4.1.4 Monoprojectgrammars ...... 47 4.2 Grammar transformation experiments ...... 48 4.2.1 RED-Z ...... 49 4.2.2 RED-U ...... 50 4.2.3 InfluenceofRED-Z/RED-Uorder ...... 50 4.2.4 RED-Z*/U*orderwithreuse ...... 55 4.2.5 RED-Znodeselectionstrategies ...... 59 4.2.6 Conclusion ...... 66 4.3 Automaton construction experiments ...... 66 4.3.1 Measurementtechniques...... 67 4.3.2 Automaton construction: general issues ...... 68 4.3.3 Constructions of nondeterministic automata ...... 73 4.3.4 Constructions of deterministic automata ...... 76 4.3.5 Conclusion ...... 88 4.4 DFRTAbasedtreeparsingexperiments ...... 90 4.4.1 Theparsingalgorithm ...... 90 4.4.2 Automatoncomparison ...... 98

5 Conclusions 101 5.1 Results...... 101 5.2 Recommendations for future work ...... 103 5.3 Evaluation...... 104

A MSc Assignment description 105 A.1 OriginalAssignmentdescription...... 105 A.2 Additional data structure requirements ...... 111

B Formal definitions 115 B.1 Treerelateddefinitions...... 115 B.2 Treegrammarrelateddefinitions ...... 115 B.3 Treeautomatarelateddefinitions ...... 116

C ForestFIRE library 119 C.1 Basiccollections ...... 119 C.1.1 List ...... 119 C.1.2 Dictionary...... 120 C.1.3 Set...... 121 C.2 Trees...... 122 C.2.1 Datastructures...... 122 C.2.2 Invariants ...... 124 C.2.3 Relatedalgorithms ...... 126 C.3 Regulartreegrammars...... 127 C.3.1 Datastructures...... 127

x C.3.2 Invariants ...... 129 C.3.3 Relatedalgorithms ...... 129 C.4 Treepatterns ...... 136 C.4.1 Datastructures...... 136 C.4.2 Invariants ...... 136 C.4.3 Relatedalgorithms ...... 136 C.5 Treeautomata ...... 138 C.5.1 Datastructures...... 138 C.5.2 Invariants ...... 142 C.5.3 Relatedalgorithms ...... 143

D FIREWood file format 147 .1 Alphabets...... 148 D.2 Trees...... 149 D.3 Treegrammars ...... 149 D.4 Treepatternsandpatterncollections ...... 150 D.5 Example...... 151

E Tree automaton construction – results 153

Bibliography 161

xi xii 0 Introduction

The Software Engineering and Technology (SET) expertise group at the department of Math- ematics and Computer Science of the Technische Universiteit Eindhoven (TU/e) has as its main objective to create methods and supporting tools for development and maintenance of reliable software. Compilers, to which this master’s project is related, play an important role in supporting the software development process. A few years ago Loek G.W.A. Cleophas started his PhD research that focuses on algorithms for the tree parsing, matching and acceptance problems. These algorithms can be used in many fields. One of these fields is compilers. To be more precise: the final part of the software compilation process needs to translate a generated parse tree to a sequence of instructions that can be executed on a machine. This instruction selection process can be done efficiently by using tree parsing algorithms. Cleophas collected a large collection of algorithms in this area and described them in a general shape in his PhD thesis [Cle07]. This collection contains parsing, matching and acceptance algorithms, but also supporting algorithms and structures that are used to solve these problems efficiently. The goal of this master’s project was to build a toolkit which contained (a subset of) these algorithms, especially tree grammar transformations, tree automaton construction algorithms and tree parsing algorithms that use tree automata. This toolkit should, in combination with a graphical user interface, provide an environment to experiment with these algorithms. This practical experience was to be used to gain insight into these algorithms, their properties, and their applicability in practice.

The starting point for this Master’s project was the original assignment description created by Cleophas and an additional list of requirements (see Appendix A). The assignment description provided three lists of requirements for the toolkit with different priorities. These requirements describe the domain concepts, like trees and grammars, and algorithms that needed to be implemented. The additional list of requirements contained specific operations for the data structures that implement the domain structures. During this project as many of these requirements as possible were implemented. The design and implementation of the toolkit was preceded by an investigation of existing tree (grammar) toolkits. This study was used to see how these toolkits implement basic domain structures, like trees and tree grammars. This knowledge and the formal descriptions of the domain structures and related algorithms in Cleophas’ draft PhD thesis were then used to implement the toolkit. Finally the implemented toolkit and GUI were used to perform a collection of experiments. These experiments focus on tree grammar transformations, tree automaton constructions and tree parsing. In summary, this lead to the following project phases:

• Studying toolkits using tree structures and algorithms related to them • Designing and implementing the toolkit and GUI

1 Chapter 0. Introduction

• Performing experiments with the toolkit and GUI

These phases were executed in a time frame of about eight months. We will describe each of these phases in detail in the upcoming chapters. However, before discussing the details of the assignment itself there will be an introduction to the domain, including trees, regular tree grammars, tree automata and the problems of tree acceptance, matching and parsing.

2 1 Domain

This chapter introduces the problem domain in which the toolkit operates: the domain of regular tree languages on ranked trees. Concepts of this domain that are important to the toolkit are introduced in this chapter. These concepts can also be found in more detail in Loek Cleophas’ draft PhD thesis [Cle07, Chapter 3]. After introducing these concepts a description is given of the domain specific problems that are described in [Cle07]: tree acceptance, tree parsing and tree pattern matching. Finally an overview is given of the application areas for these three domain problems.

• Basic concepts – Trees – Regular tree languages – Regular tree grammars – Tree patterns – Tree automata • Problems of interest – Tree acceptance – Tree pattern matching – Tree parsing • Application areas

1.1 Basic concepts

Each of the upcoming sections will discuss one of the basic concepts in the domain of regular tree languages. These concepts are introduced to simplify the explanation of the domain specific problems and to provide the basic knowledge needed to read the rest of the report. All of the concepts discussed in this section are more or less generalizations of similar concepts in the string domain [Lin01]. Tree grammars are for instance comparable to string grammars. This tree-string relation will therefore sometimes be used to clarify the tree concepts.

1.1.1 Trees As described in the introduction, this report focuses on the domain of regular tree languages on ranked trees, to be more precise ranked, ordered, node labeled trees. Node labeled means that each node in a tree contains a symbol as label (see Figure 1.1).

3 1.1. Basic concepts Chapter 1. Domain

a a c c b c c

Figure 1.1: Node labeled tree

The term ranked implies that the symbol labeling a node determines the number of child nodes of the node. It is not allowed for a symbol to appear in nodes with different numbers of child nodes. The tree in Figure 1.1 for example is not a ranked tree, because there are two a-nodes with a different number of child nodes. Removing one of the c-nodes for the root a-node will turn this tree into a ranked tree. Figure 1.2 shows this modified tree. Symbol a in this tree has rank 2, b rank 1 and c is of rank 0.

a a c b c c

Figure 1.2: Ranked tree

Ordered addresses the fact that the child nodes of a parent node are ordered. A node with a symbol of rank n has n child nodes, where the leftmost child is considered the first child and the rightmost child the nth child. Exchanging child nodes/subtrees results in a different tree if the exchanged items are not equal (see Figure 1.3).

a 6= a c d d c

Figure 1.3: Two different trees

This description of trees is quite informal. Their formal definition is based on the concept of tree domains (see Definition B.1.1). A tree domain describes the paths to all nodes and thereby describes the bare structure of a tree (see Example 1.1.1).

Example 1.1.1 This example shows the structure of the tree of Figure 1.2 as tree domain. This is the tree domain in set notation:

ε, 1, 2, 1 · 1, 1 · 2, 1 · 1 · 1 

4 Chapter 1. Domain 1.1. Basic concepts

Each of the elements is a path to a node. The ε refers to the root, 1 to the first child of the root, 2 to the second child of the root, 1  1 to the first child of the node that is first child of the root and so on. This tree domain represents the complete tree structure.

The tree labeling function (Definition B.1.2) is used as an addition to the paths to define the labels for the nodes and thereby creates a labeled tree. This labeling function, based on an alphabet of symbols, defines a relation for each path (see Example 1.1.2).

Example 1.1.2 This example shows the tree domain of Example 1.1.1 with the tree labeling function. This is the original tree domain:

ε, 1, 1, 1 · 1, 1 · 2, 1 · 1 · 1

This is the labeling function from the tree domain to alphabet a,b,c that turns the tree domain into the tree of Figure 1.2  (ε,a), (1,a), (2,c), (1 · 1,b), (1 · 2,c), (1 · 1 · 1,c)

Each element from the tree domain is now linked to a symbol from the alphabet. This therefore corresponds to a tree where each node has a label and in this case results in the tree depicted in Figure 1.2.

The representation in Example 1.1.2 does not oblige symbols to have a fixed rank. A ranked labeled tree can be constructed by using a so called ranked alphabet (Definition B.1.2) in the labeling function instead of a normal alphabet. A ranked alphabet defines a rank for each symbol. Using the alphabet (a, 2), (b, 1), (c, 0) in Example 1.1.2 would turn it into a ranked tree. 

The domain representation, as shown, can be used to represent trees. These definitions are often used in descriptions of algorithms related to trees. However, this report will mostly use the graphical representation or prefix notation of the graphical representation, because it is more compact and easier to read.

1.1.2 Tree languages The concept tree language is comparable to the concept string language. A tree language is defined as a set of trees, where as a string language is defined as a set of strings. The concept of tree language is important, because it helps describing subsets of all possible trees. These languages facilitate problem descriptions like: Given a tree t and a tree language l determine wether t ∈ l. These tree languages are mostly not directly represented by a set of trees, but are defined indirectly by a tree grammar (covered in Section 1.1.3). This is again similar to string languages and string grammars.

1.1.3 Regular tree grammars A regular tree grammar (rtg) is a grammar that can be used to generate trees. These trees are generated in the same way that strings are generated by string grammars: applying productions to nonterminal symbols, starting with the start symbol. A regular tree grammar

5 1.1. Basic concepts Chapter 1. Domain is defined by the five tuple consisting of: two symbol alphabets N and Σ (terminal and nonterminal symbols), a ranking function r, a start symbol S, and a set of production rules Prod (sometimes abbreviated as P ). Each production rule is of the form A → α, where A, the left hand side (lhs), is a nonterminal and α, the right hand side (rhs), is a tree. This tree can contain both terminal and nonterminal symbols, with the restriction that nonterminals can only be present at leafs (nonterminals have therefore always rank 0). Such a regular tree grammar is comparable to a regular string grammar in right regular form. Definition B.2.1 of Appendix B contains the formal definition. The nonterminal and ter- minal alphabet together with the ranking function define which symbols can be used in the production rules. Example 1.1.3 contains an example tree grammar. This grammar will be used as running example throughout this section. Furthermore in this report we will use capitals for nonterminals and standard characters for terminals.

Example 1.1.3 Let G = (N, Σ,r,Prods,S) where

• N = S, B • Σ= a,b,c,d • r = (S, 0), (B, 0), (a, 2), (b, 1), (c, 0), (d, 0) • Prods = (1) S → a , (2) S → a , (3) S → c,  B d b B c (4) B → b , (5) B → S, (6) B → d B Creating a tree with such a grammar starts at the nonterminal start symbol. A rule (whose lhs is this start symbol) is applied to the start symbol and thereby replaces the symbol with the right hand side of the production rule. This process is repeated until no nonterminals remain. An example tree construction can be seen in Example 1.1.4.

Example 1.1.4 Using Example 1.1.3, Figure 1.4 shows a possible sequence of applications of productions starting at the start symbol. The first application is the application of the first rule that replaces start symbol S with the rhs tree of rule number one. This is followed by the application of rule four and six, and finally results in the tree on the right.

(1) (4) (6) S =⇒ a =⇒ a =⇒ a B d b d b d D d

Figure 1.4: Example construction of tree based on the 1.1.3 grammar

The tree constructed in Example 1.1.4 is one of many possible trees that can be constructed using the example grammar. The complete set of trees that can be constructed is the tree language that is produced by that grammar, which is denoted as Lrtg(G) for a regular tree grammar G.

6 Chapter 1. Domain 1.1. Basic concepts

Before proceeding with tree automata some detail is given about possible special shapes of regular tree grammars. Two topics are discussed: grammar characteristics, and reachability and productivity.

Grammar characteristics There are two special characteristics for regular tree grammars described in [Cle07]. These characteristics describe the presence of production rules with distinctively shaped right hand sides. The first characteristic describes whether a grammar may contain (U+) or may not contain (U−) production rules in which the right hand side is a single nonterminal. These rules are called chain rules or unit productions. The grammar in Example 1.1.3 is for instance a (U+) grammar, because rule number five is a chain rule. Removing this rule would convert this grammar into a (U−) grammar (and also change the language produced by this grammar). The second characteristic concentrates on the opposite case where the right hand sides contain large trees with so called Z-nodes. These Z-nodes are non-root nodes that contain terminals (for example the d-node in the first rule of the example grammar). A grammar that may contain rules with Z-nodes is called a Z+ grammar, and the term Z− grammar is used for the opposite case. The example grammar is a Z+, U+ grammar. Converting this grammar to Z−, U− results in a grammar that only contains production rules where the RHS has one level containing a single terminal (A → B) or has at most two levels with no terminal leaves (A → a(B0,..,Bn)). The conversions are done by applying the transformations known as RED-Z and RED-U (which are described in Chapter 4) that respectively convert a Z+ grammar into a Z− grammar and a U+ grammar into a U− grammar. These conversions and the effects of these conversions on the shape of the grammars were subject of the experiments and are discussed in detail in Chapter 4 of this report.

Reachability and productivity The subject of reachability and productivity discusses the usability of production rules and symbols of a grammar. Some production rules and symbols in a grammar may be useless to construct trees. These useless items can be divided into two types: start unreachable rules or symbols, and unproductive rules or symbols. Start reachable symbols are symbols which can be reached from the start symbol and start reachable production rules are rules with an lhs that is a start reachable symbol. A symbol B is called reachable from another symbol A if B can be reached by applying any sequence of production rules from A (see Definition B.2.2 and B.2.3). Symbols and rules which are not reachable from the start symbol are useless in a grammar. Example 1.1.5 shows a grammar that contains such start unreachable symbols and rules. Example 1.1.5 The running example grammar of this chapter (Example 1.1.3) does not contain any unreachable symbols or rules. The concept of reachability can be nicely illustrated by adding symbols and production rules to the example grammar. The first items to be added are the nonterminal C and the following production rule:

(7) C → b d

7 1.1. Basic concepts Chapter 1. Domain

Nonterminal C is not present in any of the other six rules and therefore unreachable from the start symbol S. A consequence of this is that rule seven also is unreachable. Nonterminals that are only reachable via C are also unreachable. This can be shown by adding the following two production rules and nonterminal D:

(8) C → b , (9) D → c D Nonterminal D is present in the right hand side of rule eight, but it is still qualified as unreachable, because rule eight is unreachable.

These unreachable productions can be removed from the grammar because they do not contribute to the tree derivation process.

Unproductive rules and symbols have similar problems. A nonterminal is called productive if there exists a productive production rule in the grammar with that symbol as lhs, and a terminal is always considered productive (see Definition B.2.4). A rule is marked as productive if there is no unproductive symbol in its right hand side. The uselessness of unproductive symbols and rules is illustrated by Example 1.1.6.

Example 1.1.6 This example provides an example of unproductive symbols and production rules. Similar to Example 1.1.5 the standard example grammar does not contain unproductive items. Unproductiveness is illustrated by adding additional rules to the grammar of Exam- ple 1.1.3. Only nonterminal C and the following rule are needed to illustrate both types of unproductive items:

(7) C → b D There is no way to obtain a tree from this nonterminal D, hence this nonterminal is marked as unproductive. If one would apply rule seven then the derivation would get stuck, because one can not replace the nonterminal D. This also explains why a rule with an unproductive nonterminal in its right hand side is called unproductive.

Unproductive rules and symbols are just as useless as unreachable items. One may therefore remove them from a grammar.

However, we have to place the remark that not only unproductive and unreachable sym- bols/rules are useless. An unproductive symbol can also make other symbols and production rules useless even when these are not unproductive or unreachable. Let us assume having a rhs of a production rule contains a unproductive symbol X. If this rhs contains a (reachable and productive) symbol Y that is only present in that rhs and no other rhs, then Y is also useless. The same holds for the production rules with Y as lhs. This illustrates that not only unproductive and unreachable symbols/rules are useless.

8 Chapter 1. Domain 1.1. Basic concepts

1.1.4 Tree patterns Tree patterns are in essence trees in which variable symbols may occur, with the restriction that these variables may only be present in the leaf nodes. These trees are called patterns because these variables represent all possible subtrees. Figure 1.5 shows such a tree pattern with a variable ν. This pattern matches the tree on the right hand side of that figure.

a matches: a ν c a c b c c Figure 1.5: Tree pattern that matches a tree

Computing wether a pattern matches a subtree is one the interesting problems related to tree patterns. One of the three major domain problems is a variant of this matching problem. This domain problem will be discussed in detail in Section 1.2.

1.1.5 Finite tree automata Finite tree automata (tas) play the same role in the domain of regular tree languages as finite string automata in the area of regular string languages. Tree automata, like string automata, can be divided into categories based on two criteria: determinism and processing direction. There are deterministic and nondeterministic tree automata which in turn can be divided into root-to-frontier (top-down) and frontier-to-root (bottom-up) automata. Let us first discuss the general shape of a tree automaton, before addressing the differences among the four variants. The formal definition of a tree automaton describes the automaton as a 6-tuple (see Definition B.3.1). Informally, such a tree automaton, like a string automaton, consists of a set of states, a set of transitions, root/leaf accepting states and an alphabet. The root accepting states and leaf accepting states are comparable to start states and accepting states of string automata depending on the processing direction of the ta. Root- to-frontier (rf) processing starts at the top, so the root accepting states are used as start states and the leaf accepting states as accepting states, where each leaf with a symbol x must match a leaf accepting state with an incoming x transition. This is the other way around for frontier-to-root (fr) automata. However, the shape of the transitions in such tree automata is different compared to string automata. This is caused by the ranked alphabet of symbols that is used in this domain.

Figure 1.6: Transitions in tree automata

Figure 1.6 shows such a tree automaton transition that is clearly different from the ones that are used in string automata. Transitions for a symbol in such string automata are defined as

9 1.1. Basic concepts Chapter 1. Domain an element from the set Q × Q, where Q is the set of states. In a tree automaton a transition is not a relation between two states, but between one state and a set of states; To be more precise, the transition relation for symbol a is defined as Q × Qn where n is r(a), in other words the rank of a. This is depicted in Figure 1.6 where each transition consists of a single edge, a small circle and n branches. There are n branches because a transition on a symbol a of rank n is a transition that corresponds to a node with symbol a that has n child nodes. Figure 1.7 shows how a tree automaton with multiple transitions can look like. Tree automata are in fact a generalization of string automata. A tree automaton that only contains transitions for symbol with rank one can be treated as a string automata. Removing the small circles where the transitions normally split for symbols with a rank larger than two will more or less turn the tree automaton into a string automaton.

Figure 1.7: Example (rf) tree automaton

The transitions in such an automaton can be undirected, root-to-frontier directed or frontier- to-root directed. An undirected automaton allows processing in both root-to-frontier and frontier-to-root direction, and therefore uses undirected transitions. The other two variants are used respectively in rf automata and fr automata. Figure 1.7 for instance shows an rf automaton. The arrows in this automata point from the parent to the child states. The arrows in an fr automaton point in the other direction. This report will focus on the directed automata, because they are more useful in practice.

There is also a difference when looking at determinism, as discussed in the first paragraph. This difference between nondeterministic and deterministic tree automata will be discussed in the upcoming two sections. These sections will also provide further details on rf and fr variants for both these deterministic and nondeterministic automata.

Nondeterministic tree automata The phenomenon nondeterminism can be described in a similar way as for string automata. Assume one has an rf tree automaton. If this tree automaton is nondeterministic then this means that there can be an arbitrary number of transitions for a symbol a (or for the ε-symbol) from that state. This can also be seen in Definition B.3.2, which defines, for a nondeterministic automaton (nrfta), the relation for a symbol a (Ra) of such an automaton n as Ra ∈ Q → P(Q ), where n is again the rank of symbol a.

10 Chapter 1. Domain 1.1. Basic concepts

Figure 1.7 shows such an nrfta. The a-transitions starting in state qs provide an example of this nondeterminism, because there are two a-transitions that result in different state vectors ((q2, q3) and (q3, q4)). The same can be observed for an nfrta if we would convert the rf variant in Figure 1.7 into an fr one. Our starting point is now the state vector (q3, q4). We again look at the possible a-transitions and encounter two possible transitions, leading to state qs and q3. This illustrates how both rf and fr tree automata can be nondeterministic.

Another shared property with string automata is the possible presence of so called ε- transitions in nondeterministic tree automata. These are transitions that can be taken without reading a symbol. Nondeterministic tree automata that contain such ε-transitions are denoted by the terms εnrfta and εnfrta depending on the processing direction. Figure 1.8 shows an example of an εnrfta.

Figure 1.8: Example ta with ε-transitions

State qs in Figure 1.8 has three outgoing ε-transitions, so from qs it is possible to proceed to q0, q1 and q2 without reading a symbol. A consequence of this is that an automaton containing ε-transitions is a nondeterministic automaton, because being in state qs means also being in the states reachable from qs by an ε-transition.

This nondeterminism has an effect on the usage of these automata. If one for example has an nrfta and starts processing the root symbol of the tree in the root accepting state then this can result in multiple state vectors for the child nodes due to the nondeterminism. All these vectors have to be considered when proceeding and reading the child symbols. This makes the use of nondeterministic tree automata less efficient, similar to the use of nondeterministic string automata.

Deterministic tree automata For each state in a deterministic rf automaton (drfta) there is at most one transition for each possible symbol and their exist no ε-transitions, implying that there is at most one transition for each symbol/state vector-combination for the fr variant of a deterministic automata (dfrta). This determinism can also be seen in the definition of these tree automata (see Definitions B.3.4 and B.3.5). For instance, the transitions Ra for a symbol a of rank n in drfta are n defined as: Ra ∈ Q → Q , where the transition set in nrfta for the same symbol a is defined

11 1.2. Problems of interest Chapter 1. Domain

n as: Ra ∈ Q → P(Q ). The absence of this powerset shows that each transition for a certain symbol and state can only result in one vector for a drfta.

It is important to mention that not all tree automata have the same acceptance power. It is well known that the accepting power of the DRFTA is less than the accepting power of the other three automata (see e.g. [Cle07, Lemma 3.6.30]), so in practice most of the attention in the area of deterministic tree automata will go the DFRTAs.

1.2 Problems of interest

As described earlier the PhD research of Loek Cleophas focuses on solutions for three algo- rithmic problems in the area of regular tree languages. These problems are defined in more detail here. Let us start with the tree acceptance problem (Definition 1.2.1). This is the most simple of the tree problems. The research in [Cle07] focused on solving this problem by constructing tree automata based on the rtg and processing the input tree with these automata. If an automaton accepts the tree, then the tree can be generated by the rtg.

Definition 1.2.1 Tree acceptance: Given a regular tree grammar and an input tree, de- termine wether the input tree can by generated by the regular tree grammar i.e. is part of the language denoted by the regular tree grammar.

The second problem focuses on tree patterns instead of tree grammars (Definition 1.2.2).

Definition 1.2.2 Tree pattern matching: Given a finite, non-empty set of trees (the pattern set) and an input tree, find all occurrences of the patterns in the input tree.

What needs to be determined for all nodes in the input tree is which patterns match for that node. This problem can again be solved by using tree automata, by constructing a tree automaton from the pattern set and processing the input tree with the tree automaton. The states of the automaton correspond to certain patterns of the pattern set, so what needs to be done is storing which state can be assigned to which node in the tree in an accepting computation of the tree automata. The final problem, called tree parsing (Definition 1.2.3), is an extension of the tree accep- tance problem. However, for tree parsing one does not only want to know whether a tree can be produced by a grammar, but also how it can be produced. In other words, which production rule needs to be applied for each node in the input tree to obtain the complete tree.

Definition 1.2.3 Tree parsing: Given a regular tree grammar and an input tree, determine all parses of the input tree that can be generated by the regular tree grammar. A variation of this problem that is often used is to determine a parse that is optimal (with respect to some cost function).

This can again be solved by creating an automaton based on the rtg and processing the tree with this automaton. However, one has to do some extra registration during processing (as for the tree pattern matching problem) with the automaton to obtain and store the parse information.

12 Chapter 1. Domain 1.3. Application Areas

This report will focus on algorithms related to these problems, especially to the tree accep- tance and tree parsing problem. The algorithms of interest manipulate the shape of the input grammars (to remove useless items as discussed in Section 1.1.3) and construct automata from these grammars. The algorithms and the experiments with these algorithms are discussed in detail in Chapter 4.

1.3 Application Areas

There is a significant collection of application areas where these three problems in the regular tree language domain play a role:

• Code generation in compilers, particularly for instruction selection or optimization • Term rewriting and unification • Genetics, in particular DNA/RNA pattern matching • XML document processing • Type inference • Protocol verification, particularly for cryptography and network protocols

Concepts in these domains can be translated to trees. For instance terms, which can be represented as a tree. However, our focus is on to the topic of code generation. In a compiler a string (source code) is translated into a parse tree by using lexical and syntactical analysis [ALSU07]. The resulting parse tree has to be translated into a sequence of machine instructions. This translation into machine instructions is called code genera- tion/instruction selection. Figure 1.9 gives a simplified overview of this compilation process for an example expression.

+ MOV # c2, R3 c1 M ADD R2, R3 R1 := c1 + M[c2 + R2] =⇒ + =⇒ MOV M[R2], R2 MOV # c1, R1 c2 R2 ADD R1, R2

Figure 1.9: Compilation process: an expression converted into tree, followed by a translation of the tree to a sequence of instructions.

The phase where the intermediate tree is translated into instructions can be described as a tree parsing problem. To do this one has to translate the instruction set of the target archi- tecture into a tree grammar. Each instruction has to be converted into a single production. The lhs of the production rule has to correspond to the register in which the result is stored of the instruction, where the rhs of the created production rule must represent the semantics of the instruction. An example of an instruction set with corresponding production rules can be seen in Table 1.1.

13 1.3. Application Areas Chapter 1. Domain

Instruction Production Rule MOV # c, Ri Ri → c MOV M[Rj], Ri Ri → M

Rj ADD Ri, Rj Ri → +

Ri Rj

Table 1.1: Three instructions with their corresponding production rules

The derived tree grammar and intermediate tree can then be used as input to an algorithm that solves the tree parsing problem. This algorithm will then compute which rules from the grammar have to be applied to obtain the source tree and therefore describe which instructions have to be selected to compute the expressions that correspond to (parts of) the tree. The experiments on rtg transformations and ta constructions at the end of this report will present algorithms that can be used in solutions to these problems. Especially the ta constructions are useful for solving the instruction selection problem. The ta construction algorithms can construct automata from tree grammars and these resulting automata can be used to parse the tree and thereby compute which production rules correspond to which nodes of the subject trees. Each rule corresponds to a instruction, so the parse result can then used to determine which instructions need to be executed. More details about how such an automaton is used to parse a tree can be found in the final section of the chapter about the experiments.

14 2 Research into existing toolkits

The starting point of the assignment was to investigate existing toolkits in the area of regular tree grammars and related algorithms. The primary goal was to analyze the design of the toolkits and to find out if some parts of these designs could be reused in the design of our toolkit. Our toolkit contains a large subset of the domain concepts that are discussed in Chapter 1. The research into existing toolkits therefore focused on the data structures that are used for representing the main domain concepts: trees, their nodes, tree grammars and tree automata. Before discussing these toolkits in detail I would like to start with the remark that most of the structures are quite straightforward. Trees, for instance, are defined by a root node, this root node points to its child nodes and in this way recursively defines its structure. Nodes have a terminal symbol and zero or more children, sometimes with a maximum of two due to the area of instruction selection applications where most of the instructions have at most two arguments. Many of the systems also use other structures, such as tree automata and tree grammars. Most of the systems talk about trees, others treat them as terms. In the next sections I will give an overview of each system and discuss the details of the data structures that are used for representing trees, tree grammars and tree automata.

2.1 ATerms

ATerms [vdBdJKO00] is a C-library for efficient representation and storage of terms (which can be used to build tree-like structures). The library offers a representation for storing the trees/terms in memory during program execution and a separate binary format for stor- ing trees/terms after execution or exchanging them between programs that use the ATerms library. Some of this storage optimization is achieved by storing subtrees with multiple oc- currences only once in a tree (called subtree sharing). Pointers are then created for these subtrees to that single instance. As mentioned, ATerms treats trees as terms. A term in ATerms can be an integer, real, string, function application, list, placeholder, blob or annotation pair. Trees can be con- structed using the function application. The function application consists of a function sym- bol and an arbitrary number of terms as arguments. The function symbol is used to repre- sent a node symbol and the arguments represent the subtrees. For example: ‘PLUS’(term1, term2,...,termn). The function application ‘PLUS’ represents a node labeled by a terminal with the name ‘PLUS’ and children term1 to termn. The internal representation in the soft- ware is quite straightforward. Each of the seven types of terms is mapped onto a piece of memory. The function application is stored as the name of the function application (PLUS in the example) and a list of pointers to the terms contained in the function application. The details for all of these terms can be found in [vdBdJKO00, Section 3.4]. This way a tree can be represented by the ATerm that represents the root node of the tree. This is very similar to the tree structure sketched in the introduction above.

15 2.2. BEG Chapter 2. Research into existing toolkits

Our toolkit does not only use trees, but also tree grammars etc. Unfortunately, there is no built-in support for data structures like tree grammars, or other data structures and algorithms related to our research field. However, one could construct tree grammars and tree automata using optimized trees as provided by ATerms. The ATerms tree structures also have some other interesting properties besides the efficient storage:

• ATerms can have terms with more than two children. • Function applications can have a different number of arguments when they contain the same terminal, so the terminals have no fixed rank. • Terms support annotation, which could be used in e.g. tree parsing. • ATerms has automatic garbage collection for its terms, which simplifies the usage.

The annotation feature of the terms is very useful for our toolkit. It offers the possibility for algorithms and associated data structures to store node-specific information. And it is also possible to store multiple items in each node, because each block of information is stored with a specific key value as in a dictionary. These key values have to be used to set and get the data item for that key. This creates a nice way to store information for different purposes in a structured way. Such storage can be found in the other systems but not in such a structured way. The dictionary type annotation makes the ATerms nodes more flexible than nodes in other systems. The only disadvantage is that terms can only be annotated with other terms. This makes the ATerms annotation not very practical when it is used to store complex annotation data in parsing algorithms. In summary, ATerms contains nice concepts. Some of these, like garbage collection and structured annotation, are interesting features. Other features like the optimized storage through subtree sharing are less interesting because they can result in difficulties when ma- nipulating tree structures. When a subtree, for instance, has multiple occurrences in a tree (and is therefore shared) and is changed, then this subtree is changed everywhere. This is not always what a user wants. ATerms even calls such shared terms immutable. It is therefore not practical either to use ATerms in our toolkit or reuse this optimization approach.

2.2 BEG

BEG [ESL89] is a tool that generates a compiler back end (instruction selector/code generator) from a tree grammar that computes the minimal cost parse for an intermediate representation tree. The tree grammar specification is written in a language called BEGL. BEG uses such a specification to construct a Modula-2 back end that computes this minimal cost parse for an input tree. The documentation of BEG gives a clear overview of the data structure it uses for repre- senting nodes of the input tree. The structure is very similar to the structure described in the introduction: a tree node is represented by a structure that contains a string for the terminal symbol and a list of other nodes that represent the children. BEG, like ATerms, has annotation possibilities in each node, but in BEG they are only tar- geted at the parsing algorithm. The parser stores in each node the matching production rules and their cost in two separate arrays. There is little or no attention paid to the description of the design, while there are other possibilities for storing this information. For instance,

16 Chapter 2. Research into existing toolkits 2.3. BURG

BURG (discussed in Section 2.3) uses a slightly more complicated structure for storing parse data, which is more flexible and creates less pollution in the nodes. It could be that other designs are considered, but unfortunately no information about the design decisions can be found in the BEG description [ESL89]. Also, not much information can be found on how BEG represents tree grammars and their production rules. What can be retrieved from the code samples in [ESL89] is that the trees on the right hand side of the production rules are comparable to the the normal BEG trees. Normal BEG tree nodes contain two arrays for storing match information. These arrays are in theory not necessary for the rhs of a production rule. However, one can not derive from the paper whether the same tree data structure is used or a stripped down variant. In total, BEG uses straightforward data structures. The additional features, like the anno- tation option, are especially targeted at the parsing algorithm used by BEG. These features are therefore not applicable for our toolkit, because our goal is to be more flexible.

2.3 BURG

BURG [FHP92b] is, just like BEG, a code generator toolkit. BURG emits a tree parser, called BURM, as a C-program that computes the lowest cost parse of an input tree. The construction of this parser is done by using the BURS technique [Pro95]. BURG is used by providing a grammar and description of the tree structure to the toolkit. The BURS technique will then be used to create a tree automaton based parser that is used by BURM to parse each input tree.

Let us start by looking at the tree structures used by BURG. The first information about the data structures used by BURG was found in the BURG input file format. The input file, used for defining the grammar and shape of the input tree, showed that BURG uses conventional tree structures, where a tree is recursively defined starting at the root node. The user must define the basic shape of the input tree by defining (by using the C #define directive) the four main fields of a node: the symbol field, the left and right child node, and integer annotation field. The latter two items immediately introduce one of the shortcomings of structures used by the BURG toolkit. The restriction to symbols with a rank of at most 2 is not clarified, but it is probably related to the fact that BURG is used for instruction selection. Another remarkable item is the integer field that is used for annotation. Other toolkits, like BEG, store quite a lot of data in the tree itself. BURG uses integer values that refer to states of the used BURS automata. Such an integer value can then be used to retrieve information related to this node. This information itself is stored in separate structures, like hash tables and arrays. This is a cleaner solution and more flexible solution than BEG’s solution. Tree grammars are the next structures that need to be discussed. BURG reads the gram- mars that are specified in the input file and uses these to construct an automaton by applying the BURS technique. The used tree grammar data structures can in contrast to the trees not be found the output BURM c-file, but only in BURG itself. The paper describing BURG [FHP92b] did not contain any information about the data structures that were used for representing grammars, however the source code of BURG provided an impression of these structures. A grammar is represented by a set of lists: a list of nonterminals, a list of ter- minals, a list of production rules and some auxiliary lists such as a list of chain rules. The

17 2.4. iBURG Chapter 2. Research into existing toolkits production rules, in turn, consist of an lhs nonterminal and an rhs pattern. Such a pattern is a tree without any non-root terminal node. It therefore only consists of a root symbol and list of child symbols. BURG reads the input grammar and converts this grammar into the Z− form and uses this form internally, because it simplifies the construction of the BURS tree automaton. BURG further defines for each rule a number that can be used to refer to the production rules. These numbers are also used in the constructed BURS automaton. The BURS (fr) tree automaton in BURG is the final data structure that was important to study. This BURS automaton is represented as a standard tree automaton. It consists of a collection of states and a set of transition tables. These states are used as indices in these transitions via a mapping function that translates them to integer indices. The states themselves are created for a specific pattern. How these states and transitions are used for tree parsing can be found in [Pro95]. The BURS tree automaton has a conventional form, however it is targeted at the special BURS tree parsing technique and is therefore not directly applicable for our toolkit. What can be learned from the BURS automaton is the usage of the two dimensional transition tables. These tables allow transitions based on two indices (indices refer to terminals/subtrees). This idea can be expanded to n-dimensional tables for symbol of rank n. Summarized, BURG provides implementation ideas for trees, tree grammars and tree au- tomata. The tree and tree grammar structures were modeled similar to other toolkits. The two most import exceptions are that there were no encapsulating grammar objects and that a single integer was used as reference to annotation data instead of a complete structure. The advantage of BURG was that it also provides implementation details about the used BURS tree automata. This provides the idea of using state indexed transition tables realized by mapping functions that translate states to unique integer values.

2.4 iBURG

iBURG [FHP92a] is variant of BURG, that like BURG, constructs a tree parser from a tree grammar. iBURG uses the same input language as BURG, but uses a different parsing technique, because BURS was considered to be inflexible. This relation to BURG can also been seen in the data structures. iBURG uses roughly the same structures as BURG. The shape of the subject trees are completely the same compared to BURG, because iBURG uses the same type input file in which the user must define the basic tree structure. The tree grammar data structure used by iBURG however, is more advanced than its BURG counterpart. The basis of the rules is as one would expect. It defines an lhs which is a nonterminal, an rhs which is a tree, and a cost value, but rules also have a pointer to another rule with a similar rhs (based on similar root or children) and a pointer to a chain rule with the same right hand side. Similar additions are made to the nonterminals, they also point to rules that use these nonterminals. These extras can simplify the parsing process, but it also creates a lot of extra initialization work because all these pointers have to be set. Like BURG, iBURG generates an output C-program, for the specified grammar, that parses trees provided by the user. iBURG also uses a tree automaton to parse the input tree. However, in contrast to BURG, this is not a conventional tree automaton consisting of explicit states and transition. iBURG defines states in this C-program as integers and uses a function that uses a complicated generated switch statement instead of more conventional transition tables. This approach can however only be used when generating source code for a parser

18 Chapter 2. Research into existing toolkits 2.5. Timbuk and not in a toolkit that constructs these automata in memory (which will be the case in the toolkit that is designed in this report). Overall iBURG’s basic structures for trees and grammars are as one would expect, but they also contain small extra ideas. Such an idea is the extra pointers to connect structures that for instance contain the same symbol on the same spot (e.g. two production rules with the same rhs root symbol). These pointers can become practical when computing matches based on these symbols.

2.5 Timbuk

Timbuk [GT01] is part of a collection of three tools. These three tools (Timbuk, Taml, Tabi) form a toolkit for term rewriting systems and tree automata. The Timbuk tool gives the user the ability to specify an automaton or a set of terms (for which an automaton that accepts these patterns can be constructed by the toolkit). The user can then test if his automaton/set of terms matches another automaton or set of terms if they are rewritten by some rewrite rules he has defined. The Timbuk manual [GT] does not give much insight in the implementation of Timbuk. To find some more detailed information I took a look at the OCaml source code of Timbuk (version 2.0), where the definition of terms and automata can be found. However, there is not a lot of comment in the source code that describes the structures or why they are implemented the way they are. Nevertheless, some things can be distilled from this source code. Timbuk defines terms/trees in another way than most of the other systems, due to the functional nature of the OCaml programming language. A term can have four shapes: variable (nonterminal), constant (leaf terminal), function (terminal with subterms) and ‘special’. The function of this last one is not completely clear. The Timbuk manual does not discuss ‘special’ terms. The source code of the automata portion suggested that this symbol is used to represent automaton states as term, but the exact role is unclear. Timbuk uses no tree grammars, because it focuses on automata constructed from tree patterns, instead of grammars. Examining the structure of these automata was not that easy, because only the source code was available, but the names used for variables give a good idea about the function of the different parts of the automaton. However, it was harder to get an idea how these different parts were used. The type is defined by:

type tree automata = { alphabet: Alphabet type.t; state ops: Alphabet type.t; states: State set type.t; final states: State set type.t; prior: transition table; transitions: transition table }

The role of parts like alphabet, states, final states and transitions are clear from their names. But for the other fields it is not so clear. The ‘state ops alphabet’ is the collection of names used for the state. These are stored separately and have to be disjoint from the

19 2.6. Treebag Chapter 2. Research into existing toolkits normal alphabet that is used. The role of the ‘prior’ field is unclear. There is no comment in the source code that describes its role or gives hints about why it is needed to form an automaton. Studying the source code with someone with more OCaml experience could shed some light on the function of this field. Unfortunately the only available Timbuk information sources were the Timbuk manual [GT] and paper [GT01] which mostly contained a description of the functionality of the Timbuk tool. There was also source code available, however the use of OCaml, a language which I was not familiar with, made it hard to analyze the code and get a clear view of how Timbuk functions. In summary, Timbuk offers a large set of structures. Many of these structures are also necessary in a tool like ForestFIRE, but unfortunately the absence of good documentation and code comments makes it hard to get a detailed view of their design and design philosophy. It is therefore hard to decide what can be reused.

2.6 Treebag

Treebag is a toolkit that enables a user to define tree grammars, algebras, transformations on trees and visualizations of trees. A user can define components of each of the four types. These components can then be connected, for instance, to visualize trees that are gener- ated by a self-defined grammar and afterwards transformed by a self-defined transformation. These components can be defined in text files and loaded into the toolkit, where they can be connected with directed edges.

When examining the data structures I had to turn to the source code of the toolkit, because there was not much design/implementation information in the manual [Dre]. Tree nodes in Treebag are called terms and their structure is straightforward. There are no fields for algorithmic data, probably because the tree transformations should be loosely coupled to the trees. Remarkable is that Treebag is the only system that uses terminals with ranks, and stores ranks in each node explicitly. This results in duplicate information. A dictionary-like data structure would be more efficient, if one would like to store it explicitly. A reason for this implementation could be that the retrieval of the rank is a bit faster, because it can be accessed directly in each node. Trees in Treebag are created using tree grammars that are defined by the user. The system can do this randomly by applying random production rules or the user can do this by hand by applying the production rules he likes. The shape of these production rules is very similar to the production rules in the other systems. A production rule is defined by a structure containing a left hand side and a right hand side. The rhs is defined as a term, which can be any form of tree. The lhs is formed by one symbol only. Additionally there is a field called weight, which can be used to give a cost to each of the production rules (Treebag doesn’t oblige the user to specify this weight in his grammar definition). It seems that the grammars in Treebag do not store a separate list of the terminals and nonterminals that are used in the production rules. This is probably done because these lists are not necessary to generate the trees.

20 Chapter 2. Research into existing toolkits 2.7. Twig

Overall, Treebag’s structures for trees and grammars are quite similar to the structures in other systems and did not contain any striking differences. Yet like Timbuk, the absence of design documentation makes it hard to retrieve much information about the structures used and the design decisions made.

2.7 Twig

Twig is a tree manipulation language that enables a user to describe tree grammars and convert them, like BEG and BURG, into a code generator for a compiler. The paper [AGT89] contains the language to describe tree grammars (with costs for each production rule) and an algorithm that uses an special string automaton to compute the minimal cost parse for a given tree. There is not much information on how the trees and grammars are represented in the Twig system. However, from the code samples in the paper one can derive that the tree nodes are defined as described in the introduction of this chapter, although the number of children is, like in other systems, limited to two. A Twig tree node also contains extra fields for storing algorithmic data for the parsing algorithm. Like in BEG, each node contains two arrays for storing subtree matches and their cost. There are still two extra fields that contain a state and a bit string. These are used to create a link between the nodes in the tree and the automaton that Twig uses for computing the match. So, in total there are four extra fields which are directly related to the matching algorithm used. Such a large collection of specific fields in each tree node would not be a good idea for a toolkit like ForestFIRE, which targets an expandable collection of algorithms. This would create a pollution of the tree nodes. Unfortunately, information about the structures used for the tree grammars was not avail- able in [AGT89]. It only contained a discussion about the high level conversion of the grammar into an automaton. As already mentioned Twig constructs automata to check whether a grammar accepts a certain tree. These automata are not standard tree automata, but generalized variants of the Aho-Corasick string automaton [AC75]. However, these string automata are very similar to tree automata. Studying these automata in Twig was not easy, because only example automata are given without many details about internal representations of this structure. The only information about the structure was that it contains a function which computes the next state of an automaton for a given input state and the symbol that was being read. Also, it was not possible to get a copy of the Twig system to examine it and to obtain more details about the data structures that Twig uses.

In summary, Twig tree and node structures are comparable to the structures of BEG and BURG. However, Twig shows that storing algorithmic data inside the tree can create a large number of extra fields in each node and is therefore not a very elegant solution. One can be brief about the structures used for the grammars and automata. The Twig description [AGT89] contains almost no information about these structures.

2.8 Summary

The systems studied provide some different ideas for implementing the domain concepts that were needed in our toolkit. As seen, many implementations of trees were available, but these

21 2.8. Summary Chapter 2. Research into existing toolkits were not always well documented. Tree grammars and tree automata were not that often present in toolkits. Some of the ideas in these data structures, although not always directly inspired by these systems, can also be seen in our toolkit. Other features are deliberately not present in our data structures. Here I summarize the most remarkable features that were encountered in the three domain concepts that were studied in the systems.

Trees As already mentioned in the introduction of this chapter, trees were mostly implemented in the same way. There were three aspects that caught our attention: storage of ranks, rank restriction, and annotation possibilities. Remarkably, there was only one system (Treebag) that explicitly stores ranks in its trees. This system stores these ranks in each node of a tree. In our toolkit the choice was made to store the ranks explicitly. However, the storage is done in a different way by creating an explicitly ranked alphabet that is coupled to a tree instead of a rank that is stored in each node. The details of this implementation and the reasons for it can be found in Section 3.1.1. The second aspect is the limitation of the rank. Many of the systems studied restrict their nodes to having only two child nodes. This means that all symbols in such a tree are at most of rank two. This is enough for most of the instruction selection applications, but also seems unnecessarily restrictive in many cases, because many algorithms used in these systems are also capable of handling symbols of a rank larger than two. Our implementation will not have this limitation and will therefore be more flexible than most of the systems studied here. The first two observations described aspects from the existing systems that cannot be found in our toolkit. The annotation field is, unlike the other item, present in our toolkit (see Section 3.1.1), because it is practical in many algorithms. The implementation of this annotation structure is most comparable to the implementation of ATerms. The other systems provide annotation fields that are strongly related to the algorithms that use them. Our implementation, and also ATerms, focuses more on flexibility. This flexibility means that an arbitrary algorithm can use the annotation possibility provided by the nodes.

Tree grammars Tree grammars, like trees, are mostly implemented in a straightforward manner by the studied systems. They consist of a collection of production rules that have a symbol as lhs and a tree as rhs. Strangely enough most systems do not wrap these grammars with classes. This is mostly due to the fact that they use a single grammar. This makes it less profitable to create such an additional structure. This situation is not applicable for our toolkit, because the goal is to be capable of handling more than one tree grammar at once. The only non-standard feature encountered for tree grammars was the collection of refer- ences used by the iBURG toolkit. iBURG uses these collections to get fast access to symbols or rules with comparable symbols. This could also be used in our toolkit, but this was not done due to the additional implementation costs.

Tree automata Finally there are the implementations of tree automata, but of the domain concepts I was interested in, this unfortunately was the one I encountered the least in the systems studied

22 Chapter 2. Research into existing toolkits 2.8. Summary in this chapter. There was only one system (BURG) that uses tree automata and also docu- mented it in a usable manner. This system described a way of indexing in transition tables using states. A similar technique can also be found in the DFRTAs described in Section 4.3.

23 2.8. Summary Chapter 2. Research into existing toolkits

24 3 The Toolkit and GUI

The primary goal of this research project was to design and build a new toolkit in the area of regular tree languages. This new toolkit, built with the knowledge gained by studying the existing toolkits, should facilitate experiments with algorithms related to the three domain problems mentioned in Section 1.2. The construction of this toolkit was shaped by two requirements:

• Contain an expandable collection of algorithms related to the domain problems (with focus on the tree grammar transformations, tree automaton construction algorithms and tree parsing algorithms, which were the subjects of the experiments) • Provide a visual interface to use these algorithms in an easy way

What therefore was needed was a toolkit that implements all standard domain concepts (trees, tree grammars etc.), provides the possibility to implement algorithms that use these concepts and furthermore come with an user interface to experiment with these algorithms. This resulted in a toolkit consisting of two parts. A library, called ForestFIRE, contain- ing domain specific data structures and a collection of algorithms targeted at the domain problems (with special attention for the three specific groups of algorithms described above). The second part is FIREWood, a graphical user interface that provides the possibility to experiment with these algorithms and algorithms added in the future. This chapter will provide the details of the design and construction of these two parts in two separate sections. The first section will discuss the heart of the system: the ForestFIRE library. The section describes which data structures and algorithms are implemented and how they relate to the concepts and algorithms that can be found in the domain. The second section provides details on the design of the graphical FIREWood application that should provide easy access to the algorithms inside the library. That section focuses on the major design decisions made during the construction of FIREWood. In a final section some details are given about the practical implementation process, including programming language chosen, platforms supported et cetera.

3.1 ForestFIRE

This section describes how the domain concepts of Chapter 1 are modeled and implemented in the ForestFIRE library. These concepts are divided into four sections: trees, regular tree grammars, tree patterns and tree automata. Each of these sections discusses the concepts that are related to that topic. These concepts are translated to data structures/classes, where each concept can result in one or more classes. A tree for instance contains an alphabet, nodes etc. Each of the resulting classes is described in detail to clarify its role and design. Furthermore the relation between these different classes is visualized by a UML class diagram. Finally there are references to Appendix C,

25 3.1. ForestFIRE Chapter 3. The Toolkit and GUI which contains the precise interfaces of these classes. This appendix serves as a dictionary of all the data structures in ForestFIRE. Additionally, each section discusses invariants for these classes. These invariants describe which properties in these classes have to be maintained such that the data structures do not violate properties that are important for the corresponding concept (a tree may for instance only use symbols that can be found in its alphabet). The sections conclude with the algorithms that are related to the concept. For each of the concepts it is briefly described which algorithms are implemented. Additionally, Appendix C describes the classes and methods that implement these algorithms. The implementation of these algorithms is in many cases quite straightforward (e.g. counting chain rules), however others are more complicated. The details of these more complicated algorithms can be found in [Cle07]. The only complicated algorithms discussed in detail in this report are the algorithms that were subject of the experiments. The description of these algorithms can be found in Chapter 4. Many of the data structures and algorithms discussed in this section can also be found in the original assignment description and additional list of requirements for the toolkit (see Appendix A).

3.1.1 Trees This section describes all data structures related to the tree concept, like nodes, symbols, alphabets etc. These data structures are modeled in such a way that they fit the descriptions given in Chapter 1 and [Cle07]. A tree is represented by a collection of nodes with links that form the tree structure. The tree contains a reference to its root node and leaf nodes (to support easy rf and fr traversal). Each of the nodes in the tree structure contains a symbol and an unlimited sequence of child nodes. Using an unlimited sequence of nodes means that we do not restrict the number of child nodes to two as most of the existing toolkits do. The tree also contains an alphabet. This alphabet contains the set of (ranked) symbols that can be used in the tree structure. Each node in this structure refers to its corresponding symbol in the alphabet. These references strongly resemble the tree function in the formal definition (see Definition B.1.2). None of the existing toolkits stored the rank with the symbol. However, we have chosen to do this, because it is efficient (rank is not stored in each node) and results in easy retrieval from each node. These symbols can be used to represent terminals, nonterminals and variables. They contain a name (the label), type, and a rank in case it is a terminal. The type is used to denote whether it is a terminal, nonterminal or variable symbol. Nonterminals and variables are represented by using unranked symbols. The rank of these symbols are considered to be zero.

The choice for modeling a tree in this conventional way was made because this conventional shape is easier to implement and it provides an elegant way to point to subtrees by just using a reference to a node. Figure 3.1 shows an UML class diagram that shows how the tree, node and symbol classes are related.

26 Chapter 3. The Toolkit and GUI 3.1. ForestFIRE

+Root Node Symbol Tree +Parent +Symbol +Name +Alphabet 1 +ParentTree 0..1 +Leaves +Annotation +Type +Children * 1 0..1 1..* +Tree 1 +Node 1

* DottedTree RankedSymbol +Name * +Type +Rank

Figure 3.1: UML Class diagram for tree and node related classes.

Figure 3.1 shows that the node structure has even more fields than described above. One field that is important to highlight is the annotation field. Many algorithms that use tree- like structures compute information for a certain node. These algorithms sometimes require space to store this information. The annotation dictionary is created to facilitate this. The dictionary allows an algorithm to store this information by using its own key(s). Additionally a class was added to represent dotted trees. Dotted trees are used to point to a specific node within a tree. The representation of this dotted tree is similar to its formal 2-tuple definition, which defines a tree and points to a node. This node pointer is in the formal definition a path from the root to that node. However, we will define it as a direct pointer to that node, because this provides faster access to that node. The details of all these data structures can be found in Section C.2. This section describes all classes separately and provides a description of the individual fields of these classes.

There are also some important invariants for these classes, because it is possible to construct illegal trees with these data structures if one does not follow certain rules. These rules are captured in class invariants, include for example:

• Each node that points to a tree as its parent tree, must also be part of this tree, and vice versa. • A symbol used in a node of tree t must also be present in the alphabet of t.

• If a node n1 is in the list of child nodes of a node n2, then n1 must refer to n2 as its parent.

All invariants and details about these invariants can be found in Section C.2.2. The data structures themselves offer facilities to avoid violation of these invariants. There exists for example a special method for assigning a node as root node. This method makes sure that all pointers to the parent tree in the tree structure are set to the corresponding tree. However, these invariants can not all be handled by the library itself; the user of the library has to follow certain rules when using the data structures if he wants to make use of these properties. These rules can be found in the source code documentation. The invariants become important when a user starts manipulating the data structures in another way or manipulating the implementation of the data structures. This combination of classes and invariants enables us to represent all concepts related to trees in a correct way.

27 3.1. ForestFIRE Chapter 3. The Toolkit and GUI

3.1.2 Regular tree grammars This section discusses the implementation of the data structures and algorithms related to regular tree grammars. We divide the discussion into a part for data structures and one for algorithms. Finally there is a third part that discusses the implementation of one algorithm in detail.

Data structures The regular tree grammar definition for the toolkit is similar to its formal 5-tuple definition (see Definition B.2.1). This formal definition describes the following items:

• N, the nonterminal alphabet. • T , the terminal alphabet. • r, a ranking function for the terminals. • P , the production rules or productions. • S, the start symbol.

However, there are some differences. The regular tree grammar definition in the toolkit uses one combined alphabet for terminals and nonterminals, instead of two separate alphabets. This is done, because the symbols as defined contain a type field which indicates wether a symbol is a terminal or nonterminal, so there is no direct reason for separating them. Another difference is the absence of an explicit ranking function. Ranks are stored within the symbol itself, as described in the previous section about trees or considered to be zero in case of an unranked symbol (nonterminal or variable). The last two items of the formal definition are found in the toolkit implementation of a grammar: the RegularTreeGrammar class contains a start symbol and a set of production rules. These production rules are also shaped as in their formal definition. They contain a nonterminal lhs and a tree rhs, but there is also a small addition, in the form of a cost field. This cost field is added because there exist parse algorithms that use a cost value to compute the minimum cost parse for trees. This field can then be used to store the cost for each production rule. A UML class diagram of these structures can be found in Figure 3.2.

Tree 1 +LHS Symbol +Alphabet +Name +IsRanked +Type +IsOrdered +RHS 1

1 1 +StartSymbol 1

DottedRule +ProductionRule GrammarProductionRule +Node +Cost * 1

+ProductionRules *

1

RegularTreeGrammar +Alphabet 1

Figure 3.2: UML Class diagram for grammar related classes.

Classes Tree and Symbol in Figure 3.2 were discussed in Section 3.1.1. As can be seen there is also an additional DottedRule class. The role of this class is similar to the role of the

28 Chapter 3. The Toolkit and GUI 3.1. ForestFIRE

DottedTree class. Instead of storing a reference of node inside a specific tree, it points to a node inside the rhs of a production rule. Dotted rules therefore contain a field that refers to a production rule and another that refers to a node. These dotted rules are not used in our experiments, however they can play a role in future parsing algorithms.

There are two important invariants for tree grammars. The first invariant describes that the left hand side symbol must be part of the alphabet of the grammar. The other describes that the alphabet of the right hand side tree must be a subset of the alphabet of the grammar itself. The formal definition of the new invariants can be found in Section C.3.2. Of course the general tree invariants discussed in Section 3.1.1 hold for the rhs as well.

Algorithms The toolkit implements a large collection of algorithms related to tree grammars. The role of these algorithms varies. Some of them retrieve simple statistics like the number of nodes in the right hand sides of the production rules, but others perform more complicated operations like removing all chain rules or non-root terminal nodes in such a way that the language produced by that grammar is not changed. All these algorithms are separated from the data structures themselves, to avoid garnishing these data structures with additional methods. This therefore results in a new collection of classes. The detailed interfaces of these classes can be found in Section C.3.3. The algorithms are divided into different categories and each of these is discussed briefly in this section.

• Standard analysis • Usability • Grammar transformation • Dotted rule and subtree retrieval

The first category provides algorithms that provide basic statistics of the grammar. This includes things like the number of nodes (mentioned earlier), but also the number of non-root terminal nodes. These algorithms only perform measurements and do not change the tree grammar. The usability category also only measures statistics, but related to reachability and produc- tivity of rules and symbols. The classes provide interfaces to retrieve both unproductive/un- reachable/useless as well as productive/reachable/useful items. The third category focusses on algorithms that manipulate the grammars. These so called transformations can be used to remove non-root terminal nodes, chain rules or useless pro- duction symbols and rules. This is realized by two new classes. The first class is targeted at removing chain rules or non-root terminal nodes and thereby converting the grammar into a grammar with the so called U− or Z− characteristic. The transformations related to usability are wrapped by the other class. These algorithms are easy to implement because removing useless items from a grammar will not influence the language produced by the grammar. This is however different for the transformations related to the U and Z characteristic. These trans- formations need to split rhs subtrees (to remove Z-nodes) or replace rules (to remove chain

29 3.1. ForestFIRE Chapter 3. The Toolkit and GUI rules) in a special way (described in Section 4.2.1) to obtain the desired characteristic without changing the language produced by the grammar. Experiments using these transformations are discussed in Chapter 4. The final category contain algorithms that retrieve all subtrees that are present in the production rules of a tree grammar, where these substructures are called items. These items contain all unique subtrees in the rhs of the production rules and all unique lhs nonterminals. Figure 3.3 shows the item set of an example grammar. The items can retrieved as subtrees (in this case the subtrees are cloned into new trees) or as dotted rules where a dotted rule is created for each existing node in the production rules. This first technique is called subtree retrieval and the second one dotted tree retrieval. The difference between these two is that for subtree retrieval a tree is created for every unique item (so only one d instead of two for the example in Figure 3.3), while dotted tree retrieval results in dotted trees for all existing items even when their structure is the same. These collections of items will become useful when constructing automata from tree grammars. The tree automata that were used in the experiments make use of subtree retrieval to construct automata. Details about these automata can be found in Section 4.3.

S → a contains the following items: S , a , b , d, d b d b d d d d Figure 3.3: Items of a tree grammar

Appendix C shows the related classes and interfaces of all these algorithms. Some of these algorithms are not very complex, like the retrieval of statistics and item sets. The other algorithms are more complex. Information about the transformation algorithms can be found in [Cle07, Section 3.3.3] for the usability transformations. The grammar transformation al- gorithms that remove chain rules and non-root terminal nodes are also discussed in detail in Section 4.2, because they were used in experiments. However, to get impression of the imple- mentation of such an algorithm we will highlight the implementation of the transformation algorithm that removes chain rules in the next section.

Example algorithm implementation The previous section gave a brief overview of the algorithms implemented by ForestFIRE in the area of tree grammars. This section will give an impression of how such an algorithm, the RED-U transformation algorithm, is implemented. Many of the algorithms in ForestFIRE, like the RED-U transformation algorithm, origi- nate from Cleophas’ PhD thesis. He described these algorithms using a more formal approach. These formal definitions were translated to the Java (programming language used for Forest- FIRE) implementations that can be found in the toolkit. Many of these implementation have a large resemblance with their definition. This section will compare such a definition to its implementation to show this resemblance. As described, the chosen algorithm to show this resemblance is the RED-U transformation step, that removes a single chain rule (also called unit production) from a grammar. Definition 3.1.1 shows the formal definition of this transformation step.

30 Chapter 3. The Toolkit and GUI 3.1. ForestFIRE

Definition 3.1.1 ( red-u, removing a single unit production). Let G = (N, Σ,r,Prods,S) be an rtg with characteristic value U+. Then there must be a production A → B ∈ Prods with A, B ∈ N. Then G0 = (N, Σ,r,Prods0, S) where

∗ Prods0 = Prods\{A → B} ∪ Set C,γ : B ⇒ C ∧ C → γ ∈ Prod : A → γ * ∧ γ∈ / N + is the resulting transformed grammar.

This definition describes that the chain rule A → B is removed and that new rules are added to replace the removed chain rule. These new rules A → γ are created for each rhs γ of a rule C → γ that is reachable from B. The new rules make sure that A can still produce the trees that were producible via the chain rule A → B (an example of this can be seen in Section 4.2.2). Let us now look to Java implementation in Listing 3.1 which contains the method Re- moveChainRule (also listed in Section C.3.3.3). The java code of this algorithm is, like the definition, divided into two parts. The first part (line 4) removes the chain rule that is pro- vided to the method from the grammar. The remaining lines handle the creation of the rules that will replace this chain rule. Line 9 – 13 compute, using Warshall’s algorithm, all nonterminals C that can be reached from B by using chain rules. Warshall’s algorithm computes the transitive closure of chain rules, this means that B itself is not returned. Line 13 was therefore added to add B itself to the set of nonterminals. The remaining issue needed to be solved is finding all rules C → γ where γ is no nonterminal. This can be seen in the final lines of code. These check for each rule if they have a closure nonterminal C as lhs. If this is the case and the rhs is also no nonterminal, then a new rule is created using A as lhs and γ as rhs. This new rule is then added to a temporary collection of rules. These rules are at the end added to grammar. This has to be done afterwards, because else it will interfere with the ‘foreach’ loop.

1 void RemoveChainRule(RegularTreeGrammar grammar, GrammarProductionRule rule) 2 { 3 //Remove chain rule from the set of production rules 4 grammar.productionRules().remove(rule); 5 6 //Compute nonterminal closure of chain rules 7 //(using Warshall) and retrieve nonterminals 8 //reachable from rule . rhs (). getRoot().symbol() 9 Warshall closure = new Warshall(grammar); 10 ArrayList closureSymbols = 11 closure .getClosureSymbols(rule . rhs (). getRoot().symbol()); 12 //add RHS to obtain reflexive closure 13 closureSymbols.add(rule . rhs (). getRoot().symbol()); 14 15 //Construct new rules that replace the chain rule 16 ArrayList newRules = 17 new ArrayList();

31 3.1. ForestFIRE Chapter 3. The Toolkit and GUI

18 19 for(GrammarProductionRule r: grammar.productionRules()) 20 { 21 //Check if LHS is in the symbol closure 22 if (closureSymbols. contains(r .getLhs())) 23 24 //Check if the rule is not a chain rule 25 if (r . rhs (). getRoot().symbol().symbolType() 26 != SymbolType.NonTerminal) { 27 28 //Make a new rule based on the RHS of r 29 newRule = new GrammarProductionRule( 30 rule .getLhs(), r . rhs (). clone ()); 31 32 newRules.add(newRule); 33 } 34 } 35 36 for(GrammarProductionRule r: newRules) 37 { 38 grammar.productionRules().add(r); 39 } 40 } Listing 3.1: Java implementation of RED-U transformation step

This shows how closely related the formal definition and the implementation an algorithm are. This is similarly for the other algorithms, also for the algorithms related to the other domain concepts.

3.1.3 Tree patterns Chapter 1 shows that a tree pattern is in essence a normal tree. A tree pattern is therefore implemented as a tree, with the exception that it not only contains terminals, but also vari- ables. However, to represent patterns one can still use the normal tree classes, because the symbols used in nodes can be used to represent terminals, nonterminals and variables. The only additional data structure needed in this area of tree patterns is a collection to wrap multiple patterns into a pattern set, because pattern sets are an important part of the tree pattern matching problem. Such a set contains a group of patterns and an alphabet that contains all symbols that may be used inside the patterns. These pattern sets can for instance be used to create automata that accept only patterns from this collection. These automata can be then be used to solve the tree pattern matching problem. Figure 3.4 shows the existing tree class and the pattern collection class and how they relate.

PatternSet +Patterns Tree +Alphabet +Alphabet * *

Figure 3.4: UML Class diagram for pattern related classes.

32 Chapter 3. The Toolkit and GUI 3.1. ForestFIRE

All the invariants of the standard tree also apply to the tree pattern situation. The only extra invariant needed is that alphabets of the trees inside a pattern set are a subset of the alphabet of the pattern set. This is needed to ensure that the patterns do not contain symbols that are not present in the alphabet of the set (for details see Section C.4.2).

The only two algorithms for tree patterns that were implemented are for determining the collection of pattern subtrees respectively of dotted pattern trees (similar to the subtree and dotted rule retrieval for tree grammars). This first category retrieves subtrees as normal trees which results in a collection of cloned trees in which no duplicate tree structures can be found. The other category focusses on retrieval of subtree as dotted trees. In such a case dotted trees will be created for each node in the patterns. This means that there exist dotted rules that point to a similar tree structure if the pattern set contains multiple occurrences of a subtree.

3.1.4 Tree automata This section discusses how concepts related to tree automata are implemented in the toolkit. At first the data structures are discussed that represent concepts like tree automata and tree automaton states. These data structures are accompanied by a collection of invariants just as in the previous sections. Finally all algorithms related to automata are discussed. These algorithms focus on the construction and usage of the automata.

Data structures This section presents the classes needed for representing the different types of tree automata discussed in this report:

• Nondeterministic Root-to-Frontier Tree Automata (nrfta) • Deterministic Root-to-Frontier Tree Automata (drfta) • Nondeterministic Frontier-to-Root Tree Automata (nfrta) • Deterministic Frontier-to-Root Tree Automata (dfrta)

These automata were implemented with the help of two abstract classes (see Figure 3.5). One abstract class represents the general tree automaton and a specialized abstract dfrta class supports different implementations of dfrtas (the goal of this becomes clear in Chapter 4). The nondeterministic automata (with or without ε-transitions) inherit from the general abstract tree automaton class while dfrtas inherit from the Abstract dfrta class which in its turn inherits from the general abstract automaton class. drftas are not implemented, because they have less acceptance power than the other three types, as mentioned in Chapter 1. However, they could be implemented in a similar way as the dfrtas. The general tree automaton class describes a standard tree automaton as the 5-tuple in the formal definition (see Definition B.3.1):

• Q, the state set. • (V,r), a ranked alphabet. • R, the set of transition relations.

33 3.1. ForestFIRE Chapter 3. The Toolkit and GUI

• Qra, the root accepting states.

• Qla, the leaf accepting states.

The state set, root accepting states and the ranked alphabet can be found in the Abstract- TreeAutomaton class as defined in Section C.5.1.1. However, this is different for the other items. The set of transition relations is not defined by this class. The reason for this is, is that the optimal shape of these transition relations is different for RF and FR automata. These transitions are also not made accessible to the user in its raw form, because we want to hide the implementation from the user and give the user access to the transition relations via a NextState-method. The same is done for adding transitions. A method is defined that allows the user to add a transition for a symbol relating a state and a vector of states. How these transitions are stored is hidden from the user, because there are many possible ways to do this: it depends on the direction of the automaton as mentioned before, but also on the type of construction used. The toolkit contains a collection of such implementations. The details of these implementations are discussed in Section 4.3.2, because they are closely related to different automaton constructions used in the experiments. The other exception is the absence of leaf accepting states. These states can be distilled from the transition relations, because they have incoming transitions for symbols of rank 0 (as described in [Cle07, Remark 3.6.2]). The retrieval of the leaf accepting states is in practice implemented by defining a transition relation for each symbol of rank 0 from the empty tuple ’()’ to the corresponding leaf accepting state. Calling the NextState-method with the empty tuple for each such symbol will return these states.

Figure 3.5 shows how all the tree automata classes relate to each other. Only a single implementation of the Abstractdfrta class is shown. In practice there are five such implemen- tations. These implementations all use different optimization techniques for the transition relations, but they do not change the external behavior of the automata. These are therefore not described here, but in Appendix C. Details about these optimizations can be found in Section 4.3.

AbstractTreeAutomaton +StateSet AbstractAutomatonState +Alphabet +Name +RootAcceptingStates +addTransition() 1 *

NRFTA NFRTA AbstractDFRTA DottedTreeAutomatonState DottedRuleAutomatonState SubtreeAutomatonState +Matches +Matches +Matches +NextState() +NextState() +nextState()

StandardDFRTA

Figure 3.5: UML Class diagram for automaton related classes.

34 Chapter 3. The Toolkit and GUI 3.1. ForestFIRE

Figure 3.5 also shows an AbstractAutomatonState class with three descendants. These classes are used to represent the states of an automaton. Each state refers to one (for non- deterministic automata) or a set (for deterministic automata) of items. Such an item or item set indicate which subtrees are matched when that state in the automaton is reached, and therefore indicates derivability (when the automaton is constructed from a grammar) or pattern matches. The shape of this item sets depends on how these items are represented. As discussed in the section about tree grammars and tree patterns it is possible to store subtrees as subtrees, but also as dotted trees or dotted rules depending on the source (grammar or pattern). The DottedTreeAutomatonState is used by automata constructed from dotted trees (extracted from tree patterns), while the DottedRuleAutomatonState is used by automata constructed from dotted rules (obtained from a tree grammar). The SubtreeAutomatonState is used by automata constructed from subtrees extracted from either tree grammars or tree patterns. Our tree automaton implementations focus on the automaton states based on subtrees and therefore use the SubtreeAutomatonState class.

There are also invariants for these data structures. The first one describes that the set of root accepting states is a subset of the state set of the automaton. The second invariant addresses that the states returned by the NextState-method are states that are present in the state set. These invariants are described in more detail in Section C.5.2.

Algorithms The ForestFIRE library implements a large collection of algorithms related to tree automata. Most of these algorithms are targeted at the construction of these tree automata (from tree grammars) and the usage in tree acceptance and parsing. The implemented construction algorithms can be used to construct (ε)nfrtas, (ε)nrftas and dfrtas from regular tree grammars. This is realized by only two algorithms: one for constructing nondeterministic automata and another for constructing the dfrtas. The algo- rithm for nondeterministic automata can produce (ε)nfrtas and (ε)nrftas, depending on the parameters used. The classes that implement these two algorithms and the methods that can be used to construct these automata can be found in Section C.5.3. These algorithms were also used in experiments and are therefore discussed in detail in Section 4.3. The construction algorithms in turn make use of another group of algorithms called item set providers. These algorithms play an important role, in the construction of both nonde- terministic and deterministic automata (see also Section 4.3), because they are involved in the state creation process. These item set providers retrieve a special set of subtrees from the grammar for which the automaton is constructed. The states of the automaton are then constructed based on these subtrees. Another group of algorithms uses these automata to solve the tree acceptance problem. These are the so called acceptance algorithms. The toolkit implements three acceptance algorithms, one based on nfrtas, another for nrftas and a final one for dfrtas. These different algorithms are needed because the shape of the acceptance algorithms depends on the properties of the used automaton. An FR automaton processes trees in a bottom-up way, while an nrfta works in the other direction. Acceptance algorithms for deterministic automata can for instance be optimized compared to the algorithms for nondeterministic ones, because they can make use of the determinism of the automaton.

35 3.2. FIREWood Chapter 3. The Toolkit and GUI

Finally a parsing algorithm was implemented that solves the parsing problem using a dfrta. This algorithm stores the parse information (which rule needs to be applied in which node) in the annotation fields of the nodes, such that one can find out which grammar rules need to be applied to derive the complete tree. There were also a set of experiments performed with this parsing algorithm in combination with different dfrtas. These experiments are discussed in Section 4.4. Details of the classes that implement all these algorithms can be found in Section C.5.3.

3.2 FIREWood

FIREWood is the application that provides a graphical user interface on top of the ForestFIRE library and supports easy access to tasks that make use of the different data structures and algorithms from the library. This includes e.g. visualizations of trees in prefix notation, but also providing access to tree acceptance and parsing algorithms. In this project there was a special interest in tasks related to tree acceptance and parsing. However, there was a broad list of requirements for the FIREWood application:

• Support input/output of domain concepts (trees, tree grammars etc.) from/to files • Provide an environment for experiments with rtg transformations, ta constructions and tree parsing algorithms • Support the possibility to add new visualizations for new and existing concepts and related algorithm

This section discusses how these requirements were translated to an architecture and pro- vides an overview of the main building blocks of the FIREWood application. The other part of this section contains a discussion of the resulting application and describes how the appli- cation can be used in practice to support the experiments in this domain. Details concerning the implementation (like programming language and graphical libraries used) can be found in Section 3.3.

3.2.1 Architecture The architecture of the FIREWood application describes how the requirements were trans- lated into a user interface application and how the application functions on a high level. The application should allow a user to load a collection of domain concepts like trees, tree grammars etc. Furthermore the user interface needed to provide a set of replaceable views that visualize the loaded concepts. These views provide information about the concept’s structure or enables the user to perform operations on it. A tree grammar is for instance one of the possible loaded concepts. If such a grammar is selected, the views provide the possibility to for instance inspect rules and remove chain rules. This lead to three practical tasks that had to be implemented:

1. Reading and writing tree concepts from/to a file 2. Presenting loaded concepts to the user 3. Providing a (expandable) collection of views depending on the type of concept

36 Chapter 3. The Toolkit and GUI 3.2. FIREWood

Before discussing these tasks in detail we describe the basic concepts that the user can define in files and can be loaded in the FIREWood application: alphabet, tree, tree grammar, tree pattern and tree pattern collection. All these concepts play a role in the three domain problems (tree acceptance, tree parsing and tree pattern matching). The alphabet can mostly be found as a part of the other four concepts, however is was added because it is likely that the other concepts will use a similar alphabet if they are related to each other. This way a single alphabet definition can be used in definitions of multiple concepts. The three tasks result in a data access component, multiple data view, and single data view components, which are detailed below.

Data access component The data access component (dac) is responsible for providing access to concepts stored in the FIREWood file format (for definition see Appendix D). The component can read files in this format and convert them into the corresponding ForestFIRE data structures and write them back to a file. This component therefore provides access to the user defined trees, tree grammars etc. All the other components access the data structures specified by the user through this component. The dac consists of three groups of data structures (see Figure 3.6). Firstly there are the file readers and writers. These readers convert a file into a so called main data container. This container contains all data structures read in smaller containers called single data container. These single data containers contain the ForestFIRE data structure(s) that correspond to a single concept. All these data structures are wrapped by a container to control the access to them. Adding for instance a new concept to a loaded collection of concepts can have consequences for the collection of alphabets used, because alphabets can be shared. The containers make sure that such internal dependencies are handled appropriately.

input file

Data Access Component

Single Data Container: TreeX Reader Main Data - Container Single Data Container: PatternY Writer .....

Figure 3.6: Data access component that translated files into containers

The data access component is not restricted to the usage of the five concepts described above. The main data container is defined as a collection of abstract single data containers. An implementation of the abstract container is made for each concept that can be defined in the input files. New concepts can therefore be supported by creating a new implementation of the abstract single data container and extending the reader and writer classes with additional code to handle the new type of concept.

Multiple data view The multiple data view component (mdv) makes the ForestFIRE data structures loaded by the dac accessible to the user. How this access is realized is not explicitly defined by this

37 3.2. FIREWood Chapter 3. The Toolkit and GUI component. What the component does define is a way to communicate with the dac. This is realized by an abstract class that defines a number of operations related to the complete collection of concepts. These operations have as role to react on changes in the main data container and to signal the application when a user selects a concept in the mdv. An im- plementation that inherits from this abstract class can then implement the visualization of these concept in the desired way (See Figure 3.7(a)). The idea behind this abstract class is that one can replace the visualization easily. The FIREWood application also provides an implementation of a concrete mdv in the form of a tree view (see Figure 3.7(b)). This tree view contains all loaded concepts grouped by the different types of concepts. The user can inspect these concepts by clicking on them. The single data view component is then triggered to provide a visualization of the selected concept. This communication with the single data view component is provided by the abstract mdv class.

Single Data Container: TreeX Main Data Container Single Data Container: PatternY .....

Abstract MDV

Treeview MDV

User (a) mdv visualizes content main con- (b) mdv implemented as tainer tree control

Figure 3.7: Multiple Data View

Single data view The single data view component (sdv) has as goal to visualize a single concept and provide access to algorithms and statistics related to that concept. This single data view has also to support new visualizations when additional algorithms are added to the ForestFIRE library. The design philosophy behind this component was similar to the philosophy behind the multiple data view component. The component consists of an abstract class and possible implementations of this abstract class. The abstract class provides an interface to respond to visualization requests generated by the mdv. The abstract class describes such a visualization only as an environment with a collection of subviews (mostly related to the different algorithms etc.). It depends on the implementation of the abstract sdv which subviews there are for a type of concept and how these are implemented. The concrete sdv presents these subviews to the user when the mdv signals that the user has selected a concept. Figure 3.8(a) shows how the sdv, mdv and dac are related to each other. As for the multiple data view component, an example implementation of the single data view was included, in the form of a tab control, where the subviews are the tab pages (see Figure 3.8(b)). When a concept is selected in the mdv (the tree view) a collection of tabs is opened in the tab control depending on the type of requested concept. Our standard implementation provides standard tabs for each type of concept. A creator of new algorithms

38 Chapter 3. The Toolkit and GUI 3.2. FIREWood for a specific type of concept can then define new tab pages to access his algorithms. These pages are automatically shown to the user if he couples these tab pages to the corresponding concept type. It is however also possible to use a completely different kind of control (or set of controls) for an implementation of the abstract single data view class.

Single Data Container: TreeX Main Data Container Single Data Container: PatternY .....

Abstract MDV Abstract SDV

Treeview MDV Tabview SDV

User (a) sdv visualizing a single data con- (b) sdv implemented as tab control tainer on request

Figure 3.8: Single Data View

The current toolkit contains a large collection of these tab pages, especially for the tree grammar concept. This concept was important because the toolkit was used as a platform for experiments using tree grammar transformations and tree automata constructions for tree grammars. The current version of FIREWood therefore contains tab pages that provide access to such algorithms. Examples of this can be seen in the next section, which contains a brief tour through the interface of FIREWood with the tree view mdv and tab control sdv.

3.2.2 Resulting user interface The architecture of FIREWood described three main components, the dac, mdv and sdv. All three components can be found physically in the user interface. The design already described that mdv and sdv respectively appear as a tree view and tab control. Let us start with an overview of the main window of the application to clarify how these three components are visually present:

39 3.2. FIREWood Chapter 3. The Toolkit and GUI

DAC

MDV SDV

Figure 3.9: FIREWood main form

When the application is started one can open a file through the menu bar on the top. The desired file can then be selected through a . When the file is selected the dac is triggered to parse it. After this the content of the file is visualized by the tree view on the left when the file contains no errors (if this is not the case then the loading is canceled and the errors are shown to the user). Clicking on one of the loaded concepts opens a collection of tab pages, where each of the tab pages has a different purpose. This can for instance be visualizing simple information, as the properties of the concept in Figure 3.9, but also providing access to the complicated parse algorithm. Let us illustrate the possibilities of FIREWood by showing how the application was used in the experiments. Each of two experiments discussed a collection of algorithms. These algorithms can be found as tab pages in the tab control. Let us for instance look at the grammar transformations. These transformations remove the chain rules (u-rules) and non- root terminal nodes (z-nodes) that are discussed in Section 1.1.3. The tab page collection for tree grammars was therefore expanded with pages for analyzing and removing u-rules and z-nodes. Figure 3.10 shows the z/u analysis page for the grammar described in Example 1.1.3. This tab page contains a tree where u/z-items are marked with colored squares that indicate wether a rule/node is a chain rule or z-node. This tab page can therefore be used to study the grammar for the existence of special rules and nodes.

40 Chapter 3. The Toolkit and GUI 3.2. FIREWood

Figure 3.10: z/u analysis page

The row of tab pages in Figure 3.10 also contains three transformation tab pages (only two are visible in the figure). Two of these tab pages are devoted to the removal of u-rules and z-nodes where the other is targeted at the removal of unproductive and unreachable symbols and rules. As an example we look at the tab page that can be used to remove non-root terminal nodes by using the red-z reduction (this reduction is discussed in Section 4.2).

Figure 3.11: z removal page

41 3.3. Implementation details Chapter 3. The Toolkit and GUI

The red-z reduction page (see Figure 3.11) contains a tree view just like the analysis page does. However, this tree can now be used to mark non-root terminal nodes that should be removed. Additionally the user can select which red-z reduction style is used. After clicking the preview button, the two text boxes to the right will show which rules need to be removed and added to grammar to remove these z-nodes. These two boxes can be used to get some insight into the transformation effects (number of new rules, number of new nonterminals etc.). This kind of information was used in the chapter about the experiments to learn more about the characteristics of this red-z reduction. Finally, one can use the apply button to remove/add the specified rules and store the new grammar as a new concept. This new grammar can then be inspected in more detail by selecting it in the mdv tree view at the left. The updated collection of concepts can afterwards also be stored in the original file or a new file.

This small example shows what these tab pages look like and how the user interface can contribute to the research process. Similar tab pages were created for the other transforma- tions and the automaton construction algorithms. The experiments that were performed with the help of the toolkit will be discussed in Chapter 4.

3.3 Implementation details

The implementation of the toolkit had one important technical requirement: the toolkit & GUI should be usable on both the XP operation system and on the Ap- ple Mac OS X . This meant that a cross platform programming language was needed in combination with a cross platform graphics library. Research was therefore conducted to find combinations of programming languages and libraries that fulfil these re- quirements. This resulted in three candidates:

• Lazarus in combination with the Lazarus Component Library (LCL) • Java in combination with the Standard Widget Toolkit (SWT) • C++ in combination with a cross platform library (e.g. WxWidgets)

Lazarus [laz] was the language selected from these three candidates. The reason for this was the resemblance with that is used as the teaching language at the department of mathematics and computer science at the TU/e. Lazarus is an open source and cross platform implementation of Delphi and therefore a very promising candidate. However, a number of problems occurred with LCL library after the first weeks of implementation. The LCL tab control was burdened with a number of bugs. The LCL library was also not as cross platform as hoped, because controls behaved differently on the different platforms. After a month of programming the choice was made to abandon Lazarus/LCL and switch to another candidate from the list. The choice was made to continue with Java and SWT. This choice was made because Java is considered to be a programming language more suitable for rapid development than C++ and it had the advantage of ’compile once, run anywhere’. After the choice was made the implementation continued and the already implemented part was translated to Java. Fortunately, this rewriting took only 2 weeks of time. The choice

42 Chapter 3. The Toolkit and GUI 3.3. Implementation details of switching to another programming language and library turned out to be a good choice, because Java and SWT gave rise to significantly less problems than Lazarus with LCL. The complete library and GUI application were implemented in Java version 1.5. In this implementation we mapped ForestFIRE and FIREWood to two Java packages, where each of the packages was divided into a collection of subpackages. This resulted in the following main hierarchy:

• ForestFIRE – Trees – Grammars – Patterns – Automata • FIREWood – Data Access Component – Multiple Data View – Single Data View

The ForestFIRE packages follow the structure of the domain concepts that are described in Section 3.1. Each of the subpackages contains all the classes related to that concept (the trees package contains for instance the tree and node class). These subpackages can also contain other child packages that consist of classes that implement algorithms related to that concept. Table 3.1 shows the main packages listed above and the algorithmic subpackages with their corresponding Java name. This table shows, for instance, the trees subpackage and the trees.algorithms subpackage. The FIREWood package is divided into set subpackages from which three directly corre- spond to the main components of the FIREWood application, where the subpackages of the mdv and sdv contain both the presented abstract classes and the example implementations.

Table 3.1 provides, next to the names of the (sub)packages, an impression of their dimen- sion. The toolkit and GUI together contain around 10,000 lines of code and 100 classes. The table lists the number of classes and lines of code for each subpackage in ForestFIRE and FIREWood. The table also contains some subpackages, like base, controls and extensions, that are not mentioned earlier. These packages contain classes that support the main ForestFIRE and FIREWood subpackages. The base subpackage, for example, contains the special collection types that are used in the toolkit (see Section C.1).

43 3.3. Implementation details Chapter 3. The Toolkit and GUI

Package Number of classes Lines of code forestfire 57 5785 base 8 523 trees 6 313 trees.algorithms 1 61 grammars 3 159 grammars.algorithms 4 562 patterns 5 103 patterns.algorithms 3 88 automata 10 2797 automata.interfaces 3 101 automata.acceptance 3 126 automata.construction 6 554 automata.parsing 4 345 extensions 1 53 firewood 41 5634 controls 4 192 dac 13 864 mdv 1 173 sdv 1 155 sdv.tabs 20 3724 extensions 1 29

Table 3.1: Code statistics of toolkit and GUI

The ForestFIRE package (and its subpackages) can be used independently of the FIRE- Wood package, so it is for instance possible to use it in other Java applications. The FIRE- Wood application was also combined with the ForestFIRE library in a single JAR-archive. This executable archive was used throughout this project to perform the needed experiments. All the packages and classes also contain JavaDoc documentation. This documentation simplifies the extension of both the toolkit and the GUI. We therefore realized the goal of building a toolkit and GUI that not only helped performing the experiments, but that can also be easily extended to support new tree algorithms.

44 4 Experiments

This chapter discusses the experiments that were carried out with the ForestFIRE/FIREWood toolkit. These experiments focussed on three subjects:

• Tree grammar transformations • Tree automaton constructions • Tree parsing

The tree grammar transformation experiments measure the influence of the grammar trans- formations that remove the chain rules and non-root terminal nodes. The grammar trans- formations were subject of the experiments because certain tree automaton constructions focus on grammars with the Z− or/and U− characteristic. The transformations realize these characteristics by removing existing rules from the grammar and adding new rules and non- terminals to it. It was therefore interesting to discover how the Z− and U− characteristic can be obtained while introducing as few new rules and nonterminals as possible. The second set of experiments measures the characteristics of tree automaton constructions that are used to construct tree automata from tree grammars. The characteristics measured are running time and the size of the resulting automata. These experiments were carried out because tree automata constructed from tree grammars are very useful for solving the tree acceptance and tree parsing problem. The results of these experiments were therefore used to determine which construction algorithm is the most promising for building these automata. The final collection of experiments focuses on an algorithm that uses deterministic tree automata to solve the tree parsing problem. The goal of these experiments was to find out which type of deterministic automaton discussed in the automaton construction experiments is most efficient for solving the parsing problem. Before discussing these experiments it is important to provide some background on the tree grammars that are used in these experiments. Section 4.1 introduces these grammars. The remaining two sections, 4.2 and 4.3, will discuss the two categories of experiments.

4.1 Used tree grammars

The chosen grammars came from four sources: the draft PhD thesis of Loek Cleophas [Cle07], iBurg software [iBU], a report by Huub ten Eikelder [tE89] and the Burg-files of the project. These grammars were either grammars related to instruction selection or small grammars that were easy to inspect during the experiments. In the next sections we will discuss each of these grammars in more detail. The details presented include basic statistics like number of nodes, rules, terminals and nonterminals, but also more advanced statistics like number of non-root terminal nodes and chain rules.

45 4.1. Used tree grammars Chapter 4. Experiments

4.1.1 Thesis grammar The first grammar was taken from the draft PhD thesis of Loek Cleophas. This grammar, ’Example 6.0.3’, has the following characteristics:

example603.ini Number of rules 6 Total number of nodes 12 Number of nonterminals 2 Number of terminals 4 Number of non-root terminal nodes 3 Number of chain rules 1

4.1.2 iBurg standard grammars Two grammars were taken from the collection of iBurg-files that is bundled with the iBurg software [iBU]. These grammars, ’Sample 4’ and ’Sample 5’, have the following characteristics:

sample4.ini Number of rules 12 Total number of nodes 20 Number of nonterminals 5 Number of terminals 7 Number of non-root terminal nodes 1 Number of chain rules 4

sample5.ini Number of rules 10 Total number of nodes 23 Number of nonterminals 4 Number of terminals 5 Number of non-root terminal nodes 3 Number of chain rules 2

These two grammars are related to instruction selection in compilers where intermediate representation trees are translated to instructions. The rules in for instance ’Sample 5’ are rules that can be expected in a grammar that is used for instruction selection:

reg → con loc → reg reg → PLUS reg → PLUS

reg reg MEM reg

loc Assume reg stands for register, con for constant and loc for memory location, then it can easily be seen that these rules are common instruction selection rules. These two grammars

46 Chapter 4. Experiments 4.1. Used tree grammars are small, but representative of how the transformations perform on production rules that are likely to be found in the area of instruction selection.

4.1.3 Report of ten Eikelder The report of ten Eikelder [tE89] describes the implementation of the bottom-up acceptor that is discussed by C. Hemerik and J. P. Katoen in [HK89]. This bottom-up acceptor is based on the use of DFRTAs. The report by ten Eikelder contained a grammar targeted at the instruction set of the 68000 architecture. This grammar was selected by us because one of the constructions considered in this report produces the same DFRTAs discussed by ten Eikelder. The characteristics of this grammar are:

68000.ini Number of rules 33 Total number of nodes 104 Number of nonterminals 3 Number of terminals 12 Number of non-root terminal nodes 30 Number of chain rules 1

4.1.4 Mono project grammars Finally three larger grammars were added, originating from the Mono Project [mon]. The Mono Project is a project that provides an open source implementation of the .NET Frame- work. These Mono grammars are used for translating parse trees that consist of machine independent commands into machine dependent code. Grammars for the following architec- tures were used: X86, IA64 and Sparc. The characteristics of these grammars are:

mono-x86.ini Number of rules 505 Total number of nodes 1412 Number of nonterminals 8 Number of terminals 269 Number of non-root terminal nodes 371 Number of chain rules 1

mono-ia64.ini Number of rules 432 Total number of nodes 1064 Number of nonterminals 8 Number of terminals 262 Number of non-root terminal nodes 221 Number of chain rules 1

47 4.2. Grammar transformation experiments Chapter 4. Experiments

mono-sparc.ini Number of rules 484 Total number of nodes 1263 Number of nonterminals 8 Number of terminals 272 Number of non-root terminal nodes 288 Number of chain rules 1

Files mono-x86.ini, mono-ia64.ini and mono-sparc.ini were built from the Burg-files that were contained in the Mono Source code package. They can be assembled by combining the standard Mono Burg-files: inssel.brg, inssel-float.brg. Combined with inssel-long.brg or inssel- long32.brg depending on the architecture (32-bit or 64-bit) and the machine dependent set of production rules that can be found in inssel-x86.brg, inssel-ia64.brg and inssel-sparc.brg. All these ini-files have the same single chain rule: base → reg. This chain rule originates from the inssel.brg file. A pleasant property of these real world grammars is that all terminals and nonterminals have a meaning, just like in the preceding iBurg grammars. The nonterminals for instance represent the following concepts:

stmt Statement reg Standard register lreg Longregister freg Floating point register base Base address cflags Standard compare result flag fpcflags Floating point compare result flag i8con 8-bit integer constant

The advantage of these Mono grammars is that they are used in a real world application for instruction selection. They are therefore good candidates for the experiments.

4.2 Grammar transformation experiments

This section presents the results of the experiments that were carried out in the area of tree grammar transformations. Two transformations were the subject of the experiment:

• Removal of non-root terminal nodes in production rules (RED-Z*) • Removal of chain rules (RED-U*)

RED-Z* refers to the complete transformation that removes all non-root terminals nodes, where RED-Z denotes a single transformation step that removes a single non-root terminal node. The same regular expression style abbreviation is used for RED-U* where RED-U* is the transformation that removes all chain rules and RED-U is one transformation step that removes one chain rule. The goal of the experiments is to find out how these transformations have to be applied such that as few new rules and nonterminals as possible are created. This number can for instance

48 Chapter 4. Experiments 4.2. Grammar transformation experiments be reduced by changing the order of transformations and applying rule reuse techniques in the RED-Z transformation. To clarify all this we start with a description of these two transformation algorithms, RED- Z(*) and RED-U(*). This discussion describes how these algorithms work and how they can be optimized by reusing symbols and rules. After these two sections the experiments with these transformations are discussed. There are three groups of experiments:

The first experiment measures the effect on the grammar size (number of rules, number of nonterminals) of the order of transformation steps: RED-Z*;RED-U*, RED-U*;RED-Z* and (RED-Z|RED-U)*. This experiment is carried out without reusing suitable nonterminals. The effect of reuse of suitable nonterminals during the RED-Z transformation steps will be examined in the second experiment. The effects are investigated for all the orders of RED-Z and RED-U transformation steps used in the first experiment. The last section will present the experiment that considers different strategies for selecting the next Z-node to be processed by a RED-Z transformation step: Shortest Tree First (STF), Tallest Tree First (TTF) or Random. This is done to measure the effect that these node selection strategies have on the level of reuse. This experiment will concentrate on the results of the RED-Z* transformations.

4.2.1 RED-Z A RED-Z transformation step removes a single non-root terminal node from the right hand side of a production rule. This can be done by removing the subtree α rooted at this terminal node from that tree and replacing it with a new nonterminal X. A new production rules must then be created with as left hand side the nonterminal X and as right hand side the subtree α. Figure 4.1 shows such a RED-Z transformation step.

(1) S → a =⇒ (1a) S → a , (1b) X → d d d X d

Figure 4.1: Removal of a non-root terminal node

This means that a new nonterminal and rule is created for each non-root terminal node that is removed from the grammar. We can now also remove the last non-root terminal node from the tree in Figure 4.1 by again applying RED-Z (see Figure 4.2) and thereby implicitly having applied RED-Z*.

(1a) S → a , (1b) X → d =⇒ (1a) S → a , (1b) X → d, X d X Y (1c) Y → d

Figure 4.2: Removal of all non-root terminal nodes

However, one can also reuse rules and symbols if a grammar has two or more similar subtrees rooted at a non-root terminal node (e.g. the two d nodes in rule (1) of Figure 4.1). Such

49 4.2. Grammar transformation experiments Chapter 4. Experiments subtrees can be replaced by the same nonterminal and therefore only result in one additional nonterminal and production rule (see Figure 4.3). The effects of this reuse are discussed in detail in Section 4.2.4.

(1) S → a =⇒ (1a) S → a , (1b) X → d d d X X

Figure 4.3: Removal of non-root terminal nodes with reuse

The efficiency of this reuse is determined by the order in which the non-root terminal nodes are removed from a grammar. This is discussed in more detail in Section 4.2.5.

4.2.2 RED-U The RED-U transformation step has, as already briefly discussed in Section 3.1.2, as goal to remove chain rules. Let us recapitulate how this transformation step works. A chain rule A → B can be removed if one creates a new rule A → α for each subtree α (which may not consist of a single nonterminal) that is reachable from B. An example of such a RED-U transformation step can be seen in Figure 4.4. These reachable subtrees are gathered by computing the nonterminal closure for the non- terminal B. The nonterminal closure of B is the set of nonterminals that is reachable from B by applying chain rules. Subtrees that are in the rhs of production rules that have as lhs a nonterminal that is contained in the closure are called the rhs subtrees reachable from B.

(1) S → T, (2) T → a =⇒ (1) S → a , (2) T → a c c c c c c

Figure 4.4: Removal of a chain rule

4.2.3 Influence of RED-Z/RED-U order

The first experiment carried out compared the resulting Z− U− grammars after applying both transformations in different orders. Three possibilities are discussed: RED-Z*;RED-U*, RED-U*;RED-Z* and (RED-Z|RED-U)*. This last possibility transforms the grammar by randomly applying RED-Z and RED-U transformation steps until the Z− and U− character- istics are reached. Due to this nondeterminism this complete transformation is executed a thousand times to get a better view of the average results. The first two apply the reduction steps in a fixed order, so these only have to be executed once. During this experiment the number of rules and number of nonterminals were measured after each transformation. This experiment was carried out for all grammars, but are not reported on for the IA64 and Sparc grammar of Mono, because their results are very similar to the results of the X86 Mono grammar. These are the results of this experiment:

50 Chapter 4. Experiments 4.2. Grammar transformation experiments

16

14

12

10

Rules 8 Non Terminals

6

4

2

0

RED-Z* RED-U* Standard ; RED-U*

ED-U* ; RED-Z* ED-Z* R R (RED-Z | RED-U)*

Figure 4.5: ’Example 6.0.3’

30

25

20

Rules 15 Non Terminals

10

5

0

RED-Z* RED-U* RED-Z* Standard ; RED-U*

ED-U* ; ED-Z* R R (RED-Z | RED-U)*

Figure 4.6: ’Sample 4’

51 4.2. Grammar transformation experiments Chapter 4. Experiments

20

18

16

14

12

Rules 10 Non Terminals

8

6

4

2

0

RED-Z* RED-U* Standard ; RED-U*

ED-U* ; RED-Z* ED-Z* R R (RED-Z | RED-U)*

Figure 4.7: ’Sample 5’

1200 1102 1075 1055

1000

876

800

681

Rules 600 Non Terminals 505 429 399,7 379 379 400

200

8 8 0

Standard RED-Z* RED-U*

ED-U* ; RED-Z* R RED-Z* ; RED-U* (RED-Z | RED-U)*

Figure 4.8: ’Mono X86’

These charts tell that the resulting grammar is smaller when the original grammar has first undergone a RED-Z* transformation instead of RED-U* transformation. This can be explained by the fact that each RED-U step creates additional non-root terminal nodes for each rule with non-root terminal nodes that is reachable from that chain rule. These additional nodes will all result in a new nonterminal and a new production rule if one applies RED-Z transformation steps (without reuse of suitable nonterminals and rules) afterwards . The Example 6.0.3 grammar for instance, contains three non-root terminal nodes. These

52 Chapter 4. Experiments 4.2. Grammar transformation experiments three nodes are in production rules that can be directly reached from the only chain rule in that grammar. So applying RED-U* generates three additional non-root terminal nodes which all result in an extra nonterminal and production rule when applying RED-Z*.

It is however not true that RED-Z*;RED-U* always performs better than RED-U*;RED- Z*. The (RED-Z | RED-U)* experiments showed results that confirm this statement. One would expect that the size of the (RED-Z | RED-U)* grammars is somewhere between the size of the RED-U* ; RED-Z* and RED-Z* ; RED-U* grammars. The average results confirm this statement, but some of the individual runs showed remarkable results. There were resulting grammars that were even larger than the one from the RED-U* ; RED-Z* transformation. Figure 4.9 shows this unexpected result for the X86 Mono grammar.

1200 1108 1102 1055 1055 1075

1000

800

Rules 600 Non Terminals

432 429 379 379 399,7 400

200

0

in ax

RED-Z* ; RED-U* RED-U* ; RED-Z*

(RED-Z | RED-U)* M (RED-Z | RED-U)* M (RED-Z | RED-U)* Median

Figure 4.9: The minimum, median and maximum during the (RED-Z | RED-U)* transforma- tions for the X86 Mono grammar.

Even more important is the fact that the example grammar constructed to prove the bad (RED-Z | RED-U)* performance also showed that RED-U*;RED-Z* can produce a smaller grammar than RED-Z*;RED-U* when a grammar has a certain shape. This proves that the property presented in the first part (RED-Z*;RED-U* always performs better than RED- U*;RED-Z*) is certainly not true. Example 4.2.1 shows that RED-U*;RED-Z* can outper- form RED-Z*;RED-U* and that the random variant can perform even worse than these two.

Example 4.2.1 This Example shows that relation RED-Z*;RED-U* ≤ (RED-Z | RED-U)* ≤ RED-U*;RED-Z* does not always hold, because (RED-Z | RED-U)* can produce a larger grammar than RED-U*;RED-Z* and that RED-U*;RED-Z* can perform better than RED- Z*;RED-U*. The following grammar is used to illustrate this:

T = {b, c} r = {(X,0), (Y,0), (b,1), (c,0)}

53 4.2. Grammar transformation experiments Chapter 4. Experiments

Prod = { (1) X → Y , (2) X → b , (3) Y → b } c c First RED-U*;RED-Z* and RED-Z*;RED-U* will be applied to this grammar. This results in the following intermediate and final set of production rules:

(1) X Y , (2) X b , (3) Y b c c

RED-U* RED-Z*

(2) X b , (3) Y b (1) X Y , (2) X b , (3) Y b c c V W (2a) V c , (3a) W c

RED-Z* RED-U*

(2) X b , (3) Y b (1) X b , (2) X b , (3) Y b V W W V W (2a) V c , (3a) W c (2a) V c , (3a) W c

The result is that the grammar from the RED-U* ; RED-Z* transformation is smaller than the one from the RED-Z* ; RED-U* transformation. This is caused by the RED-U* transformation (2nd from above). The rule that is created by this transformation is already present in the grammar. The number of non-root terminals is therefore not increased by this transformation. This has a positive impact on the resulting grammar. The (RED-Z | RED-U)* transformation can provide an even larger grammar by applying the transformation step in a particular order:

Original set of rules: P = { (1) X → Y , (2) X → b , (3) Y → b } c c Apply RED-Z to rule (2): P = { (1) X → Y , (2) X → b , (3) Y → b , (2a) V → c} V c Apply RED-U to rule (1): P = { (2) X → b , (3) Y → b , (2a) V → c, (4) X → b } V c c Apply RED-Z to rule (3): P = { (2) X → b , (3) Y → b , (2a) V → c, (4) X → b , (3a) W → c} V W c

54 Chapter 4. Experiments 4.2. Grammar transformation experiments

Apply RED-Z to rule (4): P = { (2) X → b , (3) Y → b , (2a) V → c, (4) X → b , (3a) W → c, (4a) Z → c} V W Z The result of this (RED-Z | RED-U)* transformation is a grammar that is even larger than the one after RED-U*;RED-Z* and RED-Z*;RED-U*.

This unexpected behavior is caused by the fact that RED-U steps do not always create new rules. If a new rule, say x, is exactly the same as an existing rule y in the grammar, then the addition of x to the set of rules will not result in an additional rule. This situation changes when rule y contains a non-root terminal node and a RED-Z step is applied to y changing it into rule y0. If this RED-Z step is followed by the same RED-U step, then this leads to an additional rule, because rule y0 it not equal to rule x anymore.

Summarized, the best transformation order depends on the shape of the grammar. If a grammar contains no chain rules or only chain rules that are converted into unique rules by RED-U then one should choose the RED-Z*;RED-U* transformation. However, the larger the number of non-unique rules created by the RED-U* transformation the wiser it is to opt for the RED-U*;RED-Z* transformation. A more advanced option would be to first remove the chain rules that do not produce new rules and then apply RED-Z* followed by the remaining RED-U steps.

4.2.4 RED-Z*/U* order with reuse This section discusses the effects of reusing suitable nonterminals and production rules. This experiment discusses the same transformation orders as in the previous section, except for the (RED-Z | RED-U)* transformation. Reusing new nonterminals and production rules could result in less rules and less nonterminals after the transformation when the original grammar contains production rules that have similar subtrees in their right hand sides. As discussed earlier, this reuse takes place in the RED-Z transformation steps. The reuse efficiency may depend on the order in which the non-root terminal nodes are selected to undergo a RED-Z transformation step. This experiment uses the STF (Shortest Tree First) node selection strategy. In this selection strategy, the smallest subtrees rooted by non-root terminals are removed first, followed by the nodes with larger subtrees. The effect(s) of using other node selection strategies are investigated in Section 4.2.5. Below are the results of this experiment for the RED-U*;RED-Z* and RED-Z*;RED-U* transformation sequences. The graphs show the results of the previous section (RED-Z*, no reuse) and the new results (RED-Z*, STF with reuse). Additionally we present the results for the Mono IA64 and Sparc grammar.

55 4.2. Grammar transformation experiments Chapter 4. Experiments

16

14

12

10

Rules 8 Non Terminals

6

4

2

0

STF) Standard RED-U*

RED-Z* ( RED-Z* (no reuse) (no reuse) ; RED-U*

RED-U* ; RED-Z* (STF) RED-Z* (STF) ; RED-U* RED-U* ; RED-Z* (no reuse) RED-Z*

Figure 4.10: ’Example 6.0.3’, RED-Z* with and without reuse.

30

25

20

Rules 15 Non Terminals

10

5

0

(STF) Standard RED-U*

RED-Z* RED-Z* (no reuse) (no reuse) ; RED-U*

RED-U* ; RED-Z* (STF) RED-Z* (STF) ; RED-U* RED-U* ; RED-Z* (no reuse) RED-Z*

Figure 4.11: ’Sample 4’, RED-Z* with and without reuse.

56 Chapter 4. Experiments 4.2. Grammar transformation experiments

20

18

16

14

12

Rules 10 Non Terminals 8

6

4

2

0

STF) Standard RED-U*

RED-Z* ( RED-Z* (no reuse) (no reuse) ; RED-U*

RED-U* ; RED-Z* (STF) RED-Z* (STF) ; RED-U* RED-U* ; RED-Z* (no reuse) RED-Z*

Figure 4.12: ’Sample 5’, RED-Z* with and without reuse.

1200 1102 1055

1000 876

800 735 735 681

Rules 600 559 505 Non Terminals 429 379 379 400

200

62 62 62 8 8 0

TF) reuse) ED-U* Standard RED-U*

RED-Z* (S RED-Z* (no

RED-U* ; RED-Z* (STF) RED-Z* (STF) ; R RED-U* ; RED-Z* (no reuse) RED-Z* (no reuse) ; RED-U*

Figure 4.13: ’Mono X86’, RED-Z* with and without reuse.

57 4.2. Grammar transformation experiments Chapter 4. Experiments

1000 902 900 871

800

677 677 700 653 645

600

Rules 500 464 432 Non Terminals 400

300 265 229 229 200

100 40 40 40 8 8 0

TF) reuse) ED-U* Standard RED-U*

RED-Z* (S RED-Z* (no

RED-U* ; RED-Z* (STF) RED-Z* (STF) ; R RED-U* ; RED-Z* (no reuse) RED-Z* (no reuse) ; RED-U*

Figure 4.14: ’Mono IA64’, RED-Z* with and without reuse.

1200

1004 1000 953

800 772 704 704 662 Rules 600 526 Non Terminals 484

400 350 296 296

200

50 50 50 8 8 0

TF) reuse) ED-U* Standard RED-U*

RED-Z* (S RED-Z* (no

RED-U* ; RED-Z* (STF) RED-Z* (STF) ; R RED-U* ; RED-Z* (no reuse) RED-Z* (no reuse) ; RED-U*

Figure 4.15: ’Mono Sparc’, RED-Z* with and without reuse.

The graphs show what one would expect. The RED-Z* transformations with reuse output grammars with fewer nonterminals and production rules than the corresponding ones without reuse. Below, a table is presented to provide a better view of these differences. This table compares the number of rules after such transformations with reuse to the number of rules after such transformations without reuse. This is performed by setting the number of rules in the resulting grammar with no reuse to 100%. The table then contains the relative sizes when reuse is enabled in the RED-Z transformation steps:

58 Chapter 4. Experiments 4.2. Grammar transformation experiments

Number of rules difference in percentages RED-Z* RED-U* ; RED-Z* RED-Z* ; RED-U* Example 6.0.3 100% 79% 100% Sample 4 100% 92% 100% Sample 5 92% 89% 94% Mono X86 64% 67% 70% Mono IA64 71% 75% 78% Mono Sparc 68% 70% 74%

The table shows that the reuse of nonterminals and production rules is very attractive. The number of rules in the large Mono grammars drops to 64–78% of its size without reuse. The effect on the number of nonterminals is even more impressive:

Number of nonterminals difference in percentages RED-Z* RED-U* ; RED-Z* RED-Z* ; RED-U* Example 6.0.3 100% 63% 100% Sample 4 100% 75% 100% Sample 5 86% 75% 86% Mono X86 16% 14% 16% Mono IA64 17% 15% 17% Mono Sparc 17% 14% 17%

This last table shows that the number of nonterminals drops significantly for the large Mono grammars. The reuse of these nonterminals and rules is, as expected, very useful for large grammars.

The results in Figure 4.10 – 4.15 also show that there is no difference in grammar size between RED-Z* ; RED-U* and RED-U* ; RED-Z*, when reusing nonterminals and produc- tion rules. This is also proven [Cle07, Section 3.3.3.1 and Remark 3.3.37]. First applying the RED-U* transformation can still result in copying non-root terminal nodes. However, these are now, due to STF reuse, replaced by the same nonterminal when applying RED-Z*. So, RED-U* ; RED-Z* will not, like before, perform worse than RED-Z* ; RED-U*. Experiments were also carried out with the (RED-Z | RED-U)* transformation. With reuse enabled they also produced grammars that had exactly the same size as the grammars produced by RED-Z* ; RED-U* and RED-U* ; RED-Z*. In summary, this means that the RED-Z transformation steps with reuse have a positive effect on the application order of RED-U and RED-Z, and that they are more efficient in terms of the size of the resulting grammar.

4.2.5 RED-Z node selection strategies The final experiment in the area of grammar transformations concentrates on different RED-Z node selection strategies. The three node selection strategies differ in the order in which they remove all the non-root terminal nodes and therefore could have an impact on the reuse of existing nonterminals and rules. The three node selection strategies that were investigated are:

59 4.2. Grammar transformation experiments Chapter 4. Experiments

• STF (Shortest Tree First) • TTF (Tallest Tree First) • Random

The first node selection strategy, STF (used in the previous experiment), removes the non- root terminal nodes based on the size of the subtree rooted at that node and starting with a node that has the smallest subtree. The TTF selection strategy does the opposite by starting with a node that has the tallest subtree. The last variant removes the nodes in a random order. This experiment will compare these three node selection strategies with the original gram- mar and the results of the RED-Z* transformation without reuse. This is done for all the grammars described in Section 4.1.

Two remarks on the results displayed below: • The random selection strategy is not deterministic; the results for this selection strategy are determined by taking the average number of rules and nonterminals over a thousand runs. The minimum, maximum and median of these runs are also visualized for the X86, IA64 and Sparc Mono grammars. • The results for the ’Sample 4’-grammar are not listed due to the fact that it has only one non-root terminal node. This will not create different situations for the node selection strategies.

10

9

8

7

6

Rules 5 NonTerminals

4

3

2

1

0 Standard No Reuse STF TTF Random

Figure 4.16: ’Example 6.0.3’, results of the different RED-Z node selection strategies.

60 Chapter 4. Experiments 4.2. Grammar transformation experiments

14

12

10

8

Rules NonTerminals 6

4

2

0 Standard No Reuse STF TTF Random

Figure 4.17: ’Sample 5’, results of the different RED-Z node selection strategies.

61 4.2. Grammar transformation experiments Chapter 4. Experiments

1000

900 876

800

700

600 559 559 566,29 505 Rules 500 NonTerminals

400 379

300

200

100 62 62 69,29 8 0 Standard No Reuse STF TTF Random

(a) Results of the different RED-Z node selection strategies

600 562 566,29 569

500

400

Rules 300 NonTerminals

200

100 65 69,29 72

0 Random Min Random Median Random Max

(b) Minimum, median and maximum over random runs

Figure 4.18: ’Mono X86’, results of the different RED-Z node selection strategies together with minimum, maximum and median of the random runs.

62 Chapter 4. Experiments 4.2. Grammar transformation experiments

500 464 464 470,19 450 432

400

350

300 253 Rules 250 229 NonTerminals

200

150

100

46,19 50 40 40 8 0 Standard No Reuse STF TTF Random

(a) Results of the different RED-Z node selection strategies

500 466 470,19 473

450

400

350

300

Rules 250 NonTerminals

200

150

100

42 46,19 49 50

0 Random Min Random Median Random Max

(b) Minimum, median and maximum over random runs

Figure 4.19: ’Mono IA64’, results of the different RED-Z node selection strategies together with minimum, maximum and median of the random runs.

63 4.2. Grammar transformation experiments Chapter 4. Experiments

900

800 772

700

600 526 526 533,98

500 484 Rules NonTerminals 400

296 300

200

100 50 50 57,98 8 0 Standard No Reuse STF TTF Random

(a) Results of the different RED-Z node selection strategies

600

529 533,98 538

500

400

Rules 300 NonTerminals

200

100 53 57,98 62

0 Random Min Random Median Random Max

(b) Minimum, median and maximum over random runs

Figure 4.20: ’Mono Sparc’, results of the different RED-Z node selection strategies together with minimum, maximum and median of the random runs.

The graphs show that STF and TTF node selection strategy result in smaller grammars than the average random selection strategy. However, the difference between STF/TTF and random selection is very small (around 1.5%). This difference can be explained by the random processing of similar subtrees that have a non-root terminal as root. Processing these subtrees by chance in different directions (STF or TTF) can have a negative influence on the reuse. The chance that this happens is quite small because it requires parallel processing of

64 Chapter 4. Experiments 4.2. Grammar transformation experiments similar subtrees in opposite order. Example 4.2.2 shows a situation were the random selection strategy performs worse compared to the STF and TTF node selection strategies.

Example 4.2.2 This example shows that the random node selection strategy can perform worse than the STF and TTF strategies. This is illustrated with the following grammar:

T = {a,b,c} r = {(X,0), (a,2), (b,1), (c,0)} P = { (1) X → a } b b c c Applying RED-Z* with TTF based reuse to this grammar results in: P = { (1) X → a , (2) S → b , (3) Z → c} S S Z More rules and nonterminals can be produced however, if the RED-Z steps are applied in a particular order: First one of the bottom nodes is removed. This results in the following set of rules: P = { (1) X → a , (2) Z → c} b b c Z The is followed by the removal of the two b-nodes, which results in the following rules: P = { (1) X → a , (2) Z → c, (3) S → b , (4) T → b } S T c Z After this the last non-root terminal node in rule 3 can be removed: P = { (1) X → a , (2) Z → c, (3) S → b , (4) T → b } S T Z Z The resulting grammar has one additional rule compared to the resulting grammar of the RED-Z* transformation with TTF node selection. This is caused by the fact that the ran- dom selection can change subtrees that were exactly the same, to subtrees that are visually different due to replacement of nodes. These changed subtrees are then replaced by different nonterminals instead of the same one (S and T instead of one S in the grammar above).

Figures 4.16–4.20 also suggest that the STF and TTF node selection perform similarly. Theoretically TTF can be less effective than STF. Unfortunately this was not visible in the results of the experiment, due to the shape of the grammars. Example 4.2.3 was constructed to show that this can happen.

Example 4.2.3 This example illustrates that there exist grammars for which the TTF node selection strategy in the RED-Z* transformation performs worse than the STF selection strat- egy. The following grammar is used to illustrate this:

65 4.3. Automaton construction experiments Chapter 4. Experiments

Original set of rules: T = {a,b,c} r = {(X,0), (Z,0), (a,2),(b,1), (c,0)} P = { (1) X → a , (2) Z → c} b b c Z Applying RED-Z* with STF based reuse to this grammar results in: P = { (1) X → a , (2) Z → c, (3) S → b } S S Z Applying RED-Z* with TTF based reuse to the original set of rules results in: P = { (1) X → a , (2) Z → c, (3) S → b , (4) T → b } S T Z Z The RED-Z* with TTF reuse created a larger grammar then the transformation with STF node selection. The reason for this is easy to see. Starting at the top, the subtrees of production at symbol a in rule 1 look different while they actually generate the same single subtree. The STF selection strategy does not have this problem because it first replaced the smaller subtree at the bottom of rule 1.

Summarizing, the STF selection strategy is slightly more effective than the TTF and the random node selection strategy. The, mostly small, differences of the selection strategies depend on the used grammar. Nevertheless the complexity of the three strategies is almost equal, because sorting is not necessary for STF when retrieving the non-root terminals in a bottom-up direction (using the leaves field in our toolkit). Therefore the STF selection strategy is the most profitable node selection strategy when reusing production rules.

4.2.6 Conclusion The three experiments provided much insight in the two transformations, especially for RED- Z/RED-Z*. These experiments indicate that reuse optimization of the RED-Z transformation performs very well, because it reduces the number of new rules significantly compared to RED-Z without reuse. The first experiment (without RED-Z reuse) showed that RED-U and RED-Z can influence each other negatively. The RED-Z with STF reuse in the second experiment removes this negative influence. This RED-Z reuse also results in grammars that are significantly smaller than the ones transformed without reuse. The final experiment was used to test the efficiency of three reuse strategies (STF, TTF, Random). This experiment showed that the STF tactic always performs better than the other two. Summarized, the transformations performed best when the STF reuse strategy is used in the RED-Z algorithm. If the reuse is enabled then there is also no difference between the application order of RED-U and RED-Z.

4.3 Automaton construction experiments

In this section, the experiments with tree automaton construction algorithms are discussed. These automaton constructions build automata that can be used for solving the tree accep-

66 Chapter 4. Experiments 4.3. Automaton construction experiments tance and tree parsing problem. There are different algorithms for different types of automata. Let us therefore first give a quick overview of the types of tree automata. As discussed in Chapter 1 there are two main types of automata: nondeterministic and deterministic automata, where nondeterministic automata can also contain ε-transitions. The two types can themselves be divided into two categories depending on the acceptance direction of the automaton: root-to-frontier (rf) or frontier-to-root (fr). This results in six types of automata.

• Nondeterministic – Root-to-frontier ((ε)nrfta) – Frontier-to-root ((ε)nfrta) • Deterministic – Root-to-frontier (drfta) – Frontier-to-root (dfrta)

We already mentioned that the accepting power of the drfta is less than the accepting power of the other three automata. The experiments in this section focus on the constructions of the remaining five types of automata. The conducted experiments were used to measure the efficiency of the different construction algorithms and the characteristics of the resulting automata:

• Number of states • Number of transitions • Memory use

The measurements were used to compare the different construction algorithms and types of automata. These results may be used to select the most efficient construction technique for automata used in tree acceptance and tree parsing applications. Here we mean most efficient in terms of construction speed and/or automaton memory use.

The construction methods will be discussed separately for deterministic and nondetermin- istic automata, because they differ in construction algorithms and optimization possibilities. However, some general topics are treated before discussing these two groups of automata. First the measurement techniques used in the experiments are described. This is followed by a discussion about general automaton construction issues. Finally the specific construction algorithms for nondeterministic and deterministic automata are discussed.

4.3.1 Measurement techniques The automaton construction experiments measure two characteristics. The first characteristic is the construction process itself (running time). The other is the constructed automaton (number of states, number of transitions, memory size). Characteristics like number of states and number of transitions are measured easily by counting them. Measuring running time

67 4.3. Automaton construction experiments Chapter 4. Experiments and memory usage is a bit more complicated due to restrictions of the Java programming language and runtime environment.

Runtime measurements The time consumption measurements were performed by using built-in methods of Java. The Java class library provides the System.nanoTime()-method which returns the difference, measured in nanoseconds, between the current time and specific point in time (chosen by the JRE). This method is called before and after a construction. The difference between these two measurements is considered to be the running time of the construction algorithm.

Memory usage measurements The memory usage measurements were carried out in a similar way to the time measurements. Java provides the possibility to request the total size (in bytes) of the reserved memory pool. It also provides a method that returns how many bytes are still available in this pool. Subtracting the free space from the total space will result in the space that is currently used. This value was computed before and after the construction (before and after the time measurements to be more precise). However, there were some difficulties with these measurements caused by the existence of the garbage collector. Measuring the free memory inside the memory pool is only reliable if the garbage collector has removed all unreferenced objects. This can be achieved by manually triggering the garbage collector until the amount of free memory stabilizes, because a single garbage collection run does not remove all unreferenced objects.

Hardware & Java environment The last measurement aspect to discuss is the environment of the experiments: the hardware and Java Runtime Environment. These facts can be found in Table 4.1.

CPU AMD Athlon64 3500+ (Winchester) Memory 1024MB Dual Channel DDR400 Operating system Microsoft Windows XP SP2 (32bit) Java version Java SE 6 Update 1

Table 4.1: Used hardware & software

4.3.2 Automaton construction: general issues Some general tree automaton subjects are discussed before discussing the specific deterministic and nondeterministic variants, namely state construction and transition storage.

State construction The tree automata in the upcoming experiments are constructed from regular tree grammars. Each of the states in such an automaton corresponds to a subtree of a production rule in the grammar (for nondeterministic automata) or a set of such subtrees (for deterministic automata). Each construction algorithm discussed in this report receives a set of subtrees,

68 Chapter 4. Experiments 4.3. Automaton construction experiments called Item Set, from the outside world and creates states corresponding to (sets of) these subtrees. The standard set of subtrees used for these algorithms is the All-Sub item set. This set is defined in the following way:

Items = Subtrees(RHS(Prods0)) where Prods0 = Prods ∪ S0 → S This definition describes that the All-Sub item set contains all subtrees in the right hand side of the production rules and the start symbol. Example 4.3.1 shows such an All-Sub item set for an example grammar. As mentioned, states are constructed from the subtrees in these item sets. A consequence of this is that providing a smaller item set to a construction algorithm can result in an automaton with less states (this can be seen in the upcoming sections about the experiments with the construction algorithms). However, one must ensure that the items in this item set can still be used to represent all possible subtrees of the production rules if one shrinks such a set. There are two smaller item sets used in the experiments:

• Proper-N • Proper-S

The Proper-N set is constructed by using the following definition.

Items = N ∪ P roperSubtrees(RHS(Prods0)) The proper subtrees in this definition are subtrees that do not only occur as a complete right hand side. Trees that only occur as complete right hand sides can be removed, because they can also be represented by a nonterminal that directly produces this tree. Example 4.3.1 shows a Proper-N item set as well. The tree with root a is removed compared to the All-Sub set, because it is no proper subtree in any right hand side. This tree is therefore obsolete because it can be represented by nonterminal A.

The Proper-S item set is similar to the Proper-N item set and can be constructed using the following definition.

Items = S ∪ P roperSubtrees(RHS(Prods0)) This item set contains the same number of items or less when compared to Proper-N, depending on the shape of the grammar. Unlike Proper-N, the Proper-S does not contain nonterminals that are not the start symbol and are not present as a proper subtree in right hand side of a production rule. These removed nonterminals can be divided into two types:

• Nonterminals that are only present as complete right hand side of a rule (and as left hand side). • Nonterminals that are only present as left hand side of a rule.

In the first case the corresponding rule is a chain rule and this nonterminal can be repre- sented by the left hand side nonterminal of that rule. The nonterminals of the second case are unreachable as discussed in Section 1.1.3. Unreachable (and unproductive) nonterminals

69 4.3. Automaton construction experiments Chapter 4. Experiments and rules are useless and therefore removed from the grammars before constructing the item set. Example 4.3.1 also provides the Proper-S item set for an example grammar. It shows that nonterminal A is removed compared to Proper-N. This nonterminal can be removed because it is superfluous due to the chain rule and its absence as a proper subtree in the rules.

Example 4.3.1 This example illustrates for the grammar below which subtrees can be found in the All-Sub, Proper-N and Proper-S item set.

N = {S, A} Σ= {a,b,c} r = {(S,0), (A,0), (a,2), (b,1), (c,0)} Prods = { (1) S → A, (2) A → a } b c c The three item sets that can be constructed for this grammar are:

All-Sub: S, A, a , b , c b c c c

Proper-N: S, A, b , c c

Proper-S: S, b , c c

Experiments will be performed with all three sets in the construction algorithms, because they could have a large impact on the size of the resulting automata. The results can be found in separate sections about the nondeterministic and deterministic construction algorithms.

Transition storage For each terminal, a transition relation exists that relates a single state to a vector of states of length n, where n is the rank of the terminal. Such a relation can be undirected, fr-directed or rf-directed. These two types can be found in respectively Frontier-to-Root tas and Root- to-Frontier tas. Only these two types automata are discussed in this report, because they provide the possibility to optimize the transition storage for this specific type of transition and thereby optimize the usage of the automata. FR-directed transitions go from a vector of states to a single state, rf ones go the other way around. This means that the transition for a symbol a of rank n> 0 can be represented as two types of functions.

70 Chapter 4. Experiments 4.3. Automaton construction experiments

For nondeterministic tas n FR: Ra ∈ Q → P(Q) n RF: Ra ∈ Q → P(Q )

For deterministic tas n FR: Ra ∈ Q → Q n RF: Ra ∈ Q → Q

These functions can also be used for symbol of rank 0, where Q0 is then the empty tuple. However there was chosen to remove this empty tuple for the case where n=0, because it contains no information. This then leads to the following functions:

For nondeterministic tas FR: Ra ∈ P(Q) RF: Ra ∈ Q

For deterministic tas FR: Ra ∈ Q RF: Ra ∈ Q

The transition storage implementations for deterministic automata used in the experiments use conventional techniques. As discussed earlier only dfrtas are implemented, because their rf counterparts have less acceptance power. The storage of the fr-directed transitions for these automata is achieved by creating an n-dimensional transition table for each symbol of rank n > 0 (transition tables for symbols of rank 0 simply consist of a single record). In practice these are integer indexed tables. Lookup is therefore done by translating input states to unique integer values (like in BURG, see Section 2.3). These transition tables are constructed by using nested dynamic arrays. Dynamic arrays were chosen due to the characteristics of the construction algorithms for these dfrtas. The algorithms compute the state set in an iterative way, so one does not know the final amount of states before the end of the construction and one therefore does not know the length of these arrays. Dynamic arrays provide the possibility to enlarge the transition table as the number of states grows and thereby solve this problem in an easy way. There is also a penalty paid. Dynamic arrays have more overhead than static arrays. The practice showed that this overhead is quite large. This is caused by the large number of arrays that has to be used. The formula below shows the number of dynamic arrays used in a dfrta with |Q| states.

t.rank−1 |Q|i where t.rank > 0 t∈terminals i ! X X=0 There were also transition tables implemented based on a combination of static and dynamic arrays. This implementation is not discussed in detail, because it was only used to compare the efficiency of this storage. This second implementation uses a low number of dynamic arrays:

71 4.3. Automaton construction experiments Chapter 4. Experiments

t.rank t∈terminals X When these two implementations were compared they confirmed the statement that the dynamic arrays use a large amount of memory. An automaton for a large grammar uses almost 150 megabytes when using the transition tables that completely are constructed from dynamic arrays. Switching to the second implementation resulted in a drop of 40 megabytes while nothing else changed in the automaton and construction algorithm. However, the first implementation was chosen for the experiments because the second version could only be used for tables where all dimensions have the same size. This was not true for the filtered dfrtas that are discussed at the end of Section 4.3.4. The first implementation was therefore chosen to be able to compare the different dfrta construction algorithms in a fair way.

We now describe the implementation of the transition tables for nondeterministic tas. As described in the beginning of this section, nondeterministic automata can contain multiple transitions on a symbol for one state or state vector.

Figure 4.21: An example nfrta with multiple transitions for the same input vector/symbol combination.

Figure 4.21 shows for example how state vector (q1, q2) has two a-transitions originating from it. This means that the transition storage must not store a single result state or result vector, but a set of states/vectors. There were two other aspects that played a role when designing the transition tables for the nondeterministic tas, namely the direction of the transition relations (fr or rf) and the fact that the tables should be optimized for the relatively low number of transitions (compared to deterministic tas). The transition storage is based on hash tables instead of normal tables, since nondetermin- istic automata mostly have a low number of transitions for each state. The exact shape of the transition storage depends on whether it is targeted at an fr or rf automaton. The transition tables for nrftas were easy to construct: a single hash table was created for each symbol with a rank larger than zero. This table stores for each state a set of vectors that can be reached with a transition for that symbol from this source state. The transition tables for nfrtas are comparable to those for dfrtas, but nfrtas use nested hash tables instead of nested dynamic arrays, because they are more efficient. This is caused by the fact that nfrtas, unlike dfrtas, do not have a transition for each possible symbol/state vector combination. Another difference is that the dfrta transition storage points to a single state for each vector stored in the multi-dimensional storage while the nfrta storage points to a set of result states.

The different implementations to store transitions have different properties. These proper- ties will emerge and be discussed in the upcoming experiments.

72 Chapter 4. Experiments 4.3. Automaton construction experiments

4.3.3 Constructions of nondeterministic automata The constructions for nondeterministic tree automata are based on the constructions described in [Cle07, Section 6.6, Construction of tree acceptors]. This section describes fr and rf versions of four constructions for nondeterministic tree automata.

• tga-ta:all-sub • tga-ta:all-sub:rem-ε • tga-ta:proper-n:rem-ε • tga-ta:proper-s:rem-ε

These four constructions are based on the different item sets that can be used for con- structing automaton states. The first two are based on the item set All-Sub and the last two on Proper-N and Proper-S. Remarkable is the REM-ε addition to the last three construc- tions. The first construction is the only construction that generates an automaton with these ε-transitions. The other three are targeted at ε-free tree automata (and therefore contain this REM-ε addition), because these automata lead to more efficient acceptance and parsing algorithms.

Eight different types of automata can be constructed based on the two directions and four constructions. The upcoming four sections discuss the four construction algorithms and the characteristics of the constructed NFRTA’s and NRFTA’s.

TGA-TA:ALL-SUB The tga-ta:all-sub automaton construction creates a nondeterministic automaton based on the All-Sub item set and is described in [Cle07, Construction 6.6.2, TGA-TA:ALL-SUB]. This construction algorithm creates a state for each unique subtree that can be found in the item set. Symbol transitions between these states are generated from the parent-child relations of these subtrees. This collection of transitions is then extended by an ε-transition for each rule between the state for the left hand side nonterminal and the state for the right hand side tree. The first and fifth line of Tables E.1 – E.14 contain the measurements for these constructions (both fr and rf), especially the Basic Statistics tables. A pleasant property of this first construction algorithm is the short construction time and small amount of used memory. This is due to the fact that the number of states is proportional to the number of different subtrees and the number of transitions is proportional to the number of unique parent-child node relationships and the number of production rules. This advantage will be seen in all nondeterministic automaton constructions.

There is one difference between the constructed εnfrtas and εnrftas. This difference is found in the way the transitions are stored. The tables in Appendix E show that the fr variant is slightly larger in memory usage than the rf variant. The difference can be explained by the data structures used for transition storage. The FR variant uses a multidimensional hash table for each terminal in the grammar, as described in Section 4.3.2. The dimension of the table depends on the rank of the correspond- ing terminal. The rf variant uses only a single dimension hash table for each terminal. An

73 4.3. Automaton construction experiments Chapter 4. Experiments rf automaton therefore contains fewer hash tables to store transitions. All these hash tables contain some overhead. This overhead is the major contributor to the small difference in size between nfrtas and nrftas.

TGA-TA:ALL-SUB:REM-ε The disadvantage of the previous construction is that it produces an automaton with ε- transitions. This results in computing the ε-closure each time a state is visited in an accep- tance algorithm that uses these automata. The construction in this section constructs an ε-free automaton based on the All-Sub item set. A description of this construction can be found in [Cle07, Construction 6.6.10, TGA- TA:ALL-SUB:REM-ε]. This construction computes the ε-closure during the construction and adds additional non ε-transitions to replace the ε-transitions. For example: state α has an a-transition to state β and state γ can be reached from β by an ε-transition, then an extra a-transition is added from α to γ to replace this ε-transition. The ε-transition can be removed if this is done for all transitions to β.

Lines two and six in Tables E.1 – E.14 show the measurements when these constructions are applied. Obviously, only the number of transitions has changed if one compares these automata to the automata with ε-transitions. There are also some unexpected results, for instance the occupied memory of the fr au- tomaton shrinks and that of the rf automaton grows. The single dimension hash table that contains the ε-transitions disappears in both variants. However, the effect of the replacement of these transitions is different. This too is caused by the different implementations of the fr and rf transition storage. The rf automaton has to store these new transitions as new vectors in the single dimension transition tables. These new vectors contain many existing states. This illustrated with Figure 4.22. The rf variant of this automaton has the a-transition qs → (q0, q1, q2). Removing the epsilon transition results in the addition transition: qx → (q0, q1, q2). This results in storing an additional vector of size three and thereby duplicating pointers to all three states.

Figure 4.22: Undirected ta with an ε-transition.

The fr automaton does not have the disadvantage of the rf variant due to the multidi- mensional hash tables. This can result in creating additional hash tables, but in many cases it only results in adding the result state to the set of result states for an existing vector. This last case is also true for the example in Figure 4.22. The vector (q0, q1, q2) already exists and only the result set {qs} is expanded with state qx This addition is in practice smaller than the memory that is freed by the removal of the hash map that stored the ε-transitions.

74 Chapter 4. Experiments 4.3. Automaton construction experiments

TGA-TA:PROPER-N:REM-ε The TGA-TA:ALL-SUB:REM-ε construction produces ε-free automata, but these automata contain states which are not reachable from the root accepting state. These states are the states that have a non proper subtree or an unreachable nonterminal (when not removed from the grammar) as corresponding match. Constructing the automaton based on the Proper-N item set will not create states for a non-proper subtree. This section discusses the construc- tion that can be found in [Cle07, Construction 6.6.19, TGA-TA:PROPER-N:REM-ε]. The construction produces an ε-free automaton based on this Proper-N item set. Tables E.1 – E.14 confirm that the construction produces automata with less states and transitions. The measurements show that the resulting automaton is constructed in less time and uses only 20% of the memory of the standard automaton based on the All-Sub item set. Table 4.2 compares the fr automata created with the TGA-TA:PROPER-N:REM-ε construction to the FR automata created with the TGA-TA:ALL-SUB:REM-ε.

# States # Transitions Memory Use Example 6.0.3 62,5% 78,6% 78,5% Sample 4 46,2% 75,9% 67,3% Sample 5 50,0% 72,7% 71,9% 68000 25,0% 58,5% 62,0% Mono X68 11,8% 61,1% 59,0% Mono IA64 9,1% 62,8% 58,4% Mono Sparc 10,4% 61,6% 58,4%

Table 4.2: nfrta characteristics for the TGA-TA:PROPER-N:REM-ε construction compared to the TGA-TA:ALL-SUB:REM-ε construction

The table shows that there is a significant drop in memory usage for all grammars. Espe- cially the large grammars seem to benefit from this different item set. One can construct the same table for the rf automata. The columns with the numbers of states and number of transitions will contain the same values. However, there are again some small differences in memory usage. These differences are the same as described in the previous section about the standard All-Sub construction and are caused by the usage of different data structures.

TGA-TA:PROPER-S:REM-ε This construction is closely related to the previous construction. The construction algorithm [Cle07, Construction 6.6.15, TGA-TA:PROPER-S:REM-ε] is the same with the exception that it uses the Proper-S item set instead of the Proper-N item set. This construction can result in automata that are smaller than the ones constructed by the Proper-N variant. This is the case if the grammar contains nonterminals that are not present as a proper subtree or as start symbol. The tables in Appendix E show that there is no difference between the constructions with the Proper-N and Proper-S item set for our grammars. This is due to the fact that all used grammars do not contain nonterminals that are not present as proper subtree or start symbol. The behavior and the result of this construction is therefore the same as for the previous one.

75 4.3. Automaton construction experiments Chapter 4. Experiments

4.3.4 Constructions of deterministic automata This section focuses on the construction of dfrtas. drftas are not covered because their acceptance/parse power is less than their dfrta counterparts. This section discusses five construction algorithms for dfrtas.

• DFRTA Standard Construction • DFRTA Construction with Subtree Filtering • DFRTA Construction with Index Filtering • DFRTA Construction with Symbol Filtering • DFRTA Construction with Index & Symbol Filtering

Deterministic automata in general have more states and transitions than their correspond- ing nondeterministic variants. The difference can be very large when no optimization tech- niques are used during the construction of deterministic automata. The first construction technique is a technique without any special optimization techniques. The bottom four algo- rithms focus on decreasing the size of the automaton. All these constructions can be based on all three item sets (All-Sub, Proper-N and Proper- S). The upcoming subsections discuss the characteristics of these five constructions and the effect of the different item sets.

Standard DFRTA construction The standard construction of a dfrta is different compared to the construction of the non- deterministic automata. The construction of a dfrta has two phases. The first phase creates states for all terminals of rank zero. A state q is created and this terminal is added to the match set of this state. Next, all production rules with this terminal as rhs are gathered. The transitive chain rule closure (computed by using Warshall’s Algorithm [War62]) on nonterminals is computed. All nonterminals that are reachable from these rhss are then added tot the match set of the state. Finally a start transition, of the form () → q, is added for each of these terminals. These are simply stored as q since () → q =∼ q. Example 4.3.2 provides an example of this first phase.

Example 4.3.2 This example shows how the initial states are created from the input gram- mar in the first phase of the standard dfrta construction algorithm. This example is based on the following grammar:

N = {S,X,Y } Σ= {a,b,c,d} r = {(S,0), (X,0), (Y,0), (a,2), (b,1), (c,0), (d,0)} Prods = {(1) S → a , (2) S → b , (3) X → c, (4) Y → d} X Y Y Terminals c and d are the terminals of rank zero, this results in two initial states.

• q0: {c}

76 Chapter 4. Experiments 4.3. Automaton construction experiments

• q1: {d}

The match sets need to be expanded with the nonterminals that produce these terminals. This results in these updated match sets.

• q0: {c, X}

• q1: {d,Y }

Finally the match set has to be updated with the closure of X respectively Y. However, there are no chain rules that produce X or Y, so the two sets/states form the state set after phase 1. Transitions are created for these states and their corresponding terminal as described in the introduction. This results in the following two ‘transitions’.

Tc q0 Td q1 These states and transitions can be considered as the start state and transitions of this automaton.

The second phase creates new states based on the current state set and the item set (All- Sub, Proper-N or Proper-S). This is done in an iteration over all existing states and the states that are created during this phase. When processing state qi in this iteration the algorithm looks at all terminals with a rank larger than zero. It creates the following state vectors of length n, where n is the rank of the terminal:

n n {q0,...,qi} / {q0,...,qi−1}

This means that all vectors of length n are created that contain only states qj where j ≤ i and the vector has at least one occurrence of qi. The algorithm determines for each of these vectors which subtrees from the item set matches the terminal with this child state vector. These matches are stored in a match set for that terminal in combination with that vector. If the subtrees in this match set are also present as the complete right hand side of a production rule, then the match set is expanded by the nonterminal left hand side of that rule and the closure of this nonterminal. A new state is created if there is no state with this match set. Transitions between the states from the vector and the (newly) constructed state are added to the automaton. This second phase is illustrated in Example 4.3.3.

Example 4.3.3 This example shows how new states are created from the existing states in the second phase of the standard dfrta construction algorithm that uses the All-Sub item set. States q0 and q1 are the states that came out of the first phase.

• q0: {c, X}

• q1: {d,Y }

The All-Sub item set that contains all possible matches for new states is: {a(X, Y), b(Y), X, Y, c, d}. Now we can start executing phase 2. This means computing new states for possible state vectors (consisting of existing states). This starts with case i = 0 that checks

77 4.3. Automaton construction experiments Chapter 4. Experiments

all vectors that only contain states qj where j ≤ i and that have at least one occurrence of qi (this automatically holds for i = 0).

Case i=0: Terminal a will be the first symbol for which the vectors are constructed. Creating these vectors for state q0 and terminal a of rank 2 results in this single vector: (q0, q0). The following tree is obtained if this vector is visualized as a tree with root a. a {c, X} {c, X} There is no subtree in the All-Sub item set that matches with this subtree (all possibilities for the child match sets are considered). This means that a new state q2 will be created with an empty match set. This results in this small table.

Ta q0 q0 q2

The same procedure is followed for terminal b, resulting in vector (q0) which also matches with no subtree and therefore results in the same state q2.

Tb q0 q2

Both symbols are processed for state q0, so one can proceed to the second iteration.

Case i=1:

Proceeding with i = 1 means that we continue by handling state vectors containing q0 and q1. For symbol a this results in three vectors that need to be processed: (q0, q1), (q1, q0) and (q1, q1). The first vector can be visualized with the tree below. a {c, X} {d, Y}

Subtree a(X,Y ) from the item set matches with this tree. This results in a new state q3 with match set {a(X,Y ), S}. The second and third vector result in an empty match, so they result in a transition between these vectors and the existing state q2. The complete iteration for terminal a therefore results in the following transition table.

Ta q0 q1 q0 q2 q3 q1 q2 q2

The same is done for terminal b, where vector (q1) results in a new state q4 with match set {b(Y ), S} and thus gives the following one-dimensional transition table.

Tb q0 q2 q1 q4

78 Chapter 4. Experiments 4.3. Automaton construction experiments

The second phase will be repeated until all states (cases i = 2 to i = 4) are processed. This does not result in new states for this example grammar. The state set below is therefore the complete set of states for the resulting automaton. match set Phase 1 q0 {c, X} q1 {d, Y} Phase 2 q2 ∅ q3 {a(X, Y), S} q4 {b(Y), S}

The only changes in the iterations for q2, q3 and q4 are in the transition table of terminal a and b. New vectors can be constructed for both terminals based on these states. However, transitions from these vectors do not result in new states. This results in these final transition tables:

Ta q0 q1 q2 q3 q4 Tb q0 q2 q3 q2 q2 q2 q0 q2 q1 q2 q2 q2 q2 q2 q1 q4 q2 q2 q2 q2 q2 q2 q2 q2 q3 q2 q2 q2 q2 q2 q3 q2 q4 q2 q2 q2 q2 q2 q4 q2

The final result is an automaton with 5 states and 32 transitions. These additional two transitions are the two transitions for terminals of rank 0 of phase 1.

This second phase is very time consuming, because what it does in practice is creating all possible state vectors of length n for each terminal of rank n. These vectors are compared to all the subtrees in the item set. This comparison is not trivial because each state vector for a symbol has to be compared to all subtrees α in the grammar with that symbol as root. The complexity is increased by the fact that each state on position i of the vector can contain multiple matching subtrees and that each of these subtrees has to be compared with the subtree on position i of subtree α. However an optimization is applied here: the construction stores all subtrees in the production rules of the grammar based on their root symbol, which avoids searching for subtrees α that need to be checked against the vectors. This optimizes the computation of the matches, but still requires checking many possible matches for a large number of vectors for each symbol.

The first three dfrta lines in the tables in Appendix E show the results of the experiments with this standard dfrta construction, for all three item sets introduced before. The experiments with the All-Sub item set show the most remarkable results. The number of states is almost equal to the number of states of the nondeterministic variants, but the con- struction time needed for the large Mono grammars is very high. The algorithm needs about five minutes to construct an automaton and the resulting automaton also needs more than one hundred megabytes to store its states and transitions. The large number of transitions (e.g. 25 million for the Mono X86 grammar) explains this memory consumption.

79 4.3. Automaton construction experiments Chapter 4. Experiments

Switching to the Proper-N or Proper-S item set makes a big difference. The automaton constructed with such an item set uses 98% less memory. Table 4.3 shows the relative memory size and construction time for the Mono automata constructed with the Proper-N set, where 100% is the value for the construction with the All-Sub item set.

Memory Construction Time X86 1.66% 0.63% IA64 1.22% 0.31% Sparc 1.47% 0.51%

Table 4.3: Standard Proper-N construction relative to All-Sub)

The difference in size and construction time can be easily explained. The standard dfrta construction inspects all combinations of states for each terminal in the grammar to construct all transitions. This results in the following number of checks/transitions for an automaton with |Q| states.

|Q|t.rank t∈terminals X The X86 Mono grammar for example contains around 80 terminals of rank 2. With 557 states constructed from the All-Sub set, these 80 terminals lead to approximately 25 million transitions. These transitions are stored in n-dimensional tables which are modeled by nested dynamic arrays. There is a cell in these tables for each transition and each of these cells contains a pointer to the resulting state. This results in 95 Megabytes of pointers for the X86 mono grammar. The remaining 50 megabytes are occupied by the overhead of the dynamic arrays that represent the tables. These 50 megabytes can be reduced by using static arrays instead of dynamic arrays, as discussed in Section 4.3.2. Dynamic arrays were however used to create a fair comparison between standard dfrtas and filtered dfrtas (which can not be constructed efficiently using static arrays).

Summarized, the automata created with this standard dfrta construction algorithm have a large collection of transitions even when using the Proper-N or Proper-S item sets. This leads to impractical automata. However, there are filtering techniques to reduce the number of transitions. The following four sections discuss constructions that use these filter techniques.

DFRTA with subtree filtering The dfrta construction with the subtree filtering technique tries to reduce the transition table size. This is done by reducing the possible indices for these tables (all states in the standard dfrta construction). The dfrta construction with subtree filtering adds an additional filtered match set table (R-table) and a translation table (φ-table) to the automaton. This filtered match set table contains all unique match sets from the original state set after removing subtrees that are not proper subtrees. This results in a filtered match set table which contains only match sets with proper subtrees. This can result in fewer match sets than states, because two match sets from two original states can become equal after removing non-proper subtree matches. The φ-table describes which filtered match set corresponds to which original state.

80 Chapter 4. Experiments 4.3. Automaton construction experiments

These filtered match sets are used as indices for the transition tables instead of the original states. However, this replacement does not affect the acceptance/parse power of the automa- ton, because a transition based on a non-proper subtree will result in a state with the empty match set and is therefore useless. This is easy to explain, because the match set of the resulting state will only contain subtrees in which this non-proper subtree is a child tree and that is a contradiction with the fact that it is a non-proper subtree. This transitions therefore result in the state with the empty match set. The result of using these filtered match sets instead of the (match sets of the) original states is that the transition tables for each of the terminals can be based on these smaller filtered match sets instead of the original states. This results in less transitions, less memory use and faster constructions. An example of this can be seen in Example 4.3.4.

Example 4.3.4 Example 4.3.2 and 4.3.3 show the standard dfrta construction. This ex- ample shows what happens when applying subtree filtering during the construction. The set {X, Y} is the set of proper subtrees of the grammar and therefore used for filtering. The table below shows the original match sets and the match sets after filtering with these proper subtrees. match set filtered match set Phase 1 q0 {c, X} {X} q1 {d, Y} {Y} Phase 2 q2 ∅ ∅ q3 {a(X, Y), S} ∅ q4 {b(Y), S} ∅ This results in the following filter table R and translation table φ: R-Table φ-Table q0 R1 R0 ∅ q1 R2 R1 {X} q2 R0 R2 {Y} q3 R0 q4 R0 The filter table contains only three different sets while there are five states (and correspond- ing match sets), so using these filtered sets as base for the transition tables reduces the size of each transition table from 5n entries to 3n entries, where n is the rank of the terminal. The transition tables for terminal a are printed in Table 4.4 to illustrate this. The left part of the table contains the standard table based on the original states and the right side the transition table based on the subtree filtered match sets.

81 4.3. Automaton construction experiments Chapter 4. Experiments

Ta q0 q1 q2 q3 q4 q0 q2 q3 q2 q2 q2 Ta R0 R1 R2 q1 q2 q2 q2 q2 q2 R0 q2 q2 q2 q2 q2 q2 q2 q2 q2 R1 q2 q2 q3 q3 q2 q2 q2 q2 q2 R2 q2 q2 q2 q4 q2 q2 q2 q2 q2

Table 4.4: Transition Table a with left: standard indexing, and right: filtered match set indexing.

Appendix E shows the results of the experiments with this filtering technique for all gram- mars and these result confirmed the expectations. If we compare the standard construction and the construction with subtree filtering for the All-Sub item set then we see a large dif- ference. The normal set of states is not influenced, but the number of transitions drops significantly. The R-Entries column in the ‘Table Statistics’ tables in Appendix E show how large the filtered match set tables are. Take for example the Mono IA64 grammar. The standard dfrta for this grammar contains 438 states. A 2-dimensional transition table of a terminal of rank 2 will contain 4382 entries. The filtered match set table in the subtree filtered automaton contains only 41 states. The transition table for the same terminal of rank 2 will now contain 412 entries, this is 0.9% of the entries of the standard automaton. This difference shrinks when using the Proper-N or Proper-S item set. This is caused by the fact that the states of these automata do not contain non-proper subtrees (except the nonterminals N or the start symbol S) due to the restricted item set. Table 4.5 compares the standard construction and the construction with subtree filtering for the Mono grammars.

# Transitions Memory Use Construction Time X86 – All-Sub 1.6% 2.5% 2.8% X86 – Proper-N 97.0% 99.0% 243.0% IA64 – All-Sub 0.9% 2,4% 2,0% IA64 – Proper-N 95.4% 99.1% 217.7% Sparc – All-Sub 1.1% 2.4% 2.3% Sparc – Proper-N 92.7% 95.5% 291.6%

Table 4.5: Construction characteristics for the Mono grammars for the subtree filter construc- tion compared to the standard construction

The table confirms that there is a large gain when applying subtree filtering in combination with the All-Sub item set and that there is almost no gain when Proper-N is used as the item set. The construction time actually increases for this last item set due to the extra work that is needed for building the filter tables. The results for the Proper-S set are omitted in the table because this item set contains the same subtrees as Proper-N for these Mono grammars and therefore results in the same automata. Overall, the subtree filtering has a positive influence on the automaton size, but there are more efficient filter techniques, even when using the Proper-N or Proper-S item set. These constructions are presented in the upcoming sections.

82 Chapter 4. Experiments 4.3. Automaton construction experiments

DFRTA with index filtering The construction with the index filtering technique is similar to the one with subtree filtering, but instead of creating a single table of filtered match sets that are used for indexing the transition tables, multiple filtered tables (and φ-tables, but these are not discussed in detail) are constructed. The index filtering technique constructs n tables, where n is the maximal rank in the grammar. The ith table from these n tables contains the match sets filtered on the subtrees that occur as ith child of any terminal (See Example 4.3.5). Example 4.3.5 This example shows the effect of index filtering based on Example 4.3.2 and 4.3.3. The used grammar uses terminals with rank at most two. As described the ith filter table is based on the subtrees that are present as ith child of a terminal (the fact that they are children of a terminal implies that they are also proper subtrees). The table below shows the possible subtrees for all indices. Index 1 Index 2 {X, Y} {Y} These two sets can now be used to filter the match sets of the states for both indices: match set filtered for Index 1 filtered for Index 2 Phase 1 q0 {c, X} {X} ∅ q1 {d, Y} {Y} {Y} Phase 2 q2 ∅ ∅ ∅ q3 {a(X, Y), S} ∅ ∅ q4 {b(Y), S} ∅ ∅ The table for index 1 contains 3 match sets and for index 2 only 2 sets. The index filtering technique results in a 3x2 table. The subtree filtering technique would result in a table of nine entries for a symbol of rank two. This shows the usefulness of the index filtering technique. The construction with index filtering was performed for all grammars. Appendix E shows the results of these experiments. The focus is mainly on the Mono grammars, due to their size. These grammars do not contain terminals with a rank larger than 2, so the corresponding automata use two filtered match set tables (this can also be seen in the R-Tables column). The total number of entries in these two tables is slightly more than the number of entries in the single subtree filter table, yet each index table is smaller than the subtree filtering table. The advantage of this can be seen in the number of transitions and memory use. The number of transitions is halved with respect to subtree filtering. The results for this construction based on the Proper-N item set is summarized in Table 4.6 and 4.7.

# Transitions Memory Use Construction Time Mono X86 47.6% 53.4% 32.2% Mono IA64 42.3% 54.3% 29.0% Mono Sparc 46.7% 53.9% 29.6%

Table 4.6: Construction characteristics for the index filter construction when compared to the subtree filter construction (Constructions with Proper-N item set)

83 4.3. Automaton construction experiments Chapter 4. Experiments

# Transitions Memory Use Construction Time Mono X86 46.1% 52.9% 78.4% Mono IA64 40.4% 53.8% 63.2% Mono Sparc 43.3% 51.5% 86.2%

Table 4.7: Construction characteristics for the index filter construction when compared to the standard dfrta construction (Constructions with Proper-N item set)

This table shows that the index filtering method is a good technique to reduce the size of the automaton and the construction time itself. The construction time that was added with the subtree filtering has now almost disappeared for the Mono grammars (see last column of Table 4.7 compared to same column in Table 4.5). There are two similar filtering techniques that perform even better. These filters will be discussed in the next two sections.

DFRTA with symbol filtering The dfrta construction with symbol filtering is based on filtering match sets with subtrees that are present as child trees of a certain symbol. This is realized by a filter table and φ-table for each symbol. Such a filter table contains all unique match sets from the states after filtering them with the criterium. This method can be very efficient when subtrees are only present as subtrees of certain terminals, because these subtrees will then be only used as indices in the tables for these symbols. Example 4.3.6 shows the effect of this filtering technique for the example grammar.

Example 4.3.6 This example shows the effect of symbol filtering based on Example 4.3.2 and 4.3.3. The symbol filtering will create two tables. The first table contains the unique match sets after filtering them with the subtrees that occur as a child of terminal a. The second table does the same for terminal b. The table below shows which subtrees can be found in the filtered match sets. Symbol a Symbol b {X, Y} {Y} These two sets can then be used to create the filtered tables for both symbol: match set filtered for Symbol a filtered for Symbol b Phase 1 q0 {c, X} {X} ∅ q1 {d, Y} {Y} {Y} Phase 2 q2 ∅ ∅ ∅ q3 {a(X, Y), S} ∅ ∅ q4 {b(Y), S} ∅ ∅ This results in a filter table with 3 match sets for terminal a and a table with 2 match sets for terminal b. This means that the transition table of symbol a contains 3×3 records and the transition table of b only contains 2 records.

The example shows that the symbol filtering technique performs slightly worse when com- pared to the index filtering technique (index filtering: 9 transitions (see previous section),

84 Chapter 4. Experiments 4.3. Automaton construction experiments symbol filtering: 11 transitions), but this changes when using grammars that are targeted at instruction selection (like the Mono grammars). Such grammars contain large numbers of terminals and many of the subtrees in the rules are only found as child trees of certain terminals. This results in a large collection of very small filter tables. If one applies index filtering to these grammars then this will result in larger transition tables, because subtrees are mostly not applicable to a certain index only. They therefore result in two (the used instruction selection grammars all use terminals with rank at most two) very large filtered tables. This is illustrated by Example 4.3.7.

Example 4.3.7 This example shows the benefit of symbol filtering compared to index filtering when a grammar is used where subtrees are only present as children for certain terminals. This is the used grammar:

N = {S,X,Y } Σ= {a,b,c,d} r = {(S,0), (X,0), (Y,0), (a,2), (b,1), (c,0), (d,0)} Prods = {(1) S → a , (2) S → a , (3) X → b , (4) Y → d} X X Y Y c The set of states (with corresponding match set) for both the index-filtered and symbol- filtered DFRTA is as follows if one uses the Proper-N item set in the construction:

q0: {c} q1: {Y} q2: ∅ q3: {S} q4: {X} Both filter techniques can be applied to this state set to construct the transition tables. This is first done for the index filtering technique. Applying this filtering will result in two filter th tables, where row Rx,y contains the y filtered set for index x: Index 1 Index 2 R1,0: {c} R2,0: ∅ R1,1: {Y} R2,1: {Y} R1,2: ∅ R2,2: {X} R1,3: {X} Using these filtered items as indices for the transition tables results in 4×3 cells for symbols of rank 2 and a table with 4 cells for a symbol of rank 1. This results for this grammar in a total number of 16 + 2 transitions (two additional transitions for the two symbols of rank 0). Applying the symbol filtering technique also results in two tables, but now for each terminal with a rank larger than 1:

Symbol a Symbol b Ra,0: ∅ Rb,0: {c} Ra,1: {Y} Rb,1: ∅ Ra,2: {X} These filter tables are clearly smaller than the filter tables of the index filtering technique.

85 4.3. Automaton construction experiments Chapter 4. Experiments

This is caused by the fact that the symbols X and Y are only present as children of symbol a, while symbol c is only present as a child of b. Index filtering include both matches in the filter tables for index 1 and 2 because X and Y are present at both indices. These filtered items result in a 3×3 table for symbols of rank 2 and a table with only 2 records for a symbol of rank 1. The total number of transitions for this automaton is therefore 11 + 2 transitions. This example shows that the symbol filtering can perform better than index filtering for a grammar were subtrees are related to specific symbols instead of specific indices. This is often the case for instruction selection grammars.

The Mono tables in Appendix E confirm this statement. The filter tables (R-Tables) contain many entries, but they are divided over a large number of tables, resulting in smaller tables and therefore smaller transition tables. Table 4.8 compares the average number of R-table entries for index filtering and symbol filtering.

Average Index R-Table Average Symbol R-Table Mono X86 44 entries 3.0 entries Mono IA64 28 entries 2.7 entries Mono Sparc 35.5 entries 2.9 entries

Table 4.8: Average filter table comparison between index filtering and symbol filtering.

The average size of the tables for these grammars drops with more than a factor 10. This has an enormous impact on the number of transitions, because the average transition table of a terminal of rank 2 does have not more than 3×3 entries. This also has a large impact on the construction time. Table 4.9 compares the number of transitions, memory usage and construction time between the symbol filtering and index filtering. The same comparison is made between the symbol filtering and standard construction in Table 4.10.

# Transitions Memory Use Construction Time Mono X86 1.3% 69.4% 5.0% Mono IA64 2.5% 112.5% 6.8% Mono Sparc 1.5% 92.1% 4.1%

Table 4.9: Construction characteristics for the symbol filter construction when compared to the index filter construction (Constructions with Proper-N item set)

# Transitions Memory Use Construction Time Mono X86 0.6% 36.7% 3.9% Mono IA64 1.0% 60.5% 4.3% Mono Sparc 0.7% 47.4% 3.6%

Table 4.10: Construction characteristics for the symbol filter construction when compared to the standard dfrta construction (Constructions with Proper-N item set)

As expected, the table shows a large reduction in transition table entries and construction time, caused by the smaller amount of transitions that need to be constructed. However,

86 Chapter 4. Experiments 4.3. Automaton construction experiments the memory usage is surprising; it stays around the level of the index filtering, while the number of the number of transitions drops. This is partly caused by the extra filter tables and so-called φ-tables (these are necessary for translating states to filtered match sets, when using the automaton in acceptance/parse applications). These additional data structures generate extra data. The number of φ-entries is for instance 10 times larger than the number of transitions (there is an entry for each state/filter table combination). Another cause of the extra data is the overhead of the large number of dynamic Java data structures that are used for the representation of the filter tables and the φ-tables. However, this filtering technique is very attractive due to the reduced construction time. Table 4.9 shows that less than 7% of the construction time of the index filtering remains for the Mono grammars while the memory usage stays on a similar level.

DFRTA with index & symbol filtering The final construction algorithm discusses a construction that uses a combination of the index and symbol filtering technique. With this filtering technique, filter tables and φ translation tables are created for each index i of each terminal a. The match sets in such a filter table are then filtered with subtrees that are present as a subtree for that index i at terminal a. Example 4.3.8 illustrates this filtering technique.

Example 4.3.8 This example shows the effect of combined symbol and index filtering for the automaton of Example 4.3.2 and 4.3.3. This filtering technique will create three tables. The first two tables refer to symbol a, where the first table contains all filtered match sets for symbol a at index 1 and the second table does the same for index 2. The third table contains the filtered match sets for index 1 of symbol b. The table below shows which subtrees can be found in the filtered match sets. Symbol a, Index 1 Symbol a, Index 2 Symbol b, Index 1 {X} {Y} {Y} These three sets can then be used to create the filtered tables: match set Symbol a, Index 1 Symbol a, Index 2 Symbol b, Index 1 Phase 1 q0 {c, X} {X} ∅ ∅ q1 {d, Y} ∅ {Y} {Y} Phase 2 q2 ∅∅ ∅ ∅ q3 {a(X, Y), S} ∅ ∅ ∅ q4 {b(Y), S} ∅ ∅ ∅ This results in very small filter tables, where each table contains only two match sets. The resulting transition table for symbol a will only contain 4 entries and the table for symbol b only 2.

The example shows that the filter tables shrink even more when applying the combined symbol and index filtering. This can also be seen in the experimental results in Appendix E. The gain between Symbol filtering and combined Symbol/Index filtering is not as large as the gain between Index filtering and Symbol filtering, but the combination still seems promising.

87 4.3. Automaton construction experiments Chapter 4. Experiments

Table 4.11 shows the average filter table size for index, symbol and combined symbol/index filtering.

Average Index R- Average Symbol Average Index & Table R-Table Symbol R-Table Mono X86 44 entries 3.0 entries 2.7 entries Mono IA64 28 entries 2.7 entries 2.5 entries Mono Sparc 35.5 entries 2.9 entries 2.6 entries

Table 4.11: Average filter table comparison between index filtering, symbol filtering and com- bined filtering.

Not only should these average values be considered, also the size of the transition tables is important. The transition tables for the Mono grammars are halved, but there is also a penalty: the additional filter tables and φ-tables double the memory usage. Table 4.12 below shows the relative results between the symbol filtering (100%) and combined symbol/index filtering.

# Transitions Memory Use Construction Time Mono X86 57.6% 175.9% 101.4% Mono IA64 67.7% 172.0% 110.0% Mono Sparc 66.6% 175.6% 108.6%

Table 4.12: Construction characteristics for the combined filter construction when compared to the symbol filter construction (Constructions with Proper-N item set)

One may conclude that the gain of the grammars that combine the two filters is negated largely by the overhead incurred. The combined filter reduces the transitions tables even more when comparing it to the symbol filtering technique, but for the Mono grammars it is more profitable to opt for the symbol filtering alone instead of the combined filtering. This combined filtering technique can however become interesting when grammars contain many subtrees that are only present in certain indices for certain terminals.

4.3.5 Conclusion The goal of these experiments was to examine which of the discussed construction algorithms creates an automaton which performs well in tree acceptance and parsing, but is also con- structed fast by the construction algorithm. Constructions for nondeterministic automata offer small automata combined with very short construction times (even less then 50 milliseconds for the Mono grammars). However, nondeterministic automata are not practical in acceptance and parsing algorithms, due to the fact the nondeterminism leads to growing number of states that needs to be processed for each transition taken in the automaton. This means that accepting and parsing trees using nondeterministic automata can take a lot of time. The deterministic automata eliminate this problem. However, as shown, the standard dfrta construction is disappointing. The standard construction constructs automata that are too large for real life applications, even when one uses a Proper-N or Proper-S item set. Fortunately there are promising results for the constructions that use filtering techniques to

88 Chapter 4. Experiments 4.3. Automaton construction experiments reduce the size of the transition tables in dfrtas. Especially the construction with symbol filtering, which takes advantage of subtrees that can only be found as child trees of certain terminals (which often the case for instruction selection grammars). This symbol filtering reduces the number of transitions to 0.6% of these in the standard construction for the Mono X86 grammar (using the Proper-N item set in both cases). Figures 4.23 and 4.24 give an overview of the size of the dfrtas and nfrtas constructed from the Mono grammars using the Proper-N item set. They show how the number of transitions and memory usage changes when the different filter techniques are applied for the constructions of the Mono grammars.

120%

100,0%100,0%100,0% 100% 97,0% 95,4% 92,7%

80%

Mono X86 60% Mono IA64 Mono Sparc

46,1% 43,3% 40,4% 40%

20%

0% 0,6% 1, 0,7% 0,3% 0,7% 0,4% 0,2% 0,5% 0,3% 0% DFRTA without DFRTA with subtree DFRTA with index DFRTA with symbol DFRTA with symbol NFRTA filtering filtering filtering filtering & index filtering

Figure 4.23: Relative number of transitions of the Mono dfrtas and nfrta constructed with the Proper-N item set

120%

1% 104,

100,0%100,0%100,0% 99,1% 100% 99,0% 95,5%

83,2% 80%

64,6% 60,5% Mono X86 60% Mono IA64 52,9% 53,8% 51,5% Mono Sparc 47,4%

40% 36,7%

20% 17,6%

12,5% 8,9%

0% DFRTA without DFRTA with subtree DFRTA with index DFRTA with symbol DFRTA with symbol NFRTA filtering filtering filtering filtering & index filtering

Figure 4.24: Relative memory usage of the Mono dfrtas and nfrta constructed with the Proper-N item set

It depends on the shape of the grammar which filtering technique is the most efficient. Automata for grammars that contain subtrees that are only present for certain indices should

89 4.4. DFRTA based tree parsing experiments Chapter 4. Experiments be constructed using index filtering. If the subtrees are present for certain terminals, then one should opt for symbol filtering. Combined filtering becomes profitable for grammars that have both properties. Summarized, the most important fact is that the dfrtas constructed with a well-chosen filtering technique not much larger (around a factor 3 for the Mono grammars) than the nondeterministic variants while still offering the property of being deterministic. Automata constructed with these filter based construction algorithms are therefore very promising for real world acceptance, matching and parsing applications.

4.4 DFRTA based tree parsing experiments

The automata discussed in this thesis are very useful for solving the tree acceptance and tree parsing problem. This section discusses the experiments that were carried out with an algorithm, implemented in the toolkit, that determines the lowest cost parse of an input tree with the help of a dfrta. This parsing algorithm [FSW94, Section 7] can solve the parsing problem with each type of dfrta discussed in Section 4.3.4. In this section we discuss the parsing algorithm itself and provide example parses for a dfrta constructed with the All-Sub item set and the Proper-S item set. The differences in filtering techniques can not be seen in the parsing algorithms because they are encapsulated by the automata. These differences become only visible when measuring the execution speeds with these different filtering techniques. These measurements will be presented in Section 4.4.2.

4.4.1 The parsing algorithm The parsing algorithm consists of two phases. The first phase labels the subject tree from bottom to top using the dfrta. This labeling stores for each node which rules can be applied to produce the subtree rooted at that node. Using these labels, the second phase determines from top to bottom which sequence of rules results in the lowest cost parse of the tree. This algorithm is not exactly the same for each type of dfrta. There are some small adaptations needed when using an automaton created from the Proper-N or Proper-S item set, instead of the All-Sub item set. We start with discussing the two phases for the All-Sub item set in the two upcoming sections. In the final section we discuss the changes necessary for working with Proper-N and Proper-S.

First phase The first phase processes the subject tree with the dfrta in a frontier-to-root direction. During the traversal each node is linked to a state (as by a normal acceptor). Each leaf node with a symbol x is linked to the state with the incoming x-transition. The rest of the tree is linked as one would expect by taking transitions that match the symbols that are encountered in the tree. Let us illustrate this by constructing a dfrta for the grammar in Example 1.1.3. Figure 4.26 repeats the production rules and shows the match sets of the states, while Figure 4.25 contains the complete automaton (without trap state q2).

90 Chapter 4. Experiments 4.4. DFRTA based tree parsing experiments

c 1 b 0 3 1 2 1 1 1 2 1 a a 2 a 1 1 a 1 6 b 1 a 2 b 1 a b b 5 2 1 1 a 1 1 b 4 a a b 2 a 1 a 2 a 2 2 1 2 2 1 2 1 a 1 1 7 d

2

Figure 4.25: dfrta constructed with All-Sub

(1) S → a , (2) S → a , (3) S → c,  B d b B c (4) B → b , (5) B → S, (6) B → d B

(a) Production rules of the grammar

q0 c,B,S , q1 d, B , q3 b , b , B , q4 a ,B,S ,    B c  B d q5 b , B , q6 a ,B,S , q7 a , a ,B,S  B  b B  b B B d c c

(b) Match sets of automaton in Figure 4.25

Figure 4.26: Production rules and match set of the constructed automaton.

The subject tree that will be parsed with the help of the automaton can be seen in Figure 4.27(a). For this example the algorithm starts assigning states q0 and q1 to the leaf nodes of the tree. When proceeding at the left, by reading the b node, one arrives in state q3. This process is repeated until the complete tree is visited (see Figure 4.27(b)).

91 4.4. DFRTA based tree parsing experiments Chapter 4. Experiments

a q7 - a

b d q3 - b d - q1

c q0 - c (a) Sub- (b) Subject tree with ject tree matching states

Figure 4.27: Subject tree with and without corresponding states

However, the goal of this phase is not just to label all nodes with states, but to retrieve the production rules that should be applied in each node n to produce the subtree α rooted at n as cheap as possible. The algorithm therefore not only links a state to node when it visits a node, but also records, by using the match set of the state, which rules could be applied. Let us for now simplify the problem by ignoring chain rules and costs, and by only looking at standard rules. A subtree α rooted at a node n can be produced with an arbitrary nonterminal as starting point. So, what the algorithm does is store in each node for every nonterminal which rules can be applied to produce α. These rules are found by comparing each subtree m in the match set of the linked state to each production rule r. If the rhs of r is equal to m, then r can be used to produce α from the lhs nonterminal of r. Let us illustrate this by using the example subject tree. Node b is matched by q3, where q3 contains three matches. The match b(B) can be found as rhs of rule 4 (see Figure 4.28). This means that node b can be produced from nonterminal B (the lhs of rule 4) by applying rule 4. Figure 4.29 shows the result when these production rules are gathered for all the nodes.

q7 - a

B b { b , b , B } q3 - b d - q1

B B c q0 - c

Figure 4.28: Subtree b(B) of the b-node matches lhs of production rule

S: S → a(b(c), B) S → a(B, d) B: - a

S: - S: - B: B → b(B) b d B: B → d

S: S → c c B: -

Figure 4.29: Subject tree with applicable rules for each node/nonterminal

92 Chapter 4. Experiments 4.4. DFRTA based tree parsing experiments

The whole subtree rooted at node b can now be obtained from B by applying rule 4 and recursively applying rules for the nonterminals in the rhs of rule 4. Rule 4 for instance contains the nonterminal B. This nonterminal overlaps with node c in the subject tree. This means that the B production rule for this node must be applied. However, there is no such rule due to the fact that we ignore chain rules (B → S; S → c would be a possibility). After handling the chain rules node c will contain B → S; S → c for nonterminal B (see Figure 4.30) and the complete subtree b(c) can be obtained from nonterminal B by applying: B → b(B); B → S; S → c

A remaining question is: how does the algorithm handle chain rules? Like for the automaton constructions, this is done by computing the nonterminal closure. If a production rule r is added for a node, as described above, we compute the closure based on the lhs nonterminal of that rule. For the nonterminals in the closure we store the sequence of chain rules that have to be applied plus the production rule r. The nonterminal closure for nonterminal S is for instance computed when rule S → c is added. This closure computation will tell us that chain rule B → S enables the possibility to produce subtree c from nonterminal B. Figure 4.30 shows the additional matches that are created by chain rules for the example grammar.

S: S → a(b(c), B) S → a(B, d) B: B → S ; S → a(b(c), B) B → S ; S → a(B, d) a

S: - S: - B: B → b(B) b d B: B → d

S: S → c c B: B → S ; S → c

Figure 4.30: Subject tree with matching rules and chain rules

The last aspect taken care of by the algorithm is to provide the lowest cost parse. It could for instance be possible to produce a subtree rooted at a node n by applying different rules from the same nonterminal (see for instance the two possibilities for nonterminal S in the root node in Figure 4.30). What we want is only a lowest cost sequence of rules. Let us illustrate the effects of these costs by assigning a cost of 1 to each rule. While determining the matching rules for each node we then can also compute the cost of using that rule. This cost is determined by the cost of the rule itself and the costs that are registered in the nodes that correspond to nonterminal leaves of that production rule. The c-leaf for example, can be produced from S by a single rule and therefore has cost one. Producing c from B is more expensive due to the additional rule and therefore has cost 2. The b-node above c can be produced from nonterminal B by applying rule B → b(B). The rule itself is of cost 1, but we also have to look at the cost for nonterminal B in the rhs of the rule. This nonterminal corresponds to the c-node below, so we have to add the cost of B in this c-node. This results in a final cost of 3. This way the cost is computed for each sequence of rules added to a node. Figure 4.31 shows the costs for all rules in the tree.

93 4.4. DFRTA based tree parsing experiments Chapter 4. Experiments

S: S → a(b(c), B) 2 S → a(B, d) 4 B: B → S ; S → a(b(c), B) 3 B → S ; S → a(B, d) 5 a

S: - ∞ S: - ∞ B: B → b(B) 3 b d B: B → d 1

S: S → c 1 c B: B → S ; S → c 2

Figure 4.31: Subject tree with minimum cost matching rules

The root node contains two sequences for both nonterminal B and S. However, we are only interested in the lowest cost solution, so we do not have to store the more expensive solutions. The algorithm therefore stores only one record for each nonterminal in a node. This record is updated if a production rule (or a sequence in case chain rules are involved) is encountered that has a lower cost. This first stage of the algorithm therefore results in the following labeled subject tree:

S: S → a(b(c), B) 2 B: B → S ; S → a(b(c), B) 3 a

S: - ∞ S: - ∞ B: B → b(B) 3 b d B: B → d 1

S: S → c 1 c B: B → S ; S → c 2

Figure 4.32: Subject tree after phase 1

Summarized, this leads to the following approach. For each node n we determine the corresponding state q. For every match m in the match set of q we gather all production rules r that have a rhs that is equal to m. If applying r is cheaper than the current solution for the lhs nonterminal, then we replace it by this new rule r. Finally we compute the nonterminal closure with costs, by using the Floyd-Warshall algorithm [Flo62], and update the nonterminals in n if the chain rule sequence plus r is cheaper then current value for that nonterminal. This results in the following recursive method that visits the nodes of the subject tree:

94 Chapter 4. Experiments 4.4. DFRTA based tree parsing experiments

LabelTree(n, a, rtg) Input: A node n from the subject tree, a dfrta a and the regular tree grammar rtg from which the automaton is constructed and the parse should be determined. Output: The automaton state of n (and the subtree rooted at n is annotated). (1) vector ← ∅; (2) for i ← 0 to n.children.count (3) vector[i] ← LabelTree(n.children[i],a,rtg) (4) (5) q ← a.nextState(n.symbol, vector); (6) (7) foreach match ∈ q.matches (8) foreach rule ∈ rtg.productionRules where rule.rhs = match (9) cost ← rule.cost + l∈match.leaves leafCost(n,l); (10) P (11) if cost < n.annotation[rule.lhs].cost (12) n.annotation[rule.lhs].cost ← cost; (13) n.annotation[rule.lhs].rules ← rule; (14) (15) foreach s ∈ rtg.alphabet where s.type = nonterminal (16) costAndRules ← getClosureCostAndRules(rule.lhs, s); (17) if (costAndRules.cost + cost) < n.annotation[s].cost (18) n.annotation[s].cost ← cost + costAndRules.cost; (19) n.annotation[s].rules ← ruleconcatcostAndRules.rules; (20) return q;

Second phase The second phase is less complex compared to phase one. This phase traverses the subject tree starting at the root with the tree grammar start nonterminal and applies the rule listed for that nonterminal in the root node. The algorithm executes the same process for the nonterminals that can be found in the production rule that is applied. These nonterminals correspond to certain nodes in the subject tree and the rule for that nonterminal in this node can then again be applied. Let us clarify this with the example subject tree of phase one. The start nonterminal of the used grammar (see Example 1.1.3) is the nonterminal S. The algorithm therefore starts by applying rule S → a(b(c), B) to the start nonterminal (see Figure 4.33 and 4.34). This results in a tree with one nonterminal B. This nonterminal is on the location of the d node in the subject tree. We therefore proceed by applying the production rule listed for B in this node: B → d. This finally results in the complete subject tree.

95 4.4. DFRTA based tree parsing experiments Chapter 4. Experiments

S: S → a(b(c), B) 2 B: B → S ; S → a(b(c), B) 3 a

S: - ∞ S: - ∞ B: B → b(B) 3 b d B: B → d 1

S: S → c 1 c B: B → S ; S → c 2

Figure 4.33: Rules selected by phase 2

2 6 S =⇒ a =⇒ a b B b d c c Figure 4.34: Production of the subtree by the selected rules

Adaptations for Proper-N and Proper-S dfrtas created with the Proper-N and Proper-S item set differ from All-Sub, because the match sets of the states do not contain all rhs trees of production rules as subtrees. This creates a problem because the algorithm compares in the first phase all subtrees from the match sets with the rhs of the production rules. The absence of some of these trees means that some rules are not coupled to nodes. To illustrate this problem we again parse the subject tree of Figure 4.27(a), but now by using an automaton constructed from the Proper-N item set. Figure 4.35 shows this automaton (again without trap state q2) and Figure 4.36 shows the corresponding match sets.

c 1 b 0 3 1 2 1 1 2 1 1

1 b a a 2 a 2 a b 5 b 1 1 1 1 4 a a b a 1 a 2 2 a 2 1 2 1 1 d

2

Figure 4.35: dfrta constructed with All-Sub

96 Chapter 4. Experiments 4.4. DFRTA based tree parsing experiments

q0 c,B,S q d, B 1  q , B 3  b  c q4 B, S q B 5  Figure 4.36: Match sets of automaton in Figure 4.35

Linking the nodes to states is done in the same way as when using the All-Sub item set. Figure 4.37 shows the resulting subtree. The problem becomes visible if we start looking at the match set for state q3 for the b-node of the subject tree. The match set of this state does not contain the b(B) as in the All-Sub case, because it is no proper subtree. The result of this is that the rule B → b(B) will not be registered in this node if we would proceed with the standard approach.

q4 - a

q3 - b d - q1

q0 - c

Figure 4.37: Subject tree with matching states

Luckily there is a solution to this problem. Instead of comparing the subtrees in the match set of a node n to the rhs of the production rules, we compare the rhs of the production rules to the match sets of the child nodes of n in combination with the symbol at n. In other words the algorithm has to check two things in node n for each production rule r:

• Is the symbol of the root node of the rhs of r equal to the symbol of node n. • Is there for each subtree at index i of the root node of r an equal subtree in the match set of the ith child node of n.

If all these checks are positive then the production rule r can be used to produce the subtree rooted at n. Let us look at the b-node in the subject tree and rule B → b(B) to clarify this approach (see Figure 4.38). At first we have to compare the symbol b of the node to the root node of the production rule. These are both b, so now we have to compare the subtrees in the match set of the c-node (the first and only child node of b) to the first subtree of the root node of the production rule (B). The c-node has as match set c,B,S in which B is contained. Thus the first child node of n matches the first child node of the rhs of the  production rule. This production rule therefore can be used to produce the b-node.

97 4.4. DFRTA based tree parsing experiments Chapter 4. Experiments

q4 - a

B b { b(c) , B } q3 - b d - q1

B { c , B , S } q0 - c

Figure 4.38: Node b and subtree B match the lhs of production rule

So instead of comparing two subtrees the algorithm has to compare two symbols and a collection of subtrees. The rest of the algorithm is equal to the All-Sub variant. This small difference makes the Proper-N/Proper-S variant more complex than the All-Sub variant. This could result in longer running times for the Proper-N/Proper-S variant. However, Proper-N and Proper-S automata can be constructed faster than All-Sub automata. Section 4.4.2 com- pares the two algorithms and measures if the gain caused by the construction algorithms for the Proper-N and Proper-S automata is not removed by the more complex parsing algorithm.

4.4.2 Automaton comparison The parsing algorithm described can parse trees with the help of all types of dfrtas that are described in Section 4.3.4, but the algorithm can perform differently depending on the type of dfrta used. As discussed earlier there are two variants of the algorithms. One variant that only can be used for automata constructed with the All-Sub item set and another variant that can also be used for automata constructed with the Proper-N and Proper-S item set. Furthermore there are different filter techniques that are used inside the automata. These filter techniques are not noticeable for the algorithm. However, these techniques can influence the performance of the algorithm, because they influence the state retrieval complexity. This section therefore measures the running time of the parse algorithm when using different types of dfrtas. We have chosen to execute this parsing algorithm for the Mono X86 and Mono IA64 grammar in combination with a collection of generated trees for each of the grammars. These collections contain one hundred trees with respectively a total of 71504 and 73248 nodes. Figure 4.39 shows the running time of these parses for both grammars.

98 Chapter 4. Experiments 4.4. DFRTA based tree parsing experiments

1400

1200

1000

800

Running Time (ms) 600

400

200

0

b erS perS perS

g - All-Su g - All-Sub ng - Prop ring - Pro i ring - ProperN ilterin te

No Filtering - All-Sub ee F ex Filter No FilteringNo - Proper-N Filtering - Proper-S Index Filtering - All-SubInd Index Filtering - ProperNSymbol Filtering - All-Sub Index Fil Index Filtering - Pro SubTr SubTree FilteringSubTree - ProperN Filte Symbol FilteringSymbol - ProperN Filtering - ProperS

Symbol & IndexSymbol Filterin & Symbol &

(a) Parse time for the Mono X86 grammar and a collection of one hundred subject tree (71504 nodes in total)

1200

1000

800

600 Running Time (ms)

400

200

0

b erS perS perS

g - All-Su g - All-Sub ng - Prop ring - Pro i ring - ProperN ilterin

No Filtering - All-Sub ee F ex Filter No FilteringNo - Proper-N Filtering - Proper-S Index Filtering - All-SubInd Index Filtering - ProperNSymbol Filtering - All-Sub Index Filtering - Pro SubTr SubTree FilteringSubTree - ProperN Filte Symbol FilteringSymbol - ProperN Filtering - ProperS

Symbol & IndexSymbol Filterin & IndexSymbol Filte &

(b) Parse time for the Mono IA64 grammar and a collection of one hundred subject tree (73248 nodes in total)

Figure 4.39: Parse time for two grammars and two example collections of trees

The graph shows that all filter techniques perform similarly and that differences are mostly found between the automata constructed from the All-Sub item set and the automata con- structed from the Proper-N or Proper-S item set. This performance difference for the item sets is caused by the two variants of the parsing algorithm for the item sets. Table 4.13 shows the relative parse time differences for the Proper-S item set compared to the All-Sub item set.

99 4.4. DFRTA based tree parsing experiments Chapter 4. Experiments

Filtering technique X86 IA64 No filtering 137.0% 124.1% Subtree filtering 137.7% 125.8% Index filtering 137.4% 125.6% Symbol filtering 136.1% 124.2% Symbol & index filtering 135.0% 122.7%

Table 4.13: Parse time with dfrtas constructed with Proper-S item set compared to the All-Sub item set, for the different filter techniques

These results show that parsing with dfrtas constructed from the Proper-S (and also Proper-N) set is between around 22% and 38% slower compared to their All-Sub variant for these two example collections of trees. Section 4.3.4 however shows that dfrtas can be constructed faster when using the Proper-N or Proper-S item set instead of All-Sub. It depends on the usage whether Proper-N/Proper-S is most efficient or All-Sub. If the dfrta is constructed for parsing a single small tree, then the construction time of the dfrta will have the largest impact on the running time. When using a single automaton many times for parsing large trees then it could be wiser to opt for the All-Sub item set because most of the processor time is consumed by the parsing algorithm. Summarized, the filtering technique used in the dfrtas does not have a noticeable impact on the running time of the parsing algorithm. The used item set however has an impact, but it depends on the size of the subject trees and the number of parses executed for a single dfrta which item set is the most profitable.

100 5 Conclusions

This chapter will give an overview of the achievements and results, and provide suggestions for future research. The chapter will be finalized by an evaluation of the project.

5.1 Results

The master’s project resulted in a toolkit, GUI and a collection of results from the conducted experiments. Before providing an overview of the conclusions of the experiments we highlight the key features of the constructed toolkit and GUI.

Toolkit and GUI The ForestFIRE toolkit implements a large collection of data structures and algorithms re- lated to trees. This collection contains the data structures necessary for representing the domain concepts (e.g. trees, tree grammars, tree patterns, and tree automata) and imple- menting the tree algorithms. The tree algorithms themselves mainly focus on three areas:

• Tree grammar transformations • Tree automaton constructions • Tree parsing and acceptance

The group of transformation algorithms consists of algorithms that remove chain rules (RED-U), non-root terminal nodes (RED-Z), and useless symbols and/or production rules. The second group contains algorithms that construct (ε)nfrtas, (ε)nrftas and (optimized) dfrtas from tree grammars. The final group consists of a set of algorithms that solve the tree acceptance and tree parsing problem by using tree automata. Next to these algorithms a collection of algorithms was implemented that support these algorithms.

Let us look back at the original requirements for the toolkit. As described in the introduc- tion of this thesis the goal was to implement as many features as possible, from the features listed in Appendix A. The features listed in the original assignment description (Section A.1) are divided into three groups with each a different priority. Almost all data structures and algorithms listed in the most important group are implemented. Tree automaton constructions that construct automata from patterns set, and constructions based on dotted rules/trees are not imple- mented due to the restricted time frame. Unfortunately there was also not much time left for the other two groups. This resulted in only one implemented feature from these groups. This feature was the operation for generating random trees from tree grammars.

101 5.1. Results Chapter 5. Conclusions

However, the remaining features can be implemented very easily, due to the large collection of basic data structures and algorithms the toolkit provides. This advantage was also encoun- tered during the implementation of the parsing algorithm. The availability of tree automata and operations to compare trees made it possible to implement the parser within a few hours. The additional list of requirements in Section A.2 describes requirements for the data structures that implement the basic domain concepts. These requirements consisted of special operations and wellformedness rules. All these can be in ForestFIRE as respectively class methods and invariants.

The FIREWood GUI created to the toolkit was a very useful addition to the toolkit, because it provides easy access to the algorithms inside the toolkit. Furthermore the GUI provides the possibility to load user defined trees, tree grammars in special text files. FIREWood converts these structures into corresponding ForestFIRE data structures. These structures can then be used as input for one the algorithms. FIREWood was a very useful addition to the toolkit during the experiments, but it was not just targeted at the currently implemented algorithms. FIREWood was, like ForestFIRE, built with extendibility in mind. This means that new algorithms implemented for ForestFIRE can be made accessible very easily. I therefore want to conclude by saying that ForestFIRE and FIREWood are practical tools for experimenting with the currently implemented algorithms, and that they have a large potential for supporting an even larger collection of tree algorithms.

Tree grammar transformation experiments The tree grammar transformation experiments focused on different optimization techniques and orders for removing chain rules (using RED-U) and non-root terminal nodes (using RED- Z) from a grammar. This section contains a compact overview of the results of these experi- ments:

1. The optimal transformation order (RED-Z/RED-U) depends on the grammar shape, when no reuse is applied in the RED-Z steps. 2. Reusing nonterminals and production rules in RED-Z(*) reduces the number of rules to 64–78% of the no-reuse case for the used Mono grammars. 3. The shortest tree first node selection strategy for the RED-Z transformation steps results in the most efficient reuse. 4. The shortest tree first reuse removes the differences between the different transformation orders.

Tree automaton construction experiments The tree automaton experiments measured the efficiency and size of resulting tree automata for different automaton construction techniques. This is an overview of the most important conclusions of these experiments:

102 Chapter 5. Conclusions 5.2. Recommendations for future work

1. Constructions of nondeterministic automata are considerably faster than the construc- tions of deterministic ones. 2. Constructions of nfrtas/nrftas for the Mono grammars result in smaller automata (e.g. 38–39% less transitions) when using the Proper-N or Proper-S item set instead of the All-Sub item set. 3. The Standard dfrta construction is unusable in practice due to long running times and extreme memory use. 4. Filtering techniques for the dfrtas reduce memory usage and construction time such that constructions perform almost as good as the constructions for the nondeterministic automata. 5. For the Mono grammars the symbol filtering technique performed the best. This filter technique will probably also perform well for other grammars related to instruction selection.

Tree parsing experiments The final group of experiments measured the effect of using different types of dfrtas in a parsing algorithm that solves the parsing problem using these automata. These are the main results of these experiments.

1. dfrtas constructed with the All-Sub item set instead of Proper-N or Proper-S perform better, due to needed adaptations to the parsing algorithm for the automata created with Proper-N or Proper-S. The adaptations resulted in a 24–38 % longer running time. 2. There is no remarkable difference when using different filter techniques in the dfrtas.

5.2 Recommendations for future work

This master project raised some new questions related to the three groups of experiments. This section provides a list of interesting research subjects divided over these three experiment subjects.

Tree grammar transformations:

1. Measure the effect of reusing original nonterminals with more than one occurrence as lhs in the RED-Z(*) transformation.

Tree automaton constructions:

1. Measure the influence of tree automaton constructions with U−/Z− grammars instead of U+/Z+.

103 5.3. Evaluation Chapter 5. Conclusions

2. Compare constructions based on subtrees to constructions based on dotted rules. 3. Study additional transition table reduction techniques.

Tree parsing:

1. Investigate optimization techniques for parsing with automata constructed from the Proper-N and Proper-S item set. 2. Compare the implemented parsing algorithm to other existing tree parsing algorithms.

5.3 Evaluation

This master’s project was completed in eight months. As each project, this project had some pleasant parts and less pleasant parts. Before discussing these there needs to be mentioned that the topic of tree algorithms is a very interesting topic due to its broad application area (instruction selection, term rewriting, genetics etc.). The toolkit building process and the experiments were the nicest parts of the project. Es- pecially when the construction of the toolkit was near to its completion. The final algorithms could then be implemented elegantly by using already implemented data structures and algo- rithms. Warshall’s algorithm is, for instance, implemented in a single class. This class could be reused in many other algorithms. The experiments had the pleasant property that they helped to tune the implementation of the more complex algorithms, like the automaton construction algorithms. Another advantage of these experiments was that they revealed unexpected behavior. An example of this can be seen with the grammar transformations. The expectance was that a certain RED-Z/RED-U order would always perform equal or better than another order. The experiments with these transformations suddenly provided contradiction results. This was probably not discovered without the experiments. Unfortunately, there were also some drawbacks. The initially chosen programming language Lazarus resulted in many problems, because its LCL library was not as platform independent as claimed by its developers. This resulted in some delay in the development process, because all Lazarus code was converted to Java code. Fortunately, this went faster as expected due my C# knowledge.

Overall, it was a pleasant and instructive project, which introduced me to the topic of tree algorithms and which with the cooperation of Loek Cleophas and Kees Hemerik resulted in practical toolkit and some useful experimental results.

104 A MSc Assignment description

This appendix contains the description of the original master’s assignment and an additional list of requirements that described the main goals of this project. The assignment description introduces the domain and provides three lists of features and experiments, sorted on impor- tance, that needed be implemented/performed. The list with additional requirements focuses on specific operations that should be implemented for the data structures that represent the main domain concepts. This list contains operations like: ‘retrieve the number of chain rules in a grammar’.

A.1 Original Assignment description

Context

The MSc assignment will take place in the context of the PhD research of Loek Cleophas, which is related to regular tree languages. The area of regular tree languages has a rich theory, with many results that are generalizations of regular string languages, and many relations between the two areas. Parts of this theory have broad (potential) applicability in a number of areas, among which are code generation in compilers—particularly for instruction selection or optimization—and term rewriting. Underlying these and other practical applications are the following three important algo- rithmic problems within the field of regular tree languages: 1. Tree acceptance. Given a regular tree grammar and an input tree, determine whether the input tree can be generated by the regular tree grammar i.e. is part of the language denoted by the regular tree grammar. 2. Tree pattern matching. Given a finite, non-empty set of trees (the pattern set) and an input tree, find the set of all occurrences of the patterns in the input tree. 3. Tree parsing. Given a regular tree grammar and an input tree, determine all parses of the input tree that can be generated by the regular tree grammar. A variation of this problem that is often used is to determine a parse that is optimal (with respect to some cost function). These problems are related—e.g. tree parsing generalizes/extends tree acceptance and tree pattern matching is used in tree acceptance algorithms—and involve many of the same algorithmic ingredients. Many algorithms solving these problems have been described in the literature. Unfortunately, a number of deficiencies exists: 1. The field is rather unaccessible. Much of the theory is scattered over the literature, with only a few overview publications, and none of them being algorithm oriented. 2. The algorithms that have been published are hard to compare due to differences in presentation style and level of formality.

105 A.1. Original Assignment description Chapter A. MSc Assignment description

3. Many practical algorithms have been published with little or no reference to the theory or correctness arguments. 4. No large collection of implementations of the algorithms solving these algorithmic prob- lems exist. 5. It is hard to choose between different algorithms for practical applications. To solve these deficiencies, an overview of relevant parts of the theory has been constructed and literature research to find algorithms solving the above problems has been performed. Based on these, the algorithms have been rephrased in a common presentation style and a preliminary classification of algorithms for tree acceptance (currently bottom-up/frontier-to- root ones only) and tree pattern matching has been created. Taken together, this work helps to solve the first three deficiencies mentioned.

Assignment Goal & Expected Results

The proposed MSc assignment has the broad goal of providing a starting point in solving the fourth and fifth deficiencies, based on the results from Cleophas’s PhD research. Ideally it should result in: • an (extendable) collection of algorithms & data structures related to trees, in the form of a toolkit (‘Forest FIRE’); this collection should implement foundational data structures & algorithms related to tree acceptance, tree pattern matching and tree parsing algo- rithms; time permitting, some of these algorithms themselves should be implemented; • a graphical environment (‘FIRE Wood’) to experiment with and compare the imple- mented algorithms & data structures; • a comparison of the efficiency and trade-offs involved in such algorithms & data struc- tures; • reporting on the assignment, in the form of regular meetings with the tutor and/or supervisor, an intermediate presentation, and an MSc thesis together with its oral pre- sentation and defense. The extent to which these results can be obtained will likely be limited by the fixed amount of time available for an MSc project (1120 hours, i.e. 40 ECTS * 28 hours). Realistically, the extent to which the first three results are completed is likely to be lim- ited. Further on, features will be classified as either required, desired or nice to have. The required features together with the mandatory reporting form the minimal requirements for completion of the assignment. (Even if desired or nice to have features are excluded due to time constraints, the design & implementation of included features should be extendable so excluded features can be included at a later time.)

Planning

A detailed planning should be created by the student and discussed with his tutors/supervisors during the first phase of the assignment. Roughly, the assignment will be divided into the following phases: 1. read & investigate: draft PhD thesis, existing implementations (ATerms, Timbuk, Tree- bag, TWIG, BEG, iBURG, . . . ); create detailed planning

106 Chapter A. MSc Assignment description A.1. Original Assignment description

2. define general toolkit & GUI structure, interfaces, representations (for required features, taking desired and nice to have ones into account), decide on implementation language (Delphi? Java? ...?); report on this 3. implement required features; report on this 4. experiment with resulting toolkit & GUI; report on this 5. time permitting, repeat last two steps for desired features and for nice to have features 6. create final report, give presentation, thesis defense As mentioned earlier, the total amount of time available for the assignment is 1120 hours; how this time is (planned to be) divided over the various phases will be part of the detailed planning created in the first phase.

107 A.1. Original Assignment description Chapter A. MSc Assignment description

Required Features

Data structures • trees over ranked alphabets; without variables, with a single variable, with multiple variables (?) • sets of trees (pattern sets) • regular tree grammars (RTGs), consisting of a nonterminal set, a ranked terminal al- phabet, production rules (form A → α, or more general β → α) and start nonterminal • tree automata (TAs); directed frontier-to-root (FR), root-to-frontier (RF); nondeter- ministic with and without epsilon-transitions ( (epsilon-less) NFRTA / NRFTA), deter- ministic (DFRTA / DRFTA) Issues: • what alternative representations are there? look at literature to see what is used and why; other representations? (dis)advantages of them • for TAs in particular, how to represent transitions tables; different per TA kind? how to compress transition tables (e.g. use general table compression techniques, use filtering techniques based on possibility of subpatterns occuring as ith son of symbol a, use maximal subterm sharing of ATerms?)

Input/Output • ability to define ranked alphabets, (ordered ranked) trees, pattern sets, nonterminal sets, production rules, RTGs • ability to import and export trees, pattern sets, RTGs, TAs

Algorithms • RTG transformations – removing unit productions (e.g. A → B) – removing productions whose rhs has any non-root node labeled by a terminal symbol • TA constructions & transformations – construction of TA based on RTG – construction of TA based on tree pattern set – construction based on subtrees vs. on dotted trees/production rules – directed FR, RF; nondeterministic(?), deterministic

Experiments • investigate influence of RTG transformations on RTG size, influence of order of the transformations; effect of transformations on resulting TAs • compare time and space efficiency of TA constructions

108 Chapter A. MSc Assignment description A.1. Original Assignment description

Desired Features

Data structures • Aho-Corasick automata (ACAs); basic goto-version, version specific to use for tree stringpath matching (see Prague Stringology Conference 2005 paper) Issues: • what alternative representations are there? look at literature to see what is used and why; other representations? (dis)advantages of them

Input/Output • ability to generate (pseudo-random) trees, pattern sets, RTGs

Algorithms • TA constructions & transformations – allow separate transformations to remove epsilon-transitions, perform (reachability based) subset construction, in addition to direct construction of epsilon-less and deterministic TAs ? • ACA constructions (see PSC 2005 paper) – basic goto-version – stringpath-specific version

Experiments • compare time and space efficiency of TA and ACA constructions • compare RTG transformations followed by TA construction to direct TA construction followed by TA transformations

109 A.1. Original Assignment description Chapter A. MSc Assignment description

’Nice to Have’ Features

Data structures • trees over sorts • rewrite rules • extension to hedge trees/ordered unranked trees (trees whose nodes have a list of chil- dren; these can be used to model XML trees)

Input/Output • ability to define, import, export rewrite rules • ability to generate C code for rewriting

Algorithms • TPMn/TGAn algorithms (tree pattern matching resp. tree (grammar) acceptance, for every subject node) – on-the-fly FR/BU – FR/BU using DFRTA; with/without filtering, other compression techniques (?) – RF/TD (see PSC 2005 paper) ∗ using DRFTA ∗ using ACA ∗ using stringpath-specific ACA

Experiments • compare speed of TPMn/TGAn algorithms using different TA/ACA kinds • effect of TA transformations on TPMn/TGAn algorithm speed

110 Chapter A. MSc Assignment description A.2. Additional data structure requirements

A.2 Additional data structure requirements

This section contains a list of requirements for the domain concepts that needed to be imple- mented in the toolkit. This list was defined in addition to the original assignment description that can be found in Section A.1

Original content:

This document gives a brief overview of the types of analyses, transformations and other operations needed for trees, pattern sets and Regular Tree Grammars, together with references giving more details on their definition or use. It does not include very basic analyses and operations e.g. creating these objects or counting the number of alphabet symbols.

Trees • analysis – wellformedness with respect to a ranked alphabet: does a tree use symbols from the alphabet only, and respect the ranking function? (likely not an analysis to be invoked explicitly; instead, to be enforced at tree creation time)

Pattern sets • analysis – number of patterns – wellformedness, see above • operations – obtain dotted tree set for it (see ”Tree notions” in handwritten notes, IJFCS paper Definition 11) ∗ analysis: · number of elements · number of elements when flattened (see handwritten notes for definition of flattening function fl ) ∗ operation: flattening (to obtain subtree set for pattern set, as below) – obtain subtree set for it (see IJFCS paper page 4 underneath Def. 4; expressible using fl and previously mentioned operation to obtain dotted tree set) ∗ analysis: · number of elements

Regular Tree Grammars • analysis – number of ∗ rules

111 A.2. Additional data structure requirements Chapter A. MSc Assignment description

∗ nodes, summed over all rules ∗ chain rules ∗ non-root nodes labeled by terminal symbol, summed over all rules ∗ rules with non-root nodes labeled by terminal symbol – wellformedness, see above – existence of (can be expressed in terms of ”number of” analyses) ∗ chain rules ∗ non-root nodes labeled by terminal symbol, over all rules – ”usability” (see thesis draft, Section 4.3.1.2 Removing Useless Symbols and Pro- ductions) ∗ reachable/unreachable · terminals · nonterminals · production rules ∗ productive/unproductive · terminals · nonterminals · production rules ∗ useful/useless: useful if and only if reachable and productive, useless if and only if not useful • transformations; for each one, allow it to be performed one rule at a time (either with manual or with automatic pseudo-random selection), or for all rules at once; default behavior is to make copy of RTG and perform transformations on this copy, perhaps provide setting to toggle between this and not making a copy & modifying the original RTG – remove useless (see Transformation 4.3.21 in Section 4.3.1.2) ∗ productions ∗ nonterminals ∗ terminals – perhaps variant of remove useless that only removes useless productions & nonter- minals, leaving associated terminal alphabet unchanged – remove chain rules (see Sections 6.1.1 and 6.1.2) – remove rules with non-root nodes labeled by terminal symbol (see same) • (other) operations – obtain dotted rule set for it (see ”Tree notions” in handwritten notes) ∗ analysis: · number of elements

112 Chapter A. MSc Assignment description A.2. Additional data structure requirements

· number of elements when right-flattened (see handwritten notes for defi- nition of right-flattening function rfl ) ∗ operation: right-flattening (to obtain subtree set for rule’s right hand sides, as below) – obtain right hand sides’ subtree set for it (see handwritten notes; expressible using rfl and previously mentioned operation to obtain dotted rule set) ∗ analysis: · number of elements

113 A.2. Additional data structure requirements Chapter A. MSc Assignment description

114 B Formal definitions

This appendix contains all formal definitions of the concepts described in the Domain Chapter. All these definitions are taken from [Cle07].

B.1 Tree related definitions

Definition B.1.1 (Tree domain) Given a set of edge labels E, a tree domain is a finite non-empty subset D of E∗ such that pref(D) ⊆ D, i.e. D is prefix-closed. In particular, ε ∈ D for any tree domain D. We use  to indicate concatenation of elements of E. Unless explicitly noted otherwise, we assume the edge label set E to be N+, the positive natural numbers. 

Definition B.1.2 (Tree) Given a tree domain D and an alphabet Σ, a (node labeled) tree t is a function t ∈ D → Σ. We use t(n) for the label of a node n ∈ D. 

Definition B.1.3 (Ranked alphabet) A ranked alphabet is a pair (Σ, r) such that Σ is an alphabet (a finite, non-empty set of symbols) and r ∈ Σ → N is a ranking function. For a ∈ Σ, we call r(a) the rank or arity of a. 

Definition B.1.4 (Ranked Tree) A ranked tree is a node labeled tree t whose alphabet is a ranked alphabet (Σ, r) and for which, for all n ∈ D,

r(t(n)) = # i : i ∈ E ∧ n  i ∈ Dt : i . 

Definition B.1.5 (Ordered tree domain, ordered tree) A tree domain D is ordered if and only if the underlying edge label set E is well ordered (i.e. has a minimal element and is totally ordered) and, for all n ∈ D and i ∈ E, n  i ∈ D ⇒ ∀ j : j ∈ E ∧ j

B.2 Tree grammar related definitions

Definition B.2.1 (rtg, Regular Tree Grammar) A regular tree grammar ( rtg) G is a 5-tuple (N, Σ, r, Prod, S) such that • N is an alphabet, the nonterminals

115 B.3. Tree automata related definitions Chapter B. Formal definitions

• Σ is an alphabet, the terminals • N ∩ Σ= ∅

• (N ∪ Σ, r) is a ranked alphabet with N = N0 (i.e. nonterminals have rank 0) • Prod ⊆ N × Tree(N ∪ Σ, r), the finite set of (production) rules or productions

• S ∈ N0, the start symbol 

Definition B.2.2 (Reachable symbol) Let G be an rtg. A symbol X ∈ N ∪ Σ is reach- able from a nonterminal A if and only if ∗ ∃ α : α ∈ Tree(N ∪ Σ, r) ∧ A ⇒ α : X ∈ α(Dα) D E (Note that α(Dα) denotes the set of all node labels of tree α.) 

Definition B.2.3 (Start-reachable symbol) Let G be an rtg. A symbol X ∈ N ∪ Σ is start-reachable if and only if it is reachable from the start symbol S. 

Definition B.2.4 (Productive nonterminal, productive terminal) Let G be an rtg. A nonterminal B ∈ N is productive if and only if ∗ ∃ t : t ∈ Tree(Σ, r) : B ⇒ t D E All terminals are productive (assuming Σ0 6= ∅). 

B.3 Tree automata related definitions

Definition B.3.1 (Tree Automaton) A tree automaton ( ta) M is a 6-tuple (Q, Σ, r,R,Qra, Qla) such that • Q is a finite set, the state set • (Σ, r) is a ranked alphabet n • R = Set a : a ∈ Σ : Ra ∪ Rε is the set of transition relations, where Ra ⊆ Q×Q for all a ∈ Σ and R ⊆ Q × Q (the epsilon transition relation) ε  • Qra ⊆ Q, the root accepting states

• Qla ⊆ Q, the leaf accepting states, defined by Qla = Set a, q : a ∈ Σ0 ∧ (q, ()) ∈ Ra : q 

Definition B.3.2 (ε-Nondeterministic Root-to-Frontier Tree Automaton) An ε-nondeterministic root-to-frontier tree automaton (εnrfta) M = (Q, Σ, r,R, Qra, Qla) is a ta where Ra ∈ → → n Q → P(Q ) for all a ∈ Σ such that q ∈ Ra(p) ≡ pRa q , and Rε ∈ Q → P(Q) such that q ∈ Rε(p) ≡ pRεq. 

116 Chapter B. Formal definitions B.3. Tree automata related definitions

Definition B.3.3 (ε-Nondeterministic Frontier-to-Root Tree Automaton) An ε-nondeterministic frontier-to-root tree automaton (εnfrta) M = (Q, Σ, r,R, Qra, Qla) is a ta where Ra ∈ → → n Q → P(Q) for all a ∈ Σ such that p ∈ Ra( q ) ≡ pRa q , and Rε ∈ Q → P(Q) such that p ∈ Rε(q) ≡ pRεq. 

Definition B.3.4 (Deterministic Root-to-Frontier Tree Automaton) A deterministic root-to-frontier tree automaton ( drfta) M = (Q, Σ, r,R,Qra, Qla) is an nrfta where n Ra ∈ Q → Q for all a ∈ Σ—i.e. the Ra are functions yielding a single state tuple for every state—and Qra = qra —i.e. there is a unique root accepting state.   Definition B.3.5 (Deterministic Frontier-to-Root Tree Automaton) A deterministic frontier-to-root tree automaton ( dfrta) M = (Q, Σ, r,R,Qra, Qla) is an nfrta where Ra ∈ n Q → Q for all a ∈ Σ—i.e. the Ra are functions yielding a single state for every state tuple. 

117 B.3. Tree automata related definitions Chapter B. Formal definitions

118 C ForestFIRE library

This appendix contains descriptions of all important classes (related to domain concepts and algorithms) and invariants in the ForestFIRE library. These descriptions are divided over the different concepts. A section is created for each concept (tree, tree grammar, tree pattern and tree automata). However, this chapter starts with the introduction to some basic collection types that were part of the data structures described in this appendix. The remark has to be placed that the classes and interfaces presented here are not imple- mented exactly as described due to the restrictions of Java, the target programming language. Java for instance does not support properties, so the properties in this appendix are imple- mented as getter and setter methods. However, all the functionality described here is available in the library.

C.1 Basic collections

ForestFIRE uses a selection of collection types to represent (parts of) domain concepts (e.g. alphabets, lists of nodes). This section will discuss the most important types of collections that are used in ForestFIRE:

• List • Dictionary • Set

The next three sections will discuss the characteristics, interfaces and implementation issues for each of these collection types.

C.1.1 List A list is one of the most well known collection types. A list is a sequence of items, where each item in the list is accessible with a integer index. Many operations are based on these integer indexes. One can add items to the back of such a list or insert an item on a certain index. This was the designed interface (in syntax) of a list for items of type T :

Properties Count: Integer Returns the number of items that are contained in the list. Capacity: Integer Returns or sets the maximum number of items that can be stored in the list. Item[index: Integer]: T Returns or stores (only possible when index already exists) the item on the defined index.

119 C.1. Basic collections Chapter C. ForestFIRE library

Methods Add(item: T ): Integer Adds the item to the end of the list and returns the index of that position. Insert(item: T , index: Insert an item to the list on the position defined by Integer) the index. Remove(item: T ): Integer Removes the defined item from the list if it is present and returns the position at which the item was stored. IndexOf(item: T ): Integer Returns the index of the defined item if it is present in the list.

Most of the properties and methods that are introduced above are straight forward. How- ever, we also introduce a capacity property. This property provides the possibility to set the maximum size of the list. This can be very useful for instance when defining the list of child nodes. The number of child nodes is fixed by the rank of the symbol. The capacity of the list can then be used to ensure that no more children are added to node.

The remaining question was: how to implement such a list? There are many ways to do this: linked list, dynamic array. However, most of the modern programming languages already provide this type of collection. The list was therefore implemented by reusing the standard list of the programming language library (ArrayList in Java). A wrapper was created around that list to add the functionality that was not directly supported by the standard. This resulted in a new list that provides the interface described above.

C.1.2 Dictionary A Dictionary is a collection type that stores key/value pairs, where each key can be used only once. These keys are used as a kind of indexes to store or retrieve the value items. The types used for these keys and values can be depend on the type of application. Sometimes one wants to use strings as keys (just like a normal word-dictionary), but it can also be any other type of data. The following interface (in Object Pascal syntax) was designed for a dictionary that stores item of type V by using keys of type K:

Properties Count: Integer Returns the number of key/value pairs that are con- tained in the list. Item[key: K]: V Returns or stores (only possible when the key already exists) the item for the defined key.

120 Chapter C. ForestFIRE library C.1. Basic collections

Methods Add(key: K, value: V ) Adds the value with the specified key to the dictionary. Remove(key: K) Removes the item from the dictionary that is stored for the defined key. ContainsKey(key: K) Returns a truth value that indicates wether the defined key is used as key within the dictionary. ContainsValue(value: V ) Returns a truth value that indicates wether the defined value can be found within the dictionary.

We see that the Insert and IndexOf methods are not present, due to usage of different indexing strategy. There is also no capacity property, because this functionality is not strictly necessary for our goals. The main role of this dictionary type as annotation container will be presented in Section C.2. This role does not need the possibility to define a maximum size, because it depends on the user of the tree how many items are stored in this annotation dictionary.

What remains is to discuss how this collection type can be implemented. The most common way to implement such a structure is by using a hash table. One should then define a hash function that computes a hash value for the types of objects that are used for indexing. The target programming language Java already provided an implementation of a hash table. This class reused and wrapped by a new class, just as with the list class, to provide the missing functionality.

C.1.3 Set The last major collection type used in the ForestFIRE library is the set. A set is a pool of items, where items are stored without any index and where each item is contained at most once. Items can be added or removed by only passing the item itself to the method. The add operation itself must take care that no duplicate items are added to the collection. This is the designed Object Pascal interface for a set containing items of the type T :

Properties Count: Integer Returns the number of items that are contained in the list. is already in use). Content: Array of T Returns an array that contains all the items in the set.

Methods Add(item: T ) Adds the defined item to the set if it is not present. Remove(item: T ) Removes the defined item from the set if it is present.

The remaining question is, how should one implement such a set? There are two well known

121 C.2. Trees Chapter C. ForestFIRE library ways to implement a set [Ski98, Section 8.1.5]. One is to model the set is like a bit vector. To do this one has to know the universe of the items that can be contained in the set. A bit vector is an array of bits for each item of the universe. A one in the bit vector means that that item is contained in the set and zero means that it is not present. Another method is to use the standard list as defined above. One stores the items of the set in this list. When an item is added, one checks if it is already present in the array. If this is not the case the item is added to the list. This can be done quite quickly if one keeps the list sorted. However, not all items can be sorted. It depends on the type of item that is stored in the list wether this optimization can be used. This internal list can then be hidden to the user of the set by creating a class that wraps it and that provides the standard interface that is described above. This last option was chosen, because for many sets in the in library it is not possible to determine what the precise universe is. This is for example the case in an terminal alphabet of a tree. Items can be added to this set of terminals all the time. This would mean that the bit vector should be extended every time this changed. The second method can handle such dynamical sets easier, because it does not use the universe to set up its internal data structure.

C.2 Trees

This section contains brief descriptions of the data structures related to the concept of ranked, ordered, labeled trees. This are all the classes that can be found in Figure 3.1. Additionally a set of invariants are described that have to be maintained to ensure that no illegal structures are created.

C.2.1 Data structures This section contains descriptions of all classes of the library that are used for representing trees and related concepts like nods, symbols and alphabets.

C.2.1.1 Tree class The Tree class represents a tree that consists from a set of nodes, where each of the nodes contains a symbol that is present in the alphabet of the tree. The alphabet can therefore be used to check if nodes that are added to the tree contain valid symbols that are contained in the alphabet (see the invariant section). The root node and leaf node references can be used to visit the nodes of the tree in a top-down or bottom-up way.

Properties Alphabet: List of Symbol Set of symbols that can be used inside the nodes of the tree. Root: Node The root node of the tree. Leaves: List of Node The leaf nodes of the tree.

122 Chapter C. ForestFIRE library C.2. Trees

Methods None

C.2.1.2 Node class The Node class represents a node in the tree structure. A node contains a symbol that tells whether it is a terminal, nonterminal or variable. A node has also references to its children and parent node to provide easy top-down and bottom-up traversal of the tree. Further there is a reference to the tree in which this node is contained, this can for instance be used to get access to the alphabet of a tree for checking or modification issues (see invariant section). Additionally a node contains an annotation field that can be used by algorithms to store information in a node.

Properties Parent: Node Parent node of this node. ’Null’ when node is the root node. ParentTree: Tree The tree that contains this node. Annotation: A dictionary like data structure to store annotation Dictionary Symbol: Symbol Reference to the symbol that contains the name and type of the symbol in this node. Children: List of Node Child nodes of this node.

Methods None

C.2.1.3 Symbol class The Symbol class is a class that provides a structure for representing symbols like:

• Terminal • Nonterminal • Variable

Each symbol contains a type that tell whether the symbol is a terminal, nonterminal or variable and a name field which contains the name of the symbol.

123 C.2. Trees Chapter C. ForestFIRE library

Properties Name: String The name of a symbol. Type: Enumeration Contains the type of symbol: Terminal, Nonterminal or Variable.

Methods None

C.2.1.4 RankedSymbol class The RankedSymbol class extends the Symbol class to represent a ranked symbol like a ranked terminal. This class is not used for nonterminals and variables if a tree uses ranked terminals. These nonterminals and variables are considered to have rank zero.

Properties Name: String The name of a symbol. Type: Enumeration Contains the type of symbol: Terminal, Nonterminal or Variable. Rank: Integer The rank of the symbol.

Methods None

C.2.1.5 DottedTree class This class represents a dotted tree as its 2-tuple definition. It is used for referring to a node within a tree. It therefore contains a reference to the tree and a node inside this tree.

Properties Tree: Tree Reference to the original tree. Node: Node Reference to a node inside the tree.

Methods None

C.2.2 Invariants This section discusses several invariants for the data structures that are presented in Section C.2.1. These invariants must be kept valid to ensure that only wellformed trees are represented by these classes. We start by giving a couple of short definitions to shorten the notations of

124 Chapter C. ForestFIRE library C.2. Trees the invariants. First we have to define what it means that a node is part of a tree, this done by Definition C.2.1 and C.2.2.

Definition C.2.1 Whether a node is part of a tree (Node n ∈ Tree t) is defined by the following definition: Node n ∈ Tree t = NodeInTree(t.root, n)

The remaining thing to do is formally define the NodeInTree(m, n) function. This is done by Definition C.2.2.

Definition C.2.2 NodeInTree(m, n) expresses recursively whether node n is part of the sub- tree that starts in node m:

NodeInTree(m, n) = (m ≡ n) ∨ h r : r ≤ m.symbol.rank : NodeInTree(m.children[r],n)i _ The upcoming sections will discuss a variety of invariants. Where each of the sections discusses invariants that cover a certain aspect of trees.

C.2.2.1 Trees and symbols The first and most important invariant is that each node of a tree must use a symbol that is present in the alphabet of that tree. This is formalized by Invariant C.2.3.

Invariant C.2.3 Each node n of a tree t must use a symbol that is present in the alphabet of that tree:

h∀n,t : Tree t ∧ Node n ∈ t : n.Symbol ∈ t.Alphabeti

Another invariant in the area of symbols is the invariant that discusses the role of the symbol in individual nodes. Each node node may only have as many children as defined by the rank of its symbol. This is described in formal way by Invariant C.2.4

Invariant C.2.4 The number of children in a node n must be equal to the rank of the symbol that is stored in the node. The #-symbol in the predicate below denotes the number of items in the list of children.

h∀n : Node n : n.Symbol.Rank = #(n.Children)i

The last invariant of this section discusses the symbols that are used for representing nonterminals and variables. Nodes that use these symbols can only be used as leaf nodes. To achieve this we define an invariant that describes that nodes with a nonterminal or variable symbols have no child nodes:

Invariant C.2.5 All symbols that represent a non terminal or variable are of rank zero.

h∀n : Node n∧(n.Symbol.Type = NonTerminal∨n.Symbol.Type = V ariable) : #(n.Children) = 0i

125 C.2. Trees Chapter C. ForestFIRE library

C.2.2.2 Child-parent relation This section contains invariants that describe the parent and child relation between nodes of a tree. The first invariant expresses that the tree has a tree like shape, because children of a node n, must point to n as its parent node. This property ensures that the data structure has not got a different structure. For the formal definition see Invariant C.2.6.

Invariant C.2.6 For each node n of a tree t the parent of each of the children of n is equal to n:

h∀n,t : Tree t ∧ Node n ∈ t : h∀m : m ∈ n.children : m.parent = nii

The next invariant ensures that each node n of tree t has a path back to the root node of the tree. This invariant is comparable to Definition C.2.1, with the difference that it ensures the existence of a path from the node to the top, instead of a path from the root to the node n:

Invariant C.2.7 Each node n of a tree t must have a path to the root node of tree t:

h∀n,t : Tree t ∧ Node n ∈ t : h∃i : i ∈ N : n(.parent)i = t.rootii

Finally we introduce an small invariant that makes sure that the ParentTree property of a node is set to the tree in which it is contained:

Invariant C.2.8 The ParentTree property for every node n in tree t must contain t.

h∀n,t : Tree t ∧ Node n ∈ t : n.ParentTree = ti

C.2.2.3 Dotted tree invariant There is also an important invariant in the area of dotted trees. A dotted tree contains a node and a tree. This node must be part of the specified tree:

Invariant C.2.9 The specified node of the dotted tree must be part of the specified tree.

h∀dt : DottedTree dt : dt.Node ∈ dt.Treei

C.2.3 Related algorithms This section describes the classes and methods that implement tree related algorithms. These algorithms are created to support some special tree grammar algorithms and mostly focus on providing tree statistics like the number of nodes. All these algorithms are implemented by the newly introduced TreeAnalyzer class.

126 Chapter C. ForestFIRE library C.3. Regular tree grammars

C.2.3.1 Number of nodes The number of nodes in a tree can be retrieved by the getNumberOfNodes-method of the TreeAnalyzer class. The method traverses the tree from the root to the leafs and counts the encountered nodes.

TreeAnalyzer class GetNumberOfNodes(t : Returns the number of nodes inside the tree. Tree) : Integer

C.2.3.2 Number of non-root terminal nodes The number of non-root terminal nodes can be retrieved by the getNumberOfZPlusNodes- method. This method is implemented in a similar way as the getNumberOfNodes-method. The only difference is that this method counts the nodes with a terminal symbol with the exception of the root node.

TreeAnalyzer class GetNumberOfZPlusNodes(t: Returns the number of non-root nodes that contain a Tree) : Integer terminal.

C.3 Regular tree grammars

This section about regular tree grammars described all classes that implement data struc- tures that are related to the concept regular tree grammar of ranked, ordered, labeled trees. Additionally there is a section that discusses invariants on these data structures and a third section that discusses classes that implement algorithms related to regular tree grammars.

C.3.1 Data structures This section contains descriptions of all classes of the library that are used for representing regular tree grammars.

C.3.1.1 RegularTreeGrammar class The standard RegularTreeGrammar class represents a tree grammar. A tree grammar is con- structed from a set of production rules, an alphabet that contains terminals and nonterminals that can be used as start symbol or inside the production rules and a start nonterminal.

127 C.3. Regular tree grammars Chapter C. ForestFIRE library

Properties Alphabet: Set of Symbol The alphabet of terminals and nonterminals that are used in the grammar and its production rules. ProductionRules: Set of Production rules of the grammar. ProductionRule StartSymbol: Symbol The nonterminal symbol that forms the start symbol for the grammar.

Methods None

C.3.1.2 GrammarProductionRule class The GrammarProductionRule class represents a single grammar production rule. A produc- tion rule has just like its formal definition a left hand side symbol (of the type nonterminal) and a right hand side tree. Finally there is a cost field that can be used to store the cost applying that production rule (mostly used in the area of instruction selection).

Properties LHS: Symbol The left hand side nonterminal of a production rule. RHS: Tree The right hand side tree of the production rule. Cost: Integer The cost when applying this rule.

Methods None

C.3.1.3 DottedRule class This class represents a dotted tree, that can be used to point to a specific node inside the rhs of a rule, as its 2-tuple definition. It contains a reference to the production rule and a node inside this rhs tree of the production rule.

Properties ProductionRule: Reference to the source production rule. GrammarProductionRule Node: Node Reference to a node inside the rhs tree.

Methods None

128 Chapter C. ForestFIRE library C.3. Regular tree grammars

C.3.2 Invariants As can be seen in Figure 3.2, the grammar classes are strongly related. We introduce a set of invariants to only that allow wellformed grammars are created by these classes. The first invariant expresses the problem of the different alphabets in the right hand sides of the production rules. These alphabet may not contain symbols that can not be found in the general alphabet of the grammar. This invariant is defined precisely by Invariant C.3.1. Invariant C.3.1 The alphabet of a production rule r of a grammar g, must be a subset of the general alphabet of grammar g.

h∀g,r : Grammar g ∧ Rule r ∈ g : r.RHS.Alphabet ⊆ g.Alphabeti Another concern is the left hand side of the production rules. One must ensure that these lhs symbols are nonterminals and that they present in the alphabet of the grammar. This is defined by Invariant C.3.2. Invariant C.3.2 The left hand side symbol of a rule r of grammar g, must be contained in the alphabet of grammar g and the symbol must be a nonterminal.

h∀g,r : Grammar g ∧ Rule r ∈ g : r.LHS.Type = NonTerminal ∧ r.LHS ∈ g.Alphabeti The symbols in the right hand side tree also need to be part of the alphabet, but this invariant already holds due to the combination of Invariant C.3.1 and Invariant C.2.3 of the previous chapter. Finally, there is an invariant in the area of dotted rules. A dotted rule contains a production rule and a node. This node must be part of the right hand side tree of that production rule: Invariant C.3.3 The specified node of the dotted rule must be part of the tree in the rhs of the specified production rule.

h∀dr : DottedRule dr : dr.Node ∈ dr.ProductionRule.RHSi

C.3.3 Related algorithms This section contains all implemented algorithms that are related to regular tree grammars. A description is given of the classes that implement these algorithms. Detailed descriptions of these algorithms are not given in this section. These can be found in Chapter 4 for the grammar transformation algorithms and in [Cle07] for the algorithms that were not part of the experiments. This section will divided the algorithms in several categories: standard analysis, usability, grammar transformation, dotted rule retrieval and subtree retrieval. Each of the upcoming subsections will provide details on the algorithms in that particular area.

C.3.3.1 Standard analysis The standard analysis algorithms measure basic statistics for a tree grammar. This can be simple statistics like number of nodes and rules, but also more complicated statistics like presence or number of chain rules or non-root terminal nodes. All these algorithms can found in a newly created RTGStandardAnalyzer class.

129 C.3. Regular tree grammars Chapter C. ForestFIRE library

Number of nodes The number of nodes in the production rules of a grammar can be retrieved by inspecting the getNumberOfNodes property of the RTGStandardAnalyzer class. This property is implemented by using getNumberOfNodes property for the TreeAnalyzer class. The values of all trees in the production rules can then be added to obtain the complete number of nodes.

RTGStandardAnalyzer class addition GetNumberOfNodes(g : Returns the number of nodes inside the tree in the RTG) : Integer right hand side of the production rules.

Number of rules The number of production rules in a grammar can be retrieved by in- specting a standard count property on the list that stores the production rules inside the grammar.

RTGStandardAnalyzer class addition Count(g RTG) : Integer Returns the number of production rules inside the grammar.

Number of chain rules The number of chain rules in a grammar can be retrieved by inspect- ing the getNumberOfChainRules property of the RTGStandardAnalyzer class. This property is implemented by defining a IsChainRule property for each production rule. Counting the rules that return a positive result will then result in the number of chain rules. We also provide an additional getChainRules method that does not provide the number of chain rules but the rules themselves. This method is used together with the transformation operation that removes chain rules. The method can then be used to show which chain rules will/can be removed. The existence of chain rules can be retrieved by inspecting the the HasChainRules property in the RTGStandardAnalyzer class. This property is implemented by testing if the getNum- berOfChainRules property returns zero.

RTGStandardAnalyzer class addition IsChainRule(r : Returns a boolean value that describes if the rule is ProductionRule) : Bool chain rule or not. GetNumberOfChainRules(g Returns the number of chain rules inside the grammar. : RTG) : Integer GetChainRules(g : RTG) Returns the list of chain rules within the grammar. : List of ProductionRule HasChainRules(g : RTG) Checks whether the grammar contains chain rules or : Bool not.

Non-root nodes labeled by terminals The getNumberOfZPlusNodes method in the RTG- StandardAnalyzer class provides the possibility to retrieve the number of non-root terminal

130 Chapter C. ForestFIRE library C.3. Regular tree grammars nodes. This is realized by using the TreeAnalyzer class that contains a method that returns the number of non-root terminal nodes for a tree. Calling this method for all rhs trees of the production rules and adding the return values will result in the total number of non-root terminal nodes of the grammar. Just like in the previous section about chain rules we will also provide an additional method that does not only count the non-root terminal nodes, but also lists them. This is done by the getZPlusNodes method that provides a collection of dotted rules. These dotted rules point to the non-root terminals within the rules. The existence of nodes in the rhs of the production rules that contain non-root terminals can be retrieved by inspecting the HasZPlusNodes property. This property can easily be implemented by testing if the getNumberOfZPlusNodes property returns zero.

RTGStandardAnalyzer class GetNumberOfZPlusNodes(g Returns the number of non-root nodes that contain a : RTG) : Integer terminal. GetZPlusNodes(g : RTG) Returns the non-root nodes that contain a terminal of : List of DottedRule all the production rules as a list of dotted rules. HasZPlusNodes(g : RTG) Checks whether the grammar contains nodes in the : Bool rhs of the production rules that contain non-root ter- minals.

Rules that contain non-root nodes labeled by terminals This operation that inspects the number of rules that contain non-root nodes labeled by terminals is comparable to the previous operation. The terminal counting in the tree can be implemented by the tree itself as described in the previous section. The getNumberOfZPlusRules method in the RTGStandardAnalyzer class can then count the rules with trees that contain one or more non root terminals.

RTGStandardAnalyzer class GetNumberOfZPlusRules(g Returns the number of rules that contain non-root : RTG) : Integer nodes that are labeled by a terminal.

C.3.3.2 Usability This section describes a set of operations that list rules or symbols which are or are not productive or reachable. An new class is created to separate all these operations from the RegularTreeGrammar class. This class called RTGUsabilityAnalyzer can provide these us- ability statistics for each grammar. More information about the reachability an productivity properties can be found in Section 1.1.3 and [Cle07, Section 3.4.2].

(Un)reachable symbols/production rules One usability aspect is the reachability of symbols and rules. There are three different grammar operations defined to get the set of reachable symbols and production rules. The same operations are defined for retrieving unreachable symbols and rules. These three additional operations were easily be implemented by comput- ing the difference between the complete set of terminals/nonterminals/rules and the reachable

131 C.3. Regular tree grammars Chapter C. ForestFIRE library items.

RTGUsabilityAnalyzer class GetReachableTerminals(g : RTG) : Returns Set of reachable terminals. Set of Symbol GetReachableNonTerminals(g : Returns Set of reachable nonterminals. RTG) : Set of Symbol GetReachableRules(g : RTG) : Set Returns Set of reachable production rules. of ProductionRule

(Un)productive terminals/nonterminals/production rules Productiveness is another us- ability characteristic. The RTGUsabilityAnalyzer class defines operations for this charac- teristic in a similar way then the reachability characteristic. The non-productive items can again be computed by taking the difference between the complete set of terminals/nontermi- nals/rules and their productive set.

RTGUsabilityAnalyzer class addition GetProductiveTerminals(g : RTG) Returns Set of productive terminals. : Set of Symbol GetProductiveNonTerminals(g : Returns Set of productive nonterminals. RTG) : Set of Symbol GetProductiveRules(g : RTG) : Returns Set of productive production rules. Set of ProductionRule

Useful/useless symbols/production rules The useful/useless property is a combination of reachability and productiveness property. A terminal, nonterminal or production rule is useful if and only if it is reachable and productive. There are three RTGUsabilityAnalyzer class operations defined for retrieving these useful items. The useless items can again be retrieved by computing the difference between all items and the useful items.

RTGUsabilityAnalyzer class addition getUsefulTerminals(g : RTG) : Set Returns Set of useful terminals. of Symbol getUsefulNonTerminals(g : RTG) : Returns Set of useful nonterminals. Set of Symbol getUsefulRules(g : RTG) : Set of Returns Set of useful production rules. ProductionRule

132 Chapter C. ForestFIRE library C.3. Regular tree grammars

C.3.3.3 Transformations This section describes the classes that implement algorithms related to tree grammar trans- formations. This section describes two additional classes. The RTGUselessItemRemover class that implements the algorithms for removing unreachable and unproductive items, and the RTGStandardRemover class that implements the algorithms for removing chain rules and non-root terminal nodes. These last two algorithms can be used to convert grammars such that they have the special characteristics that are described in Chapter 1. These last algo- rithms were also subject of experiments. More details about the implementation can therefore be found in Chapter 4.

Remove useless symbols/production rules The three operations, for removing useless ter- minals, nonterminals and rules, will be handled by one class that contains a method for each of the removal operations. Each of these methods takes a regular tree grammar as input and creates a new grammar that does not contain the useless terminals, nonterminals or production rules. The three different methods can be used in sequence to remove all useless items.

RTGUselessItemRemover class RemoveUselessTerminals(g: Outputs a copy of grammar g that does not contain RTG) : RTG any useless terminals. RemoveUselessNonTerminals Outputs a copy of grammar g that does not contain (g: RTG) : RTG any useless nonterminals. RemoveUselessRules(g: Outputs a copy of grammar g that does not contain RTG) : RTG any useless production rules.

Remove chain rules Chain rules are removed in a similar way as the useless terminals and other useless items. An additional class is introduced that contains an operation that takes a grammar and outputs a copy of the grammar that does not contain chain rules without changing the language produced by that grammar. This class is called RTGStandardRemover class, because it contains transformations that have corresponding analysis operations in the RTGStandardAnalyzer class. We also introduced an additional method that provides the possibility to remove single chain rules from a grammar. This is done to investigate the effect of the removal of certain chain rules. This method is called RemoveChainRule. More information about the transformation algorithm implemented by this class can be found in Section 4.2.2 and [Cle07, Transformation 3.4.34]

RTGStandardRemover class RemoveChainRules(g: RTG) Outputs a copy of grammar g that doesn’t contain : RTG any chain rules. RemoveChainRule(g: RTG, Transforms the grammar g in such a way that the r: GrammarProductionRule) chain rule r is not present anymore.

133 C.3. Regular tree grammars Chapter C. ForestFIRE library

Remove rules containing non-root terminals To remove rules that contain non-root termi- nals in their right hand side we add an additional method to the RTGStandardRemover class. This RemoveZPlusRules-method takes a regular tree grammar and creates a new grammar that does not contain trees with non-root terminals without changing the language produced by that grammar. We also introduce an additional method that allows us to only remove one non root terminal node in a production rule. This way we can analyze the effect of the removal of different nodes. This method is called RemoveZPlusNode. It takes a grammar and a dotted rule as input parameter and transform the grammar in such a way that doesn’t contain that node. More information about the transformation algorithm implemented by this class can be found in Section 4.2.1 and [Cle07, Section 3.4.3]

RTGStandardRemover class RemoveZPlusRules(g: RTG) Outputs a copy of grammar g that doesn’t contain : RTG rules that contain non-root terminals. RemoveZPlusNode(g: RTG, Transforms the grammar g in such a way that the n: DottedRule) non-root terminal node n is not present anymore.

C.3.3.4 Dotted rules This sections discusses operations to obtain and analyze dotted rules from the production rules of a grammar. These operations are implemented as separate classes to avoid garnishing the regular tree grammar class with a large collection of methods which are not vital for the grammar itself.

Get dotted rule set This operation produces all possible dotted rules from the set of the production rules of a grammar. To realize this the DottedRuleSetProvider class is constructed that contains the GetDottedRules- method that consumes a grammar and produces these dotted rules. The produced dotted trees are wrapped in a new DottedRuleCollection class, because there additional operations defined on such a collection of dotted rules.

DottedRuleSetProvider class GetDottedRules(g: RTG) Produces all possible dotted rules from the production : DottedRuleCollection rules inside grammar g.

DottedRuleCollection class Rules: Set of DottedRule The set of dotted rules that are stored in the dotted rule collection.

Flatten The flatten function converts a collection of dotted rules into a set of subtrees. This is realized by creating a new tree for each dotted rule that contains a cloned structure of the subtree referred by the dotted rule. Flattening a set of dotted rules, can result in duplicate tree structures. These duplicate trees are removed such that only unique subtrees are outputted.

134 Chapter C. ForestFIRE library C.3. Regular tree grammars

DottedRuleCollection class addition Flatten() : Set of Tree Returns the list of unique subtrees that are defined by the collection of dotted rules.

Number of dotted rules The DottedRuleCollection class is inherits from the standard list that is defined in Section C.1.1. The number of dotted rules in the collection can therefore be retrieved by inspecting the inherited count property.

DottedRuleCollection class addition Count : Integer Returns the number of dotted rules in the collection.

Number of flattened dotted rules This operation returns the number subtrees that can be obtained from a set of dotted rules when they are flattened using the flatten function.

DottedRuleCollection class addition getNumberOfElements- Returns the number subtrees if the dotted rules in the WhenFlattened() : collection were flattened. Integer

C.3.3.5 Subtrees This sections discusses operations to directly obtain all subtrees as standard trees from a grammar. These operations are implemented in the same way as for the dotted rules by constructing a new class. This is again done to avoid garnishing the grammar tree with a large collection of methods.

Get subtree set This operation produces all unique items as subtrees from that can be extracted from the right hand side trees of the production rules. This operation is realized by the new SubtreeSetProvider class that contains a method that consumes a grammar and outputs a set of trees. There is no special subtree collection class constructed, because there is no special func- tionality needed from the set of subtrees. The only needed functionality is a function for retrieval of the number of subtrees, but this is provided by the standard collection types.

SubtreeSetProvider class GetSubTrees(g: Produces all unique subtrees from the right hand side RegularTreeGrammar) : trees of the production rules inside grammar g. Set of Tree

Number of Elements The number of elements in a set of subtrees is retrieved by inspecting the standard count property of the used set type (see Section C.1), because there is no separate class defined for storing subtrees.

135 C.4. Tree patterns Chapter C. ForestFIRE library

C.4 Tree patterns

This section contain a description of all data structures and invariants related to tree patterns.

C.4.1 Data structures Most of the data structures for tree patterns are derived directly from trees, because patterns are represented as trees. The only additional data structure needed is a class that wraps a collection of tree patterns such that a group of tree patterns can easily be manipulated.

C.4.1.1 PatternSet class The PatternSet class represents a collection of tree patterns. This collection also contains an alphabet that specifies which terminals, nonterminals and variables can be used in the containing patterns.

Properties Alphabet: Set of Symbol Set of symbols that can be used in patterns. Patterns: Set of Tree The set of tree patterns.

Methods None

C.4.2 Invariants There are not many invariants for this chapter, because patterns also reuse the tree invariants. However it is important to create an invariant that describes that the symbols in the alphabet of pattern also have to be present in the alphabet of the pattern collections where this pattern is part of. This is formalized by Invariant C.4.1.

Invariant C.4.1 The alphabet of a pattern p in a pattern set s must be a subset of the alphabet of the collection s.

h∀s, p : PatternSet s ∧ Tree p ∈ s : p.Alphabet ⊆ s.Alphabeti

C.4.3 Related algorithms The algorithms related to tree patterns have except one all as goal to retrieve (properties of) the subtrees that can be retrieved from a pattern collection. This section describe the classes that implement these algorithms.

C.4.3.1 Number of patterns The number of patterns can be retrieved by inspecting the count property of the PatternSet class, because the class is implemented as a descendant of the list class that provides standard count property.

136 Chapter C. ForestFIRE library C.4. Tree patterns

PatternSet class addition Count : Integer Returns the number of patterns that are stored in the set.

C.4.3.2 Dotted trees This sections discusses operations to obtain dotted trees from a pattern collection. These operations are implemented as separate classes to avoid garnishing the PatternCollection class with a large collection of complicated methods.

Get dotted tree set This operation produces all possible dotted trees from a collection of patterns. The previous chapter already introduced a class that distills all possible dotted rules from the production rules of grammar (See Section C.3.3.4). A similar class is introduced to retrieve all dotted trees from a collection of tree patterns.

DottedTreeSetProvider class GetDottedTrees(p: Produces all possible dotted trees from a set of tree PatternSet) : patterns. DottedTreeCollection

DottedTreeCollection class Trees: Set of DottedTree The set of dotted trees that are stored in the dotted tree collection.

Flatten The flatten function converts a collection of dotted trees into a set of standard trees, by cloning the referenced tree structure. This function is similar to the Flatten function for dotted rules (See Section C.3.3.4). Therefore a similar solution is provided by extending the DottedTreeCollection class with a Flatten function.

DottedTreeCollection class addition Flatten() : Set of Tree Returns the list of unique subtrees that are defined by the collection of dotted trees.

Number of dotted trees The DottedTreeCollection class is inherits from the standard list that is defined in Section C.1.1. The number of dotted rules in the collection can therefore be retrieved by inspecting the inherited count property.

DottedTreeCollection class addition Count : Integer Returns the number of dotted trees in the collection.

Number of flattened dotted trees This operation returns the number items that can be obtained from a set of dotted trees when they are flattened using the flatten function. Such

137 C.5. Tree automata Chapter C. ForestFIRE library operation is already described for the DottedRuleCollection class (See Section C.3.3.4). The same solution is chosen for the DottedTreeCollection class.

DottedTreeCollection class addition getNumberOfElements- Returns the number subtrees if the dotted trees in the WhenFlattened() : collection were flattened. Integer

C.4.3.3 Subtrees This sections discusses operations to obtain all unique items as standard trees from a set of tree patterns. These operations are implemented in the same class used for subtree extraction from tree grammars.

Get subtree set This operation retrieves all unique subtrees that can be extracted from a set of tree patterns. This is realized by cloning all unique subtrees that can be found in the patterns.

SubtreeSetProvider class addition GetSubTrees(p: Produces all unique subtrees from the collection of PatternSet) : Set of Tree patterns p.

Number of elements The number of elements in a set/list of subtrees is retrieved by in- specting the standard count property of the used set/list (see Section C.1), because there is no separate class defined for storing a collection of subtrees.

C.5 Tree automata

This section describes the classes that implement tree automata and classes that contain the algorithms that are related to tree automata. Furthermore this section describes a collection of invariants for the tree automata classes.

C.5.1 Data structures The data structures section describes all classes that are implemented to represent (ε)NFRTAs, (ε)NRFTA and DFRTAs. This are the classes that can be found in Figure 3.5 of Chapter 3.

C.5.1.1 AbstractTreeAutomaton class The AbstractTreeAutomaton class is an abstract class that presents the standard interface for a tree automaton based on the formal 5-tuple definition. Three items of the 5-tuple can be found as property: state set, alphabet and root accepting states. The reason why the other two items are not present in the base class can be found in Section 3.1.4. This class also defines an abstract method for adding transitions. This possible because the general shape of a single transition is same for all automata if one abstracts from the

138 Chapter C. ForestFIRE library C.5. Tree automata direction. However, it depends on the type of automaton what the most efficient way is to store them. This interface can therefore be implemented differently by all automata.

Properties StateSet: Set of AutomatonState The states that can be found inside the automa- ton. Alphabet: Set of Symbol The ranked alphabet used inside the automaton. RootAccepting: Set of The root accepting states of the automaton. AutomatonState

Methods AddTransition (ss: AutomatonState, Adds an transition based on symbol s with rs: AutomatonStaten, s: Symbol) rank n between state ss and the state vec- tor rs of length n.

C.5.1.2 NRFTA class When using an NRFTA (or DRFTA) one walks from the top to the bottom of the tree to find which states match which nodes. If one is performing such a traversal and one wants to use an NRFTA to find the child states for a node for which the state is known, then one has to feed this current state and the symbol of the current node to the automaton. An NRFTA will then return a set of vectors of n states (this set also contains all vectors that can be reached after executing all possible ε-transitions) that represent the possible states for these child nodes. This traversal is realized by the NextState-method. Additionally there is a property that expresses whether the automata contains epsilon transitions.

Properties ContainsEpsilonTransitions: Indicated whether the automata contains epsilon Boolean transitions.

Methods NextState (cs: AutomatonState, s: See NRFTA class description. Symbol) : Set of AutomatonStaten

C.5.1.3 NFRTA class When using an NFRTA (or DFRTA) one walks from the bottom to the top of the tree to find which states match which nodes. If one knows all the states for child nodes of a certain node α and wants to use an NFRTA to find the state of this node α, then one has to feed all these child node states and the symbol of α to the automaton. An NFRTA will then return

139 C.5. Tree automata Chapter C. ForestFIRE library a set of states for the parent node (this set also contains all states that can be reached after executing all ε-transitions). This traversal is realized by the NextState-method. When one uses an NFRTA one starts with the leaf accepting states. As described earlier there is no explicit field that contains these leaf accepting states. These states for a specific symbol of rank 0 can be retrieved by calling the NextState-method with an empty set of states and the symbol. The method will then return the leaf accepting states that accept this symbol.

Properties ContainsEpsilonTransitions: Indicated whether the automata contains epsilon Boolean transitions.

Methods NextState (cs: List of See NFRTA introduction. AutomatonState, s: Symbol) : Set of AutomatonState

C.5.1.4 AbstractDFRTA class The goal of the AbstractDFRTA class is to support different DFRTA implementations. The interface of this abstract class is the same when compared to the NFRTA class, even the NextState Method is the same. However, its output is restricted to a singleton set containing only one state. There are five types of DFRTA classes that inherit from this abstract class. These five classes all implement different optimization techniques that reduce the size of the automaton and the time needed to construct the automaton. These five types of DFRTAs are directly related to the five tree automaton construction algorithms that are discussed in Section 4.3 of Chapter 4:

• DFRTAStandard class • DFRTAFilterSubtree class • DFRTAFilterIndex class • DFRTAFilterSymbol class • DFRTAFilterIndexSymbol class

All these automata can be constructed using the DTAGenerator class. Calling the gener- ate method of the generator class with a fresh automaton, tree grammar and corresponding item set will fill the automaton with the states and transitions such that it accepts the input grammar. The applied filtering technique depends on the type of automaton that is provided to the generate method. The AbstractDFRTA class also contains methods that are not listed below. These methods are used to facilitate easy construction by the DTAGenerator class and optimize the mem-

140 Chapter C. ForestFIRE library C.5. Tree automata ory usage of the automata. A description of these methods can be found in the JavaDoc documentation of the library.

Properties None

Methods NextState (cs: List of Similar to the NextState method of a AutomatonState, s: Symbol) : NFRTA, with the exception that it returns (Singleton) Set of AutomatonState a singleton set.

C.5.1.5 AbstractAutomatonState class This class represents an abstract state of an automaton. Each state has a name and the descending class will define each in their own way to represent the subtrees that indicate which subtrees are matched in that state. Each automaton can then use the type of state it wants/needs.

Properties Name: String The name of the state.

Methods None

C.5.1.6 DottedTreeAutomatonState class This class represents a state based on dotted tree matches. This kind of states can be used in automata constructed from pattern sets. The dotted trees in the match set of a state tell which node(s) of which pattern match with that state. This match information can then be used when for instance solving the tree pattern matching problem.

Properties Name: String The name of the state. Matches: Set of The set of dotted trees tells which node of a certain DottedTree pattern matches with this state.

Methods None

141 C.5. Tree automata Chapter C. ForestFIRE library

C.5.1.7 DottedRuleAutomatonState class This class represents a state based on dotted rule matches. This kind of states can be used in automata constructed from tree grammars. The dotted rules in the match set of a state tell which node(s) of which production rule match with that state. This match information can then be used when for instance solving the tree parsing problem.

Properties Name: String The name of the state. Matches: Set of The set of dotted rules tells which node of a certain DottedRule production rule matches with this state.

Methods None

C.5.1.8 SubtreeAutomatonState class This class represents a state based on standard tree matches. This kind of states can be used in automata constructed from either pattern sets or tree grammars. These type of states can therefore be found many types of automata. The only disadvantage of these standard tree matches is that it is impossible to quickly retrieve their origin (which pattern or production rule), because they are cloned from their source.

Properties Name: String The name of the state. Matches: Set of Tree The set of subtrees tells which node of in a production rule or pattern matches with this state.

Methods None

C.5.2 Invariants This section presents all invariants that ensure that the classes above can only present well- formed automata. The first invariants that we have to define are the invariants that limit usage of states to the states defined in the state set of an automaton. We start with defining an invariant that describes the fact that all root accepting states should be contained in the state set:

Invariant C.5.1 The set of root accepting states of an automaton must be subset of the set of all possible states.

h∀a : AbstractTreeAutomaton a : a.RootAccepting ⊆ a.StateSeti

142 Chapter C. ForestFIRE library C.5. Tree automata

We could define a similar invariant for the leaf accepting states, but the problem is that these states are not defined explicitly, because they can be obtained by using the NextState- method. It is however important to mention that all the states that are returned by this method must be part of the state set of the corresponding automaton:

Invariant C.5.2 The states of an automaton that are produced by the NextState-method must be subset of the set of all possible states (State q ∈ a.NextState, means that q can be produced by the NextState-method of automaton a).

h∀a, q : AbstractTreeAutomaton a ∧ State q ∈ a.NextState : q ∈ a.StateSeti

The last invariant we introduce expresses the fact that each symbol of the alphabet of a automaton must have a role in the automaton. This means that every symbol must be part of the transition relations of the automaton:

Invariant C.5.3 Each symbol of the alphabet of a automaton must be part of at least one transition (There is a separate definition for the RFTA’s and FRTA’s).

h∀a, s : NRFTA a ∧ Symbol s ∈ a.Alphabet : h∃q : qState ∈ a.StateSet : a.NextState(q, s) 6= ∅ii

h∀a, s : NFRTA a ∧ Symbol s ∈ a.Alphabet : h∃qs : List of State qs ⊆ a.StateSet : a.NextState(qs, s) 6= ∅ii

C.5.3 Related algorithms This section contains all the interfaces and classes related to tree automaton algorithms. These algorithms are divided into three categories: state construction algorithms, automaton construction algorithms and tree acceptance/parse algorithms. The first group discusses the interfaces of algorithms that provide item sets that are used for state construction in automaton construction algorithms. The second part discusses the interfaces of the construction algorithms themselves and finally the interfaces of the match and parse algorithms that use tree automata are discussed.

C.5.3.1 Automaton state construction Tree automaton states are, as described in Section, constructed based on a set of subtrees called item sets. These item sets consists of a set of unique subtrees which can be represented by dotted rules, dotted trees or standard (cloned) trees. This section discusses the classes that can be used to construct such item sets that contain standard trees.

AbstractItemSetProvider class The AbstractItemSetProvider class is an abstract class from which all classes are derived that implement algorithms to obtain item sets, as cloned standard trees, from tree grammars. This class therefore specifies a constructor function for which an instance of the item set provider can be create based on a grammar. The item set can be retrieved by calling the GetSubTrees method. This method returns a tuple that contains

143 C.5. Tree automata Chapter C. ForestFIRE library the item set (a set of unique subtrees of the grammar) and the subtree from the item set that refers to the start symbol. This second item is interesting when automata are constructed, because they now which subtree must be present in the match set of the root accepting state. There are three classes that inherit from this abstract class: ProviderAllSub, Provider- ProperN and ProviderProperS. These classes implement the retrieval of the three different item sets that are discussed in tree automaton constructions in Chapter 4.

AbstractItemSetProvider class Create( Creates an instance of this class for a specific tree RegularTreeGrammar g) grammar. GetSubTrees() :(Tree, Set Produces a tuple that contains all unique subtrees of Tree) from g and the subtree corresponding to the start sym- bol.

C.5.3.2 Automaton construction There are only two automaton construction algorithms implemented in ForestFIRE, one for nondeterministic automata and one for deterministic automata. However, these algorithms can be used to construct many different types of automata. This is caused by the separation of responsibilities. A part of the logic is transferred from the construction algorithm to the automata. The construction algorithms for nondeterministic automata can for instance create FR and RF automata based on the type of automaton. The automaton itself defines how the transitions are stored. This resulted in a low amount of duplicated code.

NTAGenerator class The NTAGenerator class is the class that implements the construction algorithm for non-deterministic automata. This algorithm (discussed in detail in Chapter 4) can be used to construct RF and FR automata with and without ε-transitions, based on one of the tree item sets. These different types of automata can be created by using the correct parameters in the GenerateAutomaton-method.

NTAGenerator class GenerateAutomaton(RTG Generates an nondeterministic tree automaton based g, AbstractTA ta, on grammar g and the item set provided by isp with AbstractISP isp, boolean or with ε-transition based on the parameter and stores epsilon) this automaton in ta.

DTAGenerator class The DTAGenerator class is the class that implements the construction algorithm for DFRTAs. The algorithm can be used to construct a DFRTA from a tree grammar based on one of the three item sets. The experiments in Chapter 4 discuss filter techniques that can be used in the construction. These optimizations can be applied by using one of the the four special DFRTAs that inherit from the AbstractDFRTA class. These automata hide the filtering techniques from the construction process such that only this single construction algorithm is needed.

144 Chapter C. ForestFIRE library C.5. Tree automata

DTAGenerator class GenerateAutomaton(RTG Generates an DFRTA based on grammar g and the g, AbstractDFRTA ta, item set provided by isp and stores this automaton in AbstractISP isp) ta.

C.5.3.3 Tree acceptance & parsing Two of the application areas of tree automata are the tree acceptance and tree parsing prob- lem. This section defines classes that based on a tree automaton solve these two problems. There are three acceptor classes defined that solve the acceptance problem with a specific type of automata (NFRTA, NRFTA and DFRTA) and one parser class that solves the parsing with a provided DFRTA.

NFRAcceptor class The NFRAcceptor class implements an algorithm that determines whether an NFRTA accepts a certain input tree. The class realizes this by implementing the algorithm described in [Cle07, Algorithm 6.5.2].

NFRAcceptor class Accept(NFRTA ta, Tree t) Returns wether tree t is accepted by NFRTA ta. : Boolean

NRFAcceptor class The NRFAcceptor class implements an algorithm that determines whether an NRFTA accepts a certain input tree. The class realizes this by implementing the algorithm described in [Cle07, Algorithm 6.4.2].

NRFAcceptor class Accept(NRFTA ta, Tree Returns wether tree t is accepted by NRFTA ta. t) : Boolean

DFRAcceptor class The DFRAcceptor class implements an algorithm that determines whether a DFRTA accepts a certain input tree. The class realizes this by implementing the algorithm described in [Cle07, Algorithm 6.5.4].

DFRAcceptor class Accept(DFRTA ta, Tree Returns wether tree t is accepted by DFRTA ta. t) : Boolean

OptimalCostParser class The OptimalCostParser class implements a parser that parses a tree for an input a grammar by using a DFRTA. This parsing algorithm delivers the minimal cost parse (based on the costs specified for each production rule) for that tree, if a parse exists. This parsing algorithm constructs internally a DFRTA from the grammar and uses this grammar to compute which subtrees from the production rules are matched in which node.

145 C.5. Tree automata Chapter C. ForestFIRE library

Comparing these subtrees with the complete right hand side of the rules indicates which rules can applied. These matching production rules are then stored (with the lhs nonterminal as key) in the annotation dictionary of the node. Each node will after the parsing action contain a sequence (caused by chain rules) of production rules for each nonterminal, that describes which rules have to be applied such that node matches this nonterminal. This is the interface of the OptimalCostParser class.

OptimalCostParser class Create(RegularTree Constructs a parser that can parse trees that are part Grammar g) of the language generated by g. Parse(Tree t) Parses the tree based on a automaton constructed from the tree grammar and stores in each node which rules have to applied to get the subtree rooted at this node.

146 D FIREWood file format

This appendix discusses the FIREWood file format for defining trees, tree grammars etc. The goal of this file format is to provide an easy way to define input structures for the FIREWood application. The file format is based on the INI-format. We start by defining a def-list. This is a list of user defined structures. Each such a definition will be called a def-item. These are five possible structures that can be defined:

• alphabet • trees • tree grammars • tree patterns • tree pattern collections

This resulted in the following EBNF-grammar for defining different structures:

def-list = def-item{def-item}

def-item = def-alphabet | def-tree | def-grammar | def-pattern | def-patterncollection

string = character{character}

character = a..z,A..Z

The definitions of the five different concepts have general shape, based on the general shape of an INI definition. Each definition of a concept consists of a name between brackets (the reference name of the concept) and a list of variable definitions, where one of the variable definitions specifies the type of the concept. This is for instance the general shape of a tree definition: [mytree] type=Tree ....=.... etc.

147 D.1. Alphabets Chapter D. FIREWood file format

Each type of concept will contain a special list of variable definitions, next to the type vari- able definition, that need to be present. A tree for instance needs to define its corresponding alphabet and its structure. See for instance the example specification in Section D.5. The next sections will discuss a format for each of the five different types of structures.

D.1 Alphabets

This section describes an alphabet definition. There were two possibilities for this definition. One could define one default alphabet for the complete file that every tree, tree grammar etc. uses or one could offer the possibility to define multiple alphabets. The last option was chosen, to provide the possibility of defining structures with different alphabets in a single file. The list of variable definitions for the alphabet contains two elements: the type definition and the alphabet itself. The type must be set to Alphabet and the alphabet itself is defined by a comma separated sequence of symbols and an optional rank (separated by the ’:’-symbol). Where the symbols of the alphabet are strings of the characters a to z. Symbols with no rank are stored by ForestFIRE as unranked symbol. This is for instance used for nonterminals and variables. This is the EBNF-grammar for this definition:

def-alphabet = ”[”string”]” type=Alphabet alphabet=def-alphabet-list

def-alphabet-list = ”{”def-symbol {,def-symbol}”}”

def-symbol = string [def-rank]

def-rank = ”:” number

number = digit {digit}

digit = 0..9

This is an example definition of a alphabet with name alphabetx: [alphabetx] type = Alphabet symbols = {w:3, r:2, s:1, t:0, u:0}

148 Chapter D. FIREWood file format D.2. Trees

D.2 Trees

Trees are defined by an alphabet and a definition of the tree structure itself. The alphabet is defined by a string that points to the defined alphabet with the same name. The tree structure is described in a prefix notation, where each node symbol is placed before a sequence of child nodes. This is the grammar of the tree definition, together with an example:

def-tree = ”[”string”]” type=Tree alphabet=string structure=def-tree-structure

def-tree-structure = string | string ”(”def-tree-children”)”

def-tree-children = def-tree-structure {,def-tree-structure}

This is an example definition of a tree with name treex: [treex] type = Tree alphabet = alphabetx structure = w(r(t,u),s(u),s(s(t )))

D.3 Tree grammars

Grammars are defined by a terminal alphabet, nonterminal alphabet, start nonterminal and a set of production rules. The terminal alphabet and nonterminal alphabet are defined in the same way as the alphabet of a tree by a string that points to a predefined alphabet. There is no separate part of the grammar definition that defines the start symbol. Instead the start symbol is defined as the first symbol in the nonterminal alphabet. This approach shortens the definition of a grammar. The production rule list description is a bit more complicated. The rules are divided by semicolons and each rule is described by a string that represents the lhs followed by an ’:’ and a rhs tree structure as used in the tree definition. It is al so possible to define a cost for each rule. This is done by putting a integer value after a ”#” behind the production rule. This is the grammar of the tree grammar definition, together with an example tree grammar definition:

149 D.4. Tree patterns and pattern collections Chapter D. FIREWood file format

def-grammar = ”[”string”]” type=Grammar terminal-alphabet=string nonterminal-alphabet=string rules=def-productions

def-productions = ”{”def-prule ; {def-prule}”}”

def-prule = string ”:” def-tree-structure [def-cost]

def-cost = ”#” number

This is an example definition of a tree grammar with name grammarx, together with an nonterminal alphabet called alphabety: [alphabety] type = Alphabet symbols = {$S$, $X$, $Y$}

[grammarx] type = Grammar terminal−alphabet = alphabetx nonterminal−alphabet = alphabety rules = {S: w(X,Y,Y) \# 1; X: r(t,X) \# 3; X: u \# 0; Y: s(Y) \# 1; Y: t \# 2; Y: X \# 1}

D.4 Tree patterns and pattern collections

Let us start with the definition of tree patterns. The tree pattern definition is almost the same as the normal tree definition. There is only one difference. The pattern definition also contains an alphabet of variables. This alphabet is described just as a standard alphabet of tree, by pointing to the desired alphabet definition. This is the grammar of the tree pattern definition together with an example of such a definition:

def-pattern = ”[”string”]” type=Pattern terminal-alphabet=string variable-alphabet=string structure=def-tree-structure

This is an example definition of a tree grammar with name patternx, together with an variable alphabet called alphabetz::

150 Chapter D. FIREWood file format D.5. Example

[alphabetz] type = Alphabet symbols = {v, w}

[ patternx ] type = Pattern terminal−alphabet = alphabetx variable −alphabet = alphabetz structure = r(v,s(w)) Finally the pattern collection definition can be described. Such a collection is defined by the type ’PatternCollection’ and a comma separated list of pattern variable names. These patterns should be defined in the same file before the definition of the pattern collection. This is the grammar of this definition together with an example:

def-patterncollection = ”[”string”]” type=PatternCollection patterns=def-pattern-list

def-pattern-list = ”{”string {, string}”}”

This is an example definition of a tree grammar with name patterncolx with two imaginary names for the patterns: [ patterncolx ] type = PatternCollection patterns = {patternx, patterny}

D.5 Example

This section shows the definition file for the grammar described in Example 1.1.3: [alphabetT] type=Alphabet symbols={a:2, b:1 ,c :0, d:0}

[alphabetNT] type=Alphabet symbols={$S$, $B$}

[mytree] type=Tree alphabet=alphabetT structure = a(b(c), b(b(d)))

[mygrammar]

151 D.5. Example Chapter D. FIREWood file format

type=Grammar terminal−alphabet=alphabetT nonterminal−alphabet=alphabetNT rules ={S : a (B, d); S : a (b(c), B); S: c; B: b(B); B: S; B: d}

152 E Tree automaton construction – results

The appendix contains all measurement data collected during the automaton construction experiments. These results are divided over the different tree grammars. A table of measure- ments is presented for each of these grammars.

153 Chapter E. Tree automaton construction – results

Table E.1: Example 6.0.3 – Basic Statistics # States # Transitions Time (ms) Memory (bytes) Memory (KB) εNFRTA - All-Sub 8 12 0.07 6344 6.2 NFRTA - All-Sub 8 14 0.07 4472 4.4 NFRTA - ProperN 5 11 0.07 3512 3.4 NFRTA - ProperS 5 11 0.07 3512 3.4 εNRFTA - All-Sub 8 12 0.08 5400 5.3 NRFTA - All-Sub 8 14 0.09 6472 6.3 NRFTA - ProperN 5 11 0.06 4584 4.5 NRFTA - ProperS 5 11 0.06 4584 4.5 DFRTA - All-Sub 8 74 0.73 17160 16.8 DFRTA - ProperN 6 44 0.35 8776 8.6 DFRTA - ProperS 6 44 0.3 8776 8.6 DFRTA - SubTree Filtering - All-Sub 8 32 0.4 18048 17.6 DFRTA - SubTree Filtering - ProperN 6 32 0.25 9824 9.6 DFRTA - SubTree Filtering - ProperS 6 32 0.24 9824 9.6 DFRTA - Index Filtering - All-Sub 8 18 0.73 18816 18.4 DFRTA - Index Filtering - ProperN 6 18 0.22 10496 10.3 DFRTA - Index Filtering - ProperS 6 18 0.21 10496 10.3 DFRTA - Symbol Filtering - All-Sub 8 21 0.39 19768 19.3 DFRTA - Symbol Filtering - ProperN 6 21 0.22 11208 10.9 DFRTA - Symbol Filtering - ProperS 6 21 0.22 11208 10.9 DFRTA - Symbol & Index Filtering - All-Sub 8 14 0.43 21048 20.6 DFRTA - Symbol & Index Filtering - ProperN 6 14 0.2 12312 12.0 DFRTA - Symbol & Index Filtering - ProperS 6 14 0.2 12312 12.0

Table E.2: Example 6.0.3 – Table Statistics # R Tables # R-Entries # φ-Tables # φ-Entries εNFRTA - All-Sub - - - - NFRTA - All-Sub - - - - NFRTA - ProperN - - - - NFRTA - ProperS - - - - εNRFTA - All-Sub - - - - NRFTA - All-Sub - - - - NRFTA - ProperN - - - - NRFTA - ProperS - - - - DFRTA - All-Sub - - - - DFRTA - ProperN - - - - DFRTA - ProperS - - - - DFRTA - SubTree Filtering - All-Sub 1 5 1 8 DFRTA - SubTree Filtering - ProperN 1 5 1 6 DFRTA - SubTree Filtering - ProperS 1 5 1 6 DFRTA - Index Filtering - All-Sub 2 7 2 16 DFRTA - Index Filtering - ProperN 2 7 2 12 DFRTA - Index Filtering - ProperS 2 7 2 12 DFRTA - Symbol Filtering - All-Sub 2 7 2 16 DFRTA - Symbol Filtering - ProperN 2 7 2 12 DFRTA - Symbol Filtering - ProperS 2 7 2 12 DFRTA - Symbol & Index Filtering - All-Sub 3 9 3 24 DFRTA - Symbol & Index Filtering - ProperN 3 9 3 18 DFRTA - Symbol & Index Filtering - ProperS 3 9 3 18

154 Chapter E. Tree automaton construction – results

Table E.3: Sample 4 – Basic Statistics # States # Transitions Time (ms) Memory (bytes) Memory (KB) εNFRTA - All-Sub 13 20 0.29 10000 9.8 NFRTA - All-Sub 13 29 0.14 6976 6.8 NFRTA - ProperN 6 22 0.08 4696 4.6 NFRTA - ProperS 6 22 0.07 4696 4.6 εNRFTA - All-Sub 13 20 0.11 9048 8.8 NRFTA - All-Sub 13 29 0.1 12856 12.6 NRFTA - ProperN 6 22 0.07 8424 8.2 NRFTA - ProperS 6 22 0.07 8424 8.2 DFRTA - All-Sub 9 183 0.84 24728 24.1 DFRTA - ProperN 7 115 0.56 10144 9.9 DFRTA - ProperS 7 115 0.49 10144 9.9 DFRTA - SubTree Filtering - All-Sub 9 87 0.57 25384 24.8 DFRTA - SubTree Filtering - ProperN 7 87 0.44 11408 11.1 DFRTA - SubTree Filtering - ProperS 7 87 0.43 11408 11.1 DFRTA - Index Filtering - All-Sub 9 43 0.5 25848 25.2 DFRTA - Index Filtering - ProperN 7 43 0.33 11776 11.5 DFRTA - Index Filtering - ProperS 7 43 0.33 11776 11.5 DFRTA - Symbol Filtering - All-Sub 9 32 0.51 28296 27.6 DFRTA - Symbol Filtering - ProperN 7 32 0.32 13888 13.6 DFRTA - Symbol Filtering - ProperS 7 32 0.32 13888 13.6 DFRTA - Symbol & Index Filtering - All-Sub 9 17 0.48 30496 29.8 DFRTA - Symbol & Index Filtering - ProperN 7 17 0.28 15736 15.4 DFRTA - Symbol & Index Filtering - ProperS 7 17 0.27 15736 15.4

Table E.4: Sample 4 – Table Statistics # R-Tables # R-Entries # φ-Tables # φ-Entries εNFRTA - All-Sub - - - - NFRTA - All-Sub - - - - NFRTA - ProperN - - - - NFRTA - ProperS - - - - εNRFTA - All-Sub - - - - NRFTA - All-Sub - - - - NRFTA - ProperN - - - - NRFTA - ProperS - - - - DFRTA - All-Sub - - - - DFRTA - ProperN - - - - DFRTA - ProperS - - - - DFRTA - SubTree Filtering - All-Sub 1 6 1 9 DFRTA - SubTree Filtering - ProperN 1 6 1 7 DFRTA - SubTree Filtering - ProperS 1 6 1 7 DFRTA - Index Filtering - All-Sub 2 8 2 18 DFRTA - Index Filtering - ProperN 2 8 2 14 DFRTA - Index Filtering - ProperS 2 8 2 14 DFRTA - Symbol Filtering - All-Sub 4 11 4 36 DFRTA - Symbol Filtering - ProperN 4 11 4 28 DFRTA - Symbol Filtering - ProperS 4 11 4 28 DFRTA - Symbol & Index Filtering - All-Sub 6 13 6 54 DFRTA - Symbol & Index Filtering - ProperN 6 13 6 42 DFRTA - Symbol & Index Filtering - ProperS 6 13 6 42

155 Chapter E. Tree automaton construction – results

Table E.5: Sample 5 – Basic Statistics # States # Transitions Time (ms) Memory (bytes) Memory (KB) εNFRTA - All-Sub 12 18 0.27 9960 9.7 NFRTA - All-Sub 12 22 0.18 6936 6.8 NFRTA - ProperN 6 16 0.07 4984 4.9 NFRTA - ProperS 6 16 0.06 4984 4.9 εNRFTA - All-Sub 12 18 0.12 8128 7.9 NRFTA - All-Sub 12 22 0.09 9552 9.3 NRFTA - ProperN 6 16 0.06 5736 5.6 NRFTA - ProperS 6 16 0.06 5736 5.6 DFRTA - All-Sub 9 173 0.84 26392 25.8 DFRTA - ProperN 7 107 0.53 9976 9.7 DFRTA - ProperS 7 107 0.48 9976 9.7 DFRTA - SubTree Filtering - All-Sub 9 80 0.56 27048 26.4 DFRTA - SubTree Filtering - ProperN 7 80 0.44 11240 11.0 DFRTA - SubTree Filtering - ProperS 7 80 0.43 11240 11.0 DFRTA - Index Filtering - All-Sub 9 32 0.5 27640 27.0 DFRTA - Index Filtering - ProperN 7 32 0.3 11736 11.5 DFRTA - Index Filtering - ProperS 7 32 0.3 11736 11.5 DFRTA - Symbol Filtering - All-Sub 9 38 0.51 29456 28.8 DFRTA - Symbol Filtering - ProperN 7 38 0.35 13264 13.0 DFRTA - Symbol Filtering - ProperS 7 38 0.34 13264 13.0 DFRTA - Symbol & Index Filtering - All-Sub 9 18 0.48 31568 30.8 DFRTA - Symbol & Index Filtering - ProperN 7 18 0.28 15104 14.8 DFRTA - Symbol & Index Filtering - ProperS 7 18 0.27 15104 14.8

Table E.6: Sample 5 – Table Statistics # R-Tables # R-Entries # φ-Tables # φ-Entries εNFRTA - All-Sub - - - - NFRTA - All-Sub - - - - NFRTA - ProperN - - - - NFRTA - ProperS - - - - εNRFTA - All-Sub - - - - NRFTA - All-Sub - - - - NRFTA - ProperN - - - - NRFTA - ProperS - - - - DFRTA - All-Sub - - - - DFRTA - ProperN - - - - DFRTA - ProperS - - - - DFRTA - SubTree Filtering - All-Sub 1 6 1 9 DFRTA - SubTree Filtering - ProperN 1 6 1 7 DFRTA - SubTree Filtering - ProperS 1 6 1 7 DFRTA - Index Filtering - All-Sub 2 8 2 18 DFRTA - Index Filtering - ProperN 2 8 2 14 DFRTA - Index Filtering - ProperS 2 8 2 14 DFRTA - Symbol Filtering - All-Sub 3 10 3 27 DFRTA - Symbol Filtering - ProperN 3 10 3 21 DFRTA - Symbol Filtering - ProperS 3 10 3 21 DFRTA - Symbol & Index Filtering - All-Sub 5 13 5 45 DFRTA - Symbol & Index Filtering - ProperN 5 13 5 35 DFRTA - Symbol & Index Filtering - ProperS 5 13 5 35

156 Chapter E. Tree automaton construction – results

Table E.7: ten Eikelder 68000 – Basic Statistics # States # Transitions Time (ms) Memory (bytes) Memory (KB) εNFRTA - All-Sub 36 66 0.37 32864 32.1 NFRTA - All-Sub 36 65 0.25 23536 23.0 NFRTA - ProperN 9 38 0.12 14584 14.2 NFRTA - ProperS 9 38 0.12 14584 14.2 εNRFTA - All-Sub 36 66 0.3 25064 24.5 NRFTA - All-Sub 36 65 0.24 27856 27.2 NRFTA - ProperN 9 38 0.12 10504 10.3 NRFTA - ProperS 9 38 0.12 10504 10.3 DFRTA - All-Sub 39 9208 62.11 171848 167.8 DFRTA - ProperN 10 624 3.15 24536 24.0 DFRTA - ProperS 10 624 3.12 24536 24.0 DFRTA - SubTree Filtering - All-Sub 39 508 3.49 115776 113.1 DFRTA - SubTree Filtering - ProperN 10 508 2.79 26184 25.6 DFRTA - SubTree Filtering - ProperS 10 508 2.81 26184 25.6 DFRTA - Index Filtering - All-Sub 39 354 2.74 117120 114.4 DFRTA - Index Filtering - ProperN 10 354 2.12 26136 25.5 DFRTA - Index Filtering - ProperS 10 354 2.11 26136 25.5 DFRTA - Symbol Filtering - All-Sub 39 153 2.02 131718 128.6 DFRTA - Symbol Filtering - ProperN 10 153 1.47 33080 32.3 DFRTA - Symbol Filtering - ProperS 10 153 1.5 33080 32.3 DFRTA - Symbol & Index Filtering - All-Sub 39 89 1.86 149960 146.4 DFRTA - Symbol & Index Filtering - ProperN 10 89 1.18 40648 39.7 DFRTA - Symbol & Index Filtering - ProperS 10 89 1.16 40648 39.7

Table E.8: ten Eikelder 68000 – Table Statistics # R-Tables # R-Entries # φ-Tables # φ-Entries εNFRTA - All-Sub - - - - NFRTA - All-Sub - - - - NFRTA - ProperN - - - - NFRTA - ProperS - - - - εNRFTA - All-Sub - - - - NRFTA - All-Sub - - - - NRFTA - ProperN - - - - NRFTA - ProperS - - - - DFRTA - All-Sub - - - - DFRTA - ProperN - - - - DFRTA - ProperS - - - - DFRTA - SubTree Filtering - All-Sub 1 9 1 39 DFRTA - SubTree Filtering - ProperN 1 9 1 10 DFRTA - SubTree Filtering - ProperS 1 9 1 10 DFRTA - Index Filtering - All-Sub 2 15 2 78 DFRTA - Index Filtering - ProperN 2 15 2 20 DFRTA - Index Filtering - ProperS 2 15 2 20 DFRTA - Symbol Filtering - All-Sub 8 33 8 312 DFRTA - Symbol Filtering - ProperN 8 33 8 80 DFRTA - Symbol Filtering - ProperS 8 33 8 80 DFRTA - Symbol & Index Filtering - All-Sub 14 49 14 546 DFRTA - Symbol & Index Filtering - ProperN 14 49 14 140 DFRTA - Symbol & Index Filtering - ProperS 14 49 14 140

157 Chapter E. Tree automaton construction – results

Table E.9: Mono X86 – Basic Statistics # States # Transitions Time (ms) Memory (bytes) Memory (MB) εNFRTA - All-Sub 532 1029 5.97 528848 0.5 NFRTA - All-Sub 532 1205 4.98 381248 0.4 NFRTA - ProperN 63 736 1.32 225056 0.2 NFRTA - ProperS 63 736 1.32 225056 0.2 εNRFTA - All-Sub 532 1029 7.54 391040 0.4 NRFTA - All-Sub 532 1205 4.97 508560 0.5 NRFTA - ProperN 63 736 1.33 207880 0.2 NRFTA - ProperS 63 736 1.34 207880 0.2 DFRTA - All-Sub 557 24907955 273009 152940456 145.9 DFRTA - ProperN 65 348299 1714.59 2535240 2.4 DFRTA - ProperS 65 348299 1701.26 2535240 2.4 DFRTA - SubTree Filtering - All-Sub 557 337821 7661.07 3833328 3.7 DFRTA - SubTree Filtering - ProperN 65 337821 4166.51 2510232 2.4 DFRTA - SubTree Filtering - ProperS 65 337821 4159.49 2510232 2.4 DFRTA - Index Filtering - All-Sub 557 160651 3157.93 2688280 2.6 DFRTA - Index Filtering - ProperN 65 160651 1343.52 1341568 1.3 DFRTA - Index Filtering - ProperS 65 160651 1333.87 1341568 1.3 DFRTA - Symbol Filtering - All-Sub 557 2097 321.78 6099712 5.8 DFRTA - Symbol Filtering - ProperN 65 2097 67.58 931144 0.9 DFRTA - Symbol Filtering - ProperS 65 2097 67.53 931144 0.9 DFRTA - Symbol & Index Filtering - All-Sub 557 1207 381.24 11805000 11.3 DFRTA - Symbol & Index Filtering - ProperN 65 1207 68.53 1637712 1.6 DFRTA - Symbol & Index Filtering - ProperS 65 1207 68.15 1637712 1.6

Table E.10: Mono X86 – Table Statistics # R-Tables # R-Entries # φ-Tables # φ-Entries εNFRTA - All-Sub - - - - NFRTA - All-Sub - - - - NFRTA - ProperN - - - - NFRTA - ProperS - - - - εNRFTA - All-Sub - - - - NRFTA - All-Sub - - - - NRFTA - ProperN - - - - NRFTA - ProperS - - - - DFRTA - All-Sub - - - - DFRTA - ProperN - - - - DFRTA - ProperS - - - - DFRTA - SubTree Filtering - All-Sub 1 64 1 557 DFRTA - SubTree Filtering - ProperN 1 64 1 65 DFRTA - SubTree Filtering - ProperS 1 64 1 65 DFRTA - Index Filtering - All-Sub 2 88 2 1114 DFRTA - Index Filtering - ProperN 2 88 2 130 DFRTA - Index Filtering - ProperS 2 88 2 130 DFRTA - Symbol Filtering - All-Sub 238 722 238 132566 DFRTA - Symbol Filtering - ProperN 238 722 238 15470 DFRTA - Symbol Filtering - ProperS 238 722 238 15470 DFRTA - Symbol & Index Filtering - All-Sub 318 872 318 177126 DFRTA - Symbol & Index Filtering - ProperN 318 872 318 20670 DFRTA - Symbol & Index Filtering - ProperS 318 872 318 20670

158 Chapter E. Tree automaton construction – results

Table E.11: Mono IA64 – Basic Statistics # States # Transitions Time (ms) Memory (bytes) Memory (MB) εNFRTA - All-Sub 441 865 4.13 447136 0.4 NFRTA - All-Sub 441 1078 3.45 319984 0.3 NFRTA - ProperN 40 677 0.91 186880 0.2 NFRTA - ProperS 40 677 0.91 186880 0.2 εNRFTA - All-Sub 441 865 5.69 330025 0.3 NRFTA - All-Sub 441 1078 3.37 456080 0.4 NRFTA - ProperN 44 677 0.91 199768 0.2 NRFTA - ProperS 44 677 0.91 199768 0.2 DFRTA - All-Sub 438 14075158 203738 86757368 82.7 DFRTA - ProperN 42 135562 622.21 1060696 1.0 DFRTA - ProperS 42 135562 621.49 1060696 1.0 DFRTA - SubTree Filtering - All-Sub 438 129342 4127.71 2073816 2.0 DFRTA - SubTree Filtering - ProperN 42 129342 1354.73 1051240 1.0 DFRTA - SubTree Filtering - ProperS 42 129342 1351.51 1051240 1.0 DFRTA - Index Filtering - All-Sub 438 54706 1659.59 1612144 1.5 DFRTA - Index Filtering - ProperN 42 54706 393.34 570560 0.5 DFRTA - Index Filtering - ProperS 42 54706 393.02 570560 0.5 DFRTA - Symbol Filtering - All-Sub 438 1351 193.08 4721856 4.5 DFRTA - Symbol Filtering - ProperN 42 1351 26.92 642160 0.6 DFRTA - Symbol Filtering - ProperS 42 1351 26.73 642160 0.6 DFRTA - Symbol & Index Filtering - All-Sub 438 915 266.39 9121960 8.7 DFRTA - Symbol & Index Filtering - ProperN 42 915 29.59 1104440 1.1 DFRTA - Symbol & Index Filtering - ProperS 42 915 29.41 1104440 1.1

Table E.12: Mono IA64 – Table Statistics # R-Tables # R-Entries # φ-Tables # φ-Entries εNFRTA - All-Sub - - - - NFRTA - All-Sub - - - - NFRTA - ProperN - - - - NFRTA - ProperS - - - - εNRFTA - All-Sub - - - - NRFTA - All-Sub - - - - NRFTA - ProperN - - - - NRFTA - ProperS - - - - DFRTA - All-Sub - - - - DFRTA - ProperN - - - - DFRTA - ProperS - - - - DFRTA - SubTree Filtering - All-Sub 1 41 1 438 DFRTA - SubTree Filtering - ProperN 1 41 1 42 DFRTA - SubTree Filtering - ProperS 1 41 1 42 DFRTA - Index Filtering - All-Sub 2 56 2 876 DFRTA - Index Filtering - ProperN 2 56 2 84 DFRTA - Index Filtering - ProperS 2 56 2 84 DFRTA - Symbol Filtering - All-Sub 234 643 234 102492 DFRTA - Symbol Filtering - ProperN 234 643 234 9828 DFRTA - Symbol Filtering - ProperS 234 643 234 9828 DFRTA - Symbol & Index Filtering - All-Sub 307 776 307 134466 DFRTA - Symbol & Index Filtering - ProperN 307 776 307 12894 DFRTA - Symbol & Index Filtering - ProperS 307 776 307 12894

159 Chapter E. Tree automaton construction – results

Table E.13: Mono Sparc – Basic Statistics # States # Transitions Time (ms) Memory (bytes) Memory (MB) εNFRTA - All-Sub 491 967 5.05 494104 0.5 NFRTA - All-Sub 491 1145 4.26 352840 0.3 NFRTA - ProperN 51 705 1.08 206040 0.2 NFRTA - ProperS 51 705 1.09 206040 0.2 εNRFTA - All-Sub 491 967 6.77 364456 0.3 NRFTA - All-Sub 491 1145 4.25 482160 0.5 NRFTA - ProperN 51 705 1.07 200064 0.2 NRFTA - ProperS 51 705 1.07 200064 0.2 DFRTA - All-Sub 487 18342396 188725 112685992 107.5 DFRTA - ProperN 53 225066 965.44 1653248 1.6 DFRTA - ProperS 53 225066 957.41 1653248 1.6 DFRTA - SubTree Filtering - All-Sub 487 208720 4371.86 2756664 2.6 DFRTA - SubTree Filtering - ProperN 53 208720 2815.37 1578624 1.5 DFRTA - SubTree Filtering - ProperS 53 208720 2810.04 1578624 1.5 DFRTA - Index Filtering - All-Sub 487 97543 1726.73 2049936 2.0 DFRTA - Index Filtering - ProperN 53 97543 832.37 851064 0.8 DFRTA - Index Filtering - ProperS 53 97543 829.99 851064 0.8 DFRTA - Symbol Filtering - All-Sub 487 1502 230.77 5395448 5.1 DFRTA - Symbol Filtering - ProperN 53 1502 34.39 783600 0.7 DFRTA - Symbol Filtering - ProperS 53 1502 34.06 783600 0.7 DFRTA - Symbol & Index Filtering - All-Sub 487 1001 314.37 10456104 10.0 DFRTA - Symbol & Index Filtering - ProperN 53 1001 37.35 1375792 1.3 DFRTA - Symbol & Index Filtering - ProperS 53 1001 37.15 1375792 1.3

Table E.14: Mono Sparc – Table Statistics # R-Tables # R-Entries # φ-Tables # φ-Entries εNFRTA - All-Sub - - - - NFRTA - All-Sub - - - - NFRTA - ProperN - - - - NFRTA - ProperS - - - - εNRFTA - All-Sub - - - - NRFTA - All-Sub - - - - NRFTA - ProperN - - - - NRFTA - ProperS - - - - DFRTA - All-Sub - - - - DFRTA - ProperN - - - - DFRTA - ProperS - - - - DFRTA - SubTree Filtering - All-Sub 1 51 1 487 DFRTA - SubTree Filtering - ProperN 1 51 1 53 DFRTA - SubTree Filtering - ProperS 1 51 1 53 DFRTA - Index Filtering - All-Sub 2 71 2 974 DFRTA - Index Filtering - ProperN 2 71 2 106 DFRTA - Index Filtering - ProperS 2 71 2 106 DFRTA - Symbol Filtering - All-Sub 242 700 242 117854 DFRTA - Symbol Filtering - ProperN 242 700 242 12826 DFRTA - Symbol Filtering - ProperS 242 700 242 12826 DFRTA - Symbol & Index Filtering - All-Sub 319 841 319 155353 DFRTA - Symbol & Index Filtering - ProperN 319 841 319 16907 DFRTA - Symbol & Index Filtering - ProperS 319 841 319 16907

160 Bibliography

[AC75] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333–340, 1975.

[AGT89] Alfred V. Aho, Mahadevan Ganapathi, and Steven W. K. Tjiang. Code gener- ation using tree matching and dynamic programming. ACM Transactions on Programming Languages and Systems, 11(4):491–516, 1989.

[ALSU07] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers, principles, techniques, & tools. Pearson – Addison Wesley, second edition, 2007.

[Cle07] Loek G.W.A. Cleophas. Three Taxonomies of Tree Algorithms. PhD thesis, Technische Universiteit Eindhoven, May 2007. Draft Version.

[Dre] Frank Drewes. The TREEBAG Manual, 1.2 edition. http://www. informatik.uni-bremen.de/theorie/treebag/manual/manual.html.

[ESL89] H. Emmelmann, F.W. Schr¨oer, and L. Landwehr. Beg: a generator for efficient back ends. In PLDI ’89: Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation, pages 227–237, New York, NY, USA, 1989. ACM Press.

[FHP92a] Christopher W. Fraser, David R. Hanson, and Todd A. Proebsting. Engineer- ing a simple, efficient code-generator generator. ACM Letters on Programming Languages and Systems, 1(3):213–226, 1992.

[FHP92b] Christopher W. Fraser, Robert R. Henry, and Todd A. Proebsting. BURG: fast optimal instruction selection and tree parsing. ACM SIGPLAN Notices, 27(4):68–76, 1992.

[Flo62] Robert W. Floyd. Algorithm 97: Shortest path. Communications of the ACM, 5(6):345, 1962.

[FSW94] Christian Ferdinand, Helmut Seidl, and Reinhard Wilhelm. Tree automata for code selection. Acta Informatica, 31(9):741–760, 1994.

[GT] Thomas Genet and Val´erie Viet Triem Tong. Timbuk – A Tree Automata Library, 2.0 edition. http://www.irisa.fr/lande/genet/timbuk/Manual.pdf.

[GT01] Thomas Genet and Val´erie Viet Triem Tong. Reachability analysis of term rewriting systems with Timbuk. In LPAR ’01: Proceedings of the Artificial Intelligence on Logic for Programming, pages 695–706, London, UK, 2001. Springer-Verlag.

161 Bibliography Bibliography

[HK89] C. Hemerik and J. P. Katoen. Bottom-up tree acceptors. Science of Computer Programming, 13(1):51–72, dec 1989.

[iBU] iBurg. http://www.cs.princeton.edu/software/iburg/.

[laz] Lazarus. http://www.lazarus.freepascal.org/.

[Lin01] Peter Linz. An Introduction to Formal Languages and Automata. Jones and Bartlett Publishers, third edition edition, 2001.

[mon] Mono project. http://www.mono-project.com/.

[Pro95] Todd A. Proebsting. BURS automata generation. ACM Transactions on Pro- gramming Languages and Systems, 17(3):461–486, 1995.

[Ski98] Steven S. Skiena. The Algorithm Design Manual. Springer-Verlag New York, Inc., New York, NY, USA, 1998.

[tE89] H.M.M. ten Eikelder. A simple implementation of a bottom-up tree acceptor. Technical report, Technische Universiteit Eindhoven, 1989.

[vdBdJKO00] M. G. T. van den Brand, H. A. de Jong, P. Klint, and P. A. Olivier. Efficient annotated terms. Software-Practice and Experience, 30(3):259–291, 2000.

[War62] Stephen Warshall. A theorem on boolean matrices. J. ACM, 9(1):11–12, 1962.

162