<<

Faculty of Informatics

Masaryk University

Æ

Syntactic analysis of natural languages based on context-free grammar backbone

PhD Thesis

Vladim´ır Kadlec

Brno, September 2007 Declaration

I hereby declare that this submission is my own work and that, to the best of my knowledge and belief, it contains no material previously published or written by another person nor material which to a substantial extent has been accepted for the award of any other degree or diploma of the university or other institute of higher learning, except where due acknowledgement has been made in the text.

ii Acknowledgements

I would like to express my gratitude to my supervising professor Karel Pala for his support and feedback during my research. Special thanks must go to Martin Rajman and Jean-C´edric Chappelier for their very valuable comments on this work, especially on Chapter 4. Also, I would like to thank to Aleˇs Hor´ak, who has been my major colleague for almost 7 years.

iii Contents

1 Introduction 7 1.1 ObjectivesoftheThesis ...... 7 1.2 Overview...... 8 1.3 TerminologyandNotation ...... 9

2 Related Work 10 2.1 Context-FreeParsing ...... 10 2.1.1 Tomita’s generalized LR parser ...... 11 2.1.2 Chartparsers ...... 12 2.1.3 Head driven and head corner ...... 15 2.1.4 ParserComparison ...... 18 2.2 RobustParsing ...... 19 2.3 ParsingofCzech ...... 19 2.4 Conclusions ...... 20

3 Parsing System with Contextual Constraints 21 3.1 SystemOverview ...... 21 3.2 Head-Driven dependent dot move ...... 24 3.2.1 Motivation...... 24 3.2.2 HDddmParsing...... 26 3.2.3 Head-Driven Algorithm with One Dot in Items . . . . 28 3.2.4 Simplified Items, No Dots Needed ...... 29 3.2.5 DataStructures...... 31 3.2.6 Conclusions ...... 31 3.3 OptimizingHeadsforParsing ...... 32 3.3.1 OptimizationProcedure ...... 33 3.4 Shared-Packed Forest filtering ...... 34 3.4.1 Filteringbyrulelevels ...... 34

4 CONTENTS CONTENTS

3.4.2 Conclusions ...... 34 3.5 Contextual Constraints and Semantic Actions ...... 35 3.5.1 Representationofvalues ...... 36 3.5.2 Generationofagrammarwithvalues ...... 38 3.5.3 Conclusions ...... 40 3.6 VerbValences ...... 40 3.6.1 VerbaLex ...... 40 3.6.2 FilteringbyValences ...... 42 3.7 Conclusions ...... 42

4 Robust Stochastic Parsing Using OMC 44 4.1 Coverage...... 45 4.1.1 Maximumcoverage ...... 46 4.1.2 Optimalm-coverage ...... 47 4.1.3 Probabilityofacoverage ...... 49 4.2 Findingoptimalm-coverage ...... 50 4.3 Gluing...... 52 4.3.1 Gluingwithnewrules ...... 52 4.3.2 Gluing by means of mapping non-terminals ...... 53 4.4 Conclusions ...... 54

5 Context-free Only Experiments 55 5.1 ConfigurationofExperiments ...... 55 5.2 Context-freeexperiments ...... 56 5.2.1 Comparison of Implemented CF Parsing Algorithms . . 56 5.2.2 Head-Driven dependent dot move variants ...... 57 5.2.3 Optimizing Heads for Parsing ...... 58 5.2.4 Comparison with Different Parsing Systems ...... 62 5.2.5 Conclusion...... 66 5.3 RobustParsing ...... 67 5.3.1 Implementation ...... 67 5.3.2 Experiments with English ...... 67 5.3.3 SpeechrecognitionofCzech ...... 68 5.4 Conclusions ...... 70

6 Implementation – synt Project 71 6.1 Grammar ...... 72 6.2 Parser ...... 76

5 CONTENTS CONTENTS

6.3 Experiments...... 77 6.3.1 Realdatatest...... 77 6.3.2 VerbValences ...... 78 6.4 Comparison of dependency and phrasal parsers ...... 79 6.4.1 Compared Dependency Parsers ...... 79 6.4.2 MainProblems ...... 80 6.4.3 Comparisonmethod ...... 80 6.4.4 Results...... 82 6.4.5 Conclusions ...... 83 6.5 Front-ends...... 84 6.5.1 Grammar Development Workbench ...... 84 6.5.2 WWWsynt ...... 85 6.6 Conclusions ...... 87

7 Conclusions and Future Research 88

Bibliography 89

Appendices 103

A Alternative Definitions of Maximum Coverage 103 A.1 Definition of maximum coverage in terms of footage ...... 103 A.2 Definition of maximum coverage in terms of equivalence classes 103

B List of Publications 105

6 Chapter 1

Introduction

Syntactic analysis is a “corner-stone” of applications for automated process- ing of texts in natural languages. Any application, an automatic grammar checker or information retrieval system must be capable of understanding the structure of a sentence. Recognition of the sentence structure is called parsing. Formal the- ory of natural language syntax was first introduced by Noam Chom- sky [Chomsky, 1955, Chomsky, 1957]. Chomsky come with a context-free, transformational phrase structure grammar formalisms and their compari- son. Context-free parsing techniques are well suited to be incorporated into real-world nature language processing systems because of their time efficiency and low memory requirements. Though, it is known that some natural lan- guage phenomena cannot be handled with the context-free grammar for- malism, researchers often use the context-free backbone as the core of their grammar formalism and enhance it with context sensitive feature structures (e.g. [Neidle, 1994]).

1.1 Objectives of the Thesis

The goal of this thesis is to design algorithms and methods for an effective and robust syntactic analysis of natural languages. The main contribution of described work is in a syntactic analyser for Czech language. However, the presented algorithms are language-independent, so other languages with an appropriate grammar can be modeled as well.

7 1.2 1. INTRODUCTION

The analysis of sentences by the described system is based on context-free grammar for the given language. The reasons for the choice of this formalism are mainly historical. At the time of creation our system there was intensive research on constituent grammar for Czech [Smrˇzand Hor´ak, 1999], but no analyser fast an effective enough existed. The internal representation of derivation trees allows to apply contextual constraints, e.g. case agreement. In general, this is NP-complete problem, so a polynomial algorithm working with weaker constraints than general ones is described. The result of the constraint application is still managed in the polynomial structure. The evaluation of semantic actions and contextual constraints helps us to reduce a huge number of derivation trees and we are also able to compute some new information, which is not contained in the context-free part of the grammar. Also n best trees (according to a tree rank, e.g probability) can be selected. This is an important feature for linguists developing a grammar by hand. There are many NLP applications, where it is difficult to create a gram- mar generating a sufficient subset of the processed language. We describe a robust extension of our parser that is able to return a “correct” derivation tree even if the grammar cannot generate the input sentence. Because the presented system is language-independent, results for English and French grammars are provided, as well as results for Czech.

1.2 Overview

The first chapter introduces the topics discussed in the next parts and also the basic terminology and notations. In the second chapter, a brief survey of related algorithms for syntactic analysis is presented. Current (robust) syntactic analysers for Czech are mentioned there as well. The main result of this work – a parsing system with contextual constraints – is discussed in chapter three. Chapter Three is divided into several sections describing individual modules of the system. A robust extension of the parsing system is presented in Chapter Four. The fifth chapter provides results of experi- ments with context-free and robust parts of the system. The description of the implementation – synt project – is provided in Chapter Six. The last, seventh, chapter consists of conclusions and future directions.

8 1.3 1. INTRODUCTION

1.3 Terminology and Notation

We will use the following terminology. The input grammar G is a tetrad G = hN, Σ,P,Si, where N is a finite set of non-terminals, Σ a finite set of terminals, P is a finite set of rules and S stands for the starting symbol of the grammar. Upper-case letters (A, B, etc.) will designate non-terminals, lower case letters (i, j, k) will be used for natural numbers, Greek letters (α, β, . . . ) for strings of symbols (terminals or non-terminals). The empty string is denoted by ǫ. 1 The input sentence is a sequence of wi ∈ Σ , each corresponds to a terminal of the input grammar G. Input sentence may also be referred to as “input string”.

1 wi ∈ Σ applies only for known words. There is usually a special mechanism to handle words not present in a vocabulary.

9 Chapter 2

Related Work

In the first part of this chapter, selected algorithms for syntactic analysis are shown. Only algorithms related to this thesis are described here. In the second part we focus on parsing of Czech language.

2.1 Context-Free Parsing

Parsing algorithms for context-free (CF) grammars play a crucial role in the field of general parsing. Either their basic form is directly employed (usually for parsing a context-free backbone for the grammars based on one of the re- cent formalisms), or an extension of a standard parsing algorithm is proposed that can deal with more complex features of the particular grammar form. Moreover, the position of CF parsing is further strengthened by the interest it got from language modelling community [Chelba and Jelinek, 1998]. Various methods have been designed for context-free (CF) parsing and new variants and enhancements emerge every year. To enable the efficient processing, parsing algorithms employ sophisticated structures to store inter- mediate parsing results. The most popular data structures for this purpose are a chart and a graph-structured stack. In the following sections the most common algorithms for CF parsing are described. Notice, that our CF pars- ing algorithms used in our parsing system are described in Section 3.2.

10 2.1 2. RELATED WORK

2.1.1 Tomita’s generalized LR parser The generalized LR (GLR) parser was developed by Masaru Tomita [Tomita, 1986]. It extends standard LR parser [Knuth, 1965, Aho and Ullman, 1972, Harisson, 1986, Aho et al., 1986] to deal with general CFGs. This algorithm is suitable for grammars, which are not very ambiguous, so it seems convenient to use the techniques specified rather for unambiguous grammars. The analysis is very similar to the standard LR analysis. It uses a pre- computed parsing table containing two types of actions, a shift and a reduce. The parsing table determines next action from a state, a top symbol on the stack and a prefix for the remainder of the input. There can be more than one action entry in the table for a general CFG. This corresponds to shift/reduce and reduce/reduce conflicts. The GLR parser uses graph-structured stack to manage conflicts in parsing table. The stack is represented by directed, acyclic, connected graph. The active vertex is the vertex, on which action (shift or reduce) can be carried out. This vertex has no predecessor and it cannot be an error vertex (invalid analysis). The active vertex for LR stack would be on the top of the stack. The parser stores the set of active vertexes. At the beginning the graph-structured stack and the set of active vertexes contain just one vertex. This vertex represents the initial state. The analysis then proceeds with the following steps. While the set of active vertexes is not empty, any reducible active vertex is popped. If there is not such a vertex (all active vertexes contain the shift action), one word from the input is read. All shift actions are done in parallel for active vertexes. The shift action is the same as for LR algorithm. A new predecessor to given vertex is added and this predecessor is pushed to the set of active vertexes. The reduce action differs from LR more considerably. If n symbols should be popped, starting with a given active vertex, then all successors in the distance n from this vertex has to be found. In this case the distance is the number of symbols at the path. Then a new vertex is created (as a result of shift or reduce action). This new vertex is created only if it is not in the set of active vertexes already. If so, a new edge is added to the existing vertex. Tomita proved [Tomita, 1986] that the number of vertexes and the num- ber of edges in graph-structured stack are polynomial with respect to the number of words in the input. He also shows how to represent all the resulting derivation trees in a polynomial structure. That structure is called a packed

11 2.1 2. RELATED WORK shared forest. The worst-case complexity of the GLR parser is O(nδ+1), where δ is the length of the longest right-hand side of a rule [Johnson, 1989]. The original version of Tomita’s algorithm is unable to deal with cyclic and hidden left-recursive grammars. The grammar is hidden left-recursive if there are A,B,α,β such that • A ⇒∗ BαAβ,

• Bα ⇒ ǫ. This problem was solved in different ways by Rekers [Rekers, 1992] in 1992 and by Nederhof and Sarbo [Nederhof and Sarbo, 1993] in 1993. Several optimizations are also suggested in [Heemels et al., 1991].

2.1.2 Chart parsers The idea of chart parsing was introduced by Kay in [Kay, 1989a]. The chart parser stores substructures already parsed to avoid recomputing the same results again. This approach is mostly used in combination with Earley’s parsing algorithm [Earley, 1970]. The chart is a table of all viable strings. The dotted CF rules (items) are used to represent the state of the analysis. Dots stay on boundaries between the analyzed and non-analyzed parts of items. If the parsing procedure is unidirectional only, (left to right or vice versa), only one dot is needed.

General chart parser The chart parser usually uses two sets – the chart and the agenda. These 1 sets contain the edges. The edge is triple [A → α•β, i, j], where i, j are natural numbers, 0 ≤ i ≤ j ≤ n, n is number of input terminals and rule A → αβ ∈ P . The strings of symbols α and β can be empty. The dot symbol • can appear anywhere on the right hand side of the rule. The edge is inactive if it is of the form [A → α•, i, j], i.e. the dot is right of the rightmost symbol of the rule. Otherwise the edge is active. The edge [A → •, i, j] is inactive (the epsilon rule). The chart (or agenda) can be viewed as oriented graph and that is why the word ”edge” is used. There are n + 1 vertexes for n input terminals. These vertexes are labelled 0, 1, ..., n. Considering the edge [A → α•β, i, j]

1Different definitions of edges are possible, e.g. for the head corner chart parser.

12 2.1 2. RELATED WORK the natural numbers i, j denote starting and ending vertexes respectively. An edge is labelled by the rule containing the dot. The analyse proceeds from left to right in the given grammar rule. The numbers i, j denote the fragment of input sentence covered by the part of rule that is on the left from the dot. program Chart parser begin init chart init agenda while agenda not empty do pop edge e from agenda for each edge f producible by combining e and some other edge in chart do if f 6∈ agenda and f 6∈ chart and f =6 e then push f into agenda fi od push e into chart od end Figure 2.1: General chart parser

Figure 2.1 shows the parsing algorithm. At the beginning the chart and the agenda are initialized. The initial chart is usually empty. The contents of the initial agenda differ for various algorithms. In the iteration step some edge e is popped from the agenda. Then each edge producible by combining e and some other edge from the chart is added to the agenda. This new edge is added to the agenda only if it is not neither in the chart nor in the agenda. Finally the edge e is added to the chart. The particular chart parsers differ just in the initialization and the edges-combining procedure. The analysis is successful if there is an edge of the form [S → α•, 0, n] in the chart at the end of the algorithm. The basic version of chart parsing algorithm is non-deterministic, be- cause any edge can be popped from agenda. Different implementations have different strategies, how to push and to pop edges from agenda. E.g. the agenda can be represented as a stack or a (priority) queue [Nijholt, 1994].

13 2.1 2. RELATED WORK

There are also very sophisticated statistical approaches in [Charniak, 1997, Charniak, 2000, Collins, 1997].

Left Corner The term left-corner (LC) parsing was firstly used by Rosenkrantz and Lewis in [Rosenkrantz and Lewis, 1970]. The LC algorithm for general CFG was described by Martin Kay in [Kay, 1989a] (the version without top-down filtering). The first LC analyser working in polynomial time is the BUP parser [Matsumoto and et al., 1983]. This parser can handle non-cyclic CFG without epsilon rules. The LC chart parser is based on the left-corner relation for the given grammar. The LC relation is defined as follows. The non-terminal X is a left corner of A, written X ∈ LC(A) if X = A, or the grammar contains a rule of the form B → Xα, where B ∈ LC(A). The X symbol can be either terminal or non-terminal. Left-corner parsers uses combination of bottom-up and top-down strate- gies. The idea is to work top-down in the predict phases and bottom-up in the complete phases of the iteration.

The LC algorithm

Initialization

• For every rule p ∈ P of the form S → α, add edge [S → •α, 0, 0] to the agenda.

• The initial chart is empty.

Iteration • Pop an edge e from agenda. Fundamental rules:

• If e is an edge of the form [A → α•, j, k], then for every edge in chart of the form [B → γ•Aβ, i, j], create an edge [B → γA•β, i, k].

• If e is an edge of the form [B → γ•Aβ, i, j], then for every edge in chart of the form [A → α•, j, k], create an edge [B → γA•β, i, k].

14 2.1 2. RELATED WORK

Scanning the input:

• If e is an edge of the form [A → α•wj+1β, i, j], create an edge [A → α wj+1•β, i, j + 1]. Left Corner:

• If e is an edge of the form [X → γ•, k, j] and there is edge of the form [A → α•Cβ,i,k] in the chart, then for every grammar rule of the form B → Xδ such that B ∈ LC(C), create an edge [B → X•δ, k, j].

• Symmetrically, if e is an edge of the form [A → α•Cβ,i,k].

• If e is an edge of the form [A → α•Cβ, i, j − 1], then for every grammar rule of the form B → wjδ such that B ∈ LC(C), create an edge [B → wj•δ, j − 1, j].

There are several optimizations for the basic version of the algo- rithm. These optimizations improve time and space complexity in aver- age case. Wir´en [Wir´en, 1987] suggested variant of left-corner rule, which reduces the creation of already created edges considerably. Similar solu- tion for graph-structured stack was published by Nederhof [Nederhof, 1993]. The different representation of edges is another space-saving optimiza- tion [Leermakers, 1992]. The analyser ignores the symbols situated left from the dot. In 1991 Shann [Shann, 1991] improved edge filtering. This filtering is applied when a new edge is created. The similar approach is described in [Graham et al., 1980] and it is called Cocke-Schwartz filtering. Robert Moore [Moore, 2000a] implemented all these improvements. He also suggested to change the order of top-down and bottom-up checks. In [Moore, 2000a] is shown this change speeds up the analysis.

2.1.3 Head driven and head corner parsing The head driven parsing principles were introduced by Martin Kay [Kay, 1989b]. Satta and Stock [Satta and Stock, 1989] suggested a bottom-up head-driven chart parser without top-down filtering. The main difference from the other algorithms is that the analysis does not proceed from left to right. It can start from the middle of sentence

15 2.1 2. RELATED WORK instead. The author of the input grammar can determine, where the analysis should start. This is crucial for the effectiveness of the parsing. The start of the analysis is defined by the head of the given rule. The head of grammar rule is a symbol from right-handed side. E.g. rule A → AaC can have either A or a or C as a head. The epsilon rule has head ǫ. The head symbol of each grammar rule is underlined in the following text. A head corner (HC) chart parser with predic- tion [Sikkel and op den Akker, 1993] will be defined. The HC parser is (similar to the LC parser) based on the head-corner relation. The HC relation is defined as follows. The non-terminal X is a head corner of A, written X ∈ HC(A) if X = A, or the grammar contains a rule of the form B → αX β, where B ∈ HC(A). The X symbol can be either terminal or non-terminal. Because the HC chart parser starts the parsing in the middle of sentence, it uses different definition of edges. For the HC parser the edge is triple [A → α•βXγ•δ, i, j], where i, j are natural numbers, 0 ≤ i ≤ j ≤ n and A → αβ X γδ is rule of the grammar. The edge is in the form of [A → ••, i, i] for epsilon rule. The numbers i, j denote which part of the input covers the symbols between dots. The HC parser uses the top-down filtering and it needs a specific kind of item for this filtering. The predict item is triple [l, r, A] where 0 ≤ l ≤ r ≤ n and A ∈ N. If the non-terminal A will cover input somewhere between l and r, then the predict item [l, r, A] is created. There are two cases. Either A covers the input from l to some j where l ≤ j ≤ r. Or A covers the input from some i to r where l ≤ i ≤ r. The parsing algorithm is slightly different from the general chart parsing, because it has to handle with these predict items.

Initialization

• Create predict item [0,n,S] and push it into the chart.

Iteration

• In the following, let B ∈ HC(A), 0 ≤ l ≤ i ≤ j ≤ k ≤ r and there is a predict item [l, r, A] in the chart.

16 2.1 2. RELATED WORK

Fundamental rules:

• If there are edges [C → •δ•, i, j] and [B → αC•β•γ,j,k] in the chart, create an edge [B → α•Cβ•γ,i,k].

• Symmetrically, if there are edges [B → α•β•Cγ, i, j] and [C → •δ•,j,k], create an edge [B → α•βC•γ,i,k].

Scanning the input:

• If there is an edge [B → αwi•β•γ, i, j] in the chart, create an edge [B → α•wiβ•γ, i − 1, j].

• Symmetrically, if there is an edge [B → α•β•wjγ, i, j − 1], create an edge [B → α•βwj•γ, i, j].

Predict rules:

• If there is an edge [B → αC•β•γ, i, j] in the chart, create a predict item [l,i,C].

• If there is an edge [B → α•β•Cγ, i, j], create a predict item [j, r, C].

Head corner:

• For all grammar rules in the form of B → αwiβ create an edge [B → α•wi•β, i − 1, i].

• For all grammar rules in the form of B → •• create an edge [B → ••, i, i] (i.e. for all i, l ≤ i ≤ r).

• If there is an edge [C → •δ•, i, j] in the chart, create an edge [B → α•C•β, i, j].

If the existence of the predict item is omitted in prerequisites and the predict rules are eliminated, then the definition is the same as for the bottom- up head-driven parser. The HC algorithm has O(n5) worst-case complexity, where n is the number of words in input sentence. To get usual O(n3) worst-case complexity, it is suggested to keep a goal table for non- terminals [Sikkel and op den Akker, 1993]. This extra bookkeeping allows to replace l,r with i, k in prerequisites for fundamental rules.

17 2.1 2. RELATED WORK

The first part of fundamental rules can be also optimized:

• If there are edges [C → •δ•, i, j] and [B → αC•β•,j,k] in the chart, create an edge [B → α•Cβ•,i,k].

The left dot is not moved to the left until the right dot moves to the right. This improvement prevents creating edges already being in the chart. Satta and Stock have published more general version of this in [Satta and Stock, 1989]. The algorithm can be optimized by us- ing more sophisticated predict items and adjustment of the HC rela- tion [Sikkel and op den Akker, 1993]. A very effective implementation of the HC parser was done in Prolog by Gertjan van Noord [van Noord, 1997]. He suggested several optimiza- tions such as selective memoization (store only ”expensive“ edges/items in memory) or goal weakening (solve only selected, more general goals). The goal weakening is called abstraction in [Johnson and Dorre, 1995]. A detailed analysis of complexity of the HC algorithm is given by Klaas Sikkel [Sikkel, 1996].

2.1.4 Parser Comparison There are various parsing algorithms for CF grammars that can differ in many aspects. The attempts to compare the pros and cons of them appeared at the dawn of parser development history and to this day there is a lively interest in comparing and evaluating of natural language parsing methods. Unfortunately, there is no general evaluation procedure acceptable to all researchers and developers. For example, the number of edges/items in a chart is usually given as a measure to compare chart parsing algorithms. However, as shown in [Moore, 2000b], there can be algorithms that do not differ in this respect but the processing time on a given grammar and an input differ considerably. The primary method of assessing the efficiency of a parsing algorithm is therefore only empirical – one has to compare the time taken to parse a set of test sentences by each particular parser based on a shared grammar. The comparison of our implementations on the set of common grammars and CF grammar for Czech is described in Section 5.2.1.

18 2.2 2. RELATED WORK

2.2 Robust Parsing

Formal grammars are often used in NLP applications to describe well-formed sentences. But when used in practice, the grammars usually describe only a subset of a NL, and in addition NL sentences are not always well-formed, especially in applications. The goal of robust parsing is returns a correct or usefully “close” analysis for almost all of the input sentences [Carroll and Briscoe, 1996]. A variety of approaches have been proposed to robustly handle natu- ral language. Some techniques are based on modifying the input sentence, for example by removing words that disturb the fluency [Bear et al., 1992, Heeman and Allen, 1994], more recent approaches are based on select- ing the right sequence of partial analyses [Worm and Rupp, 1998, van Noord et al., 1999]. Minimum Distance Parsing is a third approach based on relaxing the formal grammar, allowing rules to be modified by insertions, deletions and substitutions [Hipp, 1992]. The problems of robust parsing and grammar checking of Czech are dis- cussed in [Kuboˇn, 2001, Kuboˇnand Pl´atek, 2001]. The implementation of partial syntactic analyser of Czech in Prolog is described in [Mr´akov´a, 2002]. Notice, that our technique described in Chapter 4 is not targeted at Czech language (however it produces interesting results for it).

2.3 Parsing of Czech

In this section, several existing syntactic analysers of Czech are briefly de- scribed. They use two different approaches. The first method uses a grammar created by linguistic expert, the second one discovers language characteristics from an annotated corpus. All analysers described here produces dependency syntax. One of the main reasons is in traditional so-called ”Prague School”, that views sen- tence syntax as a set of dependencies of one word o another. This de- pendency syntax is integrated in Functional Generative Description of lan- guage [Sgall et al., 1986]. Also the biggest syntactically annotated corpus for Czech – the Prague Dependency (PDT) [Hajiˇc, 1998] – uses this formalism, so statistical parsers (see bellow) trained on this corpus has to produce it’s structures.

19 2.4 2. RELATED WORK

The only dependency parser for Czech using procedural gram- mar [Holan et al., 1995, Kuboˇn, 1999] is based on non-projective context- free dependency grammar (NCFDG) formalism from the LATESLAV project [Pl´atek et al., 1995]. The NCFDG formalism is very strong and suit- able for free-word-order languages like Czech. The rest of parsers showed here are statistical parsers trained on the PDT. Collins’s phrasal parser adapted for PDT [Hajiˇcet al., 1999] produces one of the best results published so far. [Holan, 2004] shows four push-down parsers. The push-down parsers, during their training phase, create a set of premise- action rules, and apply it during the parsing phase. In the training phase, the parser determines the sequence of actions which leads to the correct tree for each sentence (in case of ambiguity, a pre-specified preference ordering of the actions is used). During the parsing phase, in each situation the parser chooses the premise-action pair with the highest score. Zeman’s PhD thesis [Zeman, 2004] describes several methods how to eval- uate dependency parsers and also improve existing results by combining dif- ferent analysers. Holan’s parser ANALOG has no training phase and in the parsing phase it searches in the training data for the most similar local tree configuration [Holan, 2005]. At the moment, the best results on PDT e-test part are achieved by the combining parser [Zeman and Zabokrtsk´y,ˇ 2005, Holan and Zabokrtsk´y,ˇ 2006], that combines results of described de- pendency parsers together.

2.4 Conclusions

As written above, most of the published parsers of Czech language uses de- pendency syntax. The analyser described in this theses is based on con- stituent grammar. The reasons are mainly historical, at the time of creation our system there was intensive work on constituent grammar for Czech. The next reason for phrasal formalism is, that algorithms described in this work were developed and tested on several experiments with English (and par- tially with French). The robust algorithm described in Chapter 4 is targeted only on English. On the other hand, as shown in the following chapters, our results for Czech language are comparable with parsers described in this chapter.

20 Chapter 3

Parsing System with Contextual Constraints

This chapter describes our algorithms and methods, which are the main re- sults of this thesis. These algorithms are combined together into one language independent parsing system. The analysis of sentences by the system is based on context-free grammar supplemented by context sensitive structures. Our mechanism of applying contextual constraints allows us to work with small number of grammar rules.

3.1 System Overview

Described system consists of several independent modules. The modular design makes the system easily extensible and rather flexible. Figure 3.1 shows the data flow through the system. There are several inputs to the system: • A sentence in a natural language.

• Context-free (CF) grammar.

• Semantic actions and contextual constraints for grammar rules. Words in the input sentence can be optionally tagged. If they are not tagged, then the internal “tagger” is used. The notation of “tagger” is not correct here, because we leave ambiguities in tags. For the Czech language, morphological analyzer ajka [Sedl´aˇcek, 2005] is used to create tags. For

21 3.1 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

Figure 3.1: Parsing System with Contextual Constraints

22 3.1 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS other languages the tags are usually read from an static lexicon. These tags are stored as “values” (see below) for every word. The terminals (or sometimes called pre-terminals) for given context-free grammar are created by simplification the tags, e.g. using only word category as a terminal. Once the terminals are created the context-free parsing algorithm is run. This algorithm produces all possible derivation trees at the output with re- spect to the input sentence and input grammar. All these derivation trees are stored in a data structure based on shared-packed forest [Tomita, 1986]. Because a chart parser is used in our system (see Section 3.2), the derivation trees are stored as a chart data structure directly. Any context-free parsing algorithm could be used, the modularity of the system allows us compare the effectiveness of these algorithms easily, see Section 5.2.1. All derivation trees created in the previous step can be filtered by some basic filter, that cuts some trees off. In this step only basic filtering “com- patible” with the shared-packed forest data structure is allowed. E.g. only a whole sub-forests can be removed. The example of such filtration is in Section 3.4. The next step is application of contextual constraints and semantic ac- tions. In this step a new data structure is created, a “forest of values”. The forest of values is created by a bottom-up recursive run of semantic actions, see Section 3.5. If the input sentence cannot be generated by the input grammar, i.e. there is no derivation tree at the output of the context-free parsing algorithm, the system offers a robust module. In this case, the contextual constraints and semantic actions are applied on every derivation sub-tree in the shared- packed forest. Then the robust algorithm described in Chapter 4 is used to get the derivation tree(s). The resulting forest of values can be further filtered by constraints, that work with the whole forest, not only with local values. The example of this global filtering is usage lexicon of verb valencies VerbaLex, see Section 3.6. In the end, the derivation trees are generated from the filtered forest of values. Only one or several “best” derivation trees can be created, with respect to the ranking function, e.g. a probability of the tree could be used as one input to the ranking function. In the following sections, the above ideas are described in more detail.

23 3.2 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

3.2 Head-Driven dependent dot move

This section presents our improved form of head-driven chart parser that is appropriate for large context-free grammars. The basic method — HDddm (Head-Driven dependent dot move) — is introduced first. Both variants that improve the basic approach are based on the same idea — to reduce the number of chart edges by modifying the form of items (dotted rules). The first one “unifies” the items that share the analyzed part of the relevant rule (thus, only one dot is needed to mark the position before and after the covered part). The second method applies the inverse strategy, it “eliminates” the parts that have not been covered yet (no dot needed). All the discussed alternatives are described in the form of parsing schemata [Sikkel, 1996]. Because of using the parsing schemata, the word “edge” has the same meaning as the word “item” in this section. Also a tricky technique (employing a special trie-like data structure de- veloped originally for Scrabble) that enables minimizing the extra informa- tion needed in the algorithms is mentioned. The advantages of the described methods are demonstrated by the significant decrease in the number of edges for charts (see Section 2.1.2). The results are given for the standard set of testing grammars (and respective inputs) as well as for a large and highly ambiguous Czech grammar.

3.2.1 Motivation To enable the efficient processing, parsing algorithms employ sophisticated structures to store intermediate parsing results. The most popular data structure for this purpose is the chart. The chart can be viewed as a table of all viable strings or as a set of the items. The dotted rules (items) are used to represent the state of the analysis. Dots stay on boundary between the analyzed and non-analyzed parts of items. If the parsing procedure is unidirectional only, (left to right or vice versa), only one dot is needed. There are many advanced techniques aiming at refinement of basic chart parsing algorithms (see Section 2.1.2). Here the ones with the head-driven approaches are presented. In the experiments that are behind the effort discussed here, a special attention is paid to parser robustness. If no complete parse is found for an input (e.g. from a speech recognizer in a dialogue system) a special technique is employed to efficiently retrieve a set of the most probable maximal sub-

24 3.2 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS trees (see Section 4) to provide a partial analysis of the input. Therefore, the most efficient (in general case) approach to head-driven parsing — head- corner chart parser [Kay, 1989b, Satta and Stock, 1989] – cannot be applied. It would prune chart edges that could be needed in our later processing of incomplete parse. Moreover, the head-driven bottom-up algorithm discussed above is also more suitable for our research on incremental pars- ing [Smrˇzand Kadlec, 2005]. Nevertheless, the refinements depicted in the text are directly applicable to the head-corner case. As the text is rather technical, the parsing schemata [Sikkel, 1996] are used. The formalism provides an algebraic method appropriate for descrip- tion of key ideas in parsing algorithms. The following section brings a basic version of the algorithm that presents a slight improvement over a known chart parsing technique. Next sections discuss two modifications of the basic method aiming at reduction of edges in the resulting chart. The first method eliminates those parts of the dotted rules that were already analyzed, the second, in reverse, keeps only the ana- lyzed part of items. Though the approaches work in reverse, both lead to a significant decrease of the number of edges. The optimization technique has been previously used for other variants of chart parsing, its application for head-driven approaches is original. A smaller number of chart edges does not need to entail a more efficient parsing. A technique optimizing the search in grammar rules (their right- hand sides) is briefly described. It exploits sophisticated data structures. A standard trie structure is sufficient for the first refinement. The second case requires an efficient procedure enabling bidirectional search for paths starting from any symbol on the right-hand side of grammar rules. Such a procedure (designed originally for generating possible moves in Scrabble) has been employed in the second case. Optimization of parsing need not to have a dramatic effect on an overall performance if one needs to parse grammars with a relatively small number of rules. However, our aim are applications where very large (and highly ambiguous) grammars are used. The results of the refinements discussed here are demonstrated on both, the Czech grammar (see Chapter 6) as well as the PT grammar generated from the Penn Treebank. The latter is a part of a standard set of testing grammars that are available together with the respective inputs on the In-

25 3.2 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS ternet. Thus, the designed methods were tested on all the grammars from the standard set (see Chapter 5). The last section summarizes the size of the resulting chart (in terms of the number of edges) for parsing on the test set. The effect of both optimizations can be significant in some cases (less than 50 % edges in the chart).

3.2.2 HDddm Parsing This section describes the basic version of the head-driven algorithm (HD). As in other “head-oriented” approaches in parsing, the direction of the parsing process is not unidirectional (e. g., from left to right). It starts at the head of the given grammar rule and processes it bidirectionally to the first and to the last rule symbols. Similarly to [Satta and Stock, 1989] and [Sikkel and op den Akker, 1993], the HDddm (head-driven with dependent dot move) technique improves the process of viable hypotheses confirmation. HDddm refers to the fact that the move of one “dot” in the head-driven parsing step is dependent on the opposite move of the other one. The algorithm can be described as parsing system [Sikkel, 1996]. A pars- ing system P is a triple P = hI,H,Di, in which: • I is a set of items called domain or item set of P; • H is a finite set of items (not necessarily a subset of I), the hypotheses of P; • D is a set of deduction steps of the form η1, ..., ηk ⊢ ξ, with ηi ∈ I ∪ H for 0 ≤ i ≤ k and ξ ∈ I. The items η1, ..., ηk are called the antecedents, ξ the consequent of the deduction step. The parsing schemata for the HDddm parsing technique, PHD = hIHD,H,DHDi is defined as follows. The domain IHD is given as:

HD(i) I = {[A → α•βXγ•δ, i, j] | A → αβXγδ ∈ P, 0 ≤ i ≤ j ≤ n},

HD(ii) I = {[A → ••, i, i] | A → ǫ ∈ P, 0 ≤ i ≤ n},

26 3.2 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

HD(i) HD(ii) IHD = I ∪ I .

The hypotheses set H encodes the input sequence. For input w1 ...wn:

H = {[w, i − 1, i] | w ≡ wi, 0 ≤ i ≤ n}.

The set of deduction steps DHD is defined as follows:

Init D = {[w, i − 1, i] ⊢ [A → α•w•β, i − 1, i]} ∪ {⊢ [A → ••, i, i]},

Pred D = {[A → •α•, i, j] ⊢ [B → β•A•γ, i, j]},

Scan(i) D = {[A → α•β•wδ, i, j], [w, j, j + 1] ⊢ [A → α•βw•δ, i, j + 1]},

Scan(ii) D = {[w, i − 1, i], [A → αw•β•, i, j] ⊢ [A → α•wβ•, i − 1, j]},

Complete(i) D = {[A → α•β•Bδ, i, j], [B → •γ•,j,k] ⊢ [A → α•βB•δ, i, k]},

Complete(ii) D = {[B → •γ•, i, j], [A → αB•β•,j,k] ⊢ [A → α•Bβ•,i,k]},

Init Pred Scan(i) Scan(ii) DHD = D ∪ D ∪ D ∪ D ∪ DComplete(i) ∪ DComplete(ii).

String γ in Complete steps can be empty (ǫ). Note that the left dot in the edge cannot move leftwards until the right dot moves to the right. This is precisely the difference between HDddm and the technique described in [Satta and Stock, 1989]. The parser never creates edges like [A→ α•βXγ•δ, i, j] for non empty β. This approach avoids a redundant work during the analysis. A simplified form of the algorithm is given here, so various optimizations are possible. For example, the real implementation of the algorithm benefits

27 3.2 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS from an approach that would call for so called CYK items ([A, i, j]) in the above definitions (see [Sikkel and op den Akker, 1993] for details).

3.2.3 Head-Driven Algorithm with One Dot in Items The idea comes from [Leermakers, 1992] (for Earley’s algorithm), it has been applied to the left-corner algorithm in [Moore, 2000a]. The optimization is based on the observation that the non-terminals on the left of the dot in an Earley “one dot” item play no role in the algorithm. The approach can be adapted for the HD algorithm so that non-terminals between the dots in a HD item can be “forgotten”. The algorithm with one dot in items (iHD) can be described as parsing system PiHD = hIiHD,H,DiHDi. The domain IiHD is given as:

iHD(i) I = {[A → α•δ, i, j] | A → αβXγδ ∈ P, 0 ≤ i ≤ j ≤ n},

iHD(ii) I = {[A → •, i, j] | A → α ∈ P , possibly α ≡ ǫ, 0 ≤ i ≤ j ≤ n},

iHD(i) iHD(ii) IiHD = I ∪ I .

The HD item of the form [A → α•β•δ, i, j] is replaced with an iHD item [A → α•δ, i, j]. This approach leads to a lower number of items in the resulting chart. The hypotheses set H is the same as for PHD. The set of deduction steps DiHD is given as:

Init(i) D = {[w, i − 1, i] ⊢ [A → α•β, i − 1, i] | A → αwβ ∈ P },

Init(ii) D = {⊢ [A → •, i, i] | A → ǫ ∈ P },

Pred D = {[A → •, i, j] ⊢ [B → α•β, i, j] | B → αAβ ∈ P },

Scan(i) D = {[A → α•wδ, i, j], [w, j, j + 1] ⊢ [A → α•δ, i, j + 1]},

28 3.2 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

Scan(ii) D = {[w, i, i + 1], [A → αw•, i +1, j] ⊢ [A → α•, i, j]},

Complete(i) D = {[A → α•Bδ, i, j], [B → •,j,k] ⊢ [A → α•δ, i, k]},

Complete(ii) D = {[B → •, i, j], [A → αB•,j,k] ⊢ [A → α•,i,k]},

Init(i) Init(ii) Pred Scan(i) DiHD = D ∪ D ∪ D ∪ D ∪ DScan(ii) ∪ DComplete(i) ∪ DComplete(ii). The deduction steps are similar to the HD steps. Note that an item [A → •, i, j] represents now the situation that an arbitrary rule with left- hand side A has been recognized between positions i and j. As in the previous case, there is the refined version where the left dot moves left only if the right one is already after the rightmost symbol. The same is true for the replacement of [A → •, i, j] items by CYK item [A, i, j]. The above definition would be just more complex. The CYK items are called items in the following algorithm.

3.2.4 Simplified Items, No Dots Needed The algorithm described in the previous section cuts out the already analyzed part represented by β in HD items [A → α•β•δ, i, j]. However, the opposite approach is possible as well. One can realize that the remaining parts of the rule (left non-terminal A and the parts α and δ) are given by the grammar. Thus, only the analyzed parts are stored in the chart. This general idea from [Nederhof and Satta, 1994] is applied for the HDddm parsing. Note that the dots in items occurring in the following definitions play no role in the algorithm. They are kept only for better readability and a comparison with previous algorithms. The head-driven algorithm with simplified items (sHD) can be described as parsing system PsHD = hIsHD,H,DsHDi. The domain IsHD is defined as:

sHD(i) I = {[•βXγ•, i, j] | A → αβXγδ ∈ P, 0 ≤ i ≤ j ≤ n},

29 3.2 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

ICYK = {[A, i, j] | A → α ∈ P , possibly α ≡ ǫ, 0 ≤ i ≤ j ≤ n},

sHD(i) CYK IsHD = I ∪ I .

Item [•β•, i, j] denotes that:

• β covers the input sequence between i and j;

• there exists grammar rule A → αβγ ∈ P with the head in β.

CYK item [A, i, j] represents complete (inactive) item as discussed in the previous section. Notice the difference between [A, i, j] and [•A•, i, j]. Deduction steps DsHD are given as:

Init(i) D = {[w, i − 1, i] ⊢ [•w•, i − 1, i] | A → αwβ ∈ P },

DInit(ii) = {⊢ [A, i, i] | A → ǫ ∈ P },

Pred D = {[A, i, j] ⊢ [•A•, i, j] | B → αAβ ∈ P },

Scan(i) D = {[•β•, i, j], [w, j, j + 1] ⊢ [•βw•, i, j + 1] | A → αβwγ ∈ P },

Scan(ii) D = {[w, i, i + 1], [•β•, i +1, j] ⊢ [•wβ•, i, j] | A → αwβ ∈ P },

Complete(i) D = {[•β•, i, j], [B,j,k] ⊢ [•βB•,i,k] | A → αβBδ ∈ P },

Complete(ii) D = {[B, i, j], [•β•,j,k] ⊢ [•Bβ•,i,k] | A → αBβ ∈ P },

Complete(iii) D = {[•α•, i, j] ⊢ [A, i, j] | A → α ∈ P },

Init(i) Init(ii) Pred DsHD = D ∪ D ∪ D ∪ DScan(i) ∪ DScan(ii) ∪ DComplete(i) ∪

30 3.2 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

DComplete(ii) ∪ DComplete(iii).

3.2.5 Data Structures To implement an efficient parsing algorithm, appropriate data structures are needed to store the information. Especially the Complete steps in the dis- cussed algorithms ask for a special handling. The appropriate data structure has to efficiently represent (parts of) the right-hand sides of grammar rules. For example, the sHD algorithm needs to identify fully analyzed rules in step DComplete(iii) above. The trie [Fredkin, 1960] structure showed to be the most suitable candi- date for storing the right-hand side of grammar rules. It is true especially for parsing with our grammar of Czech where the average number of right-hand side symbols in the rules is rather high. Chappelier and Rajman [Chappelier and Rajman, 1998b] apply trie to store Earley-like simplified items. The same approach works for the basic variant of our algorithm as well as for iHD. There is a free implementa- tion of finite-state automaton and the routines for their searching described in [Daciuk et al., 1998]. It is much more difficult to come up with an appropriate data structure to efficiently search rules in the case of the last parsing method (sHD). The removal of the non-analyzed parts from the sHD items requires a bidirectional search from the inside of the right-hand side. A sophisticated data structure for this purpose has been found in GADDAG [Gordon, 1994]. Originally, it was designed for generating possible moves in an implementation of the game Scrabble. It allows bidirectional path starting from each letter (a symbol on the right-hand side in our case) of each word (a grammar rule) in the lexicon (a grammar).

3.2.6 Conclusions The iHD and sHD algorithms process and store exactly inverse information, but both approaches are functional. The degree of their contribution depends on the grammar, positions of heads and the input. The sHD method produces about 50 % reduction of the number of edges without any change in the algorithm, see Section 5.2.2 for our experimental results.

31 3.3 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

A reduction of the chart size does not necessarily imply a more efficient parser. As the complete incorporation of the discussed refinements into our system is not finished yet, there is no comparison of running times for the discussed methods. However, the results clearly demonstrate that the data structures discussed above really allow to work with the more complex items efficiently enough to overcome the basic method with the standard items. The described improvements of the HD algorithm are presented in [Kadlec and Smrˇz, 2006]. Notice, that they not exclude all other re- finements designed for the original form of items (see, e. g., [Sikkel, 1996] and [Sikkel and op den Akker, 1993]). Moreover, our modified items can be directly taken for the head-corner algorithm. All experiments (with some exceptions) in the Chapter 5 use the HDddm method described in this section.

3.3 Optimizing Heads for Parsing

The effectiveness of our HDddm parsing technique (presented in the previous Section 3.2) is highly dependent on the positions of the heads in the grammar rules1. It is usually expected that the positions of heads are set according to the linguistic intuition; that the parser should instantiate first the linguistic heads, or governing nodes [Kay, 1989b, Bouma and van Noord, 1993]. How- ever, only a limited number of experiments have been published that would prove validity of such a heuristics. One of the reasons for this situation is that the implementations of head-driven parsing algorithms efficient enough to enable experiments of this kind have appeared only recently. In the following, a method aiming at choosing the best position of CF rule heads for parsing with various natural-language grammars is presented. The optimization step focuses on the number of edges (see Section 3.2) in the resulting chart. The optimization can have almost a negligible effect when parsing with small or medium-size grammars. However, as the results presented below demonstrate, it can have a crucial importance for parsing with large and highly ambiguous grammars. The goal to optimize positions of rule heads for parsing is not completely new. It appeared as a rule instantiation strategy

1The experimentally chosen heads are called “keys” in [Oepen and Callmeier, 2000]

32 3.3 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS in [Oepen and Callmeier, 2000]. Oepen and Callmeier discuss head selec- tion procedure in the context of HPSG [Pollard and Sag, 1994] parsing and employ their PET platform for the evaluation of the results. The form of the LinGO grammar used in the experiments is rather simple, it contains only binary rules (up to two symbols on the right-hand side of rules). On the contrary, at least the Czech grammar used in our experiments contains many rules with complex right-hand sides. It makes the optimizing proce- dure much more demanding. However, the general conclusion is the same in both cases.

3.3.1 Optimization Procedure The process of finding the positions of rule heads that are optimal for a given parsing algorithm can be summarized as follows: Grammar rules are taken one after the other. The analysis of the given input is run for all possible head positions in the rule. The best head position (the position for which the number of edges in the resulting chart is minimal) is chosen and the rule is given back to the grammar. This is done for all grammar rules. This “greedy” algorithm finds optimal head positions for given input sen- tence. It is obvious that one can obtain grammars with different heads for every input sentence. Thus, the most often used head position are chosen for every rule and build the final grammar. The grammars created by taking the most frequently used head positions need not to be optimal for the given set of input sentences. All the input data set should be parsed, instead of just one sentence, in one step of the optimization process to get an optimal result. However, such a procedure would be harder to parallel and our experiments on a chosen part of the test set suggest that the output of the simple algorithm is close enough to the optimal solution. Section 5.2.1 contains the results of our experiments with CF grammars with different positions of the heads. Section 5.2.3 presents applying the optimizing procedure on two different grammars, our CF grammar for Czech and ATIS.

33 3.4 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

3.4 Shared-Packed Forest filtering

In this section, the method of filtering the shared-packed forest data struc- ture is presented. This filtering is used only to remove some data from the structure, no new information is added. This technique is a step preceding the filtering by contextual constraints, see Figure 3.1 in Section 3.1 for an overview of the whole system. The resulting shared-packed forest is always sub-forest of the original. That means that only simple transformations such as removing a node is done here.

3.4.1 Filtering by rule levels One of possible filtering the shared-packed forest is a local (with respect to the node of the forest) filtering based on what we call “rule levels”. The rule level is a function, that assigns a natural number to every grammar rule. In the following the term “a rule level of a grammar rule” denotes the resulting number of application this function to the rule. The idea is, that for some grammar rules: if the specific grammar rule succeeds during the parse process, then an application of some other rule (with the same non-terminal on the right hand side of the rule) is wrong. To be more precise, if the specific grammar rule covers the same input as some other grammar rule beginning with the same non-terminal, then the rule with lower rule level is refused. The chart structure (see Section 2.1.2) is used to represent the shared- packed forest. So the filtering method is described in terms of the chart parsing: If there are edges [A → •α•, i, j] and [A → •β•, i, j] in the chart, then delete the edge with the grammar rule with the lower rule level. If the edges have the same rule level, keep them both in the chart. Figure 3.2 shows an example of such rules, w1,w2, ..., w6 represent the input sentence.

3.4.2 Conclusions Notice that this kind of filtering is different from a probability of the grammar rule. The presented method is local to the specific node in the shared-packed forest. By default all grammar rules have the same rule level. The rule levels are set by hand and only in very specific cases. Actually, only one rule in our grammar for Czech (see Chapter 6) has non-default rule level. Only small

34 3.5 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

Figure 3.2: Filtering by rule levels. Two sub-forests with grammar rules A → α and A → β in their roots. One of them is filtered out, if these rules has a different rule level set. number of experiments were performed, because this method is new to our system.

3.5 Contextual Constraints and Semantic Actions

Our main problem with the CF parsing is, that there are too many derivation trees for a given input sentence. The contextual constraints are mainly used to prune incorrect derivation trees from the CF parsing result. Also some additional information can be computed by these constraints, that is why we also call them “semantic actions”. In the following the term “contextual constraint” has the same meaning as the term “semantic action”. Our algo- rithms for CF parsing generates the chart structure, thus we use the word “chart” to denote “a result of the CF parsing”. See Figure 3.1 to have a better view, in which part of the parsing system the constraints are computed. The contextual constraints (or actions) defined in the grammar can be divided into four groups:

35 3.5 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

1. rule-tied actions

2. case agreement constraints

3. post-processing actions

4. actions based on derivation tree

The example of a rule-tied action is a rule-based probability estimation. Case agreement constraints serve as chart pruning actions and they are used in generating the expanded grammar. See Chapter 6 for detailed de- scription of the grammar that we use. The case agreement constraints rep- resent the functional constraints, whose processing can be interleaved with that of phrasal constraints. The post-processing actions are not triggered until the chart is already completed. Actions on this level are used mainly for computation of analysis probabilities for a particular input sentence and particular analysis. Some such computations (e.g. verb valency probability) demand exponential re- sources for computation over the whole chart structure. This problem is solved by splitting the calculation process into the pruning part (run on the level of post-processing actions) and the reordering part, that is postponed until the actions based on derivation tree. The actions that do not need to work with the whole chart structure are run after the best or n most probable derivation trees have been se- lected. These actions are used, for example, for determination of possible verb valencies within the input sentence (see Section 3.6), which can pro- duce a new ordering of the selected trees, or for the logical analysis of the sentence [Hor´ak, 2002a].

3.5.1 Representation of values It was shown that parsing is in general NP-complete if grammars are al- lowed to have agreement features [Barton et al., 1987]. But the pruning con- straints in our system are weaker than for example general feature struc- tures [Kay, 1985]. We allow a a node in the derivation tree to have only limited number of features. We call the features “values”, because they rise as results of our semantic actions. E.g. the number of values for noun groups in our system

36 3.5 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

[0, 2, npnl] -> [0, 1, np] [1, 2, np]

value3 value1 value1

value4 value2 value2

List of values List of children

Figure 3.3: Example of the forest of values. is at most 56. To compute the values, we build a new structure, a forest of values, instead of pruning or extending the original chart. The forest of values is computed by the depth-first walk through the chart structure. The chart can be viewed as oriented graph, see Section 2.1.2. Every edge in the chart is passed only once, the edge can generate at most one node in the new forest of values. The value is computed as a result of the semantic action – for the grammar rule given by the current edge. The parameters for the semantic action are filled from the values on lower level, “lower” with respect to the derivation tree, i.e. closer to the leaves of the tree. So also arguments of the semantic action are limited by the limit (e.g. 56 possibilities in our case). Because there could be more than one derivation tree containing the current edge, all possible combination of values are passed to the semantic action. The worst-case time complexity for one node in the forest of values is therefore 56δ, where δ is the length of the longest right-hand side grammar rule. Notice that this complexity is independent of the number of words in input sentence. The values in the forest of values are linked with the edges backwards. An edge contains a single linked list of its values. Each value holds a single linked list of its children. The child is one dimensional array of values. This array represents one combination of values that leads to the parent value. Notice that there can be more combinations of values that lead to the same value. The i-th cell of the array contains a reference to a value from i-th symbol on the RHS of the corresponding grammar rule. The i-th symbol has not to be used to compute the parent value. We use only reference to the edge from such unused cell.

37 3.5 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

The Figure 3.3 shows an example representing the rule npnl → np np and containing three edges ([0, 2, npnl → •np np•], [0, 1, np → •α•], [1, 2, np → •β•]). The right hand sides of each rule are not shown in the figure, they play no role here. np → α and np → β are some rules in the input grammar. Each np edge contains two values, value1 and value2. This gives us four possible combinations. The semantic action computes from combinations value1 × value2 and value2 × value1 the same value value4. The combina- tion value2 × value2 was classified as incorrect (by the action – contextual constraint), so it is not here.

3.5.2 Generation of a grammar with values It is possible to create CF grammar, without our contextual constraints, which generates the same derivation trees as the CF grammar supplemented by the constraints. In the following, a method, that for the given input generates a such CF grammar without values, is provided. This allows us to compare our system, that is able to evaluate the constraints, with other systems able to work only with “pure” CF grammars. We use the following procedure for every inactive edge [i, j, A → X1X2...Xn•] in the chart: • for every value v in the edge, we generate the rule: A → A value, where value is an unique textual representation of the value v.

• for every child of the value v, we generate the rule: A value → ′ ′ ′ ′ X1X2...Xn, where X i is:

– Xi valuei if a value valuei from i-th non-terminal is used to com- pute the value v.

– Xi otherwise.

Duplicate rules are removed. Figure 3.5 shows the generated grammar for the input 2/1-1 and the grammar with actions from the Figure 3.4. The $$ parameter of the action represents the returned value. The $k parameter is a variable where we store a value of k-th non-terminal of the rule. The constraint is not zero filters out the trees, which represents “a division by zero”. The input has two derivations trees in the original grammar (if the actions are omitted),

38 3.5 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

e -> e "+" e add ( $$ $1 $3 ) e -> e "-" e sub ( $$ $1 $3 ) e -> e "/" e is_not_zero ( $3 ) div ( $$ $1 $3 ) e -> NUMBER value_of ( $$ $1 )

Figure 3.4: Grammar with the contextual constraint is not zero and se- mantic actions.

e -> e_0 e -> e_1 e -> e_2 e_0 -> e_1 "-" e_1 e_1 -> NUMBER e_1 -> e_2 "-" e_1 e_2 -> NUMBER e_2 -> e_2 "/" e_1

Figure 3.5: Generated grammar with values for the input 2/1-1.

39 3.6 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS but the corresponding generated grammar gives us only one derivation tree, because of the is not zero action.

3.5.3 Conclusions Why are the actions and semantic constraints used when they can be replaced by a grammar with values? There are three main reasons. First of all, the grammar with values for all possible inputs would be extremely large, even if the domain range is limited, e.g. by 56 in our case. Secondly, the actions can be easily changed and debugged when computed separately. The third reason is that some of our experiments use semantic actions with unlimited domain range and these actions cannot be substituted by the grammar. See Section 6.4 or Section 3.6 for an example of actions with unlimited domain range.

3.6 Verb Valences

In this section an exploitation of the lexicon of verb valencies during the parsing process is showed. Our method works in two steps. In the first step the parser automatically discovers possible valences in the input sentence and then these found valences are compared with valences from the lexicon. This enables us to prune impossible combination regarding the particular verb and automatically process large corpora for discovering possible verb valencies that are missing in the lexicon. The lexicon of verb valences called VerbaLex (see below) is used for our experiments.

3.6.1 VerbaLex The lexicon of verb valences, VerbaLex [Hlav´aˇckov´aet al., 2006], was created in 2005. VerbaLex is based on three valuable language resources for Czech, three independent electronic of verb valency frames. The first resource, Czech WordNet valency frames , was cre- ated during the Balkanet project and contains semantic roles and links to the Czech WordNet . The other resource, VALLEX 1.0 [Zabokrtsk´yandˇ Lopatkov´a, 2004], is a lexicon based on the formalism of the Functional Generative Description and was developed during the Prague

40 3.6 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

bˇeˇzet:1 / ut´ıkat:2 £ 1 2 ¢1¡ bˇeˇzetimpf / ut´ıkatimpf ≈ AG < > obl VERB obl SOC < > obl –frame: person:1 kdo1 person:1 za+k´ym7 MAN opt Adv(jak) –example: ned: vnuk bˇeˇzel radostnˇeza babiˇckou £ 1 2 ¢2¡ bˇeˇzetimpf / ut´ıkatimpf ≈ AG < > obl VERB obl LOC opt MAN opt –frame: person:1 kdo1 Adv(kam) Adv(jak) –example: ned: ut´ıkal rychle dom˚u £ 1 2 ¢3¡ bˇeˇzetimpf / ut´ıkatimpf ≈ AG < > obl VERB obl OBJ < > obl –frame: person:1 kdo1 vehicle:1 za+ˇc´ım7 –example: ned: bˇeˇzel za tramvaj´ı

Figure 3.6: An example of a VerbaLex verb frame

Dependency Treebank project [Hajiˇc, 1998]. The third source of information for VerbaLex is the syntactic lexicon of verb valences denoted as BRIEF, which originated at FI MU Brno in 1996 [Pala and Sevecek, 1997]. The resulting lexicon VerbaLex comprehends all the information found in these resources plus additional relevant information such as verb aspect, verb synonymy, types of use and semantic verb classes based on results of the VerbNet project [Dang et al., 1998]. The information in VerbaLex is or- ganized in the form of complex valency frames. All the valency information in VerbaLex is specified regarding the particular verb senses, not only the verb lemmata, as it was found in some of the sources. The current work on the lexicon data aims at enlarging the lexicon to the size of about 16.000 Czech verbs. The VerbaLex lexicon displays syntactic dependencies of sen- tence constituents, their semantic roles and links to the corresponding Czech WordNet classes. An example of such verb frame is presented in the Figure 3.6. For more de- tails about VerbaLex frame notation please refer to [Hlav´aˇckov´aet al., 2006].

41 3.7 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS

3.6.2 Filtering by Valences The filtering of forest of values (see Section 3.5) consists of two steps described below. At this moment the VerbaLex semantic roles are ignored, only the grammatical features (grammatical case) are used. First of all, all noun groups covered by the particular context-free rule are found. Then compatible groups (compatible in the term of derivation, i.e. groups within the same derivation tree) are processed by a semantic action. Notice, that this step suffers from a possible exponential time complexity because particular derivation trees are processed, not the packed forest. Only the first derivation (i.e. the root rule of the tree) is used, so our experiments show that in the average it is not a problem. If the analyzed verb has a corresponding entry in VerbaLex, we try to match the extracted frame with frames in the lexicon. When checking the valences with VerbaLex, the dependence on the surface order is discharged. Before the system confronts the actual verb valences from the input sentence with the list of valency frames found in the lexicon, all the valency expres- sions are reordered. By using the standard ordering of participants, the valency frames can be handled as sets independent on the current position of verb arguments. However, since VerbaLex contains an information about the “usual” verb position within the frame, we promote the standard ordering with increasing or decreasing the respective derivation tree probability. The results of experiments with VerbaLex are shown in Table 6.2 in Sec- tion 6.3.2.

3.7 Conclusions

In this chapter, the large language independent parsing system was described. Our presented version of head-driven context-free parsing algorithm outper- forms the best state of the art context-free parsers published, see Chapter 5 for the results of experiments. The evaluation of semantic actions and con- textual constraints allows us to work with a relatively small number of gram- mar rules. Also some additional information, which is not contained in the context-free part of the grammar, are calculated. Some constraints can be generated automatically with the lexicon of verb valencies VerbaLex. All mentioned system modules are optional, e.g. only “pure” context- free analysis without semantic actions or robustification can be run. The

42 3.7 3. PARSING SYSTEM WITH CONTEXTUAL CONSTRAINTS implementation of this system is called synt and it is described in Chapter 6. The robust extension of this system is showed in Chapter 4. The results of experiments are presented in Chapter 5. The filtering algorithms working with verb valences (from Section 3.6) and rule levels (from Section 3.4) are joint work with AleˇsHor´ak. The rest of the algorithms was created by the author of this thesis.

43 Chapter 4

Robust Stochastic Parsing Using Optimal Maximum Coverage

There are many NLP applications (e.g. with speech-recognition or dialog systems) where it is difficult to find a context-free (CF) grammar generating a sufficient subset of the processed language (the undergeneration problem). In addition, when the coverage of the grammar is improved, the accuracy usually decreases. Therefore our goal is to to develop a robust syntactic parser that is able to return a “correct” derivation tree even if the grammar cannot generate the input sentence. The definition of correctness is strongly dependent on the target application and our framework allows us to change the correctness criteria to fit various application needs. The following two step solution is proposed:

• for the sentence to analyze, the finest corresponding most probable optimal maximum coverage (see Sections 4.1 and 4.2) is generated first,

• then the possibly partial trees from this coverage are glued into one resulting tree (see Section 4.3).

Figure 4.1 shows a simple example of a possible result from the robust parsing mechanism. The implementation of the robust parser and results of experiments are discussed in Section 5.3.

44 4.1 4. ROBUST STOCHASTIC PARSING USING OMC

S

T3

T1

T2

w1 w2 w3 w4 w5 w6

Figure 4.1: Glued trees.

T1

T2

w1 w2 w3 w4 w5 w6

Figure 4.2: Partial trees, that cannot be composed into a coverage.

4.1 Coverage

For a given sentence a coverage, with respect to an input grammar G, is a sequence of non-overlapping, possibly partial, derivation trees, such that the concatenation of the leaves of these trees corresponds to the whole input sentence. Notice that the fact of restricting the coverage to derivation trees (i.e. trees verifying the leftmost non-terminal rewriting convention) excludes structures such as those shown in Figure 4.2. For an arbitrary derivation tree T , the foliage f(T ) is defined as the sequence of the leaves of T . So for a coverage C =(T1, T2, ..., Tk) of the input sentence consists of words w1,w2, ..., wn , the following equation is true:

45 4.1 4. ROBUST STOCHASTIC PARSING USING OMC

T3 T4 T1

T2 T3'

' T1

w1 w2 w3 w4 w5 w6

Figure 4.3: Partial derivation trees. Some of which (e.g. T1, T2, T3 and ′ ′ T1, T4, T3) can be composed into a coverage.

f(T1), f(T2), ..., f(Tk)= w1,w2, ..., wn.

In other words, if fi(T ) is defined as i-th leaf of T , flast(T ) as the last leaf of T then for coverage C =(T1, T2, ...Tk) of the input sentence w1,w2, ..., wn, the following equations are true:

f1(T1)= w1, flast(Tk)= wn and if flast(Ti)= wj for some 1 ≤ i

Figure 4.3 shows a coverage C =(T1, T2, T3) consisting of trees T1, T2 and ′ ′ ′ ′ T3. If there are T and T , T is a sub-tree of tree T1 and T is a sub-tree of 1 3 1 ′ ′ ′ 3 ′ T3, then there is also a coverage C = (T1, T4, T3). Conversely (T1, T3) and (T1, T4, T3) are not coverages. If there are no unknown words in the input sentence, then always at least one trivial coverage is obtained consisting of the trees that all use only lexical rules (i.e. one rule per tree).

4.1.1 Maximum coverage A maximum coverage (m-coverage) is a coverage that is maximum with re- spect to the partial order relation ≤, defined as reflexive and transitive closure of the subsumed relation ≺ defined below. The relation ≺ is a relation over coverages such that, for any coverages ′ C and C :

′ C ≺ C iff ∃i, j, k, 1 ≤ i ≤ k, 1 ≤ j and there exists rule r in the grammar G ′ ′ ′ ′ such that C =(T1, ..., Ti, ..., Tk), C =(T1, ...Ti−1, T1, T2, ..., Tj , Ti+1, ..., Tk) ′ ′ ′ and Ti = r ◦ T1 ◦ T2... ◦ Tj ,

46 4.1 4. ROBUST STOCHASTIC PARSING USING OMC

′ i.e. if there exists a sub-sequence of trees in C that can be connected by ′ rule r and the resulting tree is element of C, the other trees in C being the same as in C. Notice that the rule r can be an unary rule. The relation ≤ is defined as the reflexive and transitive closure of the ′ ′ relation ≺. The relation ≤ is also antisymmetric. If C ≤ C and C ≤ C then:

′ • If |C| denotes the number of trees in the coverage C then |C |≤|C| ′ ′ and |C|≤|C |, so |C | = |C|.

′ ′ ′ ′ ′ • If C ≺ C then ∃T, T , T ∈ C, T ∈ C such that T = r1 ◦ T for some ′ ′ unary rule r1 from grammar G. If also C ≺ C then T = r2 ◦ T . But ′ this is not possible, because T = r1 ◦ T . Notice that all the remaining ′ ′ corresponding trees in C and C have to be the same. Thus C ⊀ C ′ ′ and C ⊀ C . And also C ≡ C , because the relation ≤ is the reflexive closure of the relation ≺.

As the relation ≤ is reflexive, transitive and antisymmetric, it also corre- sponds to a partial order on the set of all coverages of a given input sentence. A maximum coverage (m-coverage) is a coverage that is maximum with re- spect to the ≤ relation. The coverage C1 =(T3) in Figure 4.4 is m-coverage. The coverage C2 = (T1, T2) is not maximum, because C2 ≤ C1. There is also another m-coverage C3 =(T4). Notice that C1 and C3 are not comparable by ≤ relation. If there is a successful parse (a single derivation tree that covers the whole input sentence) then there are as many m-coverages as full parse trees and every m-coverage contains only one tree. We had a long and interesting discussion (me, Martin Rajman and Jean- C´edric Chappelier) about the present definition of maximum coverage. The alternative definitions are provided in Appendix A. These alternatives are missing some features, that the chosen one from this chapter has (e.g. they are not antisymmetric).

4.1.2 Optimal m-coverage In addition to maximality, we focus on optimal m-coverage, where optimality is defined with respect to different measures. In contrast to maximality which is generally defined for the coverages the choice of an optimality measure depends on a target application.

47 4.1 4. ROBUST STOCHASTIC PARSING USING OMC

T3

T2

T1 T4

w1 w2 w3 w4 w5 w6

Figure 4.4: An example to illustrate a maximum coverage.

The following two measures are proposed:

• The first optimality measure S1 relates to the average width (number of leaves) of the derivation trees in the coverage. For an m-coverage C = (T1, T2, ...Tk) of input sentence w1,w2, ..., wn, where n > 1, the measure is defined as follows:

1 n S1(C)= n−1 ( k − 1).

n Notice that 0 ≤ S1(C) ≤ 1 and k is the average width of the derivation trees in the coverage. With this measure, the value of a trivial coverage (i.e. one exclusively composed of lexical rules) is 0 and the value of a successful full parse is 1.

• The second measure favours coverages with the widest trees (trees with the largest number of leaves). Let

lmax(C) = max |f(T )| T ∈C

and

1 S2(C)= n−1 (lmax(C) − 1)

for number of input words n > 1. Similarly for S1, 0 ≤ S2(C) ≤ 1, and the value obtained for a trivial coverage is 0 and the value of a successful full parse is 1.

48 4.1 4. ROBUST STOCHASTIC PARSING USING OMC

T4

T5

T1

3 T2 T T1'

w1 w2 w3 w4 w5 w6

Figure 4.5: An example to illustrate the notion of optimal m-coverage.

Several other optimality measures could be defined. For instance, an optimality measure might be sensitive to the internal structure of the trees in a coverage, e.g. counting the number of nodes in trees. These additional criteria can be used in combination with measures S1 and S2. Figure 4.5 illustrates optimal m-coverages C1 = (T1, T2, T3) and C2 = ′ ′ (T4, T5). The coverage C1 = (T1, T2, T3) is not optimal m-coverage. The coverage C2 is more optimal for the measure S1 (S1(C1)

4.1.3 Probability of a coverage The probability of a coverage is defined as the product of the probabilities of the trees it contains, i.e. for a coverage C, let

p(C)= Y p(T ). T ∈C Notice that, by construction, the probability of any coverage is always less than or equal to the probability of the corresponding trivial coverage. The probability of a coverage can be viewed as another optimality measure. So the most probable coverages can be found in the same way as the optimal m-coverages. But usually all the optimal m-coverages (OMC) are found first (optimal with respect to some other measure than probability) and then the most probable one is chosen. Both OMC and most probable OMC are not necessarily unique.

49 4.2 4. ROBUST STOCHASTIC PARSING USING OMC

4.2 Finding optimal m-coverage

For robust parsing, a CF parsing algorithm that produces all possible in- complete parses has to be chosen (i.e. whenever there is a derivation tree that covers the part of the given input sentence, the algorithm produces that tree). This condition is usually satisfied by bottom-up parsers. Then, the incomplete parses can be combined to find the maximum coverage(s). The algorithm described finds OMC with respect to the measure S1 (the average width of the derivation trees in the coverage), but it can be easily adapted to different optimality measures. All operations are applied to a set of Earley’s items [Earley, 1970]. In particular, no changes are made during the parsing phase (except some ini- tialization of internal structures for better efficiency of the algorithm). Dijkstra’s algorithm [Dijkstra, 1959] for the shortest path problem in graphs is used to find OMCs with respect to the measure S1. The input graph for Dijkstra’s algorithm consists of the weighted edges and vertexes. The edges are Earley’s items and the weight of each edge is 1. The vertices’s are word positions, thus for n input words there is n + 1 vertexes. Whenever Dijkstra’s algorithm finds paths with equal length (i.e. identical number of items), the probability of an item is used to select the most probable ones. Notice that if there are no unknown words in the input then there exists at least one path from position 0 to n corresponding to the trivial coverage. Figure 4.6 illustrates an example of the input graph for the Dijkstra’s algo- rithm for Earley’s items [A, 0, 2], [B, 2, 3], [C, 3, 4], [D, 0, 1], [E, 0, 3], [F, 1, 4] and [G, 1, 2]. The shortest paths are [E, 0, 3], [C, 3, 4] and [D, 0, 1], [F, 1, 4]. The paths correspond to two optimal m-coverages with two trees in each cov- erage. The graph is really similar to the chart structure (see Section 2.1.2) re- sulting from the CYK [Kasami, 1965, Younger, 1967, Aho and Ullman, 1972, Graham et al., 1980] parsing algorithm, if the CYK table is represented as Earley’s items. The output of the algorithm is a list of Earley’s items. The Earley’s item can represent several derivation trees and, to get an OMC, the most probable tree from each item is selected. The resulting OMC is not unique because there can be several trees with the same probability.

50 4.2 4. ROBUST STOCHASTIC PARSING USING OMC

E F A

D G B C

w1 w2 w3 w4 A B C D G 0 1 2 3 4

E F

Figure 4.6: The input graph for the Dijkstra’s algorithm and the correspond- ing derivation trees.

51 4.3 4. ROBUST STOCHASTIC PARSING USING OMC

S X L

X L X

L X X X

w1 w2 w3 w4 w5 w6

Figure 4.7: Gluing with new rules added to the grammar. Bottom bold trees are in OMC. 4.3 Gluing

The intended result for our robust parser is a derivation tree covering the whole input sentence. For this reason our goal is to connect (glue) the trees present in the OMC to construct a single one.

4.3.1 Gluing with new rules The gluing can be realized by adding new rule(s) to the grammar. The new rules use new “fresh” non-terminals, that are not in the original grammar. They just connect the roots of the trees together. The probability of such rules is set to 1. Notice that there might be several other ways of constructing a unique tree and therefore our choice mainly rely on technical reasons. Figure 4.7 shows example with new rules S → XL, XL → XLX, XL → L X, X → Ai, where S is the root of the grammar, X and X are new non- terminals and Ai is the root of the i-th tree in the coverage (there are three trees in this example). The dotted lines represent newly added rules.

52 4.3 4. ROBUST STOCHASTIC PARSING USING OMC

Figure 4.8: Gluing by means of mapping non-terminals. Bottom hatched trees are in OMC.

4.3.2 Gluing by means of mapping non-terminals Another possibility is to create the top nodes of the resulting tree by the top- down parsing algorithm and then to glue these top nodes with the selected coverage. Notice that for reasonable grammars the tree with the following properties can be generated:

• the root is equal to the root of the grammar

• the number of leaves is equal to the number of trees in the glued cov- erage.

So, in this case, the gluing would be only a formula how to connect two non-terminals. This approach is illustrated in Figure 4.8. The dotted lines represent the mapping function. The method has not been implemented, because there are many remain- ing unsolved problems. The main challenge is to find out how to generate the top nodes with respect to the input sentence. A possible track to explore is to consider approaches deriving from head-corner parsing algorithm.

53 4.4 4. ROBUST STOCHASTIC PARSING USING OMC

4.4 Conclusions

In this chapter our approaches to the robust CF stochastic parsing were pre- sented. The optimal maximum coverage framework was introduced as well as several measures for the optimality of the parser. Our definition of the maximality is independent of the target application. On the other hand, the choice of an optimality measure is strongly application dependent. The algorithm that finds OMC (with respect to the measure average width of derivation trees) efficiently was proposed. The described ideas were success- fully presented in [Kadlec et al., 2005]. The implementation of this algorithm in SLP toolkit [Chappelier and Rajman, 1998a] was used by Marita Ailo- maa [Ailomaa, 2004, Ailomaa et al., 2005b, Ailomaa et al., 2005a]. The results of her experiments are showed in Section 5.3. The presented algo- rithm was also integrated into synt system (see Chapter 6). Experiments with Czech speech recognition data are provided in Section 5.3.3.

54 Chapter 5

Context-free Only Experiments

In this chapter experiments with the context-free part of our system are showed. These experiments are presented into two main sections. In Sec- tion 5.2.1, several results of our HDddm parser (see Section 3.2) are pre- sented. The second part of the chapter, Section 5.3, describes robust parsing experiments.

5.1 Configuration of Experiments

Most of our experiments are reported on the standard data sets for parser comparison, which are available at http://www.cogs.susx.ac.uk/lab/ nlp/carroll/cfg-resources/. These web pages resulted from discussions at the Efficiency in Large Scale Parsing Systems Workshop at COLING’2000, where one of the main conclusions was the need of a bank of data for stan- dardization of parser benchmarking. Three grammars: ATIS, CT and PT are used in our experiments. The ATIS grammar consists of 4,592 rules, 192 non-terminals and 357 pre- terminals. The data set includes 98 sentences of which 71 are grammatical and 27 do not belong to the language generated by the grammar. The CT grammar consists of 24,456 rules, 3,946 non-terminals and 1,032 pre-terminals. The data set includes 162 sentences of which 150 are gram- matical and 12 are not. The PT grammar consists of 15,039 rules, 38 non-terminals and 47 pre- terminals. The data set includes 30 grammatical sentences.

55 5.2 5. CONTEXT-FREE ONLY EXPERIMENTS

All non-grammatical sentences from these three data sets were excluded from our experiments. The fourth grammar used in the tests is the Czech grammar developed within the synt project, see Chapter 6. The grammar consists of 2,915 rules, 135 non-terminals and 40 pre-terminals. The data set usually includes 200 sentences from the corpora. The exceptions are mentioned in the description of the experiment. All sentences are grammatically correct.

5.2 Context-free experiments

Experiments shown in this section are related only to the context-free parsing algorithms, that are implemented in our system. The experiments for non- context-free part of our system (e.g. contextual constrains) are described in Chapter 6. Our work is restricted to lexicalised grammars, where terminals can only appear in lexical rules in the form of A → wi. Such non-terminal A is called pre-terminal in the following text. This restriction allows us to simplify the implementation and it also enables to separate a lexicon from the grammar. In fact, we do not use Czech lexicon in our system because the lexical rules are created by the morphological analyser ajka [Sedl´aˇcek, 2005].

5.2.1 Comparison of Implemented CF Parsing Algo- rithms Four different parsing algorithms are implemented in our system. Earley’s top-down and bottom-up chart parser [Earley, 1970], HDddm parser (see Section 3.2) and Tomita’s GLR [Tomita, 1986]. The chart parsers and HD- ddm were implemented by the author of this thesis, the GLR parser was implemented by FI MUNI student Ondˇrej Macek [Macek, 2003]. All these implementations produce the same structures, thus applying contextual constraints or selecting n best trees can be shared among them. First of all, our implementations of the mostly used parsing algorithms are compared. There is no “best” CF parsing algorithm. The evaluation is always grammar and input dependent. This is demonstrated in the next sections. Here we used our grammar for Czech (see Chapter 6). Table 5.1 presents the average running time for different parsing al- gorithms analysing 200 random sentences from the Czech corpus DE-

56 5.2 5. CONTEXT-FREE ONLY EXPERIMENTS

Algorithm Time in seconds Head driven chart parser, proper heads 8.24 Head driven chart parser, improper heads 251.62 Earley’s top-down chart parser 32.79 Earley’s bottom-up 39.22 GLR 15.24

Table 5.1: Comparison of different parsing algorithms

SAM [Pala et al., 1997]. The results for the head driven chart parser (HD- ddm technique, see Section 3.2) depends strongly on the positions of heads in the grammar rules. We developed a tool that generates optimal head po- sitions for the given grammar and a set of input sentences. The Section 5.2.3 contains more details about optimizing positions of heads. Notice, that only time of CF parsing is measured. Because all our parsers produce the same chart structure, the time of evaluating semantic actions is the same for all of them. Clearly, the HDddm algorithm outperforms the others, when the head positions are set correctly. However these results have to be interpreted with respect to our grammar, for some other grammar the order of the algorithms could be different.

5.2.2 Head-Driven dependent dot move variants This section shows the comparison between various HDddm parsers described in Section 3.2. Notice, that actual implementations of iHD and sHD do not exist. The results are computed by the HD algorithm and post-processed by an external program. The edges generated by the HD algorithm are modified according to iHD or sHD algorithms, repeated edges are omitted. Thus our results only estimate the effectiveness of the algorithms. ATIS, PT and CT grammars (see Section 5.1) in the following results refer to the standard grammars from the benchmarking site. As the original data do not provide information about heads of the rules, a simple heuristics has been employed for setting the heads in these cases. If a rule contains terminals the left-most one is chosen as the head. Otherwise, a distance measure from terminals is computed for all the right-hand side non-terminals (how many derivations are needed to get a rule with a terminal) and the left-most one with the smallest “terminal distance” is taken as the head. Notice, that the

57 5.2 5. CONTEXT-FREE ONLY EXPERIMENTS

Grammar HD iHD % of HD sHD % of HD ATIS 882,673 793,370 89.8% 390,860 44.2% ATIS (H) 401,782 362,568 90.2% 139,935 34.8% PT 1,227,500 510,175 41.5% 456,736 37.2% CT 638,276 606,591 95.0% 381,115 59.7% Czech 994,402 915,004 92.0% 496,129 49.8%

Table 5.2: A comparison of the three discussed variants of the HD algorithm on the number of edges of the resulting chart heuristics does not work too well for ATIS as the grammar contains many rules with the same non-terminal on the left-hand side and the right-hand side starting with the same terminal. ATIS (H) is a variant of the ATIS grammar where the position of the heads has been set to the best position according to the chart size (for the HDddm algorithm). The optimization of the head positions can be found in Section 3.3. The benefits of the discussed HD refinements are demonstrated by the reduction of edges in the resulting charts. Table 5.2 summarizes the results. It is obvious that the improved HD parsing methods significantly reduce the size of the resulting chart. The average decrease of the number of chart edges is about 50 %. A special attention should be also paid to the two variants of the ATIS grammar. The optimal positions of heads bring a slight downgrade for the iHD algorithm but a considerable improvement for sHD. Also note that the PT grammar is the only one in our experiments where the iHD improvement has proved a substantial effect.

5.2.3 Optimizing Heads for Parsing Section 3.3 describes the optimizing procedure used for the experiments in this section. The running times of the head-driven parser shown in previous sections are strongly dependent on the positions of heads in the grammar rules. In this section more detailed results are reported on two natural lan- guage grammars and relevant sets of inputs. The first is the ATIS grammar and the corresponding data set described in Section 5.1. The experiment consists of 1,149,703 analyses (71 sentences

58 5.2 5. CONTEXT-FREE ONLY EXPERIMENTS

Grammar ATIS ATIS-H ATIS-HS Total time (sec) 10.45 3.16 1.30 # edges 1 345 544 398 261 148 866 # edges (optimal parser) 15 252 15 252 15 252 # edges per sentence 13730.04 4063.89 1519.04 Parser optimality 1.13% 3.83% 10.25% # edges / # edges ATIS 1.00 0.29 0.11

Table 5.3: HDddm parsing on ATIS grammars with different head positions

× 4592 grammar rules × the number of right hand side symbols — 3.52 in average). The input sentences were divided into two sets and run the optimization process in parallel on two Pentium 2.4 GHz workstations. The analyses took 9 hours 23 minutes (summarized total time from both machines without initialization, grammar reading, etc.) The second series of tests were carried out on the Czech grammar. The technical details about this grammar are given in Section 5.1 and Chapter 6. This series consists of 2,469,600 analyses (100 sentences × 2915 grammar rules × the number of right hand side symbols — 8.47 in average). The analyses took 8 hours 32 minutes (total time without initialization, grammar reading, etc.) on a Pentium 2.4 GHz workstation. Table 5.3 presents the results of the optimization procedure for the ATIS grammar. The column captioned ATIS gives the parsing time for the base- line — the grammar with heads on the leftmost pre-terminal. If the rule does not contain any pre-terminal the head is the leftmost symbol on the right hand side. ATIS-H is the grammar resulting from the optimization de- scribed above. ATIS-HS presents characteristics of parsing with the optimal grammars for individual sentences, so this is the best possible result of our optimization method. The number of edges for an optimal parser is the value which would be produced by an ideal parsing method that could be able to determine what edges will form the resulting chart. The chart itself (with some additional structures) is employed to represent resulting derivation trees. Thus only edges used in this structure are considered. The “real world” parsers produce additional edges, that are not used in any derivation tree. The optimal parser creates only edges, that are necessary to build up some derivation tree (with respect to the input sentence). The “Parser optimality” row shows ratio

59 5.2 5. CONTEXT-FREE ONLY EXPERIMENTS

Grammar Czech Czech-H Czech-HS Total time (sec) 4.57 2.20 1.51 # edges 822 989 306 937 191 666 # edges (optimal parser) 76 038 76 038 76 038 # edges per sentence 8229.89 3069.37 1916.66 Parser optimality 9.2% 24.8% 39.7% # edges / # edges Czech 1.00 0.37 0.23

Table 5.4: HDddm parsing on Czech grammars with different head positions between number of edges produced by the tested parser and the hypothetical optimal parser. The results show that parsing process using the resulting generated gram- mar (ATIS-H) is more than three times faster than the original method with “naive” head positions. The ATIS-HS system is not practical as it uses dif- ferent grammars for different input sentences. However, the comparison of ATIS-H and ATIS-HS clearly demonstrates that it could be worth looking for smart heuristics to predict the optimal head position of grammar rules based on a particular input. Table 5.4 presents results of the optimization procedure for the Czech grammar. Here the heuristic (heads on the leftmost pre-terminal) does not need to be used for the baseline. The column captioned Czech refers to the results of parsing with the grammar of Czech where the position of rule heads is given by linguistic intuition — the governing nodes are set as heads. The improvement of the parser performance is not so impressive in the case of the Czech grammar. However, the application of the head optimiza- tion algorithm still means a considerable reduction of the time required for parsing. As the baseline grammar employs linguistic heads, the linguistically motivated setting could be compared with experimentally found positions of heads. Table 5.5 lists the most frequent language phenomena covered by the grammar rules where the position of the linguistic head and the optimized position are different. For example, the linguistic intuition suggests to in- stantiate the reflexive verb clause (e.g. Karel se myl / Karel washed himself ) from the given verb first. However, the empirical evidence indicates that from the parser’s optimality point of view the best starting point is the reflexive particle in such case (se or si in Czech). Similarly, for the other language

60 5.2 5. CONTEXT-FREE ONLY EXPERIMENTS

Description Rule example Linguistic Empirical head head reflex. verb constr. clause → subj R VR VR R proper noun phrase np → N np prop names np np prop names conditional clause clause → condc V intr V condc genitive construction npnl → np np gen np np gen

Table 5.5: Comparison of linguistically motivated and empirically determined heads phenomena which are listed in the table, the heads determined in our exper- iments differ from traditionally defined ones. This experiment was presented in [Kadlec and Smrˇz, 2004]. The optimizing procedure and CF part were developed and produced by the author of this thesis. The linguistic interpretation from the Table 5.5 belongs to Pavel Smrˇz.

Optimizing heads for large grammars The presented optimizing procedure from Section 3.3 cannot be applied to huge grammars such as Suzanne108d3, because the method would take too much time. The comparison between head positions on the leftmost symbol from the RHS of the rule and randomly chosen heads demonstrates, that in these cases randomly chosen heads could produce a much faster grammar. So machine learning algorithms, such as genetic algorithms, have a great potential here. Table 5.6 shows that our system is able to effectively handle extremely large grammars. A sub-part from the Susanne # 3 corpus [Sampson, 1994] was used. Only empty productions (traces) and a few obvious mistakes in the annotations (e.g. cycles) have been removed. Notice that we did not restrict ourselves parse Part-of-Speech tag sequences only but worked on real word strings instead. For more information about these grammars please see [Sampson, 1994, Ballim et al., 2000]. The data for this experiment were provided by Jean-C´edric Chappelier.

61 5.2 5. CONTEXT-FREE ONLY EXPERIMENTS

Grammar Suzanne108d3 SuzanneFullCF Isis # grammar rules 186915 13438 68366 # non terminals 2371 760 6594 # sentences 399 11 513 HDddm1 (time in sec.) 514.63 49.66 57.31 HDddm2 (time in sec.) 342.74 73.76 74.50

Table 5.6: Comparison of different head positions in the input grammars, HDddm1 – heads are on the leftmost symbol from the RHS of the rule, HDddm2 – randomly chosen heads.

5.2.4 Comparison with Different Parsing Systems In this section, our CF parser using HDddm technique is compared with a couple of different parsers by different authors. Unfortunately, there is no general evaluation procedure acceptable for all researchers and developers. For example, the number of edges in chart/items, events [Roark and Charniak, 2000] are usually given as a measure to com- pare chart parsing algorithms. However, as shown in [Moore, 2000b], there can be algorithms that do not differ in this respect but the processing times on given grammar and input differ considerably. The primary method of assessing the efficiency of the parsing algorithm is therefore only empirical – one has to compare the time taken to parse a set of test sentences by each particular parser based on a shared grammar.

Moore’s Parser The best results reported on standard data sets (ATIS, CT and PT gram- mars) are the comparison data by Moore [Moore, 2004]. The results of the parser comparison appear in Table 5.7. The values in the table give the total CPU times in seconds required by the parser to completely process the test set associated with the grammar. The longer running times of our system for the CT grammar are caused by low ambiguity of the grammar. Our parsing technique is suitable for highly ambiguous grammars such as PT grammar. Notice, that Moore’s parser is implemented in different (Perl) than ours (C language). But only running times of the CF algorithms are compared

62 5.2 5. CONTEXT-FREE ONLY EXPERIMENTS

Grammar, algorithm Time in seconds ATIS grammar, Moore’s LC3 + UTF 11.60 ATIS grammar, our system 4.19 CT grammar, Moore’s LC3 + UTF 2.70 CT grammar, our system 4.29 PT grammar, Moore’s LC3 + UTF 41.80 PT grammar, our system 17.75

Table 5.7: Running times comparison (in seconds)

(and not initialization, printing the results, etc...), the Perl code is already compiled at this stage. Thus the comparison should be fair enough. These results were presented in [Hor´ak et al., 2002].

SLP toolkit The second parser, that is compared here with ours is the SLP toolkit. The SLP toolkit [Chappelier and Rajman, 1998a] provides fast and robust bottom-up chart parsing algorithm derived from Earley’s chart parsing [Earley, 1970] and CYK [Kasami, 1965, Younger, 1967, Aho and Ullman, 1972, Graham et al., 1980]. For the current version, please see http://slptk.sourceforge.net. The comparison with the SLP toolkit is provided in this text for the following reasons: • The SLP toolkit provides similar functionality as context-free part of our system (e.g. generation best trees – see below, detailed time statis- tics, etc...). • It uses completely different parsing algorithm (CYK). • Algorithms from Chapter 4 were implemented into SLP toolkit as well as into our system (both implementations were created by the author of this thesis). As in previous sections, experiments shown in this section are related only to the context-free part of the system synt (see Chapter 6). Three very different grammars were chosen to compare, the “S-grammar”, the ATIS grammar and the Isis grammar (see below).

63 5.2 5. CONTEXT-FREE ONLY EXPERIMENTS

The ”S-grammar” consists of two grammar rules: S -> S S S -> "1" Lexicon contains only one entry, a number “1”. The input is one sentence containing 220 times number “1”. The grammar is extremely ambiguous. The number of derivation trees for our input is 2,144,211,376 trees. The ATIS grammar and the input for it is described in Section 5.1. The Isis [Ballim et al., 2000] grammar consists of 17141 rules. The large Isis lexicon contains 554038 words and compounds. Ten highly ambiguous sentences from the Isis corpus were selected as input. Because synt uses head-driven algorithm, the heads of grammar rules were selected as follows. The head position for the ”S-grammar” is the left ”S” in the production rule. For ATIS and Isis, the heads were chosen by the already mentioned heuristic. If a rule contains terminals, the left-most terminal is chosen as the head. Otherwise the left-most non-terminal is chosen as the head. Table 5.2.4 shows the results of the experiment. They represent sum- maries of times for all sentences, with respect to the data set. Times of parser initialization are not measured in the synt. For a complete overview including initialization and closing the Linux command time was used, see http://www.gnu.org/software/time/time.html or type ’man 1 time’ command on the most of Unix/Linux boxes. The row “real” represents the total time used by the process. The row “user” repre- sents the CPU time used directly by the process. The row “sys” represents the CPU time used by the operating system (i.e. operating system calls such as reading files from the disk or allocating the memory). Rows labelled “Best tree” represent time for the most probable derivation tree, with respect to probabilities of the rules in the input gram- mar. In the SLP toolkit, the computation of the best tree is an unavoidable part of the parsing process. Thus time of context-free parsing involves time of the computation the best tree. In the synt these steps are separated, so the time only for the computation the best tree is given. The next difference between compared systems is that in synt more general algorithm for com- puting n-best is used. So even in this case, where only one tree is generated, the more general (and thus slower) algorithm is applied. The SLP toolkit operates with a grammar and a lexicon pre-compiled into a binary form. The synt does not have such pre-compilation, it reads

64 5.2 5. CONTEXT-FREE ONLY EXPERIMENTS

S-grammar SLP Toolkit synt Initialization (ms) 11160 - CF parsing (ms) 5120 - Analysis (ms) 16280 2940 Best tree (ms) - 7820 Overall time measured by time command: real 17.425s 12.374s user 16.690s 12.080s sys 0.210s 0.290s ATIS grammar SLP Toolkit synt Initialization (ms) 930 - CF parsing (ms) 2730 - Analysis (ms) 3660 5320 Best tree (ms) - 50 Overall time measured by time command: real 4.276s 7.068s user 4.210s 4.100s sys 0.040s 2.980s Isis grammar SLP Toolkit synt Initialization (ms) 2820 - CF parsing (ms) 2320 - Analysis (ms) 5140 510 Best tree (ms) - 0 Compilation of the grammar 27.123s - Overall time measured by time command: real 6.725s 7.851s user 6.380s 7.100s sys 0.330s 0.730s

Table 5.8: Comparison between synt and SLP toolkit.

65 5.2 5. CONTEXT-FREE ONLY EXPERIMENTS a grammar directly from a text human readable format. However, the com- pilation of the grammar does not affect the results, except the experiment with the Isis grammar. During the calculation of the best tree in the SLP toolkit, all CYK items are traversed by the algorithm. In the synt only items (edges) that can be reached from the “successful” root edge are used. This difference is clearly displayed in the results for the “S-grammar”. Also the number of “words” in the input – 220 – has influences to the results here, because of the CYK algorithm used in the SLP toolkit. The average time complexity of the CYK algorithm has n3 component (where n represents number of words in the input sentence). This n3 factor is for typical natural language grammars and inputs much lower than the factor representing size of the grammar, but for this artificial “S-grammar”, it is dominating. The experiment with the ATIS grammar shows interesting time consumed by the synt parser in the operating system calls (the “sys” row). The exper- iment was repeated several times using the different computers with different processors and configurations. But for ATIS grammar, this system time has always impact on the overall time. The reason for this was not completely discovered. One of possible causes is a bigger amount of memory allocation and de-allocation during computation of the best trees. Also an improper hashing function for names of non-terminals could be the problem. Notice, that the time for computing the best tree was less than one mil- lisecond for a lot of sentences from ATIS data set, so that is why the summary is only 50 milliseconds. In the experiment with the Isis grammar, synt spends the most of the time reading the input grammar. As said above, the SLP toolkit works with a grammar pre-compiled into a binary form so it can be read faster. Thus the time of the compilation of the grammar is also provided.

5.2.5 Conclusion The results of experiments with Moore’s parser and with the SLP toolkit show, that the comparison of parser’s effectiveness has to be done with re- spect to the tested grammar. Our parser is fully comparable with those mentioned. For highly ambiguous grammars such as the PT grammar or the Isis grammar our head-driven algorithm proves its strength.

66 5.3 5. CONTEXT-FREE ONLY EXPERIMENTS

5.3 Robust Parsing

The following section describes experiments for our robust parsing technique shown in Chapter 4.

5.3.1 Implementation The SLP toolkit (see Section 5.2.4) and synt (see Chapter 6) were used to implement the mentioned ideas from Chapter 4. The experiments for English grammars were performed with the SLP toolkit, the experiments with speech recognition of Czech were computed by synt.

5.3.2 Experiments with English ATIS and Susanne The experiment from described here was processed by Marita Ailo- maa [Ailomaa, 2004, Ailomaa et al., 2005b, Ailomaa et al., 2005a]. The ro- bust parsing technique was tested on subsets of two tree-banks, ATIS [Hemphill et al., 1990] and Susanne [Sampson, 1994]. From these tree- banks two separate grammars were extracted having different characteristics. Concretely each tree-bank was divided into a learning set that was used for producing the probabilistic grammar and a test set that was then parsed with the extracted grammar. About 10% sentences in the test set were not covered by the grammar. They represented the real focus of our experiments, as the goal of a robust parser is to process the sentences that the initial grammar fails to describe. For each sentence the 1-best derivation tree was categorized as good, acceptable or bad, depending on how closely it corresponded to the reference tree in the corpus and how useful the syntactic analysis was for extracting a correct semantic interpretation. The results are presented in table 5.9. It may be argued that the definition of a “useful” analysis might not be decidable only by observing the syntactic tree. Although we found this to be a quite reasonable hypothesis during our experiments, more objective procedure should be defined. In a concrete application, a usefulness might for example be determined by the methods that the system should perform based on the produced syntactic analysis.

67 5.3 5. CONTEXT-FREE ONLY EXPERIMENTS

Good Acceptable Bad (%) (%) (%) ATIS corpus 10 60 30 Susanne corpus 16 29 55

Table 5.9: Experimental results. Percentage of good, acceptable and bad analyses. From the experimental results one can see that our technique behaves better with the ATIS grammar that contains relatively few rules, than with Susanne, which is a considerably larger describing a rich variety of syntactic structures. The number of bad 1-best analyses that are produced can be explained by the fact that the probabilistically best analysis is not always the linguistically best one. This is a non-trivial problem related to all types of natural language parsing, not only to robust parsers.

5.3.3 Speech recognition of Czech The parsing system synt (described later in Chapter 6) with robust algorithm from Chapter 4 was used to run this experiment. The only modification of the standard processing is the elimination of the simplest form of repetitions of single words in the pre-processing phase. The is performed by the Czech morphological parser Ajka [Sedl´aˇcek, 2005]. If the word(form) is not recognized by the morphological analyzer (often due to the colloquial language used), an UNKNOWN pre-terminal is returned to the parser. This pre-terminal is present in special rules of the grammar (e. g., it can form a proper noun if the unknown word starts with an uppercase character). The language model based on our parser has been tested on a very hard task — speech recognition of a recorded lecture. It explains the overall high word error (when compared with the recognition accuracy of the best speaker- adapted dictation systems in good acoustic conditions — close microphones, no background noise). The lecture was captured in a standard lecture room at our faculty, a poor camcorder microphone produced the audio stream. The content of the record to be recognized is rather specific, the vocab- ulary contains a lot of terminology concerning the topic of the course — digital signal processing. The recognition algorithm combines 8-Gaussian HMM (Hidden Markov Models) with MLLR (Maximum Likelihood Linear

68 5.3 5. CONTEXT-FREE ONLY EXPERIMENTS

WER SAC Word 3-gram model 36.99 % 22.37 % synt robust parsing 33.72 % 23.17 % 3-gram + parsing 32.47 % 24.47 %

Table 5.10: Recognition results of the different language models

Regression) and MAP (Maximum A posterior probability) speaker adapta- tion techniques (see [Glembek, 2005] for details). The system was trained on the Czech part of SpeechDat-E which comprises 1052 speakers recorded over the land-line telephone network. The adaptation was performed on 9 lectures, a text transcription was provided for one of them — about 150 minutes. For testing purposes, we integrated the parser with the speech recognizer in a simple way. The best 200 transcriptions have been generated from word lattices provided by the speech recognizer. These hypotheses have been re- scored by our parsing technique. The reported results are computed on the output of this process. The tested word N-gram model combines very large general language model based on 536-million-word corpus with a specialized model created from 9,308 words of the lecture transcription. Another 6,213 words of the transcription (873 utterances) have been for testing. Table 5.10 presents results of the speech recognition with particular lan- guage models and their combination. The second column summarizes the word-error (WER) measure — the minimum edit distance on the words. The third column contains sentence accuracy (SAC) on the 873 testing utterances. The language model based on our robust parsing method outperforms the word-based model which combines the general and focused N-gram models. The relative increase of the recognition accuracy is 6.33 %. The results are very interesting, because the syntactic model outperforms the 3-gram model and this is something unusual. The possible reason could be, that the 3-gram model was trained on the different domain. The data for this experiment, the best 200 transcriptions for each of the tested utterance, were provided by Pavel Smrˇz, the experiment itself was conducted by the author of this thesis.

69 5.4 5. CONTEXT-FREE ONLY EXPERIMENTS

5.4 Conclusions

The results of experiments presented in this chapter show that our HDddm algorithm works efficiently on wide range of context-free grammars. Espe- cially for grammars that are very ambiguous the algorithm is faster than the compared parsers. The HDddm parsing proceeds bottom-up. This allows us to experiment with the robust extension of the algorithm. The evaluation of the robust parsing technique from Chapter 4 for English grammars was based on manually checking the derivation trees. An impor- tant issue is to integrate the technique into a target application so that we have more realistic ways of measuring the usefulness of the produced robust analyses (like Parseval [Black, 1992] labelled precision and recall). The implementation of our robust algorithm in the synt system allows to run also robust experiments for Czech. The results of our system are better than the N-gram model which is unexpected. The N-gram model is the specialization of the very large general language model. The hypothesis, why our system outperforms the N-gram model, is that the N-grams are not properly trained for the specific domain of the experiment. This result is also interesting from another point of view because our robust method has not been targeted for free word order languages such as Czech.

70 Chapter 6

Implementation – synt Project

In this chapter the implementation of the parsing system from Chapter 3 – the synt project – is described. The history of the synt started in 1999 and the first results were published in [Kadlec, 2000, Hor´ak et al., 2002]. The project consists of several parts. One of the main parts is so called meta- grammar for Czech language. This meta-grammar is described in Section 6.1. It is created by hand and several other grammar formats are generated from it to allow processing. Also implementations of semantic actions and contextual constrains for every grammar rule, see Section 3.5, are essential parts of the grammar. The second part of the synt project is the parsing system itself. The detailed description of this system is showed in Chapter 3. The overview of all parsing modules in synt and the data flow through the system are demonstrated in Figure 3.1. In the whole project several researchers cooperate. The meta-grammar and the content of semantic actions (the linguistic part of the system) are maintained by AleˇsHor´ak at the moment. The parsing system has been created and developed by the author of this thesis. On some parts, e.g. evaluation of valency actions, we work together. Selected sub-parts, such as implementation of GLR context-free parsing algorithm (see Section 5.2.1) or front-ends to synt (see Section 6.5), have been developed by our students. In the following sections the meta-grammar, the parser implementation and the results of experiments are showed. Some experiments with context- free part of the synt are described in Chapter 5. Only experiments using parser’s ability to evaluate contextual constraints and semantic actions are presented in this chapter.

71 6.1 6. IMPLEMENTATION – SYNT PROJECT

6.1 Grammar

The meta-grammar concept [Hor´ak and Kadlec, 2005] in synt consists of three grammar forms denoted as G1, G2 and G3. Human experts work with the meta-grammar form, which encompasses high-level generative constructs reflecting the meta-level natural language phenomena like the word order constraints, and enable to describe the language with a maintainable num- ber of rules. The meta-grammar serves as a base for the second grammar form which comes into existence by expanding the constructs. This grammar consists of context-free rules equipped with feature agreement tests and other contextual actions. The last phase of grammar induction lies in the trans- formation of the tests into standard rules of the expanded grammar with the actions remaining to guarantee the contextual requirements.

Meta-grammar (G1) The meta-grammar consists of global order con- straints that safeguard the succession of given terminals, special flags that impose particular restrictions to given non-terminals and terminals on the right hand side (RHS) and of constructs used to generate combinations of rule elements. The arrow in the rule is used for specification of the rule type (->, -->, ==> or ===>). A little hint to the arrow form meaning can be expressed by ‘the thicker and longer the arrow the more (complex) actions are to be done in the rule translation’. The smallest arrow (->) denotes an ordinary CFG transcription and the thick extra-long arrow (===>) inserts possible inter- segments between the RHS constituents, checks the correct order of enclitics and supplies several forms of the rule to make the verb phrase into a full sentence. The global constructs (%enclitic, %order and %merge_actions) repre- sent universal simple regulators, which are used to inhibit some combinations of terminals in rules, or which specify the actions that need some special treatment in the meta-grammar form translation. The main combining constructs of the meta-grammar are order(), rhs() and first(), which are used for generating variants of assortments of given terminals and non-terminals. /* budu se ptat - I will ask */ clause ===> order(VBU,R,VRI) /* ktery ... - which ... */ relclause ===> first(relprongr) rhs(clause)

72 6.1 6. IMPLEMENTATION – SYNT PROJECT

The order() construct generates all possible permutations of its components. The first() and rhs() constructs are employed to implant content of all the right hand sides of specified non-terminal to the rule. The rhs(N) construct generates the possible rewritings of the non-terminal N. The resulting terms are then subject to standard constraints, enclitic checking and inter-segment insertion. In some cases, one needs to force a certain constituent to be the first non-terminal on the RHS. The construct first(N) ensures that N is firmly tied to the beginning and can neither be preceded by an inter-segment nor any other construct. There are several generative constructs for defining rule templates to sim- plify the creation and maintenance of the grammar. One group of such constructs is formed by a set of %list * expressions, which automatically produce new rules for a list of the given non-terminals either simply concate- nated or separated by comma and co-ordinative conjunctions. A significant portion of the grammar is made up by the verb group rules (about 40 %). Therefore we have been seeking for an instrument that would catch frequent repetitive constructions in verb groups. The obtained addition is the %group keyword illustrated by the following example: %group verbP={ V: verb_rule_schema($@,"(#1)") groupflag($1,"head"), VR R: verb_rule_schema($@,"(#1 #2)") groupflag($1,"head"), }

/* ctu/ptam se - I am reading/I am asking */ clause ====> order(group(verbP), vi_list) verb_rule_schema($@,"#2") depends(getgroupflag($1,"head"), $2) Here, the group verbP denotes two sets of non-terminals with the corre- sponding actions that are then substituted for the expression group(verbP) on the RHS of the clause non-terminal. In order to be able to refer to verb group members in the rules, where the group is used, any group term can be associated with a flag (any string). By that flag an outside action can refer to the term later with getgroupflag construct. Many rules, e.g. those prescribing the structure of a clause, share the same rule template. They have the same requirements for inter-segments

73 6.1 6. IMPLEMENTATION – SYNT PROJECT

filling and the enclitics order checking and the RHS term combinations. To enable a global specification of such majority of rules, a rule template mech- anism is provided. It defines a pattern of each such rule (the rule type and the RHS encapsulation with some generative construct). Some grammatical phenomena occur very rarely in common texts. The best way to capture this sparseness is to train rule probabilities on a large data bank of derivation trees acquired from corpus sentences. Since prepara- tion of such corpus of adequate size (at least tens of thousands of sentences) is a very expensive and tedious process, we now overcome this difficulty with defining rule levels. Every rule without level indication is of level 0. The higher the level, the less frequent the appropriate grammatical phenomenon is, according to the guidance of the linguist. Rules of higher levels can be set on or off according to the chosen level of the whole grammar. Apart from the common generative constructs, the meta-grammar com- prises feature tagging actions that specify certain local aspects of the de- noted (non-)terminal. One of these actions is the specification of the head- dependent relations in the rule — the head() and depends construct which allow to express the dependency links between rule terms.

The Second Grammar Form (G2) As mentioned earlier, several pre- defined grammatical tests and procedures are used in the description of con- text actions associated with each grammatical rule of the system. The prun- ing actions include:

• case test for particular words and noun groups

• agreement test of case in prepositional construction

• agreement test of number and gender for relative pronouns

• agreement test of case, number and gender for noun groups

• type checking of logical constructions [Hor´ak, 2002b]

np -> adj_group np rule_schema($@, "lwtx(awtx(#1) and awtx(#2))") rule_schema($@, "lwtx([[awt(#1),#2],x])")

74 6.1 6. IMPLEMENTATION – SYNT PROJECT

Figure 6.1: Generative construct %list coord case prep in the grammar G1 and generated appropriate rules and actions in G2 and G3.

The contextual actions propagate all and agree * and propagate propa- gate all relevant grammatical information from the selected non-terminals on the RHS to the one on the left hand side of the rule. The rule_schema action presents a prescription for building a logical construction out of the sub-constructions from the RHS. Each time a type checking mechanism is applied and only the type-correct combinations are passed through.

Expanded Grammar Form (G3) The feature agreement tests can be transformed into context-free rules. For instance in Czech, as in other Slavic languages there are 7 grammatical cases (nominative, genitive, dative, ac- cusative, vocative, locative and instrumental), two numbers (singular and plural) and four genders (masculine, feminine and neuter), in which mascu-

75 6.2 6. IMPLEMENTATION – SYNT PROJECT line exists in two forms — animate and inanimate. Thus, this produces 56 possible variants for a full agreement between two constituents. Figure 6.1 illustrates generative construct %list coord case prep in G1, that produces two context-free rules with pruning actions in G2 and fourteen context-free rules in G3. The grammars are displayed by GrammarView mod- ule, which is part of the Grammar Development Workbench environment, see the Section 6.5. The number of rules naturally grows in the direction G1 < G2 < G3. The current numbers of rules in the three grammar forms are 253 in G1, 3091 in G2 and 11530 in G3, but the grammar is still being developed and enhanced.

6.2 Parser

Parser in synt is practical implementation of parsing system described in Chapter 3. See Figure 3.1 for system overview and data flow. Words in the input sentence can be optionally tagged. If they are not tagged, the Czech language morphological analyzer ajka [Sedl´aˇcek, 2005] is used. In this case, ambiguities are left in the input. The terminals for given context-free grammar are created by simplification the tags, e.g. using only word category as a terminal. For context-free part of the analysis, the HDddm algorithm (described in Section 3.2) is applied. Also other context-free parsers are implemented in synt, but this one is fastest for our grammar, see Section 5.2.1 for compari- son. It was shown [Barton et al., 1987] that parsing is in general case NP- complete problem, if grammars are allowed to have agreement features. Most of pruning constraints in synt are weaker than general feature structures. It allows an efficient implementation with the following properties. A node in the derivation tree has only limited number of values (e.g. the cardinality of the set for noun groups in our system is max. 56 = 7 cases × 2 nouns ×(3+ 1) genders). Some of our experiments use semantic actions with unlimited domain range, see Section 6.3.2 for such an experiment. A new forest of values is build instead of pruning original context-free parsing result (packed share forest), see Section 3.5 for details.

76 6.3 6. IMPLEMENTATION – SYNT PROJECT

#sentences 10000 sentences #words 191034 words Maximum sentence length 155 words Minimum sentence length 2 words Average sentences length 19.1 words Time of CFG parsing 5 minutes 52 seconds Time of evaluating constraints and actions 38 minutes 32 seconds Overall time with freeing memory 46 minutes 9 seconds Average #words per second 68.97 Size of the log file 1.2 GiB #accepted sentences 9208

Table 6.1: Results of running the synt parser on 10.000 DESAM corpus sentences 6.3 Experiments

In this section, two experiments are described. The first one shows speed of the analyser working on the data from corpus. In the second part, one of first automatic experiments with the lexicon of verb valences called VerbaLex is presented.

6.3.1 Real data test In the current stage of the meta-grammar development, synt has achieved an average of 92.08 % coverage on corpus sentences with about 84 % cases where the correct syntactic tree was present in the result. For comparison with different Czech parsers, see Section 6.4. The average time of analysis of one sentence from the corpus data was 0.28 seconds on Intel Xeon 2.2GHz. The average running time in- cludes generation of log file messages. Detailed results of analysing DE- SAM [Pala et al., 1997] corpus can be found in Table 6.1.

77 6.3 6. IMPLEMENTATION – SYNT PROJECT

Number of sentences: count 4117 Number of words in sentence: minimum 2.0 maximum 68.0 average 16.8 median 15.0 Number of discovered valency frames: minimum 0 maximum 37080 average 380 median 11 Elapsed time: minimum 0.00 s maximum 274.98 s average 6.86 s median 0.07 s

Table 6.2: The results of verb frame extraction from the corpus DESAM.

6.3.2 Verb Valences In this section results of experiments with filtration by verb valences are presented. The algorithm and more details about valency lexicon Ver- baLex [Hlav´aˇckov´aet al., 2006] can be found in Section 3.6. The results of the automatic verb frame extraction were measured on 4117 sentences from the Czech corpus DESAM [Pala et al., 1997]. Only sentences which are analysed on the rule level 0 were selected. They do not contain an- alytically difficult phenomena like non-projectivity or adjective noun phrase. Even on those sentences the number of possible valency frames can be quite high (see Table 6.2). However, if we work with intersections of those possi- ble valency frames, we can get a useful reduction of the number of resulting derivation trees. These results of the exploitation of VerbaLex in the syntactic analysis of Czech are very promising. Enlarging the lexicon to a representative number of Czech verbs should allow the synt system to be able to detect the correct

78 6.4 6. IMPLEMENTATION – SYNT PROJECT derivation tree in many cases which were unsolvable so far. The results from this section are presented in [Hlav´aˇckov´aet al., 2006].

6.4 Comparison of dependency and phrasal parsers

In this section the results of experiments with comparing dependency and phrasal parsers are shown as presented in [Hor´ak et al., 2007]. We com- pare stochastic parsers that provide dependency trees as their outputs and a meta-grammar parser synt (see Chapter 6) that generates a packed forest of phrasal derivation trees.

6.4.1 Compared Dependency Parsers The set of dependency parsers selected and denoted as the Prague parsers contains the following representatives:

McD McDonnald’s maximum spanning tree parser [McDonald, 2006],

COL Collins’s parser adapted for PDT [Hajiˇcet al., 1999],

ZZ Zabokrtsk´y’sˇ rule-based dependency parser [Holan and Zabokrtsk´y,ˇ 2006],

AN Holan’s parser ANALOG – it has no training phase and in the parsing phase it searches in the training data for the most similar local tree configuration [Holan, 2005],

L2R, R2L, L2R3, R2L3 Holan’s push-down parsers [Holan, 2004],

CP Holan’s and Zabokrtsk´y’sˇ combining parser [Holan and Zabokrtsk´y,ˇ 2006].

The selection of Prague parsers was limited to the parsers contained in combining parser (CP), which is currently the one with the best known re- sults on PDT including also other parsers like, e.g., Hall and Nov´ak’s cor- rective modelling parser [Hall and Novak, 2005] or Nilsson, Nivre and Hall’s graph transformation parser [Nilsson et al., 2006]. These parsers were not included in the comparison, since currently we do not have their results for all sentences of the testing data set.

79 6.4 6. IMPLEMENTATION – SYNT PROJECT

6.4.2 Main Problems The most principal difference between the parsers is the underlying formal- ism and methodology of the parsing process. This is, however, not the main difference that would cause problems in the parser comparison. In this sec- tion, we will concentrate on the problems arising from different output data structures and different presuppositions on the input text that all need to be resolved before we can start the real comparison. The output of Prague parsers is formed by dependency trees or graphs, whereas the output of the synt is basically formed by packed shared forest of phrasal trees. In order to be able to provide a comparison of this forest with the one tree obtained from PDT 2.0 conversion procedure (see below), for each sentence first 100 (or less) trees were extracted and they were sorted according to the tree rank . Each of these trees was then compared to the one from PDT and the results are displayed with the following 3 numbers: • best trees – one tree from the set that is most similar to the desired tree is selected and compared;

• first tree – the tree with the highest tree rank is selected and compared;

• average – the average of all trees is presented. The next problem is that the output of the the synt is always in the form of projective trees, but a non-projective phrase can, in some cases, be analysed with the mechanism of different rule levels allowing to handle special kinds of phrases. Nevertheless, the synt is not suitable for analyse non-projective sentences at the moment. On the other hand, the output of the Prague parsers, as a set of dependency edges between words, can cross the word surface order without problems. Thus they can represent projective as well as non-projective sentences.

6.4.3 Comparison method Since the measurements had to be done on several thousands of sen- tences, we have decided to use the Prague Dependency Treebank, version 2.0 (PDT-2.0), created in the Institute of Formal and Applied Linguis- tics, http://ufal.mff.cuni.cz [Hajiˇc, 2004]. Since this tree bank provides only the dependency trees for more than 80 thousand Czech sentences, we have decided to convert them to phrasal trees using the Collin’s conversion

80 6.4 6. IMPLEMENTATION – SYNT PROJECT tool [Collins, 1998] and then measure the differences between synt output and PDT-2.0 converted “phrasal” tree. The methodology for measuring the results of dependency parsing is usu- ally defined as computation of the precision and recall of the particular de- pendency edges in the resulting graph/tree. These parameters are measured for each lexical item and the result is then computed as an average precision and average recall throughout the whole set. In the case of phrasal trees we use the two following measures, PAR- SEVAL [Black, 1992] and leaf-ancestor assessment (LAA) [Sampson, 2000, Sampson and Babarczy, 2003]. The PARSEVAL scheme utilizes only the bracketing information from the parser output to compute three values:

• crossing bracket – the number of brackets in the tested analyzer’s parse that cross the tree bank parse.

• recall – a ratio of the number of correct brackets in the analyzer’s parse to the total number of brackets in the tree bank parse.

• precision– a ratio of the number of correct brackets in the analyzer’s parse to the total number of brackets in the parse.

There are several known limitations [Bangalore et al., 1998] of the PARSE- VAL technique. It is not clear whether this metric can be used for comparing parsers with different degrees of structural fineness since the score on this metric is tightly related to the degree of the structural detail. The LAA measure is more complicated than PARSEVAL. It considers a lineage for each word in the sentence, that is, the sequence of node-labels found on the path between leaf and root nodes in the respective trees. The lineages are compared by their edit distance, each of them having the score between 0 and 1. The score of the whole sentence is then defined as the mean similarity of the lineage-pairs for its respective leaves. Since it considers not only boundaries between the phrases, the LAA measure is supposed to be more objective than the PARSEVAL, even at non-projective sentences. In this comparison the Geoffrey Sampson’s LAA implementation, http://www.grsampson.net/Resources.html, was used.

81 6.4 6. IMPLEMENTATION – SYNT PROJECT

Parser all sentences non-projective projective R2L 73.845 % 69.823 % 75.735 % L2R 71.315 % 67.297 % 73.204 % ANALOG 71.077 % 66.625 % 73.169 % R2L3 61.648 % 58.276 % 63.233 % L2R3 53.276 % 49.672 % 64.912 % zz 75.931 % 74.177 % 76.755 % col 80.905 % 75.634 % 83.383 % MST 83.984 % 82.230 % 84.809 % CP 85.85 % 83.434 % 86.979 %

Table 6.3: The results of the Prague parsers (precision = recall)

Testing Data Set The PDT-2.0 e-test part was used as testing data. The morpholog- ical tags were automatically converted from the Prague tags to the Ajka [Sedl´aˇcek, 2005] tags without ambiguities. The e-test set consists of: • 10148 sentences (173586 words)

• 7732 projective sentences

• 2416 non-projective sentences 87.7 % senteces has at least one derivation tree on the output by synt. The actual comparison was run only on those sentences from e-test, that were accepted by synt.

6.4.4 Results Overall results of the Prague parsers testing are presented in the Table 6.3 in the form of percentage of correct dependencies for the whole set of sentences, for non-projective and for projective only. The results of synt parser on the whole testing set (with manual tagging from PDT-2.0), e-test is displayed in the Table 6.4. Notice, that numbers in these two tables cannot be directly compared, because the results for Prague parsers don’t contain the conversion step to the phrasal threes.

82 6.4 6. IMPLEMENTATION – SYNT PROJECT

cross-brackets precision recall LAA all sentences Best trees 4.473 60.228 % 60.645 % 71.5 % First trees 6.229 47.306 % 50.778 % 69.1 % Average 5.799 45.627 % 46.584 % 69.0 % projective sentences Best trees 3.619 66.718 % 68.663 % 73.1 % First trees 5.289 53.028 % 57.630 % 70.6 % Average 4.942 50.859 % 52.552 % 70.5 % non-projective sentences Best trees 7.251 39.615 % 35.727 % 65.6 % First trees 9.325 29.275 % 29.699 % 63.5 % Average 8.625 29.112 % 28.097 % 63.3 %

Table 6.4: The results of synt parser on the e-test set

cross-brackets precision recall LAA Best trees 0.792 89.519 % 92.274 % 97.2 % First trees 2.132 70.849 % 74.358 % 92.6 % Average 2.311 63.330 % 64.453 % 91.4 % R2L 81.472 % L2R 81.634 % ANALOG 76.537 % R2L3 63.754 % L2R3 57.201 % zz 86.650 % col 90.129 % MST 89.889 % CP 91.912 %

Table 6.5: The results of synt parser and Prague parsers on the small tree set

6.4.5 Conclusions The experiment of comparing the results of parsers with dependency and phrasal outputs has opened several new problems. One of the main causes

83 6.5 6. IMPLEMENTATION – SYNT PROJECT of these problems were the incompatibilities between the “constituent PDT” trees and derivation trees from synt. This was also the main source of low precision and recall of the parser. In order to prove this thesis, we have (manually) prepared a small set of phrasal trees (for 100 sentences randomly chosen from the e-test projective sentences) in the form of synt parser trees and repeated the measurements for this subset. The improvement of the results of synt parser on this small subset may be seen in Table 6.5. The results of these experiments were presented in [Hor´ak et al., 2007]. The conversion procedure and e-test experiments were conducted by the au- thor of this thesis and Vojtˇech Kov´aˇr. The experiment with the small set (100 sentences) was performed by AleˇsHor´ak. The testing data and results for Prague parsers were provided by Tom´aˇsHolan.

6.5 Front-ends

Because the synt is controlled from command-line only, a graphical user interfaces were created to allow non-experienced users more comfort work with the tools. Two of these front-ends are presented here, the Grammar Development Workbench and the WWWsynt.

6.5.1 Grammar Development Workbench The Grammar Development Workbench (GDW) tool has been created by Radek Vykydal [Vykydal, 2005] in the Centre for Natural Language Process- ing at Faculty of Informatics, Masaryk University, Brno (under supervision of AleˇsHor´ak). It consists of several modules:

• Gsynt – graphical user interface of the synt parser.

• TreeView – viewer for resulting syntactic trees.

• ChartView – browser of resulting chart structure.

• GrammarView – grammar forms viewer.

The GDW is mainly used to build a tree-bank and to enhance the meta- grammar. For building the tree-bank of correct syntactic trees we do the

84 6.5 6. IMPLEMENTATION – SYNT PROJECT

Figure 6.2: Gsynt module window showing an analysis of a sentence from a corpus. following steps. First of all we analyze a sentence by the Gsynt mod- ule. Figure 6.2 shows basic window with sentences from a corpus (PDTB- 1.0 [Hajiˇc, 1998] in this case). The selected sentence is analyzed by the synt parser and the output of the parser is displayed. Some additional properties of sentences are showed. For more information about the GDW see the project manual http://nlp.fi.muni.cz/projects/grammar workbench/ manual-en/.

6.5.2 WWWsynt The WWWsynt [Golembiovsk´y, 2006] allows anybody to use synt through a web browser. This web interface to the parser has been created by Jiˇr´ı

85 6.5 6. IMPLEMENTATION – SYNT PROJECT

Figure 6.3: WWWsynt query.

Golembiovsk´yin the Centre for Natural Language Processing at Faculty of Informatics, Masaryk University, Brno. The project homepage is at http: //nlp.fi.muni.cz/projekty/wwwsynt/. Figure 6.3 shows the input query form, http://nlp.fi.muni.cz/ projekty/wwwsynt/query.cgi. When an user enters the phrase the parsing process itself is run at the server. Such results as derivation trees or a whole chart can be computed and displayed. The development of the synt web interface has begun recently, but it is stable and usable already.

86 6.6 6. IMPLEMENTATION – SYNT PROJECT

6.6 Conclusions

The synt project is still under development. New features are constantly added. At the moment, most of our experiments are related to the lexicon of verb valences VerbaLex. We also add semantic actions for outputting depen- dency graphs. This allows us to compare our results directly with dependency parsers. Our experiments show that for some sentences, dependencies gener- ated by synt could be correct even for non-projective sentences, despite the fact that synt is based on context-free grammar. The results shown in this chapter were presented several times at international conferences [Hor´ak et al., 2002, Kadlec and Smrˇz, 2003, Smrˇzand Kadlec, 2005, Hor´ak and Kadlec, 2005, Kov´aˇret al., 2006, Hor´ak and Kadlec, 2006, Hor´ak et al., 2007].

87 Chapter 7

Conclusions and Future Research

In this work, the language independent parsing system is presented. It is based on context-free parser supplemented by contextual constraints and semantic actions. For the context-free part of the system, we have developed a new parsing algorithm (described in Chapter 3) based on head-driven approach. We also suggested several modifications of our algorithm. These modifications are more effective, than the basic implemented version, so the estimation of the possible benefit has been done. Because our algorithms are head-driven, the right choice of the heads of grammar rules is crucial. We suggested a general approach for selecting grammar heads. In Chapter 3 two algorithms are described – the general heuristic algorithm and the algorithm that creates optimal heads for the given grammar and corpora. Our experiments described in Chapter 5 show that the context-free part of the system is fully comparable with the best context-free parsers published. In case of highly ambiguous grammars these parsers are outperformed by our parser. If the language generated by the input grammar is not rich enough to model a natural language we apply a robust option for our parser. The method described in Chapter 4 solves this problem of undergeneration. Our algorithm is targeted at correct English sentences and it doesn’t seem to be suitable for free word-order languages such as Czech. However, the re-

88 7. CONCLUSIONS AND FUTURE RESEARCH sults of the performed experiments from Section 5.3 are surprisingly good in combination with semantic actions from Czech meta-grammar. The evaluation of semantic actions and contextual constraints helps us to reduce a huge number of derivation trees and we are also able to calculate some new information which is not covered in the context-free part of the grammar. The dependency graph or filtering by valency lexicon from Sec- tion 3.6 are examples of such information. The experiments with dependency graphs are at the beginning. But even for some kinds of short non-projective sentences, the correct dependencies can be generated within our approach as well. All described algorithms are integrated in the parsing system synt. This system and all partial results from this work were published at several inter- national conferences. Future research is aimed at the experiments with verb valences and lexicon of verb valencies for the Czech VerbaLex. Completion of semantic actions for dependency graphs in the meta-grammar will allow us to perform a direct comparison with dependency parsers.

89 Bibliography

[Aho et al., 1986] Aho, A., Sethi, R., and Ullman, J. (1986). Compilers: Principles, Techniques and Tools. Addison-Wesley, Reading, Mass. [Aho and Ullman, 1972] Aho, A. and Ullman, J. (1972). The Theory of Pars- ing, Translation and Compiling, volume I: Parsing. Prentice-Hall, Engle- wood Cliffs, N.J. [Ailomaa, 2004] Ailomaa, M. (2004). Two approaches to robust stochastic parsing. Master’s thesis, The Swiss Federal Institute of Technology, Lau- sanne, Switzerland. [Ailomaa et al., 2005a] Ailomaa, M., Kadlec, V., Chappelier, J.-C., and Ra- jman, M. (2005a). Efficient processing of extra-grammatical sentences: Comparing and combining two approaches to robust stochastic parsing. In Proceedings of the Applied Stochastic Models and Data Analysis (AS- MDA) 2005, pages 81–89, Francie. ENST Bretagne. [Ailomaa et al., 2005b] Ailomaa, M., Kadlec, V., Chappelier, J.-C., and Rajman, M. (2005b). Robust stochastic parsing: comparing two ap- proaches for processing extra-grammatical sentences. In Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA) 2005, pages 21–29, Finland. University of Joensuu. [Ballim et al., 2000] Ballim, A., Chappelier, J.-C., Rajman, M., and Pal- lotta., V. (2000). Isis: Interaction through speech with information sys- tem. In Proceedings of the Third International Workshop on Text, Speech and Dialogue-TSD 2000, pages 339–344, Brno (Czech Republic). [Bangalore et al., 1998] Bangalore, S., Sarkar, A., Doran, C., and Hockey, B. A. (1998). Grammar & parser evaluation in the XTAG project. http: //www.cs.sfu.ca/∼anoop/papers/pdf/eval-final.pdf.

90 BIBLIOGRAPHY BIBLIOGRAPHY

[Barton et al., 1987] Barton, G. E., Berwick, R. C., and Ristad, E. S. (1987). Computational complexity and natural language. MIT Press, Cambridge, Massachusetts.

[Bear et al., 1992] Bear, J., Dowding, J., and Shriberg, E. (1992). Integrat- ing multiple knowledge sources for the detection and correction of repairs in human-computer dialogue. In Proceedings of the 30th ACL, pages 56–63, Newark, Delaware.

[Black, 1992] Black, E. (1992). Meeting of interest group on evaluation of broad-coverage grammars of English. LINGUIST List 3.587. http://www. linguistlist.org/issues/3/3-587.html.

[Bouma and van Noord, 1993] Bouma, G. and van Noord, G. (1993). Head- driven parsing for lexicalist grammars. Experimental results. In Proceed- ings of the 6th Conference of the EACL, Utrecht, The Netherlands.

[Carroll and Briscoe, 1996] Carroll, J. and Briscoe, T. (1996). Robust pars- ing — a brief overview. In Carroll, J., editor, Proceedings of the Workshop on Robust Parsing at the 8th European Summer School in Logic, Language and Information (ESSLLI’96), Report CSRP 435, pages 1–7, COGS, Uni- versity of Sussex.

[Chappelier and Rajman, 1998a] Chappelier, J.-C. and Rajman, M. (1998a). A generalized CYK algorithm for parsing stochastic CFG. In TAPD’98 Workshop, pages 133–137, Paris, France. (http://slptk.sourceforge. net).

[Chappelier and Rajman, 1998b] Chappelier, J.-C. and Rajman, M. (1998b). A practical bottom-up algorithm for on-line parsing with stochastic context-free grammars. In Technical Report No 98/284, D´epartement Informatique, EPFL, Lausanne, Switzerland.

[Charniak, 1997] Charniak, E. (1997). Statistical parsing with a context-free grammar and word statistics. In Proceedings of the Foutreenth National Conference on Artificial Intelligence (AAAI’97), pages 598–603.

[Charniak, 2000] Charniak, E. (2000). A maximum-entropy-inspired parser. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics, pages 65–70.

91 BIBLIOGRAPHY BIBLIOGRAPHY

[Chelba and Jelinek, 1998] Chelba, C. and Jelinek, F. (1998). Exploiting syntactic structure for language modeling. In Boitet, C. and Whitelock, P., editors, Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pages 225–231, San Francisco, California. Morgan Kaufmann Publishers.

[Chomsky, 1955] Chomsky, N. (1955). The Logical Structure of Linguistic Theory. Plenum Press, New York and London, 1973.

[Chomsky, 1957] Chomsky, N. (1957). Syntactic structures. Mouton & Co., Hague, The Netherlands.

[Collins, 1997] Collins, M. (1997). Three generative lexicalised models for statistical parsing. In Proceedings of the 35th Annual Meeting of the As- sociation for Computational Linguistic, pages 16–23.

[Collins, 1998] Collins, M. (1998). dep2phr – conversion between dependency and phrase structures. http://ufal.mff.cuni.cz/pdt/Utilities/ dep2phr/.

[Daciuk et al., 1998] Daciuk, J., Watson, R. E., and Watson, B. W. (1998). Incremental Construction of Acyclic Finite-State Automata and Trans- ducers. In Finite State Methods in Natural Language Processing, Bilkent University, Ankara, Turkey.

[Dang et al., 1998] Dang, H. T., Kipper, K., Palmer, M., and Rosenzweig, J. (August 11-17, 1998). Investigating regular sense extensions based on intersective levin classes. In Proceedings of Coling-ACL98, Montreal CA. http://www.cis.upenn.edu/$\sim$mpalmer/.

[Dijkstra, 1959] Dijkstra, E. W. (1959). A note on two problems in connec- tion with graphs. Numerische Math, (1):269–271.

[Earley, 1970] Earley, J. (1970). An efficient context-free parsing algorithm. In Communications of the ACM, volume 13, pages 94–102.

[Fredkin, 1960] Fredkin, E. (1960). Trie memory. Communications of the Association for Computer Machinery (CACM Journal), 3(9):490–499.

92 BIBLIOGRAPHY BIBLIOGRAPHY

[Glembek, 2005] Glembek, O. (2005). Automatic lecture indexing using voice recognition. Master’s thesis, Faculty of Information Technology, Brno Uni- versity of Technology, Brno.

[Golembiovsk´y, 2006] Golembiovsk´y, J. (2006). Www rozhran´ı k syntak- tick´emu analyz´atoru synt. Master’s thesis, Faculty of Informatics, Masaryk University, Brno, Czech Republic.

[Gordon, 1994] Gordon, S. A. (1994). A faster scrabble move generation algorithm. Software — Practice and Experience, 24(2):219–232.

[Graham et al., 1980] Graham, S., Harrison, M., and Ruzzo, W. (1980). An improved context-free recognizer. ACM Transactions on Programming Languages and Systems, 2(3):415–462.

[Hajiˇc, 2004] Hajiˇc, J. (2004). Complex Corpus Annotation: The Prague Dependency Treebank. Bratislava, Slovakia. Jazykovedn´y ´ustav L.ˇ St´ura,ˇ SAV.

[Hajiˇc, 1998] Hajiˇc, J. (1998). Building a syntactically annotated corpus: The Prague Dependency Treebank. In Issues of Valency and Meaning, pages 106–132, Prague. Karolinum.

[Hajiˇcet al., 1999] Hajiˇc, J., Collins, M., Ramshaw, L., and Tillmann, C. (1999). A Statistical Parser for Czech. In Proceedings ACL’99, Maryland, USA.

[Hall and Novak, 2005] Hall, K. and Novak, V. (2005). Corrective modeling for non-projective dependency parsing. Proc. IWPT, pages 42–51.

[Harisson, 1986] Harisson, M. (1986). Introduction to Formal Language The- ory. Addison-Wesley, Reading, Mass.

[Heeman and Allen, 1994] Heeman, P. A. and Allen, J. F. (1994). Detecting and correcting speech repairs. In Proceedings of the 32th ACL, pages 295– 302, Las Cruces, New Mexico.

[Heemels et al., 1991] Heemels, R., Nijholt, A., and Sikkel, K. (1991). Tomita´s algorithm: Extensions and applications. In Proceedings of the First Twente Workshop on Language Technology, pages 99–103, Enschede. Universiteit Twente.

93 BIBLIOGRAPHY BIBLIOGRAPHY

[Hemphill et al., 1990] Hemphill, C. T., Godfrey, J. J., and Doddington, G. R. (1990). The atis spoken language systems pilot corpus. In Proc. of DARPA Speech and Natural Language Workshop, pages 96–101, Hid- den Valley, PA.

[Hipp, 1992] Hipp, D. R. (1992). Design and development of spoken natural language dialog parsing systems. PhD thesis, Duke University.

[Hlav´aˇckov´aet al., 2006] Hlav´aˇckov´a, D., Hor´ak, A., and Kadlec, V. (2006). Exploitation of the verbalex verb valency lexicon in the syntactic analysis of czech. In Proceedings of Text, Speech and Dialogue 2006, pages 85–92, Brno, Czech Republic. Springer-Verlag.

[Holan, 2004] Holan, T. (2004). Tvorba z´avislostn´ıho syntaktick´eho ana- lyz´atoru. In Sborn´ık semin´aˇre MIS 2004. Matfyzpress, Prague, Czech Republic.

[Holan, 2005] Holan, T. (2005). Genetick´euˇcen´ız´avislostn´ıch analyz´ator˚u. In Sborn´ık semin´aˇre ITAT 2005. UPJS,ˇ Koˇsice.

[Holan et al., 1995] Holan, T., Kuboˇn, V., and Pl´atek, M. (1995). An imple- mentation of syntactic analysis of Czech. In Proceedings of the 5th IWPT, pages 126–135, Charles University, Prague, Czech republic.

[Holan and Zabokrtsk´y,ˇ 2006] Holan, T. and Zabokrtsk´y,ˇ Z. (2006). Com- bining Czech Dependency Parsers. In Lecture Notes in Artificial Intel- ligence, Proceedings of TSD 2006, pages 95–102, Brno, Czech Republic. Springer Verlag.

[Hor´ak et al., 2007] Hor´ak, A., Holan, T., Kadlec, V., and Kov´aˇr, V. (2007). Dependency and Phrasal Parsers of the Czech Language: A Comparison. In Proceedings of the 10th International Conference on Text, Speech and Dialogue, TSD 2007, volume 4629 of Lecture Notes in Computer Science, pages 76–84. Springer.

[Hor´ak and Kadlec, 2006] Hor´ak, A. and Kadlec, V. (2006). Platform for Full-Syntax Grammar Development Using Meta-grammar Constructs. In Proceedings of the 20th Pacific Asia Conference on Language, Informa- tion and Computation, pages 311–318, Beijing, China. Tsinghua University Press.

94 BIBLIOGRAPHY BIBLIOGRAPHY

[Hor´ak, 2002a] Hor´ak, A. (2002a). Analysis of Knowledge in Sentences. PhD thesis, Faculty of Informatics, Masaryk University, Brno, Czech Republic.

[Hor´ak, 2002b] Hor´ak, A. (2002b). The Normal Translation Algorithm in Transparent Intensional Logic for Czech. PhD thesis, Faculty of Informat- ics, Masaryk University, Brno, Czech Republic.

[Hor´ak and Kadlec, 2005] Hor´ak, A. and Kadlec, V. (2005). New Meta- grammar Constructs in Czech Language Parser synt. In Proceedings of Text, Speech and Dialogue 2005, pages 85–92, Karlovy Vary, Czech Re- public. Springer-Verlag.

[Hor´ak et al., 2002] Hor´ak, A., Kadlec, V., and Smrˇz, P. (2002). Enhancing best analysis selection and parser comparison. In Proceedings of the 5th International Workshop TSD 2002, pages 461–467, Brno, Czech Republic. Springer Verlag, Lecture Notes in Artificial Intelligence, Volume 2448.

[Johnson, 1989] Johnson, M. (1989). The comptutational complexity of tomita’s algorithm. In Proceedings of International Worshop on Pars- ing Technologies (IWPT’89), pages 203–208, Carnegie Mellon University, Pittsburg, Pa.

[Johnson and Dorre, 1995] Johnson, M. and Dorre, J. (1995). Memoization of coroutined constraints. In 33th Annual Meeting of the Association for Computational Linguistics, pages 100–107, Boston.

[Kadlec, 2000] Kadlec, V. (2000). Syntaktick´aanal´yza pˇrirozen´eho jazyka. Master’s thesis, Faculty of Informatics, Masaryk University, Brno, Czech Republic.

[Kadlec et al., 2005] Kadlec, V., Ailomaa, M., Chappelier, J.-C., and Raj- man, M. (2005). Robust stochastic parsing using optimal maximum cov- erage. In Proceedings of The International Conference Recent Advances In Natural Language Processing (RANLP) 2005, pages 258–263, Shoumen, Bulgaria. INCOMA.

[Kadlec and Smrˇz, 2003] Kadlec, V. and Smrˇz, P. (2003). PACE - parser comparison and evaluation. In Proceedings of the 8th International Work- shop on Parsing Technologies, IWPT 2003, pages 211–212, Le Chesnay Cedex, France. INRIA, Domaine de Voluceau, Rocquencourt.

95 BIBLIOGRAPHY BIBLIOGRAPHY

[Kadlec and Smrˇz, 2004] Kadlec, V. and Smrˇz, P. (2004). Grammatical Heads Optimized for Parsing and Their Comparison with Linguistic In- tuition. In Text, Speech and Dialogue: Proceedings of the Seventh Interna- tional Workshop TSD 2004, pages 95–102, Brno, Czech Republic. Springer Verlag, Lecture Notes in Artificial Intelligence.

[Kadlec and Smrˇz, 2006] Kadlec, V. and Smrˇz, P. (2006). How many dots are really needed for head-driven chart parsing? In Proceedings of SOFSEM 2006, pages 483–492, Czech Republic. Springer-Verlag.

[Kasami, 1965] Kasami, T. (1965). An efficient recognition and syntax anal- ysis algorithm for context-free languages. In Technical report AF CRL-65- 758, Bedford, Massachusetts. Air Force Cambridge Research Laboratory.

[Kay, 1985] Kay, M. (1985). Parsing in functional unification grammar. In Natural Language Parsing, pages 251–278, England. Cambridge.

[Kay, 1989a] Kay, M. (1989a). Algorithm schemata and data structures in syntactic processing. In Report CSL-80-12, Palo Alto, California. Xerox PARC.

[Kay, 1989b] Kay, M. (1989b). Head driven parsing. In Proceedings of In- ternational Workshop on Parsing Technologies, Pittsburg.

[Knuth, 1965] Knuth, D. (1965). On the translation of languages from left to right. In Information and Control, volume 8, pages 607–639.

[Kov´aˇret al., 2006] Kov´aˇr, V., Kadlec, V., and Hor´ak, A. (2006). Grammar Development for Czech Syntactic Parser with Corpus-based Techniques. In Proceedings of Corpus Linguistic, pages 159–165, Saint-Petersburg, Russia. Saint-Petersburg State University.

[Kuboˇn, 1999] Kuboˇn, V. (1999). A robust parser for Czech. In Technick´a zpr´ava TR-1999-6, Prague, Czech republic. MFF UK.

[Kuboˇn, 2001] Kuboˇn, V. (2001). Problems of Robust Parsing of Czech. PhD thesis, Charles University, MFF, Prague, Czech Republic.

[Kuboˇnand Pl´atek, 2001] Kuboˇn, V. and Pl´atek, M. (2001). A Method of Accurate Robust Parsing of Czech. In Proceedings of the 4th International Conference on Text, Speech and Dialogue, pages 92–99. Springer, Berlin.

96 BIBLIOGRAPHY BIBLIOGRAPHY

[Leermakers, 1992] Leermakers, R. (1992). A recursive ascent earley parser. Information Processing Letters, 41(2):87–91.

[Macek, 2003] Macek, O. (2003). Efektivn´ımetody pro syntaktickou anal´yzu pˇrirozen´eho jazyka – GLR. Master’s thesis, Faculty of Informatics, Masaryk University, Brno, Czech Republic.

[Matsumoto and et al., 1983] Matsumoto, Y. and et al. (1983). Bup: A bottom-up parser embedded in prolog. In New Generation Computating, volume 1, pages 145–158.

[McDonald, 2006] McDonald, R. (2006). Discriminative learning and span- ning tree algorithms for dependency parsing. PhD thesis, University of Pennsylvania.

[Moore, 2000a] Moore, R. C. (2000a). Improved left-corner chart parsing for large context-free grammars. In Proceedings of the 6th IWPT, pages 171–182, Trento, Italy.

[Moore, 2000b] Moore, R. C. (2000b). Time as a measure of parsing effi- ciency. In Proceedings of Efficiency in Large-Scale Parsing Systems Work- shop, COLING’2000, pages 23–28, Saarbrucken: Universitaet des Saarlan- des.

[Moore, 2004] Moore, R. C. (2004). Improved Left-Corner Chart Parsing for Large Context-Free Grammars (Revised Version). In Bunt, Carroll, and Satta, editors, New Developments in Parsing Technology, pages 185–201, Kluwer Academic Publishers.

[Mr´akov´a, 2002] Mr´akov´a, E. (2002). Parci´aln´ı syntaktick´a anal´yza (ˇceˇstiny). PhD thesis, Faculty of Informatics, Masaryk University, Brno, Czech Republic.

[Nederhof, 1993] Nederhof, M. (1993). Generalized left-corner parsing. In Proceedings of Sicth Conference of the European Chapter of the Association for Computational Linguistics, pages 305–314, Utrecht, The Netherlands.

[Nederhof and Sarbo, 1993] Nederhof, M. and Sarbo, J. (1993). Increasing the applicability of lr parsing. In Proceedings of 3rd International Worshop on Parsing Technologies (IWPT’93), pages 187–207, Tilburg and Durbuy, Netherlands/Belgium.

97 BIBLIOGRAPHY BIBLIOGRAPHY

[Nederhof and Satta, 1994] Nederhof, M.-J. and Satta, G. (1994). An ex- tended theory of head-driven parsing. In Meeting of the Association for Computational Linguistics, pages 210–217.

[Neidle, 1994] Neidle, C. (1994). Lexical-Functional Grammar (LFG). In Asher, R. E., editor, Encyclopedia of Language and Linguistics, volume 3, pages 2147–2153. Pergamon Press, Oxford.

[Nijholt, 1994] Nijholt, A. (1994). Parallel approaches to context-free lan- guage parsing. In Hahn, U. and Adraens, G., editors, Parallel Natural Language Processing, pages 135–167, Norwood, NJ. Ablex Publishing Cor- poration.

[Nilsson et al., 2006] Nilsson, J., Nivre, J., and Hall, J. (2006). Graph trans- formations in data-driven dependency parsing,. In Proceedings of the 21st Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 257–264, Sydney.

[Oepen and Callmeier, 2000] Oepen, S. and Callmeier, U. (2000). Measure for measure: Parser cross-fertilization - towards increased component com- parability and exchange. In Proceedings of IWPT’2000, pages 140–149, Trento, Italy.

[Pala et al., 1997] Pala, K., Rychl´y, P., and Smrˇz, P. (1997). DESAM — annotated corpus for Czech. In Proceedings of SOFSEM’97, pages 523– 530. Springer-Verlag. Lecture Notes in Computer Science 1338.

[Pala and Sevecek, 1997] Pala, K. and Sevecek, P. (1997). Valence ˇcesk´ych sloves (Valencies of Czech Verbs). In Proceedings of Works of Philosophical Faculty at the University of Brno, pages 41–54, Brno. Masaryk University.

[Pl´atek et al., 1995] Pl´atek, M., Holan, T., Kuboˇn, V., and Hric, J. (1995). Grammar development and pivot implementation. In JRP PECO 2824 Language Technologies for Slavic Languages, Final research report, Prague, Czech republic.

[Pollard and Sag, 1994] Pollard, G. and Sag, I. (1994). Head-Driven Phrase Structure Grammar. University of Chicago Press, Chicago.

[Rekers, 1992] Rekers, J. (1992). Parser Generation for Interactive Enviro- ments. PhD thesis, University of Amsterdam.

98 BIBLIOGRAPHY BIBLIOGRAPHY

[Roark and Charniak, 2000] Roark, B. and Charniak, E. (2000). Measuring efficiency in high-accuracy, broad-coverage statistical parsing. Computa- tion and Language cs.CL/0008027, pages 29–36. [Rosenkrantz and Lewis, 1970] Rosenkrantz, D. and Lewis, P. (1970). De- terministic left corner parsing. In IEEE Conference Record of the 11th Annual Symposium on Switching Automata Theory, pages 139–152. [Sampson, 1994] Sampson, G. (1994). The Susanne corpus, release 3. In School of Cognitive & Computing Sciences, Brighton (England). University of Sussex, Falmer. [Sampson, 2000] Sampson, G. (2000). A Proposal for Improving the Mea- surement of Parse Accuracy. International Journal of , 5(01):53–68. [Sampson and Babarczy, 2003] Sampson, G. and Babarczy, A. (2003). A test of the leaf-ancestor metric for parse accuracy. Natural Language Engineer- ing, 9(04):365–380. [Satta and Stock, 1989] Satta, G. and Stock, O. (1989). Head-driven bidi- rectional parsing: A tabular method. In Proceedings of IWPT’1989, pages 43–51, Pitsburg. [Sedl´aˇcek, 2005] Sedl´aˇcek, R. (2005). Morphemic Analyser for Czech. PhD thesis, Faculty of Informatics, Masaryk University, Brno, Czech Republic. [Sgall et al., 1986] Sgall, P., Hajiˇcov´a, E., and Panevov´a, J. (1986). The Meaning of the Sentence and Its Semantic and Pragmatic As- pects. Academia/Reidel Publishing Company, Prague, Czech Repub- lic/Dordrecht, Netherlands. [Shann, 1991] Shann, P. (1991). Experiments with glr and chart parsing. In Tomita, M., editor, Generalized LR Parsing, pages 17–34, Boston, Mas- sachusetts. Kluwer Academic Publishers. [Sikkel, 1996] Sikkel, K. (1996). Parsing Schemata: A Framework for Spec- ification and Analysis of Parsing Algorithm. Springer, Berlin. [Sikkel and op den Akker, 1993] Sikkel, K. and op den Akker, R. (1993). Predictive head-corner parsing. In Proceedings of IWPT’1993, pages 267– 276, Tilburg/Durbuy.

99 BIBLIOGRAPHY BIBLIOGRAPHY

[Smrˇzand Hor´ak, 1999] Smrˇz, P. and Hor´ak, A. (1999). Implementation of Efficient and Portable Parser for Czech. In TSD ’99: Proceedings of the Second International Workshop on Text, Speech and Dialogue, pages 105– 108, London, UK. Springer-Verlag.

[Smrˇzand Kadlec, 2005] Smrˇz, P. and Kadlec, V. (2005). Incremental Parser for Czech. In Proceedings of the 4th International Symposium on Infor- mation and Communication Technologies (WISICT05), pages 1–6, Cape Town, South Africa. Cape Town International Convention Center.

[Tomita, 1986] Tomita, M. (1986). Efficient Parsing for Natural Languages: A Fast Algorithm for Practical Systems. Kluwer Academic Publishers, Boston, MA.

[van Noord, 1997] van Noord, G. (1997). An efficient implementation of the head-corner parser. Computational Linguistics, 23(3).

[van Noord et al., 1999] van Noord, G., Bouma, G., Koeling, R., and Neder- hof, M.-J. (1999). Robust grammatical analysis for spoken dialogue sys- tems. Natural Language Engineering, 5(1):45–93.

[Vykydal, 2005] Vykydal, R. (2005). N´astroje pro v´yvoj gramatik pˇrirozen´eho jazyka (tools for developing natural language grammars). Mas- ter’s thesis, Faculty of Informatics, Masaryk University, Brno, Czech Re- public. in Czech.

[Wir´en, 1987] Wir´en, M. (1987). A comparison of rule-invocation strategies in context-free chart parsing. In Proceedings of the Third Conference of the European Chapter of the Association for Computational Linguistics, pages 226–233, Copenhagen, Denmark.

[Worm and Rupp, 1998] Worm, K. L. and Rupp, C. J. (1998). Towards ro- bust understanding of speech by combination of partial analyses. In Pro- ceedings of the 13th biennial European Conference on Artificial Intelligence (ECAI’98), August 23-28, pages 190–194, Brighton, UK.

[Younger, 1967] Younger, D. (1967). Recognition of context-free languages in time n3. Inf. Control, 10(2):189–208.

[Zeman, 2004] Zeman, D. (2004). Parsing with a Statistical Dependency. PhD thesis, Charles University, MFF.

100 BIBLIOGRAPHY BIBLIOGRAPHY

[Zeman and Zabokrtsk´y,ˇ 2005] Zeman, D. and Zabokrtsk´y,ˇ Z. (2005). Im- proving Parsing Accuracy by Combining Diverse Dependency Parsers. In Proceedings of the 9th International Workshop on Parsing Technologies, pages 171–178, Vancouver, B.C., Canada.

[Zabokrtsk´yandˇ Lopatkov´a, 2004] Zabokrtsk´y,ˇ Z. and Lopatkov´a, M. (2004). Valency Frames of Czech Verbs in VALLEX 1.0. In Meyers, A., editor, HLT-NAACL 2004 Workshop: Frontiers in Corpus Annotation, pages 70–77.

101 Appendices

102 Appendix A

Alternative Definitions of Maximum Coverage

Alternative definitions of maximum coverage from Chapter 4 are presented here.

A.1 Definition of maximum coverage in terms of footage

We define a preorder relation ≤ over coverages. For any coverages C = ′ ′ ′ ′ ′ (T1, T2, ...Tk) and C =(T1, T2, ...Tk ): ′ ′ ′ C ≤ C iff ∃T ∈ C , (Tj, Tj+1, ..., Tj+l) ∈ C such that ′i f(Ti )= f(Tj), f(Tj+1), ..., f(Tj+l), i.e. if there exists a sub-sequence of trees in C that have the same foliage as ′ some tree in C . Please see Section 4.1 for definition of the foliage f. The relation ≤ is reflexive and transitive but not antisymmetric, thus it is only preorder and not an order relation.

A.2 Definition of maximum coverage in terms of equivalence classes

First of all we define an equivalence relation ≈ over coverages. For any ′ ′ ′ ′ ′ coverages C =(T1, T2, ...Tk) and C =(T1, T2, ...Tk ):

103 A.2 A. ALTERNATIVE DEFINITIONS OF MAXIMUM COVERAGE

′ ′ ′ C ≈ C iff ∃T ∈ C , Tj ∈ C and rule r in the grammar G such that i ′ ′ Ti = r ◦ Tj or Tj = r ◦ Ti , i.e. if there exists a tree in C that can be extended by unary rule r and the ′ resulting tree lies in C or contrariwise. We also define a relation ≈∗ as reflexive and transitive closure of the relation ≈. The relation ≈∗ is reflexive, transitive and symmetric, thus it is an equivalence relation. A set of class representatives (i.e. a subset of set of all coverages which contains exactly one element from each equivalence class ∗ with respect to the equivalence ≈ ) is denoted by C≈. Next we define a relation ⇒ over a set of class representatives. For any ′ ′ ′ ′ ′ ′ C≈ C≈ coverages C# =(T1, T2, ...Tk), C# =(T1, T2, ...Tk ), C# ∈ , C# ∈ :

′ ′ ′ C# ⇒ C# iff ∃Ti ∈ C#, (Tj, Tj+1, ..., Tj+l) ∈ C# and rule r in the grammar ′ G such that Ti = r ◦ Tj ◦ Tj+1... ◦ Tj+l, i.e. if there exists a sub-sequence of trees in C# that can be connected by rule ′ ∗ r and the resulting tree lies in C#. We also define a relation ⇒ as reflexive and transitive closure of the relation ⇒. The relation ⇒∗ is a partial order on C≈. After that we define a relation ≤ over a set of coverages. For any coverages ′ C and C :

′ ′ ∗ ′ ∗ ′ ′ C ≤ C iff ∃C#,C # such that C ≈ C#, C ≈ C #, C# ∈ C≈,C # ∈ C≈ ∗ ′ and C# ⇒ C #, i.e. if the appropriate class representatives (with respect to the equiva- lence ≈∗) are in relation ⇒∗. Finally we define a coverage C to be a maximum coverage (m-coverage) ′ iff for any coverage C :

′ ′ if C ≤ C then C ≤ C.

The relation ≤ is reflexive and transitive but not antisymmetric, thus it is only preorder and not an order relation.

104 Appendix B

List of Publications

[Hor´ak et al., 2002 ] Hor´ak, A., Kadlec, V., and Smrˇz, P. (2002). En- hancing best analysis selection and parser comparison. In Proceedings of the 5th International Workshop TSD 2002, pages 461–467, Brno, Czech Republic. Springer Verlag, Lecture Notes in Artificial Intelli- gence, Volume 2448.

[Kadlec and Smrˇz, 2003 ] Kadlec, V. and Smrˇz, P. (2003). PACE - parser comparison and evaluation. In Proceedings of the 8th International Workshop on Parsing Technologies, IWPT 2003, pages 211–212, Le Chesnay Cedex, France. INRIA, Domaine de Voluceau, Rocquencourt.

[Kadlec and Smrˇz, 2004 ] Kadlec, V. and Smrˇz, P. (2004). Syntactic analysis of natural languages based on context free grammar backbone. Proceedings of the 21th Workshop on Information Technologies, MIS 2004, pages 46–51.

[Kadlec et al., 2004 ] Kadlec, V., Chappelier, J., and Rajman, M. (2004). Tool for robust stochastic parsing using optimal maximum coverage. In Technical Report, Lausanne (Switzerland). Swiss Federal Institute of Technology (EPFL).

[Kadlec and Smrˇz, 2004 ] Kadlec, V. and Smrˇz, P. (2004). Grammati- cal Heads Optimized for Parsing and Their Comparison with Linguistic Intuition. In Text, Speech and Dialogue: Proceedings of the Seventh In- ternational Workshop TSD 2004, pages 95–102, Brno, Czech Republic. Springer Verlag, Lecture Notes in Artificial Intelligence.

105 B. LIST OF PUBLICATIONS

[Kadlec et al., 2005 ] Kadlec, V., Ailomaa, M., Chappelier, J.-C., and Ra- jman, M. (2005). Robust stochastic parsing using optimal maximum coverage. In Proceedings of The International Conference Recent Ad- vances In Natural Language Processing (RANLP) 2005, pages 258–263, Shoumen, Bulgaria. INCOMA.

[Ailomaa et al., 2005a ] Ailomaa, M., Kadlec, V., Chappelier, J.-C., and Rajman, M. (2005a). Efficient processing of extra-grammatical sen- tences: Comparing and combining two approaches to robust stochastic parsing. In Proceedings of the Applied Stochastic Models and Data Analysis (ASMDA) 2005, pages 81–89, Francie. ENST Bretagne.

[Ailomaa et al., 2005b ] Ailomaa, M., Kadlec, V., Chappelier, J.-C., and Rajman, M. (2005b). Robust stochastic parsing: comparing two ap- proaches for processing extra-grammatical sentences. In Proceedings of the 15th Nordic Conference of Computational Linguistics (NODAL- IDA) 2005, pages 21–29, Finland. University of Joensuu.

[Hor´ak and Kadlec, 2005 ] Hor´ak, A. and Kadlec, V. (2005). New Meta- grammar Constructs in Czech Language Parser synt. In Proceedings of Text, Speech and Dialogue 2005, pages 85–92, Karlovy Vary, Czech Republic. Springer-Verlag.

[Smrˇzand Kadlec, 2005 ] Smrˇz, P. and Kadlec, V. (2005). Incremental Parser for Czech. In Proceedings of the 4th International Symposium on Information and Communication Technologies (WISICT05), pages 1–6, Cape Town, South Africa. Cape Town International Convention Center.

[Hor´ak et al., 2006 ] Hor´ak, A., Svoboda, L., Kadlec, V., and Cenek, P. (2006). Language Resources for Intelligent Processing of Dialogues about Electrical Networks. In Proceedings of ElNet 2005, pages 42–49, VSBˇ TU Ostrava.

[Kadlec and Smrˇz, 2006 ] Kadlec, V. and Smrˇz, P. (2006). How many dots are really needed for head-driven chart parsing? In Proceedings of SOFSEM 2006, pages 483–492, Czech Republic. Springer-Verlag.

[Hlav´aˇckov´aet al., 2006 ] Hlav´aˇckov´a, D., Hor´ak, A., and Kadlec, V. (2006). Exploitation of the verbalex verb valency lexicon in the syn-

106 B. LIST OF PUBLICATIONS

tactic analysis of Czech. In Proceedings of Text, Speech and Dialogue 2006, pages 85–92, Brno, Czech Republic. Springer-Verlag.

[Hor´ak and Kadlec, 2006 ] Hor´ak, A. and Kadlec, V. (2006). Platform for Full-Syntax Grammar Development Using Meta-grammar Con- structs. In Proceedings of the 20th Pacific Asia Conference on Lan- guage, Information and Computation, pages 311–318, Beijing, China. Tsinghua University Press.

[Kov´aˇret al., 2006 ] Kov´aˇr, V., Kadlec, V., and Hor´ak, A. (2006). Gram- mar Development for Czech Syntactic Parser with Corpus-based Tech- niques. In Proceedings of Corpus Linguistic, pages 159–165, Saint- Petersburg, Russia. Saint-Petersburg State University.

[Hor´ak et al., 2007 ] Hor´ak, A., Holan, T., Kadlec, V., and Kov´aˇr, V. (2007). Dependency and Phrasal Parsers of the Czech Language: A Comparison. In Proceedings of the 10th International Conference on Text, Speech and Dialogue, TSD 2007, volume 4629 of Lecture Notes in Computer Science, pages 76–84. Pilsen, Czech Republic. Springer- Verlag.

107