Syntactic Analysis of Natural Languages Based on Context-Free Grammar Backbone

Faculty of Informatics Masaryk University Û¡¢£¤¥¦§¨ª«¬Æ°±²³´µ·¸¹º»¼½¾¿Ý Syntactic analysis of natural languages based on context-free grammar backbone PhD Thesis Vladim´ır Kadlec Brno, September 2007 Declaration I hereby declare that this submission is my own work and that, to the best of my knowledge and belief, it contains no material previously published or written by another person nor material which to a substantial extent has been accepted for the award of any other degree or diploma of the university or other institute of higher learning, except where due acknowledgement has been made in the text. ii Acknowledgements I would like to express my gratitude to my supervising professor Karel Pala for his support and feedback during my research. Special thanks must go to Martin Rajman and Jean-Cédric Chappelier for their very valuable comments on this work, especially on Chapter 4. Also, I would like to thank to Aleˇs Horák, who has been my major colleague for almost 7 years. iii Contents 1 Introduction 7 1.1 ObjectivesoftheThesis . 7 1.2 Overview.............................. 8 1.3 TerminologyandNotation . 9 2 Related Work 10 2.1 Context-FreeParsing . 10 2.1.1 Tomita’s generalized LR parser . 11 2.1.2 Chartparsers ....................... 12 2.1.3 Head driven and head corner parsing . 15 2.1.4 ParserComparison . 18 2.2 RobustParsing .......................... 19 2.3 ParsingofCzech ......................... 19 2.4 Conclusions ............................ 20 3 Parsing System with Contextual Constraints 21 3.1 SystemOverview ......................... 21 3.2 Head-Driven dependent dot move . 24 3.2.1 Motivation......................... 24 3.2.2 HDddmParsing. .. .. 26 3.2.3 Head-Driven Algorithm with One Dot in Items . 28 3.2.4 Simplified Items, No Dots Needed . 29 3.2.5 DataStructures. 31 3.2.6 Conclusions ........................ 31 3.3 OptimizingHeadsforParsing . 32 3.3.1 OptimizationProcedure . 33 3.4 Shared-Packed Forest filtering . 34 3.4.1 Filteringbyrulelevels . 34 4 CONTENTS CONTENTS 3.4.2 Conclusions ........................ 34 3.5 Contextual Constraints and Semantic Actions . 35 3.5.1 Representationofvalues . 36 3.5.2 Generationofagrammarwithvalues . 38 3.5.3 Conclusions ........................ 40 3.6 VerbValences ........................... 40 3.6.1 VerbaLex ......................... 40 3.6.2 FilteringbyValences . 42 3.7 Conclusions ............................ 42 4 Robust Stochastic Parsing Using OMC 44 4.1 Coverage.............................. 45 4.1.1 Maximumcoverage . 46 4.1.2 Optimalm-coverage . 47 4.1.3 Probabilityofacoverage . 49 4.2 Findingoptimalm-coverage . 50 4.3 Gluing............................... 52 4.3.1 Gluingwithnewrules . 52 4.3.2 Gluing by means of mapping non-terminals . 53 4.4 Conclusions ............................ 54 5 Context-free Only Experiments 55 5.1 ConfigurationofExperiments . 55 5.2 Context-freeexperiments . 56 5.2.1 Comparison of Implemented CF Parsing Algorithms . 56 5.2.2 Head-Driven dependent dot move variants . 57 5.2.3 Optimizing Heads for Parsing . 58 5.2.4 Comparison with Different Parsing Systems . 62 5.2.5 Conclusion......................... 66 5.3 RobustParsing .......................... 67 5.3.1 Implementation . 67 5.3.2 Experiments with English . 67 5.3.3 SpeechrecognitionofCzech . 68 5.4 Conclusions ............................ 70 6 Implementation – synt Project 71 6.1 Grammar ............................. 72 6.2 Parser ............................... 76 5 CONTENTS CONTENTS 6.3 Experiments............................ 77 6.3.1 Realdatatest....................... 77 6.3.2 VerbValences ....................... 78 6.4 Comparison of dependency and phrasal parsers . 79 6.4.1 Compared Dependency Parsers . 79 6.4.2 MainProblems ...................... 80 6.4.3 Comparisonmethod . 80 6.4.4 Results........................... 82 6.4.5 Conclusions ........................ 83 6.5 Front-ends............................. 84 6.5.1 Grammar Development Workbench . 84 6.5.2 WWWsynt ........................ 85 6.6 Conclusions ............................ 87 7 Conclusions and Future Research 88 Bibliography 89 Appendices 103 A Alternative Definitions of Maximum Coverage 103 A.1 Definition of maximum coverage in terms of footage . 103 A.2 Definition of maximum coverage in terms of equivalence classes 103 B List of Publications 105 6 Chapter 1 Introduction Syntactic analysis is a “corner-stone” of applications for automated processing of texts in natural languages. Any machine translation application, an automatic grammar checker or information retrieval system must be capable of understanding the structure of a sentence. Recognition of the sentence structure is called parsing. Formal the- ory of natural language syntax was first introduced by Noam Chom- sky [Chomsky, 1955, Chomsky, 1957]. Chomsky come with a context-free, transformational phrase structure grammar formalisms and their comparison. Context-free parsing techniques are well suited to be incorporated into real-world nature language processing systems because of their time efficiency and low memory requirements. Though, it is known that some natural language phenomena cannot be handled with the context-free grammar formalism, researchers often use the context-free backbone as the core of their grammar formalism and enhance it with context sensitive feature structures (e.g. [Neidle, 1994]). 1.1 Objectives of the Thesis The goal of this thesis is to design algorithms and methods for an effective and robust syntactic analysis of natural languages. The main contribution of described work is in a syntactic analyser for Czech language. However, the presented algorithms are language-independent, so other languages with an appropriate grammar can be modeled as well. 7 1.2 1. INTRODUCTION The analysis of sentences by the described system is based on context-free grammar for the given language. The reasons for the choice of this formalism are mainly historical. At the time of creation our system there was intensive research on constituent grammar for Czech [Smrˇzand Horák, 1999], but no analyser fast an effective enough existed. The internal representation of derivation trees allows to apply contextual constraints, e.g. case agreement. In general, this is NP-complete problem, so a polynomial algorithm working with weaker constraints than general ones is described. The result of the constraint application is still managed in the polynomial structure. The evaluation of semantic actions and contextual constraints helps us to reduce a huge number of derivation trees and we are also able to compute some new information, which is not contained in the context-free part of the grammar. Also n best trees (according to a tree rank, e.g probability) can be selected. This is an important feature for linguists developing a grammar by hand. There are many NLP applications, where it is difficult to create a grammar generating a sufficient subset of the processed language. We describe a robust extension of our parser that is able to return a “correct” derivation tree even if the grammar cannot generate the input sentence. Because the presented system is language-independent, results for English and French grammars are provided, as well as results for Czech. 1.2 Overview The first chapter introduces the topics discussed in the next parts and also the basic terminology and notations. In the second chapter, a brief survey of related algorithms for syntactic analysis is presented. Current (robust) syntactic analysers for Czech are mentioned there as well. The main result of this work – a parsing system with contextual constraints – is discussed in chapter three. Chapter Three is divided into several sections describing individual modules of the system. A robust extension of the parsing system is presented in Chapter Four. The fifth chapter provides results of experiments with context-free and robust parts of the system. The description of the implementation – synt project – is provided in Chapter Six. The last, seventh, chapter consists of conclusions and future directions. 8 1.3 1. INTRODUCTION 1.3 Terminology and Notation We will use the following terminology. The input grammar G is a tetrad G = hN, Σ,P,Si, where N is a finite set of non-terminals, Σ a finite set of terminals, P is a finite set of rules and S stands for the starting symbol of the grammar. Upper-case letters (A, B, etc.) will designate non-terminals, lower case letters (i, j, k) will be used for natural numbers, Greek letters (α, β, . ) for strings of symbols (terminals or non-terminals). The empty string is denoted by ǫ. 1 The input sentence is a sequence of words wi ∈ Σ , each word corresponds to a terminal of the input grammar G. Input sentence may also be referred to as “input string”. 1 wi ∈ Σ applies only for known words. There is usually a special mechanism to handle words not present in a vocabulary. 9 Chapter 2 Related Work In the first part of this chapter, selected algorithms for syntactic analysis are shown. Only algorithms related to this thesis are described here. In the second part we focus on parsing of Czech language. 2.1 Context-Free Parsing Parsing algorithms for context-free (CF) grammars play a crucial role in the field of general parsing. Either their basic form is directly employed (usually for parsing a context-free backbone for the grammars based on one of the re- cent formalisms), or an extension of a standard parsing algorithm is proposed that can deal with more complex features of the particular grammar form. Moreover, the position of CF parsing is further strengthened

Load more