ML-Tuned Constraint Grammars

PACLIC-27 ML-Tuned Constraint Grammars Eckhard Bick Institute of Language and Communication University of Southern Denmark, Odense [email protected] Abstract (b) SELECT VFIN (-1 PERS/NOM) (NOT *1 VFIN) In this paper we present a new method for Rule (a), for instance, removes a target finite machine learning-based optimization of verb reading (VFIN) if there is an unambiguous linguist-written Constraint Grammars. The (C) article or determiner 1 position to the left (-), effect of rule ordering/sorting, grammar- while rule (b) selects a finite verb reading, if sectioning and systematic rule changes is there is a personal pronoun in the nominative discussed and quantitatively evaluated. The immediately to the left, and no (NOT) other F-score improvement was 0.41 percentage finite verb is found anywhere to the right (*1). points for a mature (Danish) tagging Mature Constraint Grammars can achieve grammar, and 1.36 percentage points for a half-size grammar, translating into a 7-15% very high accuracy, but contain thousands of error reduction relative to the performance rules and are expensive to build from scratch, of the untuned grammars. traditionally requiring extensive lexica and years of expert labor. Since grammars are not data- driven in the statistical sense of the word, 1 Introduction domain adaptation, for instance for speech (Bick 2011) or historical texts (Bick 2005), is Constraint Grammar (CG) is a rule-based traditionally achieved by extending an existing paradigm for Natural Language Parsing (NLP), general grammar for the language in question, first introduced by Karlsson et al. (1995). Part- and by using specialized lexica or two-level text of-speech tagging and syntactic parses are normalization. However, due to its innate achieved by adding, removing, selecting or complexity, the general underlying grammar as substituting form and function tags on tokens in a whole has properties that do not easily lend running text. Rules express linguistic contextual themselves to manual modification. Changes constraints and are written by hand and applied and extensions will usually be made at the level sequentially and iteratively, ordered in batches of individual rules, not rule interactions or rule of increasing heuristicity and incrementally regrouping. Thus, with thousands of reducing ambiguity from morphologically interacting rules, it is difficult for a human analyzed input by removing (or changing) grammarian to exactly predict the effect of rule readings from so-called readings cohorts placement, i.e. if a rule is run earlier or later in (consisting of all possible readings for a given the sequence. In particular, rules with so-called token), - optimally until only one (correct) C-conditions (asking for unambiguous context), reading remains for each token. The method may profit from another, earlier rule acting on draws robustness from the fact that it is the context tokens involved in the C-condition. reductionist rather than generative - even Feed-back from corpus runs will pinpoint rules unforeseen or erroneous input can be parsed by that make errors, and even allow to trace the letting the last reading survive even if there are effect on other rules applied later on the same rules that would have removed it in a different sentence, but such debugging is cumbersome context. Typical CG rules consist of an operator and will not provide information on missed-out (e.g. REMOVE, SELECT), a target and one or positive, rather than negative, rule interaction. more contextual constraints that may be linked The question is therefore, whether a hand- to each other: corrected gold corpus and machine-learning techniques could be used to improve (a) REMOVE VFIN (-1C ART OR DET) ; 440 Copyright 2013 by Eckhard Bick 27th Pacific Asia Conference on Language, Information, and Computation pages 440－449 PACLIC-27 performance by data-driven rule ordering or rule inactivating certain rules entirely. adaptation, applied to existing, manual Lindberg & Eineborg (1998) conducted a grammars. The method would not only allow to performance evaluation with a CG-learning optimize general-purpose grammars, but also to Progol system on Swedish data from the adapt a grammar in the face of domain variation Stockholm-Umeå corpus. With 7000 induced without actually changing or adding any rules REMOVE rules, their system achieved a recall manually. Of course the technique will only of 98%. An F-Score was not given, but since work if a compatible gold-annotation corpus residual ambiguity was 1.13 readings per word exists for the target domain, but even creating (i.e. a precision of 98/113=86.7%), it can be manually-revised training data from scratch for estimated at 92%. Also, the lexicon was built the task at hand, may be warranted if it then from the corpus, so performance can be allows using an existing unmaintained or "black expected to be lower on lexically independent box" grammar. Other areas where ML rule data. tuning of existing grammars may be of use, is Though all three of the above reports show cross-language porting of grammars between that machine learning can be applied to CG- closely related languages, and so-called bare- style grammars, none of them addresses the bones Constraint Grammars (Bick 2012), where tuning of human-written, complete grammars grammars have to cope with heuristically rather than lists of rule templates1. In this paper, analyzed input and correspondingly skewed we will argue that the latter is possible, too, and ambiguity patterns. In such grammars, linguistic that it can lead to better results than both intuition may not adequately reflect input- automatic and human grammars seen in specific disambiguation needs, and profit from isolation. data-driven tuning. 3 Grammar Tuning Experiments 2 Prior research As target grammar for our experiments we To date, little work on CG rule tuning has been chose the morphological disambiguation module published. A notable exception is the µ-TBL of the Danish DanGram2 system and the CG3 system proposed in (Lager 1999), a Constraint Grammar compiler 3 . For most transformation-based learner working with 4 languages, manually revised CG corpora are different rule operators, and supporting not only small and used only for internal development traditional Brill-taggers but also Constraint purposes, but because Constraint Grammar was Grammars. The system could be seeded with used in the construction of the 400.000 word simple CG rule templates with conditions on Danish Arboretum treebank (Bick 2003), part of numbered context positions, but for complexity the data (70.800 tokens) was still accessible in reasons it did not support more advanced CG CG-input format and could be aligned to the rules with unbounded, sentence-wide contexts, finished treebank, making it possible to barrier conditions or linked contexts, all of automatically mark the correct reading lines in which are common in hand-written Constraint the input cohorts. Of course the current Grammars. Therefore, while capable of building DanGram system has evolved and is quite automatic grammars from rule templates and different from the one used 10 years ago to help modeling them on a gold corpus, the system was with treebank construction, a circumstance not applicable to existing, linguist-designed CGs. That automatic rule tuning can capture 1 One author, Padró (1996), using CG-reminiscent systematic differences between data sets, was constraints made up of close PoS contexts, shown by Rögnvaldsson (2002), who compared envisioned a combination of automatically learned English and Icelandic µ-TBL grammars seeded and linguistically learned rules for his relaxation labelling algorithm, but did not report any actual with the same templates, finding that the system work on human-built grammars. prioritized right context and longer-distance 2 An description of the system, and an online context templates more for English than interface can be found at: Icelandic. For hand-written grammars, rather http://beta.visl.sdu.dk/visl/da/parsing/automatic/parse than template expression, a similar tuning effect .php can be expected by prioritizing/deprioritizing 3 The CG3 compiler is developed by GrammarSoft certain rule or context types by moving them to ApS and supported by the University of Southern higher or lower rule sections, respectively, or by Denmark. It is open source and can be downloaded at http://beta.visl.sdu.dk/cg3.html 441 PACLIC-27 affecting both tokenization (name fusing and 3.2 Exploiting section structure other multiple-word expressions), primary tags Constraint Grammar allows for section- and secondary tags. Primary tags are tags grouping of rules, where the rules in each intended to be disambiguated and evaluated, and section will be iterated, gradually removing differences in e.g. which kind of nouns are ambiguity from the input, until none of the rules regarded as proper nouns, may therefore affect in the section can find any further fully satisfied evaluation. But even secondary tags may have context matches. After that, the next batch of an adverse effect on performance. Secondary rules is run, and the first set repeated, and so on. tags are lexicon-provided tags, e.g. valency For 6 sections, this means running them as 1, 1- and semantic tags not themselves intended for 2, 1-3, 1-4, 1-5, 1-6. CG grammarians use disambiguation, but used by the grammar to sectionizing to prioritize safe rules, and defer contextually assign primary tags. Most heuristic rules, so one obvious machine learning importantly, the gold corpus derived from the technique is to move rules to neighbouring treebank does not contain semantic tags, while sections according to how well they perform, current DanGram rules rely on them for our basic set-up being a so-called PDK-run disambiguation. However, this is not relevant to (Promoting, Demoting, Killing): the experiments we will be discussing in this paper - any accuracy figures are not intended to if a rule does not make errors or if its grade the performance of DanGram as such, but error percentage is lower than a pre-set only to demonstrate possible performance threshold, promote the rule 1 section improvements triggered by our grammar tuning.

ML-Tuned Constraint Grammars

Using Constraint Grammar for Treebank Retokenization

Using Danish As a CG Interlingua: a Wide-Coverage Norwegian-English Machine Translation System

Floresta Sinti(C)Tica : a Treebank for Portuguese

Instructions for Preparing LREC 2006 Proceedings

17Th Nordic Conference of Computational Linguistics (NODALIDA

Frag, a Hybrid Constraint Grammar Parser for French

Prescriptive Infinitives in the Modern North Germanic Languages: An

The VISL System

A Morphological Lexicon of Esperanto with Morpheme Frequencies

Degrees of Orality in Speechlike Corpora

Arborest – a Growing Treebank of Estonian

The English Wikipedia in Esperanto