LREC Style Template
Total Page:16
File Type:pdf, Size:1020Kb
ML-Optimization of Ported Constraint Grammars Eckhard Bick University of Southern Denmark Campusvej 55, DK-5230 Odense M Email: [email protected] Abstract In this paper, we describe how a Constraint Grammar with linguist-written rules can be optimized and ported to another language using a Machine Learning technique. The effects of rule movements, sorting, grammar-sectioning and systematic rule modifications are discussed and quantitatively evaluated. Statistical information is used to provide a baseline and to enhance the core of manual rules. The best-performing parameter combinations achieved part-of-speech F-scores of over 92 for a grammar ported from English to Danish, a considerable advance over both the statistical baseline (85.7), and the raw ported grammar (86.1). When the same technique was applied to an existing native Danish CG, error reduction was 10% (F=96.94). Keywords: Constraint Grammar, Machine Learning, Tagging 1. Introduction of a noun reading (N) - IF an article (ART) or determiner Mature Constraint Grammar (CG) parsers can achieve (DET) is found to the left (*-1) with nothing but attributes very high accuracy (Karlsson et al. 1995), but contain (ALL - ATTR) inbetween (BARRIER), and if there is thousands of manually crafted rules and depend on large gender-number agreement ($$GN variable) between noun amounts of expert labor. Input to a Constraint Grammar is and article/determiner. As a safety measure, a second, traditionally provided by a lexical-morphological negative (NOT) condition is present disallowing a noun analyzer1 that outputs tokens with one or more possible phrase element (NPHR) at position 1 (i.e. immediately to readings (tag lines) attached. The task of a CG rule is then the right of the target). to contextually remove wrong readings, select correct 3 ones, and to add and disambiguate tags that can only be REMOVE VFIN IF (0 N + $$GN) (*-1C ART OR DET inferred in a contextual way, such as syntactic function, BARRIER ALL - ATTR LINK 0 $$GN) (NOT 1 NPHR); dependency and semantic roles. Since Constraint grammars are not data-driven in the A typical CG rule contains an a target (word form, lemma, statistical sense of the word, domain adaptation, for tag or feature set), a tag/reading-manipulating action (e.g. instance for speech (Bick 2012) or historical texts (Bick REMOVE, SELECT, MAP, ADD, SUBSTITUTE, 2005), is traditionally achieved by extending an existing SETRELATION), and a list of one or more context general grammar and/or its lexicon. However, due to its conditions to be fulfilled in order for the action to be innate complexity, the general underlying grammar as a triggered. Context conditions may refer to words or whole has properties that do not easily lend themselves to features at an arbitrary distance in the sentence (or manual modification. Changes and extensions will usually beyond), they may contain variables, regular expressions be made at the level of individual rules, not rule or numerical (e.g. frequency) conditions, and they can be interactions or rule regrouping, the effect of which is very negated, linked or conditioned as safe (referring only to difficult for a human grammarian to predict. unambiguous readings). Finally, unbounded conditions (i.e. without a distance position) may be "BARRIERed" In particular, incremental contextual ambiguity reduction by blocking features. The following (Spanish, French, may activate dormant rules waiting for unambiguous German or Danish) rule, for instance, removes target context conditions to apply. Feed-back from corpus runs readings that contain a finite verb tag (VFIN2) in favour will pinpoint rules that make errors, and even allow to trace the effect on other rules applied later on the same sentence, but such debugging is cumbersome and will not 1 In the CG3 variant of the Constraint Grammar formalism, the introduction of variables and regular expressions makes it provide information on missed-out positive, rather than possible to perform morphological analysis within the negative, rule interaction. Therefore, optimization of grammar itself, but most systems only use this feature as a rule-interaction (i.e. rule management at the reserve method to handle out-of-vocabulary tokens. grammar-level) is a major, and intrinsic, difficulty linked 2 VFIN, NPHR and ATTR are not actual tags, but examples of to the Constraint Grammar approach. tag sets defined by the grammarian. Thus, VFIN comprises tense tags such as PR (present) or PAST, ATTR contains adjectives and participles, and NPHR amounts to all parts of 3 The C means "unambiguous". With a C, the context would speech allowed in a noun phrase chain (N, ADJ, ART, DET match also words that still have other readings on top of ART etc.). or DET. 4483 grammars rather than lists of rule templates4. In this paper In (Bick 2013), we have shown that machine learning we argue that this is possible, too, and that it can lead to techniques can be applied to monolingual grammar better results than both automatic and human grammars tuning, even with reduced grammars. Building on this seen in isolation. In particular we demonstrate that ML research, we intend to show that ML is effective not only CG-tuning is useful to create seeding grammars for new for optimizing grammars, but also for porting them from languages from an existing template grammar in another one language to another, a task that can be regarded as an language. Though such a ported seeding grammar cannot extreme variant of domain adaptation. For the be expected to compete with mature taggers directly, it experiments presented here, we have ported the does provide a robust basis for human grammar creation, part-of-speech/ morphological section of an English and will help to reduce overall development time for grammar (EngGram) into Danish, using a CG-annotated high-quality rule-based taggers for under-ressourced section of the Danish treebank (Arboretum, Bick 2003) languages5. for training and evaluation. 3. Grammar optimization and adaptation 2. Related work techniques To date, little work on CG rule tuning has been published. For our experiments, we divided the Danish gold corpus A notable exception is the µ-TBL system proposed in (70.800 tokens) randomly into 10 sections, using 90% for (Lager 1999), a transformation-based learner working training and 10% for evaluation. Because of CPU runtime with 4 different rule operators, and supporting not only constraints, most parameter variations were tested with traditional Brill-taggers but also Constraint Grammars. only one split, but for the best parameter combinations, The system could be seeded with simple CG rule 10-fold cross-validation was carried out with all 10 templates with conditions on numbered context positions, possible splits. but for complexity reasons it did not support more advanced CG rules with unbounded, sentence-wide For each individual training run, the CG-compiler was run contexts, barrier conditions or linked contexts, all of in trace mode, allowing the optimizer program to count which are common in hand-written Constraint Grammars. how often each individual rule was used, and what its Therefore, while capable of building automatic grammars error percentage was. Based on these counts, the from rule templates and modeling them on a gold corpus, optimizer modified rule order or rule strictness (i.e. the the system was not applicable to existing, degree of context ambiguity allowed for the rule in linguist-designed CGs. question). That automatic rule tuning can capture systematic differences between data sets, was shown by 3.1. Rule movement Rögnvaldsson (2002), who compared English and The following rule movements were implemented: Icelandic µ-TBL grammars seeded with the same templates, finding that the system prioritized right context • (a) promote good rules: moving a rule up one section, and longer-distance context templates more for English i.e. allowing it to be applied earlier, before other, than Icelandic. For hand-written grammars, rather than more heuristic rules template expression, a similar tuning effect can be • (b) demote dubious rules: moving a rule down one expected by prioritizing/deprioritizing certain rule or section, i.e. having it apply later, after other, safer context types by moving them to higher or lower rule rules sections, respectively, or by inactivating certain rules • (c) remove bad rules from the grammar, by moving entirely. them to a special "kill" section. Lindberg & Eineborg (1998) conducted a performance Bad rules were defined as rules that made more wrong evaluation with a CG-learning Progol system on Swedish than correct choice, while rules were regarded as good or data from the Stockholm-Umeå corpus. With 7000 dubious, if their error percentage was below or above induced REMOVE rules, their system achieved a recall of certain empirically established error percentages (12.5 and 98%. An F-Score was not given, but since residual ambiguity was 1.13 readings per word (i.e. a precision of 4 One author, Padró (1996), using CG-reminiscent constraints 98/113=86.7%), it can be estimated at 92%. Also, the made up of close PoS contexts, envisioned a combination of lexicon was built from the corpus, so performance can be automatically learned and linguistically learned rules for his expected to be lower on lexically independent data. relaxation labelling algorithm, but did not report any actual work on human-built grammars. Though all three of the above reports show that machine 5 Danish is a small language, but not generally regarded as under-ressourced. It was chosen as target language only as a learning can be applied to CG-style grammars, none of model, and because a gold corpus did not have to be them addresses the tuning of human-written, complete constructed from scratch. In a production setting, both the creation of a gold corpus and manual completion of the ported grammar would be part of the work flow.