CG3 Beyond Classical Constraint Grammar

CG-3 - Beyond Classical Constraint Grammar Eckhard Bick Tino Didriksen University of Southern Denmark GrammarSoft ApS [email protected] [email protected] Abstract genre, but in programming terms, it is implemented procedurally as a set of This paper discusses methodological consecutively iterated rules that add, remove or strengths and shortcomings of the select tag-encoded information. In its classical Constraint Grammar paradigm (CG), form (Karlsson, 1990; Karlsson et al., 1995), showing how the classical CG formalism Constraint Grammar relies on a morphological can be extended to achieve greater analyzer providing so-called cohorts of possible expressive power and how it can be readings for a given word, and uses constraints enhanced and hybridized with techniques that are largely topological1 in nature, for both from other parsing paradigms. We present part-of-speech disambiguation and the a new, largely theory-independent CG assignment of syntactic function tags. (a-c) framework and rule compiler (CG-3), that provide examples for close context (a) and wide allows the linguist to write CG rules context (b) POS rules, and syntactic mapping (c). incorporating different types of linguistic information and methodology from a wide (a) REMOVE VFIN IF (0 N) (-1 ART OR range of parsing approaches, covering not <poss> OR GEN); remove a finite verb reading only CG©s native topological technique, if self (0) can also be a noun (N), and if there is but also dependency grammar, phrase an article (ART), possessive (<poss>) or structure grammar and unification genitive (GEN) 1 position left (-1). grammar. In addition, we allow the (b) SELECT VFIN IF (NOT *1 VFIN) (*-1C integration of statistical-numerical CLB-WORD BARRIER VFIN); select a finite constraints and non-discrete tag and string verb reading, if there is no other finite verb sets. candidate (VFIN) to the right (*1), and if there is an unambiguous (C) clause boundary word 1 Introduction (CLB-WORD) somewhere to the left (*-1), with Within Computational Linguistics, Constraint no (BARRIER) finite verb in between. Grammar (CG) is more a methodological than a (c) MAP (@SUBJ) TARGET N (*-1 >>> descriptive paradigm, designed for the robust BARRIER NON-PRE-N) (1C VFIN) ; map a parsing of running text (Karlsson et al., 1995). subject reading (@SUBJ) on noun (N) targets if The formalism provides a framework for there is a sentence-boundary (>>>) left expressing contextual linguistic constraints without non-prenominals (NON-PRE-N) in allowing the grammarian to assign or between, and an unambiguous (C) finite verb disambiguate token-based, morphosyntactic (VFIN) immediately to the right (1C). readings. However, CG©s primary concern is not the tag inventory itself, or the underlying As can be seen from the examples, the original linguistic theory of the categories and structures formalism refers only to the linear order of used, but rather the efficiency and accuracy of the tokens, with absolute (>>>) or relative fields method used to achieve a given linguistic annotation. Conceptually, a Constraint Grammar 1 With "topological" we mean that grammar rules can be seen as a declarative whole of contextual refer to relative, left/right-pointing token positions possibilities and impossibilities for a language or (or word fields), e.g. -2 = 2 tokens to the left, *1 = anywhere to the right. Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 31 counting tokens left (-) or right (+) from a time or to link referents across sentence, nor was zero/target position in the sentence. Though in it possible to contextually trigger genre variables principle a methodological limitation, this or in other ways to make a grammar interact with topological approach also has descriptive "side a given text type. Descriptively, this limitation effects": For instance, it supports local syntactic meant that CG as such could not be used for function tags (such as the @SUBJ tag on the head higher-level annotation such as anaphora or noun of an NP), but it does not easily lend itself discourse relations, and that grammars were to structural-relational annotation. Thus, agnostic of genre and task types. dependency relations or constituent brackets can neither be created or referred to by purely Following Karlsson©s original proposal, two topological CG rules2. Even chunking constraints, standards for CG rule compilers emerged in the though topologically more manageable than tree late 90©ies. The first, CG-1, was used by structures, have to be expressed in an indirect Karlsson©s team at Helsinki University and way (cp. the NON-PRE-N barrier condition in commercially by the spin-off company LingSoft example rule (c), and syntactic phrases cannot be for English (ENGCG), Swedish and German addressed as wholes, let alone subjected to (GERCG) taggers, as well as for applied products rewriting rules. such as Scandinavian grammar checkers (Arppe, 2000; Birn, 2000 for Swedish, Hagen et al., 2001 A second design limitation in classical CG for Norwegian). The second compiler, CG-2, was concerns the expression of vague, probabilistic programmed and distributed by Pasi Tapainen truths about language. Thus, the formalism does (1996), who made several notational not allow numerical tags or numerical feature- improvements3 to the rule formalisms (in value pairs, and while many current main stream particular, regarding BARRIER conditions, SET NLP tools are based on probabilistic methods and definitions and REPLACE operations), but left machine learning, classical CG is entirely rule- the basic topological interpretation of constraints based, and the only way to integrate likelihoods is unchanged. Five years later, a third company, through lexical "Rare" tags or by ordering rules in GrammarSoft ApS, in cooperation with the batches with more heuristic rules applying last. University of Southern Denmark, launched an open source CG compiler, vislcg, which was Third, classical CG tags and tokens are discrete backward compatible with CG-2, but also units and are handled as string constants. While introduced a few new features4, in particular the this design option facilitated efficient processing SUBSTITUTE and APPEND operators designed and even FST methods, it also limited the to allow system hybridization where input from a linguist, who was not allowed to use regular probabilistic tagger could be corrected with CG expressions, feature variables or unification. rules in preparation of a syntactic or semantic CG Another aspect of discreteness concerns stage, as implemented e.g. in the earliest version tokenization: Classical CG regarded token form, of the French FrAG parser (Bick, 2004). Vislcg, number and order as fixed, so the formalism had too, was used in spell and grammar checkers difficulty in accommodating, for instance, the (Bick, 2006a), but because of its open-source rule-based creation of a (fused) named-entitity environment it also marked the transition to a token, the insertion or removal of tokens in spell wider spectrum of CG users and research and grammar checking, or the reordering of languages. tokens needed for machine translation. Finally, when classical CG was designed, it 3 Tapanainen also created a very efficient compiling had isolated sentences in mind. Though rule and run-time interpretation algorithm for cg2, scope can be arbitrarily defined by a "window" involving fintite state transducers, as well as a delimiter set, and though "global" window rules finite state dependency grammar, FDG clearly surpass the scope of HMM n-grams, it (Tapanainen, 1997), for his company Conexor and was not possible to span several windows at a its Machinese parsers. 4 The vislcg compiler was programmed over 2 As a work-around, attachment direction markers several years by Martin Carlsen for VISL and (arrows) were introduced in the syntactic function GrammarSoft. For a technical comparison of CG- tags, such as @>N or @N> for pre-nominal and 2 and vislcg, cf. http://beta.visl.sdu.dk/visl/vislcg- @N< or @<N for post-nominal NP-material. doc.html . Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 32 But though constraint grammars using the CG- upper-case POS and inflection fields or the @- 2/VISLCG compiler standard did achieve a tag marke syntactic function field6: granularity and accuracy that allowed them to support external modules for both constituent and Both "both" <quant> DET P @>N #1->2 dependency tree generation, they remained companies "company" <HH> N P @SUBJ> #2->3 topological in nature and did not permit explicit said "say" <speak> <mv> V IMPF @FS-STA #3->0 reference to linguistic relations and structure in they "they" <clb> PERS 3P NOM @SUBJ> #4->5 the formalism itself. The same is true for virtually would "will" <aux> V IMPF @FS-<ACC #5->3 all related work outside the CG community itself, lauch "launch" <mv> V INF @ICL-AUX< #6->5 where the basic idea of CG constraints has an "a" <indef> ART S @>N #7->9 sometimes been exploited to enhance or hybridize electric "electric" <jpert> ADJ POS @>N #8->9 HMM-style probabilistic methods (e.g. Graña et car "car" <Vground> N S NOM @<ACC #9->6 al., 2003) or combined with machine learning . "." PU @PU #10->0 (Lindberg & Eineborg, 1998; Lager, 1999), but Instead of the "topological" left/right-pointing always in the form of (mostly close-context) position markers, CG rules with dependency topological rather than structural-relational rules contexts can refer to three types of relations: p and always with discrete tag and string constants. (parent/head), c (child/dependent) and s (sibling). It is only with the CG-3 compiler presented here, that these and most of the other above-mentioned ADD (§AG) TARGET @SUBJ (p V-HUM design issues have been addressed in a principled LINK c @ACC LINK 0 N-NON-HUM) ; way and inside the CG formalism itself. CG-35 (or VISL CG-3 because of its backward (Add an AGENT tag to a subject reading if its compatibility with VISLCG) was developed over parent verb is a human verb that in turn has a a period of 6 years, where new features were child accusative object that is a non-human noun.

CG3 Beyond Classical Constraint Grammar

Annotation Schemes in North Sami Dependency Parsing

Using Constraint Grammar for Treebank Retokenization

Using Danish As a CG Interlingua: a Wide-Coverage Norwegian-English Machine Translation System

Floresta Sinti(C)Tica : a Treebank for Portuguese

Instructions for Preparing LREC 2006 Proceedings

17Th Nordic Conference of Computational Linguistics (NODALIDA

Frag, a Hybrid Constraint Grammar Parser for French

Word Classes and Part-Of-Speech 5 Tagging

Prescriptive Infinitives in the Modern North Germanic Languages: An

The VISL System

A Morphological Lexicon of Esperanto with Morpheme Frequencies

Degrees of Orality in Speechlike Corpora