CG3 Beyond Classical Constraint Grammar
Eckhard Bick Tino Didriksen University of Southern Denmark GrammarSoft ApS [email protected] [email protected]
Abstract genre, but in programming terms, it is implemented procedurally as a set of This paper discusses methodological consecutively iterated rules that add, remove or strengths and shortcomings of the select tagencoded information. In its classical Constraint Grammar paradigm (CG), form (Karlsson, 1990; Karlsson et al., 1995), showing how the classical CG formalism Constraint Grammar relies on a morphological can be extended to achieve greater analyzer providing socalled cohorts of possible expressive power and how it can be readings for a given word, and uses constraints enhanced and hybridized with techniques that are largely topological1 in nature, for both from other parsing paradigms. We present partofspeech disambiguation and the a new, largely theoryindependent CG assignment of syntactic function tags. (ac) framework and rule compiler (CG3), that provide examples for close context (a) and wide allows the linguist to write CG rules context (b) POS rules, and syntactic mapping (c). incorporating different types of linguistic information and methodology from a wide (a) REMOVE VFIN IF (0 N) (1 ART OR range of parsing approaches, covering not
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 31 counting tokens left () or right (+) from a time or to link referents across sentence, nor was zero/target position in the sentence. Though in it possible to contextually trigger genre variables principle a methodological limitation, this or in other ways to make a grammar interact with topological approach also has descriptive "side a given text type. Descriptively, this limitation effects": For instance, it supports local syntactic meant that CG as such could not be used for function tags (such as the @SUBJ tag on the head higherlevel annotation such as anaphora or noun of an NP), but it does not easily lend itself discourse relations, and that grammars were to structuralrelational annotation. Thus, agnostic of genre and task types. dependency relations or constituent brackets can neither be created or referred to by purely Following Karlsson's original proposal, two topological CG rules2. Even chunking constraints, standards for CG rule compilers emerged in the though topologically more manageable than tree late 90'ies. The first, CG1, was used by structures, have to be expressed in an indirect Karlsson's team at Helsinki University and way (cp. the NONPREN barrier condition in commercially by the spinoff company LingSoft example rule (c), and syntactic phrases cannot be for English (ENGCG), Swedish and German addressed as wholes, let alone subjected to (GERCG) taggers, as well as for applied products rewriting rules. such as Scandinavian grammar checkers (Arppe, 2000; Birn, 2000 for Swedish, Hagen et al., 2001 A second design limitation in classical CG for Norwegian). The second compiler, CG2, was concerns the expression of vague, probabilistic programmed and distributed by Pasi Tapainen truths about language. Thus, the formalism does (1996), who made several notational not allow numerical tags or numerical feature improvements3 to the rule formalisms (in value pairs, and while many current main stream particular, regarding BARRIER conditions, SET NLP tools are based on probabilistic methods and definitions and REPLACE operations), but left machine learning, classical CG is entirely rule the basic topological interpretation of constraints based, and the only way to integrate likelihoods is unchanged. Five years later, a third company, through lexical "Rare" tags or by ordering rules in GrammarSoft ApS, in cooperation with the batches with more heuristic rules applying last. University of Southern Denmark, launched an open source CG compiler, vislcg, which was Third, classical CG tags and tokens are discrete backward compatible with CG2, but also units and are handled as string constants. While introduced a few new features4, in particular the this design option facilitated efficient processing SUBSTITUTE and APPEND operators designed and even FST methods, it also limited the to allow system hybridization where input from a linguist, who was not allowed to use regular probabilistic tagger could be corrected with CG expressions, feature variables or unification. rules in preparation of a syntactic or semantic CG Another aspect of discreteness concerns stage, as implemented e.g. in the earliest version tokenization: Classical CG regarded token form, of the French FrAG parser (Bick, 2004). Vislcg, number and order as fixed, so the formalism had too, was used in spell and grammar checkers difficulty in accommodating, for instance, the (Bick, 2006a), but because of its opensource rulebased creation of a (fused) namedentitity environment it also marked the transition to a token, the insertion or removal of tokens in spell wider spectrum of CG users and research and grammar checking, or the reordering of languages. tokens needed for machine translation.
Finally, when classical CG was designed, it 3 Tapanainen also created a very efficient compiling had isolated sentences in mind. Though rule and run-time interpretation algorithm for cg2, scope can be arbitrarily defined by a "window" involving fintite state transducers, as well as a delimiter set, and though "global" window rules finite state dependency grammar, FDG clearly surpass the scope of HMM ngrams, it (Tapanainen, 1997), for his company Conexor and was not possible to span several windows at a its Machinese parsers. 4 The vislcg compiler was programmed over 2 As a work-around, attachment direction markers several years by Martin Carlsen for VISL and (arrows) were introduced in the syntactic function GrammarSoft. For a technical comparison of CG- tags, such as @>N or @N> for pre-nominal and 2 and vislcg, cf. http://beta.visl.sdu.dk/visl/vislcg- @N< or @ Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 32 But though constraint grammars using the CG uppercase POS and inflection fields or the @ 2/VISLCG compiler standard did achieve a tag marke syntactic function field6: granularity and accuracy that allowed them to support external modules for both constituent and Both "both" Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 33 1.) * (Deep scan) allows a child or parenttest to 3 Constituent structure: Inspiration continue searching along a straight line of from the generative paradigm descendants and ancestors, respectively, until the test condition is matched or until the end of a Because dependency syntax bases its structural relation chain is reached. description on tokens (words), it is inherently closer to the native CG approach than the 2.) C (All scan) requires a child or sibling competing generative family of syntactic relation to match all children or all siblings, formalisms, which operate with nonterminal respectively. Note that this is different from the nodes and constituent brackets. ordinary C (= safe) option which applies to readings. Thus 'cC ADJ' means 'only adjectives as 3.1 Tree transformation children' – e.g. no articles or PP's, while 'c (*) LINK 0C ADJ' means 'any one daughter with an Classical CG does not support constituent unambiguous adjective reading'. brackets in any form, be it flat chunks or nested constituents, so external modules had to be used 3.) S (Self) can be combined with c, p or s to look to create constituent trees. The oldest example at the current target as well. For example, 'c are PSGs with CG functions as terminals (Bick, @SUBJ LINK cS HUM' looks for a human 2003), used for CALL applications within the subject NP – where either the head noun VISL project, followed by dependencyto (@SUBJ) itself is human, or where it has a constituent tree transformation employing an modifier that is tagged as human. external dependency grammar (Bick, 2005; Bick, 2006b). Of course the same transformation could Apart from dependency relations, we also allow be used with our new, native CG dependency general named relations in CG3, that can be used (cp. previous section), but CG3 does offer more for arbitrary relation types, such as secondary direct ways to express linguistic structure in dependencies between object and object generative terms, allowing linguists used to think complement, anaphora (Bick, 2010), discourse along PSG lines to directly translate generative relations etc. Thus, the following establishes an descriptions and constraints into the CG identity relation between a relative pronoun and formalism. its noun antecedent: 3.2 Chunking SETRELATION (identity) TARGET ( NP opening markers (a) are inserted before prenominal noun dependents (@>N) or NP heads (N/PROP/PRON), accepting even determiners and numerals if they have no attributive function (@ATTR). 7 Though descriptively undesirable, loops can be Likewise, NP closing markers (b) are inserted after the explicitly allowed with the ALLOWLOOP and NEAREST options (cf. visl.sdu.dk/cg3.html) above NP head candidates, in the presence of the left Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 34 hand (*1) NPchunk opener. The NOT contexts in (a) However, CG3 also offers another way of make sure that the triggering prenominal is in fact the expressing chunks, the template, which can be first element in the NP, and not preceded by an adverbial integrated into CG rules also at early tagging dependent of its own, or part of a coordination. The stages. A template is basically a predefined inserted chunkopening and closing tokens can then be sequence of tokens, POS or functions that can be interpreted as labelled brackets: (np We_PRON /np) referred to as a whole in rule contexts, or even in had (np very_@>A delicious_@>N icecream_N /np) other templates. The basic idea goes back to with (np strawberries_N /np). Karlsson et al. (1995), but was not implemented in either CG1 or CG2. The second method is better suited for layered, deep For instance, an NP could be defined as chunking, because it uses relational tags to individually link chunk edges to each other or to the chunk head. (a) TEMPLATE np = ([ART, N]) With full layering, this approach can create complete OR ([ART,ADJ,N]) xmlformated constituent trees from CG dependency (b)TEMPLATE np = (? ART LINK 1 N) tagged input without the need of an external converter, if OR (? ART LINK 1 ADJ LINK 1 N) chunk brackets are expressed as xml opening/closing markers. However, using relations to delimit topological (c) TEMPLATE np = ? ART LINK *1 N units such as chunks, introduces certain complexities in BARRIER NONPREN the face of crossing branches and needs to specify the and then used in ordinary rules with a T:prefix "handedness" (left/right) and "outermostness" of (*1 VFIN LINK *1 T:np). dependency arcs, features that are normally left underspecified in dependency annotation. In CG3, we (a) is closest to the original idea, and reminiscent support these features as l/r (left/right) and ll/rr of generative rewriting rules, while (b) and (c) are (leftmost/rightmost) additions: shorthand for ordinary CG contexts and harness the full power of the latter. Independently of the (a) ADDRELATIONS (npheadl) (npstart) format, however, the linguistic motivation behind TARGET (*) (c @>N OR @N<&) TO (llScc templates is to allow direct reference to (*)) ; constituent units, to think in terms of phrase structure and to subsume aspects of generative (b) ADDRELATIONS (npheadr) (npstop) grammar into CG. Thus, constituent templates TARGET (*) (c @>N OR @N<&) (r:npheadl (*)) allow a direct conceptual transfer from generative TO (rrScc (*)) ; rules, and a simple generative NP grammar for the Both rules are bidirectional and mark both chunk NP "a very delicious icecream with strawberries": head and chunk edges. The head target is any np = adjp? n pp? ; word (*) with an adnominal dependent (c @>N adjp = adv? adj ; OR @N<), and the TOedge is the leftmost (ll) pp = prp np ; resp. rightmost (rr) descendant (cc) or self (S). could be expressed in CG3 as: This second method will yield complete, nested structures, including adjective phrases (adjp) and TEMPLATE np = (N) OR (T:adjp LINK 1 N) prepositional phrases (pp) in the NPs: (npstart OR (T:adjp LINK 1 N LINK 1 T:pp) (adjpstart very_@>A delicious_@>N adjpstop) OR (N LINK 1 T:pp) ; icecream_N (ppstart with_PRP_@N< 8 TEMPLATE adjp = (ADJ) OR (ADV LINK 1 strawberries_@P< ppstop) npstop) . ADJ) ; TEMPLATE pp = PRP LINK 1 T:np ; 3.3 Phrase templates In the example, "very_ADV delicious_ADJ" Both of the above chunking methods are intended matches T:adjp, and "with_PRP icecream_N" to be used late in the annotation pipe, and exploit matches T:pp, and the whole expression could existing morphosyntactic markup or even then be referred to as a T:np context by CG rules. dependencies, so the chunking cannot itself be CGinternally, templates could also simply be seen as methodological part of parsing per se. interpreted as shorthand (variables) for context parentheses, socalled context templates. As such, 8 For clarity, only phrases with 2 or more constituents were bracketed in the 2nd method. they logically need to allow internal, predefined Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 35 positions, as in the following example for a (c) no Crestriction for BARRIERs human verbtemplate, where the motivation is not CG3 adds all of the above10, but while increasing a constituent definition, but simply to integrate rulewriting efficiency, these changes to not affect two context alternatives into one9, and to label the the discreteness of tags and strings. result with one simple variable. Methodologically more important, therefore, is TEMPLATE vhum = ((c @SUBJ + HUM) OR our introduction of regular expressions and (*1 (”that” KS) BARRIER V)) ; "human verb" variables. The former can be used instead of sets defined as either having a subject (@SUBJ) child for openclass items, primarily lemma and word (c) that is human (HUM), or having a class, e.g. ".*i[zs]e"r V in a transitivity set or subordinating conjunction (KS) anywhere to the ".*ist" N as a heuristic candidate for the 4 Beyond discrete tags and string REMOVE (N) (0 (<(.+)^vp>r INF)) (1 INFM) (1 ("$1"v PRP)) ; # e.g. to minister to the tribe constants: Regular expressions, variables and unification With the example given, the second rule can remove the noun reading for 'minister' because the A formal grammar has to strike a balance between 'to' in the valency marker Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 36 instance coordination, as with the following LIST to encode and use corpusharvested frequencies. set of semantic roles (agent, patient, theme and The simplified example rule (a) exploits relative location): lexical POS frequencies for bigram disambiguation in a way reminiscent of hidden LIST ROLE = §AG §PAT §TH §LOC ; Markov models (HMMs), while (b) is a spell SELECT $$ROLE (1 KC) (2C $$ROLE) ; checker fallback rule selecting the word with the Sometimes unification has to be vague in order to highest phonetical similarity value work. This is the case when underspecified REMOVE ( Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 37 window boundaries by adding a 'W', e.g. *1W for BEFORE, MOVE AFTER and SWITCH WITH scanning left across the window boundary. This operators can be used to express syntactic feature is especially useful for higherorder movement rules in machine translation. The relations such as anaphora (Bick, 2010) or example rule will change Danish VS into English discourse relations. Another scoperelated SV in the presence of a fronted adverb: MOVE innovation are (definable) paired brackets that WITHCHILD13 (*) @ Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 38 References Bick, Eckhard. 2014. PALAVRAS, a Constraint Grammarbased Parsing System for Portuguese. In: Arppe, Antti. 2000. Developing a grammar checker Tony Berber Sardinha & Thelma de Lurdes São for Swedish. In: Torbjørn Nordgård (ed.). Bento Ferreira (eds.), Working with Portuguese Proceedings of NODALIDA '99. pp. 2840. Corpora, pp 279302. London/New York: Trondheim: Department of Linguistics, University Bloomsburry Academic. of Trondheim. Birn, Jussi. 2000. Detecting grammar errors with Bick, Eckhard. 2003. A CG & PSG Hybrid Approach Lingsoft's Swedish grammar checker. In: Torbjørn to Automatic Corpus Annotation. In: Kiril Simow Nordgård (ed.). Proceedings of NODALIDA '99. p. & Petya Osenova (eds.). Proceedings of 2840. Trondheim: Deparment of Linguistics, SProLaC2003 (at Corpus Linguistics 2003, University of Trondheim. Lancaster), pp. 112 Graña, Jorge & Gloria Andrade & Jesús Vilares. Bick, Eckhard. 2004. Parsing and evaluating the 2003. Compilation of constraintbased contextual French Europarl corpus, In: Patrick Paroubek, rules for partofspeech tagging into finite state Isabelle Robba & Anne Vilnat (red.): Méthodes et transducers. In: Proceedings of the 7th Conference outils pour l'évaluation des analyseurs syntaxiques on Implementation and Application of Automata (Journée ATALA, May 15, 2004). pp. 49. Paris: (CIAA 2002). pp. 128–137. SpringerVerlag, ATALA. Berlin. Bick, Eckhard. 2005. Turning Constraint Grammar Hagen, Kristin & Pia Lane & Trond Trosterud. 2001. Data into Running Dependency Treebanks. In: En grammatikkontrol for bokmål. In: Kjell Ivar Civit, Montserrat & Kübler, Sandra & Martí, Ma. VAnnebo & Helge Sandøy (eds). Språkknyt 3 Antònia (red.). Proceedings of TLT 2005 (4th 2001, pp. 69. Oslo: Norsk Språkråd Workshop on Treebanks and Linguistic Theory, Barcelona), pp.1927 Karlsson, Fred. 1990. Constraint Grammar as a Framwork for Parsing Running Text. In: Hans Bick, Eckhard. 2006a. A Constraint Grammar Based Karlgren (ed.). Proceedings of COLING90, Vol. 3, Spellchecker for Danish with a Special Focus on pp. 168173 Dyslexics". In: Suominen, Mickael et.al. (ed.) A Man of Measure: Festschrift in Honour of Fred Karlsson, Fred & Atro Voutilainen & Juha Heikkilä & Karlsson on his 60th Birthday. Special Supplement Arto Anttila. 1995. Constraint Grammar: A to SKY Jounal of Linguistics, Vol. 19. pp. 387 languageindependent system for parsing 396. Turku: The Linguistic Association of Finland unrestricted text. Natural Language Processing 4. Berlin & New York: Mouton de Gruyter. Bick, Eckhard. 2006b. Turning a Dependency Treebank into a PSGstyle Constituent Treebank. Lager, Torbjörn. 1999. The µTBL System: Logic In: Calzolari, Nicoletta et al. (eds.). Proceedings of Programming Tools for TransformationBased LREC 2000. pp. 19611964 Learning. In: Proceedings of CoNLL'99 (Bergen). Bick, Eckhard. 2010. A Dependencybased Approach Lindberg, Nikolaj & Martin Eineborg. 1998. Learning to Anaphora Annotation. In: (eds.) Extended Constraint Grammarstyle Disambiguation Rules Activities Proceedings, 9th International Using Inductive Logic Programming. In: th Conference on Computational Processing of the Proceedings of the 36 ACL / 17th COLING Portuguese Language (Porto Alegre, Brazil). ISSN (Montreal, Canada). volume 2, pages 775–779 21773580 Tapanainen, Pasi. 1996. The Constraint Grammar Bick, Eckhard. 2011. A Barebones Constraint Parser CG2. No 27, Publications of the Grammar. In: Helena Hong Gao & Minghui Dong Department of Linguistics, University of Helsinki. (eds), Proceedings of the 25th Pacific Asia Tapanainen, Pasi. 1997. A Dependency Parser for Conference on Language, Information and English. Technical Reports No TR1. Department of Computation (Singapore). pp. 226235 Linguistics, University of Helsinki. Bick, Eckhard. 2013. Using Constraint Grammar for YliJyrä, Anssi Mikael. 2011. An Efficient Constraint Chunking. In: S. Oepen, K. Hagen & J. B. Grammar Parser based on Inward Deterministic Joannessen (Eds). Proceedings of the 19th Nordic Automata. In: Proceedings of the NODALIDA Conference of Computational Linguistics 2011 Workshop ConstraintGrammar Applications, (NODALIDA 2013). Linköping Electronic pp. 5060 NEALT ProceedingsSeries , vol. 14 Conference Proceedings Vol. 85, pp. 1326. Linköping: Linköping University Electronic Press. Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 39