A Dependency Constraint Grammar for

Eckhard Bick Institute of Language and Communication University of Southern [email protected]

Abstract and prepositional phrases (@P<), a descriptive principle later generalized to This paper presents a rule-based formalism clause level functions and subclauses (Bick for dependency annotation within the 2000). In this convention, some obvious Constraint Grammar framework, underspecifications arise, such as the implemented as an extension of the open distinction between short and long source CG3 compiler. As a proof of concept we have constructed a complete attachment in np's, and the scope of for Esperanto, coordinators. Nevertheless, two different building on morphosyntactically annotated methods were developed to create full input from the EspGram parser. The system syntactic trees from shallow CG function is described and evaluated on a test corpus. tags. The first (Bick 2003) uses higher level With a 4% error rate, and most errors phrase structure grammars with function tags caused by simple error propagation from as terminals, and resolves the morphosyntactic input module, our underspecifications in a generative way. The system has proven robust enough to be second, and more robust (Bick 2005), uses integrated into real life applications, such ordinary CG rules to add secondary as the Lingvohelpilo spell- and grammar- attachment markers (e.g. , , , ) to resolve 1 Introduction underspecification, and creates dependency trees through successive attachment rules. Traditionally, Constraint Grammar (Karlsson However, the method used an external et al. 1995) as a descriptive system, has formalism, with a specially designed regarded as an extension of dependency rule compiler that also handled , with a shallow syntax based on issues like uniqueness, circularity and function tags built on case markers, word coordination chains. order and contextual constraints. This approach to syntax efficiently exploits This paper describes an effort to move this lexico-morphological clues, and the tag- last, tree-building step into the realm of based annotation allows the grammarian to Constraint Grammar proper, thus allowing treat syntax as a disambiguation technique the user to exploit CG's powerful contextual similar to the one used for morphological methodology in the process, to better disambiguation. However, function is only integrate dependency and functional syntax an indirect marker for the relation between and to achieve some control over words, and it is difficult to express the dependency interaction not fully structural relations of deeper syntax in this implementable in an the external formalism. fashion. As a first approximation, The new CG extension was then used to dependency direction markers were used for create a dependency CG grammar for the dependents in noun phrases (e.g. @N> or Esperanto, and it is this grammar that will be @>N), adjective phrases (@A> or @>A) described and evaluated here. The module deep linguistic processing, and can thus be structures, and has been successfully seen as facilitation stepping stone both for converted into various exchange formats, further, syntax-dependent annotation (e.g. such as TIGER xml and the VISL cross- anaphora, semantic roles) and for various language format (constituent trees), as well applicative purposes such as machine as MALT xml and CoNNL field format . Currently, the grammar is used (dependency). in the newly-developed Esperanto grammar checker, Lingvohelpilo The rule below is an example of a (http://lingvohelpilo.ikso.net/), where it dependency-creating rule for prenominal provides important contextual information dependents (@>N), attaching to np-heads for the checking of accusative/nominative (@NP-HEAD) or nouns in the nominative case endings and transitivity affixes, as well (N NOM), to the right (*1). as for the identification of long-distance agreement errors, e.g. between subject and (a) SETPARENT @>N TO (*1 @NP-HEAD subject complement. OR (N NOM) BARRIER PRP) ;

2 The formalism Once established, dependency arcs can be used by later rules – even by other In order to accommodate for dependency, 2 dependency-mapping rules – using three new operators, SETPARENT and types of dependency relators: p (parent), c SETCHILD, were introduced to (child) and s (sibling). The p-, c- and s- GrammarSoft's open-source CG3 compiler relators replace what would otherwise be (Didriksen 2007), establishing dependency position markers in a traditional CG context. arcs from daughter to mother, or mother to Thus, rule (a) exploits semantic prototype daughter, respectively, addressing one in the roles to select +HUM subjects in the SETPARENT/SETCHILD field and the presence of cognitive verbs, while (b) other in a TO field. Both fields of the rule implements the syntactic uniqueness can be independently conditioned with CG principle for direct objects (@ACC). contexts in the usual way. The first field works like the TARGET of a MAPping rule, (a) SELECT (%hum) (0 @SUBJ) (p ) while the TO-end of the dependency is (b) SELECT (@ACC) (NOT s @ACC) specified by a context condition itself – as (c) ... (*-1 N LINK c DEF) -> definite np seen from the TARGET position. In the case recognized through dependent (d) ADD (§AG) TARGET @SUBJ (p V-HUM of a LINKed condition, the attachment point LINK c @ACC LINK 0 N-NON-HUM) ; can be marked (with a special A operator) as any of the individual contexts checked and Rule (c) is an example of a rule context used “passed”. As a default, the dependency arc to recognize a definite np through its will attach to the last condition of the LINK determiner, and (d) assigns the semantic role chain if it can be instantiated. As in the older, tag of agent (§AG) to subjects of “human” external dependency compiler, dependency verbs with a non-human direct objects. arcs are expressed as number tokens of the type #n->m, where n is the token ID of the 3 The Esperanto grammar daughter and m the token ID of the mother. Internally, the CG3 compiler uses unique, The preposition barrier (PRP) in the np rule running IDs (necessary for cross-sentence in the last section is a sensible safety relations such as anaphora or discourse measure for English and French, but fails to relations), but in standard dependency account for pre-nominal pp's as they do output, sentence windows boundaries are occur in e.g. Esperanto and German. The respected, using relative IDs. The notation is next rule therefore allows prenominals to information equivalent to constituent tree search right (**1) across the first np-head to a later one that is not part of a prenominal pp However, the clause boundary barrier (as implied by @P<). Note that the SET discussed before poses a problem where a target has its own condition excluding targets chain of conjuncts contains not only a that already have a parent (using the (*) coordinator, but also commas. Therefore, a convention for “any tag”). Since rule somewhat more complicated rule becomes application order supersedes token order, this necessary to attach comma-isolated will have the effect of not undoing the pp- conjuncts: free prenominal attachments already mapped by the first rule. SETPARENT $$@FUNC (NOT p (V)) TO (*-1 IT BARRIER NON-PRE-N/ADV SETPARENT @>N (NOT p (*)) LINK *-1 $$@FUNC BARRIER @FUNC TO (**1 @NP-HEAD OR (N NOM)) LINK p (V)) ; (NOT 0 @P<) ; This rule exploits the new uniqueness feature At the clause level, it is a fair assumption in CG3 to attach any as yet unattached that all left-pointing functions attach to the function if the same function ($$@FUNC) closest main verb (&MV), unless an can be found to the left of an immediately intervening subclause ending is marked by adjacent (BARRIER NON-PRE-N/ADV) punctuation (CLB): iterator (IT = coordinator or comma), with no other functions in between (BARRIER SETPARENT @), the discussed here, attach the coordinator token blocking condition is a subclause itself, and assign secondary conjunct tags to “complementizer” (relative/interrogative all conjuncts, in order to distinguish between pronoun or a subordinating conjunction), first and later conjuncts should the need for a which – unlike English - is an obligatory Mel'cuk-style transformation arise. feature in Esperanto. In a subsequent rule, long-distant attachment across relative 4 Evaluation clauses can be performed for still unattached subjects (NOT p (V)), by linking to the next Compared to the complexity of main verb that does not already have a morphological and syntactic CGs, our subject (NOT c @SUBJ>): dependency CG module is strikingly rule efficient, achieving robust annotation with SETPARENT @SUBJ> (NOT p (V)) just 66 rules, compared to the thousands of TO (**1 &MV) rules in lower-level CGs, and the couple of (*-1 NON-V LINK NOT 1 PCP) hundred rules in a CG-based PSG. Of course, (NOT c @SUBJ>) it has to be born in mind, that our rules rely heavily on syntactic functions and Note the additional context condition in the attachment direction markers introduced by TO field that identifies the first verb in a preceding CG modules. Also, at the time of possible verb chain and conditions it as not writing, we have not yet incorporated the being a participle – since participle clauses distinction between close and long don't have left subjects. postnominal attachment, ellipsis and quoted sentences which will unavoidably add to the In our grammar, coordination is handled as number of rules. “parallel” attachment, not chained Mel'cuk- style, and in the absence of uniqueness- Speedwise, CG-dependency is also quite demanding contexts, ordinary attachment efficient. A 75.000 word corpus consisting of rules will therefore handle coordination, too. 50% news magazine text and 50% classical texts, was analyzed with the EspGram tagger indicates. (Bick 2007) at the syntactic-functional level, and the annotated corpus was then tagged References with our dependency CG on a 2.4 GHz laptop. In this experiment, the analysis chain Bick, Eckhard (2000), “The System up to the syntactic function level ran at 72 PALAVRAS - Automatic Grammatical Analysis of words/s, while the dependency level alone Portuguese in a Constraint Grammar Framework”, ran at 6336 words/s, using 10.2 % of overall Aarhus: Press processing time. Compared to the external Bick, Eckhard. (2003) A CG & PSG Hybrid Approach dependency system (608 words/s), this to Automatic Corpus Annotation, In: Kiril Simow & Petya Osenova (eds.), Proceedings of implies a speed improvement by almost one SProLaC2003 (at 2003, order of magnitude. Lancaster), pp. 1-12 Bick, Eckhard. (2005) “Turning Constraint Grammar A rough inspection of annotation results for a Data into Running Dependency ”. In: sample of 1000 words indicate an overall Civit, Montserrat & Kübler, Sandra & Martí, Ma. error rate for the dependency annotation of Antònia (red.), Proceedings of TLT 2005 (4th about 4%. Of these, about half were Workshop on Treebanks and Linguistic Theory, attachment failures (no mothernode for non- Barcelona, December 9th - 10th, 2005), pp.19-27 topnode functions), half were wrong Bick, Eckhard (2007), Tagging and Parsing an attachments (wrong daughter-mother Artificial Language: An Annotated Web-Corpus of relation). With most errors being caused by Esperanto, In: Proceedings of Corpus Linguistics 2007, Birmingham, UK. Electronically published syntactic-function errors in the input, the at (http://ucrel.lancs.ac.uk/publications/CL2007/, error rate of the dependency module itself Nov. 2007) was very low, under 1%. Didriksen, Tino (2003). “Constraint Grammar Manual”, http://beta.visl.sdu.dk/cg3/single/ 5 Conclusion and outlook Karlsson, Fred et al. (1995): Constraint Grammar - A Language-Independent System for Parsing Given the necessary formal changes to the Unrestricted Text. Natural Language Processing, CG compiler software, it appears to be No 4. Berlin & New York: Mouton de Gruyter. feasible, even with a relatively small set of rules, to handle the creation of dependency Appendix: Annotation sample tree structures for CG-analyzed input within the CG formalism itself. Our experiments Post 12 jaroj da reformoj, la efikeco de la ĉeĥa with such a grammar for use in an Esperanto ekonomio ne signife transpaŝas la nivelon atingitan en spell- and grammar-checker produced robust la jaro 1989. (Ater 12 years of reforms, the efficiency of the chech results, both quantitatively and qualitatively. economy has not significantly surpassed the level In particular, the dependency module proved reached in [the year of] 1989,) to be considerably more robust than the syntactic function module, inheriting most of Post [post] <*> PRP @ADVL> #1->14 its errors from the former. We therefore 12 [12] NUM P @>N #2->3 believe that CG dependency modules can be jaroj [jaro] N P NOM @P< #3->1 da [da] PRP @N< #4->3 created with comparatively little effort, to reformoj [reformo] N P NOM @P< turn existing CG function annotations into #5->4 dependency treebanks without substantial la [la] ART @>N #6->7 loss of information. Future research should efikeco [efikeco] N S NOM @SUBJ> #7->14 allow us to shed light on the question to what de [de] PRP @N< #8->7 la [la] ART @>N #9->11 degree our dependency grammar, given a cxehxa [cxehxa] ADJ S NOM @>N compatible set of morphological and #10->11 syntactic input tags, is language independent ekonomio [ekonomio] N S NOM @P< - as the size and simple nature of our rule set #11->8 ne [ne] ADV @>A #12->13 signife [signife] ADV @ADVL> #13->14 @syntactic_function, #dependency-link transpasxas [transpasxi] V PR @FS-STA #14->0 (part of speech tags: N=noun, V=verb, la [la] ART @>N #15->16 ADJ=adjective, ADV=adverb, PRP=preposition, nivelon [nivelo] N S ACC @14 ART=article, NUM=numeral; inflexion: S=singular, atingitan [atingi] V PCP PAS IMPF ADJ P=plural, NOM=nominative, ACC=accusative, S ACC @ICL-N< #17->16 PCP=participle, PAS=passive, PR=present tense, en [en] PRP @17 IMPF=past tense; syntactic function: la [la] ART @>N #19->20 @SUBJ=subject, @ADVL=adverbial, @ACC=direct jaro [jaro] N S NOM @P< object, @>N=pre-nomina modifier, @N<=post- #20->18 nominal modifier, @P<=argument of preposition, 1989 [1989] NUM S @N< @ICL=non-finite clause, @FS=finite clause, #21->20 @STA=statement; semantic prototypes: $. duration, abstract countable, domain, semantic product, action, feature, The following fields are used in the annotation nationality, main verb; valency: scheme, and expressed as feature attribute pairs in transitive) xml: wordform, [base form/lemma], ,