<<

Proceedings of the Society for Computation in Linguistics

Volume 2 6

2019 Modeling Clausal Complementation for a Grammar Engineering Resource Olga Zamaraeva University of Washington, [email protected]

Kristen Howell University of Washington, [email protected]

Emily M. Bender University of Washington, [email protected]

Follow this and additional works at: https://scholarworks.umass.edu/scil Part of the Computational Linguistics Commons

Recommended Citation Zamaraeva, Olga; Howell, Kristen; and Bender, Emily M. (2019) "Modeling Clausal Complementation for a Grammar Engineering Resource," Proceedings of the Society for Computation in Linguistics: Vol. 2 , Article 6. DOI: https://doi.org/10.7275/dygn-c796 Available at: https://scholarworks.umass.edu/scil/vol2/iss1/6

This Paper is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Proceedings of the Society for Computation in Linguistics by an authorized editor of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. Modeling Clausal Complementation for a Grammar Engineering Resource

Olga Zamaraeva Kristen Howell Emily M. Bender University of Washington University of Washington University of Washington [email protected] [email protected] [email protected]

Abstract because of their interaction with other phenomena. A grammar that cannot support subordinate We present a grammar engineering library will fail to provide analyses for a big portion of a for modeling objectival declarative clausal typical corpus and will not support the grammar complementation patterns attested cross- engineer in exploring how well the analyses used linguistically. Our primary contribution is positing a set of syntactico-semantic analyses for modeling simple clauses generalize. couched within a variant of the HPSG syntactic To fill this gap, we incorporate a cross-linguistic formalism and integrating them with a variety account of clausal complements into a grammar of phenomena already implemented in a gram- engineering questionnaire and customization sys- mar engineering toolkit. We evaluate the addi- tem. The questionnaire elicits typological infor- tion to the system on testsuites from genetically mation about hypothetically any language and the diverse languages that were not considered dur- customization system outputs a starter precision ing development. grammar to specifications. In this way, the toolkit 1 Introduction supports the rapid development of precision gram- mars, giving both novice and experienced grammar Grammar engineering is the modeling of language developers a means to create a grammar fragment rules in a machine-readable fashion, such that the customized to their language of interest and ready resulting system (the grammar) can parse gram- for extension to broader coverage. matical strings while not overgenerating (licensing In the meta-grammar engineering context, an incorrect or spurious analyses). Linguistic gram- analysis is a set of theoretically-grounded elements mar engineering prioritizes precision (how many which comply with the requirements of the spe- parses are syntactically and semantically correct cific implementation framework and are used by with respect to the linguistic formalism) over re- the grammar engineering system so as to produce call (how many input strings get parsed). functioning grammars. In our case, the analysis A precision grammar documents and models is couched within Head-Driven Structure linguistic phenomena on the one hand, and is a - Grammar (HPSG; Pollard and Sag, 1994) and gram that parses and generates, on the other. This Minimal Recursion Semantics (MRS; Copestake makes precision grammars a resource for rigorous et al., 2005) and is implemented as an extension to linguistic hypothesis testing as well as for NLP the LinGO Grammar Matrix (Bender et al., 2002, tasks. While precision grammars are expensive— 2010). Our main challenges lie in accounting for broad-coverage grammars like Flickinger 2000, a large space of typological possibilities while at 2011 or Siegel et al. 2016 take years to build— the same time integrating our analysis of clausal there are eorts to automate this process. We complements into a complex system which out- present a contribution to one such initiative, adding puts streamlined grammars. We start by motivat- a clausal complements library to a grammar engi- ing the choice of the grammar engineering system neering starter toolkit. and briefly summarizing the formal underpinnings Clausal complements are a kind of subordinate associated with this choice (§2). We then present . Modeling subordinate clauses in general a concise summary of the typological literature on and clausal complements in particular is important clausal complementation (§3), and describe our not only because of their corpus frequency, but also analysis and implementation in §4. The evaluation

39 Proceedings of the Society for Computation in Linguistics (SCiL) 2019, pages 39-49. New York City, New York, January 3-6, 2019 results featuring held-out languages from dierent clause embedding itself. On the other hand, we are families are given in §5. All of the resources men- able to explore the interaction of our analyses with tioned in the paper are freely available.1 the existing analyses of other phenomena in or- der to validate both the existing and new libraries. 2 Background Finally, because we contribute our library to the Grammar Matrix code base, it can in turn serve 2.1 The Grammar Matrix as part of the background infrastructure for future We develop a cross-linguistic analysis of clausal library development. complements as part of the LinGO Grammar Ma- 2 trix (Bender et al., 2002, 2010). Other examples 2.2 DELPH-IN Joint Reference Formalism, of multilingual grammar engineering projects in- HPSG and MRS clude ParGram (Butt and King, 2002) and Core- The Grammar Matrix uses the DELPH-IN Joint Gram (Müller, 2015). What is unique about the Reference Formalism (Copestake, 2000), a version Grammar Matrix, however, is that it allows the of the HPSG syntactic formalism (Pollard and Sag, user to obtain a starter grammar customized from 1994), developed to balance expressive power with typological choices, making it possible to easily computational eciency. Every part of the gram- test the analysis on multiple languages, including mar (lexical items, lexical rules, and phrase struc- systematic testing of each part of the analysis using ture rules) is encoded with typed feature struc- artificially constructed languages (see §4.3). tures.4 Phrase structure rules can have any fixed The Grammar Matrix consists of a web-based number of daughters in a fixed order,5 but are typ- questionnaire and a back-end customization sys- ically either binary branching or unary construc- tem. The input to the system is user answers to tions. Lexical rules are always unary projections the typological questions (elicited in the question- and are furthermore restricted to the lower parts of naire and serialized as a choices file) and the out- the tree, i.e. below any phrase structure rules. put is an implemented starter precision grammar. A customized grammar consists of a type hier- There are currently multiple libraries, including archy where typed feature structures (types) inherit word order (Bender and Flickinger, 2005; Fokkens, constraints from other types and add their own con- 2014); person, number, gender, and case systems straints. The typed feature structures defined by the (Drellishak, 2009); tense, aspect, and mood (Poul- grammar are combined using the operation of uni- son, 2011); argument optionality (Saleem and Ben- fication in order to build analyses of sentences. If der, 2010); matrix yes/no questions, lexicon, (Ben- no analysis can be found for a string using a gram- der and Flickinger, 2005); morphotactics (O’Hara, mar, then the string is deemed ungrammatical by 2008; Goodman and Bender, 2010); nominalized that grammar. Because the feature structures en- clauses (Howell et al., 2018), and clausal modifiers code both syntactic and semantic constraints, any (Howell and Zamaraeva, 2018).3 The contribution derivation tree produced by the grammar also in- of this paper is the clausal complements library. cludes an MRS semantic representation (Copes- In the context of the Grammar Matrix, our analysis take et al., 2005). When evaluating a grammar consists of a new web questionnaire page with a set created via the Grammar Matrix customization sys- of choices describing clausal complements cross- tem, the correctness of both syntactic and semantic linguistically, a set of new types available for the representations of strings is taken into account. back-end customization logic, and revisions to the types already existing in the system. 2.3 Clausal Complements in HPSG Building on the existing resource of the Gram- mar Matrix to develop our library has dual bene- To our knowledge, there is no previous fits: On the one hand, we can reuse existing im- cross-linguistic account of clausal complements plementations of phenomena that occur in embed- in HPSG. Language-specific analyses include ded clauses and focus our attention primarily on Ginzburg and Sag 2000 for English and Crysmann 2013 for German, among others. We could not di- 1svn://lemur.ling.washington.edu/shared/ matrix/trunk, revision 42067 (code and testsuites). 4These are exemplified in §4.2 in (4)-(10). 2http://matrix.delph-in.net/customize/ 5This contrasts with other versions of HPSG formalisms matrix.cgi which separate immediate dominance from linear precedence 3The nonexhaustive list of cited libraries includes the ones (e.g. Engelkamp et al., 1992) as well as those that allow with which we tested the interaction of our library. variable-arity rules, like Sag et al. 2003.

40 rectly incorporate them for two reasons. First, most its nominal . Complementizers can be seen of the existing theoretical analyses are not within in both (3) and (1). the DELPH-IN JRF. Second, the literature mostly (2) [senin sinema-ya gel-me-n-i] concerns itself with the level of clause complex- [2. cinema- come--2-] ity which is beyond the Grammar Matrix’s cur- isti-yor-um rent scope. The analyses which informed us the want--1 ‘I want you to come to the movies.’ [tur] (adapted most are actual grammar implementations within from Kornfilt 2013, p. 48) the DELPH-IN JRF and include Siegel et al. 2016 for Japanese and Flickinger 2000, 2011 for English. (3) Men bilamen [ki bu Odam ˘o˘a-ni I know-1 [ this man chicken- oirladi] 3 The Typology of Clausal Complements stole-3] ‘I know that the man stole the chicken.’ [uzb] (Noo- Clausal arguments are clauses which serve as core nan, 2007) arguments to (Noonan, 2007). They can be subjects or complements (objects) and typically oc- 4 Developing the Library cur with verbs of thought, perception, knowledge, etc. An example of an objectival clausal argument Development starts with mapping typological de- is given in (1). scriptions of the phenomenon to a set of choices to be presented to the user via the web questionnaire (1) Kim thinks [that Sandy left]. [eng] (§4.1). This characterizes our analysis in the broad sense. In particular, upon reviewing the typologi- Clausal complements can be finite clauses that cal literature and restricting the scope of the library can occur on their own (1) or they can be reduced. as outlined in §3, we provide an analysis for lan- Reduced clauses can share the with the guages that mark clausal complements in a set of matrix clause, have the in a dependent form certain ways, e.g. with complementizers, nominal- or have case marking on the arguments that dif- ization, aspect or other marking on the embedded fers from what is normal for finite clauses in the verb.7 We posit HPSG analyses for all combina- language (ibid.). Complementation strategies in- tions of the choices we target, without claiming clude parataxis (several clauses joined without sub- to have identified or modeled all dimensions of ordination); complementizer + finite ; variation. We start with basic analyses for com- nominalization; infinitival and participial comple- plementizers and clause-embedding verbs (§4.2) ments. and implement a holistic analysis in a test-driven Here we focus on the complementation phe- fashion (§4.3). nomena which involve embedded clauses that are declarative, objectival and clearly marked as subor- 4.1 Web Questionnaire dinate, and that contain a full proposition. Embed- As summarized in §3, clausal complements can be ded clauses that are subjects, interrogative clauses, marked with a complementizer, special morphol- subject sharing, paratactic, infinitival or participial ogy on the embedded verb, or word order changes complements are not covered at this point. Our in either the matrix or the embedded clause. Com- library supports clausal complements marked by binations of these phenomena are also possible. special morphology on the verb, special word or- There may be several distinct strategies in a lan- der, and/or a complementizer attaching to one of guage and clausal complement-taking verbs may the edges of the embedded clause. Special mor- select for specific ones. We add a web page to the phology on the embedded verb is illustrated in (2), questionnaire which elicits all the relevant choices a Turkish example with nominalization. Special 6 from the user. word order in the matrix clause is illustrated by A language may have multiple complementizers (3): Uzbek is a SOV language, but clausal com- which belong to distinct complementation strate- plements can be extraposed to the end of the sen- gies (i.e. with dierent word order consequences, tence. Note that the order of the complementizer dierent selecting verbs, or dierent morphologi- and the embedded clause here also does not follow cal requirements on the embedded verb). To ac- the same rule as, for example, a and 7While our analysis draws on a comprehensive review of 6See §4.2.4 regarding the word order in subordinate typological literature, we do not make any typological claims clauses in German. as part of this work.

41 count for this, we generalize the feature, pre- Grammar Matrix provides us with a theoretical as viously available for verbs to distinguish dierent well as an engineering basis. As is consistent with syntactically relevant inflected forms, to comple- the Grammar Matrix’s overarching goal, the main mentizers, and reflect this in the questionnaire. goal of our analysis is to provide the user-linguist To handle complementation strategies that re- with a range of typologically motivated possibil- quire specific morphology on the embedded verb, ities rather than to focus on specific predictions we leverage several existing libraries: The mor- of what is possible vs. what is not possible in the photactics library (Goodman, 2013) provides a world’s languages. In this section, we describe the means of associating morphological features with building blocks of our analysis. inflected forms of verbs; the tense/aspect/mood li- brary (Poulson, 2011) allows users to define mor- 4.2.1 Lexical Types phology associated with those features (as well as Complementizers can be treated as semantically additional arbitrary features on the ‘Other Features’ empty elements that take a proposition as a com- page) and the nominalization library (Howell et al., plement (e.g. Siegel et al. 2016). The Grammar 2018) accounts for nominalized clauses. In all Matrix already had a complementizer type which cases, the lexical rules defined by these libraries did not contribute its own predication but raised include morphosyntactic or morphosemantic fea- its complement’s semantic identifier via the tures which allow an embedding verb or comple- identity, as shown in (4).9 mentizer to select for a clausal complement headed by a verb with appropriate morphology. This lex- (4) comp-lex-item ically introduced information is available at the comp 2HEAD 3 6 MOD 7 clause level thanks to constraints in the shared core 6 " hi# 7 6 7 grammar that pass the information up the tree. 6SUBJ 7 6 hi 7 6 HEAD verb 7 We also provide an analysis of clause-bounded 6 7 6 7 extraposition of clausal complements. One of the 6 SUBJ 7 6COMPS 1 2 hi 3 7 6 h 6COMPS 7i7 salient syntactic characteristics of clausal com- 6 6 hi 7 7 6 6HOOK 2 7 7 plements is that they tend to be dispreferred in 6 6 7 7 6 6 7 7 6ARG-ST 1 6 7 7 -medial position, as seen in (3). We allow 6 h 6i 7 7 6RELS !!4 5 7 the user to choose this type of extraposition and 6 h i 7 6HOOK 2 7 8 6 7 indicate whether it is obligatory or optional. 6 7 6 7 6 7 Finally, the user must add lexical types for While4 posited originally for question5 parti- clause-embedding verbs, at least one per comple- cles in the yes/no questions library (Bender and mentation strategy, as we operate within a lexicalist Flickinger, 2005), it was intended to generalize view of syntax and the lexical type of the clause- to clausal complementizers. In order to have embedding verb is therefore responsible for choos- clausal complementizers inherit from this super- ing which type of clausal complement to take. type we move the (‘main clause’) value from Here, we extended an already existing part of the the supertype (4) where it was posited originally questionnaire (the Lexicon) so that verb types can to the clausal complementizer subtypes (5). This select for a complementation strategy. way question particles (not shown) can take main clauses as their complements but clausal comple- 4.2 Syntactic Analysis mentizers cannot. Another change implemented to As discussed in §2.3, there appears to be little in support our analysis is that complementizers now the theoretical HPSG literature focused on a cross- have the feature (with user-specified array of linguistic account of basic complementation that values). This is used to license certain comple- can be directly incorporated into our library. How- mentizers only in combination with certain verb ever, we can leverage the existing Grammar Matrix types. 10 analyses of various phenomena as well as those from other implemented grammars and extend or 9Empty semantic relations () lists and identities adapt them to cover clausal complementation; the as in (4) are characteristic of semantically empty elements. 10AVMs are abbreviated to focus on the primary points of 8Displacement to sentence-initial position is less men- interest. Note also that all such feature structures are arranged tioned in the typological literature, though it appears to be into a type hierarchy, such that more specific types, like (5), possible. We leave it to future work. inherit all of the constraints of their supertypes, like (4).

42 (5) ccomp-lex-item clausal complement strategies as argument struc- HEAD FORM form ture choices and allowed the case frames that are 2 3 6 7 available for non-clause-embedding verbs to be 6 h i7 6COMPS MC 7 6 h i 7 available for clausal complement strategies as well. 6 h i 7 6 7 11 The option which specifies a case constraint on the To model6 clausal complement-taking7 verbs, we introduce4 new lexical types. Lexical5 types in verb’s object is not always available: it only makes the Grammar Matrix typically specify both syntac- sense for nominalized clausal objects. tic and semantic requirements on their dependents. 4.2.2 Lexical Rules, Features, Nominalization Because cross-linguistically clausal complement- As noted in §4.1 above, we leverage existing li- taking verbs can take both verbal and nominal- braries to account for constraints on the morphol- ized complements, our supertype in this space (6) ogy of embedded verbs imposed by complemen- leaves both underspecified. Otherwise, it is simi- tizers or clausal-complement verbs. The result is lar to transitive-verb-lex, which also specifies two that users can specify, for example, that the clausal arguments.12 complement be nominalized, at which point the re- (6) cl-verb-lex sulting clausal-complement verb type will include SUBJ 1 the constraint shown in (7). We also extended the 2 h i 3 6 SUBJ 7 Morphotactics library to allow nominalized verbs 6COMPS 2 hi 7 6 COMPS 7 to bear case inflections. 6 h " #i 7 6 hi 7 6 7 6ARG-ST 1 HEAD , 2 7 (7) 6 h i7 cl-verb-lex 6 h i 7 6 7 COMPS NMZ + Further6 subtypes inheriting from cl-verb-lex7 are 2 3 4 5 6 h i7 6 h i 7 based on the specific choices made by the user and 6 7 For a6 specific example7 of how we accommo- make use of one of two already existing types: ei- 4 5 date lexical variation in complementation, consider ther transitive-lex-item, for nominalized comple- Turkish, which has several complementation strate- ments, or clausal-second-arg-trans-lex-item, for gies. The verb isti (‘want’) can not only take nom- all other types of clausal complements. This al- inalized complements like in (2) but also clauses lows us to model the primary semantic eect of headed by verbs in the optative form, as in (8). nominalization: the introduction of an individual- type variable corresponding to the nominalized (8) herkes [yarin ben-im-le sinema-ya everybody [tomorrow I-- cinema- clause, which in turn allows for e.g. adjectival gel-esin] isti-yor modifiers of that clause. In the semantic com- come-2.] want-. position, transitive-lex-item takes the individual- ‘Everybody wants you to come along to the movies type index of its complement as a semantic argu- with me tomorrow.’ [tur] (from Kornfilt 2013, p. 48) ment. Clausal-second-arg-trans-lex-item, on the To model this, the user can add separate comple- other hand, expects an argument with an event-type mentation strategies in the questionnaire for nom- index and combines with it semantically via a han- inalization and for optative on the embedded verb, dle constraint to accommodate the MRS analysis and a corresponding clausal complement-taking of quantifier scope (Copestake et al., 2005). verb type to go with each strategy. The resulting For proper interaction with the case library grammar will include two lexical entries for isti, (Drellishak, 2009), clausal complement-taking each belonging to a dierent type. One will only verbs require a hierarchy which is aware of various take complements with [ +, nonfinite] case frames. In modeling a language with case, and the other [ , opt]. the user must specify a case frame for each verb type. For example, a verb can have nominative- 4.2.3 Phrase Structure Rules, and accusative argument structure, meaning the type The most intricate part of this library is its in- will constrain its syntactic subject to bear nomi- teraction with the existing analysis of word order. native case and its object accusative. We added To account for basic word order, any Grammar 11Often called the , complement-taking predicate, in Matrix-generated grammar will have some basic syntactic and typological literature. head-subject and head-complement rules, one of 12In future work, we will extend the lexical types available to provide for more valence patterns with clausal comple- each when the word order is strict (e.g. SOV) and ments, such as Kim told me that Sandy left. more if the order is flexible.

43 When the complementizer or the main verb par- mentizer the order is verb final. Fokkens (2014) ticipates in a word order that is dierent from other implemented this type of word order variation in verbs, we posit an additional head-complement the Grammar Matrix framework but did not fully rule (HCR) and, in some cases, an additional head- integrate it into the customization system. We in- subject rule (HSR). Then we constrain them (as corporate part of her analysis so that the user can well as all lexical types that use the HCRs and choose this type of word order variation directly HSRs) for one or both Boolean features, and via the questionnaire. . While similar features have been used be- fore (Keller 1995; Crysmann 2013; see also Siegel 4.3 Implementation et al. 2016, 59), we incorporate them into a new Encoding any particular structure is straightfor- customization logic so that sets of correct con- ward; the challenge is in emitting the right sets straints are emitted automatically based on user of structures with the right constraints given user choices. Examples (9)–(10) illustrate the general choices, where the space of possible combinations HCR and the additional one for extraposition that is large. Furthermore, any additions need to be are emitted for one of many possible user-defined well integrated so that the constraints the new code languages: an OVS language with extraposition. emits should not interfere with what other libraries create. At the same time, it is not desirable to (9) Top HCR1 HCR1 add unique structures where an existing one can COMPS 1 2 3 be adapted, since this leads to unnecessarily large, 6 INIT 7 O VHSR 6 7 complex and potentially overgenerating grammars. 6 7 6H-DTR SUBJ 7 6 2 hi 37 The goal is to implement an analysis that is su- 6 6COMPS 3 1 77 V S 6 6 h i 77 ciently general to handle any typologically plausi- 6 6 77 6NH-DTR 63 77 6 6 77 ble language while at the same time emitting rea- 6ARGS 4 3 , 2 57 6 h i 7 sonably streamlined grammars. 6 7 6 7 6 7 (10) 4HCR2 5 TopHCR2 4.3.1 Test-driven Development COMPS 1 2 3 We begin by describing the procedure we use for 6 INIT + 7 VHSR O 6 7 creating test cases, both in development and in 6 7 6H-DTR SUBJ 7 6 2 hi 37 evaluation (§5). Test-driven development in the 6 6COMPS 3 1 77 V S 6 6 h i 77 context of the Grammar Matrix relies on two com- 6 6 77 6NH-DTR 63 77 6 6 77 ponents: the testsuites and the test choices files 6ARGS 4 2 , 3 57 6 h i 7 (i.e. grammar specifications). There is a testsuite 6 7 6 7 We use6 (a feature) on lexical7 types and and a choices file for each language that is used 4 5 on the head daughters of phrase structure rules to in development or evaluation. While the testsuite account for word order variations associated with and the choices file are created separately, both subordinator attachment and with extraposition of are based on the descriptive grammar for the lan- objects from OV languages. [ +] is associ- guage in question (with the exception of artificial ated with head-initial rules and [ ] with head- pseudolanguages discussed below which function final rules. We need a separate feature, more like ‘unit tests’ for the bits of our analysis). (not shown), to handle extraposition in VOS and Specifically, we first collect the sentences illustrat- V-initial languages, because the order in these lan- ing clausal complementation from the descriptive guages remains head-initial despite extraposition grammar to compile the testsuites. Then, in a sep- to the end of sentence. is constrained on arate iteration, we read the descriptive grammar to list elements of clausal-complement select- the extent necessary to fill out the Grammar Matrix ing heads and on the non-head daughter of head- questionnaire for this language. initial HCRs.13 We build testsuites and choices files for pseu- dolanguages (defined by combinations of choices) 4.2.4 German-like V2/V-final Variation and for 5 development languages (see §5 for the In German, the word order is verb second (V2), details). The number of possible pseudolanguages but in clausal complements marked by a comple- is large: a conservative estimate which treats some 13For an example of how is used, see Zamaraeva bundles of choices as single dimensions is 2300, et al. 2018. assuming one strategy per language. In testing we

44 work with a 50 language sample which includes out languages, described in §5 below. several languages with more than one complemen- tation strategy. A testsuite consists of grammatical 4.3.2 Adding Types and ungrammatical sentences illustrating what is We update the customization system as follows. possible/impossible with respect to clausal com- A supertype for clausal-complement verbs (de- plements in a given language. scribed in §4.2) is now added in all cases when For pseudolanguages testsuites, we construct all any clausal complement strategy is specified. A possible nonrecursive sentences14 illustrating the supertype for a complementizer is added whenever clausal complementation strategies that this lan- the user says there is a complementizer associated guage has. For development languages, the sen- with any of the strategies. Appropriate subtypes tences either come directly or are adapted from the are added based on the user choices, one per strat- 16 sections in descriptive grammars explaining how egy, and constrained as described in §4.3.3. clausal complements work in this language. Here Additional phrase structure rules are added ac- the sentences feature interacting phenomena such cording to the user choices. An additional HSR as tense and case.15 Crucially, we include cor- is added for VOS orders with extraposition. An responding impossible (ungrammatical) sentences additional HCR is added whenever the comple- to make sure they are not parsed. Consider a mentizer or the clause-embedding verb cannot use language with one strict strategy: SOV word or- the basic HCR. In general, this means all situations der, obligatory complementizer attaching before when the basic order is e.g. head-final but there is the embedded clause, and obligatory extraposition either extraposition or the complementizer can at- of clausal complements. The testsuite will con- tach clause-initially (or symmetrically, if the basic sist of 2 grammatical sentences, one for a simple order is head-initial but the complementizer can at- SOV sentence and another that has an extraposed tach clause-finally). In practice, it means checking 17 clausal complement with the complementizer. The combinations of choices: 8 for the word orders; ungrammatical sentences will include: a complex 3 for complementizer (obligatory, no, optional); 3 sentence with no complementizer, one with a com- for complementizer attachment (before the clause, plementizer attaching after the clause, one with a after, or both); 3 for extraposition (obligatory, no, non-extraposed clausal complement, and a simple optional). clause with SVO order. The more flexible the lan- 4.3.3 Adding Constraints guage, the more grammatical and fewer ungram- matical examples the testsuite will have. After adding the types, we constrain them for the grammar to generate only grammatical strings (e.g. From the choices files which are created inde- for features ,,,). We need to pendently from the testsuites, grammar fragments add constraints so that they do not clash with those are created with the original customization sys- placed by other libraries while not positing new tem. The grammars are loaded into the LKB soft- types unless necessary. ware(Copestake, 2002) which can parse strings. Then we edit these grammars by hand until they be- The information about whether or have correctly with respect to the yet unsupported (nominalization) constraints are needed on the rel- clausal complements choices (as reflected by the evant types comes directly from the choices files: grammars’ coverage and overgeneration over the the user would have specified a value or a corresponding testsuites). Then we generalize the nominalization strategy associated with the com- solutions in the resulting grammars and incorpo- plementation strategy. The treatment of and rate them in the customization system, continually however must be inferred from a fairly large testing with the regression tests. The result of this space of choices combinations. development is frozen before evaluation on held- Figure 1 shows the logic that we add to the cus- tomization system that lets it decide whether to 14The vocabulary is minimal. use the feature.18 The decision depends on 15In developing these testsuites, we sometimes must adapt the sentences to exclude phenomena that are not supposed to 16A type can be instantiated by multiple lexical entries. be supported by the system or to capture the full spectrum 17Excluding V2 and free. described by the grammar’s author in prose but not fully il- 18Abbreviations for Figure and Tables: comp (comple- lustrated by examples. This, along with the initial selection mentizer); extrap (extraposition); fam (family); morph (mor- of which sentences to include in the grammar, introduces a pheme); neg (ungrammatical sentences); nmz (nominaliza- certain amount of bias to the testsuites. tion); oblig (obligatory); opt (optional); OV (object pre-

45 Complementizer (comp)

oblig. opt. no

OV VO OV VO extrap no extrap

S comp comp S both S comp comp S both extrap no extrap comp S S comp both OV VO no extrap no extrap strict extrap flex. extrap no extrap no comp S S comp both no no

no no no

Figure 1: Decision tree illustrating the logic of using the feature based on user choices whether there is a complementizer, whether it is the phenomena covered by the development lan- obligatory or optional, what is the basic word or- guages.19 der, is there extraposition and is it strict or flex- As for testing the interaction with other libraries, ible, and how does the complementizer attach to we included all the choices that were relevant to the clause. If the feature is used, then all HCR the testsuites. For example, if the language has and all lexical items which can go through the HCR case and the sentences with clausal complements must be properly constrained for + or . The showed this, we filled out the Case portion of the logic associated with the feature is simpler: questionnaire and added appropriate lexical rules it is used only with VOS and V-initial word orders on the Morphology page. Case, word order, mor- when there is extraposition. phology, and tense, aspect and mood choices came up most often. 4.3.4 Summary After addressing a few issues uncovered by the This section described the process and the main pseudo- and development languages, we achieve techniques used to implement the Clausal Com- 100% coverage and 0% overgeneration on their plements library for the Grammar Matrix. We testsuites (Table 1). Furthermore, we checked operate in a large space of theoretically possi- that all parses for each sentence are warranted, i.e. ble user choices combinations (over 2300) where there is no unwanted ambiguity. At the same time, the ultimate goal is to emit streamlined grammars the library was used in a course where students which behave correctly for any valid combination implemented grammar fragments for 5 more lan- of choices (and a language defined by them). The guages,21 which helped us discover an issue in the development was driven by a sample of pseudo- interaction with the information structure library and illustrative (development) languages and was (Song, 2014). evaluated as described in the next section. Finally, we evaluate the library on 5 held-out 5 Testing, Evaluation, and Error languages from dierent language families than 22 Analysis the development languages. The results are pre- sented in Table 2. Coverage and overgeneration To test the typological legitimacy and the rigor of are given in numbers of sentences in the testsuite our analysis, we apply a three-stage process where (4/8 means 4 out of 8 were parsed). we test grammars against testsuites. The process In Jalkunan (SOV), there is a kind of clausal used in all stages is described in §4.3.1. The first two stages involving 50 pseudolanguages and 5 de- 19German (Sapp, 2006), Tagalog (Hoenigswald et al., velopment languages are part of the test-driven de- 1975), Lango (Noonan, 2007), Turkish (Kornfilt, 2013). The first author, a native speaker of Russian, provided the testsuite velopment. We chose the development languages and the choices for Russian. which exhibited the full range of in-scope comple- 20We targeted a fragment of German with obligatory com- mentation phenomena and for which we had both plementizers and V2/V-final word order variation. 21Akuntsu [aqz] (Tupian); Lhasa Tibetan [bod] (Tibeto- some sentences and a description sucient for us Burman), Dzongkha [dzo] (Tibeto-Burman), Dagaare [gur] to fill out the questionnaire. Table 1 summarizes (Niger-Congo), Yolmo [scp] (Tibeto-Burman). 22The number of languages is limited by time constraints. cedes verb); pos (grammatical sentences); S comp (clause We pick held-out languages randomly from a pool of de- followed by complementizer); strat (strategy); VO (verb pre- scriptive grammars and reject them if they come from a lan- cedes object); WO (word order). Language families: AA guage family already used or if the grammars do not cover (Afro-Asiatic); Astrn (Austronesian); Awk (Arawak); IE basic clausal complements. Grammars: Jalkunan (Heath, (Indo-European); NS (Nilo-Saharan); NC (Niger-Congo); PN 2017), Paresi-Haliti (Brandão), Sahaptin (Jansen, 2010), He- (Pama-Nyungan); PP (Plateau Penutian); Tur (Turkic). brew (Zuckermann, 2006), Wangkangurru (Hercus, 1994).

46 Language iso639-3 fam WO comp order morph extrap # strat pos neg Russian rus IE free opt comp S nmz,form - 3 6 11 German deu IE V2/V-fin oblig20 comp S. - - 1 64 Tagalog tgl Astrn. V-in oblig comp S - flexible 1 34 Lango laj NS SVO oblig comp S mood - 3 44 Turkish tur Tur SOV opt both nmz, form strict 4 79

Table 1: Phenomena covered in the development languages. Coverage is 100%, overgeneration is 0%.

Language iso639-3 fam WO comp order morph extrap # strat. Cov. Overgen. Jalkunan bxl NC SOV opt comp S - strict 1 4/8 0/12 Paresi-Haliti pab Awk SOV - - nmz strict 1 5/5 0/5 Yakima Sahaptin yak PP free - - nmz - 1 10/10 0/6 Modern Hebrew heb AA SVO oblig comp S - - 1 2/2 0/9 Wangkangurru wgg PN free - - aspect - 1 10/10 0/3

Table 2: Coverage and overgeneration on sentences from held-out language families. complement extraposition where a “dummy” 3 imate quantitatively how well a system would do (similar to it in English) stays in the for a linguist working from a reasonably thorough sentence-medial position while the clausal com- description of a language we had not worked with. plement is extraposed (Heath, 2017). We had not considered this strategy during the development 6 Conclusion since we hadn’t come across it in the literature, so we could not parse those items. In addition, we We presented a clausal complements library for discovered a bug in the interaction with the word the LinGO Grammar Matrix. To this end, we im- order library which places constraints on the or- plemented an HPSG-based analysis of a range of der of auxiliaries and their complements. Some of complementation phenomena attested in the typo- the Jalkunan sentences had auxiliaries, and we also logical literature. The library allows the grammar could not parse them due to the bug; hence the 50% engineer to include basic clausal complements in coverage. On all other languages we achieve 100% their starter grammar in a streamlined system with coverage and 0% overgeneration (again, with no appropriate interaction with other phenomena, thus spurious ambiguity). extending the initial coverage of the grammar frag- The testsuites represent typologically diverse ment that can be obtained automatically. Available languages, and so one testsuite may emphasize choices define thousands of possible combinations word order variation, another morphology, etc. for which the library needs to emit correct code. The descriptive grammars for held-out languages We make sure we have correct behavior on a sam- only included one in-scope strategy per language. ple of 50 choices combinations and 5 development In general, our held-out grammars contain rela- languages; on testsuites from 5 held-out languages, tively few examples which illustrated basic com- we achieve 88% coverage and no overgeneration. plementation.23 Our source for Modern Hebrew, In the future, we plan to extend the system to Zuckermann 2006, for example, identified a sin- handle such related phenomena as subject shar- gle strict complementation strategy, and so for that ing, sentential subjects, embedded questions, and language we have just one grammatical sentence displacement to the beginning of the clause. In illustrating simple noun complementation and one terms of interactions with other libraries, we will grammatical sentence illustrating clausal comple- look at complex sentences combining clausal com- mentation. The more strategies in a language, the plements and clausal modifiers and coordinated more flexible they are, and the clearer the author clausal complements. Our library will also serve of the grammar describes them in isolation from as a foundation for testing future additions to the other phenomena, the more sentences we can in- Grammar Matrix for their interaction with subor- clude. The point of the evaluation is to approx- dinate clauses, e.g. wh-questions. Finally, we plan to explore how to infer answers to our additions 23 Descriptive grammars seem to prioritize complicated ex- to the questionnaire from collections of interlinear amples of complex clauses. This diers from the data from development languages which were picked so as to better in- glossed text, along the lines of Bender et al. (2013) form the implementation of the library. and Bender et al. (2014).

47 Acknowledgements Ann Copestake. 2000. Appendix: Definitions of typed feature structures. Natural Language Engineering, We thank the anonymous reviewers for SCiL 2019 6(01):109–112. for helpful discussion. Ann Copestake. 2002. Implementing typed feature This material is based upon work supported by structure grammars. CSLI publications Stanford. the National Science Foundation under Grant No. BCS-1561833. Any opinions, findings, and con- Ann Copestake, Dan Flickinger, Carl Pollard, and Ivan A Sag. 2005. Minimal recursion semantics: clusions or recommendations expressed in this ma- An introduction. Research on language and compu- terial are those of the authors and do not necessarily tation, 3(2-3):281–332. reflect the views of the National Science Founda- Berthold Crysmann. 2013. On the of com- tion. plement clause and extraposition. Rightward movement in a comparative perspective, 200:369–395. References Scott Drellishak. 2009. Widespread but Not Universal: Emily M. Bender, Joshua Crowgey, Michael Wayne Improving the Typological Coverage of the Grammar Goodman, and Fei Xia. 2014. Learning grammar Matrix. Ph.D. thesis, University of Washington. specifications from IGT: A case study of Chintang. In Proceedings of the 2014 Workshop on the Use of Judith Engelkamp, Gregor Erbach, and Hans Uszko- Computational Methods in the Study of Endangered reit. 1992. Handling linear precedence constraints Languages, pages 43–53. Association for Computa- by unification. In Proceedings of the 30th annual tional Linguistics. meeting on Association for Computational Linguis- tics, pages 201–208. Association for Computational Emily M Bender, Scott Drellishak, Antske Fokkens, Linguistics. Laurie Poulson, and Safiyyah Saleem. 2010. Gram- mar customization. Research on Language & Com- Dan Flickinger. 2000. On building a more ecient putation, 8(1):23–72. 10.1007/s11168-010-9070-1. grammar by exploiting types. Natural Language En- gineering, 6(01):15–28. Emily M. Bender and Dan Flickinger. 2005. Rapid Dan Flickinger. 2011. Accuracy v. robustness in gram- prototyping of scalable grammars: Towards modu- mar engineering. In Emily M. Bender and Jennifer E. larity in extensions to a language-independent core. Arnold, editors, Language from a Cognitive Perspec- In Proceedings of the 2nd International Joint Con- tive: Grammar, Usage and Processing, pages 31–50. ference on Natural Language Processing IJCNLP-05 CSLI Publications, Stanford, CA. (Posters/Demos), Jeju Island, Korea. Antske Sibelle Fokkens. 2014. Enhancing Empiri- Emily M. Bender, Dan Flickinger, and Stephan Oepen. cal Research for Linguistically Motivated Precision 2002. The Grammar Matrix: An open-source starter- Grammars. Ph.D. thesis, Department of Computa- kit for the rapid development of cross-linguistically tional Linguistics, Universität des Saarlandes. consistent broad-coverage precision grammars. In Proceedings of the Workshop on Grammar Engineer- Jonathan Ginzburg and Ivan Sag. 2000. Interrogative ing and Evaluation at the 19th International Con- investigations. Stanford: CSLI publications. ference on Computational Linguistics, pages 8–14, Taipei, Taiwan. Michael Wayne Goodman. 2013. Generation of machine-readable morphological rules from human- Emily M. Bender, Michael Wayne Goodman, Joshua readable input. Seattle: University of Washington Crowgey, and Fei Xia. 2013. Towards creating pre- Working Papers in Linguistics, 30. cision grammars from interlinear glossed text: Infer- ring large-scale typological properties. In Proceed- Michael Wayne Goodman and Emily M. Bender. 2010. ings of the 7th Workshop on Language Technology What’s in a word? Refining the morphotactic infras- for Cultural Heritage, Social Sciences, and Human- tructure in the LinGO Grammar Matrix customiza- ities, pages 74–83, Sofia, Bulgaria. Association for tion system. In Workshop on Morphology and For- Computational Linguistics. mal Grammar, Paris. Jerey Heath. 2017. A grammar of Jalkunan (Mande, Ana Paula Barros Brandão. A Reference Grammar of Burkina Faso). Language Description Heritage Li- Paresi-Haliti (Arawak). Ph.D. thesis, The University brary. of Texas at Austin. Luise A Hercus. 1994. A grammar of the Arabana- Miriam Butt and Tracy Holloway King. 2002. Urdu Wangkangurru language: Lake Eyre Basin, South and the Parallel Grammar project. In Proceedings of Australia. Australian National Univ. the 3rd workshop on Asian language resources and international standardization-Volume 12, pages 1–3. Henry M Hoenigswald, Paul Schachter, and Fe T Association for Computational Linguistics. Otanes. 1975. Tagalog Reference Grammar.

48 Kristen Howell and Olga Zamaraeva. 2018. Clausal Sanghoun Song. 2014. A grammar library for informa- modifiers in the grammar matrix. In Proceedings of tion structure. Ph.D. thesis, University of Washing- the 27th International Conference on Computational ton. Linguistics, pages 2939–2952. Olga Zamaraeva, Kristen Howell, and Emily M Ben- Kristen Howell, Olga Zamaraeva, and Emily M. Bender. der. 2018. A cross-linguistic account of subordinator 2018. Nominalized clauses in the Grammar Matrix. and subordinate clause position. Poster presented at In Proceedings of the 25th International Conference The 25th International Conference on Head-Driven on Head-Driven Phrase Structure Grammar, pages Phrase Structure Grammar, Tokyo, Japan. 68–88. Ghil‘ad Zuckermann. 2006. Complement clause types Joana Worth Jansen. 2010. A Grammar of Yakima in Israeli. Oxford University Press. Ichishkíin/Sahaptin. Ph.D. thesis, University of Ore- gon.

Frank Keller. 1995. Towards an account of extraposi- tion in HPSG. In Proceedings of the seventh con- ference on European chapter of the Association for Computational Linguistics, pages 301–306. Morgan Kaufmann Publishers Inc.

Jaklin Kornfilt. 2013. Turkish. Routledge, London.

Stefan Müller. 2015. The CoreGram project: Theoreti- cal linguistics, theory development and verification. Journal of Language Modelling, 3(1):21–86.

Michael Noonan. 2007. Complementation. In Timothy Shopen, editor, Language Typology and Syntactic Description, volume 2. Cambridge University Press, Cambridge, UK.

Kelly O’Hara. 2008. A morphotactic infrastructure for a grammar customization system. Ph.D. thesis, Uni- versity of Washington.

Carl Pollard and Ivan A. Sag. 1994. Head-Driven Phrase Structure Grammar. Studies in Contempo- rary Linguistics. The University of Chicago Press and CSLI Publications, Chicago, IL and Stanford, CA.

Laurie Poulson. 2011. Meta-modeling of tense and as- pect in a cross-linguistic grammar engineering plat- form. University of Washington Working Papers in Linguistics (UWWPL), 28.

Ivan A. Sag, Thomas Wasow, and Emily M. Bender. 2003. Syntactic Theory: A Formal Introduction, second edition. CSLI, Stanford, CA.

Safiyyah Saleem and Emily M Bender. 2010. Argu- ment optionality in the LinGO Grammar Matrix. In Proceedings of the 23rd international conference on computational linguistics: posters, pages 1068– 1076. Association for Computational Linguistics.

Christopher D Sapp. 2006. Verb order in subordinate clauses from Early New High German to Modern German. Indiana University.

Melanie Siegel, Emily M. Bender, and Francis Bond. 2016. Jacy: An Implemented Grammar of Japanese. CSLI Studies in Computational Linguistics. CSLI Publications, Stanford CA.

49