Treatment of Multiword Expressions and Compounds in Bulgarian
Total Page:16
File Type:pdf, Size:1020Kb
Treatment of Multiword Expressions and Compounds in Bulgarian Petya Osenova and Kiril Simov Linguistic Modelling Deparment, IICT-BAS Acad. G. Bonchev 25A, 1113 Sofia, Bulgaria [email protected] and [email protected] Abstract 2012)), we will adopt the Multiword Expressions The paper shows that catena represen- classification, presented in (Sag et al., 2001). They tation together with valence information divide them into two groups: lexicalized phrases can provide a good way of encoding and institutionalized phrases. The former are fur- Multiword Expressions (beyond idioms). ther subdivided into fixed-expressions, semi-fixed It also discusses a strategy for mapping expressions and syntactically-flexible expressions. noun/verb compounds with their counter- Fixed expressions are said to be fully lexicalized part syntactic phrases. The data on Mul- and undergoing neither morphosyntactic variation tiword Expression comes from BulTree- nor internal modification. Semi-fixed expressions Bank, while the data on compounds comes have a fixed word order, but “undergo some degree from a morphological dictionary of Bul- of lexical variation, e.g. in the form of inflection, garian. variation in reflexive form, and determiner selec- tion” (non-decomposable idioms, proper names). 1 Introduction Syntactically-flexible expressions show more vari- Our work is based on the annotation of Multi- ation in their word order (light verb constructions, word Expressions (MWE) in the Bulgarian tree- decomposable idioms). We follow the understand- bank — BulTreeBank (Simov et al., 2004). We ing of (O’Grady, 1998) that MWEs have their in- use this representation for parsing and analysis of ternal syntactic structure which needs to be rep- compounds. BulTreeBank exists in two formats: resented in the lexicon as well as in the sentence HPSG-based (original - constituent-based with analysis. Such a mapping would provide a mech- head annotation and grammatical relations) and anism for accessing the literal meaning of MWE Dependency-based (converted from the HPSG- when necessary. The inclusion of the compounds based format). In both of them the representations into the MWE classification raises additional chal- of the various kinds of Multiword Expressions is lenges. As it was mentioned, an important ques- an important problem. We need a mechanism for tion is the prediction of the compound semantics connecting the MWE in the lexicon with their ac- formed on the basis of the related phrases contain- tual usages within the sentences. As an interesting ing verb + dependents. case of MWE at the interface of morphology and In this paper we discuss the usage of the same syntax we consider Compounds. They are usually formal instrument - catena - for their representa- derived from several lexical units, have an internal tion and analysis. Catena is a path in the syntac- structure with a specified derivation model and se- tic or morphemic analysis that is continuous in the mantics. In the paper we are especially interested vertical dimension. Its potential is discussed fur- in the mapping among deverbal compounds and ther in the text. The paper is structured as follows: their counterpart syntactic phrases. In the next section a brief review of previous works Since there is no broadly accepted standard on catena is presented. In Section 3 a typology of for Multiword Expressions (see about the vari- the Miltiword Expressions in BulTreeBank is out- ous classifications in (Villavicencio and Kordoni, lined. Section 4 considers possible approaches for consistent analyses of MWE. Section 5 introduces In this paper we consider catena as a unit of the relation of syntax with compound morphology. syntax and morphology1. In a syntactic or mor- Section 6 concludes the paper. phological tree (constituent or dependency) catena is: Any element (word) or any combination of ele- 2 Related works on catena ments that are continuous in the vertical dimen- sion (y-axis). In syntax it is applied to the id- The notion of catena (chain) was introduced in iosyncratic meaning of all sorts, to the syntax of (O’Grady, 1998) as a mechanism for represent- ellipsis mechanisms (e.g. gapping, stripping, VP- ing the syntactic structure of idioms. He showed ellipsis, pseudogapping, sluicing, answer ellipsis, that for this task there is a need for a definition of comparative deletion), to the syntax of predicate- syntactic patterns that do not coincide with con- argument structures, and to the syntax of discon- stituents. He defines the catena in the following tinuities (topicalization, wh-fronting, scrambling, way: The words A, B, and C (order irrelevant) extraposition, etc.). In morphology it is applied form a chain if and only if A immediately dom- to the bracketing paradox problem. It provides inates B and C, or if and only if A immediately a mechanism for a (partial) set of interconnected dominates B and B immediately dominates C. In syntactic or morphological relations. The set is recent years the notion of catena revived again and partial in cases when the elements of the catena it was applied also to dependency representations. can be extended with additional syntactic or mor- Catena is used successfully for modelling of prob- phological relations to elements outside of the lematic language phenomena. catena. The relations within the catena cannot be (Gross, 2010) presents the problems in syntax changed. and morphology that have led to the introduction These characteristics of catena make it a good of the subconstituent catena level. Constituency- candidate for representing the various types of based analysis faces non-constituent structures in Multiword Expressions in lexicons and treebanks. ellipsis, idioms, verb complexes. In morphol- In the lexicons each MWE represented as a catena ogy the constituent-oriented bracketing paradoxes might specify the potential extension of each ele- have been also introduced ([moral] [philosoph - ment of the catena. As part of the morphemic anal- er] vs. [moral philosoph]-er). Catena is viewed ysis of compounds, catena is also a good candidate as a dependency grammar unit. At the mor- for mapping the elements of the syntactic para- phological level morphemes (affixes) receive their phrase of the compound to its morphemic analy- own nodes forming chains with the roots (such sis. as tenses: has...-(be)en; be...(be)ing, etc.). In (Gross, 2011) the author again advocated his ap- 3 Multiword Expressions in proach on providing a surface-based account of BulTreeBank the non-constituent phenomena via the contribu- In its inception and development phase, the tion of catena. Here the author introduces a no- HPSG-based Treebank adopted the following tion at the morphological level — morph catena. principles: When the MWE is fixed, which Also, he presents the morphological analysis in the is inseparable, with fixed order and can be Meaning-Text Theory framework, where (due to viewed as a part-of-speech, it receives lexical its strata) there is no problem like the one present treatment. This group concerns the multiword in constituency. closed class parts-of-speech: multiword prepo- Apart from linguistic modeling of language sitions, conjunctions, pronouns, adverbs. There phenomena, catena was used in a number of NLP are 1081 occurrences of such multiword closed applications. (Maxwell et al., 2013) presents an class parts-of-speech in the treebank, which makes approach to Information retrieval based on cate- around 1.9% of the token occurrences in the text. nae. The authors consider catena as a mecha- Thus, this group is not problematic. Of course, nism for semantic encoding which overcomes the there are also exceptions. For example, one problems of long-distance paths and elliptical sen- of the multiword indefinite pronouns in Bulgar- tences. The employment of catena in NLP appli- ian shows variation in its ending part: êàêâè- cations is additional motivation for us to use it in òî è äà å/ñà/áèëî (whatever). The varying the modeling of an interface between the valence lexicon, treebank and syntax. 1http://en.wikipedia.org/wiki/Catena (linguistics) part is a 3-person-singular-present-tense-auxiliary, as ïîðòôåéë (wallet) or ðîäíèíà (relative), the 3-person-plural-present-tense-auxiliary or its 3- verb takes canonical complements. In the latter person-neuter-singular-past-participle. The semi- cases, verbs like - îáðúùàì ‘to turn’ pay - take fixed expressions (mainly proper names) have only noun - âíèìàíèå ‘attention’ attention - for been interpreted as Multiword Expressions. How- making an idiom - îáðúùàì âíèìàíèå íà íÿ- ever, all the idioms, light verb constructions, etc. êîãî ‘to turn attention to somebody’ pay attention have been treated syntactically. This means that in to somebody. Another example is the verb - âçå- the annotations there is no difference between the ìàì ‘to take’, which combines in such cases with literal and idiomatic meaning of the expression: äóìà ‘word’ - âçåìàì äóìàòà ‘to take the word’ kick the bucket (= kick some object) and kick the take the floor. However, light expressions with bucket (= die). In both cases we indicated that the desemantisized verbs, such as èìàì have or ñòà- verb kick takes its nominal complement. âà happens (èìàì äóìàòà ‘I have the word’ to After some exploration of the treebank, such have the floor or ñòàâà äóìà çà íåùî ‘it happens as the extraction of the valency frames and train- word’ something refers to something)can take nu- ing of statistical parsers, we discovered that the merous semantic classes as dependants. In this present annotations of Multiword Expressions are case we mark the information only on the head of not the most useful ones. In both applications the these Multiword Expressions. In this approach the corresponding generalizations are overloaded with assumption is that the verb posits its requirements specific cases which are not easy to incorporate on its dependants. However, a very detailed va- in more consistent classifications. The group of lency lexicon is required. One problem with this lexically treated POS remained stable. However, approach is when the dependant elements allow the other two groups were reconsidered.