Treatment of Multiword Expressions and Compounds in Bulgarian Petya Osenova and Kiril Simov Linguistic Modelling Deparment, IICT-BAS Acad. G. Bonchev 25A, 1113 Sofia, Bulgaria
[email protected] and
[email protected] Abstract 2012)), we will adopt the Multiword Expressions The paper shows that catena represen- classification, presented in (Sag et al., 2001). They tation together with valence information divide them into two groups: lexicalized phrases can provide a good way of encoding and institutionalized phrases. The former are fur- Multiword Expressions (beyond idioms). ther subdivided into fixed-expressions, semi-fixed It also discusses a strategy for mapping expressions and syntactically-flexible expressions. noun/verb compounds with their counter- Fixed expressions are said to be fully lexicalized part syntactic phrases. The data on Mul- and undergoing neither morphosyntactic variation tiword Expression comes from BulTree- nor internal modification. Semi-fixed expressions Bank, while the data on compounds comes have a fixed word order, but “undergo some degree from a morphological dictionary of Bul- of lexical variation, e.g. in the form of inflection, garian. variation in reflexive form, and determiner selec- tion” (non-decomposable idioms, proper names). 1 Introduction Syntactically-flexible expressions show more vari- Our work is based on the annotation of Multi- ation in their word order (light verb constructions, word Expressions (MWE) in the Bulgarian tree- decomposable idioms). We follow the understand- bank — BulTreeBank (Simov et al., 2004). We ing of (O’Grady, 1998) that MWEs have their in- use this representation for parsing and analysis of ternal syntactic structure which needs to be rep- compounds. BulTreeBank exists in two formats: resented in the lexicon as well as in the sentence HPSG-based (original - constituent-based with analysis.