Some Aspects of the Morphological Processing of Bulgarian

Milena Slavcheva Linguistic Modelling Department Central Laboratory for Parallel Processing Bulgarian Academy of Sciences 25A, Acad. G. Bonchev St, 1113 Sofia, Bulgaria [email protected]

Abstract linguistic and computational decision-making of each stage. This paper demonstrates the modelling Bulgarian computational morphology has de- of morphological knowledge in Bulgar- veloped as the result of local (Paskaleva et al. ian and applications of the created data 1993; Popov et al. 1998) and international activ- sets in an integrated framework for pro- ities for the compilation of sets of morphosyntac- duction and manipulation of language tic distinctions and the construction of electronic resources. The production scenario is lexicons (Dimitrova et al. 1998). The need for exemplified by the Bulgarian as the synchronization and standardization has lead to morphologically richest and most prob- the activities of the application to Bulgarian of lematic part-of-speech category. The internationally acknowledged guidelines for mor- definition of the set of morphosyntactic phosyntactic annotation (Slavcheva and Paskaleva specifications for in the lexicon is 1997), and to the comparison of morphosyntactic described. The application of the tagset tagsets (Slavcheva 1997). in the automatic morphological analy- In this paper I demonstrate the production sce- sis of text corpora is accounted for. A nario of modelling morphological knowledge in Type Model of Bulgarian verbs handling Bulgarian and applications of the created data sets the attachment of short pronominal ele- in an integrated framework for production and ma- ments to verbs, is presented. nipulation of language resources, that is, the Bul- TreeBank framework (Simov et al. 2002). The production scenario is exemplified by the Bulgar- 1 Introduction ian verb as the morphologically richest and most The morphological processing of languages is in- problematic part-of-speech category. The defini- dispensable for most applications in Human Lan- tion of the set of morphosyntactic specifications guage Technology. Usually, morphological mod- for verbs in the lexicon is described. The ap- els and their implementations are the primary plication of the tagset in the automatic morpho- building blocks in NLP systems. logical analysis of text corpora is accounted for. The development of the computational mor- Special attention is drawn to the attachment of phology of a given language has two main stages. short pronominal elements to verbs. This is a phe- The first stage is the building of the morphologi- nomenon difficult to handle in language process- cal database itself. The second stage includes ap- ing due to its intermediate position between mor- plications of the morphological database in differ- phology and syntax proper. ent processing tasks. The interaction and mutual The paper is structured as follows. In section prediction between the two stages determines the 2 the principles of building the latest version of a Bulgarian tagset are pointed out and the subset is, morphosyntactic information attached to words of the tagset for verbs is exhaustively presented. as dictionary units. Section 3 is dedicated to a specially worked out As pointed above, the tagset for Bulgarian is typology of Bulgarian verbs which is suitable for constructed on the basis of the long term experi- handling the problematic verb forms. ence acquired in local and international initiatives for the compilation of core sets of morphosyntac- 2 Morphosyntactic Specifications for tic distinctions and the construction of electronic Verbs As a Subset of a Bulgarian lexicons. Tagset The EAGLES principle of levels of the mor- 2.1 Principles of Tagset Construction phosyntactic annotation is used but it is, so to say, localized. That means that while the EAGLES an- The set of morphosyntactic specifications for notation schemes are constructed for the simulta- verbs is a subset of the tagset (Simov, Slavcheva, neous application to many languages, in the tagset Osenova 2002) used within the BulTreeBank described here, the principle of levels is consis- framework (Simov et al. 2002) for morphologi- tently applied for structuring the morphosyntactic cal analysis of Bulgarian texts. A tagset for anno- information attached to each part of speech (POS) tating real-world texts can be divided into several category in a single language, that is, Bulgarian. subsets according to the types of text units: The principle of levels is applied in the EAGLES multi-lingual environment as follows. The elabo- 1. Tags attached to single word tokens. (These ration of the tags starts with the encoding of the are the common words in the vocabulary: most general information which is applicable to a nouns, verbs, adjectives, adverbs, pronouns, big range of languages (e.g., the languages spoken prepositions, etc.) in the European Union). It continues with structur- 2. Tags attached to multi-word tokens. (Most ing the morphosyntactic information that is con- of them are conjunctions having analytical sidered more or less common to a smaller group structure, but also some indefinite pronouns, of languages. Finally, the single language-specific etc.) information is added to the tags. In the monolin- gual tagset for Bulgarian this scheme of informa- 3. Tags attached to abbreviations. tion levels is used for the POS categories. The next underlying structuring principle that 4. Tags attached to named entities which are of is used in the Bulgarian tagset is that of MUL- various types: person names, topological en- TEXT and MULTEXT-East for ordering the infor- tities, etc.; film titles, company names, etc.; mation by defining numbered slots in the tags for formulas, single letters, citations of words as the value of each grammatical and leaving used in scientific texts. the slot unoccupied if the feature is not relevant 5. Tags for punctuation. in a given tag. Again, in a multi-lingual environ- ment, the MULTEXT ordering of the grammatical The above groups of tags belong to three big categories is simultaneously applied to a bunch of divisions of tag types. The tags in items 1 and 2 languages, while in the Bulgarian tagset it is ap- above include annotation of linguistic items that plied to one and the same POS category of Bul- belong to what is accepted to be a common dic- garian. The ordering of the information starts with tionary. The tags in items 3 and 4 contain annota- the POS category and follows a scale of general- tion of linguistic units that name, generally speak- ity where the more general lexeme features (e.g. ing, different kinds of realities and entities. They type, aspect of the verb) precede the grammati- are peculiarities of the unrestricted, real-life texts. cal features describing the wordform (e.g. person, Item 5 refers to the annotation of linguistic items number of verb forms). that serve as text formatters. The subject of discus- The tagset is defined so that the necessary and sion in this paper is a tagset of the first type, that enough information is attached to the word tokens according to the following factors: 2.3 Specifications for the Verb • The information is attached on the morpho- The grammatical features which are encoded in logical level (that is, stemming from the lexi- the verb tagset have ordered positions in the tag con). strings as shown bellow. 1:POS, 2:Verb type, 3:Aspect, 4:Transitiv- • The information is attached to running words ity, 5:Clitic attachment, 6:Verb form/Mood, in the text (and here is the tricky interplay of 7:, 8:Tense, 9:Person, 10:Number, 11:Gen- form and function of the lexical items). der, 12:Definiteness All the descriptions below are in the form of • There is potential for interfacing this infor- triples where the first element is the name of the mation with the next levels of linguistic rep- grammatical feature, the second element is the resentation like, for instance, syntax (that is value of the grammatical feature and the third ele- why we speak about morphosyntactic anno- ment is the abbreviation used in the tag string. tation). The feature-value pairs describing the verb cat- • When defining the specifications in the tags, egory are distributed in three levels. The first level the levels of linguistic representation (i.e., of feature-value pairs represents the most general morphology, syntax, semantics, and prag- category, that is, the part of speech. matics) are kept distinct as much as possi- [POS, verb, V] ble. That means that the underlying princi- The second level of description includes fea- ple is to provide information for the lexical tures whose values provide the invariant informa- items thinking about them as dictionary units. tion for a given wordform, that is, the information In connection to the latter principle, another stemming from the lexeme. This is the informa- principle is defined, that is, whenever possi- tion used for the generation of the appropriate type ble, the formal morphological analyses of the and number of paradigm elements for a given lex- lexical items are taken into account, rather eme. For the verb the second level features are: than the assignment of functional categories Verb type, Aspect, Transitivity, Clitic attachment. to them, which is the task of the successive Combinations of those features denote subclasses levels of linguistic interpretation and repre- of verbs. The features, their values, and the abbre- sentation. viations are given in the following descriptions. [Verb type, personal, p] 2.2 Format of the Tagset [Verb type, impersonal, n] The information encoded in the morphosyntactic [Verb type, auxiliary, x] annotation is represented as sets of feature-value [Verb type, semi-impersonal, s] pairs. The tags are lists of values of the grammati- [Aspect, imperfective, i] cal features describing the wordforms. The format [Aspect, perfective, p] of the tags is a string of symbols (letters, digits or [Aspect, dual, d] hyphens) where for each value there is one single [Transitivity, transitive, t] symbol that denotes it. The first symbol is a cap- [Transitivity, intransitive, i] ital letter denoting the POS category. The rest of [Clitic attachment, none, 0] the string is a mixture of small letters, digits or hy- [Clitic attachment, mandatory ”se”, 1] phens. The letters or digits denote the values of [Clitic attachment, mandatory ”si”, 2] the features describing a lexical item. The hyphen [Clitic attachment, mandatory acc.pron., 3] means that a given feature is irrelevant for a given [Clitic attachment, mandatory dat.pron., 4] lexical item. The hyphen preserves the ordering of [Clitic attachment, mandatory dat.pron.+se, 5] the values of features in the tag string by denoting [Clitic attachment, optional ”se”, 6] a position. In case the hyphen or hyphens come [Clitic attachment, optional ”si”, 7] last in the tag string, that is, no symbol follows The values of the third level features define the them, they are omitted. variant information for a given word form, that is, the grammatical information carried by the vari- short pronominals when grammatical structures of ous inflections. This is the level of most specific various meanings come out. information. The third level features for the verb At this point it should be noted that the are: Verb form/Mood, Voice, Tense, Person, Num- electronic lexicon that is used for automatic ber, Gender, Definiteness. morphosyntactic annotation in the BulTreeBank [Verb form/Mood, Finite indicative, f] framework (Popov et al. 1998) follows the tra- [Verb form/Mood, Finite imperative, z] ditional, ”paper-dictionary” subcategorization of [Verb form/Mood, Finite conditional, u] verbs into personal, impersonal and auxiliary. [Verb form/Mood, Non-finite , c] Also the morphological analyzer identifies only [Verb form/Mood, Non-finite gerund, g] single word tokens, that is, strings of symbols be- [Voice, active, a] tween white spaces. Orthographically, the short [Voice, passive, v] pronominal elements in Bulgarian are always sep- [Tense, present, r] arate word tokens and change their place around [Tense, , o] the verb according to language-specific phonolog- [Tense, imperfect, m] ical rules. In such a way, the full Type Model [Tense, past, t] which takes into account the pronominal elements [Person, first, 1] is, so to say, ”switched off”. It can be easily [Person, second, 2] ”switched on”, since the full Type Model cate- [Person, third, 3] gories are subsets of the categories belonging to [Number, singular, s] the model applied at present. In the BulTree- [Number, , p] bank tagset the slot for the values of the feature [Gender, masculine, m] Clitic attachment is filled by a hyphen, that is, the [Gender, feminine, f] subcategorization according to the attachment of [Gender, neuter, n] short pronominals is switched off, but it is easily [Definiteness, indefinite, i] recoverable when required. [Definiteness, definite, d] Now let us consider the full Type Model proper [Definiteness, Short definite form, h] and the templates of morphosyntactic specifica- [Definiteness, Full definite form, f] tions defined by the possible combinations of sec- ond level features, that is, features describing 3 Type Model of Bulgarian Verbs and its the invariant, lexeme information. The full Type Application in Lexicon Construction Model is already applied in practice in the paradig- The type model is the underlying factor in defin- matic dimension of lexicon construction: 17909 ing the morphosyntactic schemes for verbs and the Bulgarian verbs have been classified according to scheme transformations necessary in different ap- the model (Slavcheva 2002a). plications. Four initial Verb Types are defined: 3.1 Type Personal Verbs personal, impersonal, semi-personal and auxil- iary. The definition of the types is triggered by The greatest number of verbs belong to this type. the necessity to determine the relevant and opti- The personal verbs have a full paradigm of in- mal combinations of second level features which flected forms. The number of the paradigm generate the correct paradigms of verbs belong- members depends on the features Aspect and ing to the respective verb type. A decisive fac- Transitivity. The possible values of the fea- tor for the typology is the combination of verbs ture Clitic attachment are: none, mandatory se, with short pronominal elements. It is necessary mandatory si. In the working variant of the dictio- to differentiate, from one side, the attachment of nary there exist the values optional se, optional si short pronominals as an integral part of the lexeme which are used for the generation of verb lexemes for some groups of verbs (and consequently to the containing a reflexive formant (i.e., se or si). whole paradigm), and, on the other hand, the gen- The combination between a personal verb and eration of combinations between verb forms and the short accusative reflexive pronominal element se defines the following classes of verbs: possible values of the feature Clitic attachment are: none, mandatory acc pers pron, manda- 1. Intransitive verbs with obligatory accusative tory dat pers pron, mandatory dat pers pron+se, reflexive element se (e.g., usmihvam se mandatory se. The combination between an im- ’smile’), which have no correlates without se. personal verb and the short pronominals results in 2. So called medium verbs (e.g., karam se the following classes: ’quarrel’) which have correlates without se 1. Impersonal verbs without short pronominals (e.g., karam ’drive’) but the meaning of two (e.g., samva ’dawn’). correlates is quite different. The short reflex- ive pronoun is not interchangeable with the 2. Impersonal verbs with se, which are formal full form of the reflexive pronoun sebe si. variants of some verbs belonging to class 1, that is, there is no difference in the meaning 3. Verbs denoting a reciprocal action (e.g., bia (e.g., samva se ’dawn’). se ’fight’). 3. Impersonal verbs with obligatory short ac- 4. Verbs that can be defined as reflexive per se, cusative personal pronoun, short dative per- that is, the subject and the object of the ac- sonal pronoun or short dative personal pro- tion coincide. The subject is prevailingly an- noun + se (e.g., marzi me ’to be lazy’, imate. The interesting linguistic fact about dozsaliava mi ’to feel pitty’, gadi mi se ’to those verbs is that, theoretically and logi- feel sick’). The verbs in this class have no cally, the alternation of short and full forms correlated forms of personal verbs without of the accusative reflexive pronoun is possi- pronominals. ble, but in reality the usage of the full form is communicatively very strongly marked and 4. Impersonal verbs with short pronominals, is not common at all. This fact supports the which have correlated forms of personal assumption that the combination between a verbs without short pronominals, but the at- verb and the short reflexive se is lexicalized tachment of the pronominals changes the (e.g., aboniram ’subscribe smb.’ / aboniram meaning and triggers the differentiation of in- se ’subscribe self’). dependent verb lexemes of impersonal verbs with pronominals (e.g, trese ’shake’ / trese The combination between a personal verb and me ’be in a fever’, struva ’cost’ / struva mi the short dative reflexive pronominal element si se ’it appears to me’). defines the following classes of verbs: 3.3 Type Semi-personal Verbs 1. Transitive and intransitive verbs with oblig- The definition of this innovative type of verbs is atory dative reflexive element si (e.g., vao- triggered by the idiosyncracies of the paradigm, braziavam si ’imagine’), which have no cor- the argument structure and the obligatory attach- relates without si. ment of short personal pronouns. The verbs in this 2. Medium verbs (e.g., tragvam si ’go home’) class have features in common both with the per- which have correlates without si (e.g., trag- sonal and the impersonal verbs and it is most con- vam ’go’) but the meaning of two correlates venient to isolate them in a separate class. The is quite different. semi-personal verbs resemble the personal verbs in having a much bigger paradigm compared to the 3.2 Type Impersonal Verbs impersonal ones. In fact, forms in the first and sec- The verbs belonging to this class have the small- ond person singular and plural are not used (e.g, est paradigm: the finite forms are only in the vali ’to rain’, boli me ’it hurts me’). The semi- third person singular, and the are only personal verbs can form sentences which struc- in the neuter singular. The attribute Transitiv- turally coincide with sentences of personal verbs, ity is irrelevant for the impersonal verbs. The that is, they have a full-fledged subject, but the set of nouns that can occupy the subject position is tailed morphosyntactic differentiation can be used rather small, hence the argument structure is rather as a source for predictions of the valency frame al- specific. (E.g., Valiat porojni dazsdove. ’Heavy ternations in a machine-aided construction of the rains fall.’ Krakata me boliat. ’My legs hurt syntactic structure of sentences. me.’) The semi-personal verbs have also features in common with the impersonal verbs. They have 5 Acknowledgment the same possible combinations with the short The work presented in this paper is supported by pronominals as the impersonal verbs have. The the BulTreebank project, funded by the Volkswa- feature Transitivity is irrelevant, as it is with the gen Foundation, Federal Republic of Germany, impersonal verbs. The subcategorization of the under the Programme ”Cooperation with Natural semi-personal verbs is analogous to that of the im- and Engineering Scientists in Central and Eastern personal verbs (see items 1-4 for the impersonal Europe”, contract I/76887. verbs above).

3.4 Type Auxiliary Verbs References The small number of auxiliary verbs have an id- iosyncratic paradigm. The features Aspect, Tran- Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H.J., Petkevic,ˇ V., Tufis¸, D. (1998) ”Multext-East: Par- sitivity and Clitic attachment are irrelevant for allel and Comparable Corpora and Lexicons for Six them. Central and Eastern European Languages.” In Pro- ceedings of COLING-ACL’98, Montreal,´ Quebec,´ 4 Conclusion and Further Development Canada, pp.315-319.

In section 3, the application of the verb Type Paskaleva, E., Simov, K., Damova, M., Slavcheva, M. Model in a paradigmatic dimension has been con- (1993) ”The Long Journey from the Core to the Real sidered. A very important practical issue is how Size of a Large LDB”. In Proceedings of ACL Work- shop ”Acquisition of Lexical Knowledge from Text”, the morphosyntactic information encoded in the Columbus, Ohio, pp.161-169. second level features (i.e., the lexeme informa- tion) can be used in a syntagmatic dimension, that Popov, D., Simov, K., Vidinska, S. (1998) Dictionary is, pattern recognition and annotation in running of Writing, Pronunciation and Punctuation of Bul- garian. Atlantis LK, Sofia. texts. The issue of crucial importance is the uti- lization of the Clitic attachment information. Simov, K., Peev, Z., Kouylekov, M., Simov, A., Dim- Within the BulTreebank framework, a cascaded itrov, M., Kiryakov, A. (2001) ”CLaRK - an XML- regular grammar has been built for the segmenta- based System for Corpora Development.” In Pro- tion, pattern recognition and category assignment ceedings of the Corpus Linguistics 2001 Conference, pp.558-560. of Bulgarian compound verb forms as linguistic entities in XML documents (Slavcheva 2002b). In Simov, K., P. Osenova, M. Slavcheva, S. Kolhovska, E. the segmentation model, the short pronominals are Balabanova, D. Doikov, K. Ivanova, A. Simov, M. included into the compound verb forms of all types Kouylekov. (2002) ”Building a Linguistically Inter- preted Corpus of Bulgarian: the BulTreeBank.” In of verbs which consist of different combinations Proceedings of LREC 2002, Canary Islands, Spain, among short pronominals, particles and auxiliary pp.1729-1736. verbs. At present the grammar for parsing com- pound verb forms does not discriminate between Simov, K., Slavcheva, M., Osenova, P. (2002) ”Bul- TreeBank Morphosyntactic Tagset.” BulTreeBank cliticized verb forms which are lexemes per se and Report, Sofia, Bulgaria. cliticized verb forms which are purely grammati- cal. Thus an immediate application of the Type Slavcheva, M. (1997) ”A Comparative Representa- Model and the data set of approximately 18000 tion of Two Bulgarian Morphosyntactic Tagsets and the EAGLES Encoding Standard”, TELRI I subcategorized verbs would be the construction of COPERNICUS Concerted Action 1202, Working a discriminating parser for the different types of Group 3 ”Morphosyntactic Annotation”, Report. cliticized verb forms. In its turn, this more de- http://www.lml.bas.bg/projects/BG-EUstand/ Slavcheva, M. (2002a) ”Language Technology and - Classificational Model of the Verb”. In Proceedings of the Conference ”Slavistics in 21 Century. Traditions and Expectations.”, SE- MASH Publishing House, Sofia, Bulgaria, pp.240- 247 (in Bulgarian). Slavcheva, M. (2002b) ”Segmentation Layers in the Group of the Predicate: a Case Study of Bulgarian within the BulTreeBank Framework”. In Proceed- ings of the International Workshop ”Treebanks and Linguistic Theories”, Sozopol, Bulgaria, pp.199- 209. Slavcheva, M., Paskaleva, E. (1997) ”Application to Bulgarian. A contribution to the EAGLES Synop- sis and Comparison of Morphosyntactic Phenom- ena Encoded in Lexicons and Corpora. A Com- mon Proposal and Applications to European Lan- guages.” TELRI I COPERNICUS Concerted Action 1202, Working Group 3 ”Morphosyntactic Anno- tation”, Report. http://www.lml.bas.bg/projects/BG- EUstand/eagles/index.html