Enriching Plwordnet with Morphology
Total Page:16
File Type:pdf, Size:1020Kb
Enriching plWordNet with morphology Agnieszka Dziob and Wiktor Walentynowicz G4.19 Research Group, Department of Computational Intelligence Wrocław University of Technology, Wrocław, Poland {agnieszka.dziob,wiktor.walentynowicz}@pwr.edu.pl Abstract tions it enters. Thus, LUs having different rela- tions in the language system cannot be treated as In the paper, we present the process of synonyms and belong to the same synset (Derwo- adding morphological information to the jedowa et al., 2008). Maziarz et al. (2013) for- Polish WordNet (plWordNet). We de- mulated the concept of constitutive relations and scribe the reasons for this connection and constitutive features, i.e. those which differentiate the intuitions behind it. We also draw at- the meaning. They include all synset relations, ex- tention to the specificity of the Polish mor- cept fuzzynymy, and LU features, such as stylistic phology. We show in which tasks the mor- register, aspect for verbs, and semantic classes for phological information is important and adjectives and verbs (Maziarz et al., 2016). There- how the methods can be developed by ex- fore, there are no theoretical and methodological tending them to include combined mor- assumptions, which would allow to define inflec- phological information based on WordNet. tional features as distinguishing meanings of an LU in plWordNet. At the same time, it was also 1 Introduction not explicitly stated that morphological features plWordNet is a very large wordnet of Polish. cannot influence meaning. The 4.1 version presented in Dziob et al. (2019) Lexicographic works are ongoing, and currently contains 192,495 lemmas, 290,366 lexical units focus on completing the structure of plWord- (henceforth, LUs), and 224,179 synsets belonging Net with new LUs, increasing the density of to four parts of speech: verbs, adjectives, adverbs lexico-semantic relations, and correcting errors and nouns. Since 2012, there have been carried resulting from manual work. The most recent out ongoing works on its connection to Princeton work has consisted of connecting morphological WordNet (Rudnicka et al., 2012). descriptions from the Grammatical Dictionary of For the description of synsets and LUs, lexical- Polish (pol. Słownik Gramatyczny J˛ezyka Pol- semantic relations are used, the concept of which skiego, henceforth SGJP) (Saloni et al., 2007). was taken from Princeton WordNet (Fellbaum, This process has four practical objectives: 1) in- 1998) and EuroWordNet (Vossen, 2002). In the vestigating the influence of morphological char- course of those works, the need emerged to create acteristics of LUs on their semantic description; new relations specific to the Polish language (Pi- 2) verifying the membership to parts of speech of asecki et al., 2012). It is related to the necessity morphologically ambiguous LUs and lemmas; 3) of describing derivational dependencies for a lan- completing the semantic description of the exist- guage with a rich derivational morphology. Inflec- ing lemmas with new senses, based on their mor- tional morphology was not dealt with in plWord- phological description in the SGJP; 4) building a Net. Following Miller (1995), we assumed that practical resource, combining semantic and mor- describing inflectional morphology is the task of a phological description, for tasks related to the pro- separate resource that is morphological dictionar- cessing of the Polish language. The result of this ies for the Polish language, which support the use work is plWordNet combined with the morpholog- of plWordNet. ical description from SGJP, created automatically In plWordNet meaning is defined according to with a manual disambiguation of morphologically the assumptions of relational semantics (Lyons, ambiguous lemmas, i.e. those which have more 1977) that is as a bundle of lexico-semantic rela- than one pattern of inflection. The purpose of this paper is to present the re- ogy has also been carried out on a larger scale for sults of combining resources and to indicate the Bulgarian (Koeva, 2008). applications of them. The works that are the closest to those presented in this paper were described in Obradovic´ and 2 The Problem of Morphology Stankovic´ (2007). The authors have developed a Description tool for the creation of complex lexicographic data 2.1 The Morphological Description in obtained from wordnets, morphological dictionar- Wordnets ies, and text corpora. It is possible to highlight several important as- The assumption of Princeton WordNet was a se- pects of morphological description in wordnets: mantic description, i.e. including derivational, not 1) on a wider scale it describes derivational, not inflectional morphology (Miller et al., 1990). Still, inflectional morphology; 2) there is a regular re- morphological data resources are developed as in- lationship between inflection and derivation, but dependent linguistic databases. One of them is these two levels of description are not treated as CELEX – a lexical database for Dutch and English equivalent; 3) the combination of semantic and (Van der Wouden, 1990), extended in 2.0 version morphological description (both, at the inflec- with German (Baayen et al., 1995). It contains tional and derivational level) is useful, e.g. for the orthographic, phonological, morphological (also tasks related to Word Sense Disambiguation and, inflectional), syntactic, and statistical information in connection with it, the extraction of information (frequency in text corpora), but it does not contain from texts and building knowledge bases. semantics. The morphological description of the deriva- 2.2 The Specificity of the Polish Morphology tion and inflection in CELEX offers great op- portunities to enrich it with semantic informa- As already mentioned, plWordNet describes four tion. Hathout (2002) describes combining the parts of speech: verbs, adjectives, adverbs, and morphological resources of CELEX for the En- nouns. In Polish, nouns and adjectives are in- glish language with Princeton WordNetto create flected by seven cases in two numbers: singular a language-independent mechanism for obtaining and plural. Furthermore, adjectives are inflected semantic relational data (synonyms and deriva- for gender, while nouns are always lexically spec- tives). Similar studies with the use of CELEX ified for grammatical gender. and semantic-relational thesauri were conducted There are five genders: three masculine (per- for Dutch (Kraaij and Pohlmann, 1996). This re- sonal, animate, and inanimate), feminine and search is particularly applicable to the extraction neuter (Lewandowska-Tomaszczyk et al., 2012). of information and knowledge from text. For example, pies (zw) ‘dog’ and pies (os) ‘cop’ The inflectional morphology is less of a prob- (depreciative and colloquial) differ in their gen- lem for languages with residual inflection, such ders and, consequently, in the pattern of the inflec- as English, but a serious challenge for highly in- tion. The gender of adjectives depends on the gen- flected languages. In the paper Henrich et al. der of nouns they have syntactic relations in the (2012), semantic data from Princeton Word- text, e.g. szafa ‘wardrobe’, feminine, czerwon-a Net and GermaNet (Hamp and Feldweg, 1997) ‘red’ an adjective, feminine. Moreover, adjectives and morphological data for these languages from and adverbs have three degrees: positive, compar- Wiktionary were used to create a method for ative, and superlative. sense-annotation and automatically annotated text Verbs in Polish are inflected for two numbers corpora. and three persons for each of them, four tenses (in- Slavic languages have an even more compli- cluding two future ones that differ with each other cated inflection than Germanic ones. The paper only grammatically), and three modes of express- of Pala and Hlaváckovᡠ(2007) presents the re- ing modality (indicative, conditional, imperative). sults of the work consisting of adapting a mech- They have an assigned aspect having grammati- anism of the Czech morphological analyzer Ajka cal and semantic functions (Dziob and Piasecki, (Sedlácekˇ and Smrž, 2001) to extend the Czech 2018). They form gerunds and four types of par- WordNet with derivational relations. For Slavic ticiples, which are treated as forms of the verb. A languages, the research on derivational morphol- semantic-syntactic feature of the verb (to a lesser extent also of other parts of speech) is valence, un- Net contains, allows searching for new patterns in derstood as the ability of predicates to attach ar- data. guments in specific forms and syntactic positions (Przepiórkowski et al., 2014). For example, the 3 Resources verb jes´c´ ‘to eat’ opens three syntactic positions, 3.1 Morfeusz and SGJP for a subject, an object, and a circumstance. In this case, the object is expressed as a noun in the SGJP (Saloni, 2012) aims to give grammatical accusative case. Walenty is connected at the se- characteristics of Polish words. The main ele- mantic layer to the plWordNet by using synsets ment of this characteristic is an open description to determine a semantic preference, e.g. ‘jes´c’´ + of the inflection of units taken into account by giv- jedzenie 2 ‘food’. ing all their forms of inflection and determining Traditional Polish grammars cf. (Grzegor- their grammatical functions. The dictionary does czykowa et al., 1998) distinguish a quite large not contain lexeme sense. Morfeusz2 (Wolinski,´ group of numbers and pronouns. In syntactically 2014) is a morphological analyzer that can use oriented grammars, e.g. (Saloni, 2012),