Where Bears Have the Eyes of Currant: Towards a Mansi Wordnet
Total Page:16
File Type:pdf, Size:1020Kb
Where Bears Have the Eyes of Currant: Towards a Mansi WordNet Csilla Horváth1, Ágoston Nagy1, Norbert Szilágyi2, Veronika Vincze3 1University of Szeged, Institute of English–American Studies Egyetem u. 2., 6720 Szeged, Hungary [email protected], [email protected] 2University of Szeged, Department of Finno-Ugrian Studies Egyetem u. 2., 6720 Szeged, Hungary [email protected] 3Hungarian Academy of Sciences, Research Group on Artificial Intelligence Tisza Lajos krt. 103., 6720 Szeged, Hungary [email protected] Abstract tasks, be it mono- or multilingual: for instance, in machine translation, information retrieval and so on. Here we report the construction of a wordnet In this paper, we aim at constructing a wordnet for Mansi, an endangered minority language for Mansi, an indigenous language spoken in Rus- spoken in Russia. We will pay special atten- sia. Mansi is an endangered minority language, with tion to challenges that we encountered during the building process, among which the most less than 1000 native speakers. Most often, minority important ones are the low number of native languages are not recognized as official languages speakers, the lack of thesauri and the bear in their respective countries, where there is an offi- language. We will discuss our solutions to cial language (in this case, Russian) and there is one these issues, which might have some theoreti- or there are several minority languages (e.g. Mansi, cal implications for the methodology of word- Nenets, Saami etc.). Hence, the speakers of minority net building in general. languages are bilingual, and usually use the official or majority language in their studies and work, and 1 Introduction the language of administration is the majority lan- guage as well. However, the minority language is Wordnets are lexical databases that are rendered ac- typically restricted to the private sphere, i.e. among cording to semantic and lexical relations between family members and friends, and thus it is mostly groups of words. They are supposed to reflect the used in oral communication, with only sporadic ex- internal organization of the human mind (Miller et amples of writing in the minority language (Vincze al., 1990). The first wordnet was constructed for En- et al., 2015). Also, the cultural and ethnographic glish (Miller et al., 1990) and since that time, word- background of Mansi people may affect language nets have been built for several languages including use: certain artifacts used by Mansi people that are several European languages, mostly in the frame- unknown to Western cultures have their own vocab- work of EuroWordNet and BalkaNet (Alonge et al., ulary items in Mansi and vice versa, certain concepts 1998; Tufi¸set al., 2004) and other languages such used by Western people are unknown to Mansi peo- as Arabic, Chinese, Persian, Hindi, Tulu, Dravidian, ple, therefore there are no lexicalized terms for them. Tamil, Telugu, Sanskrit, Assamese, Filipino, Gu- The construction of a Mansi wordnet help us ex- jarati, Nepali, Kurdish, Sinhala (Tanács et al., 2008; plore how a wordnet can be built for a minority lan- Bhattacharyya et al., 2010; Fellbaum and Vossen, guage and also, an endangered language. Thus, we 2012; Orav et al., 2014). Synsets within wordnets will investigate the following issues in this paper: for different languages are usually linked to each other, so concepts from one language can be easily • What are the specialties of constructing a word- mapped to those in another language. Wordnets can net for a minority language? be beneficial for several natural language processing • What are the specialties of constructing a word- net for an endangered language? adapted to the Mansi phonology (ï©yëüíèöà), or us- ing a Mansi neologism to describe it (ì©aõóìïó- • What are the specialties of constructing a word- ñìàëòàí êîë, ‘a house for healing people, hospi- net for Mansi? tal’, as opposed to í©ÿâðàìïóñìàëòàí êîë ‘chil- ©yéõóë ïóñìàë- The paper has the following structure. First, the dren hospital, children’s clinic’ or òàí êîë Mansi language will be shortly presented from lin- ‘veterinary clinic’). guistic, sociolinguistic and language policy perspec- 3 Semi-automatic construction of the tives. Then our methods to build the Mansi word- Mansi WordNet net will be discussed, with special emphasis on spe- cific challenges as regards endangered and minority In this section, we will present our methods to con- languages in general and Mansi in particular. Later, struct the Mansi WordNet. We will also pay special statistical data will be analysed and our results will attention to the most challenging issues concerning be discussed in detail. Finally, a summary will con- wordnet building. clude the paper. 3.1 Low number of native speakers 2 The Mansi Language The first and greatest problem we met while creating the Mansi wordnet was that only a handful of native Mansi (former term: Vogul) is an extremely en- speakers have been trained in linguistics. Thus, we dangered indigenous Uralic (more precisely Finno- worked with specialists of the Mansi language who Ugric, Ugric, Ob-Ugric) languages, spoken in West- have been trained in linguistics and technology, but ern Siberia, especially on the territory of the Khanty- do not have native competence in Mansi. Mansi Autonomous Okrug. Among the approxi- As it is not rentable to build a WordNet from mately 13,000 people who declared to be ethnic scratch and as our annotators are native speakers of Mansi according to the data of the latest Russian fed- Hungarian, we used the Hungarian WordNet (Mi- eral census in 2010 only 938 stated that they could háltz et al., 2008) as a starting point. First, we de- speak the Mansi language. cided to include basic synsets, and the number of The Mansi have been traditionally living on hunt- the synsets is planned to be expanded continuously ing, fishing, to a lesser extent also on reindeer breed- later on. We used Basic Concepts – already intro- ing, they got acquainted with agriculture and urban duced in EuroWordNet – as a starting point: this set lifestyle basically during the Soviet period. The of synsets contains the synsets that are considered principles of Soviet linguistic policy according to the most basic conceptual units universally. which the Mansi literary language has been designed kept changing from time to time. After using Latin 3.2 Already existing resources transcription for a short period, Mansi language In order to accelerate the whole task and to ease planners had to switch to the Cyrillic transcription the work of Mansi language experts, the WordNet in 1937. While until the 1950s the more general ten- creating process was carried out semi-automatically. dency was to create new Mansi words to describe the Since there is no native speaker available who could formerly unknown phenomena, later on the usage of solve the problems requiring native competence, we Russian loanwords became more dominant. As a re- were forced to utilize the available sources as cre- sult of these tendencies some of the terms describ- atively as possible. ing contemporary environment, urban lifestyle, the First, the basic concept sets of the Hungarian Russian-dominated culture are Russian loanwords, WordNet XML file were extracted and at the same while others are Mansi neologisms created by Mansi time, the non-lexicalized elements were filtered as linguists and journalists. It is not uncommon to find in this phase, we intend to focus only on lexicalized two or even three different synonyms describing the elements. same phenomena (for example, hospital): by the Second, we used a Hungarian-Mansi dictionary means of borrowing the word from Russian (áîëü- to create possible translations for the members of íèöà), or using the Russian loanword in a form the synsets. The dictionary we use in the process WordNet (Miller et al., 1990). is based on different Mansi-Russian dictionaries (e.g. Rombandeeva (2005), Balandin and Vahruševa 3.3 Bear language (1958), Rombandeeva and Kuzakova (1982)). The Another very special problem occurred during word- translation of all Mansi entries to Hungarian and to net building in Mansi, that is the question regard- English in the new dictionary is being done indepen- ing the situation of the so called “bear language”. dently of WordNet developing (Vincze et al., 2015). The bear is a prominently sacred animal venerated In order not to get all Hungarian entries of the by Mansi, bearing great mythical and ritual signifi- WordNet translated to Mansi again, a program code cance, and also surrounded by a detailed taboo lan- was developed to replace the Hungarian terms with guage. Since the bear is believed to understand the the already existing translations from the dictionary. human speech (and also to have sharp ears), it is Only literals are replaced, definitions and examples respectful and cautious to use taboo words while are left untouched, so that the linguists can check speaking about the bear, the parts of its body, or the actual meaning and can replace them with their any activity connected with the bear (especially bear Mansi equivalents. The Mansi specialists’ role is hunting) so that the bear would not understand it. to check the automatic replacement and to give new The taboo words of this “bear language” may be di- term candidates if there is no proper automatic trans- vided into two major subgroups: Mansi words which lation. have a different, special meaning when used in con- In this workphase, as there are no synonym dic- nection with the bear (e.g. ñîñûã ‘currant’ but also tionaries or thesauri available for the Mansi lan- meaning ‘eye’, when speaking of the bear’s eyes), guage, the above-mentioned bilingual student dic- and those which may be used solely in connection tionaries are used as primary resources.