Building the Old Javanese Wordnet
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2940–2946 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Building the Old Javanese Wordnet David Moeljadi, Zakariya Pamuji Aminullah Palacký University, Gadjah Mada University Olomouc, Czechia; Yogyakarta, Indonesia {davidmoeljadi, zakariya.aminullah}@gmail.com Abstract This paper discusses the construction and the ongoing development of the Old Javanese Wordnet. The words were extracted from the digitized version of the Old Javanese–English Dictionary (Zoetmulder, 1982). The wordnet is built using the ‘expansion’ approach (Vossen, 1998), leveraging on the Princeton Wordnet’s core synsets and semantic hierarchy, as well as scientific names. The main goal of our project was to produce a high quality, human-curated resource. As of December 2019, the Old Javanese Wordnet contains 2,054 concepts or synsets and 5,911 senses. It is released under a Creative Commons Attribution 4.0 International License (CC BY 4.0). We are still developing it and adding more synsets and senses. We believe that the lexical data made available by this wordnet will be useful for a variety of future uses such as the development of Modern Javanese Wordnet and many language processing tasks and linguistic research on Javanese. Keywords: Old Javanese, Kawi, wordnet, less resourced language 1. Introduction vehicle of the culture, politics, and religions of ancient Ja- This paper discusses the process of constructing a new vanese civilization and written in charters and other inscrip- wordnet for Old Javanese, a language written particularly tional documents, treatises on religious doctrine and ritual, in Java Island in Indonesia between 800 AD to 1500 AD. on ascetic and mystical practice, and ethical precepts regu- It belongs to the Western Malayo-Polynesian branch of the lating man’s conduct both as an individual and as a member Austronesian language family. Despite its importance for of ancient Javanese society, and treatises on law and lexi- historical and comparative linguistics, as well as ancient cography (Zoetmulder, 1974). history, Old Javanese remains comparatively low in digi- There is no standardization in writing, thus one lexical tal resources. To the best of our knowledge, there are some item may have more than one written form. It must be digital resources for Old Javanese such as the online Old stressed that the Old Javanese data was collected from writ- Javanese–English Dictionary, which is a part of SEAlang ten sources mentioned above, thus we do not have any ac- projects, and Kawi Lexicon (Wojowasito and Mills, 1980). tual recording of how it was spoken in the past. Old Ja- However, there is no wordnet for Old Javanese (as well as vanese has been written in several varieties of the Indone- Modern Javanese). This absence of an open-source Old Ja- sian branch of Indic ‘Brahm¯ ¯ı’-derived scripts, most abun- vanese wordnet fed our motivations to build it. dantly in Balinese script on palm-leaf manuscripts (Zoet- We aim to build a machine readable resources for Old Ja- mulder, 1974). There is generally no word division in Old vanese: providing a wordnet for the language, which will Javanese writing (scripto continua), though scholarly text also be the first wordnet for the Javanese branch of West- editions spell Old Javanese with spaces between words. ern Malayo-Polynesian. We would like our Old Javanese Old Javanese borrowed a considerable number of words Wordnet to support many Natural Language Processing from Sanskrit. Out of more than 25,500 entries in the Old (NLP) tasks, such as machine translation and grammar en- Javanese–English Dictionary (Zoetmulder, 1982), more gineering; at the same time, to support the study of linguis- than 10,000 of them were originated from Sanskrit. How- tics, such as historical linguistics, lexical semantics, and ever, it must be noted that Sanskrit was merely adapted into verb subcategorization. In addition, we would like to lever- Old Javanese at the lexical and textual materials in which age it to build a wordnet for Modern Javanese. the words were formed by derivations according to the Old Javanese morphology, so there was no form of declination 2. Old Javanese nor conjugation, as Sanskrit has (Gonda, 1952). It can be Old Javanese (or Kawi, ISO 639-2: kaw) is the first stage of briefly said that Old Javanese was ‘enriched’ through an in- the Javanese language which has been recorded in writing fusion of lexemes drawn from Sanskrit (Hunter, 2009), as a for more than 600 years. The oldest inscription written in form of participation referred to as “Sanskrit ecumene” or Old Javanese is dated 25 March 804 AD and Old Javanese “Sanskrit cosmopolis” in Pollock (1996). These terms im- texts and traditions continue to flourish until the present in ply that Sanskrit in the further expansion has been adopted religious and social practice, as well as in artistic and ritual for political, literal, and economical importance in the areas contexts (Creese, 2001). It is a Western Malayo-Polynesian attached by Hinduism and also Buddhism. language of the Austronesian language family. Within this Old Javanese is an agglutinative language with various subgroup, it belongs to the Javanese branch (Eberhard et affixes (prefixes, infixes, suffixes, and circumfixes). Its al., 2019). base-words with affixation may undergo changes in sounds The term Old Javanese is given under the agreement that it (nasalization) and in junction (sandhi). For example, ut- is the earliest written form of Javanese literature. It was the tama “excellent” from Sanskrit, by adding the Old Javanese 2940 circumfix ka-...-an, will form the abstract noun kottaman is well known to most users.1 “excellence” (Zoetmulder, 1974). Syntactically, Old Ja- vanese is a Verb-initial language while Modern Javanese 3. Old Javanese–English Dictionary has Subject-Verb-Object (SVO) structure. The Old Javanese–English Dictionary (OJED; Zoetmulder There are several sources written in Old Javanese in the (1982)) was compiled by a Dutch scholar, Dr. Petrus Jose- form of inscriptions and manuscripts. Some of the institu- phus Zoetmulder, with the collaboration of Dr. Stuart Owen tions that succeeded in digitizing the original manuscripts Robson. It is the most comprehensive and authoritative dic- in open-source are namely Leiden University Library and tionary for Old Javanese. The Old Javanese Wordnet we National Library of Republic of Indonesia, but not all of are building is based on this dictionary. The OJED con- them are accessible in their digital version. On the other tains more than 25,500 headword entries, more than 18,000 hand, École Française d’Extrême-Orient (EFEO) has also sub-entries, nearly 8,500 indications of Sanskrit origin, and succeeded in digitizing many edited texts, in the sense of over 105,000 corpus citations from more than 120 identi- making them searchable. Certainly, the open access to digi- fied sources. tized records of Old Javanese would greatly benefit the aca- The headword entries are the base-words, arranged in demic community. Latin alphabetical order. Words derived (by affixation To the present day, there is no general consensus for the and reduplication) are listed as sub-entries under the base- romanized transliteration of Indian-type scriptures with words. Most of the entries and sub-entries have mean- which the Old Javanese was written (Acri, 2017). There are ings/definitions and corpus citations or examples in corpus. at least two romanization systems that have been proposed, The meaning/definition field may contain a question-mark widely used, and applied in several editions of the Old Ja- which was added by the dictionary’s author to show some vanese texts, including the proposal of Zoetmulder (1982) doubt about the correctness of his interpretation. in the Old Javanese–English Dictionary and the one of Acri The realization of the idea of digitizing the OJED and mak- and Griffiths (2014) which is now modified in Dániel and ing it online was authorized in 2008 by the Koninklijk In- Griffiths (2019). The first was applied in the edition of Ar- stituut voor Taal-, Land- en Volkenkunde (KITLV). The junawiwaha¯ “Arjuna’s Marriage” by Robson (2008) and of project was supported by EFEO. The staff of the Sanskrit Sumanasantaka¯ “Death by Sumanasa Flower” by Worsley and Tamil Publishing Service (SPS) keyed the complete et al. (2013) with the exception of the use of ng for the text of the OJED. The SEAlang Library processed the data 2 velar nasal instead of N (n with palatal hook). As for the for online publication and hosts the OJED on its website. second, we can cite, for example, the edition of Dharma In the online version of the OJED, almost every entry has a Patañjala¯ “Sacred Teaching of Patañjali” by Acri (2017), page number and entry number that can serve as ID num- of Bh¯ıma Swarga “Bh¯ıma Goes to the Place of Gods” by ber. The character N in the original OJED was replaced with n˙ in the online version to simplify display and cut and paste Gunawan (2016), and of Candrakiran. a “Rays of Moon” by Aminullah (2019). Both of those romanization systems are for users. When entering search queries, users can use the essentially adaptations of the IAST (International Alpha- Harvard-Kyoto variant given in Table 2 which will be au- bet of Sanskrit Transliteration) and ISO 15919 system with tomatically converted into Unicode, as in the Online OJED some additions and changes. Table 1 lists the differences of row. the two systems. This online OJED data was employed to build the Old Ja- vanese Wordnet. Its raw data is being cleaned and anno- tated in order to build a database. The information in each Zoetmulder (1982) e˘ ö r./re˘ le˘ w N Acri and Griffiths (2014) @ @¯ r/r@ l/l@ v n˙ dictionary entry is separated or classified into headwords, ˚ ˚ homonym numbers, variants, ID numbers, definitions, et- Table 1: Two romanization systems for Old Javanese ymological information, notes, synonyms, equivalents in other related languages (e.g.