Augmented Javanese Speech Levels Machine Translation 45
Total Page:16
File Type:pdf, Size:1020Kb
Augmented Javanese Speech Levels Machine Translation 45 Augmented Javanese Speech Levels Machine Translation Aji P. Wibawa1, Andrew Nafalski2, and Wayan F. Mahmudy3, Non-members ABSTRACT language [1, 7, 8]. Furthermore, the selection of incor- rect vocabularies [4] indicates that they lack mastery This paper presents the development of the hy- of speech levels and do not know how to use them brid corpus-based machine translation for Javanese appropriately in verbal communication. In fact, the language. The system is designed to deal with the acquisition of speech levels among teenagers can be complexity of politeness expression and speech levels classified as very poor: 36.45 out of 100. This finding of Javanese that is considered as a local language with was revealed in a research on the use of speech levels the biggest number of users in Indonesia. Statistical by youngsters in Solo [8] and the result was based on features are embedded to increase the performance written vocabulary translation tests. of the system. The edit shifting distance is applied Realizing that they cannot handle this polite form, due to increase the alignment efficiency. However, im- younger speakers usually switch into Indonesian lan- proper alignment contributed by recorded impossible guage(bahasa Indonesia), which they can handle more pair and insufficient data training is still detected. easily and they believe to be more reliable to use in This paper proposes a new improvement of the de- the global era [7-9]. If this continues, the krama form- veloped alignment algorithm based on the impossible a unique characteristic Javanese-is in the danger of pair restriction. Based on experimental results, the diminishing. In addition, unskilled educators and the new developed algorithm is more accurate (A=93.8%) lack of speech levels' guides may be exacerbating the even though the number of training data is less than problem [10, 11]. While teachers are expected to serve the old one (A=87.9%). as language models at school [1], some of them use inappropriate words and levels [4] - a fact which fur- Keywords: Javanese Speech Levels, Corpus, Ma- ther suggests that Javanese speech levels dying out. chine Translation, Impossible Pair Limitation Hence, a machine translation should be developed to protect the Javanese speech levels from being extinct. 1. INTRODUCTION A machine translator has been developed in order Respecting others properly by using paralinguis- to protect the existence of the speech level [12, 13]. tic forms and proper speech is part of the politeness The translator provides bilingual pragmatic transla- in Javanese culture. The appropriate degree of po- tion between speech levels [12] and also Indonesian liteness is often articulated in the form of deference [13]. The translation knowledge is based on a bi-text in communication. Speech levels [1, 2], speech styles alignment that reinforced by edit shifting distance al- [3] or Javanese language politeness [4] are frequently gorithm [14]. The results are impressive; however, used to name the degree of deference. word repetition [13] and pragmatic translation mis- The levels of speech and associated politeness takes [12] are detected during the translation. forms are being neglected with fewer speakers being This paper is an extended version of [12], focused conversant in them even though Javanese is consid- on modifying the bilingual text alignment due to ered as the most widely used regional language in In- avoid the unwanted mistakes as well as increasing donesia [5, 6].Negative tendency is detected concern- the translation accuracy. The novel algorithm will be ing the use of Javanese speech levels among teenagers. compared to the previous model with extra training They may select an incorrect speech level to address a data. high-status person since they are unable to transform the local politeness value into its equivalent refined 2. THEJAVANESE SPEECH LEVELS Manuscript received on March 31, 2013 ; revised on June 26, Javanese linguists [2, 15, 16] divide the speech 2013. levels into three classes: krama, madya and ngoko. Final manuscript received August 10, 2013. The classes can be further classified into nine 1 The author is with Department of Electrical Engineer- sub-levels: mudha-krama (MK), kramantara (KA), ing, State University of Malang, Indonesia., E-mail: ajipw@ um.ac.id wredha-krama (WK), madya-krama (MdK), madyan- 2 The author is with School of Engineering, Uni- tara (Md A), madya-ngoko (Md Ng), basa-antya versity of South Australia, Australia., E-mail: An- (BA), antya-basa (AB), ngoko-lugu (Ng L). The sub- [email protected] 3 The author is with Department of Computer Science, Braw- levels has been simplified into four categories; ngoko ijaya University, Indonesia ., E-mail: [email protected] (Ng), ngoko alus (NgA), krama (Kr) and krama alus 46 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.8, NO.1 May 2014 (KrA) in the first Javanese Congress in 1991[4]. The- ample of alignment when translating bapakku man- example of simplified of Javanese speech levels and gan ana omahku into bapak kula dhahar wonten its lexical characteristics is shown in Table 1. griya kula is shown in Figure 1. The top part shows direct alignmentthat produces inaccurate translation Table 1: Example of Simplified Speech Levels and whereas the bottom part shows the correct alignment. Lexical Features. Fig.1: Example of Direct and Correct Javanese Speech Level Alignment. The Javanese sentence has a basic structure of SVO (subject, verb, and object). The meaning of sen- tences is pragmatically diverse and related to subject and verb agreement (SVA). The SVA derives from non-linguistic factors (i.e. social status, ages and re- lationships) [17]. For instance, sentences (1) and (2) show the differences of SVA based onthe social status of the subjects where both are operating at krama level. (1) Murid-muridsawegnedha (students are eat- ing). (2) Guru-guru sawegdhahar (teachers are eating). The underlined words in both sentences have a same meaning (eating). However, the first sentence must use the word nedha because the subject of the first sentence is students who obviously have lower social status than their teachers in sentence (2). In consequence, the word nedha is used instead of dha- har, which is only suitable for honoured people. 3. REVIEW OF MACHINE TRANSLATION Machine translation (MT), a branch of computa- One-to-one word translation in Javanese may pro- tional linguistics, is simply defined as the automatic duce poor, inaccurate and inappropriate results [14] computer-assisted translation of bilingual or multilin- since one word may be literally translated into two gual natural languages[18]. Based on its knowledge words at another speech level, and vice versa.As a base, MT is classified into two categories: Rule-based result, the source and target sentences may consist Machine Translation (RBMT) and Corpus-based Ma- of an unequal number of words, known as asym- chine Translation (CBMT). metric formation phenomenon. For example, the RBMT uses linguistic information such as seman- sentence awake dhewe ditimbali ibumu (Ng) is tic, morphological, and syntactic information as its translated as kita dipuntimbali ibu panjenen- knowledge base. The involvement of linguists in gan (Kr). Both sentences have four words; however, knowledge base development is unavoidable. They translating the words one-to-one based only their or- also build transfer rules between languages that are der in the sentence, produces an inaccurate transla- relatively complex and time consuming, especially for tion. In fact, the sentences are composed of three those with very different structures. As a result, the asymmetric pairs (i.e. awake dhewe$kita; ditim- development costs increases in an effort to achieve the bali?dipuntimbali; ibumu$ibu panjenengan). An ex- required quality threshold. The participation of lan- Augmented Javanese Speech Levels Machine Translation 47 guage experts is desperately needed when more adap- tive RBMT is desired. The linguists have to retrain the developed RBMT by adding new rules, vocabu- lary and other linguistic information, and this con- tributes further to extra development time and costs. On the other hand, the knowledge of CBMT is based on large sets of bilingual text that are recog- nised as corpora[19, 20]. The two kinds of CBMT are example-based machine translation (EBMT) and statistical machine translation (SMT). The availabil- ity of corpora is a key factor in reducing development costs. Once available, the development time is re- duced in parallel with the translation development cost. In contrast with RBMT, the role of linguists in the development of CBMT is purely optional. When Fig.2: The Design of Javanese Machine Transla- a new instance of translation arises and unknown tion. words occur, the corpus may be updated automat- ically. While RBMT is highly efficient for more gen- translation process uses the texts in both source lan- eral translation, CBMT may work better for a specific guage (S) and target language (T ). Each text is di- domain. Although some inconsistency may be found, vided into sentences (s and t ) and can be formu- for example, in pure SMT technique [21, 22] RBMT i j lated as in (1) and (2). Here, n and n represent generates more natural translation than CBMT. i j the number of sentences in the source and target text Both SMT and EBMT derive mainly from large respectively. bilingual texts; that is known as a corpus (i.e. corpora in the plural form) [19, 20]. While SMT focuses more on word combinations and their occurrence, EBMT S = fs : Sj0 ≤ i < n g (1) focuses on text segmentation [20], phrase memorisa- i i f j ≤ g tion [22]and analogical sentence recombination [19]. T = tj : T 0 i < nj (2) Moreover, SMT has a more obvious definition than Afterwards, all correspondence sentences are EBMT in terms of selecting the most appropriate parsed into a set of words (w and w ) from the first translation.