Zero Object Resolution in Korean
Total Page:16
File Type:pdf, Size:1020Kb
PACLIC 29 Zero Object Resolution in Korean Arum Park Seunghee Lim Munpyo Hong Dept. of German Linguistics Dept. of German Linguistics Dept. of German Linguistics & Literature, Sungkyunkwan & Literature, Sungkyunkwan & Literature, Sungkyunkwan University / University / University / 25-2, Sungkyunkwan-Ro, 25-2, Sungkyunkwan-Ro, 25-2, Sungkyunkwan-Ro, Jongno-Gu, Jongno-Gu, Jongno-Gu, Seoul, Korea Seoul, Korea Seoul, Korea [email protected] [email protected] [email protected] Zero object is the second most frequent zero Abstract pronoun type in Korean spoken text. Despite of the frequent use of zero objects, most of the previous Korean is one of the well-known „pro-drop‟ works do not deal with the zero objects in Korean. languages. When translating Korean zero object into languages in which objects have to be In this work, we focus on Korean spoken texts, overtly expressed, the resolution of zero object since zero pronouns occur more frequently than in is crucial. This paper proposes a machine a written text. Ryu (2001) showed that a zero learning method to resolve Korean zero object. pronoun rarely appears in written texts when it is We proposed 8 linguistically motivated features compared with spoken texts in Korean. For this for ML (Machine Learning). Our approach has reason, we conduct a study for Korean zero object been implemented with WEKA 3.6.10 and in Korean spoken text and try to find the linguistic evaluated by using 10-fold cross validation clues for the zero object resolution. method. The accuracy of the proposed method In the context of machine translation, the reached 73.37%. resolution of Korean zero objects could be one of the most important issues in order to translate them 1 Introduction into the target language like English and German. One of the reasons that zero objects in Korean is a Korean is one of the so-called pro-drop languages. problem in MT is that the omitted objects in Certain pronouns may be omitted and the omitted Korean have to be translated into overt objects in pronouns are often called zero pronouns. This kind target languages. Unfortunately, the majority of of pronoun also occurs in other languages, such as MT systems do not deal with this problem, because Japanese or Spanish. The omitted pronouns in most of the current commercial MT systems do not Korean can appear in subject and object position, treat the linguistic phenomena that go beyond a whereas in Spanish or Italian, they can appear only sentence level. To illustrate this issue, let's take a in subject position. A zero subject is the most look at the following example (1). frequent type of anaphoric expressions in Korean. Hong (2000) reported that about 57% of the (1) MT results from Korean(a) to German(b) pronouns are a zero subject pronoun in pronoun occurrences in Korean spoken text. (a) Korean A: 여권 i 을 분실했습니다. Corresponding author 439 29th Pacific Asia Conference on Language, Information and Computation pages 439 - 448 Shanghai, China, October 30 - November 1, 2015 Copyright 2015 by Arum Park, Seunghee Lim and Munpyo Hong PACLIC 29 a passport-OBJ lost 2 Related Works (yekwenul punsilhaysssupnita.) Zero pronouns have already been studied in other “I lost a passport.” languages, such as Japanese (e.g. Nakaiwa and Shirai, 1996; Okumura and Tamura, 1996) and Spanish (Park and Hong, 2014; Palomar et al., B: øi 다시 발급받으셔야 합니다. 2001; Ferrández and Peral, 2000). These studies again have to issue are based on the researches about anaphora (tasi palkuppatusyeya hapnita.) resolution. It has been a wide-open research field since 1970 focusing on English. Regardless of “You have to issue a passport again.” languages, similar strategies for anaphora resolution have been applied. Using linguistic (b1) German - Systran translator information is the most representative technique; constraints and preferences methods are A: Verlor den Pass. distinguished in the related works (Baldwin, 1997; B: Fragen wiederholt. Lappin and Leass, 1994; Carbonell and Brown, 1988). (b2) German - Google translator Constraints discard possible antecedents and are considered as absolute criteria. Preferences being A: Ich habe meinen Pass verloren. proposed as heuristic rules tend to be relative. B: Wir müssen neu aufgelegt zu werden. After applying constraints, if there are still unresolved candidate antecedents, preferences set The omitted object is represented by the symbol priorities among candidate antecedents. Nakaiwa ø. In this example, the Korean object is not overtly and Shirai (1996) focus on semantic and pragmatic expressed in the sentence B and it refers to constraints such as cases, modal expressions, „여권‟(yekwen, “Passport”) in sentence A. To verbal semantic attributes and conjunctions in translate the omitted object into German correctly, order to determine the reference of Japanese zero the gender and number of the antecedent „yekwen‟ pronouns. However, they proposed constraints has to be considered. Since the morphological focusing on zero subjects mainly. Therefore, it is information of „yekwen‟ is „masculine‟ and hard to apply their approach on zero object „singular‟ in German, the omitted object has to be resolution. translated as „ihn‟ considering its case. However, Centering theory (Grosz et al., 1995) is one of the object of the sentence B is not translated in the approaches using heuristic rules. It is claimed German in either MT system. Then, the results that certain entities mentioned in an utterance are would be ungrammatical in German. Therefore, the more central than others, and this property has resolution of Korean zero objects is crucial in MT been applied to determine the antecedent of the systems with Korean as a source language, when anaphor. Walker et al. (1994) applied the centering translating them into languages in which objects model on zero pronoun resolution in Japanese. Roh have to be overtly expressed. and Lee (2003) proposed a generation algorithm of In section 2 we present the related works about zero pronouns using a Cost-based Centering Model anaphora resolution and their limitation. Section 3 which considers the inference cost. It is known that explains zero objects phenomenon in Korean. We the most salient element of the given discourse is suggest the machine learning (ML) method for likely to be realized as a zero pronoun. We take Korean zero object resolution and propose 8 this into account in selecting the features for ML. features for ML method in section 4. In addition, Current anaphora resolution methods rely the effect of using ML is evaluated. Finally, the mainly on constraint and preference heuristics, conclusion is presented in section 5. which employ morpho-syntactic information or shallow semantic analysis. These methods are a deterministic algorithm which always produces the same output in a given particular condition. However, even if the condition is applied, the 440 PACLIC 29 output can be wrong. ML methods which are a Given the hierarchy of the forward looking non-deterministic algorithm have been studied on center ranking, a zero object can be interpreted as anaphora resolution (Connolly et al., 1994; Paul et the topic which is the most salient discourse entity. al., 1999). Since ML learns from data and makes The topic of the sentence contributes to discourse predictions of the most likely candidate on the data, salience and maintains discourse coherence by it can overcome the limitation of the deterministic preferring the CONTINUE transition state. The method. topic of the sentence can be detected easily in Park and Hong (2014) proposed a hybrid Korean using the topic markers „은‟(eun), „는‟(nun) approach to resolve Spanish zero subjects that integrates heuristic rules and ML in the context of and delimiters such as „도‟(to), „만‟(man). Spanish to Korean MT. Since Spanish zero Therefore, it is likely a candidate antecedent is the subjects can be restored from the verb ending, they antecedent of the zero object if the candidate has use morphological flections for verbs. After that, one of the topic markers or delimiters. We can see ML is utilized for some ambiguous cases. Unlike some examples in the following table. this work, our work deals with Korean zero object. Morphological information cannot be utilized for Speaker Korean dialogue Korean because of the difference of the two A 음식 주문 을 languages. For this reason, we use ML method 1 ordering a food-OBJ alone to determine the antecedent of the zero (umsik cumunul objects in spoken Korean. 어떻게 하는 거죠? how can 3 Zero object phenomenon in Korean ettehkey hanun kecyo?) A prominent phenomenon in Korean is the “How can I order a food?” B prevalence of zero pronouns. Unlike English, zero 저 기계 2 에서 메뉴 3 를 pronouns occur very frequently in Korean. In that machine-ADV menu-OBJ Korean, a zero subject is the most frequent type of (ce kikyeyeyse menyulul anaphoric expression. The second most frequent 선택한 후 식권 4 을 type is zero objects, especially when the direct after selecting a meal ticket-OBJ object is omitted. According to Hong (2000), when senthaykhan hu sikkwenul the direct object does not occur in spoken Korean, 뽑으세요. the rate becomes 19.1%. To resolve zero object, centering theory can be buy ppopuseyyo.) utilized. The centering theory endeavors to identify the antecedent of a (zero) pronoun using the idea “You can buy a meal ticket after selecting of the most central entity that a sentence concerns menu from that vending machine.” which tends to be expressed by a (zero) pronoun. A 아침 식사 는 11 시까지만 There are some studies attempting to apply the 5 breakfast-TOP until 11 o‟clock centering theory to anaphora resolution (Choi and (achim siksanun 11sikkaciman Lee, 1999; Hong, 2000; Hong, 2011). The forward looking center rankings for Korean are defined 되는 건가요? differently in the studies.