An Unsupervised Method for Identifying Loanwords in Korean

An Unsupervised Method for Identifying Loanwords in Korean Hahn Koo San Jose State University [email protected] Manuscript to appear in Language Resources and Evaluation The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-015-9296-5 Loanword Identification in Korean Abstract This paper presents an unsupervised method for developing a character-based n-gram classifier that identifies loanwords or transliterated foreign words in Korean text. The classifier is trained on an unlabeled corpus using the Ex- pectation Maximization algorithm, building on seed words extracted from the corpus. Words with high token frequency serve as native seed words. Words with seeming traces of vowel insertion to repair consonant clusters serve as foreign seed words. What counts as a trace of insertion is deter- mined using phoneme co-occurrence statistics in conjunction with ideas and findings in phonology. Experiments show that the method can produce an unsupervised classifier that performs at a level comparable to that of a supervised classifier. In a cross-validation experiment using a corpus of about 9.2 million words and a lexicon of about 71,000 words, mean F-scores of the best unsupervised classifier and the corresponding supervised classifier were 94.77% and 96.67%, respectively. Experiments also suggest that the method can be readily applied to other languages with similar phonotactics such as Japanese. Keywords: Loanwords; Transliteration; Detection; N-gram; EM algorithm; Korean 1 Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from words in a foreign language. Their forms, both pronunciation and spelling, are often nativized. Their pronunciations adapt to conform to native sound patterns. Their spellings are transliterated using the native script and re- flect the adapted pronunciations. For example, flask [flæsk] in English be- comes 플라스크 [pʰɨl.ɾa.sɨ.kʰɨ] in Korean. The present paper is concerned with building a system that scans Korean text and identifies loanwords1 spelled in Hangul, the Korean alphabet. Such a system can be useful in many ways. First, one can use the system to collect data to study various aspects of loanwords (e.g. Haspelmath and Tadmor, 2009) or develop machine transliteration systems (e.g. Knight and Graehl, 1998; Ravi and Knight, 2009). Loanwords or transliterations (e.g. 플라스크) can be extracted from monolingual corpora by running the system alone. Transliteration pairs (e.g. <flask, 플라스크>) can be extracted from parallel corpora by first identifying the output with the system and then matching input forms based on scoring heuristics such as phonetic similarity (e.g. Yoon et al., 2007). Second, the system allows one to use etymological origins of words as a feature and be more discrete in text processing. For example, grapheme-to-phoneme con- version in Korean (Yoon and Brew, 2006) and stemming in Arabic (Nwesri, 2008) can be improved by keeping separate rules for native words and loanwords. The system can be used to classify a given word into either category and apply the proper set of rules. The loanword identification system envisioned here is a binary, character- based n-gram classifier. Given a word (w) spelled in Hangul, the classifier decides whether the word is of native (N) or foreign (F ) origin by Bayesian classification, i.e. solving the following equation: c^(w) = arg max P (wjc) · P (c) (1) c2fN;F g The likelihood P (wjc) is calculated using a character n-gram model specific 1In this paper, loanwords in Korean refer to all words of foreign origin that are transliterated in Hangul except Sino-Korean words, which are ancient borrowings from Chinese. Sino-Korean words are considered more native-like than other words of foreign origin due to their longer history and higher morphological productivity (Sohn, 1999). 2 Loanword Identification in Korean to that class. The classifier is trained on a corpus in an unsupervised manner, building on seed words extracted from the corpus. The native seed consists of words with high token frequency in the corpus. The idea is that frequent words are more likely to be native words than foreign words. The foreign seed consists of words that contain what appear to be traces of vowel insertion. Korean does not have words that begin or end with consonant clusters. Like many other languages with similar phonotactics (e.g. Japanese), foreign words with consonant clusters are transliterated with vowels inserted to break the clusters. So presence of substrings that resemble traces of insertion suggests that a word may be of foreign origin. An obvious problem is deciding what those traces look like a priori. Here the problem is resolved by a heuristic based on phoneme co-occurrence statistics and rudimentary ideas and findings in phonology. The rest of the paper is organized as follows. In Section 2, I discuss previous studies in foreign word identification as well as ideas and findings in phonology that the present study builds on. I describe the proposed method for developing the unsupervised classifier in detail in Section 3. I discuss experiments that evaluate the effectiveness of the method in Korean in Section 4 and pilot experiments in Japanese that explore its applicability to other languages in Section 5. I conclude the paper in Section 6. 2 Background This work is motivated by previous studies on identifying loanwords or foreign words in monolingual data. Many of them rely on the assumption that distribution of strings of sublexical units such as phonemes, letters, and syl- lables differs between words of different origins. Some write explicit and categorical rules stating which substrings are characteristic of foreign words (e.g. Bali et al., 2007; Khaltar and Fujii, 2009). Some train letter or syllable n-gram models separately for native words and foreign words and compare the two. It has been shown that the n-gram approach can be very effective in Korean (e.g. Jeong et al., 1999; Oh and Choi, 2001). Training the n-gram models is straightforward with labeled data in which words are tagged either native or foreign. But creating labeled data can be expensive and tedious. In response, some have proposed methods for generat- 3 Loanword Identification in Korean ing pseudo-annotated data: Baker and Brew (2008) for Korean and Goldberg and Elhadad (2008) for Hebrew. In both studies, the authors suggest generating pseudo-loanwords by applying transliteration rules to a foreign lexicon such as the CMU Pronouncing Dictionary. They suggest different methods for generating pseudo-native words. Baker and Brew extract words with high token frequencies in a Korean newswire corpus assuming that frequent words are more likely to be native than foreign. Goldberg and Elhadad extracted words from a collection of old Hebrew texts assuming that old texts are much less likely to contain foreign words than recent texts. The approach is effective and a classifier trained on the pseudo-labeled data can perform comparably to a classifier trained on manually labeled data. Baker and Brew trained a logistic regression classifier using letter trigrams on about 180,000 pseudo-words, half pseudo-Korean and half pseudo-English. Tested on a labeled set of 10,000 native Korean words and 10,000 English loanwords, the classifier showed 92.4% classification accuracy. In comparison, the corresponding classifier trained on manually labeled data showed 96.2% accuracy in a 10-fold cross-validation experiment. The pseudo-annotation approach obviates the need to manually label data. But one has to write a separate set of transliteration rules for every pair of languages. In addition, the transliteration rules may not be available to begin with, if the very purpose of identifying loanwords is to collect training data for machine transliteration. The foreign seed extraction method proposed in the present study is an attempt to reduce the level of language-specificity and demand for additional natural language processing capabilities. The method essentially equips one with a subset of transliteration rules by presuppos- ing a generic pattern in pronunciation change, i.e. vowel insertion. The method should be applicable to many language pairs. The need to repair consonant clusters arises for many language pairs and vowel insertion is a repair strategy adopted in many languages. Foreign sound sequences that are phonotactically illegal in the native language are usually repaired rather than overlooked. A common source of phonotactic discrepancy involves consonant clusters: different languages allow consonant clusters of different complex- ity. Maddieson (2013) identifies 151 languages that allow a wide variety of consonant clusters, 274 languages that allow only a highly restricted set of clusters, and 61 languages that do not allow clusters at all. Illegal clusters are repaired by vowel insertion or consonant deletion, but vowel insertion appears to be cross-linguistically more common (Kang, 2011). 4 Loanword Identification in Korean The vowel insertion pattern is initially characterized only generically as ‘insert vowel X in position Y to repair consonant cluster Z’. The generic nature of the characterization ensures language-neutrality. But in order for the pattern to be of any use, one must eventually flesh out the details and provide instances of the pattern equivalent to specific transliteration rules: ‘insert [u] between the consonants to repair [sm]’ or [sm] ! [sum], for example. Here the language-specific details of vowel insertion are discovered from a corpus in a data-driven manner but the search process is guided by findings and ideas in phonology. As will be described in detail below, possible values of which vowel is inserted where are constrained based on typological studies of loanword adaptation (e.g. Kang, 2011) and vowel insertion (e.g.

Load more