<<

An Unsupervised Method for Identifying Loanwords in Korean

Hahn Koo San Jose State University [email protected]

Manuscript appear in Language Resources and Evaluation The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-015-9296-5 Loanword Identification in Korean

Abstract

This paper presents an unsupervised method for developing character-based -gram classifier that identifies loanwords or transliterated foreign words in Korean text. The classifier is trained on an unlabeled corpus using the Ex- pectation Maximization algorithm, building on seed words extracted from the corpus. Words with high token frequency serve as native seed words. Words with seeming traces of vowel insertion to repair consonant clusters serve as foreign seed words. What counts as a trace of insertion is deter- mined using phoneme co-occurrence statistics in conjunction with ideas and findings in phonology. Experiments show that the method can produce an unsupervised classifier that performs at a level comparable to that of a - pervised classifier. In a cross-validation experiment using a corpus of about 9.2 million words and a lexicon of about 71,000 words, mean F-scores of the best unsupervised classifier and the corresponding supervised classifier were 94.77% and 96.67%, respectively. Experiments also suggest that the method can be readily applied to other languages with similar phonotactics such as Japanese. Keywords: Loanwords; Transliteration; Detection; N-gram; EM algorithm; Korean

1 Loanword Identification in Korean

1 Introduction

Loanwords are words whose meaning and pronunciation are borrowed from words in a foreign language. Their forms, both pronunciation and spelling, are often nativized. Their pronunciations adapt to conform to native sound patterns. Their spellings are transliterated using the native script and - flect the adapted pronunciations. For example, flask [flæsk] in English be- comes 플라스크 [pʰɨl.ɾa.sɨ.kʰɨ] in Korean. The present paper is concerned with building a system that scans Korean text and identifies loanwords1 spelled in Hangul, the Korean alphabet. Such a system can be useful in many ways. First, one can use the system to collect data to study various aspects of loanwords (.g. Haspelmath and Tadmor, 2009) or develop - chine transliteration systems (e.g. Knight and Graehl, 1998; Ravi and Knight, 2009). Loanwords or transliterations (e.g. 플라스크) can be extracted from monolingual corpora by running the system alone. Transliteration pairs (e.g. <flask, 플라스크>) can be extracted from parallel corpora by first identi- fying the output with the system and then matching input forms based on scoring heuristics such as phonetic similarity (e.g. Yoon et al., 2007). Second, the system allows one to use etymological origins of words as a feature and be more discrete in text processing. For example, grapheme-to-phoneme con- version in Korean (Yoon and Brew, 2006) and stemming in Arabic (Nwesri, 2008) can be improved by keeping separate rules for native words and loan- words. The system can be used to classify a given word into either category and apply the proper set of rules. The loanword identification system envisioned here is a binary, character- based n-gram classifier. Given a word (w) spelled in Hangul, the classifier decides whether the word is of native (N) or foreign (F ) origin by Bayesian classification, .e. solving the following equation:

cˆ(w) = arg max P (w|c) · P (c) (1) c∈{N,F }

The likelihood P (w|c) is calculated using a character n-gram model specific

1In this paper, loanwords in Korean refer to all words of foreign origin that are translit- erated in Hangul except Sino-Korean words, which are ancient borrowings from Chinese. Sino-Korean words are considered more native-like than other words of foreign origin due to their longer history and higher morphological productivity (Sohn, 1999).

2 Loanword Identification in Korean to that class. The classifier is trained on a corpus in an unsupervised manner, building on seed words extracted from the corpus. The native seed consists of words with high token frequency in the corpus. The idea is that frequent words are more likely to be native words than foreign words. The foreign seed consists of words that contain what appear to be traces of vowel inser- tion. Korean does not have words that begin or end with consonant clusters. Like many other languages with similar phonotactics (e.g. Japanese), for- eign words with consonant clusters are transliterated with vowels inserted to break the clusters. presence of substrings that resemble traces of inser- tion suggests that a word may be of foreign origin. An obvious problem is deciding what those traces look like a priori. Here the problem is resolved by a heuristic based on phoneme co-occurrence statistics and rudimentary ideas and findings in phonology. The rest of the paper is organized as follows. In Section 2, I discuss pre- vious studies in foreign word identification as well as ideas and findings in phonology that the present study builds on. I describe the proposed method for developing the unsupervised classifier in detail in Section 3. I discuss ex- periments that evaluate the effectiveness of the method in Korean in Section 4 and pilot experiments in Japanese that explore its applicability to other languages in Section 5. I conclude the paper in Section 6.

2 Background

This work is motivated by previous studies on identifying loanwords or for- eign words in monolingual data. Many of them rely on the assumption that distribution of strings of sublexical units such as phonemes, letters, and syl- lables differs between words of different origins. Some write explicit and categorical rules stating which substrings are characteristic of foreign words (e.g. Bali et al., 2007; Khaltar and Fujii, 2009). Some train letter or syllable n-gram models separately for native words and foreign words and compare the two. It has been shown that the n-gram approach can be very effective in Korean (e.g. Jeong et al., 1999; Oh and Choi, 2001). Training the n-gram models is straightforward with labeled data in which words are tagged either native or foreign. But creating labeled data can be expensive and tedious. In response, some have proposed methods for generat-

3 Loanword Identification in Korean ing pseudo-annotated data: Baker and Brew (2008) for Korean and Goldberg and Elhadad (2008) for Hebrew. In both studies, the authors suggest gener- ating pseudo-loanwords by applying transliteration rules to a foreign lexicon such as the CMU Pronouncing Dictionary. They suggest different methods for generating pseudo-native words. Baker and Brew extract words with high token frequencies in a Korean newswire corpus assuming that frequent words are more likely to be native than foreign. Goldberg and Elhadad extracted words from a collection of old Hebrew texts assuming that old texts are much less likely to contain foreign words than recent texts. The approach is effective and a classifier trained on the pseudo-labeled data can perform comparably to a classifier trained on manually labeled data. Baker and Brew trained a logistic regression classifier using letter trigrams on about 180,000 pseudo-words, half pseudo-Korean and half pseudo-English. Tested on a la- beled set of 10,000 native Korean words and 10,000 English loanwords, the classifier showed 92.4% classification accuracy. In comparison, the corre- sponding classifier trained on manually labeled data showed 96.2% accuracy in a 10-fold cross-validation experiment. The pseudo-annotation approach obviates the need to manually label data. But one has to write a separate set of transliteration rules for every pair of languages. In addition, the transliteration rules may not be available to begin with, if the very purpose of identifying loanwords is to collect training data for machine transliteration. The foreign seed extraction method proposed in the present study is an attempt to reduce the level of language-specificity and demand for additional natural language processing capabilities. The method essentially equips one with a subset of transliteration rules by presuppos- ing a generic pattern in pronunciation change, i.e. vowel insertion. The method should be applicable to many language pairs. The need to repair consonant clusters arises for many language pairs and vowel insertion is a re- pair strategy adopted in many languages. Foreign sound sequences that are phonotactically illegal in the native language are usually repaired rather than overlooked. A common source of phonotactic discrepancy involves consonant clusters: different languages allow consonant clusters of different complex- ity. Maddieson (2013) identifies 151 languages that allow a wide variety of consonant clusters, 274 languages that allow only a highly restricted set of clusters, and 61 languages that do not allow clusters at all. Illegal clusters are repaired by vowel insertion or consonant deletion, but vowel insertion appears to be cross-linguistically more common (Kang, 2011).

4 Loanword Identification in Korean

The vowel insertion pattern is initially characterized only generically as ‘in- sert vowel X in position Y to repair consonant cluster Z’. The generic nature of the characterization ensures language-neutrality. But in order for the pat- tern to be of any use, one must eventually flesh out the details and provide instances of the pattern equivalent to specific transliteration rules: ‘insert [] between the consonants to repair [sm]’ or [sm] → [sum], for example. Here the language-specific details of vowel insertion are discovered from a corpus in a data-driven manner but the search process is guided by findings and ideas in phonology. As will be described in detail below, possible values of which vowel is inserted where are constrained based on typological studies of loanword adaptation (e.g. Kang, 2011) and vowel insertion (e.g. Hall, 2011). Possible consonant sequences originating from a cluster are delimited by the idea of sonority sequencing principle (e.g. Clements, 1990).

3 Proposal

The goal is to build a Bayesian classifier made of two character n-gram mod- els: one for native words (N) and the other for foreign words (F ). That is,

∏ · | ≈ · | i−1 cˆ(w) = arg max P (c) P (w c) arg max P (c) P (gi gi−n+1, c) (2) c∈{N,F } c∈{N,F } i

th i−1 − where gi is the i character of w and gi−n+1 is the string of n 1 characters preceding it. In this study, the n-gram models use Witten-Bell smoothing (Witten and Bell, 1991) for its ease of implementation. That is,

| i−1 − i−1 · | i−1 i−1 · | i−1 P (gi gi−n+1, c) = (1 λc(gi−n+1)) Pmle(gi gi−n+1, c)+λc(gi−n+1) P (gi gi−n+2, c) (3) | i−1 So the parameters of the classifier consist of P (c), Pmle(gi gi−n+1, c), and i−1 λc(gi−n+1). They can be estimated from data as follows: ∑ ∑ ∑w z(w, c) P (c) = ′ (4) c′ w z(w, c )

5 Loanword Identification in Korean

∑ freq (gi ) · z(w, c) P (g |gi−1 , c) = ∑w w i−n+1 (5) mle i i−n+1 i−1 · w freqw(gi−n+1) z(w, c) ∑ freq (gi−1 ) · z(w, c) λ (gi−1 ) = w ∑w i−n+1 (6) c i−n+1 i−1 • i−1 · N1+(gi−n+1 ) + w freqw(gi−n+1) z(w, c) z(w, c) indicates whether w is classified c: z(w, c) = 1 if it is and z(w, c) = 0 i−1 • otherwise. freqw(x) is the number of times x occurs in w. N1+(gi−n+1 ) is i−1 the number of different n-grams prefixed gi−n+1 that occur at least once. The challenge here is that the training corpus is unlabeled, i.e. z(w, c) is hidden. I use variants of the EM algorithm to iteratively guess z(w, c) and update the parameters. The n-gram models are initialized with seed words extracted from the corpus. For the native class, I use high frequency words in the corpus as seed words: for example, all words whose token frequency is in the 95th percentile. For the foreign class, I first use sublexical statistics to list phoneme strings that would result from vowel insertion and then use words that contain the phoneme strings as seed words. Below I describe in detail how foreign seed words are extracted and how the seeded classifier is iteratively trained.

3.1 Foreign seed extraction

The method aims to identify loanwords whose original forms contain conso- nant clusters and use them as foreign seed words. This is done by string/pattern matching, where the pattern consists of phoneme strings that can result from vowel insertion. Consonant clusters do not begin or end syllables in Korean. When foreign words are borrowed, consonant clusters are repaired by in- serting a vowel somewhere next to the consonants to break the cluster into separate syllables. Speakers usually insert the same vowel in the same po- sition to repair a given consonant cluster. As a result, transliterations of different words with the same consonant cluster all share a common sub- string showing trace of insertion. For example, 트라이 (try), 트레인 (train), 트리 (tree), 트롤 (troll), and 트루 (true) all have 트ㄹ which is pronounced [tʰɨɾ]. The idea is to figure out what those signature substrings are in advance and look for words that have them. There is a risk of false positives since

6 Loanword Identification in Korean such substrings may exist for reasons other than vowel insertion. But the hope is that the seeded classifier will gradually learn to be discrete and use other substrings in words for further disambiguation. The phoneme strings defining the pattern are specified below as tuples of the form < C1C2,Vid,Vloc > for ease of description. Each tuple characterizes a phoneme string made of two consonants and a vowel. C1 and C2 are the two consonants. Vid is the identity of the vowel. Vloc is the location of the vowel relative to the consonants, i.e. between, before, or after the consonants. For example, means [sɨn] as in [sɨnou] for 스노우 (snow) and means [ntʰɨ] as in [hintʰɨ] for 힌트 (hint). The idea is to use C1C2 to specify consonants from a foreign cluster and Vid and Vloc to specify which vowel is inserted where to repair the cluster. Rather than manually listed using language expertise, the tuples are discov- ered from a corpus using the following heuristic: 1. List words that appear atypical compared with the native seed words.

2. Extract < C1C2,Vid,Vloc > tuples from the atypical words where

(a) C1C2 respects the sonority sequencing principle.

(b) Vid and Vloc most strongly cooccur with C1C2 among all vowels.

3. Identify the most common Vid as the default vowel used for insertion. Keep tuples whose Vid matches the default vowel and throw away the rest.

4. Identify the most common Vloc of the default vowel as its site of insertion for clusters in each syllable position (onset or coda). Keep tuples whose Vloc matches the identified site of insertion and throw away the rest. The basic idea is to find recurring combinations of two consonants that po- tentially came from a foreign cluster and a vowel. Step 1 defines the search space. It should be easier to see the target pattern if zeroed in on loan- words. Native words have various morphological patterns that can obscure the target pattern. Of course, it is not yet known which words are loanwords. So instead the method avoids words similar to what are currently believed to be native words, i.e. the native seed words. Put differently, words dissimilar to the native seed words are tentatively loanwords. Here the similarity is measured by a word’s length-normalized probability according to a character

7 Loanword Identification in Korean n-gram model trained on the native seed words: 1/|w| · log P (w) for word w of length |w|. A word is atypical if its probability ranks below a threshold percentile (e.g. 5%). Step 2 generates a first-pass list. Condition 2a delimits possible consonant sequences from a foreign cluster. According to the sonority sequencing prin- ciple, consonants in a syllable are ordered so that consonants of higher sonor- ity appear closer to the vowel of the syllable. There are different proposals on what sonority is and how different classes of consonants rank on the sonority scale (e.g. Clements, 1990; Selkirk, 1984; Ladefoged, 2001). Here I simply classify consonants as either obstruents or sonorants (see Table 1) and stipulate that sonorants have higher sonority than obstruents. I also assume that the sonority of consonants do not change during transliteration although their identities may change. For example, ‘free’ changes from [fɹi] to [pʰɨɾi], but [pʰ] remains an obstruent and [ɾ] remains a sonorant. Accord- ingly, C1C2 must be obstruent-sonorant if it is from an onset cluster and sonorant-obstruent if it is from a coda cluster. To determine with certainty whether the consonants originally occupied onset or coda, I focus on phoneme strings found only at word boundaries. If C1C2 are the first two consonants of a word, they are from onset. If they are the last two consonants of a word, they are from coda. Condition 2b is used to guess the vowel inserted to repair each cluster. Only one vowel is repeatedly used so its co-occurrence with the consonants should not only be noticeable but most noticeable among all vowels. Here the co-occurrence tendency is measured using point-wise mutual informa- tion: PMI(C1C2,V ) = log P (C1C2,V ) − log P (C1C2) · P (V ) where V =< Vid,Vloc >. The list is truncated to avoid false positives in steps 3 and 4. This is done by identifying the default vowel insertion strategy and keeping only the tu- ples consistent with it. Exactly which vowel is inserted where to repair a consonant cluster is context-specific. But a language that relies on vowel insertion for repair usually has a default vowel inserted in typical locations (cf. Uffmann, 2006). Here it is assumed that the default vowel is the one used to repair most diverse consonant clusters. So it is the most frequent vowel in the list. Similarly, its default site of insertion is in principle its most frequent location in the list. But possible sites of insertion differ for onset

8 Loanword Identification in Korean clusters and coda clusters: before or between the consonants in onset, but after or between the consonants in coda (Hall, 2011). So the default site of insertion is identified separately for onset and coda.

3.2 Bootstrapping with EM

| i−1 i−1 The parameters (θ) to estimate are P (c), Pmle(gi gi−n+1, c), and λc(gi−n+1). The first parameter P (c) is initialized according to some assumption about what proportion of words in the given corpus are loanwords. For example, if one assumes that 5% are loanwords, P (N) = 0.95 and P (F ) = 0.05. The latter two parameters, which define the n-gram models, are initialized using the seed words as if they were labeled data: z(w, N) = 1 and z(w, F ) = 0 for native seed words and z(w, N) = 0 and z(w, F ) = 1 for foreign seed words. Note that other words in the corpus are not used to initialize the n-gram models. The initial parameters are then updated on the whole corpus by iterating the following two steps until some stopping criterion is met. E-step: Calculate the expected value of z(w, c) using current parameters.

P (w|c; θ(t)) · P (c; θ(t)) E[z(w, c)] = P (c|w; θ(t)) = ∑ (7) | ′ (t) · ′ (t) c′ P (w c ; θ ) P (c ; θ )

M-step: Transform the expected value to zˆ(w, c), i.e. some estimate of z(w, c), and plug it into equations (4-6) to update the parameters. I experiment with three versions of the algorithm in the present study: soft EM, hard EM, and smoothstep EM. The three differ with respect to how E[z(w, c)] is transformed to zˆ(w, c). In soft EM, which is the same as the classic EM algorithm (Dempster et al., 1977), there is transformation, i.e. ′ zˆ(w, c) = E[z(w, c)]. In hard EM, zˆ(w, c) = 1 if c = arg maxc′ E[z(w, c )] and zˆ(w, c) = 0 otherwise. Since there are only two classes here, this is equivalent to applying a threshold function at 0.5 to E[z(w, c)]. In smoothstep EM, a smooth step function is applied instead of the threshold function: zˆ(w, c) = f 3(E[z(w, c)]) where f(x) = −2x3 + 3x2. Figure 1 illustrates how E[z(w, c)] is transformed to zˆ(w, c) by the three variants of the EM algorithm.

9 Loanword Identification in Korean

As will be shown in the experiments below, soft EM is aggressive while hard EM is conservative in recruiting words to the foreign class. Soft EM gives partial credit even to words that are very unlikely to be foreign according to the current model. Over time, such words may manage to gain enough confidence and be considered foreign. Some of them may turn out to be false positives. On the other hand, hard EM does not give any credit even to words that are just barely below the threshold to be considered foreign. Some of them may turn out to be false negatives. Smoothstep EM is a compromise between the two extremes. It virtually ignores words that do not stand a chance but gives due credit to words that barely missed.

4 Experiments

Experiments show that the proposed approach can be effective in Korean despite its unsupervised nature. Classifiers built on a raw corpus with - nor preprocessing (e.g. removing tokens with non-Hangul characters) identify loanwords in test lexicons well. The foreign seed extraction method correctly identifies the default vowel insertion strategy in Korean loanword phonology. The resulting classifier performs better when initialized with the proposed seeding method than random seeding. Its performance is not that far be- hind the corresponding supervised classifier either. Moreover, after exposure to the words (but not their labels) used to train the supervised classifier, the unsupervised classifier performs at a level comparable to the supervised classifier. I discuss the details of the experiments below.

4.1 Methods

I use four datasets called SEJONG, KAIST, NIKL-1, and NIKL-2 below. SEJONG and KAIST are unlabeled data used to initialize and train the unsupervised classifier. SEJONG consists of 1,019,853 types and 9,206,430 tokens of eojeols, which are character strings delimited by white space equiv- alent to words or phrases. The eojeols are from a morphologically anno- tated corpus developed in the 21st Century Sejong Project under the aus- pices of the Ministry of Culture, Sports, and Tourism of South Korea, and National Institute of the Korean Language (2011). They were selected by

10 Loanword Identification in Korean extracting Hangul character strings delimited by white-space after remov- ing punctuation marks. Strings that contained non-Hangul characters (e.g. 12월의, Farrington으로부터) were excluded in the process. KAIST consists of 2,409,309 types and 31,642,833 tokens of eojeols from the KAIST corpus (Korea Advanced Institute of Science and Technology, 1997) extracted in the same way as SEJONG. NIKL-1 and NIKL-2 are labeled data used to test the classifier. They are made of words from various language resources released by the National Institute of the Korean Language (NIKL). NIKL-1 consists of 49,962 native words and 21,176 foreign words selected from two lexicons NIKL (2008, 2013). NIKL-2 consists of 44,214 native words and 18,943 for- eign names selected from four reports released by NIKL (2000a,b,c,d) and a list of transliterated names of people and places originally spelled in Latin alphabets (NIKL, 2013). I examined the words manually and labeled them either native or foreign. Words of unknown or ambiguous etymological origin were excluded in the process. SEJONG and NIKL-1 are mainly used to ex- amine effectiveness of the proposed methods. KAIST and NIKL-2 are used to examine whether the methods are robust to varying data. See Table 2 for a summary of data sizes. The proposed methods are implemented as follows. All n-gram models are trained on character bigrams, where each Hangul character represents a syl- lable. The high frequency words defining the native seed are eojeols whose token frequency is above the 95th percentile in a given corpus. When extract- ing the foreign seed, the so-called ‘atypical words’ are eojeols whose length- normalized n-gram probabilities lie in the bottom 5% according to the model trained on the native seed. Their phonetic transcriptions are generated by applying simple rewrite rules in Appendix A. For bootstrapping, the prior probabilities are initialized to P (c = N) = 0.95 and P (c = F ) = 0.05. The parameters of the classifier are iteratively updated until the average likeli- hood of the data improves by no more than 0.01% or the number of iterations reaches 100. Classification performance is measured in terms of precision, recall, and F- score. Here, precision (p) is the percentage of words correctly classified as foreign out of all words classified as foreign. Recall (r) is the percentage of words correctly classified as foreign out of all words that should have been classified as foreign. F-score is a harmonic mean of the two with equal empha-

11 Loanword Identification in Korean sis on both, i.e. F = 2·p·r/(p+r). To put the numbers in perspective, scores of classifiers built using the proposed methods are compared with those of supervised classifiers and randomly seeded classifiers. Supervised classifiers are trained and tested on the labeled data (NIKL-1 or NIKL-2) using five-fold cross-validation. The labeled data is partitioned into five equal-sized subsets. The supervised classifier is trained on four subsets and tested on the remain- ing subset. This is repeated five times for the five different combinations of subsets. Randomly seeded classifiers are unsupervised classifiers with just a different seeding strategy: 5% of words in the corpus are randomly chosen as foreign seed words and the rest are native seed words. For fair comparison, the unsupervised classifiers are also tested five separate times on the five sub- sets of labeled data that the supervised classifier is tested on. Accordingly, classification scores reported below are the arithmetic means of scores on the five subsets.

4.2 Results and discussion

The foreign seed extraction method correctly identifies the default vowel insertion strategy. Table 3 lists the number of different consonant clusters for which each vowel in Korean is selected as the top candidate. [ɨ] is predicted to be the default vowel as it is chosen most often overall. Its predicted site of insertion for onset clusters is between the consonants of each cluster as it is chosen more often there than before the consonants. Similarly, its predicted site of insertion for coda clusters is after the consonants of each cluster rather than between the consonants. The 28 phoneme strings made of the default vowel and the consonant pairs it allegedly separates are listed in the row labeled SEJONG in Table 4. They specify what traces of vowel insertion would look like and define the pattern matched against the atypical words to extract the foreign seed. All but three of them indeed occur as traces of vowel insertion in one or more loanwords in the entire data used for the present study. The foreign seed consists of 2,500 eojeols (out of 50,992 atypical ones) that contain one or more of the phoneme strings. The foreign seed does contain false positives, but their proportion is not that big: 489/2500 (=19.56%). Since SEJONG is unlabeled and too large, it is hard to tell what percentage of loanwords the foreign seed

12 Loanword Identification in Korean represents. But if one extracted all atypical words in NIKL-1 that contained the phoneme strings, it would return a foreign seed containing 458/21,176 = 2.16% of all the loanwords in the dataset. So the foreign seed is small in size and represents a tiny fraction of loanwords. The seeded classifier can be trained effectively with smoothstep EM (see row 2 in Table 5 for scores). Despite the small seed, recall is high (85.51%) without compromise in precision (94.21%). The scores are, of course, lower than those of the supervised classifier (see row 1 in Table 5). Precision is lower by 2.67% points and recall is lower by 10.95% points. But considering the unsupervised nature of the approach, the scores are encouraging. The classifier performs better when trained with smoothstep EM than the other two variants of EM (see rows 4 and 5 in Table 5). Precision is just as high but recall is a bit lower (80.16%) when trained with hard EM. On the other hand, precision is miserable (47.81%) although recall is higher (91.46%) when trained with soft EM. Figure 2 illustrates how well the classifier performs on NIKL-1 over time as it is iteratively trained on SEJONG with the three variants of EM. Right after initialization, scores of the classifier are precision = 93.82% and recall = 52.07%. All three variants boost recall significantly within the first several iterations. Soft EM is the most successful, followed by smoothstep EM, and then hard EM. But while the other two not only maintain but also marginally improve precision, soft EM steadily loses precision throughout the whole training session. Bootstrapping is more effective with the proposed seeding method than ran- dom seeding. Scores of three different randomly seeded classifiers trained with smoothstep EM are listed in rows 6-8 in Table 5. Compared to the proposed classifier, although their precision is higher by around 1% point, their recall is lower by around 14% points. But their performance is rather consistent as well as strong and deserves a closer look. The three randomly- seeded classifiers all followed a similar trajectory as they evolved. To briefly describe the process using a clustering analogy, the foreign cluster, which started out as a small random set of 50,992 eojeols, immediately shrank to a much smaller set including those with hapax character bigrams whose type frequency is one. For one of the three classifiers, the foreign cluster shrank to a set of 5,421 eojeols as soon as training began and 2,061 of them

13 Loanword Identification in Korean contained hapax bigrams. It is likely that many words containing hapax bi- grams were loanwords and the foreign cluster eventually grew around them. In fact, among 4,378 words in NIKL-1 containing character bigrams that ap- pear only once in SEJONG, 1,601 are native words and 2,777 are loanwords. The process makes intuitive sense. At the beginning, the foreign cluster is overwhelmed in size by the native cluster and unlikely to have homogeneous subclusters due to random initialization. Eojeols in the foreign cluster will be absorbed by the native cluster unless they have bigrams that seem alien to the native cluster. Hapax bigrams would be a prime example of such bigrams and as a result they figure more prominently in the foreign cluster. Loanwords are alien to begin with, so it makes sense that they are more likely to have hapax bigrams than native words. The dynamics involving data size, randomness, hapax bigrams, and loanwords are indeed interesting and did lead to good classifiers. But at the moment, it is not clear if they are reliable and predictable. More importantly, the proposed seeding method led to significantly better classifiers. Robustness to noise: The proposed methods are effective despite some noise in training data. There are two sources of noise in SEJONG: crude grapheme-to-phoneme conversion (G2P) and lack of morphological process- ing. G2P generates phonetic transcriptions required for foreign seed extrac- tion. In the experiments above, the transcriptions were generated by applying a rather simple set of rules. Grapheme-phoneme correspondence in Hangul is quite regular, but there are phonological patterns such as coda neutralization and tensification (Sohn, 1999) that the rules do not capture. Accordingly, the resulting transcriptions would be decent approximations but occasionally incorrect. In fact, when the rules are tested on 14,007 words randomly chosen from the Standard Korean Dictionary, word accuracy and phoneme accuracy are 67.92% and 94.67%. One could ask if the proposed methods would per- form better with more accurate transcriptions. An experiment with a better G2P suggests that the approximate transcriptions are good enough. A joint 5-gram model (Bisani and Ney, 2008) was trained on 126,068 words from the Standard Korean Dictionary. The model transcribes words in SEJONG differently from the rules: by 36.62% in terms of words and 5.53% in terms of phonemes. The model’s transcriptions are expected to be more accurate. Its word accuracy and phoneme accuracy on the 14,007 words mentioned above are 95.30% and 99.35%. Building the classifier from scratch using the new transcriptions barely changes

14 Loanword Identification in Korean results. The foreign seed extraction method again correctly identifies the de- fault vowel insertion strategy. It identifies [ɨ] as the default vowel, inserted between the consonants in onset and after the consonants in coda. It picks 31 phoneme strings including the vowel as potential traces of insertion (see SEJONG-g2p in Table 4). All but four of them have example loanwords in which they occur as traces of vowel insertion. The set of phoneme strings is similar to the one identified before, with a 73.53% overlap between the two. The resulting foreign seed is even more similar to the previous seed, with a 84.35% overlap between the two. The new seed is slightly larger than the previous seed (2,527 vs. 2,500 words) but has a higher proportion of false positives (20.66% vs. 19.56%). The two seeds lead to very similar classifiers trained with smoothstep EM. The two trained classifiers tag 99.39% of words in NIKL-1 in the same way and their scores differ by only 0.24% – 0.48% points (see row 9 in Table 5 for the new classification scores). The training data in the experiments above include eojeols containing both native and foreign morphemes. Loanwords can be suffixed with native mor- phemes, combine with native words to form compounds, or both. A good example is 투자펀드를 (investment-fund-ACC) where 투자 and 를 are - tive and 펀드 is foreign. Such items may mislead the classifier to recruit false positives during training. One could ask if performance of the proposed methods can be improved by stemming or further morpheme segmentation. Experiments suggest that they improve precision but at the sacrifice of re- call. Data for the experiments consists of a set of 250,844 stems and a set of 132,430 non-suffix morphemes in SEJONG. Eojeols in SEJONG are mor- phologically annotated in the original corpus. For example, 투자펀드를 is annotated 투자/NNG + 펀드/NNG + 를/JKO. Stems were extracted by removing substrings tagged as suffixes and particles (e.g. 투자펀드를 → 투자펀드). Non-suffix morphemes were extracted by splitting the derived stems at specified morpheme boundaries (e.g. 투자펀드 → 투자 and 펀드). Two classifiers were built from scratch with rule-based transcriptions: one using the stems and the other using the morphemes. The foreign seed extraction method is effective as when it was applied to eojeols. It correctly identifies the default vowel and its site of insertion in both data sets. The phoneme strings identified as potential traces of insertion are listed in rows labeled SEJONG-stem and SEJONG-morph in Table 4. As before, many of them are indeed found in loanwords because of vowel insertion, while a few of them are not. The resulting seeds are much smaller

15 Loanword Identification in Korean but contain less false positives than before: 59/642 = 9.20% and 58/323 = 17.96% when using stems and morphemes respectively vs. 489/2500 =19.56% when using eojeols. Scores of the seeded classifiers trained with smoothstep EM are listed in rows 10 and 11 in Table 5. Compared to the classifier trained on eojeols, precision improves by 1.55 and 2.14% points but recall plummets by 11.62 and 23.81% points. The gain in precision is tiny compared to loss in recall. Perhaps one could prevent the loss in recall by adding more data. But the current results suggest that the proposed methods are good enough, if not better off, without morphological processing. Robustness to varying data: Experiments with different Korean data suggest that the proposed methods are effective in Korean in general rather than the particular data used above. A new classifier was built from scratch on KAIST using rule-based transcriptions and smoothstep EM and tested on NIKL-2. Its performance was compared with the unsupervised classifier trained on SEJONG and a new supervised classifier trained on subsets of NIKL-2. The foreign seed extraction method again correctly identifies the default vowel and its site of insertion. It picks 26 phoneme strings including the vowel as potential traces of insertion (see KAIST in Table 4). All but one of them have example loanwords in which they occur as traces of vowel insertion. The phoneme strings lead to a foreign seed consisting of 4,179 eojeols. The seed contains relatively more false positives (27.35%) than when using eojeols in SEJONG (19.56%). But scores of the SEJONG classifier and the resulting KAIST classifier tested on NIKL-2 are barely different (see rows 13 and 15 in Table 5). The SEJONG classifier is behind the supervised classifier by 5.31% point in precision and 11.20% in recall (see row 12 in Table 5 for scores of the supervised classifier). The difference is slightly larger than the difference observed with NIKL-1. This is most likely because SEJONG is more different from NIKL-2 than it is from NIKL-1. The perplexity of a character bigram model trained on SEJONG is higher on NIKL-2 (564.55) than on NIKL-1 (484.18). Adaptation: Unlike the supervised classifier, the training data and the test data for the unsupervised classifiers come from different sources. For example, one unsupervised classifier was trained on SEJONG and tested on NIKL-1, while the supervised classifier compared with it was both trained and tested on NIKL-1. So the comparison between the two was not entirely fair. Experiments show that a simple adaptation method such as linear interpolation can fix the problem. In sum, a baseline classifier is interpolated

16 Loanword Identification in Korean with a new classifier that inherits parameters from the baseline classifier and iteratively trained on adaptation data. The classifiers are interpolated and make predictions according to the following equation:

cˆ(w) = arg max(1 − λ) · Pbase(w, c) + λ · Pnew(w, c) (8) c

Here the baseline classifier is the classifier trained on words from an unlabeled corpus (e.g. SEJONG) and adaptation data is the portion of the labeled data (e.g. NIKL-1) used to train the comparable supervised classifier. Of course, the adaptation data does not include labels from the original data. The idea is not to provide feedback but to merely expose the classifier to the kinds of words it will be asked to classify later. In the experiments, the new classifier was trained on 90% of the adaptation data with smoothstep EM just like the baseline classifier. The interpolation weights were estimated using the remaining 10% with the classic EM algorithm. Applying the method to adapt the SEJONG and KAIST classifiers to the NIKL data significantly improves their performance. F-scores of the unsupervised classifiers after adaptation are behind the comparable supervised classifiers by no more than 2.5% points. See rows 3, 14, and 16 in Table 5 for scores after adaptation.

5 Applicability to other languages: a pilot study in Japanese

Ideally, the proposed approach should work with any language that does not allow consonant clusters and relies on vowel insertion to repair foreign clusters. In this section, I demonstrate its potential applicability with a pilot study in Japanese. In addition to not allowing consonant clusters, Japanese does not allow consonants in coda except the moraic nasal (e.g. [san]) and the first part of a geminate obstruent that straddles two syllables (e.g. [kip.pu]). The vowel inserted for repair is [u] usually (e.g. フランス [huransu] for ‘France’), but [] for coronal stops [t] and [d] (e.g. トレンド [torendo] for ‘trend’). It is inserted between the consonants to repair onset clusters and after the consonants to repair coda clusters beginning with [n]. But for other

17 Loanword Identification in Korean coda clusters, it is inserted after each consonant of the cluster (e.g. ヘルス [herusu] for ‘health’). The patterns are similar to Korean, so the approach should work without much modification. The data for the experiment consists of 108,816 words for training and 148,128 words for testing. The training data came from the JEITA corpus (Hagiwara, 2013). It is not obvious to tell word boundaries and pronunci- ation in raw Japanese text. Words are not delimited by white space and sometimes spelled in which are logographic rather than or which are phonographic. Fortunately, the corpus comes with the words segmented and additionally spelled in katakana. It is those katakana spellings that constitute the training data. The test data came from JMDict (Breen, 2004), a lexicon annotated with various information including pro- nunciation transcribed in either hiragana or katakana and source language if a word is a loanword. Since loanwords in Japanese are spelled in katakana, I labeled words spelled without any katakana characters as native and words that had language source information and spelled only in katakana as foreign. This led to the test set of 130,237 native words and 17,891 foreign words. Some of the words in the training and test data were respelled to make the classification task non-trivial. First, all words in hiragana were respelled in katakana (e.g. それ → ソレ). Otherwise, one could simply label any word in hiragana as native and avoid false positives. Second, all instances of choonpu were replaced with proper vowel characters given the context (e.g. ハ–プ–ン [haapuun] ‘harpoon’ → ハアプウン). The character in katakana indicates long vowels, which in hiragana are indicated by adding an extra vowel char- acter. Without the correction, one could simply label words with choonpu as foreign and identify a significant portion of loanwords. The n-gram models in the experiment were trained on katakana character bigrams. Phonetic transcriptions for foreign seed extraction were generated essentially by romanization. Katakana symbols were romanized following the Nihon-shiki system (e.g. シャツ → syatu) and each letter was mapped to the corresponding phonetic symbol (e.g. syatu → [sjatu]). All other aspects of the experiment were set up in the same way as the experiments in Korean. The results appear promising. The foreign seed extraction method identifies [u] as the default vowel and its site of insertion as between consonants in onset and after consonants in coda. It picks 14 phoneme strings including the vowel as potential traces of insertion (see JEITA in Table 4). Eight of them have example loanwords in which they occur as traces of vowel

18 Loanword Identification in Korean insertion. The phoneme strings lead to a foreign seed consisting of 173 words that include 68 false positives (46.26%). It is encouraging that the method correctly identifies the default vowel insertion strategy. But the resulting foreign seed is quite small partly because the corpus is small to begin with and less accurate than the seeds in the Korean experiments. Classification scores are listed in rows 17-19 in Table 5. Overall, the scores are lower than the scores achieved in Korean. Considering that the scores are lower even for the supervised classifier, it seems that character bi-grams are less effective in Japanese than Korean. As expected from the size of the foreign seed, recall of the unsupervised classifier is quite low. But after adaptation to the lexicon, recall improves significantly and F-score is not that far behind the supervised classifier.

6 Conclusion

I proposed an unsupervised method for developing a classifier that identifies loanwords in Korean text. As shown in the experiments discussed above, the method can yield an effective classifier that can be made to perform at a level comparable to that of a supervised classifier. The method is cost-efficient as it does not require language resources other than a large monolingual corpus, a grapheme-to-phoneme converter, and perhaps a lexicon to supplement the corpus. The method is in principle applicable to a wide range of languages, i.e. those that rely on vowel insertion to repair illegal consonant clusters. Results from the pilot experiment in Japanese were encouraging. Future studies will further explore applicability of the method to other languages, especially under-resourced languages.

References

Baker, K. and Brew, C. (2008). Statistical identification of English loanwords in Korean using automatically generated training data. In Proceedings of the 6th Language Resources and Evaluation Conference (LREC’08), pages 1159–1163. Bali, R.-M., Chong, C. C., and Pek, K. N. (2007). Identifying and classifying

19 Loanword Identification in Korean

unknown words in Malay texts. In Proceedings of the 7th International Symposium on Natural Language Processing, pages 493–498. Bisani, M. and Ney, H. (2008). Joint-sequence models for grapheme-to- phoneme conversion. Speech Communication, 50(5):434–451. Breen, J. (2004). JMDict: a Japanese-multilingual dictionary. In Proceed- ings of the Workshop on Multilingual Linguistic Resources, pages 71–79. Association for Computational Linguistics. Clements, G. N. (1990). The role of the sonority cycle in core syllabification. In Kingston, J. and Beckman, M., editors, Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech, pages 283–333. Cambridge: Cambridge University Press. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–38. Goldberg, Y. and Elhadad, M. (2008). Identification of transliterated foreign words in Hebrew script. In Computational Linguistics and Intelligent Text Processing, pages 466–477. Springer Berlin - Heidelberg. Hagiwara, M. (2013). JEITA public morphologically tagged corpus (in Chasen format). Hall, N. (2011). Vowel epenthesis. In van Oostendorp, M., Ewen, C. J., Hume, E., and Rice, K., editors, The Blackwell Companion to Phonology, pages 1576–1596. Malden, MA & Oxford: Wiley-Blackwell. Haspelmath, M. and Tadmor, U. (2009). Loanwords in the World’s Lan- guages: A Comparative Handbook. Walter de Gruyter. Jeong, K. S., Myaeng, S. H., Lee, J. S., and Choi, K.-S. (1999). Automatic identification and back-transliteration of foreign words for information re- trieval. Information Processing and Management, 35:523–540. Kang, Y. (2011). Loanword phonology. In van Oostendorp, M., Ewen, C. J., Hume, E., and Rice, K., editors, The Blackwell Companion to Phonology, pages 2258–2281. Malden, MA & Oxford: Wiley-Blackwell. Khaltar, B.-O. and Fujii, A. (2009). A lemmatization method for Mongo-

20 Loanword Identification in Korean

lian and its application to indexing for information retrieval. Information Processing & Management, 45(4):438–451. Knight, K. and Graehl, J. (1998). Machine transliteration. Computational Linguistics, 24(4):599–612. Korea Advanced Institute of Science and Technology (1997). Automatically analyzed large scale KAIST corpus [Data file]. Ladefoged, P. (2001). A Course in Phonetics. Orlando: Harcourt Brace, 4 edition. Maddieson, I. (2013). Syllable structure. In Dryer, M. S. and Haspelmath, M., editors, The World Atlas of Language Structures Online, Leipzig. Max Planck Institute for Evolutionary Anthropology. Ministry of Culture, Sports, and Tourism of South Korea, and National In- stitute of the Korean Language (2011). The 21st century Sejong project [Data file]. NIKL (2000a). gukeo eohwiui bunryu mokrok yeongu. Resource document. NIKL (2000b). pyojuneo geomtoyong jaryo. Resource document. NIKL (2000c). pyojungukeodaesajeon pyeonchanyong eowon jeongbo jaryo. Resource document. NIKL (2000d). yongeon hwalyongpyo. Resource document. NIKL (2008). Survey of the state of loanword usage. [Data file]. NIKL (2013). oeraeeo pyogi yongrye jaryo – romaja inmyeonggwa jimyeong. Resource document. Nwesri, A. F. A. (2008). Effective Retrieval Techniques for Arabic text. PhD thesis, RMIT University, Melbourne, Australia. Oh, J.-H. and Choi, K.-S. (2001). Automatic extraction of transliterated foreign words using hidden markov model. In Proceedings of the Interna- tional Conference on Computer Processing of Oriental Languages, pages 433–438. Ravi, S. and Knight, K. (2009). Learning phoneme mappings for transliter- ation without parallel data. In Proceedings of Human Language Technolo-

21 Loanword Identification in Korean

gies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 37–45. Selkirk, E. (1984). On the major class features and syllable theory. In Aronoff, M. and Oerhle, R. T., editors, Language Sound Structure: Studies in Phonology Presented to Morris Halle by His Teachers and Students, pages 107–136. Cambridge, MA: MIT Press. Sohn, H.-M. (1999). The Korean Language. Cambridge: Cambridge Univer- sity Press. Uffmann, C. (2006). Epenthetic vowel quality in loanwords: Empirical and formal issues. Lingua, 116(7):1079–1111. Witten, I. H. and Bell, T. (1991). The zero-frequency problem: Estimat- ing the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4):1085–1094. Yoon, K. and Brew, C. (2006). A linguistically motivated approach to grapheme-to-phoneme conversion for Korean. Computer Speech & Lan- guage, 20(4):357–381. Yoon, S.-Y., Kim, K.-Y., and Sproat, R. (2007). Multilingual transliteration using feature based phonetic method. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 112–119.

22 Loanword Identification in Korean

Appendix A. Rewrite rules for grapheme-to- phoneme conversion

The table below shows letter-to-phoneme correspondences in Korean. The idea is to transcribe the pronunciation of a spelled word by first decomposing syllable-sized characters into letters and then mapping the letters to their matching phonemes one by one. For example, 한글 → ᄒ + ᅡ + ᄂ + ᄀ + ᅳ + ᄅ → [hankɨl].

Letter Phoneme(s) Letter Phoneme(s) Letter Phoneme(s) ᄀ k ᄁ k* ᄂ n ᄃ t ᄄ t* ᄅ (onset) ɾ ᄅ (coda) l ᄆ m ᄇ p ᄈ p* ᄉ s ᄊ s* ᄋ (onset) Null ᄋ (coda) ŋ ᄌ tʃ͡ ᄍ tʃ*͡ ᄎ tʃʰ͡ ᄏ kʰ ᄐ tʰ ᄑ pʰ ᄒ h ㅏ a ㅑ j a ㅐ æ ᅤ j æ ᅥ ʌ ᅧ j ʌ ᅦ e ᅨ j e ᅩ o ᅭ j o ᅪ w a ᅫ w æ ᅬ ø ᅮ u ᅲ j u ᅯ w ʌ ᅰ w e ᅱ w i ᅳ ɨ ᅵ i ᅴ ɨ͡i

23 Loanword Identification in Korean

Table 1: Korean phonemes and their place in the proposed sonority hierarchy.

Class Phonemes Obstruents p p* pʰ t t* tʰ k k* kʰ s s* h tʃ͡ tʃ*͡ tʃʰ͡ Sonorants m n ŋ ɾ l w j Vowels a e i o u æ ʌ ø ɨ ɨ͡i

24 Loanword Identification in Korean

Table 2: Data sizes in number of unique words or eojeols.

Class SEJONG KAIST NIKL-1 NIKL-2 JEITA JMDict Native unknown unknown 49,962 44,214 unknown 130,237 Foreign unknown unknown 21,176 18,943 unknown 17,891 Total 1,019,863 2,409,309 71,138 63,157 108,816 148,128

25 Loanword Identification in Korean

Table 3: Number of consonant clusters each vowel allegedly repairs via in- sertion.

ɨ e ʌ a æ u ɨ͡i ø o i Before onset consonants 0 3 4 1 1 2 3 5 1 1 Between onset consonants 17 5 6 8 4 10 11 7 9 6 Between coda consonants 0 3 3 2 1 1 0 0 1 2 After coda consonants 11 9 4 5 10 3 2 4 4 5 Total 28 20 17 16 16 16 16 16 15 14

26 Loanword Identification in Korean

Table 4: Phoneme strings chosen as potential traces of insertion. Strings in parentheses were not found in any loanwords as traces of insertion.

Data Potential traces of insertion SEJONG kɨɾ, k*ɨɾ, k*ɨn, kʰɨɾ, pɨɾ, p*ɨɾ, pʰɨɾ, sɨm, sɨn, sɨw, tɨɾ, (tɨŋ), tɨl, tʃ*ɨm,͡ (tʃ*ɨn),͡ tʰɨɾ, tʰɨl, ŋkʰɨ, lpʰɨ, lsɨ, ltʃɨ,͡ ltʃʰɨ,͡ ltʰɨ, mpʰɨ, msɨ, nsɨ, (ntʃ*ɨ),͡ ntʰɨ SEJONG-g2p kɨɾ, k*ɨɾ, k*ɨn, kʰɨɾ, kʰɨn, pɨɾ, p*ɨɾ, pʰɨɾ, pʰɨw, sɨm, sɨn, sɨw, tɨɾ, (tɨŋ), tɨl, tʃ*ɨm,͡ (t*ɨj), (t*ɨm), tʰɨɾ, tʰɨl, ŋkʰɨ, lpʰɨ, lsɨ, ltʃɨ,͡ (lt*ɨ), ltʰɨ, mpʰɨ, msɨ, nsɨ, ntʃɨ,͡ ntʰɨ SEJONG-stem kɨl, kɨm, kʰɨɾ, pɨɾ, p*ɨɾ, pʰɨɾ, sɨn, sɨw, tɨɾ, tɨl, (tʃ*ɨn),͡ (tʃʰɨŋ),͡ tʃʰɨl,͡ t*ɨl, tʰɨɾ, tʰɨw, ŋkʰɨ, lpʰɨ, ltʃɨ,͡ ltʃʰɨ,͡ ltʰɨ, mpʰɨ, (mt*ɨ), nsɨ, ns*ɨ, (ntʃ*ɨ),͡ ntʃʰɨ͡ SEJONG-morph kɨɾ, kɨm, kʰɨɾ, pɨɾ, pʰɨɾ, pʰɨw, sɨɾ, sɨn, sɨw, tɨɾ, (tʃ*ɨn),͡ tʃʰɨl,͡ tʰɨɾ, tʰɨw, ŋkʰɨ, ŋt*ɨ, lpʰɨ, lsɨ, ltʃɨ,͡ ltʃʰɨ,͡ ltʰɨ, msɨ, (mt*ɨ), nsɨ, ns*ɨ, (ntʃ*ɨ),͡ ntʃʰɨ,͡ ntʰɨ KAIST kɨɾ, kɨn, k*ɨɾ, k*ɨn, kʰɨɾ, pɨɾ, p*ɨw, pʰɨɾ, pʰɨw, sɨm, sɨn, sɨw, s*ɨɾ, tɨɾ, (tɨŋ), tɨl, tʃ*ɨm,͡ t*ɨl, tʰɨɾ, ŋtʰɨ, lpʰɨ, ltʃɨ,͡ ltʰɨ, mpʰɨ, nsɨ, ntʰɨ JEITA (bum), gur, (huj), (huw), (kun), kur, pur, (tuj), (tum), ngu, nhu, nku, nsu, nzu

27 Loanword Identification in Korean

Table 5: Performance of trained classifiers.

Index Train (+adapt) Test Seeding Learning Precision Recall F-score 1 NIKL-1 NIKL-1 N/A Supervised 96.88 96.46 96.67 2 SEJONG NIKL-1 Proposed Smoothstep EM 94.21 85.51 89.65 3 SEJONG (+NIKL-1) NIKL-1 Proposed Smoothstep EM 95.49 94.05 94.77 4 SEJONG NIKL-1 Proposed Hard EM 94.21 80.16 86.62 5 SEJONG NIKL-1 Proposed Soft EM 47.81 93.35 60.81 6 SEJONG NIKL-1 Random Smoothstep EM 95.30 70.98 81.36 7 SEJONG NIKL-1 Random Smoothstep EM 95.37 71.75 81.89 8 SEJONG NIKL-1 Random Smoothstep EM 95.20 71.89 81.92 9 SEJONG-g2p NIKL-1 Proposed Smoothstep EM 94.45 85.03 89.49 10 SEJONG-stem NIKL-1 Proposed Smoothstep EM 95.76 73.89 83.42 11 SEJONG-morph NIKL-1 Proposed Smoothstep EM 96.35 61.70 75.22 12 NIKL-2 NIKL-2 N/A Supervised 95.36 94.12 94.73 13 SEJONG NIKL-2 Proposed Smoothstep EM 90.05 82.92 86.34 14 SEJONG (+NIKL-2) NIKL-2 Proposed Smoothstep EM 93.85 90.89 92.34 15 KAIST NIKL-2 Proposed Smoothstep EM 90.53 82.52 86.34 16 KAIST (+NIKL-2) NIKL-2 Proposed Smoothstep EM 93.80 91.17 92.46 17 JMDict JMDict N/A Supervised 88.17 84.62 86.36 18 JEITA JMDict Proposed Smoothstep EM 81.20 61.82 70.20 19 JEITA (+JMDict) JMDict Proposed Smoothstep EM 88.00 80.27 83.96

28 Loanword Identification in Korean

Figure 1: Transformation of E[z(w, c)] to zˆ(w, c).

29 Loanword Identification in Korean

Figure 2: Precision and recall of the unsupervised classifier over iterations.

30