<<

Don’t Just Scratch the Surface: Enhancing Word Representations for Korean with

Kang Min ∗, Taeuk ∗ and Sang-goo Department of Computer Science and Engineering National University, Seoul, {kangminyoo,taeuk,sglee}@europa.snu.ac.kr

Abstract

We propose a simple yet effective approach for improving Korean word representations using additional linguistic annotation (.. Hanja). We employ cross-lingual transfer learning in training word representations by leveraging the fact that Hanja is closely related to - nese. We evaluate the intrinsic quality of rep- resentations learned through our approach us- ing the word analogy and similarity tests. In Figure 1: An example of a Korean word showing its addition, we demonstrate their effectiveness form and multi-level meanings. The Sino-Korean word KR on several downstream tasks, including a novel consists of phonograms ( ) and Hanja lo- HJ Korean news headline generation task. gograms ( ). Although annotation of Hanja is op- tional, it offers deeper insight into the word meaning due to its association with the (CN). 1 Introduction

There is a strong connection between the Korean Information Skip-Gram), for capturing the seman- and Chinese due to cultural and histori- tics of Hanja and subword structures of Korean cal reasons (Lee and Ramsey, 2011). Specifically, and introducing them into the vector . Note a set of with very similar forms to the that it is also quite intuitive for native 1 Chinese characters, called Hanja , served in the to resolve the ambiguity of (Sino-)Korean words past as the only medium for written Korean until with the aid of Hanja. We conjecture this heuris- Hangul, the Korean , and Jamo came into tic is attributed to the fact that Hanja characters existence in 1443 (Sohn, 2001). Considering this are logograms, each of which contains more lex- etymological background, a substantial portion of ical meaning when compared against its counter- Korean words are classified as Sino-Korean, a set part, the Hangul character, which is a . of Korean words that had originated from Chinese Accordingly, in this work, we focus on exploiting and can be expressed in both Hanja and Hangul the rich extra information from Hanja as a means (Taylor, 1997), with the latter becoming common- of constructing better Korean word embeddings. place in modern Korean. Furthermore, our approach enables us to empir- Based on these facts, we assume the introduc- ically investigate the potential of character-level tion of Hanja-level information will aid in ground- cross-lingual knowledge transfer with a case study ing better representation for the meaning of - of Chinese and Hanja. rean words (e.g. see Fig.1). To validate our To sum up, our contributions are threefold: hypothesis, we propose a simple yet effective ap- proach to training Korean word representations, • We introduce a novel way of training Korean named Hanja-level SISG (Hanja-level Subword word embeddings with an expanded Korean alphabet vocabulary using Hanja in addition ∗Equal contribution. 1Hanja and traditional Chinese characters are very similar to the existing Korean characters, Hangul. but not completely identical. Some differences in strokes and -exclusive characters exist. • We also explore the possibility of character-

3528 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3528–3533, Kong, , November 3–7, 2019. c 2019 Association for Computational Linguistics level knowledge transfer between two lan- However, due to the deconstructible structure guages by initializing Hanja embeddings of Korean characters and the agglutinative na- with Chinese character embeddings before ture of the language, the model proposed by Bo- training skip-gram. janowski et al.(2017) comes short in capturing sub-character and inter-character information spe- • We prove the effectiveness of our method in cific to Korean understanding. As a remedy, the two ways, one of which is intrinsic evaluation model proposed by et al.(2018) introduces (the word analogy and similarity tests), and jamo-level n-grams g(j), whose vectors can be as- the other being two downstream tasks includ- sociated with the target context as well. Given that ing a newly proposed Korean news headline a Korean word w can be split into a set of jamo generation task. (j) (j) n (j) (j) n-grams Gw ⊂ G = g1 , . . . , gJ , where 2 Model Description G(j) is the set of all possible jamo n-grams in the corpus, the updated scoring function for jamo- 2.1 Skip-Gram (SG) level skip-gram (j) is thus Given a corpus as a sequence of words (w1, w2, . . . , wT ), the goal of a skip-gram model (j) X (Mikolov et al., 2013) is to maximize the log- s (w, c) = s (w, c) + zjvc. (4) (j) probabilities of context words given a target word j∈Gw wi: 2.3 Hanja-level SISG2 X X log p (wc|wi), (1) Another semantic level in Korean lies in Hanja i∈{1,...,T } c∈C(i) logograms. Hanja characters can be mapped to where C returns a set of context word indices given Hangul phonograms, hence Hanja annotations are a word index. The usual choice for training the sufficient but not necessary in conveying mean- parameterized probability function p (wc|w) is to ings (Fig.1). As each Hanja character is associ- reframe the problem as a set of binary classifica- ated with semantics, their presence could provide tion tasks of the target context word c and other meaningful aid in capturing word-level semantics negative samples chosen from the entire in the vector space. (negative sampling). The objective is thus We propose incorporating Hanja n-grams in the learning of word representations by allowing the X scoring function to associate each Hanja n-gram (s (wi, wc)) + L (−s (wi, wj)) , (2) with the target context word. Concretely, given j∈N (i,c) that a Korean word w contains a set of Hanja se- quences H = (h , . . . , h ) (e.g. for the where L (x) = log (1 + e−x), the binary logis- w w,1 w,Hw example in Fig.1 h = “社會” and h = “型”) tic loss. N returns a set of negative sample in- w,1 w2 and that I is the set of Hanja n-grams for a Hanja dices given target word and context indices. In the h sequence h (e.g. Ihw,2 = { “型”, “型”, vanilla SG model, the scoring function s is the dot (h) “型”}), all Hanja n-grams Gw for word product between the two word vectors: vw and vc. w are the union of Hanja n-grams of Hanja se- 2.2 Subword Information SG (SISG) and S quences present in w: h∈Hw Ih. The length of Jamo-level SISG Hanja n-grams in Ih is a hyperparameter. Then In Bojanowski et al.(2017), the scoring func- the new score function for Hanja-level SISG is tion incorporates subword information. Specifi- cally, given all possible n-gram character sets G = (h) (j) X s (w, c) = s (w, c) + zhvc (5) {g1, . . . , gG} in the corpus and that the n-gram set h∈G(h) of a particular word w is Gw ⊂ G, the scoring w function is thus the sum of dot products between where z is the n-gram vector for a Hanja se- all n-gram vectors z and the context vector v : h c quence h.

X 2 s (w, c) = zgvc. (3) The code and models are publicly available online at https://github.com/kaniblu/hanja-sisg g∈Gw 3529 Analogy Similarity 2.4 Character-level Knowledge Transfer Method Sem. Syn. All Pr. Sp. Since Hanja characters essentially share deep SISG(cj)† 0.47 0.38 0.43 - 0.67 roots with Chinese characters, many of them can SG 0.42 0.49 0.45 0.60 0.62 SISG(c) 0.45 0.59 0.52 0.62 0.61 be mapped by one-to-one correspondence. By SISG(cj) 0.39 0.48 0.44 0.66 0.67 converting Korean characters into simplified Chi- SISG(cj)‡ 0.41 0.48 0.45 0.65 0.67 nese characters or vice versa, it is possible to uti- SISG(cjh3) 0.34 0.45 0.39 0.63 0.63 lize pre-trained Chinese embeddings in the pro- SISG(cjh4) 0.34 0.45 0.40 0.62 0.61 SISG(cjhr) 0.35 0.46 0.40 0.65 0.64 cess of training Hanja-level SISG. In our case, we † reported by Park et al.(2018) propose leveraging advances in Chinese word em- ‡ pre-trained embeddings provided by the authors of beddings from the Chinese community by initial- Park et al.(2018) run with our evaluation izing Hanja n-grams vectors zh with the state-of- the-art Chinese embeddings ( et al., 2018) and Table 1: Word analogy and similarity evaluation re- training with the score function in Equation5. sults. For word analogy test, the table shows cosine distances averaged over semantic and syntactic word pairs (lower is better). For word similarity test, we 3 Experiments report on Pearson and Spearman correlations between 3.1 Word Representation Learning predictions and ground truth scores. We observe that our approach (SISG(cjh)) shows significant improve- 3.1.1 Korean Corpus ments in the analogy test but some deterioration in the For learning word representations, we utilize the similarity test. Note that the evaluation results reported by the original authors (†) are largely different from corpus prepared by Park et al.(2018) with small our results. This might be due to differences in imple- modifications. We perform additional data cleans- mentation details, hence we report and compare with ing (e.g. removing non-Korean sentences in the the results of the authors’ embeddings run on our test corpus and unifying number tags) to obtain a script (‡). cleaner version of the corpus. All of our compara- tive studies are based on this corpus. For training SISG(cjh4) Hanja-level SISG, we use Hanjaro3 tagger to auto- from 1 to 3 while denotes Hanja SISG(cjhr) matically annotate all of our datasets with Hanja. n-grams ranging from 1 to 4. is a It is the state-of-the-art Hanja tagger available to special version of our approach, where Hanja n- the public to the best of our knowledge. grams are randomly initialized as opposed to be- ing pre-trained (Section 2.4). The details of our 3.1.2 Models ablation studies and the model implementation are We compare our approach with three other base- included in the supplementary material. SG lines. Skip-Gram ( ) model (Mikolov et al., 3.2 Intrinsic Evaluation 2013) does not incorporate any n-gram informa- tion in the training loss. Subword Information 3.2.1 Word Analogy Test Skip-Gram (SISG(c))(Bojanowski et al., 2017) Following the experimental protocol of Park et al. uses character-level n-gram information to en- (2018), given an analogy pair a : b ↔ c : d, rich word representations. For the Korean lan- we compute the cosine distance between the word guage, character-level n-grams correspond to syl- vectors + vb − vc and vd using the follow- lable n-grams. Jamo-level SISG (SISG(cj)) ing equation: 1 − cos(x, y), where cos is the co- (Park et al., 2018) uses jamo-level n-grams in ad- sine similarity. The Korean word analogy dataset dition to character-level n-grams. Our ablation (Park et al., 2018) consists of 10,000 quadruples, studies confirm that the hyperparameter settings designed with a set of semantic and syntactic fea- (character n-grams ranging from 1 to 6 and jamo tures specific to the Korean. We report our find- n-grams ranging from 3 to 5) proposed by the - ings in Table1. We observe that our approach thors are indeed optimal, hence our subsequent improves significantly compared to the previous studies all employ the same settings. Our ap- state-of-the-art models in both semantic and syn- proach is denoted by SISG(cjh)). As for the tactic word analogy detection. specific hyperparameter settings of our approach, 3.2.2 Word Similarity Test SISG(cjh3) denotes Hanja n-grams ranging The word similarity test aims to evaluate the 3http://hanjaro.juntong.or.kr/ correlation between word vector distances and

3530 BLEU human-annotated scores. We use the Korean Embeddings PPL 1 2 3 4 WS353 dataset provided by Park et al.(2018). None 26.02 7.76 3.08 1.38 5.335 Our results are shown in Table1. Although SG 30.33 10.20 4.29 1.98 4.122 our approaches do not outperform the jamo-level SISG(c) 31.34 10.96 4.69 2.19 3.942 baseline (SISG(cj)), we observe that Hanja- SISG(cj) 31.78 11.17 4.80 2.25 3.938 SISG(cj)† 31.77 11.16 4.81 2.27 3.940 level information provides some advantage com- SISG(cjh3) 32.03 11.25 4.83 2.27 3.941 pared to just using pure character-level n-grams SISG(cjh4) 32.02 11.34 4.92 2.30 3.909 (SISG(c)). Also note that WS353 is a relatively † pre-trained embeddings by Park et al.(2018) small dataset, which could be prone to statistical bias. Table 2: Korean News Headline Generation results.

3.2.3 Analysis line using a Korean tokenizer4 such How much of the improvement can be explained that decoder tokens are relatively dense. The by transfer learning from Chinese? Word anal- copy mechanism (Gu et al., 2016), a popular tech- ogy test results show that word representations nique of sequence-to-sequence models for trans- trained with pre-trained Chinese n-grams perform lation and summarization, cannot be applied di- better than those trained without (SISG(cjhr)), rectly to our model, as the encoder tokens (space supporting our claim that our approach is able to tokenized) and the decoder tokens (morpheme to- transfer relevant knowledge from the Chinese lan- kenized) do not share the same token space. Word guage for detecting analogical relationships. How- vectors obtained from both baselines and our ap- ever, for the word similarity test, word vectors proach are used to initialize the encoder embed- trained without Chinese embeddings perform bet- dings before the training begins. The dataset is ter, suggesting that there are some trade-offs. split into training, validation, and test sets (8:1:1). We report performances on the test set after val- 3.3 Downstream Tasks idating our training models on the validation set. In this section, we try to demonstrate the effec- We evaluate the titles generated by our models 5 tiveness of our approach to downstream tasks with with the BLEU (Papineni et al., 2002) . We also two specific cases. report per-word perplexity obtained from the mod- els. 3.3.1 Korean News Headline Generation As shown in Table2, models trained with our embeddings perform better and have better lan- To show that our approach helps supervised mod- guage modeling capability than those trained with els in gaining deeper understanding of Korean the previous state-of-the-art embeddings. This - texts, we devise a new task using news articles, provement can be partially explained by the fact which requires an understanding of the semantics that formal Korean texts, such as news articles, are of Korean texts. We collected Korean news ar- more likely to contain Sino-Korean words, allow- ticles published from 2017 January to February. ing models with awareness of Chinese to gain an The dataset covers balanced categories (e.g. poli- edge in Korean understanding. tics, sports, world, etc.) and contains the headline title as the annotation, akin to the CNN/Daily Mail 3.3.2 Sentiment Analysis Dataset (Hermann et al., 2015). The dataset is rel- To estimate the performance of our approach on atively large - containing 840,205 news article and a different task then the previous experiment, we title pairs conduct an experiment on Naver Sentiment Movie We use the encoder-decoder architecture sup- Corpus (NSMC)6. This dataset consists of 200K ported by OpenNMT (Klein et al., 2017) frame- movie reviews, each of which is labeled with its work for the task, where the encoder is a bidirec- sentiment, i.e. positive or negative. We split tional LSTM cell and the decoder is an LSTM cell the corpus into training (100K), validation (50K), (the hidden size is 512 for both) with a bridging and test sets (50K). It is worth noting that this feed-forward layer between the two RNNs. We 4 employ soft attention (Bahdanau et al., 2014) to https://github.com/shin285/KOMORAN 5Although ROUGE is a more widely adopted measure in generate news headlines given the first three sen- English literature, no equivalent exists for Korean. tences of an article body. We tokenize each head- 6https://github.com/e9t/nsmc

3531 Model: LSTM Embeddings 4 Related Work Acc. P R F1 SISG(c) 77.43 75.89 80.41 78.08 Although the word is usually regarded as the SISG(cj) 83.16 82.36 84.66 83.50 smallest and basic unit for most NLP pipelines, SISG(cjh3) 81.61 81.23 82.28 81.75 SISG(cjh4) 82.25 82.57 81.77 82.17 there is a recent trend of utilizing subword infor- mation for enriching word representations (Sen- Table 3: Naver Sentiment Movie Corpus results. We nrich et al. 2016; Bojanowski et al. 2017, to name report accuracy (Acc.), precision (P), recall (R), and a few), or considering subwords themselves as in- F1-score (F1) following Park et al.(2018). Each re- put directly for NLP models ( et al., 2015; ported result is the average of several runs initialized Kim et al., 2016; Peters et al., 2018; Devlin et al., with different random seeds. 2018). As Korean is agglutinative (, 2006), the current literature in Korean word representations case study is designed to figure out to what extent mainly focus on subword structures such as mor- our embeddings are generally applicable for even phemes (Edmiston and Stratos, 2018), cases, where most of the sentences in the dataset ( et al., 2017) and Jamo (Choi et al., 2016; consist of spoken words rather than written words Stratos, 2017; Park et al., 2018). We here move and often do not contain Sino-Korean words. forward one step further by incorporating Hanja information explicitly together with the aforemen- We utilize a basic LSTM (Hochreiter and tioned subword information. Schmidhuber, 1997) module as a sentence encoder On the other hand, cross-lingual representations to exclude possible exceptional gains from sophis- is a trending topic in literature (Lample et al. ticated encoders, concentrating on the usefulness 2018; Conneau et al. 2018; Lample and Conneau of input word embeddings. Likewise the previ- 2019, to name a few). Nevertheless, to the best ous experiment, word vectors obtained from both of our knowledge, our work is the first to intro- baselines and our approach are used as input for duce character-level cross-lingual transfer learn- the encoder. We regard the last hidden state of ing based on etymological grounds. Furthermore, the LSTM as the sentence representation, which is the novelty of our work lies of the fact that we use consumed by a feed-forward network followed by separated vocabulary instead of shared vocabulary a softmax classifier. The dimension of the LSTM such as Byte Pair Encoding (BPE) (Sennrich et al., cells is fixed as 300. Each result from the base- 2016). lines and our models is the average of 3 indepen- dent runs initialized with different random seeds, 5 Conclusion and the outcome of each run is chosen by the per- We have presented a method of training Korean formance (F1-score) on the validation set. word representations with Hanja. In our exten- From Table3, we confirm our approach is sive experiments, we have demonstrated that our comparable to the best previous one and general approach is effective in infusing more semantics enough to be employed in most of the down- into Korean word embeddings. One potential is- stream tasks, even though the performance of our sue with our method, as already mentioned, is that approach is somewhat unsatisfactory. While not it relies on external Hanja annotation, even though strictly proven, we conjecture one possible reason this can be mitigated to some extent by off-the- for the unsatisfying performance is the low relia- shelf taggers, as in this work. Thus, it would be an bility of the Hanja tagger we leveraged, which is attractive direction to take as future work develop- not guaranteed to work well with spoken language. ing an end-to-end . Specifically, we have manually inspected the tag- Acknowledgments ging results and observed that there are some er- rors which can lead to performance degradation. We thank Daniel Edmiston and anonymous re- This observation points out a limitation of our cur- viewers for their helpful feedback. This work was rent approach, that is, the dependence on external supported by BK21 Plus for Pioneers in Innovative taggers, encouraging us to develop an integrated Computing (Dept. of Computer Science and En- approach, leaving as future work not resorting to gineering, SNU) funded by the National Research the taggers. Foundation of Korea (NRF) (21A20151113068).

3532 References Guillaume Lample and Alexis Conneau. 2019. Cross- lingual language model pretraining. arXiv preprint Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- arXiv:1901.07291. gio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint Guillaume Lample, Alexis Conneau, Marc’Aurelio arXiv:1409.0473. Ranzato, Ludovic Denoyer, and Herve´ Jegou.´ 2018. Word translation without parallel data. In ICLR. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with Ki-Moon Lee and S Robert Ramsey. 2011. A history of subword information. Transactions of the ACL, the . Cambridge University Press. 5:135–146. Li, Zhe , Renfen , Wensi Li, Liu, and Jihun Choi, Jonghem Youn, and Sang-goo Lee. Xiaoyong Du. 2018. Analogical reasoning on chi- 2016. A -level approach for constructing nese morphological and semantic relations. In Pro- a korean morphological analyzer without linguistic ceedings of the 56th Annual Meeting of the ACL), knowledge. In 2016 IEEE International Conference volume 2, pages 138–143. on Big Data (Big Data), pages 3872–3879. IEEE. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Sanghyuk Choi, Taeuk Kim, Jinseok Seol, and Sang- rado, and Jeff Dean. 2013. Distributed representa- goo Lee. 2017. A -based technique for word tions of words and phrases and their compositional- embeddings of korean words. In Proceedings of ity. In Advances in neural information processing the First Workshop on Subword and Character Level systems, pages 3111–3119. Models in NLP, pages 36–40. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Alexis Conneau, Guillaume Lample, Ruty Rinott, Ad- Zhu. 2002. Bleu: a method for automatic eval- ina Williams, Samuel R Bowman, Holger Schwenk, uation of machine translation. In Proceedings of the and Veselin Stoyanov. 2018. Xnli: Evaluating cross- 40th annual meeting on ACL, pages 311–318. Asso- lingual sentence representations. arXiv preprint ciation for Computational Linguistics. arXiv:1809.05053. Sungjoon Park, Jeongmin Byun, Sion , Yongseok Jacob Devlin, -Wei Chang, Kenton Lee, and Cho, and Alice Oh. 2018. Subword-level word vec- Kristina Toutanova. 2018. Bert: Pre-training of deep tor representations for korean. In Proceedings of the bidirectional transformers for language understand- 56th Annual Meeting of the ACL, volume 1, pages ing. arXiv preprint arXiv:1810.04805. 2429–2438.

Daniel Edmiston and Karl Stratos. 2018. Composi- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt tional morpheme embeddings with affixes as func- Gardner, Christopher Clark, Kenton Lee, and Luke tions and stems as arguments. In Proceedings of the Zettlemoyer. 2018. Deep contextualized word repre- Workshop on the Relevance of Linguistic Structure sentations. In Proceedings of the 2018 Conference in Neural Architectures for NLP, pages 1–5. of the NAACL: HLT, Volume 1 (Long Papers), vol- ume 1, pages 2227–2237. Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in Rico Sennrich, Barry Haddow, and Alexandra Birch. sequence-to-sequence learning. arXiv preprint 2016. Neural machine translation of rare words with arXiv:1603.06393. subword units. In Proceedings of the 54th Annual Meeting of ACL, volume 1, pages 1715–1725. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su- Ho-Min Sohn. 2001. The Korean Language. Cam- leyman, and Phil Blunsom. 2015. Teaching - bridge University Press. chines to read and comprehend. In Advances in Neu- Jae Jung Song. 2006. The Korean language: Structure, ral Information Processing Systems, pages 1693– use and context. Routledge. 1701. Karl Stratos. 2017. A sub-character architecture for Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. korean language processing. In Proceedings of the Long short-term memory. Neural computation, 2017 Conference on EMNLP, pages 721–726. 9(8):1735–1780. Insup Taylor. 1997. Psycholinguistic reasons for keep- Yoon Kim, Yacine Jernite, David Sontag, and Alexan- ing chinese characters in korean and japanese. Cog- der M Rush. 2016. Character-aware neural language nitive processing of Chinese and related Asian lan- models. In Thirtieth AAAI Conference on Artificial guages, 319. Intelligence. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Character-level convolutional networks for text clas- Senellart, and Alexander M. Rush. 2017. Open- sification. In Advances in neural information pro- NMT: Open-source toolkit for neural machine trans- cessing systems, pages 649–657. lation. In Proc. ACL.

3533