Investigating an Effective Character-Level Embedding In
Total Page:16
File Type:pdf, Size:1020Kb
Investigating an Effective Character-level Embedding in Korean Sentence Classification Won Ik Cho, Seok Min Kim, and Nam Soo Kim Human Interface Laboratory Department of Electrical and Computer Engineering and INMC Seoul National University 1 Gwanak-ro, Gwanak-gu, Seoul, Korea, 08826 {wicho,smkim}@hi.snu.ac.kr, [email protected] Abstract used for many tasks such as machine translation (Ling et al., 2015), noisy document represen- Different from the writing systems of many tation (Dhingra et al., 2016), language correc- Romance and Germanic languages, some lan- guages or language families show complex tion (Xie et al., 2016), and word segmentation conjunct forms in character composition. For (Cho et al., 2018a). However, little consideration such cases where the conjuncts consist of was done for intra-language performance compari- the components representing consonant(s) and son regarding variant representation types. Unlike vowel, various character encoding schemes English, a Germanic language written with an can be adopted beyond merely making up a alphabet comprising 26 characters, many languages one-hot vector. However, there has been lit- used in East Asia are written with scripts whose tle work done on intra-language comparison characters can be further decomposed into sub-parts regarding performances using each represen- tation. In this study, utilizing the Korean lan- representing individual consonants or vowels. This guage which is character-rich and agglutina- conveys that (sub-)character-level representation tive, we investigate an encoding scheme that is for such languages has the potential to be managed the most effective among Jamo1-level one-hot, with more than just a simple one-hot encoding. character-level one-hot, character-level dense, and character-level multi-hot. Classification In this paper, a comparative experiment is con- performance with each scheme is evaluated on ducted on Korean, a representative language with a two corpora: one on binary sentiment analy- featural writing system (Daniels and Bright, 1996). sis of movie reviews, and the other on multi- To be specific, the Korean alphabet Hangul consists class identification of intention types. The re- of the letters Jamo denoting consonants and vow- arXiv:1905.13656v3 [cs.CL] 19 Sep 2019 sult displays that the character-level features els. The letters comprise a morpho-syllabic block show higher performance in general, although that refers to character, which is resultingly equiva- the Jamo-level features may show compatibil- ity with the attention-based models if guaran- lent to the phonetic unit syllable in terms of Korean teed adequate parameter set size. morpho-phonology. The conjunct form of a charac- ter is {Syllable: CV(C)}; this notation implies that there should be at least one consonant (namely cho- 1 Introduction seng, the first sound) and one vowel (namely cwung- Ever since an early approach exploiting the char- seng, the second sound) in a character. An additional acter features for the neural network-based natural consonant (namely cong-seng, the third sound) is language processing (NLP) (Zhang et al., 2015), auxiliary. However, in decomposition of the charac- character-level embedding2 has been widely ters, three slots are fully held to guarantee a space for each component; an empty cell comes in place 1Letters of Korean alphabet Hangul. 2Throughout this paper, the terms embedding and encoding of the third entry if there is no auxiliary consonant. are parallelly used depending on the context. The number of possible sub-characters, or (compos- ite) consonants/vowels, that can come for each slot is There has been little study done on an ef- 19, 21, and 27. For instance, in a syllable ‘간 (kan)’, fective text encoding scheme for Korean, a lan- the three clock-wisely arranged characters ㄱ, ㅏ, guage that has distinguished character structure and ㄴ, which sound kiyek (stands for k; among 19 which can be decomposed into sub-characters. A candidates), ah (stands for a; among 21 candidates), comparative study on the hierarchical constituents and niun (stands for n; among 27 candidates), refers of Korean morpho-syntax was first organized in to the first, the second and the third sound respec- Lee and Sohn (2016), in the way of comparing the tively. performance of Jamo, character, morpheme, and eo- To compare five different Jamo/character-level jeol (word)-level embeddings for the task of text embedding methodologies that are possible in Ko- reconstruction. In the quantitative analysis using rean, we first review the related studies and the edit distance and accuracy, the Jamo-level feature previous approaches. Then, two datasets are intro- showed a slightly better result than the character- duced, namely binary sentiment classification and level one. The (sub-)character-level representations multi-class intention identification, to investigate the presented the outcome far better than the morpheme performance of each representation under recurrent or eojeol-level cases, as in the classification task of neural network (RNN)-based analysis. After search- Zhang and LeCun (2017). The results show the task- ing for an effective encoding scheme, we demon- independent competitiveness of the character-level strate how the result can be adopted in combating features. other tasks and discuss if a similar approach can be In a more comprehensive viewpoint, applied to the languages with the complex writing Stratos (2017) showed that Jamo-level features system. combined with word and character-level ones display better performance with the parsing task. 2 Related Work With more elaborate character processing, espe- cially involving Jamos, Shin et al. (2017) and Inter-language comparison with word and char- Cho et al. (2018c) made progress recently in the acter embedding was thoroughly investigated in classification tasks. Song et al. (2018) aggregated Zhang and LeCun (2017), for Chinese, Japanese, the sparse features into multi-hot representation Korean, and English. The paper investigates the successfully, enhancing the output within the task languages via representations including character, of error correction. In a slightly different man- byte, romanized character, morpheme, and roman- ner, Cho et al. (2018a) applied dense vectors for ized morpheme. The observation of tendency for the representation of the characters, obtained by Korean suggests that adopting the raw characters skip-gram (Mikolov et al., 2013), improving the outperforms utilizing the romanized character-level naturalness of word segmentation for noisy Korean features, and moreover both the performance are text. To figure out the tendency, we implement the far beyond the morpheme-level features. However, aforementioned Jamo/character-level features and to be specific on the experiment, decomposition of discuss the result concerning classification tasks. the morpho-syllabic blocks was not conducted, and The details on each approach are to be described in the experiment did not make use of the dense em- the following section. bedding methodologies which can project the dis- tributive semantics onto the representation. We con- 3 Experiment cluded that more attention is to be paid to differ- ent character embedding methodologies of Korean. In this section, we demonstrate the featurization Here, to reduce ambiguity, we denote a morpho- of five (sub-)character embedding methodologies, syllabic block which consists of consonant(s) and a namely (i) Jamo (Shin et al., 2017; Stratos, 2017) vowel by character, and the individual components (ii) modified Jamo (Cho et al., 2018c), (iii) sparse by Jamo.A Jamo sequence is spread in the order of character vectors, (iv) dense character vec- the first to the third sound if a character is decom- tors (Cho et al., 2018a) trained based on fastText posed. (Bojanowski et al., 2016), and (v) multi-hot char- Representation Property Dimension Feature type (i) Shin2017 ㄱ · · · ㅎ / ㅏ · · · ㅢ / ㄱ · · · ㅄ Jamo-level 67 one-hot (ii) Cho2018c (i) + ㄱ · · · ㅏ · · · ㅄ Jamo-level 118 one-hot (iii) Cho2018a-Sparse · · · 간 · · · 밤 · · · 핫 · · · character-level 2,534 one-hot (iv) Cho2018a-Dense · · · 간 · · · 밤 · · · 핫 · · · character-level 100 dense (v) Song2018 · · · 간 · · · 밤 · · · 핫 · · · + α character-level 67 multi-hot Table 1: A description on the Jamo/character-level features (i-v). acter vectors (Song et al., 2018). We featurize only the list of the characters to construct a one-hot vec- Jamo/character and no other symbols such as num- tor dictionary4. bers and special letters is taken into account. (v) is a hybrid of Jamo and character-level fea- For (i), we used one-hot vectors of dimension tures; three vectors indicating the first to the third 67 (= 19 + 21 + 27), which is smaller in width sound of a character, namely the ones with dimen- than the ones suggested in Shin et al. (2017) and sion 19, 21, and 27 each, are concatenated into a sin- Stratos (2017), due to the omission of special sym- gle multi-hot vector. This preserves the conciseness bols. Similarly, for (ii), 118-dim one-hot vectors are of the Jamo-level one-hot encodings and also main- constructed. The different point of (ii) regarding (i) tains the characteristics of conjunct forms. In sum- is that it considers the cases that Jamo is used in the mary, (i) utilizes 67-dim one-hot vectors, (ii) 118- form of single (or composite) consonant or vowel, dim one-hot vectors, (iii) 2,534-dim one-hot vectors, as frequently observed in the social media text. The (iv) 100-dim dense vectors, and (v) 67-dim multi-hot cases make up an additional array of dimension 51. vectors (Table 1). For (iii) and (iv), we adopted a vector set that 3.1 Task description is distributed publicly in Cho et al. (2018a), re- ported to be extracted from a drama script cor- For evaluation, we employed two classification tasks pus of size 2M. Constructing the vectors of (iii) that can be conducted with the character-level em- is intuitive; for N characters in the corpus, a N- beddings.