Jamo Pair Encoding: Subcharacter Representation-Based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 3490–3497 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization Sangwhan Moonyz, Naoaki Okazakiy Tokyo Institute of Technologyy, Odd Concepts Inc.z, [email protected], [email protected] Abstract In the context of multilingual language model pre-training, vocabulary size for languages with a broad set of potential characters is an unsolved problem. We propose two algorithms applicable in any unsupervised multilingual pre-training task, increasing the elasticity of budget required for building the vocabulary in Byte-Pair Encoding inspired tokenizers, significantly reducing the cost of supporting Korean in a multilingual model. Keywords: tokenization, vocabulary compaction, sub-character representations, out-of-vocabulary mitigation 1. Background BPE. Roughly, the minimum size of the subword vocabulary can be approximated as jV j ≈ 2jV j, where V is the With the introduction of large-scale language model pre- c minimal subword vocabulary, and V is the character level training in the domain of natural language processing, the c vocabulary. domain has seen significant advances in the performance Since languages such as Japanese require at least 2000 char- of downstream tasks using transfer learning on pre-trained acters to express everyday text, in a multilingual training models (Howard and Ruder, 2018; Devlin et al., 2018) when setup, one must make a tradeoff. One can reduce the av- compared to conventional per-task models. As a part of this erage surface of each subword for these character vocabu- trend, it has also become common to perform this form of lary intensive languages, or increase the vocabulary size. pre-training against multiple languages when training a sin- The former trades off the performance and representational gle model. For these multilingual pre-training cases, state- power of the model, and the latter has a computational cost. of-the-art methods have relied on subword based tokenizers, Similar problems also apply to Chinese, as it shares a sig- such as Byte-Pair Encoding (BPE) (Sennrich et al., 2016) nificant portion of the character level vocabulary. However, or SentencePiece (Kudo and Richardson, 2018) as a robust this also allows some level of sharing, which reduces the mechanism to mitigate the out-of-vocabulary (OOV) prob- final budget needed in the vocabulary. lem at a tokenizer level, by having a fallback to a character Korean is an outlier in the CJK family, which linguistically level vocabulary. Not only have these methods shown to has a shared vocabulary in terms of roots, but uses an en- be robust against OOV compared to standard lexicon-based tirely different character representation. A straightforward tokenization methods, but they also have benefited from a approach would be to share the character level vocabulary computational cost perspective as it reduces the size of the between CJK languages, as it was possible between Chi- input and output layer. nese and Japanese. However, this, unfortunately, is not a While these methods have shown significant improvements straightforward operation, as Hangul (the Korean writing in alphabetic languages such as English and other Western system) is phonetic, unlike the other two examples. This European languages that use a Latin alphabet, these methods means that while the lexicon may have the exact same roots, have limitations when applied to languages that have a large the phonetic transcription is challenging to do an inverse and diverse character level vocabulary, such as Chinese, transform algorithmically. This requires comprehension of Japanese, and Korean (CJK) languages. the context to select the most likely candidate, which would In this paper, we describe the challenges of subword tok- be analogous to a quasi-masked language modeling task. enization when applied against CJK. We discuss the difference of Korean compared to other CJK languages, how to 3. Related Work and Background take advantage of the difference which Korean has when a subword tokenizer is used, and finally, propose a subword The fundamental idea of characters is not new; in the past, tokenizer-agnostic method, which allows the tokenizer to many character-level approaches have been proposed in the take advantage of Korean specific properties. form of task-specific architectures. There are also subcharacter level methods analogous to our method, all of 2. Problem Definition which we discuss in the language-specific sections below. CJK languages, due to the strong linguistic dependency of 3.1. Non-Korean Languages borrowed words from Chinese as part of their vocabulary, A study on a limited subset of Brahmic languages (Ding have a much more extensive range of characters needed to et al., 2018) proposes a method which can be used to re- express the language compared to other alphabetic (e.g., duce the vocabulary budget needed for all languages by Latin) languages. This reflects directly on the vocabu- generalizing, simplifying, then aligning multiple language lary budget requirements needed for an algorithm, which alphabets together. This is applicable when the writing builds a subword vocabulary on character pairs such as systems have genealogical relations that allow this form of 3490 We attempt to address this limitation in our work. This ㄱ[choseong] is analogous to how subword tokenization methods have 江 ㅏ[joongseong] brought to the field guarantees of lossless encoding and Phonetic 강 Jamo decoding, which was not possible with conventional lossy ㅇ[jongseong] encoding methods such as lemmatization, stemming, and other normalization methods. ㄱ[choseong] [joongseong] Standard Form 宮 Phonetic 궁 Jamo ㅜ ㅇ[jongseong] 하다 ㅎㅏㄷㅏ Declarative Present Formal Low Figure 1: Transformation process and Hangul Jamo sub- 한다 ㅎㅏㄴㄷㅏ character composition. In the real world, Hangul to Chinese almost always has a 1:n mapping. Declarative Future Informal Low 할거야 ㅎㅏㄹㄱㅓㅇㅑ alignment. Previous works (He et al., 2018; Shi et al., 2015; Declarative Present Formal High Sun et al., 2014; Yin et al., 2016) demonstrate the potential of sub-character based methods in the context Chinese and 합니다 ㅎㅏㅂㄴㅣㄷㅏ Japanese of CJK languages through radical based decomposition. Figure 2: In this example, the standard form X다 (to do) 3.2. Korean can be conjugated into many forms. The first two Jamo Korean, as shown in figure 1 builds on a small phonetic correspond to a common morpheme, which through agglu- alphabet, and uses a combination of the consonants and tination becomes different conjugations. vowels called Jamo as a building block, and use a combination of those when composing each character. Following the notation in (Stratos, 2017), given this Jamo alphabet J , 4. Method where the following table can explain jJ j = 51, and each The core motivation of our method is to remove the Uni- possible role when forming a complete character. code bottleneck we have described in the previous section, The Jamo alphabet J is a union defined as J = Jh [ expose the alphabet composition characteristics and the ag- Jv [Jt, where Jh is a Choseong (head consonant), Jh is glutinative nature of Korean to the subword tokenizers. We Jungseong (vowel), and Jh is Jongseong (tail consonant). also introduce a hard round-trip requirement, which guar- Therefore, the first example illustrated in figure 1 can be antees lossless reconstruction of the transformed text back explained as 12 Jh, O2 Jv, and G2 Jt. Note that Jt to the original input while not introducing any unexpected can be omitted, in which case it corresponds to <nil> 2 Jt. side-effects when training with other languages. Our method uses special unused characters in the Hangul Level Jamo (Subcharacters) Unicode page as hints for the processors to operate. To- Jh, Choseong 124789ABCEFG kenizers will treat these invisible characters the same as a HIJKLMN standard character in the same Unicode page. This is crucial Jv, Jungseong OPQRSTUVWXYZ for tokenizer implementations that treat pairs from different [ \ ] ^ _ ` a b c Unicode pages as non-mergeable. Jt, Jongseong <nil> 1 2 3 4 5 6 7 9 : ; The method itself is implemented and provided as two mod- <=>?@ABDEFGH ules, with two different algorithms. The two modules are JKLMN the pre-processor, which performs the Jamo decomposition, and the post-processor, which reconstructs it back to a more Table 1: Hangul Jamo sub-characters, along with their re- human-readable form. The post-processor is only needed spective positional roles for composition. for generative tasks when the output will be presented to a human. The pre-processor needs to be run at least once for Exploiting these characteristics is not a new idea, and have each input. been explored in the context of an end-to-end architecture We propose two different algorithms: a simple method in (Stratos, 2017), as a method for learning subword embed- which aligns the output to a character grid of 3, and a dings in (Park et al., 2018), and for classification in (Cho et complex method which does not have alignment and in- al., 2019). stead relies on an automaton to reconstruct the transformed One significant contribution of our method is the guaran- text. Both methods prefix orphaned Jamo (e.g., KK, which tees of round trip consistency. Previous work, which we roughly means "laugh out loud"), which is extremely com- discussed above, also discuss sub-character (Jamo) based mon in internet corpora, with a post-processor hint. These methods, but the evaluation was limited to tasks that do not methods will be referenced as aligned and automaton in require generation, and reconstruction was not discussed. later parts of our paper. 3491 The two methods have different characteristics. Aligned can Input 강 강가 KK be reconstructed with an extremely simple post-processor, Output 1OG 1OG1O<f> <o>K<o>K and has much higher guarantees for reconstruction.

Jamo Pair Encoding: Subcharacter Representation-Based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization

Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress

Proposal for a Korean Script Root Zone LGR 1 General Information

Suggestions for the ISO/IEC 14651 CTT Part for Hangul

2 Hangul Jamo Auxiliary Canonical Decomposition Mappings

Proposal for a Korean Script Root Zone LGR

Poorman's Hangul Jamo Input Method

The Unicode Standard, Version 4.0--Online Edition

The Unicode Standard, Version 3.0, Issued by the Unicode Consor- Tium and Published by Addison-Wesley

Fonts & Encodings

Cho-Tug2013handout.Pdf

Korean LGP Status Update ICANN #54 DUB (Dublin)| 2015.10

Adobe Technical Note #5093: the Adobe-Korea1-2 Character Collection 2