Byte Pair Encoding is Suboptimal for Pretraining

Kaj Bostrom and Greg Durrett Department of Computer Science The University of Texas at Austin {kaj,gdurrett}@cs.utexas.edu

Abstract levels of granularity, from individual characters to full . As a result, rare words are broken down The success of pretrained transformer lan- into a collection of subword units, bottoming out guage models (LMs) in natural language in characters in the worst case. processing has led to a wide range of pretraining setups. In particular, these models Critically, a pretrained language model’s sub- employ a variety of subword tokenization vocabulary cannot be altered: any down- methods, most notably byte-pair encoding stream application of these models must tokenize (BPE) (Sennrich et al., 2016; Gage, 1994), the input or generate output using the original subword WordPiece method (Schuster and Nakajima, vocabulary, making the choice of tokenization a 2012), and unigram language modeling (Kudo, particularly significant decision. 2018), to segment text. However, to the A variety of subword tokenization methods have best of our knowledge, the literature does not seen use in pretrained language models. BERT contain a direct evaluation of the impact of tokenization on language model pretraining. uses the WordPiece method (Schuster and Naka- We analyze differences between BPE and un- jima, 2012), a language-modeling based variant of igram LM tokenization, finding that the latter BPE; T5 (Raffel et al., 2019) uses character-level method recovers subword units that align more BPE; GPT2 (Radford et al., 2019) and ROBERTA closely with morphology and avoids problems (Liu et al., 2019) use BPE over raw bytes instead from BPE’s greedy construction of unicode characters; XLNET (Yang et al., 2019) procedure. We then compare the fine-tuned and ALBERT (Lan et al., 2019) use the Sentence- task performance of identical transformer masked language models pretrained with these Piece library (Kudo and Richardson, 2018) which tokenizations. Across downstream tasks and implements both BPE and unigram language model two languages (English and Japanese), we tokenization, but in both cases fail to clarify which find that the unigram LM tokenization method of these methods they chose. The effects of tok- matches or outperforms BPE. We hope that enization are not examined in a reported experi- developers of future pretrained LMs will ment in any of the above works except Liu et al. consider adopting the unigram LM method (2019), who note that WordPiece gave a small ad- over the more prevalent BPE. vantage over BPE in their preliminary investigation. In the literature, Kudo(2018) arXiv:2004.03720v2 [cs.CL] 5 Oct 2020 1 Introduction introduced the unigram language model tokeniza- Large transformers (Vaswani et al., 2017) pre- tion method in the context of machine translation trained with variants of a language modeling ob- and found it comparable in performance to BPE. jective, such as BERT (Devlin et al., 2019), have Domingo et al.(2018) performed further experi- proven their effectiveness at flexibly transferring to ments to investigate the effects of tokenization on a variety of domains and tasks. One design deci- neural machine translation, but used a shared BPE sion that makes them particularly adaptable is their vocabulary across all experiments. Galle´(2019) graceful handling of the open vocabulary problem examined algorithms in the BPE family, but did not through subword tokenization. Subword tokeniza- compare to unigram language modeling. tion, popularized in the neural machine translation In this work, we characterize the space of pro- literature (Sennrich et al., 2016; Vaswani et al., posed subword tokenization algorithms and ana- 2017; Wu et al., 2016), produces tokens at multiple lyze the differences between the two methods with publicly available implementations: BPE (merg- taining ordered merges and applies them to new ing tokens based on bigram frequency) and uni- text in the same order as they occurred during vo- gram language modeling (pruning tokens based on cabulary construction. unigram LM perplexity). While the vocabularies The WordPiece algorithm (Schuster and Naka- resulting from these schemes are heavily overlap- jima, 2012), used to construct BERT’s vocabulary, ping, we compare each method to reference mor- closely resembles BPE. However, instead of merg- phological segmentations and find that the unigram ing the most frequent token bigram, each poten- LM method produces tokens better aligned with tial merge is scored based on the likelihood of an morphology. To understand whether this more nat- n-gram language model trained on a version of ural tokenization leads to improved performance, the corpus incorporating that merge. Schuster and we pretrain separate language models using the Nakajima(2012) note that the process of estimat- ROBERTA objective (Liu et al., 2019) with each ing language model parameters for every potential tokenization for both English and Japanese, two merge is prohibitive, so they employ aggressive typologically distant languages. On downstream heuristics to reduce the number of potential merges tasks, we find a performance gap across tasks and considered. As their implementation is not public,1 languages, with the unigram LM method provid- we are unable to make a comparison to this method. ing an improvement over BPE of up to 10% in The unigram LM method (Kudo, 2018), in con- our Japanese QA experiments, indicating the ben- trast to the bottom-up construction process of BPE efits of adopting this technique in the context of and WordPiece, begins with a superset of the final language model pretraining. vocabulary, pruning it to the desired size:

2 Algorithms Algorithm 2 Unigram LM (Kudo, 2018) Subword tokenization algorithms consist of two 1: Input: set of strings D, target vocab size k components: a vocabulary construction procedure, 2: procedure UNIGRAMLM(D, k) which takes a corpus of text and returns a vocabu- 3: V ← all substrings occurring more than lary with the desired size, and a tokenization proce- 4: once in D (not crossing words) dure, which takes the built vocabulary and applies it 5: while |V | > k do . Prune tokens to new text, returning a sequence of tokens. In the- 6: Fit unigram LM θ to D ory, these two steps can be independent, although 7: for t ∈ V do . Estimate token ‘loss’ for the algorithms we examine the tokenization 8: Lt ← pθ(D) − pθ0 (D) 0 procedure is tightly coupled to the vocabulary con- 9: where θ is the LM without token t struction procedure. 10: end for A BPE vocabulary is constructed as follows: 11: Remove min(|V | − k, bα|V |c) of the 12: tokens t with highest Lt from V , Algorithm 1 Byte-pair encoding (Sennrich et al., 13: where α ∈ [0, 1] is a hyperparameter 2016; Gage, 1994) 14: end while 15: Fit final unigram LM θ to D 1: Input: set of strings D, target vocab size k 16: return V, θ 2: procedure BPE(D, k) 17: end procedure 3: V ← all unique characters in D 4: (about 4,000 in English Wikipedia) 5: while |V | < k do . Merge tokens Unigram LM tokenization takes the vocabulary 6: tL, tR ← Most frequent bigram in D V and unigram LM parameters θ and performs Viterbi inference to decode the segmentation with 7: tNEW ← tL + tR . Make new token θ 8: V ← V + [tNEW] maximum likelihood under . This method is 9: Replace each occurrence of tL, tR in similar to Morfessor’s unsupervised segmentation

10: D with tNEW (Creutz and Lagus, 2005) without its informed prior 11: end while over token length. 12: return V 1Although its name and association with Google might sug- 13: end procedure gest otherwise, the SentencePiece library (Kudo and Richard- son, 2018) does not, in fact, implement the WordPiece algo- rithm; it provides implementations of BPE and unigram LM BPE tokenization takes the vocabulary V con- based tokenization. Original: furiously Original: tricycles Original: nanotechnology BPE: fur iously BPE: t ric y cles BPE: n an ote chn ology Uni. LM: fur ious ly Uni. LM: tri cycle s Uni. LM: nano technology Original: Completely preposterous suggestions BPE: Comple t ely prep ost erous suggest ions Unigram LM: Complete ly pre post er ous suggestion s Original: corrupted Original: 1848 and 1852, BPE: cor rupted BPE: 184 8 and 185 2, Unigram LM: corrupt ed Unigram LM: 1848 and 1852 , Original 磁性は様々に分類がなされている。 BPE 磁 性は 様々 に分類 がなされている 。 Unigram LM 磁 性 は 様々 に 分類 がなされている 。 Gloss magnetism (top.) various ways in classification is done . Translation Magnetism is classified in various ways.

Figure 1: Example tokenizations. The character ‘ ’ is a word boundary marker. BPE merges common tokens, such as English inflectional suffixes and Japanese particles, into their neighbors even when the resulting unit is not semantically meaningful.

7 10 4000 BPE BPE 6 Unigram LM 10 Unigram LM 3000 5 10

2000 4 10

3

1000 Token frequency Number of tokens 10

2 0 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 20000 Token length Token frequency rank (a) Token length distributions within each vocabulary (b) Token frequency profiles over the corpus

Figure 2: English subword vocabulary and corpus profiles. The unigram LM method produces longer tokens on average (a) and uses its vocabulary space more effectively (b), with more tokens of moderate frequency.

In the course of our experiments we did not ob- More frequent in serve a major difference in speed between the two BPE Unigram LM algorithms. Both require similar amounts of time to H L M T B s . , ed d construct a vocabulary, and both have a negligible P C K D R ing e ly t a impact on overall model inference latency. Table 1: Tokens with the highest difference in fre- quency between tokenizations. The unigram LM 3 Comparison of Segmentations method tends to produce more parsimonious prefixes and suffixes. 3.1 Morphology

In Figure1 we illustrate the differences in tok- Tokenization enization output between BPE and the unigram BPE Unigram LM LM method. We observe that the unigram LM Tokens per word type 4.721 4.633 method produces subword units that qualitatively Tokens per word 1.343 1.318 align with morphology much better than those pro- Table 2: Mean subword units per word for each method duced by BPE. In particular, we note that the un- across all of English Wikipedia. igram LM method recovers common affixes such as -ly, -s, pre-, and tri- while BPE does not, instead absorbing them into adjacent units (-cles) while we observe that recognizable affixes appear much also producing meaningless single-character units. more frequently in the unigram LM tokenization of This trend is supported by Table1, in which our pretraining corpus than in the BPE tokenization. English (w.r.t. CELEX2) Japanese (w.r.t. MeCab) Method Precision Recall F1 Precision Recall F1 BPE 38.6% 12.9% 19.3% 78.6% 69.5% 73.8% Uni. LM 62.2% 20.1% 30.3% 82.2% 72.8% 77.2%

Table 3: Correspondence of subword boundaries between unsupervised tokenization methods and morphological reference segmentations.

As the BPE tokenization is constructed greedily Creutz and Lagus(2005), who successfully use according to frequency, common affixes (and punc- maximum-a-posteriori unigram language models tuation) are frequently absorbed into other tokens.2 to perform unsupervised morphological segmenta- We see in Figure 2a that the unigram LM tok- tion of English and Finnish. enization tends to have longer subword units than BPE. This is closer to the length distribution of 3.2 Vocabulary Allocation gold-standard English morphs, which have a mean By surfacing subword units that align with mor- length of approximately 6 characters (Creutz and phology, the unigram LM tokenization provides Linden, 2004). the opportunity for the model to learn composable subword embeddings. If an affix reliably signals a Comparison with morphological segmenters linguistic feature, rather than needing to store that In Table3, we further corroborate these observa- information redundantly across the embeddings of tions by performing a quantitative evaluation of the many tokens containing the affix, the model can degree to which each unsupervised segmentation store it in just the embedding of the affix. algorithm aligns with morphological baselines for These results suggest that the unigram LM each language. For English, we produce gold sur- method may allocate its vocabulary more economi- face allomorph boundaries from the CELEX2 lexi- cally. We note in Figure 2b that both vocabularies cal database (Baayen et al., 1995) in the manner of contain a “dead zone” of tokens whose frequency Creutz and Linden´ (2004). We then compare each is much lower than the rest of the vocabulary. This algorithm’s subword unit boundaries with gold mor- is largely the result of the presence of a number of pheme boundaries for words with 2 or more mor- very uncommon characters, including Chinese and phemes, weighted by their frequency in English Japanese kanji, in the training corpus. In the BPE Wikipedia. For Japanese, we compare subword tokenization, however, this effect is exacerbated, tokenizations of Japanese Wikipedia sentences to with the dead zone containing about 1500 more morphological reference tokenizations produced entries as a result of the tendency of its vocabulary using the MeCab morphological analysis and tok- construction process to produce intermediate “junk” enization tool (Kudo, 2006) using version 2.3.0 of tokens. For example, in the case where three tokens the UniDic dictionary (Den et al., 2007). almost always occur as a group, in order to merge We find that for both languages, the segmenta- them into a single token, BPE must first merge one tions produced by the unigram LM method cor- pair before incorporating the third token; this leaves respond more closely to the morphological refer- an intermediate token in the vocabulary that will ences, confirming our qualitative analysis. On En- only occur rarely on its own. Additionally, tokens glish data, both unsupervised methods exhibit low that appear in many contexts, such as inflectional boundary recall; we attribute this to the fact that affixes (-s, -ed), will tend to merge with many adja- they represent many common words with underly- cent units due to their frequency. However, these ing derivational morphology as single tokens, al- merges lead to embedding redundancy, as these though for BPE this is compounded by effects we affixes usually have the same linguistic function in discuss in Section 3.2. every context. Since the unigram LM method se- The ability of the unigram LM method to recover lects tokens during vocabulary construction using a the morphological structure of the text without ex- global optimization procedure, it does not produce plicit supervision aligns with the main findings of junk tokens; this property also allows it to avoid merging frequent tokens with their neighbors too 2Note that the BPE vocabulary still includes these affixes, but when they are encountered during tokenization, they are aggressively. almost always merged into larger units as in Figure1. Japanese vocabulary comparisons are included English Japanese SQuAD 1.1 (dev.) MNLI (dev.) CoNLL NER TyDi QA (dev.) Model EM F1 Acc. (m) Acc. (mm) Dev. F1 Test F1 EM F1 Ours, BPE 80.6 ± .2 88.2 ± .1 81.4 ± .3 82.4 ± .3 94.0 ± .1 90.2 ± .0 41.4 ± 0.6 42.1 ± 0.6 Ours, Uni. LM 81.8 ± .2 89.3 ± .1 82.8 ± .2 82.9 ± .2 94.3 ± .1 90.4 ± .1 53.7 ± 1.3 54.4 ± 1.2

BERTBASE 80.5 88.5 84.6 83.4 96.4 92.4 ––

Table 4: Fine-tuning results. Metrics are averaged across 5 fine-tuning seeds with standard deviations indicated by ±; due to computational constraints we did not pretrain more than once per tokenization. We include fine- tuning results for a transformer with a comparable architecture, BERTBASE, for reference, although we note that a direct comparison cannot be made due to BERTBASE using both a larger pretraining corpus and a larger subword vocabulary. in Appendix B. ical plausibility of the unigram LM tokenization may translate into better downstream task perfor- 4 Downstream Task Experiments mance as well. Larger performance gaps are ev- ident on SQuAD and MNLI, but the largest gap In order to make a fair experimental comparison be- appears on Japanese TyDi. Differences in pretrain- tween these two methods on downstream tasks, we ing may be more evident in this setting due to the do not use an existing pretrained language model fact that the Japanese portion of the TyDi train- like BERT, but instead train our own language mod- ing split only contains ∼5k examples, compared els from scratch, controlling for the data, training to the ∼88k examples available for fine-tuning on objective, and optimization procedure. We pre- SQuAD. Additionally, written Japanese does not train four transformer masked language models feature whitespace between words, so it is possi- using the architecture and training objective of ble for tokenizations to differ in word boundary ROBERTA-BASE (Liu et al., 2019) using the refer- placement as well as subword segmentation. ence fairseq implementation (Ott et al., 2019). Two are pretrained on the text of English Wikipedia, 5 Conclusion comprising ∼3B tokens under either tokenization. The other two are pretrained on the text of Japanese In this work we show that the choice of input en- Wikipedia, comprising ∼0.6B tokens. In each pair, coding makes a difference in how well pretrained one model is pretrained on the BPE tokenization of language models are able to perform end tasks. the corpus, and the other on the unigram LM tok- This indicates that tokenization encodes a surpris- enization, each with a vocabulary of 20,000 tokens. ing amount of inductive bias, and we suggest that Hyperparameters are listed in Appendix A. unigram LM tokenization may be the better choice We subsequently fine-tune each of the pretrained for development of future pretrained models. English models on the SQuAD question-answering Acknowledgments task (Rajpurkar et al., 2016), the MNLI task (Williams et al., 2018), and the This work was partially supported by NSF Grant English portion of the CoNLL 2003 named-entity IIS-1814522 and a gift from Arm. This material recognition shared task (Tjong Kim Sang and is also based on research that is supported by the De Meulder, 2003). We fine-tune the Japanese Air Force Research Laboratory (AFRL), DARPA, models on the Japanese minimal-answer subset for the KAIROS program under agreement num- of the TyDi question-answering task (Clark et al., ber FA8750-19-2-1003. The U.S. Government is 2020). We base our fine-tuning implementations on authorized to reproduce and distribute reprints for those of the transformers toolkit (Wolf et al., Governmental purposes notwithstanding any copy- 2019). right notation thereon. The views and conclusions The results of our fine-tuning experiments are contained herein are those of the authors and should presented in Table4. We show that fine-tuning not be interpreted as necessarily representing the models pretrained with unigram LM tokenization official policies or endorsements, either expressed produces better performance than fine-tuning mod- or implied, of the Air Force Research Laboratory els pretrained with BPE tokenization for all tasks. (AFRL), DARPA, or the U.S. Government. These results suggest that the higher morpholog- References Taku Kudo. 2006. MeCab: Yet another part-of-speech and morphological analyzer. R. Harald Baayen, Richard Piepenbrock, and Leon Gu- likers. 1995. The CELEX lexical database (release 2). Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple sub- Jonathan Clark, Jennimaria Palomaki, Vitaly Nikolaev, word candidates. In Proceedings of the 56th Annual Eunsol Choi, Dan Garrette, Michael Collins, and Meeting of the Association for Computational Lin- Tom Kwiatkowski. 2020. TyDi QA: A benchmark guistics (Volume 1: Long Papers), pages 66–75, Mel- for information-seeking in typo- bourne, Australia. Association for Computational logically diverse languages. Transactions of the As- Linguistics. sociation for Computational Linguistics, 8(0):454– 470. Taku Kudo and John Richardson. 2018. SentencePiece: Mathias Creutz and Krista Lagus. 2005. Unsupervised A simple and language independent subword tok- morpheme segmentation and morphology induction enizer and detokenizer for neural . In from text corpora using Morfessor 1.0. Helsinki Proceedings of the 2018 Conference on Empirical University of Technology Helsinki. Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Mathias Creutz and Krister Linden.´ 2004. Morpheme Association for Computational Linguistics. segmentation gold standards for finnish and english. Report, Helsinki University of Technology. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Mathias Johan Philip Creutz and Bo Krister Johan Lin- 2019. ALBERT: A lite BERT for self-supervised den. 2004. Morpheme segmentation gold standards learning of language representations. arXiv preprint for Finnish and English. Publications in Computer arXiv:1909.11942. and Information Science Report A77. Yasuharu Den, Toshinobu Ogiso, Hideki Ogura, At- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- sushi Yamada, Nobuaki Minematsu, Kiyotaka Uchi- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, moto, and Hanae Koiso. 2007. The development Luke Zettlemoyer, and Veselin Stoyanov. 2019. of an electronic dictionary for morphological analy- RoBERTa: A robustly optimized BERT pretraining sis and its application to japanese . approach. arXiv preprint arXiv:1907.11692. Japanese Linguistics, 22:101–123. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Myle Ott, Sergey Edunov, Alexei Baevski, Angela Kristina Toutanova. 2019. BERT: Pre-training of Fan, Sam Gross, Nathan Ng, David Grangier, and deep bidirectional transformers for language under- Michael Auli. 2019. fairseq: A fast, extensible standing. In Proceedings of the 2019 Conference toolkit for sequence modeling. In Proceedings of of the North American Chapter of the Association the 2019 Conference of the North American Chap- for Computational Linguistics: Human Language ter of the Association for Computational Linguistics Technologies, Volume 1 (Long and Short Papers), (Demonstrations), pages 48–53, Minneapolis, Min- pages 4171–4186, Minneapolis, Minnesota. Associ- nesota. Association for Computational Linguistics. ation for Computational Linguistics. Alec Radford, Jeff Wu, Rewon Child, David Luan, Miguel Domingo, Mercedes Garcıa-Martınez, Alexan- Dario Amodei, and Ilya Sutskever. 2019. Language dre Helle, and Francisco Casacuberta. 2018. How models are unsupervised multitask learners. much does tokenization affect in neural machine translation? arXiv preprint arXiv:1812.08621. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Philip Gage. 1994. A new algorithm for data compres- Lee, Sharan Narang, Michael Matena, Yanqi Zhou, sion. C Users Journal, 12(2):23–38. Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text trans- Matthias Galle.´ 2019. Investigating the effectiveness of former. arXiv preprint arXiv:1910.10683. BPE: The power of shorter sequences. In Proceed- ings of the 2019 Conference on Empirical Methods Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and in Natural Language Processing and the 9th Inter- Percy Liang. 2016. SQuAD: 100,000+ questions for national Joint Conference on Natural Language Pro- machine comprehension of text. In Proceedings of cessing (EMNLP-IJCNLP), pages 1375–1381, Hong the 2016 Conference on Empirical Methods in Natu- Kong, China. Association for Computational Lin- ral Language Processing, pages 2383–2392, Austin, guistics. Texas. Association for Computational Linguistics. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd Inter- Mike Schuster and Kaisuke Nakajima. 2012. Japanese national Conference on Learning Representations, and Korean voice search. 2012 IEEE International ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference on Acoustics, Speech and Signal Pro- Conference Track Proceedings. cessing (ICASSP), pages 5149–5152. Rico Sennrich, Barry Haddow, and Alexandra Birch. A Hyperparameters 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th An- Pretraining nual Meeting of the Association for Computational ROBERTA-BASE Linguistics (Volume 1: Long Papers), pages 1715– Model architecture (Liu et al., 2019) 1725, Berlin, Germany. Association for Computa- fairseq tional Linguistics. Implementation (Ott et al., 2019) Erik F. Tjong Kim Sang and Fien De Meulder. ADAM,  = 1e-6 2003. Introduction to the CoNLL-2003 shared task: Optimizer β = (0.9, 0.98) Language-independent named entity recognition. In (Kingma and Ba, 2015) Proceedings of the Seventh Conference on Natu- Learning rate decay Polynomial ral Language Learning at HLT-NAACL 2003, pages Peak learning rate 0.0005 142–147. Warmup steps 10000 Weight decay 0.01 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Batch size 2048 Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Sequence length 512 Kaiser, and Illia Polosukhin. 2017. Attention is all Total updates 125000 you need. In Advances in Neural Information Pro- MLP dropout 0.1 cessing Systems, pages 5998–6008. Attention dropout 0.1 Precision 16-bit Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen- Fine-tuning tence understanding through inference. In Proceed- transformers ings of the 2018 Conference of the North American Implementations (Wolf et al., 2019) Chapter of the Association for Computational Lin- ADAM,  = 1e-8 guistics: Human Language Technologies, Volume Optimizer β = (0.9, 0.999) 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguis- Learning rate decay Linear tics. Peak learning rate 5e-5 Warmup steps 0 Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Weight decay 0 Chaumond, Clement Delangue, Anthony Moi, Pier- Batch size 32 Sequence length ric Cistac, Tim Rault, Remi´ Louf, Morgan Fun- 512 towicz, et al. 2019. Transformers: State-of-the- (SQuAD, TyDi QA) Passage stride art natural language processing. arXiv preprint 192 arXiv:1910.03771. (SQuAD, TyDi QA) Sequence length 128 Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V (MNLI, NER) Le, Mohammad Norouzi, Wolfgang Macherey, Epochs 3 Maxim Krikun, Yuan Cao, Qin Gao, Klaus Precision 16-bit Macherey, et al. 2016. Google’s neural machine Tokenization translation system: Bridging the gap between hu- SentencePiece man and machine translation. arXiv preprint Implementations arXiv:1609.08144. (Kudo and Richardson, 2018) Vocabulary size 20000 Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- Unigram LM α 0.25 bonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized autoregressive pretrain- ing for language understanding. arXiv preprint arXiv:1906.08237. B Japanese vocabulary comparison

More frequent in BPE Unigram LM )、 )。 ) ンの スの li lo ていく vi てしまう は のは の 、2 ンは hi 0% to no ta

Table 5: Tokens with the highest difference in frequency between tokenizations. The BPE method merges common tokens, such as particles and punctuation, even when they do not form meaningful units. The unigram LM method recovers the units ていく and てしまう, which are productive components of the Japanese verb conjugation system.

8000 BPE BPE 6 10 6000 Unigram LM Unigram LM

4000 4 10

2000 Token frequency Number of tokens

2 0 10 1 2 3 4 5 6 7 8 9 1 20000 Token length Token frequency rank (a) Token length distributions within each vocabulary (b) Token frequency profiles over the corpus

Figure 3: Japanese subword vocabulary and corpus profiles. (a) The unigram LM method produces longer tokens, as it does in English. (b) Token frequency profiles resemble those of English, though the effect of the “dead zone” is less pronounced.