Arxiv:2004.03720V2 [Cs.CL] 5 Oct 2020

Byte Pair Encoding is Suboptimal for Language Model Pretraining Kaj Bostrom and Greg Durrett Department of Computer Science The University of Texas at Austin fkaj,[email protected] Abstract levels of granularity, from individual characters to full words. As a result, rare words are broken down The success of pretrained transformer lan- into a collection of subword units, bottoming out guage models (LMs) in natural language in characters in the worst case. processing has led to a wide range of pretraining setups. In particular, these models Critically, a pretrained language model’s sub- employ a variety of subword tokenization word vocabulary cannot be altered: any down- methods, most notably byte-pair encoding stream application of these models must tokenize (BPE) (Sennrich et al., 2016; Gage, 1994), the input or generate output using the original subword WordPiece method (Schuster and Nakajima, vocabulary, making the choice of tokenization a 2012), and unigram language modeling (Kudo, particularly significant decision. 2018), to segment text. However, to the A variety of subword tokenization methods have best of our knowledge, the literature does not seen use in pretrained language models. BERT contain a direct evaluation of the impact of tokenization on language model pretraining. uses the WordPiece method (Schuster and Naka- We analyze differences between BPE and un- jima, 2012), a language-modeling based variant of igram LM tokenization, finding that the latter BPE; T5 (Raffel et al., 2019) uses character-level method recovers subword units that align more BPE; GPT2 (Radford et al., 2019) and ROBERTA closely with morphology and avoids problems (Liu et al., 2019) use BPE over raw bytes instead stemming from BPE’s greedy construction of unicode characters; XLNET (Yang et al., 2019) procedure. We then compare the fine-tuned and ALBERT (Lan et al., 2019) use the Sentence- task performance of identical transformer masked language models pretrained with these Piece library (Kudo and Richardson, 2018) which tokenizations. Across downstream tasks and implements both BPE and unigram language model two languages (English and Japanese), we tokenization, but in both cases fail to clarify which find that the unigram LM tokenization method of these methods they chose. The effects of tok- matches or outperforms BPE. We hope that enization are not examined in a reported experi- developers of future pretrained LMs will ment in any of the above works except Liu et al. consider adopting the unigram LM method (2019), who note that WordPiece gave a small ad- over the more prevalent BPE. vantage over BPE in their preliminary investigation. In the machine translation literature, Kudo(2018) arXiv:2004.03720v2 [cs.CL] 5 Oct 2020 1 Introduction introduced the unigram language model tokeniza- Large transformers (Vaswani et al., 2017) pre- tion method in the context of machine translation trained with variants of a language modeling ob- and found it comparable in performance to BPE. jective, such as BERT (Devlin et al., 2019), have Domingo et al.(2018) performed further experi- proven their effectiveness at flexibly transferring to ments to investigate the effects of tokenization on a variety of domains and tasks. One design deci- neural machine translation, but used a shared BPE sion that makes them particularly adaptable is their vocabulary across all experiments. Galle´(2019) graceful handling of the open vocabulary problem examined algorithms in the BPE family, but did not through subword tokenization. Subword tokeniza- compare to unigram language modeling. tion, popularized in the neural machine translation In this work, we characterize the space of pro- literature (Sennrich et al., 2016; Vaswani et al., posed subword tokenization algorithms and ana- 2017; Wu et al., 2016), produces tokens at multiple lyze the differences between the two methods with publicly available implementations: BPE (merg- taining ordered merges and applies them to new ing tokens based on bigram frequency) and uni- text in the same order as they occurred during vo- gram language modeling (pruning tokens based on cabulary construction. unigram LM perplexity). While the vocabularies The WordPiece algorithm (Schuster and Naka- resulting from these schemes are heavily overlap- jima, 2012), used to construct BERT’s vocabulary, ping, we compare each method to reference mor- closely resembles BPE. However, instead of merg- phological segmentations and find that the unigram ing the most frequent token bigram, each poten- LM method produces tokens better aligned with tial merge is scored based on the likelihood of an morphology. To understand whether this more nat- n-gram language model trained on a version of ural tokenization leads to improved performance, the corpus incorporating that merge. Schuster and we pretrain separate language models using the Nakajima(2012) note that the process of estimat- ROBERTA objective (Liu et al., 2019) with each ing language model parameters for every potential tokenization for both English and Japanese, two merge is prohibitive, so they employ aggressive typologically distant languages. On downstream heuristics to reduce the number of potential merges tasks, we find a performance gap across tasks and considered. As their implementation is not public,1 languages, with the unigram LM method provid- we are unable to make a comparison to this method. ing an improvement over BPE of up to 10% in The unigram LM method (Kudo, 2018), in con- our Japanese QA experiments, indicating the ben- trast to the bottom-up construction process of BPE efits of adopting this technique in the context of and WordPiece, begins with a superset of the final language model pretraining. vocabulary, pruning it to the desired size: 2 Algorithms Algorithm 2 Unigram LM (Kudo, 2018) Subword tokenization algorithms consist of two 1: Input: set of strings D, target vocab size k components: a vocabulary construction procedure, 2: procedure UNIGRAMLM(D; k) which takes a corpus of text and returns a vocabu- 3: V all substrings occurring more than lary with the desired size, and a tokenization proce- 4: once in D (not crossing words) dure, which takes the built vocabulary and applies it 5: while jV j > k do . Prune tokens to new text, returning a sequence of tokens. In the- 6: Fit unigram LM θ to D ory, these two steps can be independent, although 7: for t 2 V do . Estimate token ‘loss’ for the algorithms we examine the tokenization 8: Lt pθ(D) − pθ0 (D) 0 procedure is tightly coupled to the vocabulary con- 9: where θ is the LM without token t struction procedure. 10: end for A BPE vocabulary is constructed as follows: 11: Remove min(jV j − k; bαjV jc) of the 12: tokens t with highest Lt from V , Algorithm 1 Byte-pair encoding (Sennrich et al., 13: where α 2 [0; 1] is a hyperparameter 2016; Gage, 1994) 14: end while 15: Fit final unigram LM θ to D 1: Input: set of strings D, target vocab size k 16: return V; θ 2: procedure BPE(D; k) 17: end procedure 3: V all unique characters in D 4: (about 4,000 in English Wikipedia) 5: while jV j < k do . Merge tokens Unigram LM tokenization takes the vocabulary 6: tL; tR Most frequent bigram in D V and unigram LM parameters θ and performs Viterbi inference to decode the segmentation with 7: tNEW tL + tR . Make new token θ 8: V V + [tNEW] maximum likelihood under . This method is 9: Replace each occurrence of tL; tR in similar to Morfessor’s unsupervised segmentation 10: D with tNEW (Creutz and Lagus, 2005) without its informed prior 11: end while over token length. 12: return V 1Although its name and association with Google might sug- 13: end procedure gest otherwise, the SentencePiece library (Kudo and Richard- son, 2018) does not, in fact, implement the WordPiece algorithm; it provides implementations of BPE and unigram LM BPE tokenization takes the vocabulary V con- based tokenization. Original: furiously Original: tricycles Original: nanotechnology BPE: fur iously BPE: t ric y cles BPE: n an ote chn ology Uni. LM: fur ious ly Uni. LM: tri cycle s Uni. LM: nano technology Original: Completely preposterous suggestions BPE: Comple t ely prep ost erous suggest ions Unigram LM: Complete ly pre post er ous suggestion s Original: corrupted Original: 1848 and 1852, BPE: cor rupted BPE: 184 8 and 185 2, Unigram LM: corrupt ed Unigram LM: 1848 and 1852 , Original n性/様々+分Cがなされている。 BPE n 性/ 様々 +分C がなされている。 Unigram LM n 性 / 様々 + 分C がなされている。 Gloss magnetism (top.) various ways in classification is done . Translation Magnetism is classified in various ways. Figure 1: Example tokenizations. The character ‘ ’ is a word boundary marker. BPE merges common tokens, such as English inflectional suffixes and Japanese particles, into their neighbors even when the resulting unit is not semantically meaningful. 7 10 4000 BPE BPE 6 Unigram LM 10 Unigram LM 3000 5 10 2000 4 10 3 1000 Token frequency Number of tokens 10 2 0 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 20000 Token length Token frequency rank (a) Token length distributions within each vocabulary (b) Token frequency profiles over the corpus Figure 2: English subword vocabulary and corpus profiles. The unigram LM method produces longer tokens on average (a) and uses its vocabulary space more effectively (b), with more tokens of moderate frequency. In the course of our experiments we did not ob- More frequent in serve a major difference in speed between the two BPE Unigram LM algorithms. Both require similar amounts of time to H L M T B s . , ed d construct a vocabulary, and both have a negligible P C K D R ing e ly t a impact on overall model inference latency. Table 1: Tokens with the highest difference in frequency between tokenizations.

Load more