The Importance of Generation Order in Language Modeling

Nicolas Ford∗ Daniel Duckworth Mohammad Norouzi George E. Dahl Brain {nicf,duckworthd,mnorouzi,gdahl}@google.com

Abstract There has been interest in moving beyond the left-to-right generation order by developing alter- Neural language models are a critical compo- native multi-stage strategies such as syntax-aware nent of state-of-the-art systems for machine neural language models (Bowman et al., 2016) translation, summarization, audio transcrip- and latent variable models of text (Wood et al., tion, and other tasks. These language models are almost universally autoregressive in nature, 2011). Before embarking on a long-term research generating sentences one token at a time from program to find better generation strategies that left to right. This paper studies the influence of improve modern neural networks, one needs ev- token generation order on model quality via a idence that the generation strategy can make a novel two-pass language model that produces large difference. This paper presents one way of partially-filled sentence “templates” and then isolating the generation strategy from the general fills in missing tokens. We compare various neural network design problem. Our key techni- strategies for structuring these two passes and cal contribution involves developing a flexible and observe a surprisingly large variation in model quality. We find the most effective strategy tractable architecture that incorporates different generates function words in the first pass fol- generation orders, while enabling exact computa- lowed by content words in the second. We be- tion of the log-probabilities of a sentence. Our ex- lieve these experimental results justify a more periments demonstrate that even when using a few extensive investigation of generation order for simple two-pass generation orders, the differences neural language models. between good and bad orderings are substantial. We consider ways of reordering the tokens 1 Introduction within a sequence based on their identities. The Neural networks have been extremely successful best ordering we tried generates function words statistical models of text in language modeling and first and content words last, which cuts against the machine translation. Despite differences in model idea of committing to the general topic of a sen- architectures, state of the art neural nets gener- tence first and only then deciding exactly how to ate sequences from left to right (Vaswani et al., phrase it. We offer some possible explanations in 2017; Jozefowicz et al., 2016; Wu et al., 2016). Section3, and we conclude that our experimen- Although in some sense humans produce and con- tal results justify a more extensive investigation of sume language from left to right as well, there are the generation order for language and translation many other intuitively appealing ways to gener- models. ate text. For instance, language is slow enough 2 Two-pass Language Models on a neurological time scale for multiple passes of generation that incorporate feedback to occur. We develop a family of two-pass language mod- Linguistic intuition might suggest that we should els that depend on a partitioning of the vocabu- first generate some abstract representation of what lary into a set of first-pass and second-pass tokens we want to say and then serialize it, a process that to generate sentences. We perform a preprocess- seems more universally appropriate given the ex- ing step on each sequence y, creating two new se- istence of languages with freer word order such as quences y(1) and y(2). The sequence y(1), which Czech and Polish. we call the template, has the same length as y, ∗Work done as a member of the Google AI Residency and consists of the first-pass tokens from y to- program (g.co/airesidency) gether with a special placeholder token wherever

2942 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2942–2946 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics sentence common first rare first function first content first odd first ” all you need to do ” all you to if you need do ” all you to if you need do ” all you need if you want the na- the ’s on want nation the ’s on your want nation press you the nation ’s tion ’s press camped is to you had a press camped your is to you a camped doorstep press camped on your on your doorstep is to [UNK] in , ” he doorstep say in , ” he in his say once had doorstep say you say you once had a in his . [EOS] once 1947 . [EOS] [UNK] 1947 once had [UNK] in 1947 , ” noted memorably noted memorably ” noted his . he noted memorably in diary [EOS] diary [EOS] [EOS] his diary . [EOS] the team announced the that the , team announced the that the team announced the team announced thursday that the 6- [UNK] will in thursday 6-foot-1 , will in thursday 6-foot-1 the 6-foot-1 foot-1 , [UNK] starter the . [EOS] starter remain through the . [UNK] starter will remain will remain in detroit detroit through [EOS] remain detroit through the 2013 . through the 2013 sea- 2013 season [EOS] 2013 season [EOS] [EOS] son . [EOS] scotland ’s next game ’s is a the scotland next game ’s is a against scotland next game ’s next game is a friendly against at on . [EOS] friendly against the at on . friendly the czech republic at the czech republic at czech republic ham- [EOS] czech republic ham- hampden on 3 march . hampden on 3 march . pden 3 march pden 3 march [EOS] [EOS] [EOS] [EOS] of course , millions of of , of course millions of , of a course millions of of additional additional homeown- a : they of additional homeown- : they of ” additional home- big ers did make a big mis- ” ” and [UNK] ers did make big ” and to owners did make they advantage of take : they took ad- to they ’t . mistake took ad- they . [EOS] big mistake ” liar ” and other vantage of ” liar loans [EOS] vantage liar loans took advantage deals buy homes ” and other [UNK] other deals liar loans other they couldn afford . deals to buy homes buy homes couldn [UNK] deals buy [EOS] they couldn ’t afford . afford [EOS] homes couldn ’t [EOS] afford [EOS]

Table 1: Some example sentences from the dataset and their corresponding templates. The placeholder token is indicated by “ ”. y had a second-pass token. The sequence y(2) has sequence model that translates the template into length equal to the number of these placeholders, the complete sentence. There are three places in and consists of the second-pass tokens from y in this model where word embeddings appear — the order. first-phase decoder, the second-phase encoder, and We use a neural language model p1 to generate the second-phase decoder — and all three sets (1) y , and then a conditional translation model p2 of parameters are shared. The output layer also to generate y(2) given y(1). Note that, since the shares the embedding parameters.1 division of the vocabulary into first- and second- For the second pass, we include the entire target pass tokens is decided in advance, there is a one- sentence, not just the second-pass tokens, on the to-one correspondence between sequences y and output side. In this way, when generating a token, pairs (y(1), y(2)). The total probability of y is then the decoder is allowed to examine all tokens to the left of its position. However, only the second-pass (1) (2) (1) p(y) = p1(y ) p2(y | y ) . (1) tokens count toward the loss, since in the other po- sitions the correct token is already known. Our Two-pass language models present a unique op- loss function is then the sum of all of these num- portunity to study the importance of generation or- bers (from both copies) divided by the length of der because, since the template is a deterministic the original sentence, which is the log-perplexity function of y, the probability of y can be com- that our model assigns to the sentence. puted exactly. This is in contrast to a language We tried five different ways of splitting the vo- model using a latent generation order, which re- cabulary: quires a prohibitive marginalization over permu- Common First and Rare First: The vocabu- tations to compute the exact probabilities. Given lary was sorted by frequency and then a cutoff was the tractable nature of the model, exact learning chosen, splitting the vocabulary into “common” based on log-likelihood is possible, and we can and “rare” tokens. The location of the cutoff2 was compare different vocabulary partitioning strate- chosen so that the number of common tokens and gies both against each other and against a single- the number of rare tokens in the average sentence pass language model. were approximately the same. In “common first” Our implementation consists of two copies of 1 the Transformer model from Vaswani et al.(2017). This behavior is enabled in the publicly available im- plementation of Transformer using the hyperparameter called The first copy just generates the template, so it has shared embedding and softmax weights. no encoder. The second copy is a sequence-to- 2In our experiments on LM1B, this is at index 78.

2943 we place the common tokens in the first pass, and size of 2048, and 8 attention heads, except that the in “rare first” we start with the rare tokens. enhanced baseline used a hidden size of 704. We Function First and Content First: We parsed used a batch size of 4096. All models were trained about 1% of LM1B’s training set using Parsey Mc- using ADAM (Kingma and Ba, 2014), with β1 = −6 Parseface (Andor et al., 2016) and assigned each 0.85, β2 = 0.997, and  = 10 . The learning rate token in the vocabulary to the grammatical role it was tuned by hand separately for each experiment was assigned most frequently by the parser. We and the experiments that produced the best results used this data to divide the vocabulary into “func- on the validation set are reported. Dropout was tion” words and “content” words; punctuation, disabled after some initial experimentation found adpositions, conjunctions, determiners, pronouns, it to be detrimental to the final validation loss. particles, modal verbs, “wh-adverbs” (Penn part- Table2 shows the results for all the two-pass of-speech tag WRB), and conjugations of “be” were generation strategies we tried as well as the base- chosen to be function words. In “function first” we lines, sorted from worst to best on the validation place the function words in the first phase and in set. Strikingly, the linguistically meaningless odd “content first” we start with the content words. first generation strategy that splits words arbitrar- Odd First: As a control, we also used a linguis- ily between the two phases is far worse than the tically meaningless split where tokens at an odd baseline, showing that the two-pass setup on its index in the frequency-sorted vocabulary list were own provides no inherent advantage over a single assigned to the first pass and tokens with an even phase. The common first and closely related func- index were assigned to the second pass. tion first strategies perform the best of all the two- A few sentences from the dataset are shown in pass strategies, whereas the rare first and closely Table1 together with their templates. Note that the related content first strategies are much worse. common and function tokens are very similar; the Since the control, rare first, and content first order- main differences are the “unknown” token, conju- ings are all worse than the baseline, the gains seen gations of “have,” and some prepositions. by the other two orderings cannot be explained by the increase in the number of trainable parameters 3 Experimental Results and Discussion alone. We ran experiments with several different ways of The enhanced version of the baseline achieved splitting the vocabulary into first-pass and second- slightly better perplexity than the best of the two- pass tokens. We trained all of these models on the pass models we trained. Given that state-of-the- One Billion Word Language Modeling benchmark art results with Transformer require models larger (LM1B) dataset (Chelba et al., 2013). One sixth than the ones we trained, we should expect grow- of the training data was used as a validation set. ing the embedding and hidden size to produce We used a vocabulary of size 65,536 consisting of large benefits. However, the two-pass model we whole words (rather than word pieces) converted proposed in this work is primarily a tool to under- to lower-case. stand the importance of sequence generation or- We compared the two-pass generation strategies der and was not designed to be parameter efficient. to a baseline version of Transformer without an Thus, as these results indicate, increasing the em- encoder, which was trained to unconditionally pre- bedding size in Transformer is a more effective use dict the target sentences in the ordinary way. Be- of trainable parameters than having extra copies cause the two-pass models contain slightly more of the other model parameters for the second pass trainable parameters than this baseline, we also (recall that the embeddings are shared across both compare to an “enhanced baseline” in which the passes). size of Transformer’s hidden space was increased One potential explanation for why the func- to make the number of parameters match the two- tion first split performed the best is that, in or- pass models. der to generate a sentence, it is easier to first de- Both the two-pass models and the baselines cide something about its syntactic structure. If used the hyperparameters referred to as base in this is the primary explanation for the observed the publicly available implementation of Trans- results, then common first’s success can be at- former,3 which has a hidden size of 512, a filter tributed to how many function words are also com- 3github.com/tensorflow/tensor2tensor mon. However, an alternative explanation might

2944 Model Train Validation Test odd first 39.925 45.377 45.196 rare first 38.283 43.293 43.077 content first 38.321 42.564 42.394 common first 36.525 41.018 40.895 function first 36.126 40.246 40.085 baseline 38.668 41.888 41.721 enhanced baseline 35.945 39.845 39.726

Table 2: The perplexities achieved by the best version of each of our models. simply be that it is preferable to delay committing and model quality can only be evaluated via ex- to a rare token for as long as possible as all subse- ternal tasks. quent decisions will then be conditioning on a low- In addition to surface-form intermediate rep- probability event. This is particularly problematic resentation, syntax-based representations have a in language modeling where datasets are too small rich history in text modeling. Chelba and Je- to cover the space of all utterances. We lack suffi- linek(1998); Yamada and Knight(2001); Graham cient evidence to decide between these hypotheses and Genabith(2010); Shen et al.(2018) integrate and believe further investigation is necessary. parse structures, explicitly designed or automati- Ultimately, our results show that content- cally learned, into the decoding process. dependent generation orders can have a surpris- Similar to the second phase of this work’s pro- ingly large effect on model quality. Moreover, the posed model, (Fedus et al., 2018) directly tackles gaps between different generation strategies can the problem of filling in the blank, akin to the sec- be quite large. ond stage of our proposed model. The Multi-Scale version of PixelRNN in (Van Oord et al., 2016) 4 Related Work was also an inspiration for the two-pass setup we used here. For tasks conditioning on sequences and sets, it is well known that order significantly affects model 5 Conclusion and Future Work quality in applications such as machine transla- tion (Sutskever et al., 2014), program synthesis To investigate the question of generation order (Vinyals et al., 2016), and text classification (Yo- in language modeling, we proposed a model that gatama et al., 2016). Experimentally, Khandelwal generates a sentence in two passes, first generat- et al.(2018) show that recurrent neural networks ing tokens from left to right while skipping over have a memory that degrades with time. Tech- some positions and then filling in the positions that niques such as attention (Bahdanau et al., 2014) it skipped. We found that the decision of which to- can be seen as augmenting that memory. kens to place in the first pass had a strong effect. Text generation via neural networks, as in lan- Given the success of our function word first guage models and machine translation, proceeds generation procedure, we could imagine taking almost universally left-to-right (Jozefowicz et al., this idea beyond splitting the vocabulary. One 2016; Sutskever et al., 2014). This is in stark con- could run a parser on each sentence and use the trast to phrase-based machine translation systems resulting tree to decide on the generation order. (Charniak et al., 2003) which traditionally split Such a scheme might shed light on which aspect token translation and “editing” (typically via re- of this split was most helpful. Finally, filling in a ordering) into separate stages. This line of work is template with missing words is a task that might be carried forward in Post-Editing Models (Junczys- interesting in its own right. One might want to pro- Dowmunt and Grundkiewicz, 2016), Deliberation vide partial information about the target sentence Networks (Xia et al., 2017), and Review Network as part of scripting flexible responses for a dia- (Yang et al., 2016) which produce a “draft” de- logue agent, question answering system, or other coding that is further edited. As any valid se- system that mixes a hand-designed grammar with quence may be used in a draft, calculating perplex- learned responses. ity in these models is unfortunately intractable,

2945 References Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In International Daniel Andor, Chris Alberti, David Weiss, Aliaksei Conference on Learning Representations. Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally nor- Yikang Shen, Zhouhan Lin, Chin-wei Huang, and malized transition-based neural networks. CoRR, Aaron Courville. 2018. Neural language modeling abs/1603.06042. by jointly learning syntax and lexicon. In Interna- tional Conference on Learning Representations. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2014. Neural machine translation by jointly Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. learning to align and translate. arXiv preprint Sequence to sequence learning with neural net- arXiv:1409.0473. works. In Advances in Neural Information Process- ing Systems (NIPS), pages 3104–3112. Samuel R Bowman, Jon Gauthier, Abhinav Ras- togi, Raghav Gupta, Christopher D Manning, and Aaron Van Oord, Nal Kalchbrenner, and Koray Christopher Potts. 2016. A fast unified model for Kavukcuoglu. 2016. recurrent neural net- parsing and sentence understanding. In Proceed- works. In International Conference on Machine ings of the 54th Annual Meeting of the Association Learning, pages 1747–1756. for Computational Linguistics (Volume 1: Long Pa- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob pers), volume 1, pages 1466–1477. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all Eugene Charniak, Kevin Knight, and Kenji Yamada. you need. In Advances in Neural Information Pro- 2003. Syntax-based language models for statistical cessing Systems, pages 6000–6010. machine translation. In Proceedings of MT Summit IX, pages 40–46. Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2016. Order matters: Sequence to sequence for sets. Ciprian Chelba and Frederick Jelinek. 1998. Ex- In International Conference on Learning Represen- ploiting syntactic structure for language model- tations (ICLR). ing. In Proceedings of the 36th Annual Meet- ing of the Association for Computational Linguis- Frank Wood, Jan Gasthaus, Cedric´ Archambeau, tics and 17th International Conference on Compu- Lancelot James, and Yee Whye Teh. 2011. The se- tational Linguistics-Volume 1, pages 225–231. As- quence memoizer. Communications of the ACM, sociation for Computational Linguistics. 54(2):91–98.

Ciprian Chelba, Toma´sˇ Mikolov, Mike Schuster, Qi Ge, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Thorsten Brants, and Phillipp Koehn. 2013. One bil- Le, Mohammad Norouzi, Wolfgang Macherey, lion word benchmark for measuring progress in sta- Maxim Krikun, Yuan Cao, Qin Gao, Klaus tistical language modeling. CoRR, abs/1312.3005. Macherey, et al. 2016. Google’s neural ma- chine translation system: Bridging the gap between William Fedus, Ian Goodfellow, and Andrew M. Dai. human and machine translation. arXiv preprint 2018. MaskGAN: Better text generation via filling arXiv:1609.08144. in the ______. In International Conference on Learning Representations (ICLR). Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2017. Deliberation Yvette Graham and Josef Genabith. 2010. Deep syntax networks: Sequence generation beyond one-pass de- language models and statistical machine translation. coding. In Advances in Neural Information Process- In Proceedings of the 4th Workshop on Syntax and ing Systems, pages 1782–1792. Structure in Statistical Translation, pages 118–126. Kenji Yamada and Kevin Knight. 2001. A syntax- based statistical translation model. In Proceedings Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam of the 39th Annual Meeting on Association for Com- Shazeer, and Yonghui Wu. 2016. Exploring putational Linguistics, pages 523–530. Association the limits of language modeling. arXiv preprint for Computational Linguistics. arXiv:1602.02410. Zhilin Yang, Ye Yuan, Yuexin Wu, William W Cohen, Marcin Junczys-Dowmunt and Roman Grundkiewicz. and Ruslan R Salakhutdinov. 2016. Review net- 2016. Log-linear combinations of monolingual works for caption generation. In Advances in Neural and bilingual neural machine translation mod- Information Processing Systems, pages 2361–2369. els for automatic post-editing. arXiv preprint arXiv:1605.04800. Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. 2016. Learning to Urvashi Khandelwal, He He, Peng Qi, and Dan Ju- compose words into sentences with reinforcement rafsky. 2018. Sharp nearby, fuzzy far away: How learning. arXiv preprint arXiv:1611.09100. neural language models use context. arXiv preprint arXiv:1805.04623.

2946