The Importance of Generation Order in Language Modeling
Total Page:16
File Type:pdf, Size:1020Kb
The Importance of Generation Order in Language Modeling Nicolas Ford∗ Daniel Duckworth Mohammad Norouzi George E. Dahl Google Brain fnicf,duckworthd,mnorouzi,[email protected] Abstract There has been interest in moving beyond the left-to-right generation order by developing alter- Neural language models are a critical compo- native multi-stage strategies such as syntax-aware nent of state-of-the-art systems for machine neural language models (Bowman et al., 2016) translation, summarization, audio transcrip- and latent variable models of text (Wood et al., tion, and other tasks. These language models are almost universally autoregressive in nature, 2011). Before embarking on a long-term research generating sentences one token at a time from program to find better generation strategies that left to right. This paper studies the influence of improve modern neural networks, one needs ev- token generation order on model quality via a idence that the generation strategy can make a novel two-pass language model that produces large difference. This paper presents one way of partially-filled sentence “templates” and then isolating the generation strategy from the general fills in missing tokens. We compare various neural network design problem. Our key techni- strategies for structuring these two passes and cal contribution involves developing a flexible and observe a surprisingly large variation in model quality. We find the most effective strategy tractable architecture that incorporates different generates function words in the first pass fol- generation orders, while enabling exact computa- lowed by content words in the second. We be- tion of the log-probabilities of a sentence. Our ex- lieve these experimental results justify a more periments demonstrate that even when using a few extensive investigation of generation order for simple two-pass generation orders, the differences neural language models. between good and bad orderings are substantial. We consider ways of reordering the tokens 1 Introduction within a sequence based on their identities. The Neural networks have been extremely successful best ordering we tried generates function words statistical models of text in language modeling and first and content words last, which cuts against the machine translation. Despite differences in model idea of committing to the general topic of a sen- architectures, state of the art neural nets gener- tence first and only then deciding exactly how to ate sequences from left to right (Vaswani et al., phrase it. We offer some possible explanations in 2017; Jozefowicz et al., 2016; Wu et al., 2016). Section3, and we conclude that our experimen- Although in some sense humans produce and con- tal results justify a more extensive investigation of sume language from left to right as well, there are the generation order for language and translation many other intuitively appealing ways to gener- models. ate text. For instance, language is slow enough 2 Two-pass Language Models on a neurological time scale for multiple passes of generation that incorporate feedback to occur. We develop a family of two-pass language mod- Linguistic intuition might suggest that we should els that depend on a partitioning of the vocabu- first generate some abstract representation of what lary into a set of first-pass and second-pass tokens we want to say and then serialize it, a process that to generate sentences. We perform a preprocess- seems more universally appropriate given the ex- ing step on each sequence y, creating two new se- istence of languages with freer word order such as quences y(1) and y(2). The sequence y(1), which Czech and Polish. we call the template, has the same length as y, ∗Work done as a member of the Google AI Residency and consists of the first-pass tokens from y to- program (g.co/airesidency) gether with a special placeholder token wherever 2942 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2942–2946 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics sentence common first rare first function first content first odd first ” all you need to do ” all you to if you need do ” all you to if you need do ” all you need if you want the na- the ’s on want nation the ’s on your want nation press you the nation ’s tion ’s press camped is to you had a press camped your is to you a camped doorstep press camped on your on your doorstep is to [UNK] in , ” he doorstep say in , ” he in his say once had doorstep say you say you once had a in his . [EOS] once 1947 . [EOS] [UNK] 1947 once had [UNK] in 1947 , ” noted memorably noted memorably ” noted his . he noted memorably in diary [EOS] diary [EOS] [EOS] his diary . [EOS] the team announced the that the , team announced the that the team announced the team announced thursday that the 6- [UNK] will in thursday 6-foot-1 , will in thursday 6-foot-1 the 6-foot-1 foot-1 , [UNK] starter the . [EOS] starter remain through the . [UNK] starter will remain will remain in detroit detroit through [EOS] remain detroit through the 2013 . through the 2013 sea- 2013 season [EOS] 2013 season [EOS] [EOS] son . [EOS] scotland ’s next game ’s is a the scotland next game ’s is a against scotland next game ’s next game is a friendly against at on . [EOS] friendly against the at on . friendly the czech republic at the czech republic at czech republic ham- [EOS] czech republic ham- hampden on 3 march . hampden on 3 march . pden 3 march pden 3 march [EOS] [EOS] [EOS] [EOS] of course , millions of of , of course millions of , of a course millions of of additional additional homeown- a : they of additional homeown- : they of ” additional home- big ers did make a big mis- ” ” and [UNK] ers did make big ” and to owners did make they advantage of take : they took ad- to they ’t . mistake took ad- they . [EOS] big mistake ” liar ” and other vantage of ” liar loans [EOS] vantage liar loans took advantage deals buy homes ” and other [UNK] other deals liar loans other they couldn afford . deals to buy homes buy homes couldn [UNK] deals buy [EOS] they couldn ’t afford . afford [EOS] homes couldn ’t [EOS] afford [EOS] Table 1: Some example sentences from the dataset and their corresponding templates. The placeholder token is indicated by “ ”. y had a second-pass token. The sequence y(2) has sequence model that translates the template into length equal to the number of these placeholders, the complete sentence. There are three places in and consists of the second-pass tokens from y in this model where word embeddings appear — the order. first-phase decoder, the second-phase encoder, and We use a neural language model p1 to generate the second-phase decoder — and all three sets (1) y , and then a conditional translation model p2 of parameters are shared. The output layer also to generate y(2) given y(1). Note that, since the shares the embedding parameters.1 division of the vocabulary into first- and second- For the second pass, we include the entire target pass tokens is decided in advance, there is a one- sentence, not just the second-pass tokens, on the to-one correspondence between sequences y and output side. In this way, when generating a token, pairs (y(1); y(2)). The total probability of y is then the decoder is allowed to examine all tokens to the left of its position. However, only the second-pass (1) (2) (1) p(y) = p1(y ) p2(y j y ) : (1) tokens count toward the loss, since in the other po- sitions the correct token is already known. Our Two-pass language models present a unique op- loss function is then the sum of all of these num- portunity to study the importance of generation or- bers (from both copies) divided by the length of der because, since the template is a deterministic the original sentence, which is the log-perplexity function of y, the probability of y can be com- that our model assigns to the sentence. puted exactly. This is in contrast to a language We tried five different ways of splitting the vo- model using a latent generation order, which re- cabulary: quires a prohibitive marginalization over permu- Common First and Rare First: The vocabu- tations to compute the exact probabilities. Given lary was sorted by frequency and then a cutoff was the tractable nature of the model, exact learning chosen, splitting the vocabulary into “common” based on log-likelihood is possible, and we can and “rare” tokens. The location of the cutoff2 was compare different vocabulary partitioning strate- chosen so that the number of common tokens and gies both against each other and against a single- the number of rare tokens in the average sentence pass language model. were approximately the same. In “common first” Our implementation consists of two copies of 1 the Transformer model from Vaswani et al.(2017). This behavior is enabled in the publicly available im- plementation of Transformer using the hyperparameter called The first copy just generates the template, so it has shared embedding and softmax weights. no encoder. The second copy is a sequence-to- 2In our experiments on LM1B, this is at index 78. 2943 we place the common tokens in the first pass, and size of 2048, and 8 attention heads, except that the in “rare first” we start with the rare tokens. enhanced baseline used a hidden size of 704. We Function First and Content First: We parsed used a batch size of 4096. All models were trained about 1% of LM1B’s training set using Parsey Mc- using ADAM (Kingma and Ba, 2014), with β1 = −6 Parseface (Andor et al., 2016) and assigned each 0:85, β2 = 0:997, and = 10 .