Generalization in Generation: A closer look at Exposure Bias

Florian Schmidt Department of Computer Science ETH Zurich¨ [email protected]

Abstract lem typically rely on replacing, masking or per- tubing ground-truth contexts (Bengio et al., 2015; Exposure bias refers to the train-test discrep- Bowman et al., 2016; Norouzi et al., 2016; Ran- ancy that seemingly arises when an autoregres- zato et al., 2016). Unfortunately, exposure bias sive generative model uses only ground-truth contexts at training time but generated ones at has never been succesfully separated from general test time. We separate the contributions of the test-time log-likelihood assessment and minor im- model and the learning framework to clarify provements on the latter are used as the only signi- the debate on consequences and review pro- fier of reduced bias. Whenever explicit effects are posed counter-measures. investigated, no significant findings are made (He In this light, we argue that generalization is et al., 2019). the underlying property to address and pro- In this work we argue that the standard training pose unconditional generation as its funda- mental benchmark. Finally, we combine la- procedure, despite all criticism, is an immediate tent variable modeling with a recent formu- consequence of combining autoregressive model- lation of exploration in reinforcement learn- ing and maximum-likelihood training. As such, ing to obtain a rigorous handling of true and the paramount consideration for improving test- generated contexts. Results on language mod- time performance is simply regularization for bet- eling and variational sentence auto-encoding ter generalization. In fact, many proposed mea- confirm the model’s generalization capability. sures against exposure bias can be seen as exactly that, yet with respect to an usually implicit metric 1 Introduction that is not maximum-likelihood. Autoregressive models span from n-gram mod- With this in mind, we discuss regularization els to recurrent neural networks to transformers for conditional and unconditional generation. We and have formed the backbone of state-of-the- note that in conditional tasks, such as translation, art models over the last decade it is usually sufficient to regularize the mapping on virtually any generative task in Natural Lan- task – here translation – rather than the generative guage Processing. Applications include machine process itself. For unconditional generation, translation (Bahdanau et al., 2015; Vaswani et al., where tradeoffs between accuracy and coverage 2017), summarization (Rush et al., 2015; Khan- are key, generalization becomes much more delwal et al., 2019), dialogue (Serban et al., 2016) tangible. and sentence compression (Filippova et al., 2015). The training methodology of such models is The debate on the right training procedure for rooted in the language modeling task, which is to autoregressive models has recently been ampli- predict a single word given a context of previous fied by the advent of latent generative models words. It has often been criticized that this set- (Rezende et al., 2014; Kingma and Welling, 2013). ting is not suited for multi-step generation where Here, the practice of decoding with true contexts – at test time – we are interested in generating during training conflicts with the hope of obtain- words given a generated context that was poten- ing a latent representation that encodes significant tially not seen during training. The consequences information about the sequence (Bowman et al., of this train-test discrepancy are summarized as 2016). Interestingly, the ad hoc tricks to reduce exposure bias. Measures to mitigate the prob- the problem are similar to those proposed to ad-

157 Proceedings of the 3rd Workshop on Neural Generation and Translation (WNGT 2019), pages 157–167 Hong Kong, China, November 4, 2019. c 2019 Association for Computational Linguistics www.aclweb.org/anthology/D19-56%2d dress exposure bias in deterministic models. in recurrent neural networks (RNNs), i.e. an Very recently, Tan et al. (2017) have presented LSTM (Hochreiter and Schmidhuber, 1997) state, a reinforcement learning formulation of explo- or a set of attention weights, as in a transformer ration that allows following the intuition that an architecture (Vaswani et al., 2017). While the con- autoregressive model should not only be trained on siderations in this work apply to all autoregressive ground-truth contexts. We combine their frame- models, we focus on recurrent networks which en- work with latent variable modeling and a reward code the context in a fixed-sized continuous rep- function that leverages modern word-embeddings. resentation h(w1:t−1). In contrast to transformers, The result is a single learning regime for uncon- RNNs can be generalized easily to variational au- ditional generation in a deterministic setting (lan- toencoders with a single latent bottleneck (Bow- guage modeling) and in a latent variable setting man et al., 2016), a particularly interesting special (variational sentence autoencoding). Empirical re- case of generative models . sults show that our formulation allows for better generalization than existing methods proposed to 2.1 Evaluation and Generalization address exposure bias. Even more, we find the re- Conditional vs. Unconditional sulting regularization to also improve generaliza- Conditional generation tasks, such as translation tion under log-likelihood. or summarization, are attractive from an appli- We conclude that it is worthwhile explor- cation perspective. However, for the purpose of ing reinforcement learning to elegantly extend studying exposure bias, we argue that uncondi- maximum-likelihood learning where our desired tional generation is the task of choice for the fol- notion of generalization cannot be expressed with- lowing reasons. out violating the underlying principles. As a re- First, exposure bias addresses conditioning on sult, we hope to provide a more unified view on past words generated which becomes less essen- the training methodologies of autoregressive mod- tial when words in a source sentence are available, els and exposure bias in particular. in particular when attention is used. Second, the difficulty of the underlying map- 2 Autoregressive Modeling ping task, say translation, is of no concern for Modern text generation methods are rooted in the mechanics of generation. This casts sentence models trained on the language modeling task. In autoencoding as a less demanding, yet more eco- essence, a language model p is trained to predict a nomic task. word given its left-side context Finally, generalization of conditional models is only studied with respect to the underlying map- p(w |w ) . (1) t 1:t−1 ping and not with respect to the conditional distri- With a trained language model at hand, a simple bution itself. A test-set in translation usually does recurrent procedure allows to generate text of arbi- not contain a source sentence seen during training trary length. Starting from an initial special sym- with a different target1. Instead, it contains un- bol wˆ0, we iterate t = 1 ... and alternate between seen source-target pairs that evaluate the general- sampling wˆt ∼ p(wt|wˆ1:t−1) and appending wˆt to ization of the mapping. Even more, at test-time the context wˆ1:t−1. Models of this form are called most conditional models resort to an arg-max de- autoregressive as they condition new predictions coding strategy. As a consequence, the entropy on old predictions. of the generative model is zero (given the source) and there is no generalization at all with respect to Neural Sequence Models Although a large cor- generation. For these reasons, we address uncon- pus provides an abundance of word-context pairs ditional generation and sentence auto-encoding for to train on, the cardinality of the context space the rest of this work. makes explicit estimates of (1) infeasible. There- fore, traditional n-gram language models rely on The big picture Let us briefly characterize out- a truncated context and smoothing techniques to put we should expect from a generative model generalize well to unseen contexts. with respect to generalization. Figure1 shows Neural language models lift the context re- 1Some datasets do provide several targets for a single striction and instead use neural context represen- source. However, those are typically only used for BLEU tations. This can be a hidden state as found computation, which is the standard test metric reported.

158 Figure 1: Generalization an idealized two-dimensional dataspace of (fixed- With aforementioned shortcomings of test log- length) sentences w ∈ V T . We sketch the sup- likelihood in mind, it is worthwhile discussing port of the unknown underlying generating distri- a recently proposed evaluation technique. Fe- bution, the train set and the test set.2 Let us look at dus et al. (2018) propose to use n-gram statis- some hypothetical examples wˆ1, wˆ2, wˆ3, wˆ4 gen- tics of the underlying data to asses generated out- erated from some well trained model. Samples put. For example, one can estimate an n-gram lan- like wˆ1 certify that the model did not overfit to guage model and report perplexity of the gener- the training data as can be certified by test log- ated data under the n-gram model. Just as BLEU likelihood. In contrast, the remaining samples and ROUGE break the sequence reward assign- are indistinguishable under test log-likelihood in ment problem into smaller sub-problems, n-gram the sense that they identically decrease the metric language models effectively smooth the sequence (assuming equal model probability) even though likelihood assignment which is usually done with wˆ2, wˆ3 have non-zero probability under the true respect to the empirical data distribution. Under data distribution. Consequently, we cannot iden- this metric, some sequences such as wˆ2 which are tify wˆ4 as a malformed example. Holtzman et close to sequences in the dataset at hand might re- al. (2019) show that neural generative models – ceive positive probability. despite their expressiveness – put significant prob- This raises two questions. First, can we break ability on clearly unreasonable repetitive phrases, sequence-level evaluation into local statistics by such as I dont know. I dont know. I dont know.3 using modern word embeddings instead of n- grams (as BLEU does)? Second, can we incor- Evaluation under smoothed data distribution porate these measures already during training to The most common approach to evaluating an un- obtain better generative models. These considera- conditional probabilistic generative model is train- tions will be key when defining a reward function ing and test log-likelihood. For a latent variable in Section 4.5. model, the exact log-likelihood (2) is intractable and a lowerbound must be used instead. How- 3 Teacher Forcing and Exposure Bias ever, at this point it should be noted that one can always estimate the log-likelihood from an empir- A concern often expressed in the context of au- ical distribution across output generated. That is, toregressive models is that the recursive sampling one generates a large set of sequences S and sets procedure for generation presented in Section1 is pˆ(w) to the normalized count of w in S. However, never used at training time; hence the model can- the variance of this estimate is impractical for all not learn to digest its own predictions. The result- but the smallest datasets. Also, even a large test- ing potential train-test discrepancy is referred to as set cannot capture the flexibility and composition- exposure bias and is associated with compounding ality found in natural language. errors that arise when mistakes made early accu- mulate (Bengio et al., 2015; Ranzato et al., 2016; 2Here we do not discuss generalization error, the discrep- ancy between empirical test error and expected test error. It Goyal et al., 2016; Leblond et al., 2018). In this should also be noted that cross-validation provides another context, teacher-forcing refers to the fact that – complementary technique to more robust model estimation, seen from the test-time perspective – ground-truth which we omit to keep the picture simple. 3They report that this also holds for non-grammatical contexts are substituted for model predictions. Al- repetitive phrase, which is what we would expect for wˆ4. though formally teacher forcing and exposure bias

159 should be seen as cause (if any) and symptom, they collapse to a language model as in (1). are often used exchangeably. While some work argues that the problem is As is sometimes but rarely mentioned, the rooted in autoregressive decoders being “too pow- presence of the ground-truth context is simply erful” (Shen et al., 2018), the proposed measures a consequence of maximum-likelihood train- often address the autoregressive training regime ing and the chain rule applied to (1) as in rather than the models (Bowman et al., 2016) and, Q p(w1:T ) = p(wt|w1:t−1) (Goodfellow et al., in fact, replace ground-truth contexts just as the 2016). As such, it is out of question whether above methods to mitigate exposure bias. generated contexts should be used as long as log- In addition, a whole body of work has discussed likelihood is the sole criterion we care about. In the implications of optimizing only a bound to the this work we will furthermore argue the following: log-likelihood (Alemi et al., 2017) and the impli- cations of re-weighting the information-theoretic Proposition 1 Exposure bias describes a lack quantities inside the bound (Higgins et al., 2017; of generalization with respect to an – usually Rainforth et al., 2018). implicit and potentially task and domain depen- dent – measure other than maximum-likelihood. 4 Latent Generation with ERPO We have discussed exposure bias and how it has The fact that we are dealing with generalization is been handled by either implicitly or explicitly obvious, as one can train a model – assuming suffi- leaving the maximum-likelihood framework. In cient capacity – under the criticized methodology this section, we present our reinforcement learning to match the training distribution. Approaches that framework for unconditional sequence generation address exposure bias do not make the above no- models. The generative story is the same as in a tion of generalization explicit, but follow the in- latent variable model: tuition that training on other contexts than (only) d ground-truth contexts should regularize the model 1. Sample a latent code z ∼ R and result in – subjectively – better results. Of course, these forms of regularization might still 2. Sample a sequence from a code-conditioned implement some form of log-likelihood regular- policy pθ(w|z). ization, hence improve log-likelihood generaliza- tion. Indeed, all of the following methods do re- However, we will rely on reinforcement learning port test log-likelihood improvements. to train the decoder p(w|z). Note that for a con- stant code z = 0 we obtain a language model as Proposed methods against exposure bias a special case. Let us now briefly review latent Scheduled sampling (Bengio et al., 2015) pro- sequential models. posed for conditional generation randomly mixes in predictions form the model, which violates 4.1 Latent sequential models the underlying learning framework (Husz’ar, Formally, a latent model of sequences w = w1:T 2015). RAML (Norouzi et al., 2016) proposes is written as a marginal over latent codes to effectively perturbs the ground-truth context according to the exponentated payoff distribution Z Z p(w) = p(w, z)dz = p(w|z)p (z)dz . (2) implied by a reward function. Alternatively, 0 adversarial approaches (Goyal et al., 2016) and learning-to-search (Leblond et al., 2018) have The precise form of p(w|z) and whether z refers been proposed. to a single factor or a sequence of factors z1:T de- pends on the model of choice. VAE Collapse In Section 4.1 we will take a look The main motivation of enhancing p with a la- at latent generative models. In that context, the tent factor is usually the hope to obtain a meaning- standard maximum-likelihood approach to autore- ful structure in the space of latent codes. How such gressive models has been criticized from a second a structure should be organized has been discussed perspective that is worth mentioning. Bowman et in the disentanglement literature in great detail, for al. (2016) show empirically that autoregressive de- example in Chen et al. (2018), Hu et al. (2017) or coders p(w|z) do not rely on the latent code z, but Tschannen et al. (2018).

160 In our context, latent generative models are in- Since samples from the policy wˆ ∼ pθ often yield teresting for two reasons. First, explicitly in- low or zeros reward, the estimator (5) is known for troducing uncertainty inside the model is often its notorious variance and much of the literature is motivated as a regularizing technique in Bay- focused on reducing this variance via baselines or seian machine learning (Murphy, 2012) and has control-derivative (Rennie et al., 2016). been applied extensively to latent sequence mod- els (M. Ziegler and M. Rush, 2019; Schmidt and 4.3 Reinforcement Learning as Inference Hofmann, 2018; Goyal et al., 2017; Bayer and Os- Recently, a new family of policy gradient meth- endorfer, 2014). Second, as mentioned in Sec- ods has been proposed that draws inspiration from tion3 (VAE collapse) conditioning on ground- inference problems in probablistic models. The truth contexts has been identified as detrimental to underlying idea is to pull the reward in (5) into obtaining meaningful latent codes (Bowman et al., a new implicit distribution p˜ that allows to draw 2016) – hence a methodology to training decoders samples wˆ with much lower variance as it is in- that relaxes this requirement might be of value. formed about reward. We follow Tan et al. (2017) who optimize Training via Variational Inference Variational an entropy-regularized version of (4), a common inference (Zhang et al., 2018) allows to optimize strategy to foster exploration. They cast the rein- a lower-bound instead of the intractable marginal forcement learning problem as likelihood and has become the standard method- ology to training latent variable models. Introduc- ? J(θ, p˜) = Ep˜[R(w, w )] q ing an inference model and applying Jensen’s in- KL equality to (2), we obtain + αD (˜p(w)||pθ(w))   + βH(˜p) (6) p0(z) log p(w) = Eq(z|w) log +log P (w|z) q(z|w) where α, β are hyper-parameters and p˜ is the new KL 4 ≥ D (q(z|w)||p0(z)) + Eq(z|w) [log P (w|z)] (3) non-parametric, variational distribution across sequences. They show that (6) can be optimized Neural inference networks (Rezende et al., 2014; using the following EM updates Kingma and Welling, 2013) have proven as effec- tive amortized approximate inference models. αp n(w) + R(w, w?) E-step: p˜n+1∝ exp θ (7) Let us now discuss how reinforcement learning α + β can help training our model. n+1 M-step: θ =arg max Ep˜n+1 [log pθ(w)] (8) θ 4.2 Generation as Reinforcement Learning Text generation can easily be formulated as a re- As Tan et al. 2018 have shown, for α → 0, inforcement learning (RL) problem if words are β = 1 and a specific reward, the framework re- 5 taken as actions (Bahdanau et al., 2016). Formally, covers maximum-likelihood training. It is explic- pθ is a parameterized policy that factorizes autore- itly not our goal to claim text generation with end- Q gressively pθ(w) = pθ(wt|h(w1:t−1)) and h is to-end reinforcement learning but to show that it is a deterministic mapping from past predictions is beneficial to operate in an RL regime relatively to a continuous state, typically a recurrent neural close to maximum-likelihood. network (RNN). The goal is then to find policy pa- 4.4 Optimization with Variational Inference rameters θ that maximize the expected reward In conditional generation, a policy is conditioned J(θ) = [R(w, w?)] (4) Epθ(w) on a source sentence, which guides generation to- where R(w, w?) is a task-specific, not necessarily wards sequences that obtain significant reward. differentiable metric. Often, several epochs of MLE pretraining (Rennie et al., 2016; Bahdanau et al., 2016) are necessary Policy gradient optimization The REIN- to make this guidance effective. FORCE (Williams, 1992) training algorithm is a common strategy to optimize (4) using a gradient 4In (Tan et al., 2018) p˜ is written as q, which resembles variational distributions in approximate Bayesian inference. estimate via the log-derivative However, here p˜ is not defined over variables but datapoints. 5 ∇ J(θ) = [R(w, w?) log p (w)] (5) Refer to their work for more special cases, including θ Epθ(w) θ MIXER (Ranzato et al., 2016)

161 In our unconditional setting, where a source is computes the minimum accumulated distance that not available, we employ the latent code z to pro- the word vectors of one sentence need to “travel” vide guidance. We cast the policy pθ as a code- to coincide with the word vectors of the other conditioned policy pθ(w|z) which is trained to sentence. In contrast to n-gram metrics, WMD maximize a marginal version of the reward (6): can leverage powerful neural word representa- tions. Unfortunately, the complexity of computing ? 3 J(θ) = Ep0(z)Epθ(w|z)[R(w, w )]] . (9) WMD is roughly O(T log T ).

Similar formulations of expected reward have re- 4.6 A Reward for Tractable Sampling cently been proposed as goal-conditioned poli- cies (Ghosh et al., 2018). However, here it is our Tan et al. (2018) show that thanks to the factor- explicit goal to also learn the representation of the ization of pθ the globally-normalized inference goal, our latent code. We follow Equation (3) and distribution p˜ in (7) can be written as a locally- optimize a lower-bound instead of the intractable normalized distribution at the word-level marginalization (9). Following (Bowman et al., 2015; Fraccaro et al., 2016) we use a deep RNN p˜(wt|w1:t−1)∝ inference network for q to optimize the bound. αp (w |w )+R (w, w?) exp θ t 1:t−1 t (10) The reparametrization-trick (Kingma and Welling, α + β 2013) allows us to compute gradients with respect to q. Algorithm1 shows the outline of the training when the reward is written as incremental re- ? ? procedure. ward Rt defined via Rt(w, w ) = R(w1:t, w ) − ? R(w1:t−1, w ). Sampling form (10) is still hard, if Algorithm 1 Latent ERPO Training Rt hides dynamic programming routines or other for do w? ∈ DATASET complex time-dependencies. With this in mind, Sample a latent code z ∼ q(z|w?) we choose a particularly simple reward Sample a datapoint w˜ ∼ p˜(w|z) Perform a gradient step ∇θ log pθ(w ˜|z) T ? X > ? R(w, w ) = φ(wt) φ(wt ) (11) Note that exploration (sampling w˜) and the gra- t=1 dient step are both conditioned on the latent code, where φ is a lookup into a length-normalized pre- hence stochasticity due to sampling a single z is trained but fixed (Mikolov et al., 2013) coupled in both. Also, no gradient needs to be embedding. This casts our reward as an effi- propagated into p˜. cient, yet drastic approximation to WMD, which So far, we have not discussed how to efficiently assumes identical length and one-to-one word cor- sample from the implicit distribution p˜. In the respondences. Putting (10) and (11) together, we remainder of this section we present our reward sample sequentially from function and discuss implications on the tractabil- ity of sampling. p˜(wt|w1:t−1)∝ 4.5 Reward αp (w |w )+φ(w )>φ(w?) exp θ t 1:t−1 t t (12) Defining a meaningful reward function is central α + β to the success of reinforcement learning. The usual RL forumlations in NLP require a measure with the complexity O(dV ) of a standard softmax. of sentence-sentence similarity as reward. Com- Compared to standard VAE training, Algorithm1 mon choices include BLEU (Papineni et al., 2002), only needs one additional forward pass (with iden- ROUGE (Lin, 2004), CIDEr (Banerjee and Lavie, tical complexity) to sample w˜ form p˜. 2005) or SPICE (Anderson et al., 2016). These are Equation (12) gives a simple interpretation of essentially n-gram metrics, partly augmented with our proposed training methodology. We locally synonym resolution or re-weighting schemes. correct predictions made by the model proportion- Word-movers distance (WMD) (Kusner et al., ally to the distance to the ground-truth in the em- 2015) provides an interesting alternative based on beddings space. Hence, we consider the ground- the optimal-transport problem. In essence, WMD truth and the model prediction for exploration.

162 OURS OURS-B OURS-DET SS-0.99 SS-0.98 SS-0.95 SS-0.90 SS-0.99-DET RAML RAML-DET VAE WDROP-0.99 LM-DET 54 54

52 53.5

50 Test NLL Train NLL 53

48 52.5 Training Time Training Time Figure 2: Generalization performance in terms of sequence NLL across latent and deterministic methods

5 Related Work 6 Experiments

Our discussion of exposure bias complements re- Parametrization The policies of all our mod- cent work that summarizes modern generative els and all baselines use the same RNN. We use models, for example Caccia et al. (2018) and Lu et a 256 dimensional GRU (Cho et al., 2014) and al. (2018). Shortcomings of maximum-likelihood 100-dimensional pre-trained word2vec input em- training for sequence generation have often been beddings. Optimization is preformed by Adam discussed (Ding and Soricut, 2017; Leblond et al., (Kingma and Ba, 2014) with an initial learning 2018; Ranzato et al., 2016), but without pointing rate of 0.001 for all models. For all methods, to generalization as the key aspect. An overview including scheduled sampling, we do not anneal of recent deep reinforcement learning methods for hyper-parameters such as the keep-probability for conditional generation can be found in (Keneshloo the following reasons. First, in an unconditional et al., 2018). setting, using only the model’s prediction is not a Our proposed approach follows work by Ding promissing setting, so it is unclear what value to et al. (2017) and Tan et al. (2018) by employing anneal to. Second, the continous search-space of both, policy and reward for exploration. In con- schedules makes it sufficiently harder to compare trast to them, we do not use n-gram based reward. different methods. For the same reason, we do Compared to RAML (Norouzi et al., 2016), we do not investigate annealing the KL term or the α, β- not perturb the ground-truth context, but correct parametrization of the models. We use the infer- the policy predictions. Scheduled sampling (Ben- ence network parametrization of (Bowman et al., gio et al., 2015) and word-dropout (Bowman et al., 2016) which employs a diagonal Gaussian for q. 2016) also apply a correction, yet one that only We found the training regime to be very sensi- affects the probability of the ground-truth. Chen tive to the α, β-parametrization. In particular, it is et al. (2017) propose Bridge modules that simi- easy to pick a set of parameters that does not truly larly to Ding et al. (2017) can incorporate arbitrary incorporate exploration, but reduces to maximum ground-truth perturbations, yet in an objective mo- likelihood training with only ground truth contexts tivated by an auxiliary KL-divergence. (see also the discussion of Figure3 in Section 6.2). Merity et al. (2017) have shown that gener- After performing a grid-search (as done also for alization is crucial to language modeling, but RAML) we choose6 α = 0.006, β = 0.067 for their focus is regularizing parameters and activa- OURS, the method proposed. In addition, we re- tions. Word-embeddings to measure deviations port for an alternative model OURS-B with α = from the ground-truth have also been used by Inan 0.01, β = 0.07. et al. (2016), yet under log-likelihood. Concur- rently to our work, Li et al. (2019) employ em- 6The scale of α is relatively small as the log-probabilities beddings to design reward functions in abstractive in (12) have significantly larger magnitude than the inner summarization. products, which are in [0, 1] due to the normalization.

163 Data For our experiments, we use a one million test performance. Note that the performance dif- sentences subset of the BooksCorpus (Kiros et al., ference over RAML, the second best method, is 2015; Zhu et al., 2015) with a 90-10 train-test split solely due to incorporating also model-predicted and a 40K words vocabulary. The corpus size is contexts during training. chosen to challenge the above policy with both Despite some slightly improved performance, scenarios, overfitting and underfitting. all latent models except for OURS-B have a KL- term relatively close to zero. OURS-B is α-β- 6.1 Baselines parametrized to incorporte slightly more model As baselines we use a standard VAE and a VAE predictions at higher temperatur and manages to with RAML decoding that uses identical reward achieve a KL-term of about 1 to 1.5 bits. These as our method (see Tan et al.(2018) for details on findings are similar to what (Bowman et al., 2016) RAML as a special case). Furthermore, we use report with annealing but still significantly behind two regularizations of the standard VAE, sched- work that addresses this specific problem (Yang uled sampling SS-P and word-dropout WDROP-P et al., 2017; Shen et al., 2018). AppendixA illus- as proposed by Bowman et al. (2016), both with trates how our models can obtain larger KL-terms fixed probability p of using the ground-truth. – yet at degraded performance – by controlling ex- In addition, we report as special cases with ploration. We conclude that improved autoregres- z = 0 results for our model (OURS-DET), RAML sive modeling inside the ERPO framework cannot (RAML-DET), scheduled sampling (SS-P-DET), alone overcome VAE-collapse. and the VAE (LM, a language model). We have discussed many approaches that devi- ate from training exclusively on ground-truth con- 6.2 Results texts. Therefore, an interesting quantity to mon- itor across methods is the fraction of words that Figure2 shows training and test negative sequence correspond to the ground-truth. Figure3 shows log-likelihood evaluated during training and Table these fractions during training for the configura- 1 shows the best performance obtained. All figures tions that gave the best results. Interestingly, in the and tables are averaged across three runs. latent setting our method relies by far the least on Model Train NLL Test NLL ground-truth contexts whereas in the deterministic OURS 48.52 52.54 setting the difference is small. OURS-B 49.51 52.61 OURS-DET 48.06 52.87 OURS OURS-B OURS-DET RAML SS-0.99 48.11 52.60 SS-0.98 SS-0.99 SS-0.95 LM/VAE

SS-0.98 48.21 52.62 1 SS-0.95 48.38 52.69 SS-0.90 49.02 52.89 0.95 SS-0.99-DET 48.08 52.90 RAML 48.26 52.56 0.9 RAML-DET 48.26 52.86 WDROP-0.99 48.19 52.86 Training Time LM 47.65 53.01 Figure 3: Fraction of correct words during training. VAE 47.86 52.66 Numbers include forced and correctly predicted words. WDROP-0.9 50.86 54.65 Table 1: Training and test performance 7 Conclusion We observe that all latent models outperform their deterministic counterparts (crossed curves) We have argued that exposure bias does not point in terms of both, generalization and overall test to a problem with the standard methodology of performance. This is not surprising as regulariza- training autoregressive sequence model. Instead, tion is one of the benefits of modeling uncertainty it refers to a notion of generalization to unseen se- through latent variables. Scheduled sampling does quences that does not manifest in log-likelihood improve generalization for p ≈ 1 with diminishing training and testing, yet might be desirable in or- returns at p = 0.95 and in general performed bet- der to capture the flexibility of natural language. ter than word dropout. Our proposed models out- To rigorously incorporate the desired gener- perform all others in terms of generalization and alization behavior, we have proposed to follow

164 the reinforcement learning formulation of Tan et Samuel R. Bowman, Gabor Angeli, Christopher Potts, al. (2018). Combined with an embedding-based and Christopher D. Manning. 2015. A large anno- reward function, we have shown excellent gener- tated corpus for learning natural language inference. In EMNLP. alization performance compared to the unregular- ized model and better generalization than existing Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An- techniques on language modeling and sentence au- drew M. Dai, Rafal Jozefowicz,´ and Samy Ben- toencoding. gio. 2016. Generating sentences from a continuous space. In ACL. Future work We have shown that the simple Massimo Caccia, Lucas Caccia, William Fedus, Hugo reward function proposed here leads to a form Larochelle, Joelle Pineau, and Laurent Charlin. of regularization that fosters generalization when 2018. Language gans falling short. CoRR, evaluated inside the maximum-likelihood frame- abs/1811.02549. work. In the future, we hope to conduct a human Tian Qi Chen, Xuechen Li, Roger B. Grosse, and evaluation to assess the generalization capabili- David Duvenaud. 2018. Isolating sources of dis- ties of models trained under maximum-likelihood entanglement in variational autoencoders. CoRR, and reinforcement learning more rigorously. Only abs/1802.04942. such a framework-independent evaluation can re- Wenhu Chen, Guanlin Li, Shujie Liu, Zhirui Zhang, veal the true gains of carefully designing re- Mu Li, and Ming Zhou. 2017. Neural sequence pre- ward functions compared to simply performing diction by coaching. CoRR, abs/1706.09152. maximum-likelihood training. Kyunghyun Cho, Bart van Merrienboer,¨ Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and . 2014. Learning References phrase representations using RNN encoder–decoder Alexander A. Alemi, Ben Poole, Ian Fischer, Joshua V. for statistical machine translation. In EMNLP, pages Dillon, Rif A. Saurous, and Kevin Murphy. 2017. 1724–1734. An information-theoretic analysis of deep latent- variable models. CoRR, abs/1711.00464. Nan Ding and Radu Soricut. 2017. Cold-start rein- forcement learning with softmax policy gradients. Peter Anderson, Basura Fernando, Mark Johnson, CoRR, abs/1709.09346. and Stephen Gould. 2016. SPICE: semantic propositional image caption evaluation. CoRR, William Fedus, Ian J. Goodfellow, and Andrew M. Dai. abs/1607.08822. 2018. Maskgan: Better text generation via filling in the . In ICLR. Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron C. Katja Filippova, Enrique Alfonseca, Carlos A Col- Courville, and Yoshua Bengio. 2016. An actor- menares, Lukasz Kaiser, and Oriol Vinyals. 2015. critic algorithm for sequence prediction. CoRR, Sentence compression by deletion with lstms. In abs/1607.07086. EMNLP 2015, pages 360–368.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, gio. 2015. Neural machine translation by jointly and Ole Winther. 2016. Sequential neural models learning to align and translate. In ICLR. with stochastic layers. pages 2199–2207. NIPS.

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: Dibya Ghosh, Abhishek Gupta, and Sergey Levine. An automatic metric for MT evaluation with im- 2018. Learning actionable representations with proved correlation with human judgments. In ACL goal-conditioned policies. CoRR, abs/1811.07819. Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summa- , Yoshua Bengio, and Aaron Courville. rization, pages 65–72, Ann Arbor, Michigan. Asso- 2016. . MIT Press. ciation for Computational Linguistics. Anirudh Goyal, Alex Lamb, Ying Zhang, Saizheng Justin Bayer and Christian Osendorfer. 2014. Learn- Zhang, Aaron C. Courville, and Yoshua Bengio. ing stochastic recurrent networks. arXiv preprint 2016. Professor forcing: A new algorithm for train- arXiv:1411.7610. ArXiv. ing recurrent networks. In NIPS.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Anirudh Goyal, Alessandro Sordoni, Marc-Alexandre Noam Shazeer. 2015. Scheduled sampling for se- Cotˆ e,´ Nan Rosemary Ke, and Yoshua Bengio. 2017. quence prediction with recurrent neural networks. Z-forcing: Training stochastic recurrent networks. In NIPS. In NIPS.

165 Tianxing He, Jingzhao Zhang, Zhiming Zhou, and Chin-Yew Lin. 2004. Rouge: A package for automatic James R. Glass. 2019. Quantifying exposure bias for evaluation of summaries. page 10. neural language generation. CoRR, abs/1905.10617. Sidi Lu, Yaoming Zhu, Weinan Zhang, Jun Wang, Irina Higgins, Lo¨ıc Matthey, Arka Pal, Christopher and Yong Yu. 2018. Neural text generation: Past, Burgess, Xavier Glorot, Matthew M Botvinick, present and beyond. CoRR, abs/1803.07133. Shakir Mohamed, and Alexander Lerchner. 2017. beta-vae: Learning basic visual concepts with a con- Zachary M. Ziegler and Alexander M. Rush. 2019. La- strained variational framework. In ICLR. tent normalizing flows for discrete sequences. arXiv preprint arXiv:1901.10548. Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735– Stephen Merity, Nitish Shirish Keskar, and Richard 1780. Socher. 2017. Regularizing and optimizing LSTM Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin language models. CoRR, abs/1708.02182. Choi. 2019. The curious case of neural text degen- eration. CoRR, abs/1904.09751. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- Z. Hu, Z. Yang, Liang X., R. Salakhutdinov, and E. R. tions of words and phrases and their compositional- Xing. 2017. Toward controlled generation of text. ity. In NIPS. In International Conference on Machine Learning (ICML). Kevin P. Murphy. 2012. Machine Learning: A Proba- bilistic Perspective. The MIT Press. Ferenc Husz’ar. 2015. How (not) to train your genera- tive model: Scheduled sampling, likelihood, adver- Mohammad Norouzi, Samy Bengio, Zhifeng Chen, sary? Navdeep Jaitly, Mike Schuster, Yonghui Wu, and Dale Schuurmans. 2016. Reward augmented max- Hakan Inan, Khashayar Khosravi, and Richard Socher. imum likelihood for neural structured prediction. 2016. Tying word vectors and word classifiers: CoRR, abs/1609.00150. A loss framework for language modeling. ArXiv, abs/1611.01462. Kishore Papineni, Salim Roukos, Todd Ward, and Wei Jing Zhu. 2002. Bleu: a method for automatic Yaser Keneshloo, Tian Shi, Naren Ramakrishnan, and evaluation of machine translation. Chandan K. Reddy. 2018. Deep reinforcement learning for sequence to sequence models. CoRR, Tom Rainforth, Adam R. Kosiorek, Tuan Anh Le, abs/1805.09461. Chris J. Maddison, Maximilian Igl, Frank Wood, Urvashi Khandelwal, Kevin Clark, Dan Jurafsky, and and Yee Whye Teh. 2018. Tighter variational Lukasz Kaiser. 2019. Sample efficient text sum- bounds are not necessarily better. In ICML. marization using a single pre-trained transformer. CoRR, abs/1905.08836. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level train- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A ing with recurrent neural networks. In ICLR. method for stochastic optimization. In ICLR. Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Diederik P Kingma and Max Welling. 2013. Auto- Jerret Ross, and Vaibhava Goel. 2016. Self-critical encoding variational bayes. arXiv preprint sequence training for image captioning. CoRR, arXiv:1312.6114. abs/1612.00563.

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Danilo Jimenez Rezende, Shakir Mohamed, and Daan Richard S Zemel, Antonio Torralba, Raquel Urta- Wierstra. 2014. Stochastic back-propagation and sun, and Sanja Fidler. 2015. Skip-thought vectors. variational inference in deep latent gaussian models. arXiv preprint arXiv:1506.06726. In ICML. Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kil- ian Q. Weinberger. 2015. From word embeddings Alexander M. Rush, Sumit Chopra, and Jason Weston. to document distances. In ICML, ICML’15, pages 2015. A neural attention model for abstractive sen- 957–966. JMLR.org. tence summarization. In EMNLP. Remi´ Leblond, Jean-Baptiste Alayrac, Anton Osokin, Florian Schmidt and Thomas Hofmann. 2018. Deep and Simon Lacoste-Julien. 2018. SEARNN: train- state space models for unconditional word genera- ing rnns with global-local losses. In ICLR. tion. In NeurIPS. Siyao Li, Deren Lei, Pengda Qin, and William Wang. Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben- 2019. Deep reinforcement learning with distribu- gio, Aaron C Courville, and Joelle Pineau. 2016. tional semantic rewards for abstractive summariza- Building end-to-end dialogue systems using gener- tion. ative hierarchical neural network models. In AAAI.

166 Xiaoyu Shen, Hui Su, Shuzi Niu, and Vera Demberg. 2018. Improving variational encoder-decoders in di- alogue generation. CoRR, abs/1802.02032. Bowen Tan, Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, and Eric P. Xing. 2018. Connecting the dots between MLE and RL for sequence genera- tion. CoRR, abs/1811.09740. Michael Tschannen, Olivier Bachem, and Mario Lucic. 2018. Recent advances in autoencoder-based repre- sentation learning. CoRR, abs/1812.05069. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. Ronald J. Williams. 1992. Simple statistical gradient- following algorithms for connectionist reinforce- ment learning. Mach. Learn., 8(3-4):229–256. Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. 2017. Improved varia- tional autoencoders for text modeling using dilated . CoRR, abs/1702.08139. Cheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. 2018. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence. Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watch- ing movies and reading books. arXiv preprint arXiv:1506.06724.

167