A Closer Look at Exposure Bias

Generalization in Generation: A closer look at Exposure Bias Florian Schmidt Department of Computer Science ETH Zurich¨ [email protected] Abstract lem typically rely on replacing, masking or per- tubing ground-truth contexts (Bengio et al., 2015; Exposure bias refers to the train-test discrep- Bowman et al., 2016; Norouzi et al., 2016; Ran- ancy that seemingly arises when an autoregres- zato et al., 2016). Unfortunately, exposure bias sive generative model uses only ground-truth contexts at training time but generated ones at has never been succesfully separated from general test time. We separate the contributions of the test-time log-likelihood assessment and minor im- model and the learning framework to clarify provements on the latter are used as the only signi- the debate on consequences and review pro- fier of reduced bias. Whenever explicit effects are posed counter-measures. investigated, no significant findings are made (He In this light, we argue that generalization is et al., 2019). the underlying property to address and pro- In this work we argue that the standard training pose unconditional generation as its funda- mental benchmark. Finally, we combine la- procedure, despite all criticism, is an immediate tent variable modeling with a recent formu- consequence of combining autoregressive model- lation of exploration in reinforcement learning and maximum-likelihood training. As such, ing to obtain a rigorous handling of true and the paramount consideration for improving test- generated contexts. Results on language mod- time performance is simply regularization for bet- eling and variational sentence auto-encoding ter generalization. In fact, many proposed mea- confirm the model’s generalization capability. sures against exposure bias can be seen as exactly that, yet with respect to an usually implicit metric 1 Introduction that is not maximum-likelihood. Autoregressive models span from n-gram mod- With this in mind, we discuss regularization els to recurrent neural networks to transformers for conditional and unconditional generation. We and have formed the backbone of state-of-the- note that in conditional tasks, such as translation, art machine learning models over the last decade it is usually sufficient to regularize the mapping on virtually any generative task in Natural Lan- task – here translation – rather than the generative guage Processing. Applications include machine process itself. For unconditional generation, translation (Bahdanau et al., 2015; Vaswani et al., where tradeoffs between accuracy and coverage 2017), summarization (Rush et al., 2015; Khan- are key, generalization becomes much more delwal et al., 2019), dialogue (Serban et al., 2016) tangible. and sentence compression (Filippova et al., 2015). The training methodology of such models is The debate on the right training procedure for rooted in the language modeling task, which is to autoregressive models has recently been ampli- predict a single word given a context of previous fied by the advent of latent generative models words. It has often been criticized that this set- (Rezende et al., 2014; Kingma and Welling, 2013). ting is not suited for multi-step generation where Here, the practice of decoding with true contexts – at test time – we are interested in generating during training conflicts with the hope of obtain- words given a generated context that was poten- ing a latent representation that encodes significant tially not seen during training. The consequences information about the sequence (Bowman et al., of this train-test discrepancy are summarized as 2016). Interestingly, the ad hoc tricks to reduce exposure bias. Measures to mitigate the prob- the problem are similar to those proposed to ad- 157 Proceedings of the 3rd Workshop on Neural Generation and Translation (WNGT 2019), pages 157–167 Hong Kong, China, November 4, 2019. c 2019 Association for Computational Linguistics www.aclweb.org/anthology/D19-56%2d dress exposure bias in deterministic models. in recurrent neural networks (RNNs), i.e. an Very recently, Tan et al. (2017) have presented LSTM (Hochreiter and Schmidhuber, 1997) state, a reinforcement learning formulation of explo- or a set of attention weights, as in a transformer ration that allows following the intuition that an architecture (Vaswani et al., 2017). While the con- autoregressive model should not only be trained on siderations in this work apply to all autoregressive ground-truth contexts. We combine their frame- models, we focus on recurrent networks which en- work with latent variable modeling and a reward code the context in a fixed-sized continuous rep- function that leverages modern word-embeddings. resentation h(w1:t−1). In contrast to transformers, The result is a single learning regime for uncon- RNNs can be generalized easily to variational au- ditional generation in a deterministic setting (lan- toencoders with a single latent bottleneck (Bow- guage modeling) and in a latent variable setting man et al., 2016), a particularly interesting special (variational sentence autoencoding). Empirical re- case of generative models . sults show that our formulation allows for better generalization than existing methods proposed to 2.1 Evaluation and Generalization address exposure bias. Even more, we find the re- Conditional vs. Unconditional sulting regularization to also improve generaliza- Conditional generation tasks, such as translation tion under log-likelihood. or summarization, are attractive from an appli- We conclude that it is worthwhile explor- cation perspective. However, for the purpose of ing reinforcement learning to elegantly extend studying exposure bias, we argue that uncondi- maximum-likelihood learning where our desired tional generation is the task of choice for the fol- notion of generalization cannot be expressed with- lowing reasons. out violating the underlying principles. As a re- First, exposure bias addresses conditioning on sult, we hope to provide a more unified view on past words generated which becomes less essen- the training methodologies of autoregressive mod- tial when words in a source sentence are available, els and exposure bias in particular. in particular when attention is used. Second, the difficulty of the underlying map- 2 Autoregressive Modeling ping task, say translation, is of no concern for Modern text generation methods are rooted in the mechanics of generation. This casts sentence models trained on the language modeling task. In autoencoding as a less demanding, yet more eco- essence, a language model p is trained to predict a nomic task. word given its left-side context Finally, generalization of conditional models is only studied with respect to the underlying map- p(w jw ) : (1) t 1:t−1 ping and not with respect to the conditional distri- With a trained language model at hand, a simple bution itself. A test-set in translation usually does recurrent procedure allows to generate text of arbi- not contain a source sentence seen during training trary length. Starting from an initial special sym- with a different target1. Instead, it contains un- bol w^0, we iterate t = 1 ::: and alternate between seen source-target pairs that evaluate the general- sampling w^t ∼ p(wtjw^1:t−1) and appending w^t to ization of the mapping. Even more, at test-time the context w^1:t−1. Models of this form are called most conditional models resort to an arg-max de- autoregressive as they condition new predictions coding strategy. As a consequence, the entropy on old predictions. of the generative model is zero (given the source) and there is no generalization at all with respect to Neural Sequence Models Although a large cor- generation. For these reasons, we address uncon- pus provides an abundance of word-context pairs ditional generation and sentence auto-encoding for to train on, the cardinality of the context space the rest of this work. makes explicit estimates of (1) infeasible. There- fore, traditional n-gram language models rely on The big picture Let us briefly characterize out- a truncated context and smoothing techniques to put we should expect from a generative model generalize well to unseen contexts. with respect to generalization. Figure1 shows Neural language models lift the context re- 1Some datasets do provide several targets for a single striction and instead use neural context represen- source. However, those are typically only used for BLEU tations. This can be a hidden state as found computation, which is the standard test metric reported. 158 Figure 1: Generalization an idealized two-dimensional dataspace of (fixed- With aforementioned shortcomings of test log- length) sentences w 2 V T . We sketch the sup- likelihood in mind, it is worthwhile discussing port of the unknown underlying generating distri- a recently proposed evaluation technique. Fe- bution, the train set and the test set.2 Let us look at dus et al. (2018) propose to use n-gram statis- some hypothetical examples w^1; w^2; w^3; w^4 gen- tics of the underlying data to asses generated out- erated from some well trained model. Samples put. For example, one can estimate an n-gram lan- like w^1 certify that the model did not overfit to guage model and report perplexity of the gener- the training data as can be certified by test log- ated data under the n-gram model. Just as BLEU likelihood. In contrast, the remaining samples and ROUGE break the sequence reward assign- are indistinguishable under test log-likelihood in ment problem into smaller sub-problems, n-gram the sense that they identically decrease the metric language models effectively smooth the sequence (assuming equal model probability) even though likelihood assignment which is usually done with w^2; w^3 have non-zero probability under the true respect to the empirical data distribution. Under data distribution. Consequently, we cannot iden- this metric, some sequences such as w^2 which are tify w^4 as a malformed example. Holtzman et close to sequences in the dataset at hand might re- al. (2019) show that neural generative models – ceive positive probability.

Load more