Journal of Machine Learning Research 21 (2020) 1-41 Submitted 4/20; Revised 10/20; Published 11/20 Residual Energy-Based Models for Text Anton Bakhtin∗∗ Yuntian Deng∗F Sam Gross Myle Ott Marc’Aurelio Ranzato Arthur Szlam {yolo,sgross,myleott,ranzato,aszlam}@fb.com
[email protected] Facebook AI Research FHarvard University 770 Broadway, New York, NY 10003 33 Oxford St., Cambridge, MA 02138 Editor: Samy Bengio Abstract Current large-scale auto-regressive language models (Radford et al., 2019; Liu et al., 2018; Graves, 2013) display impressive fluency and can generate convincing text. In this work we start by asking the question: Can the generations of these models be reliably distinguished from real text by statistical discriminators? We find experimentally that the answer is affirmative when we have access to the training data for the model, and guardedly affirmative even if we do not. This suggests that the auto-regressive models can be improved by incorporating the (globally normalized) discriminators into the generative process. We give a formalism for this using the Energy-Based Model framework, and show that it indeed improves the results of the generative models, measured both in terms of perplexity and in terms of human evaluation. Keywords: energy-based models, text generation, negative sampling, importance sampling, generalization, real/fake discrimination 1. Introduction Energy-based models (EBMs) have a long history in machine learning (Hopfield, 1982; Hinton, 2002a; LeCun et al., 2006), especially in the image domain (Teh et al., 2003; Ranzato et al., 2013). Their appeal stems from the minimal assumptions they make about the generative process of the data: they are a strict generalization of probability models, as the energy function need not be normalized or even have convergent integral.