Label-Conditioned Next-Frame Video Generation with Neural Flows
Total Page:16
File Type:pdf, Size:1020Kb
Label-Conditioned Next-Frame Video Generation with Neural Flows David Donahue University of Massachusetts Lowell Lowell, USA david [email protected] Abstract—Recent state-of-the-art video generation systems em- needed to produce them must be kept in memory until the ploy Generative Adversarial Networks (GANs) or Variational final backpropagation step. Autoencoders (VAEs) to produce novel videos. However, VAE In this work, we explore the use of neural flows as a method models typically produce blurry outputs when faced with sub- optimal conditioning of the input, and GANs are known to be of video generation in order to solve the following issues: unstable for large output sizes. In addition, the output videos of these models are difficult to evaluate, partly because the GAN 1) Existing Generative Adversarial Networks must gener- loss function is not an accurate measure of convergence. In this ate the entire video at once. We propose to alleviate work, we propose using a state-of-the-art neural flow generator called Glow to generate videos conditioned on a textual label, this problem by introducing a neural flow trained to one frame at a time. Neural flow models are more stable than maximize cross entropy. This model will generate only standard GANs, as they only optimize a single cross entropy loss the next frame, conditioned on all previous frames. This function, which is monotonic and avoids the circular convergence only requires a context encoder over previous frames, issues of the GAN minimax objective. In addition, we also show with significantly less memory usage how to condition Glow on external context, while still preserving the invertible nature of each ”flow” layer. Finally, we evaluate the 2) Existing models are difficult to evaluate, as currently proposed Glow model by calculating cross entropy on a held-out GAN-generated instances must be judged by their qual- validation set of videos, in order to compare multiple versions ity (qualitative). Neural flows instead allow for direct of the proposed model via an ablation study. We show generated estimation of the probability density of samples, allow- videos and discuss future improvements. ing for a quantitative evaluation scheme by measuring Index Terms—neural flow, glow, video generation, evaluation the log-probability of examples in a held-out validation set I. INTRODUCTION 3) Generative Adversarial Networks as well as varia- Text-to-video generation is the process by which a model tional autoencoders are unstable without heavy hyper- conditions on text, and produces a video based on that text parameter tuning. We demonstrate the stability of neural description. This is the exact opposite of video captioning, flow architectures over previous methods, and in turn which aims to produce a caption that would describe a given show examples of instability and how to improve them video [6]. It may be argued that text-to-video is a harder task, as there are many more degrees of freedom in pixel space. The outline of this paper is as follows 1. We introduce In addition, video generation is more complex than text-to- recent work in video generation, followed by a description image generation [8], as a video can be viewed as a collection of the system architecture to be implemented. Then of images and thus is a superset of the image generation we include quantitative cross entropy evaluation of the problem. Video generation has many applications, including system, and conduct an ablation study to conclude the arXiv:1910.11106v1 [cs.CV] 16 Oct 2019 the automatic generation of animation in educational settings, effectiveness of each proposed feature. Finally, we support the realistic generation of sprite movement in video games, our results with visual frames of generated videos as as well as other auto-generated entertainment. Reliable video well as plotted internal model values to justify model frame generation could play a significant role in the media of behavior. We conclude with future improvements and tomorrow. necessary next steps. Our contributions are as follows: Recently, Generative Adversarial Networks (GANs) have demonstrated promising performance in the domain of real- • We utilize the existing Glow neural flow architecture for valued output generation [7]. Some works are beginning to video generation apply this generative architecture to conditional video genera- • We demonstrate how to condition Glow on label features tion, with promising results [1]. However, these GAN systems and on a previous frame state typically produce all frames of a video at once, in order • We perform an ablation study of improvements to the to ensure coherence across all frames and sufficient error architecture propagation between frames. This is difficult because all video frames, as well as all the intermediate hidden activations 1Computer vision class project Fig. 1: Proposed Glow video generation model. Converts dataset image samples into points in Gaussian space, then maximizes the probability of images in that space. During inference, a Gaussian point is sampled and converted back to a realistic image. In this work, we replace images with video frames and add context dependence on previous frames in the video. • We propose potential future improvements to help con- of new images from a Gaussian latent space. This model is vergence and sample quality trained via a cross entropy loss function similar to that of recent language generation models [17], potentially providing II. RECENT WORK more stable training when compared to the minimax objective of Generative Adversarial Networks. Compared to other fields of computer vision, such as object More recently, the Glow architecture was proposed as a classification and detection [19], image generation has seen modification of the RealNVP neural flow. The Glow model slower progress in comparison. An older choice for video introduces several improvements, such as additive coupling generation was the variational autoencoder (VAE), which layers and 1x1 invertible convolutions to speed up training utilized an autoencoder with a KL loss component to learn time and increase model complexity for better generation and sample from a continuous latent space. Training points in [18]. They apply their model to the task of celebrity image the latent space are tasked with maximizing a lower bound of generation. While this model introduces a number of features the generated data [15]. This KL loss compactifies the latent that the GAN does not possess, such as exact estimation of space by forcing all training points toward the origin. During sample densities, it enforces strict constraints that layers inside inference, a point can be sampled from a Gaussian centered the model (flow layers) be invertible and have easily calculable on the origin and decoded to a new training point. However, log-determinants. Still, the Glow architecture shows promising the compactification of the latent space introduces a loss of results in sample quality. information that produces blurry reconstructions. As such, the VAE struggles with the generation of high-quality samples. III. THE NEURAL FLOW Recently, Generative Adversarial Networks have made great Here we give a brief background of neural flows [16]. For strides in a variety of areas [7]. They work by instantiating two a given dataset of images x 2 D assumed to come from some models: a generator network and a discriminator network. The distribution p(x), we initialize a neural flow to approximate generator attempts to produce samples which are then judged that distribution as qθ(x), then maximize the log-likelihood of by the discriminator as being real (from the dataset) or fake N dataset examples under this model using the cross entropy (from the generator). Error propagation from the discriminator objective function: is then used to train the generator to produce more realistic N samples. Most work on GANs focus on image synthesis, but 1 X L = − log q (x ) (1) GANs can theoretically generate any continuous-valued output N θ i [10]. In the context of video generation, the generator would i=1 produce video frames, which the discriminator would then take This minimization has been shown to be equivalent to as input and judge as real or fake. GAN networks can also be minimizing the KL-divergence between p(x) and qθ(x). This conditioned on input, as in the case of text-to-image generation is identical to the objective of existing language generation [8]. models, with the exception that the image domain is continu- A generative architecture which has received less attention ous . A neural flow is able to model a continuous distribution when compared to the Generative Adversarial Network is the by transforming a simple Gaussian distribution into a complex neural flow [16]. Referred to in the paper as RealNVP, this continuous distribution of any kind. A neural flow model neural flow allows for exact estimation of the probability approximates p(x) as follows. It is composed of a series of density of image examples, and allows for direct generation invertible transformations: This architecture choice is motivated by the fact that the f1 f2 fn x () h1 () h2 () z (2) first frame is the most difficult to generate, and sets up the scene. Subsequent frames work to animate the first frame and We refer to the combined transformation as f (x), and the θ continue the plot. inverse transformation as x = f −1(x). We assign a Gaussian θ One problem with the proposed tail model is that it must distribution to all points in z-space, and calculate the prob- condition on previous frame information in order to generate ability p(x) of any image point x as follows. We restrict the next frame. This is a problem, as the original Glow each invertible flow layer such that the log-determinant is model does not specify a mechanism for conditioning on easily calculable, then by the change of variables formula, we context information, but rather is utilized for unconditional calculate the corresponding z as z = f (x) and log-probability θ celebrity image generation [18].