Parallel Multiscale Autoregressive Density Estimation
Total Page:16
File Type:pdf, Size:1020Kb
Parallel Multiscale Autoregressive Density Estimation Scott Reed 1 Aaron¨ van den Oord 1 Nal Kalchbrenner 1 Sergio Gomez´ Colmenarejo 1 Ziyu Wang 1 Yutian Chen 1 Dan Belov 1 Nando de Freitas 1 Abstract “A yellow bird with a black head, orange eyes and an orange bill.” 4 8 16 32 256 PixelCNN achieves state-of-the-art results in density estimation for natural images. Although training is fast, inference is costly, requiring one 64 128 network evaluation per pixel; O(N) for N pix- els. This can be sped up by caching activations, but still involves generating each pixel sequen- tially. In this work, we propose a parallelized PixelCNN that allows more efficient inference by modeling certain pixel groups as condition- 64 128 256 ally independent. Our new PixelCNN model achieves competitive density estimation and or- ders of magnitude speedup - O(log N) sampling instead of O(N) - enabling the practical genera- tion of 512 512 images. We evaluate the model ⇥ on class-conditional image generation, text-to- Figure 1. Samples from our model at resolutions from 4 4 to ⇥ image synthesis, and action-conditional video 256 256, conditioned on text and bird part locations in the CUB ⇥ generation, showing that our model achieves the data set. See Fig. 4 and the supplement for more examples. best results among non-pixel-autoregressive den- sity models that allow efficient sampling. In the naive case, this requires a full network evaluation per pixel. Caching hidden unit activations can be used to reduce the amount of computation per pixel, as in the 1D 1. Introduction case for WaveNet (Oord et al., 2016; Ramachandran et al., 2017). However, even with this optimization, generation is Many autoregressive image models factorize the joint dis- still in serial order by pixel. tribution of images into per-pixel factors: Ideally we would generate multiple pixels in parallel, T which could greatly accelerate sampling. In the autore- p(x1:T ) = p(xt x1:t 1) (1) gressive framework this only works if the pixels are mod- | − Yt=1 eled as independent. Thus we need a way to judiciously break weak dependencies among pixels; for example im- For example PixelCNN (van den Oord et al., 2016b) uses mediately neighboring pixels should not be modeled as in- a deep convolutional network with carefully designed fil- dependent since they tend to be highly correlated. ter masking to preserve causal structure, so that all factors in equation 1 can be learned in parallel for a given image. Multiscale image generation provides one such way to However, a remaining difficulty is that due to the learned break weak dependencies. In particular, we can model cer- causal structure, inference proceeds sequentially pixel-by- tain groups of pixels as conditionally independent given a pixel in raster order. lower resolution image and various types of context infor- mation, such as preceding frames in a video. The basic idea 1DeepMind. Correspondence to: Scott Reed <reed- is obvious, but nontrivial design problems stand between [email protected]>. the idea and a workable implementation. Proceedings of the 34th International Conference on Machine First, what is the right way to transmit global information Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 from a low-resolution image to each generated pixel of the by the author(s). high-resolution image? Second, which pixels can we gen- Parallel Multiscale Autoregressive Density Estimation erate in parallel? And given that choice, how can we avoid versarial Networks (GANs) train a generator network to border artifacts when merging sets of pixels that were gen- fool a discriminator network adversary (Goodfellow et al., erated in parallel, blind to one another? 2014). These networks have been used in a wide variety of conditional image generation schemes such as text and spa- In this work we show how a very substantial portion of the tial structure to image (Mansimov et al., 2015; Reed et al., spatial dependencies in PixelCNN can be cut, with only 2016b;a; Wang & Gupta, 2016). modest degradation in performance. Our formulation al- lows sampling in O(log N) time for N pixels, instead of The addition of multiscale structure has also been shown O(N) as in the original PixelCNN, resulting in orders of to be useful in adversarial networks. Denton et al. magnitude speedup in practice. In the case of video, in (2015) used a Laplacian pyramid to generate images in a which we have access to high-resolution previous frames, coarse-to-fine manner. Zhang et al. (2016) composed a we can even sample in O(1) time, with much better perfor- low-resolution and high-resolution text-conditional GAN, mance than comparably-fast baselines. yielding higher quality 256 256 bird and flower images. ⇥ At a high level, the proposed approach can be viewed as a Generator networks can be combined with a trained model, way to merge per-pixel factors in equation 1. If we merge such as an image classifier or captioning network, to gen- the factors for, e.g. xi and x j, then that dependency is “cut”, erate high-resolution images via optimization and sam- so the model becomes slightly less expressive. However, pling procedures (Nguyen et al., 2016). Wu et al. (2017) we get the benefit of now being able to sample xi and x j in state that it is difficult to quantify GAN performance, and parallel. If we divide the N pixels into G groups of T pixels propose Monte Carlo methods to approximate the log- each, the joint distribution can be written as a product of the likelihood of GANs on MNIST images. corresponding G factors: Both auto-regressive and non auto-regressive deep net- G works have recently been applied successfully to image 1:G (g) (1:g 1) p(x ) = p(x x − ) (2) super-resolution. Shi et al. (2016) developed a sub-pixel 1:T 1:T | 1:T Yg=1 convolutional network well-suited to this problem. Dahl Above we assumed that each of the G groups contains ex- et al. (2017) use a PixelCNN as a prior for image super- actly T pixels, but in practice the number can vary. In resolution with a convolutional neural network. Johnson this work, we form pixel groups from successively higher- et al. (2016) developed a perceptual loss function useful resolution views of an image, arranged into a sub-sampling for both style transfer and super-resolution. GAN variants pyramid, such that G O(log N). have also been successful in this domain (Ledig et al., 2016; 2 Sønderby et al., 2017). In section 3 we describe this group structure implemented as a deep convolutional network. In section 4 we show Several other deep, tractable density models have recently that the model excels in density estimation and can produce been developed. Real NVP (Dinh et al., 2016) learns quality high-resolution samples at high speed. a mapping from images to a simple noise distribution, which is by construction trivially invertible. It is built from smaller invertible blocks called coupling layers whose 2. Related work Jacobian is lower-triangular, and also has a multiscale Deep neural autoregressive models have been applied to structure. Inverse Autoregressive Flows (Kingma & Sal- image generation for many years, showing promise as a imans, 2016) use autoregressive structures in the latent tractable yet expressive density model (Larochelle & Mur- space to learn more flexible posteriors for variational auto- ray, 2011; Uria et al., 2013). Autoregressive LSTMs encoders. Autoregressive models have also been combined have been shown to produce state-of-the-art performance with VAEs as decoder models (Gulrajani et al., 2016). in density estimation on large-scale datasets such as Ima- The original PixelRNN paper (van den Oord et al., 2016a) geNet (Theis & Bethge, 2015; van den Oord et al., 2016a). actually included a multiscale autoregressive version, in Causally-structured convolutional networks such as Pixel- which PixelRNNs or PixelCNNs were trained at multiple CNN (van den Oord et al., 2016b) and WaveNet (Oord resolutions. The network producing a given resolution im- et al., 2016) improved the speed and scalability of train- age was conditioned on the image at the next lower reso- ing. These led to improved autoregressive models for video lution. This work is similarly motivated by the usefulness generation (Kalchbrenner et al., 2016b) and machine trans- of multiscale image structure (and the very long history of lation (Kalchbrenner et al., 2016a). coarse-to-fine modeling). Non-autoregressive convolutional generator networks have Our novel contributions in this work are (1) asymptotically been successful and widely adopted for image generation and empirically faster inference by modeling conditional as well. Instead of maximizing likelihood, Generative Ad- independence structure, (2) scaling to much higher reso- Parallel Multiscale Autoregressive Density Estimation lution, (3) evaluating the model on a diverse set of chal- The division into non-overlapping patches may appear to lenging benchmarks including class-, text- and structure- risk border artifacts when merging. However, this does not conditional image generation and video generation. occur for several reasons. First, each predicted pixel is di- rectly adjacent to several context pixels fed into the upscal- 3. Model ing network. Second, the generated patches are not directly adjacent in the 2K 2K output image; there is always a row ⇥ The main design principle that we follow in building the or column of pixels on the border of any pair. model is a coarse-to-fine ordering of pixels. Successively Note that the only learnable portions of the upscaling mod- higher-resolution frames are generated conditioned on the ule are (1) the ResNet encoder of context pixels, and (2) the previous resolution (See for example Figure 1).