DRAW: a Recurrent Neural Network for Image Generation

DRAW: A Recurrent Neural Network For Image Generation Karol Gregor [email protected] Ivo Danihelka [email protected] Alex Graves [email protected] Danilo Jimenez Rezende [email protected] Daan Wierstra [email protected] Google DeepMind Abstract This paper introduces the Deep Recurrent Atten- tive Writer (DRAW) neural network architecture for image generation. DRAW networks combine a novel spatial attention mechanism that mimics the foveation of the human eye, with a sequential variational auto-encoding framework that allows for the iterative construction of complex images. The system substantially improves on the state of the art for generative models on MNIST, and, when trained on the Street View House Numbers dataset, it generates images that cannot be distin- guished from real data with the naked eye. 1. Introduction Time A person asked to draw, paint or otherwise recreate a visual scene will naturally do so in a sequential, iterative fashion, Figure 1. A trained DRAW network generating MNIST dig- reassessing their handiwork after each modification. Rough its. Each row shows successive stages in the generation of a sin- outlines are gradually replaced by precise forms, lines are gle digit. Note how the lines composing the digits appear to be sharpened, darkened or erased, shapes are altered, and the “drawn” by the network. The red rectangle delimits the area at- final picture emerges. Most approaches to automatic im- tended to by the network at each time-step, with the focal preci- sion indicated by the width of the rectangle border. age generation, however, aim to generate entire scenes at once. In the context of generative neural networks, this typ- ically means that all the pixels are conditioned on a single latent distribution (Dayan et al., 1995; Hinton & Salakhut- The core of the DRAW architecture is a pair of recurrent dinov, 2006; Larochelle & Murray, 2011). As well as pre- neural networks: an encoder network that compresses the cluding the possibility of iterative self-correction, the “one real images presented during training, and a decoder that shot” approach is fundamentally difficult to scale to large reconstitutes images after receiving codes. The combined images. The Deep Recurrent Attentive Writer (DRAW) ar- system is trained end-to-end with stochastic gradient de- chitecture represents a shift towards a more natural form of scent, where the loss function is a variational upper bound image construction, in which parts of a scene are created on the log-likelihood of the data. It therefore belongs to the independently from others, and approximate sketches are family of variational auto-encoders, a recently emerged successively refined. hybrid of deep learning and variational inference that has led to significant advances in generative modelling (Gre- Proceedings of the 32 nd International Conference on Machine gor et al., 2014; Kingma & Welling, 2014; Rezende et al., Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- 2014; Mnih & Gregor, 2014; Salimans et al., 2014). Where right 2015 by the author(s). DRAW differs from its siblings is that, rather than generat- DRAW: A Recurrent Neural Network For Image Generation ing images in a single pass, it iteratively constructs scenes P (x z) ct 1 write ct write ... cT σ P (x z1:T ) | − | through an accumulation of modifications emitted by the decoder dec decoder decoder FNN ht 1 RNN RNN decoder, each of which is observed by the encoder. − z zt zt+1 An obvious correlate of generating images step by step is decoding sample sample sample (generative model) the ability to selectively attend to parts of the scene while encoding Q(z x) Q(zt x, z1:t 1) Q(zt+1 x, z1:t) (inference) ignoring others. A wealth of results in the past few years | | − | enc encoder encoder suggest that visual structure can be better captured by a se- ht 1 encoder − RNN RNN quence of partial glimpses, or foveations, than by a sin- FNN gle sweep through the entire image (Larochelle & Hinton, read read 2010; Denil et al., 2012; Tang et al., 2013; Ranzato, 2014; x x x Zheng et al., 2014; Mnih et al., 2014; Ba et al., 2014; Ser- manet et al., 2014). The main challenge faced by sequential Figure 2. Left: Conventional Variational Auto-Encoder. Dur- attention models is learning where to look, which can be ing generation, a sample z is drawn from a prior P (z) and passed addressed with reinforcement learning techniques such as through the feedforward decoder network to compute the proba- policy gradients (Mnih et al., 2014). The attention model in bility of the input P (xjz) given the sample. During inference the DRAW, however, is fully differentiable, making it possible input x is passed to the encoder network, producing an approx- to train with standard backpropagation. In this sense it re- imate posterior Q(zjx) over latent variables. During training, z is sampled from Q(zjx) and then used to compute the total de- sembles the selective read and write operations developed scription length KLQ(Zjx)jjP (Z) − log(P (xjz)), which is for the Neural Turing Machine (Graves et al., 2014). minimised with stochastic gradient descent. Right: DRAW Net- The following section defines the DRAW architecture, work. At each time-step a sample zt from the prior P (zt) is along with the loss function used for training and the pro- passed to the recurrent decoder network, which then modifies part cedure for image generation. Section3 presents the selec- of the canvas matrix. The final canvas matrix cT is used to compute P (xjz ). During inference the input is read at every time- tive attention model and shows how it is applied to read- 1:T step and the result is passed to the encoder RNN. The RNNs at ing and modifying images. Section4 provides experi- the previous time-step specify where to read. The output of the mental results on the MNIST, Street View House Num- encoder RNN is used to compute the approximate posterior over bers and CIFAR-10 datasets, with examples of generated the latent variables at that time-step. images; and concluding remarks are given in Section5. Lastly, we would like to direct the reader to the video accompanying this paper (https://www.youtube. as “what to write”. The architecture is sketched in Fig.2, com/watch?v=Zt-7MI9eKEo) which contains exam- alongside a feedforward variational auto-encoder. ples of DRAW networks reading and generating images. 2.1. Network Architecture 2. The DRAW Network Let RNN enc be the function enacted by the encoder net- enc The basic structure of a DRAW network is similar to that of work at a single time-step. The output of RNN at time enc other variational auto-encoders: an encoder network deter- t is the encoder hidden vector ht . Similarly the output of dec dec mines a distribution over latent codes that capture salient the decoder RNN at t is the hidden vector ht . In gen- information about the input data; a decoder network re- eral the encoder and decoder may be implemented by any ceives samples from the code distribuion and uses them to recurrent neural network. In our experiments we use the condition its own distribution over images. However there Long Short-Term Memory architecture (LSTM; Hochreiter are three key differences. Firstly, both the encoder and de- & Schmidhuber(1997)) for both, in the extended form with coder are recurrent networks in DRAW, so that a sequence forget gates (Gers et al., 2000). We favour LSTM due of code samples is exchanged between them; moreover the to its proven track record for handling long-range depen- encoder is privy to the decoder’s previous outputs, allow- dencies in real sequential data (Graves, 2013; Sutskever ing it to tailor the codes it sends according to the decoder’s et al., 2014). Throughout the paper, we use the notation behaviour so far. Secondly, the decoder’s outputs are suc- b = W (a) to denote a linear weight matrix with bias from cessively added to the distribution that will ultimately gen- the vector a to the vector b. erate the data, as opposed to emitting this distribution in At each time-step t, the encoder receives input from both a single step. And thirdly, a dynamically updated atten- the image x and from the previous decoder hidden vector tion mechanism is used to restrict both the input region dec ht 1. The precise form of the encoder input depends on a observed by the encoder, and the output region modified read− operation, which will be defined in the next section. by the decoder. In simple terms, the network decides at enc The output ht of the encoder is used to parameterise a each time-step “where to read” and “where to write” as well distribution Q(Z henc) over the latent vector z . In our tj t t DRAW: A Recurrent Neural Network For Image Generation experiments the latent distribution is a diagonal Gaussian negative log probability of x under D: (Zt µt; σt): x N j = log D(x cT ) (9) enc L − j µt = W (ht ) (1) The latent loss z for a sequence of latent distributions enc enc L σt = exp (W (ht )) (2) Q(Zt ht ) is defined as the summed Kullback-Leibler di- j enc vergence of some latent prior P (Zt) from Q(Zt ht ): Bernoulli distributions are more common than Gaussians j T for latent variables in auto-encoders (Dayan et al., 1995; X z = KLQ(Z henc) P (Z ) (10) Gregor et al., 2014); however a great advantage of Gaus- L tj t jj t sian latents is that the gradient of a function of the sam- t=1 ples with respect to the distribution parameters can be eas- Note that this loss depends upon the latent samples zt ily obtained using the so-called reparameterization trick drawn from Q(Z henc), which depend in turn on the input tj t (Kingma & Welling, 2014; Rezende et al., 2014).

Load more