<<

1 Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models

Sam Bond-Taylor, Adam Leach, Yang Long, Chris G. Willcocks

Abstract—Deep generative modelling is a class of techniques that train deep neural networks to model the distribution of training samples. Research has fragmented into various interconnected approaches, each of which making trade-offs including run-time, diversity, and architectural restrictions. In particular, this compendium covers energy-based models, variational , generative adversarial networks, autoregressive models, normalizing flows, in addition to numerous hybrid approaches. These techniques are drawn under a single cohesive framework, comparing and contrasting to explain the premises behind each, while reviewing current state-of-the-art advances and implementations.

Index Terms—, Generative Models, Energy-Based Models, Variational Autoencoders, Generative Adversarial Networks, Autoregressive Models, Normalizing Flows !

1 INTRODUCTION

ENERATIVE modelling using neural networks has its many cases, this has been achieved using latent variables z G origins in the 1980s with aims to learn about data with which are easy to sample from and/or calculate the density no supervision, potentially providing benefits for standard of, instead learning p(x, z); this requires marginalisation classification tasks; collecting training data for unsupervised over the unobserved latent variables, however in general, learning is naturally much lower effort and cheaper than this is intractable. Generative models therefore typically collecting labelled data but there is considerable information make trade-offs in execution time, architecture, or optimise still available making it clear that generative models can be proxy functions. Choosing what to optimise for has implica- beneficial for a wide variety of applications. tions for sample quality, with direct likelihood optimisation Beyond this, generative modelling has numerous direct often leading to worse sample quality than alternatives. applications including image synthesis: super-resolution, There exists a variety of survey papers focusing on text-to-image and image-to-image conversion, inpainting, particular generative models such as normalizing flows attribute manipulation, pose estimation; video: synthe- [113], [164], generative adversarial networks [64], [230], and sis and retargeting; audio: speech and music synthesis; energy-based models [189], however, naturally these dive text: summarisation and translation; reinforcement learning; into the intricacies of their respective method rather than computer graphics: rendering, texture generation, charac- comparing with other methods; additionally, some focus on ter movement, liquid simulation; medical: drug synthesis, applications rather than theory. While there exists a recent modality conversion; and out-of-distribution detection. survey on generative models as a whole [162], it is less The central idea of generative modelling stems around broad, diving deeply into a few specific implementations. training a generative model whose samples x˜ ∼ pθ(x˜) come This survey provides a comprehensive overview of gen- arXiv:2103.04922v2 [cs.LG] 14 Apr 2021 from the same distribution as the training data distribution, erative modelling trends, introducing new readers to the x ∼ pd(x). Early neural generative models, energy-based field by bringing methods under a single cohesive statistical models achieved this by defining an energy function on data framework, comparing and contrasting so as to explain points proportional to likelihood, however, these struggled the modelling decisions behind each respective technique. to scale to complex high dimensional data such as natural Additionally, advances old and new are discussed in order images, and require Markov Chain Monte Carlo (MCMC) to bring the reader up to date with current research. In during both training and inference, a slow itera- particular, this survey covers energy-based models (Section tive process. In recent years there has been renewed interest 2), typically single unnormalised density models, varia- in generative models driven by the advent of large freely tional autoencoders (Section3), variational approximation available datasets as well as advances in both general deep of a latent-based model’s posterior, generative adversarial learning architectures and generative models, breaking new networks (Section4), two models set in a mini-max game, ground in terms of visual fidelity and sampling speed. In autoregressive models (Section5), model data decomposed as a product of conditional probabilities, and normalizing • The authors are with the Department of Computer Science, Durham flows (Section6), exact likelihood models using invertible University, Durham, DH1 3LE, United Kingdom. This work has been transformations. This breakdown is defined to closely match submitted to the IEEE for possible publication. Copyright may be trans- the typical divisions within research, however, numerous ferred without notice, after which this version may no longer be accessible. hybrid approaches exist that blur these lines, these are dis- 2 TABLE 1: Comparison between deep generative models in terms of training and test speed, parameter efficiency, sample quality, sample diversity, and ability to scale to high resolution data. Quantitative evaluation is reported on the CIFAR-10 dataset [114] in terms of Frechet´ Inception Distance (FID) and negative log-likelihood (NLL) in bits-per-dimension (BPD).

Method Train Sample Param. Sample Relative Resolution FID NLL (in Speed Speed Effic. Quality Divers. Scaling BPD) Generative Adversarial Networks DCGAN [169] ?????????????????????????????? 17.70 - ProGAN [102] ?????????????????????????????? 15.52 - BigGAN [17] ?????????????????????????????? 14.73 - StyleGAN2 + ADA [103] ?????????????????????????????? 2.42 - Energy Based Models IGEBM [42] ?????????????????????????????? 37.9 - Denoising Diffusion [80] ?????????????????????????????? 3.17 ≤ 3.75 DDPM++ Continuous [191] ?????????????????????????????? 2.92 2.99 Flow Contrastive [51] ?????????????????????????????? 37.30 ≈ 3.27 VAEBM [226] ?????????????????????????????? 12.19 - Variational Autoencoders Convolutional VAE [110] ?????????????????????????????? 106.37 ≤ 4.54 Variational Lossy AE [27] ?????????????????????????????? - ≤ 2.95 VQ-VAE [171], [215] ?????????????????????????????? - ≤ 4.67 VD-VAE [29] ?????????????????????????????? - ≤ 2.87 Autoregressive Models PixelRNN [214] ?????????????????????????????? - 3.00 Gated PixelCNN [213] ?????????????????????????????? 65.93 3.03 PixelIQN [161] ?????????????????????????????? 49.46 - Sparse Trans. + DistAug [30], [99] ?????????????????????????????? 14.74 2.66 Normalizing Flows RealNVP [39] ?????????????????????????????? - 3.49 Masked Autoregressive Flow [165] ?????????????????????????????? - 4.30 GLOW [111] ?????????????????????????????? 45.99 3.35 FFJORD [56] ?????????????????????????????? - 3.40 Residual Flow [24] ?????????????????????????????? 46.37 3.28

D cussed in the most relevant section or both where suitable. where E(x): R → R is an energy function which asso- For a brief insight into the differences between different ciates realistic points with low values and unrealistic points architectures, we provide Table1 which contrasts a diverse with high values. Modelling data in such a way offers a array of techniques through easily comparable star ratings. number of perks, namely the simplicity and stability associ- Specifically, training speed is assessed based on reported ating with training a single model; utilising a shared set of total training times thus taking into account a variety of fac- features thereby minimising required parameters; and the tors including architecture, number of function evaluations lack of any prior assumptions eliminates related bottlenecks per step, ease of optimisation, and stochasticity involved; [42]. Despite these benefits, EBM adoption fell in favour of sample speed is based on network speed and number of implicit generative approaches due in part to poor scaling evaluations required; parameter efficiency is determined by to high dimensional data, however, interest is increasing the total number of parameters required respective to the thanks to a number of tricks aiding better scaling. dataset trained on, while more powerful models will often A key issue with EBMs is how to optimise them; since have more parameters, the correlation with quality is not the denominator in Equation1 is intractable for most mod- strong across model types; sample quality is determined els, a popular proxy objective is contrastive using the following rules - 1 star: some structure/texture where energy values of data samples are ‘pushed’ down, is captured, 2 stars: a scene is recognisable but global while samples from the energy distribution are ‘pushed’ up. structure/detail is lacking, 3 stars: significant structure is More formally, the gradient of the negative log-likelihood captured but scenes look ‘weird’, 4 stars: difference to loss L(θ) = Ex∼pd [− ln pθ(x)] has been shown to approxi- real images is identifiable, and 5 stars: difference is com- mately demonstrate the following property [21], [193], pletely imperceptible; diversity is based on cover- + − ∇ L = + [∇ E (x )] − − [∇ E (x )], age/likelihood estimates of similar models; and resolution θ Ex ∼pd θ θ Ex ∼pθ θ θ (2) scaling is determined by the maximum reported resolutions. − where x ∼ pθ is a sample from the EBM found through a Markov Chain Monte Carlo (MCMC) generating procedure. 2 ENERGY-BASED MODELS Energy-based models (EBMs) [119] are based on the obser- 2.1 Boltzmann Machines D vation that any probability density function p(x) for x ∈ R A [76] is a fully connected undirected can be expressed as network of binary neurons that are turned on with proba- e−E(x) bility determined by a weighted sum of their inputs i.e. for p(x) = , (1) P R −E(x˜) some state si, p(si = 1) = σ( wi,jsj). The neurons can be x˜∈X e j BOND-TAYLOR ET AL.: DEEP GENERATIVE MODELLING: A COMPARATIVE REVIEW 3

h2 h1 h2 ... hn

h3 h1

v1 v2 v3 ... vn

v1 v2 Fig. 2: A restricted Boltzmann machine has its architecture Fig. 1: A Boltzmann machine with 2 visible units and 3 restricted so that there are no connections between units hidden units. with each group.

divided into visible v ∈ {0, 1}D units, those which are set by state v is simplified through the iterative updating of h and inputs to the model, and hidden h ∈ {0, 1}P units, all other v according to Equations 6a and 6b neurons. The energy of the state {v, h} is defined (without (n+1) X (n) hj ∼ σ( vi wij), (7a) biases for succinctness but can be generalised trivially) as i (n+1) X (n+1) 1 T 1 T 1 T v ∼ σ( h w ). Eθ(v, h) = − v Lv − h Jh − v W h, (3) i j ij (7b) 2 2 2 j where W , L, and J are symmetrical learned weight matri- Since both can be computed in parallel, this allows for ces. Due to the binary nature of the neurons, the probability considerable speedup. In practice, if v is initialised to be a assigned to a visible vector v as defined in Equation1 can sample from the dataset, a single step is sufficient [77]. Tak- be reduced to finite summations ing precon to be the distribution of visible units after a single P −βEθ (v,h) h e step of Gibbs sampling, derivatives can be approximated as pθ(v) = . (4) P P e−βEθ (v˜,h) v˜ h X ∂ ln p(x) T T ≈ EX [vh ] − Eprecon [vh ]. (8) Boltzmann machines are typically trained via negative log- ∂wi,j x∈pd likelihood through contrastive divergence, the architecture simplifying Equation2 to vector products, which for W is 2.3 Deep Belief Networks X ∂ ln p(x) T T An efficient way to learn powerful functions is to com- = [vh ] − [vh ]. (5) ∂w Epd Epθ x∈X i,j bine simpler functions; it is natural therefore to attempt to stack RBMs, using features from lower down as inputs Exact computation of the expectations takes an exponential for the next layer. Training an entire model of this form amount of time in the number of hidden units, making exact concurrently, however, is impossible since NLL calculations maximum likelihood intractable. Instead, training iterates are intractable. Instead, deep belief networks train RBMs between a positive training phase where visible units are layer by layer in a greedy fashion, composing probability fixed to a data vector and Gibbs sampling is used to approx- densities [78]. A RBM finds it easy to learn weights that imate h, and a negative training phase where no units are convert the posterior distribution over the hidden units into fixed and Gibbs sampling is used to approximate all units. the data distribution, p(v|h, W ), however, modelling the The time to approach equilibrium increases significantly posterior distribution over the hidden units p(h|W ) is more in Boltzmann machines with many hidden units, making difficult. Nevertheless, training a second RBM to model Boltzmann machines only practical for low dimensional the posterior of the first is then an easier task because the data. Additionally, for contrastive divergence to perform distribution is closer to one that it can model perfectly. Thus well, exact samples from p(h|v; θ) are required, further by improving p(h), p(v) is also improved; each time another making Boltzmann machines impractical [77], [223]. layer of features is added, the variational lower bound on the log probability is improved. Sampling data from a deep 2.2 Restricted Boltzmann Machines belief network is much slower than from a RBM; first, an Many of the issues associated with Boltzmann machines can equilibrium sample from the top level RBM is found by be overcome by restricting their connectivity. One approach, performing Gibbs sampling for a long time, then a top-down known as the restricted Boltzmann machine (RBM) [77] is to pass is performed to obtain states for all the other layers. set both J = 0 and L = 0. Because there are no direct connections between hidden units, it is possible to perform 2.4 Deep EBMs via Contrastive Divergence exact inference and thereby easily obtain an unbiased sam- To train more powerful architectures through contrastive ple of [vhT ]; for a training sample v ∈ X or hidden Epd divergence, one must be able to efficiently sample from pθ. vector h, the binary states hj are activated with probability Specifically, we would like to model an energy function with X a deep neural network, taking advantage of recent advances p(hj = 1|v) = σ( viwij), (6a) i in discriminative models [232] applied to high dimensional X data. MCMC methods such as random walk and Gibbs p(v = 1|h) = σ( h w ). (6b) i j ij sampling [78], when applied to high dimensional data, have j long mixing times, making them impractical. T Obtaining an unbiased sample of Epmodel [vh ], however, is A number of recent approaches [42], [58], [152], [153], more difficult. Gibbs sampling starting with an initial visible [228] have advocated the use of stochastic gradient 4

Langevin dynamics [174], [224] which permits sampling derivatives of the data and model’s log-density functions; through the following iterative process, the score function is defined as s(x) = ∇x ln p(x) which does not depend on the intractable denominator and can α ∂Eθ(xi) x0 ∼ p0(x) xi+1 = xi − + , (9) therefore be applied to build an energy model [121], [194] 2 ∂xi by minimising the Fisher divergence between pd and pθ, where  ∼ N (0, αI), p0(x) is typically a uniform distribu- 1 2 tion over the input domain and α is the step size. As the L = p (x)[ sθ(x) − sd(x) ], (12) 2 E d 2 number of updates N → ∞ and α → 0, the distribution of samples converges to pθ [224]; however, α and  are often however, the score function of data is usually not available. tweaked independently to speed up training. There exist various methods to estimate the score function While it is possible to train with noise initialisation as including spectral approximation [181], sliced score match- above using as few as 100 update steps, known as short- ing [188], finite difference score matching [163], and notably run MCMC, by training a non-convergent EBM [152], this denoising score matching [219] which allows the score to number of steps is still prohibitively slow for large scale be approximated using corrupted data samples q(x˜|x). In 2 tasks [58]. Additionally, because the number of update steps particular, when q = N (x˜|x, σ I), Equation 12 simplifies to is so small, samples are not truly from the correct probability " 2 # 1 x˜ − x density; there are a number of advantages however, such L = p (x) x˜∼N (x,σ2I) sθ(x˜) + . (13) 2 E d E σ2 as allowing image interpolation and reconstruction (since 2

short-run MCMC does not mix) [153]. One solution is to That is, sθ learns to estimate the noise thereby allowing select samples from a large replay buffer which stores pre- it to be used as a generative model [178], [187]. Notice in viously generated samples, and with some probability resets Equation9 that the Langevin update step uses ∇x ln p(x), a sample to noise [42], [228]. This allows samples to be con- or sθ, thus it is possible to sample from a score matching tinually refined with a relatively small number of Langevin model using Langevin dynamics [210]. This is only possible, update steps while maintaining diversity. Faster training however, when trained over a large variety of noise levels comes at the expense of slower inference times, however, so that x˜ covers the whole space. as well as worse diversity [153]. Other approaches include This process is very similar to diffusion models [1], [10], initialising MCMC sampling chains with data points [228] [80], [184] which undergo a similar procedure using a pre- and samples from an implicit generative model [227], as well determined schedule over a fixed number of steps, as adversarially training an implicit generative model, miti- allowing a bound on p(x) to be derived. Diffusion models gating mode collapse somewhat by maximising its entropy applied to high dimensional data have demonstrated the [59], [108], [117]. Improved/augmented MCMC samplers ability to generate state-of-the-art samples [80], [148], [191]. with neural networks can also improve the efficiency of Similar to implicit energy models, sampling from score MCMC sampling [82], [122], [186], [200]. matching models requires a large number of steps. By One application of EBMs of this form comes by using D K modelling the Langevin sampling procedure as a stochastic standard classifier architectures, fθ : R → R , which map differential equation (SDE), pre-existing SDE solvers can be data points to logits used by a softmax function, used, reducing the number of steps required [191], [209]. exp(f (x)[y]) Another proposed approach is to model noisy data points p (y|x) = θ , θ P (10) as q(xt−1|xt, x0), allowing the generative process to skip y˜ exp(fθ(x)[˜y]) some steps using its approximation of end samples x0 [185]. which assigns the probability of x belonging to class y. By marginalising out y, these logits can be used to define an 2.6 Correcting Implicit Generative Models unnormalised density model and thus an energy model [58], While EBMs offer powerful representation ability due to un- P normalized likelihoods, they can suffer from high variance X y exp(fθ(x[y])) p (x) = p (x, y) = , (11a) training, long training and sampling times, and struggle to θ θ Z(θ) y support the entire data space. In this section, a number of X hybrid approaches are discussed which address these issues. Eθ(x) = − ln exp(fθ(x[y])). (11b) y 2.6.1 Exponential Tilting As such, it is possible to train a classifier as both a generative To eliminate the need for an EBM to support the entire a competitive classification model. space, an EBM can instead be used to correct samples from an implicit generative network, simplifying the function to 2.5 Deep EBMs via Score Matching learn and allowing easier sampling. This procedure, referred Although Langevin sampling has allowed EBMs to scale to as exponentially tilting an implicit model, is defined as to high dimensional data, training times are still slow due 1 −Eθ (x) to the need to sample from the model distribution when pθ,φ(x) = qφ(x)e . (14) Z using contrastive divergence. This is made worse by the θ,φ finite nature of the sampling process, meaning that samples By parameterising qφ(x) as a latent variable model such as can be arbitrarily far away from the model’s distribution a normalizing flow [2], [151] or VAE generator [226], MCMC [57]. An alternative approach is score matching [93] which is sampling can be performed in the latent space rather than based on the idea of minimising the difference between the the data space. Since the latent space is much simpler, and BOND-TAYLOR ET AL.: DEEP GENERATIVE MODELLING: A COMPARATIVE REVIEW 5

often uni-modal, MCMC mixes much more effectively. This = −Eqφ(z|x)[ln qφ(z|x)] + Eqφ(z|x)[ln pθ(z)] (17) limits the freedom of the model, however, leading some to + [ln p (x|z)] jointly sample in latent and data space [2], [226]. Eqφ(z|x) θ

= −DKL(qφ(z|x)||pθ(z)) + Eqφ(z|x)[ln pθ(x|z)] 2.6.2 Noise Contrastive Estimation ≡ L(θ, φ; x), Noise contrastive estimation [48], [68] transforms EBM training into a classification problem using a noise distri- where L is known as the evidence lower bound (ELBO) [98]. θ φ bution qφ(x) by optimising the , To optimise this bound with respect to and , gradients " # " # must be backpropagated through the stochastic process of pθ(x) qφ(x) generating samples from z˜ ∼ qφ(z|x). This is permitted by Epd ln + Eqφ ln , (15) pθ(x) + qφ(x) pθ(x) + qφ(x) reparameterizing z˜ using a differentiable function gφ(, x) of a noise variable : z˜ = gφ(, x) with  ∼ p() [110]. Eθ (x)−c where pθ(x) = e . This approach can be used to train Monte Carlo gradient estimators can be used to ap- a correction via exponential tilting [151], but can also be proximate the expectations, however this yields very used to directly train an EBM and normalizing flow [51]. high variance making it impractical. However, if Equation 15 is equivalent to GAN Equation 23, however, DKL(qφ(z|x)||pθ(z)) can be integrated analytically then training formulations differ, with noise contrastive estima- the variance is manageable. A prior with such a property tion explicitly modelling likelihood ratios. needs to be simple enough to sample from but also suffi- ciently flexible to match the true posterior; a common choice 2.7 Alternative Training Objectives is a normally distributed prior with diagonal covariance, 2 As aforementioned, energy models trained with contrastive z ∼ qφ(z|x) = N (z; µ, σ I) with z˜ = µ + σ  and divergence approximately maximises the likelihood of the  ∼ N (0, I). In this case, the loss simplifies to data; likelihood however does not correlate directly with J 1 X   sample quality [199]. Training EBMs with arbitrary f- L˜ (θ, φ; x) ' 1 + ln((σ(j))2) − (µ(j))2 − (σ(j))2 V AE 2 divergences is possible, yielding improved FID scores [231]. j=1 Since score estimates have high variance, the Stein dis- L 1 X crepancy has been proposed as an alternative objective, + ln p (x|z˜ ). (18) L θ l requiring no sampling and more closely correlating with l=1 likelihood [57]. A middle ground between denoising score Despite success on small scale datasets, when applied matching and contrastive divergence is diffusion recovery to more complex datasets such as natural images, samples likelihood [11] which can be optimised via a sequence of tend to be unrealistic and blurry [41]. This blurriness has denoising EBMs conditioned on increasingly noisy samples been attributed to the max likelihood objective itself and of the data, the conditional distributions being much easy to MSE reconstruction loss, however, there is evidence that MCMC sample from than typical EBMs [52]. limited approximation of the true posterior is the root cause [239]; with MSE causing highly non-gaussian posteriors. As 3 VARIATIONAL AUTOENCODERS such, the gaussian posterior implies an overly simple model One of the key problems associated with energy-based which, when unable to perfectly fit, maps multiple data models is that sampling is not straightforward and mixing points to the same encoding leading to averaging. can require a significant amount of time. To circumvent this There are a number of other issues associated with issue, it would be beneficial to explicitly sample from the limited posterior approximation, namely under-estimation data distribution with a single network pass. of the variance of the posterior, resulting in poor predictions, To this end, suppose we have a latent based model and biases in the MAP estimates of model parameters [207]. pθ(x|z) with prior pθ(z) and posterior pθ(z|x); unfortu- Quality can be improved using higher capacity functions, nately optimising this model through max likelihood is in- however these can be harder to balance, leading to posterior R tractable due to the integral in pθ(x) = z pθ(x|z)pθ(z)dz. collapse [73], [239], as well as by combining with adver- Instead, [110] propose learning a second function qφ(z|x) = sarial training [89], [118]. There have also been attempts arg minq DKL(qφ(z|x)||pθ(z|x)) which approximates the to improve the variational bound itself [19], as well as true intractable posterior pθ(z|x). From the definition of KL use different regularisation techniques such as Wasserstein divergence we get distance [201] and GANs [136]. " # qφ(z|x) DKL(qφ(z|x)||pθ(z|x)) = Eqφ(z|x) ln 3.1 Beyond Simple Priors pθ(z|x) (16) = [ln q (z|x)] − [ln p (z, x)] One approach to improve variational bounds and increase Eqφ(z|x) φ Eqφ(z|x) θ sample quality is to improve the priors used for instance + ln pθ(x), by careful selection to the task or by increasing its com- which can be rearranged to find an alternative definition for plexity [83]. Complex priors can be learned by warping pθ(x) that does not require the knowledge of pθ(z|x) simple distributions and inducing variational dependencies between the latent variables: variational Gaussian processes ln p (x) = D (q (z|x)||p (z|x)) − [ln q (z|x)] θ KL φ θ Eqφ(z|x) φ permit this by forming an infinite ensemble of -field

+ Eqφ(z|x)[ln pθ(z, x)] distributions [203]; an EBM can be used to model a flexible prior [163]; normalizing flows (see Section6) transform ≥ −Eqφ(z|x)[ln qφ(z|x)] + Eqφ(z|x)[ln pθ(z, x)] 6

µ ··· ··· ··· ··· ···

D z3 z3 z3 x σ z xˆ E µ + σ   z2 + z2 = z2

Fig. 3: Variational with a normally distributed z1 z1 z1 prior.  is sampled from N (0, I). xˆ xˆ xˆ Bidirectional Bidirectional distributions through a series of invertible parameterised Decoder functions [12], [56], [88], [112], [173], [177]. Inference Model Inference VAE Alternatively, by rewriting the VAE training objective to have two regularisation terms [136], Fig. 4: A hierarchical VAE with bidirectional inference [112].

L(θ, φ; x) = Ex∼q(x)[Eqφ(z|x)[ln pθ(x|z)]] (19) forming dependencies depthwise though the network, + Ex∼q(x)[H[qφ(z|x)]] − Ez∼q(z)[pθ(z)], pθ(z) = pθ(z0)pθ(z1|z0) ··· pθ(zN |z

G label training samples as real and samples from G as fake, z xˆ D while G is trained to minimise the probability that D classes real, fake its samples as fake. This can be interpreted as D and G x playing a mini-max game, as with prior work [179], [180], optimising the value function V (G, D), Fig. 5: Generative adversarial networks set two networks in min max V (G, D) = Ex∼pd(x)[ln D(x)] a game: D detects real from fake samples while G tricks D. G D (22) + Ez∼pz (z)[ln(1 − D(G(z)))]. G D posterior collapse have been proposed: by restricting the For a fixed generator , the training objective for can be autoregressive network’s receptive field to a small window, reformulated as it is forced to use latents to capture global structure [27]; a max V (G, D) = Ex∼pd [ln D(x)] + Ex∼pg [ln(1 − D(x))] mutual information term can be added to the loss to encour- D " # age high correlation between x and z [240]; encouraging the pd(x) posterior to be diverse by controlling its geometry to evenly = Ex∼pd ln pd(x) + pg(x) covering the data space, redundancy is reduced and latents " # (23) are encouraged to learn global structure [133]. pg(x) + Ex∼pg ln pd(x) + pg(x)

3.3 Bridging Variational and MCMC Inference 1 1 = DKL(pd|| (pd + pg)) + DKL(pg|| (pd + pg)) + C. While variational approaches offer substantial speedup over 2 2 MCMC sampling, there is an inherent discrepancy be- Therefore the loss is equivalent to the Jensen-Shannon diver- tween the true posterior and approximate posterior de- gence between the generative distribution pg and the data spite improvements in this field. To this end, a num- distribution pd and thus with sufficient capacity, the gener- ber of approaches have been proposed to find a middle ator can recover the data distribution. The use of symmetric ground, yielding improvements over variational methods JS-divergence is well behaved when both distributions are with lower costs than MCMC. Semi-amortised VAEs [109] small unlike the asymmetric KL-divergence used in max- use an encoder network followed by stochastic gradient likelihood models. Additionally, it has been suggested that descent on latents to improve the ELBO, however, this reverse KL-divergence, DKL(pg||pd), is a better measure still relies on an inference network. The inference network for training generative models than normal KL-divergence, can be removed by assigning latent vectors to data points, DKL(pd||pg), since it minimises Ex∼pg [ln pd(x)] [92]; while then optimising them with Langevin dynamics or gradient reverse KL-divergence is not a viable objective function, JS- descent, during training; although this allows fast training, divergence is and behaves more like reverse KL-divergence convergence for unseen samples is not guaranteed and there than KL-divergence alone. With that said, JS-divergence is is still a large discrepancy between the true posterior and not perfect; if 0 mass is associated with a data sample in a latent approximations due to lag in optimisation [14], [71]. max-likelihood model, KL-divergence is driven to infinity, Short-run MCMC has also been applied however it has whereas this can happen with no consequence in a GAN. poor mixing properties [154]. Gradient Origin Networks [15] replace the encoder with an empirical Bayes approximation 4.1 Stabilising Training of the posterior that only requires a single gradient step. The adversarial nature of GANs makes them notoriously VAEBMs offer a different perspective, rather than per- difficult to train [3]; Nash equilibrium is hard to achieve forming latent MCMC sampling based on the ELBO, they [175] since non-cooperation cannot guarantee convergence, use an auxiliary energy-based model to correct blurry thus training often results in oscillations of increasing am- VAE samples, with MCMC sampling performed in both plitude. As the discriminator improves, gradients passed to the data space and latent space. This setup is defined by 1 −E (x) the generator vanish, accelerating this problem; on the other hφ,θ(x, z) = pθ(z)pθ(x|z)e φ , where pθ(z)pθ(x|z) Zφ,θ hand, if the discriminator remains poor, the generator does is the VAE, and E (x) is the energy model. This, however, φ not receive useful gradients. Another problem is mode col- requires 2 stages of training to avoid calculating the gradient lapse, where one network gets stuck in a bad local minima of the normalising constant Z , training only the VAE and φ,θ and only a small subset of the data distribution is learned. fixing the VAE and training the EBM respectively. The discriminator can also jump between modes resulting in catastrophic forgetting, where previously learned knowl- 4 GENERATIVE ADVERSARIAL NETWORKS edge is forgotten when learning something new [198]. This Another approach at eliminating the Markov chains used section explores proposed solutions to these problems. in energy models is the generative adversarial network (GAN) [54]. GANs consist of two networks, a discriminator 4.1.1 Loss Functions n D : R → [0, 1] which estimates the probability that a Since the cause of many of these issues can be linked sample comes from the data distribution x ∼ pd(x), and with the use of JS-divergence, other loss functions have m n a generator G: R → R which given a latent variable been proposed that minimise other statistical distances; in z ∼ pz(z), captures pd by tricking the discriminator into general, any f-divergence can be used to train GANs [156]. thinking its samples are real. This is achieved through ad- One notable example of such is the Wasserstein distance versarial training of the networks: D is trained to correctly which intuitively indicates how much “mass” must be 8

Name Discriminator Loss Generator Loss S-GAN NS-GAN NSGAN [54] − E[ln(σ(D(x)))] − E[ln(1 − σ(D(G(z))))] − E[ln(σ(D(G(z))))] WGAN LS-GAN WGAN [4] E[D(x)] − E[D(G(z))] E[D(G(z))] G

2 2 2 L LSGAN [137] E[(D(x) − 1) ] + E[D(G(z)) ] E[(D(G(z)) − 1) ] Hinge Hinge [123] E[min(0,D(x)−1)]−E[max(0, 1+D(G(z)))] − E[D(G(z))] EBGAN [237] D(x) + max(0, m − D(G(z))) D(G(z)) RSGAN [97] E[ln(σ(D(x) − D(G(z))))] E[ln(σ(−D(G(z))−D(x)))] D(G(z)) (a) GAN losses. (b) Generator loss functions.

Fig. 6: A comparison of popular losses used to train GANs. (a) Respective losses for discriminator/generator. (b) Plots of generator losses with respect to discriminator output. Notably, NS-GAN’s gradient disappears as discriminator gets better.

moved to transform one distribution into another. Wasser- Lipschitz constant of a linear function is its largest singular stein distance is defined formally in Equation 24a, which value (spectral norm). The spectral norm of a matrix A is by the Kantorovich-Rubinstein duality can be shown to be kAhk equivalent to Equation 24b[218]: SN(A) := max 2 = max kAhk , (25) h:h6=0 2 khk2 khk2≤1 W (p , p ) = inf [kx − yk], d g Q E(x,y)∼γ (24a) γ∈ (pd,pg ) thus a weight matrix W is normalised to be 1-Lipschitz by W := W W (pd, pg) = sup Ex∼pd [D(x)] − Ex∼pg [D(x)], (24b) replacing the weights with SN SN(W ) . Rather than kDk ≤1 L using singular value decomposition to compute the norm, where the supremum is taken over all 1-Lipschitz functions, the power iteration method is used; for randomly initialised vectors v ∈ n and u ∈ m, the procedure is that is, f such that for all x1 and x2, f(x1) − f(x2) 2 ≤ R R kx − x k . Optimising Wasserstein distance, as described T T 1 2 2 u = W v , v = W u ,SN(W ) ≈ u W v. (26) in Table 6a, offers linear gradients thus eliminating the van- t+1 t t+1 t+1 ishing gradients problem (see Fig. 6b). Moreover, Wasser- Since weights change only marginally with each optimisa- stein distance is also equivalent to minimising reverse tion step, a single power iteration step per global optimisa- KL-divergence [143], offers improved stability, and allows tion step is sufficient to keep v and u close to their targets. training to optimality. Numerous approaches to enforce 1- As aforementioned, enforcing the discriminator to be 1- Lipschitz continuity have been proposed: weight clipping Lipschitz is essential for WGANs, however, spectral nor- [4] invalidates gradients making optimisation difficult; ap- malisation has been found to dramatically improve sample plying a gradient penalty within the loss is heavily depen- quality and allow scaling to datasets with thousands of dent on the support of the generative distribution and com- classes across a variety of loss functions [17], [143]. Spectral putation with finite samples makes application to the entire collapse, has been linked to discriminator overfitting when space intractable [65]; spectral normalisation (discussed be- spectral norms of layers explode [17] as well as mode col- low) applies global regularisation by estimating the singular lapse when spectral norms fall in value significantly [126]. values of parameters. Other popular loss functions include Additionally, regularising the discriminator in this manner least squares GAN, hinge loss, energy-based GAN, and helps balance the two networks, reducing the number of relativistic GAN (detailed in Table 6a). discriminator update steps required [17], [234]. The catastrophic forgetting problem can be mitigated by conditioning the GAN on class information, encouraging 4.1.3 Data Augmentation more stable representations [17], [142], [234]. Nevertheless, Augmenting training data to increase the quantity of train- labelled data, if available, only covers limited abstractions. ing data is often common practice; when training GANs Self-supervision achieves the same goal by training the dis- the types of augmentations permitted are limited to more criminator on an auxiliary classification task based solely on simple augmentations such as cropping and flipping to the unsupervised data. Proposed approaches are based on prevent the generator from creating undesired artefacts. randomly rotating inputs to the discriminator, which learns Several approaches independently proposed applying aug- to identify the angle rotated separately to the standard mentations to all discriminator inputs, allowing more sub- real/fake classification [26]. Extensions include training the stantial augmentations to be used [103], [205], [241], [242]; discriminator to jointly determine rotation and real/fake to the training procedure for a WGAN with augmentations is provide better feedback [206], and training the generator to trick the discriminator at both the real/fake and classi- LD = Ex∼pd(x)[D(T (x))] − Ez∼p(z)[D(T (G(z)))], (27a) fication tasks [206]. A more explicit approach is to model LG = Ez∼p(z)[D(T (G(z)))], (27b) the generator with a normalizing flow, avoiding collapse by jointly optimising the GAN and likelihood objectives [63]. where T is a random augmentation. These approaches have been shown to improve sample quality on equiva- 4.1.2 Spectral Normalisation lent architectures and stabilise training. Each work offers Spectral normalisation [143] is a technique to make a func- a different perspective on why augmentation is so effective: tion globally 1-Lipschitz utilising the observation that the the increased quantity of training data in conjunction with BOND-TAYLOR ET AL.: DEEP GENERATIVE MODELLING: A COMPARATIVE REVIEW 9

the more difficult discrimination task prevents overfitting [17] employ a number of tricks to scale to high resolutions and in turn collapse [17], notably this applies even on very including using very large mini-batches to reduce variation, small datasets (100 samples); the nature of GAN training spectral normalisation to discourage spectral collapse, and leads to the generated and data distributions having non- using large datasets to prevent overfitting Despite this, overlapping supports, complicating training [195], strong training collapse still occurs thus requiring early stopping. augmentations may cause these distributions to overlap Another approach is to include skip connections between further. If an augmentation is differentiable and represents the generator and discriminator at each resolution, allow- an invertible transformation of the data space’s distribution, ing gradients to flow through shorter paths to each layer, then the JS-divergence is invariant, and the generator is providing extra information to the generator [101], [105], guaranteed to not create augmented samples [103], [205]. [212]. By treating subsets of the generator’s parameters as smaller generators, Anycost GANs extend this approach, 4.1.4 Discriminator Driven Sampling allowing samples to be generated at multiple resolutions In order to improve sample quality and address overpow- and speeds [124]. To learn long- dependencies, GANs ered discriminators, numerous works have taken inspiration can be built with self-attention components [96], [216], [234], from the connection between GANs and energy models however, full quadratic attention does not scale well to high [237]. Interpreting the discriminator of a Wasserstein GAN dimensional data. Alternatively, a powerful learned discrete [4] as an energy-based model samples from the prior can be used to model very high resolution data [49]. generator can be used to initialise an MCMC sampling chain which converges to the density learned by the discriminator, 4.3 Training Speed correcting errors learned by the generator [146], [208]. This is similar to pure EBM approaches, however, training the The mini-max nature of GAN training leads to slow conver- two networks adversarially changes the dynamics. The slow gence, if achieved at all. This problem has been exacerbated convergence rates of high dimensional MCMC sampling has by numerous works as a byproduct of improving stability or led others to instead sample in the latent space [22], [192]. sample quality. One such example is that by using very large mini-batches, reducing variance and covering more modes, 4.1.5 GANs without Competition sample quality can be improved significantly, however, this Originally proposed as a proxy to measure GAN conver- comes at the cost of slower training [17]. Small-GAN [182] gence [62], the duality gap is an upper bound on the JS- combats this by replacing large batches with small batches divergence that can be directly optimised [61], defined as that approximate the shape of the larger batch using core set sampling [182], significantly improving the mode coverage DG(D,G) = max V (G, D0) − min V (G0,D). (28) D0 G0 and sample quality of GANs trained with small batches. Cooperative training simplifies the optimisation procedure, While strong discriminator regularisation stabilises avoiding oscillations. Each training step, however, requires training, it allows the generator to make small changes and optimising for D0 and G0 which slows down training and trick the discriminator, making convergence very slow. Rob- could suffer from vanishing gradients. GAN [127], include an adversarial attack step [135] that per- turbs real images to trick the discriminator without altering 4.2 Architectures the content inordinately, adapting the GAN objective into a min-max-min problem. This provides a weaker regulari- Careful network design is a key component for stable GAN sation, enforcing small Lipschitz values locally rather than training. Scaling any deep neural network to high-resolution globally. This approach has been connected with the follow- data is non-trivial due to vanishing gradients and high the-ridge algorithm [221], [243], an optimisation approach memory usage, but since the discriminator can classify high- for solving mini-max problems that reduces the optimisa- resolution data more easily, GANs notably struggle [157]. tion path and converges to local mini-max points. Early approaches designed hierarchical architectures, di- Another approach to improve training speed is to design viding the learning procedure into more easily learnable more efficient architectures. Depthwise convolutions [31] chunks. LapGAN [37] builds a Laplacian pyramid such apply separate convolutions to each channel of a tensor that at each layer, a GAN conditioned on the previous reducing the number of operations and hence also the run- image resolution predicts a residual adding detail. Stacked time, have been found to have comparable quality to stan- GANs [90], [235] use two GANs trained successively: the dard convolutions [147]. Lightweight GANs [125] achieve first generates low-resolution samples, then the second up- fast training using a number of tricks including small batch samples and corrects the first, thus fewer GANs need to sizes, skip-layer excitation modules which provide efficient be trained. A related approach, progressive growing [102], shortcut gradient flow, as well as using a self-supervised [104], iteratively trains a single GAN at higher resolutions discriminator forcing good features to be learned. by adding layers to both the generator and discriminator upscaling the previous output, after the previous resolution converges. Training in this manner, however, not only takes 5 AUTOREGRESSIVE LIKELIHOOD MODELS a long time but leads to high frequency components being Autoregressive generative models [8] are based on the chain learned in the lower layers, resulting in shift artefacts [105]. rule of probability, where the probability of a variable that Accordingly, a number of works have targeted a single can be decomposed as x = x1, . . . , xn is expressed as GAN that can be trained end-to-end. DCGAN [169] intro- n duced a fully convolutional architecture with batch nor- Y p(x) = p(x1, . . . , xn) = p(xi|x1, . . . , xi−1). (29) malisation [94] and ReLU/LeakyReLU activations. BigGAN i=1 10

As such, unlike GANs and energy models, it is possible to xˆ1 xˆ2 xˆ3 ... xˆn directly maximise the likelihood of the data by training a re- current neural network to model p(xi|x1:i−1) by minimising the negative log-likelihood, x1 x2 x3 ... xn−1 n X − ln p(x) = − ln p(xi|x1, . . . , xi−1). (30) Fig. 7: Autoregressive models decompose data points using i the chain rule and learn conditional probabilities. While autoregressive models are extremely powerful den- sity estimators, sampling is inherently a sequential process and can be exceedingly slow on high dimensional data. is made; taking the dot product of the queries and keys, Additionally, data must be decomposed into a fixed or- a similarity vector is formed that describes which value dering; while the choice of ordering can be clear for some vectors to access. This process can be expressed as modalities (e.g. text and audio), it is not obvious for others ! QKT such as images and can affect performance depending on Attention(Q, K, V ) = softmax √ V , (31) the network architecture used. dk

where dk is the key/query dimension and is used to nor- 5.1 Architectures malise gradient magnitudes. Since the self-attention pro- The majority of research is focused on improving network cess contains no recurrence, positional information must architectures to increase their receptive fields and memory, be passed into the function. A simple effective method to ensuring the network has access to all parts of the input to achieve this is to add sinusoidal positional encodings which encourage consistency, as well as increasing the network ca- combine sine and cosine functions of different frequencies pacity, allowing more complex distributions to be modelled. to encode positional information [216]; alternatively others use trainable positional embeddings [30]. 5.1.1 Recurrent Neural Networks The infinite receptive fields of attention provides a powerful tool for representing data, however, the attention A natural architecture to apply is that of standard recurrent matrix QKT grows quadratically with data dimension, neural networks (RNNs) such as LSTMs [81], [214] and making scaling difficult. Approaches include scaling across GRUs [33], [138] which model sequential data by tracking large quantities of GPUs [18], interleaving attention between information in a hidden state. However, RNNs are known causal convolutions [28], attending over local regions [166], to forget information, limiting their receptive field thus and using sparse attention patterns that provide global preventing modelling of long range relationships. This can attention when multiple layers are stacked [30]. More re- be improved by stacking RNNs that run at different frequen- cently, a number of linear transformers have been proposed cies allowing long data such as multiple seconds of audio whose memory and time footprints grow linearly with data to be modelled [33]. Nevertheless, their sequential nature dimension [32], [106], [220]. By approximating the softmax means that training can be too slow for many applications. operation with a kernel function with feature representation φ(x), the order of multiplications can be rearranged to 5.1.2 Causal Convolutions     An alternative approach is that of causal convolutions, φ(Q)φ(K)T V = φ(Q) φ(K)T V , (32) which apply masked or shifted convolutions over a se- quence [28], [176], [213]. When stacked, this only provides allowing φ(K)T V to be cached and used for each query. a receptive field linear with depth, however, by dilating the convolutions to skip values with some step the receptive 5.1.4 Multiscale Architectures field can be orders of magnitude higher. Even with a linear , scaling to high- resolution images grows quadratically with resolution. By 5.1.3 Self-Attention using a multi-scale architecture that makes local indepen- Neural attention is an approach which at each successive dence assumptions, it is possible to reduce the complexity time step is able to select where it wishes to ‘look’ at to O(ln N) for N pixels, allowing scaling to high resolutions previous time steps. This concept has been used to autore- [172]. By imposing a partitioning on the spatial dimensions, gressively ‘draw’ images onto a blank ‘canvas’ [60] in a man- it is possible to build a multi-scale architecture without ner similar to human drawing. More recently self-attention making independence assumptions; while this reduces the (known as Transformers when used in an encoder-decoder memory required, sampling times are still slow [141]. setup) [216] have made significant strides improving not only autoregressive models, but also other generative mod- els due to their parallel nature, infinite receptive field, and 5.2 Data Modelling Decisions stable training. They achieve this using an attention scheme An obvious method of modelling individual variables xi is that can reference any previous input but using an entirely to discretize and use a softmax layer on outputs. However, independent process per time step so that there are no this can cause complications or be infeasible in some cases dependencies. Specifically, inputs are encoded as key-value such as 16-bit audio modelling in which the outputs would pairs, where the values V represent the inputs, and the keys need 65,536 neurons. A number of solutions have been K act as an indexing method. At each time step a query q proposed including: BOND-TAYLOR ET AL.: DEEP GENERATIVE MODELLING: A COMPARATIVE REVIEW 11

• Applying µ-law, a logarithmic companding algo- f1(z0) f2(z1) fi+1(zi) rithm which takes advantage of human perception z0 z1 zi zk = x of sound, then quantizing to 8-bit values [160]. • First predicting the first 8-bits, then predicting the second 8-bits conditioned on the first. • Modelling output probabilities using a mixture of logistic distributions has the benefits of providing z0 ∼ p0(z0) zi ∼ pi(zi) zK ∼ pK (zK ) more useful gradients and allowing intensities never seen to still be sampled [176]. Fig. 8: Normalizing flows build complex distributions by mapping a simple distribution through invertible functions. When modelling images, many works use “raster scan” ordering [176], [213], [214] where pixels are estimated row by row. Alternatives have been proposed such as “zig-zag” the score at each time step by optimising what the authors ordering [28] which allows pixels to depend on previously term the composite score matching divergence, removing sampled pixels to the left and above, providing more rele- the need to calculate normalising constants [140]. vant context. Another factor when modelling images is how to factorise sub-pixels. While it is possible to treat them 6 NORMALIZING FLOWS as independent variables, this adds additional complexity. While training autoregressive models with max-likelihood Alternatively, it is possible to instead condition on whole offers plenty of benefits including stable training, density pixels, and output joint distributions in a single step [175]. estimation, and a useful validation metric, the slow sam- pling speed and poor scaling properties handicaps them 5.3 Variants significantly. Normalizing flows are a technique that also As aforementioned, directly maximising the likelihood of allows exact likelihood calculation while being efficiently data points may not be appropriate, with likelihood not parallelisable as well as offering a useful latent space for always correlating with greater perceptual quality [199]. downstream tasks. : d d It requires the choice between spreading over models or Consider an invertible, smooth function f R → R ; by applying this transformation to a random variable z ∼ q(z), collapsing to a single mode, meaning approximate solutions 0 are not incentivised; as such, it can be viewed as lacking an then the distribution of the resulting random variable z = underlying metric, focusing entirely on probability. f(z) can be determined through the change of variables rule (and application of the chain rule), −1 5.3.1 Quantile Regression ∂f −1 ∂f q(z0) = q(z) det = q(z) det . (34) One alternative approach is quantile regression, which is ∂z0 ∂z equivalent to minimising the Wasserstein distance between distributions. In this case, the inverse cumulative distri- Consequently, arbitrarily complex densities can be con- bution function can be learned by minimising the Huber structed by composing simple maps and applying Equation quantile loss [91], 34[217]. This chain is known as a normalizing flow [173] ( (see Fig.8). The density qK (zK ) obtained by successively |τ−I{u≤0}| u2, |u| ≤ κ κ 2κ if transforming a random variable z0 with distribution q0 ρτ (u) = 1 (33) |τ − I{u ≤ 0}|(|u| − 2 κ), otherwise. through a chain of K transformations fk can be defined as

Quantile regression has been applied to autoregressive mod- zK = fK ◦ · · · ◦ f2 ◦ f1(z0), (35a) els [161], requiring only minor architectural changes, and K X ∂fk also having the benefit of not requiring quantization and ln q (z ) = ln q (z ) − ln det . (35b) K K 0 0 ∂z improving perceptual quality scores. k=1 k−1 Each transformation therefore must be sufficiently expres- 5.3.2 Unnormalized Densities sive while being easily invertible and have an efficient Explicitly modelling the density of data points using a com- to compute Jacobian determinant. While restrictive, there plementary distribution restricts the expressiveness of the have been a number of works which have introduced more network, for instance, finite mixtures of Gaussians struggle powerful invertible functions (see Table2). Nevertheless, to model high frequency signals. A simple approach is to normalizing flow models are typically less parameter effi- reduce the Lipschitz constant of the data distribution, for cient than other generative models. instance by adding Gaussian noise [139]. A more general One disadvantage of requiring transformations to be approach is to learn an autoregressive energy model, remov- invertible is that the input dimension must be equal to ing this restriction at the expense of less efficient sampling. the output dimension which makes deep models inefficient Nevertheless, MCMC sampling a low-dimensional space and difficult to train. A popular solution to this is to use a converges much faster than the high dimensional sampling multi-scale architecture [39], [111] (see Fig.9) which divides involved for other energy models [42]. One approach to the process into a number of stages, at the end of each train an autoregressive energy model is to estimate the factoring out half of the remaining units which are treated normalising constant for each time step using importance immediately as outputs. This allows latent variables to sampling, which due to the relatively low dimensionality is sequentially represent course to fine features and permits manageable [145]. Another proposed approach is to output deeper architectures. 12

TABLE 2: Normalizing Flow Layers: represents elementwise multiplication, ?l represents a cross-correlation layer

Description Function Inverse Function Log-Determinant Low Rank Planar [173] y = x + uh(wT z + b) No closed form inverse ln |1 + uT h0(wT z + b)w| D D With w ∈ R , u ∈ R , b ∈ R

T 0 T Sylvester [12], [72] y = x + Uh(W x + b) No closed form inverse ln det(IM + diag(h (W x + b))WU T ) Coupling/Autoregressive (1:d) (1:d) (1:d) (1:d) General Coupling y = x x = y ln | det ∇x(d+1:D) h| (d+1:D) (d+1:D) (1:d) (d+1:D) −1 (d+1:D) (1:d) y = h(x ; fθ(x )) x = h (y ; fθ(y ))

(t) MAF [165] y(t) = h(x(t); f (x(1:t−1))) x(t) = h−1(y(t); f (x(1:t−1))) − PD ln | ∂y | θ θ t=1 ∂x(t)

(t) IAF [112] yt = h(x(t); f (y(1:t−1))) xt = h−1(y(t); f (y(1:t−1))) PD ln | ∂y | θ θ t=1 ∂x(t)

−1 Pd (i) Affine Coupling [39] h(x; θ) = x exp(θ1) + θ2 h (y; θ) = (y − θ2) exp(−θ1) i=1 θ1

Pd (i) ∂F (x,θ3)i Flow++ [79] h(x; θ) = exp(θ1) F (x, θ3) + θ2 Calculated through bisection search θ + ln i=1 1 ∂xi where F is a monotone function.

Spline Flows [144] h(x; θ) = Spline(x; θ) h−1(y; θ) = Spline−1(y; θ) Computed in closed-form as a [46][47] where θ are the spline’s knots. product of quotient derivatives

T Pd ˜ B-NAF [36] y = W x for blocked weights: No closed form inverse ln i=1 exp(Wii) ˜ ˜ W = exp(W ) Md + W Mo where Md selects diagonal blocks and Mo selects off-diagonal blocks. Convolutions −1 1x1 Convolution [111] h × w × c tensor x & c × c tensor W ∀i, j : xi,j = W yi,j h · w · ln | det W | ∀i, j : yi,j = W xi,j P P Emerging k = w1 m1, g = w2 m2 zt = (yt − i=t+1 Gt,izi)/Gt,t c ln |kc,c,my ,mx gc,c,my ,mx | t−1 Convolutions [84] y = k ?l (g ?l x) P xt = (zt − i=1 Kt,ixi)/Kt,t Lipshitz Residual

i-ResNet [7] y = x + f(x) x1 = y. xn+1 = y − f(xn) tr(ln(I + ∇xf)) = k ∞ tr((∇ f) ) wherekfkL < 1 converging at an exponential rate P k+1 x k=1(−1) k

6.1 Coupling and Autoregressive Layers dimensional data such as images [111], but it is unknown A simple way of building an expressive invertible function whether multiple layers of affine flows are universal ap- is the coupling flow [38], which divide inputs into two proximators or not [164]. h sections and applies a bijection on one half parameterised 6.1.2 Monotone Functions by the other, Another method of creating invertible functions that can (1:d) (1:d) y = x , (36a) be applied element-wise is to enforce monotonicity. One (d+1:D) (d+1:D) (1:d) h y = h(x ; fθ(x )), (36b) possibility to achieve this is to define as an integral over a positive but otherwise unconstrained function g [222], f h here can be arbitrarily complex i.e. a neural network. Z xi tends to be selected as an elementwise function making h(xi; θ) = gφ(x; θ1)dx + θ2, (38) the Jacobian triangular allowing efficient computation of the 0 determinant, i.e. the product of elements on the diagonal. however, this integration requires numerical approximation. Alternatively, by choosing g to be a function with a known 6.1.1 Affine Coupling integral solution, h can be efficiently evaluated. This has A simple example of this is the affine coupling layer [39], been accomplished using positive polynomials [95] and the CDF of a mixture of logits [79]. Both cases, however, don’t y(d+1:D) = x(d+1:D) exp(f (x(1:d))) + f (x(1:d)), (37) σ µ have analytical inverses and have to be approximated itera- which has a simple Jacobian determinant and can be triv- tively with bisection search. Another option is to represent ially rearranged to obtain a definition of x(d+1:D) in terms of g as a monotonic spline: a piecewise function where each y, provided that the scaling coefficients are not 0. This sim- piece is easy to invert. As such, the inverse is as fast to plicity, however, comes at the cost of expressivity; stacking evaluate as the forward pass. Linear and quadratic splines numerous such flows significantly increases their expressiv- [144], cubic splines [46], and rational-quadratic splines [47] ity, allowing them to learn representations of complex high have been applied so far. BOND-TAYLOR ET AL.: DEEP GENERATIVE MODELLING: A COMPARATIVE REVIEW 13 1 × 1 z1 z2 z3 z4 to use a convolution which is equivalent to a linear transformation applied across channels [111]. A number of works have proposed other convolution- z h(2) 3 4 based flows that permit larger kernel sizes. A number of these apply variations on causal convolutions [160], includ- (1) (1) ing emerging convolutions [84] whose inverse is sequential, z1 z2 h3 h4 MaCow [132] which uses smaller conditional fields allowing more efficient sampling, and MintNet [190] which approx- x x x x 1 2 3 4 imates the inverse using fixed-point iteration. Alternative approaches to causal masking involve imposing repeated Fig. 9: Factoring out variables at different scales allows (periodic) structure [100], however in general this is not normalizing flows to scale to high dimensional data. a good assumption for image modelling, as well as repre- senting convolutions as exponential matrix-vector products, exp(M)x, approximated implicitly with a power series, 6.1.3 Autoregressive Flows allowing otherwise unconstrained kernels [85]. For a single coupling layer, a significant proportional of inputs remain unchanged. A more flexible generalisation of 6.3 Residual Flows coupling layers is the autoregressive flow, or MAF [165], Residual networks [74] are a popular technique to build (t) (t) (1:t−1) y = h(x ; fθ(x )). (39) deep neural networks, avoiding vanishing gradients by stacking blocks of the form Here f can be arbitrarily complex, allowing the use of θ y = x + f (x). advances in autoregressive modelling (Section.5), and h is a θ (41) bijection as used for coupling layers. Some monotonic bijec- By restricting fθ it is possible to enforce invertibility. tors have been created specifically for autoregressive flows, namely Neural Autoregressive Flows (NAF) [87] and Block 6.3.1 Matrix Determinant Lemma NAF [36]. Unlike coupling layers, a single autoregressive If a function has a certain residual form, then its Jacobian flow is a universal approximator. determinant can be computed with the matrix determinant Alternatively, an autoregressive flow can be conditioned lemma [173]. A simple example is planar flow [173] which is (1:t−1) (1:t−1) on y rather than x , this is known as an Inverse equivalent to a 3 layer MLP with a single neuron bottleneck: Autoregressive Flow, or IAF [112]. While coupling layers T can be evaluated efficiently in both directions, MAF permits y = x + uh(w x + b), (42) d parallel but sequential sampling, and IAF where u, w ∈ R , b ∈ R, and h is a differentiable non- permits parallel sampling but sequential density estimation. linearity function. Planar flows are invertible provided some simple conditions are satisfied, however its inverse 6.1.4 Probability Density Distillation is difficult to compute making it only practical for density Inverse autoregressive flows [112] offer the ability to sample estimation tasks. A higher rank generalisation of the ma- from an autoregressive model in parallel, however, train- trix determinant lemma has been applied to planar flows, ing via max-likelihood is inherently sequential making this known as Sylvester flows, removing the severe bottleneck infeasible for high dimensional data. Probability density thus allowing greater representation ability [12], [72]. distillation [159] has been proposed as a solution to this 6.3.2 Lipschitz Constrained where a second pre-trained autoregressive network which can perform density estimation in parallel is used as a By restricting the Lipschitz constant of fθ, kfθkL < 1, then ‘teacher’ network while an IAF network is used as a ‘stu- this block is invertible [7]. The inverse, however, has no dent’ and mimics the teacher’s distribution by minimising closed form definition but can be found through fixed- the KL divergence between the two distributions: point iteration which by the Banach fixed-point theorem converges to a fixed unique solution at an exponential rate D (p ||p ) = H(p , p ) − H(p ), KL S T S T S (40) dependant on kfθkL. The authors originally proposed a biased approximation of the log determinant of the Jacobian p p where S and T are the student’s and teacher’s distribu- as a power series where the Jacobian trace is approximated H(p , p ) tions respectively, S T is the cross-entropy between using Hutchkinson’s trace estimator (see Table2), but an un- p p H(p ) p S and T , and S is the entropy of S. Crucially, this biased approximator known as a Russian roulette estimator never requires the student’s inverse function to be used has also been proposed [24]. Unlike coupling layers, residual allowing it can computed entirely in parallel. flows have dense Jacobians, allowing . Enforcing Lipschitz constraints has been achieved with convolutional 6.2 Convolutional networks [55], [126], [143] as well as self-attention [107]. Making strong Lipschitz assumptions severely restricts A considerable problem with coupling and autoregressive the class of functions learned; an N layer residual flow flows is the restricted triangular Jacobian, meaning that all network is at most 2N -Lipshitz. Implicit flows [129] bypass inputs cannot interact with each other. Simple solutions this by solving implicit equations of the form involve fixed permutations on the output space such as reversing the order [38], [39]. A more general approach is F (x, y) = fθ(x) − fφ(y) + x − y = 0, (43) 14 where both fθ and fφ both have Lipschitz constants less running the ODE solver backwards, parameter efficiency, than 1. Both the forwards (solve for y given x) and back- and adaptive computation. However, it is not immediately wards (solve for x given y) directions require solving a root clear how to train such a model through backpropagation. finding problem similar to the inverse process of residual While it is possible to backpropagate directly through an flows; indeed, an implicit flow is equivalent to the composi- ODE solver, this limits the choice of solvers to differentiable tion of a residual flow and the inverse of a residual flow. This ones as well as requiring large amounts of memory. Instead, allows them to model arbitrary Lipschitz transformations. the authors apply the adjoint sensitivity method which instead solves a second, augmented ODE backwards in time 6.4 Surjective and Stochastic Layers and allows the use of a black box ODE solver. That is, to optimise a loss dependent on an ODE solver: Restricting the class of functions available to those that are ! invertible introduces a number of practical problems related Z t1 to the topology-preserving property of diffeomorphisms. L(z(t1)) = L z(t0) + f(z(t), t, θ)dt , t (46) For example, mapping a uni-modal distribution to a multi- 0 modal distribution is extremely challenging, requiring a = L(ODESolve(z(t0), f, t0, t1, θ)), highly varying Jacobian [40]. By composing bijections with a(t) = ∂L surjective or stochastic layers these topological constraints the adjoint ∂z(t) can be used to calculate the can be bypassed [150]. While the log-likelihood of stochastic derivative of loss with respect to the parameters in the form layers can only be bounded by their ELBO, functions surjec- of another initial value problem [167], tive in the inference direction permit exact likelihood evalu- ∂L Z t0  ∂L T ∂f(z(t), t, θ) ation even with altered dimensionality. Surjective transfor- = dt, (47) ∂θ ∂z(t) ∂θ mations have the following likelihood contributions: t1 " # p(x|y) which can be efficiently evaluated by automatic differentia- q(y|x) ln , (44) f E q(y|x) tion at a time cost similar to evaluating itself. Despite the complexity of this transformation, the con- where p(x|y) is deterministic for generative surjections, and tinuous change of variables rule is remarkably simple: q(y|x) is deterministic for inference surjections. ! One approach to build a surjective layer is to aug- ∂ ln p(z(t)) ∂ = −tr f(z(t), t, θ) , (48) ment the input space with additional dimensions allowing ∂t ∂z(t) smoother transformation to be learned [23], [44], [86]; the inverse process, where some dimensions are factored out, and can be computed using an ODE solver as well. The is equivalent to a multi-scale architecture [39]. Another resulting continuous-time flow is known as FFJORD [56]. approach known as RAD [40] learns a partitioning of the Since the length of the flow tends to infinity (an infinitesimal K data space into disjoint subsets {Yi}i=1, and applies piece- flow), the true posterior distribution can be recovered [173]. wise bijections to each region gi : X → Yi, ∀i ∈ {1,...,K}. As previously mentioned, invertible functions suffer The generative direction learns a classifier on X , i ∼ p(i|x), from topological problems; this is especially true for Neural allowing the inverse to be calculated as y = gi(x). Similar ODEs since their continuous nature prevents trajectories to both of these approaches are CIFs [34] which consider a from crossing. Similar to augmented normalizing flows continuous partitioning of the data space via augmentation [86], this can be solved by providing additional dimensions equivalent to an infinite mixture of normalizing flows. Other for the flow to traverse [44]. Specifically, a p-dimensional approaches include modelling finite mixtures of flows [43]. Euclidean space can be approximated by a Neural ODE in a Some powerful stochastic layers have already been dis- (2p + 1)-dimensional space [233]. cussed in this survey, namely VAEs [110] and DDPMs [80]. Stochastic layers have been incorporated into normalizing 6.5.1 Regularising Trajectories flows by interleaving small energy models, sampled with MCMC, between bijectors [225]. ODE solvers can require large numbers of network evalua- tions, notably when the ODE is stiff or the dynamics change quickly in time. By introducing regularisation, a simpler 6.5 Continuous Time Flows ODE can be learned, reducing the number of evaluations It is possible to consider a normalizing flow with an infinite required. Specifically, all works here are inspired by optimal number of steps that is defined instead by an ordinary transport theory to encourage straight trajectories. Monge- differential equation specified by a Lipschitz continuous Ampere` Flow [236] and Potential Flow Generators [229] neural network f with parameters θ, that describes the parameterise a potential function satisfying the Monge- D transformation of a hidden state z(t) ∈ R [25], Ampere` equation [20], [217] with a neural network. RNODE [50] applies transport costs to FFJORD as well as regularis- ∂z(t) = f(z(t), t, θ). (45) ing the Frobenius norm of the Jacobian, encouraging straight ∂t trajectories. OT-Flow [158] combines these approaches, pa- Starting from input noise z(0), an ODE solver can solve rameterising a potential as well as applying transport costs, an initial value problem for some time T, at which data is additionally, utilising the optimal transport derivation they defined, z(T ). Modelling a transformation in this form has derive an exact trace definition with similar complexity to a number of advantages such as inherent invertibility by using Hutchkinson’s trace estimator. BOND-TAYLOR ET AL.: DEEP GENERATIVE MODELLING: A COMPARATIVE REVIEW 15

7 CONNECTIONSBETWEEN GENERATIVE MODELS 0 c F xˆ D F While deep generative models all maximise the likelihood of xˆ H 0,1 the training data, they achieve this in various ways, making c z x different trade offs such as diversity for quality, sampling for training speed, and capacity for complexity. There are (a) Implicit GON [15]. (b) Implicit GAN [45]. also a plethora of hybrid approaches, attempting to get the best of both worlds, or at least find saddle points, balancing Fig. 10: Implicit networks model data continuously permit- trade-offs. The varied connections between these systems ting arbitrarily high resolutions. Dashed lines represent gra- mean that advances in one field inevitably benefit others, dients, F is an implicit network, and H is a hypernetwork. for instance, improved variational bounds [19], [131] are beneficial for not only VAEs [110] but also diffusion models [79] and surjective flows [150]. some models are more suited for certain tasks. Standard An area of particular overlap is the augmentation of autoregressive networks are popular for text/audio genera- training data, acting as regularisation while increasing the tion [18], [30], [160]; VAEs have been applied but posterior quantity of available training data; augmentation can be collapse is difficult to mitigate [5], [16]; GANs are more applied almost freely to GANs (see Section 4.1.3), how- parameter efficient but struggle to model discrete data [149] ever, direct application to other architectures would lead to and suffer from mode collapse [115]; some normalizing augmentations leaking into samples. By conditioning these flows offer parallel synthesis, providing substantial speedup models on the augmentation type, effective training is not [168], [204], [244]. Video synthesis is more challenging due only possible but has been found to substantially improve its exceptionally high dimensionality, typically approaches sample quality, even on small models [99]. combine a latent-based implicit generative model to gen- erate individual frames, with an autoregressive network used to predict future latents [5], [116], [120]. This approach 7.1 Assessing Quality is similar to the one used in reinforcement learning to A huge problem when developing generative models is construct world models [69], [70]. how to effectively evaluate and compare them. Qualitative comparison of random samples plays a large role in the 7.3 Implicit Representation majority of state-of-the-art works, however, it is subjective and time-consuming to compare many works. Calculating Typically deep architectures discussed in this survey are the log-likelihood on a separate validation set is popular built with data represented as discrete arrays thus using dis- for tractable likelihood models but comparison with implicit crete components such as convolutions and self-attention. likelihood models is difficult and while it is a good measure Implicit representation on the other hand treats data as of diversity, it does not correlate well with quality [199]. continuous signals, mapping coordinates to data values One approach to quantify sample quality is Inception [183], [197]. Implicit Gradient Origin Networks (GONs; Fig. Score (IS) [175] which takes a trained classifier and deter- 10a)[15] form a latent variable model by concatenating mines whether a sample has low label entropy, indicating latent vectors with coordinates which are passed through that a meaningful class is likely, and whether the distribu- an implicit network; here latent vectors are calculated as tion of classes over a large number of samples has high the gradient of a reconstruction loss with respect to the entropy, indicating that a diverse range of images can be origin. By sampling using a finer grid of coordinates, super- sampled. A perfect IS can be scored by a model that creates resolution beyond resolutions seen during training is possi- only one image per class [130] leading to the creation of ble. Another approach to learn an implicit generative model Frechet´ Inception Distance (FID) [75] which models the as a GAN is to map latents to the weights of an implicit activations of a particular layer of a classifier as multivari- function using a hyper-network, the output of which is ate Gaussians for real and generated data, measuring the classified by a discriminator [45] (Fig. 10b). Frechet´ distance between the two. These approaches are trivially solved by memorising the 8 CONCLUSION dataset and are less applicable to non-natural image-related data. Kernel Inception Distance (KID) [13] mitigates this While GANs have led the way in terms of sample quality somewhat, instead calculating the squared maximum mean for some time now, the gap between other approaches is discrepancy in feature space, however, pretrained features shrinking; the diminished mode collapse and simpler train- may not be sufficient to detect overfitting. Another approach ing objectives make these models more enticing than ever, is to train a neural network to distinguish between real however, the number of parameters required in addition to and generated samples similar to the discriminator from slow run-times pose a substantial handicap. Despite this, a GAN; while this detects overfitting well, it increases the recent work notably in hybrid models offers a balance complexity and time required to evaluate a model and is between extremes at the expense of extra model complexity biased towards adversarially trained models [67]. that hinders broader adoption. A clear stand out is the ap- plication of innovative data augmentation strategies within generative models, offering benefits impressively without 7.2 Applications necessitating more powerful architectures. When it comes In general, the definition of a generative model means that to scaling models to high-dimensional data, attention is a any technique can be used on any modality/task, however, common theme, allowing long-range dependencies to be 16 learned; recent advances in linear attention will aid scaling [22] Tong Che, Ruixiang Zhang, Jascha Sohl-Dickstein, Hugo to even higher resolutions. Implicit networks are another Larochelle, Liam Paull, Yuan Cao, and Yoshua Bengio. Your GAN is Secretly an Energy-based Model and You Should Use promising direction, allowing efficient synthesis of arbitrar- Discriminator Driven Latent Sampling. NeurIPS, 33, 2020. ily high resolution and irregular data. Similar unified gen- [23] Jianfei Chen, Cheng Lu, Biqi Chenli, Jun Zhu, and Tian Tian. erative models capable of modelling continuous, irregular, VFlow: More Expressive Generative Flows with Variational Data and arbitrary length data, over different scales and domains Augmentation. In ICML, 2020. will be key for the future of generalisation. [24] Ricky T. Q. Chen, Jens Behrmann, David K. Duvenaud, and Joern-Henrik Jacobsen. Residual Flows for Invertible Generative Modeling. NeurIPS, 2019. [25] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K REFERENCES Duvenaud. Neural Ordinary Differential Equations. NeurIPS 31, 2018. [26] Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, and Neil [1] Guillaume Alain, Yoshua Bengio, Li Yao, Jason Yosinski, eric´ Houlsby. Self-Supervised GANs via Auxiliary Rotation Loss. In Thibodeau-Laufer, Saizheng Zhang, and Pascal Vincent. GSNs: IEEE CVPR, 2019. generative stochastic networks. Information and Inference, [27] Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla 5(2):210–249, 2016. Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. [2] Michael Arbel, Liang Zhou, and Arthur Gretton. Generalized Variational Lossy Autoencoder. In ICLR, 2017. Energy Based Models. In ICLR, 2021. [3] Martin Arjovsky and Leon Bottou. Towards Principled Methods [28] Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. for Training Generative Adversarial Networks. In ICLR, 2017. PixelSNAIL: An Improved Autoregressive Generative Model. ICML, 2, 2017. [4] Martin Arjovsky, Soumith Chintala, and Leon´ Bottou. Wasser- stein GAN. arXiv:1701.07875, 2017. [29] Rewon Child. Very Deep VAEs Generalize Autoregressive Mod- [5] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H. els and Can Outperform Them on Images. In ICLR, 2021. Campbell, and Sergey Levine. Stochastic Variational Video Pre- [30] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. diction. In ICLR, 2018. Generating Long Sequences with Sparse Transformers. [6] Matthias Bauer and Andriy Mnih. Resampled Priors for Varia- arXiv:1904.10509, 2019. tional Autoencoders. In AISTATS, pages 66–75, 2019. [31] Francois Chollet. Xception: Deep Learning With Depthwise [7] Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duve- Separable Convolutions. In IEEE CVPR, 2017. naud, and Jorn-Henrik¨ Jacobsen. Invertible Residual Networks. [32] Krzysztof Marcin Choromanski, Valerii Likhosherstov, David In ICML, 2019. Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter [8] Yoshua Bengio, Rejean´ Ducharme, Pascal Vincent, and Christian Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, Jauvin. A Neural Probabilistic Language Model. JMLR, 3:1137– David Benjamin Belanger, Lucy J. Colwell, and Adrian Weller. 1155, 2003. Rethinking Attention with Performers. In ICLR, 2021. [9] Yoshua Bengio, Nicholas Leonard,´ and Aaron Courville. Esti- [33] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and mating or Propagating Gradients Through Stochastic Neurons Yoshua Bengio. Empirical Evaluation of Gated Recurrent Neural for Conditional Computation. arXiv:1308.3432, 2013. Networks on Sequence Modeling. arXiv:1412.3555, 2014. [10] Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. [34] Rob Cornish, Anthony Caterini, George Deligiannidis, and Ar- Generalized Denoising Auto-Encoders as Generative Models. naud Doucet. Relaxing Bijectivity Constraints with Continuously NeurIPS 26, 2013. Indexed Normalising Flows. In ICML, 2020. [11] Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. [35] Bin Dai and David Wipf. Diagnosing and Enhancing VAE Generalized Denoising Auto-Encoders as Generative Models. Models. arXiv:1903.05789, 2019. NeurIPS, 26, 2013. [36] Nicola De Cao, Ivan Titov, and Wilker Aziz. Block Neural [12] Rianne van den Berg, Leonard Hasenclever, Jakub M. Tomczak, Autoregressive Flow. In UAI, 2019. and Max Welling. Sylvester Normalizing Flows for Variational [37] Emily L. Denton, Soumith Chintala, Arthur Szlam, and Rob Fer- Inference. In UAI, 2018. gus. Deep Generative Image Models using a Laplacian Pyramid [13] Mikołaj Binkowski,´ Dougal J. Sutherland, Michael Arbel, and of Adversarial Networks. NeurIPS, 28, 2015. Arthur Gretton. Demystifying MMD GANs. In ICLR, 2018. [38] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non- [14] Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and Arthur Linear Independent Components Estimation. In ICLR Workshop, Szlam. Optimizing the Latent Space of Generative Networks. 2015. arXiv:1707.05776, 2019. [39] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density [15] Sam Bond-Taylor and Chris G. Willcocks. Gradient Origin Net- estimation using Real NVP. ICLR, 2017. works. In ICLR, 2021. [40] Laurent Dinh, Jascha Sohl-Dickstein, Razvan Pascanu, and Hugo [16] Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Larochelle. A RAD approach to deep mixture models. In ICLR Rafal Jozefowicz, and Samy Bengio. Generating Sentences from Workshop, 2019. a Continuous Space. arXiv:1511.06349, 2016. [41] Alexey Dosovitskiy and Thomas Brox. Generating Images with [17] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large Perceptual Similarity Metrics based on Deep Networks. NeurIPS, Scale GAN Training for High Fidelity Natural Image Synthesis. 2016. arXiv:1809.11096 , 2019. [42] Yilun Du and Igor Mordatch. Implicit Generation and General- [18] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, ization in Energy-Based Models. NeurIPS, 33, 2019. Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav [43] Leo L. Duan. Transport Monte Carlo. arXiv:1907.10448, 2020. Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, [44] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Neural ODEs. NeurIPS 32, 2019. Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, [45] Emilien Dupont, Yee Whye Teh, and Arnaud Doucet. Generative Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Models as Distributions of Functions. arXiv:2102.04776, 2021. McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. [46] Conor Durkan, Artur Bekasov, Iain Murray, and George Papa- Language Models are Few-Shot Learners. arXiv:2005.14165, 2020. makarios. Cubic-Spline Flows. In ICML Workshop, 2019. [19] Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Impor- [47] Conor Durkan, Artur Bekasov, Iain Murray, and George Papa- tance Weighted Autoencoders. In ICLR, 2016. makarios. Neural Spline Flows. NeurIPS 32, 2019. [20] Luis A. Caffarelli and Mario Milman. Monge Ampere` Equation: [48] Conor Durkan, Iain Murray, and George Papamakarios. On Applications to Geometry, Optimization. American Mathemati- Contrastive Learning for Likelihood-free Inference. In ICML, cal Soc., 1999. 2020. [21] Miguel A Carreira-Perpinan and Geoffrey E Hinton. On Con- [49] Patrick Esser, Robin Rombach, and Bjorn¨ Ommer. Taming Trans- trastive Divergence Learning. In AISTATS, volume 10, pages 33– formers for High-Resolution Image Synthesis. arXiv:2012.09841, 40, 2005. 2021. BOND-TAYLOR ET AL.: DEEP GENERATIVE MODELLING: A COMPARATIVE REVIEW 17

[50] Chris Finlay, Joern-Henrik Jacobsen, Levon Nurbekyan, and [74] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Adam Oberman. How to Train Your Neural ODE: the World Residual Learning for Image Recognition. In IEEE CVPR, 2016. of Jacobian and Kinetic Regularization. In ICML, 2020. [75] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard [51] Ruiqi Gao, Erik Nijkamp, Diederik P. Kingma, Zhen Xu, An- Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale drew M. Dai, and Ying Nian Wu. Flow Contrastive Estimation of Update Rule Converge to a Local Nash Equilibrium. NeurIPS 30, Energy-Based Models. In IEEE CVPR, 2020. 2017. [52] Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, and Diederik P. [76] G E Hinton and T J Sejnowski. Optimal Perceptual Inference. In Kingma. Learning Energy-Based Models by Diffusion Recovery IEEE CVPR, 1983. Likelihood. In ICLR, 2021. [77] Geoffrey E. Hinton. Training Products of Experts by Minimizing [53] Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Contrastive Divergence. Neural Computation, 14(8):1771–1800, Black, and Bernhard Scholkopf. From Variational to Deterministic 2002. Autoencoders. In ICLR, 2020. [78] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A Fast [54] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, Learning Algorithm for Deep Belief Nets. Neural Computation, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua 18(7):1527–1554, 2006. Bengio. Generative Adversarial Nets. NeurIPS 27, 2014. [79] Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter [55] Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael J. Abbeel. Flow++: Improving Flow-Based Generative Models with Cree. Regularisation of neural networks by enforcing Lipschitz Variational Dequantization and Architecture Design. In ICML, continuity. Machine Learning, 110(2):393–416, 2021. 2019. [56] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya [80] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Sutskever, and David Duvenaud. FFJORD: Free-form Contin- Probabilistic Models. arXiv:2006.11239, 2020. uous Dynamics for Scalable Reversible Generative Models. In [81] Sepp Hochreiter and Jurgen¨ Schmidhuber. Long Short-Term ICLR, 2019. Memory. Neural Computation, 9(8):1735–1780, 1997. [57] Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, [82] Matthew Hoffman, Pavel Sountsov, Joshua V. Dillon, Ian Lang- David Duvenaud, and Richard Zemel. Learning the Stein more, Dustin Tran, and Srinivas Vasudevan. NeuTra-lizing Bad Discrepancy for Training and Evaluating Energy-Based Models Geometry in Hamiltonian Monte Carlo Using Neural Transport. without Sampling. In ICML, 2020. arXiv:1903.03704, 2019. [58] Will Grathwohl, Kuan-Chieh Wang, Jorn-Henrik¨ Jacobsen, David [83] Matthew D Hoffman and Matthew J Johnson. ELBO surgery: yet Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your another way to carve up the variational evidence lower bound. Classifier is Secretly an and You Should In NeurIPS Workshop, 2016. Treat it Like One. In ICLR, 2020. [84] Emiel Hoogeboom, Rianne van den Berg, and Max Welling. [59] Will Sussman Grathwohl, Jacob Jin Kelly, Milad Hashemi, Mo- Emerging Convolutions for Generative Normalizing Flows. In hammad Norouzi, Kevin Swersky, and David Duvenaud. No ICML, 2019. MCMC for me: Amortized sampling for fast and stable training [85] Emiel Hoogeboom, Victor Garcia Satorras, Jakub Tomczak, and of energy-based models. In ICLR, 2021. Max Welling. The Convolution Exponential and Generalized Sylvester Flows. NeurIPS, 33, 2020. [60] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. DRAW: A Recurrent Neural Net- [86] Chin-Wei Huang, Laurent Dinh, and Aaron Courville. Aug- work For Image Generation. In ICML, 2015. mented Normalizing Flows: Bridging the Gap Between Gener- ative Flows and Latent Variable Models. arXiv:2002.07101, 2020. [61] Paulina Grnarova, Yannic Kilcher, Kfir Y Levy, Aurelien Lucchi, [87] Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron and Thomas Hofmann. Generative Minimization Networks: Courville. Neural Autoregressive Flows. In ICML, 2018. Training GANs Without Competition. arXiv:2103.12685, 2021. [88] Chin-Wei Huang, Ahmed Touati, Laurent Dinh, Michal Drozdzal, [62] Paulina Grnarova, Kfir Y Levy, Aurelien Lucchi, Nathanael Per- Mohammad Havaei, Laurent Charlin, and Aaron Courville. raudin, Ian Goodfellow, Thomas Hofmann, and Andreas Krause. Learnable Explicit Density for Continuous Latent Space and A Domain Agnostic Measure for Monitoring and Evaluating Variational Inference. arXiv:1710.02248, 2017. GANs. NeurIPS, 32, 2019. [89] Huaibo Huang, zhihang li, Ran He, Zhenan Sun, and Tieniu [63] Aditya Grover, Manik Dhar, and Stefano Ermon. Flow-GAN: Tan. Introvae: Introspective Variational Autoencoders for Pho- Combining Maximum Likelihood and Adversarial Learning in tographic Image Synthesis. NeurIPS, 31, 2018. Generative Models. In AAAI, 2018. [90] Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, and [64] Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, and Jieping Serge Belongie. Stacked Generative Adversarial Networks. In Ye. A Review on Generative Adversarial Networks: Algorithms, IEEE CVPR, 2017. arXiv:2001.06937 Theory, and Applications. , 2020. [91] Peter J. Huber. Robust Estimation of a . In [65] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Du- Breakthroughs in Statistics: Methodology and Distribution, Springer moulin, and Aaron C Courville. Improved Training of Wasser- Series in Statistics, pages 492–518. New York, NY, 1992. stein GANs. NeurIPS 30, 2017. [92] Ferenc Huszar.´ How (not) to Train your Generative Model: [66] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Scheduled Sampling, Likelihood, Adversary? arXiv:1511.05101, Taiga, Francesco Visin, David Vazquez, and Aaron Courville. 2015. PixelVAE: A Latent Variable Model for Natural Images. In ICLR, [93] Aapo Hyvarinen.¨ Estimation of Non-Normalized Statistical Mod- 2017. els by Score Matching. JMLR, 6:695–709, 2005. [67] Ishaan Gulrajani, Colin Raffel, and Luke Metz. Towards GAN [94] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accel- Benchmarks Which Require Generalization. In ICLR, 2019. erating Deep Network Training by Reducing Internal Covariate [68] Michael Gutmann and Aapo Hyvarinen.¨ Noise-Contrastive Esti- Shift. arXiv:1502.03167, 2015. mation: A New Estimation Principle for Unnormalized Statistical [95] Priyank Jaini, Kira A. Selby, and Yaoliang Yu. Sum-of-Squares Models. In AISTATS, pages 297–304, 2010. Polynomial Flow. In ICML, 2019. [69] David Ha and Jurgen¨ Schmidhuber. World Models. [96] Yifan Jiang, Shiyu Chang, and Zhangyang Wang. TransGAN: arXiv:1803.10122, 2018. Two Transformers Can Make One Strong GAN. arXiv:2102.07074, [70] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad 2021. Norouzi. Dream to Control: Learning Behaviors by Latent Imag- [97] Alexia Jolicoeur-Martineau. The relativistic discriminator: a key ination. In ICLR, 2020. element missing from standard GAN. ICLR, 2018. [71] Tian Han, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. Alter- [98] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and nating Back-Propagation for Generator Network. AAAI, 31(1), Lawrence K. Saul. Introduction to variational methods for graph- 2017. ical models. Machine Learning, 37(2):183–233, 1999. [72] Leonard Hasenclever, Jakub M Tomczak, and Max Welling. [99] Heewoo Jun, Rewon Child, Mark Chen, John Schulman, Aditya Variational Inference with Orthogonal Normalizing Flows. In Ramesh, Alec Radford, and Ilya Sutskever. Distribution Aug- Workshop on Bayesian Deep Learning, NIPS, 2017. mentation for Generative Modeling. In ICML, 2020. [73] Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor Berg- [100] Mahdi Karami, Dale Schuurmans, Jascha Sohl-Dickstein, Laurent Kirkpatrick. Lagging Inference Networks and Posterior Collapse Dinh, and Daniel Duckworth. Invertible Convolutional Flow. in Variational Autoencoders. In ICLR, 2019. NeurIPS, 2019. 18

[101] Animesh Karnewar and Oliver Wang. MSG-GAN: Multi-Scale [127] Xuanqing Liu and Cho-Jui Hsieh. Rob-GAN: Generator, Discrim- Gradients for Generative Adversarial Networks. In IEEE CVPR, inator, and Adversarial Attacker. In IEEE CVPR, 2019. 2020. [128] Gabriel Loaiza-Ganem and John P. Cunningham. The continuous [102] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Bernoulli: fixing a pervasive error in variational autoencoders. Progressive Growing of GANs for Improved Quality, Stability, NeurIPS, 32, 2019. and Variation. In ICLR, 2018. [129] Cheng Lu, Jianfei Chen, Chongxuan Li, Qiuhao Wang, and Jun [103] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Zhu. Implicit Normalizing Flows. In ICLR, 2021. Lehtinen, and Timo Aila. Training Generative Adversarial Net- [130] Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and works with Limited Data. NeurIPS, 33, 2020. Olivier Bousquet. Are GANs Created Equal? A Large-Scale [104] Tero Karras, Samuli Laine, and Timo Aila. A Style-Based Gener- Study. NeurIPS 31, 2018. ator Architecture for Generative Adversarial Networks. In IEEE [131] Yucen Luo, Alex Beatson, Mohammad Norouzi, Jun Zhu, David CVPR, 2019. Duvenaud, Ryan P. Adams, and Ricky T. Q. Chen. SUMO: Unbi- [105] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko ased Estimation of Log Marginal Probability for Latent Variable Lehtinen, and Timo Aila. Analyzing and Improving the Image Models. In ICLR, 2020. Quality of StyleGAN. In IEEE CVPR, 2020. [132] Xuezhe Ma, Xiang Kong, Shanghang Zhang, and Eduard Hovy. [106] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and MaCow: Masked Convolutional Generative Flow. NeurIPS, 2019. Franc¸ois Fleuret. Transformers are RNNs: Fast Autoregressive [133] Xuezhe Ma, Chunting Zhou, and Eduard Hovy. MAE: Mu- Transformers with Linear Attention. In ICML, 2020. tual Posterior-Divergence Regularization for Variational AutoEn- [107] Hyunjik Kim, George Papamakarios, and Andriy Mnih. The coders. In ICLR, 2019. Lipschitz Constant of Self-Attention. arXiv:2006.04710, 2020. [134] Lars Maaløe, Marco Fraccaro, Valentin Lievin,´ and Ole Winther. [108] Taesup Kim and Yoshua Bengio. Deep Directed Generative Mod- BIVA: A Very Deep Hierarchy of Latent Variables for Generative els with Energy-Based Probability Estimation. arXiv:1606.03439, Modeling. NeurIPS, 33, 2019. 2016. [135] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dim- [109] Yoon Kim, Sam Wiseman, Andrew Miller, David Sontag, and itris Tsipras, and Adrian Vladu. Towards Deep Learning Models Alexander Rush. Semi-Amortized Variational Autoencoders. In Resistant to Adversarial Attacks. arXiv:1706.06083, 2019. ICML, 2018. [136] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian [110] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Goodfellow, and Brendan Frey. Adversarial Autoencoders. Bayes. ICLR, 2014. arXiv:1511.05644, 2016. [111] Durk P Kingma and Prafulla Dhariwal. Glow: Generative Flow [137] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen with Invertible 1x1 Convolutions. NeurIPS 31, 2018. Wang, and Stephen Paul Smolley. Least Squares Generative Adversarial Networks. In IEEE CVPR, 2017. [112] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya [138] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Ku- Sutskever, and Max Welling. Improved Variational Inference with mar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Inverse Autoregressive Flow. NeurIPS 29, 2016. Bengio. SampleRNN: An Unconditional End-to-End Neural [113] I. Kobyzev, S. Prince, and M. Brubaker. Normalizing Flows: Audio Generation Model. In ICLR, 2017. An Introduction and Review of Current Methods. IEEE TPAMI, [139] Chenlin Meng, Jiaming Song, Yang Song, Shengjia Zhao, and pages 2008–2026, 2020. Stefano Ermon. Improved Autoregressive Modeling with Dis- [114] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny tribution Smoothing. In ICLR, 2021. Images. 2009. [140] Chenlin Meng, Lantao Yu, Yang Song, Jiaming Song, and Stefano [115] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Ermon. Autoregressive Score Matching. NeurIPS, 34, 2020. Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson,´ [141] Jacob Menick and Nal Kalchbrenner. Generating High Fidelity Yoshua Bengio, and Aaron C. Courville. MelGAN: Genera- Images with Subscale Pixel Networks and Multidimensional tive Adversarial Networks for Conditional Waveform Synthesis. Upscaling. In ICLR, 2019. NeurIPS, 32, 2019. [142] Mehdi Mirza and Simon Osindero. Conditional Generative [116] Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Adversarial Nets. arXiv:1411.1784, 2014. Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. [143] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi VideoFlow: A Flow-Based Generative Model for Video. In ICML Yoshida. Spectral Normalization for Generative Adversarial Workshop, 2019. Networks. ICLR, 2018. [117] Rithesh Kumar, Sherjil Ozair, Anirudh Goyal, Aaron Courville, [144] Thomas Muller,¨ Brian Mcwilliams, Fabrice Rousselle, Markus and Yoshua Bengio. Maximum Entropy Generators for Energy- Gross, and Jan Novak.´ Neural Importance Sampling. ACM TOG, Based Models. arXiv:1901.08508, 2019. 38(5):145:1–145:19, 2019. [118] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo [145] Charlie Nash and Conor Durkan. Autoregressive Energy Ma- Larochelle, and Ole Winther. Autoencoding beyond pixels using chines. In ICML, 2019. a learned similarity metric. In ICML, 2016. [146] Kirill Neklyudov, Evgenii Egorov, and Dmitry P. Vetrov. The [119] Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, Implicit Metropolis-Hastings Algorithm. NeurIPS, 32, 2019. and Fu Jie Huang. A Tutorial on Energy-Based Learning. A [147] M. Ngxande, J. Tapamo, and M. Burke. DepthwiseGANs: Fast Tutorial on Energy-Based Learning, 1(0), 2006. Training Generative Adversarial Networks for Realistic Image [120] Alex X. Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Synthesis. In SAUPEC/RobMech/PRASA, pages 111–116, 2019. Chelsea Finn, and Sergey Levine. Stochastic Adversarial Video [148] Alex Nichol and Prafulla Dhariwal. Improved Denoising Diffu- Prediction. arXiv:1804.01523, 2018. sion Probabilistic Models. arXiv:2102.09672, 2021. [121] Zengyi Li, Yubei Chen, and Friedrich T. Sommer. Learning [149] Weili Nie, Nina Narodytska, and Ankit Patel. RelGAN: Relational Energy-Based Models in High-Dimensional Spaces with Multi- Generative Adversarial Networks for Text Generation. In ICLR, scale Denoising Score Matching. arXiv:1910.07762, 2019. 2019. [122] Zengyi Li, Yubei Chen, and Friedrich T. Sommer. A Neural [150] Didrik Nielsen, Priyank Jaini, Emiel Hoogeboom, Ole Winther, Network MCMC sampler that maximizes Proposal Entropy. and Max Welling. SurVAE Flows: Surjections to Bridge the Gap arXiv:2010.03587, 2020. between VAEs and Flows. NeurIPS, 33, 2020. [123] Jae Hyun Lim and Jong Chul Ye. Geometric GAN. [151] Erik Nijkamp, Ruiqi Gao, Pavel Sountsov, Srinivas Vasudevan, arXiv:1705.02894, 2017. Bo Pang, Song-Chun Zhu, and Ying Nian Wu. Learning Energy- [124] Ji Lin, Richard Zhang, Frieder Ganz, Song Han, and Jun-Yan based Model with Flow-based Backbone by Neural Transport Zhu. Anycost GANs for Interactive Image Synthesis and Editing. MCMC. arXiv:2006.06897, 2020. arXiv:2103.03243, 2021. [152] Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, and [125] Bingchen Liu, Yizhe Zhu, Kunpeng Song, and Ahmed Elgammal. Ying Nian Wu. On the Anatomy of MCMC-Based Max- Towards Faster and Stabilized GAN Training for High-fidelity imum Likelihood Learning of Energy-Based Models. In Few-shot Image Synthesis. In ICLR, 2021. arXiv:1903.12370, 2020. [126] Kanglin Liu, Guoping Qiu, Wenming Tang, and Fei Zhou. Spec- [153] Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. tral Regularization for Combating Mode Collapse in GANs. In Learning Non-Convergent Non-Persistent Short-Run MCMC To- IEEE ICCV, 2019. ward Energy-Based Model. NeurIPS 32, 2019. BOND-TAYLOR ET AL.: DEEP GENERATIVE MODELLING: A COMPARATIVE REVIEW 19

[154] Erik Nijkamp, Bo Pang, Tian Han, Linqi Zhou, Song-Chun Zhu, [178] Saeed Saremi, Arash Mehrjou, Bernhard Scholkopf,¨ and Aapo and Ying Nian Wu. Learning Multi-layer Latent Variable Model Hyvarinen.¨ Deep Energy Estimator Networks. arXiv:1805.08306, via Variational Optimization of Short Run MCMC for Approx- 2018. imate Inference. In ECCV, Lecture Notes in Computer Science, [179]J urgen¨ Schmidhuber. Making the World Differentiable: On Using Cham, 2020. Self-Supervised Fully Recurrent Neural Networks for Dynamic [155] Sajad Norouzi, David J. Fleet, and Mohammad Norouzi. Ex- Reinforcement Learning and Planning in Non-Stationary Envi- emplar VAE: Linking Generative Models, Nearest Neighbor Re- ronments. 1990. trieval, and Data Augmentation. NeurIPS, 33, 2020. [180]J urgen¨ Schmidhuber. Generative Adversarial Networks are spe- [156] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: cial cases of Artificial Curiosity (1990) and also closely related to Training Generative Neural Samplers using Variational Diver- Predictability Minimization (1991). Neural Networks, 127:58–66, gence Minimization. NeurIPS 29, 2016. 2020. [157] Augustus Odena, Christopher Olah, and Jonathon Shlens. Condi- [181] Jiaxin Shi, Shengyang Sun, and Jun Zhu. A Spectral Approach to tional Image Synthesis with Auxiliary Classifier GANs. In ICML, Gradient Estimation for Implicit Distributions. In ICML, 2018. 2017. [182] Samarth Sinha, Han Zhang, Anirudh Goyal, Yoshua Bengio, [158] Derek Onken, Samy Wu Fung, Xingjian Li, and Lars Ruthotto. Hugo Larochelle, and Augustus Odena. Small-GAN: Speeding OT-Flow: Fast and Accurate Continuous Normalizing Flows via up GAN Training using Core-Sets. In ICML, 2020. Optimal Transport. In AAAI, 2021. [183] Vincent Sitzmann, Julien N. P. Martel, Alexander W. Bergman, [159] Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol David B. Lindell, and Gordon Wetzstein. Implicit Neural Repre- Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lock- sentations with Periodic Activation Functions. arXiv:2006.09661, hart, Luis Cobo, Florian Stimberg, Norman Casagrande, Dominik 2020. Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalch- [184] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, brenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan and Surya Ganguli. Deep Unsupervised Learning using Belov, and Demis Hassabis. Parallel WaveNet: Fast High-Fidelity Nonequilibrium Thermodynamics. In ICML, 2015. Speech Synthesis. In ICML, 2018. [185] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising [160] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Si- Diffusion Implicit Models. In ICLR, 2021. monyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew [186] Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-NICE-MC: Senior, and Koray Kavukcuoglu. WaveNet: A Generative Model Adversarial Training for MCMC. NeurIPS, 30, 2017. for Raw Audio. arXiv:1609.03499, 2016. [187] Yang Song and Stefano Ermon. Generative Modeling by Estimat- [161] Georg Ostrovski, Will Dabney, and Remi´ Munos. Autoregressive ing Gradients of the Data Distribution. NeurIPS 32, 2019. Quantile Networks for Generative Modeling. ICML, 9, 2018. [188] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced [162] A. Oussidi and A. Elhassouny. Deep Generative Models: Survey. Score Matching: A Scalable Approach to Density and Score In IEEE ISCV, pages 1–8, 2018. Estimation. In UAI, pages 574–584, 2020. [163] Bo Pang, Tian Han, Erik Nijkamp, Song-Chun Zhu, and [189] Yang Song and Diederik P. Kingma. How to Train Your Energy- Ying Nian Wu. Learning Latent Space Energy-Based Prior Model. Based Models. arXiv:2101.03288, 2021. NeurIPS, 34, 2020. [190] Yang Song, Chenlin Meng, and Stefano Ermon. MintNet: Build- [164] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, ing Invertible Neural Networks with Masked Convolutions. Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing NeurIPS, 32, 2019. Flows for Probabilistic Modeling and Inference. JMLR, 22(57):1– [191] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek 64, 2021. Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations. In ICLR, [165] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked 2021. Autoregressive Flow for Density Estimation. NeurIPS 30, 2017. [192] Yuxuan Song, Qiwei Ye, Minkai Xu, and Tie-Yan Liu. Discrimina- [166] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, tor Contrastive Divergence: Semi-Amortized Generative Model- Noam Shazeer, Alexander Ku, and Dustin Tran. Image Trans- ing by Exploring Energy of the Discriminator. arXiv:2004.01704, former. In ICML, 2018. 2020. [167] L. S. Pontryagin. Mathematical Theory of Optimal Processes. Rout- [193] Ilya Sutskever and Tijmen Tieleman. On the Convergence Prop- ledge, 2018. erties of Contrastive Divergence. In AISTATS, pages 789–795, [168] R. Prenger, R. Valle, and B. Catanzaro. WaveGlow: A Flow-based 2010. Generative Network for Speech Synthesis. In IEEE ICASSP, pages [194] Kevin Swersky, Marc’Aurelio Ranzato, David Buchman, Ben- 3617–3621, 2019. jamin M Marlin, and Nando de Freitas. On Autoencoders and [169] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Score Matching for Energy Based Models. In ICML, 2011. Representation Learning with Deep Convolutional Generative [195] Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, Adversarial Networks. In ICLR, 2016. and Ferenc Huszar.´ Amortised MAP Inference for Image Super- [170] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea resolution. In ICLR, 2017. Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot [196] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Text-to-Image Generation. arXiv:2102.12092, 2021. Sønderby, and Ole Winther. Ladder Variational Autoencoders. [171] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating NeurIPS, 29, 2016. Diverse High-Fidelity Images with VQ-VAE-2. NeurIPS 32, 2019. [197] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara [172] Scott Reed, Aaron¨ Oord, Nal Kalchbrenner, Sergio Gomez´ Col- Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- menarejo, Ziyu Wang, Yutian Chen, Dan Belov, and Nando mamoorthi, Jonathan Barron, and Ren Ng. Fourier Features Let Freitas. Parallel Multiscale Autoregressive Density Estimation. Networks Learn High Frequency Functions in Low Dimensional In ICML, 2017. Domains. NeurIPS, 33, 2020. [173] Danilo Jimenez Rezende and Shakir Mohamed. Variational [198] H. Thanh-Tung and T. Tran. Catastrophic forgetting and mode Inference with Normalizing Flows. ICML, 2, 2015. collapse in GANs. In IJCNN, pages 1–10, 2020. [174] Gareth O. Roberts and Richard L. Tweedie. Exponential conver- [199] Lucas Theis, Aaron¨ van den Oord, and Matthias Bethge. A note gence of Langevin distributions and their discrete approxima- on the evaluation of generative models. arXiv:1511.01844, 2016. tions. Bernoulli, 2(4):341–363, 1996. [200] Michalis Titsias and Petros Dellaportas. Gradient-based Adaptive [175] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Markov Chain Monte Carlo. NeurIPS, 32, 2019. Alec Radford, and Xi Chen. Improved Techniques for Training [201] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard GANs. NeurIPS, 2016. Schoelkopf. Wasserstein Auto-Encoders. arXiv:1711.01558, 2019. [176] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. [202] Jakub Tomczak and Max Welling. VAE with a VampPrior. In Kingma. PixelCNN++: Improving the PixelCNN with Dis- AISTATS, pages 1214–1223, 2018. cretized Logistic Mixture Likelihood and Other Modifications. [203] Dustin Tran, Rajesh Ranganath, and David M. Blei. Variational ICLR, 2017. Gaussian Process. In ICLR, 2016. [177] Tim Salimans, Diederik Kingma, and Max Welling. Markov [204] Dustin Tran, Keyon Vafa, Kumar Agrawal, Laurent Dinh, and Chain Monte Carlo and Variational Inference: Bridging the Gap. Ben Poole. Discrete Flows: Invertible Generative Models of In ICML, 2015. Discrete Data. NeurIPS 32, 2019. 20

[205] N.-T. Tran, V.-H. Tran, N.-B. Nguyen, T.-K. Nguyen, and N.-M. [234] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augus- Cheung. On Data Augmentation for GAN Training. IEEE TIP, tus Odena. Self-Attention Generative Adversarial Networks. 30:1882–1897, 2021. arXiv:1805.08318, 2019. [206] Ngoc-Trung Tran, Viet-Hung Tran, Bao-Ngoc Nguyen, Linxiao [235] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Yang, and Ngai-Man (Man) Cheung. Self-supervised GAN: Anal- Wang, Xiaolei Huang, and Dimitris N. Metaxas. StackGAN: ysis and Improvement with Multi-class Minimax Game. NeurIPS, Text to Photo-Realistic Image Synthesis With Stacked Generative 32, 2019. Adversarial Networks. In IEEE CVPR, 2017. [207] Richard Eric Turner and Maneesh Sahani. Two problems with [236] Linfeng Zhang, Weinan E, and Lei Wang. Monge-Ampere` Flow variational expectation maximisation for models. In for Generative Modeling. arXiv:1809.10188, 2018. Bayesian Time Series Models, pages 104–124. Cambridge, 2011. [237] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based [208] Ryan Turner, Jane Hung, Eric Frank, Yunus Saatchi, and Jason Generative Adversarial Network. In ICLR, 2017. Yosinski. Metropolis-Hastings Generative Adversarial Networks. [238] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Learning In ICML, 2019. Hierarchical Features from Deep Generative Models. In ICML, [209] Belinda Tzen and Maxim Raginsky. Neural Stochastic Differential 2017. Equations: Deep Latent Gaussian Models in the Diffusion Limit. [239] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Towards arXiv:1905.09883, 2019. Deeper Understanding of Variational Autoencoding Models. [210] Belinda Tzen and Maxim Raginsky. Theoretical guarantees for arXiv:1702.08658, 2017. sampling and inference in generative models with latent diffu- [240] Shengjia Zhao, Jiaming Song, and Stefano Ermon. InfoVAE: sions. In COLT, pages 3084–3114, 2019. Balancing Learning and Inference in Variational Autoencoders. [211] Arash Vahdat and Jan Kautz. NVAE: A Deep Hierarchical AAAI, 33(01):5885–5892, 2019. Variational Autoencoder. NeurIPS, 33, 2020. [241] Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. [212] Gabriele Valvano, Andrea Leo, and Sotirios A Tsaftaris. Learning Differentiable Augmentation for Data-Efficient GAN Training. to Segment from Scribbles using Multi-scale Adversarial Atten- NeurIPS, 33, 2020. tion Gates. IEEE T-MI, 2021. [242] Zhengli Zhao, Zizhao Zhang, Ting Chen, Sameer Singh, [213] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, koray and Han Zhang. Image Augmentations for GAN Training. kavukcuoglu, Oriol Vinyals, and Alex Graves. Conditional Image arXiv:2006.02595, 2020. Generation with PixelCNN Decoders. NeurIPS 29, 2016. [243] Jiachen Zhong, Xuanqing Liu, and Cho-Jui Hsieh. Improv- [214]A aron¨ Van Den Oord, Nal Kalchbrenner, and Koray ing the Speed and Quality of GAN by Adversarial Training. Kavukcuoglu. Pixel Recurrent Neural Networks. In ICML, 2016. arXiv:2008.03364, 2020. [215] Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. [244] Zachary M. Ziegler and Alexander M. Rush. Latent Normalizing Neural Discrete Representation Learning. NeurIPS 30, 2017. Flows for Discrete Sequences. In ICML, 2019. [216] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need. NeurIPS 30, 2017. [217]C edric´ Villani. Topics in Optimal Transportation. Number 58. American Mathematical Soc., 2003. [218]C edric´ Villani. Optimal Transport: Old and New, volume 338. Springer Science & Business Media, 2008. [219] Pascal Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7):1661–1674, 2011. [220] Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-Attention with Linear Complexity. arXiv:2006.04768, 2020. [221] Yuanhao Wang, Guodong Zhang, and Jimmy Ba. On Solving Minimax Optimization Locally: A Follow-the-Ridge Approach. In ICLR, 2020. [222] Antoine Wehenkel and Gilles Louppe. Unconstrained Monotonic Neural Networks. NeurIPS, 32, 2019. [223] Max Welling and Geoffrey E. Hinton. A New Learning Algorithm for Mean Field Boltzmann Machines. In ICANN, volume 2415, pages 351–357. Berlin, Heidelberg, 2002. [224] Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In ICML, 2011. [225] Hao Wu, Jonas Kohler,¨ and Frank Noe.´ Stochastic Normalizing Flows. NeurIPS, 2020. [226] Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models. In ICLR, 2021. [227] J. Xie, Y. Lu, R. Gao, S. Zhu, and Y. N. Wu. Cooperative Training of Descriptor and Generator Networks. IEEE TPAMI, 42(1):27–45, 2020. [228] Jianwen Xie, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. A Theory of Generative ConvNet. In ICML, 2016. [229] L. Yang and G. E. Karniadakis. Potential Flow Generator With L2 Optimal Transport Regularity for Generative Models. IEEE TNNLS, pages 1–11, 2020. [230] Xin Yi, Ekta Walia, and Paul Babyn. Generative Adversarial Network in Medical Imaging: A Review. MedIA, 58, 2019. [231] Lantao Yu, Yang Song, Jiaming Song, and Stefano Ermon. Train- ing Deep Energy-Based Models with f-Divergence Minimization. In ICML, 2020. [232] Sergey Zagoruyko and Nikos Komodakis. Wide Residual Net- works. arXiv:1605.07146, 2017. [233] Han Zhang, Xi Gao, Jacob Unterman, and Tom Arodz. Approx- imation Capabilities of Neural Ordinary Differential Equations. arXiv:1907.12998, 2019.