Online Optimization with Energy Based Models by Yilun Du B.S., Massachusetts Institute of Technology (2019) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2020 ○c Massachusetts Institute of Technology 2020. All rights reserved.

Author...... Department of Electrical Engineering and Computer Science May 15, 2020 Certified by...... Leslie P. Kaelbling Professor Department of Electrical Engineering and Computer Science Thesis Supervisor Certified by...... Tomas Lozano-Perez Professor Department of Electrical Engineering and Computer Science Thesis Supervisor Certified by...... Joshua B. Tenenbaum Professor Department of Brain and Cognitive Science Thesis Supervisor Accepted by ...... Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, Department Committee on Graduate Students 2 Online Optimization with Energy Based Models by Yilun Du

Submitted to the Department of Electrical Engineering and Computer Science on May 15, 2020, in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering

Abstract This thesis examines the power of applying optimization on learned neural networks, referred to as Energy Based Models (EBMs). We first present methods that enable scalable training of EBMs, allowing an optimization procedure to generate high resolu- tion images. We simultaneously show that resultant models are robust, compositional, and are further easy to learn online. Next we showcase how this optimization pro- cedure can also be used to formulate plans in interactive environments. We further showcase how a similar procedure can be used to learn neural energy functions for proteins, enabling structural recovery through optimization. Finally, we show that by defining generation as a optimization procedure, we can combine generative models from different domains together, and apply optimization on the joint model. Weshow that this allows us to apply various logical operations on images generation, as well as learn to generate new concepts in a continual manner.

Thesis Supervisor: Leslie P. Kaelbling Title: Professor Department of Electrical Engineering and Computer Science

Thesis Supervisor: Tomas Lozano-Perez Title: Professor Department of Electrical Engineering and Computer Science

Thesis Supervisor: Joshua B. Tenenbaum Title: Professor Department of Brain and Cognitive Science

3 4 Acknowledgments

I would like to thank Leslie, Tomas, and Josh for helping me guide through the master degree process, giving insightful advice and feedback on my ideas and helping me mature as a researcher. Much of the work done in this thesis was done with Igor, who I am also thankful for both suggesting invaluable advice and guiding my development as a researcher. I am also greatly thankful for all the collaborators that helped me through this work. Finally, I would like to thank my lab-mates for providing great conversations and insights on each of the research topics I have worked on.

5 6 Contents

1 Introduction 17

2 Related Work 19

3 Learning Energy Models 21 3.1 Energy Based Models...... 21 3.1.1 Sample Replay Buffer...... 22 3.1.2 Regularization...... 23 3.2 Image Generation...... 25 3.2.1 Mode Evaluation...... 26 3.3 Generalization...... 27 3.3.1 Adversarial Robustness...... 27 3.3.2 Out-of-Distribution Generalization...... 28 3.4 Online Learning...... 29

4 Energy Based Models for Planning 31 4.1 Planning through Online Optimization...... 31 4.1.1 Energy-Based Models and Terminology...... 31 4.1.2 Planning with Energy-Based Models...... 32 4.1.3 Online Learning with Energy-Based Models...... 34 4.1.4 Related Work...... 36 4.2 Experiments...... 37 4.2.1 Setup...... 37

7 4.2.2 Online Model Learning...... 38 4.2.3 Maximum Entropy Inference...... 41 4.2.4 Exploration...... 42

5 Energy Models for Protein Structure 45 5.1 Background...... 45 5.2 Method...... 46 5.2.1 Parameterization of protein conformations...... 48 5.2.2 Usage as an energy function...... 48 5.2.3 Training and loss functions...... 48 5.2.4 Recovery of Rotamers...... 49 5.3 Evaluation...... 49 5.3.1 Datasets...... 49 5.3.2 Baselines...... 50 5.3.3 Evaluation...... 50 5.3.4 Rotamer recovery results...... 51 5.3.5 Visualizing Energies...... 52

6 Compositionality with Energy Based Models 57 6.1 Method...... 57 6.1.1 Energy Based Models...... 57 6.1.2 Logical Operators through Online Optimization...... 58 6.2 Experiments...... 61 6.2.1 Setup...... 61 6.2.2 Compositional Generation...... 62 6.2.3 Continual Learning...... 65

7 Conclusion 67 7.1 Future Directions...... 67

8 List of Figures

1-1 A 2D example of combining EBMs through summation and the resulting sampling trajectories...... 17

3-1 Table of Inception and FID scores for ImageNet32x32 and CIFAR-10. Quan- titative numbers for ImageNet32x32 from [Ostrovski et al., 2018]. (*) We use Inception Score (from original OpenAI repo) to compare with legacy models, but strongly encourage future work to compare soley with FID score, since Langevin Dynamics converges to minima that artificially inflate Inception Score. (**) conditional EBM models for 128x128 are smaller than those in SNGAN...... 24

3-2 EBM image restoration on images in the test set via MCMC (online op- timization). The right column shows failure (approx. 10% objects change with ground truth initialization and 30% of objects change in salt/pepper corruption or in-painting. Bottom two rows shows worst case of change.) . 24

3-3 Comparison of image generation techniques on unconditional CIFAR-10 dataset. 24

3-4 Illustration of cross-class implicit sampling on a conditional EBM. The EBM is conditioned on a particular class but is initialized with an image from a separate class...... 25

3-5 Illustration of image completions on conditional ImageNet model. Our models exhibit diversity in inpainting...... 25

3-6 휖 plots under 퐿∞ and 퐿2 attacks of conditional EBMs as compared to PGD trained models in [Madry et al., 2017] and a baseline Wide ResNet18. ... 27

9 3-7 Histogram of relative likelihoods for various datasets for Glow, PixelCNN++ and EBM models ...... 28

4-1 Overview of online training procedure with a EBM where grey areas represent inadmissible regions. Plans from the current observation to goal state is inferred from the EBM (left). A particular plan is chosen and executed until a planned state deviates significantly from an actual state (middle). The

EBM is then trained on all real transitions 휏real and all planned transitions

before the deviation (in green) 휏휃, while transitions afterwards (red) are ignored. A new plan is then generated from the new location of the agent (right) ...... 35

4-2 Illustrations of the 4 evaluated environments. Both particle and maze environ- ments have 2 degree of freedom for x, y movement. The Reacher environment has 2 degrees of freedom corresponding to torques to two motors. The Sawyer Arm environment has 7 degrees of freedoms corresponding to torques. ... 37

4-5 Qualitative image showing EBM successfully navigating finger end effector to goal position...... 39

4-3 Navigation path with a central obstacle the model was not trained with. . 39

4-4 Performance on Particle, Maze and Reacher environments where dynamics models are either pretrained on random transitions or learned via online interaction with the environment. Action FF: Action Feed-Forward Network. 39

4-6 Effects of varying number of planning steps to reach a goal state. Asthe number of steps of planning increases, there is a larger envelope of explored states...... 41

4-7 Illustrations of two different planned trajectories from start to goal inthe Reacher environment...... 41

10 4-8 Illustration of energy values of states (computed by taking the energy of a transition centered at the location) and corresponding visitation maps. While an EBM learns a probabilistic model of transitions in areas already explored, energies of unexplored regions fluctuate throughout training, leading toa natural exploration incentive. Early on in training (left), the EBM puts low energy on the upper corner, incentivizing agent exploration towards the top corner. Later on in training (right), an EBM puts low energy on the right lower corner, incentivizing agent exploration towards the bottom corner. . 42

4-9 Illustration of exploration in a maze under random actions (left) as opposed to following an EBM (middle). Areas in blue in the maze environment (right) are admissible, while areas in white are not...... 42

4-10 Comparison of 3D spatial occupation of the finger end-effector of the Sawyer Arm, using random exploration versus using an EBM without a goal on a log scale across 4 different seeds. An EBM allows more directed exploration and explores more states. For the random policy to reach maximum occupancy, more than 200,000 transitions are required...... 43

5-1 Overview of the model. The model takes as input a set of atoms, 퐴, consisting of the rotamer to be predicted (shown in green) and surrounding atoms (shown in dark grey). The Cartesian coordinates and attributes of each atom are embedded. The set of embeddings is processed by Transformer blocks, and the final hidden representations are pooled over the atoms to produce a vector. The vector is passed

through a two-layer MLP to output a scalar energy value, 푓휃(퐴).... 46

5-2 The energy function models distinct behavior between core and surface residues. Core residues are more sensitive to perturbations away from

the native state in the 휒1 torsion angle. On average, residues closer to the core have a steeper energy well...... 54

11 5-3 There is a relation between the residue size and the depth of the energy well, with larger amino acids (e.g. Trp, Phe, Thr, Lys) having steeper wells...... 54

5-4 Note the periodicity for the amino acids Tyr, Asp, and Phe with terminal

symmetry about 휒2...... 54

5-5 Left: 3-dimensional representation of CcmG reducing oxidoreduc- tase [PDB ID 1KNG; Edeling et al., 2002], a protein from the test set. Atoms are colored dark blue (buried), orange (exposed), or neither (not colored). Right: t-SNE [Maaten and Hinton, 2008] projection of EBM hidden representation when focused on the alpha carbon atom for each residue in the hidden representation. In the embedding space, buried and surface residues are distinguished...... 55

5-6 The model’s saliency map applied to the test protein, serine proteinase inhibitor (PDB ID: 2CI2; McPhalen and James[1987]). The 64 atom context is centered on the carbonyl oxygen of residue 39 (isoleucine). Atoms in the context are labeled red with color saturation proportional to gradient magnitude (interaction strength). Hydrogen bonds with the carbonyl oxygen are shown by dotted lines...... 55

6-1 Illustration of logical composition operators over energy functions 퐸1 and 퐸2 (drawn as level sets)...... 58

6-2 Combinations of different attributes on CelebA via concept conjunction. Each row adds an additional energy function. Images on the first row are only conditioned on young, while images on the last row are conditioned on young, female, smiling, and wavy hair...... 61

6-3 Combinations of different attributes on MuJoCo via concept conjunction. Each row adds an additional energy function. Images on the first row are only conditioned on shape, while images on the last row are conditioned on shape, position, size, and color. The left part is the generation of a sphere shape and the right is a cylinder...... 61

12 6-4 Examples of recursive compositions of disjunction, conjunction, and negation on the CelebA dataset...... 63 6-5 Multi-object compositionality with EBMs. An EBM is trained to generate a green cube of specified size and shape in a scene alongside other objects. At test time, we sample from the conjunction of two EBMs conditioned on different positions and sizes (cube 1 and cube 2) and generates cubes atboth locations. Two cubes are merged into one if they are too close (last column). 65 6-6 Continual learning of concepts. A position EBM is first trained on one shape (cube) of one color (purple) at different positions. A shape EBM is then trained on different shapes of one fixed color (purple). Finally, a colorEBM is trained on shapes of many colors. EBMs can continually learn to generate many shapes (cube, sphere) with different colors at different positions. .. 65

13 14 List of Tables

3.1 AUROC scores of out of distribution classification on different datasets. Only our model gets better than chance classification...... 28

3.2 Comparison of EBMs with various other continual learning benchmarks. Values averaged acrossed 10 seeds reported as mean (standard deviation). . 29

4.1 Comparison of performance on the Sawyer Arm environment between Ac- tion FF and EBM. In the pretraining setting, we compare models trained using random transitions, directed transitions from an EBM, and directed transitions with correlated data. In the online setting, models are trained on 50,000 simulations of the environment. We find that EBMs perform well online. 39

5.1 Rotamer recovery of energy functions under the discrete rotamer sam- pling method detailed in Section 5.3.3. Parentheses denote value re- ported by Leaver-Fay et al.[2013]...... 51 5.2 Rotamer recovery of energy functions under continuous optimization schemes. Rosetta continuous optimization is performed with the rtmin protocol. Parentheses denote value reported by Leaver-Fay et al.[2013]. 51 5.3 Comparison of rotamer recovery rates by amino acid between Rosetta and the ensembled energy-based model under discrete rotamer sampling. The model appears to perform well on polar amino acids glutamine, serine, asparagine, and threonine, while Rosetta performs better on larger amino acids phenylalanine, tyrosine, and tryptophan and the common amino acid, leucine. The numbers reported for Rosetta are from Leaver-Fay et al.[2013]...... 53

15 6.1 Quantitative evaluation of conjunction, disjunction and negation generations on the Mujoco Scenes dataset using an EBM. Each individual attribute (Color or Position) generation is a individual EBM. (Acc: accuracy) .... 62 6.2 Quantitative evaluation of continual learning. A position EBM is first trained on “purple” “cubes” at different positions. A shape EBM is then trained on different “purple” shapes. Finally, a color EBM is trained on shapesof many colors with Earlier EBMs are fixed and combined with new EBMs. We compare with a GAN model Radford et al.[2015] which is also trained on the same position, shape and color dataset. EBMs is better at continually learning new concepts and remember the old concepts. (Acc: accuracy) .. 66

16 Chapter 1

Introduction

Deep learning (DL) has achieved astonishing levels of success in recent years. In applied domains, such as and natural language processing. DL approaches have revolutionized tasks such as object detection [Ren et al., 2015], segmentation [He et al., 2017], and translation [Sutskever et al., 2014] . In interactive settings, DL approaches have had success in broad games [Silver et al., 2016], video games [openai, 2018] and robotic control [OpenAI, 2018]. Many DL approaches are often limited by a dearth of prior information, requir- energy A energy B energy A + B ing a disproportionate amount of compu- tational power to train. Efforts to learn such generic prior information is difficult Figure 1-1: A 2D example of combining EBMs as DL models are difficult to compose through summation and the resulting sampling trajectories. with each other. In contrast, humans have rich compositional ability, and are able to combine concepts such as red and smiling to generate red smiling faces or to compose skills such as fetch, search, and cut into executing a task such as making a cup of tea, allowing people to learn new tasks in a rapid manner. Furthermore, many DL approaches are typically defined through fully connected or convolutional layers, which relies on a fixed feed-forward computation that is applied

17 to inputs. However, humans instead are able to adapt their computation to the scenario at hand. When a complicated math expression is given, humans are able to recursively think and apply a set of fixed operations to determine the right answer. When a puzzle of high complexity is given, humans are able to think and ponder a solution. Here, we examine instead learning a deep network that outputs a energy for a given input, representing a residual error of the input. Execution in such a model is then defined in terms of iteratively finding an input that minimizes the corresponding residual error, which we refer to as online optimization. Defining models in such a way gives a limited sense of compositionality – for example we can generate inputs satisfying multiple distinct models by minimizing the joint sum of the energies of several different models (see Figure 1-1). We further show how manipulations can be utilize to realize conjunction, disjunction, and negation of seperate models. The optimization procedure itself allows the model to adapt computation to the task at hand. Harder problems can be solved by running optimization on the energy of the model for extended time periods. Similarly, higher quality solutions may be obtained. In this work, we present how online optimization on energy functions can scale to challenging diverse high resolution image datasets, and some unique properties such models have towards both robustness and online learning. We showcase how online optimization can also be used in robotics domain to solve more complex and novel tasks. Further, we show that online optimization can also be used to recover protein structural configuration that can outperform those of classical protein energy functions, as correspondingly show unique structural information learned by the energy functions. Finally, we showcase that optimization can be utilized to compose separately trained models through operations such as conjunction, disjunction and negation, and high-light unique properties this enables.

18 Chapter 2

Related Work

Energy functions, also known as energy based models (EBMs) have a long history in machine learning and other domains as generative models. These models represent the likelihood of data using the Boltzmann distribution and include the Ising model [Brush, 1967], the Pott’s model [Montroll et al., 1963], the Helmholtz machine [Dayan et al., 1995], the [Ackley et al., 1985]. Ackley et al.[1985], Hinton [2006], Salakhutdinov and Hinton[2009] proposed latent based EBMs where energy is represented as a composition of latent and observable variables. In contrast Mnih and Hinton[2004], Hinton et al.[2006a] proposed EBMs where inputs are directly mapped to outputs, a structure we follow. We refer readers to [LeCun et al., 2006] for a tutorial on energy models. In the image generation domain, the FRAME model [Zhu et al., 1998] learns an energy model to represent the texture within images. This work is extended to sparse codes in [Xie et al., 2015], and applied to convolutional neural networks in [Xie et al., 2016]. However, due to computational difficulty in training, EBMs have become less popular recently. The primary difficulty in training EBMs comes from effectively estimating and sampling the partition function. One approach to train energy based models is sample the partition function through amortized generation. Kim and Bengio [2016], Zhao et al.[2016], Haarnoja et al.[2017] propose learning a separate network to generate samples, which makes these methods closely connected to GANs [Finn et al.,

19 2016], but these methods do not take advantage of the optimization procedure noted in the introduction. Furthermore, amortized generation is prone to mode collapse, especially when training the sampling network without an entropy term which is often approximated or ignored. An alternative approach is to use MCMC sampling to estimate the partition function. This has an advantage of provable mode exploration and allows the benefits of implicit generation listed in the introduction. Hinton[2006] proposed Contrastive Divergence, which uses gradient free MCMC chains initialized from training data to estimate the partition function. Similarly, Salakhutdinov and Hinton[2009] apply contrastive divergence, while Tieleman[2008] proposes PCD, which propagates MCMC chains throughout training. Task inference through energy minimization was also popular in the past. LeCun et al.[2006] provides an overview of using such minimization to solve tasks while Bai et al.[2019] further apply a similar idea to language modeling. Through product of experts, [Hinton, 1999] have also previously shown that EBMs allows the composition of different concepts. In contrast to past past work, we show that such a concept allow compositionality in the natural image domain, through a series of logical operators [Du et al., 2020a] EBMs have received an increasing amount of attention in recent years and have been applied to a variety of different domains. EBMs have been studied as a over images [Du and Mordatch, 2019]. EBMs have shown to further exhibit adversarial robustness [Lee et al., 2018, Du and Mordatch, 2019, Grathwohl et al., 2019]. EBMs are applied towards protein modeling in [Ingraham et al., Du et al., 2020b]. EBMs are further applied as a memory model in [Bartunov et al., 2019], in NLP in [Bakhtin et al., 2019, Deng et al., 2020], and in the planning domain in [Du et al., 2019].

20 Chapter 3

Learning Energy Models

In this chapter, we first introduce energy based models and then discuss astable training algorithm for training such models. We show that our training algorithm enables us to use online optimization generate high resolution images on ImageNet 128x128. We showcase additional properties of online optimization and of energy based models.

3.1 Energy Based Models

Given a datapoint x, let 퐸휃(x) ∈ R be the energy function. In our work, we parameterize this function as a deep neural network with weights 휃. The energy function

exp(−퐸휃(x)) defines a probability distribution via the Boltzmann distribution 푝휃(x) = 푍(휃) , ∫︀ where 푍(휃) = exp(−퐸휃(x))푑x denotes the partition function. In our work, we wish to maximize this likelihood on our training data.

One difficulty with maximum likelihood training is that generating samples from this distribution is challenging, with previous work relying on MCMC methods such as random walk or Gibbs sampling [Hinton, 2006]. These methods have long mixing times, especially for high-dimensional complex data such as images. To improve the mixing time of the sampling procedure, we use Langevin dynamics which makes use

21 of the gradient of the energy function to undergo sampling

휆 x˜푘 = x˜푘−1 − ∇ 퐸 (x˜푘˘1) + 휔푘, 휔푘 ∼ 풩 (0, 휆) (3.1) 2 x 휃

We want the distribution defined by 퐸 to model the data distribution 푝퐷, which we

do by minimizing the negative log likelihood of the data ℒML(휃) = Ex∼푝퐷 [− log 푝휃(x)] where − log 푝휃(x) = 퐸휃(x) − log 푍(휃). This objective is known to have the gradient + − + − (see [Turner, 2005] for derivation) ∇휃ℒML = Ex ∼푝퐷 [∇휃퐸휃(x )] − Ex ∼푝휃 [∇휃퐸휃(x )]. Intuitively, this gradient decreases energy of the positive data samples x+, while

− increasing the energy of the negative samples x from the model 푝휃. We rely on

Langevin dynamics in (3.1) to generate 푞휃 as an approximation of 푝휃:

[︀ + ]︀ [︀ − ]︀ + − ∇휃ℒML ≈ Ex ∼푝퐷 ∇휃퐸휃(x ) − Ex ∼푞휃 ∇휃퐸휃(x ) . (3.2)

Equation 3.1 further defines a way to generate samples from a model, by initializing sample from uniform noise and iteratively running Langevin dynamics for a large number of steps. We refer to this generation process as online optimization.

3.1.1 Sample Replay Buffer

Langevin dynamics does not place restrictions on sample initialization x˜0 given sufficient sampling steps. However initialization plays an crucial role in mixingtime. Persistent Contrastive Divergence (PCD) [Tieleman, 2008] maintains a single persistent chain to improve mixing and sample quality. We use a sample replay buffer ℬ in which we store past generated samples x˜ and use either these samples or uniform noise to initialize Langevin dynamics procedure. This has the benefit of continuing to refine past samples, further increasing number of sampling steps 퐾 as well as sample diversity. In all our experiments, we sample from ℬ 95% of the time and from uniform noise otherwise.

22 3.1.2 Regularization

Arbitrary energy models can have sharp changes in gradients that can make sampling with Langevin dynamics unstable. We found that constraining the Lipschitz constant of the energy network can ameliorate these issues. To constrain the Lipschitz constant, we follow the method of [Miyato et al., 2018] and add spectral normalization to all layers of the model. Additionally, we found it useful to weakly L2 regularize energy magnitudes for both positive and negative samples during training, as otherwise while the difference between positive and negative samples was preserved, the actual values would fluctuate to numerically unstable values. Both forms of regularization also serve to ensure that partition function is integrable over the domain of the input, with spectral normalization ensuring smoothness and L2 coefficient bounding the magnitude of the unnormalized distribution. We present the algorithm below, where Ω(·) indicates the stop gradient operator.

Algorithm 1 Energy training algorithm

Input: data dist. 푝퐷(x), step size 휆, number of steps 퐾 ℬ ← ∅ while not converged do + x푖 ∼ 푝퐷 0 x푖 ∼ ℬ with 95% probability and 풰 otherwise

◁ Generate sample from 푞휃 via Langevin dynamics: for sample step 푘 = 1 to 퐾 do 푘 푘−1 푘−1 x˜ ← x˜ − ∇x퐸휃(x˜ ) + 휔, 휔 ∼ 풩 (0, 휎) end for − 푘 x푖 = Ω(x˜푖 )

◁ Optimize objective 훼ℒ2 + ℒML wrt 휃: 1 ∑︀ + 2 − 2 + − ∆휃 ← ∇휃 푁 푖 훼(퐸휃(x푖 ) + 퐸휃(x푖 ) ) + 퐸휃(x푖 ) − 퐸휃(x푖 ) Update 휃 based on ∆휃 using Adam optimizer

ℬ ← ℬ ∪ x˜푖 end while

23 Model Inception* FID CIFAR-10 Unconditional PixelCNN [Van Oord et al., 2016] 4.60 65.93 PixelIQN [Ostrovski et al., 2018] 5.29 49.46 EBM (single) 6.02 40.58 DCGAN [Radford et al., 2016] 6.40 37.11 WGAN + GP [Gulrajani et al., 2017] 6.50 36.4 EBM (10 historical ensemble) 6.78 38.2 SNGAN [Miyato et al., 2018] 8.22 21.7 CIFAR-10 Conditional Improved GAN 8.09 - EBM (single) 8.30 37.9 Spectral Normalization GAN 8.59 25.5 ImageNet 32x32 Conditional PixelCNN 8.33 33.27 PixelIQN 10.18 22.99 Salt and Ground Truth Inpainting EBM (single) 18.22 14.31 Pepper (0.1) Initialization ImageNet 128x128 Conditional ACGAN [Odena et al., 2017] 28.5 - Figure 3-2: EBM image restora- EBM* (single) 28.6 43.7 tion on images in the test set SNGAN 36.8 27.62 via MCMC (online optimiza- tion). The right column shows Figure 3-1: Table of Inception and FID scores for failure (approx. 10% objects ImageNet32x32 and CIFAR-10. Quantitative numbers change with ground truth ini- for ImageNet32x32 from [Ostrovski et al., 2018]. (*) tialization and 30% of objects We use Inception Score (from original OpenAI repo) to change in salt/pepper corruption compare with legacy models, but strongly encourage or in-painting. Bottom two rows future work to compare soley with FID score, since shows worst case of change.) Langevin Dynamics converges to minima that artificially inflate Inception Score. (**) conditional EBM models for 128x128 are smaller than those in SNGAN.

(c) EBM (10 histori-(d) EBM Sample (a) GLOW Model (b) EBM cal) Buffer

Figure 3-3: Comparison of image generation techniques on unconditional CIFAR-10 dataset.

24 3.2 Image Generation

In this section, we show that online optimization with EBMs are effective generative models for images. We show EBMs are able to generate high fidelity images and exhibit mode coverage on CIFAR-10 and ImageNet. We further show EBMs exhibit adversarial robustness and better out-of-distribution behavior than other likelihood models. Our model is based on the ResNet architecture (using conditional gains and biases per class [Dumoulin et al.] for conditional models).

We quantitatively evaluate image quality of EBMs with Inception score [Salimans et al., 2016] and FID score [Heusel et al., 2017] in Table 3-1. Overall we obtain significantly better scores than likelihood models PixelCNN and PixelIQN, but worse than SNGAN [Miyato et al., 2018]. We found that in the unconditional case, mode exploration with Langevin took a very long time, so we also experimented in EBM (10 historical ensemble) with sampling joint from the last 10 snapshots of the model. At training time, extensive exploration is ensured with the replay buffer (Figure 3- 3d). Our models have similar number of parameters to SNGAN, but we believe that significantly more parameters may be necessary to generate high fidelity imageswith mode coverage. On ImageNet128x128, due to computational constraints, we train a smaller network than SNGAN and do not train to convergence.

Corruption Completions Original

Test Images

Train Images Figure 3-4: Illustration of cross-class im- plicit sampling on a conditional EBM. The EBM is conditioned on a particular class Figure 3-5: Illustration of image completions but is initialized with an image from a on conditional ImageNet model. Our models separate class. exhibit diversity in inpainting.

25 3.2.1 Mode Evaluation

We evaluate over-fitting and mode coverage in EBMs. To test mode coverage in EBMs, we investigate MCMC sampling on corrupted CIFAR-10 test images. Since Langevin dynamics is known to mix slowly [Neal, 2011] and reach local minima, we believe that good denoising after limited number of steps of sampling indicates probability modes at respective test images. Similarly, lack of movement from a ground truth test image initialization after the same number of steps likely indicates probability mode at the test image. In Figure 3-2, we find that if we initialize sampling with images from the test set, images do not move significantly. However, under the same number of steps, Figure 3-2 shows that we are able to reliably decorrupt masked and salt and pepper corrupted images, indicating good mode coverage. We note that large number of steps of sampling lead to more saturated images, which are due to sampling low temperature modes, which are saturated across likelihood models (see appendix). In comparison, GANs have been shown to miss many modes of data and cannot reliably reconstruct many different test images [Yeh et al.]. We note that such decorruption behavior is a nice property of implicit generation without need of explicit knowledge of corrupted pixels.

Another common test for mode coverage and overfitting is masked inpainting [Van Oord et al., 2016]. In Figure 3-5, we mask out the bottom half of ImageNet images and test the ability to sample the masked pixels, while fixing the value of unmasked pixels. Running Langevin dynamics on the images, we find diversity of completions on train/test images, indicating low overfitting on training set and diversity characterized by likelihood models. Furthermore initializing sampling of a class conditional EBM with images from images from another class, we can further test for presence of probability modes at images far away from the those seen in training. We find in Figure 3-4 that sampling on such images using an EBM is able to generate images of the target class, indicating semantically meaningful modes of probability even far away from the training distribution.

26 PGD PGD 0.8 EBM 0.8 EBM Baseline Baseline 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 5 10 15 20 25 30 0 20 40 60 80

(a) 퐿∞ robustness (b) 퐿2 Robustness

Figure 3-6: 휖 plots under 퐿∞ and 퐿2 attacks of conditional EBMs as compared to PGD trained models in [Madry et al., 2017] and a baseline Wide ResNet18.

3.3 Generalization

In this section, we show that EBMs trained with online optimization are robust in both the adversarial and out of distribution since.

3.3.1 Adversarial Robustness

We show conditional EBMs exhibit adversarial robustness on CIFAR-10 classification, without explicit adversarial training. To compute logits for classification, we com- pute the negative energy of the image in each class. Our model, without fine-tuning, achieves an accuracy of 49.6%. Figure 3-6 shows adversarial robustness curves. We ran 20 steps of PGD as in [Madry et al., 2017], on the above logits. To undergo classification, we then ran 10 steps sampling initialized from the starting image (witha bounded deviation of 0.03) from each conditional model, and then classified using the lowest energy conditional class. We found that running PGD incorporating sampling was less successful than without. Overall we find in Figure 3-6 that EBMs are very

robust to adversarial perturbations and outperforms the SOTA 퐿∞ model in [Madry

et al., 2017] on 퐿∞ attacks with 휖 > 13.

27 Model PixelCNN++ Glow EBM (ours) SVHN 0.32 0.24 0.63 Textures 0.33 0.27 0.48 Constant Uniform 0.0 0.0 0.30 Uniform 1.0 1.0 1.0 CIFAR10 Interpolation 0.71 0.59 0.70 Average 0.47 0.42 0.62

Table 3.1: AUROC scores of out of distribution classification on different datasets. Only our model gets better than chance classification.

SVHN/CIFAR-10 Test on Glow SVHN/CIFAR-10 Test on PixelCNN++ SVHN/CIFAR-10 Test on EBM CIFAR-10 Train/Test on EBM 0.0005 0.0008 CIFAR-10 Test CIFAR-10 Test CIFAR-10 Test Train 0.0004 0.0006 SVHN 0.8 SVHN 1.0 SVHN Test 0.0003 0.6 0.0004 0.0002 0.5 0.4 0.0002 0.0001 0.2

0.0000 0.0000 0.0 0.0 12000 10000 8000 6000 4000 2000 8000 6000 4000 2000 0 −3 −2 −1 0 1 −3 −2 −1 0 1 Log Prob Log Prob Log Prob (Unscaled) Log Prob (Unscaled)

Figure 3-7: Histogram of relative likelihoods for various datasets for Glow, PixelCNN++ and EBM models

3.3.2 Out-of-Distribution Generalization

We show EBMs exhibit better out-of-distribution (OOD) detection than other likeli- hood models. Such a task requires models to have high likelihood on the data manifold and low likelihood at all other locations and can be viewed as a proxy of log likelihood. Surprisingly, Nalisnick et al.[2019] found likelihood models such as VAE, PixelCNN, and Glow models, are unable to distinguish data assign higher likelihood to many OOD images. We constructed our OOD metric following following [Hendrycks and Gimpel, 2016] using Area Under the ROC Curve (AUROC) scores computed based on classifying CIFAR-10 test images from other OOD images using relative log likelihoods. We use SVHN, Textures [Cimpoi et al., 2014], monochrome images, uniform noise and interpolations of separate CIFAR-10 images as OOD distributions. We found that our proposed OOD metric correlated well with training progress in EBMs.

In Table 3.1, unconditional EBMs perform significantly better out-of-distribution than other auto-regressive and flow generative models and have OOD scores of 0.62

28 Method Accuracy EWC [Kirkpatrick et al., 2017] 19.80 (0.05) SI [Zenke et al., 2017] 19.67 (0.09) NAS [Schwarz et al., 2018] 19.52 (0.29) LwF [Li and Snavely, 2018] 24.17 (0.33) VAE 40.04 (1.31) EBM (ours) 64.99 (4.27)

Table 3.2: Comparison of EBMs with various other continual learning benchmarks. Values averaged acrossed 10 seeds reported as mean (standard deviation). while the closest, PixelCNN++, has a OOD score of 0.47. We provide histograms of relative likelihoods for SVHN in Figure 3-7 which are also discussed in [Nalisnick et al., 2019, Hendrycks et al., 2018]. We believe that the reason for better generalization is two-fold. First, we believe that the negative sampling procedure in EBMs helps elimi- nate spurious minima. Second, we believe EBMs have a flexible structure that allows global context when estimating probability without imposing constraints on latent variable structure. In contrast, auto-regressive models model likelihood sequentially, which makes global coherence difficult. In a different vein, flow based models must apply continuous transformations onto a continuous connected probability distribution which makes it very difficult to model disconnected modes, and thus assign spurious density to connections between modes.

3.4 Online Learning

In this section, we show that EBMs trained through online optimization are further able to learn well online. We evaluate incremental class learning on the Split MNIST task proposed in [Farquhar and Gal, 2018]. The task evaluates overall MNIST digit classification accuracy given 5 sequential training tasks of disjoint pairs of digits.We train a conditional EBM with 2 layers of 400 hidden units work and compare with a generative conditional VAE baseline with both encoder/decoder having 2 layers of 400 hidden units. Additional training details are covered in the appendix. We train the generative models to represent the joint distribution of images and labels and

29 classify based off the lowest energy label. Hsu et al.[2018] analyzed common continual learning algorithms such as EWC [Kirkpatrick et al., 2017], SI [Zenke et al., 2017] and NAS [Schwarz et al., 2018] and find they obtain performance around 20%. LwF[Li and Snavely, 2018] performed the best with performance of 24.17 ± 0.33 , where all architectures use 2 layers of 400 hidden units. However, since each new task introduces two new MNIST digits, a test accuracy of around 20% indicates complete forgetting of previous tasks. In contrast, we found continual EBM training obtains significantly higher performance of 64.99 ± 4.27. All experiments were run with 10 seeds. A crucial difference is that negative training in EBMs only locally "forgets" information corresponding to negative samples. Thus, when new classes are seen, negative samples are conditioned on the new class, and the EBM only forgets unlikely data from the new class. In contrast, the cross entropy objective used to train common continual learning algorithms down-weights the likelihood of all classes not seen. We can apply this insight on other generative models, by maximizing the likelihood of a class conditional model at train time and then using the highest likelihood class as classification results. We ran such a baseline using a VAE and obtained a performance of 40.04 ± 1.31, which is higher than other continual learning algorithms but less than that in a EBM.

30 Chapter 4

Energy Based Models for Planning

In this chapter, we talk about a framework that enables online optimization on energy models to generate trajectory levels plans to reach goals in robotics. We showcase how this enables effective online learning, inference of diverse plans and active exploration.

4.1 Planning through Online Optimization

4.1.1 Energy-Based Models and Terminology

Consider a standard Markov Decision Process (MDP) represented by the tuple ⟨푆, 퐴, 푇, 푅⟩, where 푆 is the set of all possible state configurations, 퐴 is the set of actions available to the agent, 푇 is the transition distribution, and 푅 is the reward

function. Under this setup, define a state transition pair (푠푡, 푠푡+1) between the states

at timesteps 푡 and 푡 + 1. Define 퐸휃(푠푡, 푠푡+1) ∈ R as the energy function, which we parameterize with a deep neural network. We interpret the energy function as unnormalized probability distribution over state transition by defining the distribution

−퐸휃(푠푡,푠푡+1) as 푝휃(푠푡, 푠푡+1) ∝ 푒 . To sample from the defined probability distribution, we use Model Predictive Path Integral (MPPI) algorithm [Williams et al., 2017a], which is shown to converge to the full posterior distribution in [Williams et al., 2017b]. The mathematical formulation

31 is shown below, where 푥 := (푠푡, 푠푡+1):

(︃ 푘 )︃ −퐸휃(푥푖 ) 푘 ∑︁ 푘 푘 푘−1 푒 푥˜ = 푤푖푥푖 , 푥푖 ∼ 푁(푥 , 휎), 푤푖 = 푗 (4.1) ∑︀ −퐸휃(푥푖 ) 푖 푗 푒

Other valid inference algorithms that sample from the posterior are also applicable to EBMs, such as Langevin Dynamics (previous chapter) or Hamiltonian Monte Carlo (HMC). We note that this form of sampling makes it easy to add additional constraints to a probability distribution by simply adding the constraint as an energy. We train models by minimizing

+ − + − E푥 ∼푝퐷 퐸휃(푥 ) − E푥 ∼푝휃 퐸휃(푥 ). (4.2)

Intuitively, doing so decreases the energies of observed transitions (i.e. more likely transitions), and decreases the energies of transitions sampled from the model’s distribution (i.e. less likely transitions).

4.1.2 Planning with Energy-Based Models

In the previous subsection, we described a way to learn state transition models

푝휃(푠푡, 푠푡+1). We now discuss how to use models to do inference over trajectories. Given

a learned model 푝휃(푠푡, 푠푡+1), we can model the likelihood of a trajectory 휏 under the model as a product of factors

푇 −1 ∏︁ 푝휃(휏) = 푝휃(푠1, 푠2, . . . , 푠푇 ) = 푝휃(푠푡, 푠푡+1) (4.3) 푡=1 푇 ∑︁ ∝ exp(− 퐸(푠푡, 푠푡+1)) (4.4) 푡=1

We can likewise do inference across this product using MPPI. Note that we directly sample states rather than actions in our formulation. We generate temporally smooth trajectory perturbations following approach of [Kalakrishnan et al., 2011] (where the

32 last two rows of the finite difference matrix A are removed to allow end statesof trajectories to be unconstrained).

Given a particular fixed goal state 푔 and start state 푠1, we can do inference over intermediate states by sampling from the probability distribution among state

transitions between 푠2, ..., 푠푇

푇 −1 ∑︁ 푝휃(푠2, . . . , 푠푇 |푠1, 푔) ∝ exp(− 퐸(푠푡, 푠푡+1) − 퐸(푠푇 , 푔)) (4.5) 푡=1

to get a plan. Alternatively, instead of using a fixed goal state 푔, we can represent the goal state with a Gaussian distribution around it, 푃 (푔), and similarly perform inference over

푇 −1 ∑︁ 2 푝휃(푠2, . . . , 푠푇 |푠1, 푔) ∝ exp(− 퐸(푠푡, 푠푡+1) − (푠푇 − 푔) ) (4.6) 푡=1

to get a plan. We found that inference over both distributions worked well using MPPI. Throughout our experiment, we follow Equation 4.6 to specify a Gaussian distribution around all goal states used in our experiments, to accommodate additional constraints

more flexibly and refer to the resulting state distribution as 푝휃(휏|푠1, 퐺). Actions are not inferred in this sampling process, but can be found either using a ground truth inverse dynamics model or online approximation.

We can also represent the goal state as the trajectory that has the highest con- ditional probability of reaching an optimal reward, where the event of reaching an

푅(푠푡) optimal reward is defined as 푂푡 and the probability 푃 (푂푡|푠푡) is defined as 푒 for a reward function 푅(푠). Inference can then be done on

푇 −1 ∑︁ 푝휃(휏|푂1:푇 ) ∝ exp(푅(푠1) − (퐸(푠푡, 푠푡+1) − 푅(푠푡+1))) 푡=1 which Levine[2018] interpret as maximum entropy reinforcement learning on the model. However, while the form proposed in [Levine, 2018] considers maximum entropy over actions given a state, we consider maximum entropy of the next state given the

33 current state.

4.1.3 Online Learning with Energy-Based Models

The previous sections described inference done by EBMs pre-trained with generated data. We now turn to the question of how to generate the training data in an on-going manner to simultaneously learn the EBM as the robot operates in the environment. We discuss online training methods of EBMs – i.e. how to effectively obtain data on state transitions and learn an energy function given a MDP environment represented as a tuple ⟨푆, 퐴, 푇, 푅⟩ and a Goal 퐺.

In this setup, we first generate a 푇 -step trajectory 휏휃 from the model 푝휃(휏, 푠1, 퐺)

and use an inverse dynamics model to compute the corresponding actions 푎푡 at each

time step. We then execute each action 푎푖 in the real environment to generate

another 푇 -step trajectory 휏real, stopping prematurely if the real observations deviate significantly from model predictions. After that, we train an EBM to increase the

energy of each attempted transition in 휏휃 (imagined transitions) while decreasing

energy of real transition in 휏real (real transitions). We note that perfect planning will

have no effect on the model training since 휏휃 = 휏real. For stability, we maintain a replay buffer of past experiences and simulated trajectories. Figure 4-1 provides an overview of the process. Intuitively, our training procedure allows our model to learn a good likelihood distribution over states that the model has observed. However, since we terminate model’s planning after significant deviation between real observations and the enacted plan, the model is not trained to minimize the probability of transitions among faraway states. These transitions are thus free to vary over time throughout training, which eventually provides incentive for the model to explore the whole state space. In our experiments section, we illustrate this effect and show that EBMs lead to good exploration. For completeness, we include pseudo-code for online training of EBMs, where Ω(·) denotes a collation operator that converts a trajectory to pairs of state transitions.

34 Algorithm 2 Online training of a trajectory level EBM Input: goal state 퐺, step size 휆, number of steps 퐾, number of plan steps 푇 , inverse dynamics model 퐼퐷, replay buffers ℬ푝표푠, ℬ푛푒푔 ℬ푝표푠 ← ∅ ℬ푛푒푔 ← ∅ for environment timestep 푖 do ◁ Initialize trajectory as a smooth trajectory at start state 0 휏˜푖 = 푠0 ◁ Generate sample from 푝휃(휏|푠푖, 퐺) via MPPI for sample step 푘 = 1 to 퐾 do (︃ )︃ (∑︀푇 −1 퐸 (푠푘,푠푘 ))+(푠푘 −퐺)2 푘 ∑︀ 푘 푘 푘−1 푒 푖=0 휃 푖 푖+1 푇 휏˜ = 푤 휏˜ , 휏˜ ∼ 푁(˜휏 , Σ), 푤 = 푗 푗 푗 푖 푖 푖 푖 푖 푖 (∑︀푇 −1 퐸 (푠 ,푠 ))+(푠 −퐺)2 ∑︀ 푖=0 휃 푖 푖+1 푇 푗 푒 end for 푎 ← 퐼퐷(˜휏 퐾 ) + 휏푖 ∼ simulate environment with actions 푎 + + x = Ω(휏 ) ∪ sample(ℬ푝표푠) − 푘 x = Ω(˜휏 ) ∪ sample(ℬ푛푒푔) ◁ Optimize objective ℒ2 + ℒML wrt. 휃: 1 ∑︀ + 2 − 2 + − ∆휃 ← ∇휃 푁 푖 퐸휃(x푖 ) + 퐸휃(x푖 ) + 퐸휃(x푖 ) − 퐸휃(x푖 ) Update 휃 based on ∆휃 using Adam optimizer + ℬ푝표푠 ← ℬ푝표푠 ∪ x − ℬ푛푒푔 ← ℬ푛푒푔 ∪ x end for

tq G G G

treal

Figure 4-1: Overview of online training procedure with a EBM where grey areas represent inadmissible regions. Plans from the current observation to goal state is inferred from the EBM (left). A particular plan is chosen and executed until a planned state deviates significantly from an actual state (middle). The EBM is then trained on all real transitions 휏real and all planned transitions before the deviation (in green) 휏휃, while transitions afterwards (red) are ignored. A new plan is then generated from the new location of the agent (right)

35 4.1.4 Related Work

A number of model classes have been explored for model-based planning in robotics and artificial intelligence literature, from feed-forward, recurrent [Nagabandi et al., 2018], temporal segment [Mishra et al., 2017] and Bayesian [Gal et al.] neural networks, to locally-linear models [Yip and Camarillo, 2014, Mordatch et al., 2016] and Gaussian processes [Deisenroth and Rasmussen, 2011]. By contrast, energy-based models have been an under-explored for model-based planning and it is our aim to showcase the favorable properties this model class exhibits. While work of [Haarnoja et al., 2017] explored energy-based models of policies for model-free reinforcement learning, we instead use them to model environment dynamics in a model-based setting.

Energy-based models have seen success in other applications, such as natural image modeling [Du and Mordatch, 2019, Dai et al., 2019]. These works noted EBMs for favorable compositionality and online learning properties, which we take advantage of in this work. A number of methods have been used for inference and sampling in EBMs (or equivalently, planning), from Gibbs Sampling [Hinton et al., 2006b], to Langevin Dynamics [Du and Mordatch, 2019], and learned samplers [Kim and Bengio, 2016]. We have not focused on the choice of sampler/planner in this work, and found method of [Williams et al., 2017a] to work well. Other planning methods, such as those based on direct collocation methods [Mordatch et al., 2012, Erez and Todorov, 2012] can potentially be used instead.

A common approach to achieve exploration behavior in reinforcement learning has been to use explicit rewards, known as intrinsic motivation [Oudeyer and Kaplan, 2009, Schmidhuber, 2010]. Examples include rewarding empowerment, information gain about model of the dynamics [Pathak et al., 2017], or state space coverage [Houthooft et al., 2016, Tang et al., 2017]. Maximum entropy models are another approach to induce exploratory behavior [Haarnoja et al., 2018] and what we rely on this our work as well. We show that contrastive training of EBMs is particularly conducive to exploration.

36 Goal

Start

Particle Environment Maze Environment Reacher Environment Sawyer Arm

Figure 4-2: Illustrations of the 4 evaluated environments. Both particle and maze environ- ments have 2 degree of freedom for x, y movement. The Reacher environment has 2 degrees of freedom corresponding to torques to two motors. The Sawyer Arm environment has 7 degrees of freedoms corresponding to torques.

4.2 Experiments

4.2.1 Setup

We perform experiments on four different environments listed below, with correspond- ing visualizations in Figure 4-2:

1. Particle: An environment in which a particle is spawned at a start position and must navigate to a goal position; each position is represented by an (x, y) tuple. The observation is the current position of particle, and there are two degrees of freedom that correspond to x-displacement and y-displacement. Reward at each timestep corresponds to negative distance from current position to goal position. Agents are able to move in 0.05 uniform ball around their current location, with size of the map being 2 by 2.

2. Maze: Same setup as the particle environment, but certain areas contain walls that prevent movement of particle.

3. Reacher: The Reacher environment in [Brockman et al., 2016]. The system consists of two degrees of freedom for angles of joints. The observation is the current rotations and angular velocities of joints. Reward at each timestep corresponds to negative distance from current rotations of joints to target

37 rotations of joints.

4. Sawyer Arm: A simulation of the Sawyer Arm in Mujoco [Todorov et al., 2012]. The system is second order and contains 7 degrees of freedom. The observation is the position and velocities of each of the joints, as well as the current end-effector finger position. Reward at each timestep is the negative distance between current end-effector finger and target end-effector finger position. Target end-effector position is either fixed or randomized.

For each task, we compare our model’s performance with a learned determinis- tic feedforward network (Action FF) that predicts the next state from the current state/action (with the same architecture as an EBM). We generate plans by sampling over states using MPPI, with score calculated from the L2 distance between final and goal state. On the Sawyer Arm task, we further compare our performance with a model-free baseline PPO [Schulman et al., 2017], using the implementation provided in [Dhariwal et al., 2017]. We investigate difference in performance of models that have been trained usingtwo different methods, where sources and availability of data are varied. In the casewhere data is available in advance, models are trained on 100,000 action-state transitions pre-generated from random sampling in each environment. In the case where only online data is available, models are trained by samples generated from interacting with an environment from start state; the training algorithm is outlined by Algorithm 2, with replay buffers used on both models.

4.2.2 Online Model Learning

Table 4-4 shows the performance of an EBM compared to action FF on the Particle, Maze and Reacher tasks. First, we compare both methods given a large pre-generated dataset of random interactions; we find that EBM performs slightly better than Action FF. However, when we compare both methods under an online setting, we find that EBM performs significantly better than Action FF. For example, an EBM only experiences a drop of 15.24 in score when switched to the online setting, compared to

38 Model Pretrained (random) Pretrained (directed) Pretrained (directed + sequential) Online (Fixed) Online (Variable) EBM -9569 -4438 -5114 -3782 -3907 Action FF -10326 -5041 -12838 -9360 -11942

Table 4.1: Comparison of performance on the Sawyer Arm environment between Action FF and EBM. In the pretraining setting, we compare models trained using random transitions, directed transitions from an EBM, and directed transitions with correlated data. In the online setting, models are trained on 50,000 simulations of the environment. We find that EBMs perform well online.

Figure 4-5: Qualitative image showing EBM successfully navigating finger end effector to goal position.

the score drop of 844.56 experienced by an Action FF model on the online setting.

Data Model Particle Maze Reacher

EBM -5.14 -72.07 -19.38 Pretrained Action FF -6.11 -65.06 -25.54

EBM -20.38 -162.97 -29.87 Online Action FF -850.67 -949.99 -42.37

Figure 4-4: Performance on Particle, Maze and Reacher environments where dynamics models are either pretrained on Figure 4-3: Navigation path with a cen- random transitions or learned via online tral obstacle the model was not trained interaction with the environment. Action with. FF: Action Feed-Forward Network.

Table 4.1 shows the performance of EBM compared to action FF on the Sawyer arm scenario. We find that in such a setting, using a large pre-generated dataset of random interactions led to insufficient state coverage. To mitigate this, we construct a directed dataset of 100,000 frames from an EBM trained on Sawyer Arm task. With directed data, we find that a pretrained EBM performs slightly better than Action FF, obtaining scores of -4438 and -5041 respectively. In the online training scenario (with either fixed or varied goals), however, we find that EBM performs significantly better (with a score of -3782) than Action FF (with a score of -9360). We show images

39 of execution in Figure 4-5.

On this task, the model free algorithm PPO obtains performance of -9300 with the same amount of experience, and requires 250,000 to 500,000 frames (5 - 10 times more than used in online training of the models) to achieve comparable performance to online training of an EBM. With 50 - 100 times more experience, PPO is able to obtain better scores of -1000 (note that since PPO does not have acceleration priors we do, it is allowed to reach the goal faster thus producing a higher reward - our method moves slower, but both methods successfully reach the goal). Furthermore, PPO is not able to operate in an online manner and does not exhibit zero-shot generalization results of our model, both of which are important in real-world robot learning regimes.

Table 4.1 also considers another scenario in which goals are varied across the table. In this setting we find that EBMs still perform better (with a score of -4547 while Action FF obtains a score of (-11942). Furthermore, we can actually apply a model trained on a fixed goal, and generalize to variable goals, and still obtain ascoreof -4547.

We also ablate dependence on ground truth inverse dynamics. Using recursive least squares [Mordatch et al., 2016] to infer inverse dynamics also leads to good performance of -4694. We find that our learned state distributions is not significantly impacted by inverse dynamics inference, as long as action inference does not suffer from mode collapse (which can occur from neural networks based approaches).

To ablate the effect of exploration and the ability of EBMs to learn models online, in Table 4.1 we train both Action FF and EBM models on the directed dataset, but with batches sequentially sampled from the dataset, with each datapoint repeated 100 times without shuffling to mimic the correlated experiences seen during online training. Under this setting, the performance of EBM drops slightly by 676, but performance of Action FF drops catastrophically by 7797 and the model fails to train. When training on static data-set, we do not use a replay buffer of past transitions.

Our results indicate that EBMs are able to learn well online, which is an important and necessary characteristic for models to learn in the real world.

40 Figure 4-6: Effects of varying number of planning steps to reach a goal state. As the number of steps of planning increases, there is a larger envelope of explored states.

Start State Planned States Goal State

Trajectory 1

Trajectory 2

Figure 4-7: Illustrations of two different planned trajectories from start to goal inthe Reacher environment. 4.2.3 Maximum Entropy Inference

While maximum entropy reinforcement learning has focused on maximizing entropy of actions given a state, sampling from an EBM corresponds to directly maximizing entropy over the next state. In Figure 4-6, we find EBM sampling is capable of generating diverse plans that go from a given start state to goal state. In Figure 4-6, we show that given a fixed start state and end goal, increased number of planned steps leads to a larger envelope of possible trajectories. The same diagram also shows that our method is able to sample across a wide range of trajectories that are different from each other. In Figure 4-7, we show that in the Reacher environment, we are able to make valid plans with both clockwise and counter-clockwise given a start and goal state. We illustrate the power of diverse plans by comparing the generalization per- formance between planning conditioned on only state space (EBM planning) and planning conditioned on both state and action space (action-conditional planning).

41 In the particle environment, at test time we add a large obstacle not seen during training as the particle attempts to navigate from start state to goal state, as shown in Figure 4-3; an EBM is able generalize better obtained a reward of -61.94 while an Action FF obtains a reward of -81.24.

4.2.4 Exploration

Energy Visitation Map Energy Visitation Map

Figure 4-8: Illustration of energy values of states (computed by taking the energy of a transition centered at the location) and corresponding visitation maps. While an EBM learns a probabilistic model of transitions in areas already explored, energies of unexplored regions fluctuate throughout training, leading to a natural exploration incentive. Early on intraining (left), the EBM puts low energy on the upper corner, incentivizing agent exploration towards the top corner. Later on in training (right), an EBM puts low energy on the right lower corner, incentivizing agent exploration towards the bottom corner.

We show that an EBMs model naturally incentive exploration. In Figure 4-9 we compare the exploration behavior of an EBM without a goal and a random action agent in the Maze environment. In the time it takes a random policy to explore a

Visitation Map Visitation Map Maze Environment (Random) (EBM)

Figure 4-9: Illustration of exploration in a maze under random actions (left) as opposed to following an EBM (middle). Areas in blue in the maze environment (right) are admissible, while areas in white are not.

42 hallway of a maze, an EBM is able to explore the entirety of the maze. Similarly, we consider 3D occupancy of the finger end-effector in the Sawyer arm; we define 3D occupancy by partitioning space into 3D voxels and measuring the number of voxels that a finger ends up in. We empirically to be found that the maximum system occupancy was 116. We find in Figure 4-10 that EBMs reaches maximum system occupancy significantly faster compared to random exploration across 4 different seeds. Without a goal, an EBM is able to navigate the arm freely, compared to a random policy that struggles and takes over 100 more times the number of environmental transitions (i.e. over 200,000) to reach maximum occupancy.

We reason that the exploration be- havior in EBMs comes from the fact that Finger 3D Occupancy 100 they learn local dynamics of the world only in the regions that have been ex- 50 plored. This allows the EBMs to assign Random EBM arbitrary energies to transitions among Occupancy (%) 0 103 105 unexplored states. Values of these en- Environment Steps (Log) ergies vary over the course of training, and lead the EBMs to generate plans to Figure 4-10: Comparison of 3D spatial occu- pation of the finger end-effector of the Sawyer reach different unseen states until more Arm, using random exploration versus using of the environment is explored. We illus- an EBM without a goal on a log scale across 4 different seeds. An EBM allows more directed trate this result in Figure 4-8, where we exploration and explores more states. For the random policy to reach maximum occupancy, show that an EBM puts low energy in a more than 200,000 transitions are required. swath of states that are unexplored but reachable in two stages of training, incentivizing exploration of those states while maintaining correct energies for states that have already been explored.

EBMs learn local dynamics models since they are trained on real data transitions and transitions from planning; a plan is followed until it deviates significantly from real transitions. Thus both sets of transitions are consisted of states that are locally close to states an EBM has learned. In contrast, traditional likelihood models for modeling trajectory lower the likelihood of all unseen trajectories, including at unseen

43 states; as a result, planning using such models is unable to explore as adequately.

44 Chapter 5

Energy Models for Protein Structure

In this chapter, we present a framework for learning energy models for protein structures. Who show that optimization on the learned energy function can recover protein structure on the Rotamer Recovery benchmark to a comparable degree as the classical energy function Rosetta. We first provide background about proteins.

5.1 Background

Proteins are linear polymers composed of an alphabet of twenty canonical amino acids (residues), each of which shares a common backbone moiety responsible for formation of the linear polymeric backbone chain, and a differing side chain moiety with biochemical properties that vary from amino acid to amino acid. The energetic interplay of tight packing of side chains within the core of the protein and exposure of polar residues at the surface drives folding of proteins into stable molecular conformations [Richardson and Richardson, 1989, Dill, 1990]. The conformation of a protein can be described through two interchangeable coordinate systems. Each atom has a set of spatial coordinates, which up to an arbitrary rotation and translation of all coordinates describes a unique conformation. In the internal coordinate system, the conformation is described by a sequence of rigid- body motions from each atom to the next, structured as a kinematic tree. The major degrees of freedom in protein conformation are the dihedral rotations [Richardson and

45 fΘ(A)

Context atoms Embedding layer MLP (2 layer) Position (x,y,z) (172d) Amino Acid (28d) Max pool Rotamer Atom (N,C,O,S) (28d)

atoms 6 Ordinal Label (28d) x …..

MLP

A ….. Attention

Transformer layer Crystalized Protein Atom features (256d) Transformer Model Figure 5-1: Overview of the model. The model takes as input a set of atoms, 퐴, consisting of the rotamer to be predicted (shown in green) and surrounding atoms (shown in dark grey). The Cartesian coordinates and attributes of each atom are embedded. The set of embeddings is processed by Transformer blocks, and the final hidden representations are pooled over the atoms to produce a vector. The vector is passed through a two-layer MLP to output a scalar energy value, 푓휃(퐴).

Richardson, 1989], about the backbone bonds termed phi (휑) and psi (휓) angles, and the dihedral rotations about the side chain bonds termed chi (휒) angles.

Within folded proteins, the side chains of amino acids preferentially adopt configu- rations that are determined by their molecular structure. A relatively small number of configurations separated by high energetic barriers are accessible to each side chain [Janin et al., 1978]. These configurations are called rotamers. In Rosetta and other protein design methods, rotamers are commonly represented by libraries that estimate a probability distribution over side chain configurations, conditioned on the backbone 휑 and 휓 torsion angles. We use the Dunbrack library [Shapovalov and Dunbrack Jr, 2011] for rotamer configurations.

5.2 Method

Our energy model calculates scalar functions of energy, 푓휃(퐴), of size-푘 subsets, 퐴, of atoms within a protein.

46 Selection of atom subsets In our experiments, we choose 퐴 to be nearest-neighbor sets around the residues of the protein and set 푘 = 64. For a given residue, we construct 퐴 to be the 푘 atoms that are nearest to the residue’s beta carbon.

Atom input representations Each atom in 퐴 is described by its 3D Cartesian coordinates and categorical features: (i) the identity of the atom (N, C, O, S); (ii) an ordinal label of the atom in the side chain (i.e. which specific carbon, nitrogen, etc. atom it is in the side chain) and (iii) the amino acid type (which of the 20 types of amino acids the atom belongs to). The coordinates are normalized to have zero mean across the 푘 atoms. Each categorical feature is embedded into 28 dimensions, and the spatial coordinates are projected into 172 dimensions, which are then concatenated into a 256-dimensional atom representation. The parameters for the input embeddings and projections of spatial information are learned via training. During training, a random rotation is applied to the coordinates in order to encourage rotational invariance of the model. For visualizations, a fixed number of random rotations100 ( ) is applied and the results are averaged.

Architecture In our proposed approach, 푓휃(퐴) takes the form of a Transformer model [Vaswani et al., 2017] that processes a set of atom representations. The self- attention layers allow each atom to attend to the representations of other atoms in the set, modeling the energy of the molecular configuration as a non-linear interaction of single, pairwise, and higher-order interactions between the atoms. The final hidden representations of the Transformer are pooled across the atoms to produce a single vector, which is finally passed to a two-layer multilayer perceptron (MLP) that produces the scalar output of the model. fig. 5-1 illustrates the model.

For all experiments, we use a 6-layer Transformer with embedding dimension of 256 (split over 8 attention heads) and feed-forward dimension of 1024. The final MLP contains 256 hidden units. The models are trained without dropout. Layer normalization [Ba et al., 2016] is applied before the attention blocks.

47 5.2.1 Parameterization of protein conformations

The structure of a protein can be represented by two parameterizations: (1) absolute Cartesian coordinates of the set of atoms, and (2) internal coordinates of the atoms encoded as a set of in-plane/out-of-plane rotations and displacements relative to each atom’s reference frame. Out-of-plane rotations are parameterized by 휒 angles which are the primary degrees of freedom in the rotamer configurations. The coordinate systems are interchangeable.

5.2.2 Usage as an energy function

We specify our energy function 퐸휃(푥, 푐) to take an input set composed of two parts: (1) the atoms belonging to a rotamer to be predicted, 푥, and (2) the atoms of the surrounding molecular context, 푐. The energy function is defined as follows:

퐸휃(푥, 푐) = 푓휃(퐴(푥, 푐)) where 퐴(푥, 푐) is the set of embeddings from 푘 atoms nearest to the rotamer’s beta carbon.

5.2.3 Training and loss functions

In all experiments, the energy function is trained to learn the conditional distribution of the rotamer given its context by approximately maximizing the log likelihood of the data.

ℒ(휃) = −퐸휃(푥, 푐) − log 푍휃(푐)

To estimate the partition function, we note that:

∫︁ 푒−퐸휃(푥,푐) log 푍 (푐) = log 푒−퐸휃(푥,푐)푑푥 = log( [ ]) 휃 E푞(푥|푐) 푞(푥|푐)

for some importance sampler 푞(푥|푐). Furthermore, if we assume 푞(푥|푐) is uniformly distributed on supported configurations, we obtain a simplified maximum likelihood

48 objective given by

푖 −퐸휃(푥 ,푐) ℒ(휃) = −퐸휃(푥, 푐) − log(E푞(푥푖|푐)[푒 ])

for some context dependent importance sampler 푞(푥|푐). We choose our sampler 푞(푥|푐) to be an empirically collected rotamer library [Shapovalov and Dunbrack Jr, 2011] conditioned on the amino acid identity and the backbone 휑 and 휓 angles. We write the importance sampler as a function of atomic coordinates which are interchangeable with the angular coordinates in the rotamer library. The library consists of lists of means and standard deviations of possible 휒 angles for each 10 degree interval for both 휑 and 휓. We sample rotamers uniformly from this library, given by a continuous 휑 and 휓, by sampling from a weighted mixture of Gaussians of 휒 angles at each of the four surrounding bins, with weights given by distance to the bins via bilinear interpolation. Every candidate rotamer at each bin is assigned uniform probability. To ensure our context dependent importance sampler effectively samples high likelihood areas in the model, we further add the real context as a sample from 푞(푥|푐).

5.2.4 Recovery of Rotamers

To recover rotamers given a candidate backbone of a protein, we sample from our context dependent importance sampler 푞(푥|푐) as our online optimization procedure on our energy model. We utilize several different online optimization procedures to fairly compare with Rosetta energy function.

5.3 Evaluation

5.3.1 Datasets

We constructed a curated dataset of high-resolution PDB structures using the CullPDB database, with the following criteria: resolution finer than 1.8 Å; sequence identity less than 90%; and R value less than 0.25 as defined in Wang and R. L. Dunbrack

49 [2003]. To test the model on rotamer recovery, we use the test set of structures from Leaver-Fay et al.[2013]. To prevent training on structures that are similar to those in the test set, we ran BLAST on sequences derived from the PDB structures and removed all train structures with more than 25% sequence identity to sequences in the test dataset. Ultimately, our train dataset consisted of 12,473 structures and our test dataset consisted of 129 structures.

5.3.2 Baselines

We compare to three baseline neural network architectures: a fully-connected network, the architecture for embedding sets in the set2set paper [Vinyals et al., 2015]; and a graph neural network [Veličković et al., 2017]. Results are also compared to Rosetta. We ran Rosetta using score12 and and ref15 energy functions using the rotamer trials and rtmin protocols with default settings.

5.3.3 Evaluation

For the comparison of the model to Rosetta in table 5.1, we reimplement the sampling scheme that Rosetta uses for rotamer trials evaluation. We take discrete samples from the rotamer library, with bilinear interpolation of the mean and standard deviations using the four grid points surrounding the backbone 휑 and 휓 angles for the residue. We take discrete samples of the rotamers at 휇, except that for buried residues we

sample 휒1 and 휒2 at 휇 and 휇 ± 휎 as was done in Leaver-Fay et al.[2013]. We define

buried residues to have ≥24 퐶훽 neighbors within 10Å of the residue’s 퐶훽 (퐶훼 for glycine residues). For buried positions we accumulate rotamers up to 98% of the distribution, and for other positions the accumulation is to 95%. We score a rotamer as recovered correctly if all 휒 angles are within 20∘ of the ground-truth residue. We also use a continuous sampling scheme which approximates the empirical conditional distribution of the rotamers using a mixture of Gaussians with means and standard deviations computed by bilinear interpolation as above. Instead of sampling discretely, the component rotamers are sampled with the probabilities given by the

50 Model Avg Buried Surface Rosetta score12 (rotamer-trials) 72.2 (72.6) - - Rosetta ref2015 (rotamer-trials) 73.6 - - Atom Transformer 70.4 87.0 58.3 Atom Transformer (ensemble) 71.5 89.2 59.9

Table 5.1: Rotamer recovery of energy functions under the discrete rotamer sampling method detailed in Section 5.3.3. Parentheses denote value reported by Leaver-Fay et al.[2013].

Model Avg Buried Surface Fully-connected 39.1 54.4 30.0 Set2set 43.2 60.3 31.7 GraphNet 69.0 94.3 54.2 Atom Transformer 73.1 91.1 58.3 Atom Transformer (ensemble) 74.1 91.2 59.5 Rosetta score12 (rt-min) 75.4 (74.2) - - Rosetta ref2015 (rt-min) 76.4 - -

Table 5.2: Rotamer recovery of energy functions under continuous optimization schemes. Rosetta continuous optimization is performed with the rtmin protocol. Parentheses denote value reported by Leaver-Fay et al.[2013].

library, and a sample is generated with the corresponding mean and standard deviation. This is the same sampling scheme used to train models, but with component rotamers now weighted by probability as opposed to uniform sampling.

5.3.4 Rotamer recovery results

table 5.1 directly compares our EBM model (which we refer to as the Atom Transformer) with two versions of the Rosetta energy function. We run Rosetta on the set of 152 proteins from the benchmark of Leaver-Fay et al.[2013]. We also include published performance on the same test set from Leaver-Fay et al.[2013]. As discussed above, comparable sampling strategies are used to evaluate the models, enabling a fair comparison of the energy functions. We find that a single model evaluated onthe benchmark performs slightly worse than both versions of the Rosetta energy function.

51 An ensemble of 10 models improves the results.

table 5.2 evaluates the performance of the energy function under alternative sampling strategies with the goal of optimizing recovery rates. We indicate performance of the Rosetta energy function on recovery rates using the rtmin protocol for continuous minimization. We evaluate the learned energy function with the continuous sampling from a mixture of Gaussians conditioned on the 휑/휓 settings of the backbone angles as detailed above. We find that with ensembling the model performance is close tothat of the Rosetta energy functions. We also compare to three baselines for embedding sets with similar numbers of parameters to the Atom Transformer model and find that they have weaker performance.

Buried residues are more constrained in their configurations by tight packing of the side chains within the core of the protein. In comparison, surface residues are more free to vary. Therefore we also report performance separately on both categories. We find that the ensembled Atom Transformer has a 91.2% rotamer recovery ratefor buried residues, compared to 59.5% for surface residues.

Table 5.3 reports recovery rates by residue comparing the Rosetta score12 results reported in Leaver-Fay et al.[2013] to the Atom Transformer model using the Rosetta discrete sampling method. The Atom Transformer model appears to perform well on smaller rotameric amino acids as well as polar amino acids such as glutamate/aspartate while Rosetta performs better on larger amino acids like phenylalanine and tryptophan and more common ones like leucine.

5.3.5 Visualizing Energies

In this section, we visualize and understand how the Atom Transformer models the energy of rotamers in their native contexts. We explore the response of the model to perturbations in the configuration of side chains away from their native state. We retrieve all protein structures in the test set and individually perturb rotameric 휒 angles across the unit circle, plotting results in Figures 5-2, 5-3, and 5-4.

52 Amino Acid RKMILSTV Atom Transformer 37.2 31.7 53.0 93.3 82.6 79.0 96.5 94.0 Rosetta score12 26.7 31.7 49.6 85.4 87.5 72.5 92.6 94.3 Amino Acid NDQEHWFY Atom Transformer 67.4 76.0 40.8 49.8 65.5 83.5 80.3 77.6 Rosetta score12 56.8 60.4 30.7 33.6 55.0 85.0 85.4 82.9

Table 5.3: Comparison of rotamer recovery rates by amino acid between Rosetta and the ensembled energy-based model under discrete rotamer sampling. The model appears to perform well on polar amino acids glutamine, serine, asparagine, and threonine, while Rosetta performs better on larger amino acids phenylalanine, tyrosine, and tryptophan and the common amino acid, leucine. The numbers reported for Rosetta are from Leaver-Fay et al.[2013].

Core/Surface Energies Figure 5-2 shows that steeper response to variations away from the native state is observed for residues in the core of the protein (having ≥24 contacting side chains) than for residues on the surface (≤16), consistent with the observation that buried side chains are tightly packed [Richardson and Richardson, 1989].

Rotameric Energies Figure 5-3 shows a relation between the residue size and the depth of the energy well, with larger amino acids having steeper wells (more sensitive to perturbations). Furthermore Figure 5-4 shows that the model learns the symmetries

of amino acids. We find that responses to perturbations of the 휒2 angle for the residues ∘ Tyr, Asp, and Phe are symmetric about 휒2.A 180 periodicity is observed, in contrast to the non-symmetric residues.

Embeddings of Atom Sets Building on the observation of a relation between the depth of the residue and its response to perturbation from the native state, we ask whether core and surface residues are clustered within the representations of the model. To visualize the final hidden representation of the molecular contexts within a protein, we compute the final vector embedding for the 64 atom context around the carbon-훽 atom (or for glycine, the carbon-훼 atom) for each residue. We find that a projection of these representations by t-SNE [Maaten and Hinton, 2008] into 2 dimensions shows

53 trp 220 tyr 1.5 4 phe 200 arg lys 1.0 180 ) ile 2 3 Å y

( leu g r y 160 e met e g 0.5 r m n e u

e his l n

0 o e e

140 v gln v i e t e

v 0.0 val a i u l t d e i a 120 r glu l s e 2 e r

r thr 0.5 100 asn pro 4 80 asp 1.0 core cys surface ser 180 90 0 90 180 180 90 0 90 180 1 ( ) 1 ( )

Figure 5-2: The energy function mod- Figure 5-3: There is a relation between els distinct behavior between core and the residue size and the depth of the surface residues. Core residues are more energy well, with larger amino acids (e.g. sensitive to perturbations away from the Trp, Phe, Thr, Lys) having steeper wells. native state in the 휒1 torsion angle. On average, residues closer to the core have a steeper energy well. a clear clustering between representations of core residues and surface residues. A representative example is shown in fig. 5-5.

Saliency Map The 10-residue protease-

Tyr binding loop in a chymotrypsin inhibitor from barley seeds is highly structured Asp due to the presence of backbone-backbone Phe and backbone-sidechain hydrogen bonds in the same residue [Das, 2011]. To visu-

Figure 5-4: Note the periodicity for the amino alize the dependence of the energy func- acids Tyr, Asp, and Phe with terminal symme- tion on individual atoms, we compute try about 휒2. the energy of the 64 atom context cen- tered around the backbone carbonyl oxygen of residue 39 (isoleucine) in PDB: 2CI2 [McPhalen and James, 1987] and derive the gradients with respect to the input atoms. fig. 5-6 overlays the gradient magnitudes on the structure, indicating the model attends to both sidechain and backbone atoms, which participate in hydrogen bonds.

54 Core type surface core

Figure 5-5: Left: 3-dimensional representation of CcmG reducing oxidoreductase [PDB ID 1KNG; Edeling et al., 2002], a protein from the test set. Atoms are colored dark blue (buried), orange (exposed), or neither (not colored). Right: t-SNE [Maaten and Hinton, 2008] projection of EBM hidden representation when focused on the alpha carbon atom for each residue in the hidden representation. In the embedding space, buried and surface residues are distinguished.

Figure 5-6: The model’s saliency map applied to the test protein, serine proteinase inhibitor (PDB ID: 2CI2; McPhalen and James[1987]). The 64 atom context is centered on the carbonyl oxygen of residue 39 (isoleucine). Atoms in the context are labeled red with color saturation proportional to gradient magnitude (interaction strength). Hydrogen bonds with the carbonyl oxygen are shown by dotted lines.

55 56 Chapter 6

Compositionality with Energy Based Models

In this chapter, we present a framework for combining independently trained energy models on different concepts to generate compositions of new concepts of generation through online optimization (Langevin Dynamics). We first reintroduce energy models under this formulation, and show how operators over disjunction, conjunction, and negation can be realized. We further show how this enables novel applications into continual learning.

6.1 Method

6.1.1 Energy Based Models

EBMs represent data by learning an unnormalized probability distribution across the data. For each data point x, an energy function 퐸휃(x), parameterized by a neural network, outputs a scalar real energy such that

−퐸휃(푥) 푝휃(푥) ∝ 푒 . (6.1)

57 Figure 6-1: Illustration of logical composition operators over energy functions 퐸1 and 퐸2 (drawn as level sets).

To train an EBM on a data distribution 푝퐷, we follow the methodology defined in [Du and Mordatch, 2019], where a Monte Carlo estimate (Equation 6.2) of maximum likelihood ℒ is minimized with the following gradient

+ − + − ∇휃ℒ = E푥 ∼푝퐷 퐸휃(푥 ) − E푥 ∼푝휃 퐸휃(푥 ). (6.2)

− To sample 푥 from 푝휃 for both training and generation, we use MCMC based off Langevin dynamics (online optimization). Samples are initialized from uniform random noise and are iteratively refined following Equation 6.3

휆 x˜푘 = x˜푘−1 − ∇ 퐸 (x˜푘−1) + 휔푘, 휔푘 ∼ 풩 (0, 휆), (6.3) 2 x 휃 where 푘 is the 푘푡ℎ iteration step and 휆 is the step size. We refer to each iteration of Langevin dynamics as a negative sampling step. We note that this form of sampling allows us to use the gradient of the combined distribution to generate samples from

distributions composed of 푝휃 and the other distributions. We use this ability to generate from multiple different compositions of distributions.

6.1.2 Logical Operators through Online Optimization

We next present different ways that EBMs can compose through online optimization.

We consider a set of independently trained EBMs, 퐸(x|푐1), 퐸(x|푐2), . . . , 퐸(x|푐푛), which

are learned conditional distributions on underlying latent codes 푐푖. Latent codes we consider include position, size, color, gender, hair style, and age, which we also refer

58 to as concepts. We formulate a set of compositional operators on EBMs at test time using a series of symbolic operators illustrated in Figure 6-1.

Concept Conjunction In concept conjunction, given separate independent con- cepts (such as a particular gender, hair style, or facial expression), we wish to construct an output with the specified gender, hair style, and facial expression – the combination of each concept. Since the likelihood of an output given a set of specific concepts is equal to the product of the likelihood of each individual concept, we have Equation 6.4, which is also known as the product of experts [Hinton, 2002]:

∑︀ ∏︁ − 퐸(푥|푐푖) 푝(푥|푐1 and 푐2,..., and 푐푖) = 푝(푥|푐푖) ∝ 푒 푖 . (6.4) 푖

We can thus apply Equation 6.3 to the distribution that is the sum of the energies of each concept to obtain Equation 6.5 to sample from the joint concept space with 휔푘 ∼ 풩 (0, 휆). 휆 ∑︁ x˜푘 = x˜푘−1 − ∇ 퐸 (x˜푘−1|푐 ) + 휔푘. (6.5) 2 x 휃 푖 푖

Concept Disjunction In concept disjunction, given separate concepts such as the colors red and blue, we wish to construct an output that is either red or blue. We wish to construct a new distribution that has probability mass when any chosen concept is true. A natural choice of such a distribution is the sum of the likelihood of each concept:

∑︁ 푝(푥|푐1 or 푐2,... or 푐푖) ∝ 푝(푥|푐푖)/푍(푐푖). (6.6) 푖 where 푍(푐푖) denotes the partition function for each concept. If we assume all partition

functions 푍(푐푖) to be equal, this simplifies to

∑︁ ∑︁ −퐸(푥|푐푖) logsumexp(−퐸(푥|푐푖)) 푝(푥|푐푖) ∝ 푒 = 푒 , (6.7) 푖 푖

59 ∑︀ where logsumexp(푓1, . . . , 푓푁 ) = log 푖 exp(푓푖). We can thus apply Equation 6.3 to the distribution that is a negative smooth minimum of the energies of each concept to obtain Equation 6.8 to sample from the disjunction concept space:

휆 x˜푘 = x˜푘−1 − ∇ logsumexp(−퐸(푥|푐 )) + 휔푘, (6.8) 2 x 푖 where 휔푘 ∼ 풩 (0, 휆). In our experiments, we empirically found the partition func-

tion 푍(푐푖) estimates to be similar across concepts (see Appendix), justifying the simplification of Equation 6.7.

Concept Negation In concept negation, we wish to generate an output that does not contain the concept. Given a color red, we want an output that is of a different color, such as blue. Thus, we want to construct a distribution that places high likelihood to data that is outside a given concept. One choice is a distribution inversely proportional to the concept. Importantly, negation must be defined with respect to another concept to be useful. The opposite of alive may be dead, but not inanimate. Negation without a data distribution is not integrable and leads to a generation of chaotic textures which, while satisfying absence of a concept, is not desirable. Thus in our experiments with negation we combine it with another concept to ground the negation and obtain an integrable distribution:

푝(푥|푐 ) 2 훼퐸(푥|푐1)−퐸(푥|푐2) 푝(푥|not(푐1), 푐2) ∝ 훼 ∝ 푒 . (6.9) 푝(푥|푐1)

We found relative smoothing parameter 훼 to be a useful regularizer (when 훼 = 0 we arrive at uniform distribution) and we use 훼 = 0.01 in our experiments. The above equation allows us to apply Langevin dynamics to obtain Equation 6.10 to sample concept negations.

휆 x˜푘 = x˜푘−1 − ∇ (훼퐸(푥|푐 ) − 퐸(푥|푐 )) + 휔푘, (6.10) 2 x 1 2 where 휔푘 ∼ 풩 (0, 휆).

60 young

young AND female

young AND female AND smiling young AND female AND smiling AND wavy hair

Figure 6-2: Combinations of different attributes on CelebA via concept conjunction. Each row adds an additional energy function. Images on the first row are only conditioned on young, while images on the last row are conditioned on young, female, smiling, and wavy hair.

shape

shape AND position

shape AND position AND size shape AND position AND size AND color

Figure 6-3: Combinations of different attributes on MuJoCo via concept conjunction. Each row adds an additional energy function. Images on the first row are only conditioned on shape, while images on the last row are conditioned on shape, position, size, and color. The left part is the generation of a sphere shape and the right is a cylinder. 6.2 Experiments

We perform empirical studies to answer the following questions: (1) Can EBMs exhibit concept compositionality (such as concept negation, conjunction, and disjunction) in generating images? (2) Can we take advantage of concept combinations to learn new concepts in a continual manner? (3) Does explicit factor decomposition enable generalization to novel combinations of factors? (4) Can we perform concept inference across multiple inputs?

6.2.1 Setup

We perform experiments on 64x64 object scenes rendered in MuJoCo [Todorov et al., 2012] (MuJoCo Scenes) and the 128x128 CelebA dataset. For MuJoCo Scene images,

61 Table 6.1: Quantitative evaluation of conjunction, disjunction and negation generations on the Mujoco Scenes dataset using an EBM. Each individual attribute (Color or Position) generation is a individual EBM. (Acc: accuracy)

Model Position Acc Color Acc Color 0.128 0.997 Position 0.984 0.201 Conjunction(Position, Color) 0.801 0.8125 Conjunction(Position, Negation(Color)) 0.872 0.096 Conjunction(Negation(Position), Color) 0.033 0.971 Model Position 1 Acc Position 2 Acc Position 1 0.875 0.0 Position 2 0.0 0.817 Disjunction (Position 1, Position 2) 0.432 0.413 we generate a central object of shape either sphere, cylinder, or box of varying size and color at different positions, with some number of (specified) additional background objects. Images are generated with varying lighting and objects. We use the ImageNet32x32 architecture and ImageNet128x128 architecture from [Du and Mordatch, 2019] with the Swish activation [Ramachandran et al., 2017] on MuJoCo and CelebA datasets. Models are trained on MuJoCo datasets for up to 1 day on 1 GPU and for 1 day on 8 GPUs for CelebA.2

6.2.2 Compositional Generation

Quantitative evaluation. We quantitatively evaluate compositionality operations of EBMs defined previously To quantitatively evaluate generation, we use the MuJoCo Scenes dataset. We train a supervised classifier to predict position and color onthe MuJoCo Scenes dataset, 99.3% for position and 99.9% for color on the test set. We also train seperate conditional EBMs on the concepts of position and color. For a given positional generation then, if the predicted position (obtained from a supervised classifier on generated images) and original conditioned generation position is smaller than 0.4, then a generation is consider correct. A color generation is correct if the predicted color is the same as the conditioned generation color.

62 smiling

male

smiling AND female

smiling AND NOT male (smiling AND female) OR (NOT smiling AND male)

Figure 6-4: Examples of recursive compositions of disjunction, conjunction, and negation on the CelebA dataset.

In Table 6.1, we quantitatively evaluate the quality of generated images given combinations of conjunction, disjunction, and negation on the color and position concepts. When using either Color or Position EBMs, the respective accuracy is high. Conjunction(Position, Color) has high position and color accuracies which demon- strates that an EBM can combine different concepts. Under Conjunction(Position, Negation(Color)), the color accuracy drops to below that of Color EBM. This means negating a concept reduces the likelihood of the concept. The same conclusion follows for Conjunction(Negation(Position), Color). To evaluate disjunction, we set Position 1 to be a random point in bottom left corner of a grid and Position 2 to be a random point in the top right corner of a grid. Averages over 1000 generated images are reported in Table 6.1. Position 1 EBM or Position 2 EBM is able to obtain high accuracy in predicting their own positions. Disjunction(Position 1, Position 2) EBM can generate images that are roughly evenly distributed between Position 1 and Position 2, indicating the disjunction is able to combine concepts additively (generate images that are either concept A or concept B).

Qualitative evaluation. We further provide qualitative visualizations of conjunc- tion, disjunction, and negation operations on both MuJoCo Scenes and CelebA datasets. Concept Conjunction: In Figure 6-2, we show the conjunction of EBMs are able

63 to combine multiple independent concepts, such as age, gender, smile, and wavy hair, and get more precise generations when combining more energy models of different concepts. Similarily, EBMs can combine independent concepts of shape, position, size, and color to get more precise generations in Figure 6-3. We also show results of conjunction with other logical operators in Figure 6-4.

Concept Negation: In Figure 6-4, row 4 shows images that are opposite to the trained concept using negation operation. Since concept negation operation should accompany with another concept, we use “smiling“ as the second concept. The images in row 4 shows the negation of male AND smiling is smiling female. This can further be combined with disjunction in the row 5 to make either “non-smiling male” or “smiling female”.

Concept Disjunction: The last row of Figure 6-4 shows EBMs can combine concepts additively (generate images that are concept A or concept B). By constructing sampling using logsumexp, EBMs can sample an image that is “not smiling male” or “smiling female”, where both “not smiling male” and “smiling female” are specified through the conjunction of energy models of the two different concepts.

Multiple object combination: We show that our composition operations not only combine object concepts or attributes, but also on the object level. To verify this, we constructed a dataset with one green cube and a large amount background clutter objects (which are not green) in the scene. We train a conditional EBM (conditioned on position) on the dataset. Figure 6-5 “cube 1” and “cube 2” are the generated images conditioned on different positions. We perform the conjunction operation on the EBMs of “cube 1” and “cube 2” and use the combined energy model to generate images (row 3). We find that adding two conditional EBMs allows us to selectively generate two different cubes. Furthermore, such generation satisfies the constraints of the dataset. For example, when two conditional cubes are too close, the conditionals EBMs are able to default and just generate one cube like the last image in row 3.

64 cube1 position

cube2 shape

joint color rendering Figure 6-5: Multi-object compositionality Figure 6-6: Continual learning of concepts. with EBMs. An EBM is trained to generate A position EBM is first trained on one shape a green cube of specified size and shape in (cube) of one color (purple) at different po- a scene alongside other objects. At test sitions. A shape EBM is then trained on time, we sample from the conjunction of different shapes of one fixed color (purple). two EBMs conditioned on different positions Finally, a color EBM is trained on shapes and sizes (cube 1 and cube 2) and generates of many colors. EBMs can continually learn cubes at both locations. Two cubes are to generate many shapes (cube, sphere) with merged into one if they are too close (last different colors at different positions. column).

6.2.3 Continual Learning

We evaluate to what extent compositionality in EBMs enables continual learning of new concepts and their combination with previously learned concepts. If we create an EBM for a novel concept, can it be combined with previous EBMs that have never observed this concept in their training data? And can we continually repeat this process? To evaluate this, we use the following methodology on MuJoCo dataset: 1. We first train a position EBM on a dataset of varying positions, but a fixedcolor and a fixed shape. In experiment, we use shape “cube” and color “purple”. The position EBM allows us generate a purple cube at various positions. (Figure 6-6 row 1). 2. Next we train a shape EBM by training the model in combination with the position EBM to generate images of different shapes at different positions. But we donot train position EBM. As shown in Figure 6-6 row 2, after combining the position and shape EBMs, the “sphere” is placed in the same position as “cubes” in row 1 even these “sphere” positions never be seen during training. 3. Finally we train a color EBM in combination with both position and shape EBMs to generate images of different shapes at different positions and colors. Again wefix both position and shape EBMs, and only train the color model. In Figure 6-6 row 3, the objects with different color have the same position as row 1 and same shape asrow

65 Table 6.2: Quantitative evaluation of continual learning. A position EBM is first trained on “purple” “cubes” at different positions. A shape EBM is then trained on different “purple” shapes. Finally, a color EBM is trained on shapes of many colors with Earlier EBMs are fixed and combined with new EBMs. We compare with a GAN model Radford et al.[2015] which is also trained on the same position, shape and color dataset. EBMs is better at continually learning new concepts and remember the old concepts. (Acc: accuracy)

Model Position Acc Shape Acc Color Acc EBM (Position) 0.901 - - EBM (Position + Shape) 0.813 0.743 - EBM (Position + Shape + Color) 0.781 0.703 0.521 GAN (Position) 0.941 - - GAN (Position + Shape) 0.111 0.977 - GAN (Position + Shape + Color) 0.117 0.476 0.984

2 which shows the EBM can continual learning different concepts and extrapolate new concepts in combination with previously learned concepts to generate new images. In Table 6.2, we quantitatively evaluate the continuous learning ability of our EBM and GAN Radford et al.[2015]. Similar to the quantitative evaluation previously, we a train three classifiers for position, shape, color respectively. For fair comparison, the GAN model is also trained sequentially on the position, shape, and color datasets (with the corresponding position, shape, color available and other attributes set to random to match the training in EBMs). The position accuracy of EBM doesn’t drop significantly when continually learning new concepts (shape and color) which shows our EBM is able to extrapolate earlier learned concepts by combining them with newly learned concepts. In contrast, while the GAN model is able to learn the attributes of position, shape and color models given the corresponding dataset. We find the accuracies of position and shape drops significantly after learning color. The bad performance of previously learned concepts upon learning new concept shows that GANs cannot combine the newly learned attributes with the previous attributes.

66 Chapter 7

Conclusion

We present an approach to enable scalable training of EBMs so that online optimization on the models can be utilized to generate images, trajectories of plans and protein structures. We show benefits of the online optimization procedure, enabling image in-painting, cross-mapping, adaptive planning to new goals, as well as the ability and compose with other separately trained models.

7.1 Future Directions

This work presents an initial foray into applications of EBMs. We believe this is a fruitful area of research where additional directions of exploration could be to:

∙ Exploring scaling EBMs to even higher resolution datasets through techniques such as improved MCMC through HMC

∙ Exploring planning with EBMs in more complex regimes of planning, such as manipulation with robotic hand or potentially tool use.

∙ Exploring more complex compositional operators built off the nesting of the defined operators for EBMs, such as implication.

∙ Applications of compositionality of EBMs for tasks such as continual learning of visual concepts

67 ∙ Taking advantage of the iterative optimization procedure in EBMs to adaptively learn to reason on more complex datasets.

68 Bibliography

David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognit. Sci., 9(1):147–169, 1985. 19

Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization. arXiv e-prints, art. arXiv:1607.06450, Jul 2016. 47

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in Neural Information Processing Systems, pages 688–699, 2019. 20

Anton Bakhtin, Sam Gross, Myle Ott, Yuntian Deng, Marc’Aurelio Ranzato, and Arthur Szlam. Real or fake? learning to discriminate machine from human generated text. arXiv preprint arXiv:1906.03351, 2019. 20

Sergey Bartunov, Jack W Rae, Simon Osindero, and Timothy P Lillicrap. Meta-learning deep energy-based memory models. arXiv preprint arXiv:1910.02720, 2019. 20

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv:1606.01540, 2016. 37

Stephen G Brush. History of the lenz-ising model. Reviews of modern physics, 39(4):883, 1967. 19

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014. 28

Bo Dai, Zhen Liu, Hanjun Dai, Niao He, Arthur Gretton, Le Song, and Dale Schuur- mans. Exponential family estimation via adversarial dynamics embedding. arXiv preprint arXiv:1904.12083, 2019. 36

69 R. Das. Four small puzzles that Rosetta doesn’t solve. PLoS ONE, 6(5):e20044, 2011. 54

Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural Comput., 7(5):889–904, 1995. 19

Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011. 36

Yuntian Deng, Anton Bakhtin, Myle Ott, Arthur Szlam, and Marc’Aurelio Ranzato. Resid- ual energy-based models for text generation. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=B1l4SgHKDH. 20

Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017. 38

Ken A Dill. Dominant forces in protein folding. Biochemistry, 29(31):7133–7155, 1990. 45

Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXiv preprint arXiv:1903.08689, 2019. 20, 36, 58, 62

Yilun Du, Toru Lin, and Igor Mordatch. Model based planning with energy based models. CoRL, 2019. 20

Yilun Du, Shuang Li, and Igor Mordatch. Compositional visual generation and inference with energy based models, 04 2020a. 20

Yilun Du, Joshua Meier, Jerry Ma, Rob Fergus, and Alexander Rives. Energy-based models for atomic-resolution protein conformations. In International Conference on Learning Representations, 2020b. URL https://openreview.net/forum?id=S1e_9xrFvS. 20

Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. 25

Melissa A Edeling, Luke W Guddat, Renata A Fabianek, Linda Thöny-Meyer, and Jennifer L Martin. Structure of ccmg/dsbe at 1.14 å resolution: high-fidelity reducing activity in an indiscriminately oxidizing environment. Structure, 10(7):973–979, 2002. 12, 55

70 Tom Erez and Emanuel Todorov. Trajectory optimization for domains with contacts using inverse dynamics. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4914–4919. IEEE, 2012. 36

Sebastian Farquhar and Yarin Gal. Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733, 2018. 29

Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. In NIPS Workshop, 2016. 19

Yarin Gal, Rowan McAllister, and Carl Edward Rasmussen. Improving pilco with bayesian neural network dynamics models. 36

Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263, 2019. 20

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. In NIPS, 2017. 24

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352–1361. JMLR. org, 2017. 19, 36

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018. 36

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017. 17

Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of- distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016. 28

Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlier exposure. arXiv preprint, 2018. 29

71 Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochre- iter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017. 25

Geoffrey Hinton, Simon Osindero, Max Welling, and Yee-Whye Teh. Unsupervised discovery of nonlinear structure using contrastive backpropagation. Cognitive science, 30(4):725–731, 2006a. 19

Geoffrey E Hinton. Products of experts. International Conference on Artificial Neural Networks, 1999. 20

Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural Comput., 14(8):1771–1800, 2002. 59

Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Training, 14(8), 2006. 19, 20, 21

Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm fordeep belief nets. Neural Comput., 18(7):1527–1554, 2006b. 36

Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016. 36

Yen-Chang Hsu, Yen-Cheng Liu, and Zsolt Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines. arXiv preprint arXiv:1810.12488, 2018. 30

John Ingraham, Adam Riesselman, Chris Sander, and Debora Marks. Learning protein structure with a differentiable simulator. 20

Joel Janin, Shoshanna Wodak, Michael Levitt, and Bernard Maigret. Conformation of amino acid side-chains in proteins. Journal of molecular biology, 125(3):357–386, 1978. 46

Mrinal Kalakrishnan, Sachin Chitta, Evangelos Theodorou, Peter Pastor, and Stefan Schaal. Stomp: Stochastic trajectory optimization for motion planning. In 2011 IEEE international conference on robotics and automation, pages 4569–4574. IEEE, 2011. 32

72 Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439, 2016. 19, 36

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017. 29, 30

Andrew Leaver-Fay, Matthew J O’Meara, Mike Tyka, Ron Jacak, Yifan Song, Elizabeth H Kellogg, James Thompson, Ian W Davis, Roland A Pache, Sergey Lyskov, et al. Scientific benchmarks for guiding macromolecular energy function improvement. In Methods in enzymology, volume 523, pages 109–143. Elsevier, 2013. 15, 50, 51, 52, 53

Yann LeCun, Sumit Chopra, and Raia Hadsell. A tutorial on energy-based learning. 2006. 19, 20

Kwonjoon Lee, Weijian Xu, Fan Fan, and Zhuowen Tu. Wasserstein introspective neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3702–3711, 2018. 20

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv:1805.00909, 2018. 33

Zhengqi Li and Noah Snavely. Learning intrinsic image decomposition from watching the world. In CVPR, 2018. 29, 30

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008. 12, 53, 55

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.9, 27

CA McPhalen and MNG James. Crystal and molecular structure of the serine proteinase inhibitor ci-2 from barley seeds. Biochemistry, 26(1):261–269, 1987. 12, 54, 55

73 Nikhil Mishra, Pieter Abbeel, and Igor Mordatch. Prediction and control with temporal segment models. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pages 2459–2468. JMLR. org, 2017. 36

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normal- ization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018. 23, 24, 25

Andriy Mnih and Geoffrey Hinton. Learning nonlinear constraints with contrastive backprop- agation. Citeseer, 2004. 19

Elliott W Montroll, Renfrey B Potts, and John C Ward. Correlations and spontaneous magnetization of the two-dimensional ising model. Journal of Mathematical Physics, 4(2): 308–322, 1963. 19

Igor Mordatch, Emanuel Todorov, and Zoran Popović. Discovery of complex behaviors through contact-invariant optimization. ACM Transactions on Graphics (TOG), 31(4):43, 2012. 36

Igor Mordatch, Nikhil Mishra, Clemens Eppner, and Pieter Abbeel. Combining model-based policy search with online model learning for control of physical humanoids. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 242–248. IEEE, 2016. 36, 40

Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In ICRA, 2018. 36

Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshmi- narayanan. Do deep generative models know what they don’t know? In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=H1xwNhCcYm. 28, 29

Radford M Neal. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011. 26

74 Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2642–2651. JMLR. org, 2017. 24 openai. Openai five, 2018. 17

OpenAI. Learning dexterous in-hand manipulation. In arXiv preprint arXiv:1808.00177, 2018. 17

Georg Ostrovski, Will Dabney, and Rémi Munos. Autoregressive quantile networks for generative modeling. arXiv preprint arXiv:1806.05575, 2018.9, 24

Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6, 2009. 36

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In ICML, 2017. 36

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. 16, 66

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. 24

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017. 62

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 17

Jane S Richardson and David C Richardson. Principles and patterns of protein conformation. In Prediction of protein structure and the principles of protein conformation, pages 1–98. Springer, 1989. 45, 53

Ruslan Salakhutdinov and Geoffrey E. Hinton. Deep boltzmann machines. In David A. Van Dyk and Max Welling, editors, AISTATS, volume 5 of JMLR Proceedings,

75 pages 448–455. JMLR.org, 2009. URL http://www.jmlr.org/proceedings/papers/v5/ salakhutdinov09a.html. 19, 20

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NIPS, 2016. 25

Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010. 36

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, 2017. 38

Jonathan Schwarz, Jelena Luketina, Wojciech M Czarnecki, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. arXiv preprint arXiv:1805.06370, 2018. 29, 30

Maxim V Shapovalov and Roland L Dunbrack Jr. A smoothed backbone-dependent ro- tamer library for proteins derived from adaptive kernel density estimates and regressions. Structure, 19(6):844–858, 2011. 46, 49

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nat., 529(7587): 484–489, 2016. 17

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NIPS, 2014. 17

Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pages 2753–2762, 2017. 36

76 Tijmen Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pages 1064–1071. ACM, 2008. 20, 22

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, pages 5026–5033. IEEE, 2012. 38, 61

Richard Turner. Cd notes. 2005. 22

Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016. 24, 26

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017. 47

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. 50

Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391, 2015. 50

G. Wang and Jr. R. L. Dunbrack. Pisces: a protein sequence culling server. Bioinformatics, 19:1589–1591, 2003. 49

Grady Williams, Andrew Aldrich, and Evangelos A Theodorou. Model predictive path integral control: From theory to parallel computation. Journal of Guidance, Control, and Dynamics, 40(2):344–357, 2017a. 31, 36

Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M Rehg, Byron Boots, and Evangelos A Theodorou. Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721. IEEE, 2017b. 31

Jianwen Xie, Wenze Hu, Song-Chun Zhu, and Ying Nian Wu. Learning sparse frame models for natural image patterns. International Journal of Computer Vision, 114(2-3):91–112, 2015. 19

77 Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. In International Conference on Machine Learning, pages 2635–2644, 2016. 19

Raymond A Yeh, Chen Chen, Teck-Yian Lim, Alexander G Schwing, Mark Hasegawa-Johnson, and Minh N Do. Semantic image inpainting with deep generative models. 26

Michael C Yip and David B Camarillo. Model-less feedback control of continuum manipulators in constrained environments. IEEE Transactions on Robotics, 30(4):880–889, 2014. 36

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pages 3987–3995. JMLR. org, 2017. 29, 30

Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016. 19

Song Chun Zhu, Ying Nian Wu, and David Mumford. Filters, random fields and maximum entropy (FRAME): towards a unified theory for texture modeling. IJCV, 27(2):107–126, 1998. doi: 10.1023/A:1007925832420. 19

78