<<

LEARNING TO REPRESENT AND REASON UNDER LIMITED SUPERVISION

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Aditya Grover August 2020

© 2020 by Aditya Grover. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/jv138zg1058

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Stefano Ermon, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Moses Charikar

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Jure Leskovec

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Eric Horvitz

Approved for the Stanford University Committee on Graduate Studies. Stacey F. Bent, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii Abstract

Natural agents, such as humans, excel at building representations of the world and using them to effectively draw inferences and make decisions. Critically, the develop- ment of such advanced reasoning capabilities can occur even with limited supervision. In stark contrast, the major successes of machine learning (ML)-based artificial agents are primarily in tasks that have access to large labelled datasets or simulators, such as object recognition and game playing. This dissertation focuses on probabilis- tic modeling frameworks that shrink this gap between natural and artificial agents and thus enable effective reasoning in supervision constrained scenarios. This dissertation comprises of three parts. First, we formally lay the foundations for learning probabilistic generative models. The goal here is to simulate any available data, thus providing a natural learning objective even in settings with limited su- pervision. We discuss various trade-offs involved in learning and inference in high dimensions using these models, including the specific choice of learning objective, optimization procedure, and parametric model family. Building on these insights, we develop new to boost the performance of state-of-the-art models and mitigate their biases when trained on large uncurated and unlabelled datasets. Second, we extend these models to learn feature representations for relational data. Learning these representations is unsupervised, and we demonstrate their utility for classification and sequential decision making. Third and last, we present two real- world applications of these models for accelerating scientific discovery in: (a) learning data dependent priors for compressed sensing, and (b) design of experiments for optimizing charging in electric batteries. Together, these contributions enable ML systems to overcome the critical supervision bottleneck for high-dimensional inference and decision making problems in the real world.

iv Acknowledgments

First and foremost, I would like to thank my PhD advisor, Stefano Ermon. As one of Stefano’s first students, I have been fortunate to cherish a long list of experiences with him. My rollercoaster PhD, involving diverse research projects and teaching a new course to hundreds of students among many other endeavors, would not have been possible without Stefano’s inspiring work ethics, patient advising, and trademark clarity in thought. In all these years, it amazes me how every research meeting I have with Stefano ends on a personal note of optimism and intellectual satisfaction. Thank you for sharing your magic with me, I will treasure it forever. My experience at Stanford would be incomplete without the company I enjoyed in Ermon group. Neal instantaneously became a lifelong trusted friend (and personal gym trainer); Jonathan and Tri exemplify humility (and solving convex optimization problems); Shengjia, Tony, and Yang never fail to bring a smile (and write great papers for every deadline); Kristy and Rui double up as my creative alter egos (and Mixer buddies). Some who left early on continue to be great friends: Jon, Michael, Russell, Tudor, Vlad, and the newer ones continue to make the group as vibrant as ever: Andy, Burak, Chris, Kuno, Lantao, Ria. Outside of Ermon group, I will also dearly miss hanging out with the many friends at StatsML Group, InfoLab, Stanford Data Science Institute, and Stanford AI Lab. I have also had the privilege of standing on the shoulders of amazing mentors and collaborators over the years: Ankit Anand, Mausam, and Parag Singla tricked me into the joys of research when I was an undergraduate at IIT Delhi; Christopher Ré, Greg Valiant, Jure Leskovec, Moses Charikar, Noah Goodman, Percy Liang, and Stephen Boyd were generous with their insightful feedback and time during

v rotations and oral exams at Stanford; Alekh Agarwal, Ashish Kapoor, Ben Poole, Dustin Tran, Eric Horvitz, Harri Edwards, Ken Tran, Kevin Murphy, Maruan Al- Shedivat, and Yura Burda broadened my research perspectives in fun summer internships at Google, Microsoft, and OpenAI. I have also enjoyed mentoring junior students who helped me pursue new directions in research: Aaron Zweig, Chris Chute, Eric Wang, Manik Dhar, and Todor Markov. A shoutout to Will Chueh and Peter Attia, my co-leads on the 4 year long battery project—Will’s investment in my research and career success coupled with Peter’s positivity, perseverance, and hard work makes for a dream collaboration. I am also deeply thankful to all my other collaborators, administrators, and funding agencies for keeping the show running. It is rightly said that friends are the family you choose. Thanks to my friends from India who have stayed in touch even after I moved to US. And I would be remiss in not acknowledging friends from Stanford who I have interacted with regularly in the last five years: Jayesh for his excellent taste in spices and Netflix shows andbeing the most helpful roommate I could have hoped for, Hima & Pracheer for being my no-filter friends on any topic of conversation from academia to Bollywood, Daniel for all the laughs, vents, and immigrant walks at Bytes and Coupa, and Aditi, Anmol & Vivek for allowing me to be myself without judgement. Thanks to many other friends and extended family who have been an integral part of this journey; I could not overstate their importance in lifting me during the lows, cheering for my highs, and being a part of memories that will last a lifetime. Finally, I would like to thank my brother, Abhinav, for being my best friend since childhood and my sister-in-law, Vaishali, for being a pillar of support to our family. Last and definitely the most, I will forever be indebted to my parents, Manila & Vimal, for their unconditional love and unfaltering care for my happiness and well-being. This one is for you.

vi To my parents, Manila and Vimal Grover.

vii Contents

Abstract iv

Acknowledgments v

1 Introduction 1 1.1 Approach ...... 2 1.1.1 Foundations of Probabilistic Generative Modeling ...... 3 1.1.2 Relational Representation and Reasoning ...... 4 1.1.3 Applications in Science and Society ...... 5 1.2 Related Research ...... 6 1.3 Dissertation Overview ...... 7

I Foundations of Probabilistic Generative Modeling 9

2 Background 10 2.1 Problem Setup ...... 10 2.2 Learning & Inference ...... 12 2.3 Deep Generative Models ...... 13 2.3.1 Energy-based Models ...... 14 2.3.2 Autoregressive Models ...... 15 2.3.3 Variational Autoencoders ...... 16 2.3.4 Normalizing Flows ...... 18 2.3.5 Generative Adversarial Networks ...... 20

viii 3 Trade-offs in Learning & Inference 22 3.1 Introduction ...... 22 3.2 Flow Generative Adversarial Networks ...... 24 3.3 Empirical Evaluation ...... 25 3.3.1 Log-likelihoods vs. sample quality ...... 26 3.3.2 Comparison with Gaussian mixture models ...... 28 3.3.3 Hybrid learning of Flow-GANs ...... 30 3.4 Explaining log-likelihood trends ...... 31 3.5 Discussion & Related Work ...... 32 3.6 Conclusion ...... 33

4 Model Bias in Generative Models 35 4.1 Introduction ...... 35 4.2 Preliminaries ...... 37 4.3 Likelihood-Free Importance Weighting ...... 38 4.4 Boosted Energy-Based Model ...... 43 4.5 Empirical Evaluation ...... 46 4.5.1 Goodness-of-fit testing ...... 46 4.5.2 Data augmentation for multi-class classification ...... 47 4.5.3 Model-based off-policy policy evaluation ...... 49 4.6 Discussion & Related Work ...... 53 4.7 Conclusion ...... 54

5 Dataset Bias in Generative Models 55 5.1 Introduction ...... 55 5.2 Preliminaries ...... 57 5.3 Learning with Multiple Data Sources ...... 58 5.4 Empirical Evaluation ...... 63 5.4.1 Density ratio estimation ...... 66 5.4.2 Dataset bias mitigation ...... 66 5.5 Discussion & Related Work ...... 69 5.6 Conclusion ...... 73

ix II Relational Representation & Reasoning 74

6 Representation & Reasoning in Graphs 75 6.1 Introduction ...... 75 6.2 Preliminaries ...... 76 6.2.1 Weisfeiler-Lehman ...... 77 6.2.2 Graph neural networks ...... 78 6.3 Graphite: Generative Modeling for Graphs ...... 79 6.4 Empirical Evaluation ...... 82 6.4.1 Reconstruction & density estimation ...... 83 6.4.2 Link prediction ...... 84 6.4.3 Semi-supervised node classification ...... 87 6.5 Discussion & Related Work ...... 87 6.6 Conclusion ...... 88

7 Representation & Reasoning in Multiagent Systems 89 7.1 Introduction ...... 89 7.2 Preliminaries ...... 90 7.3 Learning Policy Representations ...... 92 7.3.1 Generative representation learning via imitation ...... 92 7.3.2 Contrastive representation learning via triplet loss ...... 93 7.3.3 Hybrid generative-contrastive representations ...... 95 7.4 Generalization in Multiagent Systems ...... 95 7.4.1 Generalization across agents & interactions ...... 95 7.4.2 Generalization across tasks ...... 97 7.5 Empirical Evaluation ...... 98 7.5.1 The RoboSumo environment ...... 99 7.5.2 The ParticleWorld environment ...... 104 7.6 Discussion & Related Work ...... 107 7.7 Conclusion ...... 108

x III Applications in Science & Society 109

8 Compressed Sensing via Generative Models 110 8.1 Introduction ...... 110 8.2 Preliminaries ...... 111 8.3 Uncertainty Autoencoders ...... 113 8.3.1 Implicit generative modeling ...... 115 8.3.2 Comparison with Principal Component Analysis ...... 116 8.4 Empirical Evaluation ...... 118 8.4.1 Statistical compressed sensing ...... 118 8.4.2 Transfer compressed sensing ...... 122 8.4.3 Dimensionality reduction ...... 122 8.5 Discussion & Related Work ...... 125 8.6 Conclusion ...... 128

9 Multi-Fidelity Optimal Experimental Design: A Case Study in Battery Fast Charging 129 9.1 Introduction ...... 129 9.2 Preliminaries ...... 131 9.3 Closed-Loop Optimization With Early Prediction ...... 133 9.4 Empirical Evaluation ...... 136 9.5 Discussion & Related Work ...... 142 9.6 Conclusion ...... 144

10 Conclusions 145 10.1 Summary of Contributions ...... 145 10.2 Future Work ...... 147

A Additional Proofs 149

B Additional Experimental Details & Results 153

C Code & Data 175

xi List of Tables

3.1 Best MODE scores and test negative log-likelihood estimates for Flow-GAN models on MNIST. 26 3.2 Best Inception scores and test negative log-likelihood estimates for Flow-GAN models on CIFAR-10. 26

4.1 Goodness-of-fit evaluation on CIFAR-10 dataset for PixelCNN++ and SNGAN. Standard errors computed over 10 runs. Higher IS is better. Lower FID and KID scores are better. 46 4.2 Classification accuracy on the Omniglot dataset. Standard errors computed over 5 runs. 47 4.3 Off-policy policy evaluation on MuJoCo tasks. Standard error is over 10 Monte Carlo estimates where each estimate contains 100 randomly sampled trajectories. 51

5.1 Comparison between the cross-entropy loss of the Bayes classifier and learned density ratio classifier. 66

6.1 Mean reconstruction errors and negative log-likelihood estimates (in nats) for autoencoders and variational autoencoders respectively on test instances from six different generative families. Lower is better. 83 6.2 Citation network statistics 84 6.3 Area Under the ROC Curve (AUC) for link prediction (* denotes dataset with features). Higher is better. 85

xii 6.4 Average Precision (AP) scores for link prediction (* denotes dataset with fea- tures). Higher is better. 85 6.5 Classification accuracies (* denotes dataset with features). Baseline numbers from [KW17]. 86

7.1 Intra-inter clustering ratios (IICR) and accuracies for outcome prediction (Acc) for weak (W) and strong (S) generalization on RoboSumo. 101 7.2 Intra-inter clustering ratios (IICR) for weak (W) and strong (S) generalization on ParticleWorld. Lower is better. 105 7.3 Average train and test rewards for speaker policies on ParticleWorld. 105

8.1 PCA vs. UAE. Average test classification accuracy for the MNIST dataset. 124

B.1 Bias-variance analysis for PixelCNN++ and SNGAN when T = 10, 000. Stan- dard errors over the absolute values of bias and variance evaluations are com- puted over the 2048 activation statistics. Lower absolute values of bias, lower variance, and lower MSE is better. 156 B.2 Bias-variance analysis for PixelCNN++ and SNGAN when T = 5, 000. Stan- dard errors over the absolute values of bias and variance evaluations are com- puted over the 2048 activation statistics. Lower absolute values of bias, lower variance, and lower MSE is better. 157 B.3 Statistics for the environments. 160 B.4 Off-policy policy evaluation on MuJoCo tasks. Standard error is over 10 Monte Carlo estimates where each estimate contains 100 randomly sampled trajectories. Here, we perform stepwise LFIW over transition triplets. 161 B.5 ResNet-18 architecture adapted for attribute classifier. 164

xiii B.6 Architecture for the generator and discriminator. Notation: ch refers to the channel width multiplier, which is 64 for 64 × 64 CelebA images. ResBlock up refers to a Generator Residual Block in which the input is passed through a ReLU activation followed by two 3 × 3 convolutional layers with a ReLU activation in between. ResBlock down refers to a Discriminator Residual Block in which the input is passed through two 3 × 3 convolution layers with a ReLU activation in between, and then downsampled. Upsampling is performed via nearest neighbor interpolation, whereas downsampling is performed via mean pooling. ‘‘ResBlock up/down n → m’’ indicates a ResBlock with n input channels and m output channels. 165 B.7 AUC scores for link prediction with Monte Carlo subsampling during training. Higher is better. 168 B.8 Frobenius norms of the encoding matrices for MNIST and Omniglot. 173

xiv List of Figures

2.1 Learning a generative model. The true data distribution is a mixture of 2 Gaussians (blue). We only observe a training dataset of samples from this distribution (shown as ×). The model family M is unimodal Gaussian dis- tributions and the parameters θ correspond to the mean and variance. We learn the best fit parameters θ∗ (green) by minimizing the Kullback-Leibler

(KL) divergence between pbdata and pθ. Post learning, we can use the generative ∗ model pθ to generate more data via sampling (shown as ). 11

3.1 Samples generated by Flow-GAN models with different objectives for MNIST (top) and CIFAR-10 (bottom). 25 3.2 Learning curves for negative log-likelihood (NLL) evaluation on MNIST (top, in nats) and CIFAR (bottom, in bits/dim). Lower NLLs are better. 27 3.3 Gaussian Mixture Models outperform adversarially learned models for any bandwidth in the green shaded region on both held-out log-likelihoods and sampling metrics. 29 3.4 Singular value analysis for the Jacobian of the generator functions. The Jacobian is evaluated at 64 random noise vectors sampled from the prior. The x-axis of the figure shows the singular value magnitudes (on a log scale) and foreach singular value s, we show the corresponding cumulative distribution function (CDF) value on the y-axis which signifies the fraction of singular values less than s. 31

xv 4.1 Importance Weight Estimation using Probabilistic Classifiers. (a) A univariate Gaussian (blue) is fit to samples from a mixture of two Gaussians (red). (b-d) Estimated class probabilities (with 95% confidence intervals based on 1000 bootstraps) for varying number of points n, where n is the number of training data points used for training the generative model and multilayer perceptron.42 4.2 Qualitative evaluation of importance weighting for data augmentation. (a-f) Within each figure, top row shows held-out data samples from a specific class in Omniglot. Bottom row shows generated samples from the same class ranked in decreasing order of importance weights. 48

4.3 Estimation error δ(v) = v(πe) − vˆH(πe) for different values of H (min 0, max 100). Shaded area denotes standard error over 10 random seeds. 52

5.1 Samples from a baseline BigGAN that reflect the gender bias underlying the true data distribution in CelebA. All faces above the orange line (67%) are classified as female, while the rest are labeled as male (33%). 56 5.2 Distribution of importance weights for different latent subgroups. On average, The underrepresented subgroups are upweighted while the overrepresented subgroups are downweighted. 65 5.3 Single-Attribute Dataset Bias Mitigation for bias=0.9. Standard error in (b) and (c) over 10 independent evaluation sets of 10,000 samples each drawn from the models. Lower fairness discrepancy and FID is better. We find that on average, imp-weight outperforms the equi-weight baseline by 49.3% and the conditional baseline by 25.0% across all reference dataset sizes for bias mitigation. 67 5.4 Single Attribute Dataset Bias Mitigation for bias=0.8. Standard error in (b) and (c) over 10 independent evaluation sets of 10,000 samples each drawn from the models. Lower fairness discrepancy and FID is better. We find that on average, imp-weight outperforms the equi-weight baseline by 23.9% and the conditional baseline by 12.2% across all reference dataset sizes for bias mitigation. 68

xvi 5.5 Mult-Attribute Dataset Bias Mitigation. Standard error in (b) and (c) over 10 independent evaluation sets of 10,000 samples each drawn from the models. Lower fairness discrepancy and FID is better. We find that on average, imp- weight outperforms the equi-weight baseline by 32.5% and the conditional baseline by 4.4% across all reference dataset sizes for bias mitigation. 70

6.1 Latent variable model for Graphite. Observed evidence variables in gray. 79 6.2 t-SNE embeddings of the latent feature vectors for the Cora dataset. Colors denote labels. 86

7.1 Agent-Interaction Graph. An example of a graph used for evaluating gener- alization in a multiagent system with 5 train agents (A, B, E, F, G) and 2 test agents (C, D). 96 7.2 Illustrations for the environments used in our experiments: (a) competitive; and (b) cooperative. 98

7.3 Illustration of the proposed model for optimizing a policy πψ that conditions

on an embedding of the opponent policy πA. At time t, the pre-trained rep-

resentation function fφ computes the opponent embedding based on a past t−1 interaction e . We optimize πψ to maximize the expected rewards in its current interactions et with the opponent. 99 7.4 Embeddings learned using Emb-Hyb for 10 test interaction episodes of 5 agents projected on the first three principal components for RoboSumo and ParticleWorld. Color denotes agent policy. 100 7.5 Average win rates of the newly trained agents against 5 training agent and 5 test- ing agents. (a) Top: Baseline comparison with policies that make use of Emb-Im, Emb-Id, and Emb-Hyb (all computed online). (b) Bottom: Comparison of differ- ent baseline embeddings used at evaluation time (all embedding-conditioned policies use Emb-Hyb). At each iteration, win rates were computed based on 50 1-on-1 games. Each agent was trained 3 times, each time from a different random initialization. Shaded regions correspond to 95% CI. 101

xvii 7.6 Win, loss, and draw rates plotted for the first agent in each pair. Each pair of agents was evaluated after each training iteration on 50 1-on-1 games; curves are based on 5 evaluation runs. Shaded regions correspond to 95% CI. 102 7.7 Win rates for agents specified in each row at computed at iteration 1000. 102

8.1 Dimensionality reduction using PCA vs. UAE. Projections of the data (black points) on the UAE direction (green line) maximize the likelihood of decoding unlike the PCA projection axis (magenta line) which collapses many points in a narrow region. 117

8.2 (a-b) Test `2 reconstruction error (per image) for compressed sensing. (c-d) Reconstructions for m = 25. First: Original. Second: LASSO. Third: CS-VAE. Last: UAE. 25 projections of the data are sufficient for UAE to reconstruct the original image with high accuracy. 120

8.3 (a) Test `2 reconstruction error (per image) for compressed sensing on CelebA. (b) Reconstructions for m = 50 on the CelebA dataset. First: Target. Second: LASSO-DCT. Third: LASSO-Wavelet. Fourth: DCGAN. Last: UAE. 121

8.4 (a-b) Test `2 reconstruction error (per image) for transfer compressed sensing. (c-d) Reconstructions for m = 25. Top: Target. Second: CS-VAE. Third: UAE- SD. Last: UAE-SE. 123

9.1 Schematic of our closed-loop optimization (CLO) system. First, batteries are tested. The cycling data from the first 100 cycles, specifically electrochemical measurements such as voltage and capacity, are used as input for early outcome prediction of cycle life. These cycle life predictions are subsequently sent to a Bayesian optimization (BO) algorithm, which recommends the next protocols to test by balancing the competing demands of exploration⁠ (testing protocols with high uncertainty in estimated cycle life) and exploitation (testing protocols with high estimated cycle life). This process iterates until the testing budget is exhausted. 133

xviii 9.2 Structure of our 10-minute, six-step fast-charging protocols. (a) Current vs. SOC for an example charging protocol, 7.0C–4.8C–5.2C–3.45C (bold lines). Each charging protocol is defined by five constant current (“CC”) steps followed by one constant voltage (“CV”) step. The last two steps (CC5 and CV1) are identical for all charging protocols. We optimize over the first four constant- current steps, denoted CC1, CC2, CC3, and CC4. Each of these steps comprises a 20% SOC window, such that CC1 ranges from 0–20% SOC, CC2 ranges from 20–40% SOC, etc. CC4 is constrained by specifying that all protocols charge in the same total time (10 minutes) from 0% to 80% SOC. Thus, our parameter space consists of unique combinations of the three free parameters CC1, CC2, and CC3. For each step, we specify a range of acceptable values; the upper limit is monotonically decreasing with increasing SOC to avoid the upper cutoff potential (3.6 V for all steps). (b) CC4 (color) as a function of CC1, CC2, and CC3 (x, y, and z axes, respectively). Each point represents a unique charging protocol. 135 9.3 Early cycle life predictions per round. The tested charging protocols and the resulting predictions are plotted for rounds 1–4. Each point represents a charg- ing protocol, defined by CC1, CC2, and CC3 (x, y, and z axes, respectively). The color represents cycle life predictions from the early outcome prediction model. The charging protocols in the first round of testing are randomly se- lected. As BO shifts from exploration to exploitation, the charging protocols selected for testing by the closed loop in subsequent rounds are primarily in the high-performing region. 137 9.4 Evolution of the parameter space per round. The color represents cycle life as estimated by BO. The initial cycle life estimates are equivalent for all protocols; as more predictions are generated, BO updates its cycle life estimates. 137 9.5 (a) Distribution of the number of repetitions for each charging protocol. Only 46 of 224 protocols (21%) are tested multiple times. (b) Current vs. SOC for the top three fast-charging protocols as estimated by CLO. CC1–CC4 are displayed in the legend. All three protocols have relatively uniform charging (i.e., CC1≈CC2≈CC3≈CC4). 138

xix 9.6 Discharge capacity vs. cycle number for all batteries in the validation experi- ment. See Figure 9.7 for the color scheme legend. 139 9.7 (a) Comparison of early-predicted cycle lives from validation to closed-loop estimates, averaged on a protocol basis. Each 10-minute charging protocol is tested with five batteries. Error bars represent the 95% confidence intervals. (b) Observed vs. early-predicted cycle life for the validation experiment. While our early predictor appears biased, likely due to calendar aging effects, the trend is correctly captured (Pearson correlation coefficient r = 0.86). (c) Final cycle lives from validation, sorted by CLO ranking. The length of each bar and the annotations represents the mean final cycle life from validation per protocol. Error bars represent the 95% confidence intervals. 140

B.1 Calibration of classifiers for density ratio estimation. 158 B.2 Environments in OPE experiments. 161

B.3 Estimation error δ(v) = v(πe) − vˆH(πe) for different values of H (minimum 0, maximum 100). Shaded area denotes standard error over different random seeds; each seed uses 100 sampled trajectories. Here, we perform stepwise LFIW over transition triplets. 162 B.4 Results from the Shapes3D dataset. After restricting the possible floor colors to red or blue and using a biased dataset of bias=0.9, we find that the samples obtained after importance reweighting (b) are considerably more balanced than those without reweighting (a), as desired. 166 B.5 AUC score of VGAE and Graphite with subsampled edges on the Cora dataset.168 B.6 (a) An example clique agent interaction graph with 10 agents. (b) An example bipartite agent interaction graph with 5 speakers and 5 listeners. 172

xx 1 Introduction

Imagination is the living power and prime agent of all human perception. – Samuel Taylor Coleridge

The ability to effectively combine data and computation to make inferences and decisions is a long-standing goal of (AI). In the last few decades, there has been an explosion in the scale of datasets and computational resources available within several disciplines ranging from natural to social sciences [RD15]. Yet, there exists a fundamental gap in the capabilities of current AI systems and those required for complex reasoning in science and engineering. Most of the successful data-driven deployments of AI in the real world are restricted to applications of ma- chine learning (ML) where we have access to large labelled datasets or environment simulators, such as object recognition [He+16] and game playing [Sil+17]. This is hardly a practical assumption for many scientific domains, such as the physical and biological sciences, where the collection of labels can be prohibitively expensive incurring costs due to time, money, privacy, and safety [Zho18]. At the same time, there is compelling evidence from natural agents in the real world, such as humans, suggesting that complex reasoning tasks such as probabilis- tic inference and long-horizon planning do not require copious amounts of labelled supervision [Bar89; BK96; Gop10]. Real world data modalities, including the ones that arise in scientific applications, are typically high-dimensional, e.g., images, video, graphs. And even with limited access to supervision, natural agents can learn to reason over such data by learning meaningful spatiotemporal representations. For ex- ample, instead of monitoring individual pixels in the frames of a video, we can learn

1 CHAPTER 1. INTRODUCTION 2

to represent physical objects by clustering similarly colored pixels in close proximity and reasoning about their dynamics over successive frames collectively. How do we endow artificial agents with similar representation and reasoning capabilities in high dimensions given only limited access to labelled supervision?

1.1 Approach

This dissertation explores the above question in detail from the lens of probabilistic generative modeling. To motivate the generative modeling paradigm, consider the Imagenet dataset of natural images [Den+09]. What is the process by which any image in the dataset was created? Intuitively, one can think of multiple factors at play during image creation. Some of these factors are specific to imaging technology, such as lighting, camera resolution, and location; whereas other factors concern exogenous physical and biological phenomena, such as the shapes, colors, and sizes of plants, animals, and inanimate objects. While a complete characterization of the factors of variation during image creation seems to be an arduous task for large and diverse datasets, it can offer us useful insights about the structure of the world. The goal of a probabilistic generative model is to automate the above task by learning to generate a given dataset. As we noted in the thought experiment above, one can hope that a model that succeeds at this task would learn to extract insight- ful structure in large datasets. As such this goal is not tied to the availability (or unavailability) of labelled supervision and hence its use does not preclude any of the traditional learning paradigms such as supervised or reinforcement learn- ing. For example, Naive Bayes is a standard generative classifier that models the joint distribution between inputs and labels and makes predictions using Bayes rule [NJ02]. Above and beyond the aforementioned learning paradigms, the focus of this dissertation will be on using generative models for learning representations and reasoning for various kinds of supervision constrained settings, such as unsu- pervised and transfer learning. To this end, we organize the dissertation around three key organizational themes, which we discuss next. CHAPTER 1. INTRODUCTION 3

1.1.1 Foundations of Probabilistic Generative Modeling

A generative model induces a probability distribution and generating data from the model is equivalent to sampling from this induced distribution. During learning, we are given a dataset and our goal is to optimize for the model distribution that is as close as possible to the underlying (unknown) data distribution from which the dataset was obtained. In high dimensions however, the curse of dimensionality kicks in as the dataset covers only a small region of the true support of a potentially complex data distribution. Hence, a practical algorithm for learning a generative model for high-dimensional data involves numerous design choices and trade-offs therein. For example, how do we formalize a learning objective for closeness between the model and data distributions? How do we optimize and regularize such an objective to enable generalization beyond the training dataset? The status quo in generative modeling is a large set of sufficiently distinct frame- works derived from different algorithmic design choices (see Chapter 2 for a brief review and [Goo16; KW19; Pap+19] for broad surveys). Yet, till date, there is no dominant framework even when we restrict ourselves to a minimal set of basic evaluation criteria. The contributions in this dissertation contribute to this growing literature in two fundamental ways. First, we quantify the trade-offs in learning and inference for existing generative models. In particular, we show for the first time quantitatively that existing models which optimize for sample generation pro- gressively assign poorer likelihoods to the observed data during training even in the absence of overfitting [GDE18]. As a remedy, we provide alternate regularization schemes for learning to excel at both sampling and likelihood estimation. Second, we develop ‘‘plug-in” model-agnostic algorithms for increasing the expressiveness of generative models. We provide two such techniques: (a) an unsupervised boosting algorithm to sequentially increase model capacity [GE18; Gro+19a], (b) a bias mitigation algorithm that fuses multiple unlabelled data sources (potentially uncurated) for sample-efficient learning and avoids the perils of dataset bias accompanying such a fusion [Cho+20]. Together, these contributions improve the performance of state-of-the-art models on data generation [Gro+19a; CHAPTER 1. INTRODUCTION 4

Cho+20] as well as a diverse set of downstream use cases such as domain adapta- tion [Gro+20], data augmentation [Gro+19a], and model-based off-policy policy evaluation [Gro+19a].

1.1.2 Relational Representation and Reasoning

Reasoning with relational data is key to understanding how objects interact with each other and give rise to complex phenomena in the everyday world. Well- known applications include knowledge base completion [Wan+17a], program synthesis [Bro+18], visual question-answering [Ant+15], and social network anal- ysis [WF+94]. Although access to relational datasets has increased significantly in the last few decades, canonical representations, such as adjacency matrices for graphs, are discrete and combinatorial. Moreover, obtaining labeled supervision for predictive tasks over nodes and edges can be expensive in large graphs. Hence, integrating these datasets directly into modern ML systems that rely on continuous gradient-based optimization techniques, make strong independence and identically distributed (i.i.d.) assumptions, or require excessive supervision is challenging. A successful remedy for these challenges is to design unsupervised algorithms to learn continuous representations for the nodes in graphs. We improve upon classic approaches such as spectral decomposition by proposing new frameworks based on generative modeling and contrastive learning. For the latter, the goal is to learn representations that can distinguish similar nodes from dissimilar ones. We extend previously proposed objectives in vision and language, such as skipgram [Mik+13] and triplet losses [HA15], to establish notions of similarity for graph-structured data. Notably, our frameworks [GL16; GZE19] are carefully designed to scale to large graphs both during learning and inference, with downstream use cases in node classification and link prediction. Furthermore, we successfully extend these ideas to even interactive graphs specified as a multiagent system [Woo09]. Here, we are interested in modeling the behavior of an agent (nodes) based on its interactions (edges) with other agents. We show that the behaviors encoded in our learned representations inform effective counter or complementary strategies for competitive CHAPTER 1. INTRODUCTION 5

and cooperative reinforcement learning respectively [Gro+18c; Gro+18d].

1.1.3 Applications in Science and Society

In 2015, the United Nations adopted 17 Sustainable Development Goals as a ”blueprint to achieve a better and more sustainable future for all”. Science and engineering has a dominant role to play in succeeding at these goals, which range from eradicating poverty and hunger to providing quality healthcare and clean energy [Gom+19]. However, scientific discovery and technology development can be very expensive, as it crucially relies on experimental evidence for validating hypothesis and running experiments incurs costs due to time, money, resources, and safety. Probabilistic models can play a key role in alleviating experimental costs by re- thinking conventional paradigms for data acquisition. Broadly, there are two distinct avenues for improvement: what data to acquire and how to acquire it. The former concerns with optimal experimental design, where we are interested in finding an experimental configuration that maximizes an objective of interest. Ina Nature case study, we ground this problem in a novel setup aimed to optimize charging proto- cols that maximize the battery life of electric vehicles [Att+20]. The search space for charging protocols is massive, and every single experiment is time-consuming and has safety risks. Traditional approaches, such as Bayesian optimization and multi-armed bandits, use different strategies for balancing exploration and textit- exploitation in order to reduce the total number of experiments [Sha+15]. This is not always enough, as the running costs of even a single experiment can often be prohibitive (in the order of months to years for battery charging). To cut such costs, we use low-fidelity signals that are noisy but easy-to-obtain for exploration-exploitation [HKV19]. Specifically, we train a probabilistic model of battery lifetimes (learned on offline data) to predict the outcomes of experiments (battery life) based on data collected during the experiment run (e.g., voltage, capacity etc.). If the outcome of an experiment is predicted to be suboptimal with high confidence, we terminate it and allocate resources for future experiments much faster. In our real-world deployment, we observe a ∼15x reduction in testing times CHAPTER 1. INTRODUCTION 6

over baselines using this approach. Further, we discover novel charging protocols that outperform prior art and suggest new modes of battery degradation [Att+20]. Besides expensive labels, the curse of dimensionality also bottlenecks data ac- quisition for many real world applications. For instance, high resolution MRI scans in hospitals are important for accurate diagnosis but they can be very slow to obtain pixel-by-pixel. Instead, compressed sensing partially alleviates this difficulty by acquiring a compressed representation of the data using a small number of random projections [Don06; CT05; CRT06]. If the data is sparse in some suitable basis (e.g., images in the wavelet basis), then we can provably recover the data exactly using approximately a logarithmic number of measurements. In the real world, we however routinely have access to pre-existing datasets for many target domains. To exploit this statistical resource, we learn an acquisition (encoding) and recovery (decoding) strategy via a novel autoencoder-based generative model [GE19]. Our proposed generative approach improves recovery of high-dimensional images by ∼30% for the same number of measurements as baseline approaches.

1.2 Related Research

Broadly, this dissertation builds upon a rich body of prior work in probabilistic methods for modeling and learning from data. This prominently includes the wide literature on graphical models [KF09], where the focus is on representing probability distributions over high-dimensional data compactly using a (sparse) graph data structure. A generative model can also be thought of as a probabilistic graphical model and both terminologies are routinely used for models that express joint distributions over high-dimensional data. However, the graph structure is of limited relevance here as most modern generative models, such as autoregressive models [Uri+16], correspond to a dense graphical model that provides expressive power and yet are amenable to efficient learning via the use of functional parameter- izations such as deep neural networks. Moreover, the inference queries of interest in downstream applications of generative models are typically more specialized (e.g., sampling, density estimation) than the broad scope considered for graphical CHAPTER 1. INTRODUCTION 7

models. Nevertheless, many of the algorithms we discuss in this dissertation trace their roots to learning and inference in probabilistic graphical models. Another line of related research concerns dimensionality reduction techniques which aim to find low dimensional structure in high-dimensional data. A classic example is Principal Component Analysis (PCA) which computes linear projections of the data that maximize the variance in the data [Pea01]. Several other linear and non-linear dimensionality techniques draw their inspiration from manifold learning, metric learning, and information theory and are frequently used for feature extraction in machine learning. These prominently include autoencoders [BCV13] and their variants, such as denoising autoencoders [Vin+08] and variational autoen- coders [KW18], which employ deep neural networks for dimensionality reduction and have close connections with generative modeling. Finally, this dissertation benefits and contributes to the prior literature on stochas- tic optimization. Stochastic optimization is the workhorse of modern ML on large datasets. Perhaps the most successful instantiation concerns the optimization of deep neural networks using gradient-based optimization methods [LBH15]. We will routinely use such parameterizations and optimization methods for developing probabilistic modeling frameworks that can scale to large datasets. Further, we will also extensively employ and extend stochastic optimization methods that account for randomness due to unobserved (latent) variables [Moh+19].

1.3 Dissertation Overview

This dissertation is organized into 3 thematic parts. Part 1 investigates the statistical and computational foundations of probabilistic generative modeling.

• In Chapter 2, we provide the necessary background to setup the problem and review some key prior works.

• In Chapter 3, we discuss trade-offs in two central learning paradigms of gen- erative modeling: maximum likelihood estimation and adversarial learning. CHAPTER 1. INTRODUCTION 8

This chapter is based on [GDE18].

• In Chapter 4, we present a model-agnostic algorithm to boost the performance of any existing generative model. This chapter is based on [Gro+19a] and builds on our earlier work in [GE18].

• In Chapter 5, we present another model-agnostic algorithm to address the issue of latent dataset bias in fusing multiple unlabelled data sources for training generative models. This chapter is based on [Cho+20].

Part 2 extends the use of probabilistic generative models for representing and rea- soning over relational domains, where the data points violate the independent and identically distributed (i.i.d.) assumption.

• In Chapter 6, we present a latent variable generative model for learning rep- resentations of nodes in graphs. This chapter is based on [GZE19] and also discusses our earlier work in contrastive graph representation learning [GL16].

• In Chapter 7, we present an algorithm combining generative and contrastive objectives for learning representations of agent policies in a multiagent system. This chapter is based on [Gro+18c] and [Gro+18d].

Part 3 discusses the use of probabilistic methods for real world applications in scientific discovery and sustainable development.

• In Chapter 8, we present a generative modeling framework for learning acqui- sition and recovery procedures in statistical compressed sensing. This chapter is based on [GE19].

• In Chapter 9, we present an optimal experimental design approach suitable for domains with large design spaces and time intensive experimentation. As a case-study, we use it to optimize charging protocols for electric batteries. This chapter is based on [Att+20] and builds on our earlier work in [Gro+18b].

We summarize the major contributions in this dissertation and exciting open direc- tions for future research in Chapter 10. Part I

Foundations of Probabilistic Generative Modeling

9 2 Background

We begin by providing some background on probabilistic generative modeling. Generative models are a framework for learning and sampling from probability distributions. The focus in this thesis will be on distributions defined over high- dimensional objects, which includes images, videos, and graphs among other data modalities. We first formally setup learning in generative models as an optimization problem. Thereafter, we elucidate the challenges in learning due to the curse of dimensionality. Finally, we characterize and contrast major families of generative models in use today that seek to overcome these challenges.

Notation. Unless explicitly stated otherwise, we assume that probability distribu- tions admit absolutely continuous densities on a suitable reference measure. We use uppercase notation X, Y, Z to denote random variables and lowercase notation x, y, z to denote specific values. We use boldface for multivariate random variables X, Y, Z and their vector values x, y, z.

2.1 Problem Setup

We start with the standard assumption in probabilistic machine learning that any d observed data point x ∈ R is sampled from some fixed data distribution pdata. Here, d is the data dimensionality and typically in the order of hundreds to thousands for the data modalities of interest, e.g., number of pixels in high-resolution images. We do not know pdata analytically; we only have access to a dataset Dtrain of training data

10 CHAPTER 2. BACKGROUND 11

Figure 2.1: Learning a generative model. The true data distribution is a mixture of 2 Gaussians (blue). We only observe a training dataset of samples from this distribution (shown as ×). The model family M is unimodal Gaussian distributions and the parameters θ correspond to the mean and variance. We learn the best fit parameters θ∗ (green) by minimizing the Kullback-Leibler (KL) divergence between ∗ pbdata and pθ. Post learning, we can use the generative model pθ to generate more data via sampling (shown as ).

1 points drawn independently from pdata. We will use pbdata to denote the empirical data distribution which assigns uniform probability mass to all points in Dtrain and zero outside.

The goal of generative modeling is to learn the data distribution pdata given access to Dtrain. In order to do so, we parameterize a family of models M (i.e., the hypothesis class). Each member of M is specified by a set of real-valued parameters

θ that induce a probability distribution pθ. During learning, we search for the parameters θ∗ that best minimize some notion of discrepancy (e.g., a probabilistic divergence) between the data distribution pdata and the model distribution pθ.

∗ θ = arg min dis(pdata, pθ) (2.1) θ∈M

Figure 2.1 shows an illustrative example for learning a toy 1D data distribution. In

1 Wherever necessary, we will also assume access to held-out datasets Dvalid for model selection during validation and Dtest for reporting testing results. CHAPTER 2. BACKGROUND 12

practice, since we only have sample access to pdata via Dtrain, we effectively evaluate and minimize the discrepancy between pbdata and pθ following the principal of empirical risk minimization [Vap99]. To prevent overfitting the model to Dtrain, we can further regularize the optimization problem using any of the standard approaches in machine learning, such as minimizing the norm of the weights or early stopping based on validation error.

2.2 Learning & Inference

In this section, we dig deep deeper into some concrete learning objectives for genera- tive modeling and some fundamental inference tasks of interest using these models. For a gentle introduction, we begin our discussion with maximum likelihood es- timation which is the foundation for many classical and contemporary learning frameworks for generative modeling. We will also discuss likelihood-free objectives in later sections, notably in the context of generative adversarial networks.

Learning. The Maximum Likelihood Estimation (MLE) principle optimizes for the model parameters that minimize the Kullback-Leibler (KL) divergence between the data distribution and the model distribution:

DKL(pdata, pθ) = Epdata [log pdata(x)] − Epdata [log pθ(x)] (2.2)

Note that the first term corresponds to the entropy of the data distribution which is constant w.r.t. θ and hence, it can be ignored. Consequently, the MLE objective simplifies to minimizing the expected negative log-likelihood assigned by themodel to the training dataset:

min −Ep [log pθ(x)] := LMLE(θ). (2.3) θ∈M data

In practice, since we do not have access to pdata, we can approximate pdata via the empirical data distribution pbdata. CHAPTER 2. BACKGROUND 13

Inference. There are two key fundamental inference tasks with a generative model.

1. Sampling: generating samples from the model distribution x ∼ pθ.

2. Density estimation: evaluating the probability assigned by a model pθ(x) to a given data point x.

A few models additionally include latent variables Z in addition to the observed variables X. For these models, we will also consider another task of inferring the latent variables for a given data point x.

2.3 Deep Generative Models

The earliest generative models were restricted to simple model families, such as mixtures of finite Gaussians or Bernoullis to model continuous and discrete data respectively. These models suffer from the curse of dimensionality and have limited success when applied to high-dimensional data modalities such as images, videos, and graphs. As the data dimensionality increases, the target data distribution pdata can be very complex (high number of modes). This requires a very large number of mixture components to fit high-dimensional data distributions and consequently, optimization can be a challenge for such large mixture models. Moreover, the sample complexity for learning large models is prohibitively high as pdata will generally have a support much larger than the empirical data distribution pbdata. It is critical to choose model families and optimization algorithms which reflect inductive biases that can generalize beyond Dtrain. In recent years, we have seen ample evidence in favor of deep neural networks as general-purpose non-linear function approximations across a variety of tasks and modalities involving high-dimensional data [LBH15]. They can be optimized effectively on large datasets with stochastic gradient-based optimization methods. In this section, we will see how deep networks can also enable probabilistic generative modeling of high-dimensional data. In order to do so, we instantiate and contrast 5 major frameworks of generative modeling in use today: energy-based models, CHAPTER 2. BACKGROUND 14

autoregressive models, variational autoencoders, normalizing flows, generative adversarial networks.

2.3.1 Energy-based Models

An energy-based model is a parameterized instantiation of the Boltzmann distri- bution. That is, the probability density of any data point x ∈ Rd is given via the Boltzmann distribution:

pθ(x) ∝ exp (−Eθ(x)) , (2.4)

R where the partition function Zθ = exp(−Eθ(x))dx. The real-valued function Eθ(x) denotes the scalar energy of a data point x with parameters θ and is low for points with high probability under pθ. The origins of such a distributional form trace to statistical physics, where this distribution is used to model the preference of systems to be in states with lower energy [LeC+06]. EBMs can include latent variables Z and efficiently learned by restricting the connectivity between the latent and observed variables in a framework called Restricted Boltzmann Machines (RBM). For brevity, we focus on learning and inference in the default EBM formulation in Eq. (2.4) and refer the reader to [Hin12] for extending these methods to RBMs. The energy function for an EBM can be parameterized by a deep neural network and the learning objective for the model parameters is derived via MLE. For an EBM, we can substitute the Boltzmann distribution in Eq. (2.4) in Eq. (2.3):

LEBM(θ) = Epdata [log Eθ(x)] − log Zθ. (2.5)

The partition function Z(θ) is intractable to compute for high-dimensional spaces. The gradients of the loss function are given as:

∇θLEBM(θ) = Epdata [∇θ Eθ(x)] − Epθ [∇θ Eθ(x)] . (2.6)

Intuitively, the first term in the gradient decreases the energy at data points CHAPTER 2. BACKGROUND 15

drawn from the training dataset (also called positive examples) and the second term increases the energy at samples drawn from the model pθ (negative examples). Estimating the expectation in the second term requires us to draw samples from the model. This can be done via Markov Chain Monte Carlo (MCMC) methods, which perform sampling by running a carefully designed Markov chain. In practice, these methods can however be slow to converge (‘‘mix”) in the absence of good proposal distributions for transitioning across states in the chain. A number of methods have been developed to overcome the slow mixing time for learning and sampling from EBMs, notably contrastive divergence which runs chains for only a few steps to give biased, but fast estimates [Hin02] and Langevin dynamics which exploits the gradient information in the chain transitions [DM19].

2.3.2 Autoregressive Models

One of the key challenges with energy-based models is that they are unnormalized and hence sampling and density estimation using these models is intractable. Now, we look at a class of self-normalized models called autoregressive models. In an autoregressive model (ARM), the joint distribution over d random variables x = (x1, x2,..., xd) can be factorized using the chain rule as:

d d pθ(x) = ∏ pθ(xi|x1, x2,..., xi−1) = ∏ pθ(xi|x

If the conditionals pθ(xi|x

over the training dataset. Substituting Eq. (2.7) in Eq. (2.3), we get:

d L ( ) = E [ ( | )] ARM θ ∑ pdata log pθ xi x

To sample from these models, we follow ancestral sampling where every dimen- sion is sampled one at a time in sequence.

x1 ∼ pθ(X1)

x2 ∼ pθ(X2|X1 = x1) ...

xn ∼ pθ(Xn|X

ARMs represent one of the state-of-the-art models in use today. A key avenue for ongoing research in this field concerns the design of expressive parameteri- zations for the conditionals. Past work has explored the use of MLPs [Uri+16; Ger+15], RNNs [SMH11; OKK16], CNNs [Van+16; Oor+16; Sal+17], and trans- formers [Vas+17; RMC15] and demonstrated success on a wide variety of data modalities including images, audio, and text.

2.3.3 Variational Autoencoders

One of the key goals of unsupervised learning is to learn useful representations of the data. An effective way to achieve this goal is via latent variable generative models. In addition to the set of observed variables X ∈ Rd, a latent variable generative model additionally consists of a set of latent or unobserved random variables Z ∈ Rk. The use of latent variables provides more flexibility for modeling complex distributions and can uncover hidden structure in the data. From a probabilistic standpoint, a latent variable model expresses a joint distribution pθ over X and Z. We will discuss three key characterizations of this distribution that define major latent variable generative modeling frameworks: variational autoencoders, normalizing flows, and generative adversarial networks. CHAPTER 2. BACKGROUND 17

As before, we are interested in the maximizing the log-likelihood of the observed data. Since we also have latent variables as part of our model, evaluating the log- likelihood involves marginalizing out these latent variables. Consider the marginal log-likelihood of a latent variable model for a data point x:

Z log pθ(x) = log pθ(x, z)dz. (2.9)

Evaluating the integrand for a single (x, z) configuration is generally tractable via the chain rule as long as we choose a tractable prior pθ(z) and conditional distribution

pθ(x|z). However, computing the integral in the above expression is intractable and needs to be estimated approximately. The key idea in variational approximations is to consider a simple (i.e., easy to sample and evaluate) distribution to approximate the posterior distribution pθ(z|x). Denoting the variational approximation as a parameterized distribution qφ(z|x) with parameters φ, we get an evidence lower bound (ELBO) on the marginal likelihood as:

 p (x, z) ( ) ≥ E θ = ( ) log pθ x qφ(z|x) log : ELBO θ, φ . (2.10) qφ(z|x)

The proof is based on Jensen’s inequality (see [KW18] for a derivation). The gap equals the KL divergence between qφ(z|x) and pθ(z|x) and the approximation is tight when these two distributions match. The learning objective for a variational autoencoder (VAE) is to maximize the ELBO w.r.t. both the model parameters θ and the variational parameters φ [KW18; RMW14]. In practice, we can implement this framework similar to an autoen- coder [BCV13]. The encoding phase of an autoencoder which maps the data points to a latent space—in a VAE, this corresponds the mapping that parameterizes the variational posterior distribution qφ(z|x). The decoding phase of an autoencoder tries to reconstruct the data point using the latents—in a VAE, this corresponds the mapping that parameterizes the generative model pθ(x, z). More intuitively, we can CHAPTER 2. BACKGROUND 18

factorize the ELBO in Eq. (2.10) as a sum of two distinct terms:

( ) = E [ ( | )] − ( | ) ( ) ELBO θ, φ qφ(z|x) log pθ x z DKL qφ z x , pθ z . (2.11)

The first term corresponds to the accuracy of a VAE in reconstructing the input to the encoder at the output of the decoder. The second term regularizes the ap- proximate posterior distribution to be close to the prior distribution over the latents as measured by KL divergence. Post-learning, we can sample from the generative model pθ(x, z) (decoder) via ancestral sampling:

z ∼ pθ(Z), (2.12)

x ∼ pθ(X|Z = z).

As intended, we can evaluate latent representations z for any data point x us- ing the encoder qφ(z|x). Finally, we conclude this section with a note on practical implementation of VAEs. In the initial VAE implementation proposed in [KW18], the prior distribution is fixed to be an isotropic Gaussian and both the encoding and decoding conditionals are specified as multivariate Gaussians with diagonal covariance. Furthermore, it proposed the use of reparameterization for computing low-variance Monte Carlo gradient estimators of the ELBO. In the last few years, many extensions have been proposed to the different components of a VAE (expres- sive parameterizations, alternate loss functions, improved stochastic optimization); it is beyond the scope of this chapter to cover the vast literature and we refer the interested reader to recent surveys for more details [KW19; Moh+19].

2.3.4 Normalizing Flows

Unlike autoregressive models, variational autoencoders contain latent variables for flexible modeling and representation learning. However, the price we pay in such models is that the marginal log-likelihood of the observed data is intractable and needs to be approximated using variational methods. In this section, we will discuss normalizing flows, which is a class of latent variable models that can provide exact CHAPTER 2. BACKGROUND 19

likelihoods with additional modeling assumptions. As with VAEs, a normalizing flow (NF) consists of a set of latent variables Z ∈ Rk and a set of observed variables X ∈ Rd. The key distinguishing feature of an NF is that the mapping between any latent vector z and observed vector x is invertible [DKB14; ML16]. This requirement also constraints the dimensionality of the latent space to equal the data dimensionality (i.e., k = d) and the decoding d d distribution pθ(x|z) to be deterministic and one-to-one. Let Gθ : R → R be the −1 parameterized invertible mapping from Z to X. Letting z = Gθ (x), we can express the marginal likelihood of x using the change-of-variables formula as:

∂G−1 p (x) = p(z) det θ (2.13) θ ∂X X=x where p is a fixed prior density over Z (e.g., an isotropic Gaussian) and the second term on the RHS corresponds to the absolute value of the determinant of the Jacobian −1 of Gθ evaluated at x. Geometrically, it signifies the local change in volume when translating across the two sample spaces. For evaluating likelihoods via the change-of-variables formula, we require effi- cient and tractable evaluation of (1) the prior density p(z), (2) the inverse transfor- −1 −1 mation Gθ , and (3) the determinant of the Jacobian of Gθ . Evaluating the Jacobian of a transformation defined over a high-dimensional space can be computationally expensive. To enable efficient evaluation, standard normalizing flows exploit the following observations from linear algebra: 1. Composition of invertible transformations is invertible. This allows the design of an expressive flow of transformations with modular invertible components.

2. The determinant of a product of matrices is the product of their determinants. So as long as the determinants of each individual transformations are easy to evaluate, same holds for the overall flow.

3. If the Jacobian of any invertible transformation is an upper or lower triangular matrix, then the determinant can be easily evaluated as the product of its diagonal entries. CHAPTER 2. BACKGROUND 20

For example, consider a coupling transformation [DSB17]. Let i ∈ {1, 2, ...d − 1} be any random integer. Then a coupling transformation between z to x is given as:

x≤i = z≤i,

x>i = µθ(x<=i) + σθ(x≤i) z>i

i d−i i d−i where µθ : R → R and σθ : R → R+ are parameterized neural networks that perform a shift and scale operation respectively on the last (d − i) indices of z. Note that this transformation is indeed invertible. The Jacobian is a lower triangular matrix and the determinant equals the product of the scales output by σθ(·). To draw a sample from this model, we perform ancestral sampling, i.e., we first sample a latent vector z ∼ p(z) and obtain the data sample as x = Gθ(z). This requires the ability to efficiently: (1) sample from the prior density and (2) evaluate the forward transformation Gθ. Many transformations parameterized by deep neural networks that satisfy one or more of these criteria have been proposed in the recent literature on normalizing flows, e.g., NICE [DKB14] and Real-NVP [DSB17]. We refer the interested reader to [Pap+19] for an extensive survey on recent progress in the design of such invertible transformations.

2.3.5 Generative Adversarial Networks

So far, we have considered generative models that are trained to maximize the likelihood of the training data exactly (autoregressive models, normalizing flows) or approximately (energy-based models, variational autoencoders). Crucially, as we saw in Section 2.2, MLE minimizes the KL divergence between the data and model distributions. This can be a limitation because the KL divergence has been shown to not converge for even relatively simple sequence of distributions unlike other alternatives (see Example 1 in [ACB17]). A large family of probabilistic divergences and distances can be conveniently CHAPTER 2. BACKGROUND 21

expressed as a difference in expectations w.r.t. pdata and pθ:

  h 0 i max Ep hφ(x) − Ep hφ(x) (2.14) φ∈F θ data

0 where F denotes a set of parameters, hφ and hφ are appropriate real-valued func- 0 tions parameterized by φ. Specific choices of F, hφ and hφ recover a variety of f -divergences such as Jenson-Shannon divergence as well as integral probability metrics such as the Wasserstein distance [NCT16; MNG17b]. Importantly, a Monte Carlo estimate of Eq. (2.14) can be obtained using only samples from the model. Combining Eq. (2.1) with Eq. (2.14), we obtain the following minimax objective:

  h 0 i min max Ep hφ(x) − Ep hφ(x) . (2.15) θ∈M φ∈F θ data

As a result, any differentiable model that allows tractable sampling can be learned adversarially by optimizing minimax objectives of the form given in Eq. (2.15). A generative adversarial network (GAN) is one such latent variable modeling framework for adversarial learning. It consists of two differential mappings: (a) k d k a generator Gθ : R → R that maps a latent vector z ∈ R to an observed data d d point x ∈ R , (b) a critic function Cφ : R → R that maps X to a scalar score. To obtain samples from the generator, we first sample a latent vector from a fixed prior distribution z ∼ p(z) (e.g., isotropic Gaussian) and then obtain the data sample via a forward pass through the generator as x = Gθ(z). Using these two mappings, we can we can instantiate Eq. (2.14) to optimize a suitable divergence or distance. For example, we get the original GAN objective proposed by [Goo+14] when the critic minimizes the binary cross-entropy loss:

    min max Ep log 1 − Cφ(Gθ(z)) + Ep log Cφ(x) . (2.16) θ∈M φ∈F data

The generator and the critic parameters are both learned via alternating gradient up- dates. We refer the interested reader to [WSW19] for a recent survey on algorithmic advancements in GANs and their applications in artificial intelligence. 3 Trade-offs in Learning & Inference

3.1 Introduction

Highly expressive parametric models have enjoyed great success in supervised learn- ing, where learning objectives and evaluation metrics are typically well-specified and easy to compute. On the other hand, the learning objective for a generative model trained in an unsupervised setting is less clear. At a fundamental level, the idea is to learn a generative model that minimizes some notion of discrepancy with respect to the data distribution. As we showed in the last chapter, there are two keys families of learning objectives depending on the target inference task being density estimation or sampling. Models that can estimate densities typically minimize the KL divergence between the data distribution and the model, or equivalently per- forms maximum likelihood estimation (MLE) on the observed data. In contrast, the objective of adversarial learning is to generate data indistinguishable from the ob- served data. Adversarially learned models such as generative adversarial networks (GAN; [Goo+14]) can sidestep specifying an explicit density for any data point and belong to the class of implicit generative models [DG84; ML16]. The lack of characterization of an explicit density in GANs hinders a quantitative comparison of their generalization performance. The typical evaluation criteria for such models is based on ad-hoc sample quality metrics [Sal+16; Che+17a] which do not address this issue since it is possible to generate good samples by memorizing the training data, or missing important modes of the distribution, or both [TOB16].

22 CHAPTER 3. TRADE-OFFS IN LEARNING & INFERENCE 23

Further, density estimates of GANs based on approximate inference techniques such as annealed importance sampling (AIS; [Nea01; Wu+17]) and non-parametric methods such as kernel density estimation (KDE; [Par62; Goo+14]) are slow to compute and potentially misleading in high dimensions [GDE18]. Moreover, several applications of deep generative models rely on density estimates; e.g., the design of exploration strategies in challenging reinforcement learning environments [Ost+17] and out-of-distribution detection [CJA18]. A handle on density estimates of GANs can enable their application to many such tasks. We propose Flow-GANs, the first family of generative adversarial networks with exact likelihoods. The key modeling assumption in a Flow-GAN is the use of an invertible latent variable generator. By using an invertible generator, we can tractably evaluate exact likelihoods in a Flow-GAN using the change-of-variables formula. Further, we can perform exact posterior inference over the latent variables, another desirable inference property that does not hold for a typical GAN. Using a Flow- GAN, we perform a principled quantitative comparison of MLE and adversarial learning on two benchmark high-dimensional image datasets: MNIST and CIFAR-10. While adversarial learning outperforms MLE on sample quality metrics as expected based on prior empirical studies, the log-likelihood estimates of adversarial learning are orders of magnitude worse than those of MLE. The difference is so stark that a simple Gaussian mixture model baseline outperforms adversarially learned models on both sample quality and held-out likelihoods. Next, we attempt to resolve the dichotomy between sample quality and likeli- hoods by optimizing a hybrid objective that augments the adversarial training with the MLE objective. While this objective achieves the intended effect of smoothly trading-off the two goals in the case of CIFAR-10, it has a regularizing effecton MNIST where it outperforms MLE and adversarial learning on both likelihoods and sample quality metrics. Finally, we quantitatively analyze the reasons for the poor likelihoods of adversarial learning. We attribute this phenomena to an ill- conditioned Jacobian matrix for the generative model that suggests a mode collapse, rather than memorization of the training dataset. The proposed hybrid objective successfully overcomes this issue. CHAPTER 3. TRADE-OFFS IN LEARNING & INFERENCE 24

3.2 Flow Generative Adversarial Networks

As discussed above, generative adversarial networks can tractably generate high- quality samples but have intractable or ill-defined likelihoods. Techniques such as

AIS and KDE get around this by assuming a Gaussian observation model pθ(x|z) for the generator.1 This assumption alone is not sufficient for quantitative evaluation R since the marginal likelihood of the observed data, pθ(x) = pθ(x, z)dz in this case would be intractable as it requires integrating over all the latent factors of variation. This would then require approximate inference (e.g., Monte Carlo or variational methods) which in itself is a computational challenge for high-dimensional distribu- tions. To circumvent these issues, we propose flow generative adversarial networks (Flow-GAN). A Flow-GAN consists of a pair of generator-discriminator networks with the generator specified as a normalizing flow model. As discussed in Chapter 2, this d d implies that the generator transformation Gθ : R → R mapping the latent space Z to the observed space X is invertible with the likelihood given in Eq. (2.13) [DKB14]. −1 We restate the equation here for clarity. Letting z = Gθ (x), we can express the likelihood of x as:

∂G−1 p (x) = p(z) det θ (3.1) θ ∂X X=x

In a Flow-GAN, the likelihood is well-defined, computationally tractable, and can be evaluated exactly for standard invertible transformations [DKB14; DSB17]. Hence, a Flow-GAN can be trained via maximum likelihood estimation using Eq. (2.3) in which case the discriminator is redundant. Additionally, we can perform ancestral sampling just like a regular GAN whereby we sample a random vector z ∼ p(z) and transform it to a model generated sample via Gθ. This makes it possible to learn a Flow-GAN using an adversarial learning objective, such as the GAN objective in Eq. (2.16).

1 The true observation model for a GAN is a Dirac delta distribution, i.e., pθ(x|z) is infinite when x = Gθ(z) and zero otherwise. CHAPTER 3. TRADE-OFFS IN LEARNING & INFERENCE 25

(a) MLE (b) ADV (c) Hybrid

Figure 3.1: Samples generated by Flow-GAN models with different objectives for MNIST (top) and CIFAR-10 (bottom).

Since a Flow-GAN can be learned both using MLE and adversarial learning, a natural question to ask is if we should even consider adversarial learning given that MLE is statistically optimal (under some conditions). Besides difficulties that could arise due to optimization (in both MLE and adversarial learning), the optimality of MLE holds only when there is no model misspecification [Whi82], i.e., the data distribution pdata is a member of the parametric model family of distributions. Next, we investigate these distinctions in depth via an empirical study.

3.3 Empirical Evaluation

We compare learning of Flow-GANs using MLE and adversarial learning (ADV) for the MNIST dataset of handwritten digits [LCB10] and the CIFAR-10 dataset of natural images [KH09]. The normalizing flow generator architectures are chosen to be NICE [DKB14] and Real-NVP [DSB17] for MNIST and CIFAR-10 respectively. We fix the choice of the divergence optimized by ADV to be Wasserstein distance as CHAPTER 3. TRADE-OFFS IN LEARNING & INFERENCE 26

Table 3.1: Best MODE scores and test negative log-likelihood estimates for Flow- GAN models on MNIST.

Objective MODE Score Test NLL (in nats) MLE 7.42 −3334.56 ADV 9.24 −1604.09 Hybrid (λ = 0.1) 9.37 −3342.95

Table 3.2: Best Inception scores and test negative log-likelihood estimates for Flow- GAN models on CIFAR-10.

Objective Inception Score Test NLL (in bits/dim) MLE 2.92 3.54 ADV 5.76 8.53 Hybrid (λ = 1) 3.90 4.21 proposed in WGAN [ACB17] with the Lipschitz constraint over the critic imposed by penalizing the norm of the gradient with respect to the input [Gul+17]. The discriminator chosen corresponds to the DCGAN architecture [RMC15]. The above choices are among the current state-of-the-art in maximum likelihood estimation and adversarial learning and greatly stabilize GAN training [ACB17].

3.3.1 Log-likelihoods vs. sample quality

The learning curves for log-likelihoods attained by Flow-GANs learned using MLE and ADV are shown in Figure 3.2 for MNIST (top row) and CIFAR-10 (bottom row) respectively. Following standard convention, we report the negative log-likelihoods (NLL) in nats for MNIST and bits/dimension for CIFAR-10.

MLE. In Figure 3.2a, we see that normalizing flow models attain low validation NLLs (blue curves) after few gradient updates as expected because it is explicitly optimizing for the MLE objective in Eq. (2.3).

ADV. Surprisingly, ADV models show a consistent increase in validation NLLs as training progresses (for the CIFAR-10, the estimates are reported on a log scale!) in Figure 3.2b. Based on the learning curves, we can disregard overfitting as an CHAPTER 3. TRADE-OFFS IN LEARNING & INFERENCE 27

1e3 1e5

0 20 train nll train nll val nll 1.0 val nll wgan loss 15 1 0.8

0.6 10 2

0.4 wgan loss NLL (in nats) NLL (in nats) 5 3 0.2 1e5 0.0 1e5 0 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 generator iterations generator iterations

400 10 train nll 106 train nll val nll val nll wgan loss 300 5 8 10 200 104 6 100 wgan loss

3 NLL (in bits per dim) 10

NLL (in bits per dim) 4 1e5 0 1e5 0 1 2 0 1 2 generator iterations generator iterations (a) MLE (b) ADV

Figure 3.2: Learning curves for negative log-likelihood (NLL) evaluation on MNIST (top, in nats) and CIFAR (bottom, in bits/dim). Lower NLLs are better. CHAPTER 3. TRADE-OFFS IN LEARNING & INFERENCE 28

explanation since the increase in NLLs is observed even on the training data. The training and validation NLLs closely track each other suggesting that ADV models are not simply memorizing the training data. Comparing the right vs. left panels in Figure 3.2, we see that the log-likelihoods attained by ADV are 100 and 100,000 times worse than those attained by MLE after sufficient training. Finally, we note that the WGAN loss (green curves) does not correlate well with NLL estimates. While the WGAN loss stabilizes after few iterations of training the generator, the NLLs continue to increase. This is in contrast to prior work which has shown the loss to be strongly correlated with sample quality metrics [ACB17].

Sample quality metrics. Next, we use MODE scores [Che+17a] and Inception scores [Sal+16] for evaluating sample quality of the MNIST and CIFAR-10 models respectively. The samples corresponding to the best MODE and Inception scores are shown in Figure 3.1. ADV models significantly outperform MLE with respect to the final MODE/Inception scores achieved in line with expectations and listed in Table 3.1 and Table 3.2. Visual inspection of samples confirms the observations made on the based of the sample quality metrics.

3.3.2 Comparison with Gaussian mixture models

How good are the log-likelihoods attained by ADV? The above experiments suggest that ADV can produce excellent samples but assigns low likelihoods to the observed data. However, a direct comparison of ADV with the log-likelihoods of MLE is unfair since the latter is explicitly optimizing for the desired objective. To highlight that generating good samples at the expense of low likelihoods is not a challenging goal, we propose a simple baseline to compare with the adversarially learned Flow- GAN model that achieves the highest MODE/Inception score. Specifically, we use a Gaussian Mixture Model consisting of m isotropic Gaussians with equal weights centered at each of the m training points as the baseline Gaussian Mixture Model (GMM). The bandwidth hyperparameter σ is the same for each of the mixture components and optimized for the lowest validation NLL via a line search in (0, 1]. CHAPTER 3. TRADE-OFFS IN LEARNING & INFERENCE 29

10 NLL of ADV model with best inception score 107 9 Best MODE Score of ADV model 9 104 8 106 8 7

7 103 Best Inception Score of ADV model 6 105 NLL of ADV model with best MODE score 5 6

4 mode scores 2 4

10 10 inception scores 5 3 validation NLL (in nats) 4 3 validation NLL (in bits per dim) 2 10 101 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 bandwidth bandwidth (a) MNIST (b) CIFAR-10

Figure 3.3: Gaussian Mixture Models outperform adversarially learned models for any bandwidth in the green shaded region on both held-out log-likelihoods and sampling metrics.

The results are shown in Figure 3.3. We overload the y-axis to report both NLLs and sample quality metrics. The horizontal maroon and cyan dashed lines denote the best attainable MODE/Inception scores and corresponding validation NLLs respectively attained by the adversarially learned Flow-GAN model. The GMM clearly can attain better sample quality metrics since it is explicitly overfitting to the training data for low values of the bandwidth parameter (any σ for which the red curve is above the brown dashed line). Surprisingly, the simple GMM also outperforms the adversarially learned model with respect to NLLs attained for several values of the bandwidth parameter (any σ for which the blue curve is below the cyan dashed line). Bandwidth parameters for which GMM models outperform the adversarially learned model on both log-likelihoods and sample quality metrics are highlighted using the green shaded area. Hence, a trivial baseline that is memorizing the training data can generate high quality samples and better held-out log-likelihoods, suggesting that the log-likelihoods attained by adversarial training are very poor. CHAPTER 3. TRADE-OFFS IN LEARNING & INFERENCE 30

3.3.3 Hybrid learning of Flow-GANs

So far, we observed that adversarially learning Flow-GANs models attain poor held-out log-likelihoods. This makes it challenging to use such models for density estimation. On the other hand, Flow-GANs learned using MLE are ‘‘mode covering” but do not generate high quality samples. With a Flow-GAN, it is possible to trade- off the two goals by combining the learning objectives corresponding to boththese inductive principles. Without loss of generality, let V(Gθ, Dφ) denote the minimax objective of any GAN model (such as WGAN). The hybrid objective of a Flow-GAN can be expressed as:

min max V(Gθ, Dφ) − λEx∼P [log pθ(x)] (3.2) θ φ data where λ ≥ 0 is a hyperparameter for the algorithm. By varying λ, we can interpolate between plain adversarial training (λ = 0) and MLE (very high λ). We summarize the results from MLE, ADV, and Hybrid for log-likelihood eval- uation and sample quality in Table 3.1 and Table 3.2 for MNIST and CIFAR-10 respectively. The tables report the test log-likelihoods corresponding to the best validated MLE and ADV models as well as the highest MODE/Inception scores observed during training. The samples obtained by models corresponding to the best MODE/Inception scores for each learning objective are shown in Figure 3.1 for MNIST and CIFAR-10. While the results on CIFAR-10 are along expected lines, the hybrid objective interestingly outperforms MLE and ADV on both test log-likelihoods and sample quality metrics in the case of MNIST. One potential explanation for this is because the ADV objective can regularize MLE to generalize to the test set and in turn, the MLE objective can stabilize the optimization of the ADV objective. Hence, the hybrid objective in Eq. (3.2) can smoothly balance the two objectives using the tunable hyperparameter λ, and in some cases such as MNIST, the performance on both tasks could improve as a result of the hybrid objective. CHAPTER 3. TRADE-OFFS IN LEARNING & INFERENCE 31

1.0 Hybrid (NICE) 1.0 Hybrid (real-nvp) MLE (NICE) MLE (real-nvp) ADV (NICE, WD) ADV (real-nvp, WD) 0.8 ADV (DCGAN-784,WD) 0.8 ADV (DCGAN-3072,WD) ADV (DCGAN-784,JSD) ADV (DCGAN-3072,JSD)

0.6 0.6

P(S

0.2 0.2

0.0 0.0 10 45 10 37 10 29 10 21 10 13 10 5 10 6 10 4 10 2 100 magnitude of singular value, s magnitude of singular value, s (a) MNIST (b) CIFAR-10

Figure 3.4: Singular value analysis for the Jacobian of the generator functions. The Jacobian is evaluated at 64 random noise vectors sampled from the prior. The x-axis of the figure shows the singular value magnitudes (on a log scale) and foreach singular value s, we show the corresponding cumulative distribution function (CDF) value on the y-axis which signifies the fraction of singular values less than s.

3.4 Explaining log-likelihood trends

We perform a singular value analysis to explain the variation in log-likelihoods attained by various Flow-GAN learning objectives. Specifically, we investigate the distribution of the magnitudes of singular values for the Jacobian matrix of several generator functions, Gθ for MNIST in Figure 3.4 evaluated at 64 noise vectors, z randomly sampled from the prior density p(z). Note that the Jacobian function here is computed for the forward mapping, while the one used for likelihood evaluation in Eq. (2.13) corresponds to the inverse mapping. The Jacobian is a good first-order approximation of the generator function locally. In Figure 3.4, we observe that the singular value distribution for the Jacobian of an invertible generator learned using MLE (orange curves) is concentrated in a narrow range and hence, the Jacobian matrix is well-conditioned and easy to invert. In the case of invertible generators learned using ADV with Wasserstein distance (green curves) however, the spread of singular values is very wide and hence, the Jacobian matrix is ill-conditioned. CHAPTER 3. TRADE-OFFS IN LEARNING & INFERENCE 32

As further numerical evidence, the average log determinant of the Jacobian matri- ces for MLE, ADV, and Hybrid models trained on MNIST are −4170.34, −15588.34, and −5184.40 respectively which translates to the trend ADV

3.5 Discussion & Related Work

Any model which allows for log-likelihood evaluation and efficient sampling can be trained using MLE as well as adversarial training. This line of reasoning has been explored to some extent in prior work that combine the objectives of prescribed latent variable models such as VAEs (maximizing an evidence lower bound on the data) with adversarial learning [Lar+15; MNG17a]. However, the benefits of such procedures does not come for ‘‘free’’ since we still need to resort to some form of approximate inference to get a handle on the log-likelihoods. This could be expensive, for instance combining a VAE with a GAN introduces an additional inference network increasing the overall model complexity. CHAPTER 3. TRADE-OFFS IN LEARNING & INFERENCE 33

Our approach sidesteps the additional complexity due to approximate inference by considering a normalizing flow model. The trade-off made by a normalizing flow model in this case is that the generator function needs to be invertible while other generative models such as VAEs have no such requirement. On the positive side, we can tractably evaluate exact log-likelihoods assigned by the model for any data point. Parallel and independent of this work, concurrent work also considers adversarial learning of Real-NVP [Dan+17]. The low dimensional support of the distributions learned by adversarial learning often manifests as lack of sample diversity and is referred to as mode collapse. In prior work, mode collapse is detected based on visual inspection or heuristic techniques [Goo16; AZ17]. Techniques for avoiding mode collapse explicitly focus on stabilizing GAN training such as [Met+16; ACB17; Che+17a; MNG17b; BSM17; Bel+17] rather than quantitative methods based on likelihoods.

3.6 Conclusion

As an attempt to more quantitatively evaluate generative models, we introduced Flow-GAN. It is a generative adversarial network which allows for tractable likeli- hood evaluation, exactly like in a flow model. Since it can be trained both adversari- ally (like a GAN) and in terms of MLE (like a flow model), we can quantitatively evaluate the trade-offs involved. We observe that adversarial learning assigns very low-likelihoods to both training and validation data while generating superior qual- ity samples. To put this observation in perspective, we demonstrate how a naive Gaussian mixture model can outperform adversarially learned models on both log-likelihood estimates and sample quality metrics. Inspecting the distribution of singular value magnitudes for the Jacobian of the generator function provide insights into the contrast between maximum likelihood estimation and adversarial learning. The latter have a tendency to learn distributions of low support, which can lead to low likelihoods. To correct for this behavior, we consider a hybrid objective function which involves loss terms corresponding to both MLE and adversarial learning. On MNIST, the hybrid Flow-GAN objective CHAPTER 3. TRADE-OFFS IN LEARNING & INFERENCE 34

outperforms both vanilla MLE and adversarial learning, both in terms of sample quality and validation log-likelihoods. On CIFAR-10, the hybrid objective outper- forms MLE in terms of visual sample quality, and is only slightly worse in terms of validation log-likelihood.

Recent Progress. Follow-up work by [Ode+18] builds on the singular value anal- ysis in this chapter to show that controlling for the conditioning of the Jacobian of the generator during training can both stabilize training and improve sample quality. In [Gro+20], we further extend the use of Flow-GANs to domain translation. Here, the goal is to translate data across domains e.g., converting satellite imagery to corresponding ground maps using a forward model and vice versa using a reverse model. Beyond the virtues of hybrid MLE and adversarial training, Flow-GANs have an additional advantage here in ensuring that the mapping across domains is exactly cycle-consistent [Zhu+17] i.e., a data point translated from one domain to another using the forward translation model gives back the original data point when translated back using the reverse translation model. 4 Model Bias in Generative Models

4.1 Introduction

In the last chapter, we explored trade-offs in learning objectives for generative mod- eling. Once learned, these models are used to draw inferences and to plan future actions. For example, in data augmentation, samples from a learned model are used to enrich a dataset for supervised learning [ASE17]. In model-based off-policy policy evaluation (henceforth MBOPE), a learned dynamics model is used to simulate and evaluate a target policy without real-world deployment [Man+07], which is espe- cially valuable for risk-sensitive applications [Tho15]. In spite of the recent successes of deep generative models, there are glaring imperfections even in state-of-the-art models: e.g., image models have noticeable artifacts in background textures [McD18] and language models can fail to generate diverse and coherent dialogue [Li+15]. Moreover, existing theoretical results show that learning distributions in an unbiased manner is either impossible or has prohibitive sample complexity [Ros56; ARZ18]. Consequently, the models used in practice are inherently biased,1 and can lead to misleading downstream inferences. This chapter will focus on the characterization and mitigation of such bias in generative models. In order to address this issue, we start from the observation that many typical uses of generative models involve computing expectations under the model. For instance, in MBOPE, we seek to find the expected return of a policy under a trajectory

1We call a generative model biased if it produces biased statistics relative to the data distribution.

35 CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 36

distribution defined by this policy and a learned dynamics model. A classical recipe for correcting the bias in expectations, when samples from a different distribution than the ground truth are available, is to importance weight the samples according to the likelihood ratio [HT52]. If the importance weights were exact, the resulting estimates are unbiased. But in practice, the likelihood ratio is unknown and needs to be estimated since the true data distribution is unknown and even the model likelihood is typically intractable or ill-defined for many deep generative models, e.g., variational autoencoders [KW18] and generative adversarial networks [Goo+14]. Our proposed solution to estimate the importance weights is to train a calibrated, probabilistic classifier to distinguish samples from the data distribution andthe generative model. As shown in prior work, the output of such classifiers can be used to extract density ratios [SSK12]. Appealingly, this estimation procedure is likelihood-free since it only requires samples from the two distributions. Together, the generative model and the importance weighting function (specified via a binary classifier) induce a new energy function. While exact density estimation andsam- pling from this induced energy-based model is intractable, we can derive a particle based approximation which permits efficient sampling via resampling based meth- ods. We derive conditions on the quality of the weighting function such that the induced model provably improves the fit to the the data distribution. Empirically, we evaluate our bias reduction framework on three main sets of experiments. First, we consider goodness-of-fit metrics for evaluating sample quality metrics of a likelihood-based and a likelihood-free state-of-the-art model on the CIFAR-10 dataset. All these metrics are defined as Monte Carlo estimates from the generated samples. By importance weighting samples, we observe a bias reduction of 23.35% and 13.48% averaged across commonly used sample quality metrics on PixelCNN++ [Sal+17] and SNGAN [Miy+18] models respectively. Second, we demonstrate the utility of our approach on the task of data aug- mentation for multi-class classification on the Omniglot dataset [LST15]. We show that, while naively extending the model with samples from a data augmentation, a generative adversarial network [ASE17] is not very effective for multi-class classifi- cation, we can improve classification accuracy from 66.03% to 68.18% by importance CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 37

weighting the contributions of each augmented data point. Finally, we perform MBOPE [PSS00] for standard MuJoCo environments [TET12] with continuous high-dimensional state and action spaces. A typical MBOPE ap- proach is to first estimate a generative model of the dynamics using off-policy data and then evaluate the policy via Monte Carlo [Man+07; TB16]. Here, we observe that correcting the bias of the estimated dynamics model via our proposed approach reduces root mean squared error (RMSE) for MBOPE by 50.25% on average.

4.2 Preliminaries

We are interested in use cases where the goal is to evaluate or optimize expectations of functions under some distribution p (either equal or close to the data distribution

pdata). Assuming access to samples from p as well some generative model pθ, one extreme is to evaluate the sample average using the samples from p alone. However, this ignores the availability of pθ, through which we have a virtually unlimited access of generated samples ignoring computational constraints and hence, could improve the accuracy of our estimates when pθ is close to p. We begin by presenting a direct motivating use case of data augmentation using generative models for training classifiers which generalize better.

Example use case. Sufficient labeled training data for learning classification and regression system is often expensive to obtain or susceptible to noise. Data augmenta- tion seeks to overcome this shortcoming by artificially injecting new data points into the training set. These new data points are derived from an existing labeled dataset, either by manual transformations (e.g., rotations, flips for images), or alternatively, learned via a generative model [Rat+17; ASE17].

Consider a supervised learning task over a labeled dataset Dcl. In addition d to the feature spaces x ∈ R , this includes a label space y ∈ Y. The dataset Dcl consists of feature and label pairs (x, y), each of which is assumed to be sampled k independently from a data distribution pdata(x, y). Further, let Y ⊆ R . In order to k learn a classifier fψ : X → R with parameters ψ, we minimize the expectation of a CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 38

k loss ` : Y × R → R over the dataset Dcl:

1 E [`(y, f (x))] ≈ `(y, f (x)). (4.1) pdata(x,y) ψ ∑ ψ |Dcl| (x,y)∼Dcl e.g., ` could be the cross-entropy loss. A generative model for the task of data aug- mentation learns a joint distribution pθ(x, y). Several algorithmic variants exist for learning the model’s joint distribution and we defer the specifics to the experiments section. Once the generative model is learned, it can be used to optimize the ex- pected classification loss in Eq. (4.1) under a mixture distribution of empirical data distributions and generative model distributions given as:

pmix(x, y) = mpdata(x, y) + (1 − m)pθ(x, y) (4.2) for a suitable choice of the mixture weights m ∈ [0, 1]. Notice that, while the even- tual task here is optimization, reliably evaluating the expected loss of a candidate parameter ψ is an important ingredient. We focus on this basic question first in advance of leveraging the solution for data augmentation. Further, even if evaluating the expectation once is easy, optimization requires us to do repeated evaluation (for different values of ψ) which is significantly more challenging. Also observe that the distribution p under which we seek expectations is same as pdata here, and we rely on the generalization of pθ to generate transformations of an instance in the dataset which are not explicitly present, but plausibly observed in other, similar instances [Zha+18].

4.3 Likelihood-Free Importance Weighting

Whenever the distribution p, under which we seek expectations, differs from pθ, model-based estimates exhibit bias. In this section, we start out by formalizing bias for Monte Carlo expectations and subsequently propose a bias reduction strategy CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 39

based on likelihood-free importance weighting (LFIW). We are interested in evalu- ating expectations of a class of functions of interest f ∈ F w.r.t. the distribution p. R For any given f : X → R, we have Ep[ f (x)] = p(x) f (x)dx.

Given access to samples from a generative model pθ, if we knew the densities for both p and pθ, then a classical scheme to evaluate expectations under p using samples from pθ is to use importance sampling [HT52]. We reweight each sample from pθ according to its likelihood ratio under p and pθ and compute a weighted average of the function f over these samples.

 p(x)  1 T E [ f (x)] = E f (x) ≈ w(x ) f (x ) (4.3) p pθ ( ) ∑ i i pθ x T i=1

p(xi) where w(xi) := is the importance weight for xi ∼ pθ. The validity of this pθ (xi) d procedure is subject to the use of a proposal pθ such that for all x ∈ R where 2 pθ(x) = 0, we also have f (x)p(x) = 0.

To apply this technique to reduce the bias of a generative sampler pθ w.r.t. p, we require knowledge of the importance weights w(x) for any x ∼ pθ. However, we typically only have a sampling access to p via finite datasets. For instance, in the data augmentation example above, where p = pdata, the unknown distribution used to learn pθ. Hence we need a scheme to learn the weights w(x), using samples from p and pθ, which is the problem we tackle next.In order to do this, we consider a binary classification problem over Rd × {0, 1} and the joint distribution over data ( ) = q(y=0) > points and their binary labels is denoted as q x, y . Let γ q(y=1) 0 denote any fixed odds ratio. To specify the joint q(x, y), we additionally need the conditional q(x|y) which we define as follows:

 pθ(x) if y = 0 q(x|y) = (4.4) p(x) otherwise.

Since we only assume sample access to p and pθ(x), our strategy would be to

2A stronger sufficient, but not necessary condition that is independent of f , states that the proposal d pθ is valid if it has a support larger than p, i.e., for all x ∈ R , pθ(x) = 0 implies p(x) = 0. CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 40

estimate the conditional above via learning a probabilistic binary classifier. To train the classifier, we only require datasets of samples from pθ(x) and p(x) and estimate d γ to be the ratio of the size of two datasets. Let cφ : R → [0, 1] denote the probability assigned by the classifier with parameters φ to a sample x belonging to the positive class y = 1. As shown in prior work [SSK12; GE18], if cφ is Bayes optimal, then the importance weights can be obtained via this classifier as:

p(x) cφ(x) wφ(x) = = γ . (4.5) pθ(x) 1 − cφ(x)

In practice, we do not have access to a Bayes optimal classifier and hence, the estimated importance weights will not be exact. Consequently, we can hope to reduce the bias as opposed to eliminating it entirely. Hence, our default LFIW estimator is given as:

1 T Ep[ f (x)] ≈ ∑ wˆ φ(xi) f (xi) (4.6) T i=1

cφ(xi) where wˆ φ(xi) = γ is the importance weight for xi ∼ pθ estimated via cφ(x). 1−cφ(xi)

Practical considerations. Besides imperfections in the classifier, the quality of a generative model also dictates the efficacy of importance weighting. For example, images generated by deep generative models often possess distinct artifacts which can be exploited by the classifier to give highly-confident predictions [ODO16; Ode19]. This could lead to very small importance weights for some generated images, and consequently greater relative variance in the importance weights across the Monte Carlo batch. Below, we present some practical variants of LFIW estimator to offset this challenge. 1. Self-normalization: The self-normalized LFIW estimator for Monte Carlo evaluation normalizes the importance weights across a sampled batch:

T wˆ (x ) E [ ( )] ≈ φ i ( ) ∼ p f x ∑ T f xi where xi pθ. (4.7) i=1 ∑j=1 wˆ φ(xj) CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 41

2. Flattening: The flattened LFIW estimator interpolates between uniform importance weights and the default LFIW weights via a power scaling parameter α ≥ 0:

T 1 α Ep[ f (x)] ≈ ∑ wˆ φ(xi) f (xi) where xi ∼ pθ. (4.8) T i=1

For α = 0, there is no bias correction, and α = 1 returns the default estimator in Eq. (4.6). For intermediate values of α, we can trade-off bias reduction with any undesirable variance introduced. 3. Clipping: The clipped LFIW estimator specifies a lower bound β ≥ 0 on the importance weights:

1 T Ep[ f (x)] ≈ ∑ max(wˆ φ(xi), β) f (xi) where xi ∼ pθ. (4.9) T i=1

When β = 0, we recover the default LFIW estimator in Eq. (4.6). Finally, we note that these estimators are not exclusive and can be combined e.g., flattened or clipped weights can be normalized.

Confidence intervals. Bootstrap is a widely-used tool in statistics for deriving confidence intervals by fitting ensembles of models on resampled data points. Ifthe dataset is finite e.g., Dtrain, then the bootstrapped dataset is obtained via random sampling with replacement and confidence intervals are estimated via the empirical bootstrap. For a parametric model generating the dataset e.g., pθ, a fresh bootstrapped dataset is resampled from the model and confidence intervals are estimated via the parametric bootstrap. See [ET94] for a detailed review. Since we have real and generated data coming from a finite dataset and para- metric model respectively, we propose a combination of empirical and parametric bootstraps to derive confidence intervals around the estimated importance weights. In particular, we can estimate the confidence intervals by retraining the classifier on a fresh sample of points from pθ and a resampling of the training dataset Dtrain (with replacement). Repeating this process over multiple runs and then taking a suitable quantile gives us the corresponding confidence intervals. CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 42

(a) Setup (b) n = 50

(c) n = 100 (d) n = 1000

Figure 4.1: Importance Weight Estimation using Probabilistic Classifiers. (a) A univariate Gaussian (blue) is fit to samples from a mixture of two Gaussians (red). (b-d) Estimated class probabilities (with 95% confidence intervals based on 1000 bootstraps) for varying number of points n, where n is the number of training data points used for training the generative model and multilayer perceptron.

Synthetic experiment. We visually illustrate our importance weighting approach in a toy experiment (Figure 4.1a). We are given a finite set of samples drawn from a mixture of two Gaussians (red). The model family is a unimodal Gaussian, illustrating mismatch due to a parametric model. The mean and variance of the model are estimated by the empirical means and variances of the observed data. Using estimated model parameters, we then draw samples from the model (blue). CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 43

In Figure 4.1b, we show the probability assigned by a binary classifier to a point to be from true data distribution. Here, the classifier is a single hidden-layer multi- layer perceptron. The classifier is not Bayes optimal, which can be seen by the gaps between the optimal probabilities curve (black) and the estimated class probability curve (green). However, as we increase the number of real and generated examples n in Figures 4.1c, 4.1d, the classifier approaches optimality. Furthermore, even its uncertainty shrinks with increasing data, as expected. In summary, this experiment demonstrates how a binary classifier can mitigate this bias due to a mismatched generative model.

4.4 Boosted Energy-Based Model

In the previous section, we described a procedure to augment any base generative model pθ with an importance weighting estimator wˆ φ for debiased Monte Carlo evaluation. Here, we will use this augmentation to induce an energy-based model with density pθ,φ given as:

pθ,φ(x) ∝ pθ(x)wˆ φ(x) (4.10)

R where the partition function is expressed as Zθ,φ = pθ(x)wˆ φ(x)dx = Epθ [wˆ φ(x)].

On the default measure for pθ(x), the energy function Eθ,φ(x) = − log pθ(x) − log wˆ φ(x). Alternatively, we can change the measure to be pθ(x) itself in which case the energy function is Eφ(x) = − log wˆ φ(x) Another useful interpretation is to think of this procedure as a variant of the product of experts or multiplicative boosting formulation [Hin99; GE18] where the ‘‘first expert” corresponding to the base density pθ(x) and boosting is done via the ‘‘second expert” corresponding to the importance weighting estimator wˆ φ.

Density Estimation. Exact density estimation requires a handle on the density of the base model pθ (typically intractable for models such as VAEs and GANs) and estimates of the partition function. Exactly computing the partition function is CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 44

Algorithm 4.1 SIR for Boosted Energy-Based Model pθ,φ Input: Generative Model pθ, Importance Weight Estimator wˆ φ, budget T 1: Sample x1, x2,..., xT independently from pθ 2: Estimate importance weights wˆ φ(x1), wˆ φ(x2),..., wˆ φ(xT) ˆ T 3: Compute Zθ,φ ← ∑t=1 wˆ φ(xt)   wˆ φ(x ) wˆ φ(x ) wˆ φ(x ) 4: Sample j ∼ Categorical 1 , 2 ,..., T Zˆθ,φ Zˆθ,φ Zˆθ,φ 5: return xj

intractable. If pθ permits fast sampling and importance weights are estimated via LFIW (requiring only a forward pass through the classifier network), we can obtain 1 T unbiased estimates via a Monte Carlo average, i.e., Zθ,φ ≈ T ∑i=1 wˆ φ(xi) where xi ∼ pθ. To reduce the variance, a potentially large number of samples are required. Since samples are obtained independently, the terms in the Monte Carlo average can be evaluated in parallel.

Sampling. While exact sampling from pθ,φ is intractable, we can instead per- form sample from a particle-based approximation to pθ,φ via sampling-importance- resampling [LC98; DGA00] (SIR). We define the SIR approximation to pθ,φ via the following density:

" # wˆ (x) SIR( ) = E φ ( ) pθ,φ x; T : x2,x3,...,xT∼pθ T pθ x (4.11) wˆ φ(x) + ∑i=2 wˆ φ(xi) where T > 0 denotes the number of independent samples (or ‘‘particles”). For any SIR finite T, sampling from pθ,φ is tractable, as summarized in Algorithm 4.1. Moreover, any expectation w.r.t. the SIR approximation to the induced distribution can be evaluated in closed-form using the self-normalized LFIW estimator (Eq. (4.7)). In the limit of T → ∞, we recover the induced distribution pθ,φ:

SIR lim pθ,φ (x; T) = pθ,φ(x) ∀x (4.12) T→∞

Next, we analyze conditions under which the resampled density pθ,φ provably CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 45

improves the model fit to pdata. In order to do so, we further assume that pdata is absolutely continuous w.r.t. pθ and pθ,φ. We define the change in KL via the importance resampled density as:

∆(pdata, pθ, pθ,φ) := DKL(pdata, pθ,φ) − DKL(pdata, pθ). (4.13)

Substituting Eq. (4.10) in Eq. (4.13), we can simplify the above quantity as:

∆(pdata, pθ, pθ,φ) = Epdata [− log(pθ(x)wˆ φ(x)) + log Zθ,φ + log pθ(x)] (4.14)

= Epdata [log wˆ φ(x)] − log Epθ [wˆ φ(x)]. (4.15)

The above expression provides a necessary and sufficient condition for any positive real valued function (such as the LFIW classifier in Section 4.3) to improve the KL divergence fit to the underlying data distribution. In practice, an unbiased estimate of the LHS can be obtained via Monte Carlo averaging of log- importance 3 weights based on Dtrain. The empirical estimate for the RHS is however biased. To remedy this shortcoming, we consider the following necessary but insufficient condition.

Proposition 4.1. If ∆(pdata, pθ, pθ,φ) ≥ 0, then the following conditions hold:

Epdata [wˆ φ(x)] ≥ Epθ [wˆ φ(x)], (4.16)

Epdata [log wˆ φ(x)] ≥ Epθ [log wˆ φ(x)]. (4.17)

The conditions in Eq. (4.16) and Eq. (4.17) follow directly via Jensen’s inequality applied to the LHS and RHS of Eq. (4.15) respectively. Here, we note that estimates for the expectations in Eqs. 4.16-4.17 based on Monte Carlo averaging of (log-) importance weights are unbiased.

3If Zˆ is an unbiased estimate for Z, then log Zˆ is a biased estimate for log Z via Jensen’s inequality. CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 46

Table 4.1: Goodness-of-fit evaluation on CIFAR-10 dataset for PixelCNN++ and SNGAN. Standard errors computed over 10 runs. Higher IS is better. Lower FID and KID scores are better.

Model Evaluation IS (↑) FID (↓) KID (↓) - Reference 11.09 ± 0.1263 5.20 ± 0.0533 0.008 ± 0.0004 PixelCNN++ Default (no debiasing) 5.16 ± 0.0117 58.70 ± 0.0506 0.196 ± 0.0001 LFIW 6.68 ± 0.0773 55.83 ± 0.9695 0.126 ± 0.0009 SNGAN Default (no debiasing) 8.33 ± 0.0280 20.40 ± 0.0747 0.094 ± 0.0002 LFIW 8.57 ± 0.0325 17.29 ± 0.0698 0.073 ±0.0004

4.5 Empirical Evaluation

In all our experiments, the binary classifier for estimating the importance weights was a calibrated deep neural network trained to minimize the cross-entropy loss. The self-normalized LFIW in Eq. (4.7) worked best.

4.5.1 Goodness-of-fit testing

In the first set of experiments, we highlight the benefits of importance weighting for a debiased evaluation of three popularly used sample quality metrics viz. Incep- tion Scores (IS) [Sal+16], Frechet Inception Distance (FID) [Heu+17], and Kernel Inception Distance (KID) [Biń+18]. All these scores can be formally expressed as empirical expectations with respect to the model. For all these metrics, we can simu- late the population level unbiased case as a ‘‘reference score” wherein we artificially set both the real and generated sets of samples used for evaluation as finite, disjoint sets derived from pdata. We evaluate the three metrics for two state-of-the-art models trained on the CIFAR-10 dataset: (a) an autoregressive model PixelCNN++ learned via maximum likelihood estimation [Sal+17] and (b) a latent variable model SNGAN learned via adversarial training [Miy+18]. For evaluating each metric, we draw 10,000 samples from the model. In Table 4.1, we report the metrics with and without the LFIW bias correction. The consistent debiased evaluation of these metrics via self-normalized CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 47

Table 4.2: Classification accuracy on the Omniglot dataset. Standard errors com- puted over 5 runs.

Dataset Accuracy

Dcl 0.6603 ± 0.0012 Dg 0.4431 ± 0.0054 Dg w/ LFIW 0.4481 ± 0.0056 Dcl + Dg 0.6600 ± 0.0040 Dg w/ LFIW 0.6818 ± 0.0022

LFIW suggest that the SIR approximation to the importance resampled distribution

(Eq. (4.11)) is a better fit to pdata.

4.5.2 Data augmentation for multi-class classification

We consider data augmentation using a generative model for multi-class classifi- cation. The dataset we consider is the Omniglot dataset of handwritten charac- ters [LST15] for which we use a pretrained Data Augmentation Generative Adver- sarial Network (DAGAN) [ASE17]. The dataset is particularly relevant because it contains 1600+ classes but only 20 examples from each class and hence, could potentially benefit from augmented data. Once the model has been trained, it can be used for data augmentation in many ways. In particular, we consider ablation baselines that use various combinations of the real training data Dcl and generated data Dg for training a downstream classifier. When the generated data Dg is used, we can either use the data directly with uniform weighting for all training points, or choose to importance weight (LFIW) the contributions of the individual training points to the overall loss. The results are shown in Table 4.2. While generated data (Dg) alone cannot be used to obtain competitive performance relative to the real data (Dcl) on this task as expected, the bias it introduces for evaluation and subsequent optimization overshadows even the naive data augmentation (Dcl + Dg). In contrast, we can obtain significant improvements by importance weighting the generated points (Dcl + Dg w/ LFIW). Qualitatively, we can observe the effect of importance weighting in Figure 4.2. CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 48

(a) (b)

(c) (d)

(e) (f)

Figure 4.2: Qualitative evaluation of importance weighting for data augmentation. (a-f) Within each figure, top row shows held-out data samples from a specific class in Omniglot. Bottom row shows generated samples from the same class ranked in decreasing order of importance weights.

Here, we show true and generated samples for 6 randomly choosen classes (a-f) in the Omniglot dataset. The generated samples are ranked in decreasing order of the importance weights. There is no way to formally test the validity of such rankings and this criteria can also prefer points which have high density under

pdata but are unlikely under pθ since we are looking at ratios. Visual inspection suggests that the classifier is able to appropriately downweight poorer samples, as shown in Figure 4.2 (a, b, c, d - bottom right). There are also failure modes, such as the lowest ranked generated images in Figure 4.2 (e, f - bottom right) where the classifier weights reasonable generated samples poorly relative to others. This could be due to particular artifacts such as a tiny disconnected blurry speck in Figure 4.2 CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 49

(e - bottom right) which could be more revealing to a classifier distinguishing real and generated data.

4.5.3 Model-based off-policy policy evaluation

So far, we have seen use cases where the generative model was trained on data from the same distribution we wish to use for Monte Carlo evaluation. We can extend our debiasing framework to more involved settings when the generative model is a building block for specifying the full data generation process, e.g., trajectory data generated via a dynamics model along with an agent policy. In particular, we consider the setting of off-policy policy evaluation (OPE), where the goal is to evaluate policies using experiences collected from a different policy. Formally, let (S, A, r, P, η, T) denote an (undiscounted) Markov decision process with state space S, action space A, reward function r, transition P, initial state distribution η and horizon T. Assume πe : S × A → [0, 1] is a known policy that we wish to evaluate. The probability of generating a certain trajectory τ =

{s0, a0, s1, a1, ..., sT, aT} of length T with policy πe and transition P is given as:

T−1 ? p (τ) = η(s0) ∏ πe(at|st)P(st+1|st, at). (4.18) t=0

The return on a trajectory R(τ) is the sum of the rewards across the state, action T pairs in τ: R(τ) = ∑t=1 r(st, at), where we assume a known reward function r. We are interested in the value of a policy defined as v(πe) = Ep∗ [R(τ)]. Evalu- ating πe requires the (unknown) transition dynamics P. The dynamics model is a conditional generative model of the next states st+1 conditioned on the previous state-action pair (st, at). If we have access to historical logged data Dτ of trajectories

τ = {s0, a0, s1, a1,..., } from some behavioral policy πb : S × A → [0, 1], then we can use this off-policy data to train a dynamics model pθ(st+1|st, at). The policy πe can then be evaluated under this learned dynamics model as v˜(πe) = Ep˜[R(τ)], where p˜ uses pθ instead of the true dynamics in Eq. (4.18). CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 50

However, the trajectories sampled with pθ could significantly deviate from sam- ples from P due to compounding errors [RB10]. In order to correct for this bias, we can use likelihood-free importance weighting on entire trajectories of data. The binary classifier c(st, at, st+1) for estimating the importance weights in this case distinguishes between triples of true and generated transitions. For any true triple

(st, at, st+1) extracted from the off-policy data, the corresponding generated triple

(st, at, sˆt+1) only differs in the final transition state, i.e., sˆt+1 ∼ pθ(sˆt+1|st, at). Such a classifier allows us to obtain the importance weights wˆ (st, at, sˆt+1) for every pre- dicted state transition (st, at, sˆt+1). The importance weights for the trajectory τ can be derived from the importance weights of these individual transitions as:

? T−1 T−1 T−1 p (τ) ∏t=0 P(st+1|st, at) P(st+1|st, at) = = ≈ wˆ (s , a , sˆ + ). (4.19) p˜(τ) T−1 ∏ p (s |s , a ) ∏ t t t 1 ∏t=0 pθ(st+1|st, at) t=0 θ t+1 t t t=0

Our final LFIW estimator is given as:

"T−1 # vˆ(πe) = Ep˜ ∏ wˆ (st, at, sˆt+1) · R(τ) . (4.20) t=0

In Appendix B.2.5, we also consider a stepwise LFIW estimator for MBOPE which applies importance weighting at the level of every decision as opposed to entire trajectories.

Setup. We consider three continuous control tasks in the MuJoCo simulator [TET12] from OpenAI gym [Bro+16] (in increasing number of state dimensions): Swim- mer, HalfCheetah and HumanoidStandup. High dimensional state spaces makes it challenging to learning a reliable dynamics model in these environments. We train behavioral and evaluation policies using Proximal Policy Optimization [Sch+17a] with different hyperparameters for the two policies. The dataset collected via trajectories from the behavior policy is used to train a ensemble neural network dynamics model. We the use the trained dynamics model to evaluate v˜(πe) and its LFIW version vˆ(πe), and compare them with the ground truth returns v(πe). Each estimation is averaged over a set of 100 trajectories CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 51

Table 4.3: Off-policy policy evaluation on MuJoCo tasks. Standard error is over 10 Monte Carlo estimates where each estimate contains 100 randomly sampled trajectories.

Environment v(πe) (Ground Truth) v˜(πe) vˆ(πe) (LFIW) vˆ80(πe) (LFIW) Swimmer 36.7 ± 0.1 100.4 ± 3.2 25.7 ± 3.1 47.6 ± 4.8 HalfCheetah 241.7 ± 3.56 204.0 ± 0.8 217.8 ± 4.0 219.1 ± 1.6 Humanoid 14170 ± 53 8417 ± 28 9372 ± 375 9221 ± 381

with horizon T = 100. Specifically, for vˆ(πe), we also average the estimation over 10 classifier instances trained with different random seeds on different trajectories. We further consider performing LFIW over only the first H steps, and use uniform weights for the remainder, which we denote as vˆH(πe). This allow us to interpolate between v˜(πe) ≡ vˆ0(πe) and vˆ(πe) ≡ vˆT(πe). Finally, as in the other experiments, we used the self-normalized variant (Eq. (4.7)) of the importance weighted estimator in Eq. (4.20).

Results. We compare the policy evaluations under different environments in Ta- ble 4.3. These results show that the rewards estimated with the trained dynamics model differ from the ground truth by a large margin. By importance weighting the trajectories, we obtain much more accurate policy evaluations. As expected, we also see that while LFIW leads to higher returns on average, the imbalance in trajectory importance weights due to the multiplicative weights of the state-action pairs can lead to higher variance in the importance weighted returns. In Figure 4.3, we demonstrate that policy evaluation becomes more accurate as more timesteps are used for LFIW evaluations, until around 80 − 100 timesteps and thus empirically validates the benefits of importance weighting using a classifier. Given that our estimates have a large variance, it would be worthwhile to compose our approach with other variance reduction techniques such as (weighted) doubly robust estimation in future work [FCG18], as well as incorporate these estimates within a framework such as MAGIC to further blend with model-free OPE [TB16]. CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 52

Swimmer

60 | )

v 40 ( | 20

0 20 40 60 80 100 H HalfCheetah | )

v 30 ( |

20 0 20 40 60 80 100 H HumanoidStandup

5500 | ) v ( 5000 |

4500 0 20 40 60 80 100 H

Figure 4.3: Estimation error δ(v) = v(πe) − vˆH(πe) for different values of H (min 0, max 100). Shaded area denotes standard error over 10 random seeds.

Overall. Across all our experiments, we observe that importance weighting the generated samples leads to uniformly better results, whether in terms of evaluating the quality of samples, or their utility in downstream tasks. Since the technique is a black-box wrapper around any generative model, we expect this to benefit a diverse set of tasks in follow-up works. However, there is also some caution to be exercised with these techniques as evident from the results of Table 4.1. Note that in this table, the confidence intervals (computed using the reported standard errors) around the model scores after CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 53

importance weighting still do not contain the reference scores obtained from the true model. This would not have been the case if our debiased estimator was completely unbiased and this observation reiterates our earlier claim that LFIW is reducing bias, as opposed to completely eliminating it. Indeed, when such a mismatch is observed, it is a good diagnostic to either learn more powerful classifiers to better approximate the Bayes optimum, or find additional data from pdata in case the generative model fails the full support assumption.

4.6 Discussion & Related Work

Density ratios enjoy widespread use across machine learning e.g., for handling covariate shifts and class imbalance [SSK12; BL18]. In generative modeling, esti- mating these ratios via binary classifiers is frequently used for defining learning objectives and two sample tests [ML16; Ros+17; Gre+07; Bow+15; LO16; Dan+17; Ros+17; Im+18; GRM19]. In particular, such classifiers have been used to define learning frameworks such as generative adversarial networks [Goo+14; NCT16], likelihood-free Approximate Bayesian Computation (ABC) [GH12] and earlier work in unsupervised-as-supervised learning [FHT01] and noise contrastive es- timation [GH12] among others. These works are explicitly interested in learning the parameters of a generative model. In contrast, we use the binary classifier for estimating importance weights to correct for the model bias of any fixed generative model and in doing so, we include the classifier as part of the induced generative model. In another line of work, [Die+18] and [Cho+20] used importance weighting to reweigh data points based on differences in training and test data distributions i.e., dataset bias. We will discuss this setting in the following chapter. Recent concurrent works use MCMC and rejection sampling to explicitly trans- form or reject the generated samples [Tur+18; Aza+18; Tao+18]. These methods require extra computation beyond training a classifier, in rejecting the samples or running Markov chains to convergence, unlike the proposed importance weighting strategy. For many model-based Monte Carlo evaluation usecases (e.g., data aug- mentation, MBOPE), this extra computation is unnecessary. If samples or density CHAPTER 4. MODEL BIAS IN GENERATIVE MODELS 54

estimates are explicitly needed from the induced resampled distribution, we pre- sented a particle-based approximation to the induced density where the number of particles is a tunable knob allowing for trading statistical accuracy with computa- tional efficiency. Finally, we note resampling based techniques have been extensively studied in the context of improving variational approximations for latent variable generative models [BGS15; SKW15; Nae+17; Gro+18a].

4.7 Conclusion

We identified bias with respect to a target data distribution as a fundamental chal- lenge restricting the use of generative models as proposal distributions for Monte Carlo evaluation. We proposed a bias correction framework based on importance weighting. The importance weights are estimated in a likelihood-free fashion via a binary classifier. Empirically, we find the bias correction to be useful acrossa variety of tasks including goodness-of-fit sample quality tests, data augmentation, and off-policy policy evaluation. We also characterized the induced energy-based generative model along with an approximate sampling procedure based on sequential-importance-resampling and the theoretical conditions under which it improves the base generative model. In a recent follow-up work, [AZG20] extend this framework to update both the base generative model and energy function during learning and observe improved performance on density estimation and sampling using the induced energy-based model. In summary, these works show that such approaches sequentially increases model capacity to fix the model bias of a base generative model. The source of this bias was the mismatch in the base generative model family and the true data distribution, or simply challenges in optimization. The next chapter will focus on another piece of the puzzle: bias that can occur due to unlabelled and uncurated datasets. 5 Dataset Bias in Generative Models

5.1 Introduction

Increasingly, we are able to train generative models on large unannotated datasets and deploy them in the real world. Examples of such production level systems include Transformer-based models such as BERT and GPT-3 for natural language generation [Vas+17; Dev+18; Rad+19; Bro+20], Wavenet for text-to-speech syn- thesis [Oor+17], and a large number of creative applications such Coconet used for designing the ‘‘first AI-powered” Google [Hua+17]. As these generative ap- plications become more prevalent, it becomes increasingly important to consider questions with regards to the potential discriminatory nature of such systems and ways to mitigate it [Pod+14]. For example, some natural language generation sys- tems trained on internet-scale datasets have been shown to produce generations that are biased towards certain demographics [She+19]. A variety of socio-technical factors contribute to the discriminatory nature of ML systems [BHN18]. A major factor is the existence of biases in the training data itself [TE+11; Tom+17]. Since data is the fuel of machine learning, any existing bias in the dataset can be propagated to the learned model [BS16]. This is a particularly pressing concern for generative models which can easily amplify the bias by gen- erating more of the biased data at test time. Further, learning a generative model is fundamentally an unsupervised learning problem and hence, the bias factors of interest are typically latent. For example, while learning a generative model of

55 CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 56

human faces, we often do not have access to attributes such as gender, race, and age. Any existing bias in the dataset with respect to these attributes are easily picked by deep generative models. See Figure 5.1 for an illustration.

Figure 5.1: Samples from a baseline BigGAN that reflect the gender bias underlying the true data distribution in CelebA. All faces above the orange line (67%) are classified as female, while the rest are labeled as male (33%).

We present a model-agnostic bias mitigation approach dor learning fair genera- tive models even in the presence of dataset bias. In our setup, we assume we have access to multiple unlabelled (potentially biased) datasets in addition to a particular target (reference) dataset. The unlabelled datasets are relatively cheap to obtain for many domains in the big data era but our goal is to ultimately generate samples that are close in distribution to a particular target (reference) dataset. We note that while there may not be concept of a reference dataset devoid of bias, carefully designed representative data collection practices may be more accurately reflected in some data sources [Geb+18] and can be considered as reference datasets. As a concrete example of such a reference dataset, organizations such as the World Bank and certain biotech firms [23m16; Hon16] typically have quality checks to ensure representativeness in their collected datasets. We note two key features of our setup that are well allied with real-world scenarios: (a) we do not require the biased or reference datasets need to be labeled w.r.t. the latent bias attributes and (b) the size of the curated reference dataset can be much smaller than the biased dataset. Using a reference dataset to augment a biased dataset, our goal is to learn a generative model that best approximates the reference data distribution. Simply using the reference dataset alone for learning is an option, but this may not suffice since this dataset can be too small to learn an expressive model that accurately CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 57

captures the underlying reference data distribution. Our approach to learning a fair generative model that is robust to biases in the larger training set is based on importance reweighting. In particular, we learn a generative model which reweighs the data points in the biased dataset based on the ratio of densities assigned by the biased data distribution as compared to the reference data distribution. Since we do not have access to explicit densities assigned by either of the two distributions, we estimate the weights by using a probabilistic classifier [SSK12; ML16]. We test our bias mitigation approach on learning generative adversarial networks on the CelebA dataset [Liu+15]. The dataset consists of attributes such as gender and hair color, which we use for designing biased and reference data splits and subsequent evaluation. Empirically, we demonstrate the efficacy of our approach which reduces bias w.r.t. latent factors by an average of up to 34.6% over baselines for comparable image generation using generative adversarial networks in this dataset.

5.2 Preliminaries

Our bias mitigation framework is agnostic to the choice of generative modeling framework. For generality, we consider expectation-based learning objectives, where `(·) is a per-example loss that depends on both examples x drawn from a dataset D and the model parameters θ:

1 T E [`( )] ≈ `( ) = L( D) pdata x, θ ∑ xi, θ : θ; (5.1) T i=1

The above expression encompasses all the objectives discussed in Chapter 2. For example, if `(·) denotes the negative log-likelihood assigned to the point x as per

pθ, then we recover the MLE training objective.

Dataset Bias. The standard assumption for learning a generative model is that we have access to a sufficiently large dataset Dref of training examples, where each x ∈ Dref is assumed to be sampled independently from a reference distribution

pdata = pref. In practice however, collecting large datasets that are i.i.d. w.r.t. pref CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 58

is difficult due to a variety of socio-technical factors. The sample complexity for learning probability distributions can be doubly-exponential in the dimension in many cases [ARZ18], surpassing the size of the largest available datasets. We can partially offset this difficulty by considering data from alternate sources related to the target distribution, e.g., images scraped from the Internet. However, these additional data points are not expected to be i.i.d. w.r.t. pref. We character- ize this phenomena as dataset bias, where we assume the availability of a dataset

Dbias, such that the examples x ∈ Dbias are sampled independently from a biased

(unknown) distribution pbias that is different from pref, but shares the same support.

5.3 Learning with Multiple Data Sources

We assume a learning setting where we are given access to a data source Dbias in addition to a dataset of training examples Dref. Our goal is to capitalize on both data sources Dbias and Dref for learning a model pθ that best approximates the target distribution pref. We begin by discussing two baseline approaches for learning with multiple data sources and the trade-offs involved.

Baseline 1. One could completely ignore Dbias and consider learning pθ based on

Dref alone. Since we only consider proper losses w.r.t. pref, global optimization of the objective in Eq. (5.1) in a well-specified model family will recover the reference data distribution as |Dref| → ∞. However, since Dref is finite in practice, this is sample inefficient and the sample quality will be poor.

Baseline 2. On the other extreme, we can consider learning pθ based on the full dataset consisting of both Dref and Dbias. This procedure will be data efficient and could lead to high sample quality, but it comes at the cost of fairness since the learned distribution will be heavily biased w.r.t. pref. Both baselines fall short in achieving the twin goals of data efficiency (challenge for Baseline 1) without introducing dataset bias (challenge for Baseline 2). Next, we consider two approaches that seek to satisfy these aforementioned desiderata. CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 59

Approach 1: Dataset Conditional Modeling. Our first proposal is to learn a gener- ative model conditioned on the identity of the dataset used during training. Formally, we learn a generative model pθ(x|y) where y ∈ {0, 1} is a binary random variable indicating whether the model distribution was learned to approximate the data distribution corresponding to Dref (i.e., pref) or Dbias (i.e., pbias). By sharing model parameters θ across the two values of y, we hope to leverage both data sources. At test time, conditioning on y for Dref should result in fair generations. In practice, this simple approach however does not achieve the intended effect as we shall demonstrate in our experiments. The likely explanation is that the conditioning information is too weak for the model to infer the bias factors and effectively distinguish between the two distributions. Next, we present an alternate two-phased approach based on density ratio estimation which effectively overcomes the dataset bias in a data-efficient manner.

Approach 2: Importance Reweighting. Recall Baseline 2 learns a generative model on simply the union of Dbias and Dref. This method is problematic because it as- signs equal weight to the loss contributions from each individual data point in our dataset in Eq. (5.1), regardless of whether the data point comes from Dbias or Dref. For example, in situations where the dataset bias causes a minority group to be underrepresented, this objective will encourage the model to focus on the majority group such that the overall value of the loss is minimized on average with respect to a biased empirical distribution i.e., a weighted mixture of pbias and pref with weights proportional to |Dbias| and |Dref|.

Our key idea is to reweight the data points from Dbias during training such that the model learns to downweight over-represented data points from Dbias while simultaneously upweighting the under-represented points from Dref. The challenge in the unsupervised context is that we do not have direct supervision on which points are over- or under-represented and by how much. To resolve this issue, we consider importance sampling [HT52]. Whenever we are given data from two distributions, w.l.o.g. say p and q, and wish to evaluate a sample average w.r.t. p given samples from q, we can do so by reweighting the samples from p by the ratio of CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 60

densities assigned to the sampled points by p and q. In our setting, the distributions of interest are pbias and pref respectively. Hence, an importance weighted objective for learning from Dbias is:   pref(x) Epref [`(x, θ)] = Epbias `(x, θ) (5.2) pbias(x) 1 T ≈ ∑ w(xi)`(xi, θ) := L(θ, Dbias) (5.3) T i=1

pref(x) where w(xi) := is defined to be the importance weight for xi ∼ pbias. pbias(x) To estimate the importance weights, we use a binary classifier as described below [SSK12]. Consider a binary classification problem with classes Y ∈ {0, 1} with training data generated as follows. First, we fix a prior probability for p(Y = 1). Then, we repeatedly sample y ∼ p(Y). If y = 1, we independently sample a data point x ∼ pref, else we sample x ∼ pbias. Then, as shown in [FHT01], the ratio of densities pref and pbias assigned to an arbitrary point x can be recovered via a Bayes optimal (probabilistic) classifier c∗ : X → [0, 1]:

∗ pref(x) c (Y = 1|x) w(x) = = γ ∗ (5.4) pbias(x) 1 − c (Y = 1|x) where c(Y = 1|x) is the probability assigned by the classifier to the point x belonging = = p(Y=0) to class Y 1. Here, γ p(Y=1) is the ratio of marginals of the labels for two classes. In practice, we do not have direct access to either pbias or pref and hence, our training data consists of points sampled from the empirical data distributions defined uniformly over Dref and Dbias. Further, we may not be able to learn a Bayes optimal classifier and denote the importance weights estimated by the learned classifier c for a point x as wˆ (x). Our overall procedure is summarized in Algorithm 5.1. We use deep neural networks for parameterizing the binary classifier and the generative model. Given a biased and reference dataset along with the network architectures and other standard hyperparameters (e.g., learning rate, optimizer etc.), we first learn a probabilistic binary classifier (Line 2). The learned classifier can provide importance weights for CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 61

Algorithm 5.1 Mitigating Dataset Bias in Generative Models

Inputs: Datasets Dbias and Dref Classifier and Generative Model Architectures & Hyperparameters Output: Generative Model Parameters θ

1: . Phase 1: Estimate importance weights 2: Learn binary classifier c for distinguishing (Dbias, Y = 0) vs. (Dref, Y = 1) ( ) ← c(Y=1|x) ∈ D 3: Estimate importance weight wˆ x c(Y=0|x) for all x bias (using Eq. (5.4)) 4: Set importance weight wˆ (x) ← 1 for all x ∈ Dref nothing 5: . Phase 2: Minibatch gradient descent on θ based on weighted loss 6: Initialize model parameters θ at random 7: Set full dataset D ← Dbias ∪ Dref 8: while training 9: Sample a batch of points B from D at random 10: L(θ D) ← 1 wˆ (x )`(x θ) Set loss ; |B| ∑xi∈B i i, 11: Estimate gradients ∇θL(θ; D) and update parameters θ based on optimizer update rule 12: end while nothing 13: return θ

the data points from Dbias via estimates of the density ratios (Line 3). For the data points from Dref, we do not need to perform any reweighting and set the importance weights to 1 (Line 4). Using the combined dataset Dbias ∪ Dref, we then learn the generative model pθ where the minibatch loss for every gradient update weights the contributions from each data point (Lines 6-12). For a practical implementation, it is best to account for some diagnostics and best practices while executing Algorithm 5.1. For density ratio estimation, we test that the classifier is calibrated on a held out set. This is a necessary (but insufficient) check for the estimated density ratios to be meaningful. If the classifier is miscalibrated, we can apply standard recalibration techniques such as Platt scaling before estimating the importance weights. Furthermore, while optimizing the model using a weighted objective, there can be an increased variance across the loss contributions from each example in a minibatch due to importance weighting. We did not observe this in our CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 62

experiments, but techniques such as normalization of weights within a batch can potentially help control the unintended variance introduced within a batch [SSK12]. The performance of Algorithm 5.1 critically depends on the quality of estimated density ratios, which in turn is dictated by the training of the binary classifier. We define the expected negative cross-entropy (NCE) objective for a classifier c as:

1 γ NCE(c) := E [log c(Y = 1|x)] + E [log c(Y = 0|x)]. (5.5) γ + 1 pref(x) γ + 1 pbias(x)

In the following result, we characterize the NCE loss for the Bayes optimal classifier.

Theorem 5.1. Let Z denote a set of unobserved bias variables. Suppose there exist two joint distributions pbias(x, z) and pref(x, z) over x ∈ X and z ∈ Z. Let pbias(x) and pbias(z) denote the marginals over x and z for the joint pbias(x, z) and similar notation for the joint

pref(x, z):

pbias(x|z = k) = pref(x|z = k) ∀k (5.6)

0 0 and pbias(x|z = k), pbias(x|z = k ) have disjoint supports for k 6= k . Then, the negative cross-entropy of the Bayes optimal classifier c∗ is given as:

1  1  γ  γb(z)  NCE(c∗) = E log + E log . (5.7) γ + 1 pref(z) γb(z) + 1 γ + 1 pbias(z) γb(z) + 1 where b(z) = pbias(z)/pref(z).

Proof. See Appendix A.1.

For example, as we shall see in our experiments in the following section, the in- puts x can correspond to face images, whereas the unobserved z represents sensitive bias factors for a subgroup such as gender or ethnicity. The proportion of examples x belonging a subgroup can differ across the biased and reference datasets withthe relative proportions given by b(z). Note that the above result only requires knowing these relative proportions and not the true z for each x. The practical implication is CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 63

that under the assumptions of Theorem 5.1, we can check the quality of density ratios estimated by an arbitrary learned classifier c by comparing its empirical NCE with the theoretical NCE of the Bayes optimal classifier in Eq. (5.7) (see Section 5.4.1).

5.4 Empirical Evaluation

In this section, we are interested in investigating two broad questions empirically: 1. How well can we estimate density ratios between multiple unlabelled datasets?

2. How effective is the reweighting technique for mitigating the dataset biasin learning fair generative models?

Dataset. We consider the CelebA dataset [Liu+15], which is commonly used for benchmarking deep generative models and comprises of images of faces with 40 labeled binary attributes. We use this attribute information to construct 3 different settings for partitioning the full dataset into Dbias and Dref. • Setting 1 (single, bias=0.9): We set z to be a single bias variable correspond- ing to ‘‘gender” with values 0 (female) and 1 (male) and b(z = 0) = 0.9.

Specifically, this means that Dref contains the same fraction of male and female

images whereas Dbias contains 0.9 fraction of females and rest as males.

• Setting 2 (single, bias=0.8): We use same bias variable (gender) as Setting 1 with b(z = 0) = 0.8.

• Setting 3 (multi): We set z as two bias variables corresponding to ‘‘gender” and ‘‘black hair”. In total, we have 4 subgroups: females without black hair (00), females with black hair (01), males without black hair (10), and males with black hair (11). We set b(z = 00) = 0.437, b(z = 01) = 0.063, b(z = 10) = 0.415, b(z = 11) = 0.085.

We emphasize that the attribute information is used only for designing controlled bi- ased and reference datasets and faithful evaluation. Algorithm 5.1 does not explicitly require such labeled information. CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 64

Evaluation metrics. We are interested in evaluating metrics that relate to both sample quality and fairness w.r.t. the latent bias attributes.

1. Sample quality: Standard metrics such as Frechet Inception Distance (FID) [Heu+17] can be used here. Note that we are interested in measuring these metrics w.r.t.

the reference data distribution pref even if the model has been trained to use

both Dref and Dbias.

2. Fairness: Alternatively, we can evaluate bias of generative models specifically in the context of some sensitive latent variables, say u ∈ Rk. For example, u may correspond to the gender and hair color of an individual depicted via an image x. If we have access to a highly accurate predictor p(u|x) for the distribution of the sensitive attributes u conditioned on the observed x, we can evaluate the extent of bias mitigation via the discrepancies in the expected

marginal likelihoods of u as per pref and pθ.

Formally, we define the fairness discrepancy f for a generative model pθ w.r.t.

pref and sensitive attributes u:

f (pref, pθ) = |Epref [p(u|x)] − Epθ [p(u|x)]|2. (5.8)

In practice, the expectations in Eq. (5.8) can be computed via Monte Carlo averaging. Lower is the discrepancy in the above two expectations, the better is mitigation in dataset bias w.r.t. the sensitive attributes u.

Models. We train two classifiers for our experiments: (1) the attribute (e.g., gender) classifier which we use to assess the level of bias present in our final samples; and(2) the density ratio classifier. For both models, we use a variant of ResNet18 [He+16] on the standard train and validation splits of CelebA. For the generative model, we used a BigGAN model [BDS18] trained to minimize the hinge loss objective [LY17; TRB17]. CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 65

(a) single, bias=0.9

(b) single, bias=0.8

(c) multi

Figure 5.2: Distribution of importance weights for different latent subgroups. On av- erage, The underrepresented subgroups are upweighted while the overrepresented subgroups are downweighted. CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 66

Model Bayes optimal Empirical single, bias=0.9 0.591 0.605 single, bias=0.8 0.642 0.650 multi 0.619 0.654

Table 5.1: Comparison between the cross-entropy loss of the Bayes classifier and learned density ratio classifier.

5.4.1 Density ratio estimation

For each of the three experiments settings, we can evaluate the quality of the esti- mated density ratios by comparing empirical estimates of the cross-entropy loss of the density ratio classifier with the cross-entropy loss of the Bayes optimal classifier derived in Eq. (5.7). We show the results in Table 5.1 for perc=1.0 where we find that the two losses are very close, suggesting that we obtain high-quality density ratio estimates that we can use for subsequently training fair generative models. In Figure 5.2, we show the distribution of our importance weights for the various latent subgroups. We find that across all the considered settings, the underrepre- sented subgroups (e.g., males in Figure 5.2a, 5.2b, females with black hair in 5.2c) are upweighted on average (mean density ratio estimate > 1), while the overrepre- sented subgroups are downweighted on average (mean density ratio estimate < 1). Also, as expected, the density ratio estimates are closer to 1 when the bias is low (see Figure 5.2a v.s. 5.2b).

5.4.2 Dataset bias mitigation

We compare our importance weighted approach against three baselines: (1) equi- weight: a BigGAN trained on the full dataset Dref ∪ Dbias that weighs every point equally; (2) reference-only: a BigGAN trained on the reference dataset Dref; and (3) conditional: a conditional BigGAN where the conditioning label indicates whether a data point x is from Dref(y = 1) or Dbias(y = 0). In all our experiments, the reference-only variant which only uses the reference dataset Dref for learning however failed to give any recognizable samples. For a clean presentation of the CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 67

(a) Samples generated via importance reweighting with subgroups separated by the orange line. For the 100 samples above, the classifier concludes 52 females and 48 males.

(b) Fairness Discrepancy

(c) FID

Figure 5.3: Single-Attribute Dataset Bias Mitigation for bias=0.9. Standard error in (b) and (c) over 10 independent evaluation sets of 10,000 samples each drawn from the models. Lower fairness discrepancy and FID is better. We find that on average, imp-weight outperforms the equi-weight baseline by 49.3% and the conditional baseline by 25.0% across all reference dataset sizes for bias mitigation. CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 68

(a) Samples generated via importance reweighting with subgroups separated by the orange line. For the 100 samples above, the classifier concludes 52 females and 48 males.

(b) Fairness Discrepancy

40 imp-weight 35 equi-weight 30 conditional 25

20 FID 15

10

5

0 0.1 0.25 0.5 1.0 Ratio of unbiased to biased dataset sizes

(c) FID

Figure 5.4: Single Attribute Dataset Bias Mitigation for bias=0.8. Standard error in (b) and (c) over 10 independent evaluation sets of 10,000 samples each drawn from the models. Lower fairness discrepancy and FID is better. We find that on average, imp-weight outperforms the equi-weight baseline by 23.9% and the conditional baseline by 12.2% across all reference dataset sizes for bias mitigation. CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 69

results due to other methods, we hence ignore this baseline in the results below and defer the reader to the Appendix for further results.

We also vary the size of the balanced dataset Dref relative to the unbalanced dataset size |Dbias|: perc = {0.1, 0.25, 0.5, 1.0}. Here, perc = 0.1 denotes |Dref| =

10% of |Dbias| and perc = 1.0 denotes |Dref| = |Dbias|.

Single attribute bias. We train our attribute (gender) classifier for evaluation on the CelebA training set, and the classifier achieves 98% accuracy on the held-out set. For each experimental setting, we evaluate bias mitigation based on the fairness discrepancy metric (Eq. (5.8)) and report sample quality based on FID [Heu+17]. For the bias = 0.9 split, we show the samples generated via imp-weight and the resulting fairness discrepancies and FID in Figure 5.3. Our framework generates samples that are slightly lower quality than equi-weight baseline samples shown in Figure 5.1, but is able to produce almost identical proportion of samples across the two genders. Similar observations hold for bias = 0.8, as shown in Figure 5.4.

Multiple attributes bias. We conduct a similar experiment with a multi-attribute split based on gender and the presence of black hair. The attribute classifier for the purpose of evaluation is now trained with a 4-way classification task instead of 2, and achieves an accuracy of roughly 88% on the test set. Our model produces samples as shown in Figure 5.5a with the discrepancy metrics shown in Figures 5.5b, 5.5c respectively. Even in this challenging setup involving two latent bias factors, we find that the importance weighted approach again outperforms the baselines in almost all cases in mitigating bias in the generated data while admitting only a slight deterioration in image quality.

5.5 Discussion & Related Work

Our work presents an initial foray into the field of image generation under dataset bias, and we stress the need for caution in using our techniques and interpreting the empirical findings. For scaling our evaluation, we proposed metrics that relied ona CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 70

(a) Samples generated via importance reweighting. For the 100 samples above, the classifier concludes 37 females and 20 males without black hair, 22 females and21 males with black hair.

(b) Fairness Discrepancy

60 imp-weight 50 equi-weight conditional 40

30 FID

20

10

0 0.1 0.25 0.5 1.0 Ratio of unbiased to biased dataset sizes

(c) FID

Figure 5.5: Mult-Attribute Dataset Bias Mitigation. Standard error in (b) and (c) over 10 independent evaluation sets of 10,000 samples each drawn from the models. Lower fairness discrepancy and FID is better. We find that on average, imp-weight outperforms the equi-weight baseline by 32.5% and the conditional baseline by 4.4% across all reference dataset sizes for bias mitigation. CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 71

pretrained attribute classifier for inferring the bias in the generated data samples. The classifiers we considered are highly accurate on all subgroups, but canhave blind spots when evaluated on generated data. For future work, we would like to investigate human evaluations to mitigate such issues during evaluation [Grg+18]. As another case in point, our work calls for rethinking sample quality metrics for generative models in the presence of dataset bias [Mit+19]. On one hand, our approach increases the diversity of generated samples in the sense that the different subgroups are more balanced; at the same time, however, variation across other image features decreases because the newly generated underrepresented samples are learned from a smaller dataset of underrepresented subgroups. Moreover, standard metrics such as FID even when evaluated with respect to a reference dataset, could exhibit a relative preference for models trained on larger datasets with little or no bias correction to avoid even slight compromises on perceptual sample quality. Next, we discuss some key lines of related work for the current chapter.

Fairness & generative modeling. There is a rich body of work in fair ML, which focus on different notions of fairness (e.g.demographic parity, equality of odds and opportunity) and study methods by which models can perform tasks such as classification in a non-discriminatory way[BHN18; Dwo+12; Hei+18; Pin+18]. Our focus is in the context of fair generative modeling. The vast majority of related work in this area is centered around fair and/or privacy preserving representation learning, which exploit tools from adversarial learning and information theory among others [Zem+13; ES15; Lou+16; Beu+17; Son+19; Ade+19]. A unifying principle among these methods is such that a discriminator is trained to perform poorly in predicting an outcome based on a protected attribute. [RAM17] considers transfer learning of race and gender identities as a form of weak supervision for predicting other attributes on datasets of faces. While the end goal for the above works is classification, our focus is on data generation in the presence of dataset bias and we do not require explicit supervision for the protected attributes. The most relevant prior works in data generation are FairGAN [Xu+18] and FairnessGAN [Sat+19]. The goal of both methods is to generate fair data points and CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 72

their labels as a preprocessing step. This allows for learning a useful downstream classifier and obscures information about protected attributes. These works are not directly comparable to our setting as we do not assume explicit supervision regarding the protected attributes during training, and our goal is fair generation given unlabelled biased datasets where the bias factors are latent. Another relevant work is DB-VAE[Ami+19], which uses a VAEto learn the latent structure of sensitive attributes and employs importance weighting based on this structure to mitigate bias in downstream classifiers. Contrary to our work, these importance weights are used to directly sample (rare) data points with higher frequencies for training a classifier, as opposed to fair data generation.

Importance reweighting. Reweighting data points is a common algorithmic tech- nique for problems such as dataset bias and class imbalance [BL18]. It has often been used in the context of fair classification [CKP09], for example, [KC12] details reweighting as a way to remove discrimination without relabeling instances. For reinforcement learning, [DTB17] used an importance sampling approach for select- ing fair policies. There is also a body of work on fair clustering [Chi+17; Bac+19; BCN19; SSS18] which ensure that the clustering assignments are balanced with respect to some sensitive attribute.

Density ratio estimation using classifiers. The use of classifiers for estimating density ratios has a rich history of prior works across ML [SSK12]. For deep genera- tive modeling, density ratios estimated by classifiers have been used for expanding the class of various learning objectives [NCT16; ML16; GE18], evaluation metrics based on two-sample tests [Gre+07; Bow+15; LO16; Dan+17; Ros+17; Im+18; GRM19], or improved Monte Carlo inference via these models [Gro+19a; Aza+18; Tur+18; Tao+18]. Closest related is the proposal of [Die+18] to use importance reweighting for learning generative models where training and test distributions differ, but explicit importance weights are provided for at least a subset of thetrain- ing examples. In contrast, we consider a more realistic setting where we estimate the importance weights using a small reference dataset. CHAPTER 5. DATASET BIAS IN GENERATIVE MODELS 73

5.6 Conclusion

We considered the task of data generation given access to a (potentially small) refer- ence dataset and large biased dataset(s). For data-efficient learning, we proposed an importance weighted objective that corrects bias by reweighting the biased data points. These weights are estimated by a binary classifier. Empirically, we showed that our technique outperforms baselines by up to 34.6% on average in reducing dataset bias on CelebA without incurring any significant reduction in sample quality. Finally, it is worth emphasizing the need to be mindful of the decisions made at each stage in the development and deployment of ML systems [Abe+20]. Factors such as the dataset used for training [Geb+18; She+19; JG20] or algorithmic deci- sions such as the loss function or evaluation metric [HPS16; BG18; Kim+18; Liu+18a; Has+18], among others, may have undesirable consequences. These effects can be further amplified by the very nature of the goal of generative models. Part II

Relational Representation & Reasoning

74 6 Representation & Reasoning in Graphs

6.1 Introduction

Graphs are a fundamental abstraction for modeling relational data. Many important analysis over graphs can be cast as predictions over nodes and edges. In a typical node classification task, we are interested in predicting the most probable labelsof nodes in a graph [BCM11]. For example, in a social network, we might be interested in predicting interests of users, or in a protein-protein interaction network we might be interested in predicting functional labels of proteins [Rad+13; Yan+11]. Similarly, in link prediction, we wish to predict whether a pair of nodes in a graph should have an edge connecting them [LK07]. Link prediction is useful in a wide variety of domains; for instance, in genomics, it helps us discover novel interactions between genes, and in social networks, it can identify real-world friends [BL11; Vaz+03]. For the above prediction problems, we require feature representations for the nodes and edges that can be fed as input to a supervised machine learning algo- rithm. Hand-engineering features based on expert knowledge, e.g., structural graph features include node degrees, centrality measures, etc. is tedious and suboptimal. The goal in this chapter is discuss algorithms to automatically learn feature rep- resentations for nodes in a graph using unsupervised methods [BCV13]. Classic approaches include linear and non-linear dimensionality reduction such as Princi- pal Component Analysis, isoMap and other approaches based on matrix factoriza- tion [BN02; RS00; TDL00; Yan+06]. Besides computational limitations of matrix

75 CHAPTER 6. REPRESENTATION & REASONING IN GRAPHS 76

factorization approaches, the learning objective for these techniques is often not aligned with downstream predictive tasks of interest and hence these approaches are suboptimal. Here, we present Graphite, a latent variable generative model for graph represen- tational learning based on variational autoencoding [KW18]. The encoding phase maps each node in an input graph to a latent representation and the decoding phase maps the latent representations to a graph. Our model uses graph neural networks for inference (encoding) and generation (decoding). These networks can be seen as a parameterized extension to the Weisfeiler-Lehman test for graph isomorphism and use message passing techniques to efficiently map nodes to feature representa- tions. While the encoding is straightforward and can use any existing graph neural network, the decoding of these latent features to reconstruct the original graph is done using a multi-layer iterative procedure. The procedure starts with an initial reconstruction based on the inferred latent features, and iteratively refines the re- constructed graph via a message passing operation. We benchmark the benefits of Graphite over previous approaches on the tasks of node classification and link prediction on real-world graph datasets.

6.2 Preliminaries

Consider a weighted undirected graph G = (V, E) where V and E denote index sets of nodes and edges respectively. Additionally, we denote the (optional) feature ma- trix associated with the graph as X ∈ Rn×m for an m-dimensional signal associated with each node, e.g., these could refer to the user attributes in a social network. We represent the graph structure using a symmetric adjacency matrix A ∈ Rn×n where

n = |V| and the entries Aij denote the weight of the edge between node i and j. CHAPTER 6. REPRESENTATION & REASONING IN GRAPHS 77

6.2.1 Weisfeiler-Lehman algorithm

The Weisfeiler-Lehman (WL) algorithm [WL68; Dou11] is a heuristic test of graph isomorphism between any two graphs G and G0. The algorithm proceeds in iter- ations. Before the first iteration, we label every node in G and G0 with a scalar isomorphism invariant initialization (e.g., node degrees). That is, if G and G0 are as- sumed to be isomorphic, then an isomorphism invariant initialization is one where the matching nodes establishing the isomorphism in G and G0 have the same labels (0) (l) (l) (l) T (a.k.a. messages). Let H = [h1 , h2 , ··· , hn ] denote the vector of initializations for the nodes in the graph at iteration l ∈ N ∪ 0. At every iteration l > 0, we perform a relabelling of nodes in G and G0 based on a message passing update rule:

  H(l) ← hash AH(l−1) (6.1) where A denotes the adjacency matrix of the corresponding graph and hash : Rn → Rn is any suitable hash function e.g., a non-linear activation. Hence, the message for every node is computed as a hashed sum of the messages from the neighboring nodes (since Aij 6= 0 only if i and j are neighbors). We repeat the process for a specified number of iterations, or until convergence. If the label sets for thenodesin G and G0 are equal (which can be checked using sorting in O(n log n) time), then the algorithm declares the two graphs G and G0 to be isomorphic. The ‘‘k-dim” WL algorithm extends the 1-dim algorithm above by simultaneously passing messages of length k (each initialized with some isomorphism invariant scheme). A positive test for isomorphism requires equality in all k dimensions for nodes in G and G0 after the termination of message passing. This algorithmic test is a heuristic which guarantees no false negatives but can give false positives wherein two non-isomorphic graphs can be falsely declared isomorphic. Empirically, the test has been shown to fail on some regular graphs but gives excellent performance on real-world graphs [She+11]. CHAPTER 6. REPRESENTATION & REASONING IN GRAPHS 78

6.2.2 Graph neural networks

Intuitively, the WL algorithm encodes the structure of the graph in the form of messages at every node. Graph neural networks (GNN) build on this observation and parameterize an unfolding of the iterative message passing procedure which we describe next. A GNN consists of many layers, indexed by l ∈ N with each layer associated with an activation ηl and a dimensionality dl. In addition to the input graph A, every layer l ∈ N of the GNN takes as input the activations from the previous (l−1) n×d − n×n n×n layer H ∈ R l 1 , a family of linear transformations Fl : R → R , and a d − ×d matrix of learnable weight parameters Wl ∈ R l 1 l and optional bias parameters n×d Bl ∈ R l . Recursively, the layer wise propagation rule in a GNN is given by: ! (l) (l−1) H ← ηl Bl + ∑ f (A)H Wl (6.2) f ∈Fl

(0) with the base cases H = X and d0 = m. Here, m is the feature dimensionality. If (0) there are no explicit node features, we set H = In (identity) and d0 = n. Several variants of graph neural networks have been proposed in prior work. For instance, graph convolutional networks (GCN) [KW17] instantiate graph neural networks with a permutation equivariant propagation rule:

(l)  (l−1)  H ← ηl Bl + AH˜ Wl (6.3) where A˜ = D−1/2AD−1/2 is the symmetric diagonalization of A given the diagonal degree matrix D (i.e., Dii = ∑(i,j)∈E Aij), and same base cases as before. Comparing the above with the WL update rule in Eq. (6.1), we can see that the activations for every layer in a GCN are computed via parameterized, scaled activations (messages) of the previous layer being propagated over the graph, with the hash function implicitly specified using an activation function ηl. Our framework is agnostic to instantiations of message passing rule of a graph CHAPTER 6. REPRESENTATION & REASONING IN GRAPHS 79

Figure 6.1: Latent variable model for Graphite. Observed evidence variables in gray. neural network in Eq. (6.2), and we use graph convolutional networks for exper- imental validation due to the permutation equivariance property. For brevity, we denote the output H for the final layer of a multi-layer graph neural network with input adjacency matrix A, node feature matrix X, and parameters hW, Bi as

H = GNNhW,Bi(A, X), with appropriate activation functions and linear transforma- tions applied at each hidden layer of the network.

6.3 Graphite: Generative Modeling for Graphs

For generative modeling of graphs, we are interested in learning a parameterized distribution over adjacency matrices A. In this work, we restrict ourselves to model- ing graph structure only, and any additional information in the form of node features X is incorporated as conditioning evidence. In Graphite, we adopt a latent variable approach for modeling the generative k m process. That is, we introduce latent vectors Zi ∈ R and evidence vectors Xi ∈ R for each node i ∈ {1, 2, ··· , n} along with an observed variable for each pair of nodes n×k n×m Aij ∈ R. Unless necessary, we use a succinct representation Z ∈ R , X ∈ R , and A ∈ Rn×n for the variables henceforth. The conditional independencies between the variables can be summarized in the directed graphical model (using plate notation) in Figure 6.1. We can learn the model parameters θ by maximizing the CHAPTER 6. REPRESENTATION & REASONING IN GRAPHS 80

marginal likelihood of the observed adjacency matrix conditioned on X:

Z max log pθ(A|X) = log pθ(A, Z|X)dZ (6.4) θ Z

Here, p(Z|X) is a fixed prior distribution over the latent features of every node e.g., isotropic Gaussian. If we have multiple graphs in our dataset, we maximize the expected log-likelihoods over all the corresponding adjacency matrices. We can obtain a tractable, stochastic evidence lower bound (ELBO) to the above objective by introducing a variational posterior qφ(Z|A, X) with parameters φ:

 p (A, Z|X) ( | ) ≥ E θ log pθ A X qφ(Z|A,X) log (6.5) qφ(Z|A, X)

The lower bound is tight when the variational posterior qφ(Z|A, X) matches the true posterior pθ(Z|A, X) and hence maximizing the above objective optimizes for the parameters that define the best approximation to the true posterior within the variational family [BKM17]. We now discuss parameterizations for specifying

qφ(Z|A, X) (i.e., encoder) and pθ(A|Z, X) (i.e., decoder).

Encoding using forward message passing. Typically we use the mean field ap- proximation for defining the variational family and hence:

n ( | ) ≈ ( | ) qφ Z A, X ∏ qφi Zi A, X (6.6) i=1

Additionally, we would like to make distributional assumptions on each variational marginal density qφi (Zi|A, X) such that it is reparameterizable and easy-to-sample, such that the gradients w.r.t. φi have low variance [KW18]. In Graphite, we assume isotropic Gaussian variational marginals with diagonal covariance. The parameters for the variational marginals qφi (Z|A, X) are specified using a graph neural network:

µ, σ = GNNφ(A, X) (6.7) CHAPTER 6. REPRESENTATION & REASONING IN GRAPHS 81

where µ and σ denote the vector of means and standard deviations for the varia- { ( | )}n = { }n tional marginals qφi Zi A, X i=1 and φ φi i=1 are the full set of variational parameters.

Decoding using reverse message passing. For specifying the observation model

pθ(A|Z, X), we cannot directly use a graph neural network since we do not have an input graph for message passing. To sidestep this issue, we propose an iterative two-step approach that alternates between defining an intermediate graph and then gradually refining this graph through message passing. Formally, given a latent matrix Z and an input feature matrix X, we iterate over the following sequence of operations:

ZZT Ab = + 11T, (6.8) kZk2 ∗ Z = GNNθ(Ab, [Z|X]) (6.9) where the second argument to the GNN is a concatenation of Z and X. The first step constructs an intermediate weighted graph Ab ∈ Rn×n by applying an inner product of Z with itself and adding an additional constant of 1 to ensure entries are non-negative. And the second step performs a pass through a parameterized graph neural network. We can repeat the above sequence to gradually refine the feature matrix Z∗. The final distribution over graph parameters is obtained using an ∗ inner product step on Z∗ ∈ Rn×k akin to Eq. (6.8), where k∗ ∈ N is determined via the network architecture. For efficient sampling, we assume the observation model factorizes:

n n ( | ) = (i,j)( | ∗) pθ A Z, X ∏ ∏ pθ Aij Z . (6.10) i=1 j=1

The distribution over the individual edges can be expressed as a Bernoulli or Gaus- sian distribution for unweighted and real-valued edges respectively. For example, T the edge probabilities for an unweighted graph are given as sigmoid(Z∗Z∗ ). CHAPTER 6. REPRESENTATION & REASONING IN GRAPHS 82

Scalable learning & inference in Graphite For representation learning of large graphs, we require the encoding and decoding steps in Graphite to be computation- ally efficient. This can be challenging as the decoding step involves inner products of potentially dense matrices Z, which is by default an O(n2k) operation. Here, k is the dimension of the per-node latent vectors Zi used to define Ab. For any intermediate decoding step as in Eq. (6.8), we propose to offset this expensive computation by using the associativity property of matrix multiplications for the message passing step in Eq. (6.9). For notational brevity, consider the simplified graph propagation rule for a GNN:

(l)  (l−1) H ← ηl AHb where Ab is defined in Eq.6.8 ( ). Instead of directly taking an inner product of Z with itself, we note that the subsequent operation involves another matrix multiplication and hence, we can (l) perform right multiplication instead. If dl and dl−1 denote the size of the layers H and H(l−1) respectively, then the time complexity of propagation based on right multiplication is given by O(nkdl−1 + ndl−1dl). The above trick sidesteps the quadratic n2 complexity for decoding in the inter- mediate layers without any loss in statistical accuracy. The final layer however still involves an inner product with respect to Z∗ corresponding to potentially dense matrices. However, since the edges are generated independently, we can approxi- mate the loss objective by performing a Monte Carlo evaluation of the reconstructed adjacency matrix parameters in Eq. (6.10). By adaptively choosing the number of entries for Monte Carlo approximation, we can trade-off statistical accuracy for computational budget.

6.4 Empirical Evaluation

We evaluate Graphite on tasks involving entire graphs, nodes, and edges. We con- sider two variants of our proposed framework: the Graphite-VAE, which corresponds CHAPTER 6. REPRESENTATION & REASONING IN GRAPHS 83

Table 6.1: Mean reconstruction errors and negative log-likelihood estimates (in nats) for autoencoders and variational autoencoders respectively on test instances from six different generative families. Lower is better.

Erdos-Renyi Ego Regular Geometric Power Law Barabasi-Albert GAE 221.79 ± 7.58 197.3 ± 1.99 198.5 ± 4.78 514.26 ± 41.58 519.44 ± 36.30 236.29 ± 15.13 Graphite-AE 195.56 ± 1.49 182.79 ± 1.45 191.41 ± 1.99 181.14 ± 4.48 201.22 ± 2.42 192.38 ± 1.61 VGAE 273.82 ± 0.07 273.76 ± 0.06 275.29 ± 0.08 274.09 ± 0.06 278.86 ± 0.12 274.4 ± 0.08 Graphite-VAE 270.22 ± 0.15 270.70 ± 0.32 266.54 ± 0.12 269.71 ± 0.08 263.92 ± 0.14 268.73 ± 0.09 to a directed latent variable model as described in Section 6.3 and Graphite-AE, which corresponds to an autoencoder trained to minimize the error in reconstructing an input adjacency matrix. For unweighted graphs (i.e., A ∈ {0, 1}n×n), the reconstruc- tion terms in the objectives for both Graphite-VAE and Graphite-AE minimize the negative cross entropy between the input and reconstructed adjacency matrices. For weighted graphs, we use the mean squared error.

6.4.1 Reconstruction & density estimation

In the first set of tasks, we evaluate learning in Graphite based on held-out recon- struction losses and log-likelihoods estimated by the learned Graphite-VAE and Graphite-AE models respectively on a collection of graphs with varying sizes. In direct contrast to modalities such as images, graphs cannot be straightforwardly re- duced to a fixed number of vertices for input to a graph convolutional network. One simplifying modification taken by [Boj+18] is to consider only the largest connected component for evaluating and optimizing the objective, which we appeal to as well. Thus by setting the dimensions of Z∗ to a maximum number of vertices, Graphite can be used for inference tasks over entire graphs with potentially smaller sizes by considering only the largest connected component. We create datasets from six graph families with fixed, known generative pro- cesses: the Erdos-Renyi, ego-nets, random regular graphs, random geometric graphs, random Power Law Tree and Barabasi-Albert. For each family, 300 graph in- stances were sampled with each instance having 10 − 20 nodes and evenly split into train/validation/test instances. As a benchmark comparison, we compare against CHAPTER 6. REPRESENTATION & REASONING IN GRAPHS 84

Table 6.2: Citation network statistics

Nodes Edges Node Features Labels Cora 2708 5429 1433 7 Citeseer 3327 4732 3703 6 Pubmed 19717 44338 500 3 the Graph Autoencoder/Variational Graph Autoencoder (GAE/VGAE) [KW16]. The GAE/VGAE models consist of an encoding procedure similar to Graphite. How- ever, the decoder has no learnable parameters and reconstruction is done solely through an inner product operation (such as the one in Eq. (6.8)). The reconstruction errors and the negative log-likelihoods on the test set are shown in Table 6.1. Both Graphite-AE and Graphite-VAE outperform AE and VGAE significantly, indicating the usefulness of learned decoders in Graphite.

6.4.2 Link prediction

The task of link prediction is to predict whether an edge exists between a pair of nodes [Loe98]. Even though Graphite learns a distribution over graphs, it can be used for predictive tasks within a single graph. In order to do so, we learn a model for a random, connected training subgraph of the true graph. For validation and testing, we add a balanced set of positive and negative (false) edges to the original graph and evaluate the model performance based on the reconstruction probabilities assigned to the validation and test edges (similar to denoising of the input graph). In our experiments, we held out a set of 5% edges for validation, 10% edges for testing, and train all models on the remaining subgraph. Additionally, the validation and testing sets also each contain an equal number of non-edges.

Datasets. We compared across standard benchmark citation network datasets: Cora, Citeseer, and Pubmed with papers as nodes and citations as edges [Sen+08]. The node-level features correspond to the text attributes in the papers. The dataset statistics are summarized in Table 6.2. CHAPTER 6. REPRESENTATION & REASONING IN GRAPHS 85

Table 6.3: Area Under the ROC Curve (AUC) for link prediction (* denotes dataset with features). Higher is better.

Cora Citeseer Pubmed Cora* Citeseer* Pubmed* SC 89.9 ± 0.20 91.5 ± 0.17 94.9 ± 0.04 - - - DeepWalk 85.0 ± 0.17 88.6 ± 0.15 91.5 ± 0.04 - - - node2vec 85.6 ± 0.15 89.4 ± 0.14 91.9 ± 0.04 - - - GAE 90.2 ± 0.16 92.0 ± 0.14 92.5 ± 0.06 93.9 ± 0.11 94.9 ± 0.13 96.8 ± 0.04 VGAE 90.1 ± 0.15 92.0 ± 0.17 92.3 ± 0.06 94.1 ± 0.11 96.7 ± 0.08 95.5 ± 0.13 Graphite-AE 91.0 ± 0.15 92.6 ± 0.16 94.5 ± 0.05 94.2 ± 0.13 96.2 ± 0.10 97.8 ± 0.03 Graphite-VAE 91.5 ± 0.15 93.5 ± 0.13 94.6 ± 0.04 94.7 ± 0.11 97.3 ± 0.06 97.4 ± 0.04

Table 6.4: Average Precision (AP) scores for link prediction (* denotes dataset with features). Higher is better.

Cora Citeseer Pubmed Cora* Citeseer* Pubmed* SC 92.8 ± 0.12 94.4 ± 0.11 96.0 ± 0.03 - - - DeepWalk 86.6 ± 0.17 90.3 ± 0.12 91.9 ± 0.05 - - - node2vec 87.5 ± 0.14 91.3 ± 0.13 92.3 ± 0.05 - - - GAE 92.4 ± 0.12 94.0 ± 0.12 94.3 ± 0.5 94.3 ± 0.12 94.8 ± 0.15 96.8 ± 0.04 VGAE 92.3 ± 0.12 94.2 ± 0.12 94.2 ± 0.04 94.6 ± 0.11 97.0 ± 0.08 95.5 ± 0.12 Graphite-AE 92.8 ± 0.13 94.1 ± 0.14 95.7 ± 0.06 94.5 ± 0.14 96.1 ± 0.12 97.7 ± 0.03 Graphite-VAE 93.2 ± 0.13 95.0 ± 0.10 96.0 ± 0.03 94.9 ± 0.13 97.4 ± 0.06 97.4 ± 0.04

Baselines and evaluation metrics. We evaluate performance based on the Area Under the ROC Curve (AUC) and Average Precision (AP) metrics. We consider the following baselines: Spectral Clustering (SC) [TL11], DeepWalk [PAS14], node2vec [GL16], and GAE/VGAE [KW16]. SC, DeepWalk, and node2vec do not provide the ability to incorporate node features while learning embeddings, and hence we evaluate them only on the featureless datasets.

Results. The AUC and AP results (along with standard errors) are shown in Table 6.3 and Table 6.4 respectively averaged over 50 random train/validation/test splits. On both metrics, Graphite-VAE gives the best performance overall. Graphite- AE also gives good results, generally outperforming its closest competitor GAE. CHAPTER 6. REPRESENTATION & REASONING IN GRAPHS 86

(a) Graphite-AE (b) Graphite-VAE

Figure 6.2: t-SNE embeddings of the latent feature vectors for the Cora dataset. Colors denote labels.

Table 6.5: Classification accuracies (* denotes dataset with features). Baseline numbers from [KW17].

Cora* Citeseer* Pubmed* SemiEmb 59.0 59.6 71.1 DeepWalk 67.2 43.2 65.3 ICA 75.1 69.1 73.9 Planetoid 75.7 64.7 77.2 GCN 81.5 70.3 79.0 Graphite 82.1 ± 0.06 71.0 ± 0.07 79.3 ± 0.03

Qualitative evaluation. We visualize the embeddings learned by Graphite and given by a 2D t-SNE projection [MH08] of the latent feature vectors (given as rows for Z with λ = 0.5) on the Cora dataset in Figure 6.2. Even without any access to label information for the nodes during training, the name models are able to cluster the nodes (papers) as per their labels (paper categories). CHAPTER 6. REPRESENTATION & REASONING IN GRAPHS 87

6.4.3 Semi-supervised node classification

Given labels for a subset of nodes in an underlying graph, the goal of this task is to predict the labels for the remaining nodes. We consider a transductive setting, where we have access to the test nodes (without their labels) during training. Closest approach to Graphite for this task is a supervised graph convolutional network (GCN) trained end-to-end. We consider an extension of this baseline, wherein we augment the GCN objective with the Graphite objective and a hyperpa- rameter to control the relative importance of the two terms in the combined objective. The parameters φ for the encoder are shared across these two objectives, with an additional GCN layer for mapping the encoder output to softmax probabilities over the requisite number of classes. All parameters are learned jointly.

Results. The classification accuracy of the semi-supervised models is given in Table 6.5. We find that Graphite-hybrid outperforms the competing models on all datasets and in particular the GCN approach which is the closest baseline.

6.5 Discussion & Related Work

Probabilistic modeling of graphs. The earliest probabilistic models of graphs pro- posed to generate graphs by creating an edge between any pair of nodes with a constant probability [ER59]. Several alternatives have been proposed since; e.g., the small-world model generates graphs that exhibit local clustering [WS98], the Barabasi-Albert models preferential attachment wherein high-degree nodes are likely to form edges with newly added nodes [BA99], the stochastic block model is based on inter and intra community linkages [HLL83] etc. We direct the interested reader to prominent surveys on this topic [New03; Mit04; CF06].

Representation learning on graphs. For representation learning on graphs, there are broadly three kinds of approaches: matrix factorization, random walk based approaches, and graph neural networks. We refer the reader to [HYL17b] for a CHAPTER 6. REPRESENTATION & REASONING IN GRAPHS 88

recent survey on these three kinds of approaches and focus on the latter below. Graph neural networks, a collective term for networks that operate over graphs using message passing, have shown success on several downstream applications, e.g.,[Duv+15; Li+16; Kea+16; KW17; HYL17a] and the references therein. [Gil+17] provides a comprehensive characterization of these networks in the message passing setup. We used Graph Convolution Networks, partly to provide a direct comparison with GAE/VGAE and leave the exploration of other GNN variants for future work.

Latent variable models for graphs. Several deep generative models for graphs have been previously proposed. Amongst adversarial generation approaches, [Wan+18] and [Boj+18] model local graph neighborhoods and random walks on graphs re- spectively. [Li+18] and [You+18] model graphs as sequences and generate graphs via autoregressive procedures. Adversarial and autoregressive approaches are suc- cessful at generating graphs, but do not directly allow for inferring latent variables via encoders. Latent variable generative models have also been proposed for gen- erating small molecular graphs [JBJ18; Sam+18; SK18]. These methods involve an expensive decoding procedure that limits scaling to large graphs. Finally, closest to our framework is the GAE/VGAE approach [KW16] discussed in Section 6.4. [Pan+18] extends this approach with an adversarial regularization framework but retain the inner product decoder. Our work proposes a novel multi-step decoding mechanism based on graph refinement.

6.6 Conclusion

We proposed Graphite, a scalable deep generative model for graphs based on varia- tional autoencoding. The encoders and decoders in Graphite are parameterized by graph neural networks that propagate information locally on a graph. Our proposed decoder performs a multi-layer iterative decoding comprising of alternate inner product operations and message passing on the intermediate graph. Empirically, we observed superior performance of Graphite over other competing approaches for density estimation, node classification, and link prediction. 7 Representation & Reasoning in Multia- gent Systems

7.1 Introduction

In the last chapter, we discussed representation learning and reasoning for static graph networks. Relational data can appear in many other interactive contexts beyond static graph networks. An intelligent agent rarely acts in isolation in the real world and typically seeks to achieve its goals through interaction with other agents. Such interactions give rise to rich, complex behaviors formalized as per- agent policies in a multiagent system [Fer99; Woo09]. Depending on the underlying motivations of the agents, interactions could be directed towards achieving a shared goal in a collaborative setting, opposing another agent in a competitive setting, or be a mixture of these in a setting where agents collaborate in teams to compete against other teams. Learning useful representations of the policies of agents based on their interactions is an important step towards characterization of the agent behavior and more generally reasoning in multiagent systems. In this chapter, we propose a representation learning framework for agent poli- cies, given access to only a few episodes of interaction. Our framework builds on two major lines of advancements in representation learning: (a) generative methods, which we formalize as learning representations that can aid imitation in a multiagent setup; and (b) contrastive methods, which we formalize as learning representations

89 CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 90

that can aid in distinguishing multiple agents using triplet losses. For the embeddings to be useful, the representation function should generalize to both unseen interactions and unseen agents for novel downstream tasks. Generaliza- tion is well-understood in the context of supervised learning where a good model is expected to attain similar train and test performance. For multiagent systems, we consider a notion of generalization based on agent-interaction graphs. An agent- interaction graph provides an abstraction for distinguishing the agents (nodes) and interactions (edges) observed during training, validation, and testing. Our framework is agnostic to the nature of interactions in multiagent systems, and hence broadly applicable to competitive and cooperative environments. In particular, we consider two multiagent environments: (i) a competitive continuous control en- vironment, RoboSumo [Al-+18], and (ii) a cooperative communication environment, ParticleWorld where agents collaborate to achieve a common goal [MA18]. We show empirically that the representations learned by our framework are effective for various tasks, including policy optimization. In particular, we demonstrate the use of these representations as privileged information for better training of agent policies. In RoboSumo, we train agent policies that can condition on the opponent’s representation and achieve superior win rates much more quickly as compared to an equally expressive baseline policy with the same number of parameters. In ParticleWorld, we train speakers that can communicate more effectively with a much wider range of listeners given knowledge of their representations.

7.2 Preliminaries

Markov games. We use the classical framework of Markov games [Lit94] to repre- sent multiagent systems. A Markov game extends the general formulation of partially observable Markov decision processes (POMDP) to the multiagent setting. In a Markov game, we are given a set of n agents on a state space S with action spaces

A1, A2, ··· , An and observation spaces O1, O2, ··· , On respectively. At every time (t) (t) step t, an agent i receives an observation oi ∈ Oi and executes an action ai ∈ Ai (i) based on a stochastic policy π : Oi × Ai → [0, 1]. Based on the executed action, CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 91

(t) (t+1) the agent receives a reward ri : S × Ai → R and the next observation oi . The state dynamics are determined by a transition function T : S × A1 × · · · × An → S. H (t) The agent policies are trained to maximize their own expected reward r¯i = ∑t=1 ri over a time horizon H.

Extended Markov games. We are interested in interactions that involve not all but only a subset of agents. For this purpose, we generalize Markov games as follows. First, we augment the action space of each agent with a NO-OP (i.e., no action). Then, we introduce a problem parameter, 2 ≤ k ≤ n, with the following semantics. During every rollout of the Markov game, all but k agents deterministically execute the NO-OP operator while the k agents execute actions as per the policies defined on the original observation and action spaces. Accordingly, we assume that each agent receives rewards only in the interaction episode it participates in. Informally, the extension allows for multiagent systems where all agents do not necessarily have to participate simultaneously in an interaction. For instance, this allows to consider one-vs-one multiagent game tournaments where only two players participate in any given match. Consider a multiagent system specified as an extended Markov game. We denote = { (i)}n = { }m the set of agent policies as P π i=1 and interaction episodes as E EMj j=1 where Mj ⊆ {1, 2, ··· , n}, |Mj| = k is the set of k agents participating in episode

EMj . For the rest of the chapter, we assume k = 2 to simplify presentation and consequently, denote the set of interaction episodes between agents i and j as Eij.A single episode eij ∈ Eij consists of a sequence of observations and actions for some finite time horizon.

Imitation learning. Our approach to learning policy representations relies on behavioral cloning [Pom91]—a type of imitation learning where we train a mapping from observations to actions using a standard supervised regression loss. Other algorithms also exist for imitation learning that typically require additional super- vision, e.g., generative adversarial imitation learning [HE16] requires knowledge of the transition dynamics. In either case, our framework is largely agnostic to the CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 92

choice of the imitation algorithm, and we restrict our presentation to behavioral cloning, leaving other imitation learning paradigms to future work.

7.3 Learning Policy Representations

Our goal is to learn a parameterized function that maps episode(s) from an agent policy π(i) to a real-valued vector embedding. That is, we optimize for the param- d eters φ for a function fφ : E → R where E denotes the space of episodes and d is the dimension of the representation embedding. Here, we have assumed the agent policies are black-boxes, i.e., we can only access them based on interaction episodes with other agents in a Markov game. Hence, for every agent i, we wish (i) (i) to learn policies using all the available episode data Ei = ∪jEij . Here, Eij refers the episode data for interactions between agent i and j, but consisting of only the observation and action pairs of agent i. For a multiagent system, we propose the following auxiliary tasks for learning a good representation of an agent’s policy:

1. Generative representations. The representation should be useful for simulating the agent’s policy.

2. Contrastive representations. The representation should be able to distinguish the agent’s policy from the policies of other agents.

7.3.1 Generative representation learning via imitation

Imitation learning does not require direct access to the reward signal, making it an attractive task for unsupervised representation learning. Formally, we are interested (i) in learning a policy πθ : S × A → [0, 1] for an agent i given access to observation and action pairs from interaction episode(s) involving the agent. For behavioral cloning, we maximize the following (negative) cross-entropy objective w.r.t. θ:

  h i E (i)( | ) e∼Ei  ∑ log πθ a o  ho,ai∼e CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 93

where the expectation is over interaction episodes of agent i. Learning individual policies for every agent can be computationally expensive and statistically prohibitive for large-scale multiagent systems with dozens of agents, especially when the number of interaction episodes per agent is small. Moreover, it precludes generalization across the behaviors of such agents. On the other hand, learning a single policy for all agents increases sample efficiency but comes at the cost of reduced modeling flexibility in simulating diverse agent behaviors. We offset this dichotomy by learning a single conditional policy network. To do so, d we use the embedding output by the representation function fφ : E → R to condition the policy network. Formally, the policy network is denoted by πθ : S × A × Rd → [0, 1] and maps the agent observation and embedding to a distribution over the agent’s actions. The parameters φ and θ for the embedding function and the conditional policy network respectively are learned jointly by maximizing the following objective:

  1 n Ee ∼E ,  log π (a|o, fφ(e2)) (7.1) n ∑ 1 i ∑ θ i=1 e2∼Ei\e1 ho,ai∼e1

For every agent, the objective function samples two distinct episodes e1 and e2. The observation and action pairs from e2 are used to learn an embedding fφ(e2) that conditions the policy network trained on observation and action pairs from e1. The conditional policy network shares statistical strength through a common set of parameters for the policy network and the representation function across all agents.

7.3.2 Contrastive representation learning via triplet loss

An intuitive requirement for any representation function learned for a multiagent system is that the embeddings should reflect characteristics of an agent’s behavior that distinguish it from other agents. To do so in an unsupervised manner, we propose a contrastive objective for agent identification based on the triplet loss directly in the space of embeddings.

For every agent i, we use the representation function fφ to compute three sets CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 94

Algorithm 7.1 Learn Policy Embedding Function ( fφ) n Input: {Ei}i=1 – interaction episodes, λ – hyperparameter 1: Initialize φ and θ 2: for i = 1, 2, . . . , n 3: Sample a positive episode pe ← e+ ∼ Ei 4: Sample a reference episode re ← e∗ ∼ Ei\e+ 5: Compute Im_loss ← − ∑ log πθ(a|o, fφ(e∗)) ho,ai∼e+ 6: for j = 1, 2, . . . , n 7: if j 6= i 8: Sample a negative episode ne ← e− ∼ Ej 9: Compute Id_loss ← dφ(e+, e−, e∗) 10: Set Loss ← Im_loss + λ · Id_loss 11: Update φ and θ to minimize Loss 12: end if 13: end for 14: end for 15: return φ

of embeddings: (i) a positive embedding for an episode e+ ∼ Ei involving agent i,

(ii) a negative embedding for an episode e− ∼ Ej involving a random agent j 6= i, and (iii) a reference embedding for an episode e∗ ∼ Ei again involving agent i, but different from e+. Given these embeddings, we define the triplet loss:

−2 dφ(e+, e−, e∗) = (1 + exp {kre − nek2 − kre − pek2}) , (7.2)

where pe = fφ(e+), ne = fφ(e−), re = fφ(e∗). Intuitively, the loss encourages the positive embedding to be closer to the reference embedding than the negative embedding, which makes the embeddings of the same agent tend to cluster together and be further away from embeddings of other agents. We note that various other notions of distance can also be used. The one presented above corresponding to a squared softmax objective [HA15]. CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 95

7.3.3 Hybrid generative-contrastive representations

Conditional imitation learning encourages fφ to learn representations that can simulate the entire policy of the agents and agent identification using a triplet loss incentivizes representations that can distinguish between agent policies. Both objectives are complementary, and we combine Eq. (7.1) and Eq. (7.2) to get the final objective used for representation learning:

    1 n   E  (a|o f (e )) − E d (e e e ) ∑ e+∼Ei,  ∑ log πθ , φ ∗ λ ∑ e−∼Ej φ +, −, ∗  , (7.3) n = e∗∼E \e+  6=  i 1 i ho,ai∼e+ j i  | {z } | {z } imitation agent identification where λ > 0 is a tunable hyperparameter that controls the relative weights of the contrastive and generative terms. The pseudocode for the proposed algorithm is given in Algorithm 7.1. In experiments, we parameterize the conditional policy

πθ and the embedding function fφ using deep neural networks and use stochastic gradient-based methods for optimization.

7.4 Generalization in Multiagent Systems

Generalization is well-understood for supervised learning—models that shows similar train and test performance exhibit good generalization. To measure the quality of the learned representations for a multiagent system (MAS), we introduce a graphical formalism for reasoning about agents and their interactions.

7.4.1 Generalization across agents & interactions

In many scenarios, we are interested in generalization of the policy representation function fφ across novel agents and interactions in a multiagent system. For in- stance, we would like fφ to output useful embeddings for a downstream task, even when evaluated with respect to unseen agents and interactions. This notion of CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 96

E A C F

B D G

Figure 7.1: Agent-Interaction Graph. An example of a graph used for evaluating generalization in a multiagent system with 5 train agents (A, B, E, F, G) and 2 test agents (C, D). generalization is best understood using agent-interaction graphs [Gro+18d]. The agent-interaction graph describes interactions between a set of agent policies P and a set of interaction episodes I through a graph G = (P, I).1 An example graph is shown in Figure 7.1. The graph represents a multiagent system consisting of interactions between pairs of agents, and we will especially focus on the interactions involving Alice, Bob, Charlie, and Davis. The interactions could be competitive (e.g., a match between two agents) or cooperative (e.g., two agents communicating for a navigation task).

We learn the representation function fφ on a subset of the interactions, denoted by the solid black edges in Figure 7.1. At test time, fφ is evaluated on some downstream task of interest. The agents and interactions observed at test time can be different from those used for training. In particular, we consider the following cases:

Weak generalization. Here, we are interested in the generalization performance of the representation function on an unseen interaction between existing agents, all of which are observed during training. This corresponds to the red edge representing the interaction between Alice and Bob in Figure 7.1. From the context of an agent- interaction graph, the test graph adds only edges to the train graph.

Strong generalization. Generalization can also be evaluated with respect to un- seen agents (and their interactions). This corresponds to the addition of agents

1If we have more than two participating agents per interaction episode, we could represent the interactions using a hypergraph. CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 97

Charlie and Davis in Figure 7.1. Akin to a few shot learning setting, we observe a few of their interactions with existing agents Alice and Bob (green edges) and generalization is evaluated on unseen interactions involving Charlie and Davis (blue edges). The test graph adds both nodes and edges to the train graph. For brevity, we skip discussion of weaker forms of generalization that involves evaluation of the test performance on unseen episodes of an existing training edge (black edge).

7.4.2 Generalization across tasks

Since the representation function is learned using an unsupervised auxiliary objec- tive, we test its generalization performance by evaluating the usefulness of these embeddings for various kinds downstream tasks described below.

Unsupervised. These embeddings can be used for clustering, visualization, and interpretability of agent policies in a low-dimensional space. Such semantic associa- tions between the learned embeddings can be defined for a single agent wherein we expect representations for the same agent based on distinct episodes to be embedded close to each other, or across agents wherein agents with similar policies will have similar embeddings on average.

Supervised. Deep neural network representations are especially effective for pre- dictive modeling. In a multiagent setting, the embeddings serve as useful features for learning agent properties and interactions, including assignment of role cate- gories to agents with different skills in a collaborative setting, or prediction ofwin or loss outcomes of interaction matches between agents in a competitive setting.

Reinforcement. Finally, we can use the learned representation functions to im- prove generalization of the policies learned from a reinforcement signal in com- petitive and cooperative settings. We design policy networks that, in addition to observations, take embedding vectors of the opposing agents as inputs. The em- beddings are computed from the past interactions of the opposing agent either CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 98

(a) The RoboSumo environment.

1 0 0 1 0 target landmark Speaker 1

N 0 1 1 0 0 1 W E Speaker 2 S Listener

(b) The ParticleWorld environment.

Figure 7.2: Illustrations for the environments used in our experiments: (a) competi- tive; and (b) cooperative. with the agent being trained or with other agents using the representation function (Figure 7.3). Such embeddings play the role of privileged information and allow us to train a policy network that uses this information to learn faster and generalize better to opponents or cooperators unseen at training time.

7.5 Empirical Evaluation

We evaluate the proposed framework for both competitive and collaborative envi- ronments on various downstream machine learning tasks. In particular, we use the RoboSumo and ParticleWorld environments for the competitive and collabo- rative scenarios, respectively. We consider the embedding objectives in Eq. (7.1), Eq. (7.2), and Eq. (7.3) independently and refer to them as Emb-Im, Emb-Id, and Emb-Hyb respectively. The hyperparameter λ for Emb-Hyb is chosen by grid search over λ ∈ {0.01, 0.05, 0.1, 0.5} on a held-out set of interactions. CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 99

previous t t πψ e πA interac- tions

tt−−11 etet−−11 t−1 fφ ee πψ

Figure 7.3: Illustration of the proposed model for optimizing a policy πψ that conditions on an embedding of the opponent policy πA. At time t, the pre-trained representation function fφ computes the opponent embedding based on a past t−1 interaction e . We optimize πψ to maximize the expected rewards in its current interactions et with the opponent.

In all our experiments, the representation function fφ is specified through a multi- layer perceptron (MLP) that takes as input an episode and outputs an embedding of that episode. In particular, the MLP takes as input a single (observation, action) pair to output an intermediate embedding. We average the intermediate embeddings for all (observation, action) pairs in an episode to output an episode embedding. To condition a policy network on the embedding, we simply concatenate the observation fed as input to the network with the embedding.

7.5.1 The RoboSumo environment

For the competitive environment, we use RoboSumo [Al-+18]—a 3D environment with simulated physics (based on MuJoCo [TET12]) that allows agents to control multi-legged 3D robots and compete against each other in continuous-time wrestling games (Figure 7.2a). For our analysis, we train a diverse collection of 25 agents, some of which are trained via self-play and others are trained in pairs concurrently using Proximal Policy Optimization (PPO) algorithm [Sch+17b]. We start with a fully connected agent-interaction graph (clique) of 25 agents. Every edge in this graph corresponds to 10 rollout episodes involving the corre- sponding agents. The maximum length (or horizon) of any episode is 500 time steps, after which the episode is declared a draw. To evaluate weak generalization, we sample a connected subgraph for training with approximately 60% of the edges CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 100

third eigenvector third eigenvector

second eigenvector second eigenvector

first eigenvector first eigenvector (a) RoboSumo: Weak (b) RoboSumo: Strong

third eigenvector third eigenvector

second eigenvector second eigenvector

first eigenvector first eigenvector (c) ParticleWorld: Weak (d) ParticleWorld: Strong

Figure 7.4: Embeddings learned using Emb-Hyb for 10 test interaction episodes of 5 agents projected on the first three principal components for RoboSumo and ParticleWorld. Color denotes agent policy. preserved for training, and remaining split equally for validation and testing. For strong generalization, we preserve 15 agents and their interactions with each other for training, and similarly, 5 agents and their within-group interactions each for validation and testing.

Embedding analysis. To evaluate the robustness of the embeddings, we compute multiple embeddings for each policy based on different episodes of interaction at test time. Our evaluation metric is based on the intra- and inter-cluster Euclidean CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 101

Table 7.1: Intra-inter clustering ratios (IICR) and accuracies for outcome prediction (Acc) for weak (W) and strong (S) generalization on RoboSumo.

IICR (W) IICR (S) Acc (W) Acc(S) Emb-Im 0.24 0.23 0.71 0.60 Emb-Id 0.25 0.27 0.67 0.56 Emb-Hyb 0.22 0.21 0.73 0.56

Train Test 1.00 0.8

0.75 0.6

0.50 PPO 0.4

Win rate PPO + Emb-Im PPO + Emb-Id 0.2 0.25 PPO + Emb-Hyb 0 1000 2000 0 1000 2000 Iteration

Train Test 1.00 1.00

0.75 0.75 PPO 0.50 PPO + Emb-online 0.50

Win rate PPO + Emb-offline PPO + Emb-zero 0.25 PPO + Emb-rand 0.25 0 1000 2000 0 1000 2000 Iteration

Figure 7.5: Average win rates of the newly trained agents against 5 training agent and 5 testing agents. (a) Top: Baseline comparison with policies that make use of Emb-Im, Emb-Id, and Emb-Hyb (all computed online). (b) Bottom: Comparison of different baseline embeddings used at evaluation time (all embedding-conditioned policies use Emb-Hyb). At each iteration, win rates were computed based on 50 1-on-1 games. Each agent was trained 3 times, each time from a different random initialization. Shaded regions correspond to 95% CI. CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 102

Emb-Hyb vs. PPO Emb-Hyb vs. Emb-Im Emb-Hyb vs. Emb-Id Emb-Im vs. PPO Emb-Im vs. Emb-Id Emb-Id vs. PPO 1.0

0.5 Rate

0.0 0 1000 0 1000 0 1000 0 1000 0 1000 0 1000 Win Loss Draw Iteration

Figure 7.6: Win, loss, and draw rates plotted for the first agent in each pair. Each pair of agents was evaluated after each training iteration on 50 1-on-1 games; curves are based on 5 evaluation runs. Shaded regions correspond to 95% CI.

PPO - 1 0.51 0.44 0.36 0.32

PPO + Emb-Im - 2 0.55 0.49 0.55 0.36

PPO + Emb-Id - 3 0.62 0.44 0.50 0.42

PPO + Emb-Hyb - 4 0.67 0.61 0.57 0.48

1 2 3 4

Figure 7.7: Win rates for agents specified in each row at computed at iteration 1000. distances between embeddings. The intra-cluster distance for an agent is the average pairwise distance between its embeddings computed on the set of test interaction episodes involving the agent. Similarly, the inter-cluster distance is the average pairwise distance between the embeddings of an agent with those of other agents. (i) ni Let Ti = {tc }c=1 denote the set of test interactions involving agent i. We define the intra-inter cluster ratio (IICR) as:

( ) ( ) 1 n 1 ni ni kt i − t i k n ∑i=1 n2 ∑a=1 ∑b=1 a b 2 IICR = i . n nj n n i (i) (j) 1 ∑ ∑ 1 ∑ ∑ kt − t k n(n−1) ninj a b 2 i=1 j6=i a=1 b=1 CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 103

The intra-inter clustering ratios are reported in Table 7.1. A ratio less than 1 suggests that there is signal that identifies the agent, and the signal is stronger for lower ratios. Even though this task might seem especially suited for the agent identification objective, we interestingly find that the Emb-Im attains lower clustering ratios than Emb-Id for both weak and strong generalization. Emb-Hyb outperforms both these methods. We qualitatively visualize the embeddings learned using Emb-Hyb by projecting them on the leading principal components, as shown in Figures 7.4a and 7.4b for 10 test interaction episodes of 5 randomly selected agents in the weak and strong generalization settings respectively.

Outcome prediction. We can use these embeddings directly for training a classifier to predict the outcome of an episode (win/loss/draw). For classification, we use an MLP with 3 hidden layers of 100 units each and the learning objective minimizes the cross entropy error. The input to the classifier are the embeddings of the two agents involved in the episode. The results are reported in Table 7.1. Again, imitation based methods seem more suited for this task with Emb-Hyb and Emb-Im outperforming other methods for weak and strong generalization respectively.

Policy optimization. Here we ask whether embeddings can be used to improve learned policies in a reinforcement learning setting both in terms of end performance and generalization. To this end, we select 5 training, 5 validation, and 5 testing opponents from the pool of 25 pre-trained agents. Next, we train a new agent with reinforcement learning to compete against the selected 5 training opponents; the agent is trained concurrently against all 5 opponents using a distributed version of PPO algorithm, as described in [Al-+18]. Throughout training, we evaluate new agents on the 5 testing opponents and record the average win and draw rates. Using this setup, we compare a baseline agent with MLP-based policy with an agent whose policy takes 100-dimensional embeddings of the opponents as additional inputs at each time step and uses that information to condition its behavior on the opponent’s representation. The embeddings for each opponent are either computed online, i.e., based on an interaction episode rolled out during training at a CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 104

previous time step (Figure 7.3), or offline, i.e., pre-computed before training the new agent using only interactions between the pre-trained opponents. Figure 7.5 shows the average win rates against the set of training and testing opponents for the baseline and our agents that use different types of embeddings. While every new agent is able to achieve almost 100% win rate against the training opponents, we see that the agents that condition their policies on the opponent’s embeddings generalize better on the held-out set of opponents, with the best perfor- mance achieved with Emb-Hyb. We also note that embeddings computed offline lead to better performance than the ones computed online.2 As an ablation test, we also evaluate our agents when they are provided an incorrect embedding (either all zeros, Emb-zero, or an embedding selected for a different random opponent, Emb-rand) and observe that such embeddings lead to a degradation in performance.3 Finally, to evaluate strong generalization in the RL setting, we pit the newly trained baseline and agents with embedding-conditional policies against each other. Since the embedding network has never seen the new agents, it must exhibit strong generalization to be useful in such setting. The results are give in Figures 7.6 and 7.7. Even though the margin is not very large, the agents that use Emb-Hyb perform the best on average.

7.5.2 The ParticleWorld environment

For the collaborative setting, we evaluate the framework on the ParticleWorld environment for cooperative communication [MA18; Low+17]. The environment consists of a continuous 2D grid with 3 landmarks and two kinds of agents collabo- rating to navigate to a common landmark goal (Figure 7.2b). At the beginning of every episode, the speaker agent is shown the RGB color of a single target landmark on the grid. The speaker then communicates a fixed length binary message to the listener agent. Based on the received messages, the listener agent the moves in a

2Perhaps, this is due to differences in the interactions of the opponents between themselves and with the new agent that the embedding network was not able to capture entirely. 3Performance decrease is most significant for Emb-zero, which is an out-of-distribution all-zeros vector. CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 105

Table 7.2: Intra-inter clustering ratios (IICR) for weak (W) and strong (S) general- ization on ParticleWorld. Lower is better.

IICR (W) IICR (S) Emb-Im 0.58 0.86 Emb-Id 0.50 0.82 Emb-Hyb 0.54 0.85

Table 7.3: Average train and test rewards for speaker policies on ParticleWorld.

Train Test MADDPG −11.66 −18.99 MADDPG + Emb-Im −11.68 −17.75 MADDPG + Emb-Id −11.68 −17.68 MADDPG + Emb-Hyb −11.77 −17.20 particular direction. The final reward, shared across the speaker and listener agents, is the distance of the listener to the target landmark after a fixed time horizon. The agent-interaction graph for this environment is bipartite with only cross edges between speaker and listener agents. Every interaction edge in this graph corresponds to 1000 rollout episodes where the maximum length of any episode is 25 steps. We pretrain 28 MLP parameterized speaker and listener agent policies. Every speaker learns through communication with only two different listeners and vice- versa, giving an extremely sparse agent-interaction graph. We explicitly encoded diversity in these speakers and listener agents by masking bits in the communication channel. In particular, we masked 1 or 2 randomly selected bits for every speaker 7 7 agent in the graph to give a total of (1) + (2) = 28 distinct speaker agents. Depending on the neighboring speaker agents in the agent-interaction graph, the listener agents also show diversity in the learned policies. The policies are learned using multiagent deep deterministic policy gradients (MADDPG) [Low+17]. In this environment, the speakers and listeners are tightly coupled. Hence we vary the setup used previously in the competitive scenario. We wish to learn embeddings of listeners based on their interactions with speakers. Since the agent- interaction graph is bipartite, we use the embeddings of listener agents to condition a shared policy network for the respective speaker agents. CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 106

Embedding analysis. For the weak generalization setting, we remove an outgoing edge from every listener agent in the original graph to obtain the training graph. In the case of strong generalization, we set aside 7 listener agents (and their outgoing edges) each for validation and testing while the representation function is learned on the remaining 14 listener agents and their interactions. The intra-inter cluster- ing ratios are shown in Table 7.2, and the projections of the embeddings learned using Emb-Hyb are visualized in Figure 7.4c and Figure 7.4d for weak and strong generalization respectively. In spite of the high degree of sparsity in the training graph, the intra-inter clustering ratio for the test interaction embeddings is less than unity suggesting an agent-specific signal. Emb-id works particularly well in this environment, achieving best results for both weak and strong generalization.

Policy optimization. Here, we are interested in learning speaker agents that can communicate more effectively with a much wider range of listeners given knowledge of their embeddings. Referring back to Figure 7.3, we learn a policy πψ for a speaker agent that conditions on the representation function fφ for the listener agents. For cooperative communication, we consider interactions with 14 pre-trained listener agents split as 6 training, 4 validation, and 4 test agents.4 Similar to the competitive setting, we compare performance against a baseline speaker agent that does not have access to any privilege information about the listeners. We summarize the results for the best validated models during training and 100 interaction episodes per test listener agent across 5 initializations in Table 7.3. From the results, we observe that online embedding based methods can generalize better than the baseline methods. The baseline MADDPG achieves the lowest training error, but fails to generalize well enough and incurs a low average reward for the test listener agents.

4None of the methods considered were able to learn a non-trivial speaker agent when trained simultaneously with all 28 listener agents. Hence, we simplified the problem by considering the 14 listener agents that attained the best rewards during pretraining. CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 107

7.6 Discussion & Related Work

Agent modeling is a well-studied topic within multiagent systems. See [AS17] for an excellent survey. The vast majority of literature in agent modeling concerns with learning models for a specific predictive task. Predictive tasks are typically defined over actions, goals, and beliefs of other agents [SV00]. In competitive do- mains such as Poker and Go, such tasks are often integrated with domain-specific heuristics to model opponents and learn superior policies [RW11; Mni+15]. Simi- larly, intelligent tutoring systems take into account pedagogical features of students and teachers to accelerate learning of desired behaviors in a collaborative environ- ment [McC+00]. In contrast, we sidestep any domain specific assumptions and learn representations in an unsupervised manner which extends their utility beyond game playing [Mni+15] and predictive modeling [Hos17]. Both the generative and contrastive components of our framework have been explored independently in other contexts in prior work. Imitation learning has been extensively studied in the single-agent setting [Osa+18]. Recent work by [LYC17] proposes an algorithm for imitation in a coordinated multiagent system. [Wan+17b] proposed an imitation learning algorithm for learning robust controllers with few expert demonstrations in a single-agent setting that conditions the policy network on an inference network, similar to the encoder in our framework. Agent identification which represents the contrastive term in the learning ob- jective is based on triplet losses. Triplet losses are used within Siamese networks to learn representations of data points using distance comparisons in a feature space [HA15]. Several other methods also exist for defining positive and negative examples for contrastive representation learning. For instance, the widely used word2vec framework for language uses co-occurrence statistics to define similar or ‘‘positive” examples [Mik+13] and has been extended to relational graph domains as well via random walks [PAS14; GL16]. Recently, contrastive methods have also been successfully applied for representation learning of sequential data in latent variable models [OLV18]. CHAPTER 7. REPRESENTATION & REASONING IN MULTIAGENT SYSTEMS 108

7.7 Conclusion

In this work, we presented a framework for learning representations of agent policies in multiagent systems. The agent policies are accessed using a few interaction episodes with other agents. Our learning objective is based on a novel combination of a generative component based on imitation learning and a contrastive component for distinguishing the embeddings of different agent policies. Our overall framework is unsupervised, sample-efficient, and domain-agnostic, and hence can be readily extended to many environments and downstream tasks. Most importantly, we showed the role of these embeddings as privileged information for learning more adaptive agent policies in both collaborative and competitive settings. Part III

Applications in Science & Society

109 8 Compressed Sensing via Generative Mod- els

8.1 Introduction

Compressed sensing is a class of techniques used to efficiently acquire and recover high-dimensional data using compressed measurements much fewer than the data dimensionality. The celebrated results in compressed sensing posit that sparse, high- dimensional data points can be acquired using much fewer measurements (roughly logarithmic) than the data dimensionality [CT05; Don06; CRT06]. The acquisition is done using certain classes of random matrices and the recovery procedure is based on LASSO [Tib96; BRT09]. The assumptions of sparsity are fairly general and can be applied ‘‘out-of-the-box” for many data modalities, e.g., images and audio are typically sparse in the wavelet and Fourier basis respectively. However, such sparsity assumptions ignore the statistical nature of many real-world problems, where we increasingly have access to training datasets for domains of interest. In this chapter, we design a framework for compressed sensing based on genera- tive modeling that exploits such auxilliary training data to learn the acquisition and recovery procedures for compressed sensing, thereby sidestepping generic sparsity assumptions. In particular, we introduce a new autoencoding framework termed as uncertainty autoencoders (UAE). A UAE parameterizes the acquisition and recovery procedures for compressed sensing as the encoding and decoding phases of an

110 CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 111

autoencoder with stochastic latent variables. The stochastic latents correspond to the (noisy) compressed measurements for the data point input to the encoder. The learning objective for a UAE is based on the InfoMax principle, which seeks to learn encodings that maximize the mutual information between the observed data points and compressed representations [Lin89]. Since the mutual information is typically intractable in high-dimensions, we instead maximize tractable variational lower bounds using a parametric decoder [BA03; Ale+17]. Unlike LASSO-based recovery, a parametric decoder amortizes the recovery process, which requires only a forward pass through the decoder at test time and thus enables scalability to large datasets [GG14; Shu+18]. We show theoretically under suitable assumptions that an uncertainty autoen- coder is an implicit generative model of the underlying data distribution [ML16], i.e., a UAE permits sampling from the learned data distribution even though it does not specify an explicit likelihood function. Hence, it directly contrasts with variational autoencoders (VAE) which specify an explicit but intractable likelihood function [KW18]. Unlike a VAE,a UAE does not require specifying a prior over the latents and hence offsets pathologically observed scenarios that cause the latent representations to be uninformative when used with expressive decoders [Che+17b]. We evaluate UAEs for statistical compressed sensing of high-dimensional datasets. On the MNIST, Omniglot, and CelebA datasets, we observe average improvements of 38%, 31%, and 28% in recovery over the closest benchmark. We also demonstrate that uncertainty autoencoders demonstrate good generalization performance across domains, where the encoder/decoder trained on a source dataset are transferred over for compressed sensing of another target dataset.

8.2 Preliminaries

Compressed sensing (CS). Let the data point and measurements be denoted with multivariate random variables X ∈ Rn and Y ∈ Rm respectively. For the purpose of compressed sensing, we assume m < n and relate these variables through a m×l measurement matrix W ∈ R and a parameterized acquisition function fψ : CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 112

Rn → Rl (for any integer l > 0) such that:

y = W fψ(x) + e (8.1) where e is the measurement noise. The goal in compressed sensing is to recover x given the measurements y. If we let fψ(·) be the identity function (i.e., fψ(x) = x for all x), then we recover a standard system of underdetermined linear equations where measurements are linear combinations of the data point corrupted by noise.

In all other cases, the acquisition function transforms x such that fψ(x) is potentially more amenable for compressed sensing. For instance, fψ(·) could specify a change of basis that encourages sparsity, e.g., a Fourier basis for audio. Note that we allow the codomain of the mapping fψ(·) to be defined on a higher or lower dimensional space (i.e., l 6= n in general).

Sparse CS. To obtain nontrivial solutions to an underdetermined system, X is assumed to be sparse in some basis B. We are not given any additional informa- tion about X. The measurement matrix W is a random Gaussian matrix and the recovery is done via LASSO [CT05; Don06; CRT06]. LASSO solves for a convex `1-minimization problem such that the reconstruction bx for any data point x is given 2 as: bx = arg minx kBxk1 + λky − Wxk2 where λ > 0 is a tunable hyperparameter.

Statistical CS. In statistical compressed sensing, we are additionally given access to a set of signals D, such that each x ∈ D is assumed to be sampled i.i.d. from a data distribution qdata [YS11]. Using this dataset, we learn the the measurement matrix W and the acquisition function fψ(·) in Eq. (8.1).

At test time, we directly observe the measurements ytest that are assumed to satisfy Eq. (8.1) for a target data point xtest ∼ qdata(X) and the task is to provide an accurate reconstruction bxtest. Evaluation is based on the reconstruction error between xtest and bxtest. Particularly relevant to this work, we can optionally learn a m n recovery function gθ : R → R to reconstruct x given the measurements y. This amortized approach is in contrast to standard LASSO-based decoding which solves an optimization problem for every new data point at test time. If we learned CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 113

the recovery function gθ(·) during training, then bxtest = gθ(ytest) and the `2 error is given by kxtest − gθ(ytest)k2. Such a recovery process requires only a function evaluation at test time and permits scaling to large datasets [GG14; Shu+18].

Autoencoders. An autoencoder is a pair of parameterized functions (e, d) designed to encode and decode data points. For a standard autoencoder, let e : Rn → Rm and d : Rm → Rn denote the encoding and decoding functions respectively for an n-dimensional data point and an m-dimensional latent space. The learning objective minimizes the `2 reconstruction error over a dataset D:

2 min ∑ kx − d(e(x))k2 (8.2) e,d x∈D where e and d are typically parameterized using deep neural networks.

8.3 Uncertainty Autoencoders

Consider a joint distribution between the signals X and the measurements Y, which factorizes via chain rule as qφ(X, Y) = qdata(X)qφ(Y|X). Here, qdata(X) is a fixed data distribution and qφ(Y|X) is a parameterized observation model that depends on the measurement noise e, as given by Eq. (8.1). For instance, for isotropic Gaussian 2 2 noise e with a fixed variance σ , we have qφ(Y|X) = N (W fψ(X), σ Im). In an uncertainty autoencoder, we wish to learn the parameters φ that permit efficient and accurate recovery of a signal X using the measurements Y. In order to do so, we propose to maximize the mutual information between X and Y:

Z qφ(x, y) max Iφ(X, Y) = qφ(x, y) log dxdy φ qdata(x)qφ(y)

= H(X) − Hφ(X|Y) (8.3) where H denotes differential entropy. The intuition is simple: if the measurements preserve maximum information about the signal, we can hope that recovery will have low reconstruction error. We formalize this intuition by noting that this objective CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 114

is equivalent to maximizing the average log-posterior probability of X given Y. In fact, in Eq. (8.3), we can omit the term corresponding to the data entropy (since it is independent of φ) to get the following equivalent objective:

max −Hφ(X|Y) = Eq (X,Y)[log qφ(x|y)]. (8.4) φ φ

Even though the mutual information is maximized and equals the data entropy when Y = X, the dimensionality constraints on m  n, the parametric assumptions on fψ(·), and the noise model prohibit learning an identity mapping. Note that the properties of noise e such as the distributional family and sufficient statistics are externally specified. For example, these could be specified based on properties of the measurement device for compressed sensing. More generally for unsuper- vised representation learning, we treat these properties as hyperparameters tuned based on the reconstruction loss on a held-out set, or any other form of available supervision. It is not suggested to optimize for these statistics during learning as the UAE would tend to shrink this noise to zero to maximize mutual information, thus ignoring measurement uncertainty in the context of compressed sensing and pre- venting generalization to out-of-distribution examples for representation learning. Theorem 8.1 analyzes the effect of noise more formally. Estimating mutual information between arbitrary high dimensional random variables can be challenging. However, we can lower bound the mutual information by introducing a variational approximation to the model posterior qφ(X|Y) [BA03].

Denoting this approximation as pθ(X|Y), we get the following lower bound:

( ) ≥ ( ) + E [ ( | )] Iφ X, Y H X qφ(X,Y) log pθ x y . (8.5)

Comparing Eqs. (8.3, 8.4, 8.5), we can see that the second term in Eq. (8.5) approx- imates the intractable negative conditional entropy, −Hφ(X|Y) with a variational lower bound. Optimizing this bound leads to a decoding distribution given by

pθ(X|Y) with variational parameters θ. The bound is tight when there is no dis- tortion during recovery, or equivalently when the decoding distribution pθ(X|Y) CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 115

matches the true posterior qφ(X|Y) (i.e., the Bayes optimal decoder). Formally, the uncertainty autoencoder (UAE) objective is given by:

max Eq (X,Y) [log pθ(x|y)] . (8.6) θ,φ φ

In practice, the data distribution qdata(X) is accessible via a finite dataset D. Hence, expectations with respect to qdata(X) and its gradients are estimated using Monte Carlo methods. This allows us to express the UAE objective as:

E [ p (x|y)] = L( D) max ∑ qφ(Y|x) log θ : φ, θ; . (8.7) θ,φ x∈D

Tractable evaluation of the above objective is closely tied to the distributional assumptions on the noise model. This could be specified externally based on, e.g., properties of the sensing device in compressed sensing. For the typical case of an 2 isotropic Gaussian noise model, we know that qφ(Y|X) = N (W fψ(X), σ Im) , which is easy-to-sample. Here, gradient estimation with respect to φ is particularly challeng- ing since these parameters specify the sampling distribution qφ(Y|X). One solution is to evaluate score function gradient estimates along with control variates [Fu06; Gly90; Wil92]. Alternatively, many continuous distributions (e.g., the isotropic Gaussian and Laplace distributions) can be reparameterized such that it is possible to obtain samples by applying a deterministic transformation to samples from a fixed distribution and typically leads to low-variance gradient estimates [KW18; RMW14; Gla13; Sch+15a; Moh+19].

8.3.1 Implicit generative modeling

Starting from an arbitrary point x(0) ∈ Rn, we can use a UAE to define a Markov chain over X, Y with the following transitions:

(t) (t) y ∼ qφ(Y|x ), (8.8) (t+1) (t) x ∼ pθ(X|y ). (8.9) CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 116

Theorem 8.1. Let θ∗, φ∗ denote an optimal solution to the UAE objective in Eq. (8.6). If there exists a φ such that qφ(x|y) = pθ∗ (x|y) and the Markov chain defined in Eqs. (8.8, 8.9) is ergodic, then the stationary distribution of the chain for the parameters φ∗ and θ∗ is given by qφ∗ (X, Y).

Proof. See Appendix A.2.

The above theorem suggests an interesting insight into the behavior of UAEs. Under idealized conditions, the learned model specifies an implicit generative model for qφ∗ (X, Y). Further, ergodicity can be shown to hold for the isotropic Gaussian noise model.

Corollary 8.1. Let θ∗, φ∗ denote an optimal solution to the UAE objective in Eq. (8.6). If there exists a φ such that qφ(x|y) = pθ∗ (x|y) and the noise model is Gaussian, then the ∗ ∗ stationary distribution of the chain for the parameters φ and θ is given by qφ∗ (X, Y).

Proof. See Appendix A.3.

The marginal of the joint distribution qφ(X, Y) with respect to X corresponds to the data distribution. A UAE hence seeks to learn an implicit generative model of the data distribution [DG84; ML16], i.e., even though we do not have a tractable estimate for the likelihood of the model, we can generate samples using the Markov chain transitions defined in Eqs. (8.8, 8.9).

8.3.2 Comparison with Principal Component Analysis

A UAE can also be viewed as a dimensionality reduction technique. While in general the encoding performing this reduction can be nonlinear, the case of a linear encoding is one where the projection vectors are given as the rows of the measurement matrix W. In such a setting, a closely related objective for linear dimensionality reduction is Principal Component Analysis (PCA) [Pea01]. The objective for PCA is to find the directions that explain the most variance in the data. As noted in prior work [BA03; WCF07], the principal components however may not be the the most informative low-dimensional projections for recovering the original high-dimensional data back CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 117

PCA projection 6 UAE projection data 4 0.012 PCA projected data

0.014 0.004 UAE projected data 0.0180.020 2 0.010 0.016 0.006 0 0.002

0.016 2

0.008

0.012 0.014 0.0200.018 4 0.010

6

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

Figure 8.1: Dimensionality reduction using PCA vs. UAE. Projections of the data (black points) on the UAE direction (green line) maximize the likelihood of de- coding unlike the PCA projection axis (magenta line) which collapses many points in a narrow region. from its projections. A UAE, on the other hand, is explicitly designed to preserve as much information as possible (see Eq. (8.4)). We illustrate the differences in a synthetic experiment in Figure 8.1. The true data distribution is an equiweighted mixture of two Gaussians stretched along orthogonal directions. We sample 100 points (black) from this mixture and consider two dimensionality reductions. In the first case, we project the data on the first principal component (blue points on magenta line). This axis captures a large fraction of the variance in the data but collapses data sampled from the bottom right Gaussian in a narrow region. The projections of the data on the UAE axis (red points on green line) are more spread out. This suggest that recovery is easier, even if doing so increases the total variance in the projected space compared to PCA. CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 118

8.4 Empirical Evaluation

8.4.1 Statistical compressed sensing

We perform compressed sensing on three high-dimensional datasets with a set of low number of measurements: m ∈ {2, 5, 10, 25, 50, 100} measurements for MNIST [LCB10] and Omniglot [LST15] with 28 × 28 = 784 dimensions, and m = {20, 50, 100, 200, 500} for CelebA dataset [Liu+15] with 64 × 64 × 3 = 12288 dimensions. We first discuss the MNIST and Omniglot datasets here since they have a similar setup. Every image in MNIST and Omniglot has a dimensionality of 28 × 28. In all our experiments, we assume a Gaussian noise model with σ = 0.1. We evaluated UAE against the following baselines:

• LASSO decoding with random Gaussian matrices. The MNIST and Omniglot datasets are reasonably sparse in the canonical pixel basis, and hence, we did not observe any gains after applying Discrete Cosine Transform and Daubechies-1 Wavelet Transform.

• CS-VAE. This approach to compressed sensing was proposed by [Bor+17] and learns a latent variable generative model over the observed variables X and the k n latent variables Z. Such a model defines a mapping Gθ : R → R from Z to X, which is given by either the mean function of the observation model for a VAE or the forward deterministic mapping to generate samples for a GAN. We use VAEs in our experiments. Thereafter, using a random Gaussian or Bernoulli acquisition matrix (say W), the reconstruction bx for any data point is given as bx = Gθ(arg minz∈Rk ky − WGθ(z)k2). Intuitively, this procedure seeks the latent vector z such that the corresponding point on the range of G can best approximate the measurements y under the mapping W. We used the default parameter settings and architectures proposed in [Bor+17].

• RP-UAE. This ablation baseline encodes the data using Gaussian random pro- jections (RP) and trains the decoder based on the UAE objective in order to independently evaluate the effect of variational decoding. Since LASSO and CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 119

CS-VAE both use an RP encoding, the differences in performance would arise only due to the decoding procedures.

The UAE decoder and the CS-VAE encoder/decoder are multi-layer perceptrons consisting of two hidden layers with 500 units each. For a fair comparison with random Gaussian matrices, the UAE encoder is linear (i.e., φ = W). Further, we per- form `2 regularization on the norm of W. This helps in generalization to test signals outside the train set and is equivalent to solving the Lagrangian of a constrained UAE objective:

max Eq (X,Y) [log pθ(x|y)] subject to kWkF ≤ k. θ,φ φ

The Lagrangian parameter is chosen by line search on the above objective. The constraint ensures that UAE does not learn encodings W that trivially scale the measurement matrix to overcome noise. For each m, we choose k to be the expected norm of a random Gaussian matrix of dimensions n × m for fair comparisons with other baselines. In practice, the norm of the learned W for a UAE is much smaller than those of random Gaussian matrices suggesting that the observed performance improvements are non-trivial.

Results. The `2 reconstruction errors on the standard test sets are shown in Fig- ure 8.2 (a-b). For both datasets, we observe that UAE drastically outperforms both LASSO and CS-VAE for all values of m considered. LASSO (blue curves) is un- able to reconstruct with such few measurements. The CS-VAE (red) error decays much more slowly compared to UAE as m grows. Even the RP-UAE baseline (yel- low), which trains the decoder keeping the encoding fixed to a random projection, outperforms CS-VAE. Jointly training the encoder and the decoder using the UAE objective (green) exhibits the best performance. These results are also reflected qualitatively for the reconstructed test signals shown in Figure 8.2 (c-d) for m = 25 measurements. CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 120

9 7

8 6 7 LASSO 6 CS-VAE 5 5 RP-UAE UAE 4 reconstruction error 4 reconstruction error 2 2 l l 3 3 2

0 20 40 60 80 100 0 20 40 60 80 100 num measurements num measurements (a) MNIST (b) Omniglot Original LASSO CS-VAE UAE

(c) MNIST Original LASSO CS-VAE UAE

(d) Omniglot

Figure 8.2: (a-b) Test `2 reconstruction error (per image) for compressed sensing. (c-d) Reconstructions for m = 25. First: Original. Second: LASSO. Third: CS-VAE. Last: UAE. 25 projections of the data are sufficient for UAE to reconstruct the original image with high accuracy. CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 121

60

50

LASSO (DCT) LASSO (Wavelet) 40 DCGAN UAE reconstruction error

2 30 l

20

0 100 200 300 400 500 num measurements (a) Original LASSO-DCT LASSO-WAVELET DCGAN UAE

(b)

Figure 8.3: (a) Test `2 reconstruction error (per image) for compressed sensing on CelebA. (b) Reconstructions for m = 50 on the CelebA dataset. First: Target. Second: LASSO-DCT. Third: LASSO-Wavelet. Fourth: DCGAN. Last: UAE.

CelebA. For the CelebA dataset, the dimensions of the images are 64 × 64 × 3 and σ = 0.01. The naive pixel basis does not augur well for compressed sensing on such high-dimensional RGB datasets. Following [Bor+17], we experimented with the Discrete Cosine Transform (DCT) and Wavelet basis for the LASSO baseline. Further, we used the DCGAN architecture [RMC15] combined with [Bor+17] as our main baseline. For the UAE approach, we used additional convolutional layers in CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 122

the encoder to learn a 256 dimensional feature space for the image before projecting it down to m dimensions. The results are shown in Figure 8.3 (a). While the performance of DCGAN is comparable with that of UAE for m = 500, UAE outperforms DCGAN significantly when m is low. The LASSO baselines do not perform well, consistent with the obser- vations made for the experiments on the MNIST and Omniglot datasets. Qualitative evaluations are shown in Figure 8.3 (b) for m = 50 measurements.

8.4.2 Transfer compressed sensing

To test the generalization of the learned models to similar, unseen datasets, we consider the task of transfer compressed sensing task introduced in [DGE18].

Experimental setup. We train the models on a source domain that is related to a target domain. Since the dimensions of MNIST and Omniglot images match, transferring from one domain to another requires no additional processing. For UAE, we consider two variants. In UAE-SE, we used the encodings from the source domain and retrain the decoder on the target domain. For UAE-SD, we use source decoder and retrain the encoder on the target domain.

Results. The `2 reconstruction errors are shown in Figure 8.4 (a-b). LASSO (blue curves) does not involve any learning, and hence its performance is same as Fig- ure 8.2 (a-b). The CS-VAE (red) performance degrades significantly in comparison, even performing worse than LASSO in some cases. The UAE based methods out- perform these approaches and UAE-SE (green) fares better than UAE-SD (yellow). Qualitative differences are highlighted in Figure 8.4 (c-d) for m = 25 measurements.

8.4.3 Dimensionality reduction

Dimensionality reduction is a common preprocessing technique for specifying fea- tures for classification. We compare PCA and UAE on this task. For UAE decodings, CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 123

LASSO 9 9 CS-VAE 8 UAE-SD 8 UAE-SE 7 7

6 6

5 5 reconstruction error reconstruction error

2 2 4 l 4 l 3 3 2 2 0 20 40 60 80 100 0 20 40 60 80 100 num measurements num measurements (a) Source: MNIST, Target: Omniglot (b) Source: Omniglot, Target: MNIST Original CS-VAE UAE-SD UAE-SE

(c) Source: MNIST, Target: Omniglot Original CS-VAE UAE-SD UAE-SE

(d) Source: Omniglot, Target: MNIST

Figure 8.4: (a-b) Test `2 reconstruction error (per image) for transfer compressed sensing. (c-d) Reconstructions for m = 25. Top: Target. Second: CS-VAE. Third: UAE-SD. Last: UAE-SE. CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 124

we set the noise as a hyperparameter based on a validation set to enable out-of- sample generalization.

Setup. We learn the principal components and UAE projections on the MNIST training set for varying number of dimensions. We then learn classifiers based on the these projections. Again, we use a linear encoder for the UAE for a fair evaluation. Since the inductive biases vary across different classifiers, we considered 8 commonly used classifiers: k-Nearest Neighbors (kNN), Decision Trees (DT), Random Forests (RF), Multilayer Perceptron (MLP), AdaBoost (AdaB), Gaussian Naive Bayes (NB), Quadratic Discriminant Analysis (QDA), and Support Vector Machines (SVM) with a linear kernel.

Table 8.1: PCA vs. UAE. Average test classification accuracy for the MNIST dataset.

Dimensions Method kNN DT RF MLP AdaB NB QDA SVM 2 PCA 0.4078 0.4283 0.4484 0.4695 0.4002 0.4455 0.4576 0.4503 UAE 0.4644 0.5085 0.5341 0.5437 0.4248 0.5226 0.5316 0.5256 5 PCA 0.7291 0.5640 0.6257 0.7475 0.5570 0.6587 0.7321 0.7102 UAE 0.8115 0.6331 0.7094 0.8262 0.6164 0.7286 0.7961 0.7873 10 PCA 0.9257 0.6354 0.6956 0.9006 0.7025 0.7789 0.8918 0.8440 UAE 0.9323 0.5583 0.7362 0.9258 0.7165 0.7895 0.9098 0.8753 25 PCA 0.9734 0.6382 0.6889 0.9521 0.7234 0.8635 0.9572 0.9194 UAE 0.9730 0.5407 0.7022 0.9614 0.7398 0.8306 0.9580 0.9218 50 PCA 0.9751 0.6381 0.6059 0.9580 0.7390 0.8786 0.9632 0.9376 UAE 0.9754 0.5424 0.6765 0.9597 0.7330 0.8579 0.9638 0.9384 100 PCA 0.9734 0.6380 0.4040 0.9584 0.7136 0.8763 0.9570 0.9428 UAE 0.9731 0.6446 0.6241 0.9597 0.7170 0.8809 0.9595 0.9431

Results. The performance of the PCA and UAE feature representations for different number of dimensions is shown in Table 8.1. We find that UAE outperforms PCA in a majority of the cases. Further, this trend is largely consistent across classifiers. The improvements are especially high when the number of dimensions is low, suggesting the benefits of UAE as a dimensionality reduction technique for classification. CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 125

8.5 Discussion & Related Work

In this section, we contrast uncertainty autoencoders with related works in autoen- coding, compressed sensing, and mutual information maximization.

Autoencoders. To contrast uncertainty autoencoders with other commonly used autoencoding schemes, consider a UAE with a Gaussian observation model and fixed isotropic covariance for the decoder. The UAE objective can be simplified as:

h 2i min Ex,y∼q (X,Y) kx − gθ(y)k2 θ,φ φ

If we assume no measurement noise (i.e., e = 0) and assume the observation model pθ(X|Y) to be a Gaussian with mean gθ(Y) and a fixed isotropic Σ, then the UAE objective reduces to minimizing the mean squared error between the true and recovered data point:

h 2i min Ex∼q (X) kx − gθ(W fψ(x))k2 θ,W,ψ data

This special case of a UAE corresponds to a standard autoencoder [Ben+09] where the measurements Y signify a hidden representation for X. However, this case lacks the interpretation of an implicit generative model since the assumptions of Theorem 8.1 do not hold.

Denoising autoencoders. A DAE [Vin+08] adds noise at the level of the input data point X to learn robust representations. For a UAE, the noise model is defined at the level of the compressed measurements. Again, with the assumptions of a Gaussian decoder, the DAE objective can be expressed as:

h 2i min Ex∼q (X),x˜∼C(X˜ |x) kx − gθ(W fψ(x˜))k2 θ,W,ψ data where C(·|X) is some predefined noise corruption model. Similar to Theorem 8.1, a DAE also learns an implicit model of the data distribution [Ben+13; AB14]. CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 126

Variational autoencoders. A VAE [KW18; RMW14] explicitly learns a latent vari- able model pθ(X, Y) for the dataset. The learning objective is a variational lower bound to the marginal log-likelihood assigned by the model to the data X, which notationally corresponds to E [log p (x)]. The variational objective that maxi- qdata(X) θ mizes this quantity can be simplified as:

h 2i   min Ex,y∼q (X,Y) kx − gθ(y)k2 + Ex∼q (X) DKL(qφ(y|x), pθ(y)) θ,φ φ data

The learning objective includes a reconstruction error term, akin to the UAE objective. Crucially, it also includes a regularization term to minimize the KL divergence between the variational posterior and the prior over the latents. On the other hand, a UAE does not explicitly need to model the prior distribution over the latents. On the downside, a VAE can perform efficient ancestral sampling while a UAE requires running relatively expensive Markov Chains to obtain samples. Recent works have attempted to unify the variants of variational autoencoders through the lens of mutual information [Ale+18; ZSE18; Che+17b]. These works also highlight scenarios where the VAE can learn to ignore the latent code in the presence of a strong decoder thereby affecting the reconstructions to attain a lower KL loss. One particular variant, the β-VAE, weighs the additional KL regularization term with a positive factor β and can effectively learn disentangled representa- tions [Hig+16; ZSE19]. Although [Hig+16] does not consider this case, the UAE can be seen as a β-VAE with β = 0. To summarize, our UAE formulation provides a combination of unique desirable properties for representation learning that are absent in prior autoencoders. As discussed, a UAE defines an implicit generative model without specifying aprior (Theorem 8.1) even under realistic conditions (Corollary 8.1; unlike DAEs).

Generative modeling and compressed sensing. The related work of [Bor+17] first propose to use generative models for compressed sensing. As described in Section 8.4, the goal here is to recover signals that lie on the range of a pretrained generative model. [DGE18] further extend this approach to allow for recovery of CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 127

signals which lie on or close to the range of the generative model, thereby gurantee- ing exact recovery in the limit of the number of measurements matching the data dimensionality. Similar to [Bor+17], a UAE learns a data distribution. However, in doing so, it additionally learns an acquisition/encoding function and a recov- ery/decoding function, unlike [Bor+17; DGE18] which rely on generic random matrices and `2 decoding. The cost of implicit learning in a UAE is that some of its inference capabilities, such as likelihood evaluation and sampling, are intractable or require running Markov chains. However, these inference queries are orthogonal to compressed sensing. Finally, our decoding is amortized and scales to large datasets, unlike [Bor+17; DGE18] which solve an independent optimization problem for each test data point.

Mutual information maximization. The principle of mutual information maxi- mization, often referred to as InfoMax in prior work, was first proposed for learning encodings for communication over a noisy channel [Lin89]. The InfoMax objective has also been applied for statistical compressed sensing for learning both linear and non-linear encodings [WCF07; Car+12; Wan+14]. Our work differs from these existing frameworks in two fundamental ways. First, we optimize for a tractable variational lower bound to the MI that which allows our method to scale to high- dimensional data. Second, we learn an amortized [GG14; Shu+18] decoder in addition to the encoder that sidesteps expensive, per-example optimization for the test data points. Further, we improve upon the IM algorithm proposed originally for variational information maximization [BA03]. While the IM algorithm proposes to optimize the lower bound on the mutual information in alternating ‘‘wake-sleep’’ phases for op- timizing the encoder (‘‘wake”) and decoder (‘‘sleep”) analogous to the expectation- maximization procedure used in [WCF07], we optimize the encoder and decoder jointly using a single consistent objective leveraging recent advancements in gradient based variational stochastic optimization. CHAPTER 8. COMPRESSED SENSING VIA GENERATIVE MODELS 128

8.6 Conclusion

We presented uncertainty autoencoders (UAE), a framework for unsupervised rep- resentation learning via variational maximization of mutual information between an input signal and its latent representation. We showed formally that UAE learns an implicit generative model of the underlying data distribution. Empirically, we showed that UAEs are a natural candidate for statistical and transfer compressed sensing and outperforms competing approaches for sensing high-dimensional im- ages in practice. In a recent follow-up work, we extend the UAE framework for learning discrete latent representations [Cho+19]. In the future, it would be interesting to incorporate advancements in compressed sensing based on complex neural network architectures [MPB15; Kul+16; Ric+17; Lu+18; Van+18] within the UAE framework. Unlike the rich theory surrounding the compressed sensing of sparse signals, a similar theory surrounding generative model-based priors on the signal distribution is lacking. Recent works have made promising progress in developing a theory of SGD based recovery methods for nonconvex inverse problems [Bor+17; HV18; DGE18; Liu+18b], which continues to be an exciting direction for future work. 9 Multi-Fidelity Optimal Experimental De- sign: A Case Study in Battery Fast Charg- ing

9.1 Introduction

Simultaneously optimizing many design parameters for time-consuming experi- ments bottlenecks a broad range of scientific and engineering disciplines [Tab+18; But+18]. One such example is process and control optimization for lithium-ion batteries during materials selection, cell manufacturing, and operation. A com- mon objective is to maximize lifetime, but conducting even a single experiment to evaluate the lifetime of a battery can take months to years [Bau+14; KJ16; Sev+19] and requires many replicate samples due to the large parameter spaces and high sampling variability [Bau+14; Sch+15b; HHL17]. Hence, the key challenge is to reduce both the number and the duration of required experiments. Optimal experimental design (OED) approaches are widely used to reduce the cost of experimental optimization. These approaches often involve a closed-loop pipeline where feedback from completed experiments informs subsequent exper- imental decisions, balancing the competing demands of exploration, i.e., testing regions of the experimental parameter space with high uncertainty, and exploitation, i.e., testing promising regions based on the results of the completed experiments.

129 CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 130

OED algorithms have been successfully applied to physical science domains such as materials science [Tab+18; But+18; Nik+16; Lin+17], chemistry [Béd+18; Gra+18], biology [Kin+09], and drug discovery [Sch18], as well as to computer science do- mains such as hyperparameter optimization for machine learning [DSH15; Kle+17]. However, while a closed-loop approach is designed to minimize the number of experiments required for optimizing across a multidimensional parameter space, the time (and/or cost) per experiment may remain high, as is the case for lithium-ion batteries (LIB). Therefore, an OED approach should account for both the number of experiments and the cost per experiment. Multi-fidelity OED approaches have been developed to learn from both inex- pensive, noisy signals and expensive, accurate signals in certain domains. For example, in hyperparameter optimization for machine learning algorithms, sev- eral domain-specific low-fidelity signals for predicting the final performance ofan algorithmic configuration are used in tandem with more complete configuration evaluations [HHL11], e.g., extrapolated learning curves [DSH15; Kle+17], rapid testing on a subset of the full training dataset [Pet00; Li+17]. For LIBs, classical methods such as factorial design that use predetermined heuristics to select ex- periments have been applied [LLW09; LHL11; Sch+18], but the design and use of low-fidelity signals is challenging and unexplored. In this chapter, we develop a multi-fidelity closed-loop optimization (CLO) methodology for efficient optimization over large parameter spaces with expensive experiments and high sampling variability. We experimentally employ this system to optimize fast charging protocols for LIBs; reducing charging time to approach gaso- line refueling time is critical to reduce range anxiety for electric vehicles [Ahm+17; LZC19] but often comes at the expense of battery lifetime. Specifically, we optimize over a parameter space consisting of 224 unique six-step, 10-minute fast-charging protocols (i.e., how current and voltage are controlled during charging) to find charging protocols with high cycle life, defined as the battery capacity falling to80% of its nominal value. Our system uses two key elements to reduce the optimization cost. First, we reduce the time per experiment by using a probabilistic model to CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 131

predict the outcome of the experiment based on data from early cycles, well be- fore the batteries reach the end of life [Sev+19]. Second, we reduce the number of experiments by using a Bayesian optimization (BO) algorithm to balance the exploration-exploitation trade-off in choosing the next round of experimentsHSF14 [ ; Gro+18b]. Testing a single battery to failure under our fast-charging conditions requires ap- proximately 40 days. For example, when 48 experiments are performed in parallel, assessing all 224 charging protocols with triplicate measurements takes approxi- mately 560 days. Here, using CLO with early outcome prediction, only 16 days were required to confidently identify protocols with high cycle lives (48 parallel experi- ments). In a subsequent validation study, we find that CLO ranks these protocols by lifetime accurately (Kendall rank correlation coefficient = 0.83) and efficiently (15x less time than a baseline brute-force approach). Furthermore, we find that the charging protocols identified as optimal by CLO with early prediction outperform existing fast-charging protocols designed to avoid lithium plating, the approach suggested by conventional battery wisdom [KJ16; Ahm+17; LZC19; Sch+18].

9.2 Preliminaries

Let X denote the space of fast charging protocols and f : X → R denote the (unknown) function that relates a charging protocol x ∈ X to the observed lifetime of a battery y as:

y = f (x) + e (9.1) where e is some measurement noise, e.g., due to manufacturing variability. We do not know f but we can query its noisy lifetime value y by conducting battery charging experiments. Our goal is to find the protocol x∗ ∈ X with the highest expected lifetime:

x∗ = arg max f (x) (9.2) x∈X CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 132

There are two primary challenges in inferring x∗ via ‘‘black-box” optimization that only allows query access to f : (a) the protocol space X could be very large to permit systematic exploration, and (b) measurement noise can be very high which would require repeated querying (exploitation) to estimate expected lifetime of any candidate protocol. In addition, a third challenge for battery optimization is that evaluating the function f for any protocol x involves time-consuming real-world experimentation. The first two challenges involve a trade-off in exploration and exploitation. A popular class of techniques for navigating this trade-off is based on Bayesian opti- mization [Sha+15]. Bayesian optimization can be seen as a wrapper around any base probabilistic regression model, e.g., Gaussian processes, Bayesian neural networks, etc. The key idea in Bayesian optimization is to query the black-box function at points that best optimize an acquisition function derived from the predictions of the base regression model. For example, we use upper-confidence bounds (UCB) as the choice of the acquisition function in this chapter, which is given as:

UCB(x) = µ(x) ± σ(x) (9.3) where µ(x) and σ(x) are the mean and uncertainty (std) prediction for x given by the base regression model. Other acquisition functions are also possible and our key contributions in this chapter are agnostic to this choice; we refer the interested reader to [Fra18] for a recent survey. Intuitively, the UCB score is high for points with high uncertainty (i.e., high σ) and high mean (i.e., high µ). Initially, the base model is random (or trained on offline data). Thereafter, we select the point x˜ with the highest UCB scores for experimentation. The results of those experiment gives another training point (x˜, y˜), which is then used to retrain the base regression model. The whole process is repeated until the experimentation budget is exhausted. CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 133

Training dataset Battery Early outcome Bayesian optimization cycling predictor algorithm (BO)

Cycle life Voltage - + - + 950 670 Cycling Cycle life - + - + data 540 970 predictions Parameter 3 (from first Capacity (from ML (CC3) - + - + 100 cycles) 780 590 model)

- + - + 730 980 Parameter 1 Parameter 2 (CC1) (CC2)

High uncertainty Modify battery Recommended charging protocols (CC1, CC2, CC3) (exploration) materials & processes High cycle life (exploitation)

Figure 9.1: Schematic of our closed-loop optimization (CLO) system. First, batteries are tested. The cycling data from the first 100 cycles, specifically electrochemical measurements such as voltage and capacity, are used as input for early outcome prediction of cycle life. These cycle life predictions are subsequently sent to a Bayesian optimization (BO) algorithm, which recommends the next protocols to test by balancing the competing demands of exploration⁠ (testing protocols with high uncertainty in estimated cycle life) and exploitation (testing protocols with high estimated cycle life). This process iterates until the testing budget is exhausted.

9.3 Closed-Loop Optimization With Early Prediction

CLO with early outcome prediction is depicted schematically in Figure 9.1. The system consists of three components: parallel battery cycling, an early predictor for cycle life, and a Bayesian optimization (BO) algorithm. At each sequential round, we iterate over these three components. The first component is a multi-channel battery cycler; the cycler used in this work tests 48 batteries simultaneously. Before starting CLO, the charging protocols for the first round of 48 batteries are chosen at random (without replacement) from the complete set of 224 unique multistep protocols. Each battery undergoes repeated charging and discharging for 100 cycles (∼4 days; average predicted cycle life = 905 cycles), beyond which the experiments are terminated. CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 134

This cycling data is then fed as input to the early outcome predictor, which estimates the final cycle lives of the batteries given data from the first 100 cycles. The early predictor is a linear model trained via elastic net regression on features extracted from the charging data of the first 100 cycles, similar to that presented in [Sev+19]. Predictive features include transformations of both differences between voltage curves and discharge capacity fade trends. To train the early predictor, we require a training dataset of batteries cycled to failure. Here, we used a pre-existing dataset of 41 batteries cycled to failure (cross-validation root mean squared error = 80.4 cycles). While obtaining this dataset itself requires running full cycling experiments for a small training set of batteries (the cost we are trying to offset), this one-time cost could be avoided if pretrained predictors or previously collected datasets are available. If unavailable, we pay an upfront cost in collecting this dataset; this dataset could also be used for warm-starting the BO algorithm. The size of the dataset collected should best trade-off the upfront cost in acquiring the dataset to train an accurate model with the anticipated reduction in experimentation requirements for CLO. Finally, these predicted cycle lives from early-cycle data are fed into the BO algorithm, which recommends the next round of 48 charging protocols that best balance the exploration-exploitation trade-off. This algorithm is presented in detail in the full version of the companion paper [Att+20] and builds on the prior work of [HA15] and [Gro+18b]. The algorithm maintains an estimate of both the average cycle life and the uncertainty bounds for each protocol; these estimates are initially equal for all protocols and are refined as additional data are collected. Crucially, to reduce the total optimization cost, our algorithm performs these updates using estimates from the early outcome predictor instead of using the actual cycle lives. The mean and uncertainty estimates for the cycle lives are obtained via a Gaussian Process, which has a smoothing effect and allows for updating the cycle life estimates of untested protocols with the predictions from related protocols. The closed-loop process repeats until the optimization budget, in our case 192 batteries tested (100 cycles each), is exhausted. Our objective is to find the charging protocol which maximizes the expected CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 135

10 CC1 CC2 CC3 CC4 CC5-CV1 Max voltage = 3.6 V 8

6

4 Current (C rate)

2

Charging time to 80% SOC = 10 minutes 0 0 20 40 60 80 100 State of charge (%)

(a) CC4 4.75

4.50 8 4.25 7 4.00 CC36 3.75 5

4 3.50

3 3.25 8 7 3.00 6 8 CC2 5 7 2.75 6 4 5 4 CC1 3 3 (b)

Figure 9.2: Structure of our 10-minute, six-step fast-charging protocols. (a) Current vs. SOC for an example charging protocol, 7.0C–4.8C–5.2C–3.45C (bold lines). Each charging protocol is defined by five constant current (“CC”) steps followed by one constant voltage (“CV”) step. The last two steps (CC5 and CV1) are identical for all charging protocols. We optimize over the first four constant-current steps, denoted CC1, CC2, CC3, and CC4. Each of these steps comprises a 20% SOC window, such that CC1 ranges from 0–20% SOC, CC2 ranges from 20–40% SOC, etc. CC4 is constrained by specifying that all protocols charge in the same total time (10 minutes) from 0% to 80% SOC. Thus, our parameter space consists of unique combinations of the three free parameters CC1, CC2, and CC3. For each step, we specify a range of acceptable values; the upper limit is monotonically decreasing with increasing SOC to avoid the upper cutoff potential (3.6 V for all steps). (b) CC4 (color) as a function of CC1, CC2, and CC3 (x, y, and z axes, respectively). Each point represents a unique charging protocol. CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 136

battery cycle life for a fixed charging time (10 minutes) and state-of-charge (SOC) range (0 to 80%). The design space of our 224 six-step extreme fast-charging proto- cols is presented in Figure 9.2a. Multistep charging protocols, in which a series of different constant-current steps are applied within a single charge, are considered advantageous over single-step charging for maximizing cycle life during fast charg- ing [KJ16; Ahm+17] though the optimal combination remains unclear. As shown in Figure 9.2b, each protocol is specified by three independent parameters (CC1, CC2, and CC3); each parameter is a current applied over a fixed SOC range (0-20, 20-40 and 40-60%, respectively). A fourth parameter, CC4, is dependent on CC1, CC2, CC3, and the charging time. Given constraints on the current values, a total of 224 charging protocols are permitted. We test commercial lithium iron phosphate (LFP)/graphite cylindrical batteries (A123 Systems) in a convective environmental chamber (30°C ambient temperature). A maximum voltage of 3.6 V is imposed. These batteries are designed to fast-charge in 17 minutes. The cycle life decreases dramatically with faster charging time. Since the LFP positive electrode is generally considered stable under fast charging [KJ16; Sev+19], we select this battery chemistry to isolate the effects of extreme fast charging on graphite, which is universally employed in LIBs.

9.4 Empirical Evaluation

In all, we ran four CLO rounds sequentially, consisting of 185 batteries in total (ex- cluding seven batteries, Methods). Using early prediction, each CLO round requires four days to complete 100 cycles, resulting in a total testing time of sixteen days—a significant reduction from the 560 days required to test each charging protocol to failure three times. Figure 9.3presents the predictions and selected protocols as well as the evolution of cycle life estimates over the parameter space as the optimization progresses (Figure 9.4). Initially, the estimated cycle lives for all protocols are equal. After two rounds, the overall structure of the parameter space (i.e., the dependence of cycle life on charging protocol parameters CC1, CC2, and CC3) emerges, and a CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 137

Figure 9.3: Early cycle life predictions per round. The tested charging protocols and the resulting predictions are plotted for rounds 1–4. Each point represents a charging protocol, defined by CC1, CC2, and CC3 (x, y, and z axes, respectively). The color represents cycle life predictions from the early outcome prediction model. The charging protocols in the first round of testing are randomly selected. AsBO shifts from exploration to exploitation, the charging protocols selected for testing by the closed loop in subsequent rounds are primarily in the high-performing region.

Figure 9.4: Evolution of the parameter space per round. The color represents cycle life as estimated by BO. The initial cycle life estimates are equivalent for all protocols; as more predictions are generated, BO updates its cycle life estimates. CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 138

prominent region with high cycle life protocols has been identified. CLO’s confi- dence in this high-performing region is further improved from round 2 to round 4, but overall the cycle life estimates do not change significantly.

10 120 117 CLO 1: 4.8C-5.2C-5.2C-4.160C CLO 2: 5.2C-5.2C-4.8C-4.160C 8 100 CLO 3: 4.4C-5.6C-5.2C-4.252C

80 6 61 60 Count 4

40 Current (C rate) 26 2 20 17 3 0 0 0 1 2 3 4 0 20 40 60 80 100 Repetitions per protocol State of charge (%)

(a) (b)

Figure 9.5: (a) Distribution of the number of repetitions for each charging pro- tocol. Only 46 of 224 protocols (21%) are tested multiple times. (b) Current vs. SOC for the top three fast-charging protocols as estimated by CLO. CC1–CC4 are displayed in the legend. All three protocols have relatively uniform charging (i.e., CC1≈CC2≈CC3≈CC4).

By learning and exploiting the structure of the parameter space, we avoid evaluat- ing charging protocols with low estimated cycle life and concentrate more resources on the high-performing region. Specifically, 117 of 224 protocols are never tested (Figure 9.5a); we spend 67% of the batteries testing 21% of the protocols (0.83 batteries per protocol on average). CLO repeatedly tests several protocols with high estimated cycle life to decrease uncertainties due to manufacturing variability and the error introduced by early outcome prediction. The uncertainty is expressed as the prediction intervals of the posterior predictive distribution over cycle life. To the best of the our knowledge, this work presents the largest known map of cycle life as a function of charging conditions. This dataset can be used to validate physics-based models of battery degradation. Most fast-charging protocols pro- posed in the battery literature suggest that current steps decreasing monotonically as a function of SOC are optimal to avoid lithium plating on graphite, a well-accepted CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 139

degradation mode during fast charging [KJ16; Ahm+17; LZC19; Sch+18]. In con- trast, the protocols identified as optimal by CLO (e.g., Figure 9.5b) are generally similar to single-step constant-current charging (i.e., CC1≈CC2≈CC3≈CC4). Specif- ically, of the 75 protocols with the highest estimated cycle lives, only ten are monoton- ically decreasing (i.e., CC(i) ≥ CC(i+1) for all i) and two are strictly decreasing (i.e., CC(i) > CC(i+1)). We speculate that minimizing parasitic reactions caused by heat generation may be the operative optimization strategy for these cells, as opposed to minimizing the propensity for lithium plating. While the optimal protocol for a new scenario would depend on the selected charge time, SOC window, temperature control conditions, and battery chemistry, this unexpected result highlights the need for data-driven approaches for optimizing fast charging.

1.10

1.05

1.00

0.95 Discharge capacity (Ah)

0.90

0 200 400 600 800 1000 1200 1400 Cycle number

Figure 9.6: Discharge capacity vs. cycle number for all batteries in the validation experiment. See Figure 9.7 for the color scheme legend.

We validate the performance of CLO with early prediction on a subset of nine extreme fast-charging protocols. For each of these protocols, we cycle five batteries each to failure and use the sample average of the final cycle lives as an estimate of the true lifetimes. We use this validation study to (a) confirm that CLO is able to correctly rank protocols based on cycle life, and (b) compare the cycle lives of protocols recommended by CLO to protocols inspired by the battery literature. CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 140

1500 1500 CLO top 3 CLO top 3 Lit-inspired Lit-inspired 1250 Other 1250 Other

1000 1000

750 750

500 500 CLO-estimated cycle life Final cycle life (validation)

250 250

r = 0.93 r = 0.86 0 0 0 250 500 750 1000 1250 1500 0 250 500 750 1000 1250 1500 Mean early-predicted cycle life Early-predicted cycle life (validation) (validation)

(a) (b)

4.8C-5.2C-5.2C-4.160C: 890 cycles

5.2C-5.2C-4.8C-4.160C: 912

4.4C-5.6C-5.2C-4.252C: 884

3.6C-6.0C-5.6C-4.755C: 755

6.0C-5.6C-4.4C-3.834C: 880

7.0C-4.8C-4.8C-3.652C: 870 Validation protocol

(sorted by CLO ranking) 8.0C-4.4C-4.4C-3.940C: 702

8.0C-6.0C-4.8C-3.000C: 584 CLO top 3 Lit-inspired 8.0C-7.0C-5.2C-2.680C: 496 Other

0 200 400 600 800 1000 1200 Final cycle life (validation)

(c)

Figure 9.7: (a) Comparison of early-predicted cycle lives from validation to closed- loop estimates, averaged on a protocol basis. Each 10-minute charging protocol is tested with five batteries. Error bars represent the 95% confidence intervals. (b) Observed vs. early-predicted cycle life for the validation experiment. While our early predictor appears biased, likely due to calendar aging effects, the trend is correctly captured (Pearson correlation coefficient r = 0.86). (c) Final cycle lives from validation, sorted by CLO ranking. The length of each bar and the annotations represents the mean final cycle life from validation per protocol. Error bars represent the 95% confidence intervals. CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 141

The charging protocols used in validation, some of which are inspired by existing battery fast charging literature, span the range of estimated cycle lives. We adjust the voltage limits and charging times of these literature protocols to match our protocols, while maintaining similar current ratios as a function of SOC. While the literature protocols used in these validation experiments are generally designed for batteries with high-voltage positive electrode chemistries, fast-charging optimization strategies generally focus on the graphitic negative electrode [KJ16; Ahm+17]. For these nine protocols, we validate the ‘‘CLO-estimated” cycle lives against the sample average of the five final cycle lives. The discharge capacity fade curves (Figure 9.6) exhibit the nonlinear decay typical of fast charging [Sev+19; HHL17]. Next, we discuss the results for the validation study. If we apply our early prediction model to the batteries in the validation experiment, these early predictions (averaged over each protocol) match the CLO-estimated mean cycle lives well (Pearson correlation coefficient r = 0.93; Figure 9.7a). This result validates the performance of the BO component of CLO in particular, since the CLO-estimated cycle lives were inferred from early predictions. However, our early prediction model exhibits some bias (Figure 9.7b), likely due to calendar aging effects from different battery storage times[Kei+16]. Despite this bias in our predictive model, we generally capture the ranking well (Kendall rank correlation coefficient = 0.83, Figure 9.7c). At the same time, we note that the final cycle lives for the top-ranked protocols are similar. Furthermore, the optimal protocols identified by CLO outperform protocols inspired by previously published fast-charging protocols (895 vs. 728 cycles on average). This result suggests that the efficiency of our approach does not come at the expense of accuracy. Aprocedure that does not use early outcome prediction and simply selects protocols to randomly test begins to saturate to a competitive performance level after ∼7,700 battery- hours of testing. To achieve a similar level of performance, CLO requires only 500 battery-hours of testing. CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 142

9.5 Discussion & Related Work

The underlying optimal experimental design problem we seek to address is an instance of multi-fidelity black-box optimization. These methods are especially useful when a single function evaluation can itself be very expensive (in terms of time or resources) but it is easy to obtain noisy, low-fidelity signals. The canonical use case considered in prior work is that of hyperparameter op- timization for machine learning algorithms. Here, the goal is to find the optimal configuration of hyperparameters for learning a model on a target dataset. Promi- nent examples of low-fidelity signals include (a) early prediction of learning curves (e.g.,[DSH15]) and (b) testing a hyperparameter configuration on a subset of the data (e.g.,[Pet00]). Multi-fidelity approaches (e.g.,[Van04; KPB15; SST15; Spa+15]) use one or more of these low-fidelity signals for dynamically selecting which hyper- parameter configuration to test next, e.g., successive halving routines [KKS13] such as Hyperband [Li+17], which periodically removes half of the worst performing configurations. We provide details on methods that predict learning curves, since that form of fidelity signal is similar to the one we consider. We further refer the reader to Section 1.4 in [HKV19] for an excellent and comprehensive survey on multi-fidelity black box optimization. A learning curve tracks the performance of a machine learning algorithm (as measured via a suitable metric such as classification accuracy on a validation dataset) as a function of training time (as measured via number of iterations, gradient updates, etc.). These curves are typically increasing functions, and the performance metric saturates over time. Since observing the asymptote of a learning curve for any single hyperparameter configuration can be expensive, [DSH15] proposed to predict a learning curve after training the model only for a small number of iterations. The goal is to terminate the training procedure early if the asymptotic performance on a validation set is unlikely to beat the validation performance of the best known hyperparameter configuration so far. The predictor is parameterized using a mixture of functional forms (e.g., Weibull, CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 143

vapor pressure, etc.) that have similar characteristics as learning curves (i.e., increas- ing and saturating with respect to the performance metric). Using standard Bayesian inference over the unknown parameters, the probability that the performance of the algorithm at a future iteration will exceed the best known performance so far can be estimated. Such estimates are then used within a downstream black-box optimization routine such as SMAC [HHL11] or Tree Parzen Estimator [Hut+14] for the purpose of deciding whether to terminate training of a given hyperparam- eter configuration and utilize the remaining budget for testing other candidate hyperparameter configurations. [Kle+17] extend this approach via Bayesian neural networks to utilize additional data from past runs of related hyperparameter config- urations. This permits learning more accurate predictors without having to observe a substantial portion of the learning curve. To draw an analogy with the current case study, we were interested in predicting the cycle life at which the capacity of a charging protocol drops to a predefined value (80% of initial capacity, or 0.88 Ah), see e.g., Figure 9.6. In contrast to typical shapes of learning curves, (a) the capacity is decreasing as a function of cycle life and (b) the cycle life saturates with respect to the capacity. That is, the capacity decreases slowly in the beginning until it hits a transition point. Beyond this point, the capacity quickly drops to 0.88 Ah. This transition point is unknown and occurs close to the final cycle life. As a result, predicting the final cycle life is challenging without having substantially cycled the battery to observe the transition point. Doing so however will defeat the purpose of terminating experiments early for efficient closed-loop optimization. We overcome this challenge by training our early predictor to use both informa- tion from prior capacity curves as well as additional features at every time step (e.g., voltage and capacity). This method allows for highly accurate predictions only after the first 100 cycles. In the future, it would be interesting to deploy CLO pipelines where the time instant for termination is dynamically decided based on the uncer- tainty of the early predictor. If the uncertainty is relatively high, the experiment is run longer whereas the low uncertainty experiments can potentially be terminated much before even 100 cycles of charging. CHAPTER 9. MULTI-FIDELITY OPTIMAL EXPERIMENTAL DESIGN 144

9.6 Conclusion

In summary, we have successfully accelerated the optimization of extreme fast charg- ing for lithium-ion batteries using closed-loop optimization with early outcome prediction. This method could extend to other fast-charging design spaces, such as pulsed [Sch+18; Kei+16] and constant-power [Ahm+17] charging, as well as other objectives, such as slower charging and calendar aging. Additionally, this work opens up new applications for battery optimization, such as formation [WLD15], adaptive cycling [ZQ01], and parameter estimation for battery management system models [Par12]. Furthermore, provided that a suitable early outcome predictor exists, this method could also be applied to optimize other aspects of battery devel- opment, such as electrode materials and electrolyte chemistries. For an extensive analysis of this case-study, we refer the reader to the methods and other supplemen- tary materials in the original work [Att+20]. Beyond batteries, our CLO approach combining black-box optimization with early outcome prediction can be extended to efficiently optimize other physical [Tab+18; But+18; Sch18] and computational [Li+17; Smi+18] multidimensional parameter spaces that involve time-intensive experimentation, illustrating the power of data- driven methods to accelerate the pace of scientific discovery. 10 Conclusions

10.1 Summary of Contributions

The motivating theme for this dissertation was to develop artificial intelligence (AI) agents that can effectively reason in complex, high-dimensional real world envi- ronments. The AI enterprise is broad and our focus was particularly on machine learning (ML)-based AI systems that combine data and computation for reasoning. We identified supervision as the key bottleneck in developing ML systems that can draw inferences and make decisions in high-dimensions. To offset the high supervi- sion requirements of current ML systems, we discussed the theory and practice of probabilistic generative modeling across the three parts of this dissertation. Next, we summarize the contributions within each of these parts. In Part 1, we setup the learning problem for generative modeling in Chapter 2 as well as discussed some prominent existing frameworks for deep generative mod- eling for high-dimensional data. We followed this setup with a discussion on the fundamental trade-offs in the two main learning paradigms of generative modeling in Chapter 3. In particular, we discussed the trade-offs in the context of learning objectives based on maximum likelihood and adversarial learning under a uniform set of modeling assumptions (flow models) and evaluation metrics (likelihoods and sample quality). Next, in Chapters 4 and 5, we turned our attention to opportunities and challenges in training expressive generative models. Specifically, in Chapter 4, we discussed the model bias in current state-of-the-art

145 CHAPTER 10. CONCLUSIONS 146

models. We introduced a model-agnostic boosting procedure that uses importance weighting to mitigate the bias of any existing generative model and consequently, induces an energy-based model that surpasses the performance of the base model. In Chapter 5, we followed this up with the related problem of dataset bias in generative models. This bias is especially common when generative models are learned on large uncurated and unlabelled datasets and gets further amplified at test time when the learned model is used for data generation. We presented a model-agnostic, label-free algorithm to mitigate this bias by reweighting (unlabelled) data points from the biased datasets based on their relative likelihoods with respect to a small but curated reference (unlabelled) dataset. In Part 2, we considered generative models for relational data. Relational data, represented as graphs of entities connected via relations, are discrete and combina- torial objects. The key use case of generative model for relational data is to learn good representations of the entities. We presented corresponding representation learning algorithms for nodes in graph networks (Chapter 6) and agent policies in an interactive multiagent systems (Chapter 7). For learning representations, we used latent variable generative models as well as new objectives based on contrastive losses for representation learning. We benchmarked the performance of these meth- ods for several downstream tasks, including semi-supervised node classification and link prediction in graph networks as well as game outcome prediction and reinforcement learning in multiagent systems. Finally, in Part 3 of this dissertation, we considered two real-world applications of probabilistic models for accelerating scientific discovery. First, in Chapter 8, we presented a latent variable generative model for statistical compressive sensing. The latent variables in the model represent compressed measurements and inferred by maximizing a tractable lower bound to the mutual information between the input and the measurements. Second and last, in Chapter 9, we discussed a real world case study in battery charging that uses probabilistic generative models for optimal experimental design. We showed that a model of battery lifetimes learned on offline data can be used to terminate suboptimal experiments much before their completion, thus saving significantly on time and resources. CHAPTER 10. CONCLUSIONS 147

10.2 Future Work

There is enormous potential for using AI to enhance science and society, e.g., auto- mated screening of drugs and materials, real-time weather and crop forecasting, predict socioeconomic outcomes such as poverty, health, and education. For such applications, ML models can potentially aid the design of experiments, characterize the uncertainty in making predictions, and guide decisions for planning over long horizons. In practice however, we are limited by budget constraints for acquiring data as well as the high complexity and noise in real world systems. This dissertation presented advancements in probabilistic generative modeling and other forms of unsupervised learning to tackle some of these challenges. The contributions of this dissertation suggest new opportunities and challenges for future work, some of which we highlight below.

Causality & Generative Modeling. One of the key goals of any ML model is to generalize from past experiences. For a generative model, this implies that the model should be able to generate data beyond the training set, ideally in a controllable manner via disentanglement. We believe that effective generalization requires a tighter integration of causal prior knowledge into generative modeling (e.g., via a causal graph [Koc+17]). This will better aid principled generalization of these models beyond the observed data and permit a richer range of inference queries involving interventions and counterfactuals required for downstream applications such as off-policy policy evaluation (Chapter 4.5).

Unsupervised Reinforcement Learning. A key challenge with current reinforce- ment learning (RL) is the excessive reliance on explicit goal-specific reward signals or demonstrations of expert behavior. Natural agents, such as humans, often learn by simply interacting with the environment and developing their own intrinsic rewards and dynamic models of how the world works. Learning such task-agnostic, unsuper- vised models can vastly reduce the reliance of RL on expensive supervision [Jad+16; Gup+18; Nai+18]. The learning objective for current models in this space shuttles CHAPTER 10. CONCLUSIONS 148

between several contrastive and generative methods. A careful combination, as in Chapter 4.4, could be mutually beneficial and more broadly, a theoretical and empirical investigation of the trade-offs and synergies of these class of approaches for unsupervised RL is a promising research direction.

Societal Impacts of Unsupervised Learning. As with any advanced emerging technology, there are both risks and opportunities with real-world deployments of unsupervised learning. For instance, in Chapter 5, we demonstrated the latent dataset bias that gets amplified by state-of-the-art generative models. There are other pressing concerns as well which merit future research, such as the rise in automatically generated fake content or “deep-fakes” [CC18; KM18]. At the same time, there are also positive outcomes in using these models for fair, privacy-preserving representation learning, as we show in [Son+19] and hope to extend to relational settings (Chapters 6-7) in future research.

Generative Modeling for Automated Design in Science and Engineering. De- sign thinking is a critical component of science and engineering that requires creativ- ity and knowledge. In scientific discovery, design thinking is involved in generating a hypothesis space (e.g., candidate drugs and materials with target properties) and selecting which experiments to perform for hypothesis testing (e.g., battery charging in Chapter 9). Designs are also omnipresent in engineering, ranging from structures and prototype drawings in civil and mechanical engineering to circuit and chip design in electrical and computer engineering. With the growing availability of various kinds of large design datasets [Gau+12; Sef+20], generative models offer a promising approach to automate design [Kad+17; Góm+18; SA18; Seg+18] and future research should focus on the discreteness, symmetries, combinatorial scaling, and hard constraints inherent in the target domains [Ana+16; Gro+19b].

In summary, we believe advancements in computation and supervision-constrained machine learning will have a profound impact in accelerating scientific discovery and addressing key challenges in the road to sustainable development. A Additional Proofs

A.1 Proof of Theorem 5.1

0 0 Proof. Since pbias(x|z = k) and pbias(x|z = k ) have disjoint supports for k 6= k , we know that for all x, there exists a deterministic mapping f : X → Z such that

pbias(x|z = f (x)) > 0. Further, for all x˜ 6∈ f −1(z):

pbias(x˜|z = f (x)) = 0, (A.1)

pref(x˜|z = f (x)) = 0. (A.2)

Combining Eqs. A.1,A.2 above with the assumption in Eq. (5.6), we can simplify the density ratios as: R p (x) pbias(x|z)pbias(z)dz bias = z (A.3) ( ) R pref x z pref(x|z)pref(z)dz p (x| f (x))p ( f (x)) = bias bias (using Eqs. (A.1, A.2)) (A.4) pref(x| f (x))pref( f (x)) p ( f (x)) = bias (using Eq. (5.6)) (A.5) pref( f (x)) = b( f (x)). (A.6)

From Eq. (5.4) and Eq. (A.3), the Bayes optimal classifier c∗ can hence be expressed

149 APPENDIX A. ADDITIONAL PROOFS 150

as:

1 c∗(Y = 1|x) = . (A.7) γb( f (x)) + 1

The optimal cross-entropy loss of a binary classifier c for density ratio estimation can then be expressed as:

1 γ NCE(c∗) = E [log c∗(Y = 1|x)] + E [log c∗(Y = 0|x)] (A.8) γ + 1 pref(x) γ + 1 pbias(x)  1  γ  γb( f (x))  = E log + E log (A.9) pref γb( f (x)) + 1 γ + 1 pbias γb( f (x)) + 1 (using Eq. (A.7)) 1  1  γ  γb( f (x))  = E log + E log γ + 1 pref(x,z) γb( f (x)) + 1 γ + 1 pbias(x,z) γb( f (x)) + 1 (A.10) 1  1  γ  γb(z)  = E log + E log (A.11) γ + 1 pref(z) γb(z) + 1 γ + 1 pbias(z) γb(z) + 1 (using Eqs. (A.1,A.2)).

A.2 Proof of Theorem 8.1

Proof. We can rewrite the UAE objective in Eq. (8.6) as:

Z  E [ ( | )] = E ( | ) ( | ) qφ(x,y) log pθ x y qφ(Y) qφ x y log pθ x y dx (A.12)

= − ( | ) − E  ( ( | )k ( | )) Hφ X y qφ(Y) DKL qφ X y pθ X y . (A.13)

The KL-divergence is non-negative and minimized when its argument distributions are identical. Hence, for a fixed optimal value of θ = θ∗, if there exists a φ in the APPENDIX A. ADDITIONAL PROOFS 151

space of encoders being optimized that satisfies:

pθ∗ (X|y) = qφ(X|y) (A.14)

∗ for all x, y with pθ (Y) 6= 0, then it corresponds to the optimal encoder, i.e.,

φ = φ∗. (A.15)

For any value of φ, we know the following Gibbs chain converges to qφ(x, y) if the chain is ergodic:

(t) (t) y ∼ qφ(Y|x ), (A.16) (t+1) (t) x ∼ qφ(X|y ). (A.17)

Substituting the results from Eqs. (A.14-A.17) in the Markov chain transitions in Eqs. (8.8, 8.9) finishes the proof.

A.3 Proof of Corollary 8.1

Proof. By using earlier results (Proposition 2 in [RR06]), we need to show that the Markov chain defined in Eqs. (8.8)-(8.9) is Φ-irreducible with a Gaussian noise model.1 That is, there exists a measure such that there is a non-zero probability of transitioning from every set of non-zero measure to every other such set defined on the same measure using this Markov chain. Consider the Lebesgue measure. Formally, given any (x, y) and (x0, y0) such that the density q(x, y) > 0 and q(x0, y0) > 0 for the Lebesgue measure, we need to show that the probability density of transitioning q(x0, y0|x, y) > 0. (1) Since q(y|x) > 0 for all x ∈ Rn, y ∈ Rm (by Gaussian noise model as- sumption), we can use Eq. (8.8) to transition from (x, y) to (x, y0) with non-zero probability.

1Note that the symbol Φ here is different from the parameters denoted by little φ used in Chapter 8. APPENDIX A. ADDITIONAL PROOFS 152

(2) Next, we claim that the transition probability q(x0|y) is non-negative for all x0, y. By Bayes rule, we have:

q(y|x0)q(x0) q(x0|y) = . (A.18) q(y)

Since q(x, y) > 0 and q(x0, y0) > 0, the marginals q(y) and q(x0) are positive. Again, q(y|x0) > 0 for all x0 ∈ Rn, y ∈ Rm by the Gaussian noise model assumption. Hence, q(x0|y) is positive. Finally, using the optimality assumption for the posteriors p(x0|y) matching q(x0|y) for all x0, y0, we can use Eq. (8.9) to transition from (x, y0) to (x0, y0) with non-zero probability. From (1) and (2), we see that there is a non-zero probability of transitioning from (x, y) to (x0, y0). Hence, under the assumptions of the corollary the Markov chain in Eqs. (8.8, 8.9) is ergodic. B Additional Experimental Details & Re- sults

B.1 Chapter 3

Datasets. The MNIST dataset [LCB10] contains 50, 000 train, 10, 000 validation, and 10, 000 test images of dimensions 28 × 28. The CIFAR-10 dataset [KH09] con- tains 50, 000 train and 10, 000 test images of dimensions 32 × 32 × 3 by default. We randomly hold out 5, 000 training set images as validation set. Since we are modeling densities for discrete datasets (pixels can take a finite set of values ranging from 1 to 255), the model can assign arbitrarily high log-likelihoods to these discrete points. Following [UML13], we avoid this scenario by dequantizing the data where uniform noise between 0 and 1 is added to every pixel. Finally, we scale the pixels to lie between 0 and 1.

Model architectures. We used the NICE [DKB14] and Real-NVP [DSB17] archi- tectures as the normalizing flow generators for the MNIST and CIFAR-10 datasets respectively. For adversarial training, we consider the improved WGAN objective which enforces the Lipschitz constraint over the critic by penalizing the norm of the gradient with respect to the input [Gul+17]. The critic architecture for the MNIST dataset is the same as the one used in the DCGAN [RMC15]. The batch normaliza- tion layer is removed as recommended [Gul+17]. For the CIFAR-10 experiments we again use the DCGAN architecture with an extra convolutional layer. Batch normalization [IS15] here was substituted by layer normalization [BKH16].

153 APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 154

Hyperparameters. The hyperaparmeters specific to adversarial training with the Wasserstein distance were fixed as per [Gul+17]. For all models trained on MNIST, we used a logistic prior distribution for the NICE architecture. The weights were initialized using the truncated normal distribution. For the models trained using MLE, batch size was set to 200, learning rate was 0.001 with a decay of 0.9995 and the optimizer used was Adam with β1 = 0.9, β2 = 0.999. For the other models, the batch size was set to 100, and the optimizer used was Adam with β1 = 0.5, β2 = 0.999 and a learning rate was 0.0001. In the case of models trained on CIFAR-10, an isotropic Gaussian prior was used. We also used weight normalization [SK16] and Xavier initialization for all the models. The models trained with MLE used a batch size of 100, a learning rate of

0.0001 with a decay of 0.999995 and the Adam optimizer with β1 = 0.9, β2 = 0.999. The other models used a batch size of 64, a learning rate of 0.0001 and the Adam optimizer with β1 = 0.5, β2 = 0.999. APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 155

B.2 Chapter 4

B.2.1 Bias-Variance of Different LFIW estimators

As discussed in Section 4.3, bias reduction using LFIW can suffer from issues where the importance weights are too small due to highly confident predictions of the binary classifier. Across a batch of Monte Carlo samples, this can increase the corresponding variance. Inspired from the importance sampling literature, we proposed additional mechanisms to mitigate this additional variance at the cost of reduced debiasing in Eqs. (4.7-4.9). We now look at the empirical bias-variance trade-off of these different estimators via a simple experiment below. Our setup follows the goodness-of-fit testing experiments in Section 4.5. The statistics we choose to estimate is simply are the 2048 activations of the prefinal layer of the Inception Network, averaged across the test set of 10,000 samples of CIFAR-10.

That is, the true statistics s = {s1, s2, ··· , s2048} are given by:

1 sj = ∑ aj(x) (B.1) |Dtest| x∈Dtest where aj is the j-th prefinal layer activation of the Inception Network. Note that set of statistics s is fixed (computed once on the test set). To estimate these statistics, we will use different estimators. For example, the default estimator involving no reweighting is given as:

1 T sˆj = ∑ aj(x) (B.2) T i=1 where x ∼ pθ.

Note that sˆj is a random variable since it depends on the T samples drawn from pθ. Similar to Eq. (B.2), other variants of the LFIW estimators proposed in

Section 4.3 can be derived using Eqs. (4.7-4.9). For any LFIW estimate sˆj, we can use the standard decomposition of the expected mean-squared error into terms APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 156

Table B.1: Bias-variance analysis for PixelCNN++ and SNGAN when T = 10, 000. Standard errors over the absolute values of bias and variance evaluations are com- puted over the 2048 activation statistics. Lower absolute values of bias, lower variance, and lower MSE is better.

Model Evaluation |Bias| (↓) Variance (↓) MSE (↓) PixelCNN++ Self-norm 0.0240 ± 0.0014 0.0002935 ± 7.22e-06 0.0046 ± 0.00031 Flattening (α = 0) 0.0330 ± 0.0023 9.1e-06 ± 2.6e-07 0.0116 ± 0.00093 Flattening (α = 0.25) 0.1042 ± 0.0018 5.1e-06 ± 1.5e-07 0.0175 ± 0.00138 Flattening (α = 0.5) 0.1545 ± 0.0022 8.4e-06 ± 3.7e-07 0.0335 ± 0.00246 Flattening (α = 0.75) 0.1626 ± 0.0022 3.19e-05 ± 2e-06 0.0364 ± 0.00259 Flattening (α = 1.0) 0.1359 ± 0.0018 0.0002344 ± 1.619e-05 0.0257 ± 0.00175 Clipping (β = 0.001) 0.1359 ± 0.0018 0.0002344 ± 1.619e-05 0.0257 ± 0.00175 Clipping (β = 0.01) 0.1357 ± 0.0018 0.0002343 ± 1.618e-05 0.0256 ± 0.00175 Clipping (β = 0.1) 0.1233 ± 0.0017 0.000234 ± 1.611e-05 0.0215 ± 0.00149 Clipping (β = 1.0) 0.1255 ± 0.0030 0.0002429 ± 1.606e-05 0.0340 ± 0.00230 SNGAN Self-norm 0.0178 ± 0.0008 1.98e-05 ± 5.9e-07 0.0016 ± 0.00023 Flattening (α = 0) 0.0257 ± 0.0010 9.1e-06 ± 2.3e-07 0.0026 ± 0.00027 Flattening (α = 0.25) 0.0096 ± 0.0007 8.4e-06 ± 3.1e-07 0.0011 ± 8e-05 Flattening (α = 0.5) 0.0295 ± 0.0006 1.15e-05 ± 6.4e-07 0.0017 ± 0.00011 Flattening (α = 0.75) 0.0361 ± 0.0006 1.93e-05 ± 1.39e-06 0.002 ± 0.00012 Flattening (α = 1.0) 0.0297 ± 0.0005 3.76e-05 ± 3.08e-06 0.0015 ± 7e-05 Clipping (β = 0.001) 0.0297 ± 0.0005 3.76e-05 ± 3.08e-06 0.0015 ± 7e-05 Clipping (β = 0.01) 0.0297 ± 0.0005 3.76e-05 ± 3.08e-06 0.0015 ± 7e-05 Clipping (β = 0.1) 0.0296 ± 0.0005 3.76e-05 ± 3.08e-06 0.0015 ± 7e-05 Clipping (β = 1.0) 0.1002 ± 0.0018 3.03e-05 ± 2.18e-06 0.0170 ± 0.00171 corresponding to the (squared) bias and variance as shown below.

2 2 2 E[(sj − sˆj) ] = sj − 2sjE[sˆj] + E[sˆj] (B.3) 2 2 2 2 = sj − 2sjE[sˆj] + (E[sˆj]) + E[sˆj ] − (E[sˆj]) (B.4) 2 2 2 = (sj − E[sˆj]) + E[sˆj ] − (E[sˆj]) . (B.5) | {z } | {z } Bias2 Variance

In Table B.1, we report the bias and variance terms of the estimators averaged over 10 draws of T = 10, 0000 samples and further averaging over all 2048 statistics corresponding to s. We observe that self-normalization performs consistently well and is the best or second best in terms of bias and MSE in all cases. The flattened APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 157

Table B.2: Bias-variance analysis for PixelCNN++ and SNGAN when T = 5, 000. Standard errors over the absolute values of bias and variance evaluations are com- puted over the 2048 activation statistics. Lower absolute values of bias, lower variance, and lower MSE is better.

Model Evaluation |Bias| (↓) Variance (↓) MSE (↓) PixelCNN++ Self-norm 0.023 ± 0.0014 0.0005086 ± 1.317e-05 0.0049 ± 0.00033 Flattening (α = 0) 0.0330 ± 0.0023 1.65e-05 ± 4.6e-07 0.0116 ± 0.00093 Flattening (α = 0.25) 0.1038 ± 0.0018 9.5e-06 ± 3e-07 0.0174 ± 0.00137 Flattening (α = 0.5) 0.1539 ± 0.0022 1.74e-05 ± 8e-07 0.0332 ± 0.00244 Flattening (α = 0.75) 0.1620 ± 0.0022 6.24e-05 ± 3.83e-06 0.0362 ± 0.00256 Flattening (α = 1.0) 0.1360 ± 0.0018 0.0003856 ± 2.615e-05 0.0258 ± 0.00174 Clipping (β = 0.001) 0.1360 ± 0.0018 0.0003856 ± 2.615e-05 0.0258 ± 0.00174 Clipping (β = 0.01) 0.1358 ± 0.0018 0.0003856 ± 2.615e-05 0.0257 ± 0.00173 Clipping (β = 0.1) 0.1234 ± 0.0017 0.0003851 ± 2.599e-05 0.0217 ± 0.00148 Clipping (β = 1.0) 0.1250 ± 0.0030 0.0003821 ± 2.376e-05 0.0341 ± 0.00232 SNGAN Self-norm 0.0176 ± 0.0008 3.88e-05 ± 9.6e-07 0.0016 ± 0.00022 Flattening (α = 0) 0.0256 ± 0.0010 1.71e-05 ± 4.3e-07 0.0027 ± 0.00027 Flattening (α = 0.25) 0.0099 ± 0.0007 1.44e-05 ± 3.7e-07 0.0011 ± 8e-05 Flattening (α = 0.5) 0.0298 ± 0.0006 1.62e-05 ± 5.3e-07 0.0017 ± 0.00012 Flattening (α = 0.75) 0.0366 ± 0.0006 2.38e-05 ± 1.11e-06 0.0021 ± 0.00012 Flattening (α = 1.0) 0.0302 ± 0.0005 4.56e-05 ± 2.8e-06 0.0015 ± 7e-05 Clipping (β = 0.001) 0.0302 ± 0.0005 4.56e-05 ± 2.8e-06 0.0015 ± 7e-05 Clipping (β = 0.01) 0.0302 ± 0.0005 4.56e-05 ± 2.8e-06 0.0015 ± 7e-05 Clipping (β = 0.1) 0.0302 ± 0.0005 4.56e-05 ± 2.81e-06 0.0015 ± 7e-05 Clipping (β = 1.0) 0.1001 ± 0.0018 5.19e-05 ± 2.81e-06 0.0170 ± 0.0017 estimator with no debiasing (corresponding to α = 0) has lower bias and higher variance than the self-normalized estimator. Amongst the flattening estimators, lower values of α seem to provide the best bias-variance trade-off. The clipped esti- mators do not perform well in this setting, with lower values of β slightly preferable over larger values. We repeat the same experiment with T = 5, 000 samples and report the results in Table B.2. While the variance increases as expected (by almost an order of magnitude), the estimator bias remains roughly the same.

B.2.2 Calibration

We found in all our cases that the binary classifiers used for training the model were highly calibrated by default and did not require any further recalibration. See for APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 158

Figure B.1: Calibration of classifiers for density ratio estimation. instance the calibration of the binary classifier used for goodness-of-fit experiments in Figure B.1. We performed the analysis on a held-out set of real and generated samples and used 10 bins for computing calibration statistics. We believe the default calibration behavior is largely due to the fact that our binary classifiers distinguishing real and fake data do not require very complex neural networks architectures and training tricks that lead to miscalibration for multi-class classification. As shown in[NC05], shallow networks are well-calibrated and [Guo+17] further argue that a major reason for miscalibration is the use of a softmax loss typical for multi-class problems.

B.2.3 Goodness-of-fit testing

We used the Tensorflow implementation of Inception Network [Aba+16] to ensure the sample quality metrics are comparable with prior work. For a semantic evalua- tion of difference in sample quality, this test is performed in the feature space ofa APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 159

pretrained classifier, such as the prefinal activations of the Inception Net[Sze+16].

For example, the Inception score for a generative model pθ given a classifier d(·) can be expressed as:

IS = exp(Ex∼pθ [KL(d(y|x), d(y))]).

The FID score is another metric which unlike the Inception score also takes into account real data from pdata. Mathematically, the FID between sets S and R sampled from distributions pθ and pdata respectively, is defined as:

2 p FID(S, R) = kµS − µRk2 + Tr(ΣS + ΣR − 2 ΣSΣR) where (µS, ΣS) and (µR, ΣR) are the empirical means and covariances computed based on S and R respectively. Here, S and R are sets of datapoints from pθ and

pdata. In a similar vein, KID compares statistics between samples in a feature space defined via a combination of kernels and a pretrained classifier. The standard kernel used is a radial-basis function kernel with a fixed bandwidth of 1. As desired, the score is optimized when the data and model distributions match. We used the open-sourced model implementations of PixelCNN++ [Sal+16] and SNGAN [Miy+18]. Following the observation by [LO16], we found that training a binary classifier on top of the feature space of any pretrained image classifier was useful for removing the low-level artifacts in the generated images in classifying an image as real or fake. We hence learned a multi-layer perceptron (with a single hidden layer of 1000 units) on top of the 2048 dimensional feature space of the Inception Network. Learning was done using the Adam optimizer with the default hyperparameters with a learning rate of 0.001 and a batch size of 64. We observed relatively fast convergence for training the binary classifier (in less than 20 epochs) on both PixelCNN++ and SNGAN generated data and the best validation set accuracy across the first 20 epochs was used for final model selection. APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 160

Table B.3: Statistics for the environments.

Environment State dimensionality Action dimensionality Swimmer 8 2 HalfCheetah 17 6 HumanoidStandup 376 17

B.2.4 Data Augmentation

We used the open-source implementation of DAGAN1 [ASE17] as the base gen- erative model. Given a DAGAN model, we additionally require training a binary classifier for estimating importance weights and a multi-class classifier for subse- quent classification. The architecture for both these use cases follows prior work in meta learning on Omniglot [Vin+16]. We train the DAGAN on the 1200 classes reserved for training in prior works. For each class, we consider a 15/5/5 split of the 20 examples for training, validation, and testing. Except for the final output layer, the architecture consists of 4 blocks of 3x3 convolutions and 64 filters, followed by batch normalization [Sze+16], a ReLU non-linearity and 2x2 max pooling. Learning was done for 100 epochs using the Adam optimizer with default parameters and a learning rate of 0.001 with a batch size of 32.

B.2.5 Model-based Off-policy Policy Evaluation

We used Tensorflow [Aba+16] and OpenAI baselines2 [Dha+17]. We evaluate over three environments: Swimmer, HalfCheetah, and HumanoidStandup (Figure B.2. Both HalfCheetah and Swimmer rewards the agent for gaining higher horizontal velocity; HumanoidStandup rewards the agent for gaining more height via standing up. In all three environments, the initial state distributions are obtained via adding small random perturbation around a certain state. The dimensions for state and action spaces are shown in Table B.3. Our policy network has two fully connected layers with 64 neurons and tanh

1https://github.com/AntreasAntoniou/DAGAN.git 2https://github.com/openai/baselines.git APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 161

(c) Humanoid- (a) Swimmer (b) HalfCheetah Standup

Figure B.2: Environments in OPE experiments.

Table B.4: Off-policy policy evaluation on MuJoCo tasks. Standard error is over 10 Monte Carlo estimates where each estimate contains 100 randomly sampled trajectories. Here, we perform stepwise LFIW over transition triplets.

Environment v(πe) (Ground truth) v˜(πe) vˆ(πe) (w/ LFIW) vˆ80(πe) (w/ LFIW) Swimmer 36.7 ± 0.1 100.4 ± 3.2 19.4 ± 4.3 48.3 ± 4.0 HalfCheetah 241.7 ± 3.6 204.0 ± 0.8 229.1 ± 4.9 214.9 ± 3.9 Humanoid 14170 ± 5.3 8417 ± 28 10612 ± 794 9950 ± 640 activations for each layer, where as our transition model / classifier has three hidden layers of 500 neurons with swish activations [RZL17]. We obtain our evaluation policy by training with PPO for 1M timesteps, and our behavior policy by training with PPO for 500k timesteps. Then we train the dynamics model Pθ for 100k itera- tions with a batch size of 128. Our classifier is trained for 10k iterations with a batch size of 250, where we concatenate (st, at, st+1) into a single vector.

Stepwise LFIW. Here, we consider performing LFIW over the transition triplets, where each transition triplet (st, at, st+1) is assigned its own importance weight. This is in contrast to assigning a single importance weight for the entire trajectory, obtained by multiplying the importance weights of all transitions in the trajectory. The importance weight for a transition triplet is defined as:

? p (st, at, st+1) ≈ wˆ (st, at, st+1), (B.6) p˜(st, at, st+1) APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 162

Swimmer HalfCheetah HumanoidStandup

60 6000 | |

40 | ) ) ) v v 40 v ( ( ( | | | 4000 20 20

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 H H H

Figure B.3: Estimation error δ(v) = v(πe) − vˆH(πe) for different values of H (mini- mum 0, maximum 100). Shaded area denotes standard error over different random seeds; each seed uses 100 sampled trajectories. Here, we perform stepwise LFIW over transition triplets. so the corresponding LFIW estimator is given as

"T−1 # vˆ(πe) = Ep˜ ∑ wˆ (st, at, sˆt+1) · r(st, at) . (B.7) t=0

We describe this as the ‘‘stepwise” LFIW approach for off-policy policy evaluation. We perform self-normalization over the weights of each triplet. From the results in Table B.4 and Figure B.3, stepwise LFIW also reduces bias for OPE compared to without LFIW. Compared to the ‘‘trajectory based” LFIW described in Eq. (4.20), the stepwise estimator has slightly higher variance and weaker performance for H = 20, 40, but outperforms the trajectory level estimators when H = 100 on HalfCheetah and HumanoidStandup environments. APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 163

B.3 Chapter 5

B.3.1 Dataset Construction Procedure

We construct such dataset splits from the full CelebA training set using the following procedure. We initially fix our dataset size to be roughly 135K out of the total162K based on the total number of females present in the data. Then for each level of bias, we partition 1/4 of males and 1/4 of females into Dref to achieve the 50-50 ratio. The remaining number of examples are used for Dbias, where the number of males and females are adjusted to match the desired level of bias (e.g., 0.9). Finally at each level of reference dataset size perc, we discard the appropriate fraction of datapoints from both the male and female category in Dref. For example, for perc =

0.5, we discard half the number of females and half the number of males from Dref.

B.3.2 FID Calculation

The FID metric may exhibit a relative preference for models trained on larger datasets in order to maximize perceptual sample quality, at the expense of propagating or amplifying existing dataset bias. In order to obtain an estimate of sample quality that would also incorporate a notion of fairness across sensitive attribute classes, we pre-computed the relevant FID statistics on a ”balanced” construction of the

CelebA dataset that matches our reference dataset pref. That is, we used all train/val- idation/test splits of the data such that: (1) for single-attribute, there were 50-50 portions of males and females; and (2) for multi-attribute, there were even propor- tions of examples across all 4 classes (females with black hair, females without black hair, males with black hair, males without black hair). We report ”balanced” FID numbers on these pre-computed statistics throughout the chapter.

B.3.3 Attribute Classifier

We use the same architecture and hyperparameters for both the single- and multi- attribute classifiers. Both are variants of ResNet-18 where the output numberof APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 164

classes correspond to the dataset split (e.g., 2 classes for single-attribute, 4 classes for the multi-attribute experiment).

Model architecture. We provide the architectural details in Table B.5 below:

Name Component conv1 7 × 7 conv, 64 filters. stride 2 Residual Block 1 3 × 3 max pool, stride 2 3 × 3 conv, 128 filters Residual Block 2 × 2 3 × 3 conv, 128 filters 3 × 3 conv, 256 filters Residual Block 3 × 2 3 × 3 conv, 256 filters 3 × 3 conv, 512 filters Residual Block 4 × 2 3 × 3 conv, 512 filters Output Layer 7 × 7 average pool stride 1, fully-connected, softmax

Table B.5: ResNet-18 architecture adapted for attribute classifier.

Hyperparameters. During training, we use a batch size of 64 and the Adam op- timizer with learning rate = 0.001. The classifiers learn relatively quickly for both single and multi attribute scenarios and we only needed to train for 10 epochs. We used a validation set for model selection.

B.3.4 Density Ratio Classifier

Model architecture. We provide the same architectural details as in Table B.5.

Hyperparameters. We also use a batch size of 64, the Adam optimizer with learn- ing rate = 0.0001, and a total of 15 epochs to train the classifier for density ratio estimation.

B.3.5 BigGAN

Model architecture. The architectural details for the BigGAN are provided in Table B.6. APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 165

Generator Discriminator 1 × 1 × 2ch Noise 64 × 64 × 3 Image Linear 1 × 1 × 16ch → 1 × 1 × 16ch ResBlock down 1ch → 2ch ResBlock up 16ch → 16ch Non-Local Block (64 × 64) ResBlock up 16ch → 8ch ResBlock down 2ch → 4ch ResBlock up 8ch → 4ch ResBlock down 4ch → 8ch ResBlock up 4ch → 2ch ResBlock down 8ch → 16ch Non-Local Block (64 × 64) ResBlock down 16ch → 16ch ResBlock up 4ch → 2ch ResBlock 16ch → 16ch BatchNorm, ReLU, 3 × 3 Conv 1ch → 3 ReLU, Global sum pooling Tanh Linear → 1

Table B.6: Architecture for the generator and discriminator. Notation: ch refers to the channel width multiplier, which is 64 for 64 × 64 CelebA images. ResBlock up refers to a Generator Residual Block in which the input is passed through a ReLU activation followed by two 3 × 3 convolutional layers with a ReLU activation in between. ResBlock down refers to a Discriminator Residual Block in which the input is passed through two 3 × 3 convolution layers with a ReLU activation in between, and then downsampled. Upsampling is performed via nearest neighbor interpola- tion, whereas downsampling is performed via mean pooling. ‘‘ResBlock up/down n → m’’ indicates a ResBlock with n input channels and m output channels.

Hyperparameters. We sweep over a batch size of {16, 32, 64, 128}. The optimizer used is Adam with learning rate = 0.0002, and β1 = 0, β2 = 0.99. We train the model by taking 4 discriminator gradient steps per generator step. Because the BigGAN was originally designed for scaling up class-conditional image generation, we fix all conditioning labels for the unconditional baselines (imp-weight, equi-weight) to the zero vector.

B.3.6 Shapes3D Dataset

For this experiment, we used the Shapes3D dataset [BK18] which is comprised of 480,000 images of shapes with six underlying attributes. We chose a random attribute (floor color), restricted it to two possible instantiations (red vs. blue), and then applied Algorithm 5.1 for the bias=0.9 setting. Training on the large biased dataset (containing excess of red floors) induces an average fairness discrepancy APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 166

(a) Baseline samples

(b) Samples after reweighting

Figure B.4: Results from the Shapes3D dataset. After restricting the possible floor colors to red or blue and using a biased dataset of bias=0.9, we find that the samples obtained after importance reweighting (b) are considerably more balanced than those without reweighting (a), as desired. of 0.468 as shown in Figure B.4a. In contrast, applying the importance-weighting correction on the large biased dataset enabled us to train models that yielded an average fairness discrepancy of 0.002 as shown in Figure B.4b. APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 167

B.4 Chapter 6

B.4.1 Link prediction

We used the Spectral Clustering (SC) implementation from [Ped+11] and public implementations for others made available by the authors. For SC, we used a dimension size of 128. For DeepWalk and node2vec which uses a skipgram like objective on random walks from the graph, we used the same dimension size and default settings used in [PAS14] and [GL16] respectively of 10 random walks of length 80 per node and a context size of 10. For node2vec, we searched over the random walk bias parameters using a grid search in {0.25, 0.5, 1, 2, 4} as prescribed in the original work. For GAE and VGAE, we used the same architecture as VGAE and Adam optimizer with learning rate of 0.01. For Graphite-AE and Graphite-VAE,we used an architecture of 32-32 units for the encoder and 16-32-16 units for the decoder (two rounds of iterative decoding before a final inner product). The model is trained using the Adam optimizer [KW18] with a learning rate of 0.01. All activations were RELUs.The dropout rate (for edges) and λ were tuned as hyperparameters on the validation set to optimize the AUC, whereas traditional dropout was set to 0 for all datasets. Additionally, we trained every model for 500 iterations and used the model checkpoint with the best validation loss for testing. Scores are reported as an average of 50 runs with different train/validation/test splits, with the requirement that the training graph necessarily be connected. For Graphite, we observed that using a form of skip connections to define a linear combination of the initial embedding Z and the final embedding Z∗ is particularly useful. The skip connection consists of a tunable hyperparameter λ controlling the relative weights of the embeddings. The final embedding of Graphite is a function of the initial embedding Z and the last induced embedding Z∗. We consider two functions to aggregate them into a final embedding. That is, (1 − λ)Z + λZ∗ and Z + λZ∗/ kZ∗k, which correspond to a convex combination of two embeddings, and an incremental update to the initial embedding in a given direction, respectively. Note APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 168

Table B.7: AUC scores for link prediction with Monte Carlo subsampling during training. Higher is better.

Cora Citeseer Pubmed VGAE 89.6 92.2 92.3 Graphite 90.5 92.5 93.1

Figure B.5: AUC score of VGAE and Graphite with subsampled edges on the Cora dataset. that in either case, GAE and VGAE reduce to a special case of Graphite, using only a single inner-product decoder (i.e., λ = 0). On Cora and Pubmed final embeddings were derived through convex combination, on Citeseer through incremental update.

Scalability. We experimented with learning VGAE and Graphite models by sub- sampling |E| random entries for Monte Carlo evaluation of the objective at each iteration. The corresponding AUC scores are shown in Table B.7. The results suggest that Graphite can effectively scale to large graphs without significant loss inaccu- racy. The AUC results trained with edge subsampling as we vary the subsampling coefficient c are shown in Figure B.5. APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 169

B.4.2 Semi-supervised node classification

We report the baseline results for SemiEmb [WRC08], DeepWalk [PAS14], ICA [LG03] and Planetoid [YCS16] as specified in [KW17]. GCN uses a 32-16 architecture with ReLu activations and early stopping after 10 epochs without increasing validation accuracy. The Graphite model uses the same architecture as in link prediction (with no edge dropout). The parameters of the posterior distributions are concatenated with node features to predict the final output. The parameters are learned using the Adam optimizer [KW18] with a learning rate of 0.01. All accuracies are taken as an average of 100 runs.

B.4.3 Density estimation

We learn a model architecture specified for the maximum possible nodes (i.e., 20 in this case) to accommodate for input graphs of different sizes. While feeding in smaller graphs, we add dummy nodes disconnected from the rest of the graph. T he dummy nodes have no influence on the gradient updates for the parameters affecting the latent or observed variables involving nodes in the true graph. For density estimation, we pick a graph family, then train and validate on graphs sampled exclusively from that family. We consider graphs with 10 to 20 nodes belonging to the following graph families :

• Erdos-Renyi [ER59]: each edge independently sampled with probability p = 0.5

• Ego Network: a random Erdos-Renyi graph with all nodes neighbors of one randomly chosen node

• Random Regular: uniformly random regular graph with degree d = 4

• Random Geometric: graph induced by uniformly random points in unit square with edges between points at euclidean distance less than r = 0.5

• Random Power Tree: Tree generated by randomly swapping elements from a degree distribution to satisfy a power law distribution for γ = 3 APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 170

• Barabasi-Albert [BA99]: Preferential attachment graph generation with attach- ment edge count m = 4

We use convex combinations over three successively induced embeddings. Scores are reported over an average of 50 runs. Additionally, a two-layer neural net is applied to the initially sampled embedding Z before being fed to the inner product decoder for GAE and VGAE, or being fed to the iterations of Eqs. (6.8) and (6.9) for both Graphite-AE and Graphite-VAE. APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 171

B.5 Chapter 7

B.5.1 RoboSumo Environment

To limit the scope of our study, we restrict agent morphologies to only 4-leg robots. During the game, observations of each agent were represented by a 120-dimensional vector comprised of positions and velocities of its own body and positions of the op- ponent’s body; agent’s actions were 8-dimensional vectors that represented torques applied to the corresponding joints.

Model architectures. Agent policies are parameterized as multi-layer perceptrons (MLPs) with 2 hidden layers of 90 units each. For the embedding network, we used another MLP network with 2 hidden layers of 100 units each to give an embedding of size 100. For the conditioned policy network we also reduce the hidden layer size to 64 units each.

Policy optimization. For learning the population of agents, we use the distributed version of PPO algorithm as described in [Al-+18] with 2 × 10−3 learning rate, e = 0.2, 16,000 time steps per update with 6 epochs 4,000 time steps per batch.

Training agents. For our analysis, we train a diverse collection of 25 agents, some of which are trained via self-play and others are trained in pairs concurrently, forming a clique agent-interaction graph (Figure B.6a).

B.5.2 ParticleWorld Environment

The overall continuous observation and discrete action space for the speaker agents are 3 and 7 dimensions respectively. For the listener agents, the observation and action spaces are 15 and 5 dimensions respectively.

Model architectures. Agent policies and shared critic (i.e., a value function) are parameterized as multi-layer perceptrons (MLPs) with 2 hidden layers of 64 units APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 172

Speaker Listener

1 10 2 1 1

2 2 9 3

3 3

8 4 4 4

5 5 7 5 6

(a) (b)

Figure B.6: (a) An example clique agent interaction graph with 10 agents. (b) An example bipartite agent interaction graph with 5 speakers and 5 listeners. each. The observation space for the speaker is small (3 dimensions), and a small em- bedding of size 5 for the listener policy gives good performance. For the embedding network, we again used an MLP with 2 hidden layers of 100 units each.

Policy optimization. For learning the initial population of listener and agent poli- cies, we use multiagent deep deterministic policy gradients (MADDPG) as the base algorithm [Low+17]. Adam optimizer [KB15] with a learning rate of 4 × 10−3 was used for optimization. Replay buffer size was set to 106 timesteps.

Training agents. We first train 28 speaker-listener pairs using the MADDPG algo- rithm. From this collection of 28 speakers, we train another set of 28 listeners, each trained to work with a speaker pair, forming a bipartite agent-interaction graph (Figure B.6b). We choose the best 14 listeners for later experiments. APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 173

B.6 Chapter 8

Datasets. For MNIST, we use the train/valid/test split of 50, 000/10, 000/10, 000 images. For Omniglot, we use train/valid/test split of 23, 845/500/8, 070 images. For CelebA, we used the splits as provided by [Liu+15] on the dataset website. All images were scaled such that pixel values are between 0 and 1. We used the Adam optimizer with a learning rate of 0.001 for all the learned models. For MNIST and Omniglot, we used a batch size of 100. For CelebA, we used a batch size of 64. Further, we implemented early stopping based on the best validation bounds after 200 epochs for MNIST, 500 epochs for Omniglot, and 200 epochs for CelebA.

Model architectures. For both MNIST and Omniglot, the UAE decoder used 2 hidden layers of 500 units each with ReLU activations. The encoder was a single linear layer with only weight parameters and no bias parameters. The encoder and decoder architectures for the VAE baseline are symmetrical with 2 hidden layers of 500 units each and 20 latent units. We used the LASSO baseline implementation from sklearn and tuned the Lagrange parameter on the validation sets. For the baselines, we do 10 random restarts with 1, 000 steps per restart and pick the reconstruction with best measurement error as prescribed in [Bor+17].

Table B.8: Frobenius norms of the encoding matrices for MNIST and Omniglot.

m Random Gaussian Matrices MNIST-UAE Omniglot-UAE 2 39.57 6.42 2.17 5 63.15 5.98 2.66 10 88.98 7.24 3.50 25 139.56 8.53 4.71 50 198.28 9.44 5.45 100 280.25 10.62 6.02

Table B.8 shows the average norms for the random Gaussian matrices used in the baselines and the learned UAE encodings. The low norms for the UAE encodings suggest that UAE is not trivially overcoming noise by increasing the norm of W. APPENDIX B. ADDITIONAL EXPERIMENTAL DETAILS & RESULTS 174

For CelebA dataset, we used the following architectures for UAE.

Encoder

Signal → Conv[Kernel: 4x4, Stride: 2, Filters: 32, Padding: Same, Activation: Relu] → Conv[Kernel: 4x4, Stride: 2, Filters: 32, Padding: Same, Activation: Relu] → Conv[Kernel: 4x4, Stride: 2, Filters: 64, Padding: Same, Activation: Relu] → Conv[Kernel: 4x4, Stride: 2, Filters: 64, Padding: Same, Activation: Relu] → Conv[Kernel: 4x4, Stride: 1, Filters: 256, Padding: Valid, Activation: Relu] → Fully_Connected[Units: m, Activation: None]

Decoder

Measurements → Fully_Connected[Units: 256, Activation: Relu] → Conv_transpose[Kernel: 4x4, Stride: 1, Filters: 256, Padding: Valid, Activation: Relu] → Conv_transpose[Kernel: 4x4, Stride: 2, Filters: 64, Padding: Same, Activation: Relu] → Conv_transpose[Kernel: 4x4, Stride: 2, Filters: 64, Padding: Same, Activation: Relu] → Conv_transpose[Kernel: 4x4, Stride: 2, Filters: 32, Padding: Same, Activation: Relu] → Conv_transpose[Kernel: 4x4, Stride: 2, Filters: 3, Padding: Valid, Activation: Sigmoid] C Code & Data

The following links provide open-source implementations and custom datasets (if applicable) for the various learning frameworks presented in this dissertation.

• Chapter 3

https://github.com/ermongroup/flow-gan

• Chapter 4

https://github.com/aditya-grover/bias-correction-generative

• Chapter 5

https://github.com/ermongroup/fairgen

• Chapter 6

https://github.com/ermongroup/graphite

• Chapter 8

https://github.com/aditya-grover/uae

• Chapter 9

https://github.com/chueh-ermon/battery-fast-charging-optimization

175 Bibliography

[23m16] 23&me. ‘‘The Real Issue: Diversity in Genetics Research.’’ In: Re- trieved from https://blog.23andme.com/ancestry/the-real-issue-diversity- in-genetics-research/ (2016) (cit. on p. 56). [AB14] Guillaume Alain and Yoshua Bengio. ‘‘What regularized auto-encoders learn from the data-generating distribution.’’ In: Journal of Machine Learning Research 15.1 (2014), pp. 3563–3593 (cit. on p. 125). [Aba+16] Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. ‘‘Tensorflow: a system for large-scale machine learning.’’ In: Operating Systems Design and Implementation. 2016 (cit. on pp. 158, 160). [Abe+20] Rediet Abebe, Solon Barocas, , Karen Levy, Manish Raghavan, and David G Robinson. ‘‘Roles for computing in social change.’’ In: Proceedings of the Conference on Fairness, Accountability, and Transparency. 2020 (cit. on p. 73). [ACB17] Martin Arjovsky, Soumith Chintala, and Léon Bottou. ‘‘Wasserstein gan.’’ In: arXiv preprint arXiv:1701.07875 (2017) (cit. on pp. 20, 26, 28, 33). [Ade+19] Tameem Adel, Isabel Valera, Zoubin Ghahramani, and Adrian Weller. ‘‘One-network adversarial fairness.’’ In: Proceedings of the AAAI Con- ference on Artificial Intelligence. Vol. 33. 2019 (cit. on p. 71).

176 BIBLIOGRAPHY 177

[Ahm+17] Shabbir Ahmed, Ira Bloom, Andrew N Jansen, Tanvir Tanim, Eric J Dufek, Ahmad Pesaran, Andrew Burnham, Richard B Carlson, Fer- nando Dias, Keith Hardy, et al. ‘‘Enabling fast charging–A battery technology gap assessment.’’ In: Journal of Power Sources 367 (2017), pp. 250–262 (cit. on pp. 130, 131, 136, 139, 141, 144). [Al-+18] Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. ‘‘Continuous Adaptation via Meta- Learning in Nonstationary and Competitive Environments.’’ In: Inter- national Conference on Learning Representations. 2018 (cit. on pp. 90, 99, 103, 171). [Ale+17] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. ‘‘Deep variational information bottleneck.’’ In: International Conference on Learning Representations. 2017 (cit. on p. 111). [Ale+18] Alexander A Alemi, Ben Poole, Ian Fischer, Joshua V Dillon, Rif A Saurous, and Kevin Murphy. ‘‘Fixing a Broken ELBO.’’ In: International Conference on Machine Learning. 2018 (cit. on p. 126). [Ami+19] Alexander Amini, Ava P Soleimany, Wilko Schwarting, Sangeeta N Bhatia, and Daniela Rus. ‘‘Uncovering and mitigating algorithmic bias through learned latent structure.’’ In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 2019 (cit. on p. 72). [Ana+16] Ankit Anand, Aditya Grover, Mausam, and Parag Singla. ‘‘Contextual Symmetries in Probabilistic Graphical Models.’’ In: International Joint Conference on Artificial Intelligence. 2016 (cit. on p. 148). [Ant+15] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. ‘‘Vqa: Visual question answering.’’ In: Proceedings of the IEEE International Conference on Computer Vision. 2015 (cit. on p. 4). BIBLIOGRAPHY 178

[ARZ18] Sanjeev Arora, Andrej Risteski, and Yi Zhang. ‘‘Do GANs learn the distribution? Some Theory and Empirics.’’ In: International Conference on Learning Representations. 2018 (cit. on pp. 35, 58). [AS17] Stefano V Albrecht and Peter Stone. ‘‘Autonomous Agents Modeling Other Agents: A Comprehensive Survey and Open Problems.’’ In: arXiv preprint arXiv: 1709.08071 (2017) (cit. on p. 107). [ASE17] Antreas Antoniou, Amos Storkey, and Harrison Edwards. ‘‘Data aug- mentation generative adversarial networks.’’ In: arXiv preprint arXiv:1711.04340 (2017) (cit. on pp. 35–37, 47, 160). [Att+20] Peter Attia, Aditya Grover, Norman Jin, Kristen Severson, Bryan Cheong, Jerry Liao, Michael H Chen, Nicholas Perkins, Zi Yang, Patrick Herring, Muratahan Aykol, Stephen Harris, Richard Braatz, Stefano Ermon, and William Chueh. ‘‘Closed-loop optimization of extreme fast charg- ing for batteries using machine learning.’’ In: Nature 578.7795 (2020), pp. 397–402 (cit. on pp. 5, 6, 8, 134, 144). [AZ17] Sanjeev Arora and Yi Zhang. ‘‘Do GANs actually learn the distribu- tion? An empirical study.’’ In: arXiv preprint arXiv:1706.08224 (2017) (cit. on p. 33). [Aza+18] Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian Goodfellow, and Augustus Odena. ‘‘Discriminator rejection sampling.’’ In: arXiv preprint arXiv:1810.06758 (2018) (cit. on pp. 53, 72). [AZG20] Michael Arbel, Liang Zhou, and Arthur Gretton. ‘‘KALE: When Energy- Based Learning Meets Adversarial Training.’’ In: arXiv preprint arXiv:2003.05033 (2020) (cit. on p. 54). [BA03] David Barber and Felix Agakov. ‘‘The IM algorithm: A variational approach to information maximization.’’ In: Advances in Neural Infor- mation Processing Systems. 2003 (cit. on pp. 111, 114, 116, 127). BIBLIOGRAPHY 179

[BA99] Albert-Laszlo Barabasi and Reka Albert. ‘‘Emergence of Scaling in Random Networks.’’ In: Science 286.5439 (1999), pp. 509–512 (cit. on pp. 87, 170). [Bac+19] Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. ‘‘Scalable fair clustering.’’ In: arXiv preprint arXiv:1902.03519 (2019) (cit. on p. 72). [Bar89] Horace B Barlow. ‘‘Unsupervised learning.’’ In: Neural Computation 1.3 (1989), pp. 295–311 (cit. on p. 1). [Bau+14] Thorsten Baumhöfer, Manuel Brühl, Susanne Rothgang, and Dirk Uwe Sauer. ‘‘Production caused variation in capacity aging trend and correlation to initial cell performance.’’ In: Journal of Power Sources 247 (2014), pp. 332–338 (cit. on p. 129). [BCM11] Smriti Bhagat, Graham Cormode, and S Muthukrishnan. ‘‘Node clas- sification in social networks.’’ In: Social network data analytics. Springer, 2011, pp. 115–148 (cit. on p. 75). [BCN19] Suman K Bera, Deeparnab Chakrabarty, and Maryam Negahbani. ‘‘Fair algorithms for clustering.’’ In: arXiv preprint arXiv:1901.02393 (2019) (cit. on p. 72). [BCV13] Yoshua Bengio, Aaron Courville, and Pascal Vincent. ‘‘Representation learning: A review and new perspectives.’’ In: IEEE transactions on pattern analysis and machine intelligence 35.8 (2013), pp. 1798–1828 (cit. on pp. 7, 17, 75). [BDS18] Andrew Brock, Jeff Donahue, and Karen Simonyan. ‘‘Large scale gan training for high fidelity natural image synthesis.’’ In: arXiv preprint arXiv:1809.11096 (2018) (cit. on p. 64). [Béd+18] Anne-Catherine Bédard, Andrea Adamo, Kosi C Aroh, M Grace Rus- sell, Aaron A Bedermann, Jeremy Torosian, Brian Yue, Klavs F Jensen, BIBLIOGRAPHY 180

and Timothy F Jamison. ‘‘Reconfigurable system for automated opti- mization of diverse chemical reactions.’’ In: Science 361.6408 (2018), pp. 1220–1225 (cit. on p. 130). [Bel+17] Marc G Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Rémi Munos. ‘‘The Cramer Distance as a Solution to Biased Wasserstein Gradients.’’ In: arXiv preprint arXiv:1705.10743 (2017) (cit. on p. 33). [Ben+09] Yoshua Bengio et al. ‘‘Learning deep architectures for AI.’’ In: Foun- dations and trends® in Machine Learning 2.1 (2009), pp. 1–127 (cit. on p. 125). [Ben+13] Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. ‘‘Gener- alized denoising auto-encoders as generative models.’’ In: Advances in Neural Information Processing Systems. 2013 (cit. on p. 125). [Beu+17] Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H Chi. ‘‘Data decisions and theoretical implications when adversarially learning fair representa- tions.’’ In: arXiv preprint arXiv:1707.00075 (2017) (cit. on p. 71). [BG18] Joy Buolamwini and . ‘‘Gender shades: Intersectional ac- curacy disparities in commercial gender classification.’’ In: Conference on fairness, accountability and transparency. 2018 (cit. on p. 73). [BGS15] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. ‘‘Importance weighted autoencoders.’’ In: arXiv preprint arXiv:1509.00519 (2015) (cit. on p. 54). [BHN18] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. http://www.fairmlbook.org. fairmlbook.org, 2018 (cit. on pp. 55, 71). [Biń+18] Mikołaj Bińkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. ‘‘Demystifying MMD GANs.’’ In: arXiv preprint arXiv:1801.01401 (2018) (cit. on p. 46). BIBLIOGRAPHY 181

[BK18] Chris Burgess and Hyunjik Kim. 3D Shapes Dataset. https://github.com/deep- mind/3dshapes-dataset/. 2018 (cit. on p. 165). [BK96] Dorrit Billman and James Knutson. ‘‘Unsupervised concept learning and value systematicitiy: A complex whole aids learning the parts.’’ In: Journal of Experimental Psychology: Learning, Memory, and Cognition 22.2 (1996), p. 458 (cit. on p. 1). [BKH16] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. ‘‘Layer nor- malization.’’ In: arXiv preprint arXiv:1607.06450 (2016) (cit. on p. 153). [BKM17] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. ‘‘Variational in- ference: A review for statisticians.’’ In: Journal of the American Statistical Association 112.518 (2017), pp. 859–877 (cit. on p. 80). [BL11] Lars Backstrom and Jure Leskovec. ‘‘Supervised random walks: pre- dicting and recommending links in social networks.’’ In: ACM Inter- national Conference on Web Search and Data Mining. 2011 (cit. on p. 75). [BL18] Jonathon Byrd and Zachary C Lipton. ‘‘What is the effect of Impor- tance Weighting in Deep Learning?’’ In: arXiv preprint arXiv:1812.03372 (2018) (cit. on pp. 53, 72). [BN02] Mikhail Belkin and Partha Niyogi. ‘‘Laplacian eigenmaps and spectral techniques for embedding and clustering.’’ In: Advances in Neural Information Processing Systems. 2002 (cit. on p. 75). [Boj+18] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. ‘‘NetGAN: Generating Graphs via Random Walks.’’ In: International Conference on Machine Learning. 2018 (cit. on pp. 83, 88). [Bor+17] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. ‘‘Com- pressed sensing using generative models.’’ In: arXiv preprint arXiv:1703.03208 (2017) (cit. on pp. 118, 121, 126–128, 173). BIBLIOGRAPHY 182

[Bow+15] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. ‘‘Generating sentences from a continu- ous space.’’ In: arXiv preprint arXiv:1511.06349 (2015) (cit. on pp. 53, 72). [Bro+16] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. ‘‘Openai gym.’’ In: arXiv preprint arXiv:1606.01540 (2016) (cit. on p. 50). [Bro+18] Marc Brockschmidt, Miltiadis Allamanis, Alexander L Gaunt, and Oleksandr Polozov. ‘‘Generative code modeling with graphs.’’ In: arXiv preprint arXiv:1805.08490 (2018) (cit. on p. 4). [Bro+20] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. ‘‘Language models are few-shot learn- ers.’’ In: arXiv preprint arXiv:2005.14165 (2020) (cit. on p. 55). [BRT09] Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. ‘‘Simultane- ous analysis of lasso and Dantzig selector.’’ In: The Annals of Statistics (2009), pp. 1705–1732 (cit. on p. 110). [BS16] Solon Barocas and Andrew D Selbst. ‘‘Big data’s disparate impact.’’ In: Calif. L. Rev. 104 (2016), p. 671 (cit. on p. 55). [BSM17] David Berthelot, Tom Schumm, and Luke Metz. ‘‘Began: Boundary equilibrium generative adversarial networks.’’ In: arXiv preprint arXiv:1703.10717 (2017) (cit. on p. 33). [But+18] Keith T Butler, Daniel W Davies, Hugh Cartwright, Olexandr Isayev, and Aron Walsh. ‘‘Machine learning for molecular and materials science.’’ In: Nature 559.7715 (2018), pp. 547–555 (cit. on pp. 129, 130, 144). [Car+12] William R Carson, Minhua Chen, Miguel RD Rodrigues, Robert Calder- bank, and Lawrence Carin. ‘‘Communications-inspired projection BIBLIOGRAPHY 183

design with application to compressive sensing.’’ In: SIAM Journal on Imaging Sciences 5.4 (2012), pp. 1185–1212 (cit. on p. 127). [CC18] Danielle K Citron and Robert Chesney. ‘‘Deep Fakes: A Looming Crisis for National Security, Democracy and Privacy?’’ In: Lawfare (2018) (cit. on p. 148). [CF06] Deepayan Chakrabarti and Christos Faloutsos. ‘‘Graph mining: Laws, generators, and algorithms.’’ In: ACM Computing Surveys 38.1 (2006), p. 2 (cit. on p. 87). [Che+17a] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. ‘‘Mode Regularized Generative Adversarial Networks.’’ In: Interna- tional Conference on Learning Representations. 2017 (cit. on pp. 22, 28, 33). [Che+17b] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhari- wal, John Schulman, Ilya Sutskever, and Pieter Abbeel. ‘‘Variational lossy autoencoder.’’ In: International Conference on Learning Representa- tions. 2017 (cit. on pp. 111, 126). [Chi+17] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvit- skii. ‘‘Fair clustering through fairlets.’’ In: Advances in Neural Informa- tion Processing Systems. 2017 (cit. on p. 72). [Cho+19] Kristy Choi, Kedar Tatwawadi, Aditya Grover, Tsachy Weissman, and Stefano Ermon. ‘‘Neural Joint Source-Channel Coding.’’ In: Interna- tional Conference on Machine Learning. 2019 (cit. on p. 128). [Cho+20] Kristy Choi, Aditya Grover, Trisha Singh, Rui Shu, and Stefano Ermon. ‘‘Fair Generative Modeling via Weak Supervision.’’ In: International Conference on Machine Learning. 2020 (cit. on pp. 3, 4, 8, 53). [CJA18] Hyunsun Choi, Eric Jang, and Alexander A Alemi. ‘‘Waic, but why? generative ensembles for robust anomaly detection.’’ In: arXiv preprint arXiv:1810.01392 (2018) (cit. on p. 23). BIBLIOGRAPHY 184

[CKP09] Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. ‘‘Building classifiers with independency constraints.’’ In: 2009 IEEE International Conference on Data Mining Workshops. 2009 (cit. on p. 72). [CRT06] Emmanuel J Candès, Justin Romberg, and Terence Tao. ‘‘Robust uncer- tainty principles: Exact signal reconstruction from highly incomplete frequency information.’’ In: IEEE Transactions on information theory 52.2 (2006), pp. 489–509 (cit. on pp. 6, 110, 112). [CT05] Emmanuel J Candès and Terence Tao. ‘‘Decoding by linear program- ming.’’ In: IEEE Transactions on Information Theory 51.12 (2005), pp. 4203– 4215 (cit. on pp. 6, 110, 112). [Dan+17] Ivo Danihelka, Balaji Lakshminarayanan, Benigno Uria, Daan Wierstra, and Peter Dayan. ‘‘Comparison of Maximum Likelihood and GAN- based training of Real NVPs.’’ In: arXiv preprint arXiv:1705.05263 (2017) (cit. on pp. 33, 53, 72). [Den+09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei- Fei. ‘‘Imagenet: A large-scale hierarchical image database.’’ In: IEEE Conference on Computer Vision and Pattern Recognition. 2009 (cit. on p. 2). [Dev+18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. ‘‘Bert: Pre-training of deep bidirectional transformers for language understanding.’’ In: arXiv preprint arXiv:1810.04805 (2018) (cit. on p. 55). [DG84] Peter J Diggle and Richard J Gratton. ‘‘Monte Carlo methods of infer- ence for implicit statistical models.’’ In: Journal of the Royal Statistical Society. (1984), pp. 193–227 (cit. on pp. 22, 116). [DGA00] Arnaud Doucet, Simon Godsill, and Christophe Andrieu. ‘‘On se- quential Monte Carlo sampling methods for Bayesian filtering.’’ In: Statistics and Computing 10.3 (2000), pp. 197–208 (cit. on p. 44). BIBLIOGRAPHY 185

[DGE18] Manik Dhar, Aditya Grover, and Stefano Ermon. ‘‘Modeling Sparse Deviations for Compressed Sensing using Generative Models.’’ In: International Conference on Machine Learning. 2018 (cit. on pp. 122, 126– 128). [Dha+17] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. ‘‘Openai baselines.’’ In: GitHub, GitHub repository (2017) (cit. on p. 160). [Die+18] Maurice Diesendruck, Ethan R Elenberg, Rajat Sen, Guy W Cole, Sanjay Shakkottai, and Sinead A Williamson. ‘‘Importance weighted generative networks.’’ In: arXiv preprint arXiv:1806.02512 (2018) (cit. on pp. 53, 72). [DKB14] Laurent Dinh, David Krueger, and Yoshua Bengio. ‘‘NICE: Non-linear independent components estimation.’’ In: arXiv preprint arXiv:1410.8516 (2014) (cit. on pp. 19, 20, 24, 25, 153). [DM19] Yilun Du and Igor Mordatch. ‘‘Implicit generation and generalization in energy-based models.’’ In: arXiv preprint arXiv:1903.08689 (2019) (cit. on p. 15). [Don06] David L Donoho. ‘‘Compressed sensing.’’ In: IEEE Transactions on information theory 52.4 (2006), pp. 1289–1306 (cit. on pp. 6, 110, 112). [Dou11] Brendan L Douglas. ‘‘The Weisfeiler-Lehman method and graph iso- morphism testing.’’ In: arXiv preprint arXiv:1101.5211 (2011) (cit. on p. 77). [DSB17] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. ‘‘Density es- timation using Real NVP.’’ In: International Conference on Learning Representations. 2017 (cit. on pp. 20, 24, 25, 153). BIBLIOGRAPHY 186

[DSH15] Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. ‘‘Speed- ing up automatic hyperparameter optimization of deep neural net- works by extrapolation of learning curves.’’ In: International Joint Con- ference on Artificial Intelligence. 2015 (cit. on pp. 130, 142). [DTB17] Shayan Doroudi, Philip S Thomas, and Emma Brunskill. ‘‘Importance Sampling for Fair Policy Selection.’’ In: Grantee Submission (2017) (cit. on p. 72). [Duv+15] David Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bom- barell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan Adams. ‘‘Con- volutional Networks on Graphs for Learning Molecular Fingerprints.’’ In: Advances in Neural Information Processing Systems. 2015 (cit. on p. 88). [Dwo+12] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. ‘‘Fairness through awareness.’’ In: Proceedings of the 3rd Conference in Innovations in Theoretical Computer Science. 2012 (cit. on p. 71). [ER59] Paul Erdös and Alfréd Rényi. ‘‘On random graphs.’’ In: Publicationes Mathematicae (Debrecen) 6 (1959), pp. 290–297 (cit. on pp. 87, 169). [ES15] Harrison Edwards and Amos Storkey. ‘‘Censoring representations with an adversary.’’ In: arXiv preprint arXiv:1511.05897 (2015) (cit. on p. 71). [ET94] Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. CRC press, 1994 (cit. on p. 41). [FCG18] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. ‘‘More robust doubly robust off-policy evaluation.’’ In: International Conference on Machine Learning. 2018 (cit. on p. 51). [Fer99] Jacques Ferber. Multi-agent systems: An introduction to distributed artifi- cial intelligence. Vol. 1. Addison-Wesley Reading, 1999 (cit. on p. 89). BIBLIOGRAPHY 187

[FHT01] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. Springer series in statistics New York, NY, USA: 2001 (cit. on pp. 53, 60). [Fra18] Peter I Frazier. ‘‘A tutorial on bayesian optimization.’’ In: arXiv preprint arXiv:1807.02811 (2018) (cit. on p. 132). [Fu06] Michael C Fu. ‘‘Gradient estimation.’’ In: Handbooks in operations re- search and management science 13 (2006), pp. 575–616 (cit. on p. 115). [Gau+12] Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey,Yvonne Light, Shaun McGlinchey,David Michalovich, Bissan Al-Lazikani, et al. ‘‘ChEMBL: a large-scale bioactivity database for drug discovery.’’ In: Nucleic acids research 40.D1 (2012), pp. D1100– D1107 (cit. on p. 148). [GDE18] Aditya Grover, Manik Dhar, and Stefano Ermon. ‘‘Flow-GAN: Com- bining maximum likelihood and adversarial learning in generative models.’’ In: AAAI Conference on Artificial Intelligence. 2018 (cit. on pp. 3, 8, 23). [GE18] Aditya Grover and Stefano Ermon. ‘‘Boosted generative models.’’ In: AAAI Conference on Artificial Intelligence. 2018 (cit. on pp. 3, 8, 40, 43, 72). [GE19] Aditya Grover and Stefano Ermon. ‘‘Uncertainty Autoencoders: Learn- ing Compressed Representations via Variational Information Maxi- mization.’’ In: International Conference on Artificial Intelligence and Statis- tics. 2019 (cit. on pp. 6, 8). [Geb+18] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wort- man Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford. ‘‘Datasheets for datasets.’’ In: arXiv preprint arXiv:1803.09010 (2018) (cit. on pp. 56, 73). BIBLIOGRAPHY 188

[Ger+15] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. ‘‘Made: Masked autoencoder for distribution estimation.’’ In: Interna- tional Conference on Machine Learning. 2015 (cit. on p. 16). [GG14] Samuel Gershman and Noah Goodman. ‘‘Amortized inference in probabilistic reasoning.’’ In: Proceedings of the annual meeting of the cognitive science society. Vol. 36. 36. 2014 (cit. on pp. 111, 113, 127). [GH12] Michael U Gutmann and Aapo Hyvärinen. ‘‘Noise-contrastive estima- tion of unnormalized statistical models, with applications to natural image statistics.’’ In: Journal of Machine Learning Research 13.Feb (2012), pp. 307–361 (cit. on p. 53). [Gil+17] Justin Gilmer, Samuel Schoenholz, Patrick Riley, Oriol Vinyals, and George Dahl. ‘‘Neural message passing for quantum chemistry.’’ In: International Conference on Machine Learning. 2017 (cit. on p. 88). [GL16] Aditya Grover and Jure Leskovec. ‘‘node2vec: Scalable Feature Learn- ing for Networks.’’ In: International Conference on Knowledge Discovery and Data Mining. 2016 (cit. on pp. 4, 8, 85, 107, 167). [Gla13] Paul Glasserman. Monte Carlo methods in financial engineering. Vol. 53. Springer Science & Business Media, 2013 (cit. on p. 115). [Gly90] Peter W Glynn. ‘‘Likelihood ratio gradient estimation for stochastic systems.’’ In: Communications of the ACM 33.10 (1990), pp. 75–84 (cit. on p. 115). [Góm+18] Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamı́n Sánchez-Lengeling, Dennis She- berla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. ‘‘Automatic chemical design using a data- driven continuous representation of molecules.’’ In: ACS Central Sci- ence 4.2 (2018), pp. 268–276 (cit. on p. 148). BIBLIOGRAPHY 189

[Gom+19] Carla Gomes, Thomas Dietterich, Christopher Barrett, Jon Conrad, Bistra Dilkina, Stefano Ermon, Fei Fang, Andrew Farnsworth, Alan Fern, Xiaoli Fern, et al. ‘‘Computational sustainability: Computing for a better world and a sustainable future.’’ In: Communications of the ACM 62.9 (2019), pp. 56–65 (cit. on p. 5). [Goo+14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. ‘‘Generative adversarial nets.’’ In: Advances in Neural Information Pro- cessing Systems. 2014 (cit. on pp. 21–23, 36, 53). [Goo16] Ian Goodfellow. ‘‘NIPS 2016 tutorial: Generative adversarial networks.’’ In: arXiv preprint arXiv:1701.00160 (2016) (cit. on pp. 3, 33). [Gop10] Alison Gopnik. ‘‘How babies think.’’ In: Scientific American 303.1 (2010), pp. 76–81 (cit. on p. 1). [Gra+18] Jarosław M Granda, Liva Donina, Vincenza Dragone, De-Liang Long, and Leroy Cronin. ‘‘Controlling an organic synthesis robot with ma- chine learning to search for new reactivity.’’ In: Nature 559.7714 (2018), pp. 377–381 (cit. on p. 130). [Gre+07] Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola. ‘‘A kernel method for the two-sample-problem.’’ In: Advances in Neural Information Processing Systems. 2007 (cit. on pp. 53, 72). [Grg+18] Nina Grgic-Hlaca, Elissa M Redmiles, Krishna P Gummadi, and Adrian Weller. ‘‘Human perceptions of fairness in algorithmic decision mak- ing: A case study of criminal risk prediction.’’ In: International World Wide Web Conference. 2018 (cit. on p. 71). [GRM19] Ishaan Gulrajani, Colin Raffel, and Luke Metz. ‘‘Towards GAN Bench- marks Which Require Generalization.’’ In: International Conference on Learning Representations. 2019 (cit. on pp. 53, 72). BIBLIOGRAPHY 190

[Gro+18a] Aditya Grover, Ramki Gummadi, Miguel Lazaro-Gredilla, Dale Schu- urmans, and Stefano Ermon. ‘‘Variational Rejection Sampling.’’ In: International Conference on Artificial Intelligence and Statistics. 2018 (cit. on p. 54). [Gro+18b] Aditya Grover, Todor Markov,Peter Attia, Norman Jin, Nicholas Perkins, Bryan Cheong, Michael Chen, Zi Yang, Stephen Harris, William Chueh, and Stefano Ermon. ‘‘Best arm identification in multi-armed bandits with delayed feedback.’’ In: International Conference on Artificial Intelli- gence and Statistics. 2018 (cit. on pp. 8, 131, 134). [Gro+18c] Aditya Grover, Maruan Al-Shedivat, Jayesh K Gupta, Yura Burda, and Harrison Edwards. ‘‘Learning Policy Representations in Multiagent Systems.’’ In: International Conference on Machine Learning. 2018 (cit. on pp. 5, 8). [Gro+18d] Aditya Grover, Maruan Al-Shedivat, Jayesh K Gupta, Yuri Burda, and Harrison Edwards. ‘‘Evaluating Generalization in Multiagent Systems using Agent-Interaction Graphs.’’ In: International Conference on Autonomous Agents and Multiagent Systems. 2018 (cit. on pp. 5, 8, 96). [Gro+19a] Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, Eric Horvitz, and Stefano Ermon. ‘‘Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting.’’ In: Advances in Neural Information Processing Systems. 2019 (cit. on pp. 3, 4, 8, 72). [Gro+19b] Aditya Grover, Eric Wang, Aaron Zweig, and Stefano Ermon. ‘‘Stochas- tic Optimization of Sorting Networks via Continuous Relaxations.’’ In: International Conference on Learning Representations (International Conference on Learning Representations). 2019 (cit. on p. 148). [Gro+20] Aditya Grover, Christopher Chute, Rui Shu, Zhangjie Cao, and Ste- fano Ermon. ‘‘AlignFlow: Cycle Consistent Learning from Multiple BIBLIOGRAPHY 191

Domains via Normalizing Flows.’’ In: AAAI Conference on Artificial Intelligence. 2020 (cit. on pp. 4, 34). [Gul+17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. ‘‘Improved Training of Wasserstein GANs.’’ In: arXiv preprint arXiv:1704.00028 (2017) (cit. on pp. 26, 153, 154). [Guo+17] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. ‘‘On calibration of modern neural networks.’’ In: International Conference on Machine Learning. 2017 (cit. on p. 158). [Gup+18] Abhishek Gupta, Benjamin Eysenbach, Chelsea Finn, and Sergey Levine. ‘‘Unsupervised meta-learning for reinforcement learning.’’ In: arXiv preprint arXiv:1806.04640 (2018) (cit. on p. 147). [GZE19] Aditya Grover, Aaron Zweig, and Stefano Ermon. ‘‘Graphite: Iterative generative modeling of graphs.’’ In: International Conference on Machine Learning. 2019 (cit. on pp. 4, 8). [HA15] Elad Hoffer and Nir Ailon. ‘‘Deep metric learning using triplet net- work.’’ In: International Workshop on Similarity-Based Pattern Recognition. 2015 (cit. on pp. 4, 94, 107, 134). [Has+18] Tatsunori B Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. ‘‘Fairness without demographics in repeated loss mini- mization.’’ In: arXiv preprint arXiv:1806.08010 (2018) (cit. on p. 73). [He+16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ‘‘Deep residual learning for image recognition.’’ In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016 (cit. on pp. 1, 64). [HE16] Jonathan Ho and Stefano Ermon. ‘‘Generative adversarial imitation learning.’’ In: Advances in Neural Information Processing Systems. 2016 (cit. on p. 91). BIBLIOGRAPHY 192

[Hei+18] Hoda Heidari, Claudio Ferrari, Krishna Gummadi, and Andreas Krause. ‘‘Fairness behind a veil of ignorance: A welfare analysis for automated decision making.’’ In: Advances in Neural Information Pro- cessing Systems. 2018 (cit. on p. 71). [Heu+17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. ‘‘Gans trained by a two time-scale up- date rule converge to a local nash equilibrium.’’ In: Advances in Neural Information Processing Systems. 2017 (cit. on pp. 46, 64, 69). [HHL11] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. ‘‘Sequential model-based optimization for general algorithm configuration.’’ In: International Conference on Learning and Intelligent Optimization. 2011 (cit. on pp. 130, 143). [HHL17] Stephen J Harris, David J Harris, and Chen Li. ‘‘Failure statistics for commercial lithium ion batteries: A study of 24 pouch cells.’’ In: Jour- nal of Power Sources 342 (2017), pp. 589–597 (cit. on pp. 129, 141). [Hig+16] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerch- ner. ‘‘beta-VAE: Learning basic visual concepts with a constrained variational framework.’’ In: International Conference on Learning Repre- sentations. 2016 (cit. on p. 126). [Hin02] Geoffrey E Hinton. ‘‘Training products of experts by minimizing con- trastive divergence.’’ In: Neural Computation 14.8 (2002), pp. 1771–1800 (cit. on p. 15). [Hin12] Geoffrey E Hinton. ‘‘A practical guide to training restricted Boltz- mann machines.’’ In: Neural networks: Tricks of the trade. Springer, 2012, pp. 599–619 (cit. on p. 14). [Hin99] Geoffrey E Hinton. ‘‘Products of experts.’’ In: (1999) (cit. onp. 43). BIBLIOGRAPHY 193

[HKV19] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated ma- chine learning: methods, systems, challenges. Springer Nature, 2019 (cit. on pp. 5, 142). [HLL83] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. ‘‘Stochastic blockmodels: First steps.’’ In: Social networks 5.2 (1983), pp. 109–137 (cit. on p. 87). [Hon16] Euny Hong. ‘‘23andMe has a problem when it comes to ancestry re- ports for people of color.’’ In: Quartz. Retrieved from https://qz. com/765879/23andme- has-a-race-problem-when-it-comes-to-ancestryreports-for-non-whites (2016) (cit. on p. 56). [Hos17] Yedid Hoshen. ‘‘VAIN: Attentional Multi-agent Predictive Modeling.’’ In: Advances in Neural Information Processing Systems. 2017 (cit. on p. 107). [HPS16] Moritz Hardt, Eric Price, and Nati Srebro. ‘‘Equality of opportunity in supervised learning.’’ In: Advances in Neural Information Processing Systems. 2016 (cit. on p. 73). [HSF14] Matthew Hoffman, Bobak Shahriari, and Nando Freitas. ‘‘On corre- lation and budget constraints in model-based bandit optimization with application to automatic machine learning.’’ In: International Conference on Artificial Intelligence and Statistics. 2014 (cit. on p. 131). [HT52] Daniel G Horvitz and Donovan J Thompson. ‘‘A generalization of sampling without replacement from a finite universe.’’ In: Journal of the American Statistical Association (1952) (cit. on pp. 36, 39, 59). [Hua+17] Cheng-Zhi Anna Huang, Tim Cooijmnas, Adam Roberts, Aaron Courville, and Douglas Eck. ‘‘Counterpoint by Convolution.’’ In: International Society for Music Information Retrieval Conference (2017) (cit. on p. 55). [Hut+14] Frank Hutter, Lin Xu, Holger H Hoos, and Kevin Leyton-Brown. ‘‘Al- gorithm runtime prediction: Methods & evaluation.’’ In: Artificial Intelligence 206 (2014), pp. 79–111 (cit. on p. 143). BIBLIOGRAPHY 194

[HV18] Paul Hand and Vladislav Voroninski. ‘‘Global Guarantees for En- forcing Deep Generative Priors by Empirical Risk.’’ In: Conference on Learning Theory. 2018 (cit. on p. 128). [HYL17a] Will Hamilton, Zhitao Ying, and Jure Leskovec. ‘‘Inductive represen- tation learning on large graphs.’’ In: Advances in Neural Information Processing Systems. 2017 (cit. on p. 88). [HYL17b] William L Hamilton, Rex Ying, and Jure Leskovec. ‘‘Representation Learning on Graphs: Methods and Applications.’’ In: arXiv preprint arXiv:1709.05584 (2017) (cit. on p. 87). [Im+18] Daniel Jiwoong Im, He Ma, Graham Taylor, and Kristin Branson. ‘‘Quantitatively evaluating GANs with divergences proposed for train- ing.’’ In: arXiv preprint arXiv:1803.01045 (2018) (cit. on pp. 53, 72). [IS15] Sergey Ioffe and Christian Szegedy. ‘‘Batch normalization: Accelerat- ing deep network training by reducing internal covariate shift.’’ In: arXiv preprint arXiv:1502.03167 (2015) (cit. on p. 153). [Jad+16] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. ‘‘Reinforce- ment learning with unsupervised auxiliary tasks.’’ In: arXiv preprint arXiv:1611.05397 (2016) (cit. on p. 147). [JBJ18] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. ‘‘Junction Tree Variational Autoencoder for Molecular Graph Generation.’’ In: Inter- natiional Conference on Machine Learning. 2018 (cit. on p. 88). [JG20] Eun Seo Jo and Timnit Gebru. ‘‘Lessons from archives: strategies for collecting sociocultural data in machine learning.’’ In: Proceedings of the Conference on Fairness, Accountability, and Transparency. 2020 (cit. on p. 73). [Kad+17] Artur Kadurin, Sergey Nikolenko, Kuzma Khrabrov, Alex Aliper, and Alex Zhavoronkov. ‘‘druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with BIBLIOGRAPHY 195

desired molecular properties in silico.’’ In: Molecular pharmaceutics 14.9 (2017), pp. 3098–3104 (cit. on p. 148). [KB15] Diederik P Kingma and Jimmy Ba. ‘‘Adam: A method for stochastic optimization.’’ In: International Conference on Learning Representations. 2015 (cit. on p. 172). [KC12] Faisal Kamiran and Toon Calders. ‘‘Data preprocessing techniques for classification without discrimination.’’ In: Knowledge and Information Systems 33.1 (2012), pp. 1–33 (cit. on p. 72). [Kea+16] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. ‘‘Molecular graph convolutions: moving beyond fin- gerprints.’’ In: Journal of Computer-aided Molecular Design 30.8 (2016), pp. 595–608 (cit. on p. 88). [Kei+16] Peter Keil, Simon F Schuster, Jörn Wilhelm, Julian Travi, Andreas Hauser, Ralph C Karl, and Andreas Jossen. ‘‘Calendar aging of lithium- ion batteries.’’ In: Journal of The Electrochemical Society 163.9 (2016), A1872 (cit. on pp. 141, 144). [KF09] and Nir Friedman. Probabilistic graphical models: princi- ples and techniques. MIT press, 2009 (cit. on p. 6). [KH09] Alex Krizhevsky and Geoffrey Hinton. ‘‘Learning multiple layers of features from tiny images.’’ In: https://www.cs.toronto.edu/ kriz/learning- features-2009-TR.pdf (2009) (cit. on pp. 25, 153). [Kim+18] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. ‘‘Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav).’’ In: Inter- national conference on machine learning. 2018 (cit. on p. 73). [Kin+09] Ross D King, Jem Rowland, Stephen G Oliver, Michael Young, Wayne Aubrey, Emma Byrne, Maria Liakata, Magdalena Markham, Pinar Pir, Larisa N Soldatova, et al. ‘‘The automation of science.’’ In: Science 324.5923 (2009), pp. 85–89 (cit. on p. 130). BIBLIOGRAPHY 196

[KJ16] Peter Keil and Andreas Jossen. ‘‘Charging protocols for lithium-ion batteries and their impact on cycle life—An experimental study with different 18650 high-power cells.’’ In: Journal of Energy Storage 6 (2016), pp. 125–141 (cit. on pp. 129, 131, 136, 139, 141). [KKS13] Zohar Karnin, Tomer Koren, and Oren Somekh. ‘‘Almost optimal exploration in multi-armed bandits.’’ In: International Conference on Machine Learning. 2013 (cit. on p. 142). [Kle+17] Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. ‘‘Learning curve prediction with Bayesian neural networks.’’ In: International Conference on Learning Representations. 2017 (cit. on pp. 130, 143). [KM18] Pavel Korshunov and Sébastien Marcel. ‘‘Deepfakes: a new threat to face recognition? assessment and detection.’’ In: arXiv preprint arXiv:1812.08685 (2018) (cit. on p. 148). [Koc+17] Murat Kocaoglu, Christopher Snyder, Alexandros G Dimakis, and Sriram Vishwanath. ‘‘Causalgan: Learning causal implicit generative models with adversarial training.’’ In: arXiv preprint arXiv:1709.02023 (2017) (cit. on p. 147). [KPB15] Tammo Krueger, Danny Panknin, and Mikio L Braun. ‘‘Fast cross- validation via sequential testing.’’ In: Journal of Machine Learning Re- search 16.1 (2015), pp. 1103–1155 (cit. on p. 142). [Kul+16] Kuldeep Kulkarni, Suhas Lohit, Pavan Turaga, Ronan Kerviche, and Amit Ashok. ‘‘Reconnet: Non-iterative reconstruction of images from compressively sensed measurements.’’ In: IEEE Conference on Computer Vision and Pattern Recognition. 2016 (cit. on p. 128). [KW16] Thomas N Kipf and Max Welling. ‘‘Variational Graph Auto-Encoders.’’ In: arXiv preprint arXiv:1611.07308 (2016) (cit. on pp. 84, 85, 88). BIBLIOGRAPHY 197

[KW17] Thomas N Kipf and Max Welling. ‘‘Semi-supervised classification with graph convolutional networks.’’ In: International Conference on Learning Representations. 2017 (cit. on pp. 78, 86, 88, 169). [KW18] Diederik P Kingma and Max Welling. ‘‘Auto-encoding variational bayes.’’ In: arXiv preprint arXiv:1312.6114 (2018) (cit. on pp. 7, 17, 18, 36, 76, 80, 111, 115, 126, 167, 169). [KW19] Diederik P Kingma and Max Welling. ‘‘An introduction to variational autoencoders.’’ In: arXiv preprint arXiv:1906.02691 (2019) (cit. on pp. 3, 18). [Lar+15] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby,Hugo Larochelle, and Ole Winther. ‘‘Autoencoding beyond pixels using a learned simi- larity metric.’’ In: arXiv preprint arXiv:1512.09300 (2015) (cit. on p. 32). [LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. ‘‘Deep learning.’’ In: Nature 521.7553 (2015), pp. 436–444 (cit. on pp. 7, 13). [LC98] Jun S Liu and Rong Chen. ‘‘Sequential Monte Carlo methods for dynamic systems.’’ In: Journal of the American Statistical Association 93.443 (1998), pp. 1032–1044 (cit. on p. 44). [LCB10] Yann LeCun, Corinna Cortes, and Christopher JC Burges. ‘‘MNIST handwritten digit database.’’ In: http://yann.lecun.com/exdb/mnist (2010) (cit. on pp. 25, 118, 153). [LeC+06] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. ‘‘A tutorial on energy-based learning.’’ In: Predicting Structured Data 1.0 (2006) (cit. on p. 14). [LG03] Qing Lu and Lise Getoor. ‘‘Link-based Classification.’’ In: International Conference on Machine Learning. 2003 (cit. on p. 169). [LHL11] Yi-Hwa Liu, Ching-Hsing Hsieh, and Yi-Feng Luo. ‘‘Search for an op- timal five-step charging pattern for Li-ion batteries using consecutive orthogonal arrays.’’ In: IEEE Transactions on Energy Conversion 26.2 (2011), pp. 654–661 (cit. on p. 130). BIBLIOGRAPHY 198

[Li+15] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. ‘‘A diversity-promoting objective function for neural conversation models.’’ In: arXiv preprint arXiv:1510.03055 (2015) (cit. on p. 35). [Li+16] Yujia Li, Daniel Tarlow,Marc Brockschmidt, and Richard Zemel. ‘‘Gated graph sequence neural networks.’’ In: International Conference on Learn- ing Representations. 2016 (cit. on p. 88). [Li+17] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. ‘‘Hyperband: A novel bandit-based approach to hyperparameter optimization.’’ In: Journal of Machine Learning Research 18.1 (2017), pp. 6765–6816 (cit. on pp. 130, 142, 144). [Li+18] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. ‘‘Learning deep generative models of graphs.’’ In: arXiv preprint arXiv:1803.03324 (2018) (cit. on p. 88). [Lin+17] Julia Ling, Maxwell Hutchinson, Erin Antono, Sean Paradiso, and Bryce Meredig. ‘‘High-dimensional materials and process optimiza- tion using data-driven experimental design with well-calibrated uncer- tainty estimates.’’ In: Integrating Materials and Manufacturing Innovation 6.3 (2017), pp. 207–217 (cit. on p. 130). [Lin89] Ralph Linsker. ‘‘How to generate ordered maps by maximizing the mutual information between input and output signals.’’ In: Neural Computation 1.3 (1989), pp. 402–411 (cit. on pp. 111, 127). [Lit94] Michael L Littman. ‘‘Markov games as a framework for multi-agent re- inforcement learning.’’ In: International Conference on Machine Learning. 1994 (cit. on p. 90). [Liu+15] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. ‘‘Deep learn- ing face attributes in the wild.’’ In: Proceedings of the IEEE International Conference on Computer Vision. 2015 (cit. on pp. 57, 63, 118, 173). BIBLIOGRAPHY 199

[Liu+18a] Lydia T Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt. ‘‘Delayed impact of fair machine learning.’’ In: arXiv preprint arXiv:1803.04383 (2018) (cit. on p. 73). [Liu+18b] Risheng Liu, Shichao Cheng, Yi He, Xin Fan, Zhouchen Lin, and Zhongxuan Luo. ‘‘On the Convergence of Learning-based Iterative Methods for Nonconvex Inverse Problems.’’ In: arXiv preprint arXiv:1808.05331 (2018) (cit. on p. 128). [LK07] David Liben-Nowell and Jon Kleinberg. ‘‘The link-prediction problem for social networks.’’ In: Journal of the American Society for Information Science and Technology 58.7 (2007), pp. 1019–1031 (cit. on p. 75). [LLW09] Yi-Feng Luo, Yi-Hwa Liu, and Shun-Chung Wang. ‘‘Search for an optimal multistage charging pattern for lithium-ion batteries using the Taguchi approach.’’ In: TENCON Region 10 conference. 2009 (cit. on p. 130). [LO16] David Lopez-Paz and Maxime Oquab. ‘‘Revisiting classifier two-sample tests.’’ In: arXiv preprint arXiv:1610.06545 (2016) (cit. on pp. 53, 72, 159). [Loe98] John C Loehlin. Latent variable models: An introduction to factor, path, and structural analysis. Lawrence Erlbaum Associates Publishers, 1998 (cit. on p. 84). [Lou+16] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. ‘‘The variational fair autoencoder.’’ In: International Conference on Learning Representations. 2016 (cit. on p. 71). [Low+17] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. ‘‘Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments.’’ In: Advances in Neural Information Processing Systems. 2017 (cit. on pp. 104, 105, 172). BIBLIOGRAPHY 200

[LST15] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. ‘‘Human-level concept learning through probabilistic program induc- tion.’’ In: Science 350.6266 (2015), pp. 1332–1338 (cit. on pp. 36, 47, 118). [Lu+18] Xiaotong Lu, Weisheng Dong, Peiyao Wang, Guangming Shi, and Xue- mei Xie. ‘‘ConvCSNet: A Convolutional Compressive Sensing Frame- work Based on Deep Learning.’’ In: arXiv preprint arXiv:1801.10342 (2018) (cit. on p. 128). [LY17] Jae Hyun Lim and Jong Chul Ye. ‘‘Geometric gan.’’ In: arXiv preprint arXiv:1705.02894 (2017) (cit. on p. 64). [LYC17] Hoang M Le, Yisong Yue, and Peter Carr. ‘‘Coordinated Multi-Agent Imitation Learning.’’ In: International Conference on Machine Learning. 2017 (cit. on p. 107). [LZC19] Yayuan Liu, Yangying Zhu, and Yi Cui. ‘‘Challenges and opportunities towards fast-charging battery materials.’’ In: Nature Energy 4.7 (2019), pp. 540–550 (cit. on pp. 130, 131, 139). [MA18] Igor Mordatch and Pieter Abbeel. ‘‘Emergence of Grounded Compo- sitional Language in Multi-Agent Populations.’’ In: AAAI Conference on Artificial Intelligence. 2018 (cit. on pp. 90, 104). [Man+07] Shie Mannor, Duncan Simester, Peng Sun, and John N Tsitsiklis. ‘‘Bias and variance approximation in value function estimates.’’ In: Manage- ment Science 53.2 (2007), pp. 308–322 (cit. on pp. 35, 37). [McC+00] Gordon McCalla, Julita Vassileva, Jim Greer, and Susan Bull. ‘‘Ac- tive learner modelling.’’ In: Intelligent Tutoring Systems. 2000 (cit. on p. 107). [McD18] Kyle McDonald. How to recognize fake AI-generated images. 2018. url: https : / / medium . com / @kcimc / how - to - recognize - fake - ai - generated-images-4d1f6f9a2842 (cit. on p. 35). BIBLIOGRAPHY 201

[Met+16] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. ‘‘Un- rolled generative adversarial networks.’’ In: arXiv preprint arXiv:1611.02163 (2016) (cit. on p. 33). [MH08] Laurens van der Maaten and Geoffrey Hinton. ‘‘Visualizing data using t-SNE.’’ In: Journal of Machine Learning Research 9.Nov (2008), pp. 2579– 2605 (cit. on p. 86). [Mik+13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. ‘‘Distributed representations of words and phrases and their compositionality.’’ In: Advances in Neural Information Processing Systems. 2013 (cit. on pp. 4, 107). [Mit+19] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. ‘‘Model cards for model reporting.’’ In: Proceedings of the Conference on Fairness, Accountability, and Transparency. 2019 (cit. on p. 71). [Mit04] Michael Mitzenmacher. ‘‘A brief history of generative models for power law and lognormal distributions.’’ In: Internet 1.2 (2004), pp. 226–251 (cit. on p. 87). [Miy+18] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. ‘‘Spectral normalization for generative adversarial networks.’’ In: arXiv preprint arXiv:1802.05957 (2018) (cit. on pp. 36, 46, 159). [ML16] Shakir Mohamed and Balaji Lakshminarayanan. ‘‘Learning in implicit generative models.’’ In: arXiv preprint arXiv:1610.03483 (2016) (cit. on pp. 19, 22, 53, 57, 72, 111, 116). [MNG17a] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. ‘‘Adversar- ial Variational Bayes: Unifying Variational Autoencoders and Genera- tive Adversarial Networks.’’ In: arXiv preprint arXiv:1701.04722 (2017) (cit. on p. 32). BIBLIOGRAPHY 202

[MNG17b] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. ‘‘The Nu- merics of GANs.’’ In: arXiv preprint arXiv:1705.10461 (2017) (cit. on pp. 21, 33). [Mni+15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, An- dreas K Fidjeland, Georg Ostrovski, et al. ‘‘Human-level control through deep reinforcement learning.’’ In: Nature 518.7540 (2015), pp. 529–533 (cit. on p. 107). [Moh+19] Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. ‘‘Monte carlo gradient estimation in machine learning.’’ In: arXiv preprint arXiv:1906.10652 (2019) (cit. on pp. 7, 18, 115). [MPB15] Ali Mousavi, Ankit B Patel, and Richard G Baraniuk. ‘‘A deep learning approach to structured signal recovery.’’ In: Annual Allerton Conference on Communication, Control, and Computing. 2015 (cit. on p. 128). [Nae+17] Christian A Naesseth, Scott W Linderman, Rajesh Ranganath, and David M Blei. ‘‘Variational sequential monte carlo.’’ In: arXiv preprint arXiv:1705.11140 (2017) (cit. on p. 54). [Nai+18] Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. ‘‘Visual reinforcement learning with imagined goals.’’ In: Advances in Neural Information Processing Systems. 2018 (cit. on p. 147). [NC05] Alexandru Niculescu-Mizil and Rich Caruana. ‘‘Predicting good prob- abilities with supervised learning.’’ In: International Conference on Ma- chine learning. 2005 (cit. on p. 158). [NCT16] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. ‘‘f-gan: Train- ing generative neural samplers using variational divergence mini- mization.’’ In: Advances in Neural Information Processing Systems. 2016 (cit. on pp. 21, 53, 72). BIBLIOGRAPHY 203

[Nea01] Radford M Neal. ‘‘Annealed importance sampling.’’ In: Statistics and Computing 11.2 (2001), pp. 125–139 (cit. on p. 23). [New03] Mark EJ Newman. ‘‘The structure and function of complex networks.’’ In: SIAM review 45.2 (2003), pp. 167–256 (cit. on p. 87). [Nik+16] Pavel Nikolaev, Daylond Hooper, Frederick Webber, Rahul Rao, Kevin Decker, Michael Krein, Jason Poleski, Rick Barto, and Benji Maruyama. ‘‘Autonomy in materials research: a case study in carbon nanotube growth.’’ In: NPJ Computational Materials 2.1 (2016), pp. 1–6 (cit. on p. 130). [NJ02] Andrew Y Ng and Michael I Jordan. ‘‘On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes.’’ In: Advances in Neural Information Processing Systems. 2002 (cit. on p. 2). [Ode+18] Augustus Odena, Jacob Buckman, Catherine Olsson, Tom B Brown, Christopher Olah, Colin Raffel, and Ian Goodfellow. ‘‘Is generator conditioning causally related to gan performance?’’ In: arXiv preprint arXiv:1802.08768 (2018) (cit. on p. 34). [Ode19] Augustus Odena. ‘‘Open Questions about Generative Adversarial Networks.’’ In: Distill 4.4 (2019) (cit. on p. 40). [ODO16] Augustus Odena, Vincent Dumoulin, and Chris Olah. ‘‘Deconvolution and Checkerboard Artifacts.’’ In: Distill (2016) (cit. on p. 40). [OKK16] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. ‘‘Pixel recurrent neural networks.’’ In: International Conference on Ma- chine Learning. 2016 (cit. on p. 16). [OLV18] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. ‘‘Representation learning with contrastive predictive coding.’’ In: arXiv preprint arXiv:1807.03748 (2018) (cit. on p. 107). BIBLIOGRAPHY 204

[Oor+16] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. ‘‘Wavenet: A generative model for raw audio.’’ In: arXiv preprint arXiv:1609.03499 (2016) (cit. on p. 16). [Oor+17] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. ‘‘Parallel wavenet: Fast high-fidelity speech synthesis.’’ In: arXiv preprint arXiv:1711.10433 (2017) (cit. on p. 55). [Osa+18] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, and Jan Peters. ‘‘Analgorithmic perspective on imitation learning.’’ In: arXiv preprint arXiv:1811.06711 (2018) (cit. on p. 107). [Ost+17] Georg Ostrovski, Marc G Bellemare, Aaron van den Oord, and Rémi Munos. ‘‘Count-based exploration with neural density models.’’ In: International Conference on Machine Learning. 2017 (cit. on p. 23). [Pan+18] Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, Lina Yao, and Chengqi Zhang. ‘‘Adversarially regularized graph autoencoder for graph em- bedding.’’ In: International Joint Conference on Artificial Intelligence. 2018 (cit. on p. 88). [Pap+19] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. ‘‘Normalizing flows for probabilistic modeling and inference.’’ In: arXiv preprint arXiv:1912.02762 (2019) (cit. on pp. 3, 20). [Par12] Alan Park. Battery Charging Method and Battery Pack Using the Same. US Patent App. 13/309,993. 2012 (cit. on p. 144). [Par62] Emanuel Parzen. ‘‘On estimation of a probability density function and mode.’’ In: The Annals of Mathematical Statistics 33.3 (1962), pp. 1065– 1076 (cit. on p. 23). BIBLIOGRAPHY 205

[PAS14] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. ‘‘Deepwalk: On- line learning of social representations.’’ In: International Conference on Knowledge Discovery and Data Mining. 2014 (cit. on pp. 85, 107, 167, 169). [Pea01] Karl Pearson. ‘‘On lines and planes of closest fit to systems of points in space.’’ In: The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2.11 (1901), pp. 559–572 (cit. on pp. 7, 116). [Ped+11] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Pret- tenhofer, Ron Weiss, Vincent Dubourg, et al. ‘‘Scikit-learn: Machine learning in Python.’’ In: Journal of Machine Learning Research 12.Oct (2011), pp. 2825–2830 (cit. on p. 167). [Pet00] Johann Petrak. ‘‘Fast subsampling performance estimates for classifi- cation algorithm selection.’’ In: Proceedings of the ECML-00 Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination. 2000 (cit. on pp. 130, 142). [Pin+18] Flavio du Pin Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. ‘‘Data Pre-Processing for Discrimination Prevention: Information-Theoretic Optimization and Analysis.’’ In: IEEE Journal of Selected Topics in Signal Processing 12.5 (2018), pp. 1106–1119 (cit. on p. 71). [Pod+14] J Podesta, P Pritzker, EJ Moniz, J Holdren, and J Zients. ‘‘Big data: seizing opportunities, preserving values.’’ In: Executive Office of the President, The White House (2014) (cit. on p. 55). [Pom91] Dean A Pomerleau. ‘‘Efficient training of artificial neural networks for autonomous navigation.’’ In: Neural Computation 3.1 (1991), pp. 88–97 (cit. on p. 91). BIBLIOGRAPHY 206

[PSS00] Doina Precup, Richard S. Sutton, and Satinder P. Singh. ‘‘Eligibility Traces for Off-Policy Policy Evaluation.’’ In: International Conference on Machine Learning. 2000 (cit. on p. 37). [Rad+13] Predrag Radivojac, Wyatt T Clark, Tal Ronnen Oron, Alexandra M Schnoes, Tobias Wittkop, Artem Sokolov, Kiley Graim, Christopher Funk, Karin Verspoor, Asa Ben-Hur, et al. ‘‘A large-scale evaluation of computational protein function prediction.’’ In: Nature Methods 10.3 (2013), pp. 221–227 (cit. on p. 75). [Rad+19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. ‘‘Language models are unsupervised multitask learners.’’ In: OpenAI Blog 1.8 (2019), p. 9 (cit. on p. 55). [RAM17] Hee Jung Ryu, Hartwig Adam, and Margaret Mitchell. ‘‘Inclusive- facenet: Improving face attribute detection with race and gender di- versity.’’ In: arXiv preprint arXiv:1712.00193 (2017) (cit. on p. 71). [Rat+17] Alexander J Ratner, Henry Ehrenberg, Zeshan Hussain, Jared Dun- nmon, and Christopher Ré. ‘‘Learning to compose domain-specific transformations for data augmentation.’’ In: Advances in Neural Infor- mation Processing Systems. 2017 (cit. on p. 37). [RB10] Stéphane Ross and Drew Bagnell. ‘‘Efficient reductions for imitation learning.’’ In: International Conference on Artificial Intelligence and Statis- tics. 2010 (cit. on p. 50). [RD15] Daniel A Reed and Jack Dongarra. ‘‘Exascale computing and big data.’’ In: Communications of the ACM 58.7 (2015), pp. 56–68 (cit. on p. 1). [Ric+17] JH Rick Chang, Chun-Liang Li, Barnabas Poczos, BVK Vijaya Kumar, and Aswin C Sankaranarayanan. ‘‘One Network to Solve Them All– Solving Linear Inverse Problems Using Deep Projection Models.’’ In: Proceedings of the IEEE International Conference on Computer Vision. 2017 (cit. on p. 128). BIBLIOGRAPHY 207

[RMC15] Alec Radford, Luke Metz, and Soumith Chintala. ‘‘Unsupervised rep- resentation learning with deep convolutional generative adversarial networks.’’ In: arXiv preprint arXiv:1511.06434 (2015) (cit. on pp. 16, 26, 121, 153). [RMW14] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. ‘‘Stochas- tic backpropagation and approximate inference in deep generative models.’’ In: arXiv preprint arXiv:1401.4082 (2014) (cit. on pp. 17, 115, 126). [Ros+17] Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, and Shakir Mohamed. ‘‘Variational approaches for auto-encoding genera- tive adversarial networks.’’ In: arXiv preprint arXiv:1706.04987 (2017) (cit. on pp. 53, 72). [Ros56] Murray Rosenblatt. ‘‘Remarks on some nonparametric estimates of a density function.’’ In: The Annals of Mathematical Statistics (1956), pp. 832–837 (cit. on p. 35). [RR06] Gareth O Roberts and Jeffrey S Rosenthal. ‘‘Harris recurrence of Metropolis-within-Gibbs and trans-dimensional Markov chains.’’ In: The Annals of Applied Probability (2006), pp. 2123–2139 (cit. on p. 151). [RS00] Sam T Roweis and Lawrence K Saul. ‘‘Nonlinear dimensionality re- duction by locally linear embedding.’’ In: Science 290.5500 (2000), pp. 2323–2326 (cit. on p. 75). [RW11] Jonathan Rubin and Ian Watson. ‘‘Computer poker: A review.’’ In: Artificial Intelligence 175.5-6 (2011), pp. 958–987 (cit. on p. 107). [RZL17] Prajit Ramachandran, Barret Zoph, and Quoc V Le. ‘‘Searching for activation functions.’’ In: arXiv preprint arXiv:1710.05941 (2017) (cit. on p. 161). [SA18] Benjamin Sanchez-Lengeling and Alán Aspuru-Guzik. ‘‘Inverse molec- ular design using machine learning: Generative models for matter engineering.’’ In: Science 361.6400 (2018), pp. 360–365 (cit. on p. 148). BIBLIOGRAPHY 208

[Sal+16] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. ‘‘Improved techniques for training gans.’’ In: Advances in Neural Information Processing Systems. 2016 (cit. on pp. 22, 28, 46, 159). [Sal+17] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. ‘‘Pixelcnn++: Improving the pixelcnn with discretized logistic mix- ture likelihood and other modifications.’’ In: arXiv preprint arXiv:1701.05517 (2017) (cit. on pp. 16, 36, 46). [Sam+18] Bidisha Samanta, Abir De, Niloy Ganguly,and Manuel Gomez-Rodriguez. ‘‘Designing random graph models using variational autoencoders with applications to chemical design.’’ In: arXiv preprint arXiv:1802.05283 (2018) (cit. on p. 88). [Sat+19] Prasanna Sattigeri, Samuel C Hoffman, Vijil Chenthamarakshan, and Kush R Varshney. ‘‘Fairness GAN: Generating datasets with fairness properties using a generative adversarial network.’’ In: Proc. Interna- tional Conference on Learning Representations Workshop on Safe Machine Learning. Vol. 2. 2019 (cit. on p. 71). [Sch+15a] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. ‘‘Gradient estimation using stochastic computation graphs.’’ In: Ad- vances in Neural Information Processing Systems. 2015 (cit. on p. 115). [Sch+15b] Simon F Schuster, Martin J Brand, Philipp Berg, Markus Gleissenberger, and Andreas Jossen. ‘‘Lithium-ion cell-to-cell variation during battery electric vehicle operation.’’ In: Journal of Power Sources 297 (2015), pp. 242–251 (cit. on p. 129). [Sch+17a] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. ‘‘Proximal policy optimization algorithms.’’ In: arXiv preprint arXiv:1707.06347 (2017) (cit. on p. 50). BIBLIOGRAPHY 209

[Sch+17b] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. ‘‘Proximal policy optimization algorithms.’’ In: arXiv preprint arXiv:1707.06347 (2017) (cit. on p. 99). [Sch+18] Stefan Schindler, Marius Bauer, Hansley Cheetamun, and Michael A Danzer. ‘‘Fast charging of lithium-ion cells: identification of aging- minimal current profiles using a design of experiment approach and a mechanistic degradation analysis.’’ In: Journal of Energy Storage 19 (2018), pp. 364–378 (cit. on pp. 130, 131, 139, 144). [Sch18] Gisbert Schneider. ‘‘Automating drug discovery.’’ In: Nature Reviews Drug Discovery 17.2 (2018), p. 97 (cit. on pp. 130, 144). [Sef+20] Ari Seff, Yaniv Ovadia, Wenda Zhou, and Ryan P Adams. ‘‘Sketch- Graphs: A Large-Scale Dataset for Modeling Relational Geometry in Computer-Aided Design.’’ In: arXiv preprint arXiv:2007.08506 (2020) (cit. on p. 148). [Seg+18] Marwin HS Segler, Thierry Kogej, Christian Tyrchan, and Mark P Waller. ‘‘Generating focused molecule libraries for drug discovery with recurrent neural networks.’’ In: ACS Central Science 4.1 (2018), pp. 120–131 (cit. on p. 148). [Sen+08] P. Sen, G. M. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi- Rad. ‘‘Collective classification in network data.’’ In: AI Magazine 29.3 (2008), pp. 93–106 (cit. on p. 84). [Sev+19] Kristen A Severson, Peter M Attia, Norman Jin, Nicholas Perkins, Benben Jiang, Zi Yang, Michael H Chen, Muratahan Aykol, Patrick K Herring, Dimitrios Fraggedakis, et al. ‘‘Data-driven prediction of battery cycle life before capacity degradation.’’ In: Nature Energy 4.5 (2019), pp. 383–391 (cit. on pp. 129, 131, 134, 136, 141). [Sha+15] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. ‘‘Taking the human out of the loop: A review BIBLIOGRAPHY 210

of Bayesian optimization.’’ In: Proceedings of the IEEE 104.1 (2015), pp. 148–175 (cit. on pp. 5, 132). [She+11] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. ‘‘Weisfeiler-Lehman graph ker- nels.’’ In: Journal of Machine Learning Research 12.Sep (2011), pp. 2539– 2561 (cit. on p. 77). [She+19] Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. ‘‘The woman worked as a babysitter: On biases in language generation.’’ In: arXiv preprint arXiv:1909.01326 (2019) (cit. on pp. 55, 73). [Shu+18] Rui Shu, Hung H Bui, Shengjia Zhao, Mykel J Kochenderfer, and Stefano Ermon. ‘‘Amortized Inference Regularization.’’ In: Advances in Neural Information Processing Systems. 2018 (cit. on pp. 111, 113, 127). [Sil+17] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. ‘‘Mastering the game of go without human knowl- edge.’’ In: Nature 550.7676 (2017), pp. 354–359 (cit. on p. 1). [SK16] Tim Salimans and Durk P Kingma. ‘‘Weight normalization: A simple reparameterization to accelerate training of deep neural networks.’’ In: Advances in Neural Information Processing Systems. 2016 (cit. on p. 154). [SK18] Martin Simonovsky and Nikos Komodakis. ‘‘GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders.’’ In: arXiv preprint arXiv:1802.03480 (2018) (cit. on p. 88). [SKW15] Tim Salimans, Diederik Kingma, and Max Welling. ‘‘Markov chain monte carlo and variational inference: Bridging the gap.’’ In: Interna- tional Conference on Machine Learning. 2015 (cit. on p. 54). [SMH11] Ilya Sutskever, James Martens, and Geoffrey E Hinton. ‘‘Generating text with recurrent neural networks.’’ In: International Conference on Machine Learning. 2011 (cit. on p. 16). BIBLIOGRAPHY 211

[Smi+18] Justin S Smith, Ben Nebgen, Nicholas Lubbers, Olexandr Isayev, and Adrian E Roitberg. ‘‘Less is more: Sampling chemical space with active learning.’’ In: Journal of Chemical Physics 148.24 (2018), p. 241733 (cit. on p. 144). [Son+19] Jiaming Song, Pratyusha Kalluri, Aditya Grover, Shengjia Zhao, and Stefano Ermon. ‘‘Learning Controllable Fair Representations.’’ In: International Conference on Artificial Intelligence and Statistics. 2019 (cit. on pp. 71, 148). [Spa+15] Evan R Sparks, Ameet Talwalkar, Daniel Haas, Michael J Franklin, Michael I Jordan, and Tim Kraska. ‘‘Automating model search for large scale machine learning.’’ In: Proceedings of the ACM Symposium on Cloud Computing. 2015 (cit. on p. 142). [SSK12] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012 (cit. on pp. 36, 40, 53, 57, 60, 62, 72). [SSS18] Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. ‘‘Fair coresets and streaming algorithms for fair k-means clustering.’’ In: arXiv preprint arXiv:1812.10854 (2018) (cit. on p. 72). [SST15] Ashish Sabharwal, Horst Samulowitz, and Gerald Tesauro. ‘‘Select- ing near-optimal learners via incremental data allocation.’’ In: arXiv preprint arXiv:1601.00024 (2015) (cit. on p. 142). [SV00] Peter Stone and Manuela Veloso. ‘‘Multiagent systems: A survey from a machine learning perspective.’’ In: Autonomous Robots 8.3 (2000), pp. 345–383 (cit. on p. 107). [Sze+16] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. ‘‘Rethinking the inception architecture for computer vision.’’ In: IEEE conference on Computer Vision and Pattern Recognition. 2016 (cit. on pp. 159, 160). BIBLIOGRAPHY 212

[Tab+18] Daniel P Tabor, Loı̈c M Roch, Semion K Saikin, Christoph Kreisbeck, Dennis Sheberla, Joseph H Montoya, Shyam Dwaraknath, Muratahan Aykol, Carlos Ortiz, Hermann Tribukait, et al. ‘‘Accelerating the dis- covery of materials for clean energy in the era of smart automation.’’ In: Nature Reviews Materials 3.5 (2018), pp. 5–20 (cit. on pp. 129, 130, 144). [Tao+18] Chenyang Tao, Liqun Chen, Ricardo Henao, Jianfeng Feng, and Lawrence Carin. ‘‘Chi-square generative adversarial network.’’ In: International Conference on Machine Learning. 2018 (cit. on pp. 53, 72). [TB16] Philip Thomas and Emma Brunskill. ‘‘Data-efficient off-policy policy evaluation for reinforcement learning.’’ In: International Conference on Machine Learning. 2016 (cit. on pp. 37, 51). [TDL00] Joshua B Tenenbaum, Vin De Silva, and John C Langford. ‘‘A global geometric framework for nonlinear dimensionality reduction.’’ In: Science 290.5500 (2000), pp. 2319–2323 (cit. on p. 75). [TE+11] Antonio Torralba, Alexei A Efros, et al. ‘‘Unbiased look at dataset bias.’’ In: IEEE Conference on Computer Vision and Pattern Recognition. Vol. 1. 2011 (cit. on p. 55). [TET12] Emanuel Todorov, Tom Erez, and Yuval Tassa. ‘‘MuJoCo: A physics en- gine for model-based control.’’ In: International Conference on Intelligent Robots and Systems. 2012 (cit. on pp. 37, 50, 99). [Tho15] Philip S Thomas. ‘‘Safe reinforcement learning.’’ PhD thesis. Univer- sity of Massachusetts Libraries, 2015 (cit. on p. 35). [Tib96] Robert Tibshirani. ‘‘Regression shrinkage and selection via the lasso.’’ In: Journal of the Royal Statistical Society. Series B (Methodological) (1996), pp. 267–288 (cit. on p. 110). [TL11] Lei Tang and Huan Liu. ‘‘Leveraging social media networks for classi- fication.’’ In: Data Mining and Knowledge Discovery 23.3 (2011), pp. 447– 478 (cit. on p. 85). BIBLIOGRAPHY 213

[TOB16] Lucas Theis, Aäron van den Oord, and Matthias Bethge. ‘‘A note on the evaluation of generative models.’’ In: International Conference on Learning Representations. 2016 (cit. on p. 22). [Tom+17] Tatiana Tommasi, Novi Patricia, Barbara Caputo, and Tinne Tuytelaars. ‘‘A deeper look at dataset bias.’’ In: Domain Adaptation in Computer Vision Applications. Springer, 2017, pp. 37–55 (cit. on p. 55). [TRB17] Dustin Tran, Rajesh Ranganath, and David Blei. ‘‘Hierarchical implicit models and likelihood-free variational inference.’’ In: Advances in Neural Information Processing Systems. 2017 (cit. on p. 64). [Tur+18] Ryan Turner, Jane Hung, Yunus Saatci, and Jason Yosinski. ‘‘Metropolis- Hastings Generative Adversarial Networks.’’ In: arXiv preprint arXiv:1811.11357 (2018) (cit. on pp. 53, 72). [UML13] Benigno Uria, Iain Murray, and Hugo Larochelle. ‘‘RNADE: The real- valued neural autoregressive density-estimator.’’ In: Advances in Neu- ral Information Processing Systems. 2013 (cit. on p. 153). [Uri+16] Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. ‘‘Neural autoregressive distribution estimation.’’ In: Journal of Machine Learning Research 17.1 (2016), pp. 7184–7220 (cit. on pp. 6, 16). [Van+16] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. ‘‘Conditional image generation with PixelCNN decoders.’’ In: Advances in Neural Information Processing Systems. 2016 (cit. on p. 16). [Van+18] David Van Veen, Ajil Jalal, Eric Price, Sriram Vishwanath, and Alexan- dros G Dimakis. ‘‘Compressed Sensing with Deep Image Prior and Learned Regularization.’’ In: arXiv preprint arXiv:1806.06438 (2018) (cit. on p. 128). BIBLIOGRAPHY 214

[Van04] Antal Van den Bosch. ‘‘Wrapped progressive sampling search for opti- mizing learning algorithm parameters.’’ In: Proceedings of the 1Belgian- Dutch Conference on Artificial Intelligence. 2004 (cit. on p. 142). [Vap99] Vladimir N Vapnik. ‘‘An overview of statistical learning theory.’’ In: IEEE Transactions on Neural Networks 10.5 (1999), pp. 988–999 (cit. on p. 12). [Vas+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. ‘‘Attention is all you need.’’ In: Advances in Neural Information Processing Systems. 2017 (cit. on pp. 16, 55). [Vaz+03] Alexei Vazquez, Alessandro Flammini, Amos Maritan, and Alessan- dro Vespignani. ‘‘Global protein function prediction from protein- protein interaction networks.’’ In: Nature Biotechnology 21.6 (2003), pp. 697–700 (cit. on p. 75). [Vin+08] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. ‘‘Extracting and composing robust features with denoising autoencoders.’’ In: Proceedings of the International Conference on Machine Learning. 2008 (cit. on pp. 7, 125). [Vin+16] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. ‘‘Matching networks for one shot learning.’’ In: Advances in Neural Information Processing Systems. 2016 (cit. on p. 160). [Wan+14] Liming Wang, Abolfazl Razi, Miguel Rodrigues, Robert Calderbank, and Lawrence Carin. ‘‘Nonlinear information-theoretic compressive measurement design.’’ In: International Conference on Machine Learning. 2014 (cit. on p. 127). [Wan+17a] Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. ‘‘Knowledge graph embedding: A survey of approaches and applications.’’ In: IEEE Transactions on Knowledge and Data Engineering 29.12 (2017), pp. 2724– 2743 (cit. on p. 4). BIBLIOGRAPHY 215

[Wan+17b] Ziyu Wang, Josh Merel, Scott Reed, Greg Wayne, Nando de Freitas, and Nicolas Heess. ‘‘Robust Imitation of Diverse Behaviors.’’ In: Advances in Neural Information Processing Systems. 2017 (cit. on p. 107). [Wan+18] Hongwei Wang, Jia Wang, Jialin Wang, Miao Zhao, Weinan Zhang, Fuzheng Zhang, Xing Xie, and Minyi Guo. ‘‘GraphGAN: Graph Repre- sentation Learning with Generative Adversarial Nets.’’ In: International World Wide Web Conference. 2018 (cit. on p. 88). [WCF07] Yair Weiss, Hyun Sung Chang, and William T Freeman. ‘‘Learning compressed sensing.’’ In: Snowbird Learning Workshop. 2007 (cit. on pp. 116, 127). [WF+94] Stanley Wasserman, Katherine Faust, et al. Social network analysis: Methods and applications. Vol. 8. Cambridge university press, 1994 (cit. on p. 4). [Whi82] Halbert White. ‘‘Maximum likelihood estimation of misspecified mod- els.’’ In: Econometrica: Journal of the Econometric Society (1982), pp. 1–25 (cit. on p. 25). [Wil92] Ronald J Williams. ‘‘Simple statistical gradient-following algorithms for connectionist reinforcement learning.’’ In: Machine Learning 8.3-4 (1992), pp. 229–256 (cit. on p. 115). [WL68] Boris Weisfeiler and AA Lehman. ‘‘A reduction of a graph to a canon- ical form and an algebra arising during this reduction.’’ In: Nauchno- Technicheskaya Informatsia 2.9 (1968), pp. 12–16 (cit. on p. 77). [WLD15] David L Wood III, Jianlin Li, and Claus Daniel. ‘‘Prospects for reducing the processing cost of lithium ion batteries.’’ In: Journal of Power Sources 275 (2015), pp. 234–242 (cit. on p. 144). [Woo09] Michael Wooldridge. An introduction to multiagent systems. John Wiley & Sons, 2009 (cit. on pp. 4, 89). BIBLIOGRAPHY 216

[WRC08] Jason Weston, Frédéric Ratle, and Ronan Collobert. ‘‘Deep Learn- ing via Semi-supervised Embedding.’’ In: International Conference on Machine Learning. 2008 (cit. on p. 169). [WS98] Duncan J Watts and Steven H Strogatz. ‘‘Collective dynamics of ‘small- world’ networks.’’ In: Nature 393.6684 (1998), p. 440 (cit. on p. 87). [WSW19] Zhengwei Wang, Qi She, and Tomas E Ward. ‘‘Generative adversarial networks in computer vision: A survey and taxonomy.’’ In: arXiv preprint arXiv:1906.01529 (2019) (cit. on p. 21). [Wu+17] Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger Grosse. ‘‘On the Quantitative Analysis of Decoder-Based Generative Models.’’ In: International Conference on Learning Representations. 2017 (cit. on p. 23). [Xu+18] Depeng Xu, Shuhan Yuan, Lu Zhang, and Xintao Wu. ‘‘Fairgan: Fairness- aware generative adversarial networks.’’ In: IEEE International Confer- ence on Big Data. 2018 (cit. on p. 71). [Yan+06] Shuicheng Yan, Dong Xu, Benyu Zhang, Hong-Jiang Zhang, Qiang Yang, and Stephen Lin. ‘‘Graph embedding and extensions: A general framework for dimensionality reduction.’’ In: Transactions on Pattern Analysis and Machine Intelligence 29.1 (2006), pp. 40–51 (cit. on p. 75). [Yan+11] Shuang-Hong Yang, Bo Long, Alex Smola, Narayanan Sadagopan, Zhaohui Zheng, and Hongyuan Zha. ‘‘Like like alike: joint friend- ship and interest propagation in social networks.’’ In: International Conference on World Wide Web. 2011 (cit. on p. 75). [YCS16] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. ‘‘Revisiting semi-supervised learning with graph embeddings.’’ In: arXiv preprint arXiv:1603.08861 (2016) (cit. on p. 169). [You+18] Jiaxuan You, Rex Ying, Xiang Ren, William L Hamilton, and Jure Leskovec. ‘‘GraphRNN: A Deep Generative Model for Graphs.’’ In: International Conference on Machine Learning. 2018 (cit. on p. 88). BIBLIOGRAPHY 217

[YS11] Guoshen Yu and Guillermo Sapiro. ‘‘Statistical compressed sensing of Gaussian mixture models.’’ In: IEEE Transactions on Signal Processing 59.12 (2011), pp. 5842–5858 (cit. on p. 112). [Zem+13] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. ‘‘Learning fair representations.’’ In: International Conference on Machine Learning. 2013 (cit. on p. 71). [Zha+18] Shengjia Zhao, Hongyu Ren, Arianna Yuan, Jiaming Song, Noah Good- man, and Stefano Ermon. ‘‘Bias and Generalization in Deep Genera- tive Models: An Empirical Study.’’ In: Advances in Neural Information Processing Systems. 2018 (cit. on p. 38). [Zho18] Zhi-Hua Zhou. ‘‘A brief introduction to weakly supervised learning.’’ In: National Science Review 5.1 (2018), pp. 44–53 (cit. on p. 1). [Zhu+17] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. ‘‘Un- paired image-to-image translation using cycle-consistent adversarial networks.’’ In: Proceedings of the IEEE international Conference on Com- puter Vision. 2017 (cit. on p. 34). [ZQ01] Albert H Zimmerman and Michael V Quinzio. Adaptive charging method for lithium-ion battery cells. US Patent 6,204,634. 2001 (cit. on p. 144). [ZSE18] Shengjia Zhao, Jiaming Song, and Stefano Ermon. ‘‘The Information Autoencoding Family: A Lagrangian Perspective on Latent Variable Generative Models.’’ In: Conference on Uncertainty in Artificial Intelli- gence. 2018 (cit. on p. 126). [ZSE19] Shengjia Zhao, Jiaming Song, and Stefano Ermon. ‘‘Infovae: Informa- tion maximizing variational autoencoders.’’ In: AAAI Conference on Artificial Intelligence. 2019 (cit. on p. 126).