Copyright by Erik Michael Lindgren 2019 The Dissertation Committee for Erik Michael Lindgren certifies that this is the approved version of the following dissertation:

Combinatorial Optimization for Graphical Structures in Machine Learning

Committee:

Georgios-Alex Dimakis, Supervisor

Constantine Caramanis

Sujay Sanghavi

Adam Klivans

Qiang Liu Combinatorial Optimization for Graphical Structures in Machine Learning

by

Erik Michael Lindgren

DISSERTATION Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of

DOCTOR OF PHILOSOPHY

THE UNIVERSITY OF TEXAS AT AUSTIN December 2019 To Leif and Ellen Lindgren Acknowledgments

First, I have to thank my parents, Ellen and Leif, and my sister Lauren. I could have not done this without all your love and care, and growing up with you as role models.

I am incredibly grateful to have Alex as an advisor. He spent a tremen- dous amount of time sharing his wisdom, not just on research but also on starting a career and life in general.

I also have to thank the rest of my committee: Adam, Constantine, Qiang, and Sujay. I am incredibly grateful for all the knowledge and advice they shared with me.

I would like to thank my student collaborators Shanshan, Murat, Vatsal, Yanyao, Nishanth, and Jay. I would also like to thank my labmates Karthik, Megas, Ajil, Rashish, Dave, Ethan, Qi, Sriram, Matt, and Eirini. I would also like to thank my fellow WNCG and UT students, especially Ioannis, Ankit, Srinadh, Avro, Rajat, Soumya, Abishek, Mandar, Marius, Diego, Jess, Ashish, Derya, Avik, Isfar, Preeti, Ahmad, Surbhi, Shalmali, Sanmit, Chris, Justin, Rajiv, Yitao, Natasha, and Mónica. I learned a tremendous amount from you all, as well as all the fun times we had. Thank you for all your support, both in and out of the lab.

I also extremely grateful for the help I received from Apipol, Karen,

v Jaymie, Melanie, Melody, and Barry. They were always ready to help, especially when I needed it the most.

I would like to thank Bobak Nazer, Doug Densmore, Ernst Oberortner, and Swapnil Bhatia for helping me to start my research career at Boston University. Their advice and guidance set me up for success at UT.

I would also like to thank my intern hosts at Google, Jack and Ofer. They taught me so much about machine learning and working in industry.

Finally, I have to thank Sara. Thank you for your love and support and all the joy you brought me throughout this process.

vi Combinatorial Optimization for Graphical Structures in Machine Learning

Publication No.

Erik Michael Lindgren, Ph.D. The University of Texas at Austin, 2019

Supervisor: Georgios-Alex Dimakis

Graphs are an essential topic in machine learning. In this proposal, we explore problems in graphical models (where a probability distribution has conditional independencies specified by a graph), causality (where a directed graph specifies causal directions), and clustering (where a weighted graph is used to denote related items).

For our first contribution, we consider the facility location problem. In this problem our goal is to select a set of k “facilities" such that the average benefit from a “town" to its most beneficial facility is maximized. As input, we receive a bipartite graph where every edge has a weight denoting the benefit the town would receive from the facility if selected. The input graph is often dense, with O(n2) edges. We analyze sparsifying the graph with nearest neighbor methods. We give a tight characterization for how sparse each method can make the graph while approximately maintaining the value of the optimal

vii solution. We then demonstrate these approaches experimentally and see that they lead to large speedups and high quality solutions.

Next, we consider the MAP inference problem in discrete Markov Ran- dom Fields. MAP inference is the problem of finding the most likely config- uration of a Markov Random Field. One common approach to finding the MAP solution is with integer programming. We are interested in analyzing the complexity of integer programming under the assumption that there is a polynomial number of fractional vertices in the relaxation of the integer program. We show that under this assumption the optimal MAP assignment can be found in polynomial time. Our result can be generalized to arbitrary integer programs such that the solution set is contained in the unit hypercube: we show that any integer program in the unit hypercube with a polynomial number of fractional vertices in the linear programming relaxation can be solved in polynomial time.

We then consider the minimum cost intervention design problem: given the essential graph of a causal graph and a cost to intervene on a variable, identify the set of interventions with minimum total cost that can learn any causal graph with the given essential graph. We first show that this problem is NP-hard. We then prove that we can achieve a constant factor approximation to this problem with a greedy . Our approach to proving this guarantee uses tools from submodular optimization and knapsack quantization.

Next we consider the problem of learning Ising models when an adversary can corrupt the samples we receive. For this problem we give nearly tight lower

viii and upper bounds on the sample complexity.

Finally, we consider the problem of conditional sampling from invertible generative models. We first establish hardness results of generating conditional samples. We then develop a scheme using variational inference that allows us to approximately solve the problem. Our approach is able to utilize the given invertible model to improve sample quality.

ix Table of Contents

Acknowledgments v

Abstract vii

List of Tables xiv

List of Figures xv

Chapter 1. Introduction 1

Chapter 2. Leveraging Sparsity for Efficient Submodular Data Summarization 10 2.1 Introduction ...... 10 2.2 Related Work ...... 14 2.2.1 Benefits Functions and Nearest Neighbor Methods . . . 15 2.3 Guarantees for t-Nearest Neighbor Sparsification ...... 16 2.4 Guarantees for threshold-based Sparsification ...... 21 2.5 Experiments ...... 23 2.5.1 Summarizing Movies and Music from Ratings Data . . . 23 2.5.2 Finding Influential Actors and Actresses ...... 26 2.6 Appendix: Additional Figures ...... 29 2.7 Appendix: Full Proofs ...... 34 2.7.1 Proof of Theorem 1 ...... 34 2.7.2 Proof of Proposition 3 ...... 35 2.7.3 Proof of Proposition 4 ...... 37 2.7.4 Proof of Lemma 8 ...... 38 2.7.5 Proof of Lemma 9 ...... 39 2.7.6 Proof of Theorem 6 ...... 41 2.7.7 Proof of Lemma 7 ...... 42

x Chapter 3. Exact MAP Inference by Avoiding Fractional Ver- tices 44 3.1 Introduction ...... 44 3.2 Background and Related Work ...... 47 3.3 Provable Integer Programming ...... 49 3.3.1 Proof of Theorem 12 ...... 51 3.3.2 The M-Best LP Problem ...... 52 3.3.3 K-Best Integral Solutions ...... 53 3.4 Fractional Vertices of the Local Polytope ...... 54 3.4.1 Proof of Theorem 17 ...... 55 3.5 Estimating the number of Confounding Singleton Marginals . . 59 3.6 Experiments ...... 60 3.7 Conclusion ...... 66

Chapter 4. Experimental Design for Cost-Aware Learning of Causal Graphs 70 4.1 Introduction ...... 70 4.2 Minimum Cost Intervention Design ...... 73 4.2.1 Relevant Graph Theory Concepts ...... 73 4.2.2 Causal Graphs and Interventional Learning ...... 74 4.2.3 Graph Separating Systems and Minimum Cost Interven- tion Design ...... 76 4.3 Related Work ...... 79 4.4 Hardness of Minimum Cost Intervention Design ...... 82 4.5 Approximation Guarantees for Minimum Cost Intervention Design 84 4.6 for k-Sparse Intervention Design Problems ...... 87 4.7 Experiments ...... 89 4.8 Example Graph Where Quantization Helps Greedy ...... 91 4.9 Proof of Approximation Guarantees of the Quantized Greedy Algorithm ...... 91 4.9.1 Submodularity Background ...... 92 4.9.2 Bound on the Quantized Greedy Algorithm solution size 94 4.9.3 Submodular and Supermodular Chain Problem . . . . . 95

xi 4.9.4 Proof of quantized greedy algorithm approximation guar- antees ...... 99 4.9.5 Proof of Technical Lemmas ...... 102 4.10 Proof of Results on k-Sparse Intervention Design Problems . . 105 4.11 Proof of NP-Hardness ...... 106

Chapter 5. On Robust Learning of Ising Models 117 5.1 Introduction ...... 117 5.1.1 Related Work ...... 118 5.2 Problem Setup ...... 119 5.3 Inachievability Results ...... 121 5.4 Achievable Results ...... 123 5.4.1 Robustness of the Hedge Algorithm ...... 125 5.5 Proof of Theorem 53 ...... 126 5.6 Proof of Achieveability ...... 129

Chapter 6. Uncertainty-Aware Compressive Sensing with Flow Composition 136 6.1 Introduction ...... 136 6.2 Background and Related Work ...... 138 6.2.1 Invertible Generative Models ...... 138 6.2.2 Variational Inference for Conditional Sampling . . . . . 139 6.2.3 Compressive Sensing with Generative Priors ...... 140 6.2.4 Additional Related Work ...... 142 6.3 Hardness of Conditional Sampling ...... 142 6.4 Conditional Sampling with Composed Flow Models ...... 143 6.4.1 Generalizing to Measurement Matrices ...... 147 6.5 Experiments ...... 149 6.6 Proof of Hardness Results ...... 149 6.6.1 Design of the Additive Coupling Network ...... 152 6.6.2 Generating SAT Solutions from the Conditional Distribution154 6.6.3 Hardness of Approximate Sampling ...... 156 6.7 Proof of Proposition 61 ...... 156

xii Bibliography 160

Vita 183

xiii List of Tables

2.1 A subset of the summarization outputted by our algorithm on the MovieLens dataset, plus the elements who are represented by each representative with the largest dot product. Each group has a natural interpretation: 90’s slapstick comedies, 80’s horror, cult classics, etc. Note that this was obtained with only a similarity matrix obtained from ratings...... 26 2.2 The top twenty-five actors and actresses generated by sparsified facility location optimization defined by the personalized PageR- ank of a 57,000 vertex movie personnel collaboration graph from [IMDb, 2016] and the twenty-five actors and actresses with the largest (non-personalized) PageRank. We see that the classical PageRank approach fails to capture the diversity of nationality in the dataset, while the facility location results have actors and actresses from many of the worlds film industries...... 33

xiv List of Figures

2.1 Results for the MovieLens dataset GroupLens [2015]. Figure (a) shows the function value as the runtime increases, normalized by the value the greedy algorithm obtained. As can be seen our algorithm is within 99.9% of greedy in less than 5 seconds. For this experiment, the greedy algorithm had a runtime of 512 seconds, so this is a 100x speed up for a small penalty in performance. We also compare to the stochastic greedy algorithm Mirzasoleiman et al. [2015], which needs 125 seconds to get equivalent performance, which is 25x slower. Figure (b) shows the fraction of the set that was returned by each method that was common with the set returned by greedy. We see that the approximate nearest neighbor method has 90% of its elements common with the greedy set while being 50x faster than greedy, and using exact nearest neighbors can perfectly match the greedy set while being 4x faster than greedy. . . . . 25 2.2 (a) MovieLens Dataset GroupLens [2015] and (b) IMDb Dataset IMDb [2016] as explained in the Experiments Section. We see that for sparsity t significantly smaller than the n/(αk) lower bound we can still find a small covering set in the t-nearest neighbor graph...... 30 2.3 (a) MovieLens Dataset GroupLens [2015] and (b) IMDb Dataset IMDb [2016], as explained in the Experiments Section. We see that even with several orders of magnitude fewer edges than the complete graph we still can find a small set that covers a large fraction of the dataset. For MovieLens this set was of size 40 and for IMDb this set was of size 50. The number of coverable was estimated by the greedy algorithm for the max-coverage problem...... 30 2.4 The fraction of the greedy solution that was contained as the value of the sparsity t was increased for exact nearest neighbor and approximate LSH-based nearest neighbor on the MovieLens dataset. We see that the exact method captures slightly more of greedy solution for a given value of t and the LSH value does not converge to 1. However LSH still captures a reasonable amount of the greedy set and is significantly faster at finding nearest neighbors...... 31

xv 2.5 Results for the IMDb dataset IMDb [2016]. Figure (a) shows the function value as the runtime increases, normalized by the value the greedy algorithm obtained. As can be seen our algorithm is within 99% of greedy in less than 10 minutes. For this experiment, the greedy algorithm had a runtime of six hours, so this is a 36x acceleration for a small penalty in performance. We also compare to using a small sample of the set I as an estimate of the function, which does not perform nearly as well as our algorithm even for much longer time. Figure (b) shows the fraction of the set that was returned by each method that was common with the set returned by greedy. We see that the approximate nearest neighbor method has 90% of its elements common with the greedy set while being 18x faster than greedy...... 32

3.1 We compare how the number of fractional singleton marginals |S(VC )| changes with the connection strength w. We plot the sample CDF of the probability that |S(VC )| is some given value. We observe that |S(VC )| increases as the connection strength increases. Further we see that while most instances have a small number for |S(VC )|, there are rare instances where |S(VC )| is quite large...... 62 3.2 We compare how the number of cycle constraints from Equation (3.4) that need to be introduced to find the best integral solution changes with the number of confounding singleton marginals. We use the algorithm for finding the most frustrated cycle in [Sontag and Jaakkola, 2007] to introduce new constraints. We observe that each constraint seems to remove many confounding singleton marginals...... 63 3.3 We also observe the number of introduced confounding singleton marginals that are introduced by the cycle constraints increases with the number of confounding singleton marginals ...... 64 3.4 Finally we compare the number of branches needed to find the optimal solution increases with the number of confounding singleton marginals in Figure 3.4. A similar trend arises as with the number of cycle inequalities introduced. To compare the methods, note that branch-and-bound uses twice as many LP calls as there are branches. For this family of graphical models, branch-and-bound tends to require less calls to an LP solver than the cut constraints...... 65

xvi 4.1 We generate random chordal graphs such that the maximum degree is bounded by 20. The node weights are generated by the heavy-tailed Pareto distribution with scale parameter 2.0. The number of interventions m is fixed to 5. We compare the greedy algorithm to the optimal solution and the baseline algorithm mentioned in the experimental setup. We see that the greedy algorithm is close to optimal and outperforms the baseline. We also see that the greedy algorithm is able to find a solution with the available number of colors, even without quantization. . . 114 4.2 We sample graphs of size 10000 such that the maximum degree is bounded by 20 and the average degree is 3. We draw the weights from the heavy-tailed Pareto distribution with scale parameter 2.0. We restrict all interventions to be of size 10. We adjust the penalty parameter in Algorithm 7 to see how the size of the k-sparse graph separating system relates to the cost. Costs are normalized so that the largest cost is 1.0. We see that with 561 interventions we can achieve a cost of 0.78 compared to a cost of 1.0 with 510 interventions. Our lower bound implies that we need 506 interventions on average...... 115 4.3 When there are very large weights, the greedy algorithm may require a lot of colors to terminate, even on graphs with a small chromatic number. For this graph, the largest independent set is the top two vertices, followed by the next two, and so on. The greedy algorithm will color all these pairs of vertices a different n color, which is 2 colors. However after quantization the the greedy algorithm will only use 4 colors...... 116

5.1 Graphs for the inachievability result in Theorem 2 ...... 121 5.2 Graphs for Theorem 53 ...... 127

6.1 A flow chart of our conditional sampler. First the noise variable z0 is samples from N(0,I). This is fed into an invertible genera- ˆ tive model f to output another noise variable z1. We then feed z1 into the original model f to generate x1 and x2...... 144 6.2 A graphical model depicting the process we are running the ELBO on. We imaging that xˆ2 is drawn from the conditional 2 distribution N(x2 | σ I) for some small parameter σ. We see that xˆ2 is independent from x1 and z when conditioned on x2. 145

xvii 6.3 Conditional sampling using our approach. We condition on the top half of the image and sample the bottom half. Since the samples output both halves of the image, we replace the sampled top half with the true top half. We see that our approach is able to generate completions with diversity, as there are multiple mouth positions for most sets of images. We plot the pixelwise variance by calculating the sample variance over 64 samples and summing all color channels into one value...... 150 6.4 Conditional sampling when the measurement is a blurred image. We blur the image using average pooling with a window size and stride of 4. We see that our approach is able to find several reasonable ways to complete the measurements...... 151 6.5 Caption ...... 153

xviii Chapter 1

Introduction

Graphs are an essential topic in machine learning, where they play a prominent role in representing how items are structured and related. In this proposal we explore three problems in machine learning that use combinatorial optimization over graphical structures.

Problem 1: Facility Location Problem.

In the facility location problem we are given a set V of size n, a set

I of size m, and a benefit matrix of nonnegative numbers C ∈ RI×V , where

Civ describes the benefit that element i receives from element v. Our goal is to select a small set A of k columns in this matrix. Once we have chosen A, element i will get a benefit equal to the best choice out of the available columns, maxv∈A Civ. The total reward is the sum of the row rewards, so the optimal choice of columns is the solution of X arg max max Civ. v∈A {A⊆V :|A|≤k} i∈I A natural application of this problem is in finding a small set of repre- sentative images in a big dataset, where Civ represents the similarity between images i and v. The problem is to select k images that provide a good coverage of the full dataset, since each one has a close representative in the chosen set.

1 The benefit matrix C can be interpreted as a weighted, undirected, bipartite graph G = (V,I,E). There is an edge between items i ∈ I and v ∈ V with weight Civ. If there are many values of weight 0 in the matrix C, we can think of this as removing edges in the graph G. Algorithms for optimizing the facility location problem, such as the greedy algorithm, can exploit the sparsity of this graph to speed up running time.

Unfortunately, for many benefit functions of interest, the benefit matrix C is dense with O(mn) entries. For example, if the benefit matrix is a similarity function in a vector space, there will typically be few, if any, similarities of value exactly 0.

One approach is to sparsify this matrix by removing small entries. For example, we can keep only the t-smallest elements of each row Ci or we can keep only elements with value above a threshold τ. These sparsification methods can be quickly implemented using nearest neighbor methods. In Chapter 2, we analyze these approaches. We prove that these approaches do lead to a significant reduction in the number of edges of a graph G while approximately maintaining the optimal solution to the initial facility location problem.

Problem 2: Integer Programming for MAP in graphical models.

Markov Random fields (MRFs) are graphical models that represent conditional independencies with an undirected graph. One important class of distributions is binary, pairwise MRFs, which are also called Ising models. For these distributions, we are given a graph G = (V,E) with node weights (θi)i∈V

2 V and edge weights (Wij)ij∈E. The probability of a configuration x ∈ {0, 1} is calculated as ! 1 X X P(X = x) = exp θ x + W x x , Z i i ij i j i∈V ij∈E where Z is used to ensure that the distribution normalizes to 1.

An important problem for inference in graphical models is the MAP problem, which is the problem of finding the most likely configuration. For Ising P P models, we can write this problem as max n θ x + x x . We x∈{0,1} i∈V i i ij∈Wij i j can write this problem as an integer linear program: X X max θiqi + Wijqij q∈ V ∪E R i∈V ij∈E

s.t. qi ∈ {0, 1} ∀i ∈ V

qij ≥ max{0, qi + qj − 1} ∀ij ∈ E

qij ≤ min{qi, qj} ∀ij ∈ E. The standard approach to solving the integer program is with LP relaxations. To create the LP relaxation, we replace the constraint qi ∈ {0, 1} with 0 ≤ qi ≤ 1. The problem is now a linear program and can be solved in polynomial time. If the optimal solution is integral, then we know that this is also the optimal solution to the integer program. If the optimal solution is fractional, then various integer programming techniques can be used to modify the LP in a way that removes this fractional solution from the LP relaxation while maintaining the optimal integral solution.

LP relaxations are utilized for decoding LDPC codes [Feldman et al., 2005]. LDPC codes are a class of graphical models used for communication

3 coding, and decoding is essentially finding the MAP solution. The LP relaxation succeeds when the integral MAP solution is also the optimal solution to the linear program. However, due to introduced fractional vertices, the LP relaxation approach can fail.

Dimakis et al. [2009] considered integer programming techniques from LDPC decoding that provably succeed under an assumption on the number of fractional vertices above the optimal integral solutions. The were able to show that they can remove a polynomial number of fractional solutions in polynomial time. However, their approach required the special structure of LDPC codes and it was left as an open question of whether or not similar results can hold for more general classes of graphical models.

In Chapter 3 we show that for general graphical models we can remove a polynomial fraction of fractional vertices in polynomial time. Our result actually shows that it is possible to do this on arbitrary polytopes contained in the unit cube. We further extend this result to recovering the M-best integral solutions under the assumption that there are a polynomial number of fraction of vertices.

Problem 3: Cost-aware experiment design for learning causal graphs.

In machine learning, we often want to learn causal relationships between objects. However, if we only have observational data then it is not always possible to learn causal relationships. For many situations, the only way to learn which way causality goes we need to perform randomized experiments

4 using interventions.

In Pearl’s causal model [Pearl, 2009], causality is represented by a causal DAG. Variables are said to be caused by their parents in the graph (and indirectly caused by all ancestors).

We are interested in learning all causal directions. Since causal DAGs imply a Bayesian Network, the causal directions that can be identified from data are those edges that are essential to all Bayesian networks that describe the conditional independences of the joint distribution. From data, we can also identify everywhere that there exists an edge. We thus need to perform interventions to identify the causal directions of the undirected component of the essential graph.

Experiment design is a well studied area, and it is known that for the remaining edges to be identifiable, an intervention design must be such that the set of intervention nodes I1,...,Im are such that every undirected edge is in the cut set for some Ii. The cut set of a set of nodes Ii is the set of edges such that one end is in Ii and one end is not in Ii. Thus we can describe a valid intervention design in a purely combinatorial way.

The minimum cost intervention design problem, first considered by Kocaoglu et al. [2017], is an intervention design problem when there is a cost to intervene on a given node, and the cost of an intervention is the sum of the individual node costs. The problem is to create an intervention design problem with minimum total cost.

5 Kocaoglu et al. [2017] was able to develop optimal solutions for some special classes of graphs, however is was still open if (1) the problem was NP-hard for general causal graphs, and if so (2) if there are any efficient approximation algorithms for the problem. In Chapter 4, we indeed show that the problem is NP-hard and that there is an efficient (2 + ε)-approximation of the problem when the number of allowed interventions is slightly more than the minimum required.

Problem 4: Robust Learning of Ising Models

In the previous problem we saw that Ising models are a class of Markov random fields where there are pairwise conditional dependencies and every variable takes a binary value. While in the previous problem we considered these values to be in {0, 1}, for this problem it is convenient to have that take values in {−1, 1}.

An important question is learning Ising models given samples from the joint distribution. Specifically, we want to recover the graph structure after observing a small number of samples from the joint distribution.

It is know that we need to assume a bound α on the minimum edge weight. If it is arbitrarily small, then we would not be able to detect it. Additionally, sample complexity bounds by Wainwright et al. [2003] establish that we also need to have an upper bound on the total strength of the edge weights. We define the width of the model λ to be

X λ = max |θi| + |Wij|. i∈V j∈V

6 For fixed α and λ, sample complexity bounds by Wainwright et al. [2003] establish that we need O(log n) samples to learn the Ising model.

Bresler [2015] was the first to establish an algorithm with optimal sample complexity and polynomial runtime for the problem of learning Ising models. However, the dependence of λ was quite large. This was improved by Vuffray et al. [2016] and later by Klivans and Meka [2017]. Specifically, the algorithm by Klivans and Meka [2017] was shown to have nearly optimal sample complexity and runtime.

In this work we consider the problem of robustly learning Ising models. This means that we need to recover the true Ising model when an adversary can corrupt a fraction of samples that we receive. We will call the fraction of corrupted samples η, and the problem is to find the largest value of η such that we can still efficiently recover the underlying dependency graph.

In Chapter 5, we establish a lower bound on η based on the parameters α and λ. In particular, we establish that if η = α exp(−O(λ)) that no algorithm is able to recover the underlying Ising model, where C is some constant. We then show that the Sparsitron algorithm of Klivans and Meka [2017] can efficiently recover the underlying graph for η = α2 exp(−O(λ)), showing that our lower bound is essentially tight.

Problem 5: Uncertainty Aware Compressive Sensing

Compressive sensing is the problem of recovering a signal x ∈ Rd given a small number of measurements y = Ax, for a known measurement matrix

7 A ∈ Rm×d. If the number of measurements m < d, then arbitrary signals x cannot be recovered. However, if we have prior information on the structure of x then it may be possible.

There has been extensive study on signal recovery under the assumption that the signal x is sparse and we now have high quality algorithms for this setting [Tibshirani, 1996, Candes et al., 2006, Donoho et al., 2006, Bickel et al., 2009].

There has been recent interest in compressive sensing that utilizes powerful generative models. Starting with the work of Bora et al. [2017], there has been extensive study on the problem of signal recovery under the assumption that the signal is in the range of a deep generative model such as a GAN [Goodfellow et al., 2014] or a VAE [Kingma and Welling, 2013].

One issue with prior work is that, while it is able to recover signals that match the measurements, it is not able to identify what aspects of the signal is common to all signals that fit the measurements and what aspects of the signal can vary while still fitting the measurements.

Because of this, there has been recent attention on uncertainty aware compressive sensing [Tonolini et al., 2019, Zhang and Jin, 2019]. Here we want to recover the conditional distribution on x given the measurements y.

In this work we consider uncertainty aware compressive sensing when the prior is given to us as an invertible generative model, which is a special type of generative model that allows us to evaluate the density function as well

8 as sample.

In Chapter 6 we show that the conditional sampling problem is hard in general. Because of this we consider approximations to the problem. We develop an approach that allows us to directly utilize the existing model to improve sample quality and compares favorably to existing approaches. Our approach utilizes tools from variational inference.

9 Chapter 2

Leveraging Sparsity for Efficient Submodular Data Summarization

2.1 Introduction

In this chapter we study the facility location problem: we are given sets

V of size n, I of size m and a benefit matrix of nonnegative numbers C ∈ RI×V , where Civ describes the benefit that element i receives from element v. Our goal is to select a small set A of k columns in this matrix. Once we have chosen A, element i will get a benefit equal to the best choice out of the available columns, maxv∈A Civ. The total reward is the sum of the row rewards, so the optimal choice of columns is the solution of:

X arg max max Civ. (2.1) v∈A {A⊆V :|A|≤k} i∈I

A natural application of this problem is in finding a small set of repre- sentative images in a big dataset, where Civ represents the similarity between images i and v. The problem is to select k images that provide a good coverage of the full dataset, since each one has a close representative in the chosen set.

Throughout this chapter we follow the nomenclature common to the submodular optimization for machine learning literature. This problem is also

10 known as the maximization version of the k-medians problem. A number of recent works have used this problem for selecting subsets of documents or images from a larger corpus [Lin and Bilmes, 2012, Tschiatschek et al., 2014], to identify locations to monitor in order to quickly identify important events in sensor or blog networks [Krause et al., 2008, Leskovec et al., 2007], as well as clustering applications [Krause and Gomes, 2010, Mirzasoleiman et al., 2013].

We can naturally interpret Problem 2.1 as a maximization of a set function F (A) which takes as an input the selected set of columns and returns the total reward of that set. Formally, let F (∅) = 0 and for all other sets A ⊆ V define X F (A) = max Civ. (2.2) v∈A i∈I The set function F is submodular, since for all j ∈ V and sets A ⊆ B ⊆ V \{j}, we have F (A∪{j})−F (A) ≥ F (B∪{j})−F (B), that is, the gain of an element is diminishes as we add elements. Since the entries of C are nonnegative, F is monotone, since for all A ⊆ B ⊆ V , we have F (A) ≤ F (B). We also have F normalized, since F (∅) = 0.

The facility location problem is NP-Hard, so we consider approximation algorithms. Like all monotone and normalized submodular functions, the greedy algorithm guarantees a (1 − 1/e)-factor approximation to the optimal solution [Nemhauser et al., 1978]. The greedy algorithm starts with the empty set, then for k iterations adds the element with the largest reward. This approximation is the best possible—the maximum coverage problem is an instance of the

11 facility location problem, which was shown to be NP-Hard to optimize within a factor of 1 − 1/e + ε for all ε > 0 [Feige, 1998].

The problem is that the greedy algorithm has super-quadratic running time Θ(nmk) and in many datasets n and m can be in the millions. For this reason, several recent papers have focused on accelerating the greedy algorithm. In [Leskovec et al., 2007], the authors point out that if the benefit matrix is sparse, this can dramatically speed up the computation time. Unfortunately, in many problems of interest, data similarities or rewards are not sparse. Wei et al. [2014] proposed to first sparsify the benefit matrix and then run the greedy algorithm on this new sparse matrix. In particular, Wei et al. [2014] considers t-nearest neighbor sparsification, i.e., keeping for each row the t largest entries and zeroing out the rest. Using this technique they demonstrated an impressive 80-fold speedup over the greedy algorithm with little loss in solution quality. One limitation of their theoretical analysis was the limited setting under which provable approximation guarantees were established.

Our Contributions: Inspired by the work of Wei et al. [2014] we improve the theoretical analysis of the approximation error induced by sparsifi- cation. Specifically, the previous analysis assumes that the input came from a probability distribution where the preferences of each element of i ∈ I are inde- pendently chosen uniformly at random. For this distribution, when k = Ω(n), they establish that the sparsity can be taken to be O(log n) and running the greedy algorithm on the sparsified problem will guarantee a constant factor approximation with high probability. We improve the analysis in the following

12 ways:

• We prove guarantees for all values of k and our guarantees do not require any assumptions on the input besides nonnegativity of the benefit matrix.

• In the case where k = Ω(n), we show that it is possible to take the sparsity of each row as low as O(1) while guaranteeing a constant factor approximation.

• Unlike previous work, our analysis does not require the use of any par- ticular algorithm and can be integrated to many algorithms for solving facility location problems.

• We establish a lower bound which shows that our approximation guaran- tees are tight up to log factors, for all desired approximation factors.

In addition to the above results we propose a novel algorithm that uses a threshold based sparsification where we keep matrix elements that are above a set value threshold. This type of sparsification is easier to efficiently implement using nearest neighbor methods. For this method of sparsification, we obtain worst case guarantees and a lower bound that matches up to constant factors. We also obtain a data dependent guarantee which helps explain why our algorithm empirically performs better than the worst case.

Further, we propose the use of Locality Sensitive Hashing (LSH) and random walk methods to accelerate approximate nearest neighbor computa- tions. Specifically, we use two types of similarity metrics: inner products and

13 personalized PageRank (PPR). We propose the use of fast approximations for these metrics and empirically show that they dramatically improve running times. LSH functions are well-known but, to the best of our knowledge, this is the first time they have been used to accelerate facility location problems. Furthermore, we utilize personalized PageRank as the similarity between ver- tices on a graph. Random walks can quickly approximate this similarity and we demonstrate that it yields highly interpretable results for real datasets.

2.2 Related Work

The use of a sparsified proxy function was shown by Wei et al. [2015]. to also be useful for finding a subset for training nearest neighbor classifiers . Further, they also show a connection of nearest neighbor classifiers to the facility location function. The facility location function was also used by Mirzasoleiman et al. [2016] as part of a summarization objective function, where they present a summarization algorithm that is able to handle a variety of constraints.

The stochastic greedy algorithm was shown to get a 1−1/e−ε approxima-

1 tion with runtime O(nm log ε ), which has no dependance on k [Mirzasoleiman n 1 et al., 2015]. It works by choosing a sample set from V of size k log ε each iteration and adding to the current set the element of the sample set with the largest gain.

Also, there are several related algorithms for the streaming setting [Badanidiyuru et al., 2014] and distributed setting [Barbosa et al., 2015, Kumar et al., 2015, Mirrokni and Zadimoghaddam, 2015, Mirzasoleiman et al., 2013].

14 Since the objective function is defined over the entire dataset, optimizing the facility location function becomes more complicated in these memory limited settings. Often the function is estimated by considering a randomly chosen subset from the set I.

2.2.1 Benefits Functions and Nearest Neighbor Methods

For many problems, the elements V and I are vectors in some feature space where the benefit matrix is defined by some similarity function sim. For

d −γkx−yk2 example, in R we may use the RBF kernel sim(x, y) = e 2 , dot product T xT y sim(x, y) = x y, or cosine similarity sim(x, y) = kxkkyk .

There has been decades of research on nearest neighbor search in ge- ometric spaces. If the vectors are low dimensional, then classical techniques such as kd-trees [Bentley, 1975] work well and are exact. However it has been observed that as the dimensions grow that the runtime of all known exact methods does little better than a linear scan over the dataset.

As a compromise, researchers have started to work on approximate nearest neighbor methods, one of the most successful approaches being locality sensitive hashing [Gionis et al., 1999, Indyk and Motwani, 1998]. LSH uses a hash function that hashes together items that are close. Locality sensitive hash functions exist for a variety of metrics and similarities such as Euclidean [Datar et al., 2004], cosine similarity [Andoni et al., 2015, Charikar, 2002], and dot product [Neyshabur and Srebro, 2015, Shrivastava and Li, 2014]. Nearest neighbor methods other than LSH that have been shown to work for

15 machine learning problems include [Beygelzimer et al., 2006, Chen et al., 2009]. Additionally, see [Garcia et al., 2010] for efficient and exact GPU methods.

An alternative to vector functions is to use similarities and benefits defined from graph structures. For instance, we can use the personalized PageRank of vertices in a graph to define the benefit matrix [Page et al., 1999]. The personalized PageRank is similar to the classic PageRank, except the random jumps, rather than going to anywhere in the graph, go back to the users “home” vertex. This can be used as a value of “reputation” or “influence” between vertices in a graph [Gupta et al., 2013].

There are a variety of algorithms for finding the vertices with a large PageRank personalized to some vertex. One popular one is the random walk method. If πi is the personalized PageRank vector to some vertex i, then πi(v) is the same as the probability that a random walk of geometric length starting from i ends on a vertex v (where the parameter of the geometric distribution is defined by the probability of jumping back to i) [Avrachenkov et al., 2007]. Using this approach, we can quickly estimate all elements in the benefit matrix greater than some value τ.

2.3 Guarantees for t-Nearest Neighbor Sparsification

We associate a bipartite support graph G = (V,I,E) by having an edge between v ∈ V and i ∈ I whenever Cij > 0. If the support graph is sparse, we can use the graph to calculate the gain of an element much more efficiently, since we only need to consider the neighbors of the element versus the entire

16 set I. If the average degree of a vertex i ∈ I is t, (and we use a cache for the current best value of an element i) then we can execute greedy in time O(mtk). See Algorithm 1 in the Appendix for pseudocode. If the sparsity t is much smaller than the size of V , the runtime is greatly improved.

However, the instance we wish to optimize may not be sparse. One idea is to sparsify the original matrix by only keeping the values in the benefit matrix C that are t-nearest neighbors, which was considered by Wei et al. [2014]. That is, for every element i in I, we only keep the top t elements of

Ci1,Ci2,...,Cin and set the rest equal to zero. This leads to a matrix with mt nonzero elements. We then want the solution from optimizing the sparse problem to be close to the value of the optimal solution in the original objective function F .

Our main theorem is that we can set the sparsity parameter t to be

n m O( αk log αk )—which is a significant improvement for large enough k—while 1 still having the solution to the sparsified problem be at most a factor of 1+α from the value of the optimal solution.

Theorem 1. Let Ot be the optimal solution to an instance of the facility location problem with a benefit matrix that was sparsified with t-nearest neighbor. For

∗ n n 1 any t ≥ t (α) = O( αk log αk ), we have F (Ot) ≥ 1+α OPT.

Proof Sketch. For the value of t chosen, there exists a set Γ of size αk such that every element of I has a neighbor in the t-nearest neighbor graph; this is proven using the probabilistic method. By appending Γ to the optimal solution

17 and using the monotonicity of F , we can move to the sparsified function, since no element of I would prefer an element that was zeroed out in the sparsified matrix as one of their top t most beneficial elements is present in the set Γ. The optimal solution appended with Γ is a set of size (1 + α)k. We then bound the amount that the optimal value of a submodular function can increase by when adding αk elements. See the appendix for the complete proof.

Note that Theorem 1 is agnostic to the algorithm used to optimize the sparsified function, and so if we use a ρ-approximation algorithm, then we are

ρ at most a factor of 1+α from the optimal solution. Later this section we will utilize this to design a subquadratic algorithm for optimizing facility location problems as long as we can quickly compute t-nearest neighbors and k is large enough.

If m = O(n) and k = Ω(n), we can achieve a constant factor approxima- tion even when taking the sparsity parameter as low as t = O(1), which means that the benefit matrix C has only O(n) nonzero entries. Also note that the only assumption we need is that the benefits between elements are nonnegative. When k = Ω(n), previous work was only able to take t = O(log n) and required the benefit matrix to come from a probability distribution [Wei et al., 2014].

Our guarantee has two regimes depending on the value of α. If we want the optimal solution to the sparsified function to be a 1 − ε factor from the

∗ n m optimal solution to the original function, we have that t (ε) = O( εk log εk ) suffices. Conversely, if we want to take the sparsity t to be much smaller

18 n m than k log k , then this is equivalent to taking α very large and we have some guarantee of optimality.

In the proof of Theorem 1, the only time we utilize the value of t is to show that there exists a small set Γ that covers the entire set I in the t-nearest neighbor graph. Real datasets often contain a covering set of size αk for t much

n m smaller than O( αk log αk ). This observation yields the following corollary.

Corollary 2. If after sparsifying a problem instance there exists a covering set of size αk in the t-nearest neighbor graph, then the optimal solution Ot of the

1 sparsified problem satisfies F (Ot) ≥ 1+α OPT.

In the datasets we consider in our experiments of roughly 7000 items, we have covering sets with only 25 elements for t = 75, and a covering set of size 10 for t = 150. The size of covering set was upper bounded by using the greedy set cover algorithm. In Figure 2.2 in the appendix, we see how the size of the covering set changes with the choice of the number of neighbors chosen t.

It would be desirable to take the sparsity parameter t lower than the value dictated by t∗(α). As demonstrated by the following lower bound, is not

1 n 1 possible to take the sparsity significantly lower than α k and still have a 1+α approximation in the worst case.

Proposition 3. Suppose we take

 1 1  n − 1 t = max , . 2α 1 + α k

19 1 There exists a family of inputs such that we have F (Ot) ≤ 1+α−2/k OPT.

The example we create to show this has the property that in the t- nearest neighbor graph, the set Γ needs αk elements to cover every element of I. We plant a much smaller covering set that is very close in value to Γ but is hidden after sparsification. We then embed a modular function within the facility location objective. With knowledge of the small covering set, an optimal solver can take advantage of this modular function, while the sparsified solution would prefer to first choose the set Γ before considering the modular function. See the appendix for full details.

Sparsification integrates well with the stochastic greedy algorithm [Mirza- soleiman et al., 2015]. By taking t ≥ t∗(ε/2) and running stochastic greedy

n 2 with sample sets of size k ln ε , we get a 1−1/e−ε approximation in expectation nm 1 m that runs in expected time O( εk log ε log εk ). If we can quickly sparsify the problem and k is large enough, for example n1/3, this is subquadratic. The following proposition shows a high probability guarantee on the runtime of this algorithm and is proven in the appendix.

Proposition 4. When m = O(n), the stochastic greedy algorithm [Mirza-

n 2 soleiman et al., 2015] with set sizes of size k log ε , combined with sparsifica- 1 tion with sparsity parameter t, will terminate in time O(n log ε max{t, log n}) ∗ n m with high probability. When t ≥ t (ε/2) = O( εk log εk ), this algorithm has a 1 − 1/e − ε approximation in expectation.

20 2.4 Guarantees for threshold-based Sparsification

Rather than t-nearest neighbor sparsification, we now consider using τ-threshold sparsification, where we zero-out all entries that have value below a threshold τ. Recall the definition of a locality sensitive hash.

Definition 5. H is a (τ, Kτ, p, q)-locality sensitive hash family if for x, y satisfying sim(x, y) ≥ τ we have Ph∈H(h(x) = h(y)) ≥ p and if x, y satisfy sim(x, y) ≤ Kτ we have Ph∈H(h(x) = h(y)) ≤ q.

We see that τ-threshold sparsification is a better model than t-nearest neighbors for LSH, as for K = 1 it is a noisy τ-sparsification and for non- adversarial datasets it is a reasonable approximation of a τ-sparsification method. Note that due to the approximation constant K, we do not have an a priori guarantee on the runtime of arbitrary datasets. However we would expect in practice that we would only see a few elements with threshold above the value τ. See [Andoni, 2012] for a discussion on this.

One issue is that we do not know how to choose the threshold τ. We can sample elements of the benefit matrix C to estimate how sparse the threshold graph will be for a given threshold τ. Assuming the values of C are in general position1, by using the Dvoretzky-Kiefer-Wolfowitz-Massart Inequality [Dvoretzky et al., 1956, Massart, 1990] we can bound the number of samples needed to choose a threshold that achieves a desired sparsification level.

1By this we mean that the values of C are all unique, or at least only a few elements take any particular value. We need this to hold since otherwise a threshold based sparsification may exclusively return an empty graph or the complete graph.

21 We establish the following data-dependent bound on the difference in the optimal solutions of the τ-threshold sparsified function and the original function. We denote the set of vertices adjacent to S in the τ-threshold graph with N(S).

Theorem 6. Let Oτ be the optimal solution to an instance of the facility location problem with a benefit matrix that was sparsified using a τ-threshold. Assume there exists a set S of size k such that in the τ-threshold graph we have the neighborhood of S satisfying |N(S)| ≥ µn. Then we have

 1 −1 F (O ) ≥ 1 + OPT. τ µ

For the datasets we consider in our experiments, we see that we can keep just a 0.01 − 0.001 fraction of the elements of C while still having a small set S with a neighborhood N(S) that satisfied |N(S)| ≥ 0.3n. In Figure 2.3 in the appendix, we plot the relationship between the number of edges in the τ-threshold graph and the number of coverable element by a a set of small size, as estimated by the greedy algorithm for max-cover.

Additionally, we have worst case dependency on the number of edges in the τ-threshold graph and the approximation factor. The guarantees follow from applying Theorem 6 with the following Lemma.

c 1 1 2 2 Lemma 7. For k ≥ 1−2c2 δ , any graph with 2 δ n edges has a set S of size k such that the neighborhood N(S) satisfies

|N(S)| ≥ cδn.

22 To get a matching lower bound, consider the case where the graph has two disjoint cliques, one of size δn and one of size (1 − δ)n. Details are in the appendix.

2.5 Experiments 2.5.1 Summarizing Movies and Music from Ratings Data

We consider the problem of summarizing a large collection of movies. We first need to create a feature vector for each movie. Movies can be categorized by the people who like them, and so we create our feature vectors from the MovieLens ratings data [GroupLens, 2015]. The MovieLens database has 20 million ratings for 27,000 movies from 138,000 users. To do this, we perform low-rank matrix completion and factorization on the ratings matrix [Jain et al., 2013, Koren et al., 2009] to get a matrix X = UV T , where X is the completed ratings matrix, U is a matrix of feature vectors for each user and V is a matrix of feature vectors for each movie. For movies i and j with vectors vi and vj, we

T set the benefit function Cij = vi vj. We do not use the normalized dot product (cosine similarity) because we want our summary movies to be movies that were highly rated, and not normalizing makes highly rated movies increase the objective function more.

We complete the ratings matrix using the MLlib library in Apache Spark [Meng et al., 2016] after removing all but the top seven thousand most rated movies to remove noise from the data. We use locality sensitive hashing to perform sparsification; in particular we use the LSH in the FALCONN library

23 for cosine similarity [Andoni et al., 2015] and the reduction from a cosine simiarlity hash to a dot product hash [Neyshabur and Srebro, 2015]. As a baseline we consider sparsification using a scan over the entire dataset, the stochastic greedy algorithm with lazy evaluations [Mirzasoleiman et al., 2015], and the greedy algorithm with lazy evaluations [Minoux, 1978]. The number of elements chosen was set to 40 and for the LSH method and stochastic greedy we average over five trials.

We then do a scan over the sparsity parameter t for the sparsification methods and a scan over the number of samples drawn each iteration for the stochastic greedy algorithm. The sparsified methods use the (non-stochastic) lazy greedy algorithm as the base optimization algorithm, which we found worked best for this particular problem2. In Figure 2.1(a) we see that the LSH method very quickly approaches the greedy solution—it is almost identical in value just after a few seconds even though the value of t is much less than t∗(ε). The stochastic greedy method requires much more time to get the same function value. Lazy greedy is not plotted, since it took over 500 seconds to finish.

A performance metric that can be better than the objective value is the fraction of elements returned that are common with the greedy algorithm. We treat this as a proxy for the interpretability of the results. We believe this metric is reasonable since we found the subset returned by the greedy

2When experimenting on very larger datasets, we found that runtime constraints can make it necessary to use stochastic greedy as the base optimization algorithm

24 (a) Fraction of Greedy Set Value vs. Runtime (b) Fraction of Greedy Set Contained vs. Runtime 0.99 1.0 0.98 0.9 0.97 0.8 0.7 0.96 0.6 0.95 0.5 0.94 0.4 0.93 Exact top-t 0.3 0.92 LSH top-t 0.2 0.91 Stochastic Greedy 0.1 0.90 0.0 0 25 50 75 100 125 150 0 25 50 75 100 125 150 Runtime (s) Runtime (s)

Figure 2.1: Results for the MovieLens dataset GroupLens [2015]. Figure (a) shows the function value as the runtime increases, normalized by the value the greedy algorithm obtained. As can be seen our algorithm is within 99.9% of greedy in less than 5 seconds. For this experiment, the greedy algorithm had a runtime of 512 seconds, so this is a 100x speed up for a small penalty in performance. We also compare to the stochastic greedy algorithm Mirzasoleiman et al. [2015], which needs 125 seconds to get equivalent performance, which is 25x slower. Figure (b) shows the fraction of the set that was returned by each method that was common with the set returned by greedy. We see that the approximate nearest neighbor method has 90% of its elements common with the greedy set while being 50x faster than greedy, and using exact nearest neighbors can perfectly match the greedy set while being 4x faster than greedy. algorithm to be quite interpretable. We plot this metric against runtime in Figure 2.1(b). We see that the LSH method quickly gets to 90% of the elements in the greedy set while stochastic greedy takes much longer to get to just 70% of the elements. The exact sparsification method is able to completely match the greedy solution at this point. One interesting feature is that the LSH method does not go much higher than 90%. This may be due to the increased inaccuracy when looking at elements with smaller dot products. We plot this metric against the number of exact and approximate nearest neighbors t in

25 Table 2.1: A subset of the summarization outputted by our algorithm on the MovieLens dataset, plus the elements who are represented by each representative with the largest dot product. Each group has a natural interpretation: 90’s slapstick comedies, 80’s horror, cult classics, etc. Note that this was obtained with only a similarity matrix obtained from ratings. Happy Gilmore Nightmare on Elm Street Star Wars IV Shawshank Redemption Tommy Boy Friday the 13th Star Wars V Schindler’s List Billy Madison Halloween II Raiders of the Lost Ark The Usual Suspects Dumb & Dumber Nightmare on Elm Street 3 Star Wars VI Life Is Beautiful Ace Ventura Pet Detective Child’s Play Indiana Jones, Last Crusade Saving Private Ryan Road Trip Return of the Living Dead II Terminator 2 American History X American Pie 2 Friday the 13th 2 The Terminator The Dark Knight Black Sheep Puppet Master Star Trek II Good Will Hunting

Pulp Fiction The Notebook Pride and Prejudice The Godfather Reservoir Dogs P.S. I Love You Anne of Green Gables The Godfather II American Beauty The Holiday Persuasion One Flew Over the Cuckoo’s Nest A Clockwork Orange Remember Me Emma Goodfellas Trainspotting A Walk to Remembe Mostly Martha Apocalypse Now Memento The Proposal Desk Set Chinatown Old Boy The Vow The Young Victoria 12 Angry Men No Country for Old Men Life as We Know It Mansfield Park Taxi Driver

Figure 2.4 in the appendix.

We include a subset of the summarization and for each representative a few elements who are represented by this representative with the largest dot product in Table 2.1 to show the interpretability of our results.

2.5.2 Finding Influential Actors and Actresses

For our second experiment, we consider how to find a diverse subset of actors and actresses in a collaboration network. We have an edge between an actor or actress if they collaborated in a movie together, weighted by the number of collaborations. Data was obtained from [IMDb, 2016] and an actor or actress was only included if he or she was one of the top six in the cast

26 billing. As a measure of influence, we use personalized PageRank [Page et al., 1999]. To quickly calculate the people with the largest influence relative to someone, we used the random walk method[Avrachenkov et al., 2007].

We first consider a small instance where we can see how well the sparsified approach works. We build a graph based on the cast in the top thousand most rated movies. This graph has roughly 6000 vertices and 60,000 edges. We then calculate the entire PPR matrix using the power method. Note that this is infeasible on larger graphs in terms of time and memory. Even on this moderate sized graph it took six hours and takes 2GB of space. We then compare the value of the greedy algorithm using the entire PPR matrix with the sparsified algorithm using the matrix approximated by Monte Carlo sampling using the two metrics mentioned in the previous section. We omit exact nearest neighbor and stochastic greedy because it is not clear how it would work without having to compute the entire PPR matrix. Instead we compare to an approach where we choose a sample from I and calculate the PPR only on these elements using the power method. As mentioned in Section 2.2, several algorithms utilize random sampling from I. We take k to be 50 for this instance. In Figure 2.5 in the appendix we see that sparsification performs drastically better in both function value and percent of the greedy set contained for a given runtime.

We now scale up to a larger graph by taking the actors and actresses billed in the top six for the twenty thousand most rated movies. This graph has 57,000 vertices and 400,000 edges. We would not be able to compute the entire PPR matrix for this graph in a reasonable amount of time or space.

27 However we can run the sparsified algorithm in three hours using only 2 GB of memory, which could be improved further by parallelizing the Monte Carlo approximation.

We run the greedy algorithm separately on the actors and actresses. For each we take the top twenty-five and compare to the actors and actresses with the largest (non-personalized) PageRank. In Figure 2.2 of the appendix, we see that the PageRank output fails to capture the diversity in nationality of the dataset, while the facility location optimization returns actors and actresses from many of the worlds film industries.

28 2.6 Appendix: Additional Figures

Algorithm 1 Greedy algorithm with sparsity graph Input: benefit matrix C, sparsity graph G = (V,I,E) define N(v): return the neighbors of v in G for all i ∈ I: # cache of the current benefit given to i βi ← 0 A ← ∅ for k iterations: for all v ∈ V : # calculate the gain of element v gv ← 0 for all i ∈ N(v): # add the gain of element v from i gv ← gv + max(Civ − βi, 0) ∗ v ← arg maxV gv A ← A ∪ {v∗} for all i ∈ N(v∗) # update the cache of the current benefit for i βi ← max(βi,Civ∗ ) return A

29 (a) MovieLens (b) IMDb 70 140 130 60 120 Γ Γ 110 50 100 90 40 80 70 30 60 50 20 40 30 Size of covering set Size of covering set 10 20 10 0 0 0 25 50 75 100 125 150 0 25 50 75 100 125 150 Sparsity t Sparsity t

Figure 2.2: (a) MovieLens Dataset GroupLens [2015] and (b) IMDb Dataset IMDb [2016] as explained in the Experiments Section. We see that for sparsity t significantly smaller than the n/(αk) lower bound we can still find a small covering set in the t-nearest neighbor graph.

(a) MovieLens (b) IMDb

0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 Fraction of elements coverable Fraction of elements coverable 0.0 0.0 0.001 0.01 0.1 1 0.0001 0.001 0.01 Fraction of elements kept Fraction of elements kept

Figure 2.3: (a) MovieLens Dataset GroupLens [2015] and (b) IMDb Dataset IMDb [2016], as explained in the Experiments Section. We see that even with several orders of magnitude fewer edges than the complete graph we still can find a small set that covers a large fraction of the dataset. For MovieLens this set was of size 40 and for IMDb this set was of size 50. The number of coverable was estimated by the greedy algorithm for the max-coverage problem.

30 Fraction of Greedy Set Contained vs. Sparsity 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Exact top-t 0.1 LSH top-t 0.0 0 50 100 150 200 Sparsity (t)

Figure 2.4: The fraction of the greedy solution that was contained as the value of the sparsity t was increased for exact nearest neighbor and approximate LSH-based nearest neighbor on the MovieLens dataset. We see that the exact method captures slightly more of greedy solution for a given value of t and the LSH value does not converge to 1. However LSH still captures a reasonable amount of the greedy set and is significantly faster at finding nearest neighbors.

31 (a) Fraction of Greedy Set Value vs. Runtime (b) Fraction of Greedy Set Contained vs. Runtime 1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.4 0.5 0.3 0.4 Monte Carlo PPR 0.2 0.1 0.3 Sample PPR 0.0 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 Runtime (s) Runtime (s)

Figure 2.5: Results for the IMDb dataset IMDb [2016]. Figure (a) shows the function value as the runtime increases, normalized by the value the greedy algorithm obtained. As can be seen our algorithm is within 99% of greedy in less than 10 minutes. For this experiment, the greedy algorithm had a runtime of six hours, so this is a 36x acceleration for a small penalty in performance. We also compare to using a small sample of the set I as an estimate of the function, which does not perform nearly as well as our algorithm even for much longer time. Figure (b) shows the fraction of the set that was returned by each method that was common with the set returned by greedy. We see that the approximate nearest neighbor method has 90% of its elements common with the greedy set while being 18x faster than greedy.

32 Table 2.2: The top twenty-five actors and actresses generated by sparsified facility location optimization defined by the personalized PageRank of a 57,000 vertex movie personnel collaboration graph from [IMDb, 2016] and the twenty- five actors and actresses with the largest (non-personalized) PageRank. We see that the classical PageRank approach fails to capture the diversity of nationality in the dataset, while the facility location results have actors and actresses from many of the worlds film industries. Actors Actresses Facility Location PageRank Facility Location PageRank Robert De Niro Jackie Chan Julianne Moore Jackie Chan Gérard Depardieu Susan Sarandon Susan Sarandon Gérard Depardieu Robert De Niro Kemal Sunal Michael Caine Isabelle Huppert Shah Rukh Khan Samuel L. Jackson Kareena Kapoor Catherine Deneuve Michael Caine Christopher Lee Juliette Binoche Kristin Scott Thomas John Wayne Donald Sutherland Meryl Streep Samuel L. Jackson Peter Cushing Adile Naşit Bette Davis Bud Spencer Nicolas Cage Catherine Deneuve Nicole Kidman Peter Cushing John Wayne Li Gong Charlotte Rampling Toshirô Mifune John Cusack Helena Bonham Carter Helena Bonham Carter Steven Seagal Christopher Walken Penélope Cruz Kathy Bates Moritz Bleibtreu Bruce Willis Naomi Watts Naomi Watts Jean-Claude Van Damme Kemal Sunal Masako Nozawa Cate Blanchett Mads Mikkelsen Harvey Keitel Drew Barrymore Drew Barrymore Michael Ironside Amitabh Bachchan Charlotte Rampling Amitabh Bachchan Shah Rukh Khan Golshifteh Farahani Michelle Pfeiffer Ricardo Darín Sean Bean Penélope Cruz Charles Chaplin Steven Seagal Toni Collette Sigourney Weaver Sean Bean Jean-Claude Van Damme Kati Outinen Toni Collette Louis de Funès Morgan Freeman Edna Purviance Catherine Keener Tadanobu Asano Christian Slater Monica Bellucci Heather Graham Bogdan Diklic Val Kilmer Kristin Scott Thomas Sandra Bullock Nassar Liam Neeson Catherine Keener Lance Henriksen Gene Hackman Kyôko Kagawa Miranda Richardson

33 2.7 Appendix: Full Proofs 2.7.1 Proof of Theorem 1

We will use the following two lemmas in the proof of Theorem 1, which are proven later in this section. The first lemma bounds the size of the smallest set of left vertices covering every right vertex in a t-regular bipartite graph.

Lemma 8. For any bipartite graph G = (V,I,E) such that |V | = n, |I| = m, every vertex i ∈ I has degree at least t, and n ≤ mt, there exists a set of vertices Γ ⊆ V such that every vertex in I has a neighbor in Γ and

n  mt |Γ| ≤ 1 + ln . (2.3) t n

The second lemma bounds the rate that the optimal solution grows as a function of k.

Lemma 9. Let f be any normalized submodular function and let O2 and O1 be optimal solutions for their respective sizes, with |O2| ≥ |O1|. We have

|O2| f(O2) ≤ f(O1). |O1|

We now prove Theorem 1.

Proof. We will take t∗(α) to be the smallest value of t such that |Γ| ≥ αk in

∗ n n Equation (2.3). It can be verified that t (α) ≤ d4 αk max{1, ln αk }e.

Let Γ ⊆ V be a set such that all elements of I has a t-nearest neighbor in Γ. By Lemma 8, one is guaranteed to exists of size at most αk for t ≥ t∗(α).

34 Let O be the optimal set of size k and let F (t) be the objective function of the

k (1+α)k (t) sparsified function. Let Ot and Ot be the optimal solutions to F of size k and (1 + α)k. We have

F (O) ≤ F (O ∪ Γ)

= F (t)(O ∪ Γ)

(t) (1+α)k ≤ F (Ot ). (2.4)

The first inequality is due to the monotonicity of F . The second is because every element of I would prefer to choose one of their t nearest neighbors and because of Γ they can. The third inequality is because |O ∪ Γ| ≤ (1 + α)k and

(1+α)k Ot is the optimal solution for this size.

Now by Lemma 9, we can bound the approximation for shrinking from

(1+α)k k Ot to Ot . Applying Lemma 9 and continuing from Equation (2.4) implies

(t) k F (O) ≤ (1 + α)F (Ot ).

Observe that F (t)(A) ≤ F (A) for any set A to obtain the final bound.

2.7.2 Proof of Proposition 3

Define Πn(t) to be the n × (n + 1) matrix where for i = 1, . . . , n we have column i equal to 1 for positions i to i + t − 1, potentially cycling the position back to the beginning if necessary, and then 0 otherwise. For column

35 n + 1 make all values 1 − 1/2n. For example,

1 0 0 0 1 1 11/12 1 1 0 0 0 1 11/12   1 1 1 0 0 0 11/12 Π6(3) =   . 0 1 1 1 0 0 11/12   0 0 1 1 1 0 11/12 0 0 0 1 1 1 11/12

We will show the lower bound in two parts, when α < 1 and when α ≥ 1.

Proof of case α ≥ 1. Let F be the facility location function defined on the

n n (t) benefit matrix C = Πn(δ k ). For t = δ k , the sparsified matrix C has all of its elements except the n + 1st row. With k elements, the optimal solution to F (t) is to choose the k elements that let us cover δn of the elements of I, giving a value of δn. However if we chose the n + 1th element, we would have gotten a

δ value of n − 1/2, giving an approximation of 1−1/(2n) . Setting δ = 1/(1 + α) and using α ≤ n/k implies

1 F (O ) ≤ OPT t 1 + α − 1/k

1 |V |−1 when we take t = 1+α k (note that for this problem |V | = n + 1).

Proof of case α < 1. Let F be the facility location function defined on the benefit matrix  1 n  Πn( α k ) 0 C = 1 n 1  0 α k − 2n In×n 1 n (t) For t = α k , the optimal solution to F is to use αk elements to cover all the elements of Πn, then use the remaining (1 − α)k elements in the identity

36 1 section of the matrix. This has a value of less than α n. For F , the optimal solution is to choose the n + 1st element of Πn, then use the remaining k − 1 elements in the identity section of the identity section of the matrix. This has

1 1 1 a value of more than n(1 + α − kα − n ), and therefore an approximation of 1 1+α−1/k−1/n . Note that in this case |V | = 2n + 1 and so we have 1 F (O ) ≤ OPT t 1 + α − 1/k − 1/n

1 |V |−1/2 when we take t = 2α k .

2.7.3 Proof of Proposition 4

Proof. The stochastic greedy algorithm works by choosing a set of elements Sj

n 1 each iteration of size k log ε . We will assume m = n and ε = 1/e to simplify notation. We want to show that

k X X dv = O(nt)

j=1 v∈Sj with high probability, where dv is the degree of element v in the sparsity graph. We will show this using Bernstein’s Inequality: given n i.i.d. random variables

2 X1,...,Xn such that E(X`) = 0, Var(X`) = σ , and |X`| ≤ c with probability 1, we have n ! X  nλ2  P X ≥ λn ≤ exp − . ` 2σ2 + 2 cλ `=1 3

We will take X` to be the degree of the `th element of V chosen uniformly at random, shifted by the mean of t. Although in the stochastic greedy algorithm the elements are not chosen i.i.d. but instead iterations in

37 k iterations of sampling without replacement, treating them as i.i.d. random variables for purposes of Bernstein’s Inequality is justified by Theorem 4 of [Hoeffding, 1963].

We have |X`| ≤ n, and Var(X`) ≤ tn, where the variance bound is because variance for a given mean t on support [0, m] is maximized by putting

t t mass n on n and 1 − n on 0.

8 8 If t ≥ ln n, then take λ = 3 t. If t < ln n, take λ = 3 ln n. This yields   k r X X 8 m 1 P d ≥ nt + max{nt, ln n} ≤ .  v 3 n  n j=1 v∈Sj

2.7.4 Proof of Lemma 8

We now prove Lemma 8, which is a modification of Theorem 1.2.2 of [Alon and Spencer, 2008].

Proof. Choose a set X by picking each element of V with probability p, where p is to be decided later. For every element of I without a neighbor in X, add one arbitrarily. Call this set Y . We have E(|X ∪Y |) ≤ np+m(1−p)t ≤ np+me−pt.

1 mt mt Optimizing for p yields p = t ln n . This is a valid probability when n ≥ 1, m et which we assumed, and when n ≤ t (we do not need to worry about the latter case because if it does not hold then it implies an inequality weaker than the trivial one |Γ| ≤ n).

38 2.7.5 Proof of Lemma 9

Before we prove Lemma 9, we need the following Lemma.

Lemma 10. Let f be any normalized submodular function and let O be an optimal solution for its respective size. Let A be any set. We have

 |A| f(O ∪ A) ≤ 1 + f(O). |O|

We now prove Lemma 9.

Proof. Let A∗ = arg max f(A) {A⊆O2:|A|≤|O1|} 0 ∗ ∗ and let A = O2 \ A . Since A is optimal for the function when restricted to a ground set O2, by Lemma 10 and the optimality of O1 for sets of size |O1|, we have

∗ 0 f(O2) = f(A ∪ A )  |A0|  ≤ 1 + f(A∗) |A∗| |O | = 2 f(A∗) |O1|

|O2| ≤ f(O1). |O1|

We now prove Lemma 10.

39 Proof. Define f(v | A) = f({v} ∪ A) − f(A). Let O = {o1, . . . ok}, where the ordering is arbitrary except that

f(ok | O \{ok}) = arg min f(oi | O \{oi}). i=1,...,k

Let A = {a1, . . . , a`}, where the ordering is arbitrary except that

f(a1 | O) = arg max f(ai | O). i=1,...,`

We will first show that

f(a1 | O) ≤ f(ok | O \{ok}). (2.5)

By submodularity, we have

f(a1 | O) ≤ f(a1 | O \{ok}).

If it was true that

f(a1 | O \{ok}) > f(ok | O \{ok}), then we would have

k−1 X f((O \{ok}) ∪ {a1}) = f(a1 | O \{ok}) + f(oi | {o1, . . . , oi−1}) i=1 k X ≥ f(oi | {o1, . . . , oi−1}) i=1 = f(O), contradicting the optimality of O, thus showing that Inequality 2.5 holds.

40 Now since for all i ∈ {1, 2, . . . , k}

f(a1 | O) ≤ f(ok | O \{ok})

≤ f(oi | O \{oi})

≤ f(oi | {o1, . . . , oi−1}),

1 Pk it is worse than the average of f(oi | {o1, . . . , oi−1}), which is k i=1 f(oi |

{o1, . . . , oi−1}), and showing that

1 f(a | O) ≤ f(O). (2.6) 1 k

Finally, we have

` X f(O ∪ A) = f(O) + f(ai | O ∪ {a1, . . . , ai−1}) i=1 ` X ≤ f(O) + f(ai | O) i=1

≤ f(O) + `f(a1 | O)  `  ≤ 1 + f(O), k which is what we wanted to show.

2.7.6 Proof of Theorem 6

Proof. Let O be the optimal solution to the original problem. Let Fτ and F τ be the functions defined restricting to the matrix elements with benefit at least τ and all remaining elements, respectively. If there exists a set S of size k such

41 that µn elements have a neighbor in S, then we have

F (O) ≤ Fτ (O) + F τ (O)

≤ Fτ (O) + nτ 1 ≤ F (O) + F (S) τ µ τ  1  ≤ 1 + F (O ) µ τ τ where the last inequality follows from Oτ being optimal for Fτ .

2.7.7 Proof of Lemma 7

Proof. Consider the following algorithm:

B ← ∅ S ← ∅ while |B| ≤ cδn v∗ ← arg max |N(v)| add v∗ to S add N(v∗) to B remove N(v∗) ∪ {v∗} from G

c We will show that after T = (1−2c2)δ iterations this algorithm will terminate. When it does, S will satisfy |N(S)| ≥ cδn since every element of B has a neighbor in S.

If there exists a vertex of degree cδn, then we will terminate after the first iteration. Otherwise all vertices have degree less than cδn. Assuming all

42 vertices have degree less than cδn, until we terminate the number of edges incident to B is at most |B|cδn ≤ c2δ2n2. At each iteration, the number of

1 2 2 2 edges in the graph is at least ( 2 − c )δ n , thus in each iteration we can find a v∗ with degree at least (1 − 2c2)δ2n. Therefore, after T iterations, we will have terminated with the size of S is at most T and |N(S)| ≥ cδn.

We see that this is tight up to constant factors by the following proposi- tion.

Proposition 11. There exists an example where for ∆ = δ2n, the optimal solution to the sparsified function is a factor of O(δ) from the optimal solution to the original function.

Proof. Consider the following benefit matrix.

1 1  δn×δn + (1 + k−1 )I 0 C = 1 1 0 (1 − (1−δ)n ) (1−δn)×(1−δn)

The sparsified optimal would only choose elements in the top left clique and would get a value of roughly δn, while the true optimal solution would cover both cliques and get a value of roughly n.

43 Chapter 3

Exact MAP Inference by Avoiding Fractional Vertices

3.1 Introduction

Given a graphical model, one essential problem is MAP inference, that is, finding the most likely configuration of states according to the model.

Consider graphical models with binary random variables and pairwise interactions, also known as Ising models. For a graph G = (V,E) with node weights θ ∈ RV and edge weights W ∈ RE, the probability of a variable configuration is given by ! 1 X X P(X = x) = exp θ x + W x x , (3.1) Z i i ij i j i∈V ij∈E where Z is a normalization constant.

The MAP problem is to find the configuration x ∈ {0, 1}V that maxi- mizes Equation (3.1). We can write this as an integer linear program (ILP) as follows: X X max θiqi + Wijqij q∈ V ∪E R i∈V ij∈E

s.t. qi ∈ {0, 1} ∀i ∈ V (3.2)

qij ≥ max{0, qi + qj − 1} ∀ij ∈ E

qij ≤ min{qi, qj} ∀ij ∈ E.

44 The MAP problem on binary, pairwise graphical models contains, as a special case, the Max-cut problem and is therefore NP-hard. For this reason, a significant amount of attention has focused on analyzing the LP relaxation of the ILP, which can be solved efficiently in practice.

X X max θiqi + Wijqij q∈ V ∪E R i∈V ij∈E

s.t. 0 ≤ qi ≤ 1 ∀i ∈ V (3.3)

qij ≥ max{0, qi + qj − 1} ∀ij ∈ E

qij ≤ min{qi, qj} ∀ij ∈ E This relaxation has been an area of intense research in machine learning and statistics. In [Meshi et al., 2016], the authors state that a major open question is to identify why real world instances of Problem (3.2) can be solved efficiently despite the theoretical worst case complexity.

We make progress on this open problem by analyzing the fractional vertices of the LP relaxation, that is, the extreme points of the polytope with fractional coordinates. Vertices of the relaxed polytope with fractional coordi- nates are called pseudomarginals for graphical models and pseudocodewords in coding theory. If a fractional vertex has higher objective value (i.e. likelihood) compared to the best integral one, the LP relaxation fails. We call fractional vertices with an objective value at least as good as the objective value of the optimal integral vertex confounding vertices. Our main result is that it is possible to prune all confounding vertices efficiently when their number is polynomial.

45 Our contributions:

• Our first contribution is a general result on integer programs. We show that any 0-1 integer linear program (ILP) can be solved exactly in polynomial time, if the number confounding vertices is bounded by a polynomial. This applies to MAP inference for a graphical model over any alphabet size and any order of connection. The same result (exact solution if the number of confounding vertices is bounded by a polynomial) was established by Dimakis et al. [2009] for the special case of LP decoding of LDPC codes [Feldman et al., 2005]. The algorithm from Dimakis et al. [2009] relies on the special structure of the graphical models that correspond to LDPC codes. In this paper we generalize this result for any ILP in the unit hypercube. Our results extend to finding all integral vertices among the M-best vertices.

• Given our condition, one may be tempted to think that we generate the top M-best vertices of a linear program (for M polynomial) and output the best integral one in this list. We actually show that such an approach would be computationally intractable. Specifically, we show that it is NP-hard to produce a list of the M-best vertices if M = O(nε) for any fixed ε > 0. This result holds even if the list is allowed to be approximate. This strengthens the previously known hardness result [Angulo et al., 2014] which was M = O(n) for the exact M-best vertices. In terms of achievability, the best previously known result (from [Angulo et al.,

46 2014]) can only solve the ILP if there is at most a constant number of confounding vertices.

• We obtain a complete characterization of the fractional vertices of the local polytope for binary, pairwise graphical models. We show that any variable in the fractional support must be connected to a frustrated cycle by other fractional variables in the graphical model. This is a complete structural characterization that was not previously known, to the best of our knowledge.

• We develop an approach to estimate the number of confounding vertices of a half-integral polytope. We use this method in an empirical evaluation of the number of confounding vertices of previously studied problems and analyze how well common integer programming techniques perform at pruning confounding vertices.

3.2 Background and Related Work

For some classes of graphical models, it is possible to solve the MAP problem exactly. For example see [Weller et al., 2016] for balanced and almost balanced models, [Jebara, 2009] for perfect graphs, and [Wainwright et al., 2008] for graphs with constant tree-width.

These conditions are often not true in practice and a wide variety of general purpose algorithms are able to solve the MAP problem for large inputs. One class is belief propagation and its variants [Yedidia et al., 2000,

47 Wainwright et al., 2003, Sontag et al., 2008]. Another class involves general ILP optimization methods (see e.g. [Nemhauser and Wolsey, 1999]). Techniques specialized to graphical models include cutting-plane methods based on the cycle inequalities [Sontag and Jaakkola, 2007, Komodakis and Paragios, 2008, Sontag et al., 2012]. See also [Kappes et al., 2013] for a comparative survey of techniques.

In [Weller et al., 2014], the authors investigate how pseudomarginals and relaxations relate to the success of the Bethe approximation of the partition function.

There has been substantial prior work on improving inference building on these LP relaxations, especially for LDPC codes in the information theory community. This work ranges from very fast solvers that exploit the special structure of the polytope [Burshtein, 2009], connections to unequal error protection [Dimakis et al., 2007], and graphical model covers [Koetter et al., 2007]. LP decoding currently provides the best known finite-length error- correction bounds for LDPC codes both for random [Daskalakis et al., 2008, Arora et al., 2009], and adversarial bit-flipping errors [Feldman et al., 2007].

For binary graphical models, there is a body of work which tries to exploit the persistency of the LP relaxation, that is, the property that integer components in the solution of the relaxation must take the same value in the optimal solution, under some regularity assumptions [Boros and Hammer, 2002, Rother et al., 2007, Fix et al., 2012].

48 Fast algorithms for solving large graphical models in practice include [Ihler et al., 2012, Dechter and Rish, 2003].

The work most closely related to this paper involves eliminating frac- tional vertices (so-called pseudocodewords in coding theory) by changing the polytope or the objective function [Zhang and Siegel, 2012, Chertkov and Stepanov, 2008, Liu et al., 2012].

3.3 Provable Integer Programming

A binary integer linear program is an optimization problem of the following form. max cT x x subject to Ax ≤ b

x ∈ {0, 1}n which is relaxed to a linear program by replacing the x ∈ {0, 1}n constraint with

0 ≤ x ≤ 1. For binary integer programs with the box constraints 0 ≤ xi ≤ 1 for all i, every integral vector x is a vertex of the polytope described by the constraints of the LP relaxation. However fraction vertices may also be in this polytope, and fractional solutions can potentially have an objective value larger than every integral vertex.

If the optimal solution to the linear program happens to be integral, then this is the optimal solution to the original integer linear program. If the optimal solution is fractional, then a variety of techniques are available to tighten the LP relaxation and eliminate the fractional solution.

49 We establish a success condition for integer programming based on the number of confounding vertices, which to the best of our knowledge was unknown. The algorithm used in proving Theorem 12 is a version of branch-and- bound, a classic technique in integer programming [Land and Doig, 1960] (see [Nemhauser and Wolsey, 1999] for a modern reference on integer programming). This algorithm works by starting with a root node, then branching on a fractional coordinate by making two new linear programs with all the constraints of the parent node, with the constraint xi = 0 added to one new leaf and xi = 1 added to the other. The decision on which leaf of the tree to branch on next is based on which leaf has the best objective value. When the best leaf is integral, we know that this is the best integral solution. This algorithm is formally written in Algorithm 2.

∗ Theorem 12. Let x be the optimal integral solution and let {v1, v2, . . . , vM } be the set of confounding vertices in the LP relaxation. Algorithm 2 will find the optimal integral solution x∗ after 2M calls to an LP solver.

Since MAP inference is a binary integer program regardless of the alphabet size of the variables and order of the clique potentials, we have the following corollary:

Corollary 13. Given a graphical model such that the local polytope has M as cofounding variables, Algorithm 2 can find the optimal MAP configuration with 2M calls to an LP solver.

50 Cutting-plane methods, which remove a fractional vertex by introducing a new constraint in the polytope may not have this property, since this cut may create new confounding vertices. This branch-and-bound algorithm has the desirable property that it never creates a new fractional vertex. We note that other branching algorithms, such as the algorithm presented by the authors in [Marinescu and Dechter, 2009], do not immediately allow us to prove our desired theorem.

Note that warm starting a linear program with slightly modified con- straints allows subsequent calls to an LP solver to be much more efficient after the root LP has been solved.

3.3.1 Proof of Theorem 12

The proof follows from the following invariants:

• At every iteration we remove at least one fractional vertex.

• Every integral vertex is in exactly one branch.

• Every fractional vertex is in at most one branch.

• No fractional vertices are created by the new constraints.

To see the last invariant, note that every vertex of a polytope can be identified by the set of inequality constraints that are satisfied with equality (see [Bertsimas and Tsitsiklis, 1997]). By forcing an inequality constraint to be tight, we cannot possibly introduce new vertices.

51 3.3.2 The M-Best LP Problem

As mentioned in the introduction, the algorithm used to prove Theorem 12 does not enumerate all the fractional vertices until it finds an integral vertex. Enumerating the M-best vertices of a linear program is the M-best LP problem.

Definition 14. Given a linear program {min cT x : x ∈ P } over a polytope P and a positive integer M, the M-best LP problem is to optimize

M X T max c vk. {v1,...,vM }⊆V (P ) k=1

This was established by [Angulo et al., 2014] to be NP-hard when M = O(n). We strengthen this result to hardness of approximation even when M = nε for any ε > 0.

Theorem 15. It is NP-hard to approximate the M-best LP problem by a factor

nε better than O( M ) for any fixed ε > 0.

Proof. Consider the circulation polytope described in [Khachiyan et al., 2008], with the graph and weight vector w described in [Boros et al., 2011]. By adding an O(log M) long series of 2 × 2 bipartite subgraphs, we can make it such that one long path in the original graph implies M long paths in the new graph, and thus it is NP-hard to find any of these long paths in the new graph. By adding the constraint vector wT x ≤ 0, and using the cost function −w, the vertices corresponding to the short paths have value 1/2, the vertices corresponding to the long paths have value O(1/n), and all other vertices have value 0. Thus the

52 M optimal set has value O(n + n ). However it is NP-hard to find a set of value n greater than O(n) in polynomial time, which gives an O( M ) approximation. Using a padding argument, we can replace n with nε.

The best known algorithm for the M-best LP problem is a generalization of the facet guessing algorithm [Dimakis et al., 2009] developed in [Angulo et al., 2014], which would require O(mM ) calls to an LP solver, where m is the number of constraints of the LP. Since we only care about integral solutions, we can find the single best integral vertex with O(M) calls to an LP solver, and if we want all of the K-best integral solutions among the top M vertices of the polytope, we can find these with O(nK + M) calls to an LP-solver, as we will see in the next section.

3.3.3 K-Best Integral Solutions

Finding the K-best solutions to general optimization problems has been uses in several machine learning applications. Producing multiple high- value outputs can be naturally combined with post-processing algorithms that select the most desired solution using additional side-information. There is a significant volume of work in the general area, see [Fromer and Globerson, 2009, Batra et al., 2012] for MAP solutions in graphical models and [Eppstein, 2014] for a survey on M-best problems.

We further generalize our theorem to find the K-best integral solutions.

Theorem 16. Under the assumption that there are less than M fractional

53 vertices with objective value at least as good as the K-best integral solutions, we can find all of the K-best integral solutions, O(nK + M) calls to an LP solver.

The algorithm used in this theorem is Algorithm 3. It combines Algo- rithm 2 with the space partitioning technique used in [Murty, 1968, Lawler, 1972]. If the current optimal solution in the solution tree is fractional, then we use the branching technique in Algorithm 2. If the current optimal solution in the solution tree x∗ is integral, then we branch by creating a new leaf for every

∗ i not currently constrained by the parent with the constraint xi = ¬xi .

3.4 Fractional Vertices of the Local Polytope

We now describe the fractional vertices of the local polytope for binary, pairwise graphical models, which is described in Equation 3.3. It was shown in [Padberg, 1989] that all the vertices of this polytope are half-integral, that is,

1 all coordinates have a value from {0, 2 , 1} (see [Weller et al., 2016] for a new proof of this).

1 V ∪E Given a half-integral point q ∈ {0, 2 , 1} in the local polytope, we say that a cycle C = (VC ,EC ) ⊆ G is frustrated if there is an odd number of edges ij ∈ EC such that qij = 0. If a point q has a frustrated cycle, then it is a pseudomarginal, as no probability distribution exists that has as its singleton and pairwise marginals the coordinates of q. Half-integral points q with a frustrated cycle do not satisfy the cycle inequalities [Sontag and Jaakkola,

2007, Wainwright et al., 2008], for all cycles C = (VC ,EC ),F = (VF ,EF ) ⊆

54 C, |EF | odd we must have

X X qi + qj − 2qij − qi + qj − 2qij ≤ |FC | − 1. (3.4)

ij∈EF ij∈EC \EF

Frustrated cycles allow a solution to be zero on negative weights in a way that is not possible for any integral solution.

We have the following theorem describing all the vertices of the local polytope for binary, pairwise graphical models.

Theorem 17. Given a point q in the local polytope, q is a vertex of this polytope

1 V ∪E if and only if q ∈ {0, 2 , 1} and the induced subgraph on the fractional nodes of q is such that every connected component of this subgraph contains a frustrated cycle.

3.4.1 Proof of Theorem 17

Every vertex q of an n-dimensional polytope is such that there are n constraints such that q satisfies them with equality, known as active constraints (see [Bertsimas and Tsitsiklis, 1997]). Every integral q is thus a vertex of the local polytope. We now describe the fractional vertices of the local polytope.

1 n+m Definition 18. Let q ∈ {0, 2 , 1} be a point of the local polytope. Let 1 GF = (VF ,EF ) be an induced subgraph of points such that qi = 2 for all i ∈ VF .

55 We say that GF is full rank if the following system of equations is full rank.

qi + qj − qij = 1 ∀ij ∈ EF such that qij = 0

qij = 0 ∀ij ∈ EF such that qij = 0 1 (3.5) q − q = 0 ∀ij ∈ E such that q = i ij F ij 2 1 q − q = 0 ∀ij ∈ E such that q = j ij F ij 2

Theorem 17 follows from the following lemmas.

1 n+m Lemma 19. Let q ∈ {0, 2 , 1} be a point of the local polytope. Let GF = 1 (VF ,EF ) be the subgraph induced by the nodes i ∈ V such that qi = 2 . The point q is a vertex if and only if every connected component of GF is full rank.

1 n+m Lemma 20. Let q ∈ {0, 2 , 1} be a point of the local polytope. Let GF = 1 (VF ,EF ) be a connected subgraph induced from nodes such that such that qi = 2 for all i ∈ VF . GF is full rank if and only if GF contains cycle that is full rank.

1 n+m Lemma 21. Let q ∈ {0, 2 , 1} be a point of the local polytope. Let C =

(VC ,EC ) be a cycle of G such that qi is fractional for all i ∈ VC . C is full rank if and only if C is a frustrated cycle.

Proof of Lemma 19. Suppose every connected component is full rank. Then every fractional node and edge between fractional nodes is fully specified by their corresponding equations in Problem 3.3. It is easy to check that all integral nodes, edges between integral nodes, and edges between integral and fractional nodes is also fixed. Thus q is a vertex.

56 Now suppose that there exists a connected component that is not full rank. The only other constraints involving this connected component are those between fractional nodes and integral nodes. However, note that these constraints are always rank 1, and also introduce a new edge variable. Thus all the constraints where q is tight do not make a full rank system of equations.

Proof of Lemma 20. Suppose GF has a full rank cycle. We will build the graph starting with the full rank cycle then adding one connected edge at a time. It is easy to see from Equations 3.5 that all new variables introduced to the system of equations have a fixed value, and thus the whole connected component is full rank.

Now suppose GF has no full rank cycle. We will again build the graph starting from the cycle then adding one connected edge at a time. If we add an edge that connects to a new node, then we added two variables and two equations, thus we did not make the graph full rank. If we add an edge between two existing nodes, then we have a cycle involving this edge. We introduce two new equations, however with one of the equations and the other cycle equations, we can produce the other equation, thus we can increase the rank by one but we also introduced a new edge. Thus the whole graph cannot be full rank.

The proof of Lemma 21 from the following lemma.

57 Lemma 22. Consider a collection of n vectors

v1 = (1, t1, 0,..., 0)

v2 = (0, 1, t2, 0,..., 0)

v3 = (0, 0, 1, t3, 0,..., 0) . .

vn−1 = (0,..., 0, 1, tn−1)

vn = (tn, 0,..., 0, 1)

for ti ∈ {−1, 1}. We have rank(v1, v2, . . . , vn) = n if and only if there is an odd number of vectors such that ti = 1.

Proof of Lemma 22. Let k be the number of vectors such that ti = 1. Let

S1 = v1 and define ( Si − vi+1 if Si(i + 1) = 1 Si+1 = Si + vi+1 if Si(i + 1) = −1 for i = 2, . . . , n − 1.

Note that if ti+1 = −1 then Si+1(i + 2) = Si(i + 1) and if ti+1 = 1 then

Si+1(i + 2) = −Si(i + 1). Thus the number of times the sign changes is exactly the number of ti = 1 for i ∈ {2, . . . , n − 1}.

Using the value of Sn−1 we can now we can check for all values of t1 and tn that the following is true.

• If k is odd then (1, 0,..., 0) ∈ span(v1, v2, . . . , vn), which allows us to create the entire standard basis, showing the vectors are full rank.

58 • If k is even then vn ∈ span(v1, v2, . . . , vn−1) and thus the vectors are not full rank.

3.5 Estimating the number of Confounding Singleton Marginals

For this section we generalize generalize Theorem 12. We see after every iteration we potentially remove more than one confounding vertex—we remove

all confounding vertices that agree with xI0 = 0 and xI1 = 1 and are fractional with any value at coordinate i. We also observe that we can handle a mixed integer program (MIP) with the same algorithm. max cT x + dT z x subject to Ax + Bz ≤ b

x ∈ {0, 1}n

We will call a vertex (x, z) fractional if its x component is fractional. For each fractional vertex (x, z), we create a half-integral vector S(x) such that  0 if xi = 0  1 S(x)i = 2 if xi is fractional  1 if xi = 1 For a set of vertices V , we define S(V ) to be the set {S(x):(x, z) ∈ V }, i.e. we remove all duplicate entries.

∗ ∗ Theorem 23. Let (x , z ) be the optimal integral solution and let VC be the set of confounding vertices. Algorithm 2 will find the optimal integral solution

∗ ∗ (x , z ) after 2|S(VC )| calls to an LP solver.

59 For MAP inference in graphical models, S(VC ) refers to the fractional singleton marginals qV such that there exists a set of pairwise pseudomarginals qE such that (qV , qE) is a cofounding vertex. In this case we call qV a confound- ing singleton marginal. We develop Algorithm 4 to estimate the number of confounding singleton marginals for our experiments section. It is based on the k-best enumeration method developed in [Murty, 1968, Lawler, 1972].

Algorithm 4 works by a branching argument. The root node is the original LP. A leaf node is branched on by introducing a new leaf for every

1 node in V and every element of {0, 2 , 1} such that qi =6 a in the parent node and the constraint {qi = a} is not in the constraints for the parent node. For

1 i ∈ V , a ∈ {0, 2 , 1}, we create the leaf such that it has all the constraints of its parents plus the constraint qi = a.

Note that Algorithm 4 actually generates a superset of the elements

1 of S(VC ), since the introduction of constraints of the type qi = 2 introduce vertices into the new polytope that were not in the original polytope. This does not seem to be an issue for the experiments we consider, however this does occur for other graphs. An interesting question is if the vertices of the local polytope can be provably enumerated.

3.6 Experiments

We consider a synthetic experiment on randomly created graphical models, which were also used in [Sontag and Jaakkola, 2007, Weller, 2016, Weller et al., 2014]. The graph topology used is the complete graph on 12 nodes.

60 We first reparametrize the model to use the sufficient statistics 1(xi = xj) and 1(xi = 1). The node weights are drawn θi ∼ Uniform(−1, 1) and the edge weights are drawn Wij ∼ Uniform(−w, w) for varying w. The quantity w determines how strong the connections are between nodes. We do 100 draws for each choice of edge strength w.

For the complete graph, we observe that Algorithm 4 does not yield any points that do not correspond to vertices, however this does occur for other topologies.

We first compare how the number of fractional singleton marginals

|S(VC )| changes with the connection strength w. In Figure 3.1, we plot the sample CDF of the probability that |S(VC )| is some given value. We observe that |S(VC )| increases as the connection strength increases. Further we see that while most instances have a small number for |S(VC )|, there are rare instances where |S(VC )| is quite large.

Now we compare how the number of cycle constraints from Equation (3.4) that need to be introduced to find the best integral solution changes with the number of confounding singleton marginals in Figure 3.2. We use the algorithm for finding the most frustrated cycle in [Sontag and Jaakkola, 2007] to introduce new constraints. We observe that each constraint seems to remove many confounding singleton marginals.

We also observe the number of introduced confounding singleton marginals that are introduced by the cycle constraints increases with the number of con-

61 1 ) t 0.8 | ≤ ) C V ( S | 0.6

P( w = 0.1 w = 0.2 w = 0.3 0.4 100 101 102 103 104 |S(VC )|

Figure 3.1: We compare how the number of fractional singleton marginals |S(VC )| changes with the connection strength w. We plot the sample CDF of the probability that |S(VC )| is some given value. We observe that |S(VC )| increases as the connection strength increases. Further we see that while most instances have a small number for |S(VC )|, there are rare instances where |S(VC )| is quite large.

62 50

40

30

20

10

# cycle constraints added 0 100 101 102 103 104 |S(VC )|

Figure 3.2: We compare how the number of cycle constraints from Equation (3.4) that need to be introduced to find the best integral solution changes with the number of confounding singleton marginals. We use the algorithm for finding the most frustrated cycle in [Sontag and Jaakkola, 2007] to introduce new constraints. We observe that each constraint seems to remove many confounding singleton marginals.

63 30

20

10

0 100 101 102 103 104

# introduced confounding singleton marginals |S(VC )|

Figure 3.3: We also observe the number of introduced confounding singleton marginals that are introduced by the cycle constraints increases with the number of confounding singleton marginals . founding singleton marginals in Figure 3.3.

Finally we compare the number of branches needed to find the optimal solution increases with the number of confounding singleton marginals in Figure 3.4. A similar trend arises as with the number of cycle inequalities introduced. To compare the methods, note that branch-and-bound uses twice as many LP calls as there are branches. For this family of graphical models, branch-and-bound tends to require less calls to an LP solver than the cut constraints.

64 10

5 # branches

0 100 101 102 103 104 |S(VC )|

Figure 3.4: Finally we compare the number of branches needed to find the optimal solution increases with the number of confounding singleton marginals in Figure 3.4. A similar trend arises as with the number of cycle inequalities introduced. To compare the methods, note that branch-and-bound uses twice as many LP calls as there are branches. For this family of graphical models, branch-and-bound tends to require less calls to an LP solver than the cut constraints.

65 3.7 Conclusion

Perhaps the most interesting follow-up question to our work is to deter- mine when, in theory and practice, our condition on the number of confounding pseudomarginals in the LP relaxation is small. Another interesting question is to see if it is possible to prune the number of confounding pseudomarginals at a faster rate. The algorithm presented for our main theorem removes one pseudomarginal after two calls to an LP solver. Is it possible to do this at a faster rate? From our experiments, this seems to be the case in practice.

66 Algorithm 2 Branch and Bound Input: an LP {min cT x : Ax ≤ b, 0 ≤ x ≤ 1}

# branch (v, I0,I1) means v is optimal LP

# with xI0 = 0 and xI1 = 1. def LP(I0,I1): v∗ ← arg max cT x subject to: Ax ≤ b

xI0 = 0

xI1 = 1 return v∗ if feasible, else return null

v ← LP(∅, ∅) B ← {(v, ∅, ∅)} while optimal integral vertex not found: T (v, I0,I1) ← arg max(v,I0,I1)∈B c v if v is integral: return v else find a fractional coordinate i (0) v ← LP(I0 ∪ {i},I1) (1) v ← LP(I0,I1 ∪ {i}) remove (v, I0,I1) from B (0) add (v ,I0 ∪ {i},I1) to B if feasible (1) add (v ,I0,I1 ∪ {i}) to B if feasible

67 Algorithm 3 M-best Integral Input: an LP {max cT x : Ax ≤ b, 0 ≤ x ≤ 1} Input: number of solutions K

def LP(I0,I1): same as Algorithm 2

def SplitIntegral(v, I0,I1): P ← { } for i ∈ [n] if i∈ / I0 ∪ I1: a ← ¬vi 0 0 I0,I1 ← copy(I0,I1) 0 add i to Ia 0 0 0 v ← LP(I0,I1) 0 0 0 add (v ,I0,Ia) to P if feasible return P

v ← LP(∅, ∅) B ← {(v, ∅, ∅)} results ← { } while K integral vertices not found: T (v, I0,I1) ← arg max(v,I0,I1)∈B c v if v is integral: add v to results add SplitIntegeral(v, I0,I1) to B remove (v, I0,I1) from B else find a fractional coordinate i (0) v ← LP(I0 ∪ {i},I1) (1) v ← LP(I0,I1 ∪ {i}) remove (v, I0,I1) from B (0) add (v ,I0 ∪ {i},I1) to B if feasible (1) add (v ,I0,I1 ∪ {i}) to B if feasible return results

68 Algorithm 4 Estimate S(VC ) for Binary, Pairwise Graphical Models Input: a binary, pairwise graphical model LP

# branch (v, I0,I 1 ,I1) means v is optimal LP 2 1 # with xI0 = 0, xI 1 = 2 , and xI1 = 1. 2 def LP(I0,I 1 ,I1): 2 optimize LP with additional constraints:

xI0 = 0 1 xI 1 = 2 2 xI1 = 1 return q∗ if feasible, else return null

q ← LP(∅, ∅, ∅) B ← {(q, ∅, ∅, ∅)} solution ← { } while optimal integral vertex not found:

1 (q, I0,I ,I1) ← arg max(q,I0,I 1 ,I1)∈B objective val 2 2 add q to solution remove (q, I0,I 1 ,I1) from B 2 for i ∈ V if i∈ / I0 ∪ I 1 ∪ I1: 2 1 for a ∈ {0, 2 , 1} if qi 6= a: 0 0 0 I0,I 1 ,I1 ← copy(I0,I 1 ,I1) 2 2 0 0 Ia ← Ia ∪ {i} 0 0 0 0 q ← LP(I0,I 1 ,I1) 2 0 0 0 0 add (q ,I0,I 1 ,I1) to B if feasible 2 return solution

69 Chapter 4

Experimental Design for Cost-Aware Learning of Causal Graphs

4.1 Introduction

Causality is a fundamental concept in science and an essential tool for multiple disciplines such as engineering, medical research, and economics [Rotmensch et al., 2017, Ramsey et al., 2010, Rubin and Waterman, 2006]. Discovering causal relations has been studied extensively under different frame- works and under various assumptions [Pearl, 2009, Imbens and Rubin, 2015]. To learn the cause-effect relations between variables without any assumptions other than basic modeling assumptions, it is essential to perform experiments. Experimental data combined with observational data has been successfully used for recovering causal relationships in different domains [Sachs et al., 2005].

There is significant cost and time required to set up experiments. Often there are many ways to design experiments to discover cause-and-effect rela- tionships. Considering cost when designing experiments can critically change the total cost needed to learn the same causal system. King et al. [2004] created a robot scientist that would automatically perform experiments to learn how a yeast gene functions. Different experiments required different

70 materials with large variations with costs. By considering material cost when defining interventions, their robot scientist was able to learn the same causal structure significantly cheaper.

Since the work of King et al., there have been a number of papers on automated and cost-sensitive experiment design for causal learning in biological systems. Sverchkov and Craven [2017] discuss some aspects on how to design costs. Ness et al. [2017] develop an active learning strategy for cost-aware experiments in protein networks.

We study the problem of cost-aware causal learning in Pearl’s framework of causality [Pearl, 2009] under the causal sufficiency assumption, i.e., when there are no latent confounders. In this framework, there is a directed acyclic graph (DAG) called the causal graph that describes the causal relationships between the variables in our system. Learning direct causal relations between the variables in the system is equivalent to learning the directed edges of this graph. From observational data, we can learn of the existence of a causal edge, as well as some of the edge directions, however in general we cannot learn the direction of every edge. To learn the remaining causal edges, we need to perform experiments and collect additional data from these experiments [Eberhardt, 2007, Hauser and Bühlmann, 2012b, Hyttinen et al., 2013].

An intervention is an experiment where we force a variable to take a particular value. An intervention is called a stochastic intervention when the value of the intervened variable is assigned to another independent random variable. Interventions can be performed on a single variable, or a subset of

71 variables simultaneously. In the non-adaptive setting, which is what we consider here, all interventions are performed in parallel. In this setting, we can only guarantee that an edge direction is learned when there is an intervention such that exactly one of the endpoints is included [Kocaoglu et al., 2017].

In the minimum cost intervention design problem, as first formalized by Kocaoglu et al. [2017], there is a cost to intervene on each variable. We want to learn the causal direction of every edge in the graph with minimum total cost. This becomes a combinatorial optimization problem, and so two natural questions that have not yet been addressed are if the problem is NP-hard and if the greedy algorithm proposed by [Kocaoglu et al., 2017] has any approximation guarantees.

Our contributions:

• We show that the minimum cost intervention design problem is NP-hard.

• We modify the greedy coloring algorithm proposed in [Kocaoglu et al., 2017]. We establish that our modified algorithm is a (2+ε)-approximation algorithm for the minimum cost intervention design problem. Our proof makes use of a connection to submodular optimization.

• We consider the sparse intervention setup where each experiment can include at most k variables. We show a lower bound to the minimum number of interventions and create an algorithm which is a (1 + o(1))- approximation to this problem for sparse graphs with sparse interventions.

72 • We introduce the minimum cost k-sparse intervention design problem and develop an algorithm that is essentially optimal for the unweighted variant of this problem on sparse graphs. We then discuss how to extend this algorithm to the weighted problem.

4.2 Minimum Cost Intervention Design 4.2.1 Relevant Graph Theory Concepts

We first discuss some graph theory concepts that we utilize in this work.

A proper coloring of a graph G = (V,E) is an assignment of colors c : V 7→ {1, 2, . . . , t} to the vertices V such that for all edges uv ∈ E we have c(u) 6= c(v). The chromatic number is the minimum number of colors needed for a proper coloring to exist and is denoted by χ.

An independent set of a graph G = (V,E) is a subset of the vertices S ⊆ V such that for all pairs of vertices u, v ∈ S we have that uv∈ / E. The independence number is the size of the maximum independent set and is denoted by α. If there is a weight function on the vertices, a maximum weight independent set is an independent set with the largest total weight.

A vertex cover of a graph G = (V,E) is a subset of vertices S such that for every edge uv ∈ E, at least one of u or v are in S. Vertex covers are closely related to independent sets: if S is a vertex cover then V \ S is an independent set and vice versa. Further, if S is a minimum weight vertex cover then V \ S is a maximum weight independent set. The size of the smallest vertex cover of G is denoted τ.

73 A chordal graph is a graph such that for any cycle v1, v2, . . . , vt for t ≥ 4, there is a chord, which is an edge between two vertices that are not adjacent in the cycle. There are linear complexity algorithms for finding a minimum coloring, maximum weight independent set, and minimum weight vertex cover of a chordal graph. Any induced subgraph of a chordal graph is also a chordal graph.

Given a graph G = (V,E) and a subset of vertices I ⊆ V , the cut δ(I) is the set of edges uv ∈ E such that u ∈ I and v ∈ V \ I.

4.2.2 Causal Graphs and Interventional Learning

Consider two variables X,Y of a system. If every time we change the value of X, the value of Y changes but not vice versa, then we suspect that variable X causes Y . If we have a set of variables, the same intuition carries through while defining causality. This asymmetry in the directional influence between variables is at the core of causality.

Pearl [2009] and Spirtes et al. [2001] formalized the notion of causality using directed acyclic graphs (DAGs). DAGs are suitable to encode asymmetric relations. Consider a system of n random variables V = {V1,V2,...,Vn}. The structural causal model of Pearl models the causal relations between variables as follows: each variables Vi can be written as a deterministic function of a set of other variables Si and an unobserved variable Ei as Vi = fi(Si,Ei). We assume that Ei, called an exogenous random variable, is independent from everything, i.e., every variable in V and all other exogenous variables Ej. The graph that

74 captures these directional relations is called the causal graph between variables in V. We restrict the graph created to be acyclic, so that if we replace the value of a variable we potentially change the descendent variables but the ancestor variables will not change.

Given a causal graph, a variable is said to be caused by the set of parents

1 . This is precisely Si in the structural causal model. It is known that the joint distribution induced on V by a structural causal model factorizes with respect to the causal graph. Thus, the causal graph D is a valid Bayesian network for the observed joint distribution.

There are two main approaches for learning causal graphs from observa- tional distribution: i) score based [Geiger and Heckerman, 1994, Heckerman et al., 1995], and ii) constraint based [Pearl, 2009, Spirtes et al., 2001]. Score based approaches optimize a score (e.g., likelihood) over all Bayesian networks to recover the most likely graph. Constraint-based approaches, such as IC and PC algorithms, use conditional independence tests to identify the causal edges that are invariant across every graph consistent with the observed data. This remaining mixed graph is called the essential graph. The undirected components of the essential graph are always chordal [Spirtes et al., 2001, Hauser and Bühlmann, 2012a]

Although PC runs in time exponential in the maximum degree of the

1To be more precise, parent nodes are said to directly cause a variable whereas ancestors cause indirectly through parents. In this paper, we will not make this distinction since we do not use indirect causal relations for graph discovery.

75 graph, various extensions make it feasible to run it even on graphs with 30,000 nodes with maximum degree up to 12 [Ramsey et al., 2017]. To learn the rest of the causal edge directions without additional assumptions, we need to use interventions on the undirected, chordal components. 2 An intervention is an experiment where a random variable is forced to take a certain value. Due to the acyclicity assumption on the graph, if X → Y , then intervening on Y should not change the distribution of X, however intervening on X will change the distribution of Y . Running the observational learning algorithms like PC/IC after an intervention on a set S of variables, we can learn the new skeleton after the intervention, which allows us to identify the immediate children and immediate parents of the intervened variables. Therefore, if we perform a randomized experiment on a set S of vertices in the causal graph, we can learn the direction of all the edges cut between S and V \ S. This approach has been heavily used in the literature [Hyttinen et al., 2013, Hauser and Bühlmann, 2012b, Shanmugam et al., 2015].

4.2.3 Graph Separating Systems and Minimum Cost Intervention Design

Given a causal DAG D = (V,E), we observe the essential graph E(D). Kocaoglu et al. [2017] established that if we want to guarantee learning the direction of the undirected edges with nonadaptive interventions, it is nessesary

2It is known that the edges identified in a chordal component of the skeleton do not help identify edges in another component [Hauser and Bühlmann, 2012a].Thus, each chordal component learning task can be treated as an individual problem.

76 and sufficient for our intervention design I = {I1,I2,...,Im} to be a graph separating system on the undirected component of the graph G.

Definition 24 (Graph Separating System). Given an undirected graph G = (V,E), a graph separating system of size m is a collection of m subsets of vertices I = {I1,I2,...,Im} such that every edge is cut at least once, that is, S I∈I δ(I) = E.

Recall that the undirected component of the essential graph of a causal DAG is always a chordal graph. We can now define the minimum cost inter- vention design problem.

Definition 25 (Minimum Cost Intervention Design). Given a chordal graph

G = (V,E), a set of weights wv for all v ∈ V , and a size constraint m ≥ dlog χe, the minimum cost intervention design problem is to find a graph separating system I of size at most m that minimizes the cost

X X cost(I) = wv. I∈I v∈I

Graph separating systems are tightly related to graph colorings. Mao- Cheng [1984] proved that the smallest graph separating system has size m = dlog χe, where χ is the chromatic number. To see this, for each vertex, we create a binary vector c(v) where c(v)i = 1 if v ∈ Ii and c(v)i = 0 if v∈ / Ii. Since two neighboring vectors u and v must have, for some intervention Ii, exactly

m one of u ∈ Ii or v ∈ Ii, the assignment of vectors to vertices c : V 7→ 0, 1 is a proper coloring. With a size m graph separating system, we are able to create

77 2m different colors, proving that the size of the smallest separating system is exactly m = dlog χe.

The equivalence between graph separating systems and coloring allows us to define an equivalent coloring version of the minimum cost intervention design problem, which was first developed in [Kocaoglu et al., 2017].

Definition 26 (Minimum Cost Intervention Design, Coloring Version). Given a chordal graph G = (V,E), a set of weights wv for all v ∈ V , and the colors C = {0, 1}m such that |C| ≥ χ, the coloring version of the minimum cost intervention design problem is to find a proper coloring c : V 7→ C that minimizes the total cost X cost(c) = kc(v)k1wv. v∈V Given a minimum cost coloring from the coloring variant of the minimum cost intervention design, we can create a minimum cost intervention design. Further, the reduction is approximation preserving.

In practice, it can sometimes be difficult to intervene on a large number of variables. A variant of intervention design of interest is when every intervention can only involve k variables. For this problem, we want our interventions to be a k-sparse graph separating system.

Definition 27 (k-Sparse Graph Separating System). Given an undirected graph G = (V,E), a k-sparse graph separating system of size m is a collection of m subsets of vertices I = {I1,I2,...,Im} such that all subsets Ii satisfy S |Ii| ≤ k and every edge is cut at least once, that is, I∈I δ(I) = E.

78 We consider two optimization problems related to k-sparse graph sepa- rating systems. In the first one we want to find a graph separating system of minimum size.

Definition 28 (Minimum Size k-Sparse Intervention Design). Given a chordal graph G = (V,E) and a sparsity constraint k, the minimum size k-sparse intervention design problem is to find a k-sparse graph separating system for G of minimum size, that is, we want to minimize the cost

cost(I) = |I|.

For the next problem, we want to find the k-sparse intervention design of minimum cost where there is a cost to intervene on every variable.

Definition 29 (Minimum Cost k-Sparse Intervention Design). Given a chordal graph G = (V,E), a set of weights wv for all v ∈ V , a sparsity constraint k, and a size constraint m, the minimum cost k-sparse intervention design problem is to find a k-sparse graph separating system I of size m that minimizes the cost

X X cost(I) = wv. I∈I v∈I

4.3 Related Work

One problem of interest is to find the intervention design with the smallest number of interventions. Eberhardt et al. [2005] established that dlog ne is sufficient and nessesary in the worst case. Eberhardt [2007] established

79 that graph separating systems are necessary across all graphs (the example he used is the complete graph). Hauser and Bühlmann [2012b] establish the connection between graph colorings and intervention designs by using the key observation of Mao-Cheng [1984] that graph colorings can be used to construct graph separating systems, and vise-versa. This leads to the requirement and sufficiency of dlog(χ)e experiments where χ is the chromatic number of the graph.

Since graph coloring can be done efficiently for chordal graphs, we can efficiently create a minimum size intervention design when given as input a chordal skeleton. Similarly, if we are given as input an arbitrary graph, perhaps due to side information on some edge directions, it is NP-hard to find a minimum size intervention design [Hyttinen et al., 2013, Mao-Cheng, 1984].

Hu et al. [2014] proposed a randomized algorithm that requires only O(log log n) experiments and learns the causal graph with high probability.

Closer to our setup, Hyttinen et al. [2013] considers a special case of minimum cost intervention design problem when every vertex has cost 1 and the input is the complete graph. They were able to optimally solve this special case. Kocaoglu et al. [2017] was the first to formalize the minimum cost intervention design problem on general chordal graphs and the relationship to its coloring variant. They used the coloring variant to develop a greedy algorithm that finds a maximum weighted independent set and colors this set with the available color with the lowest weight. However their work did not establish approximation guarantees on this algorithm and it is not clear

80 how many iterations the greedy algorithm needs to fully color the graph—we address these issues in this paper. Further it was unknown until our work that the minimum cost intervention design problem is NP-hard.

There has been a lot of prior work when every intervention is constrained to be of size at most k. Eberhardt et al. [2005] was the first to consider the minimum size k-sparse intervention design problem and established sufficient conditions on the number of interventions needed for the complete graph. Hyttinen et al. [2013] showed how k-sparse separating system constructions can be used for intervention designs on the complete graph using the construction of Katona [1966]. They establish the necessary and sufficient number of k- sparse interventions needed to learn all causal directions in the complete graph. Shanmugam et al. [2015] illustrate that for the complete graphs separating systems are necessary even under the constraint that each intervention has size at most k. They also identify an information theoretic lower bound on the necessary number of experiments and propose a new optimal k-sparse separating system construction for the complete graph. To the best of our knowledge there has been no graph dependent bounds on the size of a k-sparse graph separating systems until our work.

Ghassami et al. considered the dual problem of maximizing the number of learned causal edges for a given number of interventions [Ghassami et al., 2018]. They show that this problem is a submodular maximization problem when only interventions involving a single variable are allowed. We note that their connection to submodularity is different than the one we discover in our

81 work.

Graph coloring has been extensively studied in the literature. There are various versions of graph coloring problem. We identify a connection of the minimum cost intervention design problem to the general optimum cost chromatic partition problem (GOCCP). GOCCP is a graph coloring problem where there are t colors and a cost γvi to color vertex v with color i. It is a more general version of the minimum cost intervention design problem. Jansen [1997] established that for graphs with bounded treewidth r, the GOCCP can be solved exactly in time O(trn). This implies that for graphs with maximum degree ∆ we can solve the minimum cost intervention design problem exactly in time O(2m∆n). Note that m is at least log ∆ and can be as large as ∆, thus this algorithm is not practical even for ∆ = 12.

4.4 Hardness of Minimum Cost Intervention Design

In this section, we show that the minimum cost intervention design problem is NP-hard.

We assume that the input graph is chordal, since it is obtained as an undirected component of a causal graph skeleton. We note that every chordal graph can be realized by this process.

Proposition 30. For any undirected chordal graph G, there is a causal graph D such that the essential graph E(D) = G.

Thus every chordal graph is the undirected subgraph of the essential

82 graph for some causal DAG. This validates the problem definition of the minimum cost intervention design as any chordal graph can be given as input. We now state our hardness result.

Theorem 31. The minimum cost intervention design problem is NP-hard.

Please see Appendix 4.11 for the proof. Our proof is based on the reduction from numerical 3D matching to a graph coloring problem that is more general than the minimum cost intervention problem on interval graphs by Kroon et al. [Kroon et al., 1996]. Our hardness proof holds even if the vertex costs are all equal to 1 and the input graph is an interval graph, which is a subset of chordal graphs that often have efficient algorithms for problems that are hard in general graphs.

It it worth comparing to complexity results on related minimum size intervention design problem. The minimum size intervention design problem on a graph can be solved by finding a minimum coloring on the same graph [Mao- Cheng, 1984, Hauser and Bühlmann, 2012b]. For chordal graphs, graph coloring can be solved efficiently so the minimum size intervention design problem can also be solved efficiently. In contrast, the minimum cost intervention design problem is NP-hard, even on chordal graphs. Both problems are hard on general graphs, which can be due to side information.

83 4.5 Approximation Guarantees for Minimum Cost Inter- vention Design

Since the input graph is chordal, we can find the maximum weighted independent sets in polynomial time using Frank’s algorithm [Frank, 1975]. Further, a chordal graph remains chordal after removing a subset of the vertices. The authors of [Kocaoglu et al., 2017] use these facts to construct a greedy algorithm for this weighted coloring problem. Let G0 = G. On iteration t,

find the maximum weighted independent set in Gt and assign these vertices the available color with the smallest cost. Then let Gt+1 be the graph after removing the colored vertices from Gt. Repeat this until all vertices are colored. Convert the coloring to a graph separating system and return this design.

One issue with this algorithm is it is not clear how many iterations the greedy algorithm will utilize until the graph is fully colored. This is important as we want to satisfy the size constraint on the graph separating system. To reduce the number of colors in the graph, we introduce a quantization step to reduce the number of iterations the greedy algorithm requires to completely color the graph. In Figure 4.3 of Appendix 4.8, we see an example of a (non- chordal) graph where without quantization the greedy algorithm requires n/2 colors but with quantization it only requires 4 colors.

Specifically, we first find the maximum independent set of the input graph and remove it. We then find the maximum cost vertex of the new graph with weight wmax. For all vertices v in the new graph, we replace the cost wv

3 with b wvn c. See Algorithm 5 for pseudocode describing our algorithm. wmax

84 The reason we first remove the maximum independent set before quan- tizing is because the maximum independent set will be colored with a color of weight 0, and thus not contribute to the cost. We want the quantized costs to not be arbitrarily far from the original costs, except for the vertices that are not intervened on. For example, if there is a vertex with a weight of infinity, we will never intervene on it. However if we were to quantize it the optimal solution to the quantized problem can be arbitrarily far from the true optimal solution. Our method of quantization will allow us to show that a good solution to the quantized weights is also a good solution to the true weights.

Algorithm 5 Greedy Coloring Algorithm with Quantization

Input: A chordal graph G = (V,E), positive integral weights wi for all i ∈ V . Quantize the vertex weights: S0 ← maximum weighted independent set of G

wmax ← maxi∈V \S0 wi j 3 k w ← win i wmax Greedy weighted coloring algorithm: Assign S0 color 0 G1 ← G − S0 t ← 1 while Gt is not empty: St ← maximum weight independent set of Gt color all vertices of St with the color t Gt+1 ← Gt − St t ← t + 1 convert the coloring of G to a graph separating system I return I

We now state our main theorem, which guarantees that the greedy algorithm with quantization will return a solution that is a (2+ε)-approximation

85 from the optimal solution while only using log χ + O(log log n) interventions. Our algorithm thus returns a good solution to the minimum cost intervention design problem whenever the allowed number of interventions m ≥ log χ + O(log log n). Note that m ≥ log χ is required for there to exist any graph separating system.

Theorem 32. If the number of interventions m satisfies m ≥ log χ+log log n+ 5, then the greedy coloring algorithm with quantization for the minimum cost intervention design problem creates a graph separating system Igreedy such that

cost(Igreedy) ≤ (2 + ε)OPT, where ε = exp(−Ω(m)) + n−1.

See Appendix 4.9 for the proof of the theorem. We present a brief sketch of our proof.

To show that the greedy algorithm uses a small number of colors, we first define a submodular, monotone, and non-negative function such that every vertex has been colored if and only if this particular submodular function is maximized. This is an instance of the submodular cover problem. Wolsey established that the greedy algorithm for the submodular cover problem returns a set with cardinality that is close to the optimal cardinality solution when the values of the submodular function are bounded by a polynomial [Wolsey, 1982]. This is why we need to quantize the weights.

To show that the greedy algorithm returns a solution with small value, we first define a new class of functions which we call supermodular chain

86 functions. We then show that the minimum cost intervention design problem is an instance of a supermodular chain function. Using result on submodular optimization from [Nemhauser et al., 1978, Krause and Golovin, 2014] and some nice properties of the minimum cost intervention design problem, we are able to show that the greedy algorithm returns an approximately optimal solution.

To relate the quantized weights back to the original weights, we use an analysis that is similar to the analysis used to show the approximation guarantees of the knapsack problem [Ibarra and Kim, 1975].

Finally, we remark how our algorithm will perform when there are vertices with infinite cost. These vertices can be interpreted as variables that cannot be intervened on. If these variables form an independent set, then they can be colored with the color of weight zero. We can maintain our theoretical guarantees in this case, since our quantization procedure first removes the maximum weight independent set. If the variables with infinite cost do not form an independent set, then no valid graph separating system has finite cost.

4.6 Algorithms for k-Sparse Intervention Design Prob- lems

We first establish a lower bound for how large a k-sparse graph separating system must be for a graph G based on the size of the smallest vertex cover of the graph τ.

Proposition 33. For any graph G, the size of the smallest k-sparse graph

87 ∗ ∗ τ separating system mk satisfies mk ≥ k , where τ is the size of the smallest vertex cover in the graph G.

See Appendix 4.10 for the proof.

Algorithm 6 Algorithm for Min Size and Unweighted Min Cost k-Sparse Intervention Design Input: A chordal graph G, a sparsity constraint k. S ← minimum size vertex cover of G. GS ← induced graph of S in G. Find an optimal coloring of GS. Split the color classes of GS into size k intervention sets I1,I2,...,Im. Return I = {I1,I2,...,Im}.

We use Algorithm 6 to find a small k-sparse graph separating system. It first finds the minimum cardinality vertex cover S. It then finds an optimal coloring of the graph induced with the vertices of S. It then partitions the color class into independent sets of size k and performs an intervention for each of these partitions. Since the set of vertices not in a vertex cover is an independent set, this is a valid k-sparse graph separating system.

When the sparsity k and the maximum degree ∆ are small, Algorithm 6 is nearly optimal. Using Proposition 33, we can establish the following approximation guarantee on the size of the graph separating system created.

Theorem 34. Given a chordal graph G with maximum degree ∆, Algorithm 6

finds a k-sparse graph separating system of size mk such that  k(∆ + 1)∆ m ≤ 1 + OPT, k n where OPT is the size of the smallest k-sparse graph separating system.

88 See Appendix 4.10 for the proof. If the sparsity constraint k and the maximum degree of the graph ∆ both satisfy k, ∆ = o(n1/3), then Theorem 34 implies that we have a 1 + o(1) approximation to the optimal solution.

One interesting aspect of Algorithm 6 is that every vertex is only intervened on once and the set of elements not intervened on is the maximum cardinality independent set. By a similar argument to Theorem 2 of [Kocaoglu et al., 2017], we have that this algorithm is optimal in the unweighted case.

Corollary 35. Given an instance of the minimum cost k-sparse intervention design problem with chordal graph G with maximum degree ∆ and vertex cover

τ k(∆+1)∆ of size τ, sparsity constraint k, size constraint m ≥ k (1 + n ), and all vertex weights wv = 1, Algorithm 6 returns a solution with optimal cost.

We show one way to extend Algorithm 6 to the weighted case. There is a trade off between the size and the weight of the independent set of vertices that are never intervened on. We can trade these off by adding a penalty λ

λ λ to every vertex weight, i.e., the new weight wv of a vertex v is wv = wv + λ. Larger values of λ will encourage independent sets of larger size. See Algorithm 7 for the pseudocode describing this algorithm. We can run Algorithm 7 for various values of λ to explore the trade off between cost and size.

4.7 Experiments

We generate chordal graphs following the procedure of [Shanmugam et al., 2015], however we modify the sampling algorithm so that we can

89 Algorithm 7 Algorithm for Weighted Min Cost k-Sparse Intervention Design

Input: chordal graph G, sparsity constraint k, vertex weights wv, penalty parameter λ λ S ← minimum weight vertex cover S using weights wv = wv + λ. GS ← induced graph of S in G. Find an optimal coloring of GS. Split the color classes of GS into size k intervention sets I1,I2,...,Im. Return I = {I1,I2,...,Im}.

control the maximum degree. First we order the vertices {v1, v2, . . . , vn}.

For vertex vi we choose a vertex from {vi−b, vi−b+1, . . . , vi−1} uniformly at random and add it to the neighborhood of vi. We then go through the vertices

{vi−b, vi−b+1, . . . , vi−1} and add them to the neighborhood of vi with probability d b . We then add edges so that the neighbors of vi in {v1, v2, . . . , vi−1} form a clique. This is guaranteed to be a connected chordal graph with maximum degree bounded by 2b.

In our first experiment we compare the greedy algorithm to two other algorithms. One first assigns the maximum weight independent set the weight 0 color, then finds a minimum coloring of the remaining vertices, sorts the independent sets by weight, then assigns the cheapest colors to the independent sets of the highest weight. The other algorithm finds the optimal solution with integer programming using the Gurobi solver[Gurobi Optimization, LLC, 2018]. The integer programming formulation is standard (see, e.g., [Delle Donne and Marenco, 2016]).

We compare the cost of the different algorithms when we (a) adjust the number of vertices while maintaining the average degree and (b) adjust

90 the average degree while maintaining the number of vertices. We see that the greedy coloring algorithm performs almost optimally. We also see that it is able to find a proper coloring even with only m = 5 interventions and no quantization. See Figure 4.1 for the complete results.

In our second experiment we see how Algorithm 7 allows us to trade off the number of interventions and the cost of the interventions in the k-sparse minimum cost intervention design problem. See Figure 4.2 for the results.

Finally, we observe the empirical running time of the greedy algorithm. We generate graphs on 10, 000 vertices with maximum degree 20 and have 5 interventions. The greedy algorithm terminates in 5 seconds. In contrast, the integer programming solution takes 128 seconds using the Gurobi solver [Gurobi Optimization, LLC, 2018].

Appendix

4.8 Example Graph Where Quantization Helps Greedy 4.9 Proof of Approximation Guarantees of the Quantized Greedy Algorithm

In this section we prove our approximation guarantee of using the quantized greedy algorithm for minimum cost intervention design.

Theorem 32. If the number of interventions m satisfies m ≥ log χ+log log n+ 5, then the greedy algorithm with quantization for the minimum cost intervention

91 design problem creates a graph separating system Igreedy such that

cost(Igreedy) ≤ (2 + ε)OPT, where ε = exp(−Ω(m)) + n−1.

4.9.1 Submodularity Background

Our proof uses several results from submodularity. A set function F over a ground set V is a function that takes as input a subset of V and outputs a real number. We say that the function F is submodular if for all v ∈ V and A ⊆ B ⊆ V \{v} the function satisfies the diminishing returns property

F (A ∪ {v}) − F (A) ≥ F (B ∪ {v}) − F (B).

We say that the function F is monotone if for all A ⊆ B ⊆ V we have that F (A) ≤ F (B). We say that F is non-negative if for all A ⊆ V we have that F (A) ≥ 0.

One classic problem in submodular optimization is finding a set A with caridinality constraint |A| ≤ k that maximizes a submodular, monotone, and non-negative function F . The greedy algorithm starts with the emptyset

A0 = ∅, selects the item vk+1 = arg maxv∈V F (Ak ∪ {v}) − F (Ak). It then updates Ak+1 = Ak ∪ {vk+1}.

The classic result of Nemhauser and Wolsey established that the greedy algorithm is a (1 − 1/e)-approximation algorithm to the optimal [Nemhauser et al., 1978]. Krause and Golovin generalized this to show that if the greedy

92 algorithm selects dCke elements for some positive value C, then it is a (1−e−C )- approximation to the optimal solution of size k.

Theorem 36 ([Nemhauser et al., 1978, Krause and Golovin, 2014]). Given a submodular, monotone, and non-negative function F over a ground set V and a cardinality constraint k, let OPT be defined as

OPT = max F (A). A⊆V :|A|≤k

If the greedy algorithm for this problem runs for dCke iterations, for some

Ck positive value C, it returns a set Agreedy such that

Ck −C F (Agreedy) ≥ (1 − e )OPT.

Another important problem in submodular function optimization is the submodular set cover problem, which is a generalization of the set cover problem. Given a submodular, monotone, and non-negative function F that maps a subsets of a ground set V to integers, we want to find a set A of minimum cardinality such that F (A) = F (V ). The greedy algorithm is a natural approach to solve this problem: we run greedy iterations until the set satisfies the submodular set cover constraint. Let wmax = arg maxv∈V F ({v}). Wolsey established that the cardinality of the set returned by the greedy algorithm is a 1 + ln wmax approximation to the minimum cardinality solution [Wolsey, 1982].

Theorem 37 ([Wolsey, 1982]). Given a submodular, monotone, and non- negative function F that maps subsets of a ground set V to integers, let OPT

93 be defined as OPT = min |A|. A⊆V :F (A)=F (V )

Let wmax = arg maxv∈V F ({v}). The greedy algorithm for this problem returns a set Agreedy such that

|Agreedy| ≤ (1 + ln wmax)OPT.

4.9.2 Bound on the Quantized Greedy Algorithm solution size

In this section we show that after χ(2 + 5 ln n) + 1 rounds the greedy algorithm with quantization will have colored every vertex in the graph. Since the number of possible colors in a graph separating system of size m is 2m, this implies that when m ≥ log χ + log log n + 4, there are enough colors for the greedy algorithm to fully color the graph.

Lemma 38. If the intervention size is m ≥ log χ + log log n + 4, then the greedy algorithm will terminate using at most 2m colors.

Proof. The greedy algorithm first colors the maximum weight independent, using 1 color. We will denote the remaining graph by G = (V,E).

The weights of the remaining vertices are quantized to integers such that the maximum weight is bounded by n3. Let A be the set of all independent sets in G. The maximum weight of an independent set in G is bounded by n4. Let W be the function that takes a set of independent sets A ⊆ A and outputs the value X W (A) = wv, S v∈ a∈A a

94 that is, it takes a set of independent sets and return the sum of the vertices in their union. It can be verified that this function is submodular, monotone, and non-negative.

We will assume for now that the weights are all positive. If we have a set of independent sets A such that W (A) = W (A), then every vertex in the graph will have been covered. Since the minimum cardinality is χ and the maximum weight of an independent set is n4, by Theorem 37, the greedy algorithm will terminate after χ(1 + 4 ln n) iterations.

To handle vertices of weight 0, note that it is a set cover problem to cover the remaining vertices. Thus the greedy algorithm will need no more that χ(1 + ln n) colors to color the remaining vertices, using a total number of colors χ(2 + 5 ln n) + 1.

We have the following corollary by noting that adding an extra inter- vention doubles the number of allowed colors.

Corollary 39. If the intervention size is m ≥ log χ + log log n + 5, then the greedy algorithm will terminate using at most 2m colors such that all color

m vectors c have weight kck1 ≤ d 2 e.

4.9.3 Submodular and Supermodular Chain Problem

In this section we define two new types of submodular optimization problem, which we call the submodular chain problem and the supermodular

95 chain problem. We will use these in our proof of the approximation guarantees of the greedy algorithm with quantization.

Definition 40. Given integers k1, k2, . . . , km and a submodular, monotone, and non-negative function F , over a ground set V , the submodular chain problem is to find sets A1,A2,...,Am ⊆ V such that |Ai| ≤ ki that maximizes

m X F (A1 ∪ A2, ∪ · · · ∪ Ai). i=1

Throughout this section we will assume that m is an even number.

The greedy algorithm for this problem will first choose the set A1 of cardinality k1 that maximizes F (A1). It will then choose the set A2 of cardinality k2 that maximizes F (A1 ∪ A2). It will continue this process until all Ai are chosen.

Note by using the greedy algorithm and Theorem 36 we can obtain a (1 − 1/e) approximation to this problem. However we will instead use the following guarantee.

∗ ∗ ∗ Lemma 41. Let A1,A2,...,Am be the optimal solution to the submodular P2p chain problem. Suppose that for all 1 ≤ p ≤ m/2 − 1 we have that i=1 ki ≥ Pp C i=1 ki. Also assume that F (A1 ∪ A2 ∪ · · · ∪ Am) = F (V ). Then the greedy algorithm for the submodular chain problem returns set A1,A2,...,Am such that

m m/2−1 X −C X ∗ ∗ ∗ F (A1 ∪ A2 ∪ · · · A2i) ≥ F (V ) + 2(1 − e ) F (A1 ∪ A2 ∪ · · · Ai ) i=0 i=0

96 P2i Pi Proof. Since p=1 kp ≥ C p=1 kp, by Theorem 36 we have that

−C ∗ ∗ ∗ F (A1 ∪ A2 ∪ · · · A2p) ≥ (1 − e )F (A1 ∪ A2 ∪ · · · ∪ Ap).

We thus have

m/2−1 m/2−1 X −C X ∗ ∗ ∗ F (A1 ∪ A2 ∪ · · · ∪ A2i) ≥ (1 − e ) F (A1 ∪ A2 ∪ · · · ∪ Ai ). i=0 i=0

To conclude the proof, use the monotonicity of the submodular function F to observe that

m X F (A1 ∪ A2 ∪ · · · ∪ Ai) = F (A1 ∪ A2 ∪ Am) i=0 m/2−1 X + F (A1 ∪ A2 ∪ · · · ∪ A2i) + F (A1 ∪ A2 ∪ · · · ∪ A2i+1) i=0 m/2−1 X = F (V ) + F (A1 ∪ A2 ∪ · · · ∪ A2i) + F (A1 ∪ A2 ∪ · · · ∪ A2i+1) i=0 m/2−1 X ≥ F (V ) + 2 F (A1 ∪ A2 ∪ · · · ∪ A2i). i=0

We define the supermodular chain problem similarly.

Definition 42. Given integers k1, k2, . . . , km and a submodular, monotone, and non-negative function F , over a ground set V , the supermodular chain problem is to find sets A1,A2,...,Am ⊆ V such that |Ai| ≤ ki that minimizes

m X F (V ) − F (A1 ∪ A2, ∪ · · · ∪ Ai). i=0

97 We establish the following guarantee for the greedy algorithm on the supermodular chain problem.

∗ ∗ ∗ Lemma 43. Let A1,A2,...,Am be the optimal solution to the supermodular P2t chain problem. Suppose that for all 1 ≤ p ≤ m/2 − 1 we have that i=1 ki ≥ Pt C i=1 ki. Also assume that F (A1 ∪ A2 ∪ · · · ∪ Am) = F (V ). Then the greedy algorithm for the supermodular chain problem returns set A1,A2,...,Am such that m m X −C X ∗ ∗ ∗ F (V ) − F (A1 ∪ A2 ∪ · · · Ai) ≤ e mF (V ) + 2 F (A1 ∪ A2 ∪ · · · Ai ) i=0 i=0 Proof. Starting from Lemma 41, we have that

m m/2−1 X −C X ∗ ∗ ∗ (m + 1)F (V ) − F (A1 ∪ A2 ∪ · · · A2i) ≤ mF (V ) − 2(1 − e ) F (A1 ∪ A2 ∪ · · · Ai ) i=0 i=0 m/2−1 −C X ∗ ∗ ∗ ≤ e mF (V ) + mF (V ) − 2 F (A1 ∪ A2 ∪ · · · Ai ) i=0 m/2−1 −C X ∗ ∗ ∗ = e mF (V ) + 2 F (V ) − F (A1 ∪ A2 ∪ · · · Ai ). i=0 Using the monotonicity of the submodular function F , we can continue with m/2−1 m −C X ∗ ∗ ∗ −C X ∗ ∗ ∗ e mF (V ) + 2 F (V ) − F (A1 ∪ A2 ∪ · · · Ai ) ≤ e mF (V ) + 2 F (V ) − F (A1 ∪ A2 ∪ · · · Ai ) i=0 i=0 and conclude that m m X X F (V ) − F (A1 ∪ A2 ∪ · · · A2i) = (m + 1)F (V ) − F (A1 ∪ A2 ∪ · · · A2i) i=0 i=0 m −C X ∗ ∗ ∗ ≤ e mF (V ) + 2 F (V ) − F (A1 ∪ A2 ∪ · · · Ai ) i=0

98 4.9.4 Proof of quantized greedy algorithm approximation guaran- tees

For simplicity, we will assume that the number of interventions m is divisible by 4.

We will need the following lemma, which can be proved by standard binomial approximations.

m Lemma 44. If m and t are integers such that t ≤ 4 , we have

2t t ! X m X m ≥ Ω(m) 1 + . i i i=1 i=1

We use the following technical lemma to prove our approximation guarantee. We defer the proof to Section 4.9.5.

Lemma 45. Let A∗ be the optimal solution to the coloring problem. Let A+ be the optimal solution to the coloring problem when we force it to color the maximum weighted independent set with the weight 0 color, but allow it an extra color of weight 1. That is, it can color m + 1 independent sets with a color of weight 1, rather than the usual m independent sets. We have

cost(A+) ≤ cost(A∗).

We can now show that the quantized greedy algorithm is a good ap- proximation to the optimal solution to the quantized problem.

99 Lemma 46. Suppose all the weights in the graph that are not in the maximum weight independent set are bounded by n3. Then if the number of interventions m satisfies m ≥ log χ + log log n + 5 the greedy coloring algorithm returns a solution I of cost

cost(I) ≤ (2 + exp(−Ω(m)))OPT.

Proof. Let A be the set of all independent sets in G. Let W be the function that takes a set of independent sets A ⊆ A and outputs the value

X W (A) = wv, S v∈ a∈A a that is, it takes a set of independent sets and return the sum of the vertices in their union. It can be verified that this function is submodular, monotone, and non-negative.

A feasible solution to the coloring variant of the minimum cost interven- tion design problem is a coloring that maps vertices to color vectors {0, 1}m.

The colors c with weight i are the coloring vectors c such that kck1 = i. We can describe a feasible solution to the coloring variant of the minimum cost intervention design by A0,A1,A2,...,Am, where Ai is the set of independent sets that are colored with a coloring vector of weight i.

One simplifying assumption is that the optimal solution A∗ and the greedy solution A both use the color of weight 0 to color the maximum weight independent set. By Lemma 45 this assumption is valid if we allow the optimal

100 solution to use an additional color of weight 1. We just need to show the approximation guarantee on the sets A1,A2,...,Am.

We can calculate the cost of a feasible solution of the minimum cost intervention design problem by

m X cost(A1,...,Am) = W (A) − W (A1 ∪ A2 ∪ · · · Am), i=0 m where |Ai| ≤ i . This is an instance of the supermodular chain problem.

Using Corollary 39, the greedy algorithm will terminate only using colors of weight at most m/2, so we only need to show optimality of the sets

A1,A2,...,Am/2. By Lemma 44, we have that the number of colors of weight at most 2t is a factor of Ω(m) more than the number of colors the optimal solution uses of weight at most t, even after including the extra color given to the optimal solution. By Lemma 43 and using the monotonicity of W , we have that

cost(Igreedy) = cost(A1,A2,...,Am/2) m/2 X = W (A) − W (A1 ∪ A2 ∪ · · · Ai) i=0 m/2 m X ≤ e−Ω(m) F (A) + 2 W (A) − W (A∗ ∪ A∗ ∪ · · · A∗) 2 1 2 i i=0 m m X ≤ e−Ω(m) F (A) + 2 W (A) − W (A∗ ∪ A∗ ∪ · · · A∗) 2 1 2 i i=0 m = e−Ω(m) W (A) + 2OPT. 2

To conclude, observe that OPT ≥ W (A), since every vertex not in the

101 maximum weighted independent set is colored with a color of weight at least 1.

Lemma 46 shows an approximation guarantee of the quantized greedy algorithm to the quantized optimal solution. To relate the quantized greedy algorithm to the true optimal solution, we use the following lemma, which we prove in Section 4.9.5.

Lemma 47. Suppose an intervention design I is an α-approximation solution to the optimal solution to the quantized problem. Then it is an (α + n−1)- approximation to the optimal solution to the original problem.

With Lemma 47, we can conclude the proof of Theorem 32.

4.9.5 Proof of Technical Lemmas

Lemma 45. Let A∗ be the optimal solution to the coloring problem. Let A+ be the optimal solution to the coloring problem when we force it to color the maximum weighted independent set with the weight 0 color, but allow it an extra color of weight 1. That is, it can color m + 1 independent sets with a color of weight 1, rather than the usual m independent sets. We have

cost(A+) ≤ cost(A∗).

∗ + Proof. Let a0 and a0 be the sets of vertices covered with the color of weight 0

∗ + + for A and A , respectively. From the optimality of a0 as a maximum weight

102 independent set, we have

X X wi ≥ wi. + ∗ ∗ + i∈a0 \a0 i∈a0\a0 Consider a new coloring A0, also with an extra weight 1 coloring, that uses

+ ∗ + a0 as the set of vertices colored with the weight 0 color, a0 \ a0 as the set of vertices colored with the extra weight 1 color, then does the same coloring as A∗, removing the vertices that are already colored.

The only vertices colored by A0 with a positive cost and different color

∗ ∗ + than A are a0 \ a0 , which are all colored with a cost of weight 1. The only vertices colored by A∗ with a positive cost and different color than A0 are

+ ∗ ∗ ∗ a0 \ a0. Let cv be the cost to color vertex v using A . We can thus conclude

0 ∗ X X ∗ cost(A ) − cost(A ) = wv − cvwv ∗ + + ∗ v∈a0\a0 v∈a0 \a0 X X ≤ wv − wv ≤ 0. ∗ + + ∗ v∈a0\a0 v∈a0 \a0

Lemma 47. Suppose an intervention design I is an α-approximation solution to the optimal solution to the quantized problem. Then it is an (α + n−1)- approximation to the optimal solution to the original problem.

Proof. This is a modification of the proof of the FPTAS of the knapsack algorithm [Ibarra and Kim, 1975] (see also [Williamson and Shmoys, 2011]).

Let c∗ be the optimal coloring in the original weights, c0 be the optimal coloring in the quantized weights, and c be the approximate coloring.

103 0 wmax Let wv be the true weight, and wv be the quantized weight. Let µ = n3 .

0 wv 0 wv ∗ Since wv = b µ c, we have and wv ≤ µ . We also have cost(I ) ≥ wmax.

We thus have

X cost(I) = kc(v)k1wv v∈V X 0 ≤ µ kc(v)k1(wv + 1) v∈V X 0 X = µ kc(v)k1wv + µ kc(v)k1 v∈V v∈V X 0 0 X ≤ αµ kc (v)kwv + µ kc(v)k1 v∈V v∈V

Using the optimality of c0 in the quantized weights, we have

X 0 0 X X ∗ 0 X αµ kc (v)kwv + µ kc(v)k1 ≤ αµ kc (v)k1wv + µ kc(v)k1 v∈V v∈V v∈V v∈V X ∗ X ≤ α kc (v)k1wv + µ kc(v)k1 v∈V v∈V X = αOPT + µ kc(v)k1 v∈V ≤ αOPT + µmn w mn = αOPT + max n3 w ≤ αOPT + max n ≤ (α + n−1)OPT.

104 4.10 Proof of Results on k-Sparse Intervention Design Problems

Proposition 33. For any graph G, the size of the smallest k-sparse graph

∗ ∗ τ separating system mk satisfies mk ≥ k , where τ is the size of the smallest vertex cover in the graph G.

∗ τ Proof. Suppose that there exists a graph separating system I of size mk < k . S Note that the vertices in S = I∈I I form a vertex cover. The number of ∗ vertices in S is |S| ≤ kmk < τ, contradicting the result that the smallest vertex cover has at least τ vertices.

Theorem 34. Given a chordal graph G with maximum degree ∆, Algorithm 6

finds a k-sparse graph separating system of size mk such that  k(∆ + 1)∆ m ≤ 1 + m∗, k n k

∗ where mk is the size of the smallest k-sparse graph separating system.

Proof. Given the vertices S in the smallest vertex cover of the graph, we can color these vertices with ∆ + 1 colors. We can then partition each color class

τ τ into k + ∆ + 1 independent sets of size k, as we have at most k of size k and at most ∆ + 1 sets that cannot be grouped into exactly k vertices due to rounding errors.

n Note that the size of the smallest vertex cover τ satisfies τ ≥ ∆ . We have k(∆ + 1)∆ n k(∆ + 1)∆ τ k(∆ + 1)∆ ∆ + 1 = ≤ ≤ m∗. n k∆ n k n k

105 τ  k(∆+1)∆  ∗ Thus we use at most k + ∆ + 1 ≤ 1 + n mk interventions.

4.11 Proof of NP-Hardness

We establish the following theorem in this section.

Theorem 48. The minimum cost intervention design problem is NP-hard, even if every vertex has weight 1 and the input graph is an interval graph.

Theorem 31 follows immediately from Theorem 48.

First, we need to introduce the numerical three dimensional matching problem:

Definition 49 (Numerical Three Dimensional Matching). Given a positive Pt integer t and 3t rational numbers ai, bi, ci satisfying i=1 ai + bi + ci = t and

0 < ai, bi, ci < 1, ∀i ∈ [t], does there exist permutations ρ, σ of [t] such that ai + brho(i) + cσ(i) = 1, ∀i ∈ [t]?

The numerical three dimensional matching problem is known to be strongly NP-complete [Garey and Johnson, 1979].

Kroon et al. [1996] reduces the optimal cost chromatic partition problem on interval graphs to numerical three dimensional matching. The input to an instance of the optimal cost chromatic partitioning problem is a graph and a set of weighted colors. The cost to color a vertex with a given color is the weight of that color. The cost of a coloring is the sum of the coloring cost of each vertex. A solution to the problem is a valid coloring of minimum cost.

106 Kroon et al. show that the optimal cost chromatic partition problem is NP-hard, even if the input graph is an interval graph and color weights take four values: 0, 1, 2, and Θ(n). They use the following construction for their reduction.

Suppose we are given an instance of the numerical three dimensional matching problem containing the number ai, bi, ci for i ∈ [t]. For i, j ∈ [t], define rational numbers Ai, Bj, and Xij such that 4 < Ai < 5 < Bj < 6 and

7 < Xij < 9. The following are the intervals of the graph used in [Kroon et al., 1996] (see the original paper for an image):

107 Interval Occurrences Clique ID (0, 1) t times I (0, 3) t2 − t times I

(0,Ai) ∀i ∈ [t], t − 1 times I

(0,Bj) ∀j ∈ [t] I (1, 2) t times II

(2,Ai) for ∀i ∈ [t] III

(3,Bj) ∀j ∈ [t], t − 1 times III

(Ai,Xi,j) ∀i, j ∈ [t] IV

(Bj,Xi,j) ∀i, j ∈ [t] IV

(Xi,j, 10 + ai + bj) ∀i, j ∈ [t] V

(Xi,j, 14) ∀i, j ∈ [t] V

(11 − ck, 13) ∀k ∈ [t] VI (12, 14) t2 − t VII (13, 14) t times VII

They estabish that it is NP-complete to decide if there is a coloring of cost at most 11t2 − 5t when there are t colors of weight 0, t2 − t colors of weight 1, t2 colors of weight 2, and all other colors of weight 3. However they omit the proof, so we include a proof here. We us the clique IDs we added in the definition of the interval graph.

Proof. If there is a solution to the numerical three dimensional matching problem, then there exists a coloring of cost at most 11t2 − 5t; see the original paper for the proof of this [Kroon et al., 1996]. They also prove that if there

108 is a coloring of cost at most 11t2 − 5t that only uses the colors of weight 0, 1, and 2, then it can be used to construct a solution to the numerical three dimensional matching problem.

Now we show that if the coloring uses a color of weight 3, then it must have a cost strictly greater than 11t2 − 5t. Note that all the vertices with the same clique ID indeed do form a clique. Consider the subgraph containing all the vertices of the original graph, but only the edges between vertices with the same clique ID.

The optimal way to color a subgraph of size k is to use one instance of the k cheapest colors. From this, we can see that the optimal way to color the subgraph has a cost of 11t2 − 5t.

We can also see that any coloring of this subgraph that uses a color of weight 3 has a cost strictly larger than 11t2 − 5t. Since there is an available color of weight less than 3, if we swap the color of weight 3 with an available, cheaper color, the cost must decrease by at least 1. Since the coloring after the switch cannot be lower than 11t2 − 5t, it must have been that the coloring before the switch was strictly larger than 11t2 − 5t.

Since a valid coloring for the original graph is a valid coloring for the subgraph, and the cost of a coloring of the original graph is the same as a cost of the coloring for the subgraph, we see that the cost of a coloring of the original graph that uses a color of weight 3 must have a cost strictly larger than 11t2 − 5t.

109 We also see that the problem still remains hard when there are t colors of weight 1, t2 − t colors of weight 2, t2 colors of weight 3, and all other colors of weight 4. This is because the cost of a coloring using these new colors is just an additive factor n more than the original colors. Thus a coloring that minimizes the cost using these new colors also minimizes the cost using the original colors, and it is NP-complete to decide if there exists a coloring of cost 19t2 − 3t.

We will define another interval graph by adding the following intervals. Set ε, δ to be nonnegative rational numbers such that ε 6= δ and min{6 − maxj Bj, 9 − maxij Xij} >  and δ < 1. Add the following intervals to the original graph:

110 Interval Occurrences Clique ID (0, 1 + ε) t + 1 times I (1 + ε, 2 + ε) t + 1 times II (2 + ε, 6 − ε) t + 1 times III (6 − ε, 9 − ε) t + 1 times IV (9 − ε, 11 + ε) t + 1 times V (11 + ε, 13 − ε) t + 1 times VI (13 − ε, 14 − ε) t + 1 times VII (14 − ε, 14) t + 1 times VIII

2t 2 (0, 3 + δ) 2 − t + t times I 2t 2 (3 + δ, 6 + δ) 2 − t + t times III 2t 2 (6 + δ, 9 + δ) 2 − t + t times IV 2t 2 (9 + δ, 14) 2 − t + t times V 2t 2 (0, 14) 3 − t times I

We will consider the optimal cost chromatic partition problem problem

2t 2t with 1 color of weight 0, 2t colors of weight 1, 2 colors of weight 2, 3 colors 2t of weight 3, and 4 colors of weight 4. This is exactly the coloring version of the minimum cost intervention design problem.

2t 2t We argue it is NP-complete to decide if the coloring cost is 3 3 +2 2 + 14t2 + 7t. We reduce from numerical three dimensional matching. From the original reduction by [Kroon et al., 1996], we see that if there is a solution to numerical three dimensional matching problem, then there exists a coloring of

2t 2t 2 cost at most 3 3 + 2 2 + 14t + 7t.

111 Call the vertices with an ε in their description the ε-class, and the intervals with a δ in their description the δ-class. We see that the ε-class intervals can be partitioned into t + 1 contiguous regions, and the δ-class can

2t 2 be partitioned into 2 − t + t contiguous regions. By the choice of ε and δ, we also see that if the coloring does not follow this structure, then it takes

2t 2 more than 2 − t + 2t + 1 colors to color all these intervals. Further, there is a “gap” that cannot be filled by one of the original intervals. From the original hardness proof by [Kroon et al., 1996], we see that if the coloring creates a gap in the original vertices that can be filled by a member of the ε-class or δ-class, then it takes more than 2t2 colors to color the original intervals. We conclude that if the coloring of the ε-class and the δ-class do not partition these intervals into contiguous regions, then the coloring must use a color of weight 4.

Again using the clique argument to show that the original problem is NP-complete, we see that if a coloring uses a color of weight 4, then the cost

2t 2t 2 of this coloring is strictly more than 3 3 + 2 2 + 14t + 7t.

In the original reduction by [Kroon et al., 1996], they prove that if there is a solution to the numerical three dimensional matching problem, the optimal coloring must have t color classes of size 7, t2 − t color classes of size 5, and t2 color classes of size 3. Introducing these new intervals, we see that the color classes of size 8 should take the weight 0 color and t of the weight 1 colors, the t color classes of size 7 should take the rest of the weight 1 colors, the t2 − t

2 2t 2 color classes of size 5 should take t − t colors of weight 2, the 2 − t + t color classes must take the rest of the weight 2 colors, the t2 color classes of

112 2 2t 2 size 3 should take t weight 3 colors, and the 3 − t color classes of size 1 should take the rest of the weight 3 colors. By looking at the coloring of the

2t 2t 2 original intervals, if the total cost is at most 3 3 + 2 2 + 14t + 7t, we can create a solution to the numerical three dimensional matching problem.

We can thus conclude that the unweighted minimum cost intervention design problem problem is NP-hard on interval graphs.

113 Num Vertices vs Cost in Min Cost Intervention Design

3,000 Baseline Greedy 2,000 Optimal cost 1,000

2,000 4,000 number of vertices (a) We adjust the number of vertices. The average degree stays close to 10 for all values of the number of vertices. Average Degree vs Cost in Min Cost Intervention Design 400 Baseline Greedy 300 Optimal cost 200

5 10 15 average degree

(b) The number of vertices are fixed at 500. We adjust the sparsity parameter in the graph generator to see how the algorithms perform for varying graph densities.

Figure 4.1: We generate random chordal graphs such that the maximum degree is bounded by 20. The node weights are generated by the heavy-tailed Pareto distribution with scale parameter 2.0. The number of interventions m is fixed to 5. We compare the greedy algorithm to the optimal solution and the baseline algorithm mentioned in the experimental setup. We see that the greedy algorithm is close to optimal and outperforms the baseline. We also see that the greedy algorithm is able to find a solution with the available number of colors, even without quantization. 114 Num Interventions vs. Cost in k-Sparse Min Cost Intervention Design

1

0.95

0.9 cost 0.85

0.8

520 540 560 number of interventions

Figure 4.2: We sample graphs of size 10000 such that the maximum degree is bounded by 20 and the average degree is 3. We draw the weights from the heavy-tailed Pareto distribution with scale parameter 2.0. We restrict all interventions to be of size 10. We adjust the penalty parameter in Algorithm 7 to see how the size of the k-sparse graph separating system relates to the cost. Costs are normalized so that the largest cost is 1.0. We see that with 561 interventions we can achieve a cost of 0.78 compared to a cost of 1.0 with 510 interventions. Our lower bound implies that we need 506 interventions on average.

115 Figure 4.3: When there are very large weights, the greedy algorithm may require a lot of colors to terminate, even on graphs with a small chromatic number. For this graph, the largest independent set is the top two vertices, followed by the next two, and so on. The greedy algorithm will color all these n pairs of vertices a different color, which is 2 colors. However after quantization the the greedy algorithm will only use 4 colors.

116 Chapter 5

On Robust Learning of Ising Models

5.1 Introduction

Ising models are an important class of probability distributions that model simple dependencies between binary random variables. They have been used to model network behavior in various domains, such as social networks, biology, and game theory [Daskalakis et al., 2011, 2017, Ellison, 1993, Montanari and Saberi, 2010]. Recent work due to Klivans and Meka [2017] develops an algorithm with essentially optimal run-time and sample complexity for the problem of structure learning for Ising models. That is, given samples from the unknown Ising model, the algorithm recovers all of its edge weights with small error.

The main thrust of this paper is to understand whether structure learning for Ising models can be made robust; i.e., can we efficiently recover the Ising model if an adversary is corrupting some draws from the underlying distribution? In the strongest setting, the adversary is typically allowed to observe the entire dataset and replace a constant fraction η of samples with arbitrary values.

In this work, we establish new lower and upper bounds for robustly

117 learning Ising models based on the sparsity λ of the model and the smallest absolute edge weight α. For the lower bounds, we construct two Ising models over different graphs. We show that if an adversary is allowed to corrupt even η = α exp(−O(λ)) fraction of samples then no algorithm can differentiate the two distributions.

We complement our lower bound by establishing a robustness guarantee of the Sparsitron algorithm of Klivans and Meka [2017]. We show that the Sparsitron algorithm is robust to an adversary who can arbitrarily corrupt a fraction η = α2 exp(−O(λ)) of samples from the Ising model. The number of samples required is the same as in the uncorrupted case, specifically, on the order of exponential in the sparsity of the model and logarithmic in the dimension of the model.

5.1.1 Related Work

Ising Models Bresler [2015] was the first to establish tractable algorithms for learning sparse Ising models with sample complexity that, for a fixed sparsity, depends only on the logarithm of the number of variables. The dependence on the sparsity was improved from doubly exponential to exponential by Vuffray et al. [2016], Lokhov et al. [2018] using convex programming. Klivans and Meka [2017], improved the running time to be essentially optimal using multiplicative weights. They were able to generalize this approach to learn non-binary and higher order graphical models.

118 Robust Estimation There are a number of classic techniques for robust estimation of low-dimensional distributions [Huber and Ronchetti, 2011, Ham- pel et al., 2011]. Diakonikolas et al. [2016] and Lai et al. [2016] were the first to propose tractable algorithms for robust high-dimensional estimation. Diakonikolas et al. [2018] considers robust learning of Bayesian networks with known graphical structure. In this work, they mention robust learning of Ising models as an interesting open problem. Additionally, Kapoor et al. [2018] considers robust bandit learning under the same adversarial model as what we consider for the experts problem.

5.2 Problem Setup

An Ising model is defined over a graph G = (V, E) with |V| = n. Vertices vi ∈ V correspond to binary random variables xi ∈ {−1, +1} (alternatively known as spins). Edges (vi, vj) ∈ E, also called as couplings, are denoted using non-zero real parameters Aij. There are also an external field parameters θi for each vertex vi that bias the variable towards a particular value. An Ising model distribution D is the distribution such that the probability of a spin configuration x = {x1, x2, . . . , xn} is given by:   1 X X P(x) = exp A x x + θ , Z  ij i j i (vi,vj )∈E vi∈V where Z is a normalization constant. The set of neighbors of node vi ∈ V is described by ∂vi = {vj :(vi, vj) ∈ E}. Accordingly, we define the width of the   Ising model λ = max P |A | + |θ | . To construct an estimator of the vi vj ∈∂vi ij i

119 edge set that is able to reconstruct the original structure with high probability, we will make a few assumptions on the coupling intensity of the model.

A1: The smallest absolute edge weight is greater than α: min(vi,vj )∈E |Aij| ≥ α.   A2: The width of the Ising model is bounded by λ: max P |A | + |θ | ≤ vi vj ∈∂vi ij i λ.

A common variant of the width guarantee is that the graph G has maximum degree d, and the maximum absolute edge weight maxij |Aij| ≤ β. Note that in this case, we have the width is bounded by βd.

Now, that we have setup the notation for the Ising model, let us define the two primary contamination models we will consider.

Definition 50 (Huber’s η-contamination model). Let D be a distribution on {−1, 1}n. In Huber’s η-contamination model, we receive i.i.d. samples from the distribution (1 − η)D + ηE, where E is an arbitrary distribution.

Definition 51 (η-corrupted samples). Let D be a distribution on {−1, 1}n. We say that a collection of samples U is η-corrupted if they were created by the following process: Generate m = |D| samples by drawing them i.i.d. from D, then an adversary chooses an η-fraction of the samples and replaces them with arbitrary values on {−1, 1}n.

In the corrupted model, the adversary can introduce dependencies between the observed data. Ignoring a technical issue1 for simplicity, the

1Note that the number of corrupted samples in the second model is not random but with high probability for large enough n, the second model is stronger [Diakonikolas et al., 2016]

120 corruption model is stronger than the contamination model. All of our achiev- ability results hold in the corruption model and all of our impossibility results hold in the contamination model.

5.3 Inachievability Results

Figure 5.1: Graphs for the inachievability result in Theorem 2

We will use the following well-known lemma in both of our subsequent proofs for inachievability (see, for example, Fact 2.3 of [Diakonikolas et al., 2016]). It essentially states that if two distribution are close in total variation distance, then they cannot be distinguished given contaminated samples. It can be proven using Farkas’ lemma.

Lemma 52. Given two distribution D1 and D2 such that the total variation distance dTV(D1, D2) ≤ η, there exists distributions E1 and E2 such that (1 −

η)D1 + ηE1 = (1 − η)D2 + ηE2.

We first establish that no algorithm can robustly learn an Ising model with bounded width λ when the fraction of contaminated samples is even exponentially small in λ.

121 Theorem 53. For all λ and α > 0, there exists two Ising models D1 and D2 such that both have width λ and minimum edge weight α such that given any number of η-contaminated samples with η > min{α, 1} exp(−2(λ − α)), then no algorithm with any number of samples can distinguish the two distributions.

This theorem holds even in the weaker model where the adversary cannot corrupt real samples but can only inject an η-fraction of samples.

Proof. We will create two Ising models with different graph structure that can be completely confused when η > α exp(−2λ). Both models are on three vertices a, b, c and have an edge bc with weight β. The first model (Model 1-1 in Fig. 5.1) has an additional edge ab of weight α, while the second model (Model 1-2 in Fig. 5.1) has an additional edge ac with weight α. Note that both models have width λ = α + β.

The total variation distance dTV between these models can be calculated as:

2(e−β+α − e−β−α) d (D˜ , D˜ ) = TV 1 2 2eβ+α + 2eβ−α + 2e−β−α + 2e−β+α 1 e2α − 1 = · = tanh(α)σ(−2β) ≤ min{α, 1}e−2β. (e2β + 1) e2α + 1

Thus if an adversary can inject min{α, 1} exp(−2β) fraction of samples, they can make samples from one model look like samples from the other model and vice-versa.

One special case of interest is when the Ising model has a degree bound

122 d and a bound on the maximum absolute edge weight β. For this special case, we establish a similar lower bound.

ln 2 Theorem 54. For all d, α > 0 and β > 3 , there exists two Ising models D1 and D2 on different graphs such that both are d-sparse and have edges weights satisfying α ≤ Auv ≤ β such that given any number of η-contaminated samples with η > min{α, 1}e−Cβd, no algorithm can distinguish the two distributions.

The detailed proof of Theorem 54 can be found in Appendix Section 5.5.

5.4 Achievable Results

We complement the prior lower bound with the following robustness guarantee for learning Ising models with corrupted samples.

Theorem 55. Given an Ising model distribution D of width λ such that the absolute value of all edge weights are greater than α, suppose we receive

2 −C2λ N η-corrupted samples from D. If η < C1 min{α , 1}e , then, if N =

 exp(C3λ) n  O α4 log( δα ) samples, we can recover the Ising model structure with probability 1 − δ, for some fixed constant C1,C2,C3.

We establish this by showing robustness of the Sparsitron algorithm introduced by Klivans and Meka [2017] where the authors established how to learn Ising models using sparse generalized linear models (GLMs). A generalized linear model is defined by a weight vector w and link function σ : R 7→ [0, 1],

123 which in this work we assume to be the logistic function2. The model predicts the response yˆ to a feature x as σ(w · x).

For an Ising model with weight matrix A and mean field θ, we have that

X P(xi = +1 | x\i) = σ(2θi + 2 Aijxj), j where σ is the logistic function. Thus, if we set the true label yi = 1(xi = +1), then the expected label is a GLM. Klivans and Meka [2017] show how to learn an Ising model by solving

2 min E[(σ(w · x) − yi) ]. w

Lemma 56 (Klivans and Meka [2017]). Given an Ising model distribution D

∗ ∗ of width λ, suppose that w , θ is the weight vector such that P(xi = 1 | x\i) = σ(w∗ · x + θ∗). If we have a weight vector w, θ such that, for some α < 1 and fixed constant C,

∗ ∗ 2 2 −Cλ EX∼D[(σ(w · x + θ) − σ(w · x + θ )) ] ≤ α e then we have

∗ kw − w k∞ ≤ α.

Their result shows that if we can learn the appropriate Ising model GLMs with small enough error then we can learn the Ising structure. Their algorithm to learn the sparse GLM is Sparsitron, which is based on the Hedge algorithm due Freund and Schapire [1997].

2For GLMs, we can use any 1-Lipschitz link functions.

124 We are able to show that Sparsitron is robust to corrupted samples. See Section 5.6 for the proof. Since Sparsitron is based on multiplicative weights, the key result needed to establish Theorem 55 is a robustness guarantee of the Hedge algorithm.

5.4.1 Robustness of the Hedge Algorithm The Hedge algorithm is a powerful tool used in many applications in learning theory and [Freund and Schapire, 1997, Arora et al., 2012].

In the experts problem, there are n experts. At every iteration we want to generate a distribution over experts pt. We then observe a loss of each expert

t PT t t `i. We want our distributions to minimize the total loss L = t=1 p · ` .

The Hedge algorithm learns distributions in the following way:

1 1. Initialize a weight wi = 1 for each expert i.

t wt 2. At iteration t, use the distribution p = t . kw k1

t t t+1 t `i 3. After observing the loss `i for each expert i, set wi = wi(1 − ε) .

We want to control the loss of the Hedge algorithm in the following adversarial noise model, where for an η-fraction of losses, the true loss `t is different from the observed loss `˜t.

Definition 57 (η-corrupted experts problem). In every iteration t for every

t t expert i, there is a loss `i such that 0 ≤ `i ≤ 1. However we observe the loss

125 ˜t t `i. In the η-corruption model, the losses `i are all created, however before we start observing data the adversary sees all the losses and creates the observed

˜t losses `i such that at most an η-fraction of the observed losses differ from the true losses. Our goal is to generate a distribution pt at every iteration over

PT t t the experts such that the total loss at the T -th iteration L = t=1 p · ` is minimized. However, we have to do this without observing the true losses, only the observed losses `˜t.

A simple modification of the proof of the noiseless guarantee of Hedge shows that if we run the hedge algorithm on the observed losses, we can still bound the true loss.

PT t Lemma 58. Let Li = t=1 `i. In the η-corrupted loss model, the multiplicative ˜t PT t t weights algorithm on the observed losses ` achieves total loss L = t=1 p · ` such that ln n L ≤ + (1 + ε)LT + 3ηT. ε i

The proof is in Section 5.6 of the Appendix. Note that this is an additive factor of O(ηT ) from the guarantee in the noiseless case.

5.5 Proof of Theorem 53

Proof. We will create two Ising model distributions D1 and D2 that satisfy the conditions of Theorem 54 and show that they have total variations distance

−β(d−2) dTV(D1, D2) ≤ αe .

126 Figure 5.2: Graphs for Theorem 53

Both models will have a clique of size d on vertices v1, v2, . . . , vd such that every edge weight in the clique is β. Both models will also have a vertex v0. Model 2 − 1 will have an edge between v0 and v1 with edge weight α. Model

2 − 2 will have an edge between v0 and v2 with edge weight α. In addition to what is described in the figures, we will also assume a graph with n − d − 1 vertices with identical configurations which are unconnected to the rest of the graph and thus don’t affect the energy and consequently the total variation distance calculations.

k d−2−k Define E(k) = 2 + 2 −k(d−2−k). Note that E(k) is the energy from the edge weights between {x3, x4, . . . , xd} when k of the variables take the value of 1.

We will lower bound the partition function Z by

d−2 X d − 2 Z ≥ (eα + e−α)ed eβE(k). (5.1) k k=0

127 To see this, first consider only the configurations such that x1 = x2. Then group the configurations by the number of variables with the value 1 in {x3, x4, . . . , xd. d−2 For each of the k configuration of {x3, x4, . . . , xd} with k values equal to 1, we need to also decide the value of x1 = x2 and the value of x0. We thus have

d−2 X d − 2 Z ≥(eα + e−α) e2β(2k−d+2)+βE(k)+β + e−2β(2k−d+2)+βE(k)+β k k=0 d−2 X d − 2 ≥(eα + e−α)eβ eβE(k)e|2(2k−d+2)|β k k=0 We have max{2k − d + 2, −(2k − d + 2)} = |2k − d + 2|.

1 1 X d (D , D ) = |E (x) − E (x)| TV 1 2 2 Z 1 2 x∈{−1,+1}p d−2 1 4 X d − 2 = eα − e−α e−βeβE(k) 2 Z k k=0 eα − e−α Pd−2 d−2eβE(k) ≤2 e−2β k=0 k eα + e−α Pd−2 d−2 βE(k) |2(2k−d+2)|β k=0 k e e eα − e−α Pd−2 d−2eβE(k) ≤2 e−2β k=0 k eα + e−α eβE(d−2)e4(d−2)β eα − e−α 2d−2eβE(d−2) ≤2 e−2β eα + e−α eβE(d−2)e4(d−2)β ≤2αe−2βe−(d−2)(4β−ln 2)

128 5.6 Proof of Achieveability

To prove Theorem 55, we just need to show that we can learn the sparse GLM even when the samples are coming from an η-corrupted distribution. The rest of the proof follows from Lemma 56.

We first prove Theorem 58, following the approach from Freund and Schapire [1997] (see also Arora et al. [2012]).

t Pn t Proof of Theorem 58. We define Φ = i=1 wi. We first upper bound the value of ΦT .

Suppose that at time t we observe the loss `˜t and that the observed loss `˜t is equal to the true loss `t. We have that

n n n X X t X t+1 t+1 t `i t t Φ = wi = wi(1 − γ) ≤ wi(1 − `iγ) i=1 i=1 i=1 n t t X t t t t t = Φ − γΦ pi`i ≤ Φ exp(−γp · ` ). i=1

Now suppose that the observed loss `˜t is different from the true loss `t.

˜t t Note that `i − `i ≥ −1. We have that

n n n X X ˜t X ˜t t t t+1 t+1 t `i t `i+`i−`i Φ = wi = wi(1 − γ) = wi(1 − γ) i=1 i=1 i=1 n X t −1 t `i −1 t t t ≤ (1 − γ) wi(1 − γ) ≤ (1 − γ) Φ exp(−γp · ` ). i=1

Since only ηT steps have corrupted losses, we have that

ΦT ≤ Φ1(1 − γ)−ηT exp(γL), (5.2)

129 PT t t where L = t=1 p · ` is the total loss.

We now lower bound the value of ΦT . If the observed loss is equal to the true loss, we have that

t+1 t+1 t+1 t `i Φ ≥ wi = wi(1 − γ) ,

˜t t otherwise, since `i − `i ≤ 1, we have that

˜t t t t t+1 t+1 t `i−`i+`i t `i Φ ≥ wi = wi(1 − γ) ≥ (1 − γ)wi(1 − γ) .

Since only ηT steps have corrupted losses, we have that

ΦT ≥ w1(1 − γ)ηT (1 − γ)Li , (5.3)

PT T where Li = t=1 `i is the loss for expert i.

1 1 Since Φ = n and wi = 1, we can use Equations (5.2) and (5.3) to show that

(1 − γ)ηT (1 − γ)Li ≤ ΦT ≤ n(1 − γ)−ηT exp(γL), which implies that

ln n − ln(1 − γ) − ln(1 − γ) L ≤ + L + 2ηT . γ i γ γ

1 Using the fact that − ln(1 − γ) ≤ γ(1 + γ) for γ ≤ 2 , we can conclude that

ln n L ≤ + L (1 + γ) + 3ηT. γ i

130 We then need to prove that the Sparsitron algorithm is able to efficiently optimize sparse GLMs with an η-corrupted dataset. The Sparsitron algorithm is specified by Algorithm 8. Note that it is a particular instance of the multiplicative weights algorithm.

Algorithm 8 Sparsitron [Klivans and Meka [2017]] Input: training samples (x1, y1),..., (xT , yT ) Input: test samples (a1, b1),..., (aT , bT ) Input: sparsity parameter λ, weight parameter γ 0 initialize all weights: wi = 1 for all iterations t = 1, 2,...,T : t wt−1 p = t−1 kw k1 t 1 t t t t ` = 2 (1 + (σ(λp · x ) − y )x ) t t t−1 `i wt = wi (1 − γ) for all iterations t = 1, 2,...T : t 1 PT t j j 2 εˆ(λp ) = T j=1(u(λp · a ) − b ) t∗ ∗ t Return λp for t = argmintεˆ(λp )

Using Sparsitron, we can establish the following Theorem.

n Theorem 59. Let D be a distribution on {−1, 1} ×{0, 1}, such that Ex,y∼D[y |

∗ ∗ x] = σ(w ·x). Assume that kw k1 ≤ λ for a known λ. There exists an algorithm

λ2 n such that, given T = O( ε2 log δε ) η-corrupted samples from D, learns a weight vector w such that, with probability 1 − δ, satisfies

∗ 2 Ex,y∼D[(σ(w · x) − σ(w · x)) ] ≤ ε + O(λη).

Using Theorem 59, we can now prove Theorem 55 by setting ε, η ≤ min{α2, 1}e−C max{λ,1} for some constant C and applying Lemma 56.

131 We now prove Theorem 59. The proof is essentially the same as the proof of Theorem 3.1 by Klivans and Meka [2017] in the non-adversarial setting, expect that we use the bound of Theorem 58. For completeness, we duplicate the proof including the error induced by the adversarial setting.

∗ ∗ Proof of Theorem 55. We can assume that w ≥ 0 and kw k1 = λ. If not, we can map examples (x, y) to ((x, −x, 0), y).

Define the risk of a weight vector v to be ε(v) = E [(σ(v · x) − y)2]. x,y∼D Since the training set and the holdout set are the same size, if the whole dataset is η-corrupted, then each portion of the dataset is at most 2η corrupted.

Let x˜t, y˜t be the corrupted examples. To be fully rigorous, we need to define a probability space over the examples (xt, yt, x˜t, y˜t). We will assume the adversary takes the following form:

1. The adversary receives the dataset (xt, yt).

2. The adversary enumerates all valid η-corrupted datasets.

3. The adversary runs the Sparsitron algorithm for each η-corrupted dataset and calculates the risk of the feature vector learned from each dataset.

4. The adversary sends us the dataset with the highest risk.

This is a deterministic function of the true dataset. If we are robust to this adversary, then we can be robust to any adversary.

132 t t t w∗ t Let Q = p · ` − λ · ` . Let

Zt = Qt − E [Qt | (x1, y1, x˜1, y˜1),..., (xt−1, yt−1, x˜t−1, y˜t−1)]. xt,yt,x˜t,y˜t

Note that Z1,...,ZT is a martingale difference sequence with respect to the sequence (x1, y1, x˜1, y˜1),..., (xT , yT , x˜T , y˜T ), as Zt is a function of the values (x1, y1, x˜1, y˜1),..., (xT , yT , x˜t, y˜t). Further, we have that Zt is bounded between

−2 and 2. Thus, by the Azuma-Hoeffding inequality, we have that PT Zt ≤ t=1 O(pT log(1/δ), with probability 1 − δ. We can conclude that, with probability at least 1 − δ,

T T X X E [Qt | (x1, y1, x˜1, y˜1),..., (xt−1, yt−1, x˜t−1, y˜t−1)] ≤ Qt+O(pT log(1/δ)). xt,yt,x˜t,y˜t t=1 t=1 (5.4)

Now we analyze the term

E [Qt | (x1, y1, x˜1, y˜1),..., (xt−1, yt−1, x˜t−1, y˜t−1)] xt,yt,x˜t,y˜t and relate it to the error of the weight vector λpt. Recall that

E [Qt | (x1, y1, x˜1, y˜1),..., (xt−1, yt−1, x˜t−1, y˜t−1)] xt,yt,x˜t,y˜t = E [(pt − (1/λ)w∗) · `t | (x1, y1, x˜1, y˜1),..., (xt−1, yt−1, x˜t−1, y˜t−1)]. xt,yt,x˜t,y˜t

Note that pt is completely determined by (x˜1, y˜1),..., (x˜t−1, y˜t−1) and `t is completely determined by (xt, yt). We thus have

E [Qt | (x1, y1, x˜1, y˜1),..., (xt−1, yt−1, x˜t−1, y˜t−1)] = E [(pt−(1/λ)w∗)·`t]. xt,yt,x˜t,y˜t xt,yt

133 From the proof of Theorem 3.1 of Klivans and Meka [2017], we can conclude that 1 E [Qt | (x1, y1, x˜1, y˜1),..., (xt−1, yt−1, x˜t−1, y˜t−1)] = E [(pt−(1/λ)w∗)·`t] ≥ ε(λpt). xt,yt,x˜t,y˜t xt,yt 2λ

Connecting the above with Inequality (5.4), we have, with probability at least 1 − δ,

T T 1 X X ε(λpt) ≤ Qt + O(pT log(1/δ)) 2λ t=1 t=1 T T X X w∗ = pt · `t − · `t + O(pT log(1/δ)). (5.5) λ t=1 t=1

q ln n Now, using Theorem 58, with γ = T , we have that the total loss L = PT t T t=1 p · ` satisfies p L ≤ min Li + O( T log n + ηT ), i

PT T where Li = t=1 `i is the loss for expert i. Connecting this with Inequality (5.5), we have, with probability 1 − δ

T T ∗ 1 X t X w t p p ε(λp ) ≤ min Li − · ` + O( T log(1/δ) + T log n + ηT ). 2λ i λ t=1 t=1 ∗ Since w /λ is a valid distribution, and mini Li is the minimum loss for all

PT w∗ t distributions, we have that mini Li − t=1 λ ·` ≤ 0, and thus, with probability 1 − δ,

T 1 X ε(λpt) ≤ O(pT log(1/δ) + pT log n + ηT ) 2λ t=1 T r ! 1 X log(1/δ) + log n =⇒ min ε(λpt) ≤ ε(λpt) ≤ λO + O(λη). t T T t=1

134 λ2 log(n/δ) Setting T = O( ε2 ), we have, with probability 1 − δ, that

min ε(λpt) ≤ O(ε + λη). t

Using our holdout set, we calculate the empirical error for each weight vector λpt using T 1 X εˆ(λpt) = (σ(λpt · aj) − bj)2. T j=1

log(T/δ) From Fact 3.2 of Klivans and Meka [2017], since T ≥ O( ε2 ), we know that, with probability 1 − δ, we have that |ε(λpt) − εˆ(λpt)| ≤ ε. However, our examples are η-corrupted. Since each value of (σ(λpt · aj) − bj)2 is bounded between −4 and 4, the mean of the corrupted examples differs from the mean of the true examples by an additive factor of O(η) at most. Thus, by choosing the weight vector λpt with the smallest empirical error, we can find a weight vector λpt such that ε(λpt) ≤ O(ε + λη).

135 Chapter 6

Uncertainty-Aware Compressive Sensing with Flow Composition

6.1 Introduction

Invertible generative models are a class of generative models that de- signed to have efficient density evaluation and inverse functions. Specifically, they work by defining a function f such that the output image x = f(z), for the latent variable z drawn from i.i.d. Gaussian noise. The function f is designed such that the inverse f −1 can be easily calculated, as well as the density function P (x). Starting with the work of Dinh et al. [2014], we now are able to train high quality invertible generative models for a wide class of datasets.

In this work we consider the Bayesian compressive sensing problem [Ji et al., 2008] for invertible generative models. In this problem, we are given an invertible generative model that defines a distribution P (x), a measurement matrix A and a set of measurements y∗. Our goal is to generate samples from the conditional distribution P (x | Ax = y∗). Natural applications of this problem are image completion and superresolution with uncertainty quantification.

While there exist methods such as Langevin dynamics and variational

136 inference exist for this problem, we observe that these approaches have to learn to generate samples from scratch. Training generative models is a difficult and expensive computational problem, and since we already have a powerful generative model, it may be more efficient to directly utilize this model in conditional sample generation.

Specifically, rather than generate samples of x directly, we consider an approach that generates samples of the latent variables z such that when fed into the invertible generative model, the output x is such that Ax ≈ y∗ and x is drawn from the conditional distribution. We observe that working in the latent space is much more efficient and returns higher quality samples.

Our approach is based on variational inference. However, due to the form of our composed variational family, we cannot directly apply existing approaches. We consider a "smoothed" alternative such that the ELBO can be easily calculated in the the new smoothed problem.

Our contributions:

• We establish that the Bayesian compressive sensing problem is computa- tionally intractable even for simple invertible generative model architec- tures. Because of this we consider approximate alternatives.

• We develop an approach to conditional sample from a given invertible generative model by composing another generative model with the given model. Specifically, we show how to train the added model to generate

137 structured noise so that the output of the composed model matches the conditional distribution.

• We experimentally verify our approach and we see that it is able to create higher quality samples compared to alternative approaches.

6.2 Background and Related Work 6.2.1 Invertible Generative Models

Invertible generative models (also known as normalizing flow models) are a class of generative models that sample an output x by taking a noise input z (typically i.i.d. Gaussian noise) and returning x = f(z) for an bijective and differentiable function f.

Since f is bijective, for any x there is a unique z such that x = f(z). We can thus calculate the probability density of x to be

−1 df P (x) = P (z) det (x) , dx

df −1 −1 where dx is the Jacobian of the inverse transformation and z = f (x).

For invertible generative models, the architecture is designed such that both f(z) and f −1(x) are easily calculated, as well as the determinant of the Jacobian of f −1. They are typically trained by maximizing the log likelihood, as the likelihood of a dataset can be easily evaluated.

Research into neural invertible generative models started with Dinh et al. [2014] and the NICE model. Since then, there has been extensive

138 research in architectures for invertible generative models. Some examples include RealNVP [Dinh et al., 2016], i-RevNets [Jacobsen et al., 2018], Glow [Kingma and Dhariwal, 2018], Neural ODEs [Chen et al., 2018], FFJORD [Grathwohl et al., 2019], invertible ResNets [Behrmann et al., 2018], and neural spline flows [Durkan et al., 2019].

As an example, we describe the affine coupling layer, first described by Dinh et al. [2016]. In an affine coupling layer, we first partition of the variables

n1 n2 x = (x1, x2) ∈ R × R . The layer is parametrized by neural networks

n1 n2 n1 n2 0 0 0 Na : R → R and Ns : R → R . For the output x = (x1, x2), we pass x1 0 through unchanged and have x2 = Ns(x1) x2 + Na(x1). The inverse and the determinant of the Jacobian can easily be calculated by this transformation.

Invertible generative models are attractive due to their tractable density function and explicit inverse function. These have been utilized in applications such as compressive sensing, image deblurring, and image completion [Asim et al., 2019, Shamshad et al., 2019], attribute manipulation [Kingma and Dhari- wal, 2018], and variational inference [Rezende and Mohamed, 2015, Kingma et al., 2016].

6.2.2 Variational Inference for Conditional Sampling

Variational inference [Jordan et al., 1999, Wainwright et al., 2008, Blei et al., 2017] is a set of techniques that attempt to solve difficult inference problems by optimizing over a tractable family of distribution Q, called the variational family.

139 For the Bayesian compressive sensing problem, we attempt to minimize the KL divergence between the tractable approximation and the true conditional

∗ P (x | Ax = y ). For simplicity, we will assume that x = (x1, x2) is a partition

∗ x and we want to fit the conditional distribution p(x1 | x2 = x2). We have that the KL minimization problem is

∗ min D(q(x1) k p(x1 | x2 = x2)), q∈Q

h q(x) i where D(q(x) k p(x)) = Ex∼q log p(x) is the KL divergence. We choose the variational family Q such that for all members q we can sample from q and evaluate the density q(x).

∗ ∗ p(x1x2) While the conditional density p(x1 | x2 = x2) = ∗ cannot be P (x2) ∗ efficiently calculated, often the joint density P (x1, x2 = x2) can be, such as when the joint density is given as an invertible generative model. We can rewrite the minimization problem as

∗ ∗ ∗ min D(q(x1) k p(x1 | x2 = x2)) = min Ex1∼q[log q(x1)−log p(x1, x2 = x2)]+p(x2 = x2). q∈Q q∈Q

∗ Since P (x2 = x2) is a constant with respect to q, we can ignore it

∗ in optimization and instead maximize Ex1∼q[log P (x1, x2 = x2) − log q(x1)]], which is the ELBO. We can then use the q we found to sample values of x1

∗ conditioned on x2 = x2.

6.2.3 Compressive Sensing with Generative Priors

In the compressive sensing problem, a vector x ∈ Rd generates a set of measurements y∗ = Ax, where y∗ ∈ Rm and d  d. Observing only the

140 measurements y∗, our goal is to reconstruct the vector x. Since the number of measurements is much smaller than the dimension, this is not possible in general. However, if we know there is some simplifying structure to x, then this may actually be possible.

Classically, the simplifying structure was that x is sparse, and there has been extensive work in this setting [Tibshirani, 1996, Candes et al., 2006, Donoho et al., 2006, Bickel et al., 2009, Baraniuk, 2007].

Recent work has considered alternative simplifying structures, such as the vector x coming from a generative model. Starting with Bora et al. [2017], there has been extensive work on this setting [Grover and Ermon, 2018, Mardani et al., 2018, Heckel and Hand, 2018, Mixon and Villar, 2018, Pandit et al., 2019].

The main idea of these approaches is to optimize over the latent space of a generative model, typically using the objective function

2 min kAf(z) − yk2, z where f is the encoding function of a deep generative model.

The approaches mentioned above focus on recovering a single signal that is close to the true signal. However, there can be many signals that fit the measurements, and there is some uncertainty on any given image. Because of this, there has been recent work on recovering the distribution conditioned on the measurements [Tonolini et al., 2019, Zhang and Jin, 2019].

141 We also mention the work of Asim et al. [2019], Shamshad et al. [2019] which utilize invertible generative models for compressive sensing problems.

6.2.4 Additional Related Work

There has been extensive work on learning a conditional sampler. Con- ditional GANs [Mirza and Osindero, 2014] and conditional VAEs [Sohn et al., 2015] are two approaches to this problem. Recent work by Ivanov et al. [2018] and Belghazi et al. [2019] has developed deep generative models that can condi- tion on many different sets. Our work is different from these approaches as we consider the problem of sampling from the conditional distribution of a given invertible generative model, rather than learning a conditional distribution.

Additionally, there has been extensive work on inpainting [Pathak et al., 2016, Yeh et al., 2017]. Inpainting is the problem of completing an image in a natural and interesting way. While conditional sampling is a type of inpainting procedure, work on inpainting is more concerned with completing an image in a natural and interesting way, rather than faithfully recreating a conditional sampler.

6.3 Hardness of Conditional Sampling

We first show that if an algorithm is able to efficiently sample from the conditional distribution of an invertible generative model for common architectures, then this algorithm can be used to solve NP-complete problems efficiently. Our hardness result holds even if we allow the conditional sampler

142 to approximately sample from the conditional distribution.

Theorem 60. Suppose there is an efficient algorithm that can draw samples from the conditional distribution of an invertible generative model implemented with additive coupling layers. Then RP = NP . Further, the problem remains hard even if we only require the algorithm to sample from a distribution q such

∗ that dTV(p(· | y = y ), q) ≥ 1 − 1/poly(d).

We note that RP is the class of decision problems with efficient random algorithms that (1) output YES with probability 1/2 if the true answer is YES and (2) output NO with probability 1 if the true answer is NO. It is widely believed that RP is a strict subset of NP.

Models that include additive coupling layers include NICE [Dinh et al., 2014], RealNVP [Dinh et al., 2016], Glow [Kingma and Dhariwal, 2018], and neural spline flows [Durkan et al., 2019], thus our hardness results applies to a large variety of invertible generative models.

6.4 Conditional Sampling with Composed Flow Models

We consider using variational inference for the conditional sampling task. Here, we optimize over a class of models to find the one that best fits the conditional distribution.

For simplicity, we first work out the problem when we have x = (x1, x2) is a partition of the variables and we want to get the conditional distribution

∗ of x1 after the realization x2 = x2

143 z0 fˆ z1 f x1, x2

Figure 6.1: A flow chart of our conditional sampler. First the noise variable z0 is samples from N(0,I). This is fed into an invertible generative model fˆ to output another noise variable z1. We then feed z1 into the original model f to generate x1 and x2.

Rather than perform variational inference in the observed variables, we propose to perform variational inference in the latent variables. By this, we mean that we want to learn a new distribution q(z) over the latent variables z such that the the distribution of x1, x2 from z ∼ q → (x1, x2) = f(z)

∗ satisfies the condition x2 = x2 and x1 is sampled according to the conditional

∗ distributions p(x1 | x2 = x2).

Our variational family will be the according to distributions from the composed flow model f ◦ fˆ = f(fˆ(·)), where f is the original, given flow model and fˆ is a new latent flow model that we will learn. To sample from the ˆ composed model, we first sample z0 ∼ N(0,I). Then we get z1 = f. To get the final output we get (x, y) = f(z1). We include a flow chart describing this process in Figure 6.1.

However, we cannot directly apply the tools of variational inference using this variational family, as we do not have access to the marginal distribution of x1.

Because of this, we consider a smoothed version of the problem. We imaging that there is a new variable xˆ2 with the conditional distribution

144 x1

z

x2 xˆ2

Figure 6.2: A graphical model depicting the process we are running the ELBO 2 on. We imaging that xˆ2 is drawn from the conditional distribution N(x2 | σ I) for some small parameter σ. We see that xˆ2 is independent from x1 and z when conditioned on x2.

2 p(xˆ2 | x2) = N(x2, σ ), where σ is a small smoothing parameter that we can

∗ adjust. Then, rather than fitting the distribution p(x1 | x2 = x2), we fit

∗ the distribution p(x1, x2 | xˆ2 = x2). We see that as σ → 0 we have that p(x | yˆ = y∗) → p(x | y = y∗).

Proposition 61. Assuming reasonably regularity conditions on the joint distri- bution p(x1, x2), we have that the smoothed conditional distribution p(x1 | xˆ2 =

∗ ∗ x2) converges uniformly to the true conditional distribution p(x1 | x2 = x2) as the smoothing parameter σ → 0.

We see that

p(x1, x2, xˆ2) = p(x1, x2)p(ˆx2 | x1, x2) = P (x1, x2)P (ˆx2 | x2)

due to the conditional independence of x1 and xˆ2 when conditioned on x2. We include an image of the graphical model describing this process in Figure 6.2.

145 We define pf to be the density function of x1, x2 when standard Gaussian noise is fed into f. We define pf◦fˆ to be the density function of x1, x2 when ˆ standard Gaussian noise is fed into f ◦ f. We define pfˆ to be the density ˆ function of z1 when standard Gaussian noise is fed into f.

We want to fit the composed flow model f ◦fˆto the smoothed conditional

∗ ∗ distribution p(x1, x2 | xˆ2 = x2). By factoring out the evidence term pσ(xˆ2 = x2)

∗ and the conditional distribution pσ(xˆ2 = x2 | x2), we see that KL divergence

∗ between pf◦fˆ and pf (x1, x2 | xˆ2 = x2) satisfies

D(p k p (· | yˆ = y∗)) = E [log p (x , x ) − log p (x , x | xˆ = x∗)] f◦fˆ f x1,x2∼pf◦fˆ f◦fˆ 1 2 f 1 2 2 2 = E [log p (x , x ) − log p (x , x , xˆ = x∗)] + log p (ˆx = x∗) x1,x2∼pf◦fˆ f◦fˆ 1 2 f 1 2 2 2 f 2 2 = E [log p (x , x ) − log p (x , x ) − log p (ˆx = x∗ | x )] x1,x2∼pf◦fˆ f◦fˆ 1 2 f 1 2 σ 2 2 2

∗ + log pf (ˆx2 = x2).

Removing constants that do not depend on the model we are optimizing fˆ, we see that the objective function we want to optimize is   1 ∗ 2 arg min E log p ˆ(x , x ) − log p (x , x ) + kx − x k . (6.1) x1,x2∼pf◦fˆ f◦f 1 2 f 1 2 2 2 2 2 fˆ 2σ

Since we can efficiently sample from pf◦fˆ and calculate the log likeli- hoods log pf◦fˆ(x1, x2) and log pf (x1, x2), it is possible to run stochastic gradient descent on the objective function in Equation (6.1) to optimize for fˆ.

Due to the nature of composed flow models, we can simplify the objective function in Equation (6.1). It is known that KL divergence is preserved

146 after applying a bijective transformation to the output of both distributions. Specifically, we see that

dfˆ−1 df −1 log p (x , x ) = log p (z ) + log det (z ) + log det (x , x ) f◦fˆ 1 2 N(0,I) 0 1 1 2 dz1 dx1, x2 and −1 df log Pf (x1, x2) = log PN(0,I)(z1) + log det (x1, x2) . dx1, x2 We see that the terms related to the Jacobian of f −1 actually cancel out.

Applying this and reparametrizing the expectation using z1, we can write the objective function as   1 ∗ 2 arg min E log p ˆ(z ) − log p (z ) + kf (z ) − x k , (6.2) z1∼pfˆ f 1 N(0,I) 1 2 x2 1 2 2 fˆ 2σ

where by fx2 (z1) we mean get x1, x2 = f(z1) and return x2.

The objective function in Equation 6.2 is interesting as it allows us to avoid computing the Jacobian of f −1 completely. Additionally, it can be rewritted as

arg min D(p k N(0,I)) + E kf (z ) − x∗k2 . fˆ z1∼pfˆ x2 1 2 2 fˆ

6.4.1 Generalizing to Measurement Matrices

We now consider the problem of sampling from the conditional distribu- tion given a measurement matrix A and a set of measurements y∗. That is, we want to sample from the distribution P (x | Ax = y∗).

147  T   V1 We can apply SVD to the matrix A to get A = U 0 Σ T , where V2  T  V1 m d−m m U, V = T are orthonormal transforms with U ∈ R ,V1 ∈ R ,V2 ∈ R V2 and Σ ∈ Rm×m is a diagonal matrix with the singular values. Since Σ = I is common in compressive sensing (or approximately true), we assume this and

T have A = UΣV2. We now have a partition of x into the null space V1 X and

T row space V2 x and can apply the approach from the previous section using

T ∗ V2x = U y .

Following the approach in the previous section, we imaging that there is a variable yˆ = Ax + N(0, σ2I) for a small value σ. We then calculate the ELBO for this situation to obtain the following objective function:   1 T T ∗ 2 arg min E log p ˆ(z ) − log p (z ) + kV f(z ) − U y k . z1∼pfˆ f 1 N(0,I) 1 2 2 1 2 fˆ 2σ

Since the `2 norm is preserved after an orthonormal transform, we have that the minimization problem is equivalent to   1 ∗ 2 arg min E log p ˆ(z ) − log p (z ) + kAf(z ) − y k . (6.3) z1∼pfˆ f 1 N(0,I) 1 2 1 2 fˆ 2σ

This objective function is similar to Equation (6.2), except that rather

∗ than consider the `2 norm between x2 and x2, we consider the `2 norm between Ax and y∗. We note that the objective function in Equation (6.2) is a special case of the objective function in Equation 6.3 when the matrix A is an identity matrix with rows removed.

148 6.5 Experiments

Our invertible generative model is a RealNVP model [Dinh et al., 2016]. We train the model to generate 64 × 64 CelebA-HQ images [Liu et al., 2015, Karras et al., 2017]. We also use a RealNVP model for the latent model fˆ.

In Figure 6.3 we use our approach for conditional sampling from this invertible generative model. We see that our method is able to generate reasonable conditional samples from this model. We plot the pixelwise variance by calculating the sample variance over 64 samples and summing all color channels into one value.

We now consider a super-resolution task. We design our measurement matrix to be a local blurring of nearby pixels. Specifically, we apply average pooling with a window size of 4 and a stride of 4, which is the same approach utilized in Bora et al. [2017]. This reduces the amount of data by a factor of 16.

In Figure 6.4 we use our approach to generate conditional samples under this measurement model. We see that the conditional sampler is able to recover a reasonable distribution of samples for this measurement matrix.

6.6 Proof of Hardness Results

A Boolean variable is a variable that takes a value in {−1, 1}.A literal is a Boolean variable xi or its negation ¬xi.A clause is set of literals combined with the OR operator, e.g., x1 ∨¬x2 ∨x3.A conjunctive normal form formula is

149 Figure 6.3: Conditional sampling using our approach. We condition on the top 150 half of the image and sample the bottom half. Since the samples output both halves of the image, we replace the sampled top half with the true top half. We see that our approach is able to generate completions with diversity, as there are multiple mouth positions for most sets of images. We plot the pixelwise variance by calculating the sample variance over 64 samples and summing all color channels into one value. Figure 6.4: Conditional sampling when the measurement is a blurred image. We blur the image using average pooling with a window size and stride of 4. We see that our approach is able to find several reasonable ways to complete the measurements. 151 a set of clauses joined by the AND operator, e.g., (x1 ∨¬x2 ∨x3)∧(x1 ∨¬x3 ∨x4). A satisfying assignment is an assignment to the variables such that the Boolean formula is true.

The 3-SAT problem is the problem of deciding if a conjunctive normal form formula with three literals per clause has a satisfying assignment. We will show that conditional sampling from flow models allows us to solve the 3-SAT problem.

We ignore the issue of representing samples from the conditional distri- bution with a finite number of bits. However the reduction is still valid if the samples are truncated to a constant number of bits.

6.6.1 Design of the Additive Coupling Network

Given a Boolean formula, we design a ReLU neural network with 3 hidden layers such that the output is 0 if the input is far from a satisfying assignment, and the output is a about a large number M if the input is close to a satisfying assignment.

We will define the following scalar function

1  1  δ (x) =ReLU (x − (1 − ε)) − ReLU (x − (1 − ε)) − 1 ε ε ε 1  1  −ReLU (x − 1) + ReLU (x − 1) − 1 . ε ε

This function is 1 if the input is 1, 0 if the input x has |x − 1| ≥ 0 and is a linear interpolation on (1 − ε, 1 + ε). Note that it can be implemented by a

152 Figure 6.5: Caption hidden layer of a neural network and a linear transform, which can be absorbed in the following hidden layer. See Figure 6.5 for a plot of this function.

For each variable xi, we create a transformed variable x˜i by applying x˜i = δε(xi) − δε(−xi). Note that this function is 0 on (−∞, −1 − ε] ∪ [−1 +

ε, 1 − ε] ∪ [1 + ε, ∞), −1 at xi = −1, 1 at xi = 1, and a smooth interpolation on the remaining values in the domain.

Every clause has at most 8 satisfying assignments. For each satisfying assignment we will create a neuron with the following process: (1) get the relevant transformed values x˜i, x˜j, x˜k, (2) multiply each variable by 1/3 if it is equal to 1 in the satisfying assignment and −1/3 if it is equal to −1 in the satisfying assignment, (3) sum the scaled variables, (4) apply the δε function

153 to the sum.

We will then sum all the neurons corresponding to a satisfying assign- ment for clause Cj to get the value cj. P We will then output the value M × ReLU( j cj − (m − 1)), where M is a large scalar.

We say that an input to the neural network x corresponds to a Boolean

0 d 0 assignment x ∈ {−1, 1} if for every xi we have |xi − xi| < ε. For ε < 1/3, if the input does not correspond to a satisfying assignment of the given formula, then at least one of the values cj is 0. The remaining values of cj are at most 1, so the sum in the output is at most (m − 1), thus the sum is at most zero, so the final output is 0. However, if the input is a satisfying assignment, then every value of cj = 1, so the output is M.

6.6.2 Generating SAT Solutions from the Conditional Distribution

Our flow model will take in Gaussian noise x1, . . . , xd, z ∼ N(0, 1). The values x1, . . . , xd will be passed through to the output. The output variable y will be z + fM (x1, . . . , xd), where fM is the neural network described in the previous section, and M is the parameter in the output to be decided later.

Let A be all the valid satisfying assignments to the given formula. For

d each assignment a, we will define Xa to be the region Xa = {x ∈ R : ka−xk∞ ≤ S ε}, where as above ε is some constant less than 1/3. Let XA = a∈A Xa.

Given an element x ∈ Xa, we can recreate the corresponding satisfying

154 assignment a. Thus if we have an element of XA, we can certify that there is a satisfying assignment. We will show that the distribution conditioned on y = M will generating satisfying assignments with high probability.

We have that

p(y = M,XA) p(XA | y = M) = p(y = M,XA) + p(y = M, XA)

If we can show that p(y = M, XA)  p(y = M,XA), then we have that the generated samples are in

We have that

p(y = M, XA) = p(y = M | XA)P (XA) ≤ p(y = M | XA).

Note that if x ∈ XA, the fM (x) = 0. Thus y ∼ N(0, 1) and P (y = M | XA) = Θ(exp(−M 2/2)).

0 Now consider any satisfying assignment xa. Let Xa be the region

0 d 1 Xa = {x ∈ R : ka − xk∞ ≤ 2m }. Note that for every x in this region we have 0 −d fM (x) ≥ M/2. Additionally, we have that P (Xa) = Θ(m) . Thus for any

0 2 x ∈ Xa, we have p(Y = M | x) & exp(−M /8). We can conclude that Z 0 2 p(y = M,XA) ≥ p(Y = M,Xa) = p(Y = M | x)p(x) dx & exp(−M /8−Θ(d log m)). 0 Xa √ For M = O( d log m), we have that p(y = M, XA) is exponentially smaller than p(y = M,XA). This implies that sampling from the distribution condi- tioned on y = M will return a satisfying assignment with high probability.

155 6.6.3 Hardness of Approximate Sampling

Total variation distance is a distance over probability distributions defined as dTV(p, q) = maxE |p(E) − q(E)|, where E is an event.

We show that the problem is still hard even if we require the algorithm

∗ to sample from a distribution Q such that dTV(p(x | y = y ), q) ≥ 1/poly(d).

Consider the event XA from above. We saw that p(XA | y = M) ≥

1 − exp(−Ω(d)). We have that dTV(p(· | y = M), q) ≥ 1 − exp(−Ω(d) − q(XA).

Suppose that the distribution q has q(XA) ≥ 1/poly(d). Then by sampling a polynomial number of times from q we sample an element of XA, which allows us to find a satisfying assignment. Thus if we can efficiently create such a distribution, we would be able to efficiently solve SAT and RP = NP. As we are assuming this is false, we must have q(XA) ≤ 1/poly(d), which implied dTV(p(· | y = M), q) ≥ 1 − 1/poly(d).

6.7 Proof of Proposition 61

We assume the following assumptions on the joint distribution p(x1, x2):

1. The density function p(x1, x2) is continuous.

2. There exists a value M1 such that p(x1, x2) < M for all x1, x2.

3. The marginal density function p(y) is continuous.

4. There exists a value M2 such that p(y) < M2 for all y.

156 5. For any ε > 0, there exists an R such that for all x1, x2 such that

k(x1, x2)k2 > R we have that p(x1, x2) < ε.

∗ We first prove the following lemma on the convergence of pσ(xˆ2 = x2)

∗ to p(x2 = x2).

∗ Lemma 62. For any ε > 0, there exists a σ1 > 0 such that |pσ(yˆ = y )−p(y = y∗)|ε.

Proof. We have, for a radius rσ to be determined later, Z ∗ ∗ pσ(ˆx2 = x2) = pσ(ˆx2 = x2 | x2)p(x2) dx2 x2 Z Z ∗ ∗ = pσ(ˆx2 = x2 | x2)p(x2) dx2 + pσ(ˆx2 = x2 | x2)p(x2) dx2 ∗ ∗ x2:kx2−x2k2≤rσ x2:kx2−x2k2>rσ Z Z ∗ ∗ ≤ pσ(ˆx2 = x2 | x2)p(x2) dx2 + M2 pσ(ˆx2 = x2 | x2) dx2, ∗ ∗ y:kx2−x2k2≤rσ x2:kx2−x2k2>rσ where M2 is from Assumption 4.

∗ Now since pσ(xˆ2 = x2 | x2) is a Gaussian distribution, we have that as

∗ ∗ 2 a function of x2 pσ(xˆ2 = x2 | x2) = p(N(x2, σ I) = x2). This can be seen by directly inspecting the functions.

2 q 1 ∗ 2 For rσ = O(σ (n + n log σ )), the probability that kN(x2, σ I)k2 ≥ rσ is bounded by σ, due to standard bounds on the concentration of χ2 random

157 variables (see, for example, [Wainwright, 2019]). Continuing, we thus have that

Z ∗ ∗ pσ(ˆx2 = x2) ≤ Pσ(ˆx2 = x2 | x2)p(x2) dx2 + M2σ ∗ x2:kx2−x2k2≤rσ   Z ∗ ≤ max p(x2) Pσ(ˆx2 = x | x2) dx2 + M2σ ∗ 2 x2:kx2−x k2≤rσ ∗ 2 x2:kx2−x2k2≤rσ

≤ max p(x2) + M2σ, ∗ x2:kx2−x2k2≤rσ

∗ ∗ 2 where in the previous line we use again that pσ(xˆ2 = x2 | x2) = p(N(x2, σ I) = x2).

By Assumption 3, we have that p(x2) is continuous. By the definition of continuous functions we have that for all ε, there exists an r0 = r0(ε) such

∗ ∗ 0 that |p(x2) − P (x2)| < ε for all x2 such kx2 − x2k2 ≤ r .

∗ ∗ Thus for σ1 small enough, we have that pσ(ˆx2 = x2) ≤ p(x2 = x2) + ε.

To get a lower bound, start with

Z ∗ ∗ pσ(ˆx2 = x2) ≥ pσ(ˆx2 = x2 | x2)p(y) dy ∗ x2:kx2−x2k2≤rσ and continue similarly.

We recall the definition of absolute convergence of a function:

Definition 63. We say that a function gσ converges absolutely to g if for all

ε > 0, there exists a σ > 0 such that for all x we have that |gσ(x) − g(x)| < ε.

We now prove Proposition 61.

158 Proof. We have that

R ∗ Z Pσ(x1, x2)pσ(x2 | xˆ2 = x ) ∗ ∗ x2 2 pσ(x1 | xˆ2 = x2) = pσ(x1, x2 | xˆ2 = x2) dx2 = ∗ x2 pσ(ˆx2 = x2)

Following a similar argument as Lemma 62, we have that

max ∗ p(x , x ) + M σ ∗ x2:kx2−x2k2≤rσ 1 2 1 pσ(x1 | xˆ2 = x2) ≤ ∗ , pσ(ˆx2 = x2)

2 q 1 where rσ = O(σ (n + n log σ )) and M1 is from Assumption 5.

Due to Assumption 1 and Assumption 5, by the Heine–Cantor theorem we have that p(x1, x2) is uniformly continuous, which implies that for all ε > 0,

0 ∗ 0 there exists an r that does not depend on x1, x2 such that if kx2 − x2k2 < r

∗ then |P (x1, x2) − P (x1, x2)| < ε.

0 For rσ < r we thus have

∗ ∗ p(x1, x2) + ε + M1σ pσ(x1 | xˆ2 = x2) ≤ ∗ . pσ(ˆx2 = x2)

Applying Lemma 62, we have for σ, ε small enough that

∗ ∗ p(x1, x2) + ε + M1σ pσ(x1 | xˆ2 = x2) ≤ ∗ p(x2 = x2)(1 − ε) ∗ ∗ p(x1, x2) ε + M1σ p(x1, x2) + ε + M1σ ≤ ∗ + ∗ + 2ε ∗ p(x2 = x2) p(x2 = x2) p(x2 = x2) ∗ p(x1, x2) ε + M1σ M1 + ε + M1σ ≤ ∗ + ∗ + 2ε ∗ p(x2 = x2) p(x2 = x2) p(x2 = x2) ∗ = p(x1 | x2 = x2) + O(ε + σ).

The lower bound follows similarly.

159 Bibliography

N. Alon and J. H. Spencer. The Probabilistic Method. Wiley, 3rd edition, 2008.

A. Andoni. Exact algorithms from approximation algorithms? (part 2). Windows on Theory, 2012. https://windowsontheory.org/2012/04/17/ exact-algorithms-from-approximation-algorithms-part-2/ (version: 2016-09-06).

A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal LSH for angular distance. In Neural Information Processing Systems, 2015.

Gustavo Angulo, Shabbir Ahmed, Santanu S Dey, and Volker Kaibel. Forbidden vertices. Mathematics of Operations Research, 40(2):350–360, 2014.

Sanjeev Arora, Constantinos Daskalakis, and David Steurer. Message pass- ing algorithms and improved LP decoding. In Symposium on Theory of Computing, pages 3–12. ACM, 2009.

Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8 (1):121–164, 2012.

Muhammad Asim, Ali Ahmed, and Paul Hand. Invertible generative models

160 for inverse problems: mitigating representation error and dataset bias. arXiv preprint arXiv:1905.11672, 2019.

K. Avrachenkov, N. Litvak, D. Nemirovsky, and N. Osipova. Monte Carlo methods in PageRank computation: When one iteration is sufficient. SIAM Journal on Numerical Analysis, 2007.

A. Badanidiyuru, B. Mirzasoleiman, A. Karbasi, and A. Krause. Streaming submodular maximization: Massive data summarization on the fly. In KDD, 2014.

Richard G Baraniuk. Compressive sensing. IEEE signal processing magazine, 24(4), 2007.

R. Barbosa, A. Ene, H. L. Nguyen, and J. Ward. The power of randomization: Distributed submodular maximization on massive datasets. In International Conference on Machine Learning, 2015.

Dhruv Batra, Payman Yadollahpour, Abner Guzman-Rivera, and Gregory Shakhnarovich. Diverse m-best solutions in markov random fields. In European Conference on Computer Vision, pages 1–16. Springer, 2012.

Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Invertible residual networks. arXiv preprint arXiv:1811.00995, 2018.

Mohamed Ishmael Belghazi, Maxime Oquab, Yann LeCun, and David Lopez- Paz. Learning about an exponential amount of conditional distributions. arXiv preprint arXiv:1902.08401, 2019.

161 J. L. Bentley. Multidimensional binary search trees used for associative search- ing. Communications of the ACM, 1975.

Dimitris Bertsimas and John N Tsitsiklis. Introduction to linear optimization, volume 6. Athena Scientific Belmont, MA, 1997.

A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In International Conference on Machine Learning, 2006.

Peter J Bickel, Ya’acov Ritov, Alexandre B Tsybakov, et al. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705– 1732, 2009.

David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112 (518):859–877, 2017.

Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 537–546. JMLR. org, 2017.

Endre Boros and Peter L Hammer. Pseudo-boolean optimization. Discrete applied mathematics, 123(1):155–225, 2002.

Endre Boros, Khaled Elbassioni, Vladimir Gurvich, and Hans Raj Tiwary. The negative cycles polyhedron and hardness of checking some polyhedral properties. Annals of Operations Research, 188(1):63–76, 2011.

162 Guy Bresler. Efficiently learning ising models on arbitrary graphs. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 771–782. ACM, 2015.

David Burshtein. Iterative approximate linear programming decoding of LDPC codes with linear complexity. IEEE Transactions on Information Theory, 55 (11):4835–4859, 2009.

Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8):1207–1223, 2006.

M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.

J. Chen, H.-R Fang, and Y. Saad. Fast approximate k–NN graph construction for high dimensional data via recursive Lanczos bisection. JMLR, 2009.

Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pages 6571–6583, 2018.

Michael Chertkov and Mikhail G Stepanov. An efficient pseudocodeword search algorithm for linear programming decoding of LDPC codes. IEEE Transactions on Information Theory, 54(4):1514–1520, 2008.

163 Constantinos Daskalakis, Alexandros G Dimakis, Richard M Karp, and Martin J Wainwright. Probabilistic analysis of linear programming decoding. IEEE Transactions on Information Theory, 54(8):3565–3578, 2008.

Constantinos Daskalakis, Elchanan Mossel, and Sébastien Roch. Evolutionary trees and the ising model on the bethe lattice: a proof of steel’s conjecture. Probability Theory and Related Fields, 149(1-2):149–189, 2011.

Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Concentra- tion of multilinear functions of the ising model with applications to network data. In Advances in Neural Information Processing Systems, pages 12–22, 2017.

M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Symposium on Computational Geometry, 2004.

Rina Dechter and Irina Rish. Mini-buckets: A general scheme for bounded inference. Journal of the ACM (JACM), 50(2):107–153, 2003.

Diego Delle Donne and Javier Marenco. Polyhedral studies of vertex coloring problems: The standard formulation. Discrete Optimization, 21:1–13, 2016.

Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 655–664. IEEE, 2016.

164 Ilias Diakonikolas, Daniel Kane, and Alistair Stewart. Robust learning of fixed-structure bayesian networks. In Neural Information Processing Systems, 2018.

Alexandros G Dimakis, Jiajun Wang, and Kannan Ramchandran. Unequal growth codes: Intermediate performance and unequal error protection for video streaming. In Multimedia Signal Processing, pages 107–110. IEEE, 2007.

Alexandros G Dimakis, Amin A Gohari, and Martin J Wainwright. Guessing facets: Polytope structure and improved LP decoder. IEEE Transactions on Information Theory, 55(8):3479–3487, 2009.

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear indepen- dent components estimation. arXiv preprint arXiv:1410.8516, 2014.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.

David L Donoho et al. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.

Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. arXiv preprint arXiv:1906.04032, 2019.

A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 1956.

165 Frederich Eberhardt, Clark Glymour, and Richard Scheines. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among n variables. In Uncertainty in Artificial Intelligence, 2005.

Frederick Eberhardt. Causation and intervention. (Ph.D. Thesis), 2007.

Glenn Ellison. Learning, local interaction, and coordination. Econometrica: Journal of the Econometric Society, pages 1047–1071, 1993.

D. Eppstein. k-best enumeration. Encyclopedia of Algorithms, 2014.

U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 1998.

Jon Feldman, Martin J Wainwright, and David R Karger. Using linear pro- gramming to decode binary linear codes. IEEE Transactions on Information Theory, 51(3):954–972, 2005.

Jon Feldman, Tal Malkin, Rocco A Servedio, Cliff Stein, and Martin J Wain- wright. LP decoding corrects a constant fraction of errors. IEEE Transactions on Information Theory, 53(1):82–89, 2007.

Alexander Fix, Joyce Chen, Endre Boros, and Ramin Zabih. Approximate MRF inference using bounded treewidth subgraphs. Computer Vision–ECCV 2012, pages 385–398, 2012.

András Frank. Some polynomial algorithms for certain graphs and hypergraphs. In British Combinatorial Conference, 1975.

166 Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.

Menachem Fromer and Amir Globerson. An LP view of the M-best MAP problem. In Neural Information Processing Systems, volume 22, pages 567–575, 2009.

V. Garcia, E. Debreuve, F. Nielsen, and M. Barlaud. K-nearest neighbor search: Fast GPU-based implementations and application to high-dimensional feature matching. In ICIP, 2010.

Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., 1979.

Dan Geiger and David Heckerman. Learning Gaussian networks. In Uncertainty in Artificial Intelligence, 1994.

AmirEmad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Elias Barein- boim. Budgeted experiment design for causal structure learning. In Interna- tional Conference on Machine Learning, 2018.

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative

167 adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

Will Grathwohl, RT Chen, Jesse Bettencourt, and David Duvenaud. Scalable reversible generative models with free-form continuous dynamics. In Pro- ceedings of the International Conference on Learning Representations, New Orleans, LA, USA, pages 6–9, 2019.

GroupLens. MovieLens 20M dataset, 2015. http://grouplens.org/ datasets/movielens/20m/.

Aditya Grover and Stefano Ermon. Uncertainty autoencoders: Learning com- pressed representations via variational information maximization. arXiv preprint arXiv:1812.10539, 2018.

P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and R. Zadeh. WTF: The who to follow service at . In WWW, 2013.

Gurobi Optimization, LLC. Gurobi optimizer reference manual, 2018.

Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics: the approach based on influence functions, volume 196. John Wiley & Sons, 2011.

Alain Hauser and Peter Bühlmann. Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research, 13(1):2409–2464, 2012a.

168 Alain Hauser and Peter Bühlmann. Two optimal strategies for active learning of causal networks from interventional data. In European Workshop on Probabilistic Graphical Models, 2012b.

Reinhard Heckel and Paul Hand. Deep decoder: Concise image representations from untrained non-convolutional networks. arXiv preprint arXiv:1810.03982, 2018.

David Heckerman, Dan Geiger, and David Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. In Machine Learning, 20(3):197–243, 1995.

W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 1963.

Huining Hu, Zhentao Li, and Adrian Vetta. Randomized experimental design for causal graph discovery. In Neural Information Processing Systems, 2014.

Peter J Huber and Elvezio M Ronchetti. Robust Statistics, volume 693. John Wiley & Sons, 2011.

Antti Hyttinen, Frederick Eberhardt, and Patrik Hoyer. Experiment selection for causal discovery. Journal of Machine Learning Research, 14:3041–3071, 2013.

Oscar H Ibarra and Chul E Kim. Fast approximation algorithms for the knapsack and sum of subset problems. Journal of the ACM, 22(4):463–468, 1975.

169 Alexander Ihler, Natalia Flerova, Rina Dechter, and Lars Otten. Join-graph based cost-shifting schemes. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, pages 397–406. AUAI Press, 2012.

Guido W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015.

IMDb. Alternative interfaces, 2016. http://www.imdb.com/interfaces.

P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC, 1998.

Oleg Ivanov, Michael Figurnov, and Dmitry Vetrov. Variational autoencoder with arbitrary conditioning. arXiv preprint arXiv:1806.02382, 2018.

Jörn-Henrik Jacobsen, Arnold Smeulders, and Edouard Oyallon. i-revnet: Deep invertible networks. arXiv preprint arXiv:1802.07088, 2018.

P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. In STOC, 2013.

Klaus Jansen. The optimum cost chromatic partition problem. In Italian Conference on Algorithms and Complexity, 1997.

Tony Jebara. Map estimation, message passing, and perfect graphs. In UAI, pages 258–267. AUAI Press, 2009.

170 Shihao Ji, Ya Xue, Lawrence Carin, et al. Bayesian compressive sensing. IEEE Transactions on signal processing, 56(6):2346, 2008.

Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.

Sayash Kapoor, Kumar Kshitij Patel, and Purushottam Kar. Corruption- tolerant bandit learning. Machine Learning, pages 1–29, 2018.

Joerg Kappes, Bjoern Andres, Fred Hamprecht, Christoph Schnorr, Sebastian Nowozin, Dhruv Batra, Sungwoong Kim, Bernhard Kausler, Jan Lellmann, Nikos Komodakis, et al. A comparative study of modern inference techniques for discrete energy minimization problems. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1328–1335, 2013.

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.

Gyula Katona. On separating systems of a finite set. Journal of Combinatorial Theory, 1(2):174–194, 1966.

Leonid Khachiyan, Endre Boros, Konrad Borys, Khaled Elbassioni, and Vladimir Gurvich. Generating all vertices of a polyhedron is hard. Discrete & Computational Geometry, 39(1-3):174–190, 2008.

171 Ross D King, Kenneth E Whelan, Ffion M Jones, Philip GK Reiser, Christo- pher H Bryant, Stephen H Muggleton, Douglas B Kell, and Stephen G Oliver. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature, 427(6971):247, 2004.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215–10224, 2018.

Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743–4751, 2016.

Adam Klivans and Raghu Meka. Learning graphical models using multiplicative weights. In Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual Symposium on, pages 343–354. IEEE, 2017.

Murat Kocaoglu, Alexandros G. Dimakis, and Sriram Vishwanath. Cost- optimal learning of causal graphs. In International Conference on Machine Learning, 2017.

Ralf Koetter, Wen-Ching W Li, Pascal O Vontobel, and Judy L Walker.

172 Characterizations of pseudo-codewords of (low-density) parity-check codes. Advances in Mathematics, 213(1):205–229, 2007.

Nikos Komodakis and Nikos Paragios. Beyond loose LP-relaxations: Optimizing MRFs by repairing cycles. In European Conference on Computer Vision, pages 806–820. Springer, 2008.

Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recom- mender systems. Computer, 2009.

A. Krause and R. G. Gomes. Budgeted nonparametric learning from data streams. In International Conference on Machine Learning, 2010.

A. Krause, J. Leskovec, C. Guestrin, J. VanBriesen, and C. Faloutsos. Efficient sensor placement optimization for securing large water distribution networks. Journal of Water Resources Planning and Management, 2008.

Andreas Krause and Daniel Golovin. Submodular function maximization, 2014.

Leo G Kroon, Arunabha Sen, Haiyong Deng, and Asim Roy. The optimal cost chromatic partition problem for trees and interval graphs. In International Workshop on Graph-Theoretic Concepts in Computer Science, 1996.

R. Kumar, B. Moseley, S. Vassilvitskii, and A. Vattani. Fast greedy algorithms in MapReduce and streaming. ACM Transactions on Parallel Computing, 2015.

173 Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665–674. IEEE, 2016.

Ailsa H Land and Alison G Doig. An automatic method of solving discrete programming problems. Econometrica: Journal of the Econometric Society, pages 497–520, 1960.

Eugene L Lawler. A procedure for computing the k best solutions to discrete optimization problems and its application to the shortest path problem. Management science, 18(7):401–405, 1972.

J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In KDD, 2007.

H. Lin and J. A. Bilmes. Learning mixtures of submodular shells with applica- tion to document summarization. In UAI, 2012.

Xishuo Liu, Stark C Draper, and Benjamin Recht. Suppressing pseudocode- words by penalizing the objective of LP decoding. In Information Theory Workshop, pages 367–371. IEEE, 2012.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.

174 Andrey Y Lokhov, Marc Vuffray, Sidhant Misra, and Michael Chertkov. Optimal structure and parameter learning of ising models. Science advances, 4(3): e1700791, 2018.

Cai Mao-Cheng. On separating systems of graphs. Discrete Mathematics, 49: 15–20, 1984.

Morteza Mardani, Qingyun Sun, David Donoho, Vardan Papyan, Hatef Mona- jemi, Shreyas Vasanawala, and John Pauly. Neural proximal gradient descent for compressive imaging. In Advances in Neural Information Processing Systems, pages 9573–9583, 2018.

Radu Marinescu and Rina Dechter. And/or branch-and-bound search for combinatorial optimization in graphical models. Artificial Intelligence, 173 (16-17):1457–1491, 2009.

P. Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The Annals of Probability, 1990.

X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine learning in Apache Spark. JMLR, 2016.

Ofer Meshi, Mehrdad Mahdavi, Adrian Weller, and David Sontag. Train and test tightness of LP relaxations in structured prediction. In International Conference on Machine Learning, 2016.

175 M. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques. Springer, 1978.

V. Mirrokni and M. Zadimoghaddam. Randomized composable core-sets for distributed submodular maximization. STOC, 2015.

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.

B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause. Distributed submod- ular maximization: Identifying representative elements in massive data. In Neural Information Processing Systems, 2013.

B. Mirzasoleiman, A. Badanidiyuru, A. Karbasi, J. Vondrak, and A. Krause. Lazier than lazy greedy. In AAAI, 2015.

B. Mirzasoleiman, A. Badanidiyuru, and A. Karbasi. Fast constrained sub- modular maximization: Personalized data summarization. In International Conference on Machine Learning, 2016.

Dustin G Mixon and Soledad Villar. Sunlayer: Stable denoising with generative networks. arXiv preprint arXiv:1803.09319, 2018.

Andrea Montanari and Amin Saberi. The spread of innovations in social networks. Proceedings of the National Academy of Sciences, 107(47):20196– 20201, 2010.

176 Katta G Murty. Letter to the editor—an algorithm for ranking all the as- signments in order of increasing cost. Operations research, 16(3):682–687, 1968.

G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions–I. Mathematical Programming, 1978.

George L Nemhauser and Laurence A Wolsey. Integer and Combinatorial Optimization. John Wiley & Sons, 1999.

Robert Osazuwa Ness, Karen Sachs, Parag Mallick, and Olga Vitek. A Bayesian active learning experimental design for inferring signaling networks. In International Conference on Research in Computational Molecular Biology, 2017.

B. Neyshabur and N. Srebro. On symmetric and asymmetric LSHs for inner product search. In International Conference on Machine Learning, 2015.

Manfred Padberg. The boolean quadric polytope: some characteristics, facets and relatives. Mathematical programming, 45(1):139–172, 1989.

L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Stanford Digital Libraries, 1999.

Parthe Pandit, Mojtaba Sahraee, Sundeep Rangan, and Alyson K Fletcher. Asymptotics of map inference in deep networks. arXiv preprint arXiv:1903.01293, 2019.

177 Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.

Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge Univer- sity Press, 2009.

Joseph Ramsey, Madelyn Glymour, Ruben Sanchez-Romero, and Clark Gly- mour. A million variables and more: The fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. International Journal of Data Science and Analytics, 3(2):121–129, 2017.

Joseph D Ramsey, Stephen José Hanson, Catherine Hanson, Yaroslav O Halchenko, Russell A Poldrack, and Clark Glymour. Six problems for causal inference from fMRI. Neuroimage, 49(2):1545–1558, 2010.

Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538, 2015.

Carsten Rother, Vladimir Kolmogorov, Victor Lempitsky, and Martin Szummer. Optimizing binary MRFs via extended roof duality. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007.

178 Maya Rotmensch, Yoni Halpern, Abdulhakim Tlimat, Steven Horng, and David Sontag. Learning a health knowledge graph from electronic medical records. Scientific Reports, 7(1):5994, 2017.

Donald B Rubin and Richard P Waterman. Estimating the causal effects of marketing interventions using propensity score methodology. Statistical Science, pages 206–222, 2006.

Karen Sachs, Omar Perez, Dana Pe’er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein-signaling networks derived from multiparameter single- cell data. Science, 308(5721):523–529, 2005.

Fahad Shamshad, Asif Hanif, and Ali Ahmed. Subsampled fourier ptychography via pretrained invertible and untrained network priors. 2019.

Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G. Dimakis, and Sriram Vishwanath. Learning causal graphs with small interventions. In Neural Information Processing Systems, 2015.

A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In Neural Information Processing Systems, 2014.

Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pages 3483–3491, 2015.

179 David Sontag and Tommi S Jaakkola. New outer bounds on the marginal polytope. In Neural Information Processing Systems, volume 20, pages 1393–1400, 2007.

David Sontag, Talya Meltzer, Amir Globerson, Tommi Jaakkola, and Yair Weiss. Tightening LP relaxations for map using message passing. In UAI, pages 503–510. AUAI Press, 2008.

David Sontag, Yitao Li, et al. Efficiently searching for frustrated cycles in map inference. In UAI, 2012.

Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. A Bradford Book, 2001.

Yuriy Sverchkov and Mark Craven. A review of active learning approaches to experimental design for uncovering biological networks. PLoS Computational Biology, 13(6):e1005466, 2017.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.

Francesco Tonolini, Ashley Lyons, Piergiorgio Caramazza, Daniele Faccio, and Roderick Murray-Smith. Variational inference for computational imaging inverse problems. arXiv preprint arXiv:1904.06264, 2019.

S. Tschiatschek, R. K. Iyer, H. Wei, and J. A. Bilmes. Learning mixtures of submodular functions for image collection summarization. In Neural Information Processing Systems, 2014.

180 Marc Vuffray, Sidhant Misra, Andrey Lokhov, and Michael Chertkov. Inter- action screening: Efficient and sample-optimal learning of ising models. In Advances in Neural Information Processing Systems, pages 2595–2603, 2016.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Martin J Wainwright, Tommi S Jaakkola, and Alan S Willsky. Tree-reweighted belief propagation algorithms and approximate ML estimation by pseudo- moment matching. In AISTATS, 2003.

Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.

K. Wei, R. Iyer, and J. Bilmes. Fast multi-stage submodular maximization. In International Conference on Machine Learning, 2014.

K. Wei, R. Iyer, and J. Bilmes. Submodularity in data subset selection and active learning. In International Conference on Machine Learning, 2015.

Adrian Weller. Uprooting and rerooting graphical models. In International Conference on Machine Learning, 2016.

Adrian Weller, Kui Tang, Tony Jebara, and David Sontag. Understanding the bethe approximation: when and how can it go wrong? In UAI, pages 868–877, 2014.

181 Adrian Weller, Mark Rowland, and David Sontag. Tightness of LP relaxations for almost balanced models. In AISTATS, 2016.

David P Williamson and David B Shmoys. The design of approximation algorithms. Cambridge University Press, 2011.

Laurence A Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2(4):385–393, 1982.

Jonathan S Yedidia, William T Freeman, Yair Weiss, et al. Generalized belief propagation. In Neural Information Processing Systems, volume 13, pages 689–695, 2000.

Raymond A Yeh, Chen Chen, Teck Yian Lim, Alexander G Schwing, Mark Hasegawa-Johnson, and Minh N Do. Semantic image inpainting with deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5485–5493, 2017.

Chen Zhang and Bangti Jin. Probabilistic residual learning for aleatoric uncertainty in image restoration. arXiv preprint arXiv:1908.01010, 2019.

Xiaojie Zhang and Paul H Siegel. Adaptive cut generation algorithm for improved linear programming decoding of binary linear codes. IEEE Trans- actions on Information Theory, 58(10):6581–6594, 2012.

182 Vita

Erik Michael Lindgren received his PhD from the University of Texas at Austin, where he focused on machine learning. He received his bachelor’s from Boston University and is a recipient of an NSF Graduate Research Fellowship. His research interests are in machine learning and combinatorial optimization.

Email address: [email protected]

This dissertation was typeset with LATEX† by the author.

†LATEX is a document preparation system developed by Leslie Lamport as a special version of Donald Knuth’s TEX Program.

183