Copyright by Erik Michael Lindgren 2019 The Dissertation Committee for Erik Michael Lindgren certifies that this is the approved version of the following dissertation:
Combinatorial Optimization for Graphical Structures in Machine Learning
Committee:
Georgios-Alex Dimakis, Supervisor
Constantine Caramanis
Sujay Sanghavi
Adam Klivans
Qiang Liu Combinatorial Optimization for Graphical Structures in Machine Learning
by
Erik Michael Lindgren
DISSERTATION Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of
DOCTOR OF PHILOSOPHY
THE UNIVERSITY OF TEXAS AT AUSTIN December 2019 To Leif and Ellen Lindgren Acknowledgments
First, I have to thank my parents, Ellen and Leif, and my sister Lauren. I could have not done this without all your love and care, and growing up with you as role models.
I am incredibly grateful to have Alex as an advisor. He spent a tremen- dous amount of time sharing his wisdom, not just on research but also on starting a career and life in general.
I also have to thank the rest of my committee: Adam, Constantine, Qiang, and Sujay. I am incredibly grateful for all the knowledge and advice they shared with me.
I would like to thank my student collaborators Shanshan, Murat, Vatsal, Yanyao, Nishanth, and Jay. I would also like to thank my labmates Karthik, Megas, Ajil, Rashish, Dave, Ethan, Qi, Sriram, Matt, and Eirini. I would also like to thank my fellow WNCG and UT students, especially Ioannis, Ankit, Srinadh, Avro, Rajat, Soumya, Abishek, Mandar, Marius, Diego, Jess, Ashish, Derya, Avik, Isfar, Preeti, Ahmad, Surbhi, Shalmali, Sanmit, Chris, Justin, Rajiv, Yitao, Natasha, and Mónica. I learned a tremendous amount from you all, as well as all the fun times we had. Thank you for all your support, both in and out of the lab.
I also extremely grateful for the help I received from Apipol, Karen,
v Jaymie, Melanie, Melody, and Barry. They were always ready to help, especially when I needed it the most.
I would like to thank Bobak Nazer, Doug Densmore, Ernst Oberortner, and Swapnil Bhatia for helping me to start my research career at Boston University. Their advice and guidance set me up for success at UT.
I would also like to thank my intern hosts at Google, Jack and Ofer. They taught me so much about machine learning and working in industry.
Finally, I have to thank Sara. Thank you for your love and support and all the joy you brought me throughout this process.
vi Combinatorial Optimization for Graphical Structures in Machine Learning
Publication No.
Erik Michael Lindgren, Ph.D. The University of Texas at Austin, 2019
Supervisor: Georgios-Alex Dimakis
Graphs are an essential topic in machine learning. In this proposal, we explore problems in graphical models (where a probability distribution has conditional independencies specified by a graph), causality (where a directed graph specifies causal directions), and clustering (where a weighted graph is used to denote related items).
For our first contribution, we consider the facility location problem. In this problem our goal is to select a set of k “facilities" such that the average benefit from a “town" to its most beneficial facility is maximized. As input, we receive a bipartite graph where every edge has a weight denoting the benefit the town would receive from the facility if selected. The input graph is often dense, with O(n2) edges. We analyze sparsifying the graph with nearest neighbor methods. We give a tight characterization for how sparse each method can make the graph while approximately maintaining the value of the optimal
vii solution. We then demonstrate these approaches experimentally and see that they lead to large speedups and high quality solutions.
Next, we consider the MAP inference problem in discrete Markov Ran- dom Fields. MAP inference is the problem of finding the most likely config- uration of a Markov Random Field. One common approach to finding the MAP solution is with integer programming. We are interested in analyzing the complexity of integer programming under the assumption that there is a polynomial number of fractional vertices in the linear programming relaxation of the integer program. We show that under this assumption the optimal MAP assignment can be found in polynomial time. Our result can be generalized to arbitrary integer programs such that the solution set is contained in the unit hypercube: we show that any integer program in the unit hypercube with a polynomial number of fractional vertices in the linear programming relaxation can be solved in polynomial time.
We then consider the minimum cost intervention design problem: given the essential graph of a causal graph and a cost to intervene on a variable, identify the set of interventions with minimum total cost that can learn any causal graph with the given essential graph. We first show that this problem is NP-hard. We then prove that we can achieve a constant factor approximation to this problem with a greedy algorithm. Our approach to proving this guarantee uses tools from submodular optimization and knapsack quantization.
Next we consider the problem of learning Ising models when an adversary can corrupt the samples we receive. For this problem we give nearly tight lower
viii and upper bounds on the sample complexity.
Finally, we consider the problem of conditional sampling from invertible generative models. We first establish hardness results of generating conditional samples. We then develop a scheme using variational inference that allows us to approximately solve the problem. Our approach is able to utilize the given invertible model to improve sample quality.
ix Table of Contents
Acknowledgments v
Abstract vii
List of Tables xiv
List of Figures xv
Chapter 1. Introduction 1
Chapter 2. Leveraging Sparsity for Efficient Submodular Data Summarization 10 2.1 Introduction ...... 10 2.2 Related Work ...... 14 2.2.1 Benefits Functions and Nearest Neighbor Methods . . . 15 2.3 Guarantees for t-Nearest Neighbor Sparsification ...... 16 2.4 Guarantees for threshold-based Sparsification ...... 21 2.5 Experiments ...... 23 2.5.1 Summarizing Movies and Music from Ratings Data . . . 23 2.5.2 Finding Influential Actors and Actresses ...... 26 2.6 Appendix: Additional Figures ...... 29 2.7 Appendix: Full Proofs ...... 34 2.7.1 Proof of Theorem 1 ...... 34 2.7.2 Proof of Proposition 3 ...... 35 2.7.3 Proof of Proposition 4 ...... 37 2.7.4 Proof of Lemma 8 ...... 38 2.7.5 Proof of Lemma 9 ...... 39 2.7.6 Proof of Theorem 6 ...... 41 2.7.7 Proof of Lemma 7 ...... 42
x Chapter 3. Exact MAP Inference by Avoiding Fractional Ver- tices 44 3.1 Introduction ...... 44 3.2 Background and Related Work ...... 47 3.3 Provable Integer Programming ...... 49 3.3.1 Proof of Theorem 12 ...... 51 3.3.2 The M-Best LP Problem ...... 52 3.3.3 K-Best Integral Solutions ...... 53 3.4 Fractional Vertices of the Local Polytope ...... 54 3.4.1 Proof of Theorem 17 ...... 55 3.5 Estimating the number of Confounding Singleton Marginals . . 59 3.6 Experiments ...... 60 3.7 Conclusion ...... 66
Chapter 4. Experimental Design for Cost-Aware Learning of Causal Graphs 70 4.1 Introduction ...... 70 4.2 Minimum Cost Intervention Design ...... 73 4.2.1 Relevant Graph Theory Concepts ...... 73 4.2.2 Causal Graphs and Interventional Learning ...... 74 4.2.3 Graph Separating Systems and Minimum Cost Interven- tion Design ...... 76 4.3 Related Work ...... 79 4.4 Hardness of Minimum Cost Intervention Design ...... 82 4.5 Approximation Guarantees for Minimum Cost Intervention Design 84 4.6 Algorithms for k-Sparse Intervention Design Problems ...... 87 4.7 Experiments ...... 89 4.8 Example Graph Where Quantization Helps Greedy ...... 91 4.9 Proof of Approximation Guarantees of the Quantized Greedy Algorithm ...... 91 4.9.1 Submodularity Background ...... 92 4.9.2 Bound on the Quantized Greedy Algorithm solution size 94 4.9.3 Submodular and Supermodular Chain Problem . . . . . 95
xi 4.9.4 Proof of quantized greedy algorithm approximation guar- antees ...... 99 4.9.5 Proof of Technical Lemmas ...... 102 4.10 Proof of Results on k-Sparse Intervention Design Problems . . 105 4.11 Proof of NP-Hardness ...... 106
Chapter 5. On Robust Learning of Ising Models 117 5.1 Introduction ...... 117 5.1.1 Related Work ...... 118 5.2 Problem Setup ...... 119 5.3 Inachievability Results ...... 121 5.4 Achievable Results ...... 123 5.4.1 Robustness of the Hedge Algorithm ...... 125 5.5 Proof of Theorem 53 ...... 126 5.6 Proof of Achieveability ...... 129
Chapter 6. Uncertainty-Aware Compressive Sensing with Flow Composition 136 6.1 Introduction ...... 136 6.2 Background and Related Work ...... 138 6.2.1 Invertible Generative Models ...... 138 6.2.2 Variational Inference for Conditional Sampling . . . . . 139 6.2.3 Compressive Sensing with Generative Priors ...... 140 6.2.4 Additional Related Work ...... 142 6.3 Hardness of Conditional Sampling ...... 142 6.4 Conditional Sampling with Composed Flow Models ...... 143 6.4.1 Generalizing to Measurement Matrices ...... 147 6.5 Experiments ...... 149 6.6 Proof of Hardness Results ...... 149 6.6.1 Design of the Additive Coupling Network ...... 152 6.6.2 Generating SAT Solutions from the Conditional Distribution154 6.6.3 Hardness of Approximate Sampling ...... 156 6.7 Proof of Proposition 61 ...... 156
xii Bibliography 160
Vita 183
xiii List of Tables
2.1 A subset of the summarization outputted by our algorithm on the MovieLens dataset, plus the elements who are represented by each representative with the largest dot product. Each group has a natural interpretation: 90’s slapstick comedies, 80’s horror, cult classics, etc. Note that this was obtained with only a similarity matrix obtained from ratings...... 26 2.2 The top twenty-five actors and actresses generated by sparsified facility location optimization defined by the personalized PageR- ank of a 57,000 vertex movie personnel collaboration graph from [IMDb, 2016] and the twenty-five actors and actresses with the largest (non-personalized) PageRank. We see that the classical PageRank approach fails to capture the diversity of nationality in the dataset, while the facility location results have actors and actresses from many of the worlds film industries...... 33
xiv List of Figures
2.1 Results for the MovieLens dataset GroupLens [2015]. Figure (a) shows the function value as the runtime increases, normalized by the value the greedy algorithm obtained. As can be seen our algorithm is within 99.9% of greedy in less than 5 seconds. For this experiment, the greedy algorithm had a runtime of 512 seconds, so this is a 100x speed up for a small penalty in performance. We also compare to the stochastic greedy algorithm Mirzasoleiman et al. [2015], which needs 125 seconds to get equivalent performance, which is 25x slower. Figure (b) shows the fraction of the set that was returned by each method that was common with the set returned by greedy. We see that the approximate nearest neighbor method has 90% of its elements common with the greedy set while being 50x faster than greedy, and using exact nearest neighbors can perfectly match the greedy set while being 4x faster than greedy. . . . . 25 2.2 (a) MovieLens Dataset GroupLens [2015] and (b) IMDb Dataset IMDb [2016] as explained in the Experiments Section. We see that for sparsity t significantly smaller than the n/(αk) lower bound we can still find a small covering set in the t-nearest neighbor graph...... 30 2.3 (a) MovieLens Dataset GroupLens [2015] and (b) IMDb Dataset IMDb [2016], as explained in the Experiments Section. We see that even with several orders of magnitude fewer edges than the complete graph we still can find a small set that covers a large fraction of the dataset. For MovieLens this set was of size 40 and for IMDb this set was of size 50. The number of coverable was estimated by the greedy algorithm for the max-coverage problem...... 30 2.4 The fraction of the greedy solution that was contained as the value of the sparsity t was increased for exact nearest neighbor and approximate LSH-based nearest neighbor on the MovieLens dataset. We see that the exact method captures slightly more of greedy solution for a given value of t and the LSH value does not converge to 1. However LSH still captures a reasonable amount of the greedy set and is significantly faster at finding nearest neighbors...... 31
xv 2.5 Results for the IMDb dataset IMDb [2016]. Figure (a) shows the function value as the runtime increases, normalized by the value the greedy algorithm obtained. As can be seen our algorithm is within 99% of greedy in less than 10 minutes. For this experiment, the greedy algorithm had a runtime of six hours, so this is a 36x acceleration for a small penalty in performance. We also compare to using a small sample of the set I as an estimate of the function, which does not perform nearly as well as our algorithm even for much longer time. Figure (b) shows the fraction of the set that was returned by each method that was common with the set returned by greedy. We see that the approximate nearest neighbor method has 90% of its elements common with the greedy set while being 18x faster than greedy...... 32
3.1 We compare how the number of fractional singleton marginals |S(VC )| changes with the connection strength w. We plot the sample CDF of the probability that |S(VC )| is some given value. We observe that |S(VC )| increases as the connection strength increases. Further we see that while most instances have a small number for |S(VC )|, there are rare instances where |S(VC )| is quite large...... 62 3.2 We compare how the number of cycle constraints from Equation (3.4) that need to be introduced to find the best integral solution changes with the number of confounding singleton marginals. We use the algorithm for finding the most frustrated cycle in [Sontag and Jaakkola, 2007] to introduce new constraints. We observe that each constraint seems to remove many confounding singleton marginals...... 63 3.3 We also observe the number of introduced confounding singleton marginals that are introduced by the cycle constraints increases with the number of confounding singleton marginals ...... 64 3.4 Finally we compare the number of branches needed to find the optimal solution increases with the number of confounding singleton marginals in Figure 3.4. A similar trend arises as with the number of cycle inequalities introduced. To compare the methods, note that branch-and-bound uses twice as many LP calls as there are branches. For this family of graphical models, branch-and-bound tends to require less calls to an LP solver than the cut constraints...... 65
xvi 4.1 We generate random chordal graphs such that the maximum degree is bounded by 20. The node weights are generated by the heavy-tailed Pareto distribution with scale parameter 2.0. The number of interventions m is fixed to 5. We compare the greedy algorithm to the optimal solution and the baseline algorithm mentioned in the experimental setup. We see that the greedy algorithm is close to optimal and outperforms the baseline. We also see that the greedy algorithm is able to find a solution with the available number of colors, even without quantization. . . 114 4.2 We sample graphs of size 10000 such that the maximum degree is bounded by 20 and the average degree is 3. We draw the weights from the heavy-tailed Pareto distribution with scale parameter 2.0. We restrict all interventions to be of size 10. We adjust the penalty parameter in Algorithm 7 to see how the size of the k-sparse graph separating system relates to the cost. Costs are normalized so that the largest cost is 1.0. We see that with 561 interventions we can achieve a cost of 0.78 compared to a cost of 1.0 with 510 interventions. Our lower bound implies that we need 506 interventions on average...... 115 4.3 When there are very large weights, the greedy algorithm may require a lot of colors to terminate, even on graphs with a small chromatic number. For this graph, the largest independent set is the top two vertices, followed by the next two, and so on. The greedy algorithm will color all these pairs of vertices a different n color, which is 2 colors. However after quantization the the greedy algorithm will only use 4 colors...... 116
5.1 Graphs for the inachievability result in Theorem 2 ...... 121 5.2 Graphs for Theorem 53 ...... 127
6.1 A flow chart of our conditional sampler. First the noise variable z0 is samples from N(0,I). This is fed into an invertible genera- ˆ tive model f to output another noise variable z1. We then feed z1 into the original model f to generate x1 and x2...... 144 6.2 A graphical model depicting the process we are running the ELBO on. We imaging that xˆ2 is drawn from the conditional 2 distribution N(x2 | σ I) for some small parameter σ. We see that xˆ2 is independent from x1 and z when conditioned on x2. 145
xvii 6.3 Conditional sampling using our approach. We condition on the top half of the image and sample the bottom half. Since the samples output both halves of the image, we replace the sampled top half with the true top half. We see that our approach is able to generate completions with diversity, as there are multiple mouth positions for most sets of images. We plot the pixelwise variance by calculating the sample variance over 64 samples and summing all color channels into one value...... 150 6.4 Conditional sampling when the measurement is a blurred image. We blur the image using average pooling with a window size and stride of 4. We see that our approach is able to find several reasonable ways to complete the measurements...... 151 6.5 Caption ...... 153
xviii Chapter 1
Introduction
Graphs are an essential topic in machine learning, where they play a prominent role in representing how items are structured and related. In this proposal we explore three problems in machine learning that use combinatorial optimization over graphical structures.
Problem 1: Facility Location Problem.
In the facility location problem we are given a set V of size n, a set
I of size m, and a benefit matrix of nonnegative numbers C ∈ RI×V , where
Civ describes the benefit that element i receives from element v. Our goal is to select a small set A of k columns in this matrix. Once we have chosen A, element i will get a benefit equal to the best choice out of the available columns, maxv∈A Civ. The total reward is the sum of the row rewards, so the optimal choice of columns is the solution of X arg max max Civ. v∈A {A⊆V :|A|≤k} i∈I A natural application of this problem is in finding a small set of repre- sentative images in a big dataset, where Civ represents the similarity between images i and v. The problem is to select k images that provide a good coverage of the full dataset, since each one has a close representative in the chosen set.
1 The benefit matrix C can be interpreted as a weighted, undirected, bipartite graph G = (V,I,E). There is an edge between items i ∈ I and v ∈ V with weight Civ. If there are many values of weight 0 in the matrix C, we can think of this as removing edges in the graph G. Algorithms for optimizing the facility location problem, such as the greedy algorithm, can exploit the sparsity of this graph to speed up running time.
Unfortunately, for many benefit functions of interest, the benefit matrix C is dense with O(mn) entries. For example, if the benefit matrix is a similarity function in a vector space, there will typically be few, if any, similarities of value exactly 0.
One approach is to sparsify this matrix by removing small entries. For example, we can keep only the t-smallest elements of each row Ci or we can keep only elements with value above a threshold τ. These sparsification methods can be quickly implemented using nearest neighbor methods. In Chapter 2, we analyze these approaches. We prove that these approaches do lead to a significant reduction in the number of edges of a graph G while approximately maintaining the optimal solution to the initial facility location problem.
Problem 2: Integer Programming for MAP in graphical models.
Markov Random fields (MRFs) are graphical models that represent conditional independencies with an undirected graph. One important class of distributions is binary, pairwise MRFs, which are also called Ising models. For these distributions, we are given a graph G = (V,E) with node weights (θi)i∈V
2 V and edge weights (Wij)ij∈E. The probability of a configuration x ∈ {0, 1} is calculated as ! 1 X X P(X = x) = exp θ x + W x x , Z i i ij i j i∈V ij∈E where Z is used to ensure that the distribution normalizes to 1.
An important problem for inference in graphical models is the MAP problem, which is the problem of finding the most likely configuration. For Ising P P models, we can write this problem as max n θ x + x x . We x∈{0,1} i∈V i i ij∈Wij i j can write this problem as an integer linear program: X X max θiqi + Wijqij q∈ V ∪E R i∈V ij∈E
s.t. qi ∈ {0, 1} ∀i ∈ V
qij ≥ max{0, qi + qj − 1} ∀ij ∈ E
qij ≤ min{qi, qj} ∀ij ∈ E. The standard approach to solving the integer program is with LP relaxations. To create the LP relaxation, we replace the constraint qi ∈ {0, 1} with 0 ≤ qi ≤ 1. The problem is now a linear program and can be solved in polynomial time. If the optimal solution is integral, then we know that this is also the optimal solution to the integer program. If the optimal solution is fractional, then various integer programming techniques can be used to modify the LP in a way that removes this fractional solution from the LP relaxation while maintaining the optimal integral solution.
LP relaxations are utilized for decoding LDPC codes [Feldman et al., 2005]. LDPC codes are a class of graphical models used for communication
3 coding, and decoding is essentially finding the MAP solution. The LP relaxation succeeds when the integral MAP solution is also the optimal solution to the linear program. However, due to introduced fractional vertices, the LP relaxation approach can fail.
Dimakis et al. [2009] considered integer programming techniques from LDPC decoding that provably succeed under an assumption on the number of fractional vertices above the optimal integral solutions. The were able to show that they can remove a polynomial number of fractional solutions in polynomial time. However, their approach required the special structure of LDPC codes and it was left as an open question of whether or not similar results can hold for more general classes of graphical models.
In Chapter 3 we show that for general graphical models we can remove a polynomial fraction of fractional vertices in polynomial time. Our result actually shows that it is possible to do this on arbitrary polytopes contained in the unit cube. We further extend this result to recovering the M-best integral solutions under the assumption that there are a polynomial number of fraction of vertices.
Problem 3: Cost-aware experiment design for learning causal graphs.
In machine learning, we often want to learn causal relationships between objects. However, if we only have observational data then it is not always possible to learn causal relationships. For many situations, the only way to learn which way causality goes we need to perform randomized experiments
4 using interventions.
In Pearl’s causal model [Pearl, 2009], causality is represented by a causal DAG. Variables are said to be caused by their parents in the graph (and indirectly caused by all ancestors).
We are interested in learning all causal directions. Since causal DAGs imply a Bayesian Network, the causal directions that can be identified from data are those edges that are essential to all Bayesian networks that describe the conditional independences of the joint distribution. From data, we can also identify everywhere that there exists an edge. We thus need to perform interventions to identify the causal directions of the undirected component of the essential graph.
Experiment design is a well studied area, and it is known that for the remaining edges to be identifiable, an intervention design must be such that the set of intervention nodes I1,...,Im are such that every undirected edge is in the cut set for some Ii. The cut set of a set of nodes Ii is the set of edges such that one end is in Ii and one end is not in Ii. Thus we can describe a valid intervention design in a purely combinatorial way.
The minimum cost intervention design problem, first considered by Kocaoglu et al. [2017], is an intervention design problem when there is a cost to intervene on a given node, and the cost of an intervention is the sum of the individual node costs. The problem is to create an intervention design problem with minimum total cost.
5 Kocaoglu et al. [2017] was able to develop optimal solutions for some special classes of graphs, however is was still open if (1) the problem was NP-hard for general causal graphs, and if so (2) if there are any efficient approximation algorithms for the problem. In Chapter 4, we indeed show that the problem is NP-hard and that there is an efficient (2 + ε)-approximation of the problem when the number of allowed interventions is slightly more than the minimum required.
Problem 4: Robust Learning of Ising Models
In the previous problem we saw that Ising models are a class of Markov random fields where there are pairwise conditional dependencies and every variable takes a binary value. While in the previous problem we considered these values to be in {0, 1}, for this problem it is convenient to have that take values in {−1, 1}.
An important question is learning Ising models given samples from the joint distribution. Specifically, we want to recover the graph structure after observing a small number of samples from the joint distribution.
It is know that we need to assume a bound α on the minimum edge weight. If it is arbitrarily small, then we would not be able to detect it. Additionally, sample complexity bounds by Wainwright et al. [2003] establish that we also need to have an upper bound on the total strength of the edge weights. We define the width of the model λ to be
X λ = max |θi| + |Wij|. i∈V j∈V
6 For fixed α and λ, sample complexity bounds by Wainwright et al. [2003] establish that we need O(log n) samples to learn the Ising model.
Bresler [2015] was the first to establish an algorithm with optimal sample complexity and polynomial runtime for the problem of learning Ising models. However, the dependence of λ was quite large. This was improved by Vuffray et al. [2016] and later by Klivans and Meka [2017]. Specifically, the algorithm by Klivans and Meka [2017] was shown to have nearly optimal sample complexity and runtime.
In this work we consider the problem of robustly learning Ising models. This means that we need to recover the true Ising model when an adversary can corrupt a fraction of samples that we receive. We will call the fraction of corrupted samples η, and the problem is to find the largest value of η such that we can still efficiently recover the underlying dependency graph.
In Chapter 5, we establish a lower bound on η based on the parameters α and λ. In particular, we establish that if η = α exp(−O(λ)) that no algorithm is able to recover the underlying Ising model, where C is some constant. We then show that the Sparsitron algorithm of Klivans and Meka [2017] can efficiently recover the underlying graph for η = α2 exp(−O(λ)), showing that our lower bound is essentially tight.
Problem 5: Uncertainty Aware Compressive Sensing
Compressive sensing is the problem of recovering a signal x ∈ Rd given a small number of measurements y = Ax, for a known measurement matrix
7 A ∈ Rm×d. If the number of measurements m < d, then arbitrary signals x cannot be recovered. However, if we have prior information on the structure of x then it may be possible.
There has been extensive study on signal recovery under the assumption that the signal x is sparse and we now have high quality algorithms for this setting [Tibshirani, 1996, Candes et al., 2006, Donoho et al., 2006, Bickel et al., 2009].
There has been recent interest in compressive sensing that utilizes powerful generative models. Starting with the work of Bora et al. [2017], there has been extensive study on the problem of signal recovery under the assumption that the signal is in the range of a deep generative model such as a GAN [Goodfellow et al., 2014] or a VAE [Kingma and Welling, 2013].
One issue with prior work is that, while it is able to recover signals that match the measurements, it is not able to identify what aspects of the signal is common to all signals that fit the measurements and what aspects of the signal can vary while still fitting the measurements.
Because of this, there has been recent attention on uncertainty aware compressive sensing [Tonolini et al., 2019, Zhang and Jin, 2019]. Here we want to recover the conditional distribution on x given the measurements y.
In this work we consider uncertainty aware compressive sensing when the prior is given to us as an invertible generative model, which is a special type of generative model that allows us to evaluate the density function as well
8 as sample.
In Chapter 6 we show that the conditional sampling problem is hard in general. Because of this we consider approximations to the problem. We develop an approach that allows us to directly utilize the existing model to improve sample quality and compares favorably to existing approaches. Our approach utilizes tools from variational inference.
9 Chapter 2
Leveraging Sparsity for Efficient Submodular Data Summarization
2.1 Introduction
In this chapter we study the facility location problem: we are given sets
V of size n, I of size m and a benefit matrix of nonnegative numbers C ∈ RI×V , where Civ describes the benefit that element i receives from element v. Our goal is to select a small set A of k columns in this matrix. Once we have chosen A, element i will get a benefit equal to the best choice out of the available columns, maxv∈A Civ. The total reward is the sum of the row rewards, so the optimal choice of columns is the solution of:
X arg max max Civ. (2.1) v∈A {A⊆V :|A|≤k} i∈I
A natural application of this problem is in finding a small set of repre- sentative images in a big dataset, where Civ represents the similarity between images i and v. The problem is to select k images that provide a good coverage of the full dataset, since each one has a close representative in the chosen set.
Throughout this chapter we follow the nomenclature common to the submodular optimization for machine learning literature. This problem is also
10 known as the maximization version of the k-medians problem. A number of recent works have used this problem for selecting subsets of documents or images from a larger corpus [Lin and Bilmes, 2012, Tschiatschek et al., 2014], to identify locations to monitor in order to quickly identify important events in sensor or blog networks [Krause et al., 2008, Leskovec et al., 2007], as well as clustering applications [Krause and Gomes, 2010, Mirzasoleiman et al., 2013].
We can naturally interpret Problem 2.1 as a maximization of a set function F (A) which takes as an input the selected set of columns and returns the total reward of that set. Formally, let F (∅) = 0 and for all other sets A ⊆ V define X F (A) = max Civ. (2.2) v∈A i∈I The set function F is submodular, since for all j ∈ V and sets A ⊆ B ⊆ V \{j}, we have F (A∪{j})−F (A) ≥ F (B∪{j})−F (B), that is, the gain of an element is diminishes as we add elements. Since the entries of C are nonnegative, F is monotone, since for all A ⊆ B ⊆ V , we have F (A) ≤ F (B). We also have F normalized, since F (∅) = 0.
The facility location problem is NP-Hard, so we consider approximation algorithms. Like all monotone and normalized submodular functions, the greedy algorithm guarantees a (1 − 1/e)-factor approximation to the optimal solution [Nemhauser et al., 1978]. The greedy algorithm starts with the empty set, then for k iterations adds the element with the largest reward. This approximation is the best possible—the maximum coverage problem is an instance of the
11 facility location problem, which was shown to be NP-Hard to optimize within a factor of 1 − 1/e + ε for all ε > 0 [Feige, 1998].
The problem is that the greedy algorithm has super-quadratic running time Θ(nmk) and in many datasets n and m can be in the millions. For this reason, several recent papers have focused on accelerating the greedy algorithm. In [Leskovec et al., 2007], the authors point out that if the benefit matrix is sparse, this can dramatically speed up the computation time. Unfortunately, in many problems of interest, data similarities or rewards are not sparse. Wei et al. [2014] proposed to first sparsify the benefit matrix and then run the greedy algorithm on this new sparse matrix. In particular, Wei et al. [2014] considers t-nearest neighbor sparsification, i.e., keeping for each row the t largest entries and zeroing out the rest. Using this technique they demonstrated an impressive 80-fold speedup over the greedy algorithm with little loss in solution quality. One limitation of their theoretical analysis was the limited setting under which provable approximation guarantees were established.
Our Contributions: Inspired by the work of Wei et al. [2014] we improve the theoretical analysis of the approximation error induced by sparsifi- cation. Specifically, the previous analysis assumes that the input came from a probability distribution where the preferences of each element of i ∈ I are inde- pendently chosen uniformly at random. For this distribution, when k = Ω(n), they establish that the sparsity can be taken to be O(log n) and running the greedy algorithm on the sparsified problem will guarantee a constant factor approximation with high probability. We improve the analysis in the following
12 ways:
• We prove guarantees for all values of k and our guarantees do not require any assumptions on the input besides nonnegativity of the benefit matrix.
• In the case where k = Ω(n), we show that it is possible to take the sparsity of each row as low as O(1) while guaranteeing a constant factor approximation.
• Unlike previous work, our analysis does not require the use of any par- ticular algorithm and can be integrated to many algorithms for solving facility location problems.
• We establish a lower bound which shows that our approximation guaran- tees are tight up to log factors, for all desired approximation factors.
In addition to the above results we propose a novel algorithm that uses a threshold based sparsification where we keep matrix elements that are above a set value threshold. This type of sparsification is easier to efficiently implement using nearest neighbor methods. For this method of sparsification, we obtain worst case guarantees and a lower bound that matches up to constant factors. We also obtain a data dependent guarantee which helps explain why our algorithm empirically performs better than the worst case.
Further, we propose the use of Locality Sensitive Hashing (LSH) and random walk methods to accelerate approximate nearest neighbor computa- tions. Specifically, we use two types of similarity metrics: inner products and
13 personalized PageRank (PPR). We propose the use of fast approximations for these metrics and empirically show that they dramatically improve running times. LSH functions are well-known but, to the best of our knowledge, this is the first time they have been used to accelerate facility location problems. Furthermore, we utilize personalized PageRank as the similarity between ver- tices on a graph. Random walks can quickly approximate this similarity and we demonstrate that it yields highly interpretable results for real datasets.
2.2 Related Work
The use of a sparsified proxy function was shown by Wei et al. [2015]. to also be useful for finding a subset for training nearest neighbor classifiers . Further, they also show a connection of nearest neighbor classifiers to the facility location function. The facility location function was also used by Mirzasoleiman et al. [2016] as part of a summarization objective function, where they present a summarization algorithm that is able to handle a variety of constraints.
The stochastic greedy algorithm was shown to get a 1−1/e−ε approxima-
1 tion with runtime O(nm log ε ), which has no dependance on k [Mirzasoleiman n 1 et al., 2015]. It works by choosing a sample set from V of size k log ε each iteration and adding to the current set the element of the sample set with the largest gain.
Also, there are several related algorithms for the streaming setting [Badanidiyuru et al., 2014] and distributed setting [Barbosa et al., 2015, Kumar et al., 2015, Mirrokni and Zadimoghaddam, 2015, Mirzasoleiman et al., 2013].
14 Since the objective function is defined over the entire dataset, optimizing the facility location function becomes more complicated in these memory limited settings. Often the function is estimated by considering a randomly chosen subset from the set I.
2.2.1 Benefits Functions and Nearest Neighbor Methods
For many problems, the elements V and I are vectors in some feature space where the benefit matrix is defined by some similarity function sim. For
d −γkx−yk2 example, in R we may use the RBF kernel sim(x, y) = e 2 , dot product T xT y sim(x, y) = x y, or cosine similarity sim(x, y) = kxkkyk .
There has been decades of research on nearest neighbor search in ge- ometric spaces. If the vectors are low dimensional, then classical techniques such as kd-trees [Bentley, 1975] work well and are exact. However it has been observed that as the dimensions grow that the runtime of all known exact methods does little better than a linear scan over the dataset.
As a compromise, researchers have started to work on approximate nearest neighbor methods, one of the most successful approaches being locality sensitive hashing [Gionis et al., 1999, Indyk and Motwani, 1998]. LSH uses a hash function that hashes together items that are close. Locality sensitive hash functions exist for a variety of metrics and similarities such as Euclidean [Datar et al., 2004], cosine similarity [Andoni et al., 2015, Charikar, 2002], and dot product [Neyshabur and Srebro, 2015, Shrivastava and Li, 2014]. Nearest neighbor methods other than LSH that have been shown to work for
15 machine learning problems include [Beygelzimer et al., 2006, Chen et al., 2009]. Additionally, see [Garcia et al., 2010] for efficient and exact GPU methods.
An alternative to vector functions is to use similarities and benefits defined from graph structures. For instance, we can use the personalized PageRank of vertices in a graph to define the benefit matrix [Page et al., 1999]. The personalized PageRank is similar to the classic PageRank, except the random jumps, rather than going to anywhere in the graph, go back to the users “home” vertex. This can be used as a value of “reputation” or “influence” between vertices in a graph [Gupta et al., 2013].
There are a variety of algorithms for finding the vertices with a large PageRank personalized to some vertex. One popular one is the random walk method. If πi is the personalized PageRank vector to some vertex i, then πi(v) is the same as the probability that a random walk of geometric length starting from i ends on a vertex v (where the parameter of the geometric distribution is defined by the probability of jumping back to i) [Avrachenkov et al., 2007]. Using this approach, we can quickly estimate all elements in the benefit matrix greater than some value τ.
2.3 Guarantees for t-Nearest Neighbor Sparsification
We associate a bipartite support graph G = (V,I,E) by having an edge between v ∈ V and i ∈ I whenever Cij > 0. If the support graph is sparse, we can use the graph to calculate the gain of an element much more efficiently, since we only need to consider the neighbors of the element versus the entire
16 set I. If the average degree of a vertex i ∈ I is t, (and we use a cache for the current best value of an element i) then we can execute greedy in time O(mtk). See Algorithm 1 in the Appendix for pseudocode. If the sparsity t is much smaller than the size of V , the runtime is greatly improved.
However, the instance we wish to optimize may not be sparse. One idea is to sparsify the original matrix by only keeping the values in the benefit matrix C that are t-nearest neighbors, which was considered by Wei et al. [2014]. That is, for every element i in I, we only keep the top t elements of
Ci1,Ci2,...,Cin and set the rest equal to zero. This leads to a matrix with mt nonzero elements. We then want the solution from optimizing the sparse problem to be close to the value of the optimal solution in the original objective function F .
Our main theorem is that we can set the sparsity parameter t to be
n m O( αk log αk )—which is a significant improvement for large enough k—while 1 still having the solution to the sparsified problem be at most a factor of 1+α from the value of the optimal solution.
Theorem 1. Let Ot be the optimal solution to an instance of the facility location problem with a benefit matrix that was sparsified with t-nearest neighbor. For
∗ n n 1 any t ≥ t (α) = O( αk log αk ), we have F (Ot) ≥ 1+α OPT.
Proof Sketch. For the value of t chosen, there exists a set Γ of size αk such that every element of I has a neighbor in the t-nearest neighbor graph; this is proven using the probabilistic method. By appending Γ to the optimal solution
17 and using the monotonicity of F , we can move to the sparsified function, since no element of I would prefer an element that was zeroed out in the sparsified matrix as one of their top t most beneficial elements is present in the set Γ. The optimal solution appended with Γ is a set of size (1 + α)k. We then bound the amount that the optimal value of a submodular function can increase by when adding αk elements. See the appendix for the complete proof.
Note that Theorem 1 is agnostic to the algorithm used to optimize the sparsified function, and so if we use a ρ-approximation algorithm, then we are
ρ at most a factor of 1+α from the optimal solution. Later this section we will utilize this to design a subquadratic algorithm for optimizing facility location problems as long as we can quickly compute t-nearest neighbors and k is large enough.
If m = O(n) and k = Ω(n), we can achieve a constant factor approxima- tion even when taking the sparsity parameter as low as t = O(1), which means that the benefit matrix C has only O(n) nonzero entries. Also note that the only assumption we need is that the benefits between elements are nonnegative. When k = Ω(n), previous work was only able to take t = O(log n) and required the benefit matrix to come from a probability distribution [Wei et al., 2014].
Our guarantee has two regimes depending on the value of α. If we want the optimal solution to the sparsified function to be a 1 − ε factor from the
∗ n m optimal solution to the original function, we have that t (ε) = O( εk log εk ) suffices. Conversely, if we want to take the sparsity t to be much smaller
18 n m than k log k , then this is equivalent to taking α very large and we have some guarantee of optimality.
In the proof of Theorem 1, the only time we utilize the value of t is to show that there exists a small set Γ that covers the entire set I in the t-nearest neighbor graph. Real datasets often contain a covering set of size αk for t much
n m smaller than O( αk log αk ). This observation yields the following corollary.
Corollary 2. If after sparsifying a problem instance there exists a covering set of size αk in the t-nearest neighbor graph, then the optimal solution Ot of the
1 sparsified problem satisfies F (Ot) ≥ 1+α OPT.
In the datasets we consider in our experiments of roughly 7000 items, we have covering sets with only 25 elements for t = 75, and a covering set of size 10 for t = 150. The size of covering set was upper bounded by using the greedy set cover algorithm. In Figure 2.2 in the appendix, we see how the size of the covering set changes with the choice of the number of neighbors chosen t.
It would be desirable to take the sparsity parameter t lower than the value dictated by t∗(α). As demonstrated by the following lower bound, is not
1 n 1 possible to take the sparsity significantly lower than α k and still have a 1+α approximation in the worst case.
Proposition 3. Suppose we take
1 1 n − 1 t = max , . 2α 1 + α k
19 1 There exists a family of inputs such that we have F (Ot) ≤ 1+α−2/k OPT.
The example we create to show this has the property that in the t- nearest neighbor graph, the set Γ needs αk elements to cover every element of I. We plant a much smaller covering set that is very close in value to Γ but is hidden after sparsification. We then embed a modular function within the facility location objective. With knowledge of the small covering set, an optimal solver can take advantage of this modular function, while the sparsified solution would prefer to first choose the set Γ before considering the modular function. See the appendix for full details.
Sparsification integrates well with the stochastic greedy algorithm [Mirza- soleiman et al., 2015]. By taking t ≥ t∗(ε/2) and running stochastic greedy
n 2 with sample sets of size k ln ε , we get a 1−1/e−ε approximation in expectation nm 1 m that runs in expected time O( εk log ε log εk ). If we can quickly sparsify the problem and k is large enough, for example n1/3, this is subquadratic. The following proposition shows a high probability guarantee on the runtime of this algorithm and is proven in the appendix.
Proposition 4. When m = O(n), the stochastic greedy algorithm [Mirza-
n 2 soleiman et al., 2015] with set sizes of size k log ε , combined with sparsifica- 1 tion with sparsity parameter t, will terminate in time O(n log ε max{t, log n}) ∗ n m with high probability. When t ≥ t (ε/2) = O( εk log εk ), this algorithm has a 1 − 1/e − ε approximation in expectation.
20 2.4 Guarantees for threshold-based Sparsification
Rather than t-nearest neighbor sparsification, we now consider using τ-threshold sparsification, where we zero-out all entries that have value below a threshold τ. Recall the definition of a locality sensitive hash.
Definition 5. H is a (τ, Kτ, p, q)-locality sensitive hash family if for x, y satisfying sim(x, y) ≥ τ we have Ph∈H(h(x) = h(y)) ≥ p and if x, y satisfy sim(x, y) ≤ Kτ we have Ph∈H(h(x) = h(y)) ≤ q.
We see that τ-threshold sparsification is a better model than t-nearest neighbors for LSH, as for K = 1 it is a noisy τ-sparsification and for non- adversarial datasets it is a reasonable approximation of a τ-sparsification method. Note that due to the approximation constant K, we do not have an a priori guarantee on the runtime of arbitrary datasets. However we would expect in practice that we would only see a few elements with threshold above the value τ. See [Andoni, 2012] for a discussion on this.
One issue is that we do not know how to choose the threshold τ. We can sample elements of the benefit matrix C to estimate how sparse the threshold graph will be for a given threshold τ. Assuming the values of C are in general position1, by using the Dvoretzky-Kiefer-Wolfowitz-Massart Inequality [Dvoretzky et al., 1956, Massart, 1990] we can bound the number of samples needed to choose a threshold that achieves a desired sparsification level.
1By this we mean that the values of C are all unique, or at least only a few elements take any particular value. We need this to hold since otherwise a threshold based sparsification may exclusively return an empty graph or the complete graph.
21 We establish the following data-dependent bound on the difference in the optimal solutions of the τ-threshold sparsified function and the original function. We denote the set of vertices adjacent to S in the τ-threshold graph with N(S).
Theorem 6. Let Oτ be the optimal solution to an instance of the facility location problem with a benefit matrix that was sparsified using a τ-threshold. Assume there exists a set S of size k such that in the τ-threshold graph we have the neighborhood of S satisfying |N(S)| ≥ µn. Then we have
1 −1 F (O ) ≥ 1 + OPT. τ µ
For the datasets we consider in our experiments, we see that we can keep just a 0.01 − 0.001 fraction of the elements of C while still having a small set S with a neighborhood N(S) that satisfied |N(S)| ≥ 0.3n. In Figure 2.3 in the appendix, we plot the relationship between the number of edges in the τ-threshold graph and the number of coverable element by a a set of small size, as estimated by the greedy algorithm for max-cover.
Additionally, we have worst case dependency on the number of edges in the τ-threshold graph and the approximation factor. The guarantees follow from applying Theorem 6 with the following Lemma.
c 1 1 2 2 Lemma 7. For k ≥ 1−2c2 δ , any graph with 2 δ n edges has a set S of size k such that the neighborhood N(S) satisfies
|N(S)| ≥ cδn.
22 To get a matching lower bound, consider the case where the graph has two disjoint cliques, one of size δn and one of size (1 − δ)n. Details are in the appendix.
2.5 Experiments 2.5.1 Summarizing Movies and Music from Ratings Data
We consider the problem of summarizing a large collection of movies. We first need to create a feature vector for each movie. Movies can be categorized by the people who like them, and so we create our feature vectors from the MovieLens ratings data [GroupLens, 2015]. The MovieLens database has 20 million ratings for 27,000 movies from 138,000 users. To do this, we perform low-rank matrix completion and factorization on the ratings matrix [Jain et al., 2013, Koren et al., 2009] to get a matrix X = UV T , where X is the completed ratings matrix, U is a matrix of feature vectors for each user and V is a matrix of feature vectors for each movie. For movies i and j with vectors vi and vj, we
T set the benefit function Cij = vi vj. We do not use the normalized dot product (cosine similarity) because we want our summary movies to be movies that were highly rated, and not normalizing makes highly rated movies increase the objective function more.
We complete the ratings matrix using the MLlib library in Apache Spark [Meng et al., 2016] after removing all but the top seven thousand most rated movies to remove noise from the data. We use locality sensitive hashing to perform sparsification; in particular we use the LSH in the FALCONN library
23 for cosine similarity [Andoni et al., 2015] and the reduction from a cosine simiarlity hash to a dot product hash [Neyshabur and Srebro, 2015]. As a baseline we consider sparsification using a scan over the entire dataset, the stochastic greedy algorithm with lazy evaluations [Mirzasoleiman et al., 2015], and the greedy algorithm with lazy evaluations [Minoux, 1978]. The number of elements chosen was set to 40 and for the LSH method and stochastic greedy we average over five trials.
We then do a scan over the sparsity parameter t for the sparsification methods and a scan over the number of samples drawn each iteration for the stochastic greedy algorithm. The sparsified methods use the (non-stochastic) lazy greedy algorithm as the base optimization algorithm, which we found worked best for this particular problem2. In Figure 2.1(a) we see that the LSH method very quickly approaches the greedy solution—it is almost identical in value just after a few seconds even though the value of t is much less than t∗(ε). The stochastic greedy method requires much more time to get the same function value. Lazy greedy is not plotted, since it took over 500 seconds to finish.
A performance metric that can be better than the objective value is the fraction of elements returned that are common with the greedy algorithm. We treat this as a proxy for the interpretability of the results. We believe this metric is reasonable since we found the subset returned by the greedy
2When experimenting on very larger datasets, we found that runtime constraints can make it necessary to use stochastic greedy as the base optimization algorithm
24 (a) Fraction of Greedy Set Value vs. Runtime (b) Fraction of Greedy Set Contained vs. Runtime 0.99 1.0 0.98 0.9 0.97 0.8 0.7 0.96 0.6 0.95 0.5 0.94 0.4 0.93 Exact top-t 0.3 0.92 LSH top-t 0.2 0.91 Stochastic Greedy 0.1 0.90 0.0 0 25 50 75 100 125 150 0 25 50 75 100 125 150 Runtime (s) Runtime (s)
Figure 2.1: Results for the MovieLens dataset GroupLens [2015]. Figure (a) shows the function value as the runtime increases, normalized by the value the greedy algorithm obtained. As can be seen our algorithm is within 99.9% of greedy in less than 5 seconds. For this experiment, the greedy algorithm had a runtime of 512 seconds, so this is a 100x speed up for a small penalty in performance. We also compare to the stochastic greedy algorithm Mirzasoleiman et al. [2015], which needs 125 seconds to get equivalent performance, which is 25x slower. Figure (b) shows the fraction of the set that was returned by each method that was common with the set returned by greedy. We see that the approximate nearest neighbor method has 90% of its elements common with the greedy set while being 50x faster than greedy, and using exact nearest neighbors can perfectly match the greedy set while being 4x faster than greedy. algorithm to be quite interpretable. We plot this metric against runtime in Figure 2.1(b). We see that the LSH method quickly gets to 90% of the elements in the greedy set while stochastic greedy takes much longer to get to just 70% of the elements. The exact sparsification method is able to completely match the greedy solution at this point. One interesting feature is that the LSH method does not go much higher than 90%. This may be due to the increased inaccuracy when looking at elements with smaller dot products. We plot this metric against the number of exact and approximate nearest neighbors t in
25 Table 2.1: A subset of the summarization outputted by our algorithm on the MovieLens dataset, plus the elements who are represented by each representative with the largest dot product. Each group has a natural interpretation: 90’s slapstick comedies, 80’s horror, cult classics, etc. Note that this was obtained with only a similarity matrix obtained from ratings. Happy Gilmore Nightmare on Elm Street Star Wars IV Shawshank Redemption Tommy Boy Friday the 13th Star Wars V Schindler’s List Billy Madison Halloween II Raiders of the Lost Ark The Usual Suspects Dumb & Dumber Nightmare on Elm Street 3 Star Wars VI Life Is Beautiful Ace Ventura Pet Detective Child’s Play Indiana Jones, Last Crusade Saving Private Ryan Road Trip Return of the Living Dead II Terminator 2 American History X American Pie 2 Friday the 13th 2 The Terminator The Dark Knight Black Sheep Puppet Master Star Trek II Good Will Hunting
Pulp Fiction The Notebook Pride and Prejudice The Godfather Reservoir Dogs P.S. I Love You Anne of Green Gables The Godfather II American Beauty The Holiday Persuasion One Flew Over the Cuckoo’s Nest A Clockwork Orange Remember Me Emma Goodfellas Trainspotting A Walk to Remembe Mostly Martha Apocalypse Now Memento The Proposal Desk Set Chinatown Old Boy The Vow The Young Victoria 12 Angry Men No Country for Old Men Life as We Know It Mansfield Park Taxi Driver
Figure 2.4 in the appendix.
We include a subset of the summarization and for each representative a few elements who are represented by this representative with the largest dot product in Table 2.1 to show the interpretability of our results.
2.5.2 Finding Influential Actors and Actresses
For our second experiment, we consider how to find a diverse subset of actors and actresses in a collaboration network. We have an edge between an actor or actress if they collaborated in a movie together, weighted by the number of collaborations. Data was obtained from [IMDb, 2016] and an actor or actress was only included if he or she was one of the top six in the cast
26 billing. As a measure of influence, we use personalized PageRank [Page et al., 1999]. To quickly calculate the people with the largest influence relative to someone, we used the random walk method[Avrachenkov et al., 2007].
We first consider a small instance where we can see how well the sparsified approach works. We build a graph based on the cast in the top thousand most rated movies. This graph has roughly 6000 vertices and 60,000 edges. We then calculate the entire PPR matrix using the power method. Note that this is infeasible on larger graphs in terms of time and memory. Even on this moderate sized graph it took six hours and takes 2GB of space. We then compare the value of the greedy algorithm using the entire PPR matrix with the sparsified algorithm using the matrix approximated by Monte Carlo sampling using the two metrics mentioned in the previous section. We omit exact nearest neighbor and stochastic greedy because it is not clear how it would work without having to compute the entire PPR matrix. Instead we compare to an approach where we choose a sample from I and calculate the PPR only on these elements using the power method. As mentioned in Section 2.2, several algorithms utilize random sampling from I. We take k to be 50 for this instance. In Figure 2.5 in the appendix we see that sparsification performs drastically better in both function value and percent of the greedy set contained for a given runtime.
We now scale up to a larger graph by taking the actors and actresses billed in the top six for the twenty thousand most rated movies. This graph has 57,000 vertices and 400,000 edges. We would not be able to compute the entire PPR matrix for this graph in a reasonable amount of time or space.
27 However we can run the sparsified algorithm in three hours using only 2 GB of memory, which could be improved further by parallelizing the Monte Carlo approximation.
We run the greedy algorithm separately on the actors and actresses. For each we take the top twenty-five and compare to the actors and actresses with the largest (non-personalized) PageRank. In Figure 2.2 of the appendix, we see that the PageRank output fails to capture the diversity in nationality of the dataset, while the facility location optimization returns actors and actresses from many of the worlds film industries.
28 2.6 Appendix: Additional Figures
Algorithm 1 Greedy algorithm with sparsity graph Input: benefit matrix C, sparsity graph G = (V,I,E) define N(v): return the neighbors of v in G for all i ∈ I: # cache of the current benefit given to i βi ← 0 A ← ∅ for k iterations: for all v ∈ V : # calculate the gain of element v gv ← 0 for all i ∈ N(v): # add the gain of element v from i gv ← gv + max(Civ − βi, 0) ∗ v ← arg maxV gv A ← A ∪ {v∗} for all i ∈ N(v∗) # update the cache of the current benefit for i βi ← max(βi,Civ∗ ) return A
29 (a) MovieLens (b) IMDb 70 140 130 60 120 Γ Γ 110 50 100 90 40 80 70 30 60 50 20 40 30 Size of covering set Size of covering set 10 20 10 0 0 0 25 50 75 100 125 150 0 25 50 75 100 125 150 Sparsity t Sparsity t
Figure 2.2: (a) MovieLens Dataset GroupLens [2015] and (b) IMDb Dataset IMDb [2016] as explained in the Experiments Section. We see that for sparsity t significantly smaller than the n/(αk) lower bound we can still find a small covering set in the t-nearest neighbor graph.
(a) MovieLens (b) IMDb
0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 Fraction of elements coverable Fraction of elements coverable 0.0 0.0 0.001 0.01 0.1 1 0.0001 0.001 0.01 Fraction of elements kept Fraction of elements kept
Figure 2.3: (a) MovieLens Dataset GroupLens [2015] and (b) IMDb Dataset IMDb [2016], as explained in the Experiments Section. We see that even with several orders of magnitude fewer edges than the complete graph we still can find a small set that covers a large fraction of the dataset. For MovieLens this set was of size 40 and for IMDb this set was of size 50. The number of coverable was estimated by the greedy algorithm for the max-coverage problem.
30 Fraction of Greedy Set Contained vs. Sparsity 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Exact top-t 0.1 LSH top-t 0.0 0 50 100 150 200 Sparsity (t)
Figure 2.4: The fraction of the greedy solution that was contained as the value of the sparsity t was increased for exact nearest neighbor and approximate LSH-based nearest neighbor on the MovieLens dataset. We see that the exact method captures slightly more of greedy solution for a given value of t and the LSH value does not converge to 1. However LSH still captures a reasonable amount of the greedy set and is significantly faster at finding nearest neighbors.
31 (a) Fraction of Greedy Set Value vs. Runtime (b) Fraction of Greedy Set Contained vs. Runtime 1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.4 0.5 0.3 0.4 Monte Carlo PPR 0.2 0.1 0.3 Sample PPR 0.0 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 Runtime (s) Runtime (s)
Figure 2.5: Results for the IMDb dataset IMDb [2016]. Figure (a) shows the function value as the runtime increases, normalized by the value the greedy algorithm obtained. As can be seen our algorithm is within 99% of greedy in less than 10 minutes. For this experiment, the greedy algorithm had a runtime of six hours, so this is a 36x acceleration for a small penalty in performance. We also compare to using a small sample of the set I as an estimate of the function, which does not perform nearly as well as our algorithm even for much longer time. Figure (b) shows the fraction of the set that was returned by each method that was common with the set returned by greedy. We see that the approximate nearest neighbor method has 90% of its elements common with the greedy set while being 18x faster than greedy.
32 Table 2.2: The top twenty-five actors and actresses generated by sparsified facility location optimization defined by the personalized PageRank of a 57,000 vertex movie personnel collaboration graph from [IMDb, 2016] and the twenty- five actors and actresses with the largest (non-personalized) PageRank. We see that the classical PageRank approach fails to capture the diversity of nationality in the dataset, while the facility location results have actors and actresses from many of the worlds film industries. Actors Actresses Facility Location PageRank Facility Location PageRank Robert De Niro Jackie Chan Julianne Moore Julianne Moore Jackie Chan Gérard Depardieu Susan Sarandon Susan Sarandon Gérard Depardieu Robert De Niro Bette Davis Juliette Binoche Kemal Sunal Michael Caine Isabelle Huppert Isabelle Huppert Shah Rukh Khan Samuel L. Jackson Kareena Kapoor Catherine Deneuve Michael Caine Christopher Lee Juliette Binoche Kristin Scott Thomas John Wayne Donald Sutherland Meryl Streep Meryl Streep Samuel L. Jackson Peter Cushing Adile Naşit Bette Davis Bud Spencer Nicolas Cage Catherine Deneuve Nicole Kidman Peter Cushing John Wayne Li Gong Charlotte Rampling Toshirô Mifune John Cusack Helena Bonham Carter Helena Bonham Carter Steven Seagal Christopher Walken Penélope Cruz Kathy Bates Moritz Bleibtreu Bruce Willis Naomi Watts Naomi Watts Jean-Claude Van Damme Kemal Sunal Masako Nozawa Cate Blanchett Mads Mikkelsen Harvey Keitel Drew Barrymore Drew Barrymore Michael Ironside Amitabh Bachchan Charlotte Rampling Helen Mirren Amitabh Bachchan Shah Rukh Khan Golshifteh Farahani Michelle Pfeiffer Ricardo Darín Sean Bean Hanna Schygulla Penélope Cruz Charles Chaplin Steven Seagal Toni Collette Sigourney Weaver Sean Bean Jean-Claude Van Damme Kati Outinen Toni Collette Louis de Funès Morgan Freeman Edna Purviance Catherine Keener Tadanobu Asano Christian Slater Monica Bellucci Heather Graham Bogdan Diklic Val Kilmer Kristin Scott Thomas Sandra Bullock Nassar Liam Neeson Catherine Keener Kirsten Dunst Lance Henriksen Gene Hackman Kyôko Kagawa Miranda Richardson
33 2.7 Appendix: Full Proofs 2.7.1 Proof of Theorem 1
We will use the following two lemmas in the proof of Theorem 1, which are proven later in this section. The first lemma bounds the size of the smallest set of left vertices covering every right vertex in a t-regular bipartite graph.
Lemma 8. For any bipartite graph G = (V,I,E) such that |V | = n, |I| = m, every vertex i ∈ I has degree at least t, and n ≤ mt, there exists a set of vertices Γ ⊆ V such that every vertex in I has a neighbor in Γ and
n mt |Γ| ≤ 1 + ln . (2.3) t n
The second lemma bounds the rate that the optimal solution grows as a function of k.
Lemma 9. Let f be any normalized submodular function and let O2 and O1 be optimal solutions for their respective sizes, with |O2| ≥ |O1|. We have
|O2| f(O2) ≤ f(O1). |O1|
We now prove Theorem 1.
Proof. We will take t∗(α) to be the smallest value of t such that |Γ| ≥ αk in
∗ n n Equation (2.3). It can be verified that t (α) ≤ d4 αk max{1, ln αk }e.
Let Γ ⊆ V be a set such that all elements of I has a t-nearest neighbor in Γ. By Lemma 8, one is guaranteed to exists of size at most αk for t ≥ t∗(α).
34 Let O be the optimal set of size k and let F (t) be the objective function of the
k (1+α)k (t) sparsified function. Let Ot and Ot be the optimal solutions to F of size k and (1 + α)k. We have
F (O) ≤ F (O ∪ Γ)
= F (t)(O ∪ Γ)
(t) (1+α)k ≤ F (Ot ). (2.4)
The first inequality is due to the monotonicity of F . The second is because every element of I would prefer to choose one of their t nearest neighbors and because of Γ they can. The third inequality is because |O ∪ Γ| ≤ (1 + α)k and
(1+α)k Ot is the optimal solution for this size.
Now by Lemma 9, we can bound the approximation for shrinking from
(1+α)k k Ot to Ot . Applying Lemma 9 and continuing from Equation (2.4) implies
(t) k F (O) ≤ (1 + α)F (Ot ).
Observe that F (t)(A) ≤ F (A) for any set A to obtain the final bound.
2.7.2 Proof of Proposition 3
Define Πn(t) to be the n × (n + 1) matrix where for i = 1, . . . , n we have column i equal to 1 for positions i to i + t − 1, potentially cycling the position back to the beginning if necessary, and then 0 otherwise. For column
35 n + 1 make all values 1 − 1/2n. For example,
1 0 0 0 1 1 11/12 1 1 0 0 0 1 11/12 1 1 1 0 0 0 11/12 Π6(3) = . 0 1 1 1 0 0 11/12 0 0 1 1 1 0 11/12 0 0 0 1 1 1 11/12
We will show the lower bound in two parts, when α < 1 and when α ≥ 1.
Proof of case α ≥ 1. Let F be the facility location function defined on the
n n (t) benefit matrix C = Πn(δ k ). For t = δ k , the sparsified matrix C has all of its elements except the n + 1st row. With k elements, the optimal solution to F (t) is to choose the k elements that let us cover δn of the elements of I, giving a value of δn. However if we chose the n + 1th element, we would have gotten a
δ value of n − 1/2, giving an approximation of 1−1/(2n) . Setting δ = 1/(1 + α) and using α ≤ n/k implies
1 F (O ) ≤ OPT t 1 + α − 1/k
1 |V |−1 when we take t = 1+α k (note that for this problem |V | = n + 1).
Proof of case α < 1. Let F be the facility location function defined on the benefit matrix 1 n Πn( α k ) 0 C = 1 n 1 0 α k − 2n In×n 1 n (t) For t = α k , the optimal solution to F is to use αk elements to cover all the elements of Πn, then use the remaining (1 − α)k elements in the identity
36 1 section of the matrix. This has a value of less than α n. For F , the optimal solution is to choose the n + 1st element of Πn, then use the remaining k − 1 elements in the identity section of the identity section of the matrix. This has
1 1 1 a value of more than n(1 + α − kα − n ), and therefore an approximation of 1 1+α−1/k−1/n . Note that in this case |V | = 2n + 1 and so we have 1 F (O ) ≤ OPT t 1 + α − 1/k − 1/n
1 |V |−1/2 when we take t = 2α k .
2.7.3 Proof of Proposition 4
Proof. The stochastic greedy algorithm works by choosing a set of elements Sj
n 1 each iteration of size k log ε . We will assume m = n and ε = 1/e to simplify notation. We want to show that
k X X dv = O(nt)
j=1 v∈Sj with high probability, where dv is the degree of element v in the sparsity graph. We will show this using Bernstein’s Inequality: given n i.i.d. random variables
2 X1,...,Xn such that E(X`) = 0, Var(X`) = σ , and |X`| ≤ c with probability 1, we have n ! X nλ2 P X ≥ λn ≤ exp − . ` 2σ2 + 2 cλ `=1 3
We will take X` to be the degree of the `th element of V chosen uniformly at random, shifted by the mean of t. Although in the stochastic greedy algorithm the elements are not chosen i.i.d. but instead iterations in
37 k iterations of sampling without replacement, treating them as i.i.d. random variables for purposes of Bernstein’s Inequality is justified by Theorem 4 of [Hoeffding, 1963].
We have |X`| ≤ n, and Var(X`) ≤ tn, where the variance bound is because variance for a given mean t on support [0, m] is maximized by putting
t t mass n on n and 1 − n on 0.
8 8 If t ≥ ln n, then take λ = 3 t. If t < ln n, take λ = 3 ln n. This yields k r X X 8 m 1 P d ≥ nt + max{nt, ln n} ≤ . v 3 n n j=1 v∈Sj
2.7.4 Proof of Lemma 8
We now prove Lemma 8, which is a modification of Theorem 1.2.2 of [Alon and Spencer, 2008].
Proof. Choose a set X by picking each element of V with probability p, where p is to be decided later. For every element of I without a neighbor in X, add one arbitrarily. Call this set Y . We have E(|X ∪Y |) ≤ np+m(1−p)t ≤ np+me−pt.
1 mt mt Optimizing for p yields p = t ln n . This is a valid probability when n ≥ 1, m et which we assumed, and when n ≤ t (we do not need to worry about the latter case because if it does not hold then it implies an inequality weaker than the trivial one |Γ| ≤ n).
38 2.7.5 Proof of Lemma 9
Before we prove Lemma 9, we need the following Lemma.
Lemma 10. Let f be any normalized submodular function and let O be an optimal solution for its respective size. Let A be any set. We have
|A| f(O ∪ A) ≤ 1 + f(O). |O|
We now prove Lemma 9.
Proof. Let A∗ = arg max f(A) {A⊆O2:|A|≤|O1|} 0 ∗ ∗ and let A = O2 \ A . Since A is optimal for the function when restricted to a ground set O2, by Lemma 10 and the optimality of O1 for sets of size |O1|, we have
∗ 0 f(O2) = f(A ∪ A ) |A0| ≤ 1 + f(A∗) |A∗| |O | = 2 f(A∗) |O1|
|O2| ≤ f(O1). |O1|
We now prove Lemma 10.
39 Proof. Define f(v | A) = f({v} ∪ A) − f(A). Let O = {o1, . . . ok}, where the ordering is arbitrary except that
f(ok | O \{ok}) = arg min f(oi | O \{oi}). i=1,...,k
Let A = {a1, . . . , a`}, where the ordering is arbitrary except that
f(a1 | O) = arg max f(ai | O). i=1,...,`
We will first show that
f(a1 | O) ≤ f(ok | O \{ok}). (2.5)
By submodularity, we have
f(a1 | O) ≤ f(a1 | O \{ok}).
If it was true that
f(a1 | O \{ok}) > f(ok | O \{ok}), then we would have
k−1 X f((O \{ok}) ∪ {a1}) = f(a1 | O \{ok}) + f(oi | {o1, . . . , oi−1}) i=1 k X ≥ f(oi | {o1, . . . , oi−1}) i=1 = f(O), contradicting the optimality of O, thus showing that Inequality 2.5 holds.
40 Now since for all i ∈ {1, 2, . . . , k}
f(a1 | O) ≤ f(ok | O \{ok})
≤ f(oi | O \{oi})
≤ f(oi | {o1, . . . , oi−1}),
1 Pk it is worse than the average of f(oi | {o1, . . . , oi−1}), which is k i=1 f(oi |
{o1, . . . , oi−1}), and showing that
1 f(a | O) ≤ f(O). (2.6) 1 k
Finally, we have
` X f(O ∪ A) = f(O) + f(ai | O ∪ {a1, . . . , ai−1}) i=1 ` X ≤ f(O) + f(ai | O) i=1
≤ f(O) + `f(a1 | O) ` ≤ 1 + f(O), k which is what we wanted to show.
2.7.6 Proof of Theorem 6
Proof. Let O be the optimal solution to the original problem. Let Fτ and F τ be the functions defined restricting to the matrix elements with benefit at least τ and all remaining elements, respectively. If there exists a set S of size k such
41 that µn elements have a neighbor in S, then we have
F (O) ≤ Fτ (O) + F τ (O)
≤ Fτ (O) + nτ 1 ≤ F (O) + F (S) τ µ τ 1 ≤ 1 + F (O ) µ τ τ where the last inequality follows from Oτ being optimal for Fτ .
2.7.7 Proof of Lemma 7
Proof. Consider the following algorithm:
B ← ∅ S ← ∅ while |B| ≤ cδn v∗ ← arg max |N(v)| add v∗ to S add N(v∗) to B remove N(v∗) ∪ {v∗} from G
c We will show that after T = (1−2c2)δ iterations this algorithm will terminate. When it does, S will satisfy |N(S)| ≥ cδn since every element of B has a neighbor in S.
If there exists a vertex of degree cδn, then we will terminate after the first iteration. Otherwise all vertices have degree less than cδn. Assuming all
42 vertices have degree less than cδn, until we terminate the number of edges incident to B is at most |B|cδn ≤ c2δ2n2. At each iteration, the number of
1 2 2 2 edges in the graph is at least ( 2 − c )δ n , thus in each iteration we can find a v∗ with degree at least (1 − 2c2)δ2n. Therefore, after T iterations, we will have terminated with the size of S is at most T and |N(S)| ≥ cδn.
We see that this is tight up to constant factors by the following proposi- tion.
Proposition 11. There exists an example where for ∆ = δ2n, the optimal solution to the sparsified function is a factor of O(δ) from the optimal solution to the original function.
Proof. Consider the following benefit matrix.
1 1 δn×δn + (1 + k−1 )I 0 C = 1 1 0 (1 − (1−δ)n ) (1−δn)×(1−δn)
The sparsified optimal would only choose elements in the top left clique and would get a value of roughly δn, while the true optimal solution would cover both cliques and get a value of roughly n.
43 Chapter 3
Exact MAP Inference by Avoiding Fractional Vertices
3.1 Introduction
Given a graphical model, one essential problem is MAP inference, that is, finding the most likely configuration of states according to the model.
Consider graphical models with binary random variables and pairwise interactions, also known as Ising models. For a graph G = (V,E) with node weights θ ∈ RV and edge weights W ∈ RE, the probability of a variable configuration is given by ! 1 X X P(X = x) = exp θ x + W x x , (3.1) Z i i ij i j i∈V ij∈E where Z is a normalization constant.
The MAP problem is to find the configuration x ∈ {0, 1}V that maxi- mizes Equation (3.1). We can write this as an integer linear program (ILP) as follows: X X max θiqi + Wijqij q∈ V ∪E R i∈V ij∈E
s.t. qi ∈ {0, 1} ∀i ∈ V (3.2)
qij ≥ max{0, qi + qj − 1} ∀ij ∈ E
qij ≤ min{qi, qj} ∀ij ∈ E.
44 The MAP problem on binary, pairwise graphical models contains, as a special case, the Max-cut problem and is therefore NP-hard. For this reason, a significant amount of attention has focused on analyzing the LP relaxation of the ILP, which can be solved efficiently in practice.
X X max θiqi + Wijqij q∈ V ∪E R i∈V ij∈E
s.t. 0 ≤ qi ≤ 1 ∀i ∈ V (3.3)
qij ≥ max{0, qi + qj − 1} ∀ij ∈ E
qij ≤ min{qi, qj} ∀ij ∈ E This relaxation has been an area of intense research in machine learning and statistics. In [Meshi et al., 2016], the authors state that a major open question is to identify why real world instances of Problem (3.2) can be solved efficiently despite the theoretical worst case complexity.
We make progress on this open problem by analyzing the fractional vertices of the LP relaxation, that is, the extreme points of the polytope with fractional coordinates. Vertices of the relaxed polytope with fractional coordi- nates are called pseudomarginals for graphical models and pseudocodewords in coding theory. If a fractional vertex has higher objective value (i.e. likelihood) compared to the best integral one, the LP relaxation fails. We call fractional vertices with an objective value at least as good as the objective value of the optimal integral vertex confounding vertices. Our main result is that it is possible to prune all confounding vertices efficiently when their number is polynomial.
45 Our contributions:
• Our first contribution is a general result on integer programs. We show that any 0-1 integer linear program (ILP) can be solved exactly in polynomial time, if the number confounding vertices is bounded by a polynomial. This applies to MAP inference for a graphical model over any alphabet size and any order of connection. The same result (exact solution if the number of confounding vertices is bounded by a polynomial) was established by Dimakis et al. [2009] for the special case of LP decoding of LDPC codes [Feldman et al., 2005]. The algorithm from Dimakis et al. [2009] relies on the special structure of the graphical models that correspond to LDPC codes. In this paper we generalize this result for any ILP in the unit hypercube. Our results extend to finding all integral vertices among the M-best vertices.
• Given our condition, one may be tempted to think that we generate the top M-best vertices of a linear program (for M polynomial) and output the best integral one in this list. We actually show that such an approach would be computationally intractable. Specifically, we show that it is NP-hard to produce a list of the M-best vertices if M = O(nε) for any fixed ε > 0. This result holds even if the list is allowed to be approximate. This strengthens the previously known hardness result [Angulo et al., 2014] which was M = O(n) for the exact M-best vertices. In terms of achievability, the best previously known result (from [Angulo et al.,
46 2014]) can only solve the ILP if there is at most a constant number of confounding vertices.
• We obtain a complete characterization of the fractional vertices of the local polytope for binary, pairwise graphical models. We show that any variable in the fractional support must be connected to a frustrated cycle by other fractional variables in the graphical model. This is a complete structural characterization that was not previously known, to the best of our knowledge.
• We develop an approach to estimate the number of confounding vertices of a half-integral polytope. We use this method in an empirical evaluation of the number of confounding vertices of previously studied problems and analyze how well common integer programming techniques perform at pruning confounding vertices.
3.2 Background and Related Work
For some classes of graphical models, it is possible to solve the MAP problem exactly. For example see [Weller et al., 2016] for balanced and almost balanced models, [Jebara, 2009] for perfect graphs, and [Wainwright et al., 2008] for graphs with constant tree-width.
These conditions are often not true in practice and a wide variety of general purpose algorithms are able to solve the MAP problem for large inputs. One class is belief propagation and its variants [Yedidia et al., 2000,
47 Wainwright et al., 2003, Sontag et al., 2008]. Another class involves general ILP optimization methods (see e.g. [Nemhauser and Wolsey, 1999]). Techniques specialized to graphical models include cutting-plane methods based on the cycle inequalities [Sontag and Jaakkola, 2007, Komodakis and Paragios, 2008, Sontag et al., 2012]. See also [Kappes et al., 2013] for a comparative survey of techniques.
In [Weller et al., 2014], the authors investigate how pseudomarginals and relaxations relate to the success of the Bethe approximation of the partition function.
There has been substantial prior work on improving inference building on these LP relaxations, especially for LDPC codes in the information theory community. This work ranges from very fast solvers that exploit the special structure of the polytope [Burshtein, 2009], connections to unequal error protection [Dimakis et al., 2007], and graphical model covers [Koetter et al., 2007]. LP decoding currently provides the best known finite-length error- correction bounds for LDPC codes both for random [Daskalakis et al., 2008, Arora et al., 2009], and adversarial bit-flipping errors [Feldman et al., 2007].
For binary graphical models, there is a body of work which tries to exploit the persistency of the LP relaxation, that is, the property that integer components in the solution of the relaxation must take the same value in the optimal solution, under some regularity assumptions [Boros and Hammer, 2002, Rother et al., 2007, Fix et al., 2012].
48 Fast algorithms for solving large graphical models in practice include [Ihler et al., 2012, Dechter and Rish, 2003].
The work most closely related to this paper involves eliminating frac- tional vertices (so-called pseudocodewords in coding theory) by changing the polytope or the objective function [Zhang and Siegel, 2012, Chertkov and Stepanov, 2008, Liu et al., 2012].
3.3 Provable Integer Programming
A binary integer linear program is an optimization problem of the following form. max cT x x subject to Ax ≤ b
x ∈ {0, 1}n which is relaxed to a linear program by replacing the x ∈ {0, 1}n constraint with
0 ≤ x ≤ 1. For binary integer programs with the box constraints 0 ≤ xi ≤ 1 for all i, every integral vector x is a vertex of the polytope described by the constraints of the LP relaxation. However fraction vertices may also be in this polytope, and fractional solutions can potentially have an objective value larger than every integral vertex.
If the optimal solution to the linear program happens to be integral, then this is the optimal solution to the original integer linear program. If the optimal solution is fractional, then a variety of techniques are available to tighten the LP relaxation and eliminate the fractional solution.
49 We establish a success condition for integer programming based on the number of confounding vertices, which to the best of our knowledge was unknown. The algorithm used in proving Theorem 12 is a version of branch-and- bound, a classic technique in integer programming [Land and Doig, 1960] (see [Nemhauser and Wolsey, 1999] for a modern reference on integer programming). This algorithm works by starting with a root node, then branching on a fractional coordinate by making two new linear programs with all the constraints of the parent node, with the constraint xi = 0 added to one new leaf and xi = 1 added to the other. The decision on which leaf of the tree to branch on next is based on which leaf has the best objective value. When the best leaf is integral, we know that this is the best integral solution. This algorithm is formally written in Algorithm 2.
∗ Theorem 12. Let x be the optimal integral solution and let {v1, v2, . . . , vM } be the set of confounding vertices in the LP relaxation. Algorithm 2 will find the optimal integral solution x∗ after 2M calls to an LP solver.
Since MAP inference is a binary integer program regardless of the alphabet size of the variables and order of the clique potentials, we have the following corollary:
Corollary 13. Given a graphical model such that the local polytope has M as cofounding variables, Algorithm 2 can find the optimal MAP configuration with 2M calls to an LP solver.
50 Cutting-plane methods, which remove a fractional vertex by introducing a new constraint in the polytope may not have this property, since this cut may create new confounding vertices. This branch-and-bound algorithm has the desirable property that it never creates a new fractional vertex. We note that other branching algorithms, such as the algorithm presented by the authors in [Marinescu and Dechter, 2009], do not immediately allow us to prove our desired theorem.
Note that warm starting a linear program with slightly modified con- straints allows subsequent calls to an LP solver to be much more efficient after the root LP has been solved.
3.3.1 Proof of Theorem 12
The proof follows from the following invariants:
• At every iteration we remove at least one fractional vertex.
• Every integral vertex is in exactly one branch.
• Every fractional vertex is in at most one branch.
• No fractional vertices are created by the new constraints.
To see the last invariant, note that every vertex of a polytope can be identified by the set of inequality constraints that are satisfied with equality (see [Bertsimas and Tsitsiklis, 1997]). By forcing an inequality constraint to be tight, we cannot possibly introduce new vertices.
51 3.3.2 The M-Best LP Problem
As mentioned in the introduction, the algorithm used to prove Theorem 12 does not enumerate all the fractional vertices until it finds an integral vertex. Enumerating the M-best vertices of a linear program is the M-best LP problem.
Definition 14. Given a linear program {min cT x : x ∈ P } over a polytope P and a positive integer M, the M-best LP problem is to optimize
M X T max c vk. {v1,...,vM }⊆V (P ) k=1
This was established by [Angulo et al., 2014] to be NP-hard when M = O(n). We strengthen this result to hardness of approximation even when M = nε for any ε > 0.
Theorem 15. It is NP-hard to approximate the M-best LP problem by a factor
nε better than O( M ) for any fixed ε > 0.
Proof. Consider the circulation polytope described in [Khachiyan et al., 2008], with the graph and weight vector w described in [Boros et al., 2011]. By adding an O(log M) long series of 2 × 2 bipartite subgraphs, we can make it such that one long path in the original graph implies M long paths in the new graph, and thus it is NP-hard to find any of these long paths in the new graph. By adding the constraint vector wT x ≤ 0, and using the cost function −w, the vertices corresponding to the short paths have value 1/2, the vertices corresponding to the long paths have value O(1/n), and all other vertices have value 0. Thus the
52 M optimal set has value O(n + n ). However it is NP-hard to find a set of value n greater than O(n) in polynomial time, which gives an O( M ) approximation. Using a padding argument, we can replace n with nε.
The best known algorithm for the M-best LP problem is a generalization of the facet guessing algorithm [Dimakis et al., 2009] developed in [Angulo et al., 2014], which would require O(mM ) calls to an LP solver, where m is the number of constraints of the LP. Since we only care about integral solutions, we can find the single best integral vertex with O(M) calls to an LP solver, and if we want all of the K-best integral solutions among the top M vertices of the polytope, we can find these with O(nK + M) calls to an LP-solver, as we will see in the next section.
3.3.3 K-Best Integral Solutions
Finding the K-best solutions to general optimization problems has been uses in several machine learning applications. Producing multiple high- value outputs can be naturally combined with post-processing algorithms that select the most desired solution using additional side-information. There is a significant volume of work in the general area, see [Fromer and Globerson, 2009, Batra et al., 2012] for MAP solutions in graphical models and [Eppstein, 2014] for a survey on M-best problems.
We further generalize our theorem to find the K-best integral solutions.
Theorem 16. Under the assumption that there are less than M fractional
53 vertices with objective value at least as good as the K-best integral solutions, we can find all of the K-best integral solutions, O(nK + M) calls to an LP solver.
The algorithm used in this theorem is Algorithm 3. It combines Algo- rithm 2 with the space partitioning technique used in [Murty, 1968, Lawler, 1972]. If the current optimal solution in the solution tree is fractional, then we use the branching technique in Algorithm 2. If the current optimal solution in the solution tree x∗ is integral, then we branch by creating a new leaf for every
∗ i not currently constrained by the parent with the constraint xi = ¬xi .
3.4 Fractional Vertices of the Local Polytope
We now describe the fractional vertices of the local polytope for binary, pairwise graphical models, which is described in Equation 3.3. It was shown in [Padberg, 1989] that all the vertices of this polytope are half-integral, that is,
1 all coordinates have a value from {0, 2 , 1} (see [Weller et al., 2016] for a new proof of this).
1 V ∪E Given a half-integral point q ∈ {0, 2 , 1} in the local polytope, we say that a cycle C = (VC ,EC ) ⊆ G is frustrated if there is an odd number of edges ij ∈ EC such that qij = 0. If a point q has a frustrated cycle, then it is a pseudomarginal, as no probability distribution exists that has as its singleton and pairwise marginals the coordinates of q. Half-integral points q with a frustrated cycle do not satisfy the cycle inequalities [Sontag and Jaakkola,
2007, Wainwright et al., 2008], for all cycles C = (VC ,EC ),F = (VF ,EF ) ⊆
54 C, |EF | odd we must have
X X qi + qj − 2qij − qi + qj − 2qij ≤ |FC | − 1. (3.4)
ij∈EF ij∈EC \EF
Frustrated cycles allow a solution to be zero on negative weights in a way that is not possible for any integral solution.
We have the following theorem describing all the vertices of the local polytope for binary, pairwise graphical models.
Theorem 17. Given a point q in the local polytope, q is a vertex of this polytope
1 V ∪E if and only if q ∈ {0, 2 , 1} and the induced subgraph on the fractional nodes of q is such that every connected component of this subgraph contains a frustrated cycle.
3.4.1 Proof of Theorem 17
Every vertex q of an n-dimensional polytope is such that there are n constraints such that q satisfies them with equality, known as active constraints (see [Bertsimas and Tsitsiklis, 1997]). Every integral q is thus a vertex of the local polytope. We now describe the fractional vertices of the local polytope.
1 n+m Definition 18. Let q ∈ {0, 2 , 1} be a point of the local polytope. Let 1 GF = (VF ,EF ) be an induced subgraph of points such that qi = 2 for all i ∈ VF .
55 We say that GF is full rank if the following system of equations is full rank.
qi + qj − qij = 1 ∀ij ∈ EF such that qij = 0
qij = 0 ∀ij ∈ EF such that qij = 0 1 (3.5) q − q = 0 ∀ij ∈ E such that q = i ij F ij 2 1 q − q = 0 ∀ij ∈ E such that q = j ij F ij 2
Theorem 17 follows from the following lemmas.
1 n+m Lemma 19. Let q ∈ {0, 2 , 1} be a point of the local polytope. Let GF = 1 (VF ,EF ) be the subgraph induced by the nodes i ∈ V such that qi = 2 . The point q is a vertex if and only if every connected component of GF is full rank.
1 n+m Lemma 20. Let q ∈ {0, 2 , 1} be a point of the local polytope. Let GF = 1 (VF ,EF ) be a connected subgraph induced from nodes such that such that qi = 2 for all i ∈ VF . GF is full rank if and only if GF contains cycle that is full rank.
1 n+m Lemma 21. Let q ∈ {0, 2 , 1} be a point of the local polytope. Let C =
(VC ,EC ) be a cycle of G such that qi is fractional for all i ∈ VC . C is full rank if and only if C is a frustrated cycle.
Proof of Lemma 19. Suppose every connected component is full rank. Then every fractional node and edge between fractional nodes is fully specified by their corresponding equations in Problem 3.3. It is easy to check that all integral nodes, edges between integral nodes, and edges between integral and fractional nodes is also fixed. Thus q is a vertex.
56 Now suppose that there exists a connected component that is not full rank. The only other constraints involving this connected component are those between fractional nodes and integral nodes. However, note that these constraints are always rank 1, and also introduce a new edge variable. Thus all the constraints where q is tight do not make a full rank system of equations.
Proof of Lemma 20. Suppose GF has a full rank cycle. We will build the graph starting with the full rank cycle then adding one connected edge at a time. It is easy to see from Equations 3.5 that all new variables introduced to the system of equations have a fixed value, and thus the whole connected component is full rank.
Now suppose GF has no full rank cycle. We will again build the graph starting from the cycle then adding one connected edge at a time. If we add an edge that connects to a new node, then we added two variables and two equations, thus we did not make the graph full rank. If we add an edge between two existing nodes, then we have a cycle involving this edge. We introduce two new equations, however with one of the equations and the other cycle equations, we can produce the other equation, thus we can increase the rank by one but we also introduced a new edge. Thus the whole graph cannot be full rank.
The proof of Lemma 21 from the following lemma.
57 Lemma 22. Consider a collection of n vectors
v1 = (1, t1, 0,..., 0)
v2 = (0, 1, t2, 0,..., 0)
v3 = (0, 0, 1, t3, 0,..., 0) . .
vn−1 = (0,..., 0, 1, tn−1)
vn = (tn, 0,..., 0, 1)
for ti ∈ {−1, 1}. We have rank(v1, v2, . . . , vn) = n if and only if there is an odd number of vectors such that ti = 1.
Proof of Lemma 22. Let k be the number of vectors such that ti = 1. Let
S1 = v1 and define ( Si − vi+1 if Si(i + 1) = 1 Si+1 = Si + vi+1 if Si(i + 1) = −1 for i = 2, . . . , n − 1.
Note that if ti+1 = −1 then Si+1(i + 2) = Si(i + 1) and if ti+1 = 1 then
Si+1(i + 2) = −Si(i + 1). Thus the number of times the sign changes is exactly the number of ti = 1 for i ∈ {2, . . . , n − 1}.
Using the value of Sn−1 we can now we can check for all values of t1 and tn that the following is true.
• If k is odd then (1, 0,..., 0) ∈ span(v1, v2, . . . , vn), which allows us to create the entire standard basis, showing the vectors are full rank.
58 • If k is even then vn ∈ span(v1, v2, . . . , vn−1) and thus the vectors are not full rank.
3.5 Estimating the number of Confounding Singleton Marginals
For this section we generalize generalize Theorem 12. We see after every iteration we potentially remove more than one confounding vertex—we remove
all confounding vertices that agree with xI0 = 0 and xI1 = 1 and are fractional with any value at coordinate i. We also observe that we can handle a mixed integer program (MIP) with the same algorithm. max cT x + dT z x subject to Ax + Bz ≤ b
x ∈ {0, 1}n
We will call a vertex (x, z) fractional if its x component is fractional. For each fractional vertex (x, z), we create a half-integral vector S(x) such that 0 if xi = 0 1 S(x)i = 2 if xi is fractional 1 if xi = 1 For a set of vertices V , we define S(V ) to be the set {S(x):(x, z) ∈ V }, i.e. we remove all duplicate entries.
∗ ∗ Theorem 23. Let (x , z ) be the optimal integral solution and let VC be the set of confounding vertices. Algorithm 2 will find the optimal integral solution
∗ ∗ (x , z ) after 2|S(VC )| calls to an LP solver.
59 For MAP inference in graphical models, S(VC ) refers to the fractional singleton marginals qV such that there exists a set of pairwise pseudomarginals qE such that (qV , qE) is a cofounding vertex. In this case we call qV a confound- ing singleton marginal. We develop Algorithm 4 to estimate the number of confounding singleton marginals for our experiments section. It is based on the k-best enumeration method developed in [Murty, 1968, Lawler, 1972].
Algorithm 4 works by a branching argument. The root node is the original LP. A leaf node is branched on by introducing a new leaf for every
1 node in V and every element of {0, 2 , 1} such that qi =6 a in the parent node and the constraint {qi = a} is not in the constraints for the parent node. For
1 i ∈ V , a ∈ {0, 2 , 1}, we create the leaf such that it has all the constraints of its parents plus the constraint qi = a.
Note that Algorithm 4 actually generates a superset of the elements
1 of S(VC ), since the introduction of constraints of the type qi = 2 introduce vertices into the new polytope that were not in the original polytope. This does not seem to be an issue for the experiments we consider, however this does occur for other graphs. An interesting question is if the vertices of the local polytope can be provably enumerated.
3.6 Experiments
We consider a synthetic experiment on randomly created graphical models, which were also used in [Sontag and Jaakkola, 2007, Weller, 2016, Weller et al., 2014]. The graph topology used is the complete graph on 12 nodes.
60 We first reparametrize the model to use the sufficient statistics 1(xi = xj) and 1(xi = 1). The node weights are drawn θi ∼ Uniform(−1, 1) and the edge weights are drawn Wij ∼ Uniform(−w, w) for varying w. The quantity w determines how strong the connections are between nodes. We do 100 draws for each choice of edge strength w.
For the complete graph, we observe that Algorithm 4 does not yield any points that do not correspond to vertices, however this does occur for other topologies.
We first compare how the number of fractional singleton marginals
|S(VC )| changes with the connection strength w. In Figure 3.1, we plot the sample CDF of the probability that |S(VC )| is some given value. We observe that |S(VC )| increases as the connection strength increases. Further we see that while most instances have a small number for |S(VC )|, there are rare instances where |S(VC )| is quite large.
Now we compare how the number of cycle constraints from Equation (3.4) that need to be introduced to find the best integral solution changes with the number of confounding singleton marginals in Figure 3.2. We use the algorithm for finding the most frustrated cycle in [Sontag and Jaakkola, 2007] to introduce new constraints. We observe that each constraint seems to remove many confounding singleton marginals.
We also observe the number of introduced confounding singleton marginals that are introduced by the cycle constraints increases with the number of con-
61 1 ) t 0.8 | ≤ ) C V ( S | 0.6
P( w = 0.1 w = 0.2 w = 0.3 0.4 100 101 102 103 104 |S(VC )|
Figure 3.1: We compare how the number of fractional singleton marginals |S(VC )| changes with the connection strength w. We plot the sample CDF of the probability that |S(VC )| is some given value. We observe that |S(VC )| increases as the connection strength increases. Further we see that while most instances have a small number for |S(VC )|, there are rare instances where |S(VC )| is quite large.
62 50
40
30
20
10
# cycle constraints added 0 100 101 102 103 104 |S(VC )|
Figure 3.2: We compare how the number of cycle constraints from Equation (3.4) that need to be introduced to find the best integral solution changes with the number of confounding singleton marginals. We use the algorithm for finding the most frustrated cycle in [Sontag and Jaakkola, 2007] to introduce new constraints. We observe that each constraint seems to remove many confounding singleton marginals.
63 30
20
10
0 100 101 102 103 104
# introduced confounding singleton marginals |S(VC )|
Figure 3.3: We also observe the number of introduced confounding singleton marginals that are introduced by the cycle constraints increases with the number of confounding singleton marginals . founding singleton marginals in Figure 3.3.
Finally we compare the number of branches needed to find the optimal solution increases with the number of confounding singleton marginals in Figure 3.4. A similar trend arises as with the number of cycle inequalities introduced. To compare the methods, note that branch-and-bound uses twice as many LP calls as there are branches. For this family of graphical models, branch-and-bound tends to require less calls to an LP solver than the cut constraints.
64 10
5 # branches
0 100 101 102 103 104 |S(VC )|
Figure 3.4: Finally we compare the number of branches needed to find the optimal solution increases with the number of confounding singleton marginals in Figure 3.4. A similar trend arises as with the number of cycle inequalities introduced. To compare the methods, note that branch-and-bound uses twice as many LP calls as there are branches. For this family of graphical models, branch-and-bound tends to require less calls to an LP solver than the cut constraints.
65 3.7 Conclusion
Perhaps the most interesting follow-up question to our work is to deter- mine when, in theory and practice, our condition on the number of confounding pseudomarginals in the LP relaxation is small. Another interesting question is to see if it is possible to prune the number of confounding pseudomarginals at a faster rate. The algorithm presented for our main theorem removes one pseudomarginal after two calls to an LP solver. Is it possible to do this at a faster rate? From our experiments, this seems to be the case in practice.
66 Algorithm 2 Branch and Bound Input: an LP {min cT x : Ax ≤ b, 0 ≤ x ≤ 1}
# branch (v, I0,I1) means v is optimal LP
# with xI0 = 0 and xI1 = 1. def LP(I0,I1): v∗ ← arg max cT x subject to: Ax ≤ b
xI0 = 0
xI1 = 1 return v∗ if feasible, else return null
v ← LP(∅, ∅) B ← {(v, ∅, ∅)} while optimal integral vertex not found: T (v, I0,I1) ← arg max(v,I0,I1)∈B c v if v is integral: return v else find a fractional coordinate i (0) v ← LP(I0 ∪ {i},I1) (1) v ← LP(I0,I1 ∪ {i}) remove (v, I0,I1) from B (0) add (v ,I0 ∪ {i},I1) to B if feasible (1) add (v ,I0,I1 ∪ {i}) to B if feasible
67 Algorithm 3 M-best Integral Input: an LP {max cT x : Ax ≤ b, 0 ≤ x ≤ 1} Input: number of solutions K
def LP(I0,I1): same as Algorithm 2
def SplitIntegral(v, I0,I1): P ← { } for i ∈ [n] if i∈ / I0 ∪ I1: a ← ¬vi 0 0 I0,I1 ← copy(I0,I1) 0 add i to Ia 0 0 0 v ← LP(I0,I1) 0 0 0 add (v ,I0,Ia) to P if feasible return P
v ← LP(∅, ∅) B ← {(v, ∅, ∅)} results ← { } while K integral vertices not found: T (v, I0,I1) ← arg max(v,I0,I1)∈B c v if v is integral: add v to results add SplitIntegeral(v, I0,I1) to B remove (v, I0,I1) from B else find a fractional coordinate i (0) v ← LP(I0 ∪ {i},I1) (1) v ← LP(I0,I1 ∪ {i}) remove (v, I0,I1) from B (0) add (v ,I0 ∪ {i},I1) to B if feasible (1) add (v ,I0,I1 ∪ {i}) to B if feasible return results
68 Algorithm 4 Estimate S(VC ) for Binary, Pairwise Graphical Models Input: a binary, pairwise graphical model LP
# branch (v, I0,I 1 ,I1) means v is optimal LP 2 1 # with xI0 = 0, xI 1 = 2 , and xI1 = 1. 2 def LP(I0,I 1 ,I1): 2 optimize LP with additional constraints:
xI0 = 0 1 xI 1 = 2 2 xI1 = 1 return q∗ if feasible, else return null
q ← LP(∅, ∅, ∅) B ← {(q, ∅, ∅, ∅)} solution ← { } while optimal integral vertex not found:
1 (q, I0,I ,I1) ← arg max(q,I0,I 1 ,I1)∈B objective val 2 2 add q to solution remove (q, I0,I 1 ,I1) from B 2 for i ∈ V if i∈ / I0 ∪ I 1 ∪ I1: 2 1 for a ∈ {0, 2 , 1} if qi 6= a: 0 0 0 I0,I 1 ,I1 ← copy(I0,I 1 ,I1) 2 2 0 0 Ia ← Ia ∪ {i} 0 0 0 0 q ← LP(I0,I 1 ,I1) 2 0 0 0 0 add (q ,I0,I 1 ,I1) to B if feasible 2 return solution
69 Chapter 4
Experimental Design for Cost-Aware Learning of Causal Graphs
4.1 Introduction
Causality is a fundamental concept in science and an essential tool for multiple disciplines such as engineering, medical research, and economics [Rotmensch et al., 2017, Ramsey et al., 2010, Rubin and Waterman, 2006]. Discovering causal relations has been studied extensively under different frame- works and under various assumptions [Pearl, 2009, Imbens and Rubin, 2015]. To learn the cause-effect relations between variables without any assumptions other than basic modeling assumptions, it is essential to perform experiments. Experimental data combined with observational data has been successfully used for recovering causal relationships in different domains [Sachs et al., 2005].
There is significant cost and time required to set up experiments. Often there are many ways to design experiments to discover cause-and-effect rela- tionships. Considering cost when designing experiments can critically change the total cost needed to learn the same causal system. King et al. [2004] created a robot scientist that would automatically perform experiments to learn how a yeast gene functions. Different experiments required different
70 materials with large variations with costs. By considering material cost when defining interventions, their robot scientist was able to learn the same causal structure significantly cheaper.
Since the work of King et al., there have been a number of papers on automated and cost-sensitive experiment design for causal learning in biological systems. Sverchkov and Craven [2017] discuss some aspects on how to design costs. Ness et al. [2017] develop an active learning strategy for cost-aware experiments in protein networks.
We study the problem of cost-aware causal learning in Pearl’s framework of causality [Pearl, 2009] under the causal sufficiency assumption, i.e., when there are no latent confounders. In this framework, there is a directed acyclic graph (DAG) called the causal graph that describes the causal relationships between the variables in our system. Learning direct causal relations between the variables in the system is equivalent to learning the directed edges of this graph. From observational data, we can learn of the existence of a causal edge, as well as some of the edge directions, however in general we cannot learn the direction of every edge. To learn the remaining causal edges, we need to perform experiments and collect additional data from these experiments [Eberhardt, 2007, Hauser and Bühlmann, 2012b, Hyttinen et al., 2013].
An intervention is an experiment where we force a variable to take a particular value. An intervention is called a stochastic intervention when the value of the intervened variable is assigned to another independent random variable. Interventions can be performed on a single variable, or a subset of
71 variables simultaneously. In the non-adaptive setting, which is what we consider here, all interventions are performed in parallel. In this setting, we can only guarantee that an edge direction is learned when there is an intervention such that exactly one of the endpoints is included [Kocaoglu et al., 2017].
In the minimum cost intervention design problem, as first formalized by Kocaoglu et al. [2017], there is a cost to intervene on each variable. We want to learn the causal direction of every edge in the graph with minimum total cost. This becomes a combinatorial optimization problem, and so two natural questions that have not yet been addressed are if the problem is NP-hard and if the greedy algorithm proposed by [Kocaoglu et al., 2017] has any approximation guarantees.
Our contributions:
• We show that the minimum cost intervention design problem is NP-hard.
• We modify the greedy coloring algorithm proposed in [Kocaoglu et al., 2017]. We establish that our modified algorithm is a (2+ε)-approximation algorithm for the minimum cost intervention design problem. Our proof makes use of a connection to submodular optimization.
• We consider the sparse intervention setup where each experiment can include at most k variables. We show a lower bound to the minimum number of interventions and create an algorithm which is a (1 + o(1))- approximation to this problem for sparse graphs with sparse interventions.
72 • We introduce the minimum cost k-sparse intervention design problem and develop an algorithm that is essentially optimal for the unweighted variant of this problem on sparse graphs. We then discuss how to extend this algorithm to the weighted problem.
4.2 Minimum Cost Intervention Design 4.2.1 Relevant Graph Theory Concepts
We first discuss some graph theory concepts that we utilize in this work.
A proper coloring of a graph G = (V,E) is an assignment of colors c : V 7→ {1, 2, . . . , t} to the vertices V such that for all edges uv ∈ E we have c(u) 6= c(v). The chromatic number is the minimum number of colors needed for a proper coloring to exist and is denoted by χ.
An independent set of a graph G = (V,E) is a subset of the vertices S ⊆ V such that for all pairs of vertices u, v ∈ S we have that uv∈ / E. The independence number is the size of the maximum independent set and is denoted by α. If there is a weight function on the vertices, a maximum weight independent set is an independent set with the largest total weight.
A vertex cover of a graph G = (V,E) is a subset of vertices S such that for every edge uv ∈ E, at least one of u or v are in S. Vertex covers are closely related to independent sets: if S is a vertex cover then V \ S is an independent set and vice versa. Further, if S is a minimum weight vertex cover then V \ S is a maximum weight independent set. The size of the smallest vertex cover of G is denoted τ.
73 A chordal graph is a graph such that for any cycle v1, v2, . . . , vt for t ≥ 4, there is a chord, which is an edge between two vertices that are not adjacent in the cycle. There are linear complexity algorithms for finding a minimum coloring, maximum weight independent set, and minimum weight vertex cover of a chordal graph. Any induced subgraph of a chordal graph is also a chordal graph.
Given a graph G = (V,E) and a subset of vertices I ⊆ V , the cut δ(I) is the set of edges uv ∈ E such that u ∈ I and v ∈ V \ I.
4.2.2 Causal Graphs and Interventional Learning
Consider two variables X,Y of a system. If every time we change the value of X, the value of Y changes but not vice versa, then we suspect that variable X causes Y . If we have a set of variables, the same intuition carries through while defining causality. This asymmetry in the directional influence between variables is at the core of causality.
Pearl [2009] and Spirtes et al. [2001] formalized the notion of causality using directed acyclic graphs (DAGs). DAGs are suitable to encode asymmetric relations. Consider a system of n random variables V = {V1,V2,...,Vn}. The structural causal model of Pearl models the causal relations between variables as follows: each variables Vi can be written as a deterministic function of a set of other variables Si and an unobserved variable Ei as Vi = fi(Si,Ei). We assume that Ei, called an exogenous random variable, is independent from everything, i.e., every variable in V and all other exogenous variables Ej. The graph that
74 captures these directional relations is called the causal graph between variables in V. We restrict the graph created to be acyclic, so that if we replace the value of a variable we potentially change the descendent variables but the ancestor variables will not change.
Given a causal graph, a variable is said to be caused by the set of parents
1 . This is precisely Si in the structural causal model. It is known that the joint distribution induced on V by a structural causal model factorizes with respect to the causal graph. Thus, the causal graph D is a valid Bayesian network for the observed joint distribution.
There are two main approaches for learning causal graphs from observa- tional distribution: i) score based [Geiger and Heckerman, 1994, Heckerman et al., 1995], and ii) constraint based [Pearl, 2009, Spirtes et al., 2001]. Score based approaches optimize a score (e.g., likelihood) over all Bayesian networks to recover the most likely graph. Constraint-based approaches, such as IC and PC algorithms, use conditional independence tests to identify the causal edges that are invariant across every graph consistent with the observed data. This remaining mixed graph is called the essential graph. The undirected components of the essential graph are always chordal [Spirtes et al., 2001, Hauser and Bühlmann, 2012a]
Although PC runs in time exponential in the maximum degree of the
1To be more precise, parent nodes are said to directly cause a variable whereas ancestors cause indirectly through parents. In this paper, we will not make this distinction since we do not use indirect causal relations for graph discovery.
75 graph, various extensions make it feasible to run it even on graphs with 30,000 nodes with maximum degree up to 12 [Ramsey et al., 2017]. To learn the rest of the causal edge directions without additional assumptions, we need to use interventions on the undirected, chordal components. 2 An intervention is an experiment where a random variable is forced to take a certain value. Due to the acyclicity assumption on the graph, if X → Y , then intervening on Y should not change the distribution of X, however intervening on X will change the distribution of Y . Running the observational learning algorithms like PC/IC after an intervention on a set S of variables, we can learn the new skeleton after the intervention, which allows us to identify the immediate children and immediate parents of the intervened variables. Therefore, if we perform a randomized experiment on a set S of vertices in the causal graph, we can learn the direction of all the edges cut between S and V \ S. This approach has been heavily used in the literature [Hyttinen et al., 2013, Hauser and Bühlmann, 2012b, Shanmugam et al., 2015].
4.2.3 Graph Separating Systems and Minimum Cost Intervention Design
Given a causal DAG D = (V,E), we observe the essential graph E(D). Kocaoglu et al. [2017] established that if we want to guarantee learning the direction of the undirected edges with nonadaptive interventions, it is nessesary
2It is known that the edges identified in a chordal component of the skeleton do not help identify edges in another component [Hauser and Bühlmann, 2012a].Thus, each chordal component learning task can be treated as an individual problem.
76 and sufficient for our intervention design I = {I1,I2,...,Im} to be a graph separating system on the undirected component of the graph G.
Definition 24 (Graph Separating System). Given an undirected graph G = (V,E), a graph separating system of size m is a collection of m subsets of vertices I = {I1,I2,...,Im} such that every edge is cut at least once, that is, S I∈I δ(I) = E.
Recall that the undirected component of the essential graph of a causal DAG is always a chordal graph. We can now define the minimum cost inter- vention design problem.
Definition 25 (Minimum Cost Intervention Design). Given a chordal graph
G = (V,E), a set of weights wv for all v ∈ V , and a size constraint m ≥ dlog χe, the minimum cost intervention design problem is to find a graph separating system I of size at most m that minimizes the cost
X X cost(I) = wv. I∈I v∈I
Graph separating systems are tightly related to graph colorings. Mao- Cheng [1984] proved that the smallest graph separating system has size m = dlog χe, where χ is the chromatic number. To see this, for each vertex, we create a binary vector c(v) where c(v)i = 1 if v ∈ Ii and c(v)i = 0 if v∈ / Ii. Since two neighboring vectors u and v must have, for some intervention Ii, exactly
m one of u ∈ Ii or v ∈ Ii, the assignment of vectors to vertices c : V 7→ 0, 1 is a proper coloring. With a size m graph separating system, we are able to create
77 2m different colors, proving that the size of the smallest separating system is exactly m = dlog χe.
The equivalence between graph separating systems and coloring allows us to define an equivalent coloring version of the minimum cost intervention design problem, which was first developed in [Kocaoglu et al., 2017].
Definition 26 (Minimum Cost Intervention Design, Coloring Version). Given a chordal graph G = (V,E), a set of weights wv for all v ∈ V , and the colors C = {0, 1}m such that |C| ≥ χ, the coloring version of the minimum cost intervention design problem is to find a proper coloring c : V 7→ C that minimizes the total cost X cost(c) = kc(v)k1wv. v∈V Given a minimum cost coloring from the coloring variant of the minimum cost intervention design, we can create a minimum cost intervention design. Further, the reduction is approximation preserving.
In practice, it can sometimes be difficult to intervene on a large number of variables. A variant of intervention design of interest is when every intervention can only involve k variables. For this problem, we want our interventions to be a k-sparse graph separating system.
Definition 27 (k-Sparse Graph Separating System). Given an undirected graph G = (V,E), a k-sparse graph separating system of size m is a collection of m subsets of vertices I = {I1,I2,...,Im} such that all subsets Ii satisfy S |Ii| ≤ k and every edge is cut at least once, that is, I∈I δ(I) = E.
78 We consider two optimization problems related to k-sparse graph sepa- rating systems. In the first one we want to find a graph separating system of minimum size.
Definition 28 (Minimum Size k-Sparse Intervention Design). Given a chordal graph G = (V,E) and a sparsity constraint k, the minimum size k-sparse intervention design problem is to find a k-sparse graph separating system for G of minimum size, that is, we want to minimize the cost
cost(I) = |I|.
For the next problem, we want to find the k-sparse intervention design of minimum cost where there is a cost to intervene on every variable.
Definition 29 (Minimum Cost k-Sparse Intervention Design). Given a chordal graph G = (V,E), a set of weights wv for all v ∈ V , a sparsity constraint k, and a size constraint m, the minimum cost k-sparse intervention design problem is to find a k-sparse graph separating system I of size m that minimizes the cost
X X cost(I) = wv. I∈I v∈I
4.3 Related Work
One problem of interest is to find the intervention design with the smallest number of interventions. Eberhardt et al. [2005] established that dlog ne is sufficient and nessesary in the worst case. Eberhardt [2007] established
79 that graph separating systems are necessary across all graphs (the example he used is the complete graph). Hauser and Bühlmann [2012b] establish the connection between graph colorings and intervention designs by using the key observation of Mao-Cheng [1984] that graph colorings can be used to construct graph separating systems, and vise-versa. This leads to the requirement and sufficiency of dlog(χ)e experiments where χ is the chromatic number of the graph.
Since graph coloring can be done efficiently for chordal graphs, we can efficiently create a minimum size intervention design when given as input a chordal skeleton. Similarly, if we are given as input an arbitrary graph, perhaps due to side information on some edge directions, it is NP-hard to find a minimum size intervention design [Hyttinen et al., 2013, Mao-Cheng, 1984].
Hu et al. [2014] proposed a randomized algorithm that requires only O(log log n) experiments and learns the causal graph with high probability.
Closer to our setup, Hyttinen et al. [2013] considers a special case of minimum cost intervention design problem when every vertex has cost 1 and the input is the complete graph. They were able to optimally solve this special case. Kocaoglu et al. [2017] was the first to formalize the minimum cost intervention design problem on general chordal graphs and the relationship to its coloring variant. They used the coloring variant to develop a greedy algorithm that finds a maximum weighted independent set and colors this set with the available color with the lowest weight. However their work did not establish approximation guarantees on this algorithm and it is not clear
80 how many iterations the greedy algorithm needs to fully color the graph—we address these issues in this paper. Further it was unknown until our work that the minimum cost intervention design problem is NP-hard.
There has been a lot of prior work when every intervention is constrained to be of size at most k. Eberhardt et al. [2005] was the first to consider the minimum size k-sparse intervention design problem and established sufficient conditions on the number of interventions needed for the complete graph. Hyttinen et al. [2013] showed how k-sparse separating system constructions can be used for intervention designs on the complete graph using the construction of Katona [1966]. They establish the necessary and sufficient number of k- sparse interventions needed to learn all causal directions in the complete graph. Shanmugam et al. [2015] illustrate that for the complete graphs separating systems are necessary even under the constraint that each intervention has size at most k. They also identify an information theoretic lower bound on the necessary number of experiments and propose a new optimal k-sparse separating system construction for the complete graph. To the best of our knowledge there has been no graph dependent bounds on the size of a k-sparse graph separating systems until our work.
Ghassami et al. considered the dual problem of maximizing the number of learned causal edges for a given number of interventions [Ghassami et al., 2018]. They show that this problem is a submodular maximization problem when only interventions involving a single variable are allowed. We note that their connection to submodularity is different than the one we discover in our
81 work.
Graph coloring has been extensively studied in the literature. There are various versions of graph coloring problem. We identify a connection of the minimum cost intervention design problem to the general optimum cost chromatic partition problem (GOCCP). GOCCP is a graph coloring problem where there are t colors and a cost γvi to color vertex v with color i. It is a more general version of the minimum cost intervention design problem. Jansen [1997] established that for graphs with bounded treewidth r, the GOCCP can be solved exactly in time O(trn). This implies that for graphs with maximum degree ∆ we can solve the minimum cost intervention design problem exactly in time O(2m∆n). Note that m is at least log ∆ and can be as large as ∆, thus this algorithm is not practical even for ∆ = 12.
4.4 Hardness of Minimum Cost Intervention Design
In this section, we show that the minimum cost intervention design problem is NP-hard.
We assume that the input graph is chordal, since it is obtained as an undirected component of a causal graph skeleton. We note that every chordal graph can be realized by this process.
Proposition 30. For any undirected chordal graph G, there is a causal graph D such that the essential graph E(D) = G.
Thus every chordal graph is the undirected subgraph of the essential
82 graph for some causal DAG. This validates the problem definition of the minimum cost intervention design as any chordal graph can be given as input. We now state our hardness result.
Theorem 31. The minimum cost intervention design problem is NP-hard.
Please see Appendix 4.11 for the proof. Our proof is based on the reduction from numerical 3D matching to a graph coloring problem that is more general than the minimum cost intervention problem on interval graphs by Kroon et al. [Kroon et al., 1996]. Our hardness proof holds even if the vertex costs are all equal to 1 and the input graph is an interval graph, which is a subset of chordal graphs that often have efficient algorithms for problems that are hard in general graphs.
It it worth comparing to complexity results on related minimum size intervention design problem. The minimum size intervention design problem on a graph can be solved by finding a minimum coloring on the same graph [Mao- Cheng, 1984, Hauser and Bühlmann, 2012b]. For chordal graphs, graph coloring can be solved efficiently so the minimum size intervention design problem can also be solved efficiently. In contrast, the minimum cost intervention design problem is NP-hard, even on chordal graphs. Both problems are hard on general graphs, which can be due to side information.
83 4.5 Approximation Guarantees for Minimum Cost Inter- vention Design
Since the input graph is chordal, we can find the maximum weighted independent sets in polynomial time using Frank’s algorithm [Frank, 1975]. Further, a chordal graph remains chordal after removing a subset of the vertices. The authors of [Kocaoglu et al., 2017] use these facts to construct a greedy algorithm for this weighted coloring problem. Let G0 = G. On iteration t,
find the maximum weighted independent set in Gt and assign these vertices the available color with the smallest cost. Then let Gt+1 be the graph after removing the colored vertices from Gt. Repeat this until all vertices are colored. Convert the coloring to a graph separating system and return this design.
One issue with this algorithm is it is not clear how many iterations the greedy algorithm will utilize until the graph is fully colored. This is important as we want to satisfy the size constraint on the graph separating system. To reduce the number of colors in the graph, we introduce a quantization step to reduce the number of iterations the greedy algorithm requires to completely color the graph. In Figure 4.3 of Appendix 4.8, we see an example of a (non- chordal) graph where without quantization the greedy algorithm requires n/2 colors but with quantization it only requires 4 colors.
Specifically, we first find the maximum independent set of the input graph and remove it. We then find the maximum cost vertex of the new graph with weight wmax. For all vertices v in the new graph, we replace the cost wv
3 with b wvn c. See Algorithm 5 for pseudocode describing our algorithm. wmax
84 The reason we first remove the maximum independent set before quan- tizing is because the maximum independent set will be colored with a color of weight 0, and thus not contribute to the cost. We want the quantized costs to not be arbitrarily far from the original costs, except for the vertices that are not intervened on. For example, if there is a vertex with a weight of infinity, we will never intervene on it. However if we were to quantize it the optimal solution to the quantized problem can be arbitrarily far from the true optimal solution. Our method of quantization will allow us to show that a good solution to the quantized weights is also a good solution to the true weights.
Algorithm 5 Greedy Coloring Algorithm with Quantization
Input: A chordal graph G = (V,E), positive integral weights wi for all i ∈ V . Quantize the vertex weights: S0 ← maximum weighted independent set of G
wmax ← maxi∈V \S0 wi j 3 k w ← win i wmax Greedy weighted coloring algorithm: Assign S0 color 0 G1 ← G − S0 t ← 1 while Gt is not empty: St ← maximum weight independent set of Gt color all vertices of St with the color t Gt+1 ← Gt − St t ← t + 1 convert the coloring of G to a graph separating system I return I
We now state our main theorem, which guarantees that the greedy algorithm with quantization will return a solution that is a (2+ε)-approximation
85 from the optimal solution while only using log χ + O(log log n) interventions. Our algorithm thus returns a good solution to the minimum cost intervention design problem whenever the allowed number of interventions m ≥ log χ + O(log log n). Note that m ≥ log χ is required for there to exist any graph separating system.
Theorem 32. If the number of interventions m satisfies m ≥ log χ+log log n+ 5, then the greedy coloring algorithm with quantization for the minimum cost intervention design problem creates a graph separating system Igreedy such that
cost(Igreedy) ≤ (2 + ε)OPT, where ε = exp(−Ω(m)) + n−1.
See Appendix 4.9 for the proof of the theorem. We present a brief sketch of our proof.
To show that the greedy algorithm uses a small number of colors, we first define a submodular, monotone, and non-negative function such that every vertex has been colored if and only if this particular submodular function is maximized. This is an instance of the submodular cover problem. Wolsey established that the greedy algorithm for the submodular cover problem returns a set with cardinality that is close to the optimal cardinality solution when the values of the submodular function are bounded by a polynomial [Wolsey, 1982]. This is why we need to quantize the weights.
To show that the greedy algorithm returns a solution with small value, we first define a new class of functions which we call supermodular chain
86 functions. We then show that the minimum cost intervention design problem is an instance of a supermodular chain function. Using result on submodular optimization from [Nemhauser et al., 1978, Krause and Golovin, 2014] and some nice properties of the minimum cost intervention design problem, we are able to show that the greedy algorithm returns an approximately optimal solution.
To relate the quantized weights back to the original weights, we use an analysis that is similar to the analysis used to show the approximation guarantees of the knapsack problem [Ibarra and Kim, 1975].
Finally, we remark how our algorithm will perform when there are vertices with infinite cost. These vertices can be interpreted as variables that cannot be intervened on. If these variables form an independent set, then they can be colored with the color of weight zero. We can maintain our theoretical guarantees in this case, since our quantization procedure first removes the maximum weight independent set. If the variables with infinite cost do not form an independent set, then no valid graph separating system has finite cost.
4.6 Algorithms for k-Sparse Intervention Design Prob- lems
We first establish a lower bound for how large a k-sparse graph separating system must be for a graph G based on the size of the smallest vertex cover of the graph τ.
Proposition 33. For any graph G, the size of the smallest k-sparse graph
87 ∗ ∗ τ separating system mk satisfies mk ≥ k , where τ is the size of the smallest vertex cover in the graph G.
See Appendix 4.10 for the proof.
Algorithm 6 Algorithm for Min Size and Unweighted Min Cost k-Sparse Intervention Design Input: A chordal graph G, a sparsity constraint k. S ← minimum size vertex cover of G. GS ← induced graph of S in G. Find an optimal coloring of GS. Split the color classes of GS into size k intervention sets I1,I2,...,Im. Return I = {I1,I2,...,Im}.
We use Algorithm 6 to find a small k-sparse graph separating system. It first finds the minimum cardinality vertex cover S. It then finds an optimal coloring of the graph induced with the vertices of S. It then partitions the color class into independent sets of size k and performs an intervention for each of these partitions. Since the set of vertices not in a vertex cover is an independent set, this is a valid k-sparse graph separating system.
When the sparsity k and the maximum degree ∆ are small, Algorithm 6 is nearly optimal. Using Proposition 33, we can establish the following approximation guarantee on the size of the graph separating system created.
Theorem 34. Given a chordal graph G with maximum degree ∆, Algorithm 6
finds a k-sparse graph separating system of size mk such that k(∆ + 1)∆ m ≤ 1 + OPT, k n where OPT is the size of the smallest k-sparse graph separating system.
88 See Appendix 4.10 for the proof. If the sparsity constraint k and the maximum degree of the graph ∆ both satisfy k, ∆ = o(n1/3), then Theorem 34 implies that we have a 1 + o(1) approximation to the optimal solution.
One interesting aspect of Algorithm 6 is that every vertex is only intervened on once and the set of elements not intervened on is the maximum cardinality independent set. By a similar argument to Theorem 2 of [Kocaoglu et al., 2017], we have that this algorithm is optimal in the unweighted case.
Corollary 35. Given an instance of the minimum cost k-sparse intervention design problem with chordal graph G with maximum degree ∆ and vertex cover
τ k(∆+1)∆ of size τ, sparsity constraint k, size constraint m ≥ k (1 + n ), and all vertex weights wv = 1, Algorithm 6 returns a solution with optimal cost.
We show one way to extend Algorithm 6 to the weighted case. There is a trade off between the size and the weight of the independent set of vertices that are never intervened on. We can trade these off by adding a penalty λ
λ λ to every vertex weight, i.e., the new weight wv of a vertex v is wv = wv + λ. Larger values of λ will encourage independent sets of larger size. See Algorithm 7 for the pseudocode describing this algorithm. We can run Algorithm 7 for various values of λ to explore the trade off between cost and size.
4.7 Experiments
We generate chordal graphs following the procedure of [Shanmugam et al., 2015], however we modify the sampling algorithm so that we can
89 Algorithm 7 Algorithm for Weighted Min Cost k-Sparse Intervention Design
Input: chordal graph G, sparsity constraint k, vertex weights wv, penalty parameter λ λ S ← minimum weight vertex cover S using weights wv = wv + λ. GS ← induced graph of S in G. Find an optimal coloring of GS. Split the color classes of GS into size k intervention sets I1,I2,...,Im. Return I = {I1,I2,...,Im}.
control the maximum degree. First we order the vertices {v1, v2, . . . , vn}.
For vertex vi we choose a vertex from {vi−b, vi−b+1, . . . , vi−1} uniformly at random and add it to the neighborhood of vi. We then go through the vertices
{vi−b, vi−b+1, . . . , vi−1} and add them to the neighborhood of vi with probability d b . We then add edges so that the neighbors of vi in {v1, v2, . . . , vi−1} form a clique. This is guaranteed to be a connected chordal graph with maximum degree bounded by 2b.
In our first experiment we compare the greedy algorithm to two other algorithms. One first assigns the maximum weight independent set the weight 0 color, then finds a minimum coloring of the remaining vertices, sorts the independent sets by weight, then assigns the cheapest colors to the independent sets of the highest weight. The other algorithm finds the optimal solution with integer programming using the Gurobi solver[Gurobi Optimization, LLC, 2018]. The integer programming formulation is standard (see, e.g., [Delle Donne and Marenco, 2016]).
We compare the cost of the different algorithms when we (a) adjust the number of vertices while maintaining the average degree and (b) adjust
90 the average degree while maintaining the number of vertices. We see that the greedy coloring algorithm performs almost optimally. We also see that it is able to find a proper coloring even with only m = 5 interventions and no quantization. See Figure 4.1 for the complete results.
In our second experiment we see how Algorithm 7 allows us to trade off the number of interventions and the cost of the interventions in the k-sparse minimum cost intervention design problem. See Figure 4.2 for the results.
Finally, we observe the empirical running time of the greedy algorithm. We generate graphs on 10, 000 vertices with maximum degree 20 and have 5 interventions. The greedy algorithm terminates in 5 seconds. In contrast, the integer programming solution takes 128 seconds using the Gurobi solver [Gurobi Optimization, LLC, 2018].
Appendix
4.8 Example Graph Where Quantization Helps Greedy 4.9 Proof of Approximation Guarantees of the Quantized Greedy Algorithm
In this section we prove our approximation guarantee of using the quantized greedy algorithm for minimum cost intervention design.
Theorem 32. If the number of interventions m satisfies m ≥ log χ+log log n+ 5, then the greedy algorithm with quantization for the minimum cost intervention
91 design problem creates a graph separating system Igreedy such that
cost(Igreedy) ≤ (2 + ε)OPT, where ε = exp(−Ω(m)) + n−1.
4.9.1 Submodularity Background
Our proof uses several results from submodularity. A set function F over a ground set V is a function that takes as input a subset of V and outputs a real number. We say that the function F is submodular if for all v ∈ V and A ⊆ B ⊆ V \{v} the function satisfies the diminishing returns property
F (A ∪ {v}) − F (A) ≥ F (B ∪ {v}) − F (B).
We say that the function F is monotone if for all A ⊆ B ⊆ V we have that F (A) ≤ F (B). We say that F is non-negative if for all A ⊆ V we have that F (A) ≥ 0.
One classic problem in submodular optimization is finding a set A with caridinality constraint |A| ≤ k that maximizes a submodular, monotone, and non-negative function F . The greedy algorithm starts with the emptyset
A0 = ∅, selects the item vk+1 = arg maxv∈V F (Ak ∪ {v}) − F (Ak). It then updates Ak+1 = Ak ∪ {vk+1}.
The classic result of Nemhauser and Wolsey established that the greedy algorithm is a (1 − 1/e)-approximation algorithm to the optimal [Nemhauser et al., 1978]. Krause and Golovin generalized this to show that if the greedy
92 algorithm selects dCke elements for some positive value C, then it is a (1−e−C )- approximation to the optimal solution of size k.
Theorem 36 ([Nemhauser et al., 1978, Krause and Golovin, 2014]). Given a submodular, monotone, and non-negative function F over a ground set V and a cardinality constraint k, let OPT be defined as
OPT = max F (A). A⊆V :|A|≤k
If the greedy algorithm for this problem runs for dCke iterations, for some
Ck positive value C, it returns a set Agreedy such that
Ck −C F (Agreedy) ≥ (1 − e )OPT.
Another important problem in submodular function optimization is the submodular set cover problem, which is a generalization of the set cover problem. Given a submodular, monotone, and non-negative function F that maps a subsets of a ground set V to integers, we want to find a set A of minimum cardinality such that F (A) = F (V ). The greedy algorithm is a natural approach to solve this problem: we run greedy iterations until the set satisfies the submodular set cover constraint. Let wmax = arg maxv∈V F ({v}). Wolsey established that the cardinality of the set returned by the greedy algorithm is a 1 + ln wmax approximation to the minimum cardinality solution [Wolsey, 1982].
Theorem 37 ([Wolsey, 1982]). Given a submodular, monotone, and non- negative function F that maps subsets of a ground set V to integers, let OPT
93 be defined as OPT = min |A|. A⊆V :F (A)=F (V )
Let wmax = arg maxv∈V F ({v}). The greedy algorithm for this problem returns a set Agreedy such that
|Agreedy| ≤ (1 + ln wmax)OPT.
4.9.2 Bound on the Quantized Greedy Algorithm solution size
In this section we show that after χ(2 + 5 ln n) + 1 rounds the greedy algorithm with quantization will have colored every vertex in the graph. Since the number of possible colors in a graph separating system of size m is 2m, this implies that when m ≥ log χ + log log n + 4, there are enough colors for the greedy algorithm to fully color the graph.
Lemma 38. If the intervention size is m ≥ log χ + log log n + 4, then the greedy algorithm will terminate using at most 2m colors.
Proof. The greedy algorithm first colors the maximum weight independent, using 1 color. We will denote the remaining graph by G = (V,E).
The weights of the remaining vertices are quantized to integers such that the maximum weight is bounded by n3. Let A be the set of all independent sets in G. The maximum weight of an independent set in G is bounded by n4. Let W be the function that takes a set of independent sets A ⊆ A and outputs the value X W (A) = wv, S v∈ a∈A a
94 that is, it takes a set of independent sets and return the sum of the vertices in their union. It can be verified that this function is submodular, monotone, and non-negative.
We will assume for now that the weights are all positive. If we have a set of independent sets A such that W (A) = W (A), then every vertex in the graph will have been covered. Since the minimum cardinality is χ and the maximum weight of an independent set is n4, by Theorem 37, the greedy algorithm will terminate after χ(1 + 4 ln n) iterations.
To handle vertices of weight 0, note that it is a set cover problem to cover the remaining vertices. Thus the greedy algorithm will need no more that χ(1 + ln n) colors to color the remaining vertices, using a total number of colors χ(2 + 5 ln n) + 1.
We have the following corollary by noting that adding an extra inter- vention doubles the number of allowed colors.
Corollary 39. If the intervention size is m ≥ log χ + log log n + 5, then the greedy algorithm will terminate using at most 2m colors such that all color
m vectors c have weight kck1 ≤ d 2 e.
4.9.3 Submodular and Supermodular Chain Problem
In this section we define two new types of submodular optimization problem, which we call the submodular chain problem and the supermodular
95 chain problem. We will use these in our proof of the approximation guarantees of the greedy algorithm with quantization.
Definition 40. Given integers k1, k2, . . . , km and a submodular, monotone, and non-negative function F , over a ground set V , the submodular chain problem is to find sets A1,A2,...,Am ⊆ V such that |Ai| ≤ ki that maximizes
m X F (A1 ∪ A2, ∪ · · · ∪ Ai). i=1
Throughout this section we will assume that m is an even number.
The greedy algorithm for this problem will first choose the set A1 of cardinality k1 that maximizes F (A1). It will then choose the set A2 of cardinality k2 that maximizes F (A1 ∪ A2). It will continue this process until all Ai are chosen.
Note by using the greedy algorithm and Theorem 36 we can obtain a (1 − 1/e) approximation to this problem. However we will instead use the following guarantee.
∗ ∗ ∗ Lemma 41. Let A1,A2,...,Am be the optimal solution to the submodular P2p chain problem. Suppose that for all 1 ≤ p ≤ m/2 − 1 we have that i=1 ki ≥ Pp C i=1 ki. Also assume that F (A1 ∪ A2 ∪ · · · ∪ Am) = F (V ). Then the greedy algorithm for the submodular chain problem returns set A1,A2,...,Am such that
m m/2−1 X −C X ∗ ∗ ∗ F (A1 ∪ A2 ∪ · · · A2i) ≥ F (V ) + 2(1 − e ) F (A1 ∪ A2 ∪ · · · Ai ) i=0 i=0
96 P2i Pi Proof. Since p=1 kp ≥ C p=1 kp, by Theorem 36 we have that
−C ∗ ∗ ∗ F (A1 ∪ A2 ∪ · · · A2p) ≥ (1 − e )F (A1 ∪ A2 ∪ · · · ∪ Ap).
We thus have
m/2−1 m/2−1 X −C X ∗ ∗ ∗ F (A1 ∪ A2 ∪ · · · ∪ A2i) ≥ (1 − e ) F (A1 ∪ A2 ∪ · · · ∪ Ai ). i=0 i=0
To conclude the proof, use the monotonicity of the submodular function F to observe that
m X F (A1 ∪ A2 ∪ · · · ∪ Ai) = F (A1 ∪ A2 ∪ Am) i=0 m/2−1 X + F (A1 ∪ A2 ∪ · · · ∪ A2i) + F (A1 ∪ A2 ∪ · · · ∪ A2i+1) i=0 m/2−1 X = F (V ) + F (A1 ∪ A2 ∪ · · · ∪ A2i) + F (A1 ∪ A2 ∪ · · · ∪ A2i+1) i=0 m/2−1 X ≥ F (V ) + 2 F (A1 ∪ A2 ∪ · · · ∪ A2i). i=0
We define the supermodular chain problem similarly.
Definition 42. Given integers k1, k2, . . . , km and a submodular, monotone, and non-negative function F , over a ground set V , the supermodular chain problem is to find sets A1,A2,...,Am ⊆ V such that |Ai| ≤ ki that minimizes
m X F (V ) − F (A1 ∪ A2, ∪ · · · ∪ Ai). i=0
97 We establish the following guarantee for the greedy algorithm on the supermodular chain problem.
∗ ∗ ∗ Lemma 43. Let A1,A2,...,Am be the optimal solution to the supermodular P2t chain problem. Suppose that for all 1 ≤ p ≤ m/2 − 1 we have that i=1 ki ≥ Pt C i=1 ki. Also assume that F (A1 ∪ A2 ∪ · · · ∪ Am) = F (V ). Then the greedy algorithm for the supermodular chain problem returns set A1,A2,...,Am such that m m X −C X ∗ ∗ ∗ F (V ) − F (A1 ∪ A2 ∪ · · · Ai) ≤ e mF (V ) + 2 F (A1 ∪ A2 ∪ · · · Ai ) i=0 i=0 Proof. Starting from Lemma 41, we have that
m m/2−1 X −C X ∗ ∗ ∗ (m + 1)F (V ) − F (A1 ∪ A2 ∪ · · · A2i) ≤ mF (V ) − 2(1 − e ) F (A1 ∪ A2 ∪ · · · Ai ) i=0 i=0 m/2−1 −C X ∗ ∗ ∗ ≤ e mF (V ) + mF (V ) − 2 F (A1 ∪ A2 ∪ · · · Ai ) i=0 m/2−1 −C X ∗ ∗ ∗ = e mF (V ) + 2 F (V ) − F (A1 ∪ A2 ∪ · · · Ai ). i=0 Using the monotonicity of the submodular function F , we can continue with m/2−1 m −C X ∗ ∗ ∗ −C X ∗ ∗ ∗ e mF (V ) + 2 F (V ) − F (A1 ∪ A2 ∪ · · · Ai ) ≤ e mF (V ) + 2 F (V ) − F (A1 ∪ A2 ∪ · · · Ai ) i=0 i=0 and conclude that m m X X F (V ) − F (A1 ∪ A2 ∪ · · · A2i) = (m + 1)F (V ) − F (A1 ∪ A2 ∪ · · · A2i) i=0 i=0 m −C X ∗ ∗ ∗ ≤ e mF (V ) + 2 F (V ) − F (A1 ∪ A2 ∪ · · · Ai ) i=0
98 4.9.4 Proof of quantized greedy algorithm approximation guaran- tees
For simplicity, we will assume that the number of interventions m is divisible by 4.
We will need the following lemma, which can be proved by standard binomial approximations.
m Lemma 44. If m and t are integers such that t ≤ 4 , we have
2t t ! X m X m ≥ Ω(m) 1 + . i i i=1 i=1
We use the following technical lemma to prove our approximation guarantee. We defer the proof to Section 4.9.5.
Lemma 45. Let A∗ be the optimal solution to the coloring problem. Let A+ be the optimal solution to the coloring problem when we force it to color the maximum weighted independent set with the weight 0 color, but allow it an extra color of weight 1. That is, it can color m + 1 independent sets with a color of weight 1, rather than the usual m independent sets. We have
cost(A+) ≤ cost(A∗).
We can now show that the quantized greedy algorithm is a good ap- proximation to the optimal solution to the quantized problem.
99 Lemma 46. Suppose all the weights in the graph that are not in the maximum weight independent set are bounded by n3. Then if the number of interventions m satisfies m ≥ log χ + log log n + 5 the greedy coloring algorithm returns a solution I of cost
cost(I) ≤ (2 + exp(−Ω(m)))OPT.
Proof. Let A be the set of all independent sets in G. Let W be the function that takes a set of independent sets A ⊆ A and outputs the value
X W (A) = wv, S v∈ a∈A a that is, it takes a set of independent sets and return the sum of the vertices in their union. It can be verified that this function is submodular, monotone, and non-negative.
A feasible solution to the coloring variant of the minimum cost interven- tion design problem is a coloring that maps vertices to color vectors {0, 1}m.
The colors c with weight i are the coloring vectors c such that kck1 = i. We can describe a feasible solution to the coloring variant of the minimum cost intervention design by A0,A1,A2,...,Am, where Ai is the set of independent sets that are colored with a coloring vector of weight i.
One simplifying assumption is that the optimal solution A∗ and the greedy solution A both use the color of weight 0 to color the maximum weight independent set. By Lemma 45 this assumption is valid if we allow the optimal
100 solution to use an additional color of weight 1. We just need to show the approximation guarantee on the sets A1,A2,...,Am.
We can calculate the cost of a feasible solution of the minimum cost intervention design problem by
m X cost(A1,...,Am) = W (A) − W (A1 ∪ A2 ∪ · · · Am), i=0 m where |Ai| ≤ i . This is an instance of the supermodular chain problem.
Using Corollary 39, the greedy algorithm will terminate only using colors of weight at most m/2, so we only need to show optimality of the sets
A1,A2,...,Am/2. By Lemma 44, we have that the number of colors of weight at most 2t is a factor of Ω(m) more than the number of colors the optimal solution uses of weight at most t, even after including the extra color given to the optimal solution. By Lemma 43 and using the monotonicity of W , we have that
cost(Igreedy) = cost(A1,A2,...,Am/2) m/2 X = W (A) − W (A1 ∪ A2 ∪ · · · Ai) i=0 m/2 m X ≤ e−Ω(m) F (A) + 2 W (A) − W (A∗ ∪ A∗ ∪ · · · A∗) 2 1 2 i i=0 m m X ≤ e−Ω(m) F (A) + 2 W (A) − W (A∗ ∪ A∗ ∪ · · · A∗) 2 1 2 i i=0 m = e−Ω(m) W (A) + 2OPT. 2
To conclude, observe that OPT ≥ W (A), since every vertex not in the
101 maximum weighted independent set is colored with a color of weight at least 1.
Lemma 46 shows an approximation guarantee of the quantized greedy algorithm to the quantized optimal solution. To relate the quantized greedy algorithm to the true optimal solution, we use the following lemma, which we prove in Section 4.9.5.
Lemma 47. Suppose an intervention design I is an α-approximation solution to the optimal solution to the quantized problem. Then it is an (α + n−1)- approximation to the optimal solution to the original problem.
With Lemma 47, we can conclude the proof of Theorem 32.
4.9.5 Proof of Technical Lemmas
Lemma 45. Let A∗ be the optimal solution to the coloring problem. Let A+ be the optimal solution to the coloring problem when we force it to color the maximum weighted independent set with the weight 0 color, but allow it an extra color of weight 1. That is, it can color m + 1 independent sets with a color of weight 1, rather than the usual m independent sets. We have
cost(A+) ≤ cost(A∗).
∗ + Proof. Let a0 and a0 be the sets of vertices covered with the color of weight 0
∗ + + for A and A , respectively. From the optimality of a0 as a maximum weight
102 independent set, we have
X X wi ≥ wi. + ∗ ∗ + i∈a0 \a0 i∈a0\a0 Consider a new coloring A0, also with an extra weight 1 coloring, that uses
+ ∗ + a0 as the set of vertices colored with the weight 0 color, a0 \ a0 as the set of vertices colored with the extra weight 1 color, then does the same coloring as A∗, removing the vertices that are already colored.
The only vertices colored by A0 with a positive cost and different color
∗ ∗ + than A are a0 \ a0 , which are all colored with a cost of weight 1. The only vertices colored by A∗ with a positive cost and different color than A0 are
+ ∗ ∗ ∗ a0 \ a0. Let cv be the cost to color vertex v using A . We can thus conclude
0 ∗ X X ∗ cost(A ) − cost(A ) = wv − cvwv ∗ + + ∗ v∈a0\a0 v∈a0 \a0 X X ≤ wv − wv ≤ 0. ∗ + + ∗ v∈a0\a0 v∈a0 \a0
Lemma 47. Suppose an intervention design I is an α-approximation solution to the optimal solution to the quantized problem. Then it is an (α + n−1)- approximation to the optimal solution to the original problem.
Proof. This is a modification of the proof of the FPTAS of the knapsack algorithm [Ibarra and Kim, 1975] (see also [Williamson and Shmoys, 2011]).
Let c∗ be the optimal coloring in the original weights, c0 be the optimal coloring in the quantized weights, and c be the approximate coloring.
103 0 wmax Let wv be the true weight, and wv be the quantized weight. Let µ = n3 .
0 wv 0 wv ∗ Since wv = b µ c, we have and wv ≤ µ . We also have cost(I ) ≥ wmax.
We thus have
X cost(I) = kc(v)k1wv v∈V X 0 ≤ µ kc(v)k1(wv + 1) v∈V X 0 X = µ kc(v)k1wv + µ kc(v)k1 v∈V v∈V X 0 0 X ≤ αµ kc (v)kwv + µ kc(v)k1 v∈V v∈V
Using the optimality of c0 in the quantized weights, we have
X 0 0 X X ∗ 0 X αµ kc (v)kwv + µ kc(v)k1 ≤ αµ kc (v)k1wv + µ kc(v)k1 v∈V v∈V v∈V v∈V X ∗ X ≤ α kc (v)k1wv + µ kc(v)k1 v∈V v∈V X = αOPT + µ kc(v)k1 v∈V ≤ αOPT + µmn w mn = αOPT + max n3 w ≤ αOPT + max n ≤ (α + n−1)OPT.
104 4.10 Proof of Results on k-Sparse Intervention Design Problems
Proposition 33. For any graph G, the size of the smallest k-sparse graph
∗ ∗ τ separating system mk satisfies mk ≥ k , where τ is the size of the smallest vertex cover in the graph G.
∗ τ Proof. Suppose that there exists a graph separating system I of size mk < k . S Note that the vertices in S = I∈I I form a vertex cover. The number of ∗ vertices in S is |S| ≤ kmk < τ, contradicting the result that the smallest vertex cover has at least τ vertices.
Theorem 34. Given a chordal graph G with maximum degree ∆, Algorithm 6
finds a k-sparse graph separating system of size mk such that k(∆ + 1)∆ m ≤ 1 + m∗, k n k
∗ where mk is the size of the smallest k-sparse graph separating system.
Proof. Given the vertices S in the smallest vertex cover of the graph, we can color these vertices with ∆ + 1 colors. We can then partition each color class
τ τ into k + ∆ + 1 independent sets of size k, as we have at most k of size k and at most ∆ + 1 sets that cannot be grouped into exactly k vertices due to rounding errors.
n Note that the size of the smallest vertex cover τ satisfies τ ≥ ∆ . We have k(∆ + 1)∆ n k(∆ + 1)∆ τ k(∆ + 1)∆ ∆ + 1 = ≤ ≤ m∗. n k∆ n k n k
105 τ k(∆+1)∆ ∗ Thus we use at most k + ∆ + 1 ≤ 1 + n mk interventions.
4.11 Proof of NP-Hardness
We establish the following theorem in this section.
Theorem 48. The minimum cost intervention design problem is NP-hard, even if every vertex has weight 1 and the input graph is an interval graph.
Theorem 31 follows immediately from Theorem 48.
First, we need to introduce the numerical three dimensional matching problem:
Definition 49 (Numerical Three Dimensional Matching). Given a positive Pt integer t and 3t rational numbers ai, bi, ci satisfying i=1 ai + bi + ci = t and
0 < ai, bi, ci < 1, ∀i ∈ [t], does there exist permutations ρ, σ of [t] such that ai + brho(i) + cσ(i) = 1, ∀i ∈ [t]?
The numerical three dimensional matching problem is known to be strongly NP-complete [Garey and Johnson, 1979].
Kroon et al. [1996] reduces the optimal cost chromatic partition problem on interval graphs to numerical three dimensional matching. The input to an instance of the optimal cost chromatic partitioning problem is a graph and a set of weighted colors. The cost to color a vertex with a given color is the weight of that color. The cost of a coloring is the sum of the coloring cost of each vertex. A solution to the problem is a valid coloring of minimum cost.
106 Kroon et al. show that the optimal cost chromatic partition problem is NP-hard, even if the input graph is an interval graph and color weights take four values: 0, 1, 2, and Θ(n). They use the following construction for their reduction.
Suppose we are given an instance of the numerical three dimensional matching problem containing the number ai, bi, ci for i ∈ [t]. For i, j ∈ [t], define rational numbers Ai, Bj, and Xij such that 4 < Ai < 5 < Bj < 6 and
7 < Xij < 9. The following are the intervals of the graph used in [Kroon et al., 1996] (see the original paper for an image):
107 Interval Occurrences Clique ID (0, 1) t times I (0, 3) t2 − t times I
(0,Ai) ∀i ∈ [t], t − 1 times I
(0,Bj) ∀j ∈ [t] I (1, 2) t times II
(2,Ai) for ∀i ∈ [t] III
(3,Bj) ∀j ∈ [t], t − 1 times III
(Ai,Xi,j) ∀i, j ∈ [t] IV
(Bj,Xi,j) ∀i, j ∈ [t] IV
(Xi,j, 10 + ai + bj) ∀i, j ∈ [t] V
(Xi,j, 14) ∀i, j ∈ [t] V
(11 − ck, 13) ∀k ∈ [t] VI (12, 14) t2 − t VII (13, 14) t times VII
They estabish that it is NP-complete to decide if there is a coloring of cost at most 11t2 − 5t when there are t colors of weight 0, t2 − t colors of weight 1, t2 colors of weight 2, and all other colors of weight 3. However they omit the proof, so we include a proof here. We us the clique IDs we added in the definition of the interval graph.
Proof. If there is a solution to the numerical three dimensional matching problem, then there exists a coloring of cost at most 11t2 − 5t; see the original paper for the proof of this [Kroon et al., 1996]. They also prove that if there
108 is a coloring of cost at most 11t2 − 5t that only uses the colors of weight 0, 1, and 2, then it can be used to construct a solution to the numerical three dimensional matching problem.
Now we show that if the coloring uses a color of weight 3, then it must have a cost strictly greater than 11t2 − 5t. Note that all the vertices with the same clique ID indeed do form a clique. Consider the subgraph containing all the vertices of the original graph, but only the edges between vertices with the same clique ID.
The optimal way to color a subgraph of size k is to use one instance of the k cheapest colors. From this, we can see that the optimal way to color the subgraph has a cost of 11t2 − 5t.
We can also see that any coloring of this subgraph that uses a color of weight 3 has a cost strictly larger than 11t2 − 5t. Since there is an available color of weight less than 3, if we swap the color of weight 3 with an available, cheaper color, the cost must decrease by at least 1. Since the coloring after the switch cannot be lower than 11t2 − 5t, it must have been that the coloring before the switch was strictly larger than 11t2 − 5t.
Since a valid coloring for the original graph is a valid coloring for the subgraph, and the cost of a coloring of the original graph is the same as a cost of the coloring for the subgraph, we see that the cost of a coloring of the original graph that uses a color of weight 3 must have a cost strictly larger than 11t2 − 5t.
109 We also see that the problem still remains hard when there are t colors of weight 1, t2 − t colors of weight 2, t2 colors of weight 3, and all other colors of weight 4. This is because the cost of a coloring using these new colors is just an additive factor n more than the original colors. Thus a coloring that minimizes the cost using these new colors also minimizes the cost using the original colors, and it is NP-complete to decide if there exists a coloring of cost 19t2 − 3t.
We will define another interval graph by adding the following intervals. Set ε, δ to be nonnegative rational numbers such that ε 6= δ and min{6 − maxj Bj, 9 − maxij Xij} > and δ < 1. Add the following intervals to the original graph:
110 Interval Occurrences Clique ID (0, 1 + ε) t + 1 times I (1 + ε, 2 + ε) t + 1 times II (2 + ε, 6 − ε) t + 1 times III (6 − ε, 9 − ε) t + 1 times IV (9 − ε, 11 + ε) t + 1 times V (11 + ε, 13 − ε) t + 1 times VI (13 − ε, 14 − ε) t + 1 times VII (14 − ε, 14) t + 1 times VIII