Graphical Models, Exponential Families, and Variational Inference
Total Page:16
File Type:pdf, Size:1020Kb
Foundations and Trends ® in sample Vol. xx, No xx (xxxx) 1–294 © xxxx xxxxxxxxx DOI: xxxxxx Graphical models, exponential families, and variational inference Martin J. Wainwright1 and Michael I. Jordan2 1 Department of Statistics, and Department of Electrical Engineering and Computer Science Berkeley 94720, USA, wainwrig@stat.berkeley.edu 2 Department of Statistics, and Department of Electrical Engineering and Computer Science Berkeley 94720, USA, jordan@stat.berkeley.edu Abstract The formalism of probabilistic graphical models provides a unifying framework for capturing complex dependencies among random vari- ables, and building large-scale multivariate statistical models. Graph- ical models have become a focus of research in many statistical, com- putational and mathematical fields, including bioinformatics, commu- nication theory, statistical physics, combinatorial optimization, signal and image processing, information retrieval and statistical machine learning. Many problems that arise in specific instances—including the key problems of computing marginals and modes of probability distributions—are best studied in the general setting. Working with exponential family representations, and exploiting the conjugate du- ality between the cumulant function and the entropy for exponential families, we develop general variational representations of the prob- lems of computing likelihoods, marginal probabilities and most prob- able configurations. We describe how a wide variety of algorithms— among them sum-product, cluster variational methods, expectation- propagation, mean field methods, and max-product—can all be un- derstood in terms of exact or approximate forms of these variational representations. The variational approach provides a complementary alternative to Markov chain Monte Carlo as a general source of ap- proximation methods for inference in large-scale statistical models. Contents 1 Introduction 1 2 Background 5 2.1 Probability distributions on graphs 5 2.1.1 Directed graphical models 6 2.1.2 Undirected graphical models 7 2.1.3 Factor graphs 8 2.2 Conditional independence 9 2.3 Statistical inference and exact algorithms 10 2.4 Applications 12 2.4.1 Hierarchical Bayesian models 12 2.4.2 Contingency table analysis 13 2.4.3 Constraint satisfaction and combinatorial opti- mization 14 2.4.4 Bioinformatics 15 2.4.5 Language and speech processing 18 2.4.6 Image processing and spatial statistics 20 i ii Contents 2.4.7 Error-correcting coding 21 2.5 Exact inference algorithms 24 2.5.1 Message-passing on trees 25 2.5.2 Junction tree representation 28 2.6 Message-passing algorithms for approximate inference 33 3 Graphical models as exponential families 36 3.1 Exponential representations via maximum entropy 36 3.2 Basics of exponential families 38 3.3 Examples of graphical models in exponential form 41 3.4 Mean parameterization and inference problems 50 3.4.1 Mean parameter spaces and marginal polytopes 51 3.4.2 Role of mean parameters in inference problems 59 3.5 Properties of A 60 3.5.1 Derivatives and convexity 61 3.5.2 Forward mapping to mean parameters 62 3.6 Conjugate duality: Maximum likelihood and maximum entropy 65 3.6.1 General form of conjugate dual 65 3.6.2 Some simple examples 68 3.7 Challenges in high-dimensional models 71 4 Sum-product, Bethe-Kikuchi, and expectation-propagation 74 4.1 Sum-product and Bethe approximation 75 4.1.1 A tree-based outer bound to M(G) 76 4.1.2 Bethe entropy approximation 80 4.1.3 Bethe variational problem and sum-product 82 4.1.4 Inexactness of Bethe and sum-product 87 4.1.5 Bethe optima and reparameterization 90 4.1.6 Bethe and loop series expansions 93 4.2 Kikuchi and hypertree-based methods 97 4.2.1 Hypergraphs and hypertrees 97 4.2.2 Kikuchi and related approximations 102 Contents iii 4.2.3 Generalized belief propagation 104 4.3 Expectation-propagation algorithms 106 4.3.1 Entropy approximations based on term decoupling108 4.3.2 Optimality in terms of moment-matching 115 5 Mean field methods 124 5.1 Tractable families 125 5.2 Optimization and lower bounds 126 5.2.1 Generic mean field procedure 127 5.2.2 Mean field and Kullback-Leibler divergence 128 5.3 Examples of mean field algorithms 131 5.3.1 Naive mean field updates 131 5.3.2 Structured mean field 135 5.4 Non-convexity of mean field 141 6 Variational methods in parameter estimation 146 6.1 Estimation in fully observed models 146 6.1.1 Maximum likelihood for triangulated graphs 147 6.1.2 Iterative methods for computing MLEs 149 6.2 Partially observed models and expectation-maximization 151 6.2.1 Exact EM algorithm in exponential families 152 6.2.2 Variational EM 156 6.3 Variational Bayes 157 7 Convex relaxations and upper bounds 162 7.1 Generic convex combinations and convex surrogates 164 7.2 Variational methods from convex relaxations 167 7.2.1 Tree-reweighted sum-product and Bethe 167 7.2.2 Reweighted Kikuchi approximations 173 7.2.3 Convex forms of expectation-propagation 177 7.2.4 Planar graph decomposition 178 7.3 Other convex variational methods 179 7.3.1 Semidefinite constraints and log-determinant bound 179 iv Contents 7.3.2 Other entropy approximations and polytope bounds 182 7.4 Algorithmic stability 182 7.5 Convex surrogates in parameter estimation 185 7.5.1 Surrogate likelihoods 186 7.5.2 Optimizing surrogate likelihoods 186 7.5.3 Penalized surrogate likelihoods 188 8 Max-product and LP relaxations 190 8.1 Variational formulation of computing modes 190 8.2 Max-product and linear programming for trees 193 8.3 Max-product for Gaussians and other convex problems 197 8.4 First-order LP relaxation and reweighted max-product 199 8.4.1 Basic properties of first-order LP relaxation 200 8.4.2 Connection to max-product message-passing 205 8.4.3 Modified forms of message-passing 208 8.4.4 Examples of the first-order LP relaxation 211 8.5 Higher-order LP relaxations 220 9 Moment matrices and conic relaxations 228 9.1 Moment matrices and their properties 229 9.1.1 Multinomial and indicator bases 230 9.1.2 Marginal polytopes for hypergraphs 233 9.2 Semidefinite bounds on marginal polytopes 233 9.2.1 Lasserre sequence 234 9.2.2 Tightness of semidefinite outer bounds 236 9.3 Link to LP relaxations and graphical structure 240 9.4 Second-order cone relaxations 244 10 Discussion 246 A Background material 249 A.1 Background on graphs and hypergraphs 249 Contents v A.2 Basics of convex sets and functions 251 A.2.1 Convex sets and cones 251 A.2.2 Convex and affine hulls 252 A.2.3 Affine hulls and relative interior 252 A.2.4 Polyhedra and their representations 253 A.2.5 Convex functions 253 A.2.6 Conjugate duality 255 B Proofs for exponential families and duality 257 B.1 Proof of Theorem 1 257 B.2 Proof of Theorem 2 259 B.3 General properties of M and A∗ 261 B.3.1 Properties of M 261 B.3.2 Properties of A∗ 262 B.4 Proof of Theorem 3(b) 263 B.5 Proof of Theorem 5 264 C Variational principles for multivariate Gaussians 265 C.1 Gaussian with known covariance 265 C.2 Gaussian with unknown covariance 267 C.3 Gaussian mode computation 268 D Clustering and augmented hypergraphs 270 D.1 Covering augmented hypergraphs 270 D.2 Specification of compatibility functions 274 E Miscellaneous results 276 E.1 M¨obius inversion 276 E.2 Pairwise Markov random fields 277 References 279 1 Introduction Graphical models bring together graph theory and probability theory in a powerful formalism for multivariate statistical modeling. In var- ious applied fields including bioinformatics, speech processing, image processing and control theory, statistical models have long been formu- lated in terms of graphs, and algorithms for computing basic statisti- cal quantities such as likelihoods and score functions have often been expressed in terms of recursions operating on these graphs; examples include phylogenies, pedigrees, hidden Markov models, Markov random fields, and Kalman filters. These ideas can be understood, unified and generalized within the formalism of graphical models. Indeed, graph- ical models provide a natural tool for formulating variations on these classical architectures, as well as for exploring entirely new families of statistical models. Accordingly, in fields that involve the study of large numbers of interacting variables, graphical models are increasingly in evidence. Graph theory plays an important role in many computationally- oriented fields, including combinatorial optimization, statistical physics and economics. Beyond its use as a language for formulating models, graph theory also plays a fundamental role in assessing computational 1 2 Introduction complexity and feasibility. In particular, the running time of an algo- rithm or the magnitude of an error bound can often be characterized in terms of structural properties of a graph. This statement is also true in the context of graphical models. Indeed, as we discuss, the compu- tational complexity of a fundamental method known as the junction tree algorithm—which generalizes many of the recursive algorithms on graphs cited above—can be characterized in terms of a natural graph- theoretic measure of interaction among variables. For suitably sparse graphs, the junction tree algorithm provides a systematic solution to the problem of computing likelihoods and other statistical quantities associated with a graphical model. Unfortunately, many graphical models of practical interest are not “suitably sparse,” so that the junction tree algorithm no longer provides a viable computational framework. One popular source of methods for attempting to cope with such cases is the Markov chain Monte Carlo (MCMC) framework, and indeed there is a significant literature on the application of MCMC methods to graphical models [e.g., 26, 90].