Ifanuary, 2002
Total Page:16
File Type:pdf, Size:1020Kb
Stochastic processes on graphs with cycles: geometric and variational approaches by Martin J. Wainwright Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology ifanuary, 2002 © 2002 Massachusetts Institute of Technology All Rights Reserved. Signature of Author: Department of Electrical Engineering and Computer Science January 28, 2002 Certified by: 11 I Alan S. Willsky Professor of EECS Thesis Supervisor Certified by: Tommi S. Jaakkola Afisitant Pr sor of EECS esis,-pervisor Accepted by: Arthur C. Smith Professor of Electrical Engineering Chair, Committee for Graduate Students MA SSACHUSETTS INSTIUTE OFTECHNOLOGY A PR, 1 6 2002 LIBRARIES Stochastic processes on graphs with cycles: geometric and variational approaches by Martin J. Wainwright Submitted to the Department of Electrical Engineering and Computer Science on January 28, 2002 in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract Stochastic processes defined on graphs arise in a tremendous variety of fields, including statistical physics, signal processing, computer vision, artificial intelligence, and infor- mation theory. The formalism of graphical models provides a useful language with which to formulate fundamental problems common to all of these fields, including esti- mation, model fitting, and sampling. For graphs without cycles, known as trees, all of these problems are relatively well-understood, and can be solved efficiently with algo- rithms whose complexity scales in a tractable manner with problem size. In contrast, these same problems present considerable challenges in general graphs with cycles. The focus of this thesis is the development and analysis of methods, both exact and approximate, for problems on graphs with cycles. Our contributions are in developing and analyzing techniques for estimation, as well as methods for computing upper and lower bounds on quantities of interest (e.g., marginal probabilities; partition functions). In order to do so, we make use of exponential representations of distributions, as well as insight from the associated information geometry and Legendre duality. Our results demonstrate the power of exponential representations for graphical models, as well as the utility of studying collections of modified problems defined on trees embedded within the original graph with cycles. The specific contributions of this thesis include the following. We develop a method for performing exact estimation of Gaussian processes on graphs with cycles by solv- ing a sequence of modified problems on embedded spanning trees. We introduce the tree-based reparameterization framework for approximate estimation of discrete pro- cesses. This perspective leads to a number of theoretical results on belief propagation and related algorithms, including characterizations of their fixed points and the associ- ated approximation error. Next we extend the notion of reparameterization to a much broader class of methods for approximate inference, including Kikuchi methods, and present results on their fixed points and accuracy. Finally, we develop and analyze a novel class of upper bounds on the log partition function based on convex combi- nations of distributions in the exponential domain. In the special case of combining tree-structured distributions, the associated dual function gives an interesting perspec- tive on the Bethe free energy. Thesis Supervisors: Alan S. Willsky and Tommi S. Jaakkola Title: Professors of Electrical Engineering and Computer Science Notational Conventions Symbol Definition General Notation absolute value | | L 2 norm V gradient operator V2 Hessian operator ai the ith component of the vector A Aij element in the ith row and jth column of matrix A ek indicator vector with 1 in the kth component and 0 every- where else R real numbers kN vector space of real-valued N-dimensional vectors [0, 1]N closed unit hypercube in JRN (0, I)N open unit hypercube in RN Ra(F) range of the mapping F F o G composition of mappings F and G I identity operator x random vector XN sample space of N-dimensional random vector x y observation vector p(x) probability distribution on x p(x I y) conditional probability distribution of x given y H(p) entropy of distribution p D(p 1[q) Kullback-Leibler divergence between p and q A(p, A) Gaussian distribution with mean M and covariance A U[a, b] uniform distribution on [a, b] L Lagrangian of a constrained optimization problem 5 6 NOTATIONAL CONVENTIONS 6 NOTATIONAL CONVENTIONS Symbol Definition Graphical models g undirected graph V vertex or node set of graph &E edge set of graph C graph clique C set of all cliques of g 9 triangulated version of 9 C set of all clique of 9 S separator set in a junction tree S set of all separator sets Vkc compatibility function on clique C Z partition function N number of nodes (i.e., IVI) m number of discrete states s, t indices for nodes (s, t) edge between nodes s and t j, k indices for discrete states N(s) neighbors of node s in 9 T embedded spanning tree of 9 S(T) edge set of T KN complete graph on N nodes NOTATIONAL CONVENTIONS 7 NOTATIONAL CONVENTIONS 7 Symbol Definition Exponential families and information geometry 0 exponential parameter vector d(0) number of components in 0 bc potential function <0 collection of potential functions p(x; 0) exponential distribution on x defined by 0 (D log partition function T' negative entropy function (dual to (D) mean parameters (dual variables) A Legendre mapping between 0 and 97 Me e-flat manifold Mm m-flat manifold D(0 1 6*) Kullback-Leibler divergence between p(x; 0) and p(x; 0*) E [f] expectation of f (x) under p(x; 0) covg{f, g} covariance of f (x) and g(x) under p(x; 0) cumo{fi,... , } kth-order cumulant of fi(x),. .. fk(x) under p(x; 0) 8 NOTATIONAL CONVENTIONS Symbol Definition Tree-based reparameterization TV embedded spanning tree i index for embedded spanning trees L total number of spanning trees used Mst;k belief propagation message from node s to t Ps;j, Pst;jk exact marginal probabilities Ts;j, Tst;jk approximate marginal probabilities K arbitrary normalization constant pt (x) tree-structured component of p(x) qZ(x) set of residual terms (i.e., p(x)/p 2 (x)) 8(xS = j) indicator function for x, to take value j A set of composite indices (s; j) and (st; Ik) AZ composite indices corresponding to T2 C constraint set for pseudomarginals C2 constraint set based on tree V 0 mapping from T to 0 ' reparameterization operator HiI projection operator onto tree 7 ill injection operator from 7' to full set Qi combined reparameterization-identity mapping based on 7r {Ion} sequence of TRP iterates A step-size at iteration n i(n) spanning tree index at iteration n G(T; 0) cost function (approximation to KL divergence) Es;j log error log Ts;j - log Ps;J NOTATIONAL CONVENTIONS NOTATIONAL CONVENTIONS 09 Symbol Definition Advanced inference techniques A core structure 9(A) graph induced by the core structure QA(x) approximating distribution defined by the core structure PA(x) components of original distribution over the core structure Cmax(A) set of maximal cliques in A Csep(A) set of separator sets associated with A R residual partition A particular residual element of R QA u A(x) auxiliary distribution on augmented structure A U A PA U A (x) components of original distribution on augmented structure AUA augmented residual partition A elements of R marginalization operator 9A; R approximation to KL divergence based on A and R collection of approximating distributions MT core structure valued messages exponential parameter for target distribution A(B) indices associated with elements of B 1B f Oa I cEGA(B)} OB* iB ZaEB aa HB projection operator of an exponential parameter onto B I injection operator into full set A OA exponential parameter for QA OA U A exponential parameter for QA u A 10 NOTATIONAL CONVENTIONS Symbol Definition Convex upper bounds set of all spanning trees of 9 A7 probability distribution over trees T p(T) probability of spanning tree T supp(1) support of the distribution fi 11e edge appearance probabilities Prg~e c T} O(T) exponential parameter vector structured according to tree T 0 collection of tree-structured exponential parameter vectors Eg[0] convex combination Ey f(T)(T) A(0*) set of feasible pairs (9; /) such that E4[0] = 0* Q(A; 0*) Lagrangian dual function A, r dual parameters IITprojection operator onto tree T L(9) set of tree-consistent mean parameters M(g) set of globally consistent mean parameters HS single-node entropy at x, ist mutual information between x. and Xt T(9) spanning tree polytope r(-) rank function v(A) number of vertices touched by edges in A C E c(A) number of connected components of 9(A) l(A;iMe; 0*) function for optimal upper bounds A(Me) optimal set of mean parameters (as a function of Me) pMe optimal set of edge appearance probabilities v(T) edge incidence vector corresponding to spanning tree T Acknowledgments During the course of my Ph.D., I have benefited from significant interactions with people both within MIT and throughout the greater academic community. I am grateful to be able to acknowledge these people and their contributions here. First of all, I have been fortunate enough to have enjoyed the support and encour- agement of not one but two thesis supervisors: Alan Willsky and Tommi Jaakkola. I doubt that anyone who interacts significantly with Alan Willsky can fail to be im- pressed (if not infected) by his tremendous intellectual energy and enthusiasm.