Learning Factor Graphs in Polynomial Time & Sample Complexity

Learning Factor Graphs in Polynomial Time & Sample Complexity Pieter Abbeel Daphne Koller Andrew Y. Ng Computer Science Dept. Computer Science Dept. Computer Science Dept. Stanford University Stanford University Stanford University Stanford, CA 94305 Stanford, CA 94305 Stanford, CA 94305 of (essentially) the same size. Abstract Based on the canonical parameterization used in the Hammersley-Clifford Theorem for Markov net- We study computational and sample complexity of pa- works (Hammersley & Clifford, 1971; Besag, 1974), we rameter and structure learning in graphical models. provide a parameterization of factor graph distributions Our main result shows that the class of factor graphs that is a product only of probabilities over local subsets of with bounded factor size and bounded connectivity can variables. By contrast, the original Hammersley-Clifford be learned in polynomial time and polynomial number canonical parameterization is a product of probabilities of samples, assuming that the data is generated by a over joint instantations of all the variables. The new pa- network in this class. This result covers both parame- rameterization naturally leads to an algorithm that solves ter estimation for a known network structure and struc- the parameter estimation problem in closed-form. For fac- ture learning. It implies as a corollary that we can learn tor graphs of bounded factor size and bounded connectivity, factor graphs for both Bayesian networks and Markov if the generating distribution falls into the target class, we networks of bounded degree, in polynomial time and show that our estimation procedure returns an accurate so- sample complexity. Unlike maximum likelihood esti- lution — one of low KL-divergence to the true distribution mation, our method does not require inference in the — given a polynomial number of samples. underlying network, and so applies to networks where Building on this result, we provide an algorithm for inference is intractable. We also show that the error of learning both the structure and parameters of such factor our learned model degrades gracefully when the gen- graphs. The algorithm uses empirical entropy estimates to erating distribution is not a member of the target class select an approximate Markov blanket for each variable, of networks. and then uses the parameter estimation algorithm to esti- mate parameters and identify which factors are likely to be irrelevant. Under the same assumptions as above, we prove 1 Introduction that this algorithm also has polynomial-time computational complexity and polynomial sample complexity.1 Graphical models are widely used to compactly represent structured probability distributions over (large) sets of ran- These algorithms provide the first polynomial-time and dom variables. The task of learning a graphical model rep- polynomial sample-complexity learning algorithm for fac- resentation for a distribution P from samples taken from tor graphs, and thereby for Markov networks. Note that our P is an important one for many applications. There are algorithms apply to any factor graph of bounded factor size many variants of this learning problem, which vary on sev- and bounded connectivity, including those (such as grids) eral axes, including whether the data is fully or partially where inference is intractable. We also show that our algo- observed, and on whether the structure of the network is rithms degrade gracefully, in that they return reasonable an- given or needs to be learned from data. swers even when the underlying distribution does not come exactly from the target class of networks. We note that the In this paper, we focus on the problem of learning both proposed algorithms are unlikely to be useful in practice network structure and parameters from fully observable in their current form, as they do an exhaustive enumera- data, restricting attention to discrete probability distribution on the possible Markov blankets of factors in the factor tions over finite sets. We focus on the problem of learning graph, a process which is generally infeasible even in small a factor graph representation of the distribution (Kschis- chang et al., 2001). Factor graphs subsume both Bayesian 1Due to space constraints, most the proofs are omitted from networks and Markov networks, in that every Bayesian net- this paper or given only as sketches. The complete proofs are work or Markov network can be written as a factor graph given in the full paper (Abbeel et al., 2005). networks; they also do not make good use of all the avail- in which a variable can participate, and a bound on how able data. Nevertheless, the techniques used in our analysis skewed each factor is (in terms of the ratio of its lowest opens new avenues towards efficient parameter and struc- and highest entries), we are guaranteed a bound on γ that ture learning in undirected, intractable models. is independent of the number n of variables in the network. By contrast, γ˜ = minx P (X = x) generally has an expo- 2 Preliminaries nential dependence on n. For example, if we have n IID 1 1 Bernoulli( 2 ) random variables, then γ = 2 (independent 2.1 Factor Graph Distributions 1 of n) but γ˜ = 2n . Definition 1 (Gibbs distribution). A factor f with scope2 2.2 The Canonical Parameterization D is a function from val(D) to R+. A Gibbs distribution P over a set of random variables X = {X1,...,Xn} is asso- A Gibbs distribution is generally over-parameterized rel- J J ciated with a set of factors {fj }j=1 with scopes {Cj }j=1, ative to the structure of the underlying factor graph, in such that that a continuum of possible parameterizations over the graph can all encode the same distribution. The canon- 1 J P (X1,...,Xn) = Z j=1 fj(Cj [X1,...,Xn]). (1) ical parameterization (Hammersley & Clifford, 1971; Be- Q sag, 1974) provides one specific choice of parameterization The normalizing constant Z is the partition function. for a Gibbs distribution, with some nice properties (see be- The factor graph associated with a Gibbs distribution is low). The canonical parameterization forms the basis for a bipartite graph whose nodes correspond to variables and the Hammersley-Clifford theorem, which asserts that any distribution that satisfies the independence assumptions en- factors, with an edge between a variable X and a factor fj coded by a Markov network can be represented as a Gibbs if the scope of fj contains X. There is one-to-one corre- spondence between factor graphs and the sets of scopes. distribution over the cliques in that network. Standardly, A Gibbs distribution also induces a Markov network — the canonical parameterization is defined for Gibbs distrib- an undirected graph whose nodes correspond to the ran- utions over Markov networks at the clique level. We utilize dom variables X and where there is an edge between two a more refined parameterization, defined at the factor level; variables if there is a factor in which they both participate. results at the clique level are trivial corollaries. The set of scopes uniquely determines the structure of the The canonical parameterization is defined relative to an Markov network, but several different sets of scopes can arbitrary (but fixed) assignment x¯ = (¯x1,..., x¯n). Let any result in the same Markov network. For example, a fully subset of variables D = hXi1 ,...,Xi|D| i, and any assign- connected Markov network can correspond both to a Gibbs ment d = hxi1 ,...,xi|D| i be given. Let any U ⊆ D be distribution with a factor which is a joint distribution over given. We define σ [·] such that for all i ∈ {1,...,n}: · X , and to a distribution with n factors over pairs of vari- 2 xi if Xi ∈ U, ables. We will use the more precise factor graph represen- (σU[d])i = x¯i if Xi ∈/ U. tation in this paper. Our results are easily translated into results for Markov networks. In words, σU[d] keeps the assignments to the variables in U as specified in d, and augments it to form a full assign- Definition 2 (Markov blanket). Let a set of scopes C = ment using the default values in x¯. Note that the assign- {C }J be given. The Markov blanket of a set of random j j=1 ments to variables outside U are always ignored, and re- variables D ⊆X is defined as placed with their default values. Thus, the scope of σU[·] is always U. MB(D) = ∪{Cj : Cj ∈C , Cj ∩ D 6= ∅} − D. Let P be a positive Gibbs distribution.The canonical For any Gibbs distribution, we have, for any D, that factor for D ⊆X is defined as follows: D U D ⊥ X − D − MB(D) | MB(D), (2) fD∗ (d) = exp U D(−1)| − | log P (σU[d]) . (3) P ⊆ The sum is over all subsets of D, including D itself and the or in words: given its Markov blanket, D is independent of empty set . all other variables. ∅ The following theorem extends the Hammersley- A standard assumption for a Gibbs distribution, which Clifford theorem (which applies to Markov networks) to is critical for identifying its structure (see Lauritzen, 1996, factor graphs. Ch. 3), is that the distribution be positive — all of its entries Theorem 1. Let P be a positive Gibbs distribution with be non-zero. Our results use a quantitative measure for how ∗ factor scopes {C }J . Let {C }J = ∪J 2Cj − ∅ positive P is. Let γ = minx,i P (Xi = xi|X i = x i), j j=1 j∗ j=1 j=1 where the −i subscript denotes all entries but entry− i. Note− (where 2X is the power set of X — the set of all of its that, if we have a fixed bound on the number of factors subsets). Then ∗ 2 J A function has scope X if its domain is val(X).

Load more