Learning Factor Graphs in Polynomial Time & Sample Complexity

Pieter Abbeel Daphne Koller Andrew Y. Ng Computer Science Dept. Computer Science Dept. Computer Science Dept. Stanford University Stanford University Stanford University Stanford, CA 94305 Stanford, CA 94305 Stanford, CA 94305

of (essentially) the same size. Abstract Based on the canonical parameterization used in the Hammersley-Clifford Theorem for Markov net- We study computational and sample complexity of pa- works (Hammersley & Clifford, 1971; Besag, 1974), we rameter and structure learning in graphical models. provide a parameterization of factor graph distributions Our main result shows that the class of factor graphs that is a product only of probabilities over local subsets of with bounded factor size and bounded connectivity can variables. By contrast, the original Hammersley-Clifford be learned in polynomial time and polynomial number canonical parameterization is a product of probabilities of samples, assuming that the data is generated by a over joint instantations of all the variables. The new pa- network in this class. This result covers both parame- rameterization naturally leads to an algorithm that solves ter estimation for a known network structure and struc- the parameter estimation problem in closed-form. For fac- ture learning. It implies as a corollary that we can learn tor graphs of bounded factor size and bounded connectivity, factor graphs for both Bayesian networks and Markov if the generating distribution falls into the target class, we networks of bounded degree, in polynomial time and show that our estimation procedure returns an accurate so- sample complexity. Unlike maximum likelihood esti- lution — one of low KL-divergence to the true distribution mation, our method does not require inference in the — given a polynomial number of samples. underlying network, and so applies to networks where Building on this result, we provide an algorithm for inference is intractable. We also show that the error of learning both the structure and parameters of such factor our learned model degrades gracefully when the gen- graphs. The algorithm uses empirical entropy estimates to erating distribution is not a member of the target class select an approximate Markov blanket for each variable, of networks. and then uses the parameter estimation algorithm to esti- mate parameters and identify which factors are likely to be irrelevant. Under the same assumptions as above, we prove 1 Introduction that this algorithm also has polynomial-time computational complexity and polynomial sample complexity.1 Graphical models are widely used to compactly represent structured probability distributions over (large) sets of ran- These algorithms provide the first polynomial-time and dom variables. The task of learning a rep- polynomial sample-complexity learning algorithm for fac- resentation for a distribution P from samples taken from tor graphs, and thereby for Markov networks. Note that our P is an important one for many applications. There are algorithms apply to any factor graph of bounded factor size many variants of this learning problem, which vary on sev- and bounded connectivity, including those (such as grids) eral axes, including whether the data is fully or partially where inference is intractable. We also show that our algo- observed, and on whether the structure of the network is rithms degrade gracefully, in that they return reasonable an- given or needs to be learned from data. swers even when the underlying distribution does not come exactly from the target class of networks. We note that the In this paper, we focus on the problem of learning both proposed algorithms are unlikely to be useful in practice network structure and parameters from fully observable in their current form, as they do an exhaustive enumera- data, restricting attention to discrete probability distribu- tion on the possible Markov blankets of factors in the factor tions over finite sets. We focus on the problem of learning graph, a process which is generally infeasible even in small a factor graph representation of the distribution (Kschis- chang et al., 2001). Factor graphs subsume both Bayesian 1Due to space constraints, most the proofs are omitted from networks and Markov networks, in that every Bayesian net- this paper or given only as sketches. The complete proofs are work or Markov network can be written as a factor graph given in the full paper (Abbeel et al., 2005). networks; they also do not make good use of all the avail- in which a variable can participate, and a bound on how able data. Nevertheless, the techniques used in our analysis skewed each factor is (in terms of the ratio of its lowest opens new avenues towards efficient parameter and struc- and highest entries), we are guaranteed a bound on γ that ture learning in undirected, intractable models. is independent of the number n of variables in the network. By contrast, γ˜ = minx P (X = x) generally has an expo- 2 Preliminaries nential dependence on n. For example, if we have n IID 1 1 Bernoulli( 2 ) random variables, then γ = 2 (independent 2.1 Factor Graph Distributions 1 of n) but γ˜ = 2n . Definition 1 (Gibbs distribution). A factor f with scope2 2.2 The Canonical Parameterization D is a function from val(D) to R+. A Gibbs distribution P over a set of random variables X = {X1,...,Xn} is asso- A Gibbs distribution is generally over-parameterized rel- J J ciated with a set of factors {fj }j=1 with scopes {Cj }j=1, ative to the structure of the underlying factor graph, in such that that a continuum of possible parameterizations over the graph can all encode the same distribution. The canon- 1 J P (X1,...,Xn) = Z j=1 fj(Cj [X1,...,Xn]). (1) ical parameterization (Hammersley & Clifford, 1971; Be- Q sag, 1974) provides one specific choice of parameterization The normalizing constant Z is the partition function. for a Gibbs distribution, with some nice properties (see be- The factor graph associated with a Gibbs distribution is low). The canonical parameterization forms the basis for a whose nodes correspond to variables and the Hammersley-Clifford theorem, which asserts that any distribution that satisfies the independence assumptions en- factors, with an edge between a variable X and a factor fj coded by a Markov network can be represented as a Gibbs if the scope of fj contains X. There is one-to-one corre- spondence between factor graphs and the sets of scopes. distribution over the cliques in that network. Standardly, A Gibbs distribution also induces a Markov network — the canonical parameterization is defined for Gibbs distrib- an undirected graph whose nodes correspond to the ran- utions over Markov networks at the clique level. We utilize dom variables X and where there is an edge between two a more refined parameterization, defined at the factor level; variables if there is a factor in which they both participate. results at the clique level are trivial corollaries. The set of scopes uniquely determines the structure of the The canonical parameterization is defined relative to an Markov network, but several different sets of scopes can arbitrary (but fixed) assignment x¯ = (¯x1,..., x¯n). Let any result in the same Markov network. For example, a fully subset of variables D = hXi1 ,...,Xi|D| i, and any assign- connected Markov network can correspond both to a Gibbs ment d = hxi1 ,...,xi|D| i be given. Let any U ⊆ D be distribution with a factor which is a joint distribution over given. We define σ [·] such that for all i ∈ {1,...,n}: · X , and to a distribution with n factors over pairs of vari- 2 xi if Xi ∈ U, ables. We will use the more precise  factor graph represen- (σU[d])i =  x¯i if Xi ∈/ U. tation in this paper. Our results are easily translated into results for Markov networks. In words, σU[d] keeps the assignments to the variables in U as specified in d, and augments it to form a full assign- Definition 2 (Markov blanket). Let a set of scopes C = ment using the default values in x¯. Note that the assign- {C }J be given. The Markov blanket of a set of random j j=1 ments to variables outside U are always ignored, and re- variables D ⊆X is defined as placed with their default values. Thus, the scope of σU[·] is always U. MB(D) = ∪{Cj : Cj ∈C , Cj ∩ D 6= ∅} − D. Let P be a positive Gibbs distribution.The canonical For any Gibbs distribution, we have, for any D, that factor for D ⊆X is defined as follows:

D U D ⊥ X − D − MB(D) | MB(D), (2) fD∗ (d) = exp U D(−1)| − | log P (σU[d]) . (3) P ⊆  The sum is over all subsets of D, including D itself and the or in words: given its Markov blanket, D is independent of empty set . all other variables. ∅ The following theorem extends the Hammersley- A standard assumption for a Gibbs distribution, which Clifford theorem (which applies to Markov networks) to is critical for identifying its structure (see Lauritzen, 1996, factor graphs. Ch. 3), is that the distribution be positive — all of its entries Theorem 1. Let P be a positive Gibbs distribution with be non-zero. Our results use a quantitative measure for how ∗ factor scopes {C }J . Let {C }J = ∪J 2Cj − ∅ positive P is. Let γ = minx,i P (Xi = xi|X i = x i), j j=1 j∗ j=1 j=1 where the −i subscript denotes all entries but entry− i. Note− (where 2X is the power set of X — the set of all of its that, if we have a fixed bound on the number of factors subsets). Then ∗ 2 J A function has scope X if its domain is val(X). P (x) = P (x¯) f ∗ ∗ (c∗), j=1 Cj j Q where cj∗ is the instantiation of Cj∗ consistent with x. Proposition 2 shows that we can compute the canoni- cal parameterization factors using probabilities over fac- The parameterization of P using the canonical factors J ∗ tor scopes and their Markov blankets only. From a sam- {fC∗ ∗ }j=1 is called the canonical parameterization of P . j ple complexity point of view, this is a significant improve- Although typically J ∗ > J, the additional factors are all ment over the standard definition which uses joint instanti- subfactors of the original factors. Note that first transform- ations over all variables. By expanding the Markov blanket ing a factor graph into a Markov network and then apply- canonical factors in Proposition 2 using Eqn. (4) we see ing the Hammersley-Clifford theorem to the Markov net- that any factor graph distribution can be parameterized as a work generally results in a significantly less sparse canon- product of local probabilities. ical parameterization than the canonical parameterization from Theorem 1. Table 1: Notational conventions. X,Y,... random variables 3 Parameter Estimation x,y,... instantiations of the random variables X, Y,... sets of random variables 3.1 Markov Blanket Canonical Factors x, y,... instantiations of sets of random variables Considering the definition of the canonical parameters, we val(X) set of values the variable X can take note that all of the terms in Eqn. (3) can be estimated from D[x] instantiation of D consistent with x (abbre- empirical data using simple counts, without requiring infer- viated as d when no ambiguity is possible) ence over the network. Thus, it appears that we can use the f factor canonical parameterization as the basis for our parameter P positive Gibbs distribution over a set of ran- estimation algorithm. However, as written, this estimation dom variables X = hX1,...,Xni J process is statistically infeasible, as the terms in Eqn. (3) {fj }j=1 factors of P J are probabilities over full instantiations of all variables, {Cj }j=1 scopes of factors of P which can never be estimated from a reasonable number Pˆ empirical (sample) distribution of samples. P˜ distribution returned by learning algorithm Our first observation is that we can obtain exactly f ∗ canonical factor as defined in Eqn. (3) · the same answer by considering probabilities over much f ∗ canonical factor as defined in Eqn. (4) smaller instantiations — those corresponding to a factor ·|· fˆ∗ canonical factor as defined in Eqn. (4), but and its Markov blanket. Let D = hXi ,...,Xi D i be any 1 | | ·|· using the empirical distribution ˆ subset of variables, and d = hx ,...,x i be any as- P i1 i|D| MB(D) Markov blanket of D signment to D. For any U ⊆ D, we define σU:D[d] to k maxj|Cj | be the restriction of the full instantiation σU[d] of all vari- γ minx,i P (Xi = xi|X i = x i) ables in X to the corresponding instantiation of the subset − − D. In other words, σU:D[d] keeps the assignments to the v maxi|val(Xi)| variables in U as specified in d, and changes the assign- b maxj|MB(Cj )| ment to the variables in D − U to the default values in x¯. Let D ⊆ X and Y ⊆ X − D. Then the factor f over D∗ Y 3.2 Parameter Estimation Algorithm the variables in D is defined as follows: |

Based on the parameterization above, we propose the fol- ∗ ¡ |D−U|

U:D ¢ fD|Y(d) = exp U⊆D(−1) log P (σ [d]|Y = y¯) , lowing Factor-Graph-Parameter-Learn algorithm. The (4) algorithm takes as inputs: the scopes of the factors J (i) m where the sum is over all subsets of D, including D itself {Cj }j=1, samples {x }i=1, a baseline instantation x¯. and the empty set ∅. J ∗ Then for {Cj∗}j=1 as above, Factor-Graph-Parameter- The following proposition shows an equivalence be- Learn does the following: tween the factors computed using Eqn. (3) and Eqn. (4). • Compute the estimates of the canonical factors Proposition 2. Let P be a positive Gibbs distribution with ˆ J ∗ J J ∗ {fC∗ ∗ MB(C∗)}j=1 as in Eqn. (4), but using the em- factor scopes {Cj }j=1, and {Cj∗}j=1 as above. Then for j | j any D ⊆X , we have: pirical estimates based on the data samples. • Return the P˜(x) ∝ J ∗ fD∗ = fD∗ D = fD∗ MB(D), (5) ˆ |X− | j=1 fC∗ ∗ MB(C∗)(cj∗). j | j and (as a direct consequence) Q Theorem 3 (Parameter learning: computational complex- J ∗ ity). The running time of the Factor-Graph-Parameter- P (x) = P (x¯) j=1 fC∗ ∗ C∗ (cj∗) (6) k 2k k 3 j |X− j Learn algorithm is in O(m2 J(k + b) + 2 Jv ). QJ ∗ x c 3 = P (¯) j=1 fC∗ ∗ MB(C∗)( j∗), (7) The upper bound is based on a very naive implementation’s j | j Q running time. It assumes operations (such as reading, writing, where cj∗ is the instantiation of Cj∗ consistent with x. adding, etc.) numbers take constant time. Note the representation of the factor graph distribution is KL-divergence. The sample complexity scales exponen- Ω(Jvk), thus exponential dependence on k is unavoidable tially in the maximum number of variables per factor k, 1 1 for any algorithm. There is no dependence on the running and polynomially in  , γ . time of evaluating the partition function. On the other hand, The error in the KL-divergence grows linearly with J. evaluating the likelihood requires evaluating the partition This is a consequence of the fact that the number of terms function (which is different for different parameter values). in the distributions is J, and each can accrue an error. We expect that ML-based learning algorithms would take We can obtain a more refined analysis if we eliminate this at least as long as evaluating the partition function. dependence by considering the normalized KL-divergence 1 D(P kP˜). We return to this issue in Section 3.4. 3.3 Sample Complexity J Theorem 4 considers the case when P actually factors We now analyze the sample complexity of the Factor- according to the given structure. The following theorem Graph-Parameter-Learn algorithm, showing that it re- shows that our error degrades gracefully even if the sam- turns a distribution that is a good approximation of the true ples are generated by a distribution Q that does not factor distribution when given only a “small” number of samples. according to the given structure. Theorem 4 (Parameter learning: sample complexity). Let Theorem 5 (Parameter learning: graceful degradation). (i) m (i) m any , δ > 0 be given. Let {x }i=1 be IID samples Let any , δ > 0 be given. Let {x }i=1 be IID sam- from P . Let P˜ be the probability distribution returned by ples from a distribution Q. Let MB and MB be the Factor-Graph-Parameter-Learn. Then, we have that, for Markov blankets according to the distribution Q and the dJ¯ given structure respectively. Let {fD∗ ∗ MB(D∗)}j=1 be the ˜ ˜ j | j D(P kP ) + D(P kP ) ≤ J J ∗ Markov blanket canonical factors of Q. Let {Cj∗}j=1 be to hold with probability at least 1 − δ, it suffices that the scopes of the canonical factors for the given structure Let P˜ be the probability distribution returned by Factor-  2 24k+3 2k+2Jvk+b Graph-Parameter-Learn. Then we have that for m ≥ (1 + 22k+2 ) γk+b2 log δ . (8)

D(QkP˜) + D(P˜kQ) ≤ J + 2 maxd∗ log f ∗ ∗ (d∗) Proof sketch. Using the Hoeffding inequality and the fact j Dj j that the probabilities are bounded away from 0, we first j:D∗ /XC∗ J∗ j ∈{ k}k=1 prove that for any j ∈ {1,...,J ∗}, for f ∗ ∗ ∗ (d∗) Dj MB(Dj ) j ∗ | (12) +2 maxdj log log P (C∗|MB(C∗)) − log Pˆ(C∗|MB(C∗)) ≤ 0 (9) f ∗ ∗ ∗ (dj∗) j j j j Xj S D MB (D ) j j ∈ | to hold with high probability, a “small” number of sam- to hold with probability at least 1 − δ, it suffices that ples is sufficient. Now using the fact that the logs of the m satisfies Eqn. (8) of Theorem 4. Here the elements of (Markov blanket) canonical factors are sums of at most J ∗ ∗ D C D D index C k S = {j : j ∈ { k∗}k=1, MB( j∗) 6= MB( j∗)} 2| j | ≤ 2 of these log probabilities, we get that the re- over the canonical factors for which the Markov blanket is sulting estimates of the (Markov blanket) canonical factors incorrect in the given structure. d are accurate: This result is important, as it shows that each canoni- ˆ k | log fC∗ ∗ MB(C∗)(cj∗) − log fC∗ ∗ MB(C∗)(cj∗)| ≤ 2 0. cal factor that is incorrectly captured by our target structure j | j j | j (10) adds at most a constant to our bound on the KL-divergence. From Proposition 2 we have that the true distribution can be A canonical factor could be incorrectly captured when the written as a product of its (Markov blanket) canonical fac- corresponding factor scope is not included in the given J ∗ structure. Canonical factors are designed so that a factor tors. I.e., we have P ∝ j=1 fC∗ ∗ MB(C∗). So Eqn. (10) j | j over a set of variables captures only the residual interac- shows we have an accurateQ estimate of all the factors of the tions between the variables in its scope, once all interac- distribution. We use a technical lemma to show that this im- tions between its subsets have been accounted for in other plies that the two distributions are close in KL-divergence: factors. Thus, canonical factors over large scopes are often k k+1 close to the trivial all-ones factor in practice. Therefore, D(P kP˜) + D(P˜kP ) ≤ 2J ∗2 0 = J ∗2 0. (11) if our structure approximation is such that it only ignores k some of the larger-scope factors, the error in the approxi- Now using J ∗ ≤ J2 , appropriately choosing 0 and care- ful bookkeeping on the number of samples required and the mation may be quite limited. A canonical factor could also high-probability statements gives the theorem.  be incorrectly captured when the given structure does not have the correct Markov blanket for that factor. The re- The theorem shows that—assuming the true distribution sulting error depends on how good an approximation of the P factors according to the given structure—Factor-Graph- Markov blanket we do have. See Section 4 for more details Parameter-Learn returns a distribution that is J-close in on this topic. 3.4 Reducing the Dependence on Network Size Theorem 7. Let any  > 0 and δ > 0 be given. Let any (BN) structure over n variables with at Our previous analysis showed a linear dependence on most k parents per variable be given. Let P be a probabil- the number of factors J in the network. In a sense, ity distribution that factors over the BN. Let P˜ denote the this dependence is inevitable. To understand why, con- probability distribution obtained by fitting the conditional sider a distribution P defined by a set of n independent probability tables (CPT) entries via maximum likelihood Bernoulli random variables X ,...,X , each with para- 1 n and then clipping each CPT entry to the interval [  , 1 −  ]. meter 0.5. Assume that Q is an approximation to P , 4 4 Then we have that for where the Xi are still independent, but have parameter 0.4999. Intuitively, a Bernoulli(0.4999) distribution is a D (P kP˜) ≤ , (13) very good estimate of a Bernoulli(0.5); thus, for most ap- n pications, Q can safely be considered to be a very good to hold with probability at least 1 − δ, it suffices that we estimate of P . However, the KL-divergence between n have a number of samples that does not depend on the num- D(P (X1:n)kQ(X1:n)) = i=1 D(P (Xi)kQ(Xi)) = ber of variables in the network. Ω(n). Thus, if n is large, the KLP divergence between P and Q would be large, even though Q is a good estimate for P . To remove such unintuitive scaling effects when studying 4 Structure Learning the dependence on the number of variables, we can con- The algorithm described in the previous section uses the sider instead the normalized KL divergence criterion: known network to establish a Markov blanket for each fac- D (P (X )kQ(X )) = 1 D(P (X )kQ(X )). tor. This Markov blanket is then used to effectively esti- n 1:n 1:n n 1:n 1:n mate the canonical parameters from empirical data. In this section, we show how we can build on this algorithm to per- As we now show, with a slight modification to the algo- form structure learning, by first identifying (from the data) rithm, we can achieve a bound of  for our normalized KL- an approximate Markov blanket for each candidate factor, divergence while eliminating the logarithmic dependence and then using this approximate Markov blanket to com- on J in our sample complexity bound. Specifically, we can pute the parameters of that factor from a “small” number modify our algorithm so that it clips probability estimates of samples. ∈ [0, γk+b) to γk+b. Note that this change can only im- prove the estimates, as the true probabilities which we are 4.1 Identifying Markov Blankets trying to estimate are never in the interval [0, γk+b).4 For this slightly modified version of the algorithm, the In the parameter learning results, the Markov blan- following theorem shows the dependence on the size of the ket MB(Cj∗) is used to efficiently estimate the condi- network is O(1), which is tighter than the logarithmic de- tional probability P (Cj∗|X − Cj∗), which is equal to pendence shown in Theorem 4. P (Cj∗|MB(Cj∗)). This suggests to measure the quality of a candidate Markov blanket Y by how well P (Cj∗|Y) ap- Theorem 6 (Parameter learning: size of the network). Let C C (i) m proximates P ( j∗|X − j∗). In this section we show how any , δ > 0 be given. Let {x }i=1 be IID samples from conditional entropy can be used to find a candidate Markov P . Let the domain size of each variable be fixed. Let the blanket that gives a good approximation for this conditional number of factors a variable can participate in be fixed. probability. One intuition why conditional entropy has the ˜ Let P be the probability distribution returned by Factor- desired property, is that it corresponds to the log-loss of Graph-Parameter-Learn. Then we have that, for predicting Cj∗ given the candidate Markov blanket.

Dn(P kP˜) + Dn(P˜kP ) ≤  Definition 3 (Conditional Entropy.). Let P be a probability distribution over over X, Y. Then the conditional entropy to hold with probability at least 1 − δ, it suffices that we H(X|Y) of X given Y is defined as have a certain number of samples that does not depend on the number of variables in the network. − P (X = x, Y = y) log P (X = x|Y = y). x val(XX),y val(Y) The following theorem shows a similar result for ∈ ∈ Bayesian networks, namely that for a fixed bound on the Proposition 8 (Cover & Thomas, 1991). Let P be a number of parents per node, the sample complexity depen- probability distribution over X, Y, Z. Then we have 5 dence on the size of the network is O(1). H(X|Y, Z) ≤ H(X|Y). 4This solution assumes that γ is known. If not, we can use a Proposition 8 shows that conditional entropy can be clipping threshold as a function of the number of samples seen. used to find the Markov blanket for a given set of variables. This technique is used by Dasgupta (1997) to derive sample com- plexity bounds for learning fixed structure Bayesian networks. the full paper we actually give a much stronger version of The- 5Complete proofs for Theorems 6 and 7 (and all other results orem 7, including dependencies of m on , δ, k and a graceful in this paper) are given in the full paper Abbeel et al. (2005). In degradation result. Let D, Y ⊆X , D ∩ Y = ∅, then we have 4.2 Structure Learning Algorithm

H(D|MB(D)) = H(D|X − D) ≤ H(D|Y), (14) We propose the following Factor-Graph-Structure- Learn algorithm. The algorithm receives as input: samples (i) m where the equality follows from the Markov blanket prop- {x }i=1, the maximum number of variables per factor k, erty stated in Eqn. (2) and the inequality follows from the maximum number of variables per Markov blanket for Prop. 8. Thus, we can select as our candidate Markov blan- a factor b, a base instantiation x¯. Let C be the set of can- ket for D the set Y which minimizes H(D|Y). didate factor scopes, let Y be the set of candidate Markov Our first difficulty is that, when learning from data, we blankets. I.e., we have do not have the true distribution, and hence the exact condi- C = {Cj∗ : Cj∗ ⊆X , Cj∗ 6= ∅, |Cj∗| ≤ k}, (18) tional entropies are unknown. The following lemma shows Y = {Y : Y ⊆X , |Y| ≤ b}. (19) that the conditional entropy can be efficiently estimated from samples. The algorithm does the following: Lemma 9. Let P be a probability distribution over X, Y • ∀C ∈ C, compute the best candidate Markov blan- such that for all instantiations x, y we have P (X = j∗ ket: MB(Cj∗) = arg minY ,Y C∗= H(Cj∗|Y). x, Y = y) ≥ λ. Let H be the conditional entropy com- ∈Y ∩ j ∅ puted based upon m IID samples from P . Then for ˆ • ∀Cj∗ d∈ C, compute the estimates {f ∗ ∗b ∗ }i of b C MB( C ) j | j H(X|Y) − H(X|Y) ≤  the canonical factors as defined in Eqn. (4) using the empirical distribution. to hold with probability 1 − δb, it suffices that ˆ • Threshold to one the factor entries f ∗ ∗ ∗ (cj∗) C MB( C ) j | j 8 val(X) 2 val(Y) 2 4 val(X) val(Y) ˆ  | | | | | || | satisfying | log f ∗ ∗ ∗ (cj∗)| ≤ k+2 , and discard m ≥ 2 2 log . C MB( C ) 2 λ  δ j | j the resulting trivial factors that have all entries equal However, as the empirical estimates of the conditional to one. entropy are noisy, the true Markov blanket is not guaran- • Return the probability distribution P˜(x) ∝ teed to achieve the minimum of H(D | Y). In fact, in ˆ i f ∗ ∗ ∗ (cj∗). C MB( C ) some probability distributions, many sets of variables could j | j be arbitrarily close to reaching equality in Eqn. (14). Thus, Q The thresholding step finds the factors that actually con- in many cases, our procedure will not recover the actual tribute to the distribution. The specific threshold is chosen Markov blanket based on a finite number of samples. For- to suit the proof of Theorem 12. If no thresholding were tunately, as we show in the next lemma, any set of variables applied, the error in Eqn. (20) would be |C|  instead of J, Y that is close to achieving equality in Eqn. (14) gives 2k an accurate approximation P (C |Y) of the probabilities which is much larger in case the true distribution has a rel- j atively small number of factors. P (Cj |X − Cj ) used in the canonical parameterization. Theorem 11 (Structure learning: computational complex- Lemma 10. Let any  > 0 be given. Let P be ity). The running time of Factor-Graph-Structure-Learn a distribution over disjoint sets of random variables is in O mknkbnb(k + b) + knkbnbvk+b + knk2kvk .6 V, W, X, Y. Let λ1 = minv val(V),w val(W) P (v, w), ∈ ∈  λ2 = minx val(X),v val(V),w val(W) P (x|v, w). Assume The first two terms come from going through the data the following∈ holds ∈ ∈ and computing the empirical conditional entropies. Since the algorithm considers all combinations of candidate fac- X ⊥ Y, W | V, (15) tors and Markov blankets, we have an exponential depen- H(X|W) ≤ H(X|V, W, Y) + . (16) dence on the maximum scope size k and the maximum Markov blanket size b. The last term comes from com- Then we have that ∀x, y, v, w puting the Markov blanket canonical factors. The impor- tant fact about this result is that, unlike for (exact) ML ap- log P (x|v, w, y) − log P (x|w) ≤ √2 . (17) proaches, the running time does not depend on the tractabil- λ2√λ1 ity of inference in the underlying true distribution, nor the

In other words, if a set of variables W looks like recovered structure. a Markov blanket for X, as evaluated by the condi- Theorem 12 (Structure learning: sample complexity). Let tional entropy H(X|W), then the conditional distribu- any , δ > 0 be given. Let P˜ be the distribution returned tion P (X|W) must be close to the conditional distribution by Factor-Graph-Structure-Learn. Then for P (X|X − X). Thus, it suffices to find such an approxi- mate Markov blanket W as a substitute for knowing the D(P kP˜) + D(P˜kP ) ≤ J (20) true Markov blanket. This makes conditional entropy suit- 6The upper bound is based on a very naive implementation able for structure learning. running time. to hold with probability 1 − δ, it suffices that to hold with probability at least 1 − δ, it suffices that m

k+b satisfies Eqn. (21) of Theorem 12. γ 2 v2k+2b28k+19 8kbnk+bvk+b m ≥ (1 + 2k+3 ) 6k+6b 4 log . (21) 2 γ  δ Theorem 13 shows that (just like in the parameter learn- Proof (sketch). From Lemmas 9 and 10 we have that the ing setting) each canonical factor that is not captured by conditioning set we pick gives a good approximation to the our learned structure contributes at most a constant to our true canonical factor assuming we used true probabilities bound on the KL-divergence. The reason a canonical fac- with that conditioning set. At this point the structure is tor is not captured could be two-fold. First, the scope of fixed, and we can use the sample complexity theorem for the factor could be too large. The paragraph after The- parameter learning to finish the proof.  orem 5 discusses when the resulting error is expected to be small. Second, the Markov blanket of the factor could Theorem 12 shows that the sample complexity depends be too large. As shown in Eqn. (22), a good approximate exponentially on the maximum factor size k, the maximum Markov blanket is sufficient to get a good approximation. 1 1 Markov blanket size b, polynomially on γ ,  . If we modify So we can expect these error contributions to be small if the analysis to consider the normalized KL-divergence, as the true distribution is mostly determined by interactions in Section 3.4, we obtain a logarithmic dependence on the between small sets of variables. number of variables in the network. To understand the implications of this theorem, con- 5 Related Work sider the class of Gibbs distributions where every variable can participate in at most d factors and every factor can 5.1 Markov Networks have at most k variables in its scope. Then we have that the Markov blanket size b ≤ dk2. Bayesian networks The most natural algorithm for parameter estimation in are also factor graphs. If the number of parents per vari- undirected graphical models is maximum likelihood (ML) ables is bounded by numP and the number of children is estimation (possibly with some regularization). Unfortu- bounded by numC, then we have k ≤ numP + 1, and that nately, evaluating the likelihood of such a model requires b ≤ (numC+1)(numP+1)2. Thus our factor graph struc- evaluating the partition function. As a consequence, all ture learning algorithm allows us to efficiently learn distri- known ML algorithms for undirected graphical models are butions that can be represented by Bayesian networks with computationally tractable only for networks in which infer- bounded number of children and parents per variable. Note ence is computationally tractable. By contrast, our closed that our algorithm recovers a distribution which is close to form solution can be efficiently computed from the data, the true generating distribution, but the distribution it re- even for Markov networks where inference is intractable. turns is encoded as a factor graph, which may not be repre- Note that our estimator does not return the ML solution, sentable as a compact Bayesian network. so that our result does not contradict the “hardness” of ML estimation. However, it does provide a low KL-divergence Theorem 12 considers the case where the generating dis- estimate of the probability distribution, with high probabil- tribution P actually factors according to a structure with ity, from a “small” number of samples, assuming the true size of factor scopes bounded by k and size of factor distribution approximately factors according to the given Markov blankets bounded by b. As we did in the case of structure. parameter estimation, we can show that we have graceful degradation of performance for distributions that do not sat- Criteria different from ML have also been proposed isfy this assumption. for learning Markov networks. The most prominent is pseudo-likelihood (Besag, 1974), and its extension, gen- Theorem 13 (Structure learning: graceful degradation). eralized pseudo-likelihood (Huang & Ogata, 2002). The Let any be given. Let x(i) m be IID , δ > 0 { }i=1 pseudo-likelihood criterion gives rise to a tractable convex samples from a distribution Q. Let MB and MB optimization problem. However, the theoretical analyses be the Markov blankets according to the distributions d (e.g., Geman & Graffigne, 1986; Comets, 1992; Guyon Q and found by Factor-Graph-Structure-Learn respec- & Kunsch,¨ 1992) only apply when the generating model tively. Let {fD∗ ∗ MB(D∗)}j be the Markov blanket canoni- is in the true target class. Moreover, they show only as- j | j cal factors of Q. Let J be the number of factors in Q with ymptotic convergence rates, which are weaker than the fi- scopesize smaller than k. Let P˜ be the probability distrib- nite sample size PAC-bounds we provide in our analysis. ution returned by Factor-Graph-Parameter-Learn. Then Pseudo-likelihood has been extended to obtain a consistent we have that for model selection procedure for a small set of models: the procedure is statistically consistent and an asymptotic con- D(QkP˜) + D(P˜kQ) ≤ J + 2 maxd∗ log f ∗ ∗ (dj ) j Dj vergence rate is given (Ji & Seymour, 1996). However, no j: DX∗ >k | j | algorithm is available to efficiently find the best pseudo- f ∗ ∗ ∗ (d∗) likelihood model over the exponentially large set of candi- Dj MB(Dj ) j ∗ | (22) date models from which we want to select in the structure + 2 maxdj log ∗ ∗ f ∗ ∗ ∗ (dj∗) j : D k,XMB(D ) >b D MB (D ) learning problem. j j j j | |≤ | | | Structure learning for Markov networks is notoriously strengthen his result, showing an O(1) dependence of the difficult, as it is generally based on using ML estimation number of samples on the number of variables in the net- of the parameters (with smoothing), often combined with a work. So for bounded fan-in Bayesian networks, the sam- penalty term for structure complexity. As evaluating the ple complexity is independent of the number of variables likelihood is only possible for the class of Markov net- in the network. works in which inference is tractable, there have been two Results analyzing the complexity of structure learning main research tracks for ML structure learning. The first, of Bayesian networks fall largely into two classes. The first starting with the work of Della Pietra et al. (1997), uses class of results assumes that the generating distribution is local-search heuristics to add factors into the network (see DAG-perfect with respect to some DAG G with at most k also McCallum, 2003). The second searches for a struc- parents for each node. (That is, P and G satisfy precisely ture within a restricted class of models in which inference the same independence assertions.) In this case, algorithms is tractable, more specifically, bounded treewidth Markov based on various independence tests (Spirtes et al., 2000; networks. Indeed, ML learning of the class of tree Markov Cheng et al., 2002) can identify the correct network struc- networks — networks of tree-width 1 — can be performed ture at the infinite sample limit, using a polynomial num- very efficiently (Chow & Liu, 1968). Unfortunately, Sre- ber of independence tests. Chickering and Meek (2002) bro (2001) proves that for any tree-width k greater than 1, relax the assumption that the distribution be DAG-perfect; even finding the ML treewidth-k network is NP-hard. Sre- they show that, under a certain assumption, a simple greedy bro also provides an approximation algorithm, but the ap- algorithm will, at the infinite sample limit, identify a net- proximation factor is a very large multiplicative factor of work structure which is a minimal I-map of the distribu- the log-likelihood, and is therefore of limited practical use. tion. They provide no polynomial time guarantees, but such Several heuristic algorithms to learn small-treewidth mod- guarantees might hold for models with bounded connected- els have been proposed (Malvestuto, 1991; Bach & Jordan, ness (such as the ones our algorithm considers). 2002; Deshpande et al., 2001), but (not surprisingly, given The second class of results relates to the problem of find- the NP-hardness of the problem) they do not come with any ing a network structure whose score is high, for a given performance guarantees. set of samples and some appropriate scoring function. Al- Recently, Narasimhan and Bilmes (2004) provided a though finding the highest-scoring tree-structured network polynomial time algorithm with a polynomial sample com- can be done in polynomial time (Chow & Liu, 1968), plexity guarantee for the class of Markov networks of Chickering (1996) shows that the problem of finding the bounded tree width. They do not provide any graceful highest scoring Bayesian network where each variable has degradation guarantees when the generating distribution is at most k parents is NP-hard, for any k ≥ 2. (See Chicker- not a member of the target class. Their algorithm com- ing et al., 2003, for details.) Even finding the maximum putes approximate information likelihood structure among the class of polytrees (Das- followed by dynamic programming to recover the bounded gupta, 1999) and paths (Meek, 2001) is NP-hard. These tree width structure. The parameters for the recovered results do not address the question of the number of sam- bounded tree width model are estimated by standard ML ples for which the highest scoring network is guaranteed to methods. Our algorithm applies to a different family of be close to the true generating distribution. distributions: factor graphs of bounded connectivity (in- Hoffgen (1993) analyzes the problem of PAC-learning cluding graphs in which inference is intractable). Factor the structure of Bayesian networks with bounded fan-in, graphs with small connectivity can have large tree width showing that the sample complexity depends only logarith- (e.g., grids) and factor graphs with small tree width can mically on the number of variables in the network (when have large connectivity (e.g., star graphs). considering KL-divergence normalized by the number of variables in the network). Hoffgen does not provide an ef- 5.2 Bayesian Networks ficient learning algorithm (and to date, no efficient learning algorithm is known), stating only that if the optimal net- ML parameter learning in Bayesian networks (possibly work for a given data set can be found (e.g., by exhaus- with smoothing) only requires computing the empirical tive enumeration), it will be close to optimal with high conditional probabilities of each variable given its parent probability. By contrast, we provide a polynomial-time instantiations. learning algorithm with similar performance guarantees for Dasgupta (1997), following earlier work by Friedman Bayesian networks with bounded fan-in and bounded fan- and Yakhini (1996), analyzes the sample complexity of out. However, we note that our algorithm does not con- learning Bayesian networks, showing that the sample com- struct a Bayesian network representation, but rather a fac- plexity is polynomial in the maximal number of different tor graph; this factor graph may not be compactly repre- instantiations per family. His sample complexity result sentable as a Bayesian network, but it is guaranteed to en- has logarithmic dependence on the number of variables in code a distribution which is close to the generating distrib- the network, when using the KL-divergence normalized by ution, with high probability. the number of variables in the network. In this paper, we 6 Discussion Chickering, D., & Meek, C. (2002). Finding optimal Bayesian networks. Proc. UAI. We have presented polynomial time algorithms for para- Chickering, D., Meek, C., & Heckerman, D. (2003). Large- meter estimation and structure learning in factor graphs of sample learning of Bayesian networks is hard. Proc. UAI. bounded factor size and bounded connectivity. When the Chow, C. K., & Liu, C. N. (1968). Approximating discrete prob- generating distribution is within this class of networks, our ability distributions with dependence trees. IEEE Transactions algorithms are guranteed to return a distribution close to it, on Information Theory. using a polynomial number of samples. When the generat- Comets, F. (1992). On consistency of a class of estimators for ing distribution is not in this class, our algorithm degrades exponential families of Markov random fields on the lattice. gracefully. Annals of Statistics. While of significant theoretical interest, our algorithms, Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley. as described, are probably impractical. From a statistical perspective, our algorithm is based on the canonical para- Dasgupta, S. (1997). The sample complexity of learning fixed structure Bayesian networks. Machine Learning. meterization, which is evaluated relative to a canonical as- Dasgupta, S. (1999). Learning polytrees. Proc. UAI. signment x¯. Many of the empirical estimates that we com- pute in the algorithm use only a subset of the samples that Della Pietra, S., Della Pietra, V. J., & Lafferty, J. D. (1997). Induc- ing features of random fields. IEEE Transactions on Pattern are (in some ways) consistent with x¯. As a consequence, Analysis and Machine Intelligence, 19, 380–393. we make very inefficient use of data, in that many samples Deshpande, A., Garofalakis, M. N., & Jordan, M. I. (2001). Ef- may never be used. In regimes where data is not abundant, ficient stepwise selection in decomposable models. Proc. UAI this limitation may be quite significant in practice. (pp. 128–135). Morgan Kaufmann Publishers Inc. From a computational perspective, our algorithm uses Friedman, N., & Yakhini, Z. (1996). On the sample complexity exhaustive enumeration over all possible factors up to some of learning Bayesian networks. Proc. UAI. size k, and over all possible Markov blankets up to size Geman, S., & Graffigne, C. (1986). Markov random field image b. When we fix k and b to be constant, the complexity is models and their applications to computer vision. Proc. of the polynomial. But in practice, the set of all subsets of size k International Congress of Mathematicians. or b is much too large to search exhaustively. Guyon, X., & Kunsch,¨ H. (1992). Asymptotic comparison of es- However, even aside from its theoretical implications, timators in the ising model. In Stochastic models, statistical methods, and algorithms in image analysis, lecture notes in the algorithm we propose might provide insight into the de- statistics. Springer, Berlin. velopment of new learning algorithms that do work well in Hammersley, J. M., & Clifford, P. (1971). Markov fields on finite practice. In particular, we might be able to address the sta- graphs and lattices. Unpublished. tistical limitation by putting together canonical factor esti- Hoffgen, K. L. (1993). Learning and robust learning of product mates from multiple canonical assignments x¯. We might be distributions. Proc. COLT. able to address the computational limitation using a more Huang, F., & Ogata, Y. (2002). Generalized pseudo-likelihood clever (perhaps heuristic) algorithm for searching over sub- estimates for Markov random fields on lattice. Annals of the sets. Given the limitations of existing structure learning Institute of Statistical Mathematics. algorithms for undirected models, we believe that the tech- Ji, C., & Seymour, L. (1996). A consistent model selection proce- niques suggested by our theoretical analysis might be worth dure for Markov random fields based on penalized pseudolike- exploring. lihood. Annals of Applied Probability. Acknowledgements. This work was supported by DARPA Kschischang, F. R., Frey, B. J., & Loeliger, H. A. (2001). Factor contract number HR0011-04-1-0016. graphs and the sum-product algorithm. IEEE Transactions on Information Theory. References Lauritzen, S. L. (1996). Graphical models. Oxford University Press. Abbeel, P., Koller, D., & Ng, A. Y. (2005). On the sample complexity of graphical models. (Full paper.) Malvestuto, F. M. (1991). Approximating discrete probability dis- http://www.cs.stanford.edu/˜pabbeel/. tributions with decomposable models. IEEE Transactions on Systems, Man and Cybernetics. Bach, F., & Jordan, M. (2002). Thin junction trees. NIPS 14. McCallum, A. (2003). Efficiently inducing features of conditional Besag, J. (1974). Spatial interaction and the statistical analysis random fields. Proc. UAI. of lattice systems. Journal of the Royal Statistical Society, Se- Meek, C. (2001). Finding a path is harder than finding a tree. ries B. Journal of Artificial Intelligence Research, 15, 383–389. Cheng, J., Greiner, R., Kelly, J., Bell, D., & Liu, W. (2002). Narasimhan, M., & Bilmes, J. (2004). PAC-learning bounded Learning Bayesian networks from data: An information-theory tree-width graphical models. Proc. UAI. based approach. Artificial Intelligence Journal. Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, pre- Chickering, D. (1996). Learning Bayesian networks is NP- diction, and search (second edition). MIT Press. Complete. In D. Fisher and H. Lenz (Eds.), Learning from Srebro, N. (2001). Maximum likelihood bounded tree-width data: Artificial intelligence and statistics v, 121–130. Springer- Markov networks. Proc. UAI. Verlag.