<<

Approximate Inference in Collective Graphical Models

Daniel Sheldon1 [email protected] Tao Sun1 [email protected] Akshat Kumar2 [email protected] Thomas G. Dietterich3 [email protected] 1University of Massachusetts, Amherst, MA 01002, USA 2IBM Research, New Delhi 110070, India 3Oregon State University, Corvallis, OR 97331, USA

Abstract mographic variables. In ecology, survey provide We study the problem of approximate infer- counts of animals in different locations, but they can- ence in collective graphical models (CGMs), not identify individuals. which were recently introduced to model the CGMs are generative models that serve as a link be- problem of learning and inference with noisy tween individual behavior and . As a aggregate observations. We first analyze the concrete example, consider the model illustrated in complexity of inference in CGMs: unlike in- Figure1(a) for modeling bird migration from obser- ference in conventional graphical models, ex- vational data collected by citizen scientists through act inference in CGMs is NP-hard even for the eBird project (Sheldon et al., 2008; Sheldon, 2009; tree-structured models. We then develop a Sullivan et al., 2009). Inside the plate, an indepen- tractable convex approximation to the NP- dent Markov chain describes the migration of each bird m hard MAP inference problem in CGMs, and among a discrete set of locations: Xt represents the show how to use MAP inference for ap- location of the mth bird at time t. Outside the plate, proximate marginal inference within the EM aggregate observations are made about the spatial dis- framework. We demonstrate empirically that tribution of the population: the variable nt is a vector these approximation techniques can reduce whose ith entry counts the number of birds in location the computational cost of inference by two i at time t. By observing temporal changes in the vec- orders of magnitude and the cost of learning tors nt , one can make inferences about migratory by at least an order of magnitude while pro- routes{ without} tracking individual birds. viding solutions of equal or better quality. In general CGMs, any discrete can ap- pear inside the plate to model individuals in a popula- 1. Introduction tion, and observations are made in the form of (noisy) low-dimensional contingency tables (Sheldon & Diet- Sheldon & Dietterich(2011) introduced collective terich, 2011). A key problem we would like to solve graphical models (CGMs) to model the problem of is learning the model parameters (of the individual learning and inference with noisy aggregate data. model) from the aggregate data, for which inference is CGMs are motivated by the growing number of ap- the key subroutine. Unfortunately, standard inference plications where data about individuals are not avail- techniques applied to CGMs quickly become compu- able, but aggregate population-level data in the form tationally intractable as the population size increases, of counts or contingency tables are available. For ex- due to the large number of hidden individual-level vari- ample, the US Bureau cannot release individ- ables that are all connected by the aggregate counts. ual records for privacy reasons, so they commonly re- A key to efficient inference in CGMs is the fact that, lease low-dimensional contingency tables that classify when only aggregate data are being modeled, the same each member of the population according to a few de- data-generating mechanism can be described much Proceedings of the 30 th International Conference on Ma- more compactly by analytically marginalizing away chine Learning, Atlanta, Georgia, USA, 2013. JMLR: the individual variables to obtain a direct probabilis- W&CP volume 28. Copyright 2013 by the author(s). tic model for the sufficient (Sundberg, 1975; Collective Graphical Models

MAP values of the sufficient statistics. Although the Xm Xm Xm 1 2 ··· T true MAP values are integers, when the population (a) m = 1 : M of individuals is large, the fractional approximation is n1 n2 nT very good and can be interpreted as describing the percentages of the population in the MAP configura-

n1,2 n2,3 ... nT 1,T tion. In bird migration, for example, this describes the − fraction of the population that flies between each pair (b) of locations for each day of the year. n n n ... n 1 2 3 T Our final contribution is to show that this fractional MAP approximation can be applied within an EM Figure 1. CGM example: (a) Individuals are explicitly algorithm to accurately learn the parameters of the modeled. (b) After marginalization, the hidden variables CGM. In EM, one must compute the posterior are the sufficient statistics of the individual model. (the expected value of the sufficient statistics given the observations). While the MAP configuration is not the mean, we show experimentally that the (fractional ap- Sheldon & Dietterich, 2011). Figure1(b) illustrates proximate) MAP configuration provides an excellent the resulting model for the bird migration example. approximation to the posterior mean. Indeed, for a The new hidden variables nt,t+1 are tables of suffi- fixed time budget, it is usually significantly more ac- cient statistics: the entry nt,t+1(i, j) is the number of curate than computing the expected sufficient statis- birds that fly from location i to location j at time tics via Gibbs . We show that our approach t. For large populations, the resulting model is much dramatically accelerates parameter learning while still more amenable to inference, because it has many fewer achieving less than 1% error. variables and it retains a graphical structure analogous to that of the individual model. However, the reduc- 2. Related Work tion in the number of variables comes at a cost: the new variables are tables of integer counts, which can Sheldon et al.(2008) solved a related MAP infer- take on many more values than the original discrete ence problem on a chain-structured CGM using linear variables in the individual model, and this adversely programming and network flow techniques. Sheldon affects the running time of inference algorithms. (2009) extended those algorithms to the case when ob- The first contribution of this paper is to character- servations are corrupted by log-concave noise models. ize the computational complexity of exact inference The MAP problem in those papers is slightly differ- in CGMs. For tree-structured graphical models, the ent from ours: it seeks the most likely setting of all of running time of exact inference (MAP or marginal) the individual variables, while we seek the most likely by message passing in the CGM is polynomial in ei- setting of the sufficient statistics, which considers the ther the population size or the cardinality of the vari- probability of all possible settings of the individual ables in the individual model (when the other pa- variables that give the same counts. This gives rise rameter is fixed). However, there is no algorithm to combinatorial terms in the probability model (see that is polynomial in both parameters unless P=NP. Equation (2) of Section3) and leads to harder non- This is a striking difference from inference in stan- linear optimization problems. dard tree-structured graphical models, for which the Sheldon & Dietterich(2011) generalized the previous running time of message passing is always polynomial ideas from chain-structured models to arbitrary dis- in the variable cardinality. We also analyze the run- crete graphical models and developed the first algo- ning time of message passing in a junction tree for rithms for marginal inference in CGMs, which were general (non-tree-structured) CGMs to draw out an- based on Gibbs sampling and Markov bases (Diaconis other difference between CGMs and standard graphi- & Sturmfels, 1998; Dobra, 2003). They showed empir- cal models: the dependence on clique-width is doubly- ically that Gibbs sampling is much faster than exact exponential instead of singly-exponential for some pa- inference: for some tasks, the running time to achieve rameter regimes. a fixed error level is independent of the population Our second main contribution is an approximate al- size. However, no analogous approximate method was gorithm for MAP inference in CGMs that is based on developed for MAP inference. a continuous and convex approximation of the MAP Inference in CGMs is related to lifted inference in re- optimization problem. Given observed evidence, the lational models (Getoor & Taskar, 2007). A CGM can algorithm computes a fractional approximation of the Collective Graphical Models

2 be viewed as a simple relational model containing a Here Z is the normalization constant and φij :[L] → single logical variable to describe the repetition over R+ are edge potentials. We refer to this as the individ- individuals. A unique feature of CGMs is that all ev- ual model. For the remainder of the paper, we assume idence occurs at the aggregate level. The most related that G is a tree to develop the important ideas while ideas from lifted inference are counting elimination keeping the exposition manageable. For a graph with (de Salvo Braz et al., 2007) and counting conversion cycles, the methods of this paper can be applied to (Milch et al., 2008), which perform aggregation oper- perform inference on a junction tree derived from G. ations similar to those performed by a CGM. See also To generate the aggregate data, first assume that M (Apsel & Brafman, 2011). Fierens & Kersting(2012) independent vectors x(1),..., x(M) are drawn from the recently proposed to “lift” probabilistic models instead individual probability model to represent the individ- of inference algorithms, which is very similar in spirit uals in a population. Aggregate observations are then to CGMs, but they did not present a general approach made in the form of contingency tables on small sets of to do so. CGMs lift any model with a single logical variables, which count the number of times that each variable when all evidence is at the population level. possible combination of those variables appears in the In general, while lifted inference algorithms use count- population. Specifically, a n on ing arguments similar to those that appear in CGMs, A variables (X ) is a table with entries they cannot reproduce the CGM model, and none of i i∈A our results are consequences of any of the results in M X  (m) |A| those papers. nA(xA) = 1 x = xA , xA [L] , A ∈ Multiple authors in and statistics have m=1 considered the problem of fitting the transition prob- (m) abilities of a Markov chain from aggregate data by a where xA is the subvector corresponding to indices conditional approach (Lee et al., 1970; in A. In this section, we will focus first on observed MacRae, 1977; Kalbfleisch et al., 1983). Van Der Plas tables ni := n{i} over single variables, which we refer (1983) showed that the conditional least squares es- to as node tables. Later, the tables nij := n{i,j} for timator is consistent and asymptotically normal un- edges (i, j), which are the sufficient statistics of the der weak assumptions about the Markov model. This individual model, will play a prominent role. We will problem can be viewed as learning in the CGM in Fig- refer to these as edge tables. ure1, where the graphical model is chain-structured We will consider both an exact and a noisy model for and single-variable contingency tables are observed ex- observing contingency tables. Let U be an observed actly for each node. In our work, the CGM may take subset of nodes. In each model, for each i U we ∈ on a more general graph structure, some nodes may be observe a table yi of the same dimension as ni, where unobserved, and the observations may be noisy, so the the entry yi(xi) has one of the following distributions: conditional least squares estimator is not applicable. Exact observations: yi(xi) = ni(xi).

Noisy observations: yi(xi) ni(xi) Pois(α ni(xi)) 3. Problem Statement | ∼ · The Poisson model is motivated by the bird migration In this section, we describe the for problem, and models birds being counted at a rate aggregate data, introduce collective graphical models, proportional to their true density. While it is helpful and state the problems of (collective) marginal and to focus on these two observation models, considerable MAP inference. variation is possible without significantly changing the results: (1) observations of the different types can be mixed, (2) higher-order contingency tables, such as Generative Model for Aggregate Data. The the edge tables nij, may be observed, either exactly probability model starts with a graphical model over or noisily, (3) some table entries may be unobserved x random variables X1,...,XN . Let = (x1, . . . , xN ) while others are observed, or, in the noisy model, they be a particular assignment to the variables (for simplic- may have multiple independent observations, and (4) ity, assume that each takes values in [L] = 1,...,L ), { } the Poisson model can be replaced by any other noise and let G = (V,E) be the independence graph. The model p(y n) that is log-concave in n. Sheldon & Diet- probability model is terich(2011| ) discuss some of these extensions further. In this paper, the exact observation model is used to 1 Y p(x) = φij(xi, xj). (1) prove the hardness results, while the Poisson model Z (i,j)∈E is used in the algorithms and . Since the Collective Graphical Models exact model can be obtained as the limiting case of a so we can marginalize them away. This marginaliza- log-concave noise model (e.g., y n Normal(n, σ2) tion can be performed analytically to obtain a proba- as σ2 0), we do not expect that| noisy∼ observations bility model with many fewer variables. It results in lead to→ a more tractable problem. a model whose random variables are the vector n of sufficient statistics and the vector y of observations. Inference Problems. Suppose we wish to learn the For trees (and, more generally, for junction trees), the parameters of the individual model from these aggre- distribution of n can be written in closed form given gate observations. To do this, we need to know the val- the marginal probabilities µi(xi) = Pr(Xi = xi) and ues of the sufficient statistics of the individual model— 3 µij(xi, xj) = Pr(Xi = xi,Xj = xj). The following namely, nij for all edges (i, j) E. Our observation expression is originally due to Sundberg(1975): models do not directly observe∈ these. Fortunately, we can apply the EM algorithm, in which case we need !ν(i)−1 Y Y ni(xi)! to know the expected values of the sufficient statistics n p( ) = M! n (x ) given the observations: E[nij y]. µi(xi) i i | i∈V xi nij (xi,xj ) This leads us to define two inference problems: Y Y µij(xi, xj) marginal inference and MAP inference. The aggregate , (2) nij(xi, xj)! marginal inference problem is to compute the condi- (i,j)∈E xi,xj subject to: tional distributions p(nij y) for all (i, j) E. Al- | ∈ P though these are finite discrete distributions, the vari- ni(xi) = xj nij(xi, xj), i, xi, j i (3) 1 ∀ ∼ able nij can take a very large number of values, so P ni(xi) = M, i. (4) we typically don’t want to represent its distribution ex- xi ∀ plicitly in tabular form. An important special case of Here, ν(i) is the degree of node i, and the notation marginal inference that does not require a large tabu- j i that j is a neighbor of i. By “subject to” lar potential is to compute the L L tables of expected we∼ mean that the probability is zero if the constraints × values E[nij y], which are the quantities needed for are not satisfied. the E step of| the EM algorithm.2 The distribution p(n) is the collective graphical model, The aggregate MAP inference problem is to find the which is defined over the random variables ni and { } tables n = (nij)(i,j)∈E that jointly maximize p(n y). n . The CGM distribution satisfies a hyper Markov | ij A primary focus of this paper is approximate algo- {property} : it has conditional independence properties rithms for aggregate MAP inference. One reason for that follow the same essential structure as the original conducting MAP inference is the usual one: to recon- graphical model (Dawid & Lauritzen, 1993). (To see struct the most likely value of n given the evidence as this, note that Eq. (2) factors into separate terms for a way of “reconstruction” (e.g., for data exploration). each node and edge contingency table; when the hard However, a second and important motivation is the constraints of Eq. (3) are also included as factors, this fact that the posterior p(n y) is an excellent has the effect of connecting the tables for edges inci- | approximation for the posterior mean E[n y] in this dent on the same node, as illustrated in Figure1.) model, so approximate MAP inference also| gives an approximate algorithm for the important marginal in- Likelihood. We can combine the CGM with the like- ference problem needed for the EM algorithm. lihood term to derive an explicit expression for the (unnormalized) posterior p(n y) p(n)p(y n). Un- der the Poisson observation model,| ∝ the likelihood| has Collective Graphical Models. Notice that in the the form setting we are considering, our observations and yi(xi) −αni(xi) queries only concern aggregate quantities. The obser- Y Y (α ni(xi)) e p(y n) = . vations are (noisy) counts and the queries are MAP or | yi(xi)! marginal probabilities over sufficient statistics (which i∈U xi are also counts). In this setting, we don’t care about the values of the individual variables x(1),..., x(M) , Generalizations. For general graph structures, Shel- { } don & Dietterich(2011) give a probability model anal- 1 M+L2−1 L2−1 There are L2−1 = O(M ) different L × L ta- ogous to Eq. (2) defined over a junction tree for the bles of non-negative integers that sum to M. graphical model. In that case, the vector n includes 2 Sheldon & Dietterich(2011) also showed how to gen- contingency tables nC for each clique C of the junction erate samples from p(nij | y), which is an alternative way to query the posterior distribution without storing a huge 3If marginal probabilities are not given, they can be tabular potential. computed by performing inference in the individual model. Collective Graphical Models

tree. Different junction trees may be chosen for a par- X0 Hyperedge ticular model, which will lead to different definitions of the hidden variables n and thus slightly different X X X inference problems, but always the same marginal dis- 1 2 3 Elements tribution p(y) over observed variables. Higher-order contingency tables may be observed as long as each ob- Figure 2. Reduction from 3-dimensional . served table nA satisfies A C for some clique C, so ⊆ 2 it can be expressed using marginalization constraints O(min(M L −1,L2M )), which gives the desired upper such as Eq.3. The approximate inference algorithms bound. in Section5 extend to these more general models in a straightforward way by making the same two approx- For general graphical models, message passing on junc- imations presented in that section for the expression tion trees can be implemented in a similar fashion. For log p(n y) to derive a convex optimization problem. a clique of size K, the contingency table will have LK | K entries, so there are O(min(M L −1,LKM )) possible 4. Computational Complexity values of the contingency table. This gives us the fol- lowing result. There are a number of natural parameters quantifying the difficulty of inference in CGMs: the population Theorem 2. Message passing on a junction tree with maximum clique size K and maximum variable cardi- size M; the number of variables N; the variable cardi- K nality L takes time O(N min(M L −1,LKM )). nality L; and the clique-width K (largest clique size) · of the junction tree used to perform inference, which is Thus, if either L or M is fixed, message passing runs bounded below by the tree-width of G plus one. The in time polynomial in the other parameter. When M inputs are: the vector y, the integer M, and the CGM is constant, then the running time is exponential in probability model. The vector y has at most NL en- the clique-width, which captures the familiar case of tries of magnitude O(M), so each can be represented in discrete graphical models. When L is constant, how- log M bits. The CGM is fully specified by the potential ever, the running time is not only exponential in L but functions in Equation (1), which have size O(NL2). doubly-exponential in the clique-width. Thus, despite Thus, the input size is poly(N, L, log M). being polynomial in one of the parameters, message We first describe the best known running time for ex- passing is unlikely to give satisfactory performance on act inference in trees (K = 2), which are the focus of real problems. Finally, the next result tells us that we this paper. should not expect to find an algorithm that is polyno- Theorem 1. When G is a tree, message passing in the mial in both parameters. CGM solves the aggregate MAP or marginal inference Theorem 3. Unless P=NP, there is no algorithm for 2 problems in time O(N min(M L −1,L2M )). MAP or marginal inference in a CGM that is polyno- · mial in both M and L. This remains true when G is a tree and N = 4. Proof sketch. Because of the hyper-Markov property, the CGM also has the form of a tree-structured graph- Proof of Theorem3. ical model. Message passing gives an exact solution to The proof is by reduction from the MAP or marginal inference problem in two passes exact 3-dimensional (3D) matching to a CGM where through the tree, which takes O(N) messages (Koller both M and L grow with the input size. An instance & Friedman, 2009). In a standard implementation of of exact 3D matching consists of finite sets A1, A2, and A3, each of size M, and a set of hyperedges T message passing, the time per message is bounded by ⊆ the maximum over all factors of the product of the car- A1 A2 A3. A hyperedge e = (a1, a2, a3) is said cover× × dinalities of the variables in that factor. However, due to a1, a2, and a3. The problem of determining whether there is a subset S T of size M that covers to the nature of the hard constraints in the CGM, it is ⊆ possible to bound the time per message by a smaller each element is NP-complete (Karp, 1972). number, which is the number of values for the random To reduce exact 3D matching to inference in col- variable nij (details omitted). The number of contin- lective graphical models, define a graphical model M+c−1 gency tables with c entries that sum to M is c−1 , with random variables X0,X1,X2,X3 such that X0 which is the number of ways of placing M identical 1,..., T is a hyperedge chosen uniformly at ran-∈ { | |} balls in c distinct bins. This number is bounded above dom, and X1, X2, and X3 are the elements covered c−1 M by M and by (c 1) . In a CGM, the table by X0 (see Figure2). Define the observed counts to 2 − nij has L entries, so the number of values for nij is be ni(a) = 1 for all a Ai, i = 1, 2, 3, which specify ∈ Collective Graphical Models that each element is covered exactly once by one of Proof. The constraints are linear and thus the feasi- M randomly selected hyperedges. These counts have ble set is convex. Since this is a maximization prob- nonzero probability if and only if there is an exact 3D lem, we must show that the objective is concave in n, matching. Thus, MAP or marginal inference can be which is clearly true for each term but the last one: used to decide exact 3D matching. For MAP, there the first three terms are linear, and the functions log n exist tables n0i such that p( n0i , ni ) > 0 if and and n log n are concave. The last term is convex. only if there is a 3D matching.{ For} marginal{ } inference However,− the sum of the last two terms: p( ni ) > 0 if and only if there is a 3D matching. { } X X nij(xi, xj) log nij(xi, xj) Because the model used in the reduction is a tree with − i∼j x ,x only four variables, the hardness result clearly holds i j X  X under that restricted case. + 1 ν(i) ni(xi) log ni(xi) (9) − − i∈V xi

5. Approximate MAP Inference is concave over the feasible set. Indeed, this is ex- actly the expression for the Bethe entropy of a graph- In this section, we address the problem of MAP infer- ical model, and the constraints (3) and (4) are identi- ence in CGMs under the noisy observation model from cal to the constraints for pairwise and node marginals Section3. That is, the node tables for an observed set used in Bethe entropy. Bethe entropy is concave over U are corrupted independently by noise, which is con- the constraint set of a tree-structured graphical model ditionally Poisson: (Heskes, 2006). The only difference between this and Y Y  p(y n) = p yi(xi) ni(xi) , (5) the conventional Bethe entropy is that the variables | | i∈U xi are normalized to sum to M instead of 1, but scaling

yi(xi) −αni(xi) in this way does not affect concavity. Therefore, the  (αni(xi)) e p yi(xi) ni(xi) = . (6) problem is convex. | yi(xi)!

Our goal is to maximize the following objective: MAP Inference for EM. We now describe how log p(n y) = log p(n) + log p(y n) + constant (7) MAP inference for CGMs can be used to significantly | | accelerate the E-step of the EM algorithm for learning As highlighted earlier, it is computationally intractable the parameters of the individual model. Let θ denote to directly optimize log p(n y). Therefore, we intro- the unknown parameter to be optimized, let y be the | duce two approximations. First, we relax the con- observed variables, and let x = (x(1),..., x(M)) be the straint that the entries of n be integers. For large hidden variables for all individuals in the population. sample size M, the effect of allowing fractional val- The EM algorithm iteratively finds parameters θ? that ues is minimal. Second, as it is hard to incorpo- maximize the following expected log-likelihood: rate factorial terms log n! directly into an optimiza- X tion framework, we employ Stirling’s approximation: Q(θ?, θ) = p(x y; θ) log p(x, y; θ?) | log n! n log n n. x ≈ − Using these two approximations, we arrive at the fol- where θ? denotes the parameters to optimize and θ de- lowing optimization problem: notes the previous iteration’s parameters. When the X X  X joint distribution p( ) is from an , as max log µij(xi, xj)+1 nij(xi, xj) αni(xi) · n − in our case, then the problem simplifies to maximiz- i∼j xi,xj i∈U,xi ? ing log p(n¯, y; θ ), where n¯ = Eθ[n y] is the expected X X  X X | + 1 ν(i) ni(xi) + yi(xi) log ni(xi) value of the sufficient statistics n = n(x, y) used to de- − i∈V xi i∈U xi fine the model; these are exactly the hidden variables X X in the CGM. In general, this expectation is difficult nij(xi, xj) log nij(xi, xj) − to compute and requires the specialized sampling ap- i∼j xi,xj X  X proach of Sheldon & Dietterich(2011). 1 ν(i) ni(xi) log ni(xi) + const., (8) − − i∈V xi Instead, we will show that the approximate MAP solu- tion, which approximates the mode of the distribution + subject to (3) and (4), for ni(xi), nij(xi, xj) R . p(n y; θ), is also an excellent approximation for its ∈ | Theorem 4. The optimization problem (8) for ap- mean Eθ[n y]. While this may seem surprising, re- proximate MAP inference in tree-structured CGMs is call that the| random variables in question take values convex. that are relatively large non-negative integers. A good Collective Graphical Models analogy is the Binomial distribution (a CGM with only 10 10000 Approximate one variable), for which the mode is very close to the Exact 8 8000 mean, and the mode of the continuous extension of the Binomial pmf is even closer to the mean. These charac- 6 6000 teristics make our approach qualitatively very different 4 4000 from a typical “hard EM” algorithm. Our experiments show that by using the convex optimization approach Relative error (%) 2 2000 Running time (seconds) of Section5, the approximate mode can be computed 0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 extremely quickly and is an excellent substitute for the Population size: M Population size: M mean. It is typically a much better approximation of the mean than the one found by Gibbs sampling for Figure 3. The effect of population size M on accuracy and reasonable time budgets, and this makes the overall running time of approximate MAP (L = 4,T = 6). Left: EM procedure many times faster. relative error vs. M. Right: running time vs. M.

approximate MAP algorithm, we first compare its so- 6. Evaluation ∗ ∗ lution napprox to the exact solution nexact obtained by We evaluated our approximate MAP inference algo- message passing for small models (L = 4, T = 6). rithm by measuring its accuracy against exact solu- We expect the fractional relaxation and Stirling’s ap- tions for small problem instances and by comparing proximation to be more accurate as M increases. By it with Gibbs sampling for marginal inference within Theorem1, the running time of message passing in the E step of the EM algorithm. For all experiments, this model is O(M 15), we are limited to tiny popula- we generated data from a chain-structured CGM to tions. Nevertheless, Figure3 shows that the relative ∗ ∗ ∗ simulate wind-dependent migration of a population of error napprox nexact 1/ nexact 1 is already less than M birds from the bottom-left to the top-right corner 1% fork M = 7.− For allk Mk , approximatek MAP takes of an ` ` grid to mimic the seasonal movement of less than 0.2 seconds, while the running time of exact a migratory× songbird from a known winter to a MAP scales very poorly. known breeding range. Thus, the variables Xt of the individual model are the grid locations of the individ- Marginal Inference. We next evaluated the ap- ual birds at times t = 1,...,T , and have cardinality 2 proximate MAP inference algorithm to show that it L = ` . The transition probabilities between grid cells can solve the EM marginal inference problem more were determined by a log- with four pa- quickly and accurately than Gibbs sampling. For these rameters that controlled the effect of features such as experiments, we fixed T = 20 and varied L by increas- direction, distance, and wind on the transition proba- ing the grid size. The largest models (L = 49) result bility. The parameters θtrue were selected manually to in a hidden vector with (T 1)L2 46K entries. The generate realistic migration trajectories. After gener- − ≈ goal is to approximate Eθ[n y]. Since we cannot com- ating data for a population of M birds, we computed pute the exact answer for non-trivial| problems, we run node contingency tables and generated observations ten very long runs of Gibbs sampling (10 million iter- from the Poisson model (α = 1) for every node. For ations), and then compare each Gibbs run, as well as the experiments that do not involve learning, we per- approximate MAP, to the reference solution obtained form inference in the same model used to generate the by averaging the nine remaining Gibbs runs; this yields data—that is, the marginal probabilities µij( , ) and · · ten evaluations for each method. µi( ) in the CGM are those determined by θtrue. · Figure4 shows that the solver quickly finds an opti- We solved the approximate MAP convex optimization mal solution to the approximate MAP problem, and it problem using MATLAB’s interior point solver. For takes Gibbs sampling nearly 100 times longer to reach the comparisons with Gibbs sampling, we developed a solution that is as close to the reference as the one an optimized C implementation of the algorithm of found by approximate MAP. Table1 shows that the (Sheldon & Dietterich, 2011) and developed an adap- same pattern holds as the problem size increases: for tive rejection sampler for discrete distributions to per- increasing values of L, Gibbs consistently takes 50 to form the log-concave sampling required by that algo- 100 times longer to find a solution as close to the ref- rithm (Gilks & Wild, 1992; Sheldon, 2013). erence solution as the one found by MAP. We conjecture that the approximate MAP solution Accuracy of Approximate MAP Solutions. To may be extremely close to the ground truth. In Fig- evaluate the impact of the two approximations in our ure4, each Gibbs solution has a relative difference of Collective Graphical Models

L = 16 L = 25 1 1 Table 1. Comparison of Gibbs vs. MAP: seconds to achieve SAEM SAEM the same relative error compared with reference solution. 0.8 MCEM 0.8 MCEM MAP−EM MAP−EM 0.6 0.6

0.4 0.4

L 9 16 25 36 49 0.2 0.2 Relative error Relative error

0 0 500 1000 1500 2000 2000 4000 6000 8000 MAP time 0.9 1.9 3.4 9.7 17.2 Seconds Seconds Gibbs time 161.8 251.6 354.0 768.1 1115.5 L = 36 L = 49 1 1 SAEM SAEM 0.8 MCEM 0.8 MCEM MAP−EM MAP−EM L = 49 0.6 0.6

Gibbs 0.4 0.4 0.5 MAP 0.2 0.2 Relative error Relative error

0.4 0 0 2000 4000 6000 8000 10000 5000 10000 15000 Seconds Seconds 0.3

0.2 Figure 5. Relative error of learned parameters versus run- Relative difference ning time for different EM algorithms (MAP and MCEM: 0.1 100 EM iterations, SAEM: 600 EM iterations, M = 1000).

0 1 2 3 10 10 10 Running time (seconds) lem size increases. SAEM has better long-term con- vergence behavior than MCEM, but MAP-EM finds Figure 4. Relative difference from reference solution versus parameters that are within 1% relative error of θ , time (log scale) for MAP and Gibbs (M = 1000, L = 49). true Gibbs and MAP: average over 10 trials; 95% confidence while SAEM still has relative error greater than 40% intervals are negligible. after a much longer running time for the larger prob- lems. We conclude that approximate MAP is an ex- cellent substitute for marginal inference with the EM about 0.09 from the reference solution computed using algorithm, both in terms of accuracy and running time. the other nine Gibbs runs, which suggests that there is still substantial error in the Gibbs runs after 10 million iterations. Furthermore, each time we increased the 7. Conclusion number of Gibbs iterations used to compute a refer- We presented hardness results and approximate algo- ence solution, the MAP relative error decreased, mean- rithms for the problems of MAP and marginal infer- ing that the reference solution was getting closer to the ence in collective graphical models. We showed that MAP solution. exact inference by message passing runs in time that is polynomial either in the population size or the cardi- Learning. Finally, we evaluated approximate MAP nality of the variables, but there is no algorithm that is as a substitute for Gibbs within the full EM learn- polynomial in both of these parameters unless P=NP. ing procedure. We initialized the parameter vector We then showed that the MAP problem can be formu- θ randomly and then ran three variants of the EM lated approximately as a non-linear convex optimiza- algorithm: MAP-EM uses approximate MAP within tion problem. We demonstrated empirically that this the E step; Monte Carlo EM (MCEM) uses a fixed approximation is very accurate even for modest popu- number of 100K Gibbs iterations per E step; and lation sizes and that approximate MAP inference is an stochastic approximation EM (SAEM) uses a smaller excellent substitute for marginal inference for comput- number of 10K Gibbs iterations per E step, but it ing the E step of the EM algorithm. Our approximate combines those with samples from previous iterations. MAP inference algorithm leads to a learning procedure SAEM usually has better convergence properties than that is much more accurate and runs in a fraction of MCEM (Delyon et al., 1999). We employed step sizes 0.75 the time of the only known alternative algorithms. of γt = 1/t within SAEM. In the M step, we ap- plied a gradient-based solver to update the parameters Acknowledgments. We thank the anonymous re- θ of the log-linear model for transition probabilities. viewers for constructive comments that improved the For each algorithm we measured the relative error of manuscript. This material is based upon work sup- the learned parameter vectors from θtrue. ported by the National Science Foundation under Grant No. IIS-1117954. Figure5 shows that MAP-EM dramatically outper- forms the other two algorithms, especially as the prob- Collective Graphical Models

References Lee, T. C., Judge, G. G., and Zellner, A. Estimating the parameters of the Markov probability model from Apsel, U. and Brafman, R. Extended lifted inference aggregate data. North-Holland Pub. Co., with joint formulas. In Proceedings of the 27th An- 1970. nual Conference on Uncertainty in Artificial Intel- ligence (UAI), pp. 11–18, Corvallis, Oregon, 2011. MacRae, E. C. Estimation of time-varying Markov processes with aggregate data. Econometrica: Jour- Dawid, A. P. and Lauritzen, S. L. Hyper Markov laws nal of the Econometric Society, 1977. in the statistical analysis of decomposable graphical models. The Annals of Statistics, 21(3):1272–1317, Milch, B., Zettlemoyer, L.S., Kersting, K., Haimes, 1993. M., and Kaelbling, L.P. Lifted probabilistic infer- ence with counting formulas. In Proceedings of the de Salvo Braz, R., Amir, E., and Roth, D. Lifted 23rd Conference on Artificial Intelligence (AAAI), first-order probabilistic inference. In Getoor, Lise pp. 1062–1068, 2008. and Taskar, Ben (eds.), Introduction to Statistical Relational Learning, pp. 433–452. The MIT Press, Sheldon, D. and Dietterich, T. G. Collective graphical 2007. models. In Advances in Neural Information Process- ing Systems (NIPS), 2011. Delyon, B., Lavielle, M., and Moulines, E. Conver- Sheldon, D., Elmohamed, M. A. S., and Kozen, D. gence of a stochastic approximation version of the Collective inference on Markov models for modeling EM algorithm. Annals of Statistics, 27(1):94–128, bird migration. In Advances in Neural Information 1999. Processing Systems (NIPS), 2008. Diaconis, P. and Sturmfels, B. Algebraic algorithms for Sheldon, Daniel. Manipulation of PageRank and Col- sampling from conditional distributions. The Annals lective Hidden Markov Models. PhD thesis, Cornell of statistics, 26(1):363–397, 1998. University, 2009.

Dobra, A. Markov bases for decomposable graphical Sheldon, Daniel. Discrete adaptive rejection sam- models. Bernoulli, 9(6):1093–1108, 2003. pling. Technical Report UM-CS-2013-012, School of Computer Science, University of Massachusetts, Fierens, D. and Kersting, K. From lifted inference to Amherst, Massachusetts, May 2013. lifted models. In Second International Workshop on Statistical Relational AI, 2012. Sullivan, B. L., Wood, C. L., Iliff, M. J., Bonney, R. E., Fink, D., and Kelling, S. eBird: A citizen-based Getoor, L. and Taskar, B. Introduction to Statistical bird observation network in the biological sciences. Relational Learning. The MIT Press, 2007. Biological Conservation, 142(10):2282 – 2292, 2009.

Gilks, W.R. and Wild, P. Adaptive Rejection sam- Sundberg, R. Some results about decomposable (or pling for Gibbs Sampling. Journal of the Royal Sta- Markov-type) models for multidimensional contin- tistical Society. Series C (Applied Statistics), 41(2): gency tables: distribution of marginals and parti- Scandinavian Journal of Statistics 337–348, 1992. tioning of tests. , 2(2):71–79, 1975.

Heskes, T. Convexity arguments for efficient minimiza- Van Der Plas, A. P. On the estimation of the parame- tion of the Bethe and Kikuchi free energies. Jour- ters of Markov probability models using macro data. nal of Artificial Intelligence Research, 26(1):153– The Annals of Statistics, 1983. 190, 2006.

Kalbfleisch, J. D., Lawless, J. F., and Vollmer, W. M. Estimation in Markov models from aggregate data. Biometrics, 1983.

Karp, RM. Reducibility among combinatorial prob- lems. Complexity of Computer Computations, 1972.

Koller, D. and Friedman, N. Probabilistic graphical models: principles and techniques. MIT press, 2009.