Approximate Inference in Collective Graphical Models

Approximate Inference in Collective Graphical Models Daniel Sheldon1 [email protected] Tao Sun1 [email protected] Akshat Kumar2 [email protected] Thomas G. Dietterich3 [email protected] 1University of Massachusetts, Amherst, MA 01002, USA 2IBM Research, New Delhi 110070, India 3Oregon State University, Corvallis, OR 97331, USA Abstract mographic variables. In ecology, survey data provide We study the problem of approximate infer- counts of animals in different locations, but they can- ence in collective graphical models (CGMs), not identify individuals. which were recently introduced to model the CGMs are generative models that serve as a link be- problem of learning and inference with noisy tween individual behavior and aggregate data. As a aggregate observations. We first analyze the concrete example, consider the model illustrated in complexity of inference in CGMs: unlike in- Figure1(a) for modeling bird migration from obser- ference in conventional graphical models, ex- vational data collected by citizen scientists through act inference in CGMs is NP-hard even for the eBird project (Sheldon et al., 2008; Sheldon, 2009; tree-structured models. We then develop a Sullivan et al., 2009). Inside the plate, an indepen- tractable convex approximation to the NP- dent Markov chain describes the migration of each bird m hard MAP inference problem in CGMs, and among a discrete set of locations: Xt represents the show how to use MAP inference for ap- location of the mth bird at time t. Outside the plate, proximate marginal inference within the EM aggregate observations are made about the spatial dis- framework. We demonstrate empirically that tribution of the population: the variable nt is a vector these approximation techniques can reduce whose ith entry counts the number of birds in location the computational cost of inference by two i at time t. By observing temporal changes in the vec- orders of magnitude and the cost of learning tors nt , one can make inferences about migratory by at least an order of magnitude while pro- routesf withoutg tracking individual birds. viding solutions of equal or better quality. In general CGMs, any discrete graphical model can ap- pear inside the plate to model individuals in a popula- 1. Introduction tion, and observations are made in the form of (noisy) low-dimensional contingency tables (Sheldon & Diet- Sheldon & Dietterich(2011) introduced collective terich, 2011). A key problem we would like to solve graphical models (CGMs) to model the problem of is learning the model parameters (of the individual learning and inference with noisy aggregate data. model) from the aggregate data, for which inference is CGMs are motivated by the growing number of ap- the key subroutine. Unfortunately, standard inference plications where data about individuals are not avail- techniques applied to CGMs quickly become compu- able, but aggregate population-level data in the form tationally intractable as the population size increases, of counts or contingency tables are available. For ex- due to the large number of hidden individual-level vari- ample, the US Census Bureau cannot release individ- ables that are all connected by the aggregate counts. ual records for privacy reasons, so they commonly re- A key to efficient inference in CGMs is the fact that, lease low-dimensional contingency tables that classify when only aggregate data are being modeled, the same each member of the population according to a few de- data-generating mechanism can be described much Proceedings of the 30 th International Conference on Ma- more compactly by analytically marginalizing away chine Learning, Atlanta, Georgia, USA, 2013. JMLR: the individual variables to obtain a direct probabilis- W&CP volume 28. Copyright 2013 by the author(s). tic model for the sufficient statistics (Sundberg, 1975; Collective Graphical Models MAP values of the sufficient statistics. Although the Xm Xm Xm 1 2 ··· T true MAP values are integers, when the population (a) m = 1 : M of individuals is large, the fractional approximation is n1 n2 nT very good and can be interpreted as describing the percentages of the population in the MAP configura- n1,2 n2,3 ... nT 1,T tion. In bird migration, for example, this describes the − fraction of the population that flies between each pair (b) of locations for each day of the year. n n n ... n 1 2 3 T Our final contribution is to show that this fractional MAP approximation can be applied within an EM Figure 1. CGM example: (a) Individuals are explicitly algorithm to accurately learn the parameters of the modeled. (b) After marginalization, the hidden variables CGM. In EM, one must compute the posterior mean are the sufficient statistics of the individual model. (the expected value of the sufficient statistics given the observations). While the MAP configuration is not the mean, we show experimentally that the (fractional ap- Sheldon & Dietterich, 2011). Figure1(b) illustrates proximate) MAP configuration provides an excellent the resulting model for the bird migration example. approximation to the posterior mean. Indeed, for a The new hidden variables nt;t+1 are tables of suffi- fixed time budget, it is usually significantly more ac- cient statistics: the entry nt;t+1(i; j) is the number of curate than computing the expected sufficient statis- birds that fly from location i to location j at time tics via Gibbs sampling. We show that our approach t. For large populations, the resulting model is much dramatically accelerates parameter learning while still more amenable to inference, because it has many fewer achieving less than 1% error. variables and it retains a graphical structure analogous to that of the individual model. However, the reduc- 2. Related Work tion in the number of variables comes at a cost: the new variables are tables of integer counts, which can Sheldon et al.(2008) solved a related MAP infer- take on many more values than the original discrete ence problem on a chain-structured CGM using linear variables in the individual model, and this adversely programming and network flow techniques. Sheldon affects the running time of inference algorithms. (2009) extended those algorithms to the case when ob- The first contribution of this paper is to character- servations are corrupted by log-concave noise models. ize the computational complexity of exact inference The MAP problem in those papers is slightly differ- in CGMs. For tree-structured graphical models, the ent from ours: it seeks the most likely setting of all of running time of exact inference (MAP or marginal) the individual variables, while we seek the most likely by message passing in the CGM is polynomial in ei- setting of the sufficient statistics, which considers the ther the population size or the cardinality of the vari- probability of all possible settings of the individual ables in the individual model (when the other pa- variables that give the same counts. This gives rise rameter is fixed). However, there is no algorithm to combinatorial terms in the probability model (see that is polynomial in both parameters unless P=NP. Equation (2) of Section3) and leads to harder non- This is a striking difference from inference in stan- linear optimization problems. dard tree-structured graphical models, for which the Sheldon & Dietterich(2011) generalized the previous running time of message passing is always polynomial ideas from chain-structured models to arbitrary dis- in the variable cardinality. We also analyze the run- crete graphical models and developed the first algo- ning time of message passing in a junction tree for rithms for marginal inference in CGMs, which were general (non-tree-structured) CGMs to draw out an- based on Gibbs sampling and Markov bases (Diaconis other difference between CGMs and standard graphi- & Sturmfels, 1998; Dobra, 2003). They showed empir- cal models: the dependence on clique-width is doubly- ically that Gibbs sampling is much faster than exact exponential instead of singly-exponential for some pa- inference: for some tasks, the running time to achieve rameter regimes. a fixed error level is independent of the population Our second main contribution is an approximate al- size. However, no analogous approximate method was gorithm for MAP inference in CGMs that is based on developed for MAP inference. a continuous and convex approximation of the MAP Inference in CGMs is related to lifted inference in re- optimization problem. Given observed evidence, the lational models (Getoor & Taskar, 2007). A CGM can algorithm computes a fractional approximation of the Collective Graphical Models 2 be viewed as a simple relational model containing a Here Z is the normalization constant and φij :[L] ! single logical variable to describe the repetition over R+ are edge potentials. We refer to this as the individ- individuals. A unique feature of CGMs is that all ev- ual model. For the remainder of the paper, we assume idence occurs at the aggregate level. The most related that G is a tree to develop the important ideas while ideas from lifted inference are counting elimination keeping the exposition manageable. For a graph with (de Salvo Braz et al., 2007) and counting conversion cycles, the methods of this paper can be applied to (Milch et al., 2008), which perform aggregation oper- perform inference on a junction tree derived from G. ations similar to those performed by a CGM. See also To generate the aggregate data, first assume that M (Apsel & Brafman, 2011). Fierens & Kersting(2012) independent vectors x(1);:::; x(M) are drawn from the recently proposed to \lift" probabilistic models instead individual probability model to represent the individ- of inference algorithms, which is very similar in spirit uals in a population. Aggregate observations are then to CGMs, but they did not present a general approach made in the form of contingency tables on small sets of to do so.

Load more