Learning Latent Groups with Hinge-Loss Markov Random Fields
Total Page:16
File Type:pdf, Size:1020Kb
Learning Latent Groups with Hinge-loss Markov Random Fields Stephen H. Bach [email protected] Bert Huang [email protected] Lise Getoor [email protected] University of Maryland, College Park, Maryland 20742, USA Abstract produce state-of the-art performance on various prob- Probabilistic models with latent variables lems with fully-observed training data, methods for are powerful tools that can help explain re- parameter learning with latent variables are currently lated phenomena by mediating dependencies less understood. In particular, there is need for latent- among them. Learning in the presence of la- variable learning methods that leverage the fast, con- tent variables can be difficult though, because vex inference in HL-MRFs. In this work, we introduce of the difficulty of marginalizing them out, or, a hard expectation-maximization (EM) strategy for more commonly, maximizing a lower bound learning HL-MRFs with latent variables. This strat- on the marginal likelihood. In this work, we egy mixes inference and supervised learning (where show how to learn hinge-loss Markov ran- all variables are observed), two well-understood tasks dom fields (HL-MRFs) that contain latent in HL-MRFs, allowing learning with latent variables variables. HL-MRFs are an expressive class while leveraging the rich expressivity of HL-MRFs. of undirected probabilistic graphical models HL-MRFs are the formalism behind the probabilis- for which inference of most probable expla- tic soft logic (PSL) modeling language (Broecheler nations is a convex optimization. By incor- et al., 2010), and have been used for collective classifi- porating latent variables into HL-MRFs, we cation, ontology alignment (Broecheler et al., 2010), can build models that express rich dependen- social trust prediction (Huang et al., 2013), voter cies among those latent variables. We use a opinion modeling (Bach et al., 2012; Broecheler & hard expectation-maximization algorithm to Getoor, 2010), and graph summarization (Memory learn the parameters of such a model, lever- et al., 2012). PSL is one of many tools for designing aging fast inference for learning. In our ex- relational probabilistic models, but is perhaps most re- periments, this combination of inference and lated to Markov logic networks (Richardson & Domin- learning discovers useful groups of users and gos, 2006), which use a similar syntax based on first- hashtags in a Twitter data set. order logic to define models. When learning parameters in models with hidden, or 1. Introduction latent, variables, the standard approach is to maxi- mize the likelihood of the observed labels, which in- Hinge-loss Markov random fields (HL-MRFs) are a volves marginalizing over the latent variable states. In powerful class of probabilistic graphical models, which many models, directly computing this likelihood is too combine support for rich dependencies with fast, con- expensive, so the variational method of expectation vex inference of most-probable explanantions (MPEs). maximization (EM) provides an alternative (Demp- They achieve this combination by expessing dependen- ster et al., 1977). The variational interpretation of cies among variables with domain [0,1] as hinge-loss EM iteratively updates a proposal distribution, min- potentials, which can generalize logical implication to imizing the Kullback-Leiber (KL) divergence to the continuous variables. While recent advances on infer- empirical distribution, interleaved with estimating the ence and learning for HL-MRFs allows these models to expectation of the latent variables. The variational view allows the possibility of EM with a limited but Presented at the International Conference on Machine tractable family of proposal distributions and provides Learning (ICML) workshop on Inferning: Interactions be- theoretical justification for what is commonly known tween Inference and Learning, Atlanta, Georgia, USA, 2013. Copyright 2013 by the author(s). as \hard EM". In hard EM, the proposal distribution Learning Latent Groups with Hinge-loss Markov Random Fields comes from the family of point distributions: distri- Definition 2. A hinge-loss Markov random field P butions where the probability is one at a single point over random variables Y and conditioned on random estimate and zero otherwise. Since HL-MRFs admit variables X is a probability density defined as follows: fast and efficient MPE inference, they are well-suited if Y; X 2= D~ , then P (YjX) = 0; if Y; X 2 D~ , then for hard EM. 1 P (YjX) = exp [−f (Y; X)] ; We demonstrate our approach on the task of group de- Z(λ) λ tection in social media data, extending previous work where Z(λ) = R exp [−f (Y; X)]. that used fixed-parameter HL-MRFs for the same task Y λ (Huang et al., 2012). Group detection in social media Thus, MPE inference is equivalent to finding the min- is an important task since more and more real-world imizer of the convex energy fλ. phenomena, such as political organizing and discourse, take place through social media. Group detection has Probabilistic soft logic (Broecheler et al., 2010; Kim- the potential to help us understand language, political mig et al., 2012) provides a natural interface to repre- events, social interactions, and more. sent hinge-loss potential templates using logical con- junction and implication. In particular, a logical con- 2. Hinge-loss Markov Random Fields junction of Boolean variables X ^ Y can be general- ized to continuous variables using the hinge function In this section, we review hinge-loss Markov random maxfX +Y −1; 0g, which is known as the Lukasiewicz fields (HL-MRFs) and probabilistic soft logic (PSL). t-norm. Similarly, logical implication X ) Y is re- HL-MRFs are parameterized by constrained hinge-loss laxed via 1 − maxfY − X; 0g. PSL allows modelers to energy functions. The energy function is factored into design rules that, given data, ground out possible sub- hinge-loss potentials, which are clamped linear func- stitutions for logical terms. The groundings of a tem- tions of the continuous variables, or squares of these plate define hinge-loss potentials that share the same functions. These potentials are weighted by a set of weight. PSL rules take the form of these soft logical parameter weights, which can be learned and can be implications, and the linear function of the HL-MRF templated, i.e., many potentials of the same form may potential is the ground rule's distance to satisfaction, share the same weight. Additionally, HL-MRFs can maxfY − X; 0g. We defer to Broecheler et al.(2010) incorporate linear constraints on the variables, which and Kimmig et al.(2012) for further details on PSL. can be useful for modeling, for example, mutual exclu- Inference of the most probable explanation (MPE) in sion between variable states. For completeness, a full, HL-MRFs is a convex optimization, since the hinge- formal definition of HL-MRFs is as follows. loss potentials are each convex and the linear con- Definition 1. Let Y = (Y1;:::;Yn) be a vector of straints preserve convexity. Currently, the fastest 0 n variables and X = (X1;:::;Xn0 ) a vector of n known method for HL-MRF inference uses the al- 0 variables with joint domain D = [0; 1]n+n . Let φ = ternating direction method of multipliers (ADMM), (φ1; : : : ; φm) be m continuous potentials of the form which decomposes the full objective into subproblems each with their own copy of the variables and uses aug- pj φj(Y; X) = [max f`j(Y; X); 0g] mented Lagrangian relaxation to enforce consensus be- tween the independently optimized subproblems (Bach where `j is a linear function of Y and X and pj 2 et al., 2012). The factorized form of the HL-MRF en- f1; 2g. Let C = (C1;:::;Cr) be linear constraint func- ergy function naturally corresponds to a subproblem tions associated with index sets denoting equality con- partitioning, with each hinge-loss potential and each straints E and inequality constraints I, which define constraint forming its own subproblem. the feasible set A number of methods can be used to learn the weights of an HL-MRF. Currently, two main strate- ~ Ck(Y; X) = 0; 8k 2 E D = Y; X 2 D : gies have been studied: approximate maximum likeli- Ck(Y; X) ≥ 0; 8k 2 I hood (Broecheler et al., 2010) and large-margin esti- For Y; X 2 D~ , given a vector of nonnegative free mation (Bach et al., 2013). In this work, we focus on approximate maximum likelihood using voted percep- parameters, i.e., weights, λ = (λ1; : : : ; λm), a con- tron gradient ascent. The gradient for the likelihood of strained hinge-loss energy function fλ is defined as training data contains the expectation of the log-linear m X features, which we approximate via the MPE solution. fλ(Y; X) = λjφj(Y; X) : Thus far, these learning methods require all unknown j=1 variables to have labeled ground truth during train- Learning Latent Groups with Hinge-loss Markov Random Fields ing. In the next section, we describe one method to Algorithm 1 Hard Expectation Maximization relax this restriction and learn when only part of the Input: model P (Y; ZjX; λ), initial parameters λ0 unknown variables have observed labels. t 1 while not converged do 3. Learning with Latent Variables t t−1 Z = arg maxZ P (ZjY; X; λ ) t t Probabilistic models can have latent variables for a λ = arg maxλ P (Y; Z jX; λ) number of reasons. In some cases, large models have t t + 1 many unknowns to the point that it is impractical to end while collect ground truth for them all. In other cases, the la- tent variables represent values that can never be mea- sured, since they are inherently latent and may have optima. This makes initializing the model wit reason- no real-world analogue. In both of these scenarios, the able parameters important.