Learning Latent Groups with Hinge-loss Markov Random Fields

Stephen H. Bach [email protected] Bert Huang [email protected] Lise Getoor [email protected] University of Maryland, College Park, Maryland 20742, USA

Abstract produce state-of the-art performance on various prob- Probabilistic models with latent variables lems with fully-observed training data, methods for are powerful tools that can help explain re- parameter learning with latent variables are currently lated phenomena by mediating dependencies less understood. In particular, there is need for latent- among them. Learning in the presence of la- variable learning methods that leverage the fast, con- tent variables can be difficult though, because vex inference in HL-MRFs. In this work, we introduce of the difficulty of marginalizing them out, or, a hard expectation-maximization (EM) strategy for more commonly, maximizing a lower bound learning HL-MRFs with latent variables. This strat- on the marginal likelihood. In this work, we egy mixes inference and supervised learning (where show how to learn hinge-loss Markov ran- all variables are observed), two well-understood tasks dom fields (HL-MRFs) that contain latent in HL-MRFs, allowing learning with latent variables variables. HL-MRFs are an expressive class while leveraging the rich expressivity of HL-MRFs. of undirected probabilistic graphical models HL-MRFs are the formalism behind the probabilis- for which inference of most probable expla- tic soft logic (PSL) modeling language (Broecheler nations is a convex optimization. By incor- et al., 2010), and have been used for collective classifi- porating latent variables into HL-MRFs, we cation, ontology alignment (Broecheler et al., 2010), can build models that express rich dependen- social trust prediction (Huang et al., 2013), voter cies among those latent variables. We use a opinion modeling (Bach et al., 2012; Broecheler & hard expectation-maximization algorithm to Getoor, 2010), and graph summarization (Memory learn the parameters of such a model, lever- et al., 2012). PSL is one of many tools for designing aging fast inference for learning. In our ex- relational probabilistic models, but is perhaps most re- periments, this combination of inference and lated to Markov logic networks (Richardson & Domin- learning discovers useful groups of users and gos, 2006), which use a similar syntax based on first- in a data set. order logic to define models. When learning parameters in models with hidden, or 1. Introduction latent, variables, the standard approach is to maxi- mize the likelihood of the observed labels, which in- Hinge-loss Markov random fields (HL-MRFs) are a volves marginalizing over the latent variable states. In powerful class of probabilistic graphical models, which many models, directly computing this likelihood is too combine support for rich dependencies with fast, con- expensive, so the variational method of expectation vex inference of most-probable explanantions (MPEs). maximization (EM) provides an alternative (Demp- They achieve this combination by expessing dependen- ster et al., 1977). The variational interpretation of cies among variables with domain [0,1] as hinge-loss EM iteratively updates a proposal distribution, min- potentials, which can generalize logical implication to imizing the Kullback-Leiber (KL) divergence to the continuous variables. While recent advances on infer- empirical distribution, interleaved with estimating the ence and learning for HL-MRFs allows these models to expectation of the latent variables. The variational view allows the possibility of EM with a limited but Presented at the International Conference on Machine tractable family of proposal distributions and provides Learning (ICML) workshop on Inferning: Interactions be- theoretical justification for what is commonly known tween Inference and Learning, Atlanta, Georgia, USA, 2013. Copyright 2013 by the author(s). as “hard EM”. In hard EM, the proposal distribution Learning Latent Groups with Hinge-loss Markov Random Fields comes from the family of point distributions: distri- Definition 2. A hinge-loss Markov random field P butions where the probability is one at a single point over random variables Y and conditioned on random estimate and zero otherwise. Since HL-MRFs admit variables X is a probability density defined as follows: fast and efficient MPE inference, they are well-suited if Y, X ∈/ D˜ , then P (Y|X) = 0; if Y, X ∈ D˜ , then for hard EM. 1 P (Y|X) = exp [−f (Y, X)] , We demonstrate our approach on the task of group de- Z(λ) λ tection in social media data, extending previous work where Z(λ) = R exp [−f (Y, X)]. that used fixed-parameter HL-MRFs for the same task Y λ (Huang et al., 2012). Group detection in social media Thus, MPE inference is equivalent to finding the min- is an important task since more and more real-world imizer of the convex energy fλ. phenomena, such as political organizing and discourse, take place through social media. Group detection has Probabilistic soft logic (Broecheler et al., 2010; Kim- the potential to help us understand language, political mig et al., 2012) provides a natural interface to repre- events, social interactions, and more. sent hinge-loss potential templates using logical con- junction and implication. In particular, a logical con- 2. Hinge-loss Markov Random Fields junction of Boolean variables X ∧ Y can be general- ized to continuous variables using the hinge function In this section, we review hinge-loss Markov random max{X +Y −1, 0}, which is known as the Lukasiewicz fields (HL-MRFs) and probabilistic soft logic (PSL). t-norm. Similarly, logical implication X ⇒ Y is re- HL-MRFs are parameterized by constrained hinge-loss laxed via 1 − max{Y − X, 0}. PSL allows modelers to energy functions. The energy function is factored into design rules that, given data, ground out possible sub- hinge-loss potentials, which are clamped linear func- stitutions for logical terms. The groundings of a tem- tions of the continuous variables, or squares of these plate define hinge-loss potentials that share the same functions. These potentials are weighted by a set of weight. PSL rules take the form of these soft logical parameter weights, which can be learned and can be implications, and the linear function of the HL-MRF templated, i.e., many potentials of the same form may potential is the ground rule’s distance to satisfaction, share the same weight. Additionally, HL-MRFs can max{Y − X, 0}. We defer to Broecheler et al.(2010) incorporate linear constraints on the variables, which and Kimmig et al.(2012) for further details on PSL. can be useful for modeling, for example, mutual exclu- Inference of the most probable explanation (MPE) in sion between variable states. For completeness, a full, HL-MRFs is a convex optimization, since the hinge- formal definition of HL-MRFs is as follows. loss potentials are each convex and the linear con- Definition 1. Let Y = (Y1,...,Yn) be a vector of straints preserve convexity. Currently, the fastest 0 n variables and X = (X1,...,Xn0 ) a vector of n known method for HL-MRF inference uses the al- 0 variables with joint domain D = [0, 1]n+n . Let φ = ternating direction method of multipliers (ADMM), (φ1, . . . , φm) be m continuous potentials of the form which decomposes the full objective into subproblems each with their own copy of the variables and uses aug- pj φj(Y, X) = [max {`j(Y, X), 0}] mented Lagrangian relaxation to enforce consensus be- tween the independently optimized subproblems (Bach where `j is a linear function of Y and X and pj ∈ et al., 2012). The factorized form of the HL-MRF en- {1, 2}. Let C = (C1,...,Cr) be linear constraint func- ergy function naturally corresponds to a subproblem tions associated with index sets denoting equality con- partitioning, with each hinge-loss potential and each straints E and inequality constraints I, which define constraint forming its own subproblem. the feasible set A number of methods can be used to learn the   weights of an HL-MRF. Currently, two main strate- ˜ Ck(Y, X) = 0, ∀k ∈ E D = Y, X ∈ D . gies have been studied: approximate maximum likeli- Ck(Y, X) ≥ 0, ∀k ∈ I hood (Broecheler et al., 2010) and large-margin esti- For Y, X ∈ D˜ , given a vector of nonnegative free mation (Bach et al., 2013). In this work, we focus on approximate maximum likelihood using voted percep- parameters, i.e., weights, λ = (λ1, . . . , λm), a con- tron gradient ascent. The gradient for the likelihood of strained hinge-loss energy function fλ is defined as training data contains the expectation of the log-linear m X features, which we approximate via the MPE solution. fλ(Y, X) = λjφj(Y, X) . Thus far, these learning methods require all unknown j=1 variables to have labeled ground truth during train- Learning Latent Groups with Hinge-loss Markov Random Fields ing. In the next section, we describe one method to Algorithm 1 Hard Expectation Maximization relax this restriction and learn when only part of the Input: model P (Y, Z|X; λ), initial parameters λ0 unknown variables have observed labels. t ← 1 while not converged do 3. Learning with Latent Variables t t−1 Z = arg maxZ P (Z|Y, X; λ ) t t Probabilistic models can have latent variables for a λ = arg maxλ P (Y, Z |X; λ) number of reasons. In some cases, large models have t ← t + 1 many unknowns to the point that it is impractical to end while collect ground truth for them all. In other cases, the la- tent variables represent values that can never be mea- sured, since they are inherently latent and may have optima. This makes initializing the model wit reason- no real-world analogue. In both of these scenarios, the able parameters important. For example, parameters standard strategy is to maximize the likelihood of the can be initialized with expert seed knowledge. The available labeled data. Let X be the observed vari- hard EM algorithm is summarized in Algorithm1. ables, Y be the target variables (i.e., labels), and Z be latent variables, which are available for inference but for which we have no ground-truth labels. The 4. Evaluation standard learning objective for learning parameters λ To evaluate our proposed approach, we build a model is over a rich social-media data set collected from South max P (Y|X; λ). American users in the 48 hours around the 2012 λ Venezuelan presidential election. The data set is col- Evaluating this objective function in many graphical lected from the social medium Twitter and is com- models, including HL-MRFs, is intractable in general, posed of short, public messages or tweets written so direct optimization is often not an available option. by users. Some tweets express opinions, some men- tion other users, some are retweets (rebroadcast mes- 3.1. Expectation Maximization sages), and many contain hashtags (user-specified an- notations). Such social-media data sets can contain Expectation maximization (EM) is a general frame- a wealth of information about public opinions, social work for learning in the presence of latent variables structures, and language. We extract some of this (Dempster et al., 1977). EM maximizes a lower information by including in our model the notion of bound on the marginal likelihood P (Y|X; λ). For latent user groups which mediate probabilistic depen- models where evaluating P (Y|X; λ) is intractable; it dencies between usage and social interactions is often possible to work with the original distribu- such as retweets and mentions. tion P (Y, Z|X; λ). Thus, EM alternates between fit- ting a distribution q(Z) to the posterior distribution Our goal is to learn model parameters that explain a set of users’ interactions with a smaller set of top P (Z|Y, X; λ) and maximizing Eq(Z) [P (Y, Z|X; λ)], the expected complete data likelihood with respect to users of interest, e.g., political figures, news organi- q(Z). zations, and entertainment accounts, given the users’ hashtag usage and their interactions with others out- For general, undirected graphical models, this max- side the set of top users. Since our data set focuses on imization is intractable. One way to make it more a presidential election, we assume that their are two tractable is to use a “hard” variant of EM, in which latent groups, one associated with each major candi- q(Z) is restricted to the family of delta distributions date, and we will interpret the learned parameters as which place all mass on a single assignment to the the strengths of associations between each group and variables. With zero density assigned to all points ex- particular hashtags or top users. cept one, maximizing the expected complete data like- lihood is equivalent to supervised learning using the 4.1. Data Set assignment to the latent varaibles with non-zero den- sity. Hard EM is attractive, especially in models such Our data set is roughly 4,275,000 tweets collected from as HL-MRFs, because it alternates between inference about 1,350,000 Twitter users via a query that focused and supervised learning, two tasks we have efficient on South American users. The tweets were collected algorithms to solve. While hard EM also maximizes from October 6 to October 8, 2012, a 48-hour window a lower bound on the marginal likelihood, the delta around the Venezuelan presidential election on Octo- distribution restriction makes it prone to finding local ber 7. The two major candidates in the election were Learning Latent Groups with Hinge-loss Markov Random Fields

Hugo Ch´avez, the incumbent, and Henrique Capriles. strongly each commonly used hashtag is associated Ch´avez won with 55% of the vote. with each latent group. To learn a model relating hashtag usage and interaca- Next, we leverage one the advantages of our approach tions with top users, we first identify 20 users as top to learning with latent variables: the ability to eas- users based on being the most retweeted or, in the case ily include interesting dependencies among latent vari- of the state-owned television network’s account, being ables. We use the rule of particular interest. We then identify all other users that either retweeted or mentioned at least one of the wsocial : RegularUserLink(U1,U3) top users and used at least one hashtag in a tweet that ∧ RegularUserLink(U2,U3) ∧ U1 6= U2 was not a mention or a retweet of a top user. Filtering ∧ InGroup(U1,G) → InGroup(U2,G) by these criteria, the set contains 1,678 regular users (i.e., users that are not top users). to encode the intuition that regular users who interact Whether each regular user tweeted a hashtag is with the same people on Twitter are more likely to represented with the PSL predicate UsedHashtag. belong to the same latent group. Tweets that mention or retweet a top user are Third, we include rules of the form not counted. For example, if we observe that

User 1 tweeted a tweet that contains the hashtag wg,t : InGroup(U, g) →TopUserLink(U, t) #hayuncamino then UsedHashtag(1, #hayuncamino) ∀g ∈ G, ∀t ∈ T has an observed truth value of 1.0. The PSL predicate RegularUserLink represents whether a regular user for each latent group and each top user so that there is retweeted or mentioned any user in the full data set a parameter governing how strongly each latent group that is not a top user, regardless of whether that men- tends to interact with each top user. tioned or retweeted user is a regular user. Whether a regular user retweeted or mentioned a top user is Finally, we include a simple rule that acts as a prior represented with the PSL predicate TopUserLink. penalizing strong assignments of regular users to either Finally, the latent group membership of each regular group. user is represented with the PSL predicate InGroup. wprior : ¬InGroup(U, G) Since the potentials defined by all our rules are 4.2. Latent Group Model squared, this prior indicates that given little or weak We now describe our model for predicting the in- evidence, users belong to all groups evenly. I.e., the teractions of regular users with top users via la- model will only infer strong group assignments if there tent group membership. We describe the HL-MRF is strong evidence. in terms of PSL rules and constraints. (See Sec- In addition to our rules, we include two sets of con- tion2.) We implement the hard EM algorithm (Al- straints. The first set constrains the InGroup atoms gorithm1) by treating atoms with the UsedHashtag for each regular user to sum to 1.0, making InGroup or RegularUserLink predicate as the set of condi- a mixed-membership assignment. The second set con- tioning variables X, atoms with the TopUserLink strains the TopUserLink atoms for each regular user predicate as the set of target variables Y, and atoms to sum to the number of interactions with top users with the InGroup predicate as the set of latent vari- observed for that regular user. This makes the infer- ables Z. ence task to predict which interactions occurred, since When defining our model, we will make reference to the constraint fixes how many interactions occurred. the set H of hashtags used by at least 15 different All that remains to use the hard EM algorithm is to regular users (|H| = 33), the set T of top users (|T | = specify initial parameters λ0. We initialize wprior to 20), and the set of latent groups G = {g0, g1}. 3.0. We initialize wh,g, wsocial, and wg,t to 2.0 for We first include rules that relate hashtag usage to all hashtags, groups, and top users, except two hash- group membership. For each hashtag in H and each tags and two top users which we initially assign as latent group, we include a rule of the form seeds. We initially associate the top user hayuncamino (Henrique Capriles’s campaign account) and the hash- wh,g : UsedHashtag(U, h) →InGroup(U, g) tag for Capriles’s campaign slogan #hayuncamino with ∀h ∈ H, ∀g ∈ G Group 0 by initializing the parameters associating them with Group 0 to 10.0 and those associating so that there is a different rule weight governing how them with Group 1 to 0.0. We initially associate Learning Latent Groups with Hinge-loss Markov Random Fields

Figure 2. Learned parameters associating latent groups with interactions with top users. Shown is the value

wg0,t − wg1,t for each target user t ∈ T .

the top user chavezcandanga (Hugo Ch´avez’s ac- count) and the hashtag for Ch´avez’s campaign slogan #elmundoconch´avez with Group 1 in the same way. Figure 1. Learned parameters associating latent groups with hashtags. Shown is the value wh,g0 − wh,g1 for each 4.3. Results hashtag h ∈ H. To learn associations between latent groups and hash- tags or interactions with top users, we perform ten it- erations of hard EM on the HL-MRF defined in the previous subsection over the entire data set. Each maximizer for P (Z|Y, X; λ) is easy to find exactly via convex optimization. To find each λ, we perform ten steps of voted-perceptron gradient ascent approximat- ing the expected values of the log-linear features by their values in the MPE state. This approximate max- Learning Latent Groups with Hinge-loss Markov Random Fields imization works well because as long as it improves pute. This experiment uses an HL-MRF with approx- the log-likelihood of Y and Z, the marginal likelihood imately 37,000 variables and 146,000 potentials and P (Y|X; λ) will improve. constraints, but ten iterations of hard EM takes only a few minutes on a workstation. Figure1 shows the learned parameters associating hashtag usage with latent groups (excluding the two seeded hashtags). The hashtags are sorted by differ- 5. Conclusion ences in parameter values from Capriles to Ch´avez. By learning a model that orders top hashtags and top Our assignment of seeds associated pro-Capriles users users according to associations with two latent politi- with Group 0 and pro-Ch´avez users with Group 1. cal groups, we show the power of HL-MRFs for under- The results show a very clean ordering of hashtags standing rich, real-world data. We advance the state based on ideology. Many of the hashtags most of the art by learning an HL-MRF that includes latent strongly associated with the latent Capriles group are variables, using a hard EM procedure that benefits explicitly pro-Capriles, e.g., #mivotoesxcapriles, from the fast, convex inference available for HL-MRFs. #votemosdeprimeroxcapriles, and #hayuncamimo, These latent variables are made interpretable by their an alternative spelling of Capriles’s cam- construction with PSL, demonstrating the utility of paign slogan. Others are also clearly anti- HL-MRFs for many fields, including computational Ch´avez: #venezueladeluto (“Venezuela in social science. Directions for future work include de- mourning” after Ch´avez’s reelection) and veloping richer proposal distribution families compati- #hugoch´avezfr´ıastequeda1d´ıa (roughly “Hugo ble with HL-MRFs, modeling latent variables in other Ch´avez has one day left”). One surprising result tasks, and quantifying the effects of modeling latent is that #6a~nosmas (“six more years”) is strongly variables on prediction tasks. associated with the latent Capriles group despite superficially appearing to support the incumbent. However, upon inspection of the tweets that use Acknowledgments this hashtag, most in our dataset use it ironically, This work was supported by NSF grant CCF0937094 predicting “six more years” of negative outcomes of and IARPA via DoI/NBC contract number Ch´avez’s reelection. On the other hand, the hashtags D12PC00337. The U.S. Government is authorized to strongly associated with the Ch´avez group are all reproduce and distribute reprints for governmental explicitly pro-Ch´avez or just his name. Interestingly, purposes notwithstanding any copyright annotation the semantically neutral hashtags promoting voter thereon. Disclaimer: The views and conclusions con- turnout, such as #tuvoto, #vota, and #vo7a, are tained herein are those of the authors and should not inferred to favor the Capriles group. We hypothesize be interpreted as necessarily representing the official that these may be because the social media campaign policies or endorsements, either expressed or implied, for increasing voter turnout was stronger from the of IARPA, DoI/NBC, or the U.S. Government. Capriles side. Figure2 shows the learned parameters associating in- References teractions with top users with latent groups (again excluding the two seeded top users). According to Bach, S., Broecheler, M., Getoor, L., and O’Leary, D. the learned model, users in the latent Capriles group Scaling MPE inference for constrained continuous are most likely to interact with hcapriles (Capriles’s Markov random fields with consensus optimization. personal account) and independent media outlets In Neural Information Processing Systems, 2012. and journalists such as globovision, la patilla, nelsonbocaranda, luischataing, and eluniversal. Bach, S., Huang, B., London, B., and Getoor, L. On the other side of the spectrum, users in the la- Hinge-loss Markov random fields: Convex inference tent Ch´avez group are most likely to interact with for structured prediction. In Uncertainty in Artifi- vtvcanal8, the Twitter account of the state-owned cial Intelligence, 2013. television network. Our results include both obviously correct and sur- Broecheler, M. and Getoor, L. Computing marginal prising, but verifiable, elements. The learned param- distributions over continuous Markov networks for eters are useful for understanding the language used statistical relational learning. In Neural Information by and the social associations among people with dif- Processing Systems, 2010. ferent political opinions. They are also easy to com- Broecheler, M., Mihalkova, L., and Getoor, L. Proba- Learning Latent Groups with Hinge-loss Markov Random Fields

bilistic similarity logic. In Uncertainty in Artificial Intelligence, 2010. Dempster, A., Laird, N., and Rubin, D. Maximum likelihood from incomplete data via the em algo- rithm. Journal of the Royal Statistical Society, Se- ries B, 39(1):1–38, 1977.

Huang, B., Bach, S., Norris, E., Pujara, J., and Getoor, L. Social group modeling with probabilistic soft logic. In NIPS 2012 Workshop - and Social Media Analysis: Methods, Models, and Applications, 2012.

Huang, B., Kimmig, A., Getoor, L., and Golbeck, J. A flexible framework for probabilistic models of social trust. In Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction, 2013. Kimmig, A., Bach, S., Broecheler, M., Huang, B., and Getoor, L. A short introduction to probabilistic soft logic. In NIPS Workshop on Probabilistic Program- ming: Foundations and Applications, 2012. Memory, A., Kimmig, A., Bach, S., Raschid, L., and Getoor, L. Graph summarization in annotated data using probabilistic soft logic. In Proceedings of the International Workshop on Uncertainty Reasoning for the Semantic Web (URSW), 2012. Richardson, M. and Domingos, P. Markov logic net- works. Machine Learning, 62(1-2):107–136, 2006.