<<

arXiv:1109.3437v4 [cs.LG] 24 Mar 2012 • • ieyi mg nlssadcmue iinb two by vision computer and concepts: existing analysis important to problem image labeling labels in the topic widely solves semantic MRF document- of the matrix. in the word set elements which nonzero a observed in the assign problem, explain to labeling is a as objective problem interpreted modeling topic be the can [11]. perspective, framework MRF (MRF) the field From random Markov the within [9]). in scenes perception complicated activity and and crowded [8] from topics documents historical mining found time-stamped tracking text have (e.g., vision in HBMs computer applications and and real-world LDA Recently, important can [7]. many methods in inference found compar- approximate be empirical these and among connections isons The infer- [6]. infer- include (CVB) Other VB ence modeling collapsed [4]. and topic [5] (EP) (RTM) probabilistic expectation-propagation for models methods topic ence [3] relational (ATM) and models and LDA author-topic including learning extensions, for its collapsed commonly-used methods two and inference been have approximate rep- [1] [2] graphical (GS) (VB) sampling its in- Gibbs Bayes in exact Variational loops no of resentation. has because LDA methods document- ference matrix. the three-layer from (document-term) topics prob- word a called infer clusters is can word that abilistic [1] (HBM) model (LDA) Bayesian hierarchical allocation Dirichlet Latent tentials I 1 .K hugadJ i r ihteDprmn fCmue Sci Computer of Kong. Department Hong the Tong, with Kowloon University, are Baptist Liu Kong J. Hong and Shangh Cheung [email protected]. the K. E-mail: with W. addressed. China. be also Processing, should is Information Technologyspondence He and Intelligent Science of China. Computer Laboratory 215006, of School Suzhou the University, with is Zeng J. hspprrpeet D ytefco rp [10] graph factor the by LDA represents paper This erigTpcMdl yBle Propagation Belief by Models Topic Learning olwd neet n oce nmn motn applicat important many on touches and interests worldwide Abstract admfils irrhclBysa oes ib samplin Gibbs models, Bayesian hierarchical fields, random Terms Index representat graph factor the on based BP using (RTM), models enc by validated as accuracy and speed sampling both Gibbs in collapsed competitive and is est (VB) Bayes parameter variational and as inference fielsuch approximate random for Markov the within (BP) graph factor a as LDA represents n,w hwhwt er w yia ainso D-ae to LDA-based of variants typical two learn g to a how become to show potential we the end, has algorithm BP the Furthermore, NTRODUCTION 1]or [12] i Zeng, Jia Ltn iihe loain(D)i niprathierarc important an is (LDA) allocation Dirichlet —Latent Ltn iihe loain oi oes eifpropag belief models, topic allocation, Dirichlet —Latent atrfunctions factor ebr IEEE, Member, egbrodsystems neighborhood 1] tasgstebest the assigns It [11]. ila .Cheung, K. William and owo corre- whom To lqepo- clique Soochow , iKey ai ,vrainlBayes. variational g, osi etmnn,cmue iinadcmuainlbiol computational and vision computer mining, text in ions ence, mto.Atog w omnyue prxmt inferenc approximate commonly-used two Although imation. uaigeprmna eut nfu ag-cl documen large-scale four on results experimental ouraging nrclann ceefrvrat fLAbsdtpcmode topic LDA-based of variants for scheme learning eneric ion. MF rmwr,wihealstecascloyble pr belief loopy classic the enables which framework, (MRF) d G) aegie ra ucse nlann D,tepropo the LDA, learning in successes great gained have (GS), i oes uha uhrtpcmdl AM n relationa and (ATM) models author-topic as such models, pic ✦ ia aeinmdlfrpoaiitctpcmdln,whi modeling, topic probabilistic for model Bayesian hical ebr IEEE, Member, to,msaepsig atrgah aeinntok,M networks, Bayesian graph, factor passing, message ation, o prxmt neec n aaee siain By [17] estimation. proper [11], parameter designing [10], and inference algorithm approximate (BP) for propagation belief new loopy a models. directed is the it to that alternative so viable [16], probabilistic and (PLSA) and analysis LDA semantic from encodes latent probability [15] joint “harmonium” different model a undirected the trast, h atrgah n thsteptnilt eoea become to potential the has it on and well graph, operates factor algorithm BP the The labeling the goal. realize local modeling to topic different system neighborhood penalize the in or configurations encourage may we h on rbblt fLAcnb ersne sthe as space, represented be variable can hidden LDA of of collapsed product probability the joint In the latent the labels. with layer the same topic the as having out hyperparameters labels topic Dirichlet integrates pseudo views which and [14], MRF Dirichlet-Multinomial conjugacy on [2], novel based collapsed LDA parameters the new a multinomial by for a inspired from algorithm not is LDA GS is idea basic it interprets The thus but perspective. as and probability model HBM, joint three-layer topic same the the in describes that factor LDA proposed the for paper, graph this conditional In both probabilities. represent joint 8.4.3]) and can Chapter models functions [11, factor undirected (MRFs) because fields and random models 13.2.3]) Markov Markov (e.g., Chapter hidden [11, (e.g., models (HMMs) directed both for labeling possible of number or limited configurations. encouraging a by only complexity the penalizing reduce we to However, neighboring labels over space. [13] topic prior topic smoothness discrete the employ the often optimiza- combinatorial in prohibited problem a tion nature in is probability,which posterior the maximizing through estimation oi aesacrigt the to according labels topic h atrgahrpeetto aiiae h classic the facilitates representation graph factor The h atrgahi rpia ersnainmethod representation graphical a is graph factor The atrfunctions factor n iigLiu, Jiming and egbrodsystem neighborhood ntefco rp.B con- By graph. factor the in aiu posteriori a maximum elw IEEE Fellow, and g.Ti paper This ogy. atrfunctions factor methods, e aasets. data t hattracts ch opagation s othis To ls. e BP sed topic l arkov (MAP) 1 , 2 generic learning scheme for variants of LDA-based topic models. For example, we also extend the BP algorithm to learn ATM [3] and RTM [4] based on the α θd zi wi φk β Nd representations. Although the convergence of BP is not D K guaranteed on general graphs [11], it often converges and works well in real-world applications. Fig. 1. Three-layer graphical representation of LDA [1]. The factor graph of LDA also reveals some intrinsic relations between HBM and MRF. HBM is a class of TABLE 1 directed models within the frame- Notations work [14], which represents the causal or conditional dependencies of observed and hidden variables in the 1 ≤ d ≤ D Document index hierarchical manner so that it is difficult to factorize the 1 ≤ w ≤ W Word index in vocabulary joint probability of hidden variables. By contrast, MRF 1 ≤ k ≤ K Topic index can factorize the joint distribution of hidden variables 1 ≤ a ≤ A Author index into the product of factor functions according to the 1 ≤ c ≤ C Link index Hammersley-Clifford theorem [18], which facilitates the x = {xw,d} Document-word matrix z k efficient BP algorithm for approximate inference and = {zw,d} Topic labels for words parameter estimation. Although learning HBM often has z−w,d Labels for d excluding w difficulty in estimating parameters and inferring hidden zw,−d Labels for w excluding d variables due to the causal coupling effects, the alterna- ad Coauthors of the document d tive factor graph representation as well as the BP-based µ·,d(k) xw,dµw,d(k) learning scheme may shed more light on faster and more w µ (k) xw,dµw,d(k) accurate for HBM. w,· Pd θd Factor of the document d The remainder of this paper is organized as follows. In P φ Factor of the word w Section 2 we introduce the factor graph interpretation for w η Factor of the link c LDA, and derive the loopy BP algorithm for approximate c f(·) inference and parameter estimation. Moreover, we dis- Factor functions cuss the intrinsic relations between BP and other state- α, β Dirichlet hyperparameters of-the-art approximate inference algorithms. Sections 3 and 4 present how to learn ATM and RTM using the BP algorithms. Section 5 validates the BP algorithm on four 2.1 Factor Graph Representation document data sets. Finally, Section 6 draws conclusions We begin by transforming the directed graph of Fig. 1 and envisions future work. into a two-layer factor graph, of which a representa- k tive fragment is shown in Fig. 2. The notation, zw,d = 2 BELIEF PROPAGATION FOR LDA xw,d k i=1 zw,d,i/xw,d, denotes the average topic labeling The probabilistic topic modeling task is to assign a set configuration over all word tokens 1 ≤ i ≤ xw,d at z k P of semantic topic labels, = {zw,d}, to explain the ob- the index {w, d}. We define the neighborhood system of served nonzero elements in the document-word matrix the topic label zw,d as z−w,d and zw,−d, where z−w,d x = {xw,d}. The notations 1 ≤ k ≤ K is the topic index, denotes a set of topic labels associated with all word xw,d is the number of word counts at the index {w, d}, indices in the document d except w, and zw,−d denotes 1 ≤ w ≤ W and 1 ≤ d ≤ D are the word index in a set of topic labels associated with the word indices the vocabulary and the document index in the corpus. w in all documents except d. The factors θd and φw Table 1 summarizes some important notations. are denoted by squares, and their connected variables Fig. 1 shows the original three-layer graphical rep- zw,d are denoted by circles. The factor θd connects the resentation of LDA [1]. The document-specific topic neighboring topic labels {zw,d, z−w,d} at different word k proportion θd(k) generates a topic label zw,d,i ∈ indices within the same document d, while the factor K k z {0, 1}, k=1 zw,d,i = 1, which in turn generates each φw connects the neighboring topic labels {zw,d, w,−d} at observed word token i at the index w in the document the same word index w but in different documents. We P d based on the topic-specific multinomial distribution absorb the observed word w as the index of φw, which φk(w) over the vocabulary words. Both multinomial pa- is similar to absorbing the observed document d as the rameters θd(k) and φk(w) are generated by two Dirichlet index of θd. Because the factors can be parameterized distributions with hyperparameters α and β, respec- functions [11], both θd and φw can represent the same tively. For simplicity, we consider only the smoothed multinomial parameters with Fig. 1. LDA [2] with the fixed symmetric Dirichlet hyperpa- Fig. 2 describes the same joint probability with Fig. 1 rameters α and β. The plates indicate replications. For if we properly design the factor functions. The bipartite example, the document d repeats D times in the corpus, factor graph is inspired by the collapsed GS [2], [14] al- the word tokens wn repeats Nd times in the document gorithm, which integrates out parameter variables {θ, φ} d, the vocabulary size is W , and there are K topics. in Fig. 1 and treats the hyperparameters {α, β} as pseudo 3

µθ →z µz ←φ d w,d w,d w modeling as a labeling problem.

α θd zw,d φw β 2.2 Belief Propagation (BP)

µ−w,d µw,−d The BP [11] algorithms provide exact solutions for infer- ence problems in -structured factor graphs but ap- z−w,d zw,−d proximate solutions in factor graphs with loops. Rather than directly computing the conditional joint probability p(z|x), we compute the conditional marginal probability, Fig. 2. Factor graph of LDA and message passing. k zk x p(zw,d = 1, xw,d| −w,−d, −w,−d), referred to as message µw,d(k), which can be normalized using a local compu- K tation, i.e., k=1 µw,d(k)=1, 0 ≤ µw,d(k) ≤ 1. According topic counts having the same layer with hidden variables to the Markov property in Fig. 2, we obtain z. Thus, the joint probability of the collapsed hidden P k zk x variables can be factorized as the product of factor p(zw,d, xw,d| −w,−d, −w,−d) ∝ k zk x k zk x functions. This collapsed view has been also discussed p(zw,d, xw,d| −w,d, −w,d)p(zw,d, xw,d| w,−d, w,−d), (5) within the mean-field framework [19], inspiring the zero- order approximation CVB (CVB0) algorithm [7] for LDA. where −w and −d denote all word indices except w and all document indices except d, and the no- So, we speculate that all three-layer LDA-based topic z z models can be collapsed into the two-layer factor graph, tations −w,d and w,−d represent all possible neigh- which facilitates the BP algorithm for efficient inference boring topic configurations. From the message pass- k zk x and parameter estimation. However, how to use the two- ing view, p(zw,d, xw,d| −w,d, −w,d) is the neighboring message µ (k) sent from the factor node θ , and layer factor graph to represent more general multi-layer θd→zw,d d k zk x HBM still remains to be studied. p(zw,d, xw,d| w,−d, w,−d) is the other neighboring mes- sage µ (k) sent from the factor node φ . Notice Based on Dirichlet-Multinomial conjugacy, integrating φw→zw,d w that (5) uses the smoothness prior in MRF, which en- out multinomial parameters {θ, φ} yields the joint prob- courages only K smooth topic configurations within the ability [14] of LDA in Fig. 1, neighborhood system. Using the Bayes’ rule and the joint D K W k Γ( =1 xw,dz + α) probability (1), we can expand Eq. (5) as P (x, z|α, β) ∝ w w,d × K W k zk x zk x d=1 k=1 Γ[ k=1P( w=1 xw,dzw,d + α)] p( ·,d, ·,d) p( w,·, w,·) µw,d(k) ∝ × , Y Y D k k W K k p(z , x−w,d) p(z , xw,−d) Γ(P d=1 xPw,dzw,d + β) −w,d w,−d W D , (1) k k −w x−w,dz−w,d + α w=1 k=1 Γ[ w=1P( d=1 xw,dzw,d + β)] ∝ × Y Y K k x =1( x−w,dz + α) where x zk = w,dPzk Precovers the original topic k P −w −w,d w,d w,d i=1 w,d,i k d xw,−dzw, d + β configuration over the word tokens in Fig. 1. Here, we P −P − , (6) design the factor functionsP as W k w=1P( −d xw,−dzw,−d + β) K W k Γ( w=1 xw,dzw,d + α) where the property,P Γ(Px +1) = xΓ(x), is used to cancel fθ (x·,d, z·,d, α)= , d K W k the common terms in both nominator and denomina- k=1 Γ[ k=1P( w=1 xw,dzw,d + α)] Y tor [14]. We find that Eq. (6) updates the message on (2) k P P the variable zw,d if its neighboring topic configuration K D k Γ( =1 xw,dz + β) zk zk d w,d { −w,d, w,−d} is known. However, due to uncertainty, fφ (xw,·, zw,·,β)= , w W D k we know only the neighboring messages rather than k=1 Γ[ w=1P( d=1 xw,dzw,d + β)] Y (3) the precise topic configuration. So, we replace topic P P configurations by corresponding messages in Eq. (6) and z z z z where ·,d = {zw,d, −w,d} and w,· = {zw,d, w,−d} obtain the following message update equation, denote subsets of the variables in Fig. 2. Therefore, the joint probability (1) of LDA can be re-written as the µ−w,d(k)+ α µw,−d(k)+ β µw,d(k) ∝ × , (7) product of factor functions [11, Eq. (8.59)] in Fig. 2, k[µ−w,d(k)+ α] w[µw,−d(k)+ β] D W where P P P (x, z|α, β) ∝ fθ (x·,d, z·,d, α) fφ (xw,·, zw,·,β). d w µ d=1 w=1 −w,d(k)= x−w,dµ−w,d(k), (8) Y Y −w (4) X µ (k)= x µ (k). Therefore, the two-layer factor graph in Fig. 2 encodes w,−d w,−d w,−d (9) d exactly the same information with the three-layer graph X− for LDA in Fig. 1. In this way, we may interpret LDA Messages are passed from variables to factors, and in within the MRF framework to treat probabilistic topic turn from factors to variables until convergence or the 4

input : x,K,T,α,β. 2.3 An Alternative View of BP output : . θd, φw We may also adopt one of the BP instantiations, the sum- µ1 (k) ← initialization and normalization; w,d product algorithm [11], to infer µw,d(k). For convenience, for t ← 1 to T do t t we will not include the observation xw,d in the formula- t+1 µ−w,d(k)+α µw,−d(k)+β µ (k) ∝ t × t ; tion. Fig. 2 shows the message passing from two factors w,d Pk[µ−w,d(k)+α] Pw[µw,−d(k)+β] end θd and φw to the variable zw,d, where the arrows denote ← the message passing directions. Based on the smoothness θd(k) [µ·,d(k)+ α]/ Pk[µ·,d(k)+ α]; ← prior, we encourage only K smooth topic configurations φw(k) [µw,·(k)+ β]/ Pw[µw,·(k)+ β]; without considering all other possible configurations. Fig. 3. The synchronous BP for LDA. The message µw,d(k) is proportional to the product of both incoming messages from factors,

µw,d(k) ∝ µθd→zw,d (k) × µφw→zw,d (k). (14) maximum number of iterations is reached. Notice that Eq. (14) has the same meaning with (5). The messages we need only pass messages for xw,d 6= 0. Because x is a very sparse matrix, the message update equation (7) is from factors to variables are the sum of all incoming computationally fast by sweeping all nonzero elements messages from the neighboring variables, in the sparse matrix x. µθd→zw,d (k)= fθd µ−w,d(k)α, (15) Based on messages, we can estimate the multinomial −w parameters θ and φ by the expectation-maximization Y µ (k)= f µ (k)β, (16) (EM) algorithm [14]. The E-step infers the normalized φw→zw,d φw w,−d −d message µw,d(k). Using the Dirichlet-Multinomial conju- Y gacy and Bayes’ rule, we express the marginal Dirichlet where α and β can be viewed as the pseudo-messages, distributions on parameters as follows, and fθd and fφw are the factor functions that encourage or penalize the incoming messages. p(θd)= Dir(θd|µ·,d(k)+ α), (10) In practice, however, Eqs. (15) and (16) often cause p(φw)= Dir(φw|µw,·(k)+ β). (11) the product of multiple incoming messages close to zero [12]. To avoid arithmetic underflow, we use the sum The M-step maximizes (10) and (11) with respect to θ d operation rather than the product operation of incoming and φ , resulting in the following point estimates of w messages because when the product value increases the multinomial parameters, sum value also increases, µ·,d(k)+ α θ (k)= , (12) µ−w,d(k)α ∝ µ−w,d(k)+ α, (17) d µ k[ ·,d(k)+ α] −w −w Y X µw,·(k)+ β φw(k)= P . (13) µw,−d(k)β ∝ µw,−d(k)+ β. (18) [µw,·(k)+ β] d d w Y− X− In this paper, we considerP only fixed hyperparameters Such approximations as (17) and (18) transform the sum- {α, β}. Interested readers can figure out how to estimate product to the sum-sum algorithm, which resembles hyperparameters based on inferred messages in [14]. the relaxation labeling algorithm for learning MRF with To implement the BP algorithm, we must choose either good performance [12]. the synchronous or the asynchronous update schedule The normalized message µw,d(k) is multiplied by the to pass messages [20]. Fig. 3 shows the synchronous number of word counts xw,d during the propagation, message update schedule. At each iteration t, each i.e., xw,dµw,d(k). In this sense, xw,d can be viewed as variable uses the incoming messages in the previous the weight of µw,d(k) during the propagation, where iteration t − 1 to calculate the current message. Once the bigger xw,d corresponds to the larger influence of every variable computes its message, the message is its message to those of its neighbors. Thus, the topics passed to the neighboring variables and used to compute may be distorted by those documents with greater word messages in the next iteration t +1. An alternative is the counts. To avoid this problem, we may choose another asynchronous message update schedule. It updates the weight like term frequency (TF) or term frequency- message of each variable immediately. The updated mes- inverse document frequency (TF-IDF) for weighted belief sage is immediately used to compute other neighboring propagation. In this sense, BP can not only handle discrete messages at each iteration t. The asynchronous update data, but also process continuous data like TF-IDF. The schedule often passes messages faster across variables, MRF model in Fig. 2 can be extended to describe both which causes the BP algorithm converge faster than discrete and continuous data in general, while LDA in the synchronous update schedule. Another termination Fig. 1 focuses only on generating discrete data. condition for convergence is that the change of the In the MRF model, we can design the factor functions multinomial parameters [1] is less than a predefined arbitrarily to encourage or penalize local topic labeling threshold λ, for example, λ =0.00001 [21]. configurations based on our prior knowledge. From 5

function [phi, theta] = siBP(X, K, T, ALPHA, BETA) Fig. 1, LDA solves the topic modeling problem according to three intrinsic assumptions: % Xis a W*D sparse matrix. % W is the vocabulary size. 1) Co-occurrence: Different word indices within the % D is the number of documents. same document tend to be associated with the same % The element of X is the word count 'xi'. % 'wi' and 'di' are word and document indices. topic labels. % K is the number of topics. 2) Smoothness: The same word indices in different % T is the number of iterations. documents are likely to be associated with the same % mu is a matrix with K rows for topic messages. % phi is a K*W matrix. topic labels. % theta is a K*D matrix. 3) Clustering: All word indices do not tend to asso- % ALPHA and BETA are hyperparameters. ciate with the same topic labels. % % normalize(A,dim) returns the normalized values The first assumption is determined by the document- % (sum to one) of the elements along different % dimensions of an array. specific topic proportion θd(k), where it is more likely k to assign a topic label zw,d = 1 to the word index [wi,di,xi] = find(X); w if the topic k is more frequently assigned to other % random initialization word indices in the document d. Similarly, the second mu = normalize(rand(K,nnz(X)),1); % simplified belief propagation assumption is based on the topic-specific multinomial for t = 1:T distribution φk(w). which implies a higher likelihood for k = 1:K md(k,:) = accumarray(di,xi'.*mu(k,:)); to associate the word index w with the topic label mw(k,:) = accumarray(wi,xi'. mu(k,:)); k * zw,d = 1 if k is more frequently assigned to the same end word index w in other documents except d. The third theta = normalize(md+ALPHA,1); %Eq.(9) phi = normalize(mw+BETA,2); %Eq.(10) assumption avoids grouping all word indices into one mu = normalize(theta(:,di).*phi(:,wi),1); %Eq.(18) topic through normalizing φk(w) in terms of all word end indices. If most word indices are associated with the return topic k, the multinomial parameter φk will become too small to allocate the topic k to these word indices. Fig. 4. The MATLAB code for siBP.

According to the above assumptions, we design fθd and fφw over messages as 1 which includes the current message µw,d(k) in both fθ (µ , α)= , (19) numerator and denominator in (7). In many real-world d ·,d [µ (k)+ α] k −w,d topic modeling tasks, a document often contains many 1 different word indices, and the same word index ap- fφw (µw,·,β)= P . (20) w[µw,−d(k)+ β] pears in many different documents. So, at each iteration, Eq. (19) normalizes the incomingP messages by the total Eq. (21) deviates slightly from (7) after adding the cur- number of messages for all topics associated with the rent message to both numerator and denominator. Such document d to make outgoing messages comparable slight difference may be enlarged after many iterations across documents. Eq. (20) normalizes the incoming mes- in Fig. 3 due to accumulation effects, leading to different sages by the total number of messages for all words in estimated parameters. Intuitively, Eq. (21) implies that if the vocabulary to make outgoing messages comparable the topic k has a higher proportion in the document d, across vocabulary words. Notice that (15) and (16) realize and it has the a higher likelihood to generate the word the first two assumptions, and (20) encodes the third index w, it is more likely to allocate the topic k to the assumption of topic modeling. The similar normalization observed word xw,d. This allocation scheme in principle technique to avoid partitioning all data points into one follows the three intrinsic topic modeling assumptions cluster has been used in the classic normalized cuts in the subsection 2.3. Fig. 4 shows the MATLAB code algorithm for image segmentation [22]. Combining (14) for the simplified BP (siBP). to (20) will yield the same message update equation (7). To estimate parameters θd and φw, we use the joint 2.5 Relationship to Previous Algorithms marginal distributions (15) and (16) of the set of variables Here we discuss some intrinsic relations between BP belonging to factors θ and φ including the variable d w with three state-of-the-art LDA learning algorithms such z , which produce the same point estimation equa- w,d as VB [1], GS [2], and zero-order approximation CVB tions (12) and (13). (CVB0) [7], [19] within the unified message passing framework. The message update scheme is an instantia- 2.4 Simplified BP (siBP) tion of the E-step of EM algorithm [23], which has been We may simplify the message update equation (7). Sub- widely used to infer the marginal probabilities of hidden stituting (12) and (13) into (7) yields the approximate variables in various graphical models according to the message update equation, maximum-likelihood estimation [11] (e.g., the E-step in- ference for GMMs [24], the forward-backward algorithm µw,d(k) ∝ θd(k) × φw(k), (21) for HMMs [25], and the probabilistic relaxation labeling 6 algorithm for MRF [26]). After the E-step, we estimate largely attributed to its asynchronous BP implementation the optimal parameters using the updated messages and from the MRF perspective. Our experiments also support observations at the M-step of EM algorithm. that the message passing over word indices instead of VB is a variational message passing method [27] that tokens will produce comparable or even better topic uses a set of factorized variational distributions q(z) modeling performance but with significantly smaller to approximate the joint distribution (1) by minimiz- computational costs. ing the Kullback-Leibler (KL) divergence between them. Eq. (21) also reveals that siBP is a probabilistic matrix Employing the Jensen’s inequality makes the approxi- factorization algorithm that factorizes the document- x mate variational distribution an adjustable lower bound word matrix, = [xw,d]W ×D, into a matrix of document- on the joint distribution, so that maximizing the joint specific topic proportions, θ = [θd(k)]K×D, and a ma- probability is equivalent to maximizing the lower bound trix of vocabulary word-specific topic proportions, φ = x T by tuning a set of variational parameters. The lower [φw(k)]K×W , i.e., ∼ φ θ. We see that the larger number bound q(z) is also an MRF in nature that approximates of word counts xw,d corresponds to the higher likelihood the joint distribution (1). Because there is always a gap k θd(k)φw(k). From this point of view, the multinomial principle component analysis (PCA) [28] describes some between the lower bound and the true joint distribution, P VB introduces bias when learning LDA. The variational intrinsic relations among LDA, PLSA [16], and non- message update equation is negative matrix factorization (NMF) [29]. Eq. (21) is the same as the E-step update for PLSA except that the pa- exp[Ψ(µ·,d(k)+ α)] rameters θ and φ are smoothed by the hyperparameters µw,d(k) ∝ × exp[Ψ( k[µ·,d(k)+ α])] α and β to prevent overfitting. µ (k)+ β VB, BP and siBP have the computational complexity P w,· , (22) O(KDWdT ), but GS and CVB0 require O(KDNdT ), w[µw,·(k)+ β] where Wd is the average vocabulary size, Nd is the which resembles the synchronousP BP (7) but with two average number of word tokens per document, and T major differences. First, VB uses complicated digamma is the number of learning iterations. functions Ψ(·), which not only introduces bias [7] but also slows down the message updating. Second, VB uses 3 BELIEF PROPAGATION FOR ATM a different variational EM schedule. At the E-step, it Author-topic models (ATM) [3] depict each author of the simultaneously updates both variational messages and document as a mixture of probabilistic topics, and have parameter of θd until convergence, holding the varia- found important applications in matching papers with tional parameter of φ fixed. At the M-step, VB updates reviewers [30]. Fig. 5A shows the generative graphical only the variational parameter of φ. representation for ATM, which first uses a document- The message update equation of GS is specific uniform distribution ud to generate an author −i index a, 1 ≤ a ≤ A, and then uses the author-specific n (k)+ α n−i (k)+ β k ·,d w,· topic proportions θa to generate a topic label zw,d = 1 µw,d,i(k) ∝ −i × −i , (23) k[n·,d(k)+ α] w[nw,·(k)+ β] for the word index w in the document d. The plate on θ indicates that there are A unique authors in the cor- −i P P where n·,d(k) is the total number of topic labels k in the pus. The document often has multiple coauthors. ATM document d except the topic label on the current word randomly assigns one of the observed author indices to −i token i, and nw,·(k) is the total number of topic labels each word in the document based on the document- k w of the word except the topic label on the current specific uniform distribution ud. However, it is more i word token . Eq. (23) resembles the asynchronous BP reasonable that each word xw,d is associated with an implementation (7) but with two subtle differences. First, author index a ∈ ad from the multinomial rather than zk = 1 GS randomly samples the current topic label w,d,i uniform distribution, where ad is a set of author indices from the message µw,d,i(k), which truncates all K-tuple of the document d. As a result, each topic label takes two message values to zeros except the sampled topic label a,k a,k a variables zw,d = {0, 1}, a,k zw,d =1,a ∈ d, 1 ≤ k ≤ K, k. Such information loss introduces bias when learning where a is the author index and k is the topic index LDA. Second, GS must sample a topic label for each attached to the word. P word token, which repeats xw,d times for the word index We transform Fig. 5A to the factor graph representa- {w, d}. The sweep of the entire word tokens rather than tion of ATM in Fig. 5B. As with Fig. 2, we absorb the word index restricts GS’s scalability to large-scale docu- observed author index a ∈ ad of the document d as the ment repositories containing billions of word tokens. za index of the factor θa∈ad . The notation −w,· denotes all CVB0 is exactly equivalent to our asynchronous BP im- labels connected with the authors a ∈ ad except those for plementation but based on word tokens. Previous empir- the word index w. The only difference between ATM and ical comparisons [7] advocated the CVB0 algorithm for LDA is that the author a ∈ ad instead of the document a za LDA within the approximate mean-field framework [19] d connects the labels zw,d and −w,·. As a result, ATM a closely connected with the proposed BP. Here we clearly encourages topic smoothness among labels zw,d attached explain that the superior performance of CVB0 has been to the same author a instead of the same document d. 7

D ud

µθ →za a α a w,d

a Nd A α θa∈ad zw,d φw β zn θ za µ( −w,·)

za z −w,· w,−d wn φk K

(B) (A)

Fig. 5. (A) The three-layer graphical representation [3] and (B) two-layer factor graph of ATM.

x a a a input : , d,K,T,α,β. index a and the topic index k excluding xw,dµw,d(k). output : θa, φw. Likewise, Eq. (25) normalizes the incoming messages a,1 ∈ a ← µw,d(k),a d, initialization and normalization; attached the author index a in terms of the topic index for t ← 1 to T do k to make outgoing messages comparable for different µa,t (k)+α µt k β a a,t+1 ∝ −w,−d × w,−d( )+ authors a ∈ d. Similar to (15), we derive the message µw,d (k) a,t P [µt (k)+β] ; Pk[µ−w,−d(k)+α] w w,−d a passing µfθa →zw,d through adding all incoming messages end evaluated by the factor function (25). ← µa µa ; θa(k) [ ·,·(k)+ α]/ Pk[ ·,·(k)+ α] Multiplying two messages from factors θ a and φ ← a∈ d w φw(k) [µw,·(k)+ β]/ Pw[µw,·(k)+ β]; yields the message update equation as follows, µa (k)+ α µ Fig. 6. The synchronous BP for ATM. a −w,−d w,−d(k)+ β µw,d(k) ∝ a × . k[µ−w,−d(k)+ α] w[µw,−d(k)+ β] (26) P P 3.1 Inference and Parameter Estimation a a a Notice that the | d|× K-tuple message µw,d(k),a ∈ d is normalized in terms of all combinations of {a, k},a ∈ Unlike passing the K-tuple message µ (k) in Fig. 3, w,d a , 1 ≤ k ≤ K. Based on the normalized messages, the the BP algorithm for learning ATM passes the |a |× K- d d author-specific topic proportion θ (k) can be estimated tuple message vectors µa (k),a ∈ a through the factor a w,d d from the sum of all incoming messages including µa θ a in Fig. 5B, where |a | is the number of authors in w,d a∈ d d evaluated by the factor function f as follows, the document d. Nevertheless, we can still obtain the K- θa a tuple word topic message µw,d(k) by marginalizing the µ·,·(k)+ α a θ (k)= . (27) µ (k) a ∈ a a a message w,d in terms of the author variable d k[µ·,·(k)+ α] as follows, As a summary, Fig.P 6 shows the synchronous BP a µw,d(k)= µw,d(k). (24) algorithm for learning ATM. The difference between a∈a Xd Fig. 3 and Fig. 6 is that Fig. 3 considers the author a Since Figs. 2 and 5B have the same right half part, index as the label for each word. At each iteration, the computational complexity is O(KDWdAdT ), where the message passing equation from the factor φw to the Ad is the average number of authors per document. variable zw,d and the parameter estimation equation for φw in Fig. 5B remain the same as (7) and (13) based on the marginalized word topic message in (24). Thus, we 4 BELIEF PROPAGATION FOR RTM only need to derive the message passing equation from Network data, such as citation and coauthor networks a the factor θa∈ad to the variable zw,d in Fig. 5B. Because of of documents [30], [31], tag networks of documents and the topic smoothness prior, we design the factor function images [32], hyperlinked networks of web pages, and as follows, social networks of friends, exist pervasively in data min- 1 ing and machine learning. The probabilistic relational f = , (25) θa [µa (k)+ α] topic modeling of network data can provide both useful k −w,−d predictive models and descriptive statistics [4]. a a a where µ−w,−d(k) =P −w,−d xw,dµw,d(k) denotes the In Fig. 7A, relational topic models (RTM) [4] represent sum of all incoming messages attached to the author entire document topics by the mean value of the docu- P 8

z ′ α ·,d

z µ ·,d′ →ηc D D

ηc θd η θd′

µηc→zw,d Nd Nd′

′ zd,n c zd ,n α θd zw,d φw β

w φ ′ d,n k wd ,n z z K −w,d w,−d

(A) (B) Fig. 7. (A) The three-layer graphical representation [4] and (B) two-layer factor graph of RTM.

input : w, c,ξ,K,T,α,β. associated with the topic label k′. Notice that the de- output : θa, φw. signed factor function does not follow the GLM for 1 ← µw,d(k) initialization and normalization; link modeling in the original RTM [4] because the GLM for t ← 1 to T do makes inference slightly more complicated. However, t+1 ∝ µw,d (k) similar to the GLM, Eq. (28) is also able to capture the − t t × t ; ′ [(1 ξ)µθd→zw,d (k)+ ξµηc→zw,d (k)] µφw→zw,d (k) topic interactions between two linked documents {d, d } end in document networks. Instead of smoothness prior ← θd(k) [µ·,d(k)+ α]/ Pk[µ·,d(k)+ α]; encoded by factor functions (19) and (20), it describes ← ′ φw(k) [µw,·(k)+ β]/ Pw[µw,·(k)+ β]; arbitrary topic dependencies {k, k } of linked documents {d, d′}. Fig. 8. The synchronous BP for RTM. Based on the factor function (28), we resort to the sum- product algorithm to calculate the message,

′ ′ ′ ment topic proportions, and use Hadamard product of µηc→zw,d (k)= fηc (k|k )µ·,d (k ), (29) ′ ′ d′ k′ mean values zd ◦ zd from two linked documents {d, d } X X as link features, which are learned by the generalized where we use the sum rather than the product of linear model (GLM) η to generate the observed binary messages from all linked documents d′ to avoid arith- citation link variable c = 1. Besides, all other parts in metic underflow. The standard sum-product algorithm RTM remain the same as LDA. requires the product of all messages from factors to vari- We transform Fig. 7A to the factor graph Fig. 7B by ables. However, in practice, the direct product operation c absorbing the observed link index c ∈ , 1 ≤ c ≤ C as cannot balance the messages from different sources. For the index of the factor ηc. Each link index connects a example, the message µθ →z is from the neighboring ′ d w,d document pair {d, d }, and the factor ηc connects word words within the same document d, while the message topic labels zw,d and z·,d′ of the document pair. Besides ′ µηc→zw,d is from all linked documents d . If we pass encoding the topic smoothness, RTM explicitly describes the product of these two types of messages, we cannot the topic structural dependencies between the pair of distinguish which one influences more on the topic label ′ linked documents {d, d } using the factor function fηc (·). zw,d. Hence, we use the weighted sum of two types of messages, 4.1 Inference and Parameter Estimation µ(zw,d = k) ∝[(1 − ξ)µθd→zw,d (k) In Fig. 7, the messages from the factors θd and φw to the + ξµη →z (k)] × µφ →z (k), (30) variable zw,d are the same as LDA in (15) and (16). Thus, c w,d w w,d we only need to derive the message passing equation where ξ ∈ [0, 1] is the weight to balance two messages from the factor ηc to the variable zw,d. µθd→zw,d and µηc→zw,d . When there are no link informa- We design the factor function fηc (·) for linked docu- tion ξ = 0, Eq. (30) reduces to (7) so that RTM reduces ments as follows, to LDA. Fig. 8 shows the synchronous BP algorithm ′ ′ µ (k)µ ′ (k ) for learning RTM. Given the inferred messages, the ′ {d,d } ·,d ·,d fηc (k|k )= ′ , (28) parameter estimation equations remain the same as (12) ′ ′ µ (k)µ ′ (k ) P{d,d },k ·,d ·,d and (13). The computational complexity at each iteration 2 which depicts the likelihoodP of topic label k assigned is O(K CDWdT ), where C is the total number of links to the document d when its linked document d′ is in the document network. 9

TABLE 2 theoretically prove that BP will definitely converge to Summarization of four document data sets the fixed point, the resemblance among VB, GS and BP in the subsection 2.5 implies that there should be the Data sets D A W C Nd Wd similar underlying principle that ensures BP to converge CORA 2410 2480 2961 8651 57 43 on general sparse word vector space in real-world appli- MEDL 2317 8906 8918 1168 104 66 cations. Further analysis reveals that BP on average uses 1740 2037 13649 − 1323 536 NIPS more number of training iterations until convergence BLOG 5177 − 33574 1549 217 149 than VB (T ≈ 100) but much less number of training iterations than GS (T ≈ 300) on the four data sets. The fast convergence rate is a desirable property as far as 5 EXPERIMENTS the online [21] and distributed [38] topic modeling for We use four large-scale document data sets: large-scale corpus are concerned. 1) CORA [33] contains abstracts from the CORA re- The predictive perplexity for the unseen test set is search paper search engine in machine learning computed as follows [1], [7]. To ensure all algorithms area, where the documents can be classified into to achieve the local optimum, we use the 1000 training 7 major categories. iterations to estimate φ on the training set from the 2) MEDL [34] contains abstracts from the MEDLINE same initialization. In practice, this number of training biomedical paper search engine, where the docu- iterations is large enough for convergence of all algo- ments fall broadly into 4 categories. rithms in Fig. 9. We randomly partition each document 3) NIPS [35] includes papers from the conference in the test set into 90% and 10% subsets. We use 1000 “Neural Information Processing Systems”, where iterations of learning algorithms to estimate θ from the all papers are grouped into 13 categories. NIPS has same initialization while holding φ fixed on the 90% no citation link information. subset, and then calculate the predictive perplexity on 4) BLOG [36] contains a collection of political blogs on the left 10% subset, the subject of American politics in the year 2008. 10% w,d xw,d log k θd(k)φw(k) where all blogs can be broadly classified into 6 P = exp − 10% , (31) categories. BLOG has no author information. ( P w,d Pxw,d ) Table 2 summarizes the statistics of four data sets, where 10% where xw,d denotes wordP counts in the 10% subset. D is the total number of documents, A is the total Notice that the perplexity (31) is based on the marginal number of authors, W is the vocabulary size, C is the probability of the word topic label µw,d(k) in (21). N total number of links between documents, d is the Fig. 10 shows the predictive perplexity (average ± W average number of words per document, and d is the standard deviation) from five-fold cross-validation for average vocabulary size per document. different topics, where the lower perplexity indicates the better generalization ability for the unseen test set. 5.1 BP for LDA Consistently, BP has the lowest perplexity for different topics on four data sets, which confirms its effectiveness We compare BP with two commonly-used LDA learning for learning LDA. On average, BP lowers around 11% algorithms such as VB [1] (Here we use Blei’s imple- than VB and 6% than GS in perplexity. Fig. 11 shows mentation of digamma functions)1 and GS [2]2 under that BP uses less training time than both VB and GS. the same fixed hyperparameters α = β = 0.01. We We show only 0.3 times of the real training time of VB use MATLAB C/C++ MEX-implementations for all these because of time-consuming digamma functions. In fact, algorithms, and carry out experiments on a common VB runs as fast as BP if we remove digamma functions. PC with CPU 2.4GHz and RAM 4G. With the goal of So, we believe that it is the digamma functions that repeatability, we have made our source codes and data slow down VB in learning LDA. BP is faster than GS sets publicly available [37]. because it computes messages for word indices. The To examine the convergence property of BP, we use speed difference is largest on the NIPS set due to its the entire data set as the training set, and calculate the largest ratio N /W = 2.47 in Table 2. Although VB training perplexity [1] at every 10 iterations in the total d d converges rapidly attributed to digamma functions, it of 1000 training iterations from the same initialization. often consumes triple more training time. Therefore, BP Fig. 9 shows that the training perplexity of BP generally on average enjoys the highest efficiency for learning decreases rapidly as the number of training iterations LDA with regard to the balance of convergence rate and increases. In our experiments, BP on average converges training time. with the number of training iterations T ≈ 170 when the difference of training perplexity between two successive We also compare six BP implementations such as iterations is less than one. Although this paper does not siBP, BP and CVB0 [7] using both synchronous and asynchronous update schedules. We name three syn- 1. http://www.cs.princeton.edu/∼blei/lda-c/index.html chronous implementations as s-BP, s-siBP and s-CVB0, 2. http://psiexp.ss.uci.edu/research/programs data/toolbox.htm and three asynchronous implementations as a-BP, a-siBP 10

650 1700 2200 CORA MEDL NIPS BLOG 480 1650 2100 600

ty BP 1600 2000 440 GS 550 VB 1550 1900 1800 500 400 1500 1700

Training Perplexi 1450 450 360 1600

1400 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of Iterations Fig. 9. Training perplexity as a function of number of iterations when K = 50 on CORA, MEDL, NIPS and BLOG.

1100 900 2300 3200 2200 BP 3000 1000 800 GS

plexity 2100 2800 900 700 VB 2000 2600 e Per 800 600

tiv 1900 2400 700 dic 500 1800 2200 CORA MEDL NIPS BLOG Pre 600 400 1700 2000 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 Number of Topics

Fig. 10. Predictive perplexity as a function of number of topics on CORA, MEDL, NIPS and BLOG.

350 500 3000 2500 0.3x 0.3x 0.3x 0.3x 300 BP GS 400 2500 2000 250 VB 2000 200 300 1500 1500 150 200 1000 100 1000 Time (Seconds) 100 500 50 500 CORA MEDL NIPS BLOG 0 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 Number of Topics

Fig. 11. Training time as a function of number of topics on CORA, MEDL, NIPS and BLOG. For VB, it shows 0.3 times of the real learning time denoted by 0.3x. and a-CVB0. Because these six belief propagation im- intruders in the topic as well as the topic intruders in the plementations produce comparable perplexity, we show document, where intruders are defined as inconsistent the relative perplexity that subtracts the mean value words or topics based on prior knowledge of subjects. of six implementations in Fig. 12. Overall, the asyn- Fig. 13 shows the top ten words of K = 10 topics inferred chronous schedule gives slightly lower perplexity than by VB, GS and BP algorithms on NIPS set. We find no synchronous schedule because it passes messages faster obvious difference with respect to word intrusions in and more efficiently. Except on CORA set, siBP generally each topic. Most topics share the similar top ten words provides the highest perplexity because it introduces but with different ranking orders. Despite significant subtle biases in computing messages at each iteration. perplexity difference, the topics extracted by three algo- The biased message will be propagated and accumulated rithms remains almost the same interpretability at least leading to inaccurate parameter estimation. Although for the top ten words. This result coincides with [39] that the proposed BP achieves lower perplexity than CVB0 on the lower perplexity may not enhance interpretability of NIPS set, both of them work comparably well on other inferred topics. sets. But BP is much faster because it computes messages Similar phenomenon has also been observed in MRF- over word indices. The comparable results also confirm based image labeling problems [20]. Different MRF infer- our assumption that topic modeling can be efficiently ence algorithms such as graph cuts and BP often yield performed on word indices instead of tokens. comparable results. Although one inference method may To measure the interpretability of a topic model, the find more optimal MRF solutions, it does not neces- word intrusion and topic intrusion are proposed to in- sarily translate into better performance compared to volve subjective judgements [39]. The basic idea is to the ground-truth. The underlying hypothesis is that the ask volunteer subjects to identify the number of word ground-truth labeling configuration is often less optimal 11

a-siBP s-siBP 15 10 40 80 a-BP 8 10 s-BP 30 60 a-CVB0 s-CVB0 6 20

lexity 40 5 4 10 erp 0 2 20 0 0 −5 0 −10

Relative P −2 −20 −10 −4 −20 CORA MEDL NIPS BLOG −15 −6 −30 −40 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Number of Topics

Fig. 12. Relative predictive perplexity as a function of number of topics on CORA, MEDL, NIPS and BLOG.

recognition image system images network training speech figure model set Topic 1 image images figure object feature features recognition space visual distance recognition image images training system speech set feature features word network networks units input neural learning hidden output training unit Topic 2 network networks neural input units output learning hidden layer weights network units networks input output neural hidden learning unit layer analog chip circuit figure neural input output time system network Topic 3 analog circuit chip output figure signal input neural time system analog neural circuit chip figure output input time system signal time model neurons neuron spike synaptic activity input firing information Topic 4 model neurons cells neuron cell visual activity response input stimulus neurons time model neuron synaptic spike cell activity input firing model visual figure cells motion direction input spatial field orientation Topic 5 time noise dynamics order results point model system values figure model visual figure motion field direction spatial cells image orientation function functions algorithm linear neural matrix learning space networks data Topic 6 function functions number set algorithm theorem tree bound learning class function functions algorithm set theorem linear number vector case space learning network error neural training networks time function weight model Topic 7 function learning error algorithm training linear vector data set space learning error neural network function weight training networks time gradient data training set error algorithm learning function class examples classification Topic 8 training data set performance classification recognition test class error speech training data set error learning performance test neural number classification model data models distribution probability parameters gaussian algorithm likelihood mixture Topic 9 model data models distribution gaussian probability parameters likelihood mixture algorithm model data distribution models gaussian algorithm probability parameters likelihood mixture learning state time control function policy reinforcement action algorithm optimal Topic 10 learning state control time model policy action reinforcement system states learning state time control policy function action reinforcement algorithm model

Fig. 13. Top ten words of K = 10 topics of VB (first line), GS (second line), and BP (third line) on NIPS. than solutions produced by inference algorithms. For labels, L-LDA automatically assigns one of the class example, if we manually label the topics for a corpus, labels to each word index. In this way, L-LDA resembles the final perplexity is often higher than that of solutions the process of manual topic labeling by human, and its returned by VB, GS and BP. For each document, LDA solution can be viewed as close to the ground-truth. provides the equal number of topics K but the ground- For a fair comparison, we set the number of topics truth often uses the unequal number of topics to explain K = 7, 4, 13, 6 of LDA for CORA, MEDL, NIPS and the observed words, which may be another reason why BLOG according to the number of document categories the overall perplexity of learned LDA is often lower in each set. Both L-LDA and LDA are trained by BP than that of the ground-truth. To test this hypothesis, using 500 iterations from the same initialization. Fig. 14 we compare the perplexity of labeled LDA (L-LDA) [40] confirms that L-LDA produces higher perplexity than with LDA in Fig. 14. L-LDA is a supervised LDA that LDA, which partly supports that the ground-truth often restricts the hidden topics as the observed class labels yields the higher perplexity than the optimal solutions of each document. When a document has multiple class of LDA inferred by BP. The underlying reason may be 12

3000 perplexity than LDA (Fig. 10). To circumvent this prob- L-LDA 2500 lem, we introduce the weight ξ in (30) to balance two LDA types of messages, so that the learned RTM gains lower 2000 perplexity than LDA. Future work will estimate the balancing weight ξ based on the feature selection or MRF xity 1500 structure learning techniques. ple We also examine the link prediction performance of Per 1000 RTM. We define the link prediction as a binary clas- sification problem. As with [4], we use the Hadmard 500 product of a pair of document topic proportions as the link feature, and train an SVM [41] to decide if 0 CORA MEDL NIPS BLOG there is a link between them. Notice that the original RTM [4] learned by the GS algorithm uses the GLM to Fig. 14. Perplexity of L-LDA and LDA on four data sets. predict links. Fig. 17 compares the F-measure (average ± standard deviation) of link prediction on five-fold cross- that the three topic modeling rules encoded by LDA are validation. Encouragingly, BP provides significantly 15% still too simple to capture human behaviors in finding higher F-measure over GS on average. These results topics. confirm the effectiveness of BP for capturing accurate Under this situation, improving the formulation of topic structural dependencies in document networks. topic models such as LDA is better than improving inference algorithms to enhance the topic modeling 6 CONCLUSIONS performance significantly. Although the proper settings First, this paper has presented the novel factor graph of hyperparameters can make the predictive perplexity representation of LDA within the MRF framework. Not comparable for all state-of-the-art approximate inference only does MRF solve topic modeling as a labeling prob- algorithms [7], we still advocate BP because it is faster lem, but also facilitate BP algorithms for approximate and more accurate than both VB and GS, even if they all inference and parameter estimation in three steps: can provide comparable perplexity and interpretability 1) First, we absorb {w, d} as indices of factors, which under the proper settings of hyperparameters. connect hidden variables such as topic labels in the neighborhood system. 5.2 BP for ATM 2) Second, we design the proper factor functions to The GS algorithm for learning ATM is implemented in encourage or penalize different local topic labeling the MATLAB topic modeling toolbox.3 We compare BP configurations in the neighborhood system. and GS for learning ATM based on 500 iterations on 3) Third, we develop the approximate inference and training data. Fig. 15 shows the predictive perplexity parameter estimation algorithms within the mes- (average ± standard deviation) from five-fold cross- sage passing framework. validation. On average, BP lowers 12% perplexity than The BP algorithm is easy to implement, computation- GS, which is consistent with Fig. 10. Another possible ally efficient, faster and more accurate than other two reason for such improvements may be our assumption approximate inference methods like VB [1] and GS [2] that all coauthors of the document account for the word in several topic modeling tasks of broad interests. Fur- topic label using multinomial instead of uniform proba- thermore, the superior performance of BP algorithm for bilities. learning ATM [3] and RTM [4] confirms its potential effectiveness in learning other LDA extensions. 5.3 BP for RTM Second, as the main contribution of this paper, the proper definition of neighborhood systems as well as the The GS algorithm for learning RTM is implemented in design of factor functions can interpret the three-layer the R package.4 We compare BP with GS for learning LDA by the two-layer MRF in the sense that they encode RTM using the same 500 iterations on training data set. the same joint probability. Since the probabilistic topic Based on the training perplexity, we manually set the modeling is essentially a word annotation paradigm, the weight ξ =0.15 in Fig. 8 to achieve the overall superior opened MRF perspective may inspire us to use other performance on four data sets. MRF-based image segmentation [22] or data clustering Fig. 16 shows predictive perplexity (average ± stan- algorithms [17] for LDA-based topic models. dard deviation) on five-fold cross-validation. On av- Finally, the scalability of BP is an important issue in erage, BP lowers 6% perplexity than GS. Because the our future work. As with VB and GS, the BP algorithm original RTM learned by GS is inflexible to balance has a linear complexity with the number documents D information from different sources, it has slightly higher and the number of topics K. We may extend the pro- 3. http://psiexp.ss.uci.edu/research/programs data/toolbox.htm posed BP algorithm for online [21] and distributed [38] 4. http://cran.r-project.org/web/packages/lda/ learning of LDA, where the former incrementally learns 13

1100 2600 BP 850 BP BP 1000 GS GS 2400 GS

plexity 750 900 2200 e Per 650 tiv

dic 800 550 2000 Pre CORA MEDL NIPS 700 450 1800 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Number of Topics

Fig. 15. Predictive perplexity as a function of number of topics for ATM on CORA, MEDL and NIPS.

850 1000 BP BP 3400 BP GS 750 GS GS 900 plexity 3000

e Per 800 650 tiv

dic 2600 700 550 Pre CORA MEDL BLOG 600 450 2200 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Number of Topics

Fig. 16. Predictive perplexity as a function of number of topics for RTM on CORA, MEDL and BLOG.

1 1 0.76 BP CORABP MEDLBP BLOG GS GS GS 0.8 0.8 0.72

0.6 0.6 0.68 F-measure 0.4 0.4 0.64

0.2 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Number of Topics

Fig. 17. F-measure of link prediction as a function of number of topics on CORA, MEDL and BLOG.

parts of D documents in data streams and the latter [3] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, “The learns parts of D documents on distributed computing author-topic model for authors and documents,” in UAI, 2004, pp. 487–494. units. Since the K-tuple message is often sparse [42], we [4] J. Chang and D. M. Blei, “Hierarchical relational models for may also pass only salient parts of the K-tuple messages document networks,” Annals of Applied Statistics, vol. 4, no. 1, pp. or only update those informative parts of messages at 124–150, 2010. [5] T. Minka and J. Lafferty, “Expectation-propagation for the gener- each learning iteration to speed up the whole message ative aspect model,” in UAI, 2002, pp. 352–359. passing process. [6] Y. W. Teh, D. Newman, and M. Welling, “A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation,” in NIPS, 2007, pp. 1353–1360. ACKNOWLEDGEMENTS [7] A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh, “On smoothing and inference for topic models,” in UAI, 2009, pp. 27–34. This work is supported by NSFC (Grant No. 61003154), [8] I. Pruteanu-Malinici, L. Ren, J. Paisley, E. Wang, and L. Carin, the Shanghai Key Laboratory of Intelligent Information “Hierarchical Bayesian modeling of topics in time-stamped doc- uments,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 6, pp. Processing, China (Grant No. IIPL-2010-009), and a grant 996–1011, 2010. from Baidu. [9] X. G. Wang, X. X. Ma, and W. E. L. Grimson, “Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models,” IEEE Trans. Pattern Anal. Mach. REFERENCES Intell., vol. 31, no. 3, pp. 539–555, 2009. [10] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs [1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet alloca- and the sum-product algorithm,” IEEE Transactions on Inform. tion,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003. Theory, vol. 47, no. 2, pp. 498–519, 2001. [2] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proc. [11] C. M. Bishop, Pattern recognition and machine learning. Springer, Natl. Acad. Sci., vol. 101, pp. 5228–5235, 2004. 2006. 14

[12] J. Zeng and Z.-Q. Liu, “Markov random field-based statistical [41] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector character structure modeling for handwritten Chinese character machines,” ACM Transactions on Intelligent Systems and Technology, recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 5, vol. 2, pp. 27:1–27:27, 2011. pp. 767–780, 2008. [42] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and [13] S. Z. Li, Modeling in Image Analysis. New M. Welling, “Fast collapsed Gibbs sampling for latent Dirichlet York: Springer-Verlag, 2001. allocation,” in KDD, 2009, pp. 569–577. [14] G. Heinrich, “Parameter estimation for text analysis,” University of Leipzig, Tech. Rep., 2008. [15] M. Welling, M. Rosen-Zvi, and G. Hinton, “Exponential family harmoniums with an application to information retrieval,” in NIPS, 2004, pp. 1481–1488. [16] T. Hofmann, “Unsupervised learning by probabilistic latent se- mantic analysis,” Machine Learning, vol. 42, pp. 177–196, 2001. [17] B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” Science, vol. 315, no. 5814, pp. 972–976, 2007. [18] J. M. Hammersley and P. Clifford, “Markov field on finite graphs and lattices,” unpublished, 1971. [19] A. U. Asuncion, “Approximate Mean Field for Dirichlet-Based Models,” in ICML Workshop on Topic Models, 2010. [20] M. F. Tappen and W. T. Freeman, “Comparison of graph cuts with belief propagation for stereo, using identical MRF parameters,” in ICCV, 2003, pp. 900–907. [21] M. Hoffman, D. Blei, and F. Bach, “Online learning for latent Dirichlet allocation,” in NIPS, 2010, pp. 856–864. [22] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, no. 8, pp. 888–905, 2000. [23] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, Series B, vol. 39, pp. 1–38, 1977. [24] J. Zeng, L. Xie, and Z.-Q. Liu, “Type-2 fuzzy Gaussian mixture models,” Pattern Recognition, vol. 41, no. 12, pp. 3636–3643, 2008. [25] J. Zeng and Z.-Q. Liu, “Type-2 fuzzy hidden Markov models and their application to speech recognition,” IEEE Trans. Fuzzy Syst., vol. 14, no. 3, pp. 454–467, 2006. [26] J. Zeng and Z. Q. Liu, “Type-2 fuzzy Markov random fields and their application to handwritten Chinese character recognition,” IEEE Trans. Fuzzy Syst., vol. 16, no. 3, pp. 747–760, 2008. [27] J. Winn and C. M. Bishop, “Variational message passing,” J. Mach. Learn. Res., vol. 6, pp. 661–694, 2005. [28] W. L. Buntine, “Variational extensions to EM and multinomial PCA,” in ECML, 2002, pp. 23–34. [29] D. D. Lee and H. S. Seung, “Learning the parts of objects by non- negative matrix factorization,” Nature, vol. 401, pp. 788–791, 1999. [30] J. Zeng, W. K. Cheung, C.-H. Li, and J. Liu, “Coauthor net- work topic models with application to expert finding,” in IEEE/WIC/ACM WI-IAT, 2010, pp. 366–373. [31] J. Zeng, W. K.-W. Cheung, C.-H. Li, and J. Liu, “Multirelational topic models,” in ICDM, 2009, pp. 1070–1075. [32] J. Zeng, W. Feng, W. K. Cheung, and C.-H. Li, “Higher-order Markov tag-topic models for tagged documents and images,” arXiv:1109.5370v1 [cs.CV], 2011. [33] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automat- ing the construction of internet portals with machine learning,” Information Retrieval, vol. 3, no. 2, pp. 127–163, 2000. [34] S. Zhu, J. Zeng, and H. Mamitsuka, “Enhancing MEDLINE doc- ument clustering by incorporating MeSH semantic similarity,” Bioinformatics, vol. 25, no. 15, pp. 1944–1951, 2009. [35] A. Globerson, G. Chechik, F. Pereira, and N. Tishby, “Euclidean embedding of co-occurrence data,” J. Mach. Learn. Res., vol. 8, pp. 2265–2295, 2007. [36] J. Eisenstein and E. Xing, “The CMU 2008 political blog corpus,” Carnegie Mellon University, Tech. Rep., 2010. [37] J. Zeng, “TMBP: A topic modeling toolbox using belief propaga- tion,” J. Mach. Learn. Res., p. arXiv:1201.0838v1 [cs.LG], 2012. [38] K. Zhai, J. Boyd-Graber, and N. Asadi, “Using variational infer- ence and MapReduce to scale topic modeling,” arXiv:1107.3765v1 [cs.AI], 2011. [39] J. Chang, J. Boyd-Graber, S. Gerris, C. Wang, and D. Blei, “Reading tea leaves: How humans interpret topic models,” in NIPS, 2009, pp. 288–296. [40] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, “Labeled LDA: A supervised topic model for credit attribution in multi- labeled corpora,” in Empirical Methods in Natural Language Pro- cessing, 2009, pp. 248–256.