Nested Chinese Restaurant Franchise Processes: Applications to User Tracking and Document Modeling
Total Page:16
File Type:pdf, Size:1020Kb
Nested Chinese Restaurant Franchise Processes: Applications to User Tracking and Document Modeling Amr Ahmed [email protected] Research @ Google Linagjie Hong [email protected] Yahoo! Labs Alexander J. Smola [email protected] Carnegie Mellon University Abstract sign models that account for both modalities jointly. It is reasonable to assume that there exists some de- Much natural data is hierarchical in nature. gree of correlation between content and location and Moreover, this hierarchy is often shared be- moreover, that this distribution be user-specific. tween different instances. We introduce the nested Chinese Restaurant Franchise Process Likewise, longer documents contain a mix of topics, to obtain both hierarchical tree-structured albeit not necessarily at the same level of differenti- representations for objects, akin to (but more ation between different topics. That is, some docu- general than) the nested Chinese Restaurant ments might address, amongst other aspects, the issue Process while sharing their structure akin to of computer science, albeit one of them in entire gen- the Hierarchical Dirichlet Process. erality while another one might delve into very specific Moreover, by decoupling the structure gen- aspects of machine learning. As with microblogs, such erating part of the process from the compo- data requires a hierarchical model that shares struc- nents responsible for the observations, we are ture between documents and where the objects of in- able to apply the same statistical approach to terest themselves exhibit a rich structural representa- a variety of user generated data. In partic- tion (e.g. a distribution over a tree). ular, we model the joint distribution of mi- Such problems have attracted a fair amount of at- croblogs and locations for Twitter for users. tention. For instance (Mei et al., 2006; Wang et al., This leads to a 40% reduction in location un- 2007; Eisenstein et al., 2010; Cho et al., 2011; Cheng certainty relative to the best previously pub- et al., 2011) all address the issue of modeling location lished results. Moreover, we model docu- and content in microblogs. Moreover, it has recently ments from the NIPS papers dataset, obtain- come to our attention (by private communication) that ing excellent perplexity relative to (hierarchi- (Paisley et al., 2012) independently proposed a model cal) Pachinko allocation and LDA. similar to the nested Chinese Restaurant Franchise Process (nCRF) of this paper. The main difference 1. Introduction is found in the inference algorithm (variational rather than collapsed sampling) and the different range of Micro-blogging services such as Twitter, Tumblr and applications (documents rather than spatiotemporal Weibo have become important tools for online users data) as well as our parameter cascades over the tree. to share breaking news, interesting stories, and rich media content. Many such services now provide loca- Most related regarding microblogs is the work of Eisen- tion services. That is, the messages have both textual stein et al. (2010; 2011); Hong et al. (2012). They take and spatiotemporal information, thus the need to de- regional language variations and global topics into ac- count by bridging finite mixture Gaussian models and Proceedings of the 30 th International Conference on Ma- topic models. These models usually employ a flat clus- chine Learning, Atlanta, Georgia, USA, 2013. JMLR: tering model of locations. This flat structure is unnat- W&CP volume 28. Copyright 2013 by the author(s). ural in terms of the language model: while it is reason- Nested Chinese Restaurant Franchise Process able to assume that New Yorkers and San Franciscans to obtain G0. This is then, in turn, used as reference might differ in terms of the content of the tweets, it is measure to obtain the measures Gi. They are discrete also reasonable to assume that American tweets, as a and share, by construction, atoms via G0. whole, are more similar to each other, than to tweets The Hierarchical Dirichlet Process is widely used in ap- from Egypt, China or Germany. As a side effect, loca- plications where different groups of data points would tion prediction is not always satisfactory. share the same settings of partitions, such as (Teh Key Contributions: We introduce a model that et al., 2006; Beal et al., 2002). In the context of docu- combines the advantages of the Hierarchical Dirichlet ment modeling the HDP is used to model each docu- Process (HDP) of Teh et al. (2006) and the nested Chi- ment as a DP while sharing the set of atoms (mixtures nese Restaurant Process (nCRP) of Blei et al. (2010) or topics) across all documents. This is precisely what into a joint statistical model that allows each ob- we also want when assessing distributions over trees ject to be represetned as a mixture of paths over a — we want to ensure that the (partial) trees attached tree. This extends the hierarchical clustering approach to each user share attributes among all users. of (Adams et al., 2010) and also includes aspects of Integrating out all random measures, we arrive as what Pachinko allocation (Mimno et al., 2007) as special is known as the Chinese Restaurant Franchise (CRF). cases. The model decouples the task of modeling hier- In this metaphor each restaurant maintains its set of archical structure from that of modeling observations. tables but shares the same set of mixtures. A cus- Moreover, we demonstrate in two applications, mi- tomer at restaurant k can chose to sit at an existing croblogs and NIPS documents, that the model is able table with a probability proportional to the number to scale well and that it can provide highly accurate of customers sitting on this table, or start a new ta- estimates in both cases. That is, for microblogs we ble with probability α and chose its dish from a global obtain a location inference algorithm with significantly distribution. improved accuracy relative to the best previously pub- The Nested Chinese Restaurant Process: CRPs lished results. Moreover, for documents, we observe and CRFs allow objects, such as documents, to be gen- significant gains in perplexity. erated from a single mixture (topic). However, they do 2. Background not provide a relationship between topics. One option to address this issue is to introduce a tree-wise de- Bayesian nonparametrics is rich in structured and hi- pendency. This was proposed in the nested Chinese erarchical models. Nonetheless, we found that no pre- Restaurant Process (nCRP) by (Blei et al., 2010). It viously proposed structure a good fit for the following defines an infinite hierarchy, both in terms of width key problem when modeling tweets: we want to model and depth. In the nCRP, a set of topics (mixtures) each user’s tweets in a hierarchical tree-like structure are arranged over a tree-like structure whose semantic akin to the one described by nested Chinese Restau- is such that parent topics are more general than the rant Process or the hierarchical Pachinko Allocation topics represented by their children. A document in Model. At the same time we want to ensure that we this process is defined as a path over the tree, and it have sharing of statistical strength between different is generated from the topics along that path using an users’ activities by sharing the hierarchical structure LDA-like model. In particular, each node in the tree and the associated emissions model. For the purpose defines a Chinese Restaurant Process over its children. of clarity we briefly review the two main constituents Thus a path is defined by the set of decisions taken at of our model — HDP and nCRP. each node. While this provides more expressive model- Franchises and Hierarchies: A key ingredient for ing, it still only allows each document to have a single building hierarchical models is the Hierarchical Dirich- path over the tree – a limitation we overcome below. let Process (HDP). It is obtained by coupling draws 3. Nested Chinese Restaurant Franchise from a Dirichlet process by having the reference mea- sure itself arise from a Dirichlet process (Teh et al., We now introduce the Nested Chinese Restaurant 2006). In other words, rather than drawing the distri- Franchise (nCRF). As its name suggests, it borrows bution G from a Dirichlet process via G DP(H, γ) both from the Chinese Restaurant Franchise, allowing (for the underlying measure H and concentration∼ pa- us to share strength between groups, and the Nested rameter γ) we now have Chinese Restaurant Process, providing us with a hier- G DP(G , γ0) and G DP(H, γ). (1) archical distribution over observations. i ∼ 0 0 ∼ Here γ and γ0 are appropriate concentration param- For clarity of exposition and concepts we will dis- eters. This means that we first draw atoms from H Nested Chinese Restaurant Franchise Process A tinguish between the structure generating nCRF and A 1 A2 B B the process generating observations from a hierarchi- 1 B2 cal generative model, once the structure variable has been determined. This is beneficial since the observa- tion space can be rather vast and structured. Global process User 1 process User 2 Process Figure 1. The nested Chinese Restaurant Franchise involv- 3.1. Basic Idea ing a common tree over components (left) and two sub- trees representing processes for two separate subgroups The nested Chinese Restaurant Process (nCRP) (Blei (e.g. users). Each user samples from his own distribution et al., 2010) provides a convenient way to impose a dis- over topics, smoothed by the global process. Thus each tribution over tree-like structures. However, it lacks a user process represents a nested Chinese Restaurant Pro- mechanism for ’personalizing’ them to different parti- cess. All of them are combined into a common franchise. tions and restricts each object to only select a single path over the tree.