UCLA UCLA Electronic Theses and Dissertations

Title Probabilistic Topic Models for Graph Mining

Permalink https://escholarship.org/uc/item/7ss082g1

Author Cha, Young Chul

Publication Date 2014

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California UNIVERSITYOF CALIFORNIA

Los Angeles

Probabilistic Topic Models for Graph Mining

A dissertation submitted in partial satisfaction

of the requirements for the degree

Doctor of Philosophy in Computer Science

by

Young Chul Cha

2014 c Copyright by

Young Chul Cha

2014 ABSTRACTOFTHE DISSERTATION

Probabilistic Topic Models for Graph Mining

by

Young Chul Cha

Doctor of Philosophy in Computer Science

University of California, Los Angeles, 2014

Professor Junghoo Cho, Chair

In this research, we extend probabilistic topic models, originally developed for a tex- tual corpus analysis, to analyze a more general graph. Especially, we extend them to effectively handle: (1) a bias caused by a limited number of frequent nodes (“popular- ity bias”), and (2) complex graphs having more than two entity types.

For the popularity bias problem, we propose LDA extensions and new topic mod- els explicitly modeling the popularity of a node with a “popularity component”. In extensive experiments with a real-world Twitter dataset, our approaches achieve signif- icantly lower perplexity (i.e., better prediction power) and improved human-perceived clustering quality compared to LDA.

To analyze more complex graphs, we propose a novel universal topic framework that takes an “incremental” approach of breaking a complex graph into smaller units, learning the topic group of each entity from the smaller units, and then “propagating” the learned topics to others. In a DBLP prediction problem, our approach achieves the best performance over many state-of-the-art methods. We also demonstrate huge potential of our approach with search logs from a commercial search engine.

ii The dissertation of Young Chul Cha is approved.

Carlo Zaniolo

D. Stott Parker

Gregory H. Leazer

Junghoo Cho, Committee Chair

University of California, Los Angeles

2014

iii To my family . . .

iv TABLEOF CONTENTS

1 Introduction ...... 1

1.1 Challenges in Graph Mining Using Probabilistic Topic Models . . . . 1

1.2 Organization of Dissertation ...... 4

2 Preliminaries ...... 6

2.1 Probabilistic Topic Models ...... 6

2.2 Heterogeneous Information Networks ...... 8

3 Handling a Popularity Bias in Topic Models ...... 10

3.1 Introduction ...... 10

3.2 Applying Topic Models to Social Graphs ...... 13

3.2.1 Follow-Edge Generative Model ...... 13

3.2.2 Popularity Bias ...... 17

3.3 Handling a Popularity Bias ...... 19

3.3.1 Existing LDA Extensions ...... 21

3.3.2 Procedural Variations of LDA ...... 23

3.3.3 Popularity-Aware Topic Models ...... 26

3.4 Experiments ...... 31

3.4.1 Dataset and Experimental Settings ...... 32

3.4.2 Prediction Performance Analysis ...... 35

3.4.3 Example Topic Groups ...... 37

3.4.4 Grouping Quality Analysis ...... 40

v 3.5 Conclusion ...... 43

4 Complex-Graph Analysis Using Topic Models ...... 44

4.1 Introduction ...... 44

4.2 Universal Topic Framework ...... 48

4.2.1 Edge Generative Model ...... 48

4.2.2 Incorporating Learned Topics ...... 51

4.2.3 Issues in Topic Incorporation ...... 54

4.3 Experiments and Analyses ...... 56

4.3.1 DBLP Experiment ...... 57

4.3.2 Online Search Experiments ...... 62

4.4 Conclusion ...... 70

5 Related Work ...... 73

6 Conclusions and Future Work ...... 76

References ...... 78

vi LISTOF FIGURES

3.1 Bipartite graph representation of models ...... 14

3.2 Topic group on cycling showing the popularity bias ...... 18

3.3 Topic hierarchy and documents generated in the hierarchy ...... 22

3.4 Two-step labeling approach ...... 24

3.5 Example of the threshold noise filtering process ...... 25

3.6 LDA and proposed topic models ...... 27

3.7 Distributions of incoming and outgoing edges ...... 33

3.8 Perplexity comparison ...... 36

3.9 Sample topic groups I ...... 38

3.10 Sample topic groups II ...... 39

3.11 Quality comparison ...... 42

4.1 Examples of HINs ...... 46

4.2 Example of edge labeling ...... 50

4.3 Proposed topic models ...... 51

4.4 Structures of two types of search logs ...... 62

4.5 Examples of topic incorporation orders ...... 64

4.6 Performance Analysis ...... 65

4.7 Topic Granularity Analysis ...... 69

4.8 Incorporation of C-Log (the last bars show results using C-Log) . . . 71

vii LISTOF TABLES

3.1 Symbols used throughout this chapter and their meanings ...... 20

3.2 Statistics of our Twitter dataset ...... 33

3.3 Experimental cases and descriptions ...... 35

4.1 Statistics of the datasets ...... 58

4.2 Prediction accuracy on the DBLP dataset. Except ours, all other results are from [16, 12]...... 60

4.3 Venue clusters ...... 61

4.4 Example topic cluster without UniZ (topic number 43) ...... 67

4.5 Example topic cluster from UniZ ...... 68

viii ACKNOWLEDGMENTS

First, I would like to thank my wonderful advisor, Junghoo “John” Cho. If it had not been for his help and advice, I would not have finished my Ph.D. study. When I came back to school after working for 10 years in a telecom industry in 2009, I had a quite hard time to catch up with all the coursework required for my new major, data mining. He was very patient with my slow progress and has never rushed me to do something. Whenever I was struggling with a hard problem, he provided me with an insightful advice rather than a solution, and motivated me to learn while seeking for the solution. I remember his I-know-it-all smile when I made a great fuss about my small finding after following his advice. Although I was very busy with my school work during my first two years, I was happy because I could feel I was getting better and better as a researcher. I really appreciate his help and great advice. I only hope that I will be able to practice half as much as what I learned from him when I begin to work with mentees of my own.

Second, I thank my lovely wife, Gook Hee Lee, for her great support. I know that it was not an easy decision for her to quit her prospective job in Korea and live as a wife of a graduate student in the foreign country. However, she went through all the hard years without many complaints. She was very supportive and willingly adjusted her schedule to mine. I especially thank her for her effort to make me healthier when I had a health issue. She searched good food ingredients for me and spent time to learn how to make delicious food with them. I also thank her for her great culinary skills. Without them, I would have had hard times to eat a large amount of food she cooks for me.

Special thanks to Carlo Zaniolo, Stott Parker and Gregory Leazer for serving as my committee. Their insightful comments made my thesis more solid and I learned a lot from their comments. Especially, Greg gave me very detailed comments on various

ix aspects of my thesis. I also owe Carlo, Stott and John for their great classes. I could establish good data mining knowledge with their wonderful classes and projects.

I would also like to thank our group members, including Michael Welch, Uri Schonfeld, Chu-Cheng Hsieh, Bin Bi, Yuchen Liu, Dong Wang, Zijun Xue, Christo- pher Moghbel, and Giljoo Na. Michael and Uri gave me good advices when I first joined the group, and Chu-Cheng always magically solved my computer and network issues. Bin was so smart and helped me a lot with developing topic models, and I had many interesting academic talks with Yuchen. Dong and Zijun were always nice to me whenever I need their help. Chris proofread most of my papers and provided sharp comments. Giljoo gave me good life advices and his wife frequently invited me to wonderful dinners when I lived alone during my first year.

It was very lucky for me to have chances to work with Keng-hao Chang, Hari Bommaganti, Ye Chen, Jian Yuan and Tak Yan during my internships at Microsoft. They provided great ideas and a stimulating environment during the summers. I could also practice what I learned from school with large real datasets.

Finally, I would like to thank my parents and brothers for their invariable love and support. They always firmly believe me and respect my decision. Whenever I am weary and in doubt, their love and belief cheer me up and give me strength. I am grateful for their selfless support and hope they are always healthy and happy. I deeply thank God who takes care of my family and leads me to the right way.

x VITA

1998 B.S. (Computer Science), Yonsei University, Korea.

2000 M.S. (Computer Science), Yonsei University, Korea.

2000–2014 Researcher & Manager, KT, Korea.

PUBLICATIONS

Y. Cha, B. Bi, and J. Cho, Handling a Popularity Bias in Probabilistic Topic Models, under review

Y. Cha, B. Bi, J. Cho, K. Chang, H. Bommaganti, Y. Chen, and T. Yan, A Universal Topic Framework (UniZ) and Its Application in Online Search, under review

Y. Cha, J. Cho, J. Yuan, and T. Yan, Exploration of the Effects of Category Match Score in Search Advertising, ICDE, Apr. 2014.

Y. Cha, B. Bi, C. Hsieh, and J. Cho, Incorporating Popularity in Topic Models for Social Network Analysis, SIGIR, best paper award nominee, Jul.-Aug. 2013.

Y. Cha, and J. Cho, Social Network Analysis Using Topic Models, SIGIR, Aug. 2012.

xi CHAPTER 1

Introduction

1.1 Challenges in Graph Mining Using Probabilistic Topic Models

Recently, topic models have widely been used for a textual corpus analysis due to their high-quality analysis results. They enable machines to identify hidden topics (mean- ings or concepts) behind words and group the words based on the identified hidden topics. In this way, they can effectively handle synonyms (e.g., car and automobile). Also, they are good at detecting multiple meanings of a word (e.g., bank related to money and bank related to water) by differently identifying the topic of each occur- rence of the word based on its context. As these synonyms and polysemous words are common in human languages, grouping them accurately based on their meanings is very important in natural language processing. Especially, probabilistic topic models are equipped with probabilistic frameworks and the results inferred by these models consist of probabilistic distributions and are easily interpretable.

In this dissertation, we extend these probabilistic topic models so that they can be applied to analyze a more general graph beyond a textual corpus, which can be repre- sented in a simple bipartite graph, where there are documents on the left side, words on the right side, and directed edges connecting each of them (when each document con- tains each word). Especially, we address two major challenges in graph mining using the probabilistic topic models: (1) handling a bias caused by a limited number of very frequent node, and (2) analyzing a complex graph beyond a simple bipartite graph.

1 The former arises due to “statistical difference” between a textual corpus and a general graph, and the latter is caused by “structural limitation” of current topic models.

• Handling a bias caused by a limited number of very frequent nodes: the topic models are built on an assumption that every word (and topic) is roughly equally likely in a corpus. Thus, we remove stop words (e.g., the and is) be- fore a topic model analysis because they are too frequent than other words (i.e., prevent the word distribution from being a power-law distribution). Fortunately, these stop words do not have much meaning and this simple removal increases the quality of the analysis result. However, when we extend the probabilistic topic models to analyze more general bipartite graphs such as a purchase graph between users and products, and a social graph between following users and fol- lowed users, this simple removal strategy does not work any more because many users are interested in these frequent nodes (e.g., popular products and popular figures). If we simply do not remove them (i.e., let the node distribution be a power-law distribution), these frequent nodes appear in every topic group in the topic model analysis result even though they are not relevant to it, and make the topic model analysis result severely “biased” to them. Thus, we need effective grouping methods free from the bias caused by these frequent nodes.

• Analyzing a complex graph beyond a simple bipartite graph: the topic model analysis is usually limited to a few entity types and relationships among them. More precisely, a textual corpus only consists of two types of entities (documents and words) and one type of relationships (contains). However, a graph from a real-world application usually consists of multiple types of entities and relation- ships. For example, a simple web graph from an online search service, consists of web users, queries they issue, web pages they visit, and words inside the web pages. Also, a web page has links to other pages. (Note that this web graph

2 is still very simple compared to one generated from a real-world service.) Al- though some complex topic models [40, 12, 51, 32, 36, 44, 45, 26, 22, 35, 11, 17] have been proposed to analyze a more complex graph than a simple bipartite graph, they still cannot handle the simple example web graph. Even if there is a topic model which can handle the simple example web graph, it would be intractable for an analysis.

By effectively solving the challenges explained above, we can greatly expand the application area of the topic model analysis and improve its quality. More specifically, we expect to achieve followings.

• Expansion of the application area: we can apply topic models to a general graph beyond a simple bipartite graph consisting of documents and words. Given that there are tremendous number of user-generated graphs (e.g., a visit log, a purchase log, a click log, a social graph, and so on) and these graphs have very frequent (popular) nodes, our research can greatly expand the application area of the topic model analysis.

• Better clustering/prediction quality: we can achieve better topic clusters by removing the bias caused by a small number of very frequent nodes. Since we can predict (recommend) relevant nodes (items) for a given user based on the unbiased clusters, the prediction quality can be improved as well. Also, because a complex graph contains more information than a simple graph (a segment of the complex graph), if we can effectively leverage the additional information, it is possible to achieve better clustering and prediction quality.

3 1.2 Organization of Dissertation

In this dissertation, we propose effective solutions to the above challenges. To that end, the rest of this dissertation is organized as follows.

• Chapter 2. Preliminaries We briefly explain background concepts for our readers to better understand our work. Since we propose extensions to prob- abilistic topic models, we first introduce the concept of the probabilistic topic models with the two most popular probabilistic topic models: Probabilistic La- tent Semantic Analysis (PLSA) [25] and Latent Dirichlet Allocation (LDA) [5]. Then, we introduce clear definition of a Heterogeneous Information Network (HIN) [42], which we will handle in Chapter 4.

• Chapter 3. Handling a Popularity Bias in Topic Models We address the “popularity bias” caused by a limited number of frequent (popular) nodes. We first show this problem with LDA, and propose LDA extensions to effectively solve the problem. Then, we propose new topic models explicitly modeling the popularity of a node with a “popularity component”. We evaluate effec- tiveness of our models with a real-world Twitter social graph. Our proposed approaches achieve significantly lower perplexity (i.e., better prediction power) and improved human-perceived clustering quality compared to LDA.

• Chapter 4: Complex-Graph Analysis Using Topic Models To analyze a complex graph, we propose a novel universal topic framework, which breaks the complex graph into much smaller units, learns the topic group of each entity from the smaller units, and then “propagates” the learned topics to others. In this way, it leverages all the available signals without introducing significant compu- tational complexity and enables a richer representation of entities and highly accurate analysis results. It also represents heterogeneous entities in a “single

4 universal topic space” so that all entities can be directly compared within the “same” topic space. In a widely-used DBLP dataset prediction problem, our approach achieves the best prediction performance over many state-of-the-art methods. We also demonstrate huge potential of our approach with search logs from a commercial search engine.

• Chapter 5: Related Work We briefly review topic models related to our work in three categories: topic models for (1) authorship, (2) hypertexts, and (3) edges. The models in the first category were proposed to analyze documents (texts) with their authors. As these models incorporate authors and their rela- tionships in the model, they can be viewed as early forms of social network topic models. The topic models in the second category are more closely related to social network analysis and analyze documents with their citations (i.e., hy- pertexts). The models in the last category only uses linkage (edge) information. Since they only focus on a graph structure, they can be easily applied to a variety of datasets.

• Chapter 6: Conclusions and Future Work We conclude the dissertation by summarizing what we have found through this work. We also describe some future work items.

5 CHAPTER 2

Preliminaries

In this chapter, we briefly explain background concepts for our readers to better un- derstand our work. Since we propose extensions to probabilistic topic models, we first introduce the concept of the probabilistic topic models with the two most popular prob- abilistic topic models: Probabilistic Latent Semantic Analysis (PLSA) [25] and Latent Dirichlet Allocation (LDA) [5]. Then, we provide clear definition of a Heterogeneous Information Network (HIN) [42], which we will handle in Chapter 4.

2.1 Probabilistic Topic Models

Topic models are built on the assumption that there is a latent variable behind each observation in a dataset. In the case of a document corpus, the usual assumption is that there is a hidden topic behind each word. PLSA [25] introduced a “probabilistic” generative model to Latent Semantic Indexing (LSI) [15], one of the most popular topic models. Equation (2.1) represents its document generative model:

X p(d, w) = p(d)p(w|d) = p(d) p(z|d)p(w|z), (2.1) z∈Z where p(d, w) denotes the probability of observing a word w in a document d and can be decomposed into two parts: p(d), the probability distribution of documents, and p(w|d), the probability distribution of words given a document. This equation describes a word selection process for a document, where an author first selects a doc-

6 ument then a word in that document. By repeating this selection process sufficiently, we can generate a full document and eventually a whole document corpus. Based on the assumption that there is a latent topic z for each word w, the above equation can be rewritten as a product of p(w|z), the probability distribution of words given a topic, and p(z|d), the probability distribution of topics given a document. By doing so, we have added an additional topic selection step to the original word selection process for a given document. Summation across all independent topics Z accounts for multiple topics that can generate a word w.

The goal of the topic model analysis is to accurately infer p(w|z) and p(z|d). Given the probabilistic generative model explained above, we can effectively infer p(w|z) and p(z|d) by maximizing the log-likelihood function L of observing the entire corpus as in Equation (2.2):

Y Y L = log[ p(d, w)n(d,w)] d∈D w∈W X X = n(d, w) log p(d, w), (2.2) d∈D w∈W where n(d, w) denotes the frequency of the word w in the document d. The in- ferred p(w|z) and p(z|d) measure the strength of association between a word w and a topic z and that between a topic z and a document d, respectively. For example, if p(wvehicle|zcar) > p(wtechnology|zcar) > 0, the word vehicle is more closely related to the topic car than the word technology, though they are all related to the topic car. In this way, PLSA and other probabilistic topic models support multiple member- ships and produce more reasonable clustering results compared to traditional clustering methods such as K-means [31], where each entity can belong to only one topic group.

Although PLSA introduced the sound probabilistic generative model, it tends to overfit the result to the observed data. To address this overfitting problem, LDA [5] introduced Dirichlet priors α and β to PLSA, to constrain p(z|d) and p(w|z), respec-

7 tively. The α is a vector of dimension |Z|, the number of topics, and each element in α is a prior for a corresponding element in p(z|d). Thus, a higher α implies that there are more frequent prior observations of topic z in a corpus. Similarly, β is a vector of dimension |W |, the number of words, and each element in β is a prior for a corre- sponding element in p(w|z). Dirichlet priors α and β, on the multinomial distributions p(z|d) and p(w|z), enable the topic models to be smoothed and safe from PLSA’s overfitting problem. As a conjugate prior for the multinomial distribution, Dirichlet distribution also simplifies the statistical inference and enables use of collapsed Gibbs sampling [39]. It is also known that PLSA can be considered as a special case of LDA without Dirichlet priors [19, 24].

2.2 Heterogeneous Information Networks

In most research works on network analysis, networks are assumed to be homoge- neous, which means having only one type of entities (vertices) and relationships (edges). For example, PageRank [6] was developed over the network consisting of web pages and links connecting them. However, in many real-world applications, there are differ- ent types of entities and different kinds of relationships among them. Figure 4.1(b) de- picts a simple Heterogeneous Information Network (HIN) of bibliographic information from DBLP consisting of four types of entities (papers, terms, authors, and venues) and three types of relationships (contains, writes, and publishes). Figure 4.1(c) depicts a more complex HIN of a web graph having five types of entities and eight types of re- lationships. Though it looks quite complicated, this HIN is still much simpler than a real-world case, where a user may have edges to content items she consumed, products she purchased, places she visited, and so on. Sun et al. [42] defined a HIN as an infor- mation network having more than one type of entities or relationships. According to this definition, PLSA and LDA are algorithms for a very simple bipartite HIN having

8 two types of entities (documents and terms), and one type of relationships (contains) as in Figure 3.1.

9 CHAPTER 3

Handling a Popularity Bias in Topic Models

In this chapter, we address a bias caused by a limited number of very popular nodes using a social graph in Twitter 1. We first justify our approach of applying proba- bilistic topic models to the social graph in Section 3.2.1 and introduce the problem of “popularity bias” with an example topic group in Section 3.2.2. To solve this problem, we propose our own procedural variations of LDA in Section 3.3.2 after investigat- ing existing solutions in Section 3.3.1. In Section 3.3.3, we propose more advanced LDA extensions explicitly modeling the popularity of a node. We perform extensive experiments using the Twitter social graph and evaluate our proposed methods based on the widely-used perplexity in Section 3.4.2. We also report human survey results on clustering quality in Section 3.4.4 after explaining some example topic groups from our approaches in Section 3.4.3.

3.1 Introduction

Microbloging services such as Twitter are popular these days because they empower users to broadcast and exchange information or thoughts in realtime. Distinct from other social network services, relationships on Twitter are unidirectional and often interest-oriented. A user may indicate her interest in another user by “following” her, and previous studies [21, 27] show that users are more likely to follow people who

1http://www.twitter.com/

10 share common interests, even though “following relationships” among users look un- organized and haphazard at first glance. Thus, if we can correctly identify the shared hidden interests behind users’ following relationships, we can group users sharing common interests and recommend more relevant users in social network services.

In this chapter, we apply topic models to correctly identify the hidden interests behind users’ following relationships (instead of their tweets as in [37]). The topic model is a statistical model originally developed for discovering hidden topics from a collection of documents. It postulates that every document is a mixture of topics, and words in a document are attributable to these hidden topics. Here, we posit that the following relationships are not random but are interest-attributable. Then, we can discover the hidden interest behind each following relationship by regarding a user’s following list as a document, and each person in the user’s following list as a word. Now, topic models can easily help us correctly identify the hidden interests and derive a low dimensional representation of the observed following lists.

However, simply applying topic models to the following relationship analysis may be problematic. As LDA is built on the assumption that every word in a document should be of roughly equal popularity, stop words such as the and is must be removed in a preprocessing stage. If we do not remove these stop words before a topic analysis, they appear even in irrelevant topic groups and severely degrade the grouping (cluster- ing) quality. Since the analysis result is dominated by these popular words (nodes), we call this problem “popularity bias” (the quality degradation caused by a limited num- ber of popular nodes). However, we cannot simply remove popular nodes in a social graph analysis because they correspond to popular figures such as Barack Obama and Justin Bieber, and many users are interested in following them.

In this work, we address how to effectively solve this popularity problem. We first show this problem with LDA, one of the most popular topic models, and investigate ex-

11 isting LDA variations for this problem. Then, we propose some procedural variations of LDA to alleviate this problem. Finally, we propose new topic models specialized in handling the popularity bias problem. For this purpose, we introduce a notion of a “popularity component” and explore various ways to effectively incorporate it in our new topic models. We evaluate the effectiveness of our approaches with a real-world Twitter social graph dataset based on two metrics: (1) “perplexity” measuring recom- mendation (prediction) performance, and (2) “quality” measuring human-perceived clustering (grouping) performance.

Note that the popularity bias is not limited to a social graph. This bias often appears in datasets showing user’s preference over items (or nodes) such as a webpage visit log, a click log, a purchase log, etc. We believe our proposed approaches are very effective in providing better clusterings and recommendations in web services generating such logs.

In summary, we make the following contributions in this chapter.

• We closely investigate a popularity bias in a topic model analysis caused by a limited number of very popular (frequent) items. This bias is commonly ob- served in many web service logs representing user’s preference over items.

• We propose new LDA extensions to solve the popularity bias. After explor- ing existing solutions to it, we propose two procedural variations and three popularity-aware topic models.

• We conduct extensive experiments using a real-world Twitter dataset. Through these experiments, we demonstrate that our approaches are very effective in rec- ommending more relevant users and providing higher quality topic clusters.

12 3.2 Applying Topic Models to Social Graphs

3.2.1 Follow-Edge Generative Model

In this section, we explain how we can apply probabilistic topic models to a social graph using a Twitter’s social graph example. Before delving into the details, we first briefly explain Twitter and a few interesting aspects of its data, which we also use in our performance evaluations later in this chapter. In contrast to a mutual friendship in other social networks, Twitter’s relationships are unidirectional (i.e., a Twitter user does not need an approval from a user with whom she wants to make friends). Thus, we use the term follow when a user adds another user as her friend. Formally, when a user r follows another user w, the follower r generates a follow edge, or simply an edge, er,w to the followed user w. To remove ambiguity, from now on, we also call a follower a “reader” and a followed user a “writer” because the reader follows the writer to read tweets of the writer. We also use er,∗ to denote a set of all outgoing edges from the reader r, and e∗,w to denote a set of all incoming edges to the writer w. To refer to a set of all readers, a set of all writers, and a set of all edges in the dataset we use R, W , and E, respectively. Figure 3.1(a) depicts these notations using a graph that we refer to as a subscription graph. For example, we observe ealice,cnn = e1, ealice,∗ = {e1, e2}, and e∗,espn = {e2, e3} in Figure 3.1(a).

Given this subscription graph, our goal is to (1) probabilistically label each edge with a correct label (topic or interest) based on co-occurrences of edges, and (2) label (group) writers (and readers as well) having the same interest based on these labeled edges, as depicted in Figure 3.1(b). For example, when we want to extract two topics

(topical groups), z1 and z2, from the subscription graph illustrated in Figure 3.1(a), we first probabilistically label each edge with either z1 or z2 based on co-occurrences of edges. By iterating this probabilistic edge-labeling process sufficient times, we

13 (a) Bipartite graph representation of a follow-edge gener- ative model (before labeling and grouping)

(b) Bipartite graph representation of a follow-edge gener- ative model (after labeling and grouping)

(c) Bipartite graph representation of a document genera- tive model

Figure 3.1: Bipartite graph representation of models

14 can get a probable label for each edge, and based on these labeled edges, we can label (group) nodes (writers and readers) as well. After these edge-labeling and node- labeling processes, we can identify what each topic is about by looking at the nodes labeled with that topic. In the example illustrated in Figure 3.1(b), the edges e1 and e2 are labeled with z1, and the edges e3, e4, and e5 are labeled with z2. Since the writer espn has incoming edges e2 and e3, it is labeled with both z1 and z2. Also, since the topic z1 is associated with the writers cnn and espn, we can infer that topic z1 is related to broadcast.

In this way, we can frame our problem as a graph labeling problem of automati- cally associating each writer w in W with a set of accurate interests z in Z based on its labeled incoming edges e∗,w (we also label each reader r in R as well). We believe the interest plays a key role in the establishment of a follow edge as the topic does in the document generation process explained in Section 2.1. In this study, we assume that there exists a “follow-edge generative model”, where a reader first chooses an interest, and based on the chosen interest, she chooses a writer to follow. In this model, a docu- ment in a corpus becomes a reader’s following list, and a word becomes a writer in the list. Figure 3.1 shows the structural equivalence between the follow-edge generative model (Figure 3.1(a) and 3.1(b)) and the document generative model (Figure 3.1(c)). The only difference is whether if multiple edges are allowed or not between each node in the left and the right.

Formally, as in the document generation process, when a reader follows a writer, she first selects an interest (a topic) from a distribution p(z|r), and then selects a writer from a distribution p(w|z) as in Equation (2.1). We formulate the probability for a reader ra to follow a writer wb based on an interest z as follows:

p(za,b|·) = p(z|ra)p(wa,b|z). (3.1)

Note that we use the notation wa,b to denote the writer wb followed by the reader ra,

15 and the notation za,b to denote the topic z assigned to wa,b. These notations with more detailed subscripts are needed to avoid ambiguity in later equations. Also, note that the

2 writer wa,b is equivalent to the edge ea,b . For example, in Figure 3.1(b), when ra is alice and wb is espn, wa,b (walice,espn) denotes the espn in alice’s following list, which is equivalent to e2 (ealice,espn). Also, za,b (zalice,espn) denotes the topic z1 assigned to walice,espn. By considering the Dirichlet priors α and β as in the LDA plate notation depicted in Figure 3.6(a), the same probability can be represented in LDA as follows: Z Z p(za,b|·) ∝ p(z|θ)p(θ|α)dθ × p(wa,b|z, φ)p(φ|β)dφ, (3.2) where θ and φ denote p(z|r) and p(w|z), respectively. If we solve this equation, we can get a collapsed Gibbs sampling equation (Equation (3.3) 3 )[39], by which the edge wa,b from the reader ra to the writer wb is labeled with the topic za,b: −(a,b) cz ,∗,w + βw p(z |·) ∝ (c−(a,b) + α ) × a,b a,b a,b , (3.3) a,b za,b,a,∗ za,b −(a,b) cza,b,∗,∗ + β∗ where the symbol * denotes a summation over all possible subscript variables, and ck,m,j is the number of associations between the topic zk and the writer wj followed by the reader rm (or follow edge from the reader rm to the writer wj). The ck,m,j is defined as: N Xm ck,m,j = I(zm,n = k&wm,n = j), (3.4) n=1 −(a,b) and ck,m,j denotes the count when we exclude the edge from the reader ra to the writer wb from ck,m,j. Thus, cza,b,∗,wa,b denotes the number of times wa,b is labeled with za,b, and cza,b,a,∗ denotes the number of times ra is labeled with za,b.

Analyzing Twitter’s follow edges using LDA delivers two estimates: p(z|r) and p(w|z). The p(z|r) indicates a reader r’s interest distribution, and p(w|z) indicates a

2 We use wa,b instead of ea,b in most of our equations for consistency with the Gibbs sampling equations in other papers. 3We remove the denominator of the left part of the equation for brevity because it is independent of za,b.

16 writer w’s importance in an interest (topic) group z. Thus, the latter can be used for “grouping” Twitter writers having the same interest. Also, from Equation (2.1), we can easily estimate p(w|r), the likelihood of a reader r’s following a writer w. Thus, by sorting writers based on this value in decreasing order, we can “recommend” (predict) good writers to follow. In Section 3.4, we evaluate our approaches based on these two tasks: grouping and recommendation.

3.2.2 Popularity Bias

Although we justified how we analyze a social graph using topic models in the previous section, we notice the following differences when we apply topic models to a social graph.

1. In a document generative model, a word is sampled with replacement. However, in our follow-edge generative model, a reader cannot follow the same writer twice as illustrated in Figure 3.1. Thus, a writers should be sampled without replacement.

2. When analyzing a textual dataset, common entities such as the and is (i.e., “stop words”) are simply removed before the topic model analysis because they do not have much meaning. However, in a social graph, these entities correspond to “celebrities” such as Barack Obama and Justin Bieber who attract more fol- lowers than others. Thus, they cannot be simply ignored but should be carefully handled.

Because of the first difference, some probability distributions in our follow-edge generative model follow a multivariate hypergeometric distribution instead of a multi- nomial distribution, which is important because LDA benefits from Dirichlet priors, which are conjugate priors of multinomial distributions. However, it is known that

17 Figure 3.2: Topic group on cycling showing the popularity bias the multivariate hypergeometric distribution converges to the multinomial distribution as the population size grows large [1]. Since millions of users are included in our Twitter dataset, we can disregard the consequence caused by the sampling without re- placement. The second difference affects the quality of a topic model analysis. When celebrities are simply included without any special handling, they appear even in ir- relevant topic groups and make the topic analysis severely biased to them. Such a “popularity bias” can be seen everywhere, from a website visit log to a product pur- chase log.

To give a better sense on the popularity bias problem, in Figure 3.2, we show a sample topic group produced by LDA when we directly apply it to a social graph with- out any special handling. It lists top 10 writers in a topic group according to their importance (p(w|z)) in the group together with their Twitter user name, number of followers (|e∗,w|), and bio information. By going over each writer’s bio information, we easily see that many of them have the same interest cycling. However, we also observe that among the three writers having the largest number of followers (barack-

18 obama, stephenfry, and lancearmstrong), only lancearmstrong is related to cycling. The other two writers, barackobama and stephenfry, are included in this group simply because they are famous and followed by many readers, including the readers who are interested in cycling. In particular, we note that barackobama appears in 15 groups out of 100 topic groups produced by LDA. Among the 15 groups, only one of them is related to his specialty, politics, which clearly shows that LDA severely suffers from these popular writers, when directly applied to a social graph dataset.

This popularity bias in the social graph may not cause any problem in a traditional information retrieval task, where we are interested in retrieving most relevant users (documents) given a specific user (document or query), because the term frequency (TF) value of a frequent writer is also at most 1 and the very low inverse document frequency (IDF) value of it can effectively penalize its influence. However, in the clustering task we are interested in, this popularity bias severely degrades the quality of the topic model analysis by appearing at top positions even in irrelevant topic groups as depicted in Figure 3.2. The term weighting scheme for LDA in a textual corpus was also proposed for cross-language retrieval [48]. In the next section, we propose effective solutions to this problem. Before moving to the next section, we summarize the symbols used in this chapter in Table 3.1.

3.3 Handling a Popularity Bias

We explore LDA extensions that effectively alleviate the popularity bias problem (i.e., label popular writers with correct labels). We first discuss two existing LDA ap- proaches in Section 3.3.1, and propose two new procedural variations of LDA in Sec- tion 3.3.2. Finally, we propose three popularity-aware topic models explicitly model- ing popularity of writers in them in Section 3.3.3.

19 Table 3.1: Symbols used throughout this chapter and their meanings Symbol Meaning

r, R Reader (follower), set of all readers w, W Writer (followed user), set of all writers e, E Follow edge, set of all follow edges z, Z Topic (interest), set of all topics

wa,b (ea,b) Writer b in reader a’s following list (follow edge)

za,b Topic assigned to edge from reader a to writer b t Binary topic-path indicator M Number of unique readers K Number of unique topics

Nm Number of writers reader m follows

fw In-corpus frequency of writer w α, β, γ, δ Dirichlet (Beta) priors θ Topic distribution for reader (p(z|r)) φ Writer distribution for topic (p(w|z)) π In-corpus writer distribution (p(w)) τ Topic-path distribution for reader (p(t|r)) λ Concentration scalar

20 3.3.1 Existing LDA Extensions

In this section, we discuss two existing LDA approaches: (1) asymmetric priors (LDA with different settings), and (2) Hierarchical LDA [4]. These extensions are most appropriate ones to deal with the popularity bias among a variety of LDA extensions we considered in Section 5.

3.3.1.1 Setting Asymmetric Priors

As mentioned in Section 2.1, LDA constrains the distributions of topics and words with Dirichlet priors α and β, respectively. Although each element of vectors α and β may take different values in principle, in the standard LDA, each element of α and β is assumed to have the same value (often referred to as the symmetric prior assumption). Intuitively, this assumption implies that every topic and word in a document corpus is roughly equally likely. Although the former sounds agreeable, the latter sounds unre- alistic since it is very well known that the probability distribution of words follows a power-law distribution by Zipf’s law. It is also the reason why stop words are removed before applying LDA to a document corpus, since stop words correspond to the head of the power-law distribution.

The most intuitive approach to address this issue would be to set a different prior for each writer. Between the two priors α and β, we are only interested in β, the prior over a word distribution given a topic, because a writer in a social graph corresponds to a word in a text dataset. As a higher prior value implies a higher likelihood of being observed in a corpus, we set each prior value proportional to the number of followers of each writer. It is expected to associate popular writers with more accurate labels as they are given adequate prior values.

We set βw, the prior for a writer w as in Equation (3.5):

0.98|e∗,w| + 0.01max(|e∗,W |) − 0.99min(|e∗,W |) βw = , (3.5) max(|e∗,W |) − min(|e∗,W |)

21 Figure 3.3: Topic hierarchy and documents generated in the hierarchy

where max(|e∗,W |) denotes the largest incoming edge count and min(|e∗,W |) denotes the smallest incoming edge count in a dataset. Note that we set the lowest value for βw as 0.01 and the highest value as 0.99 to make prior values ranged between 0 and 1.

3.3.1.2 Hierarchical LDA

Hierarchical LDA (HLDA) is also a good candidate for our problem. In HLDA, topics are organized into a hierarchy and a more frequent topic is located at a higher level. Figure 3.3 shows an example of topic hierarchy (topic tree) and associated documents, where zk denotes a topic and di denotes a document. Every document is generated by following a path from the root node in the tree and is then associated with every node in the path. For example, the document d1 is generated through the topic path z1-z2-z4, so d1 is associated with z1, z2, and z4. In HLDA, when a document is gen- erated according to Equation (2.1), words are chosen from topics in a document path. Since the top level topic is associated with all the documents, common words in every document (i.e., stop words) are expected to be labeled with the top level topic. On the contrary, the bottom level topics are expected to be more specific as they are associ-

22 ated with a small number of documents. For example, if z2 is a topic about network, z4 and z5 would be a topic about queueing and routing, respectively. As z1 is usually a topic consisting of the stop words, a document d1 from the document tree path of z1-z2-z4 consists of words from topic z1, z2, and z4, and becomes a document about network queueing. Similarly in a social graph analysis, z1 is involved in every reader’s follow edge generation process and is expected to be associated with popular writ- ers. This topic hierarchy is established because HLDA is based on the Nested Chinese Restaurant Process (NCRP) [4], a tree extension to Chinese Restaurant Process, which probabilistically generates a partition of a set {1, 2, . . . , n} at time n. In NCRP, a doc- ument is considered as a Chinese restaurant traveler who visits L restaurants along a restaurant tree path, where L refers to the level of the tree (i.e., the length of the path).

3.3.2 Procedural Variations of LDA

In the previous section, we explored existing LDA approaches to handle the popularity bias. In this section, we propose two new procedural variations of LDA: (1) a two-step labeling, and (2) a threshold noise filtering. These new variations are easy to apply and produce noticeable improvements. They can also be combined together for better labeling quality.

3.3.2.1 Two-Step Labeling

We propose a new procedural variation of LDA, two-step labeling, where we decom- pose the labeling procedure into two sub-procedures of establishing topics and label- ing with the established topics. In the first topic establishment step, we run LDA after removing edges to popular writers from the dataset similar to how we remove stop words before applying LDA to a text dataset. This step generates a balanced set of topic groups safe from the bias caused by popular writers. In the second labeling step,

23 Figure 3.4: Two-step labeling approach we apply LDA only to the remaining edges (edges to popular writers) in the dataset. When we label each edge with a topic, the topic is probabilistically sampled according to Equation (3.3). That is, the topic selection is affected by how many times that topic has been associated with the reader and the writer (the left and the right part of the equation, respectively). Thus, if correct topic assignments have been made in the first step, the labeling in the second step is guided by them and becomes less biased.

This approach is illustrated with a reader-writer matrix in Figure 3.4, where the E1 part of the dataset is labeled at the first step and the E2 part is labeled at the second step. Note that the two-step labeling does not increase computational complexity because it samples a different part of the dataset at each step. (O(|Z||E|) = O(|Z|(|E1| + |E2|)) In the literature, online LDAs [8] also use a multi-step labeling. However, while they try to change topics as a corpus grows over time, the two-step labeling fixes a set of topics and labels edges with this fixed set of topics. In Figure 3.4, W 1 and W 2 are mutually exclusive in the two-step labeling while R1 and R2 (documents) are mutually exclusive in online LDAs.

24 Figure 3.5: Example of the threshold noise filtering process

3.3.2.2 Threshold Noise Filtering

As described in Section 2.1, the association level between a writer (a word) and a topic is represented by P (w|z) in a probabilistic topic model. Thus, we can list writers in a topic group in descending order of P (w|z) and regard the top entries in the list (topic group) as more important than the bottom entries. Similarly, we can measure this asso- ciation level from a writer’s viewpoint using P (z|w). Although the two-step labeling may help label popular writers with right topics, they can still take top positions even in less-relevant topic groups because even the smallest number of associations of a popular writer with a certain topic may outnumber the largest number of associations of a non-popular writer with that specific topic.

To mitigate this problem, we propose a new post-labeling procedure, threshold noise filtering, which sets a threshold value to determine whether to label a writer with each topic. By ignoring assignments below the threshold value, we can expect a noise

reduction effect as in the anti-aliasing filter. We set cza,b,∗,wa,b , the number of times a

25 writer wa,b is assigned to a topic group za,b, as 0 if it is below the threshold value T :  cza,b,∗,wa,b  0 if p(za,b|wa,b) = < T c∗,∗,wa,b cza,b,∗,wa,b =

 cza,b,∗,wa,b otherwise.

As the threshold noise filtering is a simple post-labeling process, it does not increase computational complexity. Figure 3.5 illustrates this process and shows the top three popular writers’ distributions over topic groups. (Even non-popular writers show sim- ilar non-linear distributions.)

3.3.3 Popularity-Aware Topic Models

In the last section, we explored procedural variations of LDA that can alleviate the popularity bias. In this section, we propose three new popularity-aware topic models that explicitly model the popularity of a node (item) with a “popularity component”: (1) multiplication model, (2) polya-urn model, and (3) two-path model.

3.3.3.1 Multiplication Model

In developing our first popularity-aware topic model, we note that if a reader wants to follow a writer, she has to first “discover” him among many other writers. Only then, the reader can evaluate whether he is related to her topic of interest and decide to follow him. Therefore, the event of the “discovery” of the writer should happen together with the event of the “relevance” of the writer to the reader’s interest.

Clearly, popular writers are more likely to be discovered than unpopular writers, so we represent this discovery probability using a multinomial distribution π constrained by a Dirichlet prior γ, each element of which has a value of γ = fw , where f denotes w f∗ w an in-corpus frequency of a writer w (i.e., the number of followers to the writer), P and f∗ denotes a total frequency ( w fw). We incorporate this discovery probability

26 (a) LDA (b) Multiplication model

(c) Polya-urn model

(d) Two-path model

Figure 3.6: LDA and proposed topic models

27 into LDA’s plate notation as in Figure 3.6(b) (the dotted box), and call this module “popularity component”.

In the multiplication model, since we assume that the event of discovery should happen together with the event of relevance, we simply multiply the writer selection probability φ (the relevance of the writer to the topic interest) with the discovery proba- bility π (the global popularity of the writer). More precisely, we formulate this change (from Equation (3.2)) into the following equation: Z p(za,b|·) ∝ p(z|θ)p(θ|α)dθ ZZ × p(wa,b|z, φ, π)p(φ|β)p(π|γ)dφdπ

∝ (c−(a,b) + α ) za,b,a,∗ za,b −(a,b) cza,b,∗,wa,b + βwa,b × −(a,b) × γwa,b . (3.6) cza,b,∗,∗ + β∗ Note that the difference of the above equation from Equation 3.3 is that the popularity

factor γwa,b is multiplied at the end.

3.3.3.2 Polya-Urn Model

In the multiplication model, the global popularity factor is multiplied together with the topic relevance term, because we assumed that the writer needs to be discovered and relevant. Instead, in a polya-urn model, we assume that when a reader follows a writer, it is a mixture of the global popularity of the writer and the topic relevance. For example, when a reader follows Justin Bieber, it may be 40% due to Justin Bieber is a globally popular writer and 60% due to Justin Bieter is a writer on the topic of interest.

In general, this interpretation leads us to the polya-urn model depicted in Figure 3.6(c), which was first proposed in [2], where the authors tried to represent global topics as well as local topics. Note that in addition to the γ and π of the multiplication

28 model, the popularity component in the polya-urn model has a concentration scalar λ, which decides the relative weight of the global popularity in generating a follow edge. Initially, the multinomial distribution π is generated from the Dirichlet prior γ (π ∼ Dirichlet(γ)). Then, π works as a Dirichlet prior for φ, together with the concentration scalar λ (φ ∼ Dirichlet(λπ)). As λ works as a weight to the prior observation π, φ becomes similar to π when λ has a high value. On the other hand, φ deviates from π when λ has a low value. Since π works as a base distribution and φ deviates from π per topic, π can be considered as a global (in-corpus) writer distribution, and φ can be considered as a local (per-topic) writer distribution. As we select a writer from a mixture of the global and local writer distribution, LDA’s topic assignment probability in Equation (3.3) should be extended to: Z p(za,b|·) ∝ p(z|θ)p(θ|α)dθ ZZ × p(wa,b|z, φ)p(φ|λπ)p(π|γ)dφdπ

∝ (c−(a,b) + α ) za,b,a,∗ za,b −(a,b) cza,b,∗,wa,b c∗,∗,wa,b + γwa,b ×( −(a,b) + λ ). (3.7) c∗,∗,∗ + γ∗ cza,b,∗,∗ Note that the global distribution dominates in the mixture as the concentration param- eter λ increases. On the other hand, as λ decreases, the local distribution dominates and the whole equation becomes similar to that of LDA.

3.3.3.3 Two-Path Model

In the polya-urn model, when a reader follows a writer, she first selects a topic, and then selects a writer from the mixture of a global and a local writer distribution for the selected topic. That is, every follow edge is a mixture of these two components. In a two-path model, we separate the non-topic related follow edges from the topic related ones by assuming that there are two mutually exclusive paths, a “popularity path” and

29 a “topic path”, from which a follow edge is generated. That is, when a reader follows a writer it is because the writer is either globally popular or topically related, not a mixture of these two reasons. The separation is expected to help generate more clear topics. For this separation, we introduce a new binary latent variable t which indicates the path the writer comes from: t = 1 means that the writer comes from the topic path, and t = 0 means that she comes from the popularity path. Now, we do a “path- labeling” as well as a “topic-labeling” for a follow edge, and our goal is to accurately infer t as well as z (when t = 1).

Figure 3.6(d) depicts this two-path model. The variable t follows a Bernoulli dis- tribution τ which is constrained by a symmetric Beta prior δ. Now, a reader selects a path as well as a topic with Equation (3.8): Z p(za,b, ta,b|·) ∝ p(z|θ)p(θ|α)dθ Z t × p(wa,b|z, φ) p(φ|β))dφ Z 1−t × p(wa,b|t, π) p(π|γ)dπ Z × p(t|τ)p(τ|δ)dτ. (3.8)

RR R t Note that p(wa,b|z, t, φ, π)p(φ|β)p(π|γ)dφdπ is decomposed into p(wa,b|z, φ) p(φ|β)dφ× R 1−t p(wa,b|π) p(π|γ)dπ according to the value of the binary variable t. We extend Equation (3.4) with the new path indicator variable t:

N Xm ck,m,j,s = I(zm,n = k&wm,n = j&tm,n = s). (3.9) n=1

30 Then, the path/topic-labeling probability is derived as:

p(za,b, ta,b = 0|·) ∝ p(ta,b = 0|·)

−(a,b) ∝ (c∗,a,∗,0 + δ0) c−(a,b) + γ ∗,∗,wa,b,0 wa,b × −(a,b) , (3.10) c∗,∗,∗,0 + γ∗ p(z , t = 1|·) ∝ (c−(a,b) + α ) a,b a,b za,b,a,∗,∗ za,b −(a,b) ×(c∗,a,∗,1 + δ1) −(a,b) cz ,∗,w ,1 + γwa,b × a,b a,b . (3.11) c−(a,b) + γ za,b,∗,∗,1 ∗ The two latent variables are inferred simultaneously in every Gibbs sampling iteration.

The topic labeling procedure is performed only when ta,b = 1. Note that when ta,b = 1 for all edges, the two-path model becomes equivalent to the standard LDA.

3.4 Experiments

In this section, we evaluate our approaches using a real-world Twitter dataset under two metrics: (1) perplexity, and (2) human-perceived quality (or simply quality). While perplexity is automatically calculated and indicates prediction (recommendation) per- formance of a trained model, quality is based on human judgments and measures how good topic groups produced by the model are. As our baseline, we use LDA. 4. Note that LDA is very famous for achieving low perplexity and producing high-quality topic groups. Before we provide our analyses on the experimental results, we explain details on our Twitter dataset and experimental settings in the following section.

4As perplexity is only available for probabilistic topic models, we limited our baseline to proba- bilistic topic models and LDA provides better perplexity than PLSA by solving the overfitting problem explained in Section 2.1.

31 3.4.1 Dataset and Experimental Settings

For our experiments, we use a Twitter dataset we collected between October 2009 and January 2010. The original downloaded dataset contained 273 million follow edges, but we sampled 10 million edges from this dataset to keep our experiment manageable. To ensure that all the follow edges from a reader are preserved in our sampled dataset, we first randomly sampled users (as readers) and included all follow edges (to writers) from the sampled users, until we obtained the 10 million follow edges. We excluded the readers who follow less than 10 writers from our dataset because they may contain follow edges generated by pure curiosity and reciprocity. The same approach was used in [37].

Figure 3.7 shows the distributions of incoming and outgoing edges in the sampled dataset. The horizontal axis shows the number of edges of a node and the vertical axis shows how many nodes have the given edge count. Both the axes are shown in a log- arithmic scale. From the graph, it is clear that the number of incoming edges follows a power-law distribution, which is often the case for this type of dataset. Interestingly, we observe that the number of outgoing edges is quite uniform between edge counts 1 to 100, which is different from the distribution reported in [28]. We do not believe this difference is due to our sampling, because the graph from our complete dataset shows the same flat curve between the edge counts 1 to 100 [46]. It could be an interest- ing future work item to investigate from which this difference comes. In our sampled dataset, barackobama has the most followers (readers), 7, 410, and zappos follows the most users (writers), 142, 669. Table 3.2 shows some basic statistics of the sampled dataset.

Since we are mainly interested in investigating how different approaches handle popular writers, we categorized the edges to writers into two distinct subgroups ac- cording to writers’ incoming-edge counts. One group consists of the edges to “nor-

32 Figure 3.7: Distributions of incoming and outgoing edges

Table 3.2: Statistics of our Twitter dataset Statistics Value

|E| 10, 000, 000 |R ∪ W | 2, 430, 237 |R| 14, 015 |W | 2, 427, 373

max(|er,∗|) 142, 669 (r: zappos)

max(|e∗,w|) 7, 410 (w: barackobama)

33 mal” writers (i.e., normal-writers group), where |e∗,w| ≤ V , a boundary value, and the other group consists of the edges to “popular” writers (i.e., popular-writers group), where |e∗,w| > V . We tested three boundary values V = 50, 100, and 500. Since all the results from the three boundary values show consistent patterns, we only report the results from the case when V = 100, where the popular-writers group consists of only 0.3% of all writers but accounts for 20.2% of all follow edges.

We ran experiments for all our approaches introduced in Section 3.3. Table 3.3 summarizes the nine experimental cases we report in this section. When there are multiple sub-cases, we use bold characters to indicate the best one we report in this section. Although we calculated automated perplexity values for all the cases, we carefully selected six cases for which we report quality due to limited resources for human survey. The six cases are underlined in Table 3.3. The first two cases, base and non-popular, are the standard LDA experiments over the whole dataset and the normal- writers group, respectively. Thus, non-popular does not contain any popular writers and edges to them. The next two cases, beta and hlda, are existing LDA variations. For beta, we tried multiple asymmetric Dirichlet prior schemes for β: proportional, inversely-proportional, and ladder-shape. For hdla, we tried two and three topic levels. The next two cases, 2step and filter, are our procedural variations of LDA. The 2step is the two-step labeling approach and filter is the threshold noise filtering approach. For filter, we applied the threshold noise filtering to the result from base and 2step. The last three cases, multi, polya, and 2path, are from our popularity-aware topic models and denote experimental cases for the multiplication model, the polya-urn model, and the two-path model, respectively. We ran multiple runs to find the following optimal parameter values: T = 0.05, α = 0.1, β = 0.01 (except beta), λ = 0.1, and δ = 1. In all of our experiments, we generated 100 topic groups except the hlda case with two levels, where 50 topic groups were generated.

34 Table 3.3: Experimental cases and descriptions Case Experiment Description

base LDA over the whole dataset non-popular LDA over the normal-writer group dataset beta LDA with proportional/inverse/ladder β prior hlda HLDA with two/three levels 2step Two-step labeling filter Threshold noise filtering after base/2step multi Multiplication model polya Polya-urn model 2path Two-path model

3.4.2 Prediction Performance Analysis

We evaluate prediction (recommendation) performance of our proposed models using the widely-used perplexity metric [50, 23, 5, 24, 10] defined as:

P e∈Etest log p(e) − |E | perplexity(Etest) = e test , (3.12)

where Etest denotes all the edges in a test dataset. The perplexity quantifies the predic- tion power of a trained model by measuring how well the model handles unobserved test data. Since the exponent part of Equation (3.12) is a minus of the average log prediction probability over all the test edges, a lower perplexity value means stronger prediction power of the model. We calculated the perplexity for two separate 10% randomly held-out datasets after training a model on the remaining 80% dataset. We averaged results from ten runs (five runs for each held-out dataset with different ran- dom seeds). As the standard LDA (base) is designed to minimize the perplexity, it is not easy to achieve lower perplexity than base.

35 Figure 3.8: Perplexity comparison

We report the perplexity values from the experimental cases in Figure 3.8 5 , where we observe the followings.

1. Most of our proposed approaches seem very effective in achieving lower per- plexity than base. Especially, our popularity-aware topic models seem more effective than other approaches.

2. The 2path achieves the lowest perplexity, which is 8.14% lower than that of base.

3. The filter in Figure 3.8 denotes the combination of the two-step labeling and the threshold noise filtering. Different from our expectation, the threshold noise filtering does not help in reducing perplexity. It is because the p(w|z) values of

5We do not include perplexity values of non-popular (42, 222.0), hlda (123, 966.9), and multi (90, 466.8) in the graph because they have too high values. The non-popular does not have edges to popular writers in its test dataset, which are usually generated with higher probability (i.e., easier to predict). The perplexity of hlda is calculated with the empirical likelihood value provided by Mallet (http://mallet.cs.umass.edu). The multi seems to favor popular writers too much.

36 the filtered-out edges become close to 0 and make the value of p(e) (= p(w|r) = P z p(z|r)p(w|z)) smaller.

3.4.3 Example Topic Groups

Ultimately, the effectiveness of each approach should be determined by how people perceive the quality of the identified topic groups from each approach. In Figure 3.9 and 3.10, we report some representative example topic groups consisting of top 10 writers according to their importance in the topic groups (p(w|z)) from various approaches. Figure 3.9(a) shows an example topic group produced by non-popular, where we removed the edges to popular writers (having more than 100 followers) be- fore applying LDA to the dataset. From the second column, we observe that all writers have |e∗,w| values (i.e., number of followers) smaller than 100. Therefore, none of the popular writers would belong to a topic group under this scheme. When we go over each user’s bio information, we see that all writers in this group is somewhat related to tech . That is, the topic group after removing popular writers is cleaner (unbiased) than that from the standard LDA (as shown in Figure 3.2), at the expense of not being able to group any popular writers. However, some people may think they are not rel- evant to the group until they carefully read the bio information of the writers because these writers are not prominent in the topic group.

Figure 3.9(b) shows an example topic group from 2step, where we applied the two- step labeling to the topic group in Figure 3.9(a). We observe that many writers in this group are very popular and they are mainly about the same topic, tech media, indicating that 2step is able to group popular writers into the right topic group in general. Note that the top 10 writers in Figure 3.9(a) still belong to this topic group although they are not listed in the top 10 writers. Their relative importance in this group is decreased due to the popular writers newly added to this group during the two-step labeling.

37 (a) Topic group on tech blog without any interesting figure to follow (non-popular)

(b) Topic group on tech media (2step on the topic group in Figure 3.9(a))

Figure 3.9: Sample topic groups I

38 (a) Topic group on tech media (filter on the topic group in Figure 3.9(b))

(b) Topic group on tech media having relevant popular writers (2path)

Figure 3.10: Sample topic groups II

39 However, we observe that 2step still suffers from the presence of a few popular, yet less relevant writers in the group (in this example, cnnbrk and breakingnews may be considered less relevant to the group than others).

With the additional threshold noise filtering, we get a less noisy result. Figure 3.10(a) shows a result topic group from filter corresponding to the group in Figure 3.9(b) from 2step. We observe that cnnbrk and breakingnews, popular writers on general me- dia, are now removed from Figure 3.9(b) and more technology-centric media such as firefox, youtube, and engadget are added to the top 10 writers. Note that firefox and youtube are Twitter accounts publishing news related to FireFox6 and YouTube7. As we pick only a few most probable topics for popular writers in the threshold noise filtering, they have less chance to appear in less-relevant topic groups. Thus, with this combination, we can group “relevant” and “popular” writers.

Figure 3.10(b) shows an example topic group related to tech media from 2path. We see the topic group is quite similar to one from 2step in Figure 3.10(a). However, the two topic groups are produced in two totally different ways: a combination of pro- cedural variations vs. a popularity-aware generative model. In the following section, we accurately evaluate these approaches with a metric calculated from a survey on human-perceived grouping quality.

3.4.4 Grouping Quality Analysis

We conducted a survey with a total of 14 graduate students majoring Computer Science and measured how relevant the writers in a topic group are. The participants in our survey were presented with a random group of 10 Twitter writers identified by one of the six representative approaches: base, non-popular, hlda, 2step, filter, and 2path.

6http://www.firefox.com/ 7http://www.youtube.com/

40 The presented group looks like the ones in Figure 3.2, Figure 3.9, and Figure 3.10 but does not contain the number of followers to avoid a biased judgment. The group also contains a link to each Twitter writer’s personal webpage if she has one. The participants were asked to indicate if each writer in the topic group is relevant to the group or not based on the bio information and the webpage content (if available). Overall, we collected 23 judged topic groups per each experimental case. Note that Figure 3.2 is produced by base, and each figure in Figure 3.9 and 3.10 is generated by non-popular, 2step, filter, and 2path, respectively. A topic group produced by hlda shows one of three different patterns according to its level in a topic hierarchy: (1) the top-level celebrity group (topically mixed), (2) a middle-level somewhat popular group and, (3) a leaf-level extremely non-popular group.

Given the survey results, to measure how topic groups are relevant and interesting, we computed the human-perceived grouping quality value of a surveyed topic group

Zsurvey as:

|Zsurvey| |zk| X X quality(Zsurvey) = δ(wi, zk) log(|e∗,wi |), k=1 i=1 where δ function is defined as:   1 if writer wi is related to topic group zk δ(wi, zk) =  −1 if writer wi is not related to topic group zk.

Note that in the above quality formula, the factor log(|e∗,w|) is multiplied to assign higher weights to more popular writers, because they are more prominent and most people are interested in them. We added this weight for each true positive (related), and subtracted it for each false positive (not related).

Figure 3.11 reports the results from this survey, where each bar shows the quality value of one of the six approaches. From this graph, we observe the followings.

1. Our approaches are very effective in improving the human-perceived quality of

41 Figure 3.11: Quality comparison

identified topic groups. For example, when compared to base, 2path achieves a 1.89 times higher quality value.

2. The 2path is more effective than any other approaches. It also has the highest weighted true positive value (481.9) and the lowest weighted false positive value (51.4).

3. The non-popular achieves a higher quality value because its weighted false pos- itive value is much lower than that of base (59.0 vs. 164.9).

4. The filter shows better quality value than 2step although it shows a higher per- plexity value. It is because filter increases precision by removing less relevant (but still relevant) writers.

Due to limited resources, our survey could not cover all experimental cases. How- ever, we found that quality value is highly correlated with perplexity value (Pearson correlation coefficient value is −0.806 excluding non-popular and hlda 8). Thus, we believe that our other approaches showing a lower perplexity value can achieve a higher quality value.

8Their perplexity values are too high (outliers) and cannot be fairly compared.

42 3.5 Conclusion

In this chapter, we addressed the popularity bias problem arising when we apply topic models to a social graph. Different from a textual dataset where we simply remove popular words before a topic model analysis, a popular user in the social graph has very important meaning and should be carefully handled. After carefully investigat- ing this problem, we explored existing LDA variations, proposed two new procedural variations of LDA, and developed three new popularity-aware topic models.

In extensive experiments with a real-world Twitter social graph, most of our ap- proaches achieved significant improvements in terms of lowering perplexity (i.e., bet- ter prediction power) and improving human-perceived clustering quality. Particularly, our two-path model achieved a 8.14% lower perplexity value and a 1.89 times higher quality value than those of LDA. Since the popularity bias problem is not limited to a social graph but exists in various web service logs such as a visit log, a click log, and a purchase log, where a limited number of popular items (nodes) are preferred by many users, our approaches can be effectively used to provide more relevant prediction and clustering in such web services.

43 CHAPTER 4

Complex-Graph Analysis Using Topic Models

In this chapter, we propose universal topic framework called “UniZ” to analyze a com- plex graph using topic models. We first justify our approach of applying probabilistic topic models to any type of edges in Section 4.2.1. Then, we explain how we can effec- tively incorporate previously learned topics in Section 4.2.2 and propose major issues in the prior topic incorporation in Section 4.2.3. We evaluate our methods with two datasets: the widely-used Digital Bibliography & Library Project (DBLP) 1 dataset and search logs from a commercial search engine Bing 2. We evaluate prediction accuracy of our methods with the state-of-the-art methods using the DBLP dataset in Section 4.3.1.2, and compare recommendation performance of our methods with LDA using the search logs in Section 4.3.2.2. With the search logs, we also perform trend analy- ses on topic granularity and usage periods in Section 4.3.2.3, and attempt to propagate topics between very disparate search logs in Section 4.3.2.4.

4.1 Introduction

The problem with topic modeling is that they are usually limited to a few entity types and not good at accommodating multiple entity types, which is quite common in real- world applications. For example, consider a topic model for a web graph depicted

1http://www.informatik.uni-trier.de/ ley/db/ 2http://www.bing.com/

44 in Figure 4.1(c). This model tries to capture web users (U) who issue queries (Q) that contain multiple terms (T), visit relevant web documents (D), and click ads (A). Also, the web users follow another web users, and the web documents have links to another web documents. Unfortunately, a topic model of this complexity is often intractable for analysis. To address this complexity, we may simply decompose the model into multiple segments where each segment contains a subset of entity types, and analyze each segment separately with standard topic models (such as Probabilistic Latent Semantic Analysis (PLSA) [25] and Latent Dirichlet Allocation (LDA) [5]). For example, in Figure 4.1(c), we may apply LDA to the follow edges between the two U nodes and obtain the topic groups of U. Similarly, we can apply LDA to other edges (such as the click edges between U and A) to obtain estimate of each node’s topic groups.

The problem with this approach is that the learned topics from each segment are not directly comparable. That is, the topic No. 1 obtained from the follow edges is totally different from the topic No. 1 obtained from the click edges. In fact, there is no guarantee that topic groups obtained from two LDA applications will be comparable. In principle, when there are N segments, there are N different topic spaces that are completely independent of each other.

In this chapter, we propose a novel universal topic framework which enables rep- resentation of heterogeneous entities in a “single universal topic space”. Our approach is based on an assumption that there is a hidden interest (topic) for every relationship (edge) between two entities. Based on this assumption, we extend the follow-edge generative model [9] developed for social graph mining (explained in Section 3.2.1), and apply topic models to any type of edges between any type of entities.

To analyze arbitrarily complex topic models, we take an incremental approach, where we decompose a model into smaller segments and apply simple topic models

45 (a) Document-term graph

(b) DBLP graph

(c) Simple Web graph

Figure 4.1: Examples of HINs

46 to each segment. But we make it possible to incorporate and combine the results from distinct segments using an approach called “prior topic incorporation”. More precisely, we propose two extensions to a simple topic model, mixture model and dual-prior model, which can effectively incorporate topics learned from the entities and relationships from other segments to the current segment. This prior topic in- corporation enables the topics to be coherent across all entities and relationships in a complex graph. By representing all the entities in the universal topic space, our frame- work provides the following benefits: (1) direct topical similarity comparison between any heterogeneous entities (e.g., using simple cosine similarity), (2) a richer represen- tation of entities by leveraging all available signals (e.g., representing a user by both users she follows and movies she watches), and (3) prediction/recommendation perfor- mance improvements. Although the incremental approach was initially introduced in [16], where authors first learn topics with PLSA and propagate the learned topics with Expectation Maximization (EM), our approach incorporates previously learned topics more seamlessly within the LDA framework and produces improved performance.

We evaluate the effectiveness of our approach with many state-of-the-art methods using the widely-used Digital Bibliography & Library Project (DBLP) dataset. We also demonstrate a potential application of our approach with search logs collected from a commercial search engine, Bing. We also propose future work items to this research work at the end of this chapter because this work is our first step in this research.

In summary, we make the following contributions in this chapter.

• We propose a novel universal topic framework and two effective topic models which enable representation of different types of entities in a universal topic space.

• We evaluate the effectiveness of our approach with the state-of-the-art methods using the popular DBLP dataset. One of our models achieves the best prediction

47 performance.

• We demonstrate huge potential of our approach in a real-world environment us- ing search logs. In two recommendation tasks, our approach shows significant improvements.

4.2 Universal Topic Framework

We propose a novel universal topic framework called “UniZ”. Rather than building a complex generative model for a complex HIN, which easily becomes intractable as the number of entity types increases, we take a “divide and conquer” strategy and decompose the HIN into multiple segments so that we can apply LDA to each segment. However, a simple divide and conquer analysis generates incomparable topic groups, where the topic No. 1 in one segment (e.g. on music) is totally different from the topic No. 1 in another segment (e.g. on politics). In this section, we first justify how we can apply topic models to any types of edges (not limited to textual edges or social follow edges) so that we can freely divide the HIN. Then, we propose two effective topic incorporation models. Finally, we discuss some major issues in the topic incorporation.

4.2.1 Edge Generative Model

To represent various types of entities in a universal topic space, we need a method to incorporate topics previously learned from other types of entities and their relation- ships in a topic inference process. For this “prior topic incorporation” (or simply “topic incorporation”), we first need to justify our approach of applying topic models to dif- ferent kinds of edges. For this purpose, we extend the follow-edge generative model (explained in Section 3.2.1) to different edge types (including follow edges) between

48 different entities. The assumption behind our approach is that every edge is generated because of a hidden interest between two entities regardless of their types 3. For ex- ample, let’s say Alice is interested in K-pop (Korean pop). She recently heard about a song Gangnam Style from her friend and types a query gangnam style in a search engine to get more information about the song. From a search results page, she finds out the singer is Psy and clicks Psy’s Twitter page and follows him in Twitter. Now, we can represent her actions in a single graph as in Figure 4.2 having three different types of relationships (issues, visits, and follows) among four entities of three different types. The color bars next to each entity denote an interest distribution of that entity. In this example, she took these actions because she is interested in K-pop. If we know the color (interest or topic) distribution of each entity 4, we may probabilistically label each edge with an appropriate color (in this example red, which denotes K-pop). The color distribution of each entity is determined by the number of color edges attached to the entity and is updated after coloring the new edges. If we continue this process, we can label all the edges in a graph with appropriate colors in a single palette (i.e., a single topic space). In this way, heterogeneous edges and entities can be represented in a single universal topic space.

Now, we formalize our approach. For simplicity, we use the same notation used earlier to describe the follow-edge generative model in Section 3.2.1. Thus, for an edge, we denote the starting entity as a reader and the ending entity as a writer. When a reader generates an edge to a writer, she first picks an interest (topic) from a distribution p(z|r)(θ), and then picks a writer from a distribution p(w|z)(φ). We formulate the probability of a reader ra to follow a writer wb (or wa,b) based on a certain interest z

3This assumption holds only for an edge whose two connected entities are relevant each other (e.g., causal relationship, containing relationship, following relationship, etc). If an edge is randomly gener- ated or generated by a spammer (to every other entities), this assumption does not hold. 4More precisely, its importance in an interest group as well.

49 Figure 4.2: Example of edge labeling

(or za,b) as follows:

p(za,b|·) = p(z|ra)p(wa,b|z), (4.1)

where · denotes values of all the other random variables. We use wa,b to indicate an edge from a reader ra to a writer wb. By considering Dirichlet priors α and β, constraining θ and φ, the same equation can be represented as follows: Z Z p(za,b|·) ∝ p(z|θ)p(θ|α)dθ × p(wa,b|z, φ)p(φ|β)dφ. (4.2)

This formula leads to a collapsed Gibbs sampling equation:

−(a,b) −(a,b) Ra,za,b + αza,b Wwa,b,za,b + βwa,b p(za,b|·) ∝ −(a,b) × −(a,b) , (4.3) Ra,∗ + α∗ W∗,za,b + β∗ where R denotes an association count matrix between readers and topics and W de- notes an association count matrix between writers and topics. The R and W has readers

and writers in its rows respectively, and topics in its columns. Thus, Ra,za,b denotes the element of the row a and the column za,b, which is the number of associations between the reader ra and the topic za,b. The superscript −(a, b) means that the number does not include the topic assigned to the edge from reader ra to writer wb. To simplify the equation, we use the symbol ∗ to denote a summation over all possible subscript

P 0 P 0 variables. For example, α∗ = z0 αz and W∗,za,b = w0 Ww ,za,b .

50 (a) Mixture model (b) Dual-prior model

Figure 4.3: Proposed topic models

In a Gibbs sampling based inference, as the number of iterations increases, each R and W gets a better estimate of a joint probability distribution of a reader and a topic, p(r, z), and a joint probability distribution of a writer and a topic, p(w, z), respectively. In a sizable number of iterations, R is used to estimate p(z|r) after being normalized by row with a prior α, and W is used to estimate p(w|z) after being normalized by column with a prior β, as noted in Equation (4.3). Thus, we may say that matrices (R and W ) contain topics learned from the Gibbs sampling inference process. If we can effectively incorporate these learned topics when inferring topics for new relationships between different types of entities, we may represent these new relationships and entities in the same topic space.

4.2.2 Incorporating Learned Topics

As explained in the previous section, the learned topics from Gibbs-sampling-based LDA are stored in the reader-topic association count matrix R, and the writer-topic

51 association count matrix W . When we divide a complex HIN into multiple segments, we get a pair of R and W per segment. However, each pair is totally different from each other and there is no easy way to combine these multiple pairs. Thus, we need an “incremental” approach of learning initial topics from one segment and incorporate these “old topics” when we learn “new topics” for another segment. We call this approach “prior topic incorporation” (or simply “topic incorporation”) and propose two effective topic incorporation models: a mixture model and a dual-prior model.

4.2.2.1 Mixture Model

Since we want the new topics to be coherent to the old topics, one possible way of incorporating the old topics is to use a mixture representation of new topics and old topics. We modify the polya-urn model [2] to pick a topic or a writer from a linear combination of the old and new topic distributions. The original polya-urn model was developed to linearly combine a local and a global topic distribution. Now, the Gibbs sampling equation for this approach becomes:

−(a,b) Ra,za,b Aa,za,b + αza,b p(za,b|·) ∝ ( −(a,b) + λA ) Ra,∗ Aa,∗ + α∗ −(a,b) Wwa,b,za,b Bwa,b,za,b + βwa,b ×( −(a,b) + λB ), (4.4) B∗,z + β∗ W∗,za,b a,b where A/B is the old reader/writer-topic association count matrix (previously learned), and R/W is the new reader/writer-topic association count matrix (to be learned), re- spectively. There are two scalar weights λA and λB, which are used as concentration parameters. If λ is high, the new topic distribution becomes similar to the old one. If λ goes to zero, it degenerates to LDA. Figure 4.3(a) depicts a plate notation of this model. We call this the mixture model.

52 4.2.2.2 Dual-Prior Model

Another way of incorporating previously learned topics is to use them as “priors”. Consider a topic count matrix B, which is learned from relationships between papers and terms in Figure 4.1(b) and has papers in its rows and topics in its columns. If the value of Bp1,z1 is relatively higher than other values in the row p1 of the matrix, it suggests that paper p1 is very likely about topic z1 (e.g., information retrieval). We can leverage this learned information when we infer topics for relationships between authors and papers. If author a1 wrote the paper p1, we can infer that the author a1 is interested in z1 (information retrieval). Thus, topics learned from one type of relation- ships can be used as priors when we infer topics for a different type of relationships. One benefit of LDA is that we can seamlessly incorporate the priors in its equation. Since α and β in Equation (4.3) are priors, we can extend the equation to: −(a,b) Ra,za,b + λA · Aa,za,b + αza,b p(za,b|·) ∝ −(a,b) Ra,∗ + λA · Aa,∗ + α∗ −(a,b) Wwa,b,za,b + λB · Bwa,b,za,b + βwa,b × −(a,b) , (4.5) W∗,za,b + λB · B∗,za,b + β∗ where scalar weights λA and λB are tunable parameters to normalize the magnitude of different types of relationships, because the number of edges may be largely different among relationships. As there are two types of priors, we call this a dual-prior model and depict its plate notation in Figure 4.3(b). The dual-prior model also degenerates to LDA when λ goes to zero.

Note that these two models are originated from LDA but can work (as a frame- work) with any topic models delivering topic association count matrices. Especially, the mixture model can also work with topic models delivering only conditional dis- tributions (p(z|r) and p(w|z)) without joint distributions (p(r, z) and p(w, z), which can be produced from the raw topic count matrices). For example, when there are two types of edges EA and EB, and PLSA performs best for EA and LDA performs best

53 for EB, it is possible to initially learn topics from EA using PLSA and incorporate the

5 learned topics when we learn topics for EB using LDA . In this way, different types of edges can be treated differently.

4.2.3 Issues in Topic Incorporation

For proper topic incorporation, there are some issues to be considered. We discuss them in this section.

Incorporation depth: the inference process of Gibbs-sampling-based LDA con- sists of three stages: initialization, Gibbs sampling, and normalization. We tried three depths of topic incorporations: (1) only in the initialization stage, (2) up to the Gibbs sampling stage, and (3) up to the normalization stage (full incorporation). Although the first and second one produced slightly improved results than the case without the topic incorporation, the last one produced the best results. Thus, we only report the results from the full incorporation in this chapter.

Edge directionality: different from textual corpora consisting of asymmetric rela- tionships from documents to terms, there can be many types of symmetric relationships in HINs. For example, query-page relationships can be considered in both directions: query-page or page-query. Thus, we initially tried to derive Gibbs sampling equa- tions for the symmetric model as well for the mixture model and the dual-prior model. However, both the Gibbs sampling equations are equivalent as mentioned in [25]:

p(za,b|·) = p(z|ra)p(wa,b|z)

∝ p(ra)p(z|ra)p(wa,b|z) = p(z)p(ra|z)p(wa,b|z).

The difference between the asymmetric model and the symmetric model is in the nor- malization stage after estimating the association count matrices, R and W . While the

5The later topic model should be LDA in our current models.

54 asymmetric model gets p(z|r) and p(w|z), the symmetric model gets p(z), p(r|z) and p(w|z), from those matrices. As we incorporate those matrices instead of the condi- tional distributions in the topic incorporation, we do not need to care about symmetry for the topic incorporation 6.

Incorporation order: there are many possible orders in the topic incorporation. For example, for the DBLP dataset shown in Figure 4.1(b), we can consider 6 possible in- corporation orders: DT →UD→VD, DT →VD→UD, UD→VD→DT , UD→DT →VD, VD→UD→DT , and VD→DT →UD. In terms of a generative model, the order UD→DT →VD seems most reasonable and is also chronologically correct. How- ever, for a complex graph such as a web graph in Figure 4.1(c), it is not easy to find an appropriate chronological order because each edge can be generated without any specific order. Thus, instead of the chronological rule, we came up with two general rules for deciding the incorporation order: (1) denser 7 edges to sparser edges, and (2) textual edges to non-textual edges. Since later topic inferences are largely affected by early-set topics, it is very important to select appropriate initial edge types to start with. The denser edges obviously form better topics, and so do the textual edges be- cause they allow multiple edges between each document and term 8. We will show experimental results on the incorporation oder in Section 4.3.1.2.

Gibbs sampling traversal order: the Gibbs sampling stage usually has hundreds of

6It does not mean that a matrix representation of observed edges is symmetric. For example, in an e-mail network, when user A sends 100 e-mails to user B and user B sends 10 e-mail to user A, there are 100 edges from user A to user B and 10 edges from user B to user A. In this case, the observation matrix is not symmetric. 7We measured density by simply dividing the number of edges by the multiplication of the number of unique readers and that of writers. In the DBLP dataset, the venue entity type has only 20 unique venues. Thus, the edges from venues to papers are much denser than those from authors to papers, where there are 28, 702 unique authors. For the same reasoning, if there are institutions and edges between the institutions and the authors, those edges can be effectively used to learn topics for authors more accurately, because the number of unique institutions are far fewer than that of unique authors. 8If multiple edges are allowed, they can be used to measure the strength of the relationship. How- ever, the non-textual edges sometimes do not allow multiple edges between a reader and a writer. For example, a reader cannot follow the same writer more than once in a social graph.

55 iterations. When there are multiple types of edges, we can think of two traversal meth- ods: (1) depth-first traversal, and (2) breadth-first traversal. For example, when the incorporation order in the DBLP dataset shown in Figure 4.1(b) is UD→DT →VD, the former first finishes all the iterations for UD, and then moves to DT , and then moves to VD. On the other hand, the latter traverses all UD, DT , and VD in one iteration and repeat this traversal. While the latter is expected to achieve better results, the former requires lesser memory footprint and is more flexible when we combine very disparate datasets as we will show in Section 4.3.2.4. We report results from the former in this chapter and leave the latter as future work. Note that if the breadth-first traversal is used, the issue of the topic incorporation order can be largely alleviated.

Learning λ: we can extend our models to make the parameters λ learnable from training data, instead of tuning them manually. This can be done by introducing binary latent variables x to indicate whether z comes from the old topics or the new ones. The random variables x follows the Bernoulli distribution with parameter λ. Then, the parameter λ can be estimated by inferring the values of variables x. This extension, however, is computationally more expensive than our present models, which requires a scalable inference technique. We also leave it as future work.

4.3 Experiments and Analyses

We evaluate our models with two types of datasets: a bibliographic dataset from DBLP and online search logs from a commercial search engine, Bing. The DBLP dataset is used to fairly evaluate our models with previous state-of-the-art models and online search logs are used to demonstrate usefulness of our approach in a real environment.

56 4.3.1 DBLP Experiment

In this section, we report the prediction accuracy of our models with the widely-used DBLP dataset. Through this experiment, we show the followings: (1) the dual-prior model performs better than the mixture model, (2) our dual-prior model achieves the best prediction accuracy, and (3) there are more effective topic incorporation orders.

4.3.1.1 Dataset

The DBLP dataset has been widely used in evaluating many prediction algorithms. As briefly described in Section 2.2, it consists of four types of entities and three types of relationships among them. We use the same DBLP dataset used in [42, 16, 12]. The numbers of entities and relationships are listed in the column DBLP of Table 4.1. Moreover, 4, 057 authors, 100 papers, and all 20 venues are labeled with one of four categories: database (DB), data mining (DM), information retrieval (IR), and artificial intelligence (AI). We evaluate our models based on the prediction accuracy on these labeled entities.

4.3.1.2 Accuracy Analysis

To fairly evaluate our models, we follow the same approach in [16, 12]. We compare the prediction accuracy of our models to the following state-of-the-art methods:

• Nonnegative Matrix Factorization (NMF) [29]

• Probabilistic Latent Semantic Analysis (PLSA) [25]

• Laplacian Probabilistic Latent Semantic Indexing (LapPLSI) [7]

• Latent Dirichlet Allocation (LDA) [5]

• Author-Topic Model (ATM) [40]

57 Table 4.1: Statistics of the datasets Meaning DBLP B-Log1 B-Log2 C-Log

|V | Venues 20 - - - |U| Users(Authors) 28,702 100,000 10,000 - |D| Docs(Papers,Pages) 28,569 257,920 27,980 27,980 |T | Terms(Words) 11,771 117,116 24,124 24,124 |Q| Queries - 281,332 30,242 30,242 |DT | DT edges 2,712,928 - - - |VD| VD edges 28,569 - - - |UD| UD edges 74,632 441,138 42,583 - |UQ| UQ edges - 483,839 46,498 - |QT | QT edges - 1,362,278 130,841 24,816,679 |QD| QD edges - - - 16,058,804

• Ranking-based Clustering (NetClus) [43]

• Topic Model with Biased Propagation (TMBP) [16]

• Focused Topic Model (FTM) [47]

• Contextual Focused Topic Model (cFTM) [12]

Among these various methods, we briefly explain top-3 performers (excluding ours) in Table 4.2 in terms of overall accuracy: ATM, TMBP, and cFTM. ATM adds additional author entities into LDA. When there are multiple authors for a paper, it attempts to find out who is the most probable author for each term in the paper. Thus, it has an effect of selecting each term from a more proper topic distribution because each author has her own topic distribution. The cFTM extends FTM, which is devel- oped to deal with sparse (focused) set of topics, and incorporates additional contextual information (authors and venues) when selecting topics. While it automatically finds

58 a proper number of topics due to its non-parametric nature, it involves many param- eters and requires a quite complicated inference process. Different from these two models focusing on enriching θ (topic distribution given other contexts), TMBP takes a different approach of topic propagation. After learning topics for papers from the relationships between papers and terms using PLSA, it propagates the learned topics to authors and venues using an Expectation Maximization (EM) algorithm. Compared to ATM and cFTM, our approach is simpler and more flexible. Also, while these models only enrich θ and are limited to a document-centered graph (e.g., DBLP), our approach can be applied to any type of graph because it can enrich φ as well as θ. Perhaps TMBP is closest to our approach in the sense that the new topics are inferred with the help of previously learned topics. However, our approach is based on LDA, which solves PLSA’s overfitting problem, and more seamlessly incorporate the previously learned topics in the LDA’s inference process compared to TMBP, which has two separate processes of PLSA-based topic inference and EM-based topic propagation.

As evaluation metrics, we use both simple accuracy (AC) and normalized mutual information (NMI) [49]. The AC is calculated with the following equation: PN δ(l0, map(l )) AC = i=1 i i , (4.6) N

0 where li is labeled category, li is predicted category, and N is the total number of labels. The δ(x, y) function produces 1 if x = y, and 0 otherwise. We need the map(x) function because the predicted category number is usually different from the label category number. The NMI is defined as MI(C,C0)/MI(C,C), where C is labeled cluster and C0 is predicted cluster. Mutual Information (MI) is calculated as:

0 X p(ci, cj) MI(C,C0) = p(c , c0 ) log , (4.7) i j 2 p(c )p(c0 ) 0 0 i j ci∈C,cj ∈C

0 where p(ci) and p(cj) are the probabilities of a randomly selected document’s belong- 0 0 ing to the cluster ci and cj respectively, and p(ci, cj) denotes the joint probability that

59 a document belongs to both the clusters.

Table 4.2: Prediction accuracy on the DBLP dataset. Except ours, all other results are from [16, 12]. Entity Paper Author Venue Average

Metric (%) AC NMI AC NMI AC NMI AC NMI NMF 44.55 22.92 - - - - 44.55 22.92 PLSA 59.45 32.75 65.0 37.97 80.0 74.74 68.15 48.49 LapPLSI 61.35 33.93 - - - - 60.70 33.37 LDA 47.00 20.48 - - - - 47.00 20.48 ATM 77.00 52.21 74.13 40.67 - - 75.57 46.44 NetClus 65.00 40.96 70.82 47.43 79.75 76.69 71.86 55.03 TMBP-RW 73.10 53.13 82.59 67.76 81.75 77.53 79.15 66.14 TMBP-Regu 79.15 59.16 89.81 74.25 82.75 76.56 83.90 69.99 FTM 69.37 43.51 - - - - 69.37 43.51 cFTM 82.73 62.91 92.51 76.20 82.97 76.05 85.73 71.72 UniZ-mix 79.20 53.54 79.40 49.05 90.00 86.18 82.87 62.93 UniZ-dual 82.75 59.71 89.00 68.09 97.25 95.57 89.67 74.46

We report both average AC and NMI values from 20 runs in Table 4.2. We observe that our dual-prior model (UniZ-dual) outperforms all other state-of-the-art methods in terms of overall average AC and NMI. Only cFTM clearly outperforms it in author prediction task. However, our models required only 100 iterations, which is orders of magnitude smaller than that of cFTM (6, 000)9. Also, our models are more general and can be applied to any HINs. Our dual-prior model is especially good at predicting venue information. Table 4.3 shows example venue clusters in the four categories. For

9One iteration of our models is also much cheaper.

60 Table 4.3: Venue clusters DB DM IR AI

VLDB KDD SIGIR IJCAI ICDE PAKDD WWW AAAI SIGMOD ICDM CIKM ICML PODS PKDD ECIR CVPR EDBT SDM AAAI ECML our experiments, we set α = β = 1. We first learned topics from DT (edges from D to T ) and incorporated the learned topics to infer topics for VD with λ = 0.1 (0.1×BDT ) Then, we used a linear combination of two topics, 0.1×BDT +100×BVD, as a prior to infer topics for UD. Note that we used λ to account for magnitude of different types of edges in the graph topology 10. Although the incorporation order (DT →VD→UD) usually produced the best prediction accuracy, we also observed that there are other effective orders (e.g., VD→DT →UD). Usually, a topic incorporation order of denser edges to sparser edges and textual-edges to non-textual-edges produced better results 11 12 as we expected in Section 4.2.3. It is because later inferences are largely affected by early-set topics. Thus, it is important to select appropriate initial edge types to start with. Also, the dual-prior model almost always outperformed the mixture model in our experiments. It is probably because the former utilizes joint distributions containing more information, while the latter only utilizes conditional distributions as explained in Section 4.2.2. 10For λ, we tried 0.01, 0.1, 1, 10, and 100 and selected the one producing the best results. 11We measured density by simply dividing the number of edges by the matrix size. For example, in the DBLP dataset, density(DT ) = 0.008 = 2712928/(28569 ∗ 11771), density(VD) = 0.05, and density(UD) = 0.000091. Though DT has much lower density than VD, it produced similar or better results probably because DT is a textual dataset and allows multiple edges for each document and term. 12The overall accuracy achieved by a chronological order (UD→DT →VD) was 72.16.

61 (a) B-Log (user behavior log) (b) C-Log (click log)

Figure 4.4: Structures of two types of search logs

4.3.2 Online Search Experiments

In this section, we report the performance gains of our approach on a problem using real search logs from Bing. We evaluate our approach in two personalized recom- mendation tasks: (1) query recommendation, and (2) page recommendation. We also conduct a topic granularity analysis. Finally, we attempt to propagate topics across two very disparate datasets.

4.3.2.1 Datasets

We use two types of search logs for our experiments. The first search log depicted in Figure 4.4(a) is a per-user search and browsing log (we call this B-Log from now on), which contains users, their search queries, and pages visited. Terms are simply individual words in a query. We collected this log for 100K users (B-Log1). When we collected this log, we first sampled users and included all the queries issued by the users, and all the pages visited by them. We also collected another user behavior log with 10K users (B-Log2) for another experiment which will be described shortly. The statistics of these logs are listed in Table 4.1. The density of each log is very low.

62 Another type of a search log is an aggregated query-page click log (we call this C-Log from now on) depicted in Figure 4.4(b). While B-Log is a collection of queries and pages for individual users, C-Log consists of triplets of (query, page, click count), where the click count is the number of clicks between the query and the page across all users (not limited to the users sampled in B-Log). Thus, C-Log has very different sta- tistical characteristics compared to B-Log. Since C-Log is believed to be a very strong signal and is used in various fields of online search (e.g., query intent mining) [30], we attempt to incorporate topics learned from C-Log to B-Log2 in Section 4.3.2.4. The statistics of C-Log are also listed in Table 4.1.

4.3.2.2 Performance Analysis

Since we do not have any judged labels for the search logs, we use the widely-used perplexity metric [50, 23, 5, 24] to measure prediction performance of our models. It is defined as: P e∈Etest log p(e) − |E | perplexity(Etest) = exp test , (4.8) where Etest denotes all the edges in a test dataset and p(e) denotes an edge prediction probability (i.e., p(q|u) for the query recommendation task and p(d|u) for the page recommendation task). The perplexity quantifies the prediction power of a trained model by measuring how well the model handles unobserved test data. A lower per- plexity means stronger prediction power of the model. We calculated the perplexity for a separate 10% randomly held-out dataset after training a model on the remaining 90% dataset.

Based on this perplexity metric, we evaluate our proposed models (the mixture model and the dual-prior model) on the two recommendation tasks: (1) query recom- mendation for a user using UQ (user-query edges), and (2) page recommendation for a user using UD (user-page edges). If the perplexity of a model is lower, it means

63 (a) UQ←UD (b) UD←UQ←QT

Figure 4.5: Examples of topic incorporation orders that the model is better at recommending queries or pages to a user (predicting more probable queries or pages). We use LDA as a baseline for both these tasks 13. We tried two topic incorporation orders for each task (four in total). Figure 4.5 illustrates two example incorporation orders. In Figure 4.5(a), topics learned from UD are in- corporated when learning topics for UQ for the query recommendation task. We use the notation UQ←UD to denote this incorporation order (Note that we use the reverse arrow (←) to put a recommendation task (UQ in this case) on the left). The bar under U indicates that the topics are incorporated onto users. Similarly, UD←UQ←QT in Figure 4.5(b) denotes a topic incorporation from QT to UQ to UD for the page recom- mendation task. When there is no ambiguity, we also use a simpler notation without the underbar and the arrow. With the simple notation, the former becomes UQ-UD and the latter becomes UD-UQ-QT . For our experiments, we set the parameters as

|Z| = 100, α = 0.01, β = 0.1. We also set λA = λB = 1 because the number of edges is not different by orders of magnitude each other. The number of iterations is 100. We averaged perplexity values from five runs with different random seeds.

13Although we acquired the source code of TMBP, we could not fine-tune it to produce good results on our machine. The cFTM is too expensive and cannot handle the topology of our search logs.

64 (a) Query recommendation

(b) Page recommendation

Figure 4.6: Performance Analysis

65 We report perplexity values of LDA and our two models in Figure 4.6. We tested two topic incorporation orders for each model: (1) UQ-UD and UQ-QT for the query recommendation task, and (2) UD-UQ and UD-UQ-QT for the page recommenda- tion task. For example, the UQ-UD(mix) in the x-axis denotes the topic incorporation of UQ←UD in the mixture model. The bars show perplexity values, and the left most bar shows the perplexity value of LDA which does not benefit from the topic incor- poration. Our models seem to be very effective in lowering perplexity because they leverage all the available signals. We also observe that the dual-prior model always performs better than the mixture model as in Section 4.3.1.2. Especially, UQ-QT and UD-UQ-QT achieved the best results (−24.4% and −20.9%) for each task.

Table 4.5 illustrates top entries in example topic clusters from the UD-UQ-QT case of the dual-prior model. All the clusters are related to Yahoo 14 and have topic No. 43 15. The floating numbers in the left column denote probability (importance) of items in each topic cluster. We observe that topics generated from our approach are coherent across all types of entities (terms 16 , queries, and pages), meaning that various entities are represented in a single universal topic space. It also means that we can simply measure topical similarity of two disparate entities by using a simple similarity measure like cosine similarity.

4.3.2.3 Topic Granularity Analysis

We perform trend analyses on two parameters, the number of topics and usage pe- riods, to see how stable our approach is with respect to these two parameters. The

14http://www.yahoo.com/ 15When we simply applied LDA to each segment without the topic incorporation, we got three dif- ferent topics for the topic No. 43: yahoo, stores, sports. Some may argue that there should be a yahoo group with a different topic number. In our experimental results, LDA-generated topic No. 57 for UD is about yahoo but contained many facebook related entries. Thus, it is not just a topic number difference. The LDA-generated topic spaces for different segments are totally different. 16The terms are from services provided by Yahoo.

66 Table 4.4: Example topic cluster without UniZ (topic number 43) A term group from QT 0.3130 yahoo 0.1195 mail 0.0330 login 0.0255 sign 0.0116 email 0.0102 finance 0.0097 inbox A query group from UQ 0.0074 home depot 0.0060 lowes 0.0033 walmart 0.0016 sears 0.0011 target 0.0009 ksl 0.0008 menards A page group from UD 0.0044 http://espn.go.com/ 0.0018 http://www.foxnews.com/ 0.0009 https://www.weather.com/ 0.0006 https://sports.yahoo.com/ 0.0005 http://www.nascar.com/ 0.0003 http://www.nfl.com/ 0.0003 http://www.wlox.com/

67 Table 4.5: Example topic cluster from UniZ A term group from QT 0.3130 yahoo 0.1195 mail 0.0330 login 0.0255 sign 0.0116 email 0.0102 finance 0.0097 inbox A query group from UQ-QT 0.1366 yahoo 0.0624 yahoo mail 0.0312 yahoo mail sign 0.0278 yahoo com 0.0230 yahoo mail login 0.0097 yahoo finance 0.0088 yahoo commail A page group from UD-UQ-QT 0.0874 http://www.yahoo.com/ 0.0520 http://mail.yahoo.com/ 0.0319 https://mail.yahoo.com 0.0129 https://login.yahoo.com/config/login?.src=my& 0.0122 http://mlogin.yahoo.com/ 0.0069 http://finance.yahoo.com/ 0.0029 https://login.yahoo.com/

68 Figure 4.7: Topic Granularity Analysis number of topics is related to how granular the learned topics are. If the topic granu- larity is high, it means we can get finer clusters and achieve more accurate targeting. Thus, more granular topics are usually preferred. In our experiments, the perplexity of LDA, which provides seed topics in our framework, did not show a noticeable de- crease (i.e., performance improvement) when the number of topics tends toward 100. Rather, it showed an increase (i.e., performance degradation) when the number of top- ics increases from 100 to 150.

However, when we incorporated topics from other types of edges, we could ob- serve the perplexity drops even when the number of topics increases beyond 150. Fig- ure 4.3.2.3 generated from B-Log1 shows changes in perplexity drop rates (compared to LDA’s perplexity) when the number of topics increases from 10 to 200. We observe that UQ←QT and UD←UQ←QT have consistently lower perplexity values as num- ber of topics increase. Note that our framework is about the topic incorporation (not LDA) and can work with more granular topic models than LDA.

69 4.3.2.4 Incorporating a Query-Page Click Log

In this section, we investigate the effect of topic incorporation from C-Log to B-Log. Because C-Log is query-page click counts across all users, it is considered a very strong signal between queries and pages, and has been widely used in modeling query intent in online search [30]. However, the characteristics of C-Log are very different from that of B-Log (aggregated, denser, and extremely power-law). We incorporate topics learned from C-log to B-Log to improve performance in our proposed recom- mendation tasks. Because the number of edges in C-Log for 100K users was too large to handle in one machine, we reduced the number of users to 10K and prepared B- Log2. The C-Log dataset is collected so that all the queries and pages in B-Log2 are included.

In Figure 4.8, the leftmost bar is the result from LDA and the next two are from only B-Log2 and the last one is from the combination of C-Log and B-Log2. We set λ = 0.01 when we incorporate C-Log to B-Log2 due to huge difference in the numbers of edges. We observe that adding C-Log signal produces second best result (−16.5%) in query recommendation task, and the best result (−19.3%) in page recommendation task. It shows that our framework is effective even between very disparate datasets.

4.4 Conclusion

In this chapter, we introduced a universal topic framework called “UniZ”, which rep- resents various types of entities and their edges in a single topic space. By incorpo- rating previously learned topics, UniZ improves prediction and recommendation per- formance. We also proposed two novel and effective topic models in this framework: the mixture model and the dual-prior model. In a DBLP prediction task, one of our models performed better than all other state-of-the-art methods. They also achieved

70 (a) Query recommendation

(b) Page recommendation

Figure 4.8: Incorporation of C-Log (the last bars show results using C-Log)

71 significant improvements in query and page recommendation tasks performed with real search logs. We also demonstrated huge potential of our approach in dealing with granular topics and disparate datasets.

72 CHAPTER 5

Related Work

In this chapter, we briefly review topic models that can analyze graphs beyond a simple bipartite graph. We group them in three categories: topic models for (1) authorship, (2) hypertexts, and (3) edges.

The topic models in the first category were proposed to analyze documents (texts) with their authors. As these models incorporate authors and their relationships in the model, they can be viewed as early forms of social network topic models. They at- tempt to group documents and authors by assuming that a document is created by au- thors sharing common topics. The concept of authors (users) was initially introduced by Steyvers et al. [40] in the Author-Topic (AT) model. With the additional co-author information, they could successfully extract hidden research topics and trends from CiteSeer’s abstract corpus. The AT model was extended to the the Community User Topic (CUT) model by Zhou et al. [51] to capture semantic communities. McCalum et al. [32] also extended the AT model and proposed the Author-Recipient-Topic (ART) model and the Role-Author-Recipient-Topic (RART) model to analyze e-mail networks. Pathak et al. [36] modified the ART model and suggested the Community- Author-Recipient-Topic (CART) model, which is similar to the RART model. In ad- dition to these AT model family, other LDA extensions and probabilistic topic models were also proposed to analyze chat data [44], voting data [45], annotation data [26], and tagging data [22].

The topic models in the second category are more closely related to social net-

73 work analysis and analyze documents with their citations (i.e., hypertexts). Cohn et al. [13] initially introduced a topic model combining PLSA [25] and PHITS [14]. Later, PLSA in this model was replaced with LDA [5] by Erosheva et al. [18]. Nallapati et al. [35] extended Erosheva’s model and proposed Link-PLSA-LDA model which applies PLSA and LDA to cited and citing documents, respectively. Chang et al. [11] also proposed the Relational Topic Model (RTM) which models a citation as a binary ran- dom variable. Dietz et al. [17] proposed a topic model to analyze topical influence of research publications. More sophisticated models were proposed by Gruber et al. [20] and Sun et al. [41]. Hybrid approaches were also attempted. Mei at al. [33] introduced a regularized topic modeling framework incorporating a graph structure and Nallapati et al. [34] combined network flow and topic modeling.

The topic models in the last category only uses linkage (edge) information. Since they only focus on the graph structure, they can be easily applied to a variety of datasets. However, there has been relatively less research in this category. Our work belongs to this category and focuses on solving the issue caused by popular nodes in the graph structure and analyzing complex graphs. Airoldi et al. [3] proposed the Mixed Membership Stochastic Block (MBB) model to analyze pairwise measurements such as social networks and protein interaction networks. Zhang et al. [50] and Hen- derson et al. [23] dealt with the issues in applying LDA to academic social networks. The former focused on the issue of converting the co-authorship information into a graph, and proposed edge weighting schemes based on collaboration frequency. The latter addressed the issue of a large number of topic clusters generated due to “low popularity” nodes in the network. The “high popularity” issue was initially addressed by Steck [38]. He defined a new metric called Popularity-Stratified Recall and sug- gested a matrix factorization method optimizing it. We have published two papers in this category. In our first paper [10], we investigated the issue in more detail and pro- posed effective solutions based on probabilistic topic models. Then, in our following

74 work [9], we took a more principled approach to this issue and proposed new topic models explicitly modeling writer’s popularity in them.

75 CHAPTER 6

Conclusions and Future Work

In this dissertation, we effectively extended probabilistic topic models, which are orig- inally developed for analyzing a textual corpus, to analyze a more general graph by addressing two challenges: (1) handling the popularity bias caused by a limited num- ber of very popular nodes, and (2) analyzing a complex graph beyond a simple bipartite graph.

For the first challenge, we investigated the popularity bias problem arising when we apply topic models to a social graph. We explored various LDA extensions and proposed new topic models appropriate to analyze the social graph. Different from the textual dataset, a popular user (node) has very important meaning in the social graph and should be carefully handled. In extensive experiments with a real-world Twitter dataset, our approaches achieved significant improvements in terms of recommenda- tion performance and human-perceived clustering quality. Since the popularity bias is universal in various datasets including web page visit logs, advertisement click logs, and product purchase logs, our models can effectively provide more relevant recom- mendations and clusterings in many web services.

For the second challenge, we proposed a universal topic framework called “UniZ”, which represents various types of entities and their edges in a single topic space. By incorporating previously learned topics, UniZ improves prediction and recommen- dation performance. We also proposed two novel and effective topic models in this framework: the mixture model and the dual-prior model. In a DBLP prediction task,

76 one of our models performed better than all the other state-of-the-art methods. They also achieved significant improvements in query and page recommendation tasks per- formed with real search logs. We also demonstrated huge potential of our approach by showing how it deals with granular topics and disparate datasets.

While the improvements achieved by our approach are very impressive, there re- main a couple of future work items especially for the second challenge. Firstly, we only covered limited types of entities in web service logs since this work is our first step in this research and our main focus in this work is to show potential of our ap- proach in a real-world scenario. It would be more interesting to include more diverse entity types depicted in Figure 4.1(c). Secondly, we plan to implement the breadth- first traversal explained in Section 4.2.3. It is expected to produce better results with extra Gibbs sampling iterations and alleviate the issue of the topic incorporation order.

Finally, we plan to extend our models to make the parameters λA and λB learnable from training data, instead of tuning them manually.

77 REFERENCES

[1] Multinomial distribution. http://en.wikipedia.org/wiki/ Multinomial_distribution.

[2] Amr Ahmed, Yucheng Low, Mohamed Aly, Vanja Josifovski, and Alexander J. Smola. Scalable distributed inference of dynamic user interests for behavioral targeting. In KDD, pages 114–122, 2011.

[3] Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. Mixed membership stochastic blockmodels. In J. Mach. Learn. Res., 2008.

[4] David M. Blei, Thomas L. Griffiths, Michael I. Jordan, and Joshua B. Tanen- baum. Hierarchical topic models and the nested chinese restaurant process. In Neural Information Processing Systems, 2003.

[5] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.

[6] Sergey Brin and Larry Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 1998.

[7] Deng Cai, Qiaozhu Mei, Jiawei Han, and Chengxiang Zhai. Modeling hidden topics on document manifold. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM ’08, pages 911–920, New York, NY, USA, 2008. ACM.

[8] Kevin R. Canini, Lei Shi, and Thomas L. Griffiths. Online inference of topics with latent dirichlet allocation. In Artifical Intelligence and Statistics, 2009.

[9] Youngchul Cha, Bin Bi, Chu-Cheng Hsieh, and Junghoo Cho. Incorporating pop- ularity in topic models for social network analysis. In Proceedings of the 36th international ACM SIGIR conference on Research and development in informa- tion retrieval, SIGIR ’13, pages 223–232, New York, NY, USA, 2013. ACM.

[10] Youngchul Cha and Junghoo Cho. Social-network analysis using topic models. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’12, pages 565–574, New York, NY, USA, 2012. ACM.

[11] Jonathan Chang and David Blei. Relational topic models for document networks. In AIStats, 2009.

78 [12] Xu Chen, Mingyuan Zhou, and Lawrence Carin. The contextual focused topic model. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’12, pages 96–104, New York, NY, USA, 2012. ACM.

[13] D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In NIPS ’00: Advances in Neural Informa- tion Processing Systems. MIT Press, Cambridge, MA, 2000.

[14] David Cohn and Huan Chang. Learning to probabilistically identify authorita- tive documents. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, pages 167–174, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.

[15] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Scociety for Information Science, 1990.

[16] Hongbo Deng, Jiawei Han, Bo Zhao, Yintao Yu, and Cindy Xide Lin. Proba- bilistic topic models with biased propagation on heterogeneous information net- works. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 1271–1279, New York, NY, USA, 2011. ACM.

[17] Laura Dietz, Steffen Bickel, and Tobias Scheffer. Unsupervised prediction of citation influences. In In Proceedings of the 24th International Conference on Machine Learning, pages 233–240, 2007.

[18] Elena Erosheva, Stephen Fienberg, and John Lafferty. Mixed membership mod- els of scientific publications. In Proceedings of the National Academy of Sci- ences, page 2004. press, 2004.

[19] Mark Girolami and Ata Kaban. On an equivalence between plsi and lda. In SIGIR, 2003.

[20] Amit Gruber, Michal Rosen-Zvi, and Yair Weiss. Latent topic models for hy- pertext. In UAI 2008, Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence, July 9-12, 2008, Helsinki, Finland, pages 230–239. AUAI Press, 2008.

[21] John Hannon, Mike Bennett, and Barry Smyth. Recommending twitter users to follow using content and collaborative filtering approaches. In RecSys, pages 199–206. ACM, 2010.

79 [22] Morgan Harvey, Ian Ruthven, and Mark J. Carman. Improving social bookmark search using personalised latent variable language models. In WSDM, 2011.

[23] Keith Henderson and Tina Eliassi-Rad. Applying latent dirichlet allocation to group discovery in large graphs. In Proceedings of the 2009 ACM symposium on Applied Computing, 2009.

[24] Matthew D. Hoffman, David M. Blei, and Francis Bach. Online learning for latent dirichlet allocation. In In NIPS, 2010.

[25] Thomas Hofmann. Probabilistic latent semantic analysis. In In Proc. of Uncer- tainty in Artificial Intelligence, UAI99, pages 289–296, 1999.

[26] Tomoharu Iwata, Takeshi Yamada, and Naonori Ueda. Modeling social annota- tion data with content relevance using a topic model. In NIPS, 2009.

[27] Akshay Java, Xiaodan Song, Tim Finin, and Belle L. Tseng. Why we twitter: An analysis of a community. In WebKDD/SNA-KDD, volume 5439 of Lecture Notes in Computer Science, pages 118–138. Springer, 2007.

[28] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is twitter, a social network or a news media? In Proceedings of the 19th international conference on , WWW ’10, pages 591–600, New York, NY,USA, 2010. ACM.

[29] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix fac- torization. In In NIPS, pages 556–562. MIT Press, 2001.

[30] Xiao Li, Ye-Yi Wang, and Alex Acero. Learning query intent from regularized click graphs. In Proceedings of the 31st annual international ACM SIGIR con- ference on Research and development in information retrieval, SIGIR ’08, pages 339–346, New York, NY, USA, 2008. ACM.

[31] S.P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Informa- tion Theory, pages 129–137, 1982.

[32] Andrew Mccallum, Xuerui Wang, and Andres Corrada-Emmanuel. Topic and role discovery in social networks with experiments on enron and academic email. Journal of Artificial Intelligence Research, 30:249–272, 2007.

[33] Qiaozhu Mei, Deng Cai, Duo Zhang, and ChengXiang Zhai. Topic modeling with network regularization. In Proceedings of the 17th international conference on World Wide Web, WWW ’08, pages 101–110, New York, NY, USA, 2008. ACM.

80 [34] Ramesh Nallapati, Daniel A. McFarland, and Christopher D. Manning. Top- icflow model: Unsupervised learning of topic-specific influences of hyper- linked documents. Journal of Machine Learning Research - Proceedings Track, 15:543–551, 2011. [35] Ramesh M. Nallapati, Amr Ahmed, Eric P. Xing, and William W. Cohen. Joint latent topic models for text and citations. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08, pages 542–550, New York, NY, USA, 2008. ACM. [36] Nishith Pathak, Colin DeLong, Arindam Banerjee, and Kendrick Erickson. So- cial topic models for community extraction. In The 2nd SNA-KDD Workshop, 2008. [37] Marco Pennacchiotti and Siva Gurumurthy. Investigating topic models for social media user recommendation. In Proceedings of the 20th international conference companion on World wide web, WWW ’11, pages 101–102, New York, NY, USA, 2011. ACM. [38] Harald Steck. Item popularity and recommendation accuracy. In Proceedings of the fifth ACM conference on Recommender systems, RecSys ’11, pages 125–132, New York, NY, USA, 2011. ACM. [39] Mark Steyvers and Thomas L. Griffiths. Probabilistic topic models. Handbook of Latent Semantic Analysis, 2007. [40] Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Tom Griffiths. Proba- bilistic author-topic models for information discovery. In SIGKDD, 2004. [41] Congkai Sun, Bin Gao, Zhenfu Cao, and Hang Li. Htm: a topic model for hyper- texts. In Proceedings of the Conference on Empirical Methods in Natural Lan- guage Processing, EMNLP ’08, pages 514–522, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. [42] Yizhou Sun and Jiawei Han. Mining heterogeneous information networks: a structural analysis approach. SIGKDD Explor. Newsl., 14(2):20–28, April 2013. [43] Yizhou Sun, Yintao Yu, and Jiawei Han. Ranking-based clustering of hetero- geneous information networks with star network schema. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’09, pages 797–806, New York, NY, USA, 2009. ACM. [44] Ville Tuulos and Henry Tirri. Combining topic models and social networks for chat data mining. In In Proc. of the 2004 IEEE/WIC/ACM International Confer- ence on Web Intelligence, 2004.

81 [45] Xuerui Wang, Natasha Mohanty, and Andrew Mccallum. Group and topic dis- covery from relations and text. In In Proc. 3rd international workshop on Link discovery, pages 28–35. ACM, 2005.

[46] Michael J. Welch, Uri Schonfeld, Dan He, and Junghoo Cho. Topical semantics of twitter links. In WSDM, 2011.

[47] Sinead Williamson, Chong Wang, Katherine A. Heller, and David M. Blei. The ibp compound dirichlet process and its application to focused topic modeling. In ICML, pages 1151–1158, 2010.

[48] Andrew T. Wilson and Peter A. Chew. Term weighting schemes for latent dirich- let allocation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 465–473, Stroudsburg, PA, USA, 2010. Association for Compu- tational Linguistics.

[49] Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR ’03, pages 267–273, New York, NY, USA, 2003. ACM.

[50] Haizheng Zhang, Baojun Qiu, C. Lee Giles, Henry C. Foley, and John Yen. An lda-based community structure discovery approach for large-scale social net- works. In In IEEE International Conference on Intelligence and Security Infor- matics, pages 200–207, 2007.

[51] Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, and Hongyuan Zha. Proba- bilistic models for discovering e-communities. In World Wide Web Conference, 2006.

82