Collaborative Filtering for Orkut Communities: Discovery of User Latent Behavior
Total Page:16
File Type:pdf, Size:1020Kb
Collaborative Filtering for Orkut Communities: Discovery of User Latent Behavior Wen-Yen Chen∗ Jon-Chyuan Chu∗ Junyi Luan∗ University of California MIT Peking University Santa Barbara, CA 93106 Cambridge, MA 02139 Beijing 100084, China [email protected] [email protected] [email protected] Hongjie Bai Yi Wang Edward Y. Chang Google Research Google Research Google Research Beijing 100084, China Beijing 100084, China Beijing 100084, China [email protected] [email protected] [email protected] ABSTRACT General Terms Users of social networking services can connect with each Algorithms, Experimentation other by forming communities for online interaction. Yet as the number of communities hosted by such websites grows Keywords over time, users have even greater need for effective commu- nity recommendations in order to meet more users. In this Recommender systems, collaborative filtering, association paper, we investigate two algorithms from very different do- rule mining, latent topic models, data mining mains and evaluate their effectiveness for personalized com- munity recommendation. First is association rule mining 1. INTRODUCTION (ARM), which discovers associations between sets of com- munities that are shared across many users. Second is latent Much information is readily accessible online, yet we as Dirichlet allocation (LDA), which models user-community humans have no means of processing all of it. To help users co-occurrences using latent aspects. In comparing LDA with overcome the information overload problem and sift through ARM, we are interested in discovering whether modeling huge amounts of information efficiently, recommender sys- low-rank latent structure is more effective for recommen- tems have been developed to generate suggestions based on dations than directly mining rules from the observed data. user preferences. In general, recommender systems can be We experiment on an Orkut data set consisting of 492; 104 classified into two categories: content-based filtering and col- users and 118; 002 communities. Our empirical comparisons laborative filtering. Content-based filtering analyzes the as- using the top-k recommendations metric show that LDA sociation between user profiles and the descriptions of items. performs consistently better than ARM for the community To recommend new items to a user, the content-based filter- recommendation task when recommending a list of 4 or more ing approach matches the new items' descriptions to those communities. However, for recommendation lists of up to 3 items known to be of interest to the user. On the other communities, ARM is still a bit better. We analyze exam- hand, the collaborative filtering (CF) approach does not ples of the latent information learned by LDA to explain need content information to make recommendations. The this finding. To efficiently handle the large-scale data set, operative assumption underlying CF is that users who had we parallelize LDA on distributed computers [1] and demon- similar preferences in the past are likely to have similar pref- strate our parallel implementation's scalability with varying erences in the future. As such, CF uses information about numbers of machines. similar users' behaviors to make recommendations. Besides avoiding the need for collecting extensive information about items and users, CF requires no domain knowledge and can be easily adopted across different recommender systems. Categories and Subject Descriptors In this paper, we focus on applying CF to community rec- H.2.8 [Database Management]: Database Applications| ommendation. To investigate which notions of similarity are Data Mining; H.3.3 [Information Storage and Retrieval]: most useful for this task, we examine two approaches from Information Search and Retrieval|Information Filtering; different fields. First, association rule mining (ARM) [2] is a H.3.4 [Information Storage and Retrieval]: Systems data mining algorithm that finds association rules based on and Software|Distributed Systems frequently co-occurring sets of communities and then makes recommendations based on the rules. For example, if users ∗ in the community \New York Yankees" usually join the com- Work done while the first three authors were interns at munity \MLB" (for Major League Baseball), then ARM will Google Research, Beijing. discover a rule connecting communities \New York Yankees" and \MLB." If you join \New York Yankees," you can be rec- Copyright is held by the International World Wide Web Conference Com- mittee (IW3C2). Distribution of these papers is limited to classroom use, ommended \MLB." Generally speaking, ARM can discover and personal use by others. explicit relations between communities based on their co- WWW 2009, April 20–24, 2009, Madrid, Spain. occurrences across multiple users. ACM 978-1-60558-487-4/09/04. Second, latent Dirichlet allocation (LDA) [5] is a machine- • We apply both algorithms to an Orkut data set con- learning algorithm that models user-community co-occurrences sisting of 492; 104 users and 118; 002 communities. Our using latent aspects and makes recommendations based on empirical comparisons using the top-k recommenda- the learned model parameters. Unlike ARM, LDA models tions metric show a surprisingly intuitive finding: that the implicit relations between communities through the set LDA performs consistently better than ARM for the of latent aspects. Following the previous example, suppose community recommendation task when recommending we have another community called \New York Mets" that a list of 4 or more communities. However, for rec- shares many users in common with \MLB" but not with ommending up to 3 communities, ARM is still a bit \New York Yankees." If you are a user joining \New York better. Mets," you would not receive recommendations from ARM for \New York Yankees," because there are not enough \New • We further analyze the latent information learned by York Mets"-\New York Yankees" co-occurrences to support LDA to help explain this finding. such a rule. This is the problem of sparseness in explicit co-occurrence. On the other hand, LDA can possibly rec- • We parallelize LDA to take advantage of the distributed ommend \New York Yankees" because of there may exist an computing infrastructure of modern data centers. Our implicit relation (semantically, \baseball") among the three scalability study on the same Orkut data set shows communities. However, implicit co-occurrence is not always that our parallelization can reduce the training time inferred correctly. Suppose the community \Boston Red from 8 hours to less than 46 minutes using up to 32 ma- Sox" occurs frequently with \MLB" but not \New York Yan- chines (We have made parallel LDA open-source [1]). kees." With LDA, if you were to join \Boston Red Sox," you might receive a recommendation for \New York Yankees," The remainder of this paper is organized as follows: In which is a bad recommendation as the two teams have a Section 2, we briefly introduce ARM and LDA and works longstanding rivalry. This is the problem of noise in inferred related to them. In Section 3, we show how ARM and LDA implicit co-occurrence. can be adapted for the community recommendation task. Explicit relations suffer from the problem of sparseness, We present our parallelization framework of LDA in Sec- but are usually quite accurate. Modeling implicit relations tion 4 and an empirical study on our Orkut data set in Sec- can overcome the sparseness problem; however, doing so may tion 5. Finally, we offer our concluding remarks in Section 6. introduce noise. In what scenarios does each type of relation manifest its strengths and weaknesses? In particular, how 2. RELATED WORK do membership size (the number of communities joined by Association rule mining (ARM) is a well researched algo- a given user) and community size (the number of users in rithm for discovering of correlations between sets of items. a given community) affect sparseness of explicit relations It was originally motivated by analysis of supermarket trans- mined using ARM? action data to discover which itemsets are often purchased • Membership size: Would explicit relations be more together. For example, the rule X ) Y means customers effective at recommendations for active users, ones who who purchased X would likely also purchase Y . To select in- have joined many communities? teresting rules from the set of all possible rules, support and confidence are used as constraints. The support of an item- • Community size: Would implicit relations be more set X, supp(X), is defined as the number of transactions in effective at recommending new or niche communities the data set that contain the itemset. A frequent itemset is with few members? one whose support value exceeds a predefined threshold. An association rule is of the form X ) Y , where X and Y are Similarly, how do membership spread (the diversity among itemsets. The support of a rule is defined as supp(X ) Y ) the communities that a user joins) and community spread = supp(X [ Y ), and the confidence of a rule is defined as (the diversity among the users of a particular community) conf(X ) Y ) = supp(X [ Y )/supp(X). Association rule affect noise in the implicit relations inferred by a latent topic mining is the process of finding all rules whose support and model such as LDA? confidence exceed specified thresholds. Since ARM was first introduced in [2], various algorithms • Membership spread: Would explicit relations be have been designed to mine frequent itemsets efficiently. more effective at recommendations for a diverse user, In [3] and [18], a fast algorithm called Apriori was proposed one who is involved in a miscellaneous set of commu- to exploit the monotonicity property of the support of item- nities? sets and the confidence of association rules. Subsequently, • Community spread: Would implicit relations be FP-growth [9] employed depth-first search for mining fre- more effective at recommending umbrella communi- quent itemsets and has been shown to outperform Apriori ties, those composed of many smaller, tighter sub- in most cases. ARM has been applied to several CF tasks.