WWW 2007 / Track: Industrial Practice and Experience May 8-12, 2007. Banff, Alberta, Canada
Google News Personalization: Scalable Online Collaborative Filtering
Abhinandan Das Mayur Datar Ashutosh Garg Google Inc. Google Inc. Google Inc. 1600 Amphitheatre Pkwy, 1600 Amphitheatre Pkwy, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 Mountain View, CA 94043 Mountain View, CA 94043 [email protected] [email protected] [email protected] Shyam Rajaram University of Illinois at Urbana Champaign Urbana, IL 61801 [email protected]
ABSTRACT me something interesting. In such cases, we would like to Several approaches to collaborative filtering have been stud- present recommendations to a user based on her interests as ied but seldom have studies been reported for large (several demonstrated by her past activity on the relevant site. million users and items) and dynamic (the underlying item Collaborative filtering is a technology that aims to learn set is continually changing) settings. In this paper we de- user preferences and make recommendations based on user scribe our approach to collaborative filtering for generating and community data. It is a complementary technology personalized recommendations for users of Google News. We to content-based filtering (e.g. keyword-based searching). generate recommendations using three approaches: collabo- Probably the most well known use of collaborative filtering rative filtering using MinHash clustering, Probabilistic La- has been by Amazon.com where a user’s past shopping his- tent Semantic Indexing (PLSI), and covisitation counts. We tory is used to make recommendations for new products. combine recommendations from different algorithms using a Various approaches to collaborative filtering have been pro- linear model. Our approach is content agnostic and con- posed in the past in research community (See section 3 for sequently domain independent, making it easily adaptable details). Our aim was to build a scalable online recommen- for other applications and languages with minimal effort. dation engine that could be used for making personalized This paper will describe our algorithms and system setup in recommendations on a large web property like Google News. detail, and report results of running the recommendations Quality of recommendations notwithstanding, the following engine on Google News. requirements set us apart from most (if not all) of the known recommender systems: Categories and Subject Descriptors: H.4.m [Informa- Scalability: Google News (http://news.google.com), is vis- tion Systems]: Miscellaneous ited by several million unique visitors over a period of few General Terms: Algorithms, Design days. The number of items, news stories as identified by the Keywords: Scalable collaborative filtering, online recom- cluster of news articles, is also of the order of several million. mendation system, MinHash, PLSI, Mapreduce, Google News, Item Churn: Most systems assume that the underlying personalization item-set is either static or the amount of churn is minimal which in turn is handled by either approximately updating the models ([14]) or by rebuilding the models ever so often to 1. INTRODUCTION incorporate any new items. Rebuilding, typically being an The Internet has no dearth of content. The challenge is expensive task, is not done too frequently (every few hours). in finding the right content for yourself: something that will However, for a property like Google News, the underlying answer your current information needs or something that item-set undergoes churn (insertions and deletions) every you would love to read, listen or watch. Search engines help few minutes and at any given time the stories of interest are solve the former problem; particularly if you are looking the ones that appeared in last couple of hours. Therefore any for something specific that can be formulated as a keyword model older than a few hours may no longer be of interest query. However, in many cases, a user may not even know and partial updates will not work. what to look for. Often this is the case with things like news, For the above reasons, we found the existing recommender movies etc., and users instead end up browsing sites like systems unsuitable for our needs and embarked on a new news.google.com, www.netflix.com etc., looking around for approach with novel scalable algorithms. We believe that things that might “interest them” with the attitude: Show Amazon also does recommendations at a similar scale. How- Copyright is held by the International World Wide Web Conference Com- ever, it is the second point (item churn) that distinguishes mittee (IW3C2). Distribution of these papers is limited to classroom use, us significantly from their system. This paper describes our and personal use by others. approach and the underlying algorithms and system compo- WWW 2007, May 8–12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005.
271 WWW 2007 / Track: Industrial Practice and Experience May 8-12, 2007. Banff, Alberta, Canada nents involved. The rest of this paper is organized as follows: 2.1 Scale of our operations Section 2 describes the problem setting. Section 3 presents The Google News website is one of the most popular news a brief summary of related work. Section 4 describes our al- websites in the world receiving millions of page views and gorithms; namely, user clustering using Minhash and PLSI, clicks from millions of users. There is a large variance in the and item-item covisitation based recommendations. Sec- click history size of the users, with numbers being anywhere tion 5 describes how such a system can be implemented. from zero to hundreds, even thousands, for certain users. Section 6 reports the results of comparative analysis with The number of news stories3 that we observe over a period other collaborative filtering algorithms and quality evalua- of one month is of the order of several million. Moreover, tions on live traffic. We finish with some conclusions and as mentioned earlier, the set of news stories undergoes a open problems in Section 7. constant churn with new stories added every minute and old ones getting dropped. 2. PROBLEM SETTING 2.2 The problem statement Google News is a computer-generated news site that ag- With the preceding overview, the problem for our rec- gregates news articles from more than 4,500 news sources ommender system can be stated as follows: Presented with worldwide, groups similar stories together and displays them the click history for N users (U = {u1,u2,...,uN } )overM according to each reader’s personalized interests. Numerous items (S = {s1,s2,...,sM }), and given a specific user u with Cu {si ,...,si } editions by country and language are available. The home click history set consisting of stories 1 |Cu| , page for Google News shows “Top stories” on the top left recommend K stories to the user that she might be inter- hand corner, followed by category sections such as World, ested in reading. Every time a signed-in user accesses the U.S. , Business,, etc. Each section contains the top three home-page, we solve this problem and populate the “Recom- headlines from that category. To the left of the “Top Sto- mended” stories section. Similarly, when the user clicks on ries” is a navigation bar that links each of these categories to the Recommended section link in the navigation bar to the a page full of stories from that category. This is the format left of the “Top Stories” on home-page, we present her with of the home page for non signed-in users1.Furthermore, a page full of recommended stories, solving the above stated if you sign-in using your Google account and opt-in to the problem for a different value of K. Additionally we require “Search History” feature that is provided by various Google that the system should be able to incorporate user feedback product websites, you enjoy two additional features: (clicks) instantly, thus providing instant gratification. (a) Google will record your search queries and clicks on news stories and make them accessible to you online. This allows 2.3 Strict timing requirements you to easily browse stories you have read in the past. The Google News website strives to maintain a strict re- (b) Just below the “Top Stories” section you will see a sponse time requirement for any page views. In particular, section labeled “Recommended for youremailaddress”along home-page view and full-page view for any of the category with three stories that are recommended to you based on sections are typically generated within a second. Taking into your past click history. account the time spent in the News Frontend webserver for The goal of our project is to present recommendations generating the news story clusters by accessing the various to signed-in users based on their click history and the click indexes that store the content, time spent in generating the history of the community. In our setting, a user’s click on HTML content that is returned in response to the HTTP an article is treated as a positive vote for the article. This request, it leaves a few hundred milliseconds for the recom- sets our problem further apart from settings like Netflix, mendation engine to generate recommendations. MovieLens etc., where users are asked to rate movies on a Having described the problem setting and the underlying 1-5scale.Thetwodifferencesare: challenges, we will give a brief summary of related work on 1. Treating clicks as a positive vote is more noisy than ac- recommender systems before describing our algorithms. cepting explicit 1-5 star ratings or treating a purchase as a positive vote, as can be done in a setting like amazon.com. 3. RELATED WORK While different mechanisms can be adopted to track the au- Recommender systems can be broadly categorized into thenticity of a user’s vote, given that the focus of this paper two types: Content based and Collaborative filtering. is on collaborative filtering and not on how user votes are In content based systems the similarity of items, defined collected, for the purposes of this paper we will assume that 2 in terms of their content, to other items that have been clicks indeed represent user interest. rated highly by the user is used to recommend new items. 2. While clicks can be used to capture positive user interest, However, in the case of domains such as news, a user’s in- they don’t say anything about a user’s negative interest. terest in an article cannot always be characterized by the This is in contrast to Netflix, eachmovie etc. where users terms/topics present in a document. In addition, our aim give a rating on a scale of 1-5. was to build a system that could be potentially applied to other domains (e.g. images, music, videos), where it is hard 1The format for the web-site is subject to change. 2 to analyse the underlying content, and hence we developed While in general a click on a news article by a user does not a content-agnostic system. For the particular application of necessarily mean that she likes the article, we believe that this is less likely in the case of Google News where there are 3As mentioned earlier, the website clusters news articles clean (non-spammy) snippets for each story that the user from different news sites (e.g. BBC, CNN, ABC news etc.) gets to see before clicking. Infact, our concern is that often that are about the same story and presents an aggregated the snippets are so good in quality that the user may not view to the users. For the purpose of our discussion, when click on the news story even if she is interested in it; she we refer to a news story it means a cluster of news articles gets to know all that she wants from the snippet about the same story as identified by Google News.
272 WWW 2007 / Track: Industrial Practice and Experience May 8-12, 2007. Banff, Alberta, Canada
Google News recommendations, arguably content based rec- of users by classifying them into multiple clusters or classes. ommendations may do equally well and we plan to explore Model-based approaches include: latent semantic indexing that in the future. Collaborative filtering systems use the (LSI) [20], Bayesian clustering [3], probabilistic latent se- item ratings by users to come up with recommendations, mantic indexing (PLSI) [14], multiple multiplicative Factor and are typically content agnostic. In the context of Google Model [17], Markov Decision process [22] and Latent Dirich- News, item ratings are binary; a click on a story corresponds let Allocation [2]. Most of the model-based algorithms are toa1rating,whileanon-clickcorrespondstoa0rating. computationally expensive and our focus has been on devel- Collaborative filtering systems can be further categorized oping a new, highly scalable, cluster model and redesigning into types: memory-based, and model-based. Below, we the PLSI algorithm [14] as a MapReduce [12] computation give a brief overview of the relevant work in both these types to make it highly scalable. while encouraging the reader to study the survey article [1] 3.1 Memory-based algorithms 4. ALGORITHMS We use a mix of memory based and model based algo- Memory-based algorithms make ratings predictions for rithms to generate recommendations. As part of model- users based on their past ratings. Typically, the prediction based approach, we make use of two clustering techniques - is calculated as a weighted average of the ratings given by PLSI and MinHash and as part of memory based methods, other users where the weight is proportional to the “similar- we make use of item covisitation. Each of these algorithms ity” between users. Common “similarity” measures include assigns a numeric score to a story (such that better rec- the Pearson correlation coefficient ([19]) and the cosine sim- ommendations get higher score). Given a set of candidate ilarity ([3]) between ratings vectors. The pairwise similarity stories, the score (rua,sk ) given by clustering approaches is matrix w(ui,uj ) between users is typically computed offline. proportional to
During runtime, recommendations are made for the given