Recommender Systems for Large-Scale E-Commerce: Scalable Neighborhood Formation Using Clustering

Recommender Systems for Large-scale E-Commerce: Scalable Neighborhood Formation Using Clustering Badrul M. Sarwar†∗,GeorgeKarypis‡, Joseph Konstan†, and John Riedl† {sarwar, karypis, konstan, riedl}@cs.umn.edu †GroupLens Research Group / ‡Army HPC Research Center Department of Computer Science and Engineering University of Minnesota, Minneapolis, MN 55455, USA Abstract challenges for collaborative filtering recommender systems. Recommender systems apply knowledge discovery tech- The first challenge is to improve the scalability of the niques to the problem of making personalized prod- collaborative filtering algorithms. These algorithms are uct recommendations during a live customer interac- able to search tens of thousands of potential neighbors tion. These systems, especially the k-nearest neigh- in real-time, but the demands of modern E-commerce bor collaborative filtering based ones, are achieving systems are to search tens of millions of potential neigh- widespread success in E-commerce nowadays. The bors. Further, existing algorithms have performance tremendous growth of customers and products in re- problems with individual consumers for whom the site cent years poses some key challenges for recommender has large amounts of information. For instance, if a systems. These are:producing high quality recommen- site is using browsing patterns as indications of prod- dations and performing many recommendations per uct preference, it may have thousands of data points second for millions of customers and products. New for its most valuable customers. These “long customer recommender system technologies are needed that can rows” slow down the number of neighbors that can be quickly produce high quality recommendations, even searched per second, further reducing scalability. The for very large-scale problems. We address the perfor- second challenge is to improve the quality of the recom- mance issues by scaling up the neighborhood formation mendations for the consumers. Consumers need recom- process through the use of clustering techniques. mendations they can trust to help them find products they will like. If a consumer trusts a recommender sys- 1 Introduction tem, purchases a product, and finds out he does not like the product, the consumer will be unlikely to use the The largest E-commerce sites offer millions of prod- recommender system again. In some ways these two ucts for sale. Choosing among so many options is challenges are in conflict, since the less time an algo- challenging for consumers. Recommender systems have rithm spends searching for neighbors, the more scalable emerged in response to this problem. A recommender it will be, and the worse its quality. For this reason, system for an E-commerce site recommends products it is important to treat the two challenges simultane- that are likely to fit her needs. Today, recommender ously so the solutions discovered are both useful and systems are deployed on hundreds of different sites, practical. serving millions of consumers. One of the earliest and The focus of this paper is two-fold. First, we in- most successful recommender technologies is collabora- troduce the basic concepts of a collaborative filter- tive filtering [5, 8, 9, 13]. Collaborative filtering (CF) ing based recommender system and discuss its vari- works by building a database of preferences for prod- ous limitations. Second, we present a clustering-based ucts by consumers. A new consumer, Neo, is matched algorithm that is suited for a large data set, such as against the database to discover neighbors,whichare those are common in E-commerce applications of rec- other consumers who have historically had similar taste ommender systems. This algorithm has characteristics to Neo. Products that the neighbors like are then rec- that make it likely to be faster in online performance ommended to Neo, as he will probably also like them. than many previously studied algorithms, and we seek Collaborative filtering has been very successful in both to investigate how the quality of its recommendations research and practice. However, there remain impor- compares to other algorithms under different practical tant research questions in overcoming two fundamental circumstances. ∗Currently with the Computer Science Department, San The rest of the paper is organized as follows. The Jose State University, San Jose, CA 95112, USA. Email: next section provides a brief overview of collabora- [email protected], Phone: +1 408-245-8202 tive filtering based recommender systems and discusses Product for which prediction is sought P1 P2 . Pj . Pn C1 C 2 R (prediction on . a,j . Prediction product j for the active customer) Ca . Recommendation . {Tp1, Tp2, ..., T pN} Top-N list of products for the Cm active customer Active customer Input (ratings table) CF-Algorithm Output interface Figure 1:The Collaborative Filtering Process. some of its limitations. Section 3 describes the details most often computed by finding the Pearson-r correla- algorithm of applying clustering based approach to ad- tion between the customers C and Ni. dress these limitations. Section 4 describes our exper- Once these systems determine the proximity neigh- imental framework, experimental results, and discus- borhood, they produce recommendations that can be sion. The final section provides some concluding re- of two types: marks and directions for future research. • Prediction is a numerical value, Ra,j, expressing the predicted opinion-score of product pj for the 2 Collaborative Filtering-based active customer ca. This predicted value is within the same scale (e.g., from 1 to 5) as the opinion Recommender Systems values provided by ca. Collaborative filtering (CF) [8, 9, 13] is the most suc- • Recommendation is a list of N products, TPr = cessful recommender system technology to date, and is {Tp1,Tp2,...,TpN }, that the active user will like used in many of the most successful recommender sys- the most. The recommended list usually consists tems on the Web. CF systems recommend products to of the products not already purchased by the ac- a target customer based on the opinions of other cus- tive customer. This output interface of CF algo- tomers. These systems employ statistical techniques to rithms is also known as Top-N recommendation. find a set of customers known as neighbors,thathave a history of agreeing with the target user (i.e., they Figure 1 shows the schematic diagram of the collab- either rate different products similarly or they tend to orative filtering process. CF algorithms represent the m × n buy similar set of products). Once a neighborhood of entire customer-product data as a ratings ma- A a A users is formed, these systems use several algorithms trix, .Eachentry i,j in represent the preference i j to produce recommendations. score (ratings) of the th customer on the th product. In a typical E-Commerce scenario, there is a list of Each individual rating is within a numerical scale and it can as well be 0, indicating that the customer has m customers C = {c1,c2,...,cm} and a list of n prod- not yet rated that product. ucts P = {p1,p2,...,pn}.Eachcustomerci expresses his/her opinions about a list of products. This set of These systems have been successful in several do- mains, but the algorithm is reported to have shown opinions is called the “ratings” of customer ci and is some limitations, such as: denoted by Pci . There exists a distinguished customer c ∈C a called the active customer for whom the task of • Sparsity. Nearest neighbor algorithms rely upon a collaborative filtering algorithm is to find a product exact matches that cause the algorithms to sac- suggestion. rifice recommender system coverage and accuracy Most collaborative filtering based recommender sys- [8, 11]. In particular, since the correlation coeffi- tems build a neighborhood of likeminded customers. cient is only defined between customers who have The Neighborhood formation scheme usually uses Pear- rated at least two products in common, many pairs son correlation or cosine similarity as a measure of of customers have no correlation at all [1]. Accord- proximity [13]. The neighborhood formation process is ingly, Pearson nearest neighbor algorithms may be in fact the model-building or learning process for a rec- unable to make many product recommendations ommender system algorithm. The main goal of neigh- for a particular user. This problem is known as C borhood formation is to find, for each customer ,an reduced coverage, and is due to sparse ratings of k N {N ,N ,...,N } ordered list of customers = 1 2 k such neighbors. that C ∈ N and sim(C, N1)ismaximum,sim(C, N2) is the next maximum and so on. Where sim(C, Ni) • Scalability. Nearest neighbor algorithms require indicates similarity between two customers, which is computation that grows with both the number of Dataset after application Complete dataset (based on the of the clustering algorithm user-user similarity) User cluster is used as the neighborhood Active user Figure 2:Neighborhood formation from clustered partitions customers and the number of products. With mil- very good, since the size of the group that must be lions of customers and products, a typical web- analyzed is much smaller. based recommender system running existing algorithms will suffer serious scalability problems. 3.1 Scalable Neighborhood Algorithm The weakness of Pearson nearest neighbor approach The idea is to partition the users of a collaborative for large, sparse databases led us to explore alternative filtering system using a clustering algorithm and use recommender system algorithms. Our first approach the partitions as neighborhoods. Figure 2 explains attempted to bridge the sparsity by incorporating semi- this idea. A collaborative filtering algorithm using intelligent filtering agents into the system [11]. We this idea first applies a clustering algorithm on the addressed the scalability challenge in an earlier work user-item ratings database to divide the database [12], where we showed that forming neighborhoods in A into p partitions.

Load more