From: AAAI Technical Report WS-98-08. Compilation copyright © 1998, AAAI (www.aaai.org). All rights reserved. RecommenderSystems: A GroupLensPerspective Joseph A. Konstan*t , John Riedl *t, AI Borchers,* and Jonathan L. Herlocker* *GroupLensResearch Project *Net Perceptions,Inc. Dept. of ComputerScience and Engineering 11200 West78th Street University of Suite 300 , MN55455 Minneapolis, MN55344 http://www.cs.umn.edu/Research/GroupLens/ http://www.netperceptions.com/

ABSTRACT identifying sets of articles by keyworddoes not scale to a In this paper, wereview the history and research findings of situation in which there are thousands of articles that the GroupLensResearch project I and present the four broad contain any imaginable set of keywords. Taken together, research directions that we feel are most critical for these two weaknesses represented an opportunity for a new recommender systems. type of filtering, that would focus on finding which INTRODUCTION:A History of the GroupLensProject available articles matchhuman notions of quality and taste. The GroupLens Research project began at the Computer Such a system would be able to produce a list of articles Supported Cooperative Work (CSCW)Conference in 1992. that each user wouldlike, independentof their content. Oneof the keynote speakers at the conference lectured on a Wedecided to apply our ideas in the domain of his vision of an emerging information economy,in which news. Usenet screamsfor better information filtering, with most of the effort in the economywould revolve around hundreds of thousands of articles posted daily. Manyof the production, distribution, and consumptionof information, articles in each Usenet newsgroupare on the sametopic, so rather than physical goods and services. Paul Resnick, then syntactic techniques that identify topic are muchless a student at MIT, and nowa professor at the University of valuable in Usenet. Further, different people value very Michigan, and one of us (Riedl) were movedby the talk different sets of articles, with somepeople participating in consider the technical challenges that would have to be long discussion threads that other people couldn’t imagine overcometo enable the information economy.We realized even reading. that as the amount of information increased enormously, Wedeveloped a system that falls into the class that is now while people’s ability to process information remained called automatic . It collects ratings stable, one of the critical challenges would be technology from people on articles, combinesthe ratings statistically, that would automate matching people with the information and produces recommendations for other people of how they would find most valuable. muchthey are likely to like each article. Therewere two main thrusts of research activity in this area Weinvited people to participate in using GroupLensfrom that we knewof: (1) (AI) research all over the Internet, and studied the effect of the systemon develop tools that would serve as a "knowledgerobot", or users. Users resisted our early attempts to establish multi- knowbot, continually seeking out information, reading and dimensional rating schemes, including characteristics such understandingit, and returning with the informationthat the as quality of the writing, and suitability of the topic for the knowbotdetermined would be most valuable to its user. (2) newsgroup. Rating on multiple dimensions was too much Information Filtering (IF) research to develop even more work. Wechanged to single-dimension ratings, with the efficient tools for selecting documents that contain dimension being "What score would you have liked keywordsof interest to a user. These techniques were, and GroupLensto predict for you for this article?" continue to be fruitful, but we felt they each have one serious weakness. In the case of the knowbot,the weakness Wefound that users did change behavior in response to the is that we are still a significant distance from technology recommendations,reading a muchhigher percentage of the that can understand articles in the waya humandoes. In the articles that GroupLenspredicted they would like than of case of Information Filtering, the weakness is that either randomly selected articles, or articles GroupLens predicted they would not like. However,there were many 1 GroupLensT is a trademark of Net Perceptions, Inc, articles for whichGroupLens was unable to provide ratings, M because even with a two to three hundred users, there were which develops and markets the GroupLens simply too manyarticles in the six newsgroups we were Recommendation Engine. Net Perceptions allows the studying. A greater density of ratings by article wouldhave to use the name "GroupLens improvedthe usability of the system for most users. The Research" for continuity. The ideas and opinions low ratings density was compoundedby the first rater expressed in this paper are those of the authors and do not problem, which is the problem that a pure collaborative represent opinions of Net Perceptions, Inc. filtering system cannot possibly makerecommendations to

60 the first person that reads each article. Oneeffect of these may yield more accurate recommendations. Even if the two problems is that some beginning users of the system increased accuracy is offset by the smaller numberof items saw little value from GroupLensinitially, and hence never available to establish user correlations, partitioning maybe developed the habit of contributing ratings, though they valuable because it can help scale the performanceof the continued to use GroupLens-enablednews readers. system; each partition can be run in parallel on a separate Becausemost users did not like most articles, and because server. GroupLenswas effective at identifying articles users would To explore the potential of item partitioning, we considered like, users requested the ability to scan a newsgroupfor the three partitioning strategies for MovieLens: random articles that were predicted to be of high interest to them. partitions, partitions by movie genre, and partitions This led to our exploring a different style of interface to a generated algorithmically by clustering based on ratings. collaborative filtering system, the TopNinterface. Rather Clustering-based partitions produced a slight loss in than predicting a score for each article, a TopNinterface prediction accuracy as partitions grew smaller, but showed greedily seeks articles that are likely to have high scores for promise for a reasonable trade-off between performance an individual user, and recommendsthose articles to that and accuracy. Moviegenre partitions yielded less accurate user. Eventually, such an interface might be able to present recommendations than cluster-based ones, though some each of us with a list of the 20-30 most interesting articles genres were muchmore accurate, and others muchless so). for us from all of Usenet each morning. Randompartitions were slightly worse still. The value of item partitions clearly depends on the domain of the Our key lesson learned was that a very high volume, low recommendationsystem and the density of ratings within quality system like Usenet would require a very large and across potential partitions (our earlier Usenet work numberof users for collaborative filtering to be successful. found that mixing widely different newsgroups together For our research purposes, we needed a lower volume, reduced accuracy). One advantage of the clustering result higher density testbed. Our colleagues from Digital is that it maybe morebroadly applicable in domains where Equipment Corporation were closing downtheir research items lack obviousattributes for partitioning. system on movie recommendations,and offered us the data Wealso looked at the value of user partitioning, starting to jump-start a similar system using GroupLens. We with the extreme case of pre-computed symmetric launched our system in the summerof 1997, and have been neighborhoods based on our clustering algorithm; these running it since at www..umn.edu.MovieLens is were small partitions of about 200 users. If symmetric entirely web-based,and has several thousandregular users. neighborhoodsyield good results, time per recommendation Users rate movies, and MovieLens recommends other can be reduced dramatically, since substantial per- movies to them. neighborhood computation can be performed incrementally Over the past six years of research, we have learned that and amortized across the neighbors. We found that the people are hungry for effective tools for information accuracy of recommendationswas almost as good as using filtering, and that collaborative filtering is an exciting the full data set, but that the coverage(i.e., the numberof complementto existing filtering systems. Users value both movies for which we could computea recommendation)fell the taste-based recommendations, and the sense of by 14%. To restore coverage we introduced a two level communitythey get by participating in a group filtering hierarchy of users. process. However,there are manyopen research problems Users from each other neighborhoodwere collapsed into a still in collaborative filtering. Belowwe discuss our early single composite user. Each neighborhood then had all results on some of these problems, and outline the users represented, similar users were represented at full remaining problems we feel to be most important to the resolution and the more distant users were represented at evolution of the field of collaborativefiltering. the much lower resolution of one composite user per CURRENTRESEARCH RESULTS neighborhood. This restored full coverage and the quality of predictions was only slightly degraded by about 1%from Our recent research has focused on improving the quality the unpartitioned case. Weare continuing to explore these and efficiency of collaborative filtering systems. Wehave taken a broad approach, seeking solutions that improvethe hierarchical approaches. efficiency, accuracy, and coverage of the system. Filterbots Specifically, we’ve examinedpartitioning users and items, Oneproblem with pure collaborative filtering is that users incorporating filtering agents into the collaborative filtering cannot receive recommendationsfor an item until enough framework, and using existing data sets to start up a other users have rated it. Content-based information collaborative filtering recommendationsystem. filtering approaches, by contrast, avoid this problem by establishing profiles that can be used to evaluate items (e.g., Partitioning Usersand Items Both performance and accuracy concerns led us to explore keyword preferences). To combine the best of both the use of partitioning. If user tastes are more consistent approaches, we developedfilterbots--rating agents that use content information to generate ratings systematically. within a partition of items, and therefore, user agreementis more consistent, then partitioning the items in the system These ratings are entered into the recommendationsystem

61 by treating the filterbots as additional users. This approach WHAT’SNEXT: A RESEARCHAGENDA has the benefit of allowing us to use simple-minded or Basedon our prior work, we’ve identified four broad controversial filterbots; if a user agrees with a particular problemareas that are particularly critical to the success of filterbot, that filterbot becomes part of the user’s recommender systems. We discuss these in general, neighborhoodand gains influence in recommendations.If a highlighting work knowto be underway,but also presenting user does not agree with the filterbot, it does not become openquestions that are ripe for research. part of that user’s neighborhoodand is therefore ignored. SupportingUsers and Decision-Making To test this concept, we created several simple-minded Early work in recommender systems focused on the filterbots for Usenet news. Wefound that a spell checking technology of making recommendations. Papers cited filterbot not only increased coverage dramatically (as much measures of system accuracy such as meanabsolute error, as 514%), but also increased accuracy as muchas 74%. In measures of throughput and latency, and occasionally a tee.humor, a notoriously high noise group, all three of our general metric indicating the people used the system, or simple filterbots (spell checker, percentageof included text, perhaps that they said they liked it. Nowthat the and message length) improved coverage and quality. In technological feasibility of recommendersystems is well other newsgroups, somefilterbots helped while others did established, we must face the challenge of designing and not. Fortunately, the cost of filterbots is quite low, evaluating systems to support users and their decision- particularly since simple ones appear to have significant making processes. value. Weplan to continue exploring filterbots, looking both at simple content filtering algorithms and at learning While there are manyfactors that affect the decision- agents. making value of a recommendationsystem for users, three Jump-Starting a RecommendationEngine critical issues have arisen in each system we’ve studied: Collaborative filtering systemsface a start up problem:until accuracy, confidence, and . a critical mass of ratings has been collected, there is not Accuracy is the measure of how closely the enough data to compute recommendations. Accordingly, recommendationsgenerated by the system predict the actual early users receive little value for their contribution. In our preferences of the user. Measurementof accuracy is itself a MovieLenssystem we had the good fortune of starting our challenging issue that we discuss below. However,for any system seeded with a database over 2.8 million ratings from sensible definition of accuracy, recommendersystems still the earlier EachMovierecommender system. For privacy are far from perfect. Both Maes’ work on Ringo and our reasons the database we received from EachMoviehad only ownwork suggest that today’s pure collaborative filtering anonymoususers; although we could not associate these systems typically achieve at best an accuracy of plus-or- users with our own, they could still serve as recommenders minus one on a seven-point scale. Further research is for our users. Wecall this "dead data." neededto determine the theoretical limit of accuracy, based Wetook advantage of this rare opportunity to evaluate the on user rate/re-rate differences and empirically determined experience of new users in systems with and without the variances. Then, significant work is needed on a wide dead data. We retrospectively evaluated the range of approaches to improve accuracy for user tasks. recommendationaccuracy, coverage, and user satisfaction These approaches include those discussed below and for early users of EachMovie and MovieLens. For our special filtering modelstuned for precision and for recall. accuracy and coverage experiments, we held the Confidence is a measure of how certain the recommendation algorithm constant, and found that the recommendation system is of its recommendation. While jump-started case (MovieLens)had better coverage (nearly statistical confidence measures often are expressed as 100%, as compared with 89%) and higher accuracy confidence intervals or expected distributions, current (increases as high as 19%,depending on the metric used). recommendationsystems generally provide no more than a To assess user satisfaction, we retrospectively compared simple "high, medium,or low" confidence score. Part of user retention and participation in our current MovieLens the difficulty with expressing confidence as an interval, system with that of the early EachMovie system. By distribution, or variance is the complexityof the statistics looking at the session, rating, and overall length of active underlying collaborative filtering. Unlike dense analytic use of corresponding early EachMovie and MovieLens techniques such as multiple regression, collaborative users (all of which could be reconstructed from logs), filtering lacks a well-understood measure of error or were able to measure indicators of user satisfaction. We variance. Sources of error include: the user’s ownvariance found that early MovieLensusers were more active than in rating, the imperfect matchingof neighbors, the degree to early EachMovieusers in all categories, with dramatic which past agreementreally does predict future agreement, increases in the numberof ratings and numberof sessions. the numberof items rated by the user, and the numberof Accordingly, it appears that the start-up problemis a real items in commonwith each neighbor, rounding effects, and one--user retention and participation improves when users manyothers. receive value--and using historical or "dead" data maybe a useful technique for improvingstart-up.

62 At the same time, measures of confidence are critical to collaborative filtering (e.g., sparsity and the early rater users trying to determine whether to rely upon a problem). Someof the interesting open research questions recommendation. Without confidence measures, it is include: extremely difficult to provide recommendations in ¯ situations where users are risk averse. Perhaps even worse Howto integrate content analysis techniques from is the poor reputation that a recommendationengine will , information filtering, and agents research into recommendationsystems. Our filterbot receive if it delivers low-confidence recommendations. work is a first step in this direction, as is MIT’s Accordingly, a key research priority is the developmentof computable and usable confidence measures. When collaborating agent work and Stanford’s Fab system. More research is needed to discover which techniques algorithms permit, analytic solutions are desirable, but we work for whichapplications. are also investigating empirical confidence measures that can be derived from and applied to an existing system. ¯ Howto take advantage of user demographicsand rule- User interface issues in collaborative filtering span a range based knowledge in recommendationsystems. of questions including: ¯ Howto integrate the power of data mining with the ¯ Whenand howto use multi-dimensional ratings real-time capabilities of recommender systems. Particularly interesting questions include identifying ¯ Whenand howto use implicit ratings temporal trends in preferences (e.g., people whoprefer ¯ Howa recommendationshould be displayed Aat time t are morelikely to prefer B at time t+l). Multi-dimensional ratings (and therefore predictions) seem ¯ Howto take advantage of techniques natural in certain applications. Restaurants are often in recommendersystem. evaluated separately on food, service, and value. Tasks are In all of these cases, a key question will be whether one often rated for importance and urgency. Today’s technology can be incorporated into the framework of recommendation engines can accept these dimensions another, or whether a new architecture is needed to merge separately, but further research is needed on cross- the two types of knowledge. dimension correlation and recommendation. Scale, Sparsity, andAlgorithmlcs Implicit ratings are observational measures of a user’s There fundamental problem of producing accurate preference for an item. For example, in Usenet news we recommendationsefficiently from a sparse set of ratings is found that time spent reading an article is a good measure inherent to collaborative filtering. Indeed, if the ratings set of preference for that article. Similarly, listening to music, were dense, there would be little value in producing viewing art, and purchasing consumer goods are all good recommendations.Accordingly, there is still great need for indicators of preference. Today’s systems lack automated continued research into fundamentalissues in performance, means for evaluating and calibrating implicit ratings; scalability, and applicability. further research would be valuable. One particularly interesting research area that we, along There are manyways to display recommendations, ranging with others, are actively investigating is techniques for from simply listing recommendeditems (without order), reducing the computational complexity of recommendation. markingrecommended items in a list of items, to displaying As discussed above, the complexity of recommendations a predicted rating for items. Manyresearch questions need growsgenerally with the size of the database, which is the to be answered through real user studies. In one study product of the number of users and the number of items. we’ve conducted, we saw that the correctness of user Neighborhoods,partitioning, and all attempt decision makingis directly affected by the type of display to reduce this size by limiting the elements considered used. In other work, we are examining the question of along the user dimension, the item dimension, or both. whether the expected value of a predicted rating Neighborhoodtechniques use only a subset of users in the distribution is moreor less valuable than the probability of computation of recommendations. Many variants have the rating exceeding a "worthwhile"cutoff. For example, is been proposed and implemented; research is needed to a user moreinterested that a movieis likely to be three-and- assess the varying merits of symmetric vs. asymmetric a-half stars, or that it has a 40%chance of being four or five neighborhoods, on-demandvs. longer-term neighborhoods, stars? Manysimilar questions remain unanswered. threshold vs. size limit, etc. Partitioning is discussed BeyondCollaborative Filtering above; factor analysis is a different approachthat tries to The second major research issue for recommendersystems decompose users or items into a combination of shared is integrating technologiesother than collaborative filtering "taste" vectors. Each of these techniques has promise for into recommendation systems. Content analysis, reducing storage and computation. demographics, data mining, machine learning, and other A second critical issues is to continue to address the approaches to learning from data each have advantages that challenge of ratings sparsity, particularly for applications can help offset some of the fundamental limitations of wherefew users ever should rate an item. Clustering, factor

63 analysis, and hierarchical techniques that combine between 3.0 and 4.5 is extremely relevant. Finally, individual items with clusters or factors can provide one set ordering metrics assess the effectiveness of top-N of solutions. Integration with other technologies will algorithms in identifying true "top" items. The different provide others. metrics in each category should be evaluated with the most Finally, there is still significant workneeded on the central useful ones identified as expected for publication and algorithmics of collaborative filtering to improve comparison. performance and accuracy. The use of Pearson correlations Coverage combined with accuracy. Accuracy metrics for neighbor selection and weighting is commonin alone are not useful for many types of comparison. By recommendation systems, yet many alternatives may be setting the confidence threshold high, most systems can more suitable, depending on the distribution of ratings. increase accuracy at the expense of coverage. Meaningful Similarly, the results from RINGO,along with someof our combined metrics are needed to allow meaningful own, suggest that the common Pearson algorithm evaluation of coverage/accuracy trade-offs. We are overvalues neighbors with low correlations. Further work, working on a combineddecision-support metric, but others particularly including empirical work, is needed to evaluate are neededas well. candidate algorithms. Full-system benchmarks. The greatest need right now is Metrics and Benchmarks for corpora and benchmarks that can be used for Sadly, evaluation is one of the weakest parts of current comparison. In the future, these should be integrated with research. Systems are not compared economicmodels to evaluate, in monetary terms, the value against each other directly, and published results use a added by a recommendersystem. variety of metrics, often incomparableones. Both research ACKNOWLEDGMENTS and collaboration is needed to establish an accepted set of Wewould like to acknowledgethe financial support of the metrics and benchmarks for evaluating recommendation National Science Foundation and Net Perceptions, Inc. We systems. Three areas are particularly important and also wouldlike to thank the dozens of individuals, mostly promising. students, who have contributed their effort to the Accuracy metrics. There are nearly a dozen different GroupLens Research project. Finally, we would like to metrics that have been used to measure recommendation thankour users, all system accuracy. Statistical measuresof error include the FOR ADDITIONALINFORMATION mean absolute error between predicted and actual rating, Several good sources of bibliographic information already the root meansquared error (to more heavily weigh large exist in the area of collaborative information filtering and errors), and the correlation between predicted and actual recommender systems. Rather than duplicate that work ratings. Other metrics attempt to assess the prevalence of here, werefer the user to: large errors. Reversal measures tally the frequency with which "embarrassingly bad" recommendationsare made. 1. The March 1997 issue of Communicationsof the ACM, edited by Hal Varian and Paul Resnick. In addition to A different set of metrics attempts to assess the containing articles on several relevant systems, the effectiveness of the recommendationengine in filtering section introduction and articles contain extensive items. statistics, borrowed from bibliographic information. information retrieval, together with receiver operating characteristic measurements from signal processing 2. The Collaborative Filtering Resources web page, at discount errors that do not affect usage, and moreheavily http://www.sims.berkeley.edu/resources/collab; this weigh errors near the decision point. For example, the page grew out of a March 1996 workshop on difference between1.0 and 2.5 on a five point scale maybe collaborative filtering held at Berkeley. It includes unimportant, since both are rejected, while the difference pointers to other reference pages.

64