Evaluating Collaborative Filtering Recommender Systems JONATHAN L. HERLOCKER Oregon State University and JOSEPH A. KONSTAN, LOREN G. TERVEEN, and JOHN T. RIEDL University of Minnesota Recommender systems have been evaluated in many, often incomparable, ways. In this article, we review the key decisions in evaluating collaborative filtering recommender systems: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole. In addition to reviewing the evaluation strategies used by prior researchers, we present empirical results from the analysis of various accuracy metrics on one con- tent domain where all the tested metrics collapsed roughly into three equivalence classes. Metrics within each equivalency class were strongly correlated, while metrics from different equivalency classes were uncorrelated. Categories and Subject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software—performance Evaluation (efficiency and effectiveness) General Terms: Experimentation, Measurement, Performance Additional Key Words and Phrases: Collaborative filtering, recommender systems, metrics, evaluation 1. INTRODUCTION Recommender systems use the opinions of a community of users to help indi- viduals in that community more effectively identify content of interest from a potentially overwhelming set of choices [Resnick and Varian 1997]. One of This research was supported by the National Science Foundation (NSF) under grants DGE 95- 54517, IIS 96-13960, IIS 97-34442, IIS 99-78717, IIS 01-02229, and IIS 01-33994, and by Net Perceptions, Inc. Authors’ addresses: J. L. Herlocker, School of Electrical Engineering and Computer Science, Oregon State University, 102 Dearborn Hall, Corvallis, OR 97331; email:
[email protected]; J.