CSE 255 Assignment 1 : Movie Rating Prediction Using the Movielens Dataset

Total Page:16

File Type:pdf, Size:1020Kb

CSE 255 Assignment 1 : Movie Rating Prediction Using the Movielens Dataset CSE 255 Assignment 1 : Movie Rating Prediction using the MovieLens dataset Yashodhan Karandikar [email protected] 1. INTRODUCTION The goal of this project is to predict the rating given a user and a movie, using 3 different methods - linear regression using user and movie features, collaborative filtering and la- tent factor model [22, 23] on the MovieLens 1M data set [6]. We describe results using each of the 3 models and compare them with those obtained using the Apache Mahout machine learning library [1]. We also briefly discuss some results on restaurant ratings data from the Google Local data set [4]. 2. THE MOVIELENS DATA SET In this section, we introduce the MovieLens 1M data set [6]. This is a data set of about 1 million ratings from 6040 users on 3900 movies, thus containing around 165 ratings per user and 256 ratings per movie on an average. Ratings are integers on a 5-star scale. Each user and each movie is identified by a unique id. The data set includes information about the age (7 age groups), gender, occupation (21 types) Figure 1: Average rating for different age groups and zip code for each user, as well as the title and genre (18 types) for each movie. Each user has at least 20 ratings. We normalize the ratings so that each rating is in the closed interval [0; 1]. We perform an exploratory analysis of the data set by plotting the average rating against user and item features. This analysis is shown in figures 1, 2, 3, 4. Figure 1 shows the average rating for each of the age groups. We observe a slightly increasing trend in the average rating as the age group of the user increases, suggesting that the age group of the user might be a useful feature in predicting movie ratings. Figure 2 shows the average rating for male users and female Figure 2: Average rating for male and female users users. We observe a very small difference between the aver- age ratings for male and female users, suggesting that gender of the user is not likely to be a useful feature in prediction pations. We observe that there is a reasonable amount of of movie ratings. variation in the average ratings for various occupations, with the minimum and maximum average ratings being 0.681257 Figure 3 shows the average rating for users in various occu- and 0.757114 respectively. This indicates that the occupa- tion of the user might be a useful feature in prediction of ratings. Figure 4 shows the average rating for various movie genres. We observe that there is a reasonable amount of variation in the average ratings for various genres, with the minimum and maximum average ratings being 0.644848 and 0.815781 respectively. This indicates that the genre of the movie might be a useful feature in prediction of ratings. Figure 3: Average rating for various occupations Figure 4: Average rating for various movie genres 3. MOVIE RATING PREDICTION for the best algorithm to predict user ratings for films, given In this section we discuss the task of predicting the rating only the previous ratings, without any other information given a user and a movie. Given a user u and a movie m, about the users or films. An alternative to these methods is we would like to predict the rating that u will give to m. content filtering [23], which uses user features (such as age, Although this problem is closely related to the problem of gender, etc.) and item features (such as movie genre, cast recommending the best possible items to a user based on etc.) instead of relying on past user behavior. the user's previous purchases or reviews, in this work we fo- cus on accurately predicting the rating itself. Accordingly, Neighborhood approaches and latent factor models are gen- we evaluate the performance of our models using the mean- erally more accurate than content-based techniques [23], since squared-error (MSE) [17] and the coefficient of determina- they are domain-free and capture subtle information (such as tion R2 [15] metrics. These are calculated as follows: a user's preferences) which is hard to extract using content- filtering. On the other hand, neighborhood approaches and latent factor models suffer from the cold start problem i.e. N they are not useful on new users or new items [13]. 1 X 2 MSE = (f(u ; m ) − r(u ; m )) N i i i i i=1 In this work, we compare 3 models - the first one is a content- based linear predictor and is described in section 6. The sec- ond one is a neighborhood model and is described in section MSE R2 = 1 − 7. The last one is a latent-factor model and is described in variance in true ratings section 8. where N is the number of examples, ui is the user in the 5. FEATURES example i, mi is the movie in the example i, f(ui; mi) is the Sections 6, 7, 8 discuss the 3 models we consider, namely predicted rating, r(ui; mi) is the true rating. linear regression, collaborative filtering and latent factor[23, 22]. While collaborative filtering and latent factor models The MSE is a useful metric for comparing 2 or more models do not use the additional information about users and items on the same data. However, the MSE depends on the vari- available in the dataset, the model based on linear regression ance in the data and hence by itself is not a good measure relies on this information. As discussed in section 2, the 2 of how well a model explains the data. The R value, on the user's age and occupation and the movie genre seem to be other hand, is a measure of the fraction of the variance that useful in predicting the rating that a user will give to a is explained by a model, and hence is a better measure of movie. how well a model explains the data. We report both values for each model we consider. The data set includes the user's age as one of 6 values - 1, 18, 25, 35, 45, 50, 56, which respectively denote the age groups As a baseline model, we consider a linear predictor based on under 18, 18-24, 25-34, 35-44, 45-49, 50-55, above 56. During user and item features, which we discuss in section 6. the pre-processing stage, we replace each of these values by an integer in the interval [0; 6]. When we create the feature Further, to confirm the validity of our results, we compare vector for a given user-movie pair, we create one boolean them with the results obtained using the Apache Mahout feature for each age group, where a value of 1 indicates that machine learning library [1] on the same data set. Apache the user falls in the corresponding age group. Thus, we have Mahout contains an implementation of an alternating-least- 7 features to represent the user's age. squares recommender with weighted-λ-regularization (ALS- WR) [28, 8]. The data set includes the user's occupation as an integer in the interval [0; 20], where each integer denotes a particular 4. RELATED WORK occupation as given in [7]. When we create the feature vector The MovieLens data set [6, 7] is a data set collected and for a given user-movie pair, we create one boolean feature made available by the GroupLens Research group [5]. Movie- for each occupation, where a value of 1 indicates that the Lens is a website for personalized movie recommendations user's occupation is the corresponding occupation. Thus, we [10]. The data set contains data from users who joined have 21 features to represent the user's occupation. MovieLens in the year 2000. The data set includes the genres for a movie as a string of Given a set of users, items and ratings given by some of the genres separated by `j'. Each genre in this string is one of 18 users for some of the items, the task of predicting the rating possible genres given in [7]. We represent each genre as an for a given user for a given movie has been studied in the lit- integer in the interval [0; 17]. When we create the feature erature in the form of recommender systems, which addition- vector for a given user-movie pair, we create one boolean ally perform the task of recommending items to users. The feature for each genre, where a value of 1 indicates that state-of-the-art in recommender systems is based on 2 main the corresponding genre is one of the genres listed for that approaches - neighborhood approach and latent factor models movie. Thus, we have 18 features to represent the genres of [21, 23]. Both these methods rely only on past user behav- the movie. ior, without using any features about the users or items. [28] describes an Alternating-Least-Squares with Weighted- Similarly, we represent the user's gender by 2 boolean fea- λ-Regularization (ALS-WR) parallel algorithm designed for tures. Although the user's gender does not seem to be useful, the Netflix Prize challenge [11], a well-known competition we still include it in the features used by the predictor and expect that the the predictor will learn a weight close to 0 for these 2 features. N M ^ 1 X 2 X 2 θ = arg min (r(ui; mi) − Xi · θ) + λ θi θ N We note that though the features just described capture i=1 i=2 useful attributes of users and movies, none of these features capture a sense of compatibility or a match between a user We have the following expressions for the gradient of the and a movie.
Recommended publications
  • A Grouplens Perspective
    From: AAAI Technical Report WS-98-08. Compilation copyright © 1998, AAAI (www.aaai.org). All rights reserved. RecommenderSystems: A GroupLensPerspective Joseph A. Konstan*t , John Riedl *t, AI Borchers,* and Jonathan L. Herlocker* *GroupLensResearch Project *Net Perceptions,Inc. Dept. of ComputerScience and Engineering 11200 West78th Street University of Minnesota Suite 300 Minneapolis, MN55455 Minneapolis, MN55344 http://www.cs.umn.edu/Research/GroupLens/ http://www.netperceptions.com/ ABSTRACT identifying sets of articles by keyworddoes not scale to a In this paper, wereview the history and research findings of situation in which there are thousands of articles that the GroupLensResearch project I and present the four broad contain any imaginable set of keywords. Taken together, research directions that we feel are most critical for these two weaknesses represented an opportunity for a new recommender systems. type of filtering, that would focus on finding which INTRODUCTION:A History of the GroupLensProject available articles matchhuman notions of quality and taste. The GroupLens Research project began at the Computer Such a system would be able to produce a list of articles Supported Cooperative Work (CSCW)Conference in 1992. that each user wouldlike, independentof their content. Oneof the keynote speakers at the conference lectured on a Wedecided to apply our ideas in the domain of Usenet his vision of an emerging information economy,in which news. Usenet screamsfor better information filtering, with most of the effort in the economywould revolve around hundreds of thousands of articles posted daily. Manyof the production, distribution, and consumptionof information, articles in each Usenet newsgroupare on the sametopic, so rather than physical goods and services.
    [Show full text]
  • Evaluation of Recommendation System for Sustainable E-Commerce: Accuracy, Diversity and Customer Satisfaction
    Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 January 2020 doi:10.20944/preprints202001.0015.v1 Article Evaluation of Recommendation System for Sustainable E-Commerce: Accuracy, Diversity and Customer Satisfaction Qinglong Li 1, Ilyoung Choi 2 and Jaekyeong Kim 3,* 1 Department of Social Network Science, KyungHee University, 26, Kyungheedae-ro, Dongdaemun-gu, Seoul, 02453, Republic of Korea; [email protected] 2 Graduate School of Business Administration, KyungHee University, 26, Kyungheedae-ro, Dongdaemun- gu, Seoul, 02453, Republic of Korea; [email protected] 3 School of Management, KyungHee University, 26, Kyungheedae-ro, Dongdaemun-gu, Seoul, 02453, Republic of Korea; [email protected] * Correspondence: [email protected]; Tel.: +82-2-961-9355 (F.L.) Abstract: With the development of information technology and the popularization of mobile devices, collecting various types of customer data such as purchase history or behavior patterns became possible. As the customer data being accumulated, there is a growing demand for personalized recommendation services that provide customized services to customers. Currently, global e-commerce companies offer personalized recommendation services to gain a sustainable competitive advantage. However, previous research on recommendation systems has consistently raised the issue that the accuracy of recommendation algorithms does not necessarily lead to the satisfaction of recommended service users. It also claims that customers are highly satisfied when the recommendation system recommends diverse items to them. In this study, we want to identify the factors that determine customer satisfaction when using the recommendation system which provides personalized services. To this end, we developed a recommendation system based on Deep Neural Networks (DNN) and measured the accuracy of recommendation service, the diversity of recommended items and customer satisfaction with the recommendation service.
    [Show full text]
  • Collaborative Filtering: a Machine Learning Perspective by Benjamin Marlin a Thesis Submitted in Conformity with the Requirement
    Collaborative Filtering: A Machine Learning Perspective by Benjamin Marlin A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Computer Science University of Toronto Copyright c 2004 by Benjamin Marlin Abstract Collaborative Filtering: A Machine Learning Perspective Benjamin Marlin Master of Science Graduate Department of Computer Science University of Toronto 2004 Collaborative filtering was initially proposed as a framework for filtering information based on the preferences of users, and has since been refined in many different ways. This thesis is a comprehensive study of rating-based, pure, non-sequential collaborative filtering. We analyze existing methods for the task of rating prediction from a machine learning perspective. We show that many existing methods proposed for this task are simple applications or modifications of one or more standard machine learning methods for classification, regression, clustering, dimensionality reduction, and density estima- tion. We introduce new prediction methods in all of these classes. We introduce a new experimental procedure for testing stronger forms of generalization than has been used previously. We implement a total of nine prediction methods, and conduct large scale prediction accuracy experiments. We show interesting new results on the relative performance of these methods. ii Acknowledgements I would like to begin by thanking my supervisor Richard Zemel for introducing me to the field of collaborative filtering, for numerous helpful discussions about a multitude of models and methods, and for many constructive comments about this thesis itself. I would like to thank my second reader Sam Roweis for his thorough review of this thesis, as well as for many interesting discussions of this and other research.
    [Show full text]
  • A Recommender System for Groups of Users
    PolyLens: A Recommender System for Groups of Users Mark O’Connor, Dan Cosley, Joseph A. Konstan, and John Riedl Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN 55455 USA {oconnor;cosley;konstan;riedl}@cs.umn.edu Abstract. We present PolyLens, a new collaborative filtering recommender system designed to recommend items for groups of users, rather than for individuals. A group recommender is more appropriate and useful for domains in which several people participate in a single activity, as is often the case with movies and restaurants. We present an analysis of the primary design issues for group recommenders, including questions about the nature of groups, the rights of group members, social value functions for groups, and interfaces for displaying group recommendations. We then report on our PolyLens prototype and the lessons we learned from usage logs and surveys from a nine-month trial that included 819 users. We found that users not only valued group recommendations, but were willing to yield some privacy to get the benefits of group recommendations. Users valued an extension to the group recommender system that enabled them to invite non-members to participate, via email. Introduction Recommender systems (Resnick & Varian, 1997) help users faced with an overwhelming selection of items by identifying particular items that are likely to match each user’s tastes or preferences (Schafer et al., 1999). The most sophisticated systems learn each user’s tastes and provide personalized recommendations. Though several machine learning and personalization technologies can attempt to learn user preferences, automated collaborative filtering (Resnick et al., 1994; Shardanand & Maes, 1995) has become the preferred real-time technology for personal recommendations, in part because it leverages the experiences of an entire community of users to provide high quality recommendations without detailed models of either content or user tastes.
    [Show full text]
  • Community, Impact and Credit: Where Should I Submit My Papers?
    Panels February 23–27, 2013, San Antonio, Texas, USA Community, Impact and Credit: Where should I submit my papers? Abstract Aaron Halfaker R. Stuart Geiger We (the authors of CSCWs program) have finite time and GroupLens Research School of Information energy that can be invested into our publications and the University of Minnesota University of CA, Berkeley research communities we value. While we want our work [email protected] [email protected] to have the most impact possible, we also want to grow Cliff Lampe Loren Terveen and support productive research communities within School of Information GroupLens Research which to have this impact. This panel discussion explores Michigan State University University of Minnesota the costs and benefits of submitting papers to various [email protected] [email protected] tiers of conferences and journals surrounding CSCW and Amy Bruckman Brian Keegan reflects on the value of investing hours into building up a College of Computing Northeastern University research community. Georgia Inst. of Technology [email protected] [email protected] Author Keywords Aniket Kittur Geraldine Fitzpatrick community; credit; impact; publishing; peer review HCI Institute Vienna University of Carnegie Mellon University Technology ACM Classification Keywords [email protected] geraldine.fi[email protected] H.5.0. [Information Interfaces and Presentation (e.g. HCI)]: General Introduction We (the authors of CSCWs program) have finite time and energy that can be invested into our publications and the research communities we value. In order to allow our work to have an impact, we must also grow and maintain Copyright is held by the author/owner(s).
    [Show full text]
  • Indian Regional Movie Dataset for Recommender Systems
    Indian Regional Movie Dataset for Recommender Systems Prerna Agarwal∗ Richa Verma∗ Angshul Majumdar IIIT Delhi IIIT Delhi IIIT Delhi [email protected] [email protected] [email protected] ABSTRACT few years with a lot of diversity in languages and viewers. As per Indian regional movie dataset is the rst database of regional Indian the UNESCO cinema statistics [9], India produces around 1,724 movies, users and their ratings. It consists of movies belonging to movies every year with as many as 1,500 movies in Indian regional 18 dierent Indian regional languages and metadata of users with languages. India’s importance in the global lm industry is largely varying demographics. rough this dataset, the diversity of Indian because India is home to Bollywood in Mumbai. ere’s a huge base regional cinema and its huge viewership is captured. We analyze of audience in India with a population of 1.3 billion which is evident the dataset that contains roughly 10K ratings of 919 users and by the fact that there are more than two thousand multiplexes in 2,851 movies using some supervised and unsupervised collaborative India where over 2.2 billion movie tickets were sold in 2016. e box ltering techniques like Probabilistic Matrix Factorization, Matrix oce revenue in India keeps on rising every year. erefore, there Completion, Blind Compressed Sensing etc. e dataset consists of is a huge need for a dataset like Movielens in Indian context that can metadata information of users like age, occupation, home state and be used for testing and bench-marking recommendation systems known languages.
    [Show full text]
  • Improvements in Holistic Recommender System Research
    Improvements in Holistic Recommender System Research A DISSERTATION SUBMITTED TO THE FACULTY OF THE UNIVERSITY OF MINNESOTA BY Daniel Allen Kluver IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Joseph A. Konstan August, 2018 c Daniel Allen Kluver 2018 ALL RIGHTS RESERVED Dedication This dissertation is dedicated to my family, my friends, my advisers John Riedl and Joseph Konstan, my colleagues, both at GroupLens research and at Macalester College, and everyone else who believed in me and supported me along the way. Your support meant everything when I couldn’t support myself. Your belief meant everything when I couldn’t believe in myself. I couldn’t have done this without your help. i Abstract Since the mid 1990s, recommender systems have grown to be a major area of de- ployment in industry, and research in academia. A through-line in this research has been the pursuit, above all else, of the perfect algorithm. With this admirable focus has come a neglect of the full scope of building, maintaining, and improving recom- mender systems. In this work I outline a system deployment and a series of offline and online experiments dedicated to improving our holistic understanding of recommender systems. This work explores the design, algorithms, early performance, and interfaces of recommender systems within the scope of how they are interconnected with other aspects of the system. This work explores many indivisual aspects of a recommender system while keeping in mind how they are connected to other aspects of
    [Show full text]
  • Recommender Systems User-Facing Decision Support Systems
    Recommender Systems User-Facing Decision Support Systems Michael Hahsler Intelligent Data Analysis Lab (IDA@SMU) CSE, Lyle School of Engineering Southern Methodist University EMIS 5/7357: Decision Support Systems February 22, 2012 Michael Hahsler (IDA@SMU) Recommender Systems EMIS 5/7357: DSS 1 / 44 Michael Hahsler (IDA@SMU) Recommender Systems EMIS 5/7357: DSS 2 / 44 Michael Hahsler (IDA@SMU) Recommender Systems EMIS 5/7357: DSS 3 / 44 Michael Hahsler (IDA@SMU) Recommender Systems EMIS 5/7357: DSS 4 / 44 Table of Contents 1 Recommender Systems & DSS 2 Content-based Approach 3 Collaborative Filtering (CF) Memory-based CF Model-based CF 4 Strategies for the Cold Start Problem 5 Open-Source Implementations 6 Example: recommenderlab for R 7 Empirical Evidence Michael Hahsler (IDA@SMU) Recommender Systems EMIS 5/7357: DSS 5 / 44 Decision Support Systems ? Michael Hahsler (IDA@SMU) Recommender Systems EMIS 5/7357: DSS 6 / 44 Decision Support Systems \Decision Support Systems are defined broadly [...] as interactive computer-based systems that help people use computer communications, data, documents, knowledge, and models to solve problems and make decisions." Power (2002) Main Components: Knowledge base Model User interface Michael Hahsler (IDA@SMU) Recommender Systems EMIS 5/7357: DSS 6 / 44 ! A recommender system is a decision support systems which help a seller to choose suitable items to offer given a limited information channel. Recommender Systems Recommender systems apply statistical and knowledge discovery techniques to the problem of making product recommendations. Sarwar et al. (2000) Advantages of recommender systems (Schafer et al., 2001): Improve conversion rate: Help customers find a product she/he wants to buy.
    [Show full text]
  • Truth and Lies of What People Are Really Thinking
    Dedication For Lex and Stella Epigraph Everything we see is a perspective, not the truth —Falsely attributed to Marcus Aurelius (author unknown) Contents Cover Title Page Dedication Epigraph Introduction Part One Genuine Deceptions 1 Body Language Lies 2 Powerful Thinking 3 Mind Your Judgments 4 Scan for Truth Part Two Dating 5 They’re Totally Checking Me Out! 6 Playing Hard-to-Get 7 Just Feeling Sorry for Me 8 I’m Being Ghosted 9 What a Complete Psycho! 10 They Are Running the Show 11 I’m Going to Pay for That! 12 They Are So Mad at Me 13 A Lying Cheat? 14 Definitely into My Friend 15 A Match Made in Heaven? 16 They Are So Breaking Up! Part Three Friends and Family 17 Thick as Thieves 18 My New Bff? 19 Fomo 20 Control Freak 21 Too Close for Comfort 22 They’ll Never Fit in with My Family 23 House on Fire! 24 I Am Boring the Pants off Them 25 Lying Through Their Teeth 26 Persona Non Grata 27 Invisible Me Part Four Working Life 28 I Aced That Interview—So, Where’s the Job Offer? 29 They Hate My Work 30 Big Dog 31 Never Going to See Eye to Eye 32 Cold Fish 33 They’re Gonna Blow! 34 This Meeting Is a Waste of Time 35 Looks Like a Winning Team 36 So, You Think You’re the Boss 37 Hand in the Cookie Jar Summary Bonus Bluff Learn more Acknowledgments Notes About the Authors Praise for Truth & Lies Also by Mark Bowden Copyright About the Publisher Introduction WE CAN ALL RECALL EXAMPLES when our sense of what another person was thinking turned out to be the truth.
    [Show full text]
  • Incorporating Contextual Information in Recommender Systems Using a Multidimensional Approach
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by CiteSeerX Incorporating Contextual Information in Recommender Systems Using a Multidimensional Approach Gediminas Adomavicius Department of Information & Decision Sciences Carlson School of Management University of Minnesota [email protected] Ramesh Sankaranarayanan Department of Information, Operations & Management Sciences Stern School of Business New York University [email protected] Shahana Sen Marketing Department Silberman College of Business Fairleigh Dickinson University [email protected] Alexander Tuzhilin Department of Information, Operations & Management Sciences Stern School of Business New York University [email protected] Abstract The paper presents a multidimensional (MD) approach to recommender systems that can provide recommendations based on additional contextual information besides the typical information on users and items used in most of the current recommender systems. This approach supports multiple dimensions, extensive profiling, and hierarchical aggregation of recommendations. The paper also presents a multidimensional rating estimation method capable of selecting two- dimensional segments of ratings pertinent to the recommendation context and applying standard collaborative filtering or other traditional two-dimensional rating estimation techniques to these segments. A comparison of the multidimensional and two-dimensional rating estimation approaches is made, and the tradeoffs between the two are studied. Moreover, the paper introduces a combined rating estimation method that identifies the situations where the MD approach outperforms the standard two-dimensional approach and uses the MD approach in those situations and the standard two-dimensional approach elsewhere. Finally, the paper presents a pilot empirical study of the combined approach, using a multidimensional movie recommender system that was developed for implementing this approach and testing its performance.
    [Show full text]
  • Arxiv:1606.07659V3 [Cs.LG] 29 Dec 2017 Rate This Item
    Hybrid Recommender System based on Autoencoders Florian Strub Jer´ emie´ Mary Romaric Gaudel fl[email protected] [email protected] [email protected] Univ. Lille, CNRS, Centrale Lille, Inria Univ. Lille, CNRS, Centrale Lille, Inria Univ. Lille, CNRS, Centrale Lille, Inria Abstract systems [Wen et al., 2016] to temporal user-item interac- tions [Dai et al., 2016] through heterogeneous classification Proficient Recommender Systems heavily rely on applied to the recommendation setting [Cheng et al., 2016], Matrix Factorization (MF) techniques. MF aims few attempts have been done to use Neural Networks (NN) in at reconstructing a matrix of ratings from an in- CF. Deep Learning key successes have mainly been on fully complete and noisy initial matrix; this prediction is observable data [LeCun et al., 2015] while CF relies on in- then used to build the actual recommendation. Si- put with missing values. This constraint has received less at- multaneously, Neural Networks (NN) met tremen- tention and remains a challenging problem for NN. Handling dous successes in the last decades but few attempts missing values for CF has first been tackled by Salakhutdi- have been made to perform recommendation with nov et al. [Salakhutdinov et al., 2007] for the Netflix chal- autoencoders. In this paper, we gather the best lenge by using Restricted Boltzmann Machines. More recent practice from the literature to achieve this goal. works train autoencoders to perform CF tasks on explicit data We first highlight the link between these autoen- [Sedhain et al., 2015; Strub et al., 2016] and implicit data coder based approaches and MF.
    [Show full text]
  • User Session Identification Based on Strong Regularities in Inter-Activity
    User Session Identification Based on Strong Regularities in Inter-activity Time Aaron Halfaker Os Keyes Daniel Kluver Wikimedia Foundation Wikimedia Foundation GroupLens Research [email protected] [email protected] University of Minnesota [email protected] Jacob Thebault-Spieker Tien Nguyen Kenneth Shores GroupLens Research GroupLens Research GroupLens Research University of Minnesota University of Minnesota University of Minnesota [email protected] [email protected] [email protected] Anuradha Uduwage Morten Warncke-Wang GroupLens Research GroupLens Research University of Minnesota University of Minnesota [email protected] [email protected] ABSTRACT 1. INTRODUCTION Session identification is a common strategy used to develop In 2012, we had an idea for a measurement strategy that metrics for web analytics and behavioral analyses of user- would bring insight and understanding to the nature of par- facing systems. Past work has argued that session identifica- ticipation in an online community. While studying partic- tion strategies based on an inactivity threshold is inherently ipation in Wikipedia, the open, collaborative encyclopedia, arbitrary or advocated that thresholds be set at about 30 we found ourselves increasingly curious about the amount of minutes. In this work, we demonstrate a strong regularity time that volunteer contributors invested into the encyclope- in the temporal rhythms of user initiated events across sev- dia's construction. Past work measuring editor engagement eral different domains of online activity (incl. video gaming, relied on counting the number of contributions made by a search, page views and volunteer contributions). We de- user1, but we felt that the amount of time editors spent scribe a methodology for identifying clusters of user activity editing might serve as a more appropriate measure.
    [Show full text]