User Session Identification Based on Strong Regularities in Inter-Activity
Total Page:16
File Type:pdf, Size:1020Kb
User Session Identification Based on Strong Regularities in Inter-activity Time Aaron Halfaker Os Keyes Daniel Kluver Wikimedia Foundation Wikimedia Foundation GroupLens Research [email protected] [email protected] University of Minnesota [email protected] Jacob Thebault-Spieker Tien Nguyen Kenneth Shores GroupLens Research GroupLens Research GroupLens Research University of Minnesota University of Minnesota University of Minnesota [email protected] [email protected] [email protected] Anuradha Uduwage Morten Warncke-Wang GroupLens Research GroupLens Research University of Minnesota University of Minnesota [email protected] [email protected] ABSTRACT 1. INTRODUCTION Session identification is a common strategy used to develop In 2012, we had an idea for a measurement strategy that metrics for web analytics and behavioral analyses of user- would bring insight and understanding to the nature of par- facing systems. Past work has argued that session identifica- ticipation in an online community. While studying partic- tion strategies based on an inactivity threshold is inherently ipation in Wikipedia, the open, collaborative encyclopedia, arbitrary or advocated that thresholds be set at about 30 we found ourselves increasingly curious about the amount of minutes. In this work, we demonstrate a strong regularity time that volunteer contributors invested into the encyclope- in the temporal rhythms of user initiated events across sev- dia's construction. Past work measuring editor engagement eral different domains of online activity (incl. video gaming, relied on counting the number of contributions made by a search, page views and volunteer contributions). We de- user1, but we felt that the amount of time editors spent scribe a methodology for identifying clusters of user activity editing might serve as a more appropriate measure. and argue that regularity with which these activity clusters The measurement strategy we came up with was based appear implies a good rule-of-thumb inactivity threshold of on the clustering of Wikipedia editors' activities into \edit about 1 hour. We conclude with implications that these sessions" with the assumption that the duration of an edit temporal rhythms may have for system design based on our session would represent a lower bound of the amount of observations and theories of goal-directed human activity. time invested into Wikipedia contributions [9]. Through our ethnographic work in Wikipedia we had found the notion of Categories and Subject Descriptors work sessions to be intuitive, yet there did not appear to be a consensus in the literature on how to identify work ses- H.4 [Information Systems Applications]: Miscellaneous; sions from timestamped user activities. This led us to look H.1.1 [Coding and Information Theory]: Formal mod- to the data for insight about what might be a reasonable els of communication approach to delineating users' editing activity into sessions. The regularities we found in inter-activity time amazed us arXiv:1411.2878v2 [cs.HC] 4 Aug 2019 General Terms with their intuitiveness and the simplicity of session demar- cation they implied. It is that work that led us to look for Theory, Measurement, Human Factors such regularities in other systems and to write this paper to share our results. Keywords We are not the first to try our hands at identifying a User session, Activity, Human behavior, Regularities, Met- reasonable way to measure user session behavior in human- rics, Modeling computer interaction. User sessions have been used exten- sively to generate metrics for understanding the performance of information resources [10] { especially in the domain of search [7, 8] and content personalisation [13, 21]. Despite this interest in understanding the nature and manifestation of user sessions, no clear consensus about how to perform session identification has emerged. In fact, some work has 1for example, \Wikipedian is first to hit 1 million edits" http://www.dailydot.com/news/ This is a non-peer reviewed, pre-submission of an article. wikipedian-first-1-million-edits gone as far as to argue that sessions don't actually exist as a by Poisson-based initiation of tasks and a powerlaw of time useful divide for user activity [11] or that the common strat- inbetween task events[23]. egy of choosing a global inactivity threshold is arbitrary at In contrast, Nardi calls out this cognitive science work for best [15]. neglecting context in work patterns, motivation and com- In this paper, we propose and demonstrate a strategy for munity membership { thereby inappropriately reducing a identifying user sessions from log data and demonstrate how human to a processing unit in a vacuum [17] (p21). In- the results match both intuition and theory about goal- stead, Nardi draws from the framework of Activity Theory directed human activity. We also show how this strategy (AT) to advocate for an approach to understanding human- yields consistent results across many different types of sys- computer interaction as a conscious procession of activities. tems and user activities. First, we summarize previous work AT describes an activity as a goal-directed or purposeful in- which attempts to make sense of user session behavior from teraction of a subject with an object through the use of tools. log data. Then we discuss theoretical arguments about how AT further formalizes an activity as a collection of actions goal-directed user behavior ought to manifest in the data. directed towards completing the activity's goal. Similarly, Third, we discuss a generalized version of the inactivity actions are composed of operations, a fundamental, indivis- threshold identification strategy we developed in [9] and ible, and unconscious movement that humans make in the present strategies for identifying optimal inactivity thresh- service of performing an action. olds in new data. Then, we introduce 6 different systems For an example application of AT, let us examine Wikipedia from which we have extracted 10 different types of user editing. Our ethnographic work with Wikipedia editors sug- actions for analysis and comparison. Finally, we conclude gests that it is common to set aside time on a regular basis with discussions of the regularities and irregularities between to spend doing \wiki-work". AT would conceptualize this datasets and what that might imply for both our under- wiki-work overall as an activity and each unit of time spent standing of the measurement of human behavior and the engaging in the wiki-work as an activity phase { though we design of user-facing systems prefer the term \activity session". The actions within an activity session would manifest as 2. RELATED WORK individual edits to wiki pages representing contributions to encyclopedia articles, posts in discussions and messages sent 2.1 Human activity sessions to other Wikipedia editors. These edits involve a varied set of operations: typing of characters, copy-pasting the details The concept of an activity session is an intuitive one, but of reference materials, scrolling through a document, reading it's surprisingly difficult to tie down a single definition of an argument and eventually, clicking the \Save" button. what a session is, and how it can be demarcated. A \ses- In this work we draw from both the concepts of the operation- sion" may refer to \(1) a set of queries to satisfy a single action-activity heirarchy of Activity Theory and the empir- information need (2) a series of successive queries, and (3) ical modeling strategies of cognitive science as applied to a short period of contiguous time spent querying and exam- time between events. ining results." [11] (1) is referred to, particularly in search-related literature [8, 11], not as a session but as a task{a particular information 2.2 Session identification need the user is trying to fulfil. Multiple tasks may happen User sessions have been used as behavioral measures of in a contiguous browsing period, or a single task may be human-computer interaction for almost two decades, and spread out over multiple periods. (2) is unclear. It may re- for this reason, strategies for session identification from log fer to a series of contiguous but unrelated queries (in which data have been extensively studied [8]. case it is identical to the third definition), or a series of con- Cooley et al. [5] and Spiliopoulou et al. [21] contast two tiguous queries based on the previous query in the sequence primary strategies for identifying sessions from activity logs: (in which case it is best understood as a sequence of tasks). \navigation-oriented heuristics" and \time-oriented heuris- (3) is the most commonly-used definition in the literature tics". we have reviewed [10, 16, 21, 22]. This contrasts with the Time-oriented heuristics refer to the assignment of an in- notion of task and is the definition of \session" that we have activity threshold between logged activites to serve as a ses- chosen for this paper. It's also the definition used by the sion delimiter. The assumption implied is that if there is a W3C [12]. break between a user's actions that is sufficiently long, it's We found inspiration in thinking about how to model user likely that the user is no longer active, the session is assumed session behavior in both the empirical modeling work of cog- to have ended, and a new session is created when the next nitive science and the theoretical frameworks of human con- action is performed. This is the most commonly-used ap- sciousness as applied to \work activities". proach to identify sessions, with 30 minutes serving as the The lack of purely random distribution in the time be- most commonly used threshold [8, 21, 18]. Both thresh- tween logged human actions has been the topic of recent old and approach appear to originate in a 1995 paper by studies focusing on the cognitive capacity of humans as in- Catledge & Pitkow [4] that used client-side tracking to iden- formation processing units.