Cohort Modeling for Enhanced Personalized Search Jinyun Yan Wei Chu Ryen W
Total Page:16
File Type:pdf, Size:1020Kb
Cohort Modeling for Enhanced Personalized Search Jinyun Yan Wei Chu Ryen W. White Rutgers University Microsoft Bing Microsoft Research Piscataway, NJ 08854 USA Bellevue WA 98004 USA Redmond, WA 98052 USA [email protected] [email protected] [email protected] ABSTRACT that approximately 33% of query instances in their case study were Web search engines utilize behavioral signals to develop search ex- an exact repeat of a query submitted by the same user at a previous periences tailored to individual users. To be effective, such person- time. For such refinding queries, the results clicked by individual alization relies on access to sufficient information about each user’s users in their search history provide a strong signal to identify the interests and intentions. For new users or new queries, profile in- correct result for each user when that query is repeated [32]. The formation may be sparse or non-existent. To handle these cases, and remaining 67% queries are new, and it can be challenging to im- perhaps also improve personalization for those with profiles, search prove their search quality via personalization given limited history. engines can employ signals from users who are similar along one One way in which these issues can be addressed is by finding co- or more dimensions, i.e., those in the same cohort . In this paper we horts of searchers who share one of more attributes with the current describe a characterization and evaluation of the use of such cohort searcher. Attributes that could be used to form cohorts include lo- modeling to enhance search personalization. We experiment with cation, topical interest, and domain preferences, all easily accessi- three pre-defined cohorts—topic, location, and top-level domain ble to search engines via users’ long-term search histories. Given a preference—independently and in combination, and also evaluate user, we can leverage the search behavior of other members of their methods to learn cohorts dynamically. We show via extensive ex- cohort(s) to enhance personalization by providing signals if suffi- perimentation with large-scale logs from a commercial search en- cient information is unavailable or as an additional signal to build gine that leveraging cohort behavior can yield significant relevance richer personalization models if they already exist. Cohorts have gains when combined with a production search engine ranking al- been used effectively in applications such as collaborative filtering gorithm that uses similar classes of personalization signal but at the (CF) [12], where groups of similar users (based on factors such as individual searcher level. Additional experiments show that our liking the same item [19]) can yield relevant recommendations. Co- gains can be extended when we dynamically learn cohorts and tar- horts have also shown some limited utility in retrieval settings. get easily-identifiable classes of ambiguous or unseen queries. Groupization has shown promise in laboratory settings [30], task models to find those engaged in similar tasks have yielded strong Categories and Subject Descriptors results [37], and there have even been attempts to use CF more di- H.3.3 [ Information Storage and Retrieval ]: Information Search rectly in search result ranking [25]. However, there has been no de- and Retrieval – search process , selection process , clustering . tailed study of applying cohort models to enhance Web-scale search personalization. We address that shortcoming in this paper. Keywords Cohort modeling; Personalization; Web search. We propose the construction and application of user cohorts to en- hance Web search personalization. Our initial method creates pre- 1. INTRODUCTION defined cohorts on three types: topic, location, and top-level do- Personalization of search results has been investigated in detail in main preference (e.g., .gov, .edu). Rather than limiting ourselves to domains such as Web search and beyond [24][27][31]. The ability these pre-defined sets, we also propose clustering methods capable to tailor search results to a particular individual enables a wealth of of learning the cohorts and dynamically assigning users to one or opportunity to better satisfy their particular information needs. Per- more clusters. We demonstrate through extensive experimentation sonalization models are typically learned from observed short- and with search engine log data that our cohort modeling methods can long-term search behavior (such as queries and result clicks), which yield significant relevance improvements over a production ranker is either used directly [32] or is converted into a different represen- that already included personalization targeting the current searcher. tation (e.g., a set of topical categories) to build more general models We show that these gains are even larger when we target particular and improve personalization coverage [5][24]. Despite the value of queries (e.g., those with high ambiguity) and particular users (e.g., personalization, one drawback is that it requires sufficient user in- those with no query-relevant history). formation to perform effectively; users must be willing to share We make the following contributions with this paper: their search history and the search engine must attain sufficient in- formation on user interests to build accurate profiles. Even short- • Describe a method to generate pre-defined cohorts using infor- term personalization depends on the long-term behavior for the first mation readily available to search engines, specifically topic, query in the session when no other activity has been observed [5]. location, and top-level domain preference. It is known that users frequently submit the same query to find the • Demonstrate that modeling user interests within these cohorts information they have searched previously. Teevan et al. [29] found can enhance state-of-the-art search personalization methods, leading to significant gains in search relevance. Permission to make digital or hard copies of all or part of this work for personal or • Demonstrate that there are particular sets of easily-identifiable classroom use is granted without fee provided that copies are not made or distributed queries (e.g., ambiguous or new queries for a user), for which for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others cohort modeling can be particularly effective. than ACM must be honored. Abstracting with credit is permitted. To copy • Propose methods to dynamically learn cohorts rather than using otherwise, or republish, to post o n servers or to redistribute to lists, requires prior pre-defined sets, and demonstrate strong relevance gains. specific permission and/or a fee. Request permissions from [email protected]. SIGIR’14, July 6–11, 2014, Gold Coast, Queensland, Australia. The remainder of the paper is structured as follows. In Section 2 we Copyright © 2014 ACM 978-1-4503-2257-7/14/07…$15.00. present related work in areas such as personalization, collaborative http://dx.doi.org/10.1145/2600428.2609617 filtering, and cohort identification and use. Section 3 describes the based on those of users with similar profiles. Similar approaches pre-defined cohorts and the modeling process, as well as various have used click-through data to personalize result rankings and definitions of important attributes such as click-through rate and backed-off to the clicks of others [2][26]. Almeida and Almeida [2] smoothing, as well as how we can apply clustering methods to learn used Bayesian algorithms to cluster users of an online bookstore cohorts dynamically rather than using a pre-defined set. Section 4 into communities based on links clicked within the site and found describes the datasets. Section 5 describes our methods and exper- that the popularity of links within different communities could be imental results, and Section 6 reports the results of learned cohorts. used to customize result rankings. Lee [17] proposed a system that Section 7 discusses our findings, their implications, and concludes. uses data mining to uncover patterns in users’ queries and browsing to generate recommendations for users with similar queries. These 2. RELATED WORK techniques perform matching with other users based on individual There are three relevant areas of related work: (1) personalization queries or URLs, severely limiting coverage. Freyne and Smyth of search engines based on short- and long-term searcher interests, [10] addressed this concern by connecting different communities (2) collaborative filtering, and (3) mining the search behavior of based on the degree to which their queries and result clicks overlap. other users to complement and enhance search personalization. Alternative methods have been proposed that are query independ- Large-scale behavioral data from search engines has been mined ent. Smyth [23] suggested that click-through data from users in the extensively to improve result relevance in the aggregate across all same “search community” (e.g., a group of people who use a spe- users [1][15]. Search preferences are personal and research on per- cial-interest Web portal or work together) could enhance search. He sonalizing retrieval [22][31] has shown that implicitly collected in- provided evidence for the existence of search communities by formation such as browser history, query history, and desktop in- showing that a group of co-workers had a higher query similarity formation, can be used to improve the relevance of search results threshold than general Web users. Ieong et al. [14] showed that for a particular individual. Short-term behavior from within the cur- searchers exhibited domain preferences, where they favored partic- rent search session has been used for tasks such as result ranking ular sources when selecting results. White et al. [38] found that do- [39] or predicting future search interests [34][35]. Teevan et al. [31] main experts preferred different top-level domains than novices, showed that their personalization algorithm improved as more data with more focus on educational (.edu) and governmental (.gov) was available about the current user.