Constructing a Hierarchical User Interest Structure Based on User Profiles
Total Page:16
File Type:pdf, Size:1020Kb
Constructing a Hierarchical User Interest Structure based on User Profiles Chao Zhao Min Zhao Yi Guan School of Computer Baidu Inc. School of Computer Science and Technology Beijing, China Science and Technology Harbin Institute of Technology Email: [email protected] Harbin Institute of Technology Harbin, China Harbin, China Email: [email protected] Email: [email protected] Abstract—The interests of individual Internet users fall into capturing these concepts in the user profile, the web search a hierarchical structure which is useful in regards to building engine is more likely to ascribe the query to the Houston personalized searches and recommendations. Most studies on Rockets basketball team rather than spacecraft or other techni- this subject construct the interest hierarchy of a single person from the document perspective. In this study, we constructed the cally unrelated topics when the user simply searches the word user interest hierarchy via user profiles. We organized 433,397 “rockets”. user interests, referred to here as “attentions”, into a user attention network (UAN) from 200 million user profiles; we then Houston Rockets applied the Louvain algorithm to detect hierarchical clusters in these attentions. Finally, a 26-level hierarchy with 34,676 NBA San Antonio Spurs clusters was obtained. We found that these attention clusters were aggregated according to certain topics as opposed to the Basketball CBA hyponymy-relation based conceptual ontologies. The topics can ··· be entities or concepts, and the relations were not restrained by hyponymy. The concept relativity encapsulated in the user’s ··· interest can be captured by labeling the attention clusters with FIFA world cup corresponding concepts. Sports Copa America´ I. INTRODUCTION Football AFC Asian Cup Designing systems to retrieve or recommend relevant infor- Volleyball mation to meet users’ needs is an increasingly challenging ··· endeavor as the massive stores of online data continue to ··· grow. The most common approach to doing so is based on personalized information related to user interests. These Fig. 1: Example interest hierarchy. individual interests, which we refer to here as attentions1, are stored in the user profiles. They are a key component in the There are many existing methods for extracting hierarchical filtering and recommendation systems of today’s web services, user interests from the user’s behavior (e.g., clicks on or scrolls e.g., search engines and feeds[1], [2]. through various documents or sites). Previous studies have User attentions can be built into a hierarchy, where the mainly focused on mining the interests of only one individual arXiv:1709.06918v1 [cs.CL] 20 Sep 2017 general and specific interests of users are effectively organized at a time; research on the inner structure or relativity among in different levels. According to [3], higher-level interest these interests themselves is relatively scant. For example, is categories reflect longer-term user interests; lower-level cate- this hierarchy similar to that of a conceptual ontology? What gories reflect the user’s current interests. [4] attributes higher- broader subjects will capture a user’s interest, say, if he or level attentions to implicit, passive interests, while lower-level she clicks on articles about artificial intelligence (AI)? How attentions correspond to explicit, active interests. In any case, similar are this person’s interests to another person who is this kind of user profile can be effectually utilized for person- interested in machine learning? alization (e.g., query disambiguation, interest expansion, and In this study, we attempted to construct an attention hier- cold-start problems alleviation) by capturing the semantic or archy directly from a set of user profiles, rather than a single knowledge-level similarities of attentions [5], [6], [7], [8], [9]. profile, using a clustering technique. We explored the factors Figure 1 shows an example of “Sports-Basketball-NBA- responsible for aggregating the attentions together into the Rockets” as a certain path in the interest hierarchy. Upon hierarchy. Our primary contributions are two-fold: 1 We organized all the attentions as a user attention network For clarity, we use attentions to refer to words representing user interests, • e.g., “Harry Potter” or “Fantasy movie”. (UAN) based on their co-occurrences in user profiles, then divided the UAN into communities constituting an III. METHODS attention hierarchy structure. A. User profile We analyzed the factors responsible for clustering certain • attentions according to the attention hierarchy, and found We randomly selected 200 million user profiles from the that topics – not concepts – were paramount. Concepts Baidu search engine, which were obtained automatically from only occasionally play a role similar to topics. user behaviors (e.g., user search and click histories), to con- The remainder of this paper is structured as follows. In struct our dataset. Each user is associated with a series of Section II, we give a brief review of related works on the attentions given personalized weights. The weight of each hierarchical representation of user interests. In Section III, attention is an integer between 1 to 2,000 that indicates the we describe the UAN construction and corresponding com- importance of this attention to the user. Most of the attentions munity detection method in detail. We present the community are entities, such as the “Harry Potter”, “Stephen Curry”, or detection results in Section IV and discuss the aggregation “Fan Bingbing”; several were also be concepts, e.g., “fantasy mechanics of attentions in Section V. A brief conclusion novel”, “NBA”, or “film star”. To improve the quality of the and discussion on future research directions are presented in dataset, we only reserved the core attentions of each user with Section VI. weights of exactly 2,000. We also discarded any users with less than 10 core attentions, because we prefer to believe that II. RELATED WORKS these attentions co-occur randomly rather than due to non- The extant research on constructing user interest hierarchies trivial factors, e.g., higher-level interests of users. can be split into two main categories: Concept-based and We obtained a total of 433,379 unique attentions after filter- content-based. ing. We analyzed the distribution of the number of attentions Concept-based methods model the hierarchy of interests for an individual user as shown in Figure 2: One user can with the aid of concept taxonomies. These include the Open have no more than 250 core attentions. The distribution is Directory Project[10], [11], [6] or Wikipedia folksonomy[12], power-law like, and most of the users have less than 50 [13]. [10] used the first three levels of ODP categories to attentions. We also identified the distribution of the number of construct a general interest profile for mapping user queries followers for a unique attention, as shown in Figure 3. Both to a set of categories. [11] mapped each interest of one user axes are logarithmic-scaled and explicitly follow a power- as a hierarchical ODP path, then calculated the similarity law distribution. There are a few attentions that possess a between the search results and the user profiles and re-ranked very large amount of followers, while most attentions are the results accordingly for the sake of personalization. [12] associated with relatively few people. In effect, attentions are mined Twitter user’s interests from the entities of their Tweets very personalized attributes for individual users. The most by finding corresponding concepts in Wikipedia folksonomy. popular concept and entity attentions are “society” and “Fan By leveraging the Wikipedia category graph, [13] further Bingbing”, respectively. constructed interest hierarchies of Twitter users via spreading activation. Content-based methods involve constructing interest hier- archies without the aid of predefined taxonomy. [14], [15] mined concept-based user profiles from clickthrough data to infer user concept preferences, assuming that general terms with higher frequency would be placed at higher levels in the user profile hierarchy while specific terms with lower frequency were placed at lower levels. They defined two types of relationships, similarity and parent-child, for the concepts c1 and c2 in the search results and utilized the similarity measure Sim(c , c ) and conditional probability P (c c ) 1 2 1| 2 and two thresholds to justify the existence of relationships between concepts. [4] extracted the terms from web pages bookmarked by users as their interests, and utilized a divisive Fig. 2: Distribution of number of attentions in an individual clustering algorithm to group these interests into a hierarchy. user profile. x-axis represents the number of attentions in the The correlations between any two terms were measured by user profile; y-axis represents corresponding user numbers. the Augmented Expected Mutual Information (AEMI) and the MaxChildren threshold-finding method was used to find a reasonable threshold for correlations. [16] identified mismatch- B. User attention network ing between topic hierarchies used in analytics and learned Attentions in one user profile can be used to construct the from log data, which they resolved by first aggregating search interest hierarchy for the current user, but are less helpful as far queries and clicks into concepts in the hierarchy, then mapping as determining the more general associations among attentions the concepts to a topic taxonomy. themselves. To capture these associations, we assumed that TABLE I: The comparison five nearest neighbors of “Harry Potter” before and after weight re-calculation. before After neighbor weight neighbor weight society 20052 Daniel Radcliffe 141.774 Qiao Renliang 12606 Hermione Granger 140.087 Love O2O 11898 Emma Watson 138.945 entertainment 11612 Fantastic Beasts 135.415 doctor 10892 Harry Potter 5 134.320 Fig. 3: The distribution of the number of users for a unique attention. X-axis represents the number of followers of an C. Attention community detection attention, and y-axis represents the corresponding attention numbers. Both two axes are logarithmic scaled.