What Women Like: a Gendered Analysis of Twitter Users' Interests
Total Page:16
File Type:pdf, Size:1020Kb
Wikipedia, a Social Pedia: Research Challenges and Opportunities: Papers from the 2015 ICWSM Workshop What Women Like: A Gendered Analysis of Twitter Users’ Interests based on a Twixonomy Stefano Faralli, Giovanni Stilo and Paola Velardi Department of Computer Science Sapienza University of Rome, Italy [email protected] Abstract al., 2010). The majority of these methods infer interests from lexical information in tweets (bigrams, named entities or la- In this paper we analyze the distribution of interests in a large tent topic models), a technique that may fall short in terms population of Twitter users (the full set of 40 million users in of computational complexity when applied to large Twitter 2009 and a sample of about 100 thousand New York users in 2014), as a function of gender. To model interests, we asso- populations, as shown in Stilo and Velardi (2014). ciate ”topical” friends in users’ friendship lists (friends repre- Only few studies investigated the characteristics of Twit- senting an interest rather than a social relation between peers) ter users regardless of specific applications. In Kim et al. with Wikipedia categories. A word-sense disambiguation al- (2010) it is shown that words extracted from Twitter lists gorithm is used for selecting the appropriate wikipage for could represent latent characteristics of the users in the re- each topical friend. Starting from the set of wikipages repre- spective lists. In Kapanipathi et al. (2014) named entities are senting the population’s interests, we extract the sub-graph of extracted from tweets, then, Wikipedia categories, named Wikipedia categories connected to these pages, and we then primitive interests, are associated to each named entity. To prune cycles to induce a direct acyclic graph, that we call select a reduced number of higher-level categories, named Twixonomy. We use a novel method for reducing the compu- tational requirements of cycle detection on very large graphs. hierarchical interests, spreading of activation (Anderson, For any category at any generalization level in the Twixon- 1968) is used on the Wikipedia graph, where active nodes omy, it is then possible to estimate the gender distribution of are initially the set of primitive interests. Note that, despite Twitter users interested in that category. We analyze both the their name, hierarchical interests are not hierarchically or- population of ”celebrities”, i.e. male and female Twitter users dered. Furthermore, as discussed later in this paper (Section with an associated wikipage, and the population of ”peers”, 5), higher level categories in Wikipedia may be totally unre- i.e. male and female users who follow celebrities. lated with some of the connected wikipages. Similarly to us, Bhattacharya et al. (2014) try to infer 1. Introduction users interests at a large scale. Their system, named Who Likes What, is the first system that can infer users’ interests In this paper we present a method for extensively analyzing in Twitter at the scale of millions of users. First, the topi- the distribution of interests in Twitter according to gender. cal expertise of popular Twitter users is learned using a la- Our work is related with two areas in social media analytics: tent model on Twitter lists in which such users actively par- analysis of users’ interests and gender studies. Large-scale ticipate. Then, the interests of the users following through studies of Twitter users across the world mainly report sim- the lists such expert users are transitively inferred. By do- 1 ple demographic statistics like gender, age and geographic ing so, Who Likes What can infer the interests of around 30 distribution, followers and following counts, etc. A consid- millions users, covering 77% of the analyzed populations. erable number of works are aimed at modeling users’ inter- Evaluation is performed at a much smaller scale, by manu- ests for some specific purpose, like detecting trending top- ally comparing extracted interests with those declared in a ics, i.e. topics that emerge and become popular in a specific number of users’ bio, and by using human feedback from time slot. Trending topics are extracted to model users’ ex- 10 evaluators. The evaluators commented that the inferred pertise (Wagner et al., 2012), to produce a recommendation interests, even though useful, are sometimes too general: on (Garcia and Amatriain, 2010; Kywe, Lim, and Zhu, 2012; the other side, given the large and unstructured nature of the 2 Lu, Lam, and Zhang, 2009) , or to analyze general interests extracted interests (over 36 thousand distinct topics), gener- (e.g. events) that are predominant in a given time span (Li et ating labels at the right level of granularity is not straightfor- ward. Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Concerning gender studies, research mainly concentrated 1http://www.beevolve.com/twitter-statistics on gender profiling, i.e. automatically inferring a user’s gen- 2The literature on interest-based user recommendation is very der (Marquardt et al., 2014; Sap et al., 2014; Smith, 2014), vast and it would be impossible to survey it all. Refer to Kywe, and on the analysis of gendered language online (Bamman, Lim, and Zhu (2012) for a survey. Eisenstein, and Schnoebelen, 2014). An interesting work 34 (Szell and Thurner, 2013) has been recently published in sociological analysis, like the study of gender interests by which the authors analyze gender differences in the social category, which is the focus of this paper. In our gender behavior of about 300,000 players of an online game. To the analysis we consider both celebrities (i.e. male and female best of our knowledge, no studies on users’ interests in so- topical users4) and male and female common users. cial network by gender have been published so far, except The paper is organized as follows: In Section 2 we shortly for a sociological analysis on women’s soccer Twitter audi- describe the datasets and tools used in this study, Section 3 ence (Cochea, 2014). presents the algorithm to create the Twixonomy, Section 4 With respect to the analyzed bibliography, our main con- is dedicated to a comparison with Who Likes What (Bhat- tribution is the acquisition of a Twixonomy, a directed tacharya et al., 2014) and Section 5 performs a study of gen- acyclic graph (DAG) of Twitter users’ interests, inferred der distribution across categories. Finally, Section 5 is dedi- from users’ friendship information of the entire Twitter pop- cated to concluding remarks and future work. ulation, and from Wikipedia. We prefer friendship rather than textual features to model users’ interests, since, unless 2. Data and resources we are addressing a specific community (like e.g. the mem- For our study we use the following resources: bers of a political party, as in Colleoni, Rozza, and Arvids- son (2014)), the number of inferred topics may quickly • The Twitter 2009 network The authors in Kwak et al. grow, as in Bhattacharya et al. (2014) , and it is very hard (2010) have crawled and released the entire Twitter net- to make sense of them, or even to evaluate their quality. work as of July 2009. Since Twitter data are no longer Furthermore, textual features such as word clusters are tem- available to researches, this remains the largest available porally unstable as compared to friendship and categorial snapshot of Twitter, with 41 million user profiles and 1.47 interests, as already shown in Myers and Leskovec (2014) billion social relations. Even though things might have and Siehndel and Kawase (2012). In Barbieri, Manco, and changed in Twitter since 2009 - the number of users has Bonchi (2014) the authors argue that users’ interests can grown up to 500 millions - our purpose in this paper is be implicitly represented by the authoritative users (named to demonstrate the efficiency of our algorithms on a very hereafter topical users) they are linked to by means of friend- large sample of users. ship relations. This information is available in users’ pro- • The Twitter 2014 NewYork network On June 2014 we files, and does not require additional textual processing. Top- crawled a sample of New York Twitter users starting from ical friends are therefore both stable and readily accessible a seed of 3800 users who tweeted more than 20 times in indicators of a user’s interest. However, as a mean to system- New York5. With respect to the Twitter 2009 dataset, this atically analyze interests in large networks, this information network is much smaller but highly connected. is hardly interpretable and sparse, just like lexical features • and lists. Babelfy Babelfy (Moro, Raganato, and Navigli, 2014) is To obtain a hierarchical representation of interests, we a graph-based word-sense disambiguation (WSD) algo- first associate a Wikipedia page with topical users in users’ rithm based on a loose identification of candidate mean- friendship lists. We denote as topical users those for which ings coupled with a densest sub-graph heuristic which one such correspondence exists. This definition is slightly selects high-coherence semantic interpretations. Babelfy different from that adopted in Barbieri, Manco, and Bonchi disambiguates all nominal and named entity mentions oc- (2014)3, however it seems equally intuitive. curring within a text, using the BabelNet semantic net- In general many pages can be associated with a Twitter ac- work (Navigli and Ponzetto, 2012) a very large multilin- count name, therefore we use a word sense disambiguation gual knowledge base, obtained from the automatic inte- algorithm, as detailed later in this paper. Users can then be gration of Wikipedia and WordNet. Babelfy has shown directly (if they map to a wikipage) or indirectly (if they fol- to obtain state-of-the-art performances in standard WSD benchmarks and challenges. Both BabelNet and Babelfy low a mapped user) linked to one or more Wikipedia pages 6 representing his/her primitive interests, (we use the same are available online .