Detecting Social Roles in Twitter
Total Page:16
File Type:pdf, Size:1020Kb
Detecting Social Roles in Twitter Sunghwan Mac Kim, Stephen Wan and Cecile´ Paris Data61, CSIRO, Sydney, Australia Mac.Kim, Stephen.Wan, Cecile.Paris @csiro.au { } Abstract For social media analysts or social scientists interested in better understanding an audience or demographic cohort, being able to group social media content by demographic char- Figure 1: An example of a Twitter profile. acteristics is a useful mechanism to organise data. Social roles are one particular demo- graphic characteristic, which includes work, vestigating methods to estimate a variety of demo- recreational, community and familial roles. graphic characteristics from social media data, such In our work, we look at the task of detect- ing social roles from English Twitter pro- as gender and age on Twitter and Facebook (Mislove files. We create a new annotated dataset for et al., 2011; Sap et al., 2014) and YouTube (Filip- this task. The dataset includes approximately pova, 2012). In this work we focus on estimating 1,000 Twitter profiles annotated with social social roles, an under-explored area. roles. We also describe a machine learning ap- In social psychology literature, Augoustinos et proach for detecting social roles from Twitter al. (2014) provide an overview of schemata for so- profiles, which can act as a strong baseline for this dataset. Finally, we release a set of word cial roles, which includes achieved roles based on clusters obtained in an unsupervised manner the choices of the individual (e.g., writer or artist) from Twitter profiles. These clusters may be and ascribed roles based on the inherent traits of useful for other natural language processing an individual (e.g., teenager or schoolchild). Social tasks in social media. roles can represent a variety of categories including gender roles, family roles, occupations, and hobby- ist roles. Beller et al. (2014) have explored a set 1 Introduction of social roles (e.g., occupation-related and family- Social media platforms such as Twitter have become related social roles) extracted from the tweets. They an important communication medium in society. As used a pragmatic definition for social roles: namely, such, social scientists and media analysts are in- the word following the simple self-identification pat- creasingly turning to social media as a cheap and tern “I am a/an ”. In contrast, our manually an- large-volume source of real-time data, supplement- notated dataset covers a wide range of social roles ing “traditional” data sources such as interviews and without using this fixed pattern, since it is not nec- questionnaires. For these fields, being able to ex- essarily mentioned before the social roles. amine demographic factors can be a key part of On Twitter, users often list their social roles in analyses. However, demographic characteristics are their profiles. Figure 1, for example, shows the Twit- not always available on social media data. Conse- ter profile of a well-known Australian chef, Manu quently, there has been a growing body of work in- Feildel (@manufeildel). His profile provides infor- 34 Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, pages 34–40, Austin, TX, November 1, 2016. c 2016 Association for Computational Linguistics mation about his social roles beyond simply listing occupations. We can see that he has both a profes- sion, Chef, as well as a community role, Judge on My Kitchen Rules (MKR), which is an Australian cooking show. The ability to break down social media insights based on social roles is potentially a powerful tool Figure 2: The Crowdflower annotation interface. for social media analysts and social scientists alike. For social media analysts, it provides the opportu- nity to identify whether they reach their target audi- mother, and fan. For instance, we obtain Musician ence and to understand how subsets of their target and Youtuber as social roles from “Australian Musi- 2 audience (segmented by social role) react to various cian and Youtuber who loves purple!”. issues. For example, a marketing analyst may want To study social roles in Twitter profiles, we com- to know what online discussions are due to parents piled a dataset of approximately 1,000 randomly se- versus other social roles. lected English Twitter profiles which were annotated Our aim in this paper is to provide a rich collec- with social roles. These samples were drawn from a tion of English Twitter profiles for the social role large number of Twitter profiles crawled by a social identification task. The dataset includes a approx- network-based method (Dennett et al., 2016). Such mately 1,000 Twitter profiles, randomly selected, a dataset provides a useful collection of profiles for which we annotated with social roles. Additionally, researchers to study social media and to build ma- we release unsupervised Twitter word clusters that chine learning models. will be useful for other natural language processing Annotations were acquired using the crowdsourc- 3 (NLP) tasks in social media.1 Finally, we investigate ing platform Crowdflower. , which we now outline. social role tagging as a machine learning problem. A 2.1 Crowdflower Annotation Guidelines machine learning framework is described for detect- ing social roles in Twitter profiles. We asked Crowdflower annotators to identify social Our contributions are threefold: roles in the Twitter profiles presented to them, us- ing the following definition: “Social roles are words We introduce a new annotated dataset for identi- or phrases that could be pulled out from the profile • fying social roles in Twitter. and inserted into the sentence I am a/an . ”. Note We release a set of Twitter word clusters with re- that the profile does not necessarily need to contain • spect to social roles. the phrase “I am a/an” before the social role, as de- We propose a machine learning model as a strong scribed in Section 1. • baseline for the task of identifying social roles The annotation interface is presented in Figure 2. from Twitter profiles. The annotator is asked to select spans of text. Once a span of text is selected, the interface copies this 2 Crowdsourcing Annotated Data text into a temporary list of candidate roles. The annotator can confirm that the span of text should Twitter user profiles often list a range of interests be kept as a role (by clicking the ‘add’ link which that they associate with, and these can vary from oc- moves the text span to a second list representing the cupations to hobbies (Beller et al., 2014; Sloan et al., “final candidates”). It is also possible to remove a 2015). The aim of our annotation task was to man- candidate role from the list of final candidates (by ually identify social role-related words in English clicking ‘remove’). Profiles were allowed to have Twitter profile descriptions. A social role is defined more than one social role. as a single word that could be extracted from the de- Annotators were asked to keep candidate roles as scription. These can include terms such as engineer, short as possible as in the following instruction: if 1Our dataset and word clusters are publicly available at 2This is a real example. https://data.csiro.au. 3crowdflower.com 35 Number of annotated profiles 983 Social role Frequency Average description length 13.02 words student 25 Longest description length 74 words fan 24 Shortest description length 1 word girl 16 writer 14 Number of unique roles 488 teacher 13 geek 12 Table 1: Descriptive statistics for the annotated data. author 11 artist 10 directioner 9 designer 8 the Twitter profile contains “Bieber fan”, just mark Table 2: Top 10 ranked social roles in Twitter pro- the word “fan”.4 Finally, we instructed annotators files. to only mark roles that refer to the owner of the Twit- ter profile. For example, annotators were asked not Number of roles Frequency (%) to mark wife as a role in: I love my wife. Our Crowd- 0 552 (56.2) 1 213 (22.7) flower task was configured to present five annotation 2 101 (10.3) 3 45 (4.6) jobs in one web page. After each set of five jobs, the 4 31 (3.2) 5 23 (2.3) annotator could proceed to the next page. 6 8 (0.8) 7 2 (0.2) 8 6 (0.6) 2.2 Crowdflower Parameters 9 2 (0.2) To acquire annotations as quickly as possible, we Table 3: Frequencies of number of roles that are used the highest speed setting in Crowdflower and used to annotate one Twitter profile in our dataset. did not place additional constraints on the annotator selection, such as language, quality and geographic region. The task took approximately 1 week. We one role. The remaining descriptions (21.1%) con- offered 15 cents AUD per page. To control anno- tain more than one social role. tation quality, we utilised the Crowdflower facility 3 Word Clusters to include test cases called test validators, using 50 test cases to evaluate the annotators. We required a We can easily access a large-scale unlabelled dataset minimum accuracy of 70% on test validators. using the Twitter API, supplementing our dataset, to apply unsupervised machine learning methods to 2.3 Summary of Annotation Process help in social role tagging. Previous work showed At the completion of the annotation procedure, that word clusters derived from an unlabelled dataset Crowdflower reported the following summary statis- can improve the performance of many NLP ap- tics that provide insights on the quality of the an- plications (Koo et al., 2008; Turian et al., 2010; notations. The majority of the judgements were Spitkovsky et al., 2011; Kong et al., 2014). This sourced from annotators deemed to be trusted (i.e., finding motivates us to use a similar approach to im- reliable annotators) (4750/4936).