<<

Detecting Social Roles in

Sunghwan Mac Kim, Stephen Wan and Cecile´ Paris Data61, CSIRO, Sydney, Australia Mac.Kim, Stephen.Wan, Cecile.Paris @csiro.au { }

Abstract

For social media analysts or social scientists interested in better understanding an audience or demographic cohort, being able to group social media content by demographic char- Figure 1: An example of a Twitter profile. acteristics is a useful mechanism to organise data. Social roles are one particular demo- graphic characteristic, which includes work, vestigating methods to estimate a variety of demo- recreational, community and familial roles. graphic characteristics from social media data, such In our work, we look at the task of detect- ing social roles from English Twitter pro- as gender and age on Twitter and (Mislove files. We create a new annotated dataset for et al., 2011; Sap et al., 2014) and YouTube (Filip- this task. The dataset includes approximately pova, 2012). In this work we focus on estimating 1,000 Twitter profiles annotated with social social roles, an under-explored area. roles. We also describe a machine learning ap- In social psychology literature, Augoustinos et proach for detecting social roles from Twitter al. (2014) provide an overview of schemata for so- profiles, which can act as a strong baseline for this dataset. Finally, we release a set of word cial roles, which includes achieved roles based on clusters obtained in an unsupervised manner the choices of the individual (e.g., writer or artist) from Twitter profiles. These clusters may be and ascribed roles based on the inherent traits of useful for other natural language processing an individual (e.g., teenager or schoolchild). Social tasks in social media. roles can represent a variety of categories including gender roles, family roles, occupations, and hobby- ist roles. Beller et al. (2014) have explored a set 1 Introduction of social roles (e.g., occupation-related and family- Social media platforms such as Twitter have become related social roles) extracted from the tweets. They an important communication medium in society. As used a pragmatic definition for social roles: namely, such, social scientists and media analysts are in- the word following the simple self-identification pat- creasingly turning to social media as a cheap and tern “I am a/an ”. In contrast, our manually an- large-volume source of real-time data, supplement- notated dataset covers a wide range of social roles ing “traditional” data sources such as interviews and without using this fixed pattern, since it is not nec- questionnaires. For these fields, being able to ex- essarily mentioned before the social roles. amine demographic factors can be a key part of On Twitter, users often list their social roles in analyses. However, demographic characteristics are their profiles. Figure 1, for example, shows the Twit- not always available on social media data. Conse- ter profile of a well-known Australian chef, Manu quently, there has been a growing body of work in- Feildel (@manufeildel). His profile provides infor-

34

Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, pages 34–40, Austin, TX, November 1, 2016. c 2016 Association for Computational Linguistics mation about his social roles beyond simply listing occupations. We can see that he has both a profes- sion, Chef, as well as a community role, Judge on My Kitchen Rules (MKR), which is an Australian cooking show. The ability to break down social media insights based on social roles is potentially a powerful tool Figure 2: The Crowdflower annotation interface. for social media analysts and social scientists alike. For social media analysts, it provides the opportu- nity to identify whether they reach their target audi- mother, and fan. For instance, we obtain Musician ence and to understand how subsets of their target and Youtuber as social roles from “Australian Musi- 2 audience (segmented by social role) react to various cian and Youtuber who loves purple!”. issues. For example, a marketing analyst may want To study social roles in Twitter profiles, we com- to know what online discussions are due to parents piled a dataset of approximately 1,000 randomly se- versus other social roles. lected English Twitter profiles which were annotated Our aim in this paper is to provide a rich collec- with social roles. These samples were drawn from a tion of English Twitter profiles for the social role large number of Twitter profiles crawled by a social identification task. The dataset includes a approx- network-based method (Dennett et al., 2016). Such mately 1,000 Twitter profiles, randomly selected, a dataset provides a useful collection of profiles for which we annotated with social roles. Additionally, researchers to study social media and to build ma- we release unsupervised Twitter word clusters that chine learning models. will be useful for other natural language processing Annotations were acquired using the crowdsourc- 3 (NLP) tasks in social media.1 Finally, we investigate ing platform Crowdflower. , which we now outline. social role tagging as a machine learning problem. A 2.1 Crowdflower Annotation Guidelines machine learning framework is described for detect- ing social roles in Twitter profiles. We asked Crowdflower annotators to identify social Our contributions are threefold: roles in the Twitter profiles presented to them, us- ing the following definition: “Social roles are words We introduce a new annotated dataset for identi- or phrases that could be pulled out from the profile • fying social roles in Twitter. and inserted into the sentence I am a/an . . . ”. Note We release a set of Twitter word clusters with re- that the profile does not necessarily need to contain • spect to social roles. the phrase “I am a/an” before the social role, as de- We propose a machine learning model as a strong scribed in Section 1. • baseline for the task of identifying social roles The annotation interface is presented in Figure 2. from Twitter profiles. The annotator is asked to select spans of text. Once a span of text is selected, the interface copies this 2 Crowdsourcing Annotated Data text into a temporary list of candidate roles. The annotator can confirm that the span of text should Twitter user profiles often list a range of interests be kept as a role (by clicking the ‘add’ link which that they associate with, and these can vary from oc- moves the text span to a second list representing the cupations to hobbies (Beller et al., 2014; Sloan et al., “final candidates”). It is also possible to remove a 2015). The aim of our annotation task was to man- candidate role from the list of final candidates (by ually identify social role-related words in English clicking ‘remove’). Profiles were allowed to have Twitter profile descriptions. A social role is defined more than one social role. as a single word that could be extracted from the de- Annotators were asked to keep candidate roles as scription. These can include terms such as engineer, short as possible as in the following instruction: if

1Our dataset and word clusters are publicly available at 2This is a real example. https://data.csiro.au. 3crowdflower.com

35 Number of annotated profiles 983 Social role Frequency Average description length 13.02 words student 25 Longest description length 74 words fan 24 Shortest description length 1 word girl 16 writer 14 Number of unique roles 488 teacher 13 geek 12 Table 1: Descriptive statistics for the annotated data. author 11 artist 10 directioner 9 designer 8 the Twitter profile contains “Bieber fan”, just mark Table 2: Top 10 ranked social roles in Twitter pro- the word “fan”.4 Finally, we instructed annotators files. to only mark roles that refer to the owner of the Twit- ter profile. For example, annotators were asked not Number of roles Frequency (%) to mark wife as a role in: I love my wife. Our Crowd- 0 552 (56.2) 1 213 (22.7) flower task was configured to present five annotation 2 101 (10.3) 3 45 (4.6) jobs in one web page. After each set of five jobs, the 4 31 (3.2) 5 23 (2.3) annotator could proceed to the next page. 6 8 (0.8) 7 2 (0.2) 8 6 (0.6) 2.2 Crowdflower Parameters 9 2 (0.2) To acquire annotations as quickly as possible, we Table 3: Frequencies of number of roles that are used the highest speed setting in Crowdflower and used to annotate one Twitter profile in our dataset. did not place additional constraints on the annotator selection, such as language, quality and geographic region. The task took approximately 1 week. We one role. The remaining descriptions (21.1%) con- offered 15 cents AUD per page. To control anno- tain more than one social role. tation quality, we utilised the Crowdflower facility 3 Word Clusters to include test cases called test validators, using 50 test cases to evaluate the annotators. We required a We can easily access a large-scale unlabelled dataset minimum accuracy of 70% on test validators. using the Twitter API, supplementing our dataset, to apply unsupervised machine learning methods to 2.3 Summary of Annotation Process help in social role tagging. Previous work showed At the completion of the annotation procedure, that word clusters derived from an unlabelled dataset Crowdflower reported the following summary statis- can improve the performance of many NLP ap- tics that provide insights on the quality of the an- plications (Koo et al., 2008; Turian et al., 2010; notations. The majority of the judgements were Spitkovsky et al., 2011; Kong et al., 2014). This sourced from annotators deemed to be trusted (i.e., finding motivates us to use a similar approach to im- reliable annotators) (4750/4936). Crowdflower re- prove tagging performance for Twitter profiles. ported an inter-annotator agreement of 91.59%. Ta- Two clustering techniques are employed to gen- ble 1 presents some descriptive statistics for our an- erate the cluster features: Brown clustering (Brown notated dataset. We observe that our Twitter profile et al., 1992) and K-means clustering (MacQueen, dataset contains 488 unique roles. 1967). The Brown clustering algorithm induces a In Table 2, we present the top 10 ranked social hierarchy of words from an unannotated corpus, and roles. As can be seen, our extracted social roles it allows us to directly map words to clusters. Word include terms such as student and fan, highlighting embeddings induced from a neural network are often that social roles in Twitter profiles include a diverse useful representations of the meaning of words, en- range of personal attributes. In Table 3, we see that coded as distributional vectors. Unlike Brown clus- more than half (56.2%) of the descriptions do not tering, word embeddings do not have any form of contain any role, and approximately 22.7% contain clusters by default. K-means clustering is thus used

4 on the resulting word vectors. Each word is mapped While this decision could give us a coarse-grain granularity of social roles, it was an application-specific requirement from to the unique cluster ID to which it was assigned, a visualisation point of view to minimise roles. and these cluster identifiers were used as features.

36 Bit string Words related to social role 010110111100 writer, nwriter, scribbler, writter, glutton 01011010111110 teacher, tutor, preacher, homeschooler, nbct, hod, dutchman, nqt, tchr 0101101111110 musician, philologist, orchestrator, memoirist, dramatist, violist, crooner, flautist, filmaker, humourist, dramaturg, harpist, flutist, trumpeter, improvisor, trombonist, musicologist, organist, puppeteer, laureate, poetess, hypnotist, audiobook, comedienne, saxophonist, cellist, scriptwriter, narrator, muso, essayist, improviser, satirist, thespian, ghostwriter, arranger, humorist, violinist, magician, lyricist, playwright, pianist, screenwriter, novelist, performer, philosopher, composer, comedian, filmmaker, poet Table 4: Examples of Brown clusters with respect to social roles: writer, teacher and musician.

Cluster Words related to social role 937 writer, freelance, interviewer, documentarian, erstwhile, dramaturg, biographer, reviewer, bookseller, essayist, unpublished, critic, author, aspiring, filmmaker, dramatist, playwright, laureate, humorist, screenwriter, storyteller, ghostwriter, copywriter, scriptwriter, proofreader, copyeditor, poet, memoirist, satirist, podcaster, novelist, screenplay, poetess 642 teacher, learner, superintendent, pyp, lifelong, flipped, preparatory, cue, yearbook, preschool, intermediate, nwp, school, primary, grades, prek, distinguished, prep, dojo, isd, hpe, ib, esl, substitute, librarian, nbct, efl, headteacher, mfl, hod, elem, principal, sped, graders, nqt, eal, tchr, secondary, tdsb, kindergarten, edd, instructional, elementary, keystone, grade, exemplary, classroom, pdhpe 384 musician, songwriter, singer, troubadour, arranger, composer, drummer, session, orchestrator, saxophonist, keyboardist, percussionist, guitarist, soloist, instrumentalist, jingle, trombonist, vocal, backing, virtuoso, bassist, vocalist, pianist, frontman Table 5: Examples of word2vec clusters with respect to social roles: writer, teacher and musician.

We used 6 million Twitter profiles that were au- the similarities of abbreviations of importance to so- tomatically collected by crawling a cial roles, for example, tchr teacher, nbct Na- → → starting from a seed set of Twitter accounts (Dennett tional Board Certified Teachers, hpe Health and → et al., 2016) to derive the Brown clusters and word Physical Education. embeddings for this domain. For both methods, the text of each profile description was normalised to 4 Identifying Social Roles be in lowercase and tokenised using whitespace and 4.1 Social Role Tagger punctuation as delimiters. This section describes a tagger we developed for the To obtain the Brown clusters, we use a publicly task of identifying social roles given Twitter profiles. available toolkit, wcluster5 to generate 1,000 clus- Here, we treat social role tagging as a sequence la- ters with the minimum occurrence of 40, yielding belling task. We use the MALLET toolkit (McCal- 47,167 word types. The clusters are hierarchically lum, 2002) implementation of Conditional Random structured as a binary tree. Each word belongs to Fields (CRFs) (Lafferty et al., 2001) to automati- one cluster, and the from the word to the root of cally identify social roles in Twitter profiles as our the tree can be represented as a bit string. These can machine learning framework. More specifically, we be truncated to refer to clusters higher up in the tree. employ a first-order linear chain CRF, in which the To obtain word embeddings, we used the skip- preceding word (and its features) is incorporated as 6 gram model as implemented in word2vec , a neural context in the labelling task. In this task, each word network toolkit introduced by (Mikolov et al., 2013), is with one of two labels: social roles are to generate a 300-dimension word vector based on tagged with R (for “role”), whereas the other words a 10-word context window size. We then used are tagged by O (for “other”). K-means clustering on the resulting 47,167 word The social role tagger uses two categories of fea- vectors (k=1,000). Each word was mapped to the tures: (i) basic lexical features and (ii) word cluster unique cluster ID to which it was assigned. features. The first category captures lexical cues that Tables 4 and 5 show some examples of Brown may be indicative of a social role. These features clusters and word2vec clusters respectively, for three include morphological, syntactic, orthographic and social roles: writer, teacher and musician. We note regular expression-based features (McCallum and that similar types of social roles are grouped into the Li, 2003; Finkel et al., 2008). The second captures same clusters in both methods. For instance, orches- semantic similarities, as illustrated in Tables 4 and 5 trator and saxophonist are in the same cluster con- (Section 3). To use Brown clusters in CRFs, we use taining musician. Both clusters are able to capture eight bit string representations of different lengths to 5https://github.com/percyliang/ create features representing the ancestor clusters of brown-cluster the word. For word2vec clusters, the cluster iden- 6https://code.google.com/p/word2vec/ tifiers are used as features in CRFs. If a word is

37 Model Feature Precision Recall F1 tures lends support to the intuition that capturing KWS 0.659 0.759 0.690 CRFs Basic 0.830 0.648 0.725 similarity in vocabulary within the feature space + Brown 0.859 0.708 0.774 helps with tagging accuracy. Word cluster models + W2V 0.837 0.660 0.736 provide a means to compare words based on seman- + (Brown+W2V) 0.863 0.712 0.779 tic similarity, helping with cases where lexical items Table 6: 10-fold cross-validation macro-average re- in the test set are not found in the training set (e.g., sults on the annotated dataset. (Brown: Brown clus- linguist, evangelist, teamster). In addition, the clus- ter features, W2V: Word2vec cluster features). ter features allow CRFs to detect informal and ab- breviated words as social roles. Our tagger identi- not associated with any clustering, its corresponding fies both teacher and tchr as social roles from the cluster features are set to null in the feature vector two sentences: “I am a school teacher” and “I am a for that word. school tchr”. This is particularly useful in social me- dia because of the language variation in vocabulary 4.2 Evaluation that is typically found. We evaluate our tagger on the annotated Twitter In this experiment, we show that social role tag- dataset using precision, recall and F1-score. We use ging is possible with a reasonable level of perfor- 10-fold cross-validation and report macro-averages. mance (F1 77.9), significantly outperforming the Significance tests are performed using the Wilcoxon KWS baseline (F1 69.0). This result indicates the signed-rank test (Wilcoxon, 1945). We compare need for a method that captures the context sur- the CRF-based tagger against a keyword spotting rounding word usage. This allows language pat- (KWS) method. This baseline uses social roles la- terns to be learned from data that disambiguate word belled in the training data to provide keywords to sense and prevents spurious detection of social roles spot for in the test profiles without considering lo- from the data. This is evidenced by the lower pre- cal context. On average, over the 10-fold cross- cision and F1-score for the KWS baseline, which validation, 54% of the social roles in the test set over-generates candidates for social roles. are seen in the training set. This indicates that the KWS baseline has potential out-of-vocabulary 5 Conclusion and Future Work (OOV) problems for unseen social roles. In this work, we constructed a new manually anno- To reduce overfitting in the CRF, we employ a tated English Twitter profile dataset for social role zero mean Gaussian prior regulariser with one stan- identification task. In addition, we induced Twitter dard deviation. To find the optimal feature weights, word clusters from a large unannotated corpus with we use the limited-memory BFGS (L-BFGS) (Liu respect to social roles. We make these resources and Nocedal, 1989) algorithm, minimising the regu- publicly available in the hope that they will be useful larised negative log-likelihood. All CRFs are trained in research on social media. Finally, we developed a using 500 iterations of L-BFGS with the Gaussian social role tagger using CRFs, and this can serve as prior variance of 1 and no frequency cutoff for fea- a strong baseline in this task. In future work, we will tures, inducing approximately 97,300 features. We look into being able to identify multi-word social follow standard approaches in using the forward- roles to obtain a finer-grained categorisation (e.g., backward algorithm for exact inference in CRFs. “chemical engineer” vs. “software engineer”). Table 6 shows the evaluation results of 10-fold cross-validation for the KWS method and the CRF Acknowledgments tagger. With respect to the different feature sets, we find that the combination of the word cluster fea- The authors are grateful to anonymous So- tures obtained by the two methods outperform the cialNLP@EMNLP reviewers. We would also like to basic features in terms of F1 (77.9 vs. 72.5 respec- thank David Milne, who ran the data collection and tively), in general providing a statistically significant annotation procedure, and Bo Yan, who collected improvement of approximately 5% (p<0.01). the additional data for the clustering methods. The improvement obtained with word cluster fea-

38 References Conference on Machine Learning, pages 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Martha Augoustinos, Iain Walker, and Ngaire Donaghue. Inc. 2014. Social cognition: an integrated introduction. Dong C. Liu and Jorge Nocedal. 1989. On the limited SAGE London, third edition. memory BFGS method for large scale optimization. Charley Beller, Rebecca Knowles, Craig Harman, Mathematical Programming, 45(3):503–528, Dec. Shane Bergsma, Margaret Mitchell, and Benjamin Van Durme. 2014. I’m a belieber: Social roles via J. MacQueen. 1967. Some methods for classification self-identification and conceptual attributes. In Pro- and analysis of multivariate observations. In Proceed- ceedings of the 52nd Annual Meeting of the Associ- ings of the Fifth Berkeley Symposium on Mathematical ation for Computational Linguistics, pages 181–186, Statistics and Probability, Volume 1: Statistics, pages Baltimore, Maryland, June. Association for Computa- 281–297, Berkeley, California. University of Califor- tional Linguistics. nia Press. Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vin- Andrew McCallum and Wei Li. 2003. Early results cent J. Della Pietra, and Jenifer C. Lai. 1992. Class- for named entity recognition with conditional random based n-gram models of natural language. Computa- fields, feature induction and web-enhanced lexicons. tional Linguistics, 18(4):467–479, December. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computa- Amanda Dennett, Surya Nepal, Cecile Paris, and Bella tional Linguistics: Human Language Technologies, Robinson. 2016. Tweetripple: Understanding your pages 188–191, Edmonton, Canada. Association for twitter audience and the impact of your tweets. In Proceedings of the 2nd IEEE International Conference Computational Linguistics. on Collaboration and Computing, Pittsburgh, Andrew Kachites McCallum. 2002. Mallet: A machine PA, USA, November. IEEE. learning for language toolkit. Katja Filippova. 2012. User demographics and language Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. in an implicit social network. In Proceedings of the 2013. Linguistic regularities in continuous space word 2012 Joint Conference on Empirical Methods in Nat- representations. In Proceedings of the 2013 Confer- ural Language Processing and Computational Natu- ence of the North American Chapter of the Associa- ral Language Learning, pages 1478–1488, Jeju Island, tion for Computational Linguistics: Human Language Korea, July. Association for Computational Linguis- Technologies, pages 746–751, Atlanta, Georgia, June. tics. Association for Computational Linguistics. Jenny Rose Finkel, Alex Kleeman, and Christopher D. Alan Mislove, Sune Lehmann, Yong-Yeol Ahn, Jukka- Manning. 2008. Efficient, feature-based, Condi- Pekka Onnela, and J. Niels Rosenquist. 2011. Un- tional Random Field parsing. In Proceedings of the derstanding the demographics of twitter users. In Pro- 46th Annual Meeting of the Association for Compu- ceedings of the Fifth International AAAI Conference tational Linguistics: Human Language Technologies, on Weblogs and Social Media, pages 554–557. The pages 959–967, Columbus, Ohio, June. Association AAAI Press. for Computational Linguistics. Maarten Sap, Gregory Park, Johannes Eichstaedt, Mar- Lingpeng Kong, Nathan Schneider, Swabha garet Kern, David Stillwell, Michal Kosinski, Lyle Swayamdipta, Archna Bhatia, Chris Dyer, and Ungar, and Hansen Andrew Schwartz. 2014. De- Noah A. Smith. 2014. A dependency parser for veloping age and gender predictive lexica over social tweets. In Proceedings of the 2014 Conference on media. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Empirical Methods in Natural Language Processing, pages 1001–1012, Doha, Qatar, October. Association pages 1146–1151, Doha, Qatar, October. Association for Computational Linguistics. for Computational Linguistics. Terry Koo, Xavier Carreras, and Michael Collins. 2008. Luke Sloan, Jeffrey Morgan, Pete Burnap, and Matthew Simple semi-supervised dependency parsing. In Pro- Williams. 2015. Who tweets? deriving the de- ceedings of the 46th Annual Meeting of the Associa- mographic characteristics of age, occupation and so- tion for Computational Linguistics: Human Language cial class from twitter user -data. PLoS ONE, Technologies, pages 595–603, Columbus, Ohio, June. 10(3):e0115545, 03. Association for Computational Linguistics. Valentin I. Spitkovsky, Hiyan Alshawi, Angel X. Chang, John D. Lafferty, Andrew McCallum, and Fernando C. N. and Daniel Jurafsky. 2011. Unsupervised dependency Pereira. 2001. Conditional Random Fields: Proba- parsing without gold part-of-speech tags. In Proceed- bilistic models for segmenting and labeling sequence ings of the 2011 Conference on Empirical Methods in data. In Proceedings of the Eighteenth International Natural Language Processing, pages 1281–1290, Ed-

39 inburgh, Scotland, UK., July. Association for Compu- tational Linguistics. Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceed- ings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394, Uppsala, Sweden, July. Association for Computational Linguis- tics. Frank Wilcoxon. 1945. Individual comparisons by rank- ing methods. Biometrics Bulletin, 1(6):80–83.

40