Classifying Social Identities Based on Twitter Profile Descriptions
Total Page:16
File Type:pdf, Size:1020Kb
#WhoAmI in 160 Characters? Classifying Social Identities Based on Twitter Profile Descriptions Anna Priante Djoerd Hiemstra Tijs van den Broek Public Administation Database Group NIKOS University of Twente University of Twente University of Twente [email protected] [email protected] [email protected] Aaqib Saeed Michel Ehrenhard Ariana Need Computer Science NIKOS Public Administation University of Twente University of Twente University of Twente [email protected] [email protected] [email protected] Abstract self-concept derived from social roles or member- ships to social groups (Stryker, 1980; Tajfel, 1981; We combine social theory and NLP methods Turner et al., 1987; Stryker et al., 2000). The use of to classify English-speaking Twitter users’ on- language is strongly associated with an individual’s line social identity in profile descriptions. We social identity (Bucholtz and Hall, 2005; Nguyen et conduct two text classification experiments. In al., 2014; Tamburrini et al., 2015). On Twitter, pro- Experiment 1 we use a 5-category online so- cial identity classification based on identity file descriptions and tweets are online expressions of and self-categorization theories. While we are people’s identities. Therefore, social media provide able to automatically classify two identity cat- an enormous amount of data for social scientists in- egories (Relational and Occupational), auto- terested in studying how identities are expressed on- matic classification of the other three identities line via language. (Political, Ethnic/religious and Stigmatized) is challenging. In Experiment 2 we test a merger We identify two main research opportunities on of such identities based on theoretical argu- online identity. First, online identity research is of- ments. We find that by combining these iden- ten confined to relatively small datasets. Social sci- tities we can improve the predictive perfor- entists rarely exploit computational methods to mea- mance of the classifiers in the experiment. Our sure identity over social media. Such methods may study shows how social theory can be used to offer tools to enrich online identity research. For guide NLP methods, and how such methods example, Natural Language Processing (NLP) and provide input to revisit traditional social the- ory that is strongly consolidated in offline set- Machine Learning (ML) methods assist to quickly tings. classify and infer vast amounts of data. Various studies investigate how to predict individual charac- teristics from language use on Twitter, such as age 1 Introduction and gender (Rao et al., 2010; Burger et al., 2011; Non-profit organizations increasingly use social me- Al Zamal et al., 2012; Van Durme, 2012; Ciot et dia, such as Twitter, to mobilize people and organize al., 2013; Nguyen et al., 2013; Nguyen et al., 2014; cause-related collective action, such as health advo- Preotiuc-Pietro et al., 2015), personality and emo- cacy campaigns. tions (Preotiuc-Pietro et al., 2015; Volkova et al., Studies in social psychology (Postmes and Brun- 2015; Volkova and Bachrach, 2015), political orien- sting, 2002; Van Zomeren et al., 2008; Park and tation and ethnicity (Rao et al., 2010; Pennacchiotti Yang, 2012; Alberici and Milesi, 2013; Chan, 2014; and Popescu, 2011; Al Zamal et al., 2012; Cohen Thomas et al., 2015) demonstrate that social identity and Ruths, 2013; Volkova et al., 2014), profession motivates people to participate in collective action, and interests (Al Zamal et al., 2012; Li et al., 2014). which is the joint pursuit of a common goal or inter- Second, only a few studies combine social the- est (Olson, 1971). Social identity is an individual’s ory and NLP methods to study online identity in 55 Proceedings of 2016 EMNLP Workshop on Natural Language Processing and Computational Social Science, pages 55–65, Austin, TX, November 5, 2016. c 2016 Association for Computational Linguistics relation to collective action. One recent example 2 Theoretical Framework: a 5-category uses the Social Identity Model of Collective Action Online Social Identity Classification (Van Zomeren et al., 2008) to study health cam- Grounded in Social Theory paigns organized on Twitter (Nguyen et al., 2015). The authors automatically identify participants’ mo- We define social identity as an individual’s self- tivations to take action online by analyzing profile definition based on social roles played in society or descriptions and tweets. memberships of social groups. This definition com- In this line, our study contributes to scale-up re- bines two main theories in social psychology: iden- search on online identity. We explore automatic tity theory (Stryker, 1980; Stryker et al., 2000) and text classification of online identities based on a 5- social identity, or self-categorization, theory (Tajfel, category social identity classification built on theo- 1981; Turner et al., 1987), which respectively focus ries of identity. We analyze 2633 English-speaking on social roles and memberships of social groups. Twitter users’ 160-characters profile description to We combine these two theories as together they pro- classify their social identities. We only focus on pro- vide a more complete definition of identity (Stets file descriptions as they represent the most immedi- and Burke, 2000). The likelihood of participating ate, essential expression of an individual’s identity. in collective action does increase when individuals We conduct two classification experiments: Ex- both identify themselves with a social group and are periment 1 is based on the original 5-category social committed to the role(s) they play in the group (Stets identity classification, whereas Experiment 2 tests a and Burke, 2000). merger of three categories for which automatic clas- We create a 5-category online social identity clas- sification does not work in Experiment 1. We show sification that is based on previous studies of off- that by combining these identities we can improve line settings (Deaux et al., 1995; Ashforth et al., the predictive performance of the classifiers in the 2008; Ashforth et al., 2016). We apply such classi- experiment. fication to Twitter users’ profile descriptions as they Our study makes two main contributions. First, represent the most immediate, essential expression we combine social theory on identity and NLP meth- of an individual’s identity (Jensen and Bang, 2013). ods to classify English-speaking Twitter users’ on- While tweets mostly feature statements and conver- line social identities. We show how social theory can sations, the profile description provides a dedicated, be used to guide NLP methods, and how such meth- even limited (160 characters), space where users can ods provide input to revisit traditional social theory write about the self-definitions they want to commu- that is strongly consolidated in offline settings. nicate on Twitter. Second, we evaluate different classification algo- The five social identity categories of our classifi- rithms in the task of automatically classifying on- cation are: line social identities. We show that computers can (1) Relational identity: self-definition based on perform a reliable automatic classification for most (reciprocal or unreciprocal) relationships that an in- social identity categories. In this way, we provide dividual has with other people, and on social roles social scientists with new tools (i.e., social identity played by the individual in society. Examples on classifiers) for scaling-up online identity research to Twitter are “I am the father of an amazing baby massive datasets derived from social media. girl!”, “Happily married to @John”, “Crazy Justin The rest of the paper is structured as follows. Bieber fan”, “Manchester United team is my fam- First, we illustrate the theoretical framework and the ily”. online social identity classification which guides the (2) Occupational identity: self-definition based text classification experiments (Section 2). Second, on occupation, profession and career, individual vo- we explain the data collection (Section 3) and meth- cations, avocations, interests and hobbies. Examples ods (Section 4). Third, we report the results of the on Twitter are “Manager Communication expert”, two experiments (Section 5 and 6). Finally, we dis- “I am a Gamer, YouTuber”, “Big fan of pizza!”, cuss our findings and provide recommendations for “Writing about my passions: love cooking traveling future research (Section 7). reading”. 56 (3) Political identity: self-definition based on po- ness campaign1, which aims at changing the image litical affiliations, parties and groups, as well as be- of men’s health (i.e., prostate and testicular cancer, ing a member of social movements or taking part in mental health and physical inactivity); and (2) En- collective action. Examples on Twitter are “Fem- glish random tweets posted in February and March inist Activist”, “I am Democrat”, “I’m a coun- 2015 obtained via the Twitter Streaming API. We cil candidate in local elections for []”, “mobro in select the tweets from the UK, US and Australia, #movember”, “#BlackLivesMatter”. which are the three largest countries with native En- (4) Ethnic/Religious identity: self-definition glish speakers. For this selection, we use a country based on membership of ethnic or religious groups. classifier, which has been found to be fairly accurate Examples on Twitter are “God first”, “Will also in predicting tweets’ geolocation for these countries tweet about #atheism”, “Native Washingtonian”, (Van der Veen et al., 2015). As on Twitter only 2% “Scottish no Australian no-both?”. of tweets are geo-located, we decide to use this clas- (5) Stigmatized identity: self-definition based on sifier to get the data for our text classification. membership of a stigmatized group, which is con- From these two data sources, we obtain two Twit- sidered different from what the society defines as ter user populations: Movember participants and normal according to social and cultural norms (Goff- random generic users. We sample from these two man, 1959). Examples on Twitter are “People call groups to have a similar number of profiles in our me an affectionate idiot”, “I know people call me a dataset.