Using Sentiment Induction to Understand Variation in Gendered Online Communities
Total Page:16
File Type:pdf, Size:1020Kb
Using Sentiment Induction to Understand Variation in Gendered Online Communities Li Lucy Julia Mendelsohn Symbolic Systems Program Department of Linguistics Department of Computer Science Department of Computer Science Stanford University Stanford University [email protected] [email protected] Abstract gator and discussion platforms, Reddit, contains thousands of unique communities, known as sub- We analyze gendered communities defined in reddits. These subreddits vary in topic, such as three different ways: text, users, and senti- r/sport and r/history, content type, such as r/pics ment. Differences across these representations reveal facets of communities’ distinctive iden- and r/videos, and format, such as the Q&A style of tities, such as social group, topic, and atti- r/IamA and the narratives on r/confession. Online tudes. Two communities may have high text communities such as subreddits are often char- similarity but not user similarity or vice versa, acterized by language use and user membership and word usage also does not vary accord- (Hamilton et al., 2016; Datta et al., 2017; Bamman ing to a clearcut, binary perspective of gen- et al., 2014; Martin, 2017). der. Community-specific sentiment lexicons demonstrate that sentiment can be a useful in- Sociolinguists have primarily analyzed phono- dicator of words’ social meaning and commu- logical and syntactic variables (Eckert, 2012), nity values, especially in the context of discus- though some have studied lexical variables (e.g. sion content and user demographics. Our re- Wong, 2005). Previous computational work also sults show that social platforms such as Reddit focuses on lexical variation (Bamman et al., 2014). are active settings for different constructions We approach variation from a new direction, of gender. where we examine the salient semantic dimension 1 Introduction of sentiment to understand how users use the same words to convey different meanings. We also cre- Social groups can be described by many factors, ate representations for subreddits that encode text such as the demographics of its participants or and user membership to situate insights gained its physical location. To detect sociolinguistically from sentiment-based representations and to un- significant groups, linguists have built upon the derstand the intersection of speaker identity (user), concept of communities of practice, which are content (text), and affect (sentiment). This paper characterized by their participants’ shared actions, focuses on explicitly gendered subreddits, which beliefs, values, and language styles (Eckert and cater towards masculine- or feminine-identifying McConnell-Ginet, 1992; Eckert, 2006). This con- groups. cept initially emerged to understand the complex interplay between language and gender, and has We provide two main contributions: been applied to study social identities in numer- 1) Salient aspects of social group identities, such ous communities (e.g. Eckert, 1989; Mendoza- as gender, can produce low user overlap in com- Denton, 1996; Hall, 2009). While previous varia- munities sharing similar topics. tionist work has primarily studied traditional phys- ical communities, we focus instead on online ones. 2) Sentiment-based representations of communi- Online communities have been shown to form ties can reveal a type of variation across social collective linguistic norms, which give rise to a groups that word-choice alone cannot. An in- rich amount of language variation across com- depth study of words’ sentiments in gendered munities, even on the same website (Danescu- subreddits reveals patterns of how linguistic re- Niculescu-Mizil et al., 2013; Yang and Eisen- sources construct a wide array of gendered iden- stein, 2015). One of the largest content aggre- tities in the online sphere. 156 Proceedings of the Society for Computation in Linguistics (SCiL) 2019, pages 156-166. New York City, New York, January 3-6, 2019 2 Previous Work between men and women, demonstrating diversity in gendered language styles. The study of online communities is highly in- We compare text, user, and sentiment represen- terdisciplinary, spanning machine learning, nat- tations of communities by their predictions of sim- ural language processing, social network analy- ilarity and identify cases where these predictions sis, communications, and sociolinguistics. Twit- agree or disagree. Pavalanathan et al. (2017) sug- ter, online news, and other websites have been a gested that subreddits with similar topics can have rich source of data for computational social scien- dissimilar user groups due to differences in pre- tists, and Reddit in particular has been of interest ferred interactional styles. Datta et al. (2017) in- to much previous work (Althoff et al., 2014; Ku- troduced a method for finding misalignments of mar et al., 2018; Newell et al., 2016; Jaech et al., inferred user-based and text-based networks on 2015; Hamilton et al., 2017). Research in this Reddit. They found that pairs with high text but area expands beyond quantitative measures, by low user similarity tend to be communities that comparing results with social science theories and conflict (such as political subreddits) as well as employing qualitative thinking to highlight trends communities with hierarchical relationships (such of individual words and communities (Bamman as a niche subreddit with a more generic one). et al., 2014; Danescu-Niculescu-Mizil et al., 2013; High user but low text similarity suggested a sin- Zhang et al., 2017; Althoff et al., 2014). gle overarching community scattered across mul- In particular, the characterization and identifi- tiple subreddits. We assessed Datta et al. (2017)’s cation of gender is a common use case for online z2-score method in section 5.1, but found that its data. Language is often used to infer demograph- results hid important misalignments. ics of users, especially in classification tasks that tend to find clear distinctions between men and We use domain-specific lexicon induction tech- women (e.g. Burger et al., 2011; Argamon et al., niques for creating sentiment-based subreddit rep- 2007; Schler, 2006; Rao et al., 2010). Previous resentations. Previous work has built or adapted work have found strong patterns of gender differ- word embeddings or scores to flexibly encode se- ences in language (Newman et al., 2008; Mulac mantic dimensions or specific domains, and some et al., 2001). For example, Volkova et al. (2013) have applied their techniques to social media com- showed that the sentiment of words, hashtags, and munities (Rothe et al., 2016; Yang and Eisenstein, emoticons vary between men and women on Twit- 2015; Hamilton et al., 2016). Rothe et al. (2016)’s ter and used gender-dependent features to improve DENSIFIER model involves dense word embed- sentiment classification of tweets. Our work aims dings created by mapping generic word embed- to look beyond a straightforward divide between dings into meaningful subspaces. These learned men and women, particularly because gender is embeddings may even contain a single dimension, not a fixed biological variable, but rather a dy- which can act as labels for an induced lexicon, and namic, social one that is actively created and rein- are best applied to corpora with several billion to- forced through repeated behaviors (Butler, 1988; kens, which is far greater than any subreddit that Nguyen et al., 2014; Herring and Paolillo, 2006). we study. Language is used to simultaneously co- While Rothe et al. (2016) learned lexicons for construct multiple identities, so the plethora of generic domains such as news and Twitter, Hamil- gendered identities that emerge from communi- ton et al. (2016)’s SENTPROP method solves ties of practice may substantially differ from main- a similar task across fine-grained domains, in- stream stereotypes of “femininity” and “masculin- cluding Reddit communities. They applied a ity” (Eckert, 1989; Mendoza-Denton, 1996; Hall, label propagation method to 250 Reddit com- 2009). On Twitter, Bamman et al. (2014) in- munities, and found a wide range of sentiment vestigated cross-community lexical variation and variation. For example, insane is negative in the varied ways of constructing gender identities. r/twoxchromosomes but positive in r/sports, while They clustered users based on bag-of-words repre- soft shows the opposite pattern. SENTPROP was sentations of their posts, and the resulting clusters the most suitable approach for our purposes due corresponded not only to topical interest but also to its ability to operate on smaller datasets. We gender. Some clusters had language patterns that extend this work by examining how vector rep- were orthogonal to expected language differences resentations created from subreddit-specific sen- 157 Subreddit Description r/actuallesbians “A place for cis and trans lesbians, bisexual girls, chicks who like chicks...” r/askgaybros “Where you can ask the manly men for their opinions on various topics.” r/mensrights “For those who wish to discuss men’s rights and the ways said rights are infringed upon.” r/askmen “A semi-serious place to ask men casual questions about life, career, and more.” r/askwomen “Dedicated to asking women questions about their thoughts, lives, and experiences.” r/xxfitness “For women and gender non-binary redditors who are fit, want to be fit...” r/femalefashionadvice “A subreddit dedicated to learning about and discussing women’s fashion.” r/malefashionadvice “Making clothing less intimidating and helping you develop your own style.”