Demographic Dialectal Variation in Social Media: a Case Study of African-American English
Total Page:16
File Type:pdf, Size:1020Kb
Demographic Dialectal Variation in Social Media: A Case Study of African-American English Su Lin Blodgett† Lisa Green∗ Brendan O’Connor† †College of Information and Computer Sciences ∗Department of Linguistics University of Massachusetts Amherst Abstract As many of these dialects have traditionally ex- isted primarily in oral contexts, they have histor- Though dialectal language is increasingly ically been underrepresented in written sources. abundant on social media, few resources exist Consequently, NLP tools have been developed from for developing NLP tools to handle such lan- text which aligns with mainstream languages. With guage. We conduct a case study of dialectal the rise of social media, however, dialectal language language in online conversational text by in- is playing an increasingly prominent role in online vestigating African-American English (AAE) on Twitter. We propose a distantly supervised conversational text, for which traditional NLP tools model to identify AAE-like language from de- may be insufficient. This impacts many applica- mographics associated with geo-located mes- tions: for example, dialect speakers’ opinions may sages, and we verify that this language fol- be mischaracterized under social media sentiment lows well-known AAE linguistic phenomena. analysis or omitted altogether (Hovy and Spruit, In addition, we analyze the quality of existing 2016). Since this data is now available, we seek to language identification and dependency pars- analyze current NLP challenges and extract dialectal ing tools on AAE-like text, demonstrating that they perform poorly on such text compared to language from online data. text associated with white speakers. We also Specifically, we investigate dialectal language in provide an ensemble classifier for language publicly available Twitter data, focusing on African- identification which eliminates this disparity American English (AAE), a dialect of Standard and release a new corpus of tweets containing AAE-like language. American English (SAE) spoken by millions of peo- ple across the United States. AAE is a linguistic Data and software resources are available at: variety with defined syntactic-semantic, phonolog- http://slanglab.cs.umass.edu/TwitterAAE ical, and lexical features, which have been the sub- ject of a rich body of sociolinguistic literature. In 1 Introduction addition to the linguistic characterization, reference to its speakers and their geographical location or Owing to variation within a standard language, re- speech communities is important, especially in light gional and social dialects exist within languages of the historical development of the dialect. Not all across the world. These varieties or dialects differ African-Americans speak AAE, and not all speakers from the standard variety in syntax (sentence struc- of AAE are African-American; nevertheless, speak- ture), phonology (sound structure), and the inven- ers of this variety have close ties with specific com- tory of words and phrases (lexicon). Dialect com- munities of African-Americans (Green, 2002). Due munities often align with geographic and sociolog- to its widespread use, established history in the soci- ical factors, as language variation emerges within olinguistic literature, and demographic associations, distinct social networks, or is affirmed as a marker AAE provides an ideal starting point for the devel- of social identity. opment of a statistical model that uncovers dialectal 1119 Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1119–1130, Austin, Texas, November 1-5, 2016. c 2016 Association for Computational Linguistics language. In fact, its presence in social media is at- Both, however, find AAE-like language on Twit- tracting increasing interest for natural language pro- ter through keyword searches, which may not yield cessing (Jørgensen et al., 2016) and sociolinguistic broad corpora reflective of general AAE use. More (Stewart, 2014; Eisenstein, 2015; Jones, 2015) re- recently, Jørgensen et al.(2016) generated a large search.1 In this work we: unlabeled corpus of text from hip-hop lyrics, subti- tles from The Wire and The Boondocks, and tweets Develop a method to identify from a region of the southeast U.S. While this cor- • demographically-aligned text and lan- pus does indeed capture a wide variety of language, guage from geo-located messages ( 2), based we aim to discover AAE-like language by utiliz- § on distant supervision of geographic census ing finer-grained, neighborhood-level demographics demographics through a statistical model from across the country. that assumes a soft correlation between Our approach to identifying AAE-like text is demographics and language. to first harvest a set of messages from Twitter, Validate our approach by verifying that text cross-referenced against U.S. Census demographics • ( 2.1), then to analyze words against demograph- aligned with African-American demographics § follows well-known phonological and syntac- ics with two alternative methods, a seedlist approach ( 2.2) and a mixed-membership probabilistic model tic properties of AAE, and document the pre- § ( 2.3). viously unattested ways in which such text di- § verges from SAE ( 3). § 2.1 Twitter and Census data Demonstrate racial disparity in the efficacy • In order to create a corpus of demographically- of NLP tools for language identification and associated dialectal language, we turn to Twitter, dependency parsing—they perform poorly on whose public messages contain large amounts of ca- this text, compared to text associated with sual conversation and dialectal speech (Eisenstein, white speakers ( 4, 5). § § 2015). It is well-established that Twitter can be used Improve language identification for U.S. on- to study both geographic dialectal varieties2 and mi- • line conversational text with a simple en- nority languages.3 semble classifier using our demographically- Some methods exist to associate messages with based distant supervision method, aiming to authors’ races; one possibility is to use birth record eliminate racial disparity in accuracy rates statistics to identify African-American-associated ( 4.2). names, which has been used in (non-social media) § social science studies (Sweeney, 2013; Bertrand and Provide a corpus of 830,000 tweets aligned • Mullainathan, 2003). However, metadata about au- with African-American demographics. thors is fairly limited on Twitter and most other so- cial media services, and many supplied names are 2 Identifying AAE from Demographics obviously not real. Instead, we turn to geo-location and induce a The presence of AAE in social media and the distantly supervised mapping between authors and generation of resources of AAE-like text for NLP the demographics of the neighborhoods they live tasks has attracted recent interest in sociolinguis- in (O’Connor et al., 2010; Eisenstein et al., 2011b; tic and natural language processing research; Jones Stewart, 2014). We draw on a set of geo-located (2015) shows that nonstandard AAE orthography on Twitter messages, most of which are sent on mo- Twitter aligns with historical patterns of African- bile phones, by authors in the U.S. in 2013. (These American migration in the U.S., while Jørgensen are selected from a general archive of the “Gar- et al.(2015) investigate to what extent it supports denhose/Decahose” sample stream of public Twit- well-known sociolinguistics hypotheses about AAE. 2For example, of American English (Huang et al., 2015; 1Including a recent linguistics work- Doyle, 2014). shop: http://linguistlaura.blogspot.co.uk/2016/06/ 3For example, Lynn et al.(2015) develop POS corpora and using-twitter-for-linguistic-research.html taggers for Irish tweets; see also related work in 4.1. § 1120 ter messages (Morstatter et al., 2013)). Geo- For a token in the corpus indexed by t (across the located users are a particular sample of the userbase whole corpus), let u(t) be the author of the message (Pavalanathan and Eisenstein, 2015), but we expect containing that token, and wt be the word token. The it is reasonable to compare users of different races average demographics of word type w is:5 within this group. (census) We look up the U.S. Census blockgroup geo- t 1 wt = w πu(t) π(softcount) { } graphic area that the message was sent in; block- w ≡ P t 1 wt = w groups are one of the smallest geographic areas de- { } fined by the Census, typically containing a popula- We find that terms with theP highest πw,AA values (de- tion of 600–3000 people. We use race and ethnic- noting high average African-American demograph- ity information for each blockgroup from the Cen- ics of their authors’ locations) are very non-standard, sus’ 2013 American Community Survey, defining while Stewart(2014) and Eisenstein(2013) find four covariates: percentages of the population that large πw,AA associated with certain AAE linguistic are non-Hispanic whites, non-Hispanic blacks, His- features. panics (of any race), and Asian.4 Finally, for each One way to use the πw,k values to construct a cor- user u, we average the demographic values of all pus is through a seedlist approach. In early experi- their messages in our dataset into a length-four vec- ments, we constructed a corpus of 41,774 users (2.3 (census) million messages) by first selecting the n = 100 tor π . Under strong assumptions, this could u highest-π terms occurring at least m = 3000 be interpreted as the probability of which race the w,AA times across the data set, then collecting all tweets user is; we prefer to