<<

Confounds and Consequences in Geotagged Data

Umashanthi Pavalanathan and Jacob Eisenstein School of Interactive Computing Georgia Institute of Technology Atlanta, GA 30308 {umashanthi + jacobe}@gatech.edu

Abstract as a sample therefore risks introduc- ing both geographic and demographic biases (Mis- Twitter is often used in quantitative stud- love et al., 2011; Hecht and Stephens, 2014; Lon- ies that identify geographically-preferred gley et al., 2015; Malik et al., 2015). topics, writing styles, and entities. These This paper examines the effects of these bi- studies rely on either GPS coordinates at- ases on the geo-linguistic inferences that can be tached to individual messages, or on the drawn from Twitter. We focus on the ten largest user-supplied location field in each profile. metropolitan areas in the United States, and con- In this paper, we compare these data ac- sider three sampling techniques: drawing an equal quisition techniques and quantify the bi- number of GPS- tweets from each area; ases that they introduce; we also measure drawing a county-balanced sample of GPS-tagged their effects on linguistic analysis and text- messages to correct Twitter’s urban skew (Hecht based geolocation. GPS-tagging and self- and Stephens, 2014); and drawing a sample of reported locations yield measurably dif- location-annotated messages, using the location ferent corpora, and these linguistic differ- field in the user profile. Leveraging self-reported ences are partially attributable to differ- first names and census statistics, we show that the ences in dataset composition by age and age and gender composition of these datasets dif- gender. Using a latent variable model fer significantly. to induce age and gender, we show how Next, we apply standard methods from the lit- these demographic variables interact with erature to identify geo-linguistic differences, and geography to affect language use. We test how the outcomes of these methods depend also show that the accuracy of text-based on the sampling technique and on the underlying geolocation varies with population demo- demographics. We also test the accuracy of text- graphics, giving the best results for men based geolocation (Cheng et al., 2010; Eisenstein above the age of 40. et al., 2010) in each dataset, to determine whether the accuracies reported in recent work will gener- 1 Introduction alize to more balanced samples. Social media data such as Twitter is frequently The paper reports several new findings about used to identify the unique characteristics of geotagged Twitter data: arXiv:1506.02275v2 [cs.CL] 22 Aug 2015 geographical regions, including topics of inter- est (Hong et al., 2012), linguistic styles and di- • In comparison with tweets with self-reported alects (Eisenstein et al., 2010; Gonc¸alves and locations, GPS-tagged tweets are written Sanchez,´ 2014), political opinions (Caldarelli et more often by young people and by women. al., 2014), and public health (Broniatowski et al., • There are corresponding linguistic dif- 2013). Social media permits the aggregation of ferences between these datasets, with datasets that are orders of magnitude larger than GPS-tagged tweets including more could be assembled via traditional survey tech- geographically-specific non-standard words. niques, enabling analysis that is simultaneously • Young people use significantly more fine-grained and global in scale. Yet social media geographically-specific non-standard words. is not a representative sample of any “real world” Men tend to more geographically- population, aside from social media itself. Using specific entities than women, but these differences are significant only for individu- County (1750 people per square mile), but extends als at the age of 30 or older. to Haralson County (100 people per square mile), • Users who GPS-tag their tweets tend to write on the border of Alabama. A per-county analysis more, making them easier to geolocate. Eval- of this data therefore enables us to assess the de- uating text-based geolocation on GPS-tagged gree to which Twitter’s skew towards urban areas tweets probably overestimates its accuracy. biases geo-linguistic analysis. • Text-based geolocation is significantly more accurate for men and for older people. 3 Representativeness of geotagged Twitter data These findings should inform future attempts to generalize from geotagged Twitter data, and may We first assess potential biases in sampling tech- suggest investigations into the demographic prop- niques for obtaining geotagged Twitter data. In erties of other social media sites. particular, we compare two possible techniques We first describe the basic data collection prin- for obtaining data: the location field in the user ciples that hold throughout the paper (§ 2). The profile (Poblete et al., 2011; Dredze et al., 2013), following three sections tackle demographic bi- and the GPS coordinates attached to each mes- ases (§ 3), their linguistic consequences (§ 4), and sage (Cheng et al., 2010; Eisenstein et al., 2010). the impact on text-based geolocation (§ 5); each of these sections begins with a discussion of meth- 3.1 Methods ods, and then presents results. We then summarize related work and conclude. To build a dataset of GPS-tagged messages, we extracted the GPS latitude and longitude coordi- 2 Dataset nates reported in the tweet, and used GIS-TOOLS1 This study is performed on a dataset of tweets reverse geocoding to identify the corresponding gathered from Twitter’s streaming API from counties. This set of geotagged messages will be February 2014 to January 2015. During an ini- denoted DG. Only 1.24% of messages contain tial filtering step we removed retweets, repetitions geo-coordinates, and it is possible that the individ- of previously posted messages which contain the uals willing to share their GPS comprise a skewed “retweeted status” metadata or “RT” token which population. We therefore also considered the user- is widely used among Twitter users to indicate a reported location field in the Twitter profile, focus- retweet. To eliminate spam and automated ac- ing on the two most widely-used patterns: (1) city counts (Yardi et al., 2009), we removed tweets name, (2) city name and two letter state name (e.g. containing URLs, user accounts with more than Chicago and Chicago, IL). Messages that matched 1000 followers or followees, accounts which have any of the ten largest MSAs were grouped into a tweeted more than 5000 messages at the time of second set, DL. data collection, and the top 10% of accounts based While the inconsistencies of writing style in on number of messages in our dataset. We also re- the Twitter location field are well-known (Hecht moved users who have written more than 10% of et al., 2011), analysis of the intersection between their tweets in any language other than English, DG and DL found that the two data sources agreed using Twitter’s lang metadata field. Exploration the overwhelming majority of the time, suggest- of code-switching (Solorio and Liu, 2008) and the ing that most self-provided locations are accurate. role of second-language English speakers (Eleta Of course, there may be many false negatives — and Golbeck, 2014) is left for future work. profiles that we fail to geolocate due to the use of We consider the ten largest Metropolitan Sta- non-standard toponyms like Pixburgh and ATL. If tistical Areas (MSAs) in the United States, listed so, this would introduce a bias in the population in Table 1. MSAs are defined by the U.S. Cen- sample in DL. Such a bias might have linguistic sus Bureau as geographical regions of high popu- consequences, with datasets based on the location lation with density organized around a single ur- field containing less non-standard language over- ban core; they are not legal administrative divi- all. sions. MSAs include outlying areas that may be 1https://github.com/DrSkippy/ substantially less urban than the core itself. For Data-Science-45min-Intros/blob/master/ example, the Atlanta MSA is centered on Fulton gis-tools-101/gis_tools.ipynb ldt raetefloigblne datasets: balanced following the create to pled samples initial The Subsampling 3.1.1 right. New the county. on by Atlanta accounts, left, user Twitter the and on messages, shown Twitter is population, York census of Proportion 1: Figure h rttkni h aefil fec user’s each of field name the in against token names first the matched the then We each for name. values given age of obtained distribution we probability each information, the with this Using year each name. given born the individuals records of which number So- Administration, the Security from statistics genders cial queried and we ages sample, each of in distribution the estimate identification To gender and Age 3.1.2 L GPS-C GPS-MSA- OC % of MSA 0.10 0.20 0.00 0.05 0.15 0.25 ouaindsrbto cosec MSA. each across overall distribution population the more of are representative samples geographically These user- samples. and and balanced Bureau), message-balanced Census obtained U.S. again the from tained D nue httelretMA ontdomi- not analysis. user- linguistic do the MSAs nate the largest MSAs the as across that Balancing ensures MSA per sample. the balanced users as 2,500 tweets MSA from the all per and sample, tweets message-balanced 25,000 sampled gahclcodntsaeunavailable. user- are coordinates ographical the obtain to as in possible geolocations not county-level MSA is It per sample. the balanced users as 2,500 tweets MSA from the all per and sample, tweets message-balanced 25,000 sampled -MSA- G OUNTY Kings ae ncut-ee ouain(ob- population county-level on based Queens BALANCED BALANCED New York

- Suffolk BALANCED Bronx

D Nassau

G Westchester Bergen and

From Middlesex From Essex D

eresampled We Hudson L D D Monmouth eete resam- then were D L G Ocean erandomly we , L erandomly we , Union seatge- exact as , Passaic Morris Users Tweets Population Richmond Other Somerset Rockland h g ffieo bv h g f9 esee estimated we the — on cutoffs. 95 focus these therefore of determining We age for the basis above principled or no five of under age users exam- the to for probability — prior prior zero it a assigning While such ple, design service. to the easy use be would to and unlikely young are extremely who assigns the old, thus to and probability users, much information Twitter too of prior ages This incorporate the about 2011). not al., does et (Mislove method way same the much t nbt ietos o ihteecvasin GPS- caveats the for these distribution with age the So, induce we mind, directions. both probabil- equal in roughly ity with happen will assume not this we that gender, names tweets. their choose with of associated will 34% typically individuals and some accounts, user While Twitter of in 33% omit- frequent thereby ting database), more security social times which the in 100 than names least first at (i.e. are the database in security common sufficiently are social not accounts that are user names names eliminate give whose We to etc. free humorous, are fake, users name” so “real a policy, have not does Unlike Twitter Google+, genders. and and distri- ages approximate over induce butions to us enabling profile, B MSA-B

p % of MSA 0.00 0.10 0.20 0.05 0.15 0.25 ALANCED ( eidc itiuin vrato edrin gender author over distributions induce We a | name Fulton ALANCED p p Gwinnett = D D ( apeas, sample n ( a Other a ) = ) ) ∝

o ahsample each for Cobb i X P ∈D

apeadteL the and sample DeKalb a count 0 p count (

a Clayton | ( name

differences Cherokee name ( name Henry = = Forsyth D n = n, i . )

n, Paulding age . Users Tweets Population ewe the between age Douglas OC = =

a Coweta -MSA- ) a

0 Carroll ) (2) (1) 1e4 Number of users in each category Num. L1 Dist. L1 Dist.

MSA Counties Population Population 16046 GPS-MSA-Balanced vs. Users vs. Tweets 1.5 GPS-County-Balanced New York 23 0.2891 0.2825 LOC-MSA-Balanced

Los Angeles 2 0.0203 0.0223 1005410062 Chicago 14 0.0482 0.0535 1.0

Dallas 12 0.1437 0.1176 70776914 Houston 10 0.0394 0.0472 49394924 Philadelphia 11 0.1426 0.1202 Number of users 0.5 4460 Washington DC 22 0.2089 0.2750 1889 18411956 Miami 3 0.0428 0.0362 10891144 616 362 Atlanta 28 0.1448 0.1730 0.0 0-2 2-5 5-10 10-15 >15 Boston 7 0.1878 0.2303 Number of messages by a user

Table 1: L1 distance between county-level popu- Figure 2: User counts by number of Twitter mes- lation and Twitter users and messages sages

3.2 Results Usage Next, we turn to differences between the Geographical biases in the GPS Sample We GPS-based and profile-based techniques for ob- first assess the differences between the true pop- taining ground truth data. As shown in Fig- ulation distributions over counties, and the per- ure 2, the LOC-MSA-BALANCED sample con- tweet and per-user distributions. Because coun- tains more low-volume users than either the GPS- ties vary widely in their degree of urbanization MSA-BALANCED or GPS-COUNTY-BALANCED and other demographic characteristics, this mea- samples. We can therefore conclude that the sure is a proxy for the representativeness of GPS- county-level geographical bias in the GPS-based based Twitter samples (county information is not data does not impact usage rate, but that the differ- available for the LOC-MSA-BALANCED sample). ence between GPS-based and profile-based sam- Population distributions for New York and Atlanta pling does; the linguistic consequences of this dif- are shown in Figure 1. In Atlanta, Fulton County ference will be explored in the following sections. is the most populous and most urban, and is over- represented in both geotagged tweets and user ac- Demographics Table 2 shows the expected age counts; most of the remaining counties are corre- and gender for each dataset, with bootstrap con- spondingly underrepresented. This coheres with fidence intervals. Users in the LOC-MSA- the urban bias noted earlier by Hecht and Stephens BALANCED dataset are on average two years older (2014). In New York, Kings County (Brooklyn) than in the GPS-MSA-BALANCED and GPS- is the most populous, but is underrepresented in COUNTY-BALANCED datasets, which are statis- both the number of geotagged tweets and user ac- tically indistinguishable. Focusing on the differ- counts, at the expense of New York County (Man- ence between GPS-MSA-BALANCED and LOC- hattan). Manhattan is the commercial and enter- MSA-BALANCED, we plot the difference in age tainment center of the New York MSA, so resi- probabilities in Figure 3, showing that GPS- dents of outlying counties may be tweeting from MSA-BALANCED includes many more teens and their jobs or social activities. people in their early twenties, while LOC-MSA- To quantify the representativeness of each sam- BALANCED includes more people at middle age P ple, we use the L1 distance ||x − y||1 = c |pc − and older. Young people are especially likely to tc|, where pc is the proportion of the MSA pop- use social media on cellphones (Lenhart, 2015), ulation residing in county c and tc is the propor- where location tagging would be more relevant tion of tweets (Table 1). County boundaries are than when Twitter is accessed via a personal com- determined by states, and their density varies: for puter. Social media users in the age brackets 18- example, the Los Angeles MSA covers only two 29 and 30-49 are also more likely to tag their lo- counties, while the smaller Atlanta MSA is spread cations in social media posts than social media over 28 counties. The table shows that while New users in the age brackets 50-64 and 65+ (Zickuhr, York is the most extreme example, most MSAs 2013), with women and men tagging at roughly feature an asymmetry between county population equal rates. Table 2 shows that the GPS-MSA- and Twitter adoption. BALANCED and GPS-COUNTY-BALANCED sam- Sample Expected Age 95% CI % Female 95% CI

GPS-MSA-BALANCED 36.17 [36.07 – 36.27] 51.5 [51.3 – 51.8] GPS-COUNTY-BALANCED 36.25 [36.16 – 36.30] 51.3 [51.1 – 51.6] LOC-MSA-BALANCED 38.35 [38.25 – 38.44] 49.3 [49.1 – 49.6]

Table 2: Demographic statistics for each dataset

2.51e−3 2 ) al., 2011a) , which we use here. For each sam- 2.0 age ( 1.5 ple — GPS-MSA-BALANCED and LOC-MSA-

LOC 1.0 BALANCED — we apply SAGE to identify the Pr 0.5 −

) twenty-five most salient lexical items for each 0.0

age metropolitan area. ( −0.5

GPS −1.0

Pr −1.5 Keyword annotation Previous research has 0 20 40 60 80 100 Age identified two main types of geographical lexi- cal variables. The first are non-standard words Figure 3: Difference in age probability distribu- and spellings, such as hella and yinz, which have tions between GPS-MSA-BALANCED and LOC- been found to be very frequent in social me- MSA-BALANCED. dia (Eisenstein, 2015). Other researchers have fo- cused on the “long tail” of entity names (Roller ples contain significantly more women than LOC- et al., 2012). A key question is the relative im- MSA-BALANCED, though all three samples are portance of these two variable types, since this close to 50%. would decide whether geo-linguistic differences are primarily topic-based or stylistic. It is there- 4 Impact on linguistic generalizations fore important to know whether the frequency of these two variable types depends on proper- Many papers use Twitter data to draw conclusions ties of the sample. To test this, we take the about the relationship between language and ge- lexical items identified by SAGE (25 per MSA, ography. What role do the demographic differ- for both the GPS-MSA-BALANCED and LOC- ences identified in the previous section have on MSA-BALANCED samples), and annotate them the linguistic conclusions that emerge? We mea- as NONSTANDARD-WORD,ENTITY-NAME, or sure the differences between the linguistic corpora OTHER. Annotation for ambiguous cases is based obtained by each data acquisition approach. Since on the majority sense in randomly-selected exam- the GPS-MSA-BALANCED and GPS-COUNTY- ples. Overall, we identify 24 NONSTANDARD- BALANCED methods have nearly identical pat- WORDs and 185 ENTITY-NAMEs. terns of usage and demographics, we focus on the Inferring author demographics As described difference between GPS-MSA-BALANCED and in § 3.1.2, we can obtain an approximate distri- LOC-MSA-BALANCED. These datasets differ in bution over author age and gender by linking self- age and gender, so we also directly measure the reported first names with aggregate statistics from impact of these demographic factors on the use of the United States Census. To sharpen these esti- geographically-specific linguistic variables. mates, we now consider the text as well, build- ing a simple latent variable model in which both 4.1 Methods the name and the word counts are drawn from dis- Discovering geographical linguistic variables tributions associated with the latent age and gen- We focus on lexical variation, which is relatively der (Chang et al., 2010). The model is shown in easy to identify in text corpora. Monroe et al. Figure 4, and involves the following generative (2008) survey a range of alternative statistics for process: finding lexical variables, demonstrating that a reg- For each user i ∈ {1...N}, ularized log-odds ratio strikes a good balance be- tween distinctiveness and robustness. A similar (a) draw the age, ai ∼ Categorical(π) approach is implemented in SAGE (Eisenstein et 2https://github.com/jacobeisenstein/jos-gender-2014 π 1e 4 1.6 0.00015 Male 1.4 Female 1.2 1.0 ai gi 0.00009 ai Age (bin) for author i 0.8 gi Gender of author i 0.6 0.00005 0.00004 wi Word counts for author i 0.4 0.00002 0.00002 0.00002 n First name of author i Per-word frequency n i 0.2 0.00001 wi i π Prior distribution over 0.0 0-17 18-29 30-39 40+ age bins N Age group θa,g Word distribution for age a and gender g (a) non-standard words θ φ φa,g First name distribution for age a and gender g 1e 3 2B 1.2 0.00117 Male 1.0 Female

Figure 4: Plate diagram for latent variable model 0.8 0.00075 0.6 of age and gender 0.00050 0.4 0.00037 0.00038 0.00034 0.00027 0.2 0.00021 Per-word frequency (b) draw the gender, gi ∼ Categorical(0.5) 0.0 0-17 18-29 30-39 40+ (c) draw the author’s given name, ni ∼ Age group Categorical(φa ,g ) i i (b) entity names (d) draw the word counts, wi ∼

Multinomial(θai,gi ), Figure 5: Aggregate statistics for geographically- where we elide the second parameter of the multi- specific non-standard words and entity names nomial distribution, the total word count. We use across imputed demographic groups, from the expectation-maximization to perform inference in GPS-MSA-BALANCED sample. this model, binning the latent age variable into four groups: 0-17, 18-29, 30-39, above 40.3 Be- For entity names, the difference between datasets cause the distribution of names given demograph- was not significant, with a rate of 4.0 × 10−3 for ics is available from the Social Security data, we GPS-MSA-BALANCED, and 3.7×10−3 for LOC- clamp the value of φ throughout the EM proce- MSA-BALANCED. Note that these rates include dure. Other work in the domain of demographic only the non-standard words and entity names de- prediction often involves more complex meth- tected by SAGE as among the top 25 most distinc- ods (Nguyen et al., 2014; Volkova and Durme, tive for one of the ten largest cities in the US; of 2015), but since it is not the focus of our research, course there are many other relevant terms that are we take a relatively simple approach here, assum- below this threshold. ing no labeled data for demographic attributes. In a pilot study of the GPS-COUNTY- 4.2 Results BALANCED data, we found few linguistic differ- ences from GPS-MSA-BALANCED, in either the Linguistic differences by dataset We first con- aggregate word-group frequencies or the SAGE sider the impact of the data acquisition tech- word lists — despite the geographical imbalances nique on the lexical features associated with each shown in Table 1 and Figure 1. Informal ex- city. The keywords identified in GPS-MSA- amination of specific counties shows some ex- BALANCED dataset feature more geographically- pected differences: for example, Clayton County, specific non-standard words, which occur at a rate which hosts Atlanta’s Hartsfield-Jackson airport, of 3.9 × 10−4 in GPS-MSA-BALANCED, versus includes terms related to air travel, and other coun- 2.6 × 10−4 in LOC-MSA-BALANCED; this differ- ties include mentions of local cities and business ence is statistically significant (p < .05, t = 3.2).4 districts. But the aggregate statistics for under- 3 Binning is often employed in work on text-based age pre- represented counties are not substantially different diction (Garera and Yarowsky, 2009; Rao et al., 2010; Rosen- thal and McKeown, 2011); it enables word and name counts from those of overrepresented counties, and are to be shared over multiple ages, and avoids the complexity largely unaffected by county-based resampling. inherent in regressing a high-dimensional textual predictors against a numerical variable. cannot test the complete set of entity names or non-standard 4We employ a paired t-test, comparing the difference in words, this quantifies whether the observed difference is ro- frequency for each word across the two datasets. Since we bust across the subset of the vocabulary that we have selected. Age Sex New York Dallas F niall, ilysm, hemmings, stalk, ily fanuary, idol, lmbo, lowkey, jonas 0-17 M ight, technique, kisses, lesbian, dicks homies, daniels, oomf, teenager, brah F roses, , hmmmm, chem, sinking socially, coma, hubby, bra, swimming 18-29 M drunken, manhattan, spoiler, guardians, gonna harden, watt, astros, rockets, mavs F suite, nyc, colleagues, york, portugal astros, sophia, recommendations, houston, prepping 30-39 M mets, effectively, cruz, founder, knicks texans, rockets, embarrassment, tcu, mississippi F cultural, affected, encouraged, proverb, unhappy determine, islam, rejoice, psalm, responsibility 40+ M reuters, investors, shares, lawsuit, theaters mph, wazers, houston, tx, harris

Table 3: Most characteristic words for demographic subsets of each city, as compared with the overall average word distribution

Demographics Aggregate linguistic statistics age of 40 emphasizing religiously-oriented terms for demographic groups are shown in Fig- (proverb, islam, rejoice, psalm). ure 5. Men use significantly more geographically- specific entity names than women (p  .01, t = 5 Impact on text-based geolocation 8 0 . ), but gender differences for geographically- A major application of geotagged social media specific non-standard words are not significant is to predict the geolocation of individuals based ≈ 2 5 (p . ). Younger people use significantly on their text (Eisenstein et al., 2010; Cheng et more geographically-specific non-standard words al., 2010; Wing and Baldridge, 2011; Hong et  than older people (ages 0–29 versus 30+, p al., 2012; Han et al., 2014). Text-based geolo- 01 = 7 8 . , t . ), and older people mention signifi- cation has obvious commercial implications for cantly more geographically-specific entity names location-based marketing and opinion analysis; it  01 = 5 1 (p . , t . ). Of particular interest is also potentially useful for researchers who want is the intersection of age and gender: the use to measure geographical phenomena in social me- of geographically-specific non-standard words de- dia, and wish to access a larger set of individuals creases with age much more profoundly for men than those who provide their locations explicitly. than for women; conversely, the frequency of Previous research has obtained impressive ac- mentioning geographically-specific entity names curacies for text-based geolocation: for exam- increases dramatically with age for men, but to a ple, Hong et al. (2012) report a median error of much lesser extent for women. The observation 120 km, which is roughly the distance from Los that high-level patterns of geographically-oriented Angeles to San Diego, in a prediction space over language are more age-dependent for men than the entire continental United States. These accura- for women suggests an intriguing site for future cies are computed on test sets that were acquired research on the intersectional construction of lin- through the same procedures as the training data, guistic identity. so if the acquisition procedures have geographic For a more detailed view, we apply SAGE to and demographic biases, then the resulting accu- identify the most salient lexical items for each racy estimates will be biased too. Consequently, MSA, subgrouped by age and gender. Table 3 they may be overly optimistic (or pessimistic!) for shows word lists for New York (the largest MSA) some types of authors. In this section, we explore and Dallas (the 5th-largest MSA), using the GPS- where these text-based geolocation methods are MSA-BALANCED sample. Non-standard words most and least accurate. tend to be used by the youngest authors: ilysm (’I love you so much’), ight (’alright’), oomf (’one of 5.1 Methods my followers’). Older authors write more about Our data is drawn from the ten largest metropoli- local entities (manhattan, nyc, houston), with tan areas in the United States, and we formulate men focusing on sports-related entities (harden, text-based geolocation as a ten-way classification watt, astros, mets, texans), and women above the problem, similar to Han et al. (2014).6 Using our

6Many previous papers have attempted to identify the pre- 5But see Bamman et al. (2014) for a much more detailed cise latitude and longitude coordinates of individual authors, discussion of gender and standardness. but obtaining high accuracy on this task involves much more user-balanced samples, we apply ten-fold cross the non-standard spellings and slang favored by validation, and tune the regularization parameter younger authors. on a development fold, using the vocabulary of the sample as features. 6 Related Work

5.2 Results Several researchers have studied how adoption of Many author-attribute prediction tasks become Internet technology varies with factors such as so- substantially easier as more data is avail- cioeconomic status, age, gender, and living condi- able (Burger et al., 2011), and text-based ge- tions (Zillien and Hargittai, 2009). Hargittai and olocation is no exception. Since GPS-MSA- Litt (2011) use a longitudinal survey methodology BALANCED and LOC-MSA-BALANCED have to compare the effects of gender, race, and topics very different usage rates (Figure 2), perceived dif- of interest on Twitter usage among young adults. ferences in accuracy may be purely attributable to Geographic variation in Twitter adoption has been the amount of data available per user, rather than considered both internationally (Kulshrestha et al., to users in one group being inherently harder to 2012) and within the United States, using both classify than another. For this reason, we bin users the Twitter location field (Mislove et al., 2011) by the number of messages in our sample of their and per-message GPS coordinates (Hecht and timeline, and report results separately for each bin. Stephens, 2014). Aggregate demographic statis- All errorbars represent 95% confidence intervals. tics of Twitter users’ geographic census blocks were computed by O’Connor et al. (2010) and GPS versus location As seen in Figure 6a, there Eisenstein et al. (2011b); Malik et al. (2015) use is little difference in accuracy across sampling census demographics in spatial error model. These techniques: the location-based sample is slightly papers draw similar conclusions, showing that the easier to geolocate at each usage bin, but the dif- the distribution of geotagged tweets over the US ference is not statistically significant. However, population is not random, and that higher usage due to the higher average usage rate in GPS- is correlated with urban areas, high income, more MSA-BALANCED(see Figure 2), the overall accu- ethnic minorities, and more young people. How- racy for a sample of users will appear to be higher ever, this prior work did not consider the biases on this data. introduced by relying on geotagged messages, nor Demographics Next, we measure classification the consequences for geo-linguistic analysis. accuracy by gender and age, using the posterior Twitter has often been used to study the ge- distribution from the expectation-maximization al- ographical distribution of linguistic information, gorithm to predict the gender of each user (broadly and of particular relevance are Twitter-based stud- similar results are obtained by using the prior dis- ies of regional dialect differences (Eisenstein et tribution). For this experiment, we focus on the al., 2010; Doyle, 2014; Gonc¸alves and Sanchez,´ GPS-MSA-BALANCED sample. As shown in 2014; Eisenstein, 2015) and text-based geoloca- Figure 6b, text-based geolocation is consistently tion (Cheng et al., 2010; Hong et al., 2012; Han et more accurate for male authors, across almost the al., 2014). This prior work rarely considers the im- entire spectrum of usage rates. As shown in Fig- pact of the demographic confounds, or of the geo- ure 6c, older users also tend to be easier to ge- graphical biases mentioned in § 3. Recent research olocate: at each usage level, the highest accuracy shows that accuracies of core language technol- goes to one of the two older groups, and the dif- ogy tasks such as part-of-speech tagging are cor- ference is significant in almost every case. As dis- related with author demographics such as author cussed in § 4, older male users tend to mention age (Hovy and Søgaard, 2015); our results on lo- many entities, particularly sports-related terms; cation prediction are in accord with these findings. these terms are apparently more predictive than Hovy (2015) show that including author demo- graphics can improve text classification, a similar complex methods, such as latent variable models (Eisenstein approach might improve text-based geolocation as et al., 2010; Hong et al., 2012), or multilevel grid struc- tures (Cheng et al., 2010; Roller et al., 2012). Tuning such well. models can be challenging, and the resulting accuracies might We address the question about the impact of be affected by initial conditions or hyperparameters. We therefore focus on classification, employing the familiar and geographical biases and demographic confounds well-understood method of logistic regression. by measuring differences between three sampling 0.28 0.24 Male 0.24 LOC-MSA-Balanced 0.26 Female GPS-MSA-Balanced 0.22 0.22 0.24 0.20 0.22 0.20 0.18 0.20 0.18 Accuracy Accuracy Accuracy 0.16 0.18 0.16 0-17 30-39 0.16 0.14 0.14 18-29 40+ 0.14 0.12 0-2 3-5 6-10 11-15 >15 0-2 3-5 6-10 11-15 >15 0-2 3-5 6-10 11-15 >15 Number of messages by a user Number of messages by a user Number of messages by a user (a) Classification accuracy by sampling (b) Classification accuracy by user gen- (c) Classification accuracy by imputed technique der age

Figure 6: Classification accuracies techniques, in both language use and in the ac- odh Krishan, Ana Smith, Yijie Wang, and Yi Yang. curacy of text-based geolocation. Recent unpub- This research was supported by the National Sci- lished work proposes reweighting Twitter data to ence Foundation under awards IIS-1111142 and correct biases in political analysis (Choy et al., RI-1452443, by the National Institutes of Health 2012) and public health (Culotta, 2014). Our under award number R01GM112697-01, and by results suggest that the linguistic differences be- the Air Force Office of Scientific Research. The tween user-supplied profile locations and per- content is solely the responsibility of the authors message geotags are more significant, and that ac- and does not necessarily represent the official counting for the geographical biases among geo- views of these sponsors. tagged messages is not sufficient to offer a repre- sentative sample of Twitter users. References 7 Discussion David Bamman, Jacob Eisenstein, and Tyler Schnoe- belen. 2014. Gender identity and lexical varia- Geotagged Twitter data offers an invaluable re- tion in social media. Journal of Sociolinguistics, source for studying the interaction of language and 18(2):135–160. geography, and is helping to usher in a new gener- David A Broniatowski, Michael J Paul, and Mark ation of location-aware language technology. This Dredze. 2013. National and local influenza surveil- makes critical investigation of the nature of this lance through twitter: An analysis of the 2012-2013 data source particularly important. This paper un- influenza epidemic. PloS one, 8(12):e83672. covers demographic confounds in the linguistic John D. Burger, John Henderson, George Kim, and analysis of geo-located Twitter data, but is lim- Guido Zarrella. 2011. Discriminating gender on ited to demographics that can be readily induced twitter. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing. from given names. A key task for future work is to quantify the representativeness of geotagged Twit- Guido Caldarelli, Alessandro Chessa, Fabio Pammolli, ter data with respect to factors such as race and so- Gabriele Pompa, Michelangelo Puliga, Massimo cioeconomic status, while holding geography con- Riccaboni, and Gianni Riotta. 2014. A multi-level geographical study of Italian political elections from stant. However, these features may be more diffi- Twitter Data. PloS one, 9(5):e95809. cult to impute from names alone. Another cru- cial task is to expand this investigation beyond the Jonathan Chang, Itamar Rosenn, Lars Backstrom, and Cameron Marlow. 2010. ePluribus: Ethnicity on United States, as the varying patterns of use for so- social networks. In Proceedings of the International cial media across countries (Pew Research Center, Conference on Web and Social Media (ICWSM), 2012) implies that the findings here cannot be ex- pages 18–25, Menlo Park, . AAAI Pub- pected to generalize to every international context. lications. Zhiyuan Cheng, James Caverlee, and Kyumin Lee. Acknowledgments Thanks to the anonymous 2010. You are where you tweet: a content-based ap- reviewers for their useful and constructive feed- proach to geo-locating twitter users. In Proceedings back on our submission. The following mem- of the International Conference on Information and bers of the Georgia Tech Computational Linguis- Knowledge Management (CIKM), pages 759–768. tics Laboratory offered feedback throughout the Murphy Choy, Michelle Cheong, Ma Nang Laik, and research process: Naman Goyal, Yangfeng Ji, Vin- Ping Shung. 2012. Us presidential elec- tion 2012 prediction using census corrected twitter Eszter Hargittai and Eden Litt. 2011. The tweet smell model. arXiv preprint arXiv:1211.0938. of celebrity success: Explaining variation in twit- ter adoption among a diverse group of young adults. Aron Culotta. 2014. Reducing sampling bias in so- New Media & Society, 13(5):824–842. cial media data for county health inference. In Joint Statistical Meetings Proceedings. Brent Hecht and Monica Stephens. 2014. A tale of cities: Urban biases in volunteered geographic in- Proceedings of the International Con- Gabriel Doyle. 2014. Mapping dialectal variation formation. In ference on Web and Social Media (ICWSM) by querying social media. In Proceedings of the , pages European Chapter of the Association for Computa- 197–205, Menlo Park, California. AAAI Publica- tional Linguistics (EACL), pages 98–106, Strouds- tions. burg, Pennsylvania. Association for Computational Brent Hecht, Lichan Hong, Bongwon Suh, and Ed H Linguistics. Chi. 2011. Tweets from Justin Bieber’s heart: the dynamics of the location field in user profiles. In Mark Dredze, Michael J Paul, Shane Bergsma, and Proceedings of Human Factors in Computing Sys- Hieu Tran. 2013. Carmen: A Twitter geolocation tems (CHI), pages 237–246. system with applications to public health. In AAAI Workshop on Expanding the Boundaries of Health Liangjie Hong, Amr Ahmed, Siva Gurumurthy, Informatics Using Artificial Intelligence, pages 20– Alexander J. Smola, and Kostas Tsioutsiouliklis. 24. 2012. Discovering geographical topics in the twitter stream. In Proceedings of the Conference on World- Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, Wide Web (WWW), pages 769–778, Lyon, France. and Eric P. Xing. 2010. A latent variable model for geographic lexical variation. In Proceedings of Em- Dirk Hovy and Anders Søgaard. 2015. Tagging per- pirical Methods for Natural Language Processing formance correlates with author age. In Proceed- ings of the Association for Computational Linguis- (EMNLP), pages 1277–1287, Stroudsburg, Pennsyl- tics (ACL) vania. Association for Computational Linguistics. , pages 483–488, Beijing, China. Dirk Hovy. 2015. Demographic factors improve clas- Jacob Eisenstein, Amr Ahmed, and Eric P. Xing. sification performance. In Proceedings of the Asso- 2011a. Sparse additive generative models of text. In ciation for Computational Linguistics (ACL), pages Proceedings of the International Conference on Ma- 752–762, Beijing, China. chine Learning (ICML), pages 1041–1048, Seattle, WA. Juhi Kulshrestha, Farshad Kooti, Ashkan Nikravesh, and Krishna P. Gummadi. 2012. Geographic Dis- Jacob Eisenstein, Noah A. Smith, and Eric P. Xing. section of the Twitter Network. In Proceedings of 2011b. Discovering sociolinguistic associations the International Conference on Web and Social Me- with structured sparsity. In Proceedings of the Asso- dia (ICWSM), Menlo Park, California. AAAI Publi- ciation for Computational Linguistics (ACL), pages cations. 1365–1374, Portland, OR. Amanda Lenhart. 2015. Mobile access shifts social media use and other online activities. Technical re- Jacob Eisenstein. 2015. Written dialect variation in port, Pew Research Center, April. online social media. In Charles Boberg, John Ner- bonne, and Dom Watt, editors, Handbook of Dialec- P. A. Longley, M. Adnan, and G. Lansley. 2015. The tology. Wiley. geotemporal demographics of twitter usage. Envi- ronment and Planning A, 47(2):465–484. Irene Eleta and Jennifer Golbeck. 2014. Multilingual use of twitter: Social networks at the language fron- Momin Malik, Hemank Lamba, Constantine Nakos, tier. Computers in Human Behavior, 41:424–432. and Jurgen¨ Pfeffer. 2015. Population bias in geo- tagged tweets. In Papers from the 2015 ICWSM Nikesh Garera and David Yarowsky. 2009. Modeling Workshop on Standards and Practices in Large- latent biographic attributes in conversational genres. Scale Social Media Research, pages 18–27. The In Proceedings of the Association for Computational AAAI Press. Linguistics (ACL) , pages 710–718, Suntec, Singa- Alan Mislove, Sune Lehmann, Yong-Yeol Ahn, Jukka- pore. Pekka Onnela, and J. Niels Rosenquist. 2011. Un- derstanding the demographics of twitter users. In Bruno Gonc¸alves and David Sanchez.´ 2014. Crowd- Proceedings of the International Conference on Web sourcing dialect characterization through twitter. and Social Media (ICWSM), pages 554–557, Menlo PloS one, 9(11):e112074. Park, California. AAAI Publications. Bo Han, Paul Cook, and Timothy Baldwin. 2014. Burt L Monroe, Michael P Colaresi, and Kevin M Text-based twitter user geolocation prediction. Quinn. 2008. Fightin’words: Lexical feature se- Journal of Artificial Intelligence Research (JAIR), lection and evaluation for identifying the content of 49:451–500. political conflict. Political Analysis, 16(4):372–403. Dong Nguyen, Dolf Trieschnigg, A Seza Dogruoz,¨ Ri- Kathryn Zickuhr. 2013. Location-based services. lana Gravel, Mariet¨ Theune, Theo Meder, and Fran- Technical report, Pew Research Center, Septmeber. ciska de Jong. 2014. Why gender and age predic- tion from tweets is hard: Lessons from a crowd- Nicole Zillien and Eszter Hargittai. 2009. Digital sourcing experiment. In Proceedings of the Inter- distinction: Status-specific types of internet usage*. national Conference on Computational Linguistics Social Science Quarterly, 90(2):274–291. (COLING), pages 1950–1961.

Brendan O’Connor, Jacob Eisenstein, Eric P. Xing, and Noah A. Smith. 2010. A mixture model of demo- graphic lexical variation. In Proceedings of NIPS Workshop on Machine Learning for Social Comput- ing, Vancouver.

Pew Research Center. 2012. Social networking popu- lar across globe. Technical report, December.

Barbara Poblete, Ruth Garcia, Marcelo Mendoza, and Alejandro Jaimes. 2011. Do all birds tweet the same? characterizing Twitter around the world. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 1025–1030. ACM.

Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta. 2010. Classifying latent user at- tributes in twitter. In Proceedings of Workshop on Search and mining user-generated contents.

Stephen Roller, Michael Speriosu, Sarat Rallapalli, Benjamin Wing, and Jason Baldridge. 2012. Super- vised text-based geolocation using language mod- els on an adaptive grid. In Proceedings of Em- pirical Methods for Natural Language Processing (EMNLP), pages 1500–1510.

Sara Rosenthal and Kathleen McKeown. 2011. Age prediction in blogs: A study of style, content, and online behavior in pre- and Post-Social media gen- erations. In Proceedings of the Association for Com- putational Linguistics (ACL), pages 763–772, Port- land, OR.

Thamar Solorio and Yang Liu. 2008. Learning to pre- dict code-switching points. In Proceedings of Em- pirical Methods for Natural Language Processing (EMNLP), pages 973–981, Honolulu, HI, October. Association for Computational Linguistics.

Svitlana Volkova and Benjamin Van Durme. 2015. Online bayesian models for personal analytics in so- cial media. In Proceedings of the National Confer- ence on Artificial Intelligence (AAAI), pages 2325– 2331.

Benjamin Wing and Jason Baldridge. 2011. Sim- ple supervised document geolocation with geodesic grids. In Proceedings of the Association for Com- putational Linguistics (ACL), pages 955–964, Port- land, OR.

Sarita Yardi, Daniel Romero, Grant Schoenebeck, et al. 2009. Detecting spam in a twitter network. First Monday, 15(1).