Arxiv:1810.03067V1 [Cs.IR] 7 Oct 2018
Total Page:16
File Type:pdf, Size:1020Kb
Geocoding Without Geotags: A Text-based Approach for reddit Keith Harrigian Warner Media Applied Analytics Boston, MA [email protected] Abstract that will require more thorough informed consent processes (Kho et al., 2009; European Commis- In this paper, we introduce the first geolocation sion, 2018). inference approach for reddit, a social media While some social platforms have previously platform where user pseudonymity has thus far made supervised demographic inference dif- approached the challenge of balancing data access ficult to implement and validate. In particu- and privacy by offering users the ability to share lar, we design a text-based heuristic schema to and explicitly control public access to sensitive at- generate ground truth location labels for red- tributes, others have opted not to collect sensitive dit users in the absence of explicitly geotagged attribute data altogether. The social news website data. After evaluating the accuracy of our la- reddit is perhaps the largest platform in the lat- beling procedure, we train and test several ge- ter group; as of January 2018, it was the 5th most olocation inference models across our reddit visited website in the United States and 6th most data set and three benchmark Twitter geoloca- tion data sets. Ultimately, we show that ge- visited website globally (Alexa, 2018). Unlike olocation models trained and applied on the real-name social media platforms such as Face- same domain substantially outperform models book, reddit operates as a pseudonymous website, attempting to transfer training data across do- with the only requirement for participation being mains, even more so on reddit where platform- a screen name. specific interest-group metadata can be used to Fortunately, there has been significant progress improve inferences. made using statistical models to infer user demo- 1 Introduction graphics based on text and other features when self-attribution data is sparse (Han et al., 2014; The rise of social media over the past decade has Ajao et al., 2015). On reddit in particular, Harri- brought with it the capability to positively influ- gian et al. (2016) has used self-attributed “flair” ence and deeply understand demographic groups as labels for training a text-based gender infer- at a scale unachievable within a controlled lab ence model. However, to the best of our knowl- environment. For example, despite only hav- edge, user geolocation inference has not yet been ing access to sparse demographic metadata, social attempted on reddit, where a complete lack of media researchers have successfully engineered location-based features (e.g. geotags, profiles with arXiv:1810.03067v1 [cs.IR] 7 Oct 2018 systems to promote targeted responses to pub- a location field) has made it difficult to train and lic health issues (Yamaguchi et al., 2014; Huang validate a supervised model. et al., 2017) and to characterize complex human The ability to geolocate users on reddit has behaviors (Mellon and Prosser, 2017). substantial implications for conversation mining However, recent studies have demonstrated that and high-level property modeling, especially since social media data sets often contain strong pop- pseudonymity tends to encourage disinhibition ulation biases, especially those which are filtered (Gagnon, 2013). For instance, geolocation could down to users who have opted to share sensitive be used to segment users discussing a movie trailer attributes such as name, age, and location (Malik into US and international audiences to estimate a et al., 2015; Sloan and Morgan, 2015; Lippincott film’s global appeal or to inform advertising strat- and Carrell, 2018). These existing biases are likely egy within different global markets. Alternatively, to be compounded by new data privacy legislation user geolocation may be used in conjunction with sentiment analysis of political discussions to pre- 3 Geocoding reddit Users dict future voting outcomes. Previous research on geolocation inference for so- Moving toward these goals, we introduce a text- cial media has primarily used three Twitter data based heuristic schema to generate ground truth sets for model training and validation (Eisenstein location labels for reddit users in the absence of et al., 2010; Roller et al., 2012; Han et al., 2012). explicitly geotagged data. After evaluating the ac- Although Twitter’s topical diversity typically sup- curacy of our labeling procedure, we train and test ports generalization to other platforms (Mejova several geolocation inference models across our and Srinivasan, 2012), it has not been tested thor- reddit data set and three benchmark Twitter ge- oughly in the geolocation context or on reddit at olocation data sets. Ultimately, we show that ge- all. olocation models trained and applied on the same domain substantially outperform models attempt- There is substantial reason to believe that geolo- ing to transfer training data across domains, even cation models trained on Twitter data will not per- more so on reddit where platform-specific interest- form optimally when applied to reddit. One of the group metadata can be used to improve inferences. most glaring concerns is the variation in user de- mographics between the platforms. Namely, Twit- 2 reddit: the front page of the internet ter tends to skew female while reddit tends to skew male (Barthel et al., 2016; Smith and Anderson, Founded originally as a social news platform in 2018). Coates and Pichler (2011) have shown that 2005, reddit has since become one of the most language varies between genders, which suggests visited websites in the world, offering an expan- that language-model priors may vary based on so- sive suite of features designed to “[bridge] com- cial platform. munities and individuals with ideas, the latest dig- Additionally, the lack of geolocation ground ital trends, and breaking news” (reddit, 2018). Al- truth for reddit necessarily excludes several though the so-called front page of the internet still promising inference models from being applied boasts an impressive amount of link-sharing, red- to the platform. For example, network-based dit has gradually evolved into a self-referential geolocation approaches exploit the empirical re- community where original thoughts and new con- lationship between physical user distance and tent prevail (Singer et al., 2014). connectedness on a social graph to propagate reddit is structured much like a traditional on- known user locations to unlabeled users (Back- line forum, where over one-hundred thousand top- strom et al., 2010). Rahimi et al. (2015) argue ical categories known as subreddits separate user that network-based approaches are generally su- communities and conversation. Subreddits may perior to content-based methods, especially when cover topics as general as humor and movies (e.g. the social graph is well-connected. However, these r/funny, r/movies) or as specific as an unusual fit- models require within-domain grounding for the ness goal (e.g. r/100pushups). Within each sub- propagation algorithms to be useful. reddit, users can post thematically relevant sub- Finally, while the reddit platform lacks certain missions in the form of an image, text blurb, or features useful for geolocation on Twitter (e.g. external link. Users are then able to post com- timezones, profile location fields), it possesses ments on the submission, responding to either the its own unique assets which we hypothesize can content in the original submission or to comments prove useful for user geolocation. In this paper, made by other users. we quantify the predictive value of subreddit meta- reddit users tend to feel protected by the site’s data, leaving other features such as user flair and pseudonymity and consequently eschew their of- the platform’s hierarchical comment structure for fline persona in favor of a more genuine online future research. persona (Gagnon, 2013; Shelton et al., 2015). Thus, there is much potential in reddit as a data 3.1 Labeling Procedure source for conversation mining. Unfortunately, the Since reddit does not offer geotagging capabilities, same policies which promote this favorable be- nor do reddit user profiles include location fields, havior currently preclude the segmentation of con- we design a free-text geocoding procedure to as- versation based on demographic dimensions and sociate reddit users with a home location. While serve as a fundamental motivation to our research. we focus on reddit for this paper, we believe that [-] snappyNZ 1 point 2 years ago CORPUS OF CONTEMPORARY ENGLISH (Davies, Born in Wigan. Live in New Zealand. permalink embed save give gold 2009) are removed from the gazetteer. After ap- [-] aznasazin11 5 points 2 years ago Kansas City, Missouri, USA reporting in. plying this filter, our gazetteer is left with 23,018 permalink embed save give gold cities and their associated location hierarchies (i.e. [-] ziggy_zaggy 3 points 2 years ago Glad you specified that you live on the Missouri side :) city, state, country). permalink embed save give gold [-] yassjr 2 points 2 years ago To aid in the disambiguation of common lo- Am I the only one from Paris? permalink embed save give gold cation names, we create a small dictionary of [-] biosync 1 point 2 years ago I went to a really nice LFC bar in Paris; I just can’t remember what it was 57 frequently occurring abbreviations found dur- called. But judging by the crowd there, you certainly aren’t the only one! permalink embed save give gold ing a preliminary exploration of the data. This dictionary includes abbreviations for each state Figure 1: Comments from one of the seed submissions. Replies (gray background) are filtered out because they often and territory of the United States, in addition to contain location names not representative of user location. the following: USA (United States), UK (United the following labeling method should generalize Kingdom), BC (British Columbia), and OT (On- to other pseudonymous platforms that have a ques- tario). Abbreviations are only extracted if they di- n tion and answer-based submission structure.