A Python Geotagging Tool

pigeo: A Python Geotagging Tool Afshin Rahimi, Trevor Cohn, and Timothy Baldwin Department of Computing and Information Systems The University of Melbourne [email protected] {t.cohn,tbaldwin}@unimelb.edu.au Abstract between textual features and different regions based on large-scale collections of geotagged doc- We present pigeo, a Python geolocation uments/tweets (Wing and Baldridge, 2011; Han prediction tool that predicts a location for et al., 2012; Maier and Gomez-Rodrıguez,´ 2014). a given text input or Twitter user. We dis- Given an unseen piece of text or the text content cuss the design, implementation and appli- of a user’s timeline, the trained classifier can pre- cation of pigeo, and empirically evaluate dict the most likely location(s) associated with the it. pigeo is able to geolocate informal input. text and is a very useful tool for users who Although social media services such as Twitter require a free and easy-to-use, yet accurate remove the geographical barrier for users to com- geolocation service based on pre-trained municate, the majority of user interactions are still models. Additionally, users can train their local (Backstrom et al., 2010). This geograph- own models easily using pigeo’s API. ical bias can be utilised to geolocate a user by analysing their social interactions. Based on the 1 Introduction assumption that social interactions are more likely Geolocation is the task of identifying a location for to be local, a user should be geographically close a user or document, and has applications in local to their connections. The simplest approach to ge- search, recommender systems (Ho et al., 2012), olocation is to use the median location of a user’s targeted advertising (Lim and Datta, 2013), health friends. Recent studies have shown that using both monitoring (Paul et al., 2015), rapid disaster re- network and text information can improve the cov- sponse (Ashktorab et al., 2014), and research with erage and keep the predictions accurate simultane- a regional restriction (Gutierrez et al., 2015), not- ously (Rahimi et al., 2015b). ing the potential privacy concerns associated with Despite the widespread use of geolocation, any such application (De Cristofaro et al., 2012). most services are proprietary, overly-simplistic, While primary service providers such as Twitter or complicated to use. Supervised classification and Google are able to use metadata such as IP models often require huge amounts of geotagged addresses, WiFi traces and direct access to a GPS data and large amounts of computing power to be signal to geolocate their users, this data is gen- trained. The performance is also heavily depen- erally not available to third parties. This paper dent on hyperparameter tuning, making the train- introduces a resource that can be used to geolo- ing procedure more challenging for end-users. cate users given textual messages generated by In this paper we introduce pigeo, a Python them, and the interactions between users encoded geolocation tool that has the following charac- in those messages, focused particularly at Twitter teristics: (1) it comes with a pre-trained text- data. based model; (2) it is easy to use; (3) it has been Both language use and social ties are geograph- tuned, benchmarked and proven to be accurate; ically biased, and thus can be used to recover (4) it supports both informal and formal text in- the location of a user or a document. Previ- put; (5) it directly supports Twitter user geoloca- ous research has shown that the geographical bias tion; and (6) it has an easy-to-use RESTful API. in language use can be used in supervised text- pigeo is available at http://github.com/ based geolocation models, to learn associations afshinrahimi/pigeo. 127 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics—System Demonstrations, pages 127–132, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics 2 Background and Related Work curate than text-based models but can’t geolocate users who don’t interact with training users, which Prior work on geolocation falls broadly into two is the case for more than 30% of users in the case main categories: text-based and network-based of reciprocal Twitter @-mentions (Jurgens et al., methods. Both approaches use geotagged sam- 2015). Relaxing the requirement on reciprocity ples, and predict the location of an unseen docu- increases the coverage of users, at the expense of ment or user based on the trained model. Those lower accuracy (Rahimi et al., 2015a). approaches usually use GPS tags or user profile There are several other geolocation services location fields as the ground truth both for train- and libraries which focus on Twitter, includ- ing and evaluating the model. Geographical bias ing pigeoTextGrounder (Wing and Baldridge, in language use is most evident for countries with 2014) with a focus on targeted advertising, different languages (e.g. Germany versus China), pigeoCarmen (Dredze et al., 2013) with a fo- but also exists for countries which share the same cus on help monitoring, pigeoMapAffil (Torvik, languages (e.g. in the spelling of centre vs. center 2015) for affiliation mapping, and pigeoTweedr in British vs. American English). The linguistic (Ashktorab et al., 2014) for rapid disaster re- geographical bias is not limited to these obvious sponse. Many companies have their own pro- cases, however, and includes the use of toponyms, prietary geolocation service, which are either not names of people, sport teams, and dialectal terms. available for public use or not open source. In These differences in use of language can be cap- pigeo, we provide trained a text-based classifi- tured in text-based geolocation models. Previous cation model and network-based regression model work have used topic models (Eisenstein et al., for geolocation prediction, which has been bench- 2010) and supervised flat (Wing and Baldridge, marked against standard datasets. 2011; Han et al., 2012; Han et al., 2013; Han et al., 2014; Rahimi et al., 2015b) and hierarchical 3 Methodology (Wing and Baldridge, 2014) classification models. The main idea is to learn the geographical distri- pigeo uses two pre-trained models for geoloca- bution of a given word across different locations tion: (1) LR-WORLD and (2) LP-WORLD. Both from training data, and use it to predict a location are trained on TWITTER-WORLD-EX, an ex- for a new user. tended version of the TWITTER-WORLD dataset Social ties have also been used for social media (Han et al., 2012). user geolocation. Backstrom et al. (2010) showed that Facebook users tend to interact more with 3.1 Data nearby people (“location homophily”), and used We use TWITTER-WORLD-EX to train both the this property to geolocate users based on the loca- text-based classification and the network-based re- tion of their friends, hence popularising network- gression model. TWITTER-WORLD-EX is a Twit- based geolocation approaches. A graph is usually ter dataset with global coverage (Han et al., 2012), built based on Facebook friendship (Backstrom comprising 1.3M geotagged users (188M tweets), et al., 2010), Twitter follows (Rout et al., 2013), of which 10K are held out for each of develop- Twitter reciprocal @-mentions (Jurgens, 2013), or ment and testing. The dataset contains predomi- Twitter @-mentions (Rahimi et al., 2015b). The nantly English text, but also includes a rich variety problem can also be formulated as classification of other languages. In TWITTER-WORLD, the lo- (Rout et al., 2013) or regression over real-valued cation representation was cities, based on GEON- coordinates (Jurgens, 2013; Rahimi et al., 2015b). AMES. For our purposes, we modify this to 930 In classification models, the location label set can clusters based on a k-d tree, to derive a smaller be pre-existing regional boundaries (e.g. countries number of classes and remove class imbalance. or cities) or automatically generated through dis- Given that the dataset is about 5 years old, we ex- cretisation (e.g. a k-d tree). The label distribution pect the off-the-shelf performance to be degraded of friends is then averaged and used as the location on newer tweets (Dredze et al., 2016), particularly of a given user. In a regression model, the median in the case of the network-based model (Jurgens et coordinates of the friends of a user are often used al., 2015). for prediction. LR-WORLD is a text-based classification model Network-based models are generally more ac- trained over TWITTER-WORLD-EX. The train- 128 ing users of TWITTER-WORLD-EX are clustered into 930 regions with roughly the same number of users per region (about 2400), using a k-d tree. This results in many small regions/clusters in highly populated areas such as NYC, and a few large regions in sparsely-populated areas or areas with few Twitter users, such as the Sahara desert and China. The region IDs are then used as la- bels for all the users in that region. We use a bag- of-unigrams model of text with binary term frequency, inverse document frequency and l2 nor- malisation of samples to create user vectors. Log Figure 1: pigeo’s web interface. Given a piece loss is used with ElasticNet regularisation (90% of text or a single Twitter user, it geolocates it and l1) as the cost function to train the model using returns the description and coordinates of the pre- stochastic gradient descent. Given an unseen text dicted location and its most important textual fea- sample, one can vectorise the sample and use the tures in the model. classifier to predict a region/label or a probability distribution over regions. The predicted label(s) can be mapped to coordinates or locations.

Load more