Temporal Modelling of Geospatial Words in Twitter

Temporal Modelling of Geospatial Words in Twitter Bo Han1, Antonio Jose Jimeno Yepes2, Andrew MacKinlay2;3, Lianhua Chi2 Hugo AI1 IBM Research Australia2 University of Melbourne3 [email protected], fantonio.jimeno, admackin, [email protected] Abstract ther distinguished based on their temporal patterns of locality (i.e. how location-indicative a word Twitter text-based geotagging often uses is). Some time-invariant geospatial words (e.g. geospatial words to determine locations. CalTech) are consistently associated with a loca- While much work has been done in word tion, while other words like #ALTA216 are only geospatiality analysis, there has been little transiently associated with a location (e.g. during work on temporal variations in the geospa- the conference). tial spread of word usage. In this paper, In this paper, we investigate the word local- we investigate geospatial words relative to ity pattern over time. First, we bin streaming their temporal locality patterns by fitting tweet data into a series of sliding time windows periodical models over time. The model and calculate the locality estimators in each win- jointly captures inherent geospatial local- dow. The time-indexed locality estimators are then ity and periodical factor for a word. The fed into periodical models which jointly capture resultant factorisation enables better un- the inherent geospatial locality and periodical fac- derstanding of word temporal trends and tor. We show that by removing the periodical fac- improves geotagging accuracy by only us- tor, we can obtain improved geotagging accuracy. ing inherent geospatial local words. Furthermore, we demonstrate the utility of fitted 1 Introduction model parameters in explaining some intuitive ob- servations for reoccurring words. Automatically inferring geographical location of social media data has become increasingly pop- 2 Background ular, as geospatial information plays a vital role Text-based geotagging is often formulated as a in applications such as advertising, influenza de- classification task (Cheng et al., 2010) which in- tection and disaster management. Due to a volves predicting a location from a set of pre- lack of abundant reliable geospatial information defined geographical partitions. It exploits words (Cheng et al., 2010), various text-based geotag- in tweets on the grounds that some of them carry ging methods have been proposed (Cheng et al., geographical information. The accuracy is of- 2010; Eisenstein et al., 2010; Wing and Baldridge, ten measured by the agreement of the predicted 2014).1 The main idea is to leverage geospatial locations with true oracle locations (Wing and words such as dialects and location names em- Baldridge, 2011). bedded in Twitter text to infer geographic loca- Earlier work in geotagging exploits language tions. For instance, yinz is primarily used in Pitts- models identified from textual data from different burgh, #auspol is a popular hashtag in Australia, locations (Cheng et al., 2010; Wing and Baldridge, and @TransLink is frequently mentioned by Van- 2011; Kinsella et al., 2011). It selects the location couver users. with the most similar language model relative to Social media data often comes as a stream, and the input tweet text. However, these methods also its contents and topics change over time. This im- include irrelevant words such as stop words (e.g. plies geospatial words in social media can be fur- the) and common hashtags (e.g. #iphone), mean- 1Indeed there is a way that a user can turn on their location ing they capture imperfect geospatial signals. sharing options which offers accurate location information, however, the ratio of such users are pretty low (Cheng et al., Cheng et al. (2010) improved language model- 2010). based methods by augmenting local words for Bo Han, Antonio Jimeno Yepes, Andrew MacKinlay and Lianhua Chi. 2016. Temporal Modelling of Geospatial Words in Twitter. In Proceedings of Australasian Language Technology Association Workshop, pages 133−137. geotagging. Words with sharply-peaked frequency 3.2 Sinusoidal Modelling distributions with respect to location are cate- We assume that a word’s observed geospatial pat- gorised as local words, and only local words are tern is jointly influenced by its inherent locality used in geotagging. Furthermore, ranking geospa- and periodical factor. A general sinusoidal model tial words by their locality in the decreasing order is applied to capture both factors in Equation 1. has been proposed (Chang et al., 2012; Laere et al., 2013; Han et al., 2014), however the categori- sation of local and non-local words is still binary. f(t) = C + α sin(!t + φ) (1) Hierarchical models and regularisation have also |{z} | {z } Inherent locality Periodical factor shown to be effective in geotagging (Ahmed et al., 2013; Wing and Baldridge, 2014). f(t) denotes the geospatial locality variance With much progress in identifying and utilising over time and t is the time window index. C, geospatial words, the temporal variance of geospa- which is constant with respect to time, models tial words has not been under studied. In this pa- the inherent permanent locality, while the pe- per, we study the impact of this temporal aspect riodic factor is captured by the time-dependent for geospatial words. α sin(!t + φ) in Equation 1. A smaller C suggests the corresponding word is 3 Temporal Geospatial Word Modelling more inherently location-indicative since the periodic effects are factored out into the other term. To analyse the temporal pattern of geospatial The time component α sin(!t+φ) is dependent words, we first define a fixed-length sliding time on the time window index t and parameterised by window. The collected data within a time window the amplitude α, angular frequency ! and phase is then aggregated for computing locality vari- φ. α represents the maximum impact of periodic ances for each word. The same calculations are component on a word’s locality, with a larger value performed for each consecutive time window and suggesting the word is strongly time-dependent. the location scores are collectively incorporated in ! denotes the frequency of this periodic com- a periodical model, and the geospatial words are ponent. It is inversely proportional to the period, then ranked and categorised based on this model. which is important for categorising the geospatial Top ranked words are assumed to be consistently patterns of a word. Ideally, for transient geospa- location-indicative over time and should therefore tial words, lower locality variances will occur in be preferred when building geotagging models. a tight cluster of time windows giving a large period (and hence low !), while recurring geospatial 3.1 Measurement of Word Locality words will have a smaller period corresponding to lower locality variances appearing at more regular The locality variances of a word are computed on intervals. basis of time windows (e.g. one week). For each φ is the phase of the wave and reflects the point, word found in a time window, we obtain a list such as a day of the week, within a time window of locations (i.e. GPS coordinates) from tweets and it is crucial for curve fitting. containing the word. Then we draw random sam- ples of paired locations without replacement (non- 4 Experiments and Discussion exhaustive to improve tractability), and compute the distances between paired locations following 4.1 Datasets Cook et al. (2014), yielding a list of paired lo- We collected 10% 2014 Twitter stream data from cation distances. The mean and median of these Gnip.2 Tweets are lowercased and non-English distances serve as locality variances and are used data is removed according to Gnip-provided lan- in subsequent experiments. Permanent location- guage code. We use APR-DEC geotagged tweets indicative words should have consistently low lo- as the training data and JAN-MAR users (by ag- cality variances as they are likely to occur in ge- gregating all their tweets) for test.3 To ensure the ographically close regions in most time windows. 2 The metric of median distance reduces the influ- https://gnip.com/ 3Because we had a major data collection interruptions be- ence of outlier locations (e.g. caused by people tween March and April, we chose the larger of two datasets mentioning their home city while travelling). for training and the other for test. 134 quality of test data, we only include users who 7 have more than 10 English tweets that are within following settings to test the impact. 150 km of a city centre according to the geotag • WHOLE: This baseline uses the whole train- coordinates of the tweets, with at least 80% of ing data collection period to calculate the these tweets having the same closest city. The locality median score in training without closest city is stored as the user’s true location.4 specifying time windows, which is roughly The datasets after applying the above process are equivalent to conventional word-based non- shown in Table 1.5 temporal geotagging model. • SIN-MEAN: This setting uses a fourteen-day Datasets Data size sliding time window with one day as the slid- Train(APR-DEC) 45.4M tweets ing step.8 Random location pairs are gener- Test(JAN-MAR) 373K users ated three times with each sample size equal to 20. Mean numbers are used to fit the si- Table 1: Filtered Twitter dataset nusoidal model and the initial parameters are estimated as described in Section 4.2. • SIN-MEDIAN: Similar to SIN-MEAN, but we 4.2 Fitting Sinusoidal Models use medians as locality variances to reduce the negative impact of outliers. We estimate parameters for Equation 1 using Non- linear Least Square for each word. The initial val- We then rank words by the C value from (3.2) ues for important factors are set as follows:6 and evaluate the accuracy of the geotagging mod- • C is set to the mean of the “10%-trimmed” els produced by using the top n words.

Load more