Time to Check the Tweets: Harnessing Twitter As a Location-Indexed Source of Big Geodata
Total Page:16
File Type:pdf, Size:1020Kb
Time to Check the Tweets: Harnessing Twitter as a Location-indexed Source of Big Geodata Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Arts in the Graduate School of The Ohio State University By Kiril Vatev, B.S. Graduate Program in Geography The Ohio State University 2013 Thesis Committee: Karl Ola Ahlqvist, Advisor Ningchuan Xiao Copyright by Kiril Vatev 2013 Abstract Twitter, a popular online social network based on the microblogging format, is an important and significant source of both locational and archival data. It lends to a wide range of topics, from predicting the results of political elections, analysis of natural disasters, and tracking of disease epidemics, to studying political movements and protests, notably in cases of protest organization in Moldova and Iran, and recent events in the Arab Spring. However, many obstacles and barriers to entry are posed by Twitter’s system. Twitter has two main ways of obtaining data with different API requirements, and both returning different formats of data. System-wide authentication further complicates the data collection process. This research is based in creating an open-source software package to overcome the difficulties and barriers to obtaining a Twitter dataset, for both programming aficionados and those only interested in the data. The goals of the software are to ease the collection, storage, and future access to those interesting and important datasets. The resulting software package provides ways to drill down into the data through advanced queries not available from Twitter, and retains the data for extended analysis and reference over time. The software provides a friendly user interface, as well as plug-in capabilities for programmers. ii Vita 2011 ........................................... Bachelor’s of Science in Geography, specialization in GIS, The Ohio State University February 2012 ............................ “Strategies in Implementation of ArcGIS Server as a Data Provider for Online Geogames.” Association of American Geographers Annual Meeting. New York, NY. April 2012 ................................... “Green Revolution Geogame Demo.” Innovate OSU, Steal My Idea Showcase. Columbus, OH. July 2012 .................................... “Responsive Map-app: From Design to Implementation.” Esri International User Conference. San Diego, CA. June 2012 to August 2012 ......... Esri: Applications Prototype Lab Intern May 2011 to present .................. Research Assistant and Software Developer – The GeoGame Project, The Ohio State University: Department of Geography Fields of Study Major Field: Geography iii Table of Contents Abstract .......................................................................................................................................... ii Vita ................................................................................................................................................. iii List of Tables .................................................................................................................................. v List of Figures ................................................................................................................................. vi Introduction ................................................................................................................................... 1 Background .................................................................................................................................... 3 Understanding Twitter ...................................................................................................... 4 Sample previous work ....................................................................................................... 7 Geocoding ....................................................................................................................... 11 Software Architecture .................................................................................................................. 13 Data format ..................................................................................................................... 13 The development environment ...................................................................................... 14 The database ................................................................................................................... 17 Geocoding service ........................................................................................................... 19 Other considerations ...................................................................................................... 20 Software Implementation ............................................................................................................ 22 Sample Dataset and Testing ......................................................................................................... 25 Discussion ..................................................................................................................................... 30 Conclusion and Distribution ......................................................................................................... 33 References ................................................................................................................................... 34 Appendix A: The Manual .............................................................................................................. 38 iv List of Tables Table 1: Supervised location categories ...................................................................................... 26 Table 2: Supervised location counts ............................................................................................ 27 Table 3: Unsupervised location categories .................................................................................. 28 Table 4: Unsupervised location counts ........................................................................................ 28 Table 5: Classification problem areas .......................................................................................... 29 v List of Figures Figure 1: Software architecture .................................................................................................... 21 Figure 2: The task creation form .................................................................................................. 43 Figure 3: Task data page .............................................................................................................. 45 vi Introduction Locational data is no longer confined to the ubiquitous shape file (.shp), and today can be gathered from countless sources and data services: RSS feeds, social media, and Open 311 feeds to name a few. With over one billion GPS-equipped smartphones in use, a number predicted to double in the next three years (Strategy Analytics 2012), and advancements in Internet infrastructure, accurately placed geographic data can be produced more easily than ever before. However, putting this data to full use in geographic research is still problematic for a number of reasons. The data can come in any format that the data service deems appropriate, with any number of barriers to entry, and are almost always inaccessible to conventional geographic information systems (GIS), such as ArcMap. To further complicate data collection, location is often not standardized within a single data service – let alone across the various services – and in itself presents a challenge to data collection, often resulting in the necessary discarding of large amounts of data – at times as high as 99%, as I have found in my preliminary work. Twitter, in particular, is a major source of interesting and relevant geographic information. Twitter is a rapidly growing social network site created in 2006. It is currently the fourth largest social network globally (Global Wed Index 2013), and the largest with an open access. Twitter adopts – and arguably created – the microblogging format. The network is based entirely on users sharing 140 character messages, called tweets, openly with the public (Gonzalez et al. 2011). These messages can spread quickly through the general public, and as such, Twitter is often regarded as “electronic word of mouth,” (Tumasjan et al. 2011), consisting of small bits of information that are important to and reflective of the opinions of the author. With the wide array of data Twitter can provide, and in its wide array of formats, Gao et al. (2011) argue that there is no standardized system to collect such data, and not even a comprehensible one. Past efforts to do so in academia have been ad-hoc, and often the most 1 problematic part of research. However, data can be used on its own, or as a complement to other sources, to answer many questions raised by geographers. The first part of my thesis will focus on defining and examining the uses and limitations of Twitter as they exist today, while the second will be to develop a software package suitable for research purposes. My research collects and assembles the latest techniques in geographic data collection and location extrapolation and geocoding, and combines them in a standardized tool for collection and distribution of geographic data. Importance will be placed not only on the functionality and optimization of the software, but also on the user experience and software friendliness. The resulting software package,