Geography of Twitter, Assessing the Various CURRENT ISSUE Sources of Geographic Information on the Service and Their Accuracy
Total Page:16
File Type:pdf, Size:1020Kb
Mapping the global Twitter heartbeat: The geogr... http://firstmonday.org/ojs/index.php/fm/article/vi... HOME ABOUT LOGIN REGISTER SEARCH CURRENT OPEN JOURNAL SYSTEMS ARCHIVES ANNOUNCEMENTS SUBMISSIONS Journal Help Home > Volume 18, Number 5 - 6 May 2013 > Leetaru USER Username Password Remember me Login JOURNAL CONTENT Search All Search Browse By Issue By Author In just under seven years, Twitter has grown to count nearly three percent By Title of the entire global population among its active users who have sent more Other Journals than 170 billion 140–character messages. Today the service plays such a significant role in American culture that the Library of Congress has assembled a permanent archive of the site back to its first tweet, updated FONT SIZE daily. With its open API, Twitter has become one of the most popular data sources for social research, yet the majority of the literature has focused on it as a text or network graph source, with only limited efforts to date focusing exclusively on the geography of Twitter, assessing the various CURRENT ISSUE sources of geographic information on the service and their accuracy. More than three percent of all tweets are found to have native location information available, while a naive geocoder based on a simple major cities gazetteer and relying on the user–provided Location and Profile fields is able to geolocate more than a third of all tweets with high accuracy when measured against the GPS–based baseline. Geographic proximity is ARTICLE TOOLS found to play a minimal role both in who users communicate with and what Abstract they communicate about, providing evidence that social media is shifting the communicative landscape. Print this Contents article Indexing Introduction The native geography of Twitter: Georeferenced tweets metadata The linguistic geography of Twitter How to cite From text to maps: The textual geography of Twitter Accuracy and language item The geography of communication on Twitter The geography of linking disccourse User profile links Supplementary files Twitter versus mainstream news media Email this Twitter’s geography of growth and impact article (Login Conclusions required) Email the author (Login 1 of 37 09/22/2015 04:52 PM Mapping the global Twitter heartbeat: The geogr... http://firstmonday.org/ojs/index.php/fm/article/vi... required) Introduction ABOUT THE Since its founding in 2006, Twitter has grown at an exponential rate, today AUTHORS counting among its active users more than 2.9 percent of all people living on Earth (Fiegerman, 2012) and 9.1 percent of the population of the Kalev Leetaru United States and “has become the pulse of a planet–wide news organism, http://www.kalevleetaru.com hosting the dialogue about everything from the Arab Spring to celebrity University of Illinois deaths” (Stone, 2012). In its advertising materials, Twitter calls itself “the at Urbana- global town square — the place where people around the globe go to find Champaign out what’s happening right now” and that it is “increasingly the pulse of United States the planet” (Twitter, 2013a). In just the past 12 months alone Twitter has doubled from 100 million to 200 million active users (Twitter, 2011; Kalev Leetaru is a Fiegerman, 2012), while over the last seven years, more than 170 billion University Fellow at tweets totaling 133 terabytes have been sent, 149 billion of them in just the University of the last 24 months (Library of Congress, 2013). Illinois Graduate School of Library Twitter offers “an unprecedented opportunity to study human and Information communication and social networks,” (Miller, 2011) while the rising role of Science. His work Twitter in the consumption of traditional media like television even led focuses on ?big Nielsen to create a new Twitter “social TV” rating system (Shih, 2012). data? analyses of During major disasters, governments are increasingly turning to Twitter to space, time, and provide realtime official information streams and directives (Griggs, 2012), network structure while emergency services are taking the first steps to monitor Twitter as a to understand parallel 911 system, especially in cases where traditional phone service is human society and unavailable (Khorram, 2012). Yet, perhaps the greatest indication of culture through its Twitter’s cultural significance is that the Library of Congress now maintains communicative a permanent historical archive going back to the site’s founding and footprints and is the updated daily (Library of Congress, 2013). author of Data Mining Methods for Unlike most social network sites, Twitter and its redistributors make nearly the Content Analyst all of its data available via APIs that enables realtime programmatic access from Routledge. to its massive seven–year archive. This availability and ease of use has made Twitter one of the most popular data sources for studying social communication, with Google Scholar listing more than 3.2 million papers Shaowen Wang mentioning the service. Yet, the majority of the literature that has studied University of Illinois Twitter has focused on the text of the tweets or the network graph at Urbana- connecting users (Miller, 2011). Few studies make use of the geographic Champaign information attached to tweets, while papers like Poblete, et al. (2011) United States have used it primarily as a filtering mechanism rather than focusing on the geography itself. Most have relied either on natively georeferenced tweets Shaowen Wang is or passing the user’s self–reported location to the Google Geocoder or an Associate Yahoo! Placemaker API. Others, like Takhteyev, et al. (2012), have Professor in the integrated geography more closely into their analyses, but have limited Department of themselves to just a few thousand tweets out of the 170 billion sent to Geography and date. Geographic Information More critically, Takhteyev, et al. (2012), like most studies that have Science, and attempted to geolocate tweets, have relied on the user Location field, founding Director of making the assumption that it yields the most accurate reflection of a the user’s geographic position. Social media monitoring companies, like CyberInfrastructure Semiocast (2012), each use their own proprietary technology to locate and Geospatial tweets geographically, but offer no detail on the data fields or algorithms Information they rely on or estimates of the accuracy of their approaches. In fact, no Laboratory at the major study to date has focused exclusively on the geography of Twitter, University of Illinois examining all available sources of geographic information in the Twitter at Urbana- stream and assessing their accuracy. As location is playing an increasing Champaign. role in everything from monitoring for natural disasters (Earle, et al., 2011) to “the first ever official United Nations crisis map entirely based on data collected from social media” (Meier, 2012), there is a critical need to Guofeng Cao better understand the geography of Twitter. University of Illinois at Urbana- In Fall 2012, supercomputing manufacturer Silicon Graphics International Champaign (SGI), the University of Illinois, and social media data vendor GNIP United States collaborated to create the “Global Twitter Heartbeat” project (http://www.sgi.com/go/twitter) in order to map global emotion expressed Guofeng Cao is a on Twitter in realtime. GNIP provided access to the Twitter Decahose, Postdoctoral which consists of 10 percent of all tweets sent globally each day, while SGI Research Associate provided access to one of its new UV2000 supercomputers with 256 in the Department processors and 4TB of RAM running the Linux operating system. The result of Geography and of this collaboration, which debuted at the annual Supercomputing 2012 Geographic conference held in Salt Lake City, Utah, processed the Twitter Decahose in Information Science 2 of 37 09/22/2015 04:52 PM Mapping the global Twitter heartbeat: The geogr... http://firstmonday.org/ojs/index.php/fm/article/vi... real–time, producing a sequence of heatmaps once per second visualizing and the global emotion and displayed on an 80” LCD monitor and a large CyberInfrastructure 12’–diameter inflatable sphere known as a PufferSphere. In order to and Geospatial construct these visualizations, the project needed to construct one of the Information most detailed geographic representations of Twitter ever created in order Laboratory at the to assign a geographic location to every tweet possible and to understand University of Illinois the geographic biases in the Twitter stream that could affect its results, the at Urbana- results of which form the basis of this paper. Champaign. His research interests From 12:01AM 23 October 2012 through 11:59PM 30 November 2012, the include geographic Twitter Decahose from GNIP streamed 1,535,929,521 tweets from information science, 71,273,997 unique users, averaging 38 million tweets from 13.7 million spatiotemporal users each day. The JSON file format in which the stream is encoded analysis and generated just over 2.8TB of data over this 39 day period, but the majority location-based of this consists of metadata, with the actual total tweet text weighing in at social media data 112.7GB, containing over 14.3 billion words. The average tweet is 74 analysis. characters long and consists of 9.4 words. In all, this dataset encompasses just over 0.9 percent of all tweets ever sent since the debut of Twitter and 35.6 percent of all active users as of December 2012. Anand Padmanabhan Figure 1 shows the total number of tweets received per day from the University of Illinois Decahose over this period, while Figure 2 shows the average number of at Urbana- tweets received per hour. Twitter exhibits strong temporal change, from a Champaign low of just over one million tweets per hour from midnight to 2AM PST to United States just over two million from 7–9AM PST. Twitter’s content stream is dominated by a small number of users.