Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter

Total Page:16

File Type:pdf, Size:1020Kb

Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter Micol Marchetti-Bowick Nathanael Chambers Microsoft Corporation Department of Computer Science 475 Brannan Street United States Naval Academy San Francisco, CA 94122 Annapolis, MD 21409 [email protected] [email protected] Abstract apply our approach to the problem of predicting Presidential Job Approval polls from Twitter data, Microblogging websites such as Twitter and we present results that improve on previous offer a wealth of insight into a popu- work in this area. We also present a novel base- lation’s current mood. Automated ap- proaches to identify general sentiment to- line that performs remarkably well without using ward a particular topic often perform two topic identification. steps: Topic Identification and Sentiment Topic identification is the task of identifying Analysis. Topic Identification first identi- text that discusses a topic of interest. Most pre- fies tweets that are relevant to a desired vious work on microblogs uses simple keyword topic (e.g., a politician or event), and Sen- searches to find topic-relevant tweets on the as- timent Analysis extracts each tweet’s atti- tude toward the topic. Many techniques for sumption that short tweets do not need more so- Topic Identification simply involve select- phisticated processing. For instance, searches for ing tweets using a keyword search. Here, the name “Obama” have been assumed to return we present an approach that instead uses a representative set of tweets about the U.S. Pres- distant supervision to train a classifier on ident (O’Connor et al., 2010). One of the main the tweets returned by the search. We show contributions of this paper is to show that keyword that distant supervision leads to improved search can lead to noisy results, and that the same performance in the Topic Identification task keywords can instead be used in a distantly super- as well in the downstream Sentiment Anal- ysis stage. We then use a system that incor- vised framework to yield improved performance. porates distant supervision into both stages Distant supervision uses noisy signals in text to analyze the sentiment toward President as positive labels to train classifiers. For in- Obama expressed in a dataset of tweets. stance, the token “Obama” can be used to iden- Our results better correlate with Gallup’s tify a series of tweets that discuss U.S. President Presidential Job Approval polls than pre- Barack Obama. Although searching for token vious work. Finally, we discover a sur- matches can return false positives, using the re- prising baseline that outperforms previous work without a Topic Identification stage. sulting tweets as positive training examples pro- vides supervision from a distance. This paper ex- periments with several diverse sets of keywords 1 Introduction to train distantly supervised classifiers for topic Social networks and blogs contain a wealth of identification. We evaluate each classifier on a data about how the general public views products, hand-labeled dataset of political and apolitical campaigns, events, and people. Automated algo- tweets, and demonstrate an improvement in F1 rithms can use this data to provide instant feed- score over simple keyword search (.39 to .90 in back on what people are saying about a topic. the best case). We also make available the first la- Two challenges in building such algorithms are beled dataset for topic identification in politics to (1) identifying topic-relevant posts, and (2) iden- encourage future work. tifying the attitude of each post toward the topic. Sentiment analysis encompasses a broad field This paper studies distant supervision (Mintz et of research, but most microblog work focuses al., 2009) as a solution to both challenges. We on two moods: positive and negative sentiment. Algorithms to identify these moods range from In contrast to lexicons, many approaches in- matching words in a sentiment lexicon to training stead focus on ways to train supervised classi- classifiers with a hand-labeled corpus. Since la- fiers. However, labeled data is expensive to cre- beling corpora is expensive, recent work on Twit- ate, and examples of Twitter classifiers trained on ter uses emoticons (i.e., ASCII smiley faces such hand-labeled data are few (Jiang et al., 2011). In- as :-( and :-)) as noisy labels in tweets for distant stead, distant supervision has grown in popular- supervision (Pak and Paroubek, 2010; Davidov et ity. These algorithms use emoticons to serve as al., 2010; Kouloumpis et al., 2011). This paper semantic indicators for sentiment. For instance, presents new analysis of the downstream effects a sad face (e.g., :-() serves as a noisy label for a of topic identification on sentiment classifiers and negative mood. Read (2005) was the first to sug- their application to political forecasting. gest emoticons for UseNet data, followed by Go Interest in measuring the political mood of et al. (Go et al., 2009) on Twitter, and many others a country has recently grown (O’Connor et al., since (Bifet and Frank, 2010; Pak and Paroubek, 2010; Tumasjan et al., 2010; Gonzalez-Bailon et 2010; Davidov et al., 2010; Kouloumpis et al., al., 2010; Carvalho et al., 2011; Tan et al., 2011). 2011). Hashtags (e.g., #cool and #happy) have Here we compare our sentiment results to Presi- also been used as noisy sentiment labels (Davi- dential Job Approval polls and show that the sen- dov et al., 2010; Kouloumpis et al., 2011). Fi- timent scores produced by our system are posi- nally, multiple models can be blended into a sin- tively correlated with both the Approval and Dis- gle classifier (Barbosa and Feng, 2010). Here, we approval job ratings. adopt the emoticon algorithm for sentiment analy- In this paper we present a method for cou- sis, and evaluate it on a specific domain (politics). pling two distantly supervised algorithms for Topic identification in Twitter has received topic identification and sentiment classification on much less attention than sentiment analysis. The Twitter. In Section 4, we describe our approach to majority of approaches simply select a single topic identification and present a new annotated keyword (e.g., “Obama”) to represent their topic corpus of political tweets for future study. In Sec- (e.g., “US President”) and retrieve all tweets that tion 5, we apply distant supervision to sentiment contain the word (O’Connor et al., 2010; Tumas- analysis. Finally, Section 6 discusses our sys- jan et al., 2010; Tan et al., 2011). The underlying tem’s performance on modeling Presidential Job assumption is that the keyword is precise, and due Approval ratings from Twitter data. to the vast number of tweets, the search will re- turn a large enough dataset to measure sentiment 2 Previous Work toward that topic. In this work, we instead use a distantly supervised system similar in spirit to The past several years have seen sentiment anal- those recently applied to sentiment analysis. ysis grow into a diverse research area. The idea Finally, we evaluate the approaches presented of sentiment applied to microblogging domains is in this paper on the domain of politics. Tumasjan relatively new, but there are numerous recent pub- et al. (2010) showed that the results of a recent lications on the subject. Since this paper focuses German election could be predicted through fre- on the microblog setting, we concentrate on these quency counts with remarkable accuracy. Most contributions here. similar to this paper is that of O’Connor et al. The most straightforward approach to senti- (2010), in which tweets relating to President ment analysis is using a sentiment lexicon to la- Obama are retrieved with a keyword search and bel tweets based on how many sentiment words a sentiment lexicon is used to measure overall appear. This approach tends to be used by appli- approval. This extracted approval ratio is then cations that measure the general mood of a popu- compared to Gallup’s Presidential Job Approval lation. O’Connor et al. (2010) use a ratio of posi- polling data. We directly compare their results tive and negative word counts on Twitter, Kramer with various distantly supervised approaches. (2010) counts lexicon words on Facebook, and 3 Datasets Thelwall (2011) uses the publicly available Sen- tiStrength algorithm to make weighted counts of The experiments in this paper use seven months of keywords based on predefined polarity strengths. tweets from Twitter (www.twitter.com) collected between June 1, 2009 and December 31, 2009. ID Type Keywords The corpus contains over 476 million tweets la- PC-1 Obama obama beled with usernames and timestamps, collected PC-2 General republican, democrat, senate, congress government through Twitter’s ‘spritzer’ API without keyword , PC-3 Topic health care, economy, tax cuts, filtering. Tweets are aligned with polling data in tea party, bailout, sotomayor Section 6 using their timestamps. PC-4 Politician obama, biden, mccain, reed, The full system is evaluated against the pub- pelosi, clinton, palin licly available daily Presidential Job Approval PC-5 Ideology liberal, conservative, progres- polling data from Gallup1. Every day, Gallup asks sive, socialist, capitalist 1,500 adults in the United States about whether Table 1: The keywords used to select positive training they approve or disapprove of “the job Presi- sets for each political classifier (a subset of all PC-3 dent Obama is doing as president.” The results and PC-5 keywords are shown to conserve space). are compiled into two trend lines for Approval and Disapproval ratings, as shown in Figure 1. positive: LOL, obama made a bears refer- We compare our positive and negative sentiment ence in green bay. uh oh. scores against these two trends. negative: New blog up! It regards the new 4 Topic Identification iPhone 3G S: <URL> This section addresses the task of Topic Identi- We then use these automatically extracted fication in the context of microblogs.
Recommended publications
  • Cyber-Synchronicity: the Concurrence of the Virtual
    Cyber-Synchronicity: The Concurrence of the Virtual and the Material via Text-Based Virtual Reality A dissertation presented to the faculty of the Scripps College of Communication of Ohio University In partial fulfillment of the requirements for the degree Doctor of Philosophy Jeffrey S. Smith March 2010 © 2010 Jeffrey S. Smith. All Rights Reserved. This dissertation titled Cyber-Synchronicity: The Concurrence of the Virtual and the Material Via Text-Based Virtual Reality by JEFFREY S. SMITH has been approved for the School of Media Arts and Studies and the Scripps College of Communication by Joseph W. Slade III Professor of Media Arts and Studies Gregory J. Shepherd Dean, Scripps College of Communication ii ABSTRACT SMITH, JEFFREY S., Ph.D., March 2010, Mass Communication Cyber-Synchronicity: The Concurrence of the Virtual and the Material Via Text-Based Virtual Reality (384 pp.) Director of Dissertation: Joseph W. Slade III This dissertation investigates the experiences of participants in a text-based virtual reality known as a Multi-User Domain, or MUD. Through in-depth electronic interviews, staff members and players of Aurealan Realms MUD were queried regarding the impact of their participation in the MUD on their perceived sense of self, community, and culture. Second, the interviews were subjected to a qualitative thematic analysis through which the nature of the participant’s phenomenological lived experience is explored with a specific eye toward any significant over or interconnection between each participant’s virtual and material experiences. An extended analysis of the experiences of respondents, combined with supporting material from other academic investigators, provides a map with which to chart the synchronous and synonymous relationship between a participant’s perceived sense of material identity, community, and culture, and her perceived sense of virtual identity, community, and culture.
    [Show full text]
  • Liminality and Communitas in Social Media: the Case of Twitter Jana Herwig, M.A., [email protected] Dept
    Liminality and Communitas in Social Media: The Case of Twitter Jana Herwig, M.A., [email protected] Dept. of Theatre, Film and Media Studies, University of Vienna 1. Introduction “What’s the most amazing thing you’ve ever found?” Mac (Peter Riegert) asks Ben, the beachcomber (Fulton Mackay), in the 1983 film Local Hero. ”Impossible to say,” Ben replies. “There’s something amazing every two or three weeks.” Substitute minutes for weeks, and you have Twitter. On a good day, something amazing washes up every two or three minutes. On a bad one, you irritably wonder why all these idiots are wasting your time with their stupid babble and wish they would go somewhere else. Then you remember: there’s a simple solution, and it’s to go offline. Never works. The second part of this quote has been altered: Instead of “Twitter”, the original text referred to “the Net“, and instead of suggesting to “go offline”, author Wendy M. Grossman in her 1997 (free and online) publication net.wars suggested to “unplug your modem”. Rereading net.wars, I noticed a striking similarity between Grossman’s description of the social experience of information sharing in the late Usenet era and the form of exchanges that can currently be observed on micro-blogging platform Twitter. Furthermore, many challenges such as Twitter Spam or ‘Gain more followers’ re-tweets (see section 5) that have emerged in the context of Twitter’s gradual ‘going mainstream’, commonly held to have begun in fall 2008, are reminiscent of the phenomenon of ‘Eternal September’ first witnessed on Usenet which Grossman reports.
    [Show full text]
  • Usenet News HOWTO
    Usenet News HOWTO Shuvam Misra (usenet at starcomsoftware dot com) Revision History Revision 2.1 2002−08−20 Revised by: sm New sections on Security and Software History, lots of other small additions and cleanup Revision 2.0 2002−07−30 Revised by: sm Rewritten by new authors at Starcom Software Revision 1.4 1995−11−29 Revised by: vs Original document; authored by Vince Skahan. Usenet News HOWTO Table of Contents 1. What is the Usenet?........................................................................................................................................1 1.1. Discussion groups.............................................................................................................................1 1.2. How it works, loosely speaking........................................................................................................1 1.3. About sizes, volumes, and so on.......................................................................................................2 2. Principles of Operation...................................................................................................................................4 2.1. Newsgroups and articles...................................................................................................................4 2.2. Of readers and servers.......................................................................................................................6 2.3. Newsfeeds.........................................................................................................................................6
    [Show full text]
  • Sending Newsgroup and Instant Messages
    7.1 LESSON 7 Sending Newsgroup and Instant Messages After completing this lesson, you will be able to: Send and receive newsgroup messages Create and send instant messages Microsoft Outlook offers some alternatives to communicating via e-mail. With instant messages, you can connect with a client who is also using instant messaging to ask a simple question and get the answer immediately. On the other hand, Newsgroups allow you to communicate with a larger audience. When considering the purchase of a new graphics program, for example, you can solicit advice and recommendations by posting a message to a newsgroup for people interested in graphics. You don’t need any practice files to work through the exercises in this lesson. Sending and Receiving Newsgroup Messages A newsgroup is a collection of messages related to a particular topic that are posted by individuals to a news server . Newsgroups are generally focused on discussion of a particular topic, like a sports team, a hobby, or a type of software. Anyone with access to the group can post to the group and read all posted messages. Some newsgroups have a moderator who monitors their use, but most newsgroups are not overseen in this way. A newsgroup can be private, such as an internal company newsgroup, but there are newsgroups open to the public for practically every topic. You gain access to newsgroups through a news server maintained either on your organization’s network or by your Internet service provider (ISP). You use a newsreader program to download, read, and reply to news messages.
    [Show full text]
  • Feed Your ‘Need to Read’ – an Explanation of RSS
    by Jen Sharp, jensharp.com Feed your ‘Need To Read’ – an explanation of RSS here exists a useful informational tool so powerful that provides feeds. For example, USA Today has separate that the government of China has completely feeds in its different sections. You know that a page has the banned its use since 2007. Yet many people have capabilities for RSS subscriptions by looking for the orange Tnot even heard about it, despite its juxtaposition in chicklet in the toolbar of your browser. Clicking on the logo front of our attention during our experiences on the web. will start you in the process of seeing what the feed for that Recently I was visiting with my dad, who is well-informed page looks like. Then you can choose how to subscribe. and somewhat technically savvy. I asked him if he had ever Choosing a reader can seem overwhelming, as there are heard of RSS feeds. He hadn’t, so when I pointed out the thousands of readers available, most of them for free. To ever-present orange “chicklet,” he remarked, “Oh! I thought start, decide how you what to view the information. RSS that was the logo for Home Depot!” feeds can be displayed in many ways, including: Very simply, RSS feeds are akin to picking up your ■ sent to your email inbox favorite newspaper and scanning the headlines for a story ■ as a personal browser start page like igoogle.com or you are interested in. Items are displayed with a synopsis and my.yahoo.com links to the full page of information.
    [Show full text]
  • RE-SONANCE---Inov-Em---Be-R-1-9
    GENERAL I ARTICLE Web Search Engines How to Get What You Want from the World Wide Web T B Rajashekar The world wide web is emerging as an all-in-one infonnation source. Tools for searching web-based information includes search engines, subject directories and meta search tools. We take a look at the key features of these tools and suggest-practical hints for effective web searching. Introduction T B Rajashekar is with the National Centre for We have seen phenomenal growth in the quantity of published Science Information information (books, journals, conferences, etc.) since the (NCSI), Indian Institute of Science, Bangalore. industrial revolution. Varieties of print and electronicindices His interests include text have been developed to retain control over this growth and to retrieval, networked facilitate quick identification of relevant information. These information systems and include library catalogues (manual and computerised), services, digital collection development abstracting and indexing journals, directories and bibliographic and managment and databases. Bibliographic databases emerged in the 1970's as an electronic publishing. outcome ofcomputerised production of abstracting and indexing journals like Chemical Abstracts, Biological Abstracts and Physics Abstracts. They grew very rapidly, both in numbers and coverage (e.g. news, commercial, business, financial and technology information) and have been available for online access from database host companies like Dialog and STN International, and also delivered on electronic media like tapes, diskettes and CD-ROMs. These databases range in volume from a few thousand to several million records. Generally, content of these databases is of high quality, authoritative and edited. By employing specially developed search and retrieval programs, these databases could be rapidly scanned and relevant records extracted.
    [Show full text]
  • Newsreaders.Com: Getting Started: Netiquette
    Usenet Netiquette NewsReaders.com: Getting Started: Netiquette I hope you will try Giganews or UsenetServer for your news subscription -- especially as a holiday present for your friends, family, or yourself! Getting Started (Netiquette) Consult various Netiquette links The Seven Don'ts of Usenet Google Groups Posting Style Guide Introduction to Usenet and Netiquette from NewsWatcher manual Quoting Style by Richard Kettlewell Other languages: o Voila! Francois. o Usenet News German o Netiquette auf Deutsch! Some notes from me: Read a newsgroup for a while before posting o By reading the group without posting (known as "lurking") for a while, or by reading the group archives on Google, you get a sense of the scope and tone of the group. Consult the FAQ before posting. o See Newsgroups page to find the FAQ for a particular group o FAQ stands for Frequently Asked Questions o a FAQ has not only the questions, but the answers! o Thus by reading the FAQ you don't post questions that have been asked (and answered) ad infinitum See the configuring software section regarding test posts (only post tests to groups with name ending in ".test") and not posting a duplicate post in the mistaken belief that your initial post did not go through. Don't do "drive through" posts. o Frequently someone who has an urgent question will post to a group he/she does not usually read, and then add something like "Please reply to me directly since I don't normally read this group." o Newsgroups are meant to be a shared experience.
    [Show full text]
  • Back to the Future: Tracing the Roots and Learning Affordances of Social Software
    1 Chapter 1 Back to the Future: Tracing the Roots and Learning Affordances of Social Software Nada Dabbagh George Mason University, USA Rick Reo George Mason University, USA ABSTRACT This chapter provides a developmental perspective on Web 2.0 and social software by tracing the historical, theoretical, and technological events of the last century that led to the emergence—or re- emergence, rather—of these powerful and transformative tools in a big way. The specific goals of the chapter are firstly, to describe the evolution ofsocial software and related pedagogical constructs from pre- and early Internet networked learning environments to current Web 2.0 applications, and secondly, to discuss the theoretical underpinnings of social learning environments and the pedagogical implica- tions and affordances of social software in e-learning contexts. The chapter ends with a social software use framework that can be used to facilitate the application of customized and personalized e-learning experiences in higher education. INTRODUCTION with their use to augment computational and com- munication capabilities and foster collaboration Is social software merely a continuation of a broad and social interaction, and progressing to their class of older computer-mediated communica- Web 2.0-enabled information aggregation capa- tion (CMC) and collaboration tools, or does it bilities. Throughout this chronological depiction, represent a significant transformation of social we emphasize the socio-pedagogic affordances interaction capabilities? In this chapter, we trace and implications of social software. We conclude the evolution of social software tools, beginning the chapter with a social software use continuum to guide the design of e-learning experiences in academic contexts.
    [Show full text]
  • A History of Social Media
    2 A HISTORY OF SOCIAL MEDIA 02_KOZINETS_3E_CH_02.indd 33 25/09/2019 4:32:13 PM CHAPTER OVERVIEW This chapter and the next will explore the history of social media and cultural approaches to its study. For our purposes, the history of social media can be split into three rough-hewn temporal divisions or ages: the Age of Electronic Communications (late 1960s to early 1990s), the Age of Virtual Community (early 1990s to early 2000s), and the Age of Social Media (early 2000s to date). This chapter examines the first two ages. Beginning with the Arpanet and the early years of (mostly corpo- rate) networked communications, our history takes us into the world of private American online services such as CompuServe, Prodigy, and GEnie that rose to prominence in the 1980s. We also explore the internationalization of online services that happened with the publicly- owned European organizations such as Minitel in France. From there, we can see how the growth and richness of participation on the Usenet system helped inspire the work of early ethnographers of the Internet. As well, the Bulletin Board System was another popular and similar form of connection that proved amenable to an ethnographic approach. The research intensified and developed through the 1990s as the next age, the Age of Virtual Community, advanced. Corporate and news sites like Amazon, Netflix, TripAdvisor, and Salon.com all became recogniz- able hosts for peer-to-peer contact and conversation. In business and academia, a growing emphasis on ‘community’ began to hold sway, with the conception of ‘virtual community’ crystallizing the tendency.
    [Show full text]
  • Blogging for Engines
    BLOGGING FOR ENGINES Blogs under the Influence of Software-Engine Relations Name: Anne Helmond Student number: 0449458 E-mail: [email protected] Blog: http://www.annehelmond.nl Date: January 28, 2008 Supervisor: Geert Lovink Secondary reader: Richard Rogers Institution: University of Amsterdam Department: Media Studies (New Media) Keywords Blog Software, Blog Engines, Blogosphere, Software Studies, WordPress Summary This thesis proposes to add the study of software-engine relations to the emerging field of software studies, which may open up a new avenue in the field by accounting for the increasing entanglement of the engines with software thus further shaping the field. The increasingly symbiotic relationship between the blogger, blog software and blog engines needs to be addressed in order to be able to describe a current account of blogging. The daily blogging routine shows that it is undesirable to exclude the engines from research on the phenomenon of blogs. The practice of blogging cannot be isolated from the software blogs are created with and the engines that index blogs and construct a searchable blogosphere. The software-engine relations should be studied together as they are co-constructed. In order to describe the software-engine relations the most prevailing standalone blog software, WordPress, has been used for a period of over seventeen months. By looking into the underlying standards and protocols of the canonical features of the blog it becomes clear how the blog software disperses and syndicates the blog and connects it to the engines. Blog standards have also enable the engines to construct a blogosphere in which the bloggers are subject to a software-engine regime.
    [Show full text]
  • Newznab Documentation Release 0.2.3-Dev
    Newznab Documentation Release 0.2.3-dev Newznab Jan 17, 2021 Contents 1 Overview 3 1.1 About...................................................3 1.2 How it Works...............................................3 1.3 Choosing Newsgroups..........................................3 1.4 Updating Index (populating binaries + parts)..............................4 1.5 Categorization..............................................4 1.6 Missing Parts...............................................4 1.7 Backfilling Groups............................................4 1.8 Regex Matching.............................................4 1.9 Regex Details...............................................5 1.10 Regex Updating.............................................5 1.11 NZB File Storage.............................................5 1.12 Spotnab..................................................5 1.13 SSL Usenet Connection.........................................5 1.14 Importing & Exporting NZBs......................................6 1.15 Google Ads/Analytics..........................................6 1.16 Admin..................................................6 1.17 TvRage/TVDB..............................................6 1.18 NFO...................................................6 1.19 Caching..................................................6 1.20 IMDb, TMDb and Rotten Tomatoes...................................7 1.21 3rd Party API Keys............................................7 1.22 Content/CMS...............................................7 1.23 Skinning
    [Show full text]
  • USENET Is a System of Special Interest Discussion Which Are Then
    DOCUMENT RESUME ED 385 294 IR 055 580 AUTHOR Anderson, Judith TITLE USENET Newsgroups. Consumer Guide. Number 12. INSTITUTION Office of Educational Research and Improvement (ED), Washington, DC. REPORT NO AR-95-7023 PUB DATE Jun 95 NOTE 4p. AVAILABLE FROMConsumer Guides, OERI, U.S. Department of Education, 555 New Jersey Avenue, N.W., Room 610, Washington, DC 20208; Internet at gopher.ed.gov. PUB TYPE Guides Non-Classroom Use (055) EDRS PRICE MFO1 /PCO1 Plus Postage. DESCRIPTORS Access to Information; Administrators; *Computer Mediated Communication; *Computer Networks; *Discussion Groups; Guidelines; *Information Seeking; Information Sources; Proofreading IDENTIFIERS Internet; *USENET ABSTRACT USENET is a system of special interest discussion groups called newsgroups, to which readers can send or post messages which are then distributed to other computers in the network. A common network is the Internet, but other networks may carry USENET as well. There are thousands of newsgroups, with a wide range of conversations on topics such as computer technology, gardening, rock and roll, and education. Each newsgroup is a discussion group focused around a specific topic. The network site administrator can provide site-specific information on accessing USENET. Most systems allow for searching the names of the newsgroups by keyword. Each newsgroup has its own set of rules. Some newsgroups are moderated; in others, there is no screening of messages. Some hints for joining a newsgroup include writing concisely, sticking to the topic, responding politely, providing references, avoiding humor or sarcasm when it could be misinterpreted, and proofreading. The following should be avoided: advertising, writing in all capitals, posting irrelevant material, and posting surveys.
    [Show full text]