Applications of Twitter Emotion Detection for Stock Market Prediction by Clare H

Applications of Twitter Emotion Detection for Stock Market Prediction by Clare H. Liu S.B., Massachusetts Institute of Technology (2016) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2017 ○c Massachusetts Institute of Technology 2017. All rights reserved. Author................................................................ Department of Electrical Engineering and Computer Science May 18, 2017 Certified by. Andrew W. Lo Charles E. and Susan T. Harris Professor Thesis Supervisor Accepted by........................................................... Christopher J. Terman Chairman, Masters of Engineering Thesis Committee 2 Applications of Twitter Emotion Detection for Stock Market Prediction by Clare H. Liu Submitted to the Department of Electrical Engineering and Computer Science on May 18, 2017, in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science and Engineering Abstract Currently, most applications of sentiment analysis focus on detecting sentiment polarity, which is whether a piece of text can be classified as positive or negative. However, it can sometimes be important to be able to distinguish between distinct emotions as opposed to just the polarity. In this thesis, we use a supervised learning approach to develop an emotion classifier for the six Ekman emotions: joy, fear, sadness, disgust, surprise, and anger. Then we apply our emotion classifier to tweets from the 2016 presidential election and financial tweets labeled with Twitter cashtags and evaluate the effectiveness of using finer-grained emotion categorization to predict future stock market performance. Thesis Supervisor: Andrew W. Lo Title: Charles E. and Susan T. Harris Professor 3 4 Acknowledgments First of all, I would like to express my gratitude to my thesis supervisor, Professor Andrew Lo, for giving me the opportunity to explore a new field, and for his insightful ideas and feedback. I would also like to thank Allie, Jayna, and Crystal for providing me with important resources and for their scheduling help. I especially want to thank Shomesh Chaudhuri for giving me crash courses on finance and providing invaluable suggestions and guidance over the past two years. Finally, I wish to thank my parents for their unconditional support and encour- agement. 5 6 Contents 1 Introduction 13 1.1 Thesis Organization . 14 2 Literature Review 17 2.1 Emotion Classification . 17 2.2 Relationship Between Twitter Sentiment and Stock Market Performance 19 2.3 Predicting Presidential Elections . 21 3 Creating an Emotion Classifier 23 3.1 Multiclass Classification Algorithms . 23 3.1.1 One-vs-rest . 24 3.1.2 One-vs-one . 24 3.1.3 Logistic Regression . 24 3.1.4 Random Forests . 25 3.2 Datasets . 26 3.3 Baselines . 26 3.4 Methodology . 27 3.4.1 Feature Selection . 27 3.4.2 Data Preparation . 28 3.4.3 Implementation Details . 28 3.5 Evaluation Metrics . 29 3.6 Results . 31 3.7 Discussion . 32 7 4 Emotion Analysis of Presidential Election Tweets 35 4.1 Datasets . 35 4.1.1 Data Preparation . 36 4.2 Emotion Distributions on Election Day . 36 4.2.1 Election Day Key Events . 37 4.2.2 Comparison with Polarity-Based Sentiment Analysis . 37 4.2.3 Using Volume to Identify Events . 40 4.3 Can Presidential Debates Predict Market Returns? . 42 4.3.1 Summary of Candidate Policies . 42 4.3.2 S&P 500 Returns after Election Day . 43 4.3.3 Who won the Presidential Debates? . 44 4.3.4 S&P 500 Reactions to Presidential Debates . 46 4.4 Discussion . 48 5 Emotion Analysis of Financial Tweets 51 5.1 Datasets . 51 5.2 Correlation Between Emotions and Stock Prices . 52 5.3 Using Volume to Identify Events . 55 5.4 Sentiment-Based Trading Strategy . 57 5.4.1 Preliminary Results . 58 5.4.2 Reevaluation of Emotion Classifier Performance . 61 5.5 Keyword-Based Trading Strategy . 65 5.5.1 Evaluation of Trading Strategy Performance . 66 5.6 Discussion . 68 6 Conclusions and Future Work 69 8 List of Figures 4-1 Average Sentiment during the 2016 Presidential Election . 37 4-2 2016 Election Day Emotion Distributions . 39 4-3 First Presidential Debate . 41 4-4 Emotion Distributions during the First Presidential Debate . 44 5-1 Twitter Volume Plots for Microsoft and Facebook . 57 5-2 Preliminary Trading Strategy Performance for Microsoft, Facebook, and Yahoo . 60 5-4 Microsoft sentiment using keywords during earnings announcement on April 21 . 65 5-5 Keyword-based Trading Strategy . 66 9 10 List of Tables 3.1 Examples of Labeled Tweets . 26 3.2 Tweet Processing Example . 28 3.3 Model Comparison . 31 3.4 Logistic Regression Accuracy Metrics . 32 3.5 Classification Examples . 33 3.6 Examples of Classification Errors . 34 4.1 S&P 500 Sectors before and after Election Day . 43 4.2 Clinton: Change in joy tweets before and after debates . 45 4.3 Trump: Change in joy tweets before and after debates . 45 4.4 Morning Consult Poll Results . 46 4.5 S&P 500 Industries Before and After First Presidential Debate . 46 4.6 S&P 500 Industries Before and After Second Presidential Debate . 47 4.7 S&P 500 Industries before and after Third Presidential Debate . 48 5.1 Correlation between average emotion percentages and next-day stock returns . 53 5.2 Correlation between average emotion percentages and same-day stock returns . 54 5.3 Noise in $AAPL Tweets . 55 5.4 Microsoft Earnings Announcement Classification Examples . 62 5.5 Yahoo Earnings Announcement Classification Errors . 63 5.6 Trading Strategy Comparison . 67 5.7 Trading Strategy Statistics . 67 11 12 Chapter 1 Introduction Over the past decade, the rise of social media has enabled millions of people to share their opinions and react to current events in real time. As of June 2016, Twitter has over 300 monthly active users and over 500 million tweets are posted per day [53]. Ever since the official Twitter API was introduced in 2006, users and researchers have been applying sentiment analysis algorithms on this massive data source to gauge public opinion towards emerging events. Automatic sentiment analysis algorithms have been used in a variety of applications, including evaluating customer satisfaction, fraud detection, and predicting future events, such as the results of a presidential election. Currently, most publicly available sentiment analysis libraries focus on detecting sentiment polarity, which is whether a piece of text expresses a positive, negative, or neutral sentiment. However, due to the wide range of possible human emotions, there are some limitations to using this coarse-grained approach for some applications. For instance, the producers of a horror movie may wish to use sentiment analysis to summarize understand their audience’s opinion of the movie. Boredom and fear could both be classified as negative emotions, but the producers would be happy iftheir viewers expressed fear, while they would probably modify their approach for future movies if the viewers were bored. In this thesis, we will evaluate the merits and limitations of using a finer-grained emotion classification scheme compared to the more common sentiment polarity approach. We will also evaluate the possibility of predicting future stock returns based 13 on emotion distributions of tweets from two contrasting domains: presidential elections and financial tweets mentioning NASDAQ-100 companies. The election ofa new president has wide implications on the future of United States and international economies, which usually results in stock market volatility. Company stock prices have also been shown to be affected by market sentiment, especially following important events such as earnings announcements and acquisitions. Since presidential elections and volatility in the stock market often evoke strong emotions in people, using a finer-grained emotion analysis approach could reveal more interesting insights about the public’s perception of candidates and publicly traded companies, potentially leading to more accurate and profitable stock market predictions. 1.1 Thesis Organization The remainder of this thesis is organized as follows: ∙ Chapter 2 contains a literature review of past work in automatic emotion detection and in using Twitter to predict future stock market performance and the results of presidential elections. ∙ Chapter 3 then details the construction of and evaluates the performance of an emotion classifier for the six basic Ekman emotions. ∙ In chapter 4, we analyze tweets from the 2016 presidential election to determine whether emotion classification can be used to identify differences in public opinion towards the two presidential candidates. Then we investigate the correlation between the policies of presidential debate winners and the market performance of related industries on the following day. ∙ In chapter 5, we will evaluate the correlation between emotion distributions of tweets tagged with cashtags of the NASDAQ-100 companies and future stock returns for these companies. We then look at trends in Twitter volume and 14 sentiment for different tickers to identify significant events and predict outcomes on future returns. Finally, we will propose a simple trading strategy based on sentiment expressed in earnings announcement tweets. ∙ Finally, chapter 6 will summarize our major findings and suggest possible av- enues for future research. 15 16 Chapter 2 Literature Review This chapter discusses approaches to automatic emotion classification and related work in using social media for stock market prediction. 2.1 Emotion Classification In 1992, psychologist Paul Ekman argued that there are six basic emotions: anger, fear, sadness, joy, disgust, and surprise. These emotions share nine characteristics with a biological basis, including distinctive universal signals, presence in other pri- mates, and quick onset. He also argued that all other emotional states can be grouped into one of these basic emotions or classified as moods, emotional traits, or emotional attitudes instead [11].

Load more