Multi-factor for Gauging Investors Fear

by Ichihan Tai

B.S. in Engineering in Computer Science, May 2012, University of Michigan M.S. in Engineering in Industrial and Operations Engineering, December 2012, University of Michigan

A Praxis submitted to

The Faculty of The School of Engineering and Applied Science of The George Washington University in partial fulfillment of the requirements of the degree of Doctor of Engineering

August 31, 2018

Praxis directed by

Amir H. Etemadi Assistant Professor of Engineering and Applied Science

Ebrahim Malalla Visiting Associate Professor of Engineering and Applied Science

The School of Engineering and Applied Science of The George Washington University certifies that Ichihan Tai has passed the Final Examination for the degree of Doctor of

Engineering as of the date of praxis defense, July 19, 2018. This is the final and approved form of the praxis.

Multi-factor Sentiment Analysis for Gauging Investors Fear

Ichihan Tai

Praxis Research Committee:

Amir H. Etemadi, Assistant Professor of Engineering and Applied Science, Praxis Co-Director

Ebrahim Malalla, Visiting Associate Professor of Engineering and Applied Science, Praxis Co-Director

Takeo Mitsuhashi, Chief Executive Officer & Chief Investment Officer, Committee Member

ii

© Copyright 2018 by Ichihan Tai All rights reserved

iii

Acknowledgements

I am grateful to my academic advisors Dr. Amir H. Etemadi and Dr. Ebrahim

Malalla, and previous academic advisors Dr. Bill Olson and Dr. Paul Blessner for all the support and guidance throughout my doctorial study. I appreciate my employer Takeo

Mitsuhashi and Satoshi Tazaki for financial supports and accommodations to my doctorial study. I am thankful to my previous superiors Sunny Park, Andrew Ray, and

David Carlebach for all the encouragements and supports to get me started in this doctorial journey. I am appreciative to my mentor Dr. Peng Zang for the guidance and the inspirations.

Finally, I thank my family members for continuous support throughout the program.

iv

Abstract

Multi-factor Sentiment Analysis for Gauging Investors Fear

The Chicago Board Options Exchange Volatility Index (VIX) is widely used to gauge investor fear and measures the “insurance premiums” of the stock market.

Traditionally, VIX prediction was done using time-series analysis models (e.g.,

GARCH), and some attempt was made to predict it using sentiment analysis approach.

However, traditional sentiment analysis is focused on authors’ sentiment (sentiment expressed in news authors’ word choices), and this Praxis will demonstrate that adding other factors (e.g., similarity, readability) into the traditional authors’ sentiment model will improve VIX prediction results. This Praxis research leverages natural language processing (NLP) and machine learning (ML) techniques to build a VIX prediction model.

v

Table of Contents

Acknowledgements ...... iv

Abstract ...... v

List of Figures ...... ix

List of Tables ...... x

List of Acronyms ...... xi

Chapter 1-Introduction ...... 1

1.1 Background ...... 1

1.2 Problem statement ...... 5

1.3 Thesis statement ...... 5

1.4 Research objectives ...... 5

1.5 Research questions ...... 8

1.6 Research hypotheses ...... 9

1.7 Research limitations ...... 9

1.8 Organization of praxis ...... 10

Chapter 2-Literature Review ...... 11

2.1 Problems ...... 11

2.1.1 VIX Index Prediction...... 11

2.1.2 Modeling Financial Markets ...... 12

vi

2.1.3 Managing crises ...... 16

2.1.4 Engineering Management ...... 19

2.2 Data ...... 20

2.2.1 Data Hierarchy ...... 20

2.2.2 Textual Data: Metadata ...... 21

2.2.3 Textual Data: Contents ...... 22

2.3 Methodologies ...... 23

2.3.1 Textual Analysis ...... 23

2.3.2 Textual Data Preprocessing ...... 25

2.3.3 Features Engineering ...... 26

2.3.4 Machine Learning ...... 28

2.3.5 Evaluation ...... 30

2.4 Gaps in the existing literatures ...... 31

Chapter 3-Methods...... 33

3.1 Data ...... 37

3.2 Pre-processing ...... 43

3.3 Features Engineering ...... 48

3.3.1 Sentiment ...... 50

3.3.2 Similarity ...... 51

3.3.3 Readability ...... 54

vii

3.3.4 Topic Frequency ...... 54

3.4 Machine Learning ...... 55

3.5 Evaluation ...... 58

3.6 Benchmark ...... 59

Chapter 4-Results ...... 61

4.1 Pre-processing ...... 61

4.2 Machine learning ...... 64

Chapter 5-Discussion and Conclusions ...... 66

5.1 Discussion (interpretation of results) ...... 66

5.2 Conclusions ...... 68

5.3 Contributions ...... 70

5.4 Future research directions ...... 70

Appendices ...... 85

Appendix A: Factor Library ...... 85

Appendix B: Optimum number of topics for Latent Dirichlet Allocation ...... 94

Appendix C: Main Multi-Factor Sentiment Analysis Code ...... 95

Appendix D: News Data (Sample) ...... 98

Appendix E: VIX Index Data (Sample) ...... 99

viii

List of Figures

Figure 1: Illustration of human logic ...... 35

Figure 2: Illustration of MSA logic ...... 36

Figure 3: High-level construction of MSA ...... 37

Figure 4: Determination of economic uncertainty ...... 42

Figure 5: Cumulative distribution of VIX change ...... 42

Figure 6: Sentiment example ...... 51

Figure 7: High similarity example ...... 52

Figure 8: Low similarity example ...... 53

Figure 9: Readability example ...... 54

Figure 10: Results from maximization algorithms for finding natural number of topics ..61

Figure 11: Results from minimization algorithms for finding natural number of topics...62

ix

List of Tables

Table 1: Factors generated in features engineering ...... 49

Table 2: Topics and associated keywords ...... 63

Table 3: Performance of MSA and benchmarks ...... 64

Table 4: Performance of MSA and benchmarks ...... 65

Table 5: Performance of MSA framework using other machine learning algorithms ...... 66

Table 6: Best performing standalone factor ...... 68

x

List of Acronyms

MSA Multi-factor Sentiment Analysis

VIX Chicago Board Options Exchange Volatility Index

CAPM Capital Asset Pricing Model

GARCH Generalized Autoregressive Conditional Heteroskedasticity

LM Loughran and McDonald Sentiment Dictionary

GI Harvard General Inquirer Sentiment Dictionary

LDA Latent Dirichlet Allocation

NLP Natural Language Processing

xi

Chapter 1-Introduction

1.1 Background

Volatility, as a key factor in equity and option markets, has always been a key focus of academic research and investment practice. Since its launch in 1986 The Chicago

Board Options Exchange Volatility Index (VIX) is widely used to gauge investor fear and measures the expected volatility of stock market (Brenner, 1989).

“Things happen” is a centuries-old problem that perhaps all investment management teams face at some point. The volatility of financial markets has made it inevitable that a mature investment management team needs to constantly model and predict detrimental systematic market events and develop timely hedging plan. As a result, the ability to predict market volatility, i.e. VIX, has propounding impacts for the long-term success of investment management process.

The traditional INCOSE risk management systems have formally defined risk management and more specifically crisis detection as part of the risk and opportunity management framework (International Council on Systems Engineering, 2011). Such risk management systems specifically call for attention to “ambient risk,” which is risk caused by and created by the surrounding environment (ambience) of the portfolio. This includes risks created by external environments that are often beyond the control of the portfolio management team. The INCOSE framework seeks to convert risks to desirable opportunities or to eliminate undesirable problems. From an empirical point of view, portfolio managers, who have access to machine learning tools from knowledge discovery and decision support literatures (Coussement & Van den Poel, 2008; Lavrenko

1

et al., 2000; Schumaker & Chen, 2009) are well equipped to capture and predict risk events and take proactive action to minimize their impact on the investment portfolio.

There are two types of problems in engineering management: anticipated problems and unplanned events. In these two categories, continuously monitoring and catching unplanned events in a timely manner is the main challenge and purpose of any problem management system. From an investment management perspective, any unplanned events from a traditional system engineering perspective, are both dangerous and extremely valuable. Such unplanned events can generally be labelled as market surprises, are important sources of extraordinary returns for an investment portfolio. Market surprises, by definition, are occasions where the majority of market participants fail to take the right direction in their respective portfolio, which create exceptional return opportunities if an investment manager take on the opposite position against the majority of the market.

However, as one can imagine, this also creates risks of tremendous losses. As a result, to be able to forecast unplanned events, is critical to the success of portfolio management, even more than in the traditional system engineering framework. On the other hand, the difficulty of this exception-catching process means that the process is resource-intensive by nature (B. A. Olson, Mazzuchi, Sarkani, & Forsberg, 2012).

How should a problem management system be structured so that it can promptly identify unplanned events and notify the portfolio management team, and at the same time achieve these goals in a cost-efficient manner? What data should be fed into a problem management system and how should they be processed? What type of signals should the problem management system be designed to detect? These are all critical

2

questions that need to be answered when developing an effective problem management system.

A large portion of exploitable information is now embedded within texts rather than quantitative data. With the development of the internet, the amount of textual data that is readily available to the public has also been increasing over the past decade (Chan &

Franklin, 2011). To identify or even predict unplanned events, textual data sources, such as internal emails, external news and regulatory filings, can provide an informational edge over quantitative data, as textual data may contain forward-looking information that resists quantification.

At a microeconomics level, a portfolio management team may need to monitor any unplanned events related to credit risks. If an investment portfolio has large holding of corporate bonds, then assessing the credit rating of the counterparty is critical in ensuring any unplanned events are captured instantly. For example, if the company that issued the corporate bonds has ongoing lawsuits or regulatory investigations that may require cash settlements, its ability to make future payments or deliverables may be significantly affected. To capture such information, news data has advantages over other data sources.

Although detailed information about a company’s operational and financial results and associated risks can be found in regulatory filings for publicly traded companies, such data is generally published relatively rarely (e.g. quarterly or annually), and therefore cannot satisfy the requirements of a problem management system.

At a macroeconomics level, information on geo-political risks and general economic risks should also be captured by a problem management system in a timely fashion. Such risks, also be referred to as systematic risks, despite their low probability of occurrence,

3

can have destructive impact on any investment portfolio. For example, during the 2008 financial crisis, although the crisis started within the US financial sector, it soon spread across various industries and evolved into a worldwide economic crisis. As most companies rely on the banking industry to finance their day-to-day operations, when consecutive market downturns lead to a liquidity drain on the major banks, companies in other industries will likely run into the same issue (Calomiris & Nissim, 2014). As a result, a problem management system that captures overall market performance and risks can serve as an early warning system for investment management teams.

From an event management perspective, quantitative data lacks the timeliness required by the continuous risk management goals of portfolio management teams. It also lacks the information richness found in textual data, especially when analyzing macroeconomic trends. While quantitative market data reflect the public’s realized sentiment toward overall market performance, they cannot be relied upon as the only source to predict drastic market movements in the near future, such as a financial crisis.

Researchers have shown that textual data contain more relevant information in this domain (Chan & Franklin, 2011).

This praxis applies an engineering technique, namely , to monitor and manage events that are relevant to portfolio management teams. The formal adoption of the text mining framework as part of the portfolio risk and opportunity management process can help investment management teams detect problems and uncertainties quickly and take proactive actions.

4

1.2 Problem statement

Traditional sentiment analysis for VIX prediction only leverages sentiment expressed in news authors’ word choices, resulting in lower f1 score (less accurate prediction).

1.3 Thesis statement

By combining similarity, readability, topic, and sentiment methods, we can increase f1 score of VIX prediction over traditional sentiment only approach.

1.4 Research objectives

In seeking information specifically about the state of the economy and the society, people traditionally rely on printed news, but this requires significant time and effort

(Mullainathan & Shleifer, 2005). In the finance industry, numerous equity research analysts are hired by large financial institutions to manually read and analyze corporate filings and news related to a particular industry or company. Researchers found that white-collar workers spend 30 to 40 percent of their worktime navigating through documents, and the amount of data that is stored in an unstructured format accounts for more than 85 percent of all available data (Blumberg & Atre, 2003). Managers often say that they do not have the right toolset to analyze large volumes of text data (Blumberg &

Atre, 2003).

In addition to the fact that having human readers filtering through textual data may not be cost efficient, it is also difficult to standardize how human readers filter textual data and what information they derive from them. Therefore, an engineering solution is

5

required so that portfolio management teams can monitor economic uncertainties in a uniform and cost-efficient manner.

In the academic literature, there are many studies that portfolio management teams can use to help them manage internal uncertainties, such as an automated system developed by Cheung and Wang (Cheung, Lee, Wang, Wang, & Yeung, 2011; Wang,

Cheung, Lee, & Kwok, 2008) for reading emails and extracting knowledge from them.

However, there is no engineering management solution available to monitor external uncertainties, especially those related to economic uncertainty events.

This praxis uses the market volatility to define economic crises, also widely used by previous researchers (Ahn, Oh, Kim, & Kim, 2011; Kim, Oh, Sohn, & Hwang, 2004).

Market volatility can be observed in the financial market in various forms, and perhaps one of the most popular measures used by the practitioners is the Chicago Board Options

Exchange Volatility Index (VIX Index). When market participants’ level of fear increases, they purchase more options as insurance to protect the buyer of the option from market crashes in exchange for a premium. The VIX index is derived from the option prices of the S&P 500 Index, which is a popular index for the broader US stock market.

Therefore, use of the VIX index can capture the general market-wide sentiment, measured subjectively, in a timely manner.

Successful management of economic uncertainty events has been carried out using numerical data, and researchers have applied machine learning techniques to economic and pricing data to develop early warning systems (Ahn et al., 2011; Nassirtoussi,

Aghabozorgi, Wah, & Ngo, 2014; Oh, Kim, & Kim, 2006). On the other hand, financial

6

market predictions have also been successfully conducted with unstructured textual data using text mining techniques (Nassirtoussi, Aghabozorgi, Wah, & Ngo, 2015).

Since there are far more data stored in unstructured formats such as text than are stored structurally in a numerical format (Exchange, 2009), the ability to leverage unstructured data is highly desirable. In the recent history of quantitative research on modeling market volatilities, there has been an increased focus on leveraging textual data.

To understand the risk and return of the stock market, researchers first leveraged statistical techniques on the most accessible and typical data about the stocks, the pricing data, leading to the famous Capital Asset Pricing Model (Sharpe, 1964). A few decades later, researchers started to use reference data, fundamental accounting data, to model the stock and return that could not be modelled by the pricing data alone. This resulted in the famous Fama-French three-factor model (Fama & French, 1993) that has been widely used in the finance industry. Additionally, researchers have employed more reference data to develop a more sophisticated model that includes additional variables, such as in the Fama-French five-factor model (Fama & French, 2015).

Over the past decade, researchers have started to pay greater attention to the unstructured textual data and have successfully shown that additional risk and return of the stock market can be modelled using news data that could not be modelled using traditional data (i.e., structured pricing data and even accounting data) (Calomiris &

Mamaysky, 2017).

This trend of incorporating textual data for quantitative analysis not only exists in quantitative finance literature, but also in economics literature. Techniques used to measure the economic conditions were shifted from applying traditional statistical

7

approaches on structured numerical data (i.e., interest rates and asset prices) (Hatzius,

Hooper, Mishkin, Schoenholtz, & Watson, 2010) to applying textual mining techniques to monitor what is in unstructured textual data (Baker, Bloom, & Davis, 2016; Bholat,

Hansen, Santos, & Schonhardt-Bailey, 2015; Calomiris & Mamaysky, 2017) . Therefore, it seems like a natural progression for traditional problem management framework in engineering management to evolve into a form that embraces the modern textual mining techniques and to leverage unstructured textual data.

This praxis will demonstrate that engineering management as a field is ready for text mining techniques. A problem management system focused on managing an economic uncertainty event is presented with text mining as a core capability.

The contribution of this praxis is a robust text mining system that combines knowledge from both finance literatures and AI literature to automatically read news and predict spikes in VIX Index. The system is tailored to the needs of the portfolio management teams who need to detect the market-wide macroeconomic uncertainties that impact all companies. The system developed in this praxis allow users to understand the dynamic drivers behind recent spikes in VIX Index as it makes predictions.

1.5 Research questions

 Does MSA provide higher VIX prediction power over traditional sentiment analysis?

 Is random forest the best performing machine learning algorithm under MSA

framework?

8

 Do change-in-sentiment, similarity, readability, change-in-readability, topic-

frequency, and change-in-topic-frequency based predictors provide results more

accurate than sentiment based predictors?

1.6 Research hypotheses

H1: In predicting VIX, MSA provides higher F1 score than traditional sentiment analysis

H2: In predicting VIX, random forest provide higher F1 score than Decision Tree,

Naïve Bayes, SVM, Neural Network, Logistic Regression, Nearest Neighbors, and

Voting in MSA framework

H3: In predicting VIX, change-in-sentiment, similarity, readability, change-in- readability, topic-frequency, and change-in-topic-frequency based predictors provide higher F1 score than sentiment based predictors

1.7 Research limitations

This praxis does not aim to find factors that always work overtime. That is usually the aim of financial literatures using rigorous linear regression method. Instead, this praxis aims to gain domain insights from financial literature to derive what factors can potentially work, and feed into prediction methods from AI literature (i.e., machine learning) as factors. This praxis has the fundamental belief that some factor will work at some time but not at other time; therefore, machine learning should be utilized to look at the data to determine the factors that work at a given point in time. It is supposed to be dynamic and ever evolving process.

9

1.8 Organization of praxis

The rest of this praxis is organized as follows.

Chapter 2 reviews the relevant previous studies. The chapter can be further divided into three sections that discuss previous literature on similar research problems, news data analysis and the related text mining methodologies.

The first section is focused on problems solved by previous researchers. This section includes related literature from the engineering management fields, as well as finance and economics. The second section is focused on textual data. This section focuses on what types of data are commonly researched and the important properties inherent to the data.

The third section is focused on methodologies. This praxis surveys related literature and studies what methodologies and models other researchers have used to solve similar problems.

Chapter 3 focuses on the actual construction of the system. This chapter is focused on the methodologies used. The methodologies are divided into data, pre-processing, machine learning, and evaluation methodologies.

Chapter 4 discusses the output from MSA in detail. First, the output from the topic model is interpreted qualitatively using text mining techniques and the effectiveness of filtering through topics. Second, the effectiveness of sentiment analysis is discussed.

Finally, the result from the machine learning step is discussed with the overall effectiveness of MSA.

Chapter 5 concludes the study and discusses potential future research directions.

10

Chapter 2-Literature Review

The literature review section is divided into three sections that discuss previous literature on similar research problems, news data analysis and the related text mining methodologies.

2.1 Problems

This section surveys existing research related to VIX prediction and text mining systems, including systems that model the financial market, systems that warn against crises, and systems that are applied to broader fields of engineering management.

2.1.1 VIX Index Prediction

In prediction of VIX Index, traditionally, researchers have used time series models in the ARCH (AutoRegressive Conditional Heteroskedasticity) family, such as Generalized-

ARCH (GARCH) using historical numerical data (Ahoniemi, 2008; Liu, Guo, & Qiao,

2015; Majmudar & Banerjee, 2004). This econometrics model family attempts to capture the irregularity of error variation (i.e. heteroskedasticity), which is the characteristic of stock market volatility. However, by definition, the ARCH family model build prediction with past data. The use of historical data only for future prediction makes strong assumptions about autocorrection and get criticized as such approach is “driving with the rearview mirror”.

In recent years, a new concept called quantamental emerged (Gray, ). Traditionally, financial predictions were divided into two styles: quantitative and fundamental.

11

Quantamental approach aims to combine quantitative approach with fundamental approach. The GARCH and related approaches are classic examples of quantitative approach. However, in recent years, more quantamental approaches were taken in VIX prediction literature as well.

Researchers have used sentiment analysis on news to predict spikes in VIX Index

(Smales, 2014) and found that negative sentiment is a good predictor of spikes in VIX index. However, this research is focused on use of traditional sentiment analysis only and use linear regression as a prediction model. More sophisticated sentiment analysis and prediction models can be used to improve prediction results.

The “original” VIX Index is derived from implied volatility in exchange traded option prices. Researchers have developed alternative VIX index called NVIX Index to derive investor fears from news data instead of from options pricing (Manela & Moreira, 2017).

Although this research is not about predicting VIX index itself, it shows more interests in incorporating unstructured data with VIX index in recent academic literatures.

2.1.2 Modeling Financial Markets

Monitoring and capturing economic uncertainties is critical for finance and investment professionals. In the finance and economics literature, researchers have tried many approaches to model the risk and return of stock markets. Sharpe’s Capital Asset

Pricing Model reflected the stock market as a whole and explained individual stock returns as having a certain degree of exposure to the return of the entire stock market and residuals (Sharpe, 1964). Sharpe introduced the parameter β to represent the sensitivity of a stock to the market return and ε to represent the idiosyncratic movement of price.

12

In 1993 Fama and French went further to include in the market return model the book value relative to price and the market capitalization of a company. They later added the aggressiveness of investment and operating profitability in 2015 to explain the unexplained risk and return beyond CAPM (Fama & French, 1993; Fama & French,

2015).

Fama and French derived these features from accounting data and incorporated this into pricing data, whereas CAPM involves pricing data only. Their findings succeeded in explaining more risk and return of stocks compared than the single-factor CAPM model with the initial introduction of the three-factor model in 1993, and still more with the introduction of the five-factor model in 2015. In other words, the original ε shrunk further and further as different types of dataset were introduced into the process. This is a classic example of using structured reference data to enhance original structured data.

However, a drawback of this approach is the difficulty of interpreting the meaning of data by non-investment professionals (e.g. portfolio managers outside of the finance industry). Without detailed knowledge of finance and accounting, it is generally difficult to navigate through the massive amount of accounting data available and to use the specific variables that Fama and French used.

As a recent development on the Fama-French model, Calomiris and Mamaysky have used news data and its content to explain the risk and returns of the financial markets that cannot be explained by traditional variables (Calomiris & Mamaysky, 2017). They emphasized their use of a theory-neutral approach, whereas the target phrase approach used by prior researchers such as Baker relied on prior domain knowledge (Baker et al.,

2016; Calomiris & Mamaysky, 2017). They constructed this theory-neutral approach by

13

applying topic modelling to news data. In this approach news articles are broken down into different topic categories. Each topic is considered to have a certain level of impact on the ultimate market reaction to the news information. The model and results from such an approach, unlike the traditional Fama-French model, do not require any domain- specific knowledge to interpret.

Throughout these finance literature, modeling approaches, regardless of data type, can be separated to two categories: statistical and ruled-based. Statistical models include the CAPM and the topic model used by Calomiris, which utilized unstructured news data.

On the other hand, ruled-based models include Fama-French’s extended model using structured accounting data and the target phrase model developed by Baker using unstructured news data.

As rule-based models require significantly more specialized knowledge on defining features and rules than statistical approaches do, it makes more sense to implement a statistical approach for MSA. As the financial market is highly dynamic and evolves over time, it is virtually impossible for the intended users of MSA, portfolio managers, to have enough domain knowledge to know when there is an underlying change to the financial market or accounting rules so that an upgrade in the information filtering mechanism is needed. Therefore, the approach of MSA needs to be not only statistical, but also highly adaptive. This requirement naturally results in the architectural choice of incorporating machine learning to construct MSA.

Text mining can be considered one type of machine learning applied to a specific type of data, i.e. text. Researchers have developed systems for predicting financial markets.

Text mining techniques have been used by various researchers to analyze news data and

14

develop prediction systems for many different types of financial instruments, including equity stock prices (Hollum, Mosch, & Szlavik, 2013), exchange rates (Jin et al., 2013), and commodity (e.g. gold) price volatilities (Onsumran, Thammaboosadee, & Kiattisin,

2015). Similar to all other textual data, researchers using news data face a major challenge to resolve the high dimensionality problem. The approaches used by previous researchers generally fall into two categories: semantic analysis and sentiment analysis.

Semantic analysis focuses on the meanings of textual data. To model and extract semantics from a news article, a common technique is to use topic models. A typical topic model classifies a group of text documents (also known as a “corpus”) into a given number of topics by assuming a certain statistical distribution of the corpus over these topics. To model and predict financial markets, researchers have leveraged topic models to break down or classify a given news article into various topics and assess how individual topics relate to the subsequent market reaction after the news release (Hollum et al., 2013; Izumi, Goto, & Matsui, 2010; Izumi, Goto, & Matsui, 2011; Mahajan, Dey,

& Haque, 2008; Nguyen & Shirai, 2015).

Sentiment analysis is similar to semantic analysis in the sense that the techniques are used to represent a textual document with only a small number of variables. However, instead of capturing the meanings of a textual document, sentiment analysis methods attempt to capture its tone. The underlying idea is simple. Given a piece of news that is expected to impact the financial markets, if it conveys a positive sentiment the market prices tend to rise, and vice versa.

Unlike topic models used for semantic analysis, which assumes a statistical distribution of words in a given textual dataset, common methods used for sentiment

15

analysis generally focus on the tone indication of specific words or word lists (also commonly referred as dictionaries). A sentiment word list or dictionary categorizes words into different sub-lists based on their sentiment indication (positive, negative, uncertain, etc.). The most simple version of sentiment analysis based on a word list approach is to count the appearance of positive and negatives words (Loughran & McDonald, 2011).

The more advanced models under both sentiment analysis and semantic analysis are reviewed and compared in more detail in the second half of the literature review section.

2.1.3 Managing crises

MSA is intended to monitor and capture spikes in VIX index which are economic uncertainty events to serve as a risk management tool. As a result, the focus of MSA is not to predict stock prices but rather to monitor stock market volatilities.

Previous studies have commonly used market volatility as a proxy for market panic or highly pessimistic investor sentiments. This is what the Chicago Board Options Exchange

(CBOE) attempted to measure when it developed the Volatility Index (VIX) (Exchange,

2009). Throughout the literature review process of this study, no existing text mining systems have been found that make binary predictions related to the VIX.

In addition, prediction methods for economic uncertainties (market volatilities) are significantly different from those used for market prices. First, there are major differences between the probability distribution of market prices and that of market volatilities. When trying to predict market prices, systems are generally built to predict the up-or-down directions of the market, and each direction has almost equal probability. However, in an

16

uncertainty (market volatility) prediction system, users expect the systems to predict only rare market events, which by definition have low probability.

Second, the consequence severities of wrong recommendations differ between market price prediction systems and market volatility prediction systems. In the former scenario, a prediction system indicating the market price will increase when it actually decreases is just harmful as the system indicating the market price will decrease when it actually increases. However, in the latter scenario an uncertainty event monitoring system can do more damage by failing to capture a rare but highly impactful event than by generating false alarms. This is similar to the functionality of a fire alarm. It is more harmful if a fire alarm is not triggered when there is a fire than when it gives a false alarm.

Many previous studies have attempted to develop market volatility prediction systems

(also referred to as early warning systems) by applying machine learning techniques to economic and price data (Ahn et al., 2011; Kim et al., 2004; Oh et al., 2006), and used text mining techniques to predict the financial markets with unstructured text data

(Nassirtoussi et al., 2014). However, no prediction system of the broader equity market leveraging unstructured data has yet been developed. As a result, MSA will be an important complement to the numerical data-based systems.

The difference between a warning system and a market prediction system is how often the system is expected to provide actionable insights for users. A warning system is focused on predicting events with high impact but low occurrence (e.g. the bankruptcy of a company), while a prediction system deals with more frequent events with less impact during each occurrence, such as the price of a single stock going up or down.

17

In this praxis, the definition of the success of a warning system must be extended beyond prediction accuracy. To illustrate this point, an earthquake warning system will have extremely high accuracy if it keeps predicting “no earthquake,” but such metrics do not serve the purpose of an earthquake warning system. Instead, an earthquake detection system should be evaluated based on how well it predicted the actual occurrence of the rare earthquakes that did occur.

To evaluate a prediction system, the most important question to ask is:

1. How many correct predictions did the system make?

In comparison, to evaluate a warning system, all the following questions must be considered.

1. How many times did the system give warnings?

2. How many of the actual occurrences of the rare events did the system capture?

3. How far in advance did the system give warnings for the rare events?

This praxis is focused on building a warning system based on unstructured textual data.

In the financial and economics literature, the volatility index published by CBOE has been a standard for measuring market fear and market stress. Followed by an initiative by the Bank of England to encourage the use of text mining in 2015 (Bholat et al., 2015), researchers have attempted to develop an alternative measure of market stress using textual data (Baker et al., 2016). Although many of the underlying techniques will be shared between previous economics studies and the research conducted in this praxis, existing research focuses on measuring the economic uncertainties, whereas the research

18

conducted in this praxis focuses on constructing a system to warn against economic uncertainties.

2.1.4 Engineering Management

MSA is built to help portfolio management teams measure and manage uncertainties in the overall economy using a text mining approach. In previous studies, text mining techniques have been applied in the domain of crisis management.

Notable research focuses in this field include the detection of natural disasters by analyzing unusual topics in real time Twitter postings (Chae et al., 2014; Paul & Dredze,

2012; Sakaki, Okazaki, & Matsuo, 2010) and the detection of malware using API call data (Sundarkumar, Ravi, Nwogu, & Govindaraju, 2015). Researchers have also attempted to predict student grades or class fail rates by mining online class discussion forums (Ming & Ming, 2012).

In addition to applications in crisis management, textual analysis has also been widely applied in the field of engineering management. Examples include: automated knowledge acquisition for lessons-learned systems (Strait, Haynes, & Foltz, 2000), product design aid using social media textual data (Fan & Gordon, 2014), software lifecycle management (Hindle, Ernst, Godfrey, & Mylopoulos, 2011), helping requirements engineers to read policy documents (Massey, Eisenstein, Anton, & Swire, 2013), and security risk assessment for software product evaluation (R. Das, Sarkani, & Mazzuchi,

2012).

19

2.2 Data

2.2.1 Data Hierarchy

The hierarchy chart below summarizes the features of textual data.

 Data o Structured data o Unstructured data . Picture . Text  Metadata o Source . Who . Credibility o Time . When . Novelty o Location o Accessibility o Attention  Content o Sentiment o Semantic . Entity  Relevance  Attribute  Relationship . Event  Relevance  Impact

At the highest level, data can be divided into structured and unstructured data (Katal,

Wazid, & Goudar, 2013). One major form of structured data is quantitative data. Most existing scientific studies and widely used scientific methodologies are built around the analysis of structured data, and the field of engineering management is no exception.

The key objectives of this praxis are to expand the use of unstructured data in the field of engineering management and to build an engineering system to advance the risk management capability available to portfolio managers.

20

Among unstructured data, there are mainly two categories: pictures and text (Govers

& Go, 2004). Picture data include static images and videos, which are pictures shown in rapid succession. Textual data include news data, regulatory filing data, internet postings

(e.g. Twitter, blogs etc.), and emails.

2.2.2 Textual Data: Metadata

All textual data have two main properties: metadata and contents (Mihaila, Raschid,

& Vidal, 2000). Metadata is also known as “the data of data,” and constitute a group of attributes that describes the features of data, while “content” is the textual data themselves.

Metadata have five important constituents: source, time, location, accessibility, and attention.

 Source concerns where the textual data come from and how credible the origin of

the information is.

 Time relates to when the textual data is published and how new the information

is, as some news is an original article that is published for the first time, while some

news simply report the same content at a later time.

 Location is about where the source of the textual data is located, which is an

important part of Twitter data. As previously discussed, there are many crisis

detection systems based on Twitter’s reliable location data in addition to the contents

of the Tweets. Studies have even found Twitter’s location data to be more informative

than Twitter’s content data due to the fact that Twitter data is noisy and difficult to

clean up.

21

 Availability is to whom the information is available. Not all textual data are

available to the public, e.g. emails.

 Attention is a subjective measure of the impacts of the textual data. Some typical

measures of textual data attention include the number of views for a certain piece of

news data and the number of times a news article is retweeted on Twitter.

With regard to information sources, in the financial context common information sources include widespread news media, the management of companies who release earnings and associated company strategy information, as well as required regulatory filings such as quarterly and annual filings for publicly traded companies in the

U.S(Kearney & Liu, 2014). In addition, equity research analysts at investment banks also publish reports on major public companies, which are a major source for textual data

(Kearney & Liu, 2014).

The credibility of news data can be broken down into four levels: news from established media, pre-news which is the raw source that reporters conduct research on, rumors that are published on less credible sites, and social media where anyone can broadcast any information (Mitra & Mitra, 2010). Intuitively, the last two categories are the least credible (Mitra & Mitra, 2010).

2.2.3 Textual Data: Contents

Two types of information can be extracted from the content of textual data: sentiment and semantic (Nassirtoussi et al., 2014). For methodologies of information extraction, please refer to Section 2.3.2.

22

Sentiment relates to how positive or negative the content of the textual data appears to be, whereas semantic information can be further broken down into entity and event.

Entity is a subject such as a person, a company, or a location, while an event is an action, such as an acquisition, an oil spill, or an accounting scandal. For each entity, there are different degrees to which the textual data is related to it, and there is a separate piece of information that describes what the entity is and the relationship of entities.

For example, for the sentence “Tokyo Marine buys HCC,” there are two entities

“Tokyo Marine” and “HCC.” “Buys” is the event that refers to an acquisition of HCC by

Tokyo Marine. For attributes, “Tokyo Marine” is a Japanese insurance company and

“HCC” is a US insurance company. This information may or may not be presented directly in the same textual document but can be extracted from other sources.

Additionally, those two companies are related based on an acquirer-acquiree relationship.

Another example may be seen in the news: “S&P upgrades the credit rating of General

Motors.” Although there are two entities, the rating agent “S&P” and the auto maker

“General Motors,” and an event “credit rating upgrade,” the news is more about “General

Motors” than “S&P.” Therefore, in such case the news is more relevant to “General

Motors” than “S&P.”

2.3 Methodologies

2.3.1 Textual Analysis

Two mainstream textual analysis approaches are semantic analysis and sentiment analysis. Semantic analysis aims to capture the events that occurred through a textual document. In the semantic space, each word can be identified as a feature and a distinct

23

variable that must be addressed. This is the origin of the high dimensionality issue of textual data. As each word becomes a variable, a textual document naturally has a large number of variables that can be captured by traditional machine learning or statistical models. As a result, researchers have used various techniques to reduce the dimensionality of a semantic structure at both the word level and the document level.

At the word level, a “bag of words” approach, which assumes the word order does not change the underlying meaning, is popular among researchers (Nassirtoussi et al., 2014)

To organize words by their semantics, a common approach is to group words based on thesauruses, word roots and sentiment dictionaries (Cheung et al., 2011).

At the document level, topic models are used to extract features of a given document.

These models capture the word or topic distributions in a document. Common topic models include:

(1) (Deerwester, Dumais, Furnas, Landauer, & Harshman,

1990), which assumes that words that are close in meaning will occur in similar pieces of text and correlations of words are calculated as indicators to group words;

(2) probabilistic latent semantic indexing (Hofmann, 1999), which was developed on top of the latent semantic indexing model and assumes that documents are distributions of topics and topics are distributions of words, and

(3) the latent Dirichlet allocation (Blei, Ng, & Jordan, 2003), which is conceptually similar to the probabilistic latent semantic indexing and assumes the document-topic distribution and topic-word distribution both follow a Dirichlet process.

Among these different topic models, the latent Dirichlet allocation has been shown to be suitable for large datasets latent Dirichlet allocation has been shown to be suitable for

24

large dataset. However, for probabilistic models such as Latent Dirichlet Allocation a number of topics must be input into the model. There are various methodologies to find the optimal number of topics for a dataset (Arun, Suresh, Madhavan, & Murthy, 2010;

Cao, Xia, Li, Zhang, & Tang, 2009; Deveaud, SanJuan, & Bellot, 2014; Griffiths &

Steyvers, 2004).

Sentiment analysis is used to determine the severity of the event detected. In the sentiment space, each word or sequence of words can reflect sentiment, and this becomes an additional feature. Researchers have used various techniques to reduce the dimension of sentiments. These techniques first quantify the sentiment on a simple scale (e.g. positive vs. negative), so that simple arithmetic (e.g. the average value) can be applied to reduce the dimensions (e.g. to a single number). Harvard GI and DICTION are the most popular general databases used to access lists of words that are determined to be positive or negative (Kearney & Liu, 2014). However, researchers have argued that the use of a general dictionary for sentiment analysis purposes can sometimes be misleading, because words that are negative in a general context (e.g. “liability”) are not negative in a financial context (Loughran & McDonald, 2011). To address such concerns, researchers have developed a financial domain-specific dictionary (Loughran & McDonald, 2011).

2.3.2 Textual Data Preprocessing

Before further textual mining approaches can be applied to news data, raw textual data generally need to be transformed into a structured format so that they can be fed into traditional machine learning algorithms (Oh et al., 2006). Common approaches for textual

25

data pre-processing include tokenization, removing stop words, stemming, etc. These are explained in more detail below.

Tokenization is the process of breaking down a document into individual words. This is generally the first step in textual data preprocessing for studies using the “bag-of- words” approach. However, if each individual word is treated as a feature or variable for the subsequent analysis models, there will be too many variables for the common statistical or machine learning model to capture. As a result, additional preprocessing steps after tokenization will aim to reduce the high dimensionality problem.

After tokenization, the initial textual document can be represented by a sequence of individual words. At this stage, one way to represent the document is to count how many times each word has occurred in the document regardless of the order of word appearance. In this process, stop words, which are common English words that appear in most documents and do not carry specific sentiment or semantic meanings, such as “the,”

“and,” “this,” etc., are normally removed from the tokenized textual documents to reduce potential variables.

Another common dimension reduction approach is to group words based on the word stems. For example, in some regular stemming algorithms, “go,” “goes,” “went” and

“gone” are considered to be the same word since they share the same word root. This approach proves to be effective in reducing dimensions (Harrag, El-Qawasmah, & Al-

Salman, 2011).

2.3.3 Features Engineering

26

Finance literatures have demonstrated various way of analyzing unstructured data such as regulatory filing and news for market prediction.

Tetlock found that sentiment predicts stock returns (Tetlock, Saar‐Tsechansky, &

Macskassy, 2008). He used Harvard General Inquirer to model sentiments inherited in news and modelled its relationships with companies’ accounting earnings and stock returns.

Chouliaras applied sentiment analysis on 10-K filings and found that change in sentiment is a better predictor stock returns than sentiment itself (Chouliaras, 2015). He used Loughran and McDonald dictionary to determine sentiment scores.

Loughran and McDonald analyzed 10-K filings with various of readability measure and found relationships between improved readability and increased trading volume

(Loughran & McDonald, 2009). They used Fog Index, Flesch Reading Ease Score, and a scoring framework they created based on SEC guidelines to measure readabilities.

Cohen et al. analyzed 10-K filings and found that changes in documents predict stock returns where firms that changed more text underperformed firms changed less (Cohen,

Malloy, & Nguyen, 2016). They used Cosine distance, Jaccard distance, and a simple edit distance to measure document similarities.

Antweiler and Frank analyzed online stock message board and found that message board count on which companies got mentioned predicts stock volatility (Antweiler &

Frank, 2004).

Mamaysky and Glasserman analyzed news data and found that unusual news with negative sentiment predicts stock market volatility (Mamaysky & Glasserman, 2016).

27

They use n-grams and conditional probability to determine unusualness of a sentence and use Loughran and McDonald sentiment dictionary for sentiment analysis.

Finance literatures provided various of features that predicts stock market. However, there are couple of gaps in the research.

 Majority of methodologies are applied to regulatory filing data (e.g., 10-K), it is

not clear whether such methodologies also work on news data

 To demonstrate the robustness of their methodologies, linear regressions were

usually used for modeling, and more advanced modeling techniques (e.g.,

machine learning) are not used

 Although they are trying to solve a very similar problem (e.g., predict financial

market), comparison of results with each other or use many of them in

combinations are missing from body of knowledge

Therefore, MSA will harness the wisdoms from finance literatures as part of the features engineering step and use more advanced modeling techniques (i.e., machine learning) to make predictions. Since MSA will be using news data to predict VIX index spikes, features that derived from finance literatures will be great potential candidates for features.

2.3.4 Machine Learning

While textual analysis approaches such as sentiment analysis and semantic analysis model and quantify textual data, additional quantitative models may still need to be applied to arrive at conclusions for particular research goals. In this section, common

28

machine learning models that have been used in studies of market prediction are reviewed and discussed.

Different machine learning algorithms are used in market prediction by previous researchers. Popular machine learning algorithms include decision trees (Huang, Liao,

Yang, Chang, & Luo, 2010; Peramunetilleke & Wong, 2002; Quinlan, 1993), naïve

Bayes (Antweiler & Frank, 2004; Groth & Muntermann, 2011; John & Langley, 1995;

Li, 2010), neural network (Bollen, Mao, & Zeng, 2011; Evans, Pappas, & Xhafa, 2013;

Pal & Mitra, 1992), logistic regression (Huang, Yang, & Chuang, 2008; Le Cessie & Van

Houwelingen, 1992; D. L. Olson, Delen, & Meng, 2012), random forest (Breiman, 2001;

Kumar & Thenmozhi, 2006; Patel, Shah, Thakkar, & Kotecha, 2015), and combination algorithms such as Vote (S. R. Das & Chen, 2007; Kittler, Hatef, Duin, & Matas, 1998;

Mahajan et al., 2008).

As previously discussed, there are different evaluation methods for estimating market return and market volatility. Two important indicators in prediction result accuracy are false positives and false negatives. In the context of an economic uncertainty prediction system, a false positive occurs when the system signals a crisis when there is no such event. A false negative happens when the system fails to signal a crisis when it actually happens. Obviously, a false negative is more harmful than a false positive to users in a crisis warning system.

While popular machine learning algorithms used by previous researchers maximize accuracy by default, they tend not to address the asymmetric nature of issues with false positives and false negatives as described above. For any prediction system that aims to catch rare but severe events, this asymmetry must be addressed. To address such

29

problems, instead of using equal weighted penalty setting, previous bankruptcy prediction systems have used cost-sensitivity setting of machine learning algorithms (Chen, Ribeiro,

Vieira, Duarte, & Neves, 2011).

2.3.5 Evaluation

It was common for early warning systems to be evaluated qualitatively by manually based on the criteria of whether the system had captured significant economic uncertainty events in actual financial markets which defined by people (Ahn et al., 2011; Kim et al.,

2004). Such method lacks scalability and repeatability when monitoring the ongoing performance of a system. In comparison, MSA deal with those problems by utilizing a scalable and generalized quantitative and automated framework for system evaluation.

The availability of large amounts of market data allows the data to be split into a training dataset, which is used to train a prediction model, and a testing dataset which is not part of the sample is used to evaluate the model and invisible to the model during the training phase.

As market prediction systems predict the direction of the financial market, accuracy is the most widely used evaluation metric. As practitioners have found a system that is accurate for more than 50% of the time to be useful (Nassirtoussi et al., 2015), market prediction systems aimed such level of accuracy. However, in the case of systems that deal with rare events such as economic rises, this cannot be the only standard used to evaluate such systems because such systems can easily achieve higher than 50% accuracy by always predicting that no rare event will occur. Therefore, additional performance evaluation measures are also needed in addition to this metrics.

30

Because MSA shares some characteristics with traditional knowledge elicitation systems as they both extract important concepts from large amounts of information (e.g.

(Cheung et al., 2011; Wang et al., 2008)), it is important to study the evaluation methodologies in design of MSA. In traditional knowledge elicitation system, researchers have used the concepts of recall and precision for evaluation (Cheung et al., 2011). A previous concept elicitation system achieved a 68% recall rate with 9% precision, or a

26% recall rate with 17% precision, when extracting concepts from email data (Cheung et al., 2011); recall and precision had sharp trade-offs. When making comparison based on both precision and recall, f1 score is a useful measure that combines both precision recall

(Goutte & Gaussier, 2005).

The formal definition of f1 score, recall, and precision used by MSA are discussed in

Chapter 3.

2.4 Gaps in the existing literatures

Dr. Katsuhiko Okada pointed out that difference in cultures between AI researchers and finance researchers have limited the advance of market prediction text mining literatures (Okada, 2017; Okada, 2018). Couple of points he made about difference between AI researchers and finance researchers are,

- AI researchers:

o focus on applying new prediction methodologies to market prediction

problems

o work with limited length of data

o are not concerned with actual long-term performance in practice

31

- finance researchers

o are obsessed with linear regression

o apply 60 years of data for robustness

Observations made by Dr. Okada align with my observations as literature reviews for this praxis was conducted. In this praxis’s point of view, the fine way of fit those two streams together are to pull the strength of each and combine. Features engineering is where domain knowledge becomes crucial; therefore, it will make sense to leverage methodologies found in finance literature. On the other hand, literatures by AI researchers are heavily focused on prediction methodologies (i.e., machine learning); therefore, prediction model parts will be drawn from AI literatures.

32

Chapter 3-Methods

This section of the dissertation discusses the actual implementation of MSA. In an ideal world, there is an unlimited number of human assistants always reading news on behalf of an intended user and warning him or her when there is a significant event in the news (data). A human would normally understand whether there is a significant event in the news by understanding either the underlying event itself (semantics) or how the event is described by the author of the news (sentiment). In order to derive such an understanding of the news content, there are several pre-processing steps that a human would conduct at high level.

First, a human would ignore common words (“the,” etc.) that are not highly meaningful for comprehension of the news article since they appear so often in almost all articles. As we have seen, these words have special name in the natural language processing literature, i.e. called stop words, and they are often removed as part of the pro- processing stage.

Second, a human would group together words with similar meanings to fit what he or she sees into a finite number of topics; for example, meanings of the words “bankruptcy” and “bankrupt” will be interpreted similarly when they appear in a news article. This process is called dimension reduction in the natural language processing literature.

Third, a human would reconcile each topic or event with his or her memory to determine whether each topic or event matters. For example, generally speaking the bankruptcy of a company may matter more than gossip from a company’s employee regarding the stock market. In similar studies, people have evaluated the validity of each

33

topic or event by applying a manual filter or running each topic or event against the historical stock market movement. The former involves an extensive manual effort and the latter involves handling abundant time series data.

Fourth, the author’s sentiment is taken into account, as certain topics can have a positive interpretation or a negative interpretation. For example, if stock markets have gone up, the author may describe this as going dangerously high (negative) or gaining momentum (positive); it is reasonable to assume that how the author describe the context

(sentiment) will have a major impact on the readers. Authors often use different word choices to express their sentiment regarding the matter. In the natural language processing literature, a sentiment model based on a group of word choices is called the bag of words approach. This is a highly popular sentiment model.

Lastly, a human would see contradicting information, good news and bad news, and rank the importance of different news in their head based on experience; a highly positive news item about an executive saying something good regarding his or her company (good news) is negligible if there is a news about the same company having accounting fraud

(very bad news). The complex tug of war between observed information on which judgment is made through empirical evidence can be modelled using machine learning.

Since the market participants also use a similar thinking process of making sense of what is going on in the world through reading news and taking action accordingly, the ability to monitor the news interpretation process can greatly improve the filtering power of MSA. A typical human’s logic of interpreting a news item that causes a corresponding action is illustrated in Figure 1. Semantics (events) and sentiments are two important components of people’s mental modeling of a news article.

34

Figure 1: Illustration of human logic

MSA aims to imitate this human thinking process through combination of natural language processing techniques. The equivalent logic of MSA is shown in Figure 2 and is designed to be the same as that in Figure 1. Just like a human would, MSA takes in the same news that the human reads and interprets it. Furthermore, MSA reconciles the historical news with historical prices, which are the result of collective actions taken by humans as the result of reading news. Finally, MSA makes a prediction about the consequences of news based on its understanding of human market participants’ collective actions in response to the news.

35

Figure 2: Illustration of MSA logic

This chapter provides a detailed explanation of how MSA works. The focus is on the methodologies used in MSA in the sequence of data, pre-processing, features engineering, machine learning, and evaluation. The high-level construction of MSA is shown in Figure 3, and a more detailed explanation is given later in this section.

36

Figure 3: High-level construction of MSA

3.1 Data

Two types of data, both unstructured textual data and structured numerical data, are used by MSA. The unstructured textual part of the data is the raw news data. In this case

The Wall Street Journal (WSJ) data, which can be downloaded from Factiva, is used as the main source of the data used to construct features representing the state of the world.

The structured numerical data, which is the market price of the VIX index, comes from the exchange’s (CBOE) official website and is directly used as a representation of the actual consequences of events.

Capturing significant events within news data is an information retrieval problem, and there are two common ways of evaluating how well the desired information is retrieved.

One way is by expert opinions where a group of experts select what they consider to be important events and use these as the “perfect answer,” and important events retrieved by a system are then evaluated against the group’s answer. The other way is through a market response approach, assuming the market is efficient as in the Efficient Market

37

Hypothesis (EMH), which uses market participants’ collective response to the event as the gauge of how important it is. MSA uses the latter approach, which is why it requires the structured numerical financial market data as an input to create the gauge that represents the collective reaction of the market participants to the events.

There are two main advantages of using this approach. First, the market response approach is much more scalable, reproducible, and efficient than the expert opinion approach. Inviting a group of experts to participate in the study requires extensive time, and whom one invites can significantly change the results. Additionally, use of historical market data is a hard historical fact that can be reproduced by other researchers. Second, this praxis argues that the market response approach has no look-ahead bias, unlike the expert opinion approach. When training data and testing data are split based on time, for example, we would like to use data from 2010 to 2014 train MSA and use data from 2015 to test MSA, and it is not possible to have experts make judgments based only on knowledge from before 2014. Because the experts themselves have already experienced

2015, it is physically impossible to eliminate the look-ahead bias introduced by the experiences of the experts. Perceptions in the financial world change drastically; for example, the researcher shows that the term “leverage” was considered positive prior to the financial crisis but turned into a highly negative word after the financial crisis

(Calomiris & Nissim, 2007; Calomiris & Nissim, 2007; Calomiris & Nissim, 2014). To build on this example, experts may not consider the term “leverage” based on their knowledge prior to the financial crisis, but rather their experience of financial crisis can turn the term “leverage” into an important concept; this is the look-ahead bias that can be introduced by the expert opinion approach.

38

In use of the news data, the Wall Street Journal was chosen because previous researchers have suggested that it is not only highly reputable among investors but also has the largest circulation among daily financial publications in the United States

(Tetlock, 2007). The Wall Street Journal has many sections and not all sections will be fed into MSA. Since previous researchers have suggested that the use of only financial news and only headlines can reduce noise (Huang et al., 2010), only the “What’s News:

Business & Finance” section of The Wall Street Journal is used for MSA. This section lists important headlines for each day that are related to business and finance. Therefore, it is a convenient place to access multiple headlines simultaneously that are important for the day.

The actual news data comes from the Factiva database, which is a popular database for researchers to obtain historical news data (Moniz, 2016). However, previous researchers have reported that the academic edition of Factiva has a limitation of only being able to pull 100 news articles at a time (Moniz, 2016). Constrained by this practical scalability issue, MSA has limited its scope to news from 2010 to Q1 of 2018. On the other hand, use of the “What’s News: Business & Finance” section enables MSA to pull multiple news headlines from the same date at a time and pull hundreds of news headlines across 100 days at once. The main body of each day’s news data, which consists of a list of headlines, is split into each individual headline and passed on to the next steps of MSA with a date attribution. In other words, although Factiva has a significant amount of metadata, such as entities and topics, only the date is used by MSA.

Since MSA aims to capture macro events related to the entire economy, entity tags assigned by Factiva are ignored. Regarding topics, MSA will self-generate topics based

39

on statistical models using the methodologies described below, so the topics tags provided by Factiva are also ignored.

For the structured numerical portion, the financial market data are used to construct a flag that indicates the impact of the events indicated in the financial news data. The classic theory in modern finance, Eugene Fama’s EMH, states that all available information is quickly reflected in the market price (Malkiel & Fama, 1970). Therefore, use of market data in evaluation of event’s impact is a common practice (Campbell, Lo,

& MacKinlay, 1997; Campbell et al., 1997; MacKinlay, 1997). Additionally, the CBOE

VIX index is chosen because previous researchers have shown that use of volatility as an uncertainty measure is a common practice (Oh et al., 2006), and the CBOE VIX index is a readily available and popular volatility measure. CBOE VIX index data from 2011 to

2015 are used for MSA to match the duration of the news data.

The CBOE VIX index, a continuous numerical dataset, is an implied volatility measure calculated from S&P 500 options (Exchange, 2009). For the purposes of MSA, the value of the CBOE VIX index is transformed into a binary flag reflecting whether there was a market uncertainty, and this is based on whether the change of the CBOE

VIX index value exceeds a certain threshold. Although MSA allows flexible definition of the threshold by the users, the praxis uses an arbitrary threshold of weekly change of 10% for illustrative purposes. When the weekly change in CBOE VIX index increases by more than 10%, the uncertainty flags are recorded as True for the day; otherwise they are False.

The higher the threshold is set, the fewer events MSA captures, and it displays warnings on the more important ones. MSA gives users the flexibility to define the threshold. A

10% threshold is chosen here based on the assumption that portfolio managers, unlike

40

professional traders, only care to know when there are events that could lead to major economic uncertainties in the market. To put the thresholds in perspective, when Lehman

Brothers announced bankruptcy on September 15, 2008, the CBOE VIX index was observed to be 31.7 (Exchange, 2009) . The bankruptcy event was reflected in the market price by the end of September 15, 2008, confirming the EMH. Since the CBOE VIX index is observed based on S&P 500 options that are traded on the market, the change is calculated with the difference from the prior day’s level calculated on Friday September

12, 2008, as September 15 falls on a Monday (Exchange, 2009). The CBOE VIX index was at 25.66 on September 12, 2008, and increased 24% on September 15, 2008 within one business day (Exchange, 2009) . This type of event is qualified as an economic uncertainty by the illustrative MSA, as it exceeds the threshold of 10%. Therefore, the sample event will be within a database with three columns (Date, News, Economic

Uncertainty) with values (Sep 15 2008, article X, True), where article X is the headline about the collapse of Lehman Brothers as illustrated in Figure 4. The issue with this representation of data is that there are multiple headlines that represents multiple events on the same date, but the VIX spike flag driven by one or more headlines on a date is attached to all the headlines on the same date. Therefore, instead of “does the particular event of the date cause VIX Index to spike?,” the relevant question is “does any of the events of the date cause VIX Index to spike?” MSA uses machine learning approaches, as described below, to sort through multiple events from the date historically to determine the likelihood that inclusion of certain events will cause VIX Index to spike.

41

Figure 4: Determination of economic uncertainty

The distribution of VIX index data (in case of daily change) is shown in Figure 5; the

10% threshold represents 94% of days, which effectively targets the warning to appear about once a month. On the other hand, setting the threshold to 5% generates a warning once a week, while setting the threshold to 20% generates a warning once a quarter.

Figure 5: Cumulative distribution of VIX change

42

MSA is designed to capture the “most significant” events. Prior to the improved data processing method, the careful selection of the data source served as a crucial step for the process. The Lehman Brothers bankruptcy will be included in the “What’s News:

Business & Finance” section of The Wall Street Journal, but a bankruptcy of a small startup may not. The choice of the “What’s News: Business & Finance” section of The

Wall Street Journal over a local newspaper, for example, assumes that the section functions as a natural initial filter to pick up significant events involving significant entities that are more likely to have an impact on the entire economy. The later processes then use sophisticated natural language processing and machine learning approaches to compare the event captured historically to produce an assessment of whether the event may cause an economic uncertainty.

3.2 Pre-processing

The purpose of this pre-processing and features engineering (next section) steps are to transform unstructured data into a structured form that can be consumed by machine learning algorithms. Under this praxis, pre-processing step refers to the step that process raw unstructured data into cleaner and more normalized format; features extraction and dimension reduction are also done at this step. The actual features representations such as representing news in terms of topics and sentiment are done at the later features engineering step.

Before features engineering techniques such sentiment models can be applied to the headline, natural language processing techniques such as tokenization, stop word removal, and various normalizations must be implemented. When a headline (usually a

43

sentence) enters MSA, it is broken down into small pieces called tokens. MSA only uses a unigram (a word) to represent a token. Alternatively, other systems may use bigrams

(two words), trigrams (three words), or n-grams (n words) to represent a token. Higher dimensional n-grams such as “corporate actions” (a bigram) have different meanings from unigrams “corporate” and “actions”; therefore, incorporating higher dimensional n- grams can capture more semantic nuances. However, since the sentiment analysis portion of MSA is based on word dictionaries, which only contains unigrams, MSA will also only use unigrams. The implementation of MSA’s sentiment analysis is explained in detail in the following section.

In addition to defining tokens as a word or group of words, it is also possible to define tokens at a higher level such as a sentence. However, the input data of this research is a headline, which is typically already a sentence; therefore, it is not meaningful to define tokens at this level.

Stop word removal is a process whereby common words such as “a,” “an,” and “the” are identified and removed from analysis. Which words are considered is language- dependent, and there are two general types of approaches: a statistical approach and a dictionary approach. A statistical approach, such as term frequency-inverse document frequency (TF-IDF), uses pools of general documents to benchmark how frequently each word usually appears in a document, and uses the relative frequency of words appearing in the target document against the benchmark to identify common words. On the other hand, a dictionary approach simply provides a precompiled list of words that are common and should be removed from analysis. In this study, preference is given to a dictionary approach, as dictionary approaches can provide better transparency than statistical

44

approaches in general. This research uses the English stop word list provided by the scikit-learn. Scikit-learn is used as a library in Python, as the topic modeling which leverages the LDA portion of MSA is also implemented in Python.

The normalization step further reduces the dimensionality of features, which in this research means tokens or words. MSA treats upper-case words and lower-case words the same by converting all words into lower case.

Another popular normalization approach is called stemming. Stemming is a process that turns different forms of a word, such as “ran,” “runs,” “running,” and “run,” all into

“run.” However, since MSA uses word dictionaries for sentiment analysis, it contains different forms of a word before stemming, such as various forms of the word “abandon”

(including “abandon,” “abandoned,” “abandoning,” “abandonment,” “abandonments,” and “abandons”), all of which are in the dictionaries. Therefore, MSA omits the stemming step.

After MSA cleans the sentences that enter the system, it applies further text mining techniques to classify cleaned tokens based on their semantics to further reduce dimensionality. There are again two philosophically distinct approaches: statistical and dictionary. In this context, a statistical approach, such as the topic models approach used by MSA, leverages statistical techniques to cluster words into a number of clusters based on hidden meanings derived from sentences statistically.

There are two kinds of dictionary approaches. The first leverages a thesaurus to derive the semantic meaning behind words and cluster them based on relationships between words as defined in the thesaurus. The second type uses sentence templates,

45

where the list of events concerned is first identified, then various language rules are templated for each event identified.

MSA uses a topic model, which is a statistical approach. In this approach, words are clustered into a set number of groups based on their meaning. As discussed in the literature review section, there are various ways to conduct topic modeling. Topic models have evolved over time from latent semantic analysis to probabilistic latent semantic indexing to latent Dirichlet allocation (LDA); MSA uses LDA to model topics. Before

LDA can be used, it requires a number of topics as the input. In determining number of topics, there are four common approaches developed by Griffiths, Deveaud, Cao, and

Arun: Griffiths’ and Deveaud’s approaches are maximization algorithms, while those of

Cao and Arun are minimization algorithms. MSA uses all four methods and compares the results from all four models to find the optimal number of topics.

After determining the optimal number of topics, news data are split into two datasets: the training dataset and testing dataset. The training dataset (in-sample data) consists of the data from 2010 to 2014, and the testing dataset (out-of-sample data) is comprised of the data from 2015 to Q1 2018. The training dataset is used to train the LDA model, and topics are constructed based on what is observed in the news data between 2010 and

2014. Although each date is assigned only one economic uncertainty flag, there are several news headers for each date. It is believed that each news header contains a distinct topic. Therefore, when training the LDA model, MSA tries to model one topic per sentence by processing a group of tokens that appear in the same sentence as one record.

46

After the LDA model is trained, each header from both the training dataset and testing dataset is entered into the model to generate a probabilistic distribution for the latent topics generated by the LDA model. For each headline, the sum of the probability for each topic equals 1. Since the LDA model is trained using the training dataset, the probabilistic distribution generated is in-sample (data between 2010 to 2014) for the training dataset and out-of-sample (data between 2015 to Q1 2018) for the testing dataset.

The distinction between in-sample and out-of-sample is important, as MSA is meaningful only if it can increase its out-of-sample prediction power.

In addition to which topic each header belongs to, it is also important to identify what each latent topic means. Latent means hidden, and it is called referred to this way because the topics are statistically generated and require humans to investigate the content to give it an economic meaning. For each topic, the list of words that describes the topic can be extracted from the LDA model. Researchers evaluate the list of words and manually assign a name to the topic, and some researchers have used visual aids such as word clouds to assist in identification of the topic.

The purpose of MSA is to identify macroeconomic uncertainties and not all topics discussed in news headlines are about the entire economy. Some microeconomic topics that will only impact individual companies, such as one company buying another or the naming of a CEO, should matter less in macroeconomic prediction. Some researcher (Jin et al., 2013) manually look at each topic and determine whether the topic is relevant or not and use LDA model as a news filtering mechanism. However, MSA tries to eliminate such human judgement to avoid biases by deferring such analysis to be done automatically at machine learning step.

47

3.3 Features Engineering

At the features engineering step, various factors are developed as an input to the machine learning model. Table I shows high-level summary of methodology used to create factors. Methodology-wise there are mainly dictionary-based methodologies and statistical methodologies, the praxis calls this highest-level grouping as approach. The approach can be further split into categories. Main categories are sentiment, similarity, readability, and topic frequency. Changes of those categories are usually another category

(e.g., change in sentiment from last year to this year) except similarity as it already reflects changes. The details of each category will be discussed further in the following sub-sections.

Below category, there are method, metrics, and feature that further break down the methodologies. In total, there are 28 features. In the previous pre-processing section, topic model is used to break down news into 40 topics. Adding unbroken down form, there are 41 forms of contents (1 unbroken down + 40 forms broken down to topic). Each combination of 41 forms can be combined with 28 features to form 1148 factors (e.g.,

Change in Sentiment – LM – Positive on Topic 12).

48

Approach Category Method Metrics Feature Positive Sentiment-LM-Positive Negative Sentiment-LM-Negative Uncertainty Sentiment-LM-Uncertainty LM Litigious Sentiment-LM-Litigious Constraining Sentiment-LM-Constraining Sentiment Superfluous Sentiment-LM-Superfluous Net Sentiment-LM-Net Positive Sentiment-GI-Positive GI Negative Sentiment-GI-Negative Net Sentiment-GI-Net Dictionary Positive Change in Sentiment-LM-Positive Negative Change in Sentiment-LM-Negative Uncertainty Change in Sentiment-LM-Uncertainty LM Litigious Change in Sentiment-LM-Litigious Change in Constraining Change in Sentiment-LM-Constraining Sentiment Superfluous Change in Sentiment-LM-Superfluous Net Change in Sentiment-LM-Net Positive Change in Sentiment-GI-Positive GI Negative Change in Sentiment-GI-Negative Net Change in Sentiment-GI-Net Jaccard Similarity-Jaccard Similarity Cosine Similarity-Cosine Fog Readability-Fog Readability Flesch Readability-Flesch Statistical Change in Fog Change in Readability-Fog Readability Flesch Change in Readability-Flesch Topic Frequency LDA Topic Frequency - LDA Change in Topic LDA Frequency Change in Topic Frequency - LDA Table 1: Factors generated in features engineering

49

3.3.1 Sentiment

Sentiment is perhaps one of the most common features in natural language processing literature. MSA uses two different dictionaries: one is a financial services industry specific dictionary and the other is a general dictionary. The industry specific industry is the Loughran and McDonald dictionary (Loughran & McDonald, 2011) and the general dictionary is the Harvard General Inquirer (Stone & Hunt, 1963).

There are different types of sentiments. Six types of sentiments are taken from the

Loughran and McDonald dictionary (LM); they are “Positive”, “Negative”,

“Uncertainty”, “Litigious”, “Constraining”, and “Superfluous”. Seventh sentiment “Net” is derived from the difference of “Positive” and “Negative” sentiments. On the other hand, two types of sentiments are taken from the Harvard General Inquirer (GI); they are

“Positive”, “Negative” and “Net” sentiment is calculated similar to that of LM.

The scoring mechanism is generally based on number of words in the specific type of dictionary (e.g., LM positive) divided by total number of words. Figure 6 illustrates an example of how sentiment scores are calculated in MSA. An unstructured textual news headline (a sentiment) is gone through pre-processing step to (e.g., remove numbers, stop words, and punctuations) form a list of words. Those lists of words are applied again each type of dictionary.

50

Figure 6: Sentiment example

In addition to the different types of sentiment themselves, changes in sentiments are other forms of features where, for example, the negative sentiment of today is compared against the negative sentiment of yesterday (i.e., difference).

3.3.2 Similarity

Similarity is a measure that describes how different a sentence is from previous sentence. In the regulatory filing text mining literature, researchers argued that people are lazy and do not change texts until they have due to big underlying events (Cohen et al.,

2016). This praxis argues that same principal may also applies to news data. News reporters have to something about the financial market every day. This observation is especially true at topic 2 which is about stock indexes (see 4.1 for detailed discussion about topics generated). For majority of days the stock indexes move around without a

51

strong driver, the news reporters often just have to report the phenomenon without giving much explanations; this high similarity scenario is shown in Figure 7. On the other hand, when the market moves with a strong driver, the news reporters report the stock indexes movements in a much lengthy term as they are now able to add greater context to why the market moved; this low similarity scenario is shown in Figure 8.

Figure 7: High similarity example

52

Figure 8: Low similarity example

The basic principal behind MSA is not finding the factors that will always work and use that as a feature. Instead, MSA is focused on finding factors that may work according to the domain knowledge of financial literature and add them as features. And then, let the machine learning and data decide which factors work at given point in time. A big assumption made here is that factors that work will change overtime. Although similarity in theory is a promising factor in news, MSA will defer the usefulness judgement to data.

There are many ways of calculating similarity in the literatures, but MSA will use two similarity measures used by Cohen (Cohen et al., 2016). They are Jaccard distance and cosine distance.

Since similarity is already a comparison measure between now and past unlike other methods, it will not make sense to include change in similarity.

53

3.3.3 Readability

Readability is a measure that aims to capture meaning behind how someone writes.

There are multiple ways of measuring readability, and this praxis will be using two measures (Flesch Index and Fog Index) same as previous researchers (Loughran &

McDonald, 2009). Both measures share two ideas which are how wordy the sentences are and how complex are the words used. When complex and impactful events occurred, it is fair to assume that it takes more words for news reporters to describe those events in the news and use less usual words. Two examples of high readability and low readability are shown in Figure 9.

Figure 9: Readability example

3.3.4 Topic Frequency

54

Topic frequency refers to how many times each topic is mentioned in the headlines for the day. When certain macroeconomic topics (e.g., eurozone crisis) are heavily discussed in the headlines opposed to when no such topics are disused, it may have different implication to VIX movement prediction. Therefore, frequency of those topics that are generated from previous pre-processing step is also included as factors in the model.

3.4 Machine Learning

The unstructured textual News data is transformed into list of structured features that are fed into the machine learning as dependent variables, while the VIX data is transformed into labels and serve as an independent variable.

Machine learning is the step where cleaned and organized data enters the system and patterns in the data are recognized. Machine learning algorithms are divided into two categories: supervised learning and unsupervised learning. Supervised learning is a machine learning method with labels; during the training stage, data enters the system with labels, and when a new set of data is entered during the testing stage, the algorithm makes a judgment about which label the data should belong to. Unsupervised learning is a machine learning method without labels; during the training stage, data is divided into a given number of groups, and when new data enters the system, the algorithm assigns which group the data belongs to.

During the pre-processing stage, MSA uses the LDA model to group news headers in the training dataset into a given number of topics. Then, when news headers in the testing dataset enter the system, the LDA model makes a judgment regarding which topic the

55

news header belongs to. Therefore, the LDA model is an unsupervised learning approach and is also a machine learning algorithm.

However, for a typical market prediction system, the machine learning step usually refers to the step where a prediction about the market is made. Typically, a supervised learning algorithm is used and a certain reaction of the market, e.g. up or down, is used as a label, and each observation, which is a group of input variables, is associated with a label. In statistical terms, these groups of input variables are independent variables and the label is the dependent variable. MSA uses supervised learning during the machine learning stage, like most other market prediction systems.

A supervised learning algorithm can take two forms: regression and classification. A supervised learning algorithm using regression is trained on and predicts numerical data, such as a price or an expected return. A supervised learning algorithm conducting classification is trained on and predicts categorical data, such as movement up or down.

MSA predicts a Boolean result of True or False regarding whether there will be an economic uncertainty; therefore, MSA uses classification-based machine learning algorithms.

In addition to using regular classification-based supervised learning algorithms, MSA is an alert system that deals with prediction outcomes with high polarity: the frequency of days that VIX index does not pike up is much higher than days that VIX index does pike up. This prediction problem is similar to those encountered by bankruptcy prediction problems, and the machine learning used in those problems is cost-sensitive (Chen et al.,

2011). MSA also uses cost-sensitive machine learning outcomes. Cost-sensitive machine learning algorithms penalize one prediction outcome more than another prediction

56

outcome. In MSA, a missed alert (false negative) is penalized more heavily than a false alert (false positive), as the consequences of not getting an alert regarding actual economic uncertainties are more severe than the wasted efforts for a false alert.

Since cost-sensitivity and conduct classification are just a few properties of certain types of supervised learning machine learning algorithms, the actual algorithms used are described in more detail below. Some of the most popular supervised learning algorithms used by researchers in market prediction problems are the decision tree (Quinlan, 1993), naïve Bayes (John & Langley, 1995), neural network (Pal & Mitra, 1992), logistic regression (Le Cessie & Van Houwelingen, 1992), random forest (Breiman, 2001), SVM, and nearest neighbors. Additionally, all seven machine learning algorithms are applied at the same time to the factors and combined using the Vote algorithm. This algorithm renders a judgment based on all seven algorithms and aggregates the results based on the majority vote.

The baseline model used by MSA is random forest as it has the ability to score what factors were most useful as the model is built to provide transparencies. It is an important metrics that MSA will report to show the users that what is driving the market. Other machine learning models are also applied against same data for comparison purpose.

A training dataset for a machine learning algorithm is prepared on an annual basis with 5-year worth of past data on a rolling basis. For example, to predict data in 2015, training datasets are from 2010 to 2014.

57

3.5 Evaluation

Market prediction systems are usually evaluated in one of the three ways. The first approach is to build a trading strategy and see how well the trading strategy performs; this is not a suitable approach for MSA, which is intended to make market predictions for warning, not trading. The second approach is to run a regression and evaluate the in- sample metrics from the regression, such as R-squared. The third approach, the one used most widely in machine learning, is make out-of-sample predictions and generate metrics based on the predictions. In evaluation of MSA, the third approach is used.

In out-of-sample evaluation of regression machine learning algorithms, the mean square error is widely used. For classification of machine learning algorithms, alternative measures such as accuracy are used. The accuracy is defined in Equation (1).

푇푟푢푒 푃표푠푖푡푖푣푒+푇푟푢푒 푁푒푔푎푡푖푣푒 퐴푐푐푢푟푎푐푦 = (1) 푇푟푢푒 푃표푠푖푡푖푣푒+퐹푎푙푠푒 푃표푠푖푡푖푣푒+푇푟푢푒 푁푒푔푎푡푖푣푒+퐹푎푙푠푒 푁푒푔푎푡푖푣푒

However, for warning systems such as MSA, accuracy is not sufficient. Since there is limited number of Positives, a system that always predicts Negatives will have close to perfect accuracy. Giving such an incentive will defeat the purpose of the warning system.

Therefore, additional measures such as recall and precision need to be introduced (Goutte

& Gaussier, 2005).

Recall is how many Positives were captured out of the total number of actual events, as described in Equation (2).

푇푟푢푒 푃표푠푖푡푖푣푒 푅푒푐푎푙푙 = (2) 푇푟푢푒 푃표푠푖푡푖푣푒+퐹푎푙푠푒 푁푒푔푎푡푖푣푒

Precision is how many Positives were captured out of total number of prediction attempt made as described in Equation (3).

58

푇푟푢푒 푃표푠푖푡푖푣푒 푃푟푒푐푖푠푖표푛 = (3) 푇푟푢푒 푃표푠푖푡푖푣푒+퐹푎푙푠푒 푃표푠푖푡푖푣푒

Since recall is about capturing more and precision is about not to make mistake, they are somewhat in a trade-off relationship. Being more aggressive about prediction attempt will improve recall but hurt precision, and being more conservative about prediction attempt will improve precision but hurt recall. Therefore, an integrated measure F1 Score is used to evaluate the prediction performance of algorithms.

2(푃푟푒푐푖푠푖표푛×푅푒푐푎푙푙) 퐹1 푆푐표푟푒 = (4) 푃푟푒푐푖푠푖표푛+푅푒푐푎푙푙

As a reference point, a previous concept elicitation system achieved a 68% recall rate while maintaining a 9% precision rate (f1 score of 16%), or a 26% recall rate with a 17% precision rate (f1 score of 21%) (Cheung et al., 2011). As another illustration of how difficult it is to maintain both high recall and precision, a market prediction system much closer to MSA is also used as a reference. A currency forecast system which aimed to forecast currency movements across multiple currency pairs yielded f1 scores between

0.28 to 0.50 with limited data. However, the highest f1 score of 0.50 is achieved with only 1 Positives and 3 attempts (0.33 precision rate and 1 recall rate).

3.6 Benchmark

Previous researchers have found that negative sentiment can be best used to estimate spikes in VIX index, but they used commercially derived sentiment data from news

(Smales, 2014). Two dictionary-based sentiment models are used to simulate the traditional sentiment analysis. The objective is to see whether MSA can perform better than the benchmarks on f1 score. The proposed model is MSA based on 1148 features developed in this praxis and random forest. The first benchmark uses negative sentiment

59

part of Loughran and McDonald only (Benchmark-LM) and the second benchmark uses negative sentiment part of Harvard General Inquirer only (Benchmark-GI) as features and everything else are the same as MSA.

60

Chapter 4-Results

4.1 Pre-processing

Topic modeling is conducted during the pre-processing stage. For topic modeling

MSA uses Latent Dirichlet Allocation, which requires a number of topics as an input.

Four algorithms are applied to the training dataset to make independent judgements about the optimal number of topics. The identification of this number of topics is conducted in

R using the “ldatuning” package (Murzintcev, 2015).

Figures 10 and 11 show the results from applying four algorithms to the testing dataset, which comprises 2010 to 2014. Griffiths and Deveaud present maximization algorithms where higher values are better (Deveaud et al., 2014; Griffiths & Steyvers,

2004). Meanwhile, CaoJuan and Arun provide minimization algorithms lower values are better (Arun et al., 2010; Cao et al., 2009). The results show that the optimal number of topics for this dataset is around 40 which are also not far from previous studies (Izumi et al., 2010; Izumi et al., 2011; Jin et al., 2013; Mahajan et al., 2008). Therefore, MSA uses

40 topics for its LDA model in this research.

Figure 10: Results from maximization algorithms for finding natural number of topics

61

Figure 11: Results from minimization algorithms for finding natural number of topics

Table 2 shows the 40 topics and top 5 keywords for each topic that are generated using the LDA model based on the training dataset from 2010 to 2014. Since each topic created in this step are latent topics and only relevant keywords for each topic are shown, it is necessary to look at the keywords and manually interpret the topics. Since the results from such manual interpretation can be highly subjective, researchers have developed tools such as word clouds to aid in interpretation based on the LDA model. Fortunately, the keywords generated from the LDA model are relatively straightforward and the topics are easily identifiable from the top 5 keywords for most topics.

It is worth noting that not all keywords have a direct meaning that can be used to identify the topic. Some keywords are very general, such as “new” or “said” appears in many of the 40 topics; therefore, those keywords provide limited insight. During the stop word removal process, such general keywords should have been removed based on the intention of the process. Since some keywords, e.g. “new” or “said” are not stop words in a general sense, they can only be captured and removed by an industry-specific dictionary. Just as sentiment dictionaries have general versions and industry-specific versions and industry-specific versions are known to perform better (Loughran &

62

McDonald, 2011), it may make sense to have an industry specific stop word dictionary as well. However, there is no known widely used industry-specific stop word dictionary for finance or economics.

# Topic Name Keyword 1 Keyword 2 Keyword 3 Keyword 4 Keyword 5 1 Automobile Industry auto maker annual chrysler new 2 Stock Indexes dow points stocks industrials fell 3 Corporate Reform boeing pressure changes industry global 4 Economic Growth economy growth prices economic year 5 Innovations new york mart wal billion 6 Strategic Partnerships oil toyota wireless crude safety 7 Budgets billion investors bonds companies corporate 8 Energy Industry oil gas energy fannie natural 9 New Product google apple service online new 10 Mortgage credit mortgage settlement backed billion 11 Financial Loss financial berkshire crisis galleon buffett 12 Legal Settlements pay million agreed settle allegations 13 Bankruptcy bankruptcy protection said filed billion 14 Corporate Executives chief ceo executive board company 15 Europian Banks euro banks european zone spain 16 Ownership Change stake firm billion sell buy 17 Fund fund sec hedge funds firm 18 Legal Opinions news department corp justice said 19 Eurozone Crisis debt euro greece zone bailout 20 Interest Rates fed rates bond low buying 21 Housing home market housing prices sales 22 Financial Transactions morgan year public offering stock 23 Merger & Acquisitions track year bought bank billion 24 Court Cases court merger judge foreign ruled 25 Wall Street street wall financial meeting markets 26 Technology Industry new microsoft samsung operating nokia 27 Business Contracts food said network airbus boeing 28 IPO ipo internet fcc new gave 29 Financial Regulations bank financial regulators new probe 30 Central Bank bank central china banks japan 31 Product apple oil estate real iphone 32 Earnings profit quarter sales posted earnings 33 Corporate Strategy year morgan ceo apple firm 34 Criminal Activities trading insider drug criminal regulators 35 Investor Sentiment investors bond buying market fears 36 Health Care health obama workers care firms 37 Buyout billion deal buy private bid 38 Stop Loss said noble barnes electric car 39 Labors america overhaul labor north workers 40 Financial Services Industry goldman financial stepping company chairman Table 2: Topics and associated keywords

63

4.2 Machine learning

MSA not only overperformed both benchmarks in terms of f1 score, but also did better on both precision rate and recall rate. To put this in a perspective, previous concept elicitation systems designed for engineering management achieved a 68% recall rate with a 9% precision rate (f1 score of 16%) and a 26% recall rate with a 17% precision rate (f1 score of 21%) (Cheung et al., 2011); therefore, MSA has a highly competitive performance in comparison with its peers.

F1 Score Precision Recall Benchmark-LM 13% 18% 10% Benchmark-GI 16% 22% 12% MSA 21% 23% 18% Table 3: Performance of MSA and benchmarks

One of the benefits of using random forest model is with its ability to display the most important variables at given point in time. When predicting data from 2015, training dataset is prepared from data from 2010 to 2014; the most important factors can be extracted from the period of 2010 to 2014 based on the explanatory power. The most important 10 variables based on average importance between 2015 and 2018 predictions are shown in Table 4. There are couple of key observations about what kind of factors have greatest explanatory power over time.

 sentiment related factors showed better explanatory power

 instead of sentiment itself, the change in sentiment showed better explanatory

power

 instead of sentiment on individual topics, sentiment on all headlines without

applying topic model showed better explanatory power

64

 instead of general dictionary (GI), industry specific dictionary (LM) showed

better explanatory power. This is different from standalone benchmark model

where GI based benchmark performed better than LM based benchmark

 explanatory power of factors changed dramatically year to year

Factor 2015 2016 2017 2018 avg All - Change in Sentiment - LM - negative 0.71 1.49 0.88 0.87 0.99 All - Change in Sentiment - LM - positive 0.87 0.66 0.70 0.57 0.70 All - Change in Sentiment - LM - litigious 0.85 0.46 0.40 0.92 0.66 All - Sentiment - LM - negative 0.35 0.67 0.67 0.48 0.54 All - Change in Sentiment - LM - net 0.63 0.76 0.53 0.21 0.53 15 - Change in Sentiment - GI - negative 0.46 0.64 0.38 0.61 0.52 8 - Change in Sentiment - LM - uncertainty 0.90 0.55 0.23 0.34 0.51 24 - Similarity - Jaccard 0.74 0.99 0.08 0.13 0.48 All - Sentiment - GI - negative 0.18 0.26 0.41 1.03 0.47 11 - Change in Sentiment - GI - net 0.60 0.53 0.27 0.47 0.47 Table 4: Performance of MSA and benchmarks

65

Chapter 5-Discussion and Conclusions

5.1 Discussion (interpretation of results)

In the previous chapter, it is shown that MSA provides higher VIX prediction power over traditional sentiment analysis measured using f1 score. The baseline MSA model leverages various of factors for the features engineering and random forest for the prediction model. In this section, such result will be broken down to investigate whether the choice of the particular machine learning model or inclusion of one factor is driving the entire result.

Results of MSA framework using other machine learning algorithms instead of random forest is shown in Table 5. Generally speaking, simpler models performed better than more complex models. Voting algorithms which is the combination of all other algorithms performed one of the worst. Additionally, even this praxis’s baseline model random forest which is a collection of decision trees performed as well as a decision tree algorithm. The performer out of all the algorithms is a simple SVM.

Algorithm F1 Score Precision Recall Random Forest 21% 23% 18% Decision Tree 23% 18% 30% Naïve Bayes 21% 17% 27% SVM 25% 20% 34% Neural Network 23% 24% 21% Logistic Regression 13% 17% 10% Nearest Neighbors 8% 14% 5% Voting 13% 21% 9% Table 5: Performance of MSA framework using other machine learning algorithms

66

Table 3 has shown the relative factor importance overtime with all the factors present together. In contrast, Table 5 shows the performance when a single factor is applied to random forest model one by one; the benchmark models are the subsets of this exercise.

Contrary to popular beliefs (Smales, 2014), negative sentiments are actually not the best predictor of spikes in VIX index. Key observations are below.

 Change measures (i.e., change in sentiment) showed better prediction power

 All headlines (before topic model) showed better prediction power

 General GI showed better prediction power than industry specific LM when used

standalone which is opposite from the results in Table 3 which was when they

were used in combination

67

Factor F1 score

All - Change in Sentiment - GI - net 23% All - Change in Sentiment - GI - positive 22% All - Change in Sentiment - GI - negative 21% All - Sentiment - GI - net 20% All - Change in Sentiment - LM - positive 19% All - Change in Sentiment - LM - uncertainty 19% All - Similarity - Jaccard 18% All - Similarity - Cosine 17% All - Change in Sentiment - LM - litigious 17% All - Change in Sentiment - LM - net 16% All - Readability - Flesch 16% All - Sentiment - GI - negative 16% All - Change in Sentiment - LM - negative 15% All - Sentiment - LM - litigious 15% 36 - Change in Readability - Flesch 15% All - Change in Readability - Flesch 15% All - Sentiment - GI - positive 14% All - Sentiment - LM - net 14% All - Change in Sentiment - LM - constraining 14% 15 - Similarity - Cosine 14% 29 - Change in Readability - Flesch 14% All - Sentiment - LM - negative 13% 1 - Change in Readability - Flesch 13% 3 - Change in Sentiment - LM - net 12% 36 - Readability - Flesch 12% 1 - Similarity - Jaccard 12% 36 - Change in Sentiment - GI - positive 11% 31 - Change in Readability - Flesch 11% 36 - Similarity - Jaccard 11% 3 - Change in Readability - Flesch 11% Table 6: Best performing standalone factor

5.2 Conclusions

It is important for portfolio managers to continuously be aware of micro and macro market risks. Micro market risks are individual company’s credit risks that will impact any financial instruments they issue (e.g. corporate debt instruments, public equity etc.).

Macro market risks are economic or political events that have propounding impacts across the entire financial markets.

68

MSA was created to help portfolio managers cope with economic uncertainties by enabling them to not only stay on top of news but also become timely informed of VIX variations. The time freed up from reading news can be given back to their main task of managing investment portfolio. MSA is designed in a flexible manner to take into account the fact that different portfolio managers and different investment management companies have different levels of sensitivity to economic uncertainties and therefore will define economic uncertainties at different levels. CBOE VIX is used as a quantitative measurement of the economic uncertainties, and portfolio managers can set different thresholds on it to define their own level of economic uncertainties.

MSA reads one of the most widely circulated newspapers, the Wall Street Journal, on behalf of the portfolio managers. The “What’s News: Business & Finance” section, which is the list of most important news headlines of the day, is incorporated by MSA historically to investigate the relationship with economic uncertainties linked to the increase in VIX level. MSA is built on the idea of the Efficient Market Hypothesis in that it assumes that the market price reflects the interpretation by the market participants as a whole of the information published in the news. The benefits of this approach include the ability to train MSA in an unsurprised manner.

First, MSA takes in news headlines, clean those raw textual data, and assigns each to one of 40 topics. In the second step, MSA relies on various natural language processing techniques found in finance literatures (e.g., sentiment, similarity, readability) to construct 1148 factors on news headline as features for machine learning. Finally, the machine learning algorithms are used to identify the subtle relationship between 1148 factors and spikes in VIX index.

69

5.3 Contributions

In a sample study, the VIX spike threshold is set to 10%, which means when a weekly increase in the VIX index is greater than 10%, the day is considered to be of interest.

With a f1 score of 21%, MSA (i.e., with 1148 factors) was able to offer significantly better prediction results than benchmark models based on popular beliefs (13% and 16% f1 scores based on two different methodologies) that negative sentiment predicts spikes in

VIX index.

This praxis was able to offer multi factor sentiment analysis method that is superior to traditional sentiment analysis methods. The methodology proposed in this praxis not only provided better prediction results, but also provided transparency to the users of underlying market drivers on an on-going basis.

5.4 Future research directions

For future research, MSA can be applied to other areas and enhanced with different methodologies.

The ability to monitor uncertainties at the company level would be highly valuable to portfolio managers. Investment portfolios can often include large amounts of financial instruments issued by individual companies (e.g. corporate debt instruments and public equities). Traditionally, large financial institutions employ an entire department of risk analytics people to assess credit risk of individual companies. This approach, although proved effective, is very expensive and the risk assessment results generated by different people can also be inconsistent. Therefore, developing an intelligent system is a more

70

scalable, more cost-aware, and more effective solution to the problem of monitoring related companies. When monitoring companies, the dataset can be extended beyond the news data, and it may be interesting to explore regulatory filing data (less timely but more formal) and Twitter data (more timely but less formal).

Additionally, MSA can be applied outside of financial predictions. There are a lot of internal unstructured textual data such as emails and short message service in many companies. Instead of surveying people about internal measures such as employee satisfactions, text mining emails and short message service utilizing MSA may be able to report on those internal measures on an on-going basis. Such text mining based approach have the ability to provide potentially more accurate measures since it is based on all the written communications without incurring extra work to keep surveying people.

When it comes to methodology, other techniques can be considered for inclusions both from the features engineering prospective and the prediction model prospective. In finance literatures, there are more techniques that are yet to be explored under MSA framework such as Glasserman’s unusualness measure. At the same time, there are advanced machine learning techniques such as deep learning that can be interesting to take a look at to enhance MSA framework.

Although the current MSA and the underlying literatures are for processing English language, the statistical features engineering techniques (e.g., similarity) have the potential to extend to data in other language. For other language such as Asian languages, new techniques are needed during the pre-processing step such tokenization and stemming. However, when it comes to features engineering and machine learning, many components of existing MSA may work with limited modifications.

71

References

Ahn, J. J., Oh, K. J., Kim, T. Y., & Kim, D. H. (2011). Usefulness of support vector

machine to develop an early warning system for financial crisis. Expert Systems with

Applications, 38(4), 2966-2973.

Ahoniemi, K. (2008). Modeling and forecasting the VIX index. Retrieved June 22, 2018,

from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1033812

Antweiler, W., & Frank, M. Z. (2004). Is all that talk just noise? the information content

of internet stock message boards. The Journal of Finance, 59(3), 1259-1294.

Arun, R., Suresh, V., Madhavan, C. V., & Murthy, M. N. (2010). (2010). On finding the

natural number of topics with latent dirichlet allocation: Some observations. Paper

presented at the Pacific-Asia Conference on Knowledge Discovery and Data Mining,

391-402.

Baker, S. R., Bloom, N., & Davis, S. J. (2016). Measuring economic policy uncertainty.

The Quarterly Journal of Economics, 131(4), 1593-1636.

Bholat, D. M., Hansen, S., Santos, P. M., & Schonhardt-Bailey, C. (2015). Text mining

for central banks. Retrieved May 15, 2018, from

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2624811

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of

Machine Learning Research, 3(Jan), 993-1022.

72

Blumberg, R., & Atre, S. (2003). The problem with unstructured data. Dm Review,

13(42-49), 62.

Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal

of Computational Science, 2(1), 1-8.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.

Brenner, M. (1989). New financial instruments for hedging changes in volatility, 45: 4

fin. analysts J. 61–65 (july/aug. 1989); menachem brenner & dan galai. Hedging

Volatility in Foreign Currencies, 1(1), 53-58.

Calomiris, C. W., & Mamaysky, H. (2017). How news and its context drive risk and

returns around the world. Retrieved August 7, 2017, from

http://www.nber.org/papers/w24430

Calomiris, C. W., & Nissim, D. (2007). Activity-based valuation of bank holding

companies. Retrieved December 12, 2017 from http://www.nber.org/papers/w12918

Calomiris, C. W., & Nissim, D. (2014). Crisis-related shifts in the market valuation of

banking activities. Journal of Financial Intermediation, 23(3), 400-435.

Campbell, J. Y., Lo, A. W., & MacKinlay, A. C. (1997). The econometrics of financial

markets princeton University press Princeton, NJ.

Cao, J., Xia, T., Li, J., Zhang, Y., & Tang, S. (2009). A density-based method for

adaptive LDA model selection. Neurocomputing, 72(7), 1775-1781.

73

Chae, J., Thom, D., Jang, Y., Kim, S., Ertl, T., & Ebert, D. S. (2014). Public behavior

response analysis in disaster events utilizing visual analytics of microblog data.

Computers & Graphics, 38, 51-60.

Chan, S. W., & Franklin, J. (2011). A text-based decision support system for financial

sequence prediction. Decision Support Systems, 52(1), 189-198.

Chen, N., Ribeiro, B., Vieira, A. S., Duarte, J., & Neves, J. C. (2011). A genetic

algorithm-based approach to cost-sensitive bankruptcy prediction. Expert Systems

with Applications, 38(10), 12939-12945.

Cheung, C. F., Lee, W. B., Wang, W. M., Wang, Y., & Yeung, W. M. (2011). A multi-

faceted and automatic knowledge elicitation system (MAKES) for managing

unstructured information. Expert Systems with Applications, 38(5), 5245-5258.

Chouliaras, A. (2015). The pessimism factor: Sec edgar form 10-k textual analysis and

stock returns. Retrieved February 25, 2018, from

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2627037

Cohen, L., Malloy, C. J., & Nguyen, Q. H. (2016). Lazy prices. Retrieved May 5, 2018,

from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1658471

Coussement, K., & Van den Poel, D. (2008). Improving customer complaint management

by automatic email classification using linguistic style features as predictors.

Decision Support Systems, 44(4), 870-882.

74

Das, R., Sarkani, S., & Mazzuchi, T. A. (2012). Software selection based on quantitative

security risk assessment. IJCA Special Issue on Computational Intelligence &

Information Security CIIS, (1), 45-56.

Das, S. R., & Chen, M. Y. (2007). Yahoo! for amazon: Sentiment extraction from small

talk on the web. Management Science, 53(9), 1375-1388.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990).

Indexing by latent semantic analysis. Journal of the American Society for

Information Science, 41(6), 391.

Deveaud, R., SanJuan, E., & Bellot, P. (2014). Accurate and effective latent concept

modeling for ad hoc information retrieval. Document Numerique, 17(1), 61-84.

Evans, C., Pappas, K., & Xhafa, F. (2013). Utilizing artificial neural networks and

genetic algorithms to build an algo-trading model for intra-day foreign exchange

speculation. Mathematical and Computer Modelling, 58(5), 1249-1266.

Exchange, C. B. O. (2009). The CBOE volatility index-VIX. White Paper, , 1-23.

Retrieved June 5, 2017, from https://www.cboe.com/micro/vix/vixwhite.pdf

Fama, E. F., & French, K. R. (1993). Common risk factors in the returns on stocks and

bonds. Journal of Financial Economics, 33(1), 3-56.

Fama, E. F., & French, K. R. (2015). A five-factor asset pricing model. Journal of

Financial Economics, 116(1), 1-22.

75

Fan, W., & Gordon, M. D. (2014). The power of social media analytics. Communications

of the ACM, 57(6), 74-81.

Goutte, C., & Gaussier, E. (2005). (2005). A probabilistic interpretation of precision,

recall and F-score, with implication for evaluation. Paper presented at the European

Conference on Information Retrieval, 345-359.

Govers, R., & Go, F. M. (2004). Projected destination image online: Website content

analysis of pictures and text. Information Technology & Tourism, 7(2), 73-89.

Gray, W.Behavioral finance and investing: Are you trying too hard? Retrieved June 20,

2018, from https://alphaarchitect.com/2014/05/13/behavioral-finance-and-investing-

are-you-trying-too-hard/

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the

National Academy of Sciences, 101(suppl 1), 5228-5235.

Groth, S. S., & Muntermann, J. (2011). An intraday market risk management approach

based on textual analysis. Decision Support Systems, 50(4), 680-691.

Harrag, F., El-Qawasmah, E., & Al-Salman, A. M. S. (2011). (2011). Stemming as a

feature reduction technique for arabic text categorization. Paper presented at the

Programming and Systems (ISPS), 2011 10th International Symposium On, 128-133.

Hatzius, J., Hooper, P., Mishkin, F. S., Schoenholtz, K. L., & Watson, M. W. (2010).

Financial conditions indexes: A fresh look after the financial crisis. Retrieved March

5, 2018, from http://www.nber.org/papers/w16150

76

Hindle, A., Ernst, N. A., Godfrey, M. W., & Mylopoulos, J. (2011). (2011). Automated

topic naming to support cross-project analysis of software maintenance activities.

Paper presented at the Proceedings of the 8th Working Conference on Mining

Software Repositories, 163-172.

Hofmann, T. (1999). (1999). Probabilistic latent semantic indexing. Paper presented at

the Proceedings of the 22nd Annual International ACM SIGIR Conference on

Research and Development in Information Retrieval, 50-57.

Hollum, A. T. G., Mosch, B. P., & Szlavik, Z. (2013). (2013). Economic sentiment: Text-

based prediction of stock price movements with machine learning and wordnet.

Paper presented at the International Conference on Industrial, Engineering and

Other Applications of Applied Intelligent Systems, 322-331.

Huang, C., Liao, J., Yang, D., Chang, T., & Luo, Y. (2010). Realization of a news

dissemination agent based on weighted association rules and text mining techniques.

Expert Systems with Applications, 37(9), 6409-6413.

Huang, C., Yang, D., & Chuang, Y. (2008). Application of wrapper approach and

composite classifier to the stock trend prediction. Expert Systems with Applications,

34(4), 2870-2878.

International Council on Systems Engineering. (2011). Systems engineering handbook: A

guide for system life cycle processes and activities International Council of Systems

Engineering.

77

Izumi, K., Goto, T., & Matsui, T. (2010). Analysis of financial markets' fluctuation by

textual information. Transactions of the Japanese Society for Artificial Intelligence,

25, 383-387.

Izumi, K., Goto, T., & Matsui, T. (2011). Implementation tests of financial market

analysis by text mining. Retrieved November 20, 2017, from

https://philpapers.org/rec/IZUITO

Jin, F., Self, N., Saraf, P., Butler, P., Wang, W., & Ramakrishnan, N. (2013). (2013).

Forex-foreteller: Currency trend modeling using news articles. Paper presented at the

Proceedings of the 19th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining, 1470-1473.

John, G. H., & Langley, P. (1995). (1995). Estimating continuous distributions in

bayesian classifiers. Paper presented at the Proceedings of the Eleventh Conference

on Uncertainty in Artificial Intelligence, 338-345.

Katal, A., Wazid, M., & Goudar, R. H. (2013). (2013). Big data: Issues, challenges, tools

and good practices. Paper presented at the Contemporary Computing (IC3), 2013

Sixth International Conference On, 404-409.

Kearney, C., & Liu, S. (2014). Textual sentiment in finance: A survey of methods and

models. International Review of Financial Analysis, 33, 171-185.

78

Kim, T. Y., Oh, K. J., Sohn, I., & Hwang, C. (2004). Usefulness of artificial neural

networks for early warning system of economic crisis. Expert Systems with

Applications, 26(4), 583-590.

Kittler, J., Hatef, M., Duin, R. P., & Matas, J. (1998). On combining classifiers. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 20(3), 226-239.

Kumar, M., & Thenmozhi, M. (2006). Forecasting stock index movement: A comparison

of support vector machines and random forest. Retrieved April 3, 2018, from

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=876544

Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D., & Allan, J. (2000).

(2000). Mining of concurrent text and time series. Paper presented at the KDD-2000

Workshop on Text Mining, , 2000 37-44.

Le Cessie, S., & Van Houwelingen, J. C. (1992). Ridge estimators in logistic regression.

Applied Statistics, , 191-201.

Li, F. (2010). The information content of forward‐looking statements in corporate

filings—A naïve bayesian machine learning approach. Journal of Accounting

Research, 48(5), 1049-1102.

Liu, Q., Guo, S., & Qiao, G. (2015). VIX forecasting and variance risk premium: A new

GARCH approach. The North American Journal of Economics and Finance, 34,

314-322.

79

Loughran, T., & McDonald, B. (2009). Plain english, readability, and 10-K filings.

Retrieved June 20, 2018, from

https://www.researchgate.net/profile/Bill_Mcdonald/publication/228458241_Plain_

English_Readability_and_10-K_Filings/links/5772a80d08aeeec3895410b0.pdf

Loughran, T., & McDonald, B. (2011). When is a liability not a liability? textual analysis,

dictionaries, and 10‐Ks. The Journal of Finance, 66(1), 35-65.

MacKinlay, A. C. (1997). Event studies in economics and finance. Journal of Economic

Literature, 35(1), 13-39.

Mahajan, A., Dey, L., & Haque, S. M. (2008). (2008). Mining financial news for major

events and their impacts on the market. Paper presented at the Proceedings of the

2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent

Agent Technology-Volume 01, 423-426.

Majmudar, U., & Banerjee, A. (2004). Vix forecasting. Retrieved October 19, 2017, from

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=533583

Malkiel, B. G., & Fama, E. F. (1970). Efficient capital markets: A review of theory and

empirical work. The Journal of Finance, 25(2), 383-417.

Mamaysky, H., & Glasserman, P. (2016). Does unusual news forecast market stress? The

Office of Financial Research Working Paper, (16-04)

Manela, A., & Moreira, A. (2017). News implied volatility and disaster concerns. Journal

of Financial Economics, 123(1), 137-162.

80

Massey, A. K., Eisenstein, J., Anton, A. I., & Swire, P. P. (2013). (2013). Automated text

mining for requirements analysis of policy documents. Paper presented at the

Requirements Engineering Conference (RE), 2013 21st IEEE International, 4-13.

Mihaila, G. A., Raschid, L., & Vidal, M. (2000). (2000). Using quality of data metadata

for source selection and ranking. Paper presented at the WebDB (Informal

Proceedings), 93-98.

Ming, N., & Ming, V. (2012). (2012). Predicting student outcomes from unstructured

data. Paper presented at the UMAP Workshops,

Mitra, L., & Mitra, G. (2010). Applications of news analytics in finance: A review (tech.

rep.). optirisk-systems. com/papers. Opt0014.Pdf: OptiRisk Systems,

Moniz, A. (2016). Textual analysis of intangible information. Retrieved July 15, 2017,

from https://repub.eur.nl/pub/93001/

Mullainathan, S., & Shleifer, A. (2005). The market for news. The American Economic

Review, 95(4), 1031-1053.

Murzintcev, N. (2015). Ldatuning: Tuning of the latent dirichlet allocation (LDA) models

prameters. R Package,

Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2014). Text mining

for market prediction: A systematic review. Expert Systems with Applications,

41(16), 7653-7670.

81

Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2015). Text mining of

news-headlines for FOREX market prediction: A multi-layer dimension reduction

algorithm with semantics and sentiment. Expert Systems with Applications, 42(1),

306-324.

Nguyen, T. H., & Shirai, K. (2015). (2015). Topic modeling based sentiment analysis on

social media for stock market prediction. Paper presented at the Acl (1), 1354-1364.

Oh, K. J., Kim, T. Y., & Kim, C. (2006). An early warning system for detection of

financial crisis using financial market volatility. Expert Systems, 23(2), 83-98.

Okada, K. (2017). Application of AI techniques in finance - considerations. Journal of

Securities Analysts Journal, 55, 68-73.

Okada, K. (2018). Application of AI techniques in finance. Monthly Capital Market,

(393), 16-25.

Olson, B. A., Mazzuchi, T. A., Sarkani, S., & Forsberg, K. (2012). Problem management

process, filling the gap in the systems engineering processes between the risk and

opportunity processes. Systems Engineering, 15(3), 275-286.

Olson, D. L., Delen, D., & Meng, Y. (2012). Comparative analysis of data mining

methods for bankruptcy prediction. Decision Support Systems, 52(2), 464-473.

Onsumran, C., Thammaboosadee, S., & Kiattisin, S. (2015). Gold price volatility

prediction by text mining in economic indicators news. Journal of Advances in

Information Technology Vol, 6(4)

82

Pal, S. K., & Mitra, S. (1992). Multilayer perceptron, fuzzy sets, and classification. IEEE

Transactions on Neural Networks, 3(5), 683-697.

Patel, J., Shah, S., Thakkar, P., & Kotecha, K. (2015). Predicting stock market index

using fusion of machine learning techniques. Expert Systems with Applications,

42(4), 2162-2172.

Paul, M. J., & Dredze, M. (2012). A model for mining public health topics from twitter.

Health, 11, 16.

Peramunetilleke, D., & Wong, R. K. (2002). Currency exchange rate forecasting from

news headlines. Australian Computer Science Communications, 24(2), 131-139.

Quinlan, J. R. (1993). C4. 5: Programming for machine learning. Morgan Kauffmann, 38

Sakaki, T., Okazaki, M., & Matsuo, Y. (2010). (2010). Earthquake shakes twitter users:

Real-time event detection by social sensors. Paper presented at the Proceedings of

the 19th International Conference on World Wide Web, 851-860.

Schumaker, R. P., & Chen, H. (2009). Textual analysis of stock market prediction using

breaking financial news: The AZFin text system. ACM Transactions on Information

Systems (TOIS), 27(2), 12.

Sharpe, W. F. (1964). Capital asset prices: A theory of market equilibrium under

conditions of risk. The Journal of Finance, 19(3), 425-442.

83

Smales, L. A. (2014). News sentiment and the investor fear gauge. Finance Research

Letters, 11(2), 122-130.

Stone, P. J., & Hunt, E. B. (1963). (1963). A computer approach to content analysis:

Studies using the general inquirer system. Paper presented at the Proceedings of the

may 21-23, 1963, Spring Joint Computer Conference, 241-256.

Strait, M. J., Haynes, J. A., & Foltz, P. W. (2000). (2000). Applications of latent semantic

analysis to lessons learned systems. Paper presented at the Intelligent Lessons

Learned Systems: Papers from the AAAI Workshop, 51-53.

Sundarkumar, G. G., Ravi, V., Nwogu, I., & Govindaraju, V. (2015). (2015). Malware

detection via API calls, topic models and machine learning. Paper presented at the

Automation Science and Engineering (CASE), 2015 IEEE International Conference

On, 1212-1217.

Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the

stock market. The Journal of Finance, 62(3), 1139-1168.

Tetlock, P. C., Saar‐Tsechansky, M., & Macskassy, S. (2008). More than words:

Quantifying language to measure firms' fundamentals. The Journal of Finance,

63(3), 1437-1467.

Wang, W. M., Cheung, C. F., Lee, W. B., & Kwok, S. K. (2008). Self-associated concept

mapping for representation, elicitation and inference of knowledge. Knowledge-

Based Systems, 21(1), 52-61.

84

Appendices

Appendix A: Factor Library

Various of natural language processing functions to build features (multi-factors)

# Sentiment import pandas as pd,s3fs,re from sklearn.feature_extraction import stop_words fs = s3fs.S3FileSystem()

# import and organize Loughran and Mcdonald dictionary lm=pd.read_csv('LoughranMcDonald_MasterDictionary_2016.csv') sentimentLM = lm[['Word']].assign( Negative = lambda x:x.Word.where(lm.Negative>0), Positive = lambda x:x.Word.where(lm.Positive>0), Uncertainty = lambda x:x.Word.where(lm.Uncertainty>0), Litigious = lambda x:x.Word.where(lm.Litigious>0), Constraining = lambda x:x.Word.where(lm.Constraining>0), Superfluous = lambda x:x.Word.where(lm.Superfluous>0) ).drop('Word',axis=1).dropna(how='all') del(lm)

# import and organize Harvard General Inquery dictionary gi = pd.read_excel('inquirerbasic.xls') sentimentGI = gi[['Entry']].assign( Negative = lambda x:x.Entry.where(~gi.Negativ.isna()), Positive = lambda x:x.Entry.where(~gi.Positiv.isna()) ).drop('Entry',axis=1).dropna(how='all') sentimentGI.Positive = [i.split('#')[0] if type(i) is str else None for i in sentimentGI.Positive]

85

sentimentGI.Negative = [i.split('#')[0] if type(i) is str else None for i in sentimentGI.Negative] sentimentGI = sentimentGI.drop_duplicates() del(gi)

def getSentiment(txt, method = 'LM', metrics= 'net'): """ provide sentiment score based on Loughran and Mcdonald method and Harvard General Inquery method

Formula ------# of words in the specific dictionary (e.g., LM positive) / total # of words

Parameters ------txt : string input text method : string (lm or gi), optional select a method to use metrics : string (positive, negative, net, uncertainty, litigious, constraining, superfluous), optional select a metrics to generate

Returns ------score : double sentiment score based on the selected method """ if method =='LM' : sentiment = sentimentLM.copy() elif method =='GI' : sentiment = sentimentGI.copy() else : raise ValueError('Wrong method')

86

t = pd.Series(txt) positive = sum(t.isin(sentiment.Positive.str.lower()))/t.size negative = sum(t.isin(sentiment.Negative.str.lower()))/t.size net = positive - negative if method == 'LM': uncertainty = sum(t.isin(sentiment.Uncertainty.str.lower()))/t.size litigious = sum(t.isin(sentiment.Litigious.str.lower()))/t.size constraining = sum(t.isin(sentiment.Constraining.str.lower()))/t.size superfluous = sum(t.isin(sentiment.Superfluous.str.lower()))/t.size if metrics == 'positive':return positive elif metrics == 'negative':return negative elif metrics == 'net':return net elif method == 'LM' and metrics =='uncertainty':return uncertainty elif method == 'LM' and metrics =='litigious':return litigious elif method == 'LM' and metrics =='constraining':return constraining elif method == 'LM' and metrics =='superfluous':return superfluous else: raise ValueError('Wrong metrics')

# Similarity def getSimilarity(x,y,method='cosine'): """ provide readability scoring based on package textstat

Formula ------cosine : count for each word in x (dot product) count for each word in y / number of words in x * number of wrods in y jaccard : unique words in x (intersection) unique words in y / unique words in x (union) unique words in y /

Parameters ------x : string

87

input text y : string input text method : string (cosine or jaccard), optional select a method to use

Returns ------score : double similarity score based on the selected method """ if method == 'cosine': s1 = pd.DataFrame(x,columns=['s1']).groupby('s1').s1.count() s2 = pd.DataFrame(y,columns=['s2']).groupby('s2').s2.count() ss = s1.to_frame().join(s2,how='outer').fillna(0).assign(ss=lambda x:x.s1*x.s2).sum() return ss.ss/(ss.s1*ss.s2) elif method == 'jaccard': return len(set(x).intersection(set(y)))/\ len(set(x).union(set(y)))

# Readability from textstat.textstat import textstat

def getReadability(txt, method = 'fog'): """ provide readability scoring based on package textstat

Formula ------fog : 0.4 (average # of words per sentence + percent of complex words) flesch : 206.835 – (1.015*average # of words per sentence) – (84.6 * average number of syllables per word)

88

Parameters ------txt : string input text method : string (fog or flesch), optional select a method to use

Returns ------score : double readability score based on the selected method """ if method == 'fog':return textstat.smog_index(' '.join(txt)) elif method == 'flesch':return textstat.flesch_reading_ease(' '.join(txt)) nlp_metrics = { 'Sentiment - LM - positive': {'function':getSentiment,'method':'LM','metrics':'positive','isChange':False},

'Sentiment - LM - negative': {'function':getSentiment,'method':'LM','metrics':'negative','isChange':False},

'Sentiment - LM - net': {'function':getSentiment,'method':'LM','metrics':'net','isChange':False},

'Sentiment - LM - uncertainty': {'function':getSentiment,'method':'LM','metrics':'uncertainty','isChange':False},

'Sentiment - LM - litigious': {'function':getSentiment,'method':'LM','metrics':'litigious','isChange':False},

'Sentiment - LM - constraining':

89

{'function':getSentiment,'method':'LM','metrics':'constraining','isChange':False},

'Sentiment - LM - superfluous': {'function':getSentiment,'method':'LM','metrics':'superfluous','isChange':False},

'Sentiment - GI - positive': {'function':getSentiment,'method':'GI','metrics':'positive','isChange':False},

'Sentiment - GI - negative': {'function':getSentiment,'method':'GI','metrics':'negative','isChange':False},

'Sentiment - GI - net': {'function':getSentiment,'method':'GI','metrics':'net','isChange':False},

'Change in Sentiment - LM - positive': {'function':getSentiment,'method':'LM','metrics':'positive','isChange':True},

'Change in Sentiment - LM - negative': {'function':getSentiment,'method':'LM','metrics':'negative','isChange':True},

'Change in Sentiment - LM - net': {'function':getSentiment,'method':'LM','metrics':'net','isChange':True},

'Change in Sentiment - LM - uncertainty': {'function':getSentiment,'method':'LM','metrics':'uncertainty','isChange':True},

'Change in Sentiment - LM - litigious': {'function':getSentiment,'method':'LM','metrics':'litigious','isChange':True},

'Change in Sentiment - LM - constraining': {'function':getSentiment,'method':'LM','metrics':'constraining','isChange':True},

'Change in Sentiment - LM - superfluous':

90

{'function':getSentiment,'method':'LM','metrics':'superfluous','isChange':True},

'Change in Sentiment - GI - positive': {'function':getSentiment,'method':'GI','metrics':'positive','isChange':True},

'Change in Sentiment - GI - negative': {'function':getSentiment,'method':'GI','metrics':'negative','isChange':True},

'Change in Sentiment - GI - net': {'function':getSentiment,'method':'GI','metrics':'net','isChange':True},

'Similarity - Jaccard': {'function':getSimilarity,'method':'jaccard'},

'Similarity - Cosine': {'function':getSimilarity,'method':'cosine'},

'Readability - Fog': {'function':getReadability,'method':'fog','isChange':False},

'Readability - Flesch': {'function':getReadability,'method':'flesch','isChange':False},

'Change in Readability - Fog': {'function':getReadability,'method':'fog','isChange':True},

'Change in Readability - Flesch': {'function':getReadability,'method':'flesch','isChange':True} } def getNLPMetrics(newText,oldText=None): """ provide scorings based on all the implemented NLP metrics

91

Formula ------see individual metrics

Parameters ------newTxt : string input text oldText : string, optional input text, needed if similarity and change metrics are desired

Returns ------score : double scores based on all the implemented NLP metrics """ minNWords = 0 if newText is None or type(newText) is not str: return None

result = {} newTextList = [i for i in re.sub('(?s) +',' ',re.sub('(?s)[^a-z| +]','',newText.lower())).split(' ') if i not in stop_words.ENGLISH_STOP_WORDS] newTextList = [i for i in newTextList if i!=''] if len(newTextList)<=minNWords: return None

if oldText is not None: oldTextList = [i for i in re.sub('(?s) +',' ',re.sub('(?s)[^a-z| +]','',oldText.lower())).split(' ') if i not in stop_words.ENGLISH_STOP_WORDS] oldTextList = [i for i in oldTextList if i!='']

92

for k in nlp_metrics.keys(): mt = nlp_metrics[k] if mt['function'] == getSentiment: if mt['isChange']: if oldText is not None and len(oldTextList)>minNWords: result.update({k:mt['function'](newTextList,mt['method'],mt['metrics'])- mt['function'](oldTextList,mt['method'],mt['metrics'])}) else:result.update({k:mt['function'](newTextList,mt['method'],mt['metrics'])}) elif mt['function'] == getSimilarity and oldText is not None and len(oldTextList)>minNWords: result.update({k:mt['function'](newTextList,oldTextList,mt['method'])}) elif mt['function'] == getReadability: if mt['isChange']: if oldText is not None and len(oldTextList)>minNWords: result.update({k:mt['function'](newTextList,mt['method'])- mt['function'](oldTextList,mt['method'])}) else:result.update({k:mt['function'](newTextList,mt['method'])}) return result

93

Appendix B: Optimum number of topics for Latent Dirichlet Allocation Four algorithms to find optimum number of topics for Latent Dirichlet Allocation library("ldatuning") library("RTextTools") data <- read.csv('train_news.csv') mxTrain <- create_matrix(data$News,language = "english",removeNumbers = TRUE,stemWords = TRUE) k <- FindTopicsNumber(mxTrain,topics = seq(2,100,1),metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),mc.cores=4) write.csv(k, file = "k.csv",row.names=FALSE)

94

Appendix C: Main Multi-Factor Sentiment Analysis Code Main class to prepare data, build features, and apply machine learning algorithms import pandas as pd, numpy as np from nlp_factors import * from matplotlib import pyplot as plt from sklearn.feature_extraction.text import CountVectorizer from sklearn.decomposition import LatentDirichletAllocation from sklearn import linear_model from sklearn.metrics import accuracy_score,precision_score,recall_score, confusion_matrix,roc_curve,f1_score import warnings warnings.filterwarnings('ignore') seed = 19900727 news = pd.read_csv('wsj.csv',encoding='latin-1',index_col=0,parse_dates=True) vix = pd.read_csv('vixcurrent.csv',index_col=0,parse_dates =True,skiprows=1) news.loc[:'2014-12-31'].to_csv('train_news.csv',index=False) vct = CountVectorizer(stop_words='english',token_pattern=r'[a-zA-Z]{3,}') dt=vct.fit_transform(news.News) lda = LatentDirichletAllocation(40,learning_method='batch',random_state=seed) lda.fit(dt[:news.loc[:'2014-12-31'].index.size]) master = news.assign(Topic = [lda.transform(dt.getrow(i))[0].argsort()[-1] for i in range(news.index.size)]) master_data = master.pivot_table(index='Date',columns='Topic',values='News',aggfunc='sum')\ .join(master.groupby('Date').News.sum().rename('All')) def applyMethods(data): y = pd.DataFrame() for section in list(np.arange(0,40))+['All']: try: x = getNLPMetrics(data[1][section],master_data.iloc[list(master_data.index).index(data[0])- 1][section]) y = y.append(pd.DataFrame(list(x.values()),index=[str(section)+' - '+i for i in x.keys()],columns=[data[0]]))

95

except: pass return y.T full_metrics = master.pivot_table(index='Date',columns='Topic',values='News',aggfunc=len) full_metrics.columns = [str(i) + ' - Topic Frequency' for i in full_metrics.columns] full_metrics = full_metrics.join(pd.concat([applyMethods(dt) for dt in master_data.iterrows()])) from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.naive_bayes import BernoulliNB from sklearn.svm import LinearSVC from sklearn.neural_network import MLPClassifier from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import VotingClassifier

def runML(data_in,sft=5): predicted=[] data = data_in.join(vix.assign(chg = lambda x:x['VIX Close'].shift(-sft)/x['VIX Close'])[['chg']]>1.1,how='inner')\ .fillna(0).reset_index('Date').assign(year=lambda x:[i.year for i in x.Date]).set_index(['year','Date'])

important = pd.DataFrame()

for yr in range(2015,2019): train = data.loc[(yr-5):(yr-1)] test = data.loc[yr]

train = train.fillna(0) cw = {0:train.query('not chg').size/train.size,1:train.query('chg').size/train.size} cw2 = 'balanced' rf1 = RandomForestClassifier(random_state=seed,class_weight=cw)

96

rf2 = DecisionTreeClassifier(random_state=seed,class_weight=cw) rf3 = BernoulliNB() rf4 = LinearSVC(random_state=seed) rf5 = MLPClassifier(random_state=seed) rf6 = LogisticRegression(random_state=seed) rf7 = KNeighborsClassifier() rf = VotingClassifier([('1',rf1),('2',rf2),('3',rf3),('4',rf4),('5',rf5),('6',rf6),('7',rf7)],voting='hard') rf = rf1 rf.fit(X=train.drop('chg',axis=1),y=train.chg) predicted += rf.predict(test.drop('chg',axis=1)).tolist() important = important.append(pd.DataFrame(rf.feature_importances_,columns=[yr],index=test.drop(' chg',axis=1).columns).T)

return f1_score(data.loc[2015:].chg,predicted),\ precision_score(data.loc[2015:].chg,predicted),\ recall_score(data.loc[2015:].chg,predicted),\ accuracy_score(data.loc[2015:].chg,predicted),\ important print(runML(full_metrics[['All - Sentiment - LM - negative']])[0]) print(runML(full_metrics[['All - Sentiment - GI - negative']])[0]) print(runML(full_metrics)[0]) pd.DataFrame([runML(full_metrics[[i]])[0] for i in full_metrics.columns] ,index=full_metrics.columns, columns=['f_score']).sort_values('f_score',ascending=False) runML(full_metrics)[-1].T.assign(avg=lambda x:x.mean(axis=1)).sort_values('avg',ascending=False).head(10)*100

97

Appendix D: News Data (Sample)

Date News

1/1/2018 The unemployment rate in some metro areas stands near or even below 3%, and the tighter labor market is leading firms to raise pay to attract employees.

1/1/2018 Higher minimum wages will take effect in 18 states and almost two dozen municipalities this year.

1/1/2018 Retirement systems that manage money for public workers aren't pulling back on costly market bets.

1/1/2018 Proposed changes to offshore oil-drilling rules are raising questions over the role of safety regulators at the Interior Department.

1/1/2018 France's minister for economy and finance said Paris is looking to China and Russia to act as a counterweight to trade relations with the U.S. and Britain.

1/1/2018 The death of Playboy founder Hefner is ushering in a new era for the adult- entertainment enterprise, including possibly closing the U.S. print edition.

1/1/2018 The parent of Sears and Kmart hasn't run paid national television commercials since late November.

1/1/2018 A federal judge ruled in favor of financier Tilton in a racketeering lawsuit brought by managers of the Zohar investment funds.

1/1/2018 PwC was negligent in connection with one of the biggest bank failures of the financial crisis, a federal judge ruled.

1/1/2018 The price of Ripple, a digital currency, surged 50%, pushing its market valuation to $85 billion.

1/1/2018 PricewaterhouseCoopers was found negligent in connection with one of the biggest bank failures of the financial crisis, a federal judge has ruled, opening the auditor to the potential of millions of dollars in damages.

1/1/2018 Sears Holdings hasn't paid for any national TV spots for its struggling Sears and Kmart chains since late November, as its CEO shifts advertising to digital channels.

1/1/2018 The Trump administration's proposed changes to offshore oil drilling rules are raising fundamental questions over whether safety regulators at the Interior Department should also be concerned with promoting oil and gas production.

98

Appendix E: VIX Index Data (Sample)

Date VIX Open VIX High VIX Low VIX Close

1/2/2004 17.96 18.68 17.54 18.22

1/5/2004 18.45 18.49 17.44 17.49

1/6/2004 17.66 17.67 16.19 16.73

1/7/2004 16.72 16.75 15.5 15.5

1/8/2004 15.42 15.68 15.32 15.61

1/9/2004 16.15 16.88 15.57 16.75

1/12/2004 17.32 17.46 16.79 16.82

1/13/2004 16.6 18.33 16.53 18.04

1/14/2004 17.29 17.3 16.4 16.75

1/15/2004 17.07 17.31 15.49 15.56

1/16/2004 15.4 15.44 14.9 15

1/20/2004 15.77 16.13 15.09 15.21

1/21/2004 15.63 15.63 14.24 14.34

1/22/2004 14.2 14.87 14.01 14.71

1/23/2004 14.73 15.05 14.56 14.84

1/26/2004 15.78 15.78 14.52 14.55

1/27/2004 15.28 15.44 14.74 15.35

1/28/2004 15.37 17.06 15.29 16.78

1/29/2004 16.88 17.66 16.79 17.14

1/30/2004 16.55 17.35 16.55 16.63

2/2/2004 17.45 17.56 16.67 17.11

99