Stock Market Prediction Through Technical and Public Sentiment Analysis Kien Wei Siah, Paul Myers

1 Stock Market Prediction through Technical and Public Sentiment Analysis Kien Wei Siah, Paul Myers

I.INTRODUCTION II.DATA COLLECTIONAND FEATURE GENERATION TOCK market price behavior has been studied extensively. A. Price History S It is influenced by a myriad of factors, including political We queried daily historical prices of N225, for all trading and economic events, among others, and is a complex nonlin- days spanning January 1, 2004 to December 31, 2014, from ear time-series problem. Traditionally, stock price forecasting Yahoo! Finance. However financial time series are well-known is performed based on technical analysis, which focuses on to be non-stationary, with means, variances and covariances price action, which is the process of finding patterns in that change over time. Such non-stationary data are difficult price history. More recently, research has shown that public to model and will likely give poor classification accuracy sentiment is correlated with stock market events [1], [2], [3]. when directly used as features. By viewing the daily prices as This project proposes to study the potential of using both random walks, we attempted to stationarize the price history behavioral and technical features in stock price prediction (through differencing and lagging) before using them as pre- models based on traditional classifiers and popular neural dictors. To this end, we used three main types of conventional networks. We believe that behavioral data may offer insights price technical indicators as features [13]: into financial market dynamics in addition to that captured by 1) n-Day Returns technical analysis. An improved price forecasting model can yield enormous rewards in stock market trading. Ci − Ci−n Ri,n = (2) Ci−n

A. Problem Statement where Ri,n is the i-th day return with respect to the (i − n)-th day, or the percentage difference between the For this project, we focus on the Nikkei 225 (N225) stock i-th day closing price C and the (i − n)-th day closing index. N225 is the stock market index for the Tokyo Stock i price C . Positive values imply that the N225 index Exchange. It constitutes a price-weighted index average of 225 i−n has risen over the n days. For n = 1, we get the simple top rated Japanese companies in the Tokyo Stock Exchange. daily returns equation (Equation 1). With Japan being the third largest economy in the world 2) n-Day Returns Moving Average currently, and Tokyo being one of the largest global financial centers, the N225 price index is certainly a critical financial Ri,1 + R(i−1),1 + ··· + R(i−n),1 MA = (3) indicator that is closely watched by traders and banks around i,n n the world. where MAi,n is the average returns over the previous n We formulate the stock price prediction problem as a binary days, and n > 1 because a one day average is the day’s classification problem: whether the future daily returns of return itself. N225 will be positive (1) or negative (0), i.e. whether N225’s 3) n-Time Lagged 1-Day Returns closing price tomorrow will be higher (1) or lower (0) than today’s closing price. Daily return is defined in Equation 1. Ri,1,R(i−1),1,...,R(i−n),1 (4) where R is (i − n)-th day’s 1-Day returns. Ci − Ci−1 (i−n),1 Ri = (1) n Ci−1 By varying , we have different numbers of features which contains varying degrees of information about price trends and where R is the daily return for the i-th day and C is the i i past prices. This is one of the multiple parameters we will vary N225 closing price for the i-th day. Daily return for day i is and decide upon using cross validation. essentially the percent change in closing price from day (i−1) to day i. Future daily return for day i is just R(i+1). Take note that to get the classification target, we must take the sign of B. Public Sentiment Indicators the future daily return R(i+1) rather than its numerical value. In addition to conventional technical indicators, we also As described in the introduction, we will investigate the use looked at public sentiment indicators. The theory of behavioral price histories and public sentiment indicators available up to economics postulates that emotions plays a significant role day i to predict sign(R(i+1)). Subsequent sections cover the influencing economic decisions of individuals. Research has data collection process. Since this is framed as a classification shown that this applies to societies as large as well. In fact, task, we may use classification accuracy as a metric for Bollen et al used Twitter messages as indicators of public evaluating the performances of various models. mood states and demonstrated that they were correlated to, 2 and predictive of the Dow Jones Industrial Average over in that day. Just computing this score for the 210,000 articles time [2]. In another study, Preis et al found patterns in crawled took up a few days and had to be done in batches. It is Google query volumes, for search terms related to finance, that likely that a more sophisticated sentiment analysis would have constitutes ’early warning signs’ of stock market movements. required longer time and unfeasible within the time framework They hypothesize that investors search for information online of this project. about the markets before eventually deciding whether to buy or sell stocks. This indicates that search query data from Google C. Missing Data and Look-Ahead Bias Trends may contain valuable predictive information about the By crawling for our own data, we inevitably face the information gathering process that precedes trading decisions problem of missing data e.g. price histories for some days are in the stock market [3]. This project takes inspiration from missing, the Bloomberg Businessweek archive does not have these two widely cited studies and attempts to integrate some articles for every trading day. In dealing with this issue, we aspects of public sentiment analysis as part of our features, have three options: mean imputation, interpolation based on in hope that combining behavioral data with technical price previous and next data point, or sample and hold. We opted indicators will lead to improved performance. to go with the last option (using the last observed valid data To this end, we used behavioral data from two sources: point) as we felt that mean imputation and interpolation will Bloomberg Businessweek and Google Trends. We were unable introduce some extent of look-ahead bias (using information to replicate Bollen et al’s study using Twitter messages as that would not have been available during that time). For Twitter has restricted public access to very limited amounts of instance, the interpolation of prices or returns implicitly uses data. Other Twitter data sources required paid subscriptions. the future price, i.e. the interpolated point will be higher if the Therefore, similar to [3], we used trends in Google query next price is high. This will lead to inaccurate results. While volume for finance-related search terms as a proxy for public there are certainly more sophisticated and effective techniques sentiment. Further, we wrote a script to crawl a free online of dealing with missing data, we considered only the simpler news archive Bloomberg Businessweek for articles published methods in view of time constraints. from 2004 to 2014: approximately 210,000 articles were gath- ered. It is hoped that the state of the economy and prevalent III.RECURRENT NEURAL NETWORK stock market conditions can be extracted through sentiment analysis from these articles. A. Vanilla Recurrent Neural Network For Google Trends, we focused on the daily search volumes Recurrent Neural Networks (RNNs) have shown great of five finance-related search terms that showed the greatest potential in many Natural Language Processing tasks (e.g. predictive potential for stock market forecasting in [3], namely machine translation, language models...etc.) and are becoming economics, debt, inflation, risk and stocks. Google Trends increasingly popular. Unlike vanilla Neural Networks (NNs), scores the daily query volumes on a scale of 0-100, normalized RNN’s network topology allows it make use of sequential with respect to the peak within the date range (2004 to 2014 information. This is a natural fit for stock market prediction, in our case). a time series problem - knowing previous days’ prices may Subsequently we performed a relatively simple sentiment help us predict tomorrow’s price. analysis on the news articles crawled from Bloomberg Busi- nessweek to obtain daily sentiment scores. First, we obtained lists of ”positive” and ”negative” words that are both financial- specific and general. For financial-specific words, we used the lists published by McDonald, originating from his research on sentiment analysis on financial texts [4]. This is particularly relevant in our case as words with positive meanings in the general context may actually be negative in the financial context. For the general case, we used the lists of positive Fig. 1. Recurrent Neural Network topology [6]. and negative opinion words or sentiment words by Hu and Liu [5]. To compute the sentiment score for each article, we used the following equation: As illustrated in Figure 1, RNN performs the same op- erations, with the same weights, for each element of the POS − NEG Score = (5) sequence. It takes into account the previous step’s state (st−1) POS + NEG while computing the output for the current step. This recurrent where POS refers to the number of positive words (from the property allows it to have a ’memory’ as mentioned earlier. lists obtained earlier) counted in the article, NEG refers to The relevant equations are as follows: the number of negative words (from the lists obtained earlier) counted in the article. The positive and negative words were st = tanh(Uxt + W st−1) (6) counted as many times as they appear. A score of +1 implies o = sigmoid(V s ) (7) an entirely positive article, 0 (when no words are counted) t t implies neutral, and -1 implies an entirely negative article. where U, W and V are the weight matrices used across all Daily scores were obtained by averaging over all the articles time steps, xt is the input at time step t, st is the hidden state 3

at time step t and ot is the output at time step t. We may think of st as the ’memory’ of the RNN which contains information about inputs and computations of all the previous time steps (subject to the vanishing gradient problem elaborated below)! As described earlier, the output is computed based on the previous hidden state st−1 and current input xt (Equation 6). The first hidden state s0 is typically initialized with zeros. In our stock market prediction problem, we can think of xt Fig. 2. Gated Recurrent Unit topology [8], [9]. as the feature vector of each day (composing of features from Section II). Figure 1 has outputs at all time steps, but in our case, we are really only concerned with the output at the final step, which is the prediction whether price will rise or fall. z z z = sigmoid(U xt + W st−1) (8) In other words, we input feature vectors from previous t days r r into the RNN sequentially, and ot (a sigmoid output (Equation r = sigmoid(U xt + W st−1) (9) 7)) represents the probability of price rising or falling for the h h h = tanh(U xt + W (st−1 · r)) (10) (t+1)-th day. This allows it capture more temporal information than classifiers (e.g. Support Vector Machines, NNs, Logistic st = (1 − z) · h + z · st−1 (11) Regression) that only take input of one time step. o = sigmoid(V s ) (12) Training for RNNs is similar to that for vanilla NNs: back- t t propagation. However for RNNs, we backpropagate through where · denotes element-wise multiplication. GRU has two dL dL dL time to obtain dU , dW , dV . The idea is to ’unfold’ the RNN gates, specifically a reset gate r and an update gate z. The reset across time (similar to that in Figure 1) and do backpropaga- gate r determines how to combine the new input xt with the tion as if it were a normal NN. Since this is a classification previous hidden state st−1, while the update gate z determines problem, we can use the binary cross entropy loss as the error how much of the previous hidden state st−1 to retain in the function L. Because we are only looking at the final output, current hidden state st. We obtain the vanilla RNN by setting we can mask all other outputs and only consider loss from r to all 1’s and z to all 0’s [8]. the final output. From here, we may use stochastic gradient The GRU is a relatively new model published in recent descent to minimize the error. years. They have fewer parameters than Long Short Term There is one caveat: the vanishing gradient problem. As we Memory (another RNN variant), rendering them faster to train dL know from NN backpropagation in class, the gradients dU , and requiring less data to generalize. We tested our imple- dL dL dW , dV are derived from the chain rule, meaning they are mentation of GRU on the stock market prediction problem as products of multiple derivatives. These chain rule derivatives well. have upper bounds of 1 (apparent from the tanh and sigmoid activation functions used). And this means that gradient values IV. METHODOLOGY can shrink exponentially fast and ’vanish’ after a few time A. Baseline and Other Models steps, particularly when the neurons are saturated. Because Since we have framed stock market prediction as a binary gradients ’vanish’ within a limited number of time steps, the classification problem, Logistic Regression (LR) is a natural vanilla RNN model typically has issues learning long range choice as a baseline model. Beyond LR, we also tested several dependencies, i.e. the RNN will not learn much from inputs other more sophisticated models (some of which were not more than a certain number of time steps before the final covered in lectures) to gain exposure to common machine output. From this, we know that the number of time steps learning algorithms. They are Support Vector Machines RBF in the input sequence for this RNN model cannot be too (SVM RBF), K-Nearest Neighbors (KNN) and AdaBoost large. We may determine this hyper-parameter from cross (implemented in Scikit-Learn). validation. Note that this is a problem in deep NNs as well. Also, exploding gradient may be a problem, but this can be circumvented effectively by clipping the gradients. B. Experiment Design For this project, we implemented the above described RNN The range of data (price history and sentiment scores) model from scratch in Python and tested its performance on collected span 11 years from January 1, 2004 to December 31, the stock market prediction problem. 2014. In this project, we would like to predict whether tomorrow’s price will be higher (1) or lower (0) than today’s price. B. Gated Recurrent Unit Thus, each day may be viewed as an observation from which We also implemented from scratch in Python a more so- a training example or testing example may be constructed. phisticated RNN variant - the Gated Recurrent Unit (GRU). We created feature vectors based on the features described in GRUs are identical to the vanilla RNN described above (takes Section II: each vector is essentially a concatenation of price sequential inputs) except in the way the hidden states st technical indicators and public sentiment scores. The target are calculated. They were designed to alleviate the vanishing variable is binary and is simply the sign of tomorrow’s 1-day gradient problem through the use of gates (Figure 2). These returns. We show an example feature vector x(i) and target are illustrated through the GRU equations 8, 9, 10, 11 and 12. variable y(i) pair for some arbitrary i-th day below: 4

This means that we are not restricted to using one feature h vector for each prediction; we may input feature vectors from x(i) = R ,R ,...... ,R , i,1 i,2 i,n some previous t days into the RNN sequentially and take the MAi,2,...... ,MAi,n, ﬁnal output prediction (minimize cross entropy error of ﬁnal step prediction). Using t = 3 as a concrete example: R(i−1),1,R(i−2),1,...,R(i−n),1

GTi,econ, GTi,debt, GTi,inflat, GTi,risk, GTi,stocks, i y(i) Scorei ↑ (i) h i y = Sign(R(i+1),1) s(0) → RNN → s(1) → RNN → s(2) → RNN where notation remains the same as introduced in Section ↑ ↑ ↑ II, GTi,YYY refers to the Google Trends query volumes for the x(i−2) x(i−1) x(i) word ”YYY”. It is important that the feature vector x(i) does not contain any future information and only uses information available up to that point. where s(t) are the hidden state vectors at time step t n determines that amount of information about past prices from the RNN, and s(0) is initialized with all zeros. For and price trends incorporated into the feature vector; the training RNN, we used inputs that are sequences of feature dimensions of the feature vector changes with n. Note that vectors [x(i−t+1), . . . , , x(i−1), x(i)]. We feed them into the because we are predicting tomorrow’s price change, we lose RNN sequentially beginning from x(i−t+1) to x(i). And the one day: no prediction can be made for the last day in the data final output gives us a probability for the target variable y(i) set, December 31, 2014, because we do not know the true price since we use the sigmoid function (Equation 7). Again, similar on January 1, 2015. Also, depending on the n chosen, we have to that described in the previous section, depending on t we to drop the first n days observations: to calculate the n-days have to drop the first few days of training examples. returns, n-day returns moving average and n-time lagged 1-day This allows the RNN to capture some extent of temporal returns, we need the previous n days prices. So these features information that LR does not (e.g. finer grain resolution of cannot be calculated for the first n days in the data set because how returns are changing day to day). The larger t is, the we do not know prices prior the first day, January 1, 2004. We more temporal information we are feeding into the RNN. select n from cross validation. However, as mentioned in Section III, t is intrinsically limited by the vanishing gradient problem. t, together the dimensions N of hidden state vectors s(t) are the hyper parameters we can X h i f(θ) = − y(i)log(q(x(i)) + (1 − y(i))log(1 − q(x(i)) tune using cross validation. i=1 The above training method also applies for GRUs (a variant (13) of RNN). However we may expect better results for GRUs as where N is the number of training examples. they should theoretically face a less extent of the vanishing This is a binary classification task so we may use the binary gradient problem. cross entropy error function as objective to minimize for LR and the RNN (Equation 13). D. Cross Validation for Time Series TABLE I Cross validation is an important step in model selection and TRAINAND TEST SET SPLIT parameters tuning. It provides a measure of the generalization Data Set January 2004 to December 2014 Train Set January 2004 to December 2012 error of the trained classifier. To a certain extent, this technique Test Set January 2013 to December 2014 allows us to avoid over-fitting on the training data (and perhaps under-fitting), and consequently do better on the test data. For independent data, we can typically use K-Folds cross Before we began training, we split the data set of ob- validation, where the training data is randomly split in K servations into train and test sets, roughly 80% and 20% ideally equally sized folds. Each fold may then be used as a respectively each (Table I). We will train our models (RNN, validation set while the remaining (K-1) folds become the new GRU, LR, SVM, KNN and AdaBoost) based on the train set, training set. We cycle through the K folds so that each fold is and subsequently evaluate their performance on the untouched left out of training and used for validation once. By taking the test set. average error over these K folds validation, we get an estimate of the generalization error (i.e. how well the classifier will C. RNN Training likely perform on unseen test sets). For conventional classifiers like LR, the training method is However, for this project, the data involved is financial straightforward: for each prediction, we use x(i) as input, y(i) time series and they are not independent! Correlation between as target and minimize the error function either stochastically adjacent observations is often prevalent in time series data; the (stochastic gradient descent) or collectively (batch gradient data has some intrinsic order. The K-Folds cross validation descent). This is not the case for RNNs. Recall that one of the method described earlier breaks down because (assuming we properties of RNNs is that they can process sequential data. randomly split the training data into K Folds) the validation 5 and training samples are no longer independent. Furthermore, the train set should not contain any information that occurs after the validation set. But splitting the data randomly, we cannot be sure of that. TABLE II CROSS VALIDATION FOR TIME SERIES Fold Train Set Validation Set 1 2004 2005 2 2004, 2005 2006 3 2004, 2005, 2006 2007 4 2004, 2005, 2006, 2007 2008

A more principled approach for time series cross validation is forward chaining [7]. Using 5 years of training time series Fig. 3. Grid search heat map for Logistic Regression. The optimal parameters data from 2004 to 2008 as example, we may split it into 4 from cross validation are n = 8 and regularization C = 0.1, without folds and perform cross validation as in Table II. This is a sentiment scores. more accurate reﬂection of the situation during testing where we train on past data and predict future price changes. We adopted this approach for cross validation in this project. In Table III, we summarize the hyper-parameters for each model we tested, and the respective ranges over which we did a grid search for.

TABLE III GRID SEARCH HYPER-PARAMETERS Hyper-Parameters Sweep Range n (refer to section II) 3, 4, 5, 6, 7, 8, 9 GT , Score with and without LR Regularization C 10e-2, 10e-1, 10e-0, 10e1, 10e2 SVM RBF Bandwidth γ 10e-2, 10e-1, 10e-0, 10e1, 10e2 C 10e-2, 10e-1, 10e-0, 10e1, 10e2 KNN Fig. 4. Grid search heat map for K-Nearest Neighbor. The optimal parameters No. of neighbors 5, 10, 25, 50, 75, 100 from cross validation are n = 8 and no. of neighbors= 5, without sentiment AdaBoost scores. No. of estimators 5, 10, 25, 50, 75, 100 Learning rate 0.01, 0.05, 0.1, 0.5, 1 RNN Time steps t 2, 4, 6 Hidden state s(t) dimensions 10, 30, 50 GRU Time steps t 2, 4, 6 Hidden state s(t) dimensions 10, 30, 50

V. RESULTS AND DISCUSSION A. Grid Search Cross Validation Results We performed extensive grid searches for each model to choose the best hyper-parameters based on the resulting cross validation accuracy. Selected results are presented as heat maps in Figures 3, 4, 5, 6, 7 and 8. Using the best hyper-parameter Fig. 5. Grid search heat map for AdaBoost. We swept n as mentioned in combination, we trained fresh models (LR, KNN, AdaBoost, Table III. For easy visualization we only present heat map of the best n here. SVM RBF, RNN and GRU) based on the entire train set The optimal parameters from cross validation are n = 3, no. of estimators= 5 (from January 2004 to December 2012) and tested them on and learning rate= 1, without sentiment scores. the unseen test set (from January 2013 to December 2014). The results are summarized in Table IV. (like KNN and LR), including these sentiment scores caused a B. Discussion signiﬁcant drop in test accuracy. The reason becomes apparent From our grid search experiments, we realized that includ- when we overlay Google query volumes and sentiments scores ing Google query volumes and sentiment scores did not neces- with the N225 price index. sarily lead to improved performance. In fact for some models From Figures 9 and 10, we can see that both scores do 6

TABLE IV BEST CROSS VALIDATION ACCURACYAND TEST ACCURACY Model Best Cross Validation Accuracy Test Accuracy LR (baseline) 0.509 0.510 KNN 0.511 0.495 AdaBoost 0.520 0.523 SVM RBF 0.568 0.565 RNN 0.534 0.531 GRU 0.561 0.558

Fig. 6. Grid search heat map for Support Vector Machine RBF. We swept n as mentioned in Table III. For easy visualization we only present heat map of the best n here. The optimal parameters from cross validation are n = 8, bandwidth γ = 0.1 and C = 1000, without sentiment scores.

Fig. 9. Plot of Bloomberg Businessweek sentiment scores and the N225 price index over time from 2007 to 2009.

Fig. 7. Grid search heat map for Recurrent Neural Network. We swept n as mentioned in Table III. For easy visualization we only present heat map of the best n here. The optimal parameters from cross validation are n = 5, hidden state s(t) dimensions= 30 and time steps t = 4, without sentiment scores.

Fig. 10. Plot of Google Trends query volume for the word ”debt” and the N225 price index over time from 2010 to 2014.

(while the figures are plotted at the monthly-level, the same holds true when we zoom in to the daily-level). This likely explains why the sentiment score features do not improve the classifiers’ performance; they do not provide useful additional information. It seems that our simple sentiment analysis (scoring by counting positive and negative words from pre-specified lists) is too coarse to extract useful information. Perhaps using more Fig. 8. Grid search heat map for Gated Recurrent Unit. We swept n as sophisticated sentiment analysis methods that goes beyond mentioned in Table III. For easy visualization we only present heat map of the best n here. The optimal parameters from cross validation are n = 5, the word-level (such as OpinionFinder in [2], that looks at hidden state s(t) dimensions= 50 and time steps t = 4, without sentiment sentence-level subjectivity) will yield more informative scores. scores. In addition, it may be useful to crawl articles from multiple news archives, rather than just the Bloomberg Businessweek, to gain a more diverse set of corpus that may be more repre- not seem to be consistently correlated with the N225 price. sentative of the state of world affairs. Unlike that reported in They do not seem to be predictive of N225 price changes [3], Google search volume trends did not improve our results. 7

This could be simply due to the fact we are analyzing N225 in obtained Google query volumes from Google Trends for the this project, and not the Dow Jones Industrial Index as in the period 2004 to 2014. Using the data, we generated price original paper. On hindsight, perhaps using volume trends for technical indicators and sentiment scores to be used as features search terms in the Japanese language would have been more for predicting future (tomorrow’s) price change direction. We appropriate since English is not Japan’s first language (but implemented a vanilla RNN and GRU from scratch in Python then again, with globalization, N225 is tradable from almost and tested them against LR as a baseline. Through grid anywhere in the world). Further, [3] could have used a greater searches and cross validation for time series, we chose the set of search terms; we restricted ourselves to 5 finance related optimal (according to cross validation error) hyper-parameters terms to keep data collection and computation time reasonable. for each model. Out of all the models tested, LR gave one of the poorest From our experiments, sentiment scores and Google query accuracy at 0.510. This is only slightly better than randomly volumes did not improve classifiers’ performance. This is guessing (0.5). However such result is consistent with our likely because our simple sentiment analysis does not extract understanding that LR is ultimately a linear classification useful information from the news articles. Consistent with our model (we did not kernelize LR for this project). It is natural expectations, LR performed the poorest among SVM RBF, that stock market prediction, a non-linear problem, cannot be RNN and GRU. It is logical than a linear model cannot ade- well-modeled by a linear model. Nevertheless this serves as quately describe a complex non-linear problem such as stock a baseline benchmark to evaluate other more sophisticated prices. The GRU performed slightly better than the vanilla algorithms. RNN, indicating that the gating mechanism was effective to Both the RNN and GRU performed better than LR. Because some extent in relieving the vanishing gradient issue. Finally, these are non-linear models, it is natural that they can give we observed that the GRU has comparable performance with better accuracy than LR. One observation is that the GRU the SVM RBF. However, we feel that the GRU has potential (0.558) performs slightly better then the vanilla RNN (0.531), to outperform the SVM RBF given more time and resources. suggesting that the GRU gating architecture may have indeed Moving forward, we may perform more advanced sentiment helped to alleviate the vanishing gradient problem, allowing analysis, in terms of using more sophisticated sentence-level it to learn better. We also note that both the RNN and GRU methods (such as the OpinionFinder) and also crawling for required significant longer times to train as compared to the news articles from a wider range of websites (such as the Wall other models. This posed as an issue particularly for time Street Journal) for a more diverse corpus. This should serve series cross validation. As a result, we only managed to sweep as a better proxy for public sentiment. We could also explore (t) 3 values for both the time steps t and the hidden state s more specialized Google search terms that are predictive of dimensions (on top of the n) - sweeping these parameters took the N225, perhaps in the Japanese language. over a day for each of the two models. For the RNN and GRU, we can certainly improve their Finally, we see that GRU has comparable perfor- performances by sweeping a wider range of parameters at finer mance/slightly lower with the SVM RBF (0.565). In general, resolutions, and using more advanced optimization methods our SVM RBF accuracy is consistent with that reported in like the RMSprop. Also, we feel that their ac curacies should literature and other implementations online ([10], [11], [12] improve given more data (working at the hourly/minute scale and [13]). However we feel that the GRU has potential to instead of the daily scale). Currently, we train the RNN and outperform the SVM RBF classifier: Firstly, as mentioned GRU using a fixed train set and test them on the test set. An earlier, we only swept 3 values for both GRU parameters. alternative way is to have a ’moving’ train set where we retrain Given more time and resources, we could sweep the param- the model every year based on the latest D years’ prices, i.e. eters at finer resolutions and for a larger range. This will firstly train on 2004 and 2005 and test on 2006; train a fresh likely give better performance. In addition, we used simple model on 2005 and 2006 and test on 2007...etc. This will allow stochastic gradient descent in the GRU implementation. There us to capture short term trends more effectively. Finally, we are more sophisticated optimization methods available (such as used simple sample and hold to deal with missing data in this RMSprop) that could potentially lead to improved accuracy. project. There are definitely more robust methods on dealing Lastly, we are currently looking at daily data, which gives with such cases that we did not have the time to explore here. us around 2000 training examples. This data set size may be insufficient to learn the reset and update gates’ weights effectively. Perhaps if we looked at minute scale data (which REFERENCES would vastly increase the number of training examples), the GRU will perform much better than the SVM RBF. [1] Ruiz, Eduardo J. et al. ’Correlating Financial Time Series With Micro- Blogging Activity.’ Proceedings of the fifth ACM international conference Lastly, we did not have sufficient time to thoroughly analyze on Web search and data mining - WSDM ’12 (2012): n. pag. Web. 11 the results for KNN and AdaBoost. As mentioned in Section Nov. 2015. IV, we tested these models mostly to gain exposure to a wider [2] Bollen, Johan, Huina Mao, and Xiaojun Zeng. ’Twitter Mood Predicts The Stock Market.’ Journal of Computational Science 2.1 (2011): 1-8. range of common machine learning algorithms. Web. [3] Preis, Tobias, Helen Susannah Moat, and H. Eugene Stanley. ’Quantifying VI.CONCLUSION Trading Behavior In Financial Markets Using Google Trends.’ Sci. Rep. 3 (2013): n. pag. Web. In this project we collected price history from Yahoo! [4] McDonald, Bill. ’Bill Mcdonald’s Word Lists Page’. Nd.edu. N.p., 2015. Finance, crawled articles from Bloomberg Businessweek and Web. 7 Dec. 2015. 8

[5] Hu, Minqing, and Bing Liu. ’Mining And Summarizing Customer Re- views’. Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’04 (2004): n. pag. Web. 7 Dec. 2015. [6] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. ’Deep Learning’. Nature 521.7553 (2015): 436-444. Web. 7 Dec. 2015. [7] Arlot, Sylvain, and Alain Celisse. ’A Survey Of Cross-Validation Proce- dures For Model Selection’. Statistics Surveys 4.0 (2010): 40-79. Web. 8 Dec. 2015. [8] Britz, Denny. ’Recurrent Neural Network’. WildML. N.p., 2015. Web. 8 Dec. 2015. [9] Chung, Junyoung, Gulcehre, Caglar, Cho, KyungHyun, Bengio, Yoshua/ ’Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling’. NIPS Deep Learning Workshop, 2014 [10] Fu, Tong, Shou Chen, and Chuanqi Wei. Hong Kong Stock Index Forecasting. 2013. Web. 9 Dec. 2015. [11] Dai, Yuqing, and Yuning Zhang. Machine Learning In Stock Price Trend Forecasting. 2013. Web. 9 Dec. 2015. [12] Halls-Moore, Michael. ’Forecasting Financial Time Series’. Quantstart. N.p., 2015. Web. 9 Dec. 2015. [13] Pochetti, Francesco. ’Stock Market Prediction Scikit Classiﬁcation Al- gorithms’. N.p., 2014. Web. 9 Dec. 2015.