Linköping University | Department of Computer and Information Science Bachelor’s Thesis| Bachelor’s Programme in Programming Spring term 2020 | LIU-IDA/LITH-EX-G—20/055-SE

Forecasting Financial Time Series through Causal and Dilated Convolutional Neural Networks

Lukas Börjesson

Tutor, Rita Kovordanyi Examinator, Jalal Maleki

Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non- commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

© Lukas Börjesson

ii

Abstract

In this paper, predictions of future price movements of a major American stock index was made by analysing past movements of the same and other correlated indices. A model that has shown very good results in was modified to suit the analysis of financial data and was then compared to a base model, restricted by assumptions made for an efficient market. The performance of any model, that is trained by looking at past observations, is heavily influenced by how the division of the data into train, validation and test sets is made. This is further exaggerated by the temporal structure of the financial data, which means that the causal relationship between the predictors and the response is dependent on time. The complexity of the financial system further increases the struggle to make accurate predictions, but the model suggested here was still able to outperform the naive base model by more than 20 percent. The model was, however, too primitive to be used as a trading system, but suitable modifications, in order to turn the model into one, were discussed in the end of the paper.

iii

Acknowledgement

The study in this paper stretches over a number of different subject fields, ranging from computer science to finance. This means that there is a lot of ground to cover and a number of different concepts that need to be explained. However, the number of pages in the paper were limited and great effort has been placed on covering the most essential concepts. In this regard, my tutor, Rita Kovordanyi, has provided a great deal of thoughtful inputs and insights, for which I am very grateful. Thank you.

I also want to take this opportunity to thank Martin Singull, who has, apart from providing helpful input on this paper, been an excellent instructor in the area of mathematical statistics and who has further increased my interest in the same and similar subjects.

Linköping in June 2020

Lukas Börjesson

iv

INTRODUCTION in the financial markets are no trivial task, and it should be has brought a new paradigm into machine learn- approached with humility. However, one should not be dis- ing in the past decade and has shown remarkable results in couraged, since the models proposed in [13] does provide areas such as , speech recognition and natural promising or, in often times, positive results. language processing. However, one of the areas where it is yet An important note about the expectation mentioned above is to become a mainstream tool is in forecasting financial time that the definition of a trader, provided by Paul and Baschnagel, series. This despite the fact that time series does provide a does not limit it to be that of a human being, it might as well suitable data representation for deep learning methods such be an algorithm. This is important, since the majority of as a convolutional neural network (CNN) [16]. Researchers the transactions in the market are now made by algorithms. and market participants1 are still, to the most part, sticking to These algorithms are used in order to decrease the impact of more historically well known and tested approaches, but there biases and irrationality in the decision making. However, the has been a slight shift of interest to deep learning methods in algorithms are programmed by people and are still making the past years [13]. The reason behind the shift, apart from the predictions under uncertainty, based on historical data, which structure of the time series, is that the financial market is an means that they are by no means free of biases. Algorithms increasingly complex system. This means that there is a need is also more prone to get stuck in a feedback loop, which has for more advanced models, such as deep neural networks, that been exploited by traders in the past4. does a better job in finding the nonlinear relations in the data. There are also those who states that complexity is not the Objective issue, but instead advocate for the Efficient Market Hypothesis The objective of this paper was to expand the research in fore- (EMH) [7]. A theory that essentially suggest that no model, casting financial time series, specifically with a deep learning no matter how complex, can outperform the market, since approach. To achieve this, two models, which greatly differ the price is based on all available information. The theory in the approach towards the effectivness of the market, were rests upon three key assumptions2, which are stated to be compared. The first model was restricted by the assumptions sufficient, but not necessary3. These assumptions, even with made on the market by the EMH and was seen as the base modifications, are very bold and there are many who have model. The second model was a convolutional neural net- criticized the theory over the years. However, whether one work, inspired by a model developed for speech recognition agrees with the theory or not, one would probably agree with by researchers at Google. The models set out to predict the that a model, that satisfies the assumptions made in EMH, next day’s closing price of Standard & Poor’s 500 (S&P 500), would indeed be suitable as a base model. Which means that which is a well known stock market index, comprised of 500 such a model can be used as a benchmark, in order to assess large companies in the US. the accuracy of other models. Problem Formulation Traders and researchers alike would furthermore agree on, that In order to reach the objective, this paper aimed at answering the price of any asset is, apart from its inner utility, based the following questions: on expectation of its future value. For example, the price of a stock is partially determined by the company’s current • Can a CNN model, using only the closing price as input, per- financials, but also by the expectation of future revenues or form better forecasts than a model restricted by the EMH? future dividends. This expectation is, by the neoclassical economics, seen as completely rational, giving rise to the • Can conditional input series help to improve the perfor- area of rational expectation [8]. However, the emergence of mance of the CNN model? behavioural economics have questioned this rationality and proposes that traders (or more generally, decision-makers who THEORY act under risk and uncertainty) are sometimes irrational and Time Series many times affected by biases [15]. A time series can be defined, as the name suggests, as a series of data points, ordered with respect to time and where the time A trader that sets out to exploit this irrationality and these interval between the data points are often chosen to be fixed. biases can only do so by looking into the past, and thereby When using time series as a forecasting model, one makes the also go against the hypothesis of the efficient market. Upon assumption that future events, such as next day’s closing price reading this, it should be fairly clear, that making predictions of a stock, can be determined by looking at past closing prices in the time series. Most models, however, include a random 1"Market participants is a general expression for individuals or error as a factor, meaning that there is some noise in the data groups who are active in the market, such as banks, investors, invest- which cannot be explained by past values in the series. ment fonds, traders (for their own account). Often, we use the term ‘trader’ as a synonym for ‘market participant’" [11]. Furthermore, the models can be categorized as parametric or 2(1) no transaction costs, (2) cost is not a barrier to obtain available non-parametric, where the parametric models are the ones information and (3) all market participants agree on the implications most regularly used. In the parametric approach, each data of the available information. 3Sufficient, but not necessary means, in this context, that the assump- 4An intresting example is the two norwegians, Svend Egil Larsen and tions do not need to be fullfilled at all times. For example, the second Peder Veiby, who in 2010 were accused of manipulating algorithmic assumption might be loosened from including all to only a sufficient trading systems. They were, however, acquited in 2012, since the number of traders. court found no wrongdoing.

1 point in the series is accompanied with a coefficient, which de- Neural Networks termine the impact the past value has in the forecast of future The neural network, when applied on a supervised problem, values in the series. Below is a mathematical represenation of sets out to minimize a certain predefined . The the linear autoregressive (AR) and autoregressive exogenous loss function used in this paper was the mean absolute percent- (ARX) models, as well as the non-linear versions. Here, Xk is age error (MAPE) the kth element in the time series sequence and ϕk its accom- n T panied coefficient, Zk is the kth element in a exogenous time 100 g(w xi) −ti ε(w) = , series and ψk its accompanied coefficient, f is a nonlinear ∑ n i=1 ti function and εk is a white noice. ith t Autoregressive Model, AR(p) where w is the weights, xi is the input observation and i is the ith target value and g(·) is the model’s prediction. The p reason for this is that the errors are now made proportional with respect to the target value. This is important, since the Xt = c + ∑ ϕiXt−i + εt (1) i=1 mean and of financial series cannot be assumed to be Autoregressive Exogenous Model, ARX(p,r) stationary and this would otherwise skew the accuracy of the model, unproportionally, to times characterised by low mean. p r The loss function is with respect to the weights w and the Xt = c + ∑ ϕiXt−i + ∑ ψ jZt− j + εt (2) loss is minimized when choosing the weights that solves the i=1 j=1 function Nonlinear Autoregressive Model, NAR(p) ∂ε(w) = 0. ∂w X = c + f (X ,...,X ) + t t−1 t−p εt (3) However, this algebraic solution is seldom achievable and Nonlinear Autoregressive Exogenous Model, NARX(p,r) numerical solutions is more often used. These numerical methods sets out to find points in close proximity to a local (hopefully global, but probably not) optima. Xt = c + f (Xt−1,...,Xt−p,Zt−1,...,Zt−r) + εt (4) Moreover, instead of calculating the with respect to The autoregressive model is one of the most well known time each weight individually, uses the , series models and it is a model where the variable is regressed where each derivative can be computed layerwise backwards. against itself (auto meaning "oneself", when used as a pre- This leads to a decrease in complexity, which is very important, fix). It is often used as a building block to more advanced since it is not unusual that the number of weights might be time series models, such as the autoregressive moving aver- counted in thousands or in tens of thousands. age (ARMA) or the generalized autoregressive conditional heteroskedasticity (GARCH) models. However, the AR pro- The neural network is, unless stated otherwise, considered to cess will not be considered as a building block in the models be a fully connected network, which means that every weight, proposed in this paper. Instead, the proposed CNN models in two adjacent layers, are connected to each other. Although in this paper, can be represented as the nonlinear version of backpropagation did a remarkable job in decreasing the com- the AR model, or NAR for short. In fact, a large number of plexity, the fully connected models are not good at scaling to models, when applied to time series, can be many hidden layers. This problem can be solved by having seen as AR or NAR models. This might seem obvious to some, a sparse network, which means that not all weights are con- but it is something that is seldom mentioned in the scientific nected. The CNN model, further explained in the next section, literature. Furthermore, the models can be generalized to a is an example of a sparse network, where not all units are NARX model, if one were to include exogenous variables in connected and where some units also share weights. the model, which will be explored when tackling the second research question, defined in the problem formulation. Convolutional Neural Networks When determining the coefficients in the autoregressive mod- The input to the CNN, when modelling time series, is a 5 × els, most models need the underlying stochastic processes three-dimensional tensor ; (nr of observations) (width of the input)×(nr of series). The number of series is here the {Xt : t ≥ 0} to be stationary, or at least weak-sense stationary. This means that we are assuming that the mean and the vari- main series, for which the predictions will be made over, plus optional exogenous series. ance of Xt are constant over time. However, when looking at historical prices in the stock market, one can clearly see Furthermore, in the CNN model, there is an array of hyper- that this is not the case, for either the variance or the mean. parametes that defines the structure and complexity of the All of the above models can be generalized to handle this network. Below is a short explanation of the most important non-stationarity, by applying suitable transformations to the parameters to be acquainted with in order to understand the series. These transformations, or integrations, is a necessity networks proposed in this paper. when determining the values for the coefficients, for most of the well known methods, although this need not be the case 5Here a tensor is just a multidimensional array and should not be when using a neural network [4]. confused with the notion of a tensor in mathematical literature

2 normalization. However, the beauty of SeLU is that the output In its simplest form, when it only takes on binary values, the of each node is normalized and this process is fittingly called activation function determines if the artificial neuron fires or internal normalization. Internal normalization proved to be not. More complex activation functions are often used and the more useful than external normalization for the models in sigmoid and tanh functions this paper, which is why the SeLU has been chosen as the ex activation function. g(x) = , ex + 1 g(x) = tanh(x), The learning rate, often denoted by η, plays a large role dur- ing the training phase of the models. After each iteration, are two examples, which have been used to a large extent in the the weights are updated by a predefined update rule such as past. They are furthermore two good examples of activation functions that can cause the problem of vanishing (studied by Sepp Hochreiter in 1991, but further analyzed in wi+1 = wi − η∇ε(wi), 1997 by Sepp Hochreiter and Jürgen Schmidhuber [3]), which of course is something that should be avoided. A function where ∇ε(wt ) is the gradient for the loss function at the ith that does not exhibit this problem is the rectified linear unit iteration. The learning rate, η, can here be seen as determining (ReLU) function the rate of change in every iteration. Gradient descent is but one of many update rules, or optimizers (as they are more often 0 if x ≤ 0, called), and it is by far one of the simplest. More advanced g(x) = x otherwise, optimizers are often used, such as the adaptive moment estima- tion (Adam) [5], which has, as one if perks, individual learning which has gained a lot of traction in recent years, and is today rates for each weight. The discussion about optimizers will the most popular one for deep neural networks. One could not continue further in this paper, but it should be clear that easily understand why ReLU avoids the vanishing gradient the value of the learning rate and the choice of optimizer has a problem, by looking at its derivative great impact on the overall performance of the model. 0 if x ≤ 0, g0(x) = Filters 1 otherwise, The filter dimensions needs to be determined before training the model and the appropriate dimensions depends on the and from it conclude that the gradient is either equal to 0 or underlying data and the model of choice. When analysing 1. However, the derivative also shows a different problem time series, the filter needs to be one dimensional, since the that comes with the ReLU function, which is that the gradient time series is just an ordered sequence, so left for the developer might equal zero and that the output from many of the nodes to determine is just two things, the width of the filters (Figure might in turn become zero. This problem is called the dead 1 shows a filter with width equal to two) and then how many ReLU problem, and it might cause to many of the nodes to have filters to use for each convolutional . The type of features zero impact on the output. This can be solved by imposing that the convolutional layer "searches" for is highly influenced minor modifications on the function and it therefor now comes by the filter dimensions and having multiple filters means that in an array of different flavours. One such flavour is the the network can search for more features in each layer. exponential linear unit (ELU) α(ex − 1) if x ≤ 0, Dilation g(x) = A dilated convolutional filter is a filter that, not surprisingly, x otherwise, is widened, but still uses the same number of parameters. where the value of alpha is often chosen to be between 0.1 and This can also be observed in Figure1, where, for example, a 0.3. The ELU solves the dead ReLU problem, but it comes dilation equal to two means that every other input to the layer with a greater computational costs. A variant of the ELU is is skipped. the scaled exponential linear unit (SELU) WaveNet αex if x ≤ 0, g(x) = λ (5) The CNN models proposed in this paper is inspired by the x otherwise, WaveNet structure, modeled by van der Oord et al. in 2016 [9]. which is a relatively new activation function, first proposed The main part, or layer, of the network in a WaveNet can be visualized in Figure2, which incorporates a dilated (and in 2017 [6]. The values of α and λ have been predefined by causal) and a 1 × 1 convolution (i.e., the width the authors and the activation also needs the weights in the of the filter set to equal one). The input from the left side is network to be initialized in a certain way, called lecun_normal. the result of a casual convolution, with filter size equal to two, lecun_normal initialization means that the start value for each weight is drawn from a standard normal distribution. which has been applied to the original input series, as a sort of preprocessing. The output on the right side of the layer is the Normalization can be used as a preprocessing of the data, due residual, which can be used as the input to a new layer, with to its often positive effect on the models accuracy, and some an identical set up. The number of residual connection must networks also implement batch normalization at some point be predetermined by the developer, but the dilated convolution or points inside the network. This is what is called external also sets an upper limit on how many connections that can be

3 Figure 1. Dilated convolutional layers for a input series of length 16. Figure 2. Overview of the residual layer, when only the main series is used as input. used. Figure1 displays repeated dilations on an input series of size equal to 16 and we can see that the number of layers has an upper limit of four. Furthermore, the output from the bottom of each layer is the skip, which is the output that is passed on to the following layers in the network. If four layers are used, as in Figure1, then the network would end up with four skip connections. These skip connections are then added (element-wise) together, to form a single output series. This series is then passed through two 1 × 1 and the result of this will be the output of the model. Figure 3. Overview of the residual layer, when the main series is condi- The WaveNet has three important characteristics, it is dilated, tioned by exogenous series. causal and has residual connections. This means that the net- work is sparsely connected, that calculations can only include and the reason behind this is again that the gated activation previous values in the input series (which can be observed in function did not generalize well to the data used in this paper. Figure1) and that information is preserved across multiple layers. The sparsity is also further increased by having the When using exogenous series to help improve the predictions, width of the filters equal to only either one or two. the authors introduce two different ways to condition the main series by the exogenous series. The first way, termed global The WaveNet sets out to maximize the joint probability of the conditioning, uses a conditional latent vector l (not dependent series x = (x ,x ,...,x )T , which is factorized as a product t t−1 1 on time), accompanied with a filter v , and can be seen as a of conditional probabilities k type of bias that influence the calculations across all timesteps t z = SeLU(w ∗ x + v ∗ l). p(x) = ∏ p(xi|x1,...,xi−1), k k i=1 The other way, termed local conditioning, uses one or more where the conditional probability distributions is modeled T by convolutions. Furthermore, the joint probability can be conditional time series h = (ht ,ht−1,...,h1) , that again influ- generalized to include exogenous series ences the calculations across all timesteps

t z = SeLU(wk ∗ x + vk ∗ h) (7) p(x|y) = ∏ p(xi|x1,...,xi−1,h1,...,hi−1), i=1 and this is the approach that has been taken in this paper. This approach can further be observed in Figure3. T where h = (ht ,ht−1,...,h1) is the exogenous series. However, the effectiveness of the softmax activation did not generalize It is clear that the WaveNet provides a suitable structure for well to the the financial data used in this paper, and therefor filtering time series, since there clearly exist a casual depen- no activation was used in the final output of the network. dency in a time series and the dilation makes it possible to have a long input sequence. The WaveNet, as proposed by the authors, uses a gated activa- tion unit on the output from the dilated convolution layer in Walk-Forward Validation Figure2 Walk-forward validation, or walk-forward optimization, were z = tanh(wt,k ∗ x) σ(ws,k ∗ x), suggested by Robert Pardo[10], and were brought forward since the ordinary cross validation strategy is not well suited where ∗ is a convolution operator, is a element-wise multi- for time series data. The reason behind why cross validation is plication error, σ is a , w is the weights for ∗,k not optimal for time series data is because there exists temporal the filters and k denotes the layer index. However, the model correlations in the data, and it should then be considered as proposed in this paper will be restricted to only use a single "cheating", if one were to use future data points to predict past activation function data points. This, most likely, leads to a lower training error, z = SeLU(wk ∗ x) (6) but should result in a higher validation/test error, i.e it leads

4 Figure 4. Walk-forward validation with five folds. Figure 5. Side view of the dilated convolutional layers in Figure1. (a) is when only the main series is used, (b) and (c) are when the main series to poorer generalization due to overfitting. In order to avoid is conditioned by exogenous series as in the model proposed in [1,2] and in the original WaveNet [9]. overfitting, the model should then, when making predictions at (or past) time t, only be trained on data points that were t recorded before time . and states that there has been a trend towards more usage of Depending on the data and the suggested model, one may deep learning methods in the past five years. The review covers choose between using all past observation (until the time of a wide range of deep learning methods, applied to various time prediction) or using a fixed number of most recent observations series such as stock market indices, commodities and forex. as training data. The walk-forward scheme, using only a fixed From the review, it is clear that CNNs is not the top most used number of observations, can be observed in Figure4. method and that developers have focused more on recurrent neural networks (RNN) and long short term memory (LSTM) Efficient Market Hypothesis networks. The CNNs are, however, very good at building Apart from the three sufficient assumptions, Eugen Fama (who up high level features, from lower level features that were can be seen as the father of modern EHM), lays out in [7], originally found in the input data, which is not something a three different types of tests for the theory. Weak form; only LSTM netork is primarily designed to do. Furthermore, the past price movements are considered, semi-weak form; other WaveNet structure suggest that the model can catch long and publicly available information is included, such as quarter short term dependencies (see Figure1), which is what the or annual reports and strong form; some actors might have LSTM is designed to do as well. monopolistic access to relevant information. The tests done in The CNNs and LSTM networks does not need to be used this paper, outlined in the introduction, is clearly of the weak as two separate models, however, but could be used as two form. separate parts of the same network. An example is to use Fama also brings to light three models that have historically a CNN to preprocess the data, in order to extract suitable been used to explain the movements of asset prices, in an features, which could then be used as the input to the LSTM efficient market; the fair game model, the submartingale and part of the network [12]. Although, this is something that will the random walk. The fair game is by far the most general of not be further explored in this paper, but it does provide further the three, followed by the submartingale and then the random research questions, such as if the WaveNet can be used as a sort walk. However, this paper does not seek to explain the move- of preprocessing to a LSTM network. Another example would ments of the market, but merely to predict them, which means be to process the data through a CNN and a LSTM separately that any of the models can be used as the base model in the and then combine them, before the final output, in a suitable tests ahead. manner. This is something that is explored, whit satisfactory results, in [14] and the CNN part of the network is in fact Given the three assumptions on the market, the theory indi- an implementation of the WaveNet here as well. However, cates that the best guess for any price in the future is the last only the LSTM part of the network handles the exogenous known price (i.e., the best guess for tomorrows price of an series, so for future work, it would be interesting to see if asset is the price of that asset today). This can be altered to the performance could be improved by making the WaveNet include a drift term, which can be determined by calculating handle the exogenous series as well. the mean increment for a certain number of past observations and the best guess then changes to be the last known price Papers, where the WaveNet composes the whole network in- added with that mean increment. stead of just being a component in one, exist as well. Two examples are [1,2] and these models take into consideration RELATED WORKS exogenous series as well. However, they used ReLU as the As stated in the introduction, deep learning methods is not the activation function instead of SeLU, but implemented the nor- main stream tool when forecasting financial time series. Omar malization of the network in a similar way as was done in Berat Sezer et al. [13] gives a very informative review of the this paper. Furthermore, when considering exogenous series, the published literature on the subject between 2005 and 2019, their approach regarding the output from each residual layer

5 was different. Instead of extracting the residual from each which here takes the past p and r elements in the main and exogenous series, as was done in this paper, only the com- exogenous series as inputs. bined residual was used. This can be visualized by observing Figure3 and then ignoring the the residual for each exogenous These two models were, for convenience, named the single- series. This in turn leads to that each residual layer beyond and multi-channel models. However, two different variants of the first layer have structure similar to that Figure3. Another the multi-channel model were tested, in order to compare the way to visualize this by observing at Figure5, which is a side different structures found in (b) and (c) in Figure5 (i.e., the view of the dilated layer shown in Figure1. The residual of structure suggested in the referred papers in the related works each exogenous series is ignored in (b) and the developer here section against the structure suggested in the original WaveNet hopes that the dependencies, between the main and exogenous paper). The stucture in (b) and (c) were given the names multi-channel-sparsely-connected model (multi-channel-sc) series, will be caught in the first layer, combined with multiple and multi-channel-fully-connected model (multi-channel-fc). filters i each layer. The structure in (c) has a somewhat more "supervised" approach, in the sense that there is a more clear METHOD guide towards how the dependencies shall be caught. Data Sampling and Structuring The structure of the network explained above differs from the The financial data, for the single-channel model as well as the structure of the original WaveNet. Nevertheless, the model more complex multi-channel models, was collected from Ya- did perform better than a LSTM network (25 hidden neurons, hoo! Finance. The time interval between the observations was followed by a dropout of 0.1 and then a fully connected layer) chosen to equal a single day, since the objective was to predict, when using a similar error metric to the one used here. for any given time, the next day’s closing price of a certain T stock market index (i.e., the time series x = (xt ,...,xt−p) , at MODELS any time t, was used to predict xt+1). Base Model For the single-channel model, the series under consideration, The base model in this paper was chosen to be that of a ran- T at any time t, was x = (xt ,...,xt−p) , which is composed of dom walk and this model can in fact be modeled as an AR(1) ordered closing prices from S&P 500. In the multi-channel process models, different combinations of ordered OHLC (open, high, Xt = c + ϕ1Xt−i + εt , low and close) prices, of the S&P 500, VIX (implied volatility index of S&P 500), TY (ten year treasury note) and TYVIX where ϕ1 needs to equal one. εt is here again a white noise, (implied volatility index of TY) were considered. The clos- which accounts for the random fluctuations of the asset price. ing prices for the S&P 500 are again the series to forecast, The c parameter is the drift of the random walk and it can be while the other series Z = (z1,...,zm) are the exogenous series, determined by taking the mean of k previous increments T where, for every series i, zi = (zi,t ,...,zi,t−r) . The values of p 1 k and r determines the order of the NAR and NARX, and differ- c = X − X . ent values will be tested during the validation phase. However, k ∑ j j−1 j=1 only combinations where p and r are equal will be tested and p will therefor be used to denote the length for both the main The best guess of the next day’s closing is obtained by taking and exogenous series in the continuation of this paper. the expectation of the random walk model (ϕ equal to one) 1 The time span of the observations were chosen to be between E(Xt ) = E(c + Xt−1 + εt ) = c + Xt−1, (8) the first day in 2010 and the last day in 2019, which resulted in 2515 × m observations, where again m denotes the number which is the prediction that the base model used. of exogenous input series. Furthermore, since the models require p preceding observations, x = (x ,...,x )T , to make a CNN Model t t−p prediction and then an additional observation, xt+1, to evaluate As stated in the the introduction, a CNN model, inspired by this prediction, the number of time series that could be used the WaveNet, was compared to the base model and two differ- for predictions were decreased to (2515 − p − 1) × m. These ent approaches were needed in order to answer the research observations were then structured into time series, resulting in questions. The first approach was to structure the CNN as a a tensor with dimension (2515 − p − 1) × (p) × (m). univariate model, which only needed to be able to handle a single series (the series to make predictions over). This model The resulting tensor was then divided into folds of equal size, can be expressed as a NAR model, which can be observed which were used in order to implement the walk-forward by studying equation (3). Each element xt in the sequence is scheme. The complete horizontal bars in Figure4, should determined by a non-linear function f (the CNN in this case), here be seen as the whole tensor, while the subset consisting which takes the past p elements in the series as input. The of the blue and red sections are the folds. The blue and red second approach was to structure the CNN as a multivariate sections (training and test set of that particular fold) should be model, which needed to be able to handle multiple series (the seen as a sliding window, that "sweeps" across the complete series to make predictions over, together with exogenous se- set of time series. Whenever the model is done evaluating a ries). This model, on the other hand, can be expressed as a certain fold, the window sweeps a specific number of steps NARX model, which can be observed by studying equation (determined by the size of the test set) in time, in order to (4). Again, each xt is determined by a non-linear function f , evaluate the next fold.

6 By further observing Figure4, it should be become clear Hyperparameters MAPE that the number of folds are influenced by the size of the training and test sets. The size of each fold could (and most single- multi- multi- p l f likely should) be seen as a hyperparameter. However, due channel channel-sc channel-fc to the interest of time, the size of each fold was set to 250 4 2 32 0.5787 0.5706 0.5599 time series, which means that each fold had a dimension of 4 2 64 0.5802 0.5527 0.5577 (250) × (p) × (m). Each fold was then further divided into a 4 2 96 0.5783 0.5508 0.5587 training set (first 240 time series) and a test set (last 10 time 6 2 32 0.5720 0.5491 0.5544 series), where the test set was used to evaluate the model for 6 2 64 0.5752 0.5450 0.5426 that particular fold. 6 2 96 0.5793 0.5462 0.5479 The sizes chosen for the training and test sets gave as a result 8 2 32 0.5764 0.5413 0.5498 226 folds. These folds where then split in half, where the first 8 2 64 0.5737 0.5445 0.5450 half was used in order to validate the models (i.e., determine 8 2 96 0.5796 0.5405 0.5495 the most appropriate hyperparameters) and the second half 8 3 32 0.5733 0.5518 0.5251 was used to test the generalization of the optimal model found 8 3 64 0.5692 0.5259 0.5377 during the validation phase. 8 3 96 0.5687 0.5468 0.5389 12 2 32 0.5714 0.5524 0.5526 Validation and Backtesting 12 2 64 0.5744 0.5478 0.5542 During the validation phase, different values for the length of 12 2 96 0.5744 0.5417 0.5422 the input series (i.e., the value of p), the number of residual 12 3 32 0.5700 0.5298 0.5444 connections (i.e., number of layers stacked upon each other, 12 3 64 0.5672 0.5368 0.5290 see Figure1 and2) and the number of filters (explained in 12 3 96 0.5584 0.5312 0.5325 the theory section for CNNs) in each convolutional layer was Table 1. Validation error for the different models, where p is the length considered. The values considered for p were 4, 6, 8 and of the time series, l is the number of residual layers and f is the number 12, while the number of layers considered were 2 and 3 and of filters. number of filters considered were 32, 64 and 96. For the multi-channel models, all permutations of different vides a range of different models to work with, where the most combinations of the exogenous input series were considered. intuitive might be the Sequential model, where developers can However, it was only for the mutli-channel-fc model that add/stack layers, and then compile them into a network. How- the exogenous series were seen as a hyperparameter. The ever, the Sequential model does not provide enough freedom optimal combination of exogenous series for the multi-channel- to construct the complexity introduced in the residual and skip fc model were then chosen for the multi-channel-sc model as part of the WaveNet. functional API6 might be less well. One final note regarding the hyperparameters is that the intuitive at first, but it does provide more freedom, since the dilation rate was set to a fixed value equal to two, which is is order of the layers in the network is defined by having the the same rate as was proposed in the original WaveNet model output of every layer explicitly defined as an input parameter and the resulting dilated structure can be observed in Figure1. to the next layer in the network. As was stated in the previous section, the validation was made Furthermore, Keras comes with TensorFlow’s library of op- on the first 113 folds. The overall mean for the error of these timizers, which are used in order to estimate the parameters folds, for each combination of the hyperparameters above, was in the model and taken as an input parameter when compiling used in order to compare the different models and the model the model. The optimizer used here was the Adam optimizer, with the lowest error was then used during the backtesting. and the learning rate where set to equal 0.0001. The batch size was set to equal one for all models, while the number of epoch was set to 300 in the single-channel RESULTS model and 500 in the multi-channel models. The difference in epochs is here due to the added complexity that the exogenous Validation series brings. An important note regarding the epochs and the Table1 displays the validation error for both the single-channel evaluation of the models is that the model state, associated with and multi-channel models. The p is again the length of the the epoch with the lowest validation/test error, were ultimately input time series, while l is the number of layers in the residual chosen. This means that if a model made predictions over 300 part of the network (see Figure1) and f is the number of filters epochs, but the lowest validation/test error was found during used in each convolutional layer. The lowest validation error epoch 200, the model state (i.e., the value of the model’s was achieved with p, l and f equal to 12, 3 and 96 for the weights) associated with epoch 200 were chosen as the best single-channel model, while 8, 3, 64 and 8, 3, 32 were the performing model for that particular fold. optimal parameters for the multi-channel-sc model and the multi-channel-fc model, respectively. The mape for the multi- channel models is displayed only for the best combination of Developing the Convolutional Neural Network The networks were implemented using the Keras API, from 6More information regarding Keras functional API can be found on the well known open-source library TensorFlow. Keras pro- Keras official documentation https://keras.io/models/model/.

7 Figure 6. Cumulative mean of the MAPE for all 113 test folds. Figure 8. Predictions for the 10 test observations in test fold 25.

Figure 9. Predictions for the 10 test observations in test fold 34. Figure 7. Cumulative mean of the MAPE for the last 50 test folds.

model. This suggests that the problem of generalization for the exogenous series found for the multi-fc-channel model, which multi-channel-fc model probably lies in that the relationship proved to be just the highest daily value of the VIX. between the main series and the exogenous series has been altered, which, interestingly enough, only affects the more Testing complex model. Figure6 shows the cumulative mean of all 113 test folds, Lastly, while the multi-channel-fc model outperformed all while Figure7 shows the cumulative mean for the last 50. other models across all folds, it is also of interest, for fur- These two figures paints two different pictures of the single- ther work, to see in what settings the multivariate model channel and multi-channel models. The means across all 113 performed the best and the worst. Figure8 and9 gives an folds are 0.5793, 0.4877, 0.4707 and 0.4621, for the base, the example of these settings, where it shows the folds for which single-channel, multi-channel-sc and multi-channel-fc models, the multi-channel-fc model outperformed (fold 139) and un- respectively, while the means across the last 50 are 0.6572, derperformed (fold 148) the most against the base model. 0.5468, 0.5250 and 0.5416. By looking at these numbers, one can see that the performance of the multi-channel-fc model to DISCUSSION the base model is worse in the last 50 folds than for all 113 Method folds, while the reverse can be said about the single-channel and the multi-channel-sc models. Almost every method comes with at least some drawbacks, and the method in this paper is no exception. Although dif- The two research questions for this paper, were asking if a ferent combinations for the hyperparameters (length, layers CNN model could outperform a model restricted by EMH and and filters) were tested, the train and validation/test sizes were if conditional input series could improve the performance. By been kept fixed. There is no real scientific basis for having looking at the numbers in the preceding paragraph, one can the training size equal to 240 and validation/test size equal clearly see that the answer to both these questions is yes. Both to 10 , although it did perform better than having the sizes the single-channel and multi-channel models outperformed the equal 950 and 50 respectively. It might seem odd to someone, base model over the test folds, which accounts for almost five with little or no experience in analysing financial data, that one years of observations. Furthermore, the multi-channel models would chose to have a limit on the training size and why the clearly performed better than the single-channel model, when models, evidently, performs better using fewer observations. looking at the performance across all test folds. However, the Having a larger set of training observations is generally seen as positive effects of including the exogenous series seem to wear a good thing, so why not use all observations, until the time of off in the last folds for the complex multi-channel-fc model, prediction, as was proposed as an option in the walk-forward while it actually increased for the simpler multi-channel-sc validation section?

8 The answer lies in the structure of the financial time series, here was just a special case. However, to see the complexity as specifically the temporal structure. As was further explained a hyperparameter could prove to be beneficiary in both cases. in the walk-forward validation section, the temporal structure is the reason why one should not include future observations in If the result, discussed above, is just a special case, then prob- the training set. However, the impact of the temporal structure lem might be the number of folds used for validation in re- does not end there. The financial markets are ever-changing lation to the number of folds used for testing. A solution to and the predictors (the past values in the time series) usu- this predicament might be to to shorten the time interval to ally change with it. New paradigms finds there way into the analyze altogether, but this might lead to poorer generalizabil- markets, while old paradigms may loss its impact over time. ity, since the model could not be properly validated. It would These paradigms can be imposed by certain events, such as also affect the statistical inference made on the results from the test data. A second solution might be to relatively increase an increase in monetary spending, the infamous Brexit or the the number of folds used for validation against the number current Covid-19 pandemic (especially the response, by the of folds used for test data. This approach would increase the governments and central banks, to the pandemic). Paradigms generalizability, but does not solve the problem with testing can also be recurrent, such as the ones that are imposed by where we are in the short- and long-term debt cycle. Because of significance. This means that one should strive to reach a of these shifts, developers are restricted in how far back in balance between still keeping the number test observations time they can look, and therefor need to put restrictions on reasonably high, but low enough so it does not influence the the training size. However, in the interest of time (or comput- generalization too much. ing power, however you want to see it), the fold size was not The tests made in this paper were not primarily intended to considered as a hyperparameter in this paper. judge the suitability of the models as trading systems, but Lastly, the decision to only use different combinations of ex- rather if a deep learning approach could perform better than ogenous series as a hyperparameter for the more complex a very naive base model. However, the multi-channel models multi-channel-fc model, might make the comparison between outperforms the base model by more than 20 percent and the two multi-channel models somewhat unfair. In order to ap- this difference is quite significant and begs the question what propriately compare the two models, the combinations should changes could be done in order for the model to be used as be seen as a hyperparameter for both models and the models a trading system. While most of the predictions in fold 25, should be tested on a range of different asset classes as well. Figure8, are indeed very accurate, the predictions in fold 34, Figure9, would be disheartening for any trader to see, if it Result were to be used as a trading system. This suggests that one The results in this paper brings forward two key concerns, that should try to look for market conditions, or patterns, similar can provide appropriate research questions for further study. to the ones that were associated with low error in the training The first one is regarding the decrease of performance of the data. This could be done by clustering the time series, in an multi-channel fc model against the other models in the last 50 unsupervised manner, and then assign a score for each class folds, while the second regards the suitability of the single- represented by the clusters, where the score can be seen as and multi-channel models as potential trading systems. the probability of the model to make good predictions during the conditions specific to that class. A condition classed in a The change in performance between all 113 and the last 50 test cluster with a high score, such as the pattern in Figure8 would folds for the two multi-channel models was a surprising result. probably prompt the trader to trust the system and to take a If both models performed worse in the last 50 folds, then it position (either long or short, depending on the prediction). would have been easy to again "blaim" the temporal structure While a condition classed in a cluster with a lower score, of the financial data and more specifically, the temporal de- would prompt the trader to stay out of the market our to follow pendencies between the main and exogenous series. However, another trading system that particular day. only the more complex multi-channel model’s performance degraded, which means that the complexity (i.e., the intermin- The Work in a Wider Context gling between the series in all residual layers) is the primary A well known problem in physics, called the three − issue. A solution might then be to have the complexity as a bodyproblem, or the more generalized n-body problem, is hyperparameter as well and not differentiate between the two characterised by having no closed form solution, meaning that structures as was done here. In other words, the two models it cannot be solved algebraically and that a solution can only might more appropriately be seen as two extreme cases of the be found through numerical methods. Another characteristic same model, in a similar way as having the number of filters of the system is that past movements of the bodies bears no set to 32 and 96 (see Table1, 32 and 96 are the extreme cases significance and that the only thing that affects the system is for the number of filters). By looking at Figure5 (with p equal the bodies mass, velocity, etc. What does this got to do with to 16 in this case), one can see that the hyperparameter for finance? the complexity has two more values to chose from (having the exogenous series to directly influence the second and third The results in this paper suggests that there indeed exists hidden layers). useful information to be found in the past values in a financial time series, which is not in any way controversial but to the It would also be appropriate to compare the models against believers in a completely efficient market. However, as in the different time frames and asset classes, to see if the less com- n-body problem (where the bodies now have been switched plex model indeed generalizes better over time, or if the result to financial markets and its traders and the force between the

9 objects are market forces such as capital-flows, information, for time series forecasting. Journal of Computational expectation, etc.), traders’ expectations and actions affects, Finance, Forthcoming (2018). how ever so slightly, the whole system. What happens then if [3] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long multiple actors use the same trading system? The answer is short-term memory. Neural computation 9, 8 (1997), that the predictions of the model will, most likely, be priced 1735–1780. into the expectation of future price movements and thereby affecting the current price of that asset, which would in turn [4] Tae Yoon Kim, Kyong Joo Oh, Chiho Kim, and affect the predictions and so on. This creates a feedback-loop Jong Doo Do. 2004. Artificial neural networks for and if a large enough number of actors would contribute to the non-stationary time series. Neurocomputing 61 (2004), loop, the model would be rendered useless. This can be seen as 439–447. if the market has turned effective for the particular analysis that [5] Diederik P Kingma and Jimmy Ba. 2014. Adam: A this specific trading system uses. Paradoxically, a model that method for stochastic optimization. arXiv preprint seeks out to exploit market inefficiencies, and does this rather arXiv:1412.6980 (2014). successfully, might actually make the market more efficient and make it more similar to a n-body problem. However, this [6] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, might actually be a good thing for the majority of market and Sepp Hochreiter. 2017. Self-normalizing neural participants, since the market would become increasingly fair. networks. In Advances in neural information processing systems. 971–980. Traders will probably, to some extent, still be affected by irrationality and biases and there will pop up new trading [7] Burton G Malkiel and Eugene F Fama. 1970. Efficient systems, replacing the models that has become inefficient, but capital markets: A review of theory and empirical work. these might need to become more and more complex, which The journal of Finance 25, 2 (1970), 383–417. possibly can lead to more and more efficient markets in turn. [8] John F Muth. 1961. Rational expectations and the theory CONCLUSION of price movements. Econometrica: Journal of the The deep learning approach, inspired by the WaveNet structure, Econometric Society (1961), 315–335. proved successful in extracting information from past price [9] Aaron van den Oord, Sander Dieleman, Heiga Zen, movements of the financial data and the result was a model Karen Simonyan, Oriol Vinyals, Alex Graves, Nal that outperformed a naive base model by more than 20 percent. Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. While these results are quite significant, the model was not 2016. Wavenet: A generative model for raw audio. arXiv primarily designed as a trading system, but was rather designed preprint arXiv:1609.03499 (2016). to be a proof of concept. However, a few extensions might make the model more suitable as a trading system, which [10] Robert Pardo. 1992. Design, testing, and optimization of would be the next logical step to take if one were to develop trading systems. Vol. 2. John Wiley & Sons. the model further. [11] Wolfgang Paul and Jörg Baschnagel. 2013. Stochastic The performance of the deep learning approach is most likely Processes From Physics to Finance. Vol. 2. Springer due to its exceptional ability to extract non-linear dependencies Science Business Media, 2013. from the raw input data. However, as the field of deep learning [12] Tara N Sainath, Oriol Vinyals, Andrew Senior, and applied to financial market progress, the predictive patterns Ha¸simSak. 2015. Convolutional, long short-term found in the data might become increasingly hard to find. memory, fully connected deep neural networks. In 2015 This would suggest that the fluctuations in the market would IEEE International Conference on Acoustics, Speech come to more and more mirror a system much like the n-body and Signal Processing (ICASSP). IEEE, 4580–4584. problem, where the only predictive power lies in the estimation of the forces acting on the objects. Although, just as the forces [13] Omer Berat Sezer, M Ugur Gudelek, and Ahmet Murat in the n-body problem can be estimated, so can the forces Ozbayoglu. 2020. Financial time series forecasting with in the financial markets, which are heavily influenced by the deep learning: A systematic literature review: current sentiment in the market. A way to extract the sentiment, 2005–2019. Applied Soft Computing (2020). at any current moment, might be to analyse unstructured data, [14] Zhipeng Shen, Yuanming Zhang, Jiawei Lu, Jun Xu, and extracted from, for example, multiple news sources or social Gang Xiao. 2018. SeriesNet: A Generative Time Series media feeds. Further study in text mining, applied to financial Forecasting Model. In 2018 International Joint news sources, might therefor be merited and might be an area Conference on Neural Networks (IJCNN). IEEE, 1–8. that will become increasingly important to the financial sector in the future. [15] Richard Thaler. 1980. Toward a positive theory of consumer choice. Journal of economic behavior & REFERENCES organization 1, 1 (1980), 39–60. [1] Anastasia Borovykh, Sander Bohte, and Cornelis W [16] Zhiguang Wang, Weizhong Yan, and Tim Oates. 2017. Oosterlee. 2017. Conditional time series forecasting Time series classification from scratch with deep neural with convolutional neural networks. arXiv preprint networks: A strong baseline. In 2017 International joint arXiv:1703.04691 (2017). conference on neural networks (IJCNN). IEEE, [2] Anastasia Borovykh, Sander Bohte, and Cornelis W 1578–1585. Oosterlee. 2018. Dilated convolutional neural networks 10