<<

Predicting periodic and chaotic signals using Wavenets

Master of Science Thesis

For the degree of Master of Science in Applied Mathematics with the specialization Financial Engineering at Delft University of Technology

D.C.F. van den Assem (4336100)

August 18, 2017 Supervisor: Prof. dr. ir. C. W. Oosterlee TU Delft Thesis committee: Dr. S. M. Bohte, CWI Amsterdam Dr. ir. R. J. Fokkink, TU Delft

Faculty of Electrical Engineering, Mathematics and (EEMCS) · Delft University of Technology

iii

Copyright c Delft Institute of Applied Mathematics (DIAM) All rights reserved.

Master of Science Thesis D.C.F. van den Assem (4336100) iv

D.C.F. van den Assem (4336100) Master of Science Thesis Abstract

This thesis discusses forecasting periodic using Wavenets with an application in financial time series. Conventional neural networks used for forecasting such as the LSTM and the full convolutional network (FCN) are computationally expensive. The Wavenet uses dilated which significantly reduces the computational cost compared to the FCN with the same number of inputs. Forecasts made on the sine wave shows that the network can successfully fully forecast a sine wave. Forecasts made on the Mackey Glass time series shows that the network can outperform the LSTM and other methods Furthermore, forecasts made on the Lorenz system shows that the network is able to outperform the LSTM. By conditioning the network on the other relevant coordinate, the prediction becomes more accurate and is able to make full forecasts. In a financial application, the network shows less predictive accuracy compared to multivariate dynamic kernel support vector machines.

Master of Science Thesis D.C.F. van den Assem (4336100) ii

D.C.F. van den Assem (4336100) Master of Science Thesis Table of Contents

Acknowledgements ix

1 Introduction 1 1.1 Outline...... 2 2 Machine Learning3 2.1 Terminology in ...... 3 2.2 Classification in the Iris data set...... 5 2.3 The Single Model...... 7 2.3.1 Implementation and Training...... 7 2.3.2 Example 2: The other data set...... 9 2.4 ...... 9 2.5 Introduction to Neural Networks...... 11 2.6 Summary...... 12

3 Neural Networks 13 3.1 Network Architectures...... 13 3.1.1 Activation Functions...... 14 3.1.2 Convolutional Neural Networks...... 15 3.1.3 Recurrent Neural Networks...... 18 3.2 of the Neural Network...... 21 3.2.1 ...... 21 3.2.2 Cost Function...... 22 3.2.3 Stochastic Descent, Batch and Mini-Batch Gra- dient Descent...... 23 3.2.4 Initializers...... 24 3.3 Regularization methods...... 31 3.3.1 ...... 33 3.3.2 Dropout...... 33 3.4 Wavenet...... 34 3.5 Augmented Wavenet (AWN)...... 37 3.6 Summary...... 38

4 Methodology 39 4.1 Evaluation of the network...... 39 4.2 Error Measures...... 40 4.2.1 Statistical Testing...... 43 4.2.2 Benchmarks...... 44 4.3 Artificial Time Series...... 44 4.3.1 The sine wave...... 44 4.3.2 The Lorenz System...... 45 4.3.3 Mackey Glass Equation...... 46 4.4 Real world time series...... 47 4.4.1 Data pre processing...... 47

Master of Science Thesis D.C.F. van den Assem (4336100) iv Table of Contents 5 Results 49 5.1 Implementation comparison...... 49 5.2 The Sine Wave...... 51 5.3 The Mackey Glass Time Series...... 54 5.4 The Lorenz System...... 57 5.5 Results on financial time series...... 60 5.6 Summary...... 61 6 Conclusion 63 6.1 Summary and conclusion...... 63 6.2 Future research...... 64 A Seperation 65 B Glorot derivation 67 C Code of the model 71 Bibliography 73

D.C.F. van den Assem (4336100) Master of Science Thesis List of Figures

2.1 Scatter plot of Iris data, Setosa (blue •), Versicolor (red ×), Virginica (green ♦). In the subplot on the first row and second column, the sepal width is plotted against the sepal length...... 6 2.2 Scatter plot of Iris data, Setosa (blue •), Versicolor (red ×), Virginica (green ♦)... 6 2.3 Schematic representation of the single layer perceptron...... 7 2.4 Illustrations of the multi-class logit and the softmax implementation...... 10 2.5 A multi layer network for solving the XOR-problem...... 11

3.1 Graph of a layered network with E = E1 ∪ E2 ∪ E3,N = N1 ∪ N2 ...... 14 3.2 The three different activation functions...... 15 3.3 Illustration of the Receptive Field...... 16 3.4 Illustration of the replications, shared weights and feature map...... 16 3.5 Illustration of the Padding...... 16 3.6 Illustration of the Strides...... 17 3.7 Illustration of the causal convolutions...... 17 3.8 Illustration of the causal convolutions with larger inputs and outputs...... 18 3.9 Illustration of the dilated convolutions...... 19 3.10 The RNN on the left and the unfolded RNN network on the right...... 19 3.11 The LSTM block. The × and + are point-wise operators, σ, tanh are activation functions. Two joining arcs makes a concatenate operation. Two splitting arcs makes a copy operation...... 20 3.12 Figures of paraboloids, with a = 1 and b = 2 ...... 28 3.13 Behaviour of the training error and validation error during training...... 32 3.14 Overview of the residual block and the entire architecture, retrieved from [1]..... 34 3.15 Illustration of the stacked dilated convolutions...... 35 3.16 Overview of the architecture used in AugmentedWavenet, retrieved from [2]..... 37 5.1 The full forecast of the sine wave using different implementations...... 50 5.2 Overview of the AWN I(4)...... 54 5.3 The full forecast of the a noisy sine wave using AWN I(4) ...... 55 5.4 The full forecast of the Mackey Glass time series using 8 layers on AWN I(4) .... 55 5.5 The full forecast of the Mackey Glass time series using 8 layers on AWN I(4) trained on one-ahead noisy data (σ = 0.1)...... 57 5.6 The one ahead forecast and the full forecast of the Lorenz system using 4 layers on AWN I(4) ...... 58 5.7 Convergence behavior of the training of networks with different γ parameter, for 4 and 8 layers...... 59 5.8 The full conditioned forecast of the Lorenz system using 4 layers on AWN I(4C) using 4 layers...... 59 5.9 Comparison of the one step ahead (using months) of the AWN I(4) without and with tuning of the parameters...... 61

Master of Science Thesis D.C.F. van den Assem (4336100) vi List of Figures A.1 Petal Length vs Petal Width...... 66

D.C.F. van den Assem (4336100) Master of Science Thesis List of Tables

2.1 XOR-problem using SLP on the left and Multi-Layer XOR-problem on the right... 12 3.1 Number of weights for networks with a ‘visual field’ of 512...... 36 4.1 Difference in response to errors between MAE and RMSE ...... 42 5.1 The standard parameters used in the Wavenet...... 49 5.2 MAE and MSE based on 1000 samples of full forecast...... 50 5.3 - means that the forecast diverged, therefore the number is not useful...... 50 5.4 Results for I(2) with a variation, I(3) and I(4) ...... 52 5.5 The√ one-ahead and full forecast performance with different values for SNR. (SNR = 2 σ2 )...... 53 5.6 The results for the Mackey glass t steps ahead forecast using 4 layers. Values are RMSE ×10−3. The ± value is the standard deviation of the 10 runs...... 56 5.7 Results for noisy Mackey Glass time series on two configurations. Configuration 1 uses 4 layers and configuration 2 uses 8 layers. Values are RMSE ×10−2 ...... 56 5.8 Results of the modified network for different γ, RMSE of the benchmark is 4.78 × 10−3 58 5.9 Comparison of the noisy conditioned Lorenz system. RMSE × 103 ...... 60 5.10 Results from AWN I(4) and AWN I(4C) on the S&P500 data, conditioned with the CBOE data...... 61 5.11 The standard parameters used in the AWN I(4) and AWN I(4C)...... 62

Master of Science Thesis D.C.F. van den Assem (4336100) viii List of Tables

D.C.F. van den Assem (4336100) Master of Science Thesis Acknowledgements

This thesis has been submitted for the degree Master of Science in Applied Mathematics with the specialization Financial Engineering at the Delft University of Technology. The academic su- pervisor of this thesis was Prof.dr.ir. C.W. Oosterlee, professor at the group of the Delft Institute of Applied Mathematics. The other daily supervisor was Dr. S.M. Bohte, scientific staff member of the Neural Computation Lab of the Machine Learning Group at the CWI Amsterdam.

I would like to thank my supervisors Prof.dr.ir. C.W. Oosterleeand Dr. S.M. Bohtefor their assis- tance and support throughout writing this thesis. I would also like to thank Dr.ir. R.J. Fokkink- for being part of the examination committee. Lastly, I would like to thank my family, friends and colleague students for their encouragement and support.

Delft, University of Technology D.C.F. van den Assem (4336100) August 18, 2017

Master of Science Thesis D.C.F. van den Assem (4336100) x Acknowledgements

D.C.F. van den Assem (4336100) Master of Science Thesis Chapter 1

Introduction

According to Trends, the popularity of the search term ‘’ is four times as high as four years ago. One of the reasons for the increase in popularity are the successes made in image classification [3]. The convolutional neural network (CNN) is one of the often used deep learning methods. This thesis uses this type of neural networks for forecasting to show that long-term (temporal) dependencies within time series can be learned. In addition, conditioning of the network will be used to forecast on multivariate time series.

The main focus of this thesis is forecasting periodic and chaotic signals, specifically the sine wave, the Lorenz system and the Mackey Glass time differential equation. In addition, forecasts will be made on financial time series, which are known to have high noise components.

The Wavenet, as proposed by [1], shows promising performance on predicting and generat- ing raw audio waveforms. The authors of [2] showed that a augmented Wavenet is able to learn temporal dependencies in financial time series. Both networks use dilated convolutions as main building block for the network. The network in this thesis is an augmented Wavenet made for regression and made in such a way that the dependency of the weights in the succeeding layers is minimized.

The experiments show that this model outperforms the ALSTM as proposed by [4] in forecasting both the Lorenz and the Mackey Glass chaotic time series using a fixed set of hyper-parameters. Moreover, fine tuning these parameters allows us to improve the accuracy of the forecasts even more. In a financial application, the network is able to learn temporal dependencies. Condi- tioning the Lorenz system improves the forecasting performance and enables the network to forecast the u variable at any time point, given v. In the financial application it is able to find temporal relations, however, it is not outperforming other machine learning methods such as the multivariate dynamical kernel support vector machine [5].

Master of Science Thesis D.C.F. van den Assem (4336100) 2 Introduction 1.1 Outline

Section2 starts with an introduction to machine learning, including some terminology, the sta- tistical classification problem, an extension to regression and a very basic introduction to neural networks. For the readers who already have knowledge in machine learning it is recommended to skip to Section3. A deep learning expert can immediately skip to t Subsection 3.4. In this subsection the specific neural network structure which is used in the rest of the report is described.

In Section3, all the elements required to successfully train the Wavenet are described. This starts with a description of architectures and activation functions. Thereafter, the convolutional neural network and the LSTM (Long Short Term Memory) are described. These are two popu- lar network architectures in deep learning. Where the convolutional network is mainly popular in image processing and the LSTM in one dimensional sequences. The LSTM might be difficult to understand. Since the LSTM is merely used as a benchmark, this section can be skipped. This is followed by a detailed explanation on different ways of training the network and using the right initializers and regularization methods. At last, the Wavenet and the Augmented Wavenet are discussed.

In Section4 the methodology used in this thesis is described. It starts with the relevant statisti- cal measures in forecasting are introduced, which includes the MSE, MASE. This is followed by a series of artificial signals which are used to test specific properties of the network; e.g. sines are used for periodicity and the Lorenz system and Mackey Glass equation are used to generate chaotic signals. These test should give an insight on how the network will perform on financial time series. Finally, some real world data tests are given. The performance of the network will be compared to methods from other literature.

In Section5 a Wavenet is build with the Augmented Wavenet as starting point. This is the most important chapter for the reader who wants to understand what the network can predict and why certain improvements are made to the network. The network is compared with benchmarks from literature, which shows in some cases, such as the Lorenz system, promising performance. The conditioning of the network in the Lorenz system shows interesting results. Finally, the Wavenet is tested for the financial time series.

At last, Section6 summarizes all the findings in this thesis. A few recommendations are made, which might be used for future research.

D.C.F. van den Assem (4336100) Master of Science Thesis Chapter 2

Machine Learning

In this chapter, a mathematical introduction is given to machine learning with a focus on Neural Networks. It starts with the terminology used throughout this thesis, consisting of the problem definition in machine learning, supervised learning and splitting data sets. A classical example of a statistical classification problem is given in Section 2.2, which is used for introducing the first neural network in Section 2.3. This network is trained in two different ways, the perceptron algorithm and the gradient descent algorithm with the . As we will see in the first classification example, it sometimes is convenient to use regression. Instead of assigning a class to a specific input, a probability will be assigned to each specific class using logistic regression as introduced in Section 2.4. The logistic regression will be extended in a natural way to neural networks in Section 2.5. The aim of Section 2.5 is to provide a reason why there is a need for multiple layers, in contrast to the perceptron model. Section 2.6 summarizes all the findings of the first section.

2.1 Terminology in Machine Learning

Throughout this thesis, some machine learning terminology is used. The basic terms are ex- plained in this subsection. In machine learning input variables are used to predict an output vari- 1 able. The input variables are often named features and are denoted by X = (X1,X2, ..., Xk) , where each Xi, i ∈ 1, ..., k is a feature. The output variable is often called the response or dependent variable and is denoted by the variable Yi. The relationship between Y and the corresponding X can be written in a general form:

Y = f(X) +  (2.1)

In Eq. (2.1) f is a function of the features (X1,X2, ..., Xk) and  is the random error term. The error term is independent of X and has a mean value of zero.

In practice, the features X are available without having Y or knowing the exact relation between

1As a convention we use bold letters for collections.

Master of Science Thesis D.C.F. van den Assem (4336100) 4 Machine Learning X and Y . Since the error term has a mean value of zero, the goal is to estimate f.

Yˆ = fˆ(X) (2.2) fˆ is the estimate of f, which is often considered a black box, meaning that only the relation be- tween the input and output of fˆ is known, but the question why it works remains unanswered. As an example, consider a neural network using more than 10000 parameters to determine whether a cat is shown on an image. We are unable to understand what the contribution is of each parameter to the output decision.

The function fˆ is found using learning. Supervised learning and are two ways used in machine learning for this task. In supervised learning is used for training. By showing the inputs and the corresponding outputs (=labels), the function fˆ is optimized such that it approximates the output. In unsupervised learning, the goal is to find a hidden structure from unlabeled data. The algorithm has no measure of accuracy on the input data, which distinguishes it from supervised learning. In this thesis, the focus is on supervised learning.

In supervised learning we wish to train fˆ using a data set Z ⊂ Ωt consisting of n pairs (Xj,Yj), where Xj contains the features and Yj is the dependent variable, for j ∈ 1, 2, ..., n, where Ωt is the set of all pairs that could have been observed up to time t. The aim is to find an fˆ such that the error measure, as defined in Definition 2.1, of the prediction Yˆ is minimized. Moreover, we ˆ wish that f generalizes; i.e. for any Z ⊂ ΩT we wish to find the same error. Note that in this case, we can have T > t, which means that we wish fˆ to have the same error in predictions from future observations.

The error of Yˆj = fˆ(Xj) of a single observation is for example calculated by |Yj − Yˆj|, other measures of the error are also possible. In general, we are more interested in the error of Yˆj on all the pairs. Therefore we often use averaging measures to define the mean error. Using the example for average measure, this error measure is called the mean absolute error. Other error measures are discussed in detail in Section 4.2.

DEFINITION 2.1. The error measure of a prediction Yˆj = fˆ(Xj) is defined by a function ˆ J(Yj,Yj): R → R. A commonly used function to define the error is the absolute error defined by f(Yˆj,Yj) = |Yˆj − Yj|. ¯ ˆ ˆ n The average error of a set of predictions is defined by the function J(Y1, ..., Yn,Y1, ..., Yn): R → ¯ 1 Pn ˆ R J = n j=1 J(Yj,Yj).

Since we have the freedom to choose the number of parameters in fˆ and fine tune the parameters used during the training procedure such as the number of iterations, the and so on. We can find such a set of parameters which results in the lowest error measure. However, this error measure is not giving any information about what the error measure would be on new observations because the parameters were fit to the training set. Choosing a network based on this information may lead to overfitting on the training set.

This gives reason to split the data set Z in two sets. One for training Ztrain ⊂ Z and one for testing Ztest ⊂ Z, such that Ztrain ∩ Ztest = ∅ and Ztrain ∪ Ztest = Z. In this case, a multiple of parameter sets is trained, and the performance is evaluated on the test set. Choosing the best performing set of parameters increases the out of sample performance. However, the error

D.C.F. van den Assem (4336100) Master of Science Thesis 2.2 Classification in the Iris data set 5 measure remains inaccurate to the future predictions due to overfitting to the test set. The choice of the best performing network gives a bias to the test data.

As a solution to this problem, splitting the data in three subsets, training data Ztrain ⊂ Z, validation data Zvalidation ⊂ Z and test data Ztest ⊂ Z, where Ztrain ∪Zvalidation ∪Ztest = Z and Ztrain ∩ Zvalidation ∩ Ztest = ∅ resolves all the formerly mentioned problems. First, a multiple of parameter sets is trained on the training set. Second, the best performing set of parameters is chosen from the validation set. At last, the error measure is calculated on the test set. Once the test data is used, no further changes are made on the parameters. This method gives an unbiased estimate of the error measure on future data.

In short, machine learning is finding a function fˆ describing the relationship between an input variable X and a dependent variable Y by using observations in such a way that fˆ is as accurate as possible in any observation. In the next subsection, a classic example of the classification problem will be given, in which the basic terminology will be applied.

2.2 Classification in the Iris data set

The Iris data set was introduced by [6]. This is a multivariate data set containing 50 entries for each of the species. There are three species, consisting of four features each: sepal length, sepal width, petal length and petal width. In Figure 2.1 we have illustrated the four features into a scatterplot. Distinguishing categories by using their features is called classification. It is easy to see in Figure 2.2a, that the Setosa species can be distinguished from the others, by using, for example, the petal length. In mathematical terms, it can be stated that the Setosa is linearly separable, which is defined by Definition 2.2. DEFINITION 2.2. Two sets X1 ⊂ X and X2 ⊂ X are called linearly separable if their convex hulls are disjoint. 2

Consider the set X1, the petal lengths of the Versicolor and the Virginica species and the set X2, the petal lengths of the Setosa specie. In this case conv hull(X1) = {x|x ∈ [min(X1), max(X1)]} = [3.0, 6.9] and conv hull(X2) = {x|x ∈ [min(X2), max(X2)]} = [1.0, 1.9]. Clearly, conv hull(X1)∩ conv hull(X2) = ∅, thus X1 and X2 are linearly separable. Therefore it is possible determinis- tically classify the Setosa species in the Iris data set. In the following section we will consider a data set which has the same features as the Iris data set, but the labels are changed by group- ing the Versicolor and Virginica species, which makes the data linearly separable. In section Section 2.3.1 we discuss a classifier and training algorithm for linearly separable datasets.

From the scatterplot in Figure 2.2b we observe that the Versicolor and Virginica species are not easily distinguishable for some of the data entries. By creating convex hulls of the individual features of these two species, we can show that the Versicolor and Virginica species are not linearly separable for the individual features. Therefore it is impossible to classify the Versicolor and Virginica species using linear classifiers deterministically. Intuitively, we could create separating (hyper-)planes such that a minimal number of data entries is incorrectly separated. This gives a statistically optimal result for the

2This definition of is equivalent to a well known other definition found in the literature. In this thesis, the convex hull definition is preferred since it gives an easier geometric interpretation for multiclass classification and higher dimensional problems. By wrapping a balloon around a cloud of points of each class, the balloons should not intersect to have linear separable data.

Master of Science Thesis D.C.F. van den Assem (4336100) 6 Machine Learning 8

6 sepal length

4 5.0

3.5 sepal width

2.0 7

4 petal length

1 3.0

1.5 petal width

0.0 4 6 8 2.0 3.5 5.0 1 4 7 0.0 1.5 3.0

Figure 2.1: Scatter plot of Iris data, Setosa (blue •), Versicolor (red ×), Virginica (green ♦). In the subplot on the first row and second column, the sepal width is plotted against the sepal length.

7 8

4 6 petal length petal length

1 4 0.0 1.5 3.0 2.0 3.5 5.0 petal width petal width

(a) Zoom 1 (b) Zoom 2

Figure 2.2: Scatter plot of Iris data, Setosa (blue •), Versicolor (red ×), Virginica (green ♦).

D.C.F. van den Assem (4336100) Master of Science Thesis 2.3 The Single Layer Perceptron Model 7 data set on which the classifier is trained. However, this approach does not necessarily give the best classifier for new measurements of the species. It might be the case that the classifier is also incorporating the error or noise in the data and therefore causes overfitting to the training data. This gives reason to use the aforementioned method as described in Section 2.1 of split- ting the data. Since the choice of the (hyper-)plane is not unique, the best (hyper-)plane can be chosen in the validation data and the generalization performance can be measured in the test data. In practice, we wish to use methods that by themselves are less sensitive to overfitting.

Overfitting can be avoided in different ways, for this specific problem, the Support Vector Machine (SVM) is a popular method for finding an optimal which generalizes well. Other methods which tend to less overfitting are discussed in later chapters such as Section 3.1.2. The next subsection will describe a simple neural network to find a separating plane.

2.3 The Single Layer Perceptron Model

The Single Layer Perceptron Model (SLP) is the simplest model of a Neural Network. It consists of one input layer and one as shown in Figure 2.3. The inputs x 3 are passed through the weighted graph. The function f uses the sum of the inputs as argument and compares this with a treshold θ. The output of f is a binary value, true or false. A formal definition of the perceptron is given in Definition 2.3. n DEFINITION 2.3. The perceptron is a binary output function f(w, θ): R + 1 → {−1, 1} with a n n treshold θ, with an input vector x ∈ R and an associated weight vector w ∈ R , which outputs 4 1 if the inequality w • x ≥ θ holds and outputs −1 otherwise.

Figure 2.3: Schematic representation of the single layer perceptron

2.3.1 Implementation and Training

To make things convenient w.r.t. training the perceptron, slight changes are made to the inputs and the function. In the input layer we add a new constant input 1 with weight −θ, called the bias. The threshold of the perceptron is now constant 0. Note that this is equivalent to Definition 2.3, since w • x ≥ θ ⇔ w • x − θ ≥ 0. This transformation of the problem allows us

3From this point lower case bold letters; i.e. x, are used for vectors and upper case; i.e. X, are used for matrices. 4 n • P is the dot product between two vectors w and x of the same length n defined by i=1 wixi. The inner product is also denoted by hw, xi.

Master of Science Thesis D.C.F. van den Assem (4336100) 8 Machine Learning to treat the parameter θ in the same way as the weights and therefore Algorithm1 is able to update the parameter θ.

Algorithm 1 Perceptron Learning Algorithm for linearly separable data 1: Let X be an m × (n + 1) matrix with m data entries and n features and one column for the bias term (all values set to 1), such that two classes in this data set are linearly separable. Let y ∈ {−1, 1}m be a vector of length m indicating the class of a data entry of X. n+1 2: Choose a random w ∈ R . Set conv = 1 3: while conv do 4: Choose a random r ∈ U(0, m) 5: Set x = X(r, ·) 6: Set y = y(r) 7: Compute yˆ = sign(w • x) 8: if y = 1 and yˆ = −1 then 9: Update w := w + x 10: end if 11: if y = −1 and yˆ = 1 then 12: Update w := w − x 13: end if 14: Convergence test ( 0 if sign(X • w) = y 15: conv := 1 if sign(X • w) = y 16: end while

For linearly separable data, Algorithm1 trains the weight vector w of the perceptron such that the output of the perceptron corresponds to the correct class for all data entries in X. The algorithm starts by initializing the data, a random weight vector and a convergence test variable. The algorithm then picks a random sample from the data set and tests whether the sign of the weight vector assigns the correct label to the random sample. If an incorrect label is assigned, it updates the weight vector such that the argument of the sign function moves to the correct label. To see how this works, consider the case of a positive label in y and a negative prediction label, the weight is updated by adding the data x. To see why this works, consider:

sign(hw, xi) = −1 (2.3)

Add x to w and use the linearity of the inner product to obtain:

sign(hw + x, xi) = sign(hw, xi + hx, xi) (2.4)

By definition, the inner product is positive definite and therefore hx, xi ≥ 0. Since we know that x 6= 0 from (2.3), we have hx, xi > 0. Therefore the argument of (2.4) is moving towards the correct sign for the update of the weight. After each random sample, a convergence test is done by comparing all the signs of the samples in the data set with their corresponding labels. If the convergence test shows that not all data entries are classified correctly, the algorithm starts over by picking a new random sample, otherwise the algorithm terminates and it has successfully found a set of weights which correctly classifies the data.

To prove convergence of Algorithm1, Proposition 8 from [7] can be used. To speed up the training time, one might consider using different . As mentioned in [7], the train- ing of can be translated to solving an interior point problem. Using Karmarkar’s

D.C.F. van den Assem (4336100) Master of Science Thesis 2.4 Logistic regression 9 algorithm [8], the training of a problem with n variables, costs at most n3.5 iterations. The existence of an algorithm which solves this training problem in polynomial time shows that this problem is not NP − hard.

2.3.2 Example 2: The other data set

Algorithm1 would not terminate if we use it for the data set as shown in Figure 2.2b since it does not satisfy the linear separability condition. We could train the algorithm on a linearly separable subset such that the classification error, as defined in Definition 2.4, is minimized. One easy way to do this is by removing the data points with label −1 that are in the convex hull of the data points with label 1, or the other way around. DEFINITION 2.4. The classification error is defined as the ratio of the correct predictions and the total number of predictions.

2.4 Logistic regression

In the single layer perceptron the output is restricted to a discrete value. This characterizes the classification problem. Regression is characterized by having a continuous output. The author of [9] introduced logistic regression, a regression method for binary classification problems. This section starts with a formal definition of the logistic regression method, followed by a training algorithm and finally an extension to apply the logistic regression on multiple classes.

In logistic regression (logit) we wish to find the probability that a sample xi belongs to a class using, denoted by P (xi = 1). The of the logit is defined by the as shown in (2.5). Note that data now has to be labelled in {0, 1}.

1 P(xi = 1) = , (2.5) 1 + e−(w·xi+θ) where w is the weight vector and θ the bias term. We wish to find a set of weights and a bias term such that the probability function outputs 1 if xi belongs to the class and 0 if xi does not belong the class. Such a set of weights and bias terms is usually found using the maximum likelihood estimation. Here, however, the logistic classifier is first formulated as a single layer neural network and the weights are trained using the delta rule.

Consider the perceptron again as shown in Figure 2.3. By setting the function f equal to (2.5), this network behaves exactly as the logistic regression method. To find the weights, we are unable to use Algorithm1, since the output of the network is not strictly in {0, 1}. The weights of the network are found by using the as defined in Definition 2.5 as the error function and using the delta rule as specified in Algorithm2 as optimization algorithm. DEFINITION 2.5. The mean squared error is defined as

1 n MSE = X(y − yˆ )2, n i i i=1 where yi is the actual value and yˆi is the predicted value for observation i.

Master of Science Thesis D.C.F. van den Assem (4336100) 10 Machine Learning Algorithm 2 Delta Rule Algorithm m 1: Given a training set Y consisting n of tuples of the form (xi, yi), with xi ∈ R the features and yi the desired class. Let w be the vector of weights of the network. Let f (x) be the network function. 2: Initialize learning rate α 3: Initialize weight vector w. 4: Initialize convergence criteria  5: conv := 0 6: while not conv do 7: for i=1 ... n do 8: Compute ∆w 0 9: ∆w := αf (w · xi)(yi − f(w · xi)xi) 10: Update the weights 11: w := w + ∆w 12: end for 13: Update convergence check 1 Pn 1 2 14: if n i=1 2 (yi − f(w · xi)) <  then 15: conv := 1 16: end if 17: end while

The logit restricts us to one class. We could extend it to multiple classes by running three networks in parallel. This is easily implemented by using vectors containing binary elements with precisely one element equal to 1. For the Iris classification problem, the vector consists of three elements, where the vector (1, 0, 0) is assigned to the class Setosa, (0, 1, 0) is assigned to the class Versicolor and (0, 0, 1) to Virginica. Constructing a network consisting of three outputs as shown in Figure 2.4a requires a small change in the problem definition as follows,

(a) The network for the multi-class logit (b) The network using the

Figure 2.4: Illustrations of the multi-class logit and the softmax implementation

1 P(xi belongs to class k) = • 1k, (2.6) 1 + e−(W·xi+θ)

3×4 3 where W ∈ R is the weight matrix and θ ∈ R the bias vector and 1k is the indicator vector assigning the value 1 to the k-th element of the vector. A problem arises in this approach since we find a probability for each class. It is easy to create a scenario in which two networks assign probability 1 to the corresponding output, which can be interpreted as a sample belonging

D.C.F. van den Assem (4336100) Master of Science Thesis 2.5 Introduction to Neural Networks 11 to both classes. To avoid this problem, we introduce the softmax function as shown in (2.7). n n The softmax function maps a vector x ∈ R to a new vector y ∈ [0, 1] , where the sum of the elements in the vector is equal to 1. This allows us to get a correct probability distribution among the classes. The network is then simplified as shown in Figure 2.4b. The output probability distribution of a class can be calculated using (2.8).

exi σ(x)i = (2.7) Pn xj j=1 e for i = 1, ..., n.

(xT w ) e i k (xi belongs to class k) = T , (2.8) P Pn (x wi) i=1 e where wk corresponds to the k − th row of the matrix W.

2.5 Introduction to Neural Networks

In Section 2.3 we considered the simplest form of a Neural Network (NN) containing only one computational unit (or neuron). This network was only capable of boolean functions. An extension of the SLP is the logistic regression as Section 2.4, which is a single layer neural network using a logistic function for the activation function, producing outputs in probabilities for the classes. In the recent years, the term deep learning has gained popularity. Deep learning refers to the use of multiple layers in neural networks. The next example shows why a single layer perceptron can be insufficient for a simple task as classification and gives rise to the use of deeper networks.

Figure 2.5: A multi layer network for solving the XOR-problem

Consider the XOR-problem in machine learning. In this problem the data is not linearly sep- 2 arable. Suppose we have an input x ∈ R and a class y ∈ {−1, 1} as shown in Table 2.1. It is impossible to classify this problem using the single-layer perceptron. Adding one layer to the network can solve this problem, as shown in Table 2.1 and Figure 2.5, where x1 and x2 are connected to both y1 and y2 and the outputs of y1 and y2 are connected to y. The inputs of x1 and x2 are first mapped to the intermediate binary values y1 = f1(w1x1 + w3x2) and y2 = f2(w2x1 + w3x2), where f1,2(·) is the perceptron with threshold θ = 0. Using the weights w1 = −1, w2 = 1, w3 = 1, w4 = −1, we obtain the outputs as shown in the columns y1 and y2 in Table 2.1. From this point it remains to compute f(w5y1 + w6y2), by setting the threshold θ = 1 and w5 = 1, w6 = 1 the desired output is obtained in column y and the XOR-problem is solved. Besides classification, it is also possible to construct a multi layer neural network for regression. A simple way to implement this is connecting multiple logistic layers after each other. Let

Master of Science Thesis D.C.F. van den Assem (4336100) 12 Machine Learning Table 2.1: XOR-problem using SLP on the left and Multi-Layer XOR-problem on the right

x1 x2 y x1 x2 y1 y2 y -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 1 1 -1 -1 1 -1 -1 1 -1 -1 1 -1 1 1 1 1 1 -1 -1 1

5 si = Wizi−1 be the input of layer i , with W the weight matrix from layer i − 1 to layer i and zi−1 the outputs of layer i − 1. The output of a network is defined by zi = fi(si) The output of a multi layer neural network can now be seen as a composite function of matrix vector products of all the layers:

y = fn(sn)

= fn(Wnzn−1) (2.9)

= fn(Wnfn−1(sn−1))

Recursively we can find:

y = fn(Wnfn−1(Wn−1fn−2(... W2f1(W1x) ...)) (2.10)

p×q The size of each weight matrix Wi ∈ R , where p corresponds to the number of nodes in layer p ×m i and q corresponds to the number of nodes in layer i − 1. Therefore we obtain, if W1 ∈ R 1 1 p ×p p ×p m ,p then W2 ∈ R 2 1 and W3 ∈ R 3 2 up to Wn ∈ R 2 n−1 where m1 is the size of the input vector and m2 the size of the output vector. The parameters p1, p2, ..., pn−1 can be chosen arbitrarily. The multi layer perceptron (MLP) allows activation functions fi different from the logistic function. Given the MLP, it remains to find a set of weights. Algorithm2 is not able to solve the problem for a network with multiple layers. However, a more generalized method for is shown in Section 3.2.1. The next chapter will cover training these kinds of networks in detail.

2.6 Summary

In this chapter, we discussed the classification problem in machine learning. One of the impor- tant concepts in classification is linear separability. Using Algorithm1 we are able to train a single layer perceptron for linearly separable classification problems. For nonlinearly separable data it can be useful to assign probabilities using logistic regression instead of deterministically classify it. The delta rule is used for training networks with continuous activation functions. A network can be extended to multiple classes using binary vectors as output. The use of multiple layers in a neural network can solve the nonlinear separability, such as the XOR-problem, by mapping the inputs to separate spaces and combining them again.

5 The variables si and zi are used as help variables. In Figure 2.5, s1 = (w1x1 + w3x2, w2x1 + w4x2) and z2 = (y1, y2).

D.C.F. van den Assem (4336100) Master of Science Thesis Chapter 3

Neural Networks

In this section, we consider structured networks of computation units. In Section 3.1 a math- ematical formalization is made of network architectures. This is followed by Section 3.1.1 were the different types of activations functions are described. In Section 3.1.2 the Convolutional Neural Networks (CNN), a specific type of a feed forward neural network is explained. This is the basis for the Wavenet, which will be discussed in Section 3.4. Recurrent neural networks are also discussed in this section because they show promising results in papers [4] and can be used as a benchmark. After the structured networks, supervised learning of neural networks will be discussed in Section 3.2 and Section 3.3. This will cover some of the most important findings from recent years on deep neural networks, that are going to be used for the Wavenet.

3.1 Network Architectures

It is convenient to use a more precise definition of the network architectures for the following sections. Therefore the definition from [7] is used in Definition 3.1. For each computing unit n there are n inputs with an integration function ψ : R → R. This is equivalent to the weight vector as we have seen in the perceptron. The output of the integration function is evaluated using an activation function φ : R → R, which compares the weighted summation with a thresh- old. In Section 3.1.1 we generalize the activation function, since this output differs from the {−1, 1} output we had in the perceptron.

DEFINITION 3.1. A is a tuple (I, N, O, E) consisting of a set I of inputs, a set N of computing units, a set O of outputs and a set E of weighted directed edges. A directed edge is a tuple (u,v,w) whereby u ∈ I ∪ N, v ∈ N ∪ O and w ∈ R. Figure 3.1 shows a figure of a network architecture. The inputs are corresponding to the fea- tures xi from the perceptron. This is strictly a feed of data into the network. There are no computations done in the inputs. The inputs are connected with a weight to the elements in the set N, which contains all the computing units. The inputs could also be directly connected with a weight to the output O. The outputs of the computing units of N can also be connected to other computing units in N, which typically happens in layered networks.

Master of Science Thesis D.C.F. van den Assem (4336100) 14 Neural Networks The connections of the computing units can be subdivided in such a way that there are l subsets N1,N2, ..., Nl such that there are only connections from Ni to Ni+1, 1 ≤ i < l, where Nl are the outputs of the network, then we call this a layered architecture. In this architecture, the set of inputs is called the input layer. The set of outputs is called the output layer. All the remaining layers without a connection to the input or output are called hidden layers.A graphical representation of a layered network architecture is given in Figure 3.1. Note that this gives a feed-forward network, a network that does not contain cycles.

A fully connected layer is a layer in which the computation units have full connections to all the computation units of the previous layer. Suppose that a layer has m computation units and the previous layer gives n outputs, we would get a total number of weights of mn. To reduce the number of connections, convolutions can be made, which is explained in Section 3.1.2.

Figure 3.1: Graph of a layered network with E = E1 ∪ E2 ∪ E3,N = N1 ∪ N2

3.1.1 Activation Functions

As mentioned in Section 3.1, the activation function is any function from φ : R → R. In order to be able to train the networks using Algorithm3, we require the activation function to be at least continuous and once differentiable. Note that the derivative is not necessarily contin- uous. The previously used binary step function is not differentiable. A well-known approach to approximate the behavior of the binary step function is the , which is al- ready introduced in Section 2.4. Sigmoid functions are monotonically increasing functions with asymptotes for x → ±∞. Compared to the single layer perceptron, sigmoid functions are con- tinuous and map values to a subset of R. A network trained on binary output values becomes a regression model, as we have seen in Section 2.4.

The logistic function (Definition 3.3) and the hyperbolic tangent (Definition 3.4) are two com- monly used sigmoid functions in neural networks. The authors in [10] recommend using the hyperbolic tangent since the outputs typically have a mean of zero, in contrast to the logistic function, which always has positive mean. Having a mean value of zero results in normalized flows throughout the networks, which results in higher convergence rates. A detailed explana- tion can be found in [10].

The rectified linear unit, also called ReLU (Definition 3.2), gained popularity in 2012 when AlexNet won the ImageNet competition; a challenge where teams compete to classify a large data set of images. Compared to the sigmoid functions, the ReLU functions do not exhibit

D.C.F. van den Assem (4336100) Master of Science Thesis 3.1 Network Architectures 15 asymptotic behavior. Deep neural networks benefit from this property since the of such functions are not vanishing. Moreover, the computation of a ReLU is relatively cheap compared to typical sigmoid functions. Therefore it can significantly reduce the training time.

The choice of the best activation function depends on the type of network according to the authors in [10] and [11]. For the application in this thesis, different activation functions will be tested. Figure 3.2 gives an overview of the commonly used activation functions. DEFINITION 3.2. The rectified linear unit ( RELU) is a function f : R → R defined by f(x) = max(0, x).

DEFINITION 3.3. The logistic unit ( logistic) is a function f : R → (0, 1) defined by f(x) = 1 , where x is the midpoint of the sigmoid curve and k is the steepness of the curve. 1+e−k(x−x0) 0

DEFINITION 3.4. The hyperbolic tangent function ( tanh) is a function f : R → R defined by sinh(z) ez−e−z f(x) = cosh(z) = ez+e−z .

Hyperbolic Tangent Logistic RELU

1.00 1.0 5

0.75 0.8 4 0.50

0.25 0.6 3

y 0.00 y y

0.4 2 0.25

0.50 0.2 1 0.75

1.00 0.0 0

4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x

Figure 3.2: The three different activation functions

Other common activations functions are variants of the RELU function. The Leaky ReLU (LReLU) adds a constant slope αx for x < 0 to the ReLU activation function. [12] The exponential linear unit (ELU) adds an exponential decaying α(ex − 1) for x < 0.[13]

3.1.2 Convolutional Neural Networks

A convolutional neural network (CNN) is a neural network with a layered architecture in which the successive layers are not strongly connected to each other. In contrast to fully connected multilayer feedforward networks, CNN’s suffer less from the curse of dimensionality as well from overfitting [3]. Moreover, CNN’s makes use of the spatial structure of data. The connection pattern in a CNN is inspired by the visual cortex of an animal. A CNN has a layered network architecture (I,N,O,E) with N1,N2, ..., Nl. The most important building block of a CNN is the convolutional layer, which consists of a set of learnable filters. All the filters corresponding to a specific layer l have the same receptive field. The receptive field is the part of the input which can be seen by the filter, as shown in Figure 3.3. This filter is replicated over the entire output space of the previous layer l − 1, as shown in Figure 3.4. Each of these replications shares the same weights and bias. The outputs of all the units sharing the same filter are called the feature map. This enables a network to see a small feature on any place of the domain. For convenience, the k-th feature map at a given layer l is denoted by hl,k. Using the shared

Master of Science Thesis D.C.F. van den Assem (4336100) 16 Neural Networks

Figure 3.3: Illustration of the Receptive Field

Figure 3.4: Illustration of the replications, shared weights and feature map

k weights W and biases bk and some activation function f(·), then every computing unit of the feature map for a 1D-input can be obtained by:

l,k l,k hi = f((W ∗ x)i + bl,k), where x denotes the input and i denotes the index of the feature map.

The number of computing units in a feature map corresponding to layer l is dependent on the output dimension n of layer l − 1, the receptive field m of a filter, the stride and the padding. The stride determines the distance between the replicated filters, see Figure 3.6. Using a higher value of the stride, the receptive fields will have smaller overlap. If padding is used around the borders, zero values are added to the input to keep the same output dimension as the input dimension, see Figure 3.5. As shown in all the preceding figures about CNN’s, the connection

Figure 3.5: Illustration of the Padding pattern is sparse. This reduces the number of computations significantly compared to fully connected layers. An example of such a calculation using a linear activation function f(x) = x is given in Eq. (3.1). Note that for a fully connected layer, all the zero entries would have a weight. Another benefit is the faster convergence of a convolutional network. Since the weights are shared, an update of one single weight is based on the updates of all the replications of the filter across the whole domain.

D.C.F. van den Assem (4336100) Master of Science Thesis 3.1 Network Architectures 17

Figure 3.6: Illustration of the Strides

xl−1 1      l−1 xl,k w1 w2 w3 0 0 0 x2  1  l−1  l,k  0 w1 w2 w3 0 0  x  x     3  =  2  (3.1)  0 0 w w w 0  xl−1  l,k  1 2 3   4  x3 l−1  l,k 0 0 0 w1 w2 w3   x5  x4 l−1 x6

By the model assumption in forecasting, the future time step is only dependent on preceding time steps. This assumption should also be implemented in the convolutional neural network. This is done by using causal convolutions. In causal convolutions, the ordering of the data is preserved. This means that if the network outputs the value xt+1, all the connections to the output should have their roots in xi with i < t + 1. Moreover, if a feature is extracted from some point in the past, the network should not be able to add information from the future. This is illustrated in Figure 3.7. Using this network structure, it is possible to take even more advantage of the

Figure 3.7: Illustration of the causal convolutions

Master of Science Thesis D.C.F. van den Assem (4336100) 18 Neural Networks shared weights by increasing the input and output size during training time. This is illustrated in Figure 3.8. In each iteration of training, the weights are updated based on multiple outputs instead of one. The same result can also be achieved with batch training, however, this is computationally more expensive. To see this, denote xt = (xt−n, xt−n+1, ..., xt and consider the case n = 2. In batch training we would take the inputs xt−1 and xt and evaluate the network as shown in Figure 3.7 on all the inputs in parallel to produce the outputs xt and xt+1. Training on multiple outputs decreases the number of redundant computations. The grey marked units as shown in Figure 3.8 have to be computed twice in case of the batch training, whereas the multiple output implementation is only computing these units once. During forecasting, the weights from the trained network can be inserted in the model corresponding to Figure 3.7 which makes the forecasting calculations less intensive. For convenience, networks with the architecture as shown in Figure 3.7 and Figure 3.8 will be abbreviated with FCN (Fully Convolutional Network). The FCN will be used as the benchmark network for the Wavenet. An extension of the regular

Figure 3.8: Illustration of the causal convolutions with larger inputs and outputs convolutional layers are the dilated convolutions [14], sometimes called a trous convolutions, which literally means with holes. The french name has its origins in the algorithme a trous, which computes the fast dyadic wavelet transform. In these type of convolutional layers, the inputs corresponding to the receptive field of the filters are not neighboring points. This is illustrated in Figure 3.9. The distance between the inputs is dependent on the dilation factor.

3.1.3 Recurrent Neural Networks

Recurrent Neural Networks (RNN) are different in architecture compared to the feed-forward neural networks. In all the previous examples the computational units were fed in one direction, and all the information had its origin in the input layer. In RNN’s this is not the case, as the connection between the computational units contains cycles. Therefore a single computational unit can be dependent on its previous states. The idea of using RNN’s is to get a natural way of the persistence of memory. The cycles allow the RNN’s to get this persistence behavior. A schematic representation of an RNN is given in Figure 3.10, where g is a part of a neural net- work and should not be confused with the activation function. The unfolded network clarifies

D.C.F. van den Assem (4336100) Master of Science Thesis 3.1 Network Architectures 19

Figure 3.9: Illustration of the dilated convolutions

Figure 3.10: The RNN on the left and the unfolded RNN network on the right how an RNN works. It can be seen as a neural network composed of smaller neural networks in which information is passed in an ordered way. The unfolded network shows immediately that if t represents the time, causality relations hold in these type of networks. This makes the RNN interesting for studying time series. In the study of time series on a daily basis, we would like to be able to detect seasonal features. The period of a season can become relatively large. Theoretically, it should be possible to learn any relation between the past with the current time, since the information is passed through each block. However, the authors of [15] showed that learning long term dependencies for RNN’s using gradient descent algorithms is difficult. The Long Short Term Memory neural network (LSTM) as introduced by the authors of [16] is a suc- cessful approach to this problem. Recently LSTM’s became more popular. The main reason for the increase in popularity can be explained by the increase in computational power. In the fol- lowing section, the LSTM is discussed in detail. The block of an LSTM is shown in Figure 3.11. This block is repeated in the same way as the RNN. The LSTM block consists of two lines pass-

Master of Science Thesis D.C.F. van den Assem (4336100) 20 Neural Networks

Figure 3.11: The LSTM block. The × and + are point-wise operators, σ, tanh are activation functions. Two joining arcs makes a concatenate operation. Two splitting arcs makes a copy operation. ing horizontally, the y-value which corresponds to the output of a block and the C-value which corresponds to the cell state. The horizontal lines have inputs from the preceding blocks and outputs to the succeeding blocks. Vertically, for each block, there is an input x and an output y.

Starting by the input xt, the signal is concatenated with yt−1 and we obtain [yt−1, xt]. Fol- lowing the first arc pointing downwards, the values are passed through a sigmoid function. The output ft of this function is defined in (3.2). This function is called the forget gate, since the output, a value in (0, 1), decides whether the preceding cell state is remembered or forgotten using the point-wise product operator.

ft = σ (Wf · [yt−1, xt] + bf ) (3.2)

Following the second arc pointing downwards, the signal [yt−1, xt] arrives at another sigmoid function which is called the input gate. The output it of this function is defined in (3.3). it decides which values are used for the update.

it = σ (Wi · [yt−1, xt] + bi) (3.3)

The third arc pointing downwards generates new candidate values C˜t for the cell state by using the tanh function. By taking the cross product with the input gate, the update for the cell state can be determined using (3.5). The new cell state is a combination of the old cell state and the new candidate in which the forget gate and the input gate gradual decide whether to use the old cell state and new input respectively.

C˜t = tanh (WC · [yt−1, xt] + bC ) (3.4)

Ct = ft ∗ Ct−1 + it ∗ C˜t (3.5)

D.C.F. van den Assem (4336100) Master of Science Thesis 3.2 Supervised Learning of the Neural Network 21 The output gate transforms the signal [yt−1, xt] as defined in (3.6). By taking the product of the tanh of the updated cell state Ct and the output gate ot, the new output yt as defined in (3.7) is generated.

ot = σ (Wo[yt−1, xt] + bo) (3.6)

yt = ot ∗ tanh (Ct) (3.7) Summarizing this, the main components of the LSTM are the cell state and the output. The new cell state is defined by the forget gate and input gate. The new output is defined by the output gate and the new cell state. By adding n of these blocks, the size of the vectors passing through the blocks is growing linearly. For applications in time series with a very long memory function, or in CNN-terms receptive field, this might become computationally expensive.

3.2 Supervised Learning of the Neural Network

This section focuses on training neural networks using supervised learning. It starts with a generalization of Algorithm2 in Section 3.2.1: the backpropagation algorithm. The cost function is an important aspect of the algorithm which is discussed in Section 3.2.2. The number of training samples used for updating the weights can be varied to optimize the convergence rate, which is explained in Section 3.2.3. The initialization will be discussed in depth in Section 3.2.4. In Section 3.2.4 alternatives to the update rule to improve the convergence rate are discussed. These include the addition of momentum, Nestorov’s accelerated gradients, and other novel methods.

3.2.1 Backpropagation Algorithm

The backpropagation algorithm is by far the most commonly used method for finding the opti- mal set of weights of a neural network w.r.t. a cost function. The backpropagation algorithm is an extension of the delta rule, which also works for multi-layer networks. In the delta rule, the gradient of the cost function w.r.t. the weights can be computed directly. In the backpropaga- tion algorithm, there is a need for using the which requires the activation functions to be continuous and differentiable. Optimization methods following the gradients to find an optimum are called gradient based optimization methods.

The algorithm consists of two phases, the propagation phase and the update phase. In the propagation phase, the cost function of one or more training samples is computed. In the up- date phase the gradient of the cost function w.r.t. the weights are computed, and by multiplying these gradients with a learning rate, the weights are updated. In short, the backpropagation algorithm is a generalization of the previously used delta-rule in the logistic regression. Since gradients are now computed over nonlinear activation functions, we require the activation func- tions to be differentiable. In Algorithm3 each phase of the algorithm is described in pseudo code. In short we can write this as the Gradient Descent algorithm as used in mathematics. n+1 n w = w − α∇wJ(w) (3.8) where w are the weights, α is the learning rate, J is the cost function and ∇ is the gradient operator.

Master of Science Thesis D.C.F. van den Assem (4336100) 22 Neural Networks Algorithm 3 Backpropagation Algorithm

1: Given a training set Y consisting of tuples of the form (xi, yi), with xi the features and yi the desired outputs. Let w be the vector of weights of the network. Let F¯(x) be the neural network function. 2: Initialize weight vector w. 3: Initialize learning rate α 4: while not conv do 5: Propagation 6: Compute outputs of all training samples in Y . (Forward Pass) 7: ˜y = F¯(x) 8: 9: Compute the cost function: 10: J(˜y, y) 11: 12: Compute the gradients of the weights (and biases) w.r.t. the cost function 13: ∆w = ∇wJ 14: 15: Weight update 16: Update weights using the gradients 17: wn+1 = wn − α∆w 18: end while

3.2.2 Cost Function

In the previously used algorithms, we only considered the mean squared error (MSE) as the cost function (=error function). For networks using only linear activation units, the cost functions versus the weight graph in n-space is a paraboloid. To see this, consider a layered network. Suppose that we have dense connections, then we can rewrite the outputs of each layer as:

y1 = f1(W1x) = W1x

y2 = f2(W2y1) = W2y1 = W2W1x . . (3.9) n Y yn = Wix i=1 This simplifies the network to a linear system of equations. Since the MSE is a quadratic function, we obtain a paraboloid for the cost function. For a paraboloid, we know that the global minimum is the only minimum. Therefore the gradient descent algorithm will move towards the global minimum in the most efficient way. The problem changes for multilayer perceptrons and other deeper networks since in this case nonlinear activation functions and hidden units are used. The surface of the cost function is not a paraboloid, and the previously used method is not necessarily converging to the global minimum. In case the global minimum is found, it can be discussed whether the minimum of the used cost functions represents our goal. For classification problems, the MSE does not give us the maximum margin separation hyperplane as defined inA (which is intuitively better) and therefore might be prone to giving wrong results on the test data. Choosing the right cost function requires proper understanding of the problem to solve. For financial forecasting, one might consider constructing a cost function which outputs the buy/sell strategy. This can be done by transforming the data into a 3 class

D.C.F. van den Assem (4336100) Master of Science Thesis 3.2 Supervised Learning of the Neural Network 23 vector. The first class is related to the positive return above transaction cost level, the second class for returns inside the bounds of transaction costs and third class for below these bounds. Examples of other commonly used cost functions will be discussed in section Section 4.2 since there is an overlap with performance measures.

3.2.3 Stochastic Gradient Descent, Batch Gradient Descent and Mini-Batch Gra- dient Descent

In Section 3.2.1 the number of training samples used for updating the weights was not properly defined. Depending on the size of the network and the size and nature of the data, a compro- mise can be found for the number of training samples to use for each update. In this section we consider a set of tuples (x, y) = ((x1, y1), (x2, y2), ..., (xm, ym)) as the training set, where x denotes the inputs to the network and y denotes the outputs and m ∈ N.

Batch Gradient Descent Section 3.1.2 already gave a short example on batch training. Batch Gradient Descent (BGD) is the backpropagation algorithm which uses all training samples to perform one update, as shown in (3.10). The training data set size is dependent on the split ratio between the test and train- ing and the actual size of the data set. Therefore the BGD might be very slow for larger data sets, since generally, we use more data for training than for testing. Moreover, using data sets that are not fitting into the memory makes it inefficient regarding speed, since we first need to split the data into chunks to calculate the updates, without doing the actual updates. Another drawback of BGD is the inability for online training. Online training is adding new training samples on the fly to the data. The method is guaranteed to converge to a local minimum for convex problems.

n+1 n w = w − α∇wJ(w, x, y), (3.10) where x are all the inputs in the training data and y are all the labels in the training data. Stochastic Gradient Descent A natural way to avoid the aforementioned memory problems of Batch Gradient Descent is picking single samples randomly from the training set to perform the updates, as shown in (3.11). This method is called Stochastic Gradient Descent (SGD), where the stochastic refers to the random . In contrast to BGD, SGD avoids showing similar samples in each update and therefore reducing the number of redundant calculations. SGD typically converges faster due to the reduction of redundant calculation. SGD shows fluctuations in the convergence behavior caused by the high in the updates. This can be beneficial for the problem since the SGD algorithm allows to jump out of a local minimum resulting in finding a lower global minimum. In order avoid large fluctuations during training, one might consider adaptively decreasing the learning rate. [17] shows a proof that SGD converges to a local minimum almost surely. Another benefit of the algorithm is that it allows online training. A pseudocode of this algorithm is shown in Algorithm4.

n+1 n w = w − α∇wJ(w, xi, yi) (3.11)

Mini-Batch Gradient Descent Combining the previous two methods gives us the Mini-Batch Gradient Descent (mBGD). In

Master of Science Thesis D.C.F. van den Assem (4336100) 24 Neural Networks Algorithm 4 Backpropagation Algorithm with Stochastic Gradient Descent

1: Given a training set Y consisting of tuples of the form (xi, yi), with xi the features and yi the desired outputs. Let w be the vector of weights of the network. Let F¯(x) be the neural network function. 2: Initialize weight vector w. 3: Initialize learning rate α 4: while not conv do 5: Propagation 6: Pick a random training sample y ∈ Y . 7: Compute the output of the training sample. (Forward Pass) ¯ 8: ˜yj = F (xj) 9: 10: Compute the cost function: 11: J(˜yj, yj) 12: 13: Compute the gradients of the weights (and biases) w.r.t. the cost function 14: ∆w = ∇wJ 15: 16: Weight update 17: Update weights using the gradients 18: wn+1 = wn − α∆w 19: end while this method, an update is performed for every mini-batch of a size of k samples as shown in (3.12). A mini-batch can be made by choosing k samples randomly from the training set. Using the mBGD the number of redundant computations is still reduced compared to the BGD and the variance on each update is reduced compared to SGD. Contrary to SGD and BGD, mBGD gives the user the ability to choose a size which fits the memory of the computer, which can reduce the computation time significantly. This is especially suitable for parallel computing using GPU’s 1.A pseudocode is shown in Algorithm5.

n+1 n w = w − α∇wJ(w, xi,i+1,...,i+k, yi,i+1,...,i+k) (3.12)

3.2.4 Initializers

Algorithm3 shows that the weights should be initialized. Depending on the shape of the op- timization problem the choice of the initialization might lead to early convergence but also to divergence. A simple yet effective solution is to train multiple times on random initializations and choose the best-trained network. However, this can be very inefficient for larger problems in the sense of network size or amount of data. Consider a network with multiple layers initialized with weights close to zero. The variance of the input signal diminishes as it passes through the layers of the network. This becomes a problem since the typical activation functions are almost linear around zero. In that case, the layers could be reduced in a similar way as shown in (3.9). On the other hand, we can consider a network with weights that are initially too large. In this

1A GPU is a graphical processing unit.

D.C.F. van den Assem (4336100) Master of Science Thesis 3.2 Supervised Learning of the Neural Network 25 Algorithm 5 Backpropagation Algorithm with Mini-Batch Gradient Descent

1: Given a training set Y consisting of tuples of the form (xi, yi), with xi the features and yi the desired outputs. Let w be the vector of weights of the network. Let F¯(x) be the neural network function. 2: Initialize weight vector w. 3: Initialize learning rate α 4: while not conv do 5: Propagation 6: Pick a batch of training samples Y¯ ⊂ Y 7: Compute the output of the training samples in Y¯ . (Forward Pass) ¯ 8: ˜yj = F (xj) 9: 10: Compute the cost function: 11: J(˜yj, yj) 12: 13: Compute the gradients of the weights (and biases) w.r.t. the cost function 14: ∆w = ∇wJ 15: 16: Weight update 17: Update weights using the gradients 18: wn+1 = wn − α∆w 19: end while

case, the sigmoid and tanh activation functions will be saturated; i.e. the function arguments cause output values close to the asymptote values, and the gradients will disappear which make them pass the value through on to the next layers. Various studies show that the initialization plays an important role for efficiently training using the backpropagation algorithm. [18][19] [10]. A summary is given on the available initializers in the library. 2

The authors in Lecun Initialization [10] suggest that the initial weights for sigmoid acti- vation functions should be mapped in such a way that the weighted inputs on the function should be in the linear range of the sigmoid function because they have the advantage that the gradients are large enough to learn and the network will learn the linear part of the mapping 2 before the non linear part. By choosing the activation function as f(x) = 1.17159tanh( 3 x) and having a normalized training set. 3 Then the initial weights can be drawn from a uniform q 1 1 − 2 distribution with mean zero and standard deviation σ = m . That is U(0, m ), where m is the number of inputs.

The authors in Glorot Initialization [18] studied dense networks with symmetrical activation functions f and unit derivative at 0, i.e. f 0(0) = 1. In the following part, the findings from this paper will be examined in detail since the results will be used. Consider zi to be the activation vector of layer i, si the argument vector of the activation of layer i. Then by definition, we

2Keras is a high level neural networks application programming interface written by [20]. 3This is a training set in which the average value is zero and all the training samples have about the same covariance

Master of Science Thesis D.C.F. van den Assem (4336100) 26 Neural Networks have: si = ziW i + bi (3.13) zi = f(si)

Using the cost function J we obtain:

i+1 ∂J ∂J ∂s ∂J 0 i i i = i+1 i = i+1 f (sk)Wk,·, (3.14) ∂sk ∂s ∂sk ∂s where k, · corresponds to the k-th row of the matrix. We have

i ∂J ∂J ∂sk ∂J i i = i i = i zl , (3.15) ∂wl,k ∂sk wl,k ∂sk where l denotes the column index.

Now we can express the variance w.r.t. input x, output, and initialized weights. Our goal is to be at least in the linear regime for all si like [10]. Assume that this is the case and the variance of the outputs of each layer is the same as the input features, that is V ar(x).

Since we assume to be in a linear regime we can say that f takes the form of f(x) = cx + d. By the symmetry assumption we have f(x) = −f(−x), and therefore d = 0. By using the assumption of the unit derivative, we obtain c = 1 and therefore f(x) ≈ x. Then it is easy to find:

0 i f (sk) ≈ 1. (3.16)

As mentioned in [10], the bias is initialized by 0. Using the assumptions, we can write

zi = f(si−1) ≈ si−1 = zi−1W i−1. (3.17)

Taking on (3.14) and using the results from (3.16) and (3.17), we obtain (for a detailed derivation see AppendixB):

" #  d−1  ∂J ∂J Y i0+1 i V ar = V ar  W W  (3.18) ∂si ∂sd k,· k i0=i+1   d−1 ∂J Y h i0 i = V ar n 0 V ar W (3.19) ∂sd i +1 i0=i The variance of (3.15) can easily be found by combining the result of (3.18) and (3.14) and therefore:

    i−1 d−1 ∂J ∂J Y h i0 i Y h i0 i V ar = V ar [x] V ar n 0 V ar W n 0 V ar W (3.20) ∂wi ∂sd i i +1 i0=0 i0=i Using the analytic expressions that are derived for the variance on the gradient of the weights and the variance on the gradient of the activation arguments, it is possible to study the be- havior of the first training step of the network. The goal is to choose weights in such a way that the gradients are not exploding nor diminishing in any of the layers in both the forward

D.C.F. van den Assem (4336100) Master of Science Thesis 3.2 Supervised Learning of the Neural Network 27 and backward propagation. In the case of forward propagation an exploding gradient would give saturation of the activation function, and therefore it will be impossible to distinguish the output of the activation. With diminishing gradients in the forward propagation, the output becomes zero in the activation. In both cases, we will lose the information. For the backward propagation, a diminishing gradient means that the weights would not be updated and therefore nothing is learned. An exploding gradient causes divergence of the algorithm. In Section 3.4 we will explain that the use of skip connections is beneficial to get reasonable outputs of a network which has diminishing gradients for learning in the deeper layers.

In order to reach this goal we require for forward propagation: h i h i ∀(i, j), V ar zi = V ar zj , (3.21) and for back propagation:  ∂J   ∂J  ∀(i, j), V ar = V ar . (3.22) ∂si ∂sj By substituting the results from (B.6) in (3.21) and (3.18) in (3.22) we obtain:

h ii ∀i, niV ar W = 1, (3.23) h ii ∀i, ni+1V ar W = 1. (3.24)

Addition of (3.23) and (3.24) results in: 2 ∀i, V ar(W i) = (3.25) ni + ni+1 Using a uniform distribution around 0 results in: " √ √ # 6 6 W ∼ U −√ , √ , (3.26) ni + ni+1 ni + ni+1 which will be called the normalized initialization.

He Initialization The authors of [19] studied the initialization of weights in a similar way as [18] for Convolutional Neural Networks with rectified linear activations functions. The bias is initialized by 0 and the weights are, dependent on layer number i initialized with: s ! 2 W ∼ N 0, . (3.27) ni

Optimizers

The gradient descent algorithms are most popular for optimizing neural networks. In the recent years, various improvements have been introduced to speed up the convergence and therefore reduce the training time. This subsection will follow a similar approach as [21] to get better intuition for the novel optimization methods.

In the standard gradient descent algorithm, at each iteration, the weights are updated in the direction of the steepest descent. The distance we move towards that direction is dependent on

Master of Science Thesis D.C.F. van den Assem (4336100) 28 Neural Networks the learning rate. If we are closer to the minimum, the gradients tend to decrease, and there- fore smaller steps can lead to a better approximation of the minimum. The gradient descent algorithm can be naturally extended by using an annealed learning rate. This is a learning rate which is dependent on the iteration number and is typically decreasing. The improvement can give us a higher convergence rate at the early iterations and the ability to find a lower minimum since it suffers less from overshoot around the minimum. Implementing this requires knowledge of the optimization surface or convergence behavior of plain vanilla gradient descent. Since finding the right annealed learning rate can be time-consuming, an exponential decay of the learning rate is often used.

Momentum In practice, slow convergence rates can still be found using an annealed learn- ing rate, due to the shape of the optimization surface. Consider the optimization surface of the elliptic paraboloid

x2 y2 z = + (3.28) a2 b2

Figure 3.12 shows the contour plot of two paraboloids with on the left the symmetric case and on the right the ‘valley’ case. In the symmetric case, we have for any point (x, y) on the surface, the gradient w.r.t. the x direction being the same as the gradient w.r.t. the y direction. Therefore every update will adapt both the x and the y with the same rate. In the asymmetric paraboloid, this is not the case. The optimization will be faster around the y minimum compared to the x minimum, and therefore oscillations might occur in the y direction where slow convergence is found in the x direction. An intuitive solution to this problem is considering a ball with a

1.00 1.00 1.4

0.75 0.75 1.2

0.50 0.50 1.0

0.25 0.25 0.8

y 0.00 y 0.00

0.6 0.25 0.25

0.4 0.50 0.50

0.2 0.75 0.75

1.00 1.00 0.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 x x

Figure 3.12: Figures of paraboloids, with a = 1 and b = 2 mass rolling down the surface and gaining momentum [22] as it goes further down. Equation (3.29) shows the implementation of this technique. vt is the speed of the ball at time t, which is dependent on the previous speed and the current gradient. Therefore it accumulates all the previous gradients and increases in speed until it reaches a maximum velocity due to the γ ∈ (0, 1] parameter, assuming that η∇θJ(θ) is decreasing. Note that the speed increases for the gradient which points towards the same directions and decreases for the gradients that are

D.C.F. van den Assem (4336100) Master of Science Thesis 3.2 Supervised Learning of the Neural Network 29 opposite.

vt = γvt−1 + η∇θJ (θ) , (3.29)

θ = θ − vt. (3.30)

For a relatively high momentum, we can approximate the weights of our next step. In the case that the ball runs into a slope in the next step, it would be beneficial if we would not have the momentum anymore towards that direction. The paper [23] exploited this idea and the algorithm is now known as the Nestorov momentum. As shown in (3.31) the gradient is now calculated with the approximate next position instead of the current position.

vt = γvt−1 + η∇θJ (θ − γvt−1) , (3.31)

θ = θ − vt. (3.32)

Adagrad and Adadelta Another often occurring problem in the optimization of neural net- works is learning sparse data. If data is sparse, which naturally happens in language processing, but we can also think of it as events in time series which do not occur often but have a signifi- cant impact on the future movement, the event might be barely seen w.r.t. other events during training. Therefore it would be beneficial to perform larger updates to the infrequent samples and smaller updates to the frequent samples. Adagrad solves this by adapting the learning rate for each weight parameter.

Let gt,i be the gradient of the cost function w.r.t. the parameter θi at time t:

gt,i = ∇θJ (θi) (3.33)

In the classical SGD algorithm we would update the weights individually:

θt+1,i = θt,i − η · gt,i (3.34)

The Adagrad method as proposed by [24] has a different implementation of the update rule. It scales the general learning rate η by the sum of the squares of the previous updates as shown in (3.35). η θt+1, i = θt,i − p · gt,i, Gt,ii +  (3.35) Gt+1,ii = Gt,ii + gt,i,

t×t where Gt ∈ R is a diagonal matrix consisting of the sum of the squares of the gradients w.r.t. θi up to time t and  is a small value preventing division by zero. This avoids the need for manually finding the right learning rate parameters since it adapts them to the past gradients for each parameter individually. However, this might cause a prob- lem after a large number of time steps. The learning rate might vanish due to the accumulation of previous gradients, and therefore the algorithm stops learning.

The Adadelta method as proposed by [25] solves this problem by taking into account the gra- dients of a past time window instead of the entire past. Therefore the learning rate is not necessarily monotonically decreasing.

Adam and AdaMax In the previous methods, two significant improvements were shown.

Master of Science Thesis D.C.F. van den Assem (4336100) 30 Neural Networks Algorithm 6 The Adam algorithm Require: α: Stepsize Require: β1, β2 ∈ [0, 1): Exponential decay rates for moment estimates Require: f(θ): Stochastic cost function with parameters θ Require: θ0: Initial parameter vector 1: m0 ← 0 (Initialize first moment vector) 2: v0 ← 0 (Initialize second moment vector) 3: t ← 0 (Initialize time step) 4: while not converged do 5: t ← t + 1 6: gt ← ∇θft(θt−1) 7: mt ← β1 · mt−1 + (1 − β1) · gt (Update biased first moment estimate) 2 8: vt ← β2 · vt−1 + (1 − β2) · gt (Update biased second raw moment estimate) t 9: mˆ t ← mt/(1 − β1) (Compute bias-corrected first moment estimate) t 10: vˆt ← vt/(1 − β2) (Compute√ bias-corrected second raw moment estimate) 11: θt ← θt−1 − α · mˆ t/( vˆt +  (Update parameters) 12: end while 13: return θt (Resulting parameters)

First, the use of moments to increase the learning rate and second the use of individual weight updates. Adam, as proposed by [26] combines the two. Since this method is mainly used for training in this thesis, this method will be discussed in detail. For convenience, the algorithm in [26] is cited in Algorithm6. The idea of the algorithm is to update exponential moving 4 averages of the gradient (mt) and the squared gradient (vt), where the parameters β1 and β2 are used as the rates of exponential decay. The parameters mt and vt estimate the mean and the uncentered variance of the gradient. Since both moving averages are initialized at zero, these terms have a bias towards zero. mˆ t and vˆt give a bias-corrected moving average.

The next component to consider is the update rule. A parameter  is added to avoid the division by zero. Since we assume that during our updates we will have vˆt > 0, we can assume that  = 0. Consider the step size |∆t| = θt − θt−1. In the first case, we suppose that the gradients of all time steps are equal to 0 except the current time step. Then we obtain:

q t (1 − β2) (1 − β1) |∆t| = α · t p . (3.36) (1 − β1) (1 − β2) In the limit cases t → 0 and t → ∞ we obtain:

(1 − β1) lim |∆t| = α · p , (3.37) t→0 (1 − β2)

lim |∆t| = 0. (3.38) t→∞

(1−β1) and therefore we find an upper bound for this case: |∆t| ≤ α· √ . In the case that gradients (1−β2) are nonzero, which happens more often in practice, we obtain |∆t|  α. (For more information on this bound see [26]) Therefore the α parameter provides a bound to the maximum change in . Intuitively this can be used as the region around the current parameter space

4 The gradient mt is updated by using a weighted update from the previous gradient mt−1. Applying this relation recursively results in weights on the updates that have an exponential decay.

D.C.F. van den Assem (4336100) Master of Science Thesis 3.3 Regularization methods 31 in which we wish to find our new set of parameters. In this region, we expect that the gradient update√ provides sufficient information. Using this intuitive idea on α we can study the ratio mˆ t/ vˆt. When the ratio becomes smaller, the stepsize will also decrease. Note that this ratio typically becomes smaller when coming closer to a minimum. Therefore the earlier mentioned learning rate annealing is adopted in Adam.

The authors of [26] also show that the Adam is a regret minimizing algorithm under the as- sumption that the cost function has bounded gradients in 2−norm and ∞−norm and α chosen in such a way that the updates are bounded. The result is that the average regret converges   R(T ) = O √1 . T T

An alternative to Adam is also introduced by [26], which is called AdaMax. In the Adam algorithm, the updates of parameters are scaled by the L2-norm of their current and past gra- dients. This can be generalized to a Lp scaling. In case of AdaMax, the ∞-norm is used which results in a simple and stable algorithm with a fixed bound for the updates |∆t| ≤ α Higher order methods Higher order optimization methods such as Newton’s method are gen- erally not used in stochastic optimization on high dimensional data sets since it suffers from the curse of dimensionality. Moreover, the use of second order derivatives will set new requirements for the cost and activation functions, which restricts us to a smaller set of the popular choices. On the other hand, a second order method naturally solves the ‘valley’ problem. Higher order methods are not used in this thesis.

3.3 Regularization methods

As stated in the Section 2.1, the prediction accuracy of a neural network should be independent of the training data provided. In other words, we wish to obtain the highest prediction accuracy on the test data. A regularization method is a modification of the algorithm to increase the test accuracy. By this definition of the regularization method, we allow the method to possibly decrease the accuracy on the training set, which is often caused by removing the overfit. This section covers some popular regularization methods in deep learning, such as , L2-regularization and Dropout.

Early stopping

The first regularization method is straightforward. The idea is to train the network just as long until we reach the optimum for the validation error. By training further, the network would decrease the training error at the cost of the validation error, i.e. the network overfits. Early stopping has two main advantages, it optimizes the validation error, and it reduces the training time. In Figure 3.13, the typical training and validation error are plotted against the number of training iterations is shown. Choosing to stop after 600-700 iterations would give the desired result. The easiest way to implement early stopping is storing the weight parameters for every training iteration and evaluating the validation error after each iteration. By analyzing the previous validation errors, one could set a stopping criteria and choose the best set of parameters after stopping. In Figure 3.13, it would be sufficient to stop training as soon as the validation error increases. Generally, this smooth behavior is not seen during optimization, and

Master of Science Thesis D.C.F. van den Assem (4336100) 32 Neural Networks

validation error 0.0020 training error

0.0015

error 0.0010

0.0005

0.0000 0 200 400 600 800 1000 iterations

Figure 3.13: Behaviour of the training error and validation error during training therefore the past n steps should be compared to choose the right stopping criteria. For large data sets and networks it might be beneficial to consider setting up two hyper-parameters; the patience p and the number of steps between evaluations n. After making n training iterations, we evaluate the validation error. If the validation error is smaller than the previous validation error, store the weights and iterate and make n iterations again. If the validation error is higher than the previous validation error, start counting how many times this successively happens. When p times are counted, stop the algorithm and choose the best-saved set of weights.

L2-Regularization

An often used regularization method is the L2-regularization. The explanation of [27] will be used to explain this regularization method. The L2-norm of the weights is added to the cost function as shown in (3.39), where β corresponds to the regularization parameter. During training, this results in moving the weights closer to the origin.

β J¯(W ) = ||W ||2 + J(W ) (3.39) 2 2

To understand how the L2-regularization is improving the performance on the validation error, we study the behavior during weight updates. Consider the case that all activation functions are linear and the mean squared error is used for the cost function J¯. This reduces the optimization problem to (3.40), a problem.

¯ 1 2 β 2 min J(W ) = min ||W · x − y||2 + ||W ||2 (3.40) W W n 2 Taking the gradient w.r.t. W gives: 2 ∇ J¯(W ) = xT (W x − y) + βW (3.41) W n

D.C.F. van den Assem (4336100) Master of Science Thesis 3.3 Regularization methods 33 Setting the gradient to zero gives solutions for the optimization problem as shown in (3.42)

β W = (xT x + nI)−1(xT y), (3.42) 2 where I is the identity matrix. Since the x are the input features of the network, xT x ∝ cov(x, x). From the covariance matrix, we know that the diagonal terms di correspond to 2 the variance of the inputs xi. The L -regularization increases the variance of the input for the optimization problem. As a consequence, it will shrink the weights acting on the features with a low covariance with target y compared to the added variance on the diagonal. This is an interesting property which can be useful for conditioning neural networks. With such a parameter we may control the level of covariance required for passing information from one to another signal.

3.3.1 Bootstrap aggregating

[28] introduced bootstrap aggregating, also called bagging, a regularization method by combin- ing several models. The idea is to train multiple models independently on the same training data and combine the outputs of all the models by averaging in the case of regression or voting in case of classification. This method is useful when the errors of the trained models are differ- ent. To see this, consider n trained regression models with a corresponding error i for each of the models. The error of the ensemble predictor is given by (3.43)

1 n  = X  (3.43) n i i=1

By assuming that the individual errors i are multivariate normally distributed with E (i) = 0 2 , E i = V ar() and E (ij) = c we obtain: !2   1 n ¯2 = X  (3.44) E E n i i=1  n  1 X h 2i X =   + [ij] (3.45) n2 E i i=1 i6=j V ar() (n − 1) c = + , (3.46) n n where ¯ denotes the ensemble error. We can see in (3.44) that for uncorrelated errors; i.e. c = 0, the expected squared error of the ensemble method is scaling inverse proportionally with n. The major drawback of bagging in deep learning is the increase in computational cost. Note that it is not necessarily increasing the computation time since the networks can run in parallel.

3.3.2 Dropout

The increase in computational cost in bagging makes it impractical for large networks. The authors of [29] introduced dropout, which provides an easy way to prevent neural networks from overfitting. Dropout removes units with their connections in the network at random for each training iteration. Applying a dropout results in the training of an ensemble consisting of all the sub-networks that can be made from the original network. Compared to bagging, this method

Master of Science Thesis D.C.F. van den Assem (4336100) 34 Neural Networks is inexpensive since only the forward pass is affected regarding the number of computations per training iteration. However, the dropout networks are not training independent networks since they still share the parameters. Paper [29] also shows that dropout can be combined with other regularization methods such as the L2-regularization with promising results.

A variant of dropout is DropConnect as introduced by the authors of [30]. In DropConnect each connection instead of each unit is considered to drop out. DropConnect often outperforms dropout at the cost of speed.

3.4 Wavenet

The Wavenet is a deep neural network for generating raw audio waveforms [1]. This particu- lar network will be discussed in detail since it is used as starting point for the network used in this thesis. Moreover, the authors in [2] already implemented the Wavenet in a financial forecasting setting with promising results. This section will first explain the original Wavenet implementation from [1]. Afterwards, the adapted network for financial forecasting will be dis- cussed. The Wavenet distinguishes itself from other convolutional networks since it is able to take relatively large ‘visual fields’ 5 at low cost. Moreover, it is able to add conditioning of the signals locally and globally, which allows the Wavenet to be used as a text to speech (TTS) engine with multiple voices, is the TTS gives local conditioning and the particular voice the global conditioning. In the sense of financial forecasting, conditioning can be used for temporal relations like correlations and cross correlations between two series of asset prices.

Figure 3.14: Overview of the residual block and the entire architecture, retrieved from [1]

Figure 3.14 shows a graphical representation of the architecture of the Wavenet as used in [1]. The architecture of the network is explained in the same way as the authors. For each compo- nent, the application to a time series will be discussed. First, the conversion of the signal into

5In the context of images the term visual field is used to express the receptive field from an output w.r.t. the input. We will continue using this term, however in context of audio the term ‘audible field’ might be more appropriate.

D.C.F. van den Assem (4336100) Master of Science Thesis 3.4 Wavenet 35 categories will be discussed. This is followed by the stacked dilated convolutions. Thereafter the gated activation functions as used in the residual blocks and the residual blocks and skip connections are explained. Finishing with the local conditioning and global conditioning.

Raw audio is typically generated with a sampling frequency between 44, 1 kHz 6 and 192 kHz and a signed integer value of 16 to 24 bits to store the waveform. These high sampling fre- quencies are used because the human hearing range is between about 20 Hz and 20 kHz. The Nyquist-Shannon sampling theorem tells us that we need at least a sampling frequency twice as high as the frequency that has to be determined to prevent aliasing. Aliasing is the effect that two signals become indistinguishable after sampling.

To produce raw audio with a neural network, two options can be considered. First, create a regression model that is able to predict in the full input range. Second, create a classification model which has outputs for every bit in the signal. The authors of [1] made a classification model and used Pulse Code Modulation (PCM) to encode the signal on 8 bits. The transfor- mation of the signal is made using the µ-law as shown in (3.47). ln(1 + µ|x |) f(x ) = sign(x ) t , (3.47) t t ln(1 + µ) where xt ∈ (−1, 1) and µ = 255. The nonlinear quantization of a signal provides an efficient way to store raw audio data for speech compared to the linear quantization. To get an idea of the data set size, one second of audio data is stored in approximately 1, 35 MegaByte of data. Training on ten hours of data would cost 47 GigaByte of data.

Figure 3.15: Illustration of the stacked dilated convolutions

Stacked Dilated Convolutions The main building blocks of the Wavenet are the causal dilated convolutions. As an extension on the causal dilated convolutions, the Wavenet also allows stacks of these convolutions, as shown in Figure 3.15. To obtain the same receptive field with dilated convolutions in this figure, another dilation layer is required. The stacks are a repetition of the dilated convolutions, connecting the outputs of dilated layer to a single output. This enables the Wavenet to get a large ‘visual’ field of one output node at a relatively low computational cost. For comparison, to get a visual field of 512 inputs, an FCN would require 511 layers. In the case of a dilated

6The squared product of the first four prime numbers gives exactly 44100

Master of Science Thesis D.C.F. van den Assem (4336100) 36 Neural Networks convolutional network, we would need eight layers. The stacked dilated convolutions only need seven layers with two stacks or six layers with four stacks. To get an idea of the differences in computational power required for covering the same visual field, Table 3.1 shows the number of weights required in the network with the assumption of one filter per layer and a filter width of two. Furthermore, it is assumed that the network is using binary encoding of the 8 bits.

Table 3.1: Number of weights for networks with a ‘visual field’ of 512

Network type No. stacks No. weights per channel Total No. of weights FCN 1 2.6 · 105 2.6 · 106 WN 1 1022 8176 WN 2 1022 8176 WN 4 508 4064

Gated Activation Functions Inside each layer, the outputs of the dilated convolutions are passed to two separate activation functions: the hyperbolic tangent and the logistic function. The outputs of the activation functions are multiplied element-wise as shown in Eq. (3.48). The reason to use these gated activation functions is replicating the behavior as done in the LSTM. The logistic function gradually decides whether the input should pass through or not. The tanh function provides the new candidate since the tanh allows outputs in the range (−1, 1) which is required if the output should be able to go up or down.

z = tanh(Wf,k ∗ x) σ(Wg,k ∗ x), (3.48) where ∗ denots the convolution operator and is the Hadamard product.

Residual Block and Skip connection The authors of [31] hypothesize that by training on the residual mapping instead of on the mapping without a reference, it becomes easier to optimize the weights. The output of the residual block is given by (3.49). Instead of passing the output itself to the next layer, the input of the layer is added.

W1×1(z) + x = W1×1(tanh(Wf,k ∗ x) σ(Wg,k ∗ x)) + x. (3.49)

Contrary to the residual connections used in [31], the Wavenet adds a skip connection before the residual is made, which bypasses all the following residual blocks. Each of these skip connec- tions is summed before passing them through a series of activation functions and convolutions. Intuitively this is the sum of the information extracted in each layer. In an ideal case, we would like to have independence between each of these skip connections. This would result in new feature extraction in each succeeding layer.

Local and Global Conditioning Let h be a conditioning signal on a specific time scale. Both local and global conditioning is T applied on the gated activation function. The global conditioning (3.50) V·,k is a learnable linear projection which is repeated over the whole time dimension by multiplying it with h. Therefore the global conditioning can be seen as a feature which can happen at any place in time which influences the output. The local conditioning (3.51) is based on another time series ht which

D.C.F. van den Assem (4336100) Master of Science Thesis 3.5 Augmented Wavenet (AWN) 37 is upsampled to a signal y = f(h) with the same frequency using a transposed convolutional network. Since y already has the correct length, Vf,k ∗ y becomes a 1 × 1 convolution. By fixing the conditioning length to the signal length, the position of the feature in time is specified.

T T z = tanh(Wf,k ∗ x + Vf,kh) σ(Wg,k ∗ x + Vg,kh) (3.50)

z = tanh(Wf,k ∗ x + Vf,k ∗ y) σ(Wg,k ∗ x + Vg,k ∗ y) (3.51)

3.5 Augmented Wavenet (AWN)

Figure 3.16: Overview of the architecture used in AugmentedWavenet, retrieved from [2]

The augmented Wavenet is a regression model trained directly on output values instead of using a discretization of the output range. In contrast to the original Wavenet, the AWN starts with splitting both the input and conditioning signal into two signals; a parametrized skip connections and a signal going towards a layer (BN). The original aim of the normalization layer is to reduce the internal co-variate shift to avoid the saturation of nonlinear activation functions, see [32]. In the case of a ReLU activation function, satura- tion can not occur. Adding a normalization layer in front of the inputs of the ReLU would therefore be particularly useful for inputs that are negative; this typically occurs when forecast- ing on returns. A problem remains that the values should also be scaled back the original range.

After the normalization is applied, both the normalized signals are passed through a dilated convolution. These dilated convolutions are added which act as the conditioning of the original signal by the conditioning signal. Then the new signal is passed through the ReLU activation function, which is different from the original Wavenet. Afterwards, the output is added to the previously mentioned parametrized skip connection, which acts as a gate deciding whether to use the skip from the input or the condition. The parametrized skip connections is a 1 × 1

Master of Science Thesis D.C.F. van den Assem (4336100) 38 Neural Networks convolution of the original signals. This is particularly important when multiple conditions are used. The output of the first layer can then be denoted by (3.52). Note that the output of the first layer is summed differently as done by [1]. The AWN sends the residual to the output whereas the WN sends the value before residual to the output. After passing the first layer, the process is repeated on the output of that layer, but without conditioning.

h h z2(x, y) = (W1x + W2y + ReLU(w1 ∗ g1,x(x) + v1 ∗ g1,y(y))), (3.52) where gi,· is the normalization layer and ∗ is the dilated convolution operator.

The use of the different type of residual connection has a great impact on the network. Consider the Wavenet output in the AWN for the layer i > 1 as shown in (3.53).7 This output is used for passing to the next layer and skipping to the total output of the network. Therefore the output of the network consists of a sum of linear inputs and nonlinear transformations of the inputs. In the case of the WN, the total output is a sum of the nonlinear transformations as shown in (3.55) made in the layers. This filters the learned information from each layer and passes it to the output. Intuitively this creates more independence between the succeeding layers.

zi+1(zi) = F(zi) − Wi+1zi (3.53)

oi+1(zi) = F(zi) − Wi+1zi (3.54)

zi+1(zi) = F(zi) − Wi+1zi (3.55)

oi+1(zi) = F(zi) (3.56)

3.6 Summary

In this chapter we discussed the aspects of the neural networks required for building the Wavenet. The use of differentiable activation functions allowed us to create deeper networks. The convo- lutional layers are introduced to tackle the curse of dimensionality in the fully connected layers and to obtain better generalization results. As an alternative, the is introduced, which can be used as a benchmark. To train the deeper networks, the delta rule algorithm is extended to the backpropagation algorithm. The cost function which is required for this algorithm can be adjusted to the objective of the problem. The convergence rate of the training is improved by choosing initialization and optimizer. The Glorot initialization and Adam optimization will be used in the following chapters. To obtain better generalization, a variety of regularization methods can be used. At last, both implementations of the Wavenets are discussed which showed significant differences in the use of activation functions and the implementation of skip connections.

7A minus sign is used here since we want the network to learn the residuals, this is more intuitive for the reader. In the actual implementation, a plus sign is used, and the network can learn to use either the plus or minus sign.

D.C.F. van den Assem (4336100) Master of Science Thesis Chapter 4

Methodology

This section starts with an explanation of how the data set splitting changes for forecasting problems in Section 4.1. This is followed by an overview of the used error measures and tools to validate the statistical significance in Section 4.2. In Section 4.3, a variety of artificial time series is presented. Using artificial time series enables us to see how the Wavenet performs under specific conditions. This should give us further insight for optimizing the Wavenet for real world data. Tests on real world data are described in Section 4.4. This includes data preprocessing, which is often required to obtain better results in the forecast of the Wavenet.

4.1 Evaluation of the network

We wish to assess the performance of a prediction Yˆ of a time series. This performance is measured by quantifying how well the prediction matches the observed data. Before the error measures are discussed, the methodology of training, validating and testing is described. As mentioned in Section 2.1, the observed data is split into three sets, the training data, the valida- tion data and the test data. In time series forecasting we distinguish in sample forecasting and out of sample forecasting. An in sample forecast is a forecast on which the model uses all the data up to time T . An out of sample forecast is a forecast on which the model forecasts on data after time T . The out of sample data is not used in any way to improve the model performance. For this reason the expected performance of an in sample forecast is better compared to an out of sample forecast.

A clear distinction between the dataset splitting in machine learning practice and the in sample and out of sample forecasts is not present. After validation of a trained model using out of sample forecasts, the model can be updated and implicitly contain information from the out of sample data. Training the updated model up to time T again and making an out of sample fore- cast might have an increased performance compared to the previously trained model. Therefore a distinction is made in this thesis between a validation out of sample forecast and a test out of sample forecast. The validation out of sample forecast is used to improve the performance of a model. The test out of sample forecast is used to compare the performance of different models with each other. The results of the latter forecast will never be used to improve the models.

Master of Science Thesis D.C.F. van den Assem (4336100) 40 Methodology

Given some in sample data Yt≤T and out of sample data Yt>T . The neural network can ˆ make a forecast of YˆT +1 = f(YT ,YT −1, ..., YT −n) based on the previous n time steps. For the next time step, it is possible to use two kind of forecasts. The one day ahead forecast denoted ˆ by Yˆ , which uses the true out of sample data such that YˆT +2 = f(YT + 1,YT , ..., YT −n+1). Or the full forecast denoted by Y˜ , which uses the predicted data in the forecast, such that ˆ Y˜T +2 = f(Y˜T +1,YT , ..., YT −n+1). The full forecast is expected to perform worse than the one step ahead forecast, since it accumulates the error as defined in Definition 4.2.

ˆ Y˜T +2 = f(Y˜T +1,YT , ..., YT −n+1) ˆ = f(YT +1 + (Y˜T +1 − YT +1),YT , ..., YT −n+1) (4.1) ˆ = f(YT +1 +e ˜T +1,YT , ..., YT −n+1)

ˆ Y˜T +3 = f(Y˜T +2, Y˜T +1, ..., YT −n+2) (4.2) ˆ = f(YT +2 +e ˜T +2,YT +1 +e ˜T +1, ..., YT −n+2)

Note that in (4.2) the e˜T +2 term also consists of the e˜T +1 term from (4.1). The negative effects on forecasting performance by accumulating errors is studied in Section 4.3.

The reader might question for what reason a full forecast up to time T + n is performed instead of building a network which forecasts directly n steps ahead, since a n-ahead forecast would not suffer from the accumulation of errors. This is also possible, however by making full forecasts we can implicitly measure the sensitivity to noise. Both methods will be used in this thesis.

4.2 Error Measures

In this section the error measures are described. Before we go into specific measures, we have to know what we are measuring. In the case that we are interested in the direction of the stock price, the forecasting problem is reduced to a binary classification problem. A corresponding measure is the hit rate as defined in Definition 4.1. Note that the hit rate can also be used as a cost function, which changes the network a regression problem into a classification problem. In the case that the magnitude of the error is in our interest, measures such as the mean abso- lute error (MAE) and the mean squared error (MSE) are in our interest. The use of the error measures focuses on evaluation of the network‘s performance and not on choosing the right for the optimization problem. Choosing the right loss function is already discussed in Section 3.2.2. However, as a measure is commonly used as a cost function, a remark is made.

DEFINITION 4.1. A forecast YˆT +1 of a time series is called a hit if

(YˆT +1 − YT )(YT +1 − YT ) > 0 The hit rate of n forecasts is defined by: 1 n hits(n) = X 1 ((Yˆ − Y )(Y − Y )), n A T +i T +i−1 T +i T +i−1 i=1 1 where A = {x ∈ R : x > 0} and 1A is the indicator function.  1 1 if x ∈ A 1A(x) is the indicator function defined by 1A = 0 if x 6∈ A

D.C.F. van den Assem (4336100) Master of Science Thesis 4.2 Error Measures 41

The hit rate indicates if a forecasting method is giving better results with respect to the pre- dictability of the movement direction of the market than the currently used financial models and naive methods. In financial forecasting, the hit rate as defined in Definition 4.1 assumes that the Yˆ is a forecast on the stock price itself. A return rˆ is called a hit if rˆ · r > 0. The naive methods for forecasting will be discussed in Section 4.2.2. In most financial models, such as the Black-Scholes-Merton model (BSM), the assumption is made that market movements consists of a deterministic and a stochastic process. Assuming that we can correctly obtain the deterministic part, any hit rate above 50 percent for the stochastic part would outperform the BSM model. It should be noticed that such models are not made for short term forecast- ing. If the direction is predicted correctly, it remains to predict the magnitude of the movement.

The error gives a measure of the distance and direction between the prediction and the ac- tual value at one time step. This measure is very limited, since it is based on one sample. In order to compare single forecasts with each other using the same forecasting method with data from different data sets, the relative error as defined in Definition 4.3 can be useful. If the method is expected to work correctly on both data sets, the same relative error is expected. By doing multiple one day ahead forecasts, we expect the relative error to be the same under the assumption that the dynamics of the time series are not changing in the forecasting time span. For stationary time series, the absolute error is expected to be the same under the conditions mentioned before. In financial time series however, it might be the case that the volatility changes within a forecasting time span. This might influence the errors, since it might be the case that the low-volatile behavior is predicted very well while the high-volatile behavior is not or the other way around. For this reason it might be interesting to look at the evolution of the error in a specific time span instead of taking averages as done in Definition 4.5, Definition 4.6 2 and Definition 4.7. In case of the full forecasting method, it is in our interest to see how the error evolves over time. Moreover, this can give insight in the performance of conditional forecasting. If the growth in error decreases or even disappears, the conditioning in this case can eliminate the error term, assuming that the conditioning signal is error-free. DEFINITION 4.2. The error and absolute error are defined by

e = Yˆ − Y,

|e| = |Yˆ − Y |.

DEFINITION 4.3. The Relative Error is defined by

|Yˆ − Y | e = . rel Y

The previously mentioned error measures are not giving a single number to directly compare two forecasting methods. Therefore mean errors are introduced. By assumptions of our forecasting model, the mean of the error should become zero for increasing forecast range with one-day ahead forecasting. Otherwise the use of a bias on the output could improve the forecasts. DEFINITION 4.4. The Mean Error is defined by 1 n ME = X Yˆ − Y . n i i i=1 2The MSE is also defined in a previous chapter, but is added for convenience.

Master of Science Thesis D.C.F. van den Assem (4336100) 42 Methodology Table 4.1: Difference in response to errors between MAE and RMSE

e |e| e2 0.1 0.1 0.01 2 2 4 9 9 81 MSERMSE 3.7 9.27

Since taking the mean of the errors does not give a scale of the actual errors, the Mean Absolute Error (MAE) Definition ?? and the the Mean Squared Error (MSE) is Definition ?? are commonly used error measures for comparing performance between two forecasting methods. The difference is the response to outliers and the scale. Assuming that the expected value of the error is zero, the MSE can be used as an unbiased estimator of the variance if the sum is scaled 1 by n−1 . The standard deviation is easier to interpret and therefore the RMSE, as defined in Definition 4.7, can be used, where the unbiased estimator of the standard deviation is defined in defrefdef:scalederror. DEFINITION 4.5. The Mean Absolute Error is defined by

1 n 1 n MAE = X |Yˆ − Y | = X |e | n i i n i i=1 i=1

DEFINITION 4.6. The Mean Square Error is defined by

1 n 1 n MSE = X(Yˆ − Y )2 = X e2 n i i n i i=1 i=1

DEFINITION 4.7. The Root Mean Square Error is defined by v v u n u n √ u 1 X u 1 X RMSE = MSE = t (Yˆ − Y )2 = t e2 n i i n i i=1 i=1

To understand which metric, MAE or RMSE is better to use for a particular problem, we need insights on how these two metrics differ from each other. There are a few cases that can be considered, the response to relatively small/large errors, the behavior with increasing sample size and the distribution of the error. First, the RMSE will penalize the larger errors relatively more than the MAE compared to the small errors. An example is shown in Table 4.1. Second, the RMSE tends to grow with increasing sample size. This is easily shown by putting bounds on the RMSE using the MAE as shown below,

√ MAE ≤ RMSE ≤ nMAE.

This means that comparing the performance of different models with each other using the RMSE requires to use the same number of samples. Moreover, using a small sample size on the RMSE can be problematic, because of the response to outliers. The number of outliers in a small sample size might give a bias to the RMSE. If we now consider the distribution of the error itself, it would make sense to apply the MAE to a uniform distribution, since it assigns equal weights to all the errors. For the normal distributions, the RMSE would fit better.

D.C.F. van den Assem (4336100) Master of Science Thesis 4.2 Error Measures 43 Since the previously mentioned error measures are to be used on one time series using dif- ferent methods (and the following chapters will introduce different kind of time series), it is in our interest to find a metric which is capable of comparing the performance of forecasting methods across different time series. Therefore a scaling method of the error is required. The authors in [33] suggest that using the Mean Absolute Scaled Error (MASE) is preferred as a measure, since it is less sensitive to outliers and more easily interpreted than the Root Mean Square Scaled Error (RMSSE).

DEFINITION 4.8. The Scaled Absolute Error is defined by e q = i , i 1 Pi ˆ ˆ n−1 j=2 |Yj − Yj−1| where Yˆi − Yˆi−1 is the in-sample MAE from the naive forecast method.

The naive forecast is defined on the in sample data by YˆT +1 = YT . DEFINITION 4.9. The Mean Absolute Scaled Error is defined by

mean(|qi|), where qi is the scaled absolute error as defined in Definition 4.8.

4.2.1 Statistical Testing

After training a network, predictions can be made. Using the error measures as defined in Section 4.2 and analyzing the costs during optimization of the network, we can get insight on the performance of this particular network. Since the weights in the network are found using stochastic optimization techniques, we can not rely on the results of a single network. As we have seen in Section 3.3, regularization methods can be used to improve the predictive per- formance. Moreover, the ensemble methods show how we are able to decrease the variance of the error of the predictions. This section describes the statistical methods used to asses the statistical significance of the results. (Most methods are basic methods, but they give clear in- sights). The statistical testing is split into two methods, the univariate analysis, which focusses on the distribution of the out of sample error and the bivariate analysis, which describes the relation between the in sample error and the out of sample error. The aim of this section is to understand and give examples of how these basic techniques can be used for interpreting the results and improving the network.

The most important statistical analysis performed on the error measures is the uni-variate analysis. This analysis gives a basic description of the shape of the distribution by describing the central tendency and dispersion. Using the mean, median and mode of the error measure, the central tendency is described. The dispersion is described by the standard deviation as defined in (4.3), minimum and maximum of the error measure,

v u n u 1 X 2 σˆ = t ( − ˆ) , (4.3) n − 1 i i=1 where i are the sampled errors and ˆ is the sampled error mean.

Master of Science Thesis D.C.F. van den Assem (4336100) 44 Methodology In forecasting applications we are not able to quantify the future performance. However, if a relation in the historical data can be found between the in sample- and out of sample error, we could use this bivariate analysis to estimate the performance in the future. In practice this boils down to improving the bagging method by first removing the networks that exhibit bad performance on the in sample test, given that a bad performance for the in sample test is highly likely to give bad performance in the out of sample test too.

4.2.2 Benchmarks

This section describes a variety of benchmarks that are used for comparison of the performance of alternative forecasting methods. We distinguish three categories of forecasting methods. The first category are the naive methods. The first naive method is already defined in Definition 4.8, but now used on out of sample forecasts. The second naive method is Y¯T +1 = ±|YT |, where the sign is determined by the most common sign in the training set. The second method is particularly useful for hit rates. The second category are forecasting methods often used in financial time series analysis. The third category are the forecasting methods based on neural networks. The networks that are in our interest are the LSTM and other networks that can be found in our benchmark problems.

4.3 Artificial Time Series

In this thesis we define an artificial time series as a time series that is sampled from a deter- ministic function with or without the addition of random noise. Artificial time series allow us to isolate particular behavior of signals. By training a network on these time series, we wish to assess the performance in forecasting such behaviour. For financial forecasting it is in our interest to see if a network can learn trends, periodicity, correlation, nonlinear- and chaotic behaviour. Furthermore, the sensitivity to noise can be measured using artificial time series.

DEFINITION 4.10. The correlation between two random variables X and Y with E [X] = µx, 2 2 E [Y ] = µy, V ar [X] = σx and V ar [Y ] = σy is given by (4.4).

cov (X,Y ) ρX,Y = , (4.4) σX σy where cov(X,Y ) 3 is the covariance between X and Y . In time series analysis we are often interested in the autocorrelation of a time series and the cross correlation between two time series. The autocorrelation is the correlation between values of a time series at different times, calculated by ρYT ,YT −N . The cross correlation is the correlation between two different time series at different times, calculated by ρYT ,XT −N .

4.3.1 The sine wave

Time series often show seasonal behaviour, which means that some fluctuation occurs peri- odically. In time series analysis, filtering the seasonality is some times used to improve the performance of a forecasting method. Seasonal ARIMA is a method which incorporates the use

3 cov(X,Y ) = E [(X − µX )(Y − µY )]

D.C.F. van den Assem (4336100) Master of Science Thesis 4.3 Artificial Time Series 45 of differences on seasonal effects. This might result in a clearer relation between two samples. However, we wish to build a model which automatically extracts these kinds of features with- out the need for manual analysis. Therefore sine waves are used to generate time series with seasonality. Besides the seasonal property, sine waves are also nonlinear. The sine wave will be sampled with fixed time intervals. The samples will meet the Nyquist criterion. A sample meets the Nyquist criterion if the sample rate on a specific wave is at least twice as high as the rate of the wave itself. This is required to be able to distinguish waves from different frequencies.

We expect that networks with a ‘visual field’ and number of inputs equal to a multiple of the number of samples in a full sampled sine wave perform better compared to networks that do not have this property. First, if the ‘visual’ field is too small, the network might be unable to determine whether it should go up or down. Intuitively, the zero crossings are required to successfully reproduce a sine wave. Secondly, this ensures a balanced training on the data, which means that every part of the wave is ‘seen’ during training equally often.

The addition of multiple sine waves at different frequencies allows us to test the Wavenet on more complex shaped periodic signals. Also, adding white noise to the signal is in our interest. By training the network on noisy sine waves, we can study the generalization behavior and the effects of regularization parameters. The amount of noise will be measured with the signal to noise ratio (SNR) as defined in (4.5) P SNR = signal , (4.5) Pnoise where P is the average power of the signal. If we take uniform random samples from a sine wave, we know the variance of our distribution. If both the signal as well as the noise have zero-mean, we can use the variances to calculate the SNR, as shown in (4.6).

2 σsignal SNR = 2 (4.6) σnoise

In the sine wave tests we report the absolute error, MAE and RMSE. Forecasts are made for a fixed number of repetitions of the full sine wave. For the one ahead forecasts, the error should be periodic over the forecast length. The MAE and RMSE will be calculated on the full wave, to make the RMSE a useful measure. In the full forecast, we expect that the accumulation of errors might influence the forecasting performance.

4.3.2 The Lorenz System

It is in our interest to see whether the Wavenet is able to predict chaotic time series. Chaotic behaviour can potentially explain the movements in the financial markets that appear to be random. Using the Lorenz System we can produce these kind of time series. The Lorenz System outputs deterministic values that are nonlinear, non-periodic and three-dimensional. Since the dimensions are directly related, it is interesting to see how the Wavenet behaves on single dimensional forecasting and by using conditioning on all the dimensions at the same time. If the time steps are sufficiently small and constant, the Wavenet should be able to learn the relations for u as described in the definition, using a filter width of two, since the conditioning is done using additions. The relations for v and w are dependent on a product, which are not explicitly made between the conditioning signals. To find the relation for v, we first need to

Master of Science Thesis D.C.F. van den Assem (4336100) 46 Methodology add a bias term to the second conditioning signal w and multiply it with the first conditioning signal u and afterwards subtract the signal itself. Therefore it is expected that the Wavenet will be unable to make a full forecast on these dimensions. The data set is generated in Matlab using a fixed time step ode solver. The use of a variable time step would make the prediction of the behaviour more difficult, since we expect that the Wavenet learns the underlying dynamics between coordinates and this would add the time coordinate. DEFINITION 4.11. The Lorenz System is a system of ordinary differential equations defined by:

u˙ = σ(v − u) v˙ = u(ρ − w) − v w˙ = uv − βw

In the Lorenz system tests we use the ALSTM (Augmented LSTM) as proposed in [4] as a benchmark. This paper shows exceptionally good results compared to other literature such as [34] and [35]. In all the literature the RMSE is reported as the measure, therefore it is important to use the same sample size to make the results comparable without having bias terms. Since [4] is not clearly stating which data points are used for training and forecast, the description of [35] will be used as starting point. A Lorenz system with the parameters σ = 10, ρ = 28 8 and β = 3 is used to generate the data. The training set will consist of 1000 samples and the forecast length is set to 500 samples. The data will be scaled to a range in which the difference between the minimum and maximum is equal to 1.

The multi-dimensionality of the Lorenz system allows us to implement conditioning. By using the information of the v coordinate to forecast the u coordinate, we expect improvements in the results. These results can obviously not be used as a comparison with the previously mentioned benchmarks, however it can show that extracting information from other sources may improve the forecasting performance. In addition, adding noise to this conditioning signal will be tested. This could give us insight on how conditioning can be used in forecasting financial time series.

4.3.3 Mackey Glass Equation

The Mackey Glass Equation also has dynamics that are in the scope of the interest of this thesis, since the dynamics are nonlinear, chaotic and most importantly, it has a time delay.

DEFINITION 4.12. The Mackey-Glass chaotic time series is a first-order differential-delay equa- tion defined by:

xτ x˙ = β n − γx 1 + xτ where γ, β, n > 0 and xτ represents the value of x at time (t − τ). Given the time delay, we expect an autocorrelation in the time series. Since a division is made n between the the two variables xτ and 1+xτ , poor performance is expected for the same reasons as for the conditioned Lorenz System. The AWN is unable to generate products between inputs.

In the Mackey Glass tests, the ALSTM as proposed in [4] will be used as the benchmark. This paper uses the standard Mackey Glass time series as available in MATLAB, which have the parameters γ = 0.1, β = 0.2, n = 10 and τ = 17. Training is done using more samples com- pared to the paper to speed up training time, one ahead forecasts are made on 500 samples. In

D.C.F. van den Assem (4336100) Master of Science Thesis 4.4 Real world time series 47 [4] the one ahead forecast is made using an input size of 5 with stride 6. Therefore the training data covers a range of 600s. The RMSE is used as measure as done in many previous literature. In addition to the ALSTM benchmark, paper [36] provides results on 6 and 84 time steps ahead and noisy benchmarks using a particle swarm optimized neural network (PSO-ANN). This will be used as second benchmark. It should be noted that in contrast to [4], the authors of [36] used a data set with 1s time intervals. Therefore a separate time series will be used for comparison.

4.4 Real world time series

This section describes the data sets used in this thesis. Since real world time series often contain corrupted data, data pre-processing such as cleaning, imputation and normalization is required. Section 4.4.1 gives an overview of all the preprocessing steps done on the real world time series data. Most of the data sets used are the same as [2]. We distinguish two categories in the financial time series, the stock market data and the commodities data. In the stock market data, the S&P500, AEX and Nikkei are used. As conditioning data, the CBOE 10 years interest rate and the volatility index (VIX) are used for the stock market index data. The stock market data is retrieved from Yahoo Finance. For commodities the UK Gas month future and monthly Brent oil are used, which is retrieved from Trayport and Reuters.

4.4.1 Data pre processing

The data sets retrieved from YAHOO Finance typically consist of a list of dates, open prices, daily high, daily low, close prices, volume and adjusted close prices. The trading days in a year are not periodic due to leap years, holidays and some times other events; i.e. after the terrorist attack on 11 september 2001, the US stock market closed for three days. Training a network requires a target vector. When implementing a trading strategy, the target vector is different from the normal forecast in which the value or return of a stock in the next time step is predicted.

First, a list of all weekdays is created. Therefore the assumption is made that holidays and events that close the market are non existing. This creates an ordered set, which is useful for finding periodic behavior. The US financial markets typically have nine holidays. Therefore the new data sets will consist of 3 to 5 percent more days per year compared to the actual trading days. For the missing data on these days, the last observation carried forward (LOCF) method is used. In this method missing data is filled by moving chronologically through the data and for each data point missing, the last observation is used for the missing observation. The main reason to use LOCF is to conserve causality of the data. The Wavenet might be able to learn this periodicity of holidays which can increase forecasting accuracy. However, by using LOCF it is impossible to generate profits. Different ways of imputation, such as averaging are also considered. Using future values is not useful for financial forecasting, since causality must be conserved. Averaging over more past values changes the direction of the market in a fixed way, which would increase the predictability of such a market. Only LOCF will be used for imputation of data since it does not suffer from the previously mentioned flaws.

Secondly, the data is normalized to prevent slow training. The normalization is done by cal- culating the log returns of the prices. Afterward, the target vectors are created. For simple one ahead forecasting the target vector for xt is the value xt+1. However, for comparison with other literature such as [5], the same way of creating target vectors as in [5] is used. Calculating

Master of Science Thesis D.C.F. van den Assem (4336100) 48 Methodology the log returns of one month ahead is done by ln(xt+20) − ln(xt), where 20 days are used to represent a month.

D.C.F. van den Assem (4336100) Master of Science Thesis Chapter 5

Results

This section provides the results related to the tests from Section 4.3. It is based on the measures from Section 4.2 and compares the results with the benchmarks described in Section 4.2.2. Since the AWN is implemented in Keras for this thesis, the sine test will be used to compare the Keras implementation with the implementation. Within the experiments, improvements will be made on the Keras implementation. For convenience, if the network parameters are not specified for a specific test, the standard parameters used in this thesis are shown in Table 5.1.

Table 5.1: The standard parameters used in the Wavenet

parameter symbol value iterations - 10.000 regularization γ 0.1 filter width - 2 No. filters - 1 No. channels - 1 Adam learning rate α 0.001

5.1 Implementation comparison

The AWN implementation in Keras is easily made, since the causal convolutions are already implemented in this language. The sine wave will be used as first artificial time series for a comparison between the two implementations. As mentioned in Section 4.3.1, the visual field should be equal to a multiple of the number of samples in a full sampled sine wave to avoid errors induced by the discretization. Moreover, by placing the zero crossings exactly on the samples, we expect the network to be able to learn the movement of a sine wave. In this test a sine wave is generated with a frequency of 16Hz and a sampling frequency of 1024Hz. This results in wavelengths that exactly fit in 64 data points, with a zero crossing on the first and thirty-third element of a single wave. The networks will be trained with an input size 128 which corresponds to 2 full sine waves. Figure 5.1 shows the result of a full forecast of two trained networks per implementation on 1000

Master of Science Thesis D.C.F. van den Assem (4336100) 50 Results

Signal 1.0 1.0 Out of sample forecast - seed 1 Out of sample forecast - seed 2

0.5 0.5

0.0 0.0 amplitude amplitude

0.5 0.5

Out of sample forecast - seed 1 Out of sample forecast - seed 2 1.0 1.0 Signal

0 100 200 300 400 500 600 0 100 200 300 400 500 600 time step time step

(a) Theano Implementation (b) Keras Implementation

Figure 5.1: The full forecast of the sine wave using different implementations time steps ahead. In the Theano implementation the first seed1 completely diverges whereas the second seed shows ripples in the first waves and afterwards it produces a sine wave which is out of phase and has a different amplitude. The first seed in the Keras implementation shows good performance on the first wave, in the following waves however, the amplitude decreases and the phase shifts. The accumulation of errors is probably causing the amplitude decay. The second wave is showing an almost constant signal. It is clear that both implementations are not useful for predicting sine waves. Table 5.2 shows the the descriptive statistics using 10 runs.

Table 5.2: MAE and MSE based on 1000 samples of full forecast

Theano Keras Keras I(1) Number MAE RMSE MAE RMSE MAE RMSE 1 0.48 0.69 0.30 0.38 0.81 0.97 2 0.80 0.89 0.42 0.50 0.78 0.98 3 1.04 1.02 0.52 0.66 0.61 0.70 4 - - 0.48 0.59 0.79 0.93 5 - - 0.46 0.56 0.43 0.51 6 - - 0.64 0.72 0.09 0.12 7 - - 0.62 0.69 0.18 0.22 8 - - 0.40 0.51 - - 9 - - 0.82 1.13 0.21 0.27 10 - - 0.64 0.71 0.65 0.74 Mean - - 0.52 0.64 0.51 0.61 Std. Dev - - 0.16 0.21 0.29 0.34 Table 5.3: - means that the forecast diverged, therefore the number is not useful

The results suggest that the Keras implementation produces more reasonable results. This difference in the results might be caused by two factors. First, in the Theano implementation, the parameters βnorm and γnorm for batch regularization are regularized by the γ parameter whereas the Keras implementation has no regularization on these parameters. Second, the Keras implementation uses zero-padding. This zero padding may give a regularizing effect on

1A seed is referred to as a different random initialization of the weights in the network.

D.C.F. van den Assem (4336100) Master of Science Thesis 5.2 The Sine Wave 51 the results. However, since this can not be controlled as parameter and this might increase training time, we wish to remove this.

Improvement 1 - Removal of the zero padding Two ways of removing the zero padding are going to be discussed. First, writing a custom layer in Keras which creates a valid padding. One way to do this is using the current implementation, but dropping a number of inputs equal to the dilation number from the previous layer. It is cumbersome to subset layers in Keras. The second way to resolve the zero padding problem is by forcing the outputs that are polluted by the zero padding to zero and add the same zeros to the train vector. The cost function and mean absolute error should be adapted for this, since averaging over many zeros would increase the accuracy. By multiplying the cost functions by N c = N−V isualF ield , where N is the number of inputs and V isualF ield is the size of the number of inputs affecting one output, gives the corrected MAE and MSE. Table 5.2 shows the results of the improvement. As expected, the lack of zeros causes less predictable results due to the decrease in regularization. Moreover, due to the zero inputs, the visual field of the network is decreased in this case by a factor 2. For a fair comparison, the same information should be fed to the network. This results in a MAE of 0.48 and RMSE of 0.58 on average. From this point, we consider the Keras implementation to be similar to AWN for the unconditioned case.

5.2 The Sine Wave

In the previous section we showed that the Keras implementation is producing similar results as the Theano implementation. This section will extensively use the sine wave to improve the performance in predicting periodic functions. Improvement 2 - Change of skip connection to the WN implementation As mentioned in Section 3.5, the skip connections between the WN and the AWN are different. By splitting up the skip connection in the AWN in the same way as the WN, the network can be forced to learn more nonlinear relations. This can be beneficial for following the curvature of the sine waves. The use of this skip connection gives reason to drop the normalization layer for testing the sine waves. Since all the values are already in the (−1, 1) range and each layer is able to scale the wave (the parametrized skip) and subtract a part (the residual), this modification is justified. The implementations with the batch normalization and without the batch normalization are tested. Table 5.4 shows the results of the second improvement on the network. We see that the error measures decreased significantly. Contrary to the expectations, the network with batch normalization outperforms the network without. However, the complexity of the network is also increased by this. In the following improvements it will be removed to see whether it is necessary to improve the results. In the case without batch normalization the network is makes a forecast which diverges in the first 1000 time steps. This might be caused by the ReLU function, since this function has no dampening effect on the outputs. Improvement 3 - Gated activation functions with individual convolutions The use of gated activation functions as done in RNN’s and WN might enable us to predict nonlinear functions even better. Moreover, the unrestricted output from the ReLU function can be damped, which might be a solution to the divergence in the full forecast. To understand the difference between Keras I(3) and the WN, looking back at Figure 3.14 helps us understand-

Master of Science Thesis D.C.F. van den Assem (4336100) 52 Results Table 5.4: Results for I(2) with a variation, I(3) and I(4)

x I(2) no BN I(2) BN I(3) I(4) Number MAE RMSE MAE RMSE MAE RMSE MAE RMSE 1 0.74 1.11 0.45 0.72 0.06 0.08 0.02 0.03 2 0.88 1.04 0.03 0.04 0.04 0.05 0.02 0.02 3 0.51 0.64 0.72 0.89 0.08 0.10 0.01 0.02 4 0.42 0.52 0.39 0.47 0.19 0.23 0.01 0.01 5 0.40 0.48 0.56 0.70 0.01 0.01 0.07 0.09 6 0.75 1.04 0.36 0.43 0.06 0.08 0.00 0.00 7 - - 0.51 0.60 0.13 0.16 0.02 0.03 8 0.29 0.37 0.41 0.49 0.10 0.13 0.02 0.03 9 0.18 0.23 0.05 0.06 0.07 0.09 0.01 0.01 10 0.06 0.08 0.63 0.71 0.04 0.05 0.06 0.07 Mean 0.47 0.61 0.41 0.51 0.08 0.10 0.03 0.03 Std. Dev. 0.28 0.38 0.23 0.28 0.05 0.06 0.02 0.03 ing this. By removing the initial causal convolution on the input and adding two individual convolutions for each of the activation functions, removing the whole tail from the first ReLU activation from the output, except one 1×1 convolution, we find the Keras I(3). The results are shown in Table 5.4. The accuracy has increased by a factor 3, moreover the standard deviation is smaller and in the 10 runs there is no divergence in the first 1000 full out of sample forecasts.

Improvement 4 - Change of the skip connection to minimize dependence between layers In the RNN the gated activation function is the Hadamard product between the sigmoid and tanh functions. The sigmoid function decides whether information is important and therefore will be remembered in the cell state and the tanh function generates new candidates. In the WN this new candidate is passed through a 1 × 1 filter, which potentially modifies the shape of the signal. Since the weights corresponding to the 1 × 1 filter are directly influencing the results in the next layer, updating these weights have dependencies in the next layer. To see this, consider the output zi of layer i,

zi = Cioi + zi−1, (5.1)

oi = σ(Aizi−1) tanh(Bizi−1), (5.2)

yi = Cioi, (5.3) X y = Cfinal yi (5.4) i where C denotes the 1 × 1 convolution, B and C are the dilated convolutions, o is the output of the gated activation, yi is the skip connection of layer i and y is the output of the neural network. The weight updates of Ci are dependent on the outputs yi, yi+1, ..., yn, where n is the number of layers. The dependency on yi can be removed by creating a skip connection before the 1 × 1 convolution. Therefore, yi = oi and the Ci is removed in (5.3). This is possible since we made a regression model without coding the signal. In each layer the outputs of the gated activation functions are now directly sent to the output. Intuitively, the 1×1 convolution in the layers now acts as a scaling sign change and therefore filters the extracted information yi from zi−1. Therefore this implementation sends the more independent information from a dilation

D.C.F. van den Assem (4336100) Master of Science Thesis 5.2 The Sine Wave 53 towards the output. Figure 5.2 shows an overview of this implementation. The results of this implementation are shown in Table 5.4. We see a slight improvement in accuracy compared to I(3). The most important improvement is the standard deviation that is reduced in this implementation.

Using all the improvements up to Keras I4, we know that we can fully forecast a sine wave given that the zero crossings are present in the data and the data is not noisy. If the phase of the sine wave is given a small perturbation, the zero crossings will be removed from the data. By using a bias term, the network should still be able to successfully forecast the wave, given 1 that the discretization is the same. This is tested by adding a phase of 7 s to the original sine wave. In addition, the discretization can also be changed such that we lose all the previously mentioned properties to easily reconstruct a sine wave. This sine wave has a frequency of 17Hz and a sampling frequency of 1009Hz. Both signals are forecasted for 20000 time steps and the MAE and RMSE of the last 1000 time steps are reported. For the phase problem we find a MAE and RMSE of 0.03 ± 0.02 and 0.04 ± 0.03 respectively. For the poorly discretized problem we find a MAE and RMSE of 0.04 ± 0.03 and 0.05 ± 0.03 respectively, which is satisfactory.

In order to be able to extract seasonality from a financial time series, we require the network to predict noisy sine waves. To asses the performance of the network, different ratios for the signal to noise ratio (SNR) are used. In this test a sine wave is created and white noise is generated with a specific variance. The network is trained on the signal with the noise added. Afterwards a forecast is made and this is compared with the noiseless signal. The results are shown in Table 5.5. As shown in Figure 5.3b, the full forecast for SNR 8 is dampening out, which shows that the previously mentioned dampening effect of the activation functions is working. Figure 5.3a shows that the network learned to continue a wave. The one step ahead error shows linear scaling w.r.t. the SNR.

√ 2 Table 5.5: The one-ahead and full forecast performance with different values for SNR. (SNR = σ2 ) SNR MAE (one) RMSE (one) MAE (full) RMSE (full) 8 0.16 ± 0.00 0.20 ± 0.00 0.46 ± 0.04 0.56 ± 0.04 4 0.32 ± 0.00 0.40 ± 0.00 0.47 ± 0.13 0.59 ± 0.16 2 0.65 ± 0.01 0.81 ± 0.02 1.12 ± 0.13 1.37 ± 0.15 1 1.23 ± 0.01 1.55 ± 0.01 1.28 ± 0.00 1.62 ± 0.00 0.5 2.52 ± 0.08 3.19 ± 0.10 2.36 ± 0.01 2.99 ± 0.01

Improvement Summary In AWN I(1) we have removed the zero padding from the dilated convolutions by manually setting the polluted outputs to zero and training on a target vector with zeros on the corre- sponding outputs. This is followed by adjusting the skip connections and removal of the batch normalization in AWN I(2). Afterwards, the ReLU function is changed in each layer to a gated activation function in AWN I(3). Finally, the skip connection is adjusted to obtain less depen- dency in updating the weights of the 1 × 1 skip connection in each layer.

From the results we can conclude that we have successfully developed a network which is capable of reconstructing and forecasting sine waves. This suggests that the network is useful for finding seasonality in time series. However, the network is sensitive to noise since the error measures scale inverse proportionally with the SNR. Moreover, it is unable to dampen out the

Master of Science Thesis D.C.F. van den Assem (4336100) 54 Results

Figure 5.2: Overview of the AWN I(4) noise only in a full forecast of the noisy sine wave. In financial forecasting it might be useful to add an extra component to the network which filters the standard seasonality or the noise.

5.3 The Mackey Glass Time Series

For the Mackey Glass time series Figure 5.4 shows that the network in its current setup is unable to make a full forecast and is not learning the behavior of the Mackey Glass time series when the network is trained on the one step ahead forecast. By zooming in on the first 50 time steps, the results are still unsatisfactoy. This suggests that the network is unable to learn the long term dynamics. As stated in Section 4.1, it is also possible to train on x(t + N). This only requires slight changes in the training and test data. The training set input will contain the data points x(t − n), ..., x(t) and the training set output x(t − n + N), ..., x(t + N). For

D.C.F. van den Assem (4336100) Master of Science Thesis 5.3 The Mackey Glass Time Series 55

3 Signal 1.5 Signal Out of sample forecast Out of sample forecast

2 1.0

1 0.5

0 0.0 amplitude amplitude

1 0.5

2 1.0

3 1.5 0 100 200 300 400 500 600 0 100 200 300 400 500 600 time step time step

(a) SNR 2 (b) SNR 8

Figure 5.3: The full forecast of the a noisy sine wave using AWN I(4)

1.8 Signal Signal 0.74 Out of sample forecast Out of sample forecast 1.6 0.72 1.4 0.70 1.2 0.68 1.0 amplitude amplitude 0.66 0.8 0.64 0.6 0.62 0.4 0.60 0 200 400 600 800 1000 1200 1400 0 10 20 30 40 50 time step time step

(a) Global behaviour of the full forecast (b) Local behaviour of the full forecast

Figure 5.4: The full forecast of the Mackey Glass time series using 8 layers on AWN I(4) the test set with forecast length L the data x(t − n + N + 1), ..., x(t + N + L) will be used. In order to get unbiased results, we test the network by taking a random sub-set of the data set by choosing the parameter t > n randomly for each trained network. Since the time delay τ is set to 17, we are interested to train two networks. The first network will have a visual field of 16 per output node, this exactly fits the time delay. The second network has a visual field of 512 per output node. We add 1700 extra inputs on both networks to avoid training on batches. The RMSE will be reported as a measure, since this is used in other papers. The forecast length is set to 500, which is the same as [4] and therefore the results can be compared. The results in Table 5.6 show that a network trained on the correct time steps ahead is able to predict the Mackey-Glass time series more accurately than the results presented in [4]. Moreover, predict- ing multiple steps ahead shows promising results. The standard deviation of the RMSE seems a constant fraction of the RMSE bounded by 25 percent. Therefore ensemble methods could improve the results.

The authors of [36] used a different data set. In case of the 6 and the 84 step ahead fore- cast, a time step of 1s instead of 0.1s should be used to make a fair comparison. Using the

Master of Science Thesis D.C.F. van den Assem (4336100) 56 Results Table 5.6: The results for the Mackey glass t steps ahead forecast using 4 layers. Values are RMSE ×10−3. The ± value is the standard deviation of the 10 runs.

t 1 benchmark 1 0.72 ± 0.20 4.78[4] 2 1.23 ± 0.39 3 1.22 ± 0.44 4 1.47 ± 0.45 5 1.67 ± 0.67 6 1.95 ± 0.76 7 2.43 ± 0.89 8 2.71 ± 1.33 9 2.74 ± 1.13 10 3.65 ± 1.96 12 3.61 ± 2.08 18 5.85 ± 2.39 standard MATLAB data set and adjusted network parameters, α = 0.0005, 16 layers with 2- 2-2-2-4-4-4-4-8-8-8-8-16-16-16-16 convolutions, the network produces for x(t + 6) a RMSE of 3.08×10−3 and for x(t+84) a RMSE of 91.36×10−3. On the short term, the network is able to outperform [36]. It should be mentioned that the methods used in these papers have a smaller ‘visual’ field compared to the AWN I(4) and the alternative configuration.

In the same way as done for the sine waves, noise tests will be performed on the Mackey- Glass time series. The authors of [36] used a method to adjust the cost function as shown in (5.5).

v u n 2 u 1 X (ˆyi − yi) RMSE = t , (5.5) n σ i=1 N,i where σN,i is the noise level at data point i. Using this, the network implicitly gets information about the noise level at each point. In real world situations we do not have this information, except that we can estimate the average noise level. Therefore the noise ‘correction’ term is not used in our approach. The errors will be calculated on the original data (without noise).

Table 5.7: Results for noisy Mackey Glass time series on two configurations. Configuration 1 uses 4 layers and configuration 2 uses 8 layers. Values are RMSE ×10−2

Configuration Noise level 1 2 0.01 1.22 ± 0.11 3.92 ± 0.52 0.05 3.45 ± 0.36 2.22 ± 0.45 0.1 4.89 ± 0.51 3.51 ± 0.61

The results in Table 5.7 show that the standard deviation on the second configuration is sig- nificantly larger. This can be explained by the number of inputs affecting the output, which is for the second configuration larger, resulting in more noise at the output. Figure 5.5 shows the result of a full forecast trained on noisy data. It is interesting to see that it improved the performance for the first 100 time steps. This suggests that adding noise to the input data tends to regularize the network. Further research should be done to investigate these effects.

D.C.F. van den Assem (4336100) Master of Science Thesis 5.4 The Lorenz System 57

1.4 1.4 Signal Signal Out of sample forecast 1.3 Out of sample forecast

1.2 1.2

1.1 1.0 1.0

0.9 amplitude 0.8 amplitude 0.8

0.7 0.6

0.6

0.4 0.5 0 250 500 750 1000 1250 1500 1750 2000 0 100 200 300 400 500 time step time step (a) Global behaviour of the full forecast (b) Local behaviour of the full forecast

Figure 5.5: The full forecast of the Mackey Glass time series using 8 layers on AWN I(4) trained on one-ahead noisy data (σ = 0.1)

We can conclude that the network is able to outperform the Mackey-Glass forecast compared to the literature on the short term. This comes at the cost of losing the flexibility of making a t + n steps forecast using recursive one-step ahead forecasts. Moreover, it requires significant changes to the structure of the network to outperform on the data set which is sampled with 1s intervals. Further optimization of the network parameters can improve the results. Moreover, one could argue that training on data with a higher sample rate is possible, but the forecasting time span should be increased in such a way that they match the same data points. In this case a x(t + 60) with a sample interval of 0.1s would make the comparison with the x(t + 6) with a 1s sample interval.

Since the memory costs of the current network implementation are relatively low, ensembles of networks can be made which calculate the t + 1 up to t + n forecasts in parallel. This can be efficiently implemented in Keras by using multiple channels, where each channel corresponds to a specific time step. By keeping the channels independent, causality can be retained. We could exploit the causality by applying another causal convolution on the output of the channels. As an alternative, further research can be done on using the noise as regularization to improve the full out of sample forecast.

5.4 The Lorenz System

The Lorenz data is scaled to the range (−0.5, 0.5), this is similar as done by the authors of [4] which is used as a benchmark. Therefore the variance of the data set is the same, but the mean is set to zero. This is a better choice for the Wavenet. Lengths of 1000 and 500 are used for training and forecast respectively. Figure 5.6a shows the first attempt on forecasting the Lorenz system using I(4). The full forecasts diverges, therefore we can conclude that the Lorenz system can not be predicted using a full forecast with the current configuration. Since we have already showed promising performance in directly forecasting n steps ahead in Section 5.3, training on the x(t + i) data is a solution for making full forecasts. To obtain more accurate results, the Adam learning rate is set to 0.0005, as done by [4]. Since the learning rate

Master of Science Thesis D.C.F. van den Assem (4336100) 58 Results

0.4 Signal Signal Out of sample forecast 15 Out of sample forecast

10 0.2

5

0.0 0 amplitude amplitude

5 0.2

10

0.4 15

0 100 200 300 400 500 0 100 200 300 400 500 time step time step

(a) The one ahead forecast (b) The full forecast

Figure 5.6: The one ahead forecast and the full forecast of the Lorenz system using 4 layers on AWN I(4) is reduced by a factor two, we increase the number of iterations by a factor two which results in 20000 iterations. 1500 consecutive samples are drawn randomly from the data set to generate the training and test sets.

Table 5.8: Results of the modified network for different γ, RMSE of the benchmark is 4.78 × 10−3

γ RMSE (×103) MASE 0.1 3.89 ± 1.07 0.30 ± 0.08 0.01 1.37 ± 0.41 0.10 ± 0.02 0.001 0.96 ± 0.45 0.08 ± 0.03

As shown in Table 5.8, the results outperform the ALSTM. The training showed fast con- vergence behavior until it reached a plateau as shown in Figure 5.7. The plateau is caused by the high regularization parameter γ. Therefore we gradually decrease the value of the reg- ularization parameter. The results in convergence behavior are also shown in Figure 5.7 and the forecast results are shown in Table 5.8. As a result, the RMSE has decreased by a factor 2.5 for γ = 0.001. In the previous sections we have seen that increasing the number of layers can increase the accuracy. For a fair comparison, we are restricted to the use of 1000 training samples. Therefore the maximum number of layers is 9. The convergence behavior of a network with 8 layers shows that further reduction of γ is required to decrease the training error. For 8 layers with γ = 10−10 we obtain a RMSE of 1.78 ± 0.95 × 10−3, which is higher than a network with 4 layers. This suggests there is an optimum number of layers for this specific configuration and adding more layers results in noise at the outputs. Further decreasing the learning rate could solve this problem.

The Lorenz system is a three-dimensional system. The dimension v in the system can be used for a better approximation of u, since the x coordinate is linearly dependent on v. Since the full forecasts did not show satisfactory results for the unconditioned network, the conditioned forecasts will be particularly interesting for the full forecast. By a full conditioned forecast we assume that the information of v is available at lag 1. In the current network AWN I(4), conditioning is not yet implemented. We will use the implementation of [1] restricted to global

D.C.F. van den Assem (4336100) Master of Science Thesis 5.4 The Lorenz System 59

= 10 1, 4 layers = 10 3, 4 layers = 10 3, 8 layers = 10 10, 8 layers 10 1

10 2 training RMSE

10 3

0 2500 5000 7500 10000 12500 15000 17500 20000 iterations number

Figure 5.7: Convergence behavior of the training of networks with different γ parameter, for 4 and 8 layers. conditioning. For simplicity, we assume that we have two signals sampled at the same rate, to avoid up or down sampling. The important difference here is the skip connection, which is passed before the 1 × 1 filter to the output. The network will be called AWN I(4C).

0.6 Signal 0.6 Signal Out of sample forecast Out of sample forecast

0.4 0.4

0.2 0.2

0.0 0.0 amplitude amplitude

0.2 0.2

0.4 0.4

0 100 200 300 400 500 4500 4600 4700 4800 4900 5000 time step time step (a) Full forecast conditioned on y, first 500 samples (b) Full forecast conditioned on y, sample 4500-5000

Figure 5.8: The full conditioned forecast of the Lorenz system using 4 layers on AWN I(4C) using 4 layers

Figure 5.8a shows that conditioning on v helps the network to keep the forecast on track for a forecast length of 5000. Moreover, Table 5.9 shows that the one ahead accuracy is significantly

Master of Science Thesis D.C.F. van den Assem (4336100) 60 Results increased. In financial time series lags between two time series might be present, however this signal is typically noisy. Therefore we investigate how noise on the conditioning signal affects the forecast performance for AWN I(4C). Table 5.9 clearly shows that the network is very sensitive to noise. The network is unable to perform a full forecast with a noise levels starting at 0.01. For the one ahead forecast it is remarkable that the standard deviation with noise level 0.001 is lower than the network trained without noise.

Table 5.9: Comparison of the noisy conditioned Lorenz system. RMSE × 103

Noise level RMSE (one) RMSE (full) 0 0.38 ± 0.26 1.23 ± 0.92 0.001 0.50 ± 0.25 1.65 ± 0.85 0.01 1.06 ± 0.34 19.5 ± 34.5 0.05 0.85 ± 0.76 298 ± 65 0.1 0.98 ± 0.61 334 ± 51

We can conclude that the x coordinate of the Lorenz System can be successfully predicted by AWN I(4C). The L2 regularization parameter γ scales with the number of layers and has a large impact on the out of sample error. For further decrease of the error, a decrease in the Adam learning rate can be investigated. A full forecast can be made up to 5000 time steps which corresponds to 50 seconds. The conditioned network is sensitive to noise in the conditioning signal. This suggests that this way of conditioning might not be useful in financial applications, where noise is present. Further research is required to understand why the network is training a lag. Both problems might have a great impact in financial applications.

5.5 Results on financial time series

This section covers the results of forecasting on the financial time series using AWN I(4). One month ahead forecasts (x(t + 20)) will be used and compared with [5] and the naive methods as described in Section4.

In a first attempt to forecast on the S&P 500 as done in [5], the AWN I(4) is used in its standard configuration. The network is trained on the first three years. Thereafter the data is shifted by 21 days to avoid implicit information to be in the first 20 forecasts. Then a one month ahead forecast is made for the following 260 days. Figure 5.9a shows that it the network is mostly suppressing the movements in the market and follows the trend. This is probably caused by the regularization parameter γ = 0.1, the relatively small input network with respect to the data and the bias term which is not used. By completely removing the regularization parameter, increasing the number of layers to 9 and adding a bias term to the last 1 × 1 convo- lution. Figure 5.9b shows that the adjustments allow the network to make qualitatively good predictions. From the results in Table 5.10 we can see that the network is performing similar to the Naive 2 method with respect to the hit rates. The MASE however is better in all scenario’s compared to all other methods. Conditioning the data on the CBOE as done in [5], shows no improvement of the predictability of the market in the first two time periods. In the last period, the hit rate is higher, however the standard deviation is also increased, suggesting that ensemble methods can be used to guarantee this performance. The MASE is increased compared to the unconditioned.

D.C.F. van den Assem (4336100) Master of Science Thesis 5.6 Summary 61 Table 5.10: Results from AWN I(4) and AWN I(4C) on the S&P500 data, conditioned with the CBOE data.

AWN I(4) AWN I(4C) Time range MASE HITS MASE HITS 2006-2008 0.59 ± 0.02 0.54 ± 0.02 0.59 ± 0.02 0.54 ± 0.01 2009-2011 0.65 ± 0.03 0.65 ± 0.02 0.68 ± 0.02 0.62 ± 0.03 2012-2014 0.65 ± 0.04 0.57 ± 0.04 0.66 ± 0.04 0.59 ± 0.02

Time range Naive1 Naive2 Benchmark[5] 2006-2008 MASE HITS MASE HITS MASE HITS 2009-2011 1 0.32 0.76 0.54 0.78 0.78 2012-2014 1 0.49 0.84 0.65 0.90 0.58

Signal Signal Out of sample forecast Out of sample forecast 0.10 0.10

0.05 0.05

0.00 0.00 log return log return

0.05 0.05

0.10 0.10

0 10 20 30 40 50 0 10 20 30 40 50 day day (a) AWN I(4), without tuned parameters (b) AWN I(4), with tuned parameters

Figure 5.9: Comparison of the one step ahead (using months) of the AWN I(4) without and with tuning of the parameters.

5.6 Summary

In Section 5.1 we reconstructed the AWN in Keras which showed similar results for the two implementations. Using this network as a starting point, improvements are made using the full forecast on the sine wave as benchmark, see Section 5.2. After a few improvements which made the AWN I(4) more similar to the original WN, the best results were obtained. The most important difference in the AWN I(4) is the skip connection, which is taken before the 1 × 1 convolution to avoid interference in the weight updates during training. Section 5.3 showed that the AWN I(4) is able to outperform novel methods, such as the ALSTM. However, more data points were used in training and it should be investigated if the network can perform better using exactly the same data points. Moreover, the use of a data set with a sample rate of 1s, is not outperforming the benchmark found in literature. On noisy Mackey Glass time series the network also shows satisfactory performance. The AWN I(4) is also able to outperform the benchmark on the Lorenz map. A full forecast showed unsatisfactory results in case of the unconditioned network. The conditioned network AWN I(4C) is introduced and showed improved one ahead forecasting performance by using information from both the u as the v coordinate. Moreover, in the full forecast the network could keep track of the global movements of the u direction. In such a full forecast, we assumed that the data for v(t) is known for

Master of Science Thesis D.C.F. van den Assem (4336100) 62 Results forecasting u(t + 1). Adding noise to this v signal showed that the network is highly sensitive to perturbations. Therefore we do not expect an increase of performance in conditioned finan- cial forecasting without making modifications to the network. In future research it is strongly recommended to use exactly the same data sets as the authors of the benchmarks. Since small changes in the data due to, for example, normalization or taking a different sample rate, can significantly change the results which makes the comparison unfair. For the financial forecasts the AWN I(4) and I(4C) are unable to outperform the benchmark in terms of hit rate. How- ever, in terms of MASE the network is outperforming the benchmark.

In general we can conclude that AWN I(4) is able to make a fixed n step ahead forecasts for smooth, periodic, deterministic signals. It is recommended to use Table 5.11 as standard parameters for the AWN I(4) and AWN I(4C). This parameter set guarantees the network to converge to a set of weights which guarantees a one step ahead accuracy measured in the MAE and MSE on the formerly mentioned tests. If more accuracy is required, it is recommended to first adjust the regularization parameter γ and study the convergence behavior. Afterwards, the learning rate α can be adjusted and as a rule of thumb the number of iterations can be scaled N inverse proportionally. Adding 2 L to the number of inputs Nˆin is recommended, otherwise the modification for the zero padding would set all values to zero.

Table 5.11: The standard parameters used in the AWN I(4) and AWN I(4C).

parameter symbol value iterations - 10.000 regularization γ 0.1 filter width - 2 No. filters - 1 No. channels - 1 No. layers NL 4 N No. inputs Nin Nˆin + 2 L Adam learning rate α 0.001

D.C.F. van den Assem (4336100) Master of Science Thesis Chapter 6

Conclusion

In the previous chapter we extensively tested a augmented Wavenet in order to understand to what extent it can be used in predicting deterministic dynamics and financial time series. Promising results are found for the prediction of deterministic dynamics. In particular, the use of conditioning enables the Wavenet to improve the prediction accuracy significantly. In the financial forecasts however, the network showed less promising results. This section discusses all the findings in this thesis and gives recommendations for future research.

6.1 Summary and conclusion

As shown in Section 5.2, the AWN I(4) is able to fully forecast a sine wave. Therefore it is able to learn periodic behavior of signals. The addition of noise to the sine wave showed that the network is unable to make full forecasts when noise is present. It can be concluded that the unconditioned network is sensitive to noise.

Section 5.3 showed that the network is unable to predict the dynamics of the Mackey Glass time delay differential equation using a full forecast. Using n ahead forecasts however, it is able to outperform most of the benchmarks as mentioned in Section 4.3.3. Since the network used a larger receptive field compared to the benchmarks, it might be questioned if the comparison is fair. However, since all the information is present up to time t, the choice to use more informa- tion is justified. Moreover, the main reason to choose for less information is to limit the required computational power to make predictions. A full forecast on a noisy Mackey Glass time series showed increased performance compared to the non noisy result, suggesting that adding noise can be used for regularization. This contradicts the findings in Section 5.2, where the network was performing worse in the full forecast on a noisy signal.

The results in Section 5.4 showed that the network can be used for one step ahead predictions on the Lorenz system. Tuning the regularization parameter enabled the network to increase the accuracy. Adding more layers resulted in a decrease in accuracy due to noise in the outputs. Conditioning the network on the v coordinate showed significant improvement on the one ahead forecasting performance of the u coordinate. Moreover, we successfully made a full forecast over

Master of Science Thesis D.C.F. van den Assem (4336100) 64 Conclusion 5000 time steps without losing track of the u coordinate.

The results in Section 5.5 showed that by slightly tuning the network, it can qualitatively follow the movements of the market in forecasting financial time series. Moreover, the error measure in terms of MASE is outperforming the benchmark. The hit rate is lower compared to the other method.

6.2 Future research

Throughout this thesis a number of recommendations are given for future research. This section provides an overview of these recommendations. In Section 5.3 it is stated that the network is capable of forecasting n steps ahead by training on this particular time step. Using such a network, the outputs from time t + 1 up to time t + n are also generated, however these are using the same connections as the t + n forecast. Creating a network which trains on all the time steps up to time t + n individually, eliminates the need for a full forecast up to that time.

In Section 5.4 the conditional network is able to keep track of the dynamics. By creating a network capable of forecasting the u, v and w coordinate at the same time, the network might be able to make a full forecast multiple steps ahead. This is particularly interesting to study, since we wish to be able to forecast the Hopf bifurcation.

D.C.F. van den Assem (4336100) Master of Science Thesis Appendix A

Seperation hyperplanes

The convergence results from Algorithm1 only apply to predictions based on the training set. Moreover, depending on the convergence test, the weight vector w might differ for every training instance since r is chosen randomly. For this weight vector w, consider the separating hyperplane w • x − θ = 0 as shown in Figure A.1. All the shown hyperplanes could be build with Algorithm1. Intuitively we would say that hyperplane 2 in Figure A.1 is the optimal hyperplane for predictions. The margin, as defined in Definition A.1, seems to be equal for both species. The Support Vector Machine finds such a hyperplane by maximizing the margin of the data set. The margin of a set is defined in Definition A.2. n DEFINITION A.1. Let w, xi be vectors in R and θ ∈ R. Consider the hyperplane defined by w • xi − θ = 0. The Geometric Margin γ of a point x and the hyperplane is defined by

w • x − θ γ = i . (A.1) i ||w||

DEFINITION A.2. The margin of a set {x1, ..., xN } is defined by the minimum of the absolute margin of the individial elements of the set.

|w • x − θ| γ = min |γ | = min i . (A.2) i ||w||

Master of Science Thesis D.C.F. van den Assem (4336100) 66 Seperation hyperplanes

0 0

0 0

0 0

0 0

00 00

0 0

0 0 0 0

0 0

0 0

0 0

0 0

00 00

0 0

0 0 0 0

Figure A.1: Petal Length vs Petal Width

D.C.F. van den Assem (4336100) Master of Science Thesis Appendix B

Glorot derivation

i i By taking expectations we obtain for any element zl ∈ z :

h ii h i−1 i−1i E zl = E z W·,l

ni−1  X i−1 i−1 = E  zj wj,l  j=1

ni−1 X h i−1 i−1i (B.1) = E zj wj,l j=1 ni−1 X h i−1i h i−1i = E zj E wj,l j=1 = 0,

 i−1 since E w = 0 for all i. By taking variances we obtain

h ii h i−1 i−1i V ar zl = V ar z W·,l

ni−1  X i−1 i−1 = V ar  zj wj,l  j=1 (B.2)  2 2 ni−1   ni−1   X i−1 i−1  X i−1 i−1 = E  zj wj,l   − E  zj wj,l  , j=1 j=1

Master of Science Thesis D.C.F. van den Assem (4336100) 68 Glorot derivation i−1 i−1 By independence of wj,l and zj we obtain:   n 2 h i i−1 i  X i−1 i−1  V ar zl = E  zj wj,l   j=1 n i−1  2  2 (B.3) X  i−1  i−1 = E w E zj j=1 h i−1i h i−1i = ni−1V ar W V ar z .

By recursively using the relation above we obtain:

i−1 h ii h 1i Y h i0 i V ar zl = V ar z ni0 V ar W , (B.4) i0=1 where V ar W i denotes the shared scalar variance of all weights at layer i. For the first layer we have:

h 1i h 0 i V ar zl = V ar xW·,l

" n0 # X 0 = V ar xiwi,l i=1   (B.5) n0 !2 " n0 #2 X 0 X 0 = E  xiwi,l  − E xiwi,l i=1 i=1 h 0i = n0V ar [x] V ar W .

Using (B.4) and (B.5) we obtain,

i−1 h ii Y h i0 i V ar zl = V ar [x] ni0 V ar W . (B.6) i0=0 By taking variances on (3.14) on a network with d layers and using (3.16), we obtain. " # " # ∂J ∂J ∂si+1 V ar i = V ar i+1 i (B.7) ∂Sk ∂s ∂sk " # ∂J ∂si+2 ∂si+1 = V ar i+2 i+1 i (B.8) ∂s ∂s ∂sk By recursively applying the step (B.7) to (B.8) we obtain:     " # d−1 i+1 i+1 ∂J ∂J Y ∂s ∂s V ar = V ar   0   ∂si ∂sd ∂si ∂si k i0=i+1 k (B.9)   d−1   ∂J Y i0+1 i = V ar   W  W  ∂sd k,· i0=i+1

h ∂J i By showing E ∂sd = 0, we can again use (!). Note that: n Y i 2 J = ||x W − y||2 (B.10) i=0

D.C.F. van den Assem (4336100) Master of Science Thesis 69 By looking at one particular output and taking the derivative to sd we obtain:

∂||yˆ − y||2 ∂ x Qn W i2 + y2 + 2y x Qn W i = i=0 i=0 d Qd i ∂s ∂x i=1 W 2  Qd i Qn j 2  Qd i Qn j ∂ x i=1 W j=d+1 W + 2y − 2y x i=1 W j=d+1 W = (B.11) Qd i ∂x i=1 W 2 d  n  n Y i Y j Y j = 2 W  W  + 2y W i=1 d+1 d+1

Taking the expectation gives

  2  " 2 # d n n ∂||yˆ − y|| Y i Y j Y j = 2 W  W  + 2y W  E ∂sd E   i=1 d+1 d+1 (B.12) = 0

i i h i i by independence of the elements wk,l ∈ W ∀i, k, l and E wk,l = 0 ∀i, k, l.

Master of Science Thesis D.C.F. van den Assem (4336100) 70 Glorot derivation

D.C.F. van den Assem (4336100) Master of Science Thesis Appendix C

Code of the model

1 def BuildModel ( NInputs , NLayers , NFilters , FilterWidth , UseBias , L2Kern , ConditionWidth , ConditionFilters ): 2 # define inputs 3 SignalInput = Input ( shape=(NInputs , 1 ) , name=’input_layer’ ) 4 ConditionInput = Input ( shape=(NInputs , 1 ) , name=’condition_layer’ ) 5 ZeroInput = Input ( shape=(NInputs , 1 ) , name = ’zero_layer’ ) 6 # convolutions 7 fori in range ( NLayers ): 8 Mseed = np . random . randint (5000) 9 Mseed = Mseed + 1 10 np . random . seed ( Mseed ) 11 # dilated convolution on output 12 ConvSignalOut = Conv1D ( filters = 1 , kernel_size = 2 , strides = 1 , 13 dilation_rate = 2∗∗( i ), padding = ’causal’ , 14 activation = ’linear’ , use_bias = UseBias , 15 kernel_regularizer = regularizers . l2 ( L2Kern ), 16 kernel_initializer = glorot_uniform ( seed=Mseed ))( SignalOut ) 17 ConvSignalOut2 = Conv1D ( filters = 1 , kernel_size = 2 , strides = 1 , 18 dilation_rate = 2∗∗( i ), padding = ’causal’ , 19 activation = ’linear’ , use_bias = UseBias , 20 kernel_regularizer = regularizers . l2 ( L2Kern ), 21 kernel_initializer = glorot_uniform ( seed=Mseed ))( SignalOut ) 22 23 # tanh function on output 24 ConvSignalOut = Activation ( ’sigmoid’ )( ConvSignalOut ) 25 ConvSignalOut2 = Activation ( ’tanh’ )( ConvSignalOut2 ) 26 27 # product 28 ConvSignalOut = multiply ([ ConvSignalOut , ConvSignalOut2 ]) 29 #1x1 convolution in input 30 Signal1x1 = Conv1D ( filters = 1 , kernel_size = 1 , padding = ’same’ , 31 dilation_rate = 1 , activation = ’linear’ , 32 kernel_initializer = glorot_uniform ( seed=Mseed ))( ConvSignalOut ) 33 34 # summation(skip) of previous output and tanh output

Master of Science Thesis D.C.F. van den Assem (4336100) 72 Code of the model 35 SignalOut = add ([ Signal1x1 , SignalOut ]) 36 ifi == 0 : 37 SkipSignal = ConvSignalOut 38 else : 39 SkipSignal = add ([ SkipSignal , ConvSignalOut ]) 40 41 #1x1 convolution 42 FinalOutput = Conv1D ( filters = 1 , kernel_size = 1 , padding = ’same’ , 43 dilation_rate = 1 , activation = ’linear’ , 44 kernel_initializer = glorot_uniform ( seed=Mseed ))( SkipSignal ) 45 46 # Zero the polluted outputs 47 FinalOutput = multiply ([ FinalOutput , ZeroInput ]) 48 model = Model ( inputs=[SignalInput , ConditionInput , ZeroInput ], outputs = [ FinalOutput ]) 49 return model

D.C.F. van den Assem (4336100) Master of Science Thesis Bibliography

[1] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [2] A. Borovykh, S. Bohte, and C. W. Oosterlee, “Conditional time series forecasting with convolutional neural networks,” arXiv preprint arXiv:1703.04691, 2017. [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolu- tional neural networks,” in Advances in neural information processing systems, pp. 1097– 1105, 2012. [4] D. Hsu, “Time series forecasting based on augmented long short-term memory,” [5] M. Peña, A. Arratia, and L. A. Belanche, “Multivariate dynamic kernels for financial time series forecasting,” in International Conference on Artificial Neural Networks, pp. 336–344, Springer, 2016. [6] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of eu- genics, vol. 7, no. 2, pp. 179–188, 1936. [7] R. Rojas, Neural Networks: A Systematic Introduction. Springer Science & Business Media, 1996. [8] N. Karmarkar, “A new polynomial-time algorithm for linear programming,” in Proceedings of the sixteenth annual ACM symposium on Theory of computing, pp. 302–311, ACM, 1984. [9] D. R. Cox, “The of binary sequences,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 215–242, 1958. [10] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient backprop,” in Neural networks: Tricks of the trade, pp. 9–48, Springer, 2012. [11] Y. A. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. [12] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, vol. 30, 2013.

Master of Science Thesis D.C.F. van den Assem (4336100) 74 Bibliography [13] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.

[14] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” arXiv preprint arXiv:1606.00915, 2016.

[15] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.

[16] S. Hochreiter and J. Schmidhuber, “Lstm can solve hard long time lag problems,” in Advances in neural information processing systems, pp. 473–479, 1997.

[17] H. J. Kushner and D. S. Clark, Stochastic approximation methods for constrained and unconstrained systems, vol. 26. Springer Science & Business Media, 2012.

[18] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks.,” in Aistats, vol. 9, pp. 249–256, 2010.

[19] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human- level performance on classification,” in Proceedings of the IEEE international conference on , pp. 1026–1034, 2015.

[20] F. Chollet et al., “Keras.” https://github.com/fchollet/keras, 2015. [21] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016.

[22] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural net- works, vol. 12, no. 1, pp. 145–151, 1999.

[23] Y. Nesterov, A method for unconstrained convex minimization problem with the rate of convergence O (1/k2), vol. 269, pp. 543–547. 1983.

[24] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121– 2159, 2011.

[25] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.

[26] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[27] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org. [28] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123–140, 1996.

[29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[30] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in Proceedings of the 30th international conference on machine learning (ICML-13), pp. 1058–1066, 2013.

D.C.F. van den Assem (4336100) Master of Science Thesis 75 [31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and , pp. 770– 778, 2016. [32] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reduc- ing internal covariate shift,” in International Conference on Machine Learning, pp. 448–456, 2015. [33] R. J. Hyndman and A. B. Koehler, “Another look at measures of forecast accuracy,” International journal of forecasting, vol. 22, no. 4, pp. 679–688, 2006. [34] K. A. De Oliveira, A. Vannucci, and E. C. da Silva, “Using artificial neural networks to forecast chaotic time series,” Physica A: and its Applications, vol. 284, no. 1, pp. 393–404, 2000. [35] T. Kuremoto, S. Kimura, K. Kobayashi, and M. Obayashi, “Time series forecasting using a with restricted boltzmann machines,” Neurocomputing, vol. 137, pp. 47–56, 2014. [36] C. López-Caraballo, I. Salfate, J. Lazzús, P. Rojas, M. Rivera, and L. Palma-Chilla, “Mackey-glass noisy chaotic time series prediction by a swarm-optimized neural network,” in Journal of Physics: Conference Series, vol. 720, pp. 12002–12011, IOP Publishing, 2016.

Master of Science Thesis D.C.F. van den Assem (4336100) 76 Bibliography

D.C.F. van den Assem (4336100) Master of Science Thesis