<<

Linköping University | Department of Computer and Information Science Master thesis, 30 ECTS | Datateknik 2019 | LIU-IDA/LITH-EX-A--19/046--SE

TDNet - A Generative Model for Taxi Demand Prediction –

TDNet - En Generativ Modell för att Prediktera Taxiefterfrågan

Gustav Svensk

Supervisor : Suejb Memeti Examiner : Kristian Sandahl

External supervisor : Eero Piitulainen

Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer- ingsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko- pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis- ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker- heten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman- nens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to down- load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

© Gustav Svensk Abstract

Supplying the right amount of taxis in the right place at the right time is very important for taxi companies. In this paper, the machine learning model Taxi Demand Net (TDNet) is presented which predicts short-term taxi demand in different zones of a city. It is based on WaveNet which is a causal dilated convolutional neural net for time-series generation. TDNet uses historical demand from the last years and transforms features such as time of day, day of week and day of month into 26-hour taxi demand forecasts for all zones in a city. It has been applied to one city in northern Europe and one in South America. In north- ern europe, an error of one taxi or less per hour per zone was achieved in 64% of the cases, in South America the number was 40%. In both cities, it beat the SARIMA and stacked ensemble benchmarks. This performance has been achieved by tuning the hyperparame- ters with a Bayesian optimization algorithm. Additionally, weather and holiday features were added as input features in the northern European city and they did not improve the accuracy of TDNet. Abstract

Att ha rätt antal taxis på rätt plats vid rätt tid är väldigt viktigt för taxiföretag. I denna rapport presenteras maskininlärningsmodellen Taxi Demand Net (TDNet) som förutspår den kortfristiga efterfrågan på taxi i olika stadszoner med precision. Den är baserad på WaveNet, ett faltande neuralt nätverk med kausala och utvidgade lager för tidsseriepredik- tion. TDNet använder historisk efterfrågan från de senaste åren och transformerar infor- mation så som tid på dygnet, dag i veckan och dag i månad till prognoser för efterfrågan på taxi som sträcker sig 26 timmar framåt för alla zoner i en stad. Modellen har tilläm- pats på en stad i norra Europa och en i Sydamerika, den har åstadkommit ett fel på en taxi eller mindre i 64% respektive 40% av fallen. I båda städer har den slagit referensmod- ellerna SARIMA samt en viktad ensemble. Denna precision har nåtts genom att hitta hy- perparamtetrar med en bayesiansk optimeringsmetod. Dessutom har det visats att varken information om väder eller helgdagar förbättrar modellens prestanda. Acknowledgments

I would like to thank my supervisors Suejb Memeti at the university and Eero Piitulainen at Taxicaller for providing valuable input and supporting me during this semester. I would also like to thank my examiner Kristian Sandahl for answering questions and taking on my thesis. Lastly I would like to thank everyone at Taxicaller for making me feel welcome.

v Contents

Abstract iii

Sammanfattning iii

Acknowledgments v

Contents vi

List of Figures viii

List of Tables ix

1 Introduction 1 1.1 Motivation ...... 1 1.2 Aim...... 2 1.3 Research questions ...... 2 1.4 Delimitations ...... 3

2 Theory 4 2.1 Taxi Demand ...... 4 2.2 Contextualizing Machine Learning ...... 5 2.3 Basics of ...... 6 2.4 Artificial Neural Networks ...... 8 2.5 Training a Network ...... 10 2.6 Convolutional Neural Networks ...... 11 2.7 RNN and Sequence to Sequence Models ...... 13 2.8 WaveNet ...... 13 2.9 Evaluation Metric ...... 14 2.10 Hyperparameter Tuning ...... 15 2.11 Sequential Model-based Optimization ...... 16 2.12 Tree Parzen Estimator ...... 17 2.13 SARIMA ...... 17 2.14 Stacked ensembles ...... 18 2.15 Mixed Precision ...... 18 2.16 Method ...... 18

3 Literature Review 19 3.1 WaveNet Architectures ...... 19 3.2 Alternative approaches ...... 20 3.3 Taxi Demand ...... 20

4 Method 22 4.1 Data Description ...... 22

vi 4.2 Data Cleaning ...... 23 4.3 Data Splitting ...... 23 4.4 Data Preprocessing ...... 23 4.5 Data Exploration ...... 24 4.6 Model Implementation ...... 27 4.7 Hyperparameter Tuning ...... 27 4.8 Evaluation ...... 29 4.9 Benchmarks ...... 30 4.10 Feature Importance ...... 30 4.11 Rounding ...... 30 4.12 Models trained ...... 31

5 Empirical Evaluation 32 5.1 Experimental Setup ...... 32 5.2 Results NE ...... 33 5.3 Results SA ...... 39 5.4 Hyperparameters and Architecture ...... 42

6 Discussion 43 6.1 Results NE ...... 43 6.2 Results SA ...... 45 6.3 Comparing the Cities ...... 45 6.4 Method Criticism ...... 46 6.5 Comparing the Models ...... 48 6.6 Comparing TDNet to the Literature ...... 49 6.7 Improving TDNet ...... 50 6.8 The work in a wider context ...... 51

7 Conclusion 53 7.1 Connection to Research Questions ...... 53 7.2 Future Research ...... 54

Bibliography 55

Glossary 59

vii List of Figures

2.1 Taxi Demand Example ...... 5 2.2 Example problem, supervised learning ...... 6 2.3 Polynomial Curves ...... 7 2.4 Error plot polynomial curves ...... 8 2.5 Activation Functions ...... 9 2.6 Neural Network ...... 10 2.7 Convolving an image ...... 12 2.8 A Stack of dilated causal convolutional layers ...... 14 2.9 MNIST images ...... 16

4.1 True Demand per Hour in SA ...... 25 4.2 True Demand per Hour in NE ...... 25 4.3 Zone Demand Distribution in SA ...... 26 4.4 Zone Demand Distribution in NE ...... 26 4.5 Building block of TDNet architecture ...... 28

5.1 Result NE ...... 34 5.2 Error Distribution of RMSLE in NE ...... 34 5.3 Error Distribution of RMSLE in NE of Non-zero Demand ...... 35 5.4 Error Distribution of RMSE in NE ...... 35 5.5 Zone Error Distribution NE ...... 36 5.6 Total Demand in NE over Test Period ...... 37 5.7 Total Demand Benchmarks NE ...... 37 5.8 Prediction Error per Hour NE ...... 38 5.9 Train loss ...... 38 5.10 Results SA ...... 39 5.11 Zone Error Distribution SA ...... 39 5.12 Error Distribution of Root Square Error (RMSE) in SA ...... 40 5.13 Total Demand RMSE Model SA ...... 41 5.14 Total Demand Benchmarks SA ...... 41 5.15 Prediction Error per Hour SA ...... 42

6.1 Comparison of RMSE and Root Mean Square Logarithmic Error (RMSLE) . . . . . 45

viii List of Tables

4.1 Unprocessed Dataframe ...... 23 4.2 Processed Dataframe ...... 24 4.3 Hyperparameter domain ...... 29

5.1 Comparing loss functions ...... 37 5.2 Found Hyperparameter Values for NE and SA ...... 42 5.3 Hyperparameters SA ...... 42 5.4 Hyperparameters NE ...... 42

ix 1 Introduction

Taxis are a part of the transportation system of most cities and provide a service to take individuals from point A to point B. Due to high regulation of and restricted entry to the taxi market, ride-sharing companies such as Uber and Lyft have appeared and successfully competed with the traditional companies [10]. Traditional taxi companies must improve their competitiveness in order to fight for their share of the global taxi market which amounts to an estimated $108 billion [6]. Simultaneously, large parts of the taxi fleet is expected to be replaced by autonomous vehicles which will lead to unprecedented changes in the industry when it comes to e.g. the cost and business models, service availability and optimal fleet size [6, 31]. These technological developments paint a picture of a competitive market where the cur- rent actors will have to adapt to the new circumstances or perish. However, for those who are able to leverage their technology, an opportunity for growth is presented. Solutions which have an impact on the current industry and won’t be made obsolete in the near future, such as accurately being able to predict taxi demand, offer a great advantage.

1.1 Motivation

Predicting taxi demand accurately would lead to numerous benefits on several levels. Cus- tomers would experience a lower expected wait time, taxi companies would have more ef- ficient resource usage by regulating the number of taxis. Lastly, the drivers would receive recommendations on where to look for customers as well as a reduction in time spent roam- ing and queuing for customers. The task of predicting taxi demand can be divided into two sub problems, short-term or real-time predictions and long-term predictions. Short-term pre- dictions impact the customers and drivers on a day to day basis and long-term predictions are made on a weekly or monthly basis to help resource management and planning. The taxi industry is, as many other industries, in an era of digital transformation. With an increasing access to cheaper smart phones and better wireless mobile telecommunication, the ability to collect and store large amount data generated by each taxi has emerged. Example of data collected by taxi companies are GPS locations of where the taxi has been at different times, if the car was occupied by a customer and if so, when and where the customer was picked up and dropped off.

1 1.2. Aim

Multiple studies have used this kind of data and shown that it is possible to forecast taxi demand using different types of machine learning algorithms [34, 59, 12, 55]. A recent example is the study done in August 2018 by Jun Xu et al. which shows that it’s possible to accurately predict taxi demand using a sequential learning model based on a (RNN) and a mixture density network. The recurrent neural network architecture used by Jun Xu et al. in their research [55] is called a long short-term memory network (LSTM), it was first described in a paper by S. Hochreiter et al. in 1997 [20]. It has since gained popularity and been shown to deliver state- of-the-art results in a variety of fields and problem domains that use sequential data [47, 60, 33]. Networks of this kind are able to accurately model long-term patterns in data. In a large-scale analysis of different LSTM architectures, the forget gate component and output activation function are noted to be essential for the performance of the algorithm. The forget gate enables the LSTM to reset its own state and the output activation function is needed to stabilize learning [17]. Due to the structure of the recurrent neural network, the number of trainable parameters is high which makes it computationally expensive in comparison to feed-forward networks such as the Convolutional Neural Network (CNN). In 2015, researchers at Google Deepmind proposed a novel, fully probabilistic and called WaveNet [49]. It is based on a stack of convolutional layers, thus making it computationally cheaper than recur- rent neural networks and specifically LSTMs. Since the initial publication, multiple papers have been released using WaveNet as a base architecture for sequential forecasting [4, 25]. The state-of-the-art results of WaveNet in multiple domains which require time-series fore- casting justifies applying it to predicting taxi demand.

1.2 Aim

The purpose of this project is to evaluate the performance of an architecture based on WaveNet, Taxi Demand Net (TDNet), when applied to predicting taxi demand. The results will be compared to other forecasting algorithms. This has been achieved through the following steps:

1. Developing a prediction model based on WaveNet

2. Tuning the hyperparameters of the model

3. Evaluating the final performance

4. Comparing performance with alternative models

1.3 Research questions

Based on the motivation and aim, these are the research questions that will be investigated and answered in this report:

1. How accurately is it possible to predict the short-term taxi demand in predefined zones of a city using TDNet?

a) How should accuracy be measured in this domain?

2. How does the spatial distribution of the demand in the cities and features other than demand affect the performance of TDNet when predicting the short-term taxi demand in predefined zones of a city?

3. How well does TDNet perform, compared to other time-series forecasting models, when predicting the short-term taxi demand in predefined zones of a city?

2 1.4. Delimitations

1.4 Delimitations

This study is focused on applying TDNet to the specific application of predicting taxi de- mand. The data sets used are supplied by the taxi control company TaxiCaller AB and consist of real taxi trips that have been collected over more than two years. The data comes from two of their largest customers, one has its business in northern Europe and the second in South America. They will be referred to as NE (Northern Europe) and SA (South America). The distribution of demand in these cities are the two spatial distributions considered in research question 2. Outside of historical demand, holiday and weather features will be investigated for city NE. Furthermore, the data sets do not represent all of the taxi demand in a given city. This that the total taxi demand in a city is unknown and also the market share of the customer. Figuring out the size of the black market, the competitors and the market share of ride-sharing companies wouldn’t result in more than imprecise guesswork.

3 2 Theory

This section provides an explanation of concepts important to the rest of the thesis. The first section describes the problem domain and is followed by a few sections which serve as building blocks to understand TDNet and the underlying architecture of WaveNet. There- after follows sections on hyperparameter tuning and mixed precison which are techniques which have been used to improve the performance of the model. Finally, in the method sec- tion, studies are presented which motivate the methodology with which this project has been conducted.

2.1 Taxi Demand

The task of TDNet is to be able to predict short-term taxi demand, what’s actually meant by taxi demand as as well as what parameters influence it will be described in this section. Taxi demand could come from either street hailing or bookings, which are placed through a phone call or in a mobile application. There are three important differences. The first one is that the hailing happens spontaneously while bookings are planned in advance. The second one is that the location of taxi cars influences whether a hailing occurs or not. The last one is that hailing is influenced by the structure of a city’s road network. Consider figure 2.1 where the y-axis represents the combined number of hailings and bookings and the x-axis is the hour of the day. This is a simple example of what the demand could look like for a zone in a city. Domain experts and scientific literature have both been sources in investigating the pa- rameters which significantly impact taxi demand for an area [57, 59, 15]. The domain expert, which was interviewed specifically for this thesis, was a taxi driver in one of the cities in- vestigated who had also had a manager role. Primary parameters are historical taxi demand and temporal factors, i.e. hour of day, day of week and day of month. Secondary factors in- clude holidays, promotions, sporting events, special occasions, the schedules of nearby public transport, taxi drop-offs, weather and closing times of pubs and night clubs. What some of these secondary factors have in common are that they are indicators of how many people there are in an area. Working under the assumption that the number of mobile network con- nections is a good approximation of the number of people in an area, Google researchers managed to produce a model which very accurately predicted taxi demand in Tokyo. Unfor- tunately, this valuable data isn’t widely available. [24]

4 2.2. Contextualizing Machine Learning

Figure 2.1: Taxi demand over a 24 hour period.

2.2 Contextualizing Machine Learning

Artificial intelligence (AI) is a field which concerns itself with building intelligent entities. There are many different approaches to accomplishing this task. One example is the symbolic approach where sets of hard-coded rules, logic and search algorithms are combined to solve problems. Another example is the Bayesian approach where a probability distributions are used to reason with uncertainty. Using optimization algorithms and domain data, the most likely conditional dependencies of the variables in e.g. a probabilistic graph can then be calculated to generate a model which is able to infer the probability functions of unobserved variables. [46] The approach which has made the greatest progress during the last decade is machine learning. It tries to solve problems without being explicitly programmed to solve them [27]. This is achieved by a model through first learning patterns in the data in what’s known as the training phase. Afterwards, the model is able to determine how to handle new, unseen, data by looking for the learned patterns in it. To successfully make use of machine learning, there are four main factors to take into ac- count, namely data quantity and quality, computational power and model [16]. Since ma- chine learning requires learning patterns from data, how well a model is able to do that largely depends on the quantity but also the quality of the data. The data is an approxima- tion of reality and it all that is available to the model, if it doesn’t represent reality precisely enough or doesn’t cover all possible scenarios the model won’t either. If the model is too complex for the amount of data available, the parameters of the model won’t converge, i.e. learn the patterns. If there is enough data available, but the quality is too low, i.e. there are outliers, null entries, incorrect or invalid values, the model might start displaying unwanted behaviour or fail to identify the patterns that actually exist. For advanced tasks, advanced models are required and these are often computationally expensive to train. They might need multiple GPUs or even processing units made specif- ically for , tensor processing units (TPU), to run for weeks [48, 23]. Finally, a suitable model has to be chosen for the problem at hand. There are multiple domains which have come to be dominated by machine learning, a few examples are computer vision, ma- chine translation, the game Go, targeted advertising and text-to-speech [16]. For each of these domains there are specific types of model architectures which have proven to deliver results. How the four main factors of machine learning have evolved over time also explains why machine learning has made such substantial progress lately. The amount of data produced by our society increases rapidly. Moore’s law illustrates how computational power has increased

5 2.3. Basics of Supervised Learning exponentially over the last decades and even though it is about to come to an end, the already mentioned TPUs as well as data centers stacked with GPUs are ready to step in [52]. Lastly, machine learning as a field has gone through a research boom and has wildly increased in popularity over the last decade as more and more state-of-the-art results have been presented in several domains [16]. A sub field of machine learning known as supervised learning is defined by how it re- quires labeled data, i.e. a mapping between the input and output data to learn patterns. Formally, given an input-output pair x, y a supervised learning model tries to learn f (x) = y. Classification is one of two types of supervised learning, the task when classifying is to assign a class to the output based on the input. The other type of supervised learning is regression, this is when the output is a numerical value which isn’t a class or category. The domain of taxi demand is an example of regression in that the sought output is an integer which represents a quantity.

2.3 Basics of Supervised Learning

To provide an example of a supervised learning problem and to introduce some key concepts, consider the following problem. Construct a polynomial function which approximates the function y, where x is a uniformly distributed vector and e is normally distributed noise, N „ (0, 1).

y(x) = cos(πx) + e (2.1)

Figure 2.2: Datapoints sampled from y(x) in equation 2.1. The noise has resulted in one outlier at x = 1.8.

The vector x ranges from 0 to π with N = 11 observations and has the values shown in figure 2.2. A polynomial function of degree M is defined as the following:

M 2 M i y(x, w) = w0 + w1x + w2x + ... + wMx = wix (2.2) = iÿ0 For the purpose of quantitatively evaluate how good the approximation is, an error func- tion has to be defined. In this case, an appropriate choice is the RMSE, defined in equation 2.3. It is appropriate since it’s simple, frequently used and tries to make each residual as small as possible since it’s sensitive to outliers. An example is that two errors of r/2 amounts to a smaller RMSE than if one if one error were zero and one had a difference of r. That means

6 2.3. Basics of Supervised Learning that errors concentrated to a few points are punished harder than the same error spread out over more points. Calculating the mean error allows comparing performance of models on values of different lengths and calculating the root of the mean squared error converts the error to the same unit as the predicted values. If the model’s approximation of y is y then the error is defined as: p N (y (x , w) ´ y )2 RMSE(w) = n=1 n n n (2.3) d N ř A lower error is better and in the case that RMSE(p w) is equal to zero, the approximation y perfectly fits the training points. Thus, the objective now is to find which values of w minimize the error. p By deriving the quadratic error function, setting it to zero and solving for w, it’s possible to obtain an optimal solution. However, the degree of the polynomial function must first be decided and this impacts the complexity of the model. Changing the value of M has a large impact on how well the approximation fits the data points and how well the approximation is able to predict unseen values. M isn’t learned during the training phase and is more akin to a model setting decided by the creator before the training begins. In the domain of machine learning, M is known as a hyperparameter. To ensure that the approximation is valid, data points from the same as x are gen- erated. These data points are previously unseen and constitute the test set. A model that is able to estimate y with a low error for input that it hasn’t seen before, i. e. the test set, is said to generalize well.

Figure 2.3: A lineplot showing how well different polynomial curves fit the training points as well as the true function which they are trying to approximate, f (x) = cos(πx).

In figure 2.3, y(xtest, w) is displayed for the hyperparameter values M P t0, 5, 10u. For M = 0, the degree of the polynomial is too low and the approximation can’t fit the training points, this is knownp as underfitting. The approximation when M = 5 doesn’t fit the data points perfectly but stays close to the true value for all values of x, i. e. it generalizes. For M = 10, the curve perfectly fits the training points but there are regions such as 0 ă x ă 0.3 where the approximation takes on values far from the true y. This is known as overfitting and

7 2.4. Artificial Neural Networks negatively impacts generalization. There are several causes for overfitting but at its core, it’s because the model is too complex for the training data. How well a model fits the data can be visualized by plotting the error on the train and test set for the different values of M. To calculate the test error, each of the models have predicted y(x, w) for the last 10 values of x ´ 0.1. p

Figure 2.4: A scatterplot showing the train and test error for the polynomial fit for different degrees of the polynomial.

As can be seen in figure 2.4, the training error decreases as M is increased. However, this is not the case for the test error which massively increases for M=10. This further confirms that M=5 is the appropriate model to select for this task. To avoid problems such as overfitting, it’s possible to add a validation step to the training. To do this, a portion of the training data set is held out in a similar way to the test set. However, the validation set is used repeatedly to evaluate how different sets of hyperparameters affect the performance of the model. This enables the developer to spot how the hyperparameters affect the performance before doing the final evaluation. The size of the validation set as well as the test set should only be big enough to allow for a fair evaluation of the model since more training data helps improve the performance of the model [38]. Normally when working with a real-life problem, the test set can’t be generated but must be held out from the original data set. There are several rules of thumb for splitting the data into the different sets, a couple of years ago, the most common choice was 60% train, 20% validation and 20% test. The important part is that there is enough data in the test and cross- validation category to enable proper evaluation and not overfit the model, the rest should be used for training to improve the model [7]. However, as mentioned in section 2.2, the amount of available data has increased enormously and for deep learning tasks, splits such as 90%, 5%, 5% or even 99.5%, 0.25%, 0.25% are not unheard of according to an expert in the field, A. Ng. [38]

2.4 Artificial Neural Networks

An artificial neural network (ANN) is a framework to approximate a function, given a set of inputs and outputs. The building blocks of these networks are called nodes.

Nodes Nodes generate a signal, a, the value of which is determined by the internal state of the node as well as the input to the node. The internal state is composed of a vector of weights, w.

8 2.4. Artificial Neural Networks

These weights are multiplied with the input to the node, x and a bias b, is added. This value is then passed through a non-linear activation function, g, which transforms the value and constrains the output to be within a certain range.

a = g(w|x + b) (2.4)

An example of an activation function is the sigmoid function. It returns output between 0 and 1 is defined in equation 2.5 below. Similar to the sigmoid function is tanh, defined in equation 2.6 which returns output between -1 and 1.

1 σ(x) = (2.5) 1 + e´x

tanh(x) = 2σ(2x) ´ 1 (2.6) Both functions are non-linear since that gives a network the powerful property to be able to output functions which aren’t just linear combinations of the input. Look at the left graph in figure 2.5 for a visual representation of the sigmoid activation function.

Figure 2.5: Sigmoid activation function to the left and ReLU to the right.

Feedforward Neural Networks The network is built by a set of nodes, structured in different layers. The nodes in each layer are interconnected via edges to the nodes in the next layer. Depending on the direction of these connections, the network falls into one of several categories. In the case that these nodes and edges can be represented as a directed acyclic graph the network is called a feedforward neural network and the layers are called dense layers. In a feedforward neural netwrok the information can only flow from a previous layer to a later one, not in reverse and also not from one node in a layer to another node in the same layer. The first layer, which receives the input data, is called the input layer. The last layer, which produces the output of the network, is called the output layer. All layers in between are referred to as hidden layers, in figure 2.6, the number of hidden layers is two. Together, these stacked layers achieve what a single node is able to do but it also allows for modelling patterns of much higher complexity than a single node. A feedforward neural network can be modified to work on input data which varies with time. In that case the input data for each time step is fed to a dense layer which calculates an output. The weights of the layer aren’t updated until the input for all time steps have passed through it. A layer of this kind is called a time distributed dense layer.

9 2.5. Training a Network

Figure 2.6: A feedforward neural network with four dense layers.

2.5 Training a Network

In order to measure the performance of a neural network, a performance metric, J(w), has to be defined. Commonly, a is used which indicates how far the approximation of the network deviates from the training set, i.e. the ground truth of the task. After a prediction has been generated by the network for each of the training samples and the loss has been calculated, an algorithm is needed to identify what nodes of the network contributed to the loss. This is exactly what the back-propagation algorithm does. It propagates through the network using the chain rule for derivation. In combination with gradient descent, the values of the weights w can be calculated which leads to the greatest decrease of the loss. [16] Gradient descent is an optimization algorithm which attempts to find the minimum value of a function. It does so by calculating the gradient, i.e the direction in which the value of the function decreases the fastest. The magnitude of the gradient is equal to the steepness of the slope. Thereafter, the values of the parameters of the function are adjusted according to the direction and size of the gradient. This can be likened to taking a step in the steepest direction of the function surface. The step length is not only regulated by the size of the gradient but also by the so called learning rate. In addition to vanilla gradient descent, there are extensions such as stochastic gradient descent, RMSProp and the Adam optimization algorithm which attempt to combat common issues encountered when using gradient descent. A technique which all of these have in common is that they divide the training set into batches. A batch is a subset of the training set which serves as a of the whole set. Using batches decreases the space required to load the input data into memory and speeds up the computation of the output. Stochastic gradient descent is an extreme example where the batch-size is reduced to one. In practice, mini-batches are more commonly used and often contain between 32 and 512 training samples. [16]

The Adam Optimizer The Adam optimizer was first introduced in a paper by D. P. Kingma and J. Lei Ba [26]. The name Adam comes from the term adaptive estimation. Starting with the term adaptive, this means that each wi of w has its own learning rate, i.e. each weight has its own step size when performing gradient descent, the step size is changed throughout the training. This makes sense since it would be very unlikely that all the weights in a network would improve the fastest by being multiplied by the same scalar. The term moment estimation describes the way Adam uses exponential averaging of the previous gradients to update w. The moment estimation can be divided into two parts, estimation of the first moment, i.e. the mean of the gradient and estimation of the second moment, i.e. the uncentered of the gradient.

St = βSt´1 + (1 ´ β)Yt, t ą 1 (2.7)

10 2.6. Convolutional Neural Networks

In the above equation 2.7 which is a general definition of the exponential moving average, t denotes the time step, St is the exponential moving average, β is the decay rate and Yt is the value of the series for which the exponential moving average is being calculated. S1 is normally set to Y1. The value of β decides the importance of the most recent value of the time series in comparison to older ones. In the context of Adam, the exponential moving average can be applied in the following way. Let gt denote the gradient at time step t, then the estimation of the first moment is

mt = β1mt´1 + (1 ´ β1)gt, t ą 1 (2.8)

and the estimation of the second moment is similarly:

2 vt = β2vt´1 + (1 ´ β2)gt , t ą 1 (2.9)

What separates mt and vt from the definition seen in equation 2.7 is that they are initialized to 0, i.e. mt = vt = 0. This initialization introduces a bias which luckily can be remedied by a bias correction.

mt mt = t (2.10) 1 ´ β1 p vt vt = t (2.11) 1 ´ β2 This leads up to the definition of the rulep for weight updates:

mt wt = wt´1 ´ η( ) (2.12) vt + e p where the e is added to avoid dividing by zeroa and the whole rightmost expression is multiplied by the η hyperparameter which controlsp the step size. The actual step size when updating a weight has two upper bounds defined by the value of η which makes it more intuative than the standard learning rate seen in gradient descent. Furthermore, the moment estimations causes the weights to move in the direction of the previous gradients even if the current gradient is close to zero, this causes the optimizer to avoid getting stuck in local optima or around saddle points.

2.6 Convolutional Neural Networks

Since the structure of the feedforward neural network assigns one weight to each input vari- able, the number of weights required for handling large inputs is high. An example domain which illustrates this is image classification. The input variables are the color values of each pixel and while this might be feasible for a small gray-scale image of, for example, the di- mensions 28x28x1, the approach quickly becomes unfeasible when e.g. colored images from a modern camera are to be processed with a dimension of 3872x2592x3 which amounts to almost 10 million pixels with 3 channels each. Some data is ordered in a way that conveys additional information and a feedforward neural network fails to capture this relationship. The previous example of images displays this property in that groups of adjacent pixels form objects. Another example are sound waves where e.g the most recent frequencies conveys information about what the current frequency might be. A CNN attempts to solve these issues by convolving the input, i.e. passing a filter over it. This causes adjacent input to be interpreted together and reduces the number of parameters in the network. [14]

11 2.6. Convolutional Neural Networks

Convolving an Image To show an example of convolution, a vanilla convolution will be performed on a 2D image with one color channel. This image has the dimension 4x4 and the filter has the dimensions 2x2. For simplicity’s sake, the values of each of the pixels can either have the value 1 which represent the color black or 0 which represents white. Filters can be used to look for different kinds of features in an image, a simple example of a low-level feature is an edge.

Figure 2.7: Convolving a 4x4 image with a 2x2 filter produces a 3x3 output. The convolved image has a black top half and a white bottom half and the weights of the filter are chosen to detect edges.

The first step of the convolution depicted in figure 2.7, is to element-wise multiply the filter with the top-left corner of the matrix which is marked with a dark blue border. The product of this operation is shown in the top-left cell of the matrix to the right, which also has a dark blue border. The product is calculated as the following, 1 ˚ 1 + 1 ˚ 1 + 1 ˚ ´1 + 1 ˚ ´1 = 0. Afterwards, the filter is moved one step to the right and is element-wise multiplied with the four values marked by a red background. Next, the filter slides one more step to the right to produce the 0 in the top-left corner. After that, the filter is multiplied with the two middle rows of the first and second column, one row below the first convolution marked by the dark blue borders. In this position, the element-wise multiplication results in 1 ˚ 1 + 1 ˚ 1 + 0 ˚ ´1 + 0 ˚ ´1 = 2. The 2 can be spotted in the middle row to the left in the result matrix. [14] The result matrix can be interpreted as that there is a horizontal edge in the middle of the image, which is true. Another property of the result is that the dimensionality has been reduced in comparison to the original image. In a CNN, the values of the filter aren’t static, instead the values are trainable parameters which are continuously updated when training the network. The filters are thereby able to be improved to capture the features which matter the most when determining the output. As mentioned at the start of this section, feedforward neural networks don’t scale well to large inputs. However, the filters contain far fewer parameters and are therefore scalable. In the simple example shown in figure 2.7, the number of parameters required for a feedforward neural network would be 4 ˚ 4 = 16, one for each pixel. The filter only contains four param- eters. In more advanced cases, a larger number of features would have to be identified in an image. The solution is then to increase the number of filters as well as adding more layers to the CNN. More filters enables detecting more features and more layers allows detecting more abstract features. [14, 16] In the given example, the result matrix describes "an edge in the middle of the image" in fewer dimensions than the original image. An example can be imagined where a CNN is trained to perform face detection, in that case, the output of a layer deep into the network might be interpreted as "there is probably a pair of eyes in the top-left region of the image" which is more abstract than edge-detection and certainly more abstract than the original pixel input.

12 2.7. RNN and Sequence to Sequence Models

2.7 RNN and Sequence to Sequence Models

Sequence to sequence models takes an input sequence, x, and turns it into an output sequence, y. Examples of sequence data tasks are speech recognition and sentiment classification. To do speech recognition, a model has to be developed which turns audio waves into written language, e.g. a sentence. An example of sentiment classification is turning a movie review such as ”This movie shouldn’t have been made” into a rating such as one out of five. If an artificial neural network contains feedback connections, i.e. nodes which connect to themselves, it is referred to as a recurrent neural network (RNN). This feedback connection enables the network to pass on current information to future states, thus making it useful when modeling sequences like time series data. A RNN is a good choice for processing se- quential data but inherently they do not support turning an input sequence of one length into an output sequence of a different length. To remedy this, two RNNs with different parame- ters can be combined, the first one to encode the input to a certain size and the second one to take the encoded input and decode it to an output of a different size. Even in this improved state, there are a number of issues with the nature of its architecture, such as a large number of parameters, learning long-term patterns and gradients vanishing. [16] The discovery of the rectified linear unit (ReLU) activation function, equation 2.13, is a great example of an attempt to fix an issue with neural networks, namely gradients vanishing [35]. When using the back-propagation algorithm to calculate the gradient of the loss function for a node, a result will always be produced since all activation functions must be differentiable. But if the right graph in figure 2.5 is examined, it can be seen that the gradient of the sigmoid activation function is very close to zero for extreme values. The gradient is partly responsible for minimizing the loss function and given the right circumstances, the higher the value of the gradient, the more progress is made. These gradients close to zero is known as the vanishing gradients problem and it slows down a network’s training. Due to computers having limited precision when representing decimals, the problem is even worse in practice and not too long ago, the sigmoid value was widely used in deep neural networks.

f (x) = max(0, x) (2.13) The right graph in figure 2.5 displays the above equation 2.13 which defines ReLU, it produces a gradient value of 1 for x ą 0 or 0 for x ď 0. This is an improvement over the sigmoid activation function and suits the discrete nature of computers far better. ReLU is now the most used activation function for deep neural networks. [28] Apart from the attempts made to combat issues of RNNs, there have also been attempts at using other neural network architectures on sequential data [30, 1]. One example of this is WaveNet, a one-dimensional CNN modified to forecast time-series.

2.8 WaveNet

In the paper where the WaveNet architecture first was described, the authors A. Van Den Oord et al. showed that it was possible to generate raw audio waves from both text and music using a CNN. The technique to generate audio based on text is known as Text-To- Speech (TTS). Compared to other state-of-the-art techniques, e.g. long short-term memory networks (LSTM), the performance of WaveNet on this task was a major improvement. [49] The WaveNet model generates raw audio waveforms. In mathematical terms, the joint probability of the generated wave x = x1, x2, ..., xT can be described as factorizing the prod- uct of multiple conditional probabilities, where for each time step, xt is conditioned on the previous time steps, i.e. x1, x2, ..., xt´1.

T p(x) = p(xt|x1, ..., xt´1) (2.14) t= ź1

13 2.9. Evaluation Metric

To model the distribution p(xt|x1, x2, ..., xt´1) in equation 2.14, multiple so called causal dilated convolutional layers are used. Dilation is a technique where a subset of the previous sequential data is used to model the probability of time step t. For example x1, x2, x4 and x8 are used to determine p(x9|x1, x2, x4, x8) instead of all eight previous values [56]. The technique reduces the number of computations and increases the speed with which the CNN converges. A causal convolution means that no future samples, e.g. xt+1 or xT, are used to model p(xt). This is critical since a model which relies on knowing the future to predict the present would be useless. Figure 2.8 displays an example of what calculations are required to generate an output. In the figure, the filter size is two and the dilations are set to 2layer starting from the first hidden layer. No node contributes to the input of a node in a deeper layer more than once due to the dilation values. The dilations also result in a larger receptive field, i.e inputs from further back in time are used.

Figure 2.8: A figure showing a stack of dilated causal convolutional layers. [49]

The authors also describe how the joint p(x) can be conditioned on another set of input variables, h, which represent local or domain-specific information. Examples of domain-specific information in the area of Text-To-Speech is information re- garding different speakers or general information about the text. These conditional input variables can, as well as the hyperparameters of the WaveNet architecture, be adapted to fit other problem domains. To achieve a better performance, WaveNet makes use of residual and parameterized skip connections. Skip connections allow for gradients to flow from a shallow layer to a deeper one without going through an activation function. Parameterizing them means multiplying a learnable parameter to the gradient of the connection. Residual connections reframe the learning task of a layer, from trying to learn what the true output y(x) should be to trying to learn the residual or difference r(x), between the true output and the input of the layer, x.

y(x) = x + r(x) (2.15) A traditional network layer tries to learn y(x) in 2.15 while a residual layer tries to learn r(x). This has shown to empirically increase the performance of deep CNNs and it is believed that it is due to the fact that the layer easily can produce its input x just by keeping r(x) at 0. In a traditional layer, learning the identity function, i.e. passing the input as output is difficult. [19]

2.9 Evaluation Metric

There are numerous ways of evaluating the accuracy of a . One chosen for a task similar to predicting taxi demand where the objective was to predict the demand of

14 2.10. Hyperparameter Tuning groceries was the normalized weighted root mean squared logarithmic error (NWRMSLE) [9]. It is defined as the following:

N w (log(y + 1) ´ log(y + 1))2 NWRMSLE = i=1 i i i (2.16) g N f N ˚ i=1 wi fř e p where n is the number of data points in the datař set, i is the index of each individual data point, yi is the predicted value and yi is the true value. In this case, expiring groceries motivate a weighting factor. The weights, wi, were added to penalize expiring groceries (1.25 for perishablep groceries and 1.00 for all other items). The taxi domain doesn’t motivate having different weights for different kinds of demand and therefore the NWRMSLE can be simplified to:

n (log(y + 1) ´ log(y + 1))2 RMSLE = i=1 i i (2.17) N cř p which is also known as the root mean squared logarithmic error (RMSLE). What separates this from the RMSE, as defined in section 2.3, is that given the same distance between the predicted and the true value, a larger error is given when both values are small, compared to when they are large. The metric is therefore applicable when predicting across a large range and magnitude of values.

2.10 Hyperparameter Tuning

Choosing the appropriate hyperparameters or "model settings" often has a significant im- pact on the performance of a model. Training a machine learning model may require a lot of computational resources and time but only reveals how well the model does for one set of hyperparameters. A model has to be retrained to find out how well it performs for a different set of hyperparameters. Thus, finding the optimal or even satisfying values for these hyper- parameters isn’t a trivial task and several strategies have been developed to find sufficient hyperparameter values. The first step for each of the commonly used strategies is for the developer to define possible values of each of the hyperparameters that should be tuned, this is known as the domain. Thereafter, either values are chosen manually by the developer or a search algorithm is applied. Manually choosing which hyperparameters is easy to do but comes at the cost of efficiency. For supervised learning tasks, the performance of the chosen hyperparameters can easily be evaluated based on the size of the error metric. [16] In a study by Bergstra, J. and Bengio, Y., a simple neural network with seven hyperparam- eters is trained on seven similar data sets [2]. One of the data sets was MNIST, which is a well-known data set consisting of 70 000 handwritten digits presented in gray-scale, it was first established in a classic paper by LeCun et. al and is now available online1 [29]. Three examples of the 28x28x1 images can be seen in figure 2.9. Three of the other data sets were variations of MNIST and the last three were also image data sets of lower or similar com- plexity. For each of the data sets, only the learning rate or the learning rate and a second hyperparameter had significant relevance for the performance of the neural network. How- ever, the second significant hyperparameter changed from one data set to the next. With this in mind, the authors suggest that in general, only a subset of the hyperparameters carry significance but that determining which ones these are in beforehand is a very difficult task.

Gridsearch Gridsearch is an exhaustive search algorithm where all possible combinations of the hyper- parameter domain are tested. The advantage of this is that the optimal set of parameters

1http://yann.lecun.com/exdb/mnist

15 2.11. Sequential Model-based Optimization

Figure 2.9: Example images in the MNIST data set, to the right of every image there are three classification probabilities generated by a CNN. is found in the domain defined by the developer. This does however assume that there’s enough time and computational resources to retrain the model for all the possible combina- tions of hyperparameters. The number of possible combinations grows exponentially which is problematic when training complex models which may take a long time to train and can have more than a dozen hyperparameters. In practice, a combination of grid search and man- ual search is often used where certain parameter combinations are skipped in favor of more promising ones a few iterations in. [2]

2.11 Sequential Model-based Optimization

A step up from gridsearch, or gridsearch combined with manual search is sequential model- based optimization (SMBO). In comparison to gridsearch, a SMBO-algorithm actually uses the information from previous trials to estimate which hyperparameter values would yield better results. In comparison to manually picking values, a strategy which also takes history into account, SMBO does so in a statistically sound manner and doesn’t require changing the code before each new run. To perform SMBO, five different parts are required. The first part is defining the hyper- parameter domain, the second part is defining an objective function which takes a set of hy- perparameters as input and returns an error metric as output. The third part is a probability function which models the belief regarding what results the different possible hyperparam- eter sets will achieve. The fourth part consists of a criterion under which the next model to train is chosen. Lastly, a log is needed to store the results of the previous trials, this is used to update the probability function. [3] Concretely, the first part is a set of hyperparameters where each of them gets initialized as a probability distribution, this requires some qualified guesses from the developer. The distri- butions should preferably be selected according to research, experts or previous experience. The second part could be a machine learning model which gets trained with hyperparameter values which have been sampled from their respective distributions, the error metric would be the model’s performance on the cross-validation set. The criterion used by several algo- rithms to chose which hyperparameter values are tried next is Expected Improvement (EI). In this case, the algorithm calculates how much the error on the cross-validation set is expected to decrease based on the distributions of the hyperparameters. In this example, the maximum decrease is what’s sought since the objective function should be minimized. This algorithm, with some slight modifications, would work just as well if the objective function was to be maximized.

8 ˚ EIy˚ (x) = max(y ´ y, 0)pM(y|x)dy (2.18) ż´8 In equation 2.18, x is a set of hyperparameters, y˚ is a threshold of the objective function, y is the value of the objective function and pM(y|x) is the probability distribution of y given x for the model M. In the case that pM(y|x) is zero for all y that are lower than the threshold, no improvement is expected to be gained from this set of hyperparameters. pM(y|x) is updated for each iteration, the historical results improve the knowledge of the function which enables picking candidates for x that improve y.

16 2.12. Tree Parzen Estimator

2.12 Tree Parzen Estimator

An example of a SMBO-algorithm is the Tree Parzen Estimator (TPE). It has been used to construct models which were able to produce state-of-the-art results by efficiently identifying good sets of hyperparameters. [3]

p(x|y)p(y) p(y|x) = (2.19) p(x)

To approximate pM(y|x) in equation 2.18, the TPE uses Bayes’ rule defined in equation 2.19. Furthermore, p(x|y) = l(x) if y ă y˚ and p(x|y) = g(x) if y ě y˚. l(x) can thus be interpreted as a function which defines a probability distribution of promising values of x and g(x) as the opposite. Specifically, l(x) is sampled to produce a set of candidates and then these are evaluated under the criterion min(g(x)/l(x)), i.e. x should be chosen so that the probability of a low error is high and the probability of a high error is low. This means that y˚ must be chosen so that there exists at least one point where y(x) < y˚ in order for l(x) and by extension the criterion to be defined.

2.13 SARIMA

The statistical model ARIMA or autoregressive integrated moving average was popularized by Box, G.E.P. and Jenkins, G.M. in their book "Time series analysis: forecasting and control" which was first released in 1970 [5]. It is a linear model for time-series analysis and forecast- ing. The model linearly combines previous values of the response variable and its errors to be able to predict what value it will take in the future.

yt = θ0 + β1yt´1 + β2yt´2 + ... + βpyt´p + et ´ θ1et´1 ´ θ2et´2 ´ ... ´ θqet´q (2.20)

In the above equation, yt is the value of the response variable for time step t and et is the error. The parameters of the model are βi which ranges from 1 to p and θj which ranges from 1 to q. The values of these parameters impacts the model’s performance. Finding parameter values is achieved by going through the so called Box-Jenkins methodology. The first step according to the authors is performing model identification, i.e. a model should be proposed which auto-correlates similarly to the data. is a measure of how well a time-series correlates with a delayed copy of itself, depending on the delay. A prerequisite for identifying the model is that the time-series is stationary, i.e. that the mean and autocorrelation doesn’t change over time. This is either the original state of the time series or can be achieved by selecting an appropriate value for the parameter d which con- trols the degree of differencing. To check whether the time-series is stationary or not, the augmented Dickey-Fuller test can be performed [13]. The second step of the Box-Jenkins methodology is identifying the parameters p and q which minimize the error. p and q con- trol the time lags and the order of the exponential average respectively. These values can be found by plotting the partial autocorrelation and autocorrelation functions and observing for which time step they approach zero. The last step of the method is to evaluate the proposed model on its accuracy to make sure that the previously made assumptions hold. An extension of the vanilla ARIMA model is the seasonal ARIMA model or SARIMA. This model allows for selecting a seasonal factor m which for example could be three months if the data contains a quarterly pattern. Three parameters are added to the model which correspond to (p, d, q) but on a seasonal basis. [5]

17 2.14. Stacked ensembles

2.14 Stacked ensembles

A possibility when faced with a supervised machine learning problem is developing several models which can perform the task and then combine and weigh their outputs to generate a prediction stronger than each of their individual outputs. Different algorithms have different strengths and weaknesses and while some might have issues modelling a certain part of the problem domain, others might successfully model the same part but fail somewhere else. By combining for example an ARIMA model, two deep neural networks and a gradient boosting machine, which is another machine learning algorithm, a stacked ensemble can be created which is able to compete with a more complex model. [42]

2.15 Mixed Precision

Computing the product of matrix multiplications is done very often in a neural networks when e.g. calculating the output of a neuron, see equation 2.4. How the matrices are rep- resented in the computer memory affects how expensive the calculation is. Typically, 32-bit floating point precision is used to represent numbers, they can then range from ´3.4 ˚ 10´38 to 3.4 ˚ 1038. Reducing the precision has the positive effects of reducing the space needed to store the number, the energy required by the processing unit to do computations and improving the performance of operations done on the number. The downside is that the numbers can’t be as accurately represented, a loss in precision. Specifically, the range of numbers which can be represented when switching to 16-bit floating point precision from 32-bit is only 6 ˚ 105 to 6 ˚ 104. In addition, the precision with which a decimal can be represented is reduced. It has been shown that this range and precision loss has a limited negative impact on the performance of deep neural networks while still producing the positive effects. [11, 23]

2.16 Method

To ensure that the research is conducted in compliance with the scientific standard a couple of relevant studies have been reviewed which provide guidelines and best practices. CRoss-Industry Standard Process for Data Mining (CRISP-DM) is a process used dur- ing data mining and machine learing projects. It was first described by Rüdiger Wirth et al. in 2000 and consists of six phases: business understanding, data understanding, data preparation, modeling, evaluation and deployment. Although primarily created for teams, it provides a framework of common terminology, thus making it easier to communicate the development process to outside stakeholders. It also provides a logical order in which to complete the phases of a data mining or machine learning project. [53] A concrete example of how to empirically evaluate the performance of supervised learn- ing algorithms is provided by R. Caruana and A. Niculescu-Mizil. They provide guidelines pertaining to both process and how to properly evaluate algorithms within machine learning. The empirical evaluation guidelines are for example hyperparameter tuning, proper evalu- ation of error metrics and a bootstrap analysis. They also provide a clear overview of their processes, thoroughly describe their technical choices and point out the key principles behind splitting the data into training, testing and validation sets. [7]

18 3 Literature Review

The purpose of this section is to contextualize this thesis and TDNet in the current field of research. Simply put, the idea is to apply a WaveNet architecture, which has provided state- of-the-art results in other domains, to the domain of taxi demand prediction.

3.1 WaveNet Architectures

The WaveNet architecture has previously been used for problems outside of the domains of the original paper. A number of examples can be found on the data science platform Kag- gle1 where its members, which range from beginners to industry experts and researchers, can compete. Concretely, Glib Kechyn et al. came in second place in the competition Corporacion Favorita Grocery Sales Forecasting competition where they proposed a WaveNet architecture to predict sales for a large grocery chain [9, 25]. As the defined prediction horizon was 16 days, modifications were made to the original WaveNet model to output sequential predictions for the entire period. A problem that occurred due to predictions being conditioned on previ- ous predictions was accumulating errors. To handle this they implemented a sequence to sequence learning method using an encoder-decoder. The first 1D CNN encoded a sequence of grocery sales data into a fixed length vector which represented the original sequence. The second 1D CNN then decoded the output of the first network back to a sequence with a length of 16 days. Glib Kechyn et al. did not share the parameters between the encoding and the decoding network. The architecture of the network which was used by the authors is similar in its structure to what has been used in this paper. There are many details of the implementation which the authors have chosen not to present in their paper which surely differ. Furthermore, there are two major differences relating to the task, the obvious one being that the domain is different. The second one is the difference in available data, the data set used by Glib Kechyn et al. contained more than 125 million observations in comparison to approximately 1.6 million for each of the cities in this paper. Additionally, the model described in this paper should perform well for multiple customers in multiple cities. Thus it is important for it to generalize well. When predicting the demand of grocieres, the authors explicitly state that they used all

1https://kaggle.com

19 3.2. Alternative approaches possibilities to increase the accuracy of competition predictions, this most likely caused them to overfit to that particular dataset, thus reducing generalizability. Another domain where conditional time series forecasting is of great interest is finance. A. Borovykh et al. have successfully used a simplified WaveNet to perform multivariate fore- casting on multiple exchange rates and the second largest stock market index in the U.S. [4]. They’ve predicted daily prices which has reduced the size of the training set by a factor of 24 in comparison to if they would’ve measured hourly prices. In order to reduce the training time of their model, ReLUs are used instead of gated activation units. The authors of the original WaveNet paper considered using ReLUs but discarded that idea based on their poor performance when modelling sound [35]. A. Borovykh et al. conclude that their solution offers a viable alternative to RNNs and traditional economic models both when it comes to implementation difficulty, training required and performance. They exploit the correlation between the exchange rates by conditioning their WaveNet on multiple exchange rates. This is simliar to what is done by the model presented in this paper, but with different city zones instead. A difference in the approaches is that TDNet is fed with taxi demand lagged by a day up to a year, this is done to further increase the receptive field which is necessary since there are many more datapoints when measuring by the hour instead of by day.

3.2 Alternative approaches

A paper written by Lv et al., proposed using stacked for the purpose of pre- dicting traffic flow in the upcoming hour [32]. An is a neural net which ideally outputs an exact reconstruction of its input. It achieves this by first encoding the input to a message with reduced dimensions, it then decodes/decompresses the message to an output which matches the input as closely as possible. The consequences of this procedure is that the autoencoder learns to filter out the noise in the input and to succinctly represent it. When stacking several of these neural nets on top of each other and finishing with a prediction layer, a deep neural network is created which can do e.g. time-series predictions [51]. Predicting traffic flow is similar to predicting taxi demand and this was one of the first attempts made at utilizing machine learning for such a task. Their model displayed great performance in comparison to the statistical models they used as benchmarks. Apart from applying deep neural networks, traditional statistical models can be used to predict taxi demand. Examples of these are time-weighted time-varying Poisson models and ARIMA [34], Markov predictors [59] and multi-level clustering [12]. The advantages of using these approaches are that they are well-understood and are less computationally heavy. The disadvantage is that they have a difficulties modelling deep underlying trends, something that exists in the taxi demand and traffic flow domains [32]. To properly evaluate the balance between traditional statistical models and the increasingly popular deep neural networks, a SARIMA model has been chosen to serve as a benchmark in this paper.

3.3 Taxi Demand

K. Zhao et. al investigate the limit of predictability when it comes to taxi demand in NYC. They divide the city into zones based on large buildings and calculate three different kinds of entropy to approximate how well taxi demand can be predicted for each of the zones. They split up the causes for taxi demand into temporal and random correlation. A low random correlation indicates that a pure time-series model which only takes historical demand into account can predict the future demand well and a high random correlation indicates that further information is needed. They find that the hourly limit of predictability for their small building zones is 83% on average. To predict the taxi demand, they use a (HMM) and a shallow neural network. They conclude that the HMM, which is a pure time-series model like SARIMA, is faster and performs better than their NN in zones

20 3.3. Taxi Demand where the predictability is high. The NN is slower but performs better in zones with low predictability, i.e. irregular demand.

21 4 Method

In this section, the data, the feature engineering, TDNet, the evaluation process, benchmarks and the hyperparameter tuning will be described.

4.1 Data Description

At the center of each machine learning project is the data. In this section, the data as well as the steps taken to clean the data will be described. The data can be divided into company- provided data and data provided by external sources. In the interest of the confidentiality of TaxiCaller and their customers, certain details of the data will be omitted.

Data Provided by the Company For each taxi trip, the coordinates of the pick-up point are recorded as well as when it oc- curred. The time is represented as a time stamp which contains the year, month, day, hour and minute of the pick-up. For the sake of business purposes, each city is divided into dif- ferent zones of varying sizes and based on the coordinates of the pick-up point, a zone is assigned to the trip. Furthermore, all the trips which occur during the same hour in the same zone are totaled and are refered to as the zone demand. The data ranged from the 1st of January 2017 to February 2019. This range consists of approximately 18500 hours.

External Data As described in section 2.1 on taxi demand, factors such as the weather, national holidays, connecting traffic as well as special events affect the taxi demand. To model this, the Dark Sky1 API was used to gather information about the temperature in degrees Celsius, wind speed in meters per second and precipitation. The precipitation was further divided into type, i. e. rain or snow, intensity as measured by millimeters per hour and accumulation as measured by centimeters. If hourly data was available, it was used. Otherwise, daily data was used. Information about national holidays was also added to the data set of the customer in city NE.

1https://darksky.net/dev

22 4.2. Data Cleaning time zone id demand holiday precipitation type wind speed temperature 20170101T00 A 10 1 snow 2.1 -1.1

Table 4.1: Dataframe containing an example of data provided by Taxicaller merged with ex- ternal data.

4.2 Data Cleaning

As data quality is a determining factor for machine learning results, the data had to be cleaned. To ensure that there were no outliers, negative zone flow, invalid zone ids or times outside the preset range, a script has validated the company data before it was fed to the preprocessing step. This didn’t eliminate any data points for any of the two cities which is an indication of high quality data. The column precipitation accumulation only contained zero entries which lead to its re- moval. The column precipitation intensity contained a suspicious amount of zero entries, by manually checking another weather service it was concluded that zeros were added as the default value in the Darsky API and that very few of the data points were valid. Addition- ally, for the city NE, precipitation data was completely missing for almost a full year which was remedied by adding data from a city close by.

4.3 Data Splitting

The data set contained about 24 months of data for each city initially, 18 were used for train- ing and three for cross-validation. Three months were used for the final evaluation and con- stituted the test set. Splitting the data was done immediately after the data cleaning was completed to prevent introducing a bias while exploring and analyzing the data with the test set in it. If this wasn’t done, information about the test set would leak and most likely affect future decisions.

4.4 Data Preprocessing

In some zones there were only a couple of pick-ups for the investigated time period of two years. In other zones the number of pick-ups was negligible in comparison to the most active zones. Therefore a decision was made to remove the zones which accounted for less than 1% of the total taxi demand of the city, given that the city didn’t contain more than 50 zones. At the end of this operation, about 200 000 data points remained for each of the cities, the zones which made the cut will be referred to as the significant zones. At this point in the preprocessing, a row in the data set would contain nominal and cate- gorical numerical features, one time-stamp and one categorical string feature. In table 4.1, an example dataframe with one row is displayed. To be able to feed the input to the model, the data was transformed to a 3D matrix, also known as a tensor, where the rows were different zones, the columns were hours and the third dimension varied with hour and zone. Features which didn’t vary in all three dimensions were also transformed to this shape but only received a dimension of one in the insignificant dimensions.

Feature Engineering Numerical features could be fed directly to the model, they have however been transformed to formats which best represent the underlying information. For example the zone id feature uniquely identifies a zone, is an integer and only has as many values as there are zones in a city. A higher id doesn’t convey any more information than a lower id nor any relation

23 4.5. Data Exploration

time Zone A Zone B Zone C snow rain NaN demand 20170101T00 1 0 0 1 0 0 10

Table 4.2: Dataframe containing zones and precipitation type as binary features if the zone is A and the precipitation type is snow. to any of the other zones, it just uniquely identifies the zone. Therefore it is a and was transformed into a binary tensor. The same went for the precipitation type. Table 4.2 shows what the row in table 4.1 would look like after going through the binary transformation process and removing the holiday, wind speed and temperature feature. It also reduces the number of zones to three, all of this is done to improve table readability. Whether a certain date is a holiday or not was represented by a binary variable that was 1 for holidays and 0 for normal days. The wind speed and temperature were nor- malized and standardized so that their mean was 0 and 1. The zone demand went through a logarithmic transformation, normalization and standardiza- tion. Scaling the features in this manner improves the efficiency of gradient descent and leads to faster training. [29] This leaves the time stamp which contains plenty of important information. Initially, the time stamp column was divided into a year, a month, a day of month and an hour column. The day of the week was calculated based on the date. The year column was transformed from 2019, 2018, 2017 to 2, 1, 0 which is similar to removing the mean and moves the feature to approximately the same range as the other features. The categorical day of week column was one-hot encoded. Finally the day of month and hour of day columns were turned into cyclical features by calculating the sine and cosine values. Under the assumption that there were cyclical patterns in the data, which the data explo- ration indicated, lag features were created. Concretely, for each zone the model was fed the demand of that zone one hour, 24 hours, one week, one month and one year before. Since the demand during the last hour might not be available in a production environment, it is only fed to the encoder during training, not to the decoder when making predictions. However, the other lag features are assumed to be available. The zone id constitutes what is called a conditional vector in the original WaveNet paper [49]. It isn’t explicitly represented by a zone id but each row always corresponds to the same zone and the distribution of demand in each of the zones is conditioned to the rest of them. This is similar to the original WaveNet being fed with the values of different speakers simultaneously. The model learns during the training phase how the zones or speakers relate to each other.

4.5 Data Exploration

Numerous graphs were created to gain a better understanding of the distribution of the taxi demand based on the features. The historical demand was investigated by looking at rolling means and standard deviations of it. In this phase, suspicions such as the taxi demand on weekdays and weekends should differ were confirmed. Additionally, finds were made which indicated that holidays would be an interesting feature to add to the data set due to them causing spikes in the data. A check for linear correlation between the features and the response variable was performed without finding any relationships. The decision to remove the insignificant zones was made when examining the distribution of demand between zones. The demand throughout the day for the two cities can be seen in figure 4.2 and 4.1. The demand has been summed up for all zones for the whole train period and this total demand is displayed per hour as a fraction of the hour with the highest demand. For both cities, the demand is the lowest in the middle of the night but still doesn’t go below one quarter of the peak hourly demand. For city SA, the demand starts increasing rapidly at 9:00 up until its

24 4.5. Data Exploration peak at 14:00, then it decreases until the the morning hours. The demand for NE doesn’t follow the same smooth curve, instead the demand increases from 4:00 until a peak 8:00, afterwards the demand decreases slightly and stays level until 16:00 when it goes up and hits it peak at 18:00. Thereafter it decreases until 4:00. Most zones for both cities didn’t contribute significantly to the demand, therefore the zones which didn’t contain more than 1% of the demand were removed. This, accidentally, left 14 zones in each of the cities. The distribution of demand between these zones can be seen in figure 4.3 and 4.4 and differ substantially. In SA, one zone stands for almost 35% of the demand and a few zones hover just over the red line which represents the 1% cut-off. In the other city NE, the distribution is more even with three zones over the 10% mark, 7 between 4% and 10% and 4 between 4% and 1%.

Figure 4.1: True Demand per Hour in SA as a fraction of the max hourly demand.

Figure 4.2: True Demand per Hour in NE as a fraction of the max hourly demand.

25 4.5. Data Exploration

Figure 4.3: Zone Demand Distribution in SA as a percentage of the total demand.

Figure 4.4: Zone Demand Distribution in NE as a percentage of the total demand.

26 4.6. Model Implementation

4.6 Model Implementation

The model implementation is based off of two open-source implementations which have both been used for Kaggle competitions. They have both been created by the same author and have claimed 4th and 6th place out of more than 1000 participants in their respective competitions [50]. This is notable since Kaggle is known for having competitions where the winners have used stacked ensembles on top of each other and overfit them to the competition problem to score as high as possible. This is what WaveNet was going up against in these competitions as well and it still outperformed most of these so called stacked ensembles, see section 2.14. On top of that, the stacked ensembles in these competitions are usually composed of models which the developers don’t have to implement themselves but instead enable spending most of the time performing feature engineering or hyperparameter tuning. The changes to make TDNet different from the open-source implementations have mainly been done to the feature engineering, architecture, hyperparameters, dependencies, batch generation and precision used. The layers and the algorithms for training and predicting have only been changed for the sake of updating the dependencies, not to fundamentally change the logic.

TDNet Architecture In figure 4.5, a building block of the network can be seen. TDD stands for time distributed dense layer, Dilated Conv is a dilated convolution, σ represents the sigmoid activation func- tion. In total two of these comprise TDNet, the first one takes the input features and produces a tensor which serves as part of the input for the second block. The input for each of the k layers is also saved and used as input for the second block. The output of the second block is future predictions of taxi demand. k has been tuned as a hyperparameter and is the same as the number of dilations.

Precision The frameworks described in 5.1 are all being updated frequently and the new versions often improve performance, reduce bugs and add new features. A relevant example of this is the addition of full support for 16-bit floating point precision in CUDA 10, cudNN > 7.4.1 and tensorflow-gpu >= 1.13, given that the hardware supports it. The default precision used in tensorflow is 32-bit but as discussed in 2.15, reducing the floating point precision can yield significant benefits. Therefore the dependencies of the original model were updated and the optimizer was switched to a mixed precision optimizer. Unfortunately, only partial was support for mixed precision was achieved due to compatibility issues with a dependency management system.

Batches Training and validation batches shared dimensions and were generated using the same method, except that they were drawn from different subsets of the data. As an example, if the date to predict the zone demand in all the zones was randomly selected to be the 1st of February 2018, then a batch contained the zone demand for all the zones from 30 days back up until the 31st of January. It also contained 30 days of lagged data from the day, week, month and year before. If information about the weather on the day or whether it was a holiday should be fed to the model, information about it was added to the batch.

4.7 Hyperparameter Tuning

The hyperparameter tuning was performed using the hyperopt implemenation of TPE which is described in section 2.12. Knowing which hyperparameters to tune before doing it is a hard

27 4.7. Hyperparameter Tuning

Output

Residual + TDD

TDD + ReLU TDD ReLU

Skip-connections

*

tanh σ

Dilated Conv

k Layers

tanh

TDD

Input

Figure 4.5: Building block of TDNet architecture, connecting two of these together formed TDNet. task, based on the recommendations of an industry expert and guesses based on previous implementations of WaveNet, the hyperparameters listed in table 4.3 were tuned [39]. Due to the size of the taxi data sets being smaller than the data sets for which the models was originally built, both the number of layers as well as their width had to be reduced. Step size is the equivalent of learning rate for the ADAM optimization algorithm, it de- termines how big steps are taken on the loss function surface. Training steps determined for how many iteration TDNet was to be trained, in reality the early stopping conditions inter- rupted the training for most attempts, not the limit for training steps. Channels is the number of skip channels and residual channels. The width of the two time distributed dense layers which made up the input layers as well as the layer just before the predictor were also tuned. The number of filters, their widths and how much they were dilated were all decided by tun- ing the hyperparameters dilations and filter widths. If the filter widths (2, 2, 2, 2) were tried, only the first four dilation values were used. The different types of distributions are Log-Uniform, Discrete-Uniform and Choice. Log- Uniform allows for choosing values with probabilities ordered by magnitudes. In the case of the step size, the difference between 0.0001 and 0.001 is of greater interest than the difference between 0.8 and 0.9. Discrete-Uniform defines a uniform distribution between the Min and Max values which are separated by the step size. Choice simply leads to a decision between the hard-coded values defined. The distribution choices have mathematical reasons but the limits and the "Choices" have been selected due to similar values being found in other open- source implementations, albeit of scaled-up values.

28 4.8. Evaluation

Name Distribution Min Max Step Step Size Log-Uniform 0.0001 0.1 Training Steps Discrete-Uniform 50 000 200 000 1000 Channels Discrete-Uniform 2 8 2 Encode/Decode Discrete-Uniform 8 40 16 Dilations Choice* Filter Widths Choice**

Table 4.3: Table describing the probability distributions of the hyperparameters tuned. Choice* was defined as four different values, namely {(1, 4, 16, 64, 1, 4, 16, 64), (1, 2, 4, 8, 16, 32, 64, 128), (1, 2, 4, 8, 1 , 2, 4, 8), (1, 8, 1, 8, 1, 8, 1, 8)}. Choice** was defined as {(2, 2, 2, 2), (2, 2, 2, 2)x2, (3, 3, 3, 3)x2} where x2 means repeating the filter widths.

4.8 Evaluation

The performance of a model was evaluated using either the RMSLE metric as defined in equation 2.17 or the RMSE from equation 2.3. If a model didn’t have a better cross-validation error for 2000 iterations and was out of restarts it was stopped, otherwise it restarted from the point where it had achieved its lowest error with a decreased step size. This is the same approach taken by J. Xu et. al in their paper to speed up training [55]. Furthermore, they also used RMSE to measure the performance of their LSTM for the same task. The task was to predict 26 hours ahead, this is due to the fact that if TDNet was to be used in a production environment, an overlap of two hours for each day would make sure that predictions were always available. A generated prediction of the hourly demand of a zone can be categorized in one of three different buckets. An underestimation where the true value is two or more away from the prediction, an overestimation where the true value is two or more lower than the prediction and finally an accurate prediction where the true value is the same as the prediction or less than two away. These values have been decided based on the demand distribution in the two cities investigated. For other cities the same principle applies but not the same exact values.

Cross-Validation Validation batches were generated from the cross-validation subset of the data, this subset was made up of the months closest to the test set in the training set. A validation batch had the same dimensions as a training batch, i.e. it saw 30 days of training data for all significant zones, starting at a random point in the validation set range, and then predicted the demand of the next 26 hours. The error was calculated for this randomly chosen 26-hour period in all zones. An issue with this is that some dates will contain less noise than others and be easier to predict. Since the process is based on the cross-validation error, this could lead to an inferior model being selected just because the validation happened to occur on a day that was easy to predict. To combat this, a loss averaging window was applied which cal- culates the average train and validation error over the last 100 training steps. This averaged error serves as a metric to pick the best model throughout the training process. Calculating a rolling mean of the errors adds robustness and ensures that a model that generalizes well is selected.

Final Evaluation For the final evaluation, the chosen model has predicted the demand in all zones between the first of October 2018 and the 10th of January 2019. The month of September 2018 is also included in the test set as unseen data but is only used to provide context for predicting the demand in October. The error for this whole period for all zones was calculated and that

29 4.9. Benchmarks constitutes the final result for a certain city. This was compared to the benchmark algorithms. SARIMA went through the same procedure where 26-hour forecasts were generated but the stacked ensemble, which isn’t dependent on historical demand and therefore can predict the demand for arbitrary time periods at any time only predicted 24 hours for each day. This means that the exact same task wasn’t performed but the evaluation is fairer in the sense that the models are used in the same way that they would be used in a production environment. The stacked ensemble

4.9 Benchmarks

In order to compare the results, two benchmarks have been implemented. They are a SARIMA model and a stacked ensemble of supervised learning models. All benchmarks have been evaluated using the same error metric over the same dates. Since the purpose of the benchmarks is to put the performance of TDNet in context, their details are therefore not described as thoroughly nor all terms. The machine learning models are different supervised regression models. This means that they are fed the same data as TDNet and the time stamps are converted to features which convey information more clearly than the time stamps alone, these features have also been fed to TDNet. However, the machine learning models don’t inherently treat the time stamps as having a temporal order. The machine learning models which have been considered are the following: a random forest, an extremely-randomized forest, XGBoost, a random grid of Gradient Boosting Machines (GBMs) and a random grid of deep neural networks. A stacked ensemble model was then trained using the models which performed the best and was used to generate predictions. This stacked ensemble was created using a framework which has implemented all the different models and provides a fairly simple programming interface which handles details under the hood. The traditional statistical time-series forecasting model was a SARIMA model. The first step in using this was investigating whether the time-series was stationary or not which was done with the augmented Dickey-Fuller test. Secondly the parameters p and q were identi- fied by plotting the partial autocorrelation and autocorrelation functions. Different SARIMA models were created for all the different zones for both of the cities. Thirdly, the seasonal parameters were chosen to not induce non-stationary which was necessary for the imple- mentation of this model to work. Given that the seasonal parameters met these conditions, they were set according to rules laid out by R. Nau [36]. The seasonality chosen was 24- hours. Meaning that the model used data from 24 hours back to predict the current demand. SARIMA models have previously been used to predict taxi demand [34].

4.10 Feature Importance

To evaluate the impact of the features added by the holiday and weather data, i.e. the external data sources, the model was retrained without them for the city NE. Its performance was then measured and compared to that of the same model with the same hyperparameters without access to the additional data. Specifically, the input dimensions were changed from (zones, hours, 26) to (zones, hours, 22) due to the holiday, precipitation-type, temperature and windspeed columns being removed in the third dimension. This demand-only model thus had fewer parameters.

4.11 Rounding

If TDNet were to be put in a production environment, the use-case would most likely demand delivering predictions in the form of integers. In order to measure the impact this would

30 4.12. Models trained have on the performance, the final evaluation errors were also calculated for floating point predictions rounded to integers.

4.12 Models trained

Several TDNet models have been trained to enable accurately answering the research ques- tions. For the city of NE, four different models were trained and hyperparameter tuning was performed for two of them. For the city SA, two different models were trained and hyper- parameter tuning was performed for one of them. The TDNets to predict SA demand were trained using either the RMSE or the RMSLE loss function. The reason that two more models were trained to predict the demand in NE is that the impact of weather and holiday data was evaluated on that city. Hyperparameter tuning was performed using the RMSLE loss func- tion for NE and RMSE for SA. For NE the hyperparameter tuning ran for 100 iterations, for SA only 30 due to time constraints.

31 5 Empirical Evaluation

The setup for the performed as well as the results are presented in this sec- tion. For two of the companies customers, the hyperparameters of the architecture have been tuned. The results of these two cities, NE located in northern Europe and SA located in South America are presented. The predictions are made on the time period 2018-10-01 to 2019-01- 10 in 26 hour intervals for all zones in a city. The forecasts are 26 hours long but are made every 24th hour, thus the overlap is removed for graphs and tasks were duplicates should be avoided. From the predictions and true values, the error metrics RMSE, equation 2.3 and RMSLE, equation 2.17 are calculated for all hours and all zones which results in a numeric error for the whole city. As the real demand is measured in integers, the error metrics are also calculated for rounded values of the predictions. In addition to the performance of the TDNets, the performance of the SARIMA and the stacked ensemble benchmark are presented.

5.1 Experimental Setup

Hardware specifications for running the model will be listed as well as the core frameworks and libraries used.

Frameworks and Libraries The language used throughout this project has been Python 3.6. The model has been writ- ten in Tensorflow, an open-source platform for machine learning1. It is written in C++ for performance but the API is made primarily for Python development. To explore the data and preprocess it, the libraries pandas2 and numpy3 have been used. Pandas provides data- structures for easily handling large amounts of data and numpy provides efficient array im- plementations and functions. To enable the use of GPU-accelerated computation, the library cuDNN by NVIDIA was used as well as the CUDA platform [40, 41]. The hyperparameter turning was done using a python library called hyperopt4 which provides the possibility to

1https://www.tensorflow.org/ 2https://pandas.pydata.org/ 3http://www.numpy.org/ 4http://hyperopt.github.io/hyperopt/

32 5.2. Results NE use sequential model-based optimization to find a good set of hyperparameters [3]. The su- pervised learning benchmark was implemented in R using h2o which is a machine learning platform which specializes in making AI accessible. It allows for easily making quite ad- vanced models at the cost of customizability5. The SARIMA benchmark was implemented in python using the open-source statsmodels module. It provides tools and algorithms for, among other things, performing time-series forecasting6.

Hardware Specifications In order to make use of mixed precision, it’s essential to have a GPU which supports it. The CPU and RAM don’t have to be as powerful as the GPU to be able to train deep neural nets efficiently, but they mustn’t be bottlenecks. A desktop computer with the following relevant specifications was used:

• GPU: NVIDIA GeForce RTX 2070, 8GB

• CPU: AMD Ryzen 2600 3,4 GHz

• RAM: 2x8GB

• Operating System: Ubuntu 18.04 64-bit

5.2 Results NE

For this city, the performance of the model including external data sources in the form of weather and holiday data has been evaluated. This has been done in order to measure the impact of additional data and estimate the value of spending time to add this data to the model. The error of TDNet trained on demand only, TDNet trained on demand and external data as well as the error of the two benchmarks as measured by RMSLE is displayed in the graph to the right in figure 5.1. The benchmarks are a stacked ensemble model and a SARIMA model. TDNet, demandperformed the best followed closely by TDNet. In third place came the stacked ensemble benchmark and forth the SARIMA model. TDNet is close to being beaten by the stacked ensemble model. In the top right graph, the error measured in RMSLE of each of the models is depicted. In this case, TDNet beats TDNet, demand which beats the stacked ensemble which is followed by the SARIMA model. In the bottom left corner of graph 5.1, the RMSLE of TDNet and TDNet trained with demand only is displayed. Next to each of them are bars showing their performance when rounded to integers. When rounding the predictions, the RMSLE of TDNet increased by 3.5% for the model trained on all data sources, the RMSLE of TDNet increased by 2.6% for the model trained on demand only. Notably, the accuracy of the rounded TDNet predictions is lower than that of the unrounded TDNet, demand predictions. For the models trained using RMSE, which aren’t displayed in the graph, the difference is the same for both of them, 0.7%. To get a concrete sense of how far away the predictions are from the true values, the differ- ence between all the true values and their corresponding predictions are plotted in figure 5.2. There are more extreme differences but these are very few in number and have been omitted from this particular graph to increase readability. The prediction buckets, which are defined in section 4.8 have the following sizes: 64% of predictions are accurate, 9% are overestima- tions and 27% are underestimations. This can be compared to the figure 5.4 which shows the difference distribution of the best RMSE model. For that model, 63% of predictions are accu- rate, 15% are overestimations and 22% are underestimations. Figure 5.3 shows how the error distribution differs when the true demand has to be higher than zero. This slightly changes

5https://www.h2o.ai 6https://www.statsmodels.org/stable/index.html

33 5.2. Results NE

Figure 5.1: Results of TDNet and benchmarks as measured by RMSE and RMSLE. The bottom left shows the results from rounding the predictions.

Figure 5.2: Difference between prediction and truth in real numbers.

34 5.2. Results NE

Figure 5.3: Difference between prediction and truth in real numbers where the true demand is greater than zero. the performance of the model in that 59% of predictions are accurate, 6% are overestimations and 35% are underestimations. The city has been divided into different zones and demand predictions have been made for all the zones which contribute to 1% or more of the total demand in the city. An insight into how well TDNet predicts the demand for each of the zones can be gained by observing figure 5.5. The error is measured by RMSE and the errors of the zones range from 1 in zone M to just below 3 in zone B. This can be compared to the total error of the model of 2.37 which six zones are above and eight are below.

Figure 5.4: Difference between prediction and truth in real numbers for RMSE model.

35 5.2. Results NE

Figure 5.5: Distribution of RMSE across all zones for city NE.

The performance of the models summed over 24-hour periods and summed over all zones is shown in 5.6. This is done to display a simple overview of the performance of the models and highlights differences between the two different error metrics. The true demand in all zones for all hours of a day has been summed up and is represented by the green line in the figure, the demand for each day has been divided by the demand of the max day to avoid showing real numbers. The other two lines showcase the predicted demand of the two best models trained with different loss functions. The graph shows how the true demand varies substantially over time. Especially the month of November contains first a few spikes and then the lowest trough followed by the peak throughout the whole period which is reached on the 25th. Both TDNets seem to follow the general trend pretty well but fail to capture the spikes, though the model trained using RMSLE consistently underestimates the total de- mand. Similar to figure 5.6 is figure 5.7 where the total demand of the city is displayed in relation to the total demand as predicted by the SARIMA and stacked ensemble benchmarks. The aggregated predictions of the SARIMA model are represented by a blue line and it doesn’t vary much. The orange line represents the predictions by the stacked ensemble and these vary more and in a seemingly regular way. This regularity doesn’t mirror all the changes in the true demand but appears to capture a weekly pattern. To measure the impact of what loss function was used during training, the cross- comparison in table 5.1 has been created. The predictions of the model trained with the RMSLE loss function was evaluated using RMSLE and RMSE. The predictions of the model trained with the RMSE loss function was evaluated using RMSLE and RMSE. As expected, a model trained using one loss function and then evaluated using the same one outperformed switching loss function between training and evaluation. In percent, the decrease in accuracy when evaluating using RMSE was only 0.4% and the decrease in accuracy when evaluating using RMSLE was 6.1%. In figure 5.8, the RMSE of all predictions for each hour is shown for TDNet and bench- marks. The SARIMA model constantly produces the highest error, the stacked ensemble and TDNet intermittently produce the lowest error up until 10:00 from which point TDNet per-

36 5.2. Results NE

Figure 5.6: Total Demand by day for all zones over the test period.

Figure 5.7: Total Demand by day for all zones over the test period and predictions by bench- marks.

RMSLE Eval RMSE Eval RMSLE Train 0.6639232 2.4465284 RMSE Train 0.6681081 2.3061376

Table 5.1: Results of comparing the impact of the two different error metrics.

37 5.2. Results NE

Figure 5.8: Prediction error per hour in NE as measured by RMSE for TDNet and benchmarks. forms worse until it reaches its peak error at 18:00. From then on it reaches its minimum and outperforms the benchmarks by far. To capture the process of training a model, the loss for each twentieth iteration was logged. As depicted in figure 5.9, the loss doesn’t strictly decrease due to the random element of the batch generation but there is a decreasing trend up until the last couple of thousand iterations. During the first 100 steps, which aren’t included in the graph, the loss decreased rapidly. These steps have been removed to increase graph readability. A closer look at the figure shows two straight lines close to iteration 10000 and 25000 which are due to the model restarting on a previous iteration with a halved learning rate.

Figure 5.9: Train loss of the best RMSLE model for NE.

38 5.3. Results SA

5.3 Results SA

Figure 5.10: RMSLE and RMSE of TDNet and benchmarks.

The results for this city measured in RMSE and RMSLE can be found in figure 5.10. In both cases, TDNet trained on demand only performed the best followed by the benchmarks. When measuring with RMSLE the stacked ensemble beats the SARIMA model but when using RMSE the SARIMA error is just below that of the stacked ensemble. In comparison to the results of the other city, TDNet with extra features is missing since weather and holiday data was only collected for that city.

Figure 5.11: Distribution of RMSE across all zones for city SA.

39 5.3. Results SA

The distribution of errors for all zones which account for more than 1% of the total de- mand in the city can be seen in figure 5.11. Zone A accounts for more than twice the error of zone E which has the second lowest error. All the other zones have a RMSE of similar size. To measure the actual differences between the predictions and the true values for the demand, figure 5.12 was created for the model that trained on RMSE. The histogram shows the frequency of predictions which differ by the different values found on the x-axis. A zero means that the prediction was exactly right, this was the case for approximately 11% of the cases. 40% of the predictions were less than two away from the truth. The most common value in the histogram, a difference between the prediction and truth of negative two, is classified as an overestimation and occured in about 33% of the cases. Lastly, 27% fell in the category underestimations.

Figure 5.12: Difference between prediction and truth in real numbers for RMSE model.

The performance of TDNet summed over 24-hour periods and summed over all zones in the city is shown in figure 5.13. The graph shows how well TDNet has been able to capture the aggregated demand in the city. Most predicted highs and lows aren’t as extreme as they turned out to be in reality and the model is thrown off in the beginning of 2019 but in general the demand has been predicted fairly accurately. The performance of the benchmarks summed over 24-hour periods and summed over all zones in the city is shown in figure 5.14. This is done to display a simple overview. The true demand in all zones for all hours of a day has been summed up and is represented by the green line in the figure, the demand for each day has been divided by the demand of the max day to avoid showing real numbers. The other two lines showcase the predicted demand of the SARIMA and stacked ensemble benchmarks. Predicting this city-wide aggregated demand isn’t the task of these models but shows how the true demand varies substantially over time. Figure 5.15 shows the RMSE for the different models over all hours in a day. TDNet is the best performing model with the lowest error for all hours except a couple. The errors are stable and range from 3 to 4, in contrast the stacked ensemble achieves errors as low as 3 but also an error close to 6 at 14:00. The SARIMA model performs poorly at night but well in comparison in the afternoon, it is still worse than the other two alternatives. None of these models have been trained to perform this task where all the zones are aggregated. Thus, this

40 5.3. Results SA

Figure 5.13: Total Demand by day for all zones over the test period and predictions by RMSE model.

Figure 5.14: Total Demand by day for all zones over the test period and predictions by bench- marks.

41 5.4. Hyperparameters and Architecture

Figure 5.15: Prediction Error per Hour SA as measured by RMSE for TDNet and benchmarks. graph might be misleading in comparison to the graph which shows the actual loss achieved on the main task shown in figure 5.10.

5.4 Hyperparameters and Architecture

Table 5.2: Best hyperparameters found from hyperparameter tuning and meta information about the training process.

Table 5.3: Hyperparameters SA Table 5.4: Hyperparameters NE

Name Value Name Value Step Size 0.0014 Step Size 0.003 Training Steps 58000 Training Steps 175000 Channels 8 Channels 8 Encode/Decode 32 Encode/Decode 16 Dilations 1, 2, 4, 8, 16, 32, 64, 128 Dilations 1, 4, 16, 64, 1, 4, 16, 64 Filter Widths 2, 2, 2, 2, 2, 2, 2, 2 Filter Widths 3, 3, 3, 3, 3, 3, 3, 3 Duration 84 min Duration 124 min Iteration 17 out of 30 Iteration 66 out of 100

The best sets of hyperparameters found during the tuning for NE and SA are are shown in table 5.2. For NE, the cross-validation error of the best tuned model was about 13% better than that of the first model which was created during development with arbitrarily chosen hyperparameters. No initial model was trained before the hyperparameter tuning for SA, the first trained model for this city had hyperparameters which had been tuned.

42 6 Discussion

The sections presented in this chapter will be an analysis and discussion of the results ob- tained from the empirical evaluation. A comparison between the cities and the models will be included. TDNet will be compared to the current state-of-the-art and suggestions for im- proving its performance will be provided. Additionally, the sources as well as the method used to generate the results will be critically examined. Lastly, the subject of this thesis will be contextualized in a wider perspective.

6.1 Results NE

As implemented, TDNet trained with all features as well as TDNet trained with demand only both beat the benchmarks as measured by RMSE and RMSLE. Interestingly, their internal or- der differs depending on what error metric is used. When using RMSLE, TDNet barely beats TDNet, demand only but when measuring using RMSE the relationship is reversed. Round- ing the predictions as is done in figure 5.1 and measuring the RMSLE leads to the unrounded TDNet trained on demand only beating the rounded TDNet, this further emphasizes how minor the difference is between the two approaches. It can be concluded that adding weather and holiday information doesn’t improve the accuracy of TDNet for this city. Regarding the holiday information, some data analysis has been performed which has found that the day with the highest total demand over the two years of data was a national holiday. The day with the lowest total demand was also a national holiday, namely Christ- mas day. This suggests that splitting the holiday column depending on whether the holiday positively or negatively impacts the taxi demand could be worthwhile. Especially since find- ing information on holidays and adding it to the data set is much easier than adding weather information. Regarding the weather information, the conclusion that it doesn’t provide predictive power could very likely be extrapolated to include at least all of northern Europe which is the region in which NE is situated. But since the customers of Taxicaller AB are located all over the globe and weather varies a lot, it can’t be ruled out that having weather data would increase accuracy in other areas. Rounding the results lead to an about 3% difference when using RMSLE and 0.7% when using RMSE. The size of these rounding errors is quite small and makes it reasonable to

43 6.1. Results NE simply round the predictions if TDNet was to be used in a production environment instead of making the model output discrete predictions. To provide a connection between the abstract loss functions and the reality of having the right amount of taxis in the right place at the right time, an accurate prediction has been defined as being within +1/-1 from the true value. This is the case in about 64% of the cases independent of the loss function used. Removing the hours in the zones where the true demand is zero, as is done in the histogram 5.3 decreases the accuracy to 59%, increases the percent of underestimations from 27 to 35 and decreases the overestimations from 9 to 6. This shows that non-zero demand is slightly harder to predict and the change in estimation distribution is logical since a true demand of zero is only possible to predict accurately or overestimate. Looking at the demand per hour in graph 4.2 and the error of the predictions per hour in 5.8, it can be seen that the RMSE is higher than average after midnight when the true demand is the lowest. At first glance, this is surprising but can probably be explained by the fact that this demand comes from people taking a taxi home from a night out. The error measurements indicate that this demand is more irregular and there are a few reasons why that might be. Street hailing is more common which means that the location of taxis is more important. Under the assumption that the amount of people who are out and where they go differs between weekends, it can be hard for the drivers to know where to be and whether there are plenty of people out on a certain night or not. A safer choice to get customers is e.g. driving between a business district and the main station before and after normal working hours.

Loss Functions The demand summed over all zones over 24 hours for the whole prediction period leads to a total demand of a city. As can be seen in figure 5.6, the total predicted demand of the best RMSLE model is lower than the total predicted demand of the best RMSE model by quite a large margin. Furthermore, a comparison between the error distributions in figure 5.2 and figure 5.4 shows that TDNet trained with RMSE is much more prone to overestimations than when trained with RMSLE. However, when cross-comparing the performance of the models on the main task in table 5.1, the difference doesn’t appear to be as significant. To explain this discrepancy, the behaviour of the two different loss functions in the lower number range has been examined. As can be seen to the right in figure 6.1, the RMSE for a prediction of zero when the true value is two leads to an error of two. The same error can be achieved by predicting four. An error in either direction is treated equally. Consider instead the curvature of the RMSLE loss function to the left in the same figure. The RSMLE for a prediction of zero when the true value is two leads to an error of about 1.1. To achieve the same error by predicting too much, the answer would be eight. This is due to the logarithmic nature of the error metric, log(2 + 1) ´ log(0 + 1) = log(8 + 1) ´ log(2 + 1). The advantages of using an error metric which over-penalizes underestimates has been discussed in the theory chapter. Concretely, "[RMSLE] ... is applicable when predicting across a large range and magnitude of values". As it turns out, on an hourly basis for each zone, the demand in the cities investigated doesn’t span across a large range or magnitude of values. On the contrary, the mean hourly demand in each zone for the city NE is about 2 with a standard deviation of 3. Consequently, the predictions made by the model trained using RMSLE are conservative and consistently favor staying in-between zero and two. The objective of this thesis isn’t to answer to what degree overestimates of the demand are more expensive than underestimates, but in the interest of evaluating the performance from a wider perspective, both the RMSE as well as the RMSLE of the models have been taken into consideration. From a theoretical standpoint, customers with a higher hourly demand per zone would probably prefer using the RMSLE and customers with a lower hourly demand per zone RMSE. Increasing the hourly demand per zone could be achieved by having a higher

44 6.2. Results SA demand or larger zones. Another approach would be increasing the time-span to include several hours.

Figure 6.1: RMSE and RMSLE for different predictions x when the true value of x is 2.

p 6.2 Results SA

TDNet is able to produce better results than both the benchmarks for both error metrics. With the definition that a prediction within +1/-1 of the true value is an accurate prediction, an accuracy of 40% was reached. 33% of predictions were overestimations and 27% were underestimations. These numbers are low but as can be seen in figure 5.13 where the city demand over the whole test period can be seen, TDNet has been able to capture the pattern in the demand fairly well. This can not be said for the benchmarks which have their city- wide predictions shown in figure 5.14. To produce better results, further tuning of the model would be helpful. Furthermore, relating the definition of an accurate prediction to the hourly average number of pick-ups in a zone would be an improvement. As can be seen in figure 4.3 in the method section, the zone demand is unevenly distributed and the most active zone has an approximately three times larger demand than number two. Also, the city mean hourly demand is about 4.6 with a standard deviation of 7.5, this makes the definition for an accurate prediction quite strict. Increasing an accurate prediction to be within +2/-2 of the true value would increase the accuracy by about 30%, mostly due to a prediction of 2 above the true value being the most common prediction. The error of the hourly demand which can be found in figure 5.15 is relatively even for TDNet even though the difference in demand is substantial between the early morning and late afternoon. This relationship is similar to the one discussed in the previous section about the hourly demand in NE.

6.3 Comparing the Cities

The demand distribution will be analyzed to get a better view of what impact it has on a city. First of all it should be mentioned that the fact that both cities have been divided into 14 significant zones is a coincidence. Originally, they were divided in different amounts of zones but the requirement that a zone must contribute to more than 1% of the total demand of the city to be worthwhile predicting lead to this.

45 6.4. Method Criticism

The zone error distributions as depicted in figure 5.5 and figure 5.11 differ substantially. In NE, the errors are of similar size and they hover around the RMSE accuracy of the whole city. In SA, the error in zone A overshadows that of all the other zones. This is reasonable since zone A stands for more than a third of the city demand and causes an error of approximately three times the size of the second worst zone. However, the errors are proportional to the hourly demand. The relative demand of a zone in comparison to the city demand is a good estimation of how large the prediction error will be in relation to the error zones in the city. The zone distribution also affects the mind set of the drivers in the city. In SA, a risk-averse driver could simply always wait for customers in zone A and probably do reasonably well. In NA there exists no obvious hot spot and drivers are encouraged to roam around between zones. With access to all the locations of the taxis and the predicted demand, a machine learning model could be developed to predict the expected average wait time in a zone. This would benefit both the risk-averse drivers in SA and the roaming drivers in NE. The actual value of the hourly demand in a zone impacts the size of the error metrics, therefore SA with a higher mean and standard deviation is expected to have a numerically higher error in relation to NE. With the current definition of an accurate prediction, it is also expected that the accuracy in NE 64% is better than in SA 40%. When comparing the total city demand over the whole period the pattern in NE is not as evident as in SA and the November peak in NE has no counterpart in SA. This regular temporal pattern in SA which has been captured is also the reason as to why TDNet has clearly outperformed the benchmarks in SA compared to NE.

Overfitting A constant threat when working with supervised machine learning models is that of overfit- ting. By rigorously cross-validating the performance of the model during training, the threat can be kept at bay. Although graphs such as the one in figure 5.9 weren’t drawn repeatedly during training, a check was in place to make sure that if the cross-validation error as an av- erage over 100 iterations didn’t decrease for 2000 iterations, training was interrupted. If over- fitting would have been a bigger issue for any of the two cities, which is usually indicated by the training loss having a much lower value than the cross-validation error, techniques would have been used to combat this issue.

6.4 Method Criticism

The field of machine learning is developing very quickly and there are general issues of reli- ability, explanability and replicability which are yet to be tackled. A selection of these will be brought up in the context of this thesis as well as issues and criticisms unique to this thesis.

Batch Generation The batch- was done with replacement meaning that the model wasn’t guaranteed to see all of the training data during one training epoch. The consequences of this is that the model converges slower. Fortunately, it doesn’t prevent convergence. [44] Even though the loss averaging window technique is applied to increase robustness, it is possible that the same day is used for cross validation a very high number of times over 100 batches. Then a model which happens to accurately predict the demand of this specific day is chosen over a model which might be better overall. However, as the total size of the validation set is just below 100 days, it is expected that some days are selected a few times and some none at all.

46 6.4. Method Criticism

Hyperparameter Tuning The first step of performing hyperparameter tuning is defining the domain, i.e. the possi- ble values of the hyperparameters. Although the TPE-algorithm is able to find optimal val- ues outside of the initial domain, given that the hyperparameters are defined as continuous distributions, it might not be able to do so in a limited number of iterations [22]. For the hyperparameters defined as discrete distributions or categorical choices the TPE-algorithm is confined to the values defined by the developer and it with a limited range and compu- tational resources, it is unlikely that the optimal values are among these. It should also be noted that the sampling of the hyperparameters is determined by the seed or random state of the TPE-algorithm which affects the convergence speed. In a complex model, such as the one described in this paper, the number of hyperparam- eters to tune is high and due to time and computational restraints only a subset of them are selected. At the time of writing, no algorithms or golden rules exist for knowing in advance which have the most significant impact on the performance of a machine learning model. As discussed in section 2.10, even slight variations to the same dataset can have a major impact on the importance of the hyperparameters [2]. With this in mind, it’s impossible to determine whether the result achieved is close or far away from the theoretical limit of predictability. Luckily, the paper cited in that section points out that only a few of all hyperparameters has a fundamental impact on the performance of a model. Examples of hyperparameters which have been used with their default values and haven’t been tuned are which optimization algorithm to use, the number of restarts, the regular- ization parameter, the dropout rate, whether parameter averaging should be used on the weights, the number of validation batches, the average loss window size, the activation func- tions used in the different layers and so on. From a validity standpoint this is dissatisfying as it leaves much up to the intuition and estimations of the developer. Unfortunately, there are few papers in the machine learning field which release complete information regarding the values of their hyperparameters or trained parameters. Even fewer release the full source code of their models. The authors of the original WaveNet have chosen to go down this path of obfuscation as well and thus the open source implementations haven’t been able to fully reproduce the results of the original paper.

Computational Limits As mentioned in section 2.2, the cost of training a deep neural network such as TDNet can be considerable. Although the computer used to train the models had a strong consumer-grade GPU and additional hardware to support it, training the TDNet with one set of hyperparam- eters took between 30 and 120 minutes, mostly depending on the number of iterations as well as the number of restarts. The most exhaustive tuning of TDNet searched for 100 iterations and was done for the city NE with demand, weather and holiday features using the RMSLE loss function. The process took approximately 100 hours. Hyperparameters were searched for with only the demand as input as well as for the other city, SA. The RMSE hyperparameter search for NE only ran for 20 iterations, the cross-validation error of that model didn’t beat the RMSE cross-validation error of the best RMSLE model. For the stacked ensemble benchmark, training and hyperparameter tuning was done for a maximum of 20 minutes. If a longer time was set, the h2o framework threw a memory error before training was completed. The time required to generate predictions was negligible. The SARIMA model on the other hand, required about a minute to find parameters and investigate whether the time-series was non-stationary and then an additional 90 minutes per zone to train and generate predictions, this lead to a total of approximately 20 hours.

47 6.5. Comparing the Models

Data Handling and Feature Engineering There were a few indicators that the quality of the data retrieved from the Dark Sky API was dubious. As has been mention in section 4.2, data was missing for almost a full year, daily data was replaced with daily and two columns were removed due to them being full of zero entries. The data that was actually used has been analyzed and deemed to be adequate but it might contain minor inaccuracies which could add up. The steps which have been taken to alleviate the concerns raised by the data are described in the previously mentioned section. A mistake made was not creating yearly lags for the test set, i.e. it was full of zeros. However, the yearly lags for the first year of the training set were also zeros as well as the monthly lag for the first month of training, though this was unavoidable.

Benchmarks Statistical models such as SARIMA usually enjoy an increased performance when the pa- rameters of the model are finely tuned [5]. In this case, squeezing performance out of the model wasn’t prioritized and the parameters found most appropriate when investigating the autocorrelation and partial autocorrelation plots were not updated once originally set for a zone in a city. Furthermore, the seasonal parameters were chosen according to guidelines by a field expert and not quantitatively evaluated and tuned which would be the preferred method [36].

Source Criticism The sources which have been used to provide a theoretical background for the machine learn- ing techniques used in this thesis are mostly well-cited and written by experts such as A. Ng, Y. Bengio, J. Bergstra, I. Goodfellow and Y. LeCun. Some sources are classic books such as "Artifical Intelligence: A Modern Approach" by S. Russell and Peter Norvig or "Time Series Analysis: forecasting and control" by GE. P. Box et. al which first came out in 1970. Occa- sionally there are sections in the theory were concrete examples are brought up and in those cases the sources might be rare but then the reasoning is backed by math. Due to the fast growth in the machine learning field, there are techniques which aren’t as well theoretically understood as one might hope. As an example the Adam optimizer, introduced in a paper with more than 20000 citatations at the time of this writing, has been shown not to converge to the optimal solution for specific, but quite simple tasks [45]. The papers which specifically focus on predicting taxi demand using statistical and ma- chine learning methods aren’t very well cited due to the size of the field. However, the ones cited are among the most established. Most sources for taxi demand prediction have been found through Google Scholar or through related work sections in other papers. The open-source implementation upon which this master thesis is built has been shown to work very well empirically in at least three previous cases, one of which lead to a peer- reviewed paper being released [25]. It probably isn’t bug-free but has been shown to deliver decent predictions in the context of taxi demand as well.

6.5 Comparing the Models

In the experiments conducted in this paper, TDNet outperforms the benchmarks in terms of accuracy as measured by RMSE and RMSLE. From a theoretical standpoint, this is due to its ability to model non-linear relationships between the demand now and the demand during the previous hours, days, weeks and months [14]. The stacked ensemble, which is the best contender, doesn’t inherently model these relationships but is able to non-linearly combine the features fed to it and based on its performance, the date and time, day of week and day of month seem to contain useful information. It considers the historical demand but doesn’t use

48 6.6. Comparing TDNet to the Literature e.g. the demand 24 hours ago as an input feature. SARIMA on the other hand is able to model and make use of the historical demand by e.g. basing its predictions on a rolling average but is only able to produce a linear combination of them. Furthermore, it can’t be fed the day of week, date and time or any other feature outside from the pure demand time-series. To explain the results from a practical standpoint, the time that has been invested in find- ing the best set of hyperparameters as well as implementing TDNet is far more than that which has been spent on the benchmarks. It can’t be ruled out that the stacked ensemble wouldn’t beat TDNet if it was fed features such as a rolling mean, lagged demand features or was able to train for a longer period of time. The same might be true for SARIMA regarding parameter tuning but is more unlikely. The implementation costs and where they occur for each of the models differs substan- tially. SARIMA would ideally have a specific set of parameters for each zone in each city and finding good values for these takes a long time. On one hand, that this haven’t been done is unfortunate and weakens statements made about the superiority of the other algorithms. On the other hand, this speaks to the need for an iterative process when working with this type of model. For SARIMA, the training process as implemented in the library doesn’t make use of the GPU nor all the CPU cores which leads to very slow training. Since it makes use of historical data in the same way as TDNet, it would require access to the taxi demand of the last day in the same way which places sharper constraints on the system. TDNet only requires one set of hyperparameters for all zones in a city and given that the city demand is of approximately the same distribution as another city, the hyperparameters probably deliver decent results and don’t have to be recalculated. However, the number of hyperparameters is vastly higher than that of the benchmarks and it isn’t abstracted away by the framework as is the case with h2o and the stacked ensemble. H2o strips the developer of the duty of writing code to tune the hyperparameters of the models, it already comes pre-implemented. An attempt has been made in this thesis to find a sufficiently good set of hyperparameters for TDNet but not all have been tuned and it can’t be guaranteed that the search has been conducted in the right value range. The hyperparameter tuning is the most expensive step of creating TDNet for a city but it fully uses the GPU and the implementation framework Tensorflow is state-of-the-art. The implementation is also very explicit and the developer has almost full control of everything which is positive but it also makes it easier for bugs to sneak in. When engineering a machine learning model in general, factors outside of which model produces the lowest error come into play. The cost of implementing a tuning process might easily outweigh the gain of a decreased prediction error. In that sense, the stacked ensemble as implemented by the automl function of the h2o platform has a solid advantage. This ad- vantage comes at the cost of code control, the memory error which was encountered during the training process which limited the training time to 20 minutes lay deep within the frame- work and an easy fix was not possible. Consequently the training and tuning time was vastly reduced which lead to the stacked ensemble not living up to its full potential. The advantages of the SARIMA model might not be as obvious as those of the other two. But from an implementation perspective, the complexity is low in comparison to TDNet. Much is outsourced to the used module and there are several examples of similar tasks to taxi demand prediction following the same steps online. When it comes to explainability, exact statements can be made about how many hours back in time are being used, to what extent they contribute to the prediction and what seasonal assumption was made.

6.6 Comparing TDNet to the Literature

A previously brought up alternative to a 1D-CNN such as TDNet is a LSTM. In a paper from 2018, J. Xu et. al predict the taxi demand in different areas of New York using a LSTM [55].

49 6.7. Improving TDNet

When using the same time step as TDNet of 60 minutes, they achieve a RMSE of about 2.4 which is extremely close to the RMSE of TDNet in city NE. These two numbers aren’t directly comparable since they originate from two different data sets and the actual number of pick- ups affects the RMSE. Concretely, the max demand in a zone in one hour was about 12 times larger in NY than in NE and the standard deviation about 4 times larger despite the fact that NY was split into 6500 zones in comparison to 14. Their low error can partly be attributed to their forecast horizon of one hour in comparison to 26 hours. The authors of the paper go on to investigate what impact historical demand, weather, day of week, day and time and drop-offs have on the accuracy of their LSTM. Historical demand was found to have the highest impact followed by day of week, drop-offs, date and time and lastly weather. Using all features in comparison to just using historical demand didn’t significantly improve the performance which showcases the importance of that feature. Each trip in the datasets used in this paper was connected to a timestamp, this made it trivial to add the date and time and day of week features. These features in addition to historical demand constituted the baseline TDNet. Adding holiday and weather information required more manual work and a goal of this thesis was to determine whether the accuracy gained by including these features makes it worth it. The was almost the same as in the paper by J. Xu et. al. In their case, the addition of features beyond pick-ups and temporal features, i.e. day of week and date and time resulted in a marginally improved accuracy. In the case of TDNet, the accuracy also marginally increased when measured by RMSLE but decreased when measured by RMSE. To benchmark their LSTM, the authors made use of a feedforward neural network and a rolling mean. Similar to what has been found in this paper, a standard supervised learning algorithm can perform at an accuracy close to that of a model made for time-series forecasting. Additionally, simpler models such as a rolling mean or a coarse SARIMA model struggle to perform on the same level, especially when the variance of the demand increases. In the literature review in section 3.3, a paper has been summarized which provides in- structions for calculating the limit of taxi demand predictability. This could be done for all of the cities where it’s of interest to predict taxi demand in order to have a goal to aim for and measure the relative success of a model. Doing this could also provide insight into what kind of model might be most suitable for the specific city, i.e. should a machine learning model with additional features be used or is a pure time-series model sufficient. The average limit of hourly predictability calculated for NYC is 83% which is more than what is achieved by TDNet using a 26-hour forecast horizon. [59] From a validity standpoint, performing tests such as this one to statistically analyze the data, form a hypothesis and try to disprove it is good and follows the scientific process closely. Especially in comparison to throwing big data at a black-box algorithm, fiddle with its settings and hope it produces an error lower than before.

6.7 Improving TDNet

The simplest practical improvement would be to make full use of the instructions for mixed precision GPU training which will become available as soon as the dependency management system anaconda receives an update for the cudNN package. As it stands, some benefits are gained but the feature isn’t completely support. The second simplest improvement would be adjusting the batch generation process so that each batch is drawn without replacement as discussed in section 6.4 on method criticism. A theoretical improvement would be implementing the caching algorithm proposed by Ramachandran, P. et. al which has made the sequence generation of the original WaveNet 21 times faster [43]. The idea is to cache calculations made in the nodes of the hidden layers that are repeated several times as new output is generated. If the output of a node in the second 2 1 1 layer for time step t depends on that of two nodes in first layer, i.e. p(at |at , at´2) then the

50 6.8. The work in a wider context

2 1 1 output of that node for time step t + 2 would be p(at+2|at+2, at ). These two expressions share 1 the term at and the naive implementation used in this paper calculates it twice. The caching algorithm on the other hand calculates it once, then stores the result and retrieves it for time step t + 2. Consequently doing this for all layers results in a considerable speed up. There are potential improvements which are general to CNNs, all of the following are applicable to TDNet but plenty of them have been developed and evaluated on image data which is higher-dimensional [18]. The first and last layer of TDNet are time distributed dense layers which use the rectified linear activation function from equation 2.13. Although ReLU lead to significant performance improvements when initially introduced, issues such as a node producing zero gradients constantly when x ă 0 may slow down training. There are several suggested solutions to this issue, most rely on defining a gentle gradient for when x ă 0 and they have been shown to increase performance [54]. Thus, swapping the activation functions could improve TDNet. A technique to speed up training and improve accuracy known as batch normalization uses the same principles as the scaling of the input features as described in section 4.4. The idea is to normalize the input to all layers after a mini-batch which scales the parameters to a similar range. This makes it possible to use a higher learning rate without the gradients exploding and adds a slight regularizing effect which prevents overfitting. [21] The problem statement of short-term demand could be reinterpreted as to not entail a 26- hour forecast horizon. Forecasting one or two hours ahead with access to the current demand would most likely greatly increase the accuracy of TDNet as well as the benchmarks. In a paper by L. Moreira-Matias et. al, they achieve an aggregated error measurement of 26% for a task very similar undertaken in this paper using a stacked ensemble. The ensemble contains among three other models an ARIMA and the forecast horizon is 30 minutes [34]. This shorter horizon allows for taking special irregular events which lead to an increase in demand over a day to be taken into account. Furthermore, the systems of TaxiCaller register future bookings and feeding the bookings that are placed far ahead of time to the model would help predicting short-term bookings and street hailing demand.

6.8 The work in a wider context

As with all machine learning tasks, the "ground truth" or data set is only an approximation of reality and might not contain the actual truth. Real-world biases are most likely repre- sented in the data and this might lead to unwanted consequences. As an example, it could be that taxi drivers refuse to pick-up certain customers or avoid certain neighbourhoods due to racial discrimination [37]. This makes the demand data misleading and might force potential customers to walk to another area to get picked-up or use other modes of transportation. A machine learning model could pick up this bias against certain zones as well and if drivers in the future rely on it, they might continue to avoid these neighbourhoods due to the model predicting that there’s no demand, even though they are willing to go there. Bookings and unprejudiced taxi drivers luckily work as a natural countermeasure against these issues. Currently, a more experienced driver is expected to have a better understanding of where the customers are and therefore be able to predict the demand of taxis in a city [15]. A conse- quence of deploying a machine learning model which helps drivers is that the gap between new and experienced drivers would decrease. The senior drivers might be annoyed since they’ve worked longer and might feel entitled to an advantage. An experienced driver still would have an advantage over inexperienced ones in interpreting factors other than histor- ical demand, because when it comes to demand data, TDNet trumps that of a single driver, whoever it may be. Examples of factors unavailable to TDNet are information about events and special occasions, schedules for public transport, city-specific knowledge, competitors and visual input.

51 6.8. The work in a wider context

In the long term, an autonomous taxi fleet could benefit greatly from an algorithm which accurately predicts taxi or by extension, transportation demand. This would help position the cars efficiently and make sure that the supply matches the demand. T. Litman brings up several possible consequences of society widely adapting autonomous vehicles. Benefits include increased mobility and safety, reduced traffic, energy use and public transport but also the possibility of increased emission and less infrastructure to support traveling by bike or walking. Of course, an autonomous fleet, or at least one with self-driving vehicles would significantly decrease human involvement and meaning that the algorithm would contribute to people losing their jobs [31]. If the positives will outweigh the negatives is yet to be seen but society will definitely change.

52 7 Conclusion

The aim of this study was to apply TDNet, a machine learning model with a WaveNet ar- chitecture, to the task of predicting short-term taxi demand in cities and evaluate its per- formance. This has been done by exploring, cleaning and creating features from two taxi demand data sets and modifying an open-source implementation of WaveNet. To improve the performance of TDNet, a Bayesian optimization algorithm for hyperparameter tuning has been used. With the best set of hyperparameters found, taxi demand for the next two months in different zones of a city was forecasted. The predictions have been analyzed, discussed and in this chapter follow the final conclusions.

7.1 Connection to Research Questions

The first question, regarding how well short-term taxi demand can be predicted, has been answered by training TDNet on historical demand data as well as additional data which potentially impacts taxi demand. Furthermore, a hyperparameter tuning algorithm known as a Tree Parzen Estimator has been run for 100 iterations to improve the performance of TDNet. With these hyperparameters, TDNet was able to predict the taxi demand within +1/- 1 of the true value in 64% of the cases in the city NE and 40% in the city SA. In addition to this, two different error metrics have been used to provide a wider perspective on how accuracy should be interpreted in this domain. The RMSE is a safe choice which punishes over and underestimations equally, the the RMSLE depends on what the true demand is and can lead to very conservative predictions if the hourly demand for a zone is low. With a high average demand, the RMSLE is expected to better predict low demand but as has been the case for these two cities, the average demand is only two and four with standard deviations of three and seven and therefore RMSE-trained models which are better at predicting peaks in demand are preferred. The second question about how the distribution of taxi demand between zones in cities affects performance as well as how features other than demand impact performance has been answered by conducting two experiments. The first one consisted of evaluating TDNet in two different cities with different demand distributions between zones, the outcome was that the average hourly demand and its standard deviation impacts the prediction accuracy more than the distribution between zones. The second consisted of measuring the accuracy of TDNet when trained with access to demand, holiday and weather data and

53 7.2. Future Research comparing it with the accuracy of a TDNet trained on demand only. For the city NE, weather and holiday features didn’t improve the accuracy of TDNet. The third question, which puts the performance of TDNet in relation to existing time- series forecasting models, has been answered by using a SARIMA model and a stacked en- semble of supervised machine learning models to predict taxi demand. TDNet beat both benchmarks in both cities, but the margin was small in NE and the computational resources spent on the best benchmark, the stacked ensemble, was limited in comparison to those spent on TDNet. From a theoretical standpoint, the superior performance of TDNet can be ex- plained by it using both temporal features such as time of day, day of week, year and day of month as well as historical demand. Furthermore, it conditions the demand in one zone on the demand in other zones. This separates it from the stacked ensemble which relies solely on temporal features and SARIMA which only uses historical demand. The implementation complexity of TDNet is much higher than that of the benchmarks but it has performed slightly better in NE but significantly better in SA. If the gain in accuracy weighs up for the cost of maintaining a complex model, then TDNet should be used. Other- wise the stacked ensemble as implemented by h2o should be the preferred choice due to its low complexity and the fact that it doesn’t rely on lagged demand input features. That means that last week’s demand doesn’t need to be available for it to predict the demand of the up- coming hour which separates it from TDNet and SARIMA. This relaxed constraint makes it even easier to use. This thesis presents another example of a domain where a WaveNet architecture has been applied successfully to generate time-series which can be used for prediction. It provides a discussion of the impact of different loss functions tied to a concrete example and supplies proof that neither weather nor holiday features improve taxi demand prediction power given that historical demand and temporal features are available.

7.2 Future Research

As mentioned in section 2.1 on taxi demand, the total taxi demand of any given city is un- known. Furthermore, most taxi demand data sets aren’t publicly available. There are a few exceptions such as the one week of available data for Beijing in 2008 which has been used in multiple studies [58, 57]. A taxi demand data set of larger size which gets updated continu- ously is that of New York City. The NYC Taxi and Limousine Commission force all taxi com- panies to submit their bookings and then release them in publicly available data sets which are split based on the kind of taxi. The famous yellow cabs are for example only allowed to pick up hailing customers in the central parts of the city, while a kind of cab called for hire vehicles only accept bookings [8]. Unfortunately, the data sets don’t contain complete records of the bookings of competing ride-sharing companies such as Uber. Nonetheless, these data sets would ideally serve as standard benchmarks for the task of predicting taxi demand even for algorithms built specifically for predicting demand in e.g. Porto, Tokyo or Bengaluru, India [34, 24, 12]. Papers such as the one written by J. Xu et. al or the one by K. Zhao et. al use the NYC data set and are therefore much easier to compare [55, 59]. To determine the viability of TDNet and concretely put it in relation to state-of-the-art alternatives, it should be benchmarked on this larger data set. Predicting the zone wait time instead of just the demand cuts closer to the heart of the supply and demand problem and running such a system in real-time would be challenging. Moreira et. al ran their taxi demand prediction system in real-time for a few of months but most models never get deployed and used in reality [34]. Including the real-time location of drivers and predicting the wait time would be possible and could yield interesting results.

54 Bibliography

[1] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. “Learning long-term dependencies with gradient descent is difficult”. In: IEEE transactions on neural networks 5.2 (1994), pp. 157–166. [2] James Bergstra and Yoshua Bengio. “Random search for hyper-parameter optimiza- tion”. In: Journal of Machine Learning Research 13.Feb (2012), pp. 281–305. [3] James Bergstra, Dan Yamins, and David D Cox. “Hyperopt: A python library for opti- mizing the hyperparameters of machine learning algorithms”. In: Proceedings of the 12th Python in science conference. Citeseer. 2013, pp. 13–20. [4] Anastasia Borovykh, Sander Bohte, and Cornelis W Oosterlee. “Conditional time series forecasting with convolutional neural networks”. In: arXiv preprint arXiv:1703.04691 (2017). [5] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015. [6] Stefan Burgstaller, Demian Flowers, Tamberrino David, and Yipeng Terry Heath P.and Yang. “Rethinking Mobility: The ’pay as you go’ car: Ride hailing just the start”. In: Venture Capital Horizons (2017). [7] Rich Caruana and Alexandru Niculescu-Mizil. “An empirical comparison of super- vised learning algorithms”. In: Proceedings of the 23rd international conference on Machine learning. ACM. 2006, pp. 161–168. [8] NYC Taxi Limousine Commision. Vehicle Licenses. URL: https://www1.nyc.gov/ site/tlc/vehicles/get-a-vehicle-license.page (visited on 03/27/2019). [9] Corporación Favorita Grocery Sales Forecasting. https : / / www . kaggle . com / c / favorita-grocery-sales-forecasting. Accessed: 2018-11-29. [10] Judd Cramer and Alan B Krueger. “Disruptive change in the taxi business: The case of Uber”. In: American Economic Review 106.5 (2016), pp. 177–82. [11] W. Dally. High Performance Hardware for Machine Learning. Dec. 2015. URL: https : / / media . nips . cc / Conferences / 2015 / tutorialslides / Dally - NIPS - Tutorial-2015.pdf. [12] Neema Davis, Gaurav Raina, and Krishna Jagannathan. “A multi-level clustering ap- proach for forecasting taxi travel demand”. In: Intelligent Transportation Systems (ITSC), 2016 IEEE 19th International Conference on. IEEE. 2016, pp. 223–228.

55 Bibliography

[13] David A Dickey and Wayne A Fuller. “Distribution of the estimators for autoregressive time series with a unit root”. In: Journal of the American statistical association 74.366a (1979), pp. 427–431. [14] Vincent Dumoulin and Francesco Visin. “A guide to convolution arithmetic for deep learning”. In: arXiv preprint arXiv:1603.07285 (2016). [15] Henry S Farber. “Why you can’t find a taxi in the rain and other labor supply lessons from cab drivers”. In: The Quarterly Journal of Economics 130.4 (2015), pp. 1975–2026. [16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www. deeplearningbook.org. MIT Press, 2016. [17] Klaus Greff, Rupesh K Srivastava, Jan Koutnık, Bas R Steunebrink, and Jürgen Schmid- huber. “LSTM: A search space odyssey”. In: IEEE transactions on neural networks and learning systems 28.10 (2017), pp. 2222–2232. [18] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al. “Recent advances in convolu- tional neural networks”. In: 77 (2018), pp. 354–377. [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778. [20] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural com- putation 9.8 (1997), pp. 1735–1780. [21] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. In: arXiv preprint arXiv:1502.03167 (2015). [22] Donald R Jones. “A taxonomy of global optimization methods based on response sur- faces”. In: Journal of global optimization 21.4 (2001), pp. 345–383. [23] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Ra- minder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. “In-datacenter performance analysis of a tensor processing unit”. In: 2017 ACM/IEEE 44th Annual In- ternational Symposium on Computer Architecture (ISCA). IEEE. 2017, pp. 1–12. [24] Yuki Oyabu Kaz Sato. Now live in Tokyo: using TensorFlow to predict taxi demand. URL: https://cloud.google.com/blog/products/gcp/now-live-in-tokyo- using-tensorflow-to-predict-taxi-demand (visited on 03/29/2019). [25] Glib Kechyn, Lucius Yu, Yangguang Zang, and Svyatoslav Kechyn. “Sales forecasting using WaveNet within the framework of the Kaggle competition”. In: arXiv preprint arXiv:1803.04037 (2018). [26] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014). [27] John R Koza, Forrest H. Bennett, David Andre, and Martin A. Keane. Automated Design of Both the Topology and Sizing of Analog Electrical Circuits Using Genetic Programming. Artificial Intelligence in Design. Springer, Dordrecht, 1996. [28] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: nature 521.7553 (2015), p. 436. [29] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. “Gradient-based learn- ing applied to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278– 2324. [30] Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. “Independently recurrent neural network (indrnn): Building A longer and deeper RNN”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 5457–5466.

56 Bibliography

[31] Todd Litman. Autonomous vehicle implementation predictions. Victoria Transport Policy Institute Victoria, Canada, 2017. [32] Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi Li, and Fei-Yue Wang. “Traffic flow prediction with big data: a deep learning approach”. In: IEEE Transactions on Intelligent Transportation Systems 16.2 (2015), pp. 865–873. [33] Xiaolei Ma, Zhimin Tao, Yinhai Wang, Haiyang Yu, and Yunpeng Wang. “Long short- term memory neural network for traffic speed prediction using remote microwave sen- sor data”. In: Transportation Research Part C: Emerging Technologies 54 (2015), pp. 187– 197. [34] Luis Moreira-Matias, Joao Gama, Michel Ferreira, Joao Mendes-Moreira, and Luis Damas. “Predicting taxi–passenger demand using streaming data”. In: IEEE Transac- tions on Intelligent Transportation Systems 14.3 (2013), pp. 1393–1402. [35] Vinod Nair and Geoffrey E Hinton. “Rectified linear units improve restricted boltz- mann machines”. In: Proceedings of the 27th international conference on machine learning (ICML-10). 2010, pp. 807–814. [36] Robert Nau. General seasonal ARIMA models. URL: https://people.duke.edu/ ~rnau/seasarim.htm (visited on 05/02/2019). [37] William Neuman. New York Office to Address Discrimination by Taxis and For-Hire Vehicles. URL: https://www.nytimes.com/2018/07/31/nyregion/uber- taxis- minorities-bias-refusal-nyc.html (visited on 05/25/2019). [38] . Train Test Data Split - Improving Deep Neural Networks, Hyperparameter tun- ing, Regularization and Optimization. URL: https://www.coursera.org/lecture/ deep-neural-network/train-dev-test-sets-cxG1s (visited on 04/07/2019). [39] Andrew Ng. Tuning Process - Improving Deep Neural Networks, Hyperparameter tuning, Regularization and Optimization. URL: https : / / www . coursera . org / lecture / deep-neural-network/tuning-process-dknSn (visited on 05/15/2019). [40] NVIDIA. CUDA. URL: https://developer.nvidia.com/cuda-zone (visited on 03/25/2019). [41] NVIDIA. cuDNN. URL: https : / / developer . nvidia . com / cudnn (visited on 03/25/2019). [42] Robi Polikar. “Ensemble based systems in decision making”. In: IEEE Circuits and sys- tems magazine 6.3 (2006), pp. 21–45. [43] Prajit Ramachandran, Tom Le Paine, Pooya Khorrami, Mohammad Babaeizadeh, Shiyu Chang, Yang Zhang, Mark A Hasegawa-Johnson, Roy H Campbell, and Thomas S Huang. “Fast generation for convolutional autoregressive models”. In: arXiv preprint arXiv:1704.06001 (2017). [44] Benjamin Recht and Christopher Ré. “Beneath the valley of the noncommutative arithmetic- inequality: conjectures, case-studies, and consequences”. In: 2012. [45] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. “On the convergence of adam and beyond”. In: arXiv preprint arXiv:1904.09237 (2019). [46] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 3rd. Upper Saddle River, NJ, USA: Prentice Hall Press, 2009. Chap. 1. ISBN: 0136042597. [47] Ha¸simSak, Andrew Senior, and Françoise Beaufays. “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition”. In: arXiv preprint arXiv:1402.1128 (2014). [48] Jürgen Schmidhuber. “Deep learning in neural networks: An overview”. In: Neural net- works 61 (2015), pp. 89–90.

57 Bibliography

[49] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. “WaveNet: A generative model for raw audio.” In: SSW. 2016, p. 125. [50] Sean Vasquez. web traffic forecasting. URL: https://github.com/sjvasquez/web- traffic-forecasting (visited on 03/25/2019). [51] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. “Ex- tracting and composing robust features with denoising autoencoders”. In: Proceedings of the 25th international conference on Machine learning. ACM. 2008, pp. 1096–1103. [52] M Mitchell Waldrop. “The chips are down for Moore’s law”. In: Nature News 530.7589 (2016), p. 144. [53] Rüdiger Wirth and Jochen Hipp. “CRISP-DM: Towards a standard process model for data mining”. In: Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining. Citeseer. 2000, pp. 29–39. [54] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. “Empirical evaluation of rectified activations in convolutional network”. In: arXiv preprint arXiv:1505.00853 (2015). [55] Jun Xu, Rouhollah Rahmatizadeh, Ladislau Bölöni, and Damla Turgut. “Real-time pre- diction of taxi demand using recurrent neural networks”. In: IEEE Transactions on Intel- ligent Transportation Systems 19.8 (2018), pp. 2572–2581. [56] Fisher Yu and Vladlen Koltun. “Multi-scale context aggregation by dilated convolu- tions”. In: arXiv preprint arXiv:1511.07122 (2015). [57] Jing Yuan, Yu Zheng, Xing Xie, and Guangzhong Sun. “Driving with knowledge from the physical world”. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2011, pp. 316–324. [58] Jing Yuan, Yu Zheng, Chengyang Zhang, Wenlei Xie, Xing Xie, Guangzhong Sun, and Yan Huang. “T-drive: driving directions based on taxi trajectories”. In: Proceedings of the 18th SIGSPATIAL International conference on advances in geographic information systems. ACM. 2010, pp. 99–108. [59] Kai Zhao, Denis Khryashchev, Juliana Freire, Cláudio Silva, and Huy Vo. “Predicting taxi demand at high spatial resolution: approaching the limit of predictability”. In: Big Data (Big Data), 2016 IEEE International Conference on. IEEE. 2016, pp. 833–842. [60] Zheng Zhao, Weihai Chen, Xingming Wu, Peter CY Chen, and Jingmeng Liu. “LSTM network: a deep learning approach for short-term traffic forecast”. In: IET Intelligent Transport Systems 11.2 (2017), pp. 68–75.

58 Glossary

CNN Convolutional Neural Network. 2, 11–14, 16, 19, 49, 51

RMSE Root Mean Square Error. viii, 6, 15, 29, 31–36, 38–48, 50, 53

RMSLE Root Mean Square Logarithmic Error. viii, 29, 31–34, 36, 38, 39, 43–45, 47, 48, 50, 53

WaveNet The neural network on which TDNet is based, see section 2.8 for a detailed de- scription.. 2, 4, 13, 14, 19, 20, 24, 27, 28, 47, 50, 53, 54

59