<<

Master of Science in computer science February 2018

Demand Of Outbound Using learning

Ashik Talupula

Faculty of Computing, Blekinge Institute of , 371 79 Karlskrona, Sweden This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfilment of the requirements for the degree of Master of Science in computer science . The thesis is equivalent to 20 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information: Author(s): Ashik Talupula E-mail:[email protected]

University advisor: Dr. Hüseyin Kusetogullari Department of Computer Science

Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract

Background. long term volume forecasting is important for logistics provider for their capacity and taking the strategic decisions. At present demand is estimated by using traditional methods of averaging techniques or with their own experiences which often contain some error.This study is focused on filling these gaps by using approaches.Sample set is provided by the organiza- tion, which is the leading manufacture of Trucks, buses and construction equipment, organization has a customers from more than 190 markets and has a production fa- cilities in 18 countries. Objectives. This study is to investigate a suitable machine learning that can be used for forecasting demand of outbound distributed products and then eval- uating the performance of the selected by conducting an experiment to articulate the possibility of using long-term forecasting in transportation. Methods. primarily, literature review was initiated to find a suitable machine learn- ing algorithm and then based on the results of literature review an experiment is performed to evaluate the performance of the selected algorithms Results. Selected CNN,ANN and LSTM models are performing quite well But based on the type and amount of historical data that models were given to learn, models have a very slight difference in performance measures in terms of forecasting performance. Comparisons are made with different measures that are selected by the literature review Conclusions. This study examines the efficacy of using Convolutional Neural Net- works (CNN) for performing demand forecasting of outbound distributed products at country level. The methodology provided uses convolutions on historical loads. The output from the convolutional operation is supplied to fully connected layers to- gether with other relevant data. The presented methodology was implemented on an organization data set of outbound distributed products per month. Results obtained from the CNN were compared to results obtained by Long Short Term Memories LSTM sequence-to-sequence (LSTM S2S) and Artificial Neural Networks (ANN) for the same dataset. Experimental results showed that the CNN outperformed LSTM while producing comparable results to the ANN. Further testing is needed to compare the performances of different deep learning architectures in outbound forecasting.

Keywords: Demand forecasting , , outbound logistics, machine learning.

i Acknowledgments

First of all, I would like to thank my university supervisor, Dr. Hüseyin Kuse- togullari. He was always open when I ran into a trouble spot or had a question about my research or writing query. He always permitted this paper to be my own work, but steered me in the right adirection whenever he thought I required it. I would also like to thank my supervisor at Volvo Teja Yerneni for supporting me not only with the thesis part but also in motivating and collaborating with the team at Volvo. Finally, I must express my deep appreciation to my parents and to my friends for offering me with unfailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis. Without them, this achievement would not have been feasible. Thank you.

ii Contents

Abstract i

Acknowledgments ii

1 Introduction 1 1.1 Problem Statement ...... 2 1.1.1 Aim ...... 3 1.1.2 Objectives ...... 3 1.1.3 Research Questions ...... 3

2 Related Work 4 2.1 Time series forecasting ...... 6

3 Preliminaries 7 3.1 Forecasting ...... 7 3.2 Time series ...... 7 3.2.1 Univariate ...... 7 3.2.2 Multivariate ...... 7 3.2.3 Components of time series ...... 8 3.3 Time series forecasting as a supervised problem ...... 9 3.3.1 Supervised learning ...... 9 3.3.2 Sliding window approach for time series data ...... 9 3.4 Artificial Neural Networks ...... 9 3.4.1 Activation Functions ...... 10 3.4.2 Recurrent Neural Networks ...... 12 3.4.3 LSTM ...... 13 3.4.4 CNN ...... 14 3.5 ARIMA ...... 14 3.6 SVR ...... 15 3.7 Multiple parallel input and Multi step output...... 16

4 Method 18 4.1 Data gathering ...... 19 4.2 Data pre-processing ...... 19 4.3 Data set ...... 20 4.4 Experiment setup ...... 20 4.5 performance metrics ...... 21 4.6 Walk forward Validation ...... 21

iii 5 Results 23 5.1 Learning curve ...... 23 5.2 FORECASTS ...... 24 5.3 Forecasting Performance ...... 24 5.4 Validity Threats ...... 25

6 Analysis and Discussion 26 6.1 Implementation ...... 26 6.2 Discussion ...... 27

7 Conclusions and Work 28

References 29

A Supplemental Information 32

iv List of Figures

1.1 Outbound process ...... 2

2.1 Time series ...... 6

3.1 Univariate time series ...... 7 3.2 multivariate time series ...... 8 3.3 time series decomposition ...... 8 3.4 Time series data ...... 9 3.5 supervised problem ...... 9 3.6 single layer perceptron ...... 10 3.7 Multi layer perceptron ...... 10 3.8 Sigmoid ...... 11 3.9 Tan-h ...... 11 3.10 Relu ...... 12 3.11 Recurrent and feed forward networks structure ...... 12 3.12 LSTM Architecture ...... 13 3.13 support vector regressor ...... 16 3.14 multivariate time series ...... 16 3.15 Transformation of input and output from the above series ...... 17

4.1 Data set ...... 20 4.2 Walk forward validation ...... 22

5.1 LSTM training graph ...... 23 5.2 CNN training graph ...... 23 5.3 Actual vs forecast using CNN ...... 24 5.4 Actual vs forecast using CNN ...... 24 5.5 Models performances ...... 25

A.1 of residuals ...... 32 A.2 Actual vs forecast using LSTM ...... 32 A.3 Decomposition of Time series ...... 33 A.4 forecsat using LSTM ...... 33 A.5 forecast using LSTM ...... 34

v Chapter 1 Introduction

A consists of all activities bounded with moving goods from raw materi- als to the consumer [35]. and Order Planning(SOP) is responsible for planning and agreeing volume from all business units for the upcoming months on the first hand. Then it plays the role of of those volumes to operation plants and production logistics to plan supply chain activities[33]. Logistics is the process of distribution of goods from point of origin to point of consumption to meet consumer requirements. Inbound logistics refers to , storage, of goods coming inside a business and outbound logistics refers to the same for goods going outside of a business[34]. The process starts, when a customer places an order by connecting to the sales department, the order is then processed by sales department and assigns it to the production plant. Sales office provides the customer with customer delivery date (CDD). CDD is provided if, and only if goods are directly transported to the customer location and it is specified as Available at Terminal Date (ATD) and Indi- cated Customer Delivery Date(I-CDD) if the goods will pass through terminal and noted as transfer. An ATD shows when the order ought to be at the terminal and be prepared to stack onto the following transport unit and an I-CDD. The business volume of logistics has a sustainable growth with the advancement of the economy and improved offline and online technology thus, efficient logistics demand is needed to manage their processes in an organized manner[18] . Forecasting is the process of predicting the future, based on past or current data. Forecasting plays an important role in sales and operations planning for taking strate- gic and planning decisions. Forecasted values are just the projections, we don’t get the exact value we only try to reduce the error with the help of forecasting tools and more sophisticated models. One can easily forecast sales by using different fore- casting techniques like ARIMA[22], SVM [20], ANN [23][37], LSTM[13],CNN[15][30], etc. by having the details of previous sales record and accurate demand details.

1 Chapter 1. Introduction 2

Figure 1.1: Outbound process

Forecasting on outbound distributed products lowers the cost of warehousing and transportation by optimizing the logistic process through consolidation, capacity planning and using a third-party logistics provider. The purpose of the Thesis is to forecast outbound distributed products of a company that uses third party logistics (3PL) services for distribution of their products through Air, water and road transportation. Third party services include handling logistics such as warehousing, packaging, fulfillment and distribution.

1.1 Problem Statement Most of the logistics service providers faces several challenges in managing and distribution of products such as capacity planning, freight volume. So, there is a need to study outbound processes of a manufacturing company for developing a proper plan to overcome the challenges. Transportation is the major part of Logistics, where securing the capacity in carriers would be the most concerned issue for the logistics services especially international logistics. The of shortage with capacity with carrier providers can be minimized by an early request of space, this can be achieved through reliable Long-term volume forecasting (LTVF). This also help in planning and handling the higher capacity demand of transportations, which can’t be handled by the carrier providers. Carrier providers could increase their service Chapter 1. Introduction 3 capacity upon request to meet our demands if we could give an early demand.

1.1.1 Aim The main aim of the thesis is to investigate suitable machine learning model that can translate SOP (sales and operation planning) information into forecasting infor- mation for the outbound Distribute product processes.

1.1.2 Objectives • Identifying an appropriate machine learning model for forecasting outbound logistics.

• Evaluating the efficiency of selected machine learning algorithm

1.1.3 Research Questions • RQ 1: What are the available state-of-art-methods used in forecasting? The motivation for this research question is to find out a suitable forecasting method to identify underlying causes over a period.

• RQ 2: Which Machine Learning Model would perform better forecasting on time series data? Motivation: The motivation for this research question is to evaluate different time series forecasting models on outbound logistics and selecting the appro- priate one based on the performance. Chapter 2 Related Work

Related work for this research incorporates demand forecasting of supply chain and logistics in general. The investigations on forecasting demand and its connection to supply chain network began far earlier. In 1960, Winter exhibited of the exponential forecasting framework for forecasting sales for the purpose of optimizing production planning. In the most recent decade, a few papers were proposed to deal with this issue. Gilbert (2005) [14] stated a multistage network model build on ARIMA, he also motivated about the cause of bullwhip effect and also the demand variations in inventory and galling orders. Besides, Liang (2006) [31] proposed a solution for estimating the ordering capacity for the period t+ 1 of a multi-echelon supply chain, where every entity was permitted to use diverse inventory structure. Aburto and Weber (2007) [1] presented an hybrid intelligent system which is combination of both neural networks and ARIMA(auto regressive and moving aver- age model) for forecasting the demand. In 2008, Carbonneau [5] stated the use of advance non linear machine learning algorithms in the context of extended supply chain. Garcia et al. (2012) [11], used support vector to solve the issues faced by distribution and discovery of new models. Kandananond (2012) [24] stated that In forecasting consumer product demand Support vector machine outran the results compared to ANN’s (artificial neural networks) and also in the roiling year , the same author [24] mentioned that SVM surpasses ARIMA method of forecasting. Manas Gaur, Shruthi Goel, and Eshaan Jain (2015) [12] used K Nearest Neighbor and the Bayesian networks for forecasting the demand in supply chain. The aim of their study is to find the suitable one by comparing both the algorithms and also adaptive boosting technique is used in conjunction with different algorithms to improve the performance of the model. Results of the experiment shows that Bayesian networks with or without adding adaptive boosting surpasses the K Nearest Neighbor (KNN). Also, KNN with two nearest neighbors gave the promising results Wen-Jing Yuan and ZE-YI Jin [38]proposed a combination of grey model and stacked autoencoder (SAE) for forecasting demand of logistics by taking merits for the uniqueness of logistics demand forecasting issue. The original data is processed through multiple grey model and the out of the grey model is given as the input to the SAE model, to get the final predicted value, extreme learning machine (ELM) is applied for exact prediction at the top and SAE for feature extraction at the bot- tom. The proposed model shows more accurate results than ordinary grey network model when applied in the empirical research on the logistics demand of a Brazilian

4 Chapter 2. Related Work 5 company. Yan Zhao and Shengchang [40], Wang proposed two forecasting models Support vector machine (SVM) and least squares support vector machine (LS SVM), to iden- tify the better forecasting model they evaluated the efficiency of both the models by considering the features of complexity and nonlinearity in highway freight vol- ume. Based on calculations the forecasting model based on LSSVM is efficient for forecasting the freight volume Pei-you Chen and LU Liu [7], proposed PSO-SVR algorithm which is combi- nation of both support vector regression (SVR) and particle swarm optimization algorithm (PSO) to forecast the demand of coal transportation. They selected rail- way freight turnover volume, amount of coal consumption and some other factors, they choose railway freight volume from 1995-2011 as learning samples, radial basis function (RBF) as the kernel of prediction model to establish the influence factors by combining both the models. Results show that the selected algorithm is superior to Neural Networks Back propagation (BP) in forecast accuracy and error. Real Carbonneau [5] studied the applications of advanced machine learning al- gorithms such as neural networks, recurrent neural networks and SVR to forecast the falsified demand data set of the supply chain. He compared it with traditional methods like , . Results paved positive towards the RNN and SVR. Two different data sets were used to experiment one is collected from simulated supply chain and other one is actual Canadian foundries orders. Jingyi Du [9] stated that LSTM (neural networks) has gained a great attention in deep learning, especially time series. LSTM network is used to predict the apple stock price by using multiple feature input and single feature input variable to verify the prediction on stocks time series. Results showed positive when used multi-feature as an input. The widely used machine learning approach is Artificial Neural Network. How- ever, Hu and Zhang (2008)[17] explained the draw backs of using ANNs such as optimizing cost function and uncontrolled convergence. LSTM, support vector re- gressor and Random forest regressor to present the accurate demand forecasting by overcoming the draw backs caused by traditional methods and ANNs Kasun Amarasinghe , Daniel L Marino[2] atalked about the forecasting demand of energy load using deep neural networks. this paper investigates the effectiveness of using Convolutional Neural Networks (CNN) for performing energy load forecasting at individual building level. The presented methodology uses convolutions on his- torical loads. The output from the convolutional operation is fed to fully connected layers together with other pertinent information. The presented methodology was implemented on a benchmark data set of electricity consumption for a single res- idential customer. Results obtained from the CNN were compared against results obtained by Long Short Term Memories LSTM sequence-to-sequence (LSTM S2S), Factored Restricted Boltzmann Machines (FCRBM), “shallow” Artificial Neural Net- works (ANN) and Support Vector Machines (SVM) for the same dataset. Experi- mental results showed that the CNN outperformed SVR while producing comparable results to the ANN and deep learning methodologies. Chapter 2. Related Work 6

2.1 Time series forecasting Time series is a set of data points taken at specified time of equal intervals [6]. Time series analysis has only one variable and need to predict other variable with respect to time.Time series data are often spotted in predicting stock prices, sales of a retail store, electricity demand, airline passengers, forecasting weather. consider an observed time series as t1,t2,t3,t4. . . ..tn want to forecast for the future value of the series tn+h were h is the forecast horizon . Forecast of tn+h is made for ‘h’ steps ahead at the time ‘tn’ is represented as t^n(h). The symbol t^will distinguish from observed and forecasted values. A forecasting method is a technique for figuring out forecasts from present and past observations []. A forecasting model is selected based on given series of data. Forecasting methods and models aren’t similar they shouldn’t be used as an equivalent term. Judgmental forecasts, Univariate methods, Multivariate methods are the three types of Forecasting methods[6]

Figure 2.1: Time series

data set provided for the study follows a non-linear pattern.According to the literature study conducted many papers stated that ANN, LSTM and CNN are best suitable for adopting the patterns in non linear data. so deep learning techniques LSTM, CNN and ANN(’Artificial Neural Networks’)are adpoted for this study Chapter 3 Preliminaries

3.1 Forecasting Forecasting is determining what is going to happen in the future by analyzing from the past and current patterns in the data, it helps the business people to plan the of what will might and might not occur. Forecasting approaches are further classified into two types: 1. Quantitative: this type of forecasting is done by taking historical data, time series or correlation information and creating these forecasts out into the future of what we think is going to happen right 2. Qualitative: These are opinions taken from experts, decision makers and customers

3.2 Time series

As discussed in the section 2.1 time series is a sequence of observations , usually ordered in time, these are further classified into univariate and multivariate time series depending on the number of dependent variables recorded with respect to time.

3.2.1 Univariate Univariate timeseries data that has only single variable recorded sequential over equal intervals of time. The table below is a univariate time series data stating the monthly sales of a product.

Figure 3.1: Univariate time series

3.2.2 Multivariate Multivariate time series data that has more than one-time dependent variable (mul- tiple time series that deals with dependent data simultaneously). These type of time series data are bang on and challenging in context the machine learning.

7 Chapter 3. Preliminaries 8

Figure 3.2: multivariate time series

3.2.3 Components of time series The several reasons which affect the values of an observation in a time series are said to be components of time series. These are decomposed into four categories:

• Trend: in time series analysis trend is a movement to relatively higher or lower values over a long period of time. When a trend pattern of the data exhibits a general direction that is upward (higher highs and higher lows) is called as upward trend. When a trend pattern of the data exhibits a general direction that is downward (lower highs and lower lows) is called as downward trend. When there is no trend it is called as horizontal trend

• Seasonality: time series data that exhibits a repeating pattern at a fixed interval of time within a one-year period is called as seasonality. It is a common pattern seen across many time series data.

• Cyclic pattern: It exists, when the data exhibits rises and falls that are not of a fixed period.

• Irregular fluctuations : these are left over series of residuals after removing of trend and cyclic variations from a data set, which may or may not be random. These fluctuations are unpredictable and erratic in nature.

Figure 3.3: time series decomposition Chapter 3. Preliminaries 9

3.3 Time series forecasting as a supervised problem Most of the time series forecasting problems are shaped as a supervised learning problem. Standard linear and nonlinear machine learning algorithms can be used by transforming time series data to a supervised learning problem.

3.3.1 Supervised learning In this type of learning machine learns under guidance, where you have the input variables represented as (X) and output variable as (Y) and the algorithms are used to learn the mapping between them.

3.3.2 Sliding window approach for time series data Using prior time steps to predict the next time step is said to be sliding window method, it is also called as lags in and time series. Time series can be shaped into supervised problem by restructuring time series data set in the form of using previous time steps as input variable(X) and next time steps as out variable (y).let’s suppose we have a time series as in table 1 we can transform the time series in table1 into a supervised learning problem by using the previous values of time steps to predict the values of a next time steps.

Figure 3.4: Time series data Figure 3.5: supervised problem

3.4 Artificial Neural Networks Artificial neural networks are computing systems inspired by biological neural net- works. They are also called as neural networks, they perform tasks by learning from examples without being programmed with the set of instructions. ANN’s consists of a set of neurons connected and organized in layers. These neurons send signals to each other through a weighted connection. The neural network architecture is composed of input, output and hidden layers. Input layer is the initial input vector of the data further process by subsequent Layers of the Artificial neural networks. Hidden layers are the layers between the input and output layers where neurons of this layer take in the set of weighted inputs and gives an output through the activation function. Output layer gives the required outputs[39]. Chapter 3. Preliminaries 10

• Perceptron: It is the basic building block of neural network, it is a linear classifier used for binary prediction. This type of network works only for linear structured data.

Figure 3.6: single layer perceptron

• Multi -layer neural network:: Its has more advanced network architecture compared to perceptron. They are used in solving complex regression and classification tasks. Recurrent neu- ral networks, convolution neural networks are some examples of multi-layer perceptron[21].

Figure 3.7: Multi layer perceptron

3.4.1 Activation Functions Calculations performed in a neuron are two types activation functions and aggrega- tions. Aggregations are just the weighted sum where as activation functions define the output of the neuron given set of input data. These activation functions are different for different types of architectures. Relu, sigmoid, and Tan-h are the widely used non-linear activation functions Chapter3.Preliminaries 11

•Sigmoidactivationfunction:Thesearemostlyusedinbinaryclassification problems,itiskindoflogisticfunctionwhichgeneratesthesetofprobability outputsto0and1withthegiveninputs[25].

Figure3.8:Sigmoid

Sigmoid (x)= ex/1+ ex (3.1)

•Tan-hactivationfunction:Tan-hactivationfunctionisanalternativetologistic sigmoidfunction.Italsofollowsthesigmoidalfunctionbehaviorbut,theout valuesareboundedintherangeof-1and1,highnegativeinputvaluestoTan-h functionwillmaptonegativeoutputs[25]

Figure3.9:Tan-h

Tanh (x)=2 /1+ e2x (3.2)

•Reluactivationfunction:Itiscalledasrectifiedlinearunits,itiswidelyused inconvolutionnetworks(CNN)becauseitgivesaccuratepredictionfortrue orfalse.Itisfamousbecauseallitsnegativevaluesareconsideredas0and positivevaluesareconsideredas1[32]. Chapter3.Preliminaries 12

Figure3.10:Relu

ReLU (x)= max (0 ,x ) (3.3)

3.4.2 RecurrentNeuralNetworks Recurrentneuralnetwork(RNN)worksontheprincipleofsavingtheoutputofa layerandfeedingthisbacktotheinputinordertopredicttheoutputofthelayer. RNN’saremostlyusedforsequentialtypeofdata.TheformulationofRNNisdone byabstractingthegeneralconceptsandcommonpropertiesoffeedforwardneural networks.Thesetypesofnetworksarewidelyusedinspeechrecognition,sentiment classificationandtimeseriesprediction[16].

Figure3.11:Recurrentandfeedforwardnetworksstructure

Consideraninputsequencex=(x1...,xT),astandardrecurrentneuralnet- work(RNN)computesthehiddenvectorsequenceh=(h1...,hT)andoutputvector sequencey=(y1,...,yT)byiteratingthefollowingequationsfromt=1toT

ht=H(wxh xt+whh ht 1+bt) (3.4)

yt=Why ht+by (3.5) WhereWdenotestheweightmatrices.Histhehiddenlayerfunction,Wxhdenotes theinputhiddenweightmatrixandbhdenoteshiddenbiasvector Chapter 3. Preliminaries 13

3.4.3 LSTM Lstm (Long short-term memory) are evolved version of the , during back propagation recurrent neural networks suffer from the vanishing gradient problem. Gradients are the values used to update weights of a neural network. vanishing gradient problem is when a gradient shrinks as it back propagates through time. If a gradient value becomes extremely small it doesn’t contribute to much learning so, in recurrent neural networks the layers that gets a small gradient update doesn’t learn mainly, the starting layers. Because these layers are not learning they can forget what is seen in longer sequences does having short term memory. Lstm’s are created as the solution to short term memory, they have the internal mechanisms called gates that can regulate the flow of information, these gates learns which data in a sequence is important to keep or throw away by doing this it learns to use relevant information to make ,. Lstms are mostly used in speech recognition, text generation and time series[19].

Figure 3.12: LSTM Architecture

A common lstm architecture is composed of a cell states and three regulators usually called as gates. Cell states acts as highway that transfers relative information all the way down to the sequence chain think of it as a memory of the network because cell state carry information throughout the sequence processing from earlier time steps could be carried all the way to the last time step thus reducing the effects of short-term memory. The gates are just different neural networks that decide which is allowed on the cell state. The gates learn what information is relevant to keep or forget during training. these contains sigmoid activation, instead of squishing values between -1 and 1 squishes values between 0 and 1 forget gate decides what information should be thrown or kept away information from the previous hidden state and information from the current input is passed through the sigmoid function. Values comes out between o and 1 closer to 0 means forget and 1 means to keep, to update the cell state we have the input gate and output gate sends aggregate values to activation function. Chapter 3. Preliminaries 14

(3.6)

(3.7)

(3.8)

(3.9)

(3.10) where, sigma is the logistic sigmoid function, f is the forget gate,i is the input gate, o is output gate and c is the cell activation vectors. R is ReLU activation function.

3.4.4 CNN CNNs are a special type of neural network used primarily for processing data with a grid topology[30]. For example, images can be viewed as 2D grids and time series data such as energy consumption data, can be viewed as 1D grids. CNNs were used effectively for computer vision activities such as the classification of images [29],[15].In at least one of the layers in the network, CNNs use a specific linear operation called convolution. Convolution is described as a two-function procedure based on valued arguments [15]. The convolution operation is denoted with an asterisk

(3.11)

Where w denotes the weighting function and x denotes the input function. The weighting function is called a ‘’kernel” within CNNs. Convolution operation output is often referred to as ‘’feature map” (referred as s)

Usually, the operation of convolution is applied to inputs in multidimensional arrays.in addition, the kernel is also a multi-dimensional weight array that changes as the algorithm learns through the iterations. Therefore, in a particular case, with multidimensional inputs and Kernels, The convolution procedure is applied to more than one dimension. Thus, the two dimensional input of convolution operation can be expressed as: (3.12) Where K represents a two dimensional kernel and I represents a two dimensional input. S is the resulting feature map after the convolution.

3.5 ARIMA Arima is the widely used model in forecasting time series, proposed in 1970’s by Box and Jenkins [4] , on the basis of auto regressive (AR), moving average (MA), and auto regressive moving average (ARMA). ARIMA is called as auto regressive integrated moving average It is the combination of both AR and MA binding together with Chapter3.Preliminaries 15 integration.ARisthecorrelationbetweentheprevioustimeperiodtothecurrentand MAreferstotheresidualserroritisalinearcombinationoferrortermswhosevalues occurredsynchronicatvarioustimestampsinthepast.ARIMAismostlyusedin forecastinginflationrateorunemploymentrates,demandoftheproducts,forecasting mortgageinterestrateandforecastingsilverorgoldprices,itiswidelyusedinseveral usecasesglobally.ARIMAmodelhasthreeparametersp,d,qsopreferstoauto regressivelags,qstandsformovingaverageanddistheorderofdifferentiationto predictthevalueofpweusePACFgraphitisapartialautocorrelationplotand topredictthevalueofqweuseACFplotautocorrelationplotanddistheorderof differentiationtomakedatastationary[28].Whentheseriesbecomesstationaryafter theorderofdifferentiationtheformulaeforARIMAisstatedasfollows:

Yt=φ1Yt 1+φ2Yt 2.... +φpYt p+et−θ1et 1−θ2et 2.... +θqet q (3.13)

wherepistheorderofautoregressivemodel,qistheorderofmovingaverage model,eisthewhitenoisesequence, φand θarethemodelparametersand Ytisthe valueofaobservationattimet[8].

3.6 SVR

SVR(supportvectorregression)aremainlyusedforcontinuousdata.Thesesup- portsbothlinearandnonlinearregression.Itsolvesquadraticprogrammingissueby mappingtheirfeaturedimensiontothehigh-dimensionalfeaturespaceandbuilding thehyperplaneasthedecisionfunctionoftheoriginalspaceinthehighdimensional space.SupportvectoristheapplicationofSVMinregression Consideratrainingdataset, D=( x1,y 1) ,(x2,y 2) ,..., (xn,yn ),yi ∈Rismapped tothehighdimensionalfeaturebythenonlinearmapping (β(x),andestablisha regressionfunction: f(x,ω )= ω·eβ(x)+ b (3.14) SVRisnotusedtoseparatingsamplesofdifferentclassesasmuchaspossibleyet, tomakethesamplesfocuseswithinthereasonabledeviationrange ,I.e,measuring thelosswhenthedeviationislargerthan Chapter 3. Preliminaries 16

Figure 3.13: support vector regressor

From the above figure SVR, the loss is not calculated if the small black dots fall within the allowable deviation range. the distance between and its value y will be taken into account the total loss, the distance between the circle and exceeds the permissible deviation range . SVR can be formulated as:

(3.15) b 3.7 Multiple parallel input and Multi step output. Predicting multiple time steps into the future is called as multi step forecasting. Parallel time series requires the prediction of multiple time steps of each time series. The is clearly stated with an example below[36][3].

Figure 3.14: multivariate time series Chapter 3. Preliminaries 17

From the above multivariate time series picture we use the last two-time steps from each of the five-time series as input to the model and predict the next time steps of each of the five-time series as output.The accompanying first sample of the data set would be:

Figure 3.15: Transformation of input and output from the above series Chapter 4 Method

This study is focused on two research methods addressing two research questions.

A literature search was initiated to find answer for RQ1, the reason for selecting literature review as a research method is to study different forecasting methods and identifying the suitable forecasting method that performs better on time series data and my research is favored with the availability of past data so we can think of patterns that appeared in the past will continue in future by considering this scope of the literature search was reduced to time series forecasting techniques. A related study was conducted to better understand the working of these forecasting techniques and their importance in forecasting the demand.

The articles are searched using the search strings “machine learning”, “forecast- ing”, “time series forecasting”, “demand forecasting in logistics” and then refined the selected articles using: Inclusion criteria • Articles written in English and published between the years 2005-2019 were selected

• Articles using forecasting and machine learning approach in supply chain es- pecially logistics.

• Articles which are published in journals, books and magazines.

Exclusion criteria • Articles excluded which are not in the field computer science, supply chain and are not written in English.

An experiment was performed to answer RQ2 and to evaluate the performance of the identified machine learning algorithms for demand forecasting. The models are trained using sliding window and multi input multi out strategy. organization is interested in identifying the demand for coming twelve months (Long term) to plan their capacity for transportation so, forecasting for rolling year was performed with the availability of eight and half years past data. six and half years of data was used for training and two years of data was used for testing and validation Dependent Variable: Root

18 Chapter 4. Method 19

Independent Variables: ANN,LSTM and CNN

Based on literature study RMSE is a widely used performance metric compared to other metrics that are used in regression and also the errors are squared before being averaged, significant errors are assigned a relatively high weight by the RMSE, it means that when large errors are especially undesirable, RMSE is more useful. RMSE does not necessarily increase with the variance of the errors. RMSE increases with the variance of the frequency distribution of error magnitudes[10]. So, in this study RMSE is adopted to calculate the performance of the model.

The proposed model is about forecasting the demand of outbound distributed products. The selected models are the most recent and well know forecasting tech- niques. The research methodology is broken down into following phases below:

4.1 Data gathering Collecting data from an organization is not that easy. It consists of several business units and many stake holders involved in it. historical data of distributed products is stored in one data base and the recent one-year data in stored in an operational data store. I managed to pull the past fifteen years historical data of distributed trucks. The initial data collected directly from the data bases contains more than 300 columns, each column representing some information related to the distributed products as, all these columns are not so related to the problem refined the data by considering only the required features related to problem. The final data set containing the distributed products start date, end date, end location, monthly sold products to the end location, total demand for that particular month and body builders. Some time these products will not be delivered to end location they will be gone through other countries for body building these are called body builders.

4.2 Data pre-processing The raw data set contains lot of duplicates recorded and the delivery dates are unorganized. I removed all the duplicates at the starting phase of the data analysis then, grouped the data to month wise as organization wanted monthly forecast and also grouped according to ordering countries and destination countries. Ordering countries are the actual orders came from and destination countries are the where the final products are delivered. The data set contains many missing values and null values.Missing values are filled with the average value and null values are dropped. From the collected 15 years of data only 8 years of data has some correlation int the data rest of the past data has lot of missing values and lot of change overs made and also not continuous so, consider only 8 years from past and 2 months from 2019 for the research. Chapter 4. Method 20

4.3 Data set As discussed in earlier sections data set is transformed in to time series supervised learning problem, where each month is set to an index and the rest of the variables of distributed products are columns with the today monthly demand as a feature. The final data set has 98 months of data with 14 variables and 1 feature where these variables are not dependent on each other except time and total demand. 75 percent of data is used for training and 10 percent for validation and 15 percent for testing. As discussed in the preliminaries used a multiple parallel input and multi-step output strategy for training.

Figure 4.1: Data set

The values in the data set are not real values, as per organization rules the data set should not be shared out.from fig 4.1 the data is recorded month wise and in second column the prefix part that is Sweden is the ordering country and the suffix part Germany is the destination country and the values are monthly product deliveries.

4.4 Experiment setup The experiment is focused on evaluating the selected machine learning models by comparing their performances using selected metrics. Models are trained using mul- tiple parallel input and out put strategy. Controlled guarding is done during the training process. The batch size is same for both the models, taken care overfitting and underfitting by using call backs, early stopping and checkpoint functions. Model performance is precisely monitored during the training. The experiment was carried out in DSVM (DATA SCIENCE VIRTUAL MA- CHINE) this is a Microsoft Azure virtual Image, preinstalled and configured and tested with various tools that are required for data science and data analysis. The experiment used Keras an open source library written in python. It runs on top of TensorFlow which is used for numerical operations and consists a bulk of machine learning algorithms. Min max scaler from Scikit learn is used for normalizing Chapter 4. Method 21 the values between 0 and 1. Pandas is used data analysis and NumPy is used for performing mathematical operations. Matplot lib and sns are used for plotting graphs, python anaconda environment and jupyter notebook were used. After scaling the data, it is given as input as samples time steps and features as explained in the previous sections. Forecasted values are rescaled using inverse scaling function from Scikit learn.

4.5 performance metrics There are many evaluation metrics, but I considered mean squared error, , root mean squared error. Mean Absolute error: It is the sum of absolute differences between the actual and predicted values it doesn’t take care of direction that is positive or negative, all the forecasted values are pushed towards positive.

MAE (4.1) Root Mean Squared error This measures the deviation from the actual values.It calculates the rooted mean of the square errors i.e., differences between individual prediction and actual are squared and summed all together which is then square rooted and finally divided. Lower the RMSE value better the prediction RMSE (4.2) Mean Squared error: it is procedure used for estimating the unobserved quantity.

MSE (4.3) 4.6 Walk forward Validation Walk forward validation is used for determining the best parameters, it is used to optimize within the sample data for a time window in a data series the remainder of the data is reserved for out of sample testing, a small portion of the reserved data following the sample data is tested with the results recorded. The in sample time window is shifted forward by the period by the out of sample test and the process repeated at the end all of the recorded results are used to assess the strategy to get the suitable parameters of the model and run these finalized parameters using another segment of the data. This study adopted this validation technique where first month is trained and next following is validated then first and second-month data are trained and tested with the 3 month and so on the process followed until the final phase. Fitted the model only on the training period and assessed the performance on the validation period rerun the model on the entire series. Chapter 4. Method 22

Figure 4.2: Walk forward validation Chapter 5 Results

The experiment was supervisioned and the performances of both LSTM and CNN are calculated using MSE, RMSE and MAE error. Learning graph, the main aim of the experiment, forecasted values and the training process are recorded and explained in following phases of the experiment.

5.1 Learning curve The main objective during training the algorithm is to minimize the loss between the actual output from the predicted output from the given training data. The training is started with the arbitrarily set weights, then weights are updated incrementally as we move closer and closer to the loss. The size of the steps to reach the loss depends on learning rate. After testing and tuning the parameters the learning rate of 0.001 is set to get the optimum loss. Adam optimizer which is a variant of Stochastic gradient descent (sgd) is used. Models are trained using the history as mentioned in above sections and to check whether the selected algorithms are working correctly, facing a bias or variance problem and also to check the performance of the algorithm learning curves are used. early stopping and model check points functions are imported from Keras to monitor the learning by making sure the training is going smooth without overfitting and underfitting.

Figure 5.1: LSTM training graph Figure 5.2: CNN training graph

23 Chapter 5. Results 24

5.2 FORECASTS

Figure 5.3: Actual vs forecast using CNN

Figure 5.4: Actual vs forecast using CNN

The graphs are plotted with forecasts vs actual values on the time taken as x co- ordinate and the volumes of demand on the y-axis from the history for the 82 (7 years) and forecasted demand for the 12 months (one year). Each From the early description, the Organization has a total of 121 distributed locations and picturing all of them is not necessary to understand the nature of models so, I selected two different destination locations where the volume is highly distributed.

5.3 Forecasting Performance In order to provide a benchmark, the volume forecasting process was carried out using a standard feed forward “shallow” ANN and LSTM. Historical volume data for 12 previous time steps together with the same time step. Data sent to the CNN were used as inputs to ANN and LSTM. All Three algorithms were implemented with cross Chapter 5. Results 25 validation. The forecasting performance of the three algorithms was measured using RMSE and the working of the performance metrics are discussed in preliminaries section.

Figure 5.5: Models performances

from the above figure we can see that the CNN outperforms the both LSTM and ANN comparably ANN produces the better results than LSTM.

5.4 Validity Threats Internal validity refers to how well a research has been performed[27]. Most of the recognized threats to validity during the study are either eliminated or restricted by thorough consideration for model’s selection, long-term forecasting strategy se- lection and performance measures. But it is not certain that all threats to validity will be eliminated. All the models faced the same internal validity threats during the training. The missing variables in a data set which have a strong dependency and ultimately affecting the model’s performance is an external validity threat. Iden- tifying these variables can be one way to this threat but unfortunately all the variables cannot be identified which makes it a valid external threat. Size and magnitude of a data set can also be considered as the external validity threat because the weekly data and daily data has more data samples rather than grouping it to monthly Chapter 6 Analysis and Discussion

Based on literature review Lstm (long short-term memory), convolution nctworks,neural networks and Arima are the most widely models used for forecasting time series. Arima is better for forecasting short term periods and suitable for forecasting uni- variant time series. My problem is related on forecasting multi variate data for long horizon. At the initial stages of conducting experiment tried Arima, vector auto re- gression with exogenous variables and some machine learning techniques as they were not able to produce the optimum forecasts instead ANN, Lstm and CNN techniques are able to produce the promising results.

6.1 Implementation As mentioned, the cnn based demand forecasting algorithm was implemented on 98 months distributed products data. The presented methodology was implemented to perform a forecast for the rolling year, used a multi input and multi output strategy Therefore, 12-time steps as an input were fed into the convolution layers of the CNN. As the inputs were time series data, it needs to be data in the form of a 1D grid i.e. restructured in the form of a 1-Dimensional input in addition to these the kernel used in the convolution layers are defined as 1 dimensional kernel. In the implemented CNN, each of the convolution layers were designed to have the above mentioned three phases; 1) convolution phase, 2) non-linear transformation and 3) max pooling phase. As mentioned, the convolution phase for the three layers were performed with 1D kernels. The rectified linear unit function was used to perform non-linear transformation for all the convolution layers. Max pooling was performed as a pooling phase for all the convolution layers once the output is produced from this convolution layers it is forwarded to the fully connected layers (Hidden layers). In this experiment, one hidden layer with 40 neurons each were used. The hidden layer used the ReLU function as its activation function. Since there were 12 time series outputs, the output layer contained 12 neurons for 12-time steps as an output as explained in multi input and multi output strategy. The output layer contained a linear activation function to produce outputs. Various CNN architectures have been tested with distinct convolution layers, pooling filter sizes and Kernel sizes. Training is done using ADAM[26] algorithm was used as the gradient based optimizer. The same training and testing data were used while tuning the network with different hidden layers, pool size and neurons. The outbound demand forecasting process was performed using a standard feed forward ANN and Lstm to benchmark the results obtained from convolution neural network.

26 Chapter 6. Analysis and Discussion 27

6.2 Discussion • RQ 1: What are the available state-of-art-methods used in forecasting? Answer: Based on the results obtained from the Literature study three machine learning models Long short term memory(LSTM), Artificial neural network (ANN), Convolution neural network (CNN) have been chosen for forecasting the demand of out bound distributed Trucks.

• RQ 2: Which Machine Learning Model would perform better forecasting on time series data? Answer:Convolution neural networks is the best suitable machine learning al- gorithm for forecasting the demand of outbound products. In this experiment CNN has high performance I.e. 0.694 compared with other two algorithms LSTM with 0.654 and ANN with 0.678. After performing the out of sample test its performance is increased to 0.743 which is quite promising. The model’s performance is discussed in section 5.3 Chapter 7 Conclusions and Future Work

This paper is focused on forecasting demand of outbound distributed products for long term. There are no relevant influencing factors identified other than total de- mand, so multiple regression approach is not considered. While going through lit- erature most of the papers related to time series forecasting problem talked about neural networks and deep neural networks are best suitable for forecasting long term rather than traditional moving average methods, models are selected based on its feasibility and applicability and the forecasting performance is compared with two selected nonlinear time series bench marking methods- Lstm and Artificial neural networks. It is found that CNN with the data pre-processing measures exhibit the better performance in out-of-sample testing. As forecasting precision is very influen- tial for planning the capacity and reducing the costs for logistic companies, CNN is thought to be the best candidate of prediction approach for the case

The experiment conducted in this research is on monthly data. Weekly demand data might produce better results and also this research did not consider any external factors. The future research can be done using weekly data and considering factors influencing demand to get the accurate results.

28 References

[1] Luis Aburto and Richard Weber. Improved supply chain based on hybrid demand forecasts. Applied Soft Computing, 7(1):136–144, 2007. [2] Kasun Amarasinghe, Daniel L Marino, and Milos Manic. Deep neural networks for energy load forecasting. In 2017 IEEE 26th International Symposium on Industrial Electronics (ISIE), pages 1483–1488. IEEE, 2017. [3] Gianluca Bontempi. Long term time series prediction with multi-input multi- output local learning. Proc. 2nd ESTSP, pages 145–154, 2008. [4] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015. [5] Real Carbonneau, Kevin Laframboise, and Rustam Vahidov. Application of machine learning techniques for supply chain demand forecasting. European Journal of Operational Research, 184(3):1140–1154, 2008.

[6] Chris Chatfield. Time-series forecasting. Chapman and Hall/CRC, 2000. [7] Pei-you Chen and Lu Liu. Study on coal logistics demand forecast based on pso-svr. In 2013 10th international conference on service systems and service management, pages 130–133. IEEE, 2013. [8] Paulo Cortez, Miguel Rocha, and José Neves. Evolving time series forecasting arma models. Journal of Heuristics, 10(4):415–429, 2004. [9] Jingyi Du, Qingli Liu, Kang Chen, and Jiacheng Wang. Forecasting stock prices in two ways based on lstm neural network. In 2019 IEEE 3rd Information Tech- nology, Networking, Electronic and Control Conference (ITNEC), pages 1083–1086. IEEE, 2019.

[10] Gabriel Fernandez. Deep Learning Approaches for Network Intrusion Detection. PhD thesis, The University of Texas at San Antonio, 2019.

[11] Fernando Turrado García, Luis Javier García Villalba, and Javier Portela. Intel- ligent system for time series classification using support vector machines applied to supply-chain. Expert Systems with Applications, 39(12):10590–10599, 2012. [12] Manas Gaur, Shruti Goel, and Eshaan Jain. Comparison between nearest neigh- bours and bayesian network for demand forecasting in supply chain manage- ment. In 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), pages 1433–1436. IEEE, 2015.

29 References 30

[13] Felix A Gers, Douglas Eck, and Jürgen Schmidhuber. Applying lstm to time se- ries predictable through time-window approaches. In Neural Nets WIRN Vietri- 01, pages 193–200. Springer, 2002. [14] Kenneth Gilbert. An arima supply chain model. , 51(2):305–310, 2005.

[15] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. [16] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. IEEE, 2013. [17] Hu Guosheng and Zhang Guohong. Comparison on neural networks and support vector machines in suppliers’ selection. Journal of and Electronics, 19(2):316–320, 2008. [18] John E Hanke, Arthur G Reitsch, and Dean W Wichern. Business forecasting, volume 9. Prentice Hall Upper Saddle River, NJ, 2001.

[19] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [20] Wei-Chiang Hong, Yucheng Dong, Li-Yueh Chen, and Shih-Yung Wei. Svr with hybrid chaotic genetic algorithms for tourism demand forecasting. Applied Soft Computing, 11(2):1881–1890, 2011. [21] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. [22] Rob J Hyndman and George Athanasopoulos. Seasonal arima models. Forecast- ing: principles and practice, 2015. [23] Joarder Kamruzzaman and Ruhul A Sarker. Forecasting of currency exchange rates using ann: A case study. In International Conference on Neural Networks and Signal Processing, 2003. Proceedings of the 2003, volume 1, pages 793–797. IEEE, 2003. [24] Karin Kandananond. Consumer product demand forecasting based on artifi- cial neural network and support vector machine. World Academy of Science, Engineering and Technology, 63:372–375, 2012. [25] Bekir Karlik and A Vehbi Olgac. Performance analysis of various activation functions in generalized mlp architectures of neural networks. International Journal of Artificial Intelligence and Expert Systems, 1(4):111–122, 2011. [26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimiza- tion. arXiv preprint arXiv:1412.6980, 2014. [27] Barbara Ann Kitchenham, David Budgen, and Pearl Brereton. Evidence-based software engineering and systematic reviews, volume 4. CRC press, 2015. References 31

[28] Yann-Aël Le Borgne, Silvia Santini, and Gianluca Bontempi. Adaptive model selection for time series prediction in wireless sensor networks. Signal Processing, 87(12):3010–3020, 2007.

[29] Y LeCun, Y Bengio, and G Hinton. Deep learning. nature 521. 2015.

[30] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.

[31] Wen-Yau Liang and Chun-Che Huang. Agent-based demand forecast in multi- echelon supply chain. Decision support systems, 42(1):390–407, 2006. [32] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013. [33] Jan Olhager, Martin Rudberg, and Joakim Wikner. Long-term capacity man- agement: Linking the perspectives from manufacturing strategy and sales and operations planning. International Journal of Production Economics, 69(2):215– 225, 2001.

[34] Kwame Owusu Kwateng, John Frimpong Manso, and Richard Osei-Mensah. Outbound logistics management in manufacturing companies in ghana. Review of Business & Studies, 5(1):83–92, 2014. [35] Gordon Stewart. Supply chain performance benchmarking study reveals keys to supply chain excellence. Logistics , 8(2):38–44, 1995. [36] Souhaib Ben Taieb, Antti Sorjamaa, and Gianluca Bontempi. Multiple-output modeling for multi-step-ahead time series forecasting. Neurocomputing, 73(10- 12):1950–1957, 2010.

[37] Sangeeta Vhatkar and Jessica Dias. Oral-care goods sales forecasting using artificial neural network model. Procedia Computer Science, 79:238–243, 2016. [38] Wen-Jing Yuan, Jian-Hua Chen, Jing-Jing Cao, and Ze-Yi Jin. Forecast of logistics demand based on grey deep neural network model. In 2018 International Conference on Machine Learning and Cybernetics (ICMLC), volume 1, pages 251–256. IEEE, 2018.

[39] G Peter Zhang and Min Qi. Neural network forecasting for seasonal and trend time series. European journal of operational research, 160(2):501–514, 2005. [40] Xinfeng Zhang, Shengchang Wang, and Yan Zhao. Application of support vector machine and least squares vector machine to freight volume forecast. In 2011 International Conference on Remote Sensing, Environment and Transportation Engineering, pages 104–107. IEEE, 2011. Appendix A Supplemental Information

Figure A.1: Distribution of residuals

Figure A.2: Actual vs forecast using LSTM

32 Appendix A. Supplemental Information 33

Figure A.3: Decomposition of Time series

Figure A.4: forecsat using LSTM Appendix A. Supplemental Information 34

Figure A.5: forecast using LSTM

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden