Packaging Demand Forecasting in Logistics Using Deep Neural Networks

Master of Science in Computer Science May 2019

Packaging Demand Forecasting in Logistics using Deep Neural Networks

Yashwanth Bachu

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulﬁlment of the requirements for the degree of Master of Science in Computer Science. The thesis is equivalent to 20 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identiﬁed as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information: Author(s): Yashwanth Bachu E-mail: [email protected]

University advisor: Dr. Hüseyin Kusetogullari Department of Computer Science

Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract

Background: Logistics have a vital role in supply chain management and those logistics operations are dependent on the availability of packaging material for packing goods and material to be shipped. Forecasting packaging material demand for a long period of time will help organization planning to meet the demand. Using time-series data with Deep Neural Networks for long term forecasting is proposed for research. Objectives: This study is to identify the DNN used in forecasting packaging demand and in similar problems in terms of data, data similar to the available data with the organization (Volvo). Identifying the best-practiced approach for long-term forecasting and then combining the approach with identified and selected DNN for forecasting. The end objective of the thesis is to suggest the best DNN model for packaging demand forecasting. Methods: An experiment is conducted to evaluate the DNN models selected for demand forecasting. Three models are selected by a preliminary systematic literature review. Another Systematic literature review is performed in parallel for identifying metrics to evaluate the models to measure performance. Results from the preliminary literature review were instrumental in performing the experiment. Results: Three models observed in this study are performing well with considerable forecasting values. But based on the type and amount of historical data that models were given to learn, three models have a very slight difference in performance measures in terms of forecasting performance. Comparisons are made with different measures that are selected by the literature review. For a better understanding of the batch size impact on model performance, experimented three models were developed with two different batch sizes. Conclusions: Proposed models are performing considerable forecasting of packaging demand for planning the next 52 weeks (∼ 1 Year). Results show that by adopting DNN in forecasting, reliable packaging demand can be forecasted on time series data for packaging material. The combination of CNN-LSTM is better performing than the respective individual models by a small margin. By extending the forecasting at the granule level of the supply chain (Individual suppliers and plants) will benefit the organization by controlling the inventory and avoiding excess inventory.

Keywords: Deep Learning, Forecasting, Logistics.

Acknowledgments

Thanks to my parents, who are the support of my life. I would like to thank my university supervisor, Dr. Hüseyin Kusetogullari who has supported me in a good spirit and given me good advice and input to the process. Also thanks to my supervisor and co-supervisor at Volvo, Teja Yerneni and Laufer Marc for guiding me and helping in ﬁnding resources and data. Finally, I would like to thank my friend Uttejh.

iii

Contents

Abstract i

Acknowledgments iii

1 Introduction 1 1.1 Aims and Objectives ...... 2 1.2 Research Questions ...... 2

2 Related Work 5 2.1 Time series forecasting ...... 6 2.2 Long-term forecasting ...... 7 2.3 Deep Neural Network in forecasting ...... 7 2.4 Metrics to evaluate forecasting Performance ...... 8 2.5 Problem Domain ...... 9

3 Preliminaries 11 3.1 Time-series ...... 11 3.1.1 Univariate Time-series ...... 11 3.1.2 Bivariate Time-series ...... 11 3.1.3 Multivariate Time-series ...... 12 3.2 Multi-step-ahead ...... 12 3.2.1 Recursive forecasting ...... 13 3.2.2 Direct forecasting ...... 13 3.2.3 DiRec forecasting ...... 14 3.2.4 Multiple Input Multiple Output forecasting ...... 14 3.2.5 DIRMO forecasting ...... 15 3.3 Artiﬁcial Neural Networks ...... 15 3.3.1 Recurrent Neural Networks ...... 17 3.3.2 LSTM ...... 17 3.3.3 Convolutional Neural Network ...... 18 3.3.4 CNN-LSTM ...... 19

4 Adapted approaches 21

5 Method 23 5.1 Data gathering ...... 24 5.2 Data preprocessing ...... 24 5.3 Data set description ...... 25

v 5.4 Experiment setup ...... 25 5.5 Performance Measures ...... 26

6 Results 29 6.1 Learning curve ...... 29 6.2 Performance ...... 30 6.3 Execution time ...... 31 6.4 Forecasting Results ...... 31 6.5 Deviations ...... 33

7 Analysis and Discussion 35 7.1 Forecasting Performance ...... 35 7.2 Execution Time ...... 36 7.3 Validity threats ...... 36 7.4 Limitations of the research ...... 37

8 Conclusions and Future Work 39

References 41

vi List of Figures

3.1 Pictographic representation of the relationship between strategies [1] . 16 3.2 Structure of Recurrent and Feed-Forward Neural networks [2] . . . . 17 3.3 LSTM cell structure [3] ...... 18 3.4 convolution in CNN [4] ...... 19 3.5 convolution in CNN [5] ...... 19

5.1 sliding window model with 1 step increment ...... 25

6.1 Learning curves ...... 29 6.2 Learning curves ...... 30 6.3 Forecast by LSTM models ...... 32 6.4 Forecast by CNN models ...... 32 6.5 Forecast by CNN-Lstm models ...... 33 6.6 Deviation in CNN-LSTM forecast ...... 33 6.7 Deviation in CNN and LSTM forecast ...... 34

vii

List of Tables

2.1 Metrics adapted in time-series and other forecasting ...... 9

3.1 Univariate data example ...... 11 3.2 Example of bivaraite data ...... 12 3.3 Example of multivaraite data ...... 12

6.1 Measures of DNN models while batch and test sample sizes are 2 and 5 30 6.2 Measures of DNN models while batch and test sample sizes are 2 and 1 30 6.3 Measures of DNN models while batch and test sample sizes are 32 and 5 31 6.4 Measures of DNN models while batch and test sample sizes are 32 and 1 31 6.5 Execution time of each DNN model ...... 31

Chapter 1 Introduction

Logistics are part of Supply Chain Management. Logistics manages the flow of goods, beginning from materials supply from suppliers to manufacturing and ends with the delivery of the product to end user [6]. Packaging Management is the team that is responsible for providing the packaging material for suppliers to pack the parts and material for manufacturing. It is also responsible for providing the packaging material to the manufacturing unit to pack the finished products. Volvo Group comprises of several truck manufacturing companies like Volvo, Re- nault, UD Trucks, and MACK Trucks [7]. Volvo group is home for Volvo Penta, Construction Equipment and several bus manufacturers [7]. Volvo has standard design and types of packaging material for the entire supply chain of all Volvo group associates [8]. Packaging materials are used beginning from the supplier of materials until the delivery to customers. Volvo Packaging or V-EMB refers to the branch of Volvo organization responsible for Packaging materials [8]. In this study here after Organization or Volvo refers to Volvo Packaging. Attaining availability of the packaging material is the utmost important and Key Performance Indicator for Volvo Packaging Management. It will be similar to any supply chain management that cares about the safety of material transported. Having more than the required quantity of packaging material is a serious issue of investment loss. Balancing this can be achieved by means of reliable forecasting and planning much ahead of excessive and low demands. Though Supply Chain Management as a whole need to meet customer needs and demand, it is not the only influencing factor for internal demands like the packaging. This is well understood from the Essentials of Supply Chain Management by Michael Hugos [6]. That is why a regular sales forecasting and demand can’t be simply used for Packaging Demand (PD) and there is a need for forecasting at the packaging level. The packaging demand has dependencies with several factors like space optimization techniques, Manufacturing policies with respect to each manufacturing plant, Packaging norms of the company dependent on the nation to be shipped, and fragility of the material. Considering the influence of these factors it is not easy to get accurate forecasting of the packaging demand with statistical forecasting methods. Statistical forecasting models [9] like Moving Average (MA) [10], Autoregressive (AR) [10], Autoregressive Moving average (ARMA) [9], Autoregressive Integrated Moving Average (ARIMA) [11] have been widely used in various fields of forecasting, but it is proven that Neural Networks are outperforming other forecasting methods especially with time series data in supply chain [12]. Therefore this thesis work is finding suitable Deep Neural Networks (DNN) for forecasting packaging demand

1 2 Chapter 1. Introduction considering the above factors and also finding the method for long-term forecasting. As the demand for packaging can be recorded based on the instance of time and therefore it’s expected to predict the packaging demand for the time instance into the future. The research has a motivation to forecast for the coming year, which is 52 weeks forecasting into the future. As the length of the time stamp is long and it is considered as long term forecasting. Any operation associated with the business has either direct or indirect influence of market demand because the purpose of business is to address market needs. Mar- ket demand has many influencing factors like economy, political stability, natural disasters, etc., In this problem, we have volumes of vehicle production influenced by those factors. Volvo has an internal organization with hundreds of data scientists to consider those factors at dealer level. They plan the production volume for the coming year from their learnings. By considering those planned vehicle production volumes, will help model learn their impact on future packaging demand.

1.1 Aims and Objectives The main aim of this thesis is to identify a suitable Deep Learning technique to forecast packaging demand using historical time series data. Forecasting must also consider the planned production volume (these volumes are planned from the forecast by sales) at respective time intervals. To achieve the targeted aim, three major objectives are formulated. Formulated objectives are as follows: • Understand the existing deep learning models in forecasting.

• Identifying and modifying the best DNN techniques for packaging demand long-term forecasting.

• Analyzing results and suggesting the best suitable model for the problem.

1.2 Research Questions The study is planned to answer the following questions:

RQ1: How to model Deep Neural Networks for time series data, where one of the input variables is already having future values(planned volumes from the sales forecast in this case)? Motivation: As the future planned production volumes are available and they are major inﬂuencing factors of packaging demand. So forecasting with such consideration is important to address such problems.

RQ2: What performance measures are considered to compare forecasting models? Motivation: To compare the performance of diﬀerent models in forecasting there is a need for identifying the performance measures. 1.2. Research Questions 3

RQ3: Which Deep Neural Network will be best suited for this problem? Motivation: As the performance of DNNs are not always the same for all problems, there is a need for identifying a suitable model for speciﬁc problem type.

The first research question investigates into the existing forecasting models and the design of those models that best suits the problem. Identifying different models of time series forecasting that are practiced on univariate [13], bivariate [13], and multivariate [13] time-series data. There are many metrics that can be adapted for evaluating forecasting models based on the forecast. But for evaluating the models’ performance to the specific problem can’t be evaluated by selecting a random metric or performance measure. The second research question is to identify such metrics suitable for the evaluation of models to this problem. The third question is to identify the model to suggest by experimenting and evaluating the shortlisted models on the dataset from Volvo. All the three questions add to investigate in DNN models forecasting packaging demand for one year into the future. Comparing the models that are selected for the experiment and suggesting the best-suited model with good performance to use into the tools by the organization (Volvo) is the final objective.

Chapter 2 Related Work

Due to the crucial role that forecasting has in practical implementation with real or actual data, Many prominent methods have been developed. These methods are generally termed as statistical methods, machine learning methods, hybrid methods, and Deep Learning methods. Autoregressive (AR) [10], Autoregressive Moving average (ARMA) [9], Autoregressive Integrated Moving Average (ARIMA) [11] and exponential smoothing methods are widely practised in time series analysis [10, 9, 11] . Hyndman [14], in his book, has given a review of these models and also techniques for those models. ARIMA is state-of-art statistical methods in demand forecasting, ARIMA was used to generate electricity demand forecasting in [15]. Zhang et al. article [16] is a complete review of the Artificial Neural Network (ANN) models in forecasting at that time. This can be used as the reference in understanding the ANN in the field of forecasting in the early stages and help to identify the improvements that are in the latest models. Carbonneau et al. [12] explained different techniques in forecasting like Naive, Average, Moving Average (MA), Trend, Multi Linear Regression (MLR) along with Neural Networks, Recurrent Neural Networks (RNN) and Support Vector Machines (SVM). In his experiment by considering Mean Average Error (MEA) and Standard deviation (Std.dev) as performance indicators, the authors demonstrated that RNN is giving improved performance for foundry data in the supply chain [12]. In an evaluation conducted by Parmezan et al. [17] identified the objective comparison between parametric and non-parametric models time series prediction. The study was conducted on 95 datasets and analysis of 2090 results. Flunkert et al. have proposed a long short-term memory (LSTM) method called DeepAR, a forecasting method based on autoregressive recurrent networks [18]. DeepAR effectively outperformed on a wide variety of datasets. In an experiment [19] conducted by Selvin et al. forecasting the stock prediction. Convolutional neural network (CNN) performed better than RNN and LSTM. Authors stated that as CNN rely on the current information, it can perform on stock predictions for the short term. As a hybrid model, ANN and Genetic Algorithms (GA) were successful in performing better than regular ANN on the Thai Stock Price Index Trend data set [20]. It is observed that clustering and ensemble would also improve the performance of the forecasting model [21]. Some papers proposed hybrid models for forecasting, machine Learning models along with statistical methods. Article [22] use one of such hybrid model of GA and

5 6 Chapter 2. Related Work

Holt-Winter’s exponential smoothing. John Gamboa has analyzed several papers and stated that several independent neural network layers working together produce better results than a regular model [23]. The author also discussed Fully Convo- lutional Networks (FCN), CNN, RNN, and LSTM. This concludes that DNNs are performing better than the regular neural networks and strongly supports further research of DNN in Time series analysis. RNN is also promising neural network for time-series forecasting [24, 25, 26, 27, 28]. RNN models are used in diﬀerent time series data forecasting like chaotic time series [24, 28], Non-linear time series [25] and also time series with missing value [27] . Apart from developing an eﬀective model for forecasting, there are certain things to be considered before implementing in practice. This may include companies ma- turity, understanding of the requirements.., this was discussed with respect to spare parts management in [29]. As this study is performed and planned to implement in a well-established organization and mature enough to use forecast one of important practice requirements are met. Other practical implementation requirements are also considered and can be tested through pilot runs.

2.1 Time series forecasting

A time-series is a collection of observations made sequentially through time [30]. The most commonly identiﬁed time-series data are stock prices. Some other examples of time series are sales of a particular product in a supermarket, the temperature of a particular city at noon every day, and the heat demand of a particular county every month. Based on the frequency at which the data is recorded or observed, Time- series is of two type continuous time-series and discrete time-series [31]. If we have observations at every movement of time it is considered as continuous time-series and represented as x(t) [31]. If the data is recorded at ﬁxed time intervals like daily, hourly, monthly it is called as discrete time-series data and represented as xt [31].

Let there be a time series data X1,X2,X3, ....Xn and interested in forecasting future value of that time series suppose Xn+h. Here ‘h’ is the integer called lead time, it is also referred to forecasting horizon. ‘h’ stands for the horizon. Forecast of

Xn+h is made for ‘h’ steps ahead at the time ‘n’ is represented as Xbn(h). The symbol Xb will distinguish from real and forecasted values. In a time-series forecast, it is essential to mention both the lead time or forecast time gap and time at which the forecast has been made. A forecasting method is a procedure for computing forecasts from the present past values by Chris Chatﬁeld [30]. Forecasting can be performed by simple algorithmic rule. On the other hand, the forecasting model is selected based on the given data. The words ‘forecasting methods’ and ‘forecasting model’ don’t stand for the same and they must not be used as synonyms. Time-series forecasting methods are generalized into three diﬀerent types based on the type of knowledge and available related variables [30]. Three generalized methods are judgemental forecasts, univariate methods, and multivariate methods. 2.2. Long-term forecasting 7

2.2 Long-term forecasting

Time-series can be of three types short-term, medium-term and long-term forecasting. Their name states that what differences those types possess. There is no such standard time stamp that states exactly the range of short, medium, and long-term forecasting models have to predict. Usually, it is referred, forecasting for minutes, hours and weekly are considered as short-term forecasting. While monthly and quar- terly forecasting is considered as medium-term and forecasting beyond a year is considered as long-term forecasting [32, 33]. Scott Armstrong stated in his article that “long-range” is the length of time required for all associates of the organization (system) to react to given stimuli [34]. In a research [35] group of four researchers performed early research on where they were assessing the recurrence interval of droughts based on two different models. This research was not generating the magnitude but the recurrence time interval of occurrence. They have used Autoregression 1 (AR(1)) and Functional Noise (FN) as two models. In their research, they were forecasting values of range 100 to 500 years and can be called Long-range models quite comfortably. Forecasting website traffic is essential for many websites to make sure websites links are ready to meet future traffic [36]. One such attempt to forecast of website traffic was carried out by Papagiannaki and others. They forecasted website traffic demand for around 6 months at a frequency of half-day. As the predictions were at much granular level 6 months are considered to be long-term. In this study, authors have used the ARIMA model for the forecast. They also implemented the same model at a cycle of the day (24 hours) and weeks along with 12 hours. They achieved a forecasting error of less than 15 In a journal [37] by Hyndman and Shu Fan researched on forecasting long term electricity demand. The forecast was presented on southern Australia region of Na- tional Electricity Market (NEM). They proposed a methodology to forecast the density of annual and weekly peak electricity demand for the coming 10 years. They developed a model with three major features and one error feature, three major feature are: calendar effects, Temperature effects, and Demographic/economic effects. Error feature was to consider the other external effects apart from those three features. The proposed model was involving with split into annual effects and half- hourly effects being estimated separately. For simulating temperature accordingly a seasonal bootstrapping method with variable blocks are performed. The results confirm that the model performs well on the historical data.

2.3 Deep Neural Network in forecasting

Many real-world problems need much accurate forecasting than that of statistical models and such need of more accurate forecastings is increasing day by day. Using machine learning methods forecasting has been started around the mid 19th century. The oldest article used in this study was from 1994 [26]. Deep Neural Network (DNN) is a sophisticated Artificial Neural Network, ANN and DNN are discussed in the next chapter. DNN is sub-part of machine learning methods. DNN is contributing to many fields especially in forecasting DNN are scoring best performances, which is 8 Chapter 2. Related Work observed from the following learnings. LSTM is implemented by Wei Bao and others in their research [38], they implemented stacked autoencoders (SAE) to extract deep daily features. They forecasted the closing stock price of six popular stock indices one-step-ahead in their study. They tested LSTM performance with three state-of-art-methods with predictive accuracy and profitability, the proposed model of the study was proven outperform the three other both in predictive accuracy and profitability [38]. In another research last year conducted by Anastasia and two others [39], they proposed a method for time-series forecasting using Deep Convolutional WaveNet architecture. The study was performed over exchange rate data of five different currencies. The data was a multivariate time-series data, multivariate is discussed in the next chapter. The study made comparisons between Convolutional WaveNet (cWN), Vector Autoregressive (VAR) and LSTM models. The study observed that VAR and cWN performing were better than LSTM, while cWN is the best of three. They measured performance based on the Mean standard deviation of the forecast. There are some hybrid models used in DNN for forecasting. Some researchers find that an ensemble of best models would result in more accurate forecasts than one best model. But this wouldn’t be applicable to all problems. In a research [40] performed by a group of four in Singapore, proposed a model of ensemble using Deep Belief Network(DBN) and Support Vector Regression (SVR). They compared it with DBN, SVR, Feed Forward Neural Network (FNN), Ensemble feedforward Neural Network (ENN). Though margin is very low, Proposed model in the research performed better than other models [40]. In search of better performing hybrid models, John Gamboa conducted a study. In his article “Deep learning for Time-series Analysis” [23], he reviewed several models and concluded that stacking several independent layers would yield better results from his literature

2.4 Metrics to evaluate forecasting Performance Prediction performance can be evaluated through different measures: Mean Absolute Error (MAE) [12, 15, 21, 22], Standard Deviation (Std.dev) [12], Mean Absolute Percentage Error (MAPE) [21], Mean Square Error (MSE) [24, 26], Normalized Mean Square Error (NMSE) [21], Root Mean Square Error (RMSE) [15, 21, 22], Root Mean Square Percentage Error (RMSPE) [21], Median Square Error (Median.S.E) [26]. Lu and Kao [21] used five performance measures in evaluating forecast prediction namely RMSE, MAE, MAPE, RMSPE, NMSE. Using multiple performance measures would help to evaluate clearly, is the opinion from this initial literature. There are some articles where performance measures are not so regular, they modified to match their application of forecasting in specific. Prediction Accuracy (PA) was modified a bit and used as the second performance measure by Han et al. [28]. Some papers are not evaluating performance in quantitative and qualitative measures but by visualization, [19] used predicted and actual values plot to measure the performance. Table 2.1 shows the usage of different performance measures in 18 different studies. ’1’ represents the usage of measure in that particular study. While ’-’ represents 2.5. Problem Domain 9

Article RMSE MAE MAPE RMSPE NMSE Customized (others speciﬁed) [21] - - 1 1 - - [12] - 1 - - - std.dev [20] - - - - - Customized accuracy [18] - - - - 1 Normalized deviation [15] 1 1 - - - - [17] - - - - - Visual analysis [22] 1 1 - - - - [24] 1 - - - - - [28] 1 - - - - Prediction Accuracy [26] - - - - - MSE, Median.SE [19] - - - - - Error percentage [32] 1 1 1 - - - [33] - - 1 - - - [37] - 1 - - - - [36] - - - - - Multiresolution Analysis(MR) [38] - - 1 - - - [1] - - - - - SMAPE(Sigmoid MAPE) [40] 1 - 1 - - -

Table 2.1: Metrics adapted in time-series and other forecasting the absence of the measure in that study. The last column in the table gives us the measure used in the study that is not in the index. From the above analysis it has been observed that Root mean square error, mean absolute error, Mean absolute percentage error are used more than other measures. Customized measures for the particular problem was also explored by many researchers, but in this study general and most used three metrics are selected to enable the diﬀerent stakeholders of the organization to understand the metrics. Se- lected metrics are RMSE, MAE, and MAPE. MAPE has been adjusted to meet the problem and nature of the data. Further discussion of these three metrics and their working is in chapter 5.

2.5 Problem Domain

The packaging is essential in logistic services, packaging of parts will protect the parts from damage. Packaging through standardized designs and sizes will make the handling and shipping of parts and goods easy. To meet the demands of packaging there is a need for sophisticated forecasting. The Packaging demand inﬂuenced by several factors like production demand, Packaging inventory with production, lead time, policy change in packaging, innova- tion in packaging designs, new packaging types for special packaging. Volvo Logistics AB has a requirement for 52 weeks (1 week = 1 data point) of packaging demand forecasting, to help them in the decision making of procurement of packaging for the upcoming year. Packaging demand would help them in the re-positioning of pack- 10 Chapter 2. Related Work aging material to meet demand at their diﬀerent terminals. The annual purchase of packaging materials is high-priced, the organization would like to optimize their expenses by accurate forecasting. Chapter 3 Preliminaries

3.1 Time-series

As discussed in the previous section 2.1, time-series is the data recorded with a time frequency. Such time-series data is further classiﬁed into three types based on the number of variables that are recorded in the data. Three types are Univariate [13], Bivariate [13], and Multivariate [13] [9]. Names of the types state that what they stand for.

3.1.1 Univariate Time-series This type of time-series data has only one variable. Analyzing univariate data is comparatively simple than with the two other data types. The only reason for its simplicity in analyzing is the data deals with only one quantity that changes, there are no dependencies and relations that usually cause complexity in analysis. The purpose of analyzing such data would be ﬁnding the pattern of that single variable. There are many examples of such data types, one such example is the length of a snake. Suppose there be a pet snake and would like to analyze the snake growth in terms of length and not dealing with any relationship. The data would be similar to as in the table below.

length(in cm) 5 9 15 22 30 45 69

Table 3.1: Univariate data example

3.1.2 Bivariate Time-series Bivariate data has two different variables. This type of data has the variables which are either directly dependent or have a certain relation that influences the other one. Analysis of such data finding out such relation between variables. This type of data is mostly generated and analysis is performed if there is a remarkably known influence of the other variable on a targeted variable. One such good example for such type of data is attendance strength and sales in the canteen.

11 12 Chapter 3. Preliminaries

Attendance strength (no.of people visited) Sales (in SEK) 12 150 23 300 25 310 30 450

Table 3.2: Example of bivaraite data

3.1.3 Multivariate Time-series Multivariate is next to bivariate data types, where the data has more than two variables that are changing. It is much like bivariate data but her the data has two or more dependent variables. Analysis of such data would be more problem statement and the goals that one is expecting from the analysis. Some interesting goals from a multivariate analysis would be identifying most inﬂuencing (dependent) variable, a correlation between individual variables, etc., Sales of computers based on the customer visits of diﬀerent age group.

Child(0-14 years) Young(15-23) Adult(above 23 years) Sales(no. of computers) 2 10 12 5 4 14 15 6 6 11 23 5 6 7 10 4 5 15 25 8

Table 3.3: Example of multivaraite data

3.2 Multi-step-ahead

Time-series forecasting the next value(s). Forecasting only the next immediate value is not much interesting for the analysis to stakeholders of those analyses. Many analyses are made to plan accordingly, in order to implement it they need a certain time to act accordingly and thus where the need for forecasting more than one next value is crucial in time-series forecasting. Forecasting more than one next value is termed as a multi-step ahead time-series forecasting. It performs forecasting of the next h values in time-series data. As time-series represented in section 2.1, Forecast of h values will be eq(3.1) for given data X1,X2,X3, ....Xn.

Xbn(h) = [Xbn+1, Xbn+2, Xbn+3, Xbn+4, ...... , Xbn+1] (3.1) There are many strategies that are adopted from the early time-series analysis. Here some commonly practiced strategies are discussed. Notations used in the next sections are d represents the number of past values used by the model to predict future, w stands for error term f and F representing dependency between past and future for the forecasting model. 3.2. Multi-step-ahead 13

3.2.1 Recursive forecasting Recursive forecasting [1] strategy is also referred to as ‘Iterated strategy’ or’ Multi- stage strategy’. This is the initial and oldest forecasting strategies that were implemented [1]. In this strategy, the forecasting is performed in recursive mode and hence commonly referred to as recursive forecasting. In this strategy, the model is initially trained to forecast just one-step ahead forecasting i.e., next value in time series. Let f be the model trained to perform next value forecasting i.e.,

Xt+1 = f(Xt,Xt−1,Xt−2, ....Xt−d+1) + w (3.2)

where t {d,....,N-1}

This above equation gives the forecast of single values. But to forecast h steps ahead, we forecast the ﬁrst value using the above model. Immediately the output of that model is considered as one of the input variables to forecast the next to it by the same model. This process of feeding output as input to the next is repeated continuously until the complete horizon is forecasted.

The complete forecast of the horizon will be represented as below

 f[X , X , X , X ,..., X ], if H = 1  b bn bn−1 bn−2 bn−3 bn−d+1  if {2,. . . ,d} Xbn+H = fb[Xbn+H−1,..., Xbn+1,Xn,..., Xbn−d+H ], H (3.3)  fb[Xbn+H−1,..., Xbn+H−d], if H {d+1,. . . ,h} Though it seems to be a good strategy there are several factors inﬂuencing to deviate from proper forecasting. The noise of data and horizon of forecasting are two major concerns for recursive forecasting strategy [1]. One more issue would be the accumulation of error, by recursively taking forecast as input by the end horizon error value would be much higher than that at initial forecasting. The error propagation is responsible for long-horizon forecast errors [1]. At a stage where all inputs are forecasted values, the error would be much higher and forecasts will be less accurate. Unlike the factors discussed in the previous paragraph for the ill performance of the recursive strategy, some research shows that this strategy will be successful for certain cases [41].

3.2.2 Direct forecasting Independent strategy [1] or the direct [1] strategy are the same strategies that follow the same procedure to meet multi-step ahead forecasting. In this strategy, several models, one model for each time step in the horizon, are trained. Each model forecast one value for a one-time step on the horizon. For horizon h, h models are trained. 14 Chapter 3. Preliminaries

Xt+H = fH (Xt,Xt−1,Xt−2, ....Xt−d+1) + w (3.4) Where t d,...., N-1 and H 1,...,h The complete forecast of the horizon by all the models combining will be:

Xbn+H = fbH [Xbn, Xbn−1, Xbn−2, Xbn−3,..., Xbn−d+1](3.5) Unlike the previous strategy, direct strategy and it is not feeding on any predicted value. Thus avoiding the process of accumulating error unlike recursive. Some back draws from using this strategy will be higher computational power as the number of models equal to horizon times and hence dependent on the horizon[1]. Another issue by using this strategy would be missing the linearity of the forecasted horizon, as it not letting the model to learn complex relations between the variables. It is resulting in not considering the relation between forecasted values.

3.2.3 DiRec forecasting DiRec forecasting [42], as the name stands for both direct and recursive strategies. This strategy is a combination of both direct strategy and recursive strategy. In DirRec strategy like in the direct, it forecasts every value in the horizon with a different model at each time step. But at each time step like in recursive, it considers variables of the previous step in its inputs.

Xt+H = fH (Xt,Xt−1,Xt−2, ....Xt−d+1) + w (3.6) Where t d,....,N-h and H 1,...,h The forecasts of the complete horizon is ob- tained by:

( fb[Xbn, Xbn−1, Xbn−2, Xbn−3,..., Xbn−d+1], if H = 1 Xbn+H = (3.7) fb[Xbn+H−1,..., Xbn+1,Xn,..., Xbn−d+H ], if H {2,. . . ,h} It has outperformed both previous strategies [43].

3.2.4 Multiple Input Multiple Output forecasting Last three strategies we discussed are considered as multiple input single output since the models are designed to take several input variables and forecast one. Though by strategy we achieve multi-step output, we are still getting only one output at a time by model. By early researchers in this ﬁeld [44], Bontempi has identiﬁed a need to avoid map- ping single-output for multiple-output. As by doing so, the forecasting neglects the stochastic dependencies between future values (forecasted values) [44]. They intro- duced Multiple-Input and Multiple-Output (MIMO) [1] strategy to forecast multiple values at a time by a single model.

By using MIMO on time series X1,X2,X3, ....Xn

[Xt+h,...,Xt+1] = F (Xt,...,Xt−d+1) + w (3.8) 3.3. Artiﬁcial Neural Networks 15

where t {d,. . . ,N-h}, w Rd and F : RH =⇒ RH is a vector valued function. Forecasts by MIMO in a single will be as:

[Xbt+h,..., Xbt+1] = Fb(Xn ...,Xn/d+1) (3.9) Error accumulation of recursive strategy is avoided in MIMO as output is taken in a single step and not feed with forecasted values. Also considers dependencies of variables, unlike direct strategy. This strategy has been widely applied and successful results have been registered [1]. But it reduces the ﬂexibility of the forecasting approach because of the single model.

3.2.5 DIRMO forecasting MISMO [1] is the other name of DIRMO [1] strategy. DIRMO is the hybrid of MIMO and direct strategy. It forecasts horizon h in blocks, each block output of size s. Thus resulting in n forecasting tasks(n=h/s). Total n models will add their forecast for the horizon. If the s value equals to 1, it will lead n to h and forecasting will be of direct strategy. If the value of s equals to h, n will be 1 and it will be a MIMO forecasting strategy. Value of s is usually conﬁgured between 1 and h.

By balancing and selecting optimal s value will preserve both the qualities of MIMO and direct strategies in MISMO (DIRMO). The ﬂexibility of not having a single model and stochastic dependency of future values can be attained by this strategy.

Let n models Fp from time series X1,X2,X3, ....Xn

[Xt+p∗s,...Xt+(p−1)∗s+1] = FP (Xt,...,Xt−d+1) + w (3.10) where is s>1 then t {d,. . . ,N-h}, w RH and F : Rd =⇒ RH is a vector valued function. By n learned models h forecast will be:

[Xbn+p∗s,..., Xbn+(p−1)∗s+1] = FbP (Xn,...,Xn−d+1) (3.11) Further studies of MISMO are carried by Ben Taieb and other researchers [45]. Figure 3.1 illustrates the clear relation between the discussed strategies. It depicts clearly that DiRec is a combination of Recursive and Direct, while DIRMO is a combination of MIMO and Direct

3.3 Artificial Neural Networks Artificial Neural Networks (ANN) are parallel computational models comprised of densely interconnected adaptive processing units. ANN are well known for their adaptive nature, ANNs replaced programming for solving many problems by learning from history or examples. Parallel architecture is another important feature of ANN which helps them in faster computations. ANNs are used in a wide variety of areas like forecasting, gaming, image recognition, voice recognition, classification of 16 Chapter 3. Preliminaries

Figure 3.1: Pictographic representation of the relationship between strategies [1] complex datasets and many more. ANN is built up of several basic artificial neurons. These neurons were presented as models of biological neurons. An Artificial Neural network is a group of simple processing units, these processing units are referred to as neurons or cells [46]. These neurons communicate send signals to each other through weighted connections, these connections are huge in number compared to the count of neurons. Each connection is connection has defined weight wij, these weights determine the signal strength between neuron j has on neuron k.

Each neuron does some basic computations and sends them to all its output connections, which are input to other neurons. Computations performed at neuron are mainly two types, activation functions, and aggregation function. Aggregation is to the basic sum of all inputs of the particular neuron and the activation function than takes that aggregation. Activation functions are different in different networks. Some popular activation functions are a logistic sigmoid, hyperbolic tangent, and most popular rectified linear units(ReLU). ReLU has two additional benefits compared to that of a sigmoid and hyperbolic tangent is that it reduces the likelihood of the gradient vanish, tackles the gradient descent problem and the second one spar- sity. In this study, we have considered ReLU as the activation function for all types of Neural Network (NN) models that are used in the experiment.

1 σ(x) = (3.12) 1 + e−x 2 tanh(x) = 1) (3.13) 1 + e−2x R(x) = max(0, x) (3.14) A complete artificial neural network has three types of basic layers. An input layer, hidden layer, and output layer are three types of layers. The input layer is the first layer of the neural network where the networks are given inputs to the process. The output layer is the layer where the neural network gives output respective to the inputs given and the weights associated with neural networks. Some well know Neural Networks are Feed Forward Neural Networks, Radial basis function Neural Networks, Restricted Boltzmann Machines (RBM), Hopfield Networks, Con- volutional Neural Networks (CNN), Recurrent Neural Networks (RNN) and many hybrids and deep neural networks. As discussed in the previous chapter DNN is a Deep Neural Network, where the Neural network has more than one hidden layer 3.3. Artificial Neural Networks 17 along with the input and output layer. From the early study, it was proven that RNN-LSTM and CNN are among the top performing Neural networks for forecasting problems, especially with time-series. Brief introduction of these networks would be following sections.

3.3.1 Recurrent Neural Networks Recurrent Neural Networks (RNN) are most adapted neural networks for sequential data. They are widely used in the areas of speech recognition and time-series data [12, 18]. As RNN cell has an additional connection of a cell connected to itself unlike regular cells, it is called recurrent. RNN is a natural generalization of feedforward neural networks. RNNs are inherently deep in time, as RNNs hidden state is a function of all previous hidden states [3].

Figure 3.2: Structure of Recurrent and Feed-Forward Neural networks [2]

For sequence X = (x1, .....xT ), a regular recurrent neural network (RNN) creates the hidden vector sequence

H = (h1, ..., hT ) and output vector sequence Y = (y1, ..., yT ) by iterating the following equations from t = 1 to T:

Ht = H(Wxhxt + Whhht1 + bh) (3.15)

Yt = Whyht + by (3.16) where the W represents weight,b denotes bias and H is the hidden layer function.

3.3.2 LSTM LSTM is a type of Recurrent Neural Network stands for Long Short Term Memory (LSTM). LSTM is considered to be more eﬀective on sequential data than other NN models. LSTMs from the name stand, they capture long-term dependences. Unlike other recurrent networks, LSTM is free from optimization problems. LSTM is used 18 Chapter 3. Preliminaries in advance learning problems like speech recognition, handwriting recognition, time- series analysis, and translators.

Figure 3.3: LSTM cell structure [3]

LSTM has an extra feature to RNN, LSTM cell has memory. This memory cell will store the previous values of the cell. In Order to regulate the ﬂow of the memory, there are getting units in LSTM cell. LSTM cell has three gates input, forget and output gates. Input gate will let the inputs from other neurons to LSTM cell and similarly output gate send aggregated value to the activation function. These gates are well understood from the LSTM cell architecture.

it = σ(Wxixt + Whiht1 + Wcict1 + bi) (3.17)

ft = σ(Wxf xt + Whf ht1 + Wcf ct1 + bf ) (3.18)

ct = ftct1 + itR(Wxcxt + Whcht1 + bc) (3.19)

ot = σ(Wxoxt + Whoht1 + Wcoct + bo) (3.20)

ht = otR(ct) (3.21) where σ is the logistic sigmoid function, i is the input gate, f is the forget gate, o is output gate and c is the cell activation vectors. R is ReLU activation function.

3.3.3 Convolutional Neural Network Convolutional Neural Network (CNN) is the most common for image processing but it also has shown good forecastings when compared to traditional forecasting methods. This impression is driven from early analysis discussed in the 2nd chapter of this study. A hidden layer in CNN convolutional layer. In image processing, CNN uses spatial information between pixels of an image [23]. Thus it can be drafted 3.3. Artiﬁcial Neural Networks 19 they are based discrete convolution. Pooling is applied for translation invariance of the learned features. Two types of pooling average pooling and Max pooling is performed. In average the average of the window is calculated and sent, while in max pooling the highest value in a window is sent. CNN has a similar calculation of feedforward neural networks while activation and aggregation. Fully Convolutional Networks(FCN), they allow the input and output layers to have the same dimensions, by involving a decoder stage upsampling, convolution rectiﬁed linear units layer to CNN architecture.

Figure 3.4: convolution in CNN [4]

3.3.4 CNN-LSTM In research [47], Convolutional LSTM model has been used to handling spatiotempo- ral data to its usage of full connections in input-to-state and state to state transitions. From the early study in some researches, we also observed that CNN forecasting time- series is outperforming LSTM due to the ability of CNN to extracting and learning features. CNN is good at this with one-dimensional data. Whereas the LSTM model is very good at extracting and learning long term dependences.

Figure 3.5: convolution in CNN [5]

To ensure both the qualities in the long term forecasting, this study proposes a hybrid model of both the CNN and LSTM. The hybrid model of CNN and LSTM is called CNN-LSTM. This type of models has been used in fields like emotion predictions from speech. Figure 3.5 will give a brief description of how a CNN layer 20 Chapter 3. Preliminaries is connected to LSTM in CNN-LSTM. In this model, CNN is used to interpret the subsequences of input data and those sequences are provided to LSTM, which is in the first phase. The input sequence is divided into sub-sequences and feed into CNN. CNN interprets those sub-sequences and sends it to next layer pooling. Pooling layer then concatenate the sequences and push them to the next layer LSTM. LSTM is the layer feeds from pooling layer and process the input push it to dense layer from there to the final output layer. Chapter 4 Adapted approaches

From the early study and preliminary literature conducted at very initial days of research, some DNN models have been selected for the study. The selection was guided by the supervisors of the study by considering the data structure and correlation in the data. The aim of the study was to forecast long-term packaging demand, which is a multi-step ahead forecasting. Selection of the suitable strategy was given impor- tance in the study and selected from the learnings mentioned in the previous chapter.

MIMO

After analyzing the ﬁve strategies that are widely used for multi-step-ahead forecasting, Multi Input Multi Output (MIMO) is considered to best suited for the problem. Though DIRMO (MISMO) is a more advanced strategy balancing stochasticity and also variable dependencies, the time-series data associated with this problem is found with strong correlations with its variables. Thus Multi Input Multi Output strategy is considered for long term forecasting of packaging demand.

DNN Models

Using Machine Learning there are many models and methods to do forecasting. From a detailed analysis of the related work and previous research in forecasting using Deep Neural Network models, two DNNs are found to be performing with best results. LSTM and CNN are the two DNN models that are performing good in long term time-series forecasting. As one of the aims of this study is to identify the best model for the problem, a hybrid model is also selected as the third model for the experiment. The selected hybrid model is CNN-LSTM. In this model data is divided into sub-sequence and input is interpreted by CNN layer before pooling. After pooling it is sent to LSTM to forecast. Totally three models are selected for the research. Three models are implemented with MIMO strategy.

Chapter 5 Method

This study has two research methods to address its three research questions. An Initial literature review and experiment. A literature review and related work were conducted in the early stage of research to identify the different approaches used to address different problems in forecasting. Merits of each model and their cause of better performance for that particular problem is identified. Production volumes of the future are also given in the form of input feature to the models. Three models were suitable for forecasting packaging demand using historical time series data and predicted production demand is selected. This part of the literature review addressed the first research question. The initial literature review had another objective of identifying performance measures for forecasting models. Selection of performance measures was carried out with an intention to evaluate time-series forecasting models. Articles and journals on time- series are only considered for the study to select performance measures. After the careful study of articles, papers three measures are selected from the table derived from the study, in chapter 2. The detail descriptions of those measures can be found in the next sections of this chapter. This part of the study also addresses the Second research question of the study. An experiment was performed to evaluate the selected DNN models for forecasting. The experiment is performed in a controlled environment. Three DNN models are implemented with MIMO strategy and evaluated by three selected performance measures. The models are trained with sliding window procedure. The division of training samples and testing samples are divided with data availability and also considering the use case of the organization. As the organization is exploring to identify the demand of next coming 52 weeks, so 52 steps ahead forecasting is performed and validation set of size 5 and 1 was considered. The validation set of size 4 has each sample with a horizon of 52 but five samples are starting with lagging 1 time-step, immediate time-step, one time-step ahead, two time-steps, and three time-steps ahead respectively. First validation set gives how robust the model in predicting with time gaps. Second validation set helps in identifying the model suited for the problem. Experiment phase starts from the collection of data and continuous with preprocessing data, structuring/reshaping data, setting up an experiment, developing models and concludes with a collection of observations and performance. Actions and steps performed during those phases are described in the next sections.

23 24 Chapter 5. Method

5.1 Data gathering

Data collection was one of the tough phases in this study, Organisation (Volvo) has many stakeholders over the data which is recorded in SAP application and the data is available for only last 6 months in the system. After reaching out every high-level stakeholder of the system finally got hands on the real data. Data was in the archive but well established on the local database by an analyst. The data captured by the system was transactions of packaging materials. Those transactions were included between each node of the entire supply chain system. With an intention to pull only value-adding data, the study of features and the key nodes of the supply chain are identified and selected from 6 months data. Then with certain parameters transactions of the selected leg in supply chain [6] are pulled out for research. A leg is generally referred to a small portion of a supply chain in the logistics industry. A leg has an origin, carrier and the destination. It was observed that the same packaging material was recorded in several transactions in its one cycle of the entire supply chain. That is because packaging material moves from pool to supplier, supplier to second-tier supplier some times, there are cases where more than two suppliers would transit the material before it reaches the factory. From the factory, some portion is sent back to pool for reuse while the rest were sent to customers, dealers or bodybuilders. That means one packaging has been recorded several legs in the supply chain. From the analysis on six months data, it is observed the total packaging transactions are recorded at the inflow of the factory i.e., transactions with materials received by the factory is the leg, where least duplicates and the complete track is identified irrespective of the sender. The transactions pulled out for research from that particular leg. The other data associated with research was the vehicle productions volume at the plant(factory), productions are the reasons behind the entire supply chain and the demand for packaging materials. The productions data is clean and ready to use was available. The production data has production volumes at a factory in the past and also future, as mentioned in introduction production volumes of future are planned vehicle production volumes.

5.2 Data preprocessing

Once the packaging data has been pulled with ﬁlter there were very few duplicates of transactions and were dropped as the ﬁrst step of cleaning the data. There were also some Empty transactions with null values, learning from the application users realized that those transactions are junk generated while transaction ended abruptly while entering into the system. So those empty transactions are also dropped. The data then remained clean without null values, duplicates, and missing values. But this transactional data is not in the exact shape to feed into the model and it needs reshaping. The data has to combine with the respective productions data of factory before reshaping. To combine those two data the common dimension was the date of transaction and production volumes of the factory on that date. After combining with the date it is observed that data is in discrete time-series with the day as least granular time value. 5.3. Data set description 25

As it is clear that materials are not used on the same day of shipping for production, likewise there wasn’t any correlation with deviations in production and materials in the data. To overcome this problem suggested from many researchers and followed in papers[11] the data was scaled-up to weekly data. From scaling it is observed that data is continuous and has a linear trend and seasonalities like most of continuous time-series data. The ﬁnal ready data for research is in shape, it has

Figure 5.1: sliding window model with 1 step increment week numbers set to index and other variables from packaging data and production volumes on other columns. As per the adapted strategy, MIMO data was in structure but the future production volumes to be inserted as the input feature and also need to be sliced into frame size. The future volumes of production are added to data by duplicating production volume and shifting them up to the input frame size of model 52. Then starting with first-time step up to 52 are sliced into the first input sample and the very next 52 steps as the first output sample. Thus slicing is continued moving one step forward till the end of data is met.

5.3 Data set description The data set is time-series data as discussed in the early sections. The data has value recorded from 2015 to latest till the day of data gathered i.e., end of February. The final processed data set has a total of 217 weeks of data. The data set has 176 variables, 2 of 176 variables are features representing production volumes of medium and heavy trucks production. similarly, 2 more features representing future volumes (recorded or planned) with time gap equal to the size of the input frame i.e., the 1st week has 53rd week production volume as future production volumes. The rest 172 representing a different type of materials being used. The data is divided into training and testing data 160 weeks of data is used for training different models. 57 weeks data was untouched and kept offset for testing.

5.4 Experiment setup As the experiment is evaluating the selected DNN models and comparing with their performances, there needed to be proper care to train the models with equally pri- oritized. All the models are implemented with the same strategy. The batch size of training was also fixed the same to all models. To make sure that stopping of training by a defined parameter of either epoch or learning rate not impacting the 26 Chapter 5. Method performance. All models are defined with an early stopping with a wait time of 400 epochs. In order to make sure that each model is provided with equal resources during the entire experiment, both training and evaluating. The experiment is performed in an online virtual machine. The online virtual machine used for this experiment was from Microsoft named Data Science Virtual Machine (DSVM). The experiment used Keras models for LSTM and CNN. It used TensorFlow in the background for NN operations. Scikit Learn range model is used for normalizing and re-normalizing data. Data is normalized before making it ready for Neural network. After data is normalized data slicing performed as discussed in the previous section. Sliced data is fed into the input layer of experimenting model. Then training or forecasting takes place. In forecasting, forecasted data is re-normalized for practical use but due to data privacy, it has not been disclosed in this study. Visualizing of the learning and forecast were supported using the matplot library. Jupyter notebook and anaconda were the other tools used.

5.5 Performance Measures Three performance measures selected from the study for evaluating DNN are:

Root Mean Square Error (RMSE): This measures the deviation from the actual. From the name itself states all the operations that are involved in it, it calculates the rooted mean of the square errors i.e., diﬀerences between individual prediction and actual are squared and summed all together which is then square rooted and ﬁnally divided. Lower the RMSE value better the prediction.

q Pnf 2 t=1(ypredicted(t) − yactual(t)) RMSE = (5.1) nf

Here ypredicted(t) is the predicted value at moment (time) t, yactual(t) is the actual value at moment (time) t, and nf is the number of future values predicted.

Mean Absolute Error (MAE): This measures also calculate the deviation from the actual. This measure calculates the deviation in the metrics of values forecasted and real. To avoid the negative and positive deviations aﬀecting the deviation measure, an absolute deviation was considered.

nf 1 X MAE = | y (t) − y (t) | (5.2) n predicted actual f t=1

Here ypredicted(t) is the predicted value at moment (time) t, yactual(t) is the actual value at moment (time) t, and nf is the number of future values predicted.

Mean Absolute Percentage Error (MAPE): MAPE is the transformed ver- sion of MAE, from the measure in value metrics to a percentile of deviation is the 5.5. Performance Measures 27 transformation. MAPE is easier to compare than that with MAE as MAPE is in percentages. nf 1 X | ypredicted(t) − yactual(t) | MAP E = x100 (5.3) n y (t) f t=1 actual

Here ypredicted(t) is the predicted value at moment (time) t, yactual(t) is the actual value at moment (time) t, and nf is the number of future values predicted. The above MAPE has a division element, which can be zero for many cases in the given data. Measuring directly with the above-mentioned formula wouldn’t result in any computable measures. The above MAPE customized to Percentage of Errors as the same value-adding performance measure for this data.

nf 1 X | ypredicted(t) − yactual(t) | PE = x100 (5.4) n 1 f t=1

Here ypredicted(t) is the predicted value at moment (time) t, yactual(t) is the actual value at moment (time) t, and nf is the number of future values predicted.

Chapter 6 Results

The experiment was conducted and the performance measures were recorded. DNN evaluating metrics were recorded and results were presented in terms of three diﬀerent measures RMSE, MAE, and PE for each DNN model. Other observations that are recorded during the experiment are learning curve, execution time, and the main target of the experiment, forecast values.

6.1 Learning curve Every model in a machine learning learns from the history of data. Similarly, the DNN models used here are trained with history as mentioned in the previous chapter. The model has a factor called learning rate that determines what extent newly acquired information overrides old information. Training of any model is stopped when the learning rate is stable. In this experiment, the early stop method has been implemented to avoid the overﬁtting of the model. Each model has a learning rate changing from time to time in the training phase and that line of learning rate is called the learning curve. The learning curve of the three models is plotted in graphs below. The blue line is the learning curve of the validation set and the orange line is the learning curve of the validation set. Figure 6.1 (a) is the plot of the CNN model learning curves. Figure 6.1 (b) is the plot of CNN-LSTM model learning curves. Figure 6.2 (a) is the plot of LSTM model learning curves until the model saved by early stop. Figure 6.2 (b) is the plot of LSTM model learning curves illustrating the over ﬁt causing early stop.

(a) CNN (b) CNN-LSTM Figure 6.1: Learning curves

29 30 Chapter 6. Results

(a) LSTM (b) LSTM-continued Figure 6.2: Learning curves

6.2 Performance To measure the performance of models quantitatively, the forecast of the models were analyzed with diﬀerent performance measures. Two regular measures RMSE, MAE, and customized measure PE. Performance measures as discussed in the previous chapter are applied on the forecast of all three models with two variants of each on two diﬀerent sizes of test samples. Thus resulting in twelve instances to measure the performance. The calculated measures are tabulated into four tables. Table 6.1 shows the models’ performance measures with batch size 2 and test sample size 5. Similarly table 6.2 with batch size 2 and test sample size 1. Similarly table 6.3 with batch size 32 and test sample size 5. Table 6.4 with batch size 32 and test sample size 1. The grouping of results is made with respect to batch size and test sample size for better analysis of the results. Grouping with sample size helps in evaluating models performance to immediate forecast and forecast with time-gap. Grouping with batch size will enable to understand the impact of batch size on the forecast by models. The results of the measures are discussed in the next chapter.

RMSE MAE PE LSTM 0.2286808444 0.1407446712255478 14% CNN 0.1974456079 0.12384917587041855 12% CNN-LSTM 0.1933836199 0.11955249309539795 12% Table 6.1: Measures of DNN models while batch and test sample sizes are 2 and 5

RMSE MAE PE LSTM 0.225427572 0.14006835222244263 14% CNN 0.1822909741 0.11193402111530304 11% CNN-LSTM 0.185011211 0.11410592496395111 11% Table 6.2: Measures of DNN models while batch and test sample sizes are 2 and 1 6.3. Execution time 31

RMSE MAE PE LSTM 0.2106728041 0.12695592641830444 13% CNN 0.1870193082 0.11516165733337402 12% CNN-LSTM 0.1960096791 0.12636706233024597 13%

Table 6.3: Measures of DNN models while batch and test sample sizes are 32 and 5

RMSE MAE PE LSTM 0.1870833078 0.10810592770576477 11% CNN 0.1835637372 0.112208202481269845 11% CNN-LSTM 0.1902735045 0.12128637731075287 12%

Table 6.4: Measures of DNN models while batch and test sample sizes are 32 and 1 RMSE and MAE values are bias of forecast and actual values, which are normalized between 0 and 1. PE gives error measure in terms of the percentage. Lower the measures RMSE, MAPE, and PE better the performance.

6.3 Execution time Long-term forecasting is performed for annual planning and strategy plannings, Plan- ning will not aﬀect by the execution time of the models as that won’t add any value to long term time-series forecasting. But as the study is comparing three DNN models and their performance, comparing their execution time will give the insights of models performance with their execution time. Execution time comparisons help in identifying merit models with respect to time metrics. This comparison would help in selecting the model for real-time time-series forecasts like real-time trade market, satellites launch and similar problem areas with the need for real-time forecasting. The execution time of each DNN model with respect to Batch size (Bs) and the number of test samples(ss) evaluated are recorded in table 6.5. While models type taken to column, batch size and test sample size are used as an index for easy comparisons. For easy understanding, microseconds are rounded to milliseconds in this comparison.

Model CNN-LSTM CNN LSTM Bs = 2 Ss = 5 39ms/step 11ms/step 66ms/step Bs = 32 Ss = 5 39ms/step 10ms/step 62ms/step Bs = 2 Ss = 1 4ms/step 4ms/step 16ms/step Bs = 32 Ss = 1 3ms/step 4ms/step 15ms/step

Table 6.5: Execution time of each DNN model

6.4 Forecasting Results Performance measures are to evaluate the models in terms of their forecasting biases. But to understand the nature of forecasting visualization is mandatory. By visualizing we can identify the mimicking ability of historical trends and seasonality. The 32 Chapter 6. Results models generate the forecast values and those values are plotted onto graphs for a better understanding. The graphs are plotted with forecasts vs actual values on the time taken as x- coordinate and the normalized volumes of demand on the y-axis from the history for the 165 weeks (3 years) and forecasted demand for the 52 weeks (next one year). Each model has two variants with two different batch sizes. From the early description, the organization has a total of 172 materials and picturing all of them is not necessary to understand the nature of models. We selected one material type that is most commonly used and needs planning to procurement more actively. The material chosen for analysis has been plotted. A total of six graphs were plotted with two variants for each model. Plotting graphs would help in identifying the merits and demerits of each model by comparing with each other. Figure 6.3 has two plots from LSTM model forecasts with two batch different batch sizes. Similarly, figure 6.4 and 6.5 are from CNN and CNN-LSTM models. The graphs here has two colour codes representing their source. Orange is representing the volumes recorded in history, while blue is predicted. 165 weeks of data were used for training and the last 52 weeks are completely new data (tested data) for model and the blue line is predicted values for those 52 weeks.

(a) Batch size-32 (b) Batch size-2 Figure 6.3: Forecast by LSTM models

(a) Batch size-32 (b) Batch size-2 Figure 6.4: Forecast by CNN models 6.5. Deviations 33

(a) Batch size-32 (b) Batch size-2 Figure 6.5: Forecast by CNN-Lstm models

6.5 Deviations As there are 172 and all of them didn’t achieve acceptable forecasts. Reason for those deviations is strong and valid. General reasons for bad forecasting could be many like not selecting proper features, corrupted training data, wrong models and external (unrelatable) factors. The reason for the deviation of the forecast of a particular material in this research is discussed in the analysis. Deviating materials are identiﬁed with individual forecast measures of each material. The deviation of forecast on a particular material is plotted from three models. Figure 6.6 shows the deviated graph by CNN-LSTM model. Figure 6.7 shows the deviated graph on the same packaging by CNN and LSTM models. This graphs also follow the same colour code used in Figure 6.3, 6.4, 6.5.

Figure 6.6: Deviation in CNN-LSTM forecast 34 Chapter 6. Results

(a) CNN (b) LSTM Figure 6.7: Deviation in CNN and LSTM forecast Chapter 7 Analysis and Discussion

This chapter is about the analysis of observations that have been recorded and calculated in the previous chapter, results. Analysis and discussion are about understanding the observation and drawing the inferences from those observations and results.

7.1 Forecasting Performance It is observed from the results that all DNN models are able to successfully forecast long-term on time-series data. Though all three models with a total of six-variants are able to forecast with similar performance, it has been noticed that there are some differences between each model and the variant comparing with their performances. Considering batch sizes as the filtering agent, the hybrid model CNN-LSTM is having very good forecast performance. From table 6.1 and 6.2 CNN-LSTM clearly has better performance with low root mean square error, mean absolute error and also percentage error for the lower batch size variants. But in case of the larger batch size, the results are not so clear. From table 6.3 and 6.4 CNN and CNN-LSTM model have shown better RMSE values compared to that LSTM but LSTM has better MAE and PE. This is only for the test sample size 1 and CNN has leading results in all measures for batch size 32 and test sample size 5. The reason for LSTM having the better MAE even though with higher RMSE would be because of the huge difference in few forecasts. Those difference on summing up leading to more RMSE but not with the mean of absolute error. This is better understood from the analysis of the graphs later in this section. Now with the test sample size as the distinguishing factor of models, the CNN model has shown better performance for smaller test sample size learning from tables 6.2 and 6.5. From tables 6.1 and 6.3 CNN and CNN-LSTM are better than LSTM for a bigger test sample. On the whole, from these performance metrics, it is observed that CNN and CNN-LSTM are shown better performance than LSTM for long term forecasting on time -series data. But looking forecasting plots it is so clear that LSTM has good forecasts in identifying the low and high (peak) values even for the small time period. The graph clearly shows the reason for its low results in performance with the low batch size, it is because of the lag in forecasts. Reasons for the lag in the forecast might be due to the influence of its previous states. The reason stated here can be supported by its outperformance with higher batch size training and small test sample size forecast. The small test sample size is the forecast for the immediate next 52 weeks from the input of the before 52 weeks. Seasonality plays a major role

35 36 Chapter 7. Analysis and Discussion here. Whereas in the test sample size of 5 it is not the same case. And that is why LSTM couldn’t perform well. From a few variables of entire time-series representing specific packaging material types have been kept on a halt from usage due to policy changes. One such case is represented in figure 6.6 and figure 6.7, where deviation was observed in three models. As policy changes and environmental reforms are external factors impacting demand, it is not expected to learn by the model. For a model to understand such changes and reflecting it in the forecast will not be quick. As the long term forecast is for planning far ahead and organization (Volvo) is ready with a track of such material types. Knowledge of such impacts is expected from any group interested in long term forecasting and it must be for practical use of forecast. Negative values in figure 6.6 and 6.7 are due to scale, ’0’ in the graph is not ground zero.

7.2 Execution Time

As mentioned before, based on the problem domain, this study is not taking any inference or recommendation to select the DNN model based on its execution time. But this could be a reference for the other researchers on time-series analysis at real time. It is visible that LSTM is taking more time than CNN-LSTM and CNN-LSTM is taking more time than CNN. The time consumed for the test sample size of 5 is more than that of one because of repetitively doing the procedure ﬁve times but the time consumed is not 5 times to that of test sample size 1. The LSTM is RNN and has several memory operations and recursive connection in its neural network, making it more time consuming than other neural networks. In CNN, it’s feedforward network making less complex and resulting in low execution time.

7.3 Validity threats

Analysis of possible validity threats and limitations are initial self-assessments of study. Limitations are discussed in the next section. Most of the identified threats to the study’s validity are either eliminated or limited by careful consideration for models selection, long-term forecasting strategy selection and performance measures. But it is not certain of eliminating all validity threats. One limited internal validity threat is execution time, as the experiment was conducted on a shared online virtual machine. Usage of other users have possibilities to influence execution time. Usage of other applications and the scheduled application would also impact execution time. This influence was limited by performing the test five times and recording the average execution time for consideration. Another similar internal validity threat that has limited is equal resource alloca- tion for all models while training, Models are trained in non-working hours and not allowing any other applications on the system to use resources. But there might be some internal applications resource utilization can’t be controlled completely. For external validity, dataset variables can be one potential threat. There might be some more variables that are having strong dependencies must be missing in the dataset. Missing of those variables might be impacting performance measures. 7.4. Limitations of the research 37

Major variables are identiﬁed and strung into dataset but not complete variables. Because those variables might never be recorded. Another external threat could be the data size, 217 weeks (4 year 2 months) data is just an adequate size for long term forecasting. More data would have improved performance. Another external validity is the new policies and government regulations. Fore- casting performances would be strongly aﬀected by organization policy changes and environmental norms.

7.4 Limitations of the research There are several elements that are possible limiting agents of the forecasting performance presented in this study. The data for this study was limited to 4 years two months and forecasting it for 1 year ahead is the major factor influencing the performance. Suppliers capacity and available storage capacity at the factory (plants) are also potential factors that influence the packaging material demand and are not recorded. The model forecasts based on the history learned with a time frame of 52 weeks. It will not be able to reflect very recent changes like the complete shutdown of the particular material type just learning from a few weeks before.

Chapter 8 Conclusions and Future Work

Selected models are performing considerable forecasting of packaging demand for planning the next 52 weeks ( 1 Year). From the study, it can be stated a reliable packaging demand can be forecasted by using DNN on time series data. The hybrid DNN model, CNN-LSTM is observed performing well and equally with CNN. CNN and CNN-LSTM are also better models in execution time. Considering the impact of data size, increase of data by the time of integrating it into the tool, CNN-LSTM model is suggested. As CNN-LSTM is learning the patterns, it is able to forecast peak and low demands better than CNN. By extending the forecasting at the granule level of the supply chain (Individual suppliers and plants) will benefit the organization. Granule forecasting will help in controlling the inventory and avoiding excess inventory. Further study of CNN- LSTM with sub-sequences division with multiple sizes and ensemble of different size models will be the extension of the thesis as future work. Forecasting at supplier vs plant (factory) level using sub sequencing CNN-LSTM will be adding more value to the organization and will be a real value adding future work. The multi-output strategy, DIRMO (MISMO) can be another choice of long-term forecasting strategy that can be experimented with the new models in future research. By the next study, more data would be ready for the experiment. A study comparing MIMO and DIRMO with an ensemble of different size sub-sequencing CNN-LSTM would be another potential future work.

References

[1] S. Ben Taieb, G. Bontempi, A. F. Atiya, and A. Sorjamaa, “A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition,” Expert Systems with Applications, vol. 39, no. 8, pp. 7067–7083, Jun. 2012. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0957417412000528

[2] N. Donges, “Recurrent Neural Networks and LSTM,” Feb. 2018. [Online]. Available: https://towardsdatascience.com/ recurrent-neural-networks-and-lstm-4b601dd822a5

[3] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 6645–6649.

[4] Y. Tyshchenko, “Depression and anxiety detection from blog posts data,” 2018.

[5] T. N. Nguyen, C. Li, and C. Niederée, “On Early-Stage Debunking Rumors on Twitter: Leveraging the Wisdom of Weak Learners,” in Social Informatics. Springer, Cham, Sep. 2017, pp. 141–158. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-319-67256-4_13

[6] M. H. Hugos, Essentials of Supply Chain Management. John Wiley & Sons, Mar. 2018.

[7] “Volvo - Wikipedia.” [Online]. Available: https://en.wikipedia.org/wiki/Volvo

[8] “Volvo Group Packaging System | Volvo Group.” [Online]. Available: https://www.volvogroup.com/en-en/suppliers/useful-links-and-documents/ logistics-solutions/volvo-group-packaging-system.html

[9] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time series analysis: forecasting and control. John Wiley & Sons, 2015.

[10] P. R. Winters, “Forecasting sales by exponentially weighted moving averages,” Management science, vol. 6, no. 3, pp. 324–342, 1960.

[11] K. Gilbert, “An ARIMA Supply Chain Model,” Management Science, vol. 51, no. 2, pp. 305–310, Feb. 2005. [Online]. Available: https: //pubsonline.informs.org/doi/abs/10.1287/mnsc.1040.0308

41 42 References

[12] R. Carbonneau, K. Laframboise, and R. Vahidov, “Application of machine learning techniques for supply chain demand forecasting,” European Journal of Operational Research, vol. 184, no. 3, pp. 1140–1154, Feb. 2008. [Online]. Avail- able: http://www.sciencedirect.com/science/article/pii/S0377221706012057 [13] “Univariate, Bivariate and Multivariate data and its analysis,” Aug. 2018. [Online]. Available: https://www.geeksforgeeks.org/ univariate-bivariate-and-multivariate-data-and-its-analysis/

[14] R. Hyndman, A. B. Koehler, J. K. Ord, and R. D. Snyder, Forecasting with Ex- ponential Smoothing: The State Space Approach. Springer Science & Business Media, Jun. 2008. [15] S. Gvaladze, “Evaluating methods for time-series forecasting; Applied to energy consumption predictions for Hvaler (kommune),” Master’s thesis, 2015. [16] G. Zhang, B. Eddy Patuwo, and M. Y. Hu, “Forecasting with artificial neural networks:: The state of the art,” International Journal of Forecasting, vol. 14, no. 1, pp. 35–62, Mar. 1998. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0169207097000447 [17] A. R. S. Parmezan, V. M. A. Souza, and G. E. A. P. A. Batista, “Evaluation of statistical and machine learning models for time series prediction: Identifying the state-of-the-art and the best conditions for the use of each model,” Information Sciences, vol. 484, pp. 302–337, May 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0020025519300945 [18] V. Flunkert, D. Salinas, and J. Gasthaus, “DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks,” Apr. 2017. [Online]. Available: https://arxiv.org/abs/1704.04110v2 [19] S. Selvin, R. Vinayakumar, E. A. Gopalakrishnan, V. K. Menon, and K. P. Soman, “Stock price prediction using LSTM, RNN and CNN-sliding window model,” in 2017 International Conference on Advances in Computing, Commu- nications and Informatics (ICACCI), Sep. 2017, pp. 1643–1647. [20] M. Inthachot, V. Boonjing, and S. Intakosum, “Artificial Neural Network and Genetic Algorithm Hybrid Intelligence for Predicting Thai Stock Price Index Trend,” 2016. [Online]. Available: https://www.hindawi.com/journals/ cin/2016/3045254/ [21] C.-J. Lu and L.-J. Kao, “A clustering-based sales forecasting scheme by using extreme learning machine and ensembling linkage methods with applications to computer server,” Engineering Applications of Artificial Intelligence, vol. 55, pp. 231–238, Oct. 2016. [Online]. Available: http: //www.sciencedirect.com/science/article/pii/S0952197616301257 [22] R. S. Soni and D. Srikanth, “Inventory forecasting model using genetic programming and Holt-Winter’s exponential smoothing method,” in 2017 2nd IEEE International Conference on Recent Trends in Electronics, Information Com- munication Technology (RTEICT), May 2017, pp. 2086–2091. References 43

[23] J. C. B. Gamboa, “Deep Learning for Time-Series Analysis,” arXiv:1701.01887 [cs], Jan. 2017, arXiv: 1701.01887. [Online]. Available: http://arxiv.org/abs/ 1701.01887 [24] Y. Gao and M. J. Er, “NARMAX time series model prediction: feedforward and recurrent fuzzy neural network approaches,” Fuzzy Sets and Systems, vol. 150, no. 2, pp. 331–350, Mar. 2005. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0165011404004099 [25] E. Egrioglu, U. Yolcu, C. H. Aladag, and E. Bas, “Recurrent Multiplicative Neuron Model Artificial Neural Network for Non-linear Time Series Forecasting,” Neural Processing Letters, vol. 41, no. 2, pp. 249–258, Apr. 2015. [Online]. Available: https://link.springer.com/article/10.1007/ s11063-014-9342-0 [26] J. T. Connor, R. D. Martin, and L. E. Atlas, “Recurrent neural networks and robust time series prediction,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 240–254, Mar. 1994. [27] Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu, “Recurrent Neural Networks for Multivariate Time Series with Missing Values,” Scientific Reports, vol. 8, no. 1, p. 6085, Apr. 2018. [Online]. Available: https://www.nature.com/articles/s41598-018-24271-9 [28] M. Han, J. Xi, S. Xu, and F.-L. Yin, “Prediction of chaotic time series based on the recurrent predictor neural network,” IEEE Transactions on Signal Process- ing, vol. 52, no. 12, pp. 3409–3416, Dec. 2004. [29] A. Bacchetti and N. Saccani, “Spare parts classification and demand forecasting for stock control: Investigating the gap between research and practice,” Omega, vol. 40, no. 6, pp. 722–737, Dec. 2012. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0305048311001484

[30] C. Chatﬁeld, Time-series forecasting. Chapman and Hall/CRC, 2000. [31] C. W. J. Granger and P. Newbold, Forecasting Economic Time Series. Aca- demic Press, May 2014, google-Books-ID: oDWjBQAAQBAJ. [32] X. Yang, F. Yu, and W. Pedrycz, “Long-term forecasting of time series based on linear fuzzy information granules and fuzzy inference system,” International Journal of Approximate Reasoning, vol. 81, pp. 1–27, Feb. 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/ S0888613X16302080 [33] G. Aneiros, J. Vilar, and P. Raña, “Short-term forecast of daily curves of electricity demand and price,” International Journal of Electrical Power & Energy Systems, vol. 80, pp. 96–108, Sep. 2016. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0142061516000466

[34] J. S. Armstrong, Long-range forecasting: from crystal ball to computer, 2nd ed. New York: Wiley, 1985. 44 References

[35] J. D. Pelletier and D. L. Turcotte, “Long-range persistence in climatological and hydrological time series: analysis, modeling and application to drought hazard assessment,” Journal of Hydrology, vol. 203, no. 1, pp. 198–208, Dec. 1997. [Online]. Available: http://www.sciencedirect.com/science/article/pii/ S0022169497001029 [36] K. Papagiannaki, N. Taft, Z. Zhang, and C. Diot, “Long-term forecasting of Internet backbone traﬃc: observations and initial models,” in IEEE INFO- COM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428), vol. 2, Mar. 2003, pp. 1178–1188 vol.2. [37] R. J. Hyndman and S. Fan, “Density Forecasting for Long-Term Peak Electricity Demand,” IEEE Transactions on Power Systems, vol. 25, no. 2, pp. 1142–1153, May 2010. [38] W. Bao, J. Yue, and Y. Rao, “A deep learning framework for ﬁnancial time series using stacked autoencoders and long-short term memory,” PLOS ONE, vol. 12, no. 7, p. e0180944, Jul. 2017. [Online]. Available: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180944 [39] A. Borovykh, S. Bohte, and C. W. Oosterlee, “Conditional Time Series Forecasting with Convolutional Neural Networks,” arXiv:1703.04691 [stat], Mar. 2017, arXiv: 1703.04691. [Online]. Available: http://arxiv.org/abs/1703.04691 [40] X. Qiu, L. Zhang, Y. Ren, P. N. Suganthan, and G. Amaratunga, “Ensemble deep learning for regression and time series forecasting,” in 2014 IEEE Sympo- sium on Computational Intelligence in Ensemble Learning (CIEL), Dec. 2014, pp. 1–6. [41] E. W. Saad, D. V. Prokhorov, and D. C. Wunsch, “Comparative study of stock trend prediction using time delay, recurrent and probabilistic neural networks,” IEEE Transactions on Neural Networks, vol. 9, no. 6, pp. 1456–1470, Nov. 1998. [42] A. Sorjamaa, J. Hao, N. Reyhani, Y. Ji, and A. Lendasse, “Methodology for long-term prediction of time series,” Neurocomputing, vol. 70, no. 16, pp. 2861–2869, Oct. 2007. [Online]. Available: http://www.sciencedirect.com/ science/article/pii/S0925231207001610 [43] A. Sorjamaa and A. Lendasse, “Time Series Prediction using DirRec Strategy,” p. 6, 2006. [44] G. Bontempi and S. Ben Taieb, “Conditionally dependent strategies for multiple-step-ahead prediction in local learning,” International Journal of Forecasting, vol. 27, no. 3, pp. 689–699, Jul. 2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0169207010001433 [45] S. Ben Taieb, A. Sorjamaa, and G. Bontempi, “Multiple-output modeling for multi-step-ahead time series forecasting,” Neurocomputing, vol. 73, no. 10, pp. 1950–1957, Jun. 2010. [Online]. Available: http://www.sciencedirect.com/ science/article/pii/S0925231210001013 References 45

[46] B. Kröse, B. Krose, P. van der Smagt, and P. Smagt, “An introduction to neural networks,” 1993.

[47] X. SHI, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. WOO, “Convo- lutional LSTM Network: A Machine Learning Approach for Precipitation Now- casting,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Asso- ciates, Inc., 2015, pp. 802–810.

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden