energies

Article Optimal LSTM Model for Electric Load Forecasting using Feature Selection and Genetic Algorithm: Comparison with Approaches †

Salah Bouktif 1,*, Ali Fiaz 1, Ali Ouni 2 ID and Mohamed Adel Serhani 1

1 Department of Computer Science and Software Engineering, College of Information Technology, UAE University, Al Ain 15551, UAE; alifi[email protected] (A.F.); [email protected] (M.A.S.) 2 Department of Software Engineering and IT, Ecole de Technologie Superieure, Montréal, QC H3C 1K3, Canada; [email protected] * Correspondence: [email protected] † This work was supported by a UPAR grant from the United Arab Emirates University, under grant G00001930.

 Received: 28 April 2018; Accepted: 21 May 2018; Published: 22 June 2018 

Abstract: Background: With the development of smart grids, accurate electric load forecasting has become increasingly important as it can help power companies in better load scheduling and reduce excessive electricity production. However, developing and selecting accurate time series models is a challenging task as this requires training several different models for selecting the best amongst them along with substantial to derive informative features and finding optimal time lags, a commonly used input features for time series models. Methods: Our approach uses machine learning and a long short-term memory (LSTM)-based neural network with various configurations to construct forecasting models for short to medium term aggregate load forecasting. The research solves above mentioned problems by training several linear and non-linear machine learning algorithms and picking the best as baseline, choosing best features using wrapper and embedded feature selection methods and finally using genetic algorithm (GA) to find optimal time lags and number of layers for LSTM model predictive performance optimization. Results: Using France metropolitan’s electricity consumption data as a case study, obtained results show that LSTM based model has shown high accuracy then machine learning model that is optimized with hyperparameter tuning. Using the best features, optimal lags, layers and training various LSTM configurations further improved forecasting accuracy. Conclusions: A LSTM model using only optimally selected time lagged features captured all the characteristics of complex time series and showed decreased Absolute Error (MAE) and Root Mean Square Error (RMSE) for medium to long range forecasting for a wider metropolitan area.

Keywords: deep neural networks; long short term memory networks; short- and medium-term load forecasting; machine learning; feature selection; genetic algorithm

1. Introduction Load forecasting enables utility providers to model and forecast power loads to preserve a balance between production and demand, to reduce production cost, to estimate realistic energy prices and to manage scheduling and future capacity planning. The primary criterion used for the classification of forecasting models is the forecasting horizon. Mocanu et al. [1] grouped electricity demand forecasting into three categories, short-term forecasts ranging between one hour and one

Energies 2018, 11, 1636; doi:10.3390/en11071636 www.mdpi.com/journal/energies Energies 2018, 11, 1636 2 of 20 week, medium term ranging between one week and one year and long-term spanning a time of more than one year. The literature reveals that short-term demand forecasting has attracted substantial attention. Such forecasting is important for power system control, unit commitment, economic dispatch, and electricity markets. Conversely, medium and long-term forecasting was not sufficiently studied despite their crucial inputs for the power system planning and budget allocation [2]. In this paper, we focus on both short term and medium term monthly forecasting horizons. The rationale behind targeting simultaneously these two forecasting horizons is that deterministic models can be used successfully for both of them. However, in the case of long-term forecasting, stochastic models are needed to deal with uncertainties of forecasting parameters that always have a probability of occurrence [3]. Two challenges are associated with the targeted forecasting horizons. In the short-term case, the accuracy is crucial for optimal day-to-day operational efficiency of electrical power delivery and the medium term case, the prediction stability is needed for the precise scheduling of fuel supplies and timely maintenance operations. For prediction stability, a low forecast error should be preserved for the medium term span. Thus, the forecasting model should keep performing accurately or at least should not be excessively sensitive to the elapsed time within the medium term frame. Although several electric load forecasting approaches using traditional statistical techniques, time series analysis and recent machine learning were proposed, the need for more accurate and stable load forecasting models is still crucial [4]. Recently, a particular attention is being paid to deep learning based approaches. These are based on artificial neural networks (ANNs) with deep architectures have gained attention of many research communities [5–7] due to their ability of capturing data behavior when considering complex non-linear patterns and large amounts of data. As opposed to shallow learning, deep learning usually refers to having a larger number of hidden layers. These hidden layers in deep network makes the model able to learn accurately complex input-output relations. Long short-term memory (LSTM), a variation of deep Recurrent Neural Networks (RNN) originally developed Hochreiter et al. [8] to allow the preservation of the weights that are forward and back-propagated through layers. LSTM-based RNNs are an attractive choice for modeling sequential data like time series as they incorporate contextual information from past inputs. Especially, LSTM technique for time series forecasting has gained popularity due to its end-to-end modeling, learning complex non-linear patterns and automatic feature extraction abilities. Time series models commonly use lags to make their predictions based on past observations. Selection of appropriate time lags for the time series forecasting is an important step to eliminate redundant features [9]. This helps to improve prediction model accuracy as well as gives a better understanding of the underlying process that generate the data. Genetic algorithm (GA), a heuristic search and optimization technique that mimics the process of evaluation with the objective to minimize or maximize some objective function [10] can be used to find appropriate number of lags for time series model. GA works by creating a population of potential answers to the problem to be solved, and then submit it to the process of evolution. In this paper, we propose a LSTM-RNN-based model for aggregated demand side load forecast over short- and medium-term monthly horizons. Commonly used machine learning approaches are implemented to be compared to our proposed model. Important predictor variables and optimal lag and number of layers selection are implemented using feature selection and GA approaches. The performances of forecasting techniques are measured by using several evaluation metrics such as coefficient of variation RMSE (CVRMSE), mean absolute Error (MAE) and root mean square error (RMSE). Using France Metropolitan’s electricity energy consumption data as a case study, our validation shows that LSTM-RNN based forecasting model outperforms the best of the studied alternative approaches with a high confidence. The remainder of this paper is organized as follows: Section2 provides an overview of the literature on load forecasting. Section3 provides brief background on LSTM-RNN, benchmark machine learning models and on the forecasting evaluation metrics. Section4 describes the methodology Energies 2018, 11, 1636 3 of 20 of building our proposed forecast technique, its enhancement and validation. Section5 describes experimental results and alternative time series validation approach. Section6 provides discussion on threat to validity and Section7 draws the conclusions.

2. Literature Review Different models for load forecasting can be broadly classified as engineering methods and data driven methods. Engineering methods, also known as, physics based models; use thermodynamic rules to estimate and analyze energy consumption. Commonly, these models rely on context characteristics such as building structure, weather information, heating, ventilation, and air conditioning (HVAC) system information to calculate energy consumption. Physics based models are used in energy simulator software for buildings such as EnergyPlus and eQuest. The limitation of these models resides in their dependence on the availability and the accuracy of input information [11]. Data-driven methods, also known as artificial intelligence (AI) methods rely on historical data collected during previous energy consumption periods. These methods are very attractive for many groups of research, however little is known about their forecasting generalization. In other terms, their accuracy drops when they are applied on new energy data. The ability of generalization of data driven forecasting models remains an open problem. As reported for example in [12], the accuracy of forecasting results varies considerably for micro-grids with different capacities and load characteristics. Similar to load forecasting, data driven models for electricity price prediction were also proposed to help decision making for energy providing companies. Commonly used price forecasting techniques include multi-agent, reduced-form, statistical and computational intelligence as reviewed in [13]. Nowadays, the availability of relatively large amount of energy data makes it increasingly stimulating to use data-driven methods as an alternative to physics-based methods in load forecasting for different time horizons, namely, short, medium and long term [11]. Short-term load forecasting (STLF) has attracted more attention in the smart grid, microgrids and buildings and because of its usefulness for demand side management, energy consumption, energy storage operation, peak load anticipation, energy shortage risk reduction, etc. [14]. Many works were carried out for STLF. They ranged from classical time series analysis to recent machine learning approaches [15–20]. In particular, autoregressive and exponential smoothing models have remained the baseline models for time series prediction tasks for several years, however using these models necessitated careful selection of the lagged inputs parameter to identify the correct model configuration [21]. An adaptive autoregressive moving-average (ARMA) model developed by Chen et al. [22] to conduct day and week ahead load forecasts, reported superior performance compared to Box-Jenkins model. While these univariate time-series approaches directly model the temporal domain, they suffer from curse of dimensionality, accuracy dropping and require frequent retraining. Various other approaches covering simple , multivariate linear regression, non-linear regression, ANN and support vector machines (SVMs) were applied for electricity demand forecasting primary for the short-medium term [4,23,24] and lesser for the long term [4]. Similarly, for price forecasting, Cincotti et al. used [25] proposed three univariate models namely, ARMA-GARCH, multi-layer and support vector machines to particularly predict day ahead electricity spot prices. Results showed that SVMs performed much better than the benchmark model based on the random walk hypothesis but close to the ARMA-GARCH econometric technique. For a similar forecasting aim, a combination of wavelet transform neural network (WTNN) and evolutionary algorithms (EA) was implemented for day-ahead price forecasting. This combination resulted in more accurate and robust forecasting model [26]. Noticeably, ANN was widely used for various energy-forecasting tasks because of its ability to model nonlinearity [4]. Hippert, et al., in their review work [20], have summarized most of the applications of ANN to STLF as well as its evolution. Furthermore, various structures of ANN have found their ways into load forecasting in order to improve models accuracy. Namely, the most common configuration of ANN, (MLP), was used to forecast a load profile using previous Energies 2018, 11, 1636 4 of 20 load data [27], the fuzzy neural [28], wavelet neural networks [29], fuzzy wavelet neural network [30] and self-originating map (SOM), neural network [31], were used mainly for STLF. Unlike the conventional ANN structures, a deep neural network (DNN) is an ANN with more than one hidden layer. The multiple computation layers structure increases the feature abstraction capability of the network, which makes them more efficient in learning complex non-linear patterns [32]. To the best of our knowledge and according to Ryu et al. [14] in 2017, only a few DNN-based load forecasting models are proposed. Recently, in 2016, Mocanu et al. [1] employed, a deep learning approach based on a restricted Boltzmann machine, for single-meter residential load forecasting. Improved performance was reported compared to shallow ANN and support vector machine. The same year (2016), Marino et al. [33] presented a novel energy load forecasting methodology using two different deep architectures namely, a standard LSTM and an LSTM with sequence to Sequence architecture that produced promising results. In 2017, Ryu et al. [14] proposed a DNN based framework for day-ahead load forecast by training DNNs in two different ways, using a pre-training restricted Boltzmann machine and using the rectified linear unit without pre-training. This model exhibited accurate and robust predictions compared to shallow neural network and others alternatives (i.e., double seasonal Holt-Winters model and the autoregressive integrated moving average model). In 2018, Rahman et al. [34] proposed two RNNs for medium to long-term electric forecast for residential and commercial buildings. A first model feeds a linear combination of vectors to the hidden MLP layer while a second model applies a shared MLP layer across each time step. The results were more accurate than using a 3-layer multi-layered perceptron model. Seemingly closer to our current work, but technically distinct from it, Zheng et al. [35] developed a hybrid algorithm for STLF that combined similar days selection, empirical mode decomposition and LSTM neural networks. In this work, Xgboost was used to determine features importance and k- was employed to merge similar days into one cluster. The approach substantially improved LSTM predictive accuracy. Despite the success of machine learning models, in particular the late deep learning, to perform better than traditional time series analysis and regression approaches, there is still a crucial need for improvement to better model non-linear energy consumption patterns while targeting high accuracy and prediction stability for the medium term monthly forecasting. Definitely, we share the opinion that optimally trained deep learning model derives more accurate forecasting outputs than those obtained with a shallow structure [36]. Furthermore, we believe that the adoption of deep learning in solving load-forecasting problem needs more maturity though increasing the number of works with different forecasting configurations. These configurations include the aim and horizon of forecasting, the type or nature of inputs, the methods of determining their relevance, the selected variation of deep learning and the way of validating the derived forecasting model. For instance, using lagged features as inputs to enable deep models to see enough past values relevant for future prediction can cause overfitting if too many lags are selected. This is because deep models with lot of parameters can overfit due to increase in dimensionality. In this paper, we aim at proving that optimal LSTM-RNN will behave similarly in the context of electric load forecasting for both the short-and-medium horizon. Accordingly, our approach differs from the previous deep learning models in that:

(i) Implementing feature importance using wrapper and hybrid methods, optimal lag and number of layers selection for LSTM model using GA enabled us to prevent overfitting and resulted in more accurate and stable forecasting. (ii) We train a robust LSTM-RNN model to forecast aggregate electric load for short- and medium term horizon using a large dataset for a complete metropolitan region covering a period of nine years at a 30 min resolution (iii) We compare the LSTM-RNN model with the machine learning benchmark that is performing the best among several linear and non-linear models optimized with hyperparameter tuning. Energies 2018, 11, 1636 5 of 20 Energies 2018, 11, x 5 of 20 Energies 2018, 11, x 5 of 20 3. Background Background 3. Background This sectionsection provides provides brief brief backgrounds backgrounds on LSTM-RNN,on LSTM‐RNN, on benchmark on benchmark machine machine learning learning models This section provides brief backgrounds on LSTM‐RNN, on benchmark machine learning modelsand on theand forecasting on the forecasting evaluation evaluation metrics. metrics. models and on the forecasting evaluation metrics. 3.1. From From RNN to LSTM LSTM-RNN’s‐RNN’s 3.1. From RNN to LSTM‐RNN’s A RNN is a special type of ANN that makes use of sequential information due to directed A RNN is a special type of ANN that makes use of sequential information due to directed connections between between units units of of an an individual individual layer. layer. They They are are called called recurrent recurrent because because they they perform perform the connections between units of an individual layer. They are called recurrent because they perform the samethe same task taskfor every for every element element in the in sequence. the sequence. RNN are RNN able are to able store to memory store memory since their since current their currentoutput same task for every element in the sequence. RNN are able to store memory since their current output isoutput dependent is dependent on the previous on the previous computations. computations. However, However, RNNs RNNsare known are known to go back to go only back a only few a time few is dependent on the previous computations. However, RNNs are known to go back only a few time stepstime stepsdue to due vanishing to vanishing gradient gradient problem. problem. Rolled Rolled and and unrolled unrolled RNN RNN configuration configuration over over the the input input steps due to vanishing gradient problem. Rolled and unrolled RNN configuration over the input sequence is shown in Figure 11 (adopted(adopted from from ([ ([37])).37])). sequence is shown in Figure 1 (adopted from ([37])).

Figure 1. Recurrent neural networks (RNN) and the unfolding in time of the computation. FigureFigure 1. 1. RecurrentRecurrent neural neural networks networks (RNN) and the unfolding in time of the computation. Since Standard RNNs suffer from vanishing and exploding gradient problems, LSTMs were Since Standard RNNs suffer from vanishing and exploding gradient problems, LSTMs were speciallySince designed Standard to RNNsovercome suffer these from problems vanishing by introducing and exploding new gates gradient which problems, allow a better LSTMs control were specially designed to overcome these problems by introducing new gates which allow a better control overspecially the designedgradient toflow overcome and enable these problemsbetter preservation by introducing of long new‐range gates whichdependencies. allow a better The controlcritical over the gradient flow and enable better preservation of long‐range dependencies. The critical componentover the gradient of the LSTM flow andis the enable memory better cell and preservation the gates ofshown long-range in Figure dependencies. 2 (adopted from The ([37])). critical component of the LSTM is the memory cell and the gates shown in Figure2 2 (adopted (adopted from from ([ ([37])).37])).

Figure 2. Information flow in a long short‐term memory (LSTM) block of the RNN. FigureFigure 2. InformationInformation flow flow in in a a long long short short-term‐term memory (LSTM) (LSTM) block of the RNN. These gates in LSTM cell enables it to preserve a more constant error that can be back propagated These gates in LSTM cell enables it to preserve a more constant error that can be back propagated throughThese time gates and in layers LSTM allowing cell enables recurrent it to preserve nets to continue a more constant to learn error over thatmany can time be backsteps propagated [38]. These through time and layers allowing recurrent nets to continue to learn over many time steps [38]. These gatesthrough work time in tandem and layers to learn allowing and store recurrent long and nets short to continue‐term sequence to learn related over information. many time stepsThe RNN [38]. gates work in tandem to learn and store long and short‐term sequence related information. The RNN modelsThese gates its input work sequence in tandem {x1, tox2, learn ..., xn} and using store the longrecurrence: and short-term sequence related information. models its input sequence {x1, x2, ..., xn} using the recurrence: The RNN models its input sequence {x1, x2, ..., xn} using the recurrence: , (1) , (1) where xt is the input at time t, and ht is theht =hiddenf (ht− 1state., xt) Gates are introduced into the recurrence(1) where xt is the input at time t, and ht is the hidden state. Gates are introduced into the recurrence function f in order to solve the gradient vanishing or explosion problem. States of LSTM cells are function f in order to solve the gradient vanishing or explosion problem. States of LSTM cells are computedwhere xt is as the follows: input at time t, and ht is the hidden state. Gates are introduced into the recurrence computedfunction f asin follows: order to solve the gradient vanishing or explosion problem. States of LSTM cells are , computed as follows: (2) , (2) it = σ(Wi·[ht−1, ,xt]+bi) (3)(2) ,   (3) ft=σ Wf ·[ht−1,,xt]+b f (4)(3) , (4) , (5) , (5)

Energies 2018, 11, 1636 6 of 20

ot = σ(Wo·[ht−1, xt] + bo) (4)

Ce = tanh(WC·[ht−1, xt] + bC) (5)

Ct = ft Ct−1 + it Cet (6)

ht = ot tanh(Ct) (7)

In the equations above, it, ft and ot are the input, forget and output gates respectively. W’s and b’s are parameters of the LSTM unit, Ct is the current cell state and Ce is new candidate values for cell state. There are three sigmoid functions for it, ft and ot gates that modulates the output between 0 and 1 as given in Equations (2)–(4). The decisions for these three gates are dependent on the current input xt and the previous output ht−1. If the gate is 0, then the signal is blocked by the gate. Forget Gate ft defines how much of the previous state ht−1 is allowed to pass. Input gate it decides which new information from the input to update or add to the cell state. Output gate ot decides which information to output based on the cell state. These gates work in tandem to learn and store long and short-term sequence related information. The memory cell C acts as an accumulator of the state information. Update of old cell state Ct−1 into the new cell state Ct is performed using Equation (6). Calculation of new candidate values Ce of memory cell and output of current LSTM block ht uses hyperbolic tangent function as in Equations (5) and (7). The two states cell state and the hidden state are being transferred to the next cell for every time step. This process then continues to repeat. Weights and biases are learnt by the model by minimizing the differences between the LSTM outputs and the actual training samples.

3.2. Alternative Modeling Approaches Data-driven approaches have addressed variety of energy prediction and load forecasting tasks and thus has attracted significant research attention [39]. Common data driven approaches for time series data includes linear regression that fits the best straight line through the training data and using ordinary least square method to estimate the parameters by minimizing the sum of the squared vertical distances. Ridge regression is another linear model that addresses the issue of multi-collinearity by penalizing the extreme values of the weight vector. These linear regression models are relatively easier to develop and interpret. K-nearest neighbor is a non-parametric model where the prediction is the average value of the k-nearest neighbors. This algorithm is intuitive, easy to implement and can give reliable results for electricity load forecasting when its parameters are properly tuned. Ensemble models like and extra trees use ensemble of decision trees to predict the output and thus avoid the problem of overfitting encountered by single decision tree. These models have low sensitivity to parameter values and can produce accurate forecasts. Gradient boosting is another ensemble model that uses ensemble of weak learners in order to help increase accuracies of trees by giving more weight to wrong predictions.

3.3. Performance Metrics for Evaluation Commonly used metrics to evaluate forecast accuracy are the coefficient of variation (CV RMSE), the root mean squared error (RMSE) and the MAE [40]. CV (RMSE) is the RMSE normalized by the mean of the measured values and quantifies typical size of the error relative to the mean of the observations. A high CV score indicates that a model has a high error range. MAE, a commonly used metric, is the mean value of the sum of absolute differences between actual and forecasted. RMSE is another commonly used metric. It penalizes the larger error terms and tends to become increasingly larger than MAE for outliers. The error measures are defined as follows: v u N u 1 ( − )2 u N ∑ yi yˆi = CV(RMSE)% = t i 1 (8) y Energies 2018, 11, 1636 7 of 20

v u N u 2 u ∑ (yi − yˆi) t = RMSE = i 1 (9) N Energies 2018, 11, x v 7 of 20 u N u |y − yˆ | uN ∑ i i t i=1| yy  | MAE =  i i (10) (10) MAE  i  1 n n where yˆ is the predicted, yi is the actual and y is the average energy consumption. where ŷ is the predicted, yi is the actual and is the average energy consumption. 4. Methodology 4. Methodology In this section,In this section, we describe we describe the the proposed proposed methodology methodology process process for forshort short and medium and medium-term‐term load load forecastingforecasting LSTM-RNNs LSTM‐RNNs as depicted as depicted in Figurein Figure3. 3. The The proposed proposed methodology methodology process process can canbe seen be seenas as a frameworka framework of four processing of four processing components, components, namely, namely, data preparation data preparation and preprocessing and preprocessing component, the machinecomponent, learning the machine benchmark learning model benchmark construction model construction component, component, LSTM-RNN LSTM training‐RNN training component and LSTM-RNNcomponent validation and LSTM‐RNN component. validation In component. the following, In the we following, present we an present overview an overview of the methodology, of the methodology, and then we describe for each methodology components its detailed mission followed and then we describe for each methodology components its detailed mission followed by an illustration by an illustration on the Réseau de Transport dʹÉlectricité (RTE) power consumption data set, our case on the Rstudy.éseau de Transport d'Électricité (RTE) power consumption data set, our case study.

FigureFigure 3. 3.Proposed Proposed forecastingforecasting methodology. methodology. 4.1. Methodology Process Overview 4.1. Methodology Process Overview The first processing step starts by merging the electric energy consumption data with weather Thedata first and processing time lags. step The startsmerging by is merging performed the because electric weather energy data consumption and time lags data are with known weather to data and timeinfluence lags. The power merging demand. is performedThen a preprocessing because of weather data is carried data and out timein order lags to arecheck known null values to influence power demand.and outliers, Then scale a the preprocessing data to a given of range data and is carried split the out time in series order data to checkinto train null and values test subsets and outliers, while retaining the temporal order. This step aims at preparing and cleaning data to be ready for the scale the data to a given range and split the time series data into train and test subsets while retaining further analysis. the temporalIn order. the second This processing step aims step at preparing of our framework, and cleaning benchmark data model to be readywill be forselected the further by fitting analysis. In theseven second different processing linear and step non of‐linear our framework, machine‐learning benchmark algorithms model to the will data be selectedand choosing by fitting the seven differentmodel linear that and performs non-linear the best. machine-learning Further improvement algorithms in accuracies to the of selected data and model choosing will be achieved the model that performsby the performing best. Further feature improvement engineering and in hyerparameter accuracies of tuning. selected This model benchmark will be would achieved then bybe used performing feature engineering and hyerparameter tuning. This benchmark would then be used for forecasting Energies 2018, 11, x 8 of 20 forEnergies forecasting2018, 11 and, 1636 its obtained performance would be compared with that of the LSTM‐RNN model.8 of 20 In the third processing step of the process, a number of LSTM models with different model configurationsand its obtained like number performance of layers, would number be compared of neurons with in each that oflayer, the training LSTM-RNN epochs, model. optimizer In the etc. third will be tested. Optimal number of time lags and LSTM layers would be selected using GA. The best processingEnergies step 2018, of11, x the process, a number of LSTM models with different model configurations8 of 20 like performing configuration will be determined empirically after several trails with different settings, number of layers, number of neurons in each layer, training epochs, optimizer etc. will be tested. and then forit would forecasting be used and its for obtained the final performance LSTM‐RNN would model. be compared with that of the LSTM‐RNN model. Optimal number of time lags and LSTM layers would be selected using GA. The best performing In the third processing step of the process, a number of LSTM models with different model configuration will be determined empirically after several trails with different settings, and then it 4.2. Data configurationsPreparation and like Pre number‐Processing of layers, number of neurons in each layer, training epochs, optimizer etc. wouldwill be used be tested. for the Optimal final number LSTM-RNN of time model. lags and LSTM layers would be selected using GA. The best Timeperforming series forecasting configuration is a will special be determined case of sequence empirically modeling. after several The trails observed with different values settings, correlate with4.2. their Dataand own Preparation then past it would values, and be Pre-Processing usedthus for individual the final LSTM observations‐RNN model. cannot be considered independent of each other. ExploratoryTime series analysis forecasting of electric is a special load case time of‐series sequence can be modeling. useful to Theidentifying observed trends, values patterns correlate 4.2. Data Preparation and Pre‐Processing andwith anomalies. their own We past will values, plot thuselectric individual load consumption observations box cannot plot befor considered various years, independent quarters ofand each weekdayother. Exploratory indicator,Time series which analysis forecasting would of electric is givea special load us time-seriescasea graphical of sequence can display bemodeling. useful of todataThe identifying observeddispersion. values trends, Furthermore, correlate patterns and with their own past values, thus individual observations cannot be considered independent of each correlationanomalies. of Weelectric will consumption plot electric load with consumption time lags and box weather plot for variables various will years, also quarters be investigated and weekday to check forother. the strength Exploratory of association analysis of electricbetween load these time ‐variables.series can be useful to identifying trends, patterns indicator,and whichanomalies. would We givewill plot us aelectric graphical load displayconsumption of data box dispersion. plot for various Furthermore, years, quarters correlation and of electricweekday consumption indicator, with which time would lags and give weather us a graphical variables display will of also data be dispersion. investigated Furthermore, to check for the 4.2.1. Preliminary Data Analysis strengthcorrelation of association of electric between consumption these with variables. time lags and weather variables will also be investigated to Our checkresearch for the makes strength use of ofassociation a RTE power between consumption these variables. data set [41], which gives us a unique 4.2.1. Preliminary Data Analysis opportunity to predict next half‐hourly electrical consumption in MW containing nine years’ data in 4.2.1. Preliminary Data Analysis metropolitanOur research France. The makes power use consumption of a RTE power dataset consumption ranges from data January set [41 2008], which until December gives us a 2016. unique Theopportunity load profileOur to from research predict January makes next to half-hourly use February of a RTE 2011 electrical power as depictedconsumption consumption in the data Figure in set MW [41], 4 containingfollows which gives cyclic nine us and a years’ unique seasonal data in opportunity to predict next half‐hourly electrical consumption in MW containing nine years’ data in patterns,metropolitan which France.can be related The power to human, consumption industrial dataset and commercial ranges from activities. January 2008 until December 2016. metropolitan France. The power consumption dataset ranges from January 2008 until December 2016. The loadThe profile load profile from from January January to Februaryto February 2011 2011 as as depicted in in the the Figure Figure 4 follows4 follows cyclic cyclic and seasonal and seasonal patterns,patterns, which which can be can related be related to human,to human, industrial industrial and commercial commercial activities. activities.

Figure 4. Electricity load versus time (January–February 2011). FigureFigure 4. Electricity 4. Electricity load load versus versus time (January–February (January–February 2011). 2011).

Box plotsBox of plotsload ofconsumptions load consumptions reveal reveal that that average average load load is is almost almost constantconstant across across years years (Figure (Figure 5a) whileBox5a) quarterly plots while of quarterly loadplot consumptionsshows plot shows low consumption low reveal consumption that averagein insecond second load and and is third almost third quarter constant as as compared acrosscompared years to othersto (Figure others 5 a) (Figurewhile 5b). quarterly(Figure Power 5b). plot demandPower shows demand lowin France consumption in France increases increases in second in infirst first and and and third fourth fourth quarter quarters as compared due due to to heating to heating others load (Figureload in in5 b). winterPower months.winter demand months. in France increases in first and fourth quarters due to heating load in winter months.

Figure 5. Box Plot of Electric load (a) Yearly (b) Quarterly.

Figure 5. Box Plot of Electric load (a) Yearly (b) Quarterly. Figure 5. Box Plot of Electric load (a) Yearly (b) Quarterly.

Energies 2018,, 11,, 1636x 9 of 20 Energies 2018, 11, x 9 of 20 Holidays and weekends can affect the usage of electricity, as the usage patterns are generally quiteHolidays Holidaysdifferent andfromand weekends weekendsusual. This cancan can affectaffect be used thethe usageasusage a potential ofof electricity,electricity, feature as as to the the our usage usage model. patterns patterns Figure are are6 shows generally generally the quiteboxquite plot different different of consumption from from usual. usual. for This This weekdays can can be be used andused weekend. as as a a potential potential The featureweekend feature to to consumption our our model. model. Figure Figure is high6 6shows compared shows the the toboxbox weekdays plot plot of of consumption consumption across all years. for for weekdays weekdays and and weekend. weekend. The The weekend weekend consumption consumption is high is high compared compared to weekdaysto weekdays across across all years.all years.

Figure 6. Box Plot of Electric Load Consumption Weekend vs. Weekday. Figure 6. Figure 6. BoxBox PlotPlot ofof ElectricElectric LoadLoad ConsumptionConsumption WeekendWeekend vs. vs. Weekday. Weekday. Correlation matrix of electric consumption showed a high correlation with previous several time lags (0.90–0.98).CorrelationCorrelation Amongst matrix matrix of of electric theelectric weather consumption consumption variables, showed showed temperature a a high high correlation hadcorrelation a high with withnegative previous previous correlation several several with time time consumptionlagslags (0.90–0.98). (0.90–0.98). since AmongstAmongst we are analyzing thethe weatherweather a cooling variables,variables, load. temperaturetemperature Humidity and hadhad wind aa high high speed negative negative have correlation lowcorrelation correlation. with with Figureconsumptionconsumption 7 shows since since scatter we we plot are are analyzing analyzingof weather a a variables cooling cooling load. load. with HumidityHumidity consumption. andand windwind speedspeed havehave lowlow correlation.correlation. FigureFigure7 7shows shows scatter scatter plot plot of of weather weather variables variables with with consumption. consumption.

Figure 7 7.. ScatterScatter Plot Plot of of Electric Electric load load vs. ( (aa)) Temperature Temperature ( b) Humidity (c) Wind Speed. Figure 7. Scatter Plot of Electric load vs. (a) Temperature (b) Humidity (c) Wind Speed. 4.2.2. Data Pre‐Processing 4.2.2.4.2.2. Data Data Pre-Processing Pre‐Processing Data preprocessing is a vital step to obtain better performance and accuracies of machine Data preprocessing is a vital step to obtain better performance and accuracies of machine learning learningData as preprocessingwell as deep learning is a vital‐based step models. to obtain It isbetter about performance dealing with and inconsistent, accuracies missing of machine and as well as deep learning-based models. It is about dealing with inconsistent, missing and noisy data. noisylearning data. as Our well dataset as deep comes learning from‐based RTE, models. which isIt theis about electricity dealing transmission with inconsistent, system missingoperator and of Our dataset comes from RTE, which is the electricity transmission system operator of France. It has France.noisy data. It has Our measurements dataset comes of fromelectric RTE, power which consumption is the electricity in metropolitan transmission France system with operator a thirty of‐ measurements of electric power consumption in metropolitan France with a thirty-minute sampling minuteFrance. samplingIt has measurements rate. There areof electrictwo categories power consumption of power in inthe metropolitan dataset, namely, France definitive with a thirty and ‐ rate. There are two categories of power in the dataset, namely, definitive and intermediate. Here we intermediate.minute sampling Here rate.we will There only are use two definitive categories data. of Weather power datain the comprising dataset, namely, outdoor definitive temperature, and will only use definitive data. Weather data comprising outdoor temperature, humidity and wind speed humidityintermediate. and windHere wespeed will are only merged use definitive as exogenous data. inputWeather with data the comprising consumption outdoor data. Furthertemperature, data are merged as exogenous input with the consumption data. Further data preprocessing comprised preprocessinghumidity and comprisedwind speed data are cleansing,merged as normalization, exogenous input and with structure the consumption change. As machinedata. Further learning data data cleansing, normalization, and structure change. As machine learning and LSTM-RNN models are andpreprocessing LSTM‐RNN comprised models are data sensitive cleansing, to the normalization, scale of the inputs, and structure the data change.are normalized As machine in the learning range sensitive to the scale of the inputs, the data are normalized in the range [0, 1] by using feature scaling. [0,and 1] LSTMby using‐RNN feature models scaling. are sensitive to the scale of the inputs, the data are normalized in the range The data is split into train and test set while maintaining the temporal order of observations. [0, 1]The by datausing is feature split into scaling. train and test set while maintaining the temporal order of observations. Test data is used for evaluating accuracy of the proposed forecasting model and not used in training Test dataThe isdata used is forsplit evaluating into train accuracy and test of set the while proposed maintaining forecasting the temporalmodel and order not usedof observations. in training Test data is used for evaluating accuracy of the proposed forecasting model and not used in training

Energies 2018, 11, 1636 10 of 20 step. Standard practice for splitting data is performed using 80/20 or 70/30 ratios for machine learning models [42]. The last 30 percent of the dataset is withheld for validation while the model is trained on the remaining 70 percent of the data. Since our data is large enough, both training and tests sets are highly representative of the original problem of load forecasting. Stationarity of time series is a desired characteristic to use Ordinary Least Square regression models for estimation of regression relationship. Regressing non-stationary time series can lead to spurious regressions [43]. High R2 and high residual autocorrelation can be signs of spurious regression. Dickey-Fuller test is conducted to check stationarity. The resulting p-value is less 0.05, thus we reject the null hypothesis and conclude that the time series is stationary.

4.3. Selecting the Machine Learning Benchmark Model Benchmarking is an approach that demonstrate new methodologies abilities to run as expected and thus comparing the result to existing methods [44]. We will use linear regression, ridge, regression k-nearest neighbors, random forest, gradient boosting, ANN and extra trees Regressor as our selected machine learning models. The initial input to these models is the complete set of features comprising time lags, weather variables temperature, humidity, wind speed and schedule-related variables, month number (between 1 and 12), quarter and weekend or weekday. Use lagged versions of the variables in the regression model allows varying amounts of recent history to be brought into the forecasting model. Selected main parameters for various machine-learning techniques are shown in Table1.

Table 1. Parameters for Machine Learning Techniques.

No. Model Parameters 1 Ridge Regression Regularization parameter, α = 0.8 No. of neighbors, n = 5, weight function = uniform, 2 k-Nearest Neighbor Distance Metric = Euclidian No of Trees = 125, max depth of the tree = 100, 3 Random Forest min samples split = 4, min sample leaf = 4 No of estimators = 125, maximum depth = 75, 4 Gradient Boosting min samples split = 4, min sample leaf = 4 Activation = relu, weight optimization = adam, batch size = 150, 5 Neural network number of epochs = 300, learning rate = 0.005 No of Trees = 125, max depth of the tree = 100, 6 Extra Trees min samples split = 4, min sample leaf = 4

All the implemented machine-learning models utilized in this study used the mean squared error as the loss function to be minimized. All trained models are evaluated on the same test set and performance is measured in terms of our evaluation metrics. For our baseline models comparison, we will leave parameters to their default values. Optimal parameters for the best model will be chosen later. The results are shown in Table2.

Table 2. Performance Metrics of Machine Learning Models.

Model RMSE CV (RMSE) MAE Linear Regression 847.62 1.55 630.76 Ridge 877.35 1.60 655.70 k-Nearest Neighbor 1655.70 3.02 1239.35 Random Forest 539.08 0.98 370.09 Gradient Boosting 1021.55 1.86 746.24 Neural network 2741.91 5.01 2180.89 Extra Trees 466.88 0.85 322.04 Energies 2018, 11, 1636 11 of 20 Energies 2018, 11, x 11 of 20

well.ANN This models is because are in known order to to achieve overfit good the data accuracies, and show ANN poor model generalization needs to have when optimal the dataset network isstructure large and and training update times weights, are which longer may [45]. require From several the above epochs results, and network it is seen configurations. that ANN does During not performtraining well. of our This models, is because it was in order also tonoticed achieve that good ANN accuracies, took the ANN longer model time needs as compared to have optimal to other networkapproaches, structure which and is update an indication weights, that which the may model require overfits several and epochs as the and result network shows configurations. poor test set Duringperformance. training of our models, it was also noticed that ANN took the longer time as compared to other approaches,Since the ensemble which isapproach an indication of Extra that Trees the Regressor model overfits model and is giving as the the result best shows results, poor therefore test setwe performance. will use this as our benchmark model. The main difference of Extra Trees Regressor with other treeSince based the ensemble ensemble methods approach is randomization of Extra Trees Regressor of both attribute model isand giving cut‐point the best choice results, while therefore splitting wea willtree usenode. this Since as our this benchmark model performs model. The the main best difference amongst ofall Extra other Trees models, Regressor it is withselected other as tree our basedbenchmark. ensemble methods is randomization of both attribute and cut-point choice while splitting a tree node. Since this model performs the best amongst all other models, it is selected as our benchmark. 4.3.1. Improving Benchmark Performance with Feature Selection & Hyper Parameter Tuning 4.3.1. Improving Benchmark Performance with Feature Selection & Hyper Parameter Tuning Wrapper and embedded are model based feature selection methods that can help remove Wrapper and embedded are model based feature selection methods that can help remove redundant features and thus obtain better performing and less overfitting model. In wrapper method, redundant features and thus obtain better performing and less overfitting model. In wrapper method, feature importance is assessed using a learning algorithm while in embedded methods the learning feature importance is assessed using a learning algorithm while in embedded methods the learning algorithm performs feature selection as well in which feature selection and parameter selection space algorithm performs feature selection as well in which feature selection and parameter selection space are searched simultaneously. We will use regression and ensembles based methods to identify both are searched simultaneously. We will use regression and ensembles based methods to identify both linear and non‐linear relationships among features and thus ascertain both relevant and redundant linear and non-linear relationships among features and thus ascertain both relevant and redundant features. In the regression case, recursive feature elimination works by creating predictive models, features. In the regression case, recursive feature elimination works by creating predictive models, weighting features, and pruning those with the smallest weights. In the ensemble case, the Extra weighting features, and pruning those with the smallest weights. In the ensemble case, the Extra Trees Regressor decides feature importance based on the decrease in average impurity computed Trees Regressor decides feature importance based on the decrease in average impurity computed from all decision trees in the forest without making any assumption whether our data is linearly from all decision trees in the forest without making any assumption whether our data is linearly separable or not. Both the regression and ensemble‐based models emphasized time lags as the most separable or not. Both the regression and ensemble-based models emphasized time lags as the most important features in comparison to weather and schedule related variables. With respect to our data important features in comparison to weather and schedule related variables. With respect to our data set, among the weather and schedule related features, temperature and quarter number were found set, among the weather and schedule related features, temperature and quarter number were found to to be important after time lags. Figure 8 shows the relative importance of each feature. Subsequently, be important after time lags. Figure8 shows the relative importance of each feature. Subsequently, the input vector to machine learning models should include time lags as features. the input vector to machine learning models should include time lags as features.

FigureFigure 8. 8Feature. Feature importance importance plot. plot. Various combinations of 5, 10, 20, 30 and 40 time lags were evaluated for Extra Trees Regressor modelVarious by training combinations several of models. 5, 10, 20, Result 30 and showed 40 time lagsusing were only evaluated 30 lagged for variables Extra Trees as Regressorfeatures or modelcombination by training of lagged several features models. with Result temperature showed and using quarter only produced 30 lagged similar variables performance as features for orthis combinationmodel. This of is laggedan indication features that with temporal temperature load brings and quarter information produced that similaris originally performance contained for within this model.the weather This is parameters. an indication that temporal load brings information that is originally contained within the weatherGrid Search parameters. cross validation [46] is a tuning method that uses cross validation to perform an exhaustive search over specified parameter values for an estimator. This validation method is used for hyperparameters tuning of our best Extra Trees Regressor model by using a fit and score methodology to find the best parameters. After tuning the best model, Extra Trees Regressor model

Energies 2018, 11, 1636 12 of 20

Grid Search cross validation [46] is a tuning method that uses cross validation to perform an exhaustive search over specified parameter values for an estimator. This validation method is used for hyperparameters tuning of our best Extra Trees Regressor model by using a fit and score methodology to find the best parameters. After tuning the best model, Extra Trees Regressor model is found to have 150 estimators with max depth of 200. Table3 shows the performance results of the best Extra Trees RegressorEnergies obtained 2018, 11, x after tuning. 12 of 20 is found to have 150 estimators with max depth of 200. Table 3 shows the performance results of the bestTable Extra 3. TreesPerformance Regressor metrics obtained of best after model tuning. after feature selection & hyperparameter tuning.

Table 3. PerformanceMetrics metrics of best model Values after After feature selection Values & Before hyperparameter tuning.

RMSEMetrics Values 428.01 After Values Before 466.88 CV(RMSE)RMSE % 428.01 0.78 466.88 0.85 MAECV(RMSE) % 292.490.78 0.85 322.04 MAE 292.49 322.04 4.3.2. Checking Overfitting for Machine Learning Model 4.3.2. Checking Overfitting for Machine Learning Model With tooWith many too many features, features, the the model model may may fit fit the the training set set very very well well but but fail to fail generalize to generalize to new to new examples.examples. It is necessary It is necessary to plot to plot learning learning curve curve thatthat shows the the model model performance performance on training on training and and testingtesting data over data aover varying a varying number number of of training training instances. instances. MeanMean squared squared errors errors for for both both the the training training andand testing sets sets for for our our optimized optimized Extra Extra Tree model Tree model (i.e., after(i.e., hyperparameter after hyperparameter optimization) optimization) decrease decrease with bigger bigger sizes sizes of oftraining training data data and converge and converge at at similarsimilar values, values, which which shows shows that that our our model model is is not not overfitting,overfitting, as as reported reported in Figure in Figure 9. 9.

FigureFigure 9. 9Learning. Learning curve curve for extra extra trees trees regressor. regressor.

4.4. GA-Enhanced4.4. GA‐Enhanced LSTM-RNN LSTM‐RNN Model Model As seenAs fromseen from the machinethe machine learning learning modeling modeling results in in the the previous previous section, section, there there was almost was almost no change in the model performance when weather and schedule related variables were removed no change in the model performance when weather and schedule related variables were removed and only the time lags were used as inputs. Therefore, based on the results of feature importance and and onlycyclical the time patterns lags identified were used in the as inputs.consumption Therefore, data, we based will only on the use results time lags of for feature LSTM importance model to and cyclicalpredict patterns the future. identified These in time the lags consumption are able to capture data, weconditional will only dependencies use time lags between for LSTMsuccessive model to predicttime the periods future. in These the model. time A lags number are ableof different to capture time lags conditional were used dependencies as features to machine between learning successive time periodsmodel, inthereby the model. training A model number several of different times and time selecting lags were the best used lag as length features window. to machine However, learning model,experimenting thereby training the same model with several LSTM deep times network and selecting which has the large best number lag length of parameters window. to learn However, experimentingwould be computationally the same with very LSTM expensive deep network and time consuming. which has large number of parameters to learn Finding optimal number of lags and number of hidden layers for LSTM model is a non‐ would be computationally very expensive and time consuming. deterministic polynomial (NP) hard problem. Meta‐heuristic algorithms like GA do not guarantee to Findingfind the global optimal optimum number solution, of lagshowever, and they number do tend ofto produce hidden suboptimal layers for good LSTM solutions model that is a non-deterministicare sometimes polynomial near global (NP) optimal. hard problem. Selecting Meta-heuristic optimal among algorithms a large number like GA of do potential not guarantee to findcombinations the global optimum is thus a solution, search and however, optimization they doproblem. tend to Previous produce research suboptimal has shown good that solutions GA that are sometimesalgorithms near can global be effectively optimal. used Selecting to find a near optimal‐optimal among set of a time large lags number [47,48]. GA of potential provides solution combinations to this problem by an evolutionary process inspired by the mechanisms of natural selection and

Energies 2018, 11, 1636 13 of 20 is thus a search and optimization problem. Previous research has shown that GA algorithms can be Energies 2018, 11, x 13 of 20 effectively used to find a near-optimal set of time lags [47,48]. GA provides solution to this problem bygenetic anEnergies evolutionary science. 2018, 11 ,The x pseudo process‐code inspired for the by GA the algorithm, mechanisms crossover of natural and mutation selection operations and genetic are13 shownof science. 20 Thein Figure pseudo-code 10. for the GA algorithm, crossover and mutation operations are shown in Figure 10. genetic science. The pseudo‐code for the GA algorithm, crossover and mutation operations are shown in Figure 10.

(a) (b)

Figure 10. (a) Pseudo code(a) for genetic algorithm (GA); (b) crossover and mutation(b) operations. Figure 10. (a) Pseudo code for genetic algorithm (GA); (b) crossover and mutation operations. Figure 10. (a) Pseudo code for genetic algorithm (GA); (b) crossover and mutation operations. 4.4.1. GA Customization for Optimal Lag Selection 4.4.1. GA Customization for Optimal Lag Selection 4.4.1. GA Customization for Optimal Lag Selection Choosing an optimal number of time lags and LSTM layers to include is domain specific and it Choosing an optimal number of time lags and LSTM layers to include is domain specific and it could beChoosing different an for optimal each input number feature of time and lags data and characteristics. LSTM layers Usingto include many is domain lagged valuesspecific increases and it could be different for each input feature and data characteristics. Using many lagged values increases the the coulddimensionality be different of for the each problem, input feature which and can data cause characteristics. the deep model Using tomany overfit lagged as wellvalues as increases difficult to dimensionality of the problem, which can cause the deep model to overfit as well as difficult to train. train.the GA dimensionality inspired by ofthe the process problem, of natural which canselection cause theis used deep here model for tofinding overfit optimal as well asnumber difficult of totime GAlags inspiredtrain. and GAlayers by inspired the for process LSTM by the based of process natural deep of selectionnatural network. selection is Number used is here used of for timehere finding forlags finding will optimal be optimal tested number number from of time1 ofto time99 lags and and lags and layers for LSTM based deep network. Number of time lags will be tested from 1 to 99 and layersnumber for LSTMof hidden based layers deep network.from 3 to Number 10. A binary of time array lags will randomly be tested initialized from 1 to 99using and Bernoulli number of number of hidden layers from 3 to 10. A binary array randomly initialized using Bernoulli hiddendistribution layers fromis defined 3 to 10. as A a binary genetic array representation randomly initialized of our solution. using Bernoulli Thus, distributionevery chromosome is defined distribution is defined as a genetic representation of our solution. Thus, every chromosome asrepresents a genetic representationan array of time of lags. our solution. Population Thus, size every and number chromosome of generations represents is an set array to 40. of Various time lags. represents an array of time lags. Population size and number of generations is set to 40. Various Populationoperators of size GA and are number set as follows: of generations is set to 40. Various operators of GA are set as follows: operators of GA are set as follows: Selection: (i)(i) (i)Selection: Selection: rouletteroulette roulette wheel wheel selection selection selection was was was usedused used toto select to select parentsparents parents according according according to to their their fitness. to fitness. their Better Better fitness. Betterchromosomeschromosomes chromosomes have have more have more chances more chances chances to to be be selectedselected to be selected (ii)(ii) (ii)Crossover:Crossover: Crossover: crossovercrossover crossover operator operator exchanges exchanges variables variables between between two two two parents. parents. parents. We We We have have have used used used two two‐ two-pointpoint‐point crossovercrossovercrossover onon on thethe the inputinput input sequence sequence individuals individuals with with crossover crossover probability probability probability of of 0.7. of 0.7. 0.7. (iii)(iii)(iii) Mutation:Mutation: Mutation: This This operation operation introduce introduceintroduce diversitydiversity into into thethe the solutionsolution solution pool pool pool by by bymeans means means of ofrandomly of randomly randomly swappingswappingswapping bits.bits. bits. Mutation Mutation is is a a binary binary flip flipflip withwith a a probability probability of of of 0.1 0.1 0.1 (iv) Fitness Function: RMSE on validation set will act as a fitness function. (iv)(iv) FitnessFitness Function:Function: RMSE on validation set set will will act act as as a a fitness fitness function. function. GAGA solution solution would would be be decoded decoded to to get get integer integer timetime lagslags sizesize and and number number of of layers, layers, which which would would GA solution would be decoded to get integer time lags size and number of layers, which would then thenthen be beused used to totrain train LSTM LSTM model model and and calculate calculate RMSE on validationvalidation set. set. The The solution solution with with highest highest be used to train LSTM model and calculate RMSE on validation set. The solution with highest fitness fitnessfitness score score is selectedis selected as as the the best best solution. solution. We We reducedreduced thethe trainingtraining and and test test set set size size in inorder order to speedto speed score is selected as the best solution. We reduced the training and test set size in order to speed up the up theup the GA GA search search process process on on a asingle single machine. machine. TheThe LSTM‐‐RNNRNN performed performed poorly poorly on on smaller smaller number number GA searchof time process lags. Its on most a single effective machine. length Thefound LSTM-RNN was 34 time performed lags with poorlysix hidden on smaller layers as number shown ofin time of time lags. Its most effective length found was 34 time lags with six hidden layers as shown in lags.Figure Its most 11. effective length found was 34 time lags with six hidden layers as shown in Figure 11. Figure 11.

(a) (b) Figure 11. Root( meana) square error (RMSE) vs. time lags for (a) six and (b) four(b) hidden layers. Figure 11. Root mean square error (RMSE) vs. time lags for (a) six and (b) four hidden layers. Figure 11. Root mean square error (RMSE) vs. time lags for (a) six and (b) four hidden layers.

Energies 2018, 11, 1636 14 of 20

4.4.2. GA-LSTM Training To model our current forecasting problem as a regression one, we consider univariate electric load time series with N observations {Xt , Xt ,..., Xt }. The task of forecasting is to utilize these N data 1 2 N points to predict the next H data points XtN+1 , XtN+2 ,..., XtN+H in the future of the existing time series. LSTMs are specifically designed for sequential data that exhibits patterns over many time steps. Generally the larger the dataset, more hidden layers and neurons can be used for modeling without overfitting. Since our dataset is large enough, we can achieve higher accuracy. If there are too less neuron per layer, our model will under fit during training. In order to achieve good performance for our model we aim to test various model parameters including number of hidden layers, number of neurons in each layer, number of epochs, batch sizes, activation and optimization function. Neurons in the input layer of LSTM model matches the number of time lags in the input vector, hidden layers are dense fully connected and output layer has a single neuron for prediction with linear activation function. Means squared error is used as the loss function between the input and the corresponding neurons in the output layer.

5. Experimental Results: GA-LSTM Settings This section is dedicated to empirically set the parameters of LSTM, to validate the obtained model and to discuss the summary results. After finding optimal lag length and number of layers, various other meta-parameters for the LSTM-RNN model were tested as follows:

1. The number of neurons in the hidden layers were tested from 20 to 100. 2. size was varied from 10 to 200 training examples and training epochs from 50 to 300. 3. Sigmoid, hyperbolic tangent (tanh) and rectified linear unit (ReLU) were tested as the activation functions in hidden layers. 4. (SGD), Root Mean Square Propagation (RMSProp) and adaptive moment estimation (ADAM) were tested as optimizers.

We got significantly better results by having six hidden layers having 100, 60 and 50 neurons. The number of epochs used were 150 with batch size of 125 training examples. For our problem, nonlinear activation function ReLu performed the best and is thus used as activation function for each of the hidden layers. ReLU does not encounter vanishing gradient problem as with tanh and sigmoid activations. Amongst optimizers, ADAM performed the best and showed faster convergence than the conventional SGD. Furthermore, by using this optimizer, we do not need to specify and tune a learning rate as with SGD. The final selected parameters were the best result available from within the tested solutions. Due to the space limitation, we cannot provide all the test results for different network parameters. Table4 shows the summary results of comparing LSTM-RNN with the Extra Trees Regressor model.

Table 4. Performance metrics of the long short-term memory (LSTM) Model on the test set.

LSTM Metrics 30 LSTM Metrics Optimal Extra Tree Model Error Reduction Metrics Lags Time Lags Metrics (%) RMSE 353.38 341.40 428.01 20.3 CV(RMSE) 0.643 0.622 0.78 20.3 MAE 263.14 249.53 292.49 14.9

Plot of actual data against the predicted value by the LSTM model for the next two weeks, as medium term prediction. Figure 12 shows a good fit and stable prediction for the medium term horizon. Energies 2018, 11, 1636 15 of 20 Energies 2018, 11, x 15 of 20 Energies 2018, 11, x 15 of 20 Energies 2018, 11, x 15 of 20

FigureFigure 12. 12. ActualActual vs. vs. predictedpredicted forecastforecast by by the the LSTM LSTM model. model. Figure 12. Actual vs. predicted forecast by the LSTM model. Figure 12. Actual vs. predicted forecast by the LSTM model. 5.1. Cross Validation of the LSTM Model 5.1.5.1. Cross Cross Validation Validation of ofthe the LSTM LSTM Model Model 5.1. Cross Validation of the LSTM Model Time series split is a validation methods inspired by k‐fold cross validation suitable for time TimeTime series series split split is is a a validation methodsmethods inspired inspired by by k-fold k‐fold cross cross validation validation suitable suitable for timefor time series seriesTime case series [49]. split The is methoda validation involves methods repeating inspired the process by k‐fold of splittingcross validation the time suitableseries into for train time and seriescase case [49]. [49]. The The method method involves involves repeating repeating the process the process of splitting of splitting the time the series time series into train into and train test and sets seriestest case sets [49]. ‘k’ times. The method The test involves size remains repeating fixed the while process training of splitting set size thewill time increase series for into every train fold and as in test‘k’ sets times. ‘k’ Thetimes. test The size test remains size remains fixed while fixed training while training set size set will size increase will increase for every for fold every as infold Figure as in 13 . test Figuresets ‘k’ 13. times. The The sequential test size order remains of time fixed series while will training be maintained. set size will This increase additional for every computation fold as in will FigureThe sequential 13. The sequential order of timeorder series of time will series be maintained. will be maintained. This additional This additional computation computation will offer a will more Figureoffer 13. a moreThe sequential robust performance order of time estimate series of will both be ourmaintained. models on This sliding additional test data. computation will offerrobust a more performance robust performance estimate of estimate both our of models both our on models sliding teston sliding data. test data. offer a more robust performance estimate of both our models on sliding test data.

Figure 13. Time Series Split for Model Validation. Figure 13. Time Series Split for Model Validation. FigureFigure 13. Time 13. Time Series Series Split Split for Model for Model Validation. Validation. Indeed, we use 5‐fold cross validation to train five Extra Trees and five LSTM‐RNN models with Indeed, we use 5‐fold cross validation to train five Extra Trees and five LSTM‐RNN models with fixedIndeed,Indeed, test wesize weuse and use 5‐ foldincreasing 5-fold cross cross validation training validation set to sizetrain to train in five every five Extra Extrafold, Trees while Trees and maintaining and five five LSTM LSTM-RNN‐ RNNthe temporal models models withorder with of fixed test size and increasing training set size in every fold, while maintaining the temporal order of fixedfixeddata. test test Meansize size and performance and increasing increasing metrics training training of set the setsize results size in every in are every plotted fold, fold, while in while Figure maintaining maintaining 14. the the temporal temporal order order of of data. Mean performance metrics of the results are plotted in Figure 14. data.data. Mean Mean performance performance metrics metrics of the of the results results are are plotted plotted in Figure in Figure 14. 14.

Figure 14. Time series cross validation with Multiple Train‐Test Splits. Figure 14. Time series cross validation with Multiple Train‐Test Splits. FigureFigure 14. Time 14. Time series series cross cross validation validation with with Multiple Multiple Train Train-Test‐Test Splits. Splits. Mean and values are collected on RMSE, CV (RMSE) and MAE for LSTM‐ Mean and standard deviation values are collected on RMSE, CV (RMSE) and MAE for LSTM‐ RNNMean model and standard as well as deviation for the best values Extra are Trees collected Regressor on RMSE, model. CV These (RMSE) results and are MAE shown for LSTMin Table‐ 4. RNN model as well as for the best Extra Trees Regressor model. These results are shown in Table 4. RNN model as well as for the best Extra Trees Regressor model. These results are shown in Table 4.

Energies 2018, 11, 1636 16 of 20

Mean and standard deviation values are collected on RMSE, CV (RMSE) and MAE for LSTM-RNN model as well as for the best Extra Trees Regressor model. These results are shown in Table4. They confirm our previous results in Table5. Furthermore, LSTM showed lower forecast error compared to Extra Trees Model using this validation approach.

Table 5. Performance Metrics of LSTM-RNN and Extra Tree Regressor Model using the Cross Validation Approach.

Model Mean Std. Deviation RMSE Extra Trees 513.8 90.9 RMSE LSTM 378 59.8 CV (RMSE) % Extra Trees 1.95 0.3 CV (RMSE) % LSTM 1.31 0.2 MAE Extra Trees 344 55.8 MAE LSTM 270.4 45.4

5.2. Short and Medium Term Forecasting Results In order to check the model performance on short (i.e., accuracy) and medium term (i.e., stability) horizons, random sample of 100 data points were selected uniformly from various time ranges from one week to several months as in Table6. The means of the performance metrics values is computed of each range. Results show that error values are consistent with low standard deviation. The accuracy measured by CV (RMSE) is 0.61% for the short term and an average of 0.56% for the medium term for the medium term. With very comparable performances for both the short term and the medium term horizons, it is obvious that our LSTM-RNN is accurate and stable.

Table 6. Performance Metrics of LSTM-RNN on various Time Horizons.

Forecasting Horizon MAE RMSE CV (RMSE) % 2 Weeks 251 339 0.61 Between 2–4 Weeks 214 258 0.56 Between 2–3 Months 225 294 0.63 Between 3–4 Months 208 275 0.50 Mean-Medium term 215.6 275.6 0.56 Std. Dev. 8.6 18 0.06

6. Discussion: Threat to Validity In this section, we respond to some threats to construct validity, internal validity and external validity of our study. The construct validity in our study is concerned with two threats. The first one is related to sufficiency of time series data to well predict the energy load. This question was taken into account from the beginning of the methodology process and other source of data; in particular, weather information was added as the input rather than excluded by different feature selection techniques. The second threat is concerned with the algorithms used in the different steps and components of the methodology that could use some unrealistic assumptions inappropriate for our case study context. To avoid such a threat we have developed all the algorithms ourselves. The internal validity in our case is concerned with three main threats. The first one is related to how much the data set is suitable to train a LSTM-RNN model that needs a large amount of data to mine historical patterns. The RTE power consumption data set ranges from January 2008 until December 2016. This data set is large enough to allow deep learning that is particularly targeted by LSTM-RNN. The second threat is LSTM-RNN overfitting. It can be observed by the learning curve that describes error rate according to the number of epochs. When overfitting occurs, after a certain point test error starts increasing while training error still decreases. Too much training with complex forecasting model leads to bad generalization. To prevent overfitting for our LSTM-RNN model, Energies 2018, 11, 1636 17 of 20 we have randomly shuffled the training examples at each epoch. This helps to improve generalization. Furthermore,Energies 2018, early 11, x stopping is also used whereby training is stopped at the lowest error achieved17 of 20 on the test set. This approach is highly effective in reducing overfitting. The diagnostic plot of our LSTM-RNNincreases. model The loss in stabilizes Figure 15 aroundshows thatthe same the train point and showing validation a good losses fit. The decrease third asthreat the is number related to of epochsthe choice increases. of the The alternative loss stabilizes machine around learning the same for pointcomparison. showing We a good may fit. “miss The‐ thirdrepresent” threat the is relatedspectrum to the of choicetechniques of the used alternative for the machineenergy prediction learning for problem. comparison. For this We concern, may “miss-represent” we have selected the spectrumthe most successful of techniques techniques used for according the energy to prediction our literature problem. exploration. For this concern,A stronger we strategy have selected to avoid thethis most threat successful is to carry techniques out a specific according study to our to identify literature the exploration. top techniques A stronger for each strategy class of to prediction avoid thiscircumstances. threat is to carry This out latter a specific is one studyof the toobjectives identify of the our top future techniques works. for each class of prediction circumstances. This latter is one of the objectives of our future works.

FigureFigure 15. LSTM 15. LSTM learning learning curve. curve.

The external validity in our case is concerned with the generalization of our results. In other The external validity in our case is concerned with the generalization of our results. In other terms, the LSTM‐RNN may not perform well when it is used in very different circumstances of energy terms, the LSTM-RNN may not perform well when it is used in very different circumstances of energy consumption. This threat is mitigated by two choices. The first is the fact that we are using a deep consumption. This threat is mitigated by two choices. The first is the fact that we are using a deep learning approach that is known to recognize good patterns and from a large amount of data and learning approach that is known to recognize good patterns and from a large amount of data and from from the far history in the case of times series. Deep learning is a variation of ANN that produce a the far history in the case of times series. Deep learning is a variation of ANN that produce a robust robust model. The second choice is the use of a large amount of training data (100 k out of more than model. The second choice is the use of a large amount of training data (100 k out of more than 150 k 150 k examples) with a rigorous validation method that evaluates the robustness of the models examples) with a rigorous validation method that evaluates the robustness of the models against being against being trained on different training sets. Besides, one of our future works, will be applying our trained on different training sets. Besides, one of our future works, will be applying our methodology methodology and in particular LSTM‐RNN modeling on many data sets from distant energy contexts and in particular LSTM-RNN modeling on many data sets from distant energy contexts and training and training more parameters of LSTM model via GA in a distributed manner. For instance, our more parameters of LSTM model via GA in a distributed manner. For instance, our approach could be approach could be applied for forecasting several decision variables such as the energy flow and the applied for forecasting several decision variables such as the energy flow and the load in the context load in the context of urban vehicle networks [50]. Many alternatives of validation methods for our of urban vehicle networks [50]. Many alternatives of validation methods for our approach can be approach can be inspired by the extended discussion on validation in [51]. inspired by the extended discussion on validation in [51]. Finally and for the sake of reuse and duplication of experiments, the used code is available at: Finally and for the sake of reuse and duplication of experiments, the used code is available at: https://github.com/UAE‐University/CIT_LSTM_TimeSeries. https://github.com/UAE-University/CIT_LSTM_TimeSeries.

7. Conclusions7. Conclusions AccurateAccurate forecasting forecasting of electricity of electricity consumption consumption represents represents an essential partan essential of energy managementpart of energy for sustainablemanagement systems. for sustainable In this research, systems. we In built this and research, optimized we built a LSTM-RNN-based and optimized a univariate LSTM‐RNN model‐based forunivariate demand side model load for prediction, demand side which load forecasted prediction, accurately which forecasted over both accurately short-term over (few both days short to‐term 2 weeks)(few days and to medium 2 weeks) term and (few medium weeks term to few (few months) weeks to time few horizons. months) time For comparisonhorizons. For purposes, comparison we implementedpurposes, we implemented seven forecasting seven techniques forecasting representing techniques arepresenting spectrum of a commonly spectrum of used commonly machine used learningmachine approaches learning inapproaches the energy in prediction the energy domain. prediction The domain. best performing The best performing model on the model studied on the consumptionstudied consumption data set was data used set as was our benchmarkused as our model. benchmark We explored model. wrapperWe explored and embeddedwrapper and techniquesembedded of featuretechniques selection of feature namely selection recursive namely feature recursive elimination feature and elimination extra trees regressorand extra totrees validateregressor the importance to validate of the our importance model inputs. of our Hyperparameter model inputs. Hyperparameter tuning was used tuning to further was improve used to further the improve the performance of ensemble‐based machine learning model. These techniques unexpectedly eliminated the temperature and the quarter of year features and gave the most importance to historical time lags for forecasting energy consumption. This is due to the fact that temporal load parameters encapsulate information that is originally contained within the weather

Energies 2018, 11, 1636 18 of 20 performance of ensemble-based machine learning model. These techniques unexpectedly eliminated the temperature and the quarter of year features and gave the most importance to historical time lags for forecasting energy consumption. This is due to the fact that temporal load parameters encapsulate information that is originally contained within the weather and weekday features. For time lagged features, a common problem is that of choosing the appropriate lags length which we solved using GA. The validation of results using simple train test split, multiple train test splits and on various time horizons showed that the LSTM-RNN based forecasting method has lower forecast errors in the challenging short to medium term electric load forecasting problem compared to the best machine learning. We have modeled the electric load forecasting as a univariate time series problem and thus the approach can be generalized to other time series data. Our future works will include applying our methodology and in particular, LSTM-RNN modeling on many data sets from distant energy contexts. Another objective is to carry out an empirical study to identify the top techniques for each class of prediction circumstances.

Author Contributions: This paper is a collaborative work of the all the authors. Conceptualization, S.B. and A.F.; Methodology, S.B. and A.F.; Software, A.F.; Validation, S.B., A.F. and A.O.; Data Curation, A.F.; Writing-Original Draft Preparation, S.B., A.F., A.O. and M.A.S.; Writing-Review & Editing, S.B., A.F., A.O. and M.A.S.; Supervision, S.B.; Project Administration, S.B.; Funding Acquisition, S.B. and M.A.S. Acknowledgments: The authors would like to acknowledge United Arab Emirates University for supporting the present wok by a UPAR grant under grant number G00001930 and providing essential facilities. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Mocanu, E.; Nguyen, P.H.; Gibescu, M.; Kling, W.L. Deep learning for estimating building energy consumption. Sustain. Energy Grids Netw. 2016, 6, 91–99. [CrossRef] 2. Hyndman, R.J.; Shu, F. Density forecasting for long-term peak electricity demand. IEEE Trans. Power Syst. 2010, 25, 1142–1153. [CrossRef] 3. Chui, F.; Elkamel, A.; Surit, R.; Croiset, E.; Douglas, P.L. Long-term electricity demand forecasting for power system planning using economic, demographic and climatic variables. Eur. J. Ind. Eng. 2009, 3, 277–304. [CrossRef] 4. Hernandez, L.; Baladron, C.; Aguiar, J.M.; Carro, B.; Sanchez-Esguevillas, A.J.; Lloret, J.; Massana, J. A survey on electric power demand forecasting: Future trends in smart grids, microgrids and smart buildings. IEEE Commun. Surv. Tutor. 2014, 16, 1460–1495. [CrossRef] 5. Graves, A.; Jaitly, N. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1764–1772. 6. Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Huang, Z.; Yuille, A. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv, 2014. 7. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 3104–3112. 8. Hochreiter, S.; Schmidhuber, J. LSTM can solve hard long time lag problems. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 2–5 December 1996; pp. 473–479. 9. Ribeiro, G.H.; Neto, P.S.D.M.; Cavalcanti, G.D.; Tsang, R. Lag selection for time series forecasting using particle swarm optimization. In Proceedings of the IEEE 2011 International Joint Conference on Neural Networks (IJCNN), San Jose, CA, USA, 31 July–5 August 2011; pp. 2437–2444. 10. Goldberg, D.E. Genetic Algorithms in Search, Optimization, Machine Learning; Addison Wesley: Reading, UK, 1989. 11. Wang, Z.; Srinivasan, R.S. A review of artificial intelligence based building energy use prediction: Contrasting the capabilities of single and ensemble prediction models. Renew. Sustain. Energy Rev. 2017, 75, 796–808. [CrossRef] 12. Liu, N.; Tang, Q.; Zhang, J.; Fan, W.; Liu, J. A hybrid forecasting model with parameter optimization for short-term load forecasting of micro-grids. Appl. Energy 2014, 129, 336–345. [CrossRef] Energies 2018, 11, 1636 19 of 20

13. Weron, R. Electricity price forecasting: A review of the state-of-the-art with a look into the future. Int. J. Forecast. 2014, 30, 1030–1081. [CrossRef] 14. Ryu, S.; Noh, J.; Kim, H. Deep Neural Network Based Demand Side Short Term Load Forecasting. Energies 2017, 10, 3. [CrossRef] 15. Hagan, M.T.; Behr, S.M. The time series approach to short term load forecasting. IEEE Trans. Power Syst. 1987, 2, 785–791. [CrossRef] 16. Taylor, J.W. Short-term electricity demand forecasting using double seasonal exponential smoothing. J. Oper. Res. Soc. 2003, 54, 799–805. [CrossRef] 17. Taylor, J.W.; de Menezes, L.M.; McSharry, P.E. A comparison of univariate methods for forecasting electricity demand up to a day ahead. Int. J. Forecast. 2006, 22, 1–16. [CrossRef] 18. Park, D.C.; El-Sharkawi, M.; Marks, R.; Atlas, L.; Damborg, M. Electric load forecasting using an artificial neural network. IEEE Trans. Power Syst. 1991, 6, 442–449. [CrossRef] 19. Hernandez, L.; Baladrón, C.; Aguiar, J.M.; Carro, B.; Esguevillas, S.A.J.; Lloret, J. Short-term load forecasting for microgrids based on artificial neural networks. Energies 2013, 6, 1385–1408. [CrossRef] 20. Hippert, H.S.; Pedreira, C.E.; Souza, R.C. Neural networks for short-term load forecasting: A review and evaluation. IEEE Trans. Power Syst. 2001, 16, 44–55. [CrossRef] 21. Box, G.E.; Jenkins, G.M.; Reinsel, G.C. Time Series Analysis: Forecasting and Control; John Wiley & Sons: Hoboken, NJ, USA, 2011; Volume 734. 22. Chen, J.-F.; Wang, W.-M.; Huang, C.M. Analysis of an adaptive time-series autoregressive moving-average (ARMA) model for short-term load forecasting. Electr. Power Syst. Res. 1995, 34, 187–196. [CrossRef] 23. Zhao, H.-X.; Magouls, F. A review on the prediction of building energy consumption. Renew. Sustain. Energy Rev. 2012, 16, 3586–3592. [CrossRef] 24. Foucquier, A.; Robert, S.; Suard, F.; Stephan, L.; Jay, A. State of the art in building modelling and energy performances prediction: A review. Renew. Sustain. Energy Rev. 2013, 23, 272–288. [CrossRef] 25. Cincotti, S.; Gallo, G.; Ponta, L.; Raberto, M. Modeling and forecasting of electricity spot-prices: Computational intelligence vs classical econometrics. AI Commun. 2014, 27, 301–314. 26. Amjady, N.; Keynia, F. Day ahead price forecasting of electricity markets by a mixed data model and hybrid forecast method. Int. J. Electr. Power Energy Syst. 2008, 30, 533–546. [CrossRef] 27. Bakirtzis, A.G.; Petridis, V.; Kiartzis, S.J.; Alexiadis, M.C.; Maissis, A.H. A neural network short term load forecasting model for the Greek power system. IEEE Trans. Power Syst. 1996, 11, 858–863. [CrossRef] 28. Papadakis, S.E.; Theocharis, J.B.; Kiartzis, S.J.; Bakirtzis, A.G. A novel approach to short-term load forecasting using fuzzy neural networks. IEEE Trans. Power Syst. 1998, 13, 480–492. [CrossRef] 29. Bashir, Z.; El-Hawary, M. Applying wavelets to short-term load forecasting using PSO-based neural networks. IEEE Trans. Power Syst. 2009, 24, 20–27. [CrossRef] 30. Kodogiannis, V.S.; Amina, M.; Petrounias, I. A clustering-based fuzzy wavelet neural network model for short-term load forecasting. Int. J. Neural Syst. 2013, 23, 1350024. [CrossRef][PubMed] 31. Fan, S.; Chen, L. Short-term load forecasting based on an adaptive hybrid method. IEEE Trans. Power Syst. 2006, 21, 392–401. [CrossRef] 32. Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [CrossRef][PubMed] 33. Marino, D.L.; Amarasinghe, K.; Manic, M. Building energy load forecasting using Deep Neural Networks. In Proceedings of the IECON 42nd Annual Conference of the IEEE Industrial Electronics Society, Florence, Italy, 23–26 October 2016; pp. 7046–7051. 34. Rahman, A.; Srikumar, V.; Smith, A.D. Predicting electricity consumption for commercial and residential buildings using deep recurrent neural networks. Appl. Energy 2018, 212, 372–385. [CrossRef] 35. Zheng, H.; Yuan, J.; Chen, L. Short-Term Load Forecasting Using EMD-LSTM Neural Networks with a Xgboost Algorithm for Feature Importance Evaluation. Energies 2017, 10, 1168. [CrossRef] 36. Roux, N.L.; Bengio, Y. Deep Belief Networks Are Compact Universal Approximators. Neural Comput. 2010, 22, 2192–2207. [CrossRef] 37. Colah.github.io. Understanding LSTM Networks—Colah’s Blog. Available online: http://colah.github.io/ posts/2015-08-Understanding-LSTMs (accessed on 5 April 2018). 38. Patterson, J.; Gibson, A. Deep Learning. A Practitioner’s Approach; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2017; pp. 150–158. Energies 2018, 11, 1636 20 of 20

39. Wei, Y.; Zhang, X.; Shi, Y.; Xia, L.; Pan, S.; Wu, J.; Han, M.; Zhao, X. A review of data-driven approaches for prediction and classification of building energy consumption. Renew. Sustain. Energy Rev. 2018, 82, 1027–1047. [CrossRef] 40. Yildiz, B.; Bilbao, J.; Sproul, A. A review and analysis of regression and machine learning models on commercial building electricity load forecasting. Renew. Sustain. Energy Rev. 2017, 73, 1104–1122. [CrossRef] 41. RTE France. Bilans Électriques Nationaux. Available online: http://www.rte-france.com/fr/article/bilans- electriques-nationaux (accessed on 7 February 2018). 42. Dangeti, P. Statistics for Machine Learning: Techniques for Exploring Supervised, Unsupervised, and Models with Python and R; Packt Publishing: Birmingham, UK, 2017. 43. Brooks, C. Introductory Econometrics for Finance, 2nd ed.; Cambridge University Press: Cambridge, UK, 2008. 44. Hastie, T.J.; Tibshirani, R.J.; Friedman, J.H. The Elements of Statistical Learning: , Inference, and Prediction; Springer: New York, NY, USA, 2009. 45. Huang, Y. Advances in Artificial Neural Networks—Methodological Development and Application. Algorithms 2009, 2, 973–1007. [CrossRef] 46. Scikit-learn.org. Parameter Estimation Using Grid Search with Cross-Validation—Scikit-Learn 0.19.1 Documentation. 2018. Available online: http://scikit-learn.org/stable/auto_examples/model_-selection/ plotgrid_search_digits.html (accessed on 12 April 2018). 47. Lukoseviciute, K.; Ragulskis, M. Evolutionary algorithms for the selection of time lags for time series forecasting by fuzzy inference systems. Neurocomputing 2010, 73, 2077–2088. [CrossRef] 48. Sun, Z.L.; Huang, D.S.; Zheng, C.H.; Shang, L. Optimal selection of time lags for TDSEP based on genetic algorithm. Neurocomputing 2006, 69, 884–887. [CrossRef] 49. Scikit-learn.org. sklearn.model_selection.TimeSeriesSplit—Scikit-Learn 0.19.1 Documentation. 2018. Available online: http://scikitlearn.org/stable/modules/generated/sklearn.model_selection.Time-Series-Split.html (accessed on 18 April 2018). 50. Scellato, S.; Fortuna, L.; Frasca, M.; Gómez-Gardeñes, J.; Latora, V. Traffic optimization in transport networks based on local routing. Eur. Phys. J. B 2010, 73, 303–308. [CrossRef] 51. Bouktif, S. Improving Software Quality Prediction by Combining and Adapting Predictive Models. Ph.D. Thesis, Montreal University, Montreal, QC, Canada, 2005.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).