Engineering Degree Project

Predictive Autoscaling of Systems using Artificial Neural Networks

Authors: Christoffer Lundström, Camilla Heiding Supervisor: Sogand Shirinbab Lnu Supervisor: Jonas Nordqvist Examiner: Jonas Lundberg Semester: Spring 2021 Subject: Computer Science Abstract

Autoscalers handle the scaling of instances in a system automatically based on spec- ified thresholds such as CPU utilization. Reactive autoscalers do not take the delay of initiating a new instance into account, which may lead to overutilization. By ap- plying machine learning methodology to predict future loads and the desired number of instances, it is possible to preemptively initiate scaling such that new instances are available before demand occurs. Leveraging efficient scaling policies keeps the costs and energy consumption low while ensuring the availability of the system. In this thesis, the predictive capability of different multilayer perceptron configurations is investigated to elicit a suitable model for a telecom support system. The results indicate that it is possible to accurately predict future load using a multilayer percep- tron regressor model. However, the possibility of reproducing the results in a live environment is questioned as the dataset used is derived from a simulation.

Keywords: autoscaling, predictive autoscaling, machine learning, artificial neu- ral networks, multilayer preceptrons, MLP-regressor, time series forecasting Preface

Both of us would like to express our sincere gratitude to Ericsson, and especially Sogand Shirinbab whose guidance and support made this thesis project an invaluable experience for us. We are also grateful to Eldin Malkoc for endorsing us and helping us through the onboarding process. We would like to extend our appreciation and thankfulness to our supervisor Jonas Nordqvist for his generous support and willingness to provide valuable insights. Christoffer would like to express his love and gratitude for his family, especially his part- ner Stina and his mother Linda who have been tremendously supportive and caring during trying times. Finally, he would like to express his gratitude to his project partner Camilla and commend her for outstanding devotion and attention to detail during the course of the project. Camilla directs a great amount of appreciation towards family and friends that helped keep her motivation up. To be cohabiting in an apartment during a period of only working from home has been particularly challenging, therefore some extra love is directed to her part- ner Johan. She would also like to return the appreciation to her project partner Christoffer for great commitment, knowledge, and optimism during the work on this thesis. Contents

1 Introduction1 1.1 Background...... 1 1.2 Related work...... 1 1.3 Problem formulation...... 2 1.4 Motivation...... 2 1.5 Milestones...... 3 1.6 Scope/Limitation...... 3 1.7 Target group...... 3 1.8 Outline...... 3

2 Theory5 2.1 Docker...... 5 2.2 Container orchestration...... 5 2.3 Autoscaling...... 6 2.4 Reactive and predictive autoscaling...... 6 2.5 Autoscalers in the market...... 7 2.5.1 Kubernetes and the Horizontal Pod Autoscaler (KHPA)...... 7 2.5.2 Amazon EC2 Autoscaling...... 7 2.5.3 Autoscaling...... 8 2.6 Machine learning...... 8 2.6.1 Overview and terminology...... 8 2.6.2 Validation...... 9 2.6.3 Model selection...... 9 2.6.4 Feature selection...... 10 2.6.5 Feature scaling...... 10 2.6.6 Model evaluation...... 11 2.6.7 Artificial neural networks...... 12 2.7 Time series data...... 13

3 Method 16 3.1 Research Project...... 16 3.2 Method...... 16 3.2.1 Literature review...... 16 3.2.2 Controlled experiment...... 17 3.2.3 Data preprocessing...... 18 3.2.4 Feature selection and scaling...... 19 3.2.5 Cross-validation with time series split...... 19 3.2.6 Hyperparameter tuning...... 19 3.2.7 Model evaluation...... 20 3.2.8 Predictive autoscaling evaluation...... 20 3.3 Reliability and Validity...... 20 3.4 Ethical considerations...... 21

4 Implementation 22 4.1 Data preprocessing...... 22 4.2 Model compilation...... 23 4.3 Grid search with time series cross-validation...... 24 5 Experimental Setup and Results 25 5.1 Experimental setup...... 25 5.2 Results...... 25

6 Analysis 35

7 Discussion 37 7.1 Model validity...... 37 7.2 Predictions as scaling policy...... 38 7.3 Further improvements...... 38 7.4 Connections to related work...... 39

8 Conclusion 41 8.1 Future work...... 41

References 42 1 Introduction

This chapter introduces the background and motivation for the problem investigated. Re- lated work is presented, and the focus points for this thesis are elicited in the problem formulation. Following chapters are outlined to guide the reader through the report.

1.1 Background Autoscaling is a feature that enables organizations to scale cloud ser- vices such as server capacities, virtual machines, or pods up or down automatically, based on pre-defined thresholds for resource utilization levels. The overall benefit of autoscaling is that it eliminates the need to respond manually in real-time to traffic spikes that merit new resources and instances by automatically changing the active number of servers. Each of these servers requires configuration, monitoring, and decommissioning, which is the core of autoscaling. Core autoscaling features also allow lower cost and reliable per- formance by seamlessly increasing and decreasing new instances as demand spikes and drops [1]. Cloud computing providers such as (AWS), Azure, and (GCP), offer autoscaling tools. However, most of these tools use threshold-based mechanisms to control the autoscaling process which are not very accu- rate for complex applications such as telecom support systems since they do not consider the start-up time of new instances. A study by Casalicchio [2] has shown that threshold based autoscalers underestimate the number of pods required to keep CPU utilization at a low enough level to satisfy the quality of service constraints on response time. This thesis is done in cooperation with Ericsson and will target parts of their telecom support systems. Ericsson is a multinational telecommunications company providing ser- vices such as software and hardware infrastructure worldwide. Predicting future resource demand based on metrics provided by telecom support systems could potentially achieve more efficient scaling and, by extension, lower response time, resource utilization, and cost. The purpose of this thesis is to propose a machine learning model that can be used to learn and predict the future load on an application. The results obtained in this thesis will be used as a foundation for a software framework that provides recommendations regarding when it is a good time to scale and which application should be scaled up or down.

1.2 Related work The report [3] by Jiang et al. uses linear regression to predict the average number of web requests in the coming hour to adapt the resource capacity accordingly. The motivation for their work is that launching a virtual machine (VM) suffers a delay of considerable length and by using a predictive approach they can optimize the price-performance ratio and preempt the demand for VMs when the load increases. The authors of the report have found seasonality in web requests. For example, in their data, more email services are requested on Monday mornings and the total number of web requests drops around midnight. The seasonality improves the ability to predict the future. Their conclusion is that their approach indicates a better price-performance ratio than other methods. The doctoral thesis [4] by Yadavarnikravesh addresses the issue that reactive autoscaling neglects the boot-up time of VMs which leads to under-provisioning. However, the author

1 raises the issue that existing predictive autoscaling provides limited accuracy which deters cloud clients from using them. Autoscalers using algorithms such as artificial neural networks (ANN) and support vector machines (SVM) are implemented in the thesis to achieve greater accuracy. The author also experiments with time series window sizes to determine whether it is beneficial to use a broad window of previous data to predict the future or if trends are better captured with recent data. Both of the theses [3][4] compare their solutions with autoscaling approaches offered by Amazon and both indicate improvement over threshold based autoscalers.

1.3 Problem formulation The intent of this thesis is to develop a machine learning model that predicts the optimal number of instances in five minute forecasts. An instance is a general term that could refer to either a virtual machine or a pod depending on the implementation. The predictions are based on the total number of transactions per second in the current and previous observations. A transaction in a telecom support system context is either a phone call, text, or multimedia messaging service (MMS). A dataset of historical loads is produced by collecting data from one of Ericsson’s databases. The dataset is used to train and evaluate the predictive model. The thesis focuses on predictions and does not include an implementation for scaling systems. Existing autoscaling techniques are often based on metrics such as CPU and memory uti- lization threshold levels and do not consider the start-up delay of instances. The expected result of this thesis is that the effectiveness of autoscaling can be improved by training a model with metrics provided by a telecom support system and forecast the needed number of instances, thus answering the following research question: Is it possible to predict the future number of instances required by a system based on the historical load? Related work has proven potential for CPU intensive systems and this thesis will evaluate whether it can be effective for telecom support systems. Problems to solve: 1. Find a suitable machine learning algorithm to predict the needed number of in- stances. 2. Optimize the chosen algorithm for the problem. 3. Evaluate the model by comparing it to reactive threshold autoscaling.

1.4 Motivation The societal relevance of this thesis is related to the economics and efficiency of microser- vices in container orchestration environments. Leveraging physical computing resources using machine learning models in conjunction with custom metrics, could potentially achieve lower costs as well as improve resource and energy efficiency. Maintaining un- used resources consumes energy, leaving a negative footprint on the environment. Erics- son is a company with millions of customers, and slight increases in the efficiency of their services may provide a significant societal impact.

2 1.5 Milestones M1 Investigate available autoscaling tools. M2 Investigate suitable machine learning algorithms. M3 Develop and train an ML-model based on historical metrics. M4 Evaluate ML model on historical metrics. M5 Compare the final model to a reactive threshold autoscaler.

1.6 Scope/Limitation In this thesis, a machine learning model that predicts the optimal number of instances five minutes in the future is implemented. The assumed start-up delay for an instance is five minutes, predictions of other delays are not investigated. To scale the system up and down and use real-time data for the predictions is out of the scope of this thesis, which limits the comparison with currently available autoscalers. The data used for training is simulated data provided by Ericsson. There are an abundance of machine learning algorithms to consider, therefore a literature study is performed to elicit the most suitable to be examined in experiments. Another limitation to the scope is that the load on a telecom support system can be measured in several metrics, but only total transactions per second are considered. As mentioned, a transaction can be either a phone call, a text, or MMS. Whether these transactions have a different impact and demand on the system is not considered. A system can consist of several applications, each handling different numbers and types of instances. This thesis targets a single application and the total number of transactions it receives. To base the desired number of instances on the transactions per second demands that a load balancer is able to distribute the requests evenly among all instances.

1.7 Target group The target groups of this thesis are software and DevOps engineers working with con- tainer orchestrators and application lifecycle management to fulfill and sustain service- level agreements (SLA) for quality attributes such as availability and performance. The thesis could also be of interest to researchers in applied machine learning for time series data.

1.8 Outline Chapter2 introduces the reader to container orchestration, autoscaling and different ap- proaches for these. Section 2.5 presents currently available autoscalers. The chapter also defines the terminology used for machine learning and extends upon basic concepts fol- lowed by sections focused on the algorithms used in this thesis. Chapter3 describes the methods used for the research and development of a model. Con- siderations about reliability and validity are also included, presenting assumptions and factors that might affect the result. Chapter4 presents the structure of the implementation and includes code snippets of code central to the project. Chapter5 presents the results from the controlled experiments. Tables with evaluation scores are presented along with plots comparing predictions with the test set and threshold

3 based scaling. Chapter6 analyses and gives further meaning to the results presented in the previous chapter. Chapter7 discusses whether the problem investigated in the thesis is solved, what the results indicate, and possible issues with the solution. Chapter8 concludes the report and Section 8.1 suggests future work.

4 2 Theory

This chapter gives an introduction to containers, container orchestration, autoscaling, and machine learning concepts followed by background on artificial neural networks and time series which are used in this thesis project.

2.1 Docker Docker is a tool and platform for millions of developers to simplify the steps of building, sharing, and deploying software applications or microservices. It achieves this through an infrastructure-as-code approach where users can package their application in images with commands offered by the Docker API module. Images can then be shared and used by others through services such as Docker Hub [5]. Packaging configurations and runtime behavior into images is beneficial because it allows users to create containers and run them in container runtimes compliant to the open con- tainer initiative such as Containerd, regardless of the host operating system [6]. Another benefit is that containers are lightweight because they reduce the amount of overhead needed compared to virtual machines as they share the same operating system, shown in Fig. 2.1. Finally, docker is validated by the FIPS 140-2 security standard, and containers are run in isolation making them intrinsically secure [7].

Figure 2.1: Container and virtual machine structures.

2.2 Container orchestration The development of container technologies like Docker with which applications are pack- aged and deployed in containers comes with challenges and benefits. Deploying applica- tions or microservices in reproducible ephemeral containers, allow for complex setups of scalable and portable systems. The complexity can be managed with orchestrators such as Kubernetes (kubernetes.io) or Docker Swarm (docker.com). Orchestrators allow the users to manage storage, configuration, scaling, lifecycle, and updates of containers through ab- stracted operations provided by a controller API [8]. Kubernetes supports a declarative language called YAML with which users can specify the desired state of a system. The orchestration engine then executes actions to achieve that state [9].

5 2.3 Autoscaling When the load on a system increases the system should scale the number of instances accordingly to sustain the desired availability of the services, likewise it should scale when the load decreases to avoid wasting resources. Instances is a general term and can refer to either a virtual machine (VM) or a Pod. Scaling means to increase or decrease resources such as CPU, memory, or bandwidth. When this is performed automatically according to some defined rules it is known as autoscaling, and can be achieved in two ways: horizontal or vertical [10]. Horizontal scaling refers to the process of adding or removing resources. It requires that the traffic handled by the services is divisible, usually handled by a load balancer. Horizontal scaling is therefore often referred to as scaling in and out. Vertical scaling means increasing the capacity of the currently running instance. This is referred to as scaling up and down. Vertical scaling generally implies an amount of downtime while upgrading the resource, therefore horizontal scaling may be preferred when the load on the system frequently changes [10]. An analogy would be that in order to fit more people in a neighborhood a vertical scaler would build another floor in an existing apartment building, interrupting the residents of the building during construction, while a horizontal scaler would build another apartment building. Several providers offer services to scale containers automatically [11][12]. By providing a desired utilization level the service monitors the containers and scales accordingly. This allows resource utilization levels to be kept at a certain threshold. Constant utilization is beneficial because it prevents the system from overutilizing resources, thus keeping latency low. It also prevents underutilization resulting in higher operating costs [1].

2.4 Reactive and predictive autoscaling

Figure 2.2: Comparison between reactive and predictive autoscalers.

Threshold based autoscaling is a reactive approach. It reacts to the current metrics of the system and when a certain threshold is exceeded, resources are scaled. Predictive au- toscaling attempts to predict future metrics and scales the system preemptively as shown

6 in Fig. 2.2. This is useful when there is a delay in the scaling process since it allows resources to be made available before demand occurs. It is also advantageous during fluc- tuating workloads because it can prevent unwanted scaling during short spikes in demand where a threshold based autoscaler would oscillate in the number of resources [13].

2.5 Autoscalers in the market This section introduces autoscaling techniques used by Kubernetes, Amazon Web Ser- vices (AWS), and Google Cloud Platform (GCP).

2.5.1 Kubernetes and the Horizontal Pod Autoscaler (KHPA) The KHPA is implemented as a resource and controller, which run in a control loop de- termined by a sync-period flag. The default sync-period is τ = 15s and each cycle the autoscaler collects data from one of two metrics API; the resource metrics or the custom metrics API [14]. The resource metrics API provides relative CPU and memory utiliza- tion metrics for the current number of pods. The collected data is used as input for the autoscaling algorithm described by

& Ppods !' i=0 Ui rdesired = rcurrent · (1) Utarget [15]. The algorithm calculates the desired number of replicas where r is the number of replicas and U is CPU utilization. The relative utilization is summed up for all pods. The variable U is described by a fraction of CPU utilization or in milliCPU (mCPU), where 100mCPU would equal 0.1 (10%) utilization [16].

2.5.2 Amazon EC2 Autoscaling EC2 is a service offered by Amazon Web Services (AWS) designed to provide scalable computing capacity in the cloud. It provides the ability to configure, run and scale ap- plications in the AWS computing environment. The EC2 autoscaler offers scheduled, dynamic, and predictive scaling [11]. The scheduled autoscaler scales instances in or out based on a defined schedule. This allows the user to provide extra instances on days known to have a higher load and fewer instances on days with less. The dynamic scaler tracks a specified CloudWatch metric. CloudWatch is a monitoring service offered by AWS to track the health, performance, and logs of applications [17]. With the dynamic scaler, an alarm can be set to execute a scaling policy based on thresholds of the CloudWatch metric [18]. Predictive is the final option for the AWS autoscaler which uses machine learning to analyze historical data to make forecasts for the future load. Together with the dynamic scaler, it forms a scaling strategy that can be optimized for cost, availability, or balance each corresponding to 70%, 40% and 50% CPU utilization. Custom metrics can also be used [19][20]. A limitation of the prediction is that it requires at least 14 days of historical data. The forecast then covers two days [21].

7 2.5.3 Google Compute Engine Autoscaling Google Cloud Platform (GCP) offers the service Compute Engine which is a host and computing service allowing users to create and manage virtual machines (VMs) on Google’s platform. The service offers autoscaling for managed instance groups (MIGs) to scale out or in VMs based on demand. Predictive autoscaling can be enabled for the MIGs to forecast the future load based on historical data [12]. A limitation of the tool is that it requires three days of historical data before predictions are enabled. Until then, reactive scaling is performed on real-time data. Another limitation is that it only supports CPU utilization as the scaling metric [12].

2.6 Machine learning This section gives an introduction to the basic concepts of machine learning and intro- duces artificial neural networks which are a central point of this thesis. Section 2.6.1 and Section 2.6.2 are based on [22] if not stated otherwise.

2.6.1 Overview and terminology Machine learning (ML) is divided into unsupervised and supervised learning. In super- vised learning, each observation has a set of features and labels. An example of supervised learning is image recognition where images should be classified as cats or dogs. The fea- tures would be the color of each pixel, and the label would be cat or dog. By training on a set of images with known labels, the model will eventually be able to receive an unseen image without a label and compare the new image with previous images and determine whether it has the most pixels in common with images in the cat or dog class. Supervised learning also encompasses regression problems like predicting the price of a house based on features such as square footage, room count, and year built. In regression problems, a numeric value is to be predicted. Unsupervised learning is a type of learning that has no labels in the training set but tries to recognize patterns. This thesis focuses on supervised learning, hence unsupervised learning is not explained further. In order to understand certain machine learning concepts, some terminology must be defined. The previous paragraph mentions the term observation which could be one image in a set of images, or one house in the regression example, and each observation has a number of features and labels. When a machine learning model is trained, a set of observations is needed. The set can be divided into a training set, a validation set, and a test set. The training set is used to train the model. The features and labels are passed through an ML algorithm and a model is trained. When the model is trained, an unseen set of observations can be inputted to produce predicted labels for the observations. The errors produced during training, validation, and testing are known as training error, validation error, and test error (generalization error) respectively. Large differences in errors between training and testing are indicative of an overfitted model. An overfitted model tends to have a low number of training errors and a large number of test errors, meaning that it is likely to mislabel unseen data. If the model produces a large number of errors for both training and test data, the model is likely underfitted, meaning that it fails to account for the complexity of the data and instead returns to a general answer. The key is to find a balanced model that generalizes well enough to predict unseen data, yet specific enough to predict the labels accurately.

8 2.6.2 Validation A model is trained on a training set and validated on a validation set. The reason for dividing the dataset into a training and validation set is to avoid overfitting. With this ap- proach, the model is constrained from seeing the test data while tuning hyperparameters, which are explained in Section 2.6.3. The scores of the validation set are compared and the highest scoring model, according to some specified criterion is chosen. Another technique to avoid overfitting the model is to use regularization techniques such as Lasso (L1) and Ridge (L2) regularization which add penalties to the loss function for weights of less important features [23]. A loss function is a function that calculates the error of a model by comparing the output of a model to the desired output. The returned value can be based on measures such as mean absolute error or mean squared error, explained in Section 2.6.6, and is often referred to as loss value, or loss. Dividing a dataset into a training set, validation set, and test set can be problematic for smaller sets. The reason is that results on a small test set give a low degree of statistical certainty. To combat the problem, cross-validation techniques can be used such as k- fold validation. The method shuffles the dataset and splits it into k equally large folds and repeats training and testing k times on increments of the folds as shown in Fig. 2.3 where k = 5. The score from each iteration is averaged, making it possible to use the entire training set for both training and validation.

Figure 2.3: 5-fold validation.

2.6.3 Model selection When selecting a model for a dataset, it is necessary to make reasonable assumptions about the data. It may be the case that a polynomial model is more suited to a problem than a neural network and vice versa. There are no guarantees that one model will be more suited for a given dataset, and the only way to establish the most suitable model is to evaluate all of them, which is infeasible. This is known as the no free lunch theorem. It states that an appropriate approach to select a model is to pick a few that are reason- able for the problem and validate them with a validation set. During validation, a set of parameters, known as hyperparameters, can be tuned to achieve the optimal generaliza- tion for the model and avoid overfitting. Tuning of hyperparameters is useful because it can substantiate a decision to choose one model over another during the evaluation of the validation set predictions. Hyperparameters differ depending on the model but can be variables like epochs, learning rate, batch size, and optimizer for neural networks or tree depth and the number of estimators in a random forest model [23].

9 To select the best values for the hyperparameters of a model a grid search can be per- formed, which is a technique to exhaustively test every possible supplied combination of hyperparameter values and calculate the loss value for each combination. The technique entails extreme computational overhead and for large sets of combinations, it is more rea- sonable to use a randomized search that tests a random selection of the parameters in a given interval [22].

2.6.4 Feature selection It is essential that features correlate with the labels when training a machine learning model. Large sets of features do not guarantee good predictions, and some of the features may adversely affect the model. Utilizing a large number of features will slow down the training. Therefore it is important to determine predictive features. One way to single out important features is to remove features with low variance. If all observations have the same value for a specific feature, they serve no purpose for the predictions. The ideal solution is to train the algorithm with all potential combinations of features. However, the number of combinations grows exponentially with the number of available features, which often make exhaustive training infeasible [23]. Greedy methods can be used to lower the number of combinations. One such method is sequential feature selection with which a model is trained with each feature individually and the feature resulting in the highest score is chosen. The feature is then trained along with all other features one at a time, and the combination with the highest score is selected until the desired number of features is reached. The same method can be performed in reverse by training the model with all features except one, and sequentially removing one feature at a time [24]. Backward feature selection is also known as recursive feature elimination (RFE) [25].

2.6.5 Feature scaling Some machine learning algorithms are sensitive to large differences in the scale of fea- tures. A model trained with large differences in scale among the features is often sensitive to input and may produce unreliable results and large generalization errors because large values have a higher impact (weight) on the model. Scaling the population with tech- niques such as z-score avoids such issues. The standardization method z-score accelerates convergence in the case of gradient descent or algorithms utilizing a weighted sum such as artificial neural networks. Z-score will scale the population to have a mean of zero and a standard deviation of one assuming the population fits a Gaussian distribution [25]. An example is if house prices (SEK) are represented along with the house area (m2) where the values of prices are larger than the values of the area. This could be visualized as

3000000 150 E = 2500000 175 . (2) 3200000 200

Fitting of a population to a standard distribution with z-score is described by

E − µ z = i (3) σ

10 where σ is the standard deviation (std) and µ is the mean of the population. Each value of E is subtracted by the mean and divided by the std column-wise. The applied transformation of E in equation (2) is

 0.33968311 −1.22474487 z(E) = −1.35873244 0.  . (4) 1.01904933 1.22474487

2.6.6 Model evaluation Evaluation of a model can be performed using several different metrics. There is no single metric that is optimal for all scenarios, and usually, a combination of metrics will give a broader perspective. Since this thesis models a regression problem, metrics used for regression will be explained. One evaluation metric used for regression models is the mean squared error (MSE) that is defined by nsamples−1 1 X MSE(y, yˆ) = (y − yˆ )2 (5) n i i samples i=0 where yˆi is the predicted value of the i-th observation and yi is the true value of the same observation [23]. By calculating the difference of the predicted and the true value for each observation and taking the mean of the squared differences, the MSE is received. A ver- sion of the MSE is the root mean squared error (RMSE) which is the root of MSE. A second metric to evaluate a regression model is the mean absolute error (MAE) defined by nsamples−1 1 X MAE(y, yˆ) = |y − yˆ | . (6) n i i samples i=0

The equation is similar to MSE however instead of calculating the square of the differ- ences it calculates the absolute value. The mean absolute error is less sensitive to outliers than MSE since the prediction error is squared in MSE making large errors exponentially heavier than small errors. A version of the metric is the median absolute error which takes the median instead of the mean [26]. Explained variance (EV) is another metric defined by

var(y − yˆ) EV(y, yˆ) = 1 − (7) var(y) where var() is the variance, i.e. the square of the standard deviation. If yˆ and y are identical the difference in the numerator will be a matrix of zeros resulting in a variance of 0, which leads to an explained variance score of 1 which is the highest score possible [26]. The metric called R2 is similar to the explained variance and is defined by

Pn (y − yˆ )2 2(y, yˆ) = 1 − i=1 i i . R Pn 2 (8) i=1(yi − y¯)

11 The variable y¯ is the mean of all true values of y. In the numerator, the sum of all squared differences of the true values yi and the predicted values yˆi is calculated. In the denomi- nator, the sum of all differences of the true value of yi and the mean y is calculated. The smaller the numerator is, the larger is the total score of R2 [26]. The last metric explained is the max error which is the maximum difference of a true value of yi and the predicted value yˆi.

2.6.7 Artificial neural networks Artificial neural networks (ANN) aim to model artificial representations of neurons in the human brain to achieve intelligence. By forming complex structures, a network of neurons can be trained to classify an image, recognize speech or play a game of chess. The versatility of a neural network allows it to perform a wide array of tasks, and it often outperforms other algorithms of machine learning [22]. One of the simplest ANN units consists of an input and an output layer that form what is known as a perceptron. In a perceptron, input values are combined with weights, and an introduced bias feature, typically one, is used to calculate a weighted sum of the inputs. The weighted sum is passed through an activation function to produce an output as shown in Fig. 2.4[22].

Figure 2.4: Two input variables are fed into an activation function.

An activation function is responsible for calculating the output of a model given a set of inputs and is described by hW,b(X) = φ(XW + b) (9) where X is an input matrix with rows representing observations and columns representing features. In the equation W is a matrix of weights for X, b is a matrix of weights for the biased input feature(s) and φ represents the activation function [22]. Common activation functions include the sigmoid function defined by 1 φ(z) = (10) 1 + e−z

12 and the ReLU function defined by ( 0, x < 0 φ(x) = max{0, x} = (11) x, x ≥ 0.

Perceptrons learn by shifting the weights associated with the neurons of the network when a prediction is erroneous. It is achieved with the perceptron learning rule described by (t+1) (t) wi,j = wi,j + η(yj − yˆj)xi (12) where i,j represent input and output neurons for the preceding weight (w) and input ma- trix (x). Learning rate is described by η, y is the target output and yˆ is the output for the current instance [22]. A single perceptron fails to model problems such as exclusive-or (XOR) since the in- put values of an XOR function cannot be linearly separated by a single line. This is not an issue for multilayered perceptrons (MLP). An MLP is produced by composing per- ceptrons in layers. Any layer in between an input layer and an output layer is called a hidden layer. An ANN with multiple layers is often referred to as a deep neural net- work (DNN) [23]. MLPs use an effective technique to perform gradient descent called feed forward and backpropagation. It is based on cycling a batch of inputs forward and then back through the network. One cycle is called an epoch. By passing the inputs forward (feed forward) through the network a loss value can be calculated. The loss value for each epoch is compared against a desired value and the chain rule [23] is applied during the backward pass (backpropagation) to calculate how much each neuron contributed to the loss value. Finally, the weights are adjusted to minimize the loss value, and the next step of the gradient descent is performed [22].

2.7 Time series data For some machine learning problems, the order of observations does not matter. Each ob- servation has an independent set of features and one or more labels. A model is trained on a set of observations and can thereby predict the label of a new observation by receiving the features of the observation. However, in time series the temporal order of the obser- vations matter. The reason is because of the time dimension which assumes that each observation depends on previous observations [27]. Time series prediction can be treated as a supervised learning problem by concatenating the features of previous observations and use a future observation as the label. A standard term for a previous observation in time series forecasting is lag. The current observation is defined as t and a lag of two would be observation t − 2 [27]. In Table 2.1 the temperatures for seven consecutive days are shifted and concatenated into four columns. In a training set, the column t + 1 is used as a label and the other three columns as features.

13 t − 2 t − 1 t t + 1 NaN NaN NaN 20.7 NaN NaN 20.7 17.9 NaN 20.7 17.9 18.8 20.7 17.9 18.8 14.6 17.9 18.8 14.6 15.8 18.8 14.6 15.8 13.3 14.6 15.8 13.3 12.1

Table 2.1: Temperature values are shifted and concatenated to create features.

The concept is known as a sliding window. The size of the sliding window defines the number of lags that are used as features. A window of size three uses t, t − 1 and t − 2 to predict the following observation t + 1 as illustrated in Table 2.1. In this case, the three lags are considered features and the following observation t + 1 is considered the label to be predicted. As seen in Table 2.1, the first three rows lack values (NaN). When shifting the data, it is necessary to remove all such affected rows to avoid faulty training data [27]. If the label column were to be shifted more than one step as shown in Table 2.2, where the column t + 1 is shifted two more steps resulting in t + 3, the model would be able to predict values further into the future. The series of temperature 17.9, 18.8 and 14.6 will according to the table have a value of 12.1 three observations ahead. Since more rows must be removed as seen in Table 2.2 which includes NaN values, it is necessary to have a large dataset available for training and validation [27].

t − 2 t − 1 t t + 3 20.7 17.9 18.8 13.3 17.9 18.8 14.6 12.1 18.8 14.6 15.8 NaN 14.6 15.8 13.3 NaN

Table 2.2: The column on the far right is shifted two more steps in order to predict further into the future.

When validating a model trained on time series data it is infeasible to use k-fold validation as explained in Section 2.6.2 unless it preserves the order of data. If observations are shuffled the model might train on future values and predict past values. However, a k- fold without shuffling of observations, that iterates through the dataset as illustrated in Fig. 2.5, does not violate the temporal order of data. In the first iteration, the first fold is used as the training set and the second fold as the validation set. In the second iteration, the two first folds are used as the training set and the third as the validation set, etc. One disadvantage with the approach is that the first iteration has a smaller training set than the previous iteration, which may affect the averaged score [28].

14 Figure 2.5: 5-fold validation for time series

To determine predictive features for time series data, it is necessary that observations correlate with previous lagged observations. If the correlation is statistically significant it shows that future values depend on previous values. The correlation between time lags is known as autocorrelation [27]. In terms of predictions, autocorrelation does not imply causation. Previous observations are however valuable to detect trends and sea- sonality in a time series and can be considered Granger-casual if they improve the fore- cast [29]. Time series data can be decomposed into three distinct components: trend, seasonality, and noise. A trend in a time series is the linear inclination of a series and is either increas- ing or decreasing; it can also exhibit no trend. Seasonality refers to repeating patterns, for example, a sine wave function with static amplitude. Noise is the remaining variability of the data after previously mentioned decompositions [27].

15 3 Method

This chapter presents the methods used to answer the problems stated in Chapter1 and refers to the milestones elicited in Section 1.5.

3.1 Research Project For milestones M1 and M2 a literature review is performed. Available container tools and autoscalers in the market are investigated to provide context for the subject area. The literature review determines suitable machine learning algorithms to investigate in the project. It is motivated by the fact that there is a considerable number of algorithms available and time scope is limited. By investigating previous work in the area of time series forecasting it is possible to elicit a suitable strategy for the problems. Milestones M3, M4 and M5 are achieved through controlled experiments [30] where hyperparameters for an MLP regressor are changed one at a time. The models investigated are systematically tested on a preprocessed dataset with equal prerequisites. Performing controlled experiments is suitable since it increases the reliability of the comparisons by altering one factor at a time. The dependent variable is the evaluation score mean absolute error, the independent variables are window size, number of epochs, number of hidden layers, number of neurons in each layer, and learning rate. The hyperparameters and window size with the lowest evaluation score are determined, and the model is evaluated on a test set. The goal of the controlled experiments is to produce a model that predicts the demanded number of instances five minutes ahead as accurately as possible. The outcome of the model is then compared with how a reactive autoscaler would have scaled the instances to determine if an advantage is achieved.

3.2 Method This section sequentially explains the process of developing and evaluating the predic- tive model. Initially, Section 3.2.1 presents the methods of the literature review that is performed on autoscaling and machine learning to cover the milestones M1 and M2 de- scribed in Section 1.5. The remainder of Section 3.2 describes the controlled experiment that covers M3, M4 and M5.

3.2.1 Literature review To perform the literature review on autoscaling and machine learning, a series of ques- tions are defined such that the answers should provide enough background to under- stand the context of the problems that the project attempts to address. The literature review is divided into two main objectives which are intended to cover the milestones M1 and M2: 1. Provide background to containers, scaling techniques, and autoscaling. 2. Provide background to machine learning concepts, model development, evaluation, and propose a suitable ML algorithm for forecasting. The questions to answer objective 1 are the following: 1.1 What is a container, how are they created, and with what tools?

16 1.2 What techniques are used to manage container lifecycles? 1.3 How are containers scaled? 1.4 Do services in the industry offer predictive autoscaling and if so, what are the limi- tations? The questions to answer objective 2 are the following: 1.1 What is supervised learning, validation, feature selection, feature scaling, and model evaluation? 1.2 How can a time series problem be modeled as a supervised learning problem? 1.3 What algorithms have been successful for similar projects in previous research? In order to answer the questions, common search engines are used along with platforms such as Researchgate and other online libraries where relevant information, books, and their standard references are found. Ericsson’s previous implementations and documen- tation are also reviewed. Search terms used to find answers to the questions are the fol- lowing: Search terms: docker container, container orchestration tools, Kubernetes, autoscaling, horizontal vs. vertical scaling, predictive autoscaling, reactive autoscaling, threshold based autoscaling, machine learning concepts, artificial neural networks, multilayered perceptrons, MLP regression, autoregression, recursive feature elimination, time series forecasting, cross-validation for time series. The strategy used to include or exclude information from the searches is that results are evaluated based on if they answer the question related to the search. If the results are not a scientific publication, a book, or official documentation the results are excluded.

3.2.2 Controlled experiment To perform the controlled experiment, a number of subtasks are defined. The tasks are ex- ecuted in numerical order to receive a valid and reproducible outcome of the experiment. In order to fulfill milestone M3 the following tasks are performed: 1. Preprocess the dataset. 2. Determine lag correlation. 3. Scale the features. 4. Tune hyperparameters using cross-validation. Milestones M3 and M4 are intertwined since the model is evaluated twice. Once during hyperparameters tuning and a second time during evaluation on a test set. Milestone M4 can thus be divided into the following tasks: 1. Retrain model with selected hyperparameters. 2. Calculate evaluation metrics based on the test set. 3. Visualize comparison of true values and predictions. To realize Milestone M5 the following tasks are performed: 1. Visualize how a reactive autoscaler would scale compared to the predictive model.

17 2. Evaluate comparison, elicit eventual issues and future steps to consider. The following sections describe how the tasks are realized in further detail.

3.2.3 Data preprocessing The transactions per second (tps) and CPU utilization data are retrieved from a database that Ericsson uses to store historical load metrics from a simulation of the telecom sup- port system. It is achieved by connecting a client to the database through http. The client queries data containing the mean values for intervals of 10 seconds between the two times- tamps ’2021-01-22 11:00:00’ and ’2021-02-04 10:59:50’. A Grafana monitoring instance is connected to the database and the data is examined prior to interval selection to avoid periods of simulation downtime. There are multiple categories of tps measures stored in the database such as text, phone calls, and MMS. The sum of all is calculated to store the total number of tps in a dataframe. The system monitored consists of multiple applications, one is selected and the rest are filtered out due to scope limitation. The average CPU utilization every 10 seconds of the chosen application is stored in a data frame and is merged with the tps dataframe. The columns of the tps and CPU utilization matrix are visualized as:

Dataset = tps CP Uutilization . (13)

The number of tps is used to produce the features and the labels are set to the desired num- ber of instances. The application analyzed constantly runs on 24 instances and the desired CPU utilization is set to 75%. The average number of tps of all observations with a CPU utilization of 75±2,5% is calculated and divided by the number of instances (24). This results in the average number of tps one instance can handle with a load of around 75%. The desired number of instances is then calculated based on the tps for each observation. The CPU column in the matrix shown in equation (13) is removed and replaced by the number of instances according to

 tps  num_instances = (14) target_tps_per_instance where tps is the current number of tps and target_tps_per_instance is the average num- ber of tps one instance can handle with a load of 75%. The quota is rounded up since an instance cannot be a decimal number, and it is prioritized to have a surplus over a lacking number of instances. The data set then contains the columns:

Dataset = tps num_instances . (15)

Since the interval of the observations is 10 seconds, five minutes equal 30 steps. By shifting the column containing the number of instances 30 steps, the number of tps at each observation represents the number of instances needed five minutes into the future. When shifting the data 30 steps the last 30 observations lack values in the second column, therefore the last 30 rows are removed. The final dataframe is stored in a pickle file on the filesystem which allows training to be performed without the need to repeatedly fetch data with the database client.

18 3.2.4 Feature selection and scaling As described by Section 2.7 it is essential that time series features strongly correlate with lag values of said feature. In order to select features for time series data, autocorrelation is used to determine the correlation between t and lagged values t − n. To investigate the correlation over a larger time lag, the autocorrelation function described in [31] is plotted for 24 hours of observations. Neural networks are sensitive to differences in the scale of features. Therefore z-score, described in Section 2.6.5 by equation (3), is applied to the dataset before training. The scaling technique described in [32], implements the z-score equation that is used to trans- form the dataset.

3.2.5 Cross-validation with time series split Prior to training, the dataset is split into an 80/20% training and test set. Time series data is sensitive to the temporal order of the splits and therefore data is not shuffled for the splits. The 80% training set is further divided into time series splits as described by Section 2.7. Five splits are used where the model is trained on 1/6 increments of the total training size and validated on the following 1/6 of the set as shown in Fig. 3.6.

Figure 3.6: Training, test split with subsequent time series split.

3.2.6 Hyperparameter tuning Before predicting the test set, hyperparameters are tuned to avoid overfitting the model as described in Section 2.6.3. Parameters are tested during the time series cross-validation and evaluated by examining the averaged evaluation metrics. To find the optimal combination of parameters grid search, introduced in 2.6.4, is used to exhaustively test every combination of supplied parameters during training. Mean abso- lute error is the metric used to determine the most predictive model. The cross-validation procedure is repeated for all configurations of the MLP-regressor and the average scores are evaluated. The hyperparameters in Table 3.3 are used for tuning the MLP regressor. The number of neurons is the same across all hidden layers.

19 Hyperparameter Value(s) Epochs 500, 750, 1000 Hidden layers 2, 3 Neurons 10, 50, 100 Learning rate 0.001, 0.0001, 0.0002 Window size 5, 500, 5000

Table 3.3: List of hyperparameters for the MLP regressor.

The average MAEs of all models are compared and the top scoring model is chosen for further evaluation.

3.2.7 Model evaluation Once the grid search elicits the optimal hyperparameters according to MAE, another cross-validation is performed in order to calculate other evaluation scores. For every iter- ation of the split, the model predicts the validation set and saves the predictions and true labels into lists. Furthermore, the MAE loss value, described by equation (6), is printed at the end of every iteration. The saved lists of predictions and true labels are used to average the evaluation metrics R2, max error, explained variance, MAE, median average error, and RMSE introduced in Section 2.6.6. Finally, the model is retrained on 80% of the dataset, and the remaining 20% test set is used for predictions. Evaluation metrics are calculated based on the true values in the test set and the predicted number of instances based on features from the test set. The regressor provides decimal value predictions which are invalid for the number of instances. After calculating the evaluation scores the predictions are converted to integers by rounding to the nearest integer.

3.2.8 Predictive autoscaling evaluation After evaluating how the model predicts on the test set, the predictions are compared with nonshifted data. The nonshifted data represent how a reactive threshold based autoscaler would scale. The comparison is made by plotting the predictions against nonshifted data to visualize if the model accurately forecasts five minutes ahead (30 steps).

3.3 Reliability and Validity A number of assumptions and choices are made in this thesis based on requirements from Ericsson for the particular application. The desired CPU utilization is for example assumed to be 75% which is not the case for all scenarios. The output of the model developed might therefore not be accurate for a live environment if the desired number of instances are calculated using a different threshold. During preprocessing the desired number of instances is calculated using tps, resulting in decimal values. The predictions also result in decimal values. The number of instances cannot be decimal values, therefore the values are either rounded up, to the nearest integer or down. This has implications for the efficiency of the model depending on what resource usage strategy is employed. For example, rounding down favors resource efficiency while rounding up favors resource surplus. In this implementation rounding up is used when

20 calculating the desired number of instances and rounding to the nearest integer is used for predictions. The application used to retrieve the dataset of historical data is constantly running on 24 instances. As a result, it is not possible to mimic a threshold based scaler that scales up when the CPU utilization exceeds 75%. The method used instead, which calculates the average tps per instance when the CPU utilization is about 75%, presumes that CPU correlates with the total tps. If this is not the case, the desired number of instances used as labels in this thesis might not correspond to the number of instances a threshold based autoscaler would use. The load metric used is the total number of tps. This assumes that all transactions are equally demanding but also that tps is the only metric affecting the CPU load. Metrics used to measure the load on the system in a live environment would typically also include the response time of the system, network bandwidth, and memory load. It is possible that the predictions would be more precise if such values could be collected and included as features. Since the data retrieved from Ericsson is simulated data that mimic user behavior, it might be a threat to the validity of the ML model. It will affect the accuracy of the model in a live environment if the simulation is not a perfect reflection of user behavior.

3.4 Ethical considerations The thesis project does not have any ethical considerations. If real user data was used for the transactions per second, it would be necessary to secure the anonymity of the telecom users. However, to avoid this issue entirely Ericsson has decided to provide a simulation of user activity. The development of the simulation may include ethical considerations when exploring the user data, this is however not relevant for this thesis.

21 4 Implementation

Figure 4.7: Overview of implementation details.

The implementation of this thesis project is illustrated in Fig. 4.7. A database exists with measurements of tps and CPU utilization. Grafana instances visualize what is stored in the database. To retrieve and store measurements, the implementation of the project includes a connection to the database. The main steps of the implementation are summarized as subtasks in Fig. 4.7. The first step is to preprocess the data to generate desired features and labels and standardize the data. The following step is to set aside a test set that is unused until a final model is elected. Grid search is used on the training set to select the model configuration that produces the most accurate predictions based on mean absolute error. The model is retrained on the entire training set and predictions are made with the test set. Evaluation scores are calculated based on the predictions and graphs are plotted both in order to evaluate the model visually as well as to compare it with the reactive threshold based autoscaling. The following subsections explain some vital code from the implementation.

4.1 Data preprocessing In order to connect to the database, a client is created with hostname, port, username, and password parameters. Details regarding the implementation of the database client are excluded.

def averageTpsPerInstance(df_measurements, dvName, load):

frame= df_measurements[ {’L(t)’, dvName+’_cpu’ }] avg=0 count=0

for row in frame.itertuples(): if (row[2]> (load- 2.5)) and (row[2]< (load+ 2.5)): avg+= row[1] count+=1 avg= avg/ count avg_tps_instance= avg/ current_num_instances

return avg_tps_instance

Figure 4.8: Method for calculating average tps per instance at 75% CPU utilization.

22 After retrieving measurements for tps and CPU utilization from the database for a certain time interval, the measurements are used to calculate the desired number of instances for each observation. In Fig. 4.8, the variable frame is assigned two columns, one for tps and one for CPU utilization of the selected application. The variable load represents the desired CPU utilization which is a constant of 75% in this thesis. In the loop, each tuple is iterated and if the CPU utilization column has a value of 75±2,5, the tps value is added to the variable avg which is later divided with the count of observations with a CPU utilization in the interval. The selected application is running on a constant number of instances stored in the variable current_num_instances. The average tps for the application, avg, is divided with current_num_instances to get the average number of transactions that one instance can process with a CPU utilization at around 75%. The measurements received from the database are stored in a dataframe containing the two columns for tps and CPU utilization with a timestamp as index. A third column num_instances is calculated by dividing each value of tps with the value returned from the method in Fig. 4.8 to retrieve the desired number of instances. The CPU utilization column is removed after use. The resulting dataframe is stored in a pickle file using pandas built in method for dataframes. The pickle containing L(t) and num_instances is loaded and processed separately from the python file that retrieves the data from the database.

def window_size(dataset, lags):

copy= dataset.copy() copy= copy.drop(columns=[’num_instances’])

data= pd.concat([copy.shift(i) for i in range(0, lags+ 1)], axis=1) data.columns=[f’L(t- {i})’ for i in range(0, lags+ 1)] data[’num_instances’]= dataset[’num_instances’] data= data.iloc[lags:]’

return data

Figure 4.9: Method for generating lag features according to the specified window size.

Fig. 4.9 displays the method that is used to create features with a specified number of lags. The parameter dataset contains the two columns L(t) and num_instances. The lags parameter refers to a positive integer that appends the dataframe with the specified number of features, by concatenating the column L(t) with shifted versions of itself, one shift at the time. The shifts result in NaN rows in the beginning, explained in Section 2.7, and are therefore removed.

4.2 Model compilation The code structure of the MLP regressor is described by the code in Fig. 4.10. The method baseline_model takes the number of epochs, hidden layers, neurons in each layer, and learning rate as input parameters. It uses the optimizer Adam imported from Tensor- flow, the activation function sigmoid for the first layer, and relu for remaining hidden layers. The model measures the loss by MAE described in Section 2.6.6.

23 from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.optimizers import Adam

def baseline_model(epochs, hidden_layers, neurons, learning_rate): adam= Adam(learning_rate=learning_rate, epsilon=1e-8)

model= Sequential()

model.add(Dense(neurons, input_dim=X_train.shape[1], kernel_initializer=’normal’ , activation=’sigmoid’))

for i in range(hidden_layers-1): model.add(Dense(neurons, kernel_initializer=’normal’, activation=’relu’))

model.add(Dense(1, kernel_initializer=’normal’)) model.compile(loss=’mean_absolute_error’, optimizer=adam) return model

Figure 4.10: Method that compiles a MLP-regressor according to specified parameters.

4.3 Grid search with time series cross-validation The grid search is implemented and described by Fig. 4.11. Parameters are initialized as described in Section 3.2.6.A GridSearchCV object wraps the MLP-regressor which is built according to the baseline_model in Fig. 4.10 along with the time series cross- validation set tscv which determines the number of cross-validation splits. The specified hyperparameters are passed as input to the grid search object. The grid search is initiated and the resulting evaluation scores are printed and sorted by mean test score.

parameters= { ’epochs’:[500, 750, 1000], ’hidden_layers’:[2,3], ’learning_rate’:[0.001, 0.0001, 0.0002], ’neurons’:[10, 50, 100] }

grid_search= GridSearchCV(KerasRegressor(build_fn=baseline_model, batch_size=6000), cv=tscv, param_grid=parameters, scoring= "neg_mean_absolute_error", verbose=0)

grid_search.fit(Xn_train, y_train)

res= pd.DataFrame(grid_search.cv_results_) save= res[[’rank_test_score’,’mean_test_score’,’param_neurons’,’ param_learning_rate’,’ param_hidden_layers’,’ param_epochs’,’mean_fit_time’] ].sort_values(’rank_test_score’ )

Figure 4.11: Implementation of grid search.

24 5 Experimental Setup and Results

In Section 5.1 the hardware and software specifications and versions used in the exper- iment are defined. Results produced with the experimental setup are presented in Sec- tion 5.2.

5.1 Experimental setup Hardware specifications for the experiment are defined by Table 5.4.

Attribute Value Model HP EliteBook 840 G5 Processor Intel Core i7-8650U CPU Clock 1.90 GHz RAM 32.0 GB Operating System Windows 10 Enterprise version 1909

Table 5.4: Hardware specifications of experiment.

Software versions for the experiment are elicited in Table 5.5.

Software Version Python 3.6 Jupyter 1.0.0 NumPy 1.20.2 Pandas 1.2.4 Scikit-learn 0.24.1 Matplotlib 3.4.1 TensorFlow 2.4.1 PyCharm 20.3.5 Plotly 4.14.3

Table 5.5: List of softwares used in the controlled experiment.

Tensorflow’s and Numpy’s random number generator seeds are both set to one, in order to receive reproducible results.

5.2 Results In the literature review it was concluded that models based on MLP regressors were the most promising to investigate. The reason is that related work commonly uses MLP and other types of neural networks as opposed to other methods such as support vector machines or random forest ensembles. Another motivation for focusing on MLP neural networks is that they should in theory be able to approximate any function according to the universal approximation theorem [33]. Fig. 5.12 contains a plot of the desired number of instances for each observation in the dataset based on the calculation for a threshold of 75% CPU utilization. The dataset contains 112,320 observations (13 days).

25 Figure 5.12: Desired number of instances based on 75% CPU utilization for each obser- vation in the dataset.

In Fig. 5.13a the average tps values every 10 seconds L(t) in the interval 2021-01-22 11:00:00 - 2021-02-04 10:59:50 are shown along with the desired number of instances num_instances. Fig. 5.13b contains the dataset with a window size of five, where the features are created as described in Section 4.1. The first five rows are removed since they contain NaN values caused by the shifting of columns, hence the dataset contains 112,315 rows starting at 11:00:50 instead of 11:00:00. Each value in num_instances in the figure represents the desired number of instances for L(t−0) for the row. Once the num_instances column is shifted, 30 more rows are also removed at the end of the dataset.

(a) Dataset before window fea- ture creation. (b) Features with a window size of five created.

Figure 5.13: Example of feature creation for a window size of five.

The autocorrelation of the observations for the previous 24 hours is shown in Fig. 5.14. The lagged values show the correlation between L(t) and L(t − x) along with the calcu- lated confidence interval (light blue).

26 Figure 5.14: Autocorrelation between time lags. Interval 0-8640 (24 hours).

Table 5.6 presents the five lowest, as well as the highest, mean absolute error (MAE) when a window size of five is used, along with the combinations of hyperparameters that resulted in the scores. In total 54 models are trained on window size five, which are all possible combinations of the provided parameters shown in Table 3.3 in Section 3.2.6. Table 5.7 and 5.8 contain equal attributes but with results from grid search on datasets with window size 500 and 5,000 respectively.

Window size Neurons Learning rate Hidden layers Epochs MAE 5 100 0.0002 3 1000 0.252121 5 100 0.001 2 1000 0.260026 5 100 0.0001 3 500 0.264065 5 100 0.0001 2 500 0.265457 5 100 0.0002 2 1000 0.265548 ...... 5 10 0.0001 2 500 12.729519

Table 5.6: Grid search results for a window size of five.

Window size Neurons Learning rate Hidden layers Epochs MAE 500 100 0.001 2 1000 0.149478 500 50 0.001 2 1000 0.159879 500 50 0.001 3 1000 0.161505 500 50 0.001 2 500 0.169572 500 50 0.001 2 750 0.173077 ...... 500 10 0.0001 2 500 10.726081

Table 5.7: Grid search results for a window size of 500.

27 Window size Neurons Learning rate Hidden layers Epochs MAE 5000 50 0.001 2 1000 0.035448 5000 50 0.001 2 750 0.037547 5000 100 0.0002 3 1000 0.041951 5000 50 0.0002 3 1000 0.042181 5000 100 0.001 2 1000 0.043404 ...... 5000 10 0.0001 2 500 9.734262

Table 5.8: Grid search results for a window size of 5000.

The hyperparameters that resulted in the lowest MAE are as seen in Table 5.8 a window size of 5000 with 50 neurons in two hidden layers, a learning rate of 0.001 and 1000 epochs. Table 5.9 presents evaluation scores described in Section 2.6.6 using the config- uration that resulted in the lowest MAE in the grid search. The scores retrieved from the time series cross-validation and from the test set are presented.

Explained variance Mean absolute error Median absolute error Cross-validation 0.9992 0.0503 0.0331 Test score 0.9994 0.0400 0.0201

R2 Max error RMSE Cross-validation 0.9992 6.6399 0.1868 Test score 0.9994 5.0389 0.1817

Table 5.9: Evaluation scores for the model with the optimal hyperparameter settings.

Fig. 5.15 is a visualization of the five fold time series cross-validation. For each fold, 1/6 of the total training data is added to the training set used in the fold. The true value is shown in blue and the predicted value in red. Fig. 5.16 excludes the training sets to show a close-up view of the validations sets and predictions of each fold.

28 Figure 5.15: Time series cross-validation splits using window size of 5000 with optimal hyper parameters according to MAE.

29 Figure 5.16: Time series cross-validation with window size 5000 and optimal hyperpa- rameters according to MAE, training set excluded.

30 Fig. 5.17 contains the predictions for the 20% test set. The remaining 80% is used as training set.

Figure 5.17: Plot of test set predictions and test set labels using window size of 5000 with optimal hyper parameters according to MAE. Training set excluded from plot.

31 (a) Noise in predictions and test set.

(b) Predictions rounded to the nearest integer.

Figure 5.18: Focus on predictions in the interval 3000 - 7500 of the test set using window size of 5000 with optimal hyper parameters according to MAE.

Fig. 5.18 shows the predictions for a smaller interval before and after the value is rounded to the nearest integer. Fig. 5.19 shows the predictions along with the nonshifted threshold values previously shown in Fig. 5.12. The values in the test and validation sets in previous plots are shifted 30 steps and used to determine the accuracy of the predictions. Since the threshold values in Fig. 5.19 are not shifted, it compares threshold based scaling with predictive scaling.

32 (a) Full interval of nonshifted data and predictions.

(b) Interval 11300-12200 of nonshifted data and predictions.

Figure 5.19: Predictive scaling compared to threshold scaling suggestions using window size of 5000 with optimal hyper parameters according to MAE.

Fig. 5.20 contains predictions on a small interval of the test set using a window size of five instead of 5000. Fig. 5.20a compares the predictions with the test set to evaluate the accuracy of prediction while Fig. 5.20b compares the predictions with the nonshifted data of the threshold scaling approach.

33 (a) Predictions compared with test set.

(b) Predictive scaling compared to threshold scaling.

Figure 5.20: Predictions of a model trained with optimal hyperparameters for a window size of five.

34 6 Analysis

The autocorrelation plot is depicted by Fig. 5.14. It reveals that the dataset has a statisti- cally significant correlation between time lags. Points intersecting the confidence interval are not considered statistically significant. The plot also reveals a clear seasonality with no trend over 24 hours (8,640 observations). The conclusion drawn from these facts is that observations in the data depend on previous observations. Lags up to L(t − 500) show over 90% correlation which makes future values likely to be accurately predicted based on previous values through time series forecasting. In the experiment three grid searches are performed spanning 54 configurations each, producing a total of 162 models. The grid search results, described in Table 5.6, Table 5.7 and Table 5.8, indicate that an increase in window size results in a lower MAE. Between the top ranking result of window size 500 and 5,000 two independent variables are differing: neurons and window size. The difference in the dependent variable MAE is 0.11403 (0.149478 − 0.035448) between the two configurations. It is unclear how each variable contributes to the difference in the MAE, however the parameters which result in the lowest score are regarded as optimal. Given the difference in MAE a trade-off is made between a lower MAE and computational complexity of the training data. The trade-off favors the lower MAE at expense of a tenfold increase in input features. Such a trade-off is not necessarily beneficial with other datasets as the increase in window size may imply overfitting. Because of the regularities between the training set and test set of the provided dataset, overfitting is less likely to result in substantial differences between training and test error. This might lead to misguided conclusions that do not generalize on real user data. The time series cross-validation plot, described in Section 2.7, of the top scoring model is shown in Fig. 5.15. Each split is trained with an increased amount of training data. This might influence the averaged evaluation scores received from the cross-validation since the first split is several times smaller than the dataset used for training in the final model. However, the conditions are equal for all models. A plot excluding the training data in Fig. 5.16 shows a narrowed view of the validation set and prediction traces. Visual in- spection of the plot shows that the predictions are valid approximations of the validation set with some noise. The close approximation and the evaluation metrics explained vari- ance, R2, MAE, RMSE, median absolute error and the max error displayed in Table 5.9 indicate that the model is suitable for predicting the test set based on high accuracy and low error rate. The top scoring model’s configuration is concluded to have a window size of 5,000, two hidden layers with 50 neurons each, a learning rate of 0.001, and 1000 epochs. Visual inspection of Fig. 5.17 indicates that the model accurately approximates the test set with some noise. Fig. 5.18 shows that the majority of noise deviates less than 0.5 units from the test set and can therefore be rounded for an improved approximation of the test set. Fig. 5.18b shows the predictions rounded to the nearest integer. The predictions disregard the momentary decreases of instances in the test set, which would avoid brief oscillations in the number of instances in a live environment. Fig. 5.19 depicts the test set predictions plotted against nonshifted data which simulate a reactive threshold with five minutes of delayed response as described by Fig. 2.2 in Section 2.4. Through visual inspection of the plots it is concluded that the predictions accurately forecasts five minutes (30 steps in x-axis) ahead of time with a model of window size 5,000.

35 It is indicated by the results that an increase in window size determines the forecasting property of the model. As shown in Fig. 5.20a, a window size of five results in a delayed predictive response by the model, causing a lag effect of around 30 steps in the x-axis for all observations, thus eliminating the model’s forecasting capability. Fig. 5.20b confirms the previous statement with a plot of the predictions against nonshifted data. Because of the lag effect, models with a window size of five, along with consecutive hyperparameter configurations, are excluded as suitable models. The results indicate that the top scoring model with a window size of five provides no significant increase in forecasting capacity compared to a reactive threshold. It should be noted that an increase in window size does not necessarily improve the model’s predictions on other datasets. The data described by Fig. 5.12 exhibit no trends. It has repetitive seasonality with equal amplitude, frequency, and added noise. These facts are likely the reason why the model provides accurate forecasts for the dataset. Another dataset with irregularities or differences in seasonality, amplitude, and trend will presum- ably not be predicted accurately. Because of the stated reasons, skepticism is raised to the credibility of the dataset. To provide accurate forecasts for a live environment, improved simulations likely need to be performed that account for changes in the trend, seasonality, and amplitude.

36 7 Discussion

During the literature review it became apparent that both Amazon and Google provide predictive autoscalers. However, the offered autoscalers are not that flexible and it would likely be difficult to frame the telecom support system in the offered services. Hence, an implementation of a custom autoscaler that this thesis is a foundation for is still a valuable research topic. This is further established by Yadavarnikravesh [4] who states that predictive autoscaling tools in the market often have poor accuracy, which suggests a need for further research. This chapter reasons about the validity of the model developed in this thesis, discusses possible improvements and makes a comparison between the results and related work in the research area.

7.1 Model validity The results received in the controlled experiments indicate that it is possible to predict the preferred number of instances five minutes into the future. The explained variance and R2 scores received from a tuned multi layered perceptron (MLP) model are above 99.9% and the mean and median absolute errors are below 0.1 units, which indicate accurate predic- tions. Plotting the model’s predictions against the nonshifted data visualized in Fig. 5.19 and Fig. 5.20b also indicates that training the data on shifted measurements of the desired number of instances enables predictions into the future based on past observations given that the window size is large enough. That being said, there are several factors that might affect the validity of the model. To obtain reliable results when utilizing simulated data for training and evaluation, it pre- sumes that the simulation corresponds to real user data. That is likely not the case for the provided dataset. As seen in Fig. 5.12 the data is repetitive, following a strict season- ality pattern. It contains no spikes where the transactions per second (tps) unexpectedly increase for a short interval, and it is constantly decreasing towards the end of each day. The simulation does not account for other patterns during weekends or holidays that could make the tps deviate from the pattern. The model is good at predicting data similar to what it is trained for but it is uncertain how it would behave during deviations. Since the simu- lation data was scarce, such patterns have not been investigated but would be valuable to explore further. The number of instances used as a label for the dataset is currently calculated by dividing tps with a constant as shown in equation (14) in Section 3.2.3. This might not be the most accurate way to calculate the desired number of instances. A more accurate representation of threshold based autoscaling would be to retrieve the current CPU utilization and the current number of instances and assign the desired number of instances with the current number increased by one if the CPU utilization is above 75%. For the current system however, such an implementation is not possible because the current number of instances is a constant of 24. By using separate measures for the features and the label, it is possible to exchange the method for creating the label without affecting the feature creation. The method for calculating the desired number of instances when preprocessing the data also entails a correlation between tps and CPU utilization that might benefit the predic- tions. To use the average number of tps for a system with approximately 75% CPU utilization and divide it with the number of instances, presumes that the CPU utilization is correlated with the total tps. If this is an incorrect assumption, the results may deviate

37 from the desired number of instances received from a threshold based autoscaler. What this thesis investigates is rather if based on the previous tps, future tps can be predicted. If a different percentage than 75% is used, it might affect the allocations if tps and CPU are not linearly dependent. As discussed in Section6, a trade-off between the computational complexity of the train- ing data when using a large number of features and receiving a low MAE is made. In this thesis, a trade-off that favors low MAE is chosen. However, using a too large window size increases the risk of the model losing capacity to detect trends and deviations on narrow intervals. Since the seasonality in the investigated dataset is strong, the evaluation scores from the test set correlate with the evaluation scores from the training set. This makes detection of overfitting difficult. For a different dataset with less seasonality overfitting might be more visible and contradict the results in this thesis.

7.2 Predictions as scaling policy The predictive model forecasts five minutes ahead. The predictions can be used as a foundation for a decision to scale up, down or do nothing in an implementation. To directly use the predictions to scale an application would be inappropriate since it would also scale down five minutes ahead. The assumption is that there is a delay of five minutes when scaling up to get the new instance started. However, to terminate an instance might not take five minutes which may lead to overutilization until the demand decreases. Implementing the predictive model with autoscaling software would require several trade offs to be made. When the load decreases for brief intervals it may be undesirable to decrease the number of instances as they soon will be required again. A trade off would concern at what intervals oscillations in load should be ignored. To analyze the following period of time and detect these brief decreases, it would likely not be enough to predict one value five minutes ahead. Previous predictions can either be stored or several labels could be used which will be discussed further in Section 7.3. Another trade off would concern how the decimal values of the predictions are rounded, where a rounded up value would imply a surplus of instances at the expense of cost- efficiency, and a rounded down value would imply fewer instances at the expense of avail- ability. In this thesis rounding up is used when constructing the dataset and rounding to the nearest integer is used on the predicted values. This setup favors a surplus of instances in order to increase the availability of the system. To also round the predicted values up is considered excessive and entails a risk of fluctuations if the predicted value varies close to an integer. Other concerns include if a recommender should use a hybrid approach where predictive and reactive scaling are combined. Such an implementation could include scaling out according to the predictive model’s recommendation and scale in at reactive thresholds to prevent premature termination of instances.

7.3 Further improvements Unfortunately, the simulation has not been running consistently for a longer period of time than the 13 days used in the dataset. Window size is considered a hyperparameter and the results indicate that a larger window size produces more accurate predictions. If a larger dataset was available, it would be preferable to investigate which size of the train- ing set has a larger impact on predictions. If data for several years had been collected, it is

38 not guaranteed that the oldest observations represent recent data. Perhaps a trend would emerge as the number of customers increases or decrease over the years. A constantly increasing training size would also affect the time it takes to retrain the model. An alter- native approach would be to use the training set with a sliding window of the most recent months of data. Once again, such investigations would likely be more valuable on real user data. Further improvements to the model can possibly be achieved with another set of hyper- parameters. With more compute time it is possible to perform a wider grid search, test- ing more combinations of parameters. All top scoring models from the grid search per- forms 1,000 epochs of training which indicates that there may be room for improvement if the maximum number of epochs is increased. The top scoring model uses the largest window size (5,000) specified for the grid search which also indicates room for improve- ment. A deeper neural network (more hidden layers) can also be explored to see if results vary. It does however seem unwarranted to improve the model further for the current dataset as it is likely a flawed representation of a live environment and might already be overfitted. In the current setup, one label representing the desired number of instances 30 steps ahead (t + 30) is predicted. It is required that previous predictions are stored for 30 observations in order to cover all observations in the following five minutes interval. The stored predictions of t + 30 might not be as accurate as a prediction based on more recent observations, that predicts fewer steps ahead. An improvement to facilitate analysis for a recommendation tool that determines what scaling action to take, would be to output more than one t+30 label. Instead, all labels ranging from t+1 to t+30 could be used as labels in each prediction. It would increase the required computational power and training time but might result in a better foundation for the scaling decisions. In direct forecasting it is usually the case that the feature measure match the label measure. In the context of this thesis, it would mean that either both features and lags would be tps or that both would be the desired number of instances. The reason for using tps as features and desired number of instances as a label is to facilitate adding other measurements as features. Currently, only the total number of transactions are used. If it would be possible to access the count of each type of transaction separately they could be used as separate features. By using the three types text, phone call and MMS, the number of features would multiply by three as seen in Table 7.10 where a window of size two is used but there are six features.

text(t) text(t − 1) call(t) call(t − 1) MMS(t) MMS(t − 1) num_instances(t + 30) 4000 4300 2100 2300 800 850 23 3900 4000 2090 2100 790 800 23 3850 3900 2000 2090 780 790 22

Table 7.10: Features created using three measurements and a window size of two. The values in the table are used as an illustrative example.

7.4 Connections to related work The related work mentioned in Section 1.2 concluded that the researchers were able to improve reactive scaling by using predictions using machine learning models. The results in this thesis are in line with the results received in those reports. Yadavarnikravesh [4]

39 uses an MLP regressor similar to the one used in this thesis. One deviation from his results is that larger window sizes seem to increase the MAE and RMSE in the testing phase. Similar to the results in this thesis he receives lower MAE and RMSE when increasing the window size and predicting the training set. Yadavarnikravesh analyses that this is caused by overfitting to the training set. In this thesis however, the data in the training set and test set are similar, hence overfitting might lead to accurate predictions even on the test set. As previously discussed, this would present an issue if the load changed unexpectedly during weekends or holidays. Jiang et al. [3] proposed an autoscaling scheme based on a linear regression model. Sim- ilar to this thesis, he makes use of aggregated metrics to predict the needed number of instances (VMs). With regards to his dataset, it shows comparable seasonality and auto- correlation with which a higher price-performance is achieved, albeit with service-level- agreement violations which are not considered in this thesis. The results of the thesis suggest that there may be similar benefits to an MLP regressor if improvements are made to circumvent pre-emptive downscaling of instances.

40 8 Conclusion

The conclusion of this thesis is that given the dataset that is used it is possible to accu- rately forecast the number of required instances five minutes into the future based on past observations of the total number of transactions per second (tps). However, as discussed in Section7 there are reasons to believe that the simulation of tps does not mimic real user behaviour accurately. The main issues with the simulation is that it does not account for unexpected deviations that might occur during the weekends or on holidays. It is likely that the transactions pattern exhibit varying trends and seasonality. The work in this thesis could be reinvested on a dataset of real measurements for a better approximation of user behaviour. The model developed in this thesis cannot replace a re- active autoscaler, since it would scale down prematurely. It could however act as decision support for software that analyzes the predictions and decides whether to scale up, down, or do nothing in a hybrid manner. Results in this thesis are in line with the results of related work. Predictive autoscaling appears to be beneficial for CPU intensive systems like telecom support systems. Multi- layered perceptron regressors seem to perform well for time series forecasting problems, however further research is warranted to draw this conclusion.

8.1 Future work The main setback in this thesis is that real user data is not available for the research. Find- ing a way to receive a dataset with real user data would increase the validity of the results. One possible area of research could therefore be to explore anonymization of customers’ transaction patterns. An alternative approach would be to improve the simulation data to mimic real user behaviour more accurately. This could be performed by introducing de- viations representing weekends and holidays. An analysis of how the model adapts to the deviations could be performed to determine if it outperforms a reactive autoscaler. To build upon this thesis other variations of neural networks such as autoregressive in- tegrated moving average (ARIMA) and long-short-term memory models (LSTM) could be evaluated with time series forecasting techniques. The MLP used in this thesis can be remodeled with a moving window technique to evaluate how the model adapts to new patterns and trends. In a dataset with more deviations it would be interesting to investigate if adding other features improves the predictions. Instead of using the total transactions per second, the subtotals of texts, phone calls, and MMS transactions could be used as separate features. Other metrics such as response time and memory load would also be valuable to ex- plore.

41 References

[1] Amazon Web Services, Inc. Aws auto scaling. [Online]. Available: https: //aws.amazon.com/autoscaling/ [Accessed: 2021-03-30] [2] E. Casalicchio, “A study on performance measures for auto-scaling CPU-intensive containerized applications,” 2019. [Online]. Available: http://link.springer.com/10. 1007/s10586-018-02890-1 [Accessed: 2021-03-30] [3] J. Jiang, J. Lu, G. Zhang, and G. Long, “Optimal cloud resource auto-scaling for web applications,” 2013. [Online]. Available: https://www.researchgate.net/publication/ 261481632_Optimal_Cloud_Resource_AutoScaling_for_Web_Applications [Ac- cessed: 2021-05-03] [4] S. Yadavarnikravesh, “A self-adaptive auto-scaling system in infrastructure- as-a-service layer of cloud computing,” 2016. [Online]. Available: https: //curve.carleton.ca/a19379ba-61f9-4b21-b6a5-c7fae6ea8525 [Accessed: 2021-05- 03] [5] Docker. What is a container? a standardized unit of software. [Online]. Available: https://www.docker.com/resources/what-container [Accessed: 2021-03-26] [6] The Linux Foundation. Open container initiative. [Online]. Available: https: //opencontainers.org/ [Accessed: 2021-03-26] [7] Docker. The industry-leading container runtime. [Online]. Available: https: //www.docker.com/products/container-runtime [Accessed: 2021-03-26] [8] VMWare. What is container orchestration? [Online]. Available: https://www. vmware.com/topics/glossary/content/container-orchestration [Accessed: 2021-03- 26] [9] Kubernetes. Understanding kubernetes objects. [Online]. Avail- able: https://kubernetes.io/docs/concepts/overview/working-with-objects/ kubernetes-objects/ [Accessed: 2021-03-26] [10] Section.io. (2020) Scaling horizontally vs. scaling vertically. [Online]. Available: https://www.section.io/blog/scaling-horizontally-vs-vertically/ [Accessed: 2021- 04-01] [11] AWS. Amazon ec2. [Online]. Available: https://aws.amazon.com/ec2 [Accessed: 2021-04-01] [12] Google Cloud. Using predictive autoscaling. [Online]. Available: https://cloud. google.com/compute/docs/autoscaler/predictive-autoscaling [Accessed: 2021-04- 01] [13] M. Yadav, G. Raj, H. Akarte, and D. Yadav, “Horizontal scaling for containerized application using hybrid approach,” 2020. [Online]. Avail- able: https://www.researchgate.net/publication/348625015_Horizontal_Scaling_ for_Containerized_Application_Using_Hybrid_Approach [Accessed: 2021-04-16] [14] Kubernetes. Horizontal pod autoscaler. [Online]. Available: https://kubernetes.io/ docs/tasks/run-application/horizontal-pod-autoscale/ [Accessed: 2021-04-01]

42 [15] ——. Resource metrics pipeline. [Online]. Available: https://kubernetes.io/docs/ tasks/debug-application-cluster/resource-metrics-pipeline/ [Accessed: 2021-04-01] [16] ——. Assign cpu resources to containers and pods. [Online]. Available: https:// kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/ [Accessed: 2021-04-01] [17] AWS. Amazon cloud watch. [Online]. Available: https://aws.amazon.com/ cloudwatch/ [Accessed: 2021-04-01] [18] ——. Dynamic scaling for amazon ec2 auto scaling. [On- line]. Available: https://docs.aws.amazon.com/autoscaling/ec2/userguide/ as-scale-based-on-demand.html [Accessed: 2021-04-01] [19] ——. How scaling plans work. [Online]. Available: https://docs.aws.amazon.com/ autoscaling/plans/userguide/how-it-works.html [Accessed: 2021-04-01] [20] ——. Specify the scaling strategy. [Online]. Available: https://docs.aws.amazon. com/autoscaling/plans/userguide/gs-configure-scaling-plan.html [Accessed: 2021- 04-01] [21] ——. What is aws auto scaling. [Online]. Available: https://docs.aws.amazon. com/autoscaling/plans/userguide/what-is-aws-auto-scaling.html [Accessed: 2021- 04-01] [22] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow, 2nd ed. O’reilly, 2019. [23] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org. [24] Scikit Learn. 1.13. feature selection. [Online]. Available: https://scikit-learn.org/ stable/modules/feature_selection.html [Accessed: 2021-04-15] [25] J. Brownlee, Data preparation for machine learning. O’reilly, 2020. [26] Scikit Learn. 3.3. metrics and scoring: quantifying the quality of predictions. [Online]. Available: https://scikit-learn.org/stable/modules/model_evaluation.html [Accessed: 2021-04-15] [27] J. Brownlee, Introduction to Time Series Forecasting with Python - How to Pre- pare Data and Develop Models to Predict the Future, v1.9 ed. Machine Learning Mastery, 2020. [28] Scikit Learn v.0.24.1. 3.1. cross-validation: evaluating estimator performance. [Online]. Available: https://scikit-learn.org/stable/modules/cross_validation.html [Accessed: 2021-05-03] [29] H. Lutkepohl and M. Krätzig, Applied Time Series Econometrics. Cambridge Uni- versity Press, 2004. [30] Pritha Bhandari. What is a controlled experiment? [Online]. Available: https://www.scribbr.com/methodology/controlled-experiment/ [Accessed: 2021- 05-21]

43 [31] statsmodels v0.12.2. statsmodels.graphics.tsaplots.plot_acf. [Online]. Avail- able: https://www.statsmodels.org/stable/generated/statsmodels.graphics.tsaplots. plot_acf.html [Accessed: 2021-05-03] [32] Scikit Learn v.0.24.1. sklearn.preprocessing.standardscaler. [Online]. Avail- able: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing. StandardScaler.html [Accessed: 2021-04-15] [33] K. Hornik, M. Stinchcombe, H. White. Multilayer feedforward networks are universal approximators. [Online]. Available: https://www.cs.cmu.edu/~epxing/ Class/10715/reading/Kornick_et_al.pdf [Accessed: 2021-06-03]

44