AUTOMATIC ANOMALY DETECTION AND ROOT CAUSE ANALYSIS FOR MICROSERVICE CLUSTERS

Viktor Forsberg

Master ‘esis, 30 credits Supervisor: Johan Tordsson External supervisor: Johan Tordsson Civilingenjorsprogrammet¨ i teknisk datavetenskap 2019

Abstract

Large microservice clusters deployed in the cloud can be very dicult to both monitor and debug. Monitoring theses clusters is a €rst step towards detection of anomalies, deviations from normal behaviour. Anomalies are o‰en indicators that a component is failing or is about to fail and should hence be detected as soon as possible. ‘ere are o‰en lots of metrics available to view. Furthermore, any errors that occur o‰en propagate to other microservices making it hard to manually locate the root cause of an anomaly, because of this automatic methods are needed to detect and correct the problems. ‘e goal of this thesis is to create a solution that can automatically monitor a microservice cluster, detect anoma- lies, and €nd a root cause. ‘e anomaly detection is based on an unsupervised clustering algorithm that learns the normal behaviour of each service and then look for data that falls outside that behaviour. Once an anomaly is detected the proposed method tries to match the data against prede€ned root causes. ‘e pro- posed solution is evaluated in a real microservice cluster deployed in the cloud, using Kubernetes together with a service mesh and several other tools to help gather metrics and trace requests in the system.

Acknowledgements

I would like to thank everyone at Elastysis for giving me an interesting thesis and a nice place to work at. I have had fun and learned a lot thanks to them. My supervisor, Johan Tordsson, deserves a separate thanks. His help has been very important, he always gave good feedback and ideas during our discussions. Lastly I also want to thank my family and friends for their overall support and pushing me to get things done during this project.

Contents

1 Introduction 1 1.1 Motivation1 1.2 Problem formulation and goals1

2 Background 3 2.1 Microservices3 2.2 Containers and container orchestration3 2.3 Service mesh5 2.4 Related work6

3 Method 9 3.1 Algorithm for anomaly detection and root cause analysis9 3.2 Metrics and tracing 10 3.3 Anomaly detection 11 3.4 Root cause analysis 12

4 Evaluation 15 4.1 Testbed 15 4.2 Experiment design 17 4.3 Results 18 4.3.1 Test 1 - CPU shortage 19 4.3.2 Test 2 - increased number of users 23 4.3.3 Test 3 - network delay 25 4.3.4 Test 4 - network error 27 4.3.5 Test 5 - memory shortage 29

5 Conclusions 31 5.1 Algorithmic considerations 32 5.2 System aspects 32 5.3 Future work 33

1 Introduction

1.1 Motivation

Today there are many large applications that use the microservice architecture to split their application into small pieces. ‘e applications are then also o‰en deployed on clusters in the cloud. ‘ese can be very complex, see Figure 1, and hard to monitor and debug. But monitoring these systems are crucial as it provides very valuable information on how the system is performing. ‘e monitoring is a €rst step towards detecting anomalies, deviations from normal behaviour. Anomalies are o‰en indicators that a component is failing or is about to fail and these anomalies should hence be detected as soon as possible. A variety of anomaly types exists, such as intrusions, fraud, bad performance, and more, However, this thesis will only focus on performance anomalies. ‘ere are o‰en lots of metrics available to monitor and any errors that occur o‰en propagate to other microservices making it hard to manually locate the root cause of an anomaly, because of this automatic methods are needed to detect and correct the problems.

Figure 1: Examples of large microservice clusters in use today. Figure from [1].

1.2 Problem formulation and goals

‘e main purpose of this thesis is to investigate to what extent it is possible to create a so- lution that can automatically detect when an application in a microservice cluster has bad performance (anomaly detection) and why (root cause analysis). ‘e thesis has the following goals to ful€ll that purpose.

• Goal 1: Create a system that monitors a microservice cluster and gathers relevant met- rics about it.

1 • Goal 2: Add an anomaly detection algorithm that uses the metrics to detect when the microservices have performance issues.

• Goal 3: Add a root cause analysis algorithm to €nd why the microservices have per- formance issues.

• Goal 4: (Optional) Add automatic remediation that takes actions to resolve the per- formance issues once the root cause is found.

• Goal 5: Evaluate the developed system in a real microservice cluster by injecting per- formance anomalies and measuring how well the system detects anomalies and deter- mines their root cause.

‘e optional goal will be pursued if there is enough time le‰ once the earlier goals are ful- €lled. ‘e optional goal will only be pursued if there is enough time to do both that and the evaluation once the earlier goals are ful€lled. Otherwise only the evaluation will be done.

2 2 Background

2.1 Microservices

‘e microservice architecture is a way to structure an application into small separated pieces, microservices. ‘is is contrasted by the traditional way of building so‰ware as a monolith ap- plication, see Figure 2 for an illustrated comparison. Each microservice should be responsible for a speci€c part of the application and only loosely coupled to other services, communicat- ing over the network using lightweight protocols such as HTTP/REST or gRPC[2]. It should also be possible to develop, test, maintain, and deploy each microservice independently of oth- ers. ‘is architecture results in a system where each individual piece is easy to understand and where developers can rapidly make changes. However compared to a monolith applica- tion there are some additional complexity that’s added in the communication between mi- croservices. Handling delays or failures could be problematic and if the API of a microservice changes all other microservices that interacts with this API also needs to change accordingly. [3][4]

Figure 2: A monolith application (le‰) compared to a microservice application (right). ‘e business logic and data access from the le‰ side are split up and put into the di‚erent microservices on the right side. Figure from [3].

2.2 Containers and container orchestration

A popular way to facilitate the deployment of a microservice application is to use containers and container orchestration. A container is a packaged piece of so‰ware together with a list of dependencies that can be run in a standardized way. ‘ey are run inside a container engine that virtualizes the un- derlying operating system and also isolates each container from each other and the outside

3 environment. In this project Docker[5] is used to containerize the microservices. Containers are similar to virtual machines in the sense that both makes it possible to run several isolated applications on the same machine. ‘e di‚erence is that a virtual machine is virtualizing the underlying hardware and then running an operating system in each virtual machine. ‘is makes containers more lightweight, ecient, and faster to start, see Figure 3 for an illustra- tion. [5]

Figure 3: Running applications in containers are more ecient and lightweight than running them in separate virtual machines. Figure from [6].

Each microservice is then provisioned in their own containers that can be replicated to scale up capacity. ‘ese containers are o‰en also spread out on several machines, called nodes, in a cluster. In order to manage many containers across multiple nodes a container orchestra- tion tool is o‰en needed. Such a tool helps with the deployment, management, scaling, and networking for the containers. ‘e tool that is used in this project is called Kubernetes[7].

Figure 4: ‘e general components in a Kubernetes cluster. Figure from [8].

4 ‘e basic unit in Kubernetes is called a pod. A pod contains one or more containers that runs on the same node and that can share some resources. On each node Kubernetes has a container engine that runs the containers in the pods. ‘ere is also a component called kubelet that controls the pods and containers. Lastly, Kubernetes adds kubeproxy - a compo- nent that provides some network functionality that simpli€es communication between pods. Kubernetes controls the nodes in the cluster from a separate master node. ‘e master con- sists of three major parts, the apiserver, the scheduler, and some controllers. ‘e apiserver is the api that can control other objects in the cluster, such as pods, nodes, and deployments. ‘is is what an administrator connects to when making changes to a cluster. ‘e apiserver is stateless and instead stores the state in a distributed key-value store, etcd. ‘e scheduler is responsible for scheduling all pods on the available nodes, in a way that does not overload any node while following any other placement restrictions set by the developers. ‘e possi- ble restrictions include that pods should/should not be placed on a set of nodes (node selector or node anity/anti-anity), that pods should/should not be placed with other speci€c pods (pod anity/anti-anity), or that no pods may be placed on a node unless they are speci€- cally allowed (taints and tolerations). Lastly, the controllers watches the state of the clusters and tries to change it into the desired state, e.g. ensuring that the correct number of pods are running. Figure 4 shows the components in a Kubernetes cluster. To control the trac com- ing into the cluster an ingress is o‰en added. ‘e ingress can give external URLs to services and load balance the trac to the services. Likewise an egress can control the trac that goes out from the cluster.

2.3 Service mesh

A service mesh can be used for load balancing, routing, monitoring, security, and more. In a service mesh the microservice instances, the pods in this case, are paired with proxies that simpli€es the communication between services. ‘e trac to and from each pod goes trough these proxies, le‹ing them control and monitor the network. In the service mesh there is also a control plane that manages these proxies and their interactions[9][10]. ‘is project uses the service mesh Istio[10]. Istio uses Envoy[11] as the proxy of choice, a sidecar container with Envoy is added to each microservice pod. A sidecar container is simply and extra container added to a pod. Istio then adds some components that acts as the control plane, enabling the trac management, security, and con€guration for the proxies. Figure 5 shows the di‚erent components in Istio. Istio also has the option for ingress/egress gateways to control the trac coming into and going out of the cluster.

5 Figure 5: ‘e general components in the Istio service mesh. Figure from [10].

2.4 Related work

‘e approach of using to do automatic anomaly detection in computer sys- tems have been researched extensively. Previous work in the are can be divided broadly into supervised and unsupervised methods. Anomaly detection is also used in several other areas as diverse as fraud detection, intrusion detection, errors in patient records due to instrumen- tation errors or recording errors, and error detection in factory machines [12]. One paper focusing on intrusion detection compared both supervised and unsupervised anomaly detection. It found that supervised methods outperformed unsupervised methods when deal- ing with known a‹acks. But on unknown a‹acks the methods performed similarly, making the unsupervised method preferable due to the need for in supervised methods [13]. Another paper researched the use of combining both unsupervised and supervised machine learning, in particular k-means clustering and ID3 decision trees. ‘e combined method was compared to the individual methods on test data from three di‚erent anomaly detection do- mains, network anomaly data, dung equation data, and mechanical system data. ‘e paper concluded that the combined method’s performance was mostly in between the other meth- ods’ performance [14]. Researchers at TU Berlin studies several anomaly detection systems with the goal to create a system that can do automatic real time anomaly detection, root cause analysis, and recovery [15]. ‘ey have looked into using mostly unsupervised methods to implement the anomaly detection. Some of their research includes using deep packet inspection in virtualized services

6 [16], online Arima (regression model) in cloud monitoring [17], and distance-based online clustering for black-box services [18]. ‘e last paper have signi€catly inƒuenced the anomaly detection methods used in this project. One key di‚erence in this project is the focus on higher level metrics from a service mesh and tracing, that closely relate to user experience, instead of hardware metrics. Another recent paper also used clustering to detect anomalies in microservice clusters [19]. Similar to the other papers mentioned, that paper used hardware metrics to detect anomalies instead of higher level metrics. ‘ere are also several commercial solutions for anomaly detection and/or root cause analysis. One of them is Moogso‰ [20] that can take anomaly reports from other tools and then reduce them and correlate them into a reduced number of anomalies. ‘ey also run root cause anal- ysis to €nd the problem. Another tool called Grok [21] can run anomaly detection on many metrics in a cluster and integrate this anomaly detection with other tools. Both Moogso‰ and Grok also use machine learning, primarily unsupervised, in order to achieve their anomaly detection and/or root cause analysis. A third tool called Dynatrace [22] o‚ers a full suite of monitoring, anomaly detection, and root cause analysis. Dynatrace’s anomaly detection and root cause analysis seems to be based on a deterministic AI with some added machine learning.

7 8 3 Method

‘is chapter describes the developed system to detect anomalies and determine their root causes from an algorithmic perspective. ‘e following sections in turn focuses on an algo- rithm for the overall operation of the system, how the metrics are gathered, how the anomaly detection works, and how the root cause analysis works. ‘e system has two main phases. A learning phase that tries to learn the normal behaviour of the microservice application, and a detection phase that tries to detect anomalies and then €nd their root causes. In order to accomplish this the system uses monitoring metrics gathered from the microservice cluster.

3.1 Algorithm for anomaly detection and root cause analysis

Below is an algorithm for the overall system, the di‚erent parts are described in the following sections. Algorithm 1: Overall operation of the system Set iteration length T Set number of iterations I needed to learn normal behaviour /* Learning normal behaviour */ for I iterations do Gather metrics for last T seconds Normalize metrics Add them to clustering algorithm Save metrics for root cause analysis if a new maximum is found Wait T seconds Get €nished clusters

/* Anomaly detection and root cause analysis */ Set M to suitable value for €ltering out possible anomalies close to clusters while true do Gather metrics for last T seconds Normalize metrics Calculate distance to closest cluster for normalized metrics if distance >M * radius of cluster then /* Anomaly detected */ Calculate scores for de€ned root causes Notify about most likely root causes Wait T seconds

9 3.2 Metrics and tracing

In general, monitoring information about the state of a microservice cluster can be divided into three levels.

• Cluster: Metrics for the cluster as a whole or individual nodes in the cluster. ‘is primarily includes hardware metrics for each node, such as CPU, memory, and disk, but it can also be container orchestration metrics such as the number of nodes in the cluster, number of pods on a node, resources allocated on a node, and number of pending pods.

• Pod: Metrics for individual pods and containers. ‘is primarily includes hardware metrics for the pod or container, such as CPU, memory, and disk.

• Application: Speci€c performance metrics for the application, such as response times, error rates, number of requests, and size of request payloads.

Figure 6: ‘e setup for gathering metrics from a microservice cluster.

For this project the monitoring information is gathered in two ways, see Figure 6 for an illus- tration. Firstly, there are tools, such as cAdvisor[23] that are used to monitor the nodes in the cluster to gather metrics about the node and the containers/pods running there. ‘is covers metrics from the cluster level and pod level. ‘e metrics can then be saved and aggregated in Prometheus[24] or a similar tool. With Prometheus one can query metrics for speci€c time intervals, averages, rate of change, etc. Secondly, distributed tracing can be used to monitor the trac within the cluster. Tracing is the process of tracking a request as it moves through the system. ‘is is done by adding some meta information when a request comes to the ingress gateway and then also add it to all communication within the cluster that is initiated for the speci€c request. ‘e proxies that run in each node can then look at that meta information and collectively determine the path that the request took through the system. With other information saved at the proxies, such as timestamps and status codes, the tracing can provide information about latencies, error rates, number of requests, and more. ‘is covers metrics from the application level. In this project, Jaeger is used, a tool that enables distributed tracing by gathering all of the information and then compiling the traces for each request[25]. Each iteration of the system, both in the learning phase and the detection phase, lasts T sec- onds and in each iteration the system uses metrics from the last T seconds. ‘us larger values of T means that the system uses older metrics, which in turn means that it can take longer

10 to detect anomalies. Conversely, smaller values of T instead increases overhead while gath- ering metrics as frequent monitoring uses more computation and also more data storage. To handle this tradeo‚ the system should strive to iterate as o‰en as possible, while keeping the overhead at an acceptable level. With a system like this where data goes through multiple components before metrics can be used, there is also added overhead for each step in the mon- itoring chain, e.g. some data in this system goes from cAdvisor on each node to Prometheus and then to the anomaly detection. Some steps are usually necessary as they provide fea- tures that makes the process more ecient in other ways, e.g. by o‚ering centralized access, condensing the data, or running queries. But too many might slow down the system if the overhead outweighs the eciency improvements from the new features. ‘e current version of the system iterates once every other second, a frequency determined empirically as there were problems during development to update metrics more frequently.

3.3 Anomaly detection

Anomalies in a system can be de€ned as deviations from normal behaviour. Anomalies are o‰en indicators that a component is failing or is about to fail and should hence be detected as soon as possible. Anomaly detection is the process of trying to detect anomalies so that someone or something can react to them and €x any underlying problem. ‘ere are many ways to do anomaly detection, though the core is looking at di‚erent metrics from the system to €nd deviations. ‘is can be done both manually by system administrators or automatically by so‰ware. As microservice applications o‰en produce a large amount of metrics to analyze, a manual process is impractical. Looking at automatic solutions there are several types that have been tried and are used today, e.g. some sort of machine learning algo- rithm, either supervised or unsupervised. Using supervised machine learning requires labeled training data which can be hard to produce and furthermore cannot capture every possible anomaly. ‘is project instead focuses on unsupervised machine learning where unlabeled data is used by the algorithm. ‘is reduces the amount of domain speci€c knowledge needed and could provide a system that is able to detect unexpected types of anomalies. ‘e general idea for the anomaly detection in this system is to learn the normal behaviour in the microservice application, then look for any deviations from this behaviour. ‘is is done per instance of each microservice to look for anomalies at each pod instead of in the microservice cluster as a whole. ‘is narrows down the location of each anomaly and thus makes the root cause analysis more focused. ‘is solution uses a clustering algorithm to learn the normal behaviour of each microservice. Multiple metrics from the components are given to the algorithm each second, both during the learning phase and detection phase. During the learning phase the algorithm uses the metrics to form a set of multidimensional clusters that represent the state of the system. ‘e clusters are then a representation of the normal behaviour. Later during the detection phase this representation is used to detect anomalies by checking if the the newly gathered metrics fall inside the clusters or not. If they do not then it is treated as an anomaly, see Figure 7 for an illustration.

11 Figure 7: Clusters that represents the normal behaviour. Data points outside the clusters are considered anomalies.

When choosing metrics the idea is to use only the most important metrics that represent the performance of a microservice as seen by the users. ‘e users are only directly impacted by application level metrics, not cluster level or pod level metrics, e.g. if the response time is fast then it does not ma‹er for the user if a pod uses more CPU than normally. Because of this the anomaly detection in this project uses two metrics, time to answer a request (response time) and error rate. To keep it simple the response time is an average and the error rate will count any HTTP response that is not a success (status code 200), other options are discussed in Section 5.3. In this project a version of the clustering algorithm BIRCH[26] (balanced iterative reducing and clustering using hierarchies) is used. During the €rst phase of BIRCH several small clus- ters are created to €t the data given to it. BIRCH is an online clustering algorithm, meaning that it can be fed incremental data during the training instead of reading all data at once. BIRCH has some additional phases to condense the clusters and then re€ne them that are not used in this project. Once the training is completed, the resulting clusters are used as the normal behaviour. ‘is project used an implementation of BIRCH from the python li- brary scikit-learn[27]. ‘e BIRCH clusters have a radius of 0.5 and if any new data cannot be merged into an existing cluster without increasing the radius above 0.5, a new cluster is created. In order to have meaningful clusters, all gathered metrics are normalized to have an average of 0 and standard deviation of 1. In order to €lter out minor deviations in behaviour that could perhaps be considered normal, this system only classi€es data as anomalous if the distance to the nearest cluster is signi€- cantly greater than the radius of the cluster. For this project the threshold is set to two times the radius of the clusters, i.e. a distance of 1. ‘e distance to the closest cluster is called anomaly detection score. ‘us a high score means that there is likely an anomaly.

3.4 Root cause analysis

Once the anomaly detection found an anomaly, root cause analysis follows, i.e., the process to identify the underlying problem or cause. ‘e process involves determining what underlying problem caused the detected anomaly and where the problem is located. Finding the root cause o‰en require following a chain of causes down to the most fundamental problem. If

12 the root cause is found, that can then be used to select some remediation steps to revert the situation. As an example this system could detect an anomaly, that pod-A has slow response time, and try to €nd the root cause of that anomaly. Analyzing metrics from the pod could €nd that pod-A is slow because it is dependant on pod-B that is also slow to respond. Analyzing pod-B could then show that it has had slow response times since a recent deployment of a new version of pod-B. ‘e code could then be analyzed, but this is outside the scope of this thesis. ‘us the conclusion in this example is that a bad version of the code in pod-B was the root cause. ‘e remediation could then be to update to a newer, €xed version of the code or to roll back to a previous version. In this project, the root cause analysis is based on few prede€ned root causes and tries to determine which of these that is the most likely cause. ‘e general idea is to, for each root cause, look at one or more metrics, learn their max values during normal operation, and then compare their values at the time of the anomaly to the recorded max. If the values at the time of the anomaly is greater than the recorded max, then the di‚erence between them is the score for that root cause. ‘erefore the score should be greater than 0 in order to consider the root cause as a real possibility. In the end, the score for each possible root cause is compared and the one with the highest score is determined as the most likely cause. ‘is could be further elaborated, e.g., by looking at the di‚erence between the highest score and the rest to get an idea about the certainty of the classi€cation. If none of the scores are above 0, then none of the prede€ned root causes are likely to be the actual cause. Algorithm 2 summarizes the root cause analysis and shows how it integrates with the operations of the system.

Algorithm 2: Root cause analysis Set iteration length T Set number of iterations I needed to learn normal behaviour /* Learning normal behaviour */ for I iterations do Gather metrics for last T seconds Normalize metrics Add them to clustering algorithm Save metrics for root cause analysis if a new maximum is found Wait T seconds

/* Anomaly detection and root cause analysis */ while true do Gather metrics for last T seconds Normalize metrics Detect anomaly if anomaly is detected then foreach root cause metric do Save score for root cause as current value minus recorded maximum Find highest score if highest score >0 then Notify that root cause with highest score is likely else Notify that root cause could not be determined Wait T seconds

‘e algorithm above runs in all pods to give the most likely cause for that pod. However, this need not always give the true root cause of the problem if the actual anomaly is located in

13 another pod or node. ‘us, the cause reported from one pod should be analysed together with reports from all other pods in order to draw conclusions about the cluster as a whole. Reusing the above example for general root cause analysis, the scenario is that if pod-A reports that an anomaly is caused by slow responses from one of its dependencies, pod-B and pod-B in turn reports that it is running slow due to new versions of its code. Combining these reports with the knowledge that pod B is a dependency of pod A, it can be concluded that the root cause is located in pod B.

14 4 Evaluation

4.1 Testbed

‘e testbed is composed of four major components: the microservices and clients, a service mesh and monitoring tools to gather metrics, an anomaly detection and root cause analysis program, and an anomaly injector to evaluate the solution. ‘is project have used Google Cloud as the cloud provider and more speci€cally Google Ku- bernetes Engine to get a cluster with Kubernetes pre-deployed. Onto this cluster Istio was deployed as the service mesh together with Prometheus and a cAdvisor daemonset for met- rics and Jaeger for tracing. Next the microservices were deployed, with sidecar containers injected by Istio. Depending on the tests, some additional applications were sometimes added to inject anomalies into the system. In total the system used €ve nodes, where two nodes were dedicated to the microservices and cadvisor, with the three remaining nodes dedicated to the other components (Istio, Prometheus, the anomaly detection program, etc.).

Figure 8: ‘e architecture for the microservice application used in the experiments.

‘e application used in these experiments consisted of nine di‚erent microservices. ‘e ap- plication is supposed to mimic some internal human resource system for a company. ‘ree of these microservices (Company, Employee-salary, and Greet-employee) are user-facing, while the others are only used internally by other services. ‘e communication between the mi-

15 croservices uses regular HTTP requests and JSON payloads. ‘e application is only a dummy application developed for the evaluation of the anomaly detection and root cause analysis methods proposed in this thesis, not an application that is actually in use by any company. See Figure 8 for an illustration of the architecture for the microservices. ‘e microservices are built using the programming language Ballerina[28]. Ballerina was chosen because it was easy to build small network applications and because it had native support for Docker, Ku- bernetes, and Istio that can generate the necessary artefacts needed to deploy an application. In order to make the experiments easier to run and reproduce the microservices were de- ployed on speci€c nodes. ‘e speci€c deployment used taints and tolerations to ensure that unwanted pods were placed on the other three nodes and node selector to place the applica- tion pods on speci€c nodes. Figure 9 shows how the microservices were distributed over the two nodes. Note that one microservice was replicated and deployed with one pod per node.

Figure 9: ‘e microservices used in the experiment, distributed over 2 nodes.

During the experiments user trac is injected using Locust[29]. Locust was set up to vary the numbers of users between 30 and 60. Each user randomly sends a request to one of the exposed microservices, with a bias towards ”Company” where 50% of all requests go and the other two receives 25% each. Each client then sleeps between 500 ms and 2000 ms before repeating the same steps again. ‘is number of users was selected to not exceed the maximum capacity that the nodes can handle before slowing down signi€cantly, but it is a signi€cant load and at 60 concurrent users the nodes are using almost all of their CPU resources. Version 1.1.0 of Istio was deployed using Helm charts and the values €le values-istio-demo- auth.yaml. Metrics from the microservices are gathered each second to quickly pick up changes. ‘is required changes to Prometheus’s scrape interval and cAdvisor’s housekeeping interval. As this project used Google Kubernetes Engine, it was not possible to con€gure the cAdvisor directly in the kubelet. Instead a new daemonset with cAdvisor was deployed with the new

16 con€guration, from which Prometheus could scrape metrics. ‘e implementation of BIRCH from ”scikit-learn” was used with the default options except for the se‹ing ”n clusters=None”, which limits the algorithm to only run the €rst phase. For these tests there are €ve di‚erent de€ned root causes. ‘ere are only used as a proof of concept and a real production system would need many more possible causes to look for. ‘e €rst two root causes are a shortage of CPU on the pods node or a shortage of memory on the node. ‘ese are meant to mimic the ”noisy neighbor” e‚ect where another application is using up a lot of resources on the node, in this case either CPU or memory. ‘e third root cause is simply having having a higher request load than normal from the users. ‘e fourth and €‰h root causes are that other services respond slower or return more errors than normal to a pod. All of these root causes are based on problems outside the pod, not any internal problem such as bugs in the code. Given this selection of root causes, if none of the scores are above 0, then the default assumption could be that there is an internal problem at the pod that reported the anomaly. Perhaps there is some bug causing the code to run slow or produce more errors. ‘e conclusions drawn from the reports of the root cause analysis at each pod is done by the author of this thesis for these tests.

4.2 Experiment design

‘ere were €ve experiments planned to test the system, one for each of the root causes that can be identi€ed by the system. ‘e tests start by having the anomaly detection system learn the normal behaviour of the microservices. ‘e €rst anomaly is then introduced and the output of the anomaly detection and root cause system is recorded for about 20 minutes before the anomaly is removed. ‘e microservices then have a chance to stabilize at normal levels again before the next anomaly is introduced. ‘is continues until all €ve anomalies have been tested. ‘ese tests are only used as a €rst step to see if the system can work correctly in simple situations. ‘e €rst experiment involves creating a CPU shortage on one of the nodes, Node 2 in Figure 9. ‘e shortage is created by adding pods running the tool stress[30] that tries to use up as much CPU as possible. ‘is should increase the response time of any pod running in the a‚ected node as well as any pods depending on pods in Node 2. In the second experiment the added anomaly is simply that the number of simulated users are increased from between 30 and 60 to between 130 and 160. ‘is should slow down the response time for all or most of the pods due to a lack of capacity in the nodes. ‘e ”Employee” pods are generally a bo‹leneck for the microservices and this experiment should thus at least trigger an anomaly in them and their dependencies, whereas other microservices such as ”Time” might not be slowed down. For the third experiment a network delay is added to the ”Randomize” microservice, such that any response from it is delayed by 500 ms. ‘is should slow down any microservices that is depending on ”Randomize”. ‘e delay is introduced using Istio, the proxy sidecar in the ”Randomize” pod adds the delay to any outgoing responses. ‘is does however not increase the time that the ”Randomize” pod itself reports for each request. ‘e fourth experiment is similar to the third but it uses network errors instead of network delays. ‘e sidecar proxies in the ”Employee” pods are con€gured to respond with an error in 10% of all incoming requests instead of sending the requests to the ”Employee” microservices.

17 ‘is causes any depending microservices to also produce errors, due to bad error handling, which should be detected as anomalies. Similarly to Test 3 the ”Employee” pods does not see this as responding with errors, they only notice this in terms of a reduction in the number of requests. In the €‰h experiment the anomaly is a resource shortage as in the €rst experiment, except it is a memory shortage instead of a CPU shortage. ‘e resource shortage will be located in Node 1, see Figure 9. ‘e shortage is again created by adding pod that run the tool stress, though this time the tool is con€gured to focus on using up memory instead of CPU. ‘is should increase the response time for any pods in the a‚ected node or cause them to produce errors. Any pod depending on pods in Node 1 should also see an increase in response time or errors.

4.3 Results

Each test produced anomaly data for each pod that indicates if there was an anomaly or not. ‘is data can be displayed in a graph, see Figure 11a and Figure 11b for examples, to estimate if the system detected an anomaly or not. For each test there will be a diagram similar to 9 that shows which pods detected an anomaly or not. ‘e tests also produced root cause data that indicates which root cause that was responsible for the anomaly. ‘is data can also be displayed as graphs to see if the system detected the correct root cause and how sure it was.

18 4.3.1 Test 1 - CPU shortage

Figure 10: ‘is shows which pods that detected an anomaly in the €rst experiment. ‘e pods marked in red clearly detected an anomaly, the pods marked in yellow had incon- clusive results, and the pods marked in white clearly did not detect an anomaly. ‘e orange pods indicate that there were additional pods in Node 2 that were causing a CPU shortage on the node.

‘e €rst scenario evaluates an anomaly in the form of noisy neighbors - vicious pods that consume CPU in one of the nodes. Figure 10 shows which pods that detected an anomaly during the test. ‘e pods marked in red clearly detected an anomaly and the pods in yellow were not clear if they detected an anomaly or not. ‘e orange pods are the ones that are trying to use up the CPU resources on Node 2 and thus the cause of the anomaly. All pods in Node 2 except for ”Time” reported anomalies and the pods ”Greet employee” and ”Company”, that depends on pods in Node 2, also reported anomalies as expected. However ”Calc-salary” should not have reported any anomalies, even though the data is a bit inconclusive. Values from the anomaly detection data that are above 1 could be considered an anomaly, but due to the volatile nature of this data there should be multiple grouped values above 1 before its considered an anomaly. Figure 11 shows the execution of the anomaly detection in all pods. In general, as data is normalized, scores (much) larger than 1 are considered to be signi€cant. ‘e data from the pod ”Employee2”, shown in Figure 11a, clearly shows high values throughout the test, the low values at the beginning and end are before the anomaly is introduced and a‰er it has been removed. Likewise the data from the pod ”Randomize”, shown in Figure 11b, shows low values below 1 except for a few isolated spikes that goes slightly above 1. Figure 11c displays the data from the pod ”Calc-salary” and that data is a bit inconclusive. Most of the values are clearly below 1 as expected, but there are many more spikes that goes above 1 than in the

19 data from ”Randomize” and they are more grouped. Similarly the data from the pod ”Time”, shown in Figure 11d and again, cropped in 11e to more clearly show when values goes above 1, are also inconclusive. Here the values are also mostly below 1 but contain some grouped spikes that goes above 1. ‘is data also have a few values that are more extreme and goes a lot higher than the values from ”Calc-salary”. ‘e root cause classi€cation data for the pods ”Company” and ”Greet employee”, shown in Figure 12a and 12b respectively, clearly shows that the anomaly’s root cause have been clas- si€ed as increased response time from other services. ‘is makes sense as neither of them are located on the a‚ected node but instead are dependant on pods in Node 2. ‘e data for the pods ”Department” and ”Employee salary”, shown in Figure 12d and 12c respectively, also clearly shows that the anomaly would be classi€ed as increased response time from other services. ‘ough both also have a small increase in the score for CPU shortage. ‘is is also as expected because both are dependant on ”Employee2”, that acts as a bo‹leneck, causing the primary classi€cation. Both are also located in the a‚ected node, which is causing the in- creased score for CPU shortage. Lastly, the data for the pods ”Employee2” and ”Time”, shown in Figure 12e and 12f respectively, shows a high score for CPU shortage as expected. Both also show increased scores for memory shortage, especially ”Time”, but that will be ignored due to reasons stated in Section 5.1. Using the dependencies of the pods one can see that ”Greet employee”, ”Company”, ”Department”, and ”Employee salary” point to ”Employee2” being slow which, together with ”Time”, in turn points to a CPU shortage on Node 2. In total, the system managed to correctly detect or not detect anomalies as expected in 8 out of 10 pods, where the remaining 2 were inconclusive. All pods that were expected to detect an anomaly correctly classi€ed the root cause of the anomaly - abnormal CPU usage.

20 (a) Anomaly detection data for the ”Employee2” (b) Anomaly detection data for the ”Randomize” pod. pod.

(c) Anomaly detection data for the ”Calc-salary” (d) Anomaly detection data for the ”Time” pod. pod.

(e) Anomaly detection data for the ”Time” pod, zoomed in to show the scores with low values in more detail.

Figure 11: Anomaly detection data from some of the pods during the test with a CPU anomaly.

21 (a) Root cause analysis data for the ”Company” (b) Root cause analysis data for the ”Greet em- pod. ployee” pod.

(c) Root cause analysis data for the ”Employee (d) Root cause analysis data for the ”Depart- salary” pod. ment” pod.

(e) Root cause analysis data for the ”Employee 2” (f) Root cause analysis data for the ”Time” pod. pod.

Figure 12: Root cause classi€cation data from some of the pods during the test with a CPU anomaly.

22 4.3.2 Test 2 - increased number of users

Figure 13: ‘is shows which pods that detected an anomaly in the second experiment. ‘e pods marked in red clearly detected an anomaly and the pods marked in white clearly did not detect an anomaly. ‘e users are marked in orange to indicate that they are the source of the anomaly. ‘e number of simulated users have increased signi€cantly.

‘e second scenarios introduces an anomaly in the form of an increased number of users accessing the front-end services. Figure 13 shows which pods that detected an anomaly during the test. ‘e pods marked in red clearly detected an anomaly. ‘e users are marked in orange to indicate that they are the source of the anomaly. ‘e pods ”Greet employee”, ”Company”, ”Employee salary”, ”Department”, ”Employee1”, and ”Employee2” all detected anomalies. As the ”Employee” pods are acting as a bo‹leneck for the microservices this result is as expected. ‘e root cause data from ”Company”, ”Greet employee”, ”Employee salary”, and ”Depart- ment”, shown in Figure 14a, 14b, 14c, and 14d respectively, all classify their root cause as increased response time from other services as expected. ‘e data from ”Employee2”, shown in Figure 14f, classi€es the root cause as an increase in requests, also as expected. However, the data from ”Employee1”, shown in Figure 14e, does not have a high score that indicates the root cause being an increase in requests. ‘at score does increase, compared to before and a‰er the anomaly, but values below 0 should be classi€ed as inconclusive. In total the system managed to correctly detect or not detect anomalies in all pods. 5 out of 6 pods that detected an anomaly then correctly classi€ed the root cause of the anomaly.

23 (a) Root cause analysis data for the ”Company” (b) Root cause analysis data for the ”Greet em- pod. ployee” pod.

(c) Root cause analysis data for the ”Employee (d) Root cause analysis data for the ”Depart- salary” pod. ment” pod.

(e) Root cause analysis data for the ”Employee 1” (f) Root cause analysis data for the ”Employee 2” pod. pod.

Figure 14: Root cause classi€cation data from some of the pods during the test with increased requests as the anomaly.

24 4.3.3 Test 3 - network delay

Figure 15: ‘is shows which pods that detected an anomaly in the third experiment. ‘e pods marked in red clearly detected an anomaly and the pods marked in white clearly did not detect an anomaly. ‘e orange marking around the ”Randomize” node indicate that there is a network delay injected at that node, which is the source of the anomaly. ‘e network delay a‚ects all responses going out from that node.

‘e overall setup and results of the network delay anomaly experiment is shown in Figure 15. ‘e pods marked in red clearly detected an anomaly. ‘e orange box around ”Randomize” indicates that a network delay at that pod is the source of the anomaly. ‘e pods ”Greet employee”, ”Employee salary”, ”Greeting”, and ”Calc-salary” all detected anomalies. ‘ose are all the pods depending on ”Randomize”, which are the ones expected to detect anomalies. As the anomaly only a‚ects the time between ”Randomize” and its dependants, the ”Randomize” pod itself is not expected to detect the anomaly. ‘e root cause data from ”Greet employee”, ”Employee salary”, ”Greeting”, and ”Calc-salary”, shown in Figure 16a, 16b, 16c, and 16d respectively, all classify their root cause as increased response time from other services as expected. Since ”Greeting” and ”Calc-salary” both only have ”Randomize” as their sole dependency and ”Randomize” does not experience any anomaly, we can conclude that the network between these pods is the root cause. In total the system managed to correctly detect or not detect anomalies in all pods. All pods that detected an anomaly then also correctly classi€ed the root cause of the anomaly.

25 (a) Root cause analysis data for the ”Greet em- (b) Root cause analysis data for the ”Employee ployee” pod. salary” pod.

(c) Root cause analysis data for the ”Greeting” (d) Root cause analysis data for the ”Calc-salary” pod. pod.

Figure 16: Root cause classi€cation data from some of the pods during the test with a network delay as the anomaly.

26 4.3.4 Test 4 - network error

Figure 17: ‘is shows which pods that detected an anomaly in the fourth experiment. ‘e pods marked in red clearly detected an anomaly and the pods marked in white clearly did not detect an anomaly. ‘e orange marking around the ”Employee” nodes indicate that there is a network error injected at those nodes, which is the source of the anomaly. ‘e network errors a‚ects some of the requests going to those nodes, returning an error instead of le‹ing the request go to the nodes.

In the fourth and last scenario, an anomaly in the form of network errors was introduced. Figure 17 shows which pods that detected an anomaly during the test. ‘e pods marked in red clearly detected an anomaly. ‘e orange ”Employee” service indicate that a network error at those pods is the source of the anomaly, some of the requests gets error responses instead of going to the ”Employee” containers. ‘e pods ”Greet employee”, ”Company”, ”Employee salary”, and ”Department” all detected anomalies. ‘ose are all the pods depending on the ”Employee” pods, which are the ones expected to detect anomalies. As the errors occur before the requests arrives at the ”Employee” containers, the ”Employee” pods are not expected to see the anomaly. ‘e root cause data from ”Greet employee”, ”Company”, ”Employee salary”, and ”Depart- ment”, shown in Figure 18a, 18b, 18c, and 18d respectively, all classify their root cause as increased number of errors from other services as expected. ‘e ”Department” pods only dependency are the ”Employee” pods, which implies that the anomaly is located there. How- ever ”Greet employee” and ”Employee salary” also have other dependencies, which means that other pods could also be returning errors. ‘ough the likely conclusion to draw is that there are network errors for requests to the ”Employee” pods. In total the system managed to correctly detect or not detect anomalies in all pods. All pods that detected an anomaly then also correctly classi€ed the root cause of the anomaly.

27 (a) Root cause analysis data for the ”Greet em- (b) Root cause analysis data for the ”Company” ployee” pod. pod.

(c) Root cause analysis data for the ”Employee (d) Root cause analysis data for the ”Depart- salary” pod. ment” pod.

Figure 18: Root cause classi€cation data from some of the pods during the test with a network error as the anomaly.

28 4.3.5 Test 5 - memory shortage

Figure 19: ‘is shows the results from testing during development of this experiment. ‘e pods marked with a cross was randomly evicted by Kubernetes due to lack of memory on the node, the pod marked in yellow had inconclusive results as its dependency was occasionally evicted, and the pods marked in white clearly did not detect an anomaly. ‘e orange pods indicate that there were additional pods in Node 1 that were causing a memory shortage on the node.

Unfortunately this test was not possible to perform and get any meaningful data. ‘e main e‚ect of adding the anomaly was not causing the pods to run slower or produce errors. Instead when the node ran out of memory Kubernetes would evict (shut down) a pod on the node. When the pod restarted the node would eventually run out of memory again and a pod would be evicted again. ‘is made the scenario very hard to test given that the anomaly detection is based on looking at speci€c pods and starts to fail if any of those pods are evicted.

29 30 5 Conclusions

Overall this project has produced a system that is capable of gathering relevant metrics, detect anomalies, and €nd root causes in a microservice cluster. ‘e anomaly detection and root cause analysis have also been evaluated using a real microservice cluster with real injected anomalies. However, no automatic remediation have been added to the system. ‘us all goals set for the project in Section 1.2, except the optional goal related to automatic remediation, have been ful€lled to some extent. ‘is chapter will discuss to what extent the goals are met, algorithmic considerations, system aspects, and future work. ‘e €rst goal states that the system should be able to monitor the microservice cluster and gather relevant metrics from it. ‘e system monitors the cluster through cAdvisor and the service mesh Istio together with the tracing from Jaeger. cAdvisor provides hardware metrics for all nodes and pods in the cluster while Jaeger and Istio provides di‚erent application metrics. Together they provide thorough monitoring and gathers the key metrics for all three monitoring levels (cluster, pod, and application), thus the €rst goal is achieved. ‘e second goal states that the system should have an anomaly detection algorithm that can detect performance issues in the microservices. ‘e implemented anomaly detection based on the BIRCH clustering algorithm can detect anomalies given relevant metrics. ‘e proposed metrics of average response time and error rate represent the performance of the services as experienced by the users. ‘is rather simple anomaly detection method should thus be able to detect a broad range of anomalies. Due to the volatile nature of the used metrics there are sudden increases or decreases in score for the anomaly detection. ‘e system should contain a way to mitigate this behavior in order to have reliable anomaly detection. In total, the second goal is partially achieved. ‘e third goal states that the system should have a root cause analysis algorithm that can determine why an anomaly occurred. ‘e provided root cause analysis can for each pod €nd the most likely cause of that pods anomaly, if it is included in a prede€ned list of causes. Given the output from the root cause analysis for each pod combined with information about the microservice cluster architecture, the true root cause can be found, as outlined in the discussions of the tests in Section 4.3. ‘is is currently not automated by the system, which would be needed for this goal to be completely achieved. ‘erefore, the third goal is partially achieved. ‘e fourth, optional goal states that the system should also have automatic remediation that can take actions to resolve the issues that the root cause analysis €nds. Due to a lack of time this goal was not pursued and not achieved. ‘e €‰h and last goal states that the system should be tested in a real microservice clus- ter to see how well the anomaly detection and root cause analysis works. ‘e system was tested with the experiments described in Chapter 4 and the overall results were positive. Most anomalies were detected as expected and the root cause analysis also worked as ex- pected. ‘e goal is achieved, but it would be interesting to have evaluated an automated root cause analysis system with more tests and more prede€ned root causes.

31 5.1 Algorithmic considerations

As the anomaly detection and root cause analysis is run for each pod, the location of the anomaly is usually easy to obtain. Depending on the type of anomaly this might not be entirely true, e.g. in the case of a CPU shortage the system might not know which pod is the true root cause, but at least the node that has the CPU shortage should be located. As the whole system is based on learning the normal behaviour of the microservice clus- ter, problems arises if the normal behaviour changes over time, so called model dri‰. In the general case that trac increases slowly due to an increase in users, this problem can be handled by periodically re-training the system. However, if the microservices’ normal be- haviour varies o‰en, such as low usage during the night but high during the day, it could be impractical to retrain the system several times per day. A possible €x to this would be to have several trained models that the system could cycle through during the day, but such a solution requires knowledge of when to change models. ‘e system has been built to work in a cluster where pods can come and go, thanks to hav- ing the anomaly detection and root cause analysis look at microservice pods separately. But this does require highly speci€c models that describe the normal behaviour. ‘us each new microservice pod requires training a new model. ‘is problem is ampli€ed by the problem described in the previous paragraph, causing the system to potentially retrain several models each time the normal behaviour changes. During the development and testing of the di‚erent root causes, some problems were found with the root cause indicating memory shortage on the node. In general the memory metric was hard to normalize properly, the recorded averages and standard deviations were o‰en not very representative of the values during the test. ‘is lead to the metrics o‰en being too high or sometimes too low to use with the other metrics. During tests the score for this root cause could therefore be much higher than any other root cause, but when looking at the actual amount of memory used by the pod and the node they did not seem to be unusually high or low. ‘e score could also be high without adding any anomalies. A high or low memory load can be a good indication that something is wrong, perhaps there is a memory leak in the system. Unfortunately this is not always the case. As an example, the Java runtime is notorious for allocating all memory that it can, no ma‹er the actual load on the application. ‘at of course makes memory usage a very poor indication of performance. ‘e way memory usage was used as a metric in this project was not useful. For future versions it would need to be incorporated in a di‚erent way.

5.2 System aspects

‘is system is designed to work with most microservice clusters. If the microservice cluster is running on Kubernetes and could add Istio, Prometheus, Jaeger, and cAdvisor (or preferably have access to modify the cAdvisor in kubelet), then it should be compatible. Depending on how the communication between services work, there will be some modi€cations needed to the code. ‘is would mostly a‚ect tracing, although with h‹p requests, this only requires copying some headers from the incoming request to any resulting outgoing requests. ‘is system is basing most metrics on traces from Jaeger, which requires handling of a lot of data. ‘e traces from Jaeger that these microservices produce in around two seconds are of- ten several megabytes large. ‘is leads to some practical challenges depending on where the anomaly detection so‰ware runs. If the so‰ware runs outside the microservice cluster then

32 the amount of time required to get data from Jaeger might be too long. During development, responses from Jaeger could sometimes take more than two seconds when the microservice cluster is running on Google Kubernetes Engine and the anomaly detection so‰ware is run- ning on a local computer, which is obviously not acceptable when the so‰ware is supposed to iterate once every two seconds. ‘is was €xed by moving the anomaly detection so‰ware onto the microservice cluster, greatly increasing network speed, but alternative solutions also exists, as discussed in Section 5.3.

5.3 Future work

As mentioned earlier the current system only considers each pod in isolation during root cause analysis and then requires a human to analyze the results of several instances in order to get a complete understanding of the root cause of a given anomaly. In essence, this follows the procedure outlined in the discussions of the four anomaly scenarios in Section 4. ‘is should be automated by extending the system to include a module that does fully automated cluster-wide root cause analysis. ‘e system should then be able to report a more complete picture of the situation that can be used to determine the most suitable remediation action(s). ‘e remediation process that tries to resolve the root cause could also be automated or done by humans. ‘e scores from both the anomaly detection and root cause analysis can be rather volatile. More importantly, they can contain spikes where the scores are very high one iteration before going back to a relatively stable level again, which can be seen in the test results in Section 4.3. ‘ese spikes and unstable values are a problem for the system, as the system should not warn about an anomaly if there was just a spike in the score. ‘e system should instead only send warnings when it has suciently high scores for a period of time. ‘e spikes in the scores are likely partially caused by having a rather short monitoring and detection interval of two seconds. Higher intervals should smooth out these spikes, but then the system might not respond as quickly to anomalies. A possible solution would be to implement some sort of state machine that has two modes, anomaly present or anomaly not present, plus some counter that increments or decrements when it gets a high or low score. ‘en whenever the score becomes suciently high or low the state machine would change state. Alternative solutions might exist and the speci€cs would need to be tested, but the general idea to not warn about anomalies because of one spike should be implemented. ‘e two de€ned root causes in the system that indicates that another service is slow or is producing errors could be improved. Currently those root causes analyses are based on all outgoing requests to other services that are aggregated into the same metric. ‘is is probably not optimal when a service sends requests to many other services. In such scenarios, it is harder to locate which service is actually causing the slowdown. An improvement would be to instead dynamically create two metrics, one for response times and one for errors, for each dependency. Such a solution would make it much clearer which service is the slow one. If the service is replicated in several pods this then still creates a problem, as the metric would not point at a speci€c pod if any particular pod is the root cause. A possible solution to this problem is to split the metric again for each pod. Depending on the amount of replicas and the amount of trac to those replicas, this could be quite unstable if the load balancer changes the amount of trac that goes to each replica. Some iterations one replica might receive many requests and the next iteration the requests goes to other replicas. It could perhaps work, but it de€nitely requires some testing.

33 Currently the anomaly detection only looks at average response time and error rate for the microservice. ‘is seems like a good approximation of the overall performance, but there might be room for improvement. For response times the system could be looking at per- centiles, preferably a few di‚erent percentiles, to get more representative data[31]. For errors the system could choose to classify errors into categories. ‘ere might also be other useful metrics than response time and error rates, but keeping the number of metrics as low as possi- ble is bene€cial. ‘e anomaly detection should only have the most important metrics that are directly tied to customer performance, whereas other metrics can be used in the root cause analysis. As mentioned in Section 5.2 there is a lot of data produced by Jaeger. ‘e current system requests all relevant data and processes it for each pod and each iteration. However, Jaeger sends complete traces from the exposed front end service down to the lowest back end service even though the anomaly detection system only requires a fraction of that information. To make this more ecient, another so‰ware scrapes all traces from Jaeger and let the anomaly detection system query more speci€c information. ‘is would reduce the amount of network trac and would make the anomaly detection a bit simpler, possibly allowing the anomaly detection and root cause analysis to run outside the cluster without a‚ecting the performance of deployed services by acquisition of tracing data.

34 References

[1] Yan Cui. Capture and forward correlation IDs through di‚erent Lambda event sources. h‹ps://hackernoon.com/capture-and-forward-correlation-ids-through- di‚erent-lambda-event-sources-220c227c65f5 (visited 2019-05-23).

[2] gRPC. gRPC main page. h‹ps://grpc.io/ (visited 2019-05-21).

[3] Weaveworks. What are Microservices? h‹ps://www.weave.works/blog/what-are- microservices/ (visited 2019-04-17).

[4] AWS. Microservices on AWS. h‹ps://docs.aws.amazon.com/whitepapers/latest/microservices- on-aws/introduction.html (visited 2019-05-21).

[5] Docker. What is a container? h‹ps://www.docker.com/resources/what-container (vis- ited 2019-04-16).

[6] Doug Chamberlain. Containers vs. Virtual Machines (VMs): What’s the Di‚erence? h‹ps://blog.netapp.com/blogs/containers-vs-vms/ (visited 2019-04-17).

[7] Kubernetes. Learn Kubernetes basics. h‹ps://kubernetes.io/docs/tutorials/kubernetes- basics/ (visited 2019-04-16).

[8] Elastisys. Se‹ing up highly available Kubernetes clusters. h‹ps://elastisys.com/wp- content/uploads/2018/01/kubernetes-ha-setup.pdf (visited 2019-04-17).

[9] NGINX. What Is a Service Mesh? h‹ps://www.nginx.com/blog/what-is-a-service-mesh/ (visited 2019-04-17).

[10] Istio. What is Istio? h‹ps://istio.io/docs/concepts/what-is-istio/ (visited 20019-04-16).

[11] Envoy. Envoy main page. h‹ps://www.envoyproxy.io/ (visited 2019-05-21).

[12] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3):15:1–15:58, July 2009.

[13] Pavel Laskov, Patrick Dussel,¨ Christin Schafer,¨ and Konrad Rieck. Learning intrusion detection: Supervised or unsupervised? In ICIAP, 2005.

[14] Shekhar R. Gaddam, Vir V. Phoha, and Kiran S. Balagani. K-means+id3: A novel method for supervised anomaly detection by cascading k-means clustering and id3 methods. IEEE Trans. on Knowl. and Data Eng., 19(3):345–354, March 2007.

[15] Anton Gulenko, Florian Schmidt, Marcel Wallschlager,¨ Alexander Acker, Odej Kao, and Soren¨ Backer.¨ Self-healing cloud platform. h‹ps://www.researchgate.net/project/Self- Healing-Cloud-Platform (visited 2019-05-25).

[16] Marcel Wallschlager,¨ Anton Gulenko, Florian Schmidt, Odej Kao, and Feng Liu. Auto- mated anomaly detection in virtualized services using deep packet inspection. Procedia Computer Science, 110:510–515, 12 2017.

35 [17] Florian Schmidt, Florian Suri-Payer, Anton Gulenko, Marcel Wallschlager,¨ Alexander Acker, and Odej Kao. Unsupervised anomaly event detection for cloud monitoring using online arima. 11 2018.

[18] Anton Gulenko, Florian Schmidt, Alexander Acker, Marcel Wallschlager,¨ Odej Kao, and Feng Liu. Detecting anomalous behavior of black-box services modeled with distance- based online clustering. 09 2018.

[19] Roman Nikiforov. Clustering-based anomaly detection for microservices. CoRR, abs/1810.02762, 2018.

[20] Moogso‰. Moogso‰ product page. h‹ps://www.moogso‰.com/product/ (visited 2019- 05-27).

[21] Grok. How Grok works. h‹ps://www.grokstream.com/how-grok-works/ (visited 2019- 05-27).

[22] Dynatrace. Dynatrace platform page. h‹ps://www.dynatrace.com/platform/ (visited 2019-05-27).

[23] cAdvisor. Github page for cAdvisor. h‹ps://github.com/google/cadvisor (visited 2019- 04-17).

[24] Prometheus. Prometheus documentation - Overview. h‹ps://prometheus.io/docs/introduction/overview/ (visited 2019-04-17).

[25] Jaeger. Jaeger documentation - Introduction. h‹ps://www.jaegertracing.io/docs/1.11/ (visited 2019-04-17).

[26] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An ecient data clustering method for very large databases. SIGMOD Rec., 25(2):103–114, June 1996.

[27] scikit learn. Birch class documentation. h‹ps://scikit- learn.org/stable/modules/generated/sklearn.cluster.Birch.html (visited 2019-05-08).

[28] Ballerina. Ballerina main page. h‹ps://ballerina.io/ (visited 2019-05-23).

[29] Locust. Locust main page. h‹ps://locust.io/ (visited 2019-05-13).

[30] Amos Waterland. Stress tool. (visited 2019-05-27).

[31] Je‚rey Dean and Luiz Andre´ Barroso. ‘e tail at scale. Commun. ACM, 56(2):74–80, February 2013.

36 37