Automatic Anomaly Detection and Root Cause Analysis for Microservice Clusters

AUTOMATIC ANOMALY DETECTION AND ROOT CAUSE ANALYSIS FOR MICROSERVICE CLUSTERS Viktor Forsberg Master esis, 30 credits Supervisor: Johan Tordsson External supervisor: Johan Tordsson Civilingenjorsprogrammet¨ i teknisk datavetenskap 2019 Abstract Large microservice clusters deployed in the cloud can be very dicult to both monitor and debug. Monitoring theses clusters is a rst step towards detection of anomalies, deviations from normal behaviour. Anomalies are oen indicators that a component is failing or is about to fail and should hence be detected as soon as possible. ere are oen lots of metrics available to view. Furthermore, any errors that occur oen propagate to other microservices making it hard to manually locate the root cause of an anomaly, because of this automatic methods are needed to detect and correct the problems. e goal of this thesis is to create a solution that can automatically monitor a microservice cluster, detect anomalies, and nd a root cause. e anomaly detection is based on an unsupervised clustering algorithm that learns the normal behaviour of each service and then look for data that falls outside that behaviour. Once an anomaly is detected the proposed method tries to match the data against predened root causes. e proposed solution is evaluated in a real microservice cluster deployed in the cloud, using Kubernetes together with a service mesh and several other tools to help gather metrics and trace requests in the system. Acknowledgements I would like to thank everyone at Elastysis for giving me an interesting thesis and a nice place to work at. I have had fun and learned a lot thanks to them. My supervisor, Johan Tordsson, deserves a separate thanks. His help has been very important, he always gave good feedback and ideas during our discussions. Lastly I also want to thank my family and friends for their overall support and pushing me to get things done during this project. Contents 1 Introduction 1 1.1 Motivation1 1.2 Problem formulation and goals1 2 Background 3 2.1 Microservices3 2.2 Containers and container orchestration3 2.3 Service mesh5 2.4 Related work6 3 Method 9 3.1 Algorithm for anomaly detection and root cause analysis9 3.2 Metrics and tracing 10 3.3 Anomaly detection 11 3.4 Root cause analysis 12 4 Evaluation 15 4.1 Testbed 15 4.2 Experiment design 17 4.3 Results 18 4.3.1 Test 1 - CPU shortage 19 4.3.2 Test 2 - increased number of users 23 4.3.3 Test 3 - network delay 25 4.3.4 Test 4 - network error 27 4.3.5 Test 5 - memory shortage 29 5 Conclusions 31 5.1 Algorithmic considerations 32 5.2 System aspects 32 5.3 Future work 33 1 Introduction 1.1 Motivation Today there are many large applications that use the microservice architecture to split their application into small pieces. e applications are then also oen deployed on clusters in the cloud. ese can be very complex, see Figure 1, and hard to monitor and debug. But monitoring these systems are crucial as it provides very valuable information on how the system is performing. e monitoring is a rst step towards detecting anomalies, deviations from normal behaviour. Anomalies are oen indicators that a component is failing or is about to fail and these anomalies should hence be detected as soon as possible. A variety of anomaly types exists, such as intrusions, fraud, bad performance, and more, However, this thesis will only focus on performance anomalies. ere are oen lots of metrics available to monitor and any errors that occur oen propagate to other microservices making it hard to manually locate the root cause of an anomaly, because of this automatic methods are needed to detect and correct the problems. Figure 1: Examples of large microservice clusters in use today. Figure from [1]. 1.2 Problem formulation and goals e main purpose of this thesis is to investigate to what extent it is possible to create a solution that can automatically detect when an application in a microservice cluster has bad performance (anomaly detection) and why (root cause analysis). e thesis has the following goals to fulll that purpose. • Goal 1: Create a system that monitors a microservice cluster and gathers relevant metrics about it. 1 • Goal 2: Add an anomaly detection algorithm that uses the metrics to detect when the microservices have performance issues. • Goal 3: Add a root cause analysis algorithm to nd why the microservices have performance issues. • Goal 4: (Optional) Add automatic remediation that takes actions to resolve the performance issues once the root cause is found. • Goal 5: Evaluate the developed system in a real microservice cluster by injecting performance anomalies and measuring how well the system detects anomalies and deter- mines their root cause. e optional goal will be pursued if there is enough time le once the earlier goals are fullled. e optional goal will only be pursued if there is enough time to do both that and the evaluation once the earlier goals are fullled. Otherwise only the evaluation will be done. 2 2 Background 2.1 Microservices e microservice architecture is a way to structure an application into small separated pieces, microservices. is is contrasted by the traditional way of building soware as a monolith application, see Figure 2 for an illustrated comparison. Each microservice should be responsible for a specic part of the application and only loosely coupled to other services, communicat- ing over the network using lightweight protocols such as HTTP/REST or gRPC[2]. It should also be possible to develop, test, maintain, and deploy each microservice independently of oth- ers. is architecture results in a system where each individual piece is easy to understand and where developers can rapidly make changes. However compared to a monolith application there are some additional complexity that’s added in the communication between microservices. Handling delays or failures could be problematic and if the API of a microservice changes all other microservices that interacts with this API also needs to change accordingly. [3][4] Figure 2: A monolith application (le) compared to a microservice application (right). e business logic and data access from the le side are split up and put into the dierent microservices on the right side. Figure from [3]. 2.2 Containers and container orchestration A popular way to facilitate the deployment of a microservice application is to use containers and container orchestration. A container is a packaged piece of soware together with a list of dependencies that can be run in a standardized way. ey are run inside a container engine that virtualizes the underlying operating system and also isolates each container from each other and the outside 3 environment. In this project Docker[5] is used to containerize the microservices. Containers are similar to virtual machines in the sense that both makes it possible to run several isolated applications on the same machine. e dierence is that a virtual machine is virtualizing the underlying hardware and then running an operating system in each virtual machine. is makes containers more lightweight, ecient, and faster to start, see Figure 3 for an illustra- tion. [5] Figure 3: Running applications in containers are more ecient and lightweight than running them in separate virtual machines. Figure from [6]. Each microservice is then provisioned in their own containers that can be replicated to scale up capacity. ese containers are oen also spread out on several machines, called nodes, in a cluster. In order to manage many containers across multiple nodes a container orchestration tool is oen needed. Such a tool helps with the deployment, management, scaling, and networking for the containers. e tool that is used in this project is called Kubernetes[7]. Figure 4: e general components in a Kubernetes cluster. Figure from [8]. 4 e basic unit in Kubernetes is called a pod. A pod contains one or more containers that runs on the same node and that can share some resources. On each node Kubernetes has a container engine that runs the containers in the pods. ere is also a component called kubelet that controls the pods and containers. Lastly, Kubernetes adds kubeproxy - a component that provides some network functionality that simplies communication between pods. Kubernetes controls the nodes in the cluster from a separate master node. e master con- sists of three major parts, the apiserver, the scheduler, and some controllers. e apiserver is the api that can control other objects in the cluster, such as pods, nodes, and deployments. is is what an administrator connects to when making changes to a cluster. e apiserver is stateless and instead stores the state in a distributed key-value store, etcd. e scheduler is responsible for scheduling all pods on the available nodes, in a way that does not overload any node while following any other placement restrictions set by the developers. e possible restrictions include that pods should/should not be placed on a set of nodes (node selector or node anity/anti-anity), that pods should/should not be placed with other specic pods (pod anity/anti-anity), or that no pods may be placed on a node unless they are speci- cally allowed (taints and tolerations). Lastly, the controllers watches the state of the clusters and tries to change it into the desired state, e.g. ensuring that the correct number of pods are running. Figure 4 shows the components in a Kubernetes cluster. To control the trac com- ing into the cluster an ingress is oen added.

Load more