Anomaly Detection in Cloud Computing Environments

Anomaly Detection in Cloud Computing Environments vorgelegt von M. Sc. Florian Johannes Schmidt an der Fakultät IV - Elektrotechnik und Informatik der Technischen Universität Berlin zur Erlangung des akademischen Grades Doktor der Ingenieurwissenschaften - Dr.-Ing. - genehmigte Dissertation Promotionsausschuss: Vorsitzender: Prof. Dr. David Bermbach Gutachter: Prof. Dr. Odej Kao Gutachter: Prof. Dr. Johan Tordsson Gutachter: Prof. Dr. Jan Nordholz Tag der wissenschaftlichen Aussprache: 18. Juni 2020 Berlin 2020 Thank you! { Kathrin, Martina, Hartmut, Friederike, Maryse, Einstein, Felix, Fabian, Fiona, Marion, Thomas, Susanne, Heino, Hildegard, Ella, Harry, Helmut, Anton, Marcel, Odej, Alex, Sören, Ilya, Lauritz, Sasho, Jana, Feng, Tobias, Jan, Kevin, Stefan, Steven, Annika, Alexandra, Elisa, Eva, Paul, René, Sven, Vincent, Johannes, Yannick, Tim } i Zusammenfassung Cloud Computing Paradigmen, werden in der modernen Softwareentwicklung bereits von den meisten Unternehmen angewendet. Die Bereitstellung von digitalen Diensten in einer Cloudumgebung bietet sowohl die Möglichkeit der kostenefzienten Nutzung von Ressourcen als auch die Möglichkeit auf Bedarf dynamisch die Anwendungen zu skalieren. Basierend auf dieser Flexibilität werden immer komplexere Softwareanwen- dungen entwickelt, welches zu anspruchsvollen Wartungsarbeiten der Gesamtinfras- truktur führen. Ebenfalls werden immer höhere Ansprüche an die Verfügbarkeit von Softwarediensten gestellt (99,999% im Industriekontext), was durch die Komplexität moderner Systeme nur noch schwieriger und unter großer Mühe gewährleistet werden kann. Aufgrund dieser Trends steigt der Bedarf an intelligenten Anwendungen, die automatisiert Anomalien erkennen und Vorschläge erarbeiten, um Probleme zu erkennen, zu beheben oder zumindest zu mindern um keinen negativen Einfuss auf die Servicequalität zu kaskadieren. Diese Arbeit beschäftigt sich mit der Erkennung von degradierten abnormalen Systemzuständen in Cloudumgebungen. Hierbei wird sowohl eine holistische Analy- sepipeline und -infrastruktur beschrieben als auch die Anwendbarkeit von verschiedenen Strategien des maschinellen Lernens diskutiert, um möglichst eine voll automa- tisierte Lösung bereitzustellen. Basierend auf den zugrunde liegenden Annahmen, wird ein neuartiger unsupervised Anomalieerkennungsalgorithmus namens CABIRCH vorgestellt und dessen Anwendbarkeit analysiert und diskutiert. Da die Wahl der Hy- perparameter einen wichtigen Einfuss auf die Genauigkeit des Algorithmus hat, wird zudem ein Hyperparameterauswahlverfahren mit einer neuartigen Fitness-Funktion vorgestellt, welches zur Vollautomatisierung der Anomalieerkennung führen soll. Hi- erbei ist das Verfahren generalisiert anwendbar für eine Vielzahl von unsupervised Anomalieerkennungsalgorithmen, welche basierend auf jüngsten Veröfentlichungen umfassend evaluiert werden. Dabei wird die Anwendbarkeit zur automatisierten Erken- nung von degradierten abnormalen Systemzuständen gezeigt und mögliche Limitierun- gen diskutiert. Die Ergebnisse zeigen, dass eine Erkennung der verschiedenen Anoma- lien gewährleistet werden kann, jedoch mit einer Fehlalarmrate von über 1%. ii Abstract Cloud computing is widely applied by modern software development companies. Pro- viding digital services in a cloud environment ofers both the possibility of cost-efcient usage of computation resources and the ability to dynamically scale applications on demand. Based on this fexibility, more and more complex software applications are being developed leading to increasing maintenance eforts to ensure the reliability of the entire system infrastructure. Furthermore, highly available cloud service require- ments (99.999% as industry standards) are difcult to guarantee due to the complexity of modern systems and can therefore just be ensured by great efort. Due to these trends, there is an increasing demand for intelligent applications that automatically detect anomalies and provide suggestions solving or at least mitigating problems in order not to cascade a negative impact on the service quality. This thesis focuses on the detection of degraded abnormal system states in cloud environments. A holistic analysis pipeline and infrastructure is proposed, and the applicability of diferent machine learning strategies is discussed to provide an automated solution. Based on the underlying assumptions, a novel unsupervised anomaly detection algorithm called CABIRCH is presented and its applicability is analyzed and discussed. Since the choice of hyperparameters has a great infuence on the accuracy of the algorithm, a hyperparameter selection procedure with a novel ftness function is proposed, leading to further automation of the integrated anomaly detection. The method is generalized and applicable for a variety of unsupervised anomaly detection algorithms, which will be evaluated including a comparison to recent publications. The results show the applicability for the automated detection of degraded abnormal system states and possible limitations are discussed. The results show that detection of system anomaly scenarios achieves accurate detection rates but comes with a false alarm rate of more than 1%. iii Contents 1 Introduction 1 1.1 Research Objectives and Main Contributions . 2 1.2 Publications . 3 1.3 Outline of the Thesis . 4 2 Background 6 2.1 Anomaly detection . 6 2.1.1 Types of Anomalies . 7 2.1.2 Failures and Degraded State Anomalies . 7 2.2 Application Domain . 9 2.2.1 IP multimedia subsystem . 9 2.2.2 Video on demand . 10 2.3 Analytic Concepts . 10 2.3.1 Machine Learning Methodologies . 12 2.3.2 BIRCH . 12 2.3.3 Autoencoder . 14 2.3.4 Variational Autoencoder . 15 2.3.5 Long Short Term Memory Networks . 16 2.3.6 Dynamic Threshold Models . 17 2.3.7 Genetic Algorithm . 18 2.4 Evaluation Metrics . 19 3 Related Work 22 3.1 Characteristics of Service Anomalies . 22 3.2 Anomaly Detection . 23 3.3 Concept Adapting Clustering . 28 3.4 Hyperparameter Optimization . 30 4 Framework for AI-based Anomaly Detection 33 4.1 ZerOps Framework . 33 4.2 Categorization of AI-based Anomaly Detection . 35 4.3 Evaluation . 39 4.3.1 Supervised Learning Evaluation . 39 4.3.2 Semi-supervised Evaluation . 42 4.3.3 Summary . 44 5 Concept Adapting BIRCH 46 5.1 Concept Adapting BIRCH . 46 5.1.1 Micro-cluster Aging . 47 5.2 Anomaly Detection using Concept Adapting BIRCH . 53 5.2.1 Identity Function Threshold Model . 54 5.3 Evaluation . 55 iv v 5.3.1 Infuence of Decay Rate Selection . 56 5.3.2 CABIRCH-based Anomaly Detection . 60 6 Cold Start-Aware Identity Function Threshold Models 67 6.1 Integration of Hyperparameter Optimization into IFTM Framework . 68 6.2 Automated Hyperparameter Optimization . 70 6.2.1 Initialization, Crossover, Mutation, Termination . 70 6.2.2 Fitness Function Defnition . 71 6.3 Evaluation . 73 7 Evaluation 81 7.1 Evaluation Setup . 82 7.1.1 Resource Monitoring . 83 7.1.2 Anomaly Injection Framework . 84 7.2 Evaluation Results . 86 7.3 Discussion . 93 7.4 Future Work . 96 8 Conclusion 98 Appendices 119 A Online Arima 120 B Intervals for Identity functions and Threshold models 123 C Detailed Evaluation Results 125 Chapter 1 Introduction Digitalization transforms our world in various areas like Industry 4.0 (product line, manufacturing automation, predictive maintenance, etc.), transportation (self-driving cars, car-2-car communication, intelligent trafc control systems), smart home, medi- cal assisted surgery, and many more. Gartner predicts that by 2020 there exist more than 20 billion connected IoT-devices [1]. Cisco even forecasts 28.5 billion connected devices by 2022 [2]. While the number of devices and sensors increases, the key net- work technologies like 5G and virtualization of cloud computing and fog computing change the business opportunities and enable fexibility for the infrastructure. The improved fexibility and business opportunities come at a high cost, as the system complexity increases signifcantly. It introduces the challenge of not only administer- ing the complex IT-infrastructure but also adds the challenges of maintaining every e.g. remote device, edge cloud, software service, and the heterogeneous networks in between. Managing this complexity surpasses the ability of human experts to oversee the entire system and react with quick responses to meet the promised Quality of Service (QoS) parameters or even Service Level Agreements (SLA). In application scenarios such as softwarization of dedicated hardware solutions to virtualized environments, telecommunication providers hope to beneft from increased fexibility and cost-efectiveness. Still, the given dedicated hardware components provide a reliability of 99.999% [3], which is therefore demanded for the virtualized components. Due to the increased complexity of the computation model, which includes hardware components and a stack of virtualized components, softwarized components cannot cope with the high demand for reliability. Because of the fragility of such system stacks, the expectations of system administrators increases to maintain the continuous operation of the services [4]. With respect to this and the recent develop- ments of artifcial intelligence (AI), concepts are developed to analyze and automate large portions of operational tasks for administrators. AI-based automation of infrastructure operation (AIOps) provides the vision of establishing a system able to autonomously operate and remediate large IT environments. The increased demand for qualifed operation maintenance support refects the es- tablishment

Load more