Automating Log Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Master of Science in Computer Science January 2021 Automating Log Analysis Sri Sai Manoj Kommineni Akhila Dindi Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfilment of the requirements for the degree of Master of Science in Computer Science. The thesis is equivalent to 20 weeks of full time studies. The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree. Contact Information: Author(s): Sri Sai Manoj Kommineni E-mail: [email protected] Akhila Dindi E-mail: [email protected] University advisor: Dr. Hüseyin Kusetogullari Department of Computer Science Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract Background: With the advent of information age, there are very large number of services rising which run on several clusters of computers. Maintaining such large complex systems is a very difficult task. Developers use one tool which is common for almost all software systems, they are the console logs. To trouble shoot problems, developers refer to these logs to solve the issue. Identifying anomalies in the logs would lead us to the cause of the problem, thereby automating analysis of logs. This study focuses on anomaly detection in logs. Objectives: The main goal of the thesis is to identify different algorithms for anomaly detection in logs, implement the algorithms and compare them by doing an experiment. Methods: A literature review had been conducted for identifying the most suit- able algorithms for anomaly detection in logs. An experiment was conducted to compare the algorithms identified in the literature review. The experiment was per- formed on a dataset of logs generated by Hadoop Data File System (HDFS) servers which consisted of more than 11 million lines of logs. The algorithms that have been compared are K-means, DBScan, Isolation Forest and Local Outlier Factor algoritms which are all unsupervised learning algorithms. Results: The performance of all these algorithms have been compared using metrics precision, recall, accuracy, F1 score and run time. Though DBScan was the fastest, it resulted in poor recall, similarly Isolation Forest also resulted in poor recall. Local Outlier Factor was the fastest to predict. K-means had the highest precision and Local Outlier Factor had the highest recall, accuracy and F1 score. Conclusion: After comparing the metrics of different algorithms, we conclude that Local Outlier Factor performed better than the other algorithms with respect to the most of the metrics measured. Keywords: Anomaly detection, Log analysis, Unsupervised learning Acknowledgments We would like to show our sincere gratitude to our academic supervisor Dr. Hüseyin Kusetogullari for supervising and giving us useful feedback. We would also like to thank our company supervisors Yael Katzenellenbogen and Martha Dabrowska for advising us throughout our thesis. We would also like to extend our gratitude to our friends and family who supported and helped us directly and indirectly. iii Contents Abstract i Acknowledgments iii Contents v List of Figures vii List of Tables viii List of Equations ix 1 Introduction 1 1.1Aim.................................... 2 1.2 Objectives ................................. 3 1.3 Research questions . ......................... 3 1.4 Defining the scope of the thesis ..................... 3 1.5 Outline ................................... 3 2 Background 5 2.1 Anomaly Detection . .......................... 5 2.2 Machine Learning ............................. 6 2.2.1 Supervised Machine Learning .................. 6 2.2.2 Unsupervised Machine Learning ................. 6 2.2.3 Semi-Supervised Machine Learning . ............. 7 2.2.4 Reinforcement Machine Learning ................ 7 2.3Wordembeddings............................. 8 2.4 Algorithms ................................. 8 2.4.1 GloVe: .............................. 8 2.4.2 Cosine similarity ......................... 8 2.4.3 K-means-algorithm ........................ 9 2.4.4 DBSCAN ............................. 10 2.4.5 Isolation Forest .......................... 11 2.4.6 Local outlier factor ........................ 12 2.4.7 k-Nearest Neighbour (KNN) Algorithm ............. 13 2.5 Performance metrics ........................... 14 2.5.1 Accuracy .............................. 14 2.5.2 Precision .............................. 15 2.5.3 Recall ............................... 15 v 2.5.4 F1 score .............................. 15 2.5.5 Run time ............................. 15 3 Related Work 17 4 Method 23 4.1 Literature Review ............................. 23 4.1.1 Investigation of primary studies ................. 24 4.1.2 Criteria for the selection of research ............... 24 4.1.3 Assessment of quality ....................... 25 4.1.4 Extraction of data ........................ 25 4.2 Experiment ................................ 25 4.2.1 Experiment Set-up / Tools used ................. 26 4.3 Data Collection .............................. 27 4.3.1 Dataset . ............................ 27 4.3.2 Data Preprocessing ........................ 27 4.3.3 Log Parsing ............................ 28 4.3.4 Feature Extraction ........................ 29 4.4 Algorithms ................................. 31 4.4.1 K-means algorithm ........................ 31 4.4.2 DBSCAN Algorithm ....................... 32 4.4.3 Isolation Forest .......................... 33 4.4.4 Local Outlier Factor ....................... 34 4.5 Testing . ................................. 35 5 Results and Analysis 37 5.1 K-means algorithm ............................ 37 5.2 DBSCAN algorithm . .......................... 38 5.3 Isolation Forest algorithm ........................ 38 5.4 Local Outlier Factor algorithm ...................... 39 5.5 Comparing Algorithms .......................... 40 6 Discussion 43 6.1 Answering RQ1 .............................. 43 6.2 Answering RQ2 .............................. 43 6.3 Validity Threats Analysis ........................ 44 7 Conclusions and Future Work 45 Bibliography 47 A Supplemental Information 55 vi List of Figures 2.1 DBSCAN [84] ............................... 10 2.2 IsolationForest[31] ............................ 11 2.3 Local Outlier Factor[25] ......................... 13 4.1 Working method of Anomaly detection ................. 27 4.2 Before parsing the Console Logs ..................... 28 4.3 Parsed Logs ................................ 29 4.4 Features Extracted ............................ 30 4.5 Distribution of data over distance, x-axis: frequency of datapoints, y- axis: distance from the centre of cluster ................. 32 4.6 Distribution of data over distance, x-axis: data point index value rang- ing from 0 to 100,000, y-axis: distance from the centre of cluster ... 32 4.7 Isolation Forest Anomaly Scores, x-axis: frequency of datapoints, y- axis: anomaly score of data point .................... 34 5.1 K means metrics ............................. 37 5.2 DBScan metrics .............................. 38 5.3 ISF metrics ................................ 39 5.4 LOF metrics ............................... 39 5.5 Histogram of metrics ........................... 40 vii List of Tables 4.1 System Configuration ........................... 26 5.1 Metrics Comparision ........................... 40 viii List of Equations 2.1 Cosine Similarity ................................ 9 2.2 k-means objective function ........................... 9 2.3 DB-scan distance function ........................... 10 2.4 Anomaly score for Isolation forest ....................... 12 2.5 Local outlier factor value ............................ 12 2.6 Accuracy ..................................... 14 2.8 Recall ...................................... 15 2.9 F1 score ..................................... 15 ix List of Abbreviations 1. KNN: K-Nearest Neighbors 2. LOF: Local Outlier Factors 3. DBSCAN: Density Based Scan Spatial Clustering of Applications with Noise 4. LRD: Local Reachability Density 5. IF: Isolation Forest 6. ANN: Artificial Neural Network 7. CNN: Convolutional Neural Network 8. IT: Information Technology industry 9. BERT: Bidirectional Encoder Representations from Transformers 10. GloVe: Global Vectorization 11. TP: True Positive 12. TN: True Negative 13. FP: False Positive 14. FN: False Negative 15. PCA: Principal Component Analysis 16. TF-IDF: Term Frequency-Inverse Document Frequency 17. BDA App: Big Data Analytics Ap-plications 18. OLS: Ordinary Least Squares 19. NLP: Natural Language Processing 20. IDF: Inverse Document Frequency 21. SVM: Support Vector Machine 22. LSTM: Long Short Term Memory 23. S-LSTM: Stacked Long Short Term Memory xi 24. DARPA: Defense Advanced Research Projects Agency 25. CDMC2016: Cyber Security Data Mining Competition 2016 26. RNN: Recurrent Neural Network 27. HDFS: Hadoop Distributed File System xii Chapter 1 Introduction Many large-scale Internet services run in several large server clusters. In recent days, many companies are running these services on virtualized cloud computing en- vironments provided by several companies such as Amazon, Microsoft, and Google for scalability and pricing reasons primarily [41]. Designing, maintaining,