System Problem Detection by Mining Console Logs by Wei Xu A
Total Page:16
File Type:pdf, Size:1020Kb
System Problem Detection by Mining Console Logs by Wei Xu A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor David A. Patterson, Chair Professor Armando Fox Professor Pieter Abbeel Professor Ray R. Larson Fall 2010 System Problem Detection by Mining Console Logs Copyright 2010 by Wei Xu 1 Abstract System Problem Detection by Mining Console Logs by Wei Xu Doctor of Philosophy in Computer Science University of California, Berkeley Professor David A. Patterson, Chair The console logs generated by an application contain information that the developers believed would be useful in debugging or monitoring the application. Despite the ubiq- uity and large size of these logs, they are rarely exploited because they are not readily machine-parsable. We propose a fully automatic methodology for mining console logs using a combination of program analysis, information retrieval, data mining, and machine learn- ing techniques. We use source code analysis to understand the structures from the console logs. We then extract features, such as execution traces, from logs and use data mining and machine learning methods to detect problems. We also use a decision tree to distill the detection results to a format readily understandable by operators who need not be familiar with the anomaly detection algorithms. The whole process requires no human intervention and can scale to large scale log data. We extend the methods to perform online analysis on console log streams. We evaluate the technique on several real-world systems and detected problems that are insightful to systems operators. i To my family ii Contents List of Figures v List of Tables vi 1 Introduction 1 1.1 BackgroundandMotivation . .. 1 1.2 Contributions ................................... 3 1.3 Summary ..................................... 5 2 Key Insights and Overview 6 2.1 KeyInsights.................................... 6 2.2 MethodologyOverview . .. .. 9 2.3 Case Studies and Overview of Results . ..... 11 2.4 Summary ..................................... 13 3 Console Logs Preprocessing 16 3.1 ConsoleLogParserDesign . .. 17 3.1.1 Challengesinlogparsing. 17 3.1.2 Parserdesignoverview . 19 3.2 Object Oriented Languages: A Running Example . ...... 21 3.3 ParsersforOtherLanguages . ... 24 3.3.1 CandC++................................ 24 3.3.2 Scriptinglanguages(Python) . .. 25 3.4 Extracting Message Templates from Program Binaries . .......... 27 3.4.1 Javabytecode .............................. 27 3.4.2 BinariesfromCprograms . 28 3.5 Evaluation: Accuracy and Scalability . ....... 29 3.5.1 Accuracy ................................. 30 3.5.2 Scalability................................. 31 3.6 Summary ..................................... 32 4 Offline Problem Detection and Visualization 33 4.1 FeatureCreation ................................. 34 4.1.1 State variables and state ratiovectors. ...... 34 iii 4.1.2 Identifiers and message count vectors . .... 35 4.1.3 Implementing feature creation algorithms . ....... 36 4.2 AnomalyDetection ................................ 37 4.2.1 PCA-basedanomalydetection . 37 4.2.2 Improving PCA detection results . .. 40 4.3 EvaluationandVisualization. ..... 42 4.3.1 Darkstarexperimentresults . .. 42 4.3.2 Hadoopexperimentresults. 45 4.4 Visualizing Detection Results with Decision Trees . ........... 47 4.5 Summary ..................................... 48 5 Problem Detection in Online Log Streams 51 5.1 Two-StageOnlineAnomalyDetection. ..... 52 5.1.1 Challenges in online detection . ... 52 5.1.2 Oursolution: two-stagedetection . .... 53 5.2 Stage1: FrequentPatternMining . .... 54 5.2.1 Mining frequent event patterns . .. 56 5.2.2 Estimating distributions of session durations . ......... 57 5.2.3 ImplementationofStage1 . 59 5.3 Stage2:PCADetection ............................. 60 5.4 Evaluation..................................... 60 5.4.1 Stage1patternminingresults . .. 61 5.4.2 Detectionprecisionandrecall . ... 62 5.4.3 Detectionlatency ............................. 64 5.4.4 Comparisontoofflineresults. 65 5.5 Discussion..................................... 67 5.5.1 Limitationsofonlinedetection. .... 67 5.5.2 Usecases ................................. 67 5.6 Summary ..................................... 67 6 Real world application 69 6.1 Theartoflogsanitization . ... 70 6.2 State-baseddetection. ... 71 6.3 Sequence-based detection . .... 73 6.4 Summary ..................................... 74 7 Related Work 75 7.1 System Monitoring and Problem Detection . ...... 75 7.1.1 Systemmonitoringtools . 76 7.1.2 System problem detection with structured data . ...... 77 7.2 ConsoleLogAnalysis.............................. 79 7.2.1 Console generation and collection . .... 79 7.2.2 Consolelogparsing............................ 79 7.2.3 Using console logs for problem detection . ..... 80 iv 7.3 Techniques Used in This Project . .... 81 7.4 Summary ..................................... 83 8 Future Directions 84 8.1 AnomalyDetectionandBeyond . .. 84 8.2 ToolstoImproveConsoleLog . .. 85 8.3 BeyondConsoleLogs............................... 86 9 Conclusion 88 Bibliography 90 v List of Figures 2.1 Sample log and corresponding source code . ...... 7 2.2 Overview of four-step console log analysis methodology ............ 9 3.1 A sample log and corresponding source code . ...... 17 3.2 Number of new log printing statements added each month in four real Google systems ...................................... 18 3.3 Using source code information to parse console logs. ........... 19 3.4 Example source code segments with type inheritance . ......... 21 3.5 Constructing message templates from source code analysis .......... 22 3.6 Scalability of log parsing with number of nodes used. .......... 31 4.1 The intuition behind PCA detection . .... 38 4.2 .9513.6Fractional of total variance captured by each principal component. ................. 39 4.3 Darkstar state ratio vector detection results . .......... 42 4.4 Hadoop message count vector detection results . ......... 44 4.5 The decision tree visualization of Hadoop message count vectors . 49 4.6 The decision tree visualization for Darkstar message countvectors . 50 5.1 Overview of the two stage online detection systems. .......... 52 5.2 Frequent patterns, anomalous cases and “middle ground” ........... 54 5.3 Samplesessions.................................. 55 5.4 Tail of durations follow power-law distribution. ............ 58 5.5 Detection latency and number of events kept in detector’sbuffer ....... 64 6.1 Logsanitizationoverview. .... 71 6.2 DetectionresultsonGoogledata . .... 72 vi List of Tables 2.1 Statevariablesandidentifiers . ..... 8 2.2 Summary of data sets used in evaluation . ..... 12 2.3 Survey of console logging in popular software systems . ........... 15 3.1 ParsingaccuracyofC/C++parser . ... 25 3.2 Pythonparseraccuracy. .. 26 3.3 Comparison between C source parser and C binary parser . ......... 29 3.4 Parsing accuracy using Java Parser . ..... 30 3.5 Parsing Accuracy using the C/C++ parser on production data from Google 30 4.1 Semantics of rows and columns of features . ...... 37 4.2 Low effective dimensionality of feature data . ........ 39 4.3 Detected anomalies and false positives using PCA on Hadoop message count vectorfeature ................................... 46 5.1 Frequent patterns discovered from Hadoop logs . ........ 61 5.2 Hadoop online detection precision and recall. .......... 63 5.3 Detection accuracy comparison with offline detection results ......... 65 5.4 Addcaption.................................... 66 5.5 Detection accuracy comparison with offline detection results ......... 66 vii Acknowledgments I am grateful for the collaboration, guidance and help during my entire PhD career. First and foremost, I am extremely fortunate to have two great co-advisors, Professor David Patterson and Professor Armando Fox. Throughout my career as a Ph.d. student at Berkeley, they provided incredible help on every aspect of the research. From choosing a research topic to further shaping the project, from technical directions to fixing English problems, their help penetrates everywhere in this dissertation. I am especially grateful for their encouragement and advice during the time when I had trouble finding a research topic. I feel very fortunate to have the opportunity to learn from them. My special thanks go to Ling Huang at Intel. We collaborated on every detail in the project, and he captures even the slightest flaws existed and his insights of both machine learning and systems shaped the entire project. I also appreciate Professor Michael Jordan, who taught me about machine learning and helped me solving many problems I had during the project. My thanks go to my qualifying exam committee, Professor Eric Brewer, Professor Pieter Abbeel and Professor Ray Larson for their great suggestions that helped improving the dissertation, and especially for their encouragement. Professor Pieter Abbeel and Profes- sor Ray Larson are also on my dissertation committee, helping me further improve the dissertation. I enjoyed the supportive and intellectually stimulating research environment of the RAD Lab. All RAD Lab faculty, students and retreat guests provided important input throughout this research. Getting real data from industry is an essential part for this research. I appreciate Urs Hoelzle’s help for allowing me to use Google’s log data. At Google,