An Abstract of the Dissertation Of
Total Page:16
File Type:pdf, Size:1020Kb
AN ABSTRACT OF THE DISSERTATION OF Md Amran Siddiqui for the degree of Doctor of Philosophy in Computer Science presented on June 7, 2019. Title: Anomaly Detection: Theory, Explanation and User Feedback Abstract approved: Alan P. Fern Anomaly detection has been used in variety of applications in practice, includ- ing cyber-security, fraud detection and detecting faults in safety critical systems, etc. Anomaly detectors produce a ranked list of statistical anomalies, which are typically examined by human analysts in order to extract the actual anomalies of interest. Unfortunately, most anomaly detectors provide no explanations about why an instance was considered anomalous. To address this issue, we propose a feature based explanation approach called sequential feature explanation (SFE) to help the analyst in their investigation. A second problem with the anomaly detection systems is that they usually produce a large number of false positives due to a mismatch between statistical and semantic anomalies. We address this issue by incorporating human feedback, that is, we develop a human-in-the-loop anomaly detection system which can improve its detection rate with a simple form of true/false positive feedback from the analyst. We show empirically the efficacy and the superior performance of both of our explanation and feedback approaches on significant cyber security applications including red team attack data and real corporate network data along with a large number of benchmark datasets. We also delve into a set of state-of-the-art anomaly detection techniques to understand why they perform so well with a small number of training examples. We unify their working principle into a common framework underlying different pattern spaces and compute their sample complexity for achieving performance guarantees. In addition, we empirically investigate learning curves for anomaly detection in this framework. c Copyright by Md Amran Siddiqui June 7, 2019 All Rights Reserved Anomaly Detection: Theory, Explanation and User Feedback by Md Amran Siddiqui A DISSERTATION submitted to Oregon State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Presented June 7, 2019 Commencement June 2019 Doctor of Philosophy dissertation of Md Amran Siddiqui presented on June 7, 2019. APPROVED: Major Professor, representing Computer Science Head of the School of Electrical Engineering and Computer Science Dean of the Graduate School I understand that my dissertation will become part of the permanent collection of Oregon State University libraries. My signature below authorizes release of my dissertation to any reader upon request. Md Amran Siddiqui, Author ACKNOWLEDGEMENTS I am immensely grateful to my advisor Dr. Alan Fern for giving me the op- portunity to pursue my PhD under his advising. I learned a lot of skills like how to conduct research, how to brainstorming ideas, writing papers and presentation from his close supervision and feedback. I am very grateful to Dr. Thomas Dietterich for providing feedback and dis- cussing interesting research directions during our anomaly detection meetings. I am also grateful to my committee members Dr. Prasad Tadepalli, Dr. Raviv Raich and my GCR Dr. Debashis Mondal for their valuable time and feedback during the exams. I am grateful to my industry collaborators from Galois, Inc. Ryan Wright, Alec Theriault, David W. Archer and Nichole Schimanski. I learned a lot from them about systems, software engineering and computer security for which I didn't have much background before. I would like to thank my fellow talented graduate students and faculty members with whom I spent a lot of time discussing and sharing ideas and so on: Tadesse Zemicheal, Shubhomoy Das, Andrew Emmott, Jed Irvine, Michael Slater, Liu Si and Risheek Garrepalli. Finally, I would like to thank my friends, family members and communities in Corvallis for helping me with other than academic aspects. TABLE OF CONTENTS Page 1 Introduction . 1 1.1 Anomaly Detection . 1 1.2 Outline . 2 2 Rare Pattern Anomaly Detection (RPAD) . 4 2.1 Introduction . 4 2.2 Rare Pattern Anomaly Detection . 6 2.3 Probably Approximately Correct Framework . 8 2.4 Finite Sample Complexity of RPAD . 9 2.4.1 Sample Complexity for Finite H . 11 2.4.2 Sample Complexity for Infinite H . 12 2.5 Application to Specific Pattern Spaces . 14 2.5.1 Conjunctions . 14 2.5.2 Halfspaces . 14 2.5.3 Axis Aligned Hyper Rectangles . 15 2.5.4 Stripes . 15 2.5.5 Ellipsoids and Shells . 16 2.6 Experiments . 17 2.6.1 Pattern Space and Anomaly Detector Selection . 18 2.6.2 Learning Curve Generation . 19 2.6.3 Results . 21 2.7 Summary . 23 3 Sequential Feature Explanations (SFE) . 25 3.1 Introduction . 25 3.2 Related Work . 28 3.3 Anomaly Detection Formulation . 29 3.4 Sequential Feature Explanations . 30 3.5 Explanation Methods . 31 3.5.1 SFE Objective Function . 33 3.5.2 Greedy Algorithms . 35 3.5.3 Branch and Bound Search . 37 3.6 Framework for Evaluating Explanations . 41 3.6.1 Anomaly Detection Benchmarks . 41 TABLE OF CONTENTS (Continued) Page 3.6.2 Simulated Analyst . 42 3.7 Empirical Evaluation . 45 3.7.1 Anomaly Detector . 45 3.7.2 Empirical Comparison of Greedy vs. BaB . 46 3.7.3 Evaluation on Benchmark Data Sets . 47 3.7.4 Comparing Methods with Oracle Detectors . 51 3.7.5 Evaluation on high dimensional Datasets . 52 3.8 Experiments on Real World Network Data . 55 3.8.1 Data Collection . 55 3.8.2 Anomaly Detector Setup . 56 3.8.3 Explanation Results . 57 3.9 Main Observations and Recommendation . 58 3.10 Summary . 59 4 Feedback-Guided Anomaly Discovery . 61 4.1 Introduction . 61 4.2 Anomaly Detection Problem Setup . 63 4.3 Related Work . 64 4.4 Incorporating Feedback via Online Convex Optimization . 66 4.4.1 Online Convex Optimization . 66 4.4.2 Loss Functions for Query-Guided Anomaly Discovery . 67 4.4.3 The Mirror Descent Learning Algorithm . 70 4.5 Application to Tree-based Anomaly Detection . 72 4.6 Empirical Results . 74 4.6.1 Experiments on Benchmark Data . 76 4.6.2 Experiments on Cybersecurity Data . 80 4.7 Experiments on Real World Network Data . 83 4.7.1 Incorporating Analyst's Feedback . 83 4.7.2 Detection Results . 83 4.8 Summary . 84 5 Class Discovery . 86 5.1 Introduction . 86 5.2 Problem Setup . 87 TABLE OF CONTENTS (Continued) Page 5.3 Class Discovery Algorithms . 87 5.3.1 Anomaly Detection . 87 5.3.2 Likelihood1 ........................... 88 5.3.3 Entropy . 89 5.3.4 DPEA1 .............................. 89 5.3.5 PWrong1 ............................. 90 5.3.6 SVM and KDE Fusion1 ..................... 91 5.3.7 Clustering . 91 5.3.8 Random1 ............................. 92 5.4 Empirical Evaluation . 93 5.4.1 Benchmark . 94 5.4.2 Experimental Setup . 95 5.4.3 Results . 96 5.5 Summary . 98 6 Conclusions and Future Work . 106 Bibliography . 108 Appendices . 115 A Proofs . 116 LIST OF FIGURES Figure Page 1.1 Density based anomaly detection. The red points are outliers due to lying in a low density region. 2 2.1 Learning Curves for the Three Scoring Methods (IF, AVE, MIN) with Varying Pattern Space Complexity k Over Three Benchmarks (Rows). MIN Represents the Main RPAD Approach Analyzed in this chapter. 20 3.1 The anomaly detection pipeline addressed by this chapter (from left to right). The original data set contains both normal points and anomalies. Our goal is detect the true semantic anomalies. First an anomaly detector is applied to identify a set of statistical outliers, which are further analyzed by a human analyst in order to iden- tify the true semantic anomalies. The false negatives (missed true anomalies) of the overall system are composed of two types of true anomalies: 1) anomalies that are considered to be statistically nor- mal and are never presented to the human analyst, and 2) anomalies that are statistical outliers but are misidentified by the human ana- lyst as normal. The first type of false negative can only be avoided by changing the anomaly detector and are not the focus of this chapter. The second type of false negative can be avoided by mak- ing it easier for analysts to detect true anomalies when presented with them. The focus of this chapter is to compute explanations of statistical outliers that reduce the effort required to detect such true anomalies. 26 3.2 Analyst Certainty Curves. These are example curves generated us- ing our simulated analyst on anomalies from the Abalone benchmark using SFEs produced by SeqMarg. The x-axis shows the index of the feature revealed at each step and the y-axis shows the analyst certainty about the anomalies being normal. The leftmost curve shows an example of where the analyst gradually becomes certain that the point is anomalous, while the middle curve shows more rapidly growing certainty. The rightmost curve is an example of where the analyst is certain of the anomaly after the first feature is revealed and remains certain. 42 LIST OF FIGURES (Continued) Figure Page 3.3 The bars show the Expected Smallest Prefix (ESP) achieved by different methods averaged across top 10% highly ranked ground truth anomalies in all the benchmarks. 95% confidence intervals are also shown. 46 3.4 Performance of explanation methods on benchmarks. Each group of bars shows the performance of the six methods on benchmarks derived from a single mother set. The bars show the expected MFP averaged across anomalies in benchmarks for the correspond- ing mother set. 95% confidence intervals are also shown. 49 3.5 Performance of explanation methods on benchmarks when using an oracle anomaly detector. Each group of bars shows the performance of the six methods on benchmarks derived from a single mother set.