Anomaly Detection and Analysis in Big Data Computer Science And
Total Page:16
File Type:pdf, Size:1020Kb
Anomaly Detection and Analysis in Big Data A Thesis submitted in partial fulfillment of the requirements for the award of degree of Doctor of Philosophy Submitted by Sahil Garg (901403026) Under the guidance of Dr. Shalini Batra Associate Professor, Computer Science and Engineering Department, Thapar Institute of Engineering and Technology, Patiala, India Computer Science and Engineering Department Thapar Institute of Engineering and Technology Patiala-147004, India September 2018 Contents List of Figures . .v List of Tables . vii List of Algorithms . viii Certificate . ix Acknowledgements . .x Abstract . xi 1 Introduction 1 1.1 Background . .3 1.2 Classification of Anomalies . .6 1.2.1 Point Anomalies . .6 1.2.2 Contextual Anomalies . .7 1.2.3 Collective Anomalies . .7 1.3 Modes of Machine Learning Algorithms . .8 1.3.1 Supervised Anomaly Detection . .8 1.3.2 Unsupervised Anomaly Detection . .8 1.3.3 Semi-Supervised Anomaly Detection . .9 1.4 Types of Anomaly Detection Techniques . .9 1.4.1 Classification Techniques . .9 i One-class classification . 11 Multi-class classification . 11 1.4.2 Clustering Techniques . 11 Partitioning based techniques . 13 Hierarchy based techniques . 15 Density based techniques . 15 Grid based techniques . 16 Graph based techniques . 16 1.4.3 Statistical Techniques . 17 Parametric Techniques . 18 Non-Parametric Techniques . 18 1.4.4 Rule-based Techniques . 18 1.4.5 Information Theory based Techniques . 19 1.5 Applications of Anomaly Detection . 20 1.6 Thesis Organization . 29 2 Literature Review 31 2.1 Dimensionality Reduction . 31 2.2 Optimization Schemes . 37 2.3 Machine Learning Approaches . 40 2.4 Deep Learning Approaches . 46 2.5 Comparative Analysis . 47 2.6 Various Sources of Datasets . 48 2.7 Crucial Aspects of Anomaly Detection . 56 2.8 Motivation . 58 2.9 Need for Anomaly Detection in Big Data . 60 ii 2.10 Machine Learning for Anomaly Detection . 61 2.11 Objectives . 63 2.12 Concluding Remarks . 63 3 Ensemble based Anomaly Detection Technique 64 3.1 Working of En-ADT . 64 3.1.1 Feature Selection . 65 Fuzzy K-Means Algorithm: . 66 Complexity Analysis . 69 3.1.2 Extended Kalman Filter . 70 Complexity Analysis . 75 3.1.3 Support Vector Machines . 75 The Support Vectors of SVM . 76 Kernel Functions in SVM . 78 Complexity Analysis . 80 3.1.4 Ensembled Anomaly Detection Technique . 80 Complexity Analysis . 83 3.2 Concluding Remarks . 83 4 Fuzzified Cuckoo Based Clustering Technique 84 4.1 Working of F-CBCT . 84 4.1.1 Training Phase . 85 Decision Tree Criterion (DTC): . 85 Multi-objective Cuckoo-Search Optimization Algorithm (CSO): . 88 K-Means Clustering Algorithm: . 94 Classification and Anomaly Detection: . 96 iii 4.1.2 Detection Phase . 99 Fuzzy Detection Phase: . 99 4.2 Concluding Remarks . 105 5 Experiments and Implementation Details 106 5.1 En-ADT . 106 5.1.1 Datasets . 106 5.1.2 Performance Metrics . 107 Binary-class classification problem: . 108 Multi-class classification problem: . 109 5.1.3 Comparison of the Proposed Technique with its counterparts . 111 5.2 F-CBCT . 114 5.2.1 Datasets . 114 NSL-KDD Dataset: . 114 5.2.2 Performance Evaluation Metrics . 117 Varying C-Measure and AD-Measure: . 117 Root Mean Square Error: . 117 Other Metrics: . 117 5.3 Concluding Remarks . 123 6 Conclusion and Future Scope 124 6.1 Thesis Contributions . 124 6.2 Future Scope . 127 References 127 List of Publications 146 iv List of Figures 1.1 Overview of the anomaly detection techniques . 10 3.1 Flow of the proposed technique . 65 4.1 Framework of the proposed F-CBCT . 85 4.2 Surface plots for different membership functions . 103 4.3 Trimf rule viewer for fuzzy inference system (FIS) of F-CBCT . 104 5.1 Evaluation of proposed technique on DARPA’98 dataset . 109 5.2 Evaluation of proposed technique on KDD’99 dataset . 111 5.3 Performance evaluation of En-ADT on DARPA’98 and KDD’99 dataset . 112 5.4 Performance evaluation of the membership function in terms of RMSE . 120 5.5 Performance evaluation of F-CBCT . 122 v List of Tables 1.1 Several definitions of anomaly . .4 1.2 Distance functions . 14 1.3 Similarity functions . 14 1.4 Anomalies in wireless sensor networks . 25 2.1 Comparison of different feature selection techniques . 32 2.2 Comparison of some recently proposed feature selection schemes . 34 2.3 Comparison of the existing optimization schemes . 39 2.4 Overview of popular machine learning techniques . 41 2.5 Comparison of anomaly detection techniques on the basis of data labels . 49 2.6 Comparison of some existing machine learning based anomaly detection techniques . 49 2.7 Comparison of some existing anomaly detection schemes based on distinc- tive characteristics . 50 2.8 Sources of datasets for anomaly detection models . 54 4.1 Components of fuzzy inference system . 100 4.2 Rule-matrix for the proposed fuzzy system . 101 4.3 Rule-set for the proposed fuzzy system . 101 4.4 Membership functions with two inputs & one output . 102 vi 5.1 Summary of datasets . 107 5.2 Summary of anomalous classes in KDD’99 Dataset . 110 5.3 Comparison of existing anomaly detection techniques on DARPA’98 dataset 113 5.4 Comparison of existing anomaly detection techniques on KDD’99 dataset . 113 5.5 Characteristics of datasets from UCI ML repository . 114 5.6 NSL-KDD dataset description . 115 5.7 Description of selected features from NSL-KDD dataset using decision tree 116 5.8 Number of selected features for all classes from NSL-KDD dataset . 116 5.9 Evaluation of membership functions for anomaly detection . 118 5.10 Comparison of RMSE for different membership functions . 119 5.11 Comparison of the proposed technique with its variants . 121 vii List of Algorithms 3.1 Feature extraction by FKM . 69 3.2 Feature optimization by EKF . 74 3.3 Label detection by SVM . 79 3.4 Anomaly detection by the proposed algorithm . 81 4.1 Decision tree formation . 87 4.2 Post-Pruning of Decision Tree (DT) . 89 4.3 Multi-objective CSO algorithm . 91 4.4 K-Means clustering algorithm . 95 4.5 Computation of C-Measure and AD-Measure . ..