Autoencoder-Based Representation Learning to Predict Anomalies in Computer Networks

St. Cloud State University theRepository at St. Cloud State Culminating Projects in Information Assurance Department of Information Systems 5-2020 Autoencoder-Based Representation Learning to Predict Anomalies in Computer Networks Shehram Khan [email protected] Follow this and additional works at: https://repository.stcloudstate.edu/msia_etds Recommended Citation Khan, Shehram, "Autoencoder-Based Representation Learning to Predict Anomalies in Computer Networks" (2020). Culminating Projects in Information Assurance. 102. https://repository.stcloudstate.edu/msia_etds/102 This Thesis is brought to you for free and open access by the Department of Information Systems at theRepository at St. Cloud State. It has been accepted for inclusion in Culminating Projects in Information Assurance by an authorized administrator of theRepository at St. Cloud State. For more information, please contact [email protected]. Autoencoder-Based Representation Learning to Predict Anomalies in Computer Networks by Shehram Sikander Khan A Thesis Submitted to the Graduate Faculty of St. Cloud State University in Partial Fulfillment of the Requirements for the Degree of Master of Science in Information Assurance May, 2020 Thesis Committee: Akalanka Mailewa Dissanayaka, Chairperson Mark Schmidt David Robinson 2 Abstract With the recent advances in Internet-of-thing devices (IoT), cloud-based services, and diversity in the network data, there has been a growing need for sophisticated anomaly detection algorithms within the network intrusion detection system (NIDS) that can tackle advanced network threats. Advances in Deep and Machine learning (ML) has been garnering considerable interest among researchers since it has the capacity to provide a solution to advanced threats such as the zero-day attack. An Intrusion Detection System (IDS) is the first line of defense against network-based attacks compared to other traditional technologies, such as firewall systems. This report adds to the existing approaches by proposing a novel strategy to incorporate both supervised and unsupervised learning to Intrusion Detection Systems (IDS). Specifically, the study will utilize deep Autoencoder (DAE) as a dimensionality reduction tool and Support Vector Machine (SVM) as a classifier to perform anomaly-based classification. The study diverts from other similar studies by performing a thorough analysis of using deep autoencoders as a valid non-linear dimensionality tool by comparing it against Principal Component Analysis (PCA) and tuning hyperparameters that optimizes for 'F-1 Micro' score and 'Balanced Accuracy' since we are dealing with a dataset with imbalanced classes. The study employs robust analysis tools such as Precision-Recall Curves, Average-Precision score, Train-Test Times, t-SNE, Grid Search, and L1/L2 regularization. Our model will be trained and tested on a publicly available datasets KDDTrain+ and KDDTest+. 3 Table of Contents Page List of Tables…………………………………………………………………………….7 List of Figures……………………………………………………………………………9 Chapter I. Introduction ....................................................................................................... 8 Intrusion Detection System ............................................................................... 8 Machine Learning Algorithms ........................................................................... 9 KDDCUP99 and NSL-KDD Dataset................................................................ 10 Problem Statement ......................................................................................... 12 Nature and Significance of the Problem ......................................................... 13 Objective of the Study ..................................................................................... 14 Study Questions/Hypotheses ......................................................................... 14 Summary ........................................................................................................ 15 II. Background and Review of Literature ............................................................ 16 Introduction ..................................................................................................... 16 Background Related to the Problem ............................................................... 16 Deep Learning............................................................................................. 16 Autoencoders .............................................................................................. 18 Support Vector Machine (SVM)................................................................... 23 Regularization ............................................................................................. 24 4 Chapter Page Literature Related to the Problem ................................................................... 25 Summary ........................................................................................................ 28 III. Methodology .................................................................................................. 29 Introduction ..................................................................................................... 29 Definition of Terms ...................................................................................... 29 Data Preprocessing ........................................................................................ 30 Hardware and Software Environment ......................................................... 30 Design and Implementation of the Study ........................................................ 31 Tools and Techniques .................................................................................... 32 Performance Evaluation ................................................................................. 32 Accuracy ..................................................................................................... 33 Precision-Recall Curve ................................................................................ 33 Test and Train Timings ............................................................................... 34 F-measure ................................................................................................... 34 IV. Results .......................................................................................................... 35 Visualizing Data using t-SNE .......................................................................... 35 Grid Search .................................................................................................... 39 Classification Metrics ...................................................................................... 42 Accuracy, Precision-Recall, F-Score ........................................................... 42 Precision-Recall Curves .............................................................................. 49 5 Chapter Page Performance Metrics ....................................................................................... 55 Train and Test Time .................................................................................... 55 Conclusion ...................................................................................................... 56 References ......................................................................................................... 58 Appendix ............................................................................................................ 62 6 List of Figures Figure Page 1. Neural Network Architecture .......................................................................... 18 2. SELU plotted for a=1.6732~, Lambda=1.0507~ ............................................. 20 3. Autoencoder (AE) Neural Architecture ........................................................... 22 4. t-SNE Representation of Encoded Representation (Perplexity= 50, Iterations=500) ................................................................................................... 36 5. t-SNE Representation of Encoded Representation (Perplexity= 100, Iterations=500) ................................................................................................... 36 6. t-SNE Representation of Encoded Representation (Perplexity= 50, Iterations=1000) ................................................................................................. 37 7. t-SNE Representation of PCA (Perplexity= 50, Iterations= 500 ..................... 38 8. t-SNE Representation of PCA (Perplexity= 100, Iterations= 500) .................. 38 9. t-SNE Representation of PCA (Perplexity= 100, Iterations= 1000) ................ 39 10. Standalone SVM for Binary Class Precision-Recall Curve ........................... 50 11. PCA+SVM Binary Class Precision-Recall Curve .......................................... 50 12. AE+SVM Precision-Recall Curve (Polynomial Kernel) ................................. 51 13. Standalone SVM for MultiClass Precision-Recall Curves ............................. 52 14. PCA+SVM MultiClass Precision-Recall Curves ........................................... 53 15. AE+SVM MultiClass Precision-Recall Curves .............................................. 54 7 List of Tables Table Page 1. Attack Types in NSL-KDD Dataset ................................................................. 11 2. Grid Search with 'F1-Micro' Scoring ............................................................... 40 3. Grid Search with 'Balanced Accuracy' Scoring

Autoencoder-Based Representation Learning to Predict Anomalies in Computer Networks

Data Science Initiative 1

Training Autoencoders by Alternating Minimization

Data Science (ML-DL-Ai)

Data Science: the Big Picture Data Science with R Exploratory Data Analysis with R Data Visualization with R (3-Part)

Explainable Deep Learning Models in Medical Image Analysis

Turbo Autoencoder: Deep Learning Based Channel Codes for Point-To-Point Communication Channels

Double Backpropagation for Training Autoencoders Against Adversarial Attack

Artificial Intelligence Applied to Electromechanical Monitoring, A

Reinforcement Learning Data Science Africa 2018 Abuja, Nigeria (12 Nov - 16 Nov 2018)

Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J

An Artificial Neural Networks Primer with Financial Applications Examples in Financial Distress Predictions and Foreign Exchange Hybrid Trading System ’

Lecture 23: Recurrent Neural Network [SCS4049-02] Machine Learning and Data Science