Network Traffic Fingerprinting Using Machine Learning and Evolutionary Computing
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF NEVADA,RENO Network Traffic Fingerprinting using Machine Learning and Evolutionary Computing A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering by Ahmet Aksoy Dr. Mehmet Hadi Gunes / Dissertation Advisor August 2019 c 2019 Ahmet Aksoy All Rights Reserved UNIVERSITY OF NEVADA THE GRADUATE SCHOOL RENO We recommend that the dissertation prepared under our supervision by AHMET AKSOY entitled Network Traffic Fingerprinting using Machine Learning and Evolutionary Computing be accepted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Mehmet Hadi Gunes, Ph.D. – Advisor Sushil J. Louis, Ph.D. – Committee Member Emily M. Hand, Ph.D. – Committee Member Shamik Sengupta, Ph.D. – Committee Member M. Sami Fadali, Ph.D. – Graduate School Representative David Zeh, Ph.D. – Dean, Graduate School August 2019 i Abstract The Internet has become essential to our daily life, especially with a multi- tude of IoT devices. However, the end hosts connected to the Internet are prone to be compromised. An essential measure for protecting attacks on end hosts is through the detection of system characteristics and isolation of vulnerable devices by restriction of communications to the device. Network traffic finger- printing provides the ability to remotely and automatically gather information about the hosts within a network. Fingerprinting can help perform network management and the detection and isolation of vulnerable hosts. It is essential to automate this process to perform fingerprinting more efficiently. It also provides the ability to adapt to changes in the behavior of host devices and software. Network practitioners rely on some classifier tools for fingerprinting, but they rely on an expert to select features/attributes and generate machine learning models. Hence, existing approaches need to be manually updated for each new Operating System (OS) or IoT device introduced to the network. This dissertation addresses automated methods and tools for performing fin- gerprinting of OSes and IoT devices. It also presents SILEA, a new inductive learning algorithm along with improvements to its numerical feature quantiza- tion to further improve classification accuracy. ii SILEA is a covering-method inductive learning algorithm that reliably extracts IF-THEN rules from a collection of examples/instances. The algorithm elimi- nates exhaustive feature selection by reducing the number of features to be con- sidered for each necessary iteration of rule extraction. We also use a genetic algorithm (GA) to determine the maximum number of clusters to be consid- ered for each numeric feature for quantization and observe their contribution to classification accuracy. Once the number of clusters for each numeric feature is determined, we run the k-means algorithm for each feature with the number of clusters that are pre-determined by GA to obtain as optimal ranges for numeric features as possible. We analyze the TCP/IP packet headers to automate OS classification. We utilize a GA to determine the relevant packet header features, which helps reduce the classification complexity and increases accuracy by eliminating noisy features from the data. We use several machine learning algorithms to generate a set of rules and models that can differentiate OSes. We also investigate an automated system, called OSID, for classifying host OSes by analyzing the network packets that they generate without relying on human experts. We introduce another automated system, called SysID, for the classification of IoT device characteristics based on their network traffic. The system uses any single packet that is originated from the device to detect its kind. We utilize a GA to determine relevant features in different protocol headers, and then deploy various machine learning algorithms to classify host device types by analyz- ing features selected by GA. SysID allows a completely automated classification of IoT devices using their TCP/IP packets without expert input. SILEA, OSID and SysID codes and trained models are available at https://github.com/ netml/. iii Dedicated to my family, advisor and friends. iv Acknowledgements Firstly, I would like to express my sincere gratitude to my advisor Dr. Mehmet Hadi Gunes, for the continuous support of my Ph.D. study and related research, for his patience, motivation, and immense help. I could not have imagined hav- ing a more helpful, understanding, and patient advisor and mentor for my Ph.D. study. Besides my advisor, I would like to thank the rest of my thesis committee; Dr. Sushil J. Louis, Dr. Emily M. Hand, Dr. Shamik Sengupta and Dr. M. Sami Fadali for their insightful comments and encouragement, but also for the hard questions which incented me to widen my research from various perspectives. I would also like to thank Dr. Murat Yuksel for all the tremendous support and assistance he has given me on learning about and in pursuing an academic career. I thank my fellow labmates, Dr. M. Abdullah Canbaz, Khalid Bakhshaliyev, Paulo Regis, Jay Thom and Steven Fisher for the stimulating discussions, the sleepless nights we spent working together before deadlines and for all the fun we had in the last five years. I also thank our department’s administrative assis- tants, Lisa Cody, and Heather Lara, for their help throughout this journey. v Ahmet Aksoy (left) and Dr. Mehmet Sabih Aksoy (right) at the University of Wales, Cardiff in South Wales, UK in 1991. Last but not the least, I would like to thank my family: my parents Dr. Mehmet Sabih, Aysel Aksoy; my sisters Yumna, Hamide and my brother Cuneyt Ak- soy for supporting me spiritually throughout writing this thesis and my life in general. I thank my father, a professor of computer science, for his support all these years. His research and insights have been a tremendous help for me to shape my goals and to be where I am today. I could not have asked for a more understanding and supportive family. This thesis is based in part upon work supported by the National Science Foun- dation under grant numbers CNS-1321164, EPS-IIA-1301726, and CICI-1739032. Any opinions, findings, conclusions, or recommendations expressed in this ma- terial are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. vi Contents Abstracti Acknowledgements iv Contents vi List of Tables ix List of Figures xi 1 Introduction1 2 Related Studies7 2.1 Inductive Learning...........................7 2.2 Clustering using Genetic Algorithms.................9 2.3 Genetic Algorithms for Feature Subset Selection.......... 11 2.4 Operating System Fingerprinting................... 12 2.4.1 Active Fingerprinting..................... 13 2.4.2 Passive Fingerprinting..................... 14 2.4.3 Hybrid Fingerprinting..................... 16 2.4.4 Mobile OS Fingerprinting................... 17 2.5 IoT Device Classification........................ 17 2.5.1 Active & Passive Fingerprinting............... 17 2.5.2 Device Authentication..................... 19 2.5.3 Time-domain Based Fingerprinting............. 19 2.5.4 Wireless Network Fingerprinting............... 20 2.5.5 Mobile Device Fingerprinting................. 20 3 SILEA - a System for Inductive LEArning 22 3.1 Proposed Algorithm.......................... 23 3.1.1 Data Initialization....................... 24 3.1.2 Feature Selection Procedure.................. 26 3.1.3 Rule Extraction Procedure................... 28 3.2 Illustrative Problem........................... 30 3.3 Experimental Results.......................... 35 vii 4 Operating System Classification Performance of TCP&IP Protocol Head- ers 39 4.1 Methodology.............................. 41 4.1.1 Data Initialization....................... 42 4.1.2 Feature Selection........................ 44 4.1.3 Rule Extraction......................... 45 4.2 Experimental Results.......................... 45 4.2.1 IP Protocol Headers...................... 46 4.2.2 ICMP Protocol Headers.................... 49 4.2.3 TCP Protocol Headers..................... 50 4.2.4 UDP Protocol Headers..................... 53 4.2.5 HTTP Protocol Headers.................... 54 4.2.6 DNS Protocol Headers..................... 55 4.2.7 SSL Protocol Headers..................... 57 4.2.8 SSH Protocol Headers..................... 58 4.2.9 FTP Protocol Headers..................... 59 4.2.10 Discussion............................ 60 5 Operating System Fingerprinting via Automated Network Traffic Anal- ysis 61 5.1 Methodology.............................. 63 5.1.1 Data Collection......................... 63 5.1.2 Feature Selection........................ 65 5.1.3 OS Classification........................ 66 5.2 Experimental Results.......................... 67 5.2.1 OS Genre Results........................ 68 5.2.2 Linux OS Results........................ 69 5.2.3 Mac OS Results......................... 71 5.2.4 Windows OS Results...................... 73 5.2.5 Feature Subset Selection.................... 74 5.2.6 Discussion............................ 77 6 Operating System Identification using Network Packet Headers 78 6.1 OSID: Operating System Identifier.................. 80 6.1.1 Data Initialization....................... 81 6.1.2 Feature Selection........................ 81 6.1.3 OS Identification........................ 84 6.2 Experimental Results.......................... 85 6.2.1 Data Collection......................... 86 6.2.2 Individual Protocol Classification Results.......... 87 6.2.3 Individual