System Call Analysis and Visualization

SYSTEM CALL ANALYSIS AND VISUALIZATION A Project Presented to the faculty of the Department of Computer Science California State University, Sacramento Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in Computer Science by Aditya Singh FALL 2018 © 2018 Aditya Singh ALL RIGHTS RESERVED ii SYSTEM CALL ANALYSIS AND VISUALIZATION A Project by Aditya Singh Approved by: __________________________________, Committee Chair Dr. Xiaoyan Sun __________________________________, Second Reader Dr. Jun Dai ____________________________ Date iii Student: Aditya Singh I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the project. __________________________, Graduate Coordinator ___________________ Dr. Jinsong Ouyang Date Department of Computer Science iv Abstract of SYSTEM CALL ANALYSIS AND VISUALIZATION by Aditya Singh Nowadays it is very widespread to see attacks in the system. The attackers try automated tools and programs to attempt and gain access to the data of the users. However, for attackers, it is hard to boycott system calls. System calls are used by the user-level processes to request the different services from the kernel of the operating system. It is very difficult for the attacks to evade the system calls. The system calls are used to make every basic interaction between the operating system and program. The system performs allocating and deallocating memory, closing, reading, renaming and the opening of files, and starting and stopping a process. The size of the system log can be overwhelmingly huge, which makes it hard for the system admins to extract the useful information from it. In this project, we propose to analyze and visualize the system calls so that it can help the system administrators to extract information from the log easily and identify suspicious activities and behavior. The steps in the project include data collection/gathering, data v exploration, data cleaning, data transformation, data mining, and data visualization. This approach helps to extract important information from the system calls by using data mining and machine learning algorithms. The statistics obtained through system call analysis and visualization provide valuable information about the system activities and reveal important patterns. This information and patterns can help identify suspicious behavior which might be related to attacks. _______________________, Committee Chair Dr. Xiaoyan Sun ________________________ Date vi DEDICATION To My Parents vii ACKNOWLEDGEMENTS I want to thank Dr. Xiaoyan Sun, for providing me with an opportunity to work on this project and guiding me throughout this project. She offered me great insight and believed in my abilities. I thank her for continually providing feedback and pushing my limits to improve my mistakes. I would also like to thank Dr. Jun Dai for his readiness in evaluating this report and providing helpful feedback. I would also like to thank the Department of Computer Science at California State University, Sacramento. I am grateful to my parents, friends, and elders for supporting me throughout this journey to complete the Master’s degree program. Also, I would like to thank for the continuous support and feedback from Preetham Dhondaley and Bhuvan Bhatia. viii TABLE OF CONTENTS Page Acknowledgements .......................................................................................................... viii List of Figures .................................................................................................................... xi List of Acronyms .............................................................................................................. xii Chapter 1. INTRODUCTION .......................................................................................................1 1.1 Research Motivation ...........................................................................................1 1.2 Related Work ......................................................................................................2 1.3 Our approach ......................................................................................................4 2. BACKGROUND .........................................................................................................5 2.1 Used Technologies/Tools ...................................................................................5 2.2 Machine Learning concepts ................................................................................8 3. DESIGN .....................................................................................................................10 4. DATA COLLECTION ..............................................................................................12 5. DATA PREPROCESSING .......................................................................................14 6. DATA ANALYSIS AND VISUALIZATION ..........................................................18 6.1 Start time and End time Distribution ................................................................19 6.2 System call vs. Start time Distribution .............................................................21 ix 6.3 PCMD and Start time Distribution ...................................................................24 6.4 Analysis using Machine learning ......................................................................26 6.4.1 K-means Clustering .....................................................................................27 6.4.2 K-means clustering with PCMD and Start time ..........................................30 6.4.3 Clustering with all the essential attributes ...................................................31 7 CONCLUSION ..........................................................................................................34 8 FUTURE WORK .......................................................................................................35 Bibliography ......................................................................................................................36 x LIST OF FIGURES Figures Page 1. ARFF format ........................................................................................................... 7 2. K-means clustering ................................................................................................. 9 3. Project Design ....................................................................................................... 10 4. Data Collection ..................................................................................................... 13 5. Data preprocessing ................................................................................................ 14 6. Transformed data .................................................................................................. 15 7. Code Snippet to create a csv file ........................................................................... 16 8. Start Time distribution .......................................................................................... 19 9. End time distribution............................................................................................. 19 10. System call vs. start time distribution ................................................................... 21 11. System call Count ................................................................................................. 23 12. PCMD table .......................................................................................................... 25 13. Clustering report ................................................................................................... 28 14. Visualization of cluster ......................................................................................... 29 15. Clustering report with PCMD and Start time ....................................................... 30 16. Clustering visualization with PCMD and Start time ............................................. 31 17. Visualization of Start time and PPID .................................................................... 32 18. Result of Start time and PPID clustering .............................................................. 32 xi LIST OF ACRONYMS WEKA: Waikato Environment for Knowledge Analysis ARFF: Attribution Relation File Format SODG: System Object Dependency Graph PCMD: Parallel Command SSL: Secure Sockets Layer SSH: Secure Shell Daemon XSS: Cross-Site Scripting CSV: Comma Separated Values PPID: Parent Process Identity PID: Process Identity xii 1 1. INTRODUCTION In the current world, attackers have been using different tools and technologies to gain access to the systems of an enterprise network. Analyzing system data becomes one of the most commonly used techniques to detect intrusions. Since system calls neutrally capture all system activities, including benign ones and malicious ones, analyzing system calls is a very effective way to detect attacks. However, due to the overwhelming amount of system calls that can be generated by systems, extracting useful information from system call logs is very challenging. Therefore, system call analysis and visualization are very important for efficient and effective detection of attacks. 1.1 Research Motivation The system call is an interface between the kernel and user programs. The kernel provides services

Load more