Quick viewing(Text Mode)

System Call Analysis and Visualization

System Call Analysis and Visualization

SYSTEM CALL ANALYSIS AND VISUALIZATION

A Project

Presented to the faculty of the Department of Computer Science

California State University, Sacramento

Submitted in partial satisfaction of the requirements for the degree of

MASTER OF SCIENCE

in

Computer Science

by

Aditya Singh

FALL 2018

© 2018

Aditya Singh ALL RIGHTS RESERVED

ii

SYSTEM CALL ANALYSIS AND VISUALIZATION

A Project

by

Aditya Singh

Approved by:

______, Committee Chair Dr. Xiaoyan Sun

______, Second Reader Dr. Jun Dai

______Date

iii

Student: Aditya Singh

I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the and credit is to be awarded for the project.

______, Graduate Coordinator ______Dr. Jinsong Ouyang Date

Department of Computer Science

iv

Abstract

of

SYSTEM CALL ANALYSIS AND VISUALIZATION

by

Aditya Singh

Nowadays it is very widespread to see attacks in the system. The attackers try automated tools and programs to attempt and gain access to the data of the users. However, for attackers, it is hard to boycott system calls. System calls are used by the user-level processes to request the different services from the kernel of the . It is very difficult for the attacks to evade the system calls.

The system calls are used to every basic interaction between the operating system and program. The system performs allocating and deallocating memory, closing, reading, renaming and the opening of files, and starting and stopping a . The size of the system log can be overwhelmingly huge, which makes it hard for the system admins to extract the useful information from it.

In this project, we propose to analyze and visualize the system calls so that it can help the system administrators to extract information from the log easily and identify suspicious activities and behavior. The steps in the project include data collection/gathering, data

v

exploration, data cleaning, data transformation, data mining, and data visualization. This approach helps to extract important information from the system calls by using data mining and machine learning algorithms. The statistics obtained through system call analysis and visualization provide valuable information about the system activities and reveal important patterns. This information and patterns can help identify suspicious behavior which might be related to attacks.

______, Committee Chair

Dr. Xiaoyan Sun

______

Date

DEDICATION

To My Parents

vii

ACKNOWLEDGEMENTS

I want to thank Dr. Xiaoyan Sun, for providing me with an opportunity to work on this project and guiding me throughout this project. She offered me great insight and believed in my abilities. I thank her for continually providing feedback and pushing my limits to improve my mistakes.

I would also like to thank Dr. Jun Dai for his readiness in evaluating this report and providing helpful feedback. I would also like to thank the Department of Computer Science California State University, Sacramento.

I am grateful to my parents, friends, and elders for supporting me throughout this journey to complete the Master’s degree program. Also, I would like to thank for the continuous support and feedback from Preetham Dhondaley and Bhuvan Bhatia.

viii

TABLE OF CONTENTS

Page

Acknowledgements ...... viii

List of Figures ...... xi

List of Acronyms ...... xii

Chapter

1. INTRODUCTION ...... 1

1.1 Research Motivation ...... 1

1.2 Related Work ...... 2

1.3 Our approach ...... 4

2. BACKGROUND ...... 5

2.1 Used Technologies/Tools ...... 5

2.2 Machine Learning concepts ...... 8

3. DESIGN ...... 10

4. DATA COLLECTION ...... 12

5. DATA PREPROCESSING ...... 14

6. DATA ANALYSIS AND VISUALIZATION ...... 18

6.1 Start and End time Distribution ...... 19

6.2 System call vs. Start time Distribution ...... 21 ix

6.3 PCMD and Start time Distribution ...... 24

6.4 Analysis using Machine learning ...... 26

6.4.1 K-means Clustering ...... 27

6.4.2 K-means clustering with PCMD and Start time ...... 30

6.4.3 Clustering with all the essential attributes ...... 31

7 CONCLUSION ...... 34

8 FUTURE WORK ...... 35

Bibliography ...... 36

x

LIST OF FIGURES

Figures Page

1. ARFF format ...... 7

2. K-means clustering ...... 9

3. Project Design ...... 10

4. Data Collection ...... 13

5. Data preprocessing ...... 14

6. Transformed data ...... 15

7. Code Snippet to create a csv ...... 16

8. Start Time distribution ...... 19

9. End time distribution...... 19

10. System call vs. start time distribution ...... 21

11. System call Count ...... 23

12. PCMD table ...... 25

13. Clustering report ...... 28

14. Visualization of cluster ...... 29

15. Clustering report with PCMD and Start time ...... 30

16. Clustering visualization with PCMD and Start time ...... 31

17. Visualization of Start time and PPID ...... 32

18. Result of Start time and PPID clustering ...... 32

xi

LIST OF ACRONYMS

WEKA: Waikato Environment for Knowledge Analysis

ARFF: Attribution Relation File Format

SODG: System Object Dependency Graph

PCMD: Parallel Command

SSL: Secure Sockets Layer

SSH: Secure Shell Daemon

XSS: Cross-Site Scripting

CSV: Comma Separated Values

PPID: Identity

PID: Process Identity

xii

1

1. INTRODUCTION

In the current world, attackers have been using different tools and technologies to gain access to the systems of an enterprise network. Analyzing system data becomes one of the most commonly used techniques to detect intrusions. Since system calls neutrally capture all system activities, including benign ones and malicious ones, analyzing system calls is a very effective way to detect attacks. However, due to the overwhelming amount of system calls that can be generated by systems, extracting useful information from system call logs is very challenging. Therefore, system call analysis and visualization are very important for efficient and effective detection of attacks.

1.1 Research Motivation

The system call is an interface between the kernel and user programs. The kernel provides services to user programs. To interact with the kernel, every user application makes use of system calls that create an Application Programming Interface (API) [1]. When attackers perform malicious activities on a system, these activities are captured by system calls. The system calls cannot be hidden or avoided and are useful if analyzed in the form of a log to detect the suspicious activities. Applications generally interact with the kernel to run the assigned tasks on the system. Hence system calls provide the user interaction and are helpful in detecting the suspicious patterns.

2

Because of the continuous interaction of the application and the kernel, there is a tremendous amount of system calls being generated every second. The service between the operating system and an application never stops and even considering the log for 10 minutes produces a large amount of data. Even for a single system, gathering the data of system call of months and performing analysis on it makes it overwhelming and hard to analyze.

Similarly, for an enterprise network that has hundreds of running machines would create a considerable amount of data that is out of bound to perform an analysis. Using this amount of data these data analysts face difficulty in detecting the suspicious activities and differentiate between an authenticate event and the attack. Because of these difficulties we want to explore and visualize the data to a suspicious pattern.

1.2 Related Work

The major problem for the researchers is to detect the attack in the system. In the past, there have been different techniques used for intrusion detection. Previous research work done by Dai et al. [3] proposed a system named Patrol that would detect the ‘zero-day attack paths’ at the runtime. They build a Network-wide System Object Dependency Graph

(SODG) that would summarize the dependency relations among OS objects, including processes, files, and sockets. Their system identifies the path by generating a superset graph from the system calls. This system helps to detect suspicious intrusion propagation path

3

(SIPP) in the network path. To audit all the running process the Patrol implemented a . Selected system calls are audited, and codes are inserted into each system call. The graph is represented using an adjacency matrix in which edges are different system calls.

In Sun et al.’s [4] work, they mention that the security measures fall short of in detecting the zero-day attacks. They discuss targeting a whole path of zero-day-attack. According to their approach, a zero-day attack is just a form of a dependency graph. To detect and reveal the attack in the graph, they developed a system of the Bayesian network that could compute the expected probability of an infected object. This system is named ZePro in which system calls are parsed to convert into system objects and dependencies. One of the significant differences between Patrol and ZePro is that ZePro does not use SODG, it uses object instance graph as the supergraph. To detect the zero-day attack path, Patrol identifies the path by performing backward and forward tracking starting from trigger nodes identified from security sensors, but ZePro uses Bayesian Network to identify the instance objects that have a high probability of being infected and then connect them into a path.

Xinguang et al. [5] propose a method to detect the suspicious behavior in the systems that monitor system call activities. They used different data mining techniques to develop a model that takes a snapshot of the normal functioning of a program. They used sequence pattern matching algorithm to compare the current behavior with the historical practice.

There are two stages, the training stage, and the detection stage. In the training stage, based

4 on the support and confidence the underlying system calls are stored in the training data.

Whereas in the detection stage, a sequence pattern matching algorithm is used to compare the past normal behavior with the current normal behavior.

Use of different algorithms such as Bayesian Network, Naïve Bayes, Hidden Markov

Models, researchers have been able to build models that could detect the probability of the suspicious activities in the system log. Majority of them focused on avoiding the attack instead of discovering if the attack is already existing in the system.

1.3 Our approach

Because of the above difficulties listed, we want to explore the approaches of effectively analyzing and visualizing system calls in this project. Our research involves visualizing and perform data mining on the log of system calls and find the patterns that would imply the possibility of attacks and vulnerability in the log. First, we obtained the raw data of

Unix system calls. Secondly, the data has been pre-processed and transformed into useful information. Third, we perform data mining to extract valuable features that could give helpful hints about the attacks. After Interpreting and evaluating we would be building our visualizations. We visualized the interpreted data in the form of scatter plot and line chart.

The last step includes making predictions in the data set. Since the data set that we are using is unsupervised, so we would be using K-means clustering machine learning algorithm that helps in predicting to detect the attack for any cluster. Further, we used

WEKA as a tool to implement the machine-learning algorithm.

5

2. BACKGROUND

The proposed solution performs log analysis of the system calls to detect the suspicious activities. Log analysis helps to know the steps and activities that the user performed to attack the system. To achieve this we first, we first collect and gather the data. Second, the collected data is preprocessed, and transformed into an understandable and useful format.

Third, we use WEKA to visualization the data and to implement the k -means clustering.

We are considering some of the variables that are vital in determining suspicious behavior.

Each of the variables has a different weight that would help in the process of visualization.

Since all the variables are the values or terms of an operating system, some prior knowledge of these variables is necessary to understand the dataset.

2.1 Used Technologies/Tools

Weka

Weka is the package or suit of many machine-learning algorithms for both

supervised and unsupervised learning. WEKA was first developed at the University

of Waikato as a Machine learning . It is written in Java programming

language and is used for discovering useful information from the dataset and

visualizing the different pattern from it [6]. The machine learning algorithms can

6 be implemented either directly or by calling our own Java code. WEKA accepts the dataset only in the format of CSV and ARFF file format.

Components/Interface of WEKA

Explorer: It is a graphical user interface that gives us the ability to the dataset either from an ARFF format or a CSV format file. In addition to just reading the dataset, it provides the interface to preprocess the data, apply different classification, clustering, association algorithms. It gives the option to visualize the algorithms.

Knowledge Flow: In this graphical interface we can design the configuration for streamed data processing. The interface has boxes that represent machine learning algorithms and data sources and can them all together into the desired configuration.

Experimenter: It gives us an idea about the method and values of the parameter that matches best for any given problem. Likewise, it can also be used for large statistical experiments.

ARFF: It is a file format used in the WEKA machine learning software, and it stands for Attribute-Relation File Format. ARFF was developed at the University

7 of Waikato. It is a text file describing a list of instances sharing a set of attributes.

The cases are unordered, independent and do not involve any relationships among themselves. There are two sections in this file format: Header and Data. The Header section includes the name of a relation, a list of attributes called columns in the data set and the of attribute. This section has four types of data: numeric, nominal- specification, string, date formatted “yyyy-MM-’T’HH:mm:ss."The Data section consists of a declaration line and the instance line. In figure 1, the example of an

ARFF file includes the header, datatype, data section, and the number of instances:

@relation

@attribute 1

@attribute 2

@attribute n

@Data

Instance 1…

Instance 2…

Instance n…

Figure 1: ARFF format

8

2.2 Machine Learning concepts

Unsupervised Learning

Unsupervised learning is a form of a machine learning process in which the input

examples of the dataset are not class labeled [7]. It is also known as clustering or

learning by observation where the number or set of classes are not known in

advance.

Clustering

Clustering is defined as a process of dividing a set of data objects into a subset.

There are different types of clustering methods: Partitioning method, Hierarchical

method, Density-based method, and Grid-based method.

We are considering Partitioning method for building the model. Partitioning

method is distance-based. To represent the clusters, it can either use mean or

medoid. It constructs k partitions of the data, where each partition represents a

cluster of data and where k ≤ n and n are the objects in a set.

9

k-means Clustering algorithm

Consider a dataset D, and the k-means algorithm calculates the mean value of all the points within the cluster. To define the mean value, it uses the centroid-based partitioning method.

Steps in the k-means algorithm:

1. Selecting k number of cluster objects in the dataset D that would represent the

mean or the center.

2. From Euclidean distance between the object and the cluster, the rest of the

objects are allocated to its most similar cluster.

3. The algorithm improves the cluster variation and computes the new mean by

using that is assigned to a cluster in the last iteration.

Figure 2 shows the clustering of a set of objects using the k-means algorithm.

In (b) the cluster centers are updated.

(a) Initial cluster (b) Iterate () Final clustering

Figure 2: K-means clustering

10

3. DESIGN

This chapter describes the workflow and overview of the process of data analysis and visualization. Data selection and data formatting are the most critical part of the whole project. Figure 3 shows the steps in the process of data analysis and data visualization and how it works to extract useful information from data.

Data Data Collection Visualization

Data Data Analysis Exploration

Data

Preprocessing

Figure 3: Project Design

In our approach, the first step is the collection of data. In this step, we focus on what type of data would be suitable for the analysis and how to capture the related data for the project.

After data collection, we require to analyze the important characteristics of the data. This step includes the study of different types of system calls that could be important for further

11 analysis. After the selection of the major attributes and gathering the information all together we need to preprocess the selected data into useful and understanding format. Data

Preprocessing includes the transforming the raw data into a readable and understandable format.

Data analysis and Data visualization includes the analyzing the data in the form of a table, charts, graphs. The statistics observed from the data analysis help to observe the patterns in the system calls. We used k-means clustering for the process of data visualization. The process of clustering helps us to categorize the different combinations of the system calls to find suspicious patterns in the system.

12

4. DATA COLLECTION

To detect any suspicious activities in the system call, we required the log of the operating system commands. The text processing and data extraction of the operating system was done by using programming language. The function of awk is to search a pattern in a text file or a line. After finding the pattern, awk performs specified actions on that line.

Awk performs the search until the end of the input file is reached [8].

As we can see in figure 4, this is the extraction of the log data of the operating system. It gives us detailed information about each timestamp and requested services by every process and user. This figure is the small snapshot of the data of the log. The timestamp of this dataset is October 10, 2012, UNIX operating system. From the log, we observed that there had been an enormous number of function calls during the time of 8 minutes.

For the process of visualization, graphs are handy to observe the pattern and get useful information from it. Visualizing the large dataset in Jupyter Notebook is hard, and sometimes it is difficult to get a clear understanding.

13

Figure 4: Data Collection

14

5. DATA PREPROCESSING

Before we apply the data mining techniques, it is necessary to pre-process the data. The actual dataset captured as a log for an operating system is in the form of raw data. To optimize the searching algorithms and explore each attribute for every row in the dataset it is essential to convert the dataset into a useful table. The original dataset generated from the operating system using awk programming language contains rows with spaces and tab delimiter. Since the real-world data lacks in individual behavior or trends, we extracted the attributes from the dataset using python as a programming language. These transformations generated a CSV table that lists all the attributes with a value or a null for any column.

The pre-processing of the data includes the steps shown in figure 5:

Raw data Data Data Cleaning Data reduction transformation

Attribute Selection

Figure 5: Data preprocessing

In figure 6, the raw data of the operating system log is converted into a table and cleaned by removing the attribute names from each row and separating them with a unique column

15 name in the table. Similarly, in the step of data transformation, the null values for each attribute are included in each row. Finally, the selection of useful attributes can be fetched that would be helpful in selecting the suspicious attack in the log of the data.

Figure 6: Transformed data

16 import re import os attribute_list=['fromppid', 'toppid', 'type', 'requestuid', 'source', 'rpathname', 'newinode', 'cpid', 'newuser', 'toaddress', 'len', 'neweuid', 'cnt', 'startaddr', 'fromport','pipe', 'count', 'end', 'rtn', 'toport', 'MS_SYNCHRONOUS', 'oldeuid', 'frompcmd', 'wpathname', 'EBADF', 'wtype', 'newmode', 'newport', 'fd', 'start', 'pcmd', 'MS_DIRSYNC', 'addr', 'oldpathname', 'oldinode', 'oldfsuid', 'port', 'ftype', 'winode', 'syscall', 'topcmd', 'olduser', 'data', 'newaddress', 'oldsuid', 'oldmode', 'inode', 'address', 'newfsuid', 'newpathname', 'cppid', 'topid', 'target', 'pid','socket', 'ppid', 'frompid', 'pathname', 'cpcmd', 'olduid', 'rtype', 'fromaddress', 'newsuid', 'newuid', 'O_CREAT']

x=set() filepath = 'trace1_original' outputFile='output_bracketremoval.csv' fp= (filepath,'r') fp2=open(outputFile,'w') fp2.(",".join(attribute_list)) for line in fp: str="" cnt=0 for attribute in attribute_list: cnt=cnt+1 m = re.search('[^\w]'+attribute+':([a-zA-z0-9\./]+)\t*.*$', line) if m is not None: value=m.group(1) if attribute =='pathname' and "/" not in value: value="" if attribute =='pipe' or 'socket': value=value.('[]') if attribute =='rpathname' or 'wpathname' and "pipe" in value: value="" if cnt >1:

str=str+","+value else: str =value # print (m.group(1)) else: if cnt > 1: str = str + "," fp2.write(str+'\n') fp.() fp2.close()

Figure 7: Code Snippet to create a csv file

17

In figure 7, We used python as a programming language to extract all 65 attributes and create a CSV file. Since WEKA accepts the file in ARFF format and a CSV file with comma separated values, we used above snippet to do the formatting of the dataset and get the table shown in figure 6.

18

6. DATA ANALYSIS AND VISUALIZATION

One of the significant threats to the data of an enterprise network is to gain access to their system and to compromise their data. Attackers have been using different techniques to access the system of users. One of the most common attacks is Brute force attack. In brute force, an attacker uses the method of key guessing in which they try to randomly guess the password by using the combination of numbers, letter, and symbols. This might help the attacker to get the correct password, but it might take months to years to discover the right password. To make the attack faster attackers use a dictionary that contains millions of password patterns.

In the first examination of the dataset, we consider the spread of all the rows. The best way is to visualize how the start time and end timestamp were spread for the complete dataset.

19

6.1 Start time and End time Distribution

Figure 8: Start Time distribution

Figure 9: End time distribution

20

In figure 8 and 9, the x-axis is the number of rows in the dataset. The y-axis is the start and end time both the figures respectively. There is a total number of 82235 rows in our dataset.

In this figure, we observe the timestamp is flatter for the rows between 70000 & 80000.

On the other hand, there are some fewer flat lines at the beginning of the log. The flat line in the graph defines that there have been 4006 numbers of system calls for the start time

‘1349912650’ and 2711 number of systems calls for ‘1349912713’ start time. Therefore, these rows might have the brute force key guessing attack because the high number of system calls that could the number of attempts to access the data of the user. As we advanced further, we dig deeper into the dataset to observe a clear understanding of the actions of the users.

21

6.2 System call vs. Start time Distribution

For the next step, we have picked all the system calls and compare them against the start timestamp to see which system calls have the maximum number of frequency or counts for every second of timeframe.

In figure 10, the analysis could be stretched to all the system calls but for the useful purpose we consider stat64 (pink), read (black), lstat64 (grey), open (salmon pink), mmap2 (peach pink), write (blue), fstat64 (Electric blue). We subset the data for all these categories.

Figure 10: System call vs. start time distribution

22

First, we look at the read system call which has the highest number of counts of 27242.

Figure 11, list all the system calls along with the number of counts in the dataset. Observing the table, read system call has the highest number of counts among all other system calls.

The high number of attempts to read any file in a single second is a significant and suspicious behavior of any attack in the system.

In figure 10, for start time ‘1349912650’ among all the system calls read has the maximum number of occurrences followed by a write. For any user, it is hard to read and write to any file thousands of times in just a single second. The combination of reading and writing of a file simultaneously at the same time is one of the suspicious activities that an attacker would use.

23

Figure 11: System call Count

24

6.3 PCMD and Start time Distribution

The PCMD stands for Parallel command that performs the execution of the same commands on the computer node in parallel. PCMD is installed as the root-only tool and uses SSL as the security protocol to establish encrypted links for communication [9].

In Figure 12, we can observe all the PCMD commands in the dataset. ‘updatedb.mlocat’ has the maximum number of frequency counts in the dataset. This command is used to update the database that lists all the files on our server doing all the searches. Although

‘updated.mlocat’ occurs the maximum number of times, i.e., 22972 but for the timestamp

‘1349912650’(maximum number of times in the dataset), there are 756 attempts to change the ownership of the files using command. The chown is used when a user does not have the privilege to access the file but wish to either edit or update any file or directory.

The brute force attack is one of the examples in which an attacker attempts to guess the password multiple times.

Next, we observe the other PCMD commands that have the 2nd highest count in the dataset.

‘SSHD’ occurs overall 22972 times in our collection of the log. SSHD is a daemon program for SSH. Both together provide secure encrypted communications among two non-trusted hosts over an unsecured network. Attackers over an unsecured network use this command to make the connection reliable, safe and genuine to everyone.

25

Figure 12: PCMD table

26

6.4 Analysis using Machine learning

Machine learning algorithms help in predicting different types of result and visualization by using statistical techniques or methods. The algorithms can be simulated based on the requirements of the analyst and their outcome. The instances of the model can record, and the input can be changed to get the best prediction models.

Many methods and techniques provide strategies to avoid any attack such as Brute force attack, SQL injection, Malware attack, Cross-site scripting. It is hard to find a single model that would be perfect to determine the attacker, type of attack and the location or file affected by the attack.

In this project, instead of providing the techniques to avoid cyber-attack, network security attack, or software attack, we look for the suspicious activities in the system log that would define if there is an intrusion by the attacker. We would be looking for the cluster that would partition the observations based on the nearest mean.

To analyze the intrusion into the system, the following attributes play a vital role:

1. Start time

2. End time

3. System calls

4. Parallel command

27

5. Process id

6. Parent Process id

7. Function type

8. IP address

9. File pathname

10. Port number

11. Inode number (unique index number of each file)

In the following section, we consider a different combination of attributes to form the cluster. After preprocessing the data, we move on to the process of clustering, and the data we have is unsupervised learning.

6.4.1 K-means Clustering

Next, we move to the clustering. Since the data is unlabeled and unsupervised learning, our model is a K-means clustering. We are using a Euclidean distance function [10]. Our algorithm first evaluates each data points and divide them into respective classes, considering whether it is closer to the centroid.

We tested the cluster number to be ranged from 1 to 10 and out of them, having six clusters gives us the good prediction results and cluster all the different values among them. In figure 13, cluster 4 has the highest number of ‘start time’ values, i.e., 25580(31%). These numbers signify that the maximum number of system calls have been initiated in this

28 cluster. Further, in figure 14 we visualized which attribute commands fall in this cluster that could verify a possible attack in the system.

Figure 13: Clustering report

29

Figure 14: Visualization of cluster

30

6.4.2 K-means clustering with PCMD and Start time

One of the essential features to detect the intrusion in the system is a shell command that is used to access the files or can execute jobs in parallel using one or computers. In figure 15 and figure 16 we observe that pcmd command ‘sshd’ is 54% among all the other commands for cluster 4. In addition to that in cluster 5, the count of ‘sshd’ is 97% among other commands. The ‘sshd’ command is used by attackers to open an SSH daemon. In other words, to access the encrypted communication between untrusted hosts over an insecure network attacker use both SSH and ‘sshd’ together [11]. It is useful in listening to the connections from the other users or client and forks a new daemon that handles key exchange and encryption. Since this command is being observed the maximum number of times, we can say that the attacker used ‘sshd’ to get the encryption key and make a secure connection.

Figure 15: Clustering report with PCMD and Start time

31

Figure 16: Clustering visualization with PCMD and Start time

6.4.3 Clustering with all the essential attributes

In this clustering model we take the list of 11 attributes, i.e., end, start, pcmd, addr, port, ftype, syscall, inode, pid, ppid, pathname. We have divided the data into 6 clusters.

From the observation in the clustering report in figure 18, there is only single IP address

‘192.168.101.5’ used by the attacker to intrude in the system. The pathname that has been mostly accessed is ‘/etc/mtab’ and ‘/usr/src/.’ The ‘/etc/mtab’ is used to have the information of all the mounted and unmounted disk file/disk in the system. Also, it is not used by the kernel and kernel itself maintain its history file called /proc/mounts or

32

/proc/self/mounts. The attacker tracks the file systems that are currently mounted. The count of accessing ‘/etc/mtab’ file in cluster 5 is 17721(92%), and in cluster 0 it is

43267(52%). It might be possible that the attacker could be trying to gather data of the . Considering this pattern data integrity is at a remarkably higher risk.

Figure 17: Visualization of Start time and PPID

Figure 18: Result of Start time and PPID clustering

33

In figure 17 and figure 18, Observing the ppid for the clusters, they are distributed in each cluster but the cluster 4 which contains the highest number count of start time has the largest count of ppid. The similar observation was recorded for the pid which is unique for every process.

The patterns observed above helps to identify some unusual activities that might be related to the attacks. After analyzing and visualizing different combination of system calls and performing k-means clustering, the analysis shows that the highest number of system calls for different timeframes.

34

7 CONCLUSION

In this project, we discussed approaches of analyzing and visualizing the system calls.

These techniques can help system administrators to extract useful information about the system activities and thus identify suspicious system behaviors, which may be related to attacks. We used the log of Unix operating system which was for a short duration but big enough for analysis and visualization. The result shows that, system call analysis and visualization is able to reflect suspicious activities of the system. In-depth analysis is required to reveal sufficient information about suspicious system activities. Implementing data mining techniques and visualizing the system call log does not directly testify the existence of attacks, but the revealed information can be very useful for system or security administrators to make judgements about possibilities of attack occurrence.

35

8 FUTURE WORK

The goal to analyze and visualize system calls has been successful. As an extension to this work, we could use Principal component analysis that would help to reduce the large set of variables into a small set containing all the useful information. With the new advancement in the field of data analyses and machine learning, we can use Spark cluster over a server such as AWS, Databricks, and Azure to preprocess the data.

We could also implement the analysis for the cloud application activities of the user.

Logging the events of the user on any application and performing the analyses on them using pattern matching techniques in real time. In addition to this, there could be a model that can be used for any time frame that is not restricted to just a single date or time.

36

BIBLIOGRAPHY

[1] M. &. K. N. &. B. C.-P. &. H. A. E. &. D. J. &. R. C. J. Bagherzadeh, "Analyzing a

Decade of Linux System Calls.," Empirical Software Engineering, 2017.

[2] X. S. a. P. L. Jun Dai, "Patrol: Revealing Zero-day Attack Paths through Network-

wide System Object Dependency," Pennsylvania State University.

[3] J. D. P. L. A. S. a. J. Y. Xiaoyan Sun, "Using Bayesian Networks for Probabilistic

Identification of Zero-Day Attack Paths," 2018.

[4] X. C. M. D. R. L. H. C. X. C. Xinguang TIAN, "Network intrusion detection based

on system calls and data mining," Higher Education Press and Springer-Verlag

Berlin Heidelberg, p. 7, 2010.

[5] I. H. W. &. E. Frank, Data Mining Practical Machine Learning Tools and Techniques,

San Francisco: Morgan Kaufmann, 2005.

[6] M. K. ,. J. P. Jiawei Han, Data Mining Concepts and Techniques, Morgan Kaufmann,

2012.

[7] A. D. R. P. H. R. R. S. v. O. Diane Barlow Close, "The AWK Manual," Free software

foundation Inc, Cambridge, 1995.

[8] R. Slick, "System Administration for Cray XE and XK Systems," 2012.

[9] N. K. a. H. Wolkowicz, "Euclidean Distance Matrices and Applications," 2010.

[10] "sshd (8) - Linux Man Pages," SysTutorials, [Online]. Available:

https://www.systutorials.com/docs/linux/man/8-sshd/. [Accessed 14 11 2018].

37

[11] N. Ishkov, "A complete guide to Linux process ," University of Tampere,

2015.