
Cyberattacks Detection and Analysis in a Network Log System Using XGBoost with ELK Stack Chao-Tung Yang ( [email protected] ) Tunghai University https://orcid.org/0000-0002-9579-4426 Yu-Wei Chan Providence University Jung-Chun Liu Liu Tunghai University Endah Kristiani Tunghai University Cing-Han Lai Tunghai University Research Article Keywords: Cyber Security , Machine Learning , ELK Stack , XGBoost , NetFlow Log Posted Date: September 3rd, 2021 DOI: https://doi.org/10.21203/rs.3.rs-838650/v1 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License Noname manuscript No. (will be inserted by the editor) Cyberattacks Detection and Analysis in a Network Log System Using XGBoost with ELK Stack Chao-Tung Yang · Yu-Wei Chan · Jung-Chun Liu · Endah Kristiani · Cing-Han Lai Received: date / Accepted: date Abstract The usage of artificial intelligence and machine learning methods on cy- berattacks increasing significantly recently. For the defense method of cyberattacks, it is possible to detect and identify the attack event by observing the log data and analyz- ing whether it has abnormal behavior or not. This paper implemented the ELK Stack network log system (NetFlow Log) to visually analyze log data and present several network attack behavior characteristics for further analysis. Additionally, this sys- tem evaluated the extreme gradient enhancement (XGBoost), Recurrent Neural Net- work (RNN), and Deep Neural Network (DNN) model for machine learning methods. Keras was used as a deep learning framework for building a model to detect the at- tack event. From the experiments, it can be confirmed that the XGBoost model has an accuracy rate of 96.01% for potential threats. The full attack data set can achieve 96.26% accuracy, which is better than RNN and DNN models. This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 108-2221-E-029-010, 109-2221-E-029-020, and 110-2221-E-029-020-MY3. Corresponding author: [email protected] C-T Yang, J-C Liu, and C-H Lai Department of Computer Science, Tunghai University, Taichung City 407224, Taiwan, (R.O.C.) C-T Yang Research Center for Smart Sustainable Circular Economy, Tunghai University, No. 1727, Sec.4, Taiwan Boulevard, Taichung City 407224, Taiwan, (R.O.C.) E-mail: [email protected] Y-W Chan College of Computing and Informatics, Providence University, Taichung City 43301, Taiwan (R.O.C.) E. Kristiani Department of Industrial Engineering and Enterprise Information, Tunghai University, Taichung City 407224, Taiwan, (R.O.C.) Department of Informatics, Krida Wacana Christian University, Jakarta 11470, Indonesia 2 Chao-Tung Yang et al. Keywords Cyber Security · Machine Learning · ELK Stack · XGBoost · NetFlow Log 1 Introduction In recent years, cyberattacks are evolving and becoming more sophisticated. For example, with the development of Machine Learning algorithms, some illegal users might use the technology of cyberattacks and Machine Learning to analyze informa- tion from the social networks [3]. The specific target of cyberattacks is given based on the data, the attack success rate, or the vulnerability that is discovered [30]. Accord- ing to Neustar’s International Network Benchmark Index report released in 2018 [1], 82% of cybersecurity experts said they are worried that attackers will use artificial in- telligence to make a destructive attack on the network environment. However, a large number of experts believe that artificial intelligence can play a considerable role in network security and provides excellent supporting [26]. As mentioned above, in the campus network environment, various cyberattacks have appeared and tried to attempt the stability of the campus network environment. From the network logs, it can be found that many unusual network usage scenarios are trying to pass the campus network security system [15] [18] [28]. However, the systems with visualized network log data and the capability of detecting cyberattacks have considerable charges. The open-source platform ELK Stack is implemented to build a network log sys- tem (NetFlow Log) in this work. First, the network logs related to the cyber attack behavior were collected in a large amount of data and then obtained preliminary in- formation. Second, the data analysis was observed. After visualizing the log data, the administrator can use the machine learning model to import historical log data for analysis and detection. Then, perform a risk assessment based on the cross-validation analysis of the visual information displayed by the ELK Stack, even if it has not oc- curred or uncertain events. The administrator also has sufficient information to make the right decisions and take precautions to avoid the associated losses in informa- tion security. Our goal is to use XGBoost for machine learning, then implement a visualization system for cyberattack behavior to help administrators detect whether historical network log data has cyber attack behavior or not. The specific objectives are as follows: 1. Demonstrate the visualization and monitoring system of NetFlow log. 2. Compare XGBoost, RNN, and DNN model in two kinds of model, potential at- tack and full attack log data. 2 Background Review and Related Works This section provides the background of this work and several kit information, in- cluding Python, ELK Stack, XGBoost, and so on. Then, the next section is discussed in more detail. Title Suppressed Due to Excessive Length 3 2.1 Keras In Deep Learning, The Microsoft Cognitive Toolkit (CNTK) and TensorFlow are widely used in Deep Learning research. However, although both have compelling features, the actual application is more complicated. Therefore, the Deep Learning project for this job will use Keras to build a Deep Learning model. Keras is an open-source neural network library written in Python that can be executed on TensorFlow, CNTK, Theano. The leading developer is Google engineer Francois Chollet. Keras can quickly implement deep neural networks. 2.2 ELK Stack ELK Stack refers to the architecture based on three open-source software Elastic- search, Logstash, and Kibana [6]. ELK Stack can be used to form a system for query- ing, collecting, and analyzing logs. This work can get data from any source and for- mat. Without changing the original system architecture, ELK Sack is built to instantly search and analyze data and ultimately use visual capabilities to present the analyzed data results [21]. NetFlow Log is the automated network log platform built in this work. It is built on top of these three open-source software. In addition, ELK Stack has three kits and many other software packages, such as Filebeat, Xpack, and ECE. 2.3 Decision Tree In decision theory, a decision tree consists of decision graphs and possible outcomes that are used to help decision-making achieve program goals [22]. In Machine Learn- ing, a decision tree is a predictive model. Use tree graphics to help computers judge, segment our data, and make decisions based on it. Each node in the tree represents a specific target, and each forked path represents a possible feature of data segmen- tation. The information gain is obtained from the action of segmenting the data. The segmentation process is repeated until the leaf node. This leaf node corresponds to the target from the master node to the leaf node and has all the feature values on the path. Decision trees are available for forecasting and data analysis. A complete decision tree typically contains three nodes: decision nodes, opportunity nodes, and endpoints. Decision trees have several generation methods, classification tree analy- sis, regression tree analysis, Classification and Regression Trees (CART), Chi-square Automated Interaction Detection (CHAID). As the most fundamental component of XGBoost, it needs to introduce the CART regression tree. It constructs a decision tree based on the characteristics and data of the training to determine the prediction result of each piece of data. Also, it uses the Gini index to calculate the gain to select the characteristics of the decision tree. The Gini index formula is as follows: K Gini(D)= ∑ pk1 − pk (1) k=1 4 Chao-Tung Yang et al. pk represents the probability of class classification category (k) in dataset D, the number of categories is indicated by K. The Gini index calculates the gain formula as: |D | |D | Gini(D,A)= 1 Gini(D )+ 2 Gini(D ) (2) |D| 1 |D| 2 D represents the entire dataset, D1 and D2 respectively represent data having fea- ture A in the dataset and data other than A. 2.4 Gradient Boosting Gradient Boosting is a Boosting method that is a Machine Learning technique for regression and classification problems [11]. Gradient Boosting generates prediction models in the form of multiple weak classifiers. Each model is established in the gra- dient direction of the loss function of the previous model. Put, when the loss function is large, the model is more error-prone. On the other hand, if our model can make the loss function drop, our model will continue to improve. Thus, the loss function is reduced in the gradient direction by multiple improvements, and a good model is finally obtained [12]. The specific algorithm is as follows: Input Training set T =(x1,y1),(x2,y2)...,(xn,yn) Output Boosting tree fM (x) Procedures: – Initialization f0(x)= 0 f orm = 1,2...,M – Calculating the residual rmi = yi − f((m − 1))(xi),i = 1,2,...,n (3) – Fitting the residual rmi to learn a regression tree and get T(x:Θm ) – Update fm(x)= f((m − 1))(x)+ T(x : Θm) – Get the regression boosting tree f (x)= ∑M T( x ) M m=1 Θm 2.5 XGBoost The objective function of XGBoost consists of two parts [8]. The first part is used to calculate the difference between the predicted score and the true score is Ob j(t)= n t−1 ∑i=1 L(yi,yˆ + ft (xi)) + Ω( ft )+ constant The second part is normalization Ω( ft ) , and the formula is as follows.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages25 Page
-
File Size-