Classification of Event-Related Google Alerts Using Machine

Classification of Event-Related Google Alerts Using Machine Learning Wim Kamerman University of Twente P.O. Box 217, 7500AE Enschede The Netherlands [email protected] ABSTRACT Alerts as an additional source to identify DDoS attacks. DDoS attacks form a severe threat to organizations, as Google Alerts is a content change detection and notifica- their services can be made unavailable at any moment, tion service [28], notifying users when a new or changed resulting in major direct and indirect damage. This re- result has been detected on a certain keyword. We use search compares the performance of 8 supervised machine a dataset, comprised of Google Alerts, that has been de- learning classifiers on a DDoS Google Alert dataset. We scribed by Abhishta et al. [1]. The dataset consists of provide an annotated dataset and suggest that the Sup- Google Alerts on the keywords \DDoS" and \Distributed port Vector Machine and the Neural Network perform best Denial of Service". This research aims to select the re- based on the desired performance focus. Last, we publish sults that describe a DDoS attack. To do this, we classify a tool that eases the process of comparing and optimizing into four categories: \Attack", \Arrest", \Research" and multiple (supervised) classifiers on a given dataset, while \Other". They are defined in Section 3.1. ensuring valid results. The research can be used to collect The main contributions of this paper are: all reported DDoS attacks on the internet. This collec- tion can be used to gain a better understanding of DDoS • An annotated event-related Google Alert dataset attacks. • Comparison of the efficiency of different supervised Keywords machine learning classifiers on this dataset Machine Learning, Google Alerts, Classification, DDoS • Suggestion of the best classifier for this specific dataset as a result of the efficiency comparison 1. INTRODUCTION Distributed Denial of Service attacks, better known as • A tool that can help classify a given dataset into DDoS attacks, are attempts to make services or infras- various categories tructure unavailable by flooding them with requests from multiple sources. In 2016, a botnet with more than a mil- The contributions can be used to collect all reported DDoS lion devices was used to attack DNS provider Dyn. Not attacks on the internet. This information is important to only were major services like Netflix and Airbnb unavail- increase our understanding of DDoS attacks, to ultimately able, access to public broadcasting services like the BBC outsmart attackers and stop DDoS attacks from happen- was also cut off. This shows the major damage DDoS ing. Increasing understanding may happen on a wide va- attacks can affect. riety of topics. For example, it can be analyzed which attacks get reported and which ones don't. By correlating DDoS attacks are the most common network security threat the data with other information on DDoS attacks, new experienced by Enterprises, Governments and Educational insights may be gathered. We might, for example, gain a bodies, while the customers of Service Providers (e.g. ISPs better understanding of the impact on individual victims. and Data Centers) remain to be the number one target of DDoS attacks [3]. Santanna et al. pointed out that This paper is structured as follows: first, we will describe DDoS attacks are easy to execute with DDoS-as-a-Service the dataset we have used. Second, we will explain the clas- Providers, even for people without any technical knowl- sification methodology in detail. Then, we will present the edge [23]. results, draw conclusions and discuss future work. Finally, we will introduce the classification tool we used for obtain- Much about DDoS attacks has been researched, including ing the results of this research. This tool allows extensive how to efficiently detect and mitigate them [29, 15, 16]. exploration to be performed on a given dataset, result- Various sources have been used to identify DDoS attacks, ing in a valid analysis. The annotated dataset, the tool including leaked booter databases and DNS provider data. and all other code used will be published at Github1 after In this research, we explore the possibility to use Google publication. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies 2. DATASET are not made or distributed for profit or commercial advantage and that The complete dataset consists of 67.831 alerts, categorized copies bear this notice and the full citation on the first page. To copy oth- erwise, or republish, to post on servers or to redistribute to lists, requires by Google into three categories: web, news and blog. The prior specific permission and/or a fee. dataset contains 46.641 web alerts, 21.142 news alerts and 29th Twente Student Conference on IT July 6th, 2018, Enschede, The 48 blog alerts. Since the number of blog alerts is not Netherlands. sufficient for reliable training and testing, we leave them Copyright 2018, University of Twente, Faculty of Electrical Engineer- 1 ing, Mathematics and Computer Science. https://github.com/wmkamerman 1 70% Table 1. Details of Dataset Web 67% News #alerts Start End 60% Combined Web 46.641 2015-08-20 2018-03-22 54% News 21.142 2015-08-25 2018-03-21 50% 40% 36% out. The category \News" is assigned by Google when the 33% source is recognized by Google as being a publisher [11]. 30% 25% The remaining dataset, consisting of only news and web 25% Number of Google Alerts alerts, is summarized in Table 1. 20% 18% 18% Each Google Alert consists of a title and a short text snip- 14% 10% pet (215±63 characters), a date and a link to the acces- 3% sory page. Additionally, an abovementioned predefined 1% 0% 0% category is present. Attack Arrest Research Other 3. METHODOLOGY Figure 1. Annotated Dataset We manually annotate a part of the dataset into 4 classes (Section 3.1), followed by automatic classification using supervised machine learning algorithms (Section 3.2). 3.2.1 Used Algorithms Kotsiantis [14] has described the best-known supervised 3.1 Annotation into 4 classes machine learning classifiers in detail. He categorized the We manually annotated 1000 Google Alerts into one and algorithms into 6 categories: Decision Trees, Neural Net- only one of the following categories: works, Naive Bayes, k-Nearest Neighbors, Support Vector Machines and Rule Learners. We have compared algorithms from all categories, except k-Nearest Neighbors and 1. Attack: describes an occurrence of a DDoS attack. Rule Learners. k-Nearest Neighbors was left out because For example: \Dutch tax office, banks hit by DDoS of its intolerance of noise, its preference towards classes cyber attacks (...)" with more instances and because of its large computa- tional time needed for classification. Rule Learners were 2. Arrest: describes law enforcement after an occur- left out because they cannot be easily used as incremental rence of a DDoS attack. This includes the identifica- learners [14]. tion of the hacker and responsibility claims. For example: \Two Israeli teens arrested for running major We have used the following 9 multiclass supervised algo- DDoS service", \Teenage hacker arrested for unleash- rithms from the remaining four categories: ing DDoS on 911 system" and \British Hacker Ad- mits Using Mirai Botnet to DDoS Deutsche Telekom" • Logistic Regression (LR): one of the most widely used algorithms for classification in the industry, per- 3. Research: an alert describing DDoS research or a forming well on linearly separable classes [18]. By DDoS background article. This includes guides on default LR is a binary model, that is, only capable how to protect against DDoS attacks but does not of separating two classes. To enable multiclass clas- include DDoS protection service providers. DDoS at- sification we make use of the One-versus-Rest (OvR) tacks may be described, usually as a reason to write technique. an article, but the main article must not solely be about this particular attack. • Support Vector Machine (SVM): introduced by Cortes and Vapnik, it is designed to maximize the so-called 4. Other: all other results not fitting in the attack, margin [8]. The margin is defined as the distance be- arrest or research categories. This includes organi- tween the separating hyperplane (the decision bound- zations that offer DDoS protection services and com- ary) and the training samples that are closest to this plaints of gamers accusing DDoS attacks to be the hyperplane, which are the so-called support vectors reason for bad gameplay. [18]. Models with a maximum margin tend to have a lower generalization error, whereas smaller mar- gins are more prone to overfitting. Overfitting occurs The resulting annotated dataset consists of 630 \Web" when the generalization a model creates to classify Alerts and 370 \News" Alerts. It is shown in detail in Fig- unseen data corresponds to closely to the training ure 1. The arrest category is smallest with 3 web alerts data. This results in poor performance, as the model and 14 news alerts. takes the specific characteristics of the training data into account too much. Like LR, we use the OvR 3.2 Multiclass ML Classification technique to enable multiclass classification, as SVM We have used multiclass Machine Learning classification is binary by default. algorithms. As opposed to unsupervised learning, this type of algorithm is able to learn rules based on not only • Decision Tree (DT): introduced by Breiman, it is an input but also on output. It requires known labels (the cor- easily interpretable model, designed to maximize the responding correct outputs), so it can learn how to classify information gain [18, 4]. The DT makes decisions based on the targeted output class.

Load more