electronics

Article Effective DGA-Domain Detection and Classification with TextCNN and Additional Features

Chanwoong Hwang 1 , Hyosik Kim 1, Hooki Lee 2 and Taejin Lee 1,*

1 Department of Information Security, Hoseo University, Asan 31499, Korea; [email protected] (C.H.); [email protected] (H.K.) 2 Department of Cyber Security Engineering, Konyang University, Nonsan 32992, Korea; [email protected] * Correspondence: [email protected]

 Received: 8 June 2020; Accepted: 29 June 2020; Published: 30 June 2020 

Abstract: Malicious codes, such as advanced persistent threat (APT) attacks, do not operate immediately after infecting the system, but after receiving commands from the attacker’s command and control (C&C) server. The system infected by the malicious code tries to communicate with the C&C server through the IP address or domain address of the C&C server. If the IP address or domain address is hard-coded inside the malicious code, it can analyze the malicious code to obtain the address and block access to the C&C server through security policy. In order to circumvent this address blocking technique, domain generation algorithms are included in the to dynamically generate domain addresses. The domain generation algorithm (DGA) generates domains randomly, so it is very difficult to identify and block malicious domains. Therefore, this paper effectively detects and classifies unknown DGA domains. We extract features that are effective for TextCNN-based label prediction, and add additional domain knowledge-based features to improve our model for detecting and classifying DGA-generated malicious domains. The proposed model achieved 99.19% accuracy for DGA classification and 88.77% accuracy for DGA class classification. We expect that the proposed model can be applied to effectively detect and block DGA-generated domains.

Keywords: security; domain generation algorithm; TextCNN; domain features; classification

1. Introduction

1.1. Background Most cyberattacks use malicious codes, and according to AV-TEST, more than 1 billion malicious codes are expected to emerge in 2020 [1]. Unlike the recent distribution of malicious codes to a large number of unspecified people, advanced persistent threat (APT) attacks are attempted after targeting one target. The APT attacks are characterized by the fact that they do not stop the attack by producing dense and systematic security threats based on various IT technologies and attack methods until the successful intrusion inside. In addition, APT attacks are mainly targeted at government agencies or corporations, and they are difficult to detect because they are infiltrated into the system and continuously attacked. Figure1 shows the APT attack cycle.

Electronics 2020, 9, 1070; doi:10.3390/electronics9071070 www.mdpi.com/journal/electronics Electronics 2020, 9, 1070 2 of 18 Electronics 2020, 9, 1070 2 of 18

FigureFigure 1.1. Advanced persistent threatthreat (APT)(APT) attackattack cycle.cycle.

InIn thethe firstfirst step,step, thethe APTAPT attackerattacker infectsinfects thethe maliciousmalicious codecode throughthrough aa websitewebsite oror emailemail visitedvisited byby an internal internal user user to to create create an an internally internally infected infected PC. PC. Step Step 2 collects 2 collects infrastructure infrastructure information, information, such suchas the as organization’s the organization’s internal internal network. network. Step Step 3 invades 3 invades vulnerable vulnerable systems systems and and servers throughthrough accountaccount informationinformation thefttheft andand malwaremalware infections.infections. The fourth stage consists ofof stepssteps suchsuch asas internalinternal informationinformation leakageleakage and and system system destruction destruction under under the the command command of the of attacker. the attacker. Some Some malicious malicious codes, suchcodes, as such APT as attacks, APT attack operates, operate after receiving after receiving commands commands from a from remote a remote command command and control and control (C&C) server(C&C) afterserver being after installed being installed on thedevice. on the device. isBotn a collectionet is a collection of malware-infected of malware-infected machines machines or bots. Itor has bots. become It has thebecome main the means main for means cyber-criminals for cyber-cr toiminals send spam to send mails, spam steal mails, personal steal data, personal and launch data, distributedand launch denial distributed of service denial attacks of service [2]. Most attacks bots today[2]. Most rely bots on a domaintoday rely generation on a domain algorithm generation (DGA) toalgorithm generate (DGA) a list of to candidate generate a domain list of candidate names in thedomain attempt names to connect in the attempt with the to so-called connect C&Cwith the server. so- calledAccording C&C server. to Nominum’s domain name system (DNS) security trend analysis report [3], more than 90% ofAccording attacks by to maliciousNominum’s codes domain are using name a system domain (DNS) name security system (DNS).trend analysis A DNS-generated report [3], more DNS isthan especially 90% of usedattacks to by infect malicious/update codes new hostsare usin andg communicatea domain name with system C&C (DNS). servers. A ToDNS-generated perform the desiredDNS is actionespecially by the used attacker, to infect/update the C&C server new host is useds and to communicate communicate with in real C&C time servers. with the To malicious perform code.the desired In order action for the by malicious the attacker, code tothe receive C&C commandsserver is used from to the communicate C&C server, itin must real firsttime connect with the to themalicious server. code. For communication In order for the on malicious the internet, code the to client receive PC needscommands to know from the the IP addressC&C server, of the it server must infirst order connect for the to clientthe server. to access For the communication server. Since accessed on the internet, in this way, the the client malicious PC needs code to must know have the the IP IPaddress address of orthe domain server addressin order of for the the C&C client server to inside.access Ifthe the server. malicious Since code accessed hides thein this IP address way, the or domainmalicious address code must of the have C&C the server IP address by hard or domain coding, securityaddress of equipment the C&C server or law inside. enforcement If the malicious agencies cancode block hides the the IP IP address address or or the domain domain address address of to th prevente C&C theserver malware-infected by hard coding, device security from equipment accessing theor law C&C enforcement server. However, agencies malware can block creators the IP use addr DGAess or to circumventthe domain theseaddress access to prevent blocking the techniques. malware- infectedAttackers device usefrom DGA accessing for the the purpose C&C server. of hiding However, C&C malware servers. creators DGA is use supposed DGA to to circumvent randomly generatethese access numerous blocking domain techniques. addresses every day. Since DGA is able to predict the domain address generatedAttackers on a use specific DGA date, for the an attackerpurpose operating of hiding a C&C C&C servers. server must DGA register is supposed one of to the randomly domain addressesgenerate numerous that can bedomain generated addresses on a specificevery day. date Si throughnce DGA appropriate is able to predict registration the domain procedures address in advance.generated This on a is specific called IPdate,/DNS an fast attacker flux. Thisoperating way, ana C&C attacker server can must change register the mapping one of the of DNSdomain to IPaddresses every 10 that s. However, can be generated as many on studies a specific related date to fastthrough flux detectionappropriate have regi beenstration conducted, procedures access in controladvance. policies, This is suchcalled as IP/DNS blacklist fast management flux. This way, for anomaly an attacker DNS, can can change be established the mapping [4,5]. of As DNS a result, to IP attackersevery 10 haves. However, developed as DGAmany and studies changed related DNS to mapping fast fluxwith detection C&C servershave been in short conducted, time intervals, access neutralizingcontrol policies, the existingsuch as blacklist management policy and making for anomaly C&C server DNS,blocking can be established very difficult [4,5]. until As recently.a result, attackers have developed DGA and changed DNS mapping with C&C servers in short time intervals, 1.2. Domain Generation Algorithm neutralizing the existing blacklist policy and making C&C server blocking very difficult until recently. DGA generates a random string by inputting a predetermined seed value and combines second-level1.2. Domain Generation domain (SLD) Algorithm and top-level domain (TLD) to generate a domain address. The DGA generates millions of domains, but the attacker can predict which domain the DGA will create because DGA generates a random string by inputting a predetermined seed value and combines second- it uses time series data, such as time or exchange rates, that the attacker can know at the same time as a level domain (SLD) and top-level domain (TLD) to generate a domain address. The DGA generates millions of domains, but the attacker can predict which domain the DGA will create because it uses

ElectronicsElectronics 20202020,, 99,, 10701070 3 of 18

time series data, such as time or exchange rates, that the attacker can know at the same time as a seed seedvalue. value. The attacker The attacker calculates calculates the thedomain domain addr addressess to tobe begenerated generated by by the the DGA DGA in in advance andand registersregisters andand uses uses the the domain domain address address of theof the C&C C&C server server through through legitimate legitimate procedures. procedures. Figure Figure2 shows 2 theshows process the process of malicious of malicious code connecting code connecting to the C&C to the server C&C through server DGA.through Table DGA.1 shows Table examples 1 shows ofexamples known DGAof known types, DGA DGA types, technologies, DGA technolo and DGA-generatedgies, and DGA-generated domain names. domain DGA names. technology DGA representstechnology the represents SEED used the bySEED DGA. used by DGA.

Figure 2. Overall domain generation algorithm (DGA) operation process.

Table 1. Summary of Existing DGA Techniques and Examples. Table 1. Summary of Existing DGA Techniques and Examples. DGA. DGA techniques Example Domain Names DGA. DGA techniques Example Domain Names MD5 of the year, month, day, and a krhafeobleyhiy-trwuduzlbucutwt, sequence number between 0 and 999 vsmfabubenvib-wolvgilhirvmzkrhafeobleyhiy- GMT data as the seed of the random Conficker MD5 of the year, month, day, and a sequencefabdpdri, trwuduzlbucutwt, sfqzqigzs, whakxpvb Zeus number generator Kraken a random stringnumber of between 6 to 11 characters 0 and 999 rhxqccwdhwg,vsmfabubenvib- huwoyvagozu, gmkxtm data transformation using XOR wolvgilhirvmz Srizbi wqpyygsq, tqdaourf, aqforugp GMT data asoperations the seed of the random number fabdpdri, sfqzqigzs, current date and numbergenerator 8 as the seed of 16ah4a9ax0apra,whakxpvb 12ah4a6abx5apra, the random number generator 3ah0a16ax0apra rhxqccwdhwg, KwyjiboKraken Markov a random process string on English of 6 to syllables 11 characters overloadable, refillingman, multible huwoyvagozu, gmkxtm wqpyygsq, tqdaourf, Srizbi data transformation using XOR operations The SEED value for Zeus’s domain generation works using the MD5 hashaqforugp of the year, month, day, and a sequence number between 0 and 999. The right-most 5 bit of the generated16ah4a9ax0apra, MD5 hash are added current date and number 8 as the seed of the to the hexadecimalTorpig value of the letter ‘a.’ Then, it converts the numbers12ah4a6abx5apra, to alphabetical characters. random number generator All generated alphabetical characters are concatenated to form a domain3ah0a16ax0apra name. The Srizbi botnet has overloadable, refillingman, infected aroundKwyjibo 450,000 computersMarkov process to send on spamEnglish [6 syllables]. It performs exclusive or (XOR) operations using specific days, months, and years, and divides, changes, and concatenates.multible As a result, four domain names are generated daily for one year. Torpig stole about 500,000 online bank accounts, as well as financialThe SEED information, value for Zeus’s such as domain credit cards.generation Similarly, works this using algorithm the MD5 generates hash of the a domain year, month, name everyday, and day a usingsequence the currentnumber date. between Kwyjibo 0 and creates 999. Th ae dictionaryright-most of5 bit random of the wordsgenerated that MD5 are short hash and are easyadded to to pronounce the hexadecimal [7]. Moreover, value itof uses the aletter hyphenation ‘a.’ Then, algorithm it converts to separate the numbers dictionary to alphabetical words into 2–4characters. syllables All [8 generated]. alphabetical characters are concatenated to form a domain name. The Srizbi botnetAs has such, infected there arearound various 450,000 DGAs, computers and there to are send various spam technologies [6]. It performs for generating exclusive or domains. (XOR) Existingoperations domains using specific or IP blocking days, months, methods and are years, limited and because divides, verifying changes, whetherand concatenates. the DGA As generated a result, millionsfour domain of domain names addresses are generated is malicious daily for or one not year. is very Torpig expensive. stole about Therefore, 500,000 inthis online paper, bank we accounts, propose aas technique well as financial to detect information, malicious domains such as generated credit cards. by DGA Similarly, and classify this algorithm DGA groups generates using a AI-based domain name every day using the current date. Kwyjibo creates a dictionary of random words that are short

Electronics 2020, 9, 1070 4 of 18

TextCNN and additional features that can be automatically determined. Our proposed model can efficiently detect malicious domain addresses among numerous domain addresses. We expect to be able to effectively neutralize malicious codes using C&C servers by using the detected domain address in the blocking policy.

1.3. Contribution

1.3.1. Binary and Multi-Class Classification Are Possible with One Model Tran et al. [9] detected a malicious domain with the DGA binary classification model, and created a new multi-class classification model with the detected malicious domain. This study proposed an LSTM multi-input (LSTM.MI) model that combines two models used for binary classification and multi-class classification. However, the technique proposed in this paper is capable of binary classification and multi-class classification in one model. When creating a multi-class classification model, normal domains are also learned at the same time. For example, we train by specifying the normal domain as class 0. If the multi-class classification model predicts zero, it points to the normal domain. Therefore, we propose an approach that allows binary classification and multi-class classification in one model.

1.3.2. Feature Refining Technology that Reflects Features Well According to Purpose The feature’s dimension reduction technology can help to prevent overfitting or improve the performance of the model [10]. Liou et al. [11] used Autoencoder for a word-to-word similarity analysis. This reduced the entropy through training and created a vector that can accommodate the meaning of words. Conversely, in this paper, the dimension was reduced by using the feature refining technology to suit the purpose. There are two main ways to refine features. First, it is a feature refinement technology that does not use labels. Typical methods are feature extraction using a principal component analysis (PCA) and autoencoder. In the case of the autoencoder, it is a feature refinement technology that is not suitable for binary classification or multi-class classification because the input layer and output layer are trained with the same value to extract features [12]. Second, it is a feature refinement technology using labels. It usually uses a hidden layer in neural networks, such as the artificial neural network (ANN). This extracts the node value by setting the number of features to be extracted as the number of nodes into one hidden layer. The difference is learning by specifying the label to be classified on the output layer. This is a feature refinement technique effective for binary classification or multi-class classification. Figure3 shows the results of classifying the DGA class after refining features using a TextCNN and autoencoder in the same environment as the proposed model. This proves that TextCNN is more effective for classification than an autoencoder. ElectronicsElectronics 20202020,, 99,, 10701070 5 of 18 Electronics 2020, 9, 1070 5 of 18

Figure 3. Comparison of TextCNN and autoencoder classification results. FigureFigure 3. Comparison3. Comparison of of TextCNN TextCNN and and autoencoder autoencoder classification classification results. results. 2. Related Work 2. Related Work 2.1. Similarity Comparison Technique forfor DGADGA ClassificationClassification 2.1. Similarity Comparison Technique for DGA Classification ItIt is important to to know know the the DGA DGA in in the the security security community. community. However, However, there there are are limitations limitations to tothe the method methodIt is importantof detecting of detecting to botsknow bots and the andblocking DGA blocking in trafficthe security tra baffisedc based community.on existing on existing blacklists. However, blacklists. Donghyeon there are Donghyeon limitations et al. [13] et to al.detectedthe [13 ]method detected an APT of an detectingattack APT by attack analyzingbots by and analyzing blocking the similari the traffic similarityty between based betweenon DGA existing and DGA blacklists.DNS and using DNS Donghyeon an using n-gram. an n-gram. etThe al. n-[13] Thegramdetected n-gram extracts an extracts theAPT characteristics attack the characteristics by analyzing of text of theexpressed text similari expressed inty naturalbetween in natural language DGA language and by DNS the by sizeusing the of size an n ofson-gram. nthat so thatit The can it n- canbegram treated be treated extracts as a assimple the a simple characteristics list of list symbols. of symbols. of Figuretext Figureexpressed 4 shows4 shows inan natural n-gram an n-gram language example example bywhen the when n size is n 5. isof 5.n so that it can be treated as a simple list of symbols. Figure 4 shows an n-gram example when n is 5.

FigureFigure 4.4. Example ofof n-gramn-gram creationcreation (n(n == 5). Figure 4. Example of n-gram creation (n = 5). The APT attack was detected by comparing the similaritiessimilarities of DNS generated by DGA using the whitelistwhitelistThe existingexisting APT attack in DNS.DNS. was In Indetected addition,addition, by whencomparingwhen nn waswas the 44 throughsithroughmilarities verification,verification, of DNS generated thethe modelmodel by performanceperformanceDGA using the waswaswhitelist thethe best, existing and there in wasDNS. the In disadvantage addition, when that n it was was was 4necessary necessary through toverification, to reflect reflect an an appropriate appropriatethe model performancethreshold. threshold. Therefore,Therefore,was the best, machinemachine and learningtherelearning was techniquestechniques the disadvantage thatthat generategenera thatte it andandwas predict predictnecessary learninglearning to reflect modelsmodels an appropriate using domain threshold. data generatedTherefore, byby machine DGADGA areare learning beingbeing studiedstudied techniques [[14–17].14–17 that]. generate and predict learning models using domain data generated by DGA are being studied [14–17]. 2.2. Clustering Technique for DGA Classification 2.2. Clustering Technique for DGA Classification 2.2.Chin Clustering et al. [Technique18] classified for DGA DGA Classification by first using the machine learning classification algorithm Chin et al. [18] classified DGA by first using the machine learning classification algorithm J48 J48 and, secondly, through clustering-based DNS and similarity comparison. The machine learning and, secondly,Chin et al.through [18] classified clustering-based DGA by firstDNS using and thesimilarity machine comparison. learning classification The machine algorithm learning J48 framework applies DNS blacklist policies and detects DGA through 32 feature extractions, classifications, frameworkand, secondly, applies through DNS clustering-basedblacklist policies DNS and anddetects similarity DGA comparison.through 32 Thefeature machine extractions, learning and clustering. Six of the 32 features are linguistic features, and the other 27 are features extracted classifications,framework appliesand clustering. DNS blacklist Six of thepolicies 32 features and detectsare linguistic DGA throughfeatures, and32 feature the other extractions, 27 are from the DNS payload. Table2 shows the six features for DGA classification. Figure5 shows the DGA featuresclassifications, extracted and from clustering. the DNS Sixpayload. of the Table32 features 2 shows are the linguistic six features features, for DGAand the classification. other 27 are classification and clustering model. Figurefeatures 5 shows extracted the DGA from classi the ficationDNS payload. and clustering Table 2model. shows the six features for DGA classification. Figure 5 shows the DGA classification and clustering model.

Electronics 2020, 9, 1070 6 of 18 Electronics 2020, 9, 1070 6 of 18 Table 2. Examples of features used for DGA classification.

Features Table 2. Examples of features used forDescription DGA classification. Length It is simply the length of a domain name. Meaningful WordFeatures Description Measures the proportion of meaningful words in a domain name. Ratio Length It is simply the length of a domain name. Selects the substringMeasures length the proportion n (2 or 3) ofand meaningful counts the words number in a of PronounceabilityMeaningful Score Word Ratio occurrences in the n-gramdomain frequency name. text. Selects the substring length n (2 or 3) and counts the PercentagePronounceability of Score Measuresnumber the ofpercentage occurrences of innumbers the n-gram in frequencya string. text. NumericalPercentage Characters of Numerical Characters Measures the percentage of numbers in a string. Measures the length of the longest meaningful string PercentagePercentage of the of the LengthMeasures of LMS the length of the longest meaningful string in the domain Length of LMS inname. the domain name. Measures the minimum number of single character LevenshteinLevenshtein Edit Edit DistanceMeasures the minimum number of single character edits between edits between domains. Distance domains.

Figure 5. Model of DGA classification classification and clustering.

This compares DNS similarities based on density-baseddensity-based spatial clustering of applications with noise (DBSCAN) using domains classified classified with the J48J48 algorithm. In order to reduce false positives about the primary results, a secondarysecondary classificationclassification waswas conductedconducted throughthrough clustering.clustering. The firstfirst classificationclassification using the J48 algorithm achieved 95.14%, and the second classificationclassification using DBSCAN clustering achieved 92.02%.

2.3. Deep Deep Learning Technique for DGA ClassificationClassification DGA generatesgenerates aa single single string: string: the the domain. domain. Since Since DGA’s DGA’s operation operation process process creates creates domains domains using usingchanging changing time series time dataseries as data SEED, as SEED, there are there studies are studies that have that applied have applied recursive recursive neural neural network network (RNN) (RNN)and long and short-term long short-term memory memory (LSTM) (LSTM) [19] as suitable[19] as suitable for sequential for sequential input data input learning, data learning, such as such time asseries time data. series Woodbridge data. Woodbridge et al. [20 et] al. proposes [20] proposes a DGA a classifierDGA classifier that leverages that leverages the LSTM the LSTM network network for a forreal-time a real-time prediction prediction of DGAs of DGAs without without the need the for need contextual for contextual information informatio or manuallyn or manually created features.created features.LSTM is LSTM a special is akind special of kind deep of RNN, deep whichRNN, which is a neural is a neural network network designed designed to better to better remember remember and andlearn, learn, even even if the if distancethe distance between between the the sequential sequential input input data data is long. is long. Mainly, Mainly, it has it has been been applied applied to variousto various applications, applications, such such as languageas language modeling, modeling, speech speech recognition, recognition, and DGAand DGA botnet botnet detection. detection. As a Asresult, a result, a 90% a detection 90% detection rate has rate been has achieved. been achieved However,. However, Tran et Tran al. [9 ]et stated al. [9] that stated although that although LSTM is LSTMeffective, is iteffective, is naturally it is sensitive naturally to multi-classsensitive to imbalance multi-class problems. imbalance Therefore, problems. they proposedTherefore, a they new proposedLSTM.MI algorithma new LSTM.MI for DGA algorithm botnet detection. for DGA The botnet basic detection. idea was to The rely basic on both idea binary was to and rely multi-class on both binaryclassification and multi-class models, where classification LSTM was models, adapted where to include LSTM cost was items adapted into to its include backpropagation cost items learning into its backpropagationmechanism. In other learning words, mechanism. LSTM.MI In performedother words, multi-class LSTM.MI classificationperformed multi-class only when classification the binary onlyclassification when the result binary was classification malicious. result They was demonstrated malicious. thatThey LSTM.MI demonstrated provided that anLSTM.MI improvement provided of

Electronics 2020, 9, 1070 7 of 18

an improvement of at least 7% in terms of macro-averaging recall and precision as compared to the original LSTM and other state-of-the-art cost-sensitive methods. It was also able to preserve the high Electronicsaccuracy 2020 on, 9 ,binary 1070 classification (0.9849 F1-score), while helping to recognize five additional7 of bot18 families. Figure 6 shows the DGA classification model utilizing the LSTM mentioned above. an improvementThe domain of name at least generated 7% in terms by DGAof macro- is notaveraging fixed, and recall the and proposed precision deep as comparedlearning modelsto the originalrequire LSTMa fixed and length other as state-of-the input. Qiao-art et cost-sensitiveal. [21] set a fixed methods. length It usingwas also statistics able to on preserve the length the ofhigh the accuracyDGA domain on binary name. class Figureification 7 shows (0.9849 the statisticalF1-score), di whilestribution helping over to the recognize length of five the domainadditional name. bot families.Most Figure of the 6 shows domain the lengthsDGA classification were concentrated model utilizing at intervals the LSTM of 10 mentioned to 20, and above. padding was performedThe domain at a totalname length generated of 54 by by DGA adding is not 10 fixed,to the and maximum the proposed length deep domain. learning It then models used requireWord2Vec’s a fixed continuous length as input.bag of Qiaoword et (CBOW) al. [21] semodelt a fixed to convert length ausing domain statistics name onof 54the in length length of to the 128 DGAElectronicsdimensions. domain2020, 9 name.,As 1070 a result,Figure using7 shows the the LSTM statistical algorithm, distribution the results over the of length16 multi-class of the domain classifications name.7 of 18 achievedMost 95.14%of the domainaccuracy. lengths Yu et wereal. [22] concentrated extracted 11 at features intervals from of 10the to domain 20, and and padding proposed was a performedmethod to atdetect a total DGA length domain of names54 by basedadding on 10th eto convolutional the maximum neural length network domain. (CNN) It thenand LSTM.used at least 7% in terms of macro-averaging recall and precision as compared to the original LSTM and Word2Vec’sThe verification continuous was comparedbag of word with (CBOW) the machinemodel to convertlearning a classificationdomain name algorithmsof 54 in length K-nearest to 128 other state-of-the-art cost-sensitive methods. It was also able to preserve the high accuracy on binary dimensions.neighbor (KNN), As a result,support using vector the machine LSTM algorithm, (SVM), random the results forest of (RF),16 multi-class and AdaBoost. classifications Multiple classification (0.9849 F1-score), while helping to recognize five additional bot families. Figure6 shows achievedcharacter 95.14% ratio features accuracy. can Yube calculatedet al. [22] forextrac a givented 11 text features string: fromthe domain the domain name. andTable proposed 3 shows ana the DGA classification model utilizing the LSTM mentioned above. methodexample to of detect creating DGA a funcdomaintion names from a based domain on name.the convolutional neural network (CNN) and LSTM. The verification was compared with the machine learning classification algorithms K-nearest neighbor (KNN), support vector machine (SVM), random forest (RF), and AdaBoost. Multiple character ratio features can be calculated for a given text string: the domain name. Table 3 shows an example of creating a function from a domain name.

(a) (b)

FigureFigure 6.6. DGADGA classificationclassification modelmodel usingusing longlong short-termshort-term memorymemory (LSTM):(LSTM): ((aa)) LSTM-basedLSTM-based DGADGA classifier without contextual information or manually created features; (b) DGA classifier based on classifier without contextual information or manually created features; (b) DGA classifier based on LSTM.MILSTM.MI algorithm.algorithm. (a) (b) The domain name generated by DGA is not fixed, and the proposed deep learning models require Figure 6. DGA classification model using long short-term memory (LSTM): (a) LSTM-based DGA a fixed length as input. Qiao et al. [21] set a fixed length using statistics on the length of the DGA classifier without contextual information or manually created features; (b) DGA classifier based on domain name. Figure7 shows the statistical distribution over the length of the domain name. LSTM.MI algorithm.

Figure 7. DGA domain name length distribution.

FigureFigure 7. 7. DGADGA domain domain name name length length distribution. distribution.

Most of the domain lengths were concentrated at intervals of 10 to 20, and padding was performed at a total length of 54 by adding 10 to the maximum length domain. It then used Word2Vec’s continuous bag of word (CBOW) model to convert a domain name of 54 in length to 128 dimensions. As a result, using the LSTM algorithm, the results of 16 multi-class classifications achieved 95.14% accuracy. Yu et al. [22] extracted 11 features from the domain and proposed a method to detect DGA domain names based on the convolutional neural network (CNN) and LSTM. The verification was compared with the machine learning classification algorithms K-nearest neighbor (KNN), support vector machine Electronics 2020, 9, 1070 8 of 18

(SVM), random forest (RF), and AdaBoost. Multiple character ratio features can be calculated for a given text string: the domain name. Table3 shows an example of creating a function from a domain name.

Table 3. Example of features that can be extracted from the domain format.

Features Description The number of characters that do not exist in the English alphabet divided by the symbol character ratio string length. hex character ratio The number of hexadecimal characters (A–F) divided by the string length. vowel character ratio The number of vowels divided by the string length. A hash value is computed for each potential TLD and normalized to the range 0 TLD hash to 1. first character digit A flag for whether the first character is or is not a numerical digit. Length The length of the string taken as the domain name.

The experimental results achieved 72.89% accuracy with CNN and 74.05% with LSTM. Another study [23] verified little difference between CNN and RNN-based architectures. In the process of classifying malicious domains, CNN and LSTM were used to verify a total of five different models. A total of five models consisted of two models using CNN, two models using LSTM, and a hybrid CNN/LSTM model. The dataset was trained and evaluated with 1 million malicious domains and 1 million normal domains. The performance was compared using the RF and multi-layer perceptron (MLP) algorithms with the same features as the processed approach. Table4 shows the results of verifying the performance of RF and MLP to compare the performance with the five models mentioned above. The RF model achieved an 83% malicious domain detection rate, and deep learning-based CNNs and LSTMs averaged 97–98%. Therefore, it was proved that there is no difference between CNN and RNN-based architecture.

Table 4. Comparison results on test data [23]. Accuracy, true positive rate (TPR), and false positive rate (FPR) are thresholds that give an FPR of 0.001 on the validation data.

Model Architecture Acc TPR FPR AUC@1% RF Lexical features 91.51% 83.15% 0.00128 84.77% MLP Lexical features 73.74% 47.61% 0.00091 58.81% Embedding 84.29% 68.69% 0.00108 80.88% LSTM CNN Endgame O 98.72% 97.55% 0.00102 98.03% Invincea O 98.95% 98.01% 0.00109 97.47% CMU O 98.54% 97.18% 0.00108 98.25% MIT O O 98.70% 97.49% 0.00099 97.55% NYU O 98.58% 97.27% 0.00116 97.93%

3. Proposed Model

3.1. Overview Recently, there have been cases in which damage is caused by various variants that can be seen as APT attack malware. The attackers use C&C servers to command these malicious codes. The attackers use DGA to conceal C&C servers. DGA is an algorithm that generates random domain names in malicious codes. The attacker communicates with the malicious code by registering the domain name generated by the DGA in the DNS in advance in the malicious code. DGA creates millions of domains per day. Therefore, the existing blacklist-based domain/IP blocking method is limited. In this paper, we propose an AI-based malicious domain detection technology. We set up a layer with 100 nodes in front of the output layer of TextCNN to extract 100 features that were effective for 20 class classification. In addition, we created 10 features of the domain knowledge base and used a total Electronics 2020, 9, 1070 9 of 18

Electronics 2020, 9, 1070 9 of 18 Electronicsof 110 features 2020, 9, 1070 in addition to the 100 features obtained earlier. Finally, DGA and DGA classes9 were of 18 classified using the light gradient boosting model (LightGBM). Figure8 shows the overall configuration werewere classifiedclassified usingusing thethe lightlight gradientgradient boostingboosting modelmodel (LightGBM).(LightGBM). FigureFigure 88 showsshows thethe overalloverall of the proposed model. configurationconfiguration ofof thethe proposedproposed model.model.

FigureFigure 8. 8. EffectiveEffectiveEffective DGA DGA domain domain detection detection and and classification classificationclassification model model composition diagram combining TextCNNTextCNN featuresfeatures andand domaindomain knowledgeknowledge features.features. 3.2. DGA Analysis Approach 3.2.3.2. DGADGA AnalysisAnalysis ApproachApproach 3.2.1. TextCNN based Feature Engineering 3.2.1.3.2.1. TextCNNTextCNN basedbased FeatureFeature EngineeringEngineering Character-by-character embedding is required to represent the string format as a number. Character Character-by-characterCharacter-by-character embeddingembedding isis requiredrequired toto representrepresent thethe stringstring formatformat asas aa number.number. embedding converts each character to a unique value [24]. In order to represent the domain as a CharacterCharacter embeddingembedding convertsconverts eacheach charactercharacter toto aa ununiqueique valuevalue [24].[24]. InIn orderorder toto representrepresent thethe domaindomain one-dimensional vector, we constructed a dictionary that can index a total of 69 characters, including asas aa one-dimensionalone-dimensional vector,vector, wewe constructedconstructed aa dictdictionaryionary thatthat cancan indexindex aa totaltotal ofof 6969 characters,characters, all 68 characters and ‘UNK.’ ‘UNK’ indicates a character that is not in the dictionary. Table5a shows the includingincluding allall 6868 characterscharacters andand ‘UNK.’‘UNK.’ ‘UNK’‘UNK’ indicatesindicates aa charactercharacter thatthat isis notnot inin thethe dictionary.dictionary. TableTable character dictionary. We used a dictionary to convert characters in the domain into numbers. However, 5a5a showsshows thethe charactercharacter dictionary.dictionary. WeWe usedused aa dictionarydictionary toto convertconvert characterscharacters inin thethe domaindomain intointo because the domain length was different, to apply TextCNN, it had to be expressed as a vector value numbers.numbers. However,However, becausebecause thethe domaindomain lengthlength wawass different,different, toto applyapply TextCNN,TextCNN, itit hadhad toto bebe with a fixed size. We set the appropriate fixed size using the distribution of domain length in the expressedexpressed asas aa vectorvector valuevalue withwith aa fixedfixed size.size. WeWe sesett thethe appropriateappropriate fixedfixed sizesize usingusing thethe distributiondistribution dataset. Figure9 shows the domain length distribution in the dataset. ofof domaindomain lengthlength inin thethe dataset.dataset. FigureFigure 99 showsshows thethe domaindomain lengthlength distributiondistribution inin thethe dataset.dataset.

((aa)) ((bb)) Figure 9. Domain length distribution in dataset: (a) training dataset; (b) test dataset. Figure 9. DomainDomain length length distribu distributiontion in in dataset: dataset: ( (aa)) training training dataset; dataset; ( (bb)) test test dataset. dataset.

Electronics 2020, 9, 1070 10 of 18

Table 5. Character-level embedding approach in the proposed model. (a) Character dictionary; (b) one-hot vector dictionary.

(a) Char Index Char Index a 1 ...... b 2 [ 65 c 3 ] 66 d 4 { 67 e 5 } 68 ...... UNK 69 (b) Index One-Hot Vector 0 [0, 0, 0, 0, ... , 0, 0, 0, 0] 1 [1, 0, 0, 0, ... , 0, 0, 0, 0] 2 [0, 1, 0, 0, ... , 0, 0, 0, 0] ...... 68 [0, 0, 0, 0, ... , 0, 0, 1, 0] 69 [0, 0, 0, 0, ... , 0, 0, 0, 1]

The above figure shows that most domains are between 10 and 40 in length. The maximum domain length was 156 for the training dataset and 51 for the test dataset. We set the appropriate value of 100, which was the average of the maximum values for each dataset, to a fixed size. This removed the part where the domain length exceeded 100 and filled the insufficient length with 0 to convert the whole domain length equal to 100. Furthermore, the one-hot encoding method, which gave the size of the character dictionary 69 as a dimension of the vector, gave a value of 1 to the index of the character to be expressed, and 0 to the other index. To this, a vector consisting only of zero values was added to construct a total of 70 one-hot vectors, and a two-dimensional vector was created by substituting the vector corresponding to the index. Table5b shows the one-hot vector corresponding to the index of the character dictionary. As a result, one domain had a 70 100 vector. × We used the generated 2D vector as input to TextCNN. Noise was eliminated and key features were extracted through two convolution layers and two max pooling layers. A rectified linear unit (ReLU) was used as an activation function to partially activate by outputting 0 for sub-zero inputs between the convolution layer and the max pooling layer. We set the parameter of the convolution layer to 256 filters, the first kernel size 5 and second 3, and the rest to the default values. Then, it was transformed into a one-dimensional vector with the flatten function, and, finally, 100 features were extracted through three dense layers. The dense layer also used ReLU as the activation function, and 0.5 was applied to the dropout function to reduce overfitting and improve generalization errors. We verified by applying a variety of hyper-parameters to set these optimal hyper-parameters. The goal in this section was to select 100 features suitable for a 20 class classification. We analyzed the process of changing the size of the vector by adjusting the number of convolution layers and max pooling layers. The more convolution layers and pooling layers, the smaller the resulting vector size. We designed two convolution layers and a max pooling layer each to create a vector size considering the domain size. Moreover, in order to select 100 features suitable for classifying 20 classes, we needed to set the number of dense layer nodes. We set the output layer to 20 and fixed 100 nodes to select 100 features on the front layer. After that, we experimented by setting the number of dense layers and the number of nodes with various hyper-parameters. As the number of dense layers and the number of nodes increased, the number of parameters increased and the network processing speed became slower. Considering various situations, we selected the optimal hyper-parameters for the best results. Table6 shows the vector sizes varying through the proposed TextCNN. As a result, 100 features suitable for multi-class classification were extracted through the TextCNN that learned domain features. Electronics 2020, 9, 1070 11 of 18

Table 6. TextCNN’s feature extraction process.

Layer Output Input [None, 100] Embedding [None, 100, 69] Convolution1 [None, 96, 256] Max Pooling1 [None, 48, 256] Convolution2 [None, 46, 256] Max Pooling2 [None, 23, 256] Flatten [None, 5888] Dense1 [None, 512] Dense2 [None, 512] Dense3 [None, 100] Output [None, 20]

3.2.2. Knowledge Based Feature Engineering An IP address is required to connect to the network. However, because it is difficult to remember the IP address that is trying to be accessed, DNS allows connection by domain name instead. The structure of the domain address basically consists of a combination of subdomain, domain name, SLD, and TLD. The subdomain and domain name are flexible because the user can register with any character string. The SLD and TLD indicate the purpose or type of domain and information of the country in which it belongs, and one of the fixed lists is located at the end of the domain address. DGA-generated domain names are randomly generated using a combination of lowercase letters and numbers. Then, a domain format is created by pasting one from the restricted SLD or TLD list. It can use a unique value for SLD or TLD. We found that most of the malicious domains created by DGA were created with a fixed length and the number of dots was two or less. In addition, we expected that many vowels would appear in the malicious domain and the entropy would be calculated as high. Based on this, the structured format of domain addresses was analyzed to extract 10 features. Here are the 10 features extracted from domain names:

1. Whether the letters and numbers were mixed: this feature represents 1 if the domain name contains alphanumeric characters and 0 otherwise. 2. Number count: this feature shows the number of numbers in the domain name. 3. Number of dots: this feature indicates the number of dots in the domain name. 4. Length: this feature indicates the length of the domain name. 5. Number of vowels: this feature represents the number of vowels in the domain name. 6. Vowel ratio: this feature represents the vowel rate of the domain name. The method for calculating the vowel ratio is as shown in Equation (1).

Number o f vowels (domain name) V = (1) Length (domain name)

7. Information entropy: this feature shows the information entropy value of the domain name. Equation (2), which was used to obtain information entropy from all domain names except SLD and TLD, is as follows: XK H = p log p (2) − k 2 k k=0

8. TLD: this feature shows the unique value for each TLD. 9. SLD: this feature shows the unique value for each SLD. 10. TLD type: this feature shows the unique value for each TLD type. Electronics 2020, 9, 1070 12 of 18

From 1 to 7, we used the features that appeared in the domain name. We could get a feature for whether the domain was a mixture of numbers and letters, or for the number of numbers, the number of dots,Electronics and the 2020 length, 9, 1070 of the domain. Moreover, malicious domains are difficult to pronounce12 of 18 by creatingof random dots, and domains. the length Thisof the allowed domain. usMoreover, to get themalicious vowel domains count and are difficult vowel ratioto pronounce from the by domain name. Increating addition, random since domains. the malicious This allowed domains us to ge weret the randomlyvowel count distributed, and vowel ratio the from entropy the domain values could be obtainedname.Electronics to In express2020addition,, 9, 1070 uncertainty. since the malici Figureous domains 10 shows were randomly the entropy distributed, calculation the entropy function. values Table 12could of 187 shows the averagebe obtained of the to extracted express uncertainty. feature values Figure in 10 the sh maliciousows the entropy domain calculation and the function. normal Table domain. 7 shows The main of dots, and the length of the domain. Moreover, malicious domains are difficult to pronounce by differencethe betweenaverage of malicious the extracted and feature normal values domains in the malicious was the domain number and count the normal and length.domain. The This main meant that differencecreating random between domains. malicious This and allowed normal us domains to get the was vowel the countnumber and count vowel and ratio length. from Thisthe domain meant normal domains contained more numbers than malicious domains, and they were longer. thatname. normal In addition, domains since contained the malici moreous numbers domains thwerean malicious randomly domains, distributed, and the they entropy were longer.values could be obtained to express uncertainty. Figure 10 shows the entropy calculation function. Table 7 shows the average of the extracted feature values in the malicious domain and the normal domain. The main difference between malicious and normal domains was the number count and length. This meant that normal domains contained more numbers than malicious domains, and they were longer.

FigureFigure 10. 10.Information Information Entropy Entropy calculation calculation pseudo-code. pseudo-code.

TableTable 7. Average7. Average of of extractedextracted feature feature values. values.

Feature No. Contents DGA Domain Normal Domain Feature No.Figure 10. Contents Information Entropy calculation pseudo-code. DGA Domain Normal Domain 1 Whether letters and numbers are mixed 0.1671 0.3333 12 Whether letters andNumber numbers Count are mixed0.2395 0.1671 3.6437 0.3333 Table 7. Average of extracted feature values. 23 Number Number Count of DOTs 1.319 0.2395 1.0446 3.6437 3Feature4 No. Number ContentsLength of DOTs DGA11.3472 Domain 1.319 Normal15.8952 Domain 1.0446 451 Whether letters Number Length and of numbers vowels are mixed 0.34870.1671 11.3472 0.20410.3333 15.8952 562 NumberNumberVowel of vowels ratioCount 3.98290.2395 0.3487 2.84583.6437 0.2041 7 Information entropy 2.7387 3.2068 63 Vowel Number ratio of DOTs 1.319 3.9829 1.0446 2.8458 7From 8 to4 10, the characteristics Information ofLength SLD entropy and TLD were used.11.3472 We created 2.7387 a TLD15.8952 dictionary 3.2068 for the 5 Number of vowels 0.3487 0.2041 training data to obtain a unique value of the TLD. Figure 11 shows a part of the table analyzing the 6 Vowel ratio 3.9829 2.8458 relationship between7 TLD andInformation DGA classes. entropy We proved that 2.7387the TLD used for3.2068 each DGA was Fromdifferent. 8 to 10, It can the be characteristics seen that a specific of SLD TLD andappe TLDared for were each used. DGA Weexcept created for a normal a TLD domain dictionary with for the traininga dataclassFrom value to obtain8 toof 10, 0. the a unique characteristics value of SLD the and TLD. TLD Figure were used. 11 shows We created a part a TLD of the dictionary table analyzing for the the relationshiptraining between data to TLDobtain and a unique DGA value classes. of the We TLD. proved Figure that 11 theshows TLD a part used of forthe table each analyzing DGA was the di fferent. It can berelationship seen that between a specific TLD TLD and appeared DGA classes. for We each proved DGA that except the forTLD a used normal for each domain DGA with was a class different. It can be seen that a specific TLD appeared for each DGA except for a normal domain with value of 0. a class value of 0.

Figure 11. Example of distribution for top-level domain (TLD) used by each DGA.

FigureFigure 11. Example 11. Example of distributionof distribution for for top-leveltop-level domain domain (TLD) (TLD) used used by each by eachDGA. DGA.

ElectronicsElectronics 20202020,, 99,, 10701070 13 of 18

The TLD feature assigned a number to each TLD using a pre-configured TLD list and used a numberThe corresponding TLD feature assigned to the TLD a number of the domain. to each TLDThere using were a a pre-configured total of 742 TLDs TLD in listthe anddataset, used so a numberthe values corresponding were from 1to to the 742. TLD We of used the domain.the value There divided were by a 10 total to ofeliminate 742 TLDs the in possibility the dataset, that so the valuesvalue would were from overwhelm 1 to 742. the We learning. used the The value SLD divided feature, by 10like to the eliminate TLD, also the possibilityused the pre-configured that the value wouldlist to number overwhelm the SLD, thelearning. and used Thethe number SLD feature, corre likesponding the TLD, to the also SLD used of the the domain. pre-configured Since we list had to numberan index thefrom SLD, 1 to and 729, used we used the numberthe value corresponding divided by 10. toThe the TLD SLD type of the feature domain. divided Since the we TLD had type an indexinto gTLD, from grTLD, 1 to 729, sTLD, we used ccTLD, the valuetTLD, dividednew gTLD by, 10.proposed The TLD gTLDs, type featureand numbers divided 1 to the 7, TLDand used type intonumbers gTLD, corresponding grTLD, sTLD, to ccTLD, the TLD tTLD, type. new gTLD, proposed gTLDs, and numbers 1 to 7, and used numbers corresponding to the TLD type. 3.2.3. Classification 3.2.3. Classification In Section 3.2.1, we extracted 100 features suitable for the 20 class classification using TextCNN. Moreover,In Section in Section 3.2.1, we3.2.2, extracted we extracted 100 features 10 featur suitablees of the for domain the 20 class knowledge classification base. usingWe added TextCNN. them Moreover,to classify inDGA Section and 3.2.2 DGA, we classes extracted using 10 features a total of of the 110 domain features knowledge and the base. LightGBM We added algorithm. them to classifyLightGBM DGA is andone DGAof the classes boosting using models a total of 110the featuresensemble and learning the LightGBM technique algorithm. that trains LightGBM multiple is onemodels of the in boostingmachine modelslearning of and the ensembleuses the prediction learning technique results of that these trains models multiple to predict models better in machine values learningthan on andthe usesmodel. the Boosting prediction is resultsthe concept of these of modelscreating to multiple predict better tresses values (or other than onmodels) the model. and Boostinggradually is adding the concept up existing of creating models. multiple LightGBM tresses (or and other eXtreme models) Gradient and gradually Boosting adding (XGBoost) up existing are models.typical forms LightGBM of gradient and eXtreme boosti Gradientng. Gradient Boosting boosting (XGBoost) is a meth are typicalod of formstraining of gradienta new model boosting. by Gradientreducing boostingresidual iserrors a method in the of model training before a new training. model byIn reducingthis paper, residual LightGBM errors inwas the used model to beforecompensate training. for Inthe this shortcoming paper, LightGBM of XGBoost’s was used hy toper-parameters compensate for and the shortcomingslow learning of XGBoost’stime. The hyper-parametersproposed model had and a slowboosting learning round time. of 1000 The proposedand an output model class had number a boosting of round20. The of rest 1000 ofand the anparameters output class were number set as default of 20. The values. rest ofLightGBM the parameters had the were advantage set as default of being values. able to LightGBM process large had theamounts advantage of data of beingand using able tofewer process resources large amounts than other of data algorithms. and using Figure fewer 12 resources shows a than graph other of algorithms.performance Figure comparison 12 shows with a graphother gradient of performance boosting comparison [25]. LightGBM with otherhad the gradient fastest boosting convergence [25]. LightGBMand no longer had wastes the fastest resources convergence because and it has no longer the ab wastesility to resourcesstop learning because when it hasverification the ability accuracy to stop learningno longer when improves. verification accuracy no longer improves.

Figure 12.12. LightGBM performance comparison graph. 4. Experiment 4. Experiment 4.1. Dataset 4.1. Dataset The proposed technique was validated for performance and results using a public dataset. This wasThe theproposed dataset technique provided bywas the validated Korea Internet for performa & Securitynce and Agency results (KISA) using in 2019a public [26]. dataset. The dataset This consistedwas the dataset of domain, provided DGA, by and the class. Korea In Internet the DGA & column,Security 0Agency is normal (KISA) and 1in is 2019 malignant. [26]. The The dataset class columnconsisted has of adomain, total of DGA, 20, which and class. means In a the family DGA of column, malicious 0 is domains. normal and The 1 datasetis malignant. was composed The class column has a total of 20, which means a family of malicious domains. The dataset was composed of 4 million pieces in total: 3.6 million pieces of data were used for training, and 400,000 pieces of data

Electronics 2020, 9, 1070 14 of 18 of 4 million pieces in total: 3.6 million pieces of data were used for training, and 400,000 pieces of data were used for testing. The training data consisted of 2.67 million malicious domains and 930,000 normal domains, and the test data consisted of 370,000 malicious domains and 30,000 normal domains. Table8 shows examples of the dataset contents and dataset configurations.

Table 8. (a) Dataset contents examples; (b) dataset configurations.

(a) Domain DGA Class fbcfdlcnlaaakffb.info 1 11 firstbike.kr 0 0 foreignsmell.ru 1 16 booklog.kyobobook.co.kr 0 0 (b) DGA Domain Normal Domain Train Data 2,670,000 930,000 Test Data 370,000 30,000

4.2. Analysis Environments The PC environment used in the experiment used an Intel(R) Core(TM) i7-7700K CPU 3.60GHz and 64GB RAM. The proposed model implemented TextCNN using version 3.6 of Python and Keras, version 2.1.5 of Python deep learning library, and version 2.3.1 of LightGBM library with Python. The time it took to analyze one domain was approximately 0.00076 s, which analyzed about 1303 domains per second. This was the time from the creation of 110 features to classification using LightGBM. Table9 shows the running time of the proposed model. GPU was not used. If GPU is used, the running time is shortened. Furthermore, if the focus is on running time rather than accuracy, it can only be classified as TextCNN.

Table 9. Running time of the proposed model.

Analysis per Domain Analysis per Second Unit (Second) TextCNN Knowledge LightGBM (Second) (Domain Count) Training 2289 242 1796 0.0012019 832 Test 254 29 24 0.0007675 1303

4.3. DGA Classification Results This section shows the results of whether the domain was malicious or normal. The proposed model was compared with the DGA class prediction results using domains and DGA labels in the dataset. We needed to make sure that the proposed model had a positive impact. Therefore, we compared the four cases in which the proposed model was separated. Table 10 shows the results for the four cases. In the first case, DGA was classified as simple TextCNN, and in the second case, it was classified as LightGBM using 100 features extracted through TextCNN. In the third case, 10 features of the domain knowledge base were classified as LightGBM, and in the last case, it was a proposed model that classified LightGBM into a total of 110 features by adding 100 features extracted through TextCNN and 10 features based on domain knowledge. Simple TextCNN achieved good accuracy, but the proposed model showed the best accuracy by adding an additional 10 features based on domain knowledge. This showed an increase of 0.372% compared to the case where 10 features were not added at 99.192% accuracy. Therefore, the use of additional features had a slightly positive effect on model improvement. Figure 13 shows the ROC curve for the results of experiments with the proposed model, and Figure 14 shows the confusion matrix. Electronics 2020, 9, 1070 15 of 18

Electronics 2020, 9, 1070 15 of 18 Electronics 2020, 9, 1070 Table 10. Comparison result of evaluating the proposed model. 15 of 18

TableType 10. Comparison result of evalua Precisionting theRecall proposed F1-Score model. Accuracy TableTextCNN 10. Comparison result of evaluating98.719 the proposed99.0 model.99.36 98.82 Type Precision Recall F1-Score Accuracy TextCNN(100 Feature) + LightGBM 99.744 99.035 99.388 98.872 TypeTextCNN Precision98.719 Recall99.0 F1-Score99.36 Accuracy98.82 Domain knowledge(10 Feature) + LightGBM 99.253 95.988 97.593 95.621 TextCNN(100TextCNN Feature) + LightGBM 98.719 99.744 99.0 99.035 99.36 99.388 98.82 98.872 Proposed Model 99.681 99.445 99.563 99.192 DomainTextCNN(100 knowledge(10 Feature) +Feature)LightGBM + LightGBM 99.744 99.253 99.03595.988 99.38897.593 98.872 95.621 Domain knowledge(10Proposed Feature) Model+ LightGBM 99.253 99.681 95.988 99.445 97.593 99.563 95.621 99.192 Proposed Model 99.681 99.445 99.563 99.192

Figure 13. ROC curve for model evaluation. FigureFigure 13. 13.ROC ROC curve curve for for model model evaluation. evaluation.

Figure 14. Confusion matrix for model evaluation. Figure 14. Confusion matrix for model evaluation. 4.4. DGA Class Classification Results Figure 14. Confusion matrix for model evaluation. 4.4. DGA Class Classification Results This section shows the results of the DGA class classification. In the dataset, the proposed model was4.4. classifiedThis DGA section Class into showsClassification 20 classes the results using Results theof the domain DGA and class DGA classification. class. One In in the 20 representeddataset, the aproposed normal domain. model wasWe needed classified to makeinto 20 sure classes that theusing proposed the domain model and had DGA a positive class. impact. One in Therefore, 20 represented four cases a normal were This section shows the results of the DGA class classification. In the dataset, the proposed model domain.compared We in needed the same to waymake as sure in Section that the 4.3 proposed. Table 11 model shows had the a results positive for impact. the four Therefore, cases. Simple four was classified into 20 classes using the domain and DGA class. One in 20 represented a normal casesTextCNN were achieved compared good in the accuracy, same way but as the in proposedSection 4.3. model Table showed 11 shows the the best results accuracy for the by four adding cases. an domain. We needed to make sure that the proposed model had a positive impact. Therefore, four Simpleadditional TextCNN 10 features achieved based good on domain accuracy, knowledge. but the Thisproposed showed model an increase showed of the 0.835% best comparedaccuracy by to cases were compared in the same way as in Section 4.3. Table 11 shows the results for the four cases. addingthe case an where additional 10 features 10 features were not based added on at domain 88.77% accuracy.knowledge. Therefore, This showed the use an of increase additional of features0.835% Simple TextCNN achieved good accuracy, but the proposed model showed the best accuracy by comparedhad a slightly to the positive case ewhereffect on 10 model features improvement. were not added Figure at 15 88.77% shows accuracy. the ROC curveTherefore, for the the results use of adding an additional 10 features based on domain knowledge. This showed an increase of 0.835% additionalthe experiments features with had the a slightly proposed positive model, effe andct Figureon model 16 showsimprovement. the confusion Figure matrix. 15 shows In addition,the ROC compared to the case where 10 features were not added at 88.77% accuracy. Therefore, the use of

additional features had a slightly positive effect on model improvement. Figure 15 shows the ROC

ElectronicsElectronics 2020 2020, 9,, 91070, 1070 1616 of of18 18

Electronics 2020, 9, 1070 16 of 18 curvecurve for for the the results results of of the the experiments experiments with with the the proposed proposed model, model, and and Figure Figure 16 16 shows shows the the confusion confusion matrix.matrix. In In addition, addition, Figure Figure 17 17 shows shows the the top top 10 10 fe featuresatures of of the the proposed proposed model. model. Of Of the the 100 100 features features extractedFigureextracted 17 through showsthrough theTextCNN, TextCNN, top 10 features the the 63rd 63rd of was thewas used proposed used as as the the model. most most important Of important the 100 feature, features feature, and extractedand among among through the the 10 10 featuresTextCNN,features of of thedomain domain 63rd wasknowledge, knowledge, used as the the mostdomain domain important length, length, feature,TLD, TLD, and and entropy amongentropy features the features 10 features were were in ofin thedomain the top top 10knowledge,10 features. features. the domain length, TLD, and entropy features were in the top 10 features.

TableTableTable 11. 11. 11. ComparisonComparison Comparison result result result of of of evaluating evaluating evaluating th the the proposede proposed proposed classification classification classification model. model. model.

TypeTypeType Precision Precision Precision RecallRecallRecall F1-Score F1-ScoreF1-Score Accuracy AccuracyAccuracy TextCNN 88.532 87.935 87.683 87.935 TextCNNTextCNN 88.53288.532 87.93587.935 87.68387.683 87.93587.935 TextCNNTextCNNTextCNN (100(100 (100 Feature) Feature) Feature)+ LightGBM+ +LightGBM LightGBM 88.61 88.61 88.61 88.392 88.392 88.392 88.219 88.219 88.219 88.392 88.392 88.392 DomainDomainDomain knowledge knowledge knowledge (10 (10 (10 Feature) Feature) Feature)+ LightGBM+ +LightGBM LightGBM 82.66482.66482.664 81.83381.83381.833 81.219 81.21981.219 81.833 81.833 81.833 ProposedProposed Model Model Model 89.01 89.01 89.01 88.77 88.77 88.77 88.695 88.695 88.695 88.77 88.77 88.77

FigureFigureFigure 15. 15. 15. ROCROC ROC curve curve curve for for for classification classification classification model model model evaluation. evaluation. evaluation.

Figure 16. Confusion matrix for classification model evaluation. FigureFigure 16. 16. Confusion Confusion matrix matrix for for classification classification model model evaluation. evaluation.

Electronics 2020, 9, 1070 17 of 18 Electronics 2020, 9, 1070 17 of 18

FigureFigure 17. 17.Top Top 10 10 important important features in in the the proposed proposed model. model.

5. Conclusions5. Conclusions IntelligentIntelligent malware malware is is increasing, increasing, and and the the damage damage caused caused by malware bymalware is also increasing. is also Most increasing. Most maliciousmalicious codescodes operate operate by by receiving receiving an attacker’s an attacker’s command command after infecting after infecting the system. the Attackers system. use Attackers C&C servers to issue commands to malware. The malware needs to know the IP address or domain use C&C servers to issue commands to malware. The malware needs to know the IP address or address of the C&C server to communicate with these C&C servers. If the malicious code has a C&C domainserver’s address IP ofaddress the C&C in a hard-coded server to communicatemanner, security with equipment these or C&C law servers.enforcement If theagencies malicious can code has a C&Cblack server’sthe IP address IP address or domain in a hard-codedaddress to prevent manner, the securitymalicious equipmentcode from receiving or law enforcementC&C agenciescommands. can black However, the IP address the attacker or domain bypasses address this access to prevent blacking the technique malicious and randomly code from generates receiving C&C commands.domain However, addresses theusing attacker DGA to bypasses conceal C&C this servers. access It blacking is very efficient technique because and only randomly the attacker generates domainknows addresses how the using DGA DGAworks, to the conceal domain C&Cis registered servers. in advance, It is very and e thefficient malware because will be only connected the attacker at some point. In order to predict and block the domains generated by these DGAs in advance, this knows howpaper thecombines DGA the works, TextCNN the domain technique is with registered addition inal advance,features suitable and the for malwarethe domain will format beconnected to at somedetect point. DGA-generated In order to malicious predict doma andins block and classify the domains the DGA generated class. As a result by these of the DGAs experiment, in advance, this paperthe combinesproposed model the TextCNN achieved 99.19% technique accuracy with for additional DGA classification features and suitable 88.77% for accuracy the domain for DGA format to detect DGA-generatedclass classification. maliciousWe expect to domains be able to and effectively classify detect the DGAand block class. DGA-generated As a result ofdomains the experiment, by the proposedapplying model the proposed achieved technology. 99.19% accuracy for DGA classification and 88.77% accuracy for DGA class classification.Author Contributions: We expect Conceptualization, to be able H.K.; to e ffmethectivelyodology, detect C.H.; software, and block H.K.; DGA-generated validation, C.H., H.K. domains and by applyingT.L.; the formal proposed analysis, technology. C.H.; investigation, C.H.; resources, H.L.; data curation, C.H. and H.L.; writing—original draft preparation, H.K.; writing—review and editing, C.H.; visualization, C.H. and H.K.; supervision, T.L.; project administration, T.L.; funding acquisition, T.L. All authors have read and agreed to the published version Author Contributions: of the manuscript. Conceptualization, H.K.; methodology, C.H.; software, H.K.; validation, C.H., H.K. and T.L.; formal analysis, C.H.; investigation, C.H.; resources, H.L.; data curation, C.H. and H.L.; writing—original draft preparation,Funding: This H.K.; research writing—review project was supported and editing, by Ministry C.H.; visualization, of Culture, Sports C.H. and and Tourism H.K.; supervision,(MCST) and from T.L.; project administration,Korea Copyright T.L.; funding Commission acquisition, in 2020 (No. T.L. 2019-PF-9500). All authors have read and agreed to the published version of the manuscript. Conflicts of Interest: The authors declare no conflict of interest. Funding: This research project was supported by Ministry of Culture, Sports and Tourism (MCST) and from Korea CopyrightReferences Commission in 2020 (No. 2019-PF-9500).

Conflicts1. of Interest:AV-TEST–TheThe authorsIndependent declare IT-Security no conflict Institute. of interest. Available online: https://www.av-test.org/en/ statistics/malware/ (accessed on 8 May 2020). References2. Antonakakis, M; Perdisci, R.; Nadji, Y.; Vasiloglou, N.; Abu-Nimeh, S.; Lee, W.; Dagon, D. From throw- Away traffic to bots: Detecting the rise of DGA-Based Malware. In Proceedings of the Twenty-first USENIX 1. AV-TEST–TheSecurity Symposium Independent (USENIX IT-Security Security Institute. 12), Bellevue, Available WA, USA, online: 8–10 August https: 2012;//www.av-test.org pp. 491–506. /en/statistics/ malware3. Park,/ (accessed J.W. Security on 8trend May analysis 2020). with DNS, Information Sharing Cyber Infringement Accident Seminar in 2. Antonakakis,Korea Internet M.; Perdisci, and Security R.; Nadji, Agen Y.;cy. Vasiloglou, Sept. 2017. N.; https://www.boho.or. Abu-Nimeh, S.; Lee,kr/data/reportView.do?bulletin_ W.; Dagon, D. From throw-Away writing_sequence=26711&queryString=cGFnZT00JnNvcnRfY29kZT0mc29ydF9jb2RlX25hbWU9JnNlYXJja traffic to bots: Detecting the rise of DGA-Based Malware. In Proceedings of the Twenty-First USENIX F9zb3J0PXRpdGxlX25hbWUmc2VhcmNoX3dvcmQ9 (accessed on 10 May 2020). Security4. Truong, Symposium D.; Cheng, (USENIX G. Detecting Security domain-flux 12), Bellevue, botnet based WA, on USA, DNS 8–10 traffic August features 2012; in managed pp. 491–506. network. 3. Park,Secur. J.W. Commun. Security Netw Trend. 2016, 9, 2338–2347. Analysis with DNS, Information Sharing Cyber Infringement Accident Seminar in Korea Internet and Security Agency. September 2017. Available online: https://www.boho.or.kr/data/reportView.do?bulletin_writing_sequence=26711&queryString=cGFnZT00JnN vcnRfY29kZT0mc29ydF9jb2RlX25hbWU9JnNlYXJjaF9zb3J0PXRpdGxlX25hbWUmc2VhcmNoX3dvcmQ9 (accessed on 10 May 2020). 4. Truong, D.; Cheng, G. Detecting domain-flux botnet based on DNS traffic features in managed network. Secur. Commun. Netw. 2016, 9, 2338–2347. [CrossRef] Electronics 2020, 9, 1070 18 of 18

5. Sharifnya, R.; Abadi, M. DFBotKiller: Domain-Flux botnet detection based on the history of group activities and failures in DNS traffic. Digit. Investig. 2015, 12, 15–26. [CrossRef] 6. News, B. Spam on Rise After Brief Reprieve. 2008. Available online: http://news.bbc.co.uk/2/hi/technology/ 7749835.stm (accessed on 13 May 2020). 7. Crawford, H.; John, A. Kwyjibo: Automatic domain name generation. Softw. Pract. Exp. 2008, 38, 1561–1567. [CrossRef] 8. Liang, F.M. Word Hy-phen-a-tion by Com-put-er. No. STAN-CS-83-977. 1983. Available online: https: //cds.cern.ch/record/151530/files/cer-000062763.pdf (accessed on 29 June 2020). 9. Tran, D.; Mac, H.; Tong, V. A LSTM based framework for handling multiclass imbalance in DGA botnet detection. Neurocomputing 2018, 275, 2401–2413. [CrossRef] 10. Chandrashekar, G.; Ferat, S. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [CrossRef] 11. Liou, C.Y.; Cheng, W.C.; Liou, J.W.; Liou, D.R. Autoencoder for words. Neurocomputing 2014, 139, 84–96. [CrossRef] 12. Wang, Y.; Hongxun, Y.; Sicheng, Z. Auto-encoder based dimensionality reduction. Neurocomputing 2016, 184, 232–242. [CrossRef] 13. Kim, D.H.; Kim, K.S. DGA-DNS Similarity Analysis and APT Attack Detection Using N-gram. J. Korea Inst. Inf. Secur. Cryptol. 2018, 28, 1141–1151. 14. Yu, F.; Lu, Y.; Oluwakemi, H.; Ilker, O.; Benafsh, H.; Sun, J.; Karan, S.; Du, D.; Christopher, B.; Richard, R. Stealthy domain generation algorithms. IEEE Trans. Inf. Forensics Secur. 2017, 12, 1430–1443. 15. Anderson, H.S.; Woodbridge, J.; Filar, B. DeepDGA: Adversarially-tuned domain generation and detection. In Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security; Association for Computing Machinery: New York, NY, USA, 2016. 16. Sood, A.K.; Zeadally, S. A taxonomy of domain-generation algorithms. IEEE Secur. Priv. 2016, 14, 46–53. [CrossRef] 17. Mowbray, M.; Hagen, J. Finding domain-Generation algorithms by looking at length distribution. In Proceedings of the 2014 IEEE International Symposium on Software Reliability Engineering Workshops, Naples, Italy, 3–6 November 2014. 18. Chin, T.; Xiong, K.Q.; Hu, C.B.; Li, Y. A machine learning framework for studying domain generation algorithm (DGA)-based malware. In Proceedings of the International Conference on Security and Privacy in Communication Systems, Singapore, 8–10 August 2018. 19. Hochreiter, S.; Jürgen, S. Long short-Term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef][PubMed] 20. Woodbridge, H.S.J.; Anderson, A.A.; Grant, D. Predicting domain generation algorithms with long short-term memory networks. arXiv 2016, arXiv:1611.00791. 21. Qiao, Y.; Zhang, B.; Zhang, W.; Sangaiah, A.K.; Wu, H. DGA Domain Name Classification Method Based on Long Short-Term Memory with Attention Mechanism. Appl. Sci. 2019, 9, 4205. [CrossRef] 22. Yu, B.; Daniel, L.G.; Pan, J.; Martine, D.C.; Anderson, C.A.; Nascimento, Y. Inline DGA detection with deep networks. In Proceedings of the 2017 IEEE International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, USA, 18–21 November 2017; pp. 683–692. 23. Yu, B.; Pan, J.; Hu, J.; Nascimento, A.; De Cock, M. Character level based detection of DGA domain names. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. 24. Shimada, D.; Ryunosuke, K.; Hitoshi, I. Document classification through image-based character embedding and wildcard training. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 5–8 December 2016; pp. 3922–3927. 25. Go Lab, Apps & Machine. Available online: http://machinelearningkorea.com/2019/09/25/lightgbm%EC% 9D%98-%ED%95%B5%EC%8B%AC%EC%9D%B4%ED%95%B4/ (accessed on 25 May 2020). 26. KISA. K-Cyber Security Challenge 2019, AI-Based Malicious Domain Prediction Dataset. Available online: https://www.kisis.or.kr/kisis/subIndex/283.do (accessed on 3 December 2019).

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).