Effective DGA-Domain Detection and Classification with Textcnn
Total Page:16
File Type:pdf, Size:1020Kb
electronics Article Effective DGA-Domain Detection and Classification with TextCNN and Additional Features Chanwoong Hwang 1 , Hyosik Kim 1, Hooki Lee 2 and Taejin Lee 1,* 1 Department of Information Security, Hoseo University, Asan 31499, Korea; [email protected] (C.H.); [email protected] (H.K.) 2 Department of Cyber Security Engineering, Konyang University, Nonsan 32992, Korea; [email protected] * Correspondence: [email protected] Received: 8 June 2020; Accepted: 29 June 2020; Published: 30 June 2020 Abstract: Malicious codes, such as advanced persistent threat (APT) attacks, do not operate immediately after infecting the system, but after receiving commands from the attacker’s command and control (C&C) server. The system infected by the malicious code tries to communicate with the C&C server through the IP address or domain address of the C&C server. If the IP address or domain address is hard-coded inside the malicious code, it can analyze the malicious code to obtain the address and block access to the C&C server through security policy. In order to circumvent this address blocking technique, domain generation algorithms are included in the malware to dynamically generate domain addresses. The domain generation algorithm (DGA) generates domains randomly, so it is very difficult to identify and block malicious domains. Therefore, this paper effectively detects and classifies unknown DGA domains. We extract features that are effective for TextCNN-based label prediction, and add additional domain knowledge-based features to improve our model for detecting and classifying DGA-generated malicious domains. The proposed model achieved 99.19% accuracy for DGA classification and 88.77% accuracy for DGA class classification. We expect that the proposed model can be applied to effectively detect and block DGA-generated domains. Keywords: security; domain generation algorithm; TextCNN; domain features; classification 1. Introduction 1.1. Background Most cyberattacks use malicious codes, and according to AV-TEST, more than 1 billion malicious codes are expected to emerge in 2020 [1]. Unlike the recent distribution of malicious codes to a large number of unspecified people, advanced persistent threat (APT) attacks are attempted after targeting one target. The APT attacks are characterized by the fact that they do not stop the attack by producing dense and systematic security threats based on various IT technologies and attack methods until the successful intrusion inside. In addition, APT attacks are mainly targeted at government agencies or corporations, and they are difficult to detect because they are infiltrated into the system and continuously attacked. Figure1 shows the APT attack cycle. Electronics 2020, 9, 1070; doi:10.3390/electronics9071070 www.mdpi.com/journal/electronics Electronics 2020, 9, 1070 2 of 18 Electronics 2020, 9, 1070 2 of 18 FigureFigure 1.1. Advanced persistent threatthreat (APT)(APT) attackattack cycle.cycle. InIn thethe firstfirst step,step, thethe APTAPT attackerattacker infectsinfects thethe maliciousmalicious codecode throughthrough aa websitewebsite oror emailemail visitedvisited byby an internal internal user user to to create create an an internally internally infected infected PC. PC. Step Step 2 collects 2 collects infrastructure infrastructure information, information, such suchas the as organization’s the organization’s internal internal network. network. Step Step 3 invades 3 invades vulnerable vulnerable systems systems and and servers throughthrough accountaccount informationinformation thefttheft andand malwaremalware infections.infections. The fourth stage consists ofof stepssteps suchsuch asas internalinternal informationinformation leakageleakage and and system system destruction destruction under under the the command command of the of attacker. the attacker. Some Some malicious malicious codes, suchcodes, as such APT as attacks, APT attack operates, operate after receiving after receiving commands commands from a from remote a remote command command and control and control (C&C) server(C&C) afterserver being after installed being installed on thedevice. on the device. Botnet isBotn a collectionet is a collection of malware-infected of malware-infected machines machines or bots. Itor has bots. become It has thebecome main the means main for means cyber-criminals for cyber-cr toiminals send spam to send mails, spam steal mails, personal steal data, personal and launch data, distributedand launch denial distributed of service denial attacks of service [2]. Most attacks bots today[2]. Most rely bots on a domaintoday rely generation on a domain algorithm generation (DGA) toalgorithm generate (DGA) a list of to candidate generate a domain list of candidate names in thedomain attempt names to connect in the attempt with the to so-called connect C&Cwith the server. so- calledAccording C&C server. to Nominum’s domain name system (DNS) security trend analysis report [3], more than 90% ofAccording attacks by to maliciousNominum’s codes domain are using name a system domain (DNS) name security system (DNS).trend analysis A DNS-generated report [3], more DNS isthan especially 90% of usedattacks to by infect malicious/update codes new hostsare usin andg communicatea domain name with system C&C (DNS). servers. A ToDNS-generated perform the desiredDNS is actionespecially by the used attacker, to infect/update the C&C server new host is useds and to communicate communicate with in real C&C time servers. with the To malicious perform code.the desired In order action for the by malicious the attacker, code tothe receive C&C commandsserver is used from to the communicate C&C server, itin must real firsttime connect with the to themalicious server. code. For communication In order for the on malicious the internet, code the to client receive PC needscommands to know from the the IP addressC&C server, of the it server must infirst order connect for the to clientthe server. to access For the communication server. Since accessed on the internet, in this way, the the client malicious PC needs code to must know have the the IP IPaddress address of orthe domain server addressin order of for the the C&C client server to inside.access Ifthe the server. malicious Since code accessed hides thein this IP address way, the or domainmalicious address code must of the have C&C the server IP address by hard or domain coding, securityaddress of equipment the C&C server or law inside. enforcement If the malicious agencies cancode block hides the the IP IP address address or or the domain domain address address of to th prevente C&C theserver malware-infected by hard coding, device security from equipment accessing theor law C&C enforcement server. However, agencies malware can block creators the IP use addr DGAess or to circumventthe domain theseaddress access to prevent blocking the techniques. malware- infectedAttackers device usefrom DGA accessing for the the purpose C&C server. of hiding However, C&C malware servers. creators DGA is use supposed DGA to to circumvent randomly generatethese access numerous blocking domain techniques. addresses every day. Since DGA is able to predict the domain address generatedAttackers on a use specific DGA date, for the an attackerpurpose operating of hiding a C&C C&C servers. server must DGA register is supposed one of to the randomly domain addressesgenerate numerous that can bedomain generated addresses on a specificevery day. date Si throughnce DGA appropriate is able to predict registration the domain procedures address in advance.generated This on a is specific called IPdate,/DNS an fast attacker flux. Thisoperating way, ana C&C attacker server can must change register the mapping one of the of DNSdomain to IPaddresses every 10 that s. However, can be generated as many on studies a specific related date to fastthrough flux detectionappropriate have regi beenstration conducted, procedures access in controladvance. policies, This is suchcalled as IP/DNS blacklist fast management flux. This way, for anomaly an attacker DNS, can can change be established the mapping [4,5]. of As DNS a result, to IP attackersevery 10 haves. However, developed as DGAmany and studies changed related DNS to mapping fast fluxwith detection C&C servershave been in short conducted, time intervals, access neutralizingcontrol policies, the existingsuch as blacklist management policy and making for anomaly C&C server DNS,blocking can be established very difficult [4,5]. until As recently.a result, attackers have developed DGA and changed DNS mapping with C&C servers in short time intervals, 1.2. Domain Generation Algorithm neutralizing the existing blacklist policy and making C&C server blocking very difficult until recently. DGA generates a random string by inputting a predetermined seed value and combines second-level1.2. Domain Generation domain (SLD) Algorithm and top-level domain (TLD) to generate a domain address. The DGA generates millions of domains, but the attacker can predict which domain the DGA will create because DGA generates a random string by inputting a predetermined seed value and combines second- it uses time series data, such as time or exchange rates, that the attacker can know at the same time as a level domain (SLD) and top-level domain (TLD) to generate a domain address. The DGA generates millions of domains, but the attacker can predict which domain the DGA will create because it uses ElectronicsElectronics 20202020,, 99,, 10701070 3 of 18 time series data, such as time or exchange rates, that