Real-Time Detection of Dictionary DGA Network Traffic Using Deep
Total Page:16
File Type:pdf, Size:1020Kb
SN Computer Science (2021) 2:110 https://doi.org/10.1007/s42979-021-00507-w ORIGINAL RESEARCH Real‑Time Detection of Dictionary DGA Network Trafc Using Deep Learning Kate Highnam1 · Domenic Puzio2 · Song Luo3 · Nicholas R. Jennings1 Received: 1 August 2020 / Accepted: 6 February 2021 / Published online: 22 February 2021 © The Author(s) 2021 Abstract Botnets and malware continue to avoid detection by static rule engines when using domain generation algorithms (DGAs) for callouts to unique, dynamically generated web addresses. Common DGA detection techniques fail to reliably detect DGA variants that combine random dictionary words to create domain names that closely mirror legitimate domains. To combat this, we created a novel hybrid neural network, Bilbo the “bagging” model, that analyses domains and scores the likelihood they are generated by such algorithms and therefore are potentially malicious. Bilbo is the frst parallel usage of a convolutional neural network (CNN) and a long short-term memory (LSTM) network for DGA detection. Our unique F architecture is found to be the most consistent in performance in terms of AUC, 1 score, and accuracy when generalising across diferent dictionary DGA classifcation tasks compared to current state-of-the-art deep learning architectures. We validate using reverse-engineered dictionary DGA domains and detail our real-time implementation strategy for scoring real-world network logs within a large enterprise. In 4 h of actual network trafc, the model discovered at least fve potential command-and-control networks that commercial vendor tools did not fag. Keywords Domain generation algorithm · Deep learning · Malware · Botnets · Network security · Neural networks Introduction device. Some malware reaches out to a command and con- trol (C&C) centre hosted behind domains generated by an Malware continues to pose a serious threat to individuals algorithm (DGA domains) after it infltrates the target sys- and corporations alike [1]. Typical attack methods such as tem to receive further instructions. Identifcation of such viruses, phishing emails, and worms attempt to retrieve pri- domains in network trafc allows for the detection of mal- vate user data, destroy systems, or start unwanted programs. ware-infected machines. The majority of these attacks may be launched through the A single active DGA has been seen generating up to a network [2], posing a major threat to any Internet-facing few hundred domains per day [1]. At scale within a com- pany, this is infeasible for a human analyst to triage amidst This work was done while working at Capital One in 2017–18 and the thousands of benign domains occurring simultaneously. is based on public talks we gave in 2018. Automated detection systems are developing but the sight- ings of DGAs in worms, botnets, and other malicious set- * Kate Highnam tings is growing [3]. [email protected] In addition, malware that employs DGAs intentionally Domenic Puzio obfuscates its network communication by utilising ran- [email protected] dom seeds when generating their domains [1–5]. Most Song Luo known DGAs combine randomly-selected characters like [email protected] Nicholas R. Jennings [email protected] 1 Imperial College London, London, UK 2 Kensho Technologies, McLean, VA, USA 3 Tencent, Shenzhen, China SN Computer Science Vol.:(0123456789) 110 Page 2 of 17 SN Computer Science (2021) 2:110 Table 1 Examples of domains from our training data, comprised of testing sets but their performance can sufer when attempt- domains from the Alexa Top 1 Million list and domains generated by ing to generalise to new DGA families or new versions of dictionary-based DGA families (discovered through reverse engineer- ing) from DGArchive previously seen families. Against this background, we present a novel deep learning Legitimate Malicious model for dictionary DGA detection. This advances the state microsoft lookhurt of the art in the following ways. First, we present the frst linkedin threetold usage of parallel CNN and LSTM hybrid for DGA detec- paypal threewear tion, specifcally applied to dictionary DGA detection. The steamcommunity pielivingbytes model is trained on standard large-scale datasets of reverse- dailymotion awardsbookcasio engineered dictionary DGA domains. It achieves the most stackoverfow blanketcontent consistent success at dictionary DGA classifcation amongst facebook degreeblindagent state-of-the-art deep learning architectures for classifcation, soundcloud mistakelivegarage generalisability, and time-based resiliency. Second, we detail our insights into dictionary DGA domains’ inter-relation- ships and their efect on generalisability of models as an “myypqmvzkgnrf[.]com”, “otopshphtnhml[.]net”, and outcome. Third, we validate our model on live network traf- “uqhucsontf[.]com”1. fc in a large fnancial institution. In 4 h of logs, it discovered However, DGAs that combine random words from a dic- fve potential C&C networks that commercial vendor tools tionary like “milkdustbadliterally[.]com”, “couragenearest[.] did not fag. Finally, we detail our scalable implementation net”, and “boredlaptopattorney[.]ru” [6] are meaningfully strategy within the security context of a corporation for real- harder for humans to detect (see Table 1 for comparison). time analysis. In this paper, we will refer to this type of DGA as a diction- ary DGA and focus on those using dictionaries composed of English words. Background Common defences against malicious DGA domains include blacklists [7, 8], random forest classifers [9–11], An ever-growing number of malware rely on communication and clustering techniques [12, 13]. When the lists are well with C&C channels to receive instructions and system-spe- maintained and the features are chosen carefully, these meth- cifc code [1]. The destination (domain or IP address) of this ods have acceptable efcacy. However, both blacklists and channel can be hard-coded in the malware itself, making its these models possess serious limitations: relying on hand- location discoverable via reverse engineering or straightfor- picked features which are time-consuming to develop, lack- ward log aggregation techniques. Once known, this domain ing the ability to generalise with the few manual features or IP address can be blacklisted, rendering the malware inert. implemented, and requiring continuous expert maintenance. To avoid this single point of failure, malware authors employ More comprehensive tactics are necessary to detect inces- domain fuxing, in which the destination of the C&C com- sant new DGAs stemming from network-based malware. munication changes systematically as the attacker registers Recent innovations using deep learning have state-of-the- new domains to the C&C hub. art accuracy on DGA detection. Such models are highly fex- The key to malware domain fuxing is the use of unique ible with the proven success in complex language problems. and likely unregistered domains that are known to the They do not require hand-crafted features that are time- attacker but can blend in to regular trafc. To accomplish intensive to make and easy to evade. Woodbridge et al. [11] this, malware families employ domain generation algorithms were the frst to present a long short-term memory (LSTM) (DGAs) to create pseudo-random domains for use in com- network for DGA classifcation. Other architectures were munication. These domains are used for short periods of later applied, such as further variations on an LSTM [10, time and then phased out for newly-generated domains; this 14–17], a convolutional neural network (CNN) [18, 19], quick turnover means that manual techniques are not efec- and a hybrid CNN-LSTM model [20]. Although success- tive. Additionally, reverse engineering these algorithms may ful for random-character DGA domains, these classifers be slow or impossible if the malware is encrypted. For the have largely been inefective in identifying dictionary DGA vast majority of malware samples, trafc related to malicious domains. These models also perform well on their various activity is present in networks weeks or months before the malware is analysed and blacklisted [7]. To prevent DGA-based malware from exfltrating, disa- 1 bling, or tampering with assets, institutions must detect For the rest of this paper, all domain URLs will be referred to with [.] to prevent automatically assigning these domains as real URLs one malicious trafc as soon as possible. Throughout this paper, might click. we will discuss our solution while keeping in mind that it SN Computer Science SN Computer Science (2021) 2:110 Page 3 of 17 110 must be practical, operating in real-time, enriching contex- Much of prior DGA research has involved making tual data within in true threat environments. lookups into historical or related domain name server (DNS) records. Such methods often rely on signals attained from Non-Existent Domain (NXDomain) responses when Domain Generation Algorithms (DGAs) unregistered domains are queried. Since DGAs often gener- ate hundreds of domains per day and at most only a few of DGA usage spans a variety of cases, from benign resource those domains are actually registered by the attacker, large generation to phishing campaigns and the management of numbers of these requests result in NXDomains. Many botnets, groups of machines that have been infected by mal- NXDomain responses from the same computer are unlikely ware, such as Kraken [21], Confcker [22, 23], Murofet [24], to result from expected user behaviour, and thus this pattern and others [25]. The goal of all DGAs is to generate domains of DNS trafc can be associated with DGA activity [12, that do not already exist and, for malicious cases, will not be 25, 30]. fagged by vendor security tools or analysts. To accomplish However, such queries within high-volume DNS log data this, DGA authors typically use either character-based or can be prohibitively slow and unsuitable for real-time deci- dictionary-based pseudo-random assembly process to form sion-making needed to reduce the risk of compromise. It is domains. for this reason that our model considers limited data, only Each method has benefts and downfalls.