Identifying and Characterizing Bashlite and C&C Servers

Gabriel Bastos∗, Artur Marzano∗, Osvaldo Fonseca∗, Elverton Fazzion∗†, Cristine Hoepers‡, Klaus Steding-Jessen‡, Marcelo H. P. C. Chaves‡, Italo´ Cunha∗, Dorgival Guedes∗, Wagner Meira Jr.∗ ∗Department of Computer Science – Universidade Federal de Minas Gerais (UFMG) †Department of Computing – Universidade Federal de Sao˜ Joao˜ del-Rei (UFSJ) ‡CERT.br - Brazilian National Computer Emergency Response Team NIC.br - Brazilian Network Information Center

Abstract—IoT devices are often a vector for assembling massive which contributes to the rise of a large number of variants. , as a consequence of being broadly available, having Variants exploit other vulnerabilities, include new forms of limited security protections, and significant challenges in deploy- attack, and use different mechanisms to circumvent existing ing software upgrades. Such botnets are usually controlled by centralized Command-and-Control (C&C) servers, which need forms of defense. The increasing number of variants of these to be identified and taken down to mitigate threats. In this paper makes the analysis and reverse engineering process we propose a framework to infer C&C server IP addresses using expensive, hindering the development of countermeasures and four heuristics. Our heuristics employ static and dynamic analysis mitigations. To address the growing number of threats, security to automatically extract information from binaries. We analysts and researchers need tools to automate the collection, use active measurements to validate inferences, and demonstrate the efficacy of our framework by identifying and characterizing analysis, and reverse engineering of malware. C&C servers for 62% of 1050 malware binaries collected using The C&C server is a core component of botnets, responsible 47 honeypots. for coordinating its bots [5]. Considering that taking down a C&C makes the innocuous, security analysts dedicate I.INTRODUCTION efforts to identify C&C servers, while malware developers The (IoT) is the network of physical implement mechanisms to complicate such identification. As devices connected through the Internet, such as security cam- an example of such mechanisms, malware developers add code eras and vehicular systems. The use of IoT devices in different to obfuscate the identity of C&Cs or avoid contacting them applications shows opportunities for economic and technolog- when executing on top of sandbox environments. ical development in different sectors of society. However, the In this work we extend existing tools to perform automated minimalist design of most of those devices, constrained due analysis of Bashlite and Mirai IoT malware families to identify to market competition among vendors, compromises security C&C servers. We improve Detux, a sandbox for malware and leads to vulnerabilities. This problem is aggravated by the evaluation with isolation mechanisms to prevent malware nature of embedded software and the challenges in applying executions from interfering with each other or the Internet. updates. Malicious agents exploit vulnerabilities in IoT de- We propose four different heuristics that employ static and vices to infect them to create botnets [1], [2]. Although the dynamic analysis to infer C&C server candidates. We develop computational power of each infected device (bot) is small, an active clients that connect to the inferred C&C candidates and IoT botnet may coordinate thousands of bots to successfully exchange messages using the Bashlite and Mirai protocols to perform malicious activities, such as massive distributed denial validate inferences. of service (DDoS) attacks. We also propose a technique to identify variants of each DDoS attacks have increased in frequency and intensity, malware family. The challenge is in identifying similar bi- with services being attacked daily and some attacks generating naries even though they contain no metadata. Our technique traffic on the order of 1 Tbps [3]. The total losses for the uses Radare2, a framework with tools for reverse engineering attacked companies are in the order of billions of dollars, binary executables, to identify and extract functions from bi- since those attacks exhaust resources such as processing and naries. We then use a fuzzy hash to compare similarity among bandwidth, including well-provisioned services, causing un- functions, present a metric to calculate the distance between availability [4]. The large number and topological distribution binaries, and execute a hierarchical clustering algorithm to of infected devices allow IoT botnets to perform massive, group similar malware variants. difficult to mitigate, attacks. Our study analyzes IoT malwares collected by 47 low- Two families of IoT botnets, Bashlite and Mirai, have interactivity honeypots distributed across 15 Brazilian states. recently gained notoriety after being used to perform DDoS We present static analysis results for 25,183 binaries collected attacks of 400 Gbps and 1 Tbps, respectively [1], [3], with between Jan. 2017 and Dec. 2018, showing the evolution of considerable impact on large services (e.g., DynDNS). The anti-analysis mechanisms used by Bashlite and Mirai variants. source code for both malwares is available on the Internet, We present dynamic analysis results for 1,050 IoT malwares, III.C&CSERVER IDENTIFICATION To identify C&C servers, we start by performing a special- ized static code analysis, which allows us to detect and avoid countermeasures employed by malwares to deceive dynamic analysis (§III-A). Then, we execute the malware in the Detux sandbox, which monitors its behavior in a controlled and realistic IoT environment (§III-B). Next, we use four heuristics based on the dynamic analysis of the network traffic generated by each binary to infer IP addresses that probably host C&C Figure 1: Malware collection and analysis infrastructure. servers (§III-C). Finally, we execute tools to validate the inferred IP addresses and identify real C&C servers (§III-D). collected between 2018-11-27 and 2018-12-26, and between A. Static analysis of malwares 2019-01-30 and 2019-02-28, showing that we are able to infer C&C candidates for 96% and validate inferences for The first part of the static analysis consists of ELF data 62% of the binaries. We show that our heuristics detect 29% extraction performed by Detux. Static analysis extracts data more C&C servers when compared to a baseline approach such as file and program headers, sections, debug symbols, based on static analysis alone. Finally, the proposed binary and strings from a binary. We use the symbol table to identify clustering technique identifies different Bashlite and Mirai binaries compiled with and without debugging information variants, reducing by 47.8% the number of binaries that (stripped and non-stripped). In addition, we store strings that security analysts need to analyze in an effort to understand correspond to IP addresses contained in each binary. This malware functionality. information is used in the following steps of our analysis. The framework proposed in this article automates the iden- The second part of the static analysis aims to overcome tification of C&C servers, improving the mitigation of botnets mechanisms used by the attackers to deceive the dynamic and reducing their impact. Besides, the clustering technique analysis. We observed the absence of DNS requests in sev- reduces the number of binaries that security analysts need to eral executions of Mirai malwares, an atypical behavior for analyze. We believe these contributions are an additional step this malware family, which normally identifies its C&Cs by in mitigating those threats. domain names. Through manual inspection of the source code, we identified the presence of a malware activation II.MONITORING INFRASTRUCTUREAND DATASET mechanism which verifies whether the name of the executable (argv[0]) matches a predefined value in the code (activation The malware dataset used in this work was gathered by a key). Binaries containing this activation mechanism behave as passive data collection infrastructure, depicted in Fig. 1, which expected (by performing network scanning and by contacting monitors infection attempts. This infrastructure is composed the C&C server) only when the verification is successful. by 47 low interactivity honeypots, emulating SSH and telnet Otherwise, it runs an alternative routine that contacts a fake services with known default credentials of IoT devices, com- C&C server, preventing identification of the real C&C and monly abused by Bashlite and Mirai malware. potentially directing mitigation efforts to the wrong target. The honeypots collect the authentication credentials used To make activation key decryption hard, the bytes that to log in, and the subsequent sequence of commands issued compose it are encoded. Let B ,B ,...,B be the bytes during the infection attempt. The commands are not executed 1 2 N representing the activation key, in which B is the null by the honeypots; all replies from the honeypots are computed N byte (0x00) that ends the string. The order of these bytes by interpreting the commands. This is possible because the is exchanged for each pair of consecutive bytes and a infection process is automated, using a limited, known, set of byte with zero value is inserted between every pair. Thus, programs commonly installed in IoT devices. the activation key would be encoded in the binary as The commands issued during infection are logged and pro- B |B |0x00|B |B |0x00| ... |0x00|B |0x00. cessed by a (different) server that identifies malware download 2 1 4 3 N−1 Besides the encoding of the activation key, its position in attempts (e.g., calls to wget, curl and ftp). The malware is the binary may vary due to modifications made to different then downloaded into a database for analysis. Our dataset is variants. Before running each binary, we look into the read- composed of 31,291 malwares, downloaded between Jan. 2017 only data section (rodata) for the encoding pattern described and Fev. 2019, of which 28,103 are ELF binaries. Among the above. We consider only activation keys with at least 3 ASCII ELF binaries, 29% are MIPS32, a common CPU architecture characters starting with the “./” characters (for running the used in IoT devices. binary in the current directory).1 In this work we have extended the monitoring infrastructure with a new set of tools for automatic malware analysis, capable 1We also evaluated more generic activation key definitions; no difference of inferring and validating C&C servers (blocks with solid was observed in the results. We also noticed that the activation keys in our border in Fig. 1), which we describe next. sample always start with “./” characters.

2 B. Detux dynamic analysis connection attempts are likely not to be associated with the Detux is a tool that runs the malware on a QEMU virtual scanning process. machine of the Debian MIPS operating system, simulating a C&C IP address string in malware (IP). We infer as typical IoT environment and providing isolation as well as a potential C&C servers the IP addresses that were hardcoded in controlled environment. It monitors and collects data during binaries and were contacted at least once during its execution. the execution of a binary; in particular, Detux captures all C&C and malware servers (URL). We infer as potential network traffic and generates a report including information C&C servers the IP addresses and domain names that are such as connections attempted (including IP and ports) and parameters in commands used to download malware (e.g., wget DNS requests issued. and curl) found in the strings of a malware. This is a typical We adopt a set of protective measures to prevent malware behavior of accessing a malware server. This heuristic allows execution from impacting the Internet. We limit the execution the discovery of C&Cs hosted at the same address as malware of each malware to 90 seconds, and we rate-limit the generated servers. traffic to 10 KB/s. This prevents the interaction of binaries with C&C contacted by domain name (DNS). We infer as the Internet contributing significantly to malicious activities possible C&C servers IP addresses associated with DNS (e.g., DDoS attacks). In addition, we block connections to requests issued by the malware during its execution. Some ports 23 and 2323, which are used to scan vulnerable devices, malware variants identify the C&C server by a domain name, in order to not contribute with the discovery and infection of allowing the IP address to be updated in case the original vulnerable devices. server is taken down. We implemented our activation key detection mechanism (section III-A) as a preprocessing step in Detux, and extended D. C&C validation the tool to execute the binary with each of the possible We use two validation approaches to verify whether can- activation keys identified in the preprocessing step. This is didates inferred by the heuristics are C&C servers. The first necessary to get the malware to behave in the same way as it approach analyzes the network traffic report (including packet would in a real infection situation. headers and body) generated by Detux, and the second in- We also find that the consecutive execution of malwares in teracts with the C&C candidates actively looking for known Detux leads to contaminated reports. In particular, retransmit- C&C communication protocols. Both approaches are used to ted TCP packets belonging to a connection from a previous validate Bashlite and Mirai C&C servers. execution may be captured as traffic from a later execution. For the Bashlite family, we identify the C&C protocol by We apply the following method to filter out TCP packets that looking for commands from a set of 535 known commands are not relative to the current execution. During the execution previously observed in the communication between Bashlite of each malware, we create and manage a connection database variants and C&Cs servers [6]. We look for occurrences of with flow identifiers (5-tuple) of each attempted connection. these commands only in textual data, both in the packet We identify a connection attempt by observing outgoing TCP captures and active validation. If any known command is packets with the SYN flag enabled. Every TCP packet received found, the C&C is validated as Bashlite. For the Mirai family, that does not have a corresponding entry in the connection we look for the standard Mirai protocol handshake in the database of the current execution is ignored in our analyses. packet capture, and attempt the same handshake in the active C. C&C server inference heuristics validation2. If a successful handshake is observed in the packet While looking for C&C addresses, we observe connection capture or completed in the active validation, the C&C is attempts to a large number IP addresses during dynamic validated as Mirai. analyses in Detux as a result of the network scans performed IV. RESULTS by Bashlite and Mirai malware. Connecting to all addresses to search of C&Cs is undesirable (if not unfeasible) due to Our results show the efficacy of the proposed C&C identifi- the possible negative impact or contacting legitimate devices cation and validation techniques. We first characterize the evo- on the Internet. For this reason, we propose four heuristics lution of static properties of malware in our dataset (§IV-A). to identify which IP addresses are more likely to be C&Cs We next evaluate the contribution of each dynamic analysis servers before contacting any devices (Section III-D) heuristic towards C&C identification (§IV-B). Third, we ana- lyze the success rate of C&C validation in our dataset (§IV-C). C&C contacted on specific port (PORT). We infer as Finally, we characterize the validated C&Cs to provide insights potential C&C server the IP addresses that are connection on botnet operator behavior and mitigation techniques (§IV-E). targets in less than five attempts during the execution of a binary. This heuristic is based on the knowledge that malwares A. Malware static analysis typically create connections either to scan the network for We run our static analysis on all 25.183 binaries in our vulnerable devices or to contact the C&C server (or other dataset (§II). We observe that 22.7% of binaries do not have devices in the botnet, like malware servers) [5], [6]. Since scans perform a large number of connection attempts to well- 2The Mirai handshake is a 4-byte request with the value 1 (0x00000001) known service ports (e.g., SSH and telnet), ports with few and a 2-byte response with value zero (0x0000).

3 Table I: Number of candidates, precision Table II: C&C identification and valida- and coverage of C&C heuristics. tion statistics.

Candidates Malwares total 1050 100% Heuristic Total Exclusive Validated Coverage with candidate C&Cs 1011 96% PORT 813 236 410 (50%) 99.7% with validated C&Cs 653 62% IP 579 2 289 (50%) 63.6% C&Cs candidates 994 100% URL 239 26 142 (59%) 31.0% validated 457 46% DNS 36 10 19 (52%) 4.5% Bashlite 78 8% Figure 2: Evolution of static properties Aggregated 994 — 457 (46%) 100% Mirai 379 38% in binaries. debug symbols and 18% of them are compressed using UPX), C. C&C validation requiring decompression before static analysis. Finally, 2.7% We apply our two mechanisms for validating C&C servers: of binaries employ the argv[0] activation mechanism. (i) we post-process the network traffic collected during dy- Fig. 2 shows the fraction of binaries without debug symbols, namic analysis and (ii) we connect to candidate servers and compressed with UPX, and employing the activation mecha- process the received data to look for Mirai/Bashlithe protocol nism for each month of the collection period. As we cannot signatures. We found that post-processing traffic allows the identify the family (Bashlite or Mirai) for all binaries, we detection of 14% more Bashlite C&Cs than using just direct show aggregated results. We note the steady growth of the connections. The main reason for that is that some Bashlite fractions of binaries without debug symbols or compressed, C&C variants do not send any message until an initialization which indicates evolution of measures to make analysis hard. message (e.g., a hello or handshake request) is sent by the We also observe a slow reduction of the fraction of binaries bot. This difference is only 7.5% for Mirai, which has more using the argv[0] activation mechanism. Possible explanations uniform behavior; in particular, previous work have reported include behavior change of Mirai botnet operators or evolution more protocol customization on Bashlite than Mirai, possibly of the malware (e.g., to use more elaborate activation mecha- due to Bashlite’s text-based protocol [6], [5]. nisms), which we plan to investigate in future work. Despite Table II shows the fraction of candidate C&Cs validated the decreasing popularity of the activation mechanism, it is (46%). Reasons for validation failures include false positives; still quite popular, which makes our Detux extension useful. blocking, mitigation, or takedown of C&C servers before the Moreover, assuming only Mirai binaries employ the activation validation attempt; or modifications in the communication mechanism, we conjecture the global fraction in Fig. 2 is a protocol. In particular, our processing pipeline can take four lower bound on the fraction of Mirai binaries employing the days from observing an infection attempt in a honeypot to activation mechanism. validating, which negatively impacts validation accuracy. We are currently working to reduce the validation delay to less than 12 hours. We were able to validate C&C servers for B. Dynamic analysis and C&C identification heuristics 62% of binaries, which show that our heuristics are effective despite the false positives. Also, when compared to a baseline Due to computational restrictions, we use a subset of the that employs static analysis alone, our heuristics support the binaries in our dynamic analysis. In particular, we consider detection of 29.6% more servers. 1050 binaries collected between 2018-11-27 and 2018-12- 26, and between 2019-01-30 and 2019-02-28. Tab. II shows D. C&C volatility statistics on binaries, as well as candidate and validated C&Cs. One of the reasons behind our failure to validate some Tab. I summarizes statistics, showing the number of can- C&Cs is their unreachability at validation time. Fig. 3 sum- didate C&Cs, of candidates identified exclusively by the marizes their reachability over time. After the first validation heuristics, and of validated C&Cs. We also show the coverage of a candidate C&C (following its first observation by the of each heuristic: among all validated C&Cs, how many were honeypots), we revalidate those C&Cs on subsequent days identified by the heuristic as a candidate. to check whether they are still operational. We consider that We find that each heuristic exclusively identifies some 100% of C&Cs are validated on day zero: candidates that are C&C candidates, demonstrating they are complementary. We never validated are discarded. We observe that the fraction of show that the PORT heuristic has good coverage (99.7%) operational C&Cs decreases sharply after only a few days of and contributes significantly to the aggregated set of validated their observation. We also found that some C&Cs that become C&Cs, as it exclusively identifies 236 candidates. On the other unreachable on a day may become reachable again (leading extreme, the IP heuristic has reasonable coverage (63.6%) but to a few increases in the figure). the set of C&Cs it identifies overlaps significantly with those This sharp decrease in the fraction of operational C&Cs of other heuristics. The URL heuristic is the most accurate (it highlights the importance of a quick, automated validation. has the highest ratio between validated and identified C&Cs). It is known that most botnet operators cannot or do not

4 Figure 3: Daily revalidation of CnCs Figure 4: Distribution of C&Cs across Figure 5: Number of clusters as a in time different ASes. function of the dissimilarity threshold. rely on large C&C uptimes, which explains botnet recover A. Clustering malware binaries by similarity mechanisms in case its C&C is taken down, such as Mirai’s Our strategy to group binaries by similarity follows the use of domain names. divide and conquer paradigm, quantifying the similarity be- tween functions of binaries as a proxy for similar functionality E. Characterization of validated C&Cs and behavior. For this analysis, we use a sample of 554 Understanding the infrastructure around IoT botnets is key MIPS binaries for which we successfully validated C&Cs. to creating effective defense and mitigation mechanisms. In A challenge in implementing this strategy is that binaries § this section we characterize the validated C&C servers to iden- frequently lack symbol tables ( III-A), which complicates tify features that may be used to thwart malicious activities. identifying functions. To address this challenge we employ the radare2 open We first study the location of validated C&C servers. source reverse engineering framework to identify functions To determine the network hosting each C&C, we map IP in binaries from our dataset. We use radare2 to extract the addresses to autonomous systems (AS) using Team Cymru’s addresses, parameters, and instructions of each function. We IP-to-AS database and classify ASes according to CAIDA’s decompress UPX binaries and increase the maximum function AS-Rank [7]. Fig. 4 shows the cumulative distribution of the depth in radare2 from 64 to 512 to correctly identify functions fraction of C&Cs hosted by the ASes hosting most C&Cs. We in binaries with long function call chains. observe a significant concentration in few ASes, with five ASes We generate signatures for all functions identified by hosting 67% of all C&Cs. Moreover, 84% of C&Cs are hosted radare2 using ssdeep [8], a fuzzy hash function that provides in cloud providers (indicated by circles in Fig. 4). These results spacial reference locality (i.e., similar inputs have similar indicate that we can concentrate mitigation efforts in few outputs). We use the sequence of instructions of each function networks, and measures to penalize infrastructure providers identified by radare2 as input to ssdeep. that do not actively take action against these practices may be After computing signagures for functions, we compute the effective in fighting IoT botnets. similarity between two binaries b and b as follows. Let We also analyze the port numbers most frequently used by α β f . . . f denote b ’s functions and f . . . f denote C&C servers. Fig. 4 shows the distribution of port numbers α,1 α,nα α β,1 β,nβ b ’s functions. We define a complete bipartite graph G across all C&Cs in our dataset. We observe that just twelve β α,β where functions are vertices; and where edge e connects f ports cover almost 60% of the validated C&Cs. Knowledge ij α,i and f , and its weight w is the similarity between f and about the most frequently used ports can be used to calibrate β,j ij α,i f computed by ssdeep [8]. We then compute the maximal detection and validation mechanisms. For example, some β,j matching in G , and define the similarity between b and heuristics infer an IP address as a possible candidate, but not α,β α b , denoted S , as the sum of the weights of the edges in the port to connect to (e.g., DNS-based heuristics). β α,β the maximal matching divided by the maximum number of 3 functions in the the binaries, i.e., max(nα, nβ). V. SIMILARITY ANALYSIS OF MALWARE BINARIES After calculating the similarity between all pairs of binaries Analyzing large sets of malware binaries is an onerous task. as described, we build a distance matrix between all binary Security analysts need to deal with binaries for multiple hard- pairs. We compute the distance between the binaries as the ware architectures; binaries without symbol tables, obfuscated, complement of their similarity, i.e., Dα,β = 1 − Sα,β. compressed, or encrypted; and binaries with code designed to We use a hierarchical clustering technique to identify similar disorient analyses. Identifying similar binaries is one strategy binaries. The clustering algorithm is greedy and groups, in to choose representative binaries from a population and reduce each iteration, the two closest clusters to form a new one. On the overall number of binaries that need to be analyzed to 3 We divide the sum of edge weights by max(nα, nβ ) to consider different identify trends and draw conclusions. numbers of functions as indication of different functionality and behavior.

5 characterize botnets (including their architectures, proliferation mechanisms, and deployments) and countermeasures to their attacks [2], [1], [5], [6]. Most related to our work are those focused on identifying C&C servers. Some works propose static analysis and reverse engineering for that [5], while others propose dynamic analysis of network traffic to identify connections to C&Cs [10], [11]. Our work complements those by combining and extending both static and dynamic approaches for the Bashlite and Mirai botnets. Figure 6: Hierarchical clustering dendrogram. Binary similarity. The approach we use to determine the sim- ilarity between two binaries is similar to that of previous work each step, the distance matrix is updated with the distances that define similarity between different binary fragments [12], between current clusters. We use complete-linkage [9] to [13]. More advanced techniques include dynamic analysis of compute distances between clusters (i.e., take the maximum multiple executions of a binary [14]. distance between any pair of binaries, one from each cluster). On each step, the new cluster formed by combining two VII.CONCLUSION smaller clusters is recorded in a tree diagram (dendrogram). We propose a framework to identify Bashlite and Mirai The algorithm starts with one cluster per binary. C&C servers. We combine four heuristics based on static and dynamic analysis to infer their IP addresses, and use B. Results active measurements to verify inferences. Our framework can Figure 6 shows the dendrogram resulting from the clustering successfully identify those servers for 62% of 1050 malwares of binaries in our sample. The y axis shows the distance in a dataset collected from 47 low-interactivity honeypots. between two clusters when they are merged. Long vertical bars Identifying C&Cs is key to take these families of botnets indicate merging of two distant clusters, while short vertical down. We have also characterized the C&C servers in our bars indicate merging of two close clusters. dataset and found they are usually hosted in cloud infrastruc- To define clusters from the dendrogram, one chooses a ture providers and stay operational for short periods of time. dissimilarity (distance) threshold Dmax and interrupts the ACKNOWLEDGMENTS clustering algorithm when no clusters have a dissimilarity less This work was funded by NIC.br, RNP/CTIC (2955), FAPEMIG, than Dmax. In other words, we draw a horizontal line at CNPq, CAPES, and EUBra-Atmosphere (H2020-EU.2.1.1 777154). y = Dmax in the dendrogram, and the clusters are determined by the binaries in each intersected vertical line. Figure 5 REFERENCES shows the number of clusters as a function of the dissimilarity [1] K. Angrishi, “Turning Internet of Things(IoT) into Internet of Vulnera- threshold Dmax in steps of 5%. bilities (IoV): IoT Botnets,” CoRR, vol. abs/1702.03681, 2017. [2] C. Kolias, G. Kambourakis, A. Stavrou, and J. Voas, “DDoS in the IoT: Figure 5 shows a significant reduction in the number of Mirai and other Botnets,” IEEE Computer, vol. 50, no. 7, 2017. clusters for a dissimilarity threshold of up to 5%, and negli- [3] Symantec, “Internet Security Threat Report, Volume 22,” April 2017. gible reduction for larger thresholds. This indicates that many [4] Neustar, “DDoS Attacks & Cyber Insights Research Report,” May 2017. [5] M. Antonakakis, T. April, M. Bailey, M. Bernhard, E. Bursztein, binaries are very similar (and grouped in a single cluster even J. Cochran, Z. Durumeric, J. A. Halderman, L. Invernizzi, M. Kallitsis, for low dissimilarity thresholds), and that many binaries are D. Kumar, C. Lever, Z. Ma, J. Mason, D. Menscher, C. Seaman, significantly different (are grouped in a cluster only for high N. Sullivan, K. Thomas, and Y. Zhou, “Understanding the Mirai Botnet,” in USENIX Security, 2017. dissimilarity thresholds). This result supports the argument [6] A. Marzano, D. Alexander, O. Fonseca, E. Fazzion, C. Hoepers, that Bashlite and Mirai malware variants are different, but K. Steding-Jessen, M. Chaves, I.´ Cunha, D. Guedes, and W. Meira Jr., that some variants have multiple versions with small changes. “The Evolution of Bashlite and Mirai IoT Botnets,” in IEEE ISCC, 2018. [7] M. Luckie, B. Huffaker, A. Dhamdhere, V. Giotsas, and K. Claffy, “AS Choosing a dissimilarity threshold of 5%, for example, groups Relationships, Customers Cones, and Validations,” in ACM IMC, 2013. 554 binaries in 289 clusters (variants). For this dissimilarity [8] J. Kornblum, “Identifying Almost Identical Files Using Context Trig- threshold, 77.5% of the 289 variants have a single binary, but gered Piecewise Hashing,” Digital Investigation, vol. 3, pp. 91–97, 2006. [9]D.M ullner,¨ “Modern Hierarchical, Agglomerative Clustering Algo- 4.5% of the 289 variants have at least five different binaries. rithms,” arXiv.org, no. 1109.2378, 2011. The analysis of one version from each variant would reduce [10] G. Jacob, R. Hund, C. Kruegel, and T. Holz, “JACKSTRAWS: Picking by 47.8% the number of binaries that security analysts would Command and Control Connections from Bot Traffic,” in USENIX Security, 2011. need to consider. For example, the largest group is made of 172 [11] A. Zand, G. Vigna, X. Yan, and C. Kruegel, “Extracting Probable Mirai binaries. Another mirai cluster has 47 binaries carrying Command and Control Signatures for Detecting Botnets,” in ACM SAC, a string identifying the botnet operator. 2014. [12] Y. David, N. Partush, and E. Yahav, “Statistical Similarity of Binaries,” in ACM PLDI, 2016. VI.RELATED WORK [13] R. Smith and S. Horwitz, “Detecting and Measuring Similarity in Code Botnet Characterization. DDoS attacks are a recurrent secu- Clones,” in Intl. Workshop on Dectecting Software Clones, 2009. [14] M. Egele, M. Woo, P. Chapman, and D. Brumley, “Blanket Execution: rity problem aggravated by the proliferation of botnets [2]. Dynamic Similarity Testing for Program Binaries and Components,” in Previous work have proposed observation mechanisms to USENIX Security, 2014.

6