<<

138 International Journal of Communication Networks and Information Security (IJCNIS) Vol. 3, No. 2, August 2011 The Analysis and Identification of P2P Botnet’s Traffic Flows

Wernhuar Tarng1, Li-Zhong Den1, Kuo-Liang Ou1 and Mingteh Chen2

1National Hsinchu University of Education, 521 Nanda Rd., Hsinchu, Taiwan, ROC 2Micrel Semiconductor Inc. 2180 Fortune Drive, San Jose, CA 95131, USA

Abstract: As the advance of information and communication affecting at least hundreds of thousands of personal technologies, the Internet has become an integral part of human computers in the world, and it can send 1.5 billion spam life. Although it can provide us with many convenient services, email messages daily, enough to seriously affect the global there also exist some potential risks for its users. For example, network activities. According to Microsoft’s statistics, there hackers may try to steal confidential data for illegal benefits, and they use a variety of methods to achieve the goal of attacks, e.g., were as many as 650 million malicious spam emails sent to Distributed Denial of Service (DDoS), Spam and Trojan. These Hotmail from December 3 to 21, 2009. There were at least methods require a large number of computers; hence, hackers often 233 source IP addresses in Taiwan involved in sending spam spread out malicious software to infect those computers with lower emails for the Waledac botnets during early May 2009, defense mechanisms. The infected computers will become the showing that botnets could really influence the global zombie computers in the botnets controlled by hackers. Thus, it is computer networks. an important subject regarding network security to detect and defend against the botnets. Among them, the Peer-to-Peer (P2P) Today, the Internet is widely used for communication, botnet is a new type of botnets with every zombie computer as a multimedia, shopping, entertainment, research, education, peer controlled by hackers and thus its defense is more difficult. and so on, and it is continuously extending its application The objective of this study is to identify the traffic flows produced areas. In the open network environments, the computers by known or unknown malicious software for defending against connected to the Internet are vulnerable and subject to P2P botnets. Based on the analysis of P2P network’s traffic flows different kinds of attacks. Even with the antivirus software and the ASCII distribution in their packets, a mechanism containing six steps was proposed to identify the traffic flows of P2P botnets installed on the computers and frequently updated, it is still for locating the zombie computers, and finally restrain the possible to be infected. Due to the neglect of its user and fast computers from further infection. mutation of computer virus, a computer has a great chance to be infected and become the zombie computer. According to Keywords: P2P botnets, network traffic flows, network security, Symantec’s global Internet security report [3], Taipei has decision-tree model. become the city with the world’s highest density of botnet viruses. Up to 80% of the computers may have been 1. Introduction infected, and, what is worse, the users may still be unaware With the advance and development of information and of it. Thus, the prevention of malicious attacks can not communication technologies (ICT), computer networks have simply rely on antivirus software. Sometimes, it is required become an integral part of human life. Its applications range to use some efficient mechanisms to detect and defend from online news, online shopping and the use of Google against the botnets. search to acquire information, online ATM and stock A botnet is a collection of software agents, or robots, that trading. In the open network environments, there are always run autonomously and automatically [4]. The term is most some unscrupulous criminals or organizations trying to use commonly associated with IRC botnets and more recently various methods to steal or destroy personal data in order to malicious software, but it can also refer to a computer obtain illegal benefits. Usually, the hackers will attempt to network using distributed computation software. Botnets are infect a large number of computers lacking or without usually named after its malicious software, such as Peacomm protection using malicious software to form the so-called and Waledac. Basically, the composition of a botnet botnets, and then achieve their purposes by the attacks of includes: the server programs used to control the infected zombie computers through the botnets. The methods that computers, the programs installed on the infected often used for attacks include: Distributed Denial of Service computers waiting for the control instructions, and the (DDoS), Spam, Click Fraud and Information Leakage. malicious software to infect normal computers to become The first botnet appeared in 1993 in the Internet Relay zombie computers. The above programs often use a unique Chat (IRC) networks, and became wide-spreading after encryption system to communicate with each other to 1999. In New Zealand, a 19-year-old hacker controlled 150 prevent from being detected and they are running in the million computers through the Internet, which is the largest background of infected computers using an exchange known botnet; another Chinese hacker controlled 60,000 channel (e.g., the RFC1459 standard, Twitter) to computers to attack a music website, causing the website out communicate with its command and control server. The new of service even with its server being transferred to Taiwan or robot can automatically scan its environment and use the the USA. The two events caused the loss of hundreds of weakness of passwords to infect other computers. When a million dollars [1], and the two hackers were finally arrested. robot is capable of infecting more computers, it is more Waledac [2] is one of the top 10 botnets in the USA, valuable in the botnets controlled by the hackers. Based on the ways of connection between the hackers and 139 International Journal of Communication Networks and Information Security (IJCNIS) Vol. 3, No. 2, August 2011 zombie computers, there are three types of botnets, i.e. IRC, overall efficiency. This study improved the above approach HTTP and P2P botnets. In the first type of botnets, an by filtering out the unwanted P2P and non-P2P packets to infected computer is automatically connected to the IRC chat reduce the time identification processes. Then, it used the room controlled by the hackers and waits for the next decision-tree model trained by known P2P traffic flows to operational command. Hackers can also set up their own IRC further increase the identification rate. servers or use the public IRC servers to exchange messages A decision tree is a classification procedure to assign a with zombie computers. The architecture of HTTP botnets is number of objects to the predefined categories. In the similar to that of IRC botnets, mainly launching attacks classification process, data are collected and divided into through malicious HTTP servers set up by the hackers. several homogeneous subsets recursively. The decision tree IRC and HTTP botnets use the client-server architecture consists of the root, intermediate nodes, and end nodes. The and thus have the feature of single point of failure, which root forms the base of all information, so it doesn’t have any means the entire botnet will collapse once the server has input but can have zero or several outputs; an intermediate been shot down. Therefore, the P2P botnet was proposed by node is a partitioned data set, which can have two or more hackers as a new architecture using P2P communication input and output; an end node, or leaf node, has one input protocols. In a P2P botnet, any zombie computer can be a and no output. The J48 decision tree used in this study is an client or a server, and it connects to the botnet according to improved decision tree based on Quinlan’s C4.5 decision its peer list to from a reciprocal relationship within the tree [10], and it expands the tree structure, starting from the network topology. Therefore, a P2P botnet doesn’t need any root to the end nodes, for better understanding the rules particular server to download programs or receive generated. instructions; the hackers can launch attacks from any In this study, the detection of P2P botnets was done by computer in the P2P botnet. Consequently, the detection and identifying their traffic flows to locate the zombie computers prevention of P2P botnets are more difficult and challenging. and finally restrain other computers from further infection. In recent years, the research on botnets has become an At first, the packets sending from the source ports to the important issue. According to the study of Zhu et al. [5], destination ports by the computers in the network were current research about botnets can be divided into three main filtered, which could help understand the current status of areas: (a) the investigation of botnets by structural analysis the network. Also, the information obtained from these or observing their operation, (b) detecting and tracking packets could be used to identify the traffic flows of P2P botnets, and (c) defending against the attacks of botnets. The botnets. The mechanism proposed in this study for above study was focused on the IRC protocols of botnets. identifying P2P botnets contains the following six steps: Currently, most detection mechanisms for P2P botnets are z Pre-processing stage: filtering out non-P2P traffic flows designed to detect a single type of P2P botnets, so they to simplify the identification process. couldn’t be applied to other types of P2P botnets. To remedy z Identification of P2P application hosts: identifying the this drawback, Liu [6] proposed an adaptive defense hosts running P2P application programs. mechanism for a variety of P2P botnets, but it can only be z Identification of P2P application’s traffic flows: applied in the stage when a botnet is launching attacks. analyzing the traffic flows produced by P2P application Karasaridis et al. [7] tried to detect the P2P botnet’s attacks hosts in the communication stage to determine if they by their traffic flows, such as DDoS, and Spam, and, through belong to some P2P applications. the traffic analysis, to identify possible connections with the z Classification of P2P applications: determining if the command and control server and to track its location. Goebel traffic flows were produced by some P2P application and Holz [8] found the infected computers connecting to an programs based on the analysis of payload IRC botnet often had nicknames different from that of a characteristics. normal computer; therefore, they could identify an IRC z Detection of abnormal traffic flows: classifying the botnet through the traffic analysis of these computers with traffic flows of P2P application programs into two groups special nicknames. to detect the abnormal traffic flows produced by some Lu et al. [9] considered the future botnets will be attached unknown P2P botnets. to existing network applications (e.g., IRC, HTTP and P2P) z Detection of zombie computers: locating the zombie as well as some other unknown applications for making computers according to the information from the analysis attacks, so they suggested using the characteristics and of traffic flows produced by P2P botnets. behavior of traffic flows to find out what kind of The objective of this study is to detect the traffic flows of applications are attached and then identifying the botnets P2P botnets quickly during the communication stage. The through the classification of traffic flows based on a Response to Intervention (RTI) method [11] was adopted to decision-tree model. Their study was focused on IRC botnets observe the traffic flows of normal P2P applications and P2P only and it didn’t address the issue of P2P and HTTP botnets. Then, the traffic flows were classified into several botnets. In their approach, the characteristics of traffic flows groups by a trained decision-tree model, and the information were determined based on the payload, or ASCII (0-255) obtained were used to identify the abnormal traffic flows and distribution, of the traffic flows per unit time (1 second). In locate the zombie computers. In order to capture, filter and order to reduce the complexity of identification processes, analyze the packets, this study used VMware (installed on string comparisons were used to identify the packets of some WindowsXP SP2) and network management tools recognized applications in advance. However, their approach (WireShark and CurrPorts) to observe the network’s traffic could increase the identification time and thus affect the flows. 140 International Journal of Communication Networks and Information Security (IJCNIS) Vol. 3, No. 2, August 2011 2. P2P Traffic Analysis checksum and data. This study analyzed the characteristics of P2P traffic flows based on the ASCII (0-255) distribution The nodes in a P2P network are usually connected through in the data field. an ad hoc network [12], and the main idea is to form a logical network through the existing physical network, rather than reconstructing a new physical network. No matter what kind of logical network structure is selected, the clients still have to transfer data through the physical layer. When two computers are communicating with each other by the network protocols such as BitTorrent for data transmission, the protocols will first estimate the available bandwidth and Figure 3.The format of UDP packets computation power on both computers to see if it is feasible before making the connection and data transmission. The P2P application programs typically use UDP protocols to two computers can be either a server to download data or a establish connections during the communication stage, e.g., client to upload data. They are equally reciprocal to each eDonkey, Foxy, BitTorrent, GoGoBox and other malicious other and there exists no obvious client-server architectures. software. However, not all P2P application programs use Currently, there are many application programs using P2P UDP protocols to conduct a file transfer. For example, technologies for information sharing, e.g., eDonkey, Foxy, GoGoBox uses TCP packets to initiate a reliable connection BitTorrent and GoGoBox. by three-way handshaking directly (Figure 4). TCP is a connection-oriented and reliable transmission protocol with 2.1 Characteristics of P2P Application’s Traffic Flows lower transmission speeds, and its packet format is shown in To transfer files over P2P networks, users need to install P2P Figure 5. When GoGoBox is establishing a connection, the application programs on their computers. When the function computer will send out the packets with PSH=1 and ACK=1 of file download or upload is used, the computer will send to the P2P network for communication, so the data field of out a large number of IP packets to establish connections the packets can be retrieved for identification. with a list of P2P peers within a short time. More computers on the lists of other peers will join the connections, so the connected peers continue to change on the fly. The computer continues to work with these peers until the file transfer is completed. Since the computers on the peer list may not be online, not all connection packets receive response. Figure 1 shows the user’s computer is connecting to a P2P network Figure 4. The communication stage of GoGobox using the software BitComet (its network protocol being

BitTorrent). The computer sends a large number of UDP packets to several IP addresses before establishing the connection. In this study, it is defined as the communication stage, and the size of UDP packets in this stage is usually very small (Figure 2).

Figure 5. The format of TCP packets After the communication stage, the computer can proceed with file download/upload, which is defined as transmission stage in this study. The packet size in this stage varies greatly, from 60 to 1468 Bytes. Cho [13] used this information to detect P2P traffic flows and the accuracy was very high. Based on the above analysis, the normal behavior of P2P applications typically contains two stages: (a) Figure 1. User’s computer connecting to a P2P network communication stage: UDP packets are mainly used to

establish connections in P2P networks; the packet size and the changes in traffic flows are small. The computer establishes connections using three-way handshaking by sending TCP packets with the parameters PSH=1 and ASK=1 for communication, and (b) transmission stage: the computer starts file download/upload through the P2P network, so the packet size varies greatly in this stage.

Figure 2. Size of UDP packets during communication stage 2.2 Characteristics of P2P Botnet’s Traffic Flows

According to the adaptive defense mechanism proposed UDP isn’t a reliable or connection-oriented by Liu [6], the behavior of P2P botnets can be divided into . Its packet format (Figure 3) the following four stages: includes the source port, destination port, packet length, z Infection stage: inducing users to click on malicious links 141 International Journal of Communication Networks and Information Security (IJCNIS) Vol. 3, No. 2, August 2011 or open the attachments. z Connection stage: the infected computer connecting to the P2P botnet to receive commands and download programs. z Download stage: proceeding with secondary infection or receiving commands. z Attack stage: starting attacks or spreading spam to the target hosts or specified computers. Figure 8. The connection stage of Waledac Liu’s defense mechanism must wait until the attack stage to detect the botnet viruses. This study proposed to detect the viruses during the connection stage. Since the P2P botnets behave similarly to normal P2P applications except in the infection and attack stages. According to the previous studies, the connection stage of P2P botnets is similar to the communication stage of P2P applications and the download stage of P2P botnets is similar to the transmission stage of P2P applications. This study tried to detect P2P botnets as early as possible, so the analysis and identification of P2P traffic flows is performed in the connection or communication stage. This study investigated two different types of P2P botnets: Figure 9. The TCP packet of Waledac the first is Trojan.Peacomm, also known as the Storm worm since it spreads quickly in a short time to form a large botnet. Other types of P2P botnets such as Nugache are infected It was first discovered in 2007 [14], and used the by MSN, email attachments, and Microsoft vulnerabilities implementation of (DHT) in the (such as MS03-026 and MS04-011). The infected computers Kademlia P2P networks. It utilizes email attachments to will connect to the botnet using TCP packets, and open TCP induce users to click on them, which are then executed on Port 8 to download malicious software from other zombie the computers to connect to the botnets through the peer list computers to perform DDoS attacks. Nugache can also steal (Figure 6) to download malicious software from other email addresses from the infected computers to send spam computers. Trojan.Peacomm sends UDP packets to a large emails. Another P2P botnet virus, Sinit, infects computers number of botnets, attempting to establish connections through the IE vulnerability (Java.ByteVerify) by injecting during the connection stage. Because the changes of its malicious software through web pages. A computer is traffic flows are usually small (Figure 7), the behavior is infected after browsing the web page, and then it will open very similar to that of other P2P software. TCP/UDP Port 53 to pretend as an HTTP server. When the infected computer receives HTTP GET requests for ks.htm or ks.exe, it will infect other computers by replicating itself through UDP Port 53. The main attack by Sinit is using key loggers to steal information from the infected computers. SpamThru infects computers by users’ careless operation, e.g., clicking on malicious hyperlinks to connect to a botnet server for downloading Kaspersky antivirus software, which can also be used to remove other malicious software. When a botnet server on the network is identified, the hackers can Figure 6. The connection stage of Trojan.Peacomm immediately switch to another infected computer as the server, which will download the messenger program and send spam emails through the infected computers again. 2.3 Payload Characteristics of P2P Traffic Flows In this study, the payload characteristics of P2P traffic flows for several P2P application programs, including BitTorrent, eDonkey/eMule, Foxy, and GoGoBox, and two P2P botnet Figure 7. The packet size of Trojan.Peacomm viruses, Waledac and Trojan.Peacomm, were analyzed. The payload characteristics of P2P traffic flows within a small The second type of P2P botnets under investigation is unit of time (1 second) were obtained and analyzed using a Waledac, which uses a conection mechanism different from trained decision-tree model to classify the packets of that of Trojan.Peacomm. Waledac establishes connections different P2P applications and P2P botnets. For example, the mainly through TCP packets (Figure 8), and it uses the patterns of communication packets by different P2P packets with parameters PSH and ACK to communicate with applications are different and they may contain some special P2P botnets (Figure 9). strings, which can be used for the identification of their traffic flows. As shown in Figure 10 to Figure 15, the 142 International Journal of Communication Networks and Information Security (IJCNIS) Vol. 3, No. 2, August 2011 payload characteristics for the four types of P2P applications and the two P2P botnets are different. Therefore, this study could distinguish the traffic flows among these P2P applications and P2P botnets using a trained decision-tree model described in the next section.

Figure 14. Characteristics of Trojan.Peacomm’s traffic flows

Figure 10. Characteristics of BitTorrent’s traffic flows

Figure 15. Characteristics of Waledac’s traffic flows

3. Adaptive Mechanisms In this study, the identification of P2P botnet’s traffic flows Figure 11. Characteristics of eDonkey/eMule’s traffic is divided into six steps: (a) pre-processing stage (b) flows identifying P2P application hosts (c) identifying P2P application’s traffic flows (d) classifying P2P applications (e) detecting abnormal P2P traffic flows (f) identifying zombie computers. Among them, the first three steps are based on RTI method to detect all P2P traffic flows according to the payload characteristics of P2P applications and P2P botnets. The last three steps are used to classify P2P application’s traffic flows and to detect the abnormal traffic flows by the infected zombie computers and to identify the zombie computers. 3.1 Pre-processing Stage

Figure 12. Characteristics of Foxy’s traffic flows In the pre-processing stage, the identification process can be speeded up by filtering out non-P2P packets through the well-known ports. The well-known ports, ranging from 0 to 1023, are those recognized and defined by the Internet Assigned Numbers Authority (IANA), but not all of the port numbers are defined. Although the identification of P2P applications is not very efficient through these ports, but they can be used to filter out some non-P2P packets to reduce the processing time and data amount for identification. Because this study was focused on P2P traffic flows, it was better to filter out non-P2P packets in the pre- processing stage. For example, this study ignored the identification of Port 80 and Port 443 because P2P applications also communicate through these two ports. This Figure 13. Characteristics of GoGoBox’s traffic flows study used a post-association algorithm (as shown in Figure 16) to filter out non-P2P packets, which were determined based on the following three conditions: z If the source port and destination port are both recognized, then the packet is not a P2P packet. 143 International Journal of Communication Networks and Information Security (IJCNIS) Vol. 3, No. 2, August 2011 z If the source port or destination port is not a recognized port number are in the PAT Table. If they are, the packet port, then add the unknown port and its associated IP is determined as a non-P2P packet; otherwise, it is treated address to the Port Association Table (PAT) and set the as a possible P2P packets. packet as non-P2P. z If neither the source port nor the destination port is recognized, then check whether the IP address and the

Figure 16. The post-association algorithm used in the pre-processing stage

using P2P software to issue a number of communications 3.2 Identifying P2P Application Hosts packets should have almost the same number of IP addresses This step is for the identification of P2P application hosts and port numbers. Thus, this study summarized three connected through BitTorrent, eMule, Foxy and other P2P characteristics of hosts using P2P software: (a) application programs. During the communication stage, the communication packets using UDP packets, (b) the number hosts will send a large number of UDP packets to connect of connected host IP address is large, and (c) the ports for with several computers, one connection per peer, so a host external connections divided by the connected IP addresses 144 International Journal of Communication Networks and Information Security (IJCNIS) Vol. 3, No. 2, August 2011 is large. Figure 17 shows the algorithm for identifying P2P external connections, #dPort is the total number of external application hosts, where UDP Flag means using UDP ports, and Ratio equals #dPort/#dIP. packets or not, #dIP is the total number of IP addresses for

Figure 17. The algorithm for identifying P2P application hosts 18, where the definitions of Ratio and #dIP are the same as 3.3 Identifying P2P Application’s Traffic Flows given in the previous step and PSW is the total differences of After verifying the hosts with P2P applications, the next step packet sizes; the larger the PSW, the more different the is to identify their traffic flows. In this step, the traffic flows packet size. In general, communication packets are of small from the source port to the destination port are divided into sizes, thus the value of PSW is relatively small. several groups for identification, and the algorithm for identifying P2P application’s traffic flows is shown in Figure 145 International Journal of Communication Networks and Information Security (IJCNIS) Vol. 3, No. 2, August 2011

Figure 18.The algorithm for identifying P2P application’s traffic flows 146 International Journal of Communication Networks and Information Security (IJCNIS) Vol. 3, No. 2, August 2011 unknown botnet viruses, not only the data were used for 3.4 Classifying P2P Applications training the decision tree but also the system had to initiate According to the characteristics obtained in the previous the isolation procedure to prevent the network from further step, this study classified a variety of P2P applications using infection. the J48 decision-tree model. The characteristics of training samples included the type of packets (TCP or UDP) and 4. Simulation Experiment their ASCII distribution within one second, forming a total of 257 features. The packets of P2P applications (e.g., A simulation experiment was conducted to evaluate the BitTorrent, eDonkey, Foxy and GoGoBox) and Waledac proposed mechanism for identifying the traffic flows of P2P virus were collected, each containing 1000 samples with a botnets. The experimental environment was constructed total of 5000 samples, and used to train the decision-tree using two VMware virtual hosts (for the implementation of model (Figure 19), which was then used to classify the P2P botnet programs) and four computers running different traffic flows of P2P applications and P2P botnets in the P2P application programs. The network architecture for the simulation experiment. experimental environment and the role of each computer are shown in Figure 20 and Table 2. This study used CurrPorts, Wireshark, and Weka as the tools for monitoring the network and data analysis. CurrPorts is a software program to monitor the connection activities in each port, allowing users to know the connection status on a computer; Wireshark is a program to analyze network packets to show the detail information; Weka is a data-mining and analysis platform where users can implement their algorithms to obtain the information from a large number of data using a decision tree.

Figure 19. The J48 decision-tree model for classifying P2P applications and Waledac virus 3.5 Detecting Abnormal P2P Traffic Flows After the classification of P2P applications, this study used a K-Mean clustering algorithm to divide the traffic flows of each P2P application into two groups, and then calculated the distance between their group centers. If the distance exceeded the standard variation of the standard value T (Table 1), they were regarded as abnormal traffic flows and the mechanism triggered the monitoring and processing procedures. If the traffic flows were resulted from a certain program but the computer didn’t install the program, they were also treated as abnormal traffic flows. The standard Figure 20. Network architecture of experimental value T was derived from the original training samples using environment the group distance as reference data. The basic idea is that the difference of group centers for P2P applications is Table II. The operating systems and rolls played by usually small. computers

Operating Table I. The standard value of group distance for different Computer Roll System P2P application’s traffic flows Executing non-P2P P2P Applications Standard Value T Computer Windows 7 application software (FTP BitTorrent 129.61 A and HTTP) eDonkey 253.64 Executing normal P2P Foxy 60.55 Computer Windows 7 application software (Foxy B GoGoBox 116.69 and eDonkey) Executing normal P2P Computer Windows application software 3.6 Identifying Zombie Computers C XP SP2 (BitTorrent) and P2P botnet In this study, the infected computers can be located virus (Trojan.Peacomm) according to the information obtained from the traffic flows Executing normal P2P produced by the known P2P botnet (Waledac) and unknown Computer Windows application software P2P botnet (Trojan.Peacomm) detected in the previous steps. D XP SP2 (GoGoBox) and P2P botnet If the abnormal traffic flows were from a new P2P virus Waledac application, the data could be used to train the decision-tree Detection Linux Detecting the traffic flows model. If the traffic flows were produced from some Server CentOS of P2P botnets 147 International Journal of Communication Networks and Information Security (IJCNIS) Vol. 3, No. 2, August 2011 9.4 Table IV. Classification of P2P applications and botnet viruses Windows Computer Computer Computer Database Recording the related Server B C D Server information of packets 2008 BitTorrent 0 499 0 eDonkey 519 0 0 The simulation experiment was composed of a small Foxy 257 0 0 local area network, with several computers executing P2P GoGoBox 0 0 151 and non-P2P applications. In this study, Waledac was Waledac 0 0 42 regarded as a known P2P botnet virus, and its traffic flows were used to train the decision-tree model prior to the z Detecting abnormal P2P traffic flows experiment. Trojan.Peacomm was regarded as an unknown In addition to the classified traffic flows of known P2P P2P botnet virus. This study tried to use the proposed botnet virus, each of the remaining application traffic flows mechanism to identify the traffic flows of Waledac and were divided into two groups using a K-Mean clustering detect the traffic flows of the unknown P2P botnet virus, algorithm. The distance between two group centers was Trojan.Peacomm, and finally locate the zombie computers. calculated (as shown in Table 5) to see if it exceeded the This experiment began with capturing packets for five standard variation of the standard value T. The traffic flows minutes (300 seconds), in which 1825 traffic flows were were considered as suspicious or a possible P2P botnet retrieved with a total of 20,234 packets, including those of virus when the distance exceeded T. normal P2P applications, e.g., BitTorrent, eDonkey, Foxy and GoGoBox, P2P botnet viruses such as Trojan.Peacomm Table V. The distance between group centers for detecting and Waledac, as well as non-P2P applications like FTP, abnormal traffic flows Telnet, and HTTP. Application Distance between Standard Ove This study analyzed the characteristics of P2P traffic Program group centers Value r flows based on their ASCII (0~255) distributions of the BitTorrent 149.73 129.61 Yes packets captured from the source ports and destination eDonkey 252.43 253.64 No ports. The information was used to identify different types Foxy 60.18 60.55 No of P2P applications and P2P botnet viruses, and finally to GoGoBox 116.21 116.23 No locate the zombie computers. z Pre-processing stage The above results show that the computer running This study used a post-association algorithm to filter out BitTorrent contains abnormal packet flows. After clustering, the first group contains 464 traffic flows and the most non-P2P packets, e.g., port 21 by FTP, port 25 by second group contains of 35 traffic flows. Usually, the SMTP and port 110 by POP3, and the number of packets traffic flows of P2P botnet are smaller than normal P2P could be reduced after the pre-processing stage. However, traffic flows, so it is reasonable to infer that the traffic flows not all non-P2P packets can be filtered, e.g., port 80 and in the smaller group were caused by an unknown P2P port 443. botnet virus. z Identifying P2P application hosts z Identifying zombie computers In this stage, the computers running P2P applications are Using the information obtained from the above steps to shown in Table 3, where Computer B, C, and D are the analyze the abnormal traffic flows, Computer C was three hosts identified as executing P2P application identified as infected by an unknown P2P botnet virus, and programs. Computer D was infected by the known P2P botnet virus Waledac. According to the rules of network management, it Table III. The results of identifying the P2P application is required to notify the network management personnel to hosts isolate these two computers immediately and then retrieve Ratio TCPFlag UDPFlag #IP the packets of the unknown P2P botnet virus as samples for Computer A 0.34 1 1 6 training the decision-tree model (Figure 21). Computer B 0.91 1 1 32 Computer C 0.98 1 1 21 Computer D 0.92 1 0 23 z Identifying P2P application’s traffic flows In this step, the numbers of traffic flows by P2P applications were identified according to their packet sizes, and the results were obtained as 776, 499, and 193 on Computer B, C, and D, respectively. z Classifying P2P applications Using the trained decision tree model, the traffic flows for P2P applications (BitTorrent, eDonkey, Foxy, and GoGoBox) and P2P botnet virus (Waledac) were classified Figure 21. Adding the samples of the unknown P2P botnet as shown in Table 4. virus to the trained decision-tree model

148 International Journal of Communication Networks and Information Security (IJCNIS) Vol. 3, No. 2, August 2011 5. Conclusion and Future Work contaminated hosts by IRC nickname evaluation,” Proceeding of USENIX Conference (HotBots’07), In recent years, the research on botnets has become an Cambridge, Massachuset, April 10, 2007. important issue in network security. Basically, there are three [9] W. Lu, M. Tavallaee, G. Rammidi, and A. Ghorbani types of botnets, i.e., IRC, HTTP and P2P according to their (2009). “BotCop: an online botnet traffic classifier,” network architectures. Currently, most studies are focused on 7th Annual Conference on Communication Networks IRC botnets while the studies related to the other two types and Services Research, Moncton, Canada, May 11-13, of botnets are fewer. This study proposed a mechanism to 2009. identify the traffic flows of P2P botnets quickly during the [10] J. R. Quinlan (1993). “C4.5: Programs for machine connection stage. The mechanism used the RTI method to learning,” San Mateo, CA: Morgan Kaufmann. observe the traffic flows of normal P2P applications and P2P [11] J. D. Fuchs, and L. S. Fuchs (2006). “Introduction to botnets. Then, the traffic flows were classified into several response to intervention: what, why, and how valid is groups by a trained decision-tree model, and the information it?” Reading Research Quarterly, February/March obtained can be used to identify the abnormal traffic flows 2006. and locate the zombie computers. The simulation results [12] S. T. Lee (2008). “Design and implementation of P2P showed that it can effectively identify known and unknown traffic flows management system,” Master thesis, Department of Information Engineering, National Sun P2P botnet viruses, and then locate the infected computers Yet-Sen University, Kaohsiung, Taiwan. according to the traffic information. [13] F. G. Cho (2006). “Detection of P2P traffic flows,” In the future, different types of botnets may appear in Master thesis, Department of Electronic Engineering, addition to the three types of botnets discussed in this paper, National Taiwan University of Science and so the proposed mechanism can be used as a general Technology, Taipei, Taiwan. approach for the analysis and identification of the traffic [14] P. Porras, H. Saidi, and V. Yegneswaran. “A multi- flows produced by other types of botnets. In addition, it can perspective analysis of the Storm (Peacomm) Worm,” also be applied to detect the unknown botnet viruses and use Technical report, Computer Science Laboratory, SRI the samples to train the decision-tree model, which can be International, October 2007. used to identify and defend against a new botnet virus. Since this study was conducted in a small network environment, the performance of the proposed mechanism can be enhanced through more experiments with a larger network environment for its reliability and robustness.

References [1] Malware Report (2007). “The economic impact of viruses, spyware, adware, botnets, and other malicious code,” Computer Economics, 2007. [2] G. Sinclair, C. Nunnery and B. B. Kang (2009). “The Waledac protocol: the how and why,” Proceedings of the 4th International Conference on Malicious and Unwanted Software, Montreal, Quebec, Oct. 13-14, 2009. [3] M. Fossi, D. Turner, E. Johnson, T. Mack, T. Adams, J. Blackbird, S. Entwisle, B. Graveland, D. McKinney, J. Mulcahy, and C. Wueest (2010). “Symantec Global Internet Security Threat Report: Trends for 2009,” Technical Report, Symantec Corportation, April 2010. [4] C. Schiller, J. Binkley and D. Harley (2007). “Botnets: The killer web applications,” Rockland, MA: Syngress Publishing, Feb. 2007. [5] Z. Zhu, G. Lu, Y. Chen, Z. J. Fu, P. Roberts, and K. Han (2008). “Botnet research survey,” 32nd Annual IEEE International Computer Software and Applications Conference, Turku, Finland, July 2008. [6] B. W. Liu (2009). “An adaptive defense mechanism against P2P botnets,” Master thesis, Department of Information Engineering, Chung Yuan Christian University, Chungli, Taiwan. [7] A. Karasaridis, B. Rexroad, and D. Hoeflin (2007). “Wide-scale botnet detection and characterization,” Proceeding of USENIX Conference (HotBots’07), Cambridge, Massachuset, April 10, 2007. [8] J. Goebel, and T. Holz (2007). “Rishi: Identify bot-